I have no good heuristics for setting t3 and t4, only to suggest that the centroid averages done by the canopy mapper using t1 and t2 would tend to create less-sparse centroid vectors for the reducer step and this might tend to make the points be closer together in that pass. I would suggest t3<t1 and t4<t2 but by how much is anybody's guess.
From: Konstantin Shmakov [mailto:[EMAIL PROTECTED]]
Sent: Thursday, June 09, 2011 2:41 PM
To: [EMAIL PROTECTED]
Subject: Re: T1 and T2 in Canopy
I am experimenting with canopy clustering from Mahout and found -t3 -t4
parameters for canopy in the latest release:
--t3 (-t3) t3 T3 (Reducer T1) threshold value
--t4 (-t4) t4 T4 (Reducer T2) threshold value
Thanks for adding them.
Could you clarify what would be be the typical settings for -t3, -t4
compared to -t1, -t2?
a) should one keep t1>=t2 and experiment with t3,t4 to speed-up reducer
b) what is the relative values of t1,t2,t3,t4
-- I am using vectors than have cardinality >20k with number of nonzero
elements ~20-50 - similar to original posting
-- similarly mapping phase goes fast for most t1, t2 parameters, while
single reducer can take forever for most t1>=t2 combinations - mostly it is
impossible to have measurable experiment
-- I also found that t1<t2 can dramatically shorten reducer time
In this case should one keep t1<t2 and use default t3,t4 or try t1>t2 and
experiment with t3,t4?
What values of t3,t4 should be used compared to t1,t2?
On Sat, Mar 12, 2011 at 3:23 PM, Jeff Eastman <[EMAIL PROTECTED]> wrote: