Subject: Re: T1 and T2 in Canopy


I have no good heuristics for setting t3 and t4, only to suggest that the centroid averages done by the canopy mapper using t1 and t2 would tend to create less-sparse centroid vectors for the reducer step and this might tend to make the points be closer together in that pass. I would suggest t3<t1 and t4<t2 but by how much is anybody's guess.

-----Original Message-----
From: Konstantin Shmakov [mailto:[EMAIL PROTECTED]]
Sent: Thursday, June 09, 2011 2:41 PM
To: [EMAIL PROTECTED]
Subject: Re: T1 and T2 in Canopy

Hello

I am experimenting with canopy clustering from Mahout and found -t3 -t4
parameters for canopy in the latest release:
  --t3 (-t3) t3                              T3 (Reducer T1) threshold value

  --t4 (-t4) t4                              T4 (Reducer T2) threshold value

Thanks for adding them.

Could you clarify what would be be the typical settings for -t3, -t4
compared to -t1, -t2?
a) should one keep  t1>=t2 and experiment with t3,t4 to speed-up reducer
phase?
b) what is the relative values of t1,t2,t3,t4
       t1>t2>t3>t4?
       t3>t4>t1>t2?
Some background:
-- I am using vectors than have cardinality >20k with number of nonzero
elements ~20-50 - similar to original posting
-- similarly mapping phase goes fast for most t1, t2 parameters, while
single reducer can take forever  for most t1>=t2 combinations - mostly it is
impossible to have measurable experiment
-- I also found that t1<t2 can dramatically shorten reducer time

In this case should one keep t1<t2 and use default t3,t4 or try t1>t2 and
experiment with t3,t4?
What values of t3,t4 should be used compared to t1,t2?

Thanks
Konstantin
On Sat, Mar 12, 2011 at 3:23 PM, Jeff Eastman <[EMAIL PROTECTED]> wrote:

ksh: