Subject: T1 and T2 in Canopy


Canopy can be difficult to control and it appears you may have found a use case for not enforcing T1>T2 (we don't). It is curious, though, that the settings you have chosen assign points to canopies (dist<T2) but does not include all of their weights (T2>dist>T1) in the centroids. What happens if you set T1=T2+epsilon; T2=1.9? That would at least follow the rules and give you the same number of clusters, but it would also add the centers of the outliers (dist>1.15). Is this where your processing time blows up?

-----Original Message-----
From: Szymon Chojnacki [mailto:[EMAIL PROTECTED]]
Sent: Monday, February 28, 2011 11:55 AM
To: [EMAIL PROTECTED]
Subject: T1 and T2 in Canopy

Hello,

I am working with my colleague Tim within a Mahout-588 project (https://issues.apache.org/jira/browse/MAHOUT-588). The goal of the project is to compare mahout's clustering algorithms with Apache-Mail-Archives dataset (6 million emails). I have spent last few days trying to set such values of T1 and T2, which would give a non-trivial set of clusters (>1 and < # of all vectors). And would output the result within e.g. up to 3h.

I would be greatful for your advice, as the only way I can do it was by breaking the rule from the wiki that (T1>T1). The problem is that if T1 is large than we get many non-empty coordinates in each canopy. And both memory and cpu demand grows. However, setting low T1 results in low T2, which leads to large number of canopies. And the same problem with memory and cpu.

My understanding of the source code is that T1 and T2 are independent. So I set T1=1.15 and T2=1.9. This setting let me obtain ~200 canopies after 40 mins.

Thank you in advance for you suggestions on setting T1 and T2, and the importance of T1>T2 constraint.

Kind regards
Szymon

ps.
I described my struggle in detail in https://issues.apache.org/jira/secure/attachment/12472217/mahout-588_canopy.pdf.

--
Szymon Chojnacki
http://www.ipipan.eu/~sch/