Szymon Chojnacki

2011-02-28, 19:55

Jeff Eastman

2011-02-28, 21:25

Szymon Chojnacki

2011-03-01, 10:21

Jeff Eastman

2011-03-01, 16:56

Szymon Chojnacki

2011-03-08, 22:42

Lance Norskog

2011-03-09, 03:36

Jeff Eastman

2011-03-12, 23:23

Hello,

I am working with my colleague Tim within a Mahout-588 project (https://issues.apache.org/jira/browse/MAHOUT-588). The goal of the project is to compare mahout's clustering algorithms with Apache-Mail-Archives dataset (6 million emails). I have spent last few days trying to set such values of T1 and T2, which would give a non-trivial set of clusters (>1 and < # of all vectors). And would output the result within e.g. up to 3h.

I would be greatful for your advice, as the only way I can do it was by breaking the rule from the wiki that (T1>T1). The problem is that if T1 is large than we get many non-empty coordinates in each canopy. And both memory and cpu demand grows. However, setting low T1 results in low T2, which leads to large number of canopies. And the same problem with memory and cpu.

My understanding of the source code is that T1 and T2 are independent. So I set T1=1.15 and T2=1.9. This setting let me obtain ~200 canopies after 40 mins.

Thank you in advance for you suggestions on setting T1 and T2, and the importance of T1>T2 constraint.

Kind regards

Szymon

ps.

I described my struggle in detail in https://issues.apache.org/jira/secure/attachment/12472217/mahout-588_canopy.pdf.

--

Szymon Chojnacki

http://www.ipipan.eu/~sch/

I am working with my colleague Tim within a Mahout-588 project (https://issues.apache.org/jira/browse/MAHOUT-588). The goal of the project is to compare mahout's clustering algorithms with Apache-Mail-Archives dataset (6 million emails). I have spent last few days trying to set such values of T1 and T2, which would give a non-trivial set of clusters (>1 and < # of all vectors). And would output the result within e.g. up to 3h.

I would be greatful for your advice, as the only way I can do it was by breaking the rule from the wiki that (T1>T1). The problem is that if T1 is large than we get many non-empty coordinates in each canopy. And both memory and cpu demand grows. However, setting low T1 results in low T2, which leads to large number of canopies. And the same problem with memory and cpu.

My understanding of the source code is that T1 and T2 are independent. So I set T1=1.15 and T2=1.9. This setting let me obtain ~200 canopies after 40 mins.

Thank you in advance for you suggestions on setting T1 and T2, and the importance of T1>T2 constraint.

Kind regards

Szymon

ps.

I described my struggle in detail in https://issues.apache.org/jira/secure/attachment/12472217/mahout-588_canopy.pdf.

--

Szymon Chojnacki

http://www.ipipan.eu/~sch/

Canopy can be difficult to control and it appears you may have found a use case for not enforcing T1>T2 (we don't). It is curious, though, that the settings you have chosen assign points to canopies (dist<T2) but does not include all of their weights (T2>dist>T1) in the centroids. What happens if you set T1=T2+epsilon; T2=1.9? That would at least follow the rules and give you the same number of clusters, but it would also add the centers of the outliers (dist>1.15). Is this where your processing time blows up?

-----Original Message-----

From: Szymon Chojnacki [mailto:[EMAIL PROTECTED]]

Sent: Monday, February 28, 2011 11:55 AM

To: [EMAIL PROTECTED]

Subject: T1 and T2 in Canopy

Hello,

I am working with my colleague Tim within a Mahout-588 project (https://issues.apache.org/jira/browse/MAHOUT-588). The goal of the project is to compare mahout's clustering algorithms with Apache-Mail-Archives dataset (6 million emails). I have spent last few days trying to set such values of T1 and T2, which would give a non-trivial set of clusters (>1 and < # of all vectors). And would output the result within e.g. up to 3h.

I would be greatful for your advice, as the only way I can do it was by breaking the rule from the wiki that (T1>T1). The problem is that if T1 is large than we get many non-empty coordinates in each canopy. And both memory and cpu demand grows. However, setting low T1 results in low T2, which leads to large number of canopies. And the same problem with memory and cpu.

My understanding of the source code is that T1 and T2 are independent. So I set T1=1.15 and T2=1.9. This setting let me obtain ~200 canopies after 40 mins.

Thank you in advance for you suggestions on setting T1 and T2, and the importance of T1>T2 constraint.

Kind regards

Szymon

ps.

I described my struggle in detail in https://issues.apache.org/jira/secure/attachment/12472217/mahout-588_canopy.pdf.

--

Szymon Chojnacki

http://www.ipipan.eu/~sch/

-----Original Message-----

From: Szymon Chojnacki [mailto:[EMAIL PROTECTED]]

Sent: Monday, February 28, 2011 11:55 AM

To: [EMAIL PROTECTED]

Subject: T1 and T2 in Canopy

Hello,

I am working with my colleague Tim within a Mahout-588 project (https://issues.apache.org/jira/browse/MAHOUT-588). The goal of the project is to compare mahout's clustering algorithms with Apache-Mail-Archives dataset (6 million emails). I have spent last few days trying to set such values of T1 and T2, which would give a non-trivial set of clusters (>1 and < # of all vectors). And would output the result within e.g. up to 3h.

I would be greatful for your advice, as the only way I can do it was by breaking the rule from the wiki that (T1>T1). The problem is that if T1 is large than we get many non-empty coordinates in each canopy. And both memory and cpu demand grows. However, setting low T1 results in low T2, which leads to large number of canopies. And the same problem with memory and cpu.

My understanding of the source code is that T1 and T2 are independent. So I set T1=1.15 and T2=1.9. This setting let me obtain ~200 canopies after 40 mins.

Thank you in advance for you suggestions on setting T1 and T2, and the importance of T1>T2 constraint.

Kind regards

Szymon

ps.

I described my struggle in detail in https://issues.apache.org/jira/secure/attachment/12472217/mahout-588_canopy.pdf.

--

Szymon Chojnacki

http://www.ipipan.eu/~sch/

Thank you Jeff for your advice,

I think that the problems I encounter are characteristic for the structure of our dataset. The cardinality of the vectors is 20K, whereas an average number of non-zero coordinates is ~50. I checked with a sample that on average 12% of the distances between the vectors are maximum (i.e. there is no overlap in the non-zero coordinates). Moreover, the same values of T1 and T2 are used in mappers and in a reducer. Which imposes another challenge as the distances among the centroids transferred to the reducer probably have different distribution than the distances between pure vectors.

The process blows up either at the very begining (too many centroids are created in mappers) or after the mappers transfer the centroids to the reducer (as I see there is only one reducer hard-coded and everything has to be processed by one node)

Cheers

Szymon

Dnia 28 lutego 2011 22:25 Jeff Eastman <[EMAIL PROTECTED]> napisał(a):

> Canopy can be difficult to control and it appears you may have found a use case for not enforcing T1>T2 (we don't). It is curious, though, that the settings you have chosen assign points to canopies (dist<T2) but does not include all of their weights (T2>dist>T1) in the centroids. What happens if you set T1=T2+epsilon; T2=1.9? That would at least follow the rules and give you the same number of clusters, but it would also add the centers of the outliers (dist>1.15). Is this where your processing time blows up?

>

> -----Original Message-----

> From: Szymon Chojnacki [mailto:[EMAIL PROTECTED]]

> Sent: Monday, February 28, 2011 11:55 AM

> To: [EMAIL PROTECTED]

> Subject: T1 and T2 in Canopy

>

> Hello,

>

> I am working with my colleague Tim within a Mahout-588 project (https://issues.apache.org/jira/browse/MAHOUT-588). The goal of the project is to compare mahout's clustering algorithms with Apache-Mail-Archives dataset (6 million emails). I have spent last few days trying to set such values of T1 and T2, which would give a non-trivial set of clusters (>1 and < # of all vectors). And would output the result within e.g. up to 3h.

>

> I would be greatful for your advice, as the only way I can do it was by breaking the rule from the wiki that (T1>T1). The problem is that if T1 is large than we get many non-empty coordinates in each canopy. And both memory and cpu demand grows. However, setting low T1 results in low T2, which leads to large number of canopies. And the same problem with memory and cpu.

>

> My understanding of the source code is that T1 and T2 are independent. So I set T1=1.15 and T2=1.9. This setting let me obtain ~200 canopies after 40 mins.

>

> Thank you in advance for you suggestions on setting T1 and T2, and the importance of T1>T2 constraint.

>

> Kind regards

> Szymon

>

> ps.

> I described my struggle in detail in https://issues.apache.org/jira/secure/attachment/12472217/mahout-588_canopy.pdf.

>

>

--

Szymon Chojnacki

http://www.ipipan.eu/~sch/

I think that the problems I encounter are characteristic for the structure of our dataset. The cardinality of the vectors is 20K, whereas an average number of non-zero coordinates is ~50. I checked with a sample that on average 12% of the distances between the vectors are maximum (i.e. there is no overlap in the non-zero coordinates). Moreover, the same values of T1 and T2 are used in mappers and in a reducer. Which imposes another challenge as the distances among the centroids transferred to the reducer probably have different distribution than the distances between pure vectors.

The process blows up either at the very begining (too many centroids are created in mappers) or after the mappers transfer the centroids to the reducer (as I see there is only one reducer hard-coded and everything has to be processed by one node)

Cheers

Szymon

Dnia 28 lutego 2011 22:25 Jeff Eastman <[EMAIL PROTECTED]> napisał(a):

> Canopy can be difficult to control and it appears you may have found a use case for not enforcing T1>T2 (we don't). It is curious, though, that the settings you have chosen assign points to canopies (dist<T2) but does not include all of their weights (T2>dist>T1) in the centroids. What happens if you set T1=T2+epsilon; T2=1.9? That would at least follow the rules and give you the same number of clusters, but it would also add the centers of the outliers (dist>1.15). Is this where your processing time blows up?

>

> -----Original Message-----

> From: Szymon Chojnacki [mailto:[EMAIL PROTECTED]]

> Sent: Monday, February 28, 2011 11:55 AM

> To: [EMAIL PROTECTED]

> Subject: T1 and T2 in Canopy

>

> Hello,

>

> I am working with my colleague Tim within a Mahout-588 project (https://issues.apache.org/jira/browse/MAHOUT-588). The goal of the project is to compare mahout's clustering algorithms with Apache-Mail-Archives dataset (6 million emails). I have spent last few days trying to set such values of T1 and T2, which would give a non-trivial set of clusters (>1 and < # of all vectors). And would output the result within e.g. up to 3h.

>

> I would be greatful for your advice, as the only way I can do it was by breaking the rule from the wiki that (T1>T1). The problem is that if T1 is large than we get many non-empty coordinates in each canopy. And both memory and cpu demand grows. However, setting low T1 results in low T2, which leads to large number of canopies. And the same problem with memory and cpu.

>

> My understanding of the source code is that T1 and T2 are independent. So I set T1=1.15 and T2=1.9. This setting let me obtain ~200 canopies after 40 mins.

>

> Thank you in advance for you suggestions on setting T1 and T2, and the importance of T1>T2 constraint.

>

> Kind regards

> Szymon

>

> ps.

> I described my struggle in detail in https://issues.apache.org/jira/secure/attachment/12472217/mahout-588_canopy.pdf.

>

>

--

Szymon Chojnacki

http://www.ipipan.eu/~sch/

It seems like we need to introduce additional T arguments for the reduce step (T3, T4?) to help control these situations. The default could be to use the T1/T2 values. It's a pretty simple patch: add the new parameters to CanopyDriver.run(String[]); pass T3 & T4 (defaulted if not provided) into clusterDateMR() and put into conf; a little tricky to get the CanopyClusterer initialized right in the reducer (perhaps a new constructor to set t1=t3, t2=t4?); then it should just work. Add a unit test to exercise it and it would be a nice addition to Mahout. I can probably get to it by this weekend if nobody else wants to attempt it.

Worth doing?

-----Original Message-----

From: Szymon Chojnacki [mailto:[EMAIL PROTECTED]]

Sent: Tuesday, March 01, 2011 2:22 AM

To: [EMAIL PROTECTED]

Subject: RE: T1 and T2 in Canopy

Thank you Jeff for your advice,

I think that the problems I encounter are characteristic for the structure of our dataset. The cardinality of the vectors is 20K, whereas an average number of non-zero coordinates is ~50. I checked with a sample that on average 12% of the distances between the vectors are maximum (i.e. there is no overlap in the non-zero coordinates). Moreover, the same values of T1 and T2 are used in mappers and in a reducer. Which imposes another challenge as the distances among the centroids transferred to the reducer probably have different distribution than the distances between pure vectors.

The process blows up either at the very begining (too many centroids are created in mappers) or after the mappers transfer the centroids to the reducer (as I see there is only one reducer hard-coded and everything has to be processed by one node)

Cheers

Szymon

Dnia 28 lutego 2011 22:25 Jeff Eastman <[EMAIL PROTECTED]> napisał(a):

> Canopy can be difficult to control and it appears you may have found a use case for not enforcing T1>T2 (we don't). It is curious, though, that the settings you have chosen assign points to canopies (dist<T2) but does not include all of their weights (T2>dist>T1) in the centroids. What happens if you set T1=T2+epsilon; T2=1.9? That would at least follow the rules and give you the same number of clusters, but it would also add the centers of the outliers (dist>1.15). Is this where your processing time blows up?

>

> -----Original Message-----

> From: Szymon Chojnacki [mailto:[EMAIL PROTECTED]]

> Sent: Monday, February 28, 2011 11:55 AM

> To: [EMAIL PROTECTED]

> Subject: T1 and T2 in Canopy

>

> Hello,

>

> I am working with my colleague Tim within a Mahout-588 project (https://issues.apache.org/jira/browse/MAHOUT-588). The goal of the project is to compare mahout's clustering algorithms with Apache-Mail-Archives dataset (6 million emails). I have spent last few days trying to set such values of T1 and T2, which would give a non-trivial set of clusters (>1 and < # of all vectors). And would output the result within e.g. up to 3h.

>

> I would be greatful for your advice, as the only way I can do it was by breaking the rule from the wiki that (T1>T1). The problem is that if T1 is large than we get many non-empty coordinates in each canopy. And both memory and cpu demand grows. However, setting low T1 results in low T2, which leads to large number of canopies. And the same problem with memory and cpu.

>

> My understanding of the source code is that T1 and T2 are independent. So I set T1=1.15 and T2=1.9. This setting let me obtain ~200 canopies after 40 mins.

>

> Thank you in advance for you suggestions on setting T1 and T2, and the importance of T1>T2 constraint.

>

> Kind regards

> Szymon

>

> ps.

> I described my struggle in detail in https://issues.apache.org/jira/secure/attachment/12472217/mahout-588_canopy.pdf.

>

>

--

Szymon Chojnacki

http://www.ipipan.eu/~sch/

Worth doing?

-----Original Message-----

From: Szymon Chojnacki [mailto:[EMAIL PROTECTED]]

Sent: Tuesday, March 01, 2011 2:22 AM

To: [EMAIL PROTECTED]

Subject: RE: T1 and T2 in Canopy

Thank you Jeff for your advice,

I think that the problems I encounter are characteristic for the structure of our dataset. The cardinality of the vectors is 20K, whereas an average number of non-zero coordinates is ~50. I checked with a sample that on average 12% of the distances between the vectors are maximum (i.e. there is no overlap in the non-zero coordinates). Moreover, the same values of T1 and T2 are used in mappers and in a reducer. Which imposes another challenge as the distances among the centroids transferred to the reducer probably have different distribution than the distances between pure vectors.

The process blows up either at the very begining (too many centroids are created in mappers) or after the mappers transfer the centroids to the reducer (as I see there is only one reducer hard-coded and everything has to be processed by one node)

Cheers

Szymon

Dnia 28 lutego 2011 22:25 Jeff Eastman <[EMAIL PROTECTED]> napisał(a):

> Canopy can be difficult to control and it appears you may have found a use case for not enforcing T1>T2 (we don't). It is curious, though, that the settings you have chosen assign points to canopies (dist<T2) but does not include all of their weights (T2>dist>T1) in the centroids. What happens if you set T1=T2+epsilon; T2=1.9? That would at least follow the rules and give you the same number of clusters, but it would also add the centers of the outliers (dist>1.15). Is this where your processing time blows up?

>

> -----Original Message-----

> From: Szymon Chojnacki [mailto:[EMAIL PROTECTED]]

> Sent: Monday, February 28, 2011 11:55 AM

> To: [EMAIL PROTECTED]

> Subject: T1 and T2 in Canopy

>

> Hello,

>

> I am working with my colleague Tim within a Mahout-588 project (https://issues.apache.org/jira/browse/MAHOUT-588). The goal of the project is to compare mahout's clustering algorithms with Apache-Mail-Archives dataset (6 million emails). I have spent last few days trying to set such values of T1 and T2, which would give a non-trivial set of clusters (>1 and < # of all vectors). And would output the result within e.g. up to 3h.

>

> I would be greatful for your advice, as the only way I can do it was by breaking the rule from the wiki that (T1>T1). The problem is that if T1 is large than we get many non-empty coordinates in each canopy. And both memory and cpu demand grows. However, setting low T1 results in low T2, which leads to large number of canopies. And the same problem with memory and cpu.

>

> My understanding of the source code is that T1 and T2 are independent. So I set T1=1.15 and T2=1.9. This setting let me obtain ~200 canopies after 40 mins.

>

> Thank you in advance for you suggestions on setting T1 and T2, and the importance of T1>T2 constraint.

>

> Kind regards

> Szymon

>

> ps.

> I described my struggle in detail in https://issues.apache.org/jira/secure/attachment/12472217/mahout-588_canopy.pdf.

>

>

--

Szymon Chojnacki

http://www.ipipan.eu/~sch/

Such functionality would be appreciated,

I think that similar problems can happen with MeanShift,

I have made a few attempts to run MeanShift with the same large, spare dataset and either I get one trivial cluster or the algorithm virtually stops. I'll investigate the issue further

Cheers

Dnia 1 marca 2011 17:56 Jeff Eastman <[EMAIL PROTECTED]> napisał(a):

> It seems like we need to introduce additional T arguments for the reduce step (T3, T4?) to help control these situations. The default could be to use the T1/T2 values. It's a pretty simple patch: add the new parameters to CanopyDriver.run(String[]); pass T3 & T4 (defaulted if not provided) into clusterDateMR() and put into conf; a little tricky to get the CanopyClusterer initialized right in the reducer (perhaps a new constructor to set t1=t3, t2=t4?); then it should just work. Add a unit test to exercise it and it would be a nice addition to Mahout. I can probably get to it by this weekend if nobody else wants to attempt it.

>

> Worth doing?

>

> -----Original Message-----

> From: Szymon Chojnacki [mailto:[EMAIL PROTECTED]]

> Sent: Tuesday, March 01, 2011 2:22 AM

> To: [EMAIL PROTECTED]

> Subject: RE: T1 and T2 in Canopy

>

> Thank you Jeff for your advice,

>

> I think that the problems I encounter are characteristic for the structure of our dataset. The cardinality of the vectors is 20K, whereas an average number of non-zero coordinates is ~50. I checked with a sample that on average 12% of the distances between the vectors are maximum (i.e. there is no overlap in the non-zero coordinates). Moreover, the same values of T1 and T2 are used in mappers and in a reducer. Which imposes another challenge as the distances among the centroids transferred to the reducer probably have different distribution than the distances between pure vectors.

>

> The process blows up either at the very begining (too many centroids are created in mappers) or after the mappers transfer the centroids to the reducer (as I see there is only one reducer hard-coded and everything has to be processed by one node)

>

> Cheers

> Szymon

>

>

> Dnia 28 lutego 2011 22:25 Jeff Eastman <[EMAIL PROTECTED]> napisał(a):

>

> > Canopy can be difficult to control and it appears you may have found a use case for not enforcing T1>T2 (we don't). It is curious, though, that the settings you have chosen assign points to canopies (dist<T2) but does not include all of their weights (T2>dist>T1) in the centroids. What happens if you set T1=T2+epsilon; T2=1.9? That would at least follow the rules and give you the same number of clusters, but it would also add the centers of the outliers (dist>1.15). Is this where your processing time blows up?

> >

> > -----Original Message-----

> > From: Szymon Chojnacki [mailto:[EMAIL PROTECTED]]

> > Sent: Monday, February 28, 2011 11:55 AM

> > To: [EMAIL PROTECTED]

> > Subject: T1 and T2 in Canopy

> >

> > Hello,

> >

> > I am working with my colleague Tim within a Mahout-588 project (https://issues.apache.org/jira/browse/MAHOUT-588). The goal of the project is to compare mahout's clustering algorithms with Apache-Mail-Archives dataset (6 million emails). I have spent last few days trying to set such values of T1 and T2, which would give a non-trivial set of clusters (>1 and < # of all vectors). And would output the result within e.g. up to 3h.

> >

> > I would be greatful for your advice, as the only way I can do it was by breaking the rule from the wiki that (T1>T1). The problem is that if T1 is large than we get many non-empty coordinates in each canopy. And both memory and cpu demand grows. However, setting low T1 results in low T2, which leads to large number of canopies. And the same problem with memory and cpu.

> >

> > My understanding of the source code is that T1 and T2 are independent. So I set T1=1.15 and T2=1.9. This setting let me obtain ~200 canopies after 40 mins.

> >

> > Thank you in advance for you suggestions on setting T1 and T2, and the importance of T1>T2 constraint.

Szymon Chojnacki

http://www.ipipan.eu/~sch/

I think that similar problems can happen with MeanShift,

I have made a few attempts to run MeanShift with the same large, spare dataset and either I get one trivial cluster or the algorithm virtually stops. I'll investigate the issue further

Cheers

Dnia 1 marca 2011 17:56 Jeff Eastman <[EMAIL PROTECTED]> napisał(a):

> It seems like we need to introduce additional T arguments for the reduce step (T3, T4?) to help control these situations. The default could be to use the T1/T2 values. It's a pretty simple patch: add the new parameters to CanopyDriver.run(String[]); pass T3 & T4 (defaulted if not provided) into clusterDateMR() and put into conf; a little tricky to get the CanopyClusterer initialized right in the reducer (perhaps a new constructor to set t1=t3, t2=t4?); then it should just work. Add a unit test to exercise it and it would be a nice addition to Mahout. I can probably get to it by this weekend if nobody else wants to attempt it.

>

> Worth doing?

>

> -----Original Message-----

> From: Szymon Chojnacki [mailto:[EMAIL PROTECTED]]

> Sent: Tuesday, March 01, 2011 2:22 AM

> To: [EMAIL PROTECTED]

> Subject: RE: T1 and T2 in Canopy

>

> Thank you Jeff for your advice,

>

> I think that the problems I encounter are characteristic for the structure of our dataset. The cardinality of the vectors is 20K, whereas an average number of non-zero coordinates is ~50. I checked with a sample that on average 12% of the distances between the vectors are maximum (i.e. there is no overlap in the non-zero coordinates). Moreover, the same values of T1 and T2 are used in mappers and in a reducer. Which imposes another challenge as the distances among the centroids transferred to the reducer probably have different distribution than the distances between pure vectors.

>

> The process blows up either at the very begining (too many centroids are created in mappers) or after the mappers transfer the centroids to the reducer (as I see there is only one reducer hard-coded and everything has to be processed by one node)

>

> Cheers

> Szymon

>

>

> Dnia 28 lutego 2011 22:25 Jeff Eastman <[EMAIL PROTECTED]> napisał(a):

>

> > Canopy can be difficult to control and it appears you may have found a use case for not enforcing T1>T2 (we don't). It is curious, though, that the settings you have chosen assign points to canopies (dist<T2) but does not include all of their weights (T2>dist>T1) in the centroids. What happens if you set T1=T2+epsilon; T2=1.9? That would at least follow the rules and give you the same number of clusters, but it would also add the centers of the outliers (dist>1.15). Is this where your processing time blows up?

> >

> > -----Original Message-----

> > From: Szymon Chojnacki [mailto:[EMAIL PROTECTED]]

> > Sent: Monday, February 28, 2011 11:55 AM

> > To: [EMAIL PROTECTED]

> > Subject: T1 and T2 in Canopy

> >

> > Hello,

> >

> > I am working with my colleague Tim within a Mahout-588 project (https://issues.apache.org/jira/browse/MAHOUT-588). The goal of the project is to compare mahout's clustering algorithms with Apache-Mail-Archives dataset (6 million emails). I have spent last few days trying to set such values of T1 and T2, which would give a non-trivial set of clusters (>1 and < # of all vectors). And would output the result within e.g. up to 3h.

> >

> > I would be greatful for your advice, as the only way I can do it was by breaking the rule from the wiki that (T1>T1). The problem is that if T1 is large than we get many non-empty coordinates in each canopy. And both memory and cpu demand grows. However, setting low T1 results in low T2, which leads to large number of canopies. And the same problem with memory and cpu.

> >

> > My understanding of the source code is that T1 and T2 are independent. So I set T1=1.15 and T2=1.9. This setting let me obtain ~200 canopies after 40 mins.

> >

> > Thank you in advance for you suggestions on setting T1 and T2, and the importance of T1>T2 constraint.

Szymon Chojnacki

http://www.ipipan.eu/~sch/

High-dimensional vectors don't work as well as 2D vectors with

Manhattan or Euclidean distance.

Minkowski distance is a real-valued variant where Minkowski (1.0) is

Manhattan and Minkowski(2.0) is Euclidean. You can try

MinkowskiDistanceMeasure(0.00001) or MinkowskiDistanceMeasure(10000)

and see if these are more interesting. There are a few more distance

algorithms.

I would experiment on small datasets and do stats on various distances

etc. Pairs of vectors that you can understand (with term strings)

matched with distances could be a real eye-opener.

Lance

On Tue, Mar 8, 2011 at 2:42 PM, Szymon Chojnacki <[EMAIL PROTECTED]> wrote:

> Such functionality would be appreciated,

>

> I think that similar problems can happen with MeanShift,

> I have made a few attempts to run MeanShift with the same large, spare dataset and either I get one trivial cluster or the algorithm virtually stops. I'll investigate the issue further

>

> Cheers

>

> Dnia 1 marca 2011 17:56 Jeff Eastman <[EMAIL PROTECTED]> napisał(a):

>

>> It seems like we need to introduce additional T arguments for the reduce step (T3, T4?) to help control these situations. The default could be to use the T1/T2 values. It's a pretty simple patch: add the new parameters to CanopyDriver.run(String[]); pass T3 & T4 (defaulted if not provided) into clusterDateMR() and put into conf; a little tricky to get the CanopyClusterer initialized right in the reducer (perhaps a new constructor to set t1=t3, t2=t4?); then it should just work. Add a unit test to exercise it and it would be a nice addition to Mahout. I can probably get to it by this weekend if nobody else wants to attempt it.

>>

>> Worth doing?

>>

>> -----Original Message-----

>> From: Szymon Chojnacki [mailto:[EMAIL PROTECTED]]

>> Sent: Tuesday, March 01, 2011 2:22 AM

>> To: [EMAIL PROTECTED]

>> Subject: RE: T1 and T2 in Canopy

>>

>> Thank you Jeff for your advice,

>>

>> I think that the problems I encounter are characteristic for the structure of our dataset. The cardinality of the vectors is 20K, whereas an average number of non-zero coordinates is ~50. I checked with a sample that on average 12% of the distances between the vectors are maximum (i.e. there is no overlap in the non-zero coordinates). Moreover, the same values of T1 and T2 are used in mappers and in a reducer. Which imposes another challenge as the distances among the centroids transferred to the reducer probably have different distribution than the distances between pure vectors.

>>

>> The process blows up either at the very begining (too many centroids are created in mappers) or after the mappers transfer the centroids to the reducer (as I see there is only one reducer hard-coded and everything has to be processed by one node)

>>

>> Cheers

>> Szymon

>>

>>

>> Dnia 28 lutego 2011 22:25 Jeff Eastman <[EMAIL PROTECTED]> napisał(a):

>>

>> > Canopy can be difficult to control and it appears you may have found a use case for not enforcing T1>T2 (we don't). It is curious, though, that the settings you have chosen assign points to canopies (dist<T2) but does not include all of their weights (T2>dist>T1) in the centroids. What happens if you set T1=T2+epsilon; T2=1.9? That would at least follow the rules and give you the same number of clusters, but it would also add the centers of the outliers (dist>1.15). Is this where your processing time blows up?

>> >

>> > -----Original Message-----

>> > From: Szymon Chojnacki [mailto:[EMAIL PROTECTED]]

>> > Sent: Monday, February 28, 2011 11:55 AM

>> > To: [EMAIL PROTECTED]

>> > Subject: T1 and T2 in Canopy

>> >

>> > Hello,

>> >

>> > I am working with my colleague Tim within a Mahout-588 project (https://issues.apache.org/jira/browse/MAHOUT-588). The goal of the project is to compare mahout's clustering algorithms with Apache-Mail-Archives dataset (6 million emails). I have spent last few days trying to set such values of T1 and T2, which would give a non-trivial set of clusters (>1 and < # of all vectors). And would output the result within e.g. up to 3h.

Lance Norskog

[EMAIL PROTECTED]

Manhattan or Euclidean distance.

Minkowski distance is a real-valued variant where Minkowski (1.0) is

Manhattan and Minkowski(2.0) is Euclidean. You can try

MinkowskiDistanceMeasure(0.00001) or MinkowskiDistanceMeasure(10000)

and see if these are more interesting. There are a few more distance

algorithms.

I would experiment on small datasets and do stats on various distances

etc. Pairs of vectors that you can understand (with term strings)

matched with distances could be a real eye-opener.

Lance

On Tue, Mar 8, 2011 at 2:42 PM, Szymon Chojnacki <[EMAIL PROTECTED]> wrote:

> Such functionality would be appreciated,

>

> I think that similar problems can happen with MeanShift,

> I have made a few attempts to run MeanShift with the same large, spare dataset and either I get one trivial cluster or the algorithm virtually stops. I'll investigate the issue further

>

> Cheers

>

> Dnia 1 marca 2011 17:56 Jeff Eastman <[EMAIL PROTECTED]> napisał(a):

>

>> It seems like we need to introduce additional T arguments for the reduce step (T3, T4?) to help control these situations. The default could be to use the T1/T2 values. It's a pretty simple patch: add the new parameters to CanopyDriver.run(String[]); pass T3 & T4 (defaulted if not provided) into clusterDateMR() and put into conf; a little tricky to get the CanopyClusterer initialized right in the reducer (perhaps a new constructor to set t1=t3, t2=t4?); then it should just work. Add a unit test to exercise it and it would be a nice addition to Mahout. I can probably get to it by this weekend if nobody else wants to attempt it.

>>

>> Worth doing?

>>

>> -----Original Message-----

>> From: Szymon Chojnacki [mailto:[EMAIL PROTECTED]]

>> Sent: Tuesday, March 01, 2011 2:22 AM

>> To: [EMAIL PROTECTED]

>> Subject: RE: T1 and T2 in Canopy

>>

>> Thank you Jeff for your advice,

>>

>> I think that the problems I encounter are characteristic for the structure of our dataset. The cardinality of the vectors is 20K, whereas an average number of non-zero coordinates is ~50. I checked with a sample that on average 12% of the distances between the vectors are maximum (i.e. there is no overlap in the non-zero coordinates). Moreover, the same values of T1 and T2 are used in mappers and in a reducer. Which imposes another challenge as the distances among the centroids transferred to the reducer probably have different distribution than the distances between pure vectors.

>>

>> The process blows up either at the very begining (too many centroids are created in mappers) or after the mappers transfer the centroids to the reducer (as I see there is only one reducer hard-coded and everything has to be processed by one node)

>>

>> Cheers

>> Szymon

>>

>>

>> Dnia 28 lutego 2011 22:25 Jeff Eastman <[EMAIL PROTECTED]> napisał(a):

>>

>> > Canopy can be difficult to control and it appears you may have found a use case for not enforcing T1>T2 (we don't). It is curious, though, that the settings you have chosen assign points to canopies (dist<T2) but does not include all of their weights (T2>dist>T1) in the centroids. What happens if you set T1=T2+epsilon; T2=1.9? That would at least follow the rules and give you the same number of clusters, but it would also add the centers of the outliers (dist>1.15). Is this where your processing time blows up?

>> >

>> > -----Original Message-----

>> > From: Szymon Chojnacki [mailto:[EMAIL PROTECTED]]

>> > Sent: Monday, February 28, 2011 11:55 AM

>> > To: [EMAIL PROTECTED]

>> > Subject: T1 and T2 in Canopy

>> >

>> > Hello,

>> >

>> > I am working with my colleague Tim within a Mahout-588 project (https://issues.apache.org/jira/browse/MAHOUT-588). The goal of the project is to compare mahout's clustering algorithms with Apache-Mail-Archives dataset (6 million emails). I have spent last few days trying to set such values of T1 and T2, which would give a non-trivial set of clusters (>1 and < # of all vectors). And would output the result within e.g. up to 3h.

Lance Norskog

[EMAIL PROTECTED]

I've got a patch which adds T3/T4 arguments to Canopy. I will create an issue for it and post the patch later today. If this is useful, and not just two more knobs to guess the values for, I will commit it and we can take a look at MeanShift. I suspect you are correct here too but it needs more investigation.

-----Original Message-----

From: Szymon Chojnacki [mailto:[EMAIL PROTECTED]]

Sent: Tuesday, March 08, 2011 2:43 PM

To: [EMAIL PROTECTED]

Subject: RE: T1 and T2 in Canopy

Such functionality would be appreciated,

I think that similar problems can happen with MeanShift,

I have made a few attempts to run MeanShift with the same large, spare dataset and either I get one trivial cluster or the algorithm virtually stops. I'll investigate the issue further

Cheers

Dnia 1 marca 2011 17:56 Jeff Eastman <[EMAIL PROTECTED]> napisał(a):

> It seems like we need to introduce additional T arguments for the reduce step (T3, T4?) to help control these situations. The default could be to use the T1/T2 values. It's a pretty simple patch: add the new parameters to CanopyDriver.run(String[]); pass T3 & T4 (defaulted if not provided) into clusterDateMR() and put into conf; a little tricky to get the CanopyClusterer initialized right in the reducer (perhaps a new constructor to set t1=t3, t2=t4?); then it should just work. Add a unit test to exercise it and it would be a nice addition to Mahout. I can probably get to it by this weekend if nobody else wants to attempt it.

>

> Worth doing?

>

> -----Original Message-----

> From: Szymon Chojnacki [mailto:[EMAIL PROTECTED]]

> Sent: Tuesday, March 01, 2011 2:22 AM

> To: [EMAIL PROTECTED]

> Subject: RE: T1 and T2 in Canopy

>

> Thank you Jeff for your advice,

>

> I think that the problems I encounter are characteristic for the structure of our dataset. The cardinality of the vectors is 20K, whereas an average number of non-zero coordinates is ~50. I checked with a sample that on average 12% of the distances between the vectors are maximum (i.e. there is no overlap in the non-zero coordinates). Moreover, the same values of T1 and T2 are used in mappers and in a reducer. Which imposes another challenge as the distances among the centroids transferred to the reducer probably have different distribution than the distances between pure vectors.

>

> The process blows up either at the very begining (too many centroids are created in mappers) or after the mappers transfer the centroids to the reducer (as I see there is only one reducer hard-coded and everything has to be processed by one node)

>

> Cheers

> Szymon

>

>

> Dnia 28 lutego 2011 22:25 Jeff Eastman <[EMAIL PROTECTED]> napisał(a):

>

> > Canopy can be difficult to control and it appears you may have found a use case for not enforcing T1>T2 (we don't). It is curious, though, that the settings you have chosen assign points to canopies (dist<T2) but does not include all of their weights (T2>dist>T1) in the centroids. What happens if you set T1=T2+epsilon; T2=1.9? That would at least follow the rules and give you the same number of clusters, but it would also add the centers of the outliers (dist>1.15). Is this where your processing time blows up?

> >

> > -----Original Message-----

> > From: Szymon Chojnacki [mailto:[EMAIL PROTECTED]]

> > Sent: Monday, February 28, 2011 11:55 AM

> > To: [EMAIL PROTECTED]

> > Subject: T1 and T2 in Canopy

> >

> > Hello,

> >

> > I am working with my colleague Tim within a Mahout-588 project (https://issues.apache.org/jira/browse/MAHOUT-588). The goal of the project is to compare mahout's clustering algorithms with Apache-Mail-Archives dataset (6 million emails). I have spent last few days trying to set such values of T1 and T2, which would give a non-trivial set of clusters (>1 and < # of all vectors). And would output the result within e.g. up to 3h.

> >

> > I would be greatful for your advice, as the only way I can do it was by breaking the rule from the wiki that (T1>T1). The problem is that if T1 is large than we get many non-empty coordinates in each canopy. And both memory and cpu demand grows. However, setting low T1 results in low T2, which leads to large number of canopies. And the same problem with memory and cpu.

Szymon Chojnacki

http://www.ipipan.eu/~sch/

-----Original Message-----

From: Szymon Chojnacki [mailto:[EMAIL PROTECTED]]

Sent: Tuesday, March 08, 2011 2:43 PM

To: [EMAIL PROTECTED]

Subject: RE: T1 and T2 in Canopy

Such functionality would be appreciated,

I think that similar problems can happen with MeanShift,

I have made a few attempts to run MeanShift with the same large, spare dataset and either I get one trivial cluster or the algorithm virtually stops. I'll investigate the issue further

Cheers

Dnia 1 marca 2011 17:56 Jeff Eastman <[EMAIL PROTECTED]> napisał(a):

> It seems like we need to introduce additional T arguments for the reduce step (T3, T4?) to help control these situations. The default could be to use the T1/T2 values. It's a pretty simple patch: add the new parameters to CanopyDriver.run(String[]); pass T3 & T4 (defaulted if not provided) into clusterDateMR() and put into conf; a little tricky to get the CanopyClusterer initialized right in the reducer (perhaps a new constructor to set t1=t3, t2=t4?); then it should just work. Add a unit test to exercise it and it would be a nice addition to Mahout. I can probably get to it by this weekend if nobody else wants to attempt it.

>

> Worth doing?

>

> -----Original Message-----

> From: Szymon Chojnacki [mailto:[EMAIL PROTECTED]]

> Sent: Tuesday, March 01, 2011 2:22 AM

> To: [EMAIL PROTECTED]

> Subject: RE: T1 and T2 in Canopy

>

> Thank you Jeff for your advice,

>

> I think that the problems I encounter are characteristic for the structure of our dataset. The cardinality of the vectors is 20K, whereas an average number of non-zero coordinates is ~50. I checked with a sample that on average 12% of the distances between the vectors are maximum (i.e. there is no overlap in the non-zero coordinates). Moreover, the same values of T1 and T2 are used in mappers and in a reducer. Which imposes another challenge as the distances among the centroids transferred to the reducer probably have different distribution than the distances between pure vectors.

>

> The process blows up either at the very begining (too many centroids are created in mappers) or after the mappers transfer the centroids to the reducer (as I see there is only one reducer hard-coded and everything has to be processed by one node)

>

> Cheers

> Szymon

>

>

> Dnia 28 lutego 2011 22:25 Jeff Eastman <[EMAIL PROTECTED]> napisał(a):

>

> > Canopy can be difficult to control and it appears you may have found a use case for not enforcing T1>T2 (we don't). It is curious, though, that the settings you have chosen assign points to canopies (dist<T2) but does not include all of their weights (T2>dist>T1) in the centroids. What happens if you set T1=T2+epsilon; T2=1.9? That would at least follow the rules and give you the same number of clusters, but it would also add the centers of the outliers (dist>1.15). Is this where your processing time blows up?

> >

> > -----Original Message-----

> > From: Szymon Chojnacki [mailto:[EMAIL PROTECTED]]

> > Sent: Monday, February 28, 2011 11:55 AM

> > To: [EMAIL PROTECTED]

> > Subject: T1 and T2 in Canopy

> >

> > Hello,

> >

> > I am working with my colleague Tim within a Mahout-588 project (https://issues.apache.org/jira/browse/MAHOUT-588). The goal of the project is to compare mahout's clustering algorithms with Apache-Mail-Archives dataset (6 million emails). I have spent last few days trying to set such values of T1 and T2, which would give a non-trivial set of clusters (>1 and < # of all vectors). And would output the result within e.g. up to 3h.

> >

> > I would be greatful for your advice, as the only way I can do it was by breaking the rule from the wiki that (T1>T1). The problem is that if T1 is large than we get many non-empty coordinates in each canopy. And both memory and cpu demand grows. However, setting low T1 results in low T2, which leads to large number of canopies. And the same problem with memory and cpu.

Szymon Chojnacki

http://www.ipipan.eu/~sch/