Subject: Document Clustering


Isn't this what Mahout's clustering stuff will do?  In other words, if  
I calculate the vector for each document (presumably removing  
stopwords), normalize it, where each cell is the weight (presumably TF/
IDF) and then put that into a matrix (keeping track of labels), I  
should then be able to just run any of Mahout's clustering jobs on  
that matrix using the appropriate DistanceMeasure implementation,  
right?  Or am I missing something?

On May 28, 2009, at 11:55 AM, Ted Dunning wrote: