Isn't this what Mahout's clustering stuff will do? In other words, if
I calculate the vector for each document (presumably removing
stopwords), normalize it, where each cell is the weight (presumably TF/
IDF) and then put that into a matrix (keeping track of labels), I
should then be able to just run any of Mahout's clustering jobs on
that matrix using the appropriate DistanceMeasure implementation,
right? Or am I missing something?
On May 28, 2009, at 11:55 AM, Ted Dunning wrote: