Suneel, this is extremely helpful. I hope it gets to the Mahout wiki.
* a threshold for self-similarity seems useful. I'm thinking of
mirrored news groups, bulletin boards, and social network posts
where the docs may be very very close but have some surrounding text
that doesn't quite match so similarity 1.0 might not work. This is
not an academic question since these are some of the docs we plan to
examine. It should be pretty easy to do this in a post processing
step for now.
* I see how you use RowSimilarityJob to guess at good T1 and T2. In my
case I am also concerned with the cohesion of the resulting
clusters. The outliers will likely never bee seen by humans. The
intuition here is that well-formed clusters even if diffuse will
give better results for us than a greater number of poorly-formed
clusters. One way we have considered getting this result is to form
lots of clusters, perhaps as you describe using T1 and T2 derived
from RowSimilarityJob then throw out ones that do not match some
measurement (Dunning mentions entropy). This would allow overfitting
but toss the overfit cases. http://www.lucidimagination.com/search/document/dab8c1f3c3addcfe/validating_clustering_output#9d3f6a55f4a91cb6
I don't see that anyone has implemented something like this yet.
On 2/19/12 9:00 PM, Suneel Marthi wrote: