Subject: How to find the k most similar docs


Suneel, this is extremely helpful. I hope it gets to the Mahout wiki.

Some thoughts:

  * a threshold for self-similarity seems useful. I'm thinking of
    mirrored news groups, bulletin boards, and social network posts
    where the docs may be very very close but have some surrounding text
    that doesn't quite match so similarity 1.0 might not work. This is
    not an academic question since these are some of the docs we plan to
    examine. It should be pretty easy to do this in a post processing
    step for now.
  * I see how you use RowSimilarityJob to guess at good T1 and T2. In my
    case I am also concerned with the cohesion of the resulting
    clusters. The outliers will likely never bee seen by humans. The
    intuition here is that well-formed clusters even if diffuse will
    give better results for us than a greater number of poorly-formed
    clusters. One way we have considered getting this result is to form
    lots of clusters, perhaps as you describe using T1 and T2 derived
    from RowSimilarityJob then throw out ones that do not match some
    measurement (Dunning mentions entropy). This would allow overfitting
    but toss the overfit cases.
    http://www.lucidimagination.com/search/document/dab8c1f3c3addcfe/validating_clustering_output#9d3f6a55f4a91cb6
    I don't see that anyone has implemented something like this yet.

Thanks again.
On 2/19/12 9:00 PM, Suneel Marthi wrote: