Subject: How to find the k most similar docs


Given documents that are vectorized into Mahout vectors, have stop words
removed, and a TFIDF dictionary, what is the best distributed way to get
k nearest documents using a measure like cosine similarity (or the
others provided in Mahout)? I will be doing this for every document in
the corpus so the question is partly how best to do this given the
existing mahout+hadoop framework. What is the intuition about processing
resources needed?

Expansion: At some point I'd like to extend this idea to find similar
clusters but expect that the same method should work only with centroids
instead of doc vectors. Also I expect to do canopy clustering to feed
into kmeans clustering. I'll perform the similarity measure only on docs
in the same cluster. I think I understand how to do this preprocessing
so the question is primarily the k most similar docs and/or centroids.
This sounds like k nearest neighbors, if so is this the best way to do
it in mahout+hadoop?