On Nov 16, 2011, at 9:39 PM, Ioan Eugen Stan wrote:
I'll try in the next few days to track down the numbers from running the stuff in my recent IBM article: http://www.ibm.com/developerworks/java/library/j-mahout-scaling/
. Or, you can go run them yourself!
Otherwise, I don't know that we have any formula just yet. I suspect that once you reach a certain number of documents, your dictionary will stop growing, more or less. Then, it is just a question of how many vectors you have and the sparseness. This probably could be guessed at by looking at what the average number of words are in your email collection. Naturally, attachments may skew this if you are including them.
That has been my experience, too. Seq2Sparse is often the long part. I suspect one could get it done a lot faster in Lucene. SequenceFilesFromDirectory is also slow, but that is inherently sequential.
I haven't explored yet what it would mean to use Encoded vectors in Clustering, but perhaps I can call Ted to the front of the class and see if he has thoughts on whether that even makes sense, as that would give you a fixed size Vector.