Subject: How to find the k most similar docs


Ok, making progress. I created a matrix using rowid and got the
following output:

    Maclaurin:mahout-distribution-0.6 pferrel$ bin/mahout rowid -i
    wikipedia-clusters/tfidf-vectors/ -o wikipedia-matrix --tempDir temp
    ...
    12/03/05 16:52:45 INFO common.AbstractJob: Command line arguments:
    {--endPhase=2147483647, --input=wikipedia-clusters/tfidf-vectors/,
    --output=wikipedia-matrix, --startPhase=0, --tempDir=temp}
    2012-03-05 16:52:45.870 java[4940:1903] Unable to load realm info
    from SCDynamicStore
    12/03/05 16:52:46 WARN util.NativeCodeLoader: Unable to load
    native-hadoop library for your platform... using builtin-java
    classes where applicable
    12/03/05 16:52:46 INFO compress.CodecPool: Got brand-new compressor
    12/03/05 16:52:46 INFO compress.CodecPool: Got brand-new compressor
    12/03/05 16:52:47 INFO vectors.RowIdJob: Wrote out matrix with 4838
    rows and 87325 columns to wikipedia-matrix/matrix
    12/03/05 16:52:47 INFO driver.MahoutDriver: Program took 1758 ms
    (Minutes: 0.0293)

So a doc matrix with 4838 docs and 87325 dimensions. Next I ran
RowSimilarityJob

    Maclaurin:mahout-distribution-0.6 pferrel$ bin/mahout rowsimilarity
    -i wikipedia-matrix/matrix -o wikipedia-similarity -r 87325
    --similarityClassname SIMILARITY_COSINE -m 10 -ess true --tempDir temp

This gives me output in wikipedia-similarity/part-m-00000 but the size
is 97 bytes? Shouldn't it have created 4838 * 10 results? Ten per row? I
set no threshold so I'd expect it to pick the 10 nearest even if they
are far away.

BTW what is the output format?

On 3/5/12 11:48 AM, Suneel Marthi wrote: