Subject: How to find the k most similar docs


I'm using Mahout 0.6 compiled from source via 'mvn install' I used
Suneel's code below to get NumberOfColumns.

When I try to run the rowsimilarity job via:

    bin/mahout rowsimilarity -i wikipedia-clusters/tfidf-vectors/ -o
    /wikipedia-similarity -r 87325 -s SIMILARITY_COSINE -m 10  -ess true

I get the following error

    12/03/04 19:14:32 INFO common.AbstractJob: Command line arguments:
    {--endPhase=2147483647, --excludeSelfSimilarity=true,
    --input=wikipedia-clusters/tfidf-vectors/,
    --maxSimilaritiesPerRow=10, --numberOfColumns=87325,
    --output=/wikipedia-similarity,
    --similarityClassname=SIMILARITY_COSINE, --startPhase=0, --tempDir=temp}
    2012-03-04 19:14:32.376 java[1090:1903] Unable to load realm info
    from SCDynamicStore
    12/03/04 19:14:33 INFO input.FileInputFormat: Total input paths to
    process : 1
    12/03/04 19:14:33 INFO mapred.JobClient: Running job: job_local_0001
    12/03/04 19:14:33 INFO mapred.MapTask: io.sort.mb = 100
    12/03/04 19:14:33 INFO mapred.MapTask: data buffer = 79691776/99614720
    12/03/04 19:14:33 INFO mapred.MapTask: record buffer = 262144/327680
    12/03/04 19:14:34 WARN mapred.LocalJobRunner: job_local_0001
    java.lang.ClassCastException: org.apache.hadoop.io.Text cannot be
    cast to org.apache.hadoop.io.IntWritable
         at
    org.apache.mahout.math.hadoop.similarity.cooccurrence.RowSimilarityJob$VectorNormMapper.map(RowSimilarityJob.java:154)
         at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
         at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
         at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
         at
    org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212)

The cast error (as I understand it) usually happens when you pass in a
classname incorrectly. This seems likely since coocurence similarity is
being used?

I've probably missed something obvious about how to pass in similarity
measure to use?
On 2/19/12 9:00 PM, Suneel Marthi wrote: