Subject: can't get <point-id, cluster-id> thru "-p"


What I have done is to use "seq2sparse -nv" to get named vectors.

    *mahout seq2sparse \
        -i reuters-seqfiles/ \
        -o reuters-vectors/ \
        -ow -chunk 100 \
        -x 90 \
        -seq \
        -a com.finderbots.analyzers.LuceneStemmingAnalyzer \
        -ml 50 \
        -n 2 \
        -nv*

This will use the filename as the key in the vector sequence file. The
keys will remain the same through the clustering phase.

Run "kmeans -cl" to get the clusteredPoint dir created and in it you
will find a part-m-00000

    *mahout kmeans \
        -i reuters-vectors/tfidf-vectors/ \
        -c reuters-kmeans-centroids \
        -cl \
        -o reuters-kmeans-clusters \
        -k 20 \
        -ow \
        -x 10 \
        -dm org.apache.mahout.common.distance.CosineDistanceMeasure *

Use seqdumper:

    mahout seqdumper -s
    reuters-kmeans-clusters/clusteredPoints/part-m-00000 | more

You will see that the file contains

    key: clusterid, value: wt = % likelihood the vector is in cluster,
    distance from centroid, named vector belonging to the cluster,
    vector data.

For kmeans the likelihood will be 1.0 or 0. For example:

    Key: 21477: Value: wt: 1.0distance: 0.9420744909793364  vec:
    /-tmp/reut2-000.sgm-158.txt = [372:0.318, 966:0.396, 3027:0.230,
    8816:0.452, 8868:0.308, 13639:0.278, 13648:0.264, 14334:0.270,
    14371:0.413]

Clusters, of course, cannot have names. A simple solution is to
construct a name from the top terms in the centroid output from clusterdump.

I also recommend /Mahout in Action/ from Manning Publishing. You can buy
it here:
http://manning.com/owen/

On 3/19/12 7:45 PM, Baoqiang Cao wrote: