Subject: Problems with KMeans clustering


Thanks Steve,

That was a subtle change that was evidently made after Kmeans was
implemented and did not show up until later when people such as Philippe
and yourself ran it with real problems on real clusters. While the type
signatures of the reducer and combiner are in fact the same, the values
provided by the mapper and combiner are different and could indeed
create the odd behavior that was reported.

The algorithm's dependence upon run-once behavior is pretty fundamental,
since summing of cluster centroids is done in the combiner and the
reducer does a merge of those clusters. I'd be interested in exactly how
you resolved this.

It likely applies to some of the other clustering implementations too.

Finally, can you explain why this problem no longer seems to occur with
Hadoop trunk?

Jeff
Steve Schlosser wrote: