Subject: syntheticcontroldata clustering example failure due to combiner


Synthetic Control actually used to work with all the clustering jobs.
The move to Hadoop 0.19 introduced intermittent problems that depend
upon optimizations done behind the scenes in Hadoop. All of the original
implementations used combiners under the assumption that they would only
run after the mapper and they would run exactly once. These assumptions
changed in 0.19.  M-99 fixed K-Means but not Canopy or Mean Shift which
still have these assumptions.

Unfortunately, the combiner seems to run only once and only with the
mappers in the development mode which is used by the build and all the
unit tests. This caused the severity of the semantics change to remain
undetected until recently when users are trying to run clustering on
real Hadoop clusters.

The only solution I can imagine right now is to move the combiner
centroid summation code back into the mappers and have the mappers
output fully combined data during close(). It is not very elegant,
perhaps someone has a better solution in mind. I will take a look at it
tonight after the Hadoop Summit.

Jeff

Adil Aijaz wrote: