Subject: syntheticcontroldata clustering example failure due to combiner

  Adil Aijaz 2009-06-10, 17:49
  Jeff Eastman 2009-06-10, 23:30
  Jeff Eastman 2009-06-11, 00:04
  Jeff Eastman 2009-06-11, 03:54
  Jeff Eastman 2009-06-11, 03:59
  Adil Aijaz 2009-06-11, 16:49
  Benson Margulies 2009-06-11, 17:06
  Jeff Eastman 2009-06-11, 17:22
  Jeff Eastman 2009-06-11, 17:32
  Ted Dunning 2009-06-11, 19:54
For K-menas, that is.

For other methods, there is either a set of sufficient statistics analogous
to the sum and count, or an approximation of that or the combiner can't be

For instance, if you have something like a median, you can pass around a
sample of (say) at most 100 points and a count of how many points these
represent.  Addition of two sets would consist of sampling from two sets in
proportion to the number of elements each sample represents.  In the end,
you have up to 100 points randomly sampled from everything that the reducer
would have seen which can give you a decent measure of the median.

This isn't as good as the sums because the samples are bigger and the median
is only approximated.  It does deal with the problem of massive data going
to the reducer in the event of imbalance.

On Thu, Jun 11, 2009 at 12:54 PM, Ted Dunning <[EMAIL PROTECTED]> wrote:
Ted Dunning, CTO

111 West Evelyn Ave. Ste. 202
Sunnyvale, CA 94086
858-414-0013 (m)
408-773-0220 (fax)