For K-menas, that is.

For other methods, there is either a set of sufficient statistics analogous

to the sum and count, or an approximation of that or the combiner can't be

used.

For instance, if you have something like a median, you can pass around a

sample of (say) at most 100 points and a count of how many points these

represent. Addition of two sets would consist of sampling from two sets in

proportion to the number of elements each sample represents. In the end,

you have up to 100 points randomly sampled from everything that the reducer

would have seen which can give you a decent measure of the median.

This isn't as good as the sums because the samples are bigger and the median

is only approximated. It does deal with the problem of massive data going

to the reducer in the event of imbalance.

On Thu, Jun 11, 2009 at 12:54 PM, Ted Dunning <[EMAIL PROTECTED]> wrote:

--

Ted Dunning, CTO

DeepDyve

111 West Evelyn Ave. Ste. 202

Sunnyvale, CA 94086

http://www.deepdyve.com

858-414-0013 (m)

408-773-0220 (fax)

For other methods, there is either a set of sufficient statistics analogous

to the sum and count, or an approximation of that or the combiner can't be

used.

For instance, if you have something like a median, you can pass around a

sample of (say) at most 100 points and a count of how many points these

represent. Addition of two sets would consist of sampling from two sets in

proportion to the number of elements each sample represents. In the end,

you have up to 100 points randomly sampled from everything that the reducer

would have seen which can give you a decent measure of the median.

This isn't as good as the sums because the samples are bigger and the median

is only approximated. It does deal with the problem of massive data going

to the reducer in the event of imbalance.

On Thu, Jun 11, 2009 at 12:54 PM, Ted Dunning <[EMAIL PROTECTED]> wrote:

Ted Dunning, CTO

DeepDyve

111 West Evelyn Ave. Ste. 202

Sunnyvale, CA 94086

http://www.deepdyve.com

858-414-0013 (m)

408-773-0220 (fax)