Hm, maybe I need some clarification on what the combiner exactly does. From
what I understand from "Hadoop - The Definitive Guide", there are a few
occasions when a combiner may be called before the sort&shuffle phase.
1) Once the in-memory buffer reaches the threshold it will spill out to
disk. "Before it writes to disk, the thread first divides the data into
partitions corresponding to the reducers that they will ultimately be sent
to. Within each partition, the background thread performs an in-memory sort
by key, and if there is a combiner function, it is run on the output of the
sort. Running the combiner function makes for a more compact map output, so
there is less data to write to local disk and to transfer to the reducer."
So to me, this means that the combiner at this point only operates on the
data that is located in the in-memory buffer. If the buffer can keep at
most n records with k distinct keys (uniformly distributed), then the
combiner will cause a reduction in records spilled to disk by a factor of
2) "Before the task is finished, the spill files are merged into a single
partitioned and sorted output file. [...] If there are at least three spill
files (set by the min.num.spills.for.combine property) then the combiner is
run again before the output file is written." So the number of spill files
is not affected by the use of a combiner, only their sizes usually get
reduced and only at the end of the map task, all spill files are touched
again, merged and combined. If I have k distinct keys per map-task, then I
will be guaranteed to have k records at the very end of the map-task.
Is there any other occasion when the combiner may be called? Are spill
files ever touched again before the final merge?
2012/11/7 Sigurd Spieckermann <[EMAIL PROTECTED]>