Thanks for all your answers so far! There's still one question open which I
can't seem to find an answer for in the source code or documentation. When
I specify the two source directories of my two datasets to be joined
through CompositeInputFormat and say dataset A comes first and B second,
will Hadoop MR try to run the map task on a datanode that stores a
(replication of a) split of A, on a datanode that stores the corresponding
split of B, or at least on any datanode that stores either the split of A
or B? Since Hadoop MR takes the location of splits into account I believe
there must be some strategy how it handles the case when there are two
splits per map task, but it is not clear to me how exactly it behaves in
this case.
Am 25.10.2012 09:17 schrieb "Bertrand Dechoux" <[EMAIL PROTECTED]>: