OK, interesting. Just to confirm: is it okay to distribute quite large
files through the DistributedCache? Dataset B could be on the order of
gigabytes. Also, if I have much fewer nodes than elements/blocks in A, then
the probability that every node will have to read (almost) every block of B
is quite high so given DC is okay here in general, it would be more
efficient to use DC over HDFS reading. How about the case though that I
have m*n nodes, then every node would receive all of B while only needing a
small fraction, right? Could you maybe elaborate on this in a few sentence
just to be sure I understand Hadoop correctly?


2012/9/10 Harsh J <[EMAIL PROTECTED]>