I am benchmarking my cluster of 16 nodes (all in one rack) with TestDFSIO on Hadoop 1.0.4. For simplicity, I turned off speculative task execution and set the max map and reduce tasks to 1.
With a replication factor of 2, writing 1 file of 5GB takes twice as long as reading 1 file. This result seems to make sense since the replication results in twice the I/O in the cluster versus the read. However, as I scale up the number of 5GB files from 1 to 64 files, reading ultimately takes as long as writing. In particular, I see this result when writing and reading 64 such files.
What could cause read performance to degrade faster than write performance as the number of files increases?
The full results (number of 5GB files, ratio of write time to read time) are below: 1, 2.02 2, 1.87 4, 1.73 8, 1.54 16, 1.37 32, 1.29 64, 1.01
I would advise against using TestDFSIO, instead trying TeraGen and TeraValidate. IIRC TestDFSIO doesn't actually schedule for task locality, so it's not very good if you have a cluster bigger than your replication factor. You might be network bound as you try to read more files.
On Tue, Nov 4, 2014 at 6:19 AM, Eitan Rosenfeld <[EMAIL PROTECTED]> wrote:
Reads can be faster than writes for smaller bursts of IO in part due to disk and memory caching of reads (if you turn on write back (not recommended!) your numbers above are likely to get closer together). As your volume of IO increases, you tend to reach a point where you are bound (more or less) by physical IO and are not leveraging the cache optimization any more. FYI, if what you are seeing are these cache misses, then reducing the percent of memory used by UNIX for file system buffers should result in your observed phenomenon occurring sooner.
If you look at iostats you may see that the read and write service times on the devices are converging, which is another indication of cache misses.
*.......“The race is not to the swift,nor the battle to the strong,but to those who can see it coming and jump aside.” - Hunter ThompsonDaemeon C.M. ReiydelleUSA (+1) 415.501.0198London (+44) (0) 20 8144 9872*
On Tue, Nov 4, 2014 at 3:42 PM, Andrew Wang <[EMAIL PROTECTED]> wrote:
Daemeon - Indeed, I neglected to mention that I am clearing the caches throughout my cluster before running the read benchmark. My expectation was to ideally get results that were proportionate to disk I/O, given that replicated writes perform twice the disk I/O relative to reads. I've verified the I/O with iostat. However, as I mentioned earlier, reads and writes converge as the number of files in the workload increases, despite the constant ratio of write I/O to read I/O.
Andrew - I've verified that the network is not the bottleneck. (All of the links are 10Gb). As you'll see, I suspect that the lack of data-locality causes the slowdown because a given node can be responsible for serving multiple remote block reads all at once.
I hope my understanding of writes and reads can be confirmed:
Write pipelining allows a node to write, replicate, and receive replicated data in parallel. If node A is writing its own data while receiving replicated data from node B, node B does not wait for node A to finish writing B's replicated data to disk. Rather, node B can begin writing its next local block immediately. Thus, pipelining helps replicated writes have good performance.
In contrast, let's assume node A is currently reading a block. If node A receives an additional read request from node B, A will take longer to serve the block to B because of A's pre-existing read. Because node B waits longer for the block to be served from A, there is a delay on node B before it attempts to read the next block in the file. Multiple read requests from different nodes are a consequence of having no built-in data locality with TestDFSIO. Finally, as the number of concurrent tasks throughout the cluster increases, the wait time for reads increases.
Is my understanding of these read and write mechanisms correct?