Subject: Poor HDFS performances: "Slow BlockReceiver write packet to mirror"

Hi there,

I'd like to submit a strange behavior while instanciating another new
Hadoop cluster, on a new hardware stack.

Once everything got installed, as soon as we try to perform any I/O
operation on HDFS, we can see many of these messages within the datanode

15/01/14 22:13:07 WARN datanode.DataNode: Slow BlockReceiver write packet
to mirror took 6339ms (threshold=300ms)
15/01/14 22:13:26 INFO DataNode.clienttrace: src: /, dest: /, bytes: 176285, op: HDFS_WRITE, cliID:
DFSClient_NONMAPREDUCE_-832581408_1, offset: 0, srvID:
af886556-96db-4b03-9b5b-cd20c3d66f5a, blockid:
BP-784291941-, duration:

Followed by the famous one: 60000 millis timeout while waiting for
channel to be ready for read (...)

We've suspected dual-VLAN + bonded network interfaces (2x10 GBps) ot be
part of this, but, of course, we double-checked lots of these points and
found nothing: iperf, dd/hdparm, increasing Xmx (8 GB), sysbench...
We only found that the cluster had a pretty big `await` time on its disk
when running HDFS (>500ms, correlated to our log messages), but we can't
explain clearly what happened.

Even if you will all suspect HDDs to be the cause of our troubles, can
someone explain these log messages? We can't find anything interesting in
source code,  except it occurs while doing `flush or sync` (makes sense...).

Hadoop 2.6.0
9 Datanodes
Debian 3.2.63-2+deb7u2 x86_64
10x 1TB SAS drives
OpenJDK Runtime Environment (IcedTea 2.5.3) (7u71-2.5.3-2~deb7u1)
OpenJDK 64-Bit Server VM (build 24.65-b04, mixed mode)



*Adrien Mogenet*
Head of Backend/Infrastructure
4, avenue Franklin D. Roosevelt - 75008 Paris