Skip to content
share library_books

Monitoring Hadoop


SPM supports monitoring of both MRv1 (0.22 and earlier, 1.0, 1.1) and YARN (0.23, 2.*) based Hadoop versions. Since the architecture is different, SPM uses different application types for them and different reports are available.

Common reports for all Hadoop types:

  • Overview
  • NameNode
  • DataNode
  • CPU & Mem
  • Disk
  • Network
  • JVM
  • GC

In addition to that, MRv1 versions also get the following reports:

  • JobTracker
  • JobTracker Queues
  • TaskTracker

While reports specific for YARN versions are:

  • ResourceManager
  • ResourceManager Queues
  • NodeManager

In some cases, some reports will be empty because particular Hadoop version doesn't expose some metrics over JMX. For instance, 0.20, 0.21, 0.22 MRv1 versions of Hadoop will not have data in JobTracker, JobTracker Queues and TaskTracker reports (while 1.0 and 1.1 will have all reports populated). NOTE: regardless of this, you can monitor JVM stats of JobTracker and TaskTracker processes under JVM report for all MRv1 versions (0.20, 0.21, 0.22 included). Also, since SecondaryNameNode doesn't expose specific metrics, it doesn't have a special report, but it can also be monitored under JVM report (for instance, you can create an alert to notify you when its heap size reaches some limit or drops to 0, meaning the process likely died).

All YARN versions (0.23, 2.*) will display all available reports and we expect all new Hadoop versions to continue behaving like that.

YARN versions don't have separate reports for the following components (since they don't expose specific metrics):

  • HistoryServer
  • WebAppProxy

However, you can still monitor these processes under JVM report, in the same way as SecondaryNameNode can be monitored in MRv1 setups. You can also define any alerts which are based on JVM metrics which should be good enough for most situations.



Metric Name Key Agg Type Description
num allocated containers hadoop.nm.containers.allocated Avg Double
containers launched hadoop.nm.containers.launched Sum Long
containers killed hadoop.nm.containers.killed Sum Long
containers completed hadoop.nm.shuffle.output.bytes Sum Long
containers completed hadoop.nm.containers.completed Sum Long
available GB hadoop.nm.available Avg Double
shuffle connections hadoop.nm.shuffle.connections Sum Long
shuffle outputs failed hadoop.nm.shuffle.output.failed Sum Long
containers running hadoop.nm.containers.running Avg Double
containers failed hadoop.nm.containers.failed Sum Long
containers inited hadoop.nm.containers.initiating Avg Double
shuffle outputs ok hadoop.nm.shuffle.output.ok Sum Long
allocated GB hadoop.nm.allocated Avg Double
applications killed hadoop.rm.apps.killed Sum Long
running 300 hadoop.rm.running.300 Avg Double
available MB hadoop.rm.memory.available Avg Double
applications failed hadoop.rm.apps.failed Sum Long
applications submitted hadoop.rm.apps.submitted Sum Long
running 60 hadoop.rm.running.60 Avg Double
pending MB hadoop.rm.memory.pending Avg Double
applications pending hadoop.rm.apps.pending Avg Double
active applications Avg Double
containers allocated hadoop.rm.containers.alloc Sum Long
running 0 hadoop.rm.running.0 Avg Double
containers pending hadoop.rm.containers.pending Avg Double
applications completed hadoop.rm.apps.completed Sum Long
allocated MB hadoop.rm.memory.alloc Avg Double
containers reserved hadoop.rm.containers.reserved Avg Double
active users Avg Double
containers released hadoop.rm.containers.released Sum Long
reserved MB hadoop.rm.memory.reserved Avg Double
applications running hadoop.rm.apps.running Avg Double
running 1440 hadoop.rm.running.1440 Avg Double
active NMs Avg Double
decom NMs Avg Double
rebooted NMs Avg Double
lost NMs Avg Double
unhealthy NMs Avg Double
blocks replicated hadoop.dn.blocks.replicated Sum Long
blocks read Sum Long
writes from local client Sum Long
copy block op avg time hadoop.dn.blocks.op.copies.time Avg Double
heartbeats num ops Sum Long
reads from remote client Sum Long
read from local client Sum Long
write block num ops hadoop.dn.blocks.op.writes Sum Long
blocks removed hadoop.dn.blocks.removed Sum Long
blocks verified hadoop.dn.blocks.verified Sum Long
replace block op avg time hadoop.dn.blocks.op.replaces.time Avg Double
writes from remote client Sum Long
replace block num ops hadoop.dn.blocks.op.replaces Sum Long
copy block num ops hadoop.dn.blocks.op.copies Sum Long
write block op avg time hadoop.dn.blocks.op.writes.time Avg Double
blocks written hadoop.dn.blocks.write Sum Long
read block op avg time hadoop.dn.blocks.op.reads.time Avg Double
heartbeats avg time Avg Double
bytes read Sum Long
read block num ops hadoop.dn.blocks.op.reads Sum Long
bytes written Sum Long
map task slots Avg Double
reduces running Avg Double
tasks failed timeout Sum Long
tasks failed ping Sum Long
tasks completed Sum Long
reduce task slots Avg Double
maps running Avg Double
waiting reduces hadoop.jt.reduces.waiting Avg Double
waiting maps hadoop.jt.maps.waiting Avg Double
jobs killled Sum Long
jobs submitted Sum Long
maps failed hadoop.jt.maps.failed Sum Long
running 300 hadoop.jt.running.300 Avg Double
maps killled hadoop.jt.maps.killed Sum Long
reduce slots hadoop.jt.reduces.slots Avg Double
running 60 hadoop.jt.running.60 Avg Double
maps launched hadoop.jt.maps.launched Sum Long
jobs preparing Avg Double
reduces completed hadoop.jt.reduces.completed Sum Long
jobs failed Sum Long
running 1440 hadoop.jt.running.1440 Avg Double
reduces launched hadoop.jt.reduces.launched Sum Long
map slots hadoop.jt.maps.slots Avg Double
reduces killled hadoop.jt.reduces.killed Sum Long
maps completed hadoop.jt.maps.completed Sum Long
jobs running Avg Double
reduces failed hadoop.jt.reduces.failed Sum Long
jobs completed Sum Long
running 0 hadoop.jt.running.0 Avg Double
get listing ops hadoop.nn.files.ops.listing Sum Long
renamed files hadoop.nn.files.renamed Sum Long
decom nodes hadoop.nn.nodes.decom Avg Double
dead nodes hadoop.nn.nodes.dead Avg Double
create file ops hadoop.nn.files.ops.create Sum Long
blocks pending replication hadoop.nn.blocks.pending.replication Avg Long
missing blocks hadoop.nn.blocks.missing Avg Long
under replicated blocks hadoop.nn.blocks.underreplicated Avg Long
file info ops Sum Long
blocks total hadoop.nn.blocks Avg Long
blocks pending deletion hadoop.nn.blocks.pending.deletion Avg Long
corrupt blocks hadoop.nn.blocks.corrupt Avg Long
capacity remaining hadoop.nn.capacity.remaining Avg Long
total files hadoop.nn.files Avg Long
appended files hadoop.nn.files.appended Sum Long
excess blocks hadoop.nn.blocks.excess Avg Long
capacity used hadoop.nn.capacity.used Avg Long
deleted files hadoop.nn.files.deleted Sum Long
delete file ops hadoop.nn.files.ops.delete Sum Long
capacity total hadoop.nn.capacity Avg Long
created files hadoop.nn.files.created Sum Long
live nodes Avg Double
scheduled replication blocks hadoop.nn.blocks.scheduled.replication Avg Long
heartbeats hadoop.jt.heartbeats Sum Long
blacklisted trackers hadoop.jt.reduces.trackers.blacklisted Sum Long
running reduces hadoop.jt.reduces.running Avg Double
occupied map slots Avg Double
running maps hadoop.jt.maps.running Avg Double
blacklisted maps hadoop.jt.maps.blacklisted Sum Long
blacklisted reduces hadoop.jt.reduces.blacklisted Sum Long
occupied reduce slots hadoop.jt.slots.reduce.occupied Avg Double
decommissioned trackers hadoop.jt.reduces.trackers.decommissioned Sum Long
trackers hadoop.jt.reduces.trackers Sum Long
waiting reduces hadoop.jt.waiting.reduces Avg Double
waiting maps hadoop.jt.waiting.maps Avg Double
map slots Avg Double
graylisted trackers hadoop.jt.reduces.trackers.graylisted Sum Long
reduce slots hadoop.jt.slots.reduce Avg Double