Skip to content
share library_books

Hadoop Monitoring Integration

Overview

SPM supports monitoring of both MRv1 (0.22 and earlier, 1.0, 1.1) and YARN (0.23, 2.*) based Hadoop versions. Since the architecture is different, SPM uses different application types for them and different reports are available.

Common reports for all Hadoop types:

  • Overview
  • NameNode
  • DataNode
  • CPU & Mem
  • Disk
  • Network
  • JVM
  • GC

In addition to that, MRv1 versions also get the following reports:

  • JobTracker
  • JobTracker Queues
  • TaskTracker

While reports specific for YARN versions are:

  • ResourceManager
  • ResourceManager Queues
  • NodeManager

In some cases, some reports will be empty because particular Hadoop version doesn't expose some metrics over JMX. For instance, 0.20, 0.21, 0.22 MRv1 versions of Hadoop will not have data in JobTracker, JobTracker Queues and TaskTracker reports (while 1.0 and 1.1 will have all reports populated). NOTE: regardless of this, you can monitor JVM stats of JobTracker and TaskTracker processes under JVM report for all MRv1 versions (0.20, 0.21, 0.22 included). Also, since SecondaryNameNode doesn't expose specific metrics, it doesn't have a special report, but it can also be monitored under JVM report (for instance, you can create an alert to notify you when its heap size reaches some limit or drops to 0, meaning the process likely died).

All YARN versions (0.23, 2.*) will display all available reports and we expect all new Hadoop versions to continue behaving like that.

YARN versions don't have separate reports for the following components (since they don't expose specific metrics):

  • HistoryServer
  • WebAppProxy

However, you can still monitor these processes under JVM report, in the same way as SecondaryNameNode can be monitored in MRv1 setups. You can also define any alerts which are based on JVM metrics which should be good enough for most situations.

Integration

Metrics

Metric Name
Key (Type) (Unit)
Description
data node bytes read
hadoop.dn.io.read
(long counter)
data node bytes written
hadoop.dn.io.write
(long counter)
data node reads from local client
hadoop.dn.io.read.local
(long counter)
data node reads from remote client
hadoop.dn.io.read.remote
(long counter)
data node writes from local client
hadoop.dn.io.write.local
(long counter)
data node writes from remote client
hadoop.dn.io.write.remote
(long counter)
data node heartbeats avg time
hadoop.dn.io.write.heartbeats.time
(double gauge) (ms)
data node heartbeats ops
hadoop.dn.io.write.heartbeats
(long counter)
block checksum op avg time
hadoop.dn.blocks.op.checksum.time
(double gauge) (ms)
block checksum num ops
hadoop.dn.blocks.op.checksum
(long counter)
block report op avg time
hadoop.dn.blocks.op.reports.time
(double gauge) (ms)
block report ops
hadoop.dn.blocks.op.reports
(long counter)
copy block op avg time
hadoop.dn.blocks.op.copies.time
(double gauge) (ms)
copy block ops
hadoop.dn.blocks.op.copies
(long counter)
read block op avg time
hadoop.dn.blocks.op.reads.time
(double gauge) (ms)
read block ops
hadoop.dn.blocks.op.reads
(long counter)
replace block op avg time
hadoop.dn.blocks.op.replaces.time
(double gauge) (ms)
replace block ops
hadoop.dn.blocks.op.replaces
(long counter)
write block op avg time
hadoop.dn.blocks.op.writes.time
(double gauge) (ms)
write block ops
hadoop.dn.blocks.op.writes
(long counter)
blocks read
hadoop.dn.blocks.read
(long counter)
blocks removed
hadoop.dn.blocks.removed
(long counter)
blocks replicated
hadoop.dn.blocks.replicated
(long counter)
blocks verified
hadoop.dn.blocks.verified
(long counter)
blocks written
hadoop.dn.blocks.write
(long counter)
jobtracker heartbeats
hadoop.jt.heartbeats
(long counter)
running maps
hadoop.jt.maps.running
(long gauge)
running reduces
hadoop.jt.reduces.running
(long gauge)
waiting maps
hadoop.jt.maps.waiting
(long gauge)
waiting reduces
hadoop.jt.reduces.waiting
(long gauge)
blacklisted maps
hadoop.jt.maps.blacklisted
(long counter)
blacklisted reduces
hadoop.jt.reduces.blacklisted
(long counter)
trackers
hadoop.jt.reduces.trackers
(long counter)
blacklisted trackers
hadoop.jt.reduces.trackers.blacklisted
(long counter)
decommissioned trackers
hadoop.jt.reduces.trackers.decommissioned
(long counter)
graylisted trackers
hadoop.jt.reduces.trackers.graylisted
(long counter)
reduce slots
hadoop.jt.slots.reduce
(long gauge)
map slots
hadoop.jt.slots.map
(long gauge)
occupied map slots
hadoop.jt.slots.map.occupied
(long gauge)
occupied reduce slots
hadoop.jt.slots.reduce.occupied
(long gauge)
jobs completed
hadoop.jt.jobs.completed
(long counter)
jobs failed
hadoop.jt.jobs.failed
(long counter)
jobs killled
hadoop.jt.jobs.killed
(long counter)
jobs preparing
hadoop.jt.jobs.preparing
(long gauge)
jobs running
hadoop.jt.jobs.running
(long gauge)
jobs submitted
hadoop.jt.jobs.submitted
(long counter)
maps completed
hadoop.jt.maps.completed
(long counter)
maps failed
hadoop.jt.maps.failed
(long counter)
maps killled
hadoop.jt.maps.killed
(long counter)
maps launched
hadoop.jt.maps.launched
(long counter)
reduces completed
hadoop.jt.reduces.completed
(long counter)
reduces failed
hadoop.jt.reduces.failed
(long counter)
reduces killled
hadoop.jt.reduces.killed
(long counter)
reduces launched
hadoop.jt.reduces.launched
(long counter)
map slots
hadoop.jt.maps.slots
(long gauge)
reduce slots
hadoop.jt.reduces.slots
(long gauge)
waiting maps
hadoop.jt.waiting.maps
(long gauge)
waiting reduces
hadoop.jt.waiting.reduces
(long gauge)
running 0
hadoop.jt.running.0
(long gauge)
running 60
hadoop.jt.running.60
(long gauge)
running 300
hadoop.jt.running.300
(long gauge)
running 1440
hadoop.jt.running.1440
(long gauge)
live nodes
hadoop.nn.nodes.live
(long gauge)
dead nodes
hadoop.nn.nodes.dead
(long gauge)
decom nodes
hadoop.nn.nodes.decom
(long gauge)
blocks total
hadoop.nn.blocks
(long gauge)
corrupt blocks
hadoop.nn.blocks.corrupt
(long gauge)
excess blocks
hadoop.nn.blocks.excess
(long gauge)
missing blocks
hadoop.nn.blocks.missing
(long gauge)
blocks pending deletion
hadoop.nn.blocks.pending.deletion
(long gauge)
blocks pending replication
hadoop.nn.blocks.pending.replication
(long gauge)
scheduled replication blocks
hadoop.nn.blocks.scheduled.replication
(long gauge)
under replicated blocks
hadoop.nn.blocks.underreplicated
(long gauge)
capacity remaining
hadoop.nn.capacity.remaining
(long gauge)
capacity total
hadoop.nn.capacity
(long gauge)
capacity used
hadoop.nn.capacity.used
(long gauge)
total files
hadoop.nn.files
(long gauge)
create file ops
hadoop.nn.files.ops.create
(long counter)
get listing ops
hadoop.nn.files.ops.listing
(long counter)
delete file ops
hadoop.nn.files.ops.delete
(long counter)
file info ops
hadoop.nn.files.ops.info
(long counter)
created files
hadoop.nn.files.created
(long counter)
appended files
hadoop.nn.files.appended
(long counter)
renamed files
hadoop.nn.files.renamed
(long counter)
deleted files
hadoop.nn.files.deleted
(long counter)
num allocated containers
hadoop.nm.containers.allocated
(long gauge)
allocated GB
hadoop.nm.allocated.gb
(long gauge) (GB)
available GB
hadoop.nm.available.gb
(long gauge) (GB)
containers completed
hadoop.nm.containers.completed
(long counter)
containers failed
hadoop.nm.containers.failed
(long counter)
containers inited
hadoop.nm.containers.initiating
(long gauge)
containers killed
hadoop.nm.containers.killed
(long counter)
containers launched
hadoop.nm.containers.launched
(long counter)
containers running
hadoop.nm.containers.running
(long gauge)
shuffle connections
hadoop.nm.shuffle.connections
(long counter)
shuffle output size
hadoop.nm.shuffle.output.bytes
(long counter) (bytes)
shuffle outputs failed
hadoop.nm.shuffle.output.failed
(long counter)
shuffle outputs ok
hadoop.nm.shuffle.output.ok
(long counter)
active applications
hadoop.rm.apps.active
(long gauge)
active users
hadoop.rm.users.active
(long gauge)
agg containers allocated
hadoop.rm.agg.containers.alloc
(long counter)
containers released
hadoop.rm.containers.released
(long counter)
containers allocated
hadoop.rm.containers.alloc
(long gauge)
allocated MB
hadoop.rm.memory.alloc.mb
(long gauge) (MB)
applications completed
hadoop.rm.apps.completed
(long counter)
applications failed
hadoop.rm.apps.failed
(long counter)
applications killed
hadoop.rm.apps.killed
(long counter)
applications pending
hadoop.rm.apps.pending
(long gauge)
applications running
hadoop.rm.apps.running
(long gauge)
applications submitted
hadoop.rm.apps.submitted
(long counter)
available MB
hadoop.rm.memory.available.mb
(long gauge) (MB)
containers pending
hadoop.rm.containers.pending
(long gauge)
pending MB
hadoop.rm.memory.pending.mb
(long gauge) (MB)
containers reserved
hadoop.rm.containers.reserved
(long gauge)
reserved MB
hadoop.rm.memory.reserved.mb
(long gauge) (MB)
running 0
hadoop.rm.running.0
(long gauge)
running 60
hadoop.rm.running.60
(long gauge)
running 300
hadoop.rm.running.300
(long gauge)
running 1440
hadoop.rm.running.1440
(long gauge)
active NMs
hadoop.rm.nm.active
(long gauge)
decom NMs
hadoop.rm.nm.active.decom
(long gauge)
lost NMs
hadoop.rm.nm.active.lost
(long gauge)
rebooted NMs
hadoop.rm.nm.active.rebooted
(long gauge)
unhealthy NMs
hadoop.rm.nm.active.unhealthy
(long gauge)
map task slots
hadoop.tt.maps.slots
(long gauge)
maps running
hadoop.tt.maps.running
(long gauge)
reduce task slots
hadoop.tt.reduces.slots
(long gauge)
reduces running
hadoop.tt.reduces.running
(long gauge)
tasks completed
hadoop.tt.tasks.completed
(long counter)
tasks failed ping
hadoop.tt.tasks.failed.ping
(long counter)
tasks failed timeout
hadoop.tt.tasks.failed.timeout
(long counter)