Extending Hadoop Metrics

Here at Sematext we really like performance metrics and we like HBase.  We like them so much we’ve created a service for HBase Performance Monitoring (and for Solr, too).  In the process we’ve done some experiments with Hadoop and HBase around performance monitoring and are sharing our experience and some relevant code in this post.

The Hadoop metrics framework is simple to extend and customise. For example, you can very easily write a custom MetricsContext which sends metrics to your own storage solution.

All you need to do is extend the AbstractMetricsContext class and implement

protected void emitRecord(String context, String record, OutputRecord outputrecord)
  throws IOException;

To demonstrate, I wrote HBaseMetricsContext which stores Hadoop metrics in HBase. Since HBase itself uses the Hadoop metrics framework, you can use it to store its own metrics inside itself. Useful? Maybe. This is just an example after all.

If you’d like to try it out, get the source from GitHub. Then build the project using:

mvn package

Put the resulting Jar file in the HBase lib directory.

You will need to create a table with the relevant column families. We assume the column families are a composite of:

columnFamily = contextName + "." + recordName

In the HBase shell create your table:

create 'metrics', 'hbase.master', 'hbase.regionserver'

Edit your hadoop-metrics.properties file to include:


Restart HBase and it will start inserting to the metrics table every 10 seconds.

The row key of each record is made up of the timestamp and the tags (for disambiguation) like so:

rowKey = bytes(maxlong - timestamp) + bytes(tagName) + bytes(tagValue) + …

Subtracting the timestamp from maxlong ensures the scans get the most recent record first.

Each tag and metric is stored in it’s own column. This gives us a table that looks something like this:

hbase.master hbase.regionserver
cluster_requests hostName hostName flushQueueSize regions
rowKey2 rs1.example.org 0 1
rowKey1 101 master.example.org

For clarity timestamps are not included in the above table, as each cell is timestamped. All cells for a record will have the same timestamp.

Hive Digest, March 2011

Welcome to the first Hive digest!

Hive is a data warehouse built on Hadoop, initially developed by Facebook, it’s been under the Apache umbrella for about 2 years and has seen very active development. Last year there were 2 major releases which introduced loads of features and bug fixes. Now Hive 0.7.0 has just been released and is packed with goodness.

Hive 0.6.0

Hive 0.6.0 was released October last year. Some of its most interesting features included

Hive 0.7.0

Hive 0.7.0 has just been released! Some of the major features include:

  • Indexing has been implemented, index types are currently limited to compact indexes. This feature opens up lots of potential for future improvements, such as HIVE-1694 which aims to use indexes to accelerate query execution for GROUP BY, ORDER BY, JOINS and other misc cases and HIVE-1803 which will implement bitmap indexing.
  • Security features have been added with authorisation and authentication.
  • There is now an optional concurrency model which makes use of Zookeeper, so tables can now be locked during writes. It is disabled by default, but can be enabled using hive.support.concurrency=true in the config.

And many other small improvements including:

  • Making databases more useful, you can now select across a database.
  • The Hive command line interface has gotten some love and now supports auto-complete.
  • There’s now support for HAVING clauses, so users no longer have to do nested queries in order to apply a filter on group by expressions.

and much more.

You can download Hive 0.7.0 from here and you can follow @sematext on Twitter.