Announcing Hadoop Monitoring in SPM

Take it from one of the must trusted names from the world of Hadoop and HBase, as well as one of the friendliest people you’ll encounter on the Hadoop conference circuit, Lars George from Cloudera:

Hadoop Club Monitoring

We’re happy to announce the immediate availability of SPM for Hadoop (see Sneak Peek: Hadoop Monitoring comes to SPM for some screenshots).  With the latest SPM release Hadoop joins Apache Solr, Apache HBase, ElasticSearch, Sensei, and the JVM as the list of technologies you can monitor with SPM. With  SPM for Hadoop you go from zero to seeing all key metrics for your Hadoop cluster metrics in just a few minutes.  Included in the reports are metrics for both HDFS and MapReduce – metrics for NameNode, JobTracker, TaskTracker, and DataNode are all included along with all the default server metrics.  The YARN version of Hadoop is also supported and includes metrics for NodeManager, ResourceManager, etc.

Don’t forget that SPM monitoring agent can run as in-process agent, as well as in standalone mode (i.e., as an external process).  Running in the standalone mode means you may not have to restart various daemons of your existing Hadoop cluster that you want to monitor (assuming you already enabled JMX), so you can quickly get to your Hadoop metrics without interrupting anything!

What else would you like us to monitor?  Please select your candidates!

Poll Results: Hadoop YARN vs. pre-YARN

Back in April 2013 there was a poll in Hadoop Users LinkedIn group:

YARN or pre-YARN – which version of Hadoop are you using?

Because we were working on adding Hadoop monitoring to SPM, this was an important question for us – which version of Hadoop should SPM be able to monitor?

Here are the results of that poll:

Hadoop MRv1 vs. Hadoop YARN
Hadoop MRv1 vs. Hadoop YARN

As we can see, most Hadoop users are still using the old version of Hadoop and are not using YARN.  The percentage in the “YARN” bar at the top is partially hidden, but it’s 13% — only 13% of Hadoop users who responded are using Hadoop YARN.  But combine it with 17% of people who said they are moving to YARN, it’s 30% all together.  Still only about 1/2 of the total number of Hadoop MRv1 users, but if we asked that question in early 2014 we would likely see a close tie.

So which version of Hadoop are we supporting in SPM?  Both!  With SPM you can monitor both Hadoop MRv1 and Hadoop YARN.  And if you are using pre-YARN Hadoop today and want to switch to Hadoop YARN later, that’s not a problem for SPM.

Sneak Peek: Hadoop Monitoring comes to SPM

When it comes to Hadoop, they say you’ve got to monitor it and then monitor it some more.  Since our own Performance Monitoring and Search Analytics services run on top of Hadoop, we figured it was time to add Hadoop performance monitoring to SPM.  So here is a sneak peek at SPM for Hadoop.  If you’d like to try it on your Hadoop cluster, we’ll be sending invitations soon and you can get on the private beta list starting today!

In the mean time, here is a small sample of pretty self-explanatory reports from SPM for Hadoop, so you can get a sense of what’s available.  There are, of course, a number of other Hadoop-specific reports included, as well as server reports, filtering, alerting, multi-user support, report sharing, etc. etc.

Please don’t forget to tell us what else would you like us to monitor – select your candidates – and if you like what you see and want a good monitoring tool for your Hadoop cluster, please sign up for private beta now.

Click on any graph to see it in its full size and high quality.

Hadoop NameNode Files
Hadoop NameNode Files


Hadoop DataNode Read-Write
Hadoop DataNode Read-Write


Hadoop JobTracker MapReduce Runtime
Hadoop JobTracker MapReduce Runtime


Hadoop TaskTracker Tasks
Hadoop TaskTracker Tasks


What else would you like us to monitor with SPM?  Please select your candidates!

For announcements, promotions, discounts, service status, milk, cookies, and other goodies follow @sematext.

Announcing HBase Refcard

We’re happy to announce the very first HBase Refcard proudly authored by two guys from Sematext.  We hope people will find the HBase Refcard useful in their work with HBase, along with the wonderful Apache HBase Reference Guide.  If you think the refcard is missing some important piece of information that deserves to be included or that it contains superfluous content, please do let us know! (e.g., via comments here)

Data Engineer Position at Sematext International

If you’ve always wanted to work with Hadoop, HBase, Flume, and friends and build massively scalable, high-throughput distributed systems (like our Search Analytics and SPM), we have a Data Engineer position that is all about that!  If you are interested, please send your resume to


  • Versatile architect and developer – design and build large, high performance,scalable data processing systems using Hadoop, HBase, and other big data technologies
  • DevOps fan –  run and tune large data processing production clusters
  • Tool maker – develop ops and management tools 
  • Open source participant – keep up with development in areas of cloud and distributed computing, NoSQL, Big Data, Analytics, etc.


  • solid Math, Statistics, Machine Learning, or Data Mining is not required but is a big plus
  • experience with Analytics, OLAP, Data Warehouse or related technologies is a big plus
  • ability and desire to expand and lead a data engineering team
  • ability to think both business and engineering
  • ability to build products and services based on observed client needs
  • ability to present in public, at meetups, conferences, etc.
  • ability to contribute to
  • active participation in open-source communities
  • desire to share knowledge and teach
  • positive attitude, humor, agility, high integrity, and low ego, attention to detail


  • New York

We’re small and growing.  Our HQ is in Brooklyn, but our team is spread over 4 continents.  If you follow this blog you know we have deep expertise in search and big data analytics and that our team members are conference speakers, book authors, Apache members, open-source contributors, etc.

Relevant pointers:

Hadoop 1.0.0 – Extra Notes

The big Hadoop 1.0.0 release has arrived.  The general notes about releases from the dev team include:

  • security
  • Better support for HBase (append/hsynch/hflush, and security)
  • webhdfs (with full support for security)
  • performance enhanced access to local files for HBase
  • other performance enhancements, bug fixes, and features

You can also find the complete release notes here and see all fixes, improvements and new features included in the release. To save you time, please find below additional information about some of the items that attracted our attention from the Hadoop 1.0.0 release.

Cluster Management Optimizations should be modified to enable task memory manager
Adds additional options to manage memory usage by MR tasks. In particular, this allows to set max memory usage for map and reduce tasks (separately).

Performance Improvements should be modified to enable task memory manager
This is a short-term solution for the issue HDFS-347 “DFS read performance suboptimal when client co-located on nodes with data” which is quite hot in Hadoop dev community nowadays. NOTE: by default this optimization is switched off (or is it? Update: it is not, see the comments) so some config adjustments are required to benefit from it. And you will definitely want to benefit from it: some reported two times I/O performance improvements. Also highly recommended for HBase users.

HDFS-895Allow hflush/sync to occur in parallel with new writes to the file
Previously if a hflush/sync were in progress, an application could not write data to the HDFS client buffer. Again we stress out this improvement for HBase users as this increases the write throughput of the transaction log in HBase.

MAPREDUCE-2494Make the distributed cache delete entires using LRU priority
When certain threshold was reached and distributed cache was being purged, previous implementation deleted all entries that were not currently being used. With new code more hot data can be left in the cache (the percentage is configurable) and thus decrease cache misses.

New Features

HDFS-2316 [umbrella] WebHDFS: a complete FileSystem implementation for accessing HDFS over HTTP
Allows accessing HDFS over HTTP (read & write)

MAPREDUCE-3169Create a new MiniMRCluster equivalent which only provides client APIs cross MR1 and MR2
Cleaner MR1 & MR2 compatible API for mini MR cluster to be used in unit-tests.

HADOOP-7710Create a script to setup application in order to create root directories for application such hbase, hcat, hive etc
Similar to hadoop-setup-user script, a hadoop-setup-applications script was added to set up root directories for apps to write to (/hbase, /hive, etc.)

Enjoy Hadoop 1.0.0 and we hope you found this quick summary useful!


Search Analytics: Business Value & NoSQL Backend Presentation

Last week involved a few late nights for some of us at Sematext – we were busy readying our Search Analytics and Scalable Performance Monitoring services, as well as putting the final touches on the our Search Analytics: Business Value & NoSQL Backend presentation for Lucene Eurocon in Barcelona.

In the past we’ve given a few other public talks about Search Analytics and you can check them all out via

Extending Hadoop Metrics

Here at Sematext we really like performance metrics and we like HBase.  We like them so much we’ve created a service for HBase Performance Monitoring (and for Solr, too).  In the process we’ve done some experiments with Hadoop and HBase around performance monitoring and are sharing our experience and some relevant code in this post.

The Hadoop metrics framework is simple to extend and customise. For example, you can very easily write a custom MetricsContext which sends metrics to your own storage solution.

All you need to do is extend the AbstractMetricsContext class and implement

protected void emitRecord(String context, String record, OutputRecord outputrecord)
  throws IOException;

To demonstrate, I wrote HBaseMetricsContext which stores Hadoop metrics in HBase. Since HBase itself uses the Hadoop metrics framework, you can use it to store its own metrics inside itself. Useful? Maybe. This is just an example after all.

If you’d like to try it out, get the source from GitHub. Then build the project using:

mvn package

Put the resulting Jar file in the HBase lib directory.

You will need to create a table with the relevant column families. We assume the column families are a composite of:

columnFamily = contextName + "." + recordName

In the HBase shell create your table:

create 'metrics', 'hbase.master', 'hbase.regionserver'

Edit your file to include:


Restart HBase and it will start inserting to the metrics table every 10 seconds.

The row key of each record is made up of the timestamp and the tags (for disambiguation) like so:

rowKey = bytes(maxlong - timestamp) + bytes(tagName) + bytes(tagValue) + …

Subtracting the timestamp from maxlong ensures the scans get the most recent record first.

Each tag and metric is stored in it’s own column. This gives us a table that looks something like this:

hbase.master hbase.regionserver
cluster_requests hostName hostName flushQueueSize regions
rowKey2 0 1
rowKey1 101

For clarity timestamps are not included in the above table, as each cell is timestamped. All cells for a record will have the same timestamp.

Wanted: Devops to run and

If you are dreaming about working on search, big data, analytics, data mining, and machine learning, and are a positive, proactive, independent devops creature, inquire within!

We are a small and highly distributed team who likes to eat a little bit of everything: search for breakfast, mapreduce for lunch, and bigtable for dinner.  We are looking for a part-time-to-grow-into-full-time devops to work on the popular and sites and take them to the next level. As such, you’ll need to be on top of Lucene, Solr, and Elastic Search.  Similarly, you must be completely at $HOME on the UNIX command line.  Working knowledge of Mahout or statistics/machine learning/data mining background would be a major plus, but is not required.  Experience with productive web frameworks and slick modern front-end frameworks is another plus, as is familiarity with EC2 and EBS.

More about the ideal you:

  • You are well organized, disciplined, and efficient
  • You don’t wait to be told what to do and don’t need hand-holding
  • You are reliable, friendly, have a positive attitude, and don’t act like a prima donna
  • You have an eye for detail – no sloppy code, no poor spelelling and typous
  • You are able to communicate complex ideas in a clear fashion in English (or pretty diagrams)
  • You have experience with (large scale) search or data analysis
  • You like to write about technologies relevant to what we do
  • You are an open-source software contributor

Not all of the above are required, of course – the closer the match, the higher the relevance score, that’s all.

Interested?  Please get in touch.

Google Summer of Code and Intern Sponsoring

Are you a student and looking to do some fun and rewarding coding this summer? Then join us for the 2011 Google Summer of Code!

The application deadline is in less than a month! Lucene has identified initial potential projects, but this doesn’t mean you can also pick your own.  If you need additional ideas, look at our Lucene / Solr for Academia: PhD Thesis Ideas (or just the spreadsheet if you don’t want to read the what and the why),  just be sure to discuss with the community first (send an email to

We should also add that, separately from GSoC, Sematext would be happy to sponsor good students and interns interested in work on projects involving search (Lucene, Solr), machine learning & analytics (Mahout), big data (Hadoop, HBase, Hive, Pig, Cassandra), and related areas. We are a virtual and geographically distributed organization whose members are spread over several countries and continents and we welcome students from all across the globe.  For more information please inquire within.