Skip to main content

Poll: What do you use for ElasticSearch performance monitoring?

sematext sematext on

The results of this poll will be included in the “Large Scale ElasticSearch, Solr & HBase Performance Monitoring” presentation at Berlin Buzzwords next week.  Please vote and share this post to help us make this poll statistically significant!

15 thoughts on “Poll: What do you use for ElasticSearch performance monitoring?

    1. @Andrej – Thanks for the info. You may want to have a look at SPM for ElasticSearch, it sounds like it may be simpler to use/(not) maintain that than bash scripts. I see you are in .de – if you are going to Berlin Buzzwords stop by Sematext’s booth if you want to see SPM for ElasticSearch in live action.

    1. @Radu – interesting. I haven’t heard of it before. But note that, like Nagios, it monitors whether servers/services are up or down, but not how they performed, say, over the last 15 minutes. So it’s not really a performance monitoring tool, from what I can tell.

      1. @sematext: Yeah, well that’s the obvious part of it, but you can use NRPE[1] to run Nagios plugins on remote systems. A Nagios plugin[2] is basically any sort of application that returns the output and exit codes Nagios/Shinken can interpret (doesn’t need to be run via NRPE, it can run on the Shinken host as well, if it doesn’t need access to the remote machine). And the output can include one or more performance data values.

        For example, we use Elasticsearch for storing logs, and we have a check that fires off a log, then returns the time it takes that log to be returned in searches. If the time passes certain thresholds, we trigger warning or critical alerts. It’s all work in progress for us, but that’s how we started.

        We also use that performance data to build graphs using pnp4nagios. With “special templates”[3], you can do all sorts of stuff to make your graphs more significant. Like, for instance, aggregate the performance data from all the ES nodes.

        Also, when a service changes its state, you can use “event handlers”[4] to react to those state changes. In our case it might be dropping some logs or throttling them when ES is too loaded. This becomes even more interesting when you start defining some custom services, called “business rules”[5], where you can combine individual services states. For example, when load on ES becomes CRITICAL on at least 2 out of 8 nodes, increase the auto-refresh interval.

        The problem with all this is that the tools are quite generic. So you need to write your own plugins, event handlers, php templates for graphs, you need to define services and business rules, it’s a bit of a pain.

        Useful links:

        P.S. We also use BigDesk, I voted for it. It goes without saying – it’s awesome 😀

        1. @Radu – thanks for sharing all that, very informative! Sounds powerful, but I agree it sounds rather involved and has a lot of small moving pieces.

          1. @sematext [sorry for the late reply] Well, we plan to change a bit of that by open-sourcing bits that would be re-usable. For example, a plugin that measures inserts per second, coupled with a php template if it would be appropriate.

          2. @Radu – Shinken now provides monitoring packs[1], that put together the templates, the commands and the plugin. This makes it easier to share and improve the monitoring logic. They are also directly linked with the configuration system and the wiki to integrate all the pieces for users. This functionality is being released as part of Shinken 1.2. But is available in the Shinken git today.

            You can also note that Shinken is rolling full speed ahead in integrating Shinken with Graphite. Shinken can collect performance data included as part of the monitoring check output. A typical plugin will return the state of a service but can also include performance data. This data is exported via Shinken to Graphite. There are also plugins[2] [3] that will run checks against data in Graphite itself to use its statistical functions.


            Typical Nagios/Shinken installations have lots of small moving pieces. Though Shinken does aim at making this easier.

  1. We use a performance monitoring tool that is open-sourced out of Silicon Graphics (SGI) that captures hardware, OS, application metrics, and now has a ES integration (which we wrote).

    amazing tool particularly for retrospective analysis since we can dig deep into performance archive logs and replay theories, set up complex rules for triggering.

Leave a Reply