Our Elasticsearch monitoring integration just got a whole lot better! You get new metrics, better dashboards and more default alerts out of the box. Stemmed from many years of consulting and production support experience, our new Elasticsearch monitoring makes troubleshooting clusters a lot faster and simpler. Let me explain why.
First off, the dashboards: the new Overview screen allows you to spot health issues (e.g. unassigned shards, thread pool rejections) as well as the #1 performance killer: load imbalance. To detect the source of said load, there’s an index breakdown on both reads and writes:
The old Overview dashboard was super-useful and we kept it, and renamed as Essential Metrics. There’s another new dashboard called Daily Patterns, which allows you to see the day of the week and hours with more traffic. Here’s an example, which shows that e.g. on Monday morning queries are slower in this cluster.
Notice that TIP markdown on top? We’ve added those whenever the charts aren’t self-explanatory. You now know what to look for – and if there are issues, you have hints on what to do. Check the tip on refresh time below. Also, notice how we have breakdowns by index/node on the right, allowing you to look closer at the source of load. Like with tips, we’ve added breakdowns like this in many places.
Other new dashboards, such as Ingest or Scripting, are made of new metrics. There are 52 new metrics in total. You can now identify the most expensive Ingest pipelines, how many times circuit breakers tripped, script compilations, and more. Existing metrics classes are enhanced, too: cache hit ratios, disk IOPS, merge and recovery throttling, etc.
There’s a new dimension to filter and group by node role. Dedicated masters and data nodes have different load patterns, so you’d often select the data nodes to see their aggregate load (maybe group by host later). In a large cluster, selecting all data nodes used to be tricky, but now it’s just one click!
Last but not least: default alerts. There were 4 such alerts before on some clear red flags: heartbeats, almost 100% heap and disk usage, and anomalous number of nodes. We added 8 more, for example on anomalous unassigned shards or if the load is much higher than the number of processors. This way you’ll know better when the cluster has performance issues or when it’s behaving abnormally.
As you may tell, we’re quite excited about those improvements! We hope you’ll find them useful, too. Either way, feel free to reach out for any feedback or questions.