Correlating Metrics and Logs — Use Case: Elasticsearch Indexing

Here’s one way users can benefit from the SPM Performance Monitoring, Alerting and Anomaly Detection and Logsene Log Management and Analytics integration we just announced in the latest release. Problem – CPU Utilization hits 95%!

  • You get an alarm about a CPU usage jump to 95% (note: using classic threshold-based alerts for CPU usage is a little crazy.  SPM’s anomaly detection feature would be a much better thing to use for CPU usage metrics).
  • You wonder, naturally, why this is happening and investigate immediately.
  • Without access to log graphs — like you would have with an SPM and Logsene combination — you would not be able to tell right away that the indexing rate increased.  It could be anything.  So you would need to connect, via ssh or VPN, to a server (or servers) where the CPU jumped and start looking around and see which process has been using the most CPU.  You’d run tools like top, vmstat, etc., but of course they’d have no historical data.
  • Even knowing which process uses the most CPU is not detailed enough.  You need to start looking at logs — either in another vendor’s log management tool which does not work seamlessly with your monitoring tool or manually “grepping” through one or more potentially very large log files on one or more servers — and try to determine what this application is doing more of now than it did before.  Not surprisingly, this is error-prone, time-consuming, and needlessly manual.  Most people have better things to do and want better tools.

Solution: Use SPM and Logsene Together to Triage With a dashboard like the one you see here you can quickly tell what happened — i.e., why CPU usage went up.   In this particular case it is because the Elasticsearch indexing rate increased.  Now that the problem has been identified you can move on to taking action to fix it if a fix is needed.  Note:  You can even access the actual logs via Logsene so you can really be sure that there is no increase in some errors that are related to higher CPU usage. test_dashboard_SPM_Logsene We hope you found this use case helpful.  Got other performance monitoring, centralized log management or search-related use case ideas you’d like to see?  Drop us a line!

Leave a Reply