Monitoring rsyslog with Kibana and SPM

A while ago we published this post where we explained how you can get stats about rsyslog, such as the number of messages enqueued, the number of output errors and so on. The point was to send them to Elasticsearch (or Logsene, our logging SaaS, which exposes the Elasticsearch API) in order to analyze them.

This is part 2 of that story, where we share how we process these stats in production. We’ll cover:

  • an updated config, working with Elasticsearch 2.x
  • what Kibana dashboards we have in Logsene to get an overview of what rsyslog is doing
  • how we send some of these metrics to SPM as well, in order to set up alerts on their values: both threshold-based alerts and anomaly detection

Read More

5-Minute Recipe: Log Alerting and Anomaly Detection

Until software becomes so sophisticated that it becomes truly self-healing without human intervention it will remain important that we humans be notified of any problems with computing systems we run. This is especially true for large or distributed systems where it quickly becomes impossible to watch logs manually. A common practice is to watch performance metrics instead, centralize logs, and dig into logs only when performance problems are detected. If you use SPM Performance Monitoring already, you are used to defining alerts on critical metrics, and if you are a Logsene user you can now use alerting on logs, too! Here is how:

  1. Run your query in Logsene to search for relevant logs and press the “Save” button (see screenshot below)
  2. Mark the checkbox “Create Alert Query” and pick whether you want threshold-based or anomaly detection-based alerting:
Threshold-based alert in Logsene
Threshold-based alert in Logsene
logsene-alert-quiery-algolert
Anomaly Detection using “Algolerts” in Logsene
logsene-manage-alert-queries
Manage Alert Queries in Logsene

While alert creation dialog currently shows only email as a possible destination for alert notifications, you can actually have alert notifications sent to one or more other destinations.  To configure that go to “App Settings” as shown below:

logsene-go-to-app-settings

Once there, under “Notification Transport” you will see all available alert destinations:

Logsene-Application-Settings

In addition to email, PagerDuty, and Nagios, you can have alert notifications go to any WebHook you configure, including Slack and Hipchat.

How does one decide between Threshold-based and Anomaly Detection-based Alerts (aka Algolerts)?

The quick answers:

  • If you have a clear idea about how many logs should be matching a given Alert Query, then simply use threshold-based Alerts.
  • If you do not have a sense of how many matches a given Alert Query matches on a regular basis, but you want to watch out for sudden changes in volume, whether dips or spikes, use Algolerts (Anomaly Detection-based Alerts).

For more detailed explanations of Logsene alerts, see the FAQ on our Wiki.

3-Part Blog Series about Log Queries

Speaking of log queries…this post is part of our 3-part blog series to detail the different types of Queries that Logsene lets you create.  Check out the other posts about Saved Queries and Scheduled Queries.

Keep an eye on anomalies or other patterns in your logs

…by checking out Logsene. Simply sign up here – there’s no commitment and no credit card required.  Small startups, startups with no or very little outside investment money, non-profit and educational institutions get special pricing – just get in touch with us.  If you’d like to help us make SPM and Logsene even better, we are hiring

Integrate PagerDuty with SPM Performance Monitoring

Got Alarm Fatigue?

If so, you are not alone!  We talk to a lot of people who want to reduce the frequent “noise” from monitoring alarms.  To solve this common problem, Sematext added anomaly detection for alerts and PagerDuty integration to its SPM Performance Monitoring solution to dramatically reduce this noise compared with simple threshold-based alerting mechanisms.  The integration with PagerDuty helps DevOps with incident management, i.e., managing escalation and routing alerts to the right person by defined schedules and communication channels.

PagerDuty is an alarm aggregation and dispatching service for system administrators and support teams. It collects alerts from your monitoring tools, gives you an overall view of all of your monitoring alarms, and alerts an on-duty engineer if there’s a problem. PagerDuty allows you to build sophisticated alerting rules to determine who to contact when problems occur. You can build on-call schedules to equitably share on-call responsibilities. You can also set up multiple levels of coverage, so if the “primary” on-call person doesn’t respond to an alert in a timely fashion, it’s automatically escalated to a “secondary” person, and so on.” – Source: PagerDuty FAQ.

SPM Performance Monitoring is an enterprise-class, server and application performance monitoring, alerting, and anomaly detection solution. It is available both in the cloud (SaaS) and On Premises.  SPM also integrates with Logsene Log Management and Analytics to correlate metrics, alerts, anomalies, and events with application and server logs.

Get started

Basic setup steps are required to hook up both services:

  1. In PagerDuty: Get an API Key
  2. In SPM: Enter the API Key in SPM alert settings

1) In PagerDuty:

Create a new service:

  1. In your account, under the Services tab, click “Add New Service”.
  2. Select an Escalation Policy (e.g. default)
  3. Start typing “Sematext” for the Integration Type, which will narrow your filtering.
    PagerDuty add service
  4. Click the Add Service button
  5. Once the service is created, you’ll be taken to the Service page. On this page, you’ll see the “Service API key,” which you will need when you configure Sematext products to send events to PagerDuty. Copy the “Service API Key“ to the clipboard. PagerDuty service key

2) In SPM

1) Navigate to SPM Application Settings of your SPM App by clicking the App Settings button in the top right when you’re in the SPM UI.

 SPM - App Settings

2) Navigate to Alerts / PagerDuty

SPM - Service API Key for PagerDuty

3) Enter the API key from PagerDuty in the field Service API key

4) Press the Save button

Done. Every alert from your SPM app will be forwarded to PagerDuty, where you can manage escalation policies and configure notifications to other services like HipChat, Slack, Zapier, Flowdock, and more.

If you’ve got some feedback on this post or ideas for similar posts please let us know!

Integrating SPM Performance Monitoring with HipChat

Many agile DevOps teams rely on communication via HipChat,  which provides an API and mobile apps to receive messages while being away from one’s desktop. SPM Performance Monitoring‘s new integration via WebHooks provides the capability to forward alerts to many services, including HipChat.

The integration of both services can be achieved by collecting the room_id and an access token from HipChat and then building a WebHook in SPM.  The SPM Wiki explains how to get this information from HipChat and build the WebHook in SPM: Alerts – HipChat integration

Performance-Monitoring-Hip-Chat-Integration

This whole process only takes a minute or two.  HipChat is a tool that is becoming more popular among the DevOps crowd, and here at Sematext we pride ourselves on staying on top of what our users need and expect.

Need some extra help with this setup or another app you might want to integrate?  Have ideas for other integrations we should explore? Please drop us a line, we’re here to help and listen.

Announcement: New Functionality in SPM and Logsene

Summer is all but officially over, yet our work with SPM Performance Monitoring, Alerting and Anomaly Detection and Logsene Log Management and Analytics is not.  While lots of us took a well-deserved break over the last 1-2 months, we added a few goodies to both SPM and Logsene.  More interesting stuff is coming in the next release.

New in SPM

With SPM, the most notable addition is monitoring for Apache Spark.  We’ll have a separate post about Spark monitoring with SPM next week with all the details, including screenshots.  But that’s not the only new goodness; other additions include:

Integration with Nagios

  • You can now tell SPM where your Nagios lives and SPM will push all your Alerts to Nagios.  If you use PagerDuty, SPM can push your Alerts there, too.

Lowered SPM agent overhead

  • Those sending large volumes of metrics will see the most benefit.  The new agent makes use of Apache Flume to transport metrics.

Switched to sending metrics over HTTPS by default

These additions to SPM, along with recently announced monitoring support for NGINX Plus and NGINX make it an even more effective solution for organizations who are paying the unfortunate price of having a mish-mash of monitoring and alerting tools bolted together in an uneasy coexistence.

If you haven’t seen SPM yet, we have a live SPM demo so you can see it for yourself.  The demo shows Hadoop, HBase, Kafka, Elasticsearch, Solr, MySQL, Redis, and other types of apps being monitored.

New in Logsene

Until now you could create an unlimited number of Dashboards with SPM graphs, and now you can do that with Logsene graphs, too.  Moreover, you can place Logsene log graphs alongside SPM’s performance graphs, on the same Dashboard, and correlate your performance with your application logs!

This makes the integration of performance metrics, logs, events and anomalies more robust for those of you looking to combine performance monitoring and centralized log management in one place — not only knowing that SOMETHING happened when you look at your performance metrics graphs, but also exactly WHAT happened by having immediate access to relevant logs right there!

Screenshot – Dashboard with SPM Performance Graphs & Logsene Log Graphs  [click to enlarge]

test_dashboard_SPM_Logsene

Take a Test Drive — It’s Easy and Free to Get Started

Like what you see here?  Sound like something that could benefit your organization?  Then try SPM or Logsene for Free for 30 days by registering here.  There’s no commitment and no credit card required.

Announcement: Percentiles added to SPM

In the spirit of continuous improvement, we are happy to announce that percentiles have recently been added to SPM’s arsenal of measurement tools.  Percentiles provide more accurate statistics than averages, and users are able to see 50%, 95% and 99% percentiles for specific metrics and set both regular threshold-based as well as anomaly detection alerts.  We will go more into the details about how the percentiles are computed in another post, but for now we want to put the word out and show some of the related graphs — click on them to enlarge them.  Enjoy!

Elasticsearch – Request Rate and Latency

pecentiles_es

Garbage Collectors Time

percentiles_gc

Kafka – Flush Time

percentiles_kafka_1

Kafka – Fetch/Produce Latency 1

percentiles_kafka_2

Kafka – Fetch/Produce Latency 2

percentiles_kafka_3

Solr Req. Rate and Latency 1

percentile_solr

Solr – Req. Rate and Latency 2

percentiles_solr_2

If you enjoy performance monitoring, log analytics, or search analytics, working with projects like Elasticsearch, Solr, HBase, Hadoop, Kafka, Storm, we’re hiring planet-wide!

Introducing Algolerts – Algorithmic Anomaly Detection Alerts

It is not every day that you come across a new term or concept Google doesn’t yet know about.  So today we’ll teach Google about something new we’ve added to SPM in the latest release: Algolerts.

Please tweet about Algolerts – Algorithmic Anomaly Detection Alerts

The Problem with Threshold-based Alerts

Why do we even have alerts in performance monitoring systems?  We have them because we want to be notified when something bad happens, when some metric spikes or dips too much – when CPU usage hits the roof, when disk IO goes up, when the network traffic suspiciously quiets down, and so on.  We see such spikes or dips in metric values as signs that something might be wrong or is about to go wrong.  When limited to traditional threshold-based alerts one is forced to figure out what range of metric values represents a non-alarming, normal state and, conversely, at which point spikes and dips should be considered out of an acceptable range and taken seriously.  One needs to pick minimum and maximum metric values and then create one alert rule for each such value.  The same process then needs to be repeated for every metric one wants to monitor.  This is painful and time-consuming.  To make things worse, these thresholds have to be regularly updated to match what represents the new normal!  One can try to fight this by setting very “loose alerts” by picking high maxima and low minima, but then one risks not getting alerted when something really does go awry.

To summarize:

  • It is hard to estimate the normal range of each metric and pick min and max thresholds
  • Metric values often fluctuate and create false alerts
  • To avoid false alerts one has to regularly adjust alert rule thresholds

Algolerts to the Rescue!

With the name obviously derived from terms Algorithm and Alert, Algolerts are SPM’s alternative, or perhaps even a replacement for traditional threshold-based alerts you so often see in most, if not all monitoring solutions.  Algolerts don’t require thresholds to figure out when to alert you.  Algolerts can watch any metric you tell them to watch and alert you when an anomalous pattern – a pattern that deviates from the norm – is detected.

Creating Algolerts is even simpler than adding threshold-based alerts and is done through a familiar interface:

SPM Algolert creation
SPM Algolert creation

Algolert notifications provide useful and easy to read numbers so one can quickly see just how big of an anomaly this is about.  Here is an example notification:

Anomalous value for 'received' metric has been detected for SPM Application SA.Prod.Kafka
Host filter=xxx, Network Interface filter=eth0.

Anomaly detection window size: 1800 seconds.

Statistics for 'received' metric are:
Current: 1,220,121.00
Average:   185,147.97
Median:     89,536.00
StdDev:    222,173.70

Known Kinks

Algolerts implementation that’s in place in SPM today has a few known kinks.  The kinks we know about and that we’ll be ironing out are:

  • no “things are OK again, you can go back to sleep” notifications are sent when the metric value goes back to normal
  • regular anomalies (e.g. a CPU intensive nightly cron job) may trigger false alerts, though this is not necessarily different from threshold-based alerts anyway
  • recently observed anomalies can create “the new norm” and thus hide subsequent anomalies

Despite this, Algolerts have already proven very good and valuable in our own use of SPM and Algolerts – we’re slowly removing all our threshold-based alerts and are switching to Algolerts and invite you to try them out as well.

Please send us your feedback and follow @sematext for updates.