Skip to main content
Monitoring

The Complete Guide to Metrics, Monitoring & Alerting

sematext sematext on

Monitoring your system and infrastructure is critical to ensure the performance of your services. In fact, as software development moves faster and faster, alerting and monitoring becomes an indispensable practice for modern DevOps teams. Why is that exactly?

That’s what I’m going to discuss today. You are going to learn about the three main monitoring concepts components – metrics, alerting and monitoring – and why they are important so you can be informed when building your monitoring strategy and choosing the right tools for your business.

Metrics, Monitoring and Alerting: A Monitoring System Defined

Metrics, monitoring, and alerting are the key elements of a monitoring system. Metrics are the input, the raw data needed for monitoring performance, health, and availability. Monitoring is what alerting is built on top of. Together, they provide insight into how your applications and infrastructure are performing. They detect performance or usage anomalies, help you troubleshoot and identify issues, reveal usage and behavior patterns or trends, and help you understand the impact of any changes you make to your applications or infrastructure. Whenever metrics meet conditions you have defined as part of alert rules, monitoring solutions send notifications prompting you to investigate issues and even help you narrow down and identify possible root causes.

What Are System Metrics?

Metrics are raw data about resource usage or behavior that your monitoring system collects from any of the applications or services on your infrastructure. This is typically done with a monitoring agent, but it can also be done without it for serverless applications. You can gather low-level metrics provided by the operating system or higher-level ones related to a specific to your application or even business.

Operating system metrics refer to those metrics that provide information about the availability of resources such as CPU, memory, disk space or various processes. Since they are readily available, you can easily see them by sending them to your monitoring system. Other metrics, say for database monitoring, web servers, etc. are collected in a similar fashion.

By comparison, other components, mainly your custom applications, require instrumentation. They don’t readily expose their metrics, therefore you need to add code or interfaces, such as Prometheus, that collect and display the statistics you need to monitor.

Events

Besides metrics, some monitoring systems can also capture events. While metrics are collected on an on-going basis at fixed intervals, events are typically generated at varying frequencies and can thus be collected only when they happen. At the same time, metrics often make more sense when looked at together with other metrics, an event usually provides the context all by itself. Events typically capture what happened, where it happened, and when. They work hand in hand with metrics by providing a bit more context which helps you troubleshoot the root cause of an issue faster.

Logs

Metrics alone are insufficient for troubleshooting, which is where logs come into play. While metrics describe application or services trends and current performance, logs provide information about what applications or service are and have been doing. They create a trail of events that show what happened, where, and when. This makes them extremely valuable for troubleshooting. Of course, nobody wants to watch logs all day, which is why monitoring systems let you alert on logs, too. As with metrics and events, you can set up alert conditions to be notified when these rules are met.

Why Are Metrics Important?

Metrics help you understand the current health of your whole infrastructure and applications. They are useful individually since you can set up alert rules to be notified when conditions you defined are met. However, when multiple metrics are aggregated into a single dashboard, especially when they represent different parts of your infrastructure, they provide a full picture of your environment. You can correlate them and identify historic trends, patterns, see the impact of any issues that occurred or changes that you made.

With Sematext, you can collect metrics, logs, and events from all layers of your environment. Furthermore, you can combine metrics with application and server logs, events, alerts, anomalies, and more to monitor and troubleshoot faster from a single dashboard!

Learn about different types of monitoring metrics you should measure:

What Is Monitoring?

Monitoring is the process of collecting, aggregating, and analyzing the metrics provided by the components in your environment by using a monitoring solution.

What Is the Purpose of Monitoring: Do You Need It?

A monitoring system enables you to gather statistics, store, centralize and visualize metrics, events, logs, and traces in real time. A good monitoring system enables you to see the bigger picture of what is going on across your infrastructure at any time, all the time, and in real time.

Monitoring solutions enable you to sample or aggregate both current and historical data. While fresh metrics are essential for troubleshooting any new issues, they are also valuable when analyzed over longer periods of time. Doing this provides insight you can see only when you zoom out and see changes, patterns, and trends over time. Good monitoring solutions will make your job easier by providing you with out-of-the-box or customizable graphs and dashboards where you can instantly see how all your apps and services interact with each other. Instead of finding your way around tables, you can now understand what’s happening at a glance.

At the same time, monitoring systems allow you to correlate data from different inputs, thus being able to see the relations between various resources and across groups of servers.

And, of course, you can use monitoring systems for alerting, as I’ll describe further below.

Types of Monitoring

To ensure the best experience for your users, you need to measure performance at all levels of the deployment, from individual server components, applications, and services to collections of servers and their communication, external dependencies and the deployment environment, and end-to-end experience. To cover all your bases, you need a comprehensive monitoring system that supports different types of monitoring. For example, Sematext offers full stack visibility by bringing together application and infrastructure monitoring, log management, tracing, real user and synthetic monitoring. This mix enables users to easily diagnose and solve performance issues and spot trends and patterns to deliver a better user experience.

Interested in finding more about what monitoring entails? Check out these posts:

What Is Alerting?

Alerting is the reactive element of a monitoring system that triggers actions whenever metric values change. Therefore, the most common and basic type of alert includes a threshold and an action the system needs to perform whenever alert rule conditions are met. Besides threshold-based alerts, monitoring systems often include anomaly detection-based alerting and heartbeat alerts.

Why Is Alerting Important?

Alerts are often the main “interaction interface” devops engineers have with monitoring systems. While pretty dashboards are useful, too, you don’t want to have to look at them all the time to spot problems. You want to carry on with your life and work and be alerted when there is something you need to address.

There are two common outputs of alerts:

  1. Notifications
  2. Automated actions

Based on these, you can use monitoring alerts to be notified or to perform a programmatic response.

Notifications are the most common output of alerts. Their main purpose is to alert a human being to a problem, providing as much context along the way to help the person troubleshoot and solve problems faster. Effective alert notifications must contain enough information to paint a clear picture of what happened, where, and when so that engineers can easily and quickly understand the root cause and fix it.

Automated actions are not as common in monitoring systems. They can be useful in situations when automated actions are safe to perform without human intervention. An example of such action may be an automated restart of a problematic service.

Sematext features powerful alerting on metrics and logs with anomaly detection and scheduling, enabling both reactive and proactive system monitoring. With a rich library of integrations, you can choose where to push your notifications, whether it’s Slack, PagerDuty, or other ChatOps tools, email, mobile, etc.

Effective Monitoring and Alerting: Best Practices for Your Monitoring Strategy

Setting up an effective alerting and monitoring is not trivial. Here are a few best practices that we’ve identified throughout the years, that can help you be more proactive than reactive with alerting and monitoring:

  1. Monitor both underlying system components and the system as a whole to get the full picture of both how your system components behave and influence each other and whether your users are being affected or not.
  2. If your infrastructure is elastic, you probably want to focus only on monitoring services and not on the individual components.
  3. Define alerts based on deviations from baseline performance and use historical data to establish how many standard deviations are acceptable. This helps prevent performance issues or solve them before they impact the end user, and reduce false positives.
  4. You need to have full trust in your alerts. Avoid false alerts. They lead to alert fatigue, which then leads to alerts getting ignored.
  5. Establish rules around deployment of new services or infrastructure, their monitoring and alerts. Ensure that new infra or services don’t go to production without being monitored and without suitable alert rules.
  6. Don’t forget to monitor your product or service from the viewpoint of your users. Capture metrics from your real users, from multiple locations on the planet.
  7. To see things from the perspective of your frontend application monitor your APIs, your pages and multi-step transactions since users will most likely navigate more than one page on your website.
  8. Don’t forget to monitor any third party services’ performance. Problems with a third party affect the overall digital experience of your users and customers just as much as problems rooted in your own infrastructure.
  9. Re-evaluate your monitoring strategy on a regular basis to make sure it reflects the changes in your environment.
  10. Remember that you can use synthetic monitoring solutions such as Sematext Synthetics, to benchmark against competitors to identify areas and improvement and optimize your strategy.

When Should You Alert Someone?

Not all alerts are created equal. Not all have the same level of severity and urgency. You need to define clear alert conditions that notify the appropriate team or individual and have enough contextual information to help them decide how urgent an alert is.

Levels of Urgency

Alert severity should tell you how serious a problem is and how fast it needs to be addressed. Some alerts require immediate human intervention, others eventual human intervention or they can simply relay where you may have to intervene in the future, in which case they are really more like early warnings than true alerts. Nevertheless, all alerts should be logged to a central location, such as Sematext Cloud, for easier access and correlation with other metrics and events.

Low Severity

Low urgency alerts are generated and recorded in the monitoring system to provide the context for potential troubleshooting later. They serve to record changes in performance, unusual events, deviations from the baseline, but they don’t notify you of these things.

Medium Severity

Medium severity alerts serve to give you heads up. They are used to warn you, but not demand your immediate attention. These alerts give you the chance to act early and avoid high severity alerts. For example, an increase in CPU usage, disk UI, or network traffic that lasts for hours or days, or a notification about the disk space running out in several days or weeks.

High Severity

Critical alerts are the worst-case scenario. These are the most urgent ones, requiring a prompt response. In other words, it’s too late to take preemptive measures and you need to react and fix the issue(s) ASAP. Any alerts on resources with limited capacity (e.g. disk space, queue nearly full, etc.) fall in this category, as do service-level alerts that are impacting your users.

How to Determine the Severity Level of an Alert?

Setting up alerting correctly, and that includes deciding on alert severity is critical. You want to have adequate alert rules and severities to avoid false alerts, unnecessary interruptions (or work or life or sleep), context switching, to avoid alert fatigue and keep the trust in your alert system high. Here are several rules of thumb used at Sematext:

  • If it’s not happening in a production environment it’s not really an alert.
  • If it’s frequent, nobody’s reacting to it, and it’s not affecting or breaking anything, it’s either low severity or should be removed. In Sematext you can also just temporarily disable alert rules, which is handy when you have complex rules you don’t really want to lose forever.
  • If it’s important, good to know but not urgent sort of thing, it’s a medium severity alert.
  • If it requires immediate attention, if it’s impacting users, it’s a high severity alert.

Alerting and Monitoring Systems: How to Choose the Best One

Effective alerting and monitoring requires both a strategy and a good solution. Good monitoring solutions collect the right metrics, enable you to visualize and correlate them so that you can rapidly identify causes and minimize service degradation and disruption. There are a lot of tools available, some better than others. Here’s what we suggest you should look for when you’re in the market for a monitoring system:

External to Other Services

Don’t collocate your monitoring system in the same infrastructure where your product is running. Use an external system, a SaaS. When you need to troubleshoot your application you don’t want to have issues with your monitoring system. This is also wise from a security point of view, especially when it comes to logs. Ship them out to a SaaS where you can access them even if your own infrastructure is compromised.

Out of the Box Insights

Don’t pick a solution that requires hours, days, or weeks to build a set of charts and dashboards from scratch. Don’t underestimate how much time it takes to figure out what metrics some piece of software has, how it should be collected (counter? gauge?), how it should be aggregated (avg? max? p95?), how it should be visualized, what other metrics it’s related to, etc. Pick a solution that comes with out of the box dashboards for a number of integrations where all this has already been figured out. For example, Sematext also comes with default, out of the box alert rules to save new users time and effort.

Flexible Visualizations and Dashboards

You want a system that’s intuitive to use, snappy, and flexible, which lets you build visualizations of any sort of timeseries data.

Full Stack Observability

Look for a solution that handles not just performance metrics, but all other types of observability data – logs, events, traces, user monitoring, API monitoring, network monitoring, and so on. Having a single solution that can do all that makes troubleshooting simpler and faster, not to mention that it is cheaper than using N different solutions. And I don’t mean only cheaper in terms of monthly or annual cost for th solution, but also cost in the form of reduced time needed to learn how to use one instead of several solutions, or time one saved through more seamless pivoting between different types of observability data instead of jumping between multiple monitoring or alerting systems.

Summary and Detailed Views

The power of a monitoring system also lies in its visualization capabilities, in providing real-time high-level summaries of the huge amount of data it collects in a way that’s easy and quick to digest. Out-of-the-box dashboards are a must in a monitoring system.

However, these reposts usually reveal the most commonly used metrics. Effective troubleshooting requires a high level of manipulation, for the dashboard to be customizable. You need to be able to graph any time series or report to correlate data from various sources and identify the cause of service failure.

Able to Collect and Correlate Metrics From Different Data Sources

Since it’s responsible for providing you with an overview of all the components across your infrastructure, a monitoring system must process data in various forms from various sources. For effective system monitoring and troubleshooting, you have to be able to visualize them in one place, see how their respective apps and services interact with each other in real time, and correlate them so that you can easily catch on the issue and instigate incident management if needed.

Capacity to Adapt to Changes in Your Environment or Monitoring Strategy

As we’ve already mentioned, your environment will inevitably evolve. The monitoring system must be flexible enough to easily adapt whenever you add new components to your infrastructure with little to no disruption. The same goes for outdated machines. If you need to remove them, the system has to manage the change without losing the data associated with it.

Consequently, you may need to measure new metrics as well, in which case, the ease with which you can set up new variables is important.

Flexible and Powerful Alerting

Of course, assessing its alerting features is one of the most important aspects when shopping for monitoring software. First, it needs to be able to push notifications through various channels. However, many systems do not deliver notifications themselves but offer integrations with other parties, which is not a bad thing. In fact, it makes the alerting system more flexible since all you need to do is use an API. Furthermore, the system should offer multiple options to personalize notifications triggers.

However, setting up alerting rules pertains strictly to the monitoring system. You need comprehensive solutions that allow you to define smart, robust and reliable alerting conditions that do not overwhelm you but red flag only meaningful events you should look into.

Check out our articles and learn about different types of monitoring tools and how they can help you ensure the health of your system components and infrastructure:

System Monitoring with Sematext

Available both on-prem and in the cloud, Sematext Cloud is a full observability solution that helps you monitor the health of your IT infrastructure in real time. It brings metrics, logs, and events under one roof for easier, faster, and better troubleshooting.

Log management capabilities allow you to ship and analyze any application or infrastructure logs, including serverless logs. You can keep tabs on your environments with network, database, processes, and inventory monitoring, and check on your users’ experiences with real user monitoring and your APIs with synthetic monitoring. Sematext supports microservices and containerized environments allowing you to monitor Kubernetes or Docker and the applications running inside them.

Sematext features powerful alerting, anomaly detection and scheduling, and rich, completely customizable dashboards where you can graph virtually everything to better see how your apps and services are interacting with each other.

There is a free trial so you can see how Sematext works with your favorite apps. You can read our monitoring guides to help you get started:

The Bottom Line

Whether you should or shouldn’t monitor your applications and services is not a question anymore. You definitely should. The question is how. Understanding monitoring systems is the first step. Collecting the right metrics from all the components in your infrastructure and defining meaningful and actionable alerts are key to a healthy production environment. However, setting up a monitoring system is no easy feat, but it’s definitely worth it. A reliable monitoring system helps you and your team detect and solve issues faster with minimum system downtime and impress your bosses.