Nowadays, most applications we build are composed of microservices and distributed in nature. In such a setup, communication between these microservices is crucial, but can, unfortunately, cause some headaches.
The first thing I check when I’m troubleshooting a bug in production is inter-service communication. Having a reliable tool at your disposal to take care of this can reduce a lot of stress. RabbitMQ, a hybrid messaging broker, is one such tool.
But before you can troubleshoot communication issues, you need to make sure the tool responsible for that communication is functioning properly. Monitoring RabbitMQ becomes important to make sure the whole application is working as expected.
In this post, I’ll be showing you the various metrics RabbitMQ offers for monitoring performance, message delivery, and the general functioning of RabbitMQ. I’ll also talk about some of the best RabbitMQ monitoring tools that can help you monitor the health of your clusters. Last but not least, I’ll address the importance of correlating metrics with logs.
Understanding How RabbitMQ Works
At its core, RabbitMQ is a simple message broker system, made up of just a few parts, namely an exchange and queues. Outside of RabbitMQ, you have the producers and consumers, sending and receiving messages to and from RabbitMQ.
It’s all pretty straightforward: The producer sends a message to RabbitMQ, which is received by the RabbitMQ exchange. Based on a few parameters both in the configuration and the message received, the exchange will decide on the right target queue for the message and forward it there for a consumer to read.
There can, of course, be multiple target queues and multiple consumers consuming the same message. As to the types of exchanges you’ll find in RabbitMQ, here are a few of them:
- Fanout Exchanges send the incoming message to all queues available.
- Direct Exchanges route the incoming message to the target queue indicated by the message’s routing key.
- Topic Exchanges are where the routing key is matched with all available queues, which then completely or partially receive the message.
- Headers Exchanges are responsible for making sure the header of an incoming message matches the header of a given queue. If the headers match, the queue will receive the message.
You should know that this is an overview of the architecture of RabbitMQ. This architecture might seem very simple, but when RabbitMQ is deployed in a production environment with hundreds of consumers and producers, things get more complicated. So it’s critical you monitor the setup so that there are no messages lost in transit.
Why You Should Monitor RabbitMQ
Many transaction-based systems rely on messaging services for sending transactions downstream, e.g., for record-keeping and analytics. Losing just one message can mean losing an actual transaction and a potential monetary loss for your business. You can lose a RabbitMQ message for a variety of reasons, such as high CPU usage, several messages going unacknowledged, etc. Monitoring will help you narrow down the reasons for such behavior.
But monitoring alone is not enough. You also need alerts for when something is about to go wrong or has already gone wrong. Queues shutting down or messages piling up in a queue—these are all cause for concern. Continuous monitoring ensures problems are immediately communicated to the relevant departments and fixed.
Building a successful monitoring and alerting strategy means keeping all of these factors in mind. But what exactly should you be monitoring?
How Do You Monitor RabbitMQ: Key Metrics You Should Measure
The most basic metrics used to monitor a RabbitMQ system are health checks, which essentially tell you if a node is healthy or not. But the definition of a healthy node is different for each project.
So, for one project, a node may be healthy if the Erlang VM is running on the system. But for another project, the node will only qualify as healthy if, along with the Erlang VM, a particular service (say, a producer) is also running on it. By running RabbitMQ, my team can define what makes a node healthy or not.
Health checks typically involve monitoring system-level parameters, not RabbitMQ parameters, so they’re limited in the information they can provide; for example, you’ll only get data about the node they’re running on. But to look at the overall health of RabbitMQ, you need to record metrics from all nodes in the cluster and over longer periods of time.
The metrics you can collect from the nodes running RabbitMQ can be generally classified into three types:
- Infrastructure or Kernel metrics
- RabbitMQ metrics
- Application metrics
Now, just looking at these metrics individually won’t give you enough information about the system’s health. But looking at multiple metrics together will give you much more data, ensuring you will be able to debug those issues that arise.
Let’s dive in.
Infrastructure or Kernel Metrics
Infrastructure or kernel metrics are also called “system metrics” and basically concern the node running RabbitMQ. They’re not specific to RabbitMQ and provide only the general health of the node. Kernel metrics, such as CPU, IO, and memory, need to be monitored to understand the nature of a bottleneck. For example, high IO operations could mean that the queues are getting too large or there might be a need for more nodes.
Note: Polling too frequently when collecting these metrics will result in the monitoring tool consuming a lot of CPU cycles and leaving fewer resources for RabbitMQ. Then again, if the polling frequency is too low, you might end up missing spikes in resource usage.
These are metrics that are specific to RabbitMQ. Although you can collect many of these cluster-wide, there are certain performance metrics collected at the node level.
As the name suggests, cluster-level metrics provide a high-level overview of the entire cluster. These metrics could be a measure of the interaction between nodes or metrics combined across nodes. Let’s look at a few of them.
Number of Connections
This tells us the number of connections to the RabbitMQ cluster. Any drop in this number means that some consumers might be down.
Number of Queues
This metric tells us the number of queues that are created across the nodes in the cluster. It helps in identifying if there are queues going down in the cluster or if there are new queues being created.
Number of Consumers
Monitoring the number of consumers helps you identify if consumers are going down. A decreasing number of consumers could lead to a pileup of messages, driving down message rates as well.
Number of Messages
This is the total number of messages published, delivered, acknowledged, and unacknowledged in the cluster. This number includes all queues and consumers and nodes.
As to node-level metrics, a lot of them include system metrics that I mentioned above, but there are RabbitMQ-specific metrics as well. Here below are a few of them.
Memory and Storage Alarm Status
This metric will tell us if the memory or storage on a particular node is more than the limit set, thereby triggering an alarm.
Sockets, Available vs. Used
The number of sockets available vs. the number of sockets used tells us if we are reaching the number of connections a node can support. If all the sockets are being used up, you might have to consider scaling out the cluster.
Message Store Disk Reads
This indicates the total number of times messages have been read by the queue from the disk. If this number is higher than expected, it could mean one or more consumers are not able to acknowledge or process the message and are reading it again and again. Based on this, you might want to check if any consumer is breaking.
Inter-Node Communication Links
This metric tells us if there is any inter-node communication link down, which may require you to manually bring the nodes back up online or check the network configuration to make sure all nodes are connected. If a link is down, data stored on that node will not be available for consumers.
Monitoring and visualizing these metrics paint a picture of your overall cluster health, along with the health of RabbitMQ processes on each node. So, if a node is encountering any kind of network issue, like maxing out the number of sockets or connections available, you’ll be able to fix the problem before a node goes down.
These are critical because you need to monitor how a queue is performing. Why? Well, depending on the rate at which messages are being produced versus the rate at which those messages are being consumed, you may want to increase the number of consumers or maybe even split the queue into multiple queues. To make this decision, we need data from monitoring queues. The following are a few options.
Number of Delivered Plus Acknowledged Messages
This metric tells us the number of messages delivered and acknowledged. If the difference is huge, it means that one or more consumers are not able to acknowledge messages, meaning there is something wrong with the consumers.
Number of Messages Ready for Delivery
If this number is more than expected, there could be a bottleneck at the message processing end. If the processing is taking too much time, the consumer will not be ready to accept more messages, which would push this number up.
Number of Messages Unacknowledged
If a large number of messages are unacknowledged, it could again mean that a consumer is down, which would call for debugging the consumers.
The applications producing and consuming RabbitMQ messages play an important role in maintaining the health of the RabbitMQ cluster. If an application is unable to maintain a stable connection to RabbitMQ or is taking too long to acknowledge the delivery of messages, it can bottleneck your queues and cause the cluster to misbehave. When looking at the metrics collected from RabbitMQ and applications together, it’s way easier to pinpoint issues and identify the applications causing problems.
Here are some of the metrics I collect from applications to help me understand how my applications are performing:
- Connection opening and failure rates
- Channel opening rate
- Message publishing and delivery rate
- Positive and negative acknowledge rate
Using RabbitMQ’s Java client, for instance, you can collect these application-level metrics easily. You can also use other libraries, say the Spring AMQP Library, along with RabbitMQ to help you do this for different frameworks.
RabbitMQ gives you HTTP APIs to collect a lot of these metrics, plus a management UI for monitoring them. The one downside to using this UI is that you’re limited to the most recent data collected, from just the last few hours. But most real-world applications would need data from days or weeks ago.
This is where those RabbitMQ monitoring APIs are used by standalone monitoring tools to provide more flexibility in terms of storing data collected over a period of time. Let’s take a look at a few of them.
Best RabbitMQ Monitoring Tools
To begin with, I’ll show you one built-in CLI tool used to monitor RabbitMQ, then I’ll review a couple of standalone monitoring solutions.
rabbitmq-diagnostics is a built-in RabbitMQ tool that provides a very basic monitoring framework and specific commands such as “ping” and “status” to monitor specific metrics. If this is your first attempt at monitoring RabbitMQ, it’s a great place to start.
The “observer” command available here is a “top” or “htop” like tool that provides a very pleasant CLI user interface with a lot of metrics collected at the process level.
Sematext is a monitoring software that collects a vast variety of metrics from various systems and applications. It easily integrates with a RabbitMQ cluster to collect all the metrics provided by RabbitMQ as well as all the system metrics from the RabbitMQ cluster; plus, being a complete and general-purpose monitoring solution, you get a powerful visualization tool. So, once all this data is collected, you can build custom dashboards to plot the metrics that are more important for your business.
With separate, detailed dashboards for different RabbitMQ and system metrics, such as queues and nodes, Sematext provides great insight into the entire RabbitMQ infrastructure. When you combine this with OS-level metrics, such as CPU and memory, you can clearly identify whether a queue or a node is slowing your setup down.
Then there are Sematext Agents—Sematext services running on the nodes in your cluster. These guys can automatically detect RabbitMQ installations and then let me take over to configure all the metrics I want to monitor and customize the dashboard exactly to my liking.
Sematext also offers RabbitMQ Logs Integration. Using this feature, you can plot charts from the data extracted from log messages: client authentication, RabbitMQ restarts, etc. Correlating this log data with metric data collected from RabbitMQ in a split-screen view (as shown in the screenshot below) will provide more insight into an issue during troubleshooting.
- Extremely easy setup with automatic detection of RabbitMQ installations
- Customizable agents that let you configure which metrics they collect
- Predefined and customizable dashboards and alert rules
- Part of a full-stack monitoring platform, so easy to correlate RabbitMQ performance with metrics, events, and logs from other parts of the infrastructure and application stack
- Only the agent is open-sourced
- No annual pricing, but bundling discounts are available
Pricing: Offers a free plan with 500 MB/day ingestion. Also offers paid plans that include a 14-day free trial, after which pricing starts at $0.035 per agent per hour.
Prometheus and Grafana
Prometheus is a very well-known data collection application and integrates well with almost any application or system that generates data for monitoring. Built as a general-purpose monitoring tool that can be hosted anywhere, on-premises or in the cloud, it can be integrated into applications written in most of the widely used programming languages today.
Unfortunately, Prometheus is not a visualization tool, so you need to combine it with another tool for this purpose. Grafana is the most popular choice here. Grafana lets you create custom charts, using the data from Prometheus, and you can even create a dashboard with these charts. Being able to completely custom-build these dashboards from scratch makes the Prometheus/Grafana combination particularly powerful. If you want something pre-built, RabbitMQ provides a bunch of such dashboards to get started with.
There are a few things to consider if you decide to use these tools. One inconvenience is that they are not managed but self-hosted, meaning you’ll have to host Prometheus and Grafana yourself and maintain them as well.
If you have a big analytics team, meaning more queries and more need to scale, keep in mind you’ll need to scale these systems yourself, too.
- Can be self-hosted
- Both are open-source tools
- Highly customizable chart options
- If deployed on-premises, needs maintenance
- Setting up Prometheus and Grafana takes time
- Available options for dashboards could become overwhelming quickly
- Alerting is provided by yet another tool that needs to be installed and maintained
Pricing: Grafana is open-source, so you don’t pay a license fee if you self-host it. Grafana Cloud is a managed service with a $49/month plan for bigger teams and projects.
If you’re familiar with AWS and looking for a tool you’ll easily recognize and feel at home with, there’s AWS CloudWatch. Most of the logs and metrics from AWS services are centralized in CloudWatch. It has a powerful log-search capability and integrates easily with a host of external services to make metric collection easy.
Similar to other tools in this list, configuring AWS CloudWatch to collect metrics from RabbitMQ is a bit of a process, with quite a few things to configure before data starts flowing. But anybody who has configured any other AWS service should have no problem.
There is one configuration file that tells CloudWatch which metrics need to be collected and then divides these metrics into exchange, queue, connection, and channel categories, making configuration a snap. As expected, almost all RabbitMQ metrics can be collected by CloudWatch.
- Well integrated with other AWS services
- Has rule-based automatic Lambda trigger functionality
- Easy to set up and use, though only within AWS
- Doesn’t work well with non-AWS services
- Chart options for dashboards are limited
- UI/UX leave users unsatisfied
Pricing: Starts from $0.30 per metric for the first 10,000 metrics per month.
New Relic is another well-known monitoring tool that can monitor most of the important metrics you’d want to monitor in a RabbitMQ cluster and is especially good for small clusters. Make sure you’re running a supported version of RabbitMQ though, as not all versions are supported by New Relic.
Compared to Sematext, integrating New Relic into RabbitMQ can be a bit complicated. You need constant human intervention since there’s no auto-update for New Relic’s integration package, plus you have to manually check if there is an update available for the integration tool and update that manually, too. I find this process of manually monitoring a monitoring tool to be bothersome, to say the least, and it definitely doesn’t give you a seamless experience.
Note: The process of installing and configuring New Relic changes depending on the platform you’re running your RabbitMQ cluster on—Linux, Windows, or Kubernetes.
- SQL-like query language makes it easy to query collected data
- Provides a variety of built-in dashboards
- Good integration with alerting tools
- Integration could have been easier
- Needs more types of charts
- Could become expensive as most features require a paid plan
- In the Standard package, there’s only one free user account. Every additional user added will cost $99. Only the Pro and Enterprise plans include multiple user accounts.
Pricing: Starts with a free tier with 100 GB of data ingestion per month, after which it costs $0.25 per GB ingested.
Datadog is a popular system monitoring tool that allows you to monitor both RabbitMQ and system metrics. Just install Datadog Agents, which will automatically install the required packages to monitor RabbitMQ. After this, you’ll have to deal with some configuration for the Datadog Agents to connect from the RabbitMQ infrastructure to the Datadog dashboard.
Datadog is capable of collecting almost all the metrics available in RabbitMQ, and most metrics from the system as well. But the popular Watchdog feature, which uses algorithms to detect and alert you to potential system and application issues, does not support RabbitMQ. However, the dashboard that Datadog provides without the Watchdog feature is good enough for a simple RabbitMQ cluster.
One specific feature of Datadog is Runbook. This is basically a guide that tells the team members how to react to an alert or an issue; it also provides context to the issue with historical data.
- Runbook makes it easy to act on alerts
- Alerting with email templating and period digest capabilities
- Good visualization
- Navigating the user interface can be a challenge
- Documentation isn’t the best
- Setup can become cumbersome
- Has a single paid plan
Pricing: Starts at $15 per host per month.
Yet another popular monitoring tool that supports RabbitMQ is AppDynamics. Right off the bat, you’ll realize that the installation and configuration of AppDynamics agents on each node in your RabbitMQ cluster is pretty hands-on. You need to download and copy files to specific directories, edit XML files, etc. But after doing all this, you can monitor all the important RabbitMQ metrics.
Plus, with AppDynamics’ Unified Monitoring solution, you’ll be able to monitor all of your applications, databases, systems, etc. in one place. This is a good solution for a big, distributed system. And with its Cloud Monitoring tool, you’ll have every metric you’d need from a cloud-based RabbitMQ installation.
- Great documentation
- Business transaction monitoring
- Intuitive user interface
- Can be overwhelming in the beginning
- User role definition a bit limited
- There have been reports of crashes in production systems
Pricing: Offers a 15-day free trial. Post that, pricing starts at $6/month per CPU core for the most rudimentary package, and $60/month or $90/month per CPU core for other plans.
Correlating RabbitMQ Metrics with Logs for Easier Troubleshooting
Similar to any other system, RabbitMQ logs a lot of data continuously. Even though most of these aren’t needed in a production deployment, logs about connections, RabbitMQ restarts, and errors can be useful. For example, I can plot a chart of all connections coming into RabbitMQ from different sources or the rate at which RabbitMQ restarts.
Using Sematext’s RabbitMQ integration, I can also easily ship RabbitMQ logs along with metrics to Sematext and create dashboards with information like the impact of increased CPU or memory usage by RabbitMQ when it authenticates or restarts. This kind of information helps me troubleshoot quickly and avoid RabbitMQ downtime. Here’s an example of how to use them to troubleshoot large queues.
RabbitMQ is a widely used tool for interprocess messaging and communication. But setting up a cluster isn’t enough. There has to be continuous monitoring of the cluster to make sure it’s working as expected.
Even though RabbitMQ provides a way to get started with basic monitoring, it’s not built for continuous monitoring or large-scale production monitoring. RabbitMQ does, however, give you the means to do these things in the form of metrics.
Regardless of which tool you choose, you can see how your RabbitMQ cluster is performing as well as observe resource-usage patterns to proactively scale up resources when an increase in messages is expected. Analyzing connection information from the logs will also help you identify malfunctioning applications that could be deteriorating cluster performance. Using this information, collected and plotted on the same page, helps you make sure no messages are lost—ensuring business continuity and protecting profits.
If you’re looking for a tool to monitor RabbitMQ, give Sematext Monitoring a try! There’s a 14-day free trial available for you to test all its features and see for yourself how it can help improve performance.
Sunny has been working in the data space for over seven years. He writes microservices to work with data at scale and has experience using a variety of databases, including MySQL, MongoDB, DynamoDB, Couchbase, Amazon Athena, and Apache HBase. He is currently working on a Customer 360 product that is intelligently unifying customer data from multiple sources at scale.