At the end of November, we’ll be migrating the Sematext Logs backend from Elasticsearch to OpenSearch

How to Monitor MongoDB: Key Metrics to Measure for High Performance

November 9, 2022

Table of contents

Monitoring distributed systems like MongoDB is very important to ensure optimal performance and constant health. But even the best monitoring tool will not be efficient without fully understanding the metrics it gathers and presents, what they represent, how to interpret them, and what they affect. That’s why it is crucial not only to collect the metrics but also to understand them.

In this article we will look into the key MongoDB metrics that you should monitor to ensure the best performance, fault tolerance, and health of this NoSQL database.

What Is MongoDB?

MongoDB is a cross-platform, distributed NoSQL database that uses JSON-like documents and can work without a schema. It was designed to be highly scalable, fault tolerant and highly available with performance in mind, especially large data sets. It was built with cloud in mind and supports automatic scaling. It is usually a good choice for document-oriented use cases when you need quick prototyping or massive scale, but not ACID transactions from your datastore.

Why Monitor MongoDB?

The massive amount of data and large number of users means massive scale. And with massive scale additional challenges come. You need to keep your infrastructure and the software running on it at bay, understand what is going on, be proactively alerted when things go bad and also be able to do proper post-mortem analysis when things fail eventually. You need a complete view of the metrics and understand which of them are important and how they affect what they mean. This ain’t different when it comes to MongoDB.

Of course, a good monitoring solution helps keep an eye on the health of your clusters and the hardware running them. However, being able to look at the metrics themselves is not everything – you need to understand what they mean and how they affect your MongoDB instances and clusters.

MongoDB Performance Metrics

We can group MongoDB metrics into categories to help us identify the key components that can impact health and performance.

Cluster Operations and Connections Metrics

The first group contains the metrics related to MongoDB itself. They are crucial to understanding how your MongoDB instances and clusters are performing, their state, and if they need any action to keep your infrastructure healthy. Those metrics include:

  • Opcounters – the average rate of operations per second performed on your MongoDB instances. You should look at the general number and the operation type breakdown, which tells you the average number of write and read operations. There are no definitive bad values here, but a sudden, unexpected increase of operations of a given kind may potentially point to something happening.
  • Operation execution time – the average operation time of a given type. You want the operations performed against the MongoDB cluster to be as fast as possible. Large values in the average operation execution time may point to performance degradation and issues and require investigation.
  • Query executors – the metric tells us the average rate per second of scanned documents during queries and query plan evaluation.
  • Query targeting – the metric represents the ratio between the number of scanned and retrieved documents. A high value usually indicates that the application using MongoDB can be improved as many documents are scanned, but only a small portion is returned.
  • Number of connections – this describes the number of clients connected to the MongoDB instance. An increase in the metric or a sudden spike usually means an unresponsive MongoDB instance or an issue with the connection logic in the application using it.
  • Queues – a set of metrics informing you about the number of operations waiting to be executed. An operation can wait to read the data, write the data, or obtain a lock. Large queues usually point out issues with application design. They can also mean that MongoDB can’t keep up with the load as it can no longer execute the usual number of operations in the given time frame.
  • Scan and order – the metric tells the number of operations that couldn’t perform the sort using an index.
  • Locks time and count – metrics dedicated to different lock types showing the time needed to acquire the lock and the number of locks in the given time frame. Operations waiting for a longer period of time to obtain the lock will degrade MongoDB performance. This may point out various issues, such as poor query structure or insufficient memory, not to mention inefficient architecture.

System and Resource Utilization

The system and resource utilization metrics concern the hardware used to run your MongoDB instances and clusters. Certain hardware elements have corresponding metrics, such as:

CPU

Each request processed by MongoDB will require CPU cycles to be processed and completed. That’s why you need to monitor the overall CPU utilization of your MongoDB instances and clusters.

The CPU usage can be a single number that shows the average in a given time period, but you can also divide it into certain areas like

  • User – the percentage of total CPU processing available for user-based execution, like applications. In short, it’s the processes space CPU usage.
  • System – the percentage of total CPU processing power spent on operating system-related execution.
  • Wait – the percentage of time spent waiting for resources, like disk or network,
  • and more.

The user part of the CPU usage will show what your MongoDB process needs. You should avoid situations where the CPU is constantly at 100% utilization, as the server is overloaded and will affect your request processing time.

Memory

When working with MongoDB, you should consider the following three memory-related metrics:

  • Used memory – the memory occupied and used by the processes running on the monitored infrastructure element, meaning MongoDB instances and the processes required to run the environment. You should avoid situations where the used memory is at 100% or very close to that value, as it means that your MongoDB will not have enough memory space to work.
  • Free memory – the space that is not occupied and freely available for use by MongoDB and other system elements.
  • Swap memory – describes how much memory is written to the swap space. Higher values of this memory-dedicated metric indicate that the instance is under-provisioned for your workload, which may lead to performance degradation of such instances.

Disk

Disk, or, in general, the I/O subsystem of your operating system, is where the data is stored and from which it is retrieved. That’s why it is crucial to monitor its condition and speed. From MongoDB’s perspective, you should keep an eye on at least the following information:

  • Free disk space – refers to the free space on the device MongoDB uses for data storage. You need to be sure that you have enough free space available on every MongoDB instance to allow for data growth.
  • Disk latency – tells you about the write and disk latency, meaning the time needed to start retrieving the data from the disk or writing the data to the disk. If the latency is high, MongoDB will be slower, which can be especially visible if the amount of data is larger than the amount of free memory available.
  • Disk IOPS – shows the average number of I/O operations per second used by your MongoDB instance. You can use that metric to see if you are reaching the performance limits of your disks.

Replication Metrics

The replication metrics are important when running more than a single MongoDB instance, so every time you run your operations at scale and need high availability and fault tolerance from your infrastructure elements. You must monitor the health of the MongoDB replication to ensure that the cluster remains healthy and can serve the data without unnecessary delays. The most important replication-related metrics include:

  • Replication lag – describes the approximate number of seconds a secondary node is behind the primary in write operations. It is normal to see a small replication lag as the primary node will write the data faster in most cases, but a high number indicates issues. The reasons for issues may vary, but the point is that when the secondary node cannot keep up with the writing operation, you must investigate the issue.
  • Replication oplog window – is the approximate number of hours available in the primary’s replication oplog. The oplog is a special collection that keeps a rolling record of all the changes that are made to the data. If the replication lag of the secondary replica is higher than this metric, it will not be able to catch up with the primary. It will need a full resynchronization, meaning a full data transfer between the MongoDB instances.
  • Replication headroom – informs you about the difference between the primary replication oplog window and the secondary replication lag. If the value falls to 0, the secondary replica can go into the recovery procedure.
  • Oplog GB/hour – tells the average rate of oplog that the primary writes per hour. Unexpected high volumes of oplog usually indicate insufficient write capabilities of your MongoDB nodes.

Errors

While it’s not a metric, it is a very good idea to constantly monitor the number of errors happening inside your MongoDB cluster. You can expect some of them, but when the number of errors appearing in a given period is higher than usual, you should investigate the cause. It doesn’t mean that MongoDB is the one to blame. It can be something different, like a bug in the application code.

How to Monitor MongoDB

The number of metrics related to MongoDB can be overwhelming, especially when working with the numerous MongoDB instances connected in a cluster. MongoDB provides built-in commands that help you achieve this, such as mongostat and mongotop. However, they have limited capabilities.

When it comes to monitoring, it doesn’t really matter if you are using a single MongoDB instance, a few independent ones, or multiple clusters that have you working with a complicated and scalable distributed system – you need to keep an eye on their health and performance. A good MongoDB monitoring solution should provide all the necessary metrics, help with root cause analysis, and allow metrics, logs, and traces correlation. It should be user-friendly, fast, and easy to use. We have already reviewed such solutions in our article about the best MongoDB monitoring tools.

Monitor MongoDB Metrics with Sematext

Sematext Monitoring is a full-stack observability platform with advanced capabilities dedicated to MongoDB monitoring. Easy to set up and intuitive, it provides out-of-the-box dashboards mapping out all the necessary MongoDB metrics and the infrastructure elements supporting it, such as the CPU, memory, and network. Sematext features a powerful alerting system that allows you to create alerts for crucial MongoDB resources. As soon as an issue happens, it will notify you via your chosen notification channel from the many available, including e-mail, Slack, and custom webhooks.

If you want to learn more about how Sematext Monitoring can help, check out the video below or start the 14-day free trial.

Conclusion

There are various MongoDB metrics available to help you ensure the health and performance of your clusters. Understanding them is crucial to diagnosing issues and seeing what your MongoDB instance is doing and how the clusters are performing. They enable you to troubleshoot when things go wrong and restore health to your infrastructure. To reap the full benefits of metrics, you need a proper monitoring tool that allows you to correlate them and analyze their historical values. Sematext Monitoring checks all these boxes and more.

Java Logging Basics: Concepts, Tools, and Best Practices

Imagine you're a detective trying to solve a crime, but...

Best Web Transaction Monitoring Tools in 2024

Websites are no longer static pages.  They’re dynamic, transaction-heavy ecosystems...

17 Linux Log Files You Must Be Monitoring

Imagine waking up to a critical system failure that has...