Apache Zookeeper is a great tool used by many popular tools. Your Kafka uses Zookeeper, your HDFS uses it, your SolrCloud uses it, and your ClickHouse may also be using it. No matter where you are using Apache Zookeeper, it is usually a crucial piece of the infrastructure and it needs to be reliable and fast.
The reliability part of the Apache Zookeeper is bound to its architecture and can be achieved by having a properly constructed Zookeeper ensemble – the cluster of Zookeeper nodes. The speed is something we need to be aware of and constantly monitor. We need to know how it performs, choose meaningful metrics and have an observability solution that will help us look into the metrics in real time as well as do post-mortem analysis if needed.
In this article, I will go through the most crucial Zookeeper metrics and some of the best ZooKeeper monitoring tools available to help ensure that your Zookeeper ensemble nodes are healthy and fast.
Key ZooKeeper Metrics for Performance Monitoring
To get a sense of the performance of Apache ZooKeeper, we need metrics. We need to know what is happening to the ZooKeeper, how the operating system is behaving and what is the state of the Java Virtual Machine used to run ZooKeeper.
There are multiple metrics available in ZooKeeper, but it is worth mentioning that starting from version 3.6.0, the new metrics system has been available, which helps users monitor ZooKeeper. There are various ways of getting the metrics – via JMX, by using Prometheus with Grafana, InfluxDB, or even those Four Letter Words that ZooKeeper exposes to get some degree of information. Whatever you choose, you should know which metrics are crucial and should be measured. Let’s dive in.
Operating System Metrics
The operating system is behind the scenes, but it plays a major role in your ZooKeeper health. We suggest monitoring the following metrics:
- CPU usage – ZooKeeper performance highly depends on the CPU not being overloaded with tasks. You should dedicate at least two or three cores to the ZooKeeper itself – the standard operations and the JVM garbage collection. Usually, when the machine runs only ZooKeeper, the CPU usage will be low, but if it stays above 70% for some time, you may assume other metrics are up as well, telling you that ZooKeeper is not performing as well as it could. Keep in mind that the general CPU usage is not the only metric related to the CPU. Details such as waits can point you in the right direction when dealing with performance issues, so ideally, your monitoring solution should show you a breakdown of those.
- Memory usage – the applications running on top of the operating system need memory. The memory usage should be, in general, below 85% to allow for a healthy system. But the metric should be correlated with others, such as swap usage, to fully understand what is happening.
- Swap usage – JVM doesn’t like to have its memory swapped. If your operating system uses swap and it is the JVM memory that is written to the swap space, your ZooKeeper performance will suffer. This metric should be 0 if possible.
- Disk usage – you should have enough disk space not only for data but for the normal operations of the whole system, for example, logs, memory dumps, etc. Keep at least 20% of the disks free, just in case.
Java Virtual Machine Metrics
The Java Virtual Machine is the runtime environment used to run ZooKeeper. Thus it is extremely important that it remains operational and healthy. Some of the metrics worth monitoring are:
- Heap usage – the JVM heap usage shouldn’t go above 80% over a longer period of time – if it is constantly above 80%, you may start suffering from extensive garbage collection and eventually run into Java OutOfMemoryError situations.
- Garbage collection times – your garbage collection should be healthy. But what does healthy actually mean? It is not an easy question and we have a series of blog posts about garbage collection that will help you understand your use case.
- Number of threads – the average number of active JVM threads used for ZooKeeper. The more threads you have the more resources, like CPU cycles and memory are used to keep them alive. It is hard to say what is the proper value, because it depends on the use case, but you should definitely monitor for anomalys, like sudden increases and drops.
Apache ZooKeeper Metrics
There are a few metrics that are worth looking into when it comes to ZooKeeper itself and those include:
- Number of ZooKeeper nodes – this metric should always be higher than 0, because you want at least one node to be operational. However, it also greatly depends on your use case. If your ensemble is built of 3 nodes, you need to have at least 2 that are running at the same time.
- Znode count – if the number of znodes becomes too large you will start seeing performance degradations. The suggestion is to keep the number of znodes below 1000000.
- Active connections – the total number of active connections shouldn’t be larger than around 85% of the total allowed connections configured via the maxClientCnxns.
- Znode memory usage – the total amount of memory used by ZooKeeper znodes shouldn’t be higher than 1GB.
- Average request latency – the average latency of the requests executed against ZooKeeper should be as low as possible. If the average latency over a period of time is high, for example above 100 milliseconds, you should look into your ZooKeeper instances and see what can be the potential reason for that.
- Average fsync time – the fsync operation, similar to average request latency is crucial, as it tells how fast the files are written to the disk. It needs to be as low as possible and, if it stays above 100 milliseconds on average for a period of time, for example, for one minute, you should be alerted.
- Open files – ZooKeeper doesn’t need a lot of files to be opened, so this metric should be relatively low, usually below 300. If it is above that value, have a look at the instances to see what those files are.
Top ZooKeeper Monitoring Tools
One thing is certain – you need visibility into your infrastructure, applications, performance, etc. This is why you need to monitor your ZooKeeper ensemble and all the nodes in particular that build it. There are different approaches to monitoring it.
You can use one of the full-fledged commercial tools that provide all the necessary features out of the box. Other companies manage them and require payment. This is usually a good solution if you don’t want to spend time managing a monitoring solution in-house.
On the other side, there are awesome open source tools that can give you similar, if not the same, features but will require time and effort to manage it, keep it healthy, etc. In this section, I will compare the features, pros, cons, and pricing of the top ZooKeeper monitoring tools.
Sematext Monitoring is a full-stack monitoring tool with ZooKeeper monitoring available as a first-class citizen. Easy to set up and intuitive, it provides all that is necessary to have a complete overview of your Zookeeper ensemble – an out-of-the-box dashboard with all the operating system, Java Virtual Machine, and ZooKeeper metrics but which is easy to customize it to your needs. You can view Zookeepers metrics and be alerted using the powerful alerting engine with anomaly detection and scheduling.
- Discovers Zookeeper services automatically, which enables hands-off monitoring.
- Out-of-the box key ZooKeeper metrics and performance monitoring.
- Automated discovery for both services and logs.
- Easy to deploy it anywhere from bare metal or virtual machines to Docker and Kubernetes.
- Monitor resource utilization, installed hardware and packages with IT inventory monitoring.
- Out-of-the-box parsing and dashboard for graphing virtually any data shipped to Sematext.
- Quick setup by installing lightweight, open-sourced, and pluggable agents.
- Out-of-the-box integrations, including MySQL, Apache Cassandra, and many more, allow you to have a single observability solution for your whole environment.
- Alerting with anomaly detection on both metrics and logs.
- Quick correlation of metrics, logs, and events for faster troubleshooting.
- Reliable IT inventory management system to unify package inventory used across all servers – installed packages and their versions, detailed server info, container image inventory, etc.
- Limited support for transaction tracing.
- Lack of full-featured profilers.
The pricing for each solution is straightforward and flexible – you can have a different plan depending on your needs. ZooKeeper monitoring is metered by the hour, making it suitable for dynamic environments that scale up and down. It starts with $0.005 per agent per hour for the standard plan and 7 days of data retention.
Prometheus & Grafana
The Prometheus + Grafana combo is the only open-source ZooKeeper monitoring solution on the list, but a very powerful one allowing you to easily get the most out of your monitoring. The setup is very easy: install the Prometheus exporter following the steps in the official ZooKeeper documentation, point your Grafana to it, and you are done. You can use the Prometheus + Grafana duo to monitor ZooKeeper as well as your whole environment. The best thing is that it is not bound to metrics but can handle logs and traces, giving you the single observability tool to manage it all.
- Enormous visualization and dashboarding capabilities allow you to slice and dice through the data in a visual order.
- Enables connection of various data sources for broader visibility giving you a single tool for observability of the whole environment.
- Supports the creation and managing of alerts based on your ZooKeeper metrics.
- Supports annotations on graphs to leave visual notes pointing to interesting data for easy understanding and communication.
- Open-source and easy to set up.
- Well-known and mature inside the open-source community and not only
- Extensible with lots of plugins available.
- Very powerful visualization and dashboarding for getting the critical information regarding your Zookeeper ensemble.
- Large community with lots of integration examples available, making it easy to implement.
- Self-maintained, requiring housekeeping, with the cloud version also available.
There is no pricing associated with Prometheus and Grafana, at least in the basic and self-managed version – it is an open-source solution, and the only cost is the one paid for maintaining it in your environment.
ManageEngine Applications Manager is a single, integrated application performance monitoring system for infrastructure and applications, including Apache ZooKeeper. It supports various infrastructure architectures like bare metal, virtual machines, and containers and includes a wide variety of application integrations. All of that combined with actionable alerts and robust reporting gives you a proper monitoring solution for your Apache ZooKeeper nodes and ensemble, including the whole infrastructure.
- Alerting engine with notifications support.
- Docker and Kubernetes integrations make it easy to monitor ZooKeeper running in different environments.
- Support for Apache ZooKeeper metrics when running on common cloud providers like Amazon Web Services, Microsoft Azure, Google Cloud Platform, and OpenStack.
- Customizable dashboards for better visibility allowing you to focus on key ZooKeeper metrics.
- Hotspot detection for quick insights into potential ZooKeeper problems.
- Great integration support with REST APIs.
- Lack of support for metrics and logs correlation.
- Limited number of ZooKeeper metrics available.
ManageEngine Applications Manager comes in two versions – Professional and Enterprise. The pricing depends on the selected version, the number of monitors, and the number of users using the product with the free version available.
Site 24×7 is a cloud monitoring tool with the Apache ZooKeeper monitoring integration available. It provides all the necessary metrics to get complete visibility into your Apache ZooKeeper ensemble. The service allows you to set up alerts based on advanced rules to limit alert fatigue and get immediate insights. You can monitor not only ZooKeeper but also your servers and over 50 additional technologies running inside your infrastructure, including commonly used technologies such as MySQL.
- Support for ZooKeepers imok 4 letter word for ensemble health check.
- Server monitoring with support for Microsoft Windows and Linux for monitoring your ZooKeeper no matter what runtime environment you will go for.
- Performance metrics and log alerting capabilities for full visibility into ZooKeeper health.
- Cloud monitoring with support for hybrid cloud infrastructure to support the most common platforms where ZooKeeper is run.
- Quick and easy agent installation.
- Monitoring and complex alerting capabilities for a number of technologies besides ZooKeeper.
- Crucial ZooKeeper metrics support such as status, average latency, minimum and maximum session timeout and connections.
- Easy-to-build dashboards enable custom views into each component of your infrastructure.
- The large number of features can be too much for new users and too much just for monitoring ZooKeeper servers.
- Need of manual modification of the agent files for support of additional metrics.
- Server monitoring support for a limited number of technologies.
Pricing varies depending on what parts of the product that you want to use. The annual billing plan for APM starts at $35 per month for up to 3 applications, 40 servers and up to 500MB of logs. If you want to track the servers where ZooKeeper is running, you have to opt for infrastructure monitoring which starts at 9 euros per month when billed annually for up to 10 servers, 500MB of logs, and 100K page views for a single site. For a monthly fee, you can purchase additional add-ons.
OpsView is the all-in-one monitoring system providing monitoring coverage for your cloud platform, on-premises infrastructure, containerized and virtualized environment, and applications including Apache ZooKeeper. With the Opspacks extensions you can monitor more than 200 infrastructure elements with more than 2000 additional extensions available via the Nagios Exchange. Finally, OpsView is designed and architecture to grow with you, so you don’t have to worry about the size of your infrastructure.
- Rich set of integrations and monitoring coverage to include metrics not only coming from ZooKeeper itself.
- Customizable dashboarding, including performance graphs and network maps for full visibility of the ZooKeeper ensemble, but also the runtime environment.
- Business service monitoring covers not only the technical side of your business metrics, but also the key business metrics.
- Alerts with notification support with support for mobile, e-mail, Slack, Twilio and more.
- Easy to set up.
- Clear and easy to navigate UI.
- Compatible with a rich set of Nagios plugins.
- Pricing available on demand.
The pricing is available on request, but the official pricing page mentions a starting price of 9 euro per host per month in the cloud version.
Instana ZooKeeper monitoring gives you the view of the necessary metrics for a single ZooKeeper node as well as the working ensemble. It doesn’t only focus on ZooKeeper metrics but also on data such as version, client, mode, and state in a distributed environment. Combined with out-of-the-box alerting, Instana provides a valuable observability solution for your ZooKeeper ensemble.
- Automated alerts for crucial ZooKeeper metrics available out of the box.
- Dependency map for real-time performance analysis.
- Support for root cause analysis aimed to help in finding the root cause efficiently.
- Support for tracing across all the requests generated by your applications.
- Predictable pricing with no surprises.
- Single agent required per host with automatic services discovery.
- Additional data available for ZooKeeper, such as mode, state and version.
- Rich set of additional integrations.
- Not tailored towards small companies with minimum 10 hosts monitored.
Instana pricing for the SaaS version costs $75 per host per month, with the minimum number of hosts set to 10, which gives it a starting price of $750/month. The self-hosted solution is priced at $93.80 with minimum 10 hosts.
Formerly SignalFx, now Splunk APM, provides a versatile observability platform with Apache ZooKeeper monitoring as a CollectD plugin. It gives you a view over the crucial ZooKeeper metrics no matter which environment you run your ZooKeeper on – whether it is a cloud environment or a private bare metal server.
- Service mapping for out-of-the-box visibility into the services interactions so you can see and control which software connects to your ZooKeeper
- Smart alerting with anomaly detection to reduce the notification fatigue
- Code-level performance analysis for quick analysis of the problem – if that is ZooKeeper or the application using it
- AI-driven analytics and troubleshooting for quick root cause analysis for you ZooKeeper metrics.
- Full support for OpenTelemetry limiting the vendor lock-in, so you can move your ZooKeeper monitoring elsewhere if you decide.
- Continuous code profiling analyzes the code-level performance with minimal overhead.
- The setup may be overwhelming to less experienced users looking only for ZooKeeper monitoring.
- Lack of detailed, out of the box ZooKeeper dashboards.
SignalFx provides two types of pricing – host and usage-based pricing. The host-based pricing starts at $55 per host per month when billed annually with the AlwaysOn Profiling starting with $3 a month for 1.7MB/min profiling volume.
Datadog is a comprehensive software as a service monitoring solution with ZooKeeper monitoring capabilities. Without an additional installation, it provides all the ZooKeeper metrics that you need to asses the health and performance of your ensemble. For improved visibility, you can easily configure Datadog to collect ZooKeeper logs. The tool features anomaly detection and alerting based on machine learning that instantly let’s you know whenever your ZooKeeper is underperforming
- End-to-end application monitoring with tracing with support for OpenTracing and OpenTelemetry for full ZooKeeper visibility.
- Code-level insights with easy root cause analysis for improved troubleshooting.
- Automatic containers and services discovery for containerized and orchestrated environments allowing you to monitor ZooKeeper no matter where it runs.
- Alerting with anomaly detection.
- ZooKeeper integration supported out of the box with the default agent package.
- Logs centralization and analysis, so you can monitor both ZooKeeper metrics and logs.
- Network and host monitoring to see the network utilization of your ZooKeeper ensemble.
- Dashboard framework for building ZooKeeper tailored visuations combining metrics and logs and sharing those with the team.
- The installation process can be overwhelming if you want to monitor anything beyond the basic ZooKeeper metrics, especially if you are a beginner.
- Few out-of-the-box dashbords compared to other ZooKeeper monitoring tools means that new users need to spend additional time in learning the metrics and customizing dashboards before being able to use the solution.
Pricing depends on the part of the solution that you will go for. For basic ZooKeeper monitoring, you can use the application performance monitoring. This plan starts at $31/host per month when billed annually or $36/host per month when billed monthly.
AppDynamics offers monitoring tools with an Apache ZooKeeper addon in both a software as a service and an on-premise model. It offers support for ZooKeeper monitoring regardless of the environment you run your ensemble on. Besides tracking ZooKeeper metrics, you can also use AppDynamics to get information about your entire infrastructure, from VMs to containers, business performance metrics, and more.
- Alerting with email templating and period digest capabilities to bring detailed information about the cause of the issue – for ZooKeeper and not only.
- Machine Learning supported anomaly detection and root cause analysis features for quick issue resolution.
- Up to code level insights, for easy detection if the issues are related to ZooKeeper or the applications using it.
- Features end-user monitoring, including mobile and browser real-user, synthetic, and internet of things monitoring within a single solution.
- Detailed information about the environment, including versions, such as JVM application startup parameters or JVM version, allows for better insights into ZooKeeper performance
- Visibility into connections between system components, environment elements, endpoint response times, and business transactions – get to know what is using ZooKeeper and when.
- Advanced monitoring support for a variety of languages,like automatic leak detection and object instance tracking for the Java applications.
- Code-level visibility and automated diagnostics for quick diagnostics of errors.
- Complicated, non-transparent and extremely high pricing. Rather targets large enterprises with using more conventional high-touch sales strategies.
- Complex setup: manual operations such as downloading and starting the agent are required; no one-line installation and setup command.
- The lowest price plan does not include some of the basic metrics such as system CPU, memory, and network utilization, which leads to limited visibility into your ZooKeeper ensemble.
Pricing depends on the number of CPU Cores and the plan variant. For US based companies the plan that includes application performance monitoring and infrastructure monitoring for full ZooKeeper visibility is listed as $60/month per CPU Core.
Get Started with ZooKeeper Monitoring
ZooKeeper is a distributed system and monitoring it is necessary if you want to ensure that your cluster stays healthy and perfectly working. After all, it needs to provide all its features with the best performance possible so that the system that relies on it doesn’t suffer. The ZooKeeper performance monitoring tool must provide metrics in three key areas – the operating system, the Java Virtual Machine, and the ZooKeeper itself.