As the first part of a three-part series on Apache Kafka monitoring, this article explores which Kafka metrics are important to monitor and why. When monitoring Kafka, it’s important to also monitor ZooKeeper as Kafka depends on it. The second part will cover Kafka open source monitoring tools, and identify the tools and techniques you need to further help monitor and administer Kafka in production.
Why Monitor Kafka?
The main use of Apache Kafka is to transfer large volumes of real-time data (sometimes called data-in-motion) in a wide variety of forms. It’s typically used to gather data across space and time, between servers and even data centers, with reliable categorization and storage. Specifically, Kafka is used in two ways:
- To build out real-time data pipelines that reliably collect, stream and transform data between systems
- The processing of streams of real-time data using algorithms and analytics
To do so, Kafka clusters are broken into three main components: producers, consumers, and brokers. A broker is a server that runs the Kafka software, and there are one or more servers in your Kafka cluster. Producers publish data to topics that are processed by the brokers within your cluster. Finally, consumers listen for data sent to these topics and pull that data on their own schedule to do something with it.
For each topic, Kafka maintains a log broken down into partitions, where each partition is an ordered, immutable sequence of records that is continually appended to. Every partition has a server that acts as a “leader” that handles all read and write requests for the partition while the followers passively replicate the leader. If the leader fails, one of the followers will automatically become the new leader. Replica partitions maintain duplicate copies of data, allowing automatic failover to these replicas when a server in the cluster fails so messages remain available in the presence of failures. Each server acts as a leader for some of its partitions and a follower for others so the load is well balanced within the cluster.
All of these Kafka components have their own metrics to be monitored, which break down into the following overall groups of metrics:
- Kafka Broker metrics
- JVM metrics
- Host/server metrics
Kafka moves data using a platform-independent protocol, and its challenges include language and platform differences, the volume and variety of data types, dealing with cloud-based components and communication, and the differences between servers and data centers used in a Kafka solution. Things to watch for include the timeliness of data delivery, overall application performance, knowing when to scale up, connectivity issues, and ensuring data is not lost. When planning a Kafka monitoring strategy, consider the different categories of components, including:
- Data producers and consumers (also called publishers and subscribers)
- Message size and data types
- Brokers and servers for data transfer
- Programming language and platforms
- Network architecture (e.g. cloud, hybrid IT)
- Open source packages deployed
Top 10 Kafka Metrics to Focus on First
Related to the concerns listed above, there are key metrics to monitor and track to help alert you if there is trouble. These can be broken into categories such as server-related metrics, message throughput, queue sizes and latency, and data consumer and connectivity errors. Keep in mind that with so many metrics available to monitor, data and application-specific differences, and varying network architecture per deployment, it’s difficult to provide a concise list for all cases. However, there are some specific metrics that prove useful over and over. Let’s look at some of these in detail.
Network Request Rate
Since the goal of Kafka brokers is to gather and move data for processing, they can also be sources of high network traffic. Monitor and compare the network throughput per server in each of your data centers and cloud providers, if possible, by tracking the number of network requests per second. When a particular broker’s network bandwidth goes above or below a threshold, it can indicate the need to scale up the number of brokers, or that some condition is causing latency. It can also indicate the need to implement (or modify) a consumer back-off protocol to handle the data rates more efficiently.
The specific metric type and name, exposed via JMX:
kafka.network: type=RequestMetrics, name=RequestsPerSec
Network Error Rate
Cross-referencing network throughput with related network error rates can help diagnose the reasons for latency. Error conditions include dropped network packets, error rates in responses per request type, and the types of error(s) occurring. If network throughput decreases without an increase in error rate, this can indicate the need to scale data consumers up, whereas dropped packets can indicate a hardware issue.
The specific Kafka server metric type and name:
kafka.network: type=RequestMetrics, name=ErrorsPerSec
Under-replicated Partitions
To ensure data durability and that brokers are always available to deliver data, you can set a replication number per topic as applicable. As a result, data will be replicated across more than one broker, available for processing even if a single broker fails. The Kafka UnderReplicatedPartitions metric alerts you to cases where there are fewer than the minimum number of active brokers for a given topic. As a rule, there should be no under-replicated partitions in a running Kafka deployment (meaning this value should always be zero), making this a very important metric to monitor and alert on.
The specific metric type and name:
kafka.server: type=ReplicaManager, name=UnderReplicatedPartitions
Offline Partition Count
Offline partitions represent data stores unavailable to your applications due to a server failure or restart. In a Kafka cluster, one of the brokers serves as the controller responsible for managing the states of partitions and replicas and to reassign partitions when needed.
If partitions on specific servers go up and down due to factors such as cloud connectivity or other transient network events, the increases in this count could indicate the need to increase partition replication (discussed earlier). If the offline partition counts go above zero, this indicates a need to scale up the number of brokers as well, as fetches cannot keep up with the rate of incoming messages.
Kafka brokers can only write to the leader partition, so if partitions don’t have an active leader they’re not eligible for reads or writes.
The specific metric type and name:
kafka.controller: type=KafkaController, name=OfflinePartitionsCount – Number of partitions without an active leader, therefore not readable nor writeable.
Total Broker Partitions
The previous two metrics indicate potential partition error conditions, but simply knowing how many partitions a broker is managing can help you avoid errors and know when it’s time to scale out. The goal should be to keep the count balanced across brokers.
The specific metric type and name:
kafka.server:type=ReplicaManager,name=PartitionCount – Number of partitions on this broker.
Log Flush Latency
Kafka stores data by appending to existing log files. Cache-based writes are flushed to physical storage based on many Kafka internal factors and are performed asynchronously so as to optimize performance and durability. You can force this write to happen by calling the fsync system call. By using the log.flush.interval.messages and log.flush.interval.ms settings you can tell Kafka exactly when to do this flush. It’s recommended in the Kafka documentation that you do not set these but, instead, allow the operating system’s background flush capabilities as it is more efficient.
Your monitoring strategy for data durability should include a combination of the data replication (as described in the metric above) and latency in the asynchronous disk log flush time. Measuring log flush actual time to the scheduled time yields latency, and can indicate the need for more replication and scale, faster storage, or a hardware issue.
The specific metric type and name:
kafka.log: type=LogFlushStats, name=LogFlushRateAndTimeMs
Consumer Message Rate
Set baselines for expected consumer message throughput and measure fluctuations in the rate to detect latency and the need to scale the number of consumers up and down accordingly. Cross-reference this data with bytes-per-second measurements and queue sizes (called max lag, see below) to get an indication of the root cause, such as messages that are too large. To change the Kafka maximum message size, use the max.message.bytes value, which specifies the largest record batch size allowed by Kafka. If this is increased, the consumers’ fetch size might also need to be increased so that they can fetch record batches this large.
The specific metric type and name:
kafka.consumer: type=ConsumerTopicMetrics, name=MessagesPerSec, clientId=([-.w]+) Messages consumed per second.
Consumer Max Lag
Even with consumers fetching messages at a high rate, producers can still outpace them. Understanding how large the queue of incoming messages grows per topic is important to accurately build out your real-time system. Measure and track the MaxLag metric for consumers to know precisely when the need to scale up arises. However, tools such as Sematext Kafka monitoring and others offer alternatives to Kafka’s MaxLag metric, with the goal of providing a better measurement. Remember that this metric works at the level of consumer and partition, meaning that each partition in each topic has its own lag for a given consumer.
The specific metric type and name:
kafka.consumer: type=ConsumerFetcherManager, name=MaxLag, clientId=([-.w]+) Number of messages by which the consumer lags behind the producer
Fetcher Lag
This metric indicates the lag in the number of messages per follower replica, indicating that replication has potentially stopped or has been interrupted. This is a major problem when dealing with the real-time processing of incoming data and should be monitored. Also monitoring the replica.lag.time.max.ms configuration parameter you can measure the time for which the replica has not attempted to fetch new data from the leader.
The specific metric type and name:
kafka.server:type=FetcherLagMetrics,name=ConsumerLag,clientId=([-.w]+),topic=([-.w]+),partition=([0-9]+)
Free Memory and Swap Space Usage
Kafka performance is best when swapping is kept to a minimum. To do this, set the JVM max heap size large enough to avoid frequent garbage collection activity, but small enough to allow space for filesystem caching. To size this properly, monitor for when server free memory drops below a threshold, as well as disk reads. If brokers need to read recently written data from disk instead of cache, it’s possible you need to tune the JVM memory settings, or you simply need more RAM. Additionally, watch for swap usage if you have swap enabled, watching for increases in server swapping activity, as this can lead to Kafka operation timeouts. Note that in many cases, it’s best to turn swap off entirely. Adjust your monitoring accordingly.
Where to Go Next
Kafka is a great solution for real-time analytics due to its high throughput and durability in terms of message delivery. Keeping your Kafka cluster healthy may seem daunting due to the potentially large number of components and the high volume of data, but only a few key metrics, like the ones presented here (and some that others find useful) are all that’s really needed. In the next part, we’ll take a look at some of the most useful open source tools available for Kafka cluster and application monitoring beyond Apache’s own Kafka Manager, and how to use them effectively.
About the Author
Eric Bruno is a writer and editor for multiple online publications, with more than 25 years of experience in the information technology community. He is a highly requested moderator and speaker for a variety of conferences and other events on topics spanning the technology spectrum, from the desktop to the data center. He has written articles, blogs, white papers and books on software architecture and development for more than a decade. He is also an enterprise architect, developer, and industry analyst with expertise in full lifecycle, large-scale software architecture, design, and development for companies all over the globe. His accomplishments span highly distributed system development, multi-tiered web development, real-time development, and transactional software development. See his editorial work online at www.ericbruno.com.