The world lives by processing the data. Humans process the data – each sound we hear, each picture we see – everything is data for our brain. The same goes for modern applications and algorithms – the data is the fuel that allows them to function and provide useful features.
Even though such thinking is not new, what is new in recent years is the requirement of near-real-time processing of large quantities of events processed by our systems. That’s why the technological stack grew from simple applications to whole processing pipelines. And the more complex the systems become, the less visibility into its working we have – at least when not using proper tools.
One system that allows us to process large amounts of data is Apache Kafka – an open-source, distributed event streaming platform designed to stream massive amounts of data. However, as with everything, we need to monitor it to ensure that everything works well and is healthy. One of the most crucial metrics for Kafka and the systems using it is consumer lag. In this blog post, we will learn how to monitor it.
What Is Consumer Lag in Kafka
Kafka Consumer Lag indicates how much lag there is between Kafka producers and consumers.
When talking about Kafka, people typically refer to Kafka Brokers. You can think of a Kafka Broker as a Kafka server. A Broker is what actually stores and serves Kafka messages. Kafka Producers are applications that write messages into Kafka (Brokers). Kafka Consumers are applications that read messages from Kafka (Brokers).
Inside Brokers data is stored in one or more Topics, and each Topic consists of one or more Partitions. When writing data a Broker actually writes it into a specific Partition. As it writes data it keeps track of the last “write position” in each Partition. This is called Latest Offset, also known as Log End Offset. Each Partition has its own independent Latest Offset.
Consumer Group Offset
Just like Brokers keep track of their write position in each Partition, each Consumer keeps track of “read position” in each Partition whose data it is consuming. That is, it keeps track of which data it has read. This is known as Consumer Offset. This Consumer Offset is periodically persisted (to ZooKeeper or a special topic in Kafka itself), so it can survive Consumer crashes or unclean shutdowns and avoid re-consuming too much old data.
When a new consumer group is created, so when we start consuming data from Kafka, it is set to zero and the group offset is increased when the data is read and the offset is committed, so that the consumer knows when it ended.
Why Is Consumer Lag Important
Many applications today are based on processing (near) real-time data. Think about performance monitoring systems like Sematext Monitoring or log management tools like Sematext Logs. They continuously process infinite streams of near real-time data. If they were to show you metrics or logs with too much delay – if the Consumer Lag were too big – they’d be nearly useless. This Consumer Lag tells us how far behind each Consumer (Group) is in each Partition. The smaller the lag, the more real-time the data consumption.
How Is Kafka Consumer Lag Calculated?
The rate at which Brokers are writing messages created by Producers should be roughly equal to the orange bars representing the rate at which Consumers are consuming messages from Brokers. Otherwise, the Consumers will fall behind. However, there will always be some delay between the moment a message is written and the moment it is consumed. Reads will always lag behind writes – that is what we call Consumer Lag. The Consumer Lag is simply the delta between the Broker Latest Offset and Consumer Offset.
What Causes Kafka Consumer Lag?
There are many things that can be causing the Kafka Consumer Lag, including:
- big jump in traffic resulting in producing way more Kafka messages
- poorly written code
- various software bugs and issues resulting in slow processing
- issues with the pipeline elements
- uneven load in Kafka partitions
Of course, the above list is just an example, but let’s look at one of the mentioned causes a bit deeper, the one about issues with the pipeline elements. Let’s imagine a very simplified architecture that looks like this:
We have the data sent to the Data Receiver that works as the Kafka Producer and sends the data to our Kafka Broker. Next, the Data Processor reads the messages from the Kafka Broker. The data process is in fact, the Kafka Consumer which, in addition to reading the data, also enriches it and finally writes it to the Data Store.
What can cause the Kafka Consumer Lag in this scenario? From a technical point of view, the cause can be in almost every part of this architecture, but to be very strict, the Consumer Lag is only present when the consumer cannot keep up. So in our case, if the Data Processor cannot read, process, and write the data to the Data Store faster or at the same pace as it is written to the Kafka Broker by our Data Receiver, the Consumer Lag will start to grow. The reasons for that may vary from inefficient processing to issues with the Data Store, network issues, and many more. Basically, anything that can slow down consuming data from the Kafka Broker will cause the Consumer Lag making the processing fall behind in processing the data.
How to Fix Consumer Lag in Kafka?
There isn’t a simple answer to how to fix consumer lag in Kafka. There may not even be a general answer because it all depends on why the lag happened in the first place. We know of a few common causes, and I’ll try to discuss them and tell you what you can do in each case.
Poorly Written Code
If you know that the code responsible for consuming data from Kafka is poorly written, and you need a reliable solution that reads data from Kafka fast and without issues, then this won’t surprise you – you need to at least refactor the code.
There are various resources on how to approach that – one of them is, for example, the introduction to the New Consumer Client introduced with Kafka 0.9. It provides insights into how things work if you don’t know that and shows code fragments that can be incorporated. It is based on Kafka 0.9, though, so you may need to adjust when using recent Kafka versions, but at least you know where to start.
Software Bugs and Issues
Similar to the above point, you need to find and fix the issues in your code if they are the ones that are responsible for the Kafka Consumer lag. If you can’t find any other reasons – if everything works well and Kafka’s number of messages is similar to what we expect – you may have bugs. As developers know all that, I know, but unit tests, pair programming, and code reviews really help find issues and correct them. So keep that in mind, and good luck!
Big Jump in Traffic
In some cases, we are not the ones to blame – the code works well, the whole pipeline works as intended, but still, issues may happen. You may be very successful and receive traffic far beyond what you ever imagined. In such cases, you may not be prepared to process such a tremendous amount of data and need to scale up. You will probably need more Kafka Consumers.
However, keep in mind that the reads parallelization may be limited by the number of partitions or the consumer implementation. If you can’t increase the number of partitions in your Kafka topic and introduce new consumers to parallelize the processing, maybe you should have a look at the Parallel Consumer implementation?
If the issues are in the pipeline, the key is to fix the issue. For example, even if your Kafka consumers are doing an amazing job and process everything in real time, you may not be able to write data to the data store that fast, so you must pause reading. First, make sure that your pipeline works again, and then think if you need to catch up faster or not. If you don’t, just wait for things to settle after fixing the pipeline. If you do need to catch up, you may need additional resources, just as mentioned in the Big Jump in Traffic section.
Uneven Load in Kafka Partitions
If the messages you are writing to Kafka use a key, the partition that will store the data is determined by the hash calculated based on the message key.
Suppose such a key is based on an identifier associated with the source, like the user identifier. It may happen because one of the sources is very noisy and may cause one of the partitions to be loaded more than the others and be processed slower. The risks of such a situation aren’t significant, but they can happen. Luckily you can mitigate such issues, for example, by using the Parallel Consumer or trying to repartition the Kafka topic even more to isolate the noisy data source.
How to Monitor Kafka Consumer Lag?
The basic way to monitor Kafka Consumer Lag is to use the Kafka command line tools and see the lag in the console. We can use the kafka-consumer-groups.sh script provided with Kafka and run a lag command similar to this one:
$ bin/kafka-consumer-groups.sh --bootstrap-server localhost:9092 --describe --group console-consumer-15340
The result would be the lag for the provided consumer group. Here is a very simple example that uses the console consumer:
GROUP TOPIC PARTITION CURRENT-OFFSET LOG-END-OFFSET LAG CONSUMER-ID HOST CLIENT-ID console-consumer-8551 example_blog 0 - 26 - consumer-console-consumer-8551-1-4f57353f-a040-4b4d-a13a-6514e9a245ec /127.0.0.1 consumer-console-consumer-8551-1
The thing you are most interested in is the current offset. If the offset is positive, that means that there is a lag. In most cases, if your Kafka Producer is actively producing messages and the Kafka Consumers are actively consuming, you will have a small lag here. This is expected. The problems start when the lag is significant or is constantly growing. That means that the data is not processed fast enough.
Using the console tools is possible and a viable solution if we have access to Kafka brokers, we are online when the issues are happening, and we know that the issues are happening. But without a proper monitoring solution, users may see the issues way before we will notice them. By then, it may be too late to react and prevent the disaster quickly. That’s why we need an observability solution like Sematext Monitoring.
Sematext Monitoring is one of the most comprehensive Kafka monitoring tools, capturing a number of the 200+ Kafka metrics, including Kafka Broker, Producer, and Consumer metrics. If the 200 Kafka metrics sound scary and overwhelming, you shouldn’t worry. Sematext Monitoring includes pre-built dashboards with metrics that you should really take care of and keep a close eye on. If you want to see everything, you can create a custom dashboard and choose the Kafka metrics you want and need to monitor.
To start monitoring, just create a Kafka monitoring application and follow the instructions in the documentation or the ones displayed on the screen. You’ll have your monitoring setup in seconds.
However, keep in mind that it is crucial to have full visibility into what is happening when using Kafka. To achieve that, monitoring your Kafka Producers, Kafka Brokers, and Kafka Consumers in a single Sematext Monitoring App is the optimal way to get the most out of the monitoring solution.
Look Beyond Kafka Consumer Lag
Kafka Consumer Lag and Broker Offset Changes
As we’ve just learned, the delta between the Latest Offset and the Consumer Offset is what gives us the Consumer Lag. But there is more to monitoring Kafka than the lag itself. In the above chart from Sematext, you may have noticed a few other metrics:
- Broker Write Rate
- Consume Rate
- Broker Earliest Offset Changes
The rate metrics are derived metrics. If you look at Kafka’s metrics, you won’t find them there. Under the hood, the open source Sematext agent collects a few Kafka metrics with various offsets from which these rates are computed. You can use the Sematext flexible dashboarding and the awesome chart builder to derive new metrics from the ones gathered by Sematext Agent to get even more visibility into what is happening in your environment.
Avoid Consumer Lag with Sematext’s Kafka Monitoring Tools
If you need a good Kafka monitoring tool, give Sematext Monitoring a go.Ship your Kafka and other logs into Sematext Logs and you’ve got yourself a comprehensive DevOps solution that will make troubleshooting easy instead of dreadful. Read how to choose the best monitoring software for your use case from our alerting and monitoring guide.