Kafka Consumer Lag Monitoring

SPM is one of the most comprehensive Kafka monitoring solutions, capturing some 200 Kafka metrics, including Kafka Broker, Producer, and Consumer metrics. While lots of those metrics are useful, there is one particular metric everyone wants to monitor – Consumer Lag.

What is Consumer Lag

When people talk about Kafka or about a Kafka cluster, they are typically referring to Kafka Brokers. You can think of a Kafka Broker as a Kafka server. A Broker is what actually stores and serves Kafka messages. Kafka Producers are applications that write messages into Kafka (Brokers). Kafka Consumers are applications that read messages from Kafka (Brokers).

Inside Kafka Brokers data is stored in one or more Topics, and each Topic consists of one or more Partitions. When writing data a Broker actually writes it into a specific Partition. As it writes data it keeps track of the last “write position” in each Partition. This is called Latest Offset also known as Log End Offset. Each Partition has its own independent Latest Offset.

Just like Brokers keep track of their write position in each Partition, each Consumer keeps track of “read position” in each Partition whose data it is consuming. That is, it keeps track of which data it has read. This is known as Consumer Offset. This Consumer Offset is periodically persisted (to ZooKeeper or a special Topic in Kafka itself) so it can survive Consumer crashes or unclean shutdowns and avoid re-consuming too much old data.

Kafka Consumer Lag and Read/Write Rates
Kafka Consumer Lag and Read/Write Rates

In our diagram above we can see yellow bars, which represents the rate at which Brokers are writing messages created by Producers.  The orange bars represent the rate at which Consumers are consuming messages from Brokers. The rates look roughly equal – and they need to be, otherwise the Consumers will fall behind.  However, there is always going to be some delay between the moment a message is written and the moment it is consumed. Reads are always going to be lagging behind writes, and that is what we call Consumer Lag. The Consumer Lag is simply the delta between the Latest Offset and Consumer Offset.

Why is Consumer Lag Important

Many applications today are based on being able to process (near) real-time data. Think about performance monitoring system like SPM or log management service like Logsene. They continuously process infinite streams of near real-time data. If they were to show you metrics or logs with too much delay – if the Consumer Lag were too big – they’d be nearly useless.  This Consumer Lag tells us how far behind each Consumer (Group) is in each Partition.  The smaller the lag the more real-time the data consumption.

Monitoring Read and Write Rates

Kafka Consumer Lag and Broker Offset Changes
Kafka Consumer Lag and Broker Offset Changes

As we just learned the delta between the Latest Offset and the Consumer Offset is what gives us the Consumer Lag.  In the above chart from SPM you may have noticed a few other metrics:

  • Broker Write Rate
  • Consume Rate
  • Broker Earliest Offset Changes

The rate metrics are derived metrics.  If you look at Kafka’s metrics you won’t find them there.  Under the hood SPM collects a few metrics with various offsets from which these rates are computed.  In addition, it charts Broker Earliest Offset Changes, which is  the earliest known offset in each Broker’s Partition.  Put another way, this offset is the offset of the oldest message in a Partition.  While this offset alone may not be super useful, knowing how it’s changing could be handy when things go awry.  Data in Kafka has has a certain TTL (Time To Live) to allow for easy purging of old data.  This purging is performed by Kafka itself.  Every time such purging kicks in the offset of the oldest data changes.  SPM’s Broker Earliest Offset Change surfaces this information for your monitoring pleasure.  This metric gives you an idea how often purges are happening and how many messages they’ve removed each time they ran.

There are several Kafka monitoring tools out there that, like LinkedIn’s Burrow, whose Offset and Consumer Lag monitoring approach is used in SPM.  If you need a good Kafka monitoring solution, give SPM a go.  Ship your Kafka and other logs into Logsene and you’ve got yourself a DevOps solution that will make troubleshooting easy instead of dreadful.

 

Sematext Elastic Stack Training

Elasticsearch / Elastic Stack Training – NYC June 13-16

Next month, June 13-16, 2016, we will be running three Elastic Stack (aka ELK Stack) classes in New York City:

  1. June 13 & 14: Elasticsearch for Developers Training Workshop
  2. June 15: Elasticsearch Operations Training Workshop
  3. June 16: Elasticsearch for Logging Training Workshop

All classes cover Elasticsearch 2.x as well as Elasticsearch 5.x!

You can see the complete course outlines under Training Overview.  All three classes include lots of valuable hands-on exercises.  Be prepared to learn a lot!

Cost:

  • 2-day course: $1,200 early bird rate (valid through June 1) and $1,500 afterwards.
  • 1-day course: $700 early bird rate (valid through June 1) and $800 afterwards.

There’s also a 50% discount for the purchase of a 2nd seat!

Location:
462 7th Avenue, New York, NY 10018 – see map

If you have any questions please get in touch.

Sematext Solr Training

Apache Solr Training in NYC June 13-14

If you’ve missed our Core Solr training in October 2015 in New York, here is another chance – we’re running the 2-day Core Solr class again next month – June 13 & 14, 2016.
This course covers Solr 5.x as well as Solr 6.x!  You can see the complete course outline under Solr & Elasticsearch Training Overview . The course is very comprehensive — it includes over 20 chapters and lots of hands-on exercises.  Be prepared to learn a lot!

Cost:
$1,200 early bird rate (valid through June 1) and $1,500 afterwards.

There’s also a 50% discount for the purchase of a 2nd seat!

Location:

462 7th Avenue, New York, NY 10018 – see map

If you have any questions please get in touch.  To sign up, just register here.

technology_partners_monitoring_image_0 (2)

Sematext is Docker Ecosystem Technology Partner (ETP) for Monitoring

technology_partners_monitoring_image_0 (2)May 5 2016 — Sematext, a global, Brooklyn-based products and services company that builds innovative Cloud and On Premises solutions for application performance monitoring, log management and analytics, today announced that it has been recognized by Docker as the Ecosystem Technology Partner (ETP) for monitoring and logging. This designation indicates that SPM Performance Monitoring and Logsene have demonstrated working integration with the Docker platform via the Docker API and are available to users and organizations that seek solutions to monitor their Dockerized distributed applications.

Sematext Docker Agent is extremely easy to deploy on  Docker Swarm, Docker Cloud and Docker Datacenter. It discovers new and existing containers, collects Docker performance metrics, events and logs, and runs in a tiny container on every Docker Host. In addition to standard log collection functionality the agent performs automatic log format detection and field extraction for a number of log formats, including Docker Swarm, Elasticsearch, Solr, Nginx, Apache, MongoDB, Kubernetes, etc.  

Sematext Docker Agent

Many organizations invest a lot of time in monitoring and logging setups because monitoring and logging changed dramatically with the introduction of Docker and related orchestration tools. We’ve observed that organizations and teams that use different tools for logging and monitoring often have difficulties correlating logs, events and metrics. Sematext automates performance monitoring and logging for Docker. Operational insights  are provided in a single UI, which helps one efficiently correlate metrics, logs and events. Sematext Docker Agent detects many log formats and structures the logs automatically for analysis in Logsene.

We would like to congratulate Sematext on their inclusion into Docker’s Ecosystem Technology Partner program for logging and monitoring,” said Nick Stinemates, VP of Business Development and Technical Alliances. “The ETP program recognizes organizations like Sematext that have demonstrated integration with the Docker platform to provide users with intelligent insights and increased visibility into their Dockerized environments. The goal is to provide users with the data needed to ensure the highest degree of availability and performance for all their business-critical applications”.

Sematext SPM is available at http://sematext.com/spm

About Sematext

Sematext Group, Inc. is a global, Brooklyn-based products and services company that builds innovative Cloud and On Premises solutions for application performance monitoring, log management and analytics, and site search analytics. Sematext Docker Agent is extremely easy to deploy; it collects Docker performance metrics, events and logs and runs in a container on every Docker Host. In addition to standard log collection functionality the agent performs automatic log format detection and field extraction for a number of log formats.  Besides monitoring Docker, Sematext SPM agents also monitor applications running inside and outside containers, such as Elasticsearch, Nginx, Apache, Kafka, Cassandra, Spark, Node.js, MongoDB, Solr, MySQL, etc.

Sematext also provides professional services around Elasticsearch, the ELK / Elastic Stack, and Apache Solr – Consulting, Training, and Production Support.

Contacts: press@sematext.com

Elasticsearch Training in London

3 Elasticsearch Classes in London

 

es-training

Elasticsearch for Developers ……. April 4-5

Elasticsearch for Logging ……… April 6

Elasticsearch Operations …….  April 6

All classes cover Elasticsearch 2.x

Hands-on — lab exercises follow each class section

Early bird pricing until February 29

Add a second seat for 50% off

Register_Now_2

Course overviews are on our Elasticsearch Training page.

Want a training in your city or on-site?  Let us know!

Attendees in all three workshops will go through several sequences of short lectures followed by interactive, group, hands-on exercises. There will be Q&A sessions in each workshop after each such lecture-practicum block.

Got any questions or suggestions for the course? Just drop us a line or hit us @sematext!

Lastly, if you can’t make it…watch this space or follow @sematext — we’ll be adding more Elasticsearch training workshops in the US, Europe and possibly other locations in the coming months.  We are also known worldwide for Elasticsearch Consulting Services, and Elasticsearch Production Support.
We hope to see you in London in April!

Core Solr Training in London

solr-training

April 4 & 5 — Covers Solr 5.x

Hands-on — lab exercises follow each class section

Early bird pricing until February 29

Add a second seat for 50% off

Sematext is running a 2-day, very comprehensive, hands-on workshop in London on April 4 & 5 for Developers and DevOps who want to configure, tune and manage Solr at scale.

The workshop will be taught by Sematext engineer — and author of Solr books — Rafał Kuć. Attendees will go through several sequences of short lectures followed by interactive, group, hands-on exercises. There will be a Q&A session after each such lecture-practicum block.  See details, including training overview.

 

Register_Now_2

Target audience:

Developers who want to configure, tune and manage Solr at scale and learn about a wide array of Solr features this training covers in its 23 chapters – we mean it when we say this is comprehensive!

What you’ll get out of it:

In two days of training Rafal will:

  1. Bring Solr novices to the level where he/she would be comfortable with taking Solr to production
  2. Give experienced Solr users proven and practical advice based on years of experience designing, tuning, and operating numerous Solr clusters to help with their most advanced and pressing issues

When & Where:

  • Dates: April 4-5 (Monday & Tuesday)
  • Time: 9:00 am to 5:00 pm
  • Place: Imparando City of London Training Centre — 56 Commercial Road, Aldgate, London, E1 1LP (see map)
  • Cost: GBP £845.00 “Early Bird” rate (valid through February 29) and GBP £1.045.00 afterward.  There’s also a 50% discount for the purchase of a 2nd seat! (limit of 1 discounted seat per full-price seat)
  • Food/Drinks: Light morning & afternoon refreshments and Lunch will be provided

Got any questions or suggestions for the course? Just drop us a line or hit us @sematext!

Lastly, if you can’t make it…watch this space or follow @sematext — we’ll be adding more Solr training workshops in the US, Europe and possibly other locations in the coming months.  We are also known worldwide for our Solr Consulting Services and Solr Production Support.

Register_Now_2

Hope to see you in the London in April!  See detailed info about this training.

 

 

2015 in Review

Another year is behind us, and it’s been another good year for us at Sematext.  Here are the highlights in the chronological order.  If you prefer looking non-chronological overview, look further below.

January

We started the year by doing a ton of publishing on the blog – about Solr-Redis, about SPM and Slack, about Solr vs. Elasticsearch – always a popular topic, Spark, Kafka, Cassandra, Solr, etc.  Logsene being ELK as a Service means we made sure users have the freedom and flexibility to create custom Elasticsearch Index Templates in Logsene.

February

We added Account Sharing to all our products, thus making it easier to share SPM, Logsene, and Site Search Analytics apps by teams.  We made a big contribution to Kafka 0.8.2 by reworking pretty much all Kafka metrics and making them much more useful and consumable by Kafka monitoring agents.  We also added support for HAProxy monitoring to SPM.

March

We announced Node.js / io.js monitoring.  This was a release of our first Node.js-based monitoring monitoring agent – spm-agent-nodejs, and our first open-source agent.  The development of this agent resulted in creation of spm-agent an extensible framework for Node.js-based monitoring agents.  HBase is one of those systems with tons of metrics and with metrics that change a lot from release to release, so we updated our HBase monitoring support for HBase 0.98.

April

The SPM REST API was announced in April, and a couple of weeks later the spm-metrics-js npm module for sending custom metrics from Node.js apps to SPM was released on Github.

May

A number of us from several different countries gathered in Krakow in May.  The excuse was to give a talk about Tuning Elasticsearch Indexing Pipeline for Logs at GeeCon and give away our eBook – Log Management & Analytics – A Quick Guide to Logging Basics while sponsoring GeeCon, but in reality it was really more about Żubrówka and Vișinată, it turned out.  Sematext grew a little in May with 3 engineers from 3 countries joining us in a span of a couple of weeks.  We were only dozen people before that, so this was decent growth for us.

Right after Krakow some of us went to Berlin to give another talk: Solr and Elasticsearch – Side by Side with Elasticsearch and Solr: Performance and Scalability.  While in Berlin we held our first public Elasticsearch training and, following that, quickly hopped over to Hamburg to give a talk at a local search meetup.

June

In June we gave a talk on the other side of the Atlantic – in NYC – Beyond POC: Processing Metrics, Logs and Traces … at Scale.  We were conference sponsors there as well and took part in the panel about microservices.  We published our second eBook – Elasticsearch Monitoring Essentials eBook.  The two most important June happenings were the announcement of Docker monitoring – SPM for Docker – our solution for monitoring Docker containers, as well as complete, seamless integration of Kibana 4 into Logsene.  We’ve added Servers View to SPM and Logsene got much needed Alerting and Anomaly Detection, as well as Saved Searches and Scheduled Reporting.

July

In July we announced public Solr and Elasticsearch trainings, both in New York City, scheduled for October.  We built and open-sourced Logsene Command Line Interfacelogsene-cli – and we added Tomcat monitoring integration to SPM.

August

At Sematext we use Akka, among other things, and in August we introduced Akka monitoring integration for SPM and open-sourced the Kamon backend for SPM.  We also worked on and announced Transaction Tracing that lets you easily find slow transactions and bottlenecks that caused their slowness, along with AppMaps, which are a wonderful way to visualize all your infrastructure along applications running on it and see, in real-time, which apps and servers are communicating, how much, how often there are errors in each app, and so on.

September

In September we held our first 2 webinars on Docker Monitoring and Docker Logging.  You can watch them both in Sematext’s YouTube channel.

October

We presented From zero to production hero: Log Analysis with Elasticsearch at O’Reilly’s Velocity conference in New York and then Large Scale Log Analytics with Solr at Lucene/Solr Revolution in Austin.  After Texas we came back to New York for our Solr and Elasticsearch trainings.

November

Logsene users got Live Tail in November, while SPM users welcomed the new Top Database Operations report.  Live Tail comes in very handy when you want to semi-passively watch out for errors (or other types of logs) without having to constantly search for them.  While most SPM users have SPM monitoring agents on their various backend components, Top Database Operations gives them the ability to gain more insight in performance between the front-end/web applications and backend servers like Solr, Elasticsearch, or other databases by putting the monitoring agents on applications that act as clients for those backend services.  We worked with O’Reilly and produced a 3-hour Working with Elasticsearch Training Video.

December

We finished off 2015 by adding MongoDB monitoring to SPM, joining Docker’s ETP Program for Logging, further integrating monitoring and logging, ensuring Logsene works with Grafana, writing about monitoring Solr on Docker, publishing the popular Top 10 Node.js Metrics to Watch, as well as a SPM vs. New Relic APM comparison.  

Pivoting the above and grouping it by our products and services:

Logsene:

  • Live Tail
  • Alerting
  • Anomaly Detection
  • logsene-cli + logsene.js + logagent-js
  • Saved Searches
  • Scheduled Email Reporting
  • Integrated Kibana
  • Compatibility with Grafana
  • Search AutoComplete
  • Powerful click-and-filter
  • Native charting of numerical fields
  • Account Sharing
  • REST API

 

SPM:

  • Transaction Tracing
  • SPM Tracing API
  • AppMap
  • NetMap
  • On Demand Profiling
  • Integration with Logsene
  • Expanded monitoring for Elasticsearch, Solr, HBase, and Kafka
  • Added monitoring for Docker, Node.js, Akka, MongoDB, HAProxy, and Tomcat
  • Birds Eye Servers View
  • Account Sharing
  • REST API

 

Webinars:

  • Docker Monitoring
  • Docker Logging

 

Trainings:

  • Elasticsearch training in Berlin
  • Solr and Elasticsearch trainings in New York

 

eBooks:

  • Elasticsearch Monitoring Essentials
  • Log Management & Analytics – A Quick Guide to Logging Basics

 

Talks / Presentations / Conferences:

  • Lucene/Solr Revolution, Austin, TX – Large Scale Log Analytics with Solr
  • Velocity Conference, NYC, NY – Log Analysis with Elasticsearch
  • Berlin Buzzwords, Berlin, Germany – Side by Side with Elasticsearch and Solr: Performance and Scalability
  • GeeCon, Krakow, Poland – Tuning Elasticsearch Indexing Pipeline for Logs
  • DevOps Days, Warsaw, Poland – Running High Performance and Fault Tolerant Elasticsearch Clusters on Docker
  • DevOps Expo, NYC, NY – Process Metrics, Logs, and Traces at Scale

 

Trends:

All numbers are up  – our SPM and Logsene signups are up, product revenue is up a few hundred percent from last year, we’ve nearly doubled our blogging volume, our site traffic is up,we’ve made several UI-level facelifts for both apps.sematext.com and www.sematext.com, our team has grown, we’ve increased the number of our Solr and Elasticsearch Production Support customers, and we’ve added Solr and Elasticsearch Training to the list of our professional services.

 

Kafka Real-time Stream Multi-topic Catch up Trick

Half of the world, Sematext included, seems to be using Kafka.

Kafka is the spinal cord that connects various components in SPM, Site Search Analytics, and Logsene.  If Kafka breaks, we’re in trouble (but we have anomaly detection all over the place to catch issues early).  In many Kafka deployments, ours included, the most recent data is the most valuable.  Consider the case of Kafka in SPM, which processes massive amounts of performance metrics for monitoring applications and servers.  Clearly, in a performance monitoring system you primarily care about current performance numbers.  Thus, if SPM’s Kafka pipeline were to break and we restore it, what we’d really like to avoid is processing all data sequentially, oldest to newest.  What we’d prefer is processing new metrics data first and then processing older data using any spare capacity we have in order to “fill the gap” caused by Kafka downtime.

Here’s a very quick “video” that show this in action:

Kafka Catch Up
Kafka Catch Up

 

How does this work?

We asked about it back in 2013, but didn’t really get good tips.  Shortly after that we implemented the following logic that’s been working well for us, as you can see in the animation above.

The catch up logic assumes having multiple topics to consume from and one of these topics being the “active” topic to which producer is publishing messages. Consumer sets which topic is active, although Producer can also set it if it has not already been set. The active topic is set in ZooKeeper.

Consumer looks at the lag by looking at the timestamp that Producer adds to each message published to Kafka. If the lag is over N minutes then Consumer starts paying attention to the offset.  If the offset starts getting smaller and keeps getting smaller M times in a row, then Consumer knows we are able to keep up (i.e. the offset is not getting bigger) and sets another topic as active. This signals to Producer to switch publishing to this new topic, while Consumer keeps consuming from all topics.

As the result, Consumer is able to consume both new data and the delayed/old data and avoid not having fresh data while we are in catch-up mode busy processing the backlog.  Consuming from one topic is what causes new data to be processed (this corresponds to the right-most part of the chart above “moving forward”), and consuming from the other topic is where we get data for filling in the gap.

If you run Kafka and want a good monitoring tool for Kafka, check out SPM for Kafka monitoring.