At the end of November, we’ll be migrating the Sematext Logs backend from Elasticsearch to OpenSearch

Kafka Open Source Monitoring Tools

April 8, 2019

Table of contents

Open-source software adoption continues to grow within enterprises (even for legacy applications), beyond just startups and born-in-the-cloud software.

They may not have full-blown features like the Sematext Kafka monitoring integration or other SaaS tools, but keep in mind they’re open-source products and can hold their own just fine. We’ll explore what it takes to install, configure, and actually use each tool in a meaningful way.

[product_banner type=”infrastructure-monitoring”]Map and monitor your entire infrastructure on-premise or in the cloud.[/product_banner]

kafka monitoring tools

The Lifesaving Guide to Kafka Monitoring

Sometimes it feels like managing your Kafka cluster is going to kill you. That’s because Kafka clusters are by definition complex, making it difficult to know, at a glance, if they’re healthy. At the same time, the need to monitor them and react properly — perhaps with automation — is critical. As we explored in part 1 of the series – Kafka metrics to monitor, challenges include the varying sizes and types of data streams to monitor, the varieties of servers and platforms they run on, and the highly distributed hybrid IT networking that connects them.

When searching for an open-source monitoring tool to help you, look for the following qualities:

  • The ability to monitor and manage multiple clusters
  • An easy, at-a-glance overview of cluster state
  • Clear illustration of how requests flow through the system
  • Aggregates and makes all of the metrics and data in your deployment searchable
  • Combines lower-level JVM metrics and OS metrics with Kafka-specific metrics to help find correlations
  • Allows you to easily see low-level statistics and alerts such as replication counts, consumer lag, queue sizes, throughput and latency per broker, and so on
  • The ability to set up alerts.

Let’s explore some specific packages and how to use them now. You can also jump to monitoring Kafka with Sematext if you are looking for an easy to use Kafka monitoring solution with alerting, dashboards, team support, etc. and not just a tool.

Kafka Monitor

As described on the Kafka Monitor GitHub page, the goal of the Kafka Monitor framework is to make it as easy as possible to develop and execute long-running Kafka-specific system tests in real clusters and monitor application performance. It helps you execute long-running tests in a Kafka cluster, and works with Kafka’s existing system tests by capturing issues that can occur after running for an extended period of time.

Kafka Monitor allows you to monitor a cluster using end-to-end pipelines to obtain vital statistics such as end-to-end latency, service availability and message loss rate. For example, to start Kafka Monitor and begin monitoring a cluster, use the following script where you add the parameters specific to your cluster:

./bin/single-cluster-monitor.sh --topic --broker-list --zookeeper

To monitor multiple clusters, all you need is to modify the multi-cluster-monitor.properties config file (within the config directory) with your cluster specific information and run the following script:

./bin/kafka-monitor-start.sh config/multi-cluster-monitor.properties

In his blog post on the history of open-sourcing Kafka Monitor, Dong Lin (also one of the main project contributors) describes the philosophy and design overview of the tool and useful tests to run. Out-of-the-box monitoring checks include those that measure availability, end-to-end latency, duplication rates, and message loss rates. The values from these tests can be easily viewed in a web interface as shown below.

kafka monitor

You begin by cloning and building the GitHub repository:

$ git clone https://github.com/linkedin/kafka-monitor.git
$ cd kafka-monitor
$ ./gradlew jar

The bin/kafka-monitor-start.sh script is used to run Kafka Monitor and begin executing checks against your Kafka clusters. Although it uses the word “test”, this implies a runtime monitoring check. You execute “tests” against a running production cluster to return information needed to monitor the health of your cluster. To do so, you must configure it to run your checks and connect to your cluster. The kafka-monitor.properties file in the config directory is where all of this is set up. Each check and service is specified in JSON format using the following structure:

{
   "name1" : {
     "type": MonitorClassName
     "key1": value1,
     "key2": value2,
     ...
   },
   "name2" : {
     "type": ServiceClassName
     "key1": value1,
     "key2": value2,
     ...
   },
   ...
}

Each class must implement the com.linkedin.kmf.services.Test Java interface, and each service class implements the com.linkedin.kmf.services.Service interface. The key for each test and service in the JSON map identifies it in the log or JMX metrics. The following sample service can be used to report some useful Kafka metrics:

"reporter-kafka-service": {
"class.name": "com.linkedin.kmf.services.KafkaMetricsReporterService",
"report.interval.sec": 3,
"zookeeper.connect": "localhost:2181",
"bootstrap.servers": "localhost:9092",
"topic": "kafka-monitor-topic-metrics",
"report.kafka.topic.replication.factor": 1,
"report.metrics.list": [
  "kmf.services:type=produce-service,name=*:produce-availability-avg",
  "kmf.services:type=consume-service,name=*:consume-availability-avg",
  "kmf.services:type=produce-service,name=*:records-produced-total",
  "kmf.services:type=consume-service,name=*:records-consumed-total",
  "kmf.services:type=consume-service,name=*:records-lost-total",
  "kmf.services:type=consume-service,name=*:records-duplicated-total",
  "kmf.services:type=consume-service,name=*:records-delay-ms-avg",
  "kmf.services:type=produce-service,name=*:records-produced-rate",
  "kmf.services:type=produce-service,name=*:produce-error-rate",
  "kmf.services:type=consume-service,name=*:consume-error-rate"
  ]
}

Further service configuration is described on the Kafka Monitor configuration page, allowing you to list the servers in your deployment, the producer and consumer classes, topics to be monitored, and so on.

Filebeat

Filebeat is a tool from Elastic that eases the pain in collecting scores of files distributed across a multitude of servers (and their respective VMs and containers). It is one of several good Logstash alternatives. The Kafka module for Filebeat collects and parses logs created by running Kafka instances, and provides a dashboard to visualize the log data. To use it, begin by downloading and installing Filebeat.

The following steps are required to further set up and run the Kafka Filebeat module:

Step 1. Enable the module:

./filebeat modules enable kafka

Step 2. Next, set up the initial environment:

./filebeat setup -e

The setup command writes the Kafka indexing template to Elasticsearch and deploys the sample dashboards for visualizing the data in Kibana.

Step 3. To run Filebeat, use the following command:

./filebeat -e

Step 4. Once started and connected, you can view the Filebeat Kibana dashboard via the URL:

http://localhost:5601/app/kibana#/dashboards

Filebeat comes with a sample dashboard to show Kafka logs and stack traces:

best tools to monitor kafka

Configuring Filebeat requires you to add options to the output section of the tool’s filebeat.yml file. For Kafka clusters, the output you define looks like the following example:

output.kafka:
# initial brokers for reading cluster metadata
hosts: ["kafka1:9092", "kafka2:9092", "kafka3:9092"]
# message topic selection + partitioning
topic: '%{[fields.log_topic]}'
partition.round_robin:
reachable_only: false
required_acks: 1
compression: gzip
max_message_bytes: 1000000

You define the brokers and topics, for example, in your Kafka cluster, along with many other options you can find in the documentation, such as partition strategy, authentication, and the topic used to produce log events.

Cruise Control

The Cruise Control project is an open source tool to help monitor and manage large-scale Kafka clusters. Out of the box it enables you to track resource utilization for brokers, topics, and partitions, query cluster state, to view the status of partitions, to monitor server capacity (i.e. CPU, network IO, etc.), message traffic distribution, add and remove brokers, rebalance your cluster, and so on. Cruise Control is used within LinkedIn to manage almost 3000 Kafka brokers.

To get started, clone and build Cruise Control:

$ git clone https://github.com/linkedin/cruise-control.git
$ cd cruise-control/
$ ./gradlew jar

Next, copy the ./cruise-control-metrics-reporter/build/libs/cruise-control-metrics-reporter.jar file to your Kafka server dependency jar folder. For Apache Kafka, the folder would be: core/build/dependant-libs-scala.

Next, modify your Kafka server configuration (located at ./config/server.properties) to set metric.reporters to com.linkedin.kafka.cruisecontrol.metricsreporter.CruiseControlMetricsReporter.

Next, modify the Cruise Control properties file (./config/cruisecontrol.properties) and edit the bootstrap.servers and zookeeper.connect sections for your Kafka cluster. For more information on all of the configuration options, see the configuration wiki.

# The Kafka cluster to control.
bootstrap.servers=localhost:9092
# The zookeeper connect of the Kafka cluster
zookeeper.connect=localhost:2181/

Finally, start Zookeeper and your Kafka server, and run the following commands:

$ ./gradlew jar copyDependantLibs
$ ./kafka-cruise-control-start.sh config/cruisecontrol.properties

This LinkedIn article on Cruise Control describes how LinkedIn has used the tool to manage Kafka operational issues. The Cruise Control architecture, as illustrated in the article (see the image below), shows the relationship between the components and highlights the pluggable pieces such as the metrics-related sampler, store and reporter.

open source kafka monitoring tools

Cruise Control has a separate front-end component project to help visualize Kafka cluster state as monitored by Cruise Control itself. Cruise Control Front End (CCFE) is implemented as a single-page web application, either deployed with Cruise Control or an existing web server installation. Details about the clusters to be managed are made available via a configuration file.

One of the views provided gives an overview of the configured Kafka cluster status, including broker count, leader partitions, replicas, throughput data, and status, such as out-of-sync replicas.

real time kafka cluster monitoring

Other views provide overviews of cluster load and underlying server resource usage statistics.

Burrow

Burrow is an open source monitoring tool to track consumer lag in Apache Kafka clusters. It’s designed to monitor every consumer group that is committing offsets to either Kafka or Zookeeper, and to monitor every topic and partition consumed by those groups. Burrow specifically does not monitor MaxLag, and has its reasons for it (as do other tools such as Sematext Kafka monitoring agent) due to limitations around when its value is useful.

Burrow is written in Go, so you’ll need to download, install and set up Go separately. After that, you can download and install Burrow using the Go commands:

$ go get github.com/linkedin/Burrow
$ cd $GOPATH/src/github.com/linkedin/Burrow
$ dep ensure
$ go install

To run Burrow using Go, execute the command:

$ $GOPATH/bin/Burrow --config-dir /path/containing/config

You can also run it directly from the command line using the installed shell script:

$ ./Burrow --config-dir=/path/to/configurations

You configure Burrow using Viper, which is used with many Go projects, and it supports formats such as JSON, YAML, TOML and others. The configuration header specifies basic Burrow information:

[general]
pidfile="burrow.pid"
stdout-logfile="burrow.out"
access-control-allow-origin="mysite.example.com"

Here, you specify a filename and path to store the process ID (PID) of the running Burrow process, a filename for Burrow stdout output, and the Burrow REST server response URL. You also need to specify important Kafka information, such as Zookeeper, Kafka version, security, clusters, and so on:

[zookeeper]
servers=["zkhost01.example.com:2181", "zkhost02.example.com:2181", "zkhost03.example.com:2181"]
timeout=6
root-path=/mypath/burrow

[client-profile.myclient]
kafka-version="0.10.2"
client-id="burrow-myclient"
tls="mytlsprofile"
sasl="mysaslprofile"

[cluster.myclustername]
class-name="kafka"

[consumer.myconsumers]
class-name="kafka"
cluster="myclustername"

To visualize Kafka cluster data as gathered by Burrow, there are open source projects available, such as the browser-based BurrowUI and burrow-dashboard, the command-line UI tool burrow-client, and various plug-ins to other tools.

kafka monitoring tools comparison

Conclusion

As you can see, open-source tools abound for Kafka monitoring and management. Many have large followings and are even used at high-profile companies for very large Kafka clusters. However, keep in mind that most of them require a good deal of setup and configuration to get up and running and to maintain as your cluster changes. If you’re looking for a Kafka monitoring tool that allows you to get set up in minutes, check out Part 3 of this Kafka Monitoring series to learn more.

Bio

efbecbc1f8369e38d40f59ab850a64bb 400x400 1

Eric Bruno is a writer and editor for multiple online publications with more than 20 years of experience in the information technology community. He is a highly requested moderator and speaker for a variety of conferences and other events on topics spanning the technology spectrum from the desktop to the data center. He has written articles, blogs, white papers, and books on software architecture and development topics for more than a decade. He is also an enterprise architect, developer, and industry analyst with expertise in full lifecycle, large-scale software architecture, design, and development for companies all over the globe. His accomplishments span highly distributed system development, multi-tiered web development, real-time development, and transactional software development. See his editorial work online at www.ericbruno.com.

Memory Bottleneck

Definition: What Is a Memory Bottleneck? The term "bottleneck" refers...

20+ UX Metrics & KPIs Product Managers Should Measure for User Experience [Guide]

User experience (UX) is everything today. Poor digital experience can...

NGINX Error & Access Logs: Tutorial for How to View and Configure Them for Efficient Logging

NGINX is one of the most widely used reverse proxy...