2016 Year in Review: Monitoring and Logging Highlights

2017 is almost here and, like last year, we thought we’d share how 2016 went for us.  We remain committed to be your “one-stop shop” for all things Elasticsearch and Solr: from Consulting, Production Support, and Training, to complementing that with our Logsene for all your logs, and SPM for all your monitoring needs.

Docker

It’s safe to say 2016 was the year of Docker and by extension Kubernetes, Mesos, Docker Swarm, among others, too.  They stopped being just early adopters’ toys and have become production-ready technologies used by many. This year we’ve added excellent support for Docker monitoring with SPM and logging with Logsene via the open-source Sematext Docker Agent.

Read More

Kubernetes Containers: Logging and Monitoring support

In this post we will:

  • Introduce Kubernetes concepts and motivation for Kubernetes-aware monitoring and logging tooling
  • Show how to deploy the Sematext Docker Agent to each Kubernetes node with DaemonSet
  • Point out key Kubernetes metrics and log elements to help you troubleshoot and tune Docker and Kubernetes

Managing microservices in containers is typically done with Cluster Managers and Orchestration tools such as  Google Kubernetes, Apache Mesos, Docker Swarm, Docker Cloud, Amazon ECS, Hashicorp Nomad just to mention a few. However, each platform has slightly different of options to deploy containers or schedule tasks to each cluster node. This is why we started a Series of blog post with Docker Swarm Monitoring, and continue today with a quick tutorial for Container Monitoring and Log Collection on Kubernetes.

Read More

Exploring Windows Kernel with Fibratus and Logsene

This is a guest post by Nedim Šabić, developer of Fibratus, a tool for exploration and tracing of the Windows kernel. 

Unlike Linux / UNIX environments which provide a plethora of open source and native tools to instrument the user / kernel space internals, the Windows operating systems are pretty limited when it comes to diversity of tools and interfaces to perform the aforementioned tasks. Prior to Windows 7, you could use some of not so legal techniques like SSDT hooking to intercept system calls issued from the user space and do your custom pre-processing, but they are far from efficient or stable. The kernel mode driver could be helpful if it wouldn’t require a digital signature granted by Microsoft. Actually, some tools like Sysmon or Process Monitor can be helpful, but they are closed-source and don’t leave much room for extensibility or integration with external systems such as message queues, databases, endpoints, etc.

Read More

Docker Swarm Lessons from Swarm3K

This is a guest post by Prof. Chanwit Kaewkasi, Docker Captain who organized Swarm3K – the largest Docker Swarm cluster to date.

Swarm3K Review

Swarm3K was the second collaborative project trying to form a very large Docker cluster with the Swarm mode. It happened on 28th October 2016 with more than 50 individuals and companies joining this project.

Sematext was one of the very first companies that offered to help us by offering their Docker monitoring and logging solution. They became the official monitoring system for Swarm3K. Stefan, Otis and their team provided wonderful support for us from the very beginning.

Swarm3K public dashboard by Sematext
Swarm3K public dashboard by Sematext

To my knowledge, Sematext is one and the only Docker monitoring company which allow to deploy the monitoring agents as the global Docker service at the moment. This deployment model provides for a greatly simplified the monitoring process.

Swarm3K Setup and Workload

There were two planned workloads:

  1. MySQL with WordPress cluster
  2. C1M

The 25 nodes formed a MySQL cluster. We experiences some mixing of IP addresses from both mynet and ingress networks. This was the same issue found when forming a cluster of Apache Spark in the past (see https://github.com/docker/docker/issues/24637). We prevented this by binding the cluster only to a single overlay network.

A WordPress node was scheduled somewhere on our huge cluster, and we intentionally didn’t control where it should be. When we were trying to connect a WordPress node to the backend MySQL cluster, the connection kept timing out. We concluded that a WordPress / MySQL combo would be set to run correctly if we put them together in the same DC.

We aimed for 3000 nodes, but in the end we successfully formed a working, geographically distributed 4,700-node Docker Swarm cluster.

Swarm3K Observations

What we also learned from this issue was that the performance of the overlay network greatly depends on the correct tuning of network configuration on each host.

When the MySQL / WordPress test failed, we changed the plan to try NGINX on Routing Mesh.

The Ingress network is a /16 network which supports up to 64K IP addresses. Suggested by Alex Ellis, we then started 4,000 NGINX containers on the formed cluster. During this test, nodes were still coming in and out. The NGINX service started and the Routing Mesh was formed. It could correctly serve even as some nodes kept failing.

We concluded that the Routing Mesh in 1.12 is rock solid and production ready.

We then stopped the NGINX service and started to test the scheduling of as many containers as possible.

This time we simply used “alpine top” as we did for Swarm2K. However, the scheduling rate was quite slow. We went to 47,000 containers in approximately 30 minutes. Therefore it was going to be ~10.6 hours to fill the cluster with 1M containers. Unfortunately, because that would take too long, we decided to shut down the manager as it made no point to go further.

Swarm3k Task Status
Swarm3k Task Status

Scheduling with a huge batch of containers stressed out the cluster. We scheduled the launch of a large number of containers using “docker scale alpine=70000”.  This created a large scheduling queue that would not commit until all 70,000 containers were finished scheduling. This is why when we shut down the managers all scheduling tasks disappeared and the cluster became unstable, for the Raft log got corrupted.

One of the most interesting things was that we were able to collect enough CPU profile information to show us what was keeping the cluster busy.

dockerd-flamegraph-01

Here we can see that only 0.42% of the CPU was spent on the scheduling algorithm. I think we can say with certainty: 

The Docker Swarm scheduling algorithm in version 1.12 is quite fast.

This means that there is an opportunity to introduce a more sophisticated scheduling algorithm that could result in even better resource utilization.

dockerd-flamegraph-02

We found that a lot of CPU cycles were spent on node communication. Here we see the Libnetwork’s member list layer. It used ~12% of the overall CPU.

dockerd-flamegraph-03

Another major CPU consumer was the Raft communication, which also caused the GC here. This used ~30% of the overall CPU.

Docker Swarm Lessons Learned

Here’s the summarized list of what we learned together.

  1. For a large set of nodes like this, managers require a lot of CPUs. CPUs will spike whenever the Raft recovery process kicks in.
  2. If the Leading manager dies, you better stop “docker daemon” on that node and wait until the cluster becomes stable again with n-1 managers.
  3. Don’t use “dockerd -D” in production. Of course, I know you won’t do that.
  4. Keep snapshot reservation as small as possible. The default Docker Swarm configuration will do. Persisting Raft snapshots uses extra CPU.
  5. Thousands of nodes require a huge set of resources to manage, both in terms of CPU and Network bandwidth. In contrast, hundreds of thousand tasks require high Memory nodes.
  6. 500 – 1000 nodes are recommended for production. I’m guessing you won’t need larger than this in most cases, unless you’re planning on being the next Twitter.
  7. If managers seem to be stuck, wait for them. They’ll recover eventually.
  8. The parameter –advertise-addr is mandatory for Routing Mesh to work.
  9. Put your compute nodes as close to your data nodes as possible. The overlay network is great and will require tweaking Linux net configuration for all hosts to make it work best.
  10. Despite slow scheduling, Docker Swarm mode is robust. There were no task failures this time even with unpredictable network connecting this huge cluster together.

“Ten Docker Swarm Lessons Learned” by @chanwit

Credits
Finally, I would like to thank all Swarm3K heroes: @FlorianHeigl, @jmaitrehenry from PetalMD, @everett_toews from Rackspace,  Internet Thailand, @squeaky_pl, @neverlock, @tomwillfixit from Demonware, @sujaypillai from Jabil, @pilgrimstack from OVH, @ajeetsraina from Collabnix, @AorJoa and @PNgoenthai from Aiyara Cluster, @f_soppelsa, @GroupSprint3r, @toughIQ, @mrnonaki, @zinuzoid from HotelQuickly,  @_EthanHunt_,  @packethost from Packet.io, @ContainerizeT – ContainerizeThis The Conference, @_pascalandy from FirePress, @lucjuggery from TRAXxs, @alexellisuk, @svega from Huli, @BretFisher,  @voodootikigod from Emerging Technology Advisors, @AlexPostID,  @gianarb from ThumpFlow, @Rucknar,  @lherrerabenitez, @abhisak from Nipa Technology, and @enlamp from NexwayGroup.

I would like to thanks Sematext again for the best-of-class Docker monitoring system, DigitalOcean for providing all resources for huge Docker Swarm managers, and the Docker Engineering team for making this great software and supporting us during the run.

While this time around we didn’t manage to launch all 150,000 containers we wanted to have, we did manage to create a nearly 5,000-node Docker Swarm cluster distributed over several continents.  Lessons we’ve learned from this experiment will help us launch another huge Docker Swarm cluster next year.  Thank you all and I’m looking forward to the new run!

 

Akka & Play Framework Monitoring

Akka Monitoring with Kamon and SPM

SPM provides Akka monitoring via Kamon and has been doing that for quite a while now.  With SPM and Kamon you get out of the box metrics about Akka Actors, Dispatchers and Routers, about the JVMs your Akka app runs in, and system metrics.

We’ve recently made a few nice improvements that should be of interest to anyone using Akka, and especially those using Play! Framework.

Want to see a demo and don’t feel like reading?
Go to
https://apps.sematext.com/demo and look for any SPM apps with “Akka” in their name.

Want to see an example Akka app that uses Kamon SPM backend for monitoring?
See https://github.com/sematext/kamon-spm-example/

Transaction Traces, Trace Segments, and Errors

We’ve expanded our Transaction Tracing support and now support Kamon’s Traces and Trace Segments.  Note that Traces don’t necessarily have to be initiated by an HTTP request.  SPM’s Transaction Tracing lets you specify where a transaction starts.  You can see that in our Demo Akka App, which is not actually a web app, so we specified where in code its transactions start and end. Traces can be produced by instrumentation libraries like ‘kamon-play’ or manually in the code using something like this:

val tContext = Kamon.tracer.newContext("name")

And for segments:

val segment = tContext.startSegment("some-section", "business-logic", "kamon")
// your code that is a part of this transaction would be here
segment.finish()
tContext.finish()

So what exactly do these Akka monitoring reports look like?  Here are some examples:

image03

Trace response time for AWS/ECS request trace

Read More

Docker “Swarm Mode”: Full Cluster Monitoring & Logging with 1 Command

Until recently, automating the deployment of Performance Monitoring agents in Docker Swarm clusters was challenging because monitoring agents had to be deployed to each cluster node and the previous Docker releases (<Docker engine v1.12 / Docker Swarm 1.2.4) had no global service scheduler (Github issue #601).  Scheduling services with via docker-compose and scheduling constraints required manual updates when the number of nodes changed in the swarm cluster – definitely not convenient for dynamic scaling of clusters! In Docker Swarm Monitoring and Logging we shared some Linux shell acrobatics as workaround for this issue.

The good news: All this has changed with Docker Engine v1.12 and new Swarm Mode. The latest release of Docker v1.12 provides many new features for orchestration and the new Swarm mode made it much easier to deploy Swarm clusters.  

 

With Docker v1.12 services can be scheduled globally – similar to Kubernetes DaemonSet, RancherOS global services or CoreOS global fleet services

 

Read More

Container Monitoring: Top Docker Metrics to Watch

Monitoring of Docker environments is challenging. Why? Because each container typically runs a single process, has its own environment, utilizes virtual networks, or has various methods of managing storage. Traditional monitoring solutions take metrics from each server and applications they run. These servers and applications running on them are typically very static, with very long uptimes. Docker deployments are different: a set of containers may run many applications, all sharing the resources of one or more underlying hosts. It’s not uncommon for Docker servers to run thousands of short-term containers (e.g., for batch jobs) while a set of permanent services runs in parallel. Traditional monitoring tools not used to such dynamic environments are not suited for such deployments. On the other hand, some modern monitoring solutions (e.g. SPM from Sematext) were built with such dynamic systems in mind and even have out of the box reporting for docker monitoring. Moreover, container resource sharing calls for stricter enforcement of resource usage limits, an additional issue you must watch carefully. To make appropriate adjustments for resource quotas you need good visibility into any limits containers have reached or errors they have caused. We recommend using alerts according to defined limits; this way you can adjust limits or resource usage even before errors start happening.

Read More

Kafka Consumer Lag Monitoring

SPM is one of the most comprehensive Kafka monitoring solutions, capturing some 200 Kafka metrics, including Kafka Broker, Producer, and Consumer metrics. While lots of those metrics are useful, there is one particular metric everyone wants to monitor – Consumer Lag.

What is Consumer Lag

When people talk about Kafka or about a Kafka cluster, they are typically referring to Kafka Brokers. You can think of a Kafka Broker as a Kafka server. A Broker is what actually stores and serves Kafka messages. Kafka Producers are applications that write messages into Kafka (Brokers). Kafka Consumers are applications that read messages from Kafka (Brokers).

Inside Kafka Brokers data is stored in one or more Topics, and each Topic consists of one or more Partitions. When writing data a Broker actually writes it into a specific Partition. As it writes data it keeps track of the last “write position” in each Partition. This is called Latest Offset also known as Log End Offset. Each Partition has its own independent Latest Offset.

Just like Brokers keep track of their write position in each Partition, each Consumer keeps track of “read position” in each Partition whose data it is consuming. That is, it keeps track of which data it has read. This is known as Consumer Offset. This Consumer Offset is periodically persisted (to ZooKeeper or a special Topic in Kafka itself) so it can survive Consumer crashes or unclean shutdowns and avoid re-consuming too much old data.

Kafka Consumer Lag and Read/Write Rates
Kafka Consumer Lag and Read/Write Rates

In our diagram above we can see yellow bars, which represents the rate at which Brokers are writing messages created by Producers.  The orange bars represent the rate at which Consumers are consuming messages from Brokers. The rates look roughly equal – and they need to be, otherwise the Consumers will fall behind.  However, there is always going to be some delay between the moment a message is written and the moment it is consumed. Reads are always going to be lagging behind writes, and that is what we call Consumer Lag. The Consumer Lag is simply the delta between the Latest Offset and Consumer Offset.

Why is Consumer Lag Important

Many applications today are based on being able to process (near) real-time data. Think about performance monitoring system like SPM or log management service like Logsene. They continuously process infinite streams of near real-time data. If they were to show you metrics or logs with too much delay – if the Consumer Lag were too big – they’d be nearly useless.  This Consumer Lag tells us how far behind each Consumer (Group) is in each Partition.  The smaller the lag the more real-time the data consumption.

Monitoring Read and Write Rates

Kafka Consumer Lag and Broker Offset Changes
Kafka Consumer Lag and Broker Offset Changes

As we just learned the delta between the Latest Offset and the Consumer Offset is what gives us the Consumer Lag.  In the above chart from SPM you may have noticed a few other metrics:

  • Broker Write Rate
  • Consume Rate
  • Broker Earliest Offset Changes

The rate metrics are derived metrics.  If you look at Kafka’s metrics you won’t find them there.  Under the hood SPM collects a few metrics with various offsets from which these rates are computed.  In addition, it charts Broker Earliest Offset Changes, which is  the earliest known offset in each Broker’s Partition.  Put another way, this offset is the offset of the oldest message in a Partition.  While this offset alone may not be super useful, knowing how it’s changing could be handy when things go awry.  Data in Kafka has has a certain TTL (Time To Live) to allow for easy purging of old data.  This purging is performed by Kafka itself.  Every time such purging kicks in the offset of the oldest data changes.  SPM’s Broker Earliest Offset Change surfaces this information for your monitoring pleasure.  This metric gives you an idea how often purges are happening and how many messages they’ve removed each time they ran.

There are several Kafka monitoring tools out there that, like LinkedIn’s Burrow, whose Offset and Consumer Lag monitoring approach is used in SPM.  If you need a good Kafka monitoring solution, give SPM a go.  Ship your Kafka and other logs into Logsene and you’ve got yourself a DevOps solution that will make troubleshooting easy instead of dreadful.

 

Sematext is Docker Ecosystem Technology Partner (ETP) for Monitoring

technology_partners_monitoring_image_0 (2)May 5 2016 — Sematext, a global, Brooklyn-based products and services company that builds innovative Cloud and On Premises solutions for application performance monitoring, log management and analytics, today announced that it has been recognized by Docker as the Ecosystem Technology Partner (ETP) for monitoring and logging. This designation indicates that SPM Performance Monitoring and Logsene have demonstrated working integration with the Docker platform via the Docker API and are available to users and organizations that seek solutions to monitor their Dockerized distributed applications.

Sematext Docker Agent is extremely easy to deploy on  Docker Swarm, Docker Cloud and Docker Datacenter. It discovers new and existing containers, collects Docker performance metrics, events and logs, and runs in a tiny container on every Docker Host. In addition to standard log collection functionality the agent performs automatic log format detection and field extraction for a number of log formats, including Docker Swarm, Elasticsearch, Solr, Nginx, Apache, MongoDB, Kubernetes, etc.  

Sematext Docker Agent

Many organizations invest a lot of time in monitoring and logging setups because monitoring and logging changed dramatically with the introduction of Docker and related orchestration tools. We’ve observed that organizations and teams that use different tools for logging and monitoring often have difficulties correlating logs, events and metrics. Sematext automates performance monitoring and logging for Docker. Operational insights  are provided in a single UI, which helps one efficiently correlate metrics, logs and events. Sematext Docker Agent detects many log formats and structures the logs automatically for analysis in Logsene.

We would like to congratulate Sematext on their inclusion into Docker’s Ecosystem Technology Partner program for logging and monitoring,” said Nick Stinemates, VP of Business Development and Technical Alliances. “The ETP program recognizes organizations like Sematext that have demonstrated integration with the Docker platform to provide users with intelligent insights and increased visibility into their Dockerized environments. The goal is to provide users with the data needed to ensure the highest degree of availability and performance for all their business-critical applications”.

Sematext SPM is available at http://sematext.com/spm

About Sematext

Sematext Group, Inc. is a global, Brooklyn-based products and services company that builds innovative Cloud and On Premises solutions for application performance monitoring, log management and analytics, and site search analytics. Sematext Docker Agent is extremely easy to deploy; it collects Docker performance metrics, events and logs and runs in a container on every Docker Host. In addition to standard log collection functionality the agent performs automatic log format detection and field extraction for a number of log formats.  Besides monitoring Docker, Sematext SPM agents also monitor applications running inside and outside containers, such as Elasticsearch, Nginx, Apache, Kafka, Cassandra, Spark, Node.js, MongoDB, Solr, MySQL, etc.

Sematext also provides professional services around Elasticsearch, the ELK / Elastic Stack, and Apache Solr – Consulting, Training, and Production Support.

Contacts: press@sematext.com

Monitoring Kafka on Docker Cloud

For those of you using Apache Kafka and Docker Cloud or considering it, we’ve got a Sematext user case study for your reading pleasure. In this use case, Ján Antala, a Software Engineer in the DevOps Team at @pygmalios, talks about the business and technical needs that drove their decision to use Docker Cloud, how they are using Docker Cloud, and how their Docker and Kafka monitoring is done.

Pygmalios – Future of data-driven retail.

Pygmalios Logo

Pygmalios helps companies monitor how customers and staff interact in real-time. Our retail analytics platforms tracks sales, display conversions, customers and staff behavior to deliver better service, targeted sales, faster check-outs and the optimal amount of staffing for a given time and location. Among our partners are big names such as dm drogerie or BMW.

I am a software engineer on a DevOps position so I know about all the challenges from both sides – infrastructure as well as software development.

Our infrastructure

In Pygmalios we have decided to use the architecture based on microservices for our analytics platform. We have a complex system of Apache Spark, Apache Kafka, Cassandra and InfluxDB databases, Node.js backend and JavaScript frontend applications where every service has its own single responsibility which makes them easy to scale. We run them mostly in Docker containers apart from Spark and Cassandra which run on the DataStax Enterprise stack.

We have around 50 different Docker services in total. Why Docker? Because it’s easy to deploy, scale and you don’t have to care about where you run your applications. You can even transfer them between node clusters in seconds. We don’t have our own servers but use cloud providers instead, especially AWS. We have been using Tutum to orchestrate our Docker containers for the past year (Tutum was acquired by Docker recently and the service is now called Docker Cloud).
Docker Cloud is the best service for Docker container management and deployment and totally matches our needs. You can create servers on any cloud provider or bring your own, add Docker image and write stack file where you can list rules which specify what and where to deploy. Then you can manage all your services and nodes via a dashboard. We really love the CI & CD features. When we push a new commit to Github the Docker image is built and then automatically deployed to the production.

DevOps Challenges

As we use a microservices architecture we have a lot of applications across multiple servers so we need to orchestrate them. We also have many physical sensors outside in the retail stores which are our data sources. In the end, there are a lot of things we have to think about including correlations between them:

Server monitoring

Basic metrics for the hardware layer such as memory, cpu and network.

Docker monitoring

In the software layer we want to know whether our applications inside Docker containers are running properly.

Kafka, Spark and Cassandra monitoring

Our core services. They are crucial, so monitoring is a must.

Sensors monitoring

Sensors are deployed outside in the retail stores. We have to monitor them as well and use custom metrics.

Notifications

We want alerts whenever anything breaks.

Centralized logging

Store all logs in one place, combine them with hardware usage and then analyze anomalies.

Monitoring & Logging on Docker Cloud

There is already a great post about Docker Cloud Monitoring and Logging so more information go to this blog: Docker Cloud Monitoring and Logging.

Kafka on Docker Cloud

We use a cluster of 3 brokers each running in a Docker container across nodes because Kafka is crucial for us. We are not collecting any data when Kafka is not available so they are lost forever if Kafka is ever down. Sure, we have buffers inside sensors, but we don’t want to rely on them. All topics are also replicated between all brokers so we can handle outage of 2 nodes. Our goal is also to scale it easily.

Kafka and Zookeeper live together so you have to link them using connection parameters. Kafka doesn’t have a master broker but the leader is automatically elected by Zookeeper from available brokers. Zookeeper elects its own leader automatically. To scale Kafka and Zookeeper to more nodes we just have to add them into Docker Cloud cluster as we use every_node deployment strategy and update connection link in the stack file.

We use our own fork of wurstmeister/kafka and signalfx/docker-zookeeper Docker images and I would encourage you to do the same so you can easily tune them to your needs.

To run Kafka + Zookeeper cluster launch the following stack on Docker Cloud.

Code from https://gist.github.com/janantala/c93a284e3f93bc7d7942f749aae520af

kafka:
image: 'pygmalios/kafka:latest'
deployment_strategy: every_node
environment:
- JMX_PORT=9999
- KAFKA_ADVERTISED_HOST_NAME=$DOCKERCLOUD_CONTAINER_HOSTNAME
- KAFKA_ADVERTISED_PORT=9092
- KAFKA_DEFAULT_REPLICATION_FACTOR=3
- KAFKA_DELETE_TOPIC_ENABLE=true
- KAFKA_LOG_CLEANER_ENABLE=true
- 'KAFKA_ZOOKEEPER_CONNECT=zookeeper-1:2181,zookeeper-2:2181,zookeeper-3:2181'
- KAFKA_ZOOKEEPER_CONNECTION_TIMEOUT_MS=6000
ports:
- '9092:9092'
- '9999:9999'
restart: always
tags:
- kafka
volumes:
- '/var/run/docker.sock:/var/run/docker.sock'
zookeeper:
image: 'pygmalios/zookeeper-cluster:latest'
deployment_strategy: every_node
environment:
- CONTAINER_NAME=$DOCKERCLOUD_CONTAINER_HOSTNAME
- SERVICE_NAME=zookeeper
- 'ZOOKEEPER_INSTANCES=zookeeper-1,zookeeper-2,zookeeper-3'
- 'ZOOKEEPER_SERVER_IDS=zookeeper-1:1,zookeeper-2:2,zookeeper-3:3'
- ZOOKEEPER_ZOOKEEPER_1_CLIENT_PORT=2181
- ZOOKEEPER_ZOOKEEPER_1_HOST=zookeeper-1
- ZOOKEEPER_ZOOKEEPER_1_LEADER_ELECTION_PORT=3888
- ZOOKEEPER_ZOOKEEPER_1_PEER_PORT=2888
- ZOOKEEPER_ZOOKEEPER_2_CLIENT_PORT=2181
- ZOOKEEPER_ZOOKEEPER_2_HOST=zookeeper-2
- ZOOKEEPER_ZOOKEEPER_2_LEADER_ELECTION_PORT=3888
- ZOOKEEPER_ZOOKEEPER_2_PEER_PORT=2888
- ZOOKEEPER_ZOOKEEPER_3_CLIENT_PORT=2181
- ZOOKEEPER_ZOOKEEPER_3_HOST=zookeeper-3
- ZOOKEEPER_ZOOKEEPER_3_LEADER_ELECTION_PORT=3888
- ZOOKEEPER_ZOOKEEPER_3_PEER_PORT=2888
ports:
- '2181:2181'
- '2888:2888'
- '3888:3888'
restart: always
tags:
- kafka
volumes:
- '/var/lib/zookeeper:/var/lib/zookeeper'
- '/var/log/zookeeper:/var/log/zookeeper'

We use private networking and hostname addressing (KAFKA_ADVERTISED_HOST_NAME environment variable) for security reasons in our stack. However, you can use IP addressing directly when you replace hostname by IP address. To connect to Kafka from outside environment you have to add records into /etc/hosts file:

KAFKA_NODE.1.IP.ADDRESS kafka-1
KAFKA_NODE.2.IP.ADDRESS kafka-2
KAFKA_NODE.3.IP.ADDRESS kafka-3
KAFKA_NODE.1.IP.ADDRESS zookeeper-1
KAFKA_NODE.2.IP.ADDRESS zookeeper-2
KAFKA_NODE.3.IP.ADDRESS zookeeper-3

Or on Docker Cloud add extra_hosts into service configuration.

extra_hosts:
- 'kafka-1:KAFKA_NODE.1.IP.ADDRESS'
- 'kafka-2:KAFKA_NODE.2.IP.ADDRESS'
- 'kafka-3:KAFKA_NODE.3.IP.ADDRESS'
- 'zookeeper-1:KAFKA_NODE.1.IP.ADDRESS'
- 'zookeeper-2:KAFKA_NODE.2.IP.ADDRESS'
- 'zookeeper-3:KAFKA_NODE.3.IP.ADDRESS'

Then you can use following Zookeeper connection string to connect Kafka:

zookeeper-1:2181,zookeeper-2:2181,zookeeper-3:2181

And Kafka broker list:

kafka-1:9092,kafka-2:9092,kafka-3:9092

Kafka + SPM

To monitor Kafka we use SPM by Sematext which provides Kafka monitoring of all metrics for Brokers, Producers and Consumers available in JMX interface out of the box. They also provide monitoring for other apps we use such as Spark, Cassandra, Docker images and we can also collect logs so we have it all in one place. When we have this information we can find out not only when something happened, but also why.

Our Kafka node cluster with Docker containers is displayed in the following diagram:

image alt text

SPM Performance Monitoring for Kafka

SPM collects performance metrics of Kafka. First you have to create an SPM application of type Kafka in the Sematext dashboard and connect SPM client Docker container from sematext/spm-client image. We use SPM client in-process mode as a Java agent so it is easy to set up. Just add SPM_CONFIG environment variable to SPM client Docker container, where you specify monitor configuration of Kafka Brokers, Consumers and Producers. Note, that you have to use your own SPM token, instead of YOUR_SPM_TOKEN.

create new SPM app

sematext-agent-kafka:
image: 'sematext/spm-client:latest'
deployment_strategy: every_node
environment:
- 'SPM_CONFIG=YOUR_SPM_TOKEN kafka javaagent kafka-broker;YOUR_SPM_TOKEN kafka javaagent kafka-producer;YOUR_SPM_TOKEN kafka javaagent kafka-consumer'
restart: always
tags:
- kafka

Kafka

You have to also connect Kafka and SPM monitor together. This can be done by mounting volume from the SPM monitor service into Kafka container using volumes_from option. To enable the SPM monitor just add KAFKA_JMX_OPTS environment variable into Kafka container by adding the following arguments to your JVM startup script for Kafka Broker, Producer & Consumer.

KAFKA_JMX_OPTS=-Dcom.sun.management.jmxremote -javaagent:/opt/spm/spm-monitor/lib/spm-monitor-kafka.jar=YOUR_SPM_TOKEN:kafka-broker:default -Dcom.sun.management.jmxremote -javaagent:/opt/spm/spm-monitor/lib/spm-monitor-kafka.jar=YOUR_SPM_TOKEN:kafka-producer:default -Dcom.sun.management.jmxremote -javaagent:/opt/spm/spm-monitor/lib/spm-monitor-kafka.jar=YOUR_SPM_TOKEN:kafka-consumer:default -Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.ssl=false

Done! Your Kafka cluster monitoring is set up. Now you can monitor requests, topics and other JMX metrics out of the box or you can create custom dashboards by connecting other apps.

image alt text

Kafka metrics overview in SPM

image alt text

Requests

image alt text

Topic Bytes/Messages

Stack file

To run Zookeeper + Kafka + SPM monitoring cluster just launch following stack and update these environment variables in your stack file:

  • YOUR_SPM_TOKEN inside SPM_CONFIG in Sematext monitoring service and KAFKA_JMX_OPTS in Kafka service

Code from https://gist.github.com/janantala/d816071a7a00eefeea934ec630a57c07

Kafaka, Zookeeper, SPM Stack File

kafka:
image: 'pygmalios/kafka:latest'
deployment_strategy: every_node
environment:
- JMX_PORT=9999
- KAFKA_ADVERTISED_HOST_NAME=$DOCKERCLOUD_CONTAINER_HOSTNAME
- KAFKA_ADVERTISED_PORT=9092
- KAFKA_DEFAULT_REPLICATION_FACTOR=3
- KAFKA_DELETE_TOPIC_ENABLE=true
- **'KAFKA_JMX_OPTS=-Dcom.sun.management.jmxremote -javaagent:/opt/spm/spm-monitor/lib/spm-monitor-kafka.jar=YOUR_SPM_TOKEN:kafka-broker:default -Dcom.sun.management.jmxremote -javaagent:/opt/spm/spm-monitor/lib/spm-monitor-kafka.jar=YOUR_SPM_TOKEN:kafka-producer:default -Dcom.sun.management.jmxremote -javaagent:/opt/spm/spm-monitor/lib/spm-monitor-kafka.jar=YOUR_SPM_TOKEN:kafka-consumer:default -Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.ssl=false'
** - KAFKA_LOG_CLEANER_ENABLE=true
- 'KAFKA_ZOOKEEPER_CONNECT=zookeeper-1:2181,zookeeper-2:2181,zookeeper-3:2181'
- KAFKA_ZOOKEEPER_CONNECTION_TIMEOUT_MS=6000
ports:
- '9092:9092'
- '9999:9999'
restart: always
tags:
- kafka
volumes:
- '/var/run/docker.sock:/var/run/docker.sock'
**volumes_from:
- sematext-agent-kafka**
sematext-agent-kafka:
image: 'sematext/spm-client:latest'
deployment_strategy: every_node
environment:
- **'SPM_CONFIG=YOUR_SPM_TOKEN kafka javaagent kafka-broker;YOUR_SPM_TOKEN kafka javaagent kafka-producer;YOUR_SPM_TOKEN kafka javaagent kafka-consumer'**
restart: always
tags:
- kafka
zookeeper:
image: 'pygmalios/zookeeper-cluster:latest'
deployment_strategy: every_node
environment:
- CONTAINER_NAME=$DOCKERCLOUD_CONTAINER_HOSTNAME
- SERVICE_NAME=zookeeper
- 'ZOOKEEPER_INSTANCES=zookeeper-1,zookeeper-2,zookeeper-3'
- 'ZOOKEEPER_SERVER_IDS=zookeeper-1:1,zookeeper-2:2,zookeeper-3:3'
- ZOOKEEPER_ZOOKEEPER_1_CLIENT_PORT=2181
- ZOOKEEPER_ZOOKEEPER_1_HOST=zookeeper-1
- ZOOKEEPER_ZOOKEEPER_1_LEADER_ELECTION_PORT=3888
- ZOOKEEPER_ZOOKEEPER_1_PEER_PORT=2888
- ZOOKEEPER_ZOOKEEPER_2_CLIENT_PORT=2181
- ZOOKEEPER_ZOOKEEPER_2_HOST=zookeeper-2
- ZOOKEEPER_ZOOKEEPER_2_LEADER_ELECTION_PORT=3888
- ZOOKEEPER_ZOOKEEPER_2_PEER_PORT=2888
- ZOOKEEPER_ZOOKEEPER_3_CLIENT_PORT=2181
- ZOOKEEPER_ZOOKEEPER_3_HOST=zookeeper-3
- ZOOKEEPER_ZOOKEEPER_3_LEADER_ELECTION_PORT=3888
- ZOOKEEPER_ZOOKEEPER_3_PEER_PORT=2888
ports:
- '2181:2181'
- '2888:2888'
- '3888:3888'
restart: always
tags:
- kafka
volumes:
- '/var/lib/zookeeper:/var/lib/zookeeper'
- '/var/log/zookeeper:/var/log/zookeeper'

image alt text

Summary

Thanks to Sematext you can easily monitor all important metrics. Basic setup should take only a few minutes and then you can tune it to your needs, connect with other applications and create custom dashboards.

If you have feedback for monitoring Kafka cluster get in touch with me @janantala or email me at j.antala@pygmalios.com. You can also follow us @pygmalios for more cool stuff. If you have problems setting your monitoring and logging don’t hesitate to send an email to support@sematext.com or tweet @sematext.