Until recently, automating the deployment of Performance Monitoring agents in Docker Swarm clusters was challenging because monitoring agents had to be deployed to each cluster node and the previous Docker releases (<Docker engine v1.12 / Docker Swarm 1.2.4) had no global service scheduler (Github issue #601). Scheduling services with via docker-compose and scheduling constraints required manual updates when the number of nodes changed in the swarm cluster – definitely not convenient for dynamic scaling of clusters! In Docker Swarm Monitoring and Logging we shared some Linux shell acrobatics as workaround for this issue.
The good news: All this has changed with Docker Engine v1.12 and new Swarm Mode. The latest release of Docker v1.12 provides many new features for orchestration and the new Swarm mode made it much easier to deploy Swarm clusters.
Docker is growing by leaps and bounds, and along with it its ecosystem. Being light, the predominant container deployment involves running just a single app or service inside each container. Most software products and services are made up of at least several such apps/services. We all want all our apps/services to be highly available and fault tolerant. Thus, Docker containers in an organization quickly start popping up like mushrooms after the rain. They multiply faster than rabbits. While in the beginning we play with them like cute little pets, as their number quickly grow we realize we are dealing with aherd of cattle, implying we’ve become cowboys. Managing a herd with your two hands, a horse, and a lasso willget you only so far. You won’t be able to ride after each and every calf that wonders in the wrong direction. To get back to containers from this zoological analogy – operating so many moving pieces at scale is impossible without orchestration – this is why we’ve seen the rise of Docker Swarm, Kubernetes, Mesos, CoreOS, RancherOS and so on.
Container orchestration helps you manage your containers, their placement, their resources, and their whole life cycle. While containers and applications in them are running, in addition to the whole life cycle management we need container monitoring and log management so we can troubleshoot performance or stability issues, debug or tune applications, and so on. Just like with orchestration, there are a number of open-source container monitoring and logging tools. It’s great to have choices, but having lots of them means you need to evaluate and compare them to pick the one that best matches your needs.
DevOps Tools Comparison
We’ve open-sourced our Sematext Docker Agent (SDA for short) which works with SPM for monitoring and Logsene for log management (think of it as ELK as a Service), and wanted to provide a high level comparison of SDA and several popular Docker monitoring and logging tools, like CAdvisor, Logspout, and others. In the following table we group tools by functionality and include monitoring agents, log collectors and shippers, storage backends, and tools that provide the UI and visualizations. For each functionality we list in the “Common Tools” column one or more popular open-source tools that provide that functionality. An empty “Common Tools” cell means there are no popular open-source tools that provide it, or at least we are not aware of it — if we messed something up, please leave a comment or tweet @sematext.
Collect Logs from Docker API (including auto-discovery of new containers)
Sematext Docker Agent
Logspout Routing setup for containers via HTTP API to syslog, redis, kafka, logstash Docker Logging Drivers (e.g. syslog, journald, fluentd, etc.)
Sematext Docker Agent (routing of logs to different indices based on container labels)
Automatic log tagging (with Docker Compose or Swarm or Kubernetes metadata)
For Kubernetes: fluentd-elasticsearch, assumes Elasticsearch deployed locally
Sematext Docker Agent
Collect Docker Metrics
Sematext Docker Agent
Collect Docker Events
Sematext Docker Agent
Logs format detection (most tools need a static setup per logfile/application)
Sematext Docker Agent (out of the box format detection and parsing; the parser and the logagent-js pattern library is open source)
Logs parsing and shipping
Fluentd Logstash rsyslog syslog-ng
Sematext Docker Agent
Logs storage and indexing
Logsene (exposes Elasticsearch API)
Logs anomaly detection and alerting
Log search and analytics
Logsene (Logsene’s own UI or integrated Kibana, or Grafana connected to Logsene via Elasticsearch data source)
Some of the functionality provided by SPM and Logsene is not available in some of the most popular open-source monitoring and logging tools included here
Some of the SPM and Logsene functionality is indeed provided by some of the open-source tools, however none of them seems to encompass all the features, forcing one to mix and match and head down the tech debt-ridden Franken-monitoring path
Try it yourself in the MindMap below – pick a few functionalities and see how many different tools you might have to use?
Avoid building technical-debt & Franken-monitoring by using a limited number of Docker monitoring & logging tools Tweet
P.S.: Sematext Docker Agent is available in the RancherOS Community Catalog and shows up with our new mascot “Octi” only one more pet 🙂 – so if you use RancherOS search for “sematext” in the RancherOS Catalog and within a few clicks you’ll have the Sematext Docker Agent deployed to your RancherOS clusters!
Docker Datacenter (DDC) simplifies container orchestration and increases the flexibility and scalability of application deployments. However, the high level of automation create new challenges for monitoring and log management. Organizations that introduce Docker Datacenter manage container deployments in various scenarios e.g., on bare metal, virtual machines, or hybrid clouds. That’s why at Sematext we are seeing a shift from traditional server monitoring to container-centric monitoring. This post is an excerpt from the newly published “Reference Architecture: Monitoring and Logging for Docker Datacenter” and shows how Docker Datacenter could be extended with Logging and Monitoring services.
Download Reference Architecture Logging & Monitoring for Docker Datacenter Tweet
The Docker Universal Control Plane (UCP) management functionalities include real-time monitoring of the cluster state, real-time metrics and logs for each container. However, operating larger infrastructures requires a longer retention time for logs and metrics and the capability to correlate metrics, logs and events on several levels (cluster, nodes, applications and containers). A comprehensive monitoring and logging solution ought to provide the following operational insights:
SPM is one of the most comprehensive Kafka monitoring solutions, capturing some 200 Kafka metrics, including Kafka Broker, Producer, and Consumer metrics. While lots of those metrics are useful, there is one particular metric everyone wants to monitor – Consumer Lag.
What is Consumer Lag
When people talk about Kafka or about a Kafka cluster, they are typically referring to Kafka Brokers. You can think of a Kafka Broker as a Kafka server. A Broker is what actually stores and serves Kafka messages. Kafka Producers are applications that write messages into Kafka (Brokers). Kafka Consumers are applications that read messages from Kafka (Brokers).
Inside Kafka Brokers data is stored in one or more Topics, and each Topic consists of one or more Partitions. When writing data a Broker actually writes it into a specific Partition. As it writes data it keeps track of the last “write position” in each Partition. This is called Latest Offset also known as Log End Offset. Each Partition has its own independent Latest Offset.
Just like Brokers keep track of their write position in each Partition, each Consumer keeps track of “read position” in each Partition whose data it is consuming. That is, it keeps track of which data it has read. This is known as Consumer Offset. This Consumer Offset is periodically persisted (to ZooKeeper or a special Topic in Kafka itself) so it can survive Consumer crashes or unclean shutdowns and avoid re-consuming too much old data.
In our diagram above we can see yellow bars, which represents the rate at which Brokers are writing messages created by Producers. The orange bars represent the rate at which Consumers are consuming messages from Brokers. The rates look roughly equal – and they need to be, otherwise the Consumers will fall behind. However, there is always going to be some delay between the moment a message is written and the moment it is consumed. Reads are always going to be lagging behind writes, and that is what we call Consumer Lag. The Consumer Lag is simply the delta between the Latest Offset and Consumer Offset.
Why is Consumer Lag Important
Many applications today are based on being able to process (near) real-time data. Think about performance monitoring system like SPM or log management service like Logsene. They continuously process infinite streams of near real-time data. If they were to show you metrics or logs with too much delay – if the Consumer Lag were too big – they’d be nearly useless. This Consumer Lag tells us how far behind each Consumer (Group) is in each Partition. The smaller the lag the more real-time the data consumption.
Monitoring Read and Write Rates
As we just learned the delta between the Latest Offset and the Consumer Offset is what gives us the Consumer Lag. In the above chart from SPM you may have noticed a few other metrics:
Broker Write Rate
Broker Earliest Offset Changes
The rate metrics are derived metrics. If you look at Kafka’s metrics you won’t find them there. Under the hood SPM collects a few metrics with various offsets from which these rates are computed. In addition, it charts Broker Earliest Offset Changes, which is the earliest known offset in each Broker’s Partition. Put another way, this offset is the offset of the oldest message in a Partition. While this offset alone may not be super useful, knowing how it’s changing could be handy when things go awry. Data in Kafka has has a certain TTL (Time To Live) to allow for easy purging of old data. This purging is performed by Kafka itself. Every time such purging kicks in the offset of the oldest data changes. SPM’s Broker Earliest Offset Change surfaces this information for your monitoring pleasure. This metric gives you an idea how often purges are happening and how many messages they’ve removed each time they ran.
There are several Kafka monitoring tools out there that, like LinkedIn’s Burrow, whose Offset and Consumer Lag monitoring approach is used in SPM. If you need a good Kafka monitoring solution, give SPM a go. Ship your Kafka and other logs into Logsene and you’ve got yourself a DevOps solution that will make troubleshooting easy instead of dreadful.
Want a training in your city or on-site? Let us know!
Attendees in all three workshops will go through several sequences of short lectures followed by interactive, group, hands-on exercises. There will be Q&A sessions in each workshop after each such lecture-practicum block.
Got any questions or suggestions for the course? Just drop us a line or hit us @sematext!
For many of us in the DevOps field, MongoDB is a critical part of our IT stack. With today’s acquisition of WiredTiger, MongoDB is further establishing itself as the NoSQL DB built to support massive data processing and storage. It would be an understatement to say that MongoDB does a lot, with many organizations using it as their backend storage framework, analytics backend, and so on.
So your MongoDB cluster really, really needs to be in tip-top shape. All the time. And if it’s not then you need to know asap — or better yet — prevent problems before they kick in and make your life difficult. That’s where SPM comes in — with MongoDB monitoring, alerting and anomaly detection. MongoDB exposes a boatload of metrics, but instead of just throwing all of them on endless charts, we’ve taken the time to cherry pick what we think are the top 50 most valuable MongoDB metrics to monitor. We have furthermore made it possible to filter the MongoDB metrics by server, as well as a database and table where possible.
The key metric groups we track are:
The Overview chart below provides 9 charts with MongoDB key metrics:
Row 3 adds Collection/Document Metrics, Locks, and wait times; followed by Network Metrics for MongoDB
SPM for MongoDB Overview
In case you monitor a MongoDB cluster, the Server Tab provides a quick overview for the Health of each node:
SPM Server View
The Reports on the left side of the screen below provide detailed information for each group of metrics. Let’s have a quick look at them.
OS Metrics: CPU Metrics, Memory Usage, Disk Space and I/O
Below is an example of some of the key MongoDB Metrics found in SPM:
Database Operations: Counters for Queries, Insert, Update, Delete and other commands for the main database plus replica operations
Database Memory: Resident-, Virtual-, Mapped-, and Journal Memory
Database Storage: Size of Data Files, Namespace Files, DB Files etc., plus Size of Objects, Number of Collections and Objects
MongoDB Storage & Collections
The screenshot below shows:
Documents: Counters for Documents inserted, updated or returned by queries
Locks: Lock counters and lock acquisition wait times for Global, Database, Collection and Journal level. Since MongoDB 3.x Locks are not always global. SPM shows a breakdown for all lock types. These metrics are good candidates for alerting, when anomalies are detected. Simply add an alert from the menu in the top-left corner in each chart.
Metrics for all MongoDB Locks
Other key MongoDB metrics that SPM displays are:
Network: Number of client connections, Received and transmitted data, Request rate
Database Journal: Commits, Early Commits, Commit times and lock times
MongoDB Journal Metrics
In case you like to see MongoDB metrics together with the Top Node.js Metrics, you might like the idea of putting MongoDB and Node.js metrics from SPM for Node.js in a custom dashboard:
SPM Custom Dashboard with MongoDB Locks and Node.js Event Loop Latency
We hope you like this new addition to SPM. Got ideas how we could make it more useful for you? Let us know via comments, email or @sematext.
Not using SPM yet? Check out the free 30-day trial by registering here. There’s no commitment and no credit card required. Even better — combine SPM with Logsene to make the integration of performance metrics, logs, events and anomalies more robust for those looking for a single pane of glass.
If you run Elasticsearch, Solr, or any backend you communicate with using SQL (via JDBC), like SparkSQL, Apache Cassandra (CQL), Apache Impala, Apache Drill, MySQL, PostgreSQL, etc., you’ll like what we’ve just added to SPM. We call it Database Operations and in SPM you can find it in the new Database report:
If you didn’t watch the video, here’s what Database Operations gives you:
Top 5 operation types across all your data stores or filtered to a specific data store type
Top 5 operation types by speed, throughput, or simply their volume
Time-series reports for volume, throughput, and latency broken down by operation type
Ability to view all collected operations, not just the slowest ones, filter by database type or by operation type, sorted by average or total duration, or throughput
Sparklines that show last 5 minute values and trends
Top 10 slowest individual operations and drill-in details
Integration with Transaction Tracing, so you can correlate slow data store operations with the actual transaction/request that triggered slow operations
To get this information add SPM agent to the application that is talking to a data store (e.g. Solr or Elasticsearch or MySQL or …). This is because the SPM agent captures operations at that client layer, not in the server itself.
Don’t forget – when you enable Database Operations you will also automatically get Transaction Tracing, as well as the cool AppMaps – enjoy! 🙂
Got ideas how we could make Database Operations better and more useful to you? Let us know via comments, email or @sematext.
Grab a free 30-day SPM trial by registering here (ping us if you’re a startup, a non-profit, or educational institution – we’ve got special pricing for you!). There’s no commitment and no credit card required.
Half of the world, Sematext included, seems to be using Kafka.
Kafka is the spinal cord that connects various components in SPM, Site Search Analytics, and Logsene. If Kafka breaks, we’re in trouble (but we have anomaly detection all over the place to catch issues early). In many Kafka deployments, ours included, the most recent data is the most valuable. Consider the case of Kafka in SPM, which processes massive amounts of performance metrics for monitoring applications and servers. Clearly, in a performance monitoring system you primarily care about current performance numbers. Thus, if SPM’s Kafka pipeline were to break and we restore it, what we’d really like to avoid is processing all data sequentially, oldest to newest. What we’d prefer is processing new metrics data first and then processing older data using any spare capacity we have in order to “fill the gap” caused by Kafka downtime.
Here’s a very quick “video” that show this in action:
How does this work?
We asked about it back in 2013, but didn’t really get good tips. Shortly after that we implemented the following logic that’s been working well for us, as you can see in the animation above.
The catch up logic assumes having multiple topics to consume from and one of these topics being the “active” topic to which producer is publishing messages. Consumer sets which topic is active, although Producer can also set it if it has not already been set. The active topic is set in ZooKeeper.
Consumer looks at the lag by looking at the timestamp that Producer adds to each message published to Kafka. If the lag is over N minutes then Consumer starts paying attention to the offset. If the offset starts getting smaller and keeps getting smaller M times in a row, then Consumer knows we are able to keep up (i.e. the offset is not getting bigger) and sets another topic as active. This signals to Producer to switch publishing to this new topic, while Consumer keeps consuming from all topics.
As the result, Consumer is able to consume both new data and the delayed/old data and avoid not having fresh data while we are in catch-up mode busy processing the backlog. Consuming from one topic is what causes new data to be processed (this corresponds to the right-most part of the chart above “moving forward”), and consuming from the other topic is where we get data for filling in the gap.