Live Elasticsearch Online Training starting on Oct 10! See all classes

Introduction: Monitoring and Logging for Docker Datacenter

1 Introduction

Docker Enterprise Edition (Docker EE) simplifies container orchestration and increases the flexibility and scalability of application deployments.  However, the high level of automation create new challenges for monitoring and log management. Why? Because each container typically runs  a single process, has its own environment, utilizes virtual networks, or has various methods of managing storage.

Traditional monitoring solutions take metrics from each server and applications they run. These servers and applications running on them are typically very static, with very long uptimes. Docker deployments are different: a set of containers may run many applications, all sharing the resources of one or more underlying hosts. It’s not uncommon for Docker servers to run many short-term containers for batch jobs, while a set of permanent services runs in parallel.  Traditional monitoring tools not used to such dynamic environments are not suited for such deployments.

On the other hand, some modern monitoring solutions were built with such dynamic systems in mind and even have out of the box reporting for Docker monitoring.  Moreover, container resource sharing calls for stricter enforcement of resource usage limits, an additional issue you must watch carefully. To make appropriate adjustments for resource quotas you need good visibility into any limits containers have reached or errors they have caused. We recommend using alerts according to defined limits; this way you can adjust limits or resource usage even before errors start happening.


1.1 Monitoring and Logging Designed for Docker Datacenter

Docker Universal Control Plane (UCP) includes real-time monitoring of the cluster state, real-time metrics and logs for each container. Operating larger infrastructures requires a longer retention time for logs and metrics and the capability to correlate metrics, logs and events on several levels (cluster, nodes, applications and containers).  A comprehensive monitoring and logging solution ought to provide the following operational insights:

  • Auditing of Docker events. Tracking of all Docker events provides a clear view of containers life cycle. For example, by collecting events you gain insight into what happens with your containers during (re)deployments or the re-scheduling of containers to different nodes. Some containers might be configured for automatic restarts and the events could indicate whether container processes crash frequently. In case of out-of-memory events, it might be wise to modify the memory limits or check with developers why this event happened. Docker Events also carry information critical for the security of applications, such as:
    • Version changes of application images
    • Application shutdowns
    • Changes of storage volumes or network settings
    • Deletion of storage volumes, which might cause data loss
  • Resource usage for capacity planning and tuning. The resource management with Docker is one of the main advantages of running multi-tenant workflows on shared resources. To do so, definitions for resource limits like CPU, IO and Memory are required. Many organisations face the challenge that they don’t know the exact requirements of their Dockerized applications, typically because they might have been deployed in other ways in the past. At this point monitoring the resources required by containers helps one determine the right limits, as well as observe whether the assumed limits are truly appropriate.
  • Detailed metrics for cluster nodes and containers. Having detailed metrics helps optimize application resource usage. Detailed metrics are the basis for defining application-specific alert criteria for any critical resources applications depend on. Metrics are aggregated for all hosts, images and containers and are filterable by hosts, images, and containers. This lets you drill down from a cluster view down to a single container while troubleshooting or simply trying to understand operations details. Long retention times for metrics make it possible to compare resources before and after different deployments and releases or troubleshoot problems that appear only when a service has been running over several days or weeks!
  • Centralized log management with full-text search, filtering, and analytics across all containers.Logs should be collected, parsed and shipped to an indexing engine. The integrated charting functions in Logsene and integrations for Kibana and Grafana make it easy to analyze logs collected in Docker EE.
  • Anomaly detection and alerts for all logs and metrics. Anomaly detection can help reduce the noise and alert fatigue often caused by classic threshold-based alerts. Even log-alerting is possible with Logsene e.g. to detect anomalies in the log frequency of a specific query. For example, a search for “error” in the system might normally return a dozen non-critical errors, which could be ignored. An increase of error logs indicates that something might be going wrong. Another type of alerts is the Heartbeat alert for all cluster nodes. Disk space alerts are very useful for Docker nodes, because Docker images might consume a lot of disk space. Docker EE runs some cleanup agents to remove unused containers and images; nevertheless the default disk-space alert created by SPM provides an early warning before the capacity limit is reached.
  • Long retention time for logs, metrics and events. Comparing metrics and logs during deployments or watching the performance under different workloads requires one to store logs and metrics for a reasonable time. We have seen cases where memory leaks started to get serious after a few weeks of stable operations, although initially they were not detected. In such cases all context information like logs, events and metrics could be very valuable in identifying the root cause of such problems.

1.1 Sematext Docker Agent

The Docker EE architecture is open for extensions, such as Monitoring and Logging. This document explains how Docker EE can be extended with Sematext SPM for Docker Performance Monitoring and Logsene for Log Management. More specifically, we will use the open-source Sematext Docker Agent to get all data from hosts and containers to have the complete Docker monitoring and logging solution.

Sematext Docker Agent is a modern, Docker-aware metrics, events, and log collection agent. It runs as a tiny container on every Docker host collecting logs, metrics and events for all cluster hosts and all containers from the Docker Remote API. It auto-discovers all containers including the containers for Docker UCP services. Sematext Docker Agent streams metrics, events, and logs via TLS (HTTPS) to SPM and Logsene. After the deployment of the agent all logs and metrics are immediately available in SPM and Logsene.

For more information visit Sematext Docker Agent page.