2017 is almost here and, like last year, we thought we’d share how 2016 went for us. We remain committed to be your “one-stop shop” for all things Elasticsearch and Solr: from Consulting, Production Support, and Training, to complementing that with our Logsene for all your logs, and SPM for all your monitoring needs.
It’s safe to say 2016 was the year of Docker and by extension Kubernetes, Mesos, Docker Swarm, among others, too. They stopped being just early adopters’ toys and have become production-ready technologies used by many. This year we’ve added excellent support for Docker monitoring with SPM and logging with Logsene via the open-source Sematext Docker Agent.
Running on Elasticsearch on Docker sounds like a natural fit – both technologies promise elasticity. However, running a truly elastic Elasticsearch cluster on Docker Swarm became somewhat difficult with Docker 1.12 in Swarm mode. Why? Since Elasticsearch gave up on multicast discovery (by moving multicast node discovery into a plugin and not including it by default) one has to specify IP addresses of all master nodes to join the cluster. Unfortunately, this creates the chicken or the egg problem in the sense that these IP addresses are not actually known in advance when you start Elasticsearch as a Swarm service! It would be easy if we could use the shared Docker bridge or host network and simply specify the Docker host IP addresses, as we are used to it with the “docker run” command. However, “docker service create” rejects the usage of bridge or host network. Thus, the question remains: How can we deploy Elasticsearch in a Docker Swarm cluster?
Point out key Kubernetes metrics and log elements to help you troubleshoot and tune Docker and Kubernetes
Managing microservices in containers is typically done with Cluster Managers and Orchestration tools such as Google Kubernetes, Apache Mesos, Docker Swarm, Docker Cloud, Amazon ECS, Hashicorp Nomad just to mention a few. However, each platform has slightly different of options to deploy containers or schedule tasks to each cluster node. This is why we started a Series of blog post with Docker Swarm Monitoring, and continue today with a quick tutorial for Container Monitoring and Log Collection on Kubernetes.
Docker is all the rage these days, but one doesn’t hear about running Solr on Docker very much.
Last month, we gave a talk on the topic of running containerized Solr at the Lucene Revolution conference in Boston, the biggest open source conference dedicated to Apache Lucene/Solr. The 40-minute talk included a live demo that shows how to actually do it, while addressing a number of important bits if you want to run Solr on Docker in production.
Curious to check the presentation? You may find it below.
Or, interested in listening to the 40-minute talk? Check it below.
Needless to say, monitoring Solr is essential in production and Docker is disruptive in many ways, and there are many things that are slightly different and worth mentioning. For instance, one can create, deploy, and run applications by using containers and this gives a significant performance boost and reduces the size of the applications.
This is a guest post by Prof. Chanwit Kaewkasi, Docker Captain who organized Swarm3K – the largest Docker Swarm cluster to date.
Swarm3K was the second collaborative project trying to form a very large Docker cluster with the Swarm mode. It happened on 28th October 2016 with more than 50 individuals and companies joining this project.
Sematext was one of the very first companies that offered to help us by offering their Docker monitoring and logging solution. They became the official monitoring system for Swarm3K. Stefan, Otis and their team provided wonderful support for us from the very beginning.
To my knowledge, Sematext is one and the only Docker monitoring company which allow to deploy the monitoring agents as the global Docker service at the moment. This deployment model provides for a greatly simplified the monitoring process.
Swarm3K Setup and Workload
There were two planned workloads:
MySQL with WordPress cluster
The 25 nodes formed a MySQL cluster. We experiences some mixing of IP addresses from both mynet and ingress networks. This was the same issue found when forming a cluster of Apache Spark in the past (see https://github.com/docker/docker/issues/24637). We prevented this by binding the cluster only to a single overlay network.
A WordPress node was scheduled somewhere on our huge cluster, and we intentionally didn’t control where it should be. When we were trying to connect a WordPress node to the backend MySQL cluster, the connection kept timing out. We concluded that a WordPress / MySQL combo would be set to run correctly if we put them together in the same DC.
We aimed for 3000 nodes, but in the end we successfully formed a working, geographically distributed 4,700-node Docker Swarm cluster.
What we also learned from this issue was that the performance of the overlay network greatly depends on the correct tuning of network configuration on each host.
When the MySQL / WordPress test failed, we changed the plan to try NGINX on Routing Mesh.
The Ingress network is a /16 network which supports up to 64K IP addresses. Suggested by Alex Ellis, we then started 4,000 NGINX containers on the formed cluster. During this test, nodes were still coming in and out. The NGINX service started and the Routing Mesh was formed. It could correctly serve even as some nodes kept failing.
We concluded that the Routing Mesh in 1.12 is rock solid and production ready.
We then stopped the NGINX service and started to test the scheduling of as many containers as possible.
This time we simply used “alpine top” as we did for Swarm2K. However, the scheduling rate was quite slow. We went to 47,000 containers in approximately 30 minutes. Therefore it was going to be ~10.6 hours to fill the cluster with 1M containers. Unfortunately, because that would take too long, we decided to shut down the manager as it made no point to go further.
Scheduling with a huge batch of containers stressed out the cluster. We scheduled the launch of a large number of containers using “docker scale alpine=70000”. This created a large scheduling queue that would not commit until all 70,000 containers were finished scheduling. This is why when we shut down the managers all scheduling tasks disappeared and the cluster became unstable, for the Raft log got corrupted.
One of the most interesting things was that we were able to collect enough CPU profile information to show us what was keeping the cluster busy.
Here we can see that only 0.42% of the CPU was spent on the scheduling algorithm. I think we can say with certainty:
The Docker Swarm scheduling algorithm in version 1.12 is quite fast.
I would like to thanks Sematext again for the best-of-class Docker monitoring system, DigitalOcean for providing all resources for huge Docker Swarm managers, and the Docker Engineering team for making this great software and supporting us during the run.
While this time around we didn’t manage to launch all 150,000 containers we wanted to have, we did manage to create a nearly 5,000-node Docker Swarm cluster distributed over several continents. Lessons we’ve learned from this experiment will help us launch another huge Docker Swarm cluster next year. Thank you all and I’m looking forward to the new run!
One of the things you learn when attending Sematext Solr training is how to scale Solr. We discuss various topics regarding leader shards and their replicas – things like when to go for more leaders, when to go for more replicas and when to go for both. We discuss what you can do with them, how to control their creation and work with them. In this blog entry I would like to focus on one of the mentioned aspects – handling replica creation in SolrCloud. Note that while this is not limited to Solr on Docker deployment, if you are considering running Solr in Docker containers you will want to pay attention to this.
SwarmZilla/swarm3k by Docker Captain Chanwit Kaewkasi is a unique community project/event aimed at launching a Docker Swarm cluster with 3000+ community-sponsored nodes. The previous project – Swarm2k – successfully demonstrated a 2000+ node Swarm cluster with only 3 Swarm managers running a workload with 95,000 tasks/containers on worker nodes.
Swarm3k goal is to run more than 3,000 nodes with a very large subnet – with 4096 IP addresses and several workloads, such as a 20 node MySQL cluster and 2,980 WordPress tasks on top of it.
Test run: October 19, 2016, 3 PM UTC – smaller test run – creating nodes, deploying services, etc.
Sematext is the official Docker Swarm3k monitoring partner – see Live Swarm3k Dashboard at http://sematext.com/swarm3k. Compared to the setup for Swarm2k we will enrich the monitoring setup with host metrics, container metrics and logging of Docker events and task errors. All these bits of operational data are collected by Sematext Docker Agent.
Until recently, automating the deployment of Performance Monitoring agents in Docker Swarm clusters was challenging because monitoring agents had to be deployed to each cluster node and the previous Docker releases (<Docker engine v1.12 / Docker Swarm 1.2.4) had no global service scheduler (Github issue #601). Scheduling services with via docker-compose and scheduling constraints required manual updates when the number of nodes changed in the swarm cluster – definitely not convenient for dynamic scaling of clusters! In Docker Swarm Monitoring and Logging we shared some Linux shell acrobatics as workaround for this issue.
The good news: All this has changed with Docker Engine v1.12 and new Swarm Mode. The latest release of Docker v1.12 provides many new features for orchestration and the new Swarm mode made it much easier to deploy Swarm clusters.
Docker is growing by leaps and bounds, and along with it its ecosystem. Being light, the predominant container deployment involves running just a single app or service inside each container. Most software products and services are made up of at least several such apps/services. We all want all our apps/services to be highly available and fault tolerant. Thus, Docker containers in an organization quickly start popping up like mushrooms after the rain. They multiply faster than rabbits. While in the beginning we play with them like cute little pets, as their number quickly grow we realize we are dealing with aherd of cattle, implying we’ve become cowboys. Managing a herd with your two hands, a horse, and a lasso willget you only so far. You won’t be able to ride after each and every calf that wonders in the wrong direction. To get back to containers from this zoological analogy – operating so many moving pieces at scale is impossible without orchestration – this is why we’ve seen the rise of Docker Swarm, Kubernetes, Mesos, CoreOS, RancherOS and so on.
Container orchestration helps you manage your containers, their placement, their resources, and their whole life cycle. While containers and applications in them are running, in addition to the whole life cycle management we need container monitoring and log management so we can troubleshoot performance or stability issues, debug or tune applications, and so on. Just like with orchestration, there are a number of open-source container monitoring and logging tools. It’s great to have choices, but having lots of them means you need to evaluate and compare them to pick the one that best matches your needs.
DevOps Tools Comparison
We’ve open-sourced our Sematext Docker Agent (SDA for short) which works with SPM for monitoring and Logsene for log management (think of it as ELK as a Service), and wanted to provide a high level comparison of SDA and several popular Docker monitoring and logging tools, like CAdvisor, Logspout, and others. In the following table we group tools by functionality and include monitoring agents, log collectors and shippers, storage backends, and tools that provide the UI and visualizations. For each functionality we list in the “Common Tools” column one or more popular open-source tools that provide that functionality. An empty “Common Tools” cell means there are no popular open-source tools that provide it, or at least we are not aware of it — if we messed something up, please leave a comment or tweet @sematext.
Collect Logs from Docker API (including auto-discovery of new containers)
Sematext Docker Agent
Logspout Routing setup for containers via HTTP API to syslog, redis, kafka, logstash Docker Logging Drivers (e.g. syslog, journald, fluentd, etc.)
Sematext Docker Agent (routing of logs to different indices based on container labels)
Automatic log tagging (with Docker Compose or Swarm or Kubernetes metadata)
For Kubernetes: fluentd-elasticsearch, assumes Elasticsearch deployed locally
Sematext Docker Agent
Collect Docker Metrics
Sematext Docker Agent
Collect Docker Events
Sematext Docker Agent
Logs format detection (most tools need a static setup per logfile/application)
Sematext Docker Agent (out of the box format detection and parsing; the parser and the logagent-js pattern library is open source)
Logs parsing and shipping
Fluentd Logstash rsyslog syslog-ng
Sematext Docker Agent
Logs storage and indexing
Logsene (exposes Elasticsearch API)
Logs anomaly detection and alerting
Log search and analytics
Logsene (Logsene’s own UI or integrated Kibana, or Grafana connected to Logsene via Elasticsearch data source)
Some of the functionality provided by SPM and Logsene is not available in some of the most popular open-source monitoring and logging tools included here
Some of the SPM and Logsene functionality is indeed provided by some of the open-source tools, however none of them seems to encompass all the features, forcing one to mix and match and head down the tech debt-ridden Franken-monitoring path
Try it yourself in the MindMap below – pick a few functionalities and see how many different tools you might have to use?
Avoid building technical-debt & Franken-monitoring by using a limited number of Docker monitoring & logging tools Tweet
P.S.: Sematext Docker Agent is available in the Rancher Community Catalog and shows up with our new mascot “Octi” only one more pet 🙂 – so if you use RancherOS search for “sematext” in the RancherOS Catalog and within a few clicks you’ll have the Sematext Docker Agent deployed to your RancherOS clusters!
Docker Datacenter (DDC) simplifies container orchestration and increases the flexibility and scalability of application deployments. However, the high level of automation create new challenges for monitoring and log management. Organizations that introduce Docker Datacenter manage container deployments in various scenarios e.g., on bare metal, virtual machines, or hybrid clouds. That’s why at Sematext we are seeing a shift from traditional server monitoring to container-centric monitoring. This post is an excerpt from the newly published “Reference Architecture: Monitoring and Logging for Docker Datacenter” and shows how Docker Datacenter could be extended with Logging and Monitoring services.
Download Reference Architecture Logging & Monitoring for Docker Datacenter Tweet
The Docker Universal Control Plane (UCP) management functionalities include real-time monitoring of the cluster state, real-time metrics and logs for each container. However, operating larger infrastructures requires a longer retention time for logs and metrics and the capability to correlate metrics, logs and events on several levels (cluster, nodes, applications and containers). A comprehensive monitoring and logging solution ought to provide the following operational insights: