2016 Year in Review: Monitoring and Logging Highlights

2017 is almost here and, like last year, we thought we’d share how 2016 went for us.  We remain committed to be your “one-stop shop” for all things Elasticsearch and Solr: from Consulting, Production Support, and Training, to complementing that with our Logsene for all your logs, and SPM for all your monitoring needs.

Docker

It’s safe to say 2016 was the year of Docker and by extension Kubernetes, Mesos, Docker Swarm, among others, too.  They stopped being just early adopters’ toys and have become production-ready technologies used by many. This year we’ve added excellent support for Docker monitoring with SPM and logging with Logsene via the open-source Sematext Docker Agent.

Read More

Making Elasticsearch in Docker Swarm Elastic

Running on Elasticsearch on Docker sounds like a natural fit – both technologies promise elasticity. However, running a truly elastic Elasticsearch cluster on Docker Swarm became somewhat difficult with Docker 1.12 in Swarm mode. Why? Since Elasticsearch gave up on multicast discovery (by moving multicast node discovery into a plugin and not including it by default) one has to specify IP addresses of all master nodes to join the cluster.  Unfortunately, this creates the chicken or the egg problem in the sense that these IP addresses are not actually known in advance when you start Elasticsearch as a Swarm service!  It would be easy if we could use the shared Docker bridge or host network and simply specify the Docker host IP addresses, as we are used to it with the “docker run” command. However,  “docker service create” rejects the usage of bridge or host network. Thus, the question remains: How can we deploy Elasticsearch in a Docker Swarm cluster?

Read More

Kubernetes Containers: Logging and Monitoring support

In this post we will:

  • Introduce Kubernetes concepts and motivation for Kubernetes-aware monitoring and logging tooling
  • Show how to deploy the Sematext Docker Agent to each Kubernetes node with DaemonSet
  • Point out key Kubernetes metrics and log elements to help you troubleshoot and tune Docker and Kubernetes

Managing microservices in containers is typically done with Cluster Managers and Orchestration tools such as  Google Kubernetes, Apache Mesos, Docker Swarm, Docker Cloud, Amazon ECS, Hashicorp Nomad just to mention a few. However, each platform has slightly different of options to deploy containers or schedule tasks to each cluster node. This is why we started a Series of blog post with Docker Swarm Monitoring, and continue today with a quick tutorial for Container Monitoring and Log Collection on Kubernetes.

Read More

running sold on docker

Running Solr in Docker: How & Why

Docker is all the rage these days, but one doesn’t hear about running Solr on Docker very much.

Last month, we gave a talk on the topic of running containerized Solr at the Lucene Revolution conference in Boston, the biggest open source conference dedicated to Apache Lucene/Solr. The 40-minute talk included a live demo that shows how to actually do it, while addressing a number of important bits if you want to run Solr on Docker in production.

Curious to check the presentation? You may find it below.

Or, interested in listening to the 40-minute talk? Check it below.

Indeed, a rapidly growing number of organizations are using Solr and Docker in production. If you also run Solr in Docker be sure to check out Docker + Solr How-to: Monitoring the Official Solr Docker Image.

Needless to say, monitoring Solr is essential in production and Docker is disruptive in many ways, and there are many things that are slightly different and worth mentioning. For instance, one can create, deploy, and run applications by using containers and this gives a significant performance boost and reduces the size of the applications.

 

 

Docker Swarm Lessons from Swarm3K

This is a guest post by Prof. Chanwit Kaewkasi, Docker Captain who organized Swarm3K – the largest Docker Swarm cluster to date.

Swarm3K Review

Swarm3K was the second collaborative project trying to form a very large Docker cluster with the Swarm mode. It happened on 28th October 2016 with more than 50 individuals and companies joining this project.

Sematext was one of the very first companies that offered to help us by offering their Docker monitoring and logging solution. They became the official monitoring system for Swarm3K. Stefan, Otis and their team provided wonderful support for us from the very beginning.

Swarm3K public dashboard by Sematext
Swarm3K public dashboard by Sematext

To my knowledge, Sematext is one and the only Docker monitoring company which allow to deploy the monitoring agents as the global Docker service at the moment. This deployment model provides for a greatly simplified the monitoring process.

Swarm3K Setup and Workload

There were two planned workloads:

  1. MySQL with WordPress cluster
  2. C1M

The 25 nodes formed a MySQL cluster. We experiences some mixing of IP addresses from both mynet and ingress networks. This was the same issue found when forming a cluster of Apache Spark in the past (see https://github.com/docker/docker/issues/24637). We prevented this by binding the cluster only to a single overlay network.

A WordPress node was scheduled somewhere on our huge cluster, and we intentionally didn’t control where it should be. When we were trying to connect a WordPress node to the backend MySQL cluster, the connection kept timing out. We concluded that a WordPress / MySQL combo would be set to run correctly if we put them together in the same DC.

We aimed for 3000 nodes, but in the end we successfully formed a working, geographically distributed 4,700-node Docker Swarm cluster.

Swarm3K Observations

What we also learned from this issue was that the performance of the overlay network greatly depends on the correct tuning of network configuration on each host.

When the MySQL / WordPress test failed, we changed the plan to try NGINX on Routing Mesh.

The Ingress network is a /16 network which supports up to 64K IP addresses. Suggested by Alex Ellis, we then started 4,000 NGINX containers on the formed cluster. During this test, nodes were still coming in and out. The NGINX service started and the Routing Mesh was formed. It could correctly serve even as some nodes kept failing.

We concluded that the Routing Mesh in 1.12 is rock solid and production ready.

We then stopped the NGINX service and started to test the scheduling of as many containers as possible.

This time we simply used “alpine top” as we did for Swarm2K. However, the scheduling rate was quite slow. We went to 47,000 containers in approximately 30 minutes. Therefore it was going to be ~10.6 hours to fill the cluster with 1M containers. Unfortunately, because that would take too long, we decided to shut down the manager as it made no point to go further.

Swarm3k Task Status
Swarm3k Task Status

Scheduling with a huge batch of containers stressed out the cluster. We scheduled the launch of a large number of containers using “docker scale alpine=70000”.  This created a large scheduling queue that would not commit until all 70,000 containers were finished scheduling. This is why when we shut down the managers all scheduling tasks disappeared and the cluster became unstable, for the Raft log got corrupted.

One of the most interesting things was that we were able to collect enough CPU profile information to show us what was keeping the cluster busy.

dockerd-flamegraph-01

Here we can see that only 0.42% of the CPU was spent on the scheduling algorithm. I think we can say with certainty: 

The Docker Swarm scheduling algorithm in version 1.12 is quite fast.

This means that there is an opportunity to introduce a more sophisticated scheduling algorithm that could result in even better resource utilization.

dockerd-flamegraph-02

We found that a lot of CPU cycles were spent on node communication. Here we see the Libnetwork’s member list layer. It used ~12% of the overall CPU.

dockerd-flamegraph-03

Another major CPU consumer was the Raft communication, which also caused the GC here. This used ~30% of the overall CPU.

Docker Swarm Lessons Learned

Here’s the summarized list of what we learned together.

  1. For a large set of nodes like this, managers require a lot of CPUs. CPUs will spike whenever the Raft recovery process kicks in.
  2. If the Leading manager dies, you better stop “docker daemon” on that node and wait until the cluster becomes stable again with n-1 managers.
  3. Don’t use “dockerd -D” in production. Of course, I know you won’t do that.
  4. Keep snapshot reservation as small as possible. The default Docker Swarm configuration will do. Persisting Raft snapshots uses extra CPU.
  5. Thousands of nodes require a huge set of resources to manage, both in terms of CPU and Network bandwidth. In contrast, hundreds of thousand tasks require high Memory nodes.
  6. 500 – 1000 nodes are recommended for production. I’m guessing you won’t need larger than this in most cases, unless you’re planning on being the next Twitter.
  7. If managers seem to be stuck, wait for them. They’ll recover eventually.
  8. The parameter –advertise-addr is mandatory for Routing Mesh to work.
  9. Put your compute nodes as close to your data nodes as possible. The overlay network is great and will require tweaking Linux net configuration for all hosts to make it work best.
  10. Despite slow scheduling, Docker Swarm mode is robust. There were no task failures this time even with unpredictable network connecting this huge cluster together.

“Ten Docker Swarm Lessons Learned” by @chanwit

Credits
Finally, I would like to thank all Swarm3K heroes: @FlorianHeigl, @jmaitrehenry from PetalMD, @everett_toews from Rackspace,  Internet Thailand, @squeaky_pl, @neverlock, @tomwillfixit from Demonware, @sujaypillai from Jabil, @pilgrimstack from OVH, @ajeetsraina from Collabnix, @AorJoa and @PNgoenthai from Aiyara Cluster, @f_soppelsa, @GroupSprint3r, @toughIQ, @mrnonaki, @zinuzoid from HotelQuickly,  @_EthanHunt_,  @packethost from Packet.io, @ContainerizeT – ContainerizeThis The Conference, @_pascalandy from FirePress, @lucjuggery from TRAXxs, @alexellisuk, @svega from Huli, @BretFisher,  @voodootikigod from Emerging Technology Advisors, @AlexPostID,  @gianarb from ThumpFlow, @Rucknar,  @lherrerabenitez, @abhisak from Nipa Technology, and @enlamp from NexwayGroup.

I would like to thanks Sematext again for the best-of-class Docker monitoring system, DigitalOcean for providing all resources for huge Docker Swarm managers, and the Docker Engineering team for making this great software and supporting us during the run.

While this time around we didn’t manage to launch all 150,000 containers we wanted to have, we did manage to create a nearly 5,000-node Docker Swarm cluster distributed over several continents.  Lessons we’ve learned from this experiment will help us launch another huge Docker Swarm cluster next year.  Thank you all and I’m looking forward to the new run!

 

Handling Shards in SolrCloud

handling-shards-in-solrcloud

One of the things you learn when attending Sematext Solr training is how to scale Solr. We discuss various topics regarding leader shards and their replicas – things like when to go for more leaders, when to go for more replicas and when to go for both. We discuss what you can do with them, how to control their creation and work with them. In this blog entry I would like to focus on one of the mentioned aspects – handling replica creation in SolrCloud. Note that while this is not limited to Solr on Docker deployment, if you are considering running Solr in Docker containers you will want to pay attention to this.

Read More

Taming SwarmZilla: 150k Containers in 3K+ Docker Swarm Nodes

SwarmZilla/swarm3k by Docker Captain Chanwit Kaewkasi is a unique community project/event aimed at launching a Docker Swarm cluster with 3000+ community-sponsored nodes. The previous project – Swarm2k – successfully demonstrated a 2000+ node Swarm cluster with only 3 Swarm managers running a workload with 95,000 tasks/containers on worker nodes.

Swarm3k goal is to run more than 3,000 nodes with a very large subnet – with 4096 IP addresses and several workloads, such as a 20 node MySQL cluster and 2,980 WordPress tasks on top of it.

Schedule:

  • Test run: October 19, 2016, 3 PM UTC – smaller test run – creating nodes, deploying services, etc. 
  • Real run: October 28, 2016, 3 PM UTC

Joining:

You can join with your own Docker containers. See swarmzilla/swarm3k for instructions.  Got questions?  We’re in swarmzilla Gitter chat.

Observing / Monitoring:

Sematext is the official Docker Swarm3k monitoring partner – see Live Swarm3k Dashboard at http://sematext.com/swarm3k. Compared to the setup for Swarm2k we will enrich the monitoring setup with host metrics, container metrics and logging of Docker events and task errors.  All these bits of operational data are collected by Sematext Docker Agent.

Read More

Docker “Swarm Mode”: Full Cluster Monitoring & Logging with 1 Command

Until recently, automating the deployment of Performance Monitoring agents in Docker Swarm clusters was challenging because monitoring agents had to be deployed to each cluster node and the previous Docker releases (<Docker engine v1.12 / Docker Swarm 1.2.4) had no global service scheduler (Github issue #601).  Scheduling services with via docker-compose and scheduling constraints required manual updates when the number of nodes changed in the swarm cluster – definitely not convenient for dynamic scaling of clusters! In Docker Swarm Monitoring and Logging we shared some Linux shell acrobatics as workaround for this issue.

The good news: All this has changed with Docker Engine v1.12 and new Swarm Mode. The latest release of Docker v1.12 provides many new features for orchestration and the new Swarm mode made it much easier to deploy Swarm clusters.  

 

With Docker v1.12 services can be scheduled globally – similar to Kubernetes DaemonSet, RancherOS global services or CoreOS global fleet services

 

Read More

Open Source Docker Monitoring & Logging

Pets ⇒ Cattle ⇒ Orchestration

Docker is growing by leaps and bounds, and along with it its ecosystem.  Being light, the predominant container deployment involves running just a single app or service inside each container.  Most software products and services are made up of at least several such apps/services.  We all want all our apps/services to be highly available and fault tolerant.  Thus, Docker containers in an organization quickly start popping up like mushrooms after the rain.  They multiply faster than rabbits. While in the beginning we play with them like cute little pets, as their number quickly grow we realize we are dealing with aherd of cattle, implying we’ve become cowboys.  Managing a herd with your two hands, a horse, and a lasso willget you only so far.  You won’t be able to ride after each and every calf that wonders in the wrong direction.  To get back to containers from this zoological analogy – operating so many moving pieces at scale is impossible without orchestration – this is why we’ve seen the rise of Docker Swarm, Kubernetes, Mesos, CoreOS, RancherOS and so on.

 

Containers multiply faster than Gremlins


Pets ⇒ Cattle ⇒ Orchestration + Operational Insights

Container orchestration helps you manage your containers, their placement, their resources, and their whole life cycle.  While containers and applications in them are running, in addition to the whole life cycle management we need container monitoring and log management so we can troubleshoot performance or stability issues, debug or tune applications, and so on.  Just like with orchestration, there are a number of open-source container monitoring and logging tools.  It’s great to have choices, but having lots of them means you need to evaluate and compare them to pick the one that best matches your needs.

DevOps Tools Comparison

We’ve open-sourced our Sematext Docker Agent (SDA for short) which works with SPM for monitoring and Logsene for log management (think of it as ELK as a Service), and wanted to provide a high level comparison of SDA and several popular Docker monitoring and logging tools, like CAdvisor, Logspout, and others.  In the following table we group tools by functionality and include monitoring agents, log collectors and shippers, storage backends, and tools that provide the UI and visualizations.  For each functionality we list in the “Common Tools” column one or more popular open-source tools that provide that functionality.  An empty “Common Tools” cell means there are no popular open-source tools that provide it, or at least we are not aware of it — if we messed something up, please leave a comment or tweet @sematext.

Functionality Common Tools Sematext Tools
Collect Logs from Docker API
(including auto-discovery of new containers)
Logspout Sematext Docker Agent
Log routing Logspout
Routing setup for containers via HTTP API to syslog, redis, kafka, logstash
Docker Logging Drivers (e.g. syslog, journald, fluentd, etc.)
Sematext Docker Agent
(routing of logs to different indices based on container labels)
Automatic log tagging
(with Docker Compose or Swarm or Kubernetes metadata)
For Kubernetes: fluentd-elasticsearch, assumes Elasticsearch deployed locally Sematext Docker Agent
Collect Docker Metrics CAdvisor Sematext Docker Agent
Collect Docker Events ? Sematext Docker Agent
Logs format detection
(most tools need a static setup per logfile/application)
? Sematext Docker Agent
(out of the box format detection and parsing; the parser and the logagent-js pattern library is open source)
Logs parsing and shipping Fluentd
Logstash
rsyslog
syslog-ng
Sematext Docker Agent
Logs storage and indexing Elasticsearch
Solr
Logsene
(exposes Elasticsearch API)
Logs anomaly detection and alerting ? Logsene
Log search and analytics Kibana
Grafana
Logsene
(Logsene’s own UI or integrated Kibana, or Grafana connected to Logsene via Elasticsearch data source)
Metrics storage and aggregation Graphite
OpenTSDB
KairosDB
Elasticsearch
Influxdb
Prometheus
SPM
Metrics charts and dashboards Grafana
Kibana
SPM
Metrics anomaly detection and alerting Influxdb
Prometheus
SPM
Correlation of Metrics, Logs and Events ? SPM & Logsene integration

This table shows a few things:

  • Some of the functionality provided by SPM and Logsene is not available in some of the most popular open-source monitoring and logging tools included here
  • Some of the SPM and Logsene functionality is indeed provided by some of the open-source tools, however none of them seems to encompass all the features, forcing one to mix and match and head down the tech debt-ridden Franken-monitoring path
  • Try it yourself in the MindMap below – pick a few functionalities and see how many different tools you might have to use?
    docker_monitoring_oss

 

Avoid building technical-debt & Franken-monitoring by using a limited number of Docker monitoring & logging tools

Again, if we missed something, please leave a comment or tweet @sematext.
If you want to try Sematext Docker Agent sign up for a free trial.

P.S.: Sematext Docker Agent is available in the Rancher Community Catalog and shows up with our new mascot “Octi” only one more pet 🙂 – so if you use RancherOS search for “sematext” in the RancherOS Catalog and within a few clicks you’ll have the Sematext Docker Agent deployed to your RancherOS clusters!

octi-sda

SIGN UP – FREE TRIAL

Monitoring Docker Datacenter Logs & Metrics

Docker Datacenter (DDC) simplifies container orchestration and increases the flexibility and scalability of application deployments.  However, the high level of automation create new challenges for monitoring and log management. Organizations that introduce Docker Datacenter manage container deployments in various scenarios e.g., on bare metal, virtual machines, or hybrid clouds. That’s why at Sematext we are seeing a shift from traditional server monitoring to container-centric monitoring. This post is an excerpt from the newly published “Reference Architecture: Monitoring and Logging for Docker Datacenter” and shows how Docker Datacenter could be extended with Logging and Monitoring services.

Download Reference Architecture Logging & Monitoring for Docker Datacenter

The Docker Universal Control Plane (UCP) management functionalities include real-time monitoring of the cluster state, real-time metrics and logs for each container. However, operating larger infrastructures requires a longer retention time for logs and metrics and the capability to correlate metrics, logs and events on several levels (cluster, nodes, applications and containers).  A comprehensive monitoring and logging solution ought to provide the following operational insights:

Read More