This is a guest post by Prof. Chanwit Kaewkasi, Docker Captain who organized Swarm3K – the largest Docker Swarm cluster to date.
Swarm3K was the second collaborative project trying to form a very large Docker cluster with the Swarm mode. It happened on 28th October 2016 with more than 50 individuals and companies joining this project.
Sematext was one of the very first companies that offered to help us by offering their Docker monitoring and logging solution. They became the official monitoring system for Swarm3K. Stefan, Otis and their team provided wonderful support for us from the very beginning.
To my knowledge, Sematext is one and the only Docker monitoring company which allow to deploy the monitoring agents as the global Docker service at the moment. This deployment model provides for a greatly simplified the monitoring process.
Swarm3K Setup and Workload
There were two planned workloads:
- MySQL with WordPress cluster
The 25 nodes formed a MySQL cluster. We experiences some mixing of IP addresses from both mynet and ingress networks. This was the same issue found when forming a cluster of Apache Spark in the past (see https://github.com/docker/docker/issues/24637). We prevented this by binding the cluster only to a single overlay network.
A WordPress node was scheduled somewhere on our huge cluster, and we intentionally didn’t control where it should be. When we were trying to connect a WordPress node to the backend MySQL cluster, the connection kept timing out. We concluded that a WordPress / MySQL combo would be set to run correctly if we put them together in the same DC.
We aimed for 3000 nodes, but in the end we successfully formed a working, geographically distributed 4,700-node Docker Swarm cluster.
What we also learned from this issue was that the performance of the overlay network greatly depends on the correct tuning of network configuration on each host.
When the MySQL / WordPress test failed, we changed the plan to try NGINX on Routing Mesh.
The Ingress network is a /16 network which supports up to 64K IP addresses. Suggested by Alex Ellis, we then started 4,000 NGINX containers on the formed cluster. During this test, nodes were still coming in and out. The NGINX service started and the Routing Mesh was formed. It could correctly serve even as some nodes kept failing.
We concluded that the Routing Mesh in 1.12 is rock solid and production ready.
We then stopped the NGINX service and started to test the scheduling of as many containers as possible.
This time we simply used “alpine top” as we did for Swarm2K. However, the scheduling rate was quite slow. We went to 47,000 containers in approximately 30 minutes. Therefore it was going to be ~10.6 hours to fill the cluster with 1M containers. Unfortunately, because that would take too long, we decided to shut down the manager as it made no point to go further.
Scheduling with a huge batch of containers stressed out the cluster. We scheduled the launch of a large number of containers using “docker scale alpine=70000”. This created a large scheduling queue that would not commit until all 70,000 containers were finished scheduling. This is why when we shut down the managers all scheduling tasks disappeared and the cluster became unstable, for the Raft log got corrupted.
One of the most interesting things was that we were able to collect enough CPU profile information to show us what was keeping the cluster busy.
Here we can see that only 0.42% of the CPU was spent on the scheduling algorithm. I think we can say with certainty:
The Docker Swarm scheduling algorithm in version 1.12 is quite fast.
This means that there is an opportunity to introduce a more sophisticated scheduling algorithm that could result in even better resource utilization.
We found that a lot of CPU cycles were spent on node communication. Here we see the Libnetwork’s member list layer. It used ~12% of the overall CPU.
Another major CPU consumer was the Raft communication, which also caused the GC here. This used ~30% of the overall CPU.
Docker Swarm Lessons Learned
Here’s the summarized list of what we learned together.
- For a large set of nodes like this, managers require a lot of CPUs. CPUs will spike whenever the Raft recovery process kicks in.
- If the Leading manager dies, you better stop “docker daemon” on that node and wait until the cluster becomes stable again with n-1 managers.
- Don’t use “dockerd -D” in production. Of course, I know you won’t do that.
- Keep snapshot reservation as small as possible. The default Docker Swarm configuration will do. Persisting Raft snapshots uses extra CPU.
- Thousands of nodes require a huge set of resources to manage, both in terms of CPU and Network bandwidth. In contrast, hundreds of thousand tasks require high Memory nodes.
- 500 – 1000 nodes are recommended for production. I’m guessing you won’t need larger than this in most cases, unless you’re planning on being the next Twitter.
- If managers seem to be stuck, wait for them. They’ll recover eventually.
- The parameter –advertise-addr is mandatory for Routing Mesh to work.
- Put your compute nodes as close to your data nodes as possible. The overlay network is great and will require tweaking Linux net configuration for all hosts to make it work best.
- Despite slow scheduling, Docker Swarm mode is robust. There were no task failures this time even with unpredictable network connecting this huge cluster together.
“Ten Docker Swarm Lessons Learned” by @chanwit
Finally, I would like to thank all Swarm3K heroes: @FlorianHeigl, @jmaitrehenry from PetalMD, @everett_toews from Rackspace, Internet Thailand, @squeaky_pl, @neverlock, @tomwillfixit from Demonware, @sujaypillai from Jabil, @pilgrimstack from OVH, @ajeetsraina from Collabnix, @AorJoa and @PNgoenthai from Aiyara Cluster, @f_soppelsa, @GroupSprint3r, @toughIQ, @mrnonaki, @zinuzoid from HotelQuickly, @_EthanHunt_, @packethost from Packet.io, @ContainerizeT – ContainerizeThis The Conference, @_pascalandy from FirePress, @lucjuggery from TRAXxs, @alexellisuk, @svega from Huli, @BretFisher, @voodootikigod from Emerging Technology Advisors, @AlexPostID, @gianarb from ThumpFlow, @Rucknar, @lherrerabenitez, @abhisak from Nipa Technology, and @enlamp from NexwayGroup.
I would like to thanks Sematext again for the best-of-class Docker monitoring system, DigitalOcean for providing all resources for huge Docker Swarm managers, and the Docker Engineering team for making this great software and supporting us during the run.
While this time around we didn’t manage to launch all 150,000 containers we wanted to have, we did manage to create a nearly 5,000-node Docker Swarm cluster distributed over several continents. Lessons we’ve learned from this experiment will help us launch another huge Docker Swarm cluster next year. Thank you all and I’m looking forward to the new run!