At the end of November, we’ll be migrating the Sematext Logs backend from Elasticsearch to OpenSearch

Scaling Elasticsearch by Cleaning the Cluster State

October 18, 2023

Table of contents

We often get questions like:

  • How much data can I put in an Elasticsearch cluster?
  • How many nodes can an Elasticsearch cluster have?
  • What’s the biggest cluster that you’ve seen?

And while the 14-year-old in me is proud to say that we’ve done 24/7 support for clusters of 1000+ nodes holding many PB of data, I am quick to add that:

  1. It doesn’t mean it’s a good idea to have clusters that big.
  2. Such generic questions deserve more nuanced answers. Which is exactly what this blog post does. And it applies to OpenSearch as well as for Elasticsearch. And for the most part, to Solr (where the cluster state is stored in Zookeeper).

The Cluster State

An Elasticsearch cluster’s limit of scale isn’t about how much data is stored and how quickly it changes, but how much metadata is stored and how quickly it changes. Metadata is information about nodes, indices, shards, mappings, settings, etc. Everything besides documents and queries. All these are stored in the cluster state. You can see what’s in the cluster state via the Cluster state API.

Each node in the cluster has a copy of the cluster state, but only one is allowed to write to it: the active master. The master node has to process every cluster change serially and wait for an ACK from the other nodes to make sure they’re up to date. This is the limit of scale: how quickly can the master node get through cluster state updates (and how many it has to process) before some operations become unreasonably slow or the cluster starts misbehaving due to timeouts.

The worst case scenario is a cluster restart: lots of nodes and shards go up and down, generating a ton of cluster state updates. That’s how most people find out that their cluster state is too large: after an outage, the cluster either fails to recover or recovers in a very long time.

What if I Have Beefy Master Nodes?

In short, it doesn’t help. Once you make sure master nodes have enough heap and don’t have GC issues, adding more CPU and memory to them will not get you anything. Cluster state update speeds have more to do with:

  • The size of the cluster state, because the active master has to create a new (updated) cluster state object in heap.
  • Whether the update is a diff (normal operations) or a full cluster state update (e.g. a node joined).
  • How quickly the other nodes acknowledge the update (e.g. number of nodes, network latency).

But More Data Means More Metadata!

There is some correlation: more nodes will carry more metadata with them, so will more indices and shards. But indices can have 5 fields in the mapping or 500. You may have index templates, big analyzer definitions, stored scripts or tons of aliases. These are less about how much data you have and more about your application design.

As for nodes and indices/shards, you should know that advice like:

  • Don’t have more than 30GB of heap. You’ll get uncompressed pointers and have GC issues!
  • Don’t allocate more than 1/2 of RAM to as heap. You need plenty of memory for OS caches!
  • Don’t allow shards over 50GB. Shards won’t recover in time!
  • All sorts of formulae about the ratio between disk space and RAM, number of shards per node, etc.

Are only rules of thumb. They may or may not apply to your use-case. I’ve seen healthy clusters (for their own use-cases) rocking:

  • Hundreds of GB of heap. Hello ZGC!
  • Shards of hundreds of GB. More compact => quicker full-text search, recovery OK on beefy boxes with beefy network.
  • Clusters running thousands of shards per node. Because nodes were few, mappings were thin and shards were relatively small.

I’m not saying these are the new rules, I’m saying it depends on the use-case and workload.

What Helps?

It helps to design your clusters and applications to use small cluster states. If you can, of course. Here are some examples:

  • Don’t use fields as keys in key-value pairs. If you can find an option to have fewer fields, your Elasticsearch will be happier. Worst case scenario, use nested documents.
  • Break data into multiple clusters. If workloads are completely separated (or you can afford to use cross-cluster search occasionally), then having more smaller clusters will effectively break down the cluster state and parallelize the work of maintaining it between multiple active master nodes. Plus, you’ll likely have fewer data nodes per cluster, which leads to the next point.
  • Prefer fewer, bigger nodes. This might feel like advice from 1998, but 12 beefy nodes are better than 48 nodes 1/4 of their size – for most use-cases. It’s easier to deal with challenges of huge nodes (like GC) than with challenges of big clusters. Like a larger cluster state to be maintained across more nodes. Or keeping data balanced.
  • Don’t abuse aliases. While you can abstract away some complexity by using aliases, make sure you don’t end up with thousands of them. I’ve seen a cluster with millions of filtered aliases (one per tenant), making it unusable. The fix was for the application to apply those “tenant filters” directly in queries, so we could get rid of those aliases.
  • Use strict mappings. This will avoid having millions of fields in your cluster state, but it may not work for your use-case. Still, you can go in that direction by adjusting index.mapping.total_fields.limit or using dynamic templates to have some sort of naming convention for your fields.
  • Prefer synonym files over storing them in the cluster state.
  • Clean up unused metadata. If you don’t need an index or a search template, a stored script, etc. don’t keep them like I did below 🤦‍♂️

Show Some Proof!

This will be a a little embarrassing, but OK 😊

In Sematext Logs, we support custom templates. We recently noticed that we don’t delete custom templates of deleted Logs Apps. Which, as you may imagine, bloated our cluster state.

When we cleaned them up, we cut the cluster state size down by about a third on one cluster. Here’s a screenshot from our Elasticsearch Monitoring:

Elasticsearch Cluster State Metrics Before&After
Elasticsearch Cluster State Metrics Before&After

Notice the highlighted spikes. These are due to a maintenance job. On the top graph, you’ll see the same number of cluster states published (proof that the job did about the same amount of work). Meanwhile, with the smaller cluster state, the number of pending tasks that we saw (on average) was down to a third.

We also saw significant improvements for CPU, load and GC time on the active master. Graphs there are less impressive, because the absolute figures are pretty much always low.

Key Takeaways

Java Logging Basics: Concepts, Tools, and Best Practices

Imagine you're a detective trying to solve a crime, but...

Best Web Transaction Monitoring Tools in 2024

Websites are no longer static pages.  They’re dynamic, transaction-heavy ecosystems...

17 Linux Log Files You Must Be Monitoring

Imagine waking up to a critical system failure that has...