At the end of November, we’ll be migrating the Sematext Logs backend from Elasticsearch to OpenSearch

Field Stats for Elasticsearch 6.x

November 26, 2018

Table of contents

We’re excited to announce the release of the Field Stats API plugin for Elasticsearch. The Field Stats API used to be present from Elasticsearch 1.6 to 5.6, to provide efficient statistics for fields of each index. For example, the minimum and maximum values of a date field.

The Field Stats API was deprecated in Elasticsearch 5.4 and removed in 6.0. We needed this functionality in Sematext Cloud, so we created a plugin to add it back. If you’re in a similar situation, feel free to download the Field Stats plugin, open issues and pull requests.

Why Field Stats?

Probably the biggest consumer of the Field Stats API was Kibana. Starting with 4.3, Kibana used the Field Stats API to figure out which indices match the timestamp range of your query. No need to hit all indices if you’re only interested in last hour’s errors. The Field Stats API was also cheap to run, relying on measurements natively available in the underlying Lucene indices.

Kibana already selected the right indices before, if you rolled your Elasticsearch indices by time (e.g. one index per day). But now you could roll Elasticsearch indices by size as well (e.g. one index per 10GB). If you remember our Velocity presentation, rolling by size works better for most.

We expose Kibana in our log management solution, built on top of Elasticsearch. But we also have our own UI, that integrates well with monitoring, tracing, digital experience monitoring, and other features of Sematext Cloud. Can you guess what we use to hit the right indices for a query? Hint: the last word, written backwards, is stats.

Why was Field Stats removed?

The core idea was that range queries, at least in Elasticsearch 5.x and later, go through a rewriting step. Say you look for last hour’s logs and the query hits an index with data only from yesterday. For that index, the timestamp range query gets rewritten to MatchNoDocsQuery. This is very cool for caching, which is also why the shard request cache got turned on by default in Elasticsearch 5.x. A sliding time window doesn’t hurt caches as much as it used to: all shards that are either completely in or completely out of the time window can serve results from cache.

With these optimized ranged queries, the added complexity of Field Stats calls didn’t seem justified anymore.

The problem is, with many log management solutions, queries don’t repeat that often. And when they don’t, broadcasting a query to many large shards is expensive. Expensive as in adding seconds to the latency, despite the rewriting optimization. We saw this in our setup and at some of our Elastic Stack consulting clients.

What about shard pre-filtering?

Clearly, rewriting and caching wasn’t enough. To make things better, Elasticsearch 5.6 and later does a shard pre-filtering step, if the query touches a high number of shards. That number is configurable and defaults to 128.

Even cooler, the pre-filtering step is cheap. In fact, it uses the same native Lucene measurements as the Fields Stats API used. Kibana 6.x running on top of Elasticsearch 6.x gets shard pre-filtering out of the box. You’ll find this combo in our log management solution as well.

Problem solved, right? Kibana 6.x doesn’t use Field Stats, but who cares? Shard pre-filtering is there. All gain and no pain.

Well, not quite. If you have a lot of data to search through (as some of our customers do), you’ll want to load results incrementally. For example, think about searching in the last 7 days:

It’s likely that the whole first page is in the first index you hit. Assuming you sort by date and look at the latest index first. In that case, you could show results right away. Then, to populate the histogram chart, you’d still need to aggregate all the data. But even there you could go index by index and populate the chart incrementally. Instead of having the user wait for the entire reply at once.

Kibana doesn’t do incremental loading anymore. It would be difficult without Field Stats: you’d need to run more expensive min/max aggregations, cache them, then worry about consistency. Granted, not all types of aggregations benefit from incremental loading. For example, a date histogram does, a terms aggregation doesn’t. That said, we find incremental loading useful for dealing with large amounts of data.

Next steps

If your use-case will benefit from the Field Stats API, you can have it back even on Elasticsearch 6.x. Get the Elasticsearch Field Stats Plugin and let us know how it worked for you: add a comment here, an issue or a PR in the git repository. It’s free and open-source (Apache-licensed).

Depending on how you use Elasticsearch for time-series data, you might want to stick around for other goodies:

Java Logging Basics: Concepts, Tools, and Best Practices

Imagine you're a detective trying to solve a crime, but...

Best Web Transaction Monitoring Tools in 2024

Websites are no longer static pages.  They’re dynamic, transaction-heavy ecosystems...

17 Linux Log Files You Must Be Monitoring

Imagine waking up to a critical system failure that has...