At the end of November, we’ll be migrating the Sematext Logs backend from Elasticsearch to OpenSearch

Documents Update By Query with Elasticsearch

March 21, 2016

Table of contents

SIDE NOTE: We run Elasticsearch and ELK trainings, which may be of interest to you and your teammates.

Just recently, we’ve described how to re-index your Elasticsearch data using the built-in re-index API. Today, we’ll look at Update by Query API, which let’s you update your documents using a query without having to do any expensive fetching and processing on the application side.

You know how updates work in Elasticsearch or in Apache Lucene in general? Yes, that’s true – Lucene segments are immutable, so once you’ve updated the document, the old one gets marked as deleted in the segment and new version of the document gets indexed. Of course, Elasticsearch builds some additional processing on top of Lucene, so we can use scripts to update our data, use optimistic locking, etc., but still the above picture is true.

However, some use cases force us to update documents, sometimes a lot of them at once. To update a batch of documents matching a query, we needed to know their identifiers. This is how things used to work and the general principle was:

  1. Run a query
  2. Gather the results (probably using Scroll API if you expect a lot of them)
  3. Update returned documents one by one or use bulk API
  4. Repeat from 1) when in need

That complication ended when, similar to how Elasticsearch builds the document update features on top of Lucene, we get the ability to run a query and update all documents matching it. Welcome the Update by Query API. 🙂

For the purposes of this blog post we will again use the same small data set that we’ve used when describing the Re-Index API, so the one available on our Github account (https://github.com/sematext/berlin-buzzwords-samples/tree/master/2014/sample-documents). After indexing the data we should have 18 documents:

$ curl -XGET 'localhost:9200/videosearch/_search?size=0&pretty'
{
  "took" : 12,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 18,
    "max_score" : 0.0,
    "hits" : [ ]
  }
}

Let’s assume that we would like to update all the documents that have solr (yes, yes, I know) in the tags field and increment the values stored in their views field. With the Update by Query API, this is as simple as running the following code:

$ curl -XPOST 'localhost:9200/videosearch/_update_by_query?pretty' -d '{
 "query" : {
  "term" : {
   "tags" : "solr"
  }
 },
 "script" : {
  "inline" : "ctx._source.likes += num_likes",
  "params" : {
   "num_likes" : 1
  }
 }
}'

As you can see this was easy. We’ve provided a simple term query and included a script that increments the data. The whole request was sent to the _update_by_query REST end-point in an index we are interested in.

The response of Elasticsearch for the above request, on our example data set, would be similar to the following one (don’t forget to enable inline scripting by adding the script.inline: on to elasticsearch.yml):

{
  "took" : 60,
  "timed_out" : false,
  "total" : 11,
  "updated" : 11,
  "batches" : 1,
  "version_conflicts" : 0,
  "noops" : 0,
  "retries" : 0,
  "failures" : [ ]
}

The response Elasticsearch returns tells about about the number of updated documents, the number of batches that were created, and information about conflicts and retries. Finally, we have the information on failures.

Is there something that we can control when using the Update by Query API? Again, the answer is yes. We can control the language of the script, we can control the write consistency, replication (synchronous or asynchronous), routing, timeout and response. For example, to get information about all processed documents we could use the following information:

$ curl -XPOST 'localhost:9200/videosearch/_update_by_query?pretty&response=all' -d '{
 "query" : {
  "term" : {
   "tags" : "solr"
  }
 },
 "script" : {
  "inline" : "ctx._source.likes += num_likes",
  "params" : {
   "num_likes" : 1
  },
  "lang" : "groovy"
 }
}'

Or we can control consistency and timeout:

$ curl -XPOST 'localhost:9200/videosearch/_update_by_query?pretty&consistency=one&timeout=1m' -d '{
 "query" : {
  "term" : {
   "tags" : "solr"
  }
 },
 "script" : {
  "inline" : "ctx._source.likes += num_likes",
  "params" : {
   "num_likes" : 1
  },
  "lang" : "groovy"
 }
}'

Let’s take a second at what the response parameter does. It controls what bulk response items to include in the response of the command. The possible values are:

  • none – the default value, which means that no response items will be returned,
  • failed – only information about documents that failed to be updated will be returned,
  • all – information about all processed documents will be returned. Please remember that this options can lead to very large responses when your update by query request processes a lot of data. Because of that you may run into large memory consumption or even into out of memory situations.

What’s next?

Of course, the Update by Query API and Re-index API that we recently wrote about are nice, but what if the update request execution takes a very long time? It would be nice to be able to control or even cancel its execution, wouldn’t it? Well, we have good news – this is coming soon, probably in the next major version of Elasticsearch – see Github issue #15117.

If you need any help with Elasticsearch, check out our Elasticsearch Consulting, Elasticsearch Production Support, and Elasticsearch Training info.

Java Logging Basics: Concepts, Tools, and Best Practices

Imagine you're a detective trying to solve a crime, but...

Best Web Transaction Monitoring Tools in 2024

Websites are no longer static pages.  They’re dynamic, transaction-heavy ecosystems...

17 Linux Log Files You Must Be Monitoring

Imagine waking up to a critical system failure that has...