At the end of November, we’ll be migrating the Sematext Logs backend from Elasticsearch to OpenSearch

Reindexing Data with Elasticsearch

March 21, 2016

Table of contents

Last updated on Jan 8, 2018

SIDE NOTE: We run Elasticsearch and ELK trainings, which may be of interest to you and your teammates.

Sooner or later, you’ll run into a problem of reindexing the data of your Elasticsearch instances. When we do Elasticsearch consulting for clients we always look at whether they have some way to efficiently reindex previously indexed data. The reasons for reindexing vary – from data type changes, analysis changes, to the introduction of new fields that that need to be populated. No matter the case, you may either reindex from your source of truth or treat your Elasticsearch instance as such. Up to Elasticsearch 2.3 we had to use external tools to help us with this operation, like Logstash or stream2es. We even wrote about how to approach reindexing of data with Logstash. However, today we would like to look at the functionality added to the core in Elasticsearch 2.3 – the Elasticsearch reindex API.

The pre-requisites are quite low – you only need Elasticsearch 2.3 or later and you need to be able to run a command on it. And that’s it, nothing more is needed and Elasticsearch will do the rest for us.

For the purpose of the post, we will use data that we use during Elasticsearch and Solr talks and which are available on our Github account – https://github.com/sematext/berlin-buzzwords-samples/tree/master/2014/sample-documents.

Initial data indexation

Let’s assume that we want to index the mentioned data quickly and we want to use the schema-less approach. We will just send data to an index called videosearch in a type vid by using the following command (I have the downloaded JSON files in a directory called data):

for file in data/*.json; do curl -XPOST "localhost:9200/videosearch/vid/" -H 'Content-Type:application/json' -d "`cat $file`"; echo; done

After the indexation, we should get exactly 18 documents indexed.

The problem

Imagine that we would like to change how the data was indexed now. For example, our title field from the data was set up to be a text field with the default analysis setup. What if we would like to change the analyzer to use the one called english? Such operation is not doable without changing the mappings and indexing the data once again. That is doable and obviously, it is easy in our case where we have our data outside Elasticsearch, but let’s assume that we don’t have it. Let’s assume our data exists only in Elasticsearch.

For example, if we would only try to change the analysis of the title field we would run into issues. So, if we would try to run the following request:

curl -XPUT 'localhost:9200/videosearch/vid/_mappings' -H 'Content-Type:application/json' -d '{
 "vid" : {
  "properties" : {
   "title" : {
    "type" : "text",
    "analyzer" : "english"
   }
  }
 }
}'

Elasticsearch would tell us that there is a mapper conflict here and the operation that we are trying to do will not be completed:

{
 "error" : {
  "root_cause" : [
   {
    "type" : "illegal_argument_exception",
    "reason" : "Mapper for [title] conflicts with existing mapping in other types:\n[mapper [title] has different [analyzer]]"
   }
  ],
  "type" : "illegal_argument_exception",
  "reason" : "Mapper for [title] conflicts with existing mapping in other types:\n[mapper [title] has different [analyzer]]"
 },
 "status" : 400
}

 

Introducing reindex API

We start with creating an index called video_new with the following mappings:

curl -XPUT 'localhost:9200/video_new' -H 'Content-Type:application/json' -d '{
 "mappings" : {
  "vid" : {
   "properties" : {
    "title" : { "type" : "text", "analyzer" : "english" }
   }
  }
 }
}'

Assuming we have the mappings done and we’ve created an index called video_new, we could run the re-indexing command. Let’s also assume that we would like to preserve the versioning of the documents. So, to reindex our data we would use the following command:

curl -XPOST 'localhost:9200/_reindex' -H 'Content-Type:application/json' -d '{
 "source" : {
  "index" : "videosearch"
 },
 "dest" : {
  "index" : "video_new",
  "version_type": "external"
 }
}'

As you can see we need to specify the source index (using the source section), the destination index (using the desc section) and send the command to the _reindex REST end-point. We’ve also specified the version_type and set it to external to preserve document versions. After running the command Elasticsearch should respond with the following JSON:

{
 "took":58,
 "timed_out":false,
 "total":37,
 "updated":0,
 "created":37,
 "deleted":0,
 "batches":1,
 "version_conflicts":0,
 "noops":0,
 "retries": {
  "bulk":0,
  "search":0
 },
 "throttled_millis":0,
 "requests_per_second":-1.0,
 "throttled_until_millis":0,
 "failures":[]
}

We can see a few useful statistics about the re-indexing process here:

  • took – the amount of the re-indexing operation took,
  • updated – the number of documents updated,
  • created – the number of documents created,
  • batches – the number of batches used,
  • version_conflicts – how many documents were conflicting,
  • failures – information about documents that failed to be reindexed, none in our case.

As you can see, the operation succeeded.

Limiting source documents

A very nice reindex API feature is the ability to filter the source documents. For example, let’s assume that we would like to copy a part of the source documents from one index to another. Let’s copy three newest documents that have the elasticsearch term in the tags field. We can do that by using a query, limiting the size of the result, and using sorting and all that using the reindex API, like this:

curl -XPOST 'localhost:9200/_reindex' -H 'Content-Type:application/json' -d '{
 "size" : 3,
 "source" : {
  "index" : "videosearch",
  "type" : "vid",
  "query" : {
   "term" : {
    "tags" : "elasticsearch"
   }
  },
  "sort" : {
   "upload_date" : "desc"
  }
 },
 "dest" : {
  "index" : "video_new_sample"
 }
}'

As you can see, we’ve added the type property, which limited the documents by document type. We’ve added the query and the sort section as well, just like during the standard search operation. And we’ve limited the results to only three documents returned by the query using the size parameter. Simple, ain’t it?

Waiting for completion and timing out

Of course, we can control how Elasticsearch will behave during the reindexing process. The reindexing process can take a while if you have a lot of documents to be reindexed. Thus, one of the things that we will probably want to control is whether Elasticsearch reindex request should block until the response is ready or not. To do that we can set the wait_for_completion to false. This will cause Elasticsearch to check the prerequisites for the reindexing operation and will return the task information which can be used to check the progress of reindexing.

The next thing is checking timing out. We can control how long Elasticsearch will wait for the unavailable shards to become available for each batch of documents.

We also have the ability to control routing, consistency of writes and refresh, but let’s talk about all those properties next time we will talk about the reindex API.

Additional options

When comparing the initial release of the Reindex API in Elasticsearch 2.3 and the current Elasticsearch version we can see a lot of additional options available to us. For example, setting the op_type to create in the dest section of the request body will case documents to be created only if they do not exist. We can also ignore conflict issues by setting the conflicts parameter to proceed in the body of the request. And that’s not all. You can control the size of the batches used for reindexing (size parameter), use scripts to modify the document on the fly or even reindex the data using remote Elasticsearch instance.

Finally, Reindex API support sliced scrolling to parallelize indexing process. You can let Elasticsearch slice the data automatically – you do that by adding numerical slices parameter to the request URI. In addition to that Elasticsearch allows us to create manual slices, for example like this:

curl -XPOST 'localhost:9200/_reindex' -H 'Content-Type:application/json' -d '{ 
 "source" : {
  "index" : "videosearch",
  "type" : "vid",
  "slice" : {
   "id" : "0",
   "max" : "2"
  }
 },
 "dest" : {
  "index" : "video_new_sliced"
 }
}'
curl -XPOST 'localhost:9200/_reindex' -H 'Content-Type:application/json' -d '{ 
 "source" : {
  "index" : "videosearch",
  "type" : "vid",
  "slice" : {
   "id" : "1",
   "max" : "2"
  }
 },
 "dest" : {
  "index" : "video_new_sliced"
 }
}'

The above two requests would result in two slices created and we could send them in parallel to speed up execution if our hardware allows.

Summary

As you can see, we did get a tool that lets us reindex data without relying on external tools. This is especially useful when we don’t have the access to our original data or the data in Elasticsearch is modified by external processes that do not modify the original data (like comments to the documents). Just another step to have an easier life with the search engine. 🙂

If you need any help with Elasticsearch, check out our Elasticsearch Consulting, Elasticsearch Production Support, and Elasticsearch Training info.

Java Logging Basics: Concepts, Tools, and Best Practices

Imagine you're a detective trying to solve a crime, but...

Best Web Transaction Monitoring Tools in 2024

Websites are no longer static pages.  They’re dynamic, transaction-heavy ecosystems...

17 Linux Log Files You Must Be Monitoring

Imagine waking up to a critical system failure that has...