At the end of November, we’ll be migrating the Sematext Logs backend from Elasticsearch to OpenSearch

How to reindex your Elasticsearch data

May 18, 2023

Table of contents

The Elasticsearch reindex API copies data from one index to another. You can use reindex to change the index mapping, copy data to another cluster, or copy only a subset of data to another index.

For example, suppose you want to reindex all the data in index1 into index2. In that case, you run the following example in Kibana dev tools:

POST _reindex
{
 "source": {
   "index": "index1"
 },
 "dest": {
   "index": "index2"
 }
}

In this article, we dive into some common issues solved by reindexing as well as troubleshooting issues with reindexing itself.

Key Elastic Reindex API concepts

Reindex API You use the Reindex API to reindex data to a new index with updated mappings.
How to use reindex? You can use reindex to solve mapping conflicts and move data between clusters.
How to speed up reindexing? Speed up reindexing by using the correct API, reindexing only the data you need, and batching and slicing the data before reindexing.
Solve reindexing timeouts in Kibana. Prevent query timeouts by setting the wait_for_completion flag to false.
Reindexing without downtime. Using aliases and splitting input data to reindex without downtime.

How to use reindex to solve mapping conflicts

Elasticsearch maps fields in your document to internal data types. You can define these mappings before you begin ingesting data or let Elasticsearch create the mappings dynamically from the first document.

Conflicts arise when a document added to an index with an existing mapping has an unknown field. When a mapping conflict occurs, Elasticsearch drops the document you tried to index and returns an error similar to the following.

{
  "error": {
    "root_cause": [
      {
        "type": "mapper_parsing_exception",
        "reason": "failed to parse field [field_name] of type [long] in document with id '123'"
      }
    ],
    "type": "mapper_parsing_exception",
    "reason": "failed to parse field [field_name] of type [long] in document with id '123'",
    "caused_by": {
      "type": "number_format_exception",
      "reason": "For input string: \"invalid_number\"
    }
  },
  "status": 400
}

In this example, the error message indicates a mapper_parsing_exception caused by a number_format_exception. The issue arises when attempting to index a document with an incorrect data type for a specific field. In this case, the field “field_name” is of type long, but the provided value is a text string “invalid_number” instead of a numeric value.

Follow the steps to resolve a mapping conflict:

  1. Stop ingestion. How you stop ingestion will depend on your specific use case. If, for example, you are using Filebeat or Logstash, you’ll first have to stop the Filebeat/Logstash component.

  2. Create a new index with the correct mappings.

PUT <target_index_name>
{
 "mappings": {
   "<type_name>": {
     "properties": {
       "<field_1_name>": {
         "type": "<field_1_type>"
       },
       "<field_2_name>": {
         "type": "<field_2_type>"
       },
       ...
     }
   }
 }
}
  1. Reindex data into a new index
POST _reindex
{
 "source": {
   "index": "<source_index_name>"
 },
 "dest": {
   "index": "<target_index_name>",
   "mappings": {
     ...
   }
 }
}
  1. Update alias. If you were using an alias for your source index, you need to change the alias to point to the new index. Then any read or write operations can continue using the alias. To change the alias, run the following command.
POST _aliases
{
   "actions": [
       { "add": { "index": "target_index_name", "alias": "existing_alias" } },
       { "remove": { "index": "source_index_name", "alias": "existing_alias" } }
   ]
}
  1. Start ingestion. For this step, you’ll start whatever component you stopped in step one. If you’re not using an alias you’ll have to point your writer the the “target_index_name”.

  2. Delete the old index.

DELETE <source_index_name>

Note: If you’re not using aliases it’s important to remember that once you’ve deleted the old index, the reference to source_index_name also gets deleted, and you will have to access your data at target_index_name.

Your One Stop Shop for Elasticsearch


Platform Open Source Stack Community Help Monitoring – Metrics, Logs. Health, Alerting SaaS Training, Implementation & Maintenance Production Support & Consulting
Elasticsearch
Elasticsearch + Sematext

How to move data from one cluster to another using the Reindex API

Sometimes, you need to move data from one cluster to another but no longer have the original data. One way to solve this problem is to remotely reindex the data to the new cluster. The Reindex API allows you to move data between two Elasticsearch clusters within the same network or across different networks. Here’s the process:

  1. Create a new, empty index in the target cluster with the desired mapping:
PUT /<target_index_name>
{
 "mappings": {
   ...
 }
}
  1. Use the cross-cluster reindexing feature of the Reindex API to copy data from the source index to the target index.
POST _reindex
{
 "source": {
   "remote": {
     "host": "<source_cluster_url>"
   },
   "index": "<source_index_name>"
 },
 "dest": {
   "index": "<target_index_name>"
 }
}
  1. Optionally, validate that the data was successfully copied over.
GET /<target_index_name>/_search

Is Reindex too slow? How to speed up reindexing

If you find your reindex operations are taking too long, you can check the following:

Make sure you are using the Reindexing API instead of the scroll or bulk API.

Elasticsearch offers a dedicated Reindex API, which simplifies the reindexing process by abstracting the need to write custom code or use lower-level APIs like the now-deprecated “Scroll API” or the “Bulk API.” It is important to note that the Reindex API internally utilizes the “search_after” parameter (which replaced the Scroll API) and the Bulk API to efficiently handle the reindexing process. The advantage of using the Reindex API is that it streamlines the operation, allowing you to avoid adding extra steps to your custom application while still benefiting from the efficiency and functionality of the underlying APIs.

You can use the reindex API in the Kibana Console like this:

POST _reindex
{
 "source": {
   "index": "index1"
 },
 "dest": {
   "index": "index2"
 }
}

Filter your data if you don’t need it all

To reindex only some data from an index in Elasticsearch, you can specify filters in the request. The Reindex API applies the filter to the reindexed data and only includes documents that match the filter in the re-indexing process.

Here’s an example of how you can use a filter in the Reindex API request:

POST _reindex
{
 "source": {
   "index": "<source_index_name>",
   "query": {
     "bool": {
       "filter": [
         {
           "term": {
             "<field_name>": "<field_value>"
           }
         }
       ]
     }
   }
 },
 "dest": {
   "index": "<target_index_name>"
 }
}

Batch and Slicing the reindexing

Batching

Elasticsearch inherently processes documents in batches during the reindexing process. However, you can control the batch size by specifying the ‘size’ parameter in the Reindex API request. Adjusting the ‘size’ parameter allows you to find the optimal balance between the number of documents processed in each batch and the overall overhead of the process. While increasing the batch size doesn’t necessarily reduce the total time it takes to complete the reindexing, it can help in reducing the overhead involved.

Adjusting the batch size during the reindexing process can affect available resources in the following ways:

  1. Memory usage: Larger batch sizes result in higher memory usage, potentially leading to slower performance or out-of-memory errors, while smaller batch sizes help reduce memory usage.
  2. CPU usage: Processing larger batches increases CPU utilization, which may slow down response times for other tasks. Smaller batch sizes can alleviate CPU pressure and improve overall efficiency.
  3. I/O operations: Larger batch sizes cause more intense I/O operations, increasing latency and affecting response times for other tasks. Smaller batch sizes help distribute I/O load more evenly, reducing the impact on other tasks.

An example of reindexing with a batch size:

POST _reindex
{
 "source": {
  "index": "<source_index_name>"
 },
 "dest": {
   "index": "<target_index_name>"
 },
 "size": 1000
}

In this example, the “size” parameter is set to 1000, which means that Elasticsearch will process up to 1000 documents in each batch during the re-indexing process.

Note: Increasing the batch size increases memory usage, so it’s best to grow incrementally and monitor heap usage. As a rule of thumb, each batch should ideally be between 50KB and 500KB per shard. However, this sweet spot may vary depending on the size of your nodes and the number of shards in your target index. Larger nodes may be able to handle bigger batches, while a higher shard count may require smaller batches. The optimal value for the “size” parameter depends on your specific use case and the resources available in your Elasticsearch cluster. You may need to experiment to find the best value for your needs.

Slicing

Slicing in Elasticsearch can significantly speed up the reindexing process by parallelizing the workload across multiple workers, each handling a specific portion of the data. By dividing the source index into smaller, non-overlapping segments called slices, the reindexing operation can be performed concurrently, thus improving overall performance. This is particularly helpful when dealing with large indexes, as it allows for better distribution of computational resources.

However, there are trade-offs to consider when using slicing. First, increasing the number of slices may lead to increased resource consumption, as each slice requires additional memory, CPU, and I/O operations. This can result in higher contention for system resources, potentially affecting the performance of other tasks running on the Elasticsearch cluster. Second, coordinating the parallel execution of multiple slices introduces some overhead, which might slightly offset the performance gains. Finally, choosing the optimal number of slices is not always straightforward, as it depends on various factors such as the size of the index, cluster resources, and the desired level of parallelism. It’s essential to experiment and monitor the performance to find the right balance for your specific use case.

In most cases, it’s best to use automatic slicking:

POST _reindex?slices=5&refresh
{
 "source": {
   "index": "source_index_name"
 },
 "dest": {
   "index": "target_index_name"
 }
}

In this example the index is automatically sliced into 5 slices. If however you want to control the slicing yourself, you can do so like this:

POST _reindex
{
 "source": {
   "index": "<source_index_name>",
   "slice": {
     "id": 0,
     "max": 5
   }
 },
 "dest": {
   "index": "<target_index_name>"
 },
 "size": 1000
}

In this example, the source index is divided into 5 slices (0 to 4). The id parameter represents the specific slice being processed, and the max parameter indicates the total number of slices. To reindex the entire source index, you’ll need to run this request for each slice by changing the id parameter accordingly (0, 1, 2, 3, and 4).

You can run these requests concurrently to speed up the reindexing process.

Reindex times out in Kibana

If you’re reindexing in the Kibana dev console and get the following error:

{
 "statusCode": 504,
 "error": "Gateway Timeout",
 "message": "Client request timeout"
}

This error means your query took too long to respond and timed out. It’s important to note that Elasticsearch still completes the reindexing operation in the background despite this error. You can find your Reindexing task by running:

GET _tasks?actions=*reindex&detailed=true

To prevent this error in the future, we can add the wait_for_completion URL parameter to our query so that Kibana does not wait for the query to return. For example:

POST _reindex?wait_for_completion=false
{
    …
}

How to reindex without downtime

Reindexing without interrupting reads is relatively straightforward. Reindexing without interrupting reads and writes is more difficult and requires changes to your writing application. We’ll discuss both methods here.

Reindex without interrupting reads

The trick to reindexing without interrupting reads is to use aliases. When you give an index an alias, all requests to that alias go to the underlying index. What’s great about using an alias is that you change which index the alias points to. Even better, you can do this atomically, so there is no downtime where an index is without an alias.

Assuming you already use an alias for your index, reindexing without interrupting reads works like this:

  1. Create the new index with the new mapping:
PUT /<target_index_name>
{
 "mappings": {
   ...
 }
}
  1. Now reindex your data as before.
POST _reindex
{
 "source": {
   "remote": {
     "host": "<source_cluster_url>"
   },
   "index": "<source_index_name>"
 },
 "dest": {
   "index": "<target_index_name>"
 }
}
  1. Switch the alias to the new index.
POST _aliases
{
   "actions": [
       { "add": { "index": "target_index_name", "alias": "existing_alias" } },
       { "remove": { "index": "source_index_name", "alias": "existing_alias" } }
   ]
}

Optionally if you no longer need the data in the source index, you can delete it, as below.

DELETE source_index_name

This process only works if you’re not indexing new data. If you are indexing new data, you must pause it during reindexing. So no downtime for reading of old data, but you get downtime for writing new data.

Reindexing without interrupting reads or writes

To truly have no downtime, that is, to continue indexing data while reindexing, you  have to be able to split your indexing and write operations to write to two indices at the same time. Splitting your input is necessary because you can not have a single alias pointing to more than one index.

As before, you first create your new index with its alias.

PUT /<target_index_name>
{
 "mappings": {
   ...
 }
}

Now for the trick. You must split your write operation to write to both the new and old indexes. You can not do this in Elasticsearch; you will have to do it with whatever you are using to index data. This could be a Python script, Kafka connector, MQ Beats, or some other way. The specifics will depend on your implementation.

It’s important to note that you should only write full documents, not updates to existing documents. If you need to perform updates, you may have to accept some downtime or attempt a more delicate approach. One possible method is to use upserts, and when reindexing, take existing documents and only replace fields that don’t exist.

Once you have split your indexing, you can begin reindexing. To ensure you do not overwrite any data in the new index, you can set op_type to create in the POST request. Setting op_type forces Elasticsearch to create new documents. If it finds a conflict, the reindex of that specific document fails. To prevent the entire reindex operation from failing, we set the “conflicts” parameter to “proceed”. Putting all this together, your reindex command looks something like this.

POST _reindex
{
 "conflicts": "proceed",
 "source": {
   "index": "<source-index-name>"
 },
 "dest": {
   "index": "<dest-index-name>",
    "op_type": "create"
 }
}

Then, as before we switch the alias.

POST _aliases
{
   "actions": [
       { "add": { "index": "target_index_name", "alias": "existing_alias" } },
       { "remove": { "index": "source_index_name", "alias": "existing_alias" } }
   ]
}

At this point it’s safe to stop your ingest pipeline to the old index. Finally, if you no longer need it, delete the old index:

DELETE source_index_name

Still looking for some help? Sematext provides services for Elasticsearch's entire ecosystem

Conclusion

In this article, we’ve introduced the Reindex API and shown how you can use it to solve some problems you encounter while managing an Elasticsearch cluster. For example, you can:

  • Use the Reindex API to solve mapping conflicts.
  • Move data from one cluster to another using the remote reindex feature.
  • Timeout in Kibana while reindexing

We’ve also discussed some common pitfalls, like reindexing too slow and long downtimes. Solutions include reindexing only the data you need, batching and slicing the reindex, and making sure you are using the Reindex API. And finally, we’ve learned about aliases and how we can use them to reindex data with any downtime.

Java Logging Basics: Concepts, Tools, and Best Practices

Imagine you're a detective trying to solve a crime, but...

Best Web Transaction Monitoring Tools in 2024

Websites are no longer static pages.  They’re dynamic, transaction-heavy ecosystems...

17 Linux Log Files You Must Be Monitoring

Imagine waking up to a critical system failure that has...