If you’re working with Elasticsearch, it’s very likely that you’ll need to reindex data at some point. The most popular reason is that you need a mapping change that is incompatible with your current mapping. New fields can be added by default, but many changes are not allowed, for example:
- Want to switch to doc values because field data is taking too much heap? Reindex!
- Want to change the analyzer of a given field? Reindex!
- Want to break one great big index into time-based indices? Reindex!
Enter Logstash
A while ago I was using stream2es for reindexing, but if you look at the GitHub page it recommends using Logstash instead. Why? In general, Logstash can do more stuff, here are my top three reasons:
- On the input side, you can filter only a subset of documents to reindex
- You can add filters to transform documents on their way to the new index (or indices)
- It should perform better, as you can add more filter threads (using the -w parameter) and multiple output worker threads (using the workers configuration option)
Show Me the Configuration!
In short, you’ll use the elasticsearch input to read existing data and the elasticsearch output to write it. In between, you can use various filters to change how documents look like.
Input
To read documents, you’ll use the elasticsearch input. You’ll probably want to specify the host(s) to connect to and the index (check the documentation for more options like query):
input { elasticsearch { hosts => ["localhost"] index => "old-index" } }
By default, this will run a match_all query that does a scroll through all the documents of the index, fetch pages of 1000, and times out in a minute (i.e. after a minute it won’t know where it left off). All this is configurable, but the defaults are sensible. Scan is good for deep paging (as normally when you fetch a page from 1000000 to 1000020, Elasticsearch fetches 1000020, sorts them, and gives back the last 20) and also works with a “snapshot” of the index (updates after the scan started won’t be taken into account).
Filter
Next, you might want to change documents in their way to the new index. For example, if the data you’re reindexing wasn’t originally indexed with Logstash, you probably want to remove the @version and/or @timestamp fields that are automatically added. To do that, you’ll use the mutate filter:
filter { mutate { remove_field => [ "@version" ] } }
Output
Finally, you’ll use the elasticsearch output to send data to a new index. The defaults are once again geared towards the logging use-case. If this is not your setup, you might want to disable the default Logstash template (manage_template=false) and use yours:
output { elasticsearch { hosts => ["localhost"] manage_template => false index => "new-index" document_type => "new-type" } }
Final Remarks
If you want to use time-based indices, you can change index to something like “logstash-%{+YYYY.MM.dd}” (this is the default), and the date would be taken from the @timestamp field. This is by default populated with the time Logstash processes the document, but you can use the date filter to replace it with a timestamp from the document itself:
filter { date { "match" => [ "custom_timestamp", "MM/dd/YYYY HH:mm:ss" ] target => "@timestamp" } }
If your Logstash configuration contains only these snippets, it will nicely shut down when it’s done reindexing.
That’s it! We are happy answer questions or receive feedback – please drop us a line or get us @sematext. And, yes, we’re hiring!