If you’re using Solr, for sure there are times when you change the schema and need to reindex. Quite often the source of truth is a database, so you can use streaming expressions via the JDBC source to reindex. But sometimes that’s not possible or adds too much load to the DB. So how can we use Solr itself as a source?
First of all, you can still use streaming expressions to get data from Solr via the search source. However, if you use
qt=/export, you need all the relevant fields to have
docValues. That’s rarely going to be the case – as text fields can’t have docValues. If you use the default
qt=/search, you can only get the top N values, so you’re not going to be able to stream or page through large datasets – otherwise you’ll run out of memory.
The general solution is to run a cursor through your data, get one page of documents at a time, and reindex them to the destination. Many people end up writing scripts to do that, so we thought we’ll spare you the effort and give you an easy-to-use, open-source one that reindexes Solr documents via a cursor: solr-reindexer.
In this article, we’ll give you a quick run through solr-reindexer, so you can get started in a few minutes.
Before we get started, keep in mind that Sematext offers a full range of services for Solr.
First, some prerequisites:
- Java 11 or later to run the solr-reindexer JAR.
- SolrCloud version 6.x or later, to support cursors.
- A source and a destination collection. Typically, you’ll have an alias pointed to the current collection, create a new one with the new configuration, reindex, then flip the alias once everything is fine.
- A uniqueKey field defined in the schema, needed for the cursor to work. All but the most exotic schemas out there have it, and it’s usually called
- Download the solr-reindexer uber-jar from the releases page.
For a quick reindex you can simply do:
java -jar solr-reindexer.jar -sourceCollection source_collection_name -targetCollection destination_collection_name -zkAddress localhost:2181
Reindexing a Collection Subset
You may want to limit what you reindex:
- Documents. For example, some use-cases have deactivated documents, which you may want to skip. To do that, provide a query that matches only docs you want via e.g.
- Fields. By default, we skip the
_version_field, because it’s written automatically by Solr. But you can add others, for example if you use
copyFields that are also stored, you’ll want to skip them as well, via e.g.
-skipFields _version_,text. Though a
copyFieldis typically not stored if you store the original.
By default, solr-reindexer connects to Zookeeper at
localhost:2181. You can supply a comma-separated list of Zookeeper addresses for redundancy. Also, if a request fails, it will retry up to 10 times at an interval of 5 seconds. They’re both configurable via
For a full list of parameters and their defaults, check out the output of
Contributing and Next Steps
There’s always room for improvement, and because solr-reindexer is Apache2-licensed, we invite you to contribute. At the time of writing this, support for authentication and reindexing from a remote cluster would be the most important functionality additions. We can also make reindexing faster, by adding a queue and writing on multiple threads. We can also split the read into partitions (e.g. by shard or by uniqueId) to parallelize that, too.
Lastly, you may want to know that we’re a one-stop-shop for Solr. We have:
- Solr consulting, if you need help developing your project
- Solr training classes, both public and private
- Solr production support, if you need someone to help fight fires
- Excellent Solr monitoring and log-aggregation. Here’s how you can troubleshoot slow Solr queries with Sematext Cloud
- Jobs for people who already know Solr and want to work in consulting, support and product development
Feel free to contact us if you’re interested in any of the above, or if you have any questions or comments!