At the end of November, we’ll be migrating the Sematext Logs backend from Elasticsearch to OpenSearch

solr-reindexer: Quick Way to Reindex to a New Collection

July 27, 2022

Table of contents

If you’re using Solr, for sure there are times when you change the schema and need to reindex. Quite often the source of truth is a database, so you can use streaming expressions via the JDBC source to reindex. But sometimes that’s not possible or adds too much load to the DB. So how can we use Solr itself as a source?

First of all, you can still use streaming expressions to get data from Solr via the search source. However, if you use qt=/export, you need all the relevant fields to have docValues. That’s rarely going to be the case – as text fields can’t have docValues. If you use the default qt=/search, you can only get the top N values, so you’re not going to be able to stream or page through large datasets – otherwise you’ll run out of memory.

The general solution is to run a cursor through your data, get one page of documents at a time, and reindex them to the destination. Many people end up writing scripts to do that, so we thought we’ll spare you the effort and give you an easy-to-use, open-source one that reindexes Solr documents via a cursor: solr-reindexer.

In this article, we’ll give you a quick run through solr-reindexer, so you can get started in a few minutes.

Before we get started, keep in mind that Sematext offers a full range of services for Solr.

Running solr-reindexer

First, some prerequisites:

  • Java 11 or later to run the solr-reindexer JAR.
  • SolrCloud version 6.x or later, to support cursors.
  • A source and a destination collection. Typically, you’ll have an alias pointed to the current collection, create a new one with the new configuration, reindex, then flip the alias once everything is fine.
  • A uniqueKey field defined in the schema, needed for the cursor to work. All but the most exotic schemas out there have it, and it’s usually called id.
  • Download the solr-reindexer uber-jar from the releases page.

For a quick reindex you can simply do:

 

java -jar solr-reindexer.jar -sourceCollection source_collection_name -targetCollection destination_collection_name -zkAddress localhost:2181

 

Reindexing a Collection Subset

You may want to limit what you reindex:

  • Documents. For example, some use-cases have deactivated documents, which you may want to skip. To do that, provide a query that matches only docs you want via e.g. -query "isDeactivated:false"
  • Fields. By default, we skip the _version_ field, because it’s written automatically by Solr. But you can add others, for example if you use copyFields that are also stored, you’ll want to skip them as well, via e.g. -skipFields _version_,text. Though a copyField is typically not stored if you store the original.

Connection Preferences

By default, solr-reindexer connects to Zookeeper at localhost:2181. You can supply a comma-separated list of Zookeeper addresses for redundancy. Also, if a request fails, it will retry up to 10 times at an interval of 5 seconds. They’re both configurable via -retries and -retryInterval respectively.

For a full list of parameters and their defaults, check out the output of -help.

Contributing and Next Steps

There’s always room for improvement, and because solr-reindexer is Apache2-licensed, we invite you to contribute. At the time of writing this, support for authentication and reindexing from a remote cluster would be the most important functionality additions. We can also make reindexing faster, by adding a queue and writing on multiple threads. We can also split the read into partitions (e.g. by shard or by uniqueId) to parallelize that, too.

Lastly, you may want to know that we’re a one-stop-shop for Solr. We have:

Feel free to contact us if you’re interested in any of the above, or if you have any questions or comments!

Java Logging Basics: Concepts, Tools, and Best Practices

Imagine you're a detective trying to solve a crime, but...

Best Web Transaction Monitoring Tools in 2024

Websites are no longer static pages.  They’re dynamic, transaction-heavy ecosystems...

17 Linux Log Files You Must Be Monitoring

Imagine waking up to a critical system failure that has...