reindex data

DocValues Reindexing with Solr Streaming Expressions

Last time, when talking about Solr 6 we learned how to use streaming expressions to automatically update data in a collection. You can imagine this is not the only cool thing you can do with streaming expressions. Today, we will see how to re-index data in your collection for fields that are using doc values. For that we will use Solr 6.1, because of a simple bug that was fixed for that version (details SOLR-9015)

Let’s assume we have two collections – one called video, which will be the source of the data. The second collection will be video_new and will be the target collection. We assume that collections will have slightly different structure – slightly different field names. The video collection will have the following fields:

  • id – document identifier
  • url – URL of the video
  • likes – number of likes
  • views – number of views

The second collection, video_new, will have the following fields:

  • id – document identifier
  • url – URL of the video
  • num_likes – number of likes
  • num_views – number of views

Exporting the data

First thing we need to figure out is a way to export data from the source collection in an efficient fashion. We can’t just set the rows parameter to gazillion, because it is not efficient and can lead to Solr going out of memory. So we will use the /export request handler. The only limitation of that request handler is that data needs to be sorted and needs to use doc values. That is not a problem for our data, however you should be aware of this requirement.

We will start by exporting the data using the standard Solr way – using the request params with the /export handler. The request looks like this:

curl -XGET 'localhost:8983/solr/video/export?q=*:*&sort=id+desc&fl=id,url,likes,views'

The above will result in Solr using the /export handler and returning all data, not only the first page of the results.

However, we want to use streaming expressions to re-index the data. Because of that we can change the above request to use the search streaming expression, which looks as follows:

search(
  video,
  zkHost="localhost:9983",
  qt="/export",
  q="*:*",
  fl="id,url,likes,views",
  sort="id desc")

The working command with the request looks like this:

curl --data-urlencode 'expr=search(video,zkHost="localhost:9983",qt="/export",q="*:*",fl="id,url,likes,views",sort="id desc")' http://localhost:8983/solr/video/stream

We use the search streaming expression and provide the name of the collection, which is video in our case, the ZooKeeper host (yes, we can read from other clusters), the name of the request handler which is /export in our case and is required. Finally, we provide the match-all query, the list of fields that we are interested in, and the sorting expression. Please remember that when using the /export handler all fields listed in the fl parameter must use doc values.

Changing field names

Our collections have different field names and because of that the above search request is not enough. We need to alter the name of the fields by using the select streaming expression. We will change the name of the likes field to num_likes and the name of the views field to num_views. The expression that does that is:

select(
  search(video,zkHost="localhost:9983",qt="/export",q="*:*",fl="id,url,likes,views",sort="id desc"),
  id, 
  url, 
  likes as num_likes,
  views as num_views
)

The select streaming expression lets us choose which fields should be used in the resulting tuples and how they will be named. In our case we take the id and url fields as is and we change the name of the likes and views fields.

To test the result of that expression you can simply use the following command:

curl --data-urlencode 'expr=select(search(video,zkHost="localhost:9983",qt="/export",q="*:*",fl="id,url,likes,views",sort="id desc"),id,url,likes as num_likes,views as num_views)' http://localhost:8983/solr/video/stream

Running the re-indexing

Finally, we have the data prepared and read in an efficient way, so we can send data to Solr for indexation. We do that using the update streaming expression simply by specifying the target collection name and the batch size, like this:

update(
  video_new, 
  batchSize=100, 
  select(search(video,zkHost="localhost:9983",qt="/export",q="*:*",fl="id,url,likes,views",sort="id desc"),id,url,likes as num_likes,views as num_views))

And the command that we would send to Solr:

curl --data-urlencode 'expr=update(video_new,batchSize=100,select(search(video,zkHost="localhost:9983",qt="/export",q="*:*",fl="id,url,likes,views",sort="id desc"),id,url,likes as num_likes,views as num_views))' http://localhost:8983/solr/video/stream

Please note that we send the command to the source collection /stream handler – in our case to the video collection. This is important.

Verifying the re-indexation

Once the task has been finished by Solr we can check the number of documents returned by each collection to verify that data has been re-indexed properly. We can do that by running these commands:

curl -XGET 'localhost:8983/solr/video/select?q=*:*&indent=true&rows=0'

and

curl -XGET 'localhost:8983/solr/video_new/select?q=*:*&indent=true&rows=0'

Both result in the following number of documents:

<?xml version="1.0" encoding="UTF-8"?>
<response>

<lst name="responseHeader">
  <bool name="zkConnected">true</bool>
  <int name="status">0</int>
  <int name="QTime">38</int>
  <lst name="params">
    <str name="q">*:*</str>
    <str name="indent">true</str>
    <str name="rows">0</str>
  </lst>
</lst>
<result name="response" numFound="18" start="0">
</result>
</response>

And that means that everything works as intended 🙂

Interested in Solr Streaming Expressions? Subscribe to this blog or follow @sematext – we have more Streaming Expressions blog posts in the queue. If you need any help with Solr / SolrCloud – don’t forget @sematext does Solr Consulting, Production Support, as well as Solr Training!

Leave a Reply