At the end of November, we’ll be migrating the Sematext Logs backend from Elasticsearch to OpenSearch

Writing a Custom Sort Plugin for Solr

August 29, 2022

Table of contents

OK, so you want to sort your documents by something that can’t be implemented with Solr’s built-in functions. This calls for a custom function, which you can implement through your own ValueSourceParser.

To address the elephant in the room, Elasticsearch and OpenSearch have script sorting. This is easier to implement, but not as close to Lucene. Though of course you can use a native script as well.

Back to Solr, in this post we’ll go through all the steps: from configs and queries to the actual code. Code samples will work on Solr 9.0. With 8.x and 7.x it’s very similar, you should be able to tweak them with the help of a friendly IDE.

Looking for some help with Solr? Sematext offers a full range of services for Solr!

Sample Use-Case

Just so you can follow along, let’s just assume we have a multi-valued string field and we want to sort by the average ascii code of all the characters from all values.

In practice, it would be better if you would compute this at index time into a numeric field and use that for sorting, but it’s just an example meant to show you the various points to hook up your logic.

Schema and Solrconfig

To extract the strings from fields, it’s best to use docValues. So we’ll have docValues=true in our test field in the schema.

For solrconfig, there are three things to consider. First, our plugin will be a JAR file, so we need to make sure that it’s loaded. We can do that by either adding a lib directive or by copying the JAR file in one of the default lib directories.

Once the JAR is loaded, we need to tell Solr to use our custom ValueSourceParser. We’ll put something like this under config:

<valueSourceParser name="avgascii"                      
    class="com.sematext.solrsortplugin.AvgAsciiValueSourceParser" />

Where avgascii is what we’ll call the function in queries and class is our implementation of ValueSourceParser. We can also add custom parameters – I’ll show you how later – but it’s usually nice to be explicit and set variables at query time.

Speaking of queries, let’s see how we can call our custom function.

Queries and Test Cases

We’ll call our custom function like any other function, by the name defined in solrconfig.xml. For example:

q=*:*&sort=avgascii(stringfield_ss) desc

You’ll want to add tests for your function. To do that, we’ll create a TestAvgAsciiFunction class in the project under test/java/com.sematext.solrsortplugin.AvgAsciiValueSourceParser. Excluding some boilerplate, one test can look like this:

public class TestAvgAsciiValueSourceParser extends SolrTestCaseJ4 {

   @BeforeClass
   public static void beforeClass() throws Exception {
       // here we create a test core. You can find the config in test/resources/solr/collection1/conf/
       initCore("solrconfig.xml","schema.xml");
   }

   @Test
   public void testSimple() throws Exception {
       clearIndex();

       // add some documents
       assertU(adoc("id", "1", "foo_ss", "a", "foo_ss", "b", "foo_ss", "c"));
       assertU(adoc("id", "2", "foo_ss", "x", "foo_ss", "y", "foo_ss", "z"));
       assertU(adoc("id", "3", "foo_ss", "f", "foo_ss", "g", "foo_ss", "h"))
       assertU(commit());

       // run the query and check if we get back the documents
       // in the expected order
       assertJQ(req("q", "*:*", "fl", "id", "sort", "avgascii(foo_ss) asc")
               , "/response/docs==[{'id': '1'}," +
                       "{'id': '3'}," +
                       "{'id': '2'}]");
   }
}

Each test will be its own testNameGoesHere method. As noted in the comments, you’ll need to provide a schema and a solrconfig file as well. These can be minimal, like the schema can contain just the fields you’re using:

<?xml version="1.0" encoding="UTF-8"?>
<schema name="miniSchema" version="1.6">
 <uniqueKey>id</uniqueKey>

 <fields>
   <field name="id" type="string" stored="true" indexed="true" multiValued="false" required="true"/>
   <field name="_version_" type="long" indexed="true" stored="true"/>
   <dynamicField name="*_ss" type="string" stored="false" indexed="true" multiValued="true" docValues="true"/>
 </fields>

 <types>
   <fieldType name="string" class="solr.StrField" sortMissingLast="true"/>
   <fieldType name="long" class="solr.LongPointField" docValues="true"/>
 </types>

</schema>

And the solrconfig can be adapted from Solr’s own solr/core/src/test-files/solr/collection1/conf/solrconfig-functionquery.xml, just add your custom function as mentioned in the previous section:

<?xml version="1.0" encoding="UTF-8" ?>
<config>
...
 <!-- our custom sort plugin goes here -->
 <valueSourceParser name="avgascii"
        class="com.sematext.solrsortplugin.AvgAsciiValueSourceParser" />
...
</config>

Now that we have a failing test, let’s add the implementation to make it work 🙂

Maven Project Setup

Your project doesn’t have to be Maven, of course, but I find this easy, because these plugins tend to be simple. Here is a sample pom.xml that includes Solr, junit, and the plugins needed to package the plugin into a JAR, based on Java 17.

mvn package should build the plugin you need, but we need the actual code, so let’s get to it.

ValueSourceParser – the Entrypoint

First off, we need the ValueSourceParser implementation that we reference in solrconfig.xml. This is where we get the query-time arguments and we pass them to a custom implementation of Lucene’s ValueSource.

If we call our custom function with the field name:

avg_ascii(foo_ss) asc

Our ValueSourceParser will have to take those parameters via FunctionQParser’s parseArg method. If a parameter is optional – we should check for its existence through the hasMoreArguments method. Here’s an example implementation:

public class AvgAsciiValueSourceParser extends ValueSourceParser {

  @Override
  public ValueSource parse(FunctionQParser fp) throws SyntaxError {
     final String fieldName = fp.parseArg();    
     return new AvgAsciiValueSource(fieldName);
  }
}

If you want to make defaults configurable through solrconfig.xml, override the init() method of NamedListInitializedPlugin. Here’s an example implementation. Otherwise, we should be good with just the parse() method, returning a custom ValueSource – so let’s move on to that.

ValueSource – Returning a Custom Sort Field

So far we’ve gone through steps that are valid for any custom function. But functions can provide values for two things:

  • Sorting. This is what we do in the rest of this post, and it’s done via the getSortField method. We need to return a SortField, which we can manipulate.
  • Other functions. Typically used to implement a custom scoring function that will, for example, combine the similarity score with the number of interactions of different kinds (likes, shares, comments…) on a social media post. You’d do this through the getValues method. Here’s a tutorial.

Our constructor will take the field name (and other parameters if you need them):

public class AvgAsciiValueSource extends ValueSource {

  public static final String NAME = "avgascii";

  String fieldName;

  public AvgAsciiValueSource(String fieldName) {
     this.fieldName = fieldName;
  }

Now we can get to our getSortField(), which has to return a SortField. This class has a bunch of constructors, but to inject our own sort logic, we’ll need to provide a custom FieldComparatorSource:

@Override
public SortField getSortField(boolean reverse) {
  FieldComparatorSource comparatorSource = new AvgAsciiComparatorSource();
  return new SortField(this.fieldName, comparatorSource, reverse);
}

We also have to implement getValues(), but since we don’t want to use our function as a source for other functions (e.g. to compute the score), we can simply throw an error if someone tries to use it like that:

@Override
public FunctionValues getValues(Map context, LeafReaderContext readerContext){
  throw new UnsupportedOperationException(this + " is only for sorting");
}

If we wanted to use our function with other functions, we could add our own implementation, typically an implementation of one of the abstract classes that already extend FunctionValues, such as DoubleDocValues:

@Override
public FunctionValues getValues(Map context, LeafReaderContext readerContext){
  return new DoubleDocValues(this) {
     @Override
     public double doubleVal(int i) throws IOException {
        return 0; // TODO custom logic here
     }
  };
}

Besides getSortField() and getValues(), the other methods that you need to implement are basically boilerplate:

  • a description() for this function in general.
  • a hashCode() that identifies the specific instance of the function, typically a combination of the String.hashCode() values of its parameters.
  • an equals() to check if two calls of the same function are the same.

So let’s move on to where we left off with getSortField(). We have three parameters to our SortField:

  • fieldName, which points to our multivalued field.
  • reverse, which controls whether our sort is ascending or descending.
  • the comparatorSource, which lets us tell Solr (or rather, Lucene) how to sort the values from the multivalued field. So let’s move on to that.

ComparatorSource – a Wrapper for the Comparator

The custom FieldComparatorSource is just a necessary step to get to our FieldComparator implementation – where the custom sort logic would go.

If we had other parameters than the field name, we could take it in the constructor of this class and pass them to the FieldComparator. But since it’s not the case, our FieldComparatorSource can look like this:

public class AvgAsciiComparatorSource extends FieldComparatorSource {

  public AvgAsciiComparatorSource() {
  }

  @Override
  public FieldComparator<?> newComparator(String fieldName, int numHits, int sortPos, boolean reversed) {
     return new AvgAsciiComparator(fieldName, numHits);
  }
}

Moving on to the FieldComparator, where most of the magic happens.

FieldComparator – Custom Comparison Logic

To implement a FieldComparator, we can look at one of the existing implementations. The simplest one for comparing strings is TermValComparator, which compares the byte arrays of each value it looks at.

We’ll start with the constructor. Again, if we had any more parameters, we’d take them here, but notice how we need to store the values we collect somewhere, a top and a bottom value and so on. We’ll talk about them later. Also note that Lucene works with BytesRef when it comes to strings, which are essentially byte arrays with some tooling around them.

public class AvgAsciiComparator extends FieldComparator<BytesRef> implements LeafFieldComparator {

   private final BytesRef[] values;
   private final BytesRefBuilder[] tempBRs;
   private BytesRef topValue;
   private BytesRef bottom;
   private BinaryDocValues docTerms;
   String fieldName;

   public AvgAsciiComparator(String fieldName, int numHits) {
       values = new BytesRef[numHits];
       tempBRs = new BytesRefBuilder[numHits];
       this.fieldName = fieldName;
   }

Then we have a bunch of methods that we can get straight out of the reference TermValComparator. Some of them implement LeafFieldComparator and you’ll find the detailed descriptions in its JavaDoc. But I’ll add a few comments here as well, for convenience:

// sets the value that’s sorted last
public void setBottom(int slot) {
   this.bottom = values[slot];
}

// compare the value from this doc to the last
public int compareBottom(int doc) throws IOException {
   final BytesRef comparableBytes = getValueForDoc(doc);
   return compareValues(bottom, comparableBytes);
}

// compare the top value to this doc
public int compareTop(int doc) throws IOException {
   return compareValues(topValue, getValueForDoc(doc));
}

// copy a competitive value to this slot in the list
public void copy(int slot, int doc) throws IOException {
   final BytesRef comparableBytes = getValueForDoc(doc);
   if (comparableBytes == null) {
       values[slot] = null;
   } else {
       if (tempBRs[slot] == null) {
           tempBRs[slot] = new BytesRefBuilder();
       }
       tempBRs[slot].copyBytes(comparableBytes);
       values[slot] = tempBRs[slot].get();
   }
}

// can set the Scorer to use if we need the document’s score, but we don’t
@Override
public void setScorer(Scorable scorable) {}

Other methods extend FieldComparator directly. I’ll add small comments for convenience again:

// compare two docs. By the time we get here, the values will be
// the computed average ASCII codes
public int compare(int slot1, int slot2) {
   final BytesRef val1 = values[slot1];
   final BytesRef val2 = values[slot2];
   return compareValues(val1, val2);
}

// keeps the highest value. Useful for searchAfter
public void setTopValue(BytesRef value) {
   topValue = value;
}

// return the value from this slot
public BytesRef value(int slot) {
   return values[slot];
}

// returns a leaf (per-segment) comparator given the context
// i.e. the hierarchical relationship between IndexReader instances
public LeafFieldComparator getLeafComparator(LeafReaderContext context) throws IOException {
   docTerms = getBinaryDocValues(context, fieldName);
   return this;
}

Note this docTerms variable, because it’s our point to access the docValues of each document. It’s what the earlier copy(), compareTop() and compareBottom() use to access the actual values, because they call getValueForDoc(). Which, in turn, looks like this:

private BytesRef getValueForDoc(int doc) throws IOException {
   if (docTerms.advanceExact(doc)) {
       return docTerms.binaryValue();
   } else {
       return null;
   }
}

The problem is that we need to return one value per document – so we can compare them. This is what docTerms does, as an instance of BinaryDocValues. But we have a multivalued field – a SortedSetDocValues. So we need a custom implementation of BinaryDocValues that takes the SortedSetDocValues and returns one value per document. That’s what we do in getBinaryDocValues(), which populates docTerms, as we saw earlier:

protected BinaryDocValues getBinaryDocValues(LeafReaderContext context, String field) throws IOException {
   final SortedSetDocValues multiValues = DocValues.getSortedSet(context.reader(), field);
   return new AvgAsciiValue(multiValues);
}

And since we’re providing one value per document in this custom AvgAsciiValue, why don’t we shoot two birds with one bazooka? We can compute the average ASCII code and return that. Let’s see how.

Custom BinaryDocValues to Turn MultiValued into Single Value

Lucene does this multi-valued-to-single-valued conversion already. A good example is org.apache.lucene.search.SortedSetSelector.MinValue, which returns the minimum value from a multiValued string. So we’ll do much of the same. Let’s start with the constructor:

static class AvgAsciiValue extends BinaryDocValues {

   final SortedSetDocValues multiValues;

   //Where we store the average ascii character
   String avgCharacter;

   AvgAsciiValue(SortedSetDocValues multiValues) {
       this.multiValues = multiValues;
   }

Lucene navigates through documents during a query via nextDoc(), advance() and advanceExact(). In our case, all of them will call the respective method of our underlying multiValues field, then returning a docID() – which also uses the docID() of the multiValues field:

@Override
public int docID() {
   return multiValues.docID();
}

@Override
public int nextDoc() throws IOException {
   multiValues.nextDoc();
   computeAvgAscii();
   return docID();
}

@Override
public int advance(int target) throws IOException {
   multiValues.advance(target);
   computeAvgAscii();
   return docID();
}

@Override
public boolean advanceExact(int target) throws IOException {
   if (multiValues.advanceExact(target)) {
       computeAvgAscii();
       return true;
   }
   return false;
}

Here’s where we can add a twist: notice how all of them call computeAvgAscii(). This is where we come up with the average character to sort on. The logic is: once we get to a document, we want to compute avgCharacter, so that we return it when users of this AvgAsciiValue call binaryValue():

@Override
public BytesRef binaryValue() {
   return new BytesRef(avgCharacter);
}

Which finally gets us to computing the average ASCII value. The key here is to keep calling the nextOrd() method of the multiValues field to get the next ordinal. Then, lookupOrd() will give you the actual value as a BytesRef. To convert that to string, you’ll have to call BytesRef.utf8ToString().

private void computeAvgAscii() throws IOException {
   // this is what we compute; default to empty string; could be a missing_value parameter
   avgCharacter = "";

   // we'll concatenate all the values here
   StringBuilder totalString = new StringBuilder();

   // append all values of a multivalued field to it
   long nextOrd = multiValues.nextOrd();
   while(nextOrd != NO_MORE_ORDS) {
       BytesRef bytesValue = multiValues.lookupOrd(nextOrd);
       totalString.append(bytesValue.utf8ToString());
       nextOrd = multiValues.nextOrd();
   }

   // compute the average code
   // Note: this PoC will only work for small strings, otherwise we exceed int limits
   if (totalString.length() > 0) {
       int sum = 0;
       for (char ch : totalString.toString().toCharArray()) {
           sum += ch;
       }
       char avg = (char) (sum / totalString.length());
       avgCharacter += avg;
   }
}

To complete the class, we also need to implement a cost() method – which will return the cost of this iterator. We’ll just assume it’s the same cost as the “parent” multiValues field:

@Override
public long cost() {
   return multiValues.cost();
}

And that’s it! If you’ve been following along, the failing test from the beginning should now pass.

Conclusions

First of all, if you read through all this, you’re a hero! If I were you I would Email sales@sematext.com and ask for a loyalty discount on Sematext Cloud for monitoring my Solr and aggregating Solr logs.

Jokes aside (or maybe not?), writing a Solr plugin to sort can be a bit difficult, but it’s very flexible. There are lots of points where you can hook your own logic, depending on what exactly you want to implement. We chose a more complex example – one involving a multi-valued field – precisely so you can see many of those hook-up points.

We hope you enjoyed the read and come back for more search-related posts. And if you need some help now, feel free to reach out, because we’re offering:

Exploring Windows Kernel with Fibratus and Sematext

This is a guest post by Nedim Šabić, developer of Fibratus, a...

Apache Tomcat Logging Configuration: How to View and Analyze Log Files

Apache Tomcat is the Java web server that implements many...

Status Page

Definition: What Is a Status Page? The status page is...