What is Search Relevance Similarity
Lucene has a lot of options for configuring similarity. By extension, Solr and Elasticsearch have the same options. Similarity makes the base of your relevancy score: how similar is this document (actually, this field in this document) to the query? I’m saying the base of the score because, on top of this score, you can apply per-field boosts, function scoring (e.g. boost more recent documents) and re-ranking (e.g. Learning to Rank).
The default similarity (BM25 – described below) is a very good start, but you may need to tweak it for your use-case. You’ll likely have different criteria for matching a query with the content field of a book, for example, than for the author field. The importance of things like field length or term frequency could be different.
Before we start, check out two useful Cheat Sheets to guide you through both Solr and Elasticsearch and help boost your productivity and save time when you’re working with any of these two open-source search engines.
Solr Cheat Sheet
Elasticsearch Developer Cheat Sheet
Similarity Options in Solr & Elasticsearch
Whether you’re using Solr or Elasticsearch, you can choose a similarity class/framework and, depending on its choice, some options to influence how scores are calculated. In this post, we’re going to cover all the available similarity classes and their options:
- classic TF-IDF and the newer default BM25
- the Divergence from Randomness framework
- Divergence from Independence models
- Information-based models
- Language models (Dirichlet and Jelinek-Mercer)
Along the way, we will share graphs and intuitions of how each option would influence the score, so you can choose the best combination for your use-case.
Term Frequency * Inverse Document Frequency (TF*IDF)
TF*IDF has been in Lucene since forever, and was the default until BM25 replaced it in version 6. As the name suggests, the score is calculated from multiplying TF with IDF, where:
TF stands for Term Frequency. We’re looking at one term at a time (all similarities are doing this) and, the more often the term appears in our field, the higher the score. Lucene actually takes the square root of the TF: if you query for cat, a document mentioning cat twice is more likely about cats, but maybe not twice as likely as a document with only one occurrence.
Value of TF grows, the more the term occurs in the document
IDF stands for Inverse Document Frequency. Here, Document Frequency (DF) stands for the ratio between the number of documents containing the term and all the documents of your index. For example, if you’re searching for cat AND food in an E-commerce site dedicated to cats, the DF of cat will be very high. The IDF (Inverse DF) will be very low, because cat will carry less interesting information about the documents compared to food which, presumably, has a higher IDF.
Lucene doesn’t take IDF as simply 1/DF. It actually takes 1 + log( (docCount+1)/(docFreq+1))
. The idea is to tame the curve a little bit, as docFreq
(number of documents containing the term) increases:
IDF value drops for more frequent terms (here with 10K docs in the index)
Lucene also normalizes scores by document length. The longer the document, the more likely it is to contain that field by pure chance, so it will get a lower score (score/sqrt(docLength)
will be the result). Here, docLength
is the (approximate) number of terms in the document.
BM25 – the New Default Search Ranking
Where BM stands for Best Matching 🙂 You can think of Okapi BM25 as an upgrade of TF-IDF, and you’ll find a detailed comparison between the two here. The main difference is the introduction of term frequency saturation: if you’re searching for cat, documents with more cats will get asymptotically closer to a configurable ceiling, as opposed to growing indefinitely. More cats will still mean higher scores, but it will never be a huge score – which is usually a good thing:
The TF part of BM25 grows asymptotically to 1+k1. k1 defaults to 1.2
There are two configurable values here: k1 and b. A higher k1 will imply a higher ceiling, but it will also make document length normalization more aggressive (longer documents will be penalized more). Speaking of document length, we have two changes here:
- Length doesn’t get multiplied to the score directly. Instead, we get the ratio between the document length and the average length of all documents in the index.
- Length normalization is controlled by the b parameter. Higher b (defaults to 0.75) makes length matter more.
Last but not least, the score is multiplied by IDF. Conceptually, it’s the same IDF you had in TF-IDF, but it’s computed as the [natural] logarithm of (docCount - docFreq + 0.5)/(docFreq + 0.5)
. Which makes it more aggressive in lowering the score for higher-frequency terms, compared to the IDF in TF-IDF.
Divergence from Randomness (DFR) Framework
Divergence from Randomness (original paper here) isn’t just one model, like TF-IDF or BM25, but a framework including multiple models and normalization techniques. They all share the same principle: the term may occur in a document randomly, following a certain distribution. The more our document diverges from our configured random distribution, the higher the score. We will also normalize this score in a configurable way, depending on things like document length.
Each DFR configuration is made out of three components:
- The base model, which defines what our “random” distribution looks like.
- An after-effect, which normalizes the score of our base model based on term frequency.
- The term frequency used by the after-effect is also normalized based on document length.
Base search relevance models
As of version 8, Lucene implements four models from the original paper: G, I(f), I(n) and I(ne).
Let’s take them one at a time:
- G stands for “Geometric distribution limiting a Bose-Einstein distribution”. It’s a mouthful, I know, but the intuition here is that the more frequent the term is in the whole index, the more the score will go exponentially lower. Note that we take the total count of the term into account (
totalTermFreq
), not the total number of documents containing the term (like thedocCount
we used in IDF previously). - IN stands for Inverse(N), where N is document frequency. If this sounds familiar, it’s the IDF we know and love – almost the same as in TF-IDF.
- IF stands for Inverse(Frequency). It’s like IN, but we take the total frequency of the term in the index (like model G does), instead of the number of documents matching the term.
- INE stands for Inverse(Expected document frequency), where the expectancy is according to a Poisson distribution. Think of it like IF, where we took the total frequency of the term in the index, except that we scale it from 0 to the number of documents in the index with the following formula:
docCount * (1 - ((docCount - 1)/docCount)^totalTermFreq));
In the end, INE will produce similar scores to IN: IN is an approximation of INE – assuming document frequency is a good indicator of term frequency. In both, more frequent terms in the index will produce lower scores in a similar way to the IDF from TF-IDF.
After-effects
There are two of them: L and B.
L stands for Laplace’s rule of succession. The score function of the class actually returns 1.0 (i.e. doesn’t modify the score). That’s because the equation from the paper is 1/(1+tfn)
, where tfn
is the normalized term frequency (see normalizing rules below). And this fraction of 1/(1+tfn)
is incorporated in various ways in the base model. For example, all but model G return this:
return A * aeTimes1pTfn * (1 - 1 / (1 + tfn));
Where A
is the function we talked about (typically an IDF variant), aeTimes1pTfn
is what the aftereffect class returns (1.0
for L), and the rest was, from the original paper, supposed to be part of the aftereffect. Note how Lucene subtracts 1/(1+tfn)
from 1
, to ensure that the higher the term frequency, the higher the score will be.
B stands for Bernoulli. Namely, we’re looking at the ratio of two Bernoulli processes: the total term frequency in the index on one hand, and the document frequency on the other. Our aeTimes1pTfn
becomes (1 + totalTermFreq)/docFreq
.
The intuition for B is that if our term tends to appear, on average, more often in the matching documents, it is more informative. Because it’s less likely to appear in those documents by accident.
Length-based normalization
This is where tfn
comes from. We have four normalization techniques, with similarly creative names: H1, H2, H3 and Z. Wait, no, there is a descriptive name: Normalization.NoNormalization, which simply returns the term frequency.
Back to the funky names:
- H1 assumes uniform distribution of frequencies. It will return the term frequency (
tf
) multiplied byavgFieldLength/docLength
.avgFieldLength
is the average length of this field in the index anddocLength
is the field length of this document. You can multiply all this by a providedc
, for extra control – though this wasn’t specified in the original paper. - H2 here, length becomes less important, as we multiply
tf
bylog2(1+c*avgFieldLength/docLength)
. - Z is yet another variation based on
avgFieldLength/docLength
, after the Pareto-Zipf law. Here we multiplytf
by(avgLength/docLength)^z
.z
must be in the(0..0.5)
range, meaning score will still decay less than linear, like in H2. - H3 is completely different 🙂 It’s based on Dirichlet priors and, like the others, comes with a parameter (
mu
). The formula (simplified) is(tf + mu*totalTermFreq/numberOfFieldTokens)/(docLength+mu)
. Meaning that a highermu
will make both the term frequency and the document length matter less.
Key Takeaways
At this point, if you’re complaining that there’s too much choice, I can’t blame you. But in the end, it’s down to which metrics suit your dataset. For example, how much do you think you should penalize the score of longer documents? The original DFR paper shows results comparable to BM25 on various datasets, with different model+aftereffect+normalization combinations coming quite close to each other. That said, INE often came on top, combined with the H2 normalization (note that Z and H3 weren’t in the original paper). Aftereffect L worked well in general, but B outperformed it when there wasn’t too much variance in term frequency.
Divergence from Independence (DFI)
The Divergence from Independence model starts from a simple assumption: if a term occurs in documents by pure chance, then the variable of document IDs and that of term frequencies are statistically independent. If a term occurs in a document more times than expected, we can invalidate the assumption of independence with a certain probability. A higher probability will produce a higher score, telling us that the term is “more dependent” to the document. If the term occurs in a document less or equal than the expected number of times, we’ll just give it a score of 0 (Lucene doesn’t accept negative scores).
Expected term frequency
Now that we explained the principles, let’s dive into the details: how do we compute the “expected” number of times and how do we compute the probability that the term and the document aren’t independent?
The expected number is docLength*totalTermFrequency/numberOfFieldTokens
, where docLength
is the number of terms in the field of our document, totalTermFrequency
is the total occurrences of the search term in the index and numberOfFieldTokens
is the total number of terms in the index. Intuitively, totalTermFrequency/numberOfFieldTokens
is the probability of any term to be the search term. If we multiply that by the number of terms in our document (ie. its length), we get the number times the search term should occur in the document, if we meet the assumption of independence.
Independence measures
With the number of expected terms, we can go ahead and compute the similarity score. We have three types of independence tests to choose from:
- standardized. This is computed with
(actualFrequency - expectedFrequency)/sqrt(expectedFrequency)
. It is described in the original paper as good at tasks that require high recall and high precision, especially against short queries composed of a few words as in the case of Internet searches. The intuition here is that score will grow linearly with frequency, but growth will be more aggressive for rare words (and/or short documents) Score for expectedFrequency=0.1 (purple), 2 (blue) and 5 (green) - saturated. This is simply
(actualFrequency - expectedFrequency)/expectedFrequency
, and described as working well for tasks that require high recall against long queries. The intuition here is that growth is more aggressive for rare words, so documents containing them should bubble up to your top N: Same values for expectedFrequency, but scores diverge more - chi-squared.
(actualFrequency - expectedFrequency)^2/expectedFrequency
, which is the standardized formula squared. Meaning score will grow exponentially with frequency, even for frequent terms. Which is why the original paper says it can be used for tasks that require high precision, against both short and long queries.
Chi-squared scores grow very abruptly, even for frequent terms
Key Takeaways
No matter the model, the DFI paper recommends to keep stop-words. That’s because removing stop-words will pretty much arbitrarily change document length, which is a critical part of how score is calculated.
DFI models are said to give comparable results with BM25 and DFR models, but should be less sensitive to the change in distribution of terms. In other words, DFI needs less tuning to work on diverse datasets. On the other hand, as the DFI paper suggests, if your terms typically follow a Poisson distribution for example, a DFR model based on Poisson (here, INE) should outperform DFI.
Information-Based (IB) Models
Information-Based models are similar in how they work with DFR, although the philosophy is slightly different. Namely, how much information gain do we get from this term? However, the implementation is so similar, that Lucene documentation still mentions that DFR and IB frameworks might get merged.
Like DFR, IB comes up with different base distributions for term frequency: log-logistic (LL) and smoothed power law (SPL). By smoothed, it means we take the natural logarithm, so even though “power law” sounds impressive, the curve is typically smoother than with log-logistic, as the graph from the paper suggests:
Like in DFR, these probability distributions don’t simply work with term frequencies. In Lucene, these term frequencies are normalized with the same classes that DFR supports.
Retrieval function (lambda)
Besides the probability distribution and the normalized term frequency, we’ll want to consider “the average of the document information brought by each query term”. If this sounds familiar, you might be thinking about the after-effects of the DFR framework 🙂
We’ll call this average information lambda
, and compute it in one of two ways:
- document frequency (DF). This is simply
docFreq/docCount
, not smoothed like in TF-IDF or BM25. WheredocFreq
is the number of documents containing the term anddocCount
is the total number of documents in the index. - total term frequency (TTF).
totalTermFreq/docCount
. The difference from DF is that we care about the total occurrences of this term in the index, not just the matching documents.
Compared to DF, TTF will generate higher lambda values. This typically means that the [normalized] term frequency will matter less in comparison to how frequent the word is in the overall index.
Probability distributions
Let’s take a closer look at what they do:
LL (log-logistic) effectively computes the score as the natural logarithm of tfn/lambda + 1
. As in DFR, tfn
stands for term frequency normalized, and you’d use the same classes. Because you’d have higher lambda
for more popular terms, the effect of tfn
is going to be lower for them:
LL score with lambda=0.1 (red), 0.3 (black) and 0.8 (blue)
SPL (smoothed power law) also does a natural logarithm, but this time of (lambda - 1)/(lambda - lambda^(1 - (1/(tfn+1)))
. Once again, a higher tfn
will give a higher score, because the power factor will get closer to 1
. However, as the image from the paper suggests, scores will tend to diverge less on frequent vs rare terms:
SPL score with lambda=0.1 (red), 0.3 (black) and 0.8 (blue)
Key Takeaways
From the experiments in the original paper, both IB models tend to outperform the selected DFR and BM25 configurations, with LL usually ranking better than SPL. It also ranked better than the configured language model similarities (see below). Also, the IB framework is one of the newest frameworks (2010) out of what we describe here.
Language Models (LM)
Language models revolve over the idea of smoothing scores based on unseen words (i.e. document length). This is something all similarity frameworks do, but the two language models implemented here (Dirichlet and Jelinek-Mercer) come with yet another angle.
The central point is the probability of a term in a document to be the search term. If this sounds familiar, it’s somewhat similar to how DFI starts with the probability of a term to come up in a document. Here, we factor in the document length later, and we initially care only about the likelihood of any term to be our term: totalTermFreq/totalFieldTokens
. totalTermFreq
is the total occurrences of the term we’re looking for in the index, and totalFieldTokens
is the total occurrences of any term in the index (for that field). We’ll call this proportion probability
, and we’ll use it in one of the two smoothing methods.
Dirichlet
This model does bayesian smoothing using Dirichlet priors. In other words, how does the actual term frequency stack up to what a Dirichlet distribution would assume to be “normal”? This is a technique we’ve seen in the H3 normalization of DFR/IB.
Like in H3, we have a configurable parameter (mu
) which controls smoothing: higher mu
values will make scores vary less.
The gist of the formula is log(normalizedTermFrequency/normalizedDocLength)
. Here, normalizedTermFrequency
would be computed as mu + totalFieldTokens * tf/totalTermFreq
. Meanwhile, normalizedDocLength
would simply be mu + docLength
. From which we can derive two main intuitions:
- score would grow with the ratio of tf/totalTermFreq. If we search for “foo”, the more times “foo” occurs in this document compared to the whole index, the higher the score. Rare terms will rank logarithmically higher at the same term frequency, compared to common words. Roughly logarithmically, at least, because
mu
normalizes the score even further. - score will drop for longer documents. That said, if we have more more terms in the field to begin with (
totalFieldTokens
– proportional to the average document length, if the field isn’t sparse), the score would be a bit higher
Jelinek-Mercer Smoothing
The gist of the Jelinek-Mercer smoothing method is that we’ll divide the document model (tf/docLength
) by that of the whole index (probability
). We’ll use a lambda in the range of (0..1]
to control the weight of each: document model gets multiplied by 1-lambda
, while the index model gets multiplied by lambda
.
If we unwrap the formula, we get the [natural] logarithm of:
1 + (1/lambda - 1) * (tf/totalTermFreq) * (totalFieldTokens/docLength)
Like with Dirichlet smoothing, the score will increase with the proportion of our search term in the document from the overall index. Similarly, as lambda
increases, scores diverge less from each other.
The main difference from Dirichlet smoothing is that in Jelinek-Mercer, lambda
is multiplying our base factors. Meanwhile, in Dirichlet, mu
is added to them.
Key Takeaways
Because of this difference (adding vs multiplying), scores tend to vary more in the JM language model than in the Dirichlet one, especially when it comes to document length. Here’s a graph of scores with the default configurations (lambda=0.1
and mu=2000
) on a field with 10K terms, for a term occurring 35 times in the index, given docLength=100
:
JM (black) scores grow more term frequency, compared to Dirichlet (red)
As a result, proven by experiments in the original paper, Dirichlet tends to perform better for short (title) queries, Jelinek-Mercer is slightly better for longer queries.
Final words
First of all, congratulations for reading all the way 🙂
Formulas used here
Before jumping to conclusions, let’s talk a bit about the math: what’s in the code is transformed in a way that doesn’t change the output, but highlights intuitions better. Although we’ve tried hard to iron out any mistakes or inconsistencies, please let us know in the comments if you see anything dubious.
What’s in the code isn’t always exactly what’s in the papers. For example, Lucene’s DFI doesn’t allow for negative scores (Lucene never allows negative scores), while the algorithm from the original paper may produce negative scores. Finally, within Lucene, algorithms change between versions. For example, LUCENE-8015 changed a lot of the DFR code for Lucene 8, since it happened that scores were dropping as term frequency increased.
If you want to test out these formulas for yourself, I’d highly recommend the Desmos Graphing Calculator. Not only does it have a lot of functionality, but it allows you to slide a certain “constant” and see how the graph changes with it. This may help you tune those mu
s and lambda
s.
Which similarity is the best?
As you might expect, there are no hard and clear guidelines on which similarity is better. There are some rules of thumb, though. For example:
- BM25 typically performs better than Classic similarity
- LMDirichlet typically performs better than LMJerlinekMercer, especially for short queries
- DFI is easy to set up and to understand, but usually doesn’t have the ultimate retrieval performance of others: there’s not much to tune
- DFR and IB models can perform really well if you tune them for your use-case
A good approach is to figure out which factors are important for your use-case (and in which proportions), then try out the model(s) that seem to fit best. For example, if you want to penalize longer fields more – like in the classic case of “phone accessories” showing up first while searching for “phone” – then you may pick something on top of Normalization H1. H1 normalizes term frequency with avgDocLength/docLength
, which is pretty aggressive.
Let’s also assume that you want to factor in the probability of the field containing the term in the first place. Then, you might pick the DF lambda (where lambda=docFreq/docCount
) from the IB models. Finally, to make sure scores don’t get too high for high frequencies (or short fields, with our aggressive normalization), we’d pick the smooth power law (SPL) over the log-logistic distribution (LL) as our base probability distribution.
Custom similarities
What if none of these similarity fit your use-case? For example, many product search situations don’t care about term frequency: an item containing the term “SSD” twice isn’t “more” SSD. We might still care about other factors, such as DF: in a query for “Intel SSD”, the terms “Intel” and “SSD” may carry different weights.
In this particular example, you could get away with disabling term frequencies at the Lucene level. If frequencies aren’t indexed, tf
will always be 1
. However, by disabling frequencies you implicitly disable positions, so you lose phrase queries. If you need phrases, you’ll need to implement your own similarity. Either by writing/extending a similarity class or by having some sort of script (Elasticsearch, for example, allows for scripted similarity). In such cases, even if I don’t use the out-of-the-box similarities, I find it useful to know what’s already there, to use as inspiration.
Next steps
Whether you use a default or a custom-built similarity, you’ll need a way to test the quality of results in some sort of test suite. In essence, you’d define relevance judgements (queries and expected results) and measure how close your actual results are from the expected ones. There are quite a few tools out there, both open-source and commercial, that help you with this.
What to learn more about similarities? We were present at ACTIVATE Conference 2019 and gave a talk about: . Lots of comments and questions were answered! Curious to check our presentation? You may find it below.
[slideshare id=171191134&doc=activate2019-tweakingthebasescorelucenesolrsimilaritiesexplained-190912133448&w=750&h=450]
If you’re using Elasticsearch, the Rank Evaluation API is a great built-in tool. It allows you to use search templates, too. Including stored templates: just specify the template id
instead of inline
.
If you find this stuff exciting, please join us: we’re hiring worldwide. If you need relevancy tuning, or any other help with your search infrastructure, please reach out, because we provide:
- Consulting for Solr, Elasticsearch and Elastic Stack. Including targeted search relevance projects
- Production support for Solr, Elasticsearch and Elastic Stack
- Solr, Elasticsearch and Elastic Stack training classes (on site and remote, public and private)
- Monitoring, log centralization and tracing for not only Solr and Elasticsearch, but for other applications (e.g. Kafka, Zookeeper…), infrastructure and containers
Good luck on improving your search relevance! Before you go, don’t forget to download your Solr or Elasticsearch Cheat Sheet:
Solr Cheat Sheet
Elasticsearch Developer Cheat Sheet