Subject: potential accuracy degradation due to approximation of document length in BM25 (and other similarities)


Hi everybody,

Some time ago, I had to re-implement some Lucene similarities (in
particular BM25 and the older cosine). I noticed that the re-implemented
version (despite using the same formula) performed better on my data set.
The main difference was that my version *did not approximate *document
length.

Recently, I have implemented a modification of the current Lucene BM25 that
doesn't use this approximation either. I compared the existing and the
modified similarities (again on some of my quirky data sets). The results
are as follows:

1) The modified Lucene BM25 similarity is, indeed, a tad slower (3-5% in my
tests).
2) The modified Lucene BM25 it is also more accurate
(I don't see a good reason as to why memoization and document approximation
results in any efficiency gain at all, but this is what seems to happen
with the current hardware.)

If this potential accuracy degradation concerns you, additional experiments
using more standard collections can be done (e.g., some TREC collections).

In any case, the reproducible example (which also links to a more detailed
explanation) is in my repo:
https://github.com/searchivarius/AccurateLuceneBM25

Many thanks!

---
Leo