On Wed, Nov 22, 2017 at 5:28 AM, Nick Wellnhofer <[EMAIL PROTECTED]> wrote:
> On 21/11/2017 18:42, [EMAIL PROTECTED] wrote:
>> 2- (same question but for multiple indexes and polysearcher) If I use
>> polysearcher with 2 or more indexes, will the tf/idf scores be consistent?
>> Or would they be calculated separately for each index?
> I don't know off top of my head. It's possible that indexes are searched
> separately and the results are simply merged by normalized score. I'd have
> to look at the code to answer the question, but maybe Marvin can chime in.
The scores will be consistent.
To calculate IDF for a term accurately across a composite corpus
formed from multiple indexes, you need to know two things:
1. The total number of documents in the corpus. (Doc_Max())
2. The total number of documents which contain the term. (Doc_Freq(field, term))
Both PolySearcher and ClusterSearcher calculate their doc_max on
construction by summing the doc_max totals of all subsearchers.
Similarly, both calculate Doc_Freq for a term by summing Doc_Freq
responses for all subsearchers.https://github.com/apache/lucy/blob/rel/v0.6.1/core/Lucy/Search/PolySearcher.c#L69https://github.com/apache/lucy/blob/rel/v0.6.1/core/Lucy/Search/PolySearcher.c#L119https://github.com/apache/lucy/blob/rel/v0.6.1/perl/lib/LucyX/Remote/ClusterSearcher.pm#L73https://github.com/apache/lucy/blob/rel/v0.6.1/perl/lib/LucyX/Remote/ClusterSearcher.pm#L348
This approach trades away some performance for the sake of accuracy,
particularly with Doc_Freq -- query normalization takes longer when
you have to wait for a lot of subsearchers to report Doc_Freq numbers
for N terms. However, the alternative is occasional bizarre search
The best anecdote I ever heard illustrating why it's important to
calculate aggregate IDF consistently was an application searching a
multi-shard index containing news articles split by year. If you
searched for "iphone", it would be a very common term after the first
release of the Apple iPhone. However, in the years prior to the Apple
iPhone's release, if "iphone" existed in a shard it was likely a typo,
so it would be very rare **and thus heavily weighted**. So the top hit
for "iphone", without consistent IDF calculation, would be a typo'd
(A performance improvement on this stratagem is to create a shared
Doc_Freq source. So long as it contains all the common terms across
all shards, it doesn't have to be updated often -- Doc_Freq values
don't change very fast as indexes are updated.)