On Nov 21, 2017, at 02:09 , [EMAIL PROTECTED] wrote:
> I have a question regarding the scoring mechanism for relevancy. Is the scoring mechanism tf/idf when the field indexed with the EasyAnalyzer in the schema? What happens when multiple terms are used? Are tf/idf's summed?
Lucy uses Lucene's Practical Scoring Function by default:https://lucene.apache.org/core/3_6_0/api/core/org/apache/lucene/search/Similarity.html
Essentially, tf/idf values are summed after being multiplied with each term's boost and normalization factor.
> How does the incorporate the location of the words to the scoring mechanism for queries with multiple words?
> How about the fields which has RegexTokenizer? Is it still the same mechanism? Does the type of the tokenizer affect the scoring? I believe the important thing is the generated tokens (and not related to the tokenizer), and maybe the order of the tokens in a document.
If you use the core Tokenizers, the type of Tokenizer or the location of terms in a document don’t affect scoring. But you can write a custom Tokenizer that sets different boost values for each Token, for example depending on the location within the document.
> One more thing, if I were to change the scoring mechanism for different fields, how can I do it? Are there any predefined mechanisms eg. tf/idf doc2vec etc. Or if I want to go further and come up with my own how can I do it?
You can tweak the scoring formula by supplying your own Similarity subclass for each FieldType, possibly in conjunction with your own Query/Compiler/Matcher subclasses:https://lucy.apache.org/docs/c/Lucy/Index/Similarity.html
The public documentation for Similarity is incomplete, unfortunately. But the class is similar to Lucene’s. The .cfh file contains more details:https://git1-us-west.apache.org/repos/asf?p=lucy.git;a=blob;f=core/Lucy/Index/Similarity.cfh;h=15ec409dee06b19af1b855db50b4fef229dd314e;hb=HEAD
You’d typically override methods TF, IDF, Coord, Length_Norm, or Query_Norm.