Thank you for jumping in @[EMAIL PROTECTED]
I have an index with raw addresses in a nonstandardized format such as "123 main street" or "main street 123", and I am looking to search this index and pull the closest addresses from another raw input with a similar unpredictable format. Ideally, I am trying to reduce the number of results as much as possible because of time constraints.
At the moment, I am launching a dismax query with the mm(minimum should match) parameter set to a value I am comfortable with(say 50% for example).
In an address such as "123 main street CA 90201 US" , if I execute a query such as: "return addresses that match 50% of the tokens"(dismax,with mm set to 50%), I will potentially get records with "US Street 123" or "main street CA", which is not something that I am looking for. I understand that I could increase the mm parameter and set it to say "100%", but again, I am not sure if the token "street" should be considered when calculating the mm parameter as I could miss a record such as "123 main CA 90201 US"
For longer addresses, the relevance of "main" or "street" is much lower than keywords such as apartment number or the city.
I am not sure if this is the right way to search for unstructured addresses so we are open for suggestions.
Thank you
-----Original Message-----
From: Dave <[EMAIL PROTECTED]>
Sent: Monday, December 2, 2019 7:50 PM
To: [EMAIL PROTECTED]
Cc: [EMAIL PROTECTED]; [EMAIL PROTECTED]
Subject: Re: Is it possible to have different Stop words depending on the value of a field?
I’ll add to that since I’m up. Stopwords are in a practical sense useless and serve no purpose. It’s an old way to save index size that’s not needed any more. You’d need very specific use cases to want to use them. Maybe you do, but generally you never do unless it’s for training a machine or something a bit more on the experimental side. If you can explain *why you think you need stop words that would be helpful in perhaps guiding you to an alternative
> On Dec 2, 2019, at 7:45 PM, <[EMAIL PROTECTED]> <[EMAIL PROTECTED]> wrote:
>
> That makes sense, thank you for the clarification!
>
> @[EMAIL PROTECTED] If you can, please build on your explanation as It sounds relevant.
> -----Original Message-----
> From: Dave <[EMAIL PROTECTED]>
> Sent: Monday, December 2, 2019 7:38 PM
> To: [EMAIL PROTECTED]
> Cc: [EMAIL PROTECTED]
> Subject: Re: Is it possible to have different Stop words depending on the value of a field?
>
> It clarifies yes. You need new fields. In this case something like Address_us Address_uk And index and search them accordingly with different stopword files used in different field types, hence the copy field from “address” into as many new fields as needed
>
>> On Dec 2, 2019, at 7:33 PM, <[EMAIL PROTECTED]> <[EMAIL PROTECTED]> wrote:
>>
>> To clarify, a document would look like this :
>>
>> {
>> address: "123 main Street",
>> country : "US"
>> }
>>
>> What I'd like to do when I configure my index is to apply a set of different stop words to the address field depending on the value of the country. For example, something like this :
>>
>> If (country == US) -> File1
>> Else If (country == UK) -> File2
>>
>> Etc..
>>
>> Hopefully, that clarifies.
>>
>> -----Original Message-----
>> From: Jörn Franke <[EMAIL PROTECTED]>
>> Sent: Monday, December 2, 2019 3:25 PM
>> To: [EMAIL PROTECTED]
>> Subject: Re: Is it possible to have different Stop words depending on the value of a field?
>>
>> You can have different fields by country. I am not sure about your stop words but if they are not occurring in the other languages then you have not a problem.
>> On the other hand: it you need more than stop words (eg lemmatizing, specialized way of tokenization etc) then you need a different field per language. You don’t describe your full use case, but if you have different fields for different language then your client application needs to handle this (not difficult, but you have to be aware).