One way to create a better search experience is to understand the user intent. One of the phases in that process is query understanding, and one simple step in that direction is query segmentation. In this post, we’ll cover what query segmentation is and when it is useful. We will also introduce to you Solr Query Segmenter, a open-sourced Solr component that we developed to make search experience better.
What is query segmentation and when is it useful?
Query segmentation refers to processing of the query in order to break it down into meaningful segments. Such segments may be single tokens, or sequences of multiple tokens. Once such meaningful segments are discovered one can use them to enhance the search experience in various ways. One use of query segmentation is to rewrite the query in order to make it more precise. Below are two such examples everyone will be able to relate to.
Think about searching for people at LinkedIn. Sometimes you search for a specific person using their first and last name. If that name is fairly unique, it’s easy to locate that person (e.g. at the moment, there is just one Otis Gospodnetic on the planet, so it’s easy to find his LinkedIn profile). However, when the name is not enough people use additional criteria to make their query more precise. For example, there are over 35,000 people named Satya in LinkedIn, but if you search for Satya Microsoft there is only one match. While I do not know how exactly LinkedIn handles queries like “Satya Microsoft”, they could be using query segmentation for it. By using query segmentation they could determine that Satya is the first name (or at least a part of a personal name) and Microsoft is the name of an organization. Using this knowledge they could rewrite the query into the equivalent of firstName:Satya AND organization:Microsoft, which would be more precise than a generic version of this query such as keywords:(Satya, Microsoft).
Another query segmentation use case can be found in retail, where people may search for things like “red dress” or “toaster stainless steel Braun”. Using query segmentation one could rewrite the query with the understanding that red is a color, dress and toaster are the actual items, stainless steel is material, and Braun is the name of the company. A query rewritten using this knowledge can yield more precise results, thus helping people find what you are looking for waster instead of wading through hundreds of items that are really just lose query matches.
Query segmentation can also be used to extract locations or points of interest information from query and turn them into geospatial queries, as you can see in the Solr Query Segmenter README.
Setup Solr Example
We’ll assume you already have Solr running. In the example below we’ll use Solr 6.0.1, but other versions should work, as long as there is a version of Solr Query Segmenter that is based on it (if you don’t find one, send us a PR!) Solr ships with several examples. We’ll use the “techproducts” example to show how Solr Query Segmenter works and what is in it for you.
Let’s first run the techproducts example:
bin/solr start -e techproducts
Just to make sure all is working, you should be able to visit http://localhost:8983/solr/#/techproducts and see the Solr web admin interface:
If all is OK, we can stop Solr for now:
bin/solr stop -e techproducts
Setup Query Segmenter
Query Segmenter setup has two parts:
- Download / install of the required Jar files
- Configuration
Setup library
Download Query Segmenter jars from central maven repo.
mkdir example/techproducts/solr/techproducts/lib cd example/techproducts/solr/techproducts/lib wget https://oss.sonatype.org/content/repositories/releases/com/sematext/querysegmenter/st-QuerySegmenter-core/1.3.6.3.0/st-QuerySegmenter-core-1.3.6.3.0.jar wget https://oss.sonatype.org/content/repositories/releases/com/sematext/querysegmenter/st-QuerySegmenter-solr/1.3.6.3.0/st-QuerySegmenter-solr-1.3.6.3.0.jar
Configuration
The Query Segmenter Solr library includes Solr components that use QueryComponent – the Solr SearchComponent that handles queries. The library currently contains 2 components – QuerySegmenterQParser and CentroidComponent. Let’s have a look at each of them.
Dictionary-based Segmentation
Query segmentation is based on matching of dictionary elements against queries. Dictionary elements are specified in dictionary files. Dictionary files are plain text files that contain segments to look for when parsing and segmenting a query. A few dictionary files used for unit tests can be found under https://github.com/sematext/query-segmenter/tree/master/core/src/test/resources.
Dictionary Structure
There are 3 types of dictionaries:
Segment Dictionary
This is used with QuerySegmenterQParser and QuerySegmenterComponent and is nothing more than a text file with a set of keywords, one keyword per line. For example:
electronics currency memory wireless mouse Centroid Dictionary
This is used with QuerySegmenterQParser & CentroidComponent. It contains a set of points, one point per line. Points have the format of name|lat|lon. For example:
Aaronsburg|40.9068|-77.4081
Area Dictionary
This is another type of location dictionary for QuerySegmenterQParser & CentroidComponent. Instead of having a point per line it contains an area per line, specified using the name|maxlat|maxlon|minlat|minlon format. For example:
Northeast,61.235009,-149.703891,61.195252,-149.778423
If there is a segment in the user query that matches an element of the dictionary (built from the dictionary file), the query is rewritten using either the field specified in the segmenter configuration or the location (only when area segment dictionary is used, shown later in this article). For example, for the query “pizza brooklyn”, if “new york” is an area in the dictionary, the query may be rewritten to “pizza neighborhood:brooklyn”, or perhaps “pizza location:[minlat,minlon TO maxlat, maxlon]”. The field to use and whether we should use the label or the location is configurable.
Segment Dictionary | Centroid Dictionary | Area Dictionary | |
QuerySegmenterQParser | x | x | x |
QuerySegmenterComponent | x | ||
CentroidComponent | x | x |
QuerySegmenterQParser
This QParser is used to parse the query, extract segments from the query, and then rewrite it before letting Solr execute it.
Configuration
Configure QuerySegmenterQParser in the solrconfig.xml (example/techproducts/solr/techproducts/conf/) file:
<queryParser name="seg" class="com.sematext.querysegmenter.solr.QuerySegmenterQParserPlugin"> <lst name="segments"> <lst name="cats"> <str name="field">cat</str> <str name="dictionary">com.sematext.querysegmenter.GenericSegmentDictionaryMemImpl</str> <str name="filename">${solr.solr.home}/${solr.core.name}/conf/segmenter/categories.txt</str> <bool name="useLatLon">false</bool> </lst> </lst> </queryParser>
Create dictionary file:
mkdir example/techproducts/solr/techproducts/conf/segmenter cat <<EOF > example/techproducts/solr/techproducts/conf/segmenter/categories.txt electronics currency memory currency software camera copier music printer scanner EOF
Usage
Let’s start Solr again after adding the above segmenter component configuration.
bin/solr start -e techproducts
To use the QParser directly, use LocalParams syntax:
http://localhost:8983/solr/techproducts/select/?q={!seg}electronics%20device
Note that the “seg” part in {!seg} local parameter matches the “seg” name of the config section above.
In the above example, the Query Segmenter will first spot the “electronics” segment because that was one of the dictionary elements we provided. Thus, it will rewrite the query to cat:”electronics”. Why does it use the “cat” field? Because that is the field we specified in config earlier. Once the query is rewritten like this it is handled by the eDismax parser which then uses just the remaining “device” part with fields defined in its qf. The cat:”electronics” portion of the query would not be used with qf because of the field-specific prefix.
Using our “techproducts” Solr example such a segmented query returns 12 docs, all of which are in the “electronics” category — this is the key here!
<response> <lst name="responseHeader"> <int name="status">0</int> <int name="QTime">2</int> <lst name="params"> <str name="q">{!seg}electronics device</str> <str name="fl">cat</str> </lst> </lst> <result name="response" numFound="12" start="0"> <doc> <arr name="cat"> <str>electronics</str> <str>connector</str> </arr> </doc> <doc> <arr name="cat"> <str>electronics</str> <str>connector</str> </arr> </doc> <doc> <arr name="cat"> <str>electronics</str> <str>memory</str> </arr> </doc> <doc> <arr name="cat"> <str>electronics</str> <str>memory</str> </arr> </doc> <doc> <arr name="cat"> <str>electronics</str> <str>memory</str> </arr> </doc> <doc> <arr name="cat"> <str>electronics</str> <str>music</str> </arr> </doc> <doc> <arr name="cat"> <str>electronics</str> <str>graphics card</str> </arr> </doc> <doc> <arr name="cat"> <str>electronics</str> <str>graphics card</str> </arr> </doc> <doc> <arr name="cat"> <str>electronics</str> <str>multifunction printer</str> <str>printer</str> <str>scanner</str> <str>copier</str> </arr> </doc> <doc> <arr name="cat"> <str>electronics</str> <str>camera</str> </arr> </doc> <doc> <arr name="cat"> <str>electronics</str> <str>hard drive</str> </arr> </doc> <doc> <arr name="cat"> <str>electronics</str> <str>hard drive</str> </arr> </doc> </result> </response>
Now compare that to the results of a query without segmenter component:
http://localhost:8983/solr/techproducts/select/?q=electronics%20device
<response> <lst name="responseHeader"> <int name="status">0</int> <int name="QTime">5</int> <lst name="params"> <str name="q">electronics device</str> <str name="fl">cat</str> </lst> <result name="response" numFound="14" start="0">
Note something different? We got 14 hits, not just 12. Let’s see those 2 extra hits:
<doc> <arr name="cat"> <str>electronics and computer1</str> </arr> </doc> <doc> <arr name="cat"> <str>electronics</str> <str>memory</str> </arr> </doc> <doc> <arr name="cat"> <str>electronics and stuff2</str> </arr> </doc>
You see the problem? Our “electronics” query picked up matches in other categories. Sometimes that is what you want, but sometimes you really don’t want that, and the Solr Query Segmenter helps you avoid that and return more precise results.
QuerySegmenterComponent
A component that works like the QParser described above, but implemented as a Solr SearchComponent instead of a QParser. Using QuerySegmenterComponent lets us configure each individual Request Handler to include or not include query segmentation. One could also configure multiple QuerySegmentedComponents, perhaps with different dictionaries and/or different fields.
Using this component also means you don’t need to add prefix {!seg} for every user query, such as q={!seg}electronics%20device
Note that you should put this component before the standard query component (or simply define it to be the first component), because it needs to rewrite the query before the query is made against Solr.
Configuration
<searchComponent name="segmenter" class="com.sematext.querysegmenter.solr.QuerySegmenterComponent"> <lst name="segments"> <lst name="cats"> <str name="field">cat</str> <str name="dictionary">com.sematext.querysegmenter.GenericSegmentDictionaryMemImpl</str> <str name="filename">${solr.solr.home}/${solr.core.name}/conf/segmenter/categories.txt</str> <bool name="useLatLon">false</bool> </lst> </lst> </searchComponent> <requestHandler name="/qs" class="solr.SearchHandler"> <lst name="defaults"> <str name="defType">edismax</str> <str name="qf"> name^1.2 id^10.0 features^1.0 manu^1.1 cat^1.4 </str> </lst> <arr name="first-components"> <str>segmenter</str> </arr> </requestHandler>
Usage
http://localhost:8983/solr/techproducts/qs?q=electronics%20device
CentroidComponent
This SearchComponent is used to rewrite queries by segmenting them, looking for segments that match a centroid in the provided area dictionary, and then centering queries using that centroid. It must be used within a RequestHandler that uses a location filter (bbox or geofilt). If a match is found, the user location (the required pt request param) is changed to the center location of the centroid. The effect is that instead of using the user location for the location filter, it will use the centroid location. If multiple centroid segments are returned from the user query, the closest centroid to the original user location is used.
For example, if a user searches for “pizza Aaronsburg”, the segment “Aaronsburg” might be returned as a centroid with location 40.9068, -77.4081. This location would then be used instead of the original user’s location (think a person sitting in front of a computer in Cleveland, Ohio and looking where to eat pizza in Aaronsburg, Ohio). This would filter results and return only matches in some radius around the centroid location. This radius is specified in the configuration, as shown below.
Configuration
We’ll define the SearchComponent in solrconfig.xml:
<searchComponent name="centroidcomp"
class="com.sematext.querysegmenter.solr.CentroidComponent">
<str name="filename">${solr.solr.home}/${solr.core.name}/conf/segmenter/centroid.csv</str>
<str name="separator">|</str>
</searchComponent>
Note how we’ve specified a dictionary file with centroid information and that it’s in the csv format, which was described earlier. You can see an example centroid.csv in https://github.com/sematext/query-segmenter/tree/master/core/src/test/resources.
Next, we need to add this component to a request handler:
<requestHandler name="/centroid" class="solr.SearchHandler">
<lst name="defaults">
<str name="echoParams">explicit</str>
<str name="sfield">store</str>
<str name="fq">{!geofilt}</str>
<str name="q.alt">*:*</str>
<str name="d">75</str> <!-- radius from location, in kilometers by default -->
</lst>
<arr name="first-components">
<str>centroidcomp</str>
</arr>
</requestHandler>
The “sfield” needs to specify a location field. In this example that field is “store”. The “d” setting specifies the radius from the location, in kilometers. Any point outside that radius will be filtered out.
Usage
We can use it with the /centroid request handler defined above. Let’s search for adelphia radeon:
http://localhost:8983/solr/techproducts/centroid?q=adelphia%20radeon
Searching for adelphia radeon will return the following:
<response> <lst name="responseHeader"> <int name="status">0</int> <int name="QTime">1</int> <lst name="params"> <str name="q">adelphia radeon</str> </lst> </lst> <result name="response" numFound="1" start="0"> <doc> <str name="id">100-435805</str> <str name="name">ATI Radeon X1900 XTX 512 MB PCIE Video Card</str> <str name="manu">ATI Technologies</str> <str name="manu_id_s">ati</str> <arr name="cat"> <str>electronics</str> <str>graphics card</str> </arr> <arr name="features"> <str>ATI RADEON X1900 GPU/VPU clocked at 650MHz</str> <str>512MB GDDR3 SDRAM clocked at 1.55GHz</str> <str>PCI Express x16</str> <str>dual DVI, HDTV, svideo, composite out</str> <str>OpenGL 2.0, DirectX 9.0</str> </arr> <float name="weight">48.0</float> <float name="price">649.99</float> <str name="price_c">649.99,USD</str> <int name="popularity">7</int> <bool name="inStock">false</bool> <date name="manufacturedate_dt">2006-02-13T00:00:00Z</date> <str name="store">40.7143,-74.006</str> <long name="_version_">1538980276785381376</long> </doc> </result> </response>
What happened here? One of the centroid dictionary entries is this:
Adelphia|40.2295|-74.2954
Thus, the Solr Query Segmenter matched adelphia in the dictionary and rewrote that part of the query to use the Adelphia lat,lon. It limited the query to stores in 75km radius around that point, and then also looked for the keyword radeon in documents from that filtered set.
As the result, it found the ATI Radeon X1900 XTX 512 MB PCIE Video Card that is being sold in a store in or near Adelphia.
Want to learn more about Solr? Subscribe to our blog or follow @sematext. If you need any help with Solr / SolrCloud – don’t forget that we provide Solr Consulting, Production Support, and offer Solr Training!