Solr Digest, January 2010

Similar to our Lucene Digest post , we’ll occasionally cover recent developments in the  Solr world. Although Solr 1.4 was released only two months ago, work on new features for 1.5 (located in the svn trunk) is in full gear:

1) GeoSpatial search features have been added to Solr. The work is based on Local Lucene project, donated to the Lucene community a while back and converted to a Lucene contrib module. The features to be incorporated into Solr will allow one to implement things like:

  • filter by a bounding box – find all documents that match some specific area
  • calculate the distance between two points
  • sort by distance
  • use the distance to boost the score of documents (this is different than sort and would be used in case you want some other factors to affect the ordering of documents)

The main JIRA issue is SOLR-773, and it is scheduled to be implemented in Solr 1.5.

2) Solr Autocomplete component – Autocomplete functionality is a very common requirement. Solr itself doesn’t have the component which would provide such functionality (unlike SpellcheckComponent which provides spellchecking, although in limited form; more on this in another post or, if you are curious, have a look at the DYM ReSearcher). There are a few common approaches to solving this problem, for instance, by using recently added TermsComponent (for more information, you can check this excellent showcase by Matt Weber).  These approaches, however, have some limitations. The first limitation that comes to mind is spellchecking and correction of misspelled words. You can see such feature in Firefox’ google-bar,  if you write “Washengton”. You’ll still get “Washington” offered, while TermsComponent (and some other) approaches fail here.

The aim of SOLR-1316 is to provide autocomplete component in Solr out-of-the-box. Since it is still in development, you can’t quite rely on it, but it is scheduled to be released with Solr 1.5. In the mean-time, you can check another Sematext product which offers AutoComplete functionality with few more advanced features. It is a constantly developing product whose features have been heavily absed on real-life/customer feedback.  It uses an approach unlike than of TermsComponent and is, therefore, faster. Also, one of the features currently in development (and soon to be released) is spell-correction of queries.

3) One very nice addition in Solr 1.5 is edismax or Extended Dismax. Dismax is very useful for situations where you want to let searchers just enter a few free-text keywords (think Google) , without field names, logical operators, etc.  The Extended Dismax is contributed thanks to Lucid Imagination.  Here are some of its features:

  • Supports full Lucene query syntax in the absence of syntax errors
  • supports “and”/”or” to mean “AND”/”OR” in Lucene syntax mode
  • When there are syntax errors, improved smart partial escaping of special characters is done to prevent them… in this mode, fielded queries, +/-, and phrase queries are still supported.
  • Improved proximity boosting via word bi-grams… this prevents the problem of needing 100% of the words in the document to get any boost, as well as having all of the words in a single field.
  • advanced stopword handling… stopwords are not required in the mandatory part of the query but are still used (if indexed) in the proximity boosting part. If a query consists of all stopwords (e.g. to be or not to be) then all will be required.
  • Supports the “boost” parameter.. like the dismax bf param, but multiplies the function query instead of adding it in
  • Supports pure negative nested queries… so a query like +foo (-foo) will match all documents

You can check the development in JIRA issue SOLR-1553.

4) Up until version 1.5, distributed Solr deployments depended not only on Solr features (like replication or sharding), but also on some external systems, like load-balancers. Now, Zookeeper is used to provide Solr specific naming service (check JIRA issue SOLR-1277 for the details). The features we’ll get look exciting:

  • Automatic failover (i.e. when a server fails, clients stop trying to index to or search it and uses a different server)
  • Centralized configuration management (i.e. new solrconfig.xml or schema.xml propagate to a live Solr cluster)
  • Optionally allow shards of a partition to be moved to another server (i.e. if a server gets hot, move the hot segments out to cooler servers).

We’ll cover some of these topics in more details in the future installments. Thanks for reading!