Skip to main content

Lucene Digest, May 2010

sematext sematext on

Last month we were busy with work and didn’t publish our monthly Lucene Digest.  To make up for it, this month’s Lucene Digest really covers all Lucene developments in May from A to Z.

  • Mark Harwood had a busy month.  In LUCENE-2454 he contributed a production-tested and often-needed functionality for properly indexing parent-child entities (or, more generally, any form of hierarchical search).  He introduced his work in Adding another dimension to Lucene searches.  Joaquin Delgado has been talking about the merge of unstructured and structured search (not surprising, considering his old company with Lucene-based Federated Search product got acquired by Oracle several years ago!), so he quickly related this to ability to perform XQuery + Full-Text searches.  MarkLogic, watch your back! 😉
  • Mark also contributed a match spotter for all query types in LUCENE-1999.  This patch makes it possible to figure out which field(s) a match for a particular hit was on, which is functionality people ask about on Lucene and Solr mailing lists every so often.  Warning, though: spotting the matching and encoding that causes some score precision loss.
  • While Lucene already has TimeLimitedCollector, it’s not perfect and offers room for improvement.  Back in 2009, Mark came up with TimeLimitedIndexReader, as you can tell from his messages in Improving TimeLimitedCollector thread and created a patch with it in LUCENE-1720, which filled some of the TimeLimitedCollector’s gaps:
    • Any reader activity can be time-limited rather than just single searches e.g. the document retrieve phase.
    • Times out faster (i.e. runaway queries such as fuzzies detected quickly before last “collect” stage of query processing)
  • Robert Muir, who gave a well-received presentation on Finite State Queries in Lucene at New York Search & Discovery Meetup (see slides) back in April 2010, has been busy consolidating Lucene and Solr analyzers, tokenizers, token filters, character filters, etc. and moving them to their new home: modules/analysis, under Lucene root dir in svn.  The plan is to produce separate and standalone artifacts (read: jars) for this analysis module.  Here at Sematext we will make use of this module immediately for some of our products that currently list Lucene as a dependency, even though they really only need Lucene’s analyzers.  Solr, too, will be another customer for the new analysis module, as described by Robert in solr and analyzers module (yes, we’re showing off’s in-document search-term highlighting, which we find very useful).
  • Robert also worked on and committed an ICU-based tokenizer for Unicode-based text segmentation in LUCENE-2414.  This translates to having the ability to properly tokenize text that doesn’t use spaces as token separators.  If you’ve ever had to deal with searching Chinese, for example, you’ll know that word segmentation is one of the initial challenges one has to deal with.
  • Talking about splitting on space, another task Robert took upon himself was to stop Lucene QueryParser from splitting queries on space: LUCENE-2458.  This problem of tokenizing queries in space comes up quote often, so this is going to be a very welcome improvement in Lucene.
  • One day Robert was super bored, so he decided to write a Lucene analyzer for Indonesian: LUCENE-2437.
  • Andrzej and Isreal Ekpo (the author of one of the Solr PHP clients) both decided to add support for search-time bitwise operations of integer fields around the same time.  Isreal’s work in in LUCENE-2460, with an accompanying SOLR-1913 issue, while Andrzej’s is in SOLR-1918 and has no pure Lucene patch.  The difference is that Israel’s patch offers only filtering, while Andrzej’s patch performs scoring, which allows finding the best matching inexact bit patterns. This has applications in e.g. near-duplicate detection.
  • In one of our current engagements we are working with a large, household-name organization and a big U.S. government contractor.  Their index is heavily sharded and is well over 2 TB.  Working with such large indices is no joke (though I’m happy to say we were able to immediately improve their search performance by 40% in the first performance tuning iteration). What if we could make their indices smaller?  Would that make their search even faster?  Of course!  In LUCENE-1812 (nice number), Andrej implemented a static index pruning tool that removes posting data from indices for terms with in-document frequency lower than some threshold.  We haven’t used this tool, and it looks like we may not use it for a while, because IBM apparently holds a patent on an exact same algorithm used in this tool.
  • Phrase queries got a little performance boost in LUCENE-2410.  Every little bit counts!
  • Tom Burton-West created and contributed a handy tool that outputs total term frequency and document frequency from a Lucene index: LUCENE-2393.  This tool can be handy for estimating sizes of some of the Lucene index files, and thus getting a better grasp on disk IO needs.
  • On both Lucene and Solr lists with often see people asking about updating individual Document fields instead of simply deleting and re-adding the whole Document.  Delete and re-add approach is not necessarily a problem for Lucene/Solr, but for an external system from which data for the complete new Document needs to be fetched.  Shai Erera, another recently added Lucene committer, proposed a new approach for incremental field updates that was well received.  Once implemented, this will be a big boon for Lucene and Solr!  If that thread or message is too long for you to read, let us at least highlight (pun intended) the two great use cases from this message.
  • Lucandra is a Cassandra backend for Lucene.  But no, it’s not a Lucene Directory implementation.  Lucandra has its own IndexReader and IndexWriter that read from Cassandra and write to it.  But in LUCENE-2456 we now have another option: a Cassandra-based Lucene Directory.  We hope to have a post on this in the near future!
  • The author of Cassandra-based Lucene Directory also opened LUCENE-2425 for Anti-Merging Multi-Directory Indexing Framework that splits an index (at run-time) into multiple sub-indices, based on a “split policy”, several of which have also been added to Lucene’s JIRA.  This is somewhat similar to Lucene’s ParallelWriter, but has some differences, as described in the issue.
  • Michael McCandless is working on prototyping a multi-stage pipeline sub-system that aims to further decouple analysis from indexing.  In this pipeline, indexing would be just one step, one stage in the pipeline.  Based on the work done so far, this may even bring some performance improvements.
  • LUCENE-2295 added a LimitTokenCountAnalyzer / LimitTokenCountFilter to wrap any other Analyzer and provide the same functionality as MaxFieldLength provided on IndexWriter
  • Shay Banon, the author of Elastic Search, contributed LUCENE-2468 (can you complete this hard to figure out numeric sequence?), which allows one to specify how new Document deletions should be handled in CachingWrapperFilter and CachingSpanFilter.  We recently did work for another large organization and a household name (in the U.S. at least) where we improved their Lucene-based search performance by over 30%.  One of the things we did was making good use of CachingWrapperFilter.
  • LUCENE-2480 removes support for pre-Lucene 3.* indices from Lucene 4.*.  Thus, if you are still on Lucene 1.* or Lucene 2.*, we suggest moving to Lucene 3.* soon.  But, due to radical Lucene changes, even moving from Lucene 3.x to Lucene 4.0 won’t be as seamless as with previous Lucene upgrades.  Lucene 4.0 will include a Lucene 3.x to Lucene 4.0 migration tool: LUCENE-2441.

That’s it for this month.  Remember that you can also follow @sematext on Twitter.