Nutch Digest, April 2010

In the first part of this Nutch Digest we’ll go through new and useful features of the upcoming Nutch 1.1 release, while in the second part we’ll focus on developments and plans for next big Nutch milestone, Nutch 2.0. But, let’s start with few informational items.

  • Nutch has been approved by the ASF  board to become Top Level Project (TLP) in the Apache Software Foundation.  The changing of Nutch mailing lists, URL, etc. will start soon.

Nutch 1.1 will be officially released any day now and here is a Nutch 1.1 release features walk through:

  • Nutch release 1.1 uses Tika 0.7 for parsing and MimeType detection
  • Hadoop 0.20.2 is used for job distribution (Map/Reduce) and distributed file system (HDFS)
  • On the indexing and search side, Nutch 1.1 uses either Lucene 3.0.1.with its own search application or Solr 1.4
  • Some of the new features included in release 1.1 were discussed in previous Nutch Digest. For example, alternative generator which can generate several segments in one parse of the crawlDB is included in release 1.1. We used a flavour of this patch in our most recent Nutch engagement that involved super-duper vertical crawl.  Also, improvement of SOLRIndexer, which now commits only once when all reducers have finished, is included in Nutch 1.1.
  • Some of the new and very useful features were not mentioned before. For example, Fetcher2 (now renamed to just Fetcher) was changed to implement Hadoop’s Tool interface. With this change it is possible to override parameters from configuration files, like nutch-site.xml or hadoop-site.xml, on the command line.
  • If you’ve done some focused or vertical crawling you probably know that one or few unresponsive host(s) can slow down entire fetch, so one very useful feature added to Nutch 1.1 is the ability to skip queues (which can be translated to hosts) for URLS getting repeated exceptions.  We made good use of that here at Sematext,  in the Nutch project we just completed in April 2010.
  • Another improvement included in 1.1 release related to Nutch-Solr integration comes in a form of improved Solr schema that allows field mapping from Nutch to Solr index.
  • One useful addition to Nutch’s injector is new functionality which allows user to inject metadata into the CrawlDB. Sometimes you need additional data, related to each URL, to be stored. Such external knowledge can later be used (e.g. indexed) by a custom plug-in. If we can all agree that storing arbitrary data in CrawlDb (with URL as a primary key) can be very useful, then migration to database oriented storage (like HBase) is only a logical step.  This makes a good segue to the second part of this Digest…

In the second half of this Digest we’ll focus on the future of Nutch, starting with Nutch 2.0.  Plans and ideas for the next Nutch release can be found on mailing list under Nutch 2.0 roadmap and on the official wiki page.

Nutch is slowly replacing some of its home-grown functionality with best of breed products — it uses Tika for parsing, Solr for indexing/searching and HBase for storing various types of data.  Migration to Tika is already included in Nutch 1.1. release and exclusive use of Solr as (enterprise) search engine makes sense — for months we have been telling clients and friends we predict Nutch will deprecate its own Lucene-based search web application in favour of Solr, and that time has finally come.  Solr offers much more functionality, configurability, performance and ease of integration than Nutch’s simple search web application.  We are happy Solr users ourselves – we use it to power

Storing data in HBase instead of directly in HDFS has all of the usual benefits of storing data in database instead of a files system.  Structured (fetched and parsed) data is not split into segments (in file system directories), so data can be accessed easily and time consuming segment merges can be avoided, among other things.  As a matter of fact, we are about to engage in a project that involves this exact functionality: the marriage of Nutch and HBase.  Naturally, we are hoping we can contribute this work back to Nutch, possibly through NUTCH-650.

Of course, when you add a persistence layer to an application there is always a question if whether it is acceptable for it to be tied to one back-end (database) or whether it is better to have an ORM layer on top of the datastore. Such an ORM layer would be an additional layer which would allow different backends to be used to store data.  And guess what? Such an ORM, initially focused on HBase and Nutch, and then on Cassandra and other column-oriented databases is in the works already!  Check the evaluation of ORM frameworks which support non-relational column-oriented datastores and RDBMs and development of an ORM framework that, while initially using Nutch as the guinea pig, already lives its own decoupled life over at

That’s all from us on Nutch’s present and future for this month, stay tuned for more Nutch news, next month! And of course, as usual, feel free to leave any comments or questions – we appreciate any and all feedback.  You can also follow @sematext on Twitter.

Solr Digest, April 2010

Another month is almost over, so it is time for our regular monthly Solr Digest. This time we’ll focus on interesting JIRA issues, so let’s start:

  • Issue SOLR-1860 intends to improve stopwords list handling in Solr, based on recent Lucene’s stopwords lists additions to all language analyzers. The work hasn’t started just yet (there are no patches to try), so we’ll need to be patient before actually using it.
  • Ever had problems with http authentication in distributed Solr environment? Currently, it worked only when querying one Solr server. Now JIRA issue SOLR-1861 solves such problems and allows specification of credentials for each shard, while in the absence of credential info it falls back to default functionality (no credentials). The patch is already attached to the issue and it can be used with Solr 1.4.
  • If you have used Solr’s MoreLikeThisComponent, you noticed its output lacks any info which would explain why it recommended some item. Patch in issue SOLR-860 deals with that and improves MLT Component by adding debug info, like this (copied from JIRA):

"realMLTQuery":"+() -id:IW-02"},
"realMLTQuery":"+() -id:SOLR1000"},
"realMLTQuery":"+() -id:F8V7067-APL-KIT"},
"rawMLTQuery":"features:2 features:0 features:lcd features:x features:3",
"boostedMLTQuery":"features:2 features:0 features:lcd features:x features:3",
"realMLTQuery":"+(features:2 features:0 features:lcd features:x features:3) -id:MA147LL/A"}},

This issue is marked to be included in Solr 3.1.

  • If you ever got a requirement like “some users should be able to access these documents while being forbidden to access some other”, Solr wasn’t able to help you much. Recently, document level security has been the subject of 2 JIRA issues. In SOLR-1834 you can find a patch which is already running in production environment, while another approach to the same problem (also with attached patch) is presented in SOLR-1872 (the latter currently adds security only on select queries, delete is not supported yet).
  • SolrCloud brings exciting new capabilities to Solr, some of them already mentioned in our Solr Digest posts (for instance, check Solr Digest January 2010). Solr Cloud functionality is getting committed to trunk, you can monitor the progress in SOLR-1873.  This is big!
  • When working with Solr, you should explicitly configure Solr to take care of lowercasing indexed tokens and query strings (so uppercased versions of some words match their lowercase versions.  For instance, to have query Sematext matches SEMATEXT, sematext and Sematext). However, there is one old JIRA issue SOLR-219 designated to be fixed in Solr 1.5 which would automatically make Solr smart enough for searches to be case insensitive.
  • One common source of confusion for first time Solr users was dismax and its relation to default query operator defined in schema.xml. In reality, the default query operator has no effect on how dismax works. Also, with dismax you can’t use directly AND and OR operators, but you can achieve such functionality by using dismax’s mm (minimum should match) parameter. The default value for it is 100% (meaning that all clauses must match, which is equal to using AND operator between all clauses). If you want to achieve OR operator functionality, you would just define its value to 1 (meaning, one matching clause is enough). The confusion with default operator arises from the fact that in case your default query operator in schema.xml is OR, dismax would by default behave like it was AND. Issue SOLR-1889 should deal with that and assign default mm value for dismax depending on the default query operator from schema.xml, which will make Solr behave more consistently for new users.
  • Another old JIRA issue got its first patch a few days ago, SOLR-571. This patch allows autowarmCount values to be specified as percentages of cache size (for instance, 50% would mean that autowarm of only top half of cached queries is needed) instead of being specified by an absolute amount.
  • Solr 1.4 introduced ClusteringComponent which can cluster search results and documents. By using plugins, it allows implementation of any clustering engine. One such engine was recently unveiled, lsa4solr, which is based on Latent Semantic Analysis. This engine depends on development version of Solr 1.3 and Clojure 1.2, so take a look if you’re interested in clustering.
  • And last, but not least, for all Solr enthusiasts, an interesting webinar is on schedule for April 29th: “Practical Search with Solr: Beyond just looking it up”. You can find more about it here.

Remember, you can also follow us on Twitter: @sematext.  Until next month!

Poll: Handling lucene-dev Merge

Lucene and Solr projects merged recently, as we mentioned in Solr Digest and Lucene Digest for March 2010.  Today, their -dev mailing lists finally finally merged.  Since Sematext runs the service that makes these lists (and more) searchable, we need to decided how to handle this relatively drastic change.

Short version: Please tell us how you would like us to handle lucene-dev merge on by selecting your choice in our Handling lucene-dev merge Poll.  The 2 choices are described below.

We’ve identified 2 options, and we need your input to help us decide what the right option is:

  • We can add a new lucene-dev list and start indexing it.  This would contain only the new lucene-dev content (for both Lucene and Solr development from today on).  This downside is that if you wanted to include old lucene-dev messages or old solr-dev messages in your search, you would have to explicitly select those lists.  We could rename them to lucene-dev-old and solr-dev-old for example, so the UI would show lucene-dev, lucene-dev-old, and solr-dev-old.  You’d have total control over what you want searched, but it would require you to make your choices explicitly, which also means people would have to understand what those -old lists are about and why there is no solr-dev.
  • We could merge the old solr-dev and old lucene-dev, and have a single lucene-dev that has both of those lists’ old messages (up to today), as well as all the new messages from the merged lucene-dev list from here on.  In effect, it would look as it Lucene and Solr always had a single lucene-dev list, since all of the old lucene-dev and solr-dev content would be in this new lucene-dev.  If we go this route, there would be no lucene-dev-old or solr-dev-old in the UI, just one lucene-dev choice.  But there also wouldn’t be solr-dev choice in the UI, since it doesn’t exist any  more, which may be confusing.  Thus, when you choose to search Solr, you wouldn’t see solr-dev facet in the UI, but the lucene-dev list’s content would be searched, so you wouldn’t actually miss any matches.

If there is a 3rd or 4th option that we missed, please let us know via comments!

Please tell us which option you would prefer as user by selecting your choice in our Handling lucene-dev merge Poll.  Thank you.