Hadoop Digest, May 2010

Big news: HBase and Avro have become Apache’s Top Level Projects (TLPs)! The initial discussion happened when our previous Hadoop Digest was published, so you can find links to the threads there. The question of whether to become a TLP or not caused some pretty heated debates in Hadoop subprojects’ communities.  You might find it interesting to read the discussions of the vote results for HBase and Zookeeper. Chris Douglas was kind enough to sum up the Hadoop subprojects’ response to becoming a TLP in his post. We are happy to say that all subprojects which became TLP are still fully searchable via our search-hadoop.com service.

More news:

  • Great! Google granted MapReduce patent license to Hadoop.
  • Chukwa team announced the release of Chukwa 0.4.0, their second public release. This release fixes many bugs, improves documentation, and adds several more collection tools, such as the ability to collect UDP packets.
  • HBase 0.20.4 was released. More info in our May HBase Digest!
  • New Chicago area Hadoop User Group was organized.

Good-to-know nuggets shared by the community:

  • Dedicate a separate partition to Hadoop file space – do not use the “/” (root) partition. Setting dfs.datanode.du.reserved property is not enough to limit the space used by Hadoop, since it limits only HDFS usage, but not MapReduce’s.
  • Cloudera’s Support Team shares some basic hardware recommendations in this post. Read more on proper dedicating & counting RAM for specific parts of the system (and thus avoiding swapping) in this thread.
  • Find a couple of pieces of advice about how to save seconds when you need a job to be completed in tens of seconds or less in this thread.
  • Use Combiners to increase performance when the majority of Map output records have the same key.
  • Useful tips on how to implement Writable can be found in this thread.

Notable efforts:

  • Cascalog: Clojure-based query language for Hadoop inspired by Datalog.
  • pomsets: computational workflow management system for your public and/or private cloud.
  • hiho: a framework for connecting disparate data sources with the Apache Hadoop system, making them interoperable


  1. How can I attach external libraries (jars) which my jobs depend on?
    You can put them in a “lib” subdirectory of your jar root directory. Alternatively you can use DistributedCache API.
  2. How to Recommission DataNode(s) in Hadoop?
    Remove the hostname from your dfs.hosts.exclude file and run ‘hadoop dfsadmin -refreshNodes‘. Then start the DataNode process in the ‘recommissioned’ DataNode again.
  3. How to configure log placement under specific directory?
    You can specify the log directory in the environment variable HADOOP_LOG_DIR. It is best to set this variable in bin/hadoop-env.sh.

Thank you for reading us, and if you are a Twitter addict, you can now follow @sematext, too!

Mahout Digest, May 2010

As we reported in our previous Digest, Mahout was approved by the board to become Top Level Project (TLP) and TLP-related preparations kept Mahout developers busy in May. Mahout mailing lists (user- and dev-) were moved to their new addresses and all subscribers were automatically moved to the new lists: user@mahout.apache.org and dev@mahout.apache.org. Regarding other TLP-related changes and their progress, check the list of items to complete the move.

We’ll start a quick walk-through of May’s new features and improvements with an effort on reducing Mahout’s Vector representation size on disk, resulting in improvement of 11% lower size on a test data set.

Discussion on how UncenteredCosineSimilarity , an implementation of the cosine similarity which does not “center” its data, should behave in distributed vs. non-distributed version and a patch for distributed version can be found in MAHOUT-389. Furthermore, an implementation of distributed version of any ItemSimilarity currently available in a non-distributed form was committed in MAHOUT-393.

Mahout has utilities to generate Vectors from a directory of text documents and one improvement in terms of consistency and ease of understanding was made on tf/tfidf-vectors outputs. When using bin/mahout seq2sparse command to generate vectors from text (actually, from the sequence file generated from text), depending on the weighting method (term frequency or term frequency–inverse document frequency), output was created in different directories. Now with the fix from MAHOUT-398, tf-vectors and tfidf-vectors output directories at the same level.

We’ll end with Hidden Markov Model and its integration into Mahout. In MAHOUT-396 you’ll find a patch and more information regarding where and how it is used.

That’s all from Mahout’s world for May, please feel free to leave any questions or comments and don’t forget you can follow @sematext on Twitter.

Solr Digest, May 2010

May’s Solr Digest brings another review of interesting Solr developments and a short look at current state of Solr’s branches and versions. Confused about which versions to use and which to avoid? Don’t worry, many people are.  We’ll try to clear it up in this Digest.

  • In April’s edition of Solr Digest, we mentioned two JIRA issues dealing with document level security (SOLR-1834 and SOLR-1872). Now another issue (SOLR-1895) deals with LCF security and aims to work with SOLR-1834 to provide a complete solution.
  • One ancient JIRA issue, SOLR-397, finally got resolved and its code is now committed. Solr now has the ability to control how date range end points in facets are dealt with.  You can use this functionality by specifying the facet.date.include HTTP request parameter, which can have values “all”, “lower“, “upper“, “edge“, “between“, “before“, or “after“. More details about this can be found in SOLR-397.
  • Another issue related to date ranges was created.  This one aims to add Date Range QParser to Solr, which would simplify definition of date range filters resulting from date faceting.   This issue is still in its infancy and has no patches attached to it as of this writing, but it looks useful and promising.  When we add date faceting to Search-Lucene.com and Search-Hadoop.com we’ll be looking to use this date range query parser.
  • Some errors in Solr will be much easier to trace after JIRA issue SOLR-1898 gets resolved. Everyone using Solr probably encountered exceptions like: java.lang.NumberFormatException: For input string: “0.0”.  The message itself lacks some crucial details, such as information about the document and field that triggered the exception.  SOLR-1898 will solve that problem, and we are looking forward to this patch!
  • Have you recently been in the situation where you were unsure about which branch or version of Solr you should use on your projects? If yes, you’re certainly not alone! After the recent merge of Solr and Lucene (covered in Solr March Digest and Lucene March Digest), things became confusing, especially for casual observers of Lucene and Solr. Here are some facts about the current state of Solr:
  1. latest stable release version of Solr is still 1.4
  2. 1.4 version was released more than 6 months ago, so many new features, patches and bug fixes aren’t included in it
  3. however, it was a stable release, so if you’re planning your production very soon, maybe one low-risk choice would be using 1.4 version on top of which you could apply the patches that you find necessary for you deployment
  4. current development is ongoing on trunk (considered as unstable version and slated for future Solr 4.0 version) and branch named branch_3x. This branch is the most likely candidate for the next version of Solr (named 3.1) and is considered as (stable) development version which could be usable, though you have to be careful with your testing, as always.
  5. another choice could be some old 1.5 nightly build, but 1.5 is abandoned and, in our opinion, it makes more sense to use nightly builds from branch_3x

Here are couple of threads where you can get more information:

  1. Lucene 3.x branch created
  2. Which Solr to use?
  • To show one of the dangers of unstable versions, we’ll immediately point to one recently open JIRA issue related to “file descriptor leak” while indexing.
  • Although at Sematext we’ve been using Java 6 for a very long time both for our products and with our clients, some people might still be stuck with Java 5. It appears that they will never be able to use Solr 4.0 once it is released, since Solr trunk version now requires Java 6 to compile.

We’ll finish this month’s Solr Digest with two new Solr features:

  • For anyone wanting to use JSON format when sending documents to Sorl, JSON update handler is now committed to trunk
  • on the other hand, if you need CSV as output format from Solr, you might benefit from the work on new CSV Response Writer. Currently, there are no patches with it, but you can watch the issue and see when it is added.

Thanks for reading another Solr Digest!  Help us spread the word, please Re-Tweet it, and follow @sematext on Twitter.

Nutch Digest, May 2010

With May being almost over, it’s time for our regular monthly Nutch Digest. As you know, Nutch became a top-level project and first related changes are already visible/applied. The Nutch site was moved to its new home at nutch.apache.org and the mailing lists (user- and dev-) have been moved to new addresses: user@nutch.apache.org and dev@nutch.apache.org.  Old subscriptions were automatically moved to the new lists, so subscribers don’t have to do anything (apart from changing mail filters, perhaps).  Even though Nutch is not a Lucene sub-project any more, we will continue to index its content (mailing lists, JIRA, Wiki, web site) and searchable over on Search-Lucene.com.  We’ve helped enough organizations with their Nutch clusters that it makes sense for at least us at Sematext to have a handy place to find all things Nutch.

There is a vote on updated Release Candidate (RC3) for the Apache Nutch 1.1 release.  The major differences between this release and RC2 are several bug fixes: NUTCH-732, NUTCH-815, NUTCH-814, NUTCH-812 and one improvement in NUTCH-816.  Nutch 1.1 is expected shortly!

From relevant changes related to Nuch develompent during May it’s important to note than Nutch will be using Ivy in Nutch builds and that there is one change regarding Nutch’s language identification: code for Japanese changed from “jp” to “ja”.  The former is Japan’s country code and the latter is the language code for Japanese.

There’s been a lot of talk on the Nutch mailing list about the first book on Nutch which Dennis Kubes started writing.  We look forward to reading it, Dennis!

Nutch developers were busy with the TLP-related changes and with preparations for the Nutch 1.1 release this month, so this Digest is a bit thinner than usual.  Follow @sematext on Twitter.

Berlin Buzzwords

June is approaching, and that means Berlin Buzzwords is near.  Here at Sematext we are looking forward to Berlin this year, as we’ll be both at the Cloudera Hadoop training (and getting certified), plus attending Berlin Buzzwords.

With this guest post we are doing our part in helping Berlin Buzzwords organizers spread the word about their conference.

With data storage space getting cheaper and cheaper the question of how to efficiently analyze data and find information becomes important for a growing number of businesses. Apache Hadoop, Hive, NoSQL, MongoDB, Apache Lucene and CouchDB are just a few terms that spring to mind when thinking of large data. But what are these projects really all about? Who are the developers behind them? What do people build with these software packages? On 7/8th of June Berlin Buzzwords, a conference on these exact topics, sets out to bring together users and developers of hot data analysis topics. There are tracks specific to the three tags search, store and scale. The conference features a fantastic mixture of developers and users of open source software projects that make scaling data processing today possible:

  • Christophe Bisciglia from Cloudera is giving and overview of Apache Hadoop from an industry perspective. In addition there are talks by Steve Loughran, Aaron Kimball and Stefan Groschupf.
  • As for Apache Lucene the schedule has talks by Grant Ingersoll, Robert Muir and the “Generics Policeman” Uwe Schindler. Michael Busch from Twitter will explain how real-time search can be implemented with Lucene.
  • For those interested in NoSQL databases there is Mathias Stearn from MongoDB, Jan Lehnardt from CouchDB and Eric Evans, the guy who coined the term NoSQL one year ago.

The schedule is available online: http://berlinbuzzwords.de/content/schedule-published

Regular tickets are available online at http://berlinbuzzwords.de/content/tickets . In addition, student tickets are available: With a valid student ID, participants are eligible for one of these tickets. Each costs 100,- Euro. There are also special group tickets available: If you buy four tickets or more you are eligible for a discount of 25%, when purchasing 10 tickets or more the discount is 50%.

Looking forward to seeing you in Berlin.

Solr Performance Monitoring Announcement

Update: Sematext now offers SPM – Scalable Performance Monitoring for Solr (as well as for HBase, OS, JVM, etc.). See our Solr Performance Monitoring with SPM blog post.

We’re happy to announce the partnership between Sematext and New Relic.  Over the last few months we have been using New Relic’s RPM Gold Plan for monitoring performance of our own Solr-based services for searching Lucene and Hadoop ecosystems: http://search-lucene.com/ and http://search-hadoop.com/. We found it valuable for understanding (and fixing!) Solr performance bottlenecks and are going to be offering this service to our new and old customers.

We’ll also set up our Lucene/Solr tech support subscribers with New Relic’s RPM.  This will make our tech support team more visibility in customer’s past Solr performance, spot any suspicious trends or errors, and quickly understand the overall performance trends over time, thus resolving our customer’s issues much more quickly.

You can get more details about the service, and you can get a free, no strings attached 30 days Gold plan trial here.

Note: you will NOT be asked for any credit card information and will not have to pay a thing, nor will you have to cancel anything in order to avoid having to pay for the service.  Signing up via the above link gives you 30 days free Gold plan trial.  After 30 days the plan just goes back to the free Lite plan.  If you need more than 30 days to evaluate the service, please let us know.  Free, no strings attached 30 days Gold plan trial sign up is here.  Enjoy!

Mahout Digest, April 2010

April was a busy month here at Sematext, so we are a few days late with our April Mahout Digest, but better late than never!

The Board has approved Mahout to become Top Level Project (TLP) at the Apache Software Foundation. Check the status thread on change of Mahout mailing lists, svn layout, etc. Several discussions on mailing list, all describing what needs to be done for Mahout to become Top Level Project, resulted in Mahout TLP to-do list

As we reported in previous digest there was a lot of talk on mailing list about Google Summer of Code (GSoC) and here is a follow up on this subject. GSoC announcements are up and Mahout got 5 projects accepted this year. Check the full list of GSoC projects and Mahout’s community reactions on the mailing list.

In the past we have reported about the idea of Mahout/Solr integration, and it seems this is now getting realized. Read more on features and implementation progress of this integration here.

At the beginning of April, there was a proposal to make collections releases independent of the rest of Mahout. After some transition period of loosening the coupling between mahout-collections and the rest of Mahout, mahout-collections were extracted as an independent component.  The vote on the first independent release of collections is in progress. Mahout-collections-1.0 differs from the version released with mahout 0.3 only by removed dependency on slf4j.

Mahout parts that use Lucene are updated to use the latest release of Lucene 3.0.1 and code changes for this migration can be found in this patch.

There was a question that generated a good discussion about cosine similarity between two items and how it is implemented in Mahout. More on cosine similarity between two items which is implemented as PearsonCorrelationSimilarity (source) in Mahout code, can be found in MAHOUT-387.

The goal of clustering is to determine the grouping in a set of data (e.g. a corpus of items or a (search) result set or …).  Often, the problem in clustering implementations is that the data to be clustered has a high number of dimensions, which tend to need reducing (to a smaller number of dimensions) in order to make clustering computationally more feasible (read: faster).  The simplest way of reducing those dimensions is to use some sort of a ‘mapping function’ that takes data presented in n-dimensional space and transforms it to data presented in m-dimensional space, where m is lower than n, hence the reduction. Of course, the key here is to preserve variance of important data features (dimensions) and flatten out unimportant ones. One simple and interesting approach to clustering is the use of several independent hash functions where the probability of collision of similar items is higher. This approach, called Minhash based clustering, was proposed back in March as part of GSoC (see the proposal).  You’ll find more on theory behind it and Mahout applicable implementation in MAHOUT-344.

Those interested in Neural Networks should keep an eye on MAHOUT-383, where Neuroph (a lightweight neural net library) and Mahout will get integrated.

This concludes our short April Mahout Digest.  Once Mahout completes its transition to TLP, we expect the project to flourish even more.  We’ll be sure to report on the progress and new developments later this month.  If you are a Twitter user, you can follow @sematext on Twitter.

Elastic Search: Distributed, Lucene-based Search Engine

Here at Sematext we are constantly looking to expand our horizons and acquire new knowledge (as well as our search team – see Sematext jobs page – we are always on the lookout for people with passion for search and, yes, we are heavy users of ElasticSearch!), especially in and around our main domains of expertize – search and analytics.  That is why today we are talking with Shay Banon about his latest project: ElasticSearch.  If his name sounds familiar to you, that’s likely because Shay is also known for his work on GigaSpaces and Compass.
  • What is the history of Elastic Search in terms of:
  • When you got the idea?
  • How long you had that brewing in your head?
  • When did you first start cutting code for it?
I have been thinking about something along the lines of what elasticsearch has turned out to be for a few years now. As you may know, I am the creator of Compass (http://www.compass-project.org), which I started more 7 years ago, and the aim of Compass was to ease the integration of search into any Java application.
When I developed Compass, I slowly started to add more and more features to it. For example, Compass, from the get go, supported mapping of Objects to a Search Engine (OSEM – http://www.compass-project.org/docs/2.2.0/reference/html/core-osem.html). But, it also added a JSON to Search Engine mapping layer (JSEM – http://www.compass-project.org/docs/2.2.0/reference/html/core-jsem.html) as slowly JSON was becoming a de-facto standard wire protocol.
Another aspect that I started to tackle was Compass support for a distributed mode. GigaSpaces (http://www.compass-project.org/docs/2.2.0/reference/html/needle-gigaspaces.html), Coherence (http://www.compass-project.org/docs/2.2.0/reference/html/needle-coherence.html), and Terracotta (http://www.compass-project.org/docs/2.2.0/reference/html/needle-terracotta.html) are attempts at solving that. All support using a distributed Lucene Directory implementation (scaling the storage), but, as you know Lucene, sometimes this is not enough. With GigaSpaces, the integration took another step with sharding the index itself and using “map/reduce” to search on nodes.
The last important aspect of Compass is its integration with different mechanisms to handle content and make it searchable. For example, it has very nice integration with JPA (Hibernate, TopLink, OpenJPA), which means any change you do the database through JPA is automatically indexed. Another integration point was with data grids such as GigaSpaces and Coherence: any change done to them gets applied to the index.
But still, Compass is a library that gets embedded in your Java app, and its solutions for distributed search are far from ideal. So, I started to play around with the idea of creating a Compass Search server that would use its mapping support (JSON) and expose itself through a RESTful API.
Also, I really wanted to try and tackle the distributed search problem. I wanted to create a distributed search solution that is inspired, in its distributed model, from how current state of the art data grids work.
So, about 7 months ago I wrote the first line of elasticsearch and have been hacking on it ever since.
  • The inevitable: How is Elastic Search different from Solr?
To be honest, I never used Solr. When I was looking around for current distributed search solutions, I took a brief look at Solr distributed model, and was shocked that this is what people need to deal with in order to build a scalable search solution (that was 7 months ago, so maybe things have changed). While looking at Solr distributed model I also noticed the very problematic “REST” API it exposes. I am a strong believer in having the product talk the domain model, and not the other way around. ElasticSearch is very much a domain driven search engine, and I explain it more here: http://www.elasticsearch.com/blog/2010/02/12/yourdatayoursearch.html. You will find this attitude throughout elasticsearch APIs.
  • Is there a feature-for-feature comparison to Solr that would make it easier for developers of new search applications to understand the differences and choose the right tool for the job?
There isn’t one, and frankly, I am not that expert with Solr to create such a list. What I hope is that people who work with both will create such a list, hopefully with input from both projects.
  • When would one want (or need) to use Elastic Search instead of Solr and vice versa?
As far as I am concerned, elasticsearch is being built to be a complete distributed, RESTful, search solution, with all the features anyone would ever want from a search server. So, I really don’t see a reason why someone would choose Solr over ElasticSearch. To be honest, with today data scale, and the slow move to the cloud (or “cloud architectures”) you *need* a search engine that you can scale, and I really don’t understand how one would work with Solr distributed model, but that might just be me and I am spoiled by what I expect from distributed solutions because of my data grid background.
  • What was the reason for simply not working with the Solr community and enhancing Solr? (discussion, patches…)  Are some of Elastic Search’s features simply not implementable in Solr?
First, the challenge. Writing a data grid level distributed search engine is quite challenging to say the least (note, I am not talking about data grid features, such as transactions and so on, just data grids distributed model).
Second, building something on top of existing codebase will never be as good as building something from scratch. For example, elasticsearch has a highly optimized, asynchronous, transport layer to communicate between nodes (which the native Java client uses), a highly modular core where almost anything is pluggable. These are things that are very hard to introduce or change with existing codebase, and existing developers. Its much simpler to write it from scratch.
  • We see more and more projects using Github.  What is your reason for choosing Git for SCM and Github for Elastic Search’s home?
Well, there is no doubt that Git is much nicer to work with than SVN thanks to its distributed nature (and I like things that are distributed 🙂 ). As for GitHub, I think that its currently the best project hosting service out there. You really feel like people out there know developers and what developers need. As a side note, I am a sucker for eye candy, and it simply looks good.
  • We see the community already created a Perl client.  Are there other client libraries in the works?
Yeah, so there is an excellent Perl client (http://github.com/clintongormley/ElasticSearch.pm) which Clinton Gormley has been developing (he has also been an invaluable source of suggestions/bugs to the development of elasticsearch). There are several more including erlang, ruby, python and PHP (all listed here http://www.elasticsearch.com/products/). Note, thanks to the fact that elasticsearch has a rich, domain driven, JSON API, writing clients to it is very simple since most times there is no need to perform any mappings, especially with dynamic languages.
  • We realize it is super early for Elastic Search, but what is the biggest known deployment to date?
Yes, it is quite early. But, I can tell you that some not that small sites (10s of millions of documents) are already playing with elasticsearch successfully. Not sure if I can disclose any names, but once they go live, I will try and get them to post something about it.
  • What are Elastic Search future plans, is there a roadmap?
The first thing to get into ElasticSearch are more search engine features. The features are derived from the features that are already exposed in Compass, including some new aspects such a geo/local search.
Another interesting aspect is making elasticsearch more cloud provider friendly. For example, elasticsearch persistent store is designed in a write behind fashion, and I would love to get one that persist the index to Amazon S3 or Rackspace CloudFiles (See more information on how persistency works with elasticsearch, see here: http://www.elasticsearch.com/blog/2010/02/16/searchengine_time_machine.html).
NoSQL is also an avenue that I would love to explore. In similar concept to how Compass works with JPA / Data Grids, I would like the integration of search with NoSQL solutions more simple. It should be as simple as you do something against the NoSQL solution, and it automatically gets applied to elasticsearch as well. Thanks to the fact that elasticsearch model is very much domain driven, and very similar to what NoSQL uses, the integration should be simple. As an example, TerraStore already comes with an integration module that applies to elasticsearch any changes done to TerraStore. I blogged my thoughts about search and NoSQL here http://www.elasticsearch.com/blog/2010/02/25/nosql_yessearch.html.
If you have additional questions for Shay about Elastic Search, please feel free to leave them in comments, and we will ask Shay to use comments to answer them.