Hadoop Digest, June 2010

Hadoop 0.21 release is getting close: a few blocking issues remain in Common, HDFS and MapReduce modules.

Big announcement from Cloudera: CDHv3 and Cloudera Enterprise were released. In CDHv3 beta 2 the following was added:

  • HBase: the popular distributed columnar storage system with fast read-write access to data managed by HDFS.
  • Oozie: Yahoo!’s workflow engine. (op.ed. How many MapReduce workflow engines are there out there?  We know of at least 4-5 of them!)
  • Flume: a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data. It has a simple and flexible architecture based on streaming data flows.
  • Hue: a graphical user interface to work with CDH. Hue lets developers build attractive, easy-to-use Hadoop applications by providing a desktop-based user interface SDK.
  • Zookeeper: a centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services.

Cloudera Enterprise combines the open source CDHv3 platform with critical monitoring, management and administrative tools. It also enables control of access to the data and resources by users and groups (can be integrated with Active Directory and other LDAP implementations). The bad news is that it isn’t going to be free.

Community trends & news:

  • Amazon Elastic MapReduce now supports Hadoop 0.20, Hive 0.5, and Pig 0.6. Please, see the announcement.
  • Chukwa is going to move to the Apache’s Incubator to prepare to become a TLP.
  • Using ‘wget’ to download a file from HDFS is explained here.
  • Yahoo’s back port of security into Hadoop 0.20 is available including a sandbox VM.
  • Those of you who missed a great webinar from Cloudera, “Top ten tips tricks for Hadoop success” can get the slides from here.
  • Twitter intends to open-source Crane: MySQL-to-Hadoop tool.
  • Interesting talk from Jeff Hammerbacher about analytical data platforms. Don’t forget to read this nice passage dedicated to it.

Notable efforts:

Follow @sematext on Twitter.

Solr Digest, June 2010

We have already written about news in Solr world this month here and here, so you already know that Solr’s 1.4.1. version was released, based on Lucene 2.9.3. Still, one thread from the mailing lists gives some more info about svn branches and how they are related to Solr versions.

Real Time indexing is again one of the hot topics. We already mentioned Zoie plugin in Solr March Digest, so this time we’ll point to interesting discussion on mailing lists. In case you followed this topic, Zoie Solr Plugin is a great plugin for Solr, but still has some limitations. For instance, master-slave architecture (which is the base of almost all big Solr deployments) isn’t well suited for Zoie. Version 2.9 of Lucene brought interesting addition of Near Realtime Search capabilities. As you probably already know, Solr 1.4 release already was running on Lucene 2.9 (2.9.1. to be precise), but support for NRT wasn’t implemented. Solr’s next release might have it since there is a JIRA issue dealing with NRT integration, but don’t hold your breath.

We’ll also mention some new functionalities in Solr:

  • Added relevancy function queries – JIRA issue SOLR-1932 adds function queries for relevancy factors such as tf, idf, etc. This issue is already fixed and committed to trunk.
  • Improved Solr response indentation – added with issue SOLR-1933. Solr only supported 7 levels of indenting previously, so this issue solves it. The downside is a small increase in response size (since instead of tabs, blank spaces will be used). The fix is already committed, but not only to trunk, but also to 3_x branch.
  • Ever wanted to see index files without logging into your servers? This patch will make them visible from Solr admin pages or by using LukeRequestHandlers.
  • Another related issue also got a patch and is already committed to the trunk – SOLR-1946misc enhancements to SystemInfoHandler. Here is a brief list of additions:   include CWD in directory info, include raw bytes version of memory stats, include a list of all system properties.

We’ll end with the short overview of interesting issues which are still in development:

  • Use Lucene’s Field Cache To Retrieve Stored Fields From Memory – the issue SOLR-1961 isn’t finished yet, althought there is a patch. When it is finished, it might give a new boost to the performance of your Solr server, thanks to developers from Cisco.
  • If you want to track performance improvements prepared for 4.0 release, you can just follow JIRA issue SOLR-1965. Some stuff is already listed there, so you can go and check what is in store for the future versions.
  • For anyone using PHP to talk to Solr, there is a new PHP Response Writer – currently, it is available as a Jar that has to be added to your Solr’s classpath. For more details check JIRA issue comments.
  • Field collapsing is one of the longest still unresolved issues in Solr world. SOLR-236 (many people probably easily recognize this JIRA issue number :)) was created more than 3 years ago and during the time it has grown into a “monster” – huge number of comments, patches, problems, parameters… you name it.  Integrating it with your Solr version was never fun (we tried it!). New hope appeared on the field collapsing horizon with the opening of SOLR-1682 (that’s a new JIRA issue for you to commit to your memory!). Some work had already been done there in the past, but now Yonik decided to dedicate some of his time to this issue, which means we might soon have a non-monster implementation that will be committed to Solr.

That’s all for this month. As you can see, in Solr May Digest there was no mention of new 1.4.1. release, but it happened, almost unexpectedly. So stay tuned (and follow @sematext) – you never know if something unexpected might happen this month too…

HBase Digest, June 2010

HBase 0.20.5 is out! It fixes 24 issues since the 0.20.4 release. HBase developers “recommend that all users, particularly those running 0.20.4, upgrade to this release”.

Community trends:

  • There’s a clear need in “sanity check DNS across my cluster” tool as a lot of questions/help requests related to the name/address resolution in the cluster are submitted over time. Any volunteers?
  • Bulk incremental load into an existing table feature (HBASE-1923) is commited to trunk. No multi-family support still.
  • Good number of advice about increasing the write performance/speed in this thread, including shared numbers/techniques from a large production cluster.
  • A set of ORM tools to consider for HBase are suggested here.

Notable efforts:


  • Common issue: tables/data disappears after system restart. Usually people face it when playing with HBase for the first time and even on the single node set-up. The problem is that by default HDFS is configured to store its data in the /tmp dir which might get cleaned up by OS. Configure “dfs.name.dir” and “dfs.data.dir” properties in hdfs-site.xml to aviod these problems.