Registration is open - Live, Instructor-led Online Classes - Elasticsearch in March - Solr in April - OpenSearch in May. See all classes


Hadoop Digest, May 2010

Big news: HBase and Avro have become Apache’s Top Level Projects (TLPs)! The initial discussion happened when our previous Hadoop Digest was published, so you can find links to the threads there. The question of whether to become a TLP or not caused some pretty heated debates in Hadoop subprojects’ communities.  You might find it interesting to read the discussions of the vote results for HBase and Zookeeper. Chris Douglas was kind enough to sum up the Hadoop subprojects’ response to becoming a TLP in his post. We are happy to say that all subprojects which became TLP are still fully searchable via our search-hadoop.com service.

More news:

  • Great! Google granted MapReduce patent license to Hadoop.
  • Chukwa team announced the release of Chukwa 0.4.0, their second public release. This release fixes many bugs, improves documentation, and adds several more collection tools, such as the ability to collect UDP packets.
  • HBase 0.20.4 was released. More info in our May HBase Digest!
  • New Chicago area Hadoop User Group was organized.

Good-to-know nuggets shared by the community:

  • Dedicate a separate partition to Hadoop file space – do not use the “/” (root) partition. Setting dfs.datanode.du.reserved property is not enough to limit the space used by Hadoop, since it limits only HDFS usage, but not MapReduce’s.
  • Cloudera’s Support Team shares some basic hardware recommendations in this post. Read more on proper dedicating & counting RAM for specific parts of the system (and thus avoiding swapping) in this thread.
  • Find a couple of pieces of advice about how to save seconds when you need a job to be completed in tens of seconds or less in this thread.
  • Use Combiners to increase performance when the majority of Map output records have the same key.
  • Useful tips on how to implement Writable can be found in this thread.

Notable efforts:

  • Cascalog: Clojure-based query language for Hadoop inspired by Datalog.
  • pomsets: computational workflow management system for your public and/or private cloud.
  • hiho: a framework for connecting disparate data sources with the Apache Hadoop system, making them interoperable

FAQ

  1. How can I attach external libraries (jars) which my jobs depend on?
    You can put them in a “lib” subdirectory of your jar root directory. Alternatively you can use DistributedCache API.
  2. How to Recommission DataNode(s) in Hadoop?
    Remove the hostname from your dfs.hosts.exclude file and run ‘hadoop dfsadmin -refreshNodes‘. Then start the DataNode process in the ‘recommissioned’ DataNode again.
  3. How to configure log placement under specific directory?
    You can specify the log directory in the environment variable HADOOP_LOG_DIR. It is best to set this variable in bin/hadoop-env.sh.

Thank you for reading us, and if you are a Twitter addict, you can now follow @sematext, too!

Start Free Trial