Hadoop Digest, August 2010

The biggest announcement of the year: Apache Hadoop 0.21.0 released and is available for download here. Over 1300 issues have been addressed since 0.20.2; you can find details for Common, HDFS and MapReduce. Note from Tom White who did an excellent job as a release manager: “Please note that this release has not undergone testing at scale and should not be considered stable or suitable for production. It is being classified as a minor release, which means that it should be API compatible with 0.20.2.”. Please find a detailed description of what’s new in 0.21.0 release here.

Community trends & news:

  • New branch hadoop-0.20-security is being created. Apart from the security features, which are in high demand, it will include improvements and fixes from over 12 months of work by Yahoo!. The new security features are going to be a very valuable and welcome contribution (also discussed before).
  • A thorough discussion about approaches of backing up HDFS data in this thread.
  • Hive voted to become Top Level Apache Project (TLP) (also here).  Note that we’ll keep Hive under Search-Hadoop.com even after Hive goes TLP.
  • Pig voted to become TLP too (also here).  Note that we’ll keep Pig under Search-Hadoop.com even after Pig goes TLP.
  • Tip: if you define a Hadoop object (e.g. Partitioner, as implementing Configurable, then its setConf() method will be called once, right after it gets instantiated)
  • For those new to ZooKeeper and pressed for time, here you can find the shortest ZooKeeper description — only 4 sentences short!
  • Good read “Avoiding Common Hadoop Administration Issues” article.

Notable efforts:

  • Howl: Common metadata layer for Hadoop’s Map Reduce, Pig, and Hive (yet another contribution from Yahoo!)
  • PHP library for Avro, includes schema parsing, Avro data file and
    string IO.
  • avro-scala-compiler-plugin: aimed to auto-generate Avro serializable classes based on some simple case class definitions


  • How to programatically determine the names of the files in a particular Hadoop/HDFS directory?
    Use FileSystem & FileStatus API. Detailed examples are in this thread.
  • How to restrict HDFS space usage?
    Please, refer to HDFS Quotas Guide.
  • How to pass parameters determined at run-time (i.e. not hard-coded) to Hadoop objects (like Partitioner, Writable, etc.)?
    One option is to define a Hadoop object as implementing Configurable. In this case its setConf() method will be called once, right after it gets instantiated and you can use “native” Hadoop configuration for passing parameters you need.

HBase Digest, August 2010

The second “developer release”, hbase-0.89.201007d, is now available for download. To remind everyone, there are currently two active branches of HBase:

  • 0.20 – the current stable release series, being maintained with patches for bug fixes only.
  • 0.89 – a development release series with active feature and stability development, not currently recommended for production use.

First one doesn’t support HDFS durability (edits may be lost in the case of node failure) whereas the second one does. You can find more information at this wiki page.  HBase 0.90 release may happen in October!  See info from developers.

Community trends & news:

  • New HBase AMIs are available for dev release and 0.20.6.
  • Looking for some GUI that could be used for browsing through tables in HBase? Check out Toad for Cloud, watch for HBase-Explorer and HBase-GUI-Admin.
  • How many regions a RegionServer can support and what are the consequences of having lots of regions in a RegionServer? Check info in this thread.
  • Some more complaints to be aware of regarding HBase performing on EC2 in this thread. For those who missed it, more on Hadoop & HBase reliability with regard to EC2 in our March digest post.
  • Need guidance in sizing your first Hadoop/HBase cluster? This article will be helpful.


  • Where can I find information about data model design with regard to HBase?
    Take a look at http://wiki.apache.org/hadoop/HBase/HBasePresentations.
  • How can I perform SQL-like query “SELECT … FROM …” on HBase?
    First, consider that HBase is a key-value store which should be treated accordingly. But if you are still up for writing ad-hoc queries in your particular situation take a look at Hive & HBase integration.
  • How can I access Hadoop & HBase metrics?
    Refer to HBase Metrics documentation.
  • How to connect to HBase from java app running on remote (to cluster) machine?
    Check out client package documentation. Alternatively, one can use the REST interface: Stargate.

Solr Digest, August 2010

August brought a lot of activity into Solr world. There were many important developments, so we again compiled the most interesting ones for you, grouped into 4 categories:

Some new (and already committed) features

  • We already wrote about new work done on CollapsingComponent in June’s digest under SOLR-1682. A lot of work was done on this component and it appears that it is very close to being committed. Patches attached to the issue are functional, so you can give it a try.
  • SpellCheckComponent got improvement related to recent Lucene changes –  Add support for specifying Spelling SuggestWord Comparator to Lucene spell checkers for SpellCheckComponent. Issue SOLR-2053 is already fixed, patch is attached if you need it, but it is also committed to trunk and 3_x branch.
  • Another minor feature is improvement of WordDelimiterFilter in SOLR-2059Allow customizing how WordDelimiterFilter tokenizes text. Patch is already there and committed to trunk and 3_x.
  • Performance boost for faceting can be found in SOLR-2089Faceting: order term ords before converting to values. Behind this intimidating title hides a very decent speedup in cases when facet.limit is high. Patch is available, trunk and branch 3_x also got this magic committed.

Some new features being discussed and implemented

  • One very important (and probably much wanted) feature just got its Jira issue – SOLR-2080Create a Related Search Component. The issue was created by Grant Ingersoll, so we can expect some quality work do be done here. There are no patches (or even discussions) yet as the issue is in its infancy, but you can watch its progress in Jira. In the meantime, if you’re interested in such functionality, you can check Sematext’s RelatedSearches product.
  • Jira issue SOLR-2026Need infrastructure support in Solr for requests that perform multiple sequential queries – might add some interesting capabilities to search components, especially if you’re writing some of them on your own. We at Sematext have plenty of experience with writing of custom Solr components (check, for instance, our DYM ReSearcher or its Relaxer sibling), so we know that sometimes it is not a very pleasant task. If Solr gets better support for execution of multiple queries during a single request, writing custom components will become easier. One patch is already posted to this issue, so you can check it out, however, it is still unclear in which way this feature will evolve. We’re hoping for a flexible and comprehensive solution which would be easily extensible to many other features.
  • Defining QueryComponent’s default query parser can be made configurable with the patch attached to the issue SOLR-2031. You probably didn’t encounter many cases where you needed this functionality, but if you needed it, you had a problem before, and now that problem will become history.
  • It appears that QueryElevationComponent might get an improvement : Distinguish Editorial Results from “normal” results in the QueryElevationComponent. Jira issue SOLR-2037 will be the place to watch the progress.

Some newly found bugs

  • DataImportHandler has a bug – Multivalued fields with dynamic names does not work properly with DIH – the fix isn’t available, but if you have such problems, you check the status here.
  • Another bug in DataImportHandler points to a connection-leak issues – DIH doesn’t release JDBC connections in conjunction with DB2. There is no fix at the moment but, as usual, you can check the status in Jira.

Other interesting news

  • One potentially useful tool we recommend checking is SolrMeter. It is a standalone tool for stress testing of you Solr. From their site: The main goal of this open source project is to bring to the solr user community a “generic tool to interact specifically with solr”, firing queries and adding documents to make sure that your Solr implementation will support the real use. With SolrMeter you can simulate your work load over solr index and retrieve statistics graphically.
  • In which IDEs do you work with Solr/Lucene? Here at Sematext, we use both Eclipse and IntelliJ IDEA. If you use the latter and you want to set up Lucene or Solr in it, you can check a very useful description and patch in LUCENE-2611 IntelliJ IDEA setup.

We hope you enjoyed another Solr Digest from @sematext.  Come back and read us next month!

Solr Digest, July 2010

As usual, July is one of the slower months in Solr world, however, we managed to find a few interesting topics for our readers.

  • Interesting feature might be added with SOLR-1979Create LanguageIdentifierUpdateProcessor. It would provide ability to differently handle the text in different languages (think about stemming in analysis, for instance) and to do it automatically. This issue was just created, so the work on it and any usable patches are coming some time in the future. However, if you need something working now, Sematext has a few products for similar multilingual functionality, for instance, Multilingual Indexer or its cousin Language Identifier.
  • Another interesting feature might come with SOLR-1980Implement boundary match support. This will enable one to specify that query should match only at the start or at the end of the field (or be exact match), not somewhere in the middle, which could provide more relevant search results in some specific cases. This issue is also in its infancy and has no patches yet, so we’ll have to wait and see how it progresses.
  • Ever wanted Solr to store as the value of some field something other than the raw input value (remember, when you search Solr, you search on analyzed and indexed values; when you fetch the content of some field, you get the raw input value added to that field, not its analyzed version)? Patch for that already exists in one rather fresh JIRA issue – SOLR-1997Store internal value instead of input one.
  • Getting ready to start using Solr, but are unsure about which version you should use? Don’t worry, confusion about Solr’s version started this spring (see Solr May 2010 Digest), but things stabilized lately. The latest release is the fairly recent 1.4.1, which is basically 1.4 version with many bugfixes. The next release version is 3.1 which can be found on branch_3x branch. You can find its nightly build versions here. The trunk is still used for “unstable” development and the future 4.0 version. To get more information, check these recent threads on the Solr mailing list: here and here.
  • Many will probably agree that Solr’s SpellCheckComponent isn’t very useful in real-life applications. One of the main problems is that it poorly handles multi-word queries, where it creates its suggestion as a collated version of best suggestion for each word of the query, so you often get suggestions which have 0 hits. Also, it doesn’t return important information about suggested query, like how many hits such query would generate and what results it would give. Some of these issues could be fixed some day with SOLR-2010Improvements to SpellCheckComponent Collate functionality. The first version of the patch is already provided. However, if you’d like to use such functionality in your Solr production today, you might consider one much more sophisticated and production-ready component developed by Sematext – DYM ReSearcher – you can see DYM ReSearcher in action on Search-Lucene.com, for example.
  • One minor functionality is added to QueryElevationComponent – Add option to return only the specified results. It was added with JIRA issue SOLR-1966 and is already committed to 3.x and trunk.

We hope that this was enough to satisfy your Solr appetite.  Hopefully, we’ll dig more interesting topics for you in August.  Until then you can keep up with us via @sematext on Twitter.

Add option to return only the specified results

Hadoop Digest, July 2010

Strong moves towards the 0.21 Hadoop release “detected”: 0.21 Release Candidate 0 was out and tested. A number of issues were identified and with it the roadmap to the next candidate is set. Tom White has been hard at work and is acting as the release engineer for the 0.21 release.

Community trends and discussions:

  • Hadoop Summit 2010 slides and videos are available here.
  • In case you’re at the design stage of your Hadoop cluster aimed at work with text-based and/or structured data, you should read the “Text files vs SequenceFiles” thread.
  • Thinking of decreasing HDFS replication factor to 2? This thread might be useful to you.
  • Managing workflows of Sqoop, Hive, Pig, and MapReduce jobs with Oozie (Hadoop workflow engine from Yahoo!) is explained in this post.
  • The 2nd edition of “Hadoop: The Definitive Guide” is now in Production.  Again, Tom While in action.

Small FAQ:

  • How do you efficiently process (large) XML documents in Hadoop MapReduce?
    Take a look at Mahout’s XmlInputFormat in case StreamXmlRecordReader doesn’t do a good job for you. The former one got a lot of positive feedback from the community.
  • What are the ways of importing data to HDFS from remote locations? I need this process to be well-managed and automated.
    Here are just some of the options. First you should look at available HDFS shell commands. For large inter/intra-cluster copying distcp might work best for you. For moving data from RDBMS system you should check Sqoop. To automate moving (constantly produced) data from many different locations refer to Flume. You might also want to look at Chukwa (data collection system for monitoring large distributed systems) and Scribe (server for aggregating log data streamed in real time from a large number of servers).

Hey, follow @sematext if you are on Twitter and RT!

Hadoop Digest, June 2010

Hadoop 0.21 release is getting close: a few blocking issues remain in Common, HDFS and MapReduce modules.

Big announcement from Cloudera: CDHv3 and Cloudera Enterprise were released. In CDHv3 beta 2 the following was added:

  • HBase: the popular distributed columnar storage system with fast read-write access to data managed by HDFS.
  • Oozie: Yahoo!’s workflow engine. (op.ed. How many MapReduce workflow engines are there out there?  We know of at least 4-5 of them!)
  • Flume: a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data. It has a simple and flexible architecture based on streaming data flows.
  • Hue: a graphical user interface to work with CDH. Hue lets developers build attractive, easy-to-use Hadoop applications by providing a desktop-based user interface SDK.
  • Zookeeper: a centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services.

Cloudera Enterprise combines the open source CDHv3 platform with critical monitoring, management and administrative tools. It also enables control of access to the data and resources by users and groups (can be integrated with Active Directory and other LDAP implementations). The bad news is that it isn’t going to be free.

Community trends & news:

  • Amazon Elastic MapReduce now supports Hadoop 0.20, Hive 0.5, and Pig 0.6. Please, see the announcement.
  • Chukwa is going to move to the Apache’s Incubator to prepare to become a TLP.
  • Using ‘wget’ to download a file from HDFS is explained here.
  • Yahoo’s back port of security into Hadoop 0.20 is available including a sandbox VM.
  • Those of you who missed a great webinar from Cloudera, “Top ten tips tricks for Hadoop success” can get the slides from here.
  • Twitter intends to open-source Crane: MySQL-to-Hadoop tool.
  • Interesting talk from Jeff Hammerbacher about analytical data platforms. Don’t forget to read this nice passage dedicated to it.

Notable efforts:

Follow @sematext on Twitter.

Solr Digest, June 2010

We have already written about news in Solr world this month here and here, so you already know that Solr’s 1.4.1. version was released, based on Lucene 2.9.3. Still, one thread from the mailing lists gives some more info about svn branches and how they are related to Solr versions.

Real Time indexing is again one of the hot topics. We already mentioned Zoie plugin in Solr March Digest, so this time we’ll point to interesting discussion on mailing lists. In case you followed this topic, Zoie Solr Plugin is a great plugin for Solr, but still has some limitations. For instance, master-slave architecture (which is the base of almost all big Solr deployments) isn’t well suited for Zoie. Version 2.9 of Lucene brought interesting addition of Near Realtime Search capabilities. As you probably already know, Solr 1.4 release already was running on Lucene 2.9 (2.9.1. to be precise), but support for NRT wasn’t implemented. Solr’s next release might have it since there is a JIRA issue dealing with NRT integration, but don’t hold your breath.

We’ll also mention some new functionalities in Solr:

  • Added relevancy function queries – JIRA issue SOLR-1932 adds function queries for relevancy factors such as tf, idf, etc. This issue is already fixed and committed to trunk.
  • Improved Solr response indentation – added with issue SOLR-1933. Solr only supported 7 levels of indenting previously, so this issue solves it. The downside is a small increase in response size (since instead of tabs, blank spaces will be used). The fix is already committed, but not only to trunk, but also to 3_x branch.
  • Ever wanted to see index files without logging into your servers? This patch will make them visible from Solr admin pages or by using LukeRequestHandlers.
  • Another related issue also got a patch and is already committed to the trunk – SOLR-1946misc enhancements to SystemInfoHandler. Here is a brief list of additions:   include CWD in directory info, include raw bytes version of memory stats, include a list of all system properties.

We’ll end with the short overview of interesting issues which are still in development:

  • Use Lucene’s Field Cache To Retrieve Stored Fields From Memory – the issue SOLR-1961 isn’t finished yet, althought there is a patch. When it is finished, it might give a new boost to the performance of your Solr server, thanks to developers from Cisco.
  • If you want to track performance improvements prepared for 4.0 release, you can just follow JIRA issue SOLR-1965. Some stuff is already listed there, so you can go and check what is in store for the future versions.
  • For anyone using PHP to talk to Solr, there is a new PHP Response Writer – currently, it is available as a Jar that has to be added to your Solr’s classpath. For more details check JIRA issue comments.
  • Field collapsing is one of the longest still unresolved issues in Solr world. SOLR-236 (many people probably easily recognize this JIRA issue number :)) was created more than 3 years ago and during the time it has grown into a “monster” – huge number of comments, patches, problems, parameters… you name it.  Integrating it with your Solr version was never fun (we tried it!). New hope appeared on the field collapsing horizon with the opening of SOLR-1682 (that’s a new JIRA issue for you to commit to your memory!). Some work had already been done there in the past, but now Yonik decided to dedicate some of his time to this issue, which means we might soon have a non-monster implementation that will be committed to Solr.

That’s all for this month. As you can see, in Solr May Digest there was no mention of new 1.4.1. release, but it happened, almost unexpectedly. So stay tuned (and follow @sematext) – you never know if something unexpected might happen this month too…

HBase Digest, June 2010

HBase 0.20.5 is out! It fixes 24 issues since the 0.20.4 release. HBase developers “recommend that all users, particularly those running 0.20.4, upgrade to this release”.

Community trends:

  • There’s a clear need in “sanity check DNS across my cluster” tool as a lot of questions/help requests related to the name/address resolution in the cluster are submitted over time. Any volunteers?
  • Bulk incremental load into an existing table feature (HBASE-1923) is commited to trunk. No multi-family support still.
  • Good number of advice about increasing the write performance/speed in this thread, including shared numbers/techniques from a large production cluster.
  • A set of ORM tools to consider for HBase are suggested here.

Notable efforts:


  • Common issue: tables/data disappears after system restart. Usually people face it when playing with HBase for the first time and even on the single node set-up. The problem is that by default HDFS is configured to store its data in the /tmp dir which might get cleaned up by OS. Configure “dfs.name.dir” and “dfs.data.dir” properties in hdfs-site.xml to aviod these problems.

Lucene Digest, May 2010

Last month we were busy with work and didn’t publish our monthly Lucene Digest.  To make up for it, this month’s Lucene Digest really covers all Lucene developments in May from A to Z.

  • Mark Harwood had a busy month.  In LUCENE-2454 he contributed a production-tested and often-needed functionality for properly indexing parent-child entities (or, more generally, any form of hierarchical search).  He introduced his work in Adding another dimension to Lucene searches.  Joaquin Delgado has been talking about the merge of unstructured and structured search (not surprising, considering his old company with Lucene-based Federated Search product got acquired by Oracle several years ago!), so he quickly related this to ability to perform XQuery + Full-Text searches.  MarkLogic, watch your back! 😉
  • Mark also contributed a match spotter for all query types in LUCENE-1999.  This patch makes it possible to figure out which field(s) a match for a particular hit was on, which is functionality people ask about on Lucene and Solr mailing lists every so often.  Warning, though: spotting the matching and encoding that causes some score precision loss.
  • While Lucene already has TimeLimitedCollector, it’s not perfect and offers room for improvement.  Back in 2009, Mark came up with TimeLimitedIndexReader, as you can tell from his messages in Improving TimeLimitedCollector thread and created a patch with it in LUCENE-1720, which filled some of the TimeLimitedCollector’s gaps:
    • Any reader activity can be time-limited rather than just single searches e.g. the document retrieve phase.
    • Times out faster (i.e. runaway queries such as fuzzies detected quickly before last “collect” stage of query processing)
  • Robert Muir, who gave a well-received presentation on Finite State Queries in Lucene at New York Search & Discovery Meetup (see slides) back in April 2010, has been busy consolidating Lucene and Solr analyzers, tokenizers, token filters, character filters, etc. and moving them to their new home: modules/analysis, under Lucene root dir in svn.  The plan is to produce separate and standalone artifacts (read: jars) for this analysis module.  Here at Sematext we will make use of this module immediately for some of our products that currently list Lucene as a dependency, even though they really only need Lucene’s analyzers.  Solr, too, will be another customer for the new analysis module, as described by Robert in solr and analyzers module (yes, we’re showing off Search-Lucene.com’s in-document search-term highlighting, which we find very useful).
  • Robert also worked on and committed an ICU-based tokenizer for Unicode-based text segmentation in LUCENE-2414.  This translates to having the ability to properly tokenize text that doesn’t use spaces as token separators.  If you’ve ever had to deal with searching Chinese, for example, you’ll know that word segmentation is one of the initial challenges one has to deal with.
  • Talking about splitting on space, another task Robert took upon himself was to stop Lucene QueryParser from splitting queries on space: LUCENE-2458.  This problem of tokenizing queries in space comes up quote often, so this is going to be a very welcome improvement in Lucene.
  • One day Robert was super bored, so he decided to write a Lucene analyzer for Indonesian: LUCENE-2437.
  • Andrzej and Isreal Ekpo (the author of one of the Solr PHP clients) both decided to add support for search-time bitwise operations of integer fields around the same time.  Isreal’s work in in LUCENE-2460, with an accompanying SOLR-1913 issue, while Andrzej’s is in SOLR-1918 and has no pure Lucene patch.  The difference is that Israel’s patch offers only filtering, while Andrzej’s patch performs scoring, which allows finding the best matching inexact bit patterns. This has applications in e.g. near-duplicate detection.
  • In one of our current engagements we are working with a large, household-name organization and a big U.S. government contractor.  Their index is heavily sharded and is well over 2 TB.  Working with such large indices is no joke (though I’m happy to say we were able to immediately improve their search performance by 40% in the first performance tuning iteration). What if we could make their indices smaller?  Would that make their search even faster?  Of course!  In LUCENE-1812 (nice number), Andrej implemented a static index pruning tool that removes posting data from indices for terms with in-document frequency lower than some threshold.  We haven’t used this tool, and it looks like we may not use it for a while, because IBM apparently holds a patent on an exact same algorithm used in this tool.
  • Phrase queries got a little performance boost in LUCENE-2410.  Every little bit counts!
  • Tom Burton-West created and contributed a handy tool that outputs total term frequency and document frequency from a Lucene index: LUCENE-2393.  This tool can be handy for estimating sizes of some of the Lucene index files, and thus getting a better grasp on disk IO needs.
  • On both Lucene and Solr lists with often see people asking about updating individual Document fields instead of simply deleting and re-adding the whole Document.  Delete and re-add approach is not necessarily a problem for Lucene/Solr, but for an external system from which data for the complete new Document needs to be fetched.  Shai Erera, another recently added Lucene committer, proposed a new approach for incremental field updates that was well received.  Once implemented, this will be a big boon for Lucene and Solr!  If that thread or message is too long for you to read, let us at least highlight (pun intended) the two great use cases from this message.
  • Lucandra is a Cassandra backend for Lucene.  But no, it’s not a Lucene Directory implementation.  Lucandra has its own IndexReader and IndexWriter that read from Cassandra and write to it.  But in LUCENE-2456 we now have another option: a Cassandra-based Lucene Directory.  We hope to have a post on this in the near future!
  • The author of Cassandra-based Lucene Directory also opened LUCENE-2425 for Anti-Merging Multi-Directory Indexing Framework that splits an index (at run-time) into multiple sub-indices, based on a “split policy”, several of which have also been added to Lucene’s JIRA.  This is somewhat similar to Lucene’s ParallelWriter, but has some differences, as described in the issue.
  • Michael McCandless is working on prototyping a multi-stage pipeline sub-system that aims to further decouple analysis from indexing.  In this pipeline, indexing would be just one step, one stage in the pipeline.  Based on the work done so far, this may even bring some performance improvements.
  • LUCENE-2295 added a LimitTokenCountAnalyzer / LimitTokenCountFilter to wrap any other Analyzer and provide the same functionality as MaxFieldLength provided on IndexWriter
  • Shay Banon, the author of Elastic Search, contributed LUCENE-2468 (can you complete this hard to figure out numeric sequence?), which allows one to specify how new Document deletions should be handled in CachingWrapperFilter and CachingSpanFilter.  We recently did work for another large organization and a household name (in the U.S. at least) where we improved their Lucene-based search performance by over 30%.  One of the things we did was making good use of CachingWrapperFilter.
  • LUCENE-2480 removes support for pre-Lucene 3.* indices from Lucene 4.*.  Thus, if you are still on Lucene 1.* or Lucene 2.*, we suggest moving to Lucene 3.* soon.  But, due to radical Lucene changes, even moving from Lucene 3.x to Lucene 4.0 won’t be as seamless as with previous Lucene upgrades.  Lucene 4.0 will include a Lucene 3.x to Lucene 4.0 migration tool: LUCENE-2441.

That’s it for this month.  Remember that you can also follow @sematext on Twitter.

HBase Digest, May 2010

Big news first:

  • HBase 0.20.4 is out! This release includes critical fixes, some improvements and performance improvements. HBase 0.20.4 EC2 AMIs are now available in all regions, the latest launch scripts can be found here.
  • HBase has become Apache’s Top Level Project. Congratulations!

Good to know things shared by community:

  • HBase got a code review board. Feel free to join!
  • The guarantees for each operation in HBase with regard to ACID are properties stated here.
  • Writing filter that compares values in different columns is explained in this thread.
  • It is OK to mix transactional IndexTable and regular HTables in the same cluster. One can access tables w/out the transactional semantics/overhead as normal, even when running a TransactionalRegionServer. More in this thread.
  • Gets and scans now never return partially updated rows (as of 0.20.4 release).
  • Try to avoid building code on top of lockRow/unlockRow because this can lead to serious delays in a system work and even deadlock. Thread…
  • Read about how HBase performs load-balancing in this thread.
  • Thinking about using HBase with alternative (to HDFS) file system? Then this thread is a must-read for you.

Notable efforts:

  • HBase Indexing Library aids in building and querying indexes on top of HBase, in Google App Engine datastore-style. The library is complementary to the tableindexed contrib module of HBase.
  • HBasene is a scalable information retrieval engine, compatible with the Lucene library while using HBase as the store for the underlying TF-IDF representation.  This is much like Lucandra, which uses Lucene on top of Cassandra.  We will be covering HBasene in the near future here on Sematext Blog.


  1. Is there an easy way to remove/unset/clean a few columns in a column family for an HBase table?
    You can either delete an entire family or delete all the version of a single family/qualifier. There is no ‘wild card’ deletion or other pattern matching. Column Family is the closest.
  2. How to unsubscribe from user mailing list?
    Send mail to user-unsubscribe@hbase.apache.org.