The Hash table isn’t a really good analogy.
Its really a scan of the META table that identifies which region.

With respect to your use case…

You can use HBase to solve it.
You will need to pre-generate your indexes and you can expect your data to grow exponentially.  You shouldn’t rule it out, just plan for it.
In terms of secondary indexes… which one?
Since HBase doesn’t natively support it, you could use anything. Inverted tables or Lucene for that matter. (Or some other format.)

In terms of using the indexes… you would have to do a query/scan against the indexes, then take the intersection of the result set(s).
(This step could be omitted if using Lucene, but there are other issues… like memory management so your index memory footprint can be managed, however… even here there are challenges.)

So if you want to start simple… do an inverted table. Even here you have a choice… you can have a thin row or you can store X number of keys in the inverted row. It gets back to the fat row vs thin row, or something in between. Again, there are permutations to the basic pattern which have differing amounts of complexity and performance.  (Note: We didn’t have time to walk through and benchmark  these options, and they’re still relatively theoretical.)

Without knowing more about your use case… its hard to say what will and what wont work. (e.g. Choosing which attribute to index has an impact. ) Also the size of your raw data set, and then the size with the differing indexes.

Outside of HBase, if you’re running MapR, they do have MapRDB which doesn’t have some of the issues you have with HBase,  while more stable, it only runs on MapR.
(I’m told its in the community edition, so when I get the chance, I’ll have to play with it. )


The opinions expressed here are mine, while they may reflect a cognitive thought, that is purely accidental.
Use at your own risk.
Michael Segel
michael_segel (AT) hotmail.com