First, I would strongly recommend against using HBase… but since you insist.

Lets start with your row key.


2) How are you going to access the base table?
So if for example, you’re never going to do a “Get me Mary Smith’s record” but more “Show me all of the patients who had a positive TB test and cluster them by zip code…” You may want to consider using a UUID since you’re always going to go after the data via an index.

If you want to use a patient’s name  e.g. “last|first”, you will want to take the hash of it.  

Now lets talk about indexing.

First, what’s the use case for the database?
Do you want real time access to specific records? Then you would want to consider using Lucene, however that would be a bit more heavy lifting.

The simplest index is an inverted table index.
Two ways to implement.

One is to create the row key as the attribute value and then each column contains the RowKey of the base table. using the Rowkey’s value as the column header as well so that you get your results in sort order.

The other way is to create a single row per record where the rowkey is “attribute|RowKey” and then the only column is the Rowkey itself.  

This is more skinny table vs fat table  and of course you could do something in the middle that limits the number of columns to N columns per row and then your result set is a set of rows.

That’s pretty much it.  You build your index either via a M/R job or as you insert a row, you insert in to the index at the same time.

The opinions expressed here are mine, while they may reflect a cognitive thought, that is purely accidental.
Use at your own risk.
Michael Segel
michael_segel (AT)