If you are thinking about using HBase you will likely want to understand HBase backup options. I know we did, so let us share what we found. Please let us know what we missed and what you use for HBase backup!
You could export your tables using the Export (org.apache.hadoop.hbase.mapreduce.Export) MapReduce job that will export the table data into a Sequence File on HDFS. This was implemented in HBASE-1684 if you want to check out the patch or comments there. This tool works on one table at a time, so if you need to backup multiple tables, run this on each table. The exported data can then be imported back into HBase by the Import tool.
If you have another HBase cluster that you want to treat as a backup cluster, you can use the handy CopyTable tool to copy a table at a time.
You could use Hadoop’s distcp command to copy the whole /hbase directory from one HDFS cluster to the other. However, this can leave your data in an inconsistent state, so it should be avoided. See http://search-hadoop.com/m/wkMgSjVLDb
At this point we should point out that all of the above backup methods are per-table. Moreover, they don’t work or create a snapshot of the table. Export and CopyTable are atomic only at the row level. Furthermore, if you have multiple tables whose tables depend on each other, if they are being modified while you are exporting or copying them, you will end up with inconsistent data – the data in those tables will not be in sync. See http://search-hadoop.com/m/Q4bU81G116p.
Backup from Mozilla
Because of the above mentioned issues with distcp when running it over a cluster whose data is being modified while distcp is running, developers at Mozilla came up with their own Backup tool. They’ve described the tool and its use in the popular Migrating HBase in the Trenches post.
HBase has a relatively new and not yet widely used whole cluster replication mechanism. The backup cluster does not have to be identical to the master cluster, which means that the backup cluster could be much less powerful and thus cheaper, while still having enough storage to serve as backup.
Ah, the infamous HBASE-50! This issue saw some great work during GSoC 2010, but it looks like it was never integrated into HBase. It is unclear whether the contributor simply ran out of steam or time or whether it became apparent that table snapshots are too difficult to implement or simply not doable because of highly distributed nature of HBase. The JIRA issue does contain patches you can look at, and the author has a now inactive hbase-snapshot repository up on Github.
You could also simply crank up the replication factor to the level that makes you feel safe and call that a backup. This may not guard against data corruption, but it does guard against certain partial hardware failures.
Since so many people seem to be asking about HBase backup options, I hope this serves as a good point-in-time snapshot, a summary of all HBase backup options that are currently on the table. With time, this will be added to the HBase Book.
Are there other HBase backup options we should have included?
What you use for HBase backup?