In this part 1 of a 3-part post series we’ll describe how we use HBase at Sematext for real-time analytics and how we can perform data rollbacks by using an append-only updates approach.
Some bits of this topic were already covered in Deferring Processing Updates to Increase HBase Write Performance and some were briefly presented at BerlinBuzzwords 2011 (video). We will also talk about some of the ideas below during HBaseCon-2012 in late May (see Real-time Analytics with HBase). The approach described in this post is used in our production systems (SPM & SA) and the implementation was open-sourced as HBaseHUT project.
Problem we are Solving
While HDFS & MapReduce are designed for massive batch processing and with the idea of data being immutable (write once, read many times), HBase includes support for additional operations such as real-time and random read/write/delete access to data records. HBase performs its basic job very well, but there are times when developers have to think at a higher level about how to utilize HBase capabilities for specific use-cases. HBase is a great tool with good core functionality and implementation, but it does require one to do some thinking to ensure this core functionality is used properly and optimally. The use-case we’ll be working with in this post is a typical data analytics system where:
- new data are continuously streaming in
- data are processed and stored in HBase, usually as time-series data
- processed data are served to users who can navigate through most recent data as well as dig deep into historical data
Although the above points frame the use-case relatively narrowly, the approach and its implementation that we’ll describe here are really more general and applicable to a number of other systems, too. The basic issues we want to solve are the following:
- increase record update throughput. Ideally, despite high volume of incoming data changes can be applied in real-time . Usually. due to the limitations of the “normal HBase update”, which requires Get+Put operations, updates are applied using batch-processing approach (e.g. as MapReduce jobs). This, of course, is anything but real-time: incoming data is not immediately seen. It is seen only after it has been processed.
- ability to roll back changes in the served data. Human errors or any other issues should not permanently corrupt data that system serves.
- ability to fetch data interactively (i.e. fast enough for inpatient humans). When one navigates through a small amount of recent data, as well as when selected time interval spans years, the retrieval should be fast.
Here is what we consider an “update”:
- addition of a new record if no records with same key exists
- update of an existing record with a particular key
Let’s take a look at the following example to better understand the problem we are solving.
Briefly, here are the details of an example system:
- System collects metrics from a large number of sensors (N) very frequently (each second) and displays them on chart(s) over time
- User needs to be able to select small time intervals to display on a chart (e.g. several minutes) as well as very large spans (e.g. several years)
- Ideally, data shown to user should be updated in real-time (i.e. user can see the most recent state of the sensors)
Note that even if some of the above points are not applicable to your system the ideas that follow may still be relevant and applicable.
Possible “direct” Implementation Steps
The following steps are by no means the only possible approach.
Step 1: Write every data point as a new record or new column(s) in some record in HBase. In other words, use a simple append-only approach. While this works well for displaying charts with data from short time intervals, showing a year (there are about 31,536,000 seconds in one year) worth of data may be too slow to call the experience “interactive”.
Step 2: Store extra records with aggregated data for larger time intervals (say 1 hour, so that 1 year = 8,760 data points). As new data comes in continuously and we want data to be seen in real-time, plus we cannot rely on data coming in a strict order, say because one sensor had network connectivity issues or we want to have ability to import historical data from a new data source, we have to use update operations on those records that hold data for longer intervals. This requires a lot of Get+Put operations to update aggregated records and this means degradation in performance — writing to HBase in this fashion will be significantly slower compared to using the append-only approach described in Step 1. This may slow writes so much that a system like this may not actually be able to keep up with the volume of the incoming data. Not good.
Step 3: Compromise real-time data analysis and process data in small batches (near real-time). This will decrease the load on HBase as we can process (aggregate) data more efficiently in batches and can reduce the number of update (Get+Put) operations. But do we really want to compromise real-time analytics? No, of course not. While it may seem OK in this specific example to show data for bigger intervals with some delay (near real-time), in real-world systems this usually affects other charts/reports, such as reports that need to show total, up to date figures. So no, we really don’t want to compromise real-time analytics if we don’t have to. In addition, imagine what happens if something goes wrong (e.g. wrong data was fed as input, or application aggregates data incorrectly due to a bug or human error). If that happens we will not be able to easily roll back recently written data. Utilizing native HBase column versions may help in some cases, but in general, when we want greater control over rollback operation a better solution is needed.
Use Versions for Rolling Back?
Recent additions in managing cell versions make cell versioning even more powerful than before. Things like HBASE-4071 make it easy to store historical data without big data overhead by cleaning old data efficiently. While it seems obvious to use versions (native HBase feature) for allowing rolling back data changes, we cannot (and do not want to) rely heavily on cell versions here. The main reason for that is that it is just not very effective when dealing with lots of versions for a given cell. When update history for a record/cell becomes very long this requires many versions for a given cell. Versions are managed and navigated as a simple list in HBase (as opposed to using a Map-like structure that is used for records and columns) so managing long lists of versions is less efficient than having a bunch of separate records/columns. Besides, using versions will not help us with Get+Put situation and we are aiming to kill these two birds with one rock with the solution we are about to describe. One could try to use append-only updates approach described below and use cells versions as update log, but this would again bring us to managing long lists in a non-efficient way.
Given the example above, our suggested solution can be described as follows:
- replace update (Get+Put) operations at write time with simple append-only writes and defer processing of updates to periodic jobs or perform aggregations on the fly if user asks for data earlier than individual additions are processed.
The idea is simple and not necessarily novel, but given the specific qualities of HBase, namely fast range scans and high write throughput, this approach works very well. So well, in fact, that we’ve implemented it in HBaseHUT and have been using it with success in our production systems (SPM & SA).
So, what we gain here is:
- high update throughput
- real-time updates visibility: despite deferring the actual updates processing, user always sees the latest data changes
- efficient updates processing by replacing random Get+Put operations with processing whole sets of records at a time (during the fast scan) and eliminating redundant Get+Put attempts when writing first data item
- ability to roll back any range of updates
- avoid data inconsistency problems caused by tasks that fail after only partially updating data in HBase without doing rollback (when using with MapReduce, for example)
In part 2 post we’ll dig into the details around each of the above points and we’ll talk more about HBaseHUT, which makes all of the above possible. If you like this sort of stuff, we’re looking for Data Engineers!