I have been digging into some details of Accumulo to model the disk and network costs associated with various types of scan patterns and I have a few questions regarding compression.
Assuming an inverted index table with rows following the pattern of
and a scan that specifies an exact key and value so as to constrain the range, it seems that the dominant factor in network utiltization would be sending key-value pairs from the tablet server to the client and a secondary factor would be transmitting data from non-local RFiles (assuming no caching).
Is my understanding correct that the on-disk compression of this type of table is predominantly a function of the average number of differing bits between adjacent ids? Or, has anyone observed a significant improvement with gz or lzo vs no additional compression? I'm considering running some experiments to measure the difference for a few types of ids (uuid, snowflake-like, content based hashes), but I'm curious if anyone else has done similar experiments.
Given a scan that specifies a range for an exact key and value, is there any transport compression performed for tablet server to client communication beyond the Key.compress method which appears to only compress equivalent rows, columns, etc as opposed to those that share a common prefix?
It seems possible to implement a more specialized compression algorithm with the iterator framework, performing the decompression on the client side, but I'm curious if it could lead to general scan performance improvements if the default compression also involved run-length encoding.
Sounds about right to me. There is definitely a noticed difference between gzip, lzo/snappy (they're about on par with speed and size), and no compression. I'm not sure what the deltas are off the top of my head, but I would expect that you see noticeable differences.
You can also notice differences in "densities" when the lexicographical delta between two keys. Sequential keys that differ very little will result in very dense RFiles (lots of keys), where keys that vary greatly will result in less keys per file. We've had some discussions about the run-length encoding in RFile lately -- have you stumbled on that? No, there are no compression algorithms applied to the data before sending it. By default we use the TCompactProtocol from Thrift. If you're interested in the specifics, Thrift has some documentation on the subject.
I do recall some experiments I tried previously in which a logical "object" in Accumulo was comprised of multiple key-value pairs which were returned in one key-value pair (very similar to how the WholeRowIterator serializes things). In this case, I experimented with compressing the serialized Key-Values before returning from the Iterator stack. Sadly, I don't recall what the takeaway was, but I don't believe it was outright "bad" :) This would be really neat to test.
Some rigor in designing an experiment would be the first step, IMO. If you want to spear-head this, I'm sure there are many who would be happy to give some feedback. It would be prudent to figure out what the variables in the "equation" would be, specifically the distribution of Key-Value pairs -- amount stored, compression on disk, cache sizes, query workload. First out what you want to test, what you will tweak, and define a hypothesis you want to test.
You could consider using the continuous ingest framework for data-generation and query workloads. You could also look at YCSB  for some inspiration. Either of these would make your life easier in not having to generate the read/write workloads.  https://github.com/brianfrankcooper/YCSB
Thanks for the information. I did read through the discussion about compression of visibility expressions and columns within RFiles a while back which got me thinking about some of this. It makes sense that gzip or lzo/snappy compression would have a very noticeable impact when there are columns or visibility expressions that are not compressed with RLE even if neighboring rows have very small lexicographical deltas.
I will put some thought into desiging an experiment to evaluate whether or not there is any benefit to applying RLE during key-value transport from server to client. Even if it proves to be situationally beneficial, I think it could be implemented as a common iterator similar to the WholeRowIterator.
Given, the current compression strategy I would expect better server-client transport compression retrieving a single row with many columns
<key><value> : <id> 
compared to many lexicographically close rows.
<key><value><id> : 
with the understanding that very large rows can lead to poor load balancing.
On Thu, Oct 22, 2015 at 11:54 AM, Josh Elser <[EMAIL PROTECTED]> wrote:
I have been able to put some more thought into this over the weekend and make initial observations on tables I currently have populated. Looking at the rfile-info for a few different tables, I noticed that one which has particularly small lexicographical deltas between keys costs an average of ~2.5 bits per key to store on disk. All of the data is stored in the row component of the key and a full row is typically about 36 bytes. I wrote a little utility to recreate ScanResult objects for batches of sequential key-value pairs returned from a scanner and then used the TCompactProtocol to write the ScanResult to a byte array. Each key-value pair costs rougly 48 bytes which makes sense given that every row is different and there will be some space required for the timestamps, visibilities, and other bookeeping info.
Another table I looked at has larger lexicographical deltas between keys and costs roughly 5 bytes per key to store on disk. This table is a reverse index with very large rows, each column within a row identifies data that resides in another table. Each column is rougly 12 bytes uncompressed. When encoded in a ScanResult, each key-value pair costs roughly 25 bytes which makes sense since the row cost should be negligible for large batch sizes and the overhead from timestamp, visibility, and other bookkeeping info is roungly the same as the other table.
Since compression depends heavily on both table design and the actual data, it seemed the next logical step would be to create a tool that the community could use to easily measure the compression ratio for ScanResult objects. So, I threw together a shell extension to wrap the utility that I previously described. It measures compression ratio for the default strategy Key.compress as well as a few other simple strategies that seemed reasonable to test. The usage is almost the same as the scan command, it just prints out compression statistics rather than the data.