Subject: Cacheblocksonwrite not working during compaction?


My questions were primarily around how cacheblocksonwrite, prefetching, and compaction work together, which I think is not AWS specific. Although it may be that yes, the 1+ hour prefetching I am seeing is an AWS-specific phenomenon.

I've looked at the 1.4.9 source a bit more now that I have a better understanding of everything. As you say cacheDataOnWrite is hardcoded to false for compactions so the hbase.rs.cacheblocksonwrite setting will have no effect in these cases.

I also now understand that the cache key is partly based on filename, so disabling hbase.rs.evictblocksonclose isn't going to help for compactions either since the pre-compaction filenames will no longer be relevant.

Prefetching also makes more sense once I looked at the code. I see now it comes into effect for HFileReaderV2, so happens on a per-file basis, not per-region. I was confused before why I was seeing prefetching happen when the region was not opened recently, but now it makes sense because it is occurring when the compacted file is opened, not the region.

So unfortunately, it looks like I'm sunk in terms of caching data during compaction. Thanks for the aid in understanding this.

However, I do think this is a valid use case and also seems like it should be fairly easy to implement with a new cache config setting. On the one hand there is this nice prefetching feature which is acknowledging the use case for when people want to cache entire tables, and this use case is more common when considering larger L2 caches. Then on the other hand there is this hardcoded setting that is assuming nobody would ever want to cache all of the blocks being written during a compaction which seems at odds with the use case prefetching is trying to address. Don't get me wrong: I understand that in many use cases caching while writing during compaction is not desirable in that you don't want to evict blocks that you care about during the compaction process. In other words it sort of throws a big monkey wrench into the concept of an LRU cache. I also realize that hbase.rs.cachedataonwrite is geared more towards flushes for use cases where people often read what was recently written and don't necessarily want to cache the entire table. But a new config option (call it hbase.rs.cacheblocksoncompaction?) to address this specific use case would be nice.

I'll plan on opening a JIRA ticket for this and I'd also be happy to take a stab at creating a patch.

--Jacob LeBlanc

-----Original Message-----
From: Vladimir Rodionov [mailto:[EMAIL PROTECTED]]
Sent: Friday, September 20, 2019 10:29 PM
To: [EMAIL PROTECTED]
Subject: Re: Cacheblocksonwrite not working during compaction?

You are asking questions on Apache HBase user forum, which are more appropriate to ask on AWS forum, taking into account that you are using Amazon-specific distributive of HBase and Amazon - specific implementation of  a S3 file system.

As for not working hbase.rs.cacheblocksonwrite, HBase ignores this flag and set it to false forcefully if file writer is opened by compaction thread (this is true for 2.x, but I am pretty sure that in 1.x it is the same).

-Vlad

On Fri, Sep 20, 2019 at 4:24 PM Jacob LeBlanc <[EMAIL PROTECTED]>
wrote: