Subject: Cacheblocksonwrite not working during compaction?


Hi HBase Community!

I have some questions on block caches around how the prefetch and cacheblocksonwrite settings work.

In our production environments we've been having some performance issues with our HBase deployment (HBase 1.4.9 as part of AWS EMR 5.22, with data backed by S3).

Looking into the issue, we've discovered that when regions of a particular table that are under heavy simultaneous write and read load go through a big compaction, the rpc handler threads will all block while servicing read requests to the region that was compacted. Here are a few relevant lines from a log where you can see the compaction happen. I've included a couple responseTooSlow warnings, but there are many more in the log after this:

2019-09-16 15:31:10,204 INFO  [regionserver/ip-172-20-113-118.us-west-2.compute.internal/172.20.113.118:16020-shortCompactions-1568478085425] regionserver.HRegion: Starting compaction on a in region block_v2,\x07\x84\x8B>b\x00\x00\x14mU0p6,1567528560602.98be887c6f4938e0b492b17c669f3ac7.
2019-09-16 15:31:10,204 INFO  [regionserver/ip-172-20-113-118.us-west-2.compute.internal/172.20.113.118:16020-shortCompactions-1568478085425] regionserver.HStore: Starting compaction of 8 file(s) in a of block_v2,\x07\x84\x8B>b\x00\x00\x14mU0p6,1567528560602.98be887c6f4938e0b492b17c669f3ac7. into tmpdir=s3://cmx-emr-hbase-us-west-2-oregon/hbase/data/default/block_v2/98be887c6f4938e0b492b17c669f3ac7/.tmp, totalSize=3.0 G
2019-09-16 15:33:55,572 INFO  [regionserver/ip-172-20-113-118.us-west-2.compute.internal/172.20.113.118:16020-shortCompactions-1568478085425] dispatch.DefaultMultipartUploadDispatcher: Completed multipart upload of 24 parts 3144722724 bytes
2019-09-16 15:33:56,017 INFO  [regionserver/ip-172-20-113-118.us-west-2.compute.internal/172.20.113.118:16020-shortCompactions-1568478085425] s3n2.S3NativeFileSystem2: rename s3://cmx-emr-hbase-us-west-2-oregon/hbase/data/default/block_v2/98be887c6f4938e0b492b17c669f3ac7/.tmp/eede47d55e06454ca72482ce33529669 s3://cmx-emr-hbase-us-west-2-oregon/hbase/data/default/block_v2/98be887c6f4938e0b492b17c669f3ac7/a/eede47d55e06454ca72482ce33529669
2019-09-16 15:34:03,328 WARN  [RpcServer.default.FPBQ.Fifo.handler=3,queue=3,port=16020] ipc.RpcServer: (responseTooSlow): {"call":"Get(org.apache.hadoop.hbase.protobuf.generated.ClientProtos$GetRequest)","starttimems":1568648032777,"responsesize":562,"method":"Get","param":"region= block_v2,\\x07\\x84\\x8B>b\\x00\\x00\\x14mU0p6,1567528560602.98be887c6f4938e0b492b17c669f3ac7., row=\\x07\\x84\\x8B>b\\x00\\x00\\x14newAcPMKh/dkK2vGxPO1XI <TRUNCATED>","processingtimems":10551,"client":"172.20.132.45:51168","queuetimems":0,"class":"HRegionServer"}
2019-09-16 15:34:03,750 WARN  [RpcServer.default.FPBQ.Fifo.handler=35,queue=5,port=16020] ipc.RpcServer: (responseTooSlow): {"call":"Get(org.apache.hadoop.hbase.protobuf.generated.ClientProtos$GetRequest)","starttimems":1568648032787,"responsesize":565,"method":"Get","param":"region= block_v2,\\x07\\x84\\x8B>b\\x00\\x00\\x14mU0p6,1567528560602.98be887c6f4938e0b492b17c669f3ac7., row=\\x07\\x84\\x8B>b\\x00\\x00\\x14nfet675AvHhY4nnKAV2iqu <TRUNCATED>","processingtimems":10963,"client":"172.20.112.226:52222","queuetimems":0,"class":"HRegionServer"}

Note those log lines are from a "shortCompactions" thread. This also happens with major compactions, but I understand we can better control major compactions by running them manually in off hours if we choose.

When this occurs we see the numCallsInGeneralQueue metric spike up, and some threads in our application that service REST API requests get tied up which causes some 504 gateway timeouts for end users.

Thread dumps from the region server show that the rpc handler threads are blocking on an FSInputStream object (the read method is synchronized). Here is a pastebin of one such dump: https://pastebin.com/Mh0JWx3T

Because we are running in AWS with data backed by S3 and we expect read latencies to be larger, we are hosting large bucket caches on the local disk of the region servers. So our understanding is that after the compaction, the relevant portions of the bucket cache are invalidated which is causing read requests to have to go to S3, and these are all trying to use the same input stream and block each other, and this continues until eventually the cache is populated enough so that performance returns to normal.

In an effort to mitigate the effects of compaction on the cache, we enabled the hbase.rs.cacheblocksonwrite setting on our region servers. My understanding was that this would be placing the blocks into the bucketcache while the new hfile was being written. However, after enabling this setting we are still seeing the same issue occur. Furthermore, we enabled the PREFETCH_BLOCKS_ON_OPEN setting on the column family and when we see this issue occur, one of the threads that is getting blocked from reading is the prefetching thread.

Here are my questions:
- Why is the hbase.rs.cacheblocksonwrite not seeming to work? Does it only work for flushing and not for compaction? I can see from the logs that the file is renamed after being written. Does that have something to do with why cacheblocksonwrite isn't working?
- Why are the prefetching threads trying to read the same data? I thought that would only happen when a region is opened and I confirmed from the master and region server logs that wasn't happening. Maybe I have a misunderstanding of how/when prefetching comes into play?
- Speaking more generally, any other thoughts on how we can avoid this issue? It seems a shame that we have this nicely populated bucketcache that is somewhat necessary with a slower file system (S3), but that the cache suddenly gets invalidated because of compactions happening. I'm wary of turning off compactions altogether during our peak load hours because I don't want updates to be blocked due to too many store files.

Thanks for any help on this!