We periodically execute Spark jobs to run ETL from some of our HBase tables to another data repository. The Spark jobs read data by taking a snapshot and then using the TableSnapshotInputFormat class. Lately we've been having some failures because when the jobs try to read the data, it is trying to delete files under the recovered.edits directory for some regions and the user under which we run the jobs doesn't have permissions to do that. Pastebin of the error and stack trace from one of our job logs is here: https://pastebin.com/MAhVc9JB
This has started happening since upgrading to EMR 5.22 where the recovered.edits directory is collocated with the WALs in HDFS where it used to be in S3-backed EMRFS.
I have two questions regarding this:
1) First of why are these files under the recovered.edits directory? The timestamp of the files coincides with a hiccup we had with our cluster where I had to use "hbase hbck -fixAssignments" to fix regions that were stuck in transition. But that command seemed to work just fine and all regions were assigned and there have since been no inconsistencies. Does this mean the WALs were not replayed correctly? Does "hbase hbck -fixAssignments" not recover regions properly?
2) Why is our job trying to delete these files? I don't know enough to say for sure, but it seems like using TableSnapshotInputFormat to read snapshot data should not be trying recover or delete edits.
I've fixed the problems by running "assign '<region>'" in hbase shell for every region that had files under the recovered.edits directory and those files seemed to be cleaned up when the assignment completed. But I'd like to understand this better especially if something is interfering with replaying edits from WALs (also making sure our ETL jobs don't start failing would be nice).