Subject: TableSnapshotInputFormat failing to delete files under recovered.edits


First of all, thanks for the reply! I appreciate the time taken addressing our issues.

> It seems the mentioned "hiccup" caused RS(es) crash(es), as you got RITs and recovered edits under these regions dirs.

To give more context, I was making changes to increase snapshot timeout on region servers and did a graceful restart, so I didn't mean to crash anything, but it seems like I did this to too many region servers at once (did about half the cluster) which seemed to result in some number of regions getting stuck in transition. This was attempted on a live production cluster so the hope was to do this without downtime but it resulted in an outage to our application instead. Unfortunately master and region server logs have since rolled and aged out so I don't have them anymore.

> The fact there was a "recovered" dir under some regions dirs means that when the snapshot was taken, crashed RS(es) WAL(s) had been split, but not completely replayed yet.

Snapshot was taken many days later. File timestamps under recovered.edits directory were from June 6th and snapshot from the pastebin was taken on June 14th, but actually snapshots were taken many times with the same result (ETL jobs are launched at least daily in oozie). Do you mean that if a snapshot was taken before region was fully recovered it could result in this state even if snapshot was subsequently deleted?

> Would you know which specific hbase version is this?

It is EMR 5.22 which runs HBase 1.4.9 (with some Amazon-specific edits maybe? I noticed line numbers in HRegion.java in stack trace don't quite line up with those in the 1.4.9 tag in github).

> Could your job restore the snapshot into a temp table and then read from this temp table using TableInputFormat, instead?

Maybe we could do this, but it will take us some effort to make the changes, test, release, etc... Of course we'd rather not jump through hoops like this.

> In this case, it's finding "recovered" folder under regions dir, so it will replay the edits there. Looks like a problem with TableSnapshotInputFormat, seems weird that it tries to delete edits on a non-staging dir (your path suggests it's trying to delete the actual edit folder), that could cause data loss if it would succeed to delete edits before RSes actually replay it.

I agree that this "seems weird" to me given that I am not intimately familiar with all of the inner workings of hbase code. The potential data loss is what I'm wondering about - would data loss have occurred if we happened to execute our job under a user that had delete permissions in HDFS directories? Or did the edits actually get replayed when regions were in stuck and transition and the files just didn't get cleaned up? Is this something for which I should file a defect in JIRA?

Thanks again,

--Jacob LeBlanc
-----Original Message-----
From: Wellington Chevreuil [mailto:[EMAIL PROTECTED]]
Sent: Monday, June 17, 2019 3:55 PM
To: [EMAIL PROTECTED]
Subject: Re: TableSnapshotInputFormat failing to delete files under recovered.edits

It seems the mentioned "hiccup" caused RS(es) crash(es), as you got RITs and recovered edits under these regions dirs. The fact there was a "recovered" dir under some regions dirs means that when the snapshot was taken, crashed RS(es) WAL(s) had been split, but not completely replayed yet.

Since you are facing error when reading from table snapshot, and the stack trace shows TableSnapshotInputFormat is using "HRegion.openHRegion" code path to read snapshotted data, it will basically do the same as an RS would when trying to assign a region. In this case, it's finding "recovered"
folder under regions dir, so it will replay the edits there. Looks like a problem with TableSnapshotInputFormat, seems weird that it tries to delete edits on a non-staging dir (your path suggests it's trying to delete the actual edit folder), that could cause data loss if it would succeed to delete edits before RSes actually replay it. Would you know which specific hbase version is this? Could your job restore the snapshot into a temp table and then read from this temp table using TableInputFormat, instead?

Em seg, 17 de jun de 2019 às 17:22, Jacob LeBlanc < [EMAIL PROTECTED]> escreveu: