Subject: incomplete hbase exports due to scan timeouts


>> Now we see that the Export/Import is broken since April 2019, But the
question remains what causes the long timeouts
As data may have grown over time and scanner would be just taking a lot of
time to filter the unwanted data and eventually timing out.

>> How can one identify regions which are in trouble?
you can check these counters for each mapper( or at the Job level) to check
how many times the scanner was restarted due to an issue and how much time
scanner was taking for each next.
-- NUM_SCANNER_RESTARTS
-- MILLIS_BETWEEN_NEXTS
There might be the region info in the debug application logs on which the
mapper was working. (but not sure at this point)
>> My questions are: Is it possible that „silent timeouts“ can cause
incomplete exports?
Until there is no bug, Job should have failed if there is any timeout
exception during the scan at mapper.
You may check if your MR setup/config allows some percentage of map
failures or skipping bad records by setting any of the following property?
-- mapred.max.map.failures.percent
-- mapreduce.map.skip.maxrecords

If above doesn't apply to you, please raise a bug in HBase project with the
reproducible steps on small data(timeout can be kept as low as 1 second to
reproduce the problem) and any debugging/patch to fix the problem would be
greatly appreciated.

Regards,
Ankit Singhal

On Mon, Aug 26, 2019 at 2:04 AM Udo Offermann <[EMAIL PROTECTED]>
wrote:

> Hi everybody,
>
>
> We are running 6 data nodes (plus one master node - version HBase
> 1.0.0-cdh5.6.0) in each case on a productive and a test environment. Each
> month we export the deltas of the previous month from the productive system
> (using org.apache.hadoop.hbase.mapreduce.Export) and import them into the
> test system.
> From time to time we are using RowCounter and an analytics map-reduce job
> written by our own to check if the restore is fine.
>
> Now we see that the Export/Import is broken since April 2019. After lots
> of investigations and tests we found that the bug described in
> https://github.com/hortonworks-spark/shc/issues/174 <
> https://github.com/hortonworks-spark/shc/issues/174> causes the problems.
>
> After increasing the timeouts (client and roc timeout) from 1 minute to 10
> minutes the row counts in the test system seem to be in a good shape (we
> counted the rows for one month via RowCounter and scan on the hbase shell).
>
> Now we are about to implement the changes in the productive system.
>
> But the question remains what causes the long timeouts. Some of the tests
> we did revealed ScannerTimeouts after 60 seconds (the default setting). But
> 60 seconds - for an android, that is nearly an eternity. Thus we assume
> that there is something wrong, but how can we find out.
> The hbase locality factor is 1.0 or close to 1.0 for most of the regions.
>
> My questions are: Is it possible that „silent timeouts“ can cause
> incomplete exports?
> Is it usual that scans take longer than 1 minute - even if it seems that
> up to April the exports were all ok?
> How can one identify regions which are in trouble?
>
> Thank you and best regards
> Udo
>
>