Subject: HBase Scan consumes high cpu


Hi Ramkrishna,

Thank you for your inputs! Unfortunately we would not be knowing the
column names beforehand. We had generated the above scenario for
illustration purposes.

The intent of our query is that, given a single row key, a start column
key and an end column key, scan for the columns that are between the two
column keys.  We have been achieving that by using ColumnRangeFilter.
Our write pattern would be Put followed by Delete immediately
(Keep_deleted_cells is set to false). So as more Deletes start to
accumulate, we notice the scan time starts to be very long and the cpu
shoots up to 100% for a core during every scan. On trying to debug we
observed the following behavior:

At any instant, the cells of the particular row would be roughly
organized like

D1 P1 D2 P2 D3 P3 ............ Dn-1 Pn-1 Dn Pn Pn+1 Pn+2 Pn+3 Pn+4....

where D and P are Delete and it's corresponding Put. The newer values
from Pn haven't been deleted yet.

As the scan initiates, inside the StoreScanner,
NormalUserScanQueryMatcher would match the first cell (D1). It would be
added to the DeleteTracker and a MatchCode of SKIP is returned. Now for
the next cell (P1) the matcher would check with the DeleteTracker and
return a code of SEEK_NEXT_COL. Again the next cell would be D2 and this
would happen alternately. No filter is applied. This goes on till it
encounters Pn where filter is applied, SEEK_NEXT_USING_HINT is done and
now reseek happens to position near the desired range. The result is
returned quickly after that.

The SKIP iterations happen a lot because our pattern would have very
less active cells and only towards the latest column qualifiers(ordered
high lexicographically). We were wondering if the query could be
modified so that the filter could be applied initially or some other way
to seek to the desired range directly.

Regards,
Solvannan R M
On 2019/09/13 15:53:51, ramkrishna vasudevan wrote:
 > Hi>
 > Generally if you can form the column names like you did in the above
case>
 > it is always better you add them using>
 > scan#addColumn(family, qual). I am not sure of the shell syntax to add>
 > multiple columns but am sure there is a provision to do it.>
 >
 > This will ensure that the scan starts from the given column and
fetches the>
 > required column only. In your case probably you need to pass a set of>
 > qualifiers (instead of just 1).>
 >
 > Regards>
 > Ram>
 >
 > On Fri, Sep 13, 2019 at 8:45 PM Solvannan R M >
 > wrote:>
 >
 > > Hi Anoop,>
 > >>
 > > We have executed the query with the qualifier set like you advised.>
 > > But we dont get the results for the range but only the specified>
 > > qualifier cell is returned.>
 > >>
 > > Query & Result:>
 > >>
 > > hbase(main):008:0> get 'mytable', 'MY_ROW',>
 > > {COLUMN=>["pcf:\x00\x16\xDFx"],>
 > > FILTER=>ColumnRangeFilter.new(Bytes.toBytes(1499000.to_java(:int)),>
 > > true, Bytes.toBytes(1499010.to_java(:int)), false)}>
 > > COLUMN CELL>
 > > pcf:\x00\x16\xDFx timestamp=1568380663616,>
 > > value=\x00\x16\xDFx>
 > > 1 row(s) in 0.0080 seconds>
 > >>
 > > hbase(main):009:0>>
 > >>
 > >>
 > > Is there any other way to get arond this ?.>
 > >>
 > >>
 > > Regards,>
 > >>
 > > Solvannan R M>
 > >>
 > >>
 > > On 2019/09/13 04:53:45, Anoop John wrote:>
 > > > Hi>>
 > > > When you did a put with a lower qualifier int (put 'mytable',>>
 > > > 'MY_ROW', "pcf:\x0A", "\x00") the system flow is getting a valid
cell>
 > > at>>
 > > > 1st step itself and that getting passed to the Filter. The Filter
is>
 > > doing>>
 > > > a seek which just avoids all the in between deletes and puts>
 > > processing..>>
 > > > In 1st case the Filter wont get into action at all unless the
scan flow>>
 > > > sees a valid cell. The delete processing happens as 1st step
before the>>
 > > > filter processinf step happening.>>
 > > >>
 > > > In this case I am wondering why you can not add the specific 1st>
 > > qualifier>>
 > > > in the get part itself along with the column range filter. I mean>>
 > > >>
 > > > get 'mytable', 'MY_ROW', {COLUMN=>['pcf: *1499000 * '],>>
 > > >
FILTER=>ColumnRangeFilter.new(Bytes.toBytes(1499000.to_java(:int)),>>
 > > > true, Bytes.toBytes(1499010.to_java(:int)), false)}>>
 > > >>
 > > > Pardon the syntax it might not be proper for the shell.. Can this
be>
 > > done?>>
 > > > This will make the scan to make a seek to the given qualifier at
1st>
 > > step>>
 > > > itself.>>
 > > >>
 > > > Anoop>>
 > > >>
 > > > On Thu, Sep 12, 2019 at 10:18 PM Udai Bhan Kashyap (BLOOMBERG/>
 > > PRINCETON) <>>
 > > > [EMAIL PROTECTED]> wrote:>>
 > > >>
 > > > > Are you keeping the deleted cells? Check 'VERSIONS' for the
column>
 > > family>>
 > > > > and set it to 1 if you don't want to keep the deleted cells.>>
 > > > >>>
 > > > > From: [EMAIL PROTECTED] At: 09/12/19 12:40:01To:>>
 > > > > [EMAIL PROTECTED]>>
 > > > > Subject: Re: HBase Scan consumes high cpu>>
 > > > >>>
 > > > > Hi,>>
 > > > >>>
 > > > > As said earlier, we have populated the rowkey "MY_ROW" with
integers>>
 > > > > from 0 to 1500000 as column qualifiers. Then we have deleted the>>
 > > > > qualifiers from 0 to 1499000.>>
 > > > >>>
 > > > > We executed the following query. It took 15.3750 seconds to
execute.>>
 > > > >>>
 > > > > hbase(main):057:0> get 'mytable', 'MY_ROW', {COLUMN=>['pcf'],>>
 > > > >
FILTER=>ColumnRangeFilter.new(Bytes.toBytes(1499000.to_java(:int)),>>
 > > > > true, Bytes.toBytes(1499010.to_java(:int)), false)}>>
 > > > > COLUMN CELL>>
 > > > > pcf:\x00\x16\xDFx timestamp=1568123881899,>>
 > > > > value=\x00\x16\xDFx>>
 > > > > pcf:\x00\x16\xDFy timestamp=1568123881899,>>
 > > > > value=\x00\x16\xDFy>>
 > > > > pcf:\x00\x16\xDFz timestamp=1568123881899,>>
 > > > > value=\x00\x16\xDFz>>
 > > > > pcf:\x00\x16\xDF{ timestamp=1568123881899,>>
 > > > > value=\x00\x16\xDF{>>
 > > > > pcf:\x00\x16\xDF| timestamp=1568123881899,>>
 > > > > value=\x00\x16\xDF|>>
 > > > > pcf:\x00\x16\xDF} timestamp=1568123881899,>>
 > > > > value