Subject: HBase Scan consumes high cpu


Hi Anoop,

    We have executed the query with the qualifier set like you advised.
But we dont get the results for the range but only the specified
qualifier cell is returned.

Query & Result:

hbase(main):008:0> get 'mytable', 'MY_ROW',
{COLUMN=>["pcf:\x00\x16\xDFx"],
FILTER=>ColumnRangeFilter.new(Bytes.toBytes(1499000.to_java(:int)),
true, Bytes.toBytes(1499010.to_java(:int)), false)}
COLUMN CELL
  pcf:\x00\x16\xDFx                 timestamp=1568380663616,
value=\x00\x16\xDFx
1 row(s) in 0.0080 seconds

hbase(main):009:0>
Is there any other way to get arond this ?.
Regards,

Solvannan R M
On 2019/09/13 04:53:45, Anoop John wrote:
 > Hi>
 > When you did a put with a lower qualifier int (put 'mytable',>
 > 'MY_ROW', "pcf:\x0A", "\x00") the system flow is getting a valid cell
at>
 > 1st step itself and that getting passed to the Filter. The Filter is
doing>
 > a seek which just avoids all the in between deletes and puts
processing..>
 > In 1st case the Filter wont get into action at all unless the scan flow>
 > sees a valid cell. The delete processing happens as 1st step before the>
 > filter processinf step happening.>
 >
 > In this case I am wondering why you can not add the specific 1st
qualifier>
 > in the get part itself along with the column range filter. I mean>
 >
 > get 'mytable', 'MY_ROW', {COLUMN=>['pcf: *1499000 * '],>
 > FILTER=>ColumnRangeFilter.new(Bytes.toBytes(1499000.to_java(:int)),>
 > true, Bytes.toBytes(1499010.to_java(:int)), false)}>
 >
 > Pardon the syntax it might not be proper for the shell.. Can this be
done?>
 > This will make the scan to make a seek to the given qualifier at 1st
step>
 > itself.>
 >
 > Anoop>
 >
 > On Thu, Sep 12, 2019 at 10:18 PM Udai Bhan Kashyap (BLOOMBERG/
PRINCETON) <>
 > [EMAIL PROTECTED]> wrote:>
 >
 > > Are you keeping the deleted cells? Check 'VERSIONS' for the column
family>
 > > and set it to 1 if you don't want to keep the deleted cells.>
 > >>
 > > From: [EMAIL PROTECTED] At: 09/12/19 12:40:01To:>
 > > [EMAIL PROTECTED]>
 > > Subject: Re: HBase Scan consumes high cpu>
 > >>
 > > Hi,>
 > >>
 > > As said earlier, we have populated the rowkey "MY_ROW" with integers>
 > > from 0 to 1500000 as column qualifiers. Then we have deleted the>
 > > qualifiers from 0 to 1499000.>
 > >>
 > > We executed the following query. It took 15.3750 seconds to execute.>
 > >>
 > > hbase(main):057:0> get 'mytable', 'MY_ROW', {COLUMN=>['pcf'],>
 > > FILTER=>ColumnRangeFilter.new(Bytes.toBytes(1499000.to_java(:int)),>
 > > true, Bytes.toBytes(1499010.to_java(:int)), false)}>
 > > COLUMN CELL>
 > > pcf:\x00\x16\xDFx timestamp=1568123881899,>
 > > value=\x00\x16\xDFx>
 > > pcf:\x00\x16\xDFy timestamp=1568123881899,>
 > > value=\x00\x16\xDFy>
 > > pcf:\x00\x16\xDFz timestamp=1568123881899,>
 > > value=\x00\x16\xDFz>
 > > pcf:\x00\x16\xDF{ timestamp=1568123881899,>
 > > value=\x00\x16\xDF{>
 > > pcf:\x00\x16\xDF| timestamp=1568123881899,>
 > > value=\x00\x16\xDF|>
 > > pcf:\x00\x16\xDF} timestamp=1568123881899,>
 > > value=\x00\x16\xDF}>
 > > pcf:\x00\x16\xDF~ timestamp=1568123881899,>
 > > value=\x00\x16\xDF~>
 > > pcf:\x00\x16\xDF\x7F timestamp=1568123881899,>
 > > value=\x00\x16\xDF\x7F>
 > > pcf:\x00\x16\xDF\x80 timestamp=1568123881899,>
 > > value=\x00\x16\xDF\x80>
 > > pcf:\x00\x16\xDF\x81 timestamp=1568123881899,>
 > > value=\x00\x16\xDF\x81>
 > > 1 row(s) in 15.3750 seconds>
 > >>
 > >>
 > > Now we inserted a new column with qualifier 10 (\x0A), such that it>
 > > comes earlier in lexicographical order. Now we executed the same
query.>
 > > It only took 0.0240 seconds.>
 > >>
 > > hbase(main):058:0> put 'mytable', 'MY_ROW', "pcf:\x0A", "\x00">
 > > 0 row(s) in 0.0150 seconds>
 > > hbase(main):059:0> get 'mytable', 'MY_ROW', {COLUMN=>['pcf'],>
 > > FILTER=>ColumnRangeFilter.new(Bytes.toBytes(1499000.to_java(:int)),>
 > > true, Bytes.toBytes(1499010.to_java(:int)), false)}>
 > > COLUMN CELL>
 > > pcf:\x00\x16\xDFx timestamp=1568123881899,>
 > > value=\x00\x16\xDFx>
 > > pcf:\x00\x16\xDFy timestamp=1568123881899,>
 > > value=\x00\x16\xDFy>
 > > pcf:\x00\x16\xDFz timestamp=1568123881899,>
 > > value=\x00\x16\xDFz>
 > > pcf:\x00\x16\xDF{ timestamp=1568123881899,>
 > > value=\x00\x16\xDF{>
 > > pcf:\x00\x16\xDF| timestamp=1568123881899,>
 > > value=\x00\x16\xDF|>
 > > pcf:\x00\x16\xDF} timestamp=1568123881899,>
 > > value=\x00\x16\xDF}>
 > > pcf:\x00\x16\xDF~ timestamp=1568123881899,>
 > > value=\x00\x16\xDF~>
 > > pcf:\x00\x16\xDF\x7F timestamp=1568123881899,>
 > > value=\x00\x16\xDF\x7F>
 > > pcf:\x00\x16\xDF\x80 timestamp=1568123881899,>
 > > value=\x00\x16\xDF\x80>
 > > pcf:\x00\x16\xDF\x81 timestamp=1568123881899,>
 > > value=\x00\x16\xDF\x81>
 > > 1 row(s) in 0.0240 seconds>
 > > hbase(main):060:0>>
 > >>
 > >>
 > > We were able to reproduce the result consistently same, the pattern>
 > > being bulk insert followed by bulk delete of most of the earlier
columns.>
 > >>
 > >>
 > > We observed the following behaviour while debugging the StoreScanner>
 > > (regionserver).>
 > >>
 > > Case 1:>
 > >>
 > > 1. When StoreScanner.next() is called, it starts to iterate over the>
 > > cells from the start of the rowkey.>
 > >>
 > > 2. As all the cells are deleted (from 0 to 1499000), we could see>
 > > alternate delete and put type cells. Now, the>
 > > NormalUserScanQueryMatcher.match() returns>
 > > ScanQueryMatcher.MatchCode.SKIP and>
 > > ScanQueryMatcher.MatchCode.SEEK_NEXT_COL for Delete and Put type cell>
 > > respectively. This iteration happens throughout the range of 0 to
1499000.>
 > >>
 > > 3. This happens until a valid Put type cell is encountered, where the>
 > > matcher applies the ColumnRangeFilter to the cell, which in turm
returns>
 > > ScanQueryMatcher.MatchCode.SEEK_NEXT_USING_HINT. In the next
iteration>
 > > it seeks directly to the desired column.>
 > >>
 > >>
 > > Case 2:>
 > >>
 > > 1. When StoreScanner.next() is called, it starts to iterate over the>
 > > cells from the start of the rowkey.>
 > >>
 >