Subject: HBase Scan consumes high cpu


Hi Ram,

Thanks for your support! We will explore alternative schema designs.
Regards,

Solvannan R M
On 2019/09/17 05:21:39, ramkrishna vasudevan wrote:
 > Hi Solvannan>
 >
 > Currently there is no easy way to over come this case because deletes
and>
 > its tracking takes precedence before the filter is even applied.>
 >
 > I get your case where you really don't know the columns which could
have>
 > been previously deleted and hence you specify the entire range of>
 > columns in the filter. When this Put/Delete combination keeps
increasing>
 > then you end up in these issues.>
 >
 > Am not aware of the use case here, but is there any better way to
handle>
 > your schema for these cases?>
 >
 > Regards>
 > Ram>
 >
 >
 >
 >
 >
 >
 >
 >
 >
 >
 > On Mon, Sep 16, 2019 at 10:54 PM Solvannan R M >
 > wrote:>
 >
 > > Hi Ramkrishna,>
 > >>
 > > Thank you for your inputs! Unfortunately we would not be knowing the>
 > > column names beforehand. We had generated the above scenario for>
 > > illustration purposes.>
 > >>
 > > The intent of our query is that, given a single row key, a start
column>
 > > key and an end column key, scan for the columns that are between
the two>
 > > column keys. We have been achieving that by using ColumnRangeFilter.>
 > > Our write pattern would be Put followed by Delete immediately>
 > > (Keep_deleted_cells is set to false). So as more Deletes start to>
 > > accumulate, we notice the scan time starts to be very long and the
cpu>
 > > shoots up to 100% for a core during every scan. On trying to debug we>
 > > observed the following behavior:>
 > >>
 > > At any instant, the cells of the particular row would be roughly>
 > > organized like>
 > >>
 > > D1 P1 D2 P2 D3 P3 ............ Dn-1 Pn-1 Dn Pn Pn+1 Pn+2 Pn+3
Pn+4....>
 > >>
 > > where D and P are Delete and it's corresponding Put. The newer values>
 > > from Pn haven't been deleted yet.>
 > >>
 > > As the scan initiates, inside the StoreScanner,>
 > > NormalUserScanQueryMatcher would match the first cell (D1). It
would be>
 > > added to the DeleteTracker and a MatchCode of SKIP is returned. Now
for>
 > > the next cell (P1) the matcher would check with the DeleteTracker and>
 > > return a code of SEEK_NEXT_COL. Again the next cell would be D2 and
this>
 > > would happen alternately. No filter is applied. This goes on till it>
 > > encounters Pn where filter is applied, SEEK_NEXT_USING_HINT is done
and>
 > > now reseek happens to position near the desired range. The result is>
 > > returned quickly after that.>
 > >>
 > > The SKIP iterations happen a lot because our pattern would have very>
 > > less active cells and only towards the latest column
qualifiers(ordered>
 > > high lexicographically). We were wondering if the query could be>
 > > modified so that the filter could be applied initially or some
other way>
 > > to seek to the desired range directly.>
 > >>
 > > Regards,>
 > > Solvannan R M>
 > >>
 > >>
 > > On 2019/09/13 15:53:51, ramkrishna vasudevan wrote:>
 > > > Hi>>
 > > > Generally if you can form the column names like you did in the
above>
 > > case>>
 > > > it is always better you add them using>>
 > > > scan#addColumn(family, qual). I am not sure of the shell syntax
to add>>
 > > > multiple columns but am sure there is a provision to do it.>>
 > > >>
 > > > This will ensure that the scan starts from the given column and>
 > > fetches the>>
 > > > required column only. In your case probably you need to pass a
set of>>
 > > > qualifiers (instead of just 1).>>
 > > >>
 > > > Regards>>
 > > > Ram>>
 > > >>
 > > > On Fri, Sep 13, 2019 at 8:45 PM Solvannan R M >>
 > > > wrote:>>
 > > >>
 > > > > Hi Anoop,>>
 > > > >>>
 > > > > We have executed the query with the qualifier set like you
advised.>>
 > > > > But we dont get the results for the range but only the specified>>
 > > > > qualifier cell is returned.>>
 > > > >>>
 > > > > Query & Result:>>
 > > > >>>
 > > > > hbase(main):008:0> get 'mytable', 'MY_ROW',>>
 > > > > {COLUMN=>["pcf:\x00\x16\xDFx"],>>
 > > > >
FILTER=>ColumnRangeFilter.new(Bytes.toBytes(1499000.to_java(:int)),>>
 > > > > true, Bytes.toBytes(1499010.to_java(:int)), false)}>>
 > > > > COLUMN CELL>>
 > > > > pcf:\x00\x16\xDFx timestamp=1568380663616,>>
 > > > > value=\x00\x16\xDFx>>
 > > > > 1 row(s) in 0.0080 seconds>>
 > > > >>>
 > > > > hbase(main):009:0>>>
 > > > >>>
 > > > >>>
 > > > > Is there any other way to get arond this ?.>>
 > > > >>>
 > > > >>>
 > > > > Regards,>>
 > > > >>>
 > > > > Solvannan R M>>
 > > > >>>
 > > > >>>
 > > > > On 2019/09/13 04:53:45, Anoop John wrote:>>
 > > > > > Hi>>>
 > > > > > When you did a put with a lower qualifier int (put 'mytable',>>>
 > > > > > 'MY_ROW', "pcf:\x0A", "\x00") the system flow is getting a
valid>
 > > cell>>
 > > > > at>>>
 > > > > > 1st step itself and that getting passed to the Filter. The
Filter>
 > > is>>
 > > > > doing>>>
 > > > > > a seek which just avoids all the in between deletes and puts>>
 > > > > processing..>>>
 > > > > > In 1st case the Filter wont get into action at all unless the>
 > > scan flow>>>
 > > > > > sees a valid cell. The delete processing happens as 1st step>
 > > before the>>>
 > > > > > filter processinf step happening.>>>
 > > > > >>>
 > > > > > In this case I am wondering why you can not add the specific
1st>>
 > > > > qualifier>>>
 > > > > > in the get part itself along with the column range filter. I
mean>>>
 > > > > >>>
 > > > > > get 'mytable', 'MY_ROW', {COLUMN=>['pcf: *1499000 * '],>>>
 > > > > >>
 > > FILTER=>ColumnRangeFilter.new(Bytes.toBytes(1499000.to_java(:int)),>>>
 > > > > > true, Bytes.toBytes(1499010.to_java(:int)), false)}>>>
 > > > > >>>
 > > > > > Pardon the syntax it might not be proper for the shell.. Can
this>
 > > be>>
 > > > > done?>>>
 > > > > > This will make the scan to make a seek to the given qualifier
at>
 > > 1st>>
 > > > > step>>>
 > > > > > itself.>>>
 > > > > >>>
 > > > > > Anoop>>>
 > > > > >>>
 > > > > > On Thu, Sep 12, 2019 at 10:18 PM Udai