Hi,
We have been using HBase (1.4.9) for a case where timeseries data is continuously inserted and deleted (high churn) against a single rowkey. The column keys would represent timestamp more or less. When we scan this data using ColumnRangeFilter for a recent time-range, scanner for the stores (memstore & storefiles) has to go through contiguous deletes, before it reaches the requested timerange data. While using this scan, we could notice 100% cpu usages in single core by the regionserver process.
So, for our case, most of the cells with older timestamps will be in deleted state. While traversing these deleted cells, the regionserver process causing 100% cpu usage in single core.
We tried to trace the code for scan and we observed the following behaviour.
1. While scanner is initialized, it seeked all the store-scanners to the start of the rowkey.
2. Then it traverses the deleted cells and discards it (as it was deleted) one by one.
3. When it encounters a valid cell (put type), it applies the filter and it returns SEEK_TO_NEXT_USING_HINT.
4. Now the scanner seeks to the required key directly and returning the results quickly then.
For confirming the mentioned behaviour, we have done a test:
1. We have populated a single rowkey with column qualifier as a range of integers of 0 to 1500000 with random data.
2. We then deleted the column qualifier range of 0 to 1499000.
3. Now the data is only in memsore. No store file exists.
4. Now we scanned the rowkey with ColumnRangeFilter[1499000, 1499010).
5. The query took 12 seconds to execute. During this query, a single core is completely used
6. Then we put a new cell with qualifier 10.
7. Executed the same query, it took 0.018 seconds to execute.
Kindly check this and advise !.
Regards,
Solvannan R M
Deletes are held in memory. They represent data you have to traverse
until that data is flushed out to disk. When you write a new cell with a
qualifier of 10, that sorts, lexicographically, "early" with respect to
the other qualifiers you've written.
By that measure, if you are only scanning for the first column in this
row which you've loaded with deletes, it would make total sense to me
that the first case is slow and the second fast is fast
Can you please share exactly how you execute your "query" for both(all)
scenarios?
On 9/10/19 11:35 AM, Solvannan R M wrote:
> Hi,
>
> We have been using HBase (1.4.9) for a case where timeseries data is continuously inserted and deleted (high churn) against a single rowkey. The column keys would represent timestamp more or less. When we scan this data using ColumnRangeFilter for a recent time-range, scanner for the stores (memstore & storefiles) has to go through contiguous deletes, before it reaches the requested timerange data. While using this scan, we could notice 100% cpu usages in single core by the regionserver process.
>
> So, for our case, most of the cells with older timestamps will be in deleted state. While traversing these deleted cells, the regionserver process causing 100% cpu usage in single core.
>
> We tried to trace the code for scan and we observed the following behaviour.
>
> 1. While scanner is initialized, it seeked all the store-scanners to the start of the rowkey.
> 2. Then it traverses the deleted cells and discards it (as it was deleted) one by one.
> 3. When it encounters a valid cell (put type), it applies the filter and it returns SEEK_TO_NEXT_USING_HINT.
> 4. Now the scanner seeks to the required key directly and returning the results quickly then.
>
> For confirming the mentioned behaviour, we have done a test:
> 1. We have populated a single rowkey with column qualifier as a range of integers of 0 to 1500000 with random data.
> 2. We then deleted the column qualifier range of 0 to 1499000.
> 3. Now the data is only in memsore. No store file exists.
> 4. Now we scanned the rowkey with ColumnRangeFilter[1499000, 1499010).
> 5. The query took 12 seconds to execute. During this query, a single core is completely used
> 6. Then we put a new cell with qualifier 10.
> 7. Executed the same query, it took 0.018 seconds to execute.
>
> Kindly check this and advise !.
>
> Regards,
> Solvannan R M
>
Hi,
As said earlier, we have populated the rowkey "MY_ROW" with integers
from 0 to 1500000 as column qualifiers. Then we have deleted the
qualifiers from 0 to 1499000.
We executed the following query. It took 15.3750 seconds to execute.
hbase(main):057:0> get 'mytable', 'MY_ROW', {COLUMN=>['pcf'],
FILTER=>ColumnRangeFilter.new(Bytes.toBytes(1499000.to_java(:int)),
true, Bytes.toBytes(1499010.to_java(:int)), false)}
COLUMN CELL
pcf:\x00\x16\xDFx timestamp=1568123881899,
value=\x00\x16\xDFx
pcf:\x00\x16\xDFy timestamp=1568123881899,
value=\x00\x16\xDFy
pcf:\x00\x16\xDFz timestamp=1568123881899,
value=\x00\x16\xDFz
pcf:\x00\x16\xDF{ timestamp=1568123881899,
value=\x00\x16\xDF{
pcf:\x00\x16\xDF| timestamp=1568123881899,
value=\x00\x16\xDF|
pcf:\x00\x16\xDF} timestamp=1568123881899,
value=\x00\x16\xDF}
pcf:\x00\x16\xDF~ timestamp=1568123881899,
value=\x00\x16\xDF~
pcf:\x00\x16\xDF\x7F timestamp=1568123881899,
value=\x00\x16\xDF\x7F
pcf:\x00\x16\xDF\x80 timestamp=1568123881899,
value=\x00\x16\xDF\x80
pcf:\x00\x16\xDF\x81 timestamp=1568123881899,
value=\x00\x16\xDF\x81
1 row(s) in 15.3750 seconds
Now we inserted a new column with qualifier 10 (\x0A), such that it
comes earlier in lexicographical order. Now we executed the same query.
It only took 0.0240 seconds.
hbase(main):058:0> put 'mytable', 'MY_ROW', "pcf:\x0A", "\x00"
0 row(s) in 0.0150 seconds
hbase(main):059:0> get 'mytable', 'MY_ROW', {COLUMN=>['pcf'],
FILTER=>ColumnRangeFilter.new(Bytes.toBytes(1499000.to_java(:int)),
true, Bytes.toBytes(1499010.to_java(:int)), false)}
COLUMN CELL
pcf:\x00\x16\xDFx timestamp=1568123881899,
value=\x00\x16\xDFx
pcf:\x00\x16\xDFy timestamp=1568123881899,
value=\x00\x16\xDFy
pcf:\x00\x16\xDFz timestamp=1568123881899,
value=\x00\x16\xDFz
pcf:\x00\x16\xDF{ timestamp=1568123881899,
value=\x00\x16\xDF{
pcf:\x00\x16\xDF| timestamp=1568123881899,
value=\x00\x16\xDF|
pcf:\x00\x16\xDF} timestamp=1568123881899,
value=\x00\x16\xDF}
pcf:\x00\x16\xDF~ timestamp=1568123881899,
value=\x00\x16\xDF~
pcf:\x00\x16\xDF\x7F timestamp=1568123881899,
value=\x00\x16\xDF\x7F
pcf:\x00\x16\xDF\x80 timestamp=1568123881899,
value=\x00\x16\xDF\x80
pcf:\x00\x16\xDF\x81 timestamp=1568123881899,
value=\x00\x16\xDF\x81
1 row(s) in 0.0240 seconds
hbase(main):060:0>
We were able to reproduce the result consistently same, the pattern
being bulk insert followed by bulk delete of most of the earlier columns.
We observed the following behaviour while debugging the StoreScanner
(regionserver).
Case 1:
1. When StoreScanner.next() is called, it starts to iterate over the
cells from the start of the rowkey.
2. As all the cells are deleted (from 0 to 1499000), we could see
alternate delete and put type cells. Now, the
NormalUserScanQueryMatcher.match() returns
ScanQueryMatcher.MatchCode.SKIP and
ScanQueryMatcher.MatchCode.SEEK_NEXT_COL for Delete and Put type cell
respectively. This iteration happens throughout the range of 0 to 1499000.
3. This happens until a valid Put type cell is encountered, where the
matcher applies the ColumnRangeFilter to the cell, which in turm returns
ScanQueryMatcher.MatchCode.SEEK_NEXT_USING_HINT. In the next iteration
it seeks directly to the desired column.
Case 2:
1. When StoreScanner.next() is called, it starts to iterate over the
cells from the start of the rowkey.
2. When the Put cell of qualifier 10 (\x0A) is encountered, the matcher
returns ScanQueryMatcher.MatchCode.SEEK_NEXT_USING_HINT. In the next
iteration it seeks directly to the desired column.
Please let us know if this behaviour is intentional or it could be avoided.
Regards,
Solvannan R M
On 2019/09/10 17:12:36, Josh Elser wrote:
> Deletes are held in memory. They represent data you have to traverse >
> until that data is flushed out to disk. When you write a new cell
with a >
> qualifier of 10, that sorts, lexicographically, "early" with respect
to >
> the other qualifiers you've written.>
>
> By that measure, if you are only scanning for the first column in this >
> row which you've loaded with deletes, it would make total sense to me >
> that the first case is slow and the second fast is fast>
>
> Can you please share exactly how you execute your "query" for
both(all) >
> scenarios?>
>
> On 9/10/19 11:35 AM, Solvannan R M wrote:>
> > Hi,>
> > >
> > We have been using HBase (1.4.9) for a case where timeseries data
is continuously inserted and deleted (high churn) against a single
rowkey. The column keys would represent timestamp more or less. When we
scan this data using ColumnRangeFilter for a recent time-range, scanner
for the stores (memstore & storefiles) has to go through contiguous
deletes, before it reaches the requested timerange data. While using
this scan, we could notice 100% cpu usages in single core by the
regionserver process.>
> > >
> > So, for our case, most of the cells with older timestamps will be
in deleted state. While traversing these deleted cells, the regionserver
process causing 100% cpu usage in single core.>
> > >
> > We tried to trace the code for scan and we observed the following
behaviour.>
> > >
> > 1. While scanner is initialized, it seeked all the store-scanners
to the start of the rowkey.>
> > 2. Then it traverses the deleted cells and discards it (as it was
deleted) one by one.>
> > 3. When it encounters a valid cell (put type), it applies the
filter and it returns SEEK_TO_NEXT_USING_HINT.>
> > 4. Now the scanner seeks to the required key directly and returning
the results quickly then.>
> > >
> > For confirming the mentioned behaviour, we
Are you keeping the deleted cells? Check 'VERSIONS' for the column family and set it to 1 if you don't want to keep the deleted cells.
From: [EMAIL PROTECTED] At: 09/12/19 12:40:01To: [EMAIL PROTECTED]
Subject: Re: HBase Scan consumes high cpu
Hi,
As said earlier, we have populated the rowkey "MY_ROW" with integers
from 0 to 1500000 as column qualifiers. Then we have deleted the
qualifiers from 0 to 1499000.
We executed the following query. It took 15.3750 seconds to execute.
hbase(main):057:0> get 'mytable', 'MY_ROW', {COLUMN=>['pcf'],
FILTER=>ColumnRangeFilter.new(Bytes.toBytes(1499000.to_java(:int)),
true, Bytes.toBytes(1499010.to_java(:int)), false)}
COLUMN CELL
pcf:\x00\x16\xDFx timestamp=1568123881899,
value=\x00\x16\xDFx
pcf:\x00\x16\xDFy timestamp=1568123881899,
value=\x00\x16\xDFy
pcf:\x00\x16\xDFz timestamp=1568123881899,
value=\x00\x16\xDFz
pcf:\x00\x16\xDF{ timestamp=1568123881899,
value=\x00\x16\xDF{
pcf:\x00\x16\xDF| timestamp=1568123881899,
value=\x00\x16\xDF|
pcf:\x00\x16\xDF} timestamp=1568123881899,
value=\x00\x16\xDF}
pcf:\x00\x16\xDF~ timestamp=1568123881899,
value=\x00\x16\xDF~
pcf:\x00\x16\xDF\x7F timestamp=1568123881899,
value=\x00\x16\xDF\x7F
pcf:\x00\x16\xDF\x80 timestamp=1568123881899,
value=\x00\x16\xDF\x80
pcf:\x00\x16\xDF\x81 timestamp=1568123881899,
value=\x00\x16\xDF\x81
1 row(s) in 15.3750 seconds
Now we inserted a new column with qualifier 10 (\x0A), such that it
comes earlier in lexicographical order. Now we executed the same query.
It only took 0.0240 seconds.
hbase(main):058:0> put 'mytable', 'MY_ROW', "pcf:\x0A", "\x00"
0 row(s) in 0.0150 seconds
hbase(main):059:0> get 'mytable', 'MY_ROW', {COLUMN=>['pcf'],
FILTER=>ColumnRangeFilter.new(Bytes.toBytes(1499000.to_java(:int)),
true, Bytes.toBytes(1499010.to_java(:int)), false)}
COLUMN CELL
pcf:\x00\x16\xDFx timestamp=1568123881899,
value=\x00\x16\xDFx
pcf:\x00\x16\xDFy timestamp=1568123881899,
value=\x00\x16\xDFy
pcf:\x00\x16\xDFz timestamp=1568123881899,
value=\x00\x16\xDFz
pcf:\x00\x16\xDF{ timestamp=1568123881899,
value=\x00\x16\xDF{
pcf:\x00\x16\xDF| timestamp=1568123881899,
value=\x00\x16\xDF|
pcf:\x00\x16\xDF} timestamp=1568123881899,
value=\x00\x16\xDF}
pcf:\x00\x16\xDF~ timestamp=1568123881899,
value=\x00\x16\xDF~
pcf:\x00\x16\xDF\x7F timestamp=1568123881899,
value=\x00\x16\xDF\x7F
pcf:\x00\x16\xDF\x80 timestamp=1568123881899,
value=\x00\x16\xDF\x80
pcf:\x00\x16\xDF\x81 timestamp=1568123881899,
value=\x00\x16\xDF\x81
1 row(s) in 0.0240 seconds
hbase(main):060:0>
We were able to reproduce the result consistently same, the pattern
being bulk insert followed by bulk delete of most of the earlier columns.
We observed the following behaviour while debugging the StoreScanner
(regionserver).
Case 1:
1. When StoreScanner.next() is called, it starts to iterate over the
cells from the start of the rowkey.
2. As all the cells are deleted (from 0 to 1499000), we could see
alternate delete and put type cells. Now, the
NormalUserScanQueryMatcher.match() returns
ScanQueryMatcher.MatchCode.SKIP and
ScanQueryMatcher.MatchCode.SEEK_NEXT_COL for Delete and Put type cell
respectively. This iteration happens throughout the range of 0 to 1499000.
3. This happens until a valid Put type cell is encountered, where the
matcher applies the ColumnRangeFilter to the cell, which in turm returns
ScanQueryMatcher.MatchCode.SEEK_NEXT_USING_HINT. In the next iteration
it seeks directly to the desired column.
Case 2:
1. When StoreScanner.next() is called, it starts to iterate over the
cells from the start of the rowkey.
2. When the Put cell of qualifier 10 (\x0A) is encountered, the matcher
returns ScanQueryMatcher.MatchCode.SEEK_NEXT_USING_HINT. In the next
iteration it seeks directly to the desired column.
Please let us know if this behaviour is intentional or it could be avoided.
Regards,
Solvannan R M
On 2019/09/10 17:12:36, Josh Elser wrote:
> Deletes are held in memory. They represent data you have to traverse >
> until that data is flushed out to disk. When you write a new cell
with a >
> qualifier of 10, that sorts, lexicographically, "early" with respect
to >
> the other qualifiers you've written.>
>
> By that measure, if you are only scanning for the first column in this >
> row which you've loaded with deletes, it would make total sense to me >
> that the first case is slow and the second fast is fast>
>
> Can you please share exactly how you execute your "query" for
both(all) >
> scenarios?>
>
> On 9/10/19 11:35 AM, Solvannan R M wrote:>
> > Hi,>
> > >
> > We have been using HBase (1.4.9) for a case where timeseries data
is continuously inserted and deleted (high churn) against a single
rowkey. The column keys would represent timestamp more or less. When we
scan this data using ColumnRangeFilter for a recent time-range, scanner
for the stores (memstore & storefiles) has to go through contiguous
deletes, before it reaches the requested timerange data. While using
this scan, we could notice 100% cpu usages in single core by the
regionserver process.>
> > >
> > So, for our case, most of the cells with older timestamps will be
in deleted state. While traversing these deleted cells, the regionserver
process causing 100% cpu usage in single core.>
> > >
> > We tried to trace the code for scan and we observed the following
behaviour.>
> > >
> > 1. While scanner is initialized, it seeked all the store-scanners
to the start of the rowkey.>
> > 2. Then it traverses the deleted cells and discards it (as it was
deleted) one by one.>
> > 3. When it
Hi
When you did a put with a lower qualifier int (put 'mytable',
'MY_ROW', "pcf:\x0A", "\x00") the system flow is getting a valid cell at
1st step itself and that getting passed to the Filter. The Filter is doing
a seek which just avoids all the in between deletes and puts processing..
In 1st case the Filter wont get into action at all unless the scan flow
sees a valid cell. The delete processing happens as 1st step before the
filter processinf step happening.
In this case I am wondering why you can not add the specific 1st qualifier
in the get part itself along with the column range filter. I mean
get 'mytable', 'MY_ROW', {COLUMN=>['pcf: *1499000 * '],
FILTER=>ColumnRangeFilter.new(Bytes.toBytes(1499000.to_java(:int)),
true, Bytes.toBytes(1499010.to_java(:int)), false)}
Pardon the syntax it might not be proper for the shell.. Can this be done?
This will make the scan to make a seek to the given qualifier at 1st step
itself.
Anoop
On Thu, Sep 12, 2019 at 10:18 PM Udai Bhan Kashyap (BLOOMBERG/ PRINCETON) <
[EMAIL PROTECTED]> wrote:
> Are you keeping the deleted cells? Check 'VERSIONS' for the column family
> and set it to 1 if you don't want to keep the deleted cells.
>
> From: [EMAIL PROTECTED] At: 09/12/19 12:40:01To:
> [EMAIL PROTECTED]
> Subject: Re: HBase Scan consumes high cpu
>
> Hi,
>
> As said earlier, we have populated the rowkey "MY_ROW" with integers
> from 0 to 1500000 as column qualifiers. Then we have deleted the
> qualifiers from 0 to 1499000.
>
> We executed the following query. It took 15.3750 seconds to execute.
>
> hbase(main):057:0> get 'mytable', 'MY_ROW', {COLUMN=>['pcf'],
> FILTER=>ColumnRangeFilter.new(Bytes.toBytes(1499000.to_java(:int)),
> true, Bytes.toBytes(1499010.to_java(:int)), false)}
> COLUMN CELL
> pcf:\x00\x16\xDFx timestamp=1568123881899,
> value=\x00\x16\xDFx
> pcf:\x00\x16\xDFy timestamp=1568123881899,
> value=\x00\x16\xDFy
> pcf:\x00\x16\xDFz timestamp=1568123881899,
> value=\x00\x16\xDFz
> pcf:\x00\x16\xDF{ timestamp=1568123881899,
> value=\x00\x16\xDF{
> pcf:\x00\x16\xDF| timestamp=1568123881899,
> value=\x00\x16\xDF|
> pcf:\x00\x16\xDF} timestamp=1568123881899,
> value=\x00\x16\xDF}
> pcf:\x00\x16\xDF~ timestamp=1568123881899,
> value=\x00\x16\xDF~
> pcf:\x00\x16\xDF\x7F timestamp=1568123881899,
> value=\x00\x16\xDF\x7F
> pcf:\x00\x16\xDF\x80 timestamp=1568123881899,
> value=\x00\x16\xDF\x80
> pcf:\x00\x16\xDF\x81 timestamp=1568123881899,
> value=\x00\x16\xDF\x81
> 1 row(s) in 15.3750 seconds
>
>
> Now we inserted a new column with qualifier 10 (\x0A), such that it
> comes earlier in lexicographical order. Now we executed the same query.
> It only took 0.0240 seconds.
>
> hbase(main):058:0> put 'mytable', 'MY_ROW', "pcf:\x0A", "\x00"
> 0 row(s) in 0.0150 seconds
> hbase(main):059:0> get 'mytable', 'MY_ROW', {COLUMN=>['pcf'],
> FILTER=>ColumnRangeFilter.new(Bytes.toBytes(1499000.to_java(:int)),
> true, Bytes.toBytes(1499010.to_java(:int)), false)}
> COLUMN CELL
> pcf:\x00\x16\xDFx timestamp=1568123881899,
> value=\x00\x16\xDFx
> pcf:\x00\x16\xDFy timestamp=1568123881899,
> value=\x00\x16\xDFy
> pcf:\x00\x16\xDFz timestamp=1568123881899,
> value=\x00\x16\xDFz
> pcf:\x00\x16\xDF{ timestamp=1568123881899,
> value=\x00\x16\xDF{
> pcf:\x00\x16\xDF| timestamp=1568123881899,
> value=\x00\x16\xDF|
> pcf:\x00\x16\xDF} timestamp=1568123881899,
> value=\x00\x16\xDF}
> pcf:\x00\x16\xDF~ timestamp=1568123881899,
> value=\x00\x16\xDF~
> pcf:\x00\x16\xDF\x7F timestamp=1568123881899,
> value=\x00\x16\xDF\x7F
> pcf:\x00\x16\xDF\x80 timestamp=1568123881899,
> value=\x00\x16\xDF\x80
> pcf:\x00\x16\xDF\x81 timestamp=1568123881899,
Hi Anoop,
We have executed the query with the qualifier set like you advised.
But we dont get the results for the range but only the specified
qualifier cell is returned.
Query & Result:
hbase(main):008:0> get 'mytable', 'MY_ROW',
{COLUMN=>["pcf:\x00\x16\xDFx"],
FILTER=>ColumnRangeFilter.new(Bytes.toBytes(1499000.to_java(:int)),
true, Bytes.toBytes(1499010.to_java(:int)), false)}
COLUMN CELL
pcf:\x00\x16\xDFx timestamp=1568380663616,
value=\x00\x16\xDFx
1 row(s) in 0.0080 seconds
hbase(main):009:0>
Is there any other way to get arond this ?.
Regards,
Solvannan R M
On 2019/09/13 04:53:45, Anoop John wrote:
> Hi>
> When you did a put with a lower qualifier int (put 'mytable',>
> 'MY_ROW', "pcf:\x0A", "\x00") the system flow is getting a valid cell
at>
> 1st step itself and that getting passed to the Filter. The Filter is
doing>
> a seek which just avoids all the in between deletes and puts
processing..>
> In 1st case the Filter wont get into action at all unless the scan flow>
> sees a valid cell. The delete processing happens as 1st step before the>
> filter processinf step happening.>
>
> In this case I am wondering why you can not add the specific 1st
qualifier>
> in the get part itself along with the column range filter. I mean>
>
> get 'mytable', 'MY_ROW', {COLUMN=>['pcf: *1499000 * '],>
> FILTER=>ColumnRangeFilter.new(Bytes.toBytes(1499000.to_java(:int)),>
> true, Bytes.toBytes(1499010.to_java(:int)), false)}>
>
> Pardon the syntax it might not be proper for the shell.. Can this be
done?>
> This will make the scan to make a seek to the given qualifier at 1st
step>
> itself.>
>
> Anoop>
>
> On Thu, Sep 12, 2019 at 10:18 PM Udai Bhan Kashyap (BLOOMBERG/
PRINCETON) <>
> [EMAIL PROTECTED]> wrote:>
>
> > Are you keeping the deleted cells? Check 'VERSIONS' for the column
family>
> > and set it to 1 if you don't want to keep the deleted cells.>
> >>
> > From: [EMAIL PROTECTED] At: 09/12/19 12:40:01To:>
> > [EMAIL PROTECTED]>
> > Subject: Re: HBase Scan consumes high cpu>
> >>
> > Hi,>
> >>
> > As said earlier, we have populated the rowkey "MY_ROW" with integers>
> > from 0 to 1500000 as column qualifiers. Then we have deleted the>
> > qualifiers from 0 to 1499000.>
> >>
> > We executed the following query. It took 15.3750 seconds to execute.>
> >>
> > hbase(main):057:0> get 'mytable', 'MY_ROW', {COLUMN=>['pcf'],>
> > FILTER=>ColumnRangeFilter.new(Bytes.toBytes(1499000.to_java(:int)),>
> > true, Bytes.toBytes(1499010.to_java(:int)), false)}>
> > COLUMN CELL>
> > pcf:\x00\x16\xDFx timestamp=1568123881899,>
> > value=\x00\x16\xDFx>
> > pcf:\x00\x16\xDFy timestamp=1568123881899,>
> > value=\x00\x16\xDFy>
> > pcf:\x00\x16\xDFz timestamp=1568123881899,>
> > value=\x00\x16\xDFz>
> > pcf:\x00\x16\xDF{ timestamp=1568123881899,>
> > value=\x00\x16\xDF{>
> > pcf:\x00\x16\xDF| timestamp=1568123881899,>
> > value=\x00\x16\xDF|>
> > pcf:\x00\x16\xDF} timestamp=1568123881899,>
> > value=\x00\x16\xDF}>
> > pcf:\x00\x16\xDF~ timestamp=1568123881899,>
> > value=\x00\x16\xDF~>
> > pcf:\x00\x16\xDF\x7F timestamp=1568123881899,>
> > value=\x00\x16\xDF\x7F>
> > pcf:\x00\x16\xDF\x80 timestamp=1568123881899,>
> > value=\x00\x16\xDF\x80>
> > pcf:\x00\x16\xDF\x81 timestamp=1568123881899,>
> > value=\x00\x16\xDF\x81>
> > 1 row(s) in 15.3750 seconds>
> >>
> >>
> > Now we inserted a new column with qualifier 10 (\x0A), such that it>
> > comes earlier in lexicographical order. Now we executed the same
query.>
> > It only took 0.0240 seconds.>
> >>
> > hbase(main):058:0> put 'mytable', 'MY_ROW', "pcf:\x0A", "\x00">
> > 0 row(s) in 0.0150 seconds>
> > hbase(main):059:0> get 'mytable', 'MY_ROW', {COLUMN=>['pcf'],>
> > FILTER=>ColumnRangeFilter.new(Bytes.toBytes(1499000.to_java(:int)),>
> > true, Bytes.toBytes(1499010.to_java(:int)), false)}>
> > COLUMN CELL>
> > pcf:\x00\x16\xDFx timestamp=1568123881899,>
> > value=\x00\x16\xDFx>
> > pcf:\x00\x16\xDFy timestamp=1568123881899,>
> > value=\x00\x16\xDFy>
> > pcf:\x00\x16\xDFz timestamp=1568123881899,>
> > value=\x00\x16\xDFz>
> > pcf:\x00\x16\xDF{ timestamp=1568123881899,>
> > value=\x00\x16\xDF{>
> > pcf:\x00\x16\xDF| timestamp=1568123881899,>
> > value=\x00\x16\xDF|>
> > pcf:\x00\x16\xDF} timestamp=1568123881899,>
> > value=\x00\x16\xDF}>
> > pcf:\x00\x16\xDF~ timestamp=1568123881899,>
> > value=\x00\x16\xDF~>
> > pcf:\x00\x16\xDF\x7F timestamp=1568123881899,>
> > value=\x00\x16\xDF\x7F>
> > pcf:\x00\x16\xDF\x80 timestamp=1568123881899,>
> > value=\x00\x16\xDF\x80>
> > pcf:\x00\x16\xDF\x81 timestamp=1568123881899,>
> > value=\x00\x16\xDF\x81>
> > 1 row(s) in 0.0240 seconds>
> > hbase(main):060:0>>
> >>
> >>
> > We were able to reproduce the result consistently same, the pattern>
> > being bulk insert followed by bulk delete of most of the earlier
columns.>
> >>
> >>
> > We observed the following behaviour while debugging the StoreScanner>
> > (regionserver).>
> >>
> > Case 1:>
> >>
> > 1. When StoreScanner.next() is called, it starts to iterate over the>
> > cells from the start of the rowkey.>
> >>
> > 2. As all the cells are deleted (from 0 to 1499000), we could see>
> > alternate delete and put type cells. Now, the>
> > NormalUserScanQueryMatcher.match() returns>
> > ScanQueryMatcher.MatchCode.SKIP and>
> > ScanQueryMatcher.MatchCode.SEEK_NEXT_COL for Delete and Put type cell>
> > respectively. This iteration happens throughout the range of 0 to
1499000.>
> >>
> > 3. This happens until a valid Put type cell is encountered, where the>
> > matcher applies the ColumnRangeFilter to the cell, which in turm
returns>
> > ScanQueryMatcher.MatchCode.SEEK_NEXT_USING_HINT. In the next
iteration>
> > it seeks directly to the desired column.>
> >>
> >>
> > Case 2:>
> >>
> > 1. When StoreScanner.next() is called, it starts to iterate over the>
> > cells from the start of the rowkey.>
> >>
>
Hi
Generally if you can form the column names like you did in the above case
it is always better you add them using
scan#addColumn(family, qual). I am not sure of the shell syntax to add
multiple columns but am sure there is a provision to do it.
This will ensure that the scan starts from the given column and fetches the
required column only. In your case probably you need to pass a set of
qualifiers (instead of just 1).
Regards
Ram
On Fri, Sep 13, 2019 at 8:45 PM Solvannan R M <[EMAIL PROTECTED]lid>
wrote:
> Hi Anoop,
>
> We have executed the query with the qualifier set like you advised.
> But we dont get the results for the range but only the specified
> qualifier cell is returned.
>
> Query & Result:
>
> hbase(main):008:0> get 'mytable', 'MY_ROW',
> {COLUMN=>["pcf:\x00\x16\xDFx"],
> FILTER=>ColumnRangeFilter.new(Bytes.toBytes(1499000.to_java(:int)),
> true, Bytes.toBytes(1499010.to_java(:int)), false)}
> COLUMN CELL
> pcf:\x00\x16\xDFx timestamp=1568380663616,
> value=\x00\x16\xDFx
> 1 row(s) in 0.0080 seconds
>
> hbase(main):009:0>
>
>
> Is there any other way to get arond this ?.
>
>
> Regards,
>
> Solvannan R M
>
>
> On 2019/09/13 04:53:45, Anoop John wrote:
> > Hi>
> > When you did a put with a lower qualifier int (put 'mytable',>
> > 'MY_ROW', "pcf:\x0A", "\x00") the system flow is getting a valid cell
> at>
> > 1st step itself and that getting passed to the Filter. The Filter is
> doing>
> > a seek which just avoids all the in between deletes and puts
> processing..>
> > In 1st case the Filter wont get into action at all unless the scan flow>
> > sees a valid cell. The delete processing happens as 1st step before the>
> > filter processinf step happening.>
> >
> > In this case I am wondering why you can not add the specific 1st
> qualifier>
> > in the get part itself along with the column range filter. I mean>
> >
> > get 'mytable', 'MY_ROW', {COLUMN=>['pcf: *1499000 * '],>
> > FILTER=>ColumnRangeFilter.new(Bytes.toBytes(1499000.to_java(:int)),>
> > true, Bytes.toBytes(1499010.to_java(:int)), false)}>
> >
> > Pardon the syntax it might not be proper for the shell.. Can this be
> done?>
> > This will make the scan to make a seek to the given qualifier at 1st
> step>
> > itself.>
> >
> > Anoop>
> >
> > On Thu, Sep 12, 2019 at 10:18 PM Udai Bhan Kashyap (BLOOMBERG/
> PRINCETON) <>
> > [EMAIL PROTECTED]> wrote:>
> >
> > > Are you keeping the deleted cells? Check 'VERSIONS' for the column
> family>
> > > and set it to 1 if you don't want to keep the deleted cells.>
> > >>
> > > From: [EMAIL PROTECTED] At: 09/12/19 12:40:01To:>
> > > [EMAIL PROTECTED]>
> > > Subject: Re: HBase Scan consumes high cpu>
> > >>
> > > Hi,>
> > >>
> > > As said earlier, we have populated the rowkey "MY_ROW" with integers>
> > > from 0 to 1500000 as column qualifiers. Then we have deleted the>
> > > qualifiers from 0 to 1499000.>
> > >>
> > > We executed the following query. It took 15.3750 seconds to execute.>
> > >>
> > > hbase(main):057:0> get 'mytable', 'MY_ROW', {COLUMN=>['pcf'],>
> > > FILTER=>ColumnRangeFilter.new(Bytes.toBytes(1499000.to_java(:int)),>
> > > true, Bytes.toBytes(1499010.to_java(:int)), false)}>
> > > COLUMN CELL>
> > > pcf:\x00\x16\xDFx timestamp=1568123881899,>
> > > value=\x00\x16\xDFx>
> > > pcf:\x00\x16\xDFy timestamp=1568123881899,>
> > > value=\x00\x16\xDFy>
> > > pcf:\x00\x16\xDFz timestamp=1568123881899,>
> > > value=\x00\x16\xDFz>
> > > pcf:\x00\x16\xDF{ timestamp=1568123881899,>
> > > value=\x00\x16\xDF{>
> > > pcf:\x00\x16\xDF| timestamp=1568123881899,>
> > > value=\x00\x16\xDF|>
> > > pcf:\x00\x16\xDF} timestamp=1568123881899,>
> > > value=\x00\x16\xDF}>
> > > pcf:\x00\x16\xDF~ timestamp=1568123881899,>
> > > value=\x00\x16\xDF~>
> > > pcf:\x00\x16\xDF\x7F timestamp=1568123881899,>
> > > value=\x00\x16\xDF\x7F>
> > > pcf:\x00\x16\xDF\x80 timestamp=1568123881899,>
> > > value=\x00\x16\xDF\x80>
> > > pc
f
Hi Ramkrishna,
Thank you for your inputs! Unfortunately we would not be knowing the
column names beforehand. We had generated the above scenario for
illustration purposes.
The intent of our query is that, given a single row key, a start column
key and an end column key, scan for the columns that are between the two
column keys. We have been achieving that by using ColumnRangeFilter.
Our write pattern would be Put followed by Delete immediately
(Keep_deleted_cells is set to false). So as more Deletes start to
accumulate, we notice the scan time starts to be very long and the cpu
shoots up to 100% for a core during every scan. On trying to debug we
observed the following behavior:
At any instant, the cells of the particular row would be roughly
organized like
D1 P1 D2 P2 D3 P3 ............ Dn-1 Pn-1 Dn Pn Pn+1 Pn+2 Pn+3 Pn+4....
where D and P are Delete and it's corresponding Put. The newer values
from Pn haven't been deleted yet.
As the scan initiates, inside the StoreScanner,
NormalUserScanQueryMatcher would match the first cell (D1). It would be
added to the DeleteTracker and a MatchCode of SKIP is returned. Now for
the next cell (P1) the matcher would check with the DeleteTracker and
return a code of SEEK_NEXT_COL. Again the next cell would be D2 and this
would happen alternately. No filter is applied. This goes on till it
encounters Pn where filter is applied, SEEK_NEXT_USING_HINT is done and
now reseek happens to position near the desired range. The result is
returned quickly after that.
The SKIP iterations happen a lot because our pattern would have very
less active cells and only towards the latest column qualifiers(ordered
high lexicographically). We were wondering if the query could be
modified so that the filter could be applied initially or some other way
to seek to the desired range directly.
Regards,
Solvannan R M
On 2019/09/13 15:53:51, ramkrishna vasudevan wrote:
> Hi>
> Generally if you can form the column names like you did in the above
case>
> it is always better you add them using>
> scan#addColumn(family, qual). I am not sure of the shell syntax to add>
> multiple columns but am sure there is a provision to do it.>
>
> This will ensure that the scan starts from the given column and
fetches the>
> required column only. In your case probably you need to pass a set of>
> qualifiers (instead of just 1).>
>
> Regards>
> Ram>
>
> On Fri, Sep 13, 2019 at 8:45 PM Solvannan R M >
> wrote:>
>
> > Hi Anoop,>
> >>
> > We have executed the query with the qualifier set like you advised.>
> > But we dont get the results for the range but only the specified>
> > qualifier cell is returned.>
> >>
> > Query & Result:>
> >>
> > hbase(main):008:0> get 'mytable', 'MY_ROW',>
> > {COLUMN=>["pcf:\x00\x16\xDFx"],>
> > FILTER=>ColumnRangeFilter.new(Bytes.toBytes(1499000.to_java(:int)),>
> > true, Bytes.toBytes(1499010.to_java(:int)), false)}>
> > COLUMN CELL>
> > pcf:\x00\x16\xDFx timestamp=1568380663616,>
> > value=\x00\x16\xDFx>
> > 1 row(s) in 0.0080 seconds>
> >>
> > hbase(main):009:0>>
> >>
> >>
> > Is there any other way to get arond this ?.>
> >>
> >>
> > Regards,>
> >>
> > Solvannan R M>
> >>
> >>
> > On 2019/09/13 04:53:45, Anoop John wrote:>
> > > Hi>>
> > > When you did a put with a lower qualifier int (put 'mytable',>>
> > > 'MY_ROW', "pcf:\x0A", "\x00") the system flow is getting a valid
cell>
> > at>>
> > > 1st step itself and that getting passed to the Filter. The Filter
is>
> > doing>>
> > > a seek which just avoids all the in between deletes and puts>
> > processing..>>
> > > In 1st case the Filter wont get into action at all unless the
scan flow>>
> > > sees a valid cell. The delete processing happens as 1st step
before the>>
> > > filter processinf step happening.>>
> > >>
> > > In this case I am wondering why you can not add the specific 1st>
> > qualifier>>
> > > in the get part itself along with the column range filter. I mean>>
> > >>
> > > get 'mytable', 'MY_ROW', {COLUMN=>['pcf: *1499000 * '],>>
> > >
FILTER=>ColumnRangeFilter.new(Bytes.toBytes(1499000.to_java(:int)),>>
> > > true, Bytes.toBytes(1499010.to_java(:int)), false)}>>
> > >>
> > > Pardon the syntax it might not be proper for the shell.. Can this
be>
> > done?>>
> > > This will make the scan to make a seek to the given qualifier at
1st>
> > step>>
> > > itself.>>
> > >>
> > > Anoop>>
> > >>
> > > On Thu, Sep 12, 2019 at 10:18 PM Udai Bhan Kashyap (BLOOMBERG/>
> > PRINCETON) <>>
> > > [EMAIL PROTECTED]> wrote:>>
> > >>
> > > > Are you keeping the deleted cells? Check 'VERSIONS' for the
column>
> > family>>
> > > > and set it to 1 if you don't want to keep the deleted cells.>>
> > > >>>
> > > > From: [EMAIL PROTECTED] At: 09/12/19 12:40:01To:>>
> > > > [EMAIL PROTECTED]>>
> > > > Subject: Re: HBase Scan consumes high cpu>>
> > > >>>
> > > > Hi,>>
> > > >>>
> > > > As said earlier, we have populated the rowkey "MY_ROW" with
integers>>
> > > > from 0 to 1500000 as column qualifiers. Then we have deleted the>>
> > > > qualifiers from 0 to 1499000.>>
> > > >>>
> > > > We executed the following query. It took 15.3750 seconds to
execute.>>
> > > >>>
> > > > hbase(main):057:0> get 'mytable', 'MY_ROW', {COLUMN=>['pcf'],>>
> > > >
FILTER=>ColumnRangeFilter.new(Bytes.toBytes(1499000.to_java(:int)),>>
> > > > true, Bytes.toBytes(1499010.to_java(:int)), false)}>>
> > > > COLUMN CELL>>
> > > > pcf:\x00\x16\xDFx timestamp=1568123881899,>>
> > > > value=\x00\x16\xDFx>>
> > > > pcf:\x00\x16\xDFy timestamp=1568123881899,>>
> > > > value=\x00\x16\xDFy>>
> > > > pcf:\x00\x16\xDFz timestamp=1568123881899,>>
> > > > value=\x00\x16\xDFz>>
> > > > pcf:\x00\x16\xDF{ timestamp=1568123881899,>>
> > > > value=\x00\x16\xDF{>>
> > > > pcf:\x00\x16\xDF| timestamp=1568123881899,>>
> > > > value=\x00\x16\xDF|>>
> > > > pcf:\x00\x16\xDF} timestamp=1568123881899,>>
> > > > value
Hi Solvannan
Currently there is no easy way to over come this case because deletes and
its tracking takes precedence before the filter is even applied.
I get your case where you really don't know the columns which could have
been previously deleted and hence you specify the entire range of
columns in the filter. When this Put/Delete combination keeps increasing
then you end up in these issues.
Am not aware of the use case here, but is there any better way to handle
your schema for these cases?
Regards
Ram
On Mon, Sep 16, 2019 at 10:54 PM Solvannan R M <[EMAIL PROTECTED]lid>
wrote:
> Hi Ramkrishna,
>
> Thank you for your inputs! Unfortunately we would not be knowing the
> column names beforehand. We had generated the above scenario for
> illustration purposes.
>
> The intent of our query is that, given a single row key, a start column
> key and an end column key, scan for the columns that are between the two
> column keys. We have been achieving that by using ColumnRangeFilter.
> Our write pattern would be Put followed by Delete immediately
> (Keep_deleted_cells is set to false). So as more Deletes start to
> accumulate, we notice the scan time starts to be very long and the cpu
> shoots up to 100% for a core during every scan. On trying to debug we
> observed the following behavior:
>
> At any instant, the cells of the particular row would be roughly
> organized like
>
> D1 P1 D2 P2 D3 P3 ............ Dn-1 Pn-1 Dn Pn Pn+1 Pn+2 Pn+3 Pn+4....
>
> where D and P are Delete and it's corresponding Put. The newer values
> from Pn haven't been deleted yet.
>
> As the scan initiates, inside the StoreScanner,
> NormalUserScanQueryMatcher would match the first cell (D1). It would be
> added to the DeleteTracker and a MatchCode of SKIP is returned. Now for
> the next cell (P1) the matcher would check with the DeleteTracker and
> return a code of SEEK_NEXT_COL. Again the next cell would be D2 and this
> would happen alternately. No filter is applied. This goes on till it
> encounters Pn where filter is applied, SEEK_NEXT_USING_HINT is done and
> now reseek happens to position near the desired range. The result is
> returned quickly after that.
>
> The SKIP iterations happen a lot because our pattern would have very
> less active cells and only towards the latest column qualifiers(ordered
> high lexicographically). We were wondering if the query could be
> modified so that the filter could be applied initially or some other way
> to seek to the desired range directly.
>
> Regards,
> Solvannan R M
>
>
> On 2019/09/13 15:53:51, ramkrishna vasudevan wrote:
> > Hi>
> > Generally if you can form the column names like you did in the above
> case>
> > it is always better you add them using>
> > scan#addColumn(family, qual). I am not sure of the shell syntax to add>
> > multiple columns but am sure there is a provision to do it.>
> >
> > This will ensure that the scan starts from the given column and
> fetches the>
> > required column only. In your case probably you need to pass a set of>
> > qualifiers (instead of just 1).>
> >
> > Regards>
> > Ram>
> >
> > On Fri, Sep 13, 2019 at 8:45 PM Solvannan R M >
> > wrote:>
> >
> > > Hi Anoop,>
> > >>
> > > We have executed the query with the qualifier set like you advised.>
> > > But we dont get the results for the range but only the specified>
> > > qualifier cell is returned.>
> > >>
> > > Query & Result:>
> > >>
> > > hbase(main):008:0> get 'mytable', 'MY_ROW',>
> > > {COLUMN=>["pcf:\x00\x16\xDFx"],>
> > > FILTER=>ColumnRangeFilter.new(Bytes.toBytes(1499000.to_java(:int)),>
> > > true, Bytes.toBytes(1499010.to_java(:int)), false)}>
> > > COLUMN CELL>
> > > pcf:\x00\x16\xDFx timestamp=1568380663616,>
> > > value=\x00\x16\xDFx>
> > > 1 row(s) in 0.0080 seconds>
> > >>
> > > hbase(main):009:0>>
> > >>
> > >>
> > > Is there any other way to get arond this ?.>
> > >>
> > >>
> > > Regards,>
> > >>
> > > Solvannan R M>
> > >>
> > >>
> > > On 2019/09/13 04:53:45, Anoop John wrote:>
Hi Ram,
Thanks for your support! We will explore alternative schema designs.
Regards,
Solvannan R M
On 2019/09/17 05:21:39, ramkrishna vasudevan wrote:
> Hi Solvannan>
>
> Currently there is no easy way to over come this case because deletes
and>
> its tracking takes precedence before the filter is even applied.>
>
> I get your case where you really don't know the columns which could
have>
> been previously deleted and hence you specify the entire range of>
> columns in the filter. When this Put/Delete combination keeps
increasing>
> then you end up in these issues.>
>
> Am not aware of the use case here, but is there any better way to
handle>
> your schema for these cases?>
>
> Regards>
> Ram>
>
>
>
>
>
>
>
>
>
>
> On Mon, Sep 16, 2019 at 10:54 PM Solvannan R M >
> wrote:>
>
> > Hi Ramkrishna,>
> >>
> > Thank you for your inputs! Unfortunately we would not be knowing the>
> > column names beforehand. We had generated the above scenario for>
> > illustration purposes.>
> >>
> > The intent of our query is that, given a single row key, a start
column>
> > key and an end column key, scan for the columns that are between
the two>
> > column keys. We have been achieving that by using ColumnRangeFilter.>
> > Our write pattern would be Put followed by Delete immediately>
> > (Keep_deleted_cells is set to false). So as more Deletes start to>
> > accumulate, we notice the scan time starts to be very long and the
cpu>
> > shoots up to 100% for a core during every scan. On trying to debug we>
> > observed the following behavior:>
> >>
> > At any instant, the cells of the particular row would be roughly>
> > organized like>
> >>
> > D1 P1 D2 P2 D3 P3 ............ Dn-1 Pn-1 Dn Pn Pn+1 Pn+2 Pn+3
Pn+4....>
> >>
> > where D and P are Delete and it's corresponding Put. The newer values>
> > from Pn haven't been deleted yet.>
> >>
> > As the scan initiates, inside the StoreScanner,>
> > NormalUserScanQueryMatcher would match the first cell (D1). It
would be>
> > added to the DeleteTracker and a MatchCode of SKIP is returned. Now
for>
> > the next cell (P1) the matcher would check with the DeleteTracker and>
> > return a code of SEEK_NEXT_COL. Again the next cell would be D2 and
this>
> > would happen alternately. No filter is applied. This goes on till it>
> > encounters Pn where filter is applied, SEEK_NEXT_USING_HINT is done
and>
> > now reseek happens to position near the desired range. The result is>
> > returned quickly after that.>
> >>
> > The SKIP iterations happen a lot because our pattern would have very>
> > less active cells and only towards the latest column
qualifiers(ordered>
> > high lexicographically). We were wondering if the query could be>
> > modified so that the filter could be applied initially or some
other way>
> > to seek to the desired range directly.>
> >>
> > Regards,>
> > Solvannan R M>
> >>
> >>
> > On 2019/09/13 15:53:51, ramkrishna vasudevan wrote:>
> > > Hi>>
> > > Generally if you can form the column names like you did in the
above>
> > case>>
> > > it is always better you add them using>>
> > > scan#addColumn(family, qual). I am not sure of the shell syntax
to add>>
> > > multiple columns but am sure there is a provision to do it.>>
> > >>
> > > This will ensure that the scan starts from the given column and>
> > fetches the>>
> > > required column only. In your case probably you need to pass a
set of>>
> > > qualifiers (instead of just 1).>>
> > >>
> > > Regards>>
> > > Ram>>
> > >>
> > > On Fri, Sep 13, 2019 at 8:45 PM Solvannan R M >>
> > > wrote:>>
> > >>
> > > > Hi Anoop,>>
> > > >>>
> > > > We have executed the query with the qualifier set like you
advised.>>
> > > > But we dont get the results for the range but only the specified>>
> > > > qualifier cell is returned.>>
> > > >>>
> > > > Query & Result:>>
> > > >>>
> > > > hbase(main):008:0> get 'mytable', 'MY_ROW',>>
> > > > {COLUMN=>["pcf:\x00\x16\xDFx"],>>
> > > >
FILTER=>ColumnRangeFilter.new(Bytes.toBytes(1499000.to_java(:int)),>>
> > > > true, Bytes.toBytes(1499010.to_java(:int)), false)}>>
> > > > COLUMN CELL>>
> > > > pcf:\x00\x16\xDFx timestamp=1568380663616,>>
> > > > value=\x00\x16\xDFx>>
> > > > 1 row(s) in 0.0080 seconds>>
> > > >>>
> > > > hbase(main):009:0>>>
> > > >>>
> > > >>>
> > > > Is there any other way to get arond this ?.>>
> > > >>>
> > > >>>
> > > > Regards,>>
> > > >>>
> > > > Solvannan R M>>
> > > >>>
> > > >>>
> > > > On 2019/09/13 04:53:45, Anoop John wrote:>>
> > > > > Hi>>>
> > > > > When you did a put with a lower qualifier int (put 'mytable',>>>
> > > > > 'MY_ROW', "pcf:\x0A", "\x00") the system flow is getting a
valid>
> > cell>>
> > > > at>>>
> > > > > 1st step itself and that getting passed to the Filter. The
Filter>
> > is>>
> > > > doing>>>
> > > > > a seek which just avoids all the in between deletes and puts>>
> > > > processing..>>>
> > > > > In 1st case the Filter wont get into action at all unless the>
> > scan flow>>>
> > > > > sees a valid cell. The delete processing happens as 1st step>
> > before the>>>
> > > > > filter processinf step happening.>>>
> > > > >>>
> > > > > In this case I am wondering why you can not add the specific
1st>>
> > > > qualifier>>>
> > > > > in the get part itself along with the column range filter. I
mean>>>
> > > > >>>
> > > > > get 'mytable', 'MY_ROW', {COLUMN=>['pcf: *1499000 * '],>>>
> > > > >>
> > FILTER=>ColumnRangeFilter.new(Bytes.toBytes(1499000.to_java(:int)),>>>
> > > > > true, Bytes.toBytes(1499010.to_java(:int)), false)}>>>
> > > > >>>
> > > > > Pardon the syntax it might not be proper for the shell.. Can
this>
> > be>>
> > > > done?>>>
> > > > > This will make the scan to make a seek to the given qualifier
at>
> > 1st>>
> > > > step>>>
> > > > > itself.>>>
> > > > >>>
> > > > > Anoop>>>
> > > > >>>
> > > > > On Thu, Sep 12, 2019 at 10:18 PM Udai
I would suggest you to look your design again, its wide table approach
putting so many column against single rowkey.
You can put you time serease data in row wize also.
Thanks
Manjeet singh
On Wed, 18 Sep 2019, 22:23 Solvannan R M, <[EMAIL PROTECTED]lid>
wrote:
> Hi Ram,
>
> Thanks for your support! We will explore alternative schema designs.
>
>
> Regards,
>
> Solvannan R M
>
>
> On 2019/09/17 05:21:39, ramkrishna vasudevan wrote:
> > Hi Solvannan>
> >
> > Currently there is no easy way to over come this case because deletes
> and>
> > its tracking takes precedence before the filter is even applied.>
> >
> > I get your case where you really don't know the columns which could
> have>
> > been previously deleted and hence you specify the entire range of>
> > columns in the filter. When this Put/Delete combination keeps
> increasing>
> > then you end up in these issues.>
> >
> > Am not aware of the use case here, but is there any better way to
> handle>
> > your schema for these cases?>
> >
> > Regards>
> > Ram>
> >
> >
> >
> >
> >
> >
> >
> >
> >
> >
> > On Mon, Sep 16, 2019 at 10:54 PM Solvannan R M >
> > wrote:>
> >
> > > Hi Ramkrishna,>
> > >>
> > > Thank you for your inputs! Unfortunately we would not be knowing the>
> > > column names beforehand. We had generated the above scenario for>
> > > illustration purposes.>
> > >>
> > > The intent of our query is that, given a single row key, a start
> column>
> > > key and an end column key, scan for the columns that are between
> the two>
> > > column keys. We have been achieving that by using ColumnRangeFilter.>
> > > Our write pattern would be Put followed by Delete immediately>
> > > (Keep_deleted_cells is set to false). So as more Deletes start to>
> > > accumulate, we notice the scan time starts to be very long and the
> cpu>
> > > shoots up to 100% for a core during every scan. On trying to debug we>
> > > observed the following behavior:>
> > >>
> > > At any instant, the cells of the particular row would be roughly>
> > > organized like>
> > >>
> > > D1 P1 D2 P2 D3 P3 ............ Dn-1 Pn-1 Dn Pn Pn+1 Pn+2 Pn+3
> Pn+4....>
> > >>
> > > where D and P are Delete and it's corresponding Put. The newer values>
> > > from Pn haven't been deleted yet.>
> > >>
> > > As the scan initiates, inside the StoreScanner,>
> > > NormalUserScanQueryMatcher would match the first cell (D1). It
> would be>
> > > added to the DeleteTracker and a MatchCode of SKIP is returned. Now
> for>
> > > the next cell (P1) the matcher would check with the DeleteTracker and>
> > > return a code of SEEK_NEXT_COL. Again the next cell would be D2 and
> this>
> > > would happen alternately. No filter is applied. This goes on till it>
> > > encounters Pn where filter is applied, SEEK_NEXT_USING_HINT is done
> and>
> > > now reseek happens to position near the desired range. The result is>
> > > returned quickly after that.>
> > >>
> > > The SKIP iterations happen a lot because our pattern would have very>
> > > less active cells and only towards the latest column
> qualifiers(ordered>
> > > high lexicographically). We were wondering if the query could be>
> > > modified so that the filter could be applied initially or some
> other way>
> > > to seek to the desired range directly.>
> > >>
> > > Regards,>
> > > Solvannan R M>
> > >>
> > >>
> > > On 2019/09/13 15:53:51, ramkrishna vasudevan wrote:>
> > > > Hi>>
> > > > Generally if you can form the column names like you did in the
> above>
> > > case>>
> > > > it is always better you add them using>>
> > > > scan#addColumn(family, qual). I am not sure of the shell syntax
> to add>>
> > > > multiple columns but am sure there is a provision to do it.>>
> > > >>
> > > > This will ensure that the scan starts from the given column and>
> > > fetches the>>
> > > > required column only. In your case probably you need to pass a
> set of>>
> > > > qualifiers (instead of just 1).>>