Last week on separate thread I was suggested to use tableOperations.deleteRows for deleting rows that matched with specific ranges. So I was curious to try it out to see if it's better than my current implementation which is iterating all rows, and call putDelete for each. While researching, I also found Accumulo already provides BatchDeleter, which also does the same thing. I tried all of three, and below is my test results against three different tables (numbers are in milliseconds):
Test 1 (using iterator and call putDelete for each): Table 1: 5,702 Table 2: 6,912 Table 3: 4,694
Test 3 (using tableOperations.deleteRows, note that I first iterate all rows, just to get the last row id, which then being passed as argument to the function): Table 1: 196,597 Table 2: 226,496 Table 3: 8,442 I ran the tests few times, and pretty much got the consistent results above. I didn't look at the code what deleteRows really doing, but looking at my test results, I can say it sucks! Note that for that test, I did scan and iterate just to get the last row id, but even I subtract the time for doing that, it's still way too slow. Therefore, I'd recommend anyone to avoid using deleteRows for this scenario. YMMV, but I'd stick with my original approach, which is doing the same like Test 1 above. Thanks, Z View this message in context: http://apache-accumulo.1065345.n5.nabble.com/delete-rows-test-result-tp15569.html Sent from the Developers mailing list archive at Nabble.com.
What happens when you subtract the time to read all of your rows? deleteRows is designed so you don't have to read any data-- you can compute a range to delete. For instance, in time series table, it's trivial to give a start and end date as your rows and call deleteRows.
On Mon, Nov 16, 2015 at 10:35 AM, z11373 <[EMAIL PROTECTED]> wrote:
"Reading" all of the rows first implies you're bringing back the entire result to a client, which provides you serial access to the data.
I think you should re-run test #3 that measures the time it takes to call deleteRows only. I'm emphasizing this because I've worked on projects that could quickly define a range to be deleted without reading any data, and using deleteRows decreased our latency significantly
On Mon, Nov 16, 2015 at 11:19 AM, z11373 <[EMAIL PROTECTED]> wrote:
On Mon, Nov 16, 2015 at 10:35 AM, z11373 <[EMAIL PROTECTED]> wrote: An advantage of deleteRows is that it can drop entire tablets that fall completely within a range. However the tablet at the end of the range may need to be compacted in order to extend its range. Using deleteRows for a "small" range that falls completely within a table may be suboptimal. Is that your case? How many key values are you deleting? If its not the compaction that causing the delay, then there may be a bug.
Not sure if it will help, but there is a utility function for finding a max row. It does a binary search within the key space.
Anyone seen this exception before when calling deleterows from shell?
user@dev> deleterows -t T1 -b 9000 -e 9001 Thread "shell" died no net in java.library.path java.lang.UnsatisfiedLinkError: no net in java.library.path at java.lang.ClassLoader.loadLibrary(ClassLoader.java:1865) at java.lang.Runtime.loadLibrary0(Runtime.java:870) at java.lang.System.loadLibrary(System.java:1122) at java.net.AbstractPlainSocketImpl$1.run(AbstractPlainSocketImpl.java:84) at java.net.AbstractPlainSocketImpl$1.run(AbstractPlainSocketImpl.java:82) at java.security.AccessController.doPrivileged(Native Method) at java.net.AbstractPlainSocketImpl.<clinit>(AbstractPlainSocketImpl.java:81) at java.net.Socket.setImpl(Socket.java:503) at java.net.Socket.<init>(Socket.java:84) at org.apache.thrift.transport.TSocket.initSocket(TSocket.java:116) at org.apache.thrift.transport.TSocket.<init>(TSocket.java:109) at org.apache.thrift.transport.TSocket.<init>(TSocket.java:94) at org.apache.accumulo.core.util.ThriftUtil.createClientTransport(ThriftUtil.java:277) at org.apache.accumulo.core.client.impl.ThriftTransportPool.createNewTransport(ThriftTransportPool.java:487) at org.apache.accumulo.core.client.impl.ThriftTransportPool.getTransport(ThriftTransportPool.java:420) at org.apache.accumulo.core.client.impl.ThriftTransportPool.getTransport(ThriftTransportPool.java:397) at org.apache.accumulo.core.util.ThriftUtil.getClient(ThriftUtil.java:128) at org.apache.accumulo.core.util.ThriftUtil.getClientNoTimeout(ThriftUtil.java:116) at org.apache.accumulo.core.client.impl.MasterClient.getConnection(MasterClient.java:67) at org.apache.accumulo.core.client.impl.MasterClient.getConnectionWithRetry(MasterClient.java:45) at org.apache.accumulo.core.client.impl.TableOperationsImpl.beginFateOperation(TableOperationsImpl.java:233) at org.apache.accumulo.core.client.impl.TableOperationsImpl.doFateOperation(TableOperationsImpl.java:303) at org.apache.accumulo.core.client.impl.TableOperationsImpl.doFateOperation(TableOperationsImpl.java:295) at org.apache.accumulo.core.client.impl.TableOperationsImpl.doTableFateOperation(TableOperationsImpl.java:1594) at org.apache.accumulo.core.client.impl.TableOperationsImpl.deleteRows(TableOperationsImpl.java:557) at org.apache.accumulo.core.util.shell.commands.DeleteRowsCommand.execute(DeleteRowsCommand.java:39) at org.apache.accumulo.core.util.shell.Shell.execCommand(Shell.java:747) at org.apache.accumulo.core.util.shell.Shell.start(Shell.java:607) at org.apache.accumulo.core.util.shell.Shell.main(Shell.java:528) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:497) at org.apache.accumulo.start.Main$1.run(Main.java:141) at java.lang.Thread.run(Thread.java:745)
Deletemany does a scan and selective putDeletes based on the matches to the scan.
Deleterows doesn't use authorizations, because it just drops whole ranges without discrimination of their contents. We don't partition ranges of keys based on authorizations, so deleterows wouldn't be able to make use of this parameter. Your application could do that, if efficient deletes for particular authorizations were essential, by making making an authorization string a prefix of your row, or by partitioning your data into separate tables based on authorizations. But this probably wouldn't be that useful unless all your deletes of this nature were associated with a single authorization which you knew in advance.
On Wed, Nov 18, 2015 at 6:10 PM z11373 <[EMAIL PROTECTED]> wrote:
Thanks Christ! I had no idea why it didn't work yesterday, that's why I thought it may look for the authz. I just tried running the command again from shell, and this time it works fine. Yes, the authz string actually already used as prefix of the row in our case, so it works nicely :-)
Hi William, I re-ran the same test calling deleteRows without scanning the table (so it's only timing the deleteRows operation here), and you're right, it's faster as shown in the result below.
Table 1: 3,301 Table 2: 3,184 Table 3: 2,635
It's definitely faster, as comparison to the fastest result I got by scanning the table and calling putDelete for each, in the result below.
Table 1: 5,702 Table 2: 6,912 Table 3: 4,694
However, there is one case I didn't mention last time, which the table has summing combiner installed. So even it may have 1M rows, but actually it can have rows as many as 10M or beyond, which may explain why deleteRows can take longer. Still, it seems something wrong looking at my test result.
Test 1 (using iterator and call putDelete for each): Table 4 (with summing combiner): 11,081
Test 2 (calling deleteRows): Table 4 (with summing combiner): 197,050
Last time I heard someone mentioned about compaction, so I was curious, and do following test to compact first before calling deleteRows (to see if it'd be faster), and here is the result: Compact on Table 4 (with summing combiner): 376,619 Call deleteRows on Table 4 (with summing combiner): 188,862
So given the result above, I'd say the table compaction doesn't help. Perhaps I did something wrong here. Therefore, it seems to me, for certain case (like this one) scanning table and calling putDelete for each, will perform better than calling deleteRows, does this make sense? Thanks, Z
Revisit this thread... I just want to know if deleteRows is not appropriate for a table with summing combiners? The problem with scan and for each putDelete is it's consuming more memory, though from my test it is way faster than calling deleteRows for this particular case.
deleteRows should be fine with a combiner, but it's probably not going to be efficient for small ranges. ACCUMULO-3235 should make it more efficient, but it'll still probably add a split point and (very) briefly take tablets offline.
On Mon, Nov 30, 2015 at 12:46 PM z11373 <[EMAIL PROTECTED]> wrote:
Hi Christopher, Do you have any idea what should I do to improve the perf in my case, or wait until ACCUMULO-3235?
If you look at my test results, calling deleteRows took >15x slower than calling putDelete for the same table and data. Is it because the actual number of rows (i.e. being combined) is a way bigger than the number of combined rows? I'd imagine if deleteRows has to delete 100M of rows, while putDelete may only need to deal with 3-4M of rows (results from combined), then it may explain why it'd take that long. Thanks, Z View this message in context: http://apache-accumulo.1065345.n5.nabble.com/delete-rows-test-result-tp15569p15637.html Sent from the Developers mailing list archive at Nabble.com.
Without ACCUMULO-3235, one way you can make deleteRows faster is to only use it to delete rows on existing tablet boundaries. Even then, there may be cases where it's going to do a chop compaction before it completes the delete, and some tablets may be offline while it does this.
Aside from possibly only using existing tablet boundaries, I'm not sure there is anything you can do which would be faster.
If the deleteMany (scan/putDelete) strategy is faster for you, and memory is less important than speed, then stick with that. That's almost certainly going to be better if the data you wish to delete is interspersed with data you wish to keep.
deleteRows is going to work best in cases where you have large quantities of sequential rows to delete, spanning more than one tablet. If your application can tolerate it, you could wait for a significantly large run before doing a delete. For instance, if you wish to age-off old data, and your data is ordered by time, you could age off once a week instead of daily, to allow the ranges of things to delete to build up.
On Mon, Nov 30, 2015 at 3:18 PM z11373 <[EMAIL PROTECTED]> wrote: