I want to get total rows of a table (likely has more than 100M rows), I think to get that information, Accumulo would have to iterate all rows :-( This may not be typical Accumulo scenario.
Is there a more efficient way to get total number of rows in a table? When Accumulo iterating those items, does it mean it will pull the data to the client? If yes, is there a way to ask it to return just the number, since that's the only data I care.
Yeah, there's no explicit tracking of all rows in Accumulo, you're stuck with enumerating them (or explicitly tracking them yourself at ingest time).
The easiest approach you can take is probably using the FirstEntryInRowIterator and counting each row on the client-side.
You could do another summation in a second iterator but this is a little tricky to get correct. I tried to touch on this a little in a blog post. If this is a one-off question you want to answer, doing the summation on the client side is likely not to take excessively longer than a server-side summation.
Note that CountingIterator is in the system iterator package (FirstEntryInRowIterator also isn't in the user package for iterators, so its stability is a little questionable too). I think David ran into this a long time ago as well.
Stable versions of both of these would be good, IMO. It isn't like Z is the first one to ask how to count the unique rows :)
It's not recommended to read the Metadata table? When I needed the 'real' number, I ran a compaction. When I needed an estimate I just read the table. I also upgraded our ingest process to track numbers as a second phase to avoid the need for compaction to get 'real' numbers.
On Mon, Nov 9, 2015 at 10:52 AM, Josh Elser <[EMAIL PROTECTED]> wrote:
@Josh: Is my understanding correct that iterating the rows to get the count on client side and server side doesn't have significant performance diff?
Besides counting iterator, I'd like to see if we can add feature for deleting in bulk? Right now, I have to go thru each of them, and then call putDelete from client. I wish there is a magic way to tell server to delete all rows for this specific range. Thanks, Z
There is a performance difference. You have an upper bound of returning all data to the client be scanned, even with a FirstEntryInRowIterator. Imagine a table layout where each Key/Value pair represents a single row or document. Using a counting iterator will return a count (most likely a 64-bit long) for each tablet, that the client can then add together.
There is a deleteRows feature (TableOperations#deleteRows) which may be what you want. It avoids having to bring data back to the client.
On Thu, Nov 12, 2015 at 9:23 AM, z11373 <[EMAIL PROTECTED]> wrote:
Thanks William! This is indeed what I was looking for.
Text startRow = new Text("k"); Text endRow = new Text("r"); ops.deleteRows("myTable", startRow, endRow);
From Accumulo book, it said "When you specify start and end rows, the deleteRows() method will remove rows that sort after but not including the start row, and rows that sort before and including the end row."
Ick, that's kind of a pain. We should probably have some kind of utility to compute this for you.
You'd likely want to treat the row as a byte (instead of thinking in terms of characters) and decrement the last byte in the array. We have a static method on Range.followingPrefix(Text) which goes the opposite way. You could try to take that approach, while we (hopefully) consider adding a Range.previousRow(Text) or something.