Subject: Materialized views in Hbase/Phoenix


My first email details the use cases:
1.       Get values for certain row/column sets – this is where Hbase comes in handy, as we can easily query based on row key and column. No more than 500 rows and 30 columns will be queried.

2.       Get an entire column

3.       Get Aggregations per column based on groupings of row keys.

Number (1) is easily satisfied by Hbase, with fast lookups on the rowkey.
(2) and (3) will have to be precomputed, storing compressed data by column for (2) and the aggregations for (3). My main concern here was maintaining data consistency between the tables created for each matrix.

From: Pedro Boado [mailto:[EMAIL PROTECTED]]
Sent: Friday, September 27, 2019 10:53 AM
To: [EMAIL PROTECTED]
Subject: Re: Materialized views in Hbase/Phoenix

CAUTION: This email originated from outside the Allen Institute. Please do not click links or open attachments unless you've validated the sender and know the content is safe.
________________________________
Yeah, phoenix won't aggregate billions of rows in under 100ms (probably, nothing will).

This sounds more and more like an OLAP use case, doesn't it? Facts table with billions of rows (still, you can handle that volumes with a shared RDBMS) that will never be queried directly.. And precomputed aggregations to be queried interactively (maybe you could use Phoenix here, but you could also use a RDBMS, that additionally can give you all guarantees you're looking for).

If that's the case, I don't really think HBase/Phoenix is the right choice, (which is good doing gets by key or running scans/aggregations over reasonable key intervals).

Maybe explaining the use case could help (we are getting more info drop by drop in each new message in terms of volume, different query patterns expected, concurrency, etc etc). For instance, how are this 100s of queries interacting with the DB? Via a REST API?

On Fri, 27 Sep 2019, 17:39 Gautham Acharya, <[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>> wrote:
We are looking at being able to support hundreds of concurrent queries, but not too many more.

Will aggregations be performant across these large datasets? (e.g. give me the mean value of each column when all rows are grouped by a certain row property).

Precomputing seems much more efficient.

From: Pedro Boado [mailto:[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>]
Sent: Friday, September 27, 2019 9:27 AM
To: [EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>
Subject: Re: Materialized views in Hbase/Phoenix

CAUTION: This email originated from outside the Allen Institute. Please do not click links or open attachments unless you've validated the sender and know the content is safe.
________________________________
Can the aggregation be run on the flight in a phoenix query? 100ms response time but... With how many concurrent queries?

On Fri, 27 Sep 2019, 17:23 Gautham Acharya, <[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>> wrote:
We will be reaching 100million rows early next year, and then billions shortly after that. So, Hbase will be needed to scale to that degree.

If one of the tables fails to write, we need some kind of a rollback mechanism, which is why I was considering a transaction. We cannot be in a partial state where some of the ‘views’ are written and some aren’t.
From: Pedro Boado [mailto:[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>]
Sent: Friday, September 27, 2019 7:22 AM
To: [EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>
Subject: Re: Materialized views in Hbase/Phoenix

CAUTION: This email originated from outside the Allen Institute. Please do not click links or open attachments unless you've validated the sender and know the content is safe.
________________________________
For just a few million rows I would go for a RDBMS and not Phoenix / HBase.

You don't really need transactions to control completion, just write a flag (a COMPLETED empty file, for instance) as a final step in your job.

On Fri, 27 Sep 2019, 15:03 Gautham Acharya, <[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>> wrote:
Thanks Anil.

So, what you’re essentially advocating for is to use some kind of Spark/compute framework (I was going to use AWS Glue) job to write the ‘materialized views’ as separate tables (maybe tied together with some kind of a naming convention?)

In this case, we’d end up with some sticky data consistency issues if the write job failed halfway through (some ‘materialized view’ tables would be updated, and some wouldn’t). Can I use Phoenix transactions to wrap the write jobs together, to make sure either all the data is updated, or none?

From: anil gupta [mailto:[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>]
Sent: Friday, September 27, 2019 6:58 AM
To: [EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>
Subject: Re: Materialized views in Hbase/Phoenix

CAUTION: This email originated from outside the Allen Institute. Please do not click links or open attachments unless you've validated the sender and know the content is safe.
________________________________
For your use case, i would suggest to create another table that stores the matrix. Since this data doesnt change that often, maybe you can write a nightly spark/MR job to update/rebuild the matrix table.(If you want near real time that is also possible with any streaming system) Have you looked into bloom filters? It might help if you have sparse dataset and you are using Phoenix dynamic columns.
We use dynamic columns for a table that has columns upto 40k. Here is the presentation and optimizations we made for that use case: https://www.slideshare.net/anilgupta84/phoenix-con2017-truecarfinal<https://nam05.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.slideshare.net%2Fanilgupta84%2Fphoenix-con2017-truecarfinal&data=02%7C01%7C%7C564c6972044c42aaf3a508d743738f85%7C32669cd6737f4b398bddd6951120d3fc%7C0%7C0%7C637052035921239801&sdata=eoQRBAvzfoZGaKXQIEaZ2