Hi zhang, you may need Hybrid model: https://kylin.apache.or
Community user Roberto shared his experience before, you can check this
2018-06-13 12:22 GMT+08:00 张 <[EMAIL PROTECTED]>:
Shaofeng Shi 史少锋 Hybrid Model
Hybrid Model Tutorial
Roberto Tardío Olmos
Document Version: [1.0]
1. Introduction 3
1.1. Hybrid Model 3
2. DefinING a CUBE USING HYBRID MODEL 3
2.1. Initial scenario 3
2.2. Defining a Hybrid model 5
3. CONCLUSIONS 6
3.1. Conclusions 6
Apache Kylin is an M-OLAP engine for Big Data scenarios. After choosing the data sources (e.g. Hive, Kafka), we define a Data Model over these data source. Then, we define a Cube over this data model and we start to build this cube in order to add data to our M-OLAP analytic model, i.e. the resulting database. This M-OLAP Cube stores pre-combined and pre-aggregated data from facts and dimension tables. Thanks to this M-OLAP approach over Hadoop framework, Kylin supports for sub-second SQL queries over billions rows fact tables.
However, once we have built a Cube, we cannot modify either the Cube definition nor the Data Model without purge all data stored in the cube. This restriction is easy to understand due to Cube storage format on HBase is very fixed to Cube definition on Kylin. Thus, adding a column to Data Model and try to use it for defining a new Measure or Dimensions requires to purge all data stored in the Cube and re-built it. Due to cube rebuilding process could take several hours (or even days) and requires significant Hadoop cluster resources (YARN cores and memory), cube modification could be a big problem in many real scenarios.
In order to mitigate issues related with cube modification, Apache Kylin provides a method called Hybrid Model. This hybrid model allows us to combine data from two Cubes defined over the same Data Model. Therefore, we can define a new cube which includes the modifications we need to perform over an existing Cube. If we define a Hybrid Model that includes these two cubes (old and new), we can query them as one cube. The data from common dimensions and facts will by joined by Kylin’s query engine during querying time.
The hybrid model was introduced on Apache Kylin 1.0 and was briefly explained on Kylin Web Page Blog http://kylin.apache.org/blog/2015/09/25/hybrid-model/
. As I used this feature on a real project, with this document I aim to help Kylin users for better understanding of this this feature, how to use it and its limitations.
DefinING a CUBE USING HYBRID MODEL
We suppose that we have one Cube called Cube_V1 defined over a Data Model called My_Data_Model. This data model uses several Hive tables as a data source, e.g., structured with a Star Schema. This cube has been built incrementally during a long time, so the data stored on it is about 2.000 million rows. But now, end users demand to add new measures and dimensions columns to its analytic model. Therefore, we need to modify the following elements of our cube:
Data Sources: Add new tables from Hive or refresehing th existing ones to get changes on data sources model.
Data Model: Add new lookup tables (dimensions), dimension columns and measures columns.
Cube Definition: Add the new dimensions and measures to the cube.
However, as I explained before, we cannot modify cube definition without purge all data stored on it. Also, we cannot modify one existing data model if there are enabled cubes using it. Therefore, in this case I recommend to keep original cube Cube_V1 unchanged and create one new cube Cube_V2 in order to apply the changes. At the end, we will create a Hybrid Model that combine both cubes.
In our scenario, we also have to modify data model My_Data_Model and update data sources. We have to know if we change data model or data source that affects one existing cube (e.g. Cube_V1), maybe we won’t be able to build original cube anymore. However, we can still query original cube Cube_V1 and start to build the new defined cube Cube_V2, over modified data model My_Data_Model and data sources.
We can summarize this process in the following steps:
Disable cube Cube_V1.
Refresh/add tables on Data Sources to get changes and new tables.
Edit My_Data_Model (used by Cube_V1).
a. Add/delete lookup tables (dimensions), dimension columns and measures columns.
Now, Cube_v1 can be enabled if we need to continue performing queries over the historical data.
b. From this moment, you should not build cube Cube_v1 anymore.
Create a new cube Cube_V2.
c. This new cube has to sue the same Data Model as the old cube V1: My_Data_Model
d. Add/delete/modify dimensions, measures and optimizations over them (e.g. kind of dimensions or agg groups).
Start to build Cube_V2.
e. Avoid data overlapping between cubes V2 and V2. Overlap is not checked by Hybrid model, so you probably have duplicate data if there is and overlap between the two cubes used on the target hybrid model.
Define a Hybrid model over cubes V1 and V2.
f. If you have enabled cubes V1 and V2, but you have not yet defined a hybrid cube over them, queries will be routed to just one of these cubes:
i. Queries that include dimensions or measures not defined on Cube_V1, but defined on Cube_V2, will be routed to this last cube. Therefore, only the new data stored on Cube_v2 will be used to answer the query and data form Cube_v1 will be ignored.
ii. Queries that includes dimensions and measures common to cubes V1 and V2 will be answered with the data stored on cube V2. Therefore, these queries will ignore dat