基于Hadoop生态圈的数据仓库实践 —— 进阶技术（八）

八、多路径和参差不齐的层次
本节讨论多路径层次，它是对单路径层次的扩展。上一节里数据仓库的月维度只有一条层次路径，即年-季度-月这条路径。在本节中加一个新的级别——促销期，并且加一个新的年-促销期-月的层次路径。这时月维度将有两条层次路径，因此具有多路径层次。本节讨论的另一个主题是不完全层次，这种层次在它的一个或多个级别上没有数据。

1. 增加一个层次
下面的脚本给month_dim表添加一个叫做campaign_session的新列，并建立rds.campaign_session过渡表。

use dw;  
-- 增加促销期列
alter table month_dim rename to month_dim_old;
create table month_dim (    
    month_sk int comment 'surrogate key',    
    month tinyint comment 'month',  
    month_name varchar(9) comment 'month name', 
    campaign_session varchar(30) comment 'campaign session',
    quarter tinyint comment 'quarter',  
    year smallint comment 'year'    
)   
comment 'month dimension table'    
clustered by (month_sk) into 8 buckets    
stored as orc tblproperties ('transactional'='true') ;  
insert into month_dim select month_sk,month,month_name,null,quarter,year from month_dim_old;
drop table month_dim_old;

-- 建立促销期过渡表  
use rds;
create table campaign_session (  
    campaign_session varchar(30),  
    month tinyint,  
    year smallint 
)
row format delimited fields terminated by ',' stored as textfile;

修改后的模式如下图所示。

假设所有促销期都不跨年，并且一个促销期可以包含一个或多个年月，但一个年月只能属于一个促销期。为了理解促销期如何工作，看下表的示例。

Campaign Session	Month
2016 First Campaign	January-April
2016 Second Campaign	May-July
2016 Third Campaign	August-August
2016 Last Campaign	September-December

每个促销期有一个或多个月。一个促销期也许并不是正好一个季度，也就是说，促销期级别不能上卷到季度，但是促销期可以上卷至年级别。2016年促销期的数据如下，并保存在campaign_session.csv文件中。

2016 First Campaign,1,20162016 First Campaign,2,20162016 First Campaign,3,20162016 First Campaign,4,20162016 Second Campaign,5,20162016 Second Campaign,6,20162016 Second Campaign,7,20162016 Third Campaign,8,20162016 Last Campaign,9,20162016 Last Campaign,10,20162016 Last Campaign,11,20162016 Last Campaign,12,2016

现在可以执行下面的脚本把2016年的促销期数据装载进月维度。

use rds;load data local inpath '/root/campaign_session.csv' overwrite into table campaign_session;use dw;drop table if exists tmp;create table tmp as select t1.month_sk month_sk,       t1.month month,       t1.month_name month_name,       t2.campaign_session campaign_session,       t1.quarter quarter,       t1.year year  from month_dim t1 inner join rds.campaign_session t2 on t1.year = t2.year and t1.month = t2.month;delete from month_dim where month_dim.month_sk in (select month_sk from tmp);insert into month_dim select * from tmp;select year,month,campaign_session from dw.month_dim;

查询结果如下图所示，2016年的促销期已经有数据，其它年份的campaign_session字段值为null。
基于Hadoop生态圈的数据仓库实践 —— 进阶技术（八）

2. 层次查询
下面的语句查询年-促销期-月层次。

USE dw; SELECT product_category, time, order_amount, order_quantity   FROM (SELECT *   FROM (SELECT product_category,         year,         1 month,         year time,         1 sequence,         SUM(month_order_amount) order_amount,         SUM(month_order_quantity) order_quantity    FROM month_end_sales_order_fact a, product_dim b, month_dim c   WHERE a.product_sk = b.product_sk     AND a.order_month_sk = c.month_sk     AND year = 2016   GROUP BY product_category, year UNION ALL SELECT product_category,         year,         month,         campaign_session time,         2 sequence,         SUM(month_order_amount) order_amount,         SUM(month_order_quantity) order_quantity    FROM month_end_sales_order_fact a, product_dim b, month_dim c   WHERE a.product_sk = b.product_sk     AND a.order_month_sk = c.month_sk     AND year = 2016   GROUP BY product_category, year, month, campaign_session UNION ALL SELECT product_category,         year,         month,         month_name time,         3 sequence,         SUM(month_order_amount) order_amount,         SUM(month_order_quantity) order_quantity    FROM month_end_sales_order_fact a, product_dim b, month_dim c   WHERE a.product_sk = b.product_sk     AND a.order_month_sk = c.month_sk     AND year = 2016   GROUP BY product_category, year, quarter, month, month_name) tCLUSTER BY product_category, year, month, sequence) t;

查询结果如下图所示。
基于Hadoop生态圈的数据仓库实践 —— 进阶技术（八）

3. 不完全层次
在一个或多个级别上没有数据的层次称为不完全层次。例如在特定月份没有促销期，那么月维度就具有不完全推广期层次。本小节说明不完全层次，还有在促销期上如何应用它。
下面是一个不完全促销期（在ragged_campaign.csv文件里）的例子，2016年1月、4月、6月、9月、10月、11月和12月没有促销期。

,1,20162016 Early Spring Campaign,2,20162016 Early Spring Campaign,3,2016,4,20162016 Spring Campaign,5,2016,6,20162016 Last Campaign,7,20162016 Last Campaign,8,2016,9,2016,10,2016,11,2016,12,2016

下面的命令先把campaign_session字段置空，然后向month_dim表装载促销期数据。

use rds;load data local inpath '/root/ragged_campaign.csv' overwrite into table campaign_session;use dw;update month_dim set campaign_session = null;drop table if exists tmp;create table tmp as select t1.month_sk month_sk,       t1.month month,       t1.month_name month_name,       case when t2.campaign_session != '' then t2.campaign_session else t1.month_name end campaign_session,       t1.quarter quarter,       t1.year year  from month_dim t1 inner join rds.campaign_session t2 on t1.year = t2.year and t1.month = t2.month;delete from month_dim where month_dim.month_sk in (select month_sk from tmp);insert into month_dim select * from tmp;select year,month,campaign_session from dw.month_dim;

查询结果如下图所示。
基于Hadoop生态圈的数据仓库实践 —— 进阶技术（八）

再次执行上面的层次查询语句，结果如下图所示。
基于Hadoop生态圈的数据仓库实践 —— 进阶技术（八）

在有促销期月份的路径，月级别行的汇总与促销期级别的行相同。而对于没有促销期的月份，其促销期级别的行与月级别的行相同。也就是说，在没有促销期级别的月份，月上卷了它们自己。例如，6月没有促销期，所以在输出看到了两个6月的行（第2行和第3行）。第3行是月份级别的行，第2行表示是没有促销期的行。对于没有促销期的月份，促销期行的销售订单金额（输出里的order_amount列）与月分行的相同。

秒客网

基于Hadoop生态圈的数据仓库实践 —— 进阶技术（八）

相关文章