Hive provides an EXPLAIN command that shows the execution plan for a query. The syntax for this statement is as follows:
EXPLAIN [EXTENDED] query
hive> explain SELECT , count(*) FROM invites a WHERE > 0 GROUP BY ;
OKABSTRACT SYNTAX TREE:
(TOK_QUERY (TOK_FROM (TOK_TABREF (TOK_TABNAME invites) a)) (TOK_INSERT (TOK_DESTINATION (TOK_DIR TOK_TMP_FILE)) (TOK_SELECT (TOK_SELEXPR (. (TOK_TABLE_OR_COL a) bar)) (TOK_SELEXPR (TOK_FUNCTIONSTAR count))) (TOK_WHERE (> (. (TOK_TABLE_OR_COL a) foo) 0)) (TOK_GROUPBY (. (TOK_TABLE_OR_COL a) bar))))
STAGE DEPENDENCIES:
Stage-1 is a root stage
Stage-0 is a root stage
STAGE PLANS:
Stage: Stage-1
Map Reduce
Alias -> Map Operator Tree:
a
TableScan
alias: a
Filter Operator
predicate:
expr: (foo > 0)
type: boolean
Filter Operator
predicate:
expr: (foo > 0)
type: boolean
Select Operator
expressions:
expr: bar
type: string
outputColumnNames: bar
Group By Operator
aggregations:
expr: count()
bucketGroup: false
keys:
expr: bar
type: string
mode: hash
outputColumnNames: _col0, _col1
Reduce Output Operator
key expressions:
expr: _col0
type: string
sort order: +
Map-reduce partition columns:
expr: _col0
type: string
tag: -1
value expressions:
expr: _col1
type: bigint
Reduce Operator Tree:
Group By Operator
aggregations:
expr: count(VALUE._col0)
bucketGroup: false
keys:
expr: KEY._col0
type: string
mode: mergepartial
outputColumnNames: _col0, _col1
Select Operator
expressions:
expr: _col0
type: string
expr: _col1
type: bigint
outputColumnNames: _col0, _col1
File Output Operator
compressed: false
GlobalTableId: 0
table:
input format:
output format:
Stage: Stage-0
Fetch Operator
limit: -1
Time taken: 0.133 seconds
hive> explain insert overwrite TABLE lpx SELECT , , FROM pokes t1 JOIN invites t2 ON ( = ) ;
OK
ABSTRACT SYNTAX TREE:
(TOK_QUERY (TOK_FROM (TOK_JOIN (TOK_TABREF (TOK_TABNAME pokes) t1) (TOK_TABREF (TOK_TABNAME invites) t2) (= (. (TOK_TABLE_OR_COL t1) bar) (. (TOK_TABLE_OR_COL t2) bar)))) (TOK_INSERT (TOK_DESTINATION (TOK_TAB (TOK_TABNAME lpx))) (TOK_SELECT (TOK_SELEXPR (. (TOK_TABLE_OR_COL t1) bar)) (TOK_SELEXPR (. (TOK_TABLE_OR_COL t1) foo)) (TOK_SELEXPR (. (TOK_TABLE_OR_COL t2) foo)))))
STAGE DEPENDENCIES:
Stage-1 is a root stage
Stage-0 depends on stages: Stage-1
Stage-2 depends on stages: Stage-0
STAGE PLANS:
Stage: Stage-1
Map Reduce
Alias -> Map Operator Tree:
t1
TableScan
alias: t1
Reduce Output Operator
key expressions:
expr: bar
type: string
sort order: +
Map-reduce partition columns:
expr: bar
type: string
tag: 0
value expressions:
expr: foo
type: int
expr: bar
type: string
t2
TableScan
alias: t2
Reduce Output Operator
key expressions:
expr: bar
type: string
sort order: +
Map-reduce partition columns:
expr: bar
type: string
tag: 1
value expressions:
expr: foo
type: int
Reduce Operator Tree:
Join Operator
condition map:
Inner Join 0 to 1
condition expressions:
0 {VALUE._col0} {VALUE._col1}
1 {VALUE._col0}
handleSkewJoin: false
outputColumnNames: _col0, _col1, _col5
Select Operator
expressions:
expr: _col1
type: string
expr: _col0
type: int
expr: _col5
type: int
outputColumnNames: _col0, _col1, _col2
File Output Operator
compressed: false
GlobalTableId: 1
table:
input format:
output format:
serde: .
name:
Stage: Stage-0
Move Operator
tables:
replace: true
table:
input format:
output format:
serde: .
name:
Stage: Stage-2
Stats-Aggr Operator
注:
ABSTRACT SYNTAX TREE为抽象语法树
从信息头:
STAGE DEPENDENCIES:
Stage-1 is a root stage
Stage-0 depends on stages: Stage-1
Stage-2 depends on stages: Stage-0
从这里可以看出Plan计划的Job任务结构,整个任务会分为3个Job 执行,
第一个Job 将由Stage-1 构成;
第二个Job处理由Stage-0 构成,Stage-0 的处理必须依赖Stage-1 阶段的结果;
第三个Job处理由Stage-2 构成,Stage-2 的处理必须依赖Stage-0 阶段的结果。
下面分别解释 Stage-1 和 Stage-0,执行SQL可以分成两步:
(1)SELECT , , FROM pokes t1 JOIN invites t2 ON ( = );
(2)insert overwrite TABLE lpx;
Stage: Stage-1对应一次完整的 Map Reduce任务,包括:Map Operator Tree和Reduce Operator Tree两步操作,Map Operator Tree对应Map任务,Reduce Operator Tree对应Reduce任务。
从Map Operator Tree阶段可以看出进行了两个并列的操作t1和t2,分别SELECT , FROM t1;和 SELECT FROM t2;而且两个Map任务分别产生了Reduce阶段的输入[Reduce Output Operator]。
从Reduce Operator Tree分析可以看到如下信息,条件连接Map 的输出以及通过预定义的输出格式生成符合的存储格式的数据存储到HDFS 中。在我们创建lpx表
的时候,没有指定该表的存储格式,默认会以Text 为存储格式,输入输出会以TextInputFormat 与TextOutputFormat 进行读写:
table:
input format:
output format:
serde: .
name:
input format 的值对应,
这是因为在开始的Map 阶段产生的临时输出文件是以TextOutputFormat 格式保存的,自然Reduce 的读取是由TextInputFormat 格式处理读入数据。这些是由Hadoop 的MapReduce 处
理细节来控制,而Hive 只需要指定处理格式即可。
Serde 值为. 类,这时这个对象的保存的值为_col0, _col1, _col2,也就是我们预期要查询的, , ,这个值具体的应该为_col0+表lpx 设置的列分割符+_col1+表lpx 设置的列分割符+_col2。outputformat: 可以知道output 的处理是使用该类来处理的。
Stage-0 对应上面提到的第二步操作。这时stage-1 产生的临时处理文件举例如tmp,需要经过stage-0 阶段操作处理到lpx 表中。Move Operator 代表了这并不是一个
MapReduce 任务,只需要调用MoveTask 的处理就行,在处理之前会去检查输入文件是否符合lpx表的存储格式。
ref: /confluence/display/Hive/LanguageManual+Explain