Hive的数据分为表数据和元数据,表数据是Hive中表格(table)具有的数据;而元数据是用来存储表的名字,表的列和分区及其属性,表的属性(是否为外部表等),表的数据所在目录等。下面分别来介绍。
索引是标准的数据库技术,hive 0.7版本之后支持索引。Hive提供有限的索引功能,这不像传统的关系型数据库那样有“键(key)”的概念,用户可以在某些列上创建索引来加速某些操作,给一个表创建的索引数据被保存在另外的表中。 Hive的索引功能现在还相对较晚,提供的选项还较少。但是,索引被设计为可使用内置的可插拔的java代码来定制,用户可以扩展这个功能来满足自己的需求。 当然不是说有的查询都会受惠于Hive索引。用户可以使用EXPLAIN语法来分析HiveQL语句是否可以使用索引来提升用户查询的性能。像RDBMS中的索引一样,需要评估索引创建的是否合理,毕竟,索引需要更多的磁盘空间,并且创建维护索引也会有一定的代价。 用户必须要权衡从索引得到的好处和代价。
下面说说怎么创建索引:
1、先创建表:
1234 | hive> create table user( id int , name string) > ROW FORMAT DELIMITED > FIELDS TERMINATED BY '\t' > STORED AS TEXTFILE; |
2、导入数据:
12 | hive> load data local inpath '/export1/tmp/wyp/row.txt' > overwrite into table user; |
3、创建索引之前测试
01020304050607080910111213141516171819202122232425 | hive> select * from user where id = 500000 ; Total MapReduce jobs = 1 Launching Job 1 out of 1 Number of reduce tasks is set to 0 since there's no reduce operator Cannot run job locally: Input Size (= 356888890 ) is larger than hive.exec.mode.local.auto.inputbytes.max (= 134217728 ) Starting Job = job_1384246387966_0247, Tracking URL = http: //l-datalogm1.data.cn1:9981/proxy/application_1384246387966_0247/ Kill Command=/home/q/hadoop/bin/hadoop job -kill job_1384246387966_0247 Hadoop job information for Stage- 1 : number of mappers: 2 ; number of reducers: 0 2013 - 11 - 13 15 : 09 : 53 , 336 Stage- 1 map = 0 %, reduce = 0 % 2013 - 11 - 13 15 : 09 : 59 , 500 Stage- 1 map= 50 %,reduce= 0 %, Cumulative CPU 2.0 sec 2013 - 11 - 13 15 : 10 : 00 , 531 Stage- 1 map= 100 %,reduce= 0 %, Cumulative CPU 5.63 sec 2013 - 11 - 13 15 : 10 : 01 , 560 Stage- 1 map= 100 %,reduce= 0 %, Cumulative CPU 5.63 sec MapReduce Total cumulative CPU time: 5 seconds 630 msec Ended Job = job_1384246387966_0247 MapReduce Jobs Launched: Job 0 : Map: 2 Cumulative CPU: 5.63 sec HDFS Read: 361084006 HDFS Write: 357 SUCCESS Total MapReduce CPU Time Spent: 5 seconds 630 msec OK 500000 wyp. Time taken: 14.107 seconds, Fetched: 1 row(s) |
一共用了14.107s
4、对user创建索引
010203040506070809101112 | hive> create index user_index on table user(id) > as 'org.apache.hadoop.hive.ql.index.compact.CompactIndexHandler' > with deferred rebuild > IN TABLE user_index_table; hive> alter index user_index on user rebuild; hive> select * from user_index_table limit 5 ; 0 hdfs: //mycluster/user/hive/warehouse/table02/000000_0 [0] 1 hdfs: //mycluster/user/hive/warehouse/table02/000000_0 [352] 2 hdfs: //mycluster/user/hive/warehouse/table02/000000_0 [704] 3 hdfs: //mycluster/user/hive/warehouse/table02/000000_0 [1056] 4 hdfs: //mycluster/user/hive/warehouse/table02/000000_0 [1408] Time taken: 0.244 seconds, Fetched: 5 row(s) |
这样就对user表创建好了一个索引。
在Hive创建索引还存在bug:如果表格的模式信息来自SerDe,Hive将不能创建索引:0102030405060708091011 | hive> CREATE INDEX employees_index > ON TABLE employees (country) > AS 'org.apache.hadoop.hive.ql.index.compact.CompactIndexHandler' > WITH DEFERRED REBUILD > IDXPROPERTIES ( 'creator' = 'me' , 'created_at' = 'some_time' ) > IN TABLE employees_index_table > COMMENT 'Employees indexed by country and name.' ; FAILED: Error in metadata: java.lang.RuntimeException: \ Check the index columns, they should appear in the table being indexed. FAILED: Execution Error, return code 1 from \ org.apache.hadoop.hive.ql.exec.DDLTask |
这个bug发生在Hive0.10.0、0.10.1、0.11.0,在Hive0.12.0已经修复了,详情请参见:https://issues.apache.org/jira/browse/HIVE-4251
本博客文章除特别声明,全部都是原创!尊重原创,转载请注明: 转载自过往记忆(http://www.iteblog.com/)
本文链接地址: 《Hive创建索引》(http://www.iteblog.com/archives/836)
E-mail:wyphao.2007@163.com QQ:397090770