目录
- 1. Hive开启Iceberg的支持
- 2. Catalog管理
- 3. 数据库的创建
- 4. 表的创建和删除
- 4.1 外部表
- 4.2 create table
- 4.3 删除表
- 5. 插入数据
- 6. 查询数据
- 7. 字段类型转换及开启
1. Hive开启Iceberg的支持
Hive可以读写Iceberg的表
添加依赖jar包
[root@hive1 ~]#
[root@hive1 ~]# wget /maven2/org/apache/iceberg/iceberg-hive-runtime/0.13.1/iceberg-hive-runtime-0.13.
[root@hive1 ~]#
[root@hive1 ~]# beeline -u jdbc:hive2://hive1:10000 -n root -p root123
0: jdbc:hive2://hive1:10000>
0: jdbc:hive2://hive1:10000> add jar /root/iceberg-hive-runtime-0.13.;
INFO : Added [/root/iceberg-hive-runtime-0.13.] to class path
INFO : Added resources: [/root/iceberg-hive-runtime-0.13.]
No rows affected (0.312 seconds)
0: jdbc:hive2://hive1:10000>
添加以下配置到
<property>
<name></name>
<value>true</value>
<description>Hive是否开启Iceberg的支持</description>
</property>
2. Catalog管理
Hive本身没有Catalog的概念,但是Iceberg有Catalog。所以Hive将Catalog的信息用键值对的属性来实现,这样建表的时候就可以直接使用创建的Catalog
Hive集成Iceberg支持Hive Catalog和Hadoop Catalog
创建Hive Catalog
0: jdbc:hive2://hive1:10000>
0: jdbc:hive2://hive1:10000> set .hive_catalog.type=hive;
No rows affected (0.093 seconds)
0: jdbc:hive2://hive1:10000> set .hive_catalog.uri=thrift://hive1:9083;
No rows affected (0.052 seconds)
0: jdbc:hive2://hive1:10000> set .hive_catalog.clients=5;
No rows affected (0.058 seconds)
0: jdbc:hive2://hive1:10000> set .hive_catalog.warehouse=hdfs://nnha/user/iceberg/warehouse;
No rows affected (0.044 seconds)
0: jdbc:hive2://hive1:10000>
创建Hadoop Catalog
0: jdbc:hive2://hive1:10000>
0: jdbc:hive2://hive1:10000> set .hadoop_catalog.type=hadoop;
No rows affected (0.082 seconds)
0: jdbc:hive2://hive1:10000> set .hadoop_catalog.warehouse=hdfs://nnha/user/iceberg/warehouse;
No rows affected (0.046 seconds)
0: jdbc:hive2://hive1:10000>
3. 数据库的创建
1. Hive Catalog下的数据库
对于其它系统将该Hive作为Catalog,创建的数据库,则可以直接使用该数据库,而不用创建。因为Hive和Iceberg的数据库能直接对应上
2. Hadoop Catalog下的数据库
因为Hive没有Catalog的概念,所以不能通过上面的方式创建的Catalog自动发现数据库。所以需要创建Hive数据库和Iceberg的数据库对应。例如下面:
0: jdbc:hive2://hive1:10000> create schema iceberg_db
. . . . . . . . . . . . . .> location 'hdfs://nnha/user/iceberg/warehouse/iceberg_db/';
0: jdbc:hive2://hive1:10000>
4. 表的创建和删除
4.1 外部表
对于已经通过其它系统创建的Iceberg表,可以通过在Hive中,创建外部表,来读写Iceberg表
1. Hive Catalog下的表
对于其它系统将该Hive作为Catalog,创建的数据库表,则可以直接使用该表,而不用创建。因为Hive和Iceberg的表能直接对应上
2. Hadoop Catalog下的表
创建Hive的表和Iceberg的表对应上。查询的数据结果和Iceberg中的表结果一样
0: jdbc:hive2://hive1:10000> create external table iceberg_db.my_user
. . . . . . . . . . . . . .> stored by ''
. . . . . . . . . . . . . .> tblproperties(''='hadoop_catalog');
0: jdbc:hive2://hive1:10000>
0: jdbc:hive2://hive1:10000> select * from iceberg_db.my_user;
+------------------+--------------------+-------------------+------------------+
| my_user.user_id | my_user.user_name | my_user.birthday | my_user.country |
+------------------+--------------------+-------------------+------------------+
| 6 | zhang_san | 2022-02-01 | china |
| 5 | zhao_liu | 2022-02-02 | japan |
| 2 | zhang_san | 2022-02-01 | china |
| 1 | zhang_san | 2022-02-01 | china |
+------------------+--------------------+-------------------+------------------+
6 rows selected (7.855 seconds)
0: jdbc:hive2://hive1:10000>
如果创建表,不指定表属性,则默认使用Hive Catalog,元数据储存到当前Hive的元数据位置,表数据储存到当前Hive的warehouse中
3. location_based_table表
对于通过其它系统创建的Hadoop Catalog表,可以在Hive中不通过Catalog,直接使用HDFS路径创建location_based_table类型的表
0: jdbc:hive2://hive1:10000>
0: jdbc:hive2://hive1:10000> create external table default.my_user
. . . . . . . . . . . . . .> stored by ''
. . . . . . . . . . . . . .> location 'hdfs://nnha/user/iceberg/warehouse/iceberg_db/my_user'
. . . . . . . . . . . . . .> tblproperties(''='location_based_table');
No rows affected (1.184 seconds)
0: jdbc:hive2://hive1:10000> select * from default.my_user;
+-------------------+---------------------+--------------------+-------------------+
| my_user2.user_id | my_user2.user_name | my_user2.birthday | my_user2.country |
+-------------------+---------------------+--------------------+-------------------+
| 6 | zhang_san | 2022-02-01 | china |
| 5 | zhao_liu | 2022-02-02 | japan |
| 2 | zhang_san | 2022-02-01 | china |
| 1 | zhang_san | 2022-02-01 | china |
+-------------------+---------------------+--------------------+-------------------+
6 rows selected (1.994 seconds)
0: jdbc:hive2://hive1:10000>
不会在HDFS上创建/user/hive/warehouse//my_user目录
4.2 create table
可以通过Hive直接创建Iceberg表。默认的是Hive Catalog
0: jdbc:hive2://hive1:10000> create table iceberg_db.student(
. . . . . . . . . . . . . .> id bigint,
. . . . . . . . . . . . . .> name string
. . . . . . . . . . . . . .> ) partitioned by (birthday date, country string)
. . . . . . . . . . . . . .> stored by '';
No rows affected (5.966 seconds)
0: jdbc:hive2://hive1:10000>
查看HDFS路径如下。也会有Iceberg表的metadata元数据
[root@hive1 ~]# hadoop fs -ls /user/iceberg/warehouse/iceberg_db/student/metadata
Found 1 items
-rw-r--r-- 1 root supergroup 2154 2022-02-17 11:25 /user/iceberg/warehouse/iceberg_db/student/metadata/
[root@hive1 ~]#
指定
0: jdbc:hive2://hive1:10000> create table iceberg_db.employee(
. . . . . . . . . . . . . .> id bigint,
. . . . . . . . . . . . . .> name string
. . . . . . . . . . . . . .> ) partitioned by (birthday date, country string)
. . . . . . . . . . . . . .> stored by ''
. . . . . . . . . . . . . .> location 'hdfs://nnha/user/iceberg/warehouse/iceberg_db/employee'
. . . . . . . . . . . . . .> tblproperties(''='hadoop_catalog');
No rows affected (0.324 seconds)
0: jdbc:hive2://hive1:10000>
0: jdbc:hive2://hive1:10000> show create table iceberg_db.student;
+----------------------------------------------------+
| createtab_stmt |
+----------------------------------------------------+
| CREATE TABLE `iceberg_db.student`( |
| `id` bigint COMMENT 'from deserializer', |
| `name` string COMMENT 'from deserializer', |
| `birthday` date COMMENT 'from deserializer', |
| `country` string COMMENT 'from deserializer') |
| ROW FORMAT SERDE |
| '' |
| STORED BY |
| '' |
| |
| LOCATION |
| 'hdfs://nnha/user/iceberg/warehouse/iceberg_db/student' |
| TBLPROPERTIES ( |
| 'bucketing_version'='2', |
| ''='true', |
| ''='TRUE', |
| 'metadata_location'='hdfs://nnha/user/iceberg/warehouse/iceberg_db/student/metadata/', |
| 'table_type'='ICEBERG', |
| 'transient_lastDdlTime'='1645068339', |
| 'uuid'='29e4e983-6c71-473c-980a-6c391f3cd993') |
+----------------------------------------------------+
20 rows selected (0.318 seconds)
0: jdbc:hive2://hive1:10000>
虽然Iceberg的表是分区表,但是查看Hive表结构是看不到分区信息的。且目前不支持计算列作为分区列
查看HDFS路径如下。和Hive Catalog的表还是有区别的
[root@hive1 ~]# hadoop fs -ls /user/iceberg/warehouse/iceberg_db/employee/metadata
Found 2 items
-rw-r--r-- 1 root supergroup 2119 2022-02-17 12:51 /user/iceberg/warehouse/iceberg_db/employee/metadata/
-rw-r--r-- 1 root supergroup 1 2022-02-17 12:51 /user/iceberg/warehouse/iceberg_db/employee/metadata/
[root@hive1 ~]#
4.3 删除表
0: jdbc:hive2://hive1:10000>
0: jdbc:hive2://hive1:10000> drop table if exists default.my_user;
No rows affected (0.215 seconds)
0: jdbc:hive2://hive1:10000>
删除表,需要注意HDFS上的目录是否删除
5. 插入数据
配置
:默认10,向多个表插入数据的线程池大小
:默认10,当向多个表插入数据,写入多个文件的线程池大小
因为如果插入'2022-02-01'
到date类型的birthday字段,会报错误: Not an instance of : 2022-02-01,这是Iceberg的Bug,所以这里插入null。也可以在建表的时候使用字符串类型来代替
0: jdbc:hive2://hive1:10000> insert into iceberg_db.student(id, name, birthday, country)
. . . . . . . . . . . . . .> values(1, 'zhang_san', null, 'china'),
. . . . . . . . . . . . . .> (2, 'zhang_san', null, 'china');
No rows affected (135.859 seconds)
0: jdbc:hive2://hive1:10000> insert into iceberg_db.student(id, name, birthday, country)
. . . . . . . . . . . . . .> select 3, 'zhang_san', null, 'china';
No rows affected (121.076 seconds)
0: jdbc:hive2://hive1:10000>
对于insert into db.tb2(col1, col2) select col1, col2 from db.tb1;
:
- 如果db.tb1的表格式是parquet格式,则会报错:: Parquet support not yet supported for Pig and Hive,可以通过建表时指定表属性
tblproperties(''='orc')
来解决。 - 第二个报错如下,这个是Iceberg的Bug,暂时还没解决,只能将db.tb1的数据按行插入到db.tb2
Error: :
at (:97)
at (:57)
at (:420)
at (:702)
at $TrackedRecordReader.<init>(:176)
at (:445)
at (:350)
at $(:178)
at (Native Method)
at (:422)
at (:1878)
at (:172)
Caused by:
at $VectorizedOrcRecordReader.<init>(:78)
at (:188)
at (:111)
at .invoke0(Native Method)
at (:62)
at (:43)
at (:498)
at $(:65)
at $(:77)
at $(:196)
at $(:398)
at $(:288)
at $(:308)
at $(:231)
at .<init>(:40)
at .<init>(:40)
at .newInstance0(Native Method)
at (:62)
at (:45)
at (:423)
at $(:60)
at $(:73)
at (:114)
at (:417)
... 9 more
插入数据到多个表
from db.tb1
insert into db.tb2 select col1, col2
insert into db.tb3 select col3, col4;
向一个表插入数据完成,才会向另一个表插入数据。所以只能保证一个表的数据插入原子性,不能保证整个操作的数据插入原子性
6. 查询数据
特点
- 支持where条件的谓词下推
- 支持select字段的投影
- 支持Hive的Mapreduce和Tez计算引擎
配置
:默认false,查询avro格式的数据是否再次使用容器
:默认true,select查询是否大小写敏感
7. 字段类型转换及开启
配置
:默认false,是否开启Hive和Iceberg之间的字段类型转换
当通过Hive创建Iceberg表,和向Iceberg表写入数据时,会使用下面的字段类型转换规则
Hive | Iceberg | Notes |
---|---|---|
boolean | boolean | |
short | integer | auto-conversion |
byte | integer | auto-conversion |
integer | integer | |
long | long | |
float | float | |
double | double | |
date | date | |
timestamp | timestamp without timezone | |
timestamplocaltz | timestamp with timezone | Hive 3 only |
interval_year_month | not supported | |
interval_day_time | not supported | |
char | string | auto-conversion |
varchar | string | auto-conversion |
string | string | |
binary | binary | |
decimal | decimal | |
struct | struct | |
list | list | |
map | map | |
union | not supported |