HBase使用手册
1.概述
使用须知
本文档适用于CRH产品中HBase组件应用开发。
HBase简介
HBase作为一个典型的NoSQL数据库,使用Hadoop的HDFS作为底层存储,可以通过行键(Rowkey)检索数据,仅支持单行事务,主要用于存储非结构化和半结构化的松散数据。与Hadoop相同,HBase的设计目标主要依靠横向扩展,通过不断的增加廉价商用服务器来增加和存储能力。
HBase特性
容量巨大、面向列、稀疏性、扩展性、高可靠性、高性能。伴随着这些特性,HBase被越来越多的公司选取作为存储数据库。
HBase Table简介
-
Row key
行主键, HBase不支持条件查询和Order by等查询,读取记录只能按Row key(及其range)或全表扫描,因此Row key需要根据业务来设计以利用其存储排序特性(Table按Row key字典序排序如1,10,100,11,2)提高性能。
-
Column Family(列族)
在表创建时声明,每个Column Family为一个存储单元。
-
Column(列)
HBase的每个列都属于一个列族,以列族名为前缀,如列article:title和article:content属于article列族,author:name和author:nickname属于author列族。 Column不用创建表时定义即可以动态新增,同一Column Family的Columns会群聚在一个存储单元上,并依Column key排序,因此设计时应将具有相同I/O特性的Column设计在一个Column Family上以提高性能。同时这里需要注意的是:这个列是可以增加和删除的,这和我们的传统数据库很大的区别。所以他适合非结构化数据。
-
Timestamp
时间戳 HBase通过row和column确定一份数据,这份数据的值可能有多个版本,不同版本的值按照时间倒序排序,即最新的数据排在最前面,查询时默认返回最新版本。Timestamp默认为系统当前时间(精确到毫秒),也可以在写入数据时指定该值。
-
Value
每个值通过4个键唯一索引,tableName+RowKey+ColumnKey+Timestamp=>value TableName 是字符串RowKey 和 ColumnName 是二进制值(Java 类型 byte[])Timestamp 是一个 64 位整数(Java 类型 long)value 是一个字节数组(Java类型 byte[])。
-
存储结构
可以简单的将HTable的存储结构理解为HTable按Rowkey自动排序,每个Row包含任意数量个Columns,Columns之间按Columnkey自动排序,每个Column包含任意数量个Values。理解该存储结构将有助于查询结果的迭代。
2.HBase的简单使用
2.1 HBase shell 的使用
2.1.1 进入 hbase shell
切换到hdfs或者hbase用户
[[email protected] ~]$ hbase shell SLF4J: Class path contains multiple SLF4J bindings. SLF4J: Found binding in [jar:file:/usr/crh/6.1.2.6-1457/hbase/lib/phoenix-4.13.1-HBase-1.2-client.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/usr/crh/6.1.2.6-1457/hadoop/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/usr/crh/6.1.2.6-1457/tez/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation. HBase Shell; enter 'help<RETURN>' for list of supported commands. Type "exit<RETURN>" to leave the HBase Shell Version 1.2.6, rUnknown, Thu Apr 19 02:55:07 UTC 2018 hbase(main):001:0>
2.1.2 显示帮助
hbase(main):001:0> help
2.1.3 创建test表 指明列簇名cf
hbase(main):002:0> create 'test','cf' 0 row(s) in 2.6990 seconds => Hbase::Table - test
2.1.4 显示表信息
hbase(main):010:0> list ‘test’ TABLE test 1 row(s) in 0.0100 seconds => ["test"]
2.1.5 向test表中添加数据
hbase(main):011:0> put 'test','row1','cf:a','value1' 0 row(s) in 0.2660 seconds hbase(main):012:0> put 'test','row2','cf:b','value2' 0 row(s) in 0.0260 seconds hbase(main):013:0> put 'test','row3','cf:c','value3' 0 row(s) in 0.0270 seconds
2.1.6 查看test表数据信息
hbase(main):014:0> scan 'test'
2.1.7 获取单行数据信息
hbase(main):016:0> get 'test','row1'
2.1.8 禁用test表
在更改表结构或者删除表之前操作
hbase(main):017:0> disable 'test' 0 row(s) in 2.3300 seconds
2.1.9 重新启用表
hbase(main):018:0> enable 'test' 0 row(s) in 1.3690 seconds
2.1.10 删除表
hbase(main):020:0> disable 'test' 0 row(s) in 2.2930 seconds hbase(main):021:0> drop 'test' 0 row(s) in 1.2840 seconds
2.1.11 退出 HBase shell
hbase(main):022:0> exit [[email protected] ~]#
2.2 HBase shell的使用技巧
HBase在0.95版本添加了shell命令,它为Hbase的表提供了jruby风格的面向对象的引用,表引用可以用于进行执行数据的读写操作,从而可以更加便捷的管理Hbase表。
2.2.1 将Hbase表t 赋值给变量t
hbase(main):001:0> t = create 't','f'
2.2.2 向表中添加数据
hbase(main):003:0> t.put 'r','f','v'
2.2.3 扫描表
hbase(main):004:0> t.scan
2.2.4 描述表结构
hbase(main):005:0> t.describe
2.2.5 禁用表、删除表
hbase(main):007:0> t.disable 0 row(s) in 2.3770 seconds hbase(main):008:0> t.drop 0 row(s) in 2.3650 seconds hbase(main):009:0> list
2.2.6 将已经创建的表赋值给变量 get_table 方法
hbase(main):010:0> create 't','f' hbase(main):012:0> tab = get_table 't' hbase(main):013:0> tab.put 'r1','f','v'
3.HBase的高级操作
3.1 HBase与hadoop的结合使用
3.1.1 在HDFS中显示HBase数据的存储目录
[[email protected] ~]$ hdfs dfs -ls /apps/hbase/data SLF4J: Class path contains multiple SLF4J bindings. SLF4J: Found binding in [jar:file:/usr/crh/6.1.2.6-1457/hadoop/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/usr/crh/6.1.2.6-1457/tez/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation. SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory] Found 9 items drwxr-xr-x - hbase hdfs 0 2018-09-17 15:09 /apps/hbase/data/.tmp drwxr-xr-x - hbase hdfs 0 2018-09-17 15:14 /apps/hbase/data/MasterProcWALs drwxr-xr-x - hbase hdfs 0 2018-09-17 10:56 /apps/hbase/data/WALs drwxr-xr-x - hbase hdfs 0 2018-09-17 13:48 /apps/hbase/data/archive drwxr-xr-x - hbase hdfs 0 2018-07-03 10:11 /apps/hbase/data/corrupt drwxr-xr-x - hbase hdfs 0 2018-07-03 09:55 /apps/hbase/data/data -rw-r--r-- 3 hbase hdfs 42 2018-07-03 09:55 /apps/hbase/data/hbase.id -rw-r--r-- 3 hbase hdfs 7 2018-07-03 09:55 /apps/hbase/data/hbase.version drwxr-xr-x - hbase hdfs 0 2018-09-17 15:06 /apps/hbase/data/oldWALs
3.1.2 使用hadoop jar 运行hbase-examples-1.1.9.jar
在此操作之前需要进行以下几项准备工作
1.停止Hadoop和hbase
2.将/etc/hbase/6.1.2.6-1457/0/hbase-site.xml拷贝到/etc/hadoop/6.1.2.6-1457/0/下面
3.将/usr/crh6.1.2.6-1457/hbase/lib的jar包拷贝到/usr/crh/6.1.2.6-1457/hadoop/lib下面
4.重新启动Hadoop和hbse
5.运行计算test表行数计算的mapreduce作业
6.运行jar包
[[email protected] ~]$ hadoop jar /usr/crh/6.1.2.6-1457/hadoop/lib/hbase-examples-1.2.6.jar
3.2 HBase整合hive
3.2.1 在hive中建表
hive> create table htest(key int,value string) stored by 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' with serdeproperties('hbase.columns.mapping'=':key,f:value') tb lproperties('hbase.table.name'='htest');OK Time taken: 4.676 seconds
3.2.2 在HBase中验证
hbase(main):001:0> list TABLE test htest 2row(s) in 0.3130 seconds => [ "htest", "test"]
3.2.3 导入外部数据
创建score.csv
hive,85 hbase,90 hadoop,92 flume,89 kafka,95 spark,80 storm,70
将其上传到hdfs文件系统
[[email protected] opt]$ hdfs dfs -put /opt/score.csv /
创建外部表,并加载数据
hive> create external table if not exists default.testcourse(cname string,score int) row format delimited fields terminated by ',' stored as textfile; OK Time taken: 0.132 seconds hive> load data inpath '/score.csv' into table default.testcourse;
创建hbase识别的表
hive> create table default.hbase_testcourse(cname string,score int) STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' WITH SERDEPROPERTIES("hbase.columns.mappin g" = ":key,cf:score")TBLPROPERTIES("hbase.table.name" = "hbase_testcourse","hbase.mapred.output.outputtable" = "hbase_testcourse"); OK Time taken: 2.577 seconds
通过查询方式插入数据
hive> insert overwrite table default.hbase_testcourse select cname,score from default.testcourse; WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.Query ID = hdfs_20180917165113_4e862430-b6e7-4a36-a497-fe1b455e7484 Total jobs = 1 Launching Job 1 out of 1 Number of reduce tasks is set to 0 since there's no reduce operator Starting Job = job_1537152941617_0004, Tracking URL = http://cloud02:8088/proxy/application_1537152941617_0004/ Kill Command = /usr/crh/6.1.2.6-1457/hadoop/bin/hadoop job -kill job_1537152941617_0004 Hadoop job information for Stage-3: number of mappers: 1; number of reducers: 0 2018-09-17 16:51:35,696 Stage-3 map = 0%, reduce = 0% 2018-09-17 16:51:46,346 Stage-3 map = 100%, reduce = 0%, Cumulative CPU 6.4 sec MapReduce Total cumulative CPU time: 6 seconds 400 msec Ended Job = job_1537152941617_0004 MapReduce Jobs Launched: Stage-Stage-3: Map: 1 Cumulative CPU: 6.4 sec HDFS Read: 12507 HDFS Write: 0 SUCCESS Total MapReduce CPU Time Spent: 6 seconds 400 msec OK Time taken: 34.993 seconds
查询表 hbase_testcourse
hive> select * from hbase_testcourse; OK flume 89 hadoop 92 hbase 90 hive 85 kafka 95 spark 80 storm 70 Time taken: 0.492 seconds, Fetched: 7 row(s)
验证hbase
hbase(main):007:0> scan 'hbase_testcourse' ROW COLUMN+CELL flume column=cf:score, timestamp=1537174306012, value=89 hadoop column=cf:score, timestamp=1537174306012, value=92 hbase column=cf:score, timestamp=1537174306012, value=90 hive column=cf:score, timestamp=1537174306012, value=85 kafka column=cf:score, timestamp=1537174306012, value=95 spark column=cf:score, timestamp=1537174306012, value=80 storm column=cf:score, timestamp=1537174306012, value=70 7 row(s) in 0.0270 seconds
说明关联成功
3.3 基于HBase的SQL引擎实现
Phoenix是构建在Apache HBase上的一个SQL中间层,可以让开发者在HBase上进行SQL的查询功能。
3.3.1 进入phoenix的安装目录
[[email protected] bin]# ls config example.csv hbase-site.xml pherf-standalone.py psql.py sqlline.py traceserver.py daemon.py ex.csv log4j.properties phoenix_sandbox.py queryserver.py sqlline-thin.py daemon.pyc hadoop-metrics2-hbase.properties performance.py phoenix_utils.py readme.txt tephra end2endTest.py hadoop-metrics2-phoenix.properties pherf-cluster.py phoenix_utils.pyc sandbox-log4j.properties tephra-env.sh
3.3.2 进入phoenix执行
[[email protected] bin]# ./sqlline.py localhost:2181 Setting property: [incremental, false] Setting property: [isolation, TRANSACTION_READ_COMMITTED] issuing: !connect jdbc:phoenix:localhost:2181 none none org.apache.phoenix.jdbc.PhoenixDriver Connecting to jdbc:phoenix:localhost:2181 SLF4J: Class path contains multiple SLF4J bindings. SLF4J: Found binding in [jar:file:/usr/crh/6.1.2.6-1457/phoenix/phoenix-4.13.1-HBase-1.2-client.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/usr/crh/6.1.2.6-1457/hadoop/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/usr/crh/6.1.2.6-1457/tez/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation. 18/09/17 17:01:50 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 18/09/17 17:01:51 WARN shortcircuit.DomainSocketFactory: The short-circuit local reads feature cannot be used because libhadoop cannot be loaded. Connected to: Phoenix (version 4.13) Driver: PhoenixEmbeddedDriver (version 4.13) Autocommit status: true Transaction isolation: TRANSACTION_READ_COMMITTED Building list of tables and columns for tab-completion (set fastconnect to true to skip)... 144/144 (100%) Done Done sqlline version 1.2.0
3.3.3 查看表
0: jdbc:phoenix:localhost:2181> !tables +------------+--------------+--------------------+---------------+----------+------------+----------------------------+-----------------+--------------+-----------------+-+ | TABLE_CAT | TABLE_SCHEM | TABLE_NAME | TABLE_TYPE | REMARKS | TYPE_NAME | SELF_REFERENCING_COL_NAME | REF_GENERATION | INDEX_STATE | IMMUTABLE_ROWS | | +------------+--------------+--------------------+---------------+----------+------------+----------------------------+-----------------+--------------+-----------------+-+ | | SYSTEM | CATALOG | SYSTEM TABLE | | | | | | false | | | | SYSTEM | FUNCTION | SYSTEM TABLE | | | | | | false | | | | SYSTEM | SEQUENCE | SYSTEM TABLE | | | | | | false | | | | SYSTEM | STATS | SYSTEM TABLE | | | | | | false | | | | | ABC | TABLE | | | | | | false | | | | ETL | USERS | TABLE | | | | | | false | | | | LIAONING | VEHICLES_GPS | TABLE | | | | | | false | | | | LIAONING | VEHICLES_GPS_TEST | TABLE | | | | | | false | | | | TEST | GPS | TABLE | | | | | | false | | +------------+--------------+--------------------+---------------+----------+------------+----------------------------+-----------------+--------------+-----------------+-+
3.3.4 创建表
0: jdbc:phoenix:localhost:2181> create table test_redoop( . . . . . . . . . . . . . . . > name VARCHAR, . . . . . . . . . . . . . . . > address VARCHAR, . . . . . . . . . . . . . . . > salary DOUBLE, . . . . . . . . . . . . . . . > id INTEGER not null primary key . . . . . . . . . . . . . . . > ); No rows affected (2.649 seconds)
3.3.5 插入数据
0: jdbc:phoenix:localhost:2181> upsert into TEST_REDOOP values('zhaoshuai','Beijing',5000,01); 1 row affected (0.081 seconds) 0: jdbc:phoenix:localhost:2181> upsert into TEST_REDOOP values('zhangjie','Beijing',50000,02); 1 row affected (0.016 seconds) 0: jdbc:phoenix:localhost:2181> upsert into TEST_REDOOP values('lisi','Shanghai',8000,03);
3.3.6 查看表
0: jdbc:phoenix:localhost:2181> select * from test_redoop; +------------+-----------+----------+-----+ | NAME | ADDRESS | SALARY | ID | +------------+-----------+----------+-----+ | zhaoshuai | Beijing | 5000.0 | 1 | | zhangjie | Beijing | 50000.0 | 2 | | lisi | Shanghai | 8000.0 | 3 | +------------+-----------+----------+-----+
3.3.7 在hbase中查看
hbase(main):002:0> scan 'TEST_REDOOP' ROW COLUMN+CELL \x80\x00\x00\x01 column=0:\x00\x00\x00\x00, timestamp=1537176692090, value=x \x80\x00\x00\x01 column=0:\x80\x0B, timestamp=1537176692090, value=zhaoshuai \x80\x00\x00\x01 column=0:\x80\x0C, timestamp=1537176692090, value=Beijing \x80\x00\x00\x01 column=0:\x80\x0D, timestamp=1537176692090, value=\xC0\xB3\x88\x00\x00\x00\x00\x01 \x80\x00\x00\x02 column=0:\x00\x00\x00\x00, timestamp=1537176711631, value=x \x80\x00\x00\x02 column=0:\x80\x0B, timestamp=1537176711631, value=zhangjie \x80\x00\x00\x02 column=0:\x80\x0C, timestamp=1537176711631, value=Beijing \x80\x00\x00\x02 column=0:\x80\x0D, timestamp=1537176711631, value=\xC0\xE8j\x00\x00\x00\x00\x01 \x80\x00\x00\x03 column=0:\x00\x00\x00\x00, timestamp=1537176736422, value=x \x80\x00\x00\x03 column=0:\x80\x0B, timestamp=1537176736422, value=lisi \x80\x00\x00\x03 column=0:\x80\x0C, timestamp=1537176736422, value=Shanghai \x80\x00\x00\x03 column=0:\x80\x0D, timestamp=1537176736422, value=\xC0\[email protected]\x00\x00\x00\x00\x01 3 row(s) in 0.1560 seconds
4. Java API操纵HBase
4.1 使用Eclipse操作HBase
4.1.1 创建maven项目
右击Files -->others -->maven project 输入工程名,包名创建工程
4.1.2 添加hbase-site.xml文件
1.工程与集群间的通信是通过读取hbase-site.xml文件,因此需要将/usr/crh/6.1.2.6-1457/hbase/conf目录下的hbase-site.xml文件拷贝出来一份。
2.右击工程,新建conf文件夹,将hbase-site.xml文件放入
3.在Libraries里Add Class Folder,然后Apply,OK
4.1.3 编辑pom.xml文件
添加hadoop hbase依赖
</dependency> <!-- https://mvnrepository.com/artifact/org.apache.hadoop/hadoop-hdfs --> <dependency> <groupId>org.apache.hadoop</groupId> <artifactId>hadoop-hdfs</artifactId> <version>2.7.5</version> </dependency> <!-- https://mvnrepository.com/artifact/org.apache.hbase/hbase-client --> <dependency> <groupId>org.apache.hbase</groupId> <artifactId>hbase-client</artifactId> <version>1.2.6</version> </dependency> <dependency> <groupId>jdk.tools</groupId> <artifactId>jdk.tools</artifactId> <version>1.8</version> <scope>system</scope> <systemPath>${JAVA_HOME}/lib/tools.jar</systemPath> </dependency>
4.1.4 创建测试类HBaseTestCase
package com.redoop.hbase.HBase; import java.io.IOException; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.hbase.HBaseConfiguration; import org.apache.hadoop.hbase.HColumnDescriptor; import org.apache.hadoop.hbase.HTableDescriptor; import org.apache.hadoop.hbase.client.Get; import org.apache.hadoop.hbase.client.HBaseAdmin; import org.apache.hadoop.hbase.client.HTable; import org.apache.hadoop.hbase.client.Put; import org.apache.hadoop.hbase.client.Result; import org.apache.hadoop.hbase.client.ResultScanner; import org.apache.hadoop.hbase.client.Scan; import org.apache.hadoop.hbase.util.Bytes; public class HBaseTestCase { //声明静态配置 HBaseConfiguration static Configuration cfg=HBaseConfiguration.create(); //创建一张表,通过HBaseAdmin HTableDescriptor来创建 public static void creat(String tablename,String columnFamily) throws Exception { HBaseAdmin admin = new HBaseAdmin(cfg); if (admin.tableExists(tablename)) { System.out.println("table Exists!"); System.exit(0); } else{ HTableDescriptor tableDesc = new HTableDescriptor(tablename); tableDesc.addFamily(new HColumnDescriptor(columnFamily)); admin.createTable(tableDesc); System.out.println("create table success!"); } } //添加一条数据,通过HTable Put为已经存在的表来添加数据 public static void put(String tablename,String row, String columnFamily,String column,String data) throws Exception { HTable table = new HTable(cfg, tablename); Put p1=new Put(Bytes.toBytes(row)); p1.add(Bytes.toBytes(columnFamily), Bytes.toBytes(column), Bytes.toBytes(data)); table.put(p1); System.out.println("put '"+row+"','"+columnFamily+":"+column+"','"+data+"'"); } public static void get(String tablename,String row) throws IOException{ HTable table=new HTable(cfg,tablename); Get g=new Get(Bytes.toBytes(row)); Result result=table.get(g); System.out.println("Get: "+result); } //显示所有数据,通过HTable Scan来获取已有表的信息 public static void scan(String tablename) throws Exception{ HTable table = new HTable(cfg, tablename); Scan s = new Scan(); ResultScanner rs = table.getScanner(s); for(Result r:rs){ System.out.println("Scan: "+r); } } public static boolean delete(String tablename) throws IOException{ HBaseAdmin admin=new HBaseAdmin(cfg); if(admin.tableExists(tablename)){ try { admin.disableTable(tablename); admin.deleteTable(tablename); }catch(Exception ex){ ex.printStackTrace(); return false; } } return true; } public static void main (String [] agrs) { String tablename="redoop_hbase_test"; String columnFamily="cf"; try { HBaseTestCase.creat(tablename, columnFamily); HBaseTestCase.put(tablename, "row1", columnFamily, "crh", "redoop"); HBaseTestCase.get(tablename, "row1"); HBaseTestCase.scan(tablename); /* if(true==HBaseTestCase.delete(tablename)) System.out.println("Delete table:"+tablename+"success!"); */ } catch (Exception e) { e.printStackTrace(); } } }
4.1.5 校验结果
在Eclipse的输出结果中:
在HBase中查看创建的表:
在HBase中查看表的内容: