Developer Guide
开发者手册
(http://hypertable.com/documentation/developer_guide/)
Tableof Contents
- HQL Tutorial
- Hadoop MapReduce
- Hadoop Streaming MapReduce
- Regular Expression Filtering
- Atomic Counters
- Group Commit
目录
- HQL概览
- Hadoop MapReduce
- Hadoop流式MapReduce
- 正则表达过滤
- 原子性计数
- 成组提交
HQL Tutorial
HQL教程
Introduction
介绍
This tutorial shows you how to import a search engine query log into Hypertable, storing the data into tables withdifferent primary keys, and how to issue queries against the tables. You'll need to download the data from http://hypertable.googlecode.com/files/query-log.tsv.gz:
本教程描述如何将一个查询引擎的查询日志导入到Hypertable中,将采用不同的主键保存到表中,以及如何查询这些表。你需要从http://hypertable.googlecode.com/files/query-log.tsv.gz下载有关的数据。
$ mkdir -phql_tutorial
$ cd hql_tutorial
$ wgethttp://cdn.hypertable.com/pub/query-log.tsv.gz
The next step is to make sure Hypertable is properly installed (see Installation) and then launch the service. Once youhave Hypertable up and running, fire up an interactive session:
接着需要确保Hypertable被正确安装(参见安装手册)并启动服务。一旦你启动Hypertable完成后,开启一个交互式会话:
$/opt/hypertable/current/bin/ht shell
Welcome to thehypertable command interpreter.
For information aboutHypertable, visit http://www.hypertable.org/
Type 'help' for alist of commands, or 'help shell' for a
list of shell metacommands.
hypertable>
Type"help" to display the list of valid HQL commands:
输入”help”得到合法的HQL命令列表:
hypertable> help
USE ................Sets the current namespace
CREATE NAMESPACE ...Creates a new namespace
DROP NAMESPACE .....Removes a namespace
EXISTS TABLE .......Check if table exists
CREATE TABLE .......Creates a table
DELETE .............Deletes all or part of a row from a table
DESCRIBE TABLE .....Displays a table's schema
DROP TABLE .........Removes a table
RENAME TABLE .......Renames a table
DUMP TABLE .........Create efficient backup file
ALTER TABLE ........Add/remove column family from existing table
INSERT .............Inserts data into a table
LOAD DATA INFILE ...Loads data from a TSV input file into a table
SELECT .............Selects (and display) cells from a table
SHOW CREATE TABLE ..Displays CREATE TABLE command used to create table
SHOW TABLES ........Displays only the list of tables in the current namespace
GET LISTING ........Displays the list of tables and namespace in the current namespace
Statements must be terminated with';'. For more information on a specific statement, type 'help <statement>', where <statement> is from thepreceeding list.
语言必须以’;’结束,对于上面列表中某个特定语句的更详细的信息,可输入’help 语句’。
USE
First, open the root namespace. For an explanation of namespaces seeNamespaces. To open the root namespace issue the following HQL command:
首先,开发root命名空间。关于命名空间的解释,参见http://hypertable.com/documentation/#namespaces。打开root命名空间的HQL命令如下:
hypertable> use"/";
CREATE NAMESPACE
Nowcreate a namespace, Tutorial, within which we'll create our tables:
现在,创建一个命名空间Tutorial,我们要在这个空间中创建表:
hypertable> createnamespace "Tutorial";
hypertable> useTutorial;
CREATE TABLE
Now that we have created and opened the Tutorial namespace we can create tables within it. In this tutorial we will be loading data into, and querying data from, two separate tables. The firsttable, QueryLogByUserID, will be indexed by the fields UserID+QueryTime and thesecond table, QueryLogByTimestamp, will be indexed by the fieldsQueryTime+UserID. Notice that any identifier that contains non-alphanumericcharacters (e.g. '-') must be surrounded by quotes.
现在我们已经创建命名空间Tutorial,打开它,我们就可以在其中创建表。在本教程中,我们把数据载入到两个表,并从中查询出来。第一个表是QueryLogByUserID,采用UserID+QueryTime索引,第二个表是QueryLogByTimestamp,采用QueryTime+UserID索引。注意,标识中的任何非字母数字字符,必须用引号。
hypertable> CREATETABLE QueryLogByUserID ( Query, ItemRank, ClickURL );
hypertable> CREATETABLE QueryLogByTimestamp ( Query, ItemRank, ClickURL );
See the HQL Documentation: CREATE TABLE forcomplete syntax.
关于CREATE TABLE的完整语法,参阅HQL文档。
SHOW TABLES
Show all of the tables that exist in the current namespace:
查看所有的表在当前的命名空间下。
hypertable> showtables;
QueryLogByUserID
QueryLogByTimestamp
SHOW CREATE TABLE
Now, issue the SHOW CREATE TABLE command to make sure you got everything right. We didn't have to include the field called'row' because we'll use that in our LOAD DATA INFILE command later:
现在,用SHOW CREATE TABLE命令来看看是不是一切都做好了。我们没有包含’row’这一列,因为我们会在后面的LOAD DATA INFILE中用它。
hypertable> showcreate table QueryLogByUserID;
CREATE TABLEQueryLogByUserID (
Query,
ItemRank,
ClickURL,
ACCESS GROUP default (Query, ItemRank,ClickURL)
);
And, notice that, by default, a singleACCESS GROUP named "default" is created. Access groups are a way to physically groupcolumns together on disk. See the CREATETABLE documentation for a more detailed description of access groups.
注意,缺省情况下,一个叫做’default’的ACCESS GROUP被创建出来,Access groups是一个将有关列放在一起保存到磁盘的方法。有关access groups的更详细描述,参阅CREATE TABLE的文档。
LOAD DATA INFILE
Now, let's load some data using the MySQL-like TAB delimited format (TSV). For that, we assume you have the example data in the file query-log.tsv.gz. This file includes an initial header line indicating the format of each line in the file by listing tab delimited column names. To inspect this file we first quit out of the Hypertable command line interpreter and then use the zcat program (requires gzip package) to display the contents of query-log.tsv.gz:
现在,让我们载入采用TAB分割、类似MySQL一样的格式(TSV)的数据。我们假设示例数据已在文件query-log.tsv.gz中。这个文件包头部含了一个初始化行,通过列出以TAB分割的列名,指出每行数据的格式。为查看这个文件,我们先退出Hypertable的命令行解释器,用zcat程序(需要gzip包)来显示query-log.tsv.gz的内容。
hypertable> quit
$ zcatquery-log.tsv.gz
#QueryTime UserID Query ItemRank ClickURL
2008-11-1300:01:30 2289203 kitchen counter innew orleans 10 http://www.era.com
2008-11-1300:01:30 2289203 kitchen counter innew orleans 4 http://www.superpages.com
2008-11-1300:01:30 2289203 kitchen counter innew orleans 5 http://www.superpages.com
2008-11-1300:01:31 1958633 beads amethystgemstone 1 http://www.gemsbiz.com
2008-11-1300:01:31 3496052 chat
2008-11-13 00:01:33 892003 photo example quarter doubled die coin 5 http://www.coinresource.com
2008-11-1300:01:33 892003 photo example quarter doubled die coin 5 http://www.coinresource.com
2008-11-1300:01:35 2251112 radio stations inbuffalo 1 http://www.ontheradio.net
2008-11-1300:01:37 1274922 fafsa renewal 1 http://www.fafsa.ed.gov
2008-11-1300:01:37 1274922 fafsa renewal 1 http://www.fafsa.ed.gov
2008-11-1300:01:37 441978 find phone numbers 1 http://www.anywho.com
2008-11-1300:01:37 441978 find phone numbers 3 http://www.411.com
...
Now let's load the data filequery-log.tsv.gz into the table QueryLogByUserID. The row key is formulated byzero-padding the UserID field out to nine digits and concatenating the QueryTime field. The QueryTime field is used as the internal cell timestamp. To load the file, first jump back into the Hypertable command line interpreter and then issue the LOAD DATAINFILE command show below.
现在我们将query-log.tsv.gz中的数据载入QueryLogByUserID,行键由UserID和QueryTime合成,其中UserID用0补齐为9位数。QueryTime用作内部单元的时间戳。为载入数据,首先回到Hypertable命令行解释器,然后采用如下面显示的LOAD DATA INFILE命令。
$/opt/hypertable/current/bin/ht shell
...
hypertable> useTutorial;
hypertable> LOADDATA INFILE ROW_KEY_COLUMN="%09UserID"+QueryTimeTIMESTAMP_COLUMN=QueryTime "query-log.tsv.gz" INTO TABLEQueryLogByUserID;
Loading 7,464,729bytes of input data...
0% 10 20 30 40 50 60 70 80 90 100%
|----|----|----|----|----|----|----|----|----|----|
***************************************************
Load complete.
Elapsed time: 9.84 s
Avg value size: 15.42 bytes
Avg key size: 29.00 bytes
Throughput: 4478149.39 bytes/s (764375.74 bytes/s)
Total cells: 992525
Throughput: 100822.73 cells/s
Resends: 0
A quick inspection of the table shows:
来快速看看表
hypertable> select* from QueryLogByUserID limit 8;
000000036 2008-11-1310:30:46 Query helena ga
000000036 2008-11-1310:31:34 Query helena ga
000000036 2008-11-1310:45:23 Query checeron s
000000036 2008-11-1310:46:07 Query cheveron gas station
000000036 2008-11-1310:46:34 Query cheveron gas station richmond virginia
000000036 2008-11-1310:48:56 Query cheveron glenside road richmond virginia
000000036 2008-11-1310:49:05 Query chevron glenside road richmond virginia
000000036 2008-11-1310:49:05 ItemRank 1
000000036 2008-11-1310:49:05 ClickURL http://yp.yahoo.com
000000053 2008-11-1315:18:21 Query mapquest
000000053 2008-11-1315:18:21 ItemRank 1
000000053 2008-11-1315:18:21 ClickURL http://www.mapquest.com
Elapsed time: 0.01 s
Avg value size: 18.08 bytes
Avg key size: 30.00 bytes
Throughput: 43501.21 bytes/s
Total cells: 12
Throughput: 904.70 cells/s
Now let's load the data file query-log.tsv.gz into the table QueryLogByTimestamp. The row key is formulated by concatenating the QueryTime field with the nine digit, zero-padded UserIDfield. The QueryTime field is used as the internal cell timestamp.
现在我们将query-log.tsv.gz中的数据载入到QueryLogByTimestamp表,行键由QueryTime和9位UserID(用0补齐)合成。QueryTime用作内部单元时间戳。
hypertable> LOADDATA INFILE ROW_KEY_COLUMN=QueryTime+"%09UserID"TIMESTAMP_COLUMN=QueryTime "query-log.tsv.gz" INTO TABLEQueryLogByTimestamp;
Loading 7,464,729bytes of input data...
0% 10 20 30 40 50 60 70 80 90 100%
|----|----|----|----|----|----|----|----|----|----|
***************************************************
Load complete.
Elapsed time: 10.18 s
Avg value size: 15.42 bytes
Avg key size: 29.00 bytes
Throughput: 4330913.20 bytes/s (739243.98 bytes/s)
Total cells: 992525
Throughput: 97507.80 cells/s
Resends: 0
And a quick inspection of the table shows:
来快速看看表的内容。
hypertable> select* from QueryLogByTimestamp limit 4;
2008-11-13 00:01:30002289203 Query kitchen counter in new orleans
2008-11-13 00:01:30002289203 ItemRank 5
2008-11-13 00:01:30002289203 ClickURL http://www.superpages.com
2008-11-13 00:01:31001958633 Query beads amethyst gemstone
2008-11-13 00:01:31001958633 ItemRank 1
2008-11-13 00:01:31001958633 ClickURL http://www.gemsbiz.com
2008-11-13 00:01:31003496052 Query chat
2008-11-13 00:01:33000892003 Query photo example quarter doubled die coin
2008-11-13 00:01:33000892003 ItemRank 5
2008-11-13 00:01:33000892003 ClickURL http://www.coinresource.com
Elapsed time: 0.00 s
Avg value size: 18.11 bytes
Avg key size: 30.00 bytes
Throughput: 287150.49 bytes/s
Total cells: 19
Throughput: 5969.21 cells/s
See the HQL Documentation: LOAD DATA INFILE for complete syntax.
关于LOAD DATA INFILE的完整语法,请参阅HQL文档。
SELECT
Let's start by examining the QueryLogByUserID table. To select all of the data for user ID 003269359 we need to use the starts with operator =^. Remember that the row key is the concatenation of the user ID and the timestamp which is why we need to use the starts with operator.
让我们从QueryLogByUserID表开始。为了查询出所有user ID为003269359的数据,我们需要用开始操作符=^。请记住,行键是user ID 和时间戳的合成,这是我们采用这个操作符的原因。
hypertable> select* from QueryLogByUserID where row =^ '003269359';
003269359 2008-11-1304:36:34 Query binibining pilipinas 2008 winners
003269359 2008-11-1304:36:34 ItemRank 5
003269359 2008-11-1304:36:34 ClickURL http://www.missosology.org
003269359 2008-11-1304:37:34 Query pawee's kiss and tell
003269359 2008-11-1304:37:34 ItemRank 3
003269359 2008-11-1304:37:34 ClickURL http://www.missosology.org
003269359 2008-11-1305:07:10 Query rn jobs in 91405
003269359 2008-11-1305:07:10 ItemRank 9
003269359 2008-11-1305:07:10 ClickURL http://91405.jobs.com
003269359 2008-11-1305:20:22 Query rn jobs in 91405
...
003269359 2008-11-1309:42:49 Query wound ostomy rn training
003269359 2008-11-1309:42:49 ItemRank 11
003269359 2008-11-1309:42:49 ClickURL http://www.wocn.org
003269359 2008-11-1309:46:50 Query pych nurse in encino tarzana hospital
003269359 2008-11-1309:47:18 Query encino tarzana hospital
003269359 2008-11-1309:47:18 ItemRank 2
003269359 2008-11-1309:47:18 ClickURL http://www.encino-tarzana.com
003269359 2008-11-1309:52:42 Query encino tarzana hospital
003269359 2008-11-1309:53:08 Query alhambra hospital
003269359 2008-11-1309:53:08 ItemRank 1
003269359 2008-11-1309:53:08 ClickURL http://www.alhambrahospital.com
Elapsed time: 0.01 s
Avg value size: 19.24 bytes
Avg key size: 30.00 bytes
Throughput: 2001847.79 bytes/s
Total cells: 352
Throughput: 40651.35 cells/s
The result set was fairly large (352 cells), so let's now try selecting just the queries that were issued by the user with ID 003269359 during the hour of 5am. To do this we need to add aTIMESTAMP predicate. Each cell has an internal timestamp and the TIMESTAMP predicate can be used to filter the results based on this timestamp.
结果相当大(352个单元数据),所以,现在我们查询003269359用户在5am发出的查询,这时,我们需要加一个时间戳TIMESTAMP谓词。每个单元都有一个内部的时间戳,这个TIMESTAMP谓词能用来过滤查询结果。
hypertable> select* from QueryLogByUserID where row =^ '003269359' AND "2008-11-1305:00:00" <= TIMESTAMP < "2008-11-13 06:00:00";
003269359 2008-11-1305:07:10 Query rn jobs in 91405
003269359 2008-11-1305:07:10 ItemRank 9
003269359 2008-11-1305:07:10 ClickURL http://91405.jobs.com
003269359 2008-11-1305:20:22 Query rn jobs in 91405
003269359 2008-11-1305:20:22 ItemRank 16
003269359 2008-11-1305:20:22 ClickURL http://www.careerbuilder.com
003269359 2008-11-1305:34:02 Query usc university hospital
003269359 2008-11-1305:34:02 ItemRank 1
003269359 2008-11-1305:34:02 ClickURL http://www.uscuh.com
003269359 2008-11-1305:37:01 Query rn jobs in san fernando valley
003269359 2008-11-1305:37:01 ItemRank 7
003269359 2008-11-1305:37:01 ClickURL http://www.medhunters.com
003269359 2008-11-1305:46:22 Query northridge hospital
003269359 2008-11-1305:46:22 ItemRank 2
003269359 2008-11-1305:46:22 ClickURL http://northridgehospital.org
003269359 2008-11-1305:53:34 Query valley presbyterian hospital
003269359 2008-11-1305:53:34 ItemRank 4
003269359 2008-11-1305:53:34 ClickURL http://www.hospital-data.com
003269359 2008-11-1305:55:36 Query valley presbyterian hospital website
003269359 2008-11-1305:55:36 ItemRank 1
003269359 2008-11-1305:55:36 ClickURL http://www.valleypres.org
003269359 2008-11-1305:59:24 Query mission community hospital
003269359 2008-11-1305:59:24 ItemRank 1
003269359 2008-11-1305:59:24 ClickURL http://www.mchonline.org
Elapsed time: 0.00 s
Avg value size: 18.50 bytes
Avg key size: 30.00 bytes
Throughput: 2602086.44 bytes/s
Total cells: 36
Throughput: 53651.27 cells/s
Keep in mind that the internal celltimestamp is different than the one embedded in the row key. In this example,they both represent the same time. By specifying the TIMESTAMP_COLUMN option toLOAD DATA INFILE, we extracted the QueryTime field to be used as the internalcell timestamp. If we hadn't supplied that option, the system would haveauto-assigned a timestamp. To display the internal cell timestamp, add theDISPLAY_TIMESTAMPS option:
请记住,单元内部时间戳与行键中的不同,在这个例子中,它们都代表同样的时间。在LOAD DATA INFILE中,通过指定TIMESTAMP_COLUMN选项,我们可以将QueryTime列用作单元内部时间戳。如果我们不用该选项,则系统将自动赋一个时间戳。添加DISPLAY_TIMESTAMPS选项可以显示单元内部时间戳。
hypertable> select* from QueryLogByUserID limit 5 DISPLAY_TIMESTAMPS;
2008-11-1310:30:46.000000000 000000036 2008-11-1310:30:46 Query helena ga
2008-11-1310:31:34.000000000 000000036 2008-11-1310:31:34 Query helena ga
2008-11-1310:45:23.000000000 000000036 2008-11-1310:45:23 Query checeron s
2008-11-1310:46:07.000000000 000000036 2008-11-1310:46:07 Query cheveron gas station
2008-11-1310:46:34.000000000 000000036 2008-11-1310:46:34 Query cheveron gas station richmond virginia
Elapsed time: 0.00 s
Avg value size: 17.20 bytes
Avg key size: 30.00 bytes
Throughput: 207563.76 bytes/s
Total cells: 5
Throughput: 4397.54 cells/s
There is no index for the internal cell timestamps, so if we don't include a row =^ expression in our predicate, the system will do a full table scan. This is why we imported the data into a second table QueryLogByTimestamp. This table includes the timestamp as the row key prefix which allows us to efficiently query data over a time interval.
The following query selects all query log data for November 14th, 2008:
单元内部时间戳没有索引,所以如果我们在谓词中没有添加=^,系统将做全表扫描。这就是为什么我们将数据载入第二个表QueryLogByTimestamp的原因。这个表的行键以时间戳为前缀,我们可以高效地查询出某个时间间隔的数据。
hypertable> select* from QueryLogByTimestamp WHERE ROW =^ '2008-11-14';
2008-11-14 00:00:00001040178 Query noodle tools
2008-11-14 00:00:00001040178 ItemRank 1
2008-11-14 00:00:00001040178 ClickURL http://www.noodletools.com
2008-11-14 00:00:01000264655 Query games.myspace.com
2008-11-14 00:00:01000264655 ItemRank 1
2008-11-14 00:00:01000264655 ClickURL http://games.myspace.com
2008-11-14 00:00:01000527424 Query franklinville schools new jersey
2008-11-14 00:00:01000527424 ItemRank 1
2008-11-14 00:00:01000527424 ClickURL http://www.greatschools.net
2008-11-14 00:00:01000632400 Query lack of eye contact symptom of...
...
2008-11-14 06:02:33003676354 Query baby 20showers
2008-11-14 06:02:35003378030 Query task and responsibility matrix
2008-11-14 06:02:35003378030 ItemRank 2
2008-11-14 06:02:35003378030 ClickURL http://im.ncsu.edu
2008-11-14 06:02:36004578101 Query jcpenneys
2008-11-14 06:02:37005120734 Query ebay
2008-11-14 06:02:40000957500 Query buccal fat size of ping pong ball
Elapsed time: 2.37 s
Avg value size: 15.36 bytes
Avg key size: 30.00 bytes
Throughput: 1709616.45 bytes/s
Total cells: 89412
Throughput: 37689.18 cells/s
And to select all query log data for November 14th, 2008 during the hour of 3 am:
查询在2008年11月14日上午3点的所有日志数据:
hypertable> select* from QueryLogByTimestamp WHERE ROW =^ '2008-11-14 03';
2008-11-14 03:00:00002512415 Query ny times
2008-11-14 03:00:00002512415 ItemRank 1
2008-11-14 03:00:00002512415 ClickURL http://www.nytimes.com
2008-11-14 03:00:00005294906 Query kickmeto.fosi
2008-11-14 03:00:00005459226 Query http://www.dickdyertoyota.com
2008-11-14 03:00:02000637292 Query days of our lives
2008-11-14 03:00:02000637292 ItemRank 3
2008-11-14 03:00:02000637292 ClickURL http://www.nbc.com
2008-11-14 03:00:03002675105 Query ghetto superstar lyrics
...
2008-11-14 03:59:52002874080 ClickURL http://www.paintball-discounters.com
2008-11-14 03:59:53004292772 Query drop down menu
2008-11-14 03:59:55005656539 Query to buy indian hair to make wigs in new york
2008-11-14 03:59:55005656539 ItemRank 1
2008-11-14 03:59:55005656539 ClickURL http://query.nytimes.com
2008-11-14 03:59:58004318586 Query myspace .com
Elapsed time: 0.17 s
Avg value size: 15.37 bytes
Avg key size: 30.00 bytes
Throughput: 2267099.06 bytes/s
Total cells: 8305
Throughput: 49967.51 cells/s
And finally, to select all query log data for November 14th, 2008 during the minute of 3:45 am:
最后,在2008年11月14日上午3点45分的所有日志数据:
hypertable> select* from QueryLogByTimestamp WHERE ROW =^ '2008-11-14 03:45';
2008-11-14 03:45:00003895650 Query ks lottery.
2008-11-14 03:45:00003895650 ItemRank 2
2008-11-14 03:45:00003895650 ClickURL http://www.lotterypost.com
2008-11-14 03:45:00005036796 Query http://www.glasgowdailytimes 10-20-2005
2008-11-14 03:45:01002863052 Query map quest
2008-11-14 03:45:01005514285 Query john bermeo
2008-11-14 03:45:02002394176 Query http://www.eggseye.com
2008-11-14 03:45:02003454227 Query hawaiian weddig band
2008-11-14 03:45:03001006089 Query brokers hiring loan officers in indiana
2008-11-14 03:45:06000844720 Query latest design microsoft freeware
...
2008-11-14 03:45:55003920469 ItemRank 3
2008-11-14 03:45:55003920469 ClickURL http://www.pennyblood.com
2008-11-14 03:45:56002729906 Query tryaold
2008-11-14 03:45:56003919348 Query feathered draped fox fur mandalas
2008-11-14 03:45:56003919348 ItemRank 8
2008-11-14 03:45:56003919348 ClickURL http://www.greatdreams.com
2008-11-14 03:45:56004803968 Query -
Elapsed time: 0.02 s
Avg value size: 15.71 bytes
Avg key size: 30.00 bytes
Throughput: 305030.80 bytes/s
Total cells: 130
Throughput: 6673.51 cells/s
See the HQL Documentation: SELECT for complete syntax.
关于SELECT的完整句法参阅HQL文档。
ALTER TABLE
The ALTER TABLE command can be used to add and/or remove columns from a table. The following command will add a 'Notes'column in a new access group called 'extra' and will drop column 'ItemRank'.
ALTER TABLE命令用来增加或移除表中的列。以下命令在一个新的叫’extra’的access group中增加一个’Notes’的列,并删除列’ItemRank’。
hypertable> ALTERTABLE QueryLogByUserID ADD(Notes, ACCESS GROUP extra(Notes)) DROP(ItemRank);
To verify the change, issue the SHOW CREATE TABLE command:
用SHOW CREATE TABLE来验证上面的修改。
hypertable> showcreate table QueryLogByUserID;
CREATE TABLEQueryLogByUserID (
Query,
ClickURL,
Notes,
ACCESS GROUP default (Query, ClickURL),
ACCESS GROUP extra (Notes)
)
And to verify that the column no longerexists, issue the same SELECT statement we issued above (NOTE: the data for thecolumn still exists in the file system, it will get lazily garbage collected).
为了验证列已不存在了,用上面的SELECT语言来看看。(注:列中的数据依然存在,它将被滞后的垃圾回收机制所回收)。
hypertable> select* from QueryLogByUserID limit 8;
000000036 2008-11-1310:30:46 Query helena ga
000000036 2008-11-1310:31:34 Query helena ga
000000036 2008-11-1310:45:23 Query checeron s
000000036 2008-11-1310:46:07 Query cheveron gas station
000000036 2008-11-1310:46:34 Query cheveron gas station richmond virginia
000000036 2008-11-1310:48:56 Query cheveron glenside road richmond virginia
000000036 2008-11-1310:49:05 Query chevron glenside road richmond virginia
000000036 2008-11-1310:49:05 ClickURL http://yp.yahoo.com
000000053 2008-11-1315:18:21 Query mapquest
000000053 2008-11-1315:18:21 ClickURL http://www.mapquest.com
Elapsed time: 0.00 s
Avg value size: 21.50 bytes
Avg key size: 30.00 bytes
Throughput: 140595.14 bytes/s
Total cells: 10
Throughput: 2730.00 cells/s
See HQL Documentation: ALTER TABLE forcomplete syntax.
关于ALTER TABLE的完整描述,参阅HQL文档。
INSERT & DELETE
Now let's augment the QueryLogByUserIDtable by adding some information in the Notes column for a few of the queries:
现在我们向QueryLogByUserID表中Notes列增加一些信息。
hypertable> INSERTINTO QueryLogByUserID VALUES
("0000190582008-11-13 07:24:43", "Notes", "animals"),
("0000190582008-11-13 07:57:16", "Notes", "food"),
("0000190582008-11-13 07:59:36", "Notes", "gardening");
Elapsed time: 0.01 s
Avg value size: 6.67 bytes
Total cells: 3
Throughput: 298.36 cells/s
Resends: 0
Notice the new data by querying the affectedrow:
来查询一下受影响的行。
hypertable> select* from QueryLogByUserID where row =^ '000019058';
000019058 2008-11-1307:24:43 Query tigers
000019058 2008-11-1307:24:43 Notes animals
000019058 2008-11-1307:57:16 Query bell peppers
000019058 2008-11-1307:57:16 Notes food
000019058 2008-11-1307:58:24 Query bell peppers
000019058 2008-11-1307:58:24 ClickURL http://agalternatives.aers.psu.edu
000019058 2008-11-1307:59:36 Query growing bell peppers
000019058 2008-11-1307:59:36 Query growing bell peppers
000019058 2008-11-1307:59:36 ClickURL http://www.farm-garden.com
000019058 2008-11-1307:59:36 ClickURL http://www.organicgardentips.com
000019058 2008-11-1307:59:36 Notes gardening
000019058 2008-11-1312:31:02 Query tracfone
000019058 2008-11-1312:31:02 ClickURL http://www.tracfone.com
Elapsed time: 0.00 s
Avg value size: 16.38 bytes
Avg key size: 30.00 bytes
Throughput: 162271.26 bytes/s
Total cells: 13
Throughput: 3498.39 cells/s
Now try deleting one of the notes we just added
现在尝试删除一下我们刚才加的note内容。
hypertable> deleteNotes from QueryLogByUserID where ROW ="000019058 2008-11-1307:24:43";
Elapsed time: 0.00 s
Total cells: 1
Throughput: 256.41 cells/s
Resends: 0
And verify that the cell was, indeed, deleted:
验证一下是否真正地删除掉了。
hypertable> select* from QueryLogByUserID where row =^ '000019058';
000019058 2008-11-1307:24:43 Query tigers
000019058 2008-11-1307:57:16 Query bell peppers
000019058 2008-11-1307:57:16 Notes food
000019058 2008-11-1307:58:24 Query bell peppers
000019058 2008-11-1307:58:24 ClickURL http://agalternatives.aers.psu.edu
000019058 2008-11-1307:59:36 Query growing bell peppers
000019058 2008-11-1307:59:36 Query growing bell peppers
000019058 2008-11-1307:59:36 ClickURL http://www.farm-garden.com
000019058 2008-11-1307:59:36 ClickURL http://www.organicgardentips.com
000019058 2008-11-1307:59:36 Notes gardening
000019058 2008-11-1312:31:02 Query tracfone
000019058 2008-11-1312:31:02 ClickURL http://www.tracfone.com
Elapsed time: 0.00 s
Avg value size: 16.38 bytes
Avg key size: 30.00 bytes
Throughput: 162271.26 bytes/s
Total cells: 12
Throughput: 3498.39 cells/s
See the HQL Documentation: INSERT and theHQL Documentation: DELETE for complete syntax.
关于INSERT,DELETE的完整语法,参阅HQL文档。
DROP TABLE
The DROP TABLE command is used to removetables from the system. The IF EXISTS option prevents the system from throwingan error if the table does not exist:
DROP TABLE命令用来从系统中删除表。可以用IFEXISTS选项来防止删除不存在表时产生的错误。
hypertable> droptable IF EXISTS foo;
Let's remove one of the example tables:
让我们删除一个我们的示例表。
hypertable> droptable QueryLogByUserID;
hypertable> showtables;
QueryLogByTimestamp
Then let's remove the other:
接着删除另一个。
hypertable> droptable QueryLogByTimestamp;
hypertable> showtables;
GET LISTING & DROP NAMESPACE
列出并删除命名空间
Now, we want to get rid of the Tutorialnamespace and verify that we have:
现在,我们想去掉Tutorial命名空间,并验证它。
hypertable> use"/";
hypertable> getlisting;
Tutorial (namespace)
sys (namespace)
hypertable> dropnamespace Tutorial;
hypertable> getlisting;
sys (namespace)
The sys namespace is used by the Hypertablesystem and should not be used to contain user tables.
Note that a namespace must be empty (iemust not contain any sub-namespaces or tables) before you can drop it. In thiscase since we had already dropped the QueryLogByUserID and QueryLogByTimestamptables, we could go ahead and drop the Tutorial namespace.
命名空间sys是供Hypertable系统使用的,不应该包含用户表。
请注意,只有命名空间为空(即不能包含子命名空间或表)时,你才能删除它。在我们的例子中,我们已经删除了QueryLogByUserID和QueryLogByTimestamp,所以就能删除Tutorial命名空间了。
Hadoop MapReduce
In order to run this example, Hadoop needsto be installed and HDFS and the MapReduce framework needs to be up andrunning. Hypertable builds againstCloudera's CDH3 distribution of hadoop. See CDH3 Installation for instructions on how to get Hadoop up andrunning.
Hypertable ships with a jar file,hypertable-x.x.x.x.jar (where x.x.x.x is the hypertable release level, e.g.,0.9.5.5) that contains Hadoop InputFormat and OutputFormat classes that allowMapReduce programs to directly read from and write to tables in Hypertable. In this section, we walk you through anexample MapReduce program, WikipediaWordCount, that tokenizes articles in atable called wikipedia that has been loaded with a Wikipedia dump. It reads the article column, tokenizes it,and populates the word column of the same table. Each unique word in thearticle turns into a qualified column and the value is the number of times theword appears in the article.
为了运行本例,需要安装Hadoop,HDFS和MapReduce框架必须启动。Hypertable是针对Cloudera CDH3发行版的Hadoop而构建的,关于如何运行hadoop,参阅CDH3的安装指导(https://ccp.cloudera.com/display/CDHDOC/CDH3+Installation)。
Hypertable带有一个jar文件hypertable-x.x.x.x.jar(这里x.x.x.x是hypertable的发行号,例如0.9.5.5),这个文件中,包含Hadoop的InputFormat 和OutputFormat类,允许MapReduce程序直接读取或写入Hypertable中的表。本节中,我们就走一遍一个MapReduce的例子程序WikipediaWordCount,它对一个叫wikipedia表中的文章进行词法分析,wikipedia表已经载入了*导出的数据,程序读取列article中内容,进行词法分析,然后将结果写入到同一个表中的word列。文章中每一个不同的词变成一个列标识,该列的值为文章中词出现的次数。
Setup
First, exit the Hypertable command lineinterpreter and download the Wikipedia dump, for example:
首先,退出Hypertable命令行解释器,下载*dump数据。例如:
$ wgethttp://cdn.hypertable.com/pub/wikipedia.tsv.gz
Next, jump back into the Hypertable commandline interpreter and create the wikipedia table by executing the HQL commandsshow below.
下一步,返回Hypertable命令行解释器,运行HQL命令创建wikipedia,如下所示。
CREATE NAMESPACEtest;
USE test;
DROP TABLE IF EXISTSwikipedia;
CREATE TABLEwikipedia (
title,
id,
username,
article,
word
);
Now load the compressed Wikipedia dump filedirectly into the wikipedia table by issuing the following HQL commands:
现在,采用下面的HQL命令,直接导入*dump文件到wikipedia表中:
hypertable> LOADDATA INFILE "wikipedia.tsv.gz" INTO TABLE wikipedia;
Loading 638,058,135bytes of input data...
0% 10 20 30 40 50 60 70 80 90 100%
|----|----|----|----|----|----|----|----|----|----|
***************************************************
Load complete.
Elapsed time: 78.28 s
Avg value size: 1709.59 bytes
Avg key size: 24.39 bytes
Throughput: 25226728.63 bytes/s (8151017.58 bytes/s)
Total cells: 1138847
Throughput: 14548.46 cells/s
Resends: 8328
Example
In this example, we'll be running theWikipediaWordCount program which is included in thehypertable-X.X.X.X-examples.jar file included in the binary packageinstallation. The following is a link to the source code for this program.
在本例中,我们要运行WikipediaWordCount,这个程序包含在hypertable-X.X.X.X-examples.jar文件中,该jar在Hypertable二进制安装包中。下面是这个程序的源文件链接。
WikipediaWordCount.java(https://github.com/hypertable/hypertable/blob/master/examples/java/org/hypertable/examples/hadoop/mapreduce/WikipediaWordCount.java)
To get an idea of what the data looks like,try the following select:
想看看数据是什么样的,用如下的select命令:
hypertable> select* from wikipedia where row =^ "Addington";
Addington,Buckinghamshire title Addington, Buckinghamshire
Addington,Buckinghamshire id 377201
Addington,Buckinghamshire username Roleplayer
Addington,Buckinghamshire article {{infobox UKplace \n|country =England\n|latitude=51.95095\n|longitude=-0.92177\n|official_name= Addington\n|population = 145 ...
Now exit from the Hypertable command line interpreter and run the WikipediaWordCountMapReduce program:
现在退出Hypertable命令行解释器,运行MapReduce 程序WikipediaWordCount。
hypertable> quit
$ hadoop jar/opt/hypertable/current/lib/java/hypertable-*-examples.jar \
org.hypertable.examples.WikipediaWordCount\
--namespace=test --columns=article
To verify that it worked, jump back intothe Hypertable command line interpreter and try selecting for the word column:
要验证该程序正确运行了,返回Hypertable命令行解释器,用select查询列work:
$/opt/hypertable/current/bin/ht shell
hypertable> selectword from wikipedia where row =^ "Addington";
...
Addington,Buckinghamshire word:A 1
Addington,Buckinghamshire word:Abbey 1
Addington,Buckinghamshire word:Abbotts 1
Addington,Buckinghamshire word:According 1
Addington,Buckinghamshire word:Addington 6
Addington,Buckinghamshire word:Adstock 1
Addington,Buckinghamshire word:Aston 1
Addington,Buckinghamshire word:Aylesbury 3
Addington,Buckinghamshire word:BUCKINGHAM 1
Addington, Buckinghamshire word:Bayeux 2
Addington,Buckinghamshire word:Bene 1
Addington,Buckinghamshire word:Bishop 1
...
Hadoop Streaming MapReduce
In order to run this example, Hadoop needsto be installed and HDFS and the MapReduce framework needs to be up andrunning. Hypertable builds againstCloudera's CDH3 distribution of hadoop. See CDH3 Installation for instructions on how to get Hadoop up andrunning.
In this example, we'll be running a HadoopStreaming MapReduce job that uses a Bash script as the mapper and a Bash scriptas the reducer. Like the example in theprevious section, the programs operate on a table called wikipedia that hasbeen loaded with a Wikipedia dump.
为了运行本例,需要安装Hadoop,HDFS和MapReduce框架必须启动。Hypertable是针对Cloudera CDH3发行版的Hadoop而构建的,关于如何运行hadoop,参阅CDH3的安装指导(https://ccp.cloudera.com/display/CDHDOC/CDH3+Installation)。
本例中,我们要运行一个Hadoop流MapReduce任务,该任务分别用两个Bash脚本作为mapper和reducer。像上节的例子一样,这个程序操作了一个叫wikipedia的表,表中的数据由*dump导入。
Setup
First, exit the Hypertable command lineinterpreter and download the Wikipedia dump, for example:
首先,退出Hypertable命令行,下载*dump文件,例如
$ wgethttp://cdn.hypertable.com/pub/wikipedia.tsv.gz
Next, jump back into the Hypertable command line interpreter and create the wikipedia table by executing the HQL commands show below.
接下来,回到Hypertable命令行解释器,用如下的HQL命令创建wikipedia表。
CREATE NAMESPACEtest;
USE test;
DROP TABLE IF EXISTSwikipedia;
CREATE TABLEwikipedia (
title,
id,
username,
article,
word
);
Now load the compressed Wikipedia dump filedirectly into the wikipedia table by issuing the following HQL commands:
现在,采用下面的HQL命令,直接导入*dump文件到wikipedia表中:
hypertable> LOADDATA INFILE "wikipedia.tsv.gz" INTO TABLE wikipedia;
Loading 638,058,135bytes of input data...
0% 10 20 30 40 50 60 70 80 90 100%
|----|----|----|----|----|----|----|----|----|----|
***************************************************
Load complete.
Elapsed time: 78.28 s
Avg value size: 1709.59 bytes
Avg key size: 24.39 bytes
Throughput: 25226728.63 bytes/s (8151017.58 bytes/s)
Total cells: 1138847
Throughput: 14548.46 cells/s
Resends: 8328
The mapper script (tokenize-article.sh) and the reducer script (reduce-word-counts.sh) are show below.
mapper脚本(tokenize-article.sh)和reducer脚本 (reduce-word-counts.sh)如下。
Example
The following script, tokenize-article.sh,will be used as the mapper script.
下面的脚本tokenize-article.sh,可以作为mapper脚本。
#!/usr/bin/env bash
IFS=" "
read name columnarticle
while [ $? == 0 ] ;do
if [ "$column" =="article" ] ; then
# Strip punctuation
stripped_article=`echo $article | awk'BEGIN { FS="\t" } { print $NF }' | tr"\!\"#\$&'()*+,-./:;<=>?@[\\\\]^_\{|}~" " " |tr -s " "` ;
# Split article into words
echo $stripped_article | awk -vname="$name" 'BEGIN { article=name; FS=" "; } { for (i=1;i<=NF; i++) printf "%s\tword:%s\t1\n", article, $i; }' ;
fi
# Read another line
read name column article
done
exit 0
The following script, reduce-word-counts.sh, will be used as the reducer script.
下面的脚本reduce-word-counts.sh,可以作为reducer脚本。
#!/usr/bin/env bash
last_article=
last_word=
let total=0
IFS=" "
read article wordcount
while [ $? == 0 ] ;do
if [ "$article" =="$last_article" ] && [ "$word" =="$last_word" ] ; then
let total=$count+total
else
if [ "$last_word" !="" ]; then
echo "$last_article$last_word $total"
fi
let total=$count
last_word=$word
last_article=$article
fi
read article word count
done
if [ $total -gt 0 ] ;then
echo "$last_article $last_word $total"
fi
exit 0
To populate the word column of thewikipedia table by tokenizing the article column using the above mapper andreduce script, issue the following command:
用上面的mapper和reducer脚本,对article列中的内容进行词法分析,结果保存到wikipedia表中的word列中。
hypertable> quit
$ hadoop jar/usr/lib/hadoop-0.20/contrib/streaming/hadoop-streaming-0.20.2-cdh3u*.jar \
-libjars/opt/hypertable/current/lib/java/hypertable-*.jar,/opt/hypertable/current/lib/java/libthrift-*.jar\
-Dhypertable.mapreduce.namespace=test\
-Dhypertable.mapreduce.input.table=wikipedia\
-Dhypertable.mapreduce.output.table=wikipedia\
-mapper/home/doug/tokenize-article.sh \
-combiner/home/doug/reduce-word-counts.sh \
-reducer/home/doug/reduce-word-counts.sh \
-file/home/doug/tokenize-article.sh \
-file/home/doug/reduce-word-counts.sh \
-inputformatorg.hypertable.hadoop.mapred.TextTableInputFormat \
-outputformatorg.hypertable.hadoop.mapred.TextTableOutputFormat \
-input wikipedia-output wikipedia
Input/Output Configuration Properties
The following table lists the jobconfiguration properties that are used to specify, among other things, theinput table, output table, and scan specification. These properties can besupplied to a streaming MapReduce job with -Dproperty=value arguments.
下表列出了需要指定的作业的配置属性,其中,有输入表,输出表,扫描标准。这些属性可以通过-Dproperty=value arguments 这样的方式传递给流MapReduce作业。
Input/Output Configuration Properties
Property |
Description |
Example Value |
hypertable.mapreduce.namespace |
Namespace for both input and output table |
/test |
hypertable.mapreduce.input.namespace |
Namespace for input table |
/test/intput |
hypertable.mapreduce.input.table |
Input table name |
wikipedia |
hypertable.mapreduce.input.scan_spec.columns |
Comma separated list of input columns |
id,title |
hypertable.mapreduce.input.scan_spec.options |
Input WHERE clause options |
MAX_VERSIONS 1 KEYS_ONLY |
hypertable.mapreduce.input.scan_spec.row_interval |
Input row interval |
Dog <= ROW < Kitchen |
hypertable.mapreduce.input.scan_spec.timestamp_interval |
Timestamp filter |
TIMESTAMP >= 2011-11-21 |
hypertable.mapreduce.input.include_timestamps |
Emit integer timestamp as the |
|
1st field (nanoseconds since epoch) |
TRUE |
|
hypertable.mapreduce.output.namespace |
Namespace containing output table |
/test/output |
hypertable.mapreduce.output.table |
Output table name |
wikipedia |
hypertable.mapreduce.output.mutator_flags |
flags parameter passed to mutator constructor (1 = NO_LOG_SYNC) |
1 |
Column Selection
To run a MapReduce job over a subset of columns from the input table, specify a comma separated list of columns in the hypertable.mapreduce.input.scan_spec.columns Hadoop configurationproperty. For example,
为使MapReduce作业只对输入表的一些列起作用,在Hadoop的配置属性hypertable.mapreduce.input.scan_spec.columns中指定这些列,其中列之间用逗号分隔,例如
$ hadoop jar/usr/lib/hadoop-0.20/contrib/streaming/hadoop-streaming-0.20.2-cdh3u*.jar \
-libjars/opt/hypertable/current/lib/java/hypertable-*.jar,/opt/hypertable/current/lib/java/libthrift-*.jar\
-Dhypertable.mapreduce.namespace=test\
-Dhypertable.mapreduce.input.table=wikipedia\
-Dhypertable.mapreduce.input.scan_spec.columns="id,title"\
-mapper /bin/cat-reducer /bin/cat \
-inputformatorg.hypertable.hadoop.mapred.TextTableInputFormat \
-input wikipedia-output wikipedia2
Timestamps
To filter the input table with a timestamp predicate, specify the timestamp predicate in the hypertable.mapreduce.input.scan_spec.timestamp_interval Hadoop configuration property. The timestamp predicate is specified using the same format as thetimestamp predicate in the WHERE clause of the SELECT statement, as illustratedin the following examples:
- TIMESTAMP < 2010-08-03 12:30:00
- TIMESTAMP >= 2010-08-03 12:30:00
- 2010-08-01 <= TIMESTAMP <= 2010-08-09
To preserve the timestamps from the inputtable, set the hypertable.mapreduce.input.include_timestamps Hadoop configuration property to true. This will cause the TextTableInputFormat class to produce an additional field (field 0) that represents the timestamp as nanoseconds since the epoch. The following example illustrates how to pass a timestamp predicate into a Hadoop Streaming MapReduce program.
为了在输入表用时间戳进行过滤,在Hadoop配置属性hypertable.mapreduce.input.scan_spec.timestamp_interval中指定时间戳谓词,该时间戳谓词的格式与在SELECT语句中WHERE子句的谓词格式相同,如下例子所示:
- TIMESTAMP < 2010-08-03 12:30:00
- TIMESTAMP >= 2010-08-03 12:30:00
- 2010-08-01 <= TIMESTAMP <= 2010-08-09
为在输入表中保留时间戳,设置Hadoop配置属性hypertable.mapreduce.input.include_timestamps为true。这将使TextTableInputFormat类产生额外的一列(field0):时间戳(Unix起始时刻开始的纳秒值)。下面的例子显示了如何传递一个时间戳谓词给流MapReduce程序。
$ hadoop jar/usr/lib/hadoop-0.20/contrib/streaming/hadoop-streaming-0.20.2-cdh3u*.jar \
-libjars/opt/hypertable/current/lib/java/hypertable-*.jar,/opt/hypertable/current/lib/java/libthrift-*.jar\
-Dhypertable.mapreduce.namespace=test\
-Dhypertable.mapreduce.input.table=wikipedia\
-Dhypertable.mapreduce.output.table=wikipedia2\
-Dhypertable.mapreduce.input.scan_spec.columns="id,title"\
-Dhypertable.mapreduce.input.scan_spec.timestamp_interval="2010-08-01<= TIMESTAMP <= 2010-08-09" \
-Dhypertable.mapreduce.input.include_timestamps=true\
-mapper /bin/cat-reducer /bin/cat \
-inputformatorg.hypertable.hadoop.mapred.TextTableInputFormat \
-outputformatorg.hypertable.hadoop.mapred.TextTableOutputFormat \
-input wikipedia-output wikipedia2
Row Intervals
To restrict the MapReduce to a specific rowinterval of the input table, a row range can be specified with thehypertable.mapreduce.input.scan_spec.row_interval Hadoop configuration property. The row interval predicate is specified using the same format as thetimestamp predicate in the WHERE clause of the SELECT statement, as illustratedin the following examples:
- ROW < foo
- ROW >= bar
- bar <= ROW <= 'foo;'
The following example illustrates how a rowinterval is passed into a Hadoop Streaming MapReduce program.
为限制MapReduce只对特定的行起作用,通过设置Hadoop配置属性hypertable.mapreduce.input.scan_spec.row_interval来指定特定的行。行间隔谓词的格式与SELECT语句WHERE子句中时间戳的谓词相同,如下面所示:
- ROW < foo
- ROW >= bar
- bar <= ROW <= 'foo;'
下面的例子显示了如何将一个行间隔传递给Hadoop流MapReduce程序。
$ hadoop jar/usr/lib/hadoop-0.20/contrib/streaming/hadoop-streaming-0.20.2-cdh3u*.jar \
-libjars/opt/hypertable/current/lib/java/hypertable-*.jar,/opt/hypertable/current/lib/java/libthrift-*.jar\
-Dhypertable.mapreduce.namespace=test\
-Dhypertable.mapreduce.input.table=wikipedia\
-Dhypertable.mapreduce.output.table=wikipedia2\
-Dhypertable.mapreduce.input.scan_spec.columns="id,title"\
-Dhypertable.mapreduce.input.scan_spec.row_interval="Dog<= ROW <= Kitchen" \
-mapper /bin/cat-reducer /bin/cat \
-inputformatorg.hypertable.hadoop.mapred.TextTableInputFormat \
-outputformatorg.hypertable.hadoop.mapred.TextTableOutputFormat \
-input wikipedia-output wikipedia2
Options
A subset of the WHERE clause options of the HQL SELECT statement can be specified by supplying the options with the hypertable.mapreduce.input.scan_spec.options Hadoop configuration property. The following options are supported:
- MAX_VERSIONS
- CELL_LIMIT
- KEYS_ONLY
The following example illustrates how to pass options to a Hadoop Streaming MapReduce program.
如果只需用HQL SELECT中WHERE结果的子集,可以添加Hadoop配置属性hypertable.mapreduce.input.scan_spec.options来设置选项,选项有以下值:
- MAX_VERSIONS
- CELL_LIMIT
- KEYS_ONLY
下面的例子显示了如何将这些选项传递给Hadoop流MapReduce程序。
$ hadoop jar/usr/lib/hadoop-0.20/contrib/streaming/hadoop-streaming-0.20.2-cdh3u*.jar \
-libjars/opt/hypertable/current/lib/java/hypertable-*.jar,/opt/hypertable/current/lib/java/libthrift-*.jar\
-Dhypertable.mapreduce.namespace=test\
-Dhypertable.mapreduce.input.table=wikipedia\
-Dhypertable.mapreduce.output.table=wikipedia2\
-Dhypertable.mapreduce.input.scan_spec.options="MAX_VERSIONS1 KEYS_ONLY CELL_LIMIT 2" \
-mapper /bin/cat-reducer /bin/cat \
-inputformatorg.hypertable.hadoop.mapred.TextTableInputFormat \
-outputformatorg.hypertable.hadoop.mapred.TextTableOutputFormat \
-input wikipedia-output wikipedia2
Regular ExpressionFiltering
Hypertable supports filtering of data usin gregular expression matching on the row key, column qualifiers and value.Hypertable uses RE2 for regular expression matching, the complete supported syntax can be found in the RE2 Syntax document.
Hypertable支持用正则表达式来过滤行键、列标识和值。Hypertable采用RE2库来完成正则表达式的匹配。关于完整的RE2正则表达式语法,请参阅其文档。
Example
In this example we'll use a DMOZ dataset which contains title, description and a bunch of topic tags for a set of URLs.The domain components of the URL have been reversed so that URLs from the same domain sort together. In the schema, the row key is a URL and the title,description and topic are column families. Heres a small sample from the dataset:
本例中,我们要用DMOZ数据集,其中包含了标题、描述以及一系列URL的标题tag。URL中的域部分已经反向,因而相同的域会放在一起。在这个例子中,行键是URL,标题、描述、主题是列族。以下是数据集的一小部分。
com.awn.www Title Animation World Network com.awn.www Description Provides information resources to the international animation community. Features include searchable database archives, monthly magazine, web animation guide,the Animation Village, discussion forums and other useful resources.
com.awn.www Topic:Arts
com.awn.www Topic:Animation
Exit the hypertable shell and download the dataset, which is in the .tsv.gz format which can be directly loaded into Hypertable without unzipping:
退出Hypertable命令行,下载这个数据集,它是.tsv.gz格式,不需要解压就可以直接导入到Hypertable中。
hypertable> quit
$ wgethttp://cdn.hypertable.com/pub/dmoz.tsv.gz
Jump back into the hypertable shell and create the dmoz table as follows:
返回Hypertable命令行,创建dmoz表:
/opt/hypertable/current/bin/htshell
hypertable> USE"/";
hypertable> CREATETABLE dmoz(Description, Title, Topic, ACCESS GROUP topic(Topic));
hypertable> LOADDATA INFILE "dmoz.tsv.gz" INTO TABLE dmoz;
Loading 412,265,627bytes of input data...
0% 10 20 30 40 50 60 70 80 90 100%
|----|----|----|----|----|----|----|----|----|----|
***************************************************
Load complete.
Elapsed time: 242.26 s
Avg value size: 15.09 bytes
Avg key size: 24.76 bytes
Throughput: 6511233.28 bytes/s (1701740.27 bytes/s)
Total cells: 39589037
Throughput: 163414.69 cells/s
Resends: 144786
In the following queries we limit the number of rows returned to 2 for brevity. Suppose you want a subset of the URLs from the domain inria.fr where the first component of the domain doesn't start with the letter 'a', you could run:
为简短,在下面的查询中,我们限制返回的行数为2。假设你想要一个域为inria.fr的URL的子集,第一个部分不以’a’开始,你可以运行:
hypertable> SELECTTitle FROM dmoz WHERE ROW REGEXP "fr\.inria\.[^a]" LIMIT 2 REVS=1KEYS_ONLY;
fr.inria.caml
fr.inria.caml/pub/docs/oreilly-book
To look at all topics which start with write(case insensitive):
hypertable> SELECTTopic:/(?i)^write/ FROM dmoz LIMIT 2;
13.141.244.204/writ_denTopic:Writers_Resources
ac.sms.www/clubpage.asp?club=CL001003004000301311 Topic:Writers_Resources
The next example shows how to query for data where the description contains the word game followed by either foosball or halo:
下一个例子显示了描述中包含’game’但紧接着为foosball或halo的数据。
hypertable> SELECTCELLS Description FROM dmoz WHERE VALUE REGEXP "(?i:game.*(foosball|halo)\s)"LIMIT 2 REVS=1;
com.armchairempire.www/Previews/PCGames/planetside.htm Description Preview by Mr. Nash. "So, on the oneside the game is sounding pretty snazzy, on the other it sounds sort of likeHalo at its core."
com.digitaldestroyers.www Description Video game fans in Spartanburg, SouthCarolina who like to get together and compete for bragging rights. Also competewith other Halo / Xbox fan clubs.
Atomic Counters
Column families can optionally act as atomiccounters by supplying the COUNTER option in the column specification of theCREATE TABLE command. Counter columns are accessed using the same methods asother columns. However, to modify the counter, the value must be formattedspecially, as described in the following table.
在CREATE TABLE命令中,列的定义可以添加COUNTER选项,这样列族就可以作为一个原子计数器。这个计数器列的访问方式与其他列相同。但是,要修改此计数器列的值,值的格式比较特别,如下表所示:
Value Format Description
['+'] n Increment the counter by n
'-' n Decrement the counter by n
'=' n Reset the counter to n
Example
In this example we create a table ofcounters called counts that contains a single column family url that acts as anatomic counter for urls. By convention, the row key is the URL with the domainname reversed (so that URLs from the same domain sort next to each other) andthe column qualifier is the hour in which the "hit" occurred. Thetable is created with the following HQL:
本例中,我们创建一个叫count的表,其中包含单一的一个列族url,作为url的原子计数器。通常,行键是域反向的url(相同域的url会排序在一起),列标识是点击该url的小时值。表的创建如下
hypertable> use"/";
hypertable> createtable counts ( url COUNTER );
Let's say we've accumulated url"hit" occurrences in the following .tsv file:
我们累计的url点击数在下面的.tsv文件中。
#row column value
org.hypertable.www/ url:2010-10-26_09 +1
org.hypertable.www/ url:2010-10-26_09 +1
org.hypertable.www/download.html url:2010-10-26_09 +1
org.hypertable.www/documentation.html url:2010-10-26_09 +1
org.hypertable.www/download.html url:2010-10-26_09 +1
org.hypertable.www/about.html url:2010-10-26_09 +1
org.hypertable.www/ url:2010-10-26_09 +1
org.hypertable.www/ url:2010-10-26_10 +1
org.hypertable.www/about.html url:2010-10-26_10 +1
org.hypertable.www/ url:2010-10-26_10 +1
org.hypertable.www/download.html url:2010-10-26_10 +1
org.hypertable.www/download.html url:2010-10-26_10 +1
org.hypertable.www/documentation.html url:2010-10-26_10 +1
org.hypertable.www/ url:2010-10-26_10 +1
If we were to load this file with LOAD DATAINFILE into the counts table, a subsequent select would yield the followingoutput:
如果我们想将此文件用LOAD DATA INFILE载入到counts表中,然后的select语句将产生如下的输出:
hypertable> select* from counts;
org.hypertable.www/ url:2010-10-26_09 3
org.hypertable.www/ url:2010-10-26_10 3
org.hypertable.www/about.html url:2010-10-26_09 1
org.hypertable.www/about.html url:2010-10-26_10 1
org.hypertable.www/documentation.html url:2010-10-26_09 1
org.hypertable.www/documentation.html url:2010-10-26_10 1
org.hypertable.www/download.html url:2010-10-26_09 2
org.hypertable.www/download.html url:2010-10-26_10 2
Group Commit
Updates are carried out by the RangeServers through the following steps:
1. Write the update to the commit log (in the DFS)
2. Sync the commit log (in the DFS)
3. Populate in-memory data structure with the update
Under high concurrency, step #2 can become a bottleneck. Distributed filesy stems such as HDFS can typically handle a small number of sync operations per second. The Group Commit feature solves this problem by delaying updates, grouping them together, and carrying them out in a batch on some regular interval.
A table can be configured to use group commit by supplying the GROUP_COMMIT_INTERVAL option in the CREATE TABLE statement. The GROUP_COMMIT_INTERVAL option tells the system that updates tothis table should be carried out with group commit and also specifies the commit interval in milliseconds. The interval is constrained by the value of the config property Hypertable.RangeServer. Commit Interval, which acts as a lower bound (default is 50ms). The value specified for GROUP_COMMIT_INTERVAL will get rounded up to the nearest multiple of this property value. The following is an example CREATE TABLE statement that creates a table counts setup for group commit operation.
更新操作由RangeServer按如下步骤完成:
1. 将更新写入提交日志(在DFS中)
2. 同步提交日志(在DFS中)
3. 更新内容写入内存结构
在高并发性下,步骤2可能是一个瓶颈,通常,像HDFS这样的分布式文件系统在一秒内只能处理不多的文件同步请求。成组提交是这个问题的解决方法,即滞后更新,将若干更新放在一起,定时将打包的更新一同提交。
在创建表的语句中,提供GROUP_COMMIT_INTERVAL选项,表可以被设置成成组提交。GROUP_COMMIT_INTERVAL选项告诉系统这个表的更新采用成组提交,并且指定提交的周期(毫秒)。这个周期有一个限制,它要大于配置文件中Hypertable.RangeServer.CommitInterval属性的值(缺省是50ms)。GROUP_COMMIT_INTERVAL中的值将被圆整为上面那个属性的最接近的倍数的值,下面的CREATE TABLE的例子创建了成组提交的表counts。
Example
hypertable> CREATETABLE counts (
url,
domain
)GROUP_COMMIT_INTERVAL=100;