Hypertable - 程序员指南

时间:2023-02-12 19:07:40

Developer Guide

开发者手册

(http://hypertable.com/documentation/developer_guide/)

 

Tableof Contents

  • HQL Tutorial
  • Hadoop MapReduce
  • Hadoop Streaming MapReduce
  • Regular Expression Filtering
  • Atomic Counters
  • Group Commit

目录

  • HQL概览
  • Hadoop MapReduce
  • Hadoop流式MapReduce
  • 正则表达过滤
  • 原子性计数
  • 成组提交

 

HQL Tutorial

HQL教程

Introduction

介绍

This tutorial shows you how to import a search engine query log into Hypertable, storing the data into tables withdifferent primary keys, and how to issue queries against the tables. You'll need to download the data from http://hypertable.googlecode.com/files/query-log.tsv.gz:

本教程描述如何将一个查询引擎的查询日志导入到Hypertable中,将采用不同的主键保存到表中,以及如何查询这些表。你需要从http://hypertable.googlecode.com/files/query-log.tsv.gz下载有关的数据。

$ mkdir -phql_tutorial

$ cd hql_tutorial

$ wgethttp://cdn.hypertable.com/pub/query-log.tsv.gz

The next step is to make sure Hypertable is properly installed (see Installation) and then launch the service. Once youhave Hypertable up and running, fire up an interactive session:

接着需要确保Hypertable被正确安装(参见安装手册)并启动服务。一旦你启动Hypertable完成后,开启一个交互式会话:

$/opt/hypertable/current/bin/ht shell

 

Welcome to thehypertable command interpreter.

For information aboutHypertable, visit http://www.hypertable.org/

 

Type 'help' for alist of commands, or 'help shell' for a

list of shell metacommands.

 

hypertable>

 

Type"help" to display the list of valid HQL commands:

输入”help”得到合法的HQL命令列表:

hypertable> help

 

USE ................Sets the current namespace

CREATE NAMESPACE ...Creates a new namespace

DROP NAMESPACE .....Removes a namespace

EXISTS TABLE .......Check if table exists

CREATE TABLE .......Creates a table

DELETE .............Deletes all or part of a row from a table

DESCRIBE TABLE .....Displays a table's schema

DROP TABLE .........Removes a table

RENAME TABLE .......Renames a table

DUMP TABLE .........Create efficient backup file

ALTER TABLE ........Add/remove column family from existing table

INSERT .............Inserts data into a table

LOAD DATA INFILE ...Loads data from a TSV input file into a table

SELECT .............Selects (and display) cells from a table

SHOW CREATE TABLE ..Displays CREATE TABLE command used to create table

SHOW TABLES ........Displays only the list of tables in the current namespace

GET LISTING ........Displays the list of tables and namespace in the current namespace

 

Statements must be terminated with';'.  For more information on a specific statement, type 'help <statement>', where <statement> is from thepreceeding list.

语言必须以’;’结束,对于上面列表中某个特定语句的更详细的信息,可输入’help 语句’。

 

USE

First, open the root namespace.  For an explanation of namespaces seeNamespaces.  To open the root namespace issue the following HQL command:

首先,开发root命名空间。关于命名空间的解释,参见http://hypertable.com/documentation/#namespaces。打开root命名空间的HQL命令如下:

hypertable> use"/";

 

CREATE NAMESPACE

Nowcreate a namespace, Tutorial, within which we'll create our tables:

现在,创建一个命名空间Tutorial,我们要在这个空间中创建表:

hypertable> createnamespace "Tutorial";

hypertable> useTutorial;

 

CREATE TABLE

Now that we have created and opened the Tutorial namespace we can create tables within it. In this tutorial we will be loading data into, and querying data from, two separate tables. The firsttable, QueryLogByUserID, will be indexed by the fields UserID+QueryTime and thesecond table, QueryLogByTimestamp, will be indexed by the fieldsQueryTime+UserID. Notice that any identifier that contains non-alphanumericcharacters (e.g. '-') must be surrounded by quotes.

现在我们已经创建命名空间Tutorial,打开它,我们就可以在其中创建表。在本教程中,我们把数据载入到两个表,并从中查询出来。第一个表是QueryLogByUserID,采用UserID+QueryTime索引,第二个表是QueryLogByTimestamp,采用QueryTime+UserID索引。注意,标识中的任何非字母数字字符,必须用引号。

hypertable> CREATETABLE QueryLogByUserID ( Query, ItemRank, ClickURL );

hypertable> CREATETABLE QueryLogByTimestamp ( Query, ItemRank, ClickURL );

 

See the HQL Documentation: CREATE TABLE forcomplete syntax.

关于CREATE TABLE的完整语法,参阅HQL文档。

 

SHOW TABLES

Show all of the tables that exist in the current namespace:

查看所有的表在当前的命名空间下。

hypertable> showtables;

QueryLogByUserID

QueryLogByTimestamp

 

SHOW CREATE TABLE

Now, issue the SHOW CREATE TABLE command to make sure you got everything right. We didn't have to include the field called'row' because we'll use that in our LOAD DATA INFILE command later:

现在,用SHOW CREATE TABLE命令来看看是不是一切都做好了。我们没有包含’row’这一列,因为我们会在后面的LOAD DATA INFILE中用它。

hypertable> showcreate table QueryLogByUserID;

 

CREATE TABLEQueryLogByUserID (

   Query,

   ItemRank,

   ClickURL,

   ACCESS GROUP default (Query, ItemRank,ClickURL)

);

 

And, notice that, by default, a singleACCESS GROUP named "default" is created.  Access groups are a way to physically groupcolumns together on disk.  See the CREATETABLE documentation for a more detailed description of access groups.

注意,缺省情况下,一个叫做’default’的ACCESS GROUP被创建出来,Access groups是一个将有关列放在一起保存到磁盘的方法。有关access groups的更详细描述,参阅CREATE TABLE的文档。

 

LOAD DATA INFILE

 Now, let's load some data using the MySQL-like TAB delimited format (TSV). For that, we assume you have the example data in the file query-log.tsv.gz. This file includes an initial header line indicating the format of each line in the file by listing tab delimited column names.  To inspect this file we first quit out of the Hypertable command line interpreter and then use the zcat program (requires gzip package) to display the contents of query-log.tsv.gz:

现在,让我们载入采用TAB分割、类似MySQL一样的格式(TSV)的数据。我们假设示例数据已在文件query-log.tsv.gz中。这个文件包头部含了一个初始化行,通过列出以TAB分割的列名,指出每行数据的格式。为查看这个文件,我们先退出Hypertable的命令行解释器,用zcat程序(需要gzip包)来显示query-log.tsv.gz的内容。

hypertable> quit

 

$ zcatquery-log.tsv.gz

#QueryTime     UserID Query   ItemRank        ClickURL

2008-11-1300:01:30    2289203 kitchen counter innew orleans  10      http://www.era.com

2008-11-1300:01:30    2289203 kitchen counter innew orleans  4       http://www.superpages.com

2008-11-1300:01:30    2289203 kitchen counter innew orleans  5       http://www.superpages.com

2008-11-1300:01:31    1958633 beads amethystgemstone 1       http://www.gemsbiz.com

2008-11-1300:01:31    3496052 chat           

2008-11-13 00:01:33    892003 photo example quarter doubled die coin 5      http://www.coinresource.com

2008-11-1300:01:33    892003  photo example quarter doubled die coin  5      http://www.coinresource.com

2008-11-1300:01:35    2251112 radio stations inbuffalo       1       http://www.ontheradio.net

2008-11-1300:01:37    1274922 fafsa renewal   1      http://www.fafsa.ed.gov

2008-11-1300:01:37    1274922 fafsa renewal   1      http://www.fafsa.ed.gov

2008-11-1300:01:37    441978  find phone numbers      1      http://www.anywho.com

2008-11-1300:01:37    441978  find phone numbers      3      http://www.411.com

...

 Now let's load the data filequery-log.tsv.gz into the table QueryLogByUserID. The row key is formulated byzero-padding the UserID field out to nine digits and concatenating the QueryTime field. The QueryTime field is used as the internal cell timestamp.  To load the file, first jump back into the Hypertable command line interpreter and then issue the LOAD DATAINFILE command show below.

现在我们将query-log.tsv.gz中的数据载入QueryLogByUserID,行键由UserID和QueryTime合成,其中UserID用0补齐为9位数。QueryTime用作内部单元的时间戳。为载入数据,首先回到Hypertable命令行解释器,然后采用如下面显示的LOAD DATA INFILE命令。

$/opt/hypertable/current/bin/ht shell

...

hypertable> useTutorial;

hypertable> LOADDATA INFILE ROW_KEY_COLUMN="%09UserID"+QueryTimeTIMESTAMP_COLUMN=QueryTime "query-log.tsv.gz" INTO TABLEQueryLogByUserID;

 

Loading 7,464,729bytes of input data...

 

0%  10  20   30   40  50   60   70  80   90   100%

|----|----|----|----|----|----|----|----|----|----|

***************************************************

Load complete.

 

  Elapsed time: 9.84 s

Avg value size: 15.42 bytes

  Avg key size: 29.00 bytes

   Throughput: 4478149.39 bytes/s (764375.74 bytes/s)

  Total cells: 992525

   Throughput: 100822.73 cells/s

      Resends: 0

 

 A quick inspection of the table shows:

来快速看看表

hypertable> select* from QueryLogByUserID limit 8;

000000036 2008-11-1310:30:46  Query   helena ga

000000036 2008-11-1310:31:34  Query   helena ga

000000036 2008-11-1310:45:23  Query   checeron s

000000036 2008-11-1310:46:07  Query   cheveron gas station

000000036 2008-11-1310:46:34  Query   cheveron gas station richmond virginia

000000036 2008-11-1310:48:56  Query   cheveron glenside road richmond virginia

000000036 2008-11-1310:49:05  Query   chevron glenside road richmond virginia

000000036 2008-11-1310:49:05  ItemRank        1

000000036 2008-11-1310:49:05  ClickURL        http://yp.yahoo.com

000000053 2008-11-1315:18:21  Query   mapquest

000000053 2008-11-1315:18:21  ItemRank        1

000000053 2008-11-1315:18:21  ClickURL        http://www.mapquest.com

 

  Elapsed time: 0.01 s

Avg value size: 18.08 bytes

  Avg key size: 30.00 bytes

   Throughput: 43501.21 bytes/s

  Total cells: 12

   Throughput: 904.70 cells/s

 

Now let's load the data file query-log.tsv.gz into the table QueryLogByTimestamp. The row key is formulated by concatenating the QueryTime field with the nine digit, zero-padded UserIDfield. The QueryTime field is used as the internal cell timestamp.

现在我们将query-log.tsv.gz中的数据载入到QueryLogByTimestamp表,行键由QueryTime和9位UserID(用0补齐)合成。QueryTime用作内部单元时间戳。

hypertable> LOADDATA INFILE ROW_KEY_COLUMN=QueryTime+"%09UserID"TIMESTAMP_COLUMN=QueryTime "query-log.tsv.gz" INTO TABLEQueryLogByTimestamp;

 

Loading 7,464,729bytes of input data...

 

0%  10  20   30   40  50   60   70  80   90   100%

|----|----|----|----|----|----|----|----|----|----|

***************************************************

Load complete.

 

  Elapsed time: 10.18 s

Avg value size: 15.42 bytes

  Avg key size: 29.00 bytes

   Throughput: 4330913.20 bytes/s (739243.98 bytes/s)

  Total cells: 992525

   Throughput: 97507.80 cells/s

      Resends: 0

 And a quick inspection of the table shows:

来快速看看表的内容。

hypertable> select* from QueryLogByTimestamp limit 4;

2008-11-13 00:01:30002289203 Query kitchen counter in new orleans

2008-11-13 00:01:30002289203 ItemRank 5

2008-11-13 00:01:30002289203 ClickURL http://www.superpages.com

2008-11-13 00:01:31001958633 Query beads amethyst gemstone

2008-11-13 00:01:31001958633 ItemRank 1

2008-11-13 00:01:31001958633 ClickURL http://www.gemsbiz.com

2008-11-13 00:01:31003496052 Query chat

2008-11-13 00:01:33000892003 Query photo example quarter doubled die coin

2008-11-13 00:01:33000892003 ItemRank 5

2008-11-13 00:01:33000892003 ClickURL http://www.coinresource.com

 

  Elapsed time: 0.00 s

Avg value size: 18.11 bytes

  Avg key size: 30.00 bytes

   Throughput: 287150.49 bytes/s

  Total cells: 19

   Throughput: 5969.21 cells/s

See the HQL Documentation: LOAD DATA INFILE for complete syntax. 

关于LOAD DATA INFILE的完整语法,请参阅HQL文档。

 

SELECT

 Let's start by examining the QueryLogByUserID table. To select all of the data for user ID 003269359 we need to use the starts with operator =^. Remember that the row key is the concatenation of the user ID and the timestamp which is why we need to use the starts with operator.

让我们从QueryLogByUserID表开始。为了查询出所有user ID为003269359的数据,我们需要用开始操作符=^。请记住,行键是user ID 和时间戳的合成,这是我们采用这个操作符的原因。

 

hypertable> select* from QueryLogByUserID where row =^ '003269359';

003269359 2008-11-1304:36:34 Query binibining pilipinas 2008 winners

003269359 2008-11-1304:36:34 ItemRank 5

003269359 2008-11-1304:36:34 ClickURL http://www.missosology.org

003269359 2008-11-1304:37:34 Query pawee's kiss and tell

003269359 2008-11-1304:37:34 ItemRank 3

003269359 2008-11-1304:37:34 ClickURL http://www.missosology.org

003269359 2008-11-1305:07:10 Query rn jobs in 91405

003269359 2008-11-1305:07:10 ItemRank 9

003269359 2008-11-1305:07:10 ClickURL http://91405.jobs.com

003269359 2008-11-1305:20:22 Query rn jobs in 91405

...

003269359 2008-11-1309:42:49 Query wound ostomy rn training

003269359 2008-11-1309:42:49 ItemRank 11

003269359 2008-11-1309:42:49 ClickURL http://www.wocn.org

003269359 2008-11-1309:46:50 Query pych nurse in encino tarzana hospital

003269359 2008-11-1309:47:18 Query encino tarzana hospital

003269359 2008-11-1309:47:18 ItemRank 2

003269359 2008-11-1309:47:18 ClickURL http://www.encino-tarzana.com

003269359 2008-11-1309:52:42 Query encino tarzana hospital

003269359 2008-11-1309:53:08 Query alhambra hospital

003269359 2008-11-1309:53:08 ItemRank 1

003269359 2008-11-1309:53:08 ClickURL http://www.alhambrahospital.com

 

  Elapsed time: 0.01 s

Avg value size: 19.24 bytes

  Avg key size: 30.00 bytes

   Throughput: 2001847.79 bytes/s

  Total cells: 352

   Throughput: 40651.35 cells/s

 

The result set was fairly large (352 cells), so let's now try selecting just the queries that were issued by the user with ID 003269359 during the hour of 5am. To do this we need to add aTIMESTAMP predicate. Each cell has an internal timestamp and the TIMESTAMP predicate can be used to filter the results based on this timestamp.

结果相当大(352个单元数据),所以,现在我们查询003269359用户在5am发出的查询,这时,我们需要加一个时间戳TIMESTAMP谓词。每个单元都有一个内部的时间戳,这个TIMESTAMP谓词能用来过滤查询结果。

 

hypertable> select* from QueryLogByUserID where row =^ '003269359' AND "2008-11-1305:00:00" <= TIMESTAMP < "2008-11-13 06:00:00";

003269359 2008-11-1305:07:10 Query rn jobs in 91405

003269359 2008-11-1305:07:10 ItemRank 9

003269359 2008-11-1305:07:10 ClickURL http://91405.jobs.com

003269359 2008-11-1305:20:22 Query rn jobs in 91405

003269359 2008-11-1305:20:22 ItemRank 16

003269359 2008-11-1305:20:22 ClickURL http://www.careerbuilder.com

003269359 2008-11-1305:34:02 Query usc university hospital

003269359 2008-11-1305:34:02 ItemRank 1

003269359 2008-11-1305:34:02 ClickURL http://www.uscuh.com

003269359 2008-11-1305:37:01 Query rn jobs in san fernando valley

003269359 2008-11-1305:37:01 ItemRank 7

003269359 2008-11-1305:37:01 ClickURL http://www.medhunters.com

003269359 2008-11-1305:46:22 Query northridge hospital

003269359 2008-11-1305:46:22 ItemRank 2

003269359 2008-11-1305:46:22 ClickURL http://northridgehospital.org

003269359 2008-11-1305:53:34 Query valley presbyterian hospital

003269359 2008-11-1305:53:34 ItemRank 4

003269359 2008-11-1305:53:34 ClickURL http://www.hospital-data.com

003269359 2008-11-1305:55:36 Query valley presbyterian hospital website

003269359 2008-11-1305:55:36 ItemRank 1

003269359 2008-11-1305:55:36 ClickURL http://www.valleypres.org

003269359 2008-11-1305:59:24 Query mission community hospital

003269359 2008-11-1305:59:24 ItemRank 1

003269359 2008-11-1305:59:24 ClickURL http://www.mchonline.org

 

  Elapsed time: 0.00 s

Avg value size: 18.50 bytes

  Avg key size: 30.00 bytes

   Throughput: 2602086.44 bytes/s

  Total cells: 36

   Throughput: 53651.27 cells/s

Keep in mind that the internal celltimestamp is different than the one embedded in the row key. In this example,they both represent the same time. By specifying the TIMESTAMP_COLUMN option toLOAD DATA INFILE, we extracted the QueryTime field to be used as the internalcell timestamp. If we hadn't supplied that option, the system would haveauto-assigned a timestamp. To display the internal cell timestamp, add theDISPLAY_TIMESTAMPS option:

请记住,单元内部时间戳与行键中的不同,在这个例子中,它们都代表同样的时间。在LOAD DATA INFILE中,通过指定TIMESTAMP_COLUMN选项,我们可以将QueryTime列用作单元内部时间戳。如果我们不用该选项,则系统将自动赋一个时间戳。添加DISPLAY_TIMESTAMPS选项可以显示单元内部时间戳。

hypertable> select* from QueryLogByUserID limit 5 DISPLAY_TIMESTAMPS;

2008-11-1310:30:46.000000000  000000036 2008-11-1310:30:46   Query   helena ga

2008-11-1310:31:34.000000000  000000036 2008-11-1310:31:34   Query   helena ga

2008-11-1310:45:23.000000000  000000036 2008-11-1310:45:23   Query   checeron s

2008-11-1310:46:07.000000000  000000036 2008-11-1310:46:07   Query   cheveron gas station

2008-11-1310:46:34.000000000  000000036 2008-11-1310:46:34   Query   cheveron gas station richmond virginia

 

  Elapsed time: 0.00 s

Avg value size: 17.20 bytes

  Avg key size: 30.00 bytes

   Throughput: 207563.76 bytes/s

  Total cells: 5

   Throughput: 4397.54 cells/s

There is no index for the internal cell timestamps, so if we don't include a row =^ expression in our predicate, the system will do a full table scan. This is why we imported the data into a second table QueryLogByTimestamp. This table includes the timestamp as the row key prefix which allows us to efficiently query data over a time interval.

The following query selects all query log data for November 14th, 2008:

单元内部时间戳没有索引,所以如果我们在谓词中没有添加=^,系统将做全表扫描。这就是为什么我们将数据载入第二个表QueryLogByTimestamp的原因。这个表的行键以时间戳为前缀,我们可以高效地查询出某个时间间隔的数据。

hypertable> select* from QueryLogByTimestamp WHERE ROW =^ '2008-11-14';

2008-11-14 00:00:00001040178  Query   noodle tools

2008-11-14 00:00:00001040178  ItemRank        1

2008-11-14 00:00:00001040178  ClickURL        http://www.noodletools.com

2008-11-14 00:00:01000264655  Query   games.myspace.com

2008-11-14 00:00:01000264655  ItemRank        1

2008-11-14 00:00:01000264655  ClickURL        http://games.myspace.com

2008-11-14 00:00:01000527424  Query   franklinville schools new jersey

2008-11-14 00:00:01000527424  ItemRank        1

2008-11-14 00:00:01000527424  ClickURL        http://www.greatschools.net

2008-11-14 00:00:01000632400  Query   lack of eye contact symptom of...

...

2008-11-14 06:02:33003676354  Query   baby 20showers

2008-11-14 06:02:35003378030  Query   task and responsibility matrix

2008-11-14 06:02:35003378030  ItemRank        2

2008-11-14 06:02:35003378030  ClickURL        http://im.ncsu.edu

2008-11-14 06:02:36004578101  Query   jcpenneys

2008-11-14 06:02:37005120734  Query   ebay

2008-11-14 06:02:40000957500  Query   buccal fat size of ping pong ball

 

  Elapsed time: 2.37 s

Avg value size: 15.36 bytes

  Avg key size: 30.00 bytes

   Throughput: 1709616.45 bytes/s

  Total cells: 89412

   Throughput: 37689.18 cells/s

 

And to select all query log data for November 14th, 2008 during the hour of 3 am:

查询在2008年11月14日上午3点的所有日志数据:

hypertable> select* from QueryLogByTimestamp WHERE ROW =^ '2008-11-14 03';

2008-11-14 03:00:00002512415  Query   ny times

2008-11-14 03:00:00002512415  ItemRank        1

2008-11-14 03:00:00002512415  ClickURL        http://www.nytimes.com

2008-11-14 03:00:00005294906  Query   kickmeto.fosi

2008-11-14 03:00:00005459226  Query   http://www.dickdyertoyota.com

2008-11-14 03:00:02000637292  Query   days of our lives

2008-11-14 03:00:02000637292  ItemRank        3

2008-11-14 03:00:02000637292  ClickURL        http://www.nbc.com

2008-11-14 03:00:03002675105  Query   ghetto superstar lyrics

...

2008-11-14 03:59:52002874080  ClickURL        http://www.paintball-discounters.com

2008-11-14 03:59:53004292772  Query   drop down menu

2008-11-14 03:59:55005656539  Query   to buy indian hair to make wigs in new york

2008-11-14 03:59:55005656539  ItemRank        1

2008-11-14 03:59:55005656539  ClickURL        http://query.nytimes.com

2008-11-14 03:59:58004318586  Query   myspace .com

 

  Elapsed time: 0.17 s

Avg value size: 15.37 bytes

  Avg key size: 30.00 bytes

   Throughput: 2267099.06 bytes/s

  Total cells: 8305

   Throughput: 49967.51 cells/s

 

And finally, to select all query log data for November 14th, 2008 during the minute of 3:45 am:

最后,在2008年11月14日上午3点45分的所有日志数据:

hypertable> select* from QueryLogByTimestamp WHERE ROW =^ '2008-11-14 03:45';

2008-11-14 03:45:00003895650  Query   ks lottery.

2008-11-14 03:45:00003895650  ItemRank        2

2008-11-14 03:45:00003895650  ClickURL        http://www.lotterypost.com

2008-11-14 03:45:00005036796  Query   http://www.glasgowdailytimes 10-20-2005

2008-11-14 03:45:01002863052  Query   map quest

2008-11-14 03:45:01005514285  Query   john bermeo

2008-11-14 03:45:02002394176  Query   http://www.eggseye.com

2008-11-14 03:45:02003454227  Query   hawaiian weddig band

2008-11-14 03:45:03001006089  Query   brokers hiring loan officers in indiana

2008-11-14 03:45:06000844720  Query   latest design microsoft freeware

...

2008-11-14 03:45:55003920469  ItemRank        3

2008-11-14 03:45:55003920469  ClickURL        http://www.pennyblood.com

2008-11-14 03:45:56002729906  Query   tryaold

2008-11-14 03:45:56003919348  Query   feathered draped fox fur mandalas

2008-11-14 03:45:56003919348  ItemRank        8

2008-11-14 03:45:56003919348  ClickURL        http://www.greatdreams.com

2008-11-14 03:45:56004803968  Query   -

 

  Elapsed time: 0.02 s

Avg value size: 15.71 bytes

  Avg key size: 30.00 bytes

   Throughput: 305030.80 bytes/s

  Total cells: 130

   Throughput: 6673.51 cells/s

 

See the HQL Documentation: SELECT for complete syntax.

关于SELECT的完整句法参阅HQL文档。

 

ALTER TABLE

The ALTER TABLE command can be used to add and/or remove columns from a table. The following command will add a 'Notes'column in a new access group called 'extra' and will drop column 'ItemRank'.

ALTER TABLE命令用来增加或移除表中的列。以下命令在一个新的叫’extra’的access group中增加一个’Notes’的列,并删除列’ItemRank’。

hypertable> ALTERTABLE QueryLogByUserID ADD(Notes, ACCESS GROUP extra(Notes)) DROP(ItemRank);

 

To verify the change, issue the SHOW CREATE TABLE command:

用SHOW CREATE TABLE来验证上面的修改。

hypertable> showcreate table QueryLogByUserID;

 

CREATE TABLEQueryLogByUserID (

  Query,

  ClickURL,

  Notes,

  ACCESS GROUP default (Query, ClickURL),

  ACCESS GROUP extra (Notes)

)

And to verify that the column no longerexists, issue the same SELECT statement we issued above (NOTE: the data for thecolumn still exists in the file system, it will get lazily garbage collected).

为了验证列已不存在了,用上面的SELECT语言来看看。(注:列中的数据依然存在,它将被滞后的垃圾回收机制所回收)。

 

hypertable> select* from QueryLogByUserID limit 8;

000000036 2008-11-1310:30:46  Query   helena ga

000000036 2008-11-1310:31:34  Query   helena ga

000000036 2008-11-1310:45:23  Query   checeron s

000000036 2008-11-1310:46:07  Query   cheveron gas station

000000036 2008-11-1310:46:34  Query   cheveron gas station richmond virginia

000000036 2008-11-1310:48:56  Query   cheveron glenside road richmond virginia

000000036 2008-11-1310:49:05  Query   chevron glenside road richmond virginia

000000036 2008-11-1310:49:05  ClickURL        http://yp.yahoo.com

000000053 2008-11-1315:18:21  Query   mapquest

000000053 2008-11-1315:18:21  ClickURL        http://www.mapquest.com

 

  Elapsed time: 0.00 s

Avg value size: 21.50 bytes

  Avg key size: 30.00 bytes

   Throughput: 140595.14 bytes/s

  Total cells: 10

   Throughput: 2730.00 cells/s

 

See HQL Documentation: ALTER TABLE forcomplete syntax.

关于ALTER TABLE的完整描述,参阅HQL文档。

 

INSERT & DELETE

Now let's augment the QueryLogByUserIDtable by adding some information in the Notes column for a few of the queries:

现在我们向QueryLogByUserID表中Notes列增加一些信息。

hypertable> INSERTINTO QueryLogByUserID VALUES

("0000190582008-11-13 07:24:43", "Notes", "animals"),

("0000190582008-11-13 07:57:16", "Notes", "food"),

("0000190582008-11-13 07:59:36", "Notes", "gardening");

 

  Elapsed time: 0.01 s

Avg value size: 6.67 bytes

  Total cells: 3

   Throughput: 298.36 cells/s

      Resends: 0

Notice the new data by querying the affectedrow:

来查询一下受影响的行。

hypertable> select* from QueryLogByUserID where row =^ '000019058';

000019058 2008-11-1307:24:43  Query   tigers

000019058 2008-11-1307:24:43  Notes   animals

000019058 2008-11-1307:57:16  Query   bell peppers

000019058 2008-11-1307:57:16  Notes   food

000019058 2008-11-1307:58:24  Query   bell peppers

000019058 2008-11-1307:58:24  ClickURL        http://agalternatives.aers.psu.edu

000019058 2008-11-1307:59:36  Query   growing bell peppers

000019058 2008-11-1307:59:36  Query   growing bell peppers

000019058 2008-11-1307:59:36  ClickURL        http://www.farm-garden.com

000019058 2008-11-1307:59:36  ClickURL        http://www.organicgardentips.com

000019058 2008-11-1307:59:36  Notes   gardening

000019058 2008-11-1312:31:02  Query   tracfone

000019058 2008-11-1312:31:02  ClickURL        http://www.tracfone.com

 

  Elapsed time: 0.00 s

Avg value size: 16.38 bytes

  Avg key size: 30.00 bytes

   Throughput: 162271.26 bytes/s

  Total cells: 13

   Throughput: 3498.39 cells/s

 Now try deleting one of the notes we just added

现在尝试删除一下我们刚才加的note内容。

hypertable> deleteNotes from QueryLogByUserID where ROW ="000019058 2008-11-1307:24:43";

 

  Elapsed time: 0.00 s

  Total cells: 1

   Throughput: 256.41 cells/s

      Resends: 0

And verify that the cell was, indeed, deleted:

验证一下是否真正地删除掉了。

hypertable> select* from QueryLogByUserID where row =^ '000019058';

000019058 2008-11-1307:24:43  Query   tigers

000019058 2008-11-1307:57:16  Query   bell peppers

000019058 2008-11-1307:57:16  Notes   food

000019058 2008-11-1307:58:24  Query   bell peppers

000019058 2008-11-1307:58:24  ClickURL        http://agalternatives.aers.psu.edu

000019058 2008-11-1307:59:36  Query   growing bell peppers

000019058 2008-11-1307:59:36  Query   growing bell peppers

000019058 2008-11-1307:59:36  ClickURL        http://www.farm-garden.com

000019058 2008-11-1307:59:36  ClickURL        http://www.organicgardentips.com

000019058 2008-11-1307:59:36  Notes   gardening

000019058 2008-11-1312:31:02  Query   tracfone

000019058 2008-11-1312:31:02  ClickURL        http://www.tracfone.com

 

  Elapsed time: 0.00 s

Avg value size: 16.38 bytes

  Avg key size: 30.00 bytes

   Throughput: 162271.26 bytes/s

  Total cells: 12

   Throughput: 3498.39 cells/s

 See the HQL Documentation: INSERT and theHQL Documentation: DELETE for complete syntax.

关于INSERT,DELETE的完整语法,参阅HQL文档。

 

DROP TABLE

The DROP TABLE command is used to removetables from the system. The IF EXISTS option prevents the system from throwingan error if the table does not exist:

DROP TABLE命令用来从系统中删除表。可以用IFEXISTS选项来防止删除不存在表时产生的错误。

hypertable> droptable IF EXISTS foo;

 

Let's remove one of the example tables:

让我们删除一个我们的示例表。

hypertable> droptable QueryLogByUserID;

hypertable> showtables;

QueryLogByTimestamp

Then let's remove the other:

接着删除另一个。

hypertable> droptable QueryLogByTimestamp;

hypertable> showtables;

 

GET LISTING & DROP NAMESPACE

列出并删除命名空间

Now, we want to get rid of the Tutorialnamespace and verify that we have:

现在,我们想去掉Tutorial命名空间,并验证它。

hypertable> use"/";

 

hypertable> getlisting;

Tutorial       (namespace)

sys    (namespace)

 

hypertable> dropnamespace Tutorial;

 

hypertable> getlisting;

sys    (namespace)

 

The sys namespace is used by the Hypertablesystem and should not be used to contain user tables.

Note that a namespace must be empty (iemust not contain any sub-namespaces or tables) before you can drop it. In thiscase since we had already dropped the QueryLogByUserID and QueryLogByTimestamptables, we could go ahead and drop the Tutorial namespace.

命名空间sys是供Hypertable系统使用的,不应该包含用户表。

请注意,只有命名空间为空(即不能包含子命名空间或表)时,你才能删除它。在我们的例子中,我们已经删除了QueryLogByUserID和QueryLogByTimestamp,所以就能删除Tutorial命名空间了。

 

Hadoop MapReduce

In order to run this example, Hadoop needsto be installed and HDFS and the MapReduce framework needs to be up andrunning.  Hypertable builds againstCloudera's CDH3 distribution of hadoop. See CDH3 Installation for instructions on how to get Hadoop up andrunning.

Hypertable ships with a jar file,hypertable-x.x.x.x.jar (where x.x.x.x is the hypertable release level, e.g.,0.9.5.5) that contains Hadoop InputFormat and OutputFormat classes that allowMapReduce programs to directly read from and write to tables in Hypertable.  In this section, we walk you through anexample MapReduce program, WikipediaWordCount, that tokenizes articles in atable called wikipedia that has been loaded with a Wikipedia dump.  It reads the article column, tokenizes it,and populates the word column of the same table. Each unique word in thearticle turns into a qualified column and the value is the number of times theword appears in the article.

为了运行本例,需要安装Hadoop,HDFS和MapReduce框架必须启动。Hypertable是针对Cloudera CDH3发行版的Hadoop而构建的,关于如何运行hadoop,参阅CDH3的安装指导(https://ccp.cloudera.com/display/CDHDOC/CDH3+Installation)。

Hypertable带有一个jar文件hypertable-x.x.x.x.jar(这里x.x.x.x是hypertable的发行号,例如0.9.5.5),这个文件中,包含Hadoop的InputFormat 和OutputFormat类,允许MapReduce程序直接读取或写入Hypertable中的表。本节中,我们就走一遍一个MapReduce的例子程序WikipediaWordCount,它对一个叫wikipedia表中的文章进行词法分析,wikipedia表已经载入了*导出的数据,程序读取列article中内容,进行词法分析,然后将结果写入到同一个表中的word列。文章中每一个不同的词变成一个列标识,该列的值为文章中词出现的次数。

 

Setup

First, exit the Hypertable command lineinterpreter and download the Wikipedia dump, for example:

首先,退出Hypertable命令行解释器,下载*dump数据。例如:

$ wgethttp://cdn.hypertable.com/pub/wikipedia.tsv.gz

 Next, jump back into the Hypertable commandline interpreter and create the wikipedia table by executing the HQL commandsshow below.

下一步,返回Hypertable命令行解释器,运行HQL命令创建wikipedia,如下所示。

CREATE NAMESPACEtest;

USE test;

DROP TABLE IF EXISTSwikipedia;

CREATE TABLEwikipedia (

      title,

      id,

      username,

      article,

      word

);

 

Now load the compressed Wikipedia dump filedirectly into the wikipedia table by issuing the following HQL commands:

现在,采用下面的HQL命令,直接导入*dump文件到wikipedia表中:

hypertable> LOADDATA INFILE "wikipedia.tsv.gz" INTO TABLE wikipedia;

 

Loading 638,058,135bytes of input data...

 

0%  10  20   30   40  50   60   70  80   90   100%

|----|----|----|----|----|----|----|----|----|----|

***************************************************

Load complete.

 

  Elapsed time: 78.28 s

Avg value size: 1709.59 bytes

  Avg key size: 24.39 bytes

   Throughput: 25226728.63 bytes/s (8151017.58 bytes/s)

  Total cells: 1138847

   Throughput: 14548.46 cells/s

      Resends: 8328

 

Example

In this example, we'll be running theWikipediaWordCount program which is included in thehypertable-X.X.X.X-examples.jar file included in the binary packageinstallation. The following is a link to the source code for this program.

在本例中,我们要运行WikipediaWordCount,这个程序包含在hypertable-X.X.X.X-examples.jar文件中,该jar在Hypertable二进制安装包中。下面是这个程序的源文件链接。

WikipediaWordCount.java(https://github.com/hypertable/hypertable/blob/master/examples/java/org/hypertable/examples/hadoop/mapreduce/WikipediaWordCount.java)

 

To get an idea of what the data looks like,try the following select:

想看看数据是什么样的,用如下的select命令:

hypertable> select* from wikipedia where row =^ "Addington";

Addington,Buckinghamshire     title   Addington, Buckinghamshire

Addington,Buckinghamshire     id      377201

Addington,Buckinghamshire     username        Roleplayer

Addington,Buckinghamshire     article {{infobox UKplace \n|country =England\n|latitude=51.95095\n|longitude=-0.92177\n|official_name= Addington\n|population = 145 ...

 

Now exit from the Hypertable command line interpreter and run the WikipediaWordCountMapReduce program:

现在退出Hypertable命令行解释器,运行MapReduce 程序WikipediaWordCount。

hypertable> quit

 

$ hadoop jar/opt/hypertable/current/lib/java/hypertable-*-examples.jar \ 

   org.hypertable.examples.WikipediaWordCount\

   --namespace=test --columns=article

 

To verify that it worked, jump back intothe Hypertable command line interpreter and try selecting for the word column:

要验证该程序正确运行了,返回Hypertable命令行解释器,用select查询列work:

$/opt/hypertable/current/bin/ht shell

 

hypertable> selectword from wikipedia where row =^ "Addington";

...

Addington,Buckinghamshire     word:A  1

Addington,Buckinghamshire     word:Abbey      1

Addington,Buckinghamshire     word:Abbotts    1

Addington,Buckinghamshire     word:According  1

Addington,Buckinghamshire     word:Addington  6

Addington,Buckinghamshire     word:Adstock    1

Addington,Buckinghamshire     word:Aston      1

Addington,Buckinghamshire     word:Aylesbury  3

Addington,Buckinghamshire     word:BUCKINGHAM 1

Addington, Buckinghamshire     word:Bayeux     2

Addington,Buckinghamshire     word:Bene       1

Addington,Buckinghamshire     word:Bishop     1

...


Hadoop Streaming MapReduce

In order to run this example, Hadoop needsto be installed and HDFS and the MapReduce framework needs to be up andrunning.  Hypertable builds againstCloudera's CDH3 distribution of hadoop. See CDH3 Installation for instructions on how to get Hadoop up andrunning.

In this example, we'll be running a HadoopStreaming MapReduce job that uses a Bash script as the mapper and a Bash scriptas the reducer.  Like the example in theprevious section, the programs operate on a table called wikipedia that hasbeen loaded with a Wikipedia dump.

为了运行本例,需要安装Hadoop,HDFS和MapReduce框架必须启动。Hypertable是针对Cloudera CDH3发行版的Hadoop而构建的,关于如何运行hadoop,参阅CDH3的安装指导(https://ccp.cloudera.com/display/CDHDOC/CDH3+Installation)。

本例中,我们要运行一个Hadoop流MapReduce任务,该任务分别用两个Bash脚本作为mapper和reducer。像上节的例子一样,这个程序操作了一个叫wikipedia的表,表中的数据由*dump导入。

Setup

First, exit the Hypertable command lineinterpreter and download the Wikipedia dump, for example:

首先,退出Hypertable命令行,下载*dump文件,例如

$ wgethttp://cdn.hypertable.com/pub/wikipedia.tsv.gz

Next, jump back into the Hypertable command line interpreter and create the wikipedia table by executing the HQL commands show below.

接下来,回到Hypertable命令行解释器,用如下的HQL命令创建wikipedia表。

CREATE NAMESPACEtest;

USE test;

DROP TABLE IF EXISTSwikipedia;

CREATE TABLEwikipedia (

      title,

      id,

      username,

      article,

      word

);

Now load the compressed Wikipedia dump filedirectly into the wikipedia table by issuing the following HQL commands:

现在,采用下面的HQL命令,直接导入*dump文件到wikipedia表中:

hypertable> LOADDATA INFILE "wikipedia.tsv.gz" INTO TABLE wikipedia;

 

Loading 638,058,135bytes of input data...

 

0%  10  20   30   40  50   60   70  80   90   100%

|----|----|----|----|----|----|----|----|----|----|

***************************************************

Load complete.

 

  Elapsed time: 78.28 s

Avg value size: 1709.59 bytes

  Avg key size: 24.39 bytes

   Throughput: 25226728.63 bytes/s (8151017.58 bytes/s)

  Total cells: 1138847

   Throughput: 14548.46 cells/s

      Resends: 8328

 

The mapper script (tokenize-article.sh) and the reducer script (reduce-word-counts.sh) are show below.

mapper脚本(tokenize-article.sh)和reducer脚本 (reduce-word-counts.sh)如下。

 

Example

The following script, tokenize-article.sh,will be used as the mapper script.

下面的脚本tokenize-article.sh,可以作为mapper脚本。

#!/usr/bin/env bash

 

IFS="  "

read name columnarticle

 

while [ $? == 0 ] ;do

 

  if [ "$column" =="article" ] ; then

 

   # Strip punctuation

   stripped_article=`echo $article | awk'BEGIN { FS="\t" } { print $NF }' | tr"\!\"#\$&'()*+,-./:;<=>?@[\\\\]^_\{|}~" " " |tr -s " "` ;

 

   # Split article into words

   echo $stripped_article | awk -vname="$name" 'BEGIN { article=name; FS=" "; } { for (i=1;i<=NF; i++) printf "%s\tword:%s\t1\n", article, $i; }' ;

 

  fi

 

  # Read another line

  read name column article

 

done

exit 0

 

 The following script, reduce-word-counts.sh, will be used as the reducer script.

下面的脚本reduce-word-counts.sh,可以作为reducer脚本。

 #!/usr/bin/env bash

 

last_article=

last_word=

let total=0

 

IFS="  "

read article wordcount

 

while [ $? == 0 ] ;do

   if [ "$article" =="$last_article" ] && [ "$word" =="$last_word" ] ; then

       let total=$count+total

   else

       if [ "$last_word" !="" ]; then

           echo "$last_article$last_word      $total"

       fi

       let total=$count

       last_word=$word

       last_article=$article

   fi

   read article word count

done

 

if [ $total -gt 0 ] ;then

   echo "$last_article $last_word      $total"

fi

exit 0

 

To populate the word column of thewikipedia table by tokenizing the article column using the above mapper andreduce script, issue the following command:

用上面的mapper和reducer脚本,对article列中的内容进行词法分析,结果保存到wikipedia表中的word列中。

hypertable> quit

 

$ hadoop jar/usr/lib/hadoop-0.20/contrib/streaming/hadoop-streaming-0.20.2-cdh3u*.jar \

-libjars/opt/hypertable/current/lib/java/hypertable-*.jar,/opt/hypertable/current/lib/java/libthrift-*.jar\

-Dhypertable.mapreduce.namespace=test\

-Dhypertable.mapreduce.input.table=wikipedia\

-Dhypertable.mapreduce.output.table=wikipedia\

-mapper/home/doug/tokenize-article.sh \

-combiner/home/doug/reduce-word-counts.sh \

-reducer/home/doug/reduce-word-counts.sh \

-file/home/doug/tokenize-article.sh \

-file/home/doug/reduce-word-counts.sh \

-inputformatorg.hypertable.hadoop.mapred.TextTableInputFormat \

-outputformatorg.hypertable.hadoop.mapred.TextTableOutputFormat \

-input wikipedia-output wikipedia

 

Input/Output Configuration Properties

The following table lists the jobconfiguration properties that are used to specify, among other things, theinput table, output table, and scan specification. These properties can besupplied to a streaming MapReduce job with -Dproperty=value arguments.

下表列出了需要指定的作业的配置属性,其中,有输入表,输出表,扫描标准。这些属性可以通过-Dproperty=value arguments 这样的方式传递给流MapReduce作业。

Input/Output Configuration Properties

Property

 Description

 Example Value

hypertable.mapreduce.namespace

 Namespace for both input and output table

/test

hypertable.mapreduce.input.namespace

 Namespace for input table

/test/intput

hypertable.mapreduce.input.table

 Input table name

wikipedia

hypertable.mapreduce.input.scan_spec.columns

 Comma separated list of input columns

id,title

hypertable.mapreduce.input.scan_spec.options

 Input WHERE clause options

MAX_VERSIONS 1 KEYS_ONLY

hypertable.mapreduce.input.scan_spec.row_interval

 Input row interval

Dog <= ROW < Kitchen

hypertable.mapreduce.input.scan_spec.timestamp_interval

 Timestamp filter

TIMESTAMP >= 2011-11-21

hypertable.mapreduce.input.include_timestamps

 Emit integer timestamp as the

 

 1st field (nanoseconds since epoch)

TRUE

 

hypertable.mapreduce.output.namespace

 Namespace containing output table

/test/output

hypertable.mapreduce.output.table

 Output table name

wikipedia

hypertable.mapreduce.output.mutator_flags

 flags parameter passed to mutator constructor (1 = NO_LOG_SYNC)

1

 

 Column Selection

To run a MapReduce job over a subset of columns from the input table, specify a comma separated list of columns in the hypertable.mapreduce.input.scan_spec.columns Hadoop configurationproperty.  For example,

为使MapReduce作业只对输入表的一些列起作用,在Hadoop的配置属性hypertable.mapreduce.input.scan_spec.columns中指定这些列,其中列之间用逗号分隔,例如

$ hadoop jar/usr/lib/hadoop-0.20/contrib/streaming/hadoop-streaming-0.20.2-cdh3u*.jar \

-libjars/opt/hypertable/current/lib/java/hypertable-*.jar,/opt/hypertable/current/lib/java/libthrift-*.jar\

-Dhypertable.mapreduce.namespace=test\

-Dhypertable.mapreduce.input.table=wikipedia\

-Dhypertable.mapreduce.input.scan_spec.columns="id,title"\

-mapper /bin/cat-reducer /bin/cat \

-inputformatorg.hypertable.hadoop.mapred.TextTableInputFormat \

-input wikipedia-output wikipedia2

 

Timestamps

To filter the input table with a timestamp predicate, specify the timestamp predicate in the hypertable.mapreduce.input.scan_spec.timestamp_interval Hadoop configuration property. The timestamp predicate is specified using the same format as thetimestamp predicate in the WHERE clause of the SELECT statement, as illustratedin the following examples:

  • TIMESTAMP < 2010-08-03 12:30:00
  • TIMESTAMP >= 2010-08-03 12:30:00
  • 2010-08-01 <= TIMESTAMP <= 2010-08-09

To preserve the timestamps from the inputtable, set the hypertable.mapreduce.input.include_timestamps Hadoop configuration property to true. This will cause the TextTableInputFormat class to produce an additional field (field 0) that represents the timestamp as nanoseconds since the epoch. The following example illustrates how to pass a timestamp predicate into a Hadoop Streaming MapReduce program.

为了在输入表用时间戳进行过滤,在Hadoop配置属性hypertable.mapreduce.input.scan_spec.timestamp_interval中指定时间戳谓词,该时间戳谓词的格式与在SELECT语句中WHERE子句的谓词格式相同,如下例子所示:

  • TIMESTAMP < 2010-08-03 12:30:00
  • TIMESTAMP >= 2010-08-03 12:30:00
  • 2010-08-01 <= TIMESTAMP <= 2010-08-09

为在输入表中保留时间戳,设置Hadoop配置属性hypertable.mapreduce.input.include_timestamps为true。这将使TextTableInputFormat类产生额外的一列(field0):时间戳(Unix起始时刻开始的纳秒值)。下面的例子显示了如何传递一个时间戳谓词给流MapReduce程序。

 

$ hadoop jar/usr/lib/hadoop-0.20/contrib/streaming/hadoop-streaming-0.20.2-cdh3u*.jar \

-libjars/opt/hypertable/current/lib/java/hypertable-*.jar,/opt/hypertable/current/lib/java/libthrift-*.jar\

-Dhypertable.mapreduce.namespace=test\

-Dhypertable.mapreduce.input.table=wikipedia\

-Dhypertable.mapreduce.output.table=wikipedia2\

-Dhypertable.mapreduce.input.scan_spec.columns="id,title"\

-Dhypertable.mapreduce.input.scan_spec.timestamp_interval="2010-08-01<= TIMESTAMP <= 2010-08-09" \

-Dhypertable.mapreduce.input.include_timestamps=true\

-mapper /bin/cat-reducer /bin/cat \

-inputformatorg.hypertable.hadoop.mapred.TextTableInputFormat \

-outputformatorg.hypertable.hadoop.mapred.TextTableOutputFormat \

-input wikipedia-output wikipedia2

 

Row Intervals

To restrict the MapReduce to a specific rowinterval of the input table, a row range can be specified with thehypertable.mapreduce.input.scan_spec.row_interval Hadoop configuration property. The row interval predicate is specified using the same format as thetimestamp predicate in the WHERE clause of the SELECT statement, as illustratedin the following examples:

  • ROW < foo
  • ROW >= bar
  • bar <= ROW <= 'foo;'

The following example illustrates how a rowinterval is passed into a Hadoop Streaming MapReduce program.

为限制MapReduce只对特定的行起作用,通过设置Hadoop配置属性hypertable.mapreduce.input.scan_spec.row_interval来指定特定的行。行间隔谓词的格式与SELECT语句WHERE子句中时间戳的谓词相同,如下面所示:

  • ROW < foo
  • ROW >= bar
  • bar <= ROW <= 'foo;'

下面的例子显示了如何将一个行间隔传递给Hadoop流MapReduce程序。

$ hadoop jar/usr/lib/hadoop-0.20/contrib/streaming/hadoop-streaming-0.20.2-cdh3u*.jar \

-libjars/opt/hypertable/current/lib/java/hypertable-*.jar,/opt/hypertable/current/lib/java/libthrift-*.jar\

-Dhypertable.mapreduce.namespace=test\

-Dhypertable.mapreduce.input.table=wikipedia\

-Dhypertable.mapreduce.output.table=wikipedia2\

-Dhypertable.mapreduce.input.scan_spec.columns="id,title"\

-Dhypertable.mapreduce.input.scan_spec.row_interval="Dog<= ROW <= Kitchen" \

-mapper /bin/cat-reducer /bin/cat \

-inputformatorg.hypertable.hadoop.mapred.TextTableInputFormat \

-outputformatorg.hypertable.hadoop.mapred.TextTableOutputFormat \

-input wikipedia-output wikipedia2

 

Options

A subset of the WHERE clause options of the HQL SELECT statement can be specified by supplying the options with the hypertable.mapreduce.input.scan_spec.options Hadoop configuration property. The following options are supported:

  • MAX_VERSIONS 
  • CELL_LIMIT 
  • KEYS_ONLY

The following example illustrates how to pass options to a Hadoop Streaming MapReduce program.

如果只需用HQL SELECT中WHERE结果的子集,可以添加Hadoop配置属性hypertable.mapreduce.input.scan_spec.options来设置选项,选项有以下值:

  • MAX_VERSIONS 
  • CELL_LIMIT 
  • KEYS_ONLY

下面的例子显示了如何将这些选项传递给Hadoop流MapReduce程序。

$ hadoop jar/usr/lib/hadoop-0.20/contrib/streaming/hadoop-streaming-0.20.2-cdh3u*.jar \

-libjars/opt/hypertable/current/lib/java/hypertable-*.jar,/opt/hypertable/current/lib/java/libthrift-*.jar\

-Dhypertable.mapreduce.namespace=test\

-Dhypertable.mapreduce.input.table=wikipedia\

-Dhypertable.mapreduce.output.table=wikipedia2\

-Dhypertable.mapreduce.input.scan_spec.options="MAX_VERSIONS1 KEYS_ONLY CELL_LIMIT 2" \

-mapper /bin/cat-reducer /bin/cat \

-inputformatorg.hypertable.hadoop.mapred.TextTableInputFormat \

-outputformatorg.hypertable.hadoop.mapred.TextTableOutputFormat \

-input wikipedia-output wikipedia2


Regular ExpressionFiltering

Hypertable supports filtering of data usin gregular expression matching on the row key, column qualifiers and value.Hypertable uses RE2 for regular expression matching, the complete supported syntax can be found in the RE2 Syntax document.

Hypertable支持用正则表达式来过滤行键、列标识和值。Hypertable采用RE2库来完成正则表达式的匹配。关于完整的RE2正则表达式语法,请参阅其文档。

 

Example

In this example we'll use a DMOZ dataset which contains title, description and a bunch of topic tags for a set of URLs.The domain components of the URL have been reversed so that URLs from the same domain sort together. In the schema, the row key is a URL and the title,description and topic are column families. Heres a small sample from the dataset:

本例中,我们要用DMOZ数据集,其中包含了标题、描述以及一系列URL的标题tag。URL中的域部分已经反向,因而相同的域会放在一起。在这个例子中,行键是URL,标题、描述、主题是列族。以下是数据集的一小部分。

com.awn.www    Title  Animation World Network com.awn.www    Description     Provides information resources to the international animation community. Features include searchable database archives, monthly magazine, web animation guide,the Animation Village, discussion forums and other useful resources.

com.awn.www    Topic:Arts    

com.awn.www    Topic:Animation

Exit the hypertable shell and download the dataset, which is in the .tsv.gz format which can be directly loaded into Hypertable without unzipping:

退出Hypertable命令行,下载这个数据集,它是.tsv.gz格式,不需要解压就可以直接导入到Hypertable中。

hypertable> quit

 

$ wgethttp://cdn.hypertable.com/pub/dmoz.tsv.gz

 

Jump back into the hypertable shell and create the dmoz table as follows:

返回Hypertable命令行,创建dmoz表:

/opt/hypertable/current/bin/htshell

 

hypertable> USE"/";

hypertable> CREATETABLE dmoz(Description, Title, Topic, ACCESS GROUP topic(Topic));

hypertable> LOADDATA INFILE "dmoz.tsv.gz" INTO TABLE dmoz;

 

Loading 412,265,627bytes of input data...

 

0%  10  20   30   40  50   60   70  80   90   100%

|----|----|----|----|----|----|----|----|----|----|

***************************************************

Load complete.

 

  Elapsed time: 242.26 s

Avg value size: 15.09 bytes

  Avg key size: 24.76 bytes

   Throughput: 6511233.28 bytes/s (1701740.27 bytes/s)

  Total cells: 39589037

   Throughput: 163414.69 cells/s

      Resends: 144786

 

In the following queries we limit the number of rows returned to 2 for brevity. Suppose you want a subset of the URLs from the domain inria.fr where the first component of the domain doesn't start with the letter 'a', you could run:

为简短,在下面的查询中,我们限制返回的行数为2。假设你想要一个域为inria.fr的URL的子集,第一个部分不以’a’开始,你可以运行:

hypertable> SELECTTitle FROM dmoz WHERE ROW REGEXP "fr\.inria\.[^a]" LIMIT 2 REVS=1KEYS_ONLY;

 

fr.inria.caml

fr.inria.caml/pub/docs/oreilly-book

 

 To look at all topics which start with write(case insensitive):

hypertable> SELECTTopic:/(?i)^write/ FROM dmoz LIMIT 2;

 

13.141.244.204/writ_denTopic:Writers_Resources

ac.sms.www/clubpage.asp?club=CL001003004000301311      Topic:Writers_Resources

 

The next example shows how to query for data where the description contains the word game followed by either foosball or halo:

下一个例子显示了描述中包含’game’但紧接着为foosball或halo的数据。

hypertable> SELECTCELLS Description FROM dmoz WHERE VALUE REGEXP "(?i:game.*(foosball|halo)\s)"LIMIT 2 REVS=1;

 

com.armchairempire.www/Previews/PCGames/planetside.htm Description     Preview by Mr. Nash. "So, on the oneside the game is sounding pretty snazzy, on the other it sounds sort of likeHalo at its core."

com.digitaldestroyers.www      Description     Video game fans in Spartanburg, SouthCarolina who like to get together and compete for bragging rights. Also competewith other Halo / Xbox fan clubs.


Atomic Counters

Column families can optionally act as atomiccounters by supplying the COUNTER option in the column specification of theCREATE TABLE command. Counter columns are accessed using the same methods asother columns. However, to modify the counter, the value must be formattedspecially, as described in the following table.

在CREATE TABLE命令中,列的定义可以添加COUNTER选项,这样列族就可以作为一个原子计数器。这个计数器列的访问方式与其他列相同。但是,要修改此计数器列的值,值的格式比较特别,如下表所示:

Value        Format Description

['+'] n        Increment the counter by n

'-' n       Decrement the counter by n

'=' n      Reset the counter to n

Example

In this example we create a table ofcounters called counts that contains a single column family url that acts as anatomic counter for urls. By convention, the row key is the URL with the domainname reversed (so that URLs from the same domain sort next to each other) andthe column qualifier is the hour in which the "hit" occurred. Thetable is created with the following HQL:

本例中,我们创建一个叫count的表,其中包含单一的一个列族url,作为url的原子计数器。通常,行键是域反向的url(相同域的url会排序在一起),列标识是点击该url的小时值。表的创建如下

hypertable> use"/";

hypertable> createtable counts ( url COUNTER );

 

Let's say we've accumulated url"hit" occurrences in the following .tsv file:

我们累计的url点击数在下面的.tsv文件中。

#row   column value

org.hypertable.www/    url:2010-10-26_09       +1

org.hypertable.www/    url:2010-10-26_09       +1

org.hypertable.www/download.html       url:2010-10-26_09       +1

org.hypertable.www/documentation.html  url:2010-10-26_09       +1

org.hypertable.www/download.html       url:2010-10-26_09       +1

org.hypertable.www/about.html  url:2010-10-26_09       +1

org.hypertable.www/    url:2010-10-26_09       +1

org.hypertable.www/    url:2010-10-26_10       +1

org.hypertable.www/about.html  url:2010-10-26_10       +1

org.hypertable.www/    url:2010-10-26_10       +1

org.hypertable.www/download.html       url:2010-10-26_10       +1

org.hypertable.www/download.html       url:2010-10-26_10       +1

org.hypertable.www/documentation.html  url:2010-10-26_10       +1

org.hypertable.www/    url:2010-10-26_10       +1

 

If we were to load this file with LOAD DATAINFILE into the counts table, a subsequent select would yield the followingoutput:

如果我们想将此文件用LOAD DATA INFILE载入到counts表中,然后的select语句将产生如下的输出:

hypertable> select* from counts;

org.hypertable.www/    url:2010-10-26_09       3

org.hypertable.www/    url:2010-10-26_10       3

org.hypertable.www/about.html  url:2010-10-26_09       1

org.hypertable.www/about.html  url:2010-10-26_10       1

org.hypertable.www/documentation.html  url:2010-10-26_09       1

org.hypertable.www/documentation.html  url:2010-10-26_10       1

org.hypertable.www/download.html       url:2010-10-26_09       2

org.hypertable.www/download.html       url:2010-10-26_10       2

 

Group Commit

Updates are carried out by the RangeServers through the following steps:

1.        Write the update to the commit log (in the DFS)

2.        Sync the commit log (in the DFS)

3.        Populate in-memory data structure with the update

Under high concurrency, step #2 can become a bottleneck. Distributed filesy stems such as HDFS can typically handle a small number of sync operations per second. The Group Commit feature solves this problem by delaying updates, grouping them together, and carrying them out in a batch on some regular interval.

A table can be configured to use group commit by supplying the GROUP_COMMIT_INTERVAL option in the CREATE TABLE statement. The GROUP_COMMIT_INTERVAL option tells the system that updates tothis table should be carried out with group commit and also specifies the commit interval in milliseconds. The interval is constrained by the value of the config property Hypertable.RangeServer. Commit Interval, which acts as a lower bound (default is 50ms). The value specified for GROUP_COMMIT_INTERVAL will get rounded up to the nearest multiple of this property value. The following is an example CREATE TABLE statement that creates a table counts setup for group commit operation.

更新操作由RangeServer按如下步骤完成:

1.      将更新写入提交日志(在DFS中)

2.      同步提交日志(在DFS中)

3.      更新内容写入内存结构

在高并发性下,步骤2可能是一个瓶颈,通常,像HDFS这样的分布式文件系统在一秒内只能处理不多的文件同步请求。成组提交是这个问题的解决方法,即滞后更新,将若干更新放在一起,定时将打包的更新一同提交。

在创建表的语句中,提供GROUP_COMMIT_INTERVAL选项,表可以被设置成成组提交。GROUP_COMMIT_INTERVAL选项告诉系统这个表的更新采用成组提交,并且指定提交的周期(毫秒)。这个周期有一个限制,它要大于配置文件中Hypertable.RangeServer.CommitInterval属性的值(缺省是50ms)。GROUP_COMMIT_INTERVAL中的值将被圆整为上面那个属性的最接近的倍数的值,下面的CREATE TABLE的例子创建了成组提交的表counts。

Example

hypertable> CREATETABLE counts (

  url,

  domain

)GROUP_COMMIT_INTERVAL=100;