Hive Tutorial(上)(Hive 入门指导)

时间:2022-04-07 23:56:07

用户指导

Hive 指导


概念

Hive是什么?

Hive是一个以Apache Hadoop为基础的数据仓储基础设施。Hadoop为数据的存储和运行在商业机器上提供了可扩展和高容错的性能。

Hive的设计目标是使得数据汇总更加简单和针对大容量数据的查询和分析。它提供SWL来使得用户可以更简单地查询、汇总和数据分析。同时,Hive的SQL为用户提供了多种地方来融合他们自己的方法实现自定义分析,例如User Defined Functions (UDFs)。

Hive不是什么?

Hive不是为事务联机处理设计的。它是用于处理传统数据仓储任务。

获得和开始

至于如何配置Hive,HiveServer2和Beeline的细节,请参考GettingStarted指南。

Books about Hive 展示了一些可以帮助更好开始Hive的书籍。

在接下来的部分我们将提供一份关于系统性能的指导。我们开始描述data types,tables和partitions(跟传统关系型数据库相似)的概念和通过举例帮助了解Hive的能力。

数据单元

为了使得粒度合适,Hive数据采用下面展示的组织结构:

  • Databases:命名空间方法用来避免tables,views,partitions,columns等等的命名冲突。Databases也可以用于加强用户或者一组用户的安全性。
  • Tables: 拥有相同的schema被看成是同种数据单元。下面是page_views表的例子,每一行包括以下的列(schema):
    • timestamp—当网页被浏览时UNIX timestamp一致的INT类型的数据
    • userid  —用来识别浏览该页面的用户的BIGINT类型的数据
    • page_url —获取网页位置的STRING类型的数据
    • referer_url—用于获取用户所在当前页面的位置的STRING类型的数据。
    • IP—用于获取页面请求时的IP地址。
  • Partitions:每个页面都拥有一个或者多个Key来决定数据如何存储。Partitions——除了存储单元——也允许用户高效地识别满足指定标准的行。例如,STRING类型的 data_partition和STRING 类型的country_partition。每一个独一无二的partition key值定义一个Table的partition。例如,US时间的“2009-12-23”是page_views table的一个partition。因此,如果你只想分析2009-12-23的“US”数据,你可以运行在该table相关的partition上,从而提高分析效率。然而需要说明的是,只是因为有一个partition名字为2009-12-23并不意味着它包含所有或者只是该日期的数据。partition用时间命名只是为了方面。维持partition名字和数据内容之间的映射关系是用户的工作。Partition列是虚拟列,本身不是数据的一部分而是在加载中派生出来的。
  • Buckets (or Clusters):基于对表中一些列的hash方法得出值来将每个partition中的数据分到不同的Buckets中。例如,page_views表可能通过userid表中不同于其他partition列的列来bucket。这可以用于有效地获取样本。

需要说明是对于表来说partitioned和bucketed不是必需的,但这些抽象化概念允许系统在查询操作中筛选掉大量数据来提高查询速度。

类型系统

Hive支持原始和复杂数据类型,正如下面多描述的。可以在Hive Data Types中查看更多信息。

原始类型

  • 数据类型是跟表中的列相关的,支持下面的原始类型:
  • Integers
    • TINYINT—1个字节的整型
    • SMALLINT—2个字节的整型
    • INT—4个字节的整型
    • BIGINT—8个字节的整型
  • Boolean type
    • BOOLEAN—TRUE/FALSE
  • Floating point numbers
    • FLOAT—单精度
    • DOUBLE—双精度
  • Fixed point numbers
    • DECIMAL—a fixed point value of user defined scale and precision用户定义的大小和精度的固定的点值
  • String types
    • STRING—指定字符集的字符串
    • VARCHAR—指定字符集最大长度的字符串
    • CHAR—指定字符集和长度的字符串
  • Date and time types
    • TIMESTAMP— 纳秒精度的特定时间点
    • DATE—日期
  • Binary types
    • BINARY—字节序列

类型的层次结构如下(父类是所有子类实例的超类型):

  • Type
    • Primitive Type
      • Number
        • DOUBLE
          • FLOAT
            • BIGINT
              • INT
                • SMALLINT
                  • TINYINT
          • STRING
      • BOOLEAN

类型层级定义了类型在查询语言中的隐性转换。隐性转换允许子类转换成父类。所以当一个查询表达式需要type1但是数据是type2,type1在层级结构中是type2的父类,那么type2可以转换成type1.需要说明的是类型层级允许STRING转换成DOUBLE。

明确的类型转换可以用下面部分#Built In Functions中的cast操作符来实现。

复杂类型

复杂类型可以用原始类型和其他组合类型来组合:

  • Structs:类型里面的元素可以用.符号来获得。举个例子,一个列c的类型是STRUCT{a INT;b INT},那么里面的a可用c.a来访问。
  • Maps (key-value tuples): 元素可用[‘元素名’]来访问。例如,在一个map M中包含一个一个键值对‘group’->gid,那么gid的值可以用M[‘group’]来获得。
  • Arrays (indexable lists): 数组中元素必须是同种类型。元素可以通过[index]来获得。举个例子,A数组拥有元素[‘a’,’b’,’c’],那么A[1]会返回‘b’。

使用原始数据类型和创造复杂类型的架构,任意级别的嵌套类型都可以被创造。例如,对于一个类型,用户可能包含下面的字段:

  • gender—which is a STRING.
  • active—which is a BOOLEAN.

内置操作符和方法

下面列出的操作符和方法不一定是最新的(Hive Operators and UDFs里面有更多最新信息)在 Beeline 或者 Hive CLI, 使用这些命令行获得最新文档:

SHOW FUNCTIONS;
DESCRIBE FUNCTION <function_name>;
DESCRIBE FUNCTION EXTENDED <function_name>;

区分大小写

所有的Hive关键词都是区分大小写,包括Hive操作和方法名。

内置操作符

  • 关系操作符—通过与传递进来的值进行比较返回TRUE or FALSE

Relational Operator   

Operand   types   

Description

A = B

all primitive types

TRUE if expression A is equivalent to expression B; otherwise FALSE

A != B

all primitive types

TRUE if expression A is not equivalent to expression B; otherwise FALSE

A < B

all primitive types

TRUE if expression A is less than expression B; otherwise FALSE

A <= B

all primitive types

TRUE if expression A is less than or equal to expression B; otherwise FALSE

A > B

all primitive types

TRUE if expression A is greater than expression B] otherwise FALSE

A >= B

all primitive types

TRUE if expression A is greater than or equal to expression B otherwise FALSE

A IS NULL

all types

TRUE if expression A evaluates to NULL otherwise FALSE

A IS NOT NULL

all types

FALSE if expression A evaluates to NULL otherwise TRUE

A LIKE B

strings

TRUE if string A matches the SQL simple regular expression B, otherwise FALSE. The comparison is done character by character. The _ character in B matches any character in A (similar to . in posix regular expressions), and the % character in B matches an arbitrary number of characters in A (similar to .* in posix regular expressions). For example, 'foobar' LIKE 'foo' evaluates to FALSE where as 'foobar' LIKE 'foo___' evaluates to TRUE and so does 'foobar' LIKE 'foo%'. To escape % use \ (% matches one % character). If the data contains a semicolon, and you want to search for it, it needs to be escaped, columnValue LIKE 'a\;b'

A RLIKE B

strings

NULL if A or B is NULL, TRUE if any (possibly empty) substring of A matches the Java regular expression B (see Java regular expressions syntax), otherwise FALSE. For example, 'foobar' rlike 'foo' evaluates to TRUE and so does 'foobar' rlike '^f.*r$'.

A REGEXP B

strings

Same as RLIKE

  • 算术操作符—下面的运算符支持各种常见的算术运算符。所有的返回值都为number类型的。

Arithmetic Operators

Operand types

Description

A + B

all number types

Gives the result of adding A and B. The type of the result is the same as the common parent(in the type hierarchy) of the types of the operands, for example, since every integer is a float. Therefore, float is a containing type of integer so the + operator on a float and an int will result in a float.

A - B

all number types

Gives the result of subtracting B from A. The type of the result is the same as the common parent(in the type hierarchy) of the types of the operands.

A * B

all number types

Gives the result of multiplying A and B. The type of the result is the same as the common parent(in the type hierarchy) of the types of the operands. Note that if the multiplication causing overflow, you will have to cast one of the operators to a type higher in the type hierarchy.

A / B

all number types

Gives the result of dividing B from A. The type of the result is the same as the common parent(in the type hierarchy) of the types of the operands. If the operands are integer types, then the result is the quotient of the division.

A % B

all number types

Gives the reminder resulting from dividing A by B. The type of the result is the same as the common parent(in the type hierarchy) of the types of the operands.

A & B

all number types

Gives the result of bitwise AND of A and B. The type of the result is the same as the common parent(in the type hierarchy) of the types of the operands.

A | B

all number types

Gives the result of bitwise OR of A and B. The type of the result is the same as the common parent(in the type hierarchy) of the types of the operands.

A ^ B

all number types

Gives the result of bitwise XOR of A and B. The type of the result is the same as the common parent(in the type hierarchy) of the types of the operands.

~A

all number types

Gives the result of bitwise NOT of A. The type of the result is the same as the type of A.

  • 逻辑运算符 — 下面的操作符支持创建逻辑表达式。所有的结果根据操作对象的布尔值来返回true/false。

Logical Operators

Operands types

Description

A AND B

boolean

TRUE if both A and B are TRUE, otherwise FALSE

A && B

boolean

Same as A AND B

A OR B

boolean

TRUE if either A or B or both are TRUE, otherwise FALSE

A || B

boolean

Same as A OR B

NOT A

boolean

TRUE if A is FALSE, otherwise FALSE

!A

boolean

Same as NOT A

  • 复杂类型中的操作符—下面的操作符提供机制来获得复杂类型中的元素,

Operator

Operand types

Description

A[n]

A is an Array and n is an int

returns the nth element in the array A. The first element has index 0, for example, if A is an array comprising of ['foo', 'bar'] then A[0] returns 'foo' and A[1] returns 'bar'

M[key]

M is a Map<K, V> and key has type K

returns the value corresponding to the key in the map for example, if M is a map comprising of
{'f' -> 'foo', 'b' -> 'bar', 'all' -> 'foobar'} then M['all']
returns 'foobar'

S.x

S is a struct

returns the x field of S, for example, for struct foobar
{int foo, int bar} foobar.foo returns the integer stored in the foo field of
the struct.

内置方法

Return Type

Function Name (Signature)

Description

BIGINT

round(double a)

returns the rounded BIGINT value of the double

BIGINT

floor(double a)

returns the maximum BIGINT value that is equal or less
than the double

BIGINT

ceil(double a)

returns the minimum BIGINT value that is equal or greater
than the double

double

rand(), rand(int seed)

returns a random number (that changes from row to row).
Specifiying the seed will make sure the generated random number sequence is
deterministic.

string

concat(string A, string B,...)

returns the string resulting from concatenating B after
A. For example, concat('foo', 'bar') results in 'foobar'. This function
accepts arbitrary number of arguments and return the concatenation of all of
them.

string

substr(string A, int start)

returns the substring of A starting from start position
till the end of string A. For example, substr('foobar', 4) results in 'bar'

string

substr(string A, int start, int length)

returns the substring of A starting from start position
with the given length, for example,
substr('foobar', 4, 2) results in 'ba'

string

upper(string A)

returns the string resulting from converting all
characters of A to upper case, for example, upper('fOoBaR') results in
'FOOBAR'

string

ucase(string A)

Same as upper

string

lower(string A)

returns the string resulting from converting all
characters of B to lower case, for example, lower('fOoBaR') results in
'foobar'

string

lcase(string A)

Same as lower

string

trim(string A)

returns the string resulting from trimming spaces from
both ends of A, for example, trim(' foobar ') results in 'foobar'

string

ltrim(string A)

returns the string resulting from trimming spaces from
the beginning(left hand side) of A. For example, ltrim(' foobar ') results in
'foobar '

string

rtrim(string A)

returns the string resulting from trimming spaces from
the end(right hand side) of A. For example, rtrim(' foobar ') results in '
foobar'

string

regexp_replace(string A, string B, string C)

returns the string resulting from replacing all substrings
in B that match the Java regular expression syntax(See Java
regular expressions syntax
) with C. For example, regexp_replace('foobar',
'oo|ar', ) returns 'fb'

int

size(Map<K.V>)

returns the number of elements in the map type

int

size(Array<T>)

returns the number of elements in the array type

value
of <type>

cast(<expr> as <type>)

converts the results of the expression expr to
<type>, for example, cast('1' as BIGINT) will convert the string '1' to
it integral representation. A null is returned if the conversion does not
succeed.

string

from_unixtime(int unixtime)

convert the number of seconds from the UNIX epoch
(1970-01-01 00:00:00 UTC) to a string representing the timestamp of that
moment in the current system time zone in the format of "1970-01-01
00:00:00"

string

to_date(string timestamp)

Return the date part of a timestamp string:
to_date("1970-01-01 00:00:00") = "1970-01-01"

int

year(string date)

Return the year part of a date or a timestamp string:
year("1970-01-01 00:00:00") = 1970, year("1970-01-01") =
1970

int

month(string date)

Return the month part of a date or a timestamp string:
month("1970-11-01 00:00:00") = 11, month("1970-11-01") =
11

int

day(string date)

Return the day part of a date or a timestamp string:
day("1970-11-01 00:00:00") = 1, day("1970-11-01") = 1

string

get_json_object(string json_string, string path)

Extract json object from a json string based on json path
specified, and return json string of the extracted json object. It will
return null if the input json string is invalid.

  • Hive中的统计函数:

Return Type

Aggregation Function Name
(Signature)

Description

BIGINT

count(*), count(expr), count(DISTINCT expr[, expr_.])

count(*)—Returns the
total number of retrieved rows, including rows containing NULL values;
count(expr)—Returns the number of rows for
which the supplied expression is non-NULL; count(DISTINCT expr[, expr])—Returns the number of rows for which the
supplied expression(s) are unique and non-NULL.

DOUBLE

sum(col), sum(DISTINCT col)

returns the sum of the elements in the group or the sum
of the distinct values of the column in the group

DOUBLE

avg(col), avg(DISTINCT col)

returns the average of the elements in the group or the
average of the distinct values of the column in the group

DOUBLE

min(col)

returns the minimum value of the column in the group

DOUBLE

max(col)

returns the maximum value of the column in the group

语言能力

Hive's
SQL
提供基础 SQL操作. 这些操作是用在表和partition上,这些操作是:

    • 用WHERE来过滤表中的行
    • 用SELECT从表中来选择确定的行
    • 两表联合
    • Ability
      to evaluate aggregations on multiple "group by" columns for the
      data stored in a table.(翻译不出来T T)
    • 将查询的结果存储到另一个表中
    • 将表中内容下载到本地目录
    • 将查询结果存储到hadoop dfs目录
    • 管理tables和partitions(新建,移除和更改)
    • 插入使用自定义 map/reduce作业使用的语言写的自定义脚本

下面是原文


Tutorial

Hive Tutorial


Concepts

What Is Hive

Hive is a data warehousing infrastructure based on Apache Hadoop. Hadoop provides massive scale out and fault tolerance capabilities for data storage and processing on commodity hardware.

Hive is designed to enable easy data summarization, ad-hoc querying and analysis of large volumes of data. It provides SQL which enables users to do ad-hoc querying, summarization and data analysis easily. At the same time, Hive's SQL gives users multiple places to integrate their own functionality to do custom analysis, such as User Defined Functions (UDFs).

What Hive Is NOT

Hive is not designed for online transaction processing.  It is best used for traditional data warehousing tasks.

Getting Started

For details on setting up Hive, HiveServer2, and Beeline, please refer to the GettingStarted guide.

Books about Hive lists some books that may also be helpful for getting started with Hive.

In the following sections we provide a tutorial on the capabilities of the system. We start by describing the concepts of data types, tables, and partitions (which are very similar to what you would find in a traditional relational DBMS) and then illustrate the capabilities of Hive with the help of some examples.

Data Units

In the order of granularity - Hive data is organized into:

  • Databases: Namespaces function to avoid naming conflicts for tables, views, partitions, columns, and so on.  Databases can also be used to enforce security for a user or group of users.
  • Tables: Homogeneous units of data which have the same schema. An example of a table could be page_views table, where each row could comprise of the following columns (schema):
    • timestamp—which is of INT type that corresponds to a UNIX timestamp of when the page was viewed.
    • userid —which is of BIGINT type that identifies the user who viewed the page.
    • page_url—which is of STRING type that captures the location of the page.
    • referer_url—which is of STRING that captures the location of the page from where the user arrived at the current page.
    • IP—which is of STRING type that captures the IP address from where the page request was made.
  • Partitions: Each Table can have one or more partition Keys which determines how the data is stored. Partitions—apart from being storage units—also allow the user to efficiently identify the rows that satisfy a specified criteria; for example, a date_partition of type STRING and country_partition of type STRING. Each unique value of the partition keys defines a partition of the Table. For example, all "US" data from "2009-12-23" is a partition of the page_views table. Therefore, if you run analysis on only the "US" data for 2009-12-23, you can run that query only on the relevant partition of the table, thereby speeding up the analysis significantly. Note however, that just because a partition is named 2009-12-23 does not mean that it contains all or only data from that date; partitions are named after dates for convenience; it is the user's job to guarantee the relationship between partition name and data content! Partition columns are virtual columns, they are not part of the data itself but are derived on load.
  • Buckets (or Clusters): Data in each partition may in turn be divided into Buckets based on the value of a hash function of some column of the Table. For example the page_views table may be bucketed by userid, which is one of the columns, other than the partitions columns, of the page_view table. These can be used to efficiently sample the data.

Note that it is not necessary for tables to be partitioned or bucketed, but these abstractions allow the system to prune large quantities of data during query processing, resulting in faster query execution.

Type System

Hive supports primitive and complex data types, as described below. See Hive Data Types for additional information.

Primitive Types

  • Types are associated with the columns in the tables. The following Primitive types are supported:
  • Integers
    • TINYINT—1 byte integer
    • SMALLINT—2 byte integer
    • INT—4 byte integer
    • BIGINT—8 byte integer
  • Boolean type
    • BOOLEAN—TRUE/FALSE
  • Floating point numbers
    • FLOAT—single precision
    • DOUBLE—Double precision
  • Fixed point numbers
    • DECIMAL—a fixed point value of user defined scale and precision
  • String types
    • STRING—sequence of characters in a specified character set
    • VARCHAR—sequence of characters in a specified character set with a maximum length
    • CHAR—sequence of characters in a specified character set with a defined length
  • Date and time types
    • TIMESTAMP— a specific point in time, up to nanosecond precision
    • DATE—a date
  • Binary types
    • BINARY—a sequence of bytes

The Types are organized in the following hierarchy (where the parent is a super type of all the children instances):

  • Type
    • Primitive Type
      • Number
        • DOUBLE
          • FLOAT
            • BIGINT
              • INT
                • SMALLINT
                  • TINYINT
          • STRING
      • BOOLEAN

This type hierarchy defines how the types are implicitly converted in the query language. Implicit conversion is allowed for types from child to an ancestor. So when a query expression expects type1 and the data is of type2, type2 is implicitly converted to type1 if type1 is an ancestor of type2 in the type hierarchy. Note that the type hierarchy allows the implicit conversion of STRING to DOUBLE.

Explicit type conversion can be done using the cast operator as shown in the #Built In Functions section below.

Complex Types

Complex Types can be built up from primitive types and other composite types using:

  • Structs: the elements within the type can be accessed using the DOT (.) notation. For example, for a column c of type STRUCT {a INT; b INT}, the a field is accessed by the expression c.a
  • Maps (key-value tuples): The elements are accessed using ['element name'] notation. For example in a map M comprising of a mapping from 'group' -> gid the gid value can be accessed using M['group']
  • Arrays (indexable lists): The elements in the array have to be in the same type. Elements can be accessed using the [n] notation where n is an index (zero-based) into the array. For example, for an array A having the elements ['a', 'b', 'c'], A[1] retruns 'b'.

Using the primitive types and the constructs for creating complex types, types with arbitrary levels of nesting can be created. For example, a type User may comprise of the following fields:

  • gender—which is a STRING.
  • active—which is a BOOLEAN.

Built In Operators and Functions

The operators and functions listed below are not necessarily up to date. (Hive Operators and UDFs has more current information.) In Beeline or the Hive CLI, use these commands to show the latest documentation:

SHOW FUNCTIONS;
DESCRIBE FUNCTION <function_name>;
DESCRIBE FUNCTION EXTENDED <function_name>;

Case-insensitive

All Hive keywords are case-insensitive, including the names of Hive operators and functions.

Built In Operators

  • Relational Operators—The following operators compare the passed operands and generate a TRUE or FALSE value, depending on whether the comparison between the operands holds or not.

Relational Operator

Operand types

Description

A = B

all primitive types

TRUE if expression A is equivalent to expression B; otherwise FALSE

A != B

all primitive types

TRUE if expression A is not equivalent to expression B; otherwise FALSE

A < B

all primitive types

TRUE if expression A is less than expression B; otherwise FALSE

A <= B

all primitive types

TRUE if expression A is less than or equal to expression B; otherwise FALSE

A > B

all primitive types

TRUE if expression A is greater than expression B] otherwise FALSE

A >= B

all primitive types

TRUE if expression A is greater than or equal to expression B otherwise FALSE

A IS NULL

all types

TRUE if expression A evaluates to NULL otherwise FALSE

A IS NOT NULL

all types

FALSE if expression A evaluates to NULL otherwise TRUE

A LIKE B

strings

TRUE if string A matches the SQL simple regular expression B, otherwise FALSE. The comparison is done character by character. The _ character in B matches any character in A (similar to . in posix regular expressions), and the % character in B matches an arbitrary number of characters in A (similar to .* in posix regular expressions). For example, 'foobar' LIKE 'foo' evaluates to FALSE where as 'foobar' LIKE 'foo___' evaluates to TRUE and so does 'foobar' LIKE 'foo%'. To escape % use \ (% matches one % character). If the data contains a semicolon, and you want to search for it, it needs to be escaped, columnValue LIKE 'a\;b'

A RLIKE B

strings

NULL if A or B is NULL, TRUE if any (possibly empty) substring of A matches the Java regular expression B (see Java regular expressions syntax), otherwise FALSE. For example, 'foobar' rlike 'foo' evaluates to TRUE and so does 'foobar' rlike '^f.*r$'.

A REGEXP B

strings

Same as RLIKE

  • Arithmetic Operators—The following operators support various common arithmetic operations on the operands. All of them return number types.

Arithmetic Operators

Operand types

Description

A + B

all number types

Gives the result of adding A and B. The type of the result is the same as the common parent(in the type hierarchy) of the types of the operands, for example, since every integer is a float. Therefore, float is a containing type of integer so the + operator on a float and an int will result in a float.

A - B

all number types

Gives the result of subtracting B from A. The type of the result is the same as the common parent(in the type hierarchy) of the types of the operands.

A * B

all number types

Gives the result of multiplying A and B. The type of the result is the same as the common parent(in the type hierarchy) of the types of the operands. Note that if the multiplication causing overflow, you will have to cast one of the operators to a type higher in the type hierarchy.

A / B

all number types

Gives the result of dividing B from A. The type of the result is the same as the common parent(in the type hierarchy) of the types of the operands. If the operands are integer types, then the result is the quotient of the division.

A % B

all number types

Gives the reminder resulting from dividing A by B. The type of the result is the same as the common parent(in the type hierarchy) of the types of the operands.

A & B

all number types

Gives the result of bitwise AND of A and B. The type of the result is the same as the common parent(in the type hierarchy) of the types of the operands.

A | B

all number types

Gives the result of bitwise OR of A and B. The type of the result is the same as the common parent(in the type hierarchy) of the types of the operands.

A ^ B

all number types

Gives the result of bitwise XOR of A and B. The type of the result is the same as the common parent(in the type hierarchy) of the types of the operands.

~A

all number types

Gives the result of bitwise NOT of A. The type of the result is the same as the type of A.

  • Logical Operators — The following operators provide support for creating logical expressions. All of them return boolean TRUE or FALSE depending upon the boolean values of the operands.

Logical Operators

Operands types

Description

A AND B

boolean

TRUE if both A and B are TRUE, otherwise FALSE

A && B

boolean

Same as A AND B

A OR B

boolean

TRUE if either A or B or both are TRUE, otherwise FALSE

A || B

boolean

Same as A OR B

NOT A

boolean

TRUE if A is FALSE, otherwise FALSE

!A

boolean

Same as NOT A

  • Operators on Complex Types—The following operators provide mechanisms to access elements in Complex Types

Operator

Operand types

Description

A[n]

A is an Array and n is an int

returns the nth element in the array A. The first element has index 0, for example, if A is an array comprising of ['foo', 'bar'] then A[0] returns 'foo' and A[1] returns 'bar'

M[key]

M is a Map<K, V> and key has type K

returns the value corresponding to the key in the map for example, if M is a map comprising of
{'f' -> 'foo', 'b' -> 'bar', 'all' -> 'foobar'} then M['all']
returns 'foobar'

S.x

S is a struct

returns the x field of S, for example, for struct foobar
{int foo, int bar} foobar.foo returns the integer stored in the foo field of
the struct.

Built In Functions

Return Type

Function Name (Signature)

Description

BIGINT

round(double a)

returns the rounded BIGINT value of the double

BIGINT

floor(double a)

returns the maximum BIGINT value that is equal or less
than the double

BIGINT

ceil(double a)

returns the minimum BIGINT value that is equal or greater
than the double

double

rand(), rand(int seed)

returns a random number (that changes from row to row).
Specifiying the seed will make sure the generated random number sequence is
deterministic.

string

concat(string A, string B,...)

returns the string resulting from concatenating B after
A. For example, concat('foo', 'bar') results in 'foobar'. This function
accepts arbitrary number of arguments and return the concatenation of all of
them.

string

substr(string A, int start)

returns the substring of A starting from start position
till the end of string A. For example, substr('foobar', 4) results in 'bar'

string

substr(string A, int start, int length)

returns the substring of A starting from start position
with the given length, for example,
substr('foobar', 4, 2) results in 'ba'

string

upper(string A)

returns the string resulting from converting all
characters of A to upper case, for example, upper('fOoBaR') results in
'FOOBAR'

string

ucase(string A)

Same as upper

string

lower(string A)

returns the string resulting from converting all
characters of B to lower case, for example, lower('fOoBaR') results in
'foobar'

string

lcase(string A)

Same as lower

string

trim(string A)

returns the string resulting from trimming spaces from
both ends of A, for example, trim(' foobar ') results in 'foobar'

string

ltrim(string A)

returns the string resulting from trimming spaces from
the beginning(left hand side) of A. For example, ltrim(' foobar ') results in
'foobar '

string

rtrim(string A)

returns the string resulting from trimming spaces from
the end(right hand side) of A. For example, rtrim(' foobar ') results in '
foobar'

string

regexp_replace(string A, string B, string C)

returns the string resulting from replacing all
substrings in B that match the Java regular expression syntax(See Java
regular expressions syntax
) with C. For example, regexp_replace('foobar',
'oo|ar', ) returns 'fb'

int

size(Map<K.V>)

returns the number of elements in the map type

int

size(Array<T>)

returns the number of elements in the array type

value
of <type>

cast(<expr> as <type>)

converts the results of the expression expr to
<type>, for example, cast('1' as BIGINT) will convert the string '1' to
it integral representation. A null is returned if the conversion does not
succeed.

string

from_unixtime(int unixtime)

convert the number of seconds from the UNIX epoch
(1970-01-01 00:00:00 UTC) to a string representing the timestamp of that
moment in the current system time zone in the format of "1970-01-01
00:00:00"

string

to_date(string timestamp)

Return the date part of a timestamp string:
to_date("1970-01-01 00:00:00") = "1970-01-01"

int

year(string date)

Return the year part of a date or a timestamp string:
year("1970-01-01 00:00:00") = 1970, year("1970-01-01") =
1970

int

month(string date)

Return the month part of a date or a timestamp string:
month("1970-11-01 00:00:00") = 11, month("1970-11-01") =
11

int

day(string date)

Return the day part of a date or a timestamp string:
day("1970-11-01 00:00:00") = 1, day("1970-11-01") = 1

string

get_json_object(string json_string, string path)

Extract json object from a json string based on json path
specified, and return json string of the extracted json object. It will
return null if the input json string is invalid.

  • The
    following built in aggregate functions are supported in Hive:

Return Type

Aggregation Function Name
(Signature)

Description

BIGINT

count(*), count(expr), count(DISTINCT expr[, expr_.])

count(*)—Returns the
total number of retrieved rows, including rows containing NULL values;
count(expr)—Returns the number of rows for
which the supplied expression is non-NULL; count(DISTINCT expr[, expr])—Returns the number of rows for which the
supplied expression(s) are unique and non-NULL.

DOUBLE

sum(col), sum(DISTINCT col)

returns the sum of the elements in the group or the sum
of the distinct values of the column in the group

DOUBLE

avg(col), avg(DISTINCT col)

returns the average of the elements in the group or the
average of the distinct values of the column in the group

DOUBLE

min(col)

returns the minimum value of the column in the group

DOUBLE

max(col)

returns the maximum value of the column in the group

Language Capabilities

Hive's
SQL
provides the basic SQL operations. These operations work on tables
or partitions. These operations are:

    • Ability
      to filter rows from a table using a WHERE clause.
    • Ability
      to select certain columns from the table using a SELECT clause.
    • Ability
      to do equi-joins between two tables.
    • Ability
      to evaluate aggregations on multiple "group by" columns for the
      data stored in a table.
    • Ability
      to store the results of a query into another table.
    • Ability
      to download the contents of a table to a local (for example,, nfs)
      directory.
    • Ability
      to store the results of a query in a hadoop dfs directory.
    • Ability
      to manage tables and partitions (create, drop and alter).
    • Ability
      to plug in custom scripts in the language of choice for custom map/reduce
      jobs.