如何将HiveQL查询的结果输出到CSV?

时间:2022-03-30 13:46:09

we would like to put the results of a Hive query to a CSV file. I thought the command should look like this:

我们希望将一个Hive查询的结果放到一个CSV文件中。我认为这个命令应该是这样的:

insert overwrite directory '/home/output.csv' select books from table;

When I run it, it says it completeld successfully but I can never find the file. How do I find this file or should I be extracting the data in a different way?

当我运行它时,它说它已经成功完成,但我永远找不到文件。我如何找到这个文件,或者我应该以不同的方式提取数据?

Thanks!

谢谢!

10 个解决方案

#1


132  

Although it is possible to use INSERT OVERWRITE to get data out of Hive, it might not be the best method for your particular case. First let me explain what INSERT OVERWRITE does, then I'll describe the method I use to get tsv files from Hive tables.

虽然可以使用INSERT OVERWRITE从Hive中获取数据,但它可能不是针对特定情况的最佳方法。首先,让我解释什么是插入覆盖,然后我将描述我用来从Hive表获取tsv文件的方法。

According to the manual, your query will store the data in a directory in HDFS. The format will not be csv.

根据手册,您的查询将把数据存储在HDFS的一个目录中。格式不会是csv。

Data written to the filesystem is serialized as text with columns separated by ^A and rows separated by newlines. If any of the columns are not of primitive type, then those columns are serialized to JSON format.

数据写入文件系统被序列化为文本行和列由^换行隔开。如果任何列不是原始类型,那么这些列将序列化为JSON格式。

A slight modification (adding the LOCAL keyword) will store the data in a local directory.

稍微修改一下(添加本地关键字)将在本地目录中存储数据。

INSERT OVERWRITE LOCAL DIRECTORY '/home/lvermeer/temp' select books from table;

When I run a similar query, here's what the output looks like.

当我运行一个类似的查询时,这里是输出的样子。

[lvermeer@hadoop temp]$ ll
total 4
-rwxr-xr-x 1 lvermeer users 811 Aug  9 09:21 000000_0
[lvermeer@hadoop temp]$ head 000000_0 
"row1""col1"1234"col3"1234FALSE
"row2""col1"5678"col3"5678TRUE

Personally, I usually run my query directly through Hive on the command line for this kind of thing, and pipe it into the local file like so:

就我个人而言,我通常在命令行上直接通过Hive来运行查询,并将其插入到本地文件中:

hive -e 'select books from table' > /home/lvermeer/temp.tsv

That gives me a tab-separated file that I can use. Hope that is useful for you as well.

这给了我一个可以使用的表分隔文件。希望对你也有用。

Based on this patch-3682, I suspect a better solution is available when using Hive 0.11, but I am unable to test this myself. The new syntax should allow the following.

基于这个补丁—3682,我怀疑在使用Hive 0.11时可以使用更好的解决方案,但是我自己无法测试。新语法应该允许以下内容。

INSERT OVERWRITE LOCAL DIRECTORY '/home/lvermeer/temp' 
ROW FORMAT DELIMITED 
FIELDS TERMINATED BY ',' 
select books from table;

Hope that helps.

希望有帮助。

#2


19  

If you want a CSV file then you can modify Lukas' solutions as follows (assuming you are on a linux box):

如果你想要一个CSV文件,那么你可以修改Lukas的解决方案(假设你在一个linux盒子上):

hive -e 'select books from table' | sed 's/[[:space:]]\+/,/g' > /home/lvermeer/temp.csv

#3


4  

You should use CREATE TABLE AS SELECT (CTAS) statement to create a directory in HDFS with the files containing the results of the query. After that you will have to export those files from HDFS to your regular disk and merge them into a single file.

您应该使用CREATE TABLE作为SELECT (CTAS)语句,用包含查询结果的文件在HDFS中创建一个目录。之后,您将不得不将这些文件从HDFS导出到您的常规磁盘,并将它们合并到一个文件中。

You also might have to do some trickery to convert the files from '\001' - delimited to CSV. You could use a custom CSV SerDe or postprocess the extracted file.

您还可能需要做一些欺骗,将文件从“\001”转换为CSV。您可以使用自定义的CSV SerDe或postprocess提取的文件。

#4


3  

If you are using HUE this is fairly simple as well. Simply go to the Hive editor in HUE, execute your hive query, then save the result file locally as XLS or CSV, or you can save the result file to HDFS.

如果你使用色调这也相当简单。只需简单地使用色相中的Hive编辑器,执行您的Hive查询,然后将结果文件本地保存为XLS或CSV,或者您可以将结果文件保存到HDFS中。

#5


3  

I was looking for a similar solution, but the ones mentioned here would not work. My data had all variations of whitespace (space, newline, tab) chars and commas.

我正在寻找一个类似的解决方案,但是这里提到的那些都不起作用。我的数据有各种各样的空格(空格、换行、制表符)和逗号。

To make the column data tsv safe, I replaced all \t chars in the column data with a space, and executed python code on the commandline to generate a csv file, as shown below:

为了使列数据tsv安全,我用一个空格替换了列数据中的all \t chars,并在commandline上执行python代码生成一个csv文件,如下所示:

hive -e 'tab_replaced_hql_query' |  python -c 'exec("import sys;import csv;reader = csv.reader(sys.stdin, dialect=csv.excel_tab);writer = csv.writer(sys.stdout, dialect=csv.excel)\nfor row in reader: writer.writerow(row)")'

This created a perfectly valid csv. Hope this helps those who come looking for this solution.

这就创建了一个完全有效的csv。希望这能帮助那些来寻找这个解决方案的人。

#6


3  

You can use hive string function CONCAT_WS( string delimiter, string str1, string str2...strn )

您可以使用hive string函数CONCAT_WS(string delimiter, string str1, string str2…strn)

for ex:

为例:

hive -e 'select CONCAT_WS(',',cola,colb,colc...,coln) from Mytable' > /home/user/Mycsv.csv

#7


2  

You can use INSERTDIRECTORY …, as in this example:

您可以使用插入…目录…,如本例中所示:

INSERT OVERWRITE LOCAL DIRECTORY '/tmp/ca_employees'
SELECT name, salary, address
FROM employees
WHERE se.state = 'CA';

OVERWRITE and LOCAL have the same interpretations as before and paths are interpreted following the usual rules. One or more files will be written to /tmp/ca_employees, depending on the number of reducers invoked.

OVERWRITE和LOCAL有相同的解释,按照通常的规则进行解释。一个或多个文件将被写入/tmp/ca_employees,这取决于所调用的还原器的数量。

#8


2  

I had a similar issue and this is how I was able to address it.

我有一个类似的问题,这就是我如何解决的。

Step 1 - Loaded the data from Hive table into another table as follows

步骤1 -将数据从Hive表加载到另一个表中,如下所示。

DROP TABLE IF EXISTS TestHiveTableCSV;
CREATE TABLE TestHiveTableCSV 
ROW FORMAT DELIMITED 
FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n' AS
SELECT Column List FROM TestHiveTable;

Step 2 - Copied the blob from Hive warehouse to the new location with appropriate extension

步骤2 -将一个blob从Hive仓库复制到新的位置,并适当的扩展。

Start-AzureStorageBlobCopy
-DestContext $destContext 
-SrcContainer "Source Container"
-SrcBlob "hive/warehouse/TestHiveTableCSV/000000_0"
-DestContainer "Destination Container"
-DestBlob "CSV/TestHiveTable.csv"

#9


1  

The default separator is "^A". In python language, it is "\x01".

默认的分隔符是“^”。在python语言中,它是“\x01”。

When I want to change the delimiter, I use SQL like:

当我想要改变分隔符时,我使用SQL like:

SELECT col1, delimiter, col2, delimiter, col3, ..., FROM table

Then, regard delimiter+"^A" as a new delimiter.

然后,把分隔符+“^”作为一种新的分隔符。

#10


0  

Similar to Ray's answer above, Hive View 2.0 in Hortonworks Data Platform also allows you to run a Hive query and then save the output as csv.

类似于Ray的答案,在Hortonworks数据平台中的Hive View 2.0也允许您运行一个Hive查询,然后将输出保存为csv。

#1


132  

Although it is possible to use INSERT OVERWRITE to get data out of Hive, it might not be the best method for your particular case. First let me explain what INSERT OVERWRITE does, then I'll describe the method I use to get tsv files from Hive tables.

虽然可以使用INSERT OVERWRITE从Hive中获取数据,但它可能不是针对特定情况的最佳方法。首先,让我解释什么是插入覆盖,然后我将描述我用来从Hive表获取tsv文件的方法。

According to the manual, your query will store the data in a directory in HDFS. The format will not be csv.

根据手册,您的查询将把数据存储在HDFS的一个目录中。格式不会是csv。

Data written to the filesystem is serialized as text with columns separated by ^A and rows separated by newlines. If any of the columns are not of primitive type, then those columns are serialized to JSON format.

数据写入文件系统被序列化为文本行和列由^换行隔开。如果任何列不是原始类型,那么这些列将序列化为JSON格式。

A slight modification (adding the LOCAL keyword) will store the data in a local directory.

稍微修改一下(添加本地关键字)将在本地目录中存储数据。

INSERT OVERWRITE LOCAL DIRECTORY '/home/lvermeer/temp' select books from table;

When I run a similar query, here's what the output looks like.

当我运行一个类似的查询时,这里是输出的样子。

[lvermeer@hadoop temp]$ ll
total 4
-rwxr-xr-x 1 lvermeer users 811 Aug  9 09:21 000000_0
[lvermeer@hadoop temp]$ head 000000_0 
"row1""col1"1234"col3"1234FALSE
"row2""col1"5678"col3"5678TRUE

Personally, I usually run my query directly through Hive on the command line for this kind of thing, and pipe it into the local file like so:

就我个人而言,我通常在命令行上直接通过Hive来运行查询,并将其插入到本地文件中:

hive -e 'select books from table' > /home/lvermeer/temp.tsv

That gives me a tab-separated file that I can use. Hope that is useful for you as well.

这给了我一个可以使用的表分隔文件。希望对你也有用。

Based on this patch-3682, I suspect a better solution is available when using Hive 0.11, but I am unable to test this myself. The new syntax should allow the following.

基于这个补丁—3682,我怀疑在使用Hive 0.11时可以使用更好的解决方案,但是我自己无法测试。新语法应该允许以下内容。

INSERT OVERWRITE LOCAL DIRECTORY '/home/lvermeer/temp' 
ROW FORMAT DELIMITED 
FIELDS TERMINATED BY ',' 
select books from table;

Hope that helps.

希望有帮助。

#2


19  

If you want a CSV file then you can modify Lukas' solutions as follows (assuming you are on a linux box):

如果你想要一个CSV文件,那么你可以修改Lukas的解决方案(假设你在一个linux盒子上):

hive -e 'select books from table' | sed 's/[[:space:]]\+/,/g' > /home/lvermeer/temp.csv

#3


4  

You should use CREATE TABLE AS SELECT (CTAS) statement to create a directory in HDFS with the files containing the results of the query. After that you will have to export those files from HDFS to your regular disk and merge them into a single file.

您应该使用CREATE TABLE作为SELECT (CTAS)语句,用包含查询结果的文件在HDFS中创建一个目录。之后,您将不得不将这些文件从HDFS导出到您的常规磁盘,并将它们合并到一个文件中。

You also might have to do some trickery to convert the files from '\001' - delimited to CSV. You could use a custom CSV SerDe or postprocess the extracted file.

您还可能需要做一些欺骗,将文件从“\001”转换为CSV。您可以使用自定义的CSV SerDe或postprocess提取的文件。

#4


3  

If you are using HUE this is fairly simple as well. Simply go to the Hive editor in HUE, execute your hive query, then save the result file locally as XLS or CSV, or you can save the result file to HDFS.

如果你使用色调这也相当简单。只需简单地使用色相中的Hive编辑器,执行您的Hive查询,然后将结果文件本地保存为XLS或CSV,或者您可以将结果文件保存到HDFS中。

#5


3  

I was looking for a similar solution, but the ones mentioned here would not work. My data had all variations of whitespace (space, newline, tab) chars and commas.

我正在寻找一个类似的解决方案,但是这里提到的那些都不起作用。我的数据有各种各样的空格(空格、换行、制表符)和逗号。

To make the column data tsv safe, I replaced all \t chars in the column data with a space, and executed python code on the commandline to generate a csv file, as shown below:

为了使列数据tsv安全,我用一个空格替换了列数据中的all \t chars,并在commandline上执行python代码生成一个csv文件,如下所示:

hive -e 'tab_replaced_hql_query' |  python -c 'exec("import sys;import csv;reader = csv.reader(sys.stdin, dialect=csv.excel_tab);writer = csv.writer(sys.stdout, dialect=csv.excel)\nfor row in reader: writer.writerow(row)")'

This created a perfectly valid csv. Hope this helps those who come looking for this solution.

这就创建了一个完全有效的csv。希望这能帮助那些来寻找这个解决方案的人。

#6


3  

You can use hive string function CONCAT_WS( string delimiter, string str1, string str2...strn )

您可以使用hive string函数CONCAT_WS(string delimiter, string str1, string str2…strn)

for ex:

为例:

hive -e 'select CONCAT_WS(',',cola,colb,colc...,coln) from Mytable' > /home/user/Mycsv.csv

#7


2  

You can use INSERTDIRECTORY …, as in this example:

您可以使用插入…目录…,如本例中所示:

INSERT OVERWRITE LOCAL DIRECTORY '/tmp/ca_employees'
SELECT name, salary, address
FROM employees
WHERE se.state = 'CA';

OVERWRITE and LOCAL have the same interpretations as before and paths are interpreted following the usual rules. One or more files will be written to /tmp/ca_employees, depending on the number of reducers invoked.

OVERWRITE和LOCAL有相同的解释,按照通常的规则进行解释。一个或多个文件将被写入/tmp/ca_employees,这取决于所调用的还原器的数量。

#8


2  

I had a similar issue and this is how I was able to address it.

我有一个类似的问题,这就是我如何解决的。

Step 1 - Loaded the data from Hive table into another table as follows

步骤1 -将数据从Hive表加载到另一个表中,如下所示。

DROP TABLE IF EXISTS TestHiveTableCSV;
CREATE TABLE TestHiveTableCSV 
ROW FORMAT DELIMITED 
FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n' AS
SELECT Column List FROM TestHiveTable;

Step 2 - Copied the blob from Hive warehouse to the new location with appropriate extension

步骤2 -将一个blob从Hive仓库复制到新的位置,并适当的扩展。

Start-AzureStorageBlobCopy
-DestContext $destContext 
-SrcContainer "Source Container"
-SrcBlob "hive/warehouse/TestHiveTableCSV/000000_0"
-DestContainer "Destination Container"
-DestBlob "CSV/TestHiveTable.csv"

#9


1  

The default separator is "^A". In python language, it is "\x01".

默认的分隔符是“^”。在python语言中,它是“\x01”。

When I want to change the delimiter, I use SQL like:

当我想要改变分隔符时,我使用SQL like:

SELECT col1, delimiter, col2, delimiter, col3, ..., FROM table

Then, regard delimiter+"^A" as a new delimiter.

然后,把分隔符+“^”作为一种新的分隔符。

#10


0  

Similar to Ray's answer above, Hive View 2.0 in Hortonworks Data Platform also allows you to run a Hive query and then save the output as csv.

类似于Ray的答案,在Hortonworks数据平台中的Hive View 2.0也允许您运行一个Hive查询,然后将输出保存为csv。