由Impala支持的定制SerDe,在CSV w/双引号中查询文件的最佳方式是什么?

时间:2021-06-20 20:07:03

I have a CSV data with each field surronded with double quotes. When I created Hive table used serde 'com.bizo.hive.serde.csv.CSVSerde' When above table is queried in Impala I am getting error SerDe not found.

我有一个CSV数据,每个字段都有双引号。当我创建Hive table时,使用serde 'com. bizo.serde.csvserde '当在Impala中查询时,我得到的错误serde没有找到。

I added the CSV Serde JAR file in /usr/lib/impala/lib folder.

我在/usr/lib/impala/lib文件夹中添加了CSV Serde JAR文件。

Later studied in Impala documentation that Impala does not support custom SERDE. In such case how I can overcome this issue such that my CSV data with quotes is taken care. I want to use CSV Serde because it takes of commas in values which is a legitimate field vavlue.

后来在Impala文献中研究了Impala不支持定制SERDE。在这种情况下,我如何克服这个问题,使我的CSV数据得到了关注。我想使用CSV Serde,因为它以逗号作为值,这是一个合法的字段vavlue。

Thanks a lot

非常感谢

2 个解决方案

#1


3  

Can you use Hive? If so, here is an approach that might work. CREATE your table as an EXTERNAL TABLE in Hive and use your SERDE in the right place of the CREATE Statement (I think you need something like ROW FORMAT SERDE your_serde_here at the end of the CREATE TABLE statement). Before this you might need to do:

你可以用蜂巢吗?如果是这样的话,这是一种可行的方法。在Hive中创建表作为外部表,并在CREATE语句的正确位置使用您的SERDE(我认为在CREATE table语句的末尾,您需要类似于ROW格式SERDE your_serde_here之类的东西)。在此之前,您可能需要这样做:

ADD JAR 'hdfs:///path/to/your_serde.jar' 

Note that the jar should be somewhere in hdfs and triple /// needed for it to work...

请注意,jar应该在hdfs中,并且需要它的三/// /// /需要它工作……

Then, still in Hive, duplicate the table into another table that is stored in a format with which Impala can easily work, such as PARQUET. Something like the following does this copying:

然后,仍然在Hive中,将表复制到另一个表中,该表存储在Impala可以轻松工作的格式中,例如PARQUET。类似下面这样的复制:

CREATE TABLE copy_of_table 
   STORED AS PARQUET AS
   SELECT * FROM your_original_table

Now in Impala do:

现在在黑斑羚:

INVALIDATE DATA copy_of_table

You should be all set to happily work with copy_of_table in Impala now.

现在,您应该能够愉快地使用Impala中的copy_of_table。

Let me know whether this works, as I might have do to something like this in the near future.

让我知道这是否可行,就像我在不久的将来可能会做的那样。

#2


0  

Within Hive

在蜂巢

CREATE TABLE mydb.my_serde_table_impala AS SELECT FROM mydb.my_serde_table

Within Impala

在黑斑羚

INVALIDATE METADATA mydb.my_serde_table_impala

Add these steps to include dropping the _impala table first with whatever populates or ingests files for the serde table.

添加这些步骤,包括将_impala表放在serde表中的任何填充或ingests文件中。

Impala bypasses MapReduce, unlike Hive. So Impala can't/doesn't use the SerDe the way MapReduce does.

Impala绕过MapReduce,不像Hive。因此,Impala不能/不像MapReduce那样使用SerDe。

#1


3  

Can you use Hive? If so, here is an approach that might work. CREATE your table as an EXTERNAL TABLE in Hive and use your SERDE in the right place of the CREATE Statement (I think you need something like ROW FORMAT SERDE your_serde_here at the end of the CREATE TABLE statement). Before this you might need to do:

你可以用蜂巢吗?如果是这样的话,这是一种可行的方法。在Hive中创建表作为外部表,并在CREATE语句的正确位置使用您的SERDE(我认为在CREATE table语句的末尾,您需要类似于ROW格式SERDE your_serde_here之类的东西)。在此之前,您可能需要这样做:

ADD JAR 'hdfs:///path/to/your_serde.jar' 

Note that the jar should be somewhere in hdfs and triple /// needed for it to work...

请注意,jar应该在hdfs中,并且需要它的三/// /// /需要它工作……

Then, still in Hive, duplicate the table into another table that is stored in a format with which Impala can easily work, such as PARQUET. Something like the following does this copying:

然后,仍然在Hive中,将表复制到另一个表中,该表存储在Impala可以轻松工作的格式中,例如PARQUET。类似下面这样的复制:

CREATE TABLE copy_of_table 
   STORED AS PARQUET AS
   SELECT * FROM your_original_table

Now in Impala do:

现在在黑斑羚:

INVALIDATE DATA copy_of_table

You should be all set to happily work with copy_of_table in Impala now.

现在,您应该能够愉快地使用Impala中的copy_of_table。

Let me know whether this works, as I might have do to something like this in the near future.

让我知道这是否可行,就像我在不久的将来可能会做的那样。

#2


0  

Within Hive

在蜂巢

CREATE TABLE mydb.my_serde_table_impala AS SELECT FROM mydb.my_serde_table

Within Impala

在黑斑羚

INVALIDATE METADATA mydb.my_serde_table_impala

Add these steps to include dropping the _impala table first with whatever populates or ingests files for the serde table.

添加这些步骤,包括将_impala表放在serde表中的任何填充或ingests文件中。

Impala bypasses MapReduce, unlike Hive. So Impala can't/doesn't use the SerDe the way MapReduce does.

Impala绕过MapReduce,不像Hive。因此,Impala不能/不像MapReduce那样使用SerDe。