将JSON对象的文件转换为Parquet文件

Motivation: I want to load the data into Apache Drill. I understand that Drill can handle JSON input, but I want to see how it performs on Parquet data.

动机：我想将数据加载到Apache Drill中。我知道Drill可以处理JSON输入，但我想看看它如何在Parquet数据上执行。

Is there any way to do this without first loading the data into Hive, etc and then using one of the Parquet connectors to generate an output file?

有没有办法在没有先将数据加载到Hive等中然后使用其中一个Parquet连接器生成输出文件的情况下执行此操作？

3 个解决方案

#1

Kite has support for importing JSON to both Avro and Parquet formats via its command-line utility, kite-dataset.

Kite支持通过其命令行实用程序kite-dataset将JSON导入Avro和Parquet格式。

First, you would infer the schema of your JSON:

首先，您将推断出JSON的架构：

kite-dataset json-schema sample-file.json -o schema.avsc

风筝数据集json-schema sample-file.json -o schema.avsc

Then you can use that file to create a Parquet Hive table:

然后，您可以使用该文件创建Parquet Hive表：

kite-dataset create mytable --schema schema.avsc --format parquet

风筝数据集创建mytable --schema schema.avsc --format parquet

And finally, you can load your JSON into the dataset.

最后，您可以将JSON加载到数据集中。

kite-dataset json-import sample-file.json mytable

风筝数据集json-import sample-file.json mytable

You can also import an entire directly stored in HDFS. In that case, Kite will use a MR job to do the import.

您还可以导入直接存储在HDFS中的整个。在这种情况下，Kite将使用MR作业进行导入。

#2

You can actually use Drill itself to create a parquet file from the output of any query.

您实际上可以使用Drill本身从任何查询的输出创建一个镶木地板文件。

create table student_parquet as select * from `student.json`;

The above line should be good enough. Drill interprets the types based on the data in the fields. You can substitute your own query and create a parquet file.

以上这条线应该足够好了。 Drill根据字段中的数据解释类型。您可以替换自己的查询并创建镶木地板文件。

#3

To complete the answer of @rahul, you can use drill to do this - but I needed to add more to the query to get it working out of the box with drill.

要完成@rahul的答案，您可以使用drill来执行此操作 - 但我需要在查询中添加更多内容以使其能够通过钻取开箱即用。

create table dfs.tmp.`filename.parquet` as select * from dfs.`/tmp/filename.json` t

I needed to give it the storage plugin (dfs) and the "root" config can read from the whole disk and is not writable. But the tmp config (dfs.tmp) is writable and writes to /tmp. So I wrote to there.

我需要给它存储插件（dfs），“root”配置可以从整个磁盘读取并且不可写。但是tmp config（dfs.tmp）是可写的并写入/ tmp。所以我写信给那里。

But the problem is that if the json is nested or perhaps contains unusual characters, I would get a cryptic

但问题是如果json是嵌套的或者可能包含不寻常的字符，我会得到一个神秘的

org.apache.drill.common.exceptions.UserRemoteException: SYSTEM ERROR: java.lang.IndexOutOfBoundsException:

If I have a structure that looks like members: {id:123, name:"joe"} I would have to change the select to

如果我的结构看起来像成员：{id：123，name：“joe”}我必须将选择更改为

select members.id as members_id, members.name as members_name

select members.id为members_id，members.name为members_name

要么

select members.id as `members.id`, members.name as `members.name`

选择members.id为`members.id`，将members.name选为`members.name`

to get it to work.

让它工作。

I assume the reason is that parquet is a "column" store so you need columns. JSON isn't by default so you need to convert it.

我认为原因是镶木地板是一个“专栏”商店所以你需要专栏。默认情况下，JSON不是您需要转换的。

The problem is I have to know my json schema and I have to build the select to include all the possibilities. I'd be happy if some knows a better way to do this.

问题是我必须知道我的json架构，我必须构建select以包含所有可能性。如果有人知道更好的方法，我会很高兴。

#1