在S3中逐步向Parquet表添加数据

I would like to keep a copy of my log data in in Parquet on S3 for ad hoc analytics. I mainly work with Parquet through Spark and that only seems to offer operations to read and write whole tables via SQLContext.parquetFile() and SQLContext.saveAsParquetFile().

我想在S3上的Parquet中保留我的日志数据的副本,以进行临时分析。我主要通过Spark使用Parquet,并且似乎只提供通过SQLContext.parquetFile()和SQLContext.saveAsParquetFile()读取和写入整个表的操作。

Is there any way to add data to and existing Parquet table without writing a whole new copy of it particularly when it is stored in S3?

有没有办法将数据添加到现有的Parquet表而不编写全新的副本,特别是当它存储在S3中时?

I know I can create separate tables for the updates and in Spark I can form the union of the corresponig DataFrames in Spark at query time but I have my doubts about the scalability of that.

我知道我可以为更新创建单独的表,在Spark中我可以在查询时形成Spark中的相应DataFrame的联合,但我对它的可扩展性有疑问。

I can use something other than Spark if needed.

如果需要,我可以使用除Spark之外的其他东西。

3 个解决方案

#1

The way to append to a parquet file is using SaveMode.Append

附加到镶木地板文件的方法是使用SaveMode.Append

`yourDataFrame.write.mode(SaveMode.Append).parquet("/your/file")`

#2

You don't need to union DataFrames after creating them separately, just supply all the paths related to your query to the parquetFile(paths) and get one DataFrame. Just as the signature of reading parquet file: sqlContext.parquetFile(paths: String*) suggests.

在单独创建DataFrame之后,您不需要将它们联合起来,只需将与查询相关的所有路径提供给parquetFile(路径)并获取一个DataFrame。正如读取镶木地板文件的签名:sqlContext.parquetFile(paths:String *)建议。

Under the hood, in newParquetRelation2, all the .parquet files from all the folders you supply, as well as all the _common_medata and _metadata would be filled into a single list and regard equally.

在引擎盖下,在newParquetRelation2中,您提供的所有文件夹中的所有.parquet文件以及所有_common_medata和_metadata都将填充到单个列表中并同等重视。

#3

Amazon S3 does not support append. S3 is a CDN (Content Distrubution Network) and is not meant to support write intensive operations. It is optimized for parallel read.

Amazon S3不支持追加。 S3是CDN(内容分发网络),并不意味着支持写密集型操作。它针对并行读取进行了优化。

The only way of doing this is to slice your data into multiple files, or use EC2 as a file server and append in one place only.

执行此操作的唯一方法是将数据分割为多个文件,或将EC2用作文件服务器并仅在一个地方附加。

#1