如何在S3上指定Hive EXTERNAL TABLE数据的文件大小

时间:2023-01-08 23:03:18

I can create an EXTERNAL TABLE in Hive where the data is stored in an S3 bucket in Gzip format. However, the files are very large (over 6GB each).

我可以在Hive中创建一个EXTERNAL TABLE,其中数据以Gzip格式存储在S3存储桶中。但是,文件非常大(每个超过6GB)。

Can Hive be configured to make files in an EXTERNAL TABLE a specific size, say, 512MB?

是否可以将Hive配置为使外部表中的文件具有特定大小,例如512MB?

1 个解决方案

#1


1  

This sounds weird to me, by default, my external tables usually have a file size of around 300MB. Anyway The easiest way to tune this is to use a PARTITION BY key, (something base on timestamp probably), which will force the files to be smaller, and will have the added advantage of making your data easier to query. Also you should consider using a splittable format like Parquet, since then it won't really matter what your file size is.

这听起来很奇怪,默认情况下,我的外部表通常有大约300MB的文件大小。无论如何,调整它的最简单方法是使用PARTITION BY键(可能基于时间戳),这将强制文件更小,并且具有使数据更易于查询的附加优势。另外你应该考虑使用像Parquet这样的可拆分格式,因为那样你的文件大小并不重要。

#1


1  

This sounds weird to me, by default, my external tables usually have a file size of around 300MB. Anyway The easiest way to tune this is to use a PARTITION BY key, (something base on timestamp probably), which will force the files to be smaller, and will have the added advantage of making your data easier to query. Also you should consider using a splittable format like Parquet, since then it won't really matter what your file size is.

这听起来很奇怪,默认情况下,我的外部表通常有大约300MB的文件大小。无论如何,调整它的最简单方法是使用PARTITION BY键(可能基于时间戳),这将强制文件更小,并且具有使数据更易于查询的附加优势。另外你应该考虑使用像Parquet这样的可拆分格式,因为那样你的文件大小并不重要。