I have partitioned data stored in S3 in hive format like this.
我像这样以hive格式分区存储在S3中的数据。
bucket/year=2017/month=3/date=1/filename.json
bucket/year=2017/month=3/date=2/filename1.json
bucket/year=2017/month=3/date=3/filename2.json
Every partition has around 1,000,000 records. I have created table and partitions in Athena for this.
每个分区大约有1,000,000条记录。为此我在Athena创建了表和分区。
Now running query from Athena
现在运行Athena的查询
select count(*) from mts_data_1 where year='2017' and month='3' and date='1'
this query is taking 1800 seconds to scan 1,000,000 records.
此查询需要1800秒才能扫描1,000,000条记录。
So my question is how can I improve this query performance?
所以我的问题是如何才能提高查询性能呢?
1 个解决方案
#1
1
I think the problem is that Athena has to read so many files from S3. 250 MB isn't so much data, but 1,000,000 files is a lot of files. Athena query performance will improve dramatically if you reduce the number of files, and compressing the aggregated files will help some more. How many files do you need for one day's partition? Even with one-minute resolution, you would need less than 1,500 files. If the current query time is ~30 minutes, you might easily start with a lot less.
我认为问题是雅典娜必须从S3读取这么多文件。 250 MB不是那么多数据,但1,000,000个文件是很多文件。如果减少文件数量,Athena查询性能将大大提高,压缩聚合文件将有所帮助。一天的分区需要多少个文件?即使只有一分钟的分辨率,您也需要不到1,500个文件。如果当前查询时间大约为30分钟,那么您可能很容易从少开始。
There are many options for aggregating and compressing your records:
聚合和压缩记录有很多选项:
- AWS's Kinesis Firehose is a fairly simple way to start on exactly this sort of problem.
- AWS的Kinesis Firehose是一种相当简单的方法来解决这类问题。
- A streaming data processing tool like Apache NiFi would offer a richer set of tranformation, aggregation, and compression options. I've written a blog post about using Apache NiFi to stream data to S3 for Athena, covering these same issues.
- 像Apache NiFi这样的流数据处理工具将提供更丰富的转换,聚合和压缩选项。我写过一篇关于使用Apache NiFi为Athena传输数据到S3的博客文章,涵盖了同样的问题。
#1
1
I think the problem is that Athena has to read so many files from S3. 250 MB isn't so much data, but 1,000,000 files is a lot of files. Athena query performance will improve dramatically if you reduce the number of files, and compressing the aggregated files will help some more. How many files do you need for one day's partition? Even with one-minute resolution, you would need less than 1,500 files. If the current query time is ~30 minutes, you might easily start with a lot less.
我认为问题是雅典娜必须从S3读取这么多文件。 250 MB不是那么多数据,但1,000,000个文件是很多文件。如果减少文件数量,Athena查询性能将大大提高,压缩聚合文件将有所帮助。一天的分区需要多少个文件?即使只有一分钟的分辨率,您也需要不到1,500个文件。如果当前查询时间大约为30分钟,那么您可能很容易从少开始。
There are many options for aggregating and compressing your records:
聚合和压缩记录有很多选项:
- AWS's Kinesis Firehose is a fairly simple way to start on exactly this sort of problem.
- AWS的Kinesis Firehose是一种相当简单的方法来解决这类问题。
- A streaming data processing tool like Apache NiFi would offer a richer set of tranformation, aggregation, and compression options. I've written a blog post about using Apache NiFi to stream data to S3 for Athena, covering these same issues.
- 像Apache NiFi这样的流数据处理工具将提供更丰富的转换,聚合和压缩选项。我写过一篇关于使用Apache NiFi为Athena传输数据到S3的博客文章,涵盖了同样的问题。