Spark处理小文件(coalesce vs CombineFileInputFormat)

时间:2022-02-03 00:52:45

I have a use case where I have millions of small files in S3 which needs to be processed by Spark. I have two options to reduce number of tasks: 1. Use Coalesce 2. Extend CombineFileInputFormat

我有一个用例,我在S3中有数百万个小文件需要由Spark处理。我有两个选项来减少任务数量:1。使用Coalesce 2.扩展CombineFileInputFormat

But I'm not clear of performance implications with bot and when to use one over other.

但我不清楚机器人的性能影响以及何时使用其他产品。

Also, CombineFileInputFormat is an abstract class, that means I need to provide my implementation. But Spark API (newAPIHadoopRDD) takes the class name as param, I'm not sure how to pass configurable maxSplitSize

此外,CombineFileInputFormat是一个抽象类,这意味着我需要提供我的实现。但Spark API(newAPIHadoopRDD)将类名作为参数,我不知道如何传递可配置的maxSplitSize

1 个解决方案

#1


0  

Another great option to consider for such scenarios is SparkContext.wholeTextFiles() which makes one record for each file with its name as the key and the content as the value -- see Documentation

另一个考虑这种情况的好方法是SparkContext.wholeTextFiles(),它为每个文件创建一条记录,其名称为密钥,内容为值 - 请参阅文档

#1


0  

Another great option to consider for such scenarios is SparkContext.wholeTextFiles() which makes one record for each file with its name as the key and the content as the value -- see Documentation

另一个考虑这种情况的好方法是SparkContext.wholeTextFiles(),它为每个文件创建一条记录,其名称为密钥,内容为值 - 请参阅文档