I have a use case where I have millions of small files in S3 which needs to be processed by Spark. I have two options to reduce number of tasks: 1. Use Coalesce 2. Extend CombineFileInputFormat
我有一个用例,我在S3中有数百万个小文件需要由Spark处理。我有两个选项来减少任务数量:1。使用Coalesce 2.扩展CombineFileInputFormat
But I'm not clear of performance implications with bot and when to use one over other.
但我不清楚机器人的性能影响以及何时使用其他产品。
Also, CombineFileInputFormat is an abstract class, that means I need to provide my implementation. But Spark API (newAPIHadoopRDD) takes the class name as param, I'm not sure how to pass configurable maxSplitSize
此外,CombineFileInputFormat是一个抽象类,这意味着我需要提供我的实现。但Spark API(newAPIHadoopRDD)将类名作为参数,我不知道如何传递可配置的maxSplitSize
1 个解决方案
#1
0
Another great option to consider for such scenarios is SparkContext.wholeTextFiles()
which makes one record for each file with its name as the key
and the content as the value
-- see Documentation
另一个考虑这种情况的好方法是SparkContext.wholeTextFiles(),它为每个文件创建一条记录,其名称为密钥,内容为值 - 请参阅文档
#1
0
Another great option to consider for such scenarios is SparkContext.wholeTextFiles()
which makes one record for each file with its name as the key
and the content as the value
-- see Documentation
另一个考虑这种情况的好方法是SparkContext.wholeTextFiles(),它为每个文件创建一条记录,其名称为密钥,内容为值 - 请参阅文档