I have an array A
of size 200. A[i] = 1,000,000,000 means I need to write to file(s) 1 billion entries of value i
. For example, A = [2, 3, 1, ...], the output file(s) should be like this
我有一个大小为200的数组A.A [i] = 1,000,000,000意味着我需要写入10亿个值为i的条目。例如,A = [2,3,1,...],输出文件应该是这样的
0
0
1
1
1
2
2
...
Given such array A, how can I output to files (part-r-00000
, part-r-00001
, part-r-00002
, etc.) using Spark. I am using Spark 2.0.1 with Scala.
给定这样的阵列A,如何使用Spark输出到文件(part-r-00000,part-r-00001,part-r-00002等)。我正在使用Scala的Spark 2.0.1。
Thank you!
1 个解决方案
#1
0
I would probably approach this with the automatic saveAsTextFile() method, which does what you want by default, splitting into different files, one file per RDD.
我可能会使用自动saveAsTextFile()方法来解决这个问题,该方法默认执行您想要的操作,分成不同的文件,每个RDD一个文件。
将RDD保存到文件
The maximum size of the files depends on the filesystem used, so although not 100%, I doubt there's an automatic way of doing it.
文件的最大大小取决于所使用的文件系统,因此虽然不是100%,但我怀疑是否有自动执行方式。
Based on the code from that example, I would calculate NUM_PARTITIONS before calling .repartition() based on the number of entries and what you know of the filesystem if you can get that info from system calls, or you want to default to some values.
根据该示例中的代码,如果您可以从系统调用中获取该信息,或者您希望默认使用某些值,我将根据条目数和您对文件系统的了解来调用.repartition()之前计算NUM_PARTITIONS。
#1
0
I would probably approach this with the automatic saveAsTextFile() method, which does what you want by default, splitting into different files, one file per RDD.
我可能会使用自动saveAsTextFile()方法来解决这个问题,该方法默认执行您想要的操作,分成不同的文件,每个RDD一个文件。
将RDD保存到文件
The maximum size of the files depends on the filesystem used, so although not 100%, I doubt there's an automatic way of doing it.
文件的最大大小取决于所使用的文件系统,因此虽然不是100%,但我怀疑是否有自动执行方式。
Based on the code from that example, I would calculate NUM_PARTITIONS before calling .repartition() based on the number of entries and what you know of the filesystem if you can get that info from system calls, or you want to default to some values.
根据该示例中的代码,如果您可以从系统调用中获取该信息,或者您希望默认使用某些值,我将根据条目数和您对文件系统的了解来调用.repartition()之前计算NUM_PARTITIONS。