将RDD划分为每个分区中具有固定数量元素的分区

In Apache Spark,

在Apache Spark中,

repartition(n) - allows partitioning the RDD into exactly n partitions.

repartition(n) - 允许将RDD划分为n个分区。

But how to partition the given RDD into partitions such that all partitions (exception for the last partition) have specified number of elements. Given that number of elements in RDD is not known and doing .count() is expensive.

但是如何将给定的RDD划分为分区,使得所有分区(最后一个分区的例外)都具有指定数量的元素。鉴于RDD中的元素数量未知且执行.count()是昂贵的。

C = sc.parallelize([x for x in range(10)],2)
Let's say internally,  C = [[0,1,2,3,4,5], [6,7,8,9]]  
C = someCode(3)

Expected:

C = [[0,1,2], [3,4,5], [6, 7, 8], [9]]

1 个解决方案

#1

Quite easy in pyspark:

在pyspark相当容易:

    C = sc.parallelize([x for x in range(10)],2)
    rdd = C.map(lambda x : (x, x))
    C_repartitioned = rdd.partitionBy(4,lambda x: int( x *4/11)).map(lambda x: x[0]).glom().collect()
    C_repartitioned

    [[0, 1, 2], [3, 4, 5], [6, 7, 8], [9]]

It is called custom partitioning. More on that: http://sparkdatasourceapi.blogspot.ru/2016/10/patitioning-in-spark-writing-custom.html

它被称为自定义分区。更多内容:http://sparkdatasourceapi.blogspot.ru/2016/10/patitioning-in-spark-writing-custom.html

http://baahu.in/spark-custom-partitioner-java-example/

#1