Spark中的groupByKey 与 aggregateByKey 的区别

1.函数用法

（1）groupByKey的函数用法

groupByKey(numPartitions)

（2） aggregateByKey 的函数原型

def aggregateByKey[U: ClassTag](zeroValue: U, partitioner: Partitioner)
　　　　(seqOp: (U, V) => U, combOp: (U, U) => U): RDD[(K, U)]
def aggregateByKey[U: ClassTag](zeroValue: U, numPartitions: Int)
　　　　(seqOp: (U, V) => U, combOp: (U, U) => U): RDD[(K, U)]
def aggregateByKey[U: ClassTag](zeroValue: U)
　　　　(seqOp: (U, V) => U, combOp: (U, U) => U): RDD[(K, U)]

（3）distinct的函数用法

distinct(numPartitions)

2.原理差别

（1）groupByKey()是对RDD中的所有数据做shuffle,根据不同的Key映射到不同的partition中再进行aggregate。

（2）aggregateByKey()是先对每个partition中的数据根据不同的Key进行aggregate，然后将结果进行shuffle，完成各个partition之间的aggregate。因此，和groupByKey()相比，运算量小了很多。

(3) distinct()也是对RDD中的所有数据做shuffle进行aggregate后再去重。

秒客网

Spark中的groupByKey 与 aggregateByKey 的区别

1.函数用法

2.原理差别

相关文章