1、keyBy 为rdd按指定key生成key-value形式
scala> val a = sc.parallelize(List("dog", "salmon", "salmon", "rat", "elephant"), 3) a: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[123] at parallelize at <console>:21 scala> val b = a.keyBy(_.length) b: org.apache.spark.rdd.RDD[(Int, String)] = MapPartitionsRDD[124] at keyBy at <console>:23 scala> b.collect res80: Array[(Int, String)] = Array((3,dog), (6,salmon), (6,salmon), (3,rat), (8,elephant))
2、.groupBy(identity) 对value进行数据分桶形成key-value的结果