如何使用Spark DataFrames进行分层抽样？ [重复]

This question already has an answer here:

这个问题在这里已有答案:

Stratified sampling in Spark 2 answers

Spark 2中的分层抽样答案

I'm in Spark 1.3.0 and my data is in DataFrames. I need operations like sampleByKey(), sampleByKeyExact(). I saw the JIRA "Add approximate stratified sampling to DataFrame" (https://issues.apache.org/jira/browse/SPARK-7157). That's targeted for Spark 1.5, till that comes through, whats the easiest way to accomplish the equivalent of sampleByKey() and sampleByKeyExact() on DataFrames. Thanks & Regards MK

我在Spark 1.3.0中,我的数据在DataFrames中。我需要像sampleByKey(),sampleByKeyExact()这样的操作。我看到了JIRA“向DataFrame添加近似分层抽样”(https://issues.apache.org/jira/browse/SPARK-7157)。这是Spark 1.5的目标,直到它成功,这是在DataFrames上完成相当于sampleByKey()和sampleByKeyExact()的最简单方法。谢谢和问候MK

1 个解决方案

#1

Spark 1.1 added stratified sampling routines SampleByKey and SampleByKeyExact to Spark Core, so since then they are available without MLLib dependencies.

Spark 1.1为Spark Core添加了分层抽样例程SampleByKey和SampleByKeyExact,因此从那时起它们就没有MLLib依赖。

These two functions are PairRDDFunctions and belong to key-value RDD[(K,T)]. Also DataFrames do not have keys. You'd have to use underlying RDD - something like below:

这两个函数是PairRDDFunctions,属于键值RDD [(K,T)]。此外,DataFrames没有密钥。您必须使用底层RDD - 如下所示:

val df = ... // your dataframe
val fractions: Map[K, Double] = ... // specify the exact fraction desired from each key

val sample = df.rdd.keyBy(x=>x(0)).sampleByKey(false, fractions)

Note that sample is RDD not DataFrame now, but you can easily convert it back to DataFrame since you already have schema defined for df.

请注意,示例现在是RDD而非DataFrame,但您可以轻松地将其转换回DataFrame,因为您已经为df定义了架构。

#1