I want to split a DataFrame
, regarding a boolean column.
我想分割一个关于布尔列的DataFrame。
I've come up with :
我想出来了:
def partition(df: DataFrame, c: Column): (DataFrame, DataFrame) =
(df.filter(c === true), df.filter(c === false))
Note : in my use case, c
is a UDF.
注意:在我的用例中,c是UDF。
Is there a better way ?
有没有更好的办法 ?
I'd like :
我想要 :
- to avoid scanning 2 times the DataFrame
- to avoid ugly boolean tests
避免扫描2次DataFrame
避免丑陋的布尔测试
Here is an example :
这是一个例子:
@ val df = sc.parallelize(Seq(1,2,3,4)).toDF("i")
df: org.apache.spark.sql.DataFrame = [i: int]
@ val u = udf((i: Int) => i % 2 == 0)
u: org.apache.spark.sql.UserDefinedFunction = UserDefinedFunction(<function1>, BooleanType, List(IntegerType))
@ partition(df, u($"i"))
res25: (org.apache.spark.sql.DataFrame, org.apache.spark.sql.DataFrame) = ([i: int], [i: int])
1 个解决方案
#1
0
use combineByKey for boolean column
将combineByKey用于布尔列
data.combineByKey(lambda value: (value, 1),
lambda x, value: (x[0] + value, x[1] + 1),
lambda x, y: (x[0] + y[0], x[1] + y[1]))
#1
0
use combineByKey for boolean column
将combineByKey用于布尔列
data.combineByKey(lambda value: (value, 1),
lambda x, value: (x[0] + value, x[1] + 1),
lambda x, y: (x[0] + y[0], x[1] + y[1]))