I have two dataframes called left and right.
我有两个dataframes,分别叫做left和right。
scala> left.printSchema
root
|-- user_uid: double (nullable = true)
|-- labelVal: double (nullable = true)
|-- probability_score: double (nullable = true)
scala> right.printSchema
root
|-- user_uid: double (nullable = false)
|-- real_labelVal: double (nullable = false)
Then, I join them to get the joined Dataframe. It is a left outer join. Anyone interested in the natjoin function can find it here.
然后,我加入他们以获得加入的Dataframe。它是一个左外连接。任何对natjoin函数感兴趣的人都可以在这里找到它。
https://gist.github.com/anonymous/f02bd79528ac75f57ae8
https://gist.github.com/anonymous/f02bd79528ac75f57ae8
scala> val joinedData = natjoin(predictionDataFrame, labeledObservedDataFrame, "left_outer")
scala> joinedData.printSchema
|-- user_uid: double (nullable = true)
|-- labelVal: double (nullable = true)
|-- probability_score: double (nullable = true)
|-- real_labelVal: double (nullable = false)
Since it is a left outer join, the real_labelVal column has nulls when user_uid is not present in right.
由于它是左外连接,所以real_labelVal列在user_uid不存在的情况下为空。
scala> val realLabelVal = joinedData.select("real_labelval").distinct.collect
realLabelVal: Array[org.apache.spark.sql.Row] = Array([0.0], [null])
I want to replace the null values in the realLabelVal column with 1.0.
我想用1.0替换realLabelVal列中的null值。
Currently I do the following:
目前我做以下工作:
- I find the index of real_labelval column and use the spark.sql.Row API to set the nulls to 1.0. (This gives me a RDD[Row])
- 找到real_labelval列的索引并使用spark.sql。将null设置为1.0。(这给了我一个RDD[Row])
- Then I apply the schema of the joined dataframe to get the cleaned dataframe.
- 然后,我应用已连接的dataframe的模式来获取已清理的dataframe。
The code is as follows:
守则如下:
val real_labelval_index = 3
def replaceNull(row: Row) = {
val rowArray = row.toSeq.toArray
rowArray(real_labelval_index) = 1.0
Row.fromSeq(rowArray)
}
val cleanRowRDD = joinedData.map(row => if (row.isNullAt(real_labelval_index)) replaceNull(row) else row)
val cleanJoined = sqlContext.createDataFrame(cleanRowRdd, joinedData.schema)
Is there an elegant or efficient way to do this?
是否有一种优雅或高效的方法来做到这一点?
Goolging hasn't helped much. Thanks in advance.
Goolging没有帮助。提前谢谢。
1 个解决方案
#1
24
Have you tried using na
你试过用na吗?
joinedData.na.fill(1.0, Seq("real_labelval"))
#1
24
Have you tried using na
你试过用na吗?
joinedData.na.fill(1.0, Seq("real_labelval"))