将Spark数组特征转换为平面数组

时间:2021-12-31 21:32:05

Following on from my earlier question, Convert a Spark Vector of features into an array, I've made progress:

继我之前的问题,将功能的Spark Vector转换为数组后,我取得了进展:

def extractUdf = udf((v: SDV) => v.toArray)
val temp: DataFrame = dataWithFeatures.withColumn("extracted_features", extractUdf($"features"))

temp.printSchema()

val featuresArray1: Array[Double] = temp.rdd.map(r => r.getAs[Double](0)).collect
val featuresArray2: Array[Double] = temp.rdd.map(r => r.getAs[Double](1)).collect
val featuresArray3: Array[Double] = temp.rdd.map(r => r.getAs[Double](2)).collect

val allfeatures: Array[Array[Double]] = Array(featuresArray1, featuresArray2, featuresArray3)
val flatfeatures: Array[Double] = allfeatures.flatten

This seems to give the result I want. The extractUdf function turns feature: Vector into extracted_feature:

这似乎给了我想要的结果。 extractUdf函数转换功能:Vector into extracted_feature:

 |-- features: vector (nullable = true)
 |-- extracted_features: array (nullable = true)
|    |-- element: double (containsNull = false)

However, I don't understand why my next 3 lines of code (i.e. array featuresArray1, featuresArray2, featuresArray3) are picking up extracted_features as opposed to any other column in temp (like features) for example, and how to pick up the indices of the array (0,1,2) in a which directly references the number of features and is not hard-coded. Thanks for your help!

但是,我不明白为什么我接下来的3行代码(即数组featuresArray1,featuresArray2,featuresArray3)正在拾取extract_features,而不是像temp中的任何其他列(如功能),以及如何获取索引a中的数组(0,1,2)直接引用特征的数量而不是硬编码的。谢谢你的帮助!

1 个解决方案

#1


3  

Lets say you have a dataframe

假设你有一个数据帧

+---+-------------+
|id |features     |
+---+-------------+
|1  |[1.0,2.0,3.0]|
|2  |[3.0,4.0,8.0]|
+---+-------------+

with schema

root
 |-- id: integer (nullable = false)
 |-- features: vector (nullable = true)

and you've extracted the vector feature to Array by doing

并且您已经通过执行将矢量要素提取到Array

import org.apache.spark.sql.functions._
import org.apache.spark.mllib.linalg.DenseVector
def extractUdf = udf((v: DenseVector) => v.toArray)
val temp = dataWithFeatures.withColumn("extracted_features", extractUdf($"features"))

which would give

这会给

+---+-------------+------------------+
|id |features     |extracted_features|
+---+-------------+------------------+
|1  |[1.0,2.0,3.0]|[1.0, 2.0, 3.0]   |
|2  |[3.0,4.0,8.0]|[3.0, 4.0, 8.0]   |
+---+-------------+------------------+

root
 |-- id: integer (nullable = false)
 |-- features: vector (nullable = true)
 |-- extracted_features: array (nullable = true)
 |    |-- element: double (containsNull = false)

now referencing elements from extracted_features Array column is as with other array types in scala . So you can do

现在引用extract_features数组列中的元素与scala中的其他数组类型一样。所以你可以做到

temp.withColumn("firstValue", $"extracted_features"(0))
  .withColumn("secondValue", $"extracted_features"(1))
  .withColumn("thirdValue", $"extracted_features"(2))

which would give you

这会给你

+---+-------------+------------------+----------+-----------+----------+
|id |features     |extracted_features|firstValue|secondValue|thirdValue|
+---+-------------+------------------+----------+-----------+----------+
|1  |[1.0,2.0,3.0]|[1.0, 2.0, 3.0]   |1.0       |2.0        |3.0       |
|2  |[3.0,4.0,8.0]|[3.0, 4.0, 8.0]   |3.0       |4.0        |8.0       |
+---+-------------+------------------+----------+-----------+----------+

root
 |-- id: integer (nullable = false)
 |-- features: vector (nullable = true)
 |-- extracted_features: array (nullable = true)
 |    |-- element: double (containsNull = false)
 |-- firstValue: double (nullable = true)
 |-- secondValue: double (nullable = true)
 |-- thirdValue: double (nullable = true)

I hope the answer is helpful

我希望答案是有帮助的

#1


3  

Lets say you have a dataframe

假设你有一个数据帧

+---+-------------+
|id |features     |
+---+-------------+
|1  |[1.0,2.0,3.0]|
|2  |[3.0,4.0,8.0]|
+---+-------------+

with schema

root
 |-- id: integer (nullable = false)
 |-- features: vector (nullable = true)

and you've extracted the vector feature to Array by doing

并且您已经通过执行将矢量要素提取到Array

import org.apache.spark.sql.functions._
import org.apache.spark.mllib.linalg.DenseVector
def extractUdf = udf((v: DenseVector) => v.toArray)
val temp = dataWithFeatures.withColumn("extracted_features", extractUdf($"features"))

which would give

这会给

+---+-------------+------------------+
|id |features     |extracted_features|
+---+-------------+------------------+
|1  |[1.0,2.0,3.0]|[1.0, 2.0, 3.0]   |
|2  |[3.0,4.0,8.0]|[3.0, 4.0, 8.0]   |
+---+-------------+------------------+

root
 |-- id: integer (nullable = false)
 |-- features: vector (nullable = true)
 |-- extracted_features: array (nullable = true)
 |    |-- element: double (containsNull = false)

now referencing elements from extracted_features Array column is as with other array types in scala . So you can do

现在引用extract_features数组列中的元素与scala中的其他数组类型一样。所以你可以做到

temp.withColumn("firstValue", $"extracted_features"(0))
  .withColumn("secondValue", $"extracted_features"(1))
  .withColumn("thirdValue", $"extracted_features"(2))

which would give you

这会给你

+---+-------------+------------------+----------+-----------+----------+
|id |features     |extracted_features|firstValue|secondValue|thirdValue|
+---+-------------+------------------+----------+-----------+----------+
|1  |[1.0,2.0,3.0]|[1.0, 2.0, 3.0]   |1.0       |2.0        |3.0       |
|2  |[3.0,4.0,8.0]|[3.0, 4.0, 8.0]   |3.0       |4.0        |8.0       |
+---+-------------+------------------+----------+-----------+----------+

root
 |-- id: integer (nullable = false)
 |-- features: vector (nullable = true)
 |-- extracted_features: array (nullable = true)
 |    |-- element: double (containsNull = false)
 |-- firstValue: double (nullable = true)
 |-- secondValue: double (nullable = true)
 |-- thirdValue: double (nullable = true)

I hope the answer is helpful

我希望答案是有帮助的