将spark的MLLib例程与pandas数据帧一起使用

I have a pretty big data set (~20GB) stored on disk as Pandas/PyTables HDFStore, and I want to run random forests and boosted trees on it. Trying to do it on my local system takes forever, so I was thinking of farming it out to a spark cluster I have access to and instead using MLLib routines.

我有一个非常大的数据集(~20GB)存储在磁盘上作为Pandas / PyTables HDFStore,我想运行随机森林并在其上提升树木。试图在我的本地系统上执行它需要永远,所以我想把它耕种到我可以访问的火花簇,而是使用MLLib例程。

While I have managed to load the pandas dataframe as an spark dataframe, I'm a little confused about how to use this in MLLib routines. I'm not too familiar with MLLib and it seems that it accepts only LabeledPoint data types.

虽然我已经设法将pandas数据帧加载为spark数据帧,但我对如何在MLLib例程中使用它感到困惑。我不太熟悉MLLib,它似乎只接受LabeledPoint数据类型。

I would appreciate any ideas / pointers / code that explain how to use (pandas or spark) dataframes as input to MLLib algorithms - either directly or indirectly, by converting to supported types.

我很感激任何想法/指针/代码解释如何使用(pandas或spark)数据帧作为MLLib算法的输入 - 通过转换为支持的类型直接或间接。

Thanks.

1 个解决方案

#1

You need to convert the DataFrame to an RDD[LabeledPoint]. Note a LabeledPoint is just a (label: Double, features: Vector). Consider a mapping routine that grabs values from each row:

您需要将DataFrame转换为RDD [LabeledPoint]。注意LabeledPoint只是一个(标签:Double,features:Vector)。考虑一个从每行抓取值的映射例程:

val rdd = df.map { row =>
  new LabeledPoint(row(0), DenseVector(row.getDouble(1),..., row.getDouble(n)))
}

This will return an RDD[LabeledPoint] which you can input into a RandomForest.trainRegressor(...), for example. Have a look at the DataFrame API for details.

这将返回一个RDD [LabeledPoint],您可以将其输入到RandomForest.trainRegressor(...)中。有关详细信息,请查看DataFrame API。

#1