I have a pretty big data set (~20GB) stored on disk as Pandas/PyTables HDFStore, and I want to run random forests and boosted trees on it. Trying to do it on my local system takes forever, so I was thinking of farming it out to a spark cluster I have access to and instead using MLLib routines.
我有一个非常大的数据集(~20GB)存储在磁盘上作为Pandas / PyTables HDFStore,我想运行随机森林并在其上提升树木。试图在我的本地系统上执行它需要永远,所以我想把它耕种到我可以访问的火花簇,而是使用MLLib例程。
While I have managed to load the pandas dataframe as an spark dataframe, I'm a little confused about how to use this in MLLib routines. I'm not too familiar with MLLib and it seems that it accepts only LabeledPoint data types.
虽然我已经设法将pandas数据帧加载为spark数据帧,但我对如何在MLLib例程中使用它感到困惑。我不太熟悉MLLib,它似乎只接受LabeledPoint数据类型。
I would appreciate any ideas / pointers / code that explain how to use (pandas or spark) dataframes as input to MLLib algorithms - either directly or indirectly, by converting to supported types.
我很感激任何想法/指针/代码解释如何使用(pandas或spark)数据帧作为MLLib算法的输入 - 通过转换为支持的类型直接或间接。
Thanks.
1 个解决方案
#1
You need to convert the DataFrame
to an RDD[LabeledPoint]
. Note a LabeledPoint
is just a (label: Double, features: Vector)
. Consider a mapping routine that grabs values from each row:
您需要将DataFrame转换为RDD [LabeledPoint]。注意LabeledPoint只是一个(标签:Double,features:Vector)。考虑一个从每行抓取值的映射例程:
val rdd = df.map { row =>
new LabeledPoint(row(0), DenseVector(row.getDouble(1),..., row.getDouble(n)))
}
This will return an RDD[LabeledPoint]
which you can input into a RandomForest.trainRegressor(...)
, for example. Have a look at the DataFrame
API for details.
这将返回一个RDD [LabeledPoint],您可以将其输入到RandomForest.trainRegressor(...)中。有关详细信息,请查看DataFrame API。
#1
You need to convert the DataFrame
to an RDD[LabeledPoint]
. Note a LabeledPoint
is just a (label: Double, features: Vector)
. Consider a mapping routine that grabs values from each row:
您需要将DataFrame转换为RDD [LabeledPoint]。注意LabeledPoint只是一个(标签:Double,features:Vector)。考虑一个从每行抓取值的映射例程:
val rdd = df.map { row =>
new LabeledPoint(row(0), DenseVector(row.getDouble(1),..., row.getDouble(n)))
}
This will return an RDD[LabeledPoint]
which you can input into a RandomForest.trainRegressor(...)
, for example. Have a look at the DataFrame
API for details.
这将返回一个RDD [LabeledPoint],您可以将其输入到RandomForest.trainRegressor(...)中。有关详细信息,请查看DataFrame API。