I am using Spark Mllib with Hadoop in one of the big data analytics applications. I have a feature set of 41 features and one label. Now, while training, I want to mix and match my features to feature engineer and find the best suited minimal set of features for my scenario.
我在一个大数据分析应用程序中使用Spark Mllib和Hadoop。我有一个功能集41个功能和一个标签。现在,在培训期间,我希望将我的功能与功能工程师混合搭配,并为我的场景找到最适合的最小功能集。
For this I want to select at training time which features to use while training and testing for model accuracy.
为此,我想在训练时选择在训练和测试模型精度时使用哪些功能。
I am doing this
我这样做
JavaRDD<LabeledPoint>[] splits = data.randomSplit(new double[] { 0.5, 0.5 });
JavaRDD<LabeledPoint> trainingData = splits[0];
JavaRDD<LabeledPoint> testData = splits[1];
and later training different models using that data.
然后使用该数据训练不同的模型。
modelLR = new LogisticRegressionWithLBFGS().setNumClasses(numClasses).run(trainingData.rdd());
modelRF = RandomForest.trainClassifier(trainingData, numClasses, categoricalFeaturesInfo, numTrees, featureSubsetStrategy, impurity, maxDepth, maxBins, seed);
modelNB = NaiveBayes.train(trainingData.rdd(), 1.0);
mode* = GradientBoostedTrees.train(trainingData, boostingStrategy);
modelDT = DecisionTree.trainClassifier(trainingData, numClasses, categoricalFeaturesInfo, impurity, maxDepth, maxBins);
Now, before training the models with dataset, I want filter the data for selective features that I want to use. Can someone suggest me a way to do this out of JavaRDD<LabeledPoint>
?
现在,在使用数据集训练模型之前,我想过滤我想要使用的选择性特征的数据。有人可以建议我用JavaRDD
If any more details needed, please feel free to ask.
如果需要更多细节,请随时询问。
1 个解决方案
#1
0
Never mind. I figured the answer on my own.
没关系。我自己想出了答案。
For anyone interested in doing this I did something like this.
对于有兴趣这样做的人,我做了类似的事情。
public static JavaRDD<LabeledPoint> filterData(JavaRDD<LabeledPoint> data, String filterString) {
return data.map(new Function<LabeledPoint, LabeledPoint>() {
@Override
public LabeledPoint call(LabeledPoint point) throws Exception {
double label = point.label();
double[] features = point.features().toArray();
String[] featuresInUse = filterString.split(",");
double[] filteredFeatures = new double[featuresInUse.length];
for (int i = 0; i < featuresInUse.length; i++) {
filteredFeatures[i] = features[Integer.parseInt(VectorizationProperties.getProperty(featuresInUse[i]))];
}
LabeledPoint newPoint = new LabeledPoint(label, Vectors.dense(filteredFeatures));
System.out.println(newPoint);
return newPoint;
}
});
}
Which will filter each record and give back the filtered JavaRDD.
这将过滤每个记录并返回过滤的JavaRDD。
Please feel free to ask for any details needed to understand further.
请随时询问进一步了解所需的任何详细信息。
#1
0
Never mind. I figured the answer on my own.
没关系。我自己想出了答案。
For anyone interested in doing this I did something like this.
对于有兴趣这样做的人,我做了类似的事情。
public static JavaRDD<LabeledPoint> filterData(JavaRDD<LabeledPoint> data, String filterString) {
return data.map(new Function<LabeledPoint, LabeledPoint>() {
@Override
public LabeledPoint call(LabeledPoint point) throws Exception {
double label = point.label();
double[] features = point.features().toArray();
String[] featuresInUse = filterString.split(",");
double[] filteredFeatures = new double[featuresInUse.length];
for (int i = 0; i < featuresInUse.length; i++) {
filteredFeatures[i] = features[Integer.parseInt(VectorizationProperties.getProperty(featuresInUse[i]))];
}
LabeledPoint newPoint = new LabeledPoint(label, Vectors.dense(filteredFeatures));
System.out.println(newPoint);
return newPoint;
}
});
}
Which will filter each record and give back the filtered JavaRDD.
这将过滤每个记录并返回过滤的JavaRDD。
Please feel free to ask for any details needed to understand further.
请随时询问进一步了解所需的任何详细信息。