当使用randomForest时,错误“x和xtest必须有相同数量的列”。

时间:2022-12-28 16:13:17

I am getting an error when I am trying to use randomForest in R. When I enter

当我尝试在r中使用随机森林时,我有一个错误。

basic3prox  <- randomForest(activity ~.,data=train,proximity=TRUE,xtest=valid)

where train is a dataframe of training data and valid is a dataframe of test data, I get the following error

当训练数据的数据和有效的dataframe是测试数据的dataframe时,我得到以下错误?

Error in randomForest.default(m, y, ...) : 
  x and xtest must have same number of columns

But they do have the same number of columns. I used subset() to get them from the same original dataset and when I run dim() i get

但它们的列数相同。我使用子集()从相同的原始数据集中获取它们,当我运行dim()时我得到。

dim(train)

暗(火车)

[1] 3237 563

3237年[1]3237

dim(valid)

昏暗的(有效的)

[1] 2630 563

2630年[1]2630

So I am at a loss to figure out what is wrong here.

所以我不知道这里有什么问题。

1 个解决方案

#1


4  

No they don't; train has 562 predictor columns and 1 decision column, so valid must have 562 columns (and corresponding decision must be passed to ytest argument).
So the invocation should look like:

没有他们不;列车有562个预测列和1个决策列,因此有效必须有562个列(必须将相应的决策传递给ytest参数)。调用应该是这样的:

randomForest(activity~.,data=train,proximity=TRUE,
  xtest=valid[,names(valid)!='activity'],ytest=valid[,'activity'])

However, this is a dirty hack which will fail for more complex formulae and thus it shouldn't be used (even the authors tried to prohibit it, as Joran pointed out in comments). The correct, easier and faster way is to use separate objects for predictors and decisions instead of formulae, like this:

然而,这是一个肮脏的hack,它将不能使用更复杂的公式,因此它不应该被使用(即使作者试图禁止它,正如Joran在评论中指出的那样)。正确、更简单、更快捷的方法是使用单独的对象来预测和决策,而不是像这样的公式:

randomForest(trainPredictors,trainActivity,proximity=TRUE,
  xtest=testPredictors,ytest=testActivity)

#1


4  

No they don't; train has 562 predictor columns and 1 decision column, so valid must have 562 columns (and corresponding decision must be passed to ytest argument).
So the invocation should look like:

没有他们不;列车有562个预测列和1个决策列,因此有效必须有562个列(必须将相应的决策传递给ytest参数)。调用应该是这样的:

randomForest(activity~.,data=train,proximity=TRUE,
  xtest=valid[,names(valid)!='activity'],ytest=valid[,'activity'])

However, this is a dirty hack which will fail for more complex formulae and thus it shouldn't be used (even the authors tried to prohibit it, as Joran pointed out in comments). The correct, easier and faster way is to use separate objects for predictors and decisions instead of formulae, like this:

然而,这是一个肮脏的hack,它将不能使用更复杂的公式,因此它不应该被使用(即使作者试图禁止它,正如Joran在评论中指出的那样)。正确、更简单、更快捷的方法是使用单独的对象来预测和决策,而不是像这样的公式:

randomForest(trainPredictors,trainActivity,proximity=TRUE,
  xtest=testPredictors,ytest=testActivity)