使用Scikit学习时出错。模型的特性数量与输入的不匹配

I am working on a classification problem using RandomForestClassifier. In the code I'm splitting the dataset into a train and test data for making predictions.

我正在研究一个使用随机森林分类器的分类问题。在代码中，我将数据集分割成一个序列，并测试数据以进行预测。

Here's the code:

这是代码:

from sklearn.ensemble import RandomForestClassifier
from sklearn.cross_validation import train_test_split
import numpy as np
from numpy import genfromtxt, savetxt

a = (np.genfromtxt(open('filepath.csv','r'), delimiter=',', dtype='int')[1:])
a_train, a_test = train_test_split(a, test_size=0.33, random_state=0)


def main():
    target = [x[0] for x in a_train]
    train = [x[1:] for x in a_train]

    rf = RandomForestClassifier(n_estimators=100)
    rf.fit(train, target)
    predicted_probs = [[index + 1, x[1]] for index, x in enumerate(rf.predict_proba(a_test))]

    savetxt('filepath.csv', predicted_probs, delimiter=',', fmt='%d,%f', 
            header='Id,PredictedProbability', comments = '')

if __name__=="__main__":
    main()

On exection however, I'm getting the following error:

然而，我发现了以下错误:

ValueError: Number of features of the model must match the input. Model n_features is 1434 and input n_features is 1435

ValueError:模型的特性数量必须与输入匹配。模型n_features是1434，输入n_features是1435

Any suggestions as to how I should proceed? Thanks.

关于我该怎么做，有什么建议吗?谢谢。

1 个解决方案

#1

It looks like you are using a_test directly, without stripping out the output feature.

看起来您正在直接使用a_test，而没有剥离输出特性。

The model is confused because it expects only 1434 input features but you are feeding it 1434 features along with the output feature.

这个模型是混乱的，因为它只需要1434个输入特性，但是您正在为它提供1434个特性以及输出特性。

You can fix this by doing the same thing with test that you did with train.

你可以通过做同样的测试来解决这个问题。

test = [x[1:] for x in a_test]

Then use test on the following line:

然后在下面一行进行测试:

predicted_probs = [[index + 1, x[1]] for index, x in enumerate(rf.predict_proba(test))]

#1