在使用R预测函数时,新数据因子中的级别错误与原始数据不匹配

时间:2021-06-28 16:11:49

I am using R to build prediction model. However, the predict always gives me the error message such as

我正在使用R来构建预测模型。但是,预测总是给我错误信息,如

在使用R预测函数时,新数据因子中的级别错误与原始数据不匹配

I know that it should be caused by some test feature levels are not included in the training feature levels. Since the feature matrix itself is big, and it is very hard to modify the feature levels one-by-one in the feature matrix of test data set. Is there a way to enforce the levels of feature items in the test data set to fit the existing levels of training feature items.

我知道它应该是由一些测试功能级别未包含在训练功能级别中引起的。由于特征矩阵本身很大,并且很难在测试数据集的特征矩阵中逐个修改特征级别。有没有办法强制测试数据集中的要素项级别适合现有的培训要素项级别。

1 个解决方案

#1


1  

Here's an example of making a test variables have the same levels as a training variable:

以下是使测试变量与训练变量具有相同级别的示例:

test <- factor(LETTERS[1:5])
training <- factor(LETTERS[4:10])
levels(test)
#[1] "A" "B" "C" "D" "E"

Trying to replace a value where the level is not present:

尝试替换不存在级别的值:

test[2] <- training[5]
#Warning:
#  In `[<-.factor`(`*tmp*`, 2, value = 5L) :
#  invalid factor level, NA generated

You can get around this by uniting the factor levels:

你可以通过统一因子水平来解决这个问题:

levels(test) <- union(levels(test), levels(training))
levels(test)
#[1] "A" "B" "C" "D" "E" "F" "G" "H" "I" "J"
test
#[1] A B C D E
#Levels: A B C D E F G H I J

Now you can do the previous operation without warning:

现在您无需警告就可以执行上一个操作:

test[2] <- training[5]
test
#[1] A H C D E
#Levels: A B C D E F G H I J

Most likely you can use a similar approach in your case, but I'm not sure about the exact structure of your data.

在您的情况下,您很可能使用类似的方法,但我不确定您的数据的确切结构。

#1


1  

Here's an example of making a test variables have the same levels as a training variable:

以下是使测试变量与训练变量具有相同级别的示例:

test <- factor(LETTERS[1:5])
training <- factor(LETTERS[4:10])
levels(test)
#[1] "A" "B" "C" "D" "E"

Trying to replace a value where the level is not present:

尝试替换不存在级别的值:

test[2] <- training[5]
#Warning:
#  In `[<-.factor`(`*tmp*`, 2, value = 5L) :
#  invalid factor level, NA generated

You can get around this by uniting the factor levels:

你可以通过统一因子水平来解决这个问题:

levels(test) <- union(levels(test), levels(training))
levels(test)
#[1] "A" "B" "C" "D" "E" "F" "G" "H" "I" "J"
test
#[1] A B C D E
#Levels: A B C D E F G H I J

Now you can do the previous operation without warning:

现在您无需警告就可以执行上一个操作:

test[2] <- training[5]
test
#[1] A H C D E
#Levels: A B C D E F G H I J

Most likely you can use a similar approach in your case, but I'm not sure about the exact structure of your data.

在您的情况下,您很可能使用类似的方法,但我不确定您的数据的确切结构。