“因子有新的级别”错误，我没有使用。

Consider a simple dataset, split into a training and testing set:

考虑一个简单的数据集，分为培训和测试集:

dat <- data.frame(x=1:5, y=c("a", "b", "c", "d", "e"), z=c(0, 0, 1, 0, 1))
train <- dat[1:4,]
train
#   x y z
# 1 1 a 0
# 2 2 b 0
# 3 3 c 1
# 4 4 d 0
test <- dat[5,]
test
#   x y z
# 5 5 e 1

When I train a logistic regression model to predict z using x and obtain test-set predictions, all is well:

当我训练一个逻辑回归模型来预测z使用x并获得测试集预测时，一切都很好:

mod <- glm(z~x, data=train, family="binomial")
predict(mod, newdata=test, type="response")
#         5 
# 0.5546394

However, this fails on an equivalent-looking logistic regression model with a "Factor has new levels" error:

然而，这在一个具有“因子有新级别”错误的等效逻辑回归模型中失败:

mod2 <- glm(z~.-y, data=train, family="binomial")
predict(mod2, newdata=test, type="response")
# Error in model.frame.default(Terms, newdata, na.action = na.action, xlev = object$xlevels) : 
#   factor y has new level e

Since I removed y from my model equation, I'm surprised to see this error message. In my application, dat is very wide, so z~.-y is the most convenient model specification. The simplest workaround I can think of is removing the y variable from my data frame and then training the model with the z~. syntax, but I was hoping for a way to use the original dataset without the need to remove columns.

由于我从模型方程中删除了y，我很惊讶地看到了这个错误消息。在我的应用中，dat非常宽，所以z~。-y是最方便的型号规格。我所能想到的最简单的方法就是从我的数据框架中移除y变量，然后用z~来训练模型。语法，但是我希望能够使用原始数据集而不需要删除列。

1 个解决方案

#1

You could try updating mod2$xlevels[["y"]] in the model object

您可以尝试在模型对象中更新mod2$xlevel [[y]]。

mod2 <- glm(z~.-y, data=train, family="binomial")
mod2$xlevels[["y"]] <- union(mod2$xlevels[["y"]], levels(test$y))

predict(mod2, newdata=test, type="response")
#        5 
#0.5546394

Another option would be to exclude (but not remove) "y" from the training data

另一种选择是排除(但不排除)“y”来自培训数据。

mod2 <- glm(z~., data=train[,!colnames(train) %in% c("y")], family="binomial")
predict(mod2, newdata=test, type="response")
#        5 
#0.5546394

#1