R - cv.glmnet错误:矩阵必须具有相同的列数

时间:2022-12-28 16:13:35

Running R cv.glmnet function from glmnet package with large sparse datasets I often get following error:

使用大型稀疏数据集从glmnet包运行R cv.glmnet函数我经常会遇到以下错误:

# Error: Matrices must have same number of columns in .local(x, y, ...)

I have replicated the error with randomly generated data:

我用随机生成的数据复制了错误:

set.seed(10)

X <- matrix(rbinom(5000, 1, 0.1), nrow=1000, ncol=5)
X[, 1] <- 0
X[1, 1] <- 1

Y <- rep(0, 1000)
Y[c(1:20)] <- 1

model <- cv.glmnet(x=X, y=Y, family="binomial", alpha=0.9, standardize=T, 
                   nfolds=4)

This might be related to initial variable screening (based on inner product of X and Y). Instead of fixing coefficient to zero glmnet drops the variable from X matrix and this is done for each of the validation sets. Then if variable is dropped in some of them and kept in others the error appears.

这可能与初始变量筛选有关(基于X和Y的内积)。而不是将系数固定为零glmnet从X矩阵中删除变量,这是针对每个验证集完成的。然后,如果变量在某些变量中被删除并保留在其他变量中,则会出现错误。

Sometimes increasing nfolds helps. Which is in line with hypothesis as higher number of nfolds means larger validation subsets and smaller chance of dropping the variable in any of them.

有时增加nfolds会有所帮助。这与假设一致,因为更高数量的nfolds意味着更大的验证子集,并且更少的机会将变量放在其中任何一个中。

A few additional notes:

一些额外的说明:

Error appears only for alpha close to 1 (alpha=1 is equivalent to L1 regularization) and using standardization. It does not appear for family="Gaussian".

仅对于接近1的alpha(alpha = 1相当于L1正则化)并使用标准化时出现错误。它不适用于family =“Gaussian”。

What do you think could be happening?

你觉得怎么样?

1 个解决方案

#1


9  

This example is problematic, because one variable has a single 1 and the rest are zero. This is a case where logistic regression can diverge (if not regularized), since driving that coefficient to infinity (plus or minus depending on the response) will predict that observation perfectly, and not impact anything else.

这个例子是有问题的,因为一个变量有一个1,其余变量为零。这是逻辑回归可能发散(如果不是正则化)的情况,因为将该系数驱动到无穷大(取决于响应的正负)将完美地预测该观察,而不会影响其他任何事物。

Now the model is regularized, so this should not happen, but it does cause problems. I found by making alpha smaller (toward ridge, .5 for this example) the problem went away.

现在模型是正规化的,所以这不应该发生,但它确实会引起问题。我发现通过使alpha更小(朝向脊,这个例子为.5)问题消失了。

The real problem here is to do with the lambda sequence used for each fold, but this gets a little technical. I will try and make a fix to cv.glmnet that makes this problem go away.

这里真正的问题是与每个折叠使用的lambda序列有关,但这有点技术性。我将尝试修复cv.glmnet,使这个问题消失。

Trevor Hastie (glmnet maintainer)

Trevor Hastie(glmnet维护者)

#1


9  

This example is problematic, because one variable has a single 1 and the rest are zero. This is a case where logistic regression can diverge (if not regularized), since driving that coefficient to infinity (plus or minus depending on the response) will predict that observation perfectly, and not impact anything else.

这个例子是有问题的,因为一个变量有一个1,其余变量为零。这是逻辑回归可能发散(如果不是正则化)的情况,因为将该系数驱动到无穷大(取决于响应的正负)将完美地预测该观察,而不会影响其他任何事物。

Now the model is regularized, so this should not happen, but it does cause problems. I found by making alpha smaller (toward ridge, .5 for this example) the problem went away.

现在模型是正规化的,所以这不应该发生,但它确实会引起问题。我发现通过使alpha更小(朝向脊,这个例子为.5)问题消失了。

The real problem here is to do with the lambda sequence used for each fold, but this gets a little technical. I will try and make a fix to cv.glmnet that makes this problem go away.

这里真正的问题是与每个折叠使用的lambda序列有关,但这有点技术性。我将尝试修复cv.glmnet,使这个问题消失。

Trevor Hastie (glmnet maintainer)

Trevor Hastie(glmnet维护者)