特征选择+交叉验证,但如何在R中做曲线。

时间:2022-12-07 11:05:56

I'm stuck with the next problem. I divide my data into 10 folds. Each time, I use 1 fold as test set and the other 9 as training set (I do this ten times). On each training set, I do feature selection (filter methode with chi.squared) and then I make a SVMmodel with my training set and the selected features.
So at the end, I become 10 different models (because of the feature selection). But now I want to make a ROC-curve in R from this filter methode in general. How can I do this?

我被下一个问题困住了。我把我的数据分成10条。每一次,我使用1个折叠作为测试集,另外9个作为训练集(我做了10次)。在每一个训练集上,我都做特征选择(用chi.squared的方法来筛选),然后用我的训练集和所选的特征制作一个SVMmodel。所以在最后,我变成了10个不同的模型(因为特征选择)。但是现在我想用这个滤波器来做一个R的曲线。我该怎么做呢?

Silke

Silke

1 个解决方案

#1


2  

You can indeed store the predictions if they are all on the same scale (be especially careful about this as you perform feature selection... some methods may produce scores that are dependent on the number of features) and use them to build a ROC curve. Here is the code I used for a recent paper:

如果所有的预测都是相同的,那么您就可以存储这些预测了(在执行功能选择时要特别小心)。有些方法可能产生依赖于特征数量的分数,并使用它们来构建ROC曲线。这是我最近写的一篇论文的代码:

library(pROC)
data(aSAH)
k <- 10
n <- dim(aSAH)[1]
indices <- sample(rep(1:k, ceiling(n/k))[1:n])

all.response <- all.predictor <- aucs <- c()
for (i in 1:k) {
  test = aSAH[indices==i,]
  learn = aSAH[indices!=i,]
  model <- glm(as.numeric(outcome)-1 ~ s100b + ndka + as.numeric(wfns), data = learn, family=binomial(link = "logit"))
  model.pred <- predict(model, newdata=test)
  aucs <- c(aucs, roc(test$outcome, model.pred)$auc)
  all.response <- c(all.response, test$outcome)
  all.predictor <- c(all.predictor, model.pred)
}

roc(all.response, all.predictor)
mean(aucs)

The roc curve is built from all.response and all.predictor that are updated at each step. This code also stores the AUC at each step in auc for comparison. Both results should be quite similar when the sample size is sufficiently large, however small samples within the cross-validation may lead to underestimated AUC as the ROC curve with all data will tend to be smoother and less underestimated by the trapezoidal rule.

roc曲线是由所有的。响应和所有。每个步骤都更新的预测器。该代码还在AUC的每个步骤中存储AUC进行比较。当样本量足够大时,这两个结果都是非常相似的,但是交叉验证中的小样本可能会导致低估AUC,因为所有数据的ROC曲线都趋向于更平滑,更不被梯形规则所低估。

#1


2  

You can indeed store the predictions if they are all on the same scale (be especially careful about this as you perform feature selection... some methods may produce scores that are dependent on the number of features) and use them to build a ROC curve. Here is the code I used for a recent paper:

如果所有的预测都是相同的,那么您就可以存储这些预测了(在执行功能选择时要特别小心)。有些方法可能产生依赖于特征数量的分数,并使用它们来构建ROC曲线。这是我最近写的一篇论文的代码:

library(pROC)
data(aSAH)
k <- 10
n <- dim(aSAH)[1]
indices <- sample(rep(1:k, ceiling(n/k))[1:n])

all.response <- all.predictor <- aucs <- c()
for (i in 1:k) {
  test = aSAH[indices==i,]
  learn = aSAH[indices!=i,]
  model <- glm(as.numeric(outcome)-1 ~ s100b + ndka + as.numeric(wfns), data = learn, family=binomial(link = "logit"))
  model.pred <- predict(model, newdata=test)
  aucs <- c(aucs, roc(test$outcome, model.pred)$auc)
  all.response <- c(all.response, test$outcome)
  all.predictor <- c(all.predictor, model.pred)
}

roc(all.response, all.predictor)
mean(aucs)

The roc curve is built from all.response and all.predictor that are updated at each step. This code also stores the AUC at each step in auc for comparison. Both results should be quite similar when the sample size is sufficiently large, however small samples within the cross-validation may lead to underestimated AUC as the ROC curve with all data will tend to be smoother and less underestimated by the trapezoidal rule.

roc曲线是由所有的。响应和所有。每个步骤都更新的预测器。该代码还在AUC的每个步骤中存储AUC进行比较。当样本量足够大时,这两个结果都是非常相似的,但是交叉验证中的小样本可能会导致低估AUC,因为所有数据的ROC曲线都趋向于更平滑,更不被梯形规则所低估。