ROC曲线用于训练集和测试集，用于插入符号中的交叉验证

Is it possible to have ROC curve for training set and test set separately for each fold in 5 fold cross validation in Caret?

是否可以将训练集和测试集的ROC曲线分别对应于在Caret的5次交叉验证中分别进行设置和测试?

library(caret)
train_control <- trainControl(method="cv", number=5,savePredictions =  TRUE,classProbs = TRUE)
output <- train(Species~., data=iris, trControl=train_control, method="rf")

I can do the following but I do not know if it returns ROC for training set of Fold1 or for test set:

我可以做以下的事情，但是我不知道它是否返回ROC用于Fold1的训练集还是用于测试集:

library(pROC) 
selectedIndices <- rfmodel$pred$Resample == "Fold1"
plot.roc(rfmodel$pred$obs[selectedIndices],rfmodel$pred$setosa[selectedIndices])

1 个解决方案

#1

It is true that the documentation is not at all clear regarding the contents of rfmodel$pred - I would bet that the predictions included are for the fold used as a test set, but I cannot point to any evidence in the docs; nevertheless, and regardless of this, you are still missing some points in the way you are trying to get the ROC.

确实，关于rfmodel$pred的内容文档并不清楚——我敢打赌，其中包含的预测是作为测试集使用的折叠，但我不能指出文档中的任何证据;尽管如此，尽管如此，你还是错过了一些你想要得到中华民国的方法。

First, let's isolate rfmodel$pred in a separate dataframe for easier handling:

首先，让我们将rfmodel$pred隔离在一个单独的dataframe中，以便更容易处理:

dd <- rfmodel$pred

nrow(dd)
# 450

Why 450 rows? It is because you have tried 3 different parameter sets (in your case just 3 different values for mtry):

为什么450行吗?因为您尝试了3个不同的参数集(在您的例子中，mtry只有3个不同的值):

rfmodel$results
# output:
  mtry Accuracy Kappa AccuracySD    KappaSD
1    2     0.96  0.94 0.04346135 0.06519202
2    3     0.96  0.94 0.04346135 0.06519202
3    4     0.96  0.94 0.04346135 0.06519202

and 150 rows X 3 settings = 450.

150行x3设置= 450。

Let's have a closer look at the contents of rfmodel$pred:

让我们仔细看看rfmodel$pred的内容:

head(dd)

# result:
    pred    obs setosa versicolor virginica rowIndex mtry Resample
1 setosa setosa  1.000      0.000         0        2    2    Fold1
2 setosa setosa  1.000      0.000         0        3    2    Fold1
3 setosa setosa  1.000      0.000         0        6    2    Fold1
4 setosa setosa  0.998      0.002         0       24    2    Fold1
5 setosa setosa  1.000      0.000         0       33    2    Fold1
6 setosa setosa  1.000      0.000         0       38    2    Fold1

Column obs contains the true values
列obs包含真正的值。
The three columns setosa, versicolor, and virginica contain the respective probabilities calculated for each class, and they sum up to 1 for each row
这三列分别是setosa、versicolor和virginica，它们包含为每个类计算的各自的概率，每一行的总和为1
Column pred contains the final prediction, i.e. the class with the maximum probability from the three columns mentioned above
列pred包含最终的预测，即上述三列中概率最大的类

If this were the whole story, your way of plotting the ROC would be OK, i.e.:

如果这是整个故事，你的方式策划ROC是可以的，也就是说:

selectedIndices <- rfmodel$pred$Resample == "Fold1"
plot.roc(rfmodel$pred$obs[selectedIndices],rfmodel$pred$setosa[selectedIndices])

But this is not the whole story (the mere existence of 450 rows instead of just 150 should have given a hint already): notice the existence of a column named mtry; indeed, rfmodel$pred includes the results for all runs of cross-validation (i.e. for all the parameter settings):

但这并不是故事的全部(仅仅存在450行而不是仅仅存在150行就已经给出了提示):请注意名为mtry的列的存在;实际上，rfmodel$pred包括所有交叉验证的结果(即所有参数设置):

tail(dd)
# result:
         pred       obs setosa versicolor virginica rowIndex mtry Resample
445 virginica virginica      0      0.004     0.996      112    4    Fold5
446 virginica virginica      0      0.000     1.000      113    4    Fold5
447 virginica virginica      0      0.020     0.980      115    4    Fold5
448 virginica virginica      0      0.000     1.000      118    4    Fold5
449 virginica virginica      0      0.394     0.606      135    4    Fold5
450 virginica virginica      0      0.000     1.000      140    4    Fold5

This is the ultimate reason why your selectedIndices calculation is not correct; it should also include a specific choice of mtry, otherwise the ROC does not make any sense, since it "aggregates" more than one model:

这就是你的选择指数计算不正确的根本原因;它还应包括具体的mtry选择，否则ROC没有任何意义，因为它“聚合”了多个模型:

selectedIndices <- rfmodel$pred$Resample == "Fold1" & rfmodel$pred$mtry == 2

- - -

As I said in the beginning, I bet that the predictions in rfmodel$pred are for the folder as a test set; indeed, if we compute manually the accuracies, they coincide with the ones reported in rfmodel$results shown above (0.96 for all 3 settings), which we know are for the folder used as test (arguably, the respective training accuracies are 1.0):

正如我在一开始所说的，我打赌rfmodel$pred中的预测是针对作为测试集的文件夹的;实际上，如果我们手工计算精度，它们与上面rfmodel$results中所报告的结果一致(对于所有3个设置，都是0.96)，我们知道这些结果是用于作为测试的文件夹的(可以说，相应的训练精度是1.0):

for (i in 2:4) {  # mtry values in {2, 3, 4}

acc = (length(which(dd$pred == dd$obs & dd$mtry==i & dd$Resample=='Fold1'))/30 +
    length(which(dd$pred == dd$obs & dd$mtry==i & dd$Resample=='Fold2'))/30 +
    length(which(dd$pred == dd$obs & dd$mtry==i & dd$Resample=='Fold3'))/30 +
    length(which(dd$pred == dd$obs & dd$mtry==i & dd$Resample=='Fold4'))/30 +
    length(which(dd$pred == dd$obs & dd$mtry==i & dd$Resample=='Fold5'))/30
)/5

print(acc) 
}

# result:
[1] 0.96
[1] 0.96
[1] 0.96

#1