ROC曲线用于训练集和测试集,用于插入符号中的交叉验证

时间:2021-05-21 20:35:49

Is it possible to have ROC curve for training set and test set separately for each fold in 5 fold cross validation in Caret?

是否可以将训练集和测试集的ROC曲线分别对应于在Caret的5次交叉验证中分别进行设置和测试?

library(caret)
train_control <- trainControl(method="cv", number=5,savePredictions =  TRUE,classProbs = TRUE)
output <- train(Species~., data=iris, trControl=train_control, method="rf")

I can do the following but I do not know if it returns ROC for training set of Fold1 or for test set:

我可以做以下的事情,但是我不知道它是否返回ROC用于Fold1的训练集还是用于测试集:

library(pROC) 
selectedIndices <- rfmodel$pred$Resample == "Fold1"
plot.roc(rfmodel$pred$obs[selectedIndices],rfmodel$pred$setosa[selectedIndices])

1 个解决方案

#1


3  

It is true that the documentation is not at all clear regarding the contents of rfmodel$pred - I would bet that the predictions included are for the fold used as a test set, but I cannot point to any evidence in the docs; nevertheless, and regardless of this, you are still missing some points in the way you are trying to get the ROC.

确实,关于rfmodel$pred的内容文档并不清楚——我敢打赌,其中包含的预测是作为测试集使用的折叠,但我不能指出文档中的任何证据;尽管如此,尽管如此,你还是错过了一些你想要得到中华民国的方法。

First, let's isolate rfmodel$pred in a separate dataframe for easier handling:

首先,让我们将rfmodel$pred隔离在一个单独的dataframe中,以便更容易处理:

dd <- rfmodel$pred

nrow(dd)
# 450

Why 450 rows? It is because you have tried 3 different parameter sets (in your case just 3 different values for mtry):

为什么450行吗?因为您尝试了3个不同的参数集(在您的例子中,mtry只有3个不同的值):

rfmodel$results
# output:
  mtry Accuracy Kappa AccuracySD    KappaSD
1    2     0.96  0.94 0.04346135 0.06519202
2    3     0.96  0.94 0.04346135 0.06519202
3    4     0.96  0.94 0.04346135 0.06519202

and 150 rows X 3 settings = 450.

150行x3设置= 450。

Let's have a closer look at the contents of rfmodel$pred:

让我们仔细看看rfmodel$pred的内容:

head(dd)

# result:
    pred    obs setosa versicolor virginica rowIndex mtry Resample
1 setosa setosa  1.000      0.000         0        2    2    Fold1
2 setosa setosa  1.000      0.000         0        3    2    Fold1
3 setosa setosa  1.000      0.000         0        6    2    Fold1
4 setosa setosa  0.998      0.002         0       24    2    Fold1
5 setosa setosa  1.000      0.000         0       33    2    Fold1
6 setosa setosa  1.000      0.000         0       38    2    Fold1
  • Column obs contains the true values
  • 列obs包含真正的值。
  • The three columns setosa, versicolor, and virginica contain the respective probabilities calculated for each class, and they sum up to 1 for each row
  • 这三列分别是setosa、versicolor和virginica,它们包含为每个类计算的各自的概率,每一行的总和为1
  • Column pred contains the final prediction, i.e. the class with the maximum probability from the three columns mentioned above
  • 列pred包含最终的预测,即上述三列中概率最大的类

If this were the whole story, your way of plotting the ROC would be OK, i.e.:

如果这是整个故事,你的方式策划ROC是可以的,也就是说:

selectedIndices <- rfmodel$pred$Resample == "Fold1"
plot.roc(rfmodel$pred$obs[selectedIndices],rfmodel$pred$setosa[selectedIndices])

But this is not the whole story (the mere existence of 450 rows instead of just 150 should have given a hint already): notice the existence of a column named mtry; indeed, rfmodel$pred includes the results for all runs of cross-validation (i.e. for all the parameter settings):

但这并不是故事的全部(仅仅存在450行而不是仅仅存在150行就已经给出了提示):请注意名为mtry的列的存在;实际上,rfmodel$pred包括所有交叉验证的结果(即所有参数设置):

tail(dd)
# result:
         pred       obs setosa versicolor virginica rowIndex mtry Resample
445 virginica virginica      0      0.004     0.996      112    4    Fold5
446 virginica virginica      0      0.000     1.000      113    4    Fold5
447 virginica virginica      0      0.020     0.980      115    4    Fold5
448 virginica virginica      0      0.000     1.000      118    4    Fold5
449 virginica virginica      0      0.394     0.606      135    4    Fold5
450 virginica virginica      0      0.000     1.000      140    4    Fold5

This is the ultimate reason why your selectedIndices calculation is not correct; it should also include a specific choice of mtry, otherwise the ROC does not make any sense, since it "aggregates" more than one model:

这就是你的选择指数计算不正确的根本原因;它还应包括具体的mtry选择,否则ROC没有任何意义,因为它“聚合”了多个模型:

selectedIndices <- rfmodel$pred$Resample == "Fold1" & rfmodel$pred$mtry == 2

--

- - -

As I said in the beginning, I bet that the predictions in rfmodel$pred are for the folder as a test set; indeed, if we compute manually the accuracies, they coincide with the ones reported in rfmodel$results shown above (0.96 for all 3 settings), which we know are for the folder used as test (arguably, the respective training accuracies are 1.0):

正如我在一开始所说的,我打赌rfmodel$pred中的预测是针对作为测试集的文件夹的;实际上,如果我们手工计算精度,它们与上面rfmodel$results中所报告的结果一致(对于所有3个设置,都是0.96),我们知道这些结果是用于作为测试的文件夹的(可以说,相应的训练精度是1.0):

for (i in 2:4) {  # mtry values in {2, 3, 4}

acc = (length(which(dd$pred == dd$obs & dd$mtry==i & dd$Resample=='Fold1'))/30 +
    length(which(dd$pred == dd$obs & dd$mtry==i & dd$Resample=='Fold2'))/30 +
    length(which(dd$pred == dd$obs & dd$mtry==i & dd$Resample=='Fold3'))/30 +
    length(which(dd$pred == dd$obs & dd$mtry==i & dd$Resample=='Fold4'))/30 +
    length(which(dd$pred == dd$obs & dd$mtry==i & dd$Resample=='Fold5'))/30
)/5

print(acc) 
}

# result:
[1] 0.96
[1] 0.96
[1] 0.96

#1


3  

It is true that the documentation is not at all clear regarding the contents of rfmodel$pred - I would bet that the predictions included are for the fold used as a test set, but I cannot point to any evidence in the docs; nevertheless, and regardless of this, you are still missing some points in the way you are trying to get the ROC.

确实,关于rfmodel$pred的内容文档并不清楚——我敢打赌,其中包含的预测是作为测试集使用的折叠,但我不能指出文档中的任何证据;尽管如此,尽管如此,你还是错过了一些你想要得到中华民国的方法。

First, let's isolate rfmodel$pred in a separate dataframe for easier handling:

首先,让我们将rfmodel$pred隔离在一个单独的dataframe中,以便更容易处理:

dd <- rfmodel$pred

nrow(dd)
# 450

Why 450 rows? It is because you have tried 3 different parameter sets (in your case just 3 different values for mtry):

为什么450行吗?因为您尝试了3个不同的参数集(在您的例子中,mtry只有3个不同的值):

rfmodel$results
# output:
  mtry Accuracy Kappa AccuracySD    KappaSD
1    2     0.96  0.94 0.04346135 0.06519202
2    3     0.96  0.94 0.04346135 0.06519202
3    4     0.96  0.94 0.04346135 0.06519202

and 150 rows X 3 settings = 450.

150行x3设置= 450。

Let's have a closer look at the contents of rfmodel$pred:

让我们仔细看看rfmodel$pred的内容:

head(dd)

# result:
    pred    obs setosa versicolor virginica rowIndex mtry Resample
1 setosa setosa  1.000      0.000         0        2    2    Fold1
2 setosa setosa  1.000      0.000         0        3    2    Fold1
3 setosa setosa  1.000      0.000         0        6    2    Fold1
4 setosa setosa  0.998      0.002         0       24    2    Fold1
5 setosa setosa  1.000      0.000         0       33    2    Fold1
6 setosa setosa  1.000      0.000         0       38    2    Fold1
  • Column obs contains the true values
  • 列obs包含真正的值。
  • The three columns setosa, versicolor, and virginica contain the respective probabilities calculated for each class, and they sum up to 1 for each row
  • 这三列分别是setosa、versicolor和virginica,它们包含为每个类计算的各自的概率,每一行的总和为1
  • Column pred contains the final prediction, i.e. the class with the maximum probability from the three columns mentioned above
  • 列pred包含最终的预测,即上述三列中概率最大的类

If this were the whole story, your way of plotting the ROC would be OK, i.e.:

如果这是整个故事,你的方式策划ROC是可以的,也就是说:

selectedIndices <- rfmodel$pred$Resample == "Fold1"
plot.roc(rfmodel$pred$obs[selectedIndices],rfmodel$pred$setosa[selectedIndices])

But this is not the whole story (the mere existence of 450 rows instead of just 150 should have given a hint already): notice the existence of a column named mtry; indeed, rfmodel$pred includes the results for all runs of cross-validation (i.e. for all the parameter settings):

但这并不是故事的全部(仅仅存在450行而不是仅仅存在150行就已经给出了提示):请注意名为mtry的列的存在;实际上,rfmodel$pred包括所有交叉验证的结果(即所有参数设置):

tail(dd)
# result:
         pred       obs setosa versicolor virginica rowIndex mtry Resample
445 virginica virginica      0      0.004     0.996      112    4    Fold5
446 virginica virginica      0      0.000     1.000      113    4    Fold5
447 virginica virginica      0      0.020     0.980      115    4    Fold5
448 virginica virginica      0      0.000     1.000      118    4    Fold5
449 virginica virginica      0      0.394     0.606      135    4    Fold5
450 virginica virginica      0      0.000     1.000      140    4    Fold5

This is the ultimate reason why your selectedIndices calculation is not correct; it should also include a specific choice of mtry, otherwise the ROC does not make any sense, since it "aggregates" more than one model:

这就是你的选择指数计算不正确的根本原因;它还应包括具体的mtry选择,否则ROC没有任何意义,因为它“聚合”了多个模型:

selectedIndices <- rfmodel$pred$Resample == "Fold1" & rfmodel$pred$mtry == 2

--

- - -

As I said in the beginning, I bet that the predictions in rfmodel$pred are for the folder as a test set; indeed, if we compute manually the accuracies, they coincide with the ones reported in rfmodel$results shown above (0.96 for all 3 settings), which we know are for the folder used as test (arguably, the respective training accuracies are 1.0):

正如我在一开始所说的,我打赌rfmodel$pred中的预测是针对作为测试集的文件夹的;实际上,如果我们手工计算精度,它们与上面rfmodel$results中所报告的结果一致(对于所有3个设置,都是0.96),我们知道这些结果是用于作为测试的文件夹的(可以说,相应的训练精度是1.0):

for (i in 2:4) {  # mtry values in {2, 3, 4}

acc = (length(which(dd$pred == dd$obs & dd$mtry==i & dd$Resample=='Fold1'))/30 +
    length(which(dd$pred == dd$obs & dd$mtry==i & dd$Resample=='Fold2'))/30 +
    length(which(dd$pred == dd$obs & dd$mtry==i & dd$Resample=='Fold3'))/30 +
    length(which(dd$pred == dd$obs & dd$mtry==i & dd$Resample=='Fold4'))/30 +
    length(which(dd$pred == dd$obs & dd$mtry==i & dd$Resample=='Fold5'))/30
)/5

print(acc) 
}

# result:
[1] 0.96
[1] 0.96
[1] 0.96