用插入符号和R绘制学习曲线

时间:2022-01-30 15:01:52

I would like to study the optimal tradeoff between bias/variance for model tuning. I'm using caret for R which allows me to plot the performance metric (AUC, accuracy...) against the hyperparameters of the model (mtry, lambda, etc.) and automatically chooses the max. This typically returns a good model, but if I want to dig further and choose a different bias/variance tradeoff I need a learning curve, not a performance curve.

我想研究模型调整的偏差/方差之间的最佳权衡。我正在使用插入符号R,它允许我根据模型的超参数(mtry,lambda等)绘制性能指标(AUC,准确度......)并自动选择最大值。这通常会返回一个好的模型,但如果我想进一步挖掘并选择不同的偏差/方差权衡,我需要一个学习曲线,而不是一个性能曲线。

For the sake of simplicity, let's say my model is a random forest, which has just one hyperparameter 'mtry'

为了简单起见,假设我的模型是一个随机森林,它只有一个超参数'mtry'

I would like to plot the learning curves of both training and test sets. Something like this:

我想绘制训练和测试集的学习曲线。像这样的东西:

用插入符号和R绘制学习曲线

(red curve is the test set)

(红色曲线是测试集)

On the y axis I put an error metric (number of misclassified examples or something like that); on the x axis 'mtry' or alternatively the training set size.

在y轴上我放了一个错误度量(错误分类的例子的数量或类似的东西);在x轴'mtry'或者训练集大小。

Questions:

  1. Has caret the functionality to iteratively train models based of training set folds different in size? If I have to code by hand, how can I do that?

    是否有根据训练集折叠不同大小迭代训练模型的功能?如果我必须手动编码,我该怎么做?

  2. If I want to put the hyperparameter on the x axis, I need all the models trained by caret::train, not just the final model (the one with maximum performance got after CV). Are these "discarded" model still available after train?

    如果我想将超参数放在x轴上,我需要所有由caret :: train训练的模型,而不仅仅是最终模型(在CV之后获得最大性能的模型)。这些“丢弃”的模型在火车后仍然可用吗?

2 个解决方案

#1


4  

  1. Caret will iteratively test lots of cv models for you if you set the trainControl() function and the parameters (e.g. mtry) using a tuneGrid(). Both of these are then passed as control options to the train() function. The specifics of the tuneGrid parameters (e.g. mtry, ntree) will be different for each model type.

    如果您使用tuneGrid()设置trainControl()函数和参数(例如mtry),Caret将为您迭代测试许多cv模型。然后将这两个作为控制选项传递给train()函数。 tuneGrid参数(例如mtry,ntree)的细节对于每种模型类型将是不同的。

  2. Yes the final trainFit model will contain the error rate (however you specified it) for all folds of your CV.

    是的,最终的trainFit模型将包含CV的所有折叠的错误率(无论您是否指定它)。

So you could specify e.g. a 10-fold CV times a grid with 10 values of mtry -which would be 100 iterations. You might want to go get a cup of tea or possibly lunch.

所以你可以指定例如一个10倍的CV乘以具有10个mtry值的网格 - 这将是100次迭代。你可能想去喝杯茶或者吃午饭。

If this sounds complicated ... there is a very good example here - caret being one of the best documented packages about.

如果这听起来很复杂......这里有一个很好的例子 - 插入符号是最好的文档包之一。

#2


3  

Here's my code on how I approached this issue of plotting a learning curve in R while using the Caret package to train your model. I use the Motor Trend Car Road Tests in R for illustrative purposes. To begin, I randomize and split the mtcars dataset into training and test sets. 21 records for training and 13 records for the test set. The response feature is mpg in this example.

这是我的代码,关于我如何使用Caret包来训练你的模型,在R中绘制学习曲线的问题。为了便于说明,我使用R中的Motor Trend Car Road Tests。首先,我将mtcars数据集随机化并拆分为训练和测试集。 21个培训记录和13个测试记录。在此示例中,响应功能为mpg。

# set seed for reproducibility
set.seed(7)

# randomize mtcars
mtcars <- mtcars[sample(nrow(mtcars)),]

# split iris data into training and test sets
mtcarsIndex <- createDataPartition(mtcars$mpg, p = .625, list = F)
mtcarsTrain <- mtcars[mtcarsIndex,]
mtcarsTest <- mtcars[-mtcarsIndex,]

# create empty data frame 
learnCurve <- data.frame(m = integer(21),
                     trainRMSE = integer(21),
                     cvRMSE = integer(21))

# test data response feature
testY <- mtcarsTest$mpg

# Run algorithms using 10-fold cross validation with 3 repeats
trainControl <- trainControl(method="repeatedcv", number=10, repeats=3)
metric <- "RMSE"

# loop over training examples
for (i in 3:21) {
    learnCurve$m[i] <- i

    # train learning algorithm with size i
    fit.lm <- train(mpg~., data=mtcarsTrain[1:i,], method="lm", metric=metric,
             preProc=c("center", "scale"), trControl=trainControl)        
    learnCurve$trainRMSE[i] <- fit.lm$results$RMSE

    # use trained parameters to predict on test data
    prediction <- predict(fit.lm, newdata = mtcarsTest[,-1])
    rmse <- postResample(prediction, testY)
    learnCurve$cvRMSE[i] <- rmse[1]
}

pdf("LinearRegressionLearningCurve.pdf", width = 7, height = 7, pointsize=12)

# plot learning curves of training set size vs. error measure
# for training set and test set
plot(log(learnCurve$trainRMSE),type = "o",col = "red", xlab = "Training set size",
          ylab = "Error (RMSE)", main = "Linear Model Learning Curve")
lines(log(learnCurve$cvRMSE), type = "o", col = "blue")
legend('topright', c("Train error", "Test error"), lty = c(1,1), lwd = c(2.5, 2.5),
       col = c("red", "blue"))

dev.off()

The output plot is as shown below:
用插入符号和R绘制学习曲线

输出图如下所示:

#1


4  

  1. Caret will iteratively test lots of cv models for you if you set the trainControl() function and the parameters (e.g. mtry) using a tuneGrid(). Both of these are then passed as control options to the train() function. The specifics of the tuneGrid parameters (e.g. mtry, ntree) will be different for each model type.

    如果您使用tuneGrid()设置trainControl()函数和参数(例如mtry),Caret将为您迭代测试许多cv模型。然后将这两个作为控制选项传递给train()函数。 tuneGrid参数(例如mtry,ntree)的细节对于每种模型类型将是不同的。

  2. Yes the final trainFit model will contain the error rate (however you specified it) for all folds of your CV.

    是的,最终的trainFit模型将包含CV的所有折叠的错误率(无论您是否指定它)。

So you could specify e.g. a 10-fold CV times a grid with 10 values of mtry -which would be 100 iterations. You might want to go get a cup of tea or possibly lunch.

所以你可以指定例如一个10倍的CV乘以具有10个mtry值的网格 - 这将是100次迭代。你可能想去喝杯茶或者吃午饭。

If this sounds complicated ... there is a very good example here - caret being one of the best documented packages about.

如果这听起来很复杂......这里有一个很好的例子 - 插入符号是最好的文档包之一。

#2


3  

Here's my code on how I approached this issue of plotting a learning curve in R while using the Caret package to train your model. I use the Motor Trend Car Road Tests in R for illustrative purposes. To begin, I randomize and split the mtcars dataset into training and test sets. 21 records for training and 13 records for the test set. The response feature is mpg in this example.

这是我的代码,关于我如何使用Caret包来训练你的模型,在R中绘制学习曲线的问题。为了便于说明,我使用R中的Motor Trend Car Road Tests。首先,我将mtcars数据集随机化并拆分为训练和测试集。 21个培训记录和13个测试记录。在此示例中,响应功能为mpg。

# set seed for reproducibility
set.seed(7)

# randomize mtcars
mtcars <- mtcars[sample(nrow(mtcars)),]

# split iris data into training and test sets
mtcarsIndex <- createDataPartition(mtcars$mpg, p = .625, list = F)
mtcarsTrain <- mtcars[mtcarsIndex,]
mtcarsTest <- mtcars[-mtcarsIndex,]

# create empty data frame 
learnCurve <- data.frame(m = integer(21),
                     trainRMSE = integer(21),
                     cvRMSE = integer(21))

# test data response feature
testY <- mtcarsTest$mpg

# Run algorithms using 10-fold cross validation with 3 repeats
trainControl <- trainControl(method="repeatedcv", number=10, repeats=3)
metric <- "RMSE"

# loop over training examples
for (i in 3:21) {
    learnCurve$m[i] <- i

    # train learning algorithm with size i
    fit.lm <- train(mpg~., data=mtcarsTrain[1:i,], method="lm", metric=metric,
             preProc=c("center", "scale"), trControl=trainControl)        
    learnCurve$trainRMSE[i] <- fit.lm$results$RMSE

    # use trained parameters to predict on test data
    prediction <- predict(fit.lm, newdata = mtcarsTest[,-1])
    rmse <- postResample(prediction, testY)
    learnCurve$cvRMSE[i] <- rmse[1]
}

pdf("LinearRegressionLearningCurve.pdf", width = 7, height = 7, pointsize=12)

# plot learning curves of training set size vs. error measure
# for training set and test set
plot(log(learnCurve$trainRMSE),type = "o",col = "red", xlab = "Training set size",
          ylab = "Error (RMSE)", main = "Linear Model Learning Curve")
lines(log(learnCurve$cvRMSE), type = "o", col = "blue")
legend('topright', c("Train error", "Test error"), lty = c(1,1), lwd = c(2.5, 2.5),
       col = c("red", "blue"))

dev.off()

The output plot is as shown below:
用插入符号和R绘制学习曲线

输出图如下所示: