How do you generate a prediction interval from a regression tree that is fit using rpart?
如何从使用rpart的回归树生成预测区间?
It is my understanding that a regression tree models the response conditional on the mean of the leaf nodes. I don't know how to get the variance for a leaf node from the model, but what I would like to do is simulate using the mean and variance for a leaf node to obtain a prediction interval.
据我所知,回归树以叶节点的平均值为条件对响应进行建模。我不知道如何从模型中获得叶节点的方差,但我想要做的是使用叶节点的均值和方差来模拟以获得预测间隔。
Predict.rpart() doesn't give an option for interval.
Predict.rpart()没有为interval提供选项。
Example: I fit a tree with iris data, but predict doesn't have an option, "interval"
示例:我使用虹膜数据拟合树,但是预测没有选项,“间隔”
> r1 <- rpart(Sepal.Length ~ ., cp = 0.001, data = iris[1:nrow(iris)-1,])
> predict(r1,newdata=iris[nrow(iris),],type = "interval")
Error in match.arg(type) :
'arg' should be one of “vector”, “prob”, “class”, “matrix”
2 个解决方案
#1
7
It is not clear to me what confidence intervals would mean for regression trees as those are not classical statistical models like linear models. And I see mainly two uses: characterising the certainty of your tree or characterizing the precision of the prediction for each leaf of the tree. Hereafter an answer for each of these possibilities.
我不清楚回归树的置信区间意味着什么,因为那些不是像线性模型那样的经典统计模型。我主要看到两种用途:表征树的确定性或表征树的每个叶子的预测精度。以下是每种可能性的答案。
Characterizing the certainty of your tree
If you are looking for a confidence value for a split node, then party
provides that directly as it uses permutation tests and statistically determine which variables are most important and the p-value attached to each split. A significant superiority of party
's ctree
function over rpart
as explained here.
如果您正在寻找拆分节点的置信度值,那么派对会直接提供,因为它使用置换测试并统计确定哪些变量最重要,并且每个拆分附加了p值。如本文所述,党的ctree功能优于rpart的显着优势。
Confidence intervals for set leafs of the regression tree
Third, if you are looking for a confidence of interval for the value in each leaf, then the [0.025,0.975] quantiles interval for the observations in the leaf is most likely what you are looking for. The default plots in party
takes a similar approach when displaying boxplots for the output value for each leaf:
第三,如果您正在寻找每个叶片中值的间隔置信度,那么叶片中观察值的[0.025,0.975]分位数间隔很可能是您正在寻找的。当显示每个叶子的输出值的箱图时,聚会中的默认图采用类似的方法:
library("party")
r2 <- ctree(Sepal.Length ~ .,data=iris)
plot(r2)
Retrieving the corresponding intervals can simply be done by:
检索相应的间隔可以简单地通过以下方式完成:
iris$leaf <- predict(r2,type="node")
CIleaf <- aggregate(iris[,"Sepal.Length"],by=list(leaf=iris$leaf),quantile,prob=c(0.025,0.25,0.75,0.975))
And it's easy to visualize:
并且很容易可视化:
plot(as.factor(CIleaf$leaf),CIleaf[,2],ylab="Sepal length",xlab="Regression tree leaf")
legend("bottomright",c(" 0.975 quantile"," 0.75 quantile"," mean"," 0.25 quantile"," 0.025 quantile"),
pch=c("-","_","_","_","-"),pt.lwd=0.5,pt.cex=c(1,1,2,1,1),xjust=1)
#2
1
Perhaps one option is a simple bootstrap of your training data?
也许一个选项是训练数据的简单引导程序?
library(rpart)
library(boot)
trainData <- iris[-150L, ]
predictData <- iris[150L, ]
rboot <- boot(trainData, function(data, idx) {
bootstrapData <- data[idx, ]
r1 <- rpart(Sepal.Length ~ ., bootstrapData, cp = 0.001)
predict(r1, newdata = predictData)
}, 1000L)
quantile(rboot$t, c(0.025, 0.975))
2.5% 97.5%
5.871393 6.766842
#1
7
It is not clear to me what confidence intervals would mean for regression trees as those are not classical statistical models like linear models. And I see mainly two uses: characterising the certainty of your tree or characterizing the precision of the prediction for each leaf of the tree. Hereafter an answer for each of these possibilities.
我不清楚回归树的置信区间意味着什么,因为那些不是像线性模型那样的经典统计模型。我主要看到两种用途:表征树的确定性或表征树的每个叶子的预测精度。以下是每种可能性的答案。
Characterizing the certainty of your tree
If you are looking for a confidence value for a split node, then party
provides that directly as it uses permutation tests and statistically determine which variables are most important and the p-value attached to each split. A significant superiority of party
's ctree
function over rpart
as explained here.
如果您正在寻找拆分节点的置信度值,那么派对会直接提供,因为它使用置换测试并统计确定哪些变量最重要,并且每个拆分附加了p值。如本文所述,党的ctree功能优于rpart的显着优势。
Confidence intervals for set leafs of the regression tree
Third, if you are looking for a confidence of interval for the value in each leaf, then the [0.025,0.975] quantiles interval for the observations in the leaf is most likely what you are looking for. The default plots in party
takes a similar approach when displaying boxplots for the output value for each leaf:
第三,如果您正在寻找每个叶片中值的间隔置信度,那么叶片中观察值的[0.025,0.975]分位数间隔很可能是您正在寻找的。当显示每个叶子的输出值的箱图时,聚会中的默认图采用类似的方法:
library("party")
r2 <- ctree(Sepal.Length ~ .,data=iris)
plot(r2)
Retrieving the corresponding intervals can simply be done by:
检索相应的间隔可以简单地通过以下方式完成:
iris$leaf <- predict(r2,type="node")
CIleaf <- aggregate(iris[,"Sepal.Length"],by=list(leaf=iris$leaf),quantile,prob=c(0.025,0.25,0.75,0.975))
And it's easy to visualize:
并且很容易可视化:
plot(as.factor(CIleaf$leaf),CIleaf[,2],ylab="Sepal length",xlab="Regression tree leaf")
legend("bottomright",c(" 0.975 quantile"," 0.75 quantile"," mean"," 0.25 quantile"," 0.025 quantile"),
pch=c("-","_","_","_","-"),pt.lwd=0.5,pt.cex=c(1,1,2,1,1),xjust=1)
#2
1
Perhaps one option is a simple bootstrap of your training data?
也许一个选项是训练数据的简单引导程序?
library(rpart)
library(boot)
trainData <- iris[-150L, ]
predictData <- iris[150L, ]
rboot <- boot(trainData, function(data, idx) {
bootstrapData <- data[idx, ]
r1 <- rpart(Sepal.Length ~ ., bootstrapData, cp = 0.001)
predict(r1, newdata = predictData)
}, 1000L)
quantile(rboot$t, c(0.025, 0.975))
2.5% 97.5%
5.871393 6.766842