Does anyone know how gbm
in R
handles missing values? I can't seem to find any explanation using google.
有人知道R中gbm是如何处理缺失值的吗?我似乎找不到任何使用谷歌的解释。
5 个解决方案
#1
11
To explain what gbm does with missing predictors, let's first visualize a single tree of a gbm object.
为了解释gbm对缺失的预测器的作用,让我们首先想象一个gbm对象的单个树。
Suppose you have a gbm object mygbm. Using pretty.gbm.tree(mygbm, i.tree=1)
you can visualize the first tree on mygbm, e.g.:
假设你有一个gbm对象mygbm。使用pretty.gbm。树(mygbm, i.tree=1)你可以在mygbm上看到第一棵树,例如:
SplitVar SplitCodePred LeftNode RightNode MissingNode ErrorReduction Weight Prediction
0 46 1.629728e+01 1 5 9 26.462908 1585 -4.396393e-06
1 45 1.850000e+01 2 3 4 11.363868 939 -4.370936e-04
2 -1 2.602236e-04 -1 -1 -1 0.000000 271 2.602236e-04
3 -1 -7.199873e-04 -1 -1 -1 0.000000 668 -7.199873e-04
4 -1 -4.370936e-04 -1 -1 -1 0.000000 939 -4.370936e-04
5 20 0.000000e+00 6 7 8 8.638042 646 6.245552e-04
6 -1 3.533436e-04 -1 -1 -1 0.000000 483 3.533436e-04
7 -1 1.428207e-03 -1 -1 -1 0.000000 163 1.428207e-03
8 -1 6.245552e-04 -1 -1 -1 0.000000 646 6.245552e-04
9 -1 -4.396393e-06 -1 -1 -1 0.000000 1585 -4.396393e-06
See the gbm documentation for details. Each row corresponds to a node, and the first (unnamed) column is the node number. We see that each node has a left and right node (which are set to -1 in case the node is a leaf). We also see each node has associated a MissingNode
.
有关详细信息,请参阅gbm文档。每一行对应一个节点,第一个(未命名的)列是节点号。我们看到每个节点都有一个左右节点(如果节点是叶节点,则设置为-1)。我们还看到每个节点都关联了一个MissingNode。
To run an observation down the tree, we start at node 0. If an observation has a missing value on SplitVar
= 46, then it will be sent down the tree to the node MissingNode
= 9. The prediction of the tree for such observation will be SplitCodePred
= -4.396393e-06, which is the same prediction the tree had before any split is made to node zero (Prediction
= -4.396393e-06 for node zero).
为了对树进行观察,我们从节点0开始。如果一个观察值在SplitVar = 46上有缺失值,那么它就会被向下发送到节点MissingNode = 9。对这类观测的树的预测将是splitcod = -4.396393e-06,这是在对节点0进行任何拆分之前,对树的预测(对于节点0的预测= -4.396393e-06)。
The procedure is similar for other nodes and split variables.
该过程类似于其他节点和拆分变量。
#2
4
It appears to send missing values to a separate node within each tree. If you have a gbm object called "mygbm" then you'll see by typing "pretty.gbm.tree(mygbm, i.tree = 1)" that for each split in the tree there is a LeftNode a RightNode and a MissingNode. This implies that (assuming you have interaction.depth=1) each tree will have 3 terminal nodes (1 for each side of the split and one for where the predictor is missing).
它似乎将缺失值发送到每棵树的一个单独的节点。如果你有一个gbm对象,叫做“mygbm”,那么你可以通过输入“pretty.gbm”来查看。树(mygbm,我。树= 1)“对于树中的每一个分裂,都有一个左节点,一个右节点和一个MissingNode。这意味着(假设您有interaction.depth=1)每棵树将有3个终端节点(在拆分的每个方面1个节点,而在预测器丢失的地方则有1个)。
#3
1
Start with the source code then. Just typing gbm
at the console shows you the source code:
从源代码开始。在控制台输入gbm可以显示源代码:
function (formula = formula(data), distribution = "bernoulli",
data = list(), weights, var.monotone = NULL, n.trees = 100,
interaction.depth = 1, n.minobsinnode = 10, shrinkage = 0.001,
bag.fraction = 0.5, train.fraction = 1, cv.folds = 0, keep.data = TRUE,
verbose = TRUE)
{
mf <- match.call(expand.dots = FALSE)
m <- match(c("formula", "data", "weights", "offset"), names(mf),
0)
mf <- mf[c(1, m)]
mf$drop.unused.levels <- TRUE
mf$na.action <- na.pass
mf[[1]] <- as.name("model.frame")
mf <- eval(mf, parent.frame())
Terms <- attr(mf, "terms")
y <- model.response(mf, "numeric")
w <- model.weights(mf)
offset <- model.offset(mf)
var.names <- attributes(Terms)$term.labels
x <- model.frame(terms(reformulate(var.names)), data, na.action = na.pass)
response.name <- as.character(formula[[2]])
if (is.character(distribution))
distribution <- list(name = distribution)
cv.error <- NULL
if (cv.folds > 1) {
if (distribution$name == "coxph")
i.train <- 1:floor(train.fraction * nrow(y))
else i.train <- 1:floor(train.fraction * length(y))
cv.group <- sample(rep(1:cv.folds, length = length(i.train)))
cv.error <- rep(0, n.trees)
for (i.cv in 1:cv.folds) {
if (verbose)
cat("CV:", i.cv, "\n")
i <- order(cv.group == i.cv)
gbm.obj <- gbm.fit(x[i.train, , drop = FALSE][i,
, drop = FALSE], y[i.train][i], offset = offset[i.train][i],
distribution = distribution, w = ifelse(w ==
NULL, NULL, w[i.train][i]), var.monotone = var.monotone,
n.trees = n.trees, interaction.depth = interaction.depth,
n.minobsinnode = n.minobsinnode, shrinkage = shrinkage,
bag.fraction = bag.fraction, train.fraction = mean(cv.group !=
i.cv), keep.data = FALSE, verbose = verbose,
var.names = var.names, response.name = response.name)
cv.error <- cv.error + gbm.obj$valid.error * sum(cv.group ==
i.cv)
}
cv.error <- cv.error/length(i.train)
}
gbm.obj <- gbm.fit(x, y, offset = offset, distribution = distribution,
w = w, var.monotone = var.monotone, n.trees = n.trees,
interaction.depth = interaction.depth, n.minobsinnode = n.minobsinnode,
shrinkage = shrinkage, bag.fraction = bag.fraction, train.fraction = train.fraction,
keep.data = keep.data, verbose = verbose, var.names = var.names,
response.name = response.name)
gbm.obj$Terms <- Terms
gbm.obj$cv.error <- cv.error
gbm.obj$cv.folds <- cv.folds
return(gbm.obj)
}
<environment: namespace:gbm>
A quick read suggests that the data is put into a model frame and that NA's are handled with na.pass
so in turn, ?na.pass
Reading that, it looks like it does nothing special with them, but you'd probably have to read up on the whole fitting process to see what that means in the long run. Looks like you might need to also look at the code of gbm.fit
and so on.
快速阅读表明,数据被放入一个模型框架,而NA的处理与NA。依次传递,?na。通过阅读,它看起来并没有什么特别的,但是你可能需要阅读整个拟合过程来看看这意味着什么。看起来您可能还需要查看gbm的代码。配合等等。
#4
1
The official guide to gbms introduces missing values to the test data, so I would assume that they are coded to handle missing values.
gbms的官方指南将缺少的值引入到测试数据中,因此我假设它们被编码来处理丢失的值。
#5
1
The gbm package in particular deals with NAs (missing values) as follows. The algorithm works by building and serially combining classification or regression trees. So-called base learner trees are built by divvying up observations into Left and Right splits (@user2332165 is right). There is also a separate node type of Missing in gbm. If the row or observation does not have a value for that variable, the algorithm will apply a surrogate split method.
gbm包特别处理NAs(缺失值),如下所示。该算法通过建立和连续地结合分类或回归树来实现。所谓的基础学习树是通过将观察分成左和右分开来构建的(@user2332165是正确的)。在gbm中还存在一个单独的节点类型。如果行或观察没有该变量的值,则该算法将应用代理分割方法。
If you want to understand surrogate splitting better, I recommend reading the package rpart vignette.
如果您想更好地理解代理拆分,我推荐阅读包rpart vignette。
#1
11
To explain what gbm does with missing predictors, let's first visualize a single tree of a gbm object.
为了解释gbm对缺失的预测器的作用,让我们首先想象一个gbm对象的单个树。
Suppose you have a gbm object mygbm. Using pretty.gbm.tree(mygbm, i.tree=1)
you can visualize the first tree on mygbm, e.g.:
假设你有一个gbm对象mygbm。使用pretty.gbm。树(mygbm, i.tree=1)你可以在mygbm上看到第一棵树,例如:
SplitVar SplitCodePred LeftNode RightNode MissingNode ErrorReduction Weight Prediction
0 46 1.629728e+01 1 5 9 26.462908 1585 -4.396393e-06
1 45 1.850000e+01 2 3 4 11.363868 939 -4.370936e-04
2 -1 2.602236e-04 -1 -1 -1 0.000000 271 2.602236e-04
3 -1 -7.199873e-04 -1 -1 -1 0.000000 668 -7.199873e-04
4 -1 -4.370936e-04 -1 -1 -1 0.000000 939 -4.370936e-04
5 20 0.000000e+00 6 7 8 8.638042 646 6.245552e-04
6 -1 3.533436e-04 -1 -1 -1 0.000000 483 3.533436e-04
7 -1 1.428207e-03 -1 -1 -1 0.000000 163 1.428207e-03
8 -1 6.245552e-04 -1 -1 -1 0.000000 646 6.245552e-04
9 -1 -4.396393e-06 -1 -1 -1 0.000000 1585 -4.396393e-06
See the gbm documentation for details. Each row corresponds to a node, and the first (unnamed) column is the node number. We see that each node has a left and right node (which are set to -1 in case the node is a leaf). We also see each node has associated a MissingNode
.
有关详细信息,请参阅gbm文档。每一行对应一个节点,第一个(未命名的)列是节点号。我们看到每个节点都有一个左右节点(如果节点是叶节点,则设置为-1)。我们还看到每个节点都关联了一个MissingNode。
To run an observation down the tree, we start at node 0. If an observation has a missing value on SplitVar
= 46, then it will be sent down the tree to the node MissingNode
= 9. The prediction of the tree for such observation will be SplitCodePred
= -4.396393e-06, which is the same prediction the tree had before any split is made to node zero (Prediction
= -4.396393e-06 for node zero).
为了对树进行观察,我们从节点0开始。如果一个观察值在SplitVar = 46上有缺失值,那么它就会被向下发送到节点MissingNode = 9。对这类观测的树的预测将是splitcod = -4.396393e-06,这是在对节点0进行任何拆分之前,对树的预测(对于节点0的预测= -4.396393e-06)。
The procedure is similar for other nodes and split variables.
该过程类似于其他节点和拆分变量。
#2
4
It appears to send missing values to a separate node within each tree. If you have a gbm object called "mygbm" then you'll see by typing "pretty.gbm.tree(mygbm, i.tree = 1)" that for each split in the tree there is a LeftNode a RightNode and a MissingNode. This implies that (assuming you have interaction.depth=1) each tree will have 3 terminal nodes (1 for each side of the split and one for where the predictor is missing).
它似乎将缺失值发送到每棵树的一个单独的节点。如果你有一个gbm对象,叫做“mygbm”,那么你可以通过输入“pretty.gbm”来查看。树(mygbm,我。树= 1)“对于树中的每一个分裂,都有一个左节点,一个右节点和一个MissingNode。这意味着(假设您有interaction.depth=1)每棵树将有3个终端节点(在拆分的每个方面1个节点,而在预测器丢失的地方则有1个)。
#3
1
Start with the source code then. Just typing gbm
at the console shows you the source code:
从源代码开始。在控制台输入gbm可以显示源代码:
function (formula = formula(data), distribution = "bernoulli",
data = list(), weights, var.monotone = NULL, n.trees = 100,
interaction.depth = 1, n.minobsinnode = 10, shrinkage = 0.001,
bag.fraction = 0.5, train.fraction = 1, cv.folds = 0, keep.data = TRUE,
verbose = TRUE)
{
mf <- match.call(expand.dots = FALSE)
m <- match(c("formula", "data", "weights", "offset"), names(mf),
0)
mf <- mf[c(1, m)]
mf$drop.unused.levels <- TRUE
mf$na.action <- na.pass
mf[[1]] <- as.name("model.frame")
mf <- eval(mf, parent.frame())
Terms <- attr(mf, "terms")
y <- model.response(mf, "numeric")
w <- model.weights(mf)
offset <- model.offset(mf)
var.names <- attributes(Terms)$term.labels
x <- model.frame(terms(reformulate(var.names)), data, na.action = na.pass)
response.name <- as.character(formula[[2]])
if (is.character(distribution))
distribution <- list(name = distribution)
cv.error <- NULL
if (cv.folds > 1) {
if (distribution$name == "coxph")
i.train <- 1:floor(train.fraction * nrow(y))
else i.train <- 1:floor(train.fraction * length(y))
cv.group <- sample(rep(1:cv.folds, length = length(i.train)))
cv.error <- rep(0, n.trees)
for (i.cv in 1:cv.folds) {
if (verbose)
cat("CV:", i.cv, "\n")
i <- order(cv.group == i.cv)
gbm.obj <- gbm.fit(x[i.train, , drop = FALSE][i,
, drop = FALSE], y[i.train][i], offset = offset[i.train][i],
distribution = distribution, w = ifelse(w ==
NULL, NULL, w[i.train][i]), var.monotone = var.monotone,
n.trees = n.trees, interaction.depth = interaction.depth,
n.minobsinnode = n.minobsinnode, shrinkage = shrinkage,
bag.fraction = bag.fraction, train.fraction = mean(cv.group !=
i.cv), keep.data = FALSE, verbose = verbose,
var.names = var.names, response.name = response.name)
cv.error <- cv.error + gbm.obj$valid.error * sum(cv.group ==
i.cv)
}
cv.error <- cv.error/length(i.train)
}
gbm.obj <- gbm.fit(x, y, offset = offset, distribution = distribution,
w = w, var.monotone = var.monotone, n.trees = n.trees,
interaction.depth = interaction.depth, n.minobsinnode = n.minobsinnode,
shrinkage = shrinkage, bag.fraction = bag.fraction, train.fraction = train.fraction,
keep.data = keep.data, verbose = verbose, var.names = var.names,
response.name = response.name)
gbm.obj$Terms <- Terms
gbm.obj$cv.error <- cv.error
gbm.obj$cv.folds <- cv.folds
return(gbm.obj)
}
<environment: namespace:gbm>
A quick read suggests that the data is put into a model frame and that NA's are handled with na.pass
so in turn, ?na.pass
Reading that, it looks like it does nothing special with them, but you'd probably have to read up on the whole fitting process to see what that means in the long run. Looks like you might need to also look at the code of gbm.fit
and so on.
快速阅读表明,数据被放入一个模型框架,而NA的处理与NA。依次传递,?na。通过阅读,它看起来并没有什么特别的,但是你可能需要阅读整个拟合过程来看看这意味着什么。看起来您可能还需要查看gbm的代码。配合等等。
#4
1
The official guide to gbms introduces missing values to the test data, so I would assume that they are coded to handle missing values.
gbms的官方指南将缺少的值引入到测试数据中,因此我假设它们被编码来处理丢失的值。
#5
1
The gbm package in particular deals with NAs (missing values) as follows. The algorithm works by building and serially combining classification or regression trees. So-called base learner trees are built by divvying up observations into Left and Right splits (@user2332165 is right). There is also a separate node type of Missing in gbm. If the row or observation does not have a value for that variable, the algorithm will apply a surrogate split method.
gbm包特别处理NAs(缺失值),如下所示。该算法通过建立和连续地结合分类或回归树来实现。所谓的基础学习树是通过将观察分成左和右分开来构建的(@user2332165是正确的)。在gbm中还存在一个单独的节点类型。如果行或观察没有该变量的值,则该算法将应用代理分割方法。
If you want to understand surrogate splitting better, I recommend reading the package rpart vignette.
如果您想更好地理解代理拆分,我推荐阅读包rpart vignette。