如何从party ::: ctree模型中删除训练数据？

I created several ctree models (about 40 to 80) which I want evaluate rather often.

我创建了几个我想要经常评估的ctree模型(大约40到80个)。

An issue is that the model objects are very big (40 models require more than 2.8G of memory) and it appears to me, that they stored the training data, maybe as modelname@data and modelname@responses, and not just the informations relevant to predict new data.

一个问题是模型对象非常大(40个模型需要超过2.8G的内存),在我看来,他们存储了训练数据,可能是modelname @ data和modelname @ response,而不仅仅是相关的信息预测新数据。

Most other R learning packages have configurable options whether to include the data in the model object, but I couldn't find any hints in the documentation. I also tried to assign empty ModelEnv objects by

大多数其他R学习包具有可配置选项,是否将数据包含在模型对象中,但我在文档中找不到任何提示。我还尝试通过分配空的ModelEnv对象

modelname@data <- new("ModelEnv")

but there was no effect on the size of the respective RData file.

但是对各个RData文件的大小没有影响。

Anyone knows whether ctree really stores the training data and how to remove all data from ctree models that are irrelevant for new predictions so that I can fit many of them in memory?

任何人都知道ctree是否真的存储了训练数据,以及如何从ctree模型中删除与新预测无关的所有数据,以便我可以将其中的许多数据放入内存中?

Thanks a lot,

非常感谢,

Stefan

Thank you for your feedback, that was already very helpful.

感谢您的反馈,这已经非常有帮助。

I used dput and str to take a deeper look at the object and found that no training data is included in the model, but there is a responses slot, which seems to have the training labels and rownames. Anyways, I noticed that each node has a weight vector for each training sample. After a while of inspecting the code, I ended up googling a bit and found the following comment in the party NEWS log:

我使用dput和str深入查看对象,发现模型中没有包含训练数据,但是有一个响应槽,似乎有训练标签和rownames。无论如何,我注意到每个节点都有一个每个训练样本的权重向量。经过一段时间的检查代码,我最后搜索了一下,并在派对NEWS日志中找到以下评论:

         CHANGES IN party VERSION 0.9-13 (2007-07-23)

o   update `mvt.f'

o   improve the memory footprint of RandomForest objects
    substancially (by removing the weights slots from each node).

It turns out, there is a C function in the party package to remove these weights called R_remove_weights with the following definition:

事实证明,派对包中有一个C函数可以删除这些称为R_remove_weights的权重,其定义如下:

SEXP R_remove_weights(SEXP subtree, SEXP removestats) {
    C_remove_weights(subtree, LOGICAL(removestats)[0]);
    return(R_NilValue);
}

It also works fine:

它也工作正常:

# cc is my model object

sum(unlist(lapply(slotNames(cc), function (x)  object.size(slot(cc, x)))))
# returns: [1] 2521256
save(cc, file="cc_before.RData")

.Call("R_remove_weights", cc@tree, TRUE, PACKAGE="party")
# returns NULL and removes weights and node statistics

sum(unlist(lapply(slotNames(cc), function (x)  object.size(slot(cc, x)))))
# returns: [1] 1521392
save(cc, file="cc_after.RData")

As you can see, it reduces the object size substantially, from roughly 2.5MB to 1.5MB.

正如您所看到的,它大大减小了对象大小,从大约2.5MB到1.5MB。

What is strange, though, is that the corresponding RData files are insanely huge, and there is no impact on them:

然而,奇怪的是,相应的RData文件非常庞大,并且对它们没有影响:

$ ls -lh cc*
-rw-r--r-- 1 user user 9.6M Aug 24 15:44 cc_after.RData
-rw-r--r-- 1 user user 9.6M Aug 24 15:43 cc_before.RData

Unzipping the file shows the 2.5MB object to occupy nearly 100MB of space:

解压缩文件显示2.5MB对象占用近100MB的空间:

$ cp cc_before.RData cc_before.gz
$ gunzip cc_before.gz 
$ ls -lh cc_before*
-rw-r--r-- 1 user user  98M Aug 24 15:45 cc_before

Any ideas, what could cause this?

任何想法,是什么原因造成的?

2 个解决方案

#1

I found a solution to the problem at hand, so I write this answer if anyone might run into the same issue. I'll describe my process, so it might be a bit rambling, so bear with me.

我找到了解决问题的方法,所以如果有人遇到同样的问题,我会写下这个答案。我会描述我的过程,所以它可能有点漫无边际,所以忍受我。

With no clue, I thought about nuking slots and removing weights to get the objects as small as possible and at least save some memory, in case no fix will be found. So I removed @data and @responses as a start and prediction went still fine without them, yet no effect on the .RData file size.

没有任何线索,我想到了插槽和移除重量以使对象尽可能小,并至少节省一些内存,以防万一找不到修复。所以我删除了@data和@responses作为开始,没有它们,预测仍然很好,但对.RData文件大小没有影响。

I the went the other way round and created and empty ctree model, just pluging the tree into it:

我反过来创建并清空ctree模型,只需将树插入其中:

> library(party)

## create reference predictions for the dataset
> predictions.org <- treeresponse(c1, d)

## save tree object for reference
save(c1, "testSize_c1.RData")

Checking the size of the original object:

检查原始对象的大小:

$ ls -lh testSize_c1.RData 
-rw-r--r-- 1 user user 9.6M 2011-08-25 14:35 testSize_c1.RData

Now, let's create an empty CTree and copy the tree only:

现在,让我们创建一个空的CTree并仅复制树:

## extract the tree only 
> c1Tree <- c1@tree

## create empty tree and plug in the extracted one 
> newCTree <- new("BinaryTree")
> newCTree@tree <- c1Tree

## save tree for reference 
save(newCTree, file="testSize_newCTree.RData")

This new tree object is now much smaller:

这个新的树对象现在要小得多:

$ ls -lh testSize_newCTree.RData 
-rw-r--r-- 1 user user 108K 2011-08-25 14:35 testSize_newCTree.RData

However, it can't be used to predict:

但是,它不能用于预测:

## predict with the new tree
> predictions.new <- treeresponse(newCTree, d)
Error in object@cond_distr_response(newdata = newdata, ...) : 
  unused argument(s) (newdata = newdata)

We did not set the @cond_distr_response, which might cause the error, so copy the original one as well and try to predict again:

我们没有设置@cond_distr_response,这可能会导致错误,因此请复制原始错误并尝试再次预测:

## extract cond_distr_response from original tree
> cdr <- c1@cond_distr_response
> newCTree@cond_distr_response <- cdr

## save tree for reference 
save(newCTree, file="testSize_newCTree_with_cdr.RData")

## predict with the new tree
> predictions.new <- treeresponse(newCTree, d)

## check correctness
> identical(predictions.org, predictions.new)
[1] TRUE

This works perfectly, but now the size of the RData file is back at its original value:

这很好用,但现在RData文件的大小恢复原始值:

$ ls -lh testSize_newCTree_with_cdr.RData 
-rw-r--r-- 1 user user 9.6M 2011-08-25 14:37 testSize_newCTree_with_cdr.RData

Simply printing the slot, shows it to be a function bound to an environment:

只需打印插槽,就可以将其显示为绑定到环境的功能:

> c1@cond_distr_response
function (newdata = NULL, mincriterion = 0, ...) 
{
    wh <- RET@get_where(newdata = newdata, mincriterion = mincriterion)
    response <- object@responses
    if (any(response@is_censored)) {
        swh <- sort(unique(wh))
        RET <- vector(mode = "list", length = length(wh))
        resp <- response@variables[[1]]
        for (i in 1:length(swh)) {
            w <- weights * (where == swh[i])
            RET[wh == swh[i]] <- list(mysurvfit(resp, weights = w))
        }
        return(RET)
    }
    RET <- .Call("R_getpredictions", tree, wh, PACKAGE = "party")
    return(RET)
}
<environment: 0x44e8090>

So the answer to the initial question appears to be that the methods of the object bind an environment to it, which is then saved with the object in the corresponding RData file. This might also explain why several packages are loaded when the RData file is read.

因此,初始问题的答案似乎是对象的方法将环境绑定到它,然后将其与对象一起保存在相应的RData文件中。这也可以解释为什么在读取RData文件时会加载几个包。

Thus, to get rid of the environment, we can't copy the methods, but we can't predict without them either. The rather "dirty" solution is to emulate the functionality of the original methods and call the underlying C code directly. After some digging through the source code, this is indeed possible. As the code copied above suggests, we need to call get_where, which determines the terminal node of the tree reached by the input. We then need to call R_getpredictions to determine the response from that terminal node for each input sample. The tricky part is that we need to get the data in the right input format and thus have to call the data preprocessing included in ctree:

因此,为了摆脱环境,我们无法复制方法,但我们也无法预测它们。相当“脏”的解决方案是模拟原始方法的功能并直接调用底层C代码。经过一些挖掘源代码,这确实是可能的。正如上面复制的代码所示,我们需要调用get_where,它确定输入所到达的树的终端节点。然后,我们需要调用R_getpredictions来确定该终端节点对每个输入样本的响应。棘手的部分是我们需要以正确的输入格式获取数据,因此必须调用ctree中包含的数据预处理:

## create a character string of the formula which was used to fit the free
## (there might be a more neat way to do this)
> library(stringr)
> org.formula <- str_c(
                   do.call(str_c, as.list(deparse(c1@data@formula$response[[2]]))),
                   "~", 
                   do.call(str_c, as.list(deparse(c1@data@formula$input[[2]]))))

## call the internal ctree preprocessing 
> data.dpp <- party:::ctreedpp(as.formula(org.formula), d)

## create the data object necessary for the ctree C code
> data.ivf <- party:::initVariableFrame.df(data.dpp@menv@get("input"), 
                                           trafo = ptrafo)

## now call the tree traversal routine, note that it only requires the tree
## extracted from the @tree slot, not the whole object
> nodeID <- .Call("R_get_nodeID", c1Tree, data.ivf, 0, PACKAGE = "party")

## now determine the respective responses
> predictions.syn <- .Call("R_getpredictions", c1Tree, nodeID, PACKAGE = "party")

## check correctness
> identical(predictions.org, predictions.syn)
[1] TRUE

We now only need to save the extracted tree and the formula string to be able to predict new data:

我们现在只需要保存提取的树和公式字符串,以便能够预测新数据:

> save(c1Tree, org.formula, file="testSize_extractedObjects.RData")

We can further remove the unnecessary weights as described in the updated question above:

我们可以进一步删除上面更新的问题中描述的不必要的权重:

> .Call("R_remove_weights", c1Tree, TRUE, PACKAGE="party")
> save(c1Tree, org.formula, file="testSize_extractedObjects__removedWeights.RData")

Now let's have a look at the file sizes again:

现在让我们再看一下文件大小:

$ ls -lh testSize_extractedObjects*
-rw-r--r-- 1 user user 109K 2011-08-25 15:31 testSize_extractedObjects.RData
-rw-r--r-- 1 user user  43K 2011-08-25 15:31 testSize_extractedObjects__removedWeights.RData

Finally, instead of (compressed) 9.6M, only 43K are required to use the model. I should now be able to fit as many as I want in my 3G heap space. Hooray!

最后,代替(压缩)9.6M,只需要43K即可使用该模型。我现在应该能够在我的3G堆空间中尽可能多地适应。万岁!

#2

What you're looking for is to remove slots. A word of caution: this could be rather dangerous given how party functions work with the object.

你要找的是删除插槽。需要注意的是:鉴于派对功能如何与对象配合使用,这可能相当危险。

Nonetheless, take a look at slotNames(yourModel). You can also try object.size(slot(yourModel), slotNameOfInterest) to examine the size of different slots. You could easily create a sorted table to be sure of the sizes of objects in each slot.

尽管如此,请查看slotNames(yourModel)。您还可以尝试使用object.size(slot(yourModel),slotNameOfInterest)来检查不同插槽的大小。您可以轻松创建排序表,以确保每个插槽中对象的大小。

In any case, the slot for data is a ModelEnvFormula (I'll call this "MEF") object. You could create a dummy MEF: dummyMEF <- ModelEnvFormula(1 ~ 1) and then assign that to data: slot(yourModel, "data") <- dummyMEF.

在任何情况下,数据的槽都是ModelEnvFormula(我称之为“MEF”)对象。您可以创建一个虚拟MEF:dummyMEF < - ModelEnvFormula(1~1)然后将其分配给data:slot(yourModel,“data”)< - dummyMEF。

That will nuke that particular slot. You should take a look to see if there are other slots that are causing headaches in terms of the storage - the object.size() function will assist. I agree that it's nice to be able to omit training data from the model object.

那会破坏那个特定的位置。您应该看看是否有其他插槽导致存储方面的麻烦 - object.size()函数将提供帮助。我同意能够从模型对象中省略训练数据是很好的。

#1