在R中有GBM的并行实现吗?

时间:2021-09-13 21:20:06

I use the gbm library in R and I would like to use all my CPU to fit a model.

我在R中使用gbm库,我想使用我所有的CPU来适应一个模型。

gbm.fit(x, y,
        offset = NULL,
        misc = NULL,...

2 个解决方案

#1


2  

As far as I know, both h2o and xgboost have this.

据我所知,h2o和xgboost都有这个功能。

For h2o, see e.g. this blog post of theirs from 2013 from which I quote

关于h2o,请参阅他们2013年的博客文章,我引用了这篇文章

At 0xdata we build state-of-the-art distributed algorithms - and recently we embarked on building GBM, and algorithm notorious for being impossible to parallelize much less distribute. We built the algorithm shown in Elements of Statistical Learning II, Trevor Hastie, Robert Tibshirani, and Jerome Friedman on page 387 (shown at the bottom of this post). Most of the algorithm is straightforward “small” math, but step 2.b.ii says “Fit a regression tree to the targets….”, i.e. fit a regression tree in the middle of the inner loop, for targets that change with each outer loop. This is where we decided to distribute/parallelize.

在0xdata,我们构建了最先进的分布式算法——最近,我们开始构建GBM,而这种算法因为不能并行化更少的分布而臭名昭著。我们在第387页构建了统计学习II元素、Trevor Hastie、Robert Tibshirani和Jerome Friedman(见下面)所示的算法。大多数算法都是简单的“小”运算,但是步骤2 b。二说“回归树适合目标....,即在内部循环的中间安装一个回归树,用于每个外部循环改变的目标。这就是我们决定分发/并行化的地方。

The platform we build on is H2O, and as talked about in the prior blog has an API focused on doing big parallel vector operations - and for GBM (and also Random Forest) we need to do big parallel tree operations. But not really any tree operation; GBM (and RF) constantly build trees - and the work is always at the leaves of a tree, and is about finding the next best split point for the subset of training data that falls into a particular leaf.

我们所构建的平台是H2O,正如前面的博客中提到的,它有一个API,专注于进行大型并行向量操作——对于GBM(也包括Random Forest),我们需要进行大型并行树操作。但不是任何树的运算;GBM(和RF)不断地构建树——并且工作总是在树的叶子上,并且是关于找到属于特定叶子的训练数据子集的下一个最佳分割点。

The code can be found on our git: http://0xdata.github.io/h2o/

代码可以在我们的git上找到:http://0xdata.github.io/h2o/

(Edit: The repo now is at https://github.com/h2oai/.)

(编辑:repo现在位于https://github.com/h2oai/)

The other parallel GBM implementation is, I think, in xgboost. Its Descriptions says

另一个并行的GBM实现,我认为是在xgboost中。它的描述说

Extreme Gradient Boosting, which is an efficient implementation of gradient boosting framework. This package is its R interface. The package includes efficient linear model solver and tree learning algorithms. The package can automatically do parallel computation on a single machine which could be more than 10 times faster than existing gradient boosting packages. It supports various objective functions, including regression, classification and ranking. The package is made to be extensible, so that users are also allowed to define their own objectives easily.

极端梯度增强是梯度增强框架的有效实现。这个包是它的R接口。该软件包包括有效的线性模型求解器和树学习算法。该包可以在一台机器上自动进行并行计算,速度比现有的梯度增强包快10倍以上。它支持各种目标函数,包括回归、分类和排序。这个包是可扩展的,因此用户也可以很容易地定义自己的目标。

#2


2  

Well, there cannot be a parallel implementation of GBM in principle, neither in R neither nor in any other implementation. And the reason is very simple: the boosting algorithm is by definition sequential.

在原则上,GBM是不可能有并行实现的,无论是在R中还是在其他任何实现中。原因很简单:增强算法是按定义顺序排列的。

Consider the following, quoted from The Elements of Statistical Learning, Ch. 10 (Boosting and Additive Trees), pp. 337-339 (emphasis mine):

考虑一下,引用自《统计学习要素》第10章(促进和加性树),第337-339页(强调我的):

A weak classifier is one whose error rate is only slightly better than random guessing. The purpose of boosting is to sequentially apply the weak classification algorithm to repeatedly modified versions of the data, thereby producing a sequence of weak classifiers Gm(x), m = 1, 2, . . . , M. The predictions from all of them are then combined through a weighted majority vote to produce the final prediction. [...] Each successive classifier is thereby forced to concentrate on those training observations that are missed by previous ones in the sequence.

弱分类器的错误率仅略高于随机猜测。增强的目的是将弱分类算法依次应用到反复修改的数据版本中,从而产生一个弱分类器序列Gm(x), m = 1,2,…然后,通过加权多数投票,将所有预测组合在一起,得出最终的预测。[…因此,每个连续的分类器都必须专注于序列中先前的训练观测所遗漏的。

In a picture (ibid, p. 338):

(同上,第338页):

在R中有GBM的并行实现吗?

In fact, this is frequently noted as a key disadvantage of GBM relative to, say, Random Forest (RF), where the trees are independent and can thus be fitted in parrallel (see the bigrf R package).

事实上,这经常被认为是GBM相对于随机森林(RF)的一个关键缺点,在随机森林(Random Forest)中,树是独立的,因此可以安装在parrallel中(参见bigrf R包)。

Hence, the best you can do, as the commenters above have pinpointed, is to use your excess CPU cores to parallelize the cross-validation process...

因此,正如上面的评论者指出的那样,您所能做的最好的事情就是使用多余的CPU内核来并行化交叉验证过程……

#1


2  

As far as I know, both h2o and xgboost have this.

据我所知,h2o和xgboost都有这个功能。

For h2o, see e.g. this blog post of theirs from 2013 from which I quote

关于h2o,请参阅他们2013年的博客文章,我引用了这篇文章

At 0xdata we build state-of-the-art distributed algorithms - and recently we embarked on building GBM, and algorithm notorious for being impossible to parallelize much less distribute. We built the algorithm shown in Elements of Statistical Learning II, Trevor Hastie, Robert Tibshirani, and Jerome Friedman on page 387 (shown at the bottom of this post). Most of the algorithm is straightforward “small” math, but step 2.b.ii says “Fit a regression tree to the targets….”, i.e. fit a regression tree in the middle of the inner loop, for targets that change with each outer loop. This is where we decided to distribute/parallelize.

在0xdata,我们构建了最先进的分布式算法——最近,我们开始构建GBM,而这种算法因为不能并行化更少的分布而臭名昭著。我们在第387页构建了统计学习II元素、Trevor Hastie、Robert Tibshirani和Jerome Friedman(见下面)所示的算法。大多数算法都是简单的“小”运算,但是步骤2 b。二说“回归树适合目标....,即在内部循环的中间安装一个回归树,用于每个外部循环改变的目标。这就是我们决定分发/并行化的地方。

The platform we build on is H2O, and as talked about in the prior blog has an API focused on doing big parallel vector operations - and for GBM (and also Random Forest) we need to do big parallel tree operations. But not really any tree operation; GBM (and RF) constantly build trees - and the work is always at the leaves of a tree, and is about finding the next best split point for the subset of training data that falls into a particular leaf.

我们所构建的平台是H2O,正如前面的博客中提到的,它有一个API,专注于进行大型并行向量操作——对于GBM(也包括Random Forest),我们需要进行大型并行树操作。但不是任何树的运算;GBM(和RF)不断地构建树——并且工作总是在树的叶子上,并且是关于找到属于特定叶子的训练数据子集的下一个最佳分割点。

The code can be found on our git: http://0xdata.github.io/h2o/

代码可以在我们的git上找到:http://0xdata.github.io/h2o/

(Edit: The repo now is at https://github.com/h2oai/.)

(编辑:repo现在位于https://github.com/h2oai/)

The other parallel GBM implementation is, I think, in xgboost. Its Descriptions says

另一个并行的GBM实现,我认为是在xgboost中。它的描述说

Extreme Gradient Boosting, which is an efficient implementation of gradient boosting framework. This package is its R interface. The package includes efficient linear model solver and tree learning algorithms. The package can automatically do parallel computation on a single machine which could be more than 10 times faster than existing gradient boosting packages. It supports various objective functions, including regression, classification and ranking. The package is made to be extensible, so that users are also allowed to define their own objectives easily.

极端梯度增强是梯度增强框架的有效实现。这个包是它的R接口。该软件包包括有效的线性模型求解器和树学习算法。该包可以在一台机器上自动进行并行计算,速度比现有的梯度增强包快10倍以上。它支持各种目标函数,包括回归、分类和排序。这个包是可扩展的,因此用户也可以很容易地定义自己的目标。

#2


2  

Well, there cannot be a parallel implementation of GBM in principle, neither in R neither nor in any other implementation. And the reason is very simple: the boosting algorithm is by definition sequential.

在原则上,GBM是不可能有并行实现的,无论是在R中还是在其他任何实现中。原因很简单:增强算法是按定义顺序排列的。

Consider the following, quoted from The Elements of Statistical Learning, Ch. 10 (Boosting and Additive Trees), pp. 337-339 (emphasis mine):

考虑一下,引用自《统计学习要素》第10章(促进和加性树),第337-339页(强调我的):

A weak classifier is one whose error rate is only slightly better than random guessing. The purpose of boosting is to sequentially apply the weak classification algorithm to repeatedly modified versions of the data, thereby producing a sequence of weak classifiers Gm(x), m = 1, 2, . . . , M. The predictions from all of them are then combined through a weighted majority vote to produce the final prediction. [...] Each successive classifier is thereby forced to concentrate on those training observations that are missed by previous ones in the sequence.

弱分类器的错误率仅略高于随机猜测。增强的目的是将弱分类算法依次应用到反复修改的数据版本中,从而产生一个弱分类器序列Gm(x), m = 1,2,…然后,通过加权多数投票,将所有预测组合在一起,得出最终的预测。[…因此,每个连续的分类器都必须专注于序列中先前的训练观测所遗漏的。

In a picture (ibid, p. 338):

(同上,第338页):

在R中有GBM的并行实现吗?

In fact, this is frequently noted as a key disadvantage of GBM relative to, say, Random Forest (RF), where the trees are independent and can thus be fitted in parrallel (see the bigrf R package).

事实上,这经常被认为是GBM相对于随机森林(RF)的一个关键缺点,在随机森林(Random Forest)中,树是独立的,因此可以安装在parrallel中(参见bigrf R包)。

Hence, the best you can do, as the commenters above have pinpointed, is to use your excess CPU cores to parallelize the cross-validation process...

因此,正如上面的评论者指出的那样,您所能做的最好的事情就是使用多余的CPU内核来并行化交叉验证过程……