如何提高并行集群处理的速度

时间:2021-02-26 03:54:00

I'm new to cluster processing, and could use some advice as to how better to prepare data and/or the calls to functions from the parallel package. I have read thru the parallels package vignettes, so have a vague idea what's going on.

我是集群处理的新手,可以使用一些建议来更好地准备数据和/或从并行包调用函数。我已经阅读了通过parallels包装的小插曲,所以对于发生了什么有一个模糊的想法。

The function I want to parallelize calls the 2-D interpolation tool akima::interp . My input consists of 3 matrices (or vectors -- all the same in R): one contains the x-coordinates, one the y-coordinates, and one the "z", or data values, for a set of sample points. interp uses this to produce interpolated data on a regular grid so I can, e.g., plot the field. Once I have these 3 items set up, I cut them into "chunks" and feed them to clusterApply to execute interp chunk by chunk.

我要并行化的函数调用2-D插值工具akima :: interp。我的输入由3个矩阵(或向量 - 在R中完全相同)组成:一个包含x坐标,一个包含y坐标,一个包含一组样本点的“z”或数据值。 interp使用它来在规则网格上产生内插数据,因此我可以例如绘制字段。一旦我设置了这3个项目,我将它们切成“块”并将它们提供给clusterApply以按块执行interp chunk。

I'm using a Windows7, i7 CPU (8-core) machine. Here's the summary output from Rprof for an input data set with 1e6 points (1000x1000 if you like), and mapped onto a 1000x1000 output grid.

我正在使用Windows7,i7 CPU(8核)机器。以下是Rprof的输出数据集的摘要输出,输入数据集为1e6点(如果您愿意,则为1000x1000),并映射到1000x1000输出网格。

So my questions are: 1) It appears that "unserialize" is taking most of the time. What is this operation, and how could it be reduced? 2) In general, since each worker loads the default .Rdata file, is there any speed gained if I first save all input data to .Rdata so that it doesn't need to get passed to the workers? 3) Anything else that I'm simply unaware of that I should have done differently?

所以我的问题是:1)似乎“反序列化”占据了大部分时间。这个操作是什么,怎么可以减少? 2)一般情况下,由于每个工作人员都加载了默认的.Rdata文件,如果我首先将所有输入数据保存到.Rdata以便它不需要传递给工作人员,是否会获得任何速度? 3)其他任何我根本不知道我应该采取不同的做法?

Note: the sin, atan2, cos, +, max, min functions take place prior to the clusterApply call I make.

注意:sin,atan2,cos,+,max,min函数发生在我发出的clusterApply调用之前。

Rgames> summaryRprof('bigprof.txt')
$by.self
                   self.time self.pct total.time total.pct
"unserialize"         329.04    99.11     329.04     99.11
"socketConnection"      1.74     0.52       1.74      0.52
"serialize"             0.96     0.29       0.96      0.29
"sin"                   0.06     0.02       0.06      0.02
"atan2"                 0.04     0.01       0.06      0.02
"cos"                   0.04     0.01       0.04      0.01
"+"                     0.02     0.01       0.02      0.01
"max"                   0.02     0.01       0.02      0.01
"min"                   0.02     0.01       0.02      0.01
"row"                   0.02     0.01       0.02      0.01
"writeLines"            0.02     0.01       0.02      0.01

$by.total
                     total.time total.pct self.time self.pct
"mcswirl"                331.98    100.00      0.00     0.00
"clusterApply"           330.00     99.40      0.00     0.00
"staticClusterApply"     330.00     99.40      0.00     0.00
"FUN"                    329.06     99.12      0.00     0.00
"unserialize"            329.04     99.11    329.04    99.11
"lapply"                 329.04     99.11      0.00     0.00
"recvData"               329.04     99.11      0.00     0.00
"recvData.SOCKnode"      329.04     99.11      0.00     0.00
"makeCluster"              1.76      0.53      0.00     0.00
"makePSOCKcluster"         1.76      0.53      0.00     0.00
"newPSOCKnode"             1.76      0.53      0.00     0.00
"socketConnection"         1.74      0.52      1.74     0.52
"serialize"                0.96      0.29      0.96     0.29
"postNode"                 0.96      0.29      0.00     0.00
"sendCall"                 0.96      0.29      0.00     0.00
"sendData"                 0.96      0.29      0.00     0.00
"sendData.SOCKnode"        0.96      0.29      0.00     0.00
"sin"                      0.06      0.02      0.06     0.02
"atan2"                    0.06      0.02      0.04     0.01
"cos"                      0.04      0.01      0.04     0.01
"+"                        0.02      0.01      0.02     0.01
"max"                      0.02      0.01      0.02     0.01
"min"                      0.02      0.01      0.02     0.01
"row"                      0.02      0.01      0.02     0.01
"writeLines"               0.02      0.01      0.02     0.01
"outer"                    0.02      0.01      0.00     0.00
"system"                   0.02      0.01      0.00     0.00

$sample.interval
[1] 0.02

$sampling.time
[1] 331.98

1 个解决方案

#1


9  

When clusterApply is called, it first sends a task to each of the cluster workers, and then waits for each of them to return the corresponding result. If there are more tasks to do, it repeats that procedure until all of the tasks are complete.

调用clusterApply时,它首先向每个集群工作程序发送一个任务,然后等待每个集群工作程序返回相应的结果。如果还有更多任务要做,它会重复该过程,直到完成所有任务。

The function that it uses to wait for a result from a particular worker is recvResult which ultimately calls unserialize to read data from the socket that is connected to that worker. So if the master process is spending most of its time in unserialize, then it is spending most of its time waiting for the cluster workers to return the task results, which is what you would hope to see on the master. If it was spending a lot of time in serialize, that would mean that it was spending a lot of time sending the tasks to the workers, which would be a bad sign.

它用于等待特定worker的结果的函数是recvResult,它最终调用unserialize来从连接到该worker的套接字读取数据。因此,如果主进程将大部分时间花在反序列化上,那么它将花费大部分时间等待集群工作者返回任务结果,这是您希望在主服务器上看到的结果。如果在序列化中花费了大量时间,那就意味着花费大量时间将任务发送给工作人员,这将是一个不好的迹象。

Unfortunately, you can't tell how much time unserialize spends blocking, waiting for the result data to arrive, and how much time it spends actually transferring that data. The results might be easily computed by the workers and huge, or they might take a long time to compute and be tiny: there's no way to tell from the profiling data.

不幸的是,您无法确定反序列化花费多少时间阻塞,等待结果数据到达,以及花费多少时间实际传输数据。结果可能很容易被工人计算得很大,或者他们可能需要很长时间来计算并且很小:没有办法从分析数据中分辨出来。

So to make unserialize execute faster, you need to make the workers compute their results faster, or make the results smaller, if that's possible. In addition, it might help to use the makeCluster useXDR=FALSE option. It might improve your performance by not using XDR to encode your data, making both serialize and unserialize faster.

因此,为了使反序列化更快地执行,您需要让工作人员更快地计算结果,或者使结果更小,如果可能的话。此外,使用makeCluster useXDR = FALSE选项可能会有所帮助。它可以通过不使用XDR对数据进行编码来提高性能,从而使序列化和反序列化更快。

I don't think it will help to save all input data to .Rdata since you're not spending much time sending data to the workers, as seen by the short time spent in the serialize function. I suspect that would slow you down a little bit.

我不认为将所有输入数据保存到.Rdata会有所帮助,因为您没有花太多时间将数据发送给工作人员,这可以从序列化函数中花费的时间很短看出来。我怀疑这会让你慢下来。

The only other advice I can think of is to try using parLapply or clusterApplyLB, rather than clusterApply. I recommend using parLapply unless you have a specific reason to use one of the other functions since parLapply is often the most efficient. clusterApplyLB is useful when you have tasks that take a long but variable length of time to execute.

我能想到的唯一其他建议是尝试使用parLapply或clusterApplyLB,而不是clusterApply。我建议使用parLapply,除非您有特定的理由使用其他功能之一,因为parLapply通常是最有效的。当您的任务需要很长但可变的时间来执行时,clusterApplyLB非常有用。

#1


9  

When clusterApply is called, it first sends a task to each of the cluster workers, and then waits for each of them to return the corresponding result. If there are more tasks to do, it repeats that procedure until all of the tasks are complete.

调用clusterApply时,它首先向每个集群工作程序发送一个任务,然后等待每个集群工作程序返回相应的结果。如果还有更多任务要做,它会重复该过程,直到完成所有任务。

The function that it uses to wait for a result from a particular worker is recvResult which ultimately calls unserialize to read data from the socket that is connected to that worker. So if the master process is spending most of its time in unserialize, then it is spending most of its time waiting for the cluster workers to return the task results, which is what you would hope to see on the master. If it was spending a lot of time in serialize, that would mean that it was spending a lot of time sending the tasks to the workers, which would be a bad sign.

它用于等待特定worker的结果的函数是recvResult,它最终调用unserialize来从连接到该worker的套接字读取数据。因此,如果主进程将大部分时间花在反序列化上,那么它将花费大部分时间等待集群工作者返回任务结果,这是您希望在主服务器上看到的结果。如果在序列化中花费了大量时间,那就意味着花费大量时间将任务发送给工作人员,这将是一个不好的迹象。

Unfortunately, you can't tell how much time unserialize spends blocking, waiting for the result data to arrive, and how much time it spends actually transferring that data. The results might be easily computed by the workers and huge, or they might take a long time to compute and be tiny: there's no way to tell from the profiling data.

不幸的是,您无法确定反序列化花费多少时间阻塞,等待结果数据到达,以及花费多少时间实际传输数据。结果可能很容易被工人计算得很大,或者他们可能需要很长时间来计算并且很小:没有办法从分析数据中分辨出来。

So to make unserialize execute faster, you need to make the workers compute their results faster, or make the results smaller, if that's possible. In addition, it might help to use the makeCluster useXDR=FALSE option. It might improve your performance by not using XDR to encode your data, making both serialize and unserialize faster.

因此,为了使反序列化更快地执行,您需要让工作人员更快地计算结果,或者使结果更小,如果可能的话。此外,使用makeCluster useXDR = FALSE选项可能会有所帮助。它可以通过不使用XDR对数据进行编码来提高性能,从而使序列化和反序列化更快。

I don't think it will help to save all input data to .Rdata since you're not spending much time sending data to the workers, as seen by the short time spent in the serialize function. I suspect that would slow you down a little bit.

我不认为将所有输入数据保存到.Rdata会有所帮助,因为您没有花太多时间将数据发送给工作人员,这可以从序列化函数中花费的时间很短看出来。我怀疑这会让你慢下来。

The only other advice I can think of is to try using parLapply or clusterApplyLB, rather than clusterApply. I recommend using parLapply unless you have a specific reason to use one of the other functions since parLapply is often the most efficient. clusterApplyLB is useful when you have tasks that take a long but variable length of time to execute.

我能想到的唯一其他建议是尝试使用parLapply或clusterApplyLB,而不是clusterApply。我建议使用parLapply,除非您有特定的理由使用其他功能之一,因为parLapply通常是最有效的。当您的任务需要很长但可变的时间来执行时,clusterApplyLB非常有用。