使用R中的foreach读取全局变量

时间:2022-06-01 22:19:59

I am trying to run a foreach loop on a windows server with a 16 core CPU and 64 GB of RAM using RStudio. (using the doParallel package)

我试图使用RStudio在具有16核CPU和64 GB RAM的Windows服务器上运行foreach循环。 (使用doParallel包)

The "worker" processes copy over all the variables from outside the for loop (observed by watching the instantiation of these processes in windows task manager when the foreach loop is run), thus bloating up the memory used by each process. I tried to declare some of the especially large variables as global, while ensuring that these variables were also read from, and not written to, inside the foreach loop to avoid conflicts. However, the processes still quickly use up all available memory.

“worker”进程复制来自for循环外部的所有变量(通过在运行foreach循环时观察windows任务管理器中这些进程的实例化来观察),从而使每个进程使用的内存膨胀。我试图将一些特别大的变量声明为全局变量,同时确保这些变量也在foreach循环中读取,而不是写入,以避免冲突。但是,这些进程仍然会快速耗尽所有可用内存。

Is there a mechanism to ensure that the "worker" processes do not create copies of some of the "read-only" variables? Such as a specific way to declare such variables?

是否有一种机制可以确保“工作”进程不会创建某些“只读”变量的副本?比如声明这样的变量的具体方法?

1 个解决方案

#1


15  

The doParallel package will auto-export variables to the workers that are referenced in the foreach loop. If you don't want it to do that, you can use the foreach ".noexport" option to prevent it from auto-exporting particular variables. But if I understand you correctly, your problem is that R is subsequently duplicating some of those variables, which is even more of problem than usual since it is happening in multiple processes on a single machine.

doParallel包将自动导出变量到foreach循环中引用的worker。如果您不希望它这样做,您可以使用foreach“.noexport”选项来阻止它自动导出特定变量。但是,如果我理解正确,你的问题是R随后会复制其中的一些变量,这比平常更多问题,因为它发生在一台机器上的多个进程中。

There isn't a way to declare a variable so that R will never make a duplicate of it. You either need to replace the problem variables with objects from a package like bigmemory so that copies are never made, or you can try modifying the code in such a way as to not trigger the duplication. You can use the tracemem function to help you, since it will print a message whenever that object is duplicated.

没有办法声明变量,以便R永远不会复制它。您需要使用bigmemory之类的包中的对象替换问题变量,以便永远不会创建副本,或者您可以尝试以不触发重复的方式修改代码。您可以使用tracemem函数来帮助您,因为只要该对象被复制,它就会打印一条消息。

However, you may be able to avoid the problem by reducing the data that is needed by the workers. That reduces the amount of data that needs to be copied to each of the workers, as well as decreasing their memory footprint.

但是,您可以通过减少工作人员所需的数据来避免此问题。这减少了需要复制到每个工作人员的数据量,并减少了他们的内存占用。

Here is a classic example of giving the workers more data than they need:

这是给工人提供超出他们需要的更多数据的典型例子:

x <- matrix(1:100, 10)
foreach(i=1:10, .combine='c') %dopar% {
    mean(x[,i])
}

Since the matrix x is referenced in the foreach loop, it will be auto-exported to each of the workers, even though each worker only needs a subset of the columns. The simplest solution is to iterate over the actual columns of the matrix rather than over column indices:

由于矩阵x在foreach循环中被引用,因此它将自动导出到每个worker,即使每个worker只需要列的子集。最简单的解决方案是迭代矩阵的实际列而不是列索引:

foreach(xc=x, .combine='c') %dopar% {
    mean(xc)
}

Not only is less data transferred to the workers, but each of the workers only actually needs to have one column in memory at a time, which greatly decreases its memory footprint for large matrices. The xc vector may still end up being duplicated, but it doesn't hurt nearly as much because it is much smaller than x.

不仅将较少的数据传输给工作人员,而且每个工作人员实际上一次只需要在内存中有一列,这大大减少了大型矩阵的内存占用。 xc向量可能仍然最终被复制,但它几乎没有受到伤害,因为它比x小得多。

Note that this technique only helps when doParallel uses the "snow-derived" functions, such as parLapply and clusterApplyLB, not when using mclapply. Using this technique can make the loop a bit slower when mclapply is used, since all of the workers get the matrix x for free, so why transfer around the columns when the workers already have the entire matrix? However, on Windows, doParallel can't use mclapply, so this technique is very important.

请注意,此技术仅在doParallel使用“雪派生”函数(例如parLapply和clusterApplyLB)时才有用,而不是在使用mclapply时。当使用mclapply时,使用这种技术可以使循环变慢,因为所有工作者都可以免费获得矩阵x,那么为什么当工人已经拥有整个矩阵时,在列周围进行转移呢?但是,在Windows上,doParallel不能使用mclapply,因此这种技术非常重要。

The important thing is to think about what data is really needed by the workers in order to perform their work and to try to decrease it if possible. Sometimes you can do that by using special iterators, either from the iterators or itertools packages, but you may also be able to do that by changing your algorithm.

重要的是要考虑工人真正需要哪些数据来执行他们的工作,并尽可能地减少工作量。有时您可以通过使用迭代器或itertools包中的特殊迭代器来实现,但您也可以通过更改算法来实现。

#1


15  

The doParallel package will auto-export variables to the workers that are referenced in the foreach loop. If you don't want it to do that, you can use the foreach ".noexport" option to prevent it from auto-exporting particular variables. But if I understand you correctly, your problem is that R is subsequently duplicating some of those variables, which is even more of problem than usual since it is happening in multiple processes on a single machine.

doParallel包将自动导出变量到foreach循环中引用的worker。如果您不希望它这样做,您可以使用foreach“.noexport”选项来阻止它自动导出特定变量。但是,如果我理解正确,你的问题是R随后会复制其中的一些变量,这比平常更多问题,因为它发生在一台机器上的多个进程中。

There isn't a way to declare a variable so that R will never make a duplicate of it. You either need to replace the problem variables with objects from a package like bigmemory so that copies are never made, or you can try modifying the code in such a way as to not trigger the duplication. You can use the tracemem function to help you, since it will print a message whenever that object is duplicated.

没有办法声明变量,以便R永远不会复制它。您需要使用bigmemory之类的包中的对象替换问题变量,以便永远不会创建副本,或者您可以尝试以不触发重复的方式修改代码。您可以使用tracemem函数来帮助您,因为只要该对象被复制,它就会打印一条消息。

However, you may be able to avoid the problem by reducing the data that is needed by the workers. That reduces the amount of data that needs to be copied to each of the workers, as well as decreasing their memory footprint.

但是,您可以通过减少工作人员所需的数据来避免此问题。这减少了需要复制到每个工作人员的数据量,并减少了他们的内存占用。

Here is a classic example of giving the workers more data than they need:

这是给工人提供超出他们需要的更多数据的典型例子:

x <- matrix(1:100, 10)
foreach(i=1:10, .combine='c') %dopar% {
    mean(x[,i])
}

Since the matrix x is referenced in the foreach loop, it will be auto-exported to each of the workers, even though each worker only needs a subset of the columns. The simplest solution is to iterate over the actual columns of the matrix rather than over column indices:

由于矩阵x在foreach循环中被引用,因此它将自动导出到每个worker,即使每个worker只需要列的子集。最简单的解决方案是迭代矩阵的实际列而不是列索引:

foreach(xc=x, .combine='c') %dopar% {
    mean(xc)
}

Not only is less data transferred to the workers, but each of the workers only actually needs to have one column in memory at a time, which greatly decreases its memory footprint for large matrices. The xc vector may still end up being duplicated, but it doesn't hurt nearly as much because it is much smaller than x.

不仅将较少的数据传输给工作人员,而且每个工作人员实际上一次只需要在内存中有一列,这大大减少了大型矩阵的内存占用。 xc向量可能仍然最终被复制,但它几乎没有受到伤害,因为它比x小得多。

Note that this technique only helps when doParallel uses the "snow-derived" functions, such as parLapply and clusterApplyLB, not when using mclapply. Using this technique can make the loop a bit slower when mclapply is used, since all of the workers get the matrix x for free, so why transfer around the columns when the workers already have the entire matrix? However, on Windows, doParallel can't use mclapply, so this technique is very important.

请注意,此技术仅在doParallel使用“雪派生”函数(例如parLapply和clusterApplyLB)时才有用,而不是在使用mclapply时。当使用mclapply时,使用这种技术可以使循环变慢,因为所有工作者都可以免费获得矩阵x,那么为什么当工人已经拥有整个矩阵时,在列周围进行转移呢?但是,在Windows上,doParallel不能使用mclapply,因此这种技术非常重要。

The important thing is to think about what data is really needed by the workers in order to perform their work and to try to decrease it if possible. Sometimes you can do that by using special iterators, either from the iterators or itertools packages, but you may also be able to do that by changing your algorithm.

重要的是要考虑工人真正需要哪些数据来执行他们的工作,并尽可能地减少工作量。有时您可以通过使用迭代器或itertools包中的特殊迭代器来实现,但您也可以通过更改算法来实现。