是否使用R中的detectcore函数来指定并行处理的核数?

时间:2021-03-06 13:51:00

In the help for detectCores() it says:

在detectcore()的帮助中,它说:

This is not suitable for use directly for the mc.cores argument of mclapply nor specifying the number of cores in makeCluster. First because it may return NA, and second because it does not give the number of allowed cores.

这不适用于直接用于mc.core参数的mclapply,也不指定在makeCluster中核心的数量。首先是因为它可能返回NA,其次是因为它没有给出允许的核数。

However, I've seen quite a bit of sample code like the following:

但是,我已经看到了一些示例代码,如下所示:

library(parallel)
k <- 1000
m <- lapply(1:7, function(X) matrix(rnorm(k^2), nrow=k))

cl <- makeCluster(detectCores() - 1, type = "FORK")
test <- parLapply(cl, m, solve)
stopCluster(cl)

where detectCores() is used to specify the number of cores in makeCluster.

detectcore()用于指定makeCluster中的核心数。

My use cases involve running parallel processing both on my own multicore laptop (OSX) and running it on various multicore servers (Linux). So, I wasn't sure whether there is a better way to specify the number of cores or whether perhaps that advice about not using detectCores was more for package developers where code is meant to run over a wide range of hardware and OS environments.

我的用例涉及在我自己的多核笔记本(OSX)上运行并行处理,并在各种多核服务器(Linux)上运行它。因此,我不确定是否有一种更好的方法来指定核心的数量,或者是否有关于不使用detectcore的建议对于包开发人员来说更重要,因为代码的目的是在广泛的硬件和操作系统环境中运行。

So in summary:

因此在总结:

  • Should you use the detectCores function in R to specify the number of cores for parallel processing?
  • 是否应该使用R中的detectcore函数来指定并行处理的核数?
  • What is the distinction mean between detected and allowed cores and when is it relevant?
  • 检测到的内核和允许的内核之间的区别是什么?

3 个解决方案

#1


13  

I think it's perfectly reasonable to use detectCores as a starting point for the number of workers/processes when calling mclapply or makeCluster. However, there are many reasons that you may want or need to start fewer workers, and even some cases where you can reasonably start more.

我认为在调用mclapply或makeCluster时,使用detectcore作为员工/流程数量的起点是完全合理的。然而,有很多原因使你可能想要或需要减少员工数量,甚至在某些情况下,你可以合理地增加员工数量。

On some hyperthreaded machines it may not be a good idea to set mc.cores=detectCores(), for example. Or if your script is running on an HPC cluster, you shouldn't use any more resources than the job scheduler has allocated to your job. You also have to be careful in nested parallel situations, as when your code may be executed in parallel by a calling function, or you're executing a multithreaded function in parallel. In general, it's a good idea to run some preliminary benchmarks before starting a long job to determine the best number of workers. I usually monitor the benchmark with top to see if the number of processes and threads makes sense, and to verify that the memory usage is reasonable.

例如,在某些超线程机器上,设置mc.core = detectcore()可能不是一个好主意。或者,如果您的脚本正在HPC集群上运行,那么您不应该使用比作业调度程序分配给您的作业更多的资源。在嵌套的并行情况下,您也必须小心,因为当您的代码可能被一个调用函数并行执行时,或者您正在并行执行一个多线程函数。一般来说,在开始一项长期的工作之前,最好先做一些初步的基准测试,以确定最佳的员工数量。我通常使用top监视基准测试,以查看进程和线程的数量是否合理,并验证内存使用是否合理。

The advice that you quoted is particularly appropriate for package developers. It's certainly a bad idea for a package developer to always start detectCores() workers when calling mclapply or makeCluster, so it's best to leave the decision up to the end user. At least the package should allow the user to specify the number of workers to start, but arguably detectCores() isn't even a good default value. That's why the default value for mc.cores changed from detectCores() to getOptions("mc.cores", 2L) when mclapply was included in the parallel package.

您所引用的建议特别适合包开发人员。对于包开发人员来说,在调用mclapply或makeCluster时总是启动detectcore()工作人员,这当然不是一个好主意,所以最好由最终用户决定。至少这个包应该允许用户指定要启动的工人数量,但是detectcore()甚至不是一个好的默认值。这就是为什么mc的默认值从detectcore()改为getOptions(“mc”)的原因。当mclapply被包含在并行包中时,它的核心是“2L”。

I think the real point of the warning that you quoted is that R functions should not assume that they own the whole machine, or that they are the only function in your script that is using multiple cores. If you call mclapply with mc.cores=detectCores() in a package that you submit to CRAN, I expect your package will be rejected until you change it. But if you're the end user, running a parallel script on your own machine, then it's up to you to decide how many cores the script is allowed to use.

我认为您所引用的警告的真正意义在于R函数不应该假设它们拥有整个机器,或者它们是您的脚本中使用多个内核的唯一函数。如果您在提交给CRAN的包中使用mc.core = detectcore()调用mclapply,我预计您的包将被拒绝,直到您更改它为止。但是如果您是最终用户,在您自己的机器上运行一个并行脚本,那么您就可以决定脚本可以使用多少内核。

#2


1  

Better in my case (I use mac) is future::availableCores() because detectCores() shows 160 which is obviously wrong.

在我的例子中(我使用mac)更好的是future: availableCores(),因为detectcore()显示了160,这显然是错误的。

#3


1  

Author of the future package here: The future::availableCores() function acknowledges various HPC environment variables (e.g. NSLOTS, PBS_NUM_PPN, and SLURM_CPUS_PER_TASK) and system and R settings that are used to specify the number of cores available to the process, and if not specified, it'll fall back to parallel::detectCores(). As I, or others, become aware of more settings, I'll be happy to add automatic support also for those; there is an always open GitHub issue for this over at https://github.com/HenrikBengtsson/future/issues/22 (there are some open requests for help).

未来包的作者:将来::availableCores()函数确认各种HPC环境变量(例如nslot、PBS_NUM_PPN和SLURM_CPUS_PER_TASK)和系统和R设置,用于指定进程可用的内核数,如果没有指定,它将返回到并行::detectcore()。当我或其他人意识到更多的设置时,我很乐意为这些设置添加自动支持;关于这个问题,在https://github.com/HenrikBengtsson/future/issues/22(有一些公开的求助请求),总有一个开放的GitHub问题。

Also, if the sysadm sets environment variable R_FUTURE_AVAILABLECORES_FALLBACK=1 sitewide, then future::availableCores() will return 1, unless explicitly overridden by other means (by the job scheduler, by the user settings, ...). This further protects against software tools taking over all cores by default.

此外,如果sysadm设置环境变量R_FUTURE_AVAILABLECORES_FALLBACK=1个站点宽,那么future: availableCores()将返回1,除非通过其他方式(通过作业调度器、用户设置…)显式地重写。这进一步防止了软件工具在默认情况下接管所有内核。

In other words, if you use future::availableCores() rather than parallel::detectCores() you can be fairly sure that your code plays nice in multi-tenant environments (if it turns out it's not enough, please let us know in the above GitHub issue) and that any end user can still control the number of cores without you having to change your code.

换句话说,如果你用未来:availableCores()而不是并行::detectCores()你可以相当确定你的代码在多租户环境中扮演好(如果事实证明是不够的,请让我们知道在上面的GitHub问题),任何最终用户仍然可以控制核心的数量没有你必须改变你的代码。

#1


13  

I think it's perfectly reasonable to use detectCores as a starting point for the number of workers/processes when calling mclapply or makeCluster. However, there are many reasons that you may want or need to start fewer workers, and even some cases where you can reasonably start more.

我认为在调用mclapply或makeCluster时,使用detectcore作为员工/流程数量的起点是完全合理的。然而,有很多原因使你可能想要或需要减少员工数量,甚至在某些情况下,你可以合理地增加员工数量。

On some hyperthreaded machines it may not be a good idea to set mc.cores=detectCores(), for example. Or if your script is running on an HPC cluster, you shouldn't use any more resources than the job scheduler has allocated to your job. You also have to be careful in nested parallel situations, as when your code may be executed in parallel by a calling function, or you're executing a multithreaded function in parallel. In general, it's a good idea to run some preliminary benchmarks before starting a long job to determine the best number of workers. I usually monitor the benchmark with top to see if the number of processes and threads makes sense, and to verify that the memory usage is reasonable.

例如,在某些超线程机器上,设置mc.core = detectcore()可能不是一个好主意。或者,如果您的脚本正在HPC集群上运行,那么您不应该使用比作业调度程序分配给您的作业更多的资源。在嵌套的并行情况下,您也必须小心,因为当您的代码可能被一个调用函数并行执行时,或者您正在并行执行一个多线程函数。一般来说,在开始一项长期的工作之前,最好先做一些初步的基准测试,以确定最佳的员工数量。我通常使用top监视基准测试,以查看进程和线程的数量是否合理,并验证内存使用是否合理。

The advice that you quoted is particularly appropriate for package developers. It's certainly a bad idea for a package developer to always start detectCores() workers when calling mclapply or makeCluster, so it's best to leave the decision up to the end user. At least the package should allow the user to specify the number of workers to start, but arguably detectCores() isn't even a good default value. That's why the default value for mc.cores changed from detectCores() to getOptions("mc.cores", 2L) when mclapply was included in the parallel package.

您所引用的建议特别适合包开发人员。对于包开发人员来说,在调用mclapply或makeCluster时总是启动detectcore()工作人员,这当然不是一个好主意,所以最好由最终用户决定。至少这个包应该允许用户指定要启动的工人数量,但是detectcore()甚至不是一个好的默认值。这就是为什么mc的默认值从detectcore()改为getOptions(“mc”)的原因。当mclapply被包含在并行包中时,它的核心是“2L”。

I think the real point of the warning that you quoted is that R functions should not assume that they own the whole machine, or that they are the only function in your script that is using multiple cores. If you call mclapply with mc.cores=detectCores() in a package that you submit to CRAN, I expect your package will be rejected until you change it. But if you're the end user, running a parallel script on your own machine, then it's up to you to decide how many cores the script is allowed to use.

我认为您所引用的警告的真正意义在于R函数不应该假设它们拥有整个机器,或者它们是您的脚本中使用多个内核的唯一函数。如果您在提交给CRAN的包中使用mc.core = detectcore()调用mclapply,我预计您的包将被拒绝,直到您更改它为止。但是如果您是最终用户,在您自己的机器上运行一个并行脚本,那么您就可以决定脚本可以使用多少内核。

#2


1  

Better in my case (I use mac) is future::availableCores() because detectCores() shows 160 which is obviously wrong.

在我的例子中(我使用mac)更好的是future: availableCores(),因为detectcore()显示了160,这显然是错误的。

#3


1  

Author of the future package here: The future::availableCores() function acknowledges various HPC environment variables (e.g. NSLOTS, PBS_NUM_PPN, and SLURM_CPUS_PER_TASK) and system and R settings that are used to specify the number of cores available to the process, and if not specified, it'll fall back to parallel::detectCores(). As I, or others, become aware of more settings, I'll be happy to add automatic support also for those; there is an always open GitHub issue for this over at https://github.com/HenrikBengtsson/future/issues/22 (there are some open requests for help).

未来包的作者:将来::availableCores()函数确认各种HPC环境变量(例如nslot、PBS_NUM_PPN和SLURM_CPUS_PER_TASK)和系统和R设置,用于指定进程可用的内核数,如果没有指定,它将返回到并行::detectcore()。当我或其他人意识到更多的设置时,我很乐意为这些设置添加自动支持;关于这个问题,在https://github.com/HenrikBengtsson/future/issues/22(有一些公开的求助请求),总有一个开放的GitHub问题。

Also, if the sysadm sets environment variable R_FUTURE_AVAILABLECORES_FALLBACK=1 sitewide, then future::availableCores() will return 1, unless explicitly overridden by other means (by the job scheduler, by the user settings, ...). This further protects against software tools taking over all cores by default.

此外,如果sysadm设置环境变量R_FUTURE_AVAILABLECORES_FALLBACK=1个站点宽,那么future: availableCores()将返回1,除非通过其他方式(通过作业调度器、用户设置…)显式地重写。这进一步防止了软件工具在默认情况下接管所有内核。

In other words, if you use future::availableCores() rather than parallel::detectCores() you can be fairly sure that your code plays nice in multi-tenant environments (if it turns out it's not enough, please let us know in the above GitHub issue) and that any end user can still control the number of cores without you having to change your code.

换句话说,如果你用未来:availableCores()而不是并行::detectCores()你可以相当确定你的代码在多租户环境中扮演好(如果事实证明是不够的,请让我们知道在上面的GitHub问题),任何最终用户仍然可以控制核心的数量没有你必须改变你的代码。