I'm aware of the fact that Amelia
R
package provides some support for parallel multiple imputation (MI). However, preliminary analysis of my study's data revealed that the data is not multivariate normal, so, unfortunately, I can't use Amelia
. Consequently, I've switched to using mice
R
package for MI, as this package can perform MI on data that is not multivariate normal.
我意识到Amelia R包为并行多重归责(MI)提供了一些支持。然而,对我的研究数据的初步分析显示,这些数据并不是多元正常的,因此,很不幸,我不能使用Amelia。因此,我转而使用mice R包处理MI,因为这个包可以对非多变量正常的数据执行MI。
Since the MI process via mice
is very slow (currently I'm using AWS m3.large
2-core instance), I've started wondering whether it's possible to parallelize the procedure to save processing time. Based on my review of mice
documentation and the corresponding JSS paper, as well as mice
's source code, it appears that currently the package doesn't support parallel operations. This is sad, because IMHO the MICE algorithm is naturally parallel and, thus, its parallel implementation should be relatively easy and it would result in a significant economy in both time and resources.
由于小鼠的MI过程非常缓慢(目前我使用的是AWS m3)。我已经开始考虑是否可以并行化过程以节省处理时间。根据我对mice文档和相应的JSS文件的回顾,以及mice的源代码,目前这个包似乎不支持并行操作。这是令人遗憾的,因为IMHO的MICE算法是自然并行的,因此,它的并行实现应该相对容易,并且会在时间和资源上带来显著的经济效益。
Question: Has anyone tried to parallelize MI in mice
package, either externally (via R
parallel facilities), or internally (by modifying the source code) and what are results, if any? Thank you!
问:是否有人试图在mice包中并行化MI,或者通过外部(通过R并行工具),或者通过内部(通过修改源代码),如果有,结果是什么?谢谢你!
1 个解决方案
#1
2
Recently, I've tried to parallelize multiple imputation (MI) via mice
package externally, that is, by using R
multiprocessing facilities, in particular parallel
package, which comes standard with R
base distribution. Basically, the solution is to use mclapply()
function to distribute a pre-calculated share of the total number of needed MI iterations and then combine resulting imputed data into a single object. Performance-wise, the results of this approach are beyond my most optimistic expectations: the processing time decreased from 1.5 hours to under 7 minutes(!). That's only on two cores. I've removed one multilevel factor, but it shouldn't have much effect. Regardless, the result is unbelievable!
最近,我尝试通过外部的mice包来并行化多个imputation (MI),也就是说,通过使用R多处理设施,特别是并行包,这是标准的R基分布。基本上,解决方案是使用mclapply()函数分配所需的MI迭代总数的预先计算的份额,然后将得到的传入数据合并到一个对象中。就性能而言,这种方法的结果超出了我最乐观的预期:处理时间从1.5小时减少到7分钟以下(!)这只是在两个核上。我去掉了一个多级因子,但是不会有太大的影响。无论如何,结果令人难以置信!
#1
2
Recently, I've tried to parallelize multiple imputation (MI) via mice
package externally, that is, by using R
multiprocessing facilities, in particular parallel
package, which comes standard with R
base distribution. Basically, the solution is to use mclapply()
function to distribute a pre-calculated share of the total number of needed MI iterations and then combine resulting imputed data into a single object. Performance-wise, the results of this approach are beyond my most optimistic expectations: the processing time decreased from 1.5 hours to under 7 minutes(!). That's only on two cores. I've removed one multilevel factor, but it shouldn't have much effect. Regardless, the result is unbelievable!
最近,我尝试通过外部的mice包来并行化多个imputation (MI),也就是说,通过使用R多处理设施,特别是并行包,这是标准的R基分布。基本上,解决方案是使用mclapply()函数分配所需的MI迭代总数的预先计算的份额,然后将得到的传入数据合并到一个对象中。就性能而言,这种方法的结果超出了我最乐观的预期:处理时间从1.5小时减少到7分钟以下(!)这只是在两个核上。我去掉了一个多级因子,但是不会有太大的影响。无论如何,结果令人难以置信!