I have a large data.frame (>4M rows) in which one column contains character strings. I want to perform several string operations/match regular expressions on each text field (e.g. gsub
).
我有一个大的data.frame(> 4M行),其中一列包含字符串。我想在每个文本字段(例如gsub)上执行几个字符串操作/匹配正则表达式。
I'm wondering how I can speed up operations? Basically, I'm performing a bunch of
我想知道如何加快运营速度?基本上,我正在表演一堆
gsub(patternvector," [token] ",tweetDF$textcolumn)
gsub(patternvector," [token] ",tweetDF$textcolumn)
....
I'm running R on a 8GB RAM Mac and tried to move it to the cloud (Amazon EC2 large instance with ~64GB RAM), but it's not going very fast.
我在8GB RAM Mac上运行R并试图将其移动到云端(具有~64GB RAM的Amazon EC2大型实例),但它的速度并不快。
I've heard of the several packages (bigmemory
, ff
) and found an overview about High Performance/Parallel Computing for R here.
我听说过几个软件包(bigmemory,ff),并在这里找到了关于R的高性能/并行计算的概述。
Does anyone have recommendations for a package most suitable for speeding up string operations? Or knows of a source explaining how apply the standard R string functions (gsub
,..) to the 'objects' created by these 'High Performance Computing packages' ?
有没有人建议最适合加速字符串操作的软件包?或者知道一个来源解释如何将标准R字符串函数(gsub,..)应用于这些“高性能计算软件包”创建的“对象”?
Thanks for your help!
谢谢你的帮助!
1 个解决方案
#1
1
mclapply
or any other function that allows for parallel processing should speed up the task significantly. If you are not using parallel processing you are only using only 1 CPU, no matter how many CPUs your computer has available.
mclapply或允许并行处理的任何其他功能应该显着加快任务。如果您不使用并行处理,则无论您的计算机有多少CPU,您只使用1个CPU。
#1
1
mclapply
or any other function that allows for parallel processing should speed up the task significantly. If you are not using parallel processing you are only using only 1 CPU, no matter how many CPUs your computer has available.
mclapply或允许并行处理的任何其他功能应该显着加快任务。如果您不使用并行处理,则无论您的计算机有多少CPU,您只使用1个CPU。