R:大数据集上的字符串操作(如何加速?)

时间:2022-07-26 13:49:35

I have a large data.frame (>4M rows) in which one column contains character strings. I want to perform several string operations/match regular expressions on each text field (e.g. gsub).

我有一个大的data.frame(> 4M行),其中一列包含字符串。我想在每个文本字段(例如gsub)上执行几个字符串操作/匹配正则表达式。

I'm wondering how I can speed up operations? Basically, I'm performing a bunch of

我想知道如何加快运营速度?基本上,我正在表演一堆

gsub(patternvector," [token] ",tweetDF$textcolumn)
gsub(patternvector," [token] ",tweetDF$textcolumn)
....

I'm running R on a 8GB RAM Mac and tried to move it to the cloud (Amazon EC2 large instance with ~64GB RAM), but it's not going very fast.

我在8GB RAM Mac上运行R并试图将其移动到云端(具有~64GB RAM的Amazon EC2大型实例),但它的速度并不快。

I've heard of the several packages (bigmemory, ff) and found an overview about High Performance/Parallel Computing for R here.

我听说过几个软件包(bigmemory,ff),并在这里找到了关于R的高性能/并行计算的概述。

Does anyone have recommendations for a package most suitable for speeding up string operations? Or knows of a source explaining how apply the standard R string functions (gsub,..) to the 'objects' created by these 'High Performance Computing packages' ?

有没有人建议最适合加速字符串操作的软件包?或者知道一个来源解释如何将标准R字符串函数(gsub,..)应用于这些“高性能计算软件包”创建的“对象”?

Thanks for your help!

谢谢你的帮助!

1 个解决方案

#1


1  

mclapply or any other function that allows for parallel processing should speed up the task significantly. If you are not using parallel processing you are only using only 1 CPU, no matter how many CPUs your computer has available.

mclapply或允许并行处理的任何其他功能应该显着加快任务。如果您不使用并行处理,则无论您的计算机有多少CPU,您只使用1个CPU。

#1


1  

mclapply or any other function that allows for parallel processing should speed up the task significantly. If you are not using parallel processing you are only using only 1 CPU, no matter how many CPUs your computer has available.

mclapply或允许并行处理的任何其他功能应该显着加快任务。如果您不使用并行处理,则无论您的计算机有多少CPU,您只使用1个CPU。