R：大数据集上的字符串操作（如何加速？）

I have a large data.frame (>4M rows) in which one column contains character strings. I want to perform several string operations/match regular expressions on each text field (e.g. gsub).

我有一个大的data.frame（> 4M行），其中一列包含字符串。我想在每个文本字段（例如gsub）上执行几个字符串操作/匹配正则表达式。

I'm wondering how I can speed up operations? Basically, I'm performing a bunch of

我想知道如何加快运营速度？基本上，我正在表演一堆

gsub(patternvector," [token] ",tweetDF$textcolumn)
gsub(patternvector," [token] ",tweetDF$textcolumn)
....

I'm running R on a 8GB RAM Mac and tried to move it to the cloud (Amazon EC2 large instance with ~64GB RAM), but it's not going very fast.

我在8GB RAM Mac上运行R并试图将其移动到云端（具有~64GB RAM的Amazon EC2大型实例），但它的速度并不快。

I've heard of the several packages (bigmemory, ff) and found an overview about High Performance/Parallel Computing for R here.

我听说过几个软件包（bigmemory，ff），并在这里找到了关于R的高性能/并行计算的概述。

Does anyone have recommendations for a package most suitable for speeding up string operations? Or knows of a source explaining how apply the standard R string functions (gsub,..) to the 'objects' created by these 'High Performance Computing packages' ?

有没有人建议最适合加速字符串操作的软件包？或者知道一个来源解释如何将标准R字符串函数（gsub，..）应用于这些“高性能计算软件包”创建的“对象”？

Thanks for your help!

谢谢你的帮助！

1 个解决方案

#1

mclapply or any other function that allows for parallel processing should speed up the task significantly. If you are not using parallel processing you are only using only 1 CPU, no matter how many CPUs your computer has available.

mclapply或允许并行处理的任何其他功能应该显着加快任务。如果您不使用并行处理，则无论您的计算机有多少CPU，您只使用1个CPU。

#1

mclapply或允许并行处理的任何其他功能应该显着加快任务。如果您不使用并行处理，则无论您的计算机有多少CPU，您只使用1个CPU。

秒客网

R：大数据集上的字符串操作（如何加速？）

1 个解决方案

#1

#1

相关文章