R:是否有可能将2000万行CSV中的读取并行/加速到R?

时间:2023-01-15 13:45:03

Once the CSV is loaded via read.csv, it's fairly trivial to use multicore, segue etc to play around with the data in the CSV. Reading it in, however, is quite the time sink.

一旦通过read.csv加载CSV,使用多核,segue等来处理CSV中的数据是相当简单的。然而,阅读它是时候下沉了。

Realise it's better to use mySQL etc etc.

意识到最好使用mySQL等。

Assume the use of an AWS 8xl cluster compute instance running R2.13

假设使用运行R2.13的AWS 8xl集群计算实例

Specs as follows:

规格如下:

Cluster Compute Eight Extra Large specifications:
88 EC2 Compute Units (Eight-core 2 x Intel Xeon)
60.5 GB of memory
3370 GB of instance storage
64-bit platform
I/O Performance: Very High (10 Gigabit Ethernet)

Any thoughts / ideas much appreciated.

任何想法/想法都非常感激。

3 个解决方案

#1


5  

Going parallel might not be needed if you use fread in data.table.

如果在data.table中使用fread,则可能不需要并行。

library(data.table)
dt <- fread("myFile.csv")

A comment to this question illustrates its power. Also here's an example from my own experience:

对这个问题的评论说明了它的力量。这也是我自己经历的一个例子:

d1 <- fread('Tr1PointData_ByTime_new.csv')
Read 1048575 rows and 5 (of 5) columns from 0.043 GB file in 00:00:09

I was able to read in 1.04 million rows in under 10s!

在10秒内,我能够读取104万行!

#2


3  

Flash or conventional HD storage? If the latter, then if you don't know where the file is on the drives, and how it's split, it's very hard to speed things up because multiple simultaneous reads will not be faster than one streamed read. It's because of the disk, not the CPU. There's no way to parallelize this without starting at the storage level of the file.

闪存或传统高清存储?如果是后者,那么如果您不知道文件在驱动器上的位置以及它是如何拆分的,那么加速操作非常困难,因为多个同时读取不会比一个流读取快。这是因为磁盘而不是CPU。如果没有从文件的存储级别开始,就无法并行化。

If it's flash storage then a solution like Paul Hiemstra's might help since good flash storage can have excellent random read performance, close to sequential. Try it... but if it's not helping you know why.

如果它是闪存,那么像Paul Hiemstra这样的解决方案可能会有所帮助,因为良好的闪存存储可以具有出色的随机读取性能,接近顺序。尝试一下......但如果它没有帮助你知道原因。

Also... a fast storage interface doesn't necessary mean the drives can saturate it. Have you run performance testing on the drives to see how fast they really are?

此外......快速存储接口并不一定意味着驱动器可以使其饱和。您是否对驱动器进行了性能测试,看看它们的速度有多快?

#3


2  

What you could do is use scan. Two of its input arguments could prove to be interesting: n and skip. You just open two or more connections to the file and use skip and n to select the part you want to read from the file. There are some caveats:

你能做的就是使用扫描。它的两个输入参数可能证明是有趣的:n和跳过。您只需打开两个或多个与文件的连接,然后使用skip和n选择要从文件中读取的部分。有一些警告:

  • At some stage disk i/o might prove the bottle neck.
  • 在某些阶段,磁盘i / o可能会成为瓶颈。
  • I hope that scan does not complain when opening multiple connections to the same file.
  • 我希望打开多个连接到同一个文件时扫描不会抱怨。

But you could give it a try and see if it gives a boost to your speed.

但你可以尝试一下,看看它是否会提高你的速度。

#1


5  

Going parallel might not be needed if you use fread in data.table.

如果在data.table中使用fread,则可能不需要并行。

library(data.table)
dt <- fread("myFile.csv")

A comment to this question illustrates its power. Also here's an example from my own experience:

对这个问题的评论说明了它的力量。这也是我自己经历的一个例子:

d1 <- fread('Tr1PointData_ByTime_new.csv')
Read 1048575 rows and 5 (of 5) columns from 0.043 GB file in 00:00:09

I was able to read in 1.04 million rows in under 10s!

在10秒内,我能够读取104万行!

#2


3  

Flash or conventional HD storage? If the latter, then if you don't know where the file is on the drives, and how it's split, it's very hard to speed things up because multiple simultaneous reads will not be faster than one streamed read. It's because of the disk, not the CPU. There's no way to parallelize this without starting at the storage level of the file.

闪存或传统高清存储?如果是后者,那么如果您不知道文件在驱动器上的位置以及它是如何拆分的,那么加速操作非常困难,因为多个同时读取不会比一个流读取快。这是因为磁盘而不是CPU。如果没有从文件的存储级别开始,就无法并行化。

If it's flash storage then a solution like Paul Hiemstra's might help since good flash storage can have excellent random read performance, close to sequential. Try it... but if it's not helping you know why.

如果它是闪存,那么像Paul Hiemstra这样的解决方案可能会有所帮助,因为良好的闪存存储可以具有出色的随机读取性能,接近顺序。尝试一下......但如果它没有帮助你知道原因。

Also... a fast storage interface doesn't necessary mean the drives can saturate it. Have you run performance testing on the drives to see how fast they really are?

此外......快速存储接口并不一定意味着驱动器可以使其饱和。您是否对驱动器进行了性能测试,看看它们的速度有多快?

#3


2  

What you could do is use scan. Two of its input arguments could prove to be interesting: n and skip. You just open two or more connections to the file and use skip and n to select the part you want to read from the file. There are some caveats:

你能做的就是使用扫描。它的两个输入参数可能证明是有趣的:n和跳过。您只需打开两个或多个与文件的连接,然后使用skip和n选择要从文件中读取的部分。有一些警告:

  • At some stage disk i/o might prove the bottle neck.
  • 在某些阶段,磁盘i / o可能会成为瓶颈。
  • I hope that scan does not complain when opening multiple connections to the same file.
  • 我希望打开多个连接到同一个文件时扫描不会抱怨。

But you could give it a try and see if it gives a boost to your speed.

但你可以尝试一下,看看它是否会提高你的速度。