This question already has an answer here:
这个问题已经有了答案:
- Speeding up the performance of write.table 5 answers
- 加快写的性能。表5的答案
I have a fairly large dataframe in R that I would like to export to SPSS. This file has caused me hours of headaches trying to import it to R in the first place, however I got successful using read.fwf()
using the options comment.char="%"
(a character not appearing in the file) and fill= TRUE
(it was a fixed-width ASCII file with some rows lacking all variables, causing error messages).
我有一个相当大的数据,我想出口到SPSS。这个文件一开始就给我带来了几个小时的麻烦,我试图将它导入到R中,但是我成功地使用了read.fwf(),使用了options注释。char="%"(一个不在文件中出现的字符)和fill= TRUE(它是一个固定宽度的ASCII文件,有一些行缺少所有变量,导致错误消息)。
Anyway, my data frame currently consists of 3,9 mill observations and 48 variables (all character). I can write it to file fairly quickly by splitting it into 4 x 1 mill obs sets with df2 <- df[1:1000000,]
followed by write.table(df2)
etc., but can't write the entire file in one sweep without the computer locking up and needing a hard reset to come back up.
不管怎么说,我的数据框目前由3,9个密尔观察值和48个变量(所有字符)组成。我可以很快地将它写进文件,将它分成4×1的obs集,其中df2 <- df[1:100000,]后面是write.table(df2)等等,但是如果没有计算机锁定,并且需要进行硬的复位,我不能一次写完整个文件。
After hearing anecdotal stories about how R is unsuited for large datasets for years, this is the first time I have actually encountered a problem of this kind. I wonder whether there are other approaches(low-level "dumping" the file directly to disk?) or whether there is some package unknown to me that can handle export of large files of this type efficiently?
在听到关于R如何不适合大型数据集的轶事故事多年后,这是我第一次真正遇到这样的问题。我想知道是否还有其他方法(低级的“转储”文件直接到磁盘?)或者是否有一些我不知道的软件包可以有效地处理这种类型的大型文件的导出?
5 个解决方案
#1
7
At a guess, your machine is short on RAM, and so R is having to use the swap file, which slows things down. If you are being paid to code, then buying more RAM will probably be cheaper than you writing new code.
据猜测,您的机器内存不足,所以R不得不使用交换文件,这会减慢进程。如果你的报酬是代码,那么购买更多的RAM可能比编写新代码要便宜。
That said, there are some possibilities. You can export the file to a database and then use that database's facility for writing to a text file. JD Long's answer to this question tells you how to read in files in this way; it shouldn't be too difficult to reverse the process. Alternatively the bigmemory
and ff
packages (as mentioned by Davy) could be used for writing such files.
也就是说,有一些可能性。您可以将文件导出到数据库,然后使用该数据库的工具将其写入文本文件。JD Long对这个问题的回答告诉你如何以这种方式阅读文件;扭转这一进程应该不难。或者,可以使用bigmemory和ff包(如Davy提到的)来编写此类文件。
#2
24
1) If your file is all character strings, then it saves using write.table()
much faster if you first change it to a matrix
.
1)如果你的文件是所有字符串,那么如果你先把它改成一个矩阵,那么它就会更快地使用write.table()。
2) also write it out in chunks of, say 1000000 rows, but always to the same file, and using the argument append = TRUE
.
2)也可以将它写成块,比如1000000行,但是总是要写到同一个文件中,并使用参数append = TRUE。
#3
14
Update
After extensive work by Matt Dowle parallelizing and adding other efficiency improvements, fread
is now as much as 15x faster than write.csv
. See linked answer for more.
在Matt Dowle的大量并行化和其他效率改进之后,fread现在比write.csv快15倍。更多信息请参见链接答案。
Now data.table
has an fwrite
function contributed by Otto Seiskari which seems to be about twice as fast as write.csv
in general. See here for some benchmarks.
现在的数据。table有一个奥托·瑟卡里提供的fwrite函数,它的速度似乎是write的两倍。csv。请看这里的一些基准测试。
library(data.table)
fwrite(DF, "output.csv")
Note that row names are excluded, since the data.table
type makes no use of them.
注意,行名称被排除,因为数据。表类型不使用它们。
#4
7
Though I only use it to read very large files (10+ Gb) I believe the ff
package has functions for writing extremely large dfs.
虽然我只使用它来读取非常大的文件(10+ Gb),但我相信ff包有编写非常大的dfs的功能。
#5
7
Well, as the answer with really large files and R often is, its best to offload this kind of work to a database. SPSS has ODBC connectivity, and the RODBC
provides an interface from R to SQL.
对于很大的文件和R,通常的答案是,最好将这种工作转移到数据库中。SPSS具有ODBC连接,而RODBC提供从R到SQL的接口。
I note, that in the process of checking out my information, I have been scooped.
我注意到,在核对我的信息的过程中,我被抢先了一步。
#1
7
At a guess, your machine is short on RAM, and so R is having to use the swap file, which slows things down. If you are being paid to code, then buying more RAM will probably be cheaper than you writing new code.
据猜测,您的机器内存不足,所以R不得不使用交换文件,这会减慢进程。如果你的报酬是代码,那么购买更多的RAM可能比编写新代码要便宜。
That said, there are some possibilities. You can export the file to a database and then use that database's facility for writing to a text file. JD Long's answer to this question tells you how to read in files in this way; it shouldn't be too difficult to reverse the process. Alternatively the bigmemory
and ff
packages (as mentioned by Davy) could be used for writing such files.
也就是说,有一些可能性。您可以将文件导出到数据库,然后使用该数据库的工具将其写入文本文件。JD Long对这个问题的回答告诉你如何以这种方式阅读文件;扭转这一进程应该不难。或者,可以使用bigmemory和ff包(如Davy提到的)来编写此类文件。
#2
24
1) If your file is all character strings, then it saves using write.table()
much faster if you first change it to a matrix
.
1)如果你的文件是所有字符串,那么如果你先把它改成一个矩阵,那么它就会更快地使用write.table()。
2) also write it out in chunks of, say 1000000 rows, but always to the same file, and using the argument append = TRUE
.
2)也可以将它写成块,比如1000000行,但是总是要写到同一个文件中,并使用参数append = TRUE。
#3
14
Update
After extensive work by Matt Dowle parallelizing and adding other efficiency improvements, fread
is now as much as 15x faster than write.csv
. See linked answer for more.
在Matt Dowle的大量并行化和其他效率改进之后,fread现在比write.csv快15倍。更多信息请参见链接答案。
Now data.table
has an fwrite
function contributed by Otto Seiskari which seems to be about twice as fast as write.csv
in general. See here for some benchmarks.
现在的数据。table有一个奥托·瑟卡里提供的fwrite函数,它的速度似乎是write的两倍。csv。请看这里的一些基准测试。
library(data.table)
fwrite(DF, "output.csv")
Note that row names are excluded, since the data.table
type makes no use of them.
注意,行名称被排除,因为数据。表类型不使用它们。
#4
7
Though I only use it to read very large files (10+ Gb) I believe the ff
package has functions for writing extremely large dfs.
虽然我只使用它来读取非常大的文件(10+ Gb),但我相信ff包有编写非常大的dfs的功能。
#5
7
Well, as the answer with really large files and R often is, its best to offload this kind of work to a database. SPSS has ODBC connectivity, and the RODBC
provides an interface from R to SQL.
对于很大的文件和R,通常的答案是,最好将这种工作转移到数据库中。SPSS具有ODBC连接,而RODBC提供从R到SQL的接口。
I note, that in the process of checking out my information, I have been scooped.
我注意到,在核对我的信息的过程中,我被抢先了一步。