R：如何在不耗尽内存的情况下重新绑定两个巨大的数据帧

I have two data-frames df1 and df2 that each have around 10 million rows and 4 columns. I read them into R using RODBC/sqlQuery with no problems, but when I try to rbind them, I get that most dreaded of R error messages: cannot allocate memory. There have got to be more efficient ways to do an rbind more efficiently -- anyone have their favorite tricks on this they want to share? For instance I found this example in the doc for sqldf:

我有两个数据帧df1和df2，每个数据帧有大约1000万行和4列。我使用RODBC / sqlQuery将它们读入R并没有任何问题，但是当我尝试rbind它们时，我得到了最可怕的R错误消息：无法分配内存。必须有更有效的方法来更有效地做一个rbind - 任何人都有他们想要分享的最喜欢的技巧吗？例如，我在sqldf的doc中找到了这个例子：

# rbind
a7r <- rbind(a5r, a6r)
a7s <- sqldf("select * from a5s union all select * from a6s")

Is that the best/recommended way to do it?

这是最好的/推荐的方式吗？

UPDATE I got it to work using the crucial dbname = tempfile() argument in the sqldf call above, as JD Long suggests in his answer to this question

更新我在上面的sqldf调用中使用关键的dbname = tempfile（）参数使其工作，正如JD Long在他对这个问题的回答中所建议的那样

4 个解决方案

#1

Rather than reading them into R at the beginning and then combining them you could have SQLite read them and combine them before sending them to R. That way the files are never individually loaded into R.

而不是在开始时将它们读入R然后组合它们，您可以让SQLite读取它们并在将它们发送到R之前将它们组合。这样，文件永远不会单独加载到R.

# create two sample files
DF1 <- data.frame(A = 1:2, B = 2:3)
write.table(DF1, "data1.dat", sep = ",", quote = FALSE)
rm(DF1)

DF2 <- data.frame(A = 10:11, B = 12:13)
write.table(DF2, "data2.dat", sep = ",", quote = FALSE)
rm(DF2)

# now we do the real work
library(sqldf)

data1 <- file("data1.dat")
data2 <- file("data2.dat")

sqldf(c("select * from data1", 
 "insert into data1 select * from data2", 
 "select * from data1"), 
 dbname = tempfile())

This gives:

这给出了：

>  sqldf(c("select * from data1", "insert into data1 select * from data2", "select * from data1"), dbname = tempfile())
   A  B
1  1  2
2  2  3
3 10 12
4 11 13

This shorter version also works if row order is unimportant:

如果行顺序不重要，则此较短版本也可以使用：

sqldf("select * from data1 union select * from data2", dbname = tempfile())

See the sqldf home page http://sqldf.googlecode.com and ?sqldf for more info. Pay particular attention to the file format arguments since they are close but not identical to read.table. Here we have used the defaults so it was less of an issue.

有关详细信息，请参阅sqldf主页http://sqldf.googlecode.com和？sqldf。请特别注意文件格式参数，因为它们接近但与read.table不相同。在这里，我们使用了默认值，因此它不是一个问题。

#2

Notice the data.table R package for efficient operations on objects with over several million records.

注意data.table R包，用于对具有数百万条记录的对象进行有效操作。

Version 1.8.2 of that package offers the rbindlist function through which you can achieve what you want very efficiently. Thus instead of rbind(a5r, a6r) you can:

该软件包的1.8.2版提供了rbindlist功能，通过它您可以非常有效地实现您想要的功能。因此，您可以：而不是rbind（a5r，a6r）：

library(data.table)
rbindlist(list(a5r, a6r))

#3

Try to create a data.frame of desired size, hence import your data using subscripts.

尝试创建所需大小的data.frame，从而使用下标导入数据。

dtf <- as.data.frame(matrix(NA, 10, 10))
dtf1 <- as.data.frame(matrix(1:50, 5, 10, byrow=TRUE))
dtf2 <- as.data.frame(matrix(51:100, 5, 10, byrow=TRUE))
dtf[1:5, ] <- dtf1
dtf[6:10, ] <- dtf2

I guess that rbind grows object without pre-allocating its dimensions... I'm not positively sure, this is only a guess. I'll comb down "The R Inferno" or "Data Manipulation with R" tonight. Maybe merge will do the trick...

我猜rbind在不预先分配其尺寸的情况下成长对象......我不确定，这只是猜测。今晚我将梳理“The R Inferno”或“R操纵数据”。也许合并会有所作为......

EDIT

编辑

And you should bare in mind that (maybe) your system and/or R cannot cope with something that big. Try RevolutionR, maybe you'll manage to spare some time/resources.

你应该记住（也许）你的系统和/或R不能应付那些大的东西。尝试RevolutionR，也许你会设法节省一些时间/资源。

#4

For completeness in this thread on the topic of union:ing large files, try using shell commands on the files to combine them. In windows that is "COPY" command with "/B" flag. Example:

为了完成这个主题的联合：大文件，请尝试在文件上使用shell命令来组合它们。在带有“/ B”标志的“COPY”命令的窗口中。例：

system(command =
         paste0(
           c("cmd.exe /c COPY /Y"
             , '"file_1.csv" /B'
             , '+ "file_2.csv" /B'
             , '"resulting_file.csv" /B'
           ), collapse = " "
         )
)#system

Requires that files have no header, and same delimiter etc etc. The speed and versatility of shell commands is sometimes a great benefit, so don't forget CLI-commands when mapping out dataflows.

要求文件没有标题，相同的分隔符等等.shell命令的速度和多功能性有时是一个很大的好处，因此在映射数据流时不要忘记CLI命令。

#1