如何有效地合并两个数据集?

时间:2023-02-02 22:55:21

I am trying to merge two fairly large - but not ridiculously so (360,000 X 4, 57,000 X 4) - datasets by one common ID. I have tried a regular merge(), merge.data.table(), and sqldf(). Every time I keep running out of memory (cannot allocate vector of size...). Is there any solution to this? Or is R a bad tool for merging data? head() is given below (I am trying to merge on STUDENT.NAME):

我试图合并两个相当大的 - 但不是非常荒谬(360,000 X 4,57,000 X 4) - 一个公共ID的数据集。我尝试过常规的merge(),merge.data.table()和sqldf()。每次我内存不足(无法分配大小的矢量......)。这有什么解决方案吗?或者R是合并数据的坏工具? head()在下面给出(我试图在STUDENT.NAME上合并):

  ID10    STUDENT.NAME   FATHER.NAME MOTHER.NAME
1    1     DEEKSHITH J       JAYANNA      SWARNA
2    4    MANIKANTHA D       DEVARAJ     MANJULA
3    5        NAGESH T   THIMMAIAH N    SHIVAMMA
4    6    NIZAMUDDIN R NOOR MOHAMMED        BIBI
5    7 PRABHU YELLAPPA      YELLAPPA    MALLAMMA
6    8    SADDAM PASHA   NISAR AHMED     ZAREENA

3 个解决方案

#1


11  

From the nature of your problem it is bound to be that you're doing a many-by-many merge, where each student occurs many times in every dataframe. You might want to check how many times. If each student occurs twice in every data frame, that means one student will make 4 rows. if a student occurs 10 times, the merge will add 100 rows. First check how many rows you'll get. This is the function I use for that:

从你的问题的本质来看,你必须做一个多对多的合并,每个学生在每个数据框中都会多次出现。您可能想要查看多少次。如果每个学生在每个数据框中出现两次,这意味着一个学生将产生4行。如果学生出现10次,合并将增加100行。首先检查您将获得多少行。这是我用于此的功能:

count.rows <- function(x,y,v,all=FALSE){
    tx <- table(x[[v]])
    ty <- table(y[[v]])
    val <- val <- names(tx)[match(names(tx),names(ty),0L) > 0L]
    cts <- rbind(tx[match(val,names(tx))],ty[match(val,names(ty))])
    colnames(cts) <- val
    sum(apply(cts,2,prod,na.rm=all),na.rm=TRUE)
}
count.rows(DF1,DF2,"STUDENT.NAME")

If you would do what you asked me (read up the R documentation), you'd see that the complexity is dependent on the length of the answer. This is not due to the merge algorithm itself, but the binding of all the results together. If you really want a less memory hungry solution, you need especially to get rid of that binding. Following algorithm does that for you. I wrote it out so you can find the logic, and it can be optimized. Mind you that it does not give the same result, it copies all columns of both dataframes. So you might want to adapt that a little.

如果您按照您的要求(阅读R文档),您会发现复杂性取决于答案的长度。这不是由于合并算法本身,而是将所有结果绑定在一起。如果你真的想要一个内存较少的解决方案,你需要特别摆脱这种绑定。以下算法为您做到了这一点。我写了它,所以你可以找到逻辑,它可以被优化。请注意,它不会给出相同的结果,它会复制两个数据帧的所有列。所以你可能想要适应一点。

mymerge <- function(x,y,v,count.only=FALSE){
    ix <- match(v,names(x))
    iy <- match(v,names(y))

    xx <- x[,ix]
    yy <- y[,iy]
    ox <- order(xx)
    oy <- order(yy)
    xx <- xx[ox]
    yy <- yy[oy]

    nx <- length(xx)
    ny <- length(yy)

    val <- unique(xx)
    val <- val[match(val,yy,0L) > 0L]
    cts <- cbind(table(xx)[val],table(yy)[val])
    dimr <- sum(apply(cts,1,prod),na.rm=TRUE)

    idx <- vector("numeric",dimr)
    idy <- vector("numeric",dimr)
    ndx <- embed(c(which(!duplicated(xx)),nx+1),2)[unique(xx) %in% val,]
    ndy <- embed(c(which(!duplicated(yy)),ny+1),2)[unique(yy) %in% val,]

    count = 1
    for(i in 1:nrow(ndx)){
        nx <- abs(diff(ndx[i,]))
        ny <- abs(diff(ndy[i,]))
        ll <- nx*ny

        idx[count:(count+ll-1)] <-
          rep(ndx[i,2]:(ndx[i,1]-1),ny)

        idy[count:(count+ll-1)] <-
          rep(ndy[i,2]:(ndy[i,1]-1),each=nx)
        count <- count+ll
    }
    x <- x[ox[idx],]
    names(y) <- paste("y.",names(y),sep="")
    x[names(y)] <- y[oy[idy],]
    rownames(x) <- 1:nrow(x)
    x
}

Some testing code so you can see it works :

一些测试代码,所以你可以看到它工作:

DF1 <- data.frame(
    ID = 1:10,
    STUDENT.NAME=letters[1:10],
    SCORE = 1:10
)
id <- c(3,11,4,6,6,12,1,4,7,10,5,3)
DF2 <- data.frame(
    ID = id,
    STUDENT.NAME=letters[id],
    SCORE = 1:12
)

mymerge(DF1,DF2,"STUDENT.NAME")

Doing the same with two dataframes of 0.5 million rows and 4 columns with up to 10 matches per student name, it returns a dataframe with 5.8 million rows and 8 columns andd gives following picture on the memory :

使用50万行和4列的两个数据帧(每个学生名称最多10个匹配)执行相同的操作,它返回一个包含580万行和8列的数据帧,并在内存中显示以下图片:

如何有效地合并两个数据集?

The yellow box is the merge call, the green box is the mymerge call. Memory ranges from 2.3Gb to 3.74Gb, so the merge call uses 1.45 Gb and mymerge a bit over 0.8 Gb. Still no "out of memory" errors... The testing code for this is below :

黄色框是合并调用,绿色框是mymerge调用。内存范围从2.3Gb到3.74Gb,因此合并调用使用1.45 Gb和mymerge略高于0.8 Gb。仍然没有“内存不足”错误......测试代码如下:

Names <- sapply(
      replicate(120000,sample(letters,4,TRUE),simplify=FALSE),
      paste,collapse="")

DF1 <- data.frame(
    ID10 = 1:500000,
    STUDENT.NAME = sample(Names[1:50000],500000,TRUE),
    FATHER.NAME = sample(letters,500000,TRUE),
    SCORE1 = rnorm(500000),
    stringsAsFactors=FALSE
)

id <- sample(500000,replace=TRUE)
DF2 <- data.frame(
    ID20 = DF1$ID10,
    STUDENT.NAME = DF1$STUDENT.NAME[id],
    SCORE = rnorm(500000),
    SCORE2= rnorm(500000),
    stringsAsFactors=FALSE
)
id2 <- sample(500000,20000)
DF2$STUDENT.NAME[id2] <- sample(Names[100001:120000],20000,TRUE)

gc()
system.time(X <- merge(DF1,DF2,"STUDENT.NAME"))
Sys.sleep(1)
gc()
Sys.sleep(1)
rm(X)
gc()
Sys.sleep(3)
system.time(X <- mymerge(DF1,DF2,"STUDENT.NAME"))
Sys.sleep(1)
gc()
rm(X)
gc()

#2


2  

Have you tried the data.table package? It is more memory efficient and can be many times faster. But, as others have noted, this question has no code provided so it's possible you are just using merge incorrectly.

你试过data.table包吗?它的内存效率更高,速度可提高很多倍。但是,正如其他人所指出的那样,这个问题没有提供代码,所以你可能只是错误地使用了合并。

#3


1  

I agree with the other commentators who say this question lacking some in its description (lacking both the code and a complete data description) but I also wonder if it hasn't already been answered with one of these links:

我同意其他评论员的观点,他们说这个问题在描述中缺少一些(缺少代码和完整的数据描述),但我也想知道是否还没有用这些链接回答:

R: how to rbind two huge data-frames without running out of memory

R:如何在不耗尽内存的情况下重新绑定两个巨大的数据帧

A citation offered by @G. Grothendieck (who should probably be given a knighthood for his many contributions to R's functionality) especially the part regarding the use of an external file: http://code.google.com/p/sqldf/#Example_6._File_Input

@G提供的引文。格洛腾迪克(可能因为他对R的功能做出的贡献而应该获得骑士称号),特别是有关使用外部文件的部分:http://code.google.com/p/sqldf/#Example_6._File_Input

And one final thought: After saving your work, shutting down you computer, restrating only with R and loading only your datasets, try a cbind(.... match(..) ) maneuver like this:

最后一个想法是:在保存你的工作后,关闭你的计算机,只用R重建并只加载你的数据集,尝试一个cbind(.... match(..))机动这样:

cbind(df1,df2[match(df1$STUDENT.NAME,df2$STUDENT.NAME)),])

It won't have the same bells and whistles as merge, but it should be fairly memory efficient and succeed if the problem is just fragmented memory in your current session. These are not partial matches. If that was your expectation, you should have indicated such. Names are notoriously messy if coming from independent sources.

它不会像合并一样具有相同的功能,但它应该具有相当的内存效率并且如果问题只是当前会话中的碎片内存就会成功。这些不是部分匹配。如果这是你的期望,你应该表明这样的。如果来自独立来源,名称是众所周知的混乱。

#1


11  

From the nature of your problem it is bound to be that you're doing a many-by-many merge, where each student occurs many times in every dataframe. You might want to check how many times. If each student occurs twice in every data frame, that means one student will make 4 rows. if a student occurs 10 times, the merge will add 100 rows. First check how many rows you'll get. This is the function I use for that:

从你的问题的本质来看,你必须做一个多对多的合并,每个学生在每个数据框中都会多次出现。您可能想要查看多少次。如果每个学生在每个数据框中出现两次,这意味着一个学生将产生4行。如果学生出现10次,合并将增加100行。首先检查您将获得多少行。这是我用于此的功能:

count.rows <- function(x,y,v,all=FALSE){
    tx <- table(x[[v]])
    ty <- table(y[[v]])
    val <- val <- names(tx)[match(names(tx),names(ty),0L) > 0L]
    cts <- rbind(tx[match(val,names(tx))],ty[match(val,names(ty))])
    colnames(cts) <- val
    sum(apply(cts,2,prod,na.rm=all),na.rm=TRUE)
}
count.rows(DF1,DF2,"STUDENT.NAME")

If you would do what you asked me (read up the R documentation), you'd see that the complexity is dependent on the length of the answer. This is not due to the merge algorithm itself, but the binding of all the results together. If you really want a less memory hungry solution, you need especially to get rid of that binding. Following algorithm does that for you. I wrote it out so you can find the logic, and it can be optimized. Mind you that it does not give the same result, it copies all columns of both dataframes. So you might want to adapt that a little.

如果您按照您的要求(阅读R文档),您会发现复杂性取决于答案的长度。这不是由于合并算法本身,而是将所有结果绑定在一起。如果你真的想要一个内存较少的解决方案,你需要特别摆脱这种绑定。以下算法为您做到了这一点。我写了它,所以你可以找到逻辑,它可以被优化。请注意,它不会给出相同的结果,它会复制两个数据帧的所有列。所以你可能想要适应一点。

mymerge <- function(x,y,v,count.only=FALSE){
    ix <- match(v,names(x))
    iy <- match(v,names(y))

    xx <- x[,ix]
    yy <- y[,iy]
    ox <- order(xx)
    oy <- order(yy)
    xx <- xx[ox]
    yy <- yy[oy]

    nx <- length(xx)
    ny <- length(yy)

    val <- unique(xx)
    val <- val[match(val,yy,0L) > 0L]
    cts <- cbind(table(xx)[val],table(yy)[val])
    dimr <- sum(apply(cts,1,prod),na.rm=TRUE)

    idx <- vector("numeric",dimr)
    idy <- vector("numeric",dimr)
    ndx <- embed(c(which(!duplicated(xx)),nx+1),2)[unique(xx) %in% val,]
    ndy <- embed(c(which(!duplicated(yy)),ny+1),2)[unique(yy) %in% val,]

    count = 1
    for(i in 1:nrow(ndx)){
        nx <- abs(diff(ndx[i,]))
        ny <- abs(diff(ndy[i,]))
        ll <- nx*ny

        idx[count:(count+ll-1)] <-
          rep(ndx[i,2]:(ndx[i,1]-1),ny)

        idy[count:(count+ll-1)] <-
          rep(ndy[i,2]:(ndy[i,1]-1),each=nx)
        count <- count+ll
    }
    x <- x[ox[idx],]
    names(y) <- paste("y.",names(y),sep="")
    x[names(y)] <- y[oy[idy],]
    rownames(x) <- 1:nrow(x)
    x
}

Some testing code so you can see it works :

一些测试代码,所以你可以看到它工作:

DF1 <- data.frame(
    ID = 1:10,
    STUDENT.NAME=letters[1:10],
    SCORE = 1:10
)
id <- c(3,11,4,6,6,12,1,4,7,10,5,3)
DF2 <- data.frame(
    ID = id,
    STUDENT.NAME=letters[id],
    SCORE = 1:12
)

mymerge(DF1,DF2,"STUDENT.NAME")

Doing the same with two dataframes of 0.5 million rows and 4 columns with up to 10 matches per student name, it returns a dataframe with 5.8 million rows and 8 columns andd gives following picture on the memory :

使用50万行和4列的两个数据帧(每个学生名称最多10个匹配)执行相同的操作,它返回一个包含580万行和8列的数据帧,并在内存中显示以下图片:

如何有效地合并两个数据集?

The yellow box is the merge call, the green box is the mymerge call. Memory ranges from 2.3Gb to 3.74Gb, so the merge call uses 1.45 Gb and mymerge a bit over 0.8 Gb. Still no "out of memory" errors... The testing code for this is below :

黄色框是合并调用,绿色框是mymerge调用。内存范围从2.3Gb到3.74Gb,因此合并调用使用1.45 Gb和mymerge略高于0.8 Gb。仍然没有“内存不足”错误......测试代码如下:

Names <- sapply(
      replicate(120000,sample(letters,4,TRUE),simplify=FALSE),
      paste,collapse="")

DF1 <- data.frame(
    ID10 = 1:500000,
    STUDENT.NAME = sample(Names[1:50000],500000,TRUE),
    FATHER.NAME = sample(letters,500000,TRUE),
    SCORE1 = rnorm(500000),
    stringsAsFactors=FALSE
)

id <- sample(500000,replace=TRUE)
DF2 <- data.frame(
    ID20 = DF1$ID10,
    STUDENT.NAME = DF1$STUDENT.NAME[id],
    SCORE = rnorm(500000),
    SCORE2= rnorm(500000),
    stringsAsFactors=FALSE
)
id2 <- sample(500000,20000)
DF2$STUDENT.NAME[id2] <- sample(Names[100001:120000],20000,TRUE)

gc()
system.time(X <- merge(DF1,DF2,"STUDENT.NAME"))
Sys.sleep(1)
gc()
Sys.sleep(1)
rm(X)
gc()
Sys.sleep(3)
system.time(X <- mymerge(DF1,DF2,"STUDENT.NAME"))
Sys.sleep(1)
gc()
rm(X)
gc()

#2


2  

Have you tried the data.table package? It is more memory efficient and can be many times faster. But, as others have noted, this question has no code provided so it's possible you are just using merge incorrectly.

你试过data.table包吗?它的内存效率更高,速度可提高很多倍。但是,正如其他人所指出的那样,这个问题没有提供代码,所以你可能只是错误地使用了合并。

#3


1  

I agree with the other commentators who say this question lacking some in its description (lacking both the code and a complete data description) but I also wonder if it hasn't already been answered with one of these links:

我同意其他评论员的观点,他们说这个问题在描述中缺少一些(缺少代码和完整的数据描述),但我也想知道是否还没有用这些链接回答:

R: how to rbind two huge data-frames without running out of memory

R:如何在不耗尽内存的情况下重新绑定两个巨大的数据帧

A citation offered by @G. Grothendieck (who should probably be given a knighthood for his many contributions to R's functionality) especially the part regarding the use of an external file: http://code.google.com/p/sqldf/#Example_6._File_Input

@G提供的引文。格洛腾迪克(可能因为他对R的功能做出的贡献而应该获得骑士称号),特别是有关使用外部文件的部分:http://code.google.com/p/sqldf/#Example_6._File_Input

And one final thought: After saving your work, shutting down you computer, restrating only with R and loading only your datasets, try a cbind(.... match(..) ) maneuver like this:

最后一个想法是:在保存你的工作后,关闭你的计算机,只用R重建并只加载你的数据集,尝试一个cbind(.... match(..))机动这样:

cbind(df1,df2[match(df1$STUDENT.NAME,df2$STUDENT.NAME)),])

It won't have the same bells and whistles as merge, but it should be fairly memory efficient and succeed if the problem is just fragmented memory in your current session. These are not partial matches. If that was your expectation, you should have indicated such. Names are notoriously messy if coming from independent sources.

它不会像合并一样具有相同的功能,但它应该具有相当的内存效率并且如果问题只是当前会话中的碎片内存就会成功。这些不是部分匹配。如果这是你的期望,你应该表明这样的。如果来自独立来源,名称是众所周知的混乱。