比较两个data.frame来找到data.frame 1中不存在的行

I have the following 2 data.frames:

我有以下两个数据。

a1 <- data.frame(a = 1:5, b=letters[1:5])
a2 <- data.frame(a = 1:3, b=letters[1:3])

I want to find the row a1 has that a2 doesn't.

我想找到a没有的行a1。

Is there a built in function for this type of operation?

这种类型的操作有内置函数吗?

(p.s: I did write a solution for it, I am simply curious if someone already made a more crafted code)

(p。s:我确实为它编写了一个解决方案，我只是想知道是否有人已经编写了更精巧的代码)

Here is my solution:

这是我的解决方案:

a1 <- data.frame(a = 1:5, b=letters[1:5])
a2 <- data.frame(a = 1:3, b=letters[1:3])

rows.in.a1.that.are.not.in.a2  <- function(a1,a2)
{
    a1.vec <- apply(a1, 1, paste, collapse = "")
    a2.vec <- apply(a2, 1, paste, collapse = "")
    a1.without.a2.rows <- a1[!a1.vec %in% a2.vec,]
    return(a1.without.a2.rows)
}
rows.in.a1.that.are.not.in.a2(a1,a2)

11 个解决方案

#1

This doesn't answer your question directly, but it will give you the elements that are in common. This can be done with Paul Murrell's package compare:

这并不能直接回答你的问题，但它会给你一些共同的元素。这可以通过保罗·默雷尔的一揽子计划进行比较:

library(compare)
a1 <- data.frame(a = 1:5, b = letters[1:5])
a2 <- data.frame(a = 1:3, b = letters[1:3])
comparison <- compare(a1,a2,allowAll=TRUE)
comparison$tM
#  a b
#1 1 a
#2 2 b
#3 3 c

The function compare gives you a lot of flexibility in terms of what kind of comparisons are allowed (e.g. changing order of elements of each vector, changing order and names of variables, shortening variables, changing case of strings). From this, you should be able to figure out what was missing from one or the other. For example (this is not very elegant):

在允许进行何种比较方面，函数compare提供了很大的灵活性(例如，改变每个向量元素的顺序，改变变量的顺序和名称，缩短变量，改变字符串的情况)。从这里，您应该能够找出其中一个或另一个缺失了什么。例如(这不是很优雅):

difference <-
   data.frame(lapply(1:ncol(a1),function(i)setdiff(a1[,i],comparison$tM[,i])))
colnames(difference) <- colnames(a1)
difference
#  a b
#1 4 d
#2 5 e

#2

113

SQLDF provides a nice solution

SQLDF提供了一个很好的解决方案

a1 <- data.frame(a = 1:5, b=letters[1:5])
a2 <- data.frame(a = 1:3, b=letters[1:3])

require(sqldf)

a1NotIna2 <- sqldf('SELECT * FROM a1 EXCEPT SELECT * FROM a2')

And the rows which are in both data frames:

在两个数据帧中的行:

a1Ina2 <- sqldf('SELECT * FROM a1 INTERSECT SELECT * FROM a2')

The new version of dplyr has a function, anti_join, for exactly these kinds of comparisons

新版本的dplyr有一个函数anti_join，用来进行这些比较

require(dplyr) 
anti_join(a1,a2)

And semi_join to filter rows in a1 that are also in a2

和sem_join来过滤a1中的行，a2中的行

semi_join(a1,a2)

#3

It is certainly not efficient for this particular purpose, but what I often do in these situations is to insert indicator variables in each data.frame and then merge:

对于这个特定的目的来说，它当然不是有效的，但是我在这些情况下经常做的是在每个数据a中插入指示器变量，然后合并:

a1$included_a1 <- TRUE
a2$included_a2 <- TRUE
res <- merge(a1, a2, all=TRUE)

missing values in included_a1 will note which rows are missing in a1. similarly for a2.

included_a1中缺失的值将记录a1中缺失的行。同样为a2。

One problem with your solution is that the column orders must match. Another problem is that it is easy to imagine situations where the rows are coded as the same when in fact are different. The advantage of using merge is that you get for free all error checking that is necessary for a good solution.

解决方案的一个问题是列订单必须匹配。另一个问题是，很容易想象在实际情况不同的情况下，这些行被编码为相同的情况。使用merge的好处是，您可以免费检查对于一个好的解决方案来说是必需的所有错误。

#4

In dplyr:

在dplyr:

setdiff(a1,a2)

Basically, setdiff(bigFrame, smallFrame) gets you the extra records in the first table.

基本上，setdiff(bigFrame, smallFrame)为您获取第一个表中的额外记录。

In the SQLverse this is called a

在SQLverse中，这被称为a

For good descriptions of all join options and set subjects, this is one of the best summaries I've seen put together to date: http://www.vertabelo.com/blog/technical-articles/sql-joins

对于所有连接选项和设置主题的良好描述，这是我到目前为止看到的最好的总结之一:http://www.vertabelo.com/blog/technical-articles/sql join

But back to this question - here are the results for the setdiff() code when using the OP's data:

但是回到这个问题——下面是使用OP数据时setdiff()代码的结果:

> a1
  a b
1 1 a
2 2 b
3 3 c
4 4 d
5 5 e

> a2
  a b
1 1 a
2 2 b
3 3 c

> setdiff(a1,a2)
  a b
1 4 d
2 5 e

Or even anti_join(a1,a2) will get you the same results.
For more info: https://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf

甚至anti_join(a1,a2)也会得到相同的结果。更多信息:https://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf

#5

I wrote a package (https://github.com/alexsanjoseph/compareDF) since I had the same issue.

我写了一个包(https://github.com/sanalexjoseph/comparedf)，因为我遇到了同样的问题。

  > df1 <- data.frame(a = 1:5, b=letters[1:5], row = 1:5)
  > df2 <- data.frame(a = 1:3, b=letters[1:3], row = 1:3)
  > df_compare = compare_df(df1, df2, "row")

  > df_compare$comparison_df
    row chng_type a b
  1   4         + 4 d
  2   5         + 5 e

A more complicated example:

一个更加复杂的例子:

library(compareDF)
df1 = data.frame(id1 = c("Mazda RX4", "Mazda RX4 Wag", "Datsun 710",
                         "Hornet 4 Drive", "Duster 360", "Merc 240D"),
                 id2 = c("Maz", "Maz", "Dat", "Hor", "Dus", "Mer"),
                 hp = c(110, 110, 181, 110, 245, 62),
                 cyl = c(6, 6, 4, 6, 8, 4),
                 qsec = c(16.46, 17.02, 33.00, 19.44, 15.84, 20.00))

df2 = data.frame(id1 = c("Mazda RX4", "Mazda RX4 Wag", "Datsun 710",
                         "Hornet 4 Drive", " Hornet Sportabout", "Valiant"),
                 id2 = c("Maz", "Maz", "Dat", "Hor", "Dus", "Val"),
                 hp = c(110, 110, 93, 110, 175, 105),
                 cyl = c(6, 6, 4, 6, 8, 6),
                 qsec = c(16.46, 17.02, 18.61, 19.44, 17.02, 20.22))

> df_compare$comparison_df
    grp chng_type                id1 id2  hp cyl  qsec
  1   1         -  Hornet Sportabout Dus 175   8 17.02
  2   2         +         Datsun 710 Dat 181   4 33.00
  3   2         -         Datsun 710 Dat  93   4 18.61
  4   3         +         Duster 360 Dus 245   8 15.84
  5   7         +          Merc 240D Mer  62   4 20.00
  6   8         -            Valiant Val 105   6 20.22

The package also has an html_output command for quick checking

该包还有一个html_output命令，用于快速检查

df_compare$html_output

df_compare html_output美元

#6

I adapted the merge function to get this functionality. On larger dataframes it uses less memory than the full merge solution. And I can play with the names of the key columns.

我调整了合并函数来获得这个功能。在较大的dataframes上，它使用的内存比完整的merge解决方案少。我可以使用键列的名称。

Another solution is to use the library prob.

另一个解决方案是使用库prob。

#  Derived from src/library/base/R/merge.R
#  Part of the R package, http://www.R-project.org
#
#  This program is free software; you can redistribute it and/or modify
#  it under the terms of the GNU General Public License as published by
#  the Free Software Foundation; either version 2 of the License, or
#  (at your option) any later version.
#
#  This program is distributed in the hope that it will be useful,
#  but WITHOUT ANY WARRANTY; without even the implied warranty of
#  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
#  GNU General Public License for more details.
#
#  A copy of the GNU General Public License is available at
#  http://www.r-project.org/Licenses/

XinY <-
    function(x, y, by = intersect(names(x), names(y)), by.x = by, by.y = by,
             notin = FALSE, incomparables = NULL,
             ...)
{
    fix.by <- function(by, df)
    {
        ## fix up 'by' to be a valid set of cols by number: 0 is row.names
        if(is.null(by)) by <- numeric(0L)
        by <- as.vector(by)
        nc <- ncol(df)
        if(is.character(by))
            by <- match(by, c("row.names", names(df))) - 1L
        else if(is.numeric(by)) {
            if(any(by < 0L) || any(by > nc))
                stop("'by' must match numbers of columns")
        } else if(is.logical(by)) {
            if(length(by) != nc) stop("'by' must match number of columns")
            by <- seq_along(by)[by]
        } else stop("'by' must specify column(s) as numbers, names or logical")
        if(any(is.na(by))) stop("'by' must specify valid column(s)")
        unique(by)
    }

    nx <- nrow(x <- as.data.frame(x)); ny <- nrow(y <- as.data.frame(y))
    by.x <- fix.by(by.x, x)
    by.y <- fix.by(by.y, y)
    if((l.b <- length(by.x)) != length(by.y))
        stop("'by.x' and 'by.y' specify different numbers of columns")
    if(l.b == 0L) {
        ## was: stop("no columns to match on")
        ## returns x
        x
    }
    else {
        if(any(by.x == 0L)) {
            x <- cbind(Row.names = I(row.names(x)), x)
            by.x <- by.x + 1L
        }
        if(any(by.y == 0L)) {
            y <- cbind(Row.names = I(row.names(y)), y)
            by.y <- by.y + 1L
        }
        ## create keys from 'by' columns:
        if(l.b == 1L) {                  # (be faster)
            bx <- x[, by.x]; if(is.factor(bx)) bx <- as.character(bx)
            by <- y[, by.y]; if(is.factor(by)) by <- as.character(by)
        } else {
            ## Do these together for consistency in as.character.
            ## Use same set of names.
            bx <- x[, by.x, drop=FALSE]; by <- y[, by.y, drop=FALSE]
            names(bx) <- names(by) <- paste("V", seq_len(ncol(bx)), sep="")
            bz <- do.call("paste", c(rbind(bx, by), sep = "\r"))
            bx <- bz[seq_len(nx)]
            by <- bz[nx + seq_len(ny)]
        }
        comm <- match(bx, by, 0L)
        if (notin) {
            res <- x[comm == 0,]
        } else {
            res <- x[comm > 0,]
        }
    }
    ## avoid a copy
    ## row.names(res) <- NULL
    attr(res, "row.names") <- .set_row_names(nrow(res))
    res
}


XnotinY <-
    function(x, y, by = intersect(names(x), names(y)), by.x = by, by.y = by,
             notin = TRUE, incomparables = NULL,
             ...)
{
    XinY(x,y,by,by.x,by.y,notin,incomparables)
}

#7

Using diffobj package:

使用diffobj包:

library(diffobj)

diffPrint(a1, a2)
diffObj(a1, a2)

#8

You could use the daff package (which wraps the daff.js library using the V8 package):

您可以使用daff包(它包装daff)。使用V8包的js库:

library(daff)

diff_data(data_ref = a2,
          data = a1)

produces the following difference object:

产生以下差异对象:

Daff Comparison: ‘a2’ vs. ‘a1’ 
  First 6 and last 6 patch lines:
   @@   a   b
1 ... ... ...
2       3   c
3 +++   4   d
4 +++   5   e
5 ... ... ...
6 ... ... ...
7       3   c
8 +++   4   d
9 +++   5   e

The diff format is described in Coopy highlighter diff format for tables and should be pretty self-explanatory. The lines with +++ in the first column @@ are the ones which are new in a1 and not present in a2.

diff格式在表的Coopy highlighter diff格式中进行了描述，并且应该很容易理解。在第一列@@中带有++的行是a1中的新行，a2中不存在。

The difference object can be used to patch_data(), to store the difference for documentation purposes using write_diff() or to visualize the difference using render_diff():

差异对象可以用于patch_data()，以便使用write_diff()存储差异，或者使用render_diff()可视化差异:

render_diff(
    diff_data(data_ref = a2,
              data = a1)
)

generates a neat HTML output:

生成一个简洁的HTML输出:

#9

Your example data does not have any duplicates, but your solution handle them automatically. This means that potentially some of the answers won't match to results of your function in case of duplicates.
Here is my solution which address duplicates the same way as yours. It also scales great!

您的示例数据没有任何副本，但是您的解决方案会自动处理它们。这意味着在重复的情况下，有些答案可能与函数的结果不匹配。这是我的解决方案，地址和你的地址一样。它还尺度太棒了!

a1 <- data.frame(a = 1:5, b=letters[1:5])
a2 <- data.frame(a = 1:3, b=letters[1:3])
rows.in.a1.that.are.not.in.a2  <- function(a1,a2)
{
    a1.vec <- apply(a1, 1, paste, collapse = "")
    a2.vec <- apply(a2, 1, paste, collapse = "")
    a1.without.a2.rows <- a1[!a1.vec %in% a2.vec,]
    return(a1.without.a2.rows)
}

library(data.table)
setDT(a1)
setDT(a2)

# no duplicates - as in example code
r <- fsetdiff(a1, a2)
all.equal(r, rows.in.a1.that.are.not.in.a2(a1,a2))
#[1] TRUE

# handling duplicates - make some duplicates
a1 <- rbind(a1, a1, a1)
a2 <- rbind(a2, a2, a2)
r <- fsetdiff(a1, a2, all = TRUE)
all.equal(r, rows.in.a1.that.are.not.in.a2(a1,a2))
#[1] TRUE

It needs data.table 1.9.7 which currently can be installed from source repo

它需要的数据。表1.9.7目前可以从源repo安装。

install.packages("data.table", type = "source",
    repos = "https://Rdatatable.github.io/data.table")

#10

Maybe it is too simplistic, but I used this solution and I find it very useful when I have a primary key that I can use to compare data sets. Hope it can help.

也许它太简单了，但是我使用了这个解决方案，当我有一个主键可以用来比较数据集时，我发现它非常有用。希望它可以帮助。

a1 <- data.frame(a = 1:5, b = letters[1:5])
a2 <- data.frame(a = 1:3, b = letters[1:3])
different.names <- (!a1$a %in% a2$a)
not.in.a2 <- a1[different.names,]

#11

Yet another solution based on match_df in plyr. Here's plyr's match_df:

另一个基于match_df的plyr解决方案。这是plyr match_df:

match_df <- function (x, y, on = NULL) 
{
    if (is.null(on)) {
        on <- intersect(names(x), names(y))
        message("Matching on: ", paste(on, collapse = ", "))
    }
    keys <- join.keys(x, y, on)
    x[keys$x %in% keys$y, , drop = FALSE]
}

We can modify it to negate:

我们可以修改为否定:

library(plyr)
negate_match_df <- function (x, y, on = NULL) 
{
    if (is.null(on)) {
        on <- intersect(names(x), names(y))
        message("Matching on: ", paste(on, collapse = ", "))
    }
    keys <- join.keys(x, y, on)
    x[!(keys$x %in% keys$y), , drop = FALSE]
}

Then:

然后:

diff <- negate_match_df(a1,a2)

#1