比较两个数据帧,与列顺序无关,以获取非重复行[重复]

时间:2021-12-17 03:25:20

This question already has an answer here:

这个问题在这里已有答案:

I want to compare two data frames and check if there are duplicated rows. We assume that the order of columns doesn't matter so if df1 looks like that:

我想比较两个数据帧并检查是否有重复的行。我们假设列的顺序无关紧要,如果df1看起来像那样:

 V2 V3
 71 78
 90 13
 12 67
 56 32

and df2 like that:

和df2一样:

V2 V3
89 45
77 88
78 71
90 13

Then the non duplicated rows from both df will be:

然后来自两个df的非重复行将是:

12 67
56 32
89 45
77 88

How can I achieve this goal in easy way?

我怎样才能轻松实现这一目标?

3 个解决方案

#1


Here's a dplyr solution which will probably be pretty quick on larger datasets

这是一个dplyr解决方案,在较大的数据集上可能非常快

df1 <- data_frame( v1 = c(71,90,12,56), v2 = c(78,13,67,32))
df2 <- data_frame( v1 = c(89,77,78,90), v2 = c(45,88,71,13) )

df3 <- bind_rows(df1, df2)

df3 %>%
  rowwise() %>% 
  mutate(key = paste0( min(v1, v2), max(v1, v2))) %>% 
  group_by(key) %>% 
  mutate( size = n()) %>% 
  filter( size == 1)

This solution only works for two grouping variables, to extend it to multiple variables you basically just need to adjust how to manufacture the key.

此解决方案仅适用于两个分组变量,将其扩展为多个变量,您基本上只需要调整如何制作密钥。

Edit: I misunderstood the problem as per comments below.

编辑:我根据下面的评论误解了这个问题。

#2


You could try

你可以试试

df3 <- rbind(df1, df2)
df4 <- df3
df3[] <-  cbind(do.call(pmax, df3), do.call(pmin, df3))

df4[!(duplicated(df3)|duplicated(df3, fromLast=TRUE)),]
#  V2 V3
#3 12 67
#4 56 32
#5 89 45
#6 77 88

#3


The solution provided below works for your example data. This approach may be inefficient for rather large dataset. Then again, computer time is cheap. :)

下面提供的解决方案适用于您的示例数据。对于相当大的数据集,这种方法可能效率低下。再说一次,计算机时间便宜。 :)

df1 <- read.table(text = " V2 V3
 71 78
 90 13
 12 67
 56 32", header = TRUE)

df2 <- read.table(text = "V2 V3
89 45
77 88
78 71
90 13", header = TRUE)

throwoutFunction <- function(x, ca) {
  find.duplicates <- apply(ca, MARGIN = 1, FUN = function(y, x) y %in% x, x = x)
  filter.duplicates <- apply(find.duplicates, MARGIN = 2, all)
  if (any(filter.duplicates)) {
    return(data.frame(V2 = NA, V3 = NA))
  } else {
    data.frame(V2 = x[1], V3 = x[2])
  }
}
out1 <- do.call("rbind", apply(df1, MARGIN = 1, FUN = throwoutFunction, ca = df2))

out2 <- do.call("rbind", apply(df2, MARGIN = 1, FUN = throwoutFunction, ca = df1))

out <- na.omit(rbind(out1, out2))
rownames(out) <- 1:nrow(out)
out

  V2 V3
1 12 67
2 56 32
3 89 45
4 77 88

#1


Here's a dplyr solution which will probably be pretty quick on larger datasets

这是一个dplyr解决方案,在较大的数据集上可能非常快

df1 <- data_frame( v1 = c(71,90,12,56), v2 = c(78,13,67,32))
df2 <- data_frame( v1 = c(89,77,78,90), v2 = c(45,88,71,13) )

df3 <- bind_rows(df1, df2)

df3 %>%
  rowwise() %>% 
  mutate(key = paste0( min(v1, v2), max(v1, v2))) %>% 
  group_by(key) %>% 
  mutate( size = n()) %>% 
  filter( size == 1)

This solution only works for two grouping variables, to extend it to multiple variables you basically just need to adjust how to manufacture the key.

此解决方案仅适用于两个分组变量,将其扩展为多个变量,您基本上只需要调整如何制作密钥。

Edit: I misunderstood the problem as per comments below.

编辑:我根据下面的评论误解了这个问题。

#2


You could try

你可以试试

df3 <- rbind(df1, df2)
df4 <- df3
df3[] <-  cbind(do.call(pmax, df3), do.call(pmin, df3))

df4[!(duplicated(df3)|duplicated(df3, fromLast=TRUE)),]
#  V2 V3
#3 12 67
#4 56 32
#5 89 45
#6 77 88

#3


The solution provided below works for your example data. This approach may be inefficient for rather large dataset. Then again, computer time is cheap. :)

下面提供的解决方案适用于您的示例数据。对于相当大的数据集,这种方法可能效率低下。再说一次,计算机时间便宜。 :)

df1 <- read.table(text = " V2 V3
 71 78
 90 13
 12 67
 56 32", header = TRUE)

df2 <- read.table(text = "V2 V3
89 45
77 88
78 71
90 13", header = TRUE)

throwoutFunction <- function(x, ca) {
  find.duplicates <- apply(ca, MARGIN = 1, FUN = function(y, x) y %in% x, x = x)
  filter.duplicates <- apply(find.duplicates, MARGIN = 2, all)
  if (any(filter.duplicates)) {
    return(data.frame(V2 = NA, V3 = NA))
  } else {
    data.frame(V2 = x[1], V3 = x[2])
  }
}
out1 <- do.call("rbind", apply(df1, MARGIN = 1, FUN = throwoutFunction, ca = df2))

out2 <- do.call("rbind", apply(df2, MARGIN = 1, FUN = throwoutFunction, ca = df1))

out <- na.omit(rbind(out1, out2))
rownames(out) <- 1:nrow(out)
out

  V2 V3
1 12 67
2 56 32
3 89 45
4 77 88