R中的合并产生的行数多于其中一个数据帧

时间:2021-11-10 21:45:48

I have two data frames, the first contains 9994 rows and the second contains 60431 rows. I want to merge the two data frames such that the merged data frame contains combined columns of both data frames but only contains 9994 rows.

我有两个数据帧,第一个包含9994行,第二个包含60431行。我想合并两个数据帧,使合并的数据帧包含两个数据帧的组合列,但只包含9994行。

However, I get more than 9994 rows upon merge. How can I make sure this does not happen?

但是,合并后我获得超过9994行。我怎样才能确保不会发生这种情况?

df1 = readRDS('data1.RDS')
nrow(df1)
# [1] 9994

df2 = readRDS('data2.RDS')
nrow(df2)
# [1] 60431

df = merge(df1,df2,by=c("col1","col2"))
nrow(df)
# [1] 10057

df = merge(df1,df2,by=c("col1","col2"),all.x=TRUE)
nrow(df)
# [1] 10057
nrow(na.omit(df))
# [1] 10057

EDIT : Following akrun's comment. Yes, there were duplicates in the second data frame

编辑:遵循akrun的评论。是的,第二个数据框中有重复项

nrow(unique(df2[,c("col1","col2")]))
# [1] 60263
nrow(df2)
# [1] 60431

How can I take only one row from a data frame if there are multiple for the same {col1,col2} combination. When I merge, I would like to have only 9994 rows.

如果同一{col1,col2}组合有多个,如何从数据框中只取一行。当我合并时,我想只有9994行。

1 个解决方案

#1


0  

This should work, be sure to sort df2 first so you select the right rows.

这应该工作,请务必先排序df2,以便选择正确的行。

df = merge(
  df1,
  df2[!duplicated(df2[, c("col1","col2")], ],
  by=c("col1","col2"),
  all.x=TRUE
)

#1


0  

This should work, be sure to sort df2 first so you select the right rows.

这应该工作,请务必先排序df2,以便选择正确的行。

df = merge(
  df1,
  df2[!duplicated(df2[, c("col1","col2")], ],
  by=c("col1","col2"),
  all.x=TRUE
)