有更优雅的方法来查找重复的记录吗?

时间:2021-10-21 20:24:30

I've got 81,000 records in my test frame, and duplicated is showing me that 2039 are identical matches. One answer to Find duplicated rows (based on 2 columns) in Data Frame in R suggests a method for creating a smaller frame of just the duplicate records. This works for me, too:

我的测试框架中有81,000条记录,重复显示2039是相同的匹配。在R中的数据框中查找重复行(基于2列)的一个答案提出了一种用于创建仅重复记录的较小帧的方法。这对我也有用:

dup <- data.frame(as.numeric(duplicated(df$var))) #creates df with binary var for duplicated rows
colnames(dup) <- c("dup") #renames column for simplicity
df2 <- cbind(df, dup) #bind to original df
df3 <- subset(df2, dup == 1) #subsets df using binary var for duplicated`

But it seems, as the poster noted, inelegant. Is there a cleaner way to get the same result: a view of just those records that are duplicates?

但正如海报所指出的那样,似乎并不优雅。是否有更简洁的方法来获得相同的结果:只查看那些重复的记录?

In my case I'm working with scraped data and I need to figure out whether the duplicates exist in the original or were introduced by me scraping.

在我的情况下,我正在使用刮擦数据,我需要弄清楚复制品是否存在于原件中或者是由我刮擦引入的。

2 个解决方案

#1


2  

duplicated(df) will give you a logical vector (all values consisting of either T/F), which you can then use as an index to your dataframe rows.

duplicated(df)将为您提供一个逻辑向量(所有值由T / F组成),然后您可以将其用作数据帧行的索引。

# indx will contain TRUE values wherever in df$var there is a duplicate
indx <- duplicated(df$var)
df[indx, ]  #note the comma 

You can put it all together in one line

你可以将它们放在一行中

df[duplicated(df$var), ]  # again, the comma, to indicate we are selected rows

#2


-1  

doops <- which(duplicated(df$var)==TRUE)
uniques <- df[-doops,]
duplicates <- df[doops,]

Is the logic I generally use when I am trying to remove the duplicate entrys from a data frame.

当我试图从数据帧中删除重复的命令时,我通常使用的逻辑。

#1


2  

duplicated(df) will give you a logical vector (all values consisting of either T/F), which you can then use as an index to your dataframe rows.

duplicated(df)将为您提供一个逻辑向量(所有值由T / F组成),然后您可以将其用作数据帧行的索引。

# indx will contain TRUE values wherever in df$var there is a duplicate
indx <- duplicated(df$var)
df[indx, ]  #note the comma 

You can put it all together in one line

你可以将它们放在一行中

df[duplicated(df$var), ]  # again, the comma, to indicate we are selected rows

#2


-1  

doops <- which(duplicated(df$var)==TRUE)
uniques <- df[-doops,]
duplicates <- df[doops,]

Is the logic I generally use when I am trying to remove the duplicate entrys from a data frame.

当我试图从数据帧中删除重复的命令时,我通常使用的逻辑。