找到并合并data.frame中的重复行但忽略列顺序

时间:2022-03-09 17:03:35

I have a data.frame with 1,000 rows and 3 columns. It contains a large number of duplicates and I've used plyr to combine the duplicate rows and add a count for each combination as explained in this thread.

我有一个包含1,000行和3列的data.frame。它包含大量的重复项,我已经使用plyr来组合重复的行,并为每个组合添加一个计数,如本主题中所述。

Here's an example of what I have now (I still also have the original data.frame with all of the duplicates if I need to start from there):

这是我现在拥有的一个例子(如果我需要从那里开始,我还有原始的data.frame和所有重复项):

   name1    name2    name3     total
1  Bob      Fred     Sam       30
2  Bob      Joe      Frank     20
3  Frank    Sam      Tom       25
4  Sam      Tom      Frank     10
5  Fred     Bob      Sam       15

However, column order doesn't matter. I just want to know how many rows have the same three entries, in any order. How can I combine the rows that contain the same entries, ignoring order? In this example I would want to combine rows 1 and 5, and rows 3 and 4.

但是,列顺序无关紧要。我只是想知道有多少行以任何顺序具有相同的三个条目。如何组合包含相同条目的行,忽略顺序?在这个例子中,我想要组合行1和5,以及行3和4。

2 个解决方案

#1


4  

Define another column that's a "sorted paste" of the names, which would have the same value of "Bob~Fred~Sam" for rows 1 and 5. Then aggregate based on that.

定义另一列是名称的“排序粘贴”,对于第1行和第5行,它们具有相同的“Bob~Fred~Sam”值。然后根据该列进行聚合。

Brief code snippet (assumes original data frame is dd): it's all really intuitive. We create a lookup column (take a look and should be self explanatory), get the sums of the total column for each combination, and then filter down to the unique combinations...

简短的代码片段(假设原始数据框是dd):它们都非常直观。我们创建一个查阅列(看看并且应该是自我解释的),获取每个组合的总列的总和,然后过滤到唯一的组合......

dd$lookup=apply(dd[,c("name1","name2","name3")],1,
                                  function(x){paste(sort(x),collapse="~")})
tab1=tapply(dd$total,dd$lookup,sum)
ee=dd[match(unique(dd$lookup),dd$lookup),]
ee$newtotal=as.numeric(tab1)[match(ee$lookup,names(tab1))]

You now have in ee a set of unique rows and their corresponding total counts. Easy - and no external packages needed. And crucially, you can see at every stage of the process what is going on!

您现在拥有一组独特的行及其相应的总计数。简单 - 无需外部包装。而且至关重要的是,您可以在流程的每个阶段看到发生了什么!

(Minor update to help OP:) And if you want a cleaned-up version of the final answer:

(帮助OP的小更新:)如果你想要一个清理版本的最终答案:

outdf = with(ee,data.frame(name1,name2,name3,
                           total=newtotal,stringsAsFactors=FALSE))

This gives you a neat data frame with the three all-important name columns, and with the aggregated totals in a column called total rather than newtotal.

这为您提供了一个整齐的数据框,其中包含三个非常重要的名称列,以及一个名为total而不是newtotal的列中的聚合总计。

#2


4  

Sort the index columns, then use ddply to aggregate and sum:

对索引列进行排序,然后使用ddply进行聚合和求和:

Define the data:

定义数据:

dat <- "   name1    name2    name3     total
1  Bob      Fred     Sam       30
2  Bob      Joe      Frank     20
3  Frank    Sam      Tom       25
4  Sam      Tom      Frank     10
5  Fred     Bob      Sam       15"

x <- read.table(text=dat, header=TRUE)

Create a copy:

创建副本:

xx <- x

Use apply to sort the columns, then aggregate:

使用apply对列进行排序,然后聚合:

xx[, -4] <- t(apply(xx[, -4], 1, sort))
library(plyr)
ddply(xx, .(name1, name2, name3), numcolwise(sum))
  name1 name2 name3 total
1   Bob Frank   Joe    20
2   Bob  Fred   Sam    45
3 Frank   Sam   Tom    35

#1


4  

Define another column that's a "sorted paste" of the names, which would have the same value of "Bob~Fred~Sam" for rows 1 and 5. Then aggregate based on that.

定义另一列是名称的“排序粘贴”,对于第1行和第5行,它们具有相同的“Bob~Fred~Sam”值。然后根据该列进行聚合。

Brief code snippet (assumes original data frame is dd): it's all really intuitive. We create a lookup column (take a look and should be self explanatory), get the sums of the total column for each combination, and then filter down to the unique combinations...

简短的代码片段(假设原始数据框是dd):它们都非常直观。我们创建一个查阅列(看看并且应该是自我解释的),获取每个组合的总列的总和,然后过滤到唯一的组合......

dd$lookup=apply(dd[,c("name1","name2","name3")],1,
                                  function(x){paste(sort(x),collapse="~")})
tab1=tapply(dd$total,dd$lookup,sum)
ee=dd[match(unique(dd$lookup),dd$lookup),]
ee$newtotal=as.numeric(tab1)[match(ee$lookup,names(tab1))]

You now have in ee a set of unique rows and their corresponding total counts. Easy - and no external packages needed. And crucially, you can see at every stage of the process what is going on!

您现在拥有一组独特的行及其相应的总计数。简单 - 无需外部包装。而且至关重要的是,您可以在流程的每个阶段看到发生了什么!

(Minor update to help OP:) And if you want a cleaned-up version of the final answer:

(帮助OP的小更新:)如果你想要一个清理版本的最终答案:

outdf = with(ee,data.frame(name1,name2,name3,
                           total=newtotal,stringsAsFactors=FALSE))

This gives you a neat data frame with the three all-important name columns, and with the aggregated totals in a column called total rather than newtotal.

这为您提供了一个整齐的数据框,其中包含三个非常重要的名称列,以及一个名为total而不是newtotal的列中的聚合总计。

#2


4  

Sort the index columns, then use ddply to aggregate and sum:

对索引列进行排序,然后使用ddply进行聚合和求和:

Define the data:

定义数据:

dat <- "   name1    name2    name3     total
1  Bob      Fred     Sam       30
2  Bob      Joe      Frank     20
3  Frank    Sam      Tom       25
4  Sam      Tom      Frank     10
5  Fred     Bob      Sam       15"

x <- read.table(text=dat, header=TRUE)

Create a copy:

创建副本:

xx <- x

Use apply to sort the columns, then aggregate:

使用apply对列进行排序,然后聚合:

xx[, -4] <- t(apply(xx[, -4], 1, sort))
library(plyr)
ddply(xx, .(name1, name2, name3), numcolwise(sum))
  name1 name2 name3 total
1   Bob Frank   Joe    20
2   Bob  Fred   Sam    45
3 Frank   Sam   Tom    35