删除重复项,保持具有最大绝对值的条目

时间:2021-02-28 04:41:06

Let's say I have four samples: id=1, 2, 3, and 4, with one or more measurements on each of those samples:

假设我有四个样本:id = 1,2,3和4,每个样本都有一个或多个测量值:

> a <- data.frame(id=c(1,1,2,2,3,4), value=c(1,2,3,-4,-5,6))
> a
  id value
1  1     1
2  1     2
3  2     3
4  2    -4
5  3    -5
6  4     6

I want to remove duplicates, keeping only one entry per ID - the one having the largest absolute value of the "value" column. I.e., this is what I want:

我想删除重复项,每个ID只保留一个条目 - 具有“value”列绝对值最大的条目。即,这就是我想要的:

> a[c(2,4,5,6), ]
  id value
2  1     2
4  2    -4
5  3    -5
6  4     6

How might I do this in R?

我怎么能在R中这样做?

6 个解决方案

#1


29  

 aa <- a[order(a$id, -abs(a$value) ), ] #sort by id and reverse of abs(value)
 aa[ !duplicated(aa$id), ]              # take the first row within each id
  id value
2  1     2
4  2    -4
5  3    -5
6  4     6

#2


9  

A data.table approach might be in order if your data set is very large:

如果您的数据集非常大,则可能需要data.table方法:

library(data.table)

aDT <- as.data.table(a)
setkey(aDT,"id")

aDT[J(unique(id)), list(value = value[which.max(abs(value))])]


Or a not as fast, but still fast, alternative :

或者不是那么快,但仍然很快的替代方案:

library(data.table)
as.data.table(a)[, .SD[which.max(abs(value))], by=id]

This version returns all the columns of a, in case there are more in the real dataset.

如果真实数据集中有更多列,则此版本返回a的所有列。

#3


9  

Check out ?aggregate:

结账?聚合:

aggregate(value~id,a,function(x) x[which.max(abs(x))])

I like the answer by @DWin, but I would like show how this could also work with metadata:

我喜欢@DWin的答案,但我想展示一下它如何与元数据一起使用:

aa<-merge(aggregate(value~id,a,function(x) x[which.max(abs(x))]),a)
# Fails if the max value is duplicated for a single id without next line.
aa[!duplicated(aa),]

I couldn't help myself and created one last answer:

我无法帮助自己并创造了最后一个答案:

do.call(rbind,lapply(split(a,a$id),function(x) x[which.max(abs(x$value)),]))

#4


5  

Another approach (though the code might look a little cumbersome) is to use ave():

另一种方法(虽然代码可能看起来有点麻烦)是使用ave():

a[which(abs(a$value) == ave(a$value, a$id, 
                            FUN=function(x) max(abs(x)))), ]
#   id value
# 2  1     2
# 4  2    -4
# 5  3    -5
# 6  4     6

#5


3  

library(plyr)
ddply(a, .(id), function(x) return(x[which(abs(x$value)==max(abs(x$value))),]))

#6


1  

Here is a dplyr approach

这是一个dplyr方法

library(dplyr)
a %>% 
        group_by(id) %>%
        top_n(1, abs(value))

# A tibble: 4 x 2
# Groups:   id [4]
#     id value
#  <dbl> <dbl>
#1     1     2
#2     2    -4
#3     3    -5
#4     4     6

#1


29  

 aa <- a[order(a$id, -abs(a$value) ), ] #sort by id and reverse of abs(value)
 aa[ !duplicated(aa$id), ]              # take the first row within each id
  id value
2  1     2
4  2    -4
5  3    -5
6  4     6

#2


9  

A data.table approach might be in order if your data set is very large:

如果您的数据集非常大,则可能需要data.table方法:

library(data.table)

aDT <- as.data.table(a)
setkey(aDT,"id")

aDT[J(unique(id)), list(value = value[which.max(abs(value))])]


Or a not as fast, but still fast, alternative :

或者不是那么快,但仍然很快的替代方案:

library(data.table)
as.data.table(a)[, .SD[which.max(abs(value))], by=id]

This version returns all the columns of a, in case there are more in the real dataset.

如果真实数据集中有更多列,则此版本返回a的所有列。

#3


9  

Check out ?aggregate:

结账?聚合:

aggregate(value~id,a,function(x) x[which.max(abs(x))])

I like the answer by @DWin, but I would like show how this could also work with metadata:

我喜欢@DWin的答案,但我想展示一下它如何与元数据一起使用:

aa<-merge(aggregate(value~id,a,function(x) x[which.max(abs(x))]),a)
# Fails if the max value is duplicated for a single id without next line.
aa[!duplicated(aa),]

I couldn't help myself and created one last answer:

我无法帮助自己并创造了最后一个答案:

do.call(rbind,lapply(split(a,a$id),function(x) x[which.max(abs(x$value)),]))

#4


5  

Another approach (though the code might look a little cumbersome) is to use ave():

另一种方法(虽然代码可能看起来有点麻烦)是使用ave():

a[which(abs(a$value) == ave(a$value, a$id, 
                            FUN=function(x) max(abs(x)))), ]
#   id value
# 2  1     2
# 4  2    -4
# 5  3    -5
# 6  4     6

#5


3  

library(plyr)
ddply(a, .(id), function(x) return(x[which(abs(x$value)==max(abs(x$value))),]))

#6


1  

Here is a dplyr approach

这是一个dplyr方法

library(dplyr)
a %>% 
        group_by(id) %>%
        top_n(1, abs(value))

# A tibble: 4 x 2
# Groups:   id [4]
#     id value
#  <dbl> <dbl>
#1     1     2
#2     2    -4
#3     3    -5
#4     4     6