Let's say I have four samples: id=1, 2, 3, and 4, with one or more measurements on each of those samples:
假设我有四个样本:id = 1,2,3和4,每个样本都有一个或多个测量值:
> a <- data.frame(id=c(1,1,2,2,3,4), value=c(1,2,3,-4,-5,6))
> a
id value
1 1 1
2 1 2
3 2 3
4 2 -4
5 3 -5
6 4 6
I want to remove duplicates, keeping only one entry per ID - the one having the largest absolute value of the "value" column. I.e., this is what I want:
我想删除重复项,每个ID只保留一个条目 - 具有“value”列绝对值最大的条目。即,这就是我想要的:
> a[c(2,4,5,6), ]
id value
2 1 2
4 2 -4
5 3 -5
6 4 6
How might I do this in R?
我怎么能在R中这样做?
6 个解决方案
#1
29
aa <- a[order(a$id, -abs(a$value) ), ] #sort by id and reverse of abs(value)
aa[ !duplicated(aa$id), ] # take the first row within each id
id value
2 1 2
4 2 -4
5 3 -5
6 4 6
#2
9
A data.table
approach might be in order if your data set is very large:
如果您的数据集非常大,则可能需要data.table方法:
library(data.table)
aDT <- as.data.table(a)
setkey(aDT,"id")
aDT[J(unique(id)), list(value = value[which.max(abs(value))])]
Or a not as fast, but still fast, alternative :
或者不是那么快,但仍然很快的替代方案:
library(data.table)
as.data.table(a)[, .SD[which.max(abs(value))], by=id]
This version returns all the columns of a
, in case there are more in the real dataset.
如果真实数据集中有更多列,则此版本返回a的所有列。
#3
9
Check out ?aggregate
:
结账?聚合:
aggregate(value~id,a,function(x) x[which.max(abs(x))])
I like the answer by @DWin, but I would like show how this could also work with metadata:
我喜欢@DWin的答案,但我想展示一下它如何与元数据一起使用:
aa<-merge(aggregate(value~id,a,function(x) x[which.max(abs(x))]),a)
# Fails if the max value is duplicated for a single id without next line.
aa[!duplicated(aa),]
I couldn't help myself and created one last answer:
我无法帮助自己并创造了最后一个答案:
do.call(rbind,lapply(split(a,a$id),function(x) x[which.max(abs(x$value)),]))
#4
5
Another approach (though the code might look a little cumbersome) is to use ave()
:
另一种方法(虽然代码可能看起来有点麻烦)是使用ave():
a[which(abs(a$value) == ave(a$value, a$id,
FUN=function(x) max(abs(x)))), ]
# id value
# 2 1 2
# 4 2 -4
# 5 3 -5
# 6 4 6
#5
3
library(plyr)
ddply(a, .(id), function(x) return(x[which(abs(x$value)==max(abs(x$value))),]))
#6
1
Here is a dplyr
approach
这是一个dplyr方法
library(dplyr)
a %>%
group_by(id) %>%
top_n(1, abs(value))
# A tibble: 4 x 2
# Groups: id [4]
# id value
# <dbl> <dbl>
#1 1 2
#2 2 -4
#3 3 -5
#4 4 6
#1
29
aa <- a[order(a$id, -abs(a$value) ), ] #sort by id and reverse of abs(value)
aa[ !duplicated(aa$id), ] # take the first row within each id
id value
2 1 2
4 2 -4
5 3 -5
6 4 6
#2
9
A data.table
approach might be in order if your data set is very large:
如果您的数据集非常大,则可能需要data.table方法:
library(data.table)
aDT <- as.data.table(a)
setkey(aDT,"id")
aDT[J(unique(id)), list(value = value[which.max(abs(value))])]
Or a not as fast, but still fast, alternative :
或者不是那么快,但仍然很快的替代方案:
library(data.table)
as.data.table(a)[, .SD[which.max(abs(value))], by=id]
This version returns all the columns of a
, in case there are more in the real dataset.
如果真实数据集中有更多列,则此版本返回a的所有列。
#3
9
Check out ?aggregate
:
结账?聚合:
aggregate(value~id,a,function(x) x[which.max(abs(x))])
I like the answer by @DWin, but I would like show how this could also work with metadata:
我喜欢@DWin的答案,但我想展示一下它如何与元数据一起使用:
aa<-merge(aggregate(value~id,a,function(x) x[which.max(abs(x))]),a)
# Fails if the max value is duplicated for a single id without next line.
aa[!duplicated(aa),]
I couldn't help myself and created one last answer:
我无法帮助自己并创造了最后一个答案:
do.call(rbind,lapply(split(a,a$id),function(x) x[which.max(abs(x$value)),]))
#4
5
Another approach (though the code might look a little cumbersome) is to use ave()
:
另一种方法(虽然代码可能看起来有点麻烦)是使用ave():
a[which(abs(a$value) == ave(a$value, a$id,
FUN=function(x) max(abs(x)))), ]
# id value
# 2 1 2
# 4 2 -4
# 5 3 -5
# 6 4 6
#5
3
library(plyr)
ddply(a, .(id), function(x) return(x[which(abs(x$value)==max(abs(x$value))),]))
#6
1
Here is a dplyr
approach
这是一个dplyr方法
library(dplyr)
a %>%
group_by(id) %>%
top_n(1, abs(value))
# A tibble: 4 x 2
# Groups: id [4]
# id value
# <dbl> <dbl>
#1 1 2
#2 2 -4
#3 3 -5
#4 4 6