Currently I have a problem as follows. In a dataset where multiple observations for each subject exist, and I want to make a subset of this dataset where only the maximum data for a record is selected. For example, for a data set as below:
目前我有如下问题。在一个数据集中,每个对象都有多个观察值,我想要创建这个数据集的一个子集,其中只选择一个记录的最大数据。例如,对于如下数据集:
ID <- c(1,1,1,2,2,2,2,3,3)Value <- c(2,3,5,2,5,8,17,3,5)Event <- c(1,1,2,1,2,1,2,2,2)group <- data.frame(Subject=ID, pt=Value, Event=Event)
Subject 1, 2 and 3 have the biggest pt value of 5, 17 and 5 respectively. How could I first, find the biggest pt value for each subject, and then, put this observation in another data frame? This means that this subset would only have the biggest pt values for each subject.
题目1、2、3的pt值最大分别为5、17、5。我怎么能先找到每个实验对象最大的pt值,然后再把这个观测结果放到另一个数据框中呢?这意味着这个子集对于每个主题只有最大的pt值。
7 个解决方案
#1
47
Here's a data.table
solution:
这里有一个数据。表解决方案:
require(data.table) ## 1.9.2group <- as.data.table(group)
If you want to keep all the entries corresponding to max values of pt
within each group:
如果你想让所有的条目在每个组中都对应于pt的最大值:
group[group[, .I[pt == max(pt)], by=Subject]$V1]# Subject pt Event# 1: 1 5 2# 2: 2 17 2# 3: 3 5 2
If you'd like just the first max value of pt
:
如果你想要pt的第一个最大值:
group[group[, .I[which.max(pt)], by=Subject]$V1]# Subject pt Event# 1: 1 5 2# 2: 2 17 2# 3: 3 5 2
In this case, it doesn't make a difference, as there aren't multiple maximum values within any group in your data.
在这种情况下,这没有什么区别,因为数据中没有任何组中的多个最大值。
#2
25
The most intuitive method is to use group_by and top_n function in dplyr
最直观的方法是在dplyr中使用group_by和top_n函数
group %>% group_by(Subject) %>% top_n(1, pt)
The result you get is
结果是
Source: local data frame [3 x 3] Groups: Subject [3] Subject pt Event (dbl) (dbl) (dbl) 1 1 5 2 2 2 17 2 3 3 5 2
#3
19
A shorter solution using data.table
:
使用数据的较短解决方案。
setDT(group)[, .SD[which.max(pt)], by=Subject]# Subject pt Event# 1: 1 5 2# 2: 2 17 2# 3: 3 5 2
#4
6
I wasn't sure what you wanted to do about the Event column, but if you want to keep that as well, how about
我不确定你想对事件列做什么,但是如果你想要保留它,那该怎么办。
isIDmax <- with(dd, ave(Value, ID, FUN=function(x) seq_along(x)==which.max(x)))==1group[isIDmax, ]# ID Value Event# 3 1 5 2# 7 2 17 2# 9 3 5 2
Here we use ave
to look at the "Value" column for each "ID". Then we determine which value is the maximal and then turn that into a logical vector we can use to subset the original data.frame.
这里我们使用ave来查看每个“ID”的“值”列。然后我们确定哪个值是最大值,然后把它转换成一个逻辑向量,我们可以用它来对原始的data.frame进行子集划分。
#5
3
A dplyr
solution:
一个dplyr解决方案:
> library(dplyr)> ID <- c(1,1,1,2,2,2,2,3,3)> Value <- c(2,3,5,2,5,8,17,3,5)> Event <- c(1,1,2,1,2,1,2,2,2)> group <- data.frame(Subject=ID, pt=Value, Event=Event)> group <- group_by(group, Subject)> summarize(group, max.pt = max(pt))
This yields the following data frame:
这将产生以下数据框架:
Subject max.pt1 1 52 2 173 3 5
#6
2
do.call(rbind, lapply(split(group,as.factor(group$Subject)), function(x) {return(x[which.max(x$pt),])}))
Using Base R
使用基础R
#7
-1
Another option is slice
另一个选择是片
library(dplyr)group %>% group_by(Subject) %>% slice(which.max(pt))# Subject pt Event# <dbl> <dbl> <dbl>#1 1 5 2#2 2 17 2#3 3 5 2
#1
47
Here's a data.table
solution:
这里有一个数据。表解决方案:
require(data.table) ## 1.9.2group <- as.data.table(group)
If you want to keep all the entries corresponding to max values of pt
within each group:
如果你想让所有的条目在每个组中都对应于pt的最大值:
group[group[, .I[pt == max(pt)], by=Subject]$V1]# Subject pt Event# 1: 1 5 2# 2: 2 17 2# 3: 3 5 2
If you'd like just the first max value of pt
:
如果你想要pt的第一个最大值:
group[group[, .I[which.max(pt)], by=Subject]$V1]# Subject pt Event# 1: 1 5 2# 2: 2 17 2# 3: 3 5 2
In this case, it doesn't make a difference, as there aren't multiple maximum values within any group in your data.
在这种情况下,这没有什么区别,因为数据中没有任何组中的多个最大值。
#2
25
The most intuitive method is to use group_by and top_n function in dplyr
最直观的方法是在dplyr中使用group_by和top_n函数
group %>% group_by(Subject) %>% top_n(1, pt)
The result you get is
结果是
Source: local data frame [3 x 3] Groups: Subject [3] Subject pt Event (dbl) (dbl) (dbl) 1 1 5 2 2 2 17 2 3 3 5 2
#3
19
A shorter solution using data.table
:
使用数据的较短解决方案。
setDT(group)[, .SD[which.max(pt)], by=Subject]# Subject pt Event# 1: 1 5 2# 2: 2 17 2# 3: 3 5 2
#4
6
I wasn't sure what you wanted to do about the Event column, but if you want to keep that as well, how about
我不确定你想对事件列做什么,但是如果你想要保留它,那该怎么办。
isIDmax <- with(dd, ave(Value, ID, FUN=function(x) seq_along(x)==which.max(x)))==1group[isIDmax, ]# ID Value Event# 3 1 5 2# 7 2 17 2# 9 3 5 2
Here we use ave
to look at the "Value" column for each "ID". Then we determine which value is the maximal and then turn that into a logical vector we can use to subset the original data.frame.
这里我们使用ave来查看每个“ID”的“值”列。然后我们确定哪个值是最大值,然后把它转换成一个逻辑向量,我们可以用它来对原始的data.frame进行子集划分。
#5
3
A dplyr
solution:
一个dplyr解决方案:
> library(dplyr)> ID <- c(1,1,1,2,2,2,2,3,3)> Value <- c(2,3,5,2,5,8,17,3,5)> Event <- c(1,1,2,1,2,1,2,2,2)> group <- data.frame(Subject=ID, pt=Value, Event=Event)> group <- group_by(group, Subject)> summarize(group, max.pt = max(pt))
This yields the following data frame:
这将产生以下数据框架:
Subject max.pt1 1 52 2 173 3 5
#6
2
do.call(rbind, lapply(split(group,as.factor(group$Subject)), function(x) {return(x[which.max(x$pt),])}))
Using Base R
使用基础R
#7
-1
Another option is slice
另一个选择是片
library(dplyr)group %>% group_by(Subject) %>% slice(which.max(pt))# Subject pt Event# <dbl> <dbl> <dbl>#1 1 5 2#2 2 17 2#3 3 5 2