I have a data set with several time assessments for each participant. I want to select the last assessment for each participant. My dataset looks like this:
我有一个数据集,每个参与者都有几个时间评估。我想为每个参与者选择最后一次评估。我的数据集如下所示:
ID week outcome
1 2 14
1 4 28
1 6 42
4 2 14
4 6 46
4 9 64
4 9 71
4 12 85
9 2 14
9 4 28
9 6 51
9 9 66
9 12 84
I want to select only the last observation/assessment for each participant, but I only have the number of weeks as an indicator for each participant. How is this possible to do in R (or excel?)
我想只为每个参与者选择最后一次观察/评估,但我只有周数作为每个参与者的指标。这怎么可能在R(或excel?)
thanks in advance,
提前致谢,
niki
尼基
6 个解决方案
#1
11
Here is one base-R approach:
这是一个基础R方法:
do.call("rbind",
by(df, INDICES=df$ID, FUN=function(DF) DF[which.max(DF$week), ]))
ID week outcome
1 1 6 42
4 4 12 85
9 9 12 84
Alternatively, the data.table
package offers a succinct and expressive language for performing data frame manipulations of this type:
或者,data.table包提供了一种简洁而富有表现力的语言,用于执行此类型的数据帧操作:
library(data.table)
dt <- data.table(df, key="ID")
dt[, .SD[which.max(outcome), ], by=ID]
# ID week outcome
# [1,] 1 6 42
# [2,] 4 12 85
# [3,] 9 12 84
# Same but much faster.
# (Actually, only the same as long as there are no ties for max(outcome)..)
dt[ dt[,outcome==max(outcome),by=ID][[2]] ] # same, but much faster.
# If there are ties for max(outcome), the following will still produce
# the same results as the method using .SD, but will be faster
i1 <- dt[,which.max(outcome), by=ID][[2]]
i2 <- dt[,.N, by=ID][[2]]
dt[i1 + cumsum(i2) - i2,]
Finally, here is a plyr
-based solution
最后,这是一个基于plyr的解决方案
library(plyr)
ddply(df, .(ID), function(X) X[which.max(X$week), ])
# ID week outcome
# 1 1 6 42
# 2 4 12 85
# 3 9 12 84
#2
9
If you're just looking for the last observation per person ID, then a simple two line code should do it. I am up always for simple base solution when possible while it is always great to have more than one ways to solve a problem.
如果您只是在寻找每个人ID的最后一个观察点,那么应该使用简单的两行代码。在可能的情况下,我总是寻求简单的基础解决方案,而有多种方法来解决问题总是很好的。
dat[order(dat$ID,dat$Week),] # Sort by ID and week
dat[!duplicated(dat$ID, fromLast=T),] # Keep last observation per ID
ID Week Outcome
3 1 6 42
8 4 12 85
13 9 12 84
#3
2
I can play this game. I ran some benchmarks on differences between lapply, sapply, and by, among other things. It appears to me that the more you're in control of data types and the more basic the operation, the faster it is (e.g., lapply is generally faster than sapply, and as.numeric(lapply(...)) is going to be faster, also). With that in mind, this produced the same results as above and may be faster than the rest.
我可以玩这个游戏。我对lapply,sapply和by等之间的差异进行了一些基准测试。在我看来,你对数据类型的控制越多,操作越基本,它就越快(例如,lapply通常比sapply更快,而as.numeric(lapply(...))正在进行更快,也)。考虑到这一点,这产生了与上述相同的结果,并且可能比其他结果更快。
df[cumsum(as.numeric(lapply(split(df$week, df$id), which.max))), ]
Explanation: we only want which.max on the week per each id. That handles the contents of lapply. We only need the vector of these relative points, so make it numeric. The result is the vector (3, 5, 5). We need to add the positions of the prior maxes. This is accomplished with cumsum.
说明:我们只想在每个id的周上使用which.max。处理lapply的内容。我们只需要这些相对点的向量,所以将它设为数字。结果是向量(3,5,5)。我们需要添加先前最大值的位置。这是用cumsum完成的。
It should be noted, this solution is not general when I use cumsum. It may require that prior to execution we sort the frame on id and week. I hope you understand why (and know how to use with(df, order(id, week)) in the row index to achieve that). In any case, it may still fail if we don't have a unique max, because which.max only takes the first one. Therefore, my solution is a bit question begging, but that goes without saying. We're trying to extract very specific information for a very specific example. Our solutions can't be general (even though the methods are important to understand generally).
应该注意,当我使用cumsum时,这种解决方案并不常见。它可能要求在执行之前我们在id和week上对帧进行排序。我希望你理解为什么(并且知道如何在行索引中使用(df,order(id,week))来实现这一点)。在任何情况下,如果我们没有唯一的最大值,它可能仍然会失败,因为which.max只接受第一个。因此,我的解决方案有点问题,但不言而喻。我们试图为一个非常具体的例子提取非常具体的信息。我们的解决方案不可能是一般性的(即使这些方法通常很重要)。
I'll leave it to trinker to update his comparisons!
我会把它留给trinker来更新他的比较!
#4
2
Another option in base: df[rev(rownames(df)),][match(unique(df$ID), rev(df$ID)), ]
base中的另一个选项:df [rev(rownames(df)),] [match(unique(df $ ID),rev(df $ ID)),]
#5
2
This answer uses the data.table package. It should be very fast, even with larger data sets.
这个答案使用data.table包。即使数据集较大,它也应该非常快。
setkey(DT, ID, week) # Ensure it's sorted.
DT[DT[, .I[.N], by = ID][, V1]]
Explanation: .I
is an integer vector holding the row locations for the group (in this case the group is ID
). .N
is a length-one integer vector containing the number of rows in the group. So what we're doing here is to extract the location of the last row for each group, using the "inner" DT[.]
, using the fact that the data is sorted according to ID
and week
. Afterwards we use that to subset the "outer" DT[.]
.
说明:.I是一个整数向量,包含组的行位置(在本例中,组是ID)。 .N是长度为一的整数向量,包含组中的行数。所以我们在这里做的是使用“内部”DT [。]提取每个组的最后一行的位置,使用数据根据ID和周排序的事实。然后我们使用它来“外部”DT [。]的子集。
For comparison (because it's not posted elsewhere), here's how you can generate the original data so that you can run the code:
为了比较(因为它没有在其他地方发布),这里是如何生成原始数据以便您可以运行代码:
DT <-
data.table(
ID = c(rep(1, 3), rep(4, 5), rep(9, 5)),
week = c(2,4,6, 2,6,9,9,12, 2,4,6,9,12),
outcome = c(14,28,42, 14,46,64,71,85, 14,28,51,66,84))
#6
1
I've been trying to use split and tapply a bit more to become more acquainted with them. I know this question have been answered already but I thought I'd add another solotuion using split (pardon the ugliness; I'm more than open to feedback for improvement; thought maybe there was a use to tapply to lessen the code):
我一直试图使用split和tapply更多来熟悉它们。我知道这个问题已经得到了回答,但我想我会添加另一个使用拆分的解决方案(原谅丑陋;我不仅仅对改进的反馈持开放态度;认为可能有一点用来减少代码):
sdf <-with(df, split(df, ID))
max.week <- sapply(seq_along(sdf), function(x) which.max(sdf[[x]][, 'week']))
data.frame(t(mapply(function(x, y) y[x, ], max.week, sdf)))
I also figured why we have 7 answers here it was ripe for a benchmark. The results may surprise you (using rbenchmark with R2.14.1 on a Win 7 machine):
我也想到为什么我们在这里有7个答案它已经成熟了基准。结果可能会让您感到惊讶(在Win 7机器上使用rbenchmark和R2.14.1):
# library(rbenchmark)
# benchmark(
# DATA.TABLE= {dt <- data.table(df, key="ID")
# dt[, .SD[which.max(outcome),], by=ID]},
# DO.CALL={do.call("rbind",
# by(df, INDICES=df$ID, FUN=function(DF) DF[which.max(DF$week),]))},
# PLYR=ddply(df, .(ID), function(X) X[which.max(X$week), ]),
# SPLIT={sdf <-with(df, split(df, ID))
# max.week <- sapply(seq_along(sdf), function(x) which.max(sdf[[x]][, 'week']))
# data.frame(t(mapply(function(x, y) y[x, ], max.week, sdf)))},
# MATCH.INDEX=df[rev(rownames(df)),][match(unique(df$ID), rev(df$ID)), ],
# AGGREGATE=df[cumsum(aggregate(week ~ ID, df, which.max)$week), ],
# #WHICH.MAX.INDEX=df[sapply(unique(df$ID), function(x) which.max(x==df$ID)), ],
# BRYANS.INDEX = df[cumsum(as.numeric(lapply(split(df$week, df$ID),
# which.max))), ],
# SPLIT2={sdf <-with(df, split(df, ID))
# df[cumsum(sapply(seq_along(sdf), function(x) which.max(sdf[[x]][, 'week']))),
# ]},
# TAPPLY=df[tapply(seq_along(df$ID), df$ID, function(x){tail(x,1)}),],
# columns = c( "test", "replications", "elapsed", "relative", "user.self","sys.self"),
# order = "test", replications = 1000, environment = parent.frame())
test replications elapsed relative user.self sys.self
6 AGGREGATE 1000 4.49 7.610169 2.84 0.05
7 BRYANS.INDEX 1000 0.59 1.000000 0.20 0.00
1 DATA.TABLE 1000 20.28 34.372881 11.98 0.00
2 DO.CALL 1000 4.67 7.915254 2.95 0.03
5 MATCH.INDEX 1000 1.07 1.813559 0.51 0.00
3 PLYR 1000 10.61 17.983051 5.07 0.00
4 SPLIT 1000 3.12 5.288136 1.81 0.00
8 SPLIT2 1000 1.56 2.644068 1.28 0.00
9 TAPPLY 1000 1.08 1.830508 0.88 0.00
Edit1: I omitted the WHICH MAX solution as it does not return the correct results and returned an AGGREGATE solution as well that I wanted to use (compliments of Bryan Goodrich) and an updated version of split, SPLIT2, using cumsum (I liked that move).
编辑1:我省略了WHICH MAX解决方案,因为它没有返回正确的结果并返回了我想要使用的AGGREGATE解决方案(Bryan Goodrich的赞美)和使用cumsum的拆分SPLIT2的更新版本(我喜欢这个移动) )。
Edit 2: Dason also chimed in with a tapply solution I threw into the test that fared pretty well too.
编辑2:Dason还带着一个tapply解决方案,我投入了测试,但也很好。
#1
11
Here is one base-R approach:
这是一个基础R方法:
do.call("rbind",
by(df, INDICES=df$ID, FUN=function(DF) DF[which.max(DF$week), ]))
ID week outcome
1 1 6 42
4 4 12 85
9 9 12 84
Alternatively, the data.table
package offers a succinct and expressive language for performing data frame manipulations of this type:
或者,data.table包提供了一种简洁而富有表现力的语言,用于执行此类型的数据帧操作:
library(data.table)
dt <- data.table(df, key="ID")
dt[, .SD[which.max(outcome), ], by=ID]
# ID week outcome
# [1,] 1 6 42
# [2,] 4 12 85
# [3,] 9 12 84
# Same but much faster.
# (Actually, only the same as long as there are no ties for max(outcome)..)
dt[ dt[,outcome==max(outcome),by=ID][[2]] ] # same, but much faster.
# If there are ties for max(outcome), the following will still produce
# the same results as the method using .SD, but will be faster
i1 <- dt[,which.max(outcome), by=ID][[2]]
i2 <- dt[,.N, by=ID][[2]]
dt[i1 + cumsum(i2) - i2,]
Finally, here is a plyr
-based solution
最后,这是一个基于plyr的解决方案
library(plyr)
ddply(df, .(ID), function(X) X[which.max(X$week), ])
# ID week outcome
# 1 1 6 42
# 2 4 12 85
# 3 9 12 84
#2
9
If you're just looking for the last observation per person ID, then a simple two line code should do it. I am up always for simple base solution when possible while it is always great to have more than one ways to solve a problem.
如果您只是在寻找每个人ID的最后一个观察点,那么应该使用简单的两行代码。在可能的情况下,我总是寻求简单的基础解决方案,而有多种方法来解决问题总是很好的。
dat[order(dat$ID,dat$Week),] # Sort by ID and week
dat[!duplicated(dat$ID, fromLast=T),] # Keep last observation per ID
ID Week Outcome
3 1 6 42
8 4 12 85
13 9 12 84
#3
2
I can play this game. I ran some benchmarks on differences between lapply, sapply, and by, among other things. It appears to me that the more you're in control of data types and the more basic the operation, the faster it is (e.g., lapply is generally faster than sapply, and as.numeric(lapply(...)) is going to be faster, also). With that in mind, this produced the same results as above and may be faster than the rest.
我可以玩这个游戏。我对lapply,sapply和by等之间的差异进行了一些基准测试。在我看来,你对数据类型的控制越多,操作越基本,它就越快(例如,lapply通常比sapply更快,而as.numeric(lapply(...))正在进行更快,也)。考虑到这一点,这产生了与上述相同的结果,并且可能比其他结果更快。
df[cumsum(as.numeric(lapply(split(df$week, df$id), which.max))), ]
Explanation: we only want which.max on the week per each id. That handles the contents of lapply. We only need the vector of these relative points, so make it numeric. The result is the vector (3, 5, 5). We need to add the positions of the prior maxes. This is accomplished with cumsum.
说明:我们只想在每个id的周上使用which.max。处理lapply的内容。我们只需要这些相对点的向量,所以将它设为数字。结果是向量(3,5,5)。我们需要添加先前最大值的位置。这是用cumsum完成的。
It should be noted, this solution is not general when I use cumsum. It may require that prior to execution we sort the frame on id and week. I hope you understand why (and know how to use with(df, order(id, week)) in the row index to achieve that). In any case, it may still fail if we don't have a unique max, because which.max only takes the first one. Therefore, my solution is a bit question begging, but that goes without saying. We're trying to extract very specific information for a very specific example. Our solutions can't be general (even though the methods are important to understand generally).
应该注意,当我使用cumsum时,这种解决方案并不常见。它可能要求在执行之前我们在id和week上对帧进行排序。我希望你理解为什么(并且知道如何在行索引中使用(df,order(id,week))来实现这一点)。在任何情况下,如果我们没有唯一的最大值,它可能仍然会失败,因为which.max只接受第一个。因此,我的解决方案有点问题,但不言而喻。我们试图为一个非常具体的例子提取非常具体的信息。我们的解决方案不可能是一般性的(即使这些方法通常很重要)。
I'll leave it to trinker to update his comparisons!
我会把它留给trinker来更新他的比较!
#4
2
Another option in base: df[rev(rownames(df)),][match(unique(df$ID), rev(df$ID)), ]
base中的另一个选项:df [rev(rownames(df)),] [match(unique(df $ ID),rev(df $ ID)),]
#5
2
This answer uses the data.table package. It should be very fast, even with larger data sets.
这个答案使用data.table包。即使数据集较大,它也应该非常快。
setkey(DT, ID, week) # Ensure it's sorted.
DT[DT[, .I[.N], by = ID][, V1]]
Explanation: .I
is an integer vector holding the row locations for the group (in this case the group is ID
). .N
is a length-one integer vector containing the number of rows in the group. So what we're doing here is to extract the location of the last row for each group, using the "inner" DT[.]
, using the fact that the data is sorted according to ID
and week
. Afterwards we use that to subset the "outer" DT[.]
.
说明:.I是一个整数向量,包含组的行位置(在本例中,组是ID)。 .N是长度为一的整数向量,包含组中的行数。所以我们在这里做的是使用“内部”DT [。]提取每个组的最后一行的位置,使用数据根据ID和周排序的事实。然后我们使用它来“外部”DT [。]的子集。
For comparison (because it's not posted elsewhere), here's how you can generate the original data so that you can run the code:
为了比较(因为它没有在其他地方发布),这里是如何生成原始数据以便您可以运行代码:
DT <-
data.table(
ID = c(rep(1, 3), rep(4, 5), rep(9, 5)),
week = c(2,4,6, 2,6,9,9,12, 2,4,6,9,12),
outcome = c(14,28,42, 14,46,64,71,85, 14,28,51,66,84))
#6
1
I've been trying to use split and tapply a bit more to become more acquainted with them. I know this question have been answered already but I thought I'd add another solotuion using split (pardon the ugliness; I'm more than open to feedback for improvement; thought maybe there was a use to tapply to lessen the code):
我一直试图使用split和tapply更多来熟悉它们。我知道这个问题已经得到了回答,但我想我会添加另一个使用拆分的解决方案(原谅丑陋;我不仅仅对改进的反馈持开放态度;认为可能有一点用来减少代码):
sdf <-with(df, split(df, ID))
max.week <- sapply(seq_along(sdf), function(x) which.max(sdf[[x]][, 'week']))
data.frame(t(mapply(function(x, y) y[x, ], max.week, sdf)))
I also figured why we have 7 answers here it was ripe for a benchmark. The results may surprise you (using rbenchmark with R2.14.1 on a Win 7 machine):
我也想到为什么我们在这里有7个答案它已经成熟了基准。结果可能会让您感到惊讶(在Win 7机器上使用rbenchmark和R2.14.1):
# library(rbenchmark)
# benchmark(
# DATA.TABLE= {dt <- data.table(df, key="ID")
# dt[, .SD[which.max(outcome),], by=ID]},
# DO.CALL={do.call("rbind",
# by(df, INDICES=df$ID, FUN=function(DF) DF[which.max(DF$week),]))},
# PLYR=ddply(df, .(ID), function(X) X[which.max(X$week), ]),
# SPLIT={sdf <-with(df, split(df, ID))
# max.week <- sapply(seq_along(sdf), function(x) which.max(sdf[[x]][, 'week']))
# data.frame(t(mapply(function(x, y) y[x, ], max.week, sdf)))},
# MATCH.INDEX=df[rev(rownames(df)),][match(unique(df$ID), rev(df$ID)), ],
# AGGREGATE=df[cumsum(aggregate(week ~ ID, df, which.max)$week), ],
# #WHICH.MAX.INDEX=df[sapply(unique(df$ID), function(x) which.max(x==df$ID)), ],
# BRYANS.INDEX = df[cumsum(as.numeric(lapply(split(df$week, df$ID),
# which.max))), ],
# SPLIT2={sdf <-with(df, split(df, ID))
# df[cumsum(sapply(seq_along(sdf), function(x) which.max(sdf[[x]][, 'week']))),
# ]},
# TAPPLY=df[tapply(seq_along(df$ID), df$ID, function(x){tail(x,1)}),],
# columns = c( "test", "replications", "elapsed", "relative", "user.self","sys.self"),
# order = "test", replications = 1000, environment = parent.frame())
test replications elapsed relative user.self sys.self
6 AGGREGATE 1000 4.49 7.610169 2.84 0.05
7 BRYANS.INDEX 1000 0.59 1.000000 0.20 0.00
1 DATA.TABLE 1000 20.28 34.372881 11.98 0.00
2 DO.CALL 1000 4.67 7.915254 2.95 0.03
5 MATCH.INDEX 1000 1.07 1.813559 0.51 0.00
3 PLYR 1000 10.61 17.983051 5.07 0.00
4 SPLIT 1000 3.12 5.288136 1.81 0.00
8 SPLIT2 1000 1.56 2.644068 1.28 0.00
9 TAPPLY 1000 1.08 1.830508 0.88 0.00
Edit1: I omitted the WHICH MAX solution as it does not return the correct results and returned an AGGREGATE solution as well that I wanted to use (compliments of Bryan Goodrich) and an updated version of split, SPLIT2, using cumsum (I liked that move).
编辑1:我省略了WHICH MAX解决方案,因为它没有返回正确的结果并返回了我想要使用的AGGREGATE解决方案(Bryan Goodrich的赞美)和使用cumsum的拆分SPLIT2的更新版本(我喜欢这个移动) )。
Edit 2: Dason also chimed in with a tapply solution I threw into the test that fared pretty well too.
编辑2:Dason还带着一个tapply解决方案,我投入了测试,但也很好。