R数据。表计算面板数据

时间:2021-05-27 23:23:42

I have panel data (subject/year) for which I would like to only keep subjects who appear the maximum number of times per year. The data set is large so I am using the data.table package. Is there a more elegant solution than what I have tried below?

我有面板数据(主题/年),我希望只保留每年出现最多次数的主题。数据集很大,所以我正在使用这些数据。表方案。有比我下面尝试过的更优雅的解决方案吗?

library(data.table)

DT <- data.table(SUBJECT=c(rep('John',3), rep('Paul',2), 
                           rep('George',3), rep('Ringo',2), 
                           rep('John',2), rep('Paul',4), 
                           rep('George',2), rep('Ringo',4)), 
                 YEAR=c(rep(2011,10), rep(2012,12)), 
                 HEIGHT=rnorm(22), 
                 WEIGHT=rnorm(22))
DT

DT[, COUNT := .N, by='SUBJECT,YEAR']
DT[, MAXCOUNT := max(COUNT), by='YEAR']

DT <- DT[COUNT==MAXCOUNT]
DT <- DT[, c('COUNT','MAXCOUNT') := NULL]
DT

1 个解决方案

#1


14  

I'm not sure you'll view this as elegant but how about :

我不确定你会不会觉得这很优雅,但是

DT[, COUNT := .N, by='SUBJECT,YEAR']
DT[, .SD[COUNT == max(COUNT)], by='YEAR']

That's essentially how to apply by to the i expression as @SenorO commented. You'll still need [,COUNT:=NULL] afterwards but for one temporary column rather than two.

这本质上就是如何应用到i表达式中,正如@SenorO所评论的那样。之后,您仍然需要[,COUNT:=NULL],但只针对一个临时列而不是两个列。

We do discourage .SD though for speed reasons, but hopefully we'll get to this feature request soon so that advice can be dropped: FR#2330 Optimize .SD[i] query to keep the elegance but make it faster unchanged..

虽然出于速度的原因,我们不鼓励使用。sd,但是希望我们很快就能得到这个特性请求,这样建议就可以被删除:FR#2330优化。

A different approach is as follows. It's faster and idiomatic but may be considered less elegant.

另一种方法如下。它速度快,习惯化,但可能不那么优雅。

# Create a small aggregate table first. No need to use := on the big table.
i = DT[, .N, by='SUBJECT,YEAR']

# Find the even smaller subset. (Do as much as we can on the small aggregate.)
i = i[, .SD[N==max(N)], by=YEAR]

# Finally join the small subset of key values to the big table
setkey(DT, YEAR, SUBJECT)
DT[i]

Something similar is here.

类似的事情在这里。

#1


14  

I'm not sure you'll view this as elegant but how about :

我不确定你会不会觉得这很优雅,但是

DT[, COUNT := .N, by='SUBJECT,YEAR']
DT[, .SD[COUNT == max(COUNT)], by='YEAR']

That's essentially how to apply by to the i expression as @SenorO commented. You'll still need [,COUNT:=NULL] afterwards but for one temporary column rather than two.

这本质上就是如何应用到i表达式中,正如@SenorO所评论的那样。之后,您仍然需要[,COUNT:=NULL],但只针对一个临时列而不是两个列。

We do discourage .SD though for speed reasons, but hopefully we'll get to this feature request soon so that advice can be dropped: FR#2330 Optimize .SD[i] query to keep the elegance but make it faster unchanged..

虽然出于速度的原因,我们不鼓励使用。sd,但是希望我们很快就能得到这个特性请求,这样建议就可以被删除:FR#2330优化。

A different approach is as follows. It's faster and idiomatic but may be considered less elegant.

另一种方法如下。它速度快,习惯化,但可能不那么优雅。

# Create a small aggregate table first. No need to use := on the big table.
i = DT[, .N, by='SUBJECT,YEAR']

# Find the even smaller subset. (Do as much as we can on the small aggregate.)
i = i[, .SD[N==max(N)], by=YEAR]

# Finally join the small subset of key values to the big table
setkey(DT, YEAR, SUBJECT)
DT[i]

Something similar is here.

类似的事情在这里。