如何在data.table分组中获取当前组的长度?

时间:2021-04-02 07:36:34

I know this can be achieved with other packages, but I'm trying to do it in data.table (as it seems to be the fastest for grouping).

我知道这可以通过其他软件包实现,但我正在尝试在data.table中进行(因为它似乎是最快的分组)。

library(data.table)
dt = data.table(a=c(1,2,2,3))
dt[,length(a),by=a]

results in

结果是

   a V1
1: 1  1
2: 2  1
3: 3  1

whereas

df = data.frame(a=c(1,2,2,3))
ddply(df,.(a),summarise,V1=length(a))

produces

产生

  a V1
1 1  1
2 2  2
3 3  1

which is a more sensible results. Just wondering why data.table is not giving the same results, and how this can be achieved.

这是一个更明智的结果。只是想知道为什么data.table没有给出相同的结果,以及如何实现这一点。

1 个解决方案

#1


16  

The data.table way to do this is to use special variable, .N, which keeps track of the number of rows in the current group. (Other special variables include .SD, .BY (in version 1.8.2) and .I and .GRP (available from version 1.8.3). All are documented in ?data.table):

data.table方法是使用特殊变量.N,它跟踪当前组中的行数。 (其他特殊变量包括.SD,.BY(版本1.8.2)和.I和.GRP(版本1.8.3)。所有都记录在?data.table中:

library(data.table)
dt = data.table(a=c(1,2,2,3))

dt[, .N, by = a]
#    a N
# 1: 1 1
# 2: 2 2
# 3: 3 1

To see why what you tried didn't work, run the following, checking the value of a and length(a) at each browser prompt:

要查看您尝试的原因无效,请运行以下命令,在每个浏览器提示符下检查a和length(a)的值:

dt[, browser(), by = a]

#1


16  

The data.table way to do this is to use special variable, .N, which keeps track of the number of rows in the current group. (Other special variables include .SD, .BY (in version 1.8.2) and .I and .GRP (available from version 1.8.3). All are documented in ?data.table):

data.table方法是使用特殊变量.N,它跟踪当前组中的行数。 (其他特殊变量包括.SD,.BY(版本1.8.2)和.I和.GRP(版本1.8.3)。所有都记录在?data.table中:

library(data.table)
dt = data.table(a=c(1,2,2,3))

dt[, .N, by = a]
#    a N
# 1: 1 1
# 2: 2 2
# 3: 3 1

To see why what you tried didn't work, run the following, checking the value of a and length(a) at each browser prompt:

要查看您尝试的原因无效,请运行以下命令,在每个浏览器提示符下检查a和length(a)的值:

dt[, browser(), by = a]