I have no experience with data.table, so I don't know if there is a solution to my question (30 minutes on Google gave no answer at least), but here it goes.
我没有数据方面的经验。表,所以我不知道是否有我的问题的解决方案(谷歌的30分钟没有给出答案),但是在这里。
With data.frame I often use the following command to check the number of observations of a unique value:
使用data.frame,我经常使用以下命令检查一个惟一值的观察次数:
df$Obs=with(df, ave(v1, ID-Date, FUN=function(x) length(unique(x))))
Is there any corresponding method when working with data.table?
使用数据表时是否有相应的方法?
1 个解决方案
#1
4
Yes, there is. Happily, you've asked about one of the newest features of data.table
, added in v1.8.2 :
是的,有。幸运的是,您已经询问了数据的最新特性之一。表,在v1.8.2中增加:
:=
by group is now implemented (FR#1491) and sub-assigning to a new column by reference now adds the column automatically (initialized withNA
where the sub-assign doesn't touch) (FR#1997).:=
by group can be combined with all types ofi
, so:=
by group includes grouping byi
as well as byby
. Since:=
by group is by reference, it should be significantly faster than any method that (directly or indirectly)cbind
s the grouped results to DT, since no copy of the (large) DT is made at all. It's a short and natural syntax that can be compounded with other queries.DT[,newcol:=sum(colB),by=colA]
:= by group现在实现了(FR#1491),通过引用对新列进行子赋值,现在自动添加列(在没有子赋值的地方使用NA初始化)(FR#1997)。:= by group可以与i的所有类型组合在一起,所以:= by group既可以由i进行分组,也可以由by进行分组。因为:= by group是通过引用实现的,所以它应该比(直接或间接)将分组结果与DT绑定在一起的任何方法都要快得多,因为(大的)DT根本没有复制。它是一种简短而自然的语法,可以与其他查询混合使用。DT(newcol:=总和(colB)=可乐)
In your example, iiuc, it should be something like :
在你的例子中,iiuc应该是这样的:
DT[, Obs:=.N, by=ID-Date]
instead of :
而不是:
df$Obs=with(df, ave(v1, ID-Date, FUN=function(x) length(unique(x))))
Note that :=
by group scales well for large data sets (and smaller datasets will a lot of small groups).
注意:=对于大型数据集来说,按组伸缩很好(而较小的数据集将会有很多小的组)。
See ?":="
and Search data.table tag for "reference"
查看?“=”和搜索数据。表格标记为“参考”
#1
4
Yes, there is. Happily, you've asked about one of the newest features of data.table
, added in v1.8.2 :
是的,有。幸运的是,您已经询问了数据的最新特性之一。表,在v1.8.2中增加:
:=
by group is now implemented (FR#1491) and sub-assigning to a new column by reference now adds the column automatically (initialized withNA
where the sub-assign doesn't touch) (FR#1997).:=
by group can be combined with all types ofi
, so:=
by group includes grouping byi
as well as byby
. Since:=
by group is by reference, it should be significantly faster than any method that (directly or indirectly)cbind
s the grouped results to DT, since no copy of the (large) DT is made at all. It's a short and natural syntax that can be compounded with other queries.DT[,newcol:=sum(colB),by=colA]
:= by group现在实现了(FR#1491),通过引用对新列进行子赋值,现在自动添加列(在没有子赋值的地方使用NA初始化)(FR#1997)。:= by group可以与i的所有类型组合在一起,所以:= by group既可以由i进行分组,也可以由by进行分组。因为:= by group是通过引用实现的,所以它应该比(直接或间接)将分组结果与DT绑定在一起的任何方法都要快得多,因为(大的)DT根本没有复制。它是一种简短而自然的语法,可以与其他查询混合使用。DT(newcol:=总和(colB)=可乐)
In your example, iiuc, it should be something like :
在你的例子中,iiuc应该是这样的:
DT[, Obs:=.N, by=ID-Date]
instead of :
而不是:
df$Obs=with(df, ave(v1, ID-Date, FUN=function(x) length(unique(x))))
Note that :=
by group scales well for large data sets (and smaller datasets will a lot of small groups).
注意:=对于大型数据集来说,按组伸缩很好(而较小的数据集将会有很多小的组)。
See ?":="
and Search data.table tag for "reference"
查看?“=”和搜索数据。表格标记为“参考”