data.table - 选择组中的前n行[duplicate]

时间:2022-12-13 09:08:01

This question already has an answer here:

这个问题在这里已有答案:

As simple as it is, I don't know a data.table solution to select the first n rows in groups in a data table. Can you please help me out?

虽然很简单,但我不知道在数据表中选择组中前n行的data.table解决方案。你能帮帮我吗?

2 个解决方案

#1


31  

As an alternative:

作为备选:

dt[, .SD[1:3], cyl]

When you look at speed on the example dataset, the head method is on par with the .I method of @eddi. Comparing with the microbenchmark package:

当您在示例数据集上查看速度时,head方法与@eddi的.I方法相同。与microbenchmark包比较:

microbenchmark(head = dt[, head(.SD, 3), cyl],
               SD = dt[, .SD[1:3], cyl], 
               I = dt[dt[, .I[1:3], cyl]$V1],
               times = 10, unit = "relative")

results in:

结果是:

Unit: relative
 expr      min       lq     mean   median       uq       max neval cld
 head 1.000000 1.000000 1.000000 1.000000 1.000000 1.0000000    10  a 
   SD 2.156562 2.319538 2.306065 2.365190 2.318540 2.1908401    10   b
    I 1.001810 1.029511 1.007371 1.018514 1.016583 0.9442973    10  a 

However, data.table is specifically designed for large datasets. So, running this comparison again:

但是,data.table专门针对大型数据集而设计。所以,再次运行这个比较:

# creating a 30 million dataset
largeDT <- dt[,.SD[sample(.N, 1e7, replace = TRUE)], cyl]
# running the benchmark on the large dataset
microbenchmark(head = largeDT[, head(.SD, 3), cyl],
               SD = largeDT[, .SD[1:3], cyl], 
               I = largeDT[largeDT[, .I[1:3], cyl]$V1],
               times = 10, unit = "relative")

results in:

结果是:

Unit: relative
 expr      min       lq     mean   median       uq     max neval cld
 head 2.279753 2.194702 2.221330 2.177774 2.276986 2.33876    10   b
   SD 2.060959 2.187486 2.312009 2.236548 2.568240 2.55462    10   b
    I 1.000000 1.000000 1.000000 1.000000 1.000000 1.00000    10  a 

Now the .I method is clearly the fastest one.

现在.I方法显然是最快的方法。


Update 2016-02-12:

更新2016-02-12:

With the most recent development version of the data.table package, the .I method still wins. Whether the .SD method or the head() method is faster seems to depend on the size of the dataset. Now the benchmark gives:

使用data.table包的最新开发版本,.I方法仍然获胜。 .SD方法或head()方法是否更快似乎取决于数据集的大小。现在基准测试给出:

Unit: relative
 expr      min       lq     mean   median       uq      max neval cld
 head 2.093240 3.166974 3.473216 3.771612 4.136458 3.052213    10   b
   SD 1.840916 1.939864 2.658159 2.786055 3.112038 3.411113    10   b
    I 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000    10  a 

However with a somewhat smaller dataset (but still quite big), the odds change:

但是,如果数据集稍微小一些(但仍然很大),则可能会发生变化:

largeDT2 <- dt[,.SD[sample(.N, 1e6, replace = TRUE)], cyl]

the benchmark is now slightly in favor of the head method over the .SD method:

基准测试现在略微支持.SD方法的head方法:

Unit: relative
 expr      min       lq     mean   median       uq      max neval cld
 head 1.808732 1.917790 2.087754 1.902117 2.340030 2.441812    10   b
   SD 1.923151 1.937828 2.150168 2.040428 2.413649 2.436297    10   b
    I 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000    10  a 

#2


6  

We can use head with .SD

我们可以使用.SD头

library(data.table)

dt <- data.table(mtcars)

> dt[, head(.SD, 3), by = "cyl"]

   cyl  mpg  disp  hp drat    wt  qsec vs am gear carb
1:   6 21.0 160.0 110 3.90 2.620 16.46  0  1    4    4
2:   6 21.0 160.0 110 3.90 2.875 17.02  0  1    4    4
3:   6 21.4 258.0 110 3.08 3.215 19.44  1  0    3    1
4:   4 22.8 108.0  93 3.85 2.320 18.61  1  1    4    1
5:   4 24.4 146.7  62 3.69 3.190 20.00  1  0    4    2
6:   4 22.8 140.8  95 3.92 3.150 22.90  1  0    4    2
7:   8 18.7 360.0 175 3.15 3.440 17.02  0  0    3    2
8:   8 14.3 360.0 245 3.21 3.570 15.84  0  0    3    4
9:   8 16.4 275.8 180 3.07 4.070 17.40  0  0    3    3

#1


31  

As an alternative:

作为备选:

dt[, .SD[1:3], cyl]

When you look at speed on the example dataset, the head method is on par with the .I method of @eddi. Comparing with the microbenchmark package:

当您在示例数据集上查看速度时,head方法与@eddi的.I方法相同。与microbenchmark包比较:

microbenchmark(head = dt[, head(.SD, 3), cyl],
               SD = dt[, .SD[1:3], cyl], 
               I = dt[dt[, .I[1:3], cyl]$V1],
               times = 10, unit = "relative")

results in:

结果是:

Unit: relative
 expr      min       lq     mean   median       uq       max neval cld
 head 1.000000 1.000000 1.000000 1.000000 1.000000 1.0000000    10  a 
   SD 2.156562 2.319538 2.306065 2.365190 2.318540 2.1908401    10   b
    I 1.001810 1.029511 1.007371 1.018514 1.016583 0.9442973    10  a 

However, data.table is specifically designed for large datasets. So, running this comparison again:

但是,data.table专门针对大型数据集而设计。所以,再次运行这个比较:

# creating a 30 million dataset
largeDT <- dt[,.SD[sample(.N, 1e7, replace = TRUE)], cyl]
# running the benchmark on the large dataset
microbenchmark(head = largeDT[, head(.SD, 3), cyl],
               SD = largeDT[, .SD[1:3], cyl], 
               I = largeDT[largeDT[, .I[1:3], cyl]$V1],
               times = 10, unit = "relative")

results in:

结果是:

Unit: relative
 expr      min       lq     mean   median       uq     max neval cld
 head 2.279753 2.194702 2.221330 2.177774 2.276986 2.33876    10   b
   SD 2.060959 2.187486 2.312009 2.236548 2.568240 2.55462    10   b
    I 1.000000 1.000000 1.000000 1.000000 1.000000 1.00000    10  a 

Now the .I method is clearly the fastest one.

现在.I方法显然是最快的方法。


Update 2016-02-12:

更新2016-02-12:

With the most recent development version of the data.table package, the .I method still wins. Whether the .SD method or the head() method is faster seems to depend on the size of the dataset. Now the benchmark gives:

使用data.table包的最新开发版本,.I方法仍然获胜。 .SD方法或head()方法是否更快似乎取决于数据集的大小。现在基准测试给出:

Unit: relative
 expr      min       lq     mean   median       uq      max neval cld
 head 2.093240 3.166974 3.473216 3.771612 4.136458 3.052213    10   b
   SD 1.840916 1.939864 2.658159 2.786055 3.112038 3.411113    10   b
    I 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000    10  a 

However with a somewhat smaller dataset (but still quite big), the odds change:

但是,如果数据集稍微小一些(但仍然很大),则可能会发生变化:

largeDT2 <- dt[,.SD[sample(.N, 1e6, replace = TRUE)], cyl]

the benchmark is now slightly in favor of the head method over the .SD method:

基准测试现在略微支持.SD方法的head方法:

Unit: relative
 expr      min       lq     mean   median       uq      max neval cld
 head 1.808732 1.917790 2.087754 1.902117 2.340030 2.441812    10   b
   SD 1.923151 1.937828 2.150168 2.040428 2.413649 2.436297    10   b
    I 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000    10  a 

#2


6  

We can use head with .SD

我们可以使用.SD头

library(data.table)

dt <- data.table(mtcars)

> dt[, head(.SD, 3), by = "cyl"]

   cyl  mpg  disp  hp drat    wt  qsec vs am gear carb
1:   6 21.0 160.0 110 3.90 2.620 16.46  0  1    4    4
2:   6 21.0 160.0 110 3.90 2.875 17.02  0  1    4    4
3:   6 21.4 258.0 110 3.08 3.215 19.44  1  0    3    1
4:   4 22.8 108.0  93 3.85 2.320 18.61  1  1    4    1
5:   4 24.4 146.7  62 3.69 3.190 20.00  1  0    4    2
6:   4 22.8 140.8  95 3.92 3.150 22.90  1  0    4    2
7:   8 18.7 360.0 175 3.15 3.440 17.02  0  0    3    2
8:   8 14.3 360.0 245 3.21 3.570 15.84  0  0    3    4
9:   8 16.4 275.8 180 3.07 4.070 17.40  0  0    3    3