如何提取每组前n行?

时间:2021-07-10 09:08:48

I have a data.table dt. This data.table is sorted first by column date (my grouping variable), then by column age:

我有一个data.table dt。此data.table首先按列日期(我的分组变量)排序,然后按列年龄排序:

library(data.table)
setkeyv(dt, c("date", "age")) # Sorts table first by column "date" then by "age"
> dt
         date age     name
1: 2000-01-01   3   Andrew
2: 2000-01-01   4      Ben
3: 2000-01-01   5  Charlie
4: 2000-01-02   6     Adam
5: 2000-01-02   7      Bob
6: 2000-01-02   8 Campbell

My question is: I am wondering if it's possible to extract the first 2 rows for each unique date? Or phrased more generally:

我的问题是:我想知道是否可以为每个唯一日期提取前两行?或者更一般地说:

How to extract the first n rows within each group?

如何提取每组中的前n行?

In this example, the result in dt.f would be:

在这个例子中,dt.f中的结果是:

> dt.f = ???????? # function of dt to extract the first 2 rows per unique date
> dt.f
         date age   name
1: 2000-01-01   3 Andrew
2: 2000-01-01   4    Ben
3: 2000-01-02   6   Adam
4: 2000-01-02   7    Bob

p.s. Here is the code to create the aforementioned data.table:

附:这是创建上述data.table的代码:

install.packages("data.table")
library(data.table)
date <- c("2000-01-01","2000-01-01","2000-01-01",
    "2000-01-02","2000-01-02","2000-01-02")
age <- c(3,4,5,6,7,8)
name <- c("Andrew","Ben","Charlie","Adam","Bob","Campbell")
dt <- data.table(date, age, name)
setkeyv(dt,c("date","age")) # Sorts table first by column "date" then by "age"

2 个解决方案

#1


35  

yep, just use .SD and index it as needed.

是的,只需使用.SD并根据需要对其进行索引。

  DT[, .SD[1:2], by=date]

           date age   name
  1: 2000-01-01   3 Andrew
  2: 2000-01-01   4    Ben
  3: 2000-01-02   6   Adam
  4: 2000-01-02   7    Bob

Edited as per @eddi's suggestion.

@eddi's suggestion is spot on:

@ eddi的建议是:

Use this instead, for speed:

使用它代替速度:

  DT[DT[, .I[1:2], by = date]$V1]

  # using a slightly larger data set
  > microbenchmark(SDstyle=DT[, .SD[1:2], by=date], IStyle=DT[DT[, .I[1:2], by = date]$V1], times=200L)
  Unit: milliseconds
      expr       min        lq    median        uq      max neval
   SDstyle 13.567070 16.224797 22.170302 24.239881 88.26719   200
    IStyle  1.675185  2.018773  2.168818  2.269292 11.31072   200

#2


0  

Probably not the fastest method, but it provides some flexibility if you don't use keyed variables and need some more flexibility. By changing the selected Row.ID the number of first objects can be adjusted as needed.

可能不是最快的方法,但如果您不使用键控变量并需要更多灵活性,它可以提供一些灵活性。通过更改选定的Row.ID,可以根据需要调整第一个对象的数量。

dt[, .( age
        , name
        , Row.ID = rank(age)
        )
   , by = list(date)][Row.ID %in% (1:2), .(date
                                           , age
                                           , name
                                           )]

#1


35  

yep, just use .SD and index it as needed.

是的,只需使用.SD并根据需要对其进行索引。

  DT[, .SD[1:2], by=date]

           date age   name
  1: 2000-01-01   3 Andrew
  2: 2000-01-01   4    Ben
  3: 2000-01-02   6   Adam
  4: 2000-01-02   7    Bob

Edited as per @eddi's suggestion.

@eddi's suggestion is spot on:

@ eddi的建议是:

Use this instead, for speed:

使用它代替速度:

  DT[DT[, .I[1:2], by = date]$V1]

  # using a slightly larger data set
  > microbenchmark(SDstyle=DT[, .SD[1:2], by=date], IStyle=DT[DT[, .I[1:2], by = date]$V1], times=200L)
  Unit: milliseconds
      expr       min        lq    median        uq      max neval
   SDstyle 13.567070 16.224797 22.170302 24.239881 88.26719   200
    IStyle  1.675185  2.018773  2.168818  2.269292 11.31072   200

#2


0  

Probably not the fastest method, but it provides some flexibility if you don't use keyed variables and need some more flexibility. By changing the selected Row.ID the number of first objects can be adjusted as needed.

可能不是最快的方法,但如果您不使用键控变量并需要更多灵活性,它可以提供一些灵活性。通过更改选定的Row.ID,可以根据需要调整第一个对象的数量。

dt[, .( age
        , name
        , Row.ID = rank(age)
        )
   , by = list(date)][Row.ID %in% (1:2), .(date
                                           , age
                                           , name
                                           )]