R:逐个滞后变量,产生不同的组内滞后值

时间:2022-06-14 16:21:49

I have data grouped by an id variable with multiple, unique observations per quarter and with different group sizes per id:

我有一个按id变量分组的数据,每季度有多个独特的观察结果,每个ID有不同的组大小:

    library(dplyr)
    library(data.table)
    library(lubridate)

v2 <- sample(1:100, 15)
df <- data.frame(qy = c(rep('2016-01-01', 5), rep('2016-04-01', 5), rep('2016-10-01', 5)),
                 id = c(rep(c('a','a','b','b','c'), 3)),
                 value_t = c(0,0,1,1,0,1,1,0,0,0,0,0,1,1,1),
                 value2_t = c(v2))
df$qy <- ymd(df$qy)
df <- df %>% arrange(id, qy)
> df
   qy          id      value_t value2_t
1  2016-01-01  a       0       49
2  2016-01-01  a       0        4
3  2016-01-01  b       1        5
4  2016-01-01  b       1       48
5  2016-01-01  c       0       32
6  2016-04-01  a       1       81
7  2016-04-01  a       1        6
8  2016-04-01  b       0       71
9  2016-04-01  b       0       47
10 2016-04-01  c       0       78
11 2016-10-01  a       0       31
12 2016-10-01  a       0       10
13 2016-10-01  b       1       37
14 2016-10-01  b       1       63
15 2016-10-01  c       1       36

I attempt to create two lag variables grouped by id with lags of t-1 and t-2, respectively:

我尝试创建两个滞后变量,分别按id分组,滞后为t-1和t-2:

setDT(df)[order(qy), paste0('value_t', 1:2) := shift(value_t, 1:2) , by = id]

Although I've grouped by id, the lags don't follow the grouping assignment - the lag variables are just rolling lags within the group:

虽然我按id分组,但滞后不遵循分组分配 - 滞后变量只是在组内滚动滞后:

> df
   qy          id      value_t value2_t value_t1 value_t2
1: 2016-01-01  a       0       49       NA       NA
2: 2016-01-01  a       0        4        0       NA
3: 2016-04-01  a       1       81        0        0
4: 2016-04-01  a       1        6        1        0
5: 2016-10-01  a       0       31        1        1
6: 2016-10-01  a       0       10        0        1
7: 2016-01-01  b       1        5       NA       NA
8: 2016-01-01  b       1       48        1       NA
9: 2016-04-01  b       0       71        1        1
10: 2016-04-01  b       0       47        0        1
11: 2016-10-01  b       1       37        0        0
12: 2016-10-01  b       1       63        1        0
13: 2016-01-01  c       0       32       NA       NA
14: 2016-04-01  c       0       78        0       NA
15: 2016-10-01  c       1       36        0        0

I would like the lag variables to respect the grouping despite there being multiple observations per quarter as follows:

我希望滞后变量能够尊重分组,尽管每季度有多个观察结果如下:

> df
   qy          id      value_t value2_t value_t1 value_t2
1  2016-01-01  a       0       49       NA       NA
2  2016-01-01  a       0        4       NA       NA
3  2016-04-01  a       1       81        0       NA
4  2016-04-01  a       1        6        0       NA
5  2016-10-01  a       0       31        1        0
6  2016-10-01  a       0       10        1        0
7  2016-01-01  b       1        5       NA       NA
8  2016-01-01  b       1       48       NA       NA
9  2016-04-01  b       0       71        1       NA
10 2016-04-01  b       0       47        1       NA
11 2016-10-01  b       1       37        0        1
12 2016-10-01  b       1       63        0        1
13 2016-01-01  c       0       32       NA       NA
14 2016-04-01  c       0       78        0       NA
15 2016-10-01  c       1       36        0        0

Any suggestions in data.table or dplyr in particular would be greatly appreciated!

data.table或dplyr中的任何建议都将非常感谢!

Update: Thanks all for your comments. I believe David A. is correct in that the main issue is the varied id group size, and I've updated the question to highlight this.

更新:感谢大家的评论。我相信David A.是正确的,主要问题是不同的id组大小,我已经更新了问题以突出显示这一点。

2 个解决方案

#1


2  

We can create a subset of data frame based on unique qy and id, create the lag columns value_t1 and value_t2, and then merge back to the original data frame.

我们可以根据唯一的qy和id创建数据帧的子集,创建滞后列value_t1和value_t2,然后合并回原始数据帧。

library(dplyr)
library(data.table)
library(lubridate)

# Create example data frame
set.seed(123)

v2 <- sample(1:100, 15)
df <- data.frame(qy = c(rep('2016-01-01', 5), rep('2016-04-01', 5), rep('2016-10-01', 5)),
                 id = c(rep(c('a','a','b','b','c'), 3)),
                 value_t = c(0,0,1,1,0,1,1,0,0,0,0,0,1,1,1),
                 value2_t = c(v2))
df$qy <- ymd(df$qy)
df <- df %>% arrange(id, qy)

# Process the data
df2 <- df %>%
  distinct(id, qy, .keep_all = TRUE) %>%
  group_by(id) %>%
  mutate(value_t1 = lag(value_t, n = 1L),
         value_t2 = lag(value_t, n = 2L)) %>%
  select(-value_t, -value2_t) %>%
  ungroup() %>%
  left_join(df, ., by = c("qy", "id")) 

df2
#            qy id value_t value2_t value_t1 value_t2
# 1  2016-01-01  a       0       29       NA       NA
# 2  2016-01-01  a       0       79       NA       NA
# 3  2016-04-01  a       1        5        0       NA
# 4  2016-04-01  a       1       50        0       NA
# 5  2016-10-01  a       0       87        1        0
# 6  2016-10-01  a       0       98        1        0
# 7  2016-01-01  b       1       41       NA       NA
# 8  2016-01-01  b       1       86       NA       NA
# 9  2016-04-01  b       0       83        1       NA
# 10 2016-04-01  b       0       51        1       NA
# 11 2016-10-01  b       1       60        0        1
# 12 2016-10-01  b       1       94        0        1
# 13 2016-01-01  c       0       91       NA       NA
# 14 2016-04-01  c       0       42        0       NA
# 15 2016-10-01  c       1        9        0        0

#2


2  

You can write your own time_lag function using rle (Run Length Encoding) and apply it to the columns:

您可以使用rle(运行长度编码)编写自己的time_lag函数并将其应用于列:

library(dplyr)

time_lag = function(x, time_var, k = 1){

  shift_N = sum(rle(as.character(time_var))$lengths[0:k])

  return(c(rep(NA, shift_N), x[0:(length(x)-shift_N)]))
}

df %>%
  group_by(id) %>%
  mutate(value_t1 = time_lag(value_t, qy),
         value_t2 = time_lag(value_t, qy, 2),
         value_t3 = time_lag(value_t, qy, 3))

Result:

# A tibble: 15 x 7
# Groups:   id [3]
           qy     id value_t value2_t value_t1 value_t2 value_t3
       <date> <fctr>   <dbl>    <int>    <dbl>    <dbl>    <dbl>
 1 2016-01-01      a       0        7       NA       NA       NA
 2 2016-01-01      a       0       25       NA       NA       NA
 3 2016-04-01      a       1      100        0       NA       NA
 4 2016-04-01      a       1       20        0       NA       NA
 5 2016-10-01      a       0        1        1        0       NA
 6 2016-10-01      a       0       59        1        0       NA
 7 2016-01-01      b       1       76       NA       NA       NA
 8 2016-01-01      b       1       73       NA       NA       NA
 9 2016-04-01      b       0       69        1       NA       NA
10 2016-04-01      b       0       86        1       NA       NA
11 2016-10-01      b       1       85        0        1       NA
12 2016-10-01      b       1       40        0        1       NA
13 2016-01-01      c       0       49       NA       NA       NA
14 2016-04-01      c       0       82        0       NA       NA
15 2016-10-01      c       1       43        0        0       NA

Notes:

  • time_lag assumes that time_var is sorted and that k >= 0
  • time_lag假定time_var已排序且k> = 0

  • time_lag first calculates the rle of time_var and add up the lengths of the first k unique time values. Let's call this sum shift_N
  • time_lag首先计算time_var的rle并将前k个唯一时间值的长度相加。我们称之为sum shift_N

  • It then attaches shift_N NAs at the beginning and removes shift_N elements at the end of the vector x
  • 然后它在开头附加shift_N NAs并移除向量x末尾的shift_N元素

  • rle requires an atomic vector as input, hence the as.character
  • rle需要原子向量作为输入,因此需要as.character

  • When applied to dplyr::group_by, custom functions respects groupings, so there is no extra work needed there
  • 当应用于dplyr :: group_by时,自定义函数会考虑分组,因此不需要额外的工作

#1


2  

We can create a subset of data frame based on unique qy and id, create the lag columns value_t1 and value_t2, and then merge back to the original data frame.

我们可以根据唯一的qy和id创建数据帧的子集,创建滞后列value_t1和value_t2,然后合并回原始数据帧。

library(dplyr)
library(data.table)
library(lubridate)

# Create example data frame
set.seed(123)

v2 <- sample(1:100, 15)
df <- data.frame(qy = c(rep('2016-01-01', 5), rep('2016-04-01', 5), rep('2016-10-01', 5)),
                 id = c(rep(c('a','a','b','b','c'), 3)),
                 value_t = c(0,0,1,1,0,1,1,0,0,0,0,0,1,1,1),
                 value2_t = c(v2))
df$qy <- ymd(df$qy)
df <- df %>% arrange(id, qy)

# Process the data
df2 <- df %>%
  distinct(id, qy, .keep_all = TRUE) %>%
  group_by(id) %>%
  mutate(value_t1 = lag(value_t, n = 1L),
         value_t2 = lag(value_t, n = 2L)) %>%
  select(-value_t, -value2_t) %>%
  ungroup() %>%
  left_join(df, ., by = c("qy", "id")) 

df2
#            qy id value_t value2_t value_t1 value_t2
# 1  2016-01-01  a       0       29       NA       NA
# 2  2016-01-01  a       0       79       NA       NA
# 3  2016-04-01  a       1        5        0       NA
# 4  2016-04-01  a       1       50        0       NA
# 5  2016-10-01  a       0       87        1        0
# 6  2016-10-01  a       0       98        1        0
# 7  2016-01-01  b       1       41       NA       NA
# 8  2016-01-01  b       1       86       NA       NA
# 9  2016-04-01  b       0       83        1       NA
# 10 2016-04-01  b       0       51        1       NA
# 11 2016-10-01  b       1       60        0        1
# 12 2016-10-01  b       1       94        0        1
# 13 2016-01-01  c       0       91       NA       NA
# 14 2016-04-01  c       0       42        0       NA
# 15 2016-10-01  c       1        9        0        0

#2


2  

You can write your own time_lag function using rle (Run Length Encoding) and apply it to the columns:

您可以使用rle(运行长度编码)编写自己的time_lag函数并将其应用于列:

library(dplyr)

time_lag = function(x, time_var, k = 1){

  shift_N = sum(rle(as.character(time_var))$lengths[0:k])

  return(c(rep(NA, shift_N), x[0:(length(x)-shift_N)]))
}

df %>%
  group_by(id) %>%
  mutate(value_t1 = time_lag(value_t, qy),
         value_t2 = time_lag(value_t, qy, 2),
         value_t3 = time_lag(value_t, qy, 3))

Result:

# A tibble: 15 x 7
# Groups:   id [3]
           qy     id value_t value2_t value_t1 value_t2 value_t3
       <date> <fctr>   <dbl>    <int>    <dbl>    <dbl>    <dbl>
 1 2016-01-01      a       0        7       NA       NA       NA
 2 2016-01-01      a       0       25       NA       NA       NA
 3 2016-04-01      a       1      100        0       NA       NA
 4 2016-04-01      a       1       20        0       NA       NA
 5 2016-10-01      a       0        1        1        0       NA
 6 2016-10-01      a       0       59        1        0       NA
 7 2016-01-01      b       1       76       NA       NA       NA
 8 2016-01-01      b       1       73       NA       NA       NA
 9 2016-04-01      b       0       69        1       NA       NA
10 2016-04-01      b       0       86        1       NA       NA
11 2016-10-01      b       1       85        0        1       NA
12 2016-10-01      b       1       40        0        1       NA
13 2016-01-01      c       0       49       NA       NA       NA
14 2016-04-01      c       0       82        0       NA       NA
15 2016-10-01      c       1       43        0        0       NA

Notes:

  • time_lag assumes that time_var is sorted and that k >= 0
  • time_lag假定time_var已排序且k> = 0

  • time_lag first calculates the rle of time_var and add up the lengths of the first k unique time values. Let's call this sum shift_N
  • time_lag首先计算time_var的rle并将前k个唯一时间值的长度相加。我们称之为sum shift_N

  • It then attaches shift_N NAs at the beginning and removes shift_N elements at the end of the vector x
  • 然后它在开头附加shift_N NAs并移除向量x末尾的shift_N元素

  • rle requires an atomic vector as input, hence the as.character
  • rle需要原子向量作为输入,因此需要as.character

  • When applied to dplyr::group_by, custom functions respects groupings, so there is no extra work needed there
  • 当应用于dplyr :: group_by时,自定义函数会考虑分组,因此不需要额外的工作