R和dplyr中的组滞后/领先

时间:2022-06-14 16:22:01

I'm having trouble trying to lag date grouped by team.

我在尝试延迟按团队分组的日期时遇到了麻烦。

Data:

 df <- data.frame(Team = c("A", "A", "A", "A", "B", "B", "B", "C", "C", "D", "D"),
             Date = c("2016-05-10","2016-05-10", "2016-05-10", "2016-05-10",
                      "2016-05-12", "2016-05-12", "2016-05-12",
                      "2016-05-15","2016-05-15",
                      "2016-05-30", "2016-05-30"), 
             Points = c(1,4,3,2,1,5,6,1,2,3,9)
             )

Team      Date       Points
 A     2016-05-10      1
 A     2016-05-10      4
 A     2016-05-10      3
 A     2016-05-10      2
 B     2016-05-12      1
 B     2016-05-12      5
 B     2016-05-12      6
 C     2016-05-15      1
 C     2016-05-15      2
 D     2016-05-30      3
 D     2016-05-30      9

Expected result:

Team      Date       Points   Date_Lagged
 A     2016-05-10      1          NA
 A     2016-05-10      4          NA
 A     2016-05-10      3          NA
 A     2016-05-10      2          NA
 B     2016-05-12      1      2016-05-10 
 B     2016-05-12      5      2016-05-10 
 B     2016-05-12      6      2016-05-10 
 C     2016-05-15      1      2016-05-12
 C     2016-05-15      2      2016-05-12
 D     2016-05-30      3      2016-05-15
 D     2016-05-30      9      2016-05-15

I'm scratching my head after I realised the following isn't the correct solution:

在我意识到以下不是正确的解决方案后,我正在挠头:

df %>% group_by(Date) %>% mutate(Date_lagged = lag(Date))  

Any idea how to fix it?

知道怎么解决吗?

2 个解决方案

#1


7  

The lag by default offsets with n=1. However, we have duplicate elements for 'Team', and 'Date'. Inorder to get the expected output, we need to get the distinct rows of 'Team', 'Date', create a 'Date_lagged' with the lag of 'Date' and right_join (or left_join) with the original dataset.

默认情况下滞后偏移量为n = 1。但是,我们有'Team'和'Date'的重复元素。为了获得预期的输出,我们需要获得'Team','Date'的不同行,使用原始数据集创建具有'Date'和right_join(或left_join)滞后的'Date_lagged'。

distinct(df, Team, Date) %>%
        mutate(Date_Lagged = lag(Date)) %>%
        right_join(., df) %>%
        select(Team, Date, Points, Date_Lagged)
#   Team       Date Points Date_Lagged
#1     A 2016-05-10      1        <NA>
#2     A 2016-05-10      4        <NA>
#3     A 2016-05-10      3        <NA>
#4     A 2016-05-10      2        <NA>
#5     B 2016-05-12      1  2016-05-10
#6     B 2016-05-12      5  2016-05-10
#7     B 2016-05-12      6  2016-05-10
#8     C 2016-05-15      1  2016-05-12
#9     C 2016-05-15      2  2016-05-12
#10    D 2016-05-30      3  2016-05-15
#11    D 2016-05-30      9  2016-05-15

Or we can also do

或者我们也可以

df %>% 
    mutate(Date_Lagged = rep(lag(unique(Date)), table(Date)))

#2


3  

You can do this with base R too, for example using rle:

您也可以使用基数R执行此操作,例如使用rle:

with(rle(as.character(df$Date)), rep(c(NA, head(values, -1)), lengths))
# [1] NA           NA           NA           NA           "2016-05-10" "2016-05-10"
# [7] "2016-05-10" "2016-05-12" "2016-05-12" "2016-05-15" "2016-05-15"

#1


7  

The lag by default offsets with n=1. However, we have duplicate elements for 'Team', and 'Date'. Inorder to get the expected output, we need to get the distinct rows of 'Team', 'Date', create a 'Date_lagged' with the lag of 'Date' and right_join (or left_join) with the original dataset.

默认情况下滞后偏移量为n = 1。但是,我们有'Team'和'Date'的重复元素。为了获得预期的输出,我们需要获得'Team','Date'的不同行,使用原始数据集创建具有'Date'和right_join(或left_join)滞后的'Date_lagged'。

distinct(df, Team, Date) %>%
        mutate(Date_Lagged = lag(Date)) %>%
        right_join(., df) %>%
        select(Team, Date, Points, Date_Lagged)
#   Team       Date Points Date_Lagged
#1     A 2016-05-10      1        <NA>
#2     A 2016-05-10      4        <NA>
#3     A 2016-05-10      3        <NA>
#4     A 2016-05-10      2        <NA>
#5     B 2016-05-12      1  2016-05-10
#6     B 2016-05-12      5  2016-05-10
#7     B 2016-05-12      6  2016-05-10
#8     C 2016-05-15      1  2016-05-12
#9     C 2016-05-15      2  2016-05-12
#10    D 2016-05-30      3  2016-05-15
#11    D 2016-05-30      9  2016-05-15

Or we can also do

或者我们也可以

df %>% 
    mutate(Date_Lagged = rep(lag(unique(Date)), table(Date)))

#2


3  

You can do this with base R too, for example using rle:

您也可以使用基数R执行此操作,例如使用rle:

with(rle(as.character(df$Date)), rep(c(NA, head(values, -1)), lengths))
# [1] NA           NA           NA           NA           "2016-05-10" "2016-05-10"
# [7] "2016-05-10" "2016-05-12" "2016-05-12" "2016-05-15" "2016-05-15"