
时间:2022-09-08 16:17:45

I have a dataframe with 3 fields (ID, Date, alue) and for each ID i want the row that has the minimum date between all the rows that has date bigger than the row with the biggest date with Value == 0 and == 2. I don't know if it is clear, but i think that with example below it will be.

我有一个包含3个字段(ID,Date,alue)的数据框,对于每个ID,我希望所有行之间的最小日期之间的行具有日期大于具有最大值的行== 0和==我不知道它是否清楚,但我认为如下所示。

To be more precise my dataframe looks like:


ID       Date     Value
1     19960708      3
1     19960901      2
1     19960916      3
1     19970901      0
1     19971001      3
1     19971231      3
1     19980101      3
2     19900806      3
2     19901215      3
2     19910629      0
2     19911007      3
2     19911201      3
2     19990901      3
2     20001001      3

I would like to have as output:


ID     Date       Value
1     19971001      3
2     19911007      3

This output is because for the ID 1 the biggest date with value == 0 is 19970901. Therefore i'm taking the minumin of the values bigger than 19970901. Same thing for the ID 2.

这个输出是因为对于ID 1,值== 0的最大日期是19970901.因此我采用大于19970901的minumin值。对于ID 2,同样的事情。

I'm trying to use dplyr for filtering the dataframe but i don't have the output that i want. Below the code that i'm using:


  df %>% 
    group_by(ID) %>%
    filter(DATE > max(df[VALUE==0 | VALUE==2,]$DATE)) %>%
    filter(DATE == min(DATE))

The output of this is not the one that i'm expecting because it works for the first ID, but then it uses that date for the comparisons in the others IDs. As output i have:


ID     Date       Value
1     19971001      3
2     19990901      3

How can i fix this? Thanks.


3 个解决方案



You don't need to specify again the df in in the first filtering call:



df %>% 
  group_by(ID) %>%
  filter(Date > max(Date[Value == 0 | Value == 2])) %>%
  filter(Date == min(Date))

# A tibble: 2 x 3
# Groups:   ID [2]
#      ID     Date Value
#   <int>    <int> <int>
# 1     1 19971001     3
# 2     2 19911007     3



data.table solution:

fun1 <- function(data, var = "Value", afterMax = c(0,2), findMin = 3) {
    max_id <- max(which(data[[var]] %in% afterMax))
    tmp    <- which(  data[[var]] %in% findMin )
    min_id <- min(tmp[tmp>max_id])



#   ID     Date Value
#1:  1 19971001     3
#2:  2 19911007     3

For a dplyr solution use


df1 %>% group_by(ID) %>% do(.,fun1(.))



Another possible approach with data.table



#ensure that df is sorted in ascending by ID and Date
setorder(df, ID, Date)

df[df[, .I[max(which(Value==0 | Value==2)) + 1L], by=.(ID)]$V1]


   ID     Date Value
1:  1 19960916     3
2:  2 19911007     3

With the above approach, if there are IDs with no 0 or 2, a row with all columns filled with NAs will appear for that ID. Hence, if you want to see the ID to show NA for all other columns except ID column, you can use this:


setDT(df1, keep.rownames=TRUE)[, rn := as.integer(rn)]
setorder(df1, ID, Date)
df1[df1[, .(rn=.I[max(which(Value==0 | Value==2)) + 1L]), by=.(ID)], on=.(ID, rn)][, 
    rn := NULL]


   ID     Date Value
1:  1 19971001     3
2:  2 19911007     3
3:  3       NA    NA


df <- read.table(text="ID       Date     Value
1     19960708      3
1     19960901      2
1     19960916      3
1     19970901      0
1     19971001      3
1     19971231      3
1     19980101      3
2     19900806      3
2     19901215      3
2     19910629      0
2     19911007      3
2     19911201      3
2     19990901      3
2     20001001      3", header=TRUE)

df1 <- read.table(text="ID       Date     Value
1     19960708      3
1     19960901      2
1     19960916      3
1     19970901      0
1     19971001      3
1     19971231      3
1     19980101      3
2     19900806      3
2     19901215      3
2     19910629      0
2     19911007      3
2     19911201      3
2     19990901      3
2     20001001      3
3     19990901      3
3     20001001      3", header=TRUE)



You don't need to specify again the df in in the first filtering call:



df %>% 
  group_by(ID) %>%
  filter(Date > max(Date[Value == 0 | Value == 2])) %>%
  filter(Date == min(Date))

# A tibble: 2 x 3
# Groups:   ID [2]
#      ID     Date Value
#   <int>    <int> <int>
# 1     1 19971001     3
# 2     2 19911007     3



data.table solution:

fun1 <- function(data, var = "Value", afterMax = c(0,2), findMin = 3) {
    max_id <- max(which(data[[var]] %in% afterMax))
    tmp    <- which(  data[[var]] %in% findMin )
    min_id <- min(tmp[tmp>max_id])



#   ID     Date Value
#1:  1 19971001     3
#2:  2 19911007     3

For a dplyr solution use


df1 %>% group_by(ID) %>% do(.,fun1(.))



Another possible approach with data.table



#ensure that df is sorted in ascending by ID and Date
setorder(df, ID, Date)

df[df[, .I[max(which(Value==0 | Value==2)) + 1L], by=.(ID)]$V1]


   ID     Date Value
1:  1 19960916     3
2:  2 19911007     3

With the above approach, if there are IDs with no 0 or 2, a row with all columns filled with NAs will appear for that ID. Hence, if you want to see the ID to show NA for all other columns except ID column, you can use this:


setDT(df1, keep.rownames=TRUE)[, rn := as.integer(rn)]
setorder(df1, ID, Date)
df1[df1[, .(rn=.I[max(which(Value==0 | Value==2)) + 1L]), by=.(ID)], on=.(ID, rn)][, 
    rn := NULL]


   ID     Date Value
1:  1 19971001     3
2:  2 19911007     3
3:  3       NA    NA


df <- read.table(text="ID       Date     Value
1     19960708      3
1     19960901      2
1     19960916      3
1     19970901      0
1     19971001      3
1     19971231      3
1     19980101      3
2     19900806      3
2     19901215      3
2     19910629      0
2     19911007      3
2     19911201      3
2     19990901      3
2     20001001      3", header=TRUE)

df1 <- read.table(text="ID       Date     Value
1     19960708      3
1     19960901      2
1     19960916      3
1     19970901      0
1     19971001      3
1     19971231      3
1     19980101      3
2     19900806      3
2     19901215      3
2     19910629      0
2     19911007      3
2     19911201      3
2     19990901      3
2     20001001      3
3     19990901      3
3     20001001      3", header=TRUE)