我在r中有一个数据帧,我必须对id组进行过滤(不是简单的计数或求和)

时间:2022-09-08 16:17:45

I have a dataframe with 3 fields (ID, Date, alue) and for each ID i want the row that has the minimum date between all the rows that has date bigger than the row with the biggest date with Value == 0 and == 2. I don't know if it is clear, but i think that with example below it will be.

我有一个包含3个字段(ID,Date,alue)的数据框,对于每个ID,我希望所有行之间的最小日期之间的行具有日期大于具有最大值的行== 0和==我不知道它是否清楚,但我认为如下所示。

To be more precise my dataframe looks like:

更确切地说,我的数据框看起来像:

ID       Date     Value
1     19960708      3
1     19960901      2
1     19960916      3
1     19970901      0
1     19971001      3
1     19971231      3
1     19980101      3
2     19900806      3
2     19901215      3
2     19910629      0
2     19911007      3
2     19911201      3
2     19990901      3
2     20001001      3

I would like to have as output:

我想输出:

ID     Date       Value
1     19971001      3
2     19911007      3

This output is because for the ID 1 the biggest date with value == 0 is 19970901. Therefore i'm taking the minumin of the values bigger than 19970901. Same thing for the ID 2.

这个输出是因为对于ID 1,值== 0的最大日期是19970901.因此我采用大于19970901的minumin值。对于ID 2,同样的事情。

I'm trying to use dplyr for filtering the dataframe but i don't have the output that i want. Below the code that i'm using:

我正在尝试使用dplyr来过滤数据帧,但我没有我想要的输出。在我正在使用的代码下面:

  df %>% 
    group_by(ID) %>%
    filter(DATE > max(df[VALUE==0 | VALUE==2,]$DATE)) %>%
    filter(DATE == min(DATE))

The output of this is not the one that i'm expecting because it works for the first ID, but then it uses that date for the comparisons in the others IDs. As output i have:

这个输出不是我期望的,因为它适用于第一个ID,但是它使用该日期进行其他ID的比较。作为输出我有:

ID     Date       Value
1     19971001      3
2     19990901      3

How can i fix this? Thanks.

我怎样才能解决这个问题?谢谢。

3 个解决方案

#1


4  

You don't need to specify again the df in in the first filtering call:

您不需要在第一个过滤调用中再次指定df:

library(dplyr)

df %>% 
  group_by(ID) %>%
  filter(Date > max(Date[Value == 0 | Value == 2])) %>%
  filter(Date == min(Date))

# A tibble: 2 x 3
# Groups:   ID [2]
#      ID     Date Value
#   <int>    <int> <int>
# 1     1 19971001     3
# 2     2 19911007     3

#2


0  

data.table solution:

fun1 <- function(data, var = "Value", afterMax = c(0,2), findMin = 3) {
    max_id <- max(which(data[[var]] %in% afterMax))
    tmp    <- which(  data[[var]] %in% findMin )
    min_id <- min(tmp[tmp>max_id])
    return(data[min_id,])
}

setDT(df1)[,fun1(.SD),by="ID"]

result:

#   ID     Date Value
#1:  1 19971001     3
#2:  2 19911007     3

For a dplyr solution use

对于dplyr解决方案使用

df1 %>% group_by(ID) %>% do(.,fun1(.))

#3


0  

Another possible approach with data.table

data.table的另一种可行方法

library(data.table)
setDT(df)

#ensure that df is sorted in ascending by ID and Date
setorder(df, ID, Date)

df[df[, .I[max(which(Value==0 | Value==2)) + 1L], by=.(ID)]$V1]

output:

   ID     Date Value
1:  1 19960916     3
2:  2 19911007     3

With the above approach, if there are IDs with no 0 or 2, a row with all columns filled with NAs will appear for that ID. Hence, if you want to see the ID to show NA for all other columns except ID column, you can use this:

使用上述方法,如果存在没有0或2的ID,则将为该ID显示所有列都填充了NA的行。因此,如果要查看ID以显示除ID列以外的所有其他列的NA,则可以使用以下命令:

setDT(df1, keep.rownames=TRUE)[, rn := as.integer(rn)]
setorder(df1, ID, Date)
df1[df1[, .(rn=.I[max(which(Value==0 | Value==2)) + 1L]), by=.(ID)], on=.(ID, rn)][, 
    rn := NULL]

output:

   ID     Date Value
1:  1 19971001     3
2:  2 19911007     3
3:  3       NA    NA

data:

df <- read.table(text="ID       Date     Value
1     19960708      3
1     19960901      2
1     19960916      3
1     19970901      0
1     19971001      3
1     19971231      3
1     19980101      3
2     19900806      3
2     19901215      3
2     19910629      0
2     19911007      3
2     19911201      3
2     19990901      3
2     20001001      3", header=TRUE)

df1 <- read.table(text="ID       Date     Value
1     19960708      3
1     19960901      2
1     19960916      3
1     19970901      0
1     19971001      3
1     19971231      3
1     19980101      3
2     19900806      3
2     19901215      3
2     19910629      0
2     19911007      3
2     19911201      3
2     19990901      3
2     20001001      3
3     19990901      3
3     20001001      3", header=TRUE)

#1


4  

You don't need to specify again the df in in the first filtering call:

您不需要在第一个过滤调用中再次指定df:

library(dplyr)

df %>% 
  group_by(ID) %>%
  filter(Date > max(Date[Value == 0 | Value == 2])) %>%
  filter(Date == min(Date))

# A tibble: 2 x 3
# Groups:   ID [2]
#      ID     Date Value
#   <int>    <int> <int>
# 1     1 19971001     3
# 2     2 19911007     3

#2


0  

data.table solution:

fun1 <- function(data, var = "Value", afterMax = c(0,2), findMin = 3) {
    max_id <- max(which(data[[var]] %in% afterMax))
    tmp    <- which(  data[[var]] %in% findMin )
    min_id <- min(tmp[tmp>max_id])
    return(data[min_id,])
}

setDT(df1)[,fun1(.SD),by="ID"]

result:

#   ID     Date Value
#1:  1 19971001     3
#2:  2 19911007     3

For a dplyr solution use

对于dplyr解决方案使用

df1 %>% group_by(ID) %>% do(.,fun1(.))

#3


0  

Another possible approach with data.table

data.table的另一种可行方法

library(data.table)
setDT(df)

#ensure that df is sorted in ascending by ID and Date
setorder(df, ID, Date)

df[df[, .I[max(which(Value==0 | Value==2)) + 1L], by=.(ID)]$V1]

output:

   ID     Date Value
1:  1 19960916     3
2:  2 19911007     3

With the above approach, if there are IDs with no 0 or 2, a row with all columns filled with NAs will appear for that ID. Hence, if you want to see the ID to show NA for all other columns except ID column, you can use this:

使用上述方法,如果存在没有0或2的ID,则将为该ID显示所有列都填充了NA的行。因此,如果要查看ID以显示除ID列以外的所有其他列的NA,则可以使用以下命令:

setDT(df1, keep.rownames=TRUE)[, rn := as.integer(rn)]
setorder(df1, ID, Date)
df1[df1[, .(rn=.I[max(which(Value==0 | Value==2)) + 1L]), by=.(ID)], on=.(ID, rn)][, 
    rn := NULL]

output:

   ID     Date Value
1:  1 19971001     3
2:  2 19911007     3
3:  3       NA    NA

data:

df <- read.table(text="ID       Date     Value
1     19960708      3
1     19960901      2
1     19960916      3
1     19970901      0
1     19971001      3
1     19971231      3
1     19980101      3
2     19900806      3
2     19901215      3
2     19910629      0
2     19911007      3
2     19911201      3
2     19990901      3
2     20001001      3", header=TRUE)

df1 <- read.table(text="ID       Date     Value
1     19960708      3
1     19960901      2
1     19960916      3
1     19970901      0
1     19971001      3
1     19971231      3
1     19980101      3
2     19900806      3
2     19901215      3
2     19910629      0
2     19911007      3
2     19911201      3
2     19990901      3
2     20001001      3
3     19990901      3
3     20001001      3", header=TRUE)