如何对data.frame进行子集化?

时间:2022-10-11 01:38:38

I have a data set like this

我有这样的数据集

a <- data.frame(var1 = c("patientA", "patientA", "patientA", "patientB", "patientB", "patientB", "patientB"),
                var2 = as.Date(c("2015-01-02","2015-01-04","2015-02-02","2015-02-06","2015-01-02","2015-01-07","2015-04-02")),
                var3 = c(F, T, F, F, F, T, F)               
                )
sequ <- rle(as.character(a$var1))
a$sequ <- sequence(sequ$lengths)

producing

> a
      var1       var2  var3 sequ
1 patientA 2015-01-02 FALSE    1
2 patientA 2015-01-04  TRUE    2
3 patientA 2015-02-02 FALSE    3
4 patientB 2015-02-06 FALSE    1
5 patientB 2015-01-02 FALSE    2
6 patientB 2015-01-07  TRUE    3
7 patientB 2015-04-02 FALSE    4

How can I subset/filter this data set so that I get all the rows which var3 == TRUE and var2 date value is greater than in the row where var3 == TRUE (by patient, var1? I tried

我如何对这个数据集进行子集化/过滤,以便获得var3 == TRUE和var2日期values的数据的数据的VAR3 == TRUE的行(患者,var1?我试过了)

subset(a, (var3 == TRUE) & (var2 > var3))

but this does not produce a correct result set. The correct one is

但这不会产生正确的结果集。正确的是

#       var1       var2  var3 sequ
# 1 patientA 2015-01-04  TRUE    2
# 2 patientA 2015-02-02 FALSE    3
# 3 patientB 2015-02-06 FALSE    1
# 4 patientB 2015-01-07  TRUE    3
# 5 patientB 2015-04-02 FALSE    4

3 个解决方案

#1


You may try with data.table. Here, we convert the 'data.frame' to 'data.table' (setDT(a)), grouped by 'var1', we get a logical index for 'var2' elements that are greater than or equal to corresponding 'var2' elements for which 'var3' is TRUE and subset the dataset .SD.

您可以尝试使用data.table。在这里,我们将'data.frame'转换为'data.table'(setDT(a)),按'var1'分组,我们得到大于或等于'var2'的'var2'元素的逻辑索引'var3'为TRUE的元素和数据集.SD的子集。

library(data.table)
setDT(a)[,.SD[var2 >= var2[var3]], var1]
#       var1       var2  var3 sequ
#1: patientA 2015-01-04  TRUE    2
#2: patientA 2015-02-02 FALSE    3
#3: patientB 2015-02-06 FALSE    1
#4: patientB 2015-01-07  TRUE    3
#5: patientB 2015-04-02 FALSE    4

An option using base R (assuming that the data is ordered by 'var1')

使用基数R的选项(假设数据按'var1'排序)

a[with(a, var2>=rep(var2[var3], table(var1))),]
#      var1       var2  var3 sequ
#2 patientA 2015-01-04  TRUE    2
#3 patientA 2015-02-02 FALSE    3
#4 patientB 2015-02-06 FALSE    1
#6 patientB 2015-01-07  TRUE    3
#7 patientB 2015-04-02 FALSE    4

#2


I add a column with the date when var3 is TRUE, filter based on it, then drop it at the end.

当var3为TRUE时,我添加一个包含日期的列,根据它进行过滤,然后将其放在最后。

library(dplyr)

a %>% group_by(var1)%>%
    mutate(truedate = first(var2[var3])) %>%
    filter(var2 >= truedate) %>%
    select(-truedate)

# Source: local data frame [5 x 4]
# Groups: var1

#       var1       var2  var3 sequ
# 1 patientA 2015-01-04  TRUE    2
# 2 patientA 2015-02-02 FALSE    3
# 3 patientB 2015-02-06 FALSE    1
# 4 patientB 2015-01-07  TRUE    3
# 5 patientB 2015-04-02 FALSE    4

#3


A base-R solution: First, don't bother with your rle/sequ thing. Instead, sort your data:

基础R解决方案:首先,不要打扰您的rle / sequ事物。而是,您的数据排序:

a <- a[order(a$var1,a$var2),]

Find the selected rows:

查找选定的行:

myrows <- tapply(
  1:nrow(a),
  a$var1,
  function(ivec){
    istar <- ivec[a$var3[ivec]]
    ivec[ivec>=istar]
  })

Subset with a[unlist(myrows),].

带有[unlist(myrows),]的子集。

#1


You may try with data.table. Here, we convert the 'data.frame' to 'data.table' (setDT(a)), grouped by 'var1', we get a logical index for 'var2' elements that are greater than or equal to corresponding 'var2' elements for which 'var3' is TRUE and subset the dataset .SD.

您可以尝试使用data.table。在这里,我们将'data.frame'转换为'data.table'(setDT(a)),按'var1'分组,我们得到大于或等于'var2'的'var2'元素的逻辑索引'var3'为TRUE的元素和数据集.SD的子集。

library(data.table)
setDT(a)[,.SD[var2 >= var2[var3]], var1]
#       var1       var2  var3 sequ
#1: patientA 2015-01-04  TRUE    2
#2: patientA 2015-02-02 FALSE    3
#3: patientB 2015-02-06 FALSE    1
#4: patientB 2015-01-07  TRUE    3
#5: patientB 2015-04-02 FALSE    4

An option using base R (assuming that the data is ordered by 'var1')

使用基数R的选项(假设数据按'var1'排序)

a[with(a, var2>=rep(var2[var3], table(var1))),]
#      var1       var2  var3 sequ
#2 patientA 2015-01-04  TRUE    2
#3 patientA 2015-02-02 FALSE    3
#4 patientB 2015-02-06 FALSE    1
#6 patientB 2015-01-07  TRUE    3
#7 patientB 2015-04-02 FALSE    4

#2


I add a column with the date when var3 is TRUE, filter based on it, then drop it at the end.

当var3为TRUE时,我添加一个包含日期的列,根据它进行过滤,然后将其放在最后。

library(dplyr)

a %>% group_by(var1)%>%
    mutate(truedate = first(var2[var3])) %>%
    filter(var2 >= truedate) %>%
    select(-truedate)

# Source: local data frame [5 x 4]
# Groups: var1

#       var1       var2  var3 sequ
# 1 patientA 2015-01-04  TRUE    2
# 2 patientA 2015-02-02 FALSE    3
# 3 patientB 2015-02-06 FALSE    1
# 4 patientB 2015-01-07  TRUE    3
# 5 patientB 2015-04-02 FALSE    4

#3


A base-R solution: First, don't bother with your rle/sequ thing. Instead, sort your data:

基础R解决方案:首先,不要打扰您的rle / sequ事物。而是,您的数据排序:

a <- a[order(a$var1,a$var2),]

Find the selected rows:

查找选定的行:

myrows <- tapply(
  1:nrow(a),
  a$var1,
  function(ivec){
    istar <- ivec[a$var3[ivec]]
    ivec[ivec>=istar]
  })

Subset with a[unlist(myrows),].

带有[unlist(myrows),]的子集。