使用dplyr对多列进行多条件过滤

时间:2021-08-13 22:25:35

I've searched on SO trying to find a solution to no avail. So here it is. I have a data frame with many columns, some of which are numerical and should be non-negative. I want to clean the data since some values in these numerical columns are negative. What I can do now is extract the column names of these columns with a regular expression. But I am not sure how to implement the filtering of rows based on these columns.

我找了半天也没找到解决办法。所以在这里。我有一个有很多列的数据框架,其中有些是数值的,应该是非负的。我想要清理数据,因为这些数值列中的一些值是负的。现在我可以用正则表达式提取这些列的列名。但是我不确定如何实现基于这些列的行过滤。

To give an example, let's say:

举个例子:

library(dplyr)
df <- read.table(text = 
  "id   sth1    tg1_num   sth2    tg2_num    others   
  1     dave    2         ca      35         new
  2     tom     5         tn      -3         old
  3     jane    -3        al       0         new
  4     leroy   0         az      25         old
  5     jerry   4         mi      55        old", header=TRUE)
pattern <- "_num$"
ind <- grep(pattern, colnames(df))
target_columns <- colnames(df)[ind]
df <- df %>% filter(target_columns >= 0) # it's is wrong, but it's what I want to do

What I want to get out from this filtering is the following:

我想从这个过滤中得到以下信息:

id   sth1 tg1_num   sth2 tg2_num others
1    dave       2     ca      35    new
4   leroy       0     az      25    old
5   jerry       4     mi      55    old

where rows no. 2 and 3 are filtered out because at least one column in tg1_num and tg2_num for these rows contain negative numbers.

行不。2和3被过滤掉,因为在tg1_num和tg2_num中至少有一列包含负数。

7 个解决方案

#1


4  

This is a very awkward use of dplyr, but might be true to the spirit

这是一个非常尴尬的使用dplyr,但可能是真实的精神

> df %>% mutate(m = do.call(pmin, select(df, ends_with("_num"))))
  id  sth1 tg1_num sth2 tg2_num others  m
1  1  dave       2   ca      35    new  2
2  2   tom       5   tn      -3    old -3
3  3  jane      -3   al       0    new -3
4  4 leroy       0   az      25    old  0
5  5 jerry       4   mi      55    old  4

From there you can add a filter(m >= 0) to get the answer you want. If there were a rowMins analogous to rowMeans then that would simplify this significantly.

从那里你可以添加一个过滤器(m >= 0)来得到你想要的答案。如果有一个类似于rowMins的rowMins,那么这将极大地简化这个问题。

> rowMins <- function(df) { do.call(pmin, df) }
> df %>% mutate(m = rowMins(select(df, ends_with("_num"))))
  id  sth1 tg1_num sth2 tg2_num others  m
1  1  dave       2   ca      35    new  2
2  2   tom       5   tn      -3    old -3
3  3  jane      -3   al       0    new -3
4  4 leroy       0   az      25    old  0
5  5 jerry       4   mi      55    old  4

I don't know how efficient this is, though. And nesting the select seems real ugly.

我不知道这有多有效。选择的嵌套看起来很难看。

EDIT3: Using ideas cribbed from other solutions/comments (h/t to @Vlo) I can speed mine up a lot (unfortunately, a similar optimization speeds up @Vlo's solution even more (EDIT4: Whoops, misread the chart, I am the fastest, ok, no more on this))

EDIT3:使用来自其他解决方案/注释的想法(h/t到@Vlo),我可以加快我的速度(不幸的是,类似的优化会加速@Vlo的解决方案)。

df %>% select(ends_with("_num")) %>% rowMins %>% {df[. >= 0,]}

EDIT: out of curiosity, did some microbenchmarking on some of the solutions (EDIT2: Added more solutions)

编辑:出于好奇,对一些解决方案做了一些微基准测试(EDIT2:添加更多的解决方案)

microbenchmark(rowmins(df), rowmins2(df), reducer(df), sapplyer(df), grepapply(df), tchotchke(df), withrowsums(df), reducer2(df))

Unit: microseconds
            expr       min         lq      mean    median        uq       max
     rowmins(df)  1373.452  1431.9700  1732.188  1576.043  1729.410  5147.847
    rowmins2(df)   836.885   875.9900  1015.364   913.285  1038.729  2510.339
     reducer(df)   990.096  1058.6645  1217.264  1201.159  1297.997  3103.809
    sapplyer(df) 14119.236 14939.8755 16820.701 15952.057 16612.709 66023.721
   grepapply(df) 12907.657 13686.2325 14517.140 14485.520 15146.294 17291.779
   tchotchke(df)  2770.818  2939.6425  3114.233  3036.926  3172.325  4098.161
 withrowsums(df)  1526.227  1627.8185  1819.220  1722.430  1876.360  3025.095
    reducer2(df)   900.524   943.1265  1087.025  1003.820  1109.188  3869.993

And here are the definitions I used

这里是我使用的定义

rowmins <- function(df) {
  df %>%
    mutate(m = rowMins(select(df, ends_with("_num")))) %>%
    filter(m >= 0) %>%
    select(-m)
}

rowmins2 <- function(df) {
  df %>% select(ends_with("_num")) %>% rowMins %>% {df[. >= 0,]}
}

reducer <- function(df) {
  df %>%
    select(matches("_num$")) %>%
    lapply(">=", 0) %>%
    Reduce(f = "&", .) %>%
    which %>%
    slice(.data = df)
}

reducer2 <- function(df) {
  df %>%
    select(matches("_num$")) %>%
    lapply(">=", 0) %>%
    Reduce(f = "&", .) %>%
    {df[.,]}
}

sapplyer <- function(df) {
  nums <- sapply(df, is.numeric)
  df[apply(df[, nums], MARGIN=1, function(x) all(x >= 0)), ]
}

grepapply <- function(df) {
  cond <- df[, grepl("_num$", colnames(df))] >= 0
    df[apply(cond, 1, function(x) {prod(x) == 1}), ]
}

tchotchke <- function(df) {
  pattern <- "_num$"
  ind <- grep(pattern, colnames(df))
  target_columns <- colnames(df)[ind]
  desired_rows <- sapply(target_columns, function(x) which(df[,x]<0), simplify=TRUE)
  as.vector(unique(unlist(desired_rows)))
}

withrowsums <- function(df) {
  df %>% mutate(m=rowSums(select(df, ends_with("_num"))>0)) %>% filter(m==2) %>% select(-m)
}


df <- data.frame(id=1:10000, sth1=sample(LETTERS, 10000, replace=T), tg1_num=runif(10000,-1,1), tg2_num=runif(10000,-1, 1))

#2


6  

Here's a possible vectorized solution

这是一个矢量化的解决方案

ind <- grep("_num$", colnames(df))
df[!rowSums(df[ind] < 0),]
#   id  sth1 tg1_num sth2 tg2_num others
# 1  1  dave       2   ca      35    new
# 4  4 leroy       0   az      25    old
# 5  5 jerry       4   mi      55    old

The idea here is to create a logical matrix using the < function (it is a generic function which has data.frame method - which means it returns a data frame like structure back). Then, we are using rowSums to find if there were any matched conditions (> 0 - matched, 0- not matched). Then, we are using the ! function in order to convert it to a logical vector: >0 becomes TRUE, while 0 becomes FALSE. Finally, we are subsetting according to that vector.

这里的想法是使用< function创建一个逻辑矩阵(它是一个具有data.frame方法的泛型函数——这意味着它返回一个类似结构的数据框架)。然后,我们使用行和来查找是否有任何匹配的条件(> -匹配,0-不匹配)。那么,我们正在使用!函数为了将其转换为逻辑向量:>0变为TRUE,而0变为FALSE。最后,我们根据这个向量进行细分。

#3


4  

I wanted to see this was possible using standard evaluation with dplyr's filter_. It turns out it can be done with the help of interp from lazyeval, following the example code on this page. Essentially, you have to create a list of the interp conditions which you then pass to the .dots argument of filter_.

我希望通过使用dplyr的过滤器的标准评估来实现这一点。事实证明,它可以在lazyeval的interp的帮助下完成,遵循页面上的示例代码。本质上,您必须创建一个interp条件的列表,然后将其传递给filter_的.dots参数。

library(lazyeval)

dots <- lapply(target_columns, function(cols){
    interp(~y >= 0, .values = list(y = as.name(cols)))
})

filter_(df, .dots = dots)   

  id  sth1 tg1_num sth2 tg2_num others
1  1  dave       2   ca      35    new
2  4 leroy       0   az      25    old
3  5 jerry       4   mi      55    old

Update

更新

Starting with dplyr_0.7, this can be done directly with filter_at and all_vars (no lazyeval needed).

从dplyr_0.7开始,这可以直接使用filter_at和all_vars(不需要lazyeval)完成。

df %>%
     filter_at(vars(target_columns), all_vars(. >= 0) )

  id  sth1 tg1_num sth2 tg2_num others
1  1  dave       2   ca      35    new
2  4 leroy       0   az      25    old
3  5 jerry       4   mi      55    old

#4


1  

Using base R to get your result

使用基底R来得到结果。

cond <- df[, grepl("_num$", colnames(df))] >= 0
df[apply(cond, 1, function(x) {prod(x) == 1}), ]

  id  sth1 tg1_num sth2 tg2_num others
1  1  dave       2   ca      35    new
4  4 leroy       0   az      25    old
5  5 jerry       4   mi      55    old

Edit: this assumes you have multiple columns with "_num". It won't work if you have just one _num column

编辑:这假定您有多个带有“_num”的列。如果只有一个_num列,它就不能工作

#5


1  

First we create an index of all numeric columns. Then we subset all columns greater or equal than zero. So there is no need to check the column names, and the column id will be always positive.

首先,我们创建所有数字列的索引。然后我们将所有的列都子集大于或等于0。因此,不需要检查列名,列id总是为正数。

nums <- sapply(df, is.numeric)
df[apply(df[, nums], MARGIN = 1, function(x) all(x >= 0)), ]

Output:

输出:

  id  sth1 tg1_num sth2 tg2_num others
1  1  dave       2   ca      35    new
4  4 leroy       0   az      25    old
5  5 jerry       4   mi      55    old

#6


1  

Here is my ugly solution. Suggestions/criticisms welcome

这是我的丑陋的解决方案。建议/批评欢迎

df %>% 
  # Select the columns we want
  select(matches("_num$")) %>%
  # Convert every column to logical if >= 0
  lapply(">=", 0) %>%
  # Reduce all the sublist with AND 
  Reduce(f = "&", .) %>%
  # Convert the one vector of logical into numeric
  # index since slice can't deal with logical. 
  # Can simply write `{df[.,]}` here instead,
  # which is probably faster than which + slice
  # Edit: This is not true. which + slice is faster than `[` in this case
  which %>%
  slice(.data = df)

  id  sth1 tg1_num sth2 tg2_num others
1  1  dave       2   ca      35    new
2  4 leroy       0   az      25    old
3  5 jerry       4   mi      55    old

#7


0  

This will give you a vector of your rows that are less than 0:

这会给你一个小于0的行向量:

desired_rows <- sapply(target_columns, function(x) which(df[,x]<0), simplify=TRUE)
desired_rows <- as.vector(unique(unlist(desired_rows)))

Then to get a df of your desired rows:

然后获取所需行的df:

setdiff(df, df[desired_rows,])
  id  sth1 tg1_num sth2 tg2_num others
1  1  dave       2   ca      35    new
2  4 leroy       0   az      25    old
3  5 jerry       4   mi      55    old

#1


4  

This is a very awkward use of dplyr, but might be true to the spirit

这是一个非常尴尬的使用dplyr,但可能是真实的精神

> df %>% mutate(m = do.call(pmin, select(df, ends_with("_num"))))
  id  sth1 tg1_num sth2 tg2_num others  m
1  1  dave       2   ca      35    new  2
2  2   tom       5   tn      -3    old -3
3  3  jane      -3   al       0    new -3
4  4 leroy       0   az      25    old  0
5  5 jerry       4   mi      55    old  4

From there you can add a filter(m >= 0) to get the answer you want. If there were a rowMins analogous to rowMeans then that would simplify this significantly.

从那里你可以添加一个过滤器(m >= 0)来得到你想要的答案。如果有一个类似于rowMins的rowMins,那么这将极大地简化这个问题。

> rowMins <- function(df) { do.call(pmin, df) }
> df %>% mutate(m = rowMins(select(df, ends_with("_num"))))
  id  sth1 tg1_num sth2 tg2_num others  m
1  1  dave       2   ca      35    new  2
2  2   tom       5   tn      -3    old -3
3  3  jane      -3   al       0    new -3
4  4 leroy       0   az      25    old  0
5  5 jerry       4   mi      55    old  4

I don't know how efficient this is, though. And nesting the select seems real ugly.

我不知道这有多有效。选择的嵌套看起来很难看。

EDIT3: Using ideas cribbed from other solutions/comments (h/t to @Vlo) I can speed mine up a lot (unfortunately, a similar optimization speeds up @Vlo's solution even more (EDIT4: Whoops, misread the chart, I am the fastest, ok, no more on this))

EDIT3:使用来自其他解决方案/注释的想法(h/t到@Vlo),我可以加快我的速度(不幸的是,类似的优化会加速@Vlo的解决方案)。

df %>% select(ends_with("_num")) %>% rowMins %>% {df[. >= 0,]}

EDIT: out of curiosity, did some microbenchmarking on some of the solutions (EDIT2: Added more solutions)

编辑:出于好奇,对一些解决方案做了一些微基准测试(EDIT2:添加更多的解决方案)

microbenchmark(rowmins(df), rowmins2(df), reducer(df), sapplyer(df), grepapply(df), tchotchke(df), withrowsums(df), reducer2(df))

Unit: microseconds
            expr       min         lq      mean    median        uq       max
     rowmins(df)  1373.452  1431.9700  1732.188  1576.043  1729.410  5147.847
    rowmins2(df)   836.885   875.9900  1015.364   913.285  1038.729  2510.339
     reducer(df)   990.096  1058.6645  1217.264  1201.159  1297.997  3103.809
    sapplyer(df) 14119.236 14939.8755 16820.701 15952.057 16612.709 66023.721
   grepapply(df) 12907.657 13686.2325 14517.140 14485.520 15146.294 17291.779
   tchotchke(df)  2770.818  2939.6425  3114.233  3036.926  3172.325  4098.161
 withrowsums(df)  1526.227  1627.8185  1819.220  1722.430  1876.360  3025.095
    reducer2(df)   900.524   943.1265  1087.025  1003.820  1109.188  3869.993

And here are the definitions I used

这里是我使用的定义

rowmins <- function(df) {
  df %>%
    mutate(m = rowMins(select(df, ends_with("_num")))) %>%
    filter(m >= 0) %>%
    select(-m)
}

rowmins2 <- function(df) {
  df %>% select(ends_with("_num")) %>% rowMins %>% {df[. >= 0,]}
}

reducer <- function(df) {
  df %>%
    select(matches("_num$")) %>%
    lapply(">=", 0) %>%
    Reduce(f = "&", .) %>%
    which %>%
    slice(.data = df)
}

reducer2 <- function(df) {
  df %>%
    select(matches("_num$")) %>%
    lapply(">=", 0) %>%
    Reduce(f = "&", .) %>%
    {df[.,]}
}

sapplyer <- function(df) {
  nums <- sapply(df, is.numeric)
  df[apply(df[, nums], MARGIN=1, function(x) all(x >= 0)), ]
}

grepapply <- function(df) {
  cond <- df[, grepl("_num$", colnames(df))] >= 0
    df[apply(cond, 1, function(x) {prod(x) == 1}), ]
}

tchotchke <- function(df) {
  pattern <- "_num$"
  ind <- grep(pattern, colnames(df))
  target_columns <- colnames(df)[ind]
  desired_rows <- sapply(target_columns, function(x) which(df[,x]<0), simplify=TRUE)
  as.vector(unique(unlist(desired_rows)))
}

withrowsums <- function(df) {
  df %>% mutate(m=rowSums(select(df, ends_with("_num"))>0)) %>% filter(m==2) %>% select(-m)
}


df <- data.frame(id=1:10000, sth1=sample(LETTERS, 10000, replace=T), tg1_num=runif(10000,-1,1), tg2_num=runif(10000,-1, 1))

#2


6  

Here's a possible vectorized solution

这是一个矢量化的解决方案

ind <- grep("_num$", colnames(df))
df[!rowSums(df[ind] < 0),]
#   id  sth1 tg1_num sth2 tg2_num others
# 1  1  dave       2   ca      35    new
# 4  4 leroy       0   az      25    old
# 5  5 jerry       4   mi      55    old

The idea here is to create a logical matrix using the < function (it is a generic function which has data.frame method - which means it returns a data frame like structure back). Then, we are using rowSums to find if there were any matched conditions (> 0 - matched, 0- not matched). Then, we are using the ! function in order to convert it to a logical vector: >0 becomes TRUE, while 0 becomes FALSE. Finally, we are subsetting according to that vector.

这里的想法是使用< function创建一个逻辑矩阵(它是一个具有data.frame方法的泛型函数——这意味着它返回一个类似结构的数据框架)。然后,我们使用行和来查找是否有任何匹配的条件(> -匹配,0-不匹配)。那么,我们正在使用!函数为了将其转换为逻辑向量:>0变为TRUE,而0变为FALSE。最后,我们根据这个向量进行细分。

#3


4  

I wanted to see this was possible using standard evaluation with dplyr's filter_. It turns out it can be done with the help of interp from lazyeval, following the example code on this page. Essentially, you have to create a list of the interp conditions which you then pass to the .dots argument of filter_.

我希望通过使用dplyr的过滤器的标准评估来实现这一点。事实证明,它可以在lazyeval的interp的帮助下完成,遵循页面上的示例代码。本质上,您必须创建一个interp条件的列表,然后将其传递给filter_的.dots参数。

library(lazyeval)

dots <- lapply(target_columns, function(cols){
    interp(~y >= 0, .values = list(y = as.name(cols)))
})

filter_(df, .dots = dots)   

  id  sth1 tg1_num sth2 tg2_num others
1  1  dave       2   ca      35    new
2  4 leroy       0   az      25    old
3  5 jerry       4   mi      55    old

Update

更新

Starting with dplyr_0.7, this can be done directly with filter_at and all_vars (no lazyeval needed).

从dplyr_0.7开始,这可以直接使用filter_at和all_vars(不需要lazyeval)完成。

df %>%
     filter_at(vars(target_columns), all_vars(. >= 0) )

  id  sth1 tg1_num sth2 tg2_num others
1  1  dave       2   ca      35    new
2  4 leroy       0   az      25    old
3  5 jerry       4   mi      55    old

#4


1  

Using base R to get your result

使用基底R来得到结果。

cond <- df[, grepl("_num$", colnames(df))] >= 0
df[apply(cond, 1, function(x) {prod(x) == 1}), ]

  id  sth1 tg1_num sth2 tg2_num others
1  1  dave       2   ca      35    new
4  4 leroy       0   az      25    old
5  5 jerry       4   mi      55    old

Edit: this assumes you have multiple columns with "_num". It won't work if you have just one _num column

编辑:这假定您有多个带有“_num”的列。如果只有一个_num列,它就不能工作

#5


1  

First we create an index of all numeric columns. Then we subset all columns greater or equal than zero. So there is no need to check the column names, and the column id will be always positive.

首先,我们创建所有数字列的索引。然后我们将所有的列都子集大于或等于0。因此,不需要检查列名,列id总是为正数。

nums <- sapply(df, is.numeric)
df[apply(df[, nums], MARGIN = 1, function(x) all(x >= 0)), ]

Output:

输出:

  id  sth1 tg1_num sth2 tg2_num others
1  1  dave       2   ca      35    new
4  4 leroy       0   az      25    old
5  5 jerry       4   mi      55    old

#6


1  

Here is my ugly solution. Suggestions/criticisms welcome

这是我的丑陋的解决方案。建议/批评欢迎

df %>% 
  # Select the columns we want
  select(matches("_num$")) %>%
  # Convert every column to logical if >= 0
  lapply(">=", 0) %>%
  # Reduce all the sublist with AND 
  Reduce(f = "&", .) %>%
  # Convert the one vector of logical into numeric
  # index since slice can't deal with logical. 
  # Can simply write `{df[.,]}` here instead,
  # which is probably faster than which + slice
  # Edit: This is not true. which + slice is faster than `[` in this case
  which %>%
  slice(.data = df)

  id  sth1 tg1_num sth2 tg2_num others
1  1  dave       2   ca      35    new
2  4 leroy       0   az      25    old
3  5 jerry       4   mi      55    old

#7


0  

This will give you a vector of your rows that are less than 0:

这会给你一个小于0的行向量:

desired_rows <- sapply(target_columns, function(x) which(df[,x]<0), simplify=TRUE)
desired_rows <- as.vector(unique(unlist(desired_rows)))

Then to get a df of your desired rows:

然后获取所需行的df:

setdiff(df, df[desired_rows,])
  id  sth1 tg1_num sth2 tg2_num others
1  1  dave       2   ca      35    new
2  4 leroy       0   az      25    old
3  5 jerry       4   mi      55    old