I've searched on SO trying to find a solution to no avail. So here it is. I have a data frame with many columns, some of which are numerical and should be non-negative. I want to clean the data since some values in these numerical columns are negative. What I can do now is extract the column names of these columns with a regular expression. But I am not sure how to implement the filtering of rows based on these columns.
我找了半天也没找到解决办法。所以在这里。我有一个有很多列的数据框架,其中有些是数值的,应该是非负的。我想要清理数据,因为这些数值列中的一些值是负的。现在我可以用正则表达式提取这些列的列名。但是我不确定如何实现基于这些列的行过滤。
To give an example, let's say:
举个例子:
library(dplyr)
df <- read.table(text =
"id sth1 tg1_num sth2 tg2_num others
1 dave 2 ca 35 new
2 tom 5 tn -3 old
3 jane -3 al 0 new
4 leroy 0 az 25 old
5 jerry 4 mi 55 old", header=TRUE)
pattern <- "_num$"
ind <- grep(pattern, colnames(df))
target_columns <- colnames(df)[ind]
df <- df %>% filter(target_columns >= 0) # it's is wrong, but it's what I want to do
What I want to get out from this filtering is the following:
我想从这个过滤中得到以下信息:
id sth1 tg1_num sth2 tg2_num others
1 dave 2 ca 35 new
4 leroy 0 az 25 old
5 jerry 4 mi 55 old
where rows no. 2 and 3 are filtered out because at least one column in tg1_num and tg2_num for these rows contain negative numbers.
行不。2和3被过滤掉,因为在tg1_num和tg2_num中至少有一列包含负数。
7 个解决方案
#1
4
This is a very awkward use of dplyr
, but might be true to the spirit
这是一个非常尴尬的使用dplyr,但可能是真实的精神
> df %>% mutate(m = do.call(pmin, select(df, ends_with("_num"))))
id sth1 tg1_num sth2 tg2_num others m
1 1 dave 2 ca 35 new 2
2 2 tom 5 tn -3 old -3
3 3 jane -3 al 0 new -3
4 4 leroy 0 az 25 old 0
5 5 jerry 4 mi 55 old 4
From there you can add a filter(m >= 0)
to get the answer you want. If there were a rowMins
analogous to rowMeans
then that would simplify this significantly.
从那里你可以添加一个过滤器(m >= 0)来得到你想要的答案。如果有一个类似于rowMins的rowMins,那么这将极大地简化这个问题。
> rowMins <- function(df) { do.call(pmin, df) }
> df %>% mutate(m = rowMins(select(df, ends_with("_num"))))
id sth1 tg1_num sth2 tg2_num others m
1 1 dave 2 ca 35 new 2
2 2 tom 5 tn -3 old -3
3 3 jane -3 al 0 new -3
4 4 leroy 0 az 25 old 0
5 5 jerry 4 mi 55 old 4
I don't know how efficient this is, though. And nesting the select
seems real ugly.
我不知道这有多有效。选择的嵌套看起来很难看。
EDIT3: Using ideas cribbed from other solutions/comments (h/t to @Vlo) I can speed mine up a lot (unfortunately, a similar optimization speeds up @Vlo's solution even more (EDIT4: Whoops, misread the chart, I am the fastest, ok, no more on this))
EDIT3:使用来自其他解决方案/注释的想法(h/t到@Vlo),我可以加快我的速度(不幸的是,类似的优化会加速@Vlo的解决方案)。
df %>% select(ends_with("_num")) %>% rowMins %>% {df[. >= 0,]}
EDIT: out of curiosity, did some microbenchmarking on some of the solutions (EDIT2: Added more solutions)
编辑:出于好奇,对一些解决方案做了一些微基准测试(EDIT2:添加更多的解决方案)
microbenchmark(rowmins(df), rowmins2(df), reducer(df), sapplyer(df), grepapply(df), tchotchke(df), withrowsums(df), reducer2(df))
Unit: microseconds
expr min lq mean median uq max
rowmins(df) 1373.452 1431.9700 1732.188 1576.043 1729.410 5147.847
rowmins2(df) 836.885 875.9900 1015.364 913.285 1038.729 2510.339
reducer(df) 990.096 1058.6645 1217.264 1201.159 1297.997 3103.809
sapplyer(df) 14119.236 14939.8755 16820.701 15952.057 16612.709 66023.721
grepapply(df) 12907.657 13686.2325 14517.140 14485.520 15146.294 17291.779
tchotchke(df) 2770.818 2939.6425 3114.233 3036.926 3172.325 4098.161
withrowsums(df) 1526.227 1627.8185 1819.220 1722.430 1876.360 3025.095
reducer2(df) 900.524 943.1265 1087.025 1003.820 1109.188 3869.993
And here are the definitions I used
这里是我使用的定义
rowmins <- function(df) {
df %>%
mutate(m = rowMins(select(df, ends_with("_num")))) %>%
filter(m >= 0) %>%
select(-m)
}
rowmins2 <- function(df) {
df %>% select(ends_with("_num")) %>% rowMins %>% {df[. >= 0,]}
}
reducer <- function(df) {
df %>%
select(matches("_num$")) %>%
lapply(">=", 0) %>%
Reduce(f = "&", .) %>%
which %>%
slice(.data = df)
}
reducer2 <- function(df) {
df %>%
select(matches("_num$")) %>%
lapply(">=", 0) %>%
Reduce(f = "&", .) %>%
{df[.,]}
}
sapplyer <- function(df) {
nums <- sapply(df, is.numeric)
df[apply(df[, nums], MARGIN=1, function(x) all(x >= 0)), ]
}
grepapply <- function(df) {
cond <- df[, grepl("_num$", colnames(df))] >= 0
df[apply(cond, 1, function(x) {prod(x) == 1}), ]
}
tchotchke <- function(df) {
pattern <- "_num$"
ind <- grep(pattern, colnames(df))
target_columns <- colnames(df)[ind]
desired_rows <- sapply(target_columns, function(x) which(df[,x]<0), simplify=TRUE)
as.vector(unique(unlist(desired_rows)))
}
withrowsums <- function(df) {
df %>% mutate(m=rowSums(select(df, ends_with("_num"))>0)) %>% filter(m==2) %>% select(-m)
}
df <- data.frame(id=1:10000, sth1=sample(LETTERS, 10000, replace=T), tg1_num=runif(10000,-1,1), tg2_num=runif(10000,-1, 1))
#2
6
Here's a possible vectorized solution
这是一个矢量化的解决方案
ind <- grep("_num$", colnames(df))
df[!rowSums(df[ind] < 0),]
# id sth1 tg1_num sth2 tg2_num others
# 1 1 dave 2 ca 35 new
# 4 4 leroy 0 az 25 old
# 5 5 jerry 4 mi 55 old
The idea here is to create a logical matrix using the <
function (it is a generic function which has data.frame
method - which means it returns a data frame like structure back). Then, we are using rowSums
to find if there were any matched conditions (> 0 - matched, 0- not matched). Then, we are using the !
function in order to convert it to a logical vector: >0 becomes TRUE
, while 0 becomes FALSE
. Finally, we are subsetting according to that vector.
这里的想法是使用< function创建一个逻辑矩阵(它是一个具有data.frame方法的泛型函数——这意味着它返回一个类似结构的数据框架)。然后,我们使用行和来查找是否有任何匹配的条件(> -匹配,0-不匹配)。那么,我们正在使用!函数为了将其转换为逻辑向量:>0变为TRUE,而0变为FALSE。最后,我们根据这个向量进行细分。
#3
4
I wanted to see this was possible using standard evaluation with dplyr's filter_
. It turns out it can be done with the help of interp
from lazyeval, following the example code on this page. Essentially, you have to create a list of the interp
conditions which you then pass to the .dots
argument of filter_
.
我希望通过使用dplyr的过滤器的标准评估来实现这一点。事实证明,它可以在lazyeval的interp的帮助下完成,遵循页面上的示例代码。本质上,您必须创建一个interp条件的列表,然后将其传递给filter_的.dots参数。
library(lazyeval)
dots <- lapply(target_columns, function(cols){
interp(~y >= 0, .values = list(y = as.name(cols)))
})
filter_(df, .dots = dots)
id sth1 tg1_num sth2 tg2_num others
1 1 dave 2 ca 35 new
2 4 leroy 0 az 25 old
3 5 jerry 4 mi 55 old
Update
更新
Starting with dplyr_0.7, this can be done directly with filter_at
and all_vars
(no lazyeval needed).
从dplyr_0.7开始,这可以直接使用filter_at和all_vars(不需要lazyeval)完成。
df %>%
filter_at(vars(target_columns), all_vars(. >= 0) )
id sth1 tg1_num sth2 tg2_num others
1 1 dave 2 ca 35 new
2 4 leroy 0 az 25 old
3 5 jerry 4 mi 55 old
#4
1
Using base R to get your result
使用基底R来得到结果。
cond <- df[, grepl("_num$", colnames(df))] >= 0
df[apply(cond, 1, function(x) {prod(x) == 1}), ]
id sth1 tg1_num sth2 tg2_num others
1 1 dave 2 ca 35 new
4 4 leroy 0 az 25 old
5 5 jerry 4 mi 55 old
Edit: this assumes you have multiple columns with "_num". It won't work if you have just one _num column
编辑:这假定您有多个带有“_num”的列。如果只有一个_num列,它就不能工作
#5
1
First we create an index of all numeric columns. Then we subset all columns greater or equal than zero. So there is no need to check the column names, and the column id will be always positive.
首先,我们创建所有数字列的索引。然后我们将所有的列都子集大于或等于0。因此,不需要检查列名,列id总是为正数。
nums <- sapply(df, is.numeric)
df[apply(df[, nums], MARGIN = 1, function(x) all(x >= 0)), ]
Output:
输出:
id sth1 tg1_num sth2 tg2_num others
1 1 dave 2 ca 35 new
4 4 leroy 0 az 25 old
5 5 jerry 4 mi 55 old
#6
1
Here is my ugly solution. Suggestions/criticisms welcome
这是我的丑陋的解决方案。建议/批评欢迎
df %>%
# Select the columns we want
select(matches("_num$")) %>%
# Convert every column to logical if >= 0
lapply(">=", 0) %>%
# Reduce all the sublist with AND
Reduce(f = "&", .) %>%
# Convert the one vector of logical into numeric
# index since slice can't deal with logical.
# Can simply write `{df[.,]}` here instead,
# which is probably faster than which + slice
# Edit: This is not true. which + slice is faster than `[` in this case
which %>%
slice(.data = df)
id sth1 tg1_num sth2 tg2_num others
1 1 dave 2 ca 35 new
2 4 leroy 0 az 25 old
3 5 jerry 4 mi 55 old
#7
0
This will give you a vector of your rows that are less than 0:
这会给你一个小于0的行向量:
desired_rows <- sapply(target_columns, function(x) which(df[,x]<0), simplify=TRUE)
desired_rows <- as.vector(unique(unlist(desired_rows)))
Then to get a df of your desired rows:
然后获取所需行的df:
setdiff(df, df[desired_rows,])
id sth1 tg1_num sth2 tg2_num others
1 1 dave 2 ca 35 new
2 4 leroy 0 az 25 old
3 5 jerry 4 mi 55 old
#1
4
This is a very awkward use of dplyr
, but might be true to the spirit
这是一个非常尴尬的使用dplyr,但可能是真实的精神
> df %>% mutate(m = do.call(pmin, select(df, ends_with("_num"))))
id sth1 tg1_num sth2 tg2_num others m
1 1 dave 2 ca 35 new 2
2 2 tom 5 tn -3 old -3
3 3 jane -3 al 0 new -3
4 4 leroy 0 az 25 old 0
5 5 jerry 4 mi 55 old 4
From there you can add a filter(m >= 0)
to get the answer you want. If there were a rowMins
analogous to rowMeans
then that would simplify this significantly.
从那里你可以添加一个过滤器(m >= 0)来得到你想要的答案。如果有一个类似于rowMins的rowMins,那么这将极大地简化这个问题。
> rowMins <- function(df) { do.call(pmin, df) }
> df %>% mutate(m = rowMins(select(df, ends_with("_num"))))
id sth1 tg1_num sth2 tg2_num others m
1 1 dave 2 ca 35 new 2
2 2 tom 5 tn -3 old -3
3 3 jane -3 al 0 new -3
4 4 leroy 0 az 25 old 0
5 5 jerry 4 mi 55 old 4
I don't know how efficient this is, though. And nesting the select
seems real ugly.
我不知道这有多有效。选择的嵌套看起来很难看。
EDIT3: Using ideas cribbed from other solutions/comments (h/t to @Vlo) I can speed mine up a lot (unfortunately, a similar optimization speeds up @Vlo's solution even more (EDIT4: Whoops, misread the chart, I am the fastest, ok, no more on this))
EDIT3:使用来自其他解决方案/注释的想法(h/t到@Vlo),我可以加快我的速度(不幸的是,类似的优化会加速@Vlo的解决方案)。
df %>% select(ends_with("_num")) %>% rowMins %>% {df[. >= 0,]}
EDIT: out of curiosity, did some microbenchmarking on some of the solutions (EDIT2: Added more solutions)
编辑:出于好奇,对一些解决方案做了一些微基准测试(EDIT2:添加更多的解决方案)
microbenchmark(rowmins(df), rowmins2(df), reducer(df), sapplyer(df), grepapply(df), tchotchke(df), withrowsums(df), reducer2(df))
Unit: microseconds
expr min lq mean median uq max
rowmins(df) 1373.452 1431.9700 1732.188 1576.043 1729.410 5147.847
rowmins2(df) 836.885 875.9900 1015.364 913.285 1038.729 2510.339
reducer(df) 990.096 1058.6645 1217.264 1201.159 1297.997 3103.809
sapplyer(df) 14119.236 14939.8755 16820.701 15952.057 16612.709 66023.721
grepapply(df) 12907.657 13686.2325 14517.140 14485.520 15146.294 17291.779
tchotchke(df) 2770.818 2939.6425 3114.233 3036.926 3172.325 4098.161
withrowsums(df) 1526.227 1627.8185 1819.220 1722.430 1876.360 3025.095
reducer2(df) 900.524 943.1265 1087.025 1003.820 1109.188 3869.993
And here are the definitions I used
这里是我使用的定义
rowmins <- function(df) {
df %>%
mutate(m = rowMins(select(df, ends_with("_num")))) %>%
filter(m >= 0) %>%
select(-m)
}
rowmins2 <- function(df) {
df %>% select(ends_with("_num")) %>% rowMins %>% {df[. >= 0,]}
}
reducer <- function(df) {
df %>%
select(matches("_num$")) %>%
lapply(">=", 0) %>%
Reduce(f = "&", .) %>%
which %>%
slice(.data = df)
}
reducer2 <- function(df) {
df %>%
select(matches("_num$")) %>%
lapply(">=", 0) %>%
Reduce(f = "&", .) %>%
{df[.,]}
}
sapplyer <- function(df) {
nums <- sapply(df, is.numeric)
df[apply(df[, nums], MARGIN=1, function(x) all(x >= 0)), ]
}
grepapply <- function(df) {
cond <- df[, grepl("_num$", colnames(df))] >= 0
df[apply(cond, 1, function(x) {prod(x) == 1}), ]
}
tchotchke <- function(df) {
pattern <- "_num$"
ind <- grep(pattern, colnames(df))
target_columns <- colnames(df)[ind]
desired_rows <- sapply(target_columns, function(x) which(df[,x]<0), simplify=TRUE)
as.vector(unique(unlist(desired_rows)))
}
withrowsums <- function(df) {
df %>% mutate(m=rowSums(select(df, ends_with("_num"))>0)) %>% filter(m==2) %>% select(-m)
}
df <- data.frame(id=1:10000, sth1=sample(LETTERS, 10000, replace=T), tg1_num=runif(10000,-1,1), tg2_num=runif(10000,-1, 1))
#2
6
Here's a possible vectorized solution
这是一个矢量化的解决方案
ind <- grep("_num$", colnames(df))
df[!rowSums(df[ind] < 0),]
# id sth1 tg1_num sth2 tg2_num others
# 1 1 dave 2 ca 35 new
# 4 4 leroy 0 az 25 old
# 5 5 jerry 4 mi 55 old
The idea here is to create a logical matrix using the <
function (it is a generic function which has data.frame
method - which means it returns a data frame like structure back). Then, we are using rowSums
to find if there were any matched conditions (> 0 - matched, 0- not matched). Then, we are using the !
function in order to convert it to a logical vector: >0 becomes TRUE
, while 0 becomes FALSE
. Finally, we are subsetting according to that vector.
这里的想法是使用< function创建一个逻辑矩阵(它是一个具有data.frame方法的泛型函数——这意味着它返回一个类似结构的数据框架)。然后,我们使用行和来查找是否有任何匹配的条件(> -匹配,0-不匹配)。那么,我们正在使用!函数为了将其转换为逻辑向量:>0变为TRUE,而0变为FALSE。最后,我们根据这个向量进行细分。
#3
4
I wanted to see this was possible using standard evaluation with dplyr's filter_
. It turns out it can be done with the help of interp
from lazyeval, following the example code on this page. Essentially, you have to create a list of the interp
conditions which you then pass to the .dots
argument of filter_
.
我希望通过使用dplyr的过滤器的标准评估来实现这一点。事实证明,它可以在lazyeval的interp的帮助下完成,遵循页面上的示例代码。本质上,您必须创建一个interp条件的列表,然后将其传递给filter_的.dots参数。
library(lazyeval)
dots <- lapply(target_columns, function(cols){
interp(~y >= 0, .values = list(y = as.name(cols)))
})
filter_(df, .dots = dots)
id sth1 tg1_num sth2 tg2_num others
1 1 dave 2 ca 35 new
2 4 leroy 0 az 25 old
3 5 jerry 4 mi 55 old
Update
更新
Starting with dplyr_0.7, this can be done directly with filter_at
and all_vars
(no lazyeval needed).
从dplyr_0.7开始,这可以直接使用filter_at和all_vars(不需要lazyeval)完成。
df %>%
filter_at(vars(target_columns), all_vars(. >= 0) )
id sth1 tg1_num sth2 tg2_num others
1 1 dave 2 ca 35 new
2 4 leroy 0 az 25 old
3 5 jerry 4 mi 55 old
#4
1
Using base R to get your result
使用基底R来得到结果。
cond <- df[, grepl("_num$", colnames(df))] >= 0
df[apply(cond, 1, function(x) {prod(x) == 1}), ]
id sth1 tg1_num sth2 tg2_num others
1 1 dave 2 ca 35 new
4 4 leroy 0 az 25 old
5 5 jerry 4 mi 55 old
Edit: this assumes you have multiple columns with "_num". It won't work if you have just one _num column
编辑:这假定您有多个带有“_num”的列。如果只有一个_num列,它就不能工作
#5
1
First we create an index of all numeric columns. Then we subset all columns greater or equal than zero. So there is no need to check the column names, and the column id will be always positive.
首先,我们创建所有数字列的索引。然后我们将所有的列都子集大于或等于0。因此,不需要检查列名,列id总是为正数。
nums <- sapply(df, is.numeric)
df[apply(df[, nums], MARGIN = 1, function(x) all(x >= 0)), ]
Output:
输出:
id sth1 tg1_num sth2 tg2_num others
1 1 dave 2 ca 35 new
4 4 leroy 0 az 25 old
5 5 jerry 4 mi 55 old
#6
1
Here is my ugly solution. Suggestions/criticisms welcome
这是我的丑陋的解决方案。建议/批评欢迎
df %>%
# Select the columns we want
select(matches("_num$")) %>%
# Convert every column to logical if >= 0
lapply(">=", 0) %>%
# Reduce all the sublist with AND
Reduce(f = "&", .) %>%
# Convert the one vector of logical into numeric
# index since slice can't deal with logical.
# Can simply write `{df[.,]}` here instead,
# which is probably faster than which + slice
# Edit: This is not true. which + slice is faster than `[` in this case
which %>%
slice(.data = df)
id sth1 tg1_num sth2 tg2_num others
1 1 dave 2 ca 35 new
2 4 leroy 0 az 25 old
3 5 jerry 4 mi 55 old
#7
0
This will give you a vector of your rows that are less than 0:
这会给你一个小于0的行向量:
desired_rows <- sapply(target_columns, function(x) which(df[,x]<0), simplify=TRUE)
desired_rows <- as.vector(unique(unlist(desired_rows)))
Then to get a df of your desired rows:
然后获取所需行的df:
setdiff(df, df[desired_rows,])
id sth1 tg1_num sth2 tg2_num others
1 1 dave 2 ca 35 new
2 4 leroy 0 az 25 old
3 5 jerry 4 mi 55 old