I have to filter a data frame using as criterion those row in which is contained the string RTB
. I'm using dplyr
.
我必须过滤一个数据帧作为标准,这些行包含了字符串RTB。我用dplyr。
d.del <- df %.%
group_by(TrackingPixel) %.%
summarise(MonthDelivery = as.integer(sum(Revenue))) %.%
arrange(desc(MonthDelivery))
I know I can use the function filter
in dplyr
but I don't exactly how to tell it to check for the content of a string.
我知道我可以在dplyr中使用函数过滤器,但是我不知道如何告诉它检查字符串的内容。
In particular I want to check the content in the column TrackingPixel
. If the string contains the label RTB
I want to remove the row from the result.
我特别想检查列TrackingPixel中的内容。如果字符串包含标签RTB,我想从结果中删除行。
3 个解决方案
#1
151
The answer to the question was already posted by the @latemail in the comments above. You can use regular expressions for the second and subsequent arguments of filter
like this:
这个问题的答案已经通过@latemail在上面的评论中发布了。对于第二个和后续的过滤器参数,您可以使用正则表达式,如下所示:
dplyr::filter(df, !grepl("RTB",TrackingPixel))
Since you have not provided the original data, I will add a toy example using the mtcars
data set. Imagine you are only interested in cars produced by Mazda or Toyota.
由于您没有提供原始数据,我将使用mtcars数据集添加一个玩具示例。假设您只对马自达或丰田生产的汽车感兴趣。
mtcars$type <- rownames(mtcars)
dplyr::filter(mtcars, grepl('Toyota|Mazda', type))
mpg cyl disp hp drat wt qsec vs am gear carb type
1 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4 Mazda RX4
2 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4 Mazda RX4 Wag
3 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1 Toyota Corolla
4 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1 Toyota Corona
If you would like to do it the other way round, namely excluding Toyota and Mazda cars, the filter
command looks like this:
如果你想反过来做,即不包括丰田和马自达汽车,过滤器命令如下:
dplyr::filter(mtcars, !grepl('Toyota|Mazda', type))
#2
55
Solution
解决方案
It is possible to use str_detect
of the stringr
package included in the tidyverse
package. str_detect
returns True
or False
as to whether the specified vector contains some specific string. It is possible to filter using this boolean value. See Introduction to stringr for details about stringr
package.
可以使用包含在tidyverse包中的stringr包的str_detect。对于指定的向量是否包含特定的字符串,str_detect返回True或False。可以使用这个布尔值进行筛选。有关stringr包的详细信息,请参阅stringr介绍。
library(tidyverse)
# ─ Attaching packages ──────────────────── tidyverse 1.2.1 ─
# ✔ ggplot2 2.2.1 ✔ purrr 0.2.4
# ✔ tibble 1.4.2 ✔ dplyr 0.7.4
# ✔ tidyr 0.7.2 ✔ stringr 1.2.0
# ✔ readr 1.1.1 ✔ forcats 0.3.0
# ─ Conflicts ───────────────────── tidyverse_conflicts() ─
# ✖ dplyr::filter() masks stats::filter()
# ✖ dplyr::lag() masks stats::lag()
mtcars$type <- rownames(mtcars)
mtcars %>%
filter(str_detect(type, 'Toyota|Mazda'))
# mpg cyl disp hp drat wt qsec vs am gear carb type
# 1 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4 Mazda RX4
# 2 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4 Mazda RX4 Wag
# 3 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1 Toyota Corolla
# 4 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1 Toyota Corona
The good things about Stringr
弦的好处
We should use rather stringr::str_detect()
than base::grepl()
. This is because there are the following reasons.
我们应该使用相当的stringr::str_detect()而不是base::grepl()。这是因为有以下原因。
- The functions provided by the
stringr
package start with the prefixstr_
, which makes the code easier to read. - stringr包提供的函数以前缀str_开头,这使得代码更易于阅读。
- The first argument of the functions of
stringr
package is always the data.frame (or value), then comes the parameters.(Thank you Paolo) - stringr包函数的第一个参数始终是data.frame(或value),然后是参数。(谢谢保罗)
object <- "stringr"
# The functions with the same prefix `str_`.
# The first argument is an object.
stringr::str_count(object) # -> 7
stringr::str_sub(object, 1, 3) # -> "str"
stringr::str_detect(object, "str") # -> TRUE
stringr::str_replace(object, "str", "") # -> "ingr"
# The function names without common points.
# The position of the argument of the object also does not match.
base::nchar(object) # -> 7
base::substr(object, 1, 3) # -> "str"
base::grepl("str", object) # -> TRUE
base::sub("str", "", object) # -> "ingr"
Benchmark
基准
The results of the benchmark test are as follows. For large dataframe, str_detect
is faster.
基准测试的结果如下。对于大型dataframe, str_detection更快。
library(rbenchmark)
library(tidyverse)
# The data. Data expo 09. ASA Statistics Computing and Graphics
# http://stat-computing.org/dataexpo/2009/the-data.html
df <- read_csv("Downloads/2008.csv")
print(dim(df))
# [1] 7009728 29
benchmark(
"str_detect" = {df %>% filter(str_detect(Dest, 'MCO|BWI'))},
"grepl" = {df %>% filter(grepl('MCO|BWI', Dest))},
replications = 10,
columns = c("test", "replications", "elapsed", "relative", "user.self", "sys.self"))
# test replications elapsed relative user.self sys.self
# 2 grepl 10 16.480 1.513 16.195 0.248
# 1 str_detect 10 10.891 1.000 9.594 1.281
#3
0
If you want to find the string in any given column, have a look at
如果您想在任何给定的列中找到字符串,请查看
Remove row if any column contains a specific string
如果任何列包含一个特定的字符串,则删除行。
It is bascially about using filter_at
or filter_all
它基本上是关于使用filter_at或filter_all的
#1
151
The answer to the question was already posted by the @latemail in the comments above. You can use regular expressions for the second and subsequent arguments of filter
like this:
这个问题的答案已经通过@latemail在上面的评论中发布了。对于第二个和后续的过滤器参数,您可以使用正则表达式,如下所示:
dplyr::filter(df, !grepl("RTB",TrackingPixel))
Since you have not provided the original data, I will add a toy example using the mtcars
data set. Imagine you are only interested in cars produced by Mazda or Toyota.
由于您没有提供原始数据,我将使用mtcars数据集添加一个玩具示例。假设您只对马自达或丰田生产的汽车感兴趣。
mtcars$type <- rownames(mtcars)
dplyr::filter(mtcars, grepl('Toyota|Mazda', type))
mpg cyl disp hp drat wt qsec vs am gear carb type
1 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4 Mazda RX4
2 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4 Mazda RX4 Wag
3 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1 Toyota Corolla
4 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1 Toyota Corona
If you would like to do it the other way round, namely excluding Toyota and Mazda cars, the filter
command looks like this:
如果你想反过来做,即不包括丰田和马自达汽车,过滤器命令如下:
dplyr::filter(mtcars, !grepl('Toyota|Mazda', type))
#2
55
Solution
解决方案
It is possible to use str_detect
of the stringr
package included in the tidyverse
package. str_detect
returns True
or False
as to whether the specified vector contains some specific string. It is possible to filter using this boolean value. See Introduction to stringr for details about stringr
package.
可以使用包含在tidyverse包中的stringr包的str_detect。对于指定的向量是否包含特定的字符串,str_detect返回True或False。可以使用这个布尔值进行筛选。有关stringr包的详细信息,请参阅stringr介绍。
library(tidyverse)
# ─ Attaching packages ──────────────────── tidyverse 1.2.1 ─
# ✔ ggplot2 2.2.1 ✔ purrr 0.2.4
# ✔ tibble 1.4.2 ✔ dplyr 0.7.4
# ✔ tidyr 0.7.2 ✔ stringr 1.2.0
# ✔ readr 1.1.1 ✔ forcats 0.3.0
# ─ Conflicts ───────────────────── tidyverse_conflicts() ─
# ✖ dplyr::filter() masks stats::filter()
# ✖ dplyr::lag() masks stats::lag()
mtcars$type <- rownames(mtcars)
mtcars %>%
filter(str_detect(type, 'Toyota|Mazda'))
# mpg cyl disp hp drat wt qsec vs am gear carb type
# 1 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4 Mazda RX4
# 2 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4 Mazda RX4 Wag
# 3 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1 Toyota Corolla
# 4 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1 Toyota Corona
The good things about Stringr
弦的好处
We should use rather stringr::str_detect()
than base::grepl()
. This is because there are the following reasons.
我们应该使用相当的stringr::str_detect()而不是base::grepl()。这是因为有以下原因。
- The functions provided by the
stringr
package start with the prefixstr_
, which makes the code easier to read. - stringr包提供的函数以前缀str_开头,这使得代码更易于阅读。
- The first argument of the functions of
stringr
package is always the data.frame (or value), then comes the parameters.(Thank you Paolo) - stringr包函数的第一个参数始终是data.frame(或value),然后是参数。(谢谢保罗)
object <- "stringr"
# The functions with the same prefix `str_`.
# The first argument is an object.
stringr::str_count(object) # -> 7
stringr::str_sub(object, 1, 3) # -> "str"
stringr::str_detect(object, "str") # -> TRUE
stringr::str_replace(object, "str", "") # -> "ingr"
# The function names without common points.
# The position of the argument of the object also does not match.
base::nchar(object) # -> 7
base::substr(object, 1, 3) # -> "str"
base::grepl("str", object) # -> TRUE
base::sub("str", "", object) # -> "ingr"
Benchmark
基准
The results of the benchmark test are as follows. For large dataframe, str_detect
is faster.
基准测试的结果如下。对于大型dataframe, str_detection更快。
library(rbenchmark)
library(tidyverse)
# The data. Data expo 09. ASA Statistics Computing and Graphics
# http://stat-computing.org/dataexpo/2009/the-data.html
df <- read_csv("Downloads/2008.csv")
print(dim(df))
# [1] 7009728 29
benchmark(
"str_detect" = {df %>% filter(str_detect(Dest, 'MCO|BWI'))},
"grepl" = {df %>% filter(grepl('MCO|BWI', Dest))},
replications = 10,
columns = c("test", "replications", "elapsed", "relative", "user.self", "sys.self"))
# test replications elapsed relative user.self sys.self
# 2 grepl 10 16.480 1.513 16.195 0.248
# 1 str_detect 10 10.891 1.000 9.594 1.281
#3
0
If you want to find the string in any given column, have a look at
如果您想在任何给定的列中找到字符串,请查看
Remove row if any column contains a specific string
如果任何列包含一个特定的字符串,则删除行。
It is bascially about using filter_at
or filter_all
它基本上是关于使用filter_at或filter_all的