使用dplyr包含特定字符串的过滤行

时间:2022-10-03 19:40:15

I have to filter a data frame using as criterion those row in which is contained the string RTB. I'm using dplyr.

我必须过滤一个数据帧作为标准,这些行包含了字符串RTB。我用dplyr。

d.del <- df %.%
  group_by(TrackingPixel) %.%
  summarise(MonthDelivery = as.integer(sum(Revenue))) %.%
  arrange(desc(MonthDelivery))

I know I can use the function filter in dplyr but I don't exactly how to tell it to check for the content of a string.

我知道我可以在dplyr中使用函数过滤器,但是我不知道如何告诉它检查字符串的内容。

In particular I want to check the content in the column TrackingPixel. If the string contains the label RTB I want to remove the row from the result.

我特别想检查列TrackingPixel中的内容。如果字符串包含标签RTB,我想从结果中删除行。

3 个解决方案

#1


151  

The answer to the question was already posted by the @latemail in the comments above. You can use regular expressions for the second and subsequent arguments of filter like this:

这个问题的答案已经通过@latemail在上面的评论中发布了。对于第二个和后续的过滤器参数,您可以使用正则表达式,如下所示:

dplyr::filter(df, !grepl("RTB",TrackingPixel))

Since you have not provided the original data, I will add a toy example using the mtcars data set. Imagine you are only interested in cars produced by Mazda or Toyota.

由于您没有提供原始数据,我将使用mtcars数据集添加一个玩具示例。假设您只对马自达或丰田生产的汽车感兴趣。

mtcars$type <- rownames(mtcars)
dplyr::filter(mtcars, grepl('Toyota|Mazda', type))

   mpg cyl  disp  hp drat    wt  qsec vs am gear carb           type
1 21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4      Mazda RX4
2 21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4  Mazda RX4 Wag
3 33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1 Toyota Corolla
4 21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1  Toyota Corona

If you would like to do it the other way round, namely excluding Toyota and Mazda cars, the filter command looks like this:

如果你想反过来做,即不包括丰田和马自达汽车,过滤器命令如下:

dplyr::filter(mtcars, !grepl('Toyota|Mazda', type))

#2


55  

Solution

解决方案

It is possible to use str_detect of the stringr package included in the tidyverse package. str_detect returns True or False as to whether the specified vector contains some specific string. It is possible to filter using this boolean value. See Introduction to stringr for details about stringr package.

可以使用包含在tidyverse包中的stringr包的str_detect。对于指定的向量是否包含特定的字符串,str_detect返回True或False。可以使用这个布尔值进行筛选。有关stringr包的详细信息,请参阅stringr介绍。

library(tidyverse)
# ─ Attaching packages ──────────────────── tidyverse 1.2.1 ─
# ✔ ggplot2 2.2.1     ✔ purrr   0.2.4
# ✔ tibble  1.4.2     ✔ dplyr   0.7.4
# ✔ tidyr   0.7.2     ✔ stringr 1.2.0
# ✔ readr   1.1.1     ✔ forcats 0.3.0
# ─ Conflicts ───────────────────── tidyverse_conflicts() ─
# ✖ dplyr::filter() masks stats::filter()
# ✖ dplyr::lag()    masks stats::lag()

mtcars$type <- rownames(mtcars)
mtcars %>%
  filter(str_detect(type, 'Toyota|Mazda'))
# mpg cyl  disp  hp drat    wt  qsec vs am gear carb           type
# 1 21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4      Mazda RX4
# 2 21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4  Mazda RX4 Wag
# 3 33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1 Toyota Corolla
# 4 21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1  Toyota Corona

The good things about Stringr

弦的好处

We should use rather stringr::str_detect() than base::grepl(). This is because there are the following reasons.

我们应该使用相当的stringr::str_detect()而不是base::grepl()。这是因为有以下原因。

  • The functions provided by the stringr package start with the prefix str_, which makes the code easier to read.
  • stringr包提供的函数以前缀str_开头,这使得代码更易于阅读。
  • The first argument of the functions of stringr package is always the data.frame (or value), then comes the parameters.(Thank you Paolo)
  • stringr包函数的第一个参数始终是data.frame(或value),然后是参数。(谢谢保罗)
object <- "stringr"
# The functions with the same prefix `str_`.
# The first argument is an object.
stringr::str_count(object) # -> 7
stringr::str_sub(object, 1, 3) # -> "str"
stringr::str_detect(object, "str") # -> TRUE
stringr::str_replace(object, "str", "") # -> "ingr"
# The function names without common points.
# The position of the argument of the object also does not match.
base::nchar(object) # -> 7
base::substr(object, 1, 3) # -> "str"
base::grepl("str", object) # -> TRUE
base::sub("str", "", object) # -> "ingr"

Benchmark

基准

The results of the benchmark test are as follows. For large dataframe, str_detect is faster.

基准测试的结果如下。对于大型dataframe, str_detection更快。

library(rbenchmark)
library(tidyverse)

# The data. Data expo 09. ASA Statistics Computing and Graphics 
# http://stat-computing.org/dataexpo/2009/the-data.html
df <- read_csv("Downloads/2008.csv")
print(dim(df))
# [1] 7009728      29

benchmark(
  "str_detect" = {df %>% filter(str_detect(Dest, 'MCO|BWI'))},
  "grepl" = {df %>% filter(grepl('MCO|BWI', Dest))},
  replications = 10,
  columns = c("test", "replications", "elapsed", "relative", "user.self", "sys.self"))
# test replications elapsed relative user.self sys.self
# 2      grepl           10  16.480    1.513    16.195    0.248
# 1 str_detect           10  10.891    1.000     9.594    1.281

#3


0  

If you want to find the string in any given column, have a look at

如果您想在任何给定的列中找到字符串,请查看

Remove row if any column contains a specific string

如果任何列包含一个特定的字符串,则删除行。

It is bascially about using filter_at or filter_all

它基本上是关于使用filter_at或filter_all的

#1


151  

The answer to the question was already posted by the @latemail in the comments above. You can use regular expressions for the second and subsequent arguments of filter like this:

这个问题的答案已经通过@latemail在上面的评论中发布了。对于第二个和后续的过滤器参数,您可以使用正则表达式,如下所示:

dplyr::filter(df, !grepl("RTB",TrackingPixel))

Since you have not provided the original data, I will add a toy example using the mtcars data set. Imagine you are only interested in cars produced by Mazda or Toyota.

由于您没有提供原始数据,我将使用mtcars数据集添加一个玩具示例。假设您只对马自达或丰田生产的汽车感兴趣。

mtcars$type <- rownames(mtcars)
dplyr::filter(mtcars, grepl('Toyota|Mazda', type))

   mpg cyl  disp  hp drat    wt  qsec vs am gear carb           type
1 21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4      Mazda RX4
2 21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4  Mazda RX4 Wag
3 33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1 Toyota Corolla
4 21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1  Toyota Corona

If you would like to do it the other way round, namely excluding Toyota and Mazda cars, the filter command looks like this:

如果你想反过来做,即不包括丰田和马自达汽车,过滤器命令如下:

dplyr::filter(mtcars, !grepl('Toyota|Mazda', type))

#2


55  

Solution

解决方案

It is possible to use str_detect of the stringr package included in the tidyverse package. str_detect returns True or False as to whether the specified vector contains some specific string. It is possible to filter using this boolean value. See Introduction to stringr for details about stringr package.

可以使用包含在tidyverse包中的stringr包的str_detect。对于指定的向量是否包含特定的字符串,str_detect返回True或False。可以使用这个布尔值进行筛选。有关stringr包的详细信息,请参阅stringr介绍。

library(tidyverse)
# ─ Attaching packages ──────────────────── tidyverse 1.2.1 ─
# ✔ ggplot2 2.2.1     ✔ purrr   0.2.4
# ✔ tibble  1.4.2     ✔ dplyr   0.7.4
# ✔ tidyr   0.7.2     ✔ stringr 1.2.0
# ✔ readr   1.1.1     ✔ forcats 0.3.0
# ─ Conflicts ───────────────────── tidyverse_conflicts() ─
# ✖ dplyr::filter() masks stats::filter()
# ✖ dplyr::lag()    masks stats::lag()

mtcars$type <- rownames(mtcars)
mtcars %>%
  filter(str_detect(type, 'Toyota|Mazda'))
# mpg cyl  disp  hp drat    wt  qsec vs am gear carb           type
# 1 21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4      Mazda RX4
# 2 21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4  Mazda RX4 Wag
# 3 33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1 Toyota Corolla
# 4 21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1  Toyota Corona

The good things about Stringr

弦的好处

We should use rather stringr::str_detect() than base::grepl(). This is because there are the following reasons.

我们应该使用相当的stringr::str_detect()而不是base::grepl()。这是因为有以下原因。

  • The functions provided by the stringr package start with the prefix str_, which makes the code easier to read.
  • stringr包提供的函数以前缀str_开头,这使得代码更易于阅读。
  • The first argument of the functions of stringr package is always the data.frame (or value), then comes the parameters.(Thank you Paolo)
  • stringr包函数的第一个参数始终是data.frame(或value),然后是参数。(谢谢保罗)
object <- "stringr"
# The functions with the same prefix `str_`.
# The first argument is an object.
stringr::str_count(object) # -> 7
stringr::str_sub(object, 1, 3) # -> "str"
stringr::str_detect(object, "str") # -> TRUE
stringr::str_replace(object, "str", "") # -> "ingr"
# The function names without common points.
# The position of the argument of the object also does not match.
base::nchar(object) # -> 7
base::substr(object, 1, 3) # -> "str"
base::grepl("str", object) # -> TRUE
base::sub("str", "", object) # -> "ingr"

Benchmark

基准

The results of the benchmark test are as follows. For large dataframe, str_detect is faster.

基准测试的结果如下。对于大型dataframe, str_detection更快。

library(rbenchmark)
library(tidyverse)

# The data. Data expo 09. ASA Statistics Computing and Graphics 
# http://stat-computing.org/dataexpo/2009/the-data.html
df <- read_csv("Downloads/2008.csv")
print(dim(df))
# [1] 7009728      29

benchmark(
  "str_detect" = {df %>% filter(str_detect(Dest, 'MCO|BWI'))},
  "grepl" = {df %>% filter(grepl('MCO|BWI', Dest))},
  replications = 10,
  columns = c("test", "replications", "elapsed", "relative", "user.self", "sys.self"))
# test replications elapsed relative user.self sys.self
# 2      grepl           10  16.480    1.513    16.195    0.248
# 1 str_detect           10  10.891    1.000     9.594    1.281

#3


0  

If you want to find the string in any given column, have a look at

如果您想在任何给定的列中找到字符串,请查看

Remove row if any column contains a specific string

如果任何列包含一个特定的字符串,则删除行。

It is bascially about using filter_at or filter_all

它基本上是关于使用filter_at或filter_all的