接待员:有接线员吗?

时间:2023-02-04 17:03:15

Suppose I have the following data frame:

假设我有以下数据框架:

User.Id    Tags
34234      imageUploaded,people.jpg,more,comma,separated,stuff
34234      imageUploaded
12345      people.jpg

How might I use grep (or some other tool) to only grab rows that include both "imageUploaded" and "people"? In other words, how might I create a subset that includes just the rows with the strings "imageUploaded" AND "people.jpg", regardless of order.

如何使用grep(或其他工具)只获取包含“imageUploaded”和“people”的行?换句话说,我如何创建一个子集,其中只包含带有字符串“imageUploaded”和“people.jpg”的行,而不考虑顺序。

I have tried:

我有尝试:

data.people<-data[grep("imageUploaded|people.jpg",results$Tags),]
data.people<-data[grep("imageUploaded?=people.jpg",results$Tags),]

Is there an AND operator? Or perhaps another way to get the intended result?

有操作人员吗?或者可能是另一种方法来得到预期的结果?

4 个解决方案

#1


17  

Thanks to this answer, this regex seems to work. You want to use grepl() which returns a logical to index into your data object. I won't claim to fully understand the inner workings of the regex, but regardless:

由于这个答案,这个正则表达式似乎起作用了。您希望使用grepl(),它将逻辑索引返回到数据对象中。我不会声称完全理解regex的内部工作原理,但无论如何:

x <- c("imageUploaded,people.jpg,more,comma,separated,stuff", "imageUploaded", "people.jpg")

grepl("(?=.*imageUploaded)(?=.*people\\.jpg)", x, perl = TRUE)
#-----
[1]  TRUE FALSE FALSE

#2


11  

I love @Chase's answer, and it makes good sense to me, but it can be a bit dangerous to use constructs that one doesn't totally understand.

我喜欢@Chase的答案,这对我来说很有意义,但是使用一个人不完全理解的结构可能有点危险。

This answer is meant to reassure anyone who'd like to use @thelatemail's more straightforward approach that it works just as well and is completely competitive speedwise. It's certainly what I'd use in this case. (It's also reassuring that the more sophisticated Perl-compatible-regex pays no performance cost for its power and easy extensibility.)

这个答案是为了让任何想要使用@thelatemail更直接的方法的人放心,它同样有效,而且完全具有竞争速度优势。在这种情况下我肯定会这么做。(更复杂的perl - compatibleregex功能强大且易于扩展,因此无需支付性能成本,这也让人放心。)

library(rbenchmark)
x <- paste0(sample(letters, 1e6, replace=T), ## A longer vector of
            sample(letters, 1e6, replace=T)) ## possible matches

## Both methods give identical results
tlm <- grepl("a", x, fixed=TRUE) & grepl("b", x, fixed=TRUE)
pat <- "(?=.*a)(?=.*b)"
Chase <- grepl(pat, x, perl=TRUE)
identical(tlm, Chase)
# [1] TRUE    

## Both methods are similarly fast
benchmark(
    tlm = grepl("a", x, fixed=TRUE) & grepl("b", x, fixed=TRUE),
    Chase = grepl(pat, x, perl=TRUE))
#          test replications elapsed relative user.self sys.self
# 2       Chase          100    9.89    1.105      9.80     0.10
# 1 thelatemail          100    8.95    1.000      8.47     0.48

#3


8  

For readability's sake, you could just do:

为了便于阅读,你可以这样做:

x <- c(
       "imageUploaded,people.jpg,more,comma,separated,stuff",
       "imageUploaded",
       "people.jpg"
       )

xmatches <- intersect(
                      grep("imageUploaded",x,fixed=TRUE),
                      grep("people.jpg",x,fixed=TRUE)
                     )
x[xmatches]
[1] "imageUploaded,people.jpg,more,comma,separated,stuff"

#4


1  

Below is an alternative to grep using hadley's stringr::str_detect(). This avoids the use of perl=true @jan-stanstrup. Additionally, the dplyr::filter() will return the rows within the dataframe itself so you never need to leave the df.

下面是使用hadley的stringr:: str_detection()替代grep的方法。这避免了使用perl=true @jan-stanstrup。此外,dplyr::filter()将返回dataframe内部的行,因此您永远不需要离开df。

library(stringr)
libary(dplyr)
 x <- data.frame(User.Id =c(34234,34234,12345), 
                 Tags=c("imageUploaded,people.jpg,more,comma,separated,stuff",
                        "imageUploaded",
                        "people.jpg"))

 data.people <- x %>% filter(str_detect(Tags,"(?=.*imageUploaded)(?=.*people\\.jpg)"))
 data.people

# returns
#  User.Id                                                Tags
# 1   34234 imageUploaded,people.jpg,more,comma,separated,stuff

This is simpler and works if "people.jpg" always follows "imageUploaded"

如果“people.jpg”总是遵循“imageUploaded”,那么这就更简单了。

str_extract(x,"imageUploaded.*people\\.jpg")

#1


17  

Thanks to this answer, this regex seems to work. You want to use grepl() which returns a logical to index into your data object. I won't claim to fully understand the inner workings of the regex, but regardless:

由于这个答案,这个正则表达式似乎起作用了。您希望使用grepl(),它将逻辑索引返回到数据对象中。我不会声称完全理解regex的内部工作原理,但无论如何:

x <- c("imageUploaded,people.jpg,more,comma,separated,stuff", "imageUploaded", "people.jpg")

grepl("(?=.*imageUploaded)(?=.*people\\.jpg)", x, perl = TRUE)
#-----
[1]  TRUE FALSE FALSE

#2


11  

I love @Chase's answer, and it makes good sense to me, but it can be a bit dangerous to use constructs that one doesn't totally understand.

我喜欢@Chase的答案,这对我来说很有意义,但是使用一个人不完全理解的结构可能有点危险。

This answer is meant to reassure anyone who'd like to use @thelatemail's more straightforward approach that it works just as well and is completely competitive speedwise. It's certainly what I'd use in this case. (It's also reassuring that the more sophisticated Perl-compatible-regex pays no performance cost for its power and easy extensibility.)

这个答案是为了让任何想要使用@thelatemail更直接的方法的人放心,它同样有效,而且完全具有竞争速度优势。在这种情况下我肯定会这么做。(更复杂的perl - compatibleregex功能强大且易于扩展,因此无需支付性能成本,这也让人放心。)

library(rbenchmark)
x <- paste0(sample(letters, 1e6, replace=T), ## A longer vector of
            sample(letters, 1e6, replace=T)) ## possible matches

## Both methods give identical results
tlm <- grepl("a", x, fixed=TRUE) & grepl("b", x, fixed=TRUE)
pat <- "(?=.*a)(?=.*b)"
Chase <- grepl(pat, x, perl=TRUE)
identical(tlm, Chase)
# [1] TRUE    

## Both methods are similarly fast
benchmark(
    tlm = grepl("a", x, fixed=TRUE) & grepl("b", x, fixed=TRUE),
    Chase = grepl(pat, x, perl=TRUE))
#          test replications elapsed relative user.self sys.self
# 2       Chase          100    9.89    1.105      9.80     0.10
# 1 thelatemail          100    8.95    1.000      8.47     0.48

#3


8  

For readability's sake, you could just do:

为了便于阅读,你可以这样做:

x <- c(
       "imageUploaded,people.jpg,more,comma,separated,stuff",
       "imageUploaded",
       "people.jpg"
       )

xmatches <- intersect(
                      grep("imageUploaded",x,fixed=TRUE),
                      grep("people.jpg",x,fixed=TRUE)
                     )
x[xmatches]
[1] "imageUploaded,people.jpg,more,comma,separated,stuff"

#4


1  

Below is an alternative to grep using hadley's stringr::str_detect(). This avoids the use of perl=true @jan-stanstrup. Additionally, the dplyr::filter() will return the rows within the dataframe itself so you never need to leave the df.

下面是使用hadley的stringr:: str_detection()替代grep的方法。这避免了使用perl=true @jan-stanstrup。此外,dplyr::filter()将返回dataframe内部的行,因此您永远不需要离开df。

library(stringr)
libary(dplyr)
 x <- data.frame(User.Id =c(34234,34234,12345), 
                 Tags=c("imageUploaded,people.jpg,more,comma,separated,stuff",
                        "imageUploaded",
                        "people.jpg"))

 data.people <- x %>% filter(str_detect(Tags,"(?=.*imageUploaded)(?=.*people\\.jpg)"))
 data.people

# returns
#  User.Id                                                Tags
# 1   34234 imageUploaded,people.jpg,more,comma,separated,stuff

This is simpler and works if "people.jpg" always follows "imageUploaded"

如果“people.jpg”总是遵循“imageUploaded”,那么这就更简单了。

str_extract(x,"imageUploaded.*people\\.jpg")