如何在R中使用“grep”的反向引用?

I am looking for an elegant way of returning back references using regular expressions in R. Le me explain:

我正在寻找一种优雅的方式，使用R. Le me explain中的正则表达式返回引用:

Let's say I want to find strings that start with a month name:

假设我想找到以一个月的名字开头的字符串:

x <- c("May, 1, 2011", "30 June 2011")
grep("May|^June", x, value=TRUE)
[1] "May, 1, 2011"

This works, but I really want to isolate the month (i.e. "May", not the entire matched string.

这是可行的，但我真的想要孤立一个月。“May”，不是整个匹配的字符串。

So, one can use gsub to return the back reference using the substitute parameter. But this has two problems:

因此，可以使用gsub返回使用替换参数的返回引用。但这有两个问题:

You have to wrap the pattern inside ".*(pattern).*)" so that the substitution occurs on the entire string.
您必须将模式包装在“.*(模式).*)”中，以便在整个字符串上进行替换。
Rather than returning NA for non-matched strings, gsub returns the original string. This is clearly not what I desire:
对于不匹配的字符串，gsub不会返回NA，而是返回原始字符串。这显然不是我想要的:

The code and results:

代码和结果:

gsub(".*(^May|^June).*", "\\1", x) 
[1] "May"          "30 June 2011"

I could probably code a workaround by doing all kinds of additional checks, but this quickly becomes very messy.

我可以通过做各种额外的检查来编写一个解决方案，但这很快就变得非常混乱。

To be crystal clear, the desired results should be:

要清楚地表明，理想的结果应该是:

[1] "May"          NA

Is there an easy way of achieving this?

有没有一种简单的方法来实现这个目标?

3 个解决方案

#1

The stringr package has a function exactly for this purpose:

stringr包有一个特定的功能:

library(stringr)
x <- c("May, 1, 2011", "30 June 2011", "June 2012")
str_extract(x, "May|^June")
# [1] "May"  NA     "June"

It's a fairly thin wrapper around regexpr, but stringr generally makes string handling easier by being more consistent than base R functions.

它是regexpr的一个很薄的包装，但是stringr通常使字符串处理更容易，因为它比基本的R函数更一致。

#2

regexpr is similar to grep, but returns the position and length of the (first) match in each string:

regexpr与grep类似，但返回每个字符串中(first)匹配的位置和长度:

> x <- c("May, 1, 2011", "30 June 2011", "June 2012")
> m <- regexpr("May|^June", x)
> m
[1]  1 -1  1
attr(,"match.length")
[1]  3 -1  4

This means that the first string had a match of length 3 staring at position 1, the second string had no match, and the third string had a match of length 4 at position 1.

这意味着第一个字符串的长度与第一个位置的3匹配，第二个字符串没有匹配，第三个字符串的长度与第一个位置的4匹配。

To extract the matches, you could use something like:

要提取匹配项，可以使用以下内容:

> m[m < 0] = NA
> substr(x, m, m + attr(m, "match.length") - 1)
[1] "May"  NA     "June"

#3

The gsubfn package is more general than the grep and regexpr functions and has ways for you to return the backrefrences, see the strapply function.

gsubfn包比grep和regexpr函数更通用，并且有方法返回反折射，请参见strapply函数。

#1