是否有R函数来转义正则字符的字符串

时间:2022-12-04 00:16:26

I'm wanting to build a regex expression substituting in some strings to search for, and so these string need to be escaped before I can put them in the regex, so that if the searched for string contains regex characters it still works.

我想构建一个正则表达式代替一些字符串来搜索,所以这些字符串需要在我将它们放入正则表达式之前进行转义,这样如果搜索的字符串包含正则表达式字符,它仍然可以工作。

Some languages have functions that will do this for you (e.g. python re.escape: https://*.com/a/10013356/1900520). Does R have such a function?

某些语言具有为您执行此操作的功能(例如python re.escape:https://*.com/a/10013356/1900520)。 R有这样的功能吗?

For example (made up function):

例如(组成功能):

x = "foo[bar]"
y = escape(x) # y should now be "foo\\[bar\\]"

2 个解决方案

#1


13  

I've written an R version of Perl's quotemeta function:

我写了一个Perl的quotemeta函数的R版本:

library(stringr)
quotemeta <- function(string) {
  str_replace_all(string, "(\\W)", "\\\\\\1")
}

I always use the perl flavor of regexps, so this works for me. I don't know whether it works for the "normal" regexps in R.

我总是使用regexps的perl风格,所以这对我有用。我不知道它是否适用于R中的“正常”正则表达式。

Edit: I found the source explaining why this works. It's in the Quoting Metacharacters section of the perlre manpage:

编辑:我找到了解释其原因的来源。它位于perlre联机帮助页的引用元字符部分:

This was once used in a common idiom to disable or quote the special meanings of regular expression metacharacters in a string that you want to use for a pattern. Simply quote all non-"word" characters:

这曾经用于一个常见的习惯用法,用于禁用或引用要用于模式的字符串中正则表达式元字符的特殊含义。只需引用所有非“单词”字符:

$pattern =~ s/(\W)/\\$1/g;

As you can see, the R code above is a direct translation of this same substitution (after a trip through backslash hell). The manpage also says (emphasis mine):

正如您所看到的,上面的R代码是这个相同替换的直接翻译(在通过反斜杠地狱之后)。该联机帮助页还说(强调我的):

Unlike some other regular expression languages, there are no backslashed symbols that aren't alphanumeric.

与其他一些正则表达式语言不同,没有反斜杠符号不是字母数字。

which reinforces my point that this solution is only guaranteed for PCRE.

这强化了我的观点,即这个解决方案只能保证PCRE。

#2


7  

Apparently there is a function called escapeRegex in the Hmisc package. The function itself has the following definition for an input value of 'string':

显然在Hmisc包中有一个名为escapeRegex的函数。函数本身对'string'的输入值有以下定义:

gsub("([.|()\\^{}+$*?]|\\[|\\])", "\\\\\\1", string)

My previous answer:

我以前的回答:

I'm not sure if there is a built in function but you could make one to do what you want. This basically just creates a vector of the values you want to replace and a vector of what you want to replace them with and then loops through those making the necessary replacements.

我不确定是否有内置功能,但你可以做一个你想做的事情。这基本上只是创建了一个要替换的值的向量,以及要用它们替换它们的向量,然后循环遍历那些进行必要替换的值。

re.escape <- function(strings){
    vals <- c("\\\\", "\\[", "\\]", "\\(", "\\)", 
              "\\{", "\\}", "\\^", "\\$","\\*", 
              "\\+", "\\?", "\\.", "\\|")
    replace.vals <- paste0("\\\\", vals)
    for(i in seq_along(vals)){
        strings <- gsub(vals[i], replace.vals[i], strings)
    }
    strings
}

Some output

一些输出

> test.strings <- c("What the $^&(){}.*|?", "foo[bar]")
> re.escape(test.strings)
[1] "What the \\$\\^&\\(\\)\\{\\}\\.\\*\\|\\?"
[2] "foo\\[bar\\]"  

#1


13  

I've written an R version of Perl's quotemeta function:

我写了一个Perl的quotemeta函数的R版本:

library(stringr)
quotemeta <- function(string) {
  str_replace_all(string, "(\\W)", "\\\\\\1")
}

I always use the perl flavor of regexps, so this works for me. I don't know whether it works for the "normal" regexps in R.

我总是使用regexps的perl风格,所以这对我有用。我不知道它是否适用于R中的“正常”正则表达式。

Edit: I found the source explaining why this works. It's in the Quoting Metacharacters section of the perlre manpage:

编辑:我找到了解释其原因的来源。它位于perlre联机帮助页的引用元字符部分:

This was once used in a common idiom to disable or quote the special meanings of regular expression metacharacters in a string that you want to use for a pattern. Simply quote all non-"word" characters:

这曾经用于一个常见的习惯用法,用于禁用或引用要用于模式的字符串中正则表达式元字符的特殊含义。只需引用所有非“单词”字符:

$pattern =~ s/(\W)/\\$1/g;

As you can see, the R code above is a direct translation of this same substitution (after a trip through backslash hell). The manpage also says (emphasis mine):

正如您所看到的,上面的R代码是这个相同替换的直接翻译(在通过反斜杠地狱之后)。该联机帮助页还说(强调我的):

Unlike some other regular expression languages, there are no backslashed symbols that aren't alphanumeric.

与其他一些正则表达式语言不同,没有反斜杠符号不是字母数字。

which reinforces my point that this solution is only guaranteed for PCRE.

这强化了我的观点,即这个解决方案只能保证PCRE。

#2


7  

Apparently there is a function called escapeRegex in the Hmisc package. The function itself has the following definition for an input value of 'string':

显然在Hmisc包中有一个名为escapeRegex的函数。函数本身对'string'的输入值有以下定义:

gsub("([.|()\\^{}+$*?]|\\[|\\])", "\\\\\\1", string)

My previous answer:

我以前的回答:

I'm not sure if there is a built in function but you could make one to do what you want. This basically just creates a vector of the values you want to replace and a vector of what you want to replace them with and then loops through those making the necessary replacements.

我不确定是否有内置功能,但你可以做一个你想做的事情。这基本上只是创建了一个要替换的值的向量,以及要用它们替换它们的向量,然后循环遍历那些进行必要替换的值。

re.escape <- function(strings){
    vals <- c("\\\\", "\\[", "\\]", "\\(", "\\)", 
              "\\{", "\\}", "\\^", "\\$","\\*", 
              "\\+", "\\?", "\\.", "\\|")
    replace.vals <- paste0("\\\\", vals)
    for(i in seq_along(vals)){
        strings <- gsub(vals[i], replace.vals[i], strings)
    }
    strings
}

Some output

一些输出

> test.strings <- c("What the $^&(){}.*|?", "foo[bar]")
> re.escape(test.strings)
[1] "What the \\$\\^&\\(\\)\\{\\}\\.\\*\\|\\?"
[2] "foo\\[bar\\]"