区别(^ | \ \ s)([a - z]{ 1,3 })(\ \ | $)和b \ \[a - z]{ 1,2 } \ \ b R中的正则表达式

时间:2022-05-27 20:13:48

I'm trying clean some small strings (1-3 letters) stored in a column from R Data Frame. Specifically, suppose the next R Script:

我正在尝试清理从R数据帧中存储在列中的一些小字符串(1-3个字母)。具体来说,假设下一个R脚本:

df = data.frame( "original" = c("ABCDE FG H",
                            "IJKL MN OPQRS", 
                            "TUV WX YZ AAAA"))
df$filter1 = gsub("(^|\\s)[A-Z]{1,2}($|\\s)", " ", df$original)
df$filter2 = gsub("\\b[A-Z]{1,2}\\b", " ", df$original)

> df

        original |    filter1 |    filter2  |
1     ABCDE FG H |    ABCDE H |    ABCDE    |
2  IJKL MN OPQRS | IJKL OPQRS | IJKL   OPQRS|
3 TUV WX YZ AAAA | TUV YZ AAAA|  TUV   AAAA |

I don't understand why the first filter (^|\\s)[A-Z]{1,2}($|\\s) doesn't replace "H" in the first row or "YZ" in the third one. I would expect the same result that using \\b[A-Z]{1,2}\\b as filter (filter2 column). Please don't worry about multiple spaces, it isn't important for me (unless this would be the problem :)).

我不明白为什么第一个过滤器(^ | \ \ s)[a - z]{ 1,2 }($ | \ \ s)并不取代“H”在第一行或在第三个“YZ”。我希望使用\b[A-Z]{1,2}\ b作为过滤器(filter2列)也能得到相同的结果。请不要担心多个空间,这对我来说并不重要(除非这是问题所在:)。

I thought that the problem is the "globality" of operation, that it's, if it finds the first one not replace the second one, but it isn't TRUE if I do the next replacement:

我认为问题在于操作的“全局性”,如果它发现第一个而不是第二个,但如果我做下一个替换,它就不是真的:

> gsub("A", "X", "AAAABBBBCCCDDDDAAAAAAAEEE")
[1] "XXXXBBBBCCCDDDDXXXXXXXEEE"

So, Why are the results different?

那么,为什么结果不同呢?

1 个解决方案

#1


3  

The point is that gsub can only match non-overlapping strings. FG being the first expected match, and H the second, you can see that these strings overlap, and thus, after "(^|\\s)[A-Z]{1,2}($|\\s)" consumes the trailing space after FG, H just does not match the pattern.

重点是,gsub只能匹配不重叠的字符串。FG是第一个将匹配,和H第二,你可以看到,这些字符串重叠,因此,后”(^ | \ \ s)[a - z]{ 1,2 }($ | \ \ s)”使用FG后尾随空格,H不匹配模式。

Look: ABCDE FG H is analyzed from left to right. The expression matches FG , and the regex index is right before H. There is only this letter to match, but (^|\s) requires a space or the start of string - there is none at this location.

看,ABCDE FG H从左到右进行分析。表达式匹配FG,regex指数h .只有这封信之前匹配,但(^ | \ s)需要一个空间或字符串的开始,没有在这个位置。

To "fix" this and use the same logic, you can use a PCRE regex gsub with lookarunds:

要“修复”这个问题并使用相同的逻辑,您可以使用带有lookarunds的PCRE regex gsub:

df$filter1 = gsub("(^|\\s)[A-Z]{1,2}(?=$|\\s)", " ", df$original, perl=TRUE)

or

df$filter1 = gsub("(?<!\\S)[A-Z]{1,2}(?!\\S)", " ", df$original, perl=TRUE)

and if you need to actually consume (to remove) spaces, just add \\s* before (or/and after).

如果您需要实际使用(删除)空间,只需添加\s* before(或/和after)。

The second expression "\\b[A-Z]{1,2}\\b" contains word boundaries, and they are zero-width assertions that do not consume text, thus, the regex engine can match both FG and H since the spaces are not consumed.

第二个表达式“\b[A-Z]{1,2}\ b”包含单词边界,它们是不使用文本的零宽度断言,因此,regex引擎可以匹配FG和H,因为没有使用空格。

#1


3  

The point is that gsub can only match non-overlapping strings. FG being the first expected match, and H the second, you can see that these strings overlap, and thus, after "(^|\\s)[A-Z]{1,2}($|\\s)" consumes the trailing space after FG, H just does not match the pattern.

重点是,gsub只能匹配不重叠的字符串。FG是第一个将匹配,和H第二,你可以看到,这些字符串重叠,因此,后”(^ | \ \ s)[a - z]{ 1,2 }($ | \ \ s)”使用FG后尾随空格,H不匹配模式。

Look: ABCDE FG H is analyzed from left to right. The expression matches FG , and the regex index is right before H. There is only this letter to match, but (^|\s) requires a space or the start of string - there is none at this location.

看,ABCDE FG H从左到右进行分析。表达式匹配FG,regex指数h .只有这封信之前匹配,但(^ | \ s)需要一个空间或字符串的开始,没有在这个位置。

To "fix" this and use the same logic, you can use a PCRE regex gsub with lookarunds:

要“修复”这个问题并使用相同的逻辑,您可以使用带有lookarunds的PCRE regex gsub:

df$filter1 = gsub("(^|\\s)[A-Z]{1,2}(?=$|\\s)", " ", df$original, perl=TRUE)

or

df$filter1 = gsub("(?<!\\S)[A-Z]{1,2}(?!\\S)", " ", df$original, perl=TRUE)

and if you need to actually consume (to remove) spaces, just add \\s* before (or/and after).

如果您需要实际使用(删除)空间,只需添加\s* before(或/和after)。

The second expression "\\b[A-Z]{1,2}\\b" contains word boundaries, and they are zero-width assertions that do not consume text, thus, the regex engine can match both FG and H since the spaces are not consumed.

第二个表达式“\b[A-Z]{1,2}\ b”包含单词边界,它们是不使用文本的零宽度断言,因此,regex引擎可以匹配FG和H,因为没有使用空格。