I'm trying clean some small strings (1-3 letters) stored in a column from R Data Frame. Specifically, suppose the next R Script:
我正在尝试清理从R数据帧中存储在列中的一些小字符串(1-3个字母)。具体来说,假设下一个R脚本:
df = data.frame( "original" = c("ABCDE FG H",
"IJKL MN OPQRS",
"TUV WX YZ AAAA"))
df$filter1 = gsub("(^|\\s)[A-Z]{1,2}($|\\s)", " ", df$original)
df$filter2 = gsub("\\b[A-Z]{1,2}\\b", " ", df$original)
> df
original | filter1 | filter2 |
1 ABCDE FG H | ABCDE H | ABCDE |
2 IJKL MN OPQRS | IJKL OPQRS | IJKL OPQRS|
3 TUV WX YZ AAAA | TUV YZ AAAA| TUV AAAA |
I don't understand why the first filter (^|\\s)[A-Z]{1,2}($|\\s)
doesn't replace "H" in the first row or "YZ" in the third one. I would expect the same result that using \\b[A-Z]{1,2}\\b
as filter (filter2 column). Please don't worry about multiple spaces, it isn't important for me (unless this would be the problem :)).
我不明白为什么第一个过滤器(^ | \ \ s)[a - z]{ 1,2 }($ | \ \ s)并不取代“H”在第一行或在第三个“YZ”。我希望使用\b[A-Z]{1,2}\ b作为过滤器(filter2列)也能得到相同的结果。请不要担心多个空间,这对我来说并不重要(除非这是问题所在:)。
I thought that the problem is the "globality" of operation, that it's, if it finds the first one not replace the second one, but it isn't TRUE if I do the next replacement:
我认为问题在于操作的“全局性”,如果它发现第一个而不是第二个,但如果我做下一个替换,它就不是真的:
> gsub("A", "X", "AAAABBBBCCCDDDDAAAAAAAEEE")
[1] "XXXXBBBBCCCDDDDXXXXXXXEEE"
So, Why are the results different?
那么,为什么结果不同呢?
1 个解决方案
#1
3
The point is that gsub
can only match non-overlapping strings. FG
being the first expected match, and H
the second, you can see that these strings overlap, and thus, after "(^|\\s)[A-Z]{1,2}($|\\s)"
consumes the trailing space after FG
, H
just does not match the pattern.
重点是,gsub只能匹配不重叠的字符串。FG是第一个将匹配,和H第二,你可以看到,这些字符串重叠,因此,后”(^ | \ \ s)[a - z]{ 1,2 }($ | \ \ s)”使用FG后尾随空格,H不匹配模式。
Look: ABCDE FG H
is analyzed from left to right. The expression matches FG
, and the regex index is right before H
. There is only this letter to match, but (^|\s)
requires a space or the start of string - there is none at this location.
看,ABCDE FG H从左到右进行分析。表达式匹配FG,regex指数h .只有这封信之前匹配,但(^ | \ s)需要一个空间或字符串的开始,没有在这个位置。
To "fix" this and use the same logic, you can use a PCRE regex gsub
with lookarunds:
要“修复”这个问题并使用相同的逻辑,您可以使用带有lookarunds的PCRE regex gsub:
df$filter1 = gsub("(^|\\s)[A-Z]{1,2}(?=$|\\s)", " ", df$original, perl=TRUE)
or
或
df$filter1 = gsub("(?<!\\S)[A-Z]{1,2}(?!\\S)", " ", df$original, perl=TRUE)
and if you need to actually consume (to remove) spaces, just add \\s*
before (or/and after).
如果您需要实际使用(删除)空间,只需添加\s* before(或/和after)。
The second expression "\\b[A-Z]{1,2}\\b"
contains word boundaries, and they are zero-width assertions that do not consume text, thus, the regex engine can match both FG
and H
since the spaces are not consumed.
第二个表达式“\b[A-Z]{1,2}\ b”包含单词边界,它们是不使用文本的零宽度断言,因此,regex引擎可以匹配FG和H,因为没有使用空格。
#1
3
The point is that gsub
can only match non-overlapping strings. FG
being the first expected match, and H
the second, you can see that these strings overlap, and thus, after "(^|\\s)[A-Z]{1,2}($|\\s)"
consumes the trailing space after FG
, H
just does not match the pattern.
重点是,gsub只能匹配不重叠的字符串。FG是第一个将匹配,和H第二,你可以看到,这些字符串重叠,因此,后”(^ | \ \ s)[a - z]{ 1,2 }($ | \ \ s)”使用FG后尾随空格,H不匹配模式。
Look: ABCDE FG H
is analyzed from left to right. The expression matches FG
, and the regex index is right before H
. There is only this letter to match, but (^|\s)
requires a space or the start of string - there is none at this location.
看,ABCDE FG H从左到右进行分析。表达式匹配FG,regex指数h .只有这封信之前匹配,但(^ | \ s)需要一个空间或字符串的开始,没有在这个位置。
To "fix" this and use the same logic, you can use a PCRE regex gsub
with lookarunds:
要“修复”这个问题并使用相同的逻辑,您可以使用带有lookarunds的PCRE regex gsub:
df$filter1 = gsub("(^|\\s)[A-Z]{1,2}(?=$|\\s)", " ", df$original, perl=TRUE)
or
或
df$filter1 = gsub("(?<!\\S)[A-Z]{1,2}(?!\\S)", " ", df$original, perl=TRUE)
and if you need to actually consume (to remove) spaces, just add \\s*
before (or/and after).
如果您需要实际使用(删除)空间,只需添加\s* before(或/和after)。
The second expression "\\b[A-Z]{1,2}\\b"
contains word boundaries, and they are zero-width assertions that do not consume text, thus, the regex engine can match both FG
and H
since the spaces are not consumed.
第二个表达式“\b[A-Z]{1,2}\ b”包含单词边界,它们是不使用文本的零宽度断言,因此,regex引擎可以匹配FG和H,因为没有使用空格。