正则表达式与负向前瞻匹配

I am trying to write a regular expression whose matching pattern excludes certain strings. It should remove all occurrences of number only and alphanumeric strings, and also remove all punctuation marks but keep certain meaningful strings (911, K-12, K9, E-COMMERCE, etc.).

我正在尝试编写一个正则表达式,其匹配模式排除某些字符串。它应该删除所有出现的仅数字和字母数字字符串,并删除所有标点符号,但保留某些有意义的字符串(911,K-12,K9,E-COMMERCE等)。

I figured I need to use a negative lookahead and specify what needs to be skipped. The matching pattern works almost as needed, but there are a couple for which it doesn't work. Below is the code, and the results from the matching. There are a couple for which I've specified what should the result be. The ones I can't figure out are a string with a combination of punctuations, numbers and characters. Any help is greatly appreciated. Thanks.

我想我需要使用负向前瞻并指定需要跳过的内容。匹配模式几乎可以根据需要工作,但有一些模式不起作用。下面是代码,以及匹配的结果。有几个我已经指定了结果应该是什么。我无法弄清楚的是一个带有标点,数字和字符组合的字符串。任何帮助是极大的赞赏。谢谢。

blah <- c('ASDF911 2346', 'E-COMMERCE', 'AMAZON E-COMMERCE', 'K-12 89752 911', '65426 -', 'TEACHERK-12', 'K9 OFFICER', 'WORK - K-9564', 'DEVELOPER C++', ' C+ C +5', 'DEFAULT - 456')
gsub('(^| )(?!(911|E[-]COMMERCE|K[-]12|C[+]{1,2}))([[:punct:]]|[0-9]+|([0-9]+[A-Z]+|[A-Z]+[0-9]+)[0-9A-Z]*)', ' ', blah, perl = TRUE)

" "                     # OK
"E-COMMERCE"            # OK
"AMAZON E-COMMERCE"     # OK
"K-12  911"             # OK
"  "                    # OK
"TEACHERK-12"           # this should be "  "
"K9 OFFICER"            # OK
"WORK K-9564"           # this should be "WORK   "
"DEVELOPER C++"         # OK
" C+ C 5"               # this should be " C+ C "
"DEFAULT  "             # OK

1 个解决方案

#1

Easier would be to match both, and then replace with the white-listed keywords:

更容易匹配两者,然后替换为白名单的关键字:

gsub('(?:\\b(911\\b|E-COMMERCE\\b|K-12\\b|C\\b[+]{0,2})|[[:punct:]]|[A-Z-]*[0-9][A-Z0-9-]*)', '\\1', blah, perl = TRUE)

Output:

" "
"E-COMMERCE"
"AMAZON E-COMMERCE"
"K-12  911"
" "
""
" OFFICER"   # Should this really be "K9 OFFICER"?
"WORK  "
"DEVELOPER C++"
" C+ C "
"DEFAULT  "

\b is a word boundary. It matches the empty string at the edges of a sequence of word characters ([A-Za-z0-9_]). It is an optimized version of (?<!\w)(?=\w)|(?<=\w)(?!\w).

\ b是单词边界。它匹配单词字符序列([A-Za-z0-9_])边缘的空字符串。它是(?

[A-Z-]*[0-9][A-Z0-9-]* matches strings of letters, digits and dashes, with at least one digit in them.

[A-Z - ] * [0-9] [A-Z0-9 - ] *匹配字母,数字和短划线的字符串,其中至少有一位数字。

http://ideone.com/E3TUU5

#1