如何在正则表达式中匹配n个单词?

时间:2022-12-01 16:40:25

After scratching my head and extensive googling, I can't seem to get this right.

挠了挠头,用谷歌搜索了一遍之后,我似乎做不到这一点。

I have this sample string:

我有这个样本字符串:

test = "true sales are expected to be between 50% and 60% higher than those reported for the previous corresponding year. the main reason is blah blah. the fake sales are expected to be in the region of between 25% and 35% lower."

test =“真实销售额预计比上一年度报告的销售额高出50%至60%。”主要原因是什么。虚假销售预计将在25% - 35%之间。

I'm trying to determine whether the 'true' sales where higher or lower. Using R, and the 'stringr' library, I'm trying it as follows:

我想确定“真正的”销售额是高还是低。使用R和“stringr”库,我尝试如下:

test = "true sales are expected to be between 50% and 60% higher than those reported for the previous corresponding year. the main reason is blah blah. the fake sales are expected to be in the region of between 25% and 35% lower."
positive.regex = "(sales).*?[0-9]{1,3}% higher"
negative.regex = "(sales).*?[0-9]{1,3}% lower"

Which yields the following results:

其结果如下:

str_extract(test,positive.regex) [1] "sales are expected to be between 50% and 60% higher" str_extract(test,negative.regex) [1] "sales are expected to be between 50% and 60% higher than those reported for the previous corresponding year. the main reason is blah blah. the fake sales are expected to be in the region of between 25% and 35% lower"

str_extract(test,positive.regex)[1]“的销售额预计将比前一年报告的销售额高出50%至60%。”主要原因是什么。假货销售预计将会下降25%到35%

I'm trying to find a way to limit the number of words matched between (sales) and '% higher' or '% lower', so that the negative regex won't match. i.e I know I need to replace '.*?' with something that matches whole words, not characters, and limit the number of these words to something like 3-5, how can I do this?

我试图找到一种方法来限制(sales)和“% higher”或“% lower”之间匹配的单词的数量,这样负regex就不会匹配了。我。我知道我需要替换。*?“用一些能匹配整个单词,而不是字符的东西,把这些单词的数量限制在3-5,我该怎么做呢?”

4 个解决方案

#1


2  

You have to ensure that the words higher or lower do not occur in the .*? part of your regex. One way to do this is to use a negative lookahead assertion:

你必须确保较高或较低的字不会出现在。*?正则表达式的一部分。一种方法是使用消极的前视断言:

positive.regex = "sales(?:(?!higher|lower).)*[0-9]{1,3}% higher"
negative.regex = "sales(?:(?!higher|lower).)*[0-9]{1,3}% lower"

Explanation:

解释:

(?:      # Match...
 (?!     #  (unless we're at the start of the word
  higher #   "higher"
 |       #   or
  lower  #   "lower"
 )       #  )
 .       # any character
)*       # Repeat any number of times.

#2


1  

This uses the gsubfn package. It finds occurrences of the indicated regexp and then checks whether the match has less or equal to max.width words only returning the match if so:

它使用gsubfn包。它查找指定regexp的出现,然后检查匹配是否小于或等于max。宽度字只返回匹配,如果是:

library(gsubfn)

max.words <- 11
num.words <- function(x) length(strsplit(x, "\\s+")[[1]])

strapply(test, "(sales.*?\\d+% (higher|lower))", function(x, y) 
    if (num.words(x) <= max.words) x)

If desired we could expand the if statement to limit it to "higher" or "lower":

如果需要,我们可以扩展If语句,将其限制为“更高”或“更低”:

strapply(test, "(sales.*?\\d+% (higher|lower))", function(x, y) 
    if (num.words(x) <= max.words && y == "higher") x)

The function could alternately be written in formula notation like this (in the case of the last one above):

函数可以交替地写成这样的公式表示法(在上面最后一个例子中):

strapply(test, "(sales.*?\\d+% (higher|lower))", 
    ... ~ if (num.words(..1) <= max.words && ..2 == "higher") ..1)

#3


0  

Why not use a regular expression that matches both? You can then check if the last word was "higher" or "lower".

为什么不使用既匹配又匹配的正则表达式呢?然后你可以检查最后一个词是“高”还是“低”。

r <- "sales.*?[0-9]{1,3}% (higher|lower)"
str_match_all(test,r)

#4


0  

If you simply used this:

如果你简单地使用这个:

true sales.+higher

... it would work but for the fact that it might end up matching if later the sentence said "fake sales are higher" as well. So to get around that, use this:

…这是可行的,但事实是,如果后面的句子说“假销售额更高”,它可能最终会匹配。要解决这个问题,可以这样:

true sales.+higher.+fake

If the above matches, then true sales are indeed higher. If the following matches:

如果以上匹配,那么真实的销售额确实会更高。如果下面的比赛:

true sales.+lower.+fake

真正的销售。+低。+假的

Then true sales are lower. It is a bit crude of course. You might want to replace the dot with [\s\S] in order to include line breaks as well. Hope this helps.

那么真正的销售额就更低了。这当然有点粗糙。您可能想要用[\s\ s]来替换这个点,以包含换行符。希望这个有帮助。

#1


2  

You have to ensure that the words higher or lower do not occur in the .*? part of your regex. One way to do this is to use a negative lookahead assertion:

你必须确保较高或较低的字不会出现在。*?正则表达式的一部分。一种方法是使用消极的前视断言:

positive.regex = "sales(?:(?!higher|lower).)*[0-9]{1,3}% higher"
negative.regex = "sales(?:(?!higher|lower).)*[0-9]{1,3}% lower"

Explanation:

解释:

(?:      # Match...
 (?!     #  (unless we're at the start of the word
  higher #   "higher"
 |       #   or
  lower  #   "lower"
 )       #  )
 .       # any character
)*       # Repeat any number of times.

#2


1  

This uses the gsubfn package. It finds occurrences of the indicated regexp and then checks whether the match has less or equal to max.width words only returning the match if so:

它使用gsubfn包。它查找指定regexp的出现,然后检查匹配是否小于或等于max。宽度字只返回匹配,如果是:

library(gsubfn)

max.words <- 11
num.words <- function(x) length(strsplit(x, "\\s+")[[1]])

strapply(test, "(sales.*?\\d+% (higher|lower))", function(x, y) 
    if (num.words(x) <= max.words) x)

If desired we could expand the if statement to limit it to "higher" or "lower":

如果需要,我们可以扩展If语句,将其限制为“更高”或“更低”:

strapply(test, "(sales.*?\\d+% (higher|lower))", function(x, y) 
    if (num.words(x) <= max.words && y == "higher") x)

The function could alternately be written in formula notation like this (in the case of the last one above):

函数可以交替地写成这样的公式表示法(在上面最后一个例子中):

strapply(test, "(sales.*?\\d+% (higher|lower))", 
    ... ~ if (num.words(..1) <= max.words && ..2 == "higher") ..1)

#3


0  

Why not use a regular expression that matches both? You can then check if the last word was "higher" or "lower".

为什么不使用既匹配又匹配的正则表达式呢?然后你可以检查最后一个词是“高”还是“低”。

r <- "sales.*?[0-9]{1,3}% (higher|lower)"
str_match_all(test,r)

#4


0  

If you simply used this:

如果你简单地使用这个:

true sales.+higher

... it would work but for the fact that it might end up matching if later the sentence said "fake sales are higher" as well. So to get around that, use this:

…这是可行的,但事实是,如果后面的句子说“假销售额更高”,它可能最终会匹配。要解决这个问题,可以这样:

true sales.+higher.+fake

If the above matches, then true sales are indeed higher. If the following matches:

如果以上匹配,那么真实的销售额确实会更高。如果下面的比赛:

true sales.+lower.+fake

真正的销售。+低。+假的

Then true sales are lower. It is a bit crude of course. You might want to replace the dot with [\s\S] in order to include line breaks as well. Hope this helps.

那么真正的销售额就更低了。这当然有点粗糙。您可能想要用[\s\ s]来替换这个点,以包含换行符。希望这个有帮助。