在OR中匹配较长的字符串的Regex

时间:2021-08-19 04:56:59

Motivation

I'm parsing addresses and need to get the address and the country in separated matches, but the countries might have aliases, e.g.:

我在解析地址,需要在分开的比赛中得到地址和国家,但是这些国家可能有别名,例如:

UK == United Kingdom, 
US == USA == United States,
Korea == South Korea, 

and so on...

等等……

Explanation

So, what I do is create a big regex with all possible country names (at least the ones more likely to appear) separated by the OR operator, like this:

因此,我所做的就是创建一个由OR操作符分隔的大型regex,其中包含所有可能的国家名称(至少最有可能出现的国家名称),如下所示:

germany|us|france|chile

But the problem is with multi-word country names and their shorter versions, like:

但问题是多字的国家名称及其较短的版本,比如:

Republic of Moldova and Moldova

摩尔多瓦*和摩尔多瓦*

Using this as example, we have the string:

以这个为例,我们有一个字符串:

'Somewhere in Moldova, bla bla, 12313, Republic of Moldova'

What I want to get from this:

我想从这里得到的是

'Somewhere in Moldova, bla bla, more bla, 12313'
'Republic of Moldova'

But this is what I get:

但我得到的是:

'Somewhere in Moldova, bla bla, 12313, Republic of'
'Moldova'

Regex

As there are several cases, here is what I'm using so far:

由于有好几种情况,下面是我目前使用的方法:

^(.*),? \(?(republic of moldova|moldova)\)?(.*[\d\-]+.*|,.*[:/].*)?$

As we might have fax, phone, zip codes or something else after the country name - which I don't care about - I use the last matching group to remove them:

因为我们可能会有传真、电话、邮政编码或国名之后的其他东西——我并不在意——我使用最后一个匹配组来删除它们:

(.*[\d\-]+.*|,.*[:/].*)?

Also, sometimes the country name comes enclosed in parenthesis, so I have \(? and \)? around the second match group, and all the countries go inside it:

另外,有时候国家名包含在括号中,所以我有(?)和\)?在第二组比赛前后,所有国家都进入了小组:

(republic of moldova|moldova|...)

Question

The thing is, when there is an entry which is a subset of a bigger one, the shorter is chosen over the longer, and the remainder stays in the base_address string. Is there a way to tell the regex to choose over the biggest possible match when two values mach?

问题是,当有一个条目是一个较大条目的子集时,选择较短的条目除以较长的条目,其余的保留在base_address字符串中。是否有一种方法可以告诉regex在两个值mach时选择最大的匹配?

Edit

  1. I'm using Python with built in re module
  2. 我正在使用内置re模块的Python
  3. As suggested by m.buettner, changing the first matching group from (.*) to (.*?) indeed fixes the current issue, but it also creates another. Consider other example:

    所建议的m。buettner将第一个匹配组从(.*)改为(.*?),确实修复了当前的问题,但也创建了另一个。考虑其他的例子:

    'Department of Chemistry, National University of Singapore, 4512436 Singapore'

    “新加坡国立大学化学系,4512436新加坡”

Matches:

匹配:

'Department of Chemistry, National University of'
'Singapore'

Here it matches too soon now.

这里匹配得太快了。

2 个解决方案

#1


6  

Your problem is greediness.

你的问题是贪吃。

The .* right at the beginning tries to match as much as possible. That is everything until the end of the string. But then the rest of your pattern fails. So the engine backtracks, and discards the last character matched with .* and tries the rest of the pattern again (which still fails). The engine will repeat this process (fail match, backtrack/discard one character, try again) until it can finally match with the rest of the pattern. The first time this occurs is when .* matches everything up to Moldova (so .* is still consuming Republic of). And then the alternation (which still cannot match republic of moldova) will gladly match moldova and return that as the result.

在开始的时候,*会尽可能地匹配。直到弦的末端。但是你的模式的其他部分就失败了。因此,引擎后退,丢弃与.*匹配的最后一个字符,并再次尝试模式的其余部分(仍然失败)。引擎将重复这个过程(失败匹配、回溯/丢弃一个字符,再次尝试),直到它最终与模式的其余部分匹配。第一次出现这种情况是当.*匹配所有东西到摩尔多瓦时(所以.*仍在消耗*)。然后候选国(仍然不能与摩尔多瓦*相比)将很高兴地与摩尔多瓦比肩,并最终返回。

The simplest solution is to make the repetition ungreedy:

最简单的解决办法是让重复变得不贪婪:

^(.*?)...

Note that the question mark right after a quantifier does not mean "optional", but makes it "ungreedy". This simply reverses the behaviour: the engine first tries to leave out the .* completely, and in the process of backtracking it includes one more character after every failed attempt to match the rest of the pattern.

注意,量词后面的问号不是“可选的”,而是“不贪婪的”。这简单地逆转了行为:引擎首先尝试完全忽略。*,并且在回溯过程中,在每次尝试匹配模式的其余部分失败后,它还包含了一个字符。

EDIT:

编辑:

There are usually better alternatives to ungreediness. As you stated in a comment, the ungreedy solution brings another problem that countries in earlier parts of the string might be matched. What you can do instead, is to use lookarounds that ensure that there are no word characters (letters, digits, underscore) before or after the country. That means, a country word is only matched, if it is surrounded by commas or either end of the string:

通常有比贪婪更好的选择。正如您在评论中所指出的,不贪婪的解决方案带来了另一个问题,即字符串早期部分的国家可能是匹配的。相反,你可以做的是使用lookarounds,确保在国家之前或之后没有单词字符(字母、数字、下划线)。这意味着,一个国家的单词只有在被逗号包围或字符串的两端才能匹配:

^(.*),?(?<!\w)[ ][(]?(c|o|u|n|t|r|i|e|s)[)]?(?![ ]*\w)(.*[\d\-]+.*|,.*[:/].*)?$

Since lookarounds are not actually part of the match, they do not interfere with the rest of your pattern - they simply check a condition at a specific position in the match. The two lookarounds I have added ensure that:

由于lookarounds不是比赛的一部分,所以不会干扰你的其他模式——他们只会在比赛的某个特定位置检查一个情况。我补充的两个变通方案确保:

  1. There is no word character before the mandatory space preceding the country.
  2. 在国家之前的强制空格前没有文字字符。
  3. There is no word character after the country that is separated by nothing but spaces.
  4. 在这个国家,除了空间之外,没有任何文字。

Note that I've wrapped spaces in a character class, as well as the literal parentheses (instead of escaping them). Neither is necessary, but I prefer these readability-wise, so they are just a suggestion.

注意,我在字符类中包装了空格,以及文字括号(而不是转义它们)。这两者都不是必需的,但我更喜欢这些可读性,所以它们只是一个建议。

EDIT 2:

编辑2:

As abarnert mentioned in a comment, how about not using a regex-only solution?

正如abarnert在评论中提到的,不使用仅regex解决方案怎么样?

You could split the string on ,, then trim every result, and check these against your list of countries (possibly using regex). If any component of your address is the same as one of your countries, you can return that. If there are multiples ones than at least you can detect the ambiguity and deal with it properly.

您可以分割字符串,然后修剪每个结果,并根据您的国家列表(可能使用regex)检查这些结果。如果您的地址的任何组成部分与您的国家之一相同,您可以返回该地址。如果有多个1,至少你可以检测到歧义并妥善处理。

#2


0  

Sort all alternatives in regex, just create regex programatically by sorted (from longest to shortest) array of names. Then make whole regex in atomic group (PCRE engine has it, don't know if RE engine has it too). Because of atomic group, regex engine never backtrack to try other alternative in atomic group and so u have all alternatives sorted, match will always be the longest one.

在regex中对所有选项进行排序,只需按排序(从最长到最短)的名称数组按程序创建regex。然后在原子组中生成整个regex (PCRE引擎有,不知道RE引擎是否也有)。因为有了原子组,regex引擎不会在原子组中尝试其他的选项,所以u将所有的选项排序,match总是最长的。

Tada.

这样。

#1


6  

Your problem is greediness.

你的问题是贪吃。

The .* right at the beginning tries to match as much as possible. That is everything until the end of the string. But then the rest of your pattern fails. So the engine backtracks, and discards the last character matched with .* and tries the rest of the pattern again (which still fails). The engine will repeat this process (fail match, backtrack/discard one character, try again) until it can finally match with the rest of the pattern. The first time this occurs is when .* matches everything up to Moldova (so .* is still consuming Republic of). And then the alternation (which still cannot match republic of moldova) will gladly match moldova and return that as the result.

在开始的时候,*会尽可能地匹配。直到弦的末端。但是你的模式的其他部分就失败了。因此,引擎后退,丢弃与.*匹配的最后一个字符,并再次尝试模式的其余部分(仍然失败)。引擎将重复这个过程(失败匹配、回溯/丢弃一个字符,再次尝试),直到它最终与模式的其余部分匹配。第一次出现这种情况是当.*匹配所有东西到摩尔多瓦时(所以.*仍在消耗*)。然后候选国(仍然不能与摩尔多瓦*相比)将很高兴地与摩尔多瓦比肩,并最终返回。

The simplest solution is to make the repetition ungreedy:

最简单的解决办法是让重复变得不贪婪:

^(.*?)...

Note that the question mark right after a quantifier does not mean "optional", but makes it "ungreedy". This simply reverses the behaviour: the engine first tries to leave out the .* completely, and in the process of backtracking it includes one more character after every failed attempt to match the rest of the pattern.

注意,量词后面的问号不是“可选的”,而是“不贪婪的”。这简单地逆转了行为:引擎首先尝试完全忽略。*,并且在回溯过程中,在每次尝试匹配模式的其余部分失败后,它还包含了一个字符。

EDIT:

编辑:

There are usually better alternatives to ungreediness. As you stated in a comment, the ungreedy solution brings another problem that countries in earlier parts of the string might be matched. What you can do instead, is to use lookarounds that ensure that there are no word characters (letters, digits, underscore) before or after the country. That means, a country word is only matched, if it is surrounded by commas or either end of the string:

通常有比贪婪更好的选择。正如您在评论中所指出的,不贪婪的解决方案带来了另一个问题,即字符串早期部分的国家可能是匹配的。相反,你可以做的是使用lookarounds,确保在国家之前或之后没有单词字符(字母、数字、下划线)。这意味着,一个国家的单词只有在被逗号包围或字符串的两端才能匹配:

^(.*),?(?<!\w)[ ][(]?(c|o|u|n|t|r|i|e|s)[)]?(?![ ]*\w)(.*[\d\-]+.*|,.*[:/].*)?$

Since lookarounds are not actually part of the match, they do not interfere with the rest of your pattern - they simply check a condition at a specific position in the match. The two lookarounds I have added ensure that:

由于lookarounds不是比赛的一部分,所以不会干扰你的其他模式——他们只会在比赛的某个特定位置检查一个情况。我补充的两个变通方案确保:

  1. There is no word character before the mandatory space preceding the country.
  2. 在国家之前的强制空格前没有文字字符。
  3. There is no word character after the country that is separated by nothing but spaces.
  4. 在这个国家,除了空间之外,没有任何文字。

Note that I've wrapped spaces in a character class, as well as the literal parentheses (instead of escaping them). Neither is necessary, but I prefer these readability-wise, so they are just a suggestion.

注意,我在字符类中包装了空格,以及文字括号(而不是转义它们)。这两者都不是必需的,但我更喜欢这些可读性,所以它们只是一个建议。

EDIT 2:

编辑2:

As abarnert mentioned in a comment, how about not using a regex-only solution?

正如abarnert在评论中提到的,不使用仅regex解决方案怎么样?

You could split the string on ,, then trim every result, and check these against your list of countries (possibly using regex). If any component of your address is the same as one of your countries, you can return that. If there are multiples ones than at least you can detect the ambiguity and deal with it properly.

您可以分割字符串,然后修剪每个结果,并根据您的国家列表(可能使用regex)检查这些结果。如果您的地址的任何组成部分与您的国家之一相同,您可以返回该地址。如果有多个1,至少你可以检测到歧义并妥善处理。

#2


0  

Sort all alternatives in regex, just create regex programatically by sorted (from longest to shortest) array of names. Then make whole regex in atomic group (PCRE engine has it, don't know if RE engine has it too). Because of atomic group, regex engine never backtrack to try other alternative in atomic group and so u have all alternatives sorted, match will always be the longest one.

在regex中对所有选项进行排序,只需按排序(从最长到最短)的名称数组按程序创建regex。然后在原子组中生成整个regex (PCRE引擎有,不知道RE引擎是否也有)。因为有了原子组,regex引擎不会在原子组中尝试其他的选项,所以u将所有的选项排序,match总是最长的。

Tada.

这样。