是否存在可以确定一种常规语言是否与另一种常规语言匹配的任何输入匹配的算法?

时间:2022-05-30 11:30:19

Let's say we have regular expressions:

假设我们有正则表达式:

  • Hello W.*rld
  • 你好W. * rld
  • Hello World
  • 你好,世界
  • .* World
  • 。*世界
  • .* W.*
  • 。* W. *

I would like to minimize the number of regexes required to match arbitrary input.

我想最小化匹配任意输入所需的正则表达式的数量。

To do that, I need to find if one regular expression matches any input matched by another expression. Is that possible?

为此,我需要查找一个正则表达式是否与另一个表达式匹配的任何输入匹配。那可能吗?

Billy3

Billy3

4 个解决方案

#1


11  

Any regular expression can be linked to a DFA - you can minimize the DFA and since the minimal form is unique, you can decide whether two expressions are equivalent. Dani Cricco pointed out the Hopcroft O(n log n) algorithm. There is another improved algorithm by Hopcroft and Craft which tests the equivalence of two DFAs in O(n).

任何正则表达式都可以链接到DFA - 您可以最小化DFA,并且由于最小形式是唯一的,因此您可以决定两个表达式是否相同。 Dani Cricco指出了Hopcroft O(n log n)算法。 Hopcroft和Craft还有另一种改进算法,用于测试O(n)中两个DFA的等价性。

For a good survey on the matter and an interesting approach to this, I reccomend the paper Testing the Equivalence of Regular Languages, from arXiv.

对于这个问题的一个很好的调查和一个有趣的方法,我推荐了arXiv的测试正则语言的等价性的论文。

Later edit: if you are interested in inclusion rather than equivalence for regular expressions, I have come across a paper that might be of interest: Inclusion Problem for Regular Expressions - I have only skimmed through it but it seems to contain a polynomial time algorithm to the problem.

稍后编辑:如果你对正则表达式的包含而不是等价感兴趣,我会遇到一篇可能感兴趣的论文:正则表达式的包含问题 - 我只是略过它但它似乎包含一个多项式时间算法问题。

#2


2  

Yes.

是。

The problem of equivalence of two regular languages is decidable.

两种常规语言的等同性问题是可判定的。

Sketch of an algorithm:

算法草图:

  • minimize both DFAs
  • 最小化两个DFA
  • check if they are isomorph
  • 检查它们是否是同形体

#3


2  

Sure!. A regular expression can be represented as an FSM (Finite State Machine) and there are technically infinite number of FSM that can recognize the same string.

当然!。正则表达式可以表示为FSM(有限状态机),并且技术上无限数量的FSM可以识别相同的字符串。

Isomorphism is the name that describes if two FSM are equivalent. There are a couple of algorigthm to minimize an FSM. For example the Hopcroft minimization algorithm can minimize two FSM in O(n log n), on an n state automaton.

同构是描述两个FSM是否等效的名称。有一些算法可以最小化FSM。例如,Hopcroft最小化算法可以在n状态自动机上最小化O(n log n)中的两个FSM。

#4


0  

This problem is called "inclusion" or "subsumption" of regular expressions, because what you are asking for, is whether the set of words matched by one regexp includes (or subsumes) the set of words matched by the other regex. Equality is a different question which usually means whether two regexps matches exactly the same words, i.e. that they are functionally equivalent. For example "a*" includes "aa*", while they are not equal.

这个问题被称为正则表达式的“包含”或“包含”,因为您要求的是一个正则表达式匹配的单词集是否包括(或包含)由另一个正则表达式匹配的单词集。平等是一个不同的问题,通常意味着两个正则表达式是否匹配完全相同的单词,即它们在功能上是等同的。例如,“a *”包括“aa *”,而它们不相等。

All known algorithms for regexp inclusion are the worst case take time exponential in the size of the regexp. But the standard algorithm is like this:

所有已知的regexp包含算法都是regexp大小的最坏情况需要时间指数。但标准算法是这样的:

Input r1 and r2 Output Yes if r1 includes r2

输入r1和r2如果r1包含r2,则输出Yes

  1. Create DFA(r1) and DFA(r2)
  2. 创建DFA(r1)和DFA(r2)
  3. Create Neg(DFA(r1)) (which matches exactly those words r1 dont match)
  4. 创建Neg(DFA(r1))(与r1不匹配的那些单词完全匹配)
  5. Create Neg(DFA(r1))x DFA(r2) (which matches exactly those words matched by Neg(DFA(r1)) and DFA(r2))
  6. 创建Neg(DFA(r1))x DFA(r2)(与Neg(DFA(r1))和DFA(r2)匹配的单词完全匹配)
  7. Check that the automaton made in 3. does not match any word
  8. 检查3.中的自动机是否与任何单词匹配

This works, since what you are checking is that there are no words matched by r2 that are not matched by r1.

这是有效的,因为您正在检查的是r2没有与r1不匹配的单词匹配。

#1


11  

Any regular expression can be linked to a DFA - you can minimize the DFA and since the minimal form is unique, you can decide whether two expressions are equivalent. Dani Cricco pointed out the Hopcroft O(n log n) algorithm. There is another improved algorithm by Hopcroft and Craft which tests the equivalence of two DFAs in O(n).

任何正则表达式都可以链接到DFA - 您可以最小化DFA,并且由于最小形式是唯一的,因此您可以决定两个表达式是否相同。 Dani Cricco指出了Hopcroft O(n log n)算法。 Hopcroft和Craft还有另一种改进算法,用于测试O(n)中两个DFA的等价性。

For a good survey on the matter and an interesting approach to this, I reccomend the paper Testing the Equivalence of Regular Languages, from arXiv.

对于这个问题的一个很好的调查和一个有趣的方法,我推荐了arXiv的测试正则语言的等价性的论文。

Later edit: if you are interested in inclusion rather than equivalence for regular expressions, I have come across a paper that might be of interest: Inclusion Problem for Regular Expressions - I have only skimmed through it but it seems to contain a polynomial time algorithm to the problem.

稍后编辑:如果你对正则表达式的包含而不是等价感兴趣,我会遇到一篇可能感兴趣的论文:正则表达式的包含问题 - 我只是略过它但它似乎包含一个多项式时间算法问题。

#2


2  

Yes.

是。

The problem of equivalence of two regular languages is decidable.

两种常规语言的等同性问题是可判定的。

Sketch of an algorithm:

算法草图:

  • minimize both DFAs
  • 最小化两个DFA
  • check if they are isomorph
  • 检查它们是否是同形体

#3


2  

Sure!. A regular expression can be represented as an FSM (Finite State Machine) and there are technically infinite number of FSM that can recognize the same string.

当然!。正则表达式可以表示为FSM(有限状态机),并且技术上无限数量的FSM可以识别相同的字符串。

Isomorphism is the name that describes if two FSM are equivalent. There are a couple of algorigthm to minimize an FSM. For example the Hopcroft minimization algorithm can minimize two FSM in O(n log n), on an n state automaton.

同构是描述两个FSM是否等效的名称。有一些算法可以最小化FSM。例如,Hopcroft最小化算法可以在n状态自动机上最小化O(n log n)中的两个FSM。

#4


0  

This problem is called "inclusion" or "subsumption" of regular expressions, because what you are asking for, is whether the set of words matched by one regexp includes (or subsumes) the set of words matched by the other regex. Equality is a different question which usually means whether two regexps matches exactly the same words, i.e. that they are functionally equivalent. For example "a*" includes "aa*", while they are not equal.

这个问题被称为正则表达式的“包含”或“包含”,因为您要求的是一个正则表达式匹配的单词集是否包括(或包含)由另一个正则表达式匹配的单词集。平等是一个不同的问题,通常意味着两个正则表达式是否匹配完全相同的单词,即它们在功能上是等同的。例如,“a *”包括“aa *”,而它们不相等。

All known algorithms for regexp inclusion are the worst case take time exponential in the size of the regexp. But the standard algorithm is like this:

所有已知的regexp包含算法都是regexp大小的最坏情况需要时间指数。但标准算法是这样的:

Input r1 and r2 Output Yes if r1 includes r2

输入r1和r2如果r1包含r2,则输出Yes

  1. Create DFA(r1) and DFA(r2)
  2. 创建DFA(r1)和DFA(r2)
  3. Create Neg(DFA(r1)) (which matches exactly those words r1 dont match)
  4. 创建Neg(DFA(r1))(与r1不匹配的那些单词完全匹配)
  5. Create Neg(DFA(r1))x DFA(r2) (which matches exactly those words matched by Neg(DFA(r1)) and DFA(r2))
  6. 创建Neg(DFA(r1))x DFA(r2)(与Neg(DFA(r1))和DFA(r2)匹配的单词完全匹配)
  7. Check that the automaton made in 3. does not match any word
  8. 检查3.中的自动机是否与任何单词匹配

This works, since what you are checking is that there are no words matched by r2 that are not matched by r1.

这是有效的,因为您正在检查的是r2没有与r1不匹配的单词匹配。