How would one efficiently match one input string against any number of regular expressions?
如何有效地将一个输入字符串与任意数量的正则表达式匹配?
One scenario where this might be useful is with REST web services. Let's assume that I have come up with a number of URL patterns for a REST web service's public interface:
这可能有用的一个场景是REST Web服务。让我们假设我已经为REST Web服务的公共接口提出了许多URL模式:
-
/user/with-id/
{userId}
- /用户/与-ID / {用户id}
-
/user/with-id/
{userId}
/profile
- /用户/与-ID / {用户id} /简档
-
/user/with-id/
{userId}
/preferences
- /用户/与-ID / {用户id} /偏好
/users
- /用户
-
/users/who-signed-up-on/
{date}
- /用户/谁签名式开/ {日期和时间}
-
/users/who-signed-up-between/
{fromDate}
/and/
{toDate}
- /用户/谁签名-UP-之间/ {} FROM日期/和/ {} TODATE
- …
- ...
where {…}
are named placeholders (like regular expression capturing groups).
其中{...}是命名占位符(如正则表达式捕获组)。
Note: This question is not about whether the above REST interface is well-designed or not. (It probably isn't, but that shouldn't matter in the context of this question.)
注意:这个问题不是关于上面的REST接口是否设计得很好。 (可能不是,但在这个问题的背景下,这应该不重要。)
It may be assumed that placeholders usually do not appear at the very beginning of a pattern (but they could). It can also be safely assumed that it is impossible for any string to match more than one pattern.
可以假设占位符通常不出现在模式的最开头(但它们可以)。还可以安全地假设任何字符串都不可能匹配多个模式。
Now the web service receives a request. Of course, one could sequentially match the requested URI against one URL pattern, then against the next one, and so on; but that probably won't scale well for a larger number of patterns that must be checked.
现在,Web服务收到请求。当然,可以将请求的URI与一个URL模式顺序匹配,然后与下一个模式匹配,依此类推;但是,对于必须检查的大量模式,这可能无法很好地扩展。
Are there any efficient algorithms for this?
这有什么有效的算法吗?
Inputs:
输入:
- An input string
- 输入字符串
- A set of "mutually exclusive" regular expressions (ie. no input string may match more than one expression)
- 一组“互斥”正则表达式(即没有输入字符串可能匹配多个表达式)
Output:
输出:
- The regular expression (if any) that the input string matched against.
- 输入字符串匹配的正则表达式(如果有)。
5 个解决方案
#1
10
The Aho-Corasick algorithm is a very fast algorithm to match an input string against a set of patterns (actually keywords), that are preprocessed and organized in a trie, to speedup matching.
Aho-Corasick算法是一种非常快速的算法,用于将输入字符串与一组模式(实际上是关键字)进行匹配,这些模式在trie中进行预处理和组织,以加速匹配。
There are variations of the algorithm to support regex patterns (ie. http://code.google.com/p/esmre/ just to name one) that are probably worth a look.
支持正则表达式模式(即http://code.google.com/p/esmre/只是为了命名一个)的算法有很多种,可能值得一看。
Or, you could split the urls in chunks, organize them in a tree, then split the url to match and walk the tree one chunk at a time. The {userId} can be considered a wildcard, or match some specific format (ie. be an int).
或者,您可以将URL拆分为块,将它们组织在树中,然后拆分URL以匹配并一次一行地遍历树。 {userId}可以被认为是通配符,或者匹配某些特定格式(即,是int)。
When you reach a leaf, you know which url you matched
当你到达一片叶子时,你知道你匹配的是哪个url
#2
4
The standard solution for matching multiple regular expressions against an input stream is a lexer-generator such as Flex (there are lots of these avalable, typically several for each programming langauge).
用于将多个正则表达式与输入流匹配的标准解决方案是词法生成器,例如Flex(有许多这些可用的,通常是每个编程语言的几个)。
These tools take a set of regular expressions associated with "tokens" (think of tokens as just names for whatever a regular expression matches) and generates efficient finite-state automata to match all the regexes at the same time. This is linear time with a very small constant in the size of the input stream; hard to ask for "faster" than this. You feed it a character stream, and it emits the token name of the regex that matches "best" (this handles the case where two regexes can match the same string; see the lexer generator for the definition of this), and advances the stream by what was recognized. So you can apply it again and again to match the input stream for a series of tokens.
这些工具采用一组与“标记”相关联的正则表达式(将标记视为正则表达式匹配的名称),并生成有效的有限状态自动机,以同时匹配所有正则表达式。这是线性时间,输入流的大小非常小;很难要求比这更“快”。你输入一个字符流,它发出匹配“best”的正则表达式的令牌名称(这处理两个正则表达式可以匹配相同字符串的情况;请参阅词法分析器生成器以获取此定义),并推进流通过什么被认可。因此,您可以反复应用它以匹配一系列令牌的输入流。
Different lexer generators will allow you to capture different bits of the recognized stream in differnt ways, so you can, after recognizing a token, pick out the part you care about (e.g., for a literal string in quotes, you only care about the string content, not the quotes).
不同的词法分析器生成器允许您以不同的方式捕获识别的流的不同位,因此您可以在识别出令牌后选择您关注的部分(例如,对于引号中的文字字符串,您只关心字符串内容,而不是引号)。
#3
3
If there is a hierarchy in the url structure, that should be used to maximize performance. Only an url that starts with /user/ can match any of the first three and so on.
如果url结构中存在层次结构,则应使用该层次结构来最大化性能。只有以/ user /开头的url才能匹配前三个中的任何一个,依此类推。
I suggest storing the hierarchy to match in a tree corresponding to the url hierarchy, where each node matches a level in the hierarchy. To match an url, test the url against all roots of the tree where only nodes with regexes for "user" and "users" are. Matching url:s are tested against the children of those nodes until a match is found in a leaf node. A succesful match can be returned as the list of nodes from the root to the leaf. Named groups with property values such as {user-id} can be fetched from the nodes of the successful match.
我建议存储层次结构以匹配对应于url层次结构的树,其中每个节点与层次结构中的级别匹配。要匹配网址,请针对树的所有根测试网址,其中只有具有“用户”和“用户”的正则表达式的节点。匹配url:s针对这些节点的子节点进行测试,直到在叶节点中找到匹配为止。成功匹配可以作为从根到叶的节点列表返回。可以从成功匹配的节点获取具有属性值(例如{user-id})的命名组。
#4
1
Use named expressions and the OR operator, i.e. "(?P<re1>...)|(?P<re2>...)|...
".
使用命名表达式和OR运算符,即“(?P
#5
1
First I though that I couldn't see any good optimization for this process.
首先,我认为这个过程没有看到任何好的优化。
However, if you have a really large number of regexes you might want to partition them (I'm not sure if this is technically partitioning).
但是,如果你有很多正则表达式,你可能想要对它们进行分区(我不确定这是否是技术上的分区)。
What I tell you to do is:
我告诉你要做的是:
Suppose that you have 20 possible urls that start with user
:
假设您有20个可能以用户开头的网址:
/user/with-id/X
/user/with-id/X/preferences # instead of preferences, you could have another 10 possibilities like /friends, /history, etc
Then, you also have 20 possible urls starting with users
:
然后,您还有20个可能的用户首发网址:
/users/who-signed-up-on
/users/who-signed-up-on-between #others: /registered-for, /i-might-like, etc
And the list goes on for /products
, /companies
, etc instead of users.
这个列表继续用于/产品,/公司等,而不是用户。
What you could do in this case is using "multi-level" matching.
在这种情况下你可以做的是使用“多级”匹配。
First, match the start of the string. You'd be matching for /products
, /companies
, /users
, one at a time and ignoring the rest of the string. This way, you don't have to test all the 100 possibilities.
首先,匹配字符串的开头。您将匹配/ products,/ companies,/ users,一次一个,而忽略其余的字符串。这样,您就不必测试所有100种可能性。
After you know the url starts with /users
, you can match only the possible urls that start with users.
在您知道url以/ users开头后,您只能匹配以用户开头的可能网址。
This way, you would reduce a lot of unneeded matches. You won't match the string for all the /procucts
possibilities.
这样,您可以减少大量不必要的匹配。您将无法匹配所有/产品可能性的字符串。
#1
10
The Aho-Corasick algorithm is a very fast algorithm to match an input string against a set of patterns (actually keywords), that are preprocessed and organized in a trie, to speedup matching.
Aho-Corasick算法是一种非常快速的算法,用于将输入字符串与一组模式(实际上是关键字)进行匹配,这些模式在trie中进行预处理和组织,以加速匹配。
There are variations of the algorithm to support regex patterns (ie. http://code.google.com/p/esmre/ just to name one) that are probably worth a look.
支持正则表达式模式(即http://code.google.com/p/esmre/只是为了命名一个)的算法有很多种,可能值得一看。
Or, you could split the urls in chunks, organize them in a tree, then split the url to match and walk the tree one chunk at a time. The {userId} can be considered a wildcard, or match some specific format (ie. be an int).
或者,您可以将URL拆分为块,将它们组织在树中,然后拆分URL以匹配并一次一行地遍历树。 {userId}可以被认为是通配符,或者匹配某些特定格式(即,是int)。
When you reach a leaf, you know which url you matched
当你到达一片叶子时,你知道你匹配的是哪个url
#2
4
The standard solution for matching multiple regular expressions against an input stream is a lexer-generator such as Flex (there are lots of these avalable, typically several for each programming langauge).
用于将多个正则表达式与输入流匹配的标准解决方案是词法生成器,例如Flex(有许多这些可用的,通常是每个编程语言的几个)。
These tools take a set of regular expressions associated with "tokens" (think of tokens as just names for whatever a regular expression matches) and generates efficient finite-state automata to match all the regexes at the same time. This is linear time with a very small constant in the size of the input stream; hard to ask for "faster" than this. You feed it a character stream, and it emits the token name of the regex that matches "best" (this handles the case where two regexes can match the same string; see the lexer generator for the definition of this), and advances the stream by what was recognized. So you can apply it again and again to match the input stream for a series of tokens.
这些工具采用一组与“标记”相关联的正则表达式(将标记视为正则表达式匹配的名称),并生成有效的有限状态自动机,以同时匹配所有正则表达式。这是线性时间,输入流的大小非常小;很难要求比这更“快”。你输入一个字符流,它发出匹配“best”的正则表达式的令牌名称(这处理两个正则表达式可以匹配相同字符串的情况;请参阅词法分析器生成器以获取此定义),并推进流通过什么被认可。因此,您可以反复应用它以匹配一系列令牌的输入流。
Different lexer generators will allow you to capture different bits of the recognized stream in differnt ways, so you can, after recognizing a token, pick out the part you care about (e.g., for a literal string in quotes, you only care about the string content, not the quotes).
不同的词法分析器生成器允许您以不同的方式捕获识别的流的不同位,因此您可以在识别出令牌后选择您关注的部分(例如,对于引号中的文字字符串,您只关心字符串内容,而不是引号)。
#3
3
If there is a hierarchy in the url structure, that should be used to maximize performance. Only an url that starts with /user/ can match any of the first three and so on.
如果url结构中存在层次结构,则应使用该层次结构来最大化性能。只有以/ user /开头的url才能匹配前三个中的任何一个,依此类推。
I suggest storing the hierarchy to match in a tree corresponding to the url hierarchy, where each node matches a level in the hierarchy. To match an url, test the url against all roots of the tree where only nodes with regexes for "user" and "users" are. Matching url:s are tested against the children of those nodes until a match is found in a leaf node. A succesful match can be returned as the list of nodes from the root to the leaf. Named groups with property values such as {user-id} can be fetched from the nodes of the successful match.
我建议存储层次结构以匹配对应于url层次结构的树,其中每个节点与层次结构中的级别匹配。要匹配网址,请针对树的所有根测试网址,其中只有具有“用户”和“用户”的正则表达式的节点。匹配url:s针对这些节点的子节点进行测试,直到在叶节点中找到匹配为止。成功匹配可以作为从根到叶的节点列表返回。可以从成功匹配的节点获取具有属性值(例如{user-id})的命名组。
#4
1
Use named expressions and the OR operator, i.e. "(?P<re1>...)|(?P<re2>...)|...
".
使用命名表达式和OR运算符,即“(?P
#5
1
First I though that I couldn't see any good optimization for this process.
首先,我认为这个过程没有看到任何好的优化。
However, if you have a really large number of regexes you might want to partition them (I'm not sure if this is technically partitioning).
但是,如果你有很多正则表达式,你可能想要对它们进行分区(我不确定这是否是技术上的分区)。
What I tell you to do is:
我告诉你要做的是:
Suppose that you have 20 possible urls that start with user
:
假设您有20个可能以用户开头的网址:
/user/with-id/X
/user/with-id/X/preferences # instead of preferences, you could have another 10 possibilities like /friends, /history, etc
Then, you also have 20 possible urls starting with users
:
然后,您还有20个可能的用户首发网址:
/users/who-signed-up-on
/users/who-signed-up-on-between #others: /registered-for, /i-might-like, etc
And the list goes on for /products
, /companies
, etc instead of users.
这个列表继续用于/产品,/公司等,而不是用户。
What you could do in this case is using "multi-level" matching.
在这种情况下你可以做的是使用“多级”匹配。
First, match the start of the string. You'd be matching for /products
, /companies
, /users
, one at a time and ignoring the rest of the string. This way, you don't have to test all the 100 possibilities.
首先,匹配字符串的开头。您将匹配/ products,/ companies,/ users,一次一个,而忽略其余的字符串。这样,您就不必测试所有100种可能性。
After you know the url starts with /users
, you can match only the possible urls that start with users.
在您知道url以/ users开头后,您只能匹配以用户开头的可能网址。
This way, you would reduce a lot of unneeded matches. You won't match the string for all the /procucts
possibilities.
这样,您可以减少大量不必要的匹配。您将无法匹配所有/产品可能性的字符串。