We are trying to get rid of boost::regex and it's awful performance. According to this benchmark, Oniguruma is the best overall.
我们正试图摆脱boost::regex,这是一种糟糕的性能。根据这个基准,大内沼是最好的整体。
We have multiple regexps (and always changing) which we apply on strings ranging from medium (100 chars) to huge (1k chars)... so it's a very heterogenous environment.
我们有多个regexp(并且总是在更改),我们将它们应用于从中等(100字符)到大型(1k字符)的字符串……所以这是一个非常不同的环境。
Have any of you used it with success ? Do you recommend going for the more "standard" ones like PCRE or RE2 ?
你们有人成功地使用过它吗?你是否推荐使用更“标准”的,比如PCRE或RE2 ?
Thanks !
谢谢!
2 个解决方案
#1
7
the two kinds of implementation (FSA and BT) have quite different behaviours, which you can see in the right-hand column (email) there.
这两种实现(FSA和BT)有非常不同的行为,您可以在右边的专栏(电子邮件)中看到。
oniguruma is generally fast, but has the possibility of running slowly if you're "unlucky" with a particular regexp. that's because it's a backtracking algorithm.
小丸子通常跑得很快,但如果你“不走运”,有可能会跑得很慢。这是因为它是一个回溯算法。
in contrast, while re2 is generally a little slower, it doesn't have the same risk - its time will never[*] explode in the same way (it doesn't have worst case exponential behaviour).
相比之下,虽然re2的速度一般稍慢一些,但它没有同样的风险——它的时间不会以同样的方式爆炸(它没有最坏的情况下的指数行为)。
so it depends on details. if you're confident that your regexps will be safe, or are willing to detect and abort slow matches, oniguruma makes sense. but personally i would be inclined to pay a little more (not much more) for the security of re2.
这取决于细节。如果您确信您的regexp将是安全的,或者愿意检测并终止慢速匹配,那么oniguruma是有意义的。但就我个人而言,我倾向于为re2的安全性多花一点(而不是更多)。
for more on this see http://swtch.com/~rsc/regexp/regexp1.html (by the re2 author).
更多信息请参见http://swtch.com/~rsc/regexp/regexp1.html(由re2作者编写)。
[*] well, maybe never is too strong. for some regexps i think it has to fall back on a BT approach for certain cases (likely involving matching previous matches and lookahead). but it's still safer on most regexps.
[*]好吧,也许永远不会太强大。对于某些regexp,我认为它必须依赖BT方法来处理某些情况(可能涉及匹配之前的匹配和前瞻)。但在大多数regexp上仍然更安全。
#2
5
I've done a benchmark with the following librairies:
我对以下图书馆做了一个基准:
- Boost
- 提高
- re2
- re2公司
- Oniguruma
- Oniguruma
The benchmark consisted of executing a series of tests which made heavy use of regexps on very heterogeneous regexps (grouping, not grouping, long ones (484 characters), short ones, pipes, \?, *, ., etc.). Applied on texts that go from a few characters to around 8k characters.
基准测试包括执行一系列测试,在非常异构的regexp上大量使用regexp(分组,而不是分组,长(484个字符),短的,管道,\?,*,等等)。适用于从几个字符到大约8k个字符的文本。
Each time a regexp match was computed, I stored the regexp and incremented a milliseconds counter accumulating the time spent computing the regexp (called multiple times).
每次计算regexp匹配时,我都会存储regexp并增加一个毫秒计数器,以积累计算regexp所需的时间(多次调用)。
Here is the total time spent on all regexps for each libraries:
以下是各图书馆所有regexp花费的总时间:
- Boost: 98840 ms
- 增加:98840毫秒
- re2: 51197 ms
- re2公司:51197毫秒
- Oniguruma: 16095 ms
- Oniguruma:16095毫秒
- re2 (NO CAPUTRE* see below)): 16162 ms
- re2(无CAPUTRE*见下文):16162 ms
*We (almost) always want to capture the content of groups in regexp, and re2 performs horribly when it captures a group(see here). You don't see that much in the above result because when the group cannot be captured, it performs well. For example on this regexp (executed a lot of times):
*我们(几乎)总是希望在regexp中捕获组的内容,而re2在捕获组时执行得非常糟糕(参见这里)。在上面的结果中,您没有看到太多,因为当组不能被捕获时,它表现得很好。例如在这个regexp上(执行了很多次):
^((?:https?://)?(?:[a-z0-9\-]{1,63}\.)+(?:[a-z0-9\-]{1,63}))(?:[^\?]*).*$
^((?:https ?:/ /)?(?:a-z0-9 \[-]{ 63 } \)+(?:a-z0-9 \[-]{ 63 }))(?:[^ \ ?]*)。*美元
here are the results for each libs:
以下是每个libs的结果:
- Boost: 140 ms
- 增加:140毫秒
- re2: 5663 ms
- re2公司:5663毫秒
- Oniguruma: 53 ms
- Oniguruma:53女士
- re2 (NO CAPTURE): 37 ms.
- re2(无捕获):37 ms。
See the drop for re2 from 5663 ms to 37 ms.
re2从5663 ms下降到37 ms。
tl;dr
So my conclusion is that for our use, Oniguruma is clearly superior.
所以我的结论是,对于我们的使用,小丸子显然是优越的。
But if you don't need to capture groups, re2 is a better choice since I found that it's API is easier to use.
但是如果您不需要捕获组,re2是一个更好的选择,因为我发现它的API更容易使用。
#1
7
the two kinds of implementation (FSA and BT) have quite different behaviours, which you can see in the right-hand column (email) there.
这两种实现(FSA和BT)有非常不同的行为,您可以在右边的专栏(电子邮件)中看到。
oniguruma is generally fast, but has the possibility of running slowly if you're "unlucky" with a particular regexp. that's because it's a backtracking algorithm.
小丸子通常跑得很快,但如果你“不走运”,有可能会跑得很慢。这是因为它是一个回溯算法。
in contrast, while re2 is generally a little slower, it doesn't have the same risk - its time will never[*] explode in the same way (it doesn't have worst case exponential behaviour).
相比之下,虽然re2的速度一般稍慢一些,但它没有同样的风险——它的时间不会以同样的方式爆炸(它没有最坏的情况下的指数行为)。
so it depends on details. if you're confident that your regexps will be safe, or are willing to detect and abort slow matches, oniguruma makes sense. but personally i would be inclined to pay a little more (not much more) for the security of re2.
这取决于细节。如果您确信您的regexp将是安全的,或者愿意检测并终止慢速匹配,那么oniguruma是有意义的。但就我个人而言,我倾向于为re2的安全性多花一点(而不是更多)。
for more on this see http://swtch.com/~rsc/regexp/regexp1.html (by the re2 author).
更多信息请参见http://swtch.com/~rsc/regexp/regexp1.html(由re2作者编写)。
[*] well, maybe never is too strong. for some regexps i think it has to fall back on a BT approach for certain cases (likely involving matching previous matches and lookahead). but it's still safer on most regexps.
[*]好吧,也许永远不会太强大。对于某些regexp,我认为它必须依赖BT方法来处理某些情况(可能涉及匹配之前的匹配和前瞻)。但在大多数regexp上仍然更安全。
#2
5
I've done a benchmark with the following librairies:
我对以下图书馆做了一个基准:
- Boost
- 提高
- re2
- re2公司
- Oniguruma
- Oniguruma
The benchmark consisted of executing a series of tests which made heavy use of regexps on very heterogeneous regexps (grouping, not grouping, long ones (484 characters), short ones, pipes, \?, *, ., etc.). Applied on texts that go from a few characters to around 8k characters.
基准测试包括执行一系列测试,在非常异构的regexp上大量使用regexp(分组,而不是分组,长(484个字符),短的,管道,\?,*,等等)。适用于从几个字符到大约8k个字符的文本。
Each time a regexp match was computed, I stored the regexp and incremented a milliseconds counter accumulating the time spent computing the regexp (called multiple times).
每次计算regexp匹配时,我都会存储regexp并增加一个毫秒计数器,以积累计算regexp所需的时间(多次调用)。
Here is the total time spent on all regexps for each libraries:
以下是各图书馆所有regexp花费的总时间:
- Boost: 98840 ms
- 增加:98840毫秒
- re2: 51197 ms
- re2公司:51197毫秒
- Oniguruma: 16095 ms
- Oniguruma:16095毫秒
- re2 (NO CAPUTRE* see below)): 16162 ms
- re2(无CAPUTRE*见下文):16162 ms
*We (almost) always want to capture the content of groups in regexp, and re2 performs horribly when it captures a group(see here). You don't see that much in the above result because when the group cannot be captured, it performs well. For example on this regexp (executed a lot of times):
*我们(几乎)总是希望在regexp中捕获组的内容,而re2在捕获组时执行得非常糟糕(参见这里)。在上面的结果中,您没有看到太多,因为当组不能被捕获时,它表现得很好。例如在这个regexp上(执行了很多次):
^((?:https?://)?(?:[a-z0-9\-]{1,63}\.)+(?:[a-z0-9\-]{1,63}))(?:[^\?]*).*$
^((?:https ?:/ /)?(?:a-z0-9 \[-]{ 63 } \)+(?:a-z0-9 \[-]{ 63 }))(?:[^ \ ?]*)。*美元
here are the results for each libs:
以下是每个libs的结果:
- Boost: 140 ms
- 增加:140毫秒
- re2: 5663 ms
- re2公司:5663毫秒
- Oniguruma: 53 ms
- Oniguruma:53女士
- re2 (NO CAPTURE): 37 ms.
- re2(无捕获):37 ms。
See the drop for re2 from 5663 ms to 37 ms.
re2从5663 ms下降到37 ms。
tl;dr
So my conclusion is that for our use, Oniguruma is clearly superior.
所以我的结论是,对于我们的使用,小丸子显然是优越的。
But if you don't need to capture groups, re2 is a better choice since I found that it's API is easier to use.
但是如果您不需要捕获组,re2是一个更好的选择,因为我发现它的API更容易使用。