与其他跨平台的regexp库相比,Oniguruma有多好?

时间:2020-12-08 12:09:31

We are trying to get rid of boost::regex and it's awful performance. According to this benchmark, Oniguruma is the best overall.

我们正试图摆脱boost::regex,这是一种糟糕的性能。根据这个基准,大内沼是最好的整体。

We have multiple regexps (and always changing) which we apply on strings ranging from medium (100 chars) to huge (1k chars)... so it's a very heterogenous environment.

我们有多个regexp(并且总是在更改),我们将它们应用于从中等(100字符)到大型(1k字符)的字符串……所以这是一个非常不同的环境。

Have any of you used it with success ? Do you recommend going for the more "standard" ones like PCRE or RE2 ?

你们有人成功地使用过它吗?你是否推荐使用更“标准”的,比如PCRE或RE2 ?

Thanks !

谢谢!

2 个解决方案

#1


7  

the two kinds of implementation (FSA and BT) have quite different behaviours, which you can see in the right-hand column (email) there.

这两种实现(FSA和BT)有非常不同的行为,您可以在右边的专栏(电子邮件)中看到。

oniguruma is generally fast, but has the possibility of running slowly if you're "unlucky" with a particular regexp. that's because it's a backtracking algorithm.

小丸子通常跑得很快,但如果你“不走运”,有可能会跑得很慢。这是因为它是一个回溯算法。

in contrast, while re2 is generally a little slower, it doesn't have the same risk - its time will never[*] explode in the same way (it doesn't have worst case exponential behaviour).

相比之下,虽然re2的速度一般稍慢一些,但它没有同样的风险——它的时间不会以同样的方式爆炸(它没有最坏的情况下的指数行为)。

so it depends on details. if you're confident that your regexps will be safe, or are willing to detect and abort slow matches, oniguruma makes sense. but personally i would be inclined to pay a little more (not much more) for the security of re2.

这取决于细节。如果您确信您的regexp将是安全的,或者愿意检测并终止慢速匹配,那么oniguruma是有意义的。但就我个人而言,我倾向于为re2的安全性多花一点(而不是更多)。

for more on this see http://swtch.com/~rsc/regexp/regexp1.html (by the re2 author).

更多信息请参见http://swtch.com/~rsc/regexp/regexp1.html(由re2作者编写)。

[*] well, maybe never is too strong. for some regexps i think it has to fall back on a BT approach for certain cases (likely involving matching previous matches and lookahead). but it's still safer on most regexps.

[*]好吧,也许永远不会太强大。对于某些regexp,我认为它必须依赖BT方法来处理某些情况(可能涉及匹配之前的匹配和前瞻)。但在大多数regexp上仍然更安全。

#2


5  

I've done a benchmark with the following librairies:

我对以下图书馆做了一个基准:

  • Boost
  • 提高
  • re2
  • re2公司
  • Oniguruma
  • Oniguruma

The benchmark consisted of executing a series of tests which made heavy use of regexps on very heterogeneous regexps (grouping, not grouping, long ones (484 characters), short ones, pipes, \?, *, ., etc.). Applied on texts that go from a few characters to around 8k characters.

基准测试包括执行一系列测试,在非常异构的regexp上大量使用regexp(分组,而不是分组,长(484个字符),短的,管道,\?,*,等等)。适用于从几个字符到大约8k个字符的文本。

Each time a regexp match was computed, I stored the regexp and incremented a milliseconds counter accumulating the time spent computing the regexp (called multiple times).

每次计算regexp匹配时,我都会存储regexp并增加一个毫秒计数器,以积累计算regexp所需的时间(多次调用)。

Here is the total time spent on all regexps for each libraries:

以下是各图书馆所有regexp花费的总时间:

  • Boost: 98840 ms
  • 增加:98840毫秒
  • re2: 51197 ms
  • re2公司:51197毫秒
  • Oniguruma: 16095 ms
  • Oniguruma:16095毫秒
  • re2 (NO CAPUTRE* see below)): 16162 ms
  • re2(无CAPUTRE*见下文):16162 ms

*We (almost) always want to capture the content of groups in regexp, and re2 performs horribly when it captures a group(see here). You don't see that much in the above result because when the group cannot be captured, it performs well. For example on this regexp (executed a lot of times):

*我们(几乎)总是希望在regexp中捕获组的内容,而re2在捕获组时执行得非常糟糕(参见这里)。在上面的结果中,您没有看到太多,因为当组不能被捕获时,它表现得很好。例如在这个regexp上(执行了很多次):

^((?:https?://)?(?:[a-z0-9\-]{1,63}\.)+(?:[a-z0-9\-]{1,63}))(?:[^\?]*).*$

^((?:https ?:/ /)?(?:a-z0-9 \[-]{ 63 } \)+(?:a-z0-9 \[-]{ 63 }))(?:[^ \ ?]*)。*美元

here are the results for each libs:

以下是每个libs的结果:

  • Boost: 140 ms
  • 增加:140毫秒
  • re2: 5663 ms
  • re2公司:5663毫秒
  • Oniguruma: 53 ms
  • Oniguruma:53女士
  • re2 (NO CAPTURE): 37 ms.
  • re2(无捕获):37 ms。

See the drop for re2 from 5663 ms to 37 ms.

re2从5663 ms下降到37 ms。

tl;dr

So my conclusion is that for our use, Oniguruma is clearly superior.

所以我的结论是,对于我们的使用,小丸子显然是优越的。

But if you don't need to capture groups, re2 is a better choice since I found that it's API is easier to use.

但是如果您不需要捕获组,re2是一个更好的选择,因为我发现它的API更容易使用。

#1


7  

the two kinds of implementation (FSA and BT) have quite different behaviours, which you can see in the right-hand column (email) there.

这两种实现(FSA和BT)有非常不同的行为,您可以在右边的专栏(电子邮件)中看到。

oniguruma is generally fast, but has the possibility of running slowly if you're "unlucky" with a particular regexp. that's because it's a backtracking algorithm.

小丸子通常跑得很快,但如果你“不走运”,有可能会跑得很慢。这是因为它是一个回溯算法。

in contrast, while re2 is generally a little slower, it doesn't have the same risk - its time will never[*] explode in the same way (it doesn't have worst case exponential behaviour).

相比之下,虽然re2的速度一般稍慢一些,但它没有同样的风险——它的时间不会以同样的方式爆炸(它没有最坏的情况下的指数行为)。

so it depends on details. if you're confident that your regexps will be safe, or are willing to detect and abort slow matches, oniguruma makes sense. but personally i would be inclined to pay a little more (not much more) for the security of re2.

这取决于细节。如果您确信您的regexp将是安全的,或者愿意检测并终止慢速匹配,那么oniguruma是有意义的。但就我个人而言,我倾向于为re2的安全性多花一点(而不是更多)。

for more on this see http://swtch.com/~rsc/regexp/regexp1.html (by the re2 author).

更多信息请参见http://swtch.com/~rsc/regexp/regexp1.html(由re2作者编写)。

[*] well, maybe never is too strong. for some regexps i think it has to fall back on a BT approach for certain cases (likely involving matching previous matches and lookahead). but it's still safer on most regexps.

[*]好吧,也许永远不会太强大。对于某些regexp,我认为它必须依赖BT方法来处理某些情况(可能涉及匹配之前的匹配和前瞻)。但在大多数regexp上仍然更安全。

#2


5  

I've done a benchmark with the following librairies:

我对以下图书馆做了一个基准:

  • Boost
  • 提高
  • re2
  • re2公司
  • Oniguruma
  • Oniguruma

The benchmark consisted of executing a series of tests which made heavy use of regexps on very heterogeneous regexps (grouping, not grouping, long ones (484 characters), short ones, pipes, \?, *, ., etc.). Applied on texts that go from a few characters to around 8k characters.

基准测试包括执行一系列测试,在非常异构的regexp上大量使用regexp(分组,而不是分组,长(484个字符),短的,管道,\?,*,等等)。适用于从几个字符到大约8k个字符的文本。

Each time a regexp match was computed, I stored the regexp and incremented a milliseconds counter accumulating the time spent computing the regexp (called multiple times).

每次计算regexp匹配时,我都会存储regexp并增加一个毫秒计数器,以积累计算regexp所需的时间(多次调用)。

Here is the total time spent on all regexps for each libraries:

以下是各图书馆所有regexp花费的总时间:

  • Boost: 98840 ms
  • 增加:98840毫秒
  • re2: 51197 ms
  • re2公司:51197毫秒
  • Oniguruma: 16095 ms
  • Oniguruma:16095毫秒
  • re2 (NO CAPUTRE* see below)): 16162 ms
  • re2(无CAPUTRE*见下文):16162 ms

*We (almost) always want to capture the content of groups in regexp, and re2 performs horribly when it captures a group(see here). You don't see that much in the above result because when the group cannot be captured, it performs well. For example on this regexp (executed a lot of times):

*我们(几乎)总是希望在regexp中捕获组的内容,而re2在捕获组时执行得非常糟糕(参见这里)。在上面的结果中,您没有看到太多,因为当组不能被捕获时,它表现得很好。例如在这个regexp上(执行了很多次):

^((?:https?://)?(?:[a-z0-9\-]{1,63}\.)+(?:[a-z0-9\-]{1,63}))(?:[^\?]*).*$

^((?:https ?:/ /)?(?:a-z0-9 \[-]{ 63 } \)+(?:a-z0-9 \[-]{ 63 }))(?:[^ \ ?]*)。*美元

here are the results for each libs:

以下是每个libs的结果:

  • Boost: 140 ms
  • 增加:140毫秒
  • re2: 5663 ms
  • re2公司:5663毫秒
  • Oniguruma: 53 ms
  • Oniguruma:53女士
  • re2 (NO CAPTURE): 37 ms.
  • re2(无捕获):37 ms。

See the drop for re2 from 5663 ms to 37 ms.

re2从5663 ms下降到37 ms。

tl;dr

So my conclusion is that for our use, Oniguruma is clearly superior.

所以我的结论是,对于我们的使用,小丸子显然是优越的。

But if you don't need to capture groups, re2 is a better choice since I found that it's API is easier to use.

但是如果您不需要捕获组,re2是一个更好的选择,因为我发现它的API更容易使用。