I'm working on an app where we need to pull out key information from text. The catch is the text comes from OCRed documents, so there can be OCR recognition errors, noise, garbage characters, etc. Also, the text on the document can be in a million different formats depending on the source.
我正在开发一个应用程序,我们需要从文本中提取关键信息。捕获的是文本来自OCRed文档,因此可能会有OCR识别错误、噪声、垃圾字符等。此外,文档上的文本可能有上百万种不同的格式,这取决于源文件。
So, we use lots of regular expressions to pull out the text. We noticed that in high volume, this hammers the CPUs on the servers. I've tried pre-compiling the regexes and caching them without any improvement. Profiler shows 65% of the runtime is due to calling Regex.Match().
因此,我们使用很多正则表达式来提取文本。我们注意到,在高容量的情况下,这将在服务器上打击cpu。我试过预编译regex,并在没有任何改进的情况下进行缓存。分析器显示65%的运行时是由于调用Regex.Match()。
Reading up on regex, I see catastrophic backtracking is a performance issue.
阅读regex,我发现灾难性的回溯是性能问题。
Let's say I have an expression like this (this is a simple one just to illustrate the general format of our regexes -- others can contain many more keywords and formats):
假设我有一个这样的表达式(这是一个简单的表达式,只是为了说明regexes的一般格式——其他的可以包含更多的关键字和格式):
(.*) KEYWORD1 AND (.* KEYWORD2)
When I step through with Regex Coach, I see it does a lot of backtracking to match the string.
当我和Regex Coach一起走过时,我看到它做了大量的回溯以匹配字符串。
Can this type of regex be conceptually improved? We do only run against a subset of the entire document (a smaller blob of text), but the preprocessing to pull out the blob isn't perfect either by nature.
这种类型的regex能在概念上得到改进吗?我们只对整个文档的一个子集(一个更小的文本团)运行,但是提取blob的预处理过程本身也不是完美的。
So, yeah, pretty much anything can appear before "KEYWORD1" and anything can appear in front of "KEYWORD2", etc.
所以,是的,几乎任何东西都可以出现在“KEYWORD1”之前,任何东西都可以出现在“KEYWORD2”之前,等等。
We can't restrict to A-Z instead of .*, since in the OCR world, letters can sometimes be mistake for numbers (i.e. Illene = I11ene), or we can get garbage characters in there due to OCR recognition errors.
我们不能限制在A-Z而不是。*,因为在OCR世界里,字母有时会被误认为数字(例如,Illene = I11ene),或者由于OCR识别错误,我们可以在里面找到垃圾字符。
1 个解决方案
#1
3
Yes, these types can be easily optimized.
是的,这些类型可以很容易地优化。
You optimize them by replacing the regex with the intended code. That is to say, two substring searches. If the position of " KEYWORD1 AND "
is smaller than that of "KEYWORD2"
then you've got a match.
您可以用预期的代码替换regex来优化它们。也就是说,两个子字符串搜索。如果“KEYWORD1”和“KEYWORD2”的位置小于“KEYWORD2”,则匹配。
For extra speed, you can use optimized substring searches, but that's almost certainly not needed. Just eliminating the regex will give a massive speed boost.
对于额外的速度,您可以使用优化的子字符串搜索,但这几乎肯定不是必需的。只要取消regex,就会大大提高速度。
[edit] Ok, so there are 400. And some of them are slightly more complicated. The pattern remains the same: substantial substrings with little variation, that can be effectively located. If you know that "PART OF"
occurs in your input, checking whether " PART OF"
occurs can be done in approximately one nanosecond. And if PARTF OF_ _doesn't_ occur, you don't need to check at all whether
AS PART OF` occurs.
[编辑]好的,有400个。有些稍微复杂一些。模式保持不变:基本的子字符串,很少变化,可以有效地定位。如果您知道“部分”发生在您的输入中,那么检查“部分”是否可以在大约一纳秒内完成。如果PARTF OF ____没有发生,那么你根本不需要检查是否发生了'的一部分。
Now 400 regexes is not much. If you had 40.000, it would be worthwhile to automate checking for common substrings. At the moment, you might just run each regex in turn, trying to match the other 399 regex strings to get a first cut. .*PART OF.*
will match ".*AS PART OF.*"
.
现在400个regex并不多。如果您有40.000,那么应该自动检查公共子字符串。此时,您可能会依次运行每个regex,尝试匹配其他399个regex字符串以获得第一个cut。*将匹配”。* *”的一部分。
For the same reason, you don't need other optimizations as well. With 40.000 regexes to match, I'd calculate the frequency of each letter pair. I.e. the input FOO AS PART OF BAR
has letter pairs FO, OO, PA, AR (twice), RT, OF, BA
. This cannot match .*FOR EXAMPLE.*
as the letter pair EX
is missing. right
出于同样的原因,您也不需要其他优化。如果匹配40000个正则表达式,我将计算每个字母对的频率。即输入FOO作为BAR的一部分,有FO, OO, PA, AR(两次),RT, OF, BA的字母对。这个不能匹配。*例如。*因为一对前女友的信不见了。正确的
#1
3
Yes, these types can be easily optimized.
是的,这些类型可以很容易地优化。
You optimize them by replacing the regex with the intended code. That is to say, two substring searches. If the position of " KEYWORD1 AND "
is smaller than that of "KEYWORD2"
then you've got a match.
您可以用预期的代码替换regex来优化它们。也就是说,两个子字符串搜索。如果“KEYWORD1”和“KEYWORD2”的位置小于“KEYWORD2”,则匹配。
For extra speed, you can use optimized substring searches, but that's almost certainly not needed. Just eliminating the regex will give a massive speed boost.
对于额外的速度,您可以使用优化的子字符串搜索,但这几乎肯定不是必需的。只要取消regex,就会大大提高速度。
[edit] Ok, so there are 400. And some of them are slightly more complicated. The pattern remains the same: substantial substrings with little variation, that can be effectively located. If you know that "PART OF"
occurs in your input, checking whether " PART OF"
occurs can be done in approximately one nanosecond. And if PARTF OF_ _doesn't_ occur, you don't need to check at all whether
AS PART OF` occurs.
[编辑]好的,有400个。有些稍微复杂一些。模式保持不变:基本的子字符串,很少变化,可以有效地定位。如果您知道“部分”发生在您的输入中,那么检查“部分”是否可以在大约一纳秒内完成。如果PARTF OF ____没有发生,那么你根本不需要检查是否发生了'的一部分。
Now 400 regexes is not much. If you had 40.000, it would be worthwhile to automate checking for common substrings. At the moment, you might just run each regex in turn, trying to match the other 399 regex strings to get a first cut. .*PART OF.*
will match ".*AS PART OF.*"
.
现在400个regex并不多。如果您有40.000,那么应该自动检查公共子字符串。此时,您可能会依次运行每个regex,尝试匹配其他399个regex字符串以获得第一个cut。*将匹配”。* *”的一部分。
For the same reason, you don't need other optimizations as well. With 40.000 regexes to match, I'd calculate the frequency of each letter pair. I.e. the input FOO AS PART OF BAR
has letter pairs FO, OO, PA, AR (twice), RT, OF, BA
. This cannot match .*FOR EXAMPLE.*
as the letter pair EX
is missing. right
出于同样的原因,您也不需要其他优化。如果匹配40000个正则表达式,我将计算每个字母对的频率。即输入FOO作为BAR的一部分,有FO, OO, PA, AR(两次),RT, OF, BA的字母对。这个不能匹配。*例如。*因为一对前女友的信不见了。正确的