如何优化正则表达式性能?

时间:2021-11-20 13:20:31

I have a very long regular expression. My regex is a combination of around 5000 or more phrases.

我有一个很长的正则表达式。我的regex是大约5000个或更多的短语的组合。

Also, the text on which I am executing the regex is also huge. Text size is around 5kb.

另外,我正在执行regex的文本也是巨大的。文本大小约为5kb。

Because regex as well as the input text is huge, it takes minimum 2 minutes to execute the regex which is not acceptable in my project.

因为regex和输入文本都很大,所以至少需要2分钟来执行regex,这在我的项目中是不可接受的。

So, I would like to know how I can optimize this. One way I can think of is to split the regex and use multiple threads to minimize the execution time. Is this the correct option or is there any other way?

我想知道如何优化它。我能想到的一种方法是分割regex并使用多个线程来最小化执行时间。这是正确的选择还是有其他的方法?

Part of my regex looks like this :

我的部分regex如下所示:

(ACS|ADDR.com Technologies|ADP private limited|ADP|ADP India private limited|AIT Software Services PTE limited|AMK Technologies private limited|ANMSoft Technologies private limited|ANZ Information Technology private limited|ASD Global India private Limited|ASD India private Limited|ASM Technologies private limited|AXA Group Solutions India private limited|AXA technology India limited|Aarkay Infonet private limited|AbsolutData Research and Analytics private limited|Accenture India private limited|Accenture Services India|Accenture Services P Limited|Accenture Services private Limited|Accenture|Accenture Software Private Limited|Accurum India private limited|AceTechnologies Inc|Aclat Inc|AcmeCeeYess Softech Private Limited|Adaequare India private limited|Adaequare Info private limited|Adea International private limited|Adea Technologies|Adeptra|Aditi Technologies|Adobe Systems|Adroit Business Solutions|Adroit and Claretdene Infotech private limited|Affron Infotech|Agile Software Enterprise private limited|Agilent Technologies International private limited|Akebono Soft Technologies private limited|AkebonoSoft Technologies private limited|Akmin Technologies|Algorhythm Technologies private limited|Allsec Technologies private limited|Alphonso Informex private limited|Altria Client Services|Altruist India private limited|Amdocs|Amdocs Development Center India private limited|Amdocs Development Centre India|American CyberSystems|American Express Service India private limited|American Stock Exchange|Amrok Securities|Anish Information Technology private limited|Ankhnet Informations private limited|Apex Technologies private limited|AppLabs|AppLabs Technologies private limited|Appshark India|Apptix Software private limited|Aquila Technologies|Arcot R and D Software private limited|Arsin Systems private limited|Ascendum Solutions private limited|AskMe Software private limited|Atos Origin private limited|Atos Origin|Atos Origin India private limited|Aurigo Software Technologies private limited|Aurona Technologies private limited|Autopower Software Solutions|Aztecsoft|BMC Software India private limited|Balasai Net private limited|Bayon Solutions private limited|Beachwood Computing Limited|Birlasoft limited|Blue Bird Technologies private limited|Blue Fountain Media private limited|Blue Star InfoTech|Boden Inc|Boston|Braahamam Net Solutions private limited|Braahmam Net Solutions private limited|Brain Soft technology private limited|Brigade Corporation Private Limited|Business Link Automation India private limited|BusinessLink Automation private limited|C Ahead Info Technologies India private limited|C.D.I Corporation|CCG India private limited|CEM Solutions|CGI Information Systems and Management Consultants private limited|CGI Information Systems private limited|CGI Information System and Management Consultants private limited|CGI Information and Management private limited|CGI Netvorks|CISCO Systems India private limited|CMC Limited|COMSYS Inc|CORE SHELL TECHNOLOGIES|CRC Software India private limited|CRV Executive Search private limited|CS Software Solutions private Limited|CSC India private Limited|CSS Corp private limited|Cambridge Solutions Limited|Cambridge Solutions|Cambridge Solutions Sdn. Bhd|Candor Ind. private limited|Candor India private limited|Canvas Creatives private limited|Canvera|Capgemini Business Service India Limited|Capgemini private)

专业术语-财务英语词汇专业术语-财务英语词汇专业术语-财务英语词汇专业术语-财务英语词汇专业术语专业术语-财务英语词汇专业术语专业术语专业术语-财务英语词汇专业术语专业术语专业术语-财务英语词汇专业术语专业术语专业术语专业术语-财务英语词汇专业术语专业术语专业术语专业术语专业术语专业术语专业术语专业词汇专业词汇专业词汇专业词汇专业词汇埃森哲服务有限公司埃森哲服务有限公司埃森哲服务私营有限公司埃森哲软件私营有限公司喜来登印度私营有限公司私人有限公司|Akmin技术私人有限公司|Akmin技术私人有限公司|Allsec技术私人有限公司|Alphonso私人有限公司|Altria客户服务|私人有限公司印度私人有限公司|Amdocs开发中心印度私人有限公司b11 Amdocs发展中心印度|1美国网络系统|3美国运通服务印度私营。美国证券交易所|Amrok证券|Anish Information Technology私人有限公司|Ankhnet Informations私人有限公司|Apex Technologies私人有限公司|AppLabs|AppLabs技术私人来源有限公司bbbshark India Apptix软件私人有限公司软件技术私有有限公司|Aurona技术私有有限公司|自动转算软件解决方案|Aztecsoft b3 BMC软件印度私有有限公司公司私人有限公司|商业链自动化印度私人有限公司|商业链自动化私人有限公司|C超前信息技术印度私人有限公司|C.D.我公司| 20印度私人有限| | CGI杰姆解决方案信息系统和信息系统管理顾问私人有限| CGI私人有限| CGI信息系统和管理顾问私人有限| CGI信息和管理私人有限公司| CGI Netvorks |思科印度私人有限公司| CMC有限公司| COMSYS公司|核心壳技术软件印度私人有限| | CRC的CRV猎头私人有限公司| CS软件解决方案私人有限公司| CSC印度私人有限的|CSS私人有限|剑桥解决方案有限|剑桥解决方案|剑桥解决方案Sdn。Bhd|Candor Ind. private limited|Candor India private limited|Canvas创意private limited私人有限公司|Canvera|Capgemini商业服务印度有限公司|Capgemini private)

I am using C# for this stuff.

我用c#来做这个。

Please enlighten !!!!

请开导! ! ! !

5 个解决方案

#1


8  

You can greatly improve the performance of this regex by prepending \b at the beginning:

您可以通过在开始时预挂\b来大大提高这个regex的性能:

\b(ACS| ... |Z)

This will prevent a check on every character, and check every word instead.

这将防止对每个字符进行检查,而是检查每个单词。

#2


7  

You can optimize a regex by using atomic grouping or using possessive quantifiers where possible.

您可以通过使用原子分组或在可能的情况下使用所有格量词来优化正则表达式。

Also, if your have stuff like .* or .+ in your regex, which can be real memory/runtime hogs, replace them with (possessive) character classes (again, if possible).

此外,如果您的正则表达式中有.*或.+之类的内容(可以是真正的内存/运行时占用),则将它们替换为(所有格)字符类(如果可能的话)。

For more specific answers, you'll need to post your regex.

要获得更具体的答案,您需要发布regex。

Good luck!

好运!

#3


7  

One optimization would be to extract common prefixes. Change occurences like

一种优化方法是提取公共前缀。变化出现像

(This is some text|This is some other text)

to

This is some (text|other text)

This should also be done on every level. Change occurences like

这也应该在每个层次上进行。变化出现像

ABCD|ADCB|BACD|BADC|BCAD|BCDA|BDAC|BDCA|CABD

to

A(BCD|DCB)|B(A(CD|DC)|C(AD|DA)|D(AC|CA))|CABD

This optimization is so that the Regex engine wont have to test for the same characters multiple times.

这种优化使Regex引擎不必多次测试相同的字符。

It can be achieved by sorting the phases, and looking at successive elements. Be careful not to split at meta-characters. You don't want to split in the middle of .* or \..

它可以通过对各个阶段进行排序和观察连续的元素来实现。注意不要在元字符上分裂。你不想在。*或\..中间分裂。

Another way would be to use a Trie-structure to find the prefixes. This is more robust, but a little more complicated.

另一种方法是使用测试结构来查找前缀。这更健壮,但稍微复杂一点。

#4


2  

I know it's old, but still...

我知道它很旧,但是……

"OR" rules (for this matter all standard rules: concat, repeat and or) doesn't require manual optimization. While compiling most regexp engines will optimize it. Sometimes it's the opposite - having too many groups may have performance impact, as the engine has to save each group's match.

或者“规则(适用于所有标准规则:concat、repeat和OR)不需要手动优化。编译大多数regexp引擎时将对其进行优化。有时则相反——拥有太多组可能会影响性能,因为引擎必须保存每个组的匹配。

What hits performance really hard is look ahead and look behind rules, which are not used in your query.

真正影响性能的是向前看和查找规则背后的规则,这些规则在查询中不使用。

In this case author could add '\b' rule in the beginning and end of query to require whole word searching, which would significantly limit places that the engine would start matching.

在这种情况下,作者可以在查询的开头和结尾添加“\b”规则,以要求进行全字搜索,这将极大地限制引擎开始匹配的位置。

#5


0  

An example with Python (there is also a C-tool to optimize regular expressions at https://github.com/ksx123/regex-optimization):

Python的一个例子(在https://github.com/ksx123/regex优化中还有一个c工具来优化正则表达式):

import hachoir_regex
optimized = hachoir_regex.parse("(ACS|ADDR.com Technologies|ADP private limited|ADP|ADP India private limited|AIT Software Services PTE limited|AMK Technologies private limited|ANMSoft Technologies private limited|ANZ Information Technology private limited|ASD Global India private Limited|ASD India private Limited|ASM Technologies private limited|AXA Group Solutions India private limited|AXA technology India limited|Aarkay Infonet private limited|AbsolutData Research and Analytics private limited|Accenture India private limited|Accenture Services India|Accenture Services P Limited|Accenture Services private Limited|Accenture|Accenture Software Private Limited|Accurum India private limited|AceTechnologies Inc|Aclat Inc|AcmeCeeYess Softech Private Limited|Adaequare India private limited|Adaequare Info private limited|Adea International private limited|Adea Technologies|Adeptra|Aditi Technologies|Adobe Systems|Adroit Business Solutions|Adroit and Claretdene Infotech private limited|Affron Infotech|Agile Software Enterprise private limited|Agilent Technologies International private limited|Akebono Soft Technologies private limited|AkebonoSoft Technologies private limited|Akmin Technologies|Algorhythm Technologies private limited|Allsec Technologies private limited|Alphonso Informex private limited|Altria Client Services|Altruist India private limited|Amdocs|Amdocs Development Center India private limited|Amdocs Development Centre India|American CyberSystems|American Express Service India private limited|American Stock Exchange|Amrok Securities|Anish Information Technology private limited|Ankhnet Informations private limited|Apex Technologies private limited|AppLabs|AppLabs Technologies private limited|Appshark India|Apptix Software private limited|Aquila Technologies|Arcot R and D Software private limited|Arsin Systems private limited|Ascendum Solutions private limited|AskMe Software private limited|Atos Origin private limited|Atos Origin|Atos Origin India private limited|Aurigo Software Technologies private limited|Aurona Technologies private limited|Autopower Software Solutions|Aztecsoft|BMC Software India private limited|Balasai Net private limited|Bayon Solutions private limited|Beachwood Computing Limited|Birlasoft limited|Blue Bird Technologies private limited|Blue Fountain Media private limited|Blue Star InfoTech|Boden Inc|Boston|Braahamam Net Solutions private limited|Braahmam Net Solutions private limited|Brain Soft technology private limited|Brigade Corporation Private Limited|Business Link Automation India private limited|BusinessLink Automation private limited|C Ahead Info Technologies India private limited|C.D.I Corporation|CCG India private limited|CEM Solutions|CGI Information Systems and Management Consultants private limited|CGI Information Systems private limited|CGI Information System and Management Consultants private limited|CGI Information and Management private limited|CGI Netvorks|CISCO Systems India private limited|CMC Limited|COMSYS Inc|CORE SHELL TECHNOLOGIES|CRC Software India private limited|CRV Executive Search private limited|CS Software Solutions private Limited|CSC India private Limited|CSS Corp private limited|Cambridge Solutions Limited|Cambridge Solutions|Cambridge Solutions Sdn. Bhd|Candor Ind. private limited|Candor India private limited|Canvas Creatives private limited|Canvera|Capgemini Business Service India Limited|Capgemini private)")
len(str(optimized)) # has length 3048

While the original string has length 3399. The bigger the string gets, the more optimizations are possible. This uses the hachoir-regex library. You could use this in addition to adding \b, as proposed.

而原始字符串的长度是3399。字符串越大,可以进行更多的优化。这使用了hachoir-regex库。您可以使用这个附加的\b,如建议的那样。

#1


8  

You can greatly improve the performance of this regex by prepending \b at the beginning:

您可以通过在开始时预挂\b来大大提高这个regex的性能:

\b(ACS| ... |Z)

This will prevent a check on every character, and check every word instead.

这将防止对每个字符进行检查,而是检查每个单词。

#2


7  

You can optimize a regex by using atomic grouping or using possessive quantifiers where possible.

您可以通过使用原子分组或在可能的情况下使用所有格量词来优化正则表达式。

Also, if your have stuff like .* or .+ in your regex, which can be real memory/runtime hogs, replace them with (possessive) character classes (again, if possible).

此外,如果您的正则表达式中有.*或.+之类的内容(可以是真正的内存/运行时占用),则将它们替换为(所有格)字符类(如果可能的话)。

For more specific answers, you'll need to post your regex.

要获得更具体的答案,您需要发布regex。

Good luck!

好运!

#3


7  

One optimization would be to extract common prefixes. Change occurences like

一种优化方法是提取公共前缀。变化出现像

(This is some text|This is some other text)

to

This is some (text|other text)

This should also be done on every level. Change occurences like

这也应该在每个层次上进行。变化出现像

ABCD|ADCB|BACD|BADC|BCAD|BCDA|BDAC|BDCA|CABD

to

A(BCD|DCB)|B(A(CD|DC)|C(AD|DA)|D(AC|CA))|CABD

This optimization is so that the Regex engine wont have to test for the same characters multiple times.

这种优化使Regex引擎不必多次测试相同的字符。

It can be achieved by sorting the phases, and looking at successive elements. Be careful not to split at meta-characters. You don't want to split in the middle of .* or \..

它可以通过对各个阶段进行排序和观察连续的元素来实现。注意不要在元字符上分裂。你不想在。*或\..中间分裂。

Another way would be to use a Trie-structure to find the prefixes. This is more robust, but a little more complicated.

另一种方法是使用测试结构来查找前缀。这更健壮,但稍微复杂一点。

#4


2  

I know it's old, but still...

我知道它很旧,但是……

"OR" rules (for this matter all standard rules: concat, repeat and or) doesn't require manual optimization. While compiling most regexp engines will optimize it. Sometimes it's the opposite - having too many groups may have performance impact, as the engine has to save each group's match.

或者“规则(适用于所有标准规则:concat、repeat和OR)不需要手动优化。编译大多数regexp引擎时将对其进行优化。有时则相反——拥有太多组可能会影响性能,因为引擎必须保存每个组的匹配。

What hits performance really hard is look ahead and look behind rules, which are not used in your query.

真正影响性能的是向前看和查找规则背后的规则,这些规则在查询中不使用。

In this case author could add '\b' rule in the beginning and end of query to require whole word searching, which would significantly limit places that the engine would start matching.

在这种情况下,作者可以在查询的开头和结尾添加“\b”规则,以要求进行全字搜索,这将极大地限制引擎开始匹配的位置。

#5


0  

An example with Python (there is also a C-tool to optimize regular expressions at https://github.com/ksx123/regex-optimization):

Python的一个例子(在https://github.com/ksx123/regex优化中还有一个c工具来优化正则表达式):

import hachoir_regex
optimized = hachoir_regex.parse("(ACS|ADDR.com Technologies|ADP private limited|ADP|ADP India private limited|AIT Software Services PTE limited|AMK Technologies private limited|ANMSoft Technologies private limited|ANZ Information Technology private limited|ASD Global India private Limited|ASD India private Limited|ASM Technologies private limited|AXA Group Solutions India private limited|AXA technology India limited|Aarkay Infonet private limited|AbsolutData Research and Analytics private limited|Accenture India private limited|Accenture Services India|Accenture Services P Limited|Accenture Services private Limited|Accenture|Accenture Software Private Limited|Accurum India private limited|AceTechnologies Inc|Aclat Inc|AcmeCeeYess Softech Private Limited|Adaequare India private limited|Adaequare Info private limited|Adea International private limited|Adea Technologies|Adeptra|Aditi Technologies|Adobe Systems|Adroit Business Solutions|Adroit and Claretdene Infotech private limited|Affron Infotech|Agile Software Enterprise private limited|Agilent Technologies International private limited|Akebono Soft Technologies private limited|AkebonoSoft Technologies private limited|Akmin Technologies|Algorhythm Technologies private limited|Allsec Technologies private limited|Alphonso Informex private limited|Altria Client Services|Altruist India private limited|Amdocs|Amdocs Development Center India private limited|Amdocs Development Centre India|American CyberSystems|American Express Service India private limited|American Stock Exchange|Amrok Securities|Anish Information Technology private limited|Ankhnet Informations private limited|Apex Technologies private limited|AppLabs|AppLabs Technologies private limited|Appshark India|Apptix Software private limited|Aquila Technologies|Arcot R and D Software private limited|Arsin Systems private limited|Ascendum Solutions private limited|AskMe Software private limited|Atos Origin private limited|Atos Origin|Atos Origin India private limited|Aurigo Software Technologies private limited|Aurona Technologies private limited|Autopower Software Solutions|Aztecsoft|BMC Software India private limited|Balasai Net private limited|Bayon Solutions private limited|Beachwood Computing Limited|Birlasoft limited|Blue Bird Technologies private limited|Blue Fountain Media private limited|Blue Star InfoTech|Boden Inc|Boston|Braahamam Net Solutions private limited|Braahmam Net Solutions private limited|Brain Soft technology private limited|Brigade Corporation Private Limited|Business Link Automation India private limited|BusinessLink Automation private limited|C Ahead Info Technologies India private limited|C.D.I Corporation|CCG India private limited|CEM Solutions|CGI Information Systems and Management Consultants private limited|CGI Information Systems private limited|CGI Information System and Management Consultants private limited|CGI Information and Management private limited|CGI Netvorks|CISCO Systems India private limited|CMC Limited|COMSYS Inc|CORE SHELL TECHNOLOGIES|CRC Software India private limited|CRV Executive Search private limited|CS Software Solutions private Limited|CSC India private Limited|CSS Corp private limited|Cambridge Solutions Limited|Cambridge Solutions|Cambridge Solutions Sdn. Bhd|Candor Ind. private limited|Candor India private limited|Canvas Creatives private limited|Canvera|Capgemini Business Service India Limited|Capgemini private)")
len(str(optimized)) # has length 3048

While the original string has length 3399. The bigger the string gets, the more optimizations are possible. This uses the hachoir-regex library. You could use this in addition to adding \b, as proposed.

而原始字符串的长度是3399。字符串越大,可以进行更多的优化。这使用了hachoir-regex库。您可以使用这个附加的\b,如建议的那样。