I face a challenge to match the input in the following format:
我面临着以下列格式匹配输入的挑战:
- The input consists of key=value pairs. The key starts with slash. The value may be a number or a string in quotes.
- The value may optionally contain escaped quotes, that is quote following by a quote (""). Such escaped quote should be considered a part of value. There is no need to check that escaped quotes are balanced (e.g. ends by another escaped quote).
输入由键=值对组成。关键从斜线开始。该值可以是引号中的数字或字符串。
该值可以选择包含转义引号,即引号后跟引号(“”)。这种逃逸的报价应被视为价值的一部分。无需检查转义的报价是否平衡(例如,以另一个转义报价结束)。
The regular expression should match the given key=value part of the sequence and should not break for long inputs (e.g. value is 10000 characters).
正则表达式应该与序列的给定key = value部分匹配,并且对于长输入不应该中断(例如,值为10000个字符)。
First I came to this solution:
首先我来到这个解决方案:
/(\w+)=(\d+|"(?:""|[^"])+"(?!"))
and it performs not bad, however it fails in Java6 with *Error for long inputs (cashes regexplanet for example). I tried to improve it a bit to run faster:
并且它执行得不错,但是在Java6中,对于长输入(例如cashes regexplanet),*Error失败了。我尝试改进它以便更快地运行:
/(\w+)=(\d+|"(?:""|[^"]+)+"(?!"))
but then if input is not matching, it enters endless loop in backtracking trying to match it.
但是如果输入不匹配,它会在回溯中进入无限循环,试图匹配它。
Then I came to this regex:
然后我来到这个正则表达式:
/(\w+)=(\d+|".+?(?<!")(?:"")*"(?!"))
which is performing slower, but it seems to solve the task.
表现较慢,但它似乎解决了这个任务。
Can anyone suggest a better / faster regex?
任何人都可以建议更好/更快的正则表达式?
Sample input:
/mol_type="protein" /transl_table=11 /note="[CDS] (""multi
line)" nn /organism="""Some"" Sequence" nn /organism="Some ""Sequence"""
/translation="MHPSSSRIPHIAVVGVSAIFPGSLDAHGFWRDILSGTDLITDVPSTHWLVE
DYYDPDPSAPDKTYAKRGAFLKDVPFDPLEWGVPPSIVPATDTTQLLALIVAKRVLEDAAQGQFE
SMSRERMSVILGVTSAQELLASMVSRIQRPVWAKALRDLGYPEDEVKRACDKIAGNYVPWQESSF
PGLLGNVVAGRIANRLDLGGTNCVTDAACASSLSAMSMAINELALGQSDLVIAGGCDTMNDAFMY
MCFSKTPALSKSGDCRPFSDKADGTLLGEGIAMVALKRLDDAERDGDRVYAVIRGIGSSSDGRSK
SVYAPVPEGQAKALRRTYAAAGYGPETVELMEAHGTGTKAGDAAEFEGLRAMFDESGREDRQWCA
LGSVKSQIGHTKAAAGAAGLFKAIMALHHKVLPPTIKVDKPNPKLDIEKTAFYLNTQARPWIRPG
DHPRRASVSSFGFGGSNFHVALEEYTGPAPKAWRVRALPAELFLLSADTPAALADRARALAKEAE
VPEILRFLARESVLSFDASRPARLGLCATDEADLRKKLEQVAAHLEARPEQALSAPLVHCASGEA
PGRVAFLFPGQGSQYVGMGADALMTFDPARAAWDAAAGVAIADAPLHEVVFPRPVFSDEDRAAQE
ARLRETRWAQPAIGATSLAHLALLAALGVRAEAFAGHSFGEITALHAAGALSAADLLRVARRRGE
LRTLGQVVDHLRASLPAAGPAASASPAAAASVPKASTAAVPAVASVAAPGAAEVERVVMAVVAET
TGYPAEMLGLQMELESDLGIDSIKRVEILSAVRDRTPGLSEVDASALAQLRTLGQVVDHLRASLP
AASAGPAVAAPAAKAPAVAAPTGVSGATPGAAEVERVVMAVVAETTGYPAEMLGLQMELESDLGI
DSIKRVEILSAVRDRTPGLAEVDASALAQLRTLGQVVDHLRASLGPAAVTAGAAPAEPAEEPAST
PLGRWTLVEEPAPAAGLAMPGLFDAGTLVITGHDAIGPALVAALAARGIAAEYAPAVPRGARGAV
FLGGLRELATADAALAVHREAFLAAQAIAAKPALFVTVQDTGGDFGLAGSDRAWVGGLPGLVKTA
ALEWPEASCRAIDLERAGRSDGELAEAIASELLSGGVELEIGLRADGRRTTPRSVRQDAQPGPLP
LGPSDVVVASGGARGVTAATLIALARASHARFALLGRTALEDEPAACRGADGEAALKAALVKAAT
SAGQRVTPAEIGRSVAKILANREVRATLDAIRAAGGEALYVPVDVNDARAVAAALDGVRGALGPV
TAIVHGAGVLADKLVAEKTVEQFERVFSTKVDGLRALLGATAGDPLKAIVLFSSIAARGGNKGQC
DYAMANEVLNKVAAAEAARRPGCRVKSLGWGPWQGGMVNAALEAHFAQLGVPLIPLAAGAKMLLD
ELCDASGDRGARGQGGAPPGAVELVLGAEPKALAAQGHGGRVALAVRADRATHPYLGDHAINGVP
VVPVVIALEWFARAARACRPDLVVTELRDVRVLRGIKLAAYESGGEVFRVDCREVSNGHGAVLAA
ELRGPQGALHYAATIQMQQPEGRVAPKGPAAPELGPWPAGGELYDGRTLFHGRDFQVIRRLDGVS
RDGIAGTVVGLREAGWVAQPWKTDPAALDGGLQLATLWTQHVLGGAALPMSVGALHTFAEGPSDG
PLRAVVRGQIVARDRTKADIAFVDDRGSLVAELRDVQYVLRPDTARGQA"
/note="primer of Streptococcus pneumoniae
Expected output (from regexhero.net):
预期输出(来自regexhero.net):
3 个解决方案
#1
3
In order to fail in a reasonable time you need, indeed, to avoid catastrophic backtracking. This can be done using atomic grouping (?>...)
:
为了在合理的时间内失败,你确实需要避免灾难性的回溯。这可以使用原子分组(?> ...)来完成:
/(\w+)=(\d+|"(?>(?>""|[^"]+)+)"(?!"))
# (?>(?>""|[^"]+)+)
(?> # throw away the states created by (...)+
(?> # throw away the states created by [^"]+
""|[^"]+
)+
)
Your issue when using (?:""|[^"]+)+
on a string that will never match, is linked to the fact that each time you match a new [^"]
character the regex engine can choose to use the inner or outer +
quantifier.
在一个永远不会匹配的字符串上使用(?:“”| [^“] +)+的问题与每次匹配新的[^”]字符的事实相关联,正则表达式引擎可以选择使用内部或外部+量词。
This leads to a lot of possibilities for backtracking, and before returning a failure the engine has to try them all.
这导致回溯的很多可能性,并且在返回故障之前,引擎必须尝试所有这些。
We know that if we haven't found a match by the time the engine reaches the end, we never will: all we need to do is throw away the backtracking positions to avoid the issue, and that's what atomic grouping is for.
我们知道,如果我们在引擎到达终点时没有找到匹配,我们永远不会:我们需要做的就是抛弃回溯位置以避免问题,这就是原子分组的用途。
See a DEMO: 24 steps on failure, while preserving the speed on the successful cases (not a real benchmarking tool, but catastrophic backtracking would be pretty easy to spot)
看一个DEMO:失败的24个步骤,同时保留成功案例的速度(不是真正的基准测试工具,但灾难性的回溯很容易发现)
#2
2
Your initial regex was already quite good, but it was more complicated than necessary, leading to catastrophic backtracking.
你的初始正则表达式已经非常好了,但它比必要的更复杂,导致灾难性的回溯。
You should use
你应该用
/(\w+)=(\d+|"(?:""|[^"])*"(?!"))
See it live on regex101.com.
在regex101.com上查看。
Explanation:
/ # Slash
(\w+) # Indentifier --> Group 1
= # Equals sign
( # Group 2:
\d+ # Either a number
| # or
"(?:""|[^"])*" # a quoted string
(?!") # unless another quote follows
) # End of group 2
#3
2
How about this one:
这个怎么样:
/(\w+)=("(?:[^"]|"")*"|\d+)
(Note that the /
is part of the regex here. Escape it as appropriate for your host language.)
(请注意,/是正则表达式的一部分。根据您的宿主语言进行转义。)
If your regex engine supports it (Java does), make the *
possessive:
如果您的正则表达式引擎支持它(Java确实如此),请使*占有:
/(\w+)=("(?:[^"]|"")*+"|\d+)
After some debugging the latter expression can be improved to:
经过一些调试后,后一个表达式可以改进为:
/(\w+)=("(?:""|[^"]*+)*+"|\d++)
Note the double *+)*+
which allows matching contiguous text in one step while not being susceptible to catastrophic backtracking.
请注意double * +)* +,它允许在一个步骤中匹配连续文本,同时不易受到灾难性回溯的影响。
#1
3
In order to fail in a reasonable time you need, indeed, to avoid catastrophic backtracking. This can be done using atomic grouping (?>...)
:
为了在合理的时间内失败,你确实需要避免灾难性的回溯。这可以使用原子分组(?> ...)来完成:
/(\w+)=(\d+|"(?>(?>""|[^"]+)+)"(?!"))
# (?>(?>""|[^"]+)+)
(?> # throw away the states created by (...)+
(?> # throw away the states created by [^"]+
""|[^"]+
)+
)
Your issue when using (?:""|[^"]+)+
on a string that will never match, is linked to the fact that each time you match a new [^"]
character the regex engine can choose to use the inner or outer +
quantifier.
在一个永远不会匹配的字符串上使用(?:“”| [^“] +)+的问题与每次匹配新的[^”]字符的事实相关联,正则表达式引擎可以选择使用内部或外部+量词。
This leads to a lot of possibilities for backtracking, and before returning a failure the engine has to try them all.
这导致回溯的很多可能性,并且在返回故障之前,引擎必须尝试所有这些。
We know that if we haven't found a match by the time the engine reaches the end, we never will: all we need to do is throw away the backtracking positions to avoid the issue, and that's what atomic grouping is for.
我们知道,如果我们在引擎到达终点时没有找到匹配,我们永远不会:我们需要做的就是抛弃回溯位置以避免问题,这就是原子分组的用途。
See a DEMO: 24 steps on failure, while preserving the speed on the successful cases (not a real benchmarking tool, but catastrophic backtracking would be pretty easy to spot)
看一个DEMO:失败的24个步骤,同时保留成功案例的速度(不是真正的基准测试工具,但灾难性的回溯很容易发现)
#2
2
Your initial regex was already quite good, but it was more complicated than necessary, leading to catastrophic backtracking.
你的初始正则表达式已经非常好了,但它比必要的更复杂,导致灾难性的回溯。
You should use
你应该用
/(\w+)=(\d+|"(?:""|[^"])*"(?!"))
See it live on regex101.com.
在regex101.com上查看。
Explanation:
/ # Slash
(\w+) # Indentifier --> Group 1
= # Equals sign
( # Group 2:
\d+ # Either a number
| # or
"(?:""|[^"])*" # a quoted string
(?!") # unless another quote follows
) # End of group 2
#3
2
How about this one:
这个怎么样:
/(\w+)=("(?:[^"]|"")*"|\d+)
(Note that the /
is part of the regex here. Escape it as appropriate for your host language.)
(请注意,/是正则表达式的一部分。根据您的宿主语言进行转义。)
If your regex engine supports it (Java does), make the *
possessive:
如果您的正则表达式引擎支持它(Java确实如此),请使*占有:
/(\w+)=("(?:[^"]|"")*+"|\d+)
After some debugging the latter expression can be improved to:
经过一些调试后,后一个表达式可以改进为:
/(\w+)=("(?:""|[^"]*+)*+"|\d++)
Note the double *+)*+
which allows matching contiguous text in one step while not being susceptible to catastrophic backtracking.
请注意double * +)* +,它允许在一个步骤中匹配连续文本,同时不易受到灾难性回溯的影响。