(F) Lex,我如何匹配否定?

时间:2022-03-23 09:40:30

Some language grammars use negations in their rules. For example, in the Dart specification the following rule is used:

一些语言语法在他们的规则中使用否定。例如,在Dart规范中使用了以下规则:

~('\'|'"'|'$'|NEWLINE)

Which means match anything that is not one of the rules inside the parenthesis. Now, I know in flex I can negate character rules (ex: [^ab] , but some of the rules I want to negate could be more complicated than a single character so I don't think I could use character rules for that. For example I may need to negate the sequence '"""' for multiline strings but I'm not sure what the way to do it in flex would be.

这意味着匹配任何不在括号内的规则。现在,我知道我在flex可以否定字符规则(例:[^ ab],但有些规则我想否定可能比一个更复杂的所以我不认为我可以用字符规则。例如,我可能需要否定多行字符串的序列“”“”,但我不确定在flex中使用的方法是什么。

1 个解决方案

#1


7  

(TL;DR: Skip down to the bottom for a practical answer.)

(TL;DR:跳到下面去找一个实用的答案。)

The inverse of any regular language is a regular language. So in theory it is possible to write the inverse of a regular expression as a regular expression. Unfortunately, it is not always easy.

任何正规语言的逆都是一种正规语言。所以理论上可以把正则表达式的逆写成正则表达式。不幸的是,这并不总是那么容易。

The """ case, at least, is not too difficult.

至少,“”这个例子并不太难。

First, let's be clear about what we are trying to match.

首先,让我们明确一下我们想要匹配的是什么。

Strictly speaking "not """" would mean "any string other than """". But that would include, for example, x""".

严格地说“not”“”将意味着“除了”“之外的任何字符串”。但这包括,例如,x"

So it might be tempting to say that we're looking for "any string which does not contain """". (That is, the inverse of .*""".*). But that's not quite correct either. The typical usage is to tokenise an input like:

因此,我们可能会倾向于说,我们正在寻找“任何不包含”的字符串。(也就是,*“。*”的倒数。)但这也不完全正确。典型的用法是让输入像:

"""This string might contain " or ""."""

If we start after the initial """ and look for the longest string which doesn't contain """, we will find:

如果我们从最初的“”开始,寻找不包含“”的最长字符串,我们会发现:

This string might contain " or "".""

whereas what we wanted was:

而我们想要的是:

This string might contain " or "".

So it turns out that we need "any string which does not end with " and which doesn't contain """", which is actually the conjunction of two inverses: (~.*" ∧ ~.*""".*)

因此,我们需要“任何不以”结尾的字符串,而不包含“”,这实际上是两个逆的连接:(~。*“∧~ *”。””。*)

It's (relatively) easy to produce a state diagram for that:(F) Lex,我如何匹配否定?

它(相对)很容易产生一个状态图:

(Note that the only difference between the above and the state diagram for "any string which does not contain """" is that in that state diagram, all the states would be accepting, and in this one states 1 and 2 are not accepting.)

(请注意,上面和状态图之间唯一的区别是“任何不包含”的字符串“”“是在状态图中,所有的状态都是接受的,在这个状态中1和2是不接受的。)

Now, the challenge is to turn that back into a regular expression. There are automated techniques for doing that, but the regular expressions they produce are often long and clumsy. This case is simple, though, because there is only one accepting state and we need only describe all the paths which can end in that state:

现在的挑战是把它变成一个正则表达式。有一些自动化的技术可以做到这一点,但是它们所产生的正则表达式往往又长又笨拙。这个例子很简单,因为只有一个接受状态,我们只需要描述所有可以在这个状态下结束的路径:

([^"]|\"([^"]|\"[^"]))*

This model will work for any simple string, but it's a little more complicated when the string is not just a sequence of the same character. For example, suppose we wanted to match strings terminated with END rather than """. Naively modifying the above pattern would result in:

这个模型对于任何简单的字符串都适用,但是当字符串不只是同一个字符的序列时,它就会稍微复杂一些。例如,假设我们想要匹配以结束而不是“”结尾的字符串。天真地修改上面的模式会导致:

([^E]|E([^N]|N[^D]))*   <--- DON'T USE THIS

but that regular expression will match the string

但是正则表达式将匹配字符串。

ENENDstuff which shouldn't have been matched

The real state diagram we're looking for is (F) Lex,我如何匹配否定?

我们要找的状态图是。

and one way of writing that as a regular expression is:

作为正则表达式的一种写法是

([^E]|E(E|NE)*([^EN]|N[^ED]))

Again, I produced that by tracing all the ways to end up in state 0:

再一次,我通过追踪所有在状态0中结束的方式产生了:

[^E] stays in state 0
E    in state 1:
     (E|NE)*: stay in state 1
     [^EN]: back to state 0
     N[^ED]:back to state 0 via state 2

This can be a lot of work, both to produce and to read. And the results are error-prone. (Formal validation is easier with the state diagrams, which are small for this class of problems, rather than with the regular expressions which can grow to be enormous).

这可以是很多工作,无论是生产还是阅读。结果是容易出错的。(通过状态图,正式的验证更容易,因为状态图对于此类问题来说是小的,而不是能够增长到巨大的正则表达式)。


A practical and scalable solution

Practical Flex rulesets use start conditions to solve this kind of problem. For example, here is how you might recognize python triple-quoted strings:

实用的Flex规则集使用启动条件来解决这类问题。例如,以下是您如何识别python三引号字符串:

%x TRIPLEQ
start \"\"\"
end   \"\"\"
%%

{start}        { BEGIN( TRIPLEQ ); /* Note: no return, flex continues */ }

<TRIPLEQ>.|\n  { /* Append the next token to yytext instead of
                  * replacing yytext with the next token
                  */
                 yymore();
                 /* No return yet, flex continues */
               }
<TRIPLEQ>{end} { /* We've found the end of the string, but
                  * we need to get rid of the terminating """
                  */
                 yylval.str = malloc(yyleng - 2);
                 memcpy(yylval.str, yytext, yyleng - 3);
                 yylval.str[yyleng - 3] = 0;
                 return STRING;
               }

This works because the . rule in start condition TRIPLEQ will not match " if the " is part of a string matched by {end}; flex always chooses the longest match. It could be made more efficient by using [^"]+|\"|\n instead of .|\n, because that would result in longer matches and consequently fewer calls to yymore(); I didn't write it that way above simply for clarity.

这是因为。在开始条件下,TRIPLEQ将不匹配“如果”是与{end}匹配的字符串的一部分;flex总是选择最长的匹配。它可以更有效的利用[^]+ | \”| | \ n \ n代替,因为这将导致更长的匹配,从而减少调用yymore();我没有把它写得太清楚。

This model is much easier to extend. In particular, if we wanted to use <![CDATA[ as the start and ]]> as the terminator, we'd only need to change the definitions

这个模型更容易扩展。特别是,如果我们想使用 作为终结者,我们只需要改变定义。

start "<![CDATA["
end   "]]>"

(and possibly the optimized rule inside the start condition, if using the optimization suggested above.)

(如果使用上面建议的优化,则可能在开始条件中优化规则)。

#1


7  

(TL;DR: Skip down to the bottom for a practical answer.)

(TL;DR:跳到下面去找一个实用的答案。)

The inverse of any regular language is a regular language. So in theory it is possible to write the inverse of a regular expression as a regular expression. Unfortunately, it is not always easy.

任何正规语言的逆都是一种正规语言。所以理论上可以把正则表达式的逆写成正则表达式。不幸的是,这并不总是那么容易。

The """ case, at least, is not too difficult.

至少,“”这个例子并不太难。

First, let's be clear about what we are trying to match.

首先,让我们明确一下我们想要匹配的是什么。

Strictly speaking "not """" would mean "any string other than """". But that would include, for example, x""".

严格地说“not”“”将意味着“除了”“之外的任何字符串”。但这包括,例如,x"

So it might be tempting to say that we're looking for "any string which does not contain """". (That is, the inverse of .*""".*). But that's not quite correct either. The typical usage is to tokenise an input like:

因此,我们可能会倾向于说,我们正在寻找“任何不包含”的字符串。(也就是,*“。*”的倒数。)但这也不完全正确。典型的用法是让输入像:

"""This string might contain " or ""."""

If we start after the initial """ and look for the longest string which doesn't contain """, we will find:

如果我们从最初的“”开始,寻找不包含“”的最长字符串,我们会发现:

This string might contain " or "".""

whereas what we wanted was:

而我们想要的是:

This string might contain " or "".

So it turns out that we need "any string which does not end with " and which doesn't contain """", which is actually the conjunction of two inverses: (~.*" ∧ ~.*""".*)

因此,我们需要“任何不以”结尾的字符串,而不包含“”,这实际上是两个逆的连接:(~。*“∧~ *”。””。*)

It's (relatively) easy to produce a state diagram for that:(F) Lex,我如何匹配否定?

它(相对)很容易产生一个状态图:

(Note that the only difference between the above and the state diagram for "any string which does not contain """" is that in that state diagram, all the states would be accepting, and in this one states 1 and 2 are not accepting.)

(请注意,上面和状态图之间唯一的区别是“任何不包含”的字符串“”“是在状态图中,所有的状态都是接受的,在这个状态中1和2是不接受的。)

Now, the challenge is to turn that back into a regular expression. There are automated techniques for doing that, but the regular expressions they produce are often long and clumsy. This case is simple, though, because there is only one accepting state and we need only describe all the paths which can end in that state:

现在的挑战是把它变成一个正则表达式。有一些自动化的技术可以做到这一点,但是它们所产生的正则表达式往往又长又笨拙。这个例子很简单,因为只有一个接受状态,我们只需要描述所有可以在这个状态下结束的路径:

([^"]|\"([^"]|\"[^"]))*

This model will work for any simple string, but it's a little more complicated when the string is not just a sequence of the same character. For example, suppose we wanted to match strings terminated with END rather than """. Naively modifying the above pattern would result in:

这个模型对于任何简单的字符串都适用,但是当字符串不只是同一个字符的序列时,它就会稍微复杂一些。例如,假设我们想要匹配以结束而不是“”结尾的字符串。天真地修改上面的模式会导致:

([^E]|E([^N]|N[^D]))*   <--- DON'T USE THIS

but that regular expression will match the string

但是正则表达式将匹配字符串。

ENENDstuff which shouldn't have been matched

The real state diagram we're looking for is (F) Lex,我如何匹配否定?

我们要找的状态图是。

and one way of writing that as a regular expression is:

作为正则表达式的一种写法是

([^E]|E(E|NE)*([^EN]|N[^ED]))

Again, I produced that by tracing all the ways to end up in state 0:

再一次,我通过追踪所有在状态0中结束的方式产生了:

[^E] stays in state 0
E    in state 1:
     (E|NE)*: stay in state 1
     [^EN]: back to state 0
     N[^ED]:back to state 0 via state 2

This can be a lot of work, both to produce and to read. And the results are error-prone. (Formal validation is easier with the state diagrams, which are small for this class of problems, rather than with the regular expressions which can grow to be enormous).

这可以是很多工作,无论是生产还是阅读。结果是容易出错的。(通过状态图,正式的验证更容易,因为状态图对于此类问题来说是小的,而不是能够增长到巨大的正则表达式)。


A practical and scalable solution

Practical Flex rulesets use start conditions to solve this kind of problem. For example, here is how you might recognize python triple-quoted strings:

实用的Flex规则集使用启动条件来解决这类问题。例如,以下是您如何识别python三引号字符串:

%x TRIPLEQ
start \"\"\"
end   \"\"\"
%%

{start}        { BEGIN( TRIPLEQ ); /* Note: no return, flex continues */ }

<TRIPLEQ>.|\n  { /* Append the next token to yytext instead of
                  * replacing yytext with the next token
                  */
                 yymore();
                 /* No return yet, flex continues */
               }
<TRIPLEQ>{end} { /* We've found the end of the string, but
                  * we need to get rid of the terminating """
                  */
                 yylval.str = malloc(yyleng - 2);
                 memcpy(yylval.str, yytext, yyleng - 3);
                 yylval.str[yyleng - 3] = 0;
                 return STRING;
               }

This works because the . rule in start condition TRIPLEQ will not match " if the " is part of a string matched by {end}; flex always chooses the longest match. It could be made more efficient by using [^"]+|\"|\n instead of .|\n, because that would result in longer matches and consequently fewer calls to yymore(); I didn't write it that way above simply for clarity.

这是因为。在开始条件下,TRIPLEQ将不匹配“如果”是与{end}匹配的字符串的一部分;flex总是选择最长的匹配。它可以更有效的利用[^]+ | \”| | \ n \ n代替,因为这将导致更长的匹配,从而减少调用yymore();我没有把它写得太清楚。

This model is much easier to extend. In particular, if we wanted to use <![CDATA[ as the start and ]]> as the terminator, we'd only need to change the definitions

这个模型更容易扩展。特别是,如果我们想使用 作为终结者,我们只需要改变定义。

start "<![CDATA["
end   "]]>"

(and possibly the optimized rule inside the start condition, if using the optimization suggested above.)

(如果使用上面建议的优化,则可能在开始条件中优化规则)。