Flex：匹配signed int与加/减

I have a flex file that returns me tokens depending on an input file. I currently have it set up to return a token INT when I see a signed int in flex, and a token ADD or SUB when I see an addition or subtraction.

我有一个flex文件,根据输入文件返回令牌。我当前设置为当我在flex中看到一个带符号的int时返回一个令牌INT,当我看到一个加法或减法时,它就是一个令牌ADD或SUB。

I have a macro defined for a signed int in flex

我在flex中为signed int定义了一个宏

signedint = [+-]?[0-9]+

When flex sees this pattern, it returns an INT token.

当flex看到这个模式时,它返回一个INT标记。

However, I have run into the problem of being able to distinguish a signed integer from an addition or subtraction. For instance,

但是,我遇到了能够区分有符号整数和加法或减法的问题。例如,

5 + 2

returns 3 tokens: INT, ADD, INT. This is fine.

返回3个标记:INT,ADD,INT。这可以。

When the spaces are missing, however, I'm running into trouble:

然而,当空间丢失时,我遇到了麻烦:

5+2
5 +2

Either of these return just two tokens (2 ints), since flex sees 5 as a signedint (correctly) and returns it, and then sees '+2' and sees another signed int (it doesn't return anything for whitespace).

其中任何一个只返回两个标记(2个整数),因为flex将5视为signedint(正确)并返回它,然后看到'+2'并看到另一个signed int(它不返回任何空格)。

Any ideas on how to get it to distinguish between a signed int and an addition/subtraction?

关于如何区分有符号的int和加/减的任何想法?

2 个解决方案

#1

The difficulty of solving this problem at the lexical level is the reason that many languages don't even try. The parser can easily distinguish between a unary prefix + or - operator and a binary infix operator with the same spelling, and aside from one corner case (see below), there is no real semantic difference between -2 considered as a prefix minus followed by an integer constant, and -2 considered as a single token. In the parser, you can use constant folding to evaluate constant sub-expressions if you don't want the operators to appear in the AST.

在词汇层面解决这个问题的难度是许多语言甚至都没有尝试的原因。解析器可以很容易地区分一元前缀+或 - 运算符和具有相同拼写的二进制中缀运算符,除了一个角点情况(见下文)之外,-2之间没有真正的语义差异被视为前缀减去后跟一个整数常量,-2被视为单个标记。在解析器中,如果不希望运算符出现在AST中,则可以使用常量折叠来计算常量子表达式。

It is only possible to distinguish between infix and prefix operators during the lexical scan by maintaining lexical state, which effectively duplicates part of the parsing algorithm into a hand-built state machine in the lexical scanner. In the case of ordinary arithmetic expressions, it is only a small part of the parsing algorithm, but even so it is not pretty and complicates the verification of correctness of the lexer/parser combination.

通过维持词法状态,只能在词法扫描期间区分中缀和前缀运算符,这有效地将部分解析算法复制到词法扫描器中的手工构建状态机中。在普通算术表达式的情况下,它只是解析算法的一小部分,但即便如此,它并不漂亮,并且使得词法分析器/解析器组合的正确性验证变得复杂。

Leaving out operator precedence and associativity, which are not relevant here, the grammar for an arithmetic expression can be reduced to something like the following:

省略运算符优先级和关联性(这里不相关),算术表达式的语法可以简化为以下内容:

expr: term
    | expr INFIX term
term: CONSTANT | VARIABLE
    | '(' expr ')'
    | PREFIX term

(That omits postfix operators, including function calls and array subscripts, but the principle is not affected by that.)

(省略了后缀运算符,包括函数调用和数组下标,但原理不受此影响。)

From that grammar, it is easy to derive FIRST, LAST and FOLLOW sets. term can only start with a terminal, and expr can only start with a term, so they both have the same FIRST set:

从该语法中,很容易得出FIRST,LAST和FOLLOW集合。 term只能从一个终端开始,而expr只能以一个术语开头,所以它们都有相同的FIRST集合:

FIRST(expr) = FIRST(term) = { (, PREFIX, CONSTANT, VARIABLE }

By similar reasoning, the LAST sets of term and expr are also identical:

通过类似的推理,最后一组term和expr也是相同的:

LAST(expr) = LAST(term) = { ), CONSTANT, VARIABLE }

And finally, the FOLLOW sets for the non-terminals, based on the observation that a term only appears at the end of an expr, and the expr is either at the end of the input, or appears in the grammar followed by a terminal:

最后,FOLLOW为非终端设置,基于观察一个术语仅出现在expr的末尾,并且expr要么在输入的末尾,要么出现在语法后跟一个终端:

FOLLOW(term) = FOLLOW(expr) = { ), INFIX, $ }

(where $ is the end-of-input mark.)

(其中$是输入结束标记。)

All that lets us compute FOLLOW sets for terminals, using the observation that for each terminal A in LAST of non-terminal V can only be followed by a terminal in FOLLOW(V). (In some grammars that might overestimate the FOLLOW sets, but it doesn't matter in this case.) Which finally gives us:

所有这些让我们计算终端的FOLLOW集合,使用观察结果,对于非终端V的LAST中的每个终端A,只能在FOLLOW(V)中跟随终端。 (在一些可能高估FOLLOW集合的语法中,但在这种情况下并不重要。)最终给了我们:

 terminal           can be followed by
 --------           ------------------
 INFIX              PREFIX, (, CONSTANT, VARIABLE
 (                  PREFIX, (, CONSTANT, VARIABLE
 PREFIX             PREFIX, (, CONSTANT, VARIABLE
 )                  INFIX, ), $
 CONSTANT           INFIX, ), $
 VARIABLE           INFIX, ), $

In short, PREFIX and INFIX never occur in the same context. If the previous token was INFIX, PREFIX or ( (or there was no previous token), then an operator must be PREFIX. Otherwise, the operator must be INFIX. And we can implement that in flex using two start conditions: one for the case where we might see a CONSTANT, and the other for the case where we cannot legally see a CONSTANT. The first one is the INITIAL state.

简而言之,PREFIX和INFIX永远不会出现在相同的上下文中。如果前一个标记是INFIX,PREFIX或((或者没有先前的标记),那么操作符必须是PREFIX。否则,操作符必须是INFIX。我们可以使用两个启动条件在flex中实现它:一个用于case在那里我们可能会看到一个CONSTANT,而另一个我们不能合法地看到一个CONSTANT。第一个是INITIAL状态。

That translates into the following little flex description:

这转化为以下小的flex描述:

%x FINAL
%%

<INITIAL>[-+]?[[:digit:]]+  {
                              BEGIN(FINAL); return CONSTANT;
                            }
<INITIAL>[[:alpha:]][[:alnum:]]* {
                              BEGIN(FINAL); return VARIABLE;
                            }
<INITIAL>[(]                  return yytext[0];
<INITIAL>[-]                  return PREFIX_MINUS;
<INITIAL>[+]                  return PREFIX_PLUS;

<FINAL>[)]                    return yytext[0];
<FINAL>[-+*/] BEGIN(INITIAL); return yytext[0];

<*>.                          /* Signal invalid token */

That's not complete, of course. It leaves out the setting of yylval and the handling of input which is not part of the expression (including newline and other whitespace).

当然,这还不完整。它省略了yylval的设置和输入的处理,它不是表达式的一部分(包括换行符和其他空格)。

Although this solution works, it's easy to see why leaving the issue to the parser is preferred:

虽然此解决方案有效,但很容易理解为什么将问题留给解析器是首选:

%%
[-+*/()]                 return yytext[0];
[[:digit:]]+             return CONSTANT;
[[:alpha:]][[:alnum:]]*  return VARIABLE;

But there is one corner case which needs to be handled carefully. In an N-bit 2's complement representation, it is possible to represent -2^N but it is not possible to represent +2^N, since the maximum positive number is 2^N−1. If signed integers are deferred to the parser as expressions, it is vital that the integer 2^N be permitted, even though it will not fit into the signed integer type being used.

但是有一个角落需要仔细处理。在N比特2的补码表示中,可以表示-2N但是不可能表示+ 2N,因为最大正数是2N-1。如果有符号整数作为表达式推迟到解析器,则允许整数2N至关重要,即使它不适合所使用的有符号整数类型。

That can be accomplished by using an unsigned integer type to pass integer values to the parser, which in turn means that the parser will need to detect the overflow condition.

这可以通过使用无符号整数类型将整数值传递给解析器来实现,这反过来意味着解析器将需要检测溢出条件。

As it happens, that's not how C handles this case, which leads to an interesting anomaly. In C (as above), an integer constant does not include a sign; -2 is two tokens. But integer constants do include an (implicit) type, which in the case of a decimal integer is the smallest signed integer type which can hold the value of the constant. Since unary negation preserves type, the result is that on a 32-bit machine, -2147483648 is of type long (or long long) even though it is representable as an int. This occasionally causes confusion.

碰巧,这不是C处理这种情况的方式,这会导致一个有趣的异常现象。在C中(如上所述),整数常量不包括符号; -2是两个令牌。但是整数常量确实包含一个(隐式)类型,在十进制整数的情况下,它是可以保存常量值的最小有符号整数类型。由于一元否定保留了类型,结果是在32位机器上,-2147483648属于long(或long long)类型,即使它可以表示为int。这偶尔会引起混淆。

#2

-1

signedint = \s*[+-]?\s*[0-9]+\s*

This should work

这应该工作

#1