在flex中检查错误的标识符模式

时间:2022-03-23 09:40:42

I am just trying to learn flex and here is a sample code in flex to detect identifiers and digits. I want to improve the code by identifying wrong identifier and digit patterns (for example: 1var,12.2.2,5. etc). How I will detect it? which change do I have to make in the code?

我只是想学习flex,这里是flex的示例代码,用于检测标识符和数字。我想通过识别错误的标识符和数字模式来改进代码(例如:1var,12.2.2,5等)。我怎么会发现它?我必须在代码中进行哪些更改?

My sample code is given below:

我的示例代码如下:

ID       [a-zA-z][a-zA-z0-9]*
DIGIT    [0-9]

%%
[\t]+
{ID}     {printf("\n identifier found");}
{DIGIT}  {printf("\nDigit found");}
.        {}
%%

int main(int argc, char *argv[]){         
    yylex();         
}

1 个解决方案

#1


This is not a trivial question, as what errors are detected in the lexer is very much part of the whole design of a language processing system and the nature of the syntactic and lexical structure of the language. Some elements that may, on inspection seem like lexical errors, may turn out not to be. It really depends on the nature of the language; for example, in Fortran, spaces have no meaning, and there is the famous example:

这不是一个微不足道的问题,因为在词法分析器中检测到的错误是语言处理系统的整个设计以及语言的句法和词汇结构的本质的很大一部分。在检查时看起来像词汇错误的某些元素可能会变得不那样。这真的取决于语言的本质;例如,在Fortran中,空格没有意义,有一个着名的例子:

        DO 10 I = 1.10

Is this the keyword DO, the label 10, the identifier I, the operator = and the number 1.10? Actually, it is the identifier DO10I... etc; whereas

这是关键字DO,标签10,标识符I,运算符=和数字1.10?实际上,它是标识符DO10I ......等;而

        DO 10 I = 1,10

Does have the keyword DO...

有没有关键字DO ...

So sometimes, when seeing the sequence, 123abc, you cannot automatically assume it is just an invalid identifier. Sometimes it is just better to return it as the two valid tokens NUMBER and IDENTIFIER and leave it to the parser to report any errors that result. The only difficult area to be careful with this approach is when exponents are specified in floating point number constants, and when integer ranges are used. An example of an exponent use would be:

所以有时,当看到序列123abc时,你不能自动认为它只是一个无效的标识符。有时最好将其作为两个有效标记NUMBER和IDENTIFIER返回,并将其留给解析器报告所导致的任何错误。这种方法唯一要注意的难点是当浮点数常量指定指数时,以及使用整数范围时。指数使用的一个例子是:

-1234.457E+12

This has a letter embedded in a number, and would need to be returned as some kind of NUMBER token. Similarly the overloading on the sign operators cause problems for lexical analysis error detection. In the previous number it has two signs - and +. If they are recognised as part of the number, when do the symbols - and + get recognised as the SUBTRACT and ADD tokens? Take for example this expression:

这个字母嵌入了一个数字,需要作为某种NUMBER标记返回。类似地,符号运算符上的重载导致词法分析错误检测的问题。在之前的数字中它有两个标志 - 和+。如果它们被识别为数字的一部分,那么符号何时 - 和+被识别为SUBTRACT和ADD令牌?以此表达式为例:

i=i-1;

Is this IDENTIFIER, EQUALS,IDENTIFIER,NUMBER? No, of course not. So this means that we cannot always assume that -1 is just a NUMBER.

这个IDENTIFIER,EQUALS,IDENTIFIER,NUMBER?不,当然不。所以这意味着我们不能总是假设-1只是一个数字。

The integer range, mentioned earlier, which in many languages (Pascal in particular) is represented as 1..8, using two dots to indicate an upper and lower bound, causes difficulties when handling floating point expressions like 1.2.

前面提到的整数范围,在许多语言中(特别是Pascal)表示为1..8,使用两个点来表示上限和下限,在处理像1.2这样的浮点表达式时会造成困难。

So, just the question, "How do I checked for ill-formed identifiers and numbers in a lexer?" Is quite loaded, and shows it might represent someone who has not fully absorbed the subject area. Often questions like this a posted in class tests, as they are a good way for the instructor to see whether the student possesses the deeper knowledge of language processing, or just answers it in a surface way, and attempts to write patterns for such objects.

所以,问题是“如何在词法分析器中检查格式错误的标识符和数字?”是非常负载,并表明它可能代表没有完全吸收主题领域的人。通常在课堂测试中发布这样的问题,因为它们是教师看到学生是否拥有更深层次的语言处理知识,或者只是以表面方式回答它的好方法,并试图为这些对象编写模式。

As just mentioned, the naive answer would be just to write regular expression patterns to match the examples of invalid lexemes.

如上所述,天真的答案只是编写正则表达式模式以匹配无效词位的示例。

For example, I could add the patterns:

例如,我可以添加模式:

[0-9]+\.[0-9]+(\.[0-9]+)+    {printf("Bad float: %s\n", yytext);}
[0-9]+[a-zA-Z][a-zA-Z0-9]+   {printf("Bad Identifier: %s\n", yytext);}

But usually this is not done in most compilers. The only lexical errors detected by most compilers would be unclosed strings and comments. This is also the reason why most languages do not allow newlines in strings, because then unclosed strings can easily be detected.

但通常在大多数编译器中都没有这样做。大多数编译器检测到的唯一词法错误是未闭合的字符串和注释。这也是为什么大多数语言不允许字符串中的换行符的原因,因为这样可以很容易地检测到未闭合的字符串。

#1


This is not a trivial question, as what errors are detected in the lexer is very much part of the whole design of a language processing system and the nature of the syntactic and lexical structure of the language. Some elements that may, on inspection seem like lexical errors, may turn out not to be. It really depends on the nature of the language; for example, in Fortran, spaces have no meaning, and there is the famous example:

这不是一个微不足道的问题,因为在词法分析器中检测到的错误是语言处理系统的整个设计以及语言的句法和词汇结构的本质的很大一部分。在检查时看起来像词汇错误的某些元素可能会变得不那样。这真的取决于语言的本质;例如,在Fortran中,空格没有意义,有一个着名的例子:

        DO 10 I = 1.10

Is this the keyword DO, the label 10, the identifier I, the operator = and the number 1.10? Actually, it is the identifier DO10I... etc; whereas

这是关键字DO,标签10,标识符I,运算符=和数字1.10?实际上,它是标识符DO10I ......等;而

        DO 10 I = 1,10

Does have the keyword DO...

有没有关键字DO ...

So sometimes, when seeing the sequence, 123abc, you cannot automatically assume it is just an invalid identifier. Sometimes it is just better to return it as the two valid tokens NUMBER and IDENTIFIER and leave it to the parser to report any errors that result. The only difficult area to be careful with this approach is when exponents are specified in floating point number constants, and when integer ranges are used. An example of an exponent use would be:

所以有时,当看到序列123abc时,你不能自动认为它只是一个无效的标识符。有时最好将其作为两个有效标记NUMBER和IDENTIFIER返回,并将其留给解析器报告所导致的任何错误。这种方法唯一要注意的难点是当浮点数常量指定指数时,以及使用整数范围时。指数使用的一个例子是:

-1234.457E+12

This has a letter embedded in a number, and would need to be returned as some kind of NUMBER token. Similarly the overloading on the sign operators cause problems for lexical analysis error detection. In the previous number it has two signs - and +. If they are recognised as part of the number, when do the symbols - and + get recognised as the SUBTRACT and ADD tokens? Take for example this expression:

这个字母嵌入了一个数字,需要作为某种NUMBER标记返回。类似地,符号运算符上的重载导致词法分析错误检测的问题。在之前的数字中它有两个标志 - 和+。如果它们被识别为数字的一部分,那么符号何时 - 和+被识别为SUBTRACT和ADD令牌?以此表达式为例:

i=i-1;

Is this IDENTIFIER, EQUALS,IDENTIFIER,NUMBER? No, of course not. So this means that we cannot always assume that -1 is just a NUMBER.

这个IDENTIFIER,EQUALS,IDENTIFIER,NUMBER?不,当然不。所以这意味着我们不能总是假设-1只是一个数字。

The integer range, mentioned earlier, which in many languages (Pascal in particular) is represented as 1..8, using two dots to indicate an upper and lower bound, causes difficulties when handling floating point expressions like 1.2.

前面提到的整数范围,在许多语言中(特别是Pascal)表示为1..8,使用两个点来表示上限和下限,在处理像1.2这样的浮点表达式时会造成困难。

So, just the question, "How do I checked for ill-formed identifiers and numbers in a lexer?" Is quite loaded, and shows it might represent someone who has not fully absorbed the subject area. Often questions like this a posted in class tests, as they are a good way for the instructor to see whether the student possesses the deeper knowledge of language processing, or just answers it in a surface way, and attempts to write patterns for such objects.

所以,问题是“如何在词法分析器中检查格式错误的标识符和数字?”是非常负载,并表明它可能代表没有完全吸收主题领域的人。通常在课堂测试中发布这样的问题,因为它们是教师看到学生是否拥有更深层次的语言处理知识,或者只是以表面方式回答它的好方法,并试图为这些对象编写模式。

As just mentioned, the naive answer would be just to write regular expression patterns to match the examples of invalid lexemes.

如上所述,天真的答案只是编写正则表达式模式以匹配无效词位的示例。

For example, I could add the patterns:

例如,我可以添加模式:

[0-9]+\.[0-9]+(\.[0-9]+)+    {printf("Bad float: %s\n", yytext);}
[0-9]+[a-zA-Z][a-zA-Z0-9]+   {printf("Bad Identifier: %s\n", yytext);}

But usually this is not done in most compilers. The only lexical errors detected by most compilers would be unclosed strings and comments. This is also the reason why most languages do not allow newlines in strings, because then unclosed strings can easily be detected.

但通常在大多数编译器中都没有这样做。大多数编译器检测到的唯一词法错误是未闭合的字符串和注释。这也是为什么大多数语言不允许字符串中的换行符的原因,因为这样可以很容易地检测到未闭合的字符串。