如何在lex中创建没有特定字母组的正则表达式

时间:2021-06-10 09:39:48

I've recently started learning lex , so I was practicing and decided to make a program which recognises a declaration of a normal variable. (Sort of)

我最近开始学习lex,所以我练习并决定制作一个识别正常变量声明的程序。 (有点)

This is my code :

这是我的代码:

%{
#include "stdio.h"
%}
dataType "int"|"float"|"char"|"String"
alphaNumeric [_\*a-zA-Z][0-9]*
space [ ]
variable {dataType}{space}{alphaNumeric}+
%option noyywrap
%%
{variable} printf("ok");
. printf("incorect");
%%
int main(){
yylex();
}

Some cases when the output should return ok

有些情况下输出应该返回ok

int var3
int _varR3
int _AA3_

And if I type as input : int float , it returns ok , which is wrong because they are both reserved words.

如果我键入输入:int float,它返回ok,这是错误的,因为它们都是保留字。

So my question is what should I modify to make my expression ignore the 'dataType' words after space?

所以我的问题是我应该修改什么来使我的表达式忽略空格后的'dataType'字样?

Thank you.

2 个解决方案

#1


1  

This is really not the way to solve this particular problem.

这实际上不是解决这个特定问题的方法。

The usual way of doing it would be to write separate pattern rules to recognize keywords and variable names. (Plus a pattern rule to ignore whitespace.) That means that the tokenizer will return two tokens for the input int var3. Recognizing that the two tokens are a valid declaration is the responsibility of the parser, which will repeatedly call the tokenizer in order to parse the token stream.

通常的做法是编写单独的模式规则来识别关键字和变量名称。 (加上忽略空格的模式规则。)这意味着tokenizer将为输入int var3返回两个标记。认识到两个令牌是有效的声明是解析器的责任,解析器将重复调用tokenizer以解析令牌流。

However, if you really want to recognize two words as a single token, it is certainly possible. (F)lex does not allow negative lookaheads in regular expressions, but you can use the pattern matching precedence rule to capture erroneous tokens.

但是,如果您真的想将两个单词识别为单个标记,那么它肯定是可能的。 (F)lex不允许在正则表达式中使用负向前瞻,但您可以使用模式匹配优先级规则来捕获错误的令牌。

For example, you could do something like this:

例如,你可以这样做:

dataType       int|float|char|String
id             [[:alpha:]_][[:alnum:]_]*

%%

{dataType}[[:white:]]+{dataType}   { puts("Error: two types"); }
{dataType}[[:white:]]+{id}         { puts("Valid declaration"); }

  /* ...  more rules ... */

The above uses Posix character classes instead of writing out the possible characters. See man isalpha for a list of Posix character classes; the character class component [:xxxxx:] contains exactly the characters accepted by the isxxxxx standard library function. I fixed the pattern so that it allows more than one space between the dataType and the id, and simplified the pattern for ids.

以上使用Posix字符类而不是写出可能的字符。有关Posix字符类的列表,请参阅man isalpha;字符类组件[:xxxxx:]包含isxxxxx标准库函数接受的字符。我修复了模式,以便它允许dataType和id之间有多个空格,并简化了id的模式。

#2


2  

A preliminary consideration: Typically, detection of the construction you point out is not done at the lexing phase, but at the parsing phase. On yacc/bison, for instance, you would have a rule that only matches a "type" token followed by an "identifier" token.

初步考虑:通常情况下,您指出的构造检测不是在lexing阶段,而是在解析阶段。例如,在yacc / bison上,您将拥有一个仅匹配“type”标记后跟“标识符”标记的规则。

To achieve that with lex/flex though, you could consider playing around with the negation (^) and trailing context (/) operators. Or...

为了实现lex / flex,你可以考虑使用否定(^)和尾随上下文(/)运算符。要么...

If you're running flex, perhaps simply surrounding all your regex with parenthesis and passing the -l flag would do the trick. Notice there are a few differences between lex and flex, as described in the Flex manual.

如果你正在运行flex,或者简单地用括号括起所有正则表达式并传递-l标志就可以了。请注意,lex和flex之间存在一些差异,如Flex手册中所述。

#1


1  

This is really not the way to solve this particular problem.

这实际上不是解决这个特定问题的方法。

The usual way of doing it would be to write separate pattern rules to recognize keywords and variable names. (Plus a pattern rule to ignore whitespace.) That means that the tokenizer will return two tokens for the input int var3. Recognizing that the two tokens are a valid declaration is the responsibility of the parser, which will repeatedly call the tokenizer in order to parse the token stream.

通常的做法是编写单独的模式规则来识别关键字和变量名称。 (加上忽略空格的模式规则。)这意味着tokenizer将为输入int var3返回两个标记。认识到两个令牌是有效的声明是解析器的责任,解析器将重复调用tokenizer以解析令牌流。

However, if you really want to recognize two words as a single token, it is certainly possible. (F)lex does not allow negative lookaheads in regular expressions, but you can use the pattern matching precedence rule to capture erroneous tokens.

但是,如果您真的想将两个单词识别为单个标记,那么它肯定是可能的。 (F)lex不允许在正则表达式中使用负向前瞻,但您可以使用模式匹配优先级规则来捕获错误的令牌。

For example, you could do something like this:

例如,你可以这样做:

dataType       int|float|char|String
id             [[:alpha:]_][[:alnum:]_]*

%%

{dataType}[[:white:]]+{dataType}   { puts("Error: two types"); }
{dataType}[[:white:]]+{id}         { puts("Valid declaration"); }

  /* ...  more rules ... */

The above uses Posix character classes instead of writing out the possible characters. See man isalpha for a list of Posix character classes; the character class component [:xxxxx:] contains exactly the characters accepted by the isxxxxx standard library function. I fixed the pattern so that it allows more than one space between the dataType and the id, and simplified the pattern for ids.

以上使用Posix字符类而不是写出可能的字符。有关Posix字符类的列表,请参阅man isalpha;字符类组件[:xxxxx:]包含isxxxxx标准库函数接受的字符。我修复了模式,以便它允许dataType和id之间有多个空格,并简化了id的模式。

#2


2  

A preliminary consideration: Typically, detection of the construction you point out is not done at the lexing phase, but at the parsing phase. On yacc/bison, for instance, you would have a rule that only matches a "type" token followed by an "identifier" token.

初步考虑:通常情况下,您指出的构造检测不是在lexing阶段,而是在解析阶段。例如,在yacc / bison上,您将拥有一个仅匹配“type”标记后跟“标识符”标记的规则。

To achieve that with lex/flex though, you could consider playing around with the negation (^) and trailing context (/) operators. Or...

为了实现lex / flex,你可以考虑使用否定(^)和尾随上下文(/)运算符。要么...

If you're running flex, perhaps simply surrounding all your regex with parenthesis and passing the -l flag would do the trick. Notice there are a few differences between lex and flex, as described in the Flex manual.

如果你正在运行flex,或者简单地用括号括起所有正则表达式并传递-l标志就可以了。请注意,lex和flex之间存在一些差异,如Flex手册中所述。