Flex RegEx查找不以模式开头的字符串

时间:2022-08-19 09:38:19

I'm writing a lexer to scan a modified version of an INI file.

我正在写一个词法分析器来扫描修改后的INI文件版本。

I need to recognize the declaration of variables, comments and strings (between double quotes) to be assigned to a variable. For example, this is correct:

我需要识别要赋给变量的变量,注释和字符串(在双引号之间)的声明。例如,这是正确的:

# this is a comment
var1 = "string value"

I've successfully managed to recognize these tokens forcing the # at the begging of the comment regular expression and " at the end of the string regular expression, but I don't want to do this because later on, using Bison, the tokens I get are exactly # this is a comment and "string value". Instead I want this is a comment (without #) and string value (without ")

我已成功设法识别这些令牌强制#在求解注释正则表达式和“在字符串正则表达式结束时,但我不想这样做,因为后来,使用Bison,令牌我get正是#这是一个注释和“字符串值”。相反,我希望这是一个注释(没有#)和字符串值(没有“)

These are the regular expressions that I currently use:

这些是我目前使用的正则表达式:

[a-zA-Z][a-zA-Z0-9]*    { return TOKEN_VAR_NAME; }
["][^\n\r]*["]          { return TOKEN_STRING;   }
[#][^\n\r]*             { return TOKEN_COMMENT;  }

Obviously there can be any number of white spaces, as well as tabs, inside the string, the comment and between the variable name and the =.

显然,字符串内部可以有任意数量的空格以及制表符,注释以及变量名称和=之间。

How could I achieve the result I want?

我怎么能达到我想要的结果呢?


Maybe it will be easier if I show you a complete example of a correct input file and also the grammar rules I use with Flex and Bison.

如果我向您展示正确输入文件的完整示例以及我与Flex和Bison一起使用的语法规则,也许会更容易。

Correct input file example:

正确的输入文件示例:

[section1]
var1 = "string value"
var2 = "var1 = text"
# this is a comment
# var5 = "some text" this is also a valid comment

These are the regular expressions for the lexer:

这些是词法分析器的正则表达式:

"["                     { return TOKEN::SECTION_START; } 
"]"                     { return TOKEN::SECTION_END; }
"="                     { return TOKEN::ASSIGNMENT; }
[#][^\n\r]*             { return TOKEN::COMMENT; }
[a-zA-Z][a-zA-Z0-9]*    { *m_yylval = yytext; return TOKEN::ID; }
["][^\n\r]*["]          { *m_yylval = yytext; return TOKEN::STRING; }

And these are the syntax rules:

这些是语法规则:

input   : input line
        | line
        ;

line    : section
        | value
        | comment
        ;

section : SECTION_START ID SECTION_END      { createNewSection($2); }
        ;

value   : ID ASSIGNMENT STRING              { addStringValue($1, $3); }
        ;

comment : COMMENT                           { addComment($1); }
        ;

1 个解决方案

#1


1  

To do that you have to treat " and # as different tokens (so they get scanned as individual tokens, different from the one you are scanning now) and use a %s or %x start condition to change the accepted regular patterns on reading those tokens with the scanner input.

要做到这一点,你必须将“和#作为不同的标记处理(因此它们作为单独的标记进行扫描,与您现在正在扫描的标记不同)并使用%s或%x开始条件来更改读取时的可接受的常规模式带扫描仪输入的令牌。

This adds another drawback, that is, you will receive # as an individual token before the comment and " before and after the string contents, and you'll have to cope with that in your grammar. This will complicate your grammar and the scanner, so I have to discourage you to follow this approach.

这增加了另一个缺点,就是你会在评论之前收到#作为单个标记,并且“在字符串内容之前和之后,你将不得不应对你的语法。这会使你的语法和扫描程序复杂化,所以我不鼓励你遵循这种方法。

There is a better solution, by writting a routine to unescape things and allow the scanner to be simpler by returning all the input string in yytext and simply

有一个更好的解决方案,通过编写一个例程来解决问题,并通过在yytext中返回所有输入字符串并简单地让扫描器变得更简单

m_yylval = unescapeString(yytext);  /* drop the " chars */  
return STRING; 

or

要么

m_yylval = uncomment(yytext); /* drop the # at the beginning */
return COMMENT;  /* return EOL if you are trying the exmample at the end */

in the yylex(); function.

在yylex();功能。

Note

As comments are normally ignored, the best thing is to ignore using a rule like:

由于注释通常被忽略,最好的方法是忽略使用如下规则:

"#".*         ; /* ignored */

in your flex file. This makes generated scanner not return and ignore the token just read.

在您的flex文件中。这使得生成的扫描程序不会返回并忽略刚刚读取的令牌。

Note 2

You probably don't have taken into account that your parser will allow you to introduce lines on the form:

您可能没有考虑到您的解析器允许您在表单上引入行:

var = "data"

in front of any

在任何人面前

[section]

line, so you'll run into trouble trying to addStringvalue(...); when no section has been created. One possible solution is to modify your grammar to separate file in sections and force them to begin with a section line, like:

line,所以你在尝试addStringvalue(...)时会遇到麻烦;什么时候没有创建部分。一种可能的解决方案是修改语法以将文件分成几部分,并强制它们以剖面线开头,如:

compilation: file comments ;

file: file section
    | ; /* empty */

section: section_header section_body;

section_header: comments `[` ident `]` EOL

section_body: section_body comments assignment
    | ; /* empty */

comments: comments COMMENT
    | ; /* empty */

This has complicated by the fact that you want to process the comments. If you were to ignore them (with using ; in the flex scanner) The grammar would be:

由于您希望处理注释,因此这很复杂。如果你忽略它们(使用;在flex扫描仪中)语法将是:

file: empty_lines file section
    | ; /* empty */

empty_lines: empty_lines EOL
    | ; /* empty */

section: header body ;

header: '[' IDENT ']' EOL ;

body: body assignment
    | ; /* empty */

assignment: IDENT '=' strings EOL
    | EOL ; /* empty lines or lines with comments */

strings: 
      strings unit
    | unit ;

unit: STRING
    | IDENT
    | NUMBER ;

This way the first thing allowed in your file is, apart of comments, that are ignored and blank space (EOLs are not considered blank space as we cannot ignore them, they terminate lines)

这样,文件中允许的第一件事是,除了注释之外,被忽略的空格和空格(EOL不被视为空格,因为我们不能忽略它们,它们终止行)

#1


1  

To do that you have to treat " and # as different tokens (so they get scanned as individual tokens, different from the one you are scanning now) and use a %s or %x start condition to change the accepted regular patterns on reading those tokens with the scanner input.

要做到这一点,你必须将“和#作为不同的标记处理(因此它们作为单独的标记进行扫描,与您现在正在扫描的标记不同)并使用%s或%x开始条件来更改读取时的可接受的常规模式带扫描仪输入的令牌。

This adds another drawback, that is, you will receive # as an individual token before the comment and " before and after the string contents, and you'll have to cope with that in your grammar. This will complicate your grammar and the scanner, so I have to discourage you to follow this approach.

这增加了另一个缺点,就是你会在评论之前收到#作为单个标记,并且“在字符串内容之前和之后,你将不得不应对你的语法。这会使你的语法和扫描程序复杂化,所以我不鼓励你遵循这种方法。

There is a better solution, by writting a routine to unescape things and allow the scanner to be simpler by returning all the input string in yytext and simply

有一个更好的解决方案,通过编写一个例程来解决问题,并通过在yytext中返回所有输入字符串并简单地让扫描器变得更简单

m_yylval = unescapeString(yytext);  /* drop the " chars */  
return STRING; 

or

要么

m_yylval = uncomment(yytext); /* drop the # at the beginning */
return COMMENT;  /* return EOL if you are trying the exmample at the end */

in the yylex(); function.

在yylex();功能。

Note

As comments are normally ignored, the best thing is to ignore using a rule like:

由于注释通常被忽略,最好的方法是忽略使用如下规则:

"#".*         ; /* ignored */

in your flex file. This makes generated scanner not return and ignore the token just read.

在您的flex文件中。这使得生成的扫描程序不会返回并忽略刚刚读取的令牌。

Note 2

You probably don't have taken into account that your parser will allow you to introduce lines on the form:

您可能没有考虑到您的解析器允许您在表单上引入行:

var = "data"

in front of any

在任何人面前

[section]

line, so you'll run into trouble trying to addStringvalue(...); when no section has been created. One possible solution is to modify your grammar to separate file in sections and force them to begin with a section line, like:

line,所以你在尝试addStringvalue(...)时会遇到麻烦;什么时候没有创建部分。一种可能的解决方案是修改语法以将文件分成几部分,并强制它们以剖面线开头,如:

compilation: file comments ;

file: file section
    | ; /* empty */

section: section_header section_body;

section_header: comments `[` ident `]` EOL

section_body: section_body comments assignment
    | ; /* empty */

comments: comments COMMENT
    | ; /* empty */

This has complicated by the fact that you want to process the comments. If you were to ignore them (with using ; in the flex scanner) The grammar would be:

由于您希望处理注释,因此这很复杂。如果你忽略它们(使用;在flex扫描仪中)语法将是:

file: empty_lines file section
    | ; /* empty */

empty_lines: empty_lines EOL
    | ; /* empty */

section: header body ;

header: '[' IDENT ']' EOL ;

body: body assignment
    | ; /* empty */

assignment: IDENT '=' strings EOL
    | EOL ; /* empty lines or lines with comments */

strings: 
      strings unit
    | unit ;

unit: STRING
    | IDENT
    | NUMBER ;

This way the first thing allowed in your file is, apart of comments, that are ignored and blank space (EOLs are not considered blank space as we cannot ignore them, they terminate lines)

这样,文件中允许的第一件事是,除了注释之外,被忽略的空格和空格(EOL不被视为空格,因为我们不能忽略它们,它们终止行)