I'm writing a lexer to scan a modified version of an INI file.
我正在写一个词法分析器来扫描修改后的INI文件版本。
I need to recognize the declaration of variables, comments and strings (between double quotes) to be assigned to a variable. For example, this is correct:
我需要识别要赋给变量的变量,注释和字符串(在双引号之间)的声明。例如,这是正确的:
# this is a comment
var1 = "string value"
I've successfully managed to recognize these tokens forcing the #
at the begging of the comment regular expression and "
at the end of the string regular expression, but I don't want to do this because later on, using Bison, the tokens I get are exactly # this is a comment
and "string value"
. Instead I want this is a comment
(without #
) and string value
(without "
)
我已成功设法识别这些令牌强制#在求解注释正则表达式和“在字符串正则表达式结束时,但我不想这样做,因为后来,使用Bison,令牌我get正是#这是一个注释和“字符串值”。相反,我希望这是一个注释(没有#)和字符串值(没有“)
These are the regular expressions that I currently use:
这些是我目前使用的正则表达式:
[a-zA-Z][a-zA-Z0-9]* { return TOKEN_VAR_NAME; }
["][^\n\r]*["] { return TOKEN_STRING; }
[#][^\n\r]* { return TOKEN_COMMENT; }
Obviously there can be any number of white spaces, as well as tabs, inside the string, the comment and between the variable name and the =
.
显然,字符串内部可以有任意数量的空格以及制表符,注释以及变量名称和=之间。
How could I achieve the result I want?
我怎么能达到我想要的结果呢?
Maybe it will be easier if I show you a complete example of a correct input file and also the grammar rules I use with Flex and Bison.
如果我向您展示正确输入文件的完整示例以及我与Flex和Bison一起使用的语法规则,也许会更容易。
Correct input file example:
正确的输入文件示例:
[section1]
var1 = "string value"
var2 = "var1 = text"
# this is a comment
# var5 = "some text" this is also a valid comment
These are the regular expressions for the lexer:
这些是词法分析器的正则表达式:
"[" { return TOKEN::SECTION_START; }
"]" { return TOKEN::SECTION_END; }
"=" { return TOKEN::ASSIGNMENT; }
[#][^\n\r]* { return TOKEN::COMMENT; }
[a-zA-Z][a-zA-Z0-9]* { *m_yylval = yytext; return TOKEN::ID; }
["][^\n\r]*["] { *m_yylval = yytext; return TOKEN::STRING; }
And these are the syntax rules:
这些是语法规则:
input : input line
| line
;
line : section
| value
| comment
;
section : SECTION_START ID SECTION_END { createNewSection($2); }
;
value : ID ASSIGNMENT STRING { addStringValue($1, $3); }
;
comment : COMMENT { addComment($1); }
;
1 个解决方案
#1
1
To do that you have to treat "
and #
as different tokens (so they get scanned as individual tokens, different from the one you are scanning now) and use a %s
or %x
start condition to change the accepted regular patterns on reading those tokens with the scanner input.
要做到这一点,你必须将“和#作为不同的标记处理(因此它们作为单独的标记进行扫描,与您现在正在扫描的标记不同)并使用%s或%x开始条件来更改读取时的可接受的常规模式带扫描仪输入的令牌。
This adds another drawback, that is, you will receive #
as an individual token before the comment and "
before and after the string contents, and you'll have to cope with that in your grammar. This will complicate your grammar and the scanner, so I have to discourage you to follow this approach.
这增加了另一个缺点,就是你会在评论之前收到#作为单个标记,并且“在字符串内容之前和之后,你将不得不应对你的语法。这会使你的语法和扫描程序复杂化,所以我不鼓励你遵循这种方法。
There is a better solution, by writting a routine to unescape things and allow the scanner to be simpler by returning all the input string in yytext
and simply
有一个更好的解决方案,通过编写一个例程来解决问题,并通过在yytext中返回所有输入字符串并简单地让扫描器变得更简单
m_yylval = unescapeString(yytext); /* drop the " chars */
return STRING;
or
要么
m_yylval = uncomment(yytext); /* drop the # at the beginning */
return COMMENT; /* return EOL if you are trying the exmample at the end */
in the yylex();
function.
在yylex();功能。
Note
As comments are normally ignored, the best thing is to ignore using a rule like:
由于注释通常被忽略,最好的方法是忽略使用如下规则:
"#".* ; /* ignored */
in your flex
file. This makes generated scanner not return and ignore the token just read.
在您的flex文件中。这使得生成的扫描程序不会返回并忽略刚刚读取的令牌。
Note 2
You probably don't have taken into account that your parser will allow you to introduce lines on the form:
您可能没有考虑到您的解析器允许您在表单上引入行:
var = "data"
in front of any
在任何人面前
[section]
line, so you'll run into trouble trying to addStringvalue(...);
when no section has been created. One possible solution is to modify your grammar to separate file in sections and force them to begin with a section line, like:
line,所以你在尝试addStringvalue(...)时会遇到麻烦;什么时候没有创建部分。一种可能的解决方案是修改语法以将文件分成几部分,并强制它们以剖面线开头,如:
compilation: file comments ;
file: file section
| ; /* empty */
section: section_header section_body;
section_header: comments `[` ident `]` EOL
section_body: section_body comments assignment
| ; /* empty */
comments: comments COMMENT
| ; /* empty */
This has complicated by the fact that you want to process the comments. If you were to ignore them (with using ;
in the flex scanner) The grammar would be:
由于您希望处理注释,因此这很复杂。如果你忽略它们(使用;在flex扫描仪中)语法将是:
file: empty_lines file section
| ; /* empty */
empty_lines: empty_lines EOL
| ; /* empty */
section: header body ;
header: '[' IDENT ']' EOL ;
body: body assignment
| ; /* empty */
assignment: IDENT '=' strings EOL
| EOL ; /* empty lines or lines with comments */
strings:
strings unit
| unit ;
unit: STRING
| IDENT
| NUMBER ;
This way the first thing allowed in your file is, apart of comments, that are ignored and blank space (EOL
s are not considered blank space as we cannot ignore them, they terminate lines)
这样,文件中允许的第一件事是,除了注释之外,被忽略的空格和空格(EOL不被视为空格,因为我们不能忽略它们,它们终止行)
#1
1
To do that you have to treat "
and #
as different tokens (so they get scanned as individual tokens, different from the one you are scanning now) and use a %s
or %x
start condition to change the accepted regular patterns on reading those tokens with the scanner input.
要做到这一点,你必须将“和#作为不同的标记处理(因此它们作为单独的标记进行扫描,与您现在正在扫描的标记不同)并使用%s或%x开始条件来更改读取时的可接受的常规模式带扫描仪输入的令牌。
This adds another drawback, that is, you will receive #
as an individual token before the comment and "
before and after the string contents, and you'll have to cope with that in your grammar. This will complicate your grammar and the scanner, so I have to discourage you to follow this approach.
这增加了另一个缺点,就是你会在评论之前收到#作为单个标记,并且“在字符串内容之前和之后,你将不得不应对你的语法。这会使你的语法和扫描程序复杂化,所以我不鼓励你遵循这种方法。
There is a better solution, by writting a routine to unescape things and allow the scanner to be simpler by returning all the input string in yytext
and simply
有一个更好的解决方案,通过编写一个例程来解决问题,并通过在yytext中返回所有输入字符串并简单地让扫描器变得更简单
m_yylval = unescapeString(yytext); /* drop the " chars */
return STRING;
or
要么
m_yylval = uncomment(yytext); /* drop the # at the beginning */
return COMMENT; /* return EOL if you are trying the exmample at the end */
in the yylex();
function.
在yylex();功能。
Note
As comments are normally ignored, the best thing is to ignore using a rule like:
由于注释通常被忽略,最好的方法是忽略使用如下规则:
"#".* ; /* ignored */
in your flex
file. This makes generated scanner not return and ignore the token just read.
在您的flex文件中。这使得生成的扫描程序不会返回并忽略刚刚读取的令牌。
Note 2
You probably don't have taken into account that your parser will allow you to introduce lines on the form:
您可能没有考虑到您的解析器允许您在表单上引入行:
var = "data"
in front of any
在任何人面前
[section]
line, so you'll run into trouble trying to addStringvalue(...);
when no section has been created. One possible solution is to modify your grammar to separate file in sections and force them to begin with a section line, like:
line,所以你在尝试addStringvalue(...)时会遇到麻烦;什么时候没有创建部分。一种可能的解决方案是修改语法以将文件分成几部分,并强制它们以剖面线开头,如:
compilation: file comments ;
file: file section
| ; /* empty */
section: section_header section_body;
section_header: comments `[` ident `]` EOL
section_body: section_body comments assignment
| ; /* empty */
comments: comments COMMENT
| ; /* empty */
This has complicated by the fact that you want to process the comments. If you were to ignore them (with using ;
in the flex scanner) The grammar would be:
由于您希望处理注释,因此这很复杂。如果你忽略它们(使用;在flex扫描仪中)语法将是:
file: empty_lines file section
| ; /* empty */
empty_lines: empty_lines EOL
| ; /* empty */
section: header body ;
header: '[' IDENT ']' EOL ;
body: body assignment
| ; /* empty */
assignment: IDENT '=' strings EOL
| EOL ; /* empty lines or lines with comments */
strings:
strings unit
| unit ;
unit: STRING
| IDENT
| NUMBER ;
This way the first thing allowed in your file is, apart of comments, that are ignored and blank space (EOL
s are not considered blank space as we cannot ignore them, they terminate lines)
这样,文件中允许的第一件事是,除了注释之外,被忽略的空格和空格(EOL不被视为空格,因为我们不能忽略它们,它们终止行)