Flex/Bison: yytext skips over a value

时间:2020-12-24 09:40:14

I've been racking my brain for two days trying to figure out why the program is behaving this way. For a class project, I'm trying to write a program that parses an address and outputs it a certain way. Before I actually get to the output portion of the program, I just wanted to make sure my Bison-fu was actually correct and outputting some debugging information correctly.


It looks as if Flex and Bison are cooperating with each other nicely, as expected, but for some reason, when I get to the parsing of the third line of the address, yytext just skips over the zip code and goes straight to the new line.


Below is a stripped down version of my Flex and Bison files that I tested and still outputs the same thing as the full version:


[19:45]<Program4> $ cat scan.l
%option noyywrap
%option nounput
%option noinput

#include <stdlib.h>
#include "y.tab.h"
#include "program4.h"


[\ \t]+                 { /* Eat whitespace */}
[\n]                    { return EOLTOKEN; }
","                     { return COMMATOKEN; }
[0-9]+                  { return INTTOKEN; }
[A-Za-z]+               { return NAMETOKEN; }
[A-Za-z0-9]+            { return IDENTIFIERTOKEN; }


/*This area just occupies space*/
[19:45]<Program4> $ cat parse.y

#include <stdlib.h>
#include <stdio.h>
#include "program4.h"


%union {int num; char id[20]; }
%start locationPart
%expect 0
%token <num> NAMETOKEN
%token <num> EOLTOKEN
%token <num> INTTOKEN
%token <num> COMMATOKEN
%type <id> townName zipCode stateCode


/* Entire block */
locationPart:           townName COMMATOKEN stateCode zipCode EOLTOKEN          
{ printf("Rule 12: LP: TN COMMA SC ZC EOL: %s\n", yytext); }
| /* bad location part */                               
{ printf("Rule 13: LP: Bad location part: %s\n", yytext); }

/* Lil tokens */
townName:               NAMETOKEN                                               
{ printf("Rule 23: TN: NAMETOKEN: %s\n", yytext); }

stateCode:              NAMETOKEN                                               
{ printf("Rule 24: SC: NAMETOKEN: %s\n", yytext); }

zipCode:                INTTOKEN DASHTOKEN INTTOKEN                             
{ printf("Rule 25: ZC: INT DASH INT: %s\n", yytext); }
                    | INTTOKEN                                              
{ printf("Rule 26: ZC: INT: %s\n", yytext); }


int yyerror (char const *s){
  extern int yylineno; //Defined in lex

  fprintf(stderr, "ERROR: %s at symbol \"%s\"\n at line %d.\n", s, yytext, 
[19:45]<Program4> $ cat addresses/zip.txt
Rockford, HI 12345
[19:45]<Program4> $ parser < addresses/zip.txt
Operating in parse mode.

Rule 23: TN: NAMETOKEN: Rockford
Rule 26: ZC: INT:


Parse successful!
[19:46]<Program4> $

As you can see near the bottom, it prints Rule 26: ZC: INT: but fails to print the 5 digit zip code. It's like the program just skips the number and stores the newline instead. Any ideas why it won't store and print the zip code?



  • yytext is defined as an extern in my .h file (not posted here);
  • yytext在我的.h文件中定义为extern(未在此处发布);

  • I am using the -vdy flags to compile the parse.c file
  • 我使用-vdy标志来编译parse.c文件

3 个解决方案



Because yytext is a global variable, it's overwritten and you will have to copy it in your lex script. In a pure parser, even though it's not global anymore it's still reused and passed as a parameter so it's incorrect to use it's value like you are attempting.


Also, don't use it in bison, instead use $n where n is the position of the token in the rule. You probably need the %union directive changed to something like

此外,不要在野牛中使用它,而是使用$ n,其中n是规则中令牌的位置。您可能需要将%union指令更改为类似的内容

%union {
    int number;
    char *name;

So in the flex file, if you want to capture the text do something like


[A-Za-z]+               { yylval.name = strdup(yytext); return NAMETOKEN; }

and remember, do not use yytext in bison, it's an internal thing used by the lexer.


Then and since you have defined a type for the zip code


/* Entire block */
locationPart:           townName COMMATOKEN stateCode zipCode EOLTOKEN {
    printf("Rule 12: LP: TN COMMA SC ZC EOL: town:%s, stateCode:%d zip-code:%s\n", $1, $3, $4); 



If you want to trace the workings of your parser, you are much better off enabling bison's trace feature. It's really easy. Just add the -t or --debug flag to the bison command to generate the code, and then add a line to actually produce the tracing:


/* This assumes you have #included the parse.tab.h header */
int main(void) {
   yydebug = 1;

This is explained in the Bison manual; the #if lets your program compile if you leave off the -t flag. While on the subject of flags, I strongly suggest you do not use the -y flag; it is for compiling old Yacc programs which relied on certain obsolete features. If you don't use -y, then bison will use the basename of your .y file with extensions .tab.c and .tab.h for the generated files.


Now, your bison file says that some of your tokens have semantic types, but your flex actions do not set semantic values for these tokens and your bison actions don't use the semantic values. Instead, you simply print the value of yytext. If you think about this a bit, you should be able to see why it won't work. Bison is a lookahead parser; it makes its parsing decisions based on the the current parsing state and a peek at the next token (if necessary). It peeks at the next token by calling the lexer. And when you call the lexer, it changes the value of yytext.


Bison (unlike other yacc implementations) doesn't always peek at the next token. But in your zipcode rule, it has no alternative, since it cannot tell whether the next token is a - or not without looking at it. In this case, it is not a dash; it is a newline. So guess what yytext contains when you print it out in the zipcode action.

Bison(与其他yacc实现不同)并不总是窥视下一个令牌。但是在你的邮政编码规则中,它没有其他选择,因为它无法判断下一个令牌是否是 - 或者没有看到它。在这种情况下,它不是破折号;这是一个换行符。所以当你在zipcode动作中打印出来时,猜猜yytext包含了什么。

If your tokenizer were to save the text in the id semantic value member (which is what it is for) then your parser would be able to access the semantic values as $1, $2, ...

如果你的tokenizer要将文本保存在id语义值成员中(这就是它的用途)那么你的解析器就能够访问语义值$ 1,$ 2,...



The problem is here:


zipCode:              INTTOKEN DASHTOKEN INTTOKEN     { // case 25 }        
                    | INTTOKEN                        { // case 26 }  

The parser doesn't know which rule to take--25 or 26--until it's parsed the next token to see if it is a DASHTOKEN. By the time the code is executed, yytext has already been overwritten.

解析器不知道要采用哪个规则--25或26 - 直到它解析下一个令牌以查看它是否是DASHTOKEN。到代码执行时,yytext已被覆盖。

The easiest way to handle this is to have a production that takes the INTTOKENs and returns what was in yytext[] in malloc()'d memory. Something like:

处理此问题的最简单方法是使用INTTOKENs生成并返回malloc()内存中yytext []的内容。就像是:

zipCode:              inttoken DASHTOKEN inttoken
                              printf("Rule 25: zip is %s-%s\n", $1, $3);
                    | inttoken
                              printf("Rule 26: zip is %s\n", $1);

inttoken: INTTOKEN { $$ = strdup(yytext); }



Because yytext is a global variable, it's overwritten and you will have to copy it in your lex script. In a pure parser, even though it's not global anymore it's still reused and passed as a parameter so it's incorrect to use it's value like you are attempting.


Also, don't use it in bison, instead use $n where n is the position of the token in the rule. You probably need the %union directive changed to something like

此外,不要在野牛中使用它,而是使用$ n,其中n是规则中令牌的位置。您可能需要将%union指令更改为类似的内容

%union {
    int number;
    char *name;

So in the flex file, if you want to capture the text do something like


[A-Za-z]+               { yylval.name = strdup(yytext); return NAMETOKEN; }

and remember, do not use yytext in bison, it's an internal thing used by the lexer.


Then and since you have defined a type for the zip code


/* Entire block */
locationPart:           townName COMMATOKEN stateCode zipCode EOLTOKEN {
    printf("Rule 12: LP: TN COMMA SC ZC EOL: town:%s, stateCode:%d zip-code:%s\n", $1, $3, $4); 



If you want to trace the workings of your parser, you are much better off enabling bison's trace feature. It's really easy. Just add the -t or --debug flag to the bison command to generate the code, and then add a line to actually produce the tracing:


/* This assumes you have #included the parse.tab.h header */
int main(void) {
   yydebug = 1;

This is explained in the Bison manual; the #if lets your program compile if you leave off the -t flag. While on the subject of flags, I strongly suggest you do not use the -y flag; it is for compiling old Yacc programs which relied on certain obsolete features. If you don't use -y, then bison will use the basename of your .y file with extensions .tab.c and .tab.h for the generated files.


Now, your bison file says that some of your tokens have semantic types, but your flex actions do not set semantic values for these tokens and your bison actions don't use the semantic values. Instead, you simply print the value of yytext. If you think about this a bit, you should be able to see why it won't work. Bison is a lookahead parser; it makes its parsing decisions based on the the current parsing state and a peek at the next token (if necessary). It peeks at the next token by calling the lexer. And when you call the lexer, it changes the value of yytext.


Bison (unlike other yacc implementations) doesn't always peek at the next token. But in your zipcode rule, it has no alternative, since it cannot tell whether the next token is a - or not without looking at it. In this case, it is not a dash; it is a newline. So guess what yytext contains when you print it out in the zipcode action.

Bison(与其他yacc实现不同)并不总是窥视下一个令牌。但是在你的邮政编码规则中,它没有其他选择,因为它无法判断下一个令牌是否是 - 或者没有看到它。在这种情况下,它不是破折号;这是一个换行符。所以当你在zipcode动作中打印出来时,猜猜yytext包含了什么。

If your tokenizer were to save the text in the id semantic value member (which is what it is for) then your parser would be able to access the semantic values as $1, $2, ...

如果你的tokenizer要将文本保存在id语义值成员中(这就是它的用途)那么你的解析器就能够访问语义值$ 1,$ 2,...



The problem is here:


zipCode:              INTTOKEN DASHTOKEN INTTOKEN     { // case 25 }        
                    | INTTOKEN                        { // case 26 }  

The parser doesn't know which rule to take--25 or 26--until it's parsed the next token to see if it is a DASHTOKEN. By the time the code is executed, yytext has already been overwritten.

解析器不知道要采用哪个规则--25或26 - 直到它解析下一个令牌以查看它是否是DASHTOKEN。到代码执行时,yytext已被覆盖。

The easiest way to handle this is to have a production that takes the INTTOKENs and returns what was in yytext[] in malloc()'d memory. Something like:

处理此问题的最简单方法是使用INTTOKENs生成并返回malloc()内存中yytext []的内容。就像是:

zipCode:              inttoken DASHTOKEN inttoken
                              printf("Rule 25: zip is %s-%s\n", $1, $3);
                    | inttoken
                              printf("Rule 26: zip is %s\n", $1);

inttoken: INTTOKEN { $$ = strdup(yytext); }