为什么flex / bison中的多行注释如此回避？

I'm trying to parse C-style multi-line comments in my flex (.l) file:

我正在尝试在我的flex（.l）文件中解析C风格的多行注释：

%s ML_COMMENT
%%

...

<INITIAL>"/*"                   BEGIN(ML_COMMENT);
<ML_COMMENT>"*/"                BEGIN(INITIAL);  
<ML_COMMENT>[.\n]+              { }

I'm not returning any token and my grammar (.y) doesn't address comments in any way.

我没有返回任何令牌，我的语法（.y）不以任何方式处理评论。

When I run my executable, I get a parse error:

当我运行我的可执行文件时，我得到一个解析错误：

$ ./a.out
/*
abc 
def
Parse error: parse error
$ echo "/* foo */" | ./a.out
Parse error: parse error

(My yyerror function does a printf("Parse error: %s\n"), which is where the first half of the redundant error message comes from).

（我的yyerror函数执行printf（“解析错误：％s \ n”），这是冗余错误消息的前半部分来自）。

I can see why the second example fails since the entirety of the input is a comment, and since comments are ignored by the grammar, there are no statements. Thus the input isn't a valid program. But the first part throws a parse error before I even finish the comment.

我可以看到为什么第二个示例失败，因为整个输入是注释，并且由于语法忽略了注释，因此没有语句。因此输入不是有效的程序。但是在我完成评论之前，第一部分抛出了一个解析错误。

Also confusing:

同样令人困惑：

$ ./a.out
/* foo */
a = b;
Parse error: parse error

In this case, the comment is closed prior to actual valid input (which, without the comment, parses just fine). The failure actually occurs after parsing "a", not after attempting to parse the assignment "a = b;". If I enter "a" on its own line, it still throws an error.

在这种情况下，注释在实际有效输入之前关闭（没有注释，解析就好了）。在解析“a”之后实际发生失败，而不是在尝试解析赋值“a = b;”之后。如果我在自己的行上输入“a”，它仍会抛出错误。

Given that the error message is a parser error and not a scanner error, is there something crucial I'm missing in my .y file? Or am I doing something wrong in my scanner rules that propagates over to the parser side?

鉴于错误消息是解析器错误而不是扫描程序错误，我的.y文件中是否存在一些重要的内容？或者我在扫描器规则中做错了什么传播到解析器端？

EDIT: Per @Rudi's suggestion, I turned on debugging and found:

编辑：Per @ Rudi的建议，我打开调试，发现：

$ ./a.out
Starting parse
Entering state 0
Reading a token: /*
foo
Next token is 44 (IDENTIFER)
Shifting token 44 (IDENTIFER), Entering state 4
Reducing via rule 5 (line 130), IDENTIFER  -> identifier
state stack now 0
Entering state 5

I turned off debugging and found that /* foo */ = bar; indeed parses the same as foo = bar;. I'm using flex 2.5.4; it doesn't give me any warnings about the stateful rules I'm attempting to use.

我关掉了调试，发现/ * foo * / = bar;确实解析与foo = bar;相同。我正在使用flex 2.5.4;它没有给我任何关于我试图使用的有状态规则的警告。

4 个解决方案

#1

I think you need to declare your ML_COMMENT start condition as an exclusive start condition so only the ML_COMMENT rules are active. %x ML_COMMENT instead of %s ML_COMMENT

我认为您需要将ML_COMMENT启动条件声明为独占启动条件，因此只有ML_COMMENT规则处于活动状态。％x ML_COMMENT而不是％s ML_COMMENT

Otherwise rules with no start conditions are also active.

否则，没有开始条件的规则也是活动的。

#2

Parsing comments this way can lead to errors because:

以这种方式解析注释可能会导致错误，因为：

you need to add conditions to all of your lex rules
您需要为所有lex规则添加条件
it becomes even more complex if you also want to handle // comments
如果你还想处理//评论，它会变得更加复杂
you still have the risk that yacc/bison merges two comments including everything in between
你还有风险yacc / bison合并两条评论，包括介于两者之间的所有内容

In my parser, I handle comments like this. First define lex rules for the start of the comment, like this:

在我的解析器中，我处理这样的评论。首先为注释的开头定义lex规则，如下所示：

\/\*     {
         if (!SkipComment())
            return(-1);
         }

\/\/     {
         if (!SkipLine())
            return(-1);
         }

then write the SkipComment and SkipLine functions. They need to consume all the input until the end of the comment is found (this is rather old code so forgive me the somewhat archaic constructions):

然后编写SkipComment和SkipLine函数。他们需要消耗所有的输入，直到找到注释的结尾（这是相当古老的代码，所以请原谅我有些过时的结构）：

bool SkipComment (void)
{
int Key;

Key=!EOF;
while (true)
   {
   if (Key==EOF)
      {
      /* yyerror("Unexpected EOF within comment."); */
      break;
      }
   switch ((char)Key)
      {
      case '*' :
         Key=input();
         if (char)Key=='/') return true;
         else               continue;
         break;
      case '\n' :
         ++LineNr;
         break;
      }
   Key=input();
   }

return false;
}

bool SkipLine (void)
{
int Key;

Key=!EOF;
while (true)
   {
   if (Key==EOF)
      return true;
   switch ((char)Key)
      {
      case '\n' :
         unput('\n');
         return true;
         break;
      }
   Key=input();
   }

return false;
}

#3

Besides the problem with %x vs %s, you also have the problem that the . in [.\n] matches (only) a literal . and not 'any character other than newline' like a bare . does. You want a rule like

除了％x vs％s的问题，你也有问题了。在[。\ n]中匹配（仅）一个文字。而不是'除了换行之外的任何角色'就像裸露一样。确实。你想要一个像这样的规则

<ML_COMMENT>.|"\n"     { /* do nothing */ }

instead

代替

#4

I found this description of the C language grammar (actually just the lexer) very useful. I think it is mostly the same as Patrick's answer, but slightly different.

我发现这种C语言语法的描述（实际上只是词法分析器）非常有用。我认为它与帕特里克的答案大致相同，但略有不同。

http://www.lysator.liu.se/c/ANSI-C-grammar-l.html

#1