如何处理flex中的嵌套注释

时间:2021-03-15 14:30:32

I am working on writing a flex scanner for a language supporting nested comment like this:

我正在为一种支持嵌套注释的语言编写flex扫描器:

/*
/**/
*/

I use to work on ocaml/ocamllex that support recursive calling lex scanner very elegent. But I am now switching to c++/flex, how to handle such nested comment?

我曾经在ocaml/ocamllex上工作,它支持递归调用lex扫描器,非常优雅。但是我现在正在切换到c++/flex,如何处理这种嵌套的注释呢?

3 个解决方案

#1


3  

Assuming that only comments can be nested in comments, a stack is a very expensive solution for what could be achieved with a simple counter. For example:

假设只有注释可以嵌套在注释中,那么对于可以通过简单计数器实现的内容,堆栈是非常昂贵的解决方案。例如:

%x SC_COMMENT

%%
  int comment_nesting = 0;  /* Line 4 */

"/*"             { BEGIN(SC_COMMENT); }
<SC_COMMENT>{
  "/*"           { ++comment_nesting; }
  "*"+"/"        { if (comment_nesting) --comment_nesting;
                   else BEGIN(INITIAL); }
  "*"+           ; /* Line 11 */
  [^/*\n]+       ; /* Line 12 */
  [/]            ; /* Line 13 */
  \n             ; /* Line 14 */
}

Some explanations:

一些解释:

Line 4: Indented lines before the first rule are inserted at the top of the yylex function where they can be used to declare and initialize local variables. We use this to initialize the comment nesting depth to 0 on every call to yylex. The invariant which must be maintained is that comment_nesting is always 0 in the INITIAL state.

第4行:第一个规则之前的缩进行被插入到yylex函数的顶部,在那里它们可以用来声明和初始化本地变量。我们使用它将注释嵌套深度初始化到每一次对yylex的调用。必须保持不变的是,注释嵌套在初始状态中始终为0。

Lines 11-13: A simpler solution would have been the single pattern .|\n. , but that would result in every comment character being treated as a separate subtoken. Even though the corresponding action does nothing, this would have caused the scan loop to be broken and the action switch statement to be executed for every character. So it is usually better to try to match several characters at once.

第11-13行:更简单的解决方案应该是单一模式。,但这将导致将每个注释字符视为单独的子令牌。即使相应的操作什么都不做,这也会导致扫描循环被破坏,并且对每个字符执行操作开关语句。所以通常最好是同时匹配几个字符。

We need to be careful about / and * characters, though; we can only ignore those asterisks which we are certain are not part of the */ which terminates the (possibly nested) comment. Hence lines 11 and 12. (Line 12 won't match a sequence of asterisks which is followed by a / because those will already have been matched by the pattern above, at line 9.) And we need to ignore / if it is not followed by a *. Hence line 13.

不过,我们需要小心/和*字符;我们只能忽略那些我们确信不属于*/的星号,它们终止(可能嵌套)注释。因此第11行和第12行。(第12行不匹配后面跟着a /的星号序列,因为在第9行,这些星号已经与上面的模式匹配。)我们需要忽略/如果后面没有*。因此13号线。

Line 14: However, it can also be sub-optimal to match too large a token.

第14行:但是,如果匹配一个太大的令牌,也可能是次优的。

First, flex is not optimized for large tokens, and comments can be very large. If flex needs to refill its buffer in the middle of a token, it will retain the open token in the new buffer, and then rescan from the beginning of the token.

首先,flex没有针对大型令牌进行优化,注释可能非常大。如果flex需要在一个令牌中间重新填充它的缓冲区,那么它将在新的缓冲区中保留打开的令牌,然后从令牌开始重新扫描。

Second, flex scanners can automatically track the current line number, and they do so relatively efficiently. The scanner checks for newlines only in tokens matched by patterns which could possibly match a newline. But the entire match needs to be scanned.

其次,flex扫描器可以自动跟踪当前的行号,而且它们的跟踪效率相对较高。扫描器只在与可能与换行匹配的模式匹配的标记中检查换行。但是整个匹配需要扫描。

We reduce the impact of both of these issues by matching newline characters inside comments as individual tokens. (Line 14, also see line 12) This limits the yylineno scan to a single character, and it also limits the expected length of internal comment tokens. The comment itself might be very large, but each line is likely to be limited to a reasonable length, thus avoiding the potentially quadratic rescan on buffer refill.

我们通过将注释中的换行字符匹配为单个令牌来减少这两个问题的影响。(第14行,也参见第12行)这将ylineno扫描限制为单个字符,并且它还限制了内部注释标记的预期长度。注释本身可能非常大,但是每一行都很可能被限制到一个合理的长度,从而避免了缓冲区重新填充的潜在二次扫描。

#2


2  

I resolve this problem by using yy_push_state , yy_pop_state and start condition like this :

我使用yy_push_state、yy_pop_state和start条件来解决这个问题,如下所示:

%x comment 
%%
"/*" {
  yy_push_state(comment);
}
<comment>{
  "*/" {
    yy_pop_state();
  }

  "/*" {
    yy_push_state(comment);
  }
}
%%

In this way, I can handle any level of nested comment.

通过这种方式,我可以处理任何级别的嵌套注释。

#3


0  

Same way you would any nested stuff: Stacks

就像你喜欢任何嵌套的东西一样:栈

Stack the openings until their respective closings are found. If the stack is uneven at the end, you're missing either an opening or a closing element.

堆叠开口,直到找到它们各自的关闭。如果堆栈在最后是不均匀的,那么您将丢失一个打开或关闭元素。

#1


3  

Assuming that only comments can be nested in comments, a stack is a very expensive solution for what could be achieved with a simple counter. For example:

假设只有注释可以嵌套在注释中,那么对于可以通过简单计数器实现的内容,堆栈是非常昂贵的解决方案。例如:

%x SC_COMMENT

%%
  int comment_nesting = 0;  /* Line 4 */

"/*"             { BEGIN(SC_COMMENT); }
<SC_COMMENT>{
  "/*"           { ++comment_nesting; }
  "*"+"/"        { if (comment_nesting) --comment_nesting;
                   else BEGIN(INITIAL); }
  "*"+           ; /* Line 11 */
  [^/*\n]+       ; /* Line 12 */
  [/]            ; /* Line 13 */
  \n             ; /* Line 14 */
}

Some explanations:

一些解释:

Line 4: Indented lines before the first rule are inserted at the top of the yylex function where they can be used to declare and initialize local variables. We use this to initialize the comment nesting depth to 0 on every call to yylex. The invariant which must be maintained is that comment_nesting is always 0 in the INITIAL state.

第4行:第一个规则之前的缩进行被插入到yylex函数的顶部,在那里它们可以用来声明和初始化本地变量。我们使用它将注释嵌套深度初始化到每一次对yylex的调用。必须保持不变的是,注释嵌套在初始状态中始终为0。

Lines 11-13: A simpler solution would have been the single pattern .|\n. , but that would result in every comment character being treated as a separate subtoken. Even though the corresponding action does nothing, this would have caused the scan loop to be broken and the action switch statement to be executed for every character. So it is usually better to try to match several characters at once.

第11-13行:更简单的解决方案应该是单一模式。,但这将导致将每个注释字符视为单独的子令牌。即使相应的操作什么都不做,这也会导致扫描循环被破坏,并且对每个字符执行操作开关语句。所以通常最好是同时匹配几个字符。

We need to be careful about / and * characters, though; we can only ignore those asterisks which we are certain are not part of the */ which terminates the (possibly nested) comment. Hence lines 11 and 12. (Line 12 won't match a sequence of asterisks which is followed by a / because those will already have been matched by the pattern above, at line 9.) And we need to ignore / if it is not followed by a *. Hence line 13.

不过,我们需要小心/和*字符;我们只能忽略那些我们确信不属于*/的星号,它们终止(可能嵌套)注释。因此第11行和第12行。(第12行不匹配后面跟着a /的星号序列,因为在第9行,这些星号已经与上面的模式匹配。)我们需要忽略/如果后面没有*。因此13号线。

Line 14: However, it can also be sub-optimal to match too large a token.

第14行:但是,如果匹配一个太大的令牌,也可能是次优的。

First, flex is not optimized for large tokens, and comments can be very large. If flex needs to refill its buffer in the middle of a token, it will retain the open token in the new buffer, and then rescan from the beginning of the token.

首先,flex没有针对大型令牌进行优化,注释可能非常大。如果flex需要在一个令牌中间重新填充它的缓冲区,那么它将在新的缓冲区中保留打开的令牌,然后从令牌开始重新扫描。

Second, flex scanners can automatically track the current line number, and they do so relatively efficiently. The scanner checks for newlines only in tokens matched by patterns which could possibly match a newline. But the entire match needs to be scanned.

其次,flex扫描器可以自动跟踪当前的行号,而且它们的跟踪效率相对较高。扫描器只在与可能与换行匹配的模式匹配的标记中检查换行。但是整个匹配需要扫描。

We reduce the impact of both of these issues by matching newline characters inside comments as individual tokens. (Line 14, also see line 12) This limits the yylineno scan to a single character, and it also limits the expected length of internal comment tokens. The comment itself might be very large, but each line is likely to be limited to a reasonable length, thus avoiding the potentially quadratic rescan on buffer refill.

我们通过将注释中的换行字符匹配为单个令牌来减少这两个问题的影响。(第14行,也参见第12行)这将ylineno扫描限制为单个字符,并且它还限制了内部注释标记的预期长度。注释本身可能非常大,但是每一行都很可能被限制到一个合理的长度,从而避免了缓冲区重新填充的潜在二次扫描。

#2


2  

I resolve this problem by using yy_push_state , yy_pop_state and start condition like this :

我使用yy_push_state、yy_pop_state和start条件来解决这个问题,如下所示:

%x comment 
%%
"/*" {
  yy_push_state(comment);
}
<comment>{
  "*/" {
    yy_pop_state();
  }

  "/*" {
    yy_push_state(comment);
  }
}
%%

In this way, I can handle any level of nested comment.

通过这种方式,我可以处理任何级别的嵌套注释。

#3


0  

Same way you would any nested stuff: Stacks

就像你喜欢任何嵌套的东西一样:栈

Stack the openings until their respective closings are found. If the stack is uneven at the end, you're missing either an opening or a closing element.

堆叠开口,直到找到它们各自的关闭。如果堆栈在最后是不均匀的,那么您将丢失一个打开或关闭元素。