flex -l最长模式匹配策略 - 不在这里?

时间:2021-12-22 09:44:04

I have two lex rules and was wondering why I never matched the second rule. Instead rule 1 always fired upon the pattern 2005-05-09-11.23.04.790000

我有两个lex规则,并想知道为什么我从来没有匹配第二个规则。相反,规则1总是触发模式2005-05-09-11.23.04.790000

<data>[-]?[0-9]*[.][0-9]*  { comma=0;
             printf("DEBUG: data 1 %s\n",yytext);
                strcat(data_line,yytext); }
<data>[0-9]{4}[-][01][0-9][-][0-3][0-9][-][0-9]{2}[.][0-9]{2}[.][0-9]{2}[.][0-9]{6} {
printf("DEBUG: data 2[%s]\n",yytext);
/* 1996-07-15-hh.00.00*/

I thought, flex/lex would follow the longest string match rule?

我想,flex / lex会遵循最长的字符串匹配规则吗?

Interestingly flex (without the -l lex compatibility) behaves "right", at least as I want to have it behave.

有趣的是,flex(没有-l lex兼容性)表现“正确”,至少我希望它具有行为。

1 个解决方案

#1


This is one of several "gotchas" related to Posix-/lex- compatibility [Note 1]. For historical reasons, the (Posix-standard) lex regular expression dialect differs from (Posix-standard) EREs ("extended regular expressions"), even though Posix uses the same abbreviation to describe the lex dialect.

这是与Posix- / lex兼容性[注1]相关的几个“问题”之一。由于历史原因,(Posix标准)lex正则表达式方言与(Posix标准)ERE(“扩展正则表达式”)不同,即使Posix使用相同的缩写来描述lex方言。

The difference is the precedence of the brace-repetition operator. In standard EREs, and pretty well every other regular expression variety I know of, abc{3} would match abccc. And that's how it is interpreted by flex, too, unless you specify the -l or --posix flags. If you request lex-compatibility, the precedence of the brace operator becomes lower than that of concatenation, so abc{3} matches abcabcabc.

不同之处在于大括号重复运算符的优先级。在标准的ERE中,以及我所知道的所有其他正则表达式,abc {3}将匹配abccc。除非你指定-l或--posix标志,否则它也是由flex解释的。如果请求lex兼容性,则大括号运算符的优先级低于连接的优先级,因此abc {3}与abcabcabc匹配。

If you want to write regexes which will work with either regex variety, you must parenthesize all (or almost all) uses of the repetition operator. So your second pattern would need to be written as:

如果要编写适用于正则表达式的正则表达式,则必须将重复运算符的所有(或几乎所有)用法括起来。所以你的第二个模式需要写成:

[0-9]{4}[-][01][0-9][-][0-3][0-9][-]([0-9]{2})[.]([0-9]{2})[.]([0-9]{2})[.]([0-9]{6})

As written, it won't match the specified input, while the first rule will happily match the leading year.

如上所述,它与指定的输入不匹配,而第一条规则将与前一年相匹配。

For what it's worth, the other postfix repetition operators -- +, * and ? -- have the normal high precedence in lex mode. (In a way, this inconsistency makes the behaviour of brace-repetition even more confusing.)

对于它的价值,其他后缀重复运算符 - +,*和? - 在lex模式下具有正常的高优先级。 (在某种程度上,这种不一致使得支撑重复的行为更加令人困惑。)

Another gotcha with braces in lex-mode is that when they are used as macro replacement, no implicit parentheses are added. So in flex:

在lex-mode中使用大括号的另一个问题是,当它们用作宏替换时,不会添加隐式括号。在flex中:

foo     [fF][oO][oO]
%%
{foo}+  {
          /* yytext is some number of case-insensitive repetitions of foo */
        }

whereas in lex-compatibility mode

而在lex兼容模式下

foo     [fF][oO][oO]
%%
{foo}+  {
          /* yytext is an 'f' or 'F' followed by at least two 'o' or 'O's */
        }

Notes:

  1. The -l (and --posix) flags are options I recommend avoiding. Only use them when absolutely necessary to compile legacy code developed to the lex standard.
  2. -l(和--posix)标志是我建议避免的选项。只有在绝对必要时才使用它们来编译为lex标准开发的遗留代码。

#1


This is one of several "gotchas" related to Posix-/lex- compatibility [Note 1]. For historical reasons, the (Posix-standard) lex regular expression dialect differs from (Posix-standard) EREs ("extended regular expressions"), even though Posix uses the same abbreviation to describe the lex dialect.

这是与Posix- / lex兼容性[注1]相关的几个“问题”之一。由于历史原因,(Posix标准)lex正则表达式方言与(Posix标准)ERE(“扩展正则表达式”)不同,即使Posix使用相同的缩写来描述lex方言。

The difference is the precedence of the brace-repetition operator. In standard EREs, and pretty well every other regular expression variety I know of, abc{3} would match abccc. And that's how it is interpreted by flex, too, unless you specify the -l or --posix flags. If you request lex-compatibility, the precedence of the brace operator becomes lower than that of concatenation, so abc{3} matches abcabcabc.

不同之处在于大括号重复运算符的优先级。在标准的ERE中,以及我所知道的所有其他正则表达式,abc {3}将匹配abccc。除非你指定-l或--posix标志,否则它也是由flex解释的。如果请求lex兼容性,则大括号运算符的优先级低于连接的优先级,因此abc {3}与abcabcabc匹配。

If you want to write regexes which will work with either regex variety, you must parenthesize all (or almost all) uses of the repetition operator. So your second pattern would need to be written as:

如果要编写适用于正则表达式的正则表达式,则必须将重复运算符的所有(或几乎所有)用法括起来。所以你的第二个模式需要写成:

[0-9]{4}[-][01][0-9][-][0-3][0-9][-]([0-9]{2})[.]([0-9]{2})[.]([0-9]{2})[.]([0-9]{6})

As written, it won't match the specified input, while the first rule will happily match the leading year.

如上所述,它与指定的输入不匹配,而第一条规则将与前一年相匹配。

For what it's worth, the other postfix repetition operators -- +, * and ? -- have the normal high precedence in lex mode. (In a way, this inconsistency makes the behaviour of brace-repetition even more confusing.)

对于它的价值,其他后缀重复运算符 - +,*和? - 在lex模式下具有正常的高优先级。 (在某种程度上,这种不一致使得支撑重复的行为更加令人困惑。)

Another gotcha with braces in lex-mode is that when they are used as macro replacement, no implicit parentheses are added. So in flex:

在lex-mode中使用大括号的另一个问题是,当它们用作宏替换时,不会添加隐式括号。在flex中:

foo     [fF][oO][oO]
%%
{foo}+  {
          /* yytext is some number of case-insensitive repetitions of foo */
        }

whereas in lex-compatibility mode

而在lex兼容模式下

foo     [fF][oO][oO]
%%
{foo}+  {
          /* yytext is an 'f' or 'F' followed by at least two 'o' or 'O's */
        }

Notes:

  1. The -l (and --posix) flags are options I recommend avoiding. Only use them when absolutely necessary to compile legacy code developed to the lex standard.
  2. -l(和--posix)标志是我建议避免的选项。只有在绝对必要时才使用它们来编译为lex标准开发的遗留代码。