如何在词法分析器中有效地实现最长匹配？

I'm interested in learning how to write a lexer generator like flex. I've been reading "Compilers: Principles, Techniques, and Tools" (the "dragon book"), and I have a basic idea of how flex works.

我有兴趣学习如何编写像flex这样的词法生成器。我一直在阅读“编译器:原理,技术和工具”(“龙书”),我对flex的工作方式有一个基本的了解。

My initial approach is this: the user will supply a hash map of regexes mapping a regex to a token enum. The program will just loop through the regexes one by one in the order given and see if they match the start of the string (I could add a ^ to the beginning of each regex to implement this). If they do, I can add the token for that regex to a list of tokens for the program.

我最初的方法是:用户将提供正则表达式的哈希映射,将正则表达式映射到令牌枚举。程序将按照给定的顺序逐个遍历正则表达式,看看它们是否与字符串的开头匹配(我可以在每个正则表达式的开头添加一个^来实现它)。如果他们这样做,我可以将该正则表达式的令牌添加到该程序的令牌列表中。

My first question is, is this the most efficient way to do it? Currently I have to loop through each regex, but in theory I could construct a DFA from all of the regexes combined and step through that more efficiently. However, there will be some overhead from creating this DFA.

我的第一个问题是,这是最有效的方法吗?目前我必须遍历每个正则表达式,但理论上我可以从所有正则表达式组合构建DFA并更有效地逐步执行。但是,创建此DFA会产生一些开销。

My second question is, how would I implement the longest matching string tie breaker, like flex does? i.e, I want to match ifa as an identifier, and not the keyword if followed by the letter a. I don't see any efficient way to do this with regex. I think I'll have to loop through all of the regexes, try to match them, and if I have more than one match, take the longest result. However, if I converted the regexes to a DFA (that is, my own DFA data structure), then I could continue stepping through the characters until there are no more possible transition edges on the DFA. At that point, I can take the last time I passed through an acceptance state as the actual match of a token, since that should be the longest match.

我的第二个问题是,我如何实现最长的匹配字符串连接断路器,就像flex一样?即,我想匹配ifa作为标识符,而不是关键字if后跟字母a。我没有看到任何有效的方法来使用正则表达式。我想我必须循环遍历所有正则表达式,尝试匹配它们,如果我有多个匹配项,则取最长的结果。但是,如果我将正则表达式转换为DFA(即我自己的DFA数据结构),那么我可以继续单步执行字符,直到DFA上没有可能的过渡边缘。此时,我可以将最后一次通过接受状态作为令牌的实际匹配,因为这应该是最长的匹配。

Both of my questions point to writing my own translator from regex to a DFA. Is this required, or can I still do this efficiently with plain regex (implemented by a standard library) and still get the longest match?

我的两个问题都指向将自己的翻译从正则表达式写入DFA。这是必需的,还是我仍然可以使用普通正则表达式(由标准库实现)有效地执行此操作并仍然获得最长的匹配?

EDIT: I kept the regex engine I'm using out of this because I wanted a general answer, but I'm using Rust's regex library: http://static.rust-lang.org/doc/master/regex/index.html

编辑:我保留了我正在使用的正则表达式引擎,因为我想要一个通用的答案,但我正在使用Rust的正则表达式库:http://static.rust-lang.org/doc/master/regex/index。 HTML

1 个解决方案

#1

Timewise, it's much more efficient to compile all the regexes down into a single automaton that matches all patterns in parallel. It might blow up the space usage significantly, though (DFAs can have exponentially many states relative to the pattern sizes), so it's worth investigating whether this will hurt.

在时间上,将所有正则表达式编译成一个与并行匹配所有模式的自动机相比,效率更高。虽然它可能会大大增加空间使用量(DFA可能会有相对于模式大小的指数级数),因此值得研究这是否会受到影响。

Typically, the way you'd implement maximal-munch (matching the longest string you can) is to run the matching automaton as normal. Keep track of the index of the last match that you find. When the automaton enters a dead state and stops, you can then output the substring from the beginning of the token up through the match point, then jump back in the input sequence to the point right after the match finished. This can be done pretty efficiently and without much slowdown at all.

通常,您实现maximal-munch(匹配最长的字符串)的方式是正常运行匹配的自动机。跟踪您找到的最后一场比赛的索引。当自动机进入死状态并停止时,您可以从令牌的开头向上输出子串到匹配点,然后在输入序列中跳回到匹配完成后的点。这可以非常有效地完成,而且根本没有太大的减速。

In case it helps, here are some lecture slides from a compilers course I taught that explores scanning techniques.

如果它有帮助,这里有一些我编写的编译器课程的演讲幻灯片,探讨了扫描技术。

Hope this helps!

希望这可以帮助!

#1