如何解析缩进(python风格)?

时间:2022-11-06 20:09:25

How would you define your parser and lexer rules to parse a language that uses indentation for defining scope.

如何定义解析器和lexer规则来解析使用缩进定义范围的语言。

I have already googled and found a clever approach for parsing it by generating INDENT and DEDENT tokens in the lexer.

我已经在google上搜索并发现了一种通过在lexer中生成缩进和缩进令牌来解析它的聪明方法。

I will go deeper on this problem and post an answer if I come to something interesting, but I would like to see other approaches to the problem.

如果我发现了一些有趣的东西,我将深入研究这个问题并给出一个答案,但是我希望看到解决这个问题的其他方法。

EDIT: As Charlie pointed out, there is already another thread very similar if not the same. Should my post be deleted?

编辑:正如Charlie指出的,已经有另一个非常相似的线程了。我的帖子应该被删除吗?

2 个解决方案

#1


10  

This is kind of hypothetical, as it would depend on what technology you have for your lexer and parser, but the easiest way would seem to be to have BEGINBLOCK and ENDBLOCK tokens analogous to braces in C. Using the "offsides rule" your lexer needs to keep track of a stack of indendtation levels. When the indent level increases, emit a BEGINBLOCK for the parser; when the indentation level decreases, emit ENDBLOCK and pop levels off the stack.

这是假设,因为它将取决于你的技术为你的词法分析程序和解析器,但最简单的方法似乎是BEGINBLOCK和ENDBLOCK令牌类似于c使用括号“越位规则”你的词法分析程序需要跟踪一堆indendtation水平。当缩进级别增加时,为解析器发出开始块;当缩进级别降低时,从堆栈中发出端块和弹出级别。

Here's another discussion of this on SO, btw.

这是另一个关于这个的讨论,顺便说一句。

#2


1  

Also you can track somewhere in lexer how many ident items are preceding first line and pass it to parser. Most interesting part would be trying to pass it to parser correctly :) If your parser uses lookahead (here I mean parser may query for variable number of tokens before it really going to match even one) then trying to pass it through one global variable seems to be very bad idea (because lexer can slip on next line and change value of indent counter while parser is still trying to parse previous line). Also globals are evil in many other cases ;) Marking first line 'real' token in someway with indent counter is more reasonable. I can't give you exact example (I don't even know what parser and lexer generators are you going to use if any...) but something like storing data on first line tokens (it could be non comfortable if you can't easily get such token from parser) or saving custom data (map that links tokens to indent, array where every line in source code as index and indent value as element value) seems to be enough. One downside of this approach is additional complexity to parser that will need to distinguish between ident values and change its behavior based on it. Something like LOOKAHEAD({ yourConditionInJava }) for JavaCC may work here but it is NOT a very good idea. A lot of additional tokens in your approach seems to be less evil thing to use :)

您还可以在lexer中跟踪多少ident项在第一行之前,并将其传递给解析器。最有趣的部分就会试图把它正确地解析器:)如果您的解析器使用超前(这里我指的是解析器可能查询变量数量的令牌才真的会匹配一)然后试图通过一个全局变量似乎是非常糟糕的主意(因为词法分析程序可以滑下一行并更改缩进值计数器在解析器仍在试图解析前一行)。此外,全球变暖在许多其他情况下都是邪恶的;用缩进计数器以某种方式标记第一行“真实”标记更合理。我不能给你确切的例子(我甚至不知道解析器和词法分析程序生成器你打算使用如果任何…)但是类似存储数据第一行标记(可以是不舒服的,如果你不能很容易地获得这样的令牌从解析器)或保存自定义数据(地图链接标记缩进,数组中每一行源代码作为指数和缩进值元素值)似乎是足够的。这种方法的一个缺点是解析器的复杂性,它需要区分ident值并根据ident值改变其行为。类似于JavaCC的LOOKAHEAD({yourConditionInJava})在这里可能行得通,但这不是一个好主意。在你的方法中有很多额外的代币似乎不那么坏:)

As another alternative I would suggest is to mix this two approaches. You could generate additional tokens only when indent counter changes its value on next line. It is like artificial BEGIN and END token. In this way you may lower number of 'artificial' tokens in your stream fed into parser from lexer. Only your parser grammar should be adjusted to understand additional tokens...

作为另一种选择,我建议将这两种方法混合使用。只有当缩进计数器在下一行更改其值时,才可以生成其他令牌。它就像人工的开始和结束标记。这样,您可以减少从lexer输入到解析器的流中的“人工”令牌的数量。只有您的解析器语法需要调整以理解其他标记…

I didn't tried this (have no real experience with such languages parsing), just sharing my thoughts about possible solutions. Checking already built parsers for this kinds of languages could be of great value for you. Open source is your friend ;)

我没有尝试过这种方法(没有实际的语言解析经验),只是分享了我对可能的解决方案的想法。检查已经为这类语言构建的解析器可能对您有很大的价值。开源是你的朋友;)

#1


10  

This is kind of hypothetical, as it would depend on what technology you have for your lexer and parser, but the easiest way would seem to be to have BEGINBLOCK and ENDBLOCK tokens analogous to braces in C. Using the "offsides rule" your lexer needs to keep track of a stack of indendtation levels. When the indent level increases, emit a BEGINBLOCK for the parser; when the indentation level decreases, emit ENDBLOCK and pop levels off the stack.

这是假设,因为它将取决于你的技术为你的词法分析程序和解析器,但最简单的方法似乎是BEGINBLOCK和ENDBLOCK令牌类似于c使用括号“越位规则”你的词法分析程序需要跟踪一堆indendtation水平。当缩进级别增加时,为解析器发出开始块;当缩进级别降低时,从堆栈中发出端块和弹出级别。

Here's another discussion of this on SO, btw.

这是另一个关于这个的讨论,顺便说一句。

#2


1  

Also you can track somewhere in lexer how many ident items are preceding first line and pass it to parser. Most interesting part would be trying to pass it to parser correctly :) If your parser uses lookahead (here I mean parser may query for variable number of tokens before it really going to match even one) then trying to pass it through one global variable seems to be very bad idea (because lexer can slip on next line and change value of indent counter while parser is still trying to parse previous line). Also globals are evil in many other cases ;) Marking first line 'real' token in someway with indent counter is more reasonable. I can't give you exact example (I don't even know what parser and lexer generators are you going to use if any...) but something like storing data on first line tokens (it could be non comfortable if you can't easily get such token from parser) or saving custom data (map that links tokens to indent, array where every line in source code as index and indent value as element value) seems to be enough. One downside of this approach is additional complexity to parser that will need to distinguish between ident values and change its behavior based on it. Something like LOOKAHEAD({ yourConditionInJava }) for JavaCC may work here but it is NOT a very good idea. A lot of additional tokens in your approach seems to be less evil thing to use :)

您还可以在lexer中跟踪多少ident项在第一行之前,并将其传递给解析器。最有趣的部分就会试图把它正确地解析器:)如果您的解析器使用超前(这里我指的是解析器可能查询变量数量的令牌才真的会匹配一)然后试图通过一个全局变量似乎是非常糟糕的主意(因为词法分析程序可以滑下一行并更改缩进值计数器在解析器仍在试图解析前一行)。此外,全球变暖在许多其他情况下都是邪恶的;用缩进计数器以某种方式标记第一行“真实”标记更合理。我不能给你确切的例子(我甚至不知道解析器和词法分析程序生成器你打算使用如果任何…)但是类似存储数据第一行标记(可以是不舒服的,如果你不能很容易地获得这样的令牌从解析器)或保存自定义数据(地图链接标记缩进,数组中每一行源代码作为指数和缩进值元素值)似乎是足够的。这种方法的一个缺点是解析器的复杂性,它需要区分ident值并根据ident值改变其行为。类似于JavaCC的LOOKAHEAD({yourConditionInJava})在这里可能行得通,但这不是一个好主意。在你的方法中有很多额外的代币似乎不那么坏:)

As another alternative I would suggest is to mix this two approaches. You could generate additional tokens only when indent counter changes its value on next line. It is like artificial BEGIN and END token. In this way you may lower number of 'artificial' tokens in your stream fed into parser from lexer. Only your parser grammar should be adjusted to understand additional tokens...

作为另一种选择,我建议将这两种方法混合使用。只有当缩进计数器在下一行更改其值时,才可以生成其他令牌。它就像人工的开始和结束标记。这样,您可以减少从lexer输入到解析器的流中的“人工”令牌的数量。只有您的解析器语法需要调整以理解其他标记…

I didn't tried this (have no real experience with such languages parsing), just sharing my thoughts about possible solutions. Checking already built parsers for this kinds of languages could be of great value for you. Open source is your friend ;)

我没有尝试过这种方法(没有实际的语言解析经验),只是分享了我对可能的解决方案的想法。检查已经为这类语言构建的解析器可能对您有很大的价值。开源是你的朋友;)