如何在ANTLR中构造一个干净的Python语法？

G'day!

How can I construct a simple ANTLR grammar handling multi-line expressions without the need for either semicolons or backslashes?

如何构建一个简单的ANTLR语法处理多行表达式而不需要分号或反斜杠?

I'm trying to write a simple DSLs for expressions:

我正在尝试为表达式编写一个简单的DSL:

# sh style comments
ThisValue = 1
ThatValue = ThisValue * 2
ThisOtherValue = (1 + 2 + ThisValue * ThatValue)
YetAnotherValue = MAX(ThisOtherValue, ThatValue)

Overall, I want my application to provide the script with some initial named values and pull out the final result. I'm getting hung up on the syntax, however. I'd like to support multiple line expressions like the following:

总的来说,我希望我的应用程序为脚本提供一些初始命名值并提取最终结果。不过,我对语法感到很沮丧。我想支持多行表达式,如下所示:

# Note: no backslashes required to continue expression, as we're in brackets
# Note: no semicolon required at end of expression, either
ThisValueWithAReallyLongName = (ThisOtherValueWithASimilarlyLongName
                               +AnotherValueWithAGratuitouslyLongName)

I started off with an ANTLR grammar like this:

我开始使用这样的ANTLR语法:

exprlist
    : ( assignment_statement | empty_line )* EOF!
    ;
assignment_statement
    : assignment NL!?
    ;
empty_line
    : NL;
assignment
    : ID '=' expr
    ;

// ... and so on

It seems simple, but I'm already in trouble with the newlines:

看起来很简单,但我已经遇到了新行的麻烦:

warning(200): *Question.g:11:20: Decision can match input such as "NL" using multiple alternatives: 1, 2
As a result, alternative(s) 2 were disabled for that input

Graphically, in org.antlr.works.IDE:

在图形上,在org.antlr.works.IDE中:

Decision Can Match NL Using Multiple Alternatives http://img.skitch.com/20090723-ghpss46833si9f9ebk48x28b82.png

决策可以使用多种替代方案匹配NL http://img.skitch.com/20090723-ghpss46833si9f9ebk48x28b82.png

I've kicked the grammar around, but always end up with violations of expected behavior:

我已经开始使用语法,但总是会违反预期的行为:

A newline is not required at the end of the file

文件末尾不需要换行符

Empty lines are acceptable

空行是可以接受的

Everything in a line from a pound sign onward is discarded as a comment

从英镑符号开始的一行中的所有内容都将作为注释丢弃

Assignments end with end-of-line, not semicolons

作业以行尾结束,而不是以分号结尾

Expressions can span multiple lines if wrapped in brackets

如果用括号括起来,表达式可以跨越多行

I can find example ANTLR grammars with many of these characteristics. I find that when I cut them down to limit their expressiveness to just what I need, I end up breaking something. Others are too simple, and I break them as I add expressiveness.

我可以找到具有许多这些特征的示例ANTLR语法。我发现当我把它们剪下来以限制它们对我所需要的表现力时,我最终会破坏某些东西。其他人太简单了,当我增加表现力时,我打破了它们。

Which angle should I take with this grammar? Can you point to any examples that aren't either trivial or full Turing-complete languages?

我应该用这个语法采用哪个角度?你能指出任何不是琐碎的或完整的图灵完整语言的例子吗?

3 个解决方案

#1

I would let your tokenizer do the heavy lifting rather than mixing your newline rules into your grammar:

我会让你的标记器完成繁重的工作,而不是将你的换行规则混合到你的语法中:

Count parentheses, brackets, and braces, and don't generate NL tokens while there are unclosed groups. That'll give you line continuations for free without your grammar being any the wiser.

计算括号,括号和大括号,并且在存在未闭合组时不生成NL令牌。这将免费为您提供线路延续,而您的语法则不会更明智。
Always generate an NL token at the end of file whether or not the last line ends with a '\n' character, then you don't have to worry about a special case of a statement without a NL. Statements always end with an NL.

始终在文件末尾生成NL令牌,无论最后一行是否以'\ n'字符结尾,那么您不必担心没有NL的语句的特殊情况。语句总是以NL结尾。

The second point would let you simplify your grammar to something like this:

第二点可以让你简化你的语法:

exprlist
    : ( assignment_statement | empty_line )* EOF!
    ;
assignment_statement
    : assignment NL
    ;
empty_line
    : NL
    ;
assignment
    : ID '=' expr
    ;

#2

How about this?

这个怎么样?

exprlist
    : (expr)? (NL+ expr)* NL!? EOF!
    ;
expr 
    : assignment | ...
    ;
assignment
    : ID '=' expr
    ;

#3

I assume you chose to make NL optional, because the last statement in your input code doesn't have to end with a newline.

我假设您选择使NL成为可选项,因为输入代码中的最后一个语句不必以换行符结尾。

While it makes a lot of sense, you are making life a lot harder for your parser. Separator tokens (like NL) should be cherished, as they disambiguate and reduce the chance of conflicts.

虽然它很有意义,但是你的解析器让生活变得更加艰难。应该珍惜分隔符(如NL),因为它们消除歧义并减少冲突的可能性。

In your case, the parser doesn't know if it should parse "assignment NL" or "assignment empty_line". There are many ways to solve it, but most of them are just band-aides for an unwise design choice.

在您的情况下,解析器不知道它是否应该解析“赋值NL”或“赋值empty_line”。有很多方法可以解决它,但大多数方法只是一个不明智的设计选择的乐队助手。

My recommendation is an innocent hack: Make NL mandatory, and always append NL to the end of your input stream!

我的建议是无辜的黑客攻击:强制使用NL,并且总是将NL附加到输入流的末尾!

It may seem a little unsavory, but in reality it will save you a lot of future headaches.

这看起来有点令人讨厌,但实际上它会为你节省很多未来的麻烦。

#1