I've posted this on the D newsgroup some months ago, but for some reason, the answer never really convinced me, so I thought I'd ask it here.
几个月前,我在D新闻组发布了这篇文章,但出于某种原因,答案从未真正说服我,所以我想我应该在这里问一下。
The grammar of D is apparently context-free.
显然,D的语法是上下文无关的。
The grammar of C++, however, isn't (even without macros). (Please read this carefully!)
然而,c++的语法不是(即使没有宏)。(请仔细阅读这个!)
Now granted, I know nothing (officially) about compilers, lexers, and parsers. All I know is from what I've learned on the web.
And here is what (I believe) I have understood regarding context, in not-so-technical lingo:
现在,我对编译器、lexer和解析器一无所知。我所知道的一切都是我在网上学到的。以下是我对上下文(我相信)的理解,用不太专业的术语来说:
The grammar of a language is context-free if and only if you can always understand the meaning (though not necessarily the exact behavior) of a given piece of its code without needing to "look" anywhere else.
一种语言的语法是上下文无关的,前提是您始终能够理解它的代码的某一部分的含义(尽管不一定是确切的行为),而不需要“查看”其他任何地方。
Or, in even less rigor:
或者更不严格地说:
The grammar cannot be context-free if I need I can't tell the type of an expression just by looking at it.
语法不能是上下文无关的,如果我需要,我不能仅仅通过查看它就知道表达式的类型。
So, for example, C++ fails the context-free test because the meaning of confusing<sizeof(x)>::q < 3 > (2)
depends on the value of q
.
因此,例如,c++没有通过上下文无关的测试,因为混淆
So far, so good.
到目前为止还好。
Now my question is: Can the same thing be said of D?
现在我的问题是,同样的事情也可以用D来描述吗?
In D, hashtables can be created through a Value[Key]
declaration, for example
在D中,可以通过Value[Key]声明来创建散列表
int[string] peoplesAges; // Maps names to ages
Static arrays can be defined in a similar syntax:
静态数组可以用类似的语法定义:
int[3] ages; // Array of 3 elements
And templates can be used to make them confusing:
模板可以用来使它们混淆:
template Test1(T...)
{
alias int[T[0]] Test;
}
template Test2(U...)
{
alias int[U] Test2; // LGTM
}
Test1!(5) foo;
Test1!(int) bar;
Test2!(int) baz; // Guess what? It's invalid code.
This means that I cannot tell the meaning of T[0]
or U
just by looking at it (i.e. it could be a number, it could be a data type, or it could be a tuple of God-knows-what). I can't even tell if the expression is grammatically valid (since int[U]
certainly isn't -- you can't have a hashtable with tuples as keys or values).
这意味着我不能仅仅通过观察T[0]或U的含义(也就是说,它可以是一个数字,也可以是一个数据类型,或者是一个天知道的元组)。我甚至不知道这个表达式是否语法有效(因为int[U]肯定是无效的——您不能有一个以元组作为键或值的hashtable)。
Any parsing tree that I attempt to make for Test
would fail to make any sense (since it would need to know whether the node contains a data type versus a literal or an identifier) unless it delays the result until the value of T
is known (making it context-dependent).
我尝试为测试构建的任何解析树都没有任何意义(因为它需要知道节点是包含数据类型,还是包含文本类型或标识符),除非它延迟结果,直到T的值已知(使其与上下文相关)。
Given this, is D actually context-free, or am I misunderstanding the concept?
Why/why not?
Update:
I just thought I'd comment: It's really interesting to see the answers, since:
我只是想评论一下:看到这些答案真的很有趣,因为:
- Some answers claim that C++ and D can't be context-free
- 一些回答声称c++和D不能是上下文无关的
- Some answers claim that C++ and D are both context-free
- 一些答案声称c++和D都是上下文无关的。
- Some answers support the claim that C++ is context-sensitive while D isn't
- 有些答案支持这样的说法,即c++是上下文敏感的,而D不是
- No one has yet claimed that C++ is context-free while D is context-sensitive :-)
- 还没有人声称c++是上下文无关的,而D是上下文敏感的:-)
I can't tell if I'm learning or getting more confused, but either way, I'm kind of glad I asked this... thanks for taking the time to answer, everyone!
我不知道我是在学习还是变得更困惑,但不管怎样,我都很高兴我问了这个……谢谢大家花时间来回答这个问题!
7 个解决方案
#1
41
Being context free is first a property of generative grammars. It means that what a non-terminal can generate will not depend on the context in which the non-terminal appears (in non context-free generative grammar, the very notion of "string generated by a given non-terminal" is in general difficult to define). This doesn't prevent the same string of symbols to be generated by two non-terminals (so for the same strings of symbols to appear in two different contexts with a different meaning) and has nothing to do with type checking.
上下文*首先是生成语法的属性。这意味着一个非终结符可以生成什么并不取决于非终结符出现的上下文(在非上下文无关生成语法中,“给定的非终结符生成的字符串”的概念通常很难定义)。这并不能阻止由两个非终端生成的相同的符号串(因此,相同的符号串出现在具有不同含义的两个不同上下文中),并且与类型检查无关。
It is common to extend the context-free definition from grammars to language by stating that a language is context-free if there is at least one context free grammar describing it.
如果一种语言至少有一个上下文无关语法来描述它,那么将上下文无关的定义从语法扩展到语言是很常见的。
In practice, no programming language is context-free because things like "a variable must be declared before it is used" can't be checked by a context-free grammar (they can be checked by some other kinds of grammars). This isn't bad, in practice the rules to be checked are divided in two: those you want to check with the grammar and those you check in a semantic pass (and this division also allows for better error reporting and recovery, so you sometimes want to accept more in the grammar than what would be possible in order to give your users better diagnostics).
在实践中,没有一种编程语言是与上下文无关的,因为像“必须在使用变量之前声明它”这样的事情不能由上下文无关的语法检查(它们可以由其他一些语法检查)。这不是坏的,在实践中检查规则分为两种:那些你想检查语法和你检查在语义传递(这部门还允许更好的错误报告和恢复,所以你有时想接受更多的语法比可能为了给用户更好的诊断)。
What people means by stating that C++ isn't context-free is that doing this division isn't possible in a convenient way (with convenient including as criteria "follows nearly the official language description" and "my parser generator tool support that kind of division"; allowing the grammar to be ambiguous and the ambiguity to be resolved by the semantic check is an relatively easy way to do the cut for C++ and follow quite will the C++ standard, but it is inconvenient when you are relying on tools which don't allow ambiguous grammars, when you have such tools, it is convenient).
人们说c++不是上下文无关的意思是,做这种划分是不可能的,用一种方便的方式(包括作为标准“几乎遵循官方语言描述”和“我的解析器生成器工具支持这种划分”);允许语法歧义和模糊语义检查需要解决的是一个相对简单的方法削减为c++和遵循将c++标准,但它是不方便当你依赖工具,不允许模糊语法,当你有这样的工具,很方便)。
I don't know enough about D to know if there is or not a convenient cut of the language rules in a context-free grammar with semantic checks, but what you show is far from proving the case there isn't.
我不太了解D,不知道在上下文无关的语法中是否存在或不存在语义检查的语言规则,但是您所展示的远远不能证明没有。
#2
20
The property of being context free is a very formal concept; you can find a definition here. Note that it applies to grammars: a language is said to be context free if there is at least one context free grammar that recognizes it. Note that there may be other grammars, possibly non context free, that recognize the same language.
无上下文属性是一个非常正式的概念;你可以在这里找到一个定义。注意,它适用于语法:如果至少有一个上下文无关语法可以识别语言,那么语言就是无上下文的。注意,可能还有其他语法(可能是非上下文无关的)可以识别相同的语言。
Basically what it means is that the definition of a language element cannot change according to which elements surround it. By language elements I mean concepts like expressions and identifiers and not specific instances of these concepts inside programs, like a + b
or count
.
它的基本意思是,语言元素的定义不能根据它周围的元素而改变。所谓语言元素,我指的是表达式和标识符之类的概念,而不是程序中这些概念的具体实例,比如a + b或count。
Let's try and build a concrete example. Consider this simple COBOL statement:
让我们尝试构建一个具体的示例。考虑一下这个简单的COBOL语句:
01 my-field PICTURE 9.9 VALUE 9.9.
Here I'm defining a field, i.e. a variable, which is dimensioned to hold one integral digit, the decimal point, and one decimal digit, with initial value 9.9 . A very incomplete grammar for this could be:
这里我定义了一个字段,也就是一个变量,它的维度包含一个整数,小数点和一个十进制数字,初始值是9。9。一种非常不完整的语法可以是:
field-declaration ::= level-number identifier 'PICTURE' expression 'VALUE' expression '.'
expression ::= digit+ ( '.' digit+ )
Unfortunately the valid expressions that can follow PICTURE
are not the same valid expressions that can follow VALUE
. I could rewrite the second production in my grammar as follows:
不幸的是,可以跟踪图像的有效表达式并不是可以遵循值的相同有效表达式。我可以在我的语法中重写第二个版本,如下所示:
'PICTURE' expression ::= digit+ ( '.' digit+ ) | 'A'+ | 'X'+
'VALUE' expression ::= digit+ ( '.' digit+ )
This would make my grammar context-sensitive, because expression
would be a different thing according to whether it was found after 'PICTURE'
or after 'VALUE'
. However, as it has been pointed out, this doesn't say anything about the underlying language. A better alternative would be:
这将使我的语法对上下文敏感,因为在“图片”或“值”之后,表达式会是不同的。然而,正如已经指出的,这并没有说明底层语言的任何问题。更好的选择是:
field-declaration ::= level-number identifier 'PICTURE' format 'VALUE' expression '.'
format ::= digit+ ( '.' digit+ ) | 'A'+ | 'X'+
expression ::= digit+ ( '.' digit+ )
which is context-free.
这是上下文无关。
As you can see this is very different from your understanding. Consider:
正如你所看到的,这与你的理解有很大的不同。考虑:
a = b + c;
There is very little you can say about this statement without looking up the declarations of a,b and c, in any of the languages for which this is a valid statement, however this by itself doesn't imply that any of those languages is not context free. Probably what is confusing you is the fact that context freedom is different from ambiguity. This a simplified version of your C++ example:
对于这句话,你几乎不可能不去查阅a、b和c的声明,在任何一种语言中,这是一个有效的声明,然而,这本身并不意味着任何这些语言都不是上下文无关的。可能让你困惑的是语境*和歧义是不同的。这是您的c++示例的简化版本:
a < b > (c)
This is ambiguous in that by looking at it alone you cannot tell whether this is a function template call or a boolean expression. The previous example on the other hand is not ambiguous; From the point of view of grammars it can only be interpreted as:
这是不明确的,通过单独查看它,您无法判断这是函数模板调用还是布尔表达式。另一方面,前面的例子并不含糊;从语法的角度来看,它只能解释为:
identifier assignment identifier binary-operator identifier semi-colon
In some cases you can resolve ambiguities by introducing context sensitivity at the grammar level. I don't think this is the case with the ambiguous example above: in this case you cannot eliminate the ambiguity without knowing whether a is a template or not. Note that when such information is not available, for instance when it depends on a specific template specialization, the language provides ways to resolve ambiguities: that is why you sometimes have to use typename
to refer to certain types within templates or to use template
when you call member function templates.
在某些情况下,您可以通过在语法级别引入上下文敏感性来解决歧义。我不认为这是上面的模棱两可的例子:在这种情况下,如果不知道a是否是模板,就无法消除歧义。注意,当这些信息不可用,比如当它取决于特定的模板特殊化,语言提供了方法来解决歧义:这就是为什么有时你必须使用typename指某些类型模板或使用模板内当你调用成员函数模板。
#3
8
There are already a lot of good answers, but since you are uninformed about grammars, parsers and compilers etc, let me demonstrate this by an example.
已经有很多好的答案了,但是由于您对语法、解析器和编译器等一无所知,让我通过一个示例来说明这一点。
First, the concept of grammars are quite intuitive. Imagine a set of rules:
首先,语法的概念是很直观的。想象一套规则:
S -> a T
T -> b G t
T -> Y d
b G -> a Y b
Y -> c
Y -> lambda (nothing)
And imagine you start with S
. The capital letters are non-terminals and the small letters are terminals. This means that if you get a sentence of all terminals, you can say the grammar generated that sentence as a "word" in the language. Imagine such substitutions with the above grammar (The phrase between *phrase* is the one being replaced):
假设从s开始,大写字母是非终端,小写字母是终端。这意味着,如果你得到了所有终端的句子,你可以说语法生成的句子作为语言中的一个“词”。想象这样的替换与上面的语法(短语之间的短语*是被替换的):
*S* -> a *T* -> a *b G* t -> a a *Y* b t -> a a b t
So, I could create aabt
with this grammar.
我可以用这个语法创建aabt。
Ok, back to main line.
好,回到主线。
Let us assume a simple language. You have numbers, two types (int and string) and variables. You can do multiplication on integers and addition on strings but not the other way around.
让我们假设一种简单的语言。您有数字、两种类型(int和string)和变量。你可以对整数进行乘法运算,也可以对字符串进行加法运算,而不是反过来。
First thing you need, is a lexer. That is usually a regular grammar (or equal to it, a DFA, or equally a regular expression) that matches the program tokens. It is common to express them in regular expressions. In our example:
首先你需要的是一个lexer。这通常是一个与程序令牌匹配的正则语法(或等于它、DFA或同样的正则表达式)。用正则表达式表达它们是很常见的。在我们的示例:
(I'm making these syntaxes up)
(我在合成这些语法)
number: [1-9][0-9]* // One digit from 1 to 9, followed by any number
// of digits from 0-9
variable: [a-zA-Z_][a-zA-Z_0-9]* // You get the idea. First a-z or A-Z or _
// then as many a-z or A-Z or _ or 0-9
// this is similar to C
int: 'i' 'n' 't'
string: 's' 't' 'r' 'i' 'n' 'g'
equal: '='
plus: '+'
multiply: '*'
whitespace: (' ' or '\n' or '\t' or '\r')* // to ignore this type of token
So, now you got a regular grammar, tokenizing your input, but it understands nothing of the structure.
现在你有了一个常规的语法,标记你的输入,但是它对结构一无所知。
Then you need a parser. The parser, is usually a context free grammar. A context free grammar means, in the grammar you only have single nonterminals on the left side of grammar rules. In the example in the beginning of this answer, the rule
然后需要一个解析器。解析器通常是无上下文语法。上下文无关语法意味着,在语法中,在语法规则的左侧只有一个非终结符。在这个答案开头的例子中,规则
b G -> a Y b
makes the grammar context-sensitive because on the left you have b G
and not just G
. What does this mean?
使语法上下文敏感因为在左边有b G而不是G,这是什么意思?
Well, when you write a grammar, each of the nonterminals have a meaning. Let's write a context-free grammar for our example (| means or. As if writing many rules in the same line):
当你写语法时,每个非终结符都有意义。让我们为我们的示例编写一个上下文无关的语法(|表示or。就好像在一行里写了很多规则):
program -> statement program | lambda
statement -> declaration | executable
declaration -> int variable | string variable
executable -> variable equal expression
expression -> integer_type | string_type
integer_type -> variable multiply variable |
variable multiply number |
number multiply variable |
number multiply number
string_type -> variable plus variable
Now this grammar can accept this code:
现在这个语法可以接受这个代码:
x = 1*y
int x
string y
z = x+y
Grammatically, this code is correct. So, let's get back to what context-free means. As you can see in the example above, when you expand executable
, you generate one statement of the form variable = operand operator operand
without any consideration which part of code you are at. Whether the very beginning or middle, whether the variables are defined or not, or whether the types match, you don't know and you don't care.
从语法上讲,这段代码是正确的。让我们回到上下文无关的意思。正如您在上面的示例中所看到的,当您展开可执行文件时,您将生成一个form variable = operand operator操作数的语句,而无需考虑您所在的代码的哪个部分。无论是最开始的还是中间的,变量是否被定义,或者类型是否匹配,你都不知道,也不在乎。
Next, you need semantics. This is were context-sensitive grammars come into play. First, let me tell you that in reality, no one actually writes a context sensitive grammar (because parsing it is too difficult), but rather bit pieces of code that the parser calls when parsing the input (called action routines. Although this is not the only way). Formally, however, you can define all you need. For example, to make sure you define a variable before using it, instead of this
接下来,您需要语义。这是上下文敏感的语法。首先,让我告诉您,实际上,实际上没有人编写上下文敏感的语法(因为解析太困难),而是在解析输入时解析器调用的代码片段(称为动作例程)。虽然这不是唯一的方法)。但是,在形式上,您可以定义所有需要的内容。例如,要确保在使用变量之前定义一个变量,而不是使用它。
executable -> variable equal expression
you have to have something like:
你必须有这样的东西:
declaration some_code executable -> declaration some_code variable equal expression
more complex though, to make sure the variable
in declaration matches the one being calculated.
但更复杂的是,要确保声明中的变量与正在计算的变量匹配。
Anyway, I just wanted to give you the idea. So, all these things are context-sensitive:
不管怎样,我只是想告诉你们这个想法。所有这些都是上下文相关的
- Type checking
- 类型检查
- Number of arguments to function
- 函数的参数数
- default value to function
- 默认值功能
- if
member
exists inobj
in code:obj.member
- 如果成员存在于obj中的代码:object .member
- Almost anything that's not like: missing
;
or}
- 几乎任何不像的东西:想念;或}
I hope you got an idea what are the differences (If you didn't, I'd be more than happy to explain).
我希望你能了解它们的区别(如果你不知道,我很乐意解释)。
So in summary:
因此在总结:
- Lexer uses a regular grammar to tokenize input
- Lexer使用常规语法来标记输入
- Parser uses a context-free grammar to make sure the program is in correct structure
- 解析器使用上下文无关的语法来确保程序处于正确的结构中
- Semantic analyzer uses a context-sensitive grammar to do type-checking, parameter matching etc etc
- 语义分析器使用上下文敏感的语法进行类型检查、参数匹配等。
It is not necessarily always like that though. This just shows you how each level needs to get more powerful to be able to do more stuff. However, each of the mentioned compiler levels could in fact be more powerful.
但也不一定总是这样。这只是向您展示了每个级别需要如何变得更强大才能做更多的事情。然而,上面提到的每个编译器级别实际上都可能更强大。
For example, one language that I don't remember, used array subscription and function call both with parentheses and therefore it required the parser to go look up the type (context-sensitive related stuff) of the variable and determine which rule (function_call or array_substitution) to take.
例如,有一种语言我不记得了,它同时使用了数组订阅和函数调用,并使用了括号,因此它要求解析器查找变量的类型(上下文敏感的相关内容),并确定采用哪个规则(function_call或array_replace)。
If you design a language with lexer that has regular expressions that overlap, then you would need to also look up the context to determine which type of token you are matching.
如果您设计的语言具有具有重叠的正则表达式的lexer,那么您还需要查找上下文来确定您正在匹配的是哪种类型的令牌。
To get to your question! With the example you mentioned, it is clear that the c++ grammar is not context-free. The language D, I have absolutely no idea, but you should be able to reason about it now. Think of it this way: In a context free grammar, a nonterminal can expand without taking into consideration anything, BUT the structure of the language. Similar to what you said, it expands, without "looking" anywhere else.
来回答你的问题!通过您提到的示例,很明显c++语法不是上下文无关的。语言D,我完全不知道,但是你现在应该可以对它进行推理。可以这样想:在上下文无关的语法中,非终结符可以在不考虑任何因素的情况下展开,只考虑语言的结构。与你所说的类似,它会扩张,而不会“看”其他地方。
A familiar example would be natural languages. For example in English, you say:
一个常见的例子是自然语言。例如在英语中,你说:
sentence -> subject verb object clause
clause -> .... | lambda
Well, sentence
and clause
are nonterminals here. With this grammar you can create these sentences:
句子和子句在这里是不定式。有了这个语法,你就可以造出以下句子:
I go there because I want to
or
或
I jump you that I is air
As you can see, the second one has the correct structure, but is meaningless. As long as a context free grammar is concerned, the meaning doesn't matter. It just expands verb
to whatever verb without "looking" at the rest of the sentence.
正如你所看到的,第二个结构是正确的,但是没有意义。只要上下文无关语法,意义就不重要了。它只是把动词扩展成任何动词,而不“看”句子的其余部分。
So if you think D has to at some point check how something was defined elsewhere, just to say the program is structurally correct, then its grammar is not context-free. If you isolate any part of the code and it still can say that it is structurally correct, then it is context-free.
所以如果你认为D必须在某个点检查其他地方是如何定义的,只是说程序结构正确,那么它的语法就不是上下文无关的。如果你将代码的任何部分分离出来,它仍然可以说它在结构上是正确的,那么它就是上下文无关的。
#4
7
The grammar cannot be context-free if I need I can't tell the type of an expression just by looking at it.
语法不能是上下文无关的,如果我需要,我不能仅仅通过查看它就知道表达式的类型。
No, that's flat out wrong. The grammar cannot be context-free if you can't tell if it is an expression just by looking at it and the parser's current state (am I in a function, in a namespace, etc).
不,完全错了。如果仅仅通过查看语法和解析器的当前状态(函数中的I、名称空间中的I等等)就不能判断语法是否是表达式,那么语法就不能是上下文无关的。
The type of an expression, however, is a semantic meaning, not syntactic, and the parser and the grammar do not give a penny about types or semantic validity or whether or not you can have tuples as values or keys in hashmaps, or if you defined that identifier before using it.
一个表达式的类型,然而,是一个语义,句法,语法解析器和不给一分钱类型或语义有效性或是否可以有元组值或键在hashmap或如果你定义的标识符之前使用它。
The grammar doesn't care what it means, or if that makes sense. It only cares about what it is.
语法并不关心它的含义,或者它是否有意义。它只关心它是什么。
#5
7
There is a construct in D's lexer:
D的lexer中有一个结构:
string ::= q" Delim1 Chars newline Delim2 "
where Delim1 and Delim2 are matching identifiers, and Chars does not contain newline Delim2.
其中Delim1和Delim2是匹配标识符,而Chars不包含换行Delim2。
This construct is context sensitive, therefore D's lexer grammar is context sensitive.
这个构造是上下文敏感的,因此D的lexer语法是上下文敏感的。
It's been a few years since I've worked with D's grammar much, so I can't remember all the trouble spots off the top of my head, or even if any of them make D's parser grammar context sensitive, but I believe they do not. From recall, I would say D's grammar is context free, not LL(k) for any k, and it has an obnoxious amount of ambiguity.
我使用D的语法已经有好几年了,所以我记不起我脑子里所有的问题点,即使它们中的任何一个使D的解析器语法上下文敏感,但我认为它们不敏感。从回忆起,我认为D的语法是上下文无关的,而不是k,它有一个令人讨厌的歧义。
#6
6
To answer the question of if a programming language is context free you must first decide where to draw the line between syntax and semantics. As an extreme example, it is illegal in C for a program to use the value of some kinds of integers after they have been allowed to overflow. Clearly this can't be checked at compile time, let alone parse time:
要回答编程语言是否是上下文无关的问题,您必须首先决定在语法和语义之间划分界限。作为一个极端的例子,在C中,程序在允许某些类型的整数溢出后使用它们的值是非法的。显然,在编译时不能检查这一点,更不用说解析时间了:
void Fn() {
int i = INT_MAX;
FnThatMightNotReturn(); // halting problem?
i++;
if(Test(i)) printf("Weeee!\n");
}
As a less extreme example that others have pointed out, deceleration before use rules can't be enforced in a context free syntax so if you wish to keep your syntax pass context free, then that must be deferred to the next pass.
作为一个不太极端的例子,其他人已经指出,在使用规则之前的减速不能在上下文无关的语法中强制执行,因此如果您希望保持语法传递上下文的*,那么必须将其延迟到下一个传递。
As a practical definition, I would start with the question of: Can you correctly and unambiguously determine the parse tree of all correct programs using a context free grammar and, for all incorrect programs (that the language requires be rejected), either reject them as syntactically invalid or produce a parse tree that the later passes can identify as invalid and reject?
作为一个实际的定义,我想开始的问题:你能正确和明确确定的解析树使用上下文无关语法和所有正确的程序,对所有不正确的程序(语言需要被拒绝),要么拒绝他们语法无效或产生一个解析树,后来通过确定为无效并拒绝吗?
Given that the most correct spec for the D syntax is a parser (IIRC an LL parser) I strongly suspect that it is in fact context free by the definition I suggested.
考虑到D语法最正确的规范是解析器(IIRC和LL解析器),我强烈怀疑根据我建议的定义,它实际上是上下文无关的。
Note: the above says nothing about what grammar the language documentation or a given parser uses, only if a context free grammar exists. Also, the only full documentation on the D language is the source code of the compiler DMD.
注意:上面没有说明语言文档或给定解析器使用什么语法,只有在存在上下文无关语法的情况下。此外,关于D语言的唯一完整文档是编译器DMD的源代码。
#7
4
These answers are making my head hurt.
这些回答让我头疼。
First of all, the complications with low level languages and figuring out whether they are context-free or not, is that the language you write in is often processed in many steps.
首先,低层次语言的复杂性,并弄清楚它们是否与上下文无关,这就是您所编写的语言通常在许多步骤中被处理。
In C++ (order may be off, but that shouldn't invalidate my point):
在c++中(顺序可能是off,但这不应该使我的观点无效):
- it has to process macros and other preprocessor stuffs
- 它必须处理宏和其他预处理程序。
- it has to interpret templates
- 它必须解释模板
- it finally interprets your code.
- 它最终会解释您的代码。
Because the first step can change the context of the second step and the second step can change the context of the third step, the language YOU write in (including all of these steps) is context sensitive.
因为第一步可以改变第二步的上下文,而第二步可以改变第三步的上下文,所以您所编写的语言(包括所有这些步骤)是上下文敏感的。
The reason people will try and defend a language (stating it is context-free) is, because the only exceptions that adds context are the traceable preprocessor statements and template calls. You only have to follow two restricted exceptions to the rules to pretend the language is context-free.
人们尝试和保护一种语言(声明它是上下文无关的)的原因是,添加上下文的惟一例外是可跟踪的预处理器语句和模板调用。您只需遵循规则的两个受限异常,就可以假装语言是上下文无关的。
Most languages are context-sensitive overall, but most languages only have these minor exceptions to being context-free.
大多数语言都是上下文敏感的,但是大多数语言只有这些小的例外,没有上下文。
#1
41
Being context free is first a property of generative grammars. It means that what a non-terminal can generate will not depend on the context in which the non-terminal appears (in non context-free generative grammar, the very notion of "string generated by a given non-terminal" is in general difficult to define). This doesn't prevent the same string of symbols to be generated by two non-terminals (so for the same strings of symbols to appear in two different contexts with a different meaning) and has nothing to do with type checking.
上下文*首先是生成语法的属性。这意味着一个非终结符可以生成什么并不取决于非终结符出现的上下文(在非上下文无关生成语法中,“给定的非终结符生成的字符串”的概念通常很难定义)。这并不能阻止由两个非终端生成的相同的符号串(因此,相同的符号串出现在具有不同含义的两个不同上下文中),并且与类型检查无关。
It is common to extend the context-free definition from grammars to language by stating that a language is context-free if there is at least one context free grammar describing it.
如果一种语言至少有一个上下文无关语法来描述它,那么将上下文无关的定义从语法扩展到语言是很常见的。
In practice, no programming language is context-free because things like "a variable must be declared before it is used" can't be checked by a context-free grammar (they can be checked by some other kinds of grammars). This isn't bad, in practice the rules to be checked are divided in two: those you want to check with the grammar and those you check in a semantic pass (and this division also allows for better error reporting and recovery, so you sometimes want to accept more in the grammar than what would be possible in order to give your users better diagnostics).
在实践中,没有一种编程语言是与上下文无关的,因为像“必须在使用变量之前声明它”这样的事情不能由上下文无关的语法检查(它们可以由其他一些语法检查)。这不是坏的,在实践中检查规则分为两种:那些你想检查语法和你检查在语义传递(这部门还允许更好的错误报告和恢复,所以你有时想接受更多的语法比可能为了给用户更好的诊断)。
What people means by stating that C++ isn't context-free is that doing this division isn't possible in a convenient way (with convenient including as criteria "follows nearly the official language description" and "my parser generator tool support that kind of division"; allowing the grammar to be ambiguous and the ambiguity to be resolved by the semantic check is an relatively easy way to do the cut for C++ and follow quite will the C++ standard, but it is inconvenient when you are relying on tools which don't allow ambiguous grammars, when you have such tools, it is convenient).
人们说c++不是上下文无关的意思是,做这种划分是不可能的,用一种方便的方式(包括作为标准“几乎遵循官方语言描述”和“我的解析器生成器工具支持这种划分”);允许语法歧义和模糊语义检查需要解决的是一个相对简单的方法削减为c++和遵循将c++标准,但它是不方便当你依赖工具,不允许模糊语法,当你有这样的工具,很方便)。
I don't know enough about D to know if there is or not a convenient cut of the language rules in a context-free grammar with semantic checks, but what you show is far from proving the case there isn't.
我不太了解D,不知道在上下文无关的语法中是否存在或不存在语义检查的语言规则,但是您所展示的远远不能证明没有。
#2
20
The property of being context free is a very formal concept; you can find a definition here. Note that it applies to grammars: a language is said to be context free if there is at least one context free grammar that recognizes it. Note that there may be other grammars, possibly non context free, that recognize the same language.
无上下文属性是一个非常正式的概念;你可以在这里找到一个定义。注意,它适用于语法:如果至少有一个上下文无关语法可以识别语言,那么语言就是无上下文的。注意,可能还有其他语法(可能是非上下文无关的)可以识别相同的语言。
Basically what it means is that the definition of a language element cannot change according to which elements surround it. By language elements I mean concepts like expressions and identifiers and not specific instances of these concepts inside programs, like a + b
or count
.
它的基本意思是,语言元素的定义不能根据它周围的元素而改变。所谓语言元素,我指的是表达式和标识符之类的概念,而不是程序中这些概念的具体实例,比如a + b或count。
Let's try and build a concrete example. Consider this simple COBOL statement:
让我们尝试构建一个具体的示例。考虑一下这个简单的COBOL语句:
01 my-field PICTURE 9.9 VALUE 9.9.
Here I'm defining a field, i.e. a variable, which is dimensioned to hold one integral digit, the decimal point, and one decimal digit, with initial value 9.9 . A very incomplete grammar for this could be:
这里我定义了一个字段,也就是一个变量,它的维度包含一个整数,小数点和一个十进制数字,初始值是9。9。一种非常不完整的语法可以是:
field-declaration ::= level-number identifier 'PICTURE' expression 'VALUE' expression '.'
expression ::= digit+ ( '.' digit+ )
Unfortunately the valid expressions that can follow PICTURE
are not the same valid expressions that can follow VALUE
. I could rewrite the second production in my grammar as follows:
不幸的是,可以跟踪图像的有效表达式并不是可以遵循值的相同有效表达式。我可以在我的语法中重写第二个版本,如下所示:
'PICTURE' expression ::= digit+ ( '.' digit+ ) | 'A'+ | 'X'+
'VALUE' expression ::= digit+ ( '.' digit+ )
This would make my grammar context-sensitive, because expression
would be a different thing according to whether it was found after 'PICTURE'
or after 'VALUE'
. However, as it has been pointed out, this doesn't say anything about the underlying language. A better alternative would be:
这将使我的语法对上下文敏感,因为在“图片”或“值”之后,表达式会是不同的。然而,正如已经指出的,这并没有说明底层语言的任何问题。更好的选择是:
field-declaration ::= level-number identifier 'PICTURE' format 'VALUE' expression '.'
format ::= digit+ ( '.' digit+ ) | 'A'+ | 'X'+
expression ::= digit+ ( '.' digit+ )
which is context-free.
这是上下文无关。
As you can see this is very different from your understanding. Consider:
正如你所看到的,这与你的理解有很大的不同。考虑:
a = b + c;
There is very little you can say about this statement without looking up the declarations of a,b and c, in any of the languages for which this is a valid statement, however this by itself doesn't imply that any of those languages is not context free. Probably what is confusing you is the fact that context freedom is different from ambiguity. This a simplified version of your C++ example:
对于这句话,你几乎不可能不去查阅a、b和c的声明,在任何一种语言中,这是一个有效的声明,然而,这本身并不意味着任何这些语言都不是上下文无关的。可能让你困惑的是语境*和歧义是不同的。这是您的c++示例的简化版本:
a < b > (c)
This is ambiguous in that by looking at it alone you cannot tell whether this is a function template call or a boolean expression. The previous example on the other hand is not ambiguous; From the point of view of grammars it can only be interpreted as:
这是不明确的,通过单独查看它,您无法判断这是函数模板调用还是布尔表达式。另一方面,前面的例子并不含糊;从语法的角度来看,它只能解释为:
identifier assignment identifier binary-operator identifier semi-colon
In some cases you can resolve ambiguities by introducing context sensitivity at the grammar level. I don't think this is the case with the ambiguous example above: in this case you cannot eliminate the ambiguity without knowing whether a is a template or not. Note that when such information is not available, for instance when it depends on a specific template specialization, the language provides ways to resolve ambiguities: that is why you sometimes have to use typename
to refer to certain types within templates or to use template
when you call member function templates.
在某些情况下,您可以通过在语法级别引入上下文敏感性来解决歧义。我不认为这是上面的模棱两可的例子:在这种情况下,如果不知道a是否是模板,就无法消除歧义。注意,当这些信息不可用,比如当它取决于特定的模板特殊化,语言提供了方法来解决歧义:这就是为什么有时你必须使用typename指某些类型模板或使用模板内当你调用成员函数模板。
#3
8
There are already a lot of good answers, but since you are uninformed about grammars, parsers and compilers etc, let me demonstrate this by an example.
已经有很多好的答案了,但是由于您对语法、解析器和编译器等一无所知,让我通过一个示例来说明这一点。
First, the concept of grammars are quite intuitive. Imagine a set of rules:
首先,语法的概念是很直观的。想象一套规则:
S -> a T
T -> b G t
T -> Y d
b G -> a Y b
Y -> c
Y -> lambda (nothing)
And imagine you start with S
. The capital letters are non-terminals and the small letters are terminals. This means that if you get a sentence of all terminals, you can say the grammar generated that sentence as a "word" in the language. Imagine such substitutions with the above grammar (The phrase between *phrase* is the one being replaced):
假设从s开始,大写字母是非终端,小写字母是终端。这意味着,如果你得到了所有终端的句子,你可以说语法生成的句子作为语言中的一个“词”。想象这样的替换与上面的语法(短语之间的短语*是被替换的):
*S* -> a *T* -> a *b G* t -> a a *Y* b t -> a a b t
So, I could create aabt
with this grammar.
我可以用这个语法创建aabt。
Ok, back to main line.
好,回到主线。
Let us assume a simple language. You have numbers, two types (int and string) and variables. You can do multiplication on integers and addition on strings but not the other way around.
让我们假设一种简单的语言。您有数字、两种类型(int和string)和变量。你可以对整数进行乘法运算,也可以对字符串进行加法运算,而不是反过来。
First thing you need, is a lexer. That is usually a regular grammar (or equal to it, a DFA, or equally a regular expression) that matches the program tokens. It is common to express them in regular expressions. In our example:
首先你需要的是一个lexer。这通常是一个与程序令牌匹配的正则语法(或等于它、DFA或同样的正则表达式)。用正则表达式表达它们是很常见的。在我们的示例:
(I'm making these syntaxes up)
(我在合成这些语法)
number: [1-9][0-9]* // One digit from 1 to 9, followed by any number
// of digits from 0-9
variable: [a-zA-Z_][a-zA-Z_0-9]* // You get the idea. First a-z or A-Z or _
// then as many a-z or A-Z or _ or 0-9
// this is similar to C
int: 'i' 'n' 't'
string: 's' 't' 'r' 'i' 'n' 'g'
equal: '='
plus: '+'
multiply: '*'
whitespace: (' ' or '\n' or '\t' or '\r')* // to ignore this type of token
So, now you got a regular grammar, tokenizing your input, but it understands nothing of the structure.
现在你有了一个常规的语法,标记你的输入,但是它对结构一无所知。
Then you need a parser. The parser, is usually a context free grammar. A context free grammar means, in the grammar you only have single nonterminals on the left side of grammar rules. In the example in the beginning of this answer, the rule
然后需要一个解析器。解析器通常是无上下文语法。上下文无关语法意味着,在语法中,在语法规则的左侧只有一个非终结符。在这个答案开头的例子中,规则
b G -> a Y b
makes the grammar context-sensitive because on the left you have b G
and not just G
. What does this mean?
使语法上下文敏感因为在左边有b G而不是G,这是什么意思?
Well, when you write a grammar, each of the nonterminals have a meaning. Let's write a context-free grammar for our example (| means or. As if writing many rules in the same line):
当你写语法时,每个非终结符都有意义。让我们为我们的示例编写一个上下文无关的语法(|表示or。就好像在一行里写了很多规则):
program -> statement program | lambda
statement -> declaration | executable
declaration -> int variable | string variable
executable -> variable equal expression
expression -> integer_type | string_type
integer_type -> variable multiply variable |
variable multiply number |
number multiply variable |
number multiply number
string_type -> variable plus variable
Now this grammar can accept this code:
现在这个语法可以接受这个代码:
x = 1*y
int x
string y
z = x+y
Grammatically, this code is correct. So, let's get back to what context-free means. As you can see in the example above, when you expand executable
, you generate one statement of the form variable = operand operator operand
without any consideration which part of code you are at. Whether the very beginning or middle, whether the variables are defined or not, or whether the types match, you don't know and you don't care.
从语法上讲,这段代码是正确的。让我们回到上下文无关的意思。正如您在上面的示例中所看到的,当您展开可执行文件时,您将生成一个form variable = operand operator操作数的语句,而无需考虑您所在的代码的哪个部分。无论是最开始的还是中间的,变量是否被定义,或者类型是否匹配,你都不知道,也不在乎。
Next, you need semantics. This is were context-sensitive grammars come into play. First, let me tell you that in reality, no one actually writes a context sensitive grammar (because parsing it is too difficult), but rather bit pieces of code that the parser calls when parsing the input (called action routines. Although this is not the only way). Formally, however, you can define all you need. For example, to make sure you define a variable before using it, instead of this
接下来,您需要语义。这是上下文敏感的语法。首先,让我告诉您,实际上,实际上没有人编写上下文敏感的语法(因为解析太困难),而是在解析输入时解析器调用的代码片段(称为动作例程)。虽然这不是唯一的方法)。但是,在形式上,您可以定义所有需要的内容。例如,要确保在使用变量之前定义一个变量,而不是使用它。
executable -> variable equal expression
you have to have something like:
你必须有这样的东西:
declaration some_code executable -> declaration some_code variable equal expression
more complex though, to make sure the variable
in declaration matches the one being calculated.
但更复杂的是,要确保声明中的变量与正在计算的变量匹配。
Anyway, I just wanted to give you the idea. So, all these things are context-sensitive:
不管怎样,我只是想告诉你们这个想法。所有这些都是上下文相关的
- Type checking
- 类型检查
- Number of arguments to function
- 函数的参数数
- default value to function
- 默认值功能
- if
member
exists inobj
in code:obj.member
- 如果成员存在于obj中的代码:object .member
- Almost anything that's not like: missing
;
or}
- 几乎任何不像的东西:想念;或}
I hope you got an idea what are the differences (If you didn't, I'd be more than happy to explain).
我希望你能了解它们的区别(如果你不知道,我很乐意解释)。
So in summary:
因此在总结:
- Lexer uses a regular grammar to tokenize input
- Lexer使用常规语法来标记输入
- Parser uses a context-free grammar to make sure the program is in correct structure
- 解析器使用上下文无关的语法来确保程序处于正确的结构中
- Semantic analyzer uses a context-sensitive grammar to do type-checking, parameter matching etc etc
- 语义分析器使用上下文敏感的语法进行类型检查、参数匹配等。
It is not necessarily always like that though. This just shows you how each level needs to get more powerful to be able to do more stuff. However, each of the mentioned compiler levels could in fact be more powerful.
但也不一定总是这样。这只是向您展示了每个级别需要如何变得更强大才能做更多的事情。然而,上面提到的每个编译器级别实际上都可能更强大。
For example, one language that I don't remember, used array subscription and function call both with parentheses and therefore it required the parser to go look up the type (context-sensitive related stuff) of the variable and determine which rule (function_call or array_substitution) to take.
例如,有一种语言我不记得了,它同时使用了数组订阅和函数调用,并使用了括号,因此它要求解析器查找变量的类型(上下文敏感的相关内容),并确定采用哪个规则(function_call或array_replace)。
If you design a language with lexer that has regular expressions that overlap, then you would need to also look up the context to determine which type of token you are matching.
如果您设计的语言具有具有重叠的正则表达式的lexer,那么您还需要查找上下文来确定您正在匹配的是哪种类型的令牌。
To get to your question! With the example you mentioned, it is clear that the c++ grammar is not context-free. The language D, I have absolutely no idea, but you should be able to reason about it now. Think of it this way: In a context free grammar, a nonterminal can expand without taking into consideration anything, BUT the structure of the language. Similar to what you said, it expands, without "looking" anywhere else.
来回答你的问题!通过您提到的示例,很明显c++语法不是上下文无关的。语言D,我完全不知道,但是你现在应该可以对它进行推理。可以这样想:在上下文无关的语法中,非终结符可以在不考虑任何因素的情况下展开,只考虑语言的结构。与你所说的类似,它会扩张,而不会“看”其他地方。
A familiar example would be natural languages. For example in English, you say:
一个常见的例子是自然语言。例如在英语中,你说:
sentence -> subject verb object clause
clause -> .... | lambda
Well, sentence
and clause
are nonterminals here. With this grammar you can create these sentences:
句子和子句在这里是不定式。有了这个语法,你就可以造出以下句子:
I go there because I want to
or
或
I jump you that I is air
As you can see, the second one has the correct structure, but is meaningless. As long as a context free grammar is concerned, the meaning doesn't matter. It just expands verb
to whatever verb without "looking" at the rest of the sentence.
正如你所看到的,第二个结构是正确的,但是没有意义。只要上下文无关语法,意义就不重要了。它只是把动词扩展成任何动词,而不“看”句子的其余部分。
So if you think D has to at some point check how something was defined elsewhere, just to say the program is structurally correct, then its grammar is not context-free. If you isolate any part of the code and it still can say that it is structurally correct, then it is context-free.
所以如果你认为D必须在某个点检查其他地方是如何定义的,只是说程序结构正确,那么它的语法就不是上下文无关的。如果你将代码的任何部分分离出来,它仍然可以说它在结构上是正确的,那么它就是上下文无关的。
#4
7
The grammar cannot be context-free if I need I can't tell the type of an expression just by looking at it.
语法不能是上下文无关的,如果我需要,我不能仅仅通过查看它就知道表达式的类型。
No, that's flat out wrong. The grammar cannot be context-free if you can't tell if it is an expression just by looking at it and the parser's current state (am I in a function, in a namespace, etc).
不,完全错了。如果仅仅通过查看语法和解析器的当前状态(函数中的I、名称空间中的I等等)就不能判断语法是否是表达式,那么语法就不能是上下文无关的。
The type of an expression, however, is a semantic meaning, not syntactic, and the parser and the grammar do not give a penny about types or semantic validity or whether or not you can have tuples as values or keys in hashmaps, or if you defined that identifier before using it.
一个表达式的类型,然而,是一个语义,句法,语法解析器和不给一分钱类型或语义有效性或是否可以有元组值或键在hashmap或如果你定义的标识符之前使用它。
The grammar doesn't care what it means, or if that makes sense. It only cares about what it is.
语法并不关心它的含义,或者它是否有意义。它只关心它是什么。
#5
7
There is a construct in D's lexer:
D的lexer中有一个结构:
string ::= q" Delim1 Chars newline Delim2 "
where Delim1 and Delim2 are matching identifiers, and Chars does not contain newline Delim2.
其中Delim1和Delim2是匹配标识符,而Chars不包含换行Delim2。
This construct is context sensitive, therefore D's lexer grammar is context sensitive.
这个构造是上下文敏感的,因此D的lexer语法是上下文敏感的。
It's been a few years since I've worked with D's grammar much, so I can't remember all the trouble spots off the top of my head, or even if any of them make D's parser grammar context sensitive, but I believe they do not. From recall, I would say D's grammar is context free, not LL(k) for any k, and it has an obnoxious amount of ambiguity.
我使用D的语法已经有好几年了,所以我记不起我脑子里所有的问题点,即使它们中的任何一个使D的解析器语法上下文敏感,但我认为它们不敏感。从回忆起,我认为D的语法是上下文无关的,而不是k,它有一个令人讨厌的歧义。
#6
6
To answer the question of if a programming language is context free you must first decide where to draw the line between syntax and semantics. As an extreme example, it is illegal in C for a program to use the value of some kinds of integers after they have been allowed to overflow. Clearly this can't be checked at compile time, let alone parse time:
要回答编程语言是否是上下文无关的问题,您必须首先决定在语法和语义之间划分界限。作为一个极端的例子,在C中,程序在允许某些类型的整数溢出后使用它们的值是非法的。显然,在编译时不能检查这一点,更不用说解析时间了:
void Fn() {
int i = INT_MAX;
FnThatMightNotReturn(); // halting problem?
i++;
if(Test(i)) printf("Weeee!\n");
}
As a less extreme example that others have pointed out, deceleration before use rules can't be enforced in a context free syntax so if you wish to keep your syntax pass context free, then that must be deferred to the next pass.
作为一个不太极端的例子,其他人已经指出,在使用规则之前的减速不能在上下文无关的语法中强制执行,因此如果您希望保持语法传递上下文的*,那么必须将其延迟到下一个传递。
As a practical definition, I would start with the question of: Can you correctly and unambiguously determine the parse tree of all correct programs using a context free grammar and, for all incorrect programs (that the language requires be rejected), either reject them as syntactically invalid or produce a parse tree that the later passes can identify as invalid and reject?
作为一个实际的定义,我想开始的问题:你能正确和明确确定的解析树使用上下文无关语法和所有正确的程序,对所有不正确的程序(语言需要被拒绝),要么拒绝他们语法无效或产生一个解析树,后来通过确定为无效并拒绝吗?
Given that the most correct spec for the D syntax is a parser (IIRC an LL parser) I strongly suspect that it is in fact context free by the definition I suggested.
考虑到D语法最正确的规范是解析器(IIRC和LL解析器),我强烈怀疑根据我建议的定义,它实际上是上下文无关的。
Note: the above says nothing about what grammar the language documentation or a given parser uses, only if a context free grammar exists. Also, the only full documentation on the D language is the source code of the compiler DMD.
注意:上面没有说明语言文档或给定解析器使用什么语法,只有在存在上下文无关语法的情况下。此外,关于D语言的唯一完整文档是编译器DMD的源代码。
#7
4
These answers are making my head hurt.
这些回答让我头疼。
First of all, the complications with low level languages and figuring out whether they are context-free or not, is that the language you write in is often processed in many steps.
首先,低层次语言的复杂性,并弄清楚它们是否与上下文无关,这就是您所编写的语言通常在许多步骤中被处理。
In C++ (order may be off, but that shouldn't invalidate my point):
在c++中(顺序可能是off,但这不应该使我的观点无效):
- it has to process macros and other preprocessor stuffs
- 它必须处理宏和其他预处理程序。
- it has to interpret templates
- 它必须解释模板
- it finally interprets your code.
- 它最终会解释您的代码。
Because the first step can change the context of the second step and the second step can change the context of the third step, the language YOU write in (including all of these steps) is context sensitive.
因为第一步可以改变第二步的上下文,而第二步可以改变第三步的上下文,所以您所编写的语言(包括所有这些步骤)是上下文敏感的。
The reason people will try and defend a language (stating it is context-free) is, because the only exceptions that adds context are the traceable preprocessor statements and template calls. You only have to follow two restricted exceptions to the rules to pretend the language is context-free.
人们尝试和保护一种语言(声明它是上下文无关的)的原因是,添加上下文的惟一例外是可跟踪的预处理器语句和模板调用。您只需遵循规则的两个受限异常,就可以假装语言是上下文无关的。
Most languages are context-sensitive overall, but most languages only have these minor exceptions to being context-free.
大多数语言都是上下文敏感的,但是大多数语言只有这些小的例外,没有上下文。