How much time would it take to write a C++ compiler using lex/yacc?
使用lex / yacc编写C ++编译器需要多长时间?
Where can I get started with it?
我在哪里可以开始使用它?
13 个解决方案
#1
21
There are many parsing rules that cannot be parsed by a bison/yacc parser (for example, distinguishing between a declaration and a function call in some circumstances). Additionally sometimes the interpretation of tokens requires input from the parser, particularly in C++0x. The handling of the character sequence >>
for example is crucially dependent on parsing context.
bison / yacc解析器无法解析许多解析规则(例如,在某些情况下区分声明和函数调用)。另外,有时令牌的解释需要来自解析器的输入,特别是在C ++ 0x中。例如,字符序列>>的处理关键取决于解析上下文。
Those two tools are very poor choices for parsing C++ and you would have to put in a lot of special cases that escaped the basic framework those tools rely on in order to correctly parse C++. It would take you a long time, and even then your parser would likely have weird bugs.
这两个工具是解析C ++的非常糟糕的选择,你必须放入许多特殊情况,这些特殊情况逃脱了这些工具所依赖的基本框架,以便正确地解析C ++。这需要很长时间,即使这样你的解析器也可能会有奇怪的错误。
yacc and bison are LALR(1) parser generators, which are not sophisticated enough to handle C++ effectively. As other people have pointed out, most C++ compilers now use a recursive descent parser, and several other answers have pointed at good solutions for writing your own.
yacc和bison是LALR(1)解析器生成器,它们不够复杂,无法有效地处理C ++。正如其他人所指出的那样,大多数C ++编译器现在使用递归下降解析器,而其他几个答案指出了编写自己的解决方案的好方法。
C++ templates are no good for handling strings, even constant ones (though this may be fixed in C++0x, I haven't researched carefully), but if they were, you could pretty easily write a recursive descent parser in the C++ template language. I find that rather amusing.
C ++模板不适合处理字符串,甚至是常量字符串(尽管这可能在C ++ 0x中修复,我没有仔细研究过),但如果它们是,你可以很容易地在C ++模板中编写一个递归下降解析器语言。我发现这很有趣。
#2
10
It sounds like you're pretty new to parsing/compiler creation. If that's the case, I'd highly recommend not starting with C++. It's a monster of a language.
听起来你是解析/编译器创建的新手。如果是这种情况,我强烈建议不要从C ++开始。它是一种语言的怪物。
Either invent a trivial toy language of your own, or do something modeled on something much smaller and simpler. I saw a lua parser where the grammar definition was about a page long. That'd be much more reasonable as a starting point.
无论是发明自己的琐碎玩具语言,还是做一些更小更简单的事情。我看到了一个lua解析器,其中语法定义大约是一页长。作为一个起点,这将更加合理。
#3
10
It will probably take you years, and you'll probably switch to some other parser generator in the process.
它可能需要数年时间,您可能会在此过程中切换到其他一些解析器生成器。
Parsing C++ is notoriously error-prone. The grammar is not fully LR-parsable, as many parts are context-sensitive. You won't be able to get it working right in flex/yacc, or at least it'll be really awkward to implement. There are only two front-ends I know of that get it right. Your best bet is to use one of these and focus on writing the back-end. That's where the interesting stuff is anyway :-).
解析C ++非常容易出错。语法不完全是LR可解析的,因为许多部分都是上下文敏感的。你将无法在flex / yacc中使它正常工作,或者至少它实现起来真的很尴尬。我知道只有两个前端正确。您最好的选择是使用其中之一并专注于编写后端。无论如何,这就是有趣的东西:-)。
Existing C++ Front Ends:
现有的C ++前端:
-
The EDG front-end is used by most of the commercial vendors (Intel, Portland Group, etc.) in their compilers. It costs money, but it's very thorough. People pay big bucks for it because they don't want to deal with the pain of writing their own C++ parser.
EDG前端在大多数商业供应商(英特尔,波特兰集团等)的编译器中使用。它花钱,但它非常彻底。人们为此付出了巨大的代价,因为他们不想处理编写自己的C ++解析器的痛苦。
-
GCC's C++ front-end is thorough enough for production code, but you'd have to figure out how to integrate this into your project. I believe it's fairly involved to separate it from GCC. This would also be GPL, but I'm not sure whether that's a problem for you. You can use the GCC front-end in your project via gcc_xml, but this will only give you XML for classes, functions, namespaces, and typedefs. It won't give you a syntax tree for the code.
GCC的C ++前端对于生产代码足够透彻,但您必须弄清楚如何将其集成到您的项目中。我认为将它与GCC分开是相当复杂的。这也是GPL,但我不确定这对你来说是否有问题。您可以通过gcc_xml在项目中使用GCC前端,但这只会为类,函数,命名空间和typedef提供XML。它不会为您提供代码的语法树。
-
Another possibility is to use clang, but their C++ support is currently spotty. It'll be nice to see them get all the bugs out, but if you look at their C++ status page you'll notice there are more than a few test cases that still break. Take heed -- clang is a big project. If it's taking these guys years to implement a C++ front-end, it's going to take you longer.
另一种可能性是使用clang,但他们的C ++支持目前很不稳定。很高兴看到他们得到所有的错误,但如果你看看他们的C ++状态页面,你会发现有不止一些测试用例仍然存在。注意 - 铿锵是一个很大的项目。如果它花费这些家伙多年来实现C ++前端,那么它将花费你更长的时间。
-
Others have mentioned ANTLR, and there is a C++ grammar available for it, but I'm skeptical. I haven't heard of an ANTLR front end being used in any major compilers, though I do believe it's used in the NetBeans IDE. It might be suitable for an IDE, but I'm skeptical that you'd be able to use it on production code.
其他人提到了ANTLR,并且有一个C ++语法可供它使用,但我持怀疑态度。我没有听说任何主要编译器中使用的是ANTLR前端,但我相信它在NetBeans IDE中使用过。它可能适合IDE,但我怀疑你能否在生产代码上使用它。
#4
6
A long time, and lex and yacc won't help
If you have the skills to write a compiler for such a large language, you will not need the small amount of help that lex and yacc give you. In fact, while lex is OK it may take longer to use yacc, as it's not really quite powerful enough for C or C++, and you can end up spending far more time getting it to work right than it would take to just write a recursive descent parser.
如果你有为这么大的语言编写编译器的技能,你就不需要lex和yacc给你的少量帮助。事实上,虽然lex是可以的,但是使用yacc可能需要更长的时间,因为它对C或C ++来说并不是非常强大,并且你最终可以花费更多的时间来使它正常工作而不是只需编写一个递归血统解析器。
I believe lex and yacc are best used for simple grammars, or when it is worth the extra effort to have a nicely readable grammar file, perhaps because the grammar is experimental and subject to change.
我相信lex和yacc最适合用于简单的语法,或者当值得花费额外的努力来获得一个可读的语法文件时,也许是因为语法是实验性的并且可能会发生变化。
For that matter, the entire parser is possibly not the major part of your job, depending on exactly what goals you have for the code generator.
就此而言,整个解析器可能不是您工作的主要部分,具体取决于您对代码生成器的具体目标。
#5
3
Firstly, the "flex" tag on SO is about Adobe's product, not the lexer generator. Secondly, Bjarne Stroustrup is on record as saying he wished he had implemented Cfront (the first C++ compiler) using recursive descent rather than a table driven tool. And thirdly, to answer your question directly - lots. If you feel you need to write one, take a look at ANTLR - not my favourite tool, but there are already C++ parsers for it.
首先,SO上的“flex”标签是关于Adobe的产品,而不是词法生成器。其次,Bjarne Stroustrup有记录表示他希望他使用递归下降而不是表驱动工具实现Cfront(第一个C ++编译器)。第三,直接回答你的问题 - 很多。如果您觉得需要编写一个,请查看ANTLR - 不是我最喜欢的工具,但已经有了C ++解析器。
#6
3
This is a non-trivial problem, and would quite a lot of time to do correctly. For one thing, the grammar for C++ is not completely parseable by a LALR parser such as yacc. You can do subsets of the language, but getting the entire language specification correct is tricky.
这是一个非常重要的问题,并且需要花费大量时间才能正确完成。首先,C ++的语法不能通过诸如yacc之类的LALR解析器完全解析。您可以执行该语言的子集,但是使整个语言规范正确是很棘手的。
You're not the first person to think that this is fun. Here's a nice blog-style article on the topic: Parsing C++
你不是第一个认为这很有趣的人。这是关于该主题的一篇很好的博客文章:解析C ++
Here's an important quote from the article:
以下是文章的重要引用:
"After lots of investigation, I decided that writing a parser/analysis-tool for C++ is sufficiently difficult that it's beyond what I want to do as a hobby."
“经过大量调查,我决定为C ++编写一个解析器/分析工具是非常困难的,因为它超出了我想做的业余爱好。”
The problem with that article is that it's a bit old, and several of the links are broken. Here are some links to some other resources on the topic of writing C++ parsers:
该文章的问题在于它有点陈旧,而且有些链接被破坏了。以下是一些关于编写C ++解析器主题的其他资源的链接:
- ANTLR Grammars (contain several grammars for C++)
- A YACC-able C++ 2.1 Grammar and the resulting ambiguities
- Parsing and Processing C++ Code (Wikipedia)
ANTLR语法(包含几个C ++语法)
具有YACC能力的C ++ 2.1语法以及由此产生的模糊性
解析和处理C ++代码(*)
#7
3
As others have already said, yacc is a poor choice for implementing a C++ parser. One can do it; the orginal GCC did so, before the GCC team got disgusted with how hard it was to maintain and extend. (Flex might be OK as a lexer).
正如其他人已经说过的那样,yacc是实现C ++解析器的不良选择。一个人可以做到;在海湾合作委员会团队对维持和扩展的难度感到厌恶之前,原始海湾合作委员会这样做了。 (作为词法分析器,Flex可能没问题)。
Some say recursive descent parsers are best, because Bjarne Stroustrop said so. Our experience is the GLR parsing is the right answer for this, and our GLR-based C++ front end is a nice proof, as is the Elsa front end. Our front end has been used in anger on millions of lines of C++ (including Microsoft and GCC dialects) to carry out program analyses and massive source code transformation.
有人说递归下降解析器是最好的,因为Bjarne Stroustrop这样说。我们的经验是GLR解析是正确的答案,我们基于GLR的C ++前端是一个很好的证据,就像Elsa前端一样。我们的前端已被用于数百万行C ++(包括Microsoft和GCC方言)的愤怒,以进行程序分析和大规模源代码转换。
But what is not emphasized enough is that parsing is just a very small portion of what it takes to build a compiler, especially for C++. You need to also build symbol tables ("what does this identifier mean in this context?") and to do that you need to encode essentially most of several hundred pages of the C++ standard. We believe that the foundation on which we build compiler-like tools, DMS, is extremely good for doing this, and it took us over a man-year to get just this part right.
但是没有得到足够强调的是,解析只是构建编译器所需要的一小部分,特别是对于C ++。您还需要构建符号表(“此标识符在此上下文中的含义是什么?”),为此,您需要对C ++标准的几百页内的大部分内容进行编码。我们相信,我们构建类似编译器的工具DMS的基础非常适合这样做,而且我们花了一年多的时间来完成这一部分。
But then you have the rest of the compiler to consider:
但是你要考虑其余的编译器:
- Preprocessor
- AST construction
- Semantic analysis and type checking
- Control, Data flow, and pointer analysis
- Basic code generation
- Optimizations
- Register allocation
- Final Code Generation
- Debugging support
语义分析和类型检查
控制,数据流和指针分析
基本代码生成
最终代码生成
I keep saying this: building a parser (the BNF part) for a language is like climbing the foothills of the Himalayas. Building a full compiler is like climbing Everest. Pretty much any clod can do the former (although C++ is right at the edge). Only the really serious do the latter, and only when extremely well prepared.
我一直这样说:为一种语言建立一个解析器(BNF部分)就像爬上喜马拉雅山的山麓。构建完整的编译器就像攀登珠穆朗玛峰。几乎所有的clod都能做到前者(尽管C ++正处于边缘)。只有真正认真对待后者,并且只有做好充分准备。
Expect building a C++ compiler to take you years.
期望构建一个C ++编译器来带你多年。
(The SD C++ front end handles lexing, parsing, AST generation, symbol tables, some type checking, and regeneration of compilable source text from the AST, including the original comments, for the major C++ dialects. It has been developed over a period of some 6 years).
(SD C ++前端处理lexing,解析,AST生成,符号表,某些类型检查,以及AST的可编译源文本的再生,包括主要C ++方言的原始注释。它已经开发了一段时间大约6年)。
EDIT: May, 2015. The original answer was written in 2010; we now have 11 years invested, taking us up through C++14. The point is that it is an endless, big effort to build one of these.
编辑:2015年5月。原始答案写于2010年;我们现在有11年的投资,通过C ++ 14带我们。关键在于,构建其中之一是一项无穷无尽的大努力。
#8
2
Lex,yacc will not be enough. You need a linker, assembler too.., c preprocessor. It depends on how you do it. How much pre-made components do you plan to use? You need to get the description of the syntax and its token from somewhere.
Lex,yacc还不够。你需要一个链接器,汇编器..,c预处理器。这取决于你是如何做到的。您打算使用多少预制组件?您需要从某处获取语法及其令牌的描述。
For example, if you use LLVM, you can proceed faster. It already provides a lot of tools, assembler, linker, optimiser.... You can get a c preprocessor from boost project.. You need to create a test suite to test your compiler automatically.
例如,如果使用LLVM,则可以更快地进行。它已经提供了很多工具,汇编程序,链接器,优化器....你可以从boost项目中获得一个c预处理器。你需要创建一个测试套件来自动测试你的编译器。
It can take a year if you work on it each day or much less you have more talent and motivation.
如果你每天工作可能需要一年的时间,或者你有更多的才能和动力。
#9
2
Unless you have already written several other compilers; C++ is not a language you even want to start writing a compiler from scratch for, the language has a lot of places were the meaning requires a lot of context before the situation can be disambiguated.
除非你已经写过其他几个编译器; C ++不是一种你甚至想从头开始编写编译器的语言,语言有很多地方的意思需要大量的上下文才能消除歧义。
Even if you have lots of experience writing compilers you are looking at several years for a team of developers. This is just to parse the code correctly into an intermediate format. Writing the backend to generate code is yet another specialized task (though you could steal the gcc backend).
即使你有很多编写编译器的经验,你也会在几年内为开发人员团队寻找。这只是将代码正确解析为中间格式。编写后端以生成代码是另一项专门任务(尽管你可以窃取gcc后端)。
If you do a google for "C++ grammars" there are a couple around to get you started.
如果您使用google进行“C ++语法”,那么可以帮助您入门。
C++ LEX Tokens: http://www.computing.surrey.ac.uk/research/dsrg/fog/CxxLexer.l
C++ YACC Grammer: http://www.computing.surrey.ac.uk/research/dsrg/fog/CxxGrammar.y
http://www.computing.surrey.ac.uk/research/dsrg/fog/CxxTester.y
#10
1
A C++ compiler is very complicated. To implement enough of C++ to be compatible with most C++ code out there would take several developers a couple of years full time. clang is a compiler project being funded by Apple to develop a new compiler for C, C++, and Objective-C, with several full-time developers, and the C++ support is still very far from being complete after a couple of years of development.
C ++编译器非常复杂。要实现足够的C ++以与大多数C ++代码兼容,需要几个开发人员几年全职。 clang是一个由Apple资助的编译器项目,用于为几个全职开发人员开发C,C ++和Objective-C的新编译器,经过几年的开发,C ++支持还远未完成。
#11
1
A few years - if you can get research grant to re-write new lex/yacc :-)
几年 - 如果你能获得研究补助金重新编写新的lex / yacc :-)
People keep chasing their tails on this a lot - starting with Stroustrup who was always fancied being a language "designer" rather than actual compiler writer (remember that his C++ was a mere codegen for ages andwould still be there if it wasn't for gcc and other folks).
人们一直在追逐他们的尾巴 - 从Stroustrup开始,他总是被认为是一个语言“设计师”,而不是真正的编译器作者(记住他的C ++只是一个代码生成器,如果它不是gcc的话,它仍然会存在和其他人)。
The core issue is that real research on parser generators pretty much ceased to exist ever since CPU-s became fast enough to handle functional languages and brute-force recursive descent. Recursive descent is the last resort when you don't know what to do - it does exhaustive search till it nabs one "rule" that fires. Once you are content with that you kind of loose interest in researching how to do it efficiently.
核心问题是,自从CPU变得足够快以处理函数式语言和强力递归下降以来,对解析器生成器的真正研究几乎不复存在。当你不知道该做什么时,递归下降是最后的手段 - 它会进行详尽的搜索,直到它触发一个“规则”。一旦你对此感到满意,你就会对研究如何有效地研究它感兴趣。
What you'd essentially need is a reasonable middle-ground - like LALR(2) with fixed, limited backtraching (plus static checker to yell if "desiogner" splurges into a nondeterministic tree) and also limited and partitioned symbol table feedback (modern parser need to be concurrency-friendly).
您基本上需要的是一个合理的中间地带 - 如LALR(2),具有固定的,有限的回溯(加上静态检查器,如果“desiogner”挥霍进入一个非确定性树,则会大喊)以及有限和分区的符号表反馈(现代解析器)需要兼容并发)。
Sounds like a research grant proposal, doesn't it :-) Now if we'd find someone to actually fund it, that would be something :-))
听起来像一个研究补助金提案,不是吗:-)现在,如果我们找到一个人实际资助它,那将是:-))
#12
0
Recursive decent is a good choice to parse C++. GCC and clang use it.
Recursive decent是解析C ++的不错选择。 GCC和clang使用它。
The Elsa parser (and my ellcc compiler) use the Elkhound GLR compiler generator.
Elsa解析器(和我的ellcc编译器)使用Elkhound GLR编译器生成器。
In either case, writing a C++ compiler is a BIG job.
在任何一种情况下,编写C ++编译器都是一项艰巨的任务。
#13
0
Well, what do you mean by write a compiler?
那么,编写一个编译器是什么意思?
I doubt any one guy has made a true C++ compiler that took it down all the way to assembly code, but I have used lex and yacc to make a C compiler and I have done it without.
我怀疑是否有任何一个人已经制作了一个真正的C ++编译器,将其一直带到汇编代码,但我使用lex和yacc来制作一个C编译器而且我已经完成了它。
Using both you can make a compiler that leaves out the semantics in a couple days, but figuring out how to use them can take weeks or months easily. Figuring out how to make a compiler at all will take weeks or months no matter what, but the figure I remember is once you know how it works it took a few days with lex and yacc and a few weeks without but the second had better results and fewer bugs so really it's questionable whether they are worth using at all.
使用它们可以使编译器在几天内省去语义,但弄清楚如何使用它们可能需要数周或数月的时间。弄清楚如何制作编译器将花费几周或几个月,无论如何,但我记得的数字是,一旦你知道它如何工作,花了几天lex和yacc几周没有,但第二个有更好的结果并且更少的错误,所以它们是否值得使用是否值得怀疑。
The 'semantics' is the actual code production. That can be very simple code that's just enough to work and might not take long at all, or you could spend your whole life doing optimization on it.
'语义'是实际的代码生成。这可能是非常简单的代码,只需要工作,可能不会花费很长时间,或者你可以花一辈子去做优化。
With C++ the big issue is templates, but there's so many little issues and rules I can't imagine someone ever wanting to do this. Even if you DO finish, the problem is you won't necessarily have binary compatibility ie be able to be recognized as a runnable program by a linker or the OS because there's more to it than just C++ and its hard to pin down standard but there's also yet more standards to worry about which are even less widely available.
使用C ++,最大的问题是模板,但是有很多小问题和规则我无法想象有人想要这样做。即使你完成了,问题是你不一定有二进制兼容性,即能够被链接器或操作系统识别为可运行的程序,因为它不仅仅是C ++而且很难确定标准,但是有还有更多的标准要担心哪些更不广泛。
#1
21
There are many parsing rules that cannot be parsed by a bison/yacc parser (for example, distinguishing between a declaration and a function call in some circumstances). Additionally sometimes the interpretation of tokens requires input from the parser, particularly in C++0x. The handling of the character sequence >>
for example is crucially dependent on parsing context.
bison / yacc解析器无法解析许多解析规则(例如,在某些情况下区分声明和函数调用)。另外,有时令牌的解释需要来自解析器的输入,特别是在C ++ 0x中。例如,字符序列>>的处理关键取决于解析上下文。
Those two tools are very poor choices for parsing C++ and you would have to put in a lot of special cases that escaped the basic framework those tools rely on in order to correctly parse C++. It would take you a long time, and even then your parser would likely have weird bugs.
这两个工具是解析C ++的非常糟糕的选择,你必须放入许多特殊情况,这些特殊情况逃脱了这些工具所依赖的基本框架,以便正确地解析C ++。这需要很长时间,即使这样你的解析器也可能会有奇怪的错误。
yacc and bison are LALR(1) parser generators, which are not sophisticated enough to handle C++ effectively. As other people have pointed out, most C++ compilers now use a recursive descent parser, and several other answers have pointed at good solutions for writing your own.
yacc和bison是LALR(1)解析器生成器,它们不够复杂,无法有效地处理C ++。正如其他人所指出的那样,大多数C ++编译器现在使用递归下降解析器,而其他几个答案指出了编写自己的解决方案的好方法。
C++ templates are no good for handling strings, even constant ones (though this may be fixed in C++0x, I haven't researched carefully), but if they were, you could pretty easily write a recursive descent parser in the C++ template language. I find that rather amusing.
C ++模板不适合处理字符串,甚至是常量字符串(尽管这可能在C ++ 0x中修复,我没有仔细研究过),但如果它们是,你可以很容易地在C ++模板中编写一个递归下降解析器语言。我发现这很有趣。
#2
10
It sounds like you're pretty new to parsing/compiler creation. If that's the case, I'd highly recommend not starting with C++. It's a monster of a language.
听起来你是解析/编译器创建的新手。如果是这种情况,我强烈建议不要从C ++开始。它是一种语言的怪物。
Either invent a trivial toy language of your own, or do something modeled on something much smaller and simpler. I saw a lua parser where the grammar definition was about a page long. That'd be much more reasonable as a starting point.
无论是发明自己的琐碎玩具语言,还是做一些更小更简单的事情。我看到了一个lua解析器,其中语法定义大约是一页长。作为一个起点,这将更加合理。
#3
10
It will probably take you years, and you'll probably switch to some other parser generator in the process.
它可能需要数年时间,您可能会在此过程中切换到其他一些解析器生成器。
Parsing C++ is notoriously error-prone. The grammar is not fully LR-parsable, as many parts are context-sensitive. You won't be able to get it working right in flex/yacc, or at least it'll be really awkward to implement. There are only two front-ends I know of that get it right. Your best bet is to use one of these and focus on writing the back-end. That's where the interesting stuff is anyway :-).
解析C ++非常容易出错。语法不完全是LR可解析的,因为许多部分都是上下文敏感的。你将无法在flex / yacc中使它正常工作,或者至少它实现起来真的很尴尬。我知道只有两个前端正确。您最好的选择是使用其中之一并专注于编写后端。无论如何,这就是有趣的东西:-)。
Existing C++ Front Ends:
现有的C ++前端:
-
The EDG front-end is used by most of the commercial vendors (Intel, Portland Group, etc.) in their compilers. It costs money, but it's very thorough. People pay big bucks for it because they don't want to deal with the pain of writing their own C++ parser.
EDG前端在大多数商业供应商(英特尔,波特兰集团等)的编译器中使用。它花钱,但它非常彻底。人们为此付出了巨大的代价,因为他们不想处理编写自己的C ++解析器的痛苦。
-
GCC's C++ front-end is thorough enough for production code, but you'd have to figure out how to integrate this into your project. I believe it's fairly involved to separate it from GCC. This would also be GPL, but I'm not sure whether that's a problem for you. You can use the GCC front-end in your project via gcc_xml, but this will only give you XML for classes, functions, namespaces, and typedefs. It won't give you a syntax tree for the code.
GCC的C ++前端对于生产代码足够透彻,但您必须弄清楚如何将其集成到您的项目中。我认为将它与GCC分开是相当复杂的。这也是GPL,但我不确定这对你来说是否有问题。您可以通过gcc_xml在项目中使用GCC前端,但这只会为类,函数,命名空间和typedef提供XML。它不会为您提供代码的语法树。
-
Another possibility is to use clang, but their C++ support is currently spotty. It'll be nice to see them get all the bugs out, but if you look at their C++ status page you'll notice there are more than a few test cases that still break. Take heed -- clang is a big project. If it's taking these guys years to implement a C++ front-end, it's going to take you longer.
另一种可能性是使用clang,但他们的C ++支持目前很不稳定。很高兴看到他们得到所有的错误,但如果你看看他们的C ++状态页面,你会发现有不止一些测试用例仍然存在。注意 - 铿锵是一个很大的项目。如果它花费这些家伙多年来实现C ++前端,那么它将花费你更长的时间。
-
Others have mentioned ANTLR, and there is a C++ grammar available for it, but I'm skeptical. I haven't heard of an ANTLR front end being used in any major compilers, though I do believe it's used in the NetBeans IDE. It might be suitable for an IDE, but I'm skeptical that you'd be able to use it on production code.
其他人提到了ANTLR,并且有一个C ++语法可供它使用,但我持怀疑态度。我没有听说任何主要编译器中使用的是ANTLR前端,但我相信它在NetBeans IDE中使用过。它可能适合IDE,但我怀疑你能否在生产代码上使用它。
#4
6
A long time, and lex and yacc won't help
If you have the skills to write a compiler for such a large language, you will not need the small amount of help that lex and yacc give you. In fact, while lex is OK it may take longer to use yacc, as it's not really quite powerful enough for C or C++, and you can end up spending far more time getting it to work right than it would take to just write a recursive descent parser.
如果你有为这么大的语言编写编译器的技能,你就不需要lex和yacc给你的少量帮助。事实上,虽然lex是可以的,但是使用yacc可能需要更长的时间,因为它对C或C ++来说并不是非常强大,并且你最终可以花费更多的时间来使它正常工作而不是只需编写一个递归血统解析器。
I believe lex and yacc are best used for simple grammars, or when it is worth the extra effort to have a nicely readable grammar file, perhaps because the grammar is experimental and subject to change.
我相信lex和yacc最适合用于简单的语法,或者当值得花费额外的努力来获得一个可读的语法文件时,也许是因为语法是实验性的并且可能会发生变化。
For that matter, the entire parser is possibly not the major part of your job, depending on exactly what goals you have for the code generator.
就此而言,整个解析器可能不是您工作的主要部分,具体取决于您对代码生成器的具体目标。
#5
3
Firstly, the "flex" tag on SO is about Adobe's product, not the lexer generator. Secondly, Bjarne Stroustrup is on record as saying he wished he had implemented Cfront (the first C++ compiler) using recursive descent rather than a table driven tool. And thirdly, to answer your question directly - lots. If you feel you need to write one, take a look at ANTLR - not my favourite tool, but there are already C++ parsers for it.
首先,SO上的“flex”标签是关于Adobe的产品,而不是词法生成器。其次,Bjarne Stroustrup有记录表示他希望他使用递归下降而不是表驱动工具实现Cfront(第一个C ++编译器)。第三,直接回答你的问题 - 很多。如果您觉得需要编写一个,请查看ANTLR - 不是我最喜欢的工具,但已经有了C ++解析器。
#6
3
This is a non-trivial problem, and would quite a lot of time to do correctly. For one thing, the grammar for C++ is not completely parseable by a LALR parser such as yacc. You can do subsets of the language, but getting the entire language specification correct is tricky.
这是一个非常重要的问题,并且需要花费大量时间才能正确完成。首先,C ++的语法不能通过诸如yacc之类的LALR解析器完全解析。您可以执行该语言的子集,但是使整个语言规范正确是很棘手的。
You're not the first person to think that this is fun. Here's a nice blog-style article on the topic: Parsing C++
你不是第一个认为这很有趣的人。这是关于该主题的一篇很好的博客文章:解析C ++
Here's an important quote from the article:
以下是文章的重要引用:
"After lots of investigation, I decided that writing a parser/analysis-tool for C++ is sufficiently difficult that it's beyond what I want to do as a hobby."
“经过大量调查,我决定为C ++编写一个解析器/分析工具是非常困难的,因为它超出了我想做的业余爱好。”
The problem with that article is that it's a bit old, and several of the links are broken. Here are some links to some other resources on the topic of writing C++ parsers:
该文章的问题在于它有点陈旧,而且有些链接被破坏了。以下是一些关于编写C ++解析器主题的其他资源的链接:
- ANTLR Grammars (contain several grammars for C++)
- A YACC-able C++ 2.1 Grammar and the resulting ambiguities
- Parsing and Processing C++ Code (Wikipedia)
ANTLR语法(包含几个C ++语法)
具有YACC能力的C ++ 2.1语法以及由此产生的模糊性
解析和处理C ++代码(*)
#7
3
As others have already said, yacc is a poor choice for implementing a C++ parser. One can do it; the orginal GCC did so, before the GCC team got disgusted with how hard it was to maintain and extend. (Flex might be OK as a lexer).
正如其他人已经说过的那样,yacc是实现C ++解析器的不良选择。一个人可以做到;在海湾合作委员会团队对维持和扩展的难度感到厌恶之前,原始海湾合作委员会这样做了。 (作为词法分析器,Flex可能没问题)。
Some say recursive descent parsers are best, because Bjarne Stroustrop said so. Our experience is the GLR parsing is the right answer for this, and our GLR-based C++ front end is a nice proof, as is the Elsa front end. Our front end has been used in anger on millions of lines of C++ (including Microsoft and GCC dialects) to carry out program analyses and massive source code transformation.
有人说递归下降解析器是最好的,因为Bjarne Stroustrop这样说。我们的经验是GLR解析是正确的答案,我们基于GLR的C ++前端是一个很好的证据,就像Elsa前端一样。我们的前端已被用于数百万行C ++(包括Microsoft和GCC方言)的愤怒,以进行程序分析和大规模源代码转换。
But what is not emphasized enough is that parsing is just a very small portion of what it takes to build a compiler, especially for C++. You need to also build symbol tables ("what does this identifier mean in this context?") and to do that you need to encode essentially most of several hundred pages of the C++ standard. We believe that the foundation on which we build compiler-like tools, DMS, is extremely good for doing this, and it took us over a man-year to get just this part right.
但是没有得到足够强调的是,解析只是构建编译器所需要的一小部分,特别是对于C ++。您还需要构建符号表(“此标识符在此上下文中的含义是什么?”),为此,您需要对C ++标准的几百页内的大部分内容进行编码。我们相信,我们构建类似编译器的工具DMS的基础非常适合这样做,而且我们花了一年多的时间来完成这一部分。
But then you have the rest of the compiler to consider:
但是你要考虑其余的编译器:
- Preprocessor
- AST construction
- Semantic analysis and type checking
- Control, Data flow, and pointer analysis
- Basic code generation
- Optimizations
- Register allocation
- Final Code Generation
- Debugging support
语义分析和类型检查
控制,数据流和指针分析
基本代码生成
最终代码生成
I keep saying this: building a parser (the BNF part) for a language is like climbing the foothills of the Himalayas. Building a full compiler is like climbing Everest. Pretty much any clod can do the former (although C++ is right at the edge). Only the really serious do the latter, and only when extremely well prepared.
我一直这样说:为一种语言建立一个解析器(BNF部分)就像爬上喜马拉雅山的山麓。构建完整的编译器就像攀登珠穆朗玛峰。几乎所有的clod都能做到前者(尽管C ++正处于边缘)。只有真正认真对待后者,并且只有做好充分准备。
Expect building a C++ compiler to take you years.
期望构建一个C ++编译器来带你多年。
(The SD C++ front end handles lexing, parsing, AST generation, symbol tables, some type checking, and regeneration of compilable source text from the AST, including the original comments, for the major C++ dialects. It has been developed over a period of some 6 years).
(SD C ++前端处理lexing,解析,AST生成,符号表,某些类型检查,以及AST的可编译源文本的再生,包括主要C ++方言的原始注释。它已经开发了一段时间大约6年)。
EDIT: May, 2015. The original answer was written in 2010; we now have 11 years invested, taking us up through C++14. The point is that it is an endless, big effort to build one of these.
编辑:2015年5月。原始答案写于2010年;我们现在有11年的投资,通过C ++ 14带我们。关键在于,构建其中之一是一项无穷无尽的大努力。
#8
2
Lex,yacc will not be enough. You need a linker, assembler too.., c preprocessor. It depends on how you do it. How much pre-made components do you plan to use? You need to get the description of the syntax and its token from somewhere.
Lex,yacc还不够。你需要一个链接器,汇编器..,c预处理器。这取决于你是如何做到的。您打算使用多少预制组件?您需要从某处获取语法及其令牌的描述。
For example, if you use LLVM, you can proceed faster. It already provides a lot of tools, assembler, linker, optimiser.... You can get a c preprocessor from boost project.. You need to create a test suite to test your compiler automatically.
例如,如果使用LLVM,则可以更快地进行。它已经提供了很多工具,汇编程序,链接器,优化器....你可以从boost项目中获得一个c预处理器。你需要创建一个测试套件来自动测试你的编译器。
It can take a year if you work on it each day or much less you have more talent and motivation.
如果你每天工作可能需要一年的时间,或者你有更多的才能和动力。
#9
2
Unless you have already written several other compilers; C++ is not a language you even want to start writing a compiler from scratch for, the language has a lot of places were the meaning requires a lot of context before the situation can be disambiguated.
除非你已经写过其他几个编译器; C ++不是一种你甚至想从头开始编写编译器的语言,语言有很多地方的意思需要大量的上下文才能消除歧义。
Even if you have lots of experience writing compilers you are looking at several years for a team of developers. This is just to parse the code correctly into an intermediate format. Writing the backend to generate code is yet another specialized task (though you could steal the gcc backend).
即使你有很多编写编译器的经验,你也会在几年内为开发人员团队寻找。这只是将代码正确解析为中间格式。编写后端以生成代码是另一项专门任务(尽管你可以窃取gcc后端)。
If you do a google for "C++ grammars" there are a couple around to get you started.
如果您使用google进行“C ++语法”,那么可以帮助您入门。
C++ LEX Tokens: http://www.computing.surrey.ac.uk/research/dsrg/fog/CxxLexer.l
C++ YACC Grammer: http://www.computing.surrey.ac.uk/research/dsrg/fog/CxxGrammar.y
http://www.computing.surrey.ac.uk/research/dsrg/fog/CxxTester.y
#10
1
A C++ compiler is very complicated. To implement enough of C++ to be compatible with most C++ code out there would take several developers a couple of years full time. clang is a compiler project being funded by Apple to develop a new compiler for C, C++, and Objective-C, with several full-time developers, and the C++ support is still very far from being complete after a couple of years of development.
C ++编译器非常复杂。要实现足够的C ++以与大多数C ++代码兼容,需要几个开发人员几年全职。 clang是一个由Apple资助的编译器项目,用于为几个全职开发人员开发C,C ++和Objective-C的新编译器,经过几年的开发,C ++支持还远未完成。
#11
1
A few years - if you can get research grant to re-write new lex/yacc :-)
几年 - 如果你能获得研究补助金重新编写新的lex / yacc :-)
People keep chasing their tails on this a lot - starting with Stroustrup who was always fancied being a language "designer" rather than actual compiler writer (remember that his C++ was a mere codegen for ages andwould still be there if it wasn't for gcc and other folks).
人们一直在追逐他们的尾巴 - 从Stroustrup开始,他总是被认为是一个语言“设计师”,而不是真正的编译器作者(记住他的C ++只是一个代码生成器,如果它不是gcc的话,它仍然会存在和其他人)。
The core issue is that real research on parser generators pretty much ceased to exist ever since CPU-s became fast enough to handle functional languages and brute-force recursive descent. Recursive descent is the last resort when you don't know what to do - it does exhaustive search till it nabs one "rule" that fires. Once you are content with that you kind of loose interest in researching how to do it efficiently.
核心问题是,自从CPU变得足够快以处理函数式语言和强力递归下降以来,对解析器生成器的真正研究几乎不复存在。当你不知道该做什么时,递归下降是最后的手段 - 它会进行详尽的搜索,直到它触发一个“规则”。一旦你对此感到满意,你就会对研究如何有效地研究它感兴趣。
What you'd essentially need is a reasonable middle-ground - like LALR(2) with fixed, limited backtraching (plus static checker to yell if "desiogner" splurges into a nondeterministic tree) and also limited and partitioned symbol table feedback (modern parser need to be concurrency-friendly).
您基本上需要的是一个合理的中间地带 - 如LALR(2),具有固定的,有限的回溯(加上静态检查器,如果“desiogner”挥霍进入一个非确定性树,则会大喊)以及有限和分区的符号表反馈(现代解析器)需要兼容并发)。
Sounds like a research grant proposal, doesn't it :-) Now if we'd find someone to actually fund it, that would be something :-))
听起来像一个研究补助金提案,不是吗:-)现在,如果我们找到一个人实际资助它,那将是:-))
#12
0
Recursive decent is a good choice to parse C++. GCC and clang use it.
Recursive decent是解析C ++的不错选择。 GCC和clang使用它。
The Elsa parser (and my ellcc compiler) use the Elkhound GLR compiler generator.
Elsa解析器(和我的ellcc编译器)使用Elkhound GLR编译器生成器。
In either case, writing a C++ compiler is a BIG job.
在任何一种情况下,编写C ++编译器都是一项艰巨的任务。
#13
0
Well, what do you mean by write a compiler?
那么,编写一个编译器是什么意思?
I doubt any one guy has made a true C++ compiler that took it down all the way to assembly code, but I have used lex and yacc to make a C compiler and I have done it without.
我怀疑是否有任何一个人已经制作了一个真正的C ++编译器,将其一直带到汇编代码,但我使用lex和yacc来制作一个C编译器而且我已经完成了它。
Using both you can make a compiler that leaves out the semantics in a couple days, but figuring out how to use them can take weeks or months easily. Figuring out how to make a compiler at all will take weeks or months no matter what, but the figure I remember is once you know how it works it took a few days with lex and yacc and a few weeks without but the second had better results and fewer bugs so really it's questionable whether they are worth using at all.
使用它们可以使编译器在几天内省去语义,但弄清楚如何使用它们可能需要数周或数月的时间。弄清楚如何制作编译器将花费几周或几个月,无论如何,但我记得的数字是,一旦你知道它如何工作,花了几天lex和yacc几周没有,但第二个有更好的结果并且更少的错误,所以它们是否值得使用是否值得怀疑。
The 'semantics' is the actual code production. That can be very simple code that's just enough to work and might not take long at all, or you could spend your whole life doing optimization on it.
'语义'是实际的代码生成。这可能是非常简单的代码,只需要工作,可能不会花费很长时间,或者你可以花一辈子去做优化。
With C++ the big issue is templates, but there's so many little issues and rules I can't imagine someone ever wanting to do this. Even if you DO finish, the problem is you won't necessarily have binary compatibility ie be able to be recognized as a runnable program by a linker or the OS because there's more to it than just C++ and its hard to pin down standard but there's also yet more standards to worry about which are even less widely available.
使用C ++,最大的问题是模板,但是有很多小问题和规则我无法想象有人想要这样做。即使你完成了,问题是你不一定有二进制兼容性,即能够被链接器或操作系统识别为可运行的程序,因为它不仅仅是C ++而且很难确定标准,但是有还有更多的标准要担心哪些更不广泛。