As a pet-project, I'd like to attempt to implement a basic language of my own design that can be used as a web-scripting language. It's trivial to run a C++ program as an Apache CGI, so the real work lies in how to parse an input file containing non-code (HTML/CSS markup) and server-side code.
作为宠物项目,我想尝试实现我自己设计的基本语言,可以用作网络脚本语言。将C ++程序作为Apache CGI运行是微不足道的,因此真正的工作在于如何解析包含非代码(HTML / CSS标记)和服务器端代码的输入文件。
In my undergrad compiler course, we used Flex and Bison to generate a scanner and a parser for a simple language. We were given a copy of the grammar and wrote a parser that translated the simple language to a simple assembly for a virtual machine. The flex scanner tokenizes the input, and passes the tokens to the Bison parser.
在我的本科编译课程中,我们使用Flex和Bison为简单语言生成扫描程序和解析器。我们获得了语法的副本,并编写了一个解析器,将简单语言翻译成虚拟机的简单程序集。 flex扫描器将输入标记化,并将标记传递给Bison解析器。
The difference between that and what I'd like to do is that like PHP, this language could have plain HTML markup and the scripting language interspersed like the following:
我和我想做的事情之间的区别在于,像PHP一样,这种语言可以有纯HTML标记,脚本语言散布如下:
<p>Hello,
<? echo "World ?>
</p>
Am I incorrect in assuming that it would be efficient to parse the input file as follows:
假设解析输入文件的效率如下:
- Scan input until a script start tag is found ('
- Second scanner tokenizes the server-side script section of the input file (from the open tag: '') and passes the token to the parser, which has no need to know about the markup in the file.
- Control is returned to the first scanner that continues this general pattern.
扫描输入,直到找到脚本开始标记('
第二个扫描程序标记输入文件的服务器端脚本部分(来自打开标记:'')并将标记传递给解析器,解析器无需知道文件中的标记。
控制返回到继续此常规模式的第一个扫描仪。
Basically, the first scanner only differentiates between Markup (which is returned directly to the browser unmodified) and code, which is passed to the second scanner, which in turn tokenizes the code and passes the tokens to the parser.
基本上,第一个扫描程序仅区分Markup(直接返回到未修改的浏览器)和代码,然后传递给第二个扫描程序,后者又将代码标记化并将标记传递给解析器。
If this is not a solid design pattern, how do languages such as PHP handle scanning input and parsing code efficiently?
如果这不是一个可靠的设计模式,PHP等语言如何有效地处理扫描输入和解析代码?
2 个解决方案
#1
6
You want to look at start conditions. For example:
你想看看开始条件。例如:
"<?" { BEGIN (PHP); }
<PHP>[a-zA-Z]* { return PHP_TOKEN; }
<PHP>">?" { BEGIN (0); }
[a-zA-Z]* { return HTML_TOKEN; }
You start off in state 0, use the BEGIN macro to change states. To match a RE only while in a particular state, prefix the RE with the state name surrounded by angle-brackets.
从状态0开始,使用BEGIN宏来改变状态。要仅在特定状态下匹配RE,请在RE前面加上由尖括号括起的状态名称。
In the example above, "PHP" is state. "PHP_TOKEN" and "HTML_TOKEN" are _%token_s defined by your yacc file.
在上面的例子中,“PHP”是状态。 “PHP_TOKEN”和“HTML_TOKEN”是yacc文件定义的_%token_s。
#2
2
PHP doesn't differentiate between the scanning and the Markup. It simply outputs to buffer when in Markup mode, and then switches to parsing when in code mode. You don't need a two pass scanner, and you can do this with just a single flex lexer.
PHP不区分扫描和标记。它只是在标记模式下输出到缓冲区,然后在代码模式下切换到解析。您不需要双通扫描仪,只需一个flex lexer即可完成此操作。
If you are interested in how PHP itself works, download the source (try the PHP4 source it is a lot easier to understand). What you want to look at is in the Zend Directory, zend_language_scanner.l
.
如果您对PHP本身如何工作感兴趣,请下载源代码(尝试PHP4源代码,它更容易理解)。你想看的是Zend目录中的zend_language_scanner.l。
Having written something similar myself, I would really recommend rethinking going the Flex and Bison route, and go with something modern like Antlr. It is a lot easier, easier to understand (the macros employed in a lex grammar get very confusing and hard to read) and it has a built in debugger (AntlrWorks) so you don't have to spend hours looking at 3 Meg debug files. It also supports many languages (Java, c#, C, Python, Actionscript) and has an excellent book and a very good website that should be able to get you up and running in no time.
我自己写了类似的东西,我真的建议重新考虑去Flex和Bison的路线,然后选择像Antlr这样现代的东西。它更容易理解(在lex语法中使用的宏变得非常混乱和难以阅读)并且它具有内置调试器(AntlrWorks),因此您不必花费数小时查看3 Meg调试文件。它还支持多种语言(Java,c#,C,Python,Actionscript),并且拥有一本优秀的书籍和一个非常好的网站,应该能够让您立即启动并运行。
#1
6
You want to look at start conditions. For example:
你想看看开始条件。例如:
"<?" { BEGIN (PHP); }
<PHP>[a-zA-Z]* { return PHP_TOKEN; }
<PHP>">?" { BEGIN (0); }
[a-zA-Z]* { return HTML_TOKEN; }
You start off in state 0, use the BEGIN macro to change states. To match a RE only while in a particular state, prefix the RE with the state name surrounded by angle-brackets.
从状态0开始,使用BEGIN宏来改变状态。要仅在特定状态下匹配RE,请在RE前面加上由尖括号括起的状态名称。
In the example above, "PHP" is state. "PHP_TOKEN" and "HTML_TOKEN" are _%token_s defined by your yacc file.
在上面的例子中,“PHP”是状态。 “PHP_TOKEN”和“HTML_TOKEN”是yacc文件定义的_%token_s。
#2
2
PHP doesn't differentiate between the scanning and the Markup. It simply outputs to buffer when in Markup mode, and then switches to parsing when in code mode. You don't need a two pass scanner, and you can do this with just a single flex lexer.
PHP不区分扫描和标记。它只是在标记模式下输出到缓冲区,然后在代码模式下切换到解析。您不需要双通扫描仪,只需一个flex lexer即可完成此操作。
If you are interested in how PHP itself works, download the source (try the PHP4 source it is a lot easier to understand). What you want to look at is in the Zend Directory, zend_language_scanner.l
.
如果您对PHP本身如何工作感兴趣,请下载源代码(尝试PHP4源代码,它更容易理解)。你想看的是Zend目录中的zend_language_scanner.l。
Having written something similar myself, I would really recommend rethinking going the Flex and Bison route, and go with something modern like Antlr. It is a lot easier, easier to understand (the macros employed in a lex grammar get very confusing and hard to read) and it has a built in debugger (AntlrWorks) so you don't have to spend hours looking at 3 Meg debug files. It also supports many languages (Java, c#, C, Python, Actionscript) and has an excellent book and a very good website that should be able to get you up and running in no time.
我自己写了类似的东西,我真的建议重新考虑去Flex和Bison的路线,然后选择像Antlr这样现代的东西。它更容易理解(在lex语法中使用的宏变得非常混乱和难以阅读)并且它具有内置调试器(AntlrWorks),因此您不必花费数小时查看3 Meg调试文件。它还支持多种语言(Java,c#,C,Python,Actionscript),并且拥有一本优秀的书籍和一个非常好的网站,应该能够让您立即启动并运行。