I am attempting to write an interpreted programming language which will read in files and output a bytecode-like format which can then be executed by a virtual machine.
我正在尝试编写一种解释性编程语言,它将读入文件并输出类似字节码的格式,然后由虚拟机执行。
My original plan was:
我最初的计划是:
- Begin by loading the contents of the file in to the memory of my interpreter,
- read each line (where a word is defined as being a series of letters followed by a space or escape character),
- using regular expressions, determine the parameters of the word,
- for example,
IF myvalue = 5 THEN
will becomeIF
myvalue
5
, - convert each of these words to the byte form (i.e.
IF
would become0x09
), - execute these bytes one by one (as the executor will understand that
IF
or0x09
is followed by two bytes.
首先将文件的内容加载到我的解释器的内存中,
读取每一行(其中一个单词被定义为一系列字母后跟一个空格或转义字符),
使用正则表达式,确定单词的参数,
例如,IF myvalue = 5那么将成为IF myvalue 5,
将这些单词中的每一个转换为字节形式(即IF将变为0x09),
逐个执行这些字节(因为执行程序将理解IF或0x09后跟两个字节。
I have been told that regular expressions is a terrible way to go about doing this, but I'm unsure as to if this is a good or bad way of implementing an interpreted language.
我被告知正则表达式是一种可怕的方式,但我不确定这是否是实现解释语言的好或坏方法。
This is mainly just for experience, so I don't mind if it isn't exactly performance friendly, but that would be a benefit.
这主要是为了体验,所以我不介意它是否不完全是性能友好,但这将是一个好处。
What is the best way of implementing my interpreter, and are there any examples (written in plain old C)?
实现我的解释器的最佳方法是什么,有没有任何例子(用简单的旧C编写)?
4 个解决方案
#1
5
The reason why people tell you that regular expressions aren't the best idea for a case like this is because regular expressions take more time to evaluate, and the regular expression language has many limitations and quirks that make it unsuitable for many applications.
人们告诉你正则表达式不是这种情况的最佳选择的原因是因为正则表达式需要更多时间来评估,而正则表达式语言有许多限制和怪癖使它不适合许多应用程序。
You should be aware that this is just a knee-jerk reaction that many programmers have (including myself), regardless of whether or not regular expressions would actually be suitable for the application. This stems from people trying to do too much with regular expressions, such as trying to parse HTML.
你应该知道,这只是许多程序员(包括我自己)的一种下意识的反应,无论正则表达式是否真的适合应用程序。这源于人们试图用正则表达式做太多,比如尝试解析HTML。
Many compilers use a basic single-pass tokenization algorithm. The tokenizer has a very basic idea of what can be used as a delimiter, how constants and identifiers should be treated, etc. The tokenizer will then just iterate through the input quickly and emit a string of tokens that can then be easily parsed.
许多编译器使用基本的单通道标记化算法。标记生成器有一个非常基本的概念,即可以用作分隔符,如何处理常量和标识符等。然后,标记生成器将快速迭代输入并发出一串令牌,然后可以轻松解析。
For a basic application, such as parsing tokens, the relatively minor penalty from using regular expressions isn't something to worry about. However as I said, there are sometimes peculiarities with how regular expressions work that might limit the things that your tokenizer can do, though that work can usually be offloaded to a later point in the compiler pipeline. Everything that a tokenizer is expected to do should be representable by standard regular expressions though.
对于基本应用程序,例如解析令牌,使用正则表达式的相对较小的惩罚不是值得担心的。然而正如我所说,有时候正则表达式如何工作可能会限制你的tokenizer可以做的事情,尽管这项工作通常可以卸载到编译器管道中的后续点。遗传算法应该做的所有事情都应该由标准的正则表达式来表示。
It should be noted that when you involve regular expressions directly in your code, things can get a little hairy. You have to determine how the regular expressions should be applied to the input, where input will be delineated, and so on. You'll also incur the penalty of compiling and evaluating the regular expressions.
应该注意的是,当你直接在你的代码中包含正则表达式时,事情会变得有些毛茸茸。您必须确定如何将正则表达式应用于输入,将描绘输入,等等。您还将承担编译和评估正则表达式的惩罚。
There are some projects, such as lex
, which make use of regular expressions because they provide a simple, concise way to describe a grammar, which it can then make use of internally in whatever representation in chooses. They will also handle all of the glue logic for you and you merely have to describe the grammar that it should use via regular expressions.
有一些项目,比如lex,它们使用正则表达式,因为它们提供了一种简单,简洁的方式来描述语法,然后它可以在内部以任何选择的表示形式使用。他们还将为您处理所有的胶合逻辑,您只需要通过正则表达式描述它应该使用的语法。
When such a generator is used , it can change any regular expressions to code that represents what the expression actually means. If it sees the expression [0-9]
, it can just replace that with a call to isdigit
, an equivalent switch
statement, or some other representation. This makes the generated code much more efficient than any inline use of regular expressions could achieve.
当使用这样的生成器时,它可以将任何正则表达式更改为代表表达式实际含义的代码。如果它看到表达式[0-9],它可以通过调用isdigit,等效的switch语句或其他表示来替换它。这使得生成的代码比正则表达式的任何内联使用都更有效。
So my thought is this: If you want to use regular expressions to parse your language, go all the way and create a scanner description for flex
/lex
to generate the tokenizer for you. However, if you're actually writing it entirely yourself, you'd be better off with a logically simpler approach like the one I described.
所以我的想法是这样的:如果你想使用正则表达式来解析你的语言,那就一直开始为flex / lex创建一个扫描器描述,为你生成tokenizer。但是,如果您实际上是自己完全编写它,那么使用我所描述的逻辑上更简单的方法会更好。
I thought it would be fun to write up a sample tokenizer that doesn't use regular expressions, so here it is. I wrote it in C-like C++. The only C++ feature I used was the standard vector and string, but I did it in such a way that you could easily drop in C variants.
我认为编写一个不使用正则表达式的示例标记器会很有趣,所以现在就是这样。我用C-like C ++编写它。我使用的唯一C ++特性是标准向量和字符串,但我这样做是为了让你可以轻松地放入C变种。
#include <vector>
#include <ctype.h>
#include <string>
typedef std::vector<std::string> string_list;
typedef std::vector<long long > int_list;
typedef std::vector<long double> float_list;
std::string substr(const char* value, size_t length){
std::string v;
v.resize(length);
memcpy(&v[0], value, length * sizeof(char));
return v;
}
long long string_to_int(const char* value, size_t length){
return atoll(substr(value, length).c_str());
}
long double string_to_float(const char* value, size_t length){
return atof(substr(value, length).c_str());
}
void int_list_add(int_list& list, long long value){
list.push_back(value);
}
void string_list_add(string_list& list, const char* value, size_t length){
list.push_back(substr(value, length));
}
void float_list_add(float_list& list, long double value){
list.push_back(value);
}
size_t int_list_last(int_list& list){
return list.size();
}
size_t string_list_last(string_list& list){
return list.size();
}
size_t float_list_last(float_list& list){
return list.size();
}
typedef struct{
string_list identifiers;
string_list constants_string;
int_list constants_int;
float_list constants_float;
size_t id;
} *state, state_value;
state tok_state_create(){
state ret = new state_value;
ret->id = 0;
return ret;
}
void tok_state_destroy(state t_state){
delete t_state;
}
const char* tok_state_read_identifier(state t_state, size_t id){
return t_state->identifiers[id - 1].c_str();
}
const char* tok_state_read_string(state t_state, size_t id){
return t_state->constants_string[id - 1].c_str();
}
long long tok_state_read_int(state t_state, size_t id){
return t_state->constants_int[id - 1];
}
long double tok_state_read_float(state t_state, size_t id){
return t_state->constants_float[id - 1];
}
const char* punct_tokens[] = { "Not A Token (Dummy)",
".", ",", "<", "<<", ">", ">>",
";", "+", "-", "/", "*", "!", "%", "^",
"&", "(", ")", "=", "==", "[", "]", "{",
"}", "?", ":", "|", "||", "&&", "~", 0
};
const char* key_tokens[] = { "Not A Token (Dummy)",
"if", "while", "do", "then", "end", 0
};
typedef enum{
TOK_TYPE_INTEGER = 500,
TOK_TYPE_FLOAT,
TOK_TYPE_STRING,
TOK_TYPE_IDENTIFIER,
TOK_TYPE_NONE
} tok_type;
const char* get_token_from_id(size_t id){
if (id < 100){
return punct_tokens[id];
}
if (id < 200){
return key_tokens[id - 100];
}
if (id >= 500){
switch (id){
case TOK_TYPE_INTEGER: return "Integer Constant";
case TOK_TYPE_FLOAT: return "Float Constant ";
case TOK_TYPE_STRING: return "String Constant ";
case TOK_TYPE_IDENTIFIER: return "Identifier ";
case TOK_TYPE_NONE: return "Unknown ";
default:
break;
}
}
return "Not A Token (Dummy)";
}
int is_identifier_char(char c){
if (isalpha(c) || c == '_'){
return 1;
}
return 0;
}
size_t read_punct_token(const char* input, size_t size){
size_t max_len = 0;
size_t token_id = 0;
for (size_t i = 1; punct_tokens[i] != 0; ++i){
size_t len = strlen(punct_tokens[i]);
if (len > max_len && len <= size && strncmp(punct_tokens[i], input, len) == 0){
max_len = len;
if (i == 1 && size > 1 && isdigit(input[1])){
return 0; //Special case for floats
}
token_id = i;
}
}
return token_id;
}
size_t read_key_token(const char* input, size_t size){
size_t max_len = 0;
size_t token_id = 0;
for (size_t i = 1; key_tokens[i] != 0; ++i){
size_t len = strlen(key_tokens[i]);
if (len > max_len && len <= size && strncmp(key_tokens[i], input, len) == 0){
max_len = len;
token_id = i + 100;
}
}
return token_id;
}
size_t is_punct_token_char(char c){
for (size_t i = 1; punct_tokens[i] != 0; ++i){
if (punct_tokens[i][0] == c){
return 1;
}
}
return 0;
}
void add_token(state t_state, tok_type type, const char* string, size_t length){
switch (type){
case TOK_TYPE_INTEGER:
int_list_add(t_state->constants_int, string_to_int(string, length));
t_state->id = int_list_last(t_state->constants_int);
break;
case TOK_TYPE_FLOAT:
float_list_add(t_state->constants_float, string_to_float(string, length));
t_state->id = float_list_last(t_state->constants_float);
break;
case TOK_TYPE_STRING:
string_list_add(t_state->constants_string, string, length);
t_state->id = string_list_last(t_state->constants_string);
break;
case TOK_TYPE_IDENTIFIER:
string_list_add(t_state->identifiers, string, length);
t_state->id = string_list_last(t_state->identifiers);
break;
default:
//Do some error here
break;
}
}
size_t get_token(state t_state, char** input, size_t *size){
if (t_state->id != 0){
size_t id = t_state->id;
t_state->id = 0;
return id;
}
char* base = *input;
size_t padding = 0;
size_t length = 0;
tok_type type = TOK_TYPE_NONE;
while (*size > 0){
if (isspace(*base)){
base++;
(*size)--;
}
else{
break;
}
}
size_t tok = read_punct_token(base, *size);
if (tok){
size_t len = +strlen(get_token_from_id(tok));
*input = base + len;
*size -= len;
return tok;
}
tok = read_key_token(base, *size);
if (tok){
size_t len = +strlen(get_token_from_id(tok));
*input = base + len;
*size -= len;
return tok;
}
while (*size - length > 0){
if (length == 0 && type == TOK_TYPE_NONE){
if (is_identifier_char(*base)){
type = TOK_TYPE_IDENTIFIER;
length++;
}
else if (*base == '"'){
type = TOK_TYPE_STRING;
padding = 1;
base++;
(*size)--;
}
else if (*base == '.' && *size > 1 && isdigit(base[1])){
type = TOK_TYPE_FLOAT;
}
else if (isdigit(*base)){
type = TOK_TYPE_INTEGER;
}
else if (is_punct_token_char(*base)){
tok = read_punct_token(base, *size);
if (tok){
size_t len = strlen(punct_tokens[tok]);
*input += len;
*size -= len;
return tok;
}
else{
//do error
}
}
}
else{
if (!isspace(base[length]) || type == TOK_TYPE_STRING){
switch (type){
case TOK_TYPE_INTEGER:
if (isdigit(base[length])){
length++;
continue;
}
else if (base[length] == '.' || tolower(base[length]) == 'e'){
type = TOK_TYPE_FLOAT;
length++;
continue;
}
break;
case TOK_TYPE_FLOAT:
if (isdigit(base[length]) || base[length] == '.' || base[length] == 'e'){
length++;
continue;
}
break;
case TOK_TYPE_STRING:
if (base[length] != '"'){
length++;
continue;
}
break;
case TOK_TYPE_IDENTIFIER:
if (is_identifier_char(base[length])){
length++;
continue;
}
break;
default:
break;
}
}
//We only get here if this is a space or any of the switch cases didn't continue.
add_token(t_state, type, base, length);
*input = base + length + padding;
*size -= length + padding;
return type;
}
}
*input = base + length + padding;
*size -= length + padding;
return 0;
}
int main(){
const char* input = "if(1+1==4)then print\"hi!\";end";
state s = tok_state_create();
size_t size = strlen(input);
size_t token;
size_t token_prev = 0;
printf("Token\tMeaning\n\n");
while ((token = get_token(s, (char**)&input, &size)) != 0){
if (token_prev < 500){
if (token < 500){
printf("%d\t%s\n", token, get_token_from_id(token));
}
else{
printf("%d\t%s #", token, get_token_from_id(token));
}
}
else{
printf("%d\t", token);
switch (token_prev){
case TOK_TYPE_IDENTIFIER: printf("%s\n", tok_state_read_identifier(s, token)); break;
case TOK_TYPE_STRING: printf("%s\n", tok_state_read_string(s, token)); break;
case TOK_TYPE_INTEGER: printf("%d\n", tok_state_read_int(s, token)); break;
case TOK_TYPE_FLOAT: printf("%f\n", tok_state_read_float(s, token)); break;
}
}
token_prev = token;
}
tok_state_destroy(s);
}
This will print:
这将打印:
Token Meaning
101 if
16 (
500 Integer Constant #1 1
8 +
500 Integer Constant #2 1
19 ==
500 Integer Constant #3 4
17 )
104 then
503 Identifier #1 print
502 String Constant #1 hi!
7 ;
105 end
#2
1
As others have mentioned, there's nothing wrong with regex. Here's a typical path:
正如其他人所提到的,正则表达式没有任何问题。这是一条典型的路径:
- Parse your source file into tokens
- Generate an abstract syntax tree from your stream of parsed tokens
- Walk your AST to generate your bytecode
将源文件解析为标记
从解析的标记流生成抽象语法树
走你的AST来生成你的字节码
Flex/bison are great tools for 1/2.
Flex / bison是1/2的绝佳工具。
You will probably also need some sort of symbol table for variable definitions.
您可能还需要某种符号表来表示变量定义。
Many compiler design courses will end up implementing a compiler for some subset of the c language, you could try looking at open courseware type sites for an actual class on this.
许多编译器设计课程最终将为c语言的某个子集实现编译器,您可以尝试查看开放课件类型的站点以获得实际的类。
#3
0
As the other answers have said there's nothing wrong with regular expressions per se. But more than this, Regular expressions are the natural representation of character-level syntax.
正如其他答案所说,正则表达式本身没有任何问题。但更重要的是,正则表达式是字符级语法的自然表示。
All the major (and many (most?) minor languages) use a regular expression to define the appearance of different input tokens, at least as a formalism in designing the language.
所有主要(和许多(大多数?)次要语言)使用正则表达式来定义不同输入令牌的外观,至少作为设计语言的形式。
But there may indeed be a performance penalty for certain tricks provided by regex libraries. This is because many things, like backreferences have to be implemented by a less-powerful automaton than a more pure regular expression.
但是对于正则表达式库提供的某些技巧,可能确实存在性能损失。这是因为许多事情,如反向引用必须由功能较弱的自动机实现,而不是更纯粹的正则表达式。
Regular-expressions (without fancy stuff like backreferences) can be transformed into a finite automaton or state-machine and that can be implemented with a much simpler (ie. faster) controlling function.
正则表达式(没有像反向引用这样的花哨的东西)可以转换为有限自动机或状态机,并且可以用更简单(即更快)的控制功能来实现。
At the very least, for efficiency, you should try to pre-compile your languages defining expressions and use the compiled matcher objects instead of constructing a new one dynamically each time. If your regex library doesn't offer this, maybe time to do some library shopping, or read up on automata theory.
至少,为了提高效率,您应该尝试预编译定义表达式的语言,并使用编译的匹配器对象,而不是每次动态构建一个新的。如果你的正则表达式库没有提供这个,也许是时候去做一些图书馆购物,或者阅读自动机理论。
#4
0
Any interpreter or compiler of a sophisticated language ("has expressions with (...) or compound nested statements like do-while, if-then-else") requires that you build a parser to extract the (usually recursive) structure of the code.
复杂语言的任何解释器或编译器(“具有带(...)或复合嵌套语句的表达式,如do-while,if-then-else”)要求您构建解析器以提取(通常是递归的)结构码。
You've gotten lots of answers here saying "regular expressions are good for tokens". Yes, in many classically built compilers and interpreters, people write regular expressions to describe the shape of individual tokens (identifiers, keywords, numeric and string constants, comments) and hand them to a lexer-generator (like Flex or many others) that combines these into an efficient finite-state machine. This works because the "syntax" of tokens is almost always pretty simple. That very simplicity means that you can hand code a token lexer yourself and produce practical results for small to medium size languages. (At some point (e.g., COBOL) the sheer size of the language starts to overwhelm you and it is hard to avoid using lexer generator if you want to stay sane).
你在这里得到了很多答案,说“正则表达式对于令牌很有用”。是的,在许多经典构建的编译器和解释器中,人们编写正则表达式来描述单个标记的形状(标识符,关键字,数字和字符串常量,注释),并将它们交给词法分析器生成器(如Flex或许多其他)组合这些成为一个有效的有限状态机。这是有效的,因为令牌的“语法”几乎总是非常简单。这非常简单意味着您可以自己手动编写令牌词法分析器,并为中小型语言生成实际结果。 (在某些时候(例如,COBOL),语言的绝对大小开始压倒你,如果你想要保持理智,很难避免使用词法分析器)。
What hasn't been discussed is real parsing, discovering structure assuming we already have tokens built somehow. Using regex for tokens IS NOT PARSING. And regexes cannot be used for parsing; they cannot recognize nested structures, which is what is needed for those "sophisticated" constructs. People unfamiliar with parsing make this mistake repeatedly.
没有讨论的是真正的解析,发现结构假设我们已经以某种方式构建了令牌。使用正则表达式来代替令牌。并且正则表达式不能用于解析;他们无法识别嵌套结构,这是那些“复杂”结构所需要的。不熟悉解析的人会反复犯这个错误。
If you want to succeed, you need to learn how to build a parser.
如果您想成功,您需要学习如何构建解析器。
For complex languages, there are parser generators (YACC, JavaCC, many others), as analog to lexer generators, that will take a BNF and generate a parser for you. If you want to write an compiler with such tools, you attach parser actions typically to grammar-rule recognition points, often to build a tree for later processing.
对于复杂语言,有解析器生成器(YACC,JavaCC,许多其他),与lexer生成器类似,它将采用BNF并为您生成解析器。如果要使用此类工具编写编译器,则通常将解析器操作附加到语法规则识别点,通常用于构建树以供以后处理。
You can also hand-code a parser for modest size languages. This is a set of mutually recursive procedures, one for each grammar rule, that analyze the code (using recursion to handle nested constructs). You can also attach parse actions to the procedures. Since these procedures recognize grammar constructs, this is really essentially the same trick as you would apply if you were using a parser generator. This scheme can be trivially extended to handle the "parsing"(lexing) of tokens. If you go down this path completely, there are no regular expressions, anywhere.
您还可以为适度大小的语言手动编写解析器代码。这是一组相互递归的过程,每个语法规则一个,用于分析代码(使用递归来处理嵌套的构造)。您还可以将解析操作附加到过程。由于这些过程识别语法结构,因此如果您使用解析器生成器,这实际上与您应用的技巧基本相同。可以简单地扩展该方案以处理令牌的“解析”(lexing)。如果你完全沿着这条路走下去,那么任何地方都没有正则表达式。
It is possible to build an interpreter by executing interpreter-actions in the grammar-rule recognition places/in the hand-coded recursive parsing procedures. It won't run very fast because it will continually reparse the source code.
可以通过在语法规则识别位置/手动编码的递归解析过程中执行解释器动作来构建解释器。它不会运行得非常快,因为它会不断重新解析源代码。
It is better to build an (abstract) syntax tree (AST) representing the program, and then build and interpreter that crawls the AST to execute it. The AST is constructed by attaching tree-building actions as parser actions.
最好构建一个表示程序的(抽象)语法树(AST),然后构建和解释爬行AST来执行它。通过将树构建操作附加为解析器操作来构造AST。
If you want to produce byte-code, then you have a classic code generation problem. These are generally best solved by building an AST, walking the AST almost as an interpreter would, and spitting out byte codes that are intended to achieve the purpose of the interpreter at each point. You can build an on-the-fly code generator by producing byte codes in the parser actions. It is harder to generate "good" code this way, because the code generator cannot see enough context to handle special cases well.
如果您想生成字节码,那么您就会遇到经典的代码生成问题。这些通常最好通过构建AST来解决,几乎像解释器一样遍历AST,并吐出用于在每个点实现解释器目的的字节代码。您可以通过在解析器操作中生成字节代码来构建动态代码生成器。以这种方式生成“好”代码更难,因为代码生成器无法看到足够的上下文来很好地处理特殊情况。
You can fumble your way through all of this. Your best approach is to get a compiler book, or take a compiler class. Then all of this will be much clearer. Certainly if you fumble your way through this, you will have a much better appreciation of the kinds of machinery the compiler people use to do this. (Google my essay on "Life After Parsing").
你可以摸索所有这些。您最好的方法是获取编译器书籍,或者参加编译器类。然后所有这一切都会更加清晰。当然,如果你通过这种方式摸索,你将更好地了解编译人员用来做这件事的各种机器。 (谷歌我的论文“解析后的生活”)。
#1
5
The reason why people tell you that regular expressions aren't the best idea for a case like this is because regular expressions take more time to evaluate, and the regular expression language has many limitations and quirks that make it unsuitable for many applications.
人们告诉你正则表达式不是这种情况的最佳选择的原因是因为正则表达式需要更多时间来评估,而正则表达式语言有许多限制和怪癖使它不适合许多应用程序。
You should be aware that this is just a knee-jerk reaction that many programmers have (including myself), regardless of whether or not regular expressions would actually be suitable for the application. This stems from people trying to do too much with regular expressions, such as trying to parse HTML.
你应该知道,这只是许多程序员(包括我自己)的一种下意识的反应,无论正则表达式是否真的适合应用程序。这源于人们试图用正则表达式做太多,比如尝试解析HTML。
Many compilers use a basic single-pass tokenization algorithm. The tokenizer has a very basic idea of what can be used as a delimiter, how constants and identifiers should be treated, etc. The tokenizer will then just iterate through the input quickly and emit a string of tokens that can then be easily parsed.
许多编译器使用基本的单通道标记化算法。标记生成器有一个非常基本的概念,即可以用作分隔符,如何处理常量和标识符等。然后,标记生成器将快速迭代输入并发出一串令牌,然后可以轻松解析。
For a basic application, such as parsing tokens, the relatively minor penalty from using regular expressions isn't something to worry about. However as I said, there are sometimes peculiarities with how regular expressions work that might limit the things that your tokenizer can do, though that work can usually be offloaded to a later point in the compiler pipeline. Everything that a tokenizer is expected to do should be representable by standard regular expressions though.
对于基本应用程序,例如解析令牌,使用正则表达式的相对较小的惩罚不是值得担心的。然而正如我所说,有时候正则表达式如何工作可能会限制你的tokenizer可以做的事情,尽管这项工作通常可以卸载到编译器管道中的后续点。遗传算法应该做的所有事情都应该由标准的正则表达式来表示。
It should be noted that when you involve regular expressions directly in your code, things can get a little hairy. You have to determine how the regular expressions should be applied to the input, where input will be delineated, and so on. You'll also incur the penalty of compiling and evaluating the regular expressions.
应该注意的是,当你直接在你的代码中包含正则表达式时,事情会变得有些毛茸茸。您必须确定如何将正则表达式应用于输入,将描绘输入,等等。您还将承担编译和评估正则表达式的惩罚。
There are some projects, such as lex
, which make use of regular expressions because they provide a simple, concise way to describe a grammar, which it can then make use of internally in whatever representation in chooses. They will also handle all of the glue logic for you and you merely have to describe the grammar that it should use via regular expressions.
有一些项目,比如lex,它们使用正则表达式,因为它们提供了一种简单,简洁的方式来描述语法,然后它可以在内部以任何选择的表示形式使用。他们还将为您处理所有的胶合逻辑,您只需要通过正则表达式描述它应该使用的语法。
When such a generator is used , it can change any regular expressions to code that represents what the expression actually means. If it sees the expression [0-9]
, it can just replace that with a call to isdigit
, an equivalent switch
statement, or some other representation. This makes the generated code much more efficient than any inline use of regular expressions could achieve.
当使用这样的生成器时,它可以将任何正则表达式更改为代表表达式实际含义的代码。如果它看到表达式[0-9],它可以通过调用isdigit,等效的switch语句或其他表示来替换它。这使得生成的代码比正则表达式的任何内联使用都更有效。
So my thought is this: If you want to use regular expressions to parse your language, go all the way and create a scanner description for flex
/lex
to generate the tokenizer for you. However, if you're actually writing it entirely yourself, you'd be better off with a logically simpler approach like the one I described.
所以我的想法是这样的:如果你想使用正则表达式来解析你的语言,那就一直开始为flex / lex创建一个扫描器描述,为你生成tokenizer。但是,如果您实际上是自己完全编写它,那么使用我所描述的逻辑上更简单的方法会更好。
I thought it would be fun to write up a sample tokenizer that doesn't use regular expressions, so here it is. I wrote it in C-like C++. The only C++ feature I used was the standard vector and string, but I did it in such a way that you could easily drop in C variants.
我认为编写一个不使用正则表达式的示例标记器会很有趣,所以现在就是这样。我用C-like C ++编写它。我使用的唯一C ++特性是标准向量和字符串,但我这样做是为了让你可以轻松地放入C变种。
#include <vector>
#include <ctype.h>
#include <string>
typedef std::vector<std::string> string_list;
typedef std::vector<long long > int_list;
typedef std::vector<long double> float_list;
std::string substr(const char* value, size_t length){
std::string v;
v.resize(length);
memcpy(&v[0], value, length * sizeof(char));
return v;
}
long long string_to_int(const char* value, size_t length){
return atoll(substr(value, length).c_str());
}
long double string_to_float(const char* value, size_t length){
return atof(substr(value, length).c_str());
}
void int_list_add(int_list& list, long long value){
list.push_back(value);
}
void string_list_add(string_list& list, const char* value, size_t length){
list.push_back(substr(value, length));
}
void float_list_add(float_list& list, long double value){
list.push_back(value);
}
size_t int_list_last(int_list& list){
return list.size();
}
size_t string_list_last(string_list& list){
return list.size();
}
size_t float_list_last(float_list& list){
return list.size();
}
typedef struct{
string_list identifiers;
string_list constants_string;
int_list constants_int;
float_list constants_float;
size_t id;
} *state, state_value;
state tok_state_create(){
state ret = new state_value;
ret->id = 0;
return ret;
}
void tok_state_destroy(state t_state){
delete t_state;
}
const char* tok_state_read_identifier(state t_state, size_t id){
return t_state->identifiers[id - 1].c_str();
}
const char* tok_state_read_string(state t_state, size_t id){
return t_state->constants_string[id - 1].c_str();
}
long long tok_state_read_int(state t_state, size_t id){
return t_state->constants_int[id - 1];
}
long double tok_state_read_float(state t_state, size_t id){
return t_state->constants_float[id - 1];
}
const char* punct_tokens[] = { "Not A Token (Dummy)",
".", ",", "<", "<<", ">", ">>",
";", "+", "-", "/", "*", "!", "%", "^",
"&", "(", ")", "=", "==", "[", "]", "{",
"}", "?", ":", "|", "||", "&&", "~", 0
};
const char* key_tokens[] = { "Not A Token (Dummy)",
"if", "while", "do", "then", "end", 0
};
typedef enum{
TOK_TYPE_INTEGER = 500,
TOK_TYPE_FLOAT,
TOK_TYPE_STRING,
TOK_TYPE_IDENTIFIER,
TOK_TYPE_NONE
} tok_type;
const char* get_token_from_id(size_t id){
if (id < 100){
return punct_tokens[id];
}
if (id < 200){
return key_tokens[id - 100];
}
if (id >= 500){
switch (id){
case TOK_TYPE_INTEGER: return "Integer Constant";
case TOK_TYPE_FLOAT: return "Float Constant ";
case TOK_TYPE_STRING: return "String Constant ";
case TOK_TYPE_IDENTIFIER: return "Identifier ";
case TOK_TYPE_NONE: return "Unknown ";
default:
break;
}
}
return "Not A Token (Dummy)";
}
int is_identifier_char(char c){
if (isalpha(c) || c == '_'){
return 1;
}
return 0;
}
size_t read_punct_token(const char* input, size_t size){
size_t max_len = 0;
size_t token_id = 0;
for (size_t i = 1; punct_tokens[i] != 0; ++i){
size_t len = strlen(punct_tokens[i]);
if (len > max_len && len <= size && strncmp(punct_tokens[i], input, len) == 0){
max_len = len;
if (i == 1 && size > 1 && isdigit(input[1])){
return 0; //Special case for floats
}
token_id = i;
}
}
return token_id;
}
size_t read_key_token(const char* input, size_t size){
size_t max_len = 0;
size_t token_id = 0;
for (size_t i = 1; key_tokens[i] != 0; ++i){
size_t len = strlen(key_tokens[i]);
if (len > max_len && len <= size && strncmp(key_tokens[i], input, len) == 0){
max_len = len;
token_id = i + 100;
}
}
return token_id;
}
size_t is_punct_token_char(char c){
for (size_t i = 1; punct_tokens[i] != 0; ++i){
if (punct_tokens[i][0] == c){
return 1;
}
}
return 0;
}
void add_token(state t_state, tok_type type, const char* string, size_t length){
switch (type){
case TOK_TYPE_INTEGER:
int_list_add(t_state->constants_int, string_to_int(string, length));
t_state->id = int_list_last(t_state->constants_int);
break;
case TOK_TYPE_FLOAT:
float_list_add(t_state->constants_float, string_to_float(string, length));
t_state->id = float_list_last(t_state->constants_float);
break;
case TOK_TYPE_STRING:
string_list_add(t_state->constants_string, string, length);
t_state->id = string_list_last(t_state->constants_string);
break;
case TOK_TYPE_IDENTIFIER:
string_list_add(t_state->identifiers, string, length);
t_state->id = string_list_last(t_state->identifiers);
break;
default:
//Do some error here
break;
}
}
size_t get_token(state t_state, char** input, size_t *size){
if (t_state->id != 0){
size_t id = t_state->id;
t_state->id = 0;
return id;
}
char* base = *input;
size_t padding = 0;
size_t length = 0;
tok_type type = TOK_TYPE_NONE;
while (*size > 0){
if (isspace(*base)){
base++;
(*size)--;
}
else{
break;
}
}
size_t tok = read_punct_token(base, *size);
if (tok){
size_t len = +strlen(get_token_from_id(tok));
*input = base + len;
*size -= len;
return tok;
}
tok = read_key_token(base, *size);
if (tok){
size_t len = +strlen(get_token_from_id(tok));
*input = base + len;
*size -= len;
return tok;
}
while (*size - length > 0){
if (length == 0 && type == TOK_TYPE_NONE){
if (is_identifier_char(*base)){
type = TOK_TYPE_IDENTIFIER;
length++;
}
else if (*base == '"'){
type = TOK_TYPE_STRING;
padding = 1;
base++;
(*size)--;
}
else if (*base == '.' && *size > 1 && isdigit(base[1])){
type = TOK_TYPE_FLOAT;
}
else if (isdigit(*base)){
type = TOK_TYPE_INTEGER;
}
else if (is_punct_token_char(*base)){
tok = read_punct_token(base, *size);
if (tok){
size_t len = strlen(punct_tokens[tok]);
*input += len;
*size -= len;
return tok;
}
else{
//do error
}
}
}
else{
if (!isspace(base[length]) || type == TOK_TYPE_STRING){
switch (type){
case TOK_TYPE_INTEGER:
if (isdigit(base[length])){
length++;
continue;
}
else if (base[length] == '.' || tolower(base[length]) == 'e'){
type = TOK_TYPE_FLOAT;
length++;
continue;
}
break;
case TOK_TYPE_FLOAT:
if (isdigit(base[length]) || base[length] == '.' || base[length] == 'e'){
length++;
continue;
}
break;
case TOK_TYPE_STRING:
if (base[length] != '"'){
length++;
continue;
}
break;
case TOK_TYPE_IDENTIFIER:
if (is_identifier_char(base[length])){
length++;
continue;
}
break;
default:
break;
}
}
//We only get here if this is a space or any of the switch cases didn't continue.
add_token(t_state, type, base, length);
*input = base + length + padding;
*size -= length + padding;
return type;
}
}
*input = base + length + padding;
*size -= length + padding;
return 0;
}
int main(){
const char* input = "if(1+1==4)then print\"hi!\";end";
state s = tok_state_create();
size_t size = strlen(input);
size_t token;
size_t token_prev = 0;
printf("Token\tMeaning\n\n");
while ((token = get_token(s, (char**)&input, &size)) != 0){
if (token_prev < 500){
if (token < 500){
printf("%d\t%s\n", token, get_token_from_id(token));
}
else{
printf("%d\t%s #", token, get_token_from_id(token));
}
}
else{
printf("%d\t", token);
switch (token_prev){
case TOK_TYPE_IDENTIFIER: printf("%s\n", tok_state_read_identifier(s, token)); break;
case TOK_TYPE_STRING: printf("%s\n", tok_state_read_string(s, token)); break;
case TOK_TYPE_INTEGER: printf("%d\n", tok_state_read_int(s, token)); break;
case TOK_TYPE_FLOAT: printf("%f\n", tok_state_read_float(s, token)); break;
}
}
token_prev = token;
}
tok_state_destroy(s);
}
This will print:
这将打印:
Token Meaning
101 if
16 (
500 Integer Constant #1 1
8 +
500 Integer Constant #2 1
19 ==
500 Integer Constant #3 4
17 )
104 then
503 Identifier #1 print
502 String Constant #1 hi!
7 ;
105 end
#2
1
As others have mentioned, there's nothing wrong with regex. Here's a typical path:
正如其他人所提到的,正则表达式没有任何问题。这是一条典型的路径:
- Parse your source file into tokens
- Generate an abstract syntax tree from your stream of parsed tokens
- Walk your AST to generate your bytecode
将源文件解析为标记
从解析的标记流生成抽象语法树
走你的AST来生成你的字节码
Flex/bison are great tools for 1/2.
Flex / bison是1/2的绝佳工具。
You will probably also need some sort of symbol table for variable definitions.
您可能还需要某种符号表来表示变量定义。
Many compiler design courses will end up implementing a compiler for some subset of the c language, you could try looking at open courseware type sites for an actual class on this.
许多编译器设计课程最终将为c语言的某个子集实现编译器,您可以尝试查看开放课件类型的站点以获得实际的类。
#3
0
As the other answers have said there's nothing wrong with regular expressions per se. But more than this, Regular expressions are the natural representation of character-level syntax.
正如其他答案所说,正则表达式本身没有任何问题。但更重要的是,正则表达式是字符级语法的自然表示。
All the major (and many (most?) minor languages) use a regular expression to define the appearance of different input tokens, at least as a formalism in designing the language.
所有主要(和许多(大多数?)次要语言)使用正则表达式来定义不同输入令牌的外观,至少作为设计语言的形式。
But there may indeed be a performance penalty for certain tricks provided by regex libraries. This is because many things, like backreferences have to be implemented by a less-powerful automaton than a more pure regular expression.
但是对于正则表达式库提供的某些技巧,可能确实存在性能损失。这是因为许多事情,如反向引用必须由功能较弱的自动机实现,而不是更纯粹的正则表达式。
Regular-expressions (without fancy stuff like backreferences) can be transformed into a finite automaton or state-machine and that can be implemented with a much simpler (ie. faster) controlling function.
正则表达式(没有像反向引用这样的花哨的东西)可以转换为有限自动机或状态机,并且可以用更简单(即更快)的控制功能来实现。
At the very least, for efficiency, you should try to pre-compile your languages defining expressions and use the compiled matcher objects instead of constructing a new one dynamically each time. If your regex library doesn't offer this, maybe time to do some library shopping, or read up on automata theory.
至少,为了提高效率,您应该尝试预编译定义表达式的语言,并使用编译的匹配器对象,而不是每次动态构建一个新的。如果你的正则表达式库没有提供这个,也许是时候去做一些图书馆购物,或者阅读自动机理论。
#4
0
Any interpreter or compiler of a sophisticated language ("has expressions with (...) or compound nested statements like do-while, if-then-else") requires that you build a parser to extract the (usually recursive) structure of the code.
复杂语言的任何解释器或编译器(“具有带(...)或复合嵌套语句的表达式,如do-while,if-then-else”)要求您构建解析器以提取(通常是递归的)结构码。
You've gotten lots of answers here saying "regular expressions are good for tokens". Yes, in many classically built compilers and interpreters, people write regular expressions to describe the shape of individual tokens (identifiers, keywords, numeric and string constants, comments) and hand them to a lexer-generator (like Flex or many others) that combines these into an efficient finite-state machine. This works because the "syntax" of tokens is almost always pretty simple. That very simplicity means that you can hand code a token lexer yourself and produce practical results for small to medium size languages. (At some point (e.g., COBOL) the sheer size of the language starts to overwhelm you and it is hard to avoid using lexer generator if you want to stay sane).
你在这里得到了很多答案,说“正则表达式对于令牌很有用”。是的,在许多经典构建的编译器和解释器中,人们编写正则表达式来描述单个标记的形状(标识符,关键字,数字和字符串常量,注释),并将它们交给词法分析器生成器(如Flex或许多其他)组合这些成为一个有效的有限状态机。这是有效的,因为令牌的“语法”几乎总是非常简单。这非常简单意味着您可以自己手动编写令牌词法分析器,并为中小型语言生成实际结果。 (在某些时候(例如,COBOL),语言的绝对大小开始压倒你,如果你想要保持理智,很难避免使用词法分析器)。
What hasn't been discussed is real parsing, discovering structure assuming we already have tokens built somehow. Using regex for tokens IS NOT PARSING. And regexes cannot be used for parsing; they cannot recognize nested structures, which is what is needed for those "sophisticated" constructs. People unfamiliar with parsing make this mistake repeatedly.
没有讨论的是真正的解析,发现结构假设我们已经以某种方式构建了令牌。使用正则表达式来代替令牌。并且正则表达式不能用于解析;他们无法识别嵌套结构,这是那些“复杂”结构所需要的。不熟悉解析的人会反复犯这个错误。
If you want to succeed, you need to learn how to build a parser.
如果您想成功,您需要学习如何构建解析器。
For complex languages, there are parser generators (YACC, JavaCC, many others), as analog to lexer generators, that will take a BNF and generate a parser for you. If you want to write an compiler with such tools, you attach parser actions typically to grammar-rule recognition points, often to build a tree for later processing.
对于复杂语言,有解析器生成器(YACC,JavaCC,许多其他),与lexer生成器类似,它将采用BNF并为您生成解析器。如果要使用此类工具编写编译器,则通常将解析器操作附加到语法规则识别点,通常用于构建树以供以后处理。
You can also hand-code a parser for modest size languages. This is a set of mutually recursive procedures, one for each grammar rule, that analyze the code (using recursion to handle nested constructs). You can also attach parse actions to the procedures. Since these procedures recognize grammar constructs, this is really essentially the same trick as you would apply if you were using a parser generator. This scheme can be trivially extended to handle the "parsing"(lexing) of tokens. If you go down this path completely, there are no regular expressions, anywhere.
您还可以为适度大小的语言手动编写解析器代码。这是一组相互递归的过程,每个语法规则一个,用于分析代码(使用递归来处理嵌套的构造)。您还可以将解析操作附加到过程。由于这些过程识别语法结构,因此如果您使用解析器生成器,这实际上与您应用的技巧基本相同。可以简单地扩展该方案以处理令牌的“解析”(lexing)。如果你完全沿着这条路走下去,那么任何地方都没有正则表达式。
It is possible to build an interpreter by executing interpreter-actions in the grammar-rule recognition places/in the hand-coded recursive parsing procedures. It won't run very fast because it will continually reparse the source code.
可以通过在语法规则识别位置/手动编码的递归解析过程中执行解释器动作来构建解释器。它不会运行得非常快,因为它会不断重新解析源代码。
It is better to build an (abstract) syntax tree (AST) representing the program, and then build and interpreter that crawls the AST to execute it. The AST is constructed by attaching tree-building actions as parser actions.
最好构建一个表示程序的(抽象)语法树(AST),然后构建和解释爬行AST来执行它。通过将树构建操作附加为解析器操作来构造AST。
If you want to produce byte-code, then you have a classic code generation problem. These are generally best solved by building an AST, walking the AST almost as an interpreter would, and spitting out byte codes that are intended to achieve the purpose of the interpreter at each point. You can build an on-the-fly code generator by producing byte codes in the parser actions. It is harder to generate "good" code this way, because the code generator cannot see enough context to handle special cases well.
如果您想生成字节码,那么您就会遇到经典的代码生成问题。这些通常最好通过构建AST来解决,几乎像解释器一样遍历AST,并吐出用于在每个点实现解释器目的的字节代码。您可以通过在解析器操作中生成字节代码来构建动态代码生成器。以这种方式生成“好”代码更难,因为代码生成器无法看到足够的上下文来很好地处理特殊情况。
You can fumble your way through all of this. Your best approach is to get a compiler book, or take a compiler class. Then all of this will be much clearer. Certainly if you fumble your way through this, you will have a much better appreciation of the kinds of machinery the compiler people use to do this. (Google my essay on "Life After Parsing").
你可以摸索所有这些。您最好的方法是获取编译器书籍,或者参加编译器类。然后所有这一切都会更加清晰。当然,如果你通过这种方式摸索,你将更好地了解编译人员用来做这件事的各种机器。 (谷歌我的论文“解析后的生活”)。