什么时候应该使用解析器?

时间:2022-09-10 23:52:31

I have had problems in Regexes to divide a code up into functional components. They can break or it can take a long time for them to finish. The experience raises a question:

我在Regexes中遇到过将代码划分为功能组件的问题。它们可能会断裂,也可能需要很长时间才能完成。这一经历提出了一个问题:

"When should I use a parser?"

“我什么时候应该使用解析器?”

8 个解决方案

#1


9  

You should use a parser when you are interested in the lexical or semantic meaning of text, when patterns can vary. Parsers are generally overkill when you are simply looking to match or replace patterns of characters, regardless of their functional meaning.

当您对文本的词法或语义含义感兴趣时,当模式可能发生变化时,您应该使用解析器。当您只是想匹配或替换字符的模式,而不考虑字符的功能含义时,解析器通常是多余的。

In your case, you seem to be interested in the meaning behind the text ("functional components" of code), so a parser would be the better choice. Parsers can, however, internally make use of regex, so they should not be regarded as mutually exclusive.

在您的例子中,您似乎对文本(代码的“功能组件”)背后的含义感兴趣,因此解析器是更好的选择。然而,解析器可以在内部使用regex,因此它们不应该被视为互斥的。


A "parser" does not automatically mean it has to be complicated, however. For example, if you are interested in C code blocks, you could simply parse nested groups of { and }. This parser would only be interested in two tokens ('{' and '}') and the blocks of text between them.

然而,“解析器”并不意味着它必须是复杂的。例如,如果您对C代码块感兴趣,可以简单地解析{和}的嵌套组。这个解析器只对两个令牌('{' '和'}')和它们之间的文本块感兴趣。

However, a simple regex comparison is not sufficient here because of the nested semantics. Take the following code:

但是,由于嵌套语义,在这里简单的regex比较是不够的。下面的代码:

void Foo(bool Bar)
{
    if(Bar)
    {
        f();
    }
    else
    {
        g();
    }
}

A parser will understand the overall scope of Foo, as well as each inner scope contained within Foo (the if and else blocks). As it encounters each '{' token, it "understands" their meaning. A simple search, however does not understand the meaning behind the text and may interpret the following to be a block, which we of course know is not correct:

解析器将理解Foo的总体范围,以及Foo中包含的每个内部范围(if和else块)。当它遇到每个“{”标记时,它“理解”它们的含义。但是,简单的搜索并不理解文本背后的含义,可能会将以下内容解释为一个块,我们当然知道这是不正确的:

{
    if(Bar)
    {
        f();
    }

#2


3  

you need a parser when:

当:

  1. language is not regular (wikipedia)
  2. 语言不是常规的(*)
  3. you need a parse tree (more generally when you need to execute actions contextually)
  4. 您需要一个解析树(通常需要在上下文环境中执行操作)
  5. when the resulting regular expression is too obscure/complex
  6. 当产生的正则表达式太模糊/太复杂时

My 2 cents.

我的2美分。

#3


2  

There are a few compelling use cases for parsers over regular expressions. You should use a parser instead of a regular expression:

在正则表达式之上,有一些引人注目的解析器用例。您应该使用解析器而不是正则表达式:

  • Whenever the kinds of expressions you'd like to work with are more complex than few semantic entities (tags, variables, phone numbers, etc.).
  • 无论何时,您想要处理的表达式类型都比很少的语义实体(标签、变量、电话号码等)要复杂。
  • Whenever you need to know the semantic meaning of text instead of merely matching a pattern. For example, if you're trying to match all possible ways of writing a phone number, a parser is probably better than a regex. If you're trying to match a specific pattern that happens to correspond to a phone number, a regex is probably fine.
  • 当你需要知道文本的语义而不是仅仅匹配一个模式时。例如,如果您试图匹配所有可能的编写电话号码的方法,解析器可能比regex要好。如果您正在尝试匹配与电话号码相对应的特定模式,那么regex可能没问题。
  • Whenever input can't be guaranteed to be well-formed.
  • 当输入不能保证格式良好时。
  • If you're working entirely within the structure of a well-defined language that has a syntax specification (C#, XML, C++, Ruby, etc.), there's already going to be a parser, so you have some work done for you.
  • 如果您完全在一个具有语法规范(c#、XML、c++、Ruby等)的定义良好的语言的结构中工作,那么已经有一个解析器,所以您需要为您完成一些工作。

#4


2  

The Dragon Book has a small section about what you can't use Regular Expressions for:

龙书有一小部分是关于你不能使用正则表达式的:

  • They can't detect repetition of a string, meaning you can't match constructs like 'wcw', where w is the same succesion of symbols
  • 它们无法检测到字符串的重复,这意味着你无法匹配像wcw这样的结构,其中w是符号的相同序列
  • You can only detect a fixed number of repetition or an unspecified number of repetitions, which is to say you can't use an already parsed token to determine the number of repetitions, something like: 'n s1 s2 ... sn'
  • 你只能检测到固定的重复次数或不确定的重复次数,也就是说你不能使用已经解析过的标记来确定重复次数,比如:'n s1 s2…sn”
  • "Regular Expressions can't be used to describe balanced or nested constructs, [like] the set of strings of all balanced parentheses"
  • “正则表达式不能用于描述平衡的或嵌套的结构,比如所有平衡括号的字符串集合”

For 1 and 2, there's a simple explanation, you can't capture a substring so you can match it later. If you would, than you would be using a parser. Just think of how you would be using regular expressions for those cases, and you will intuitively come to the conclusion you can't. :)

对于1和2,有一个简单的解释,您不能捕获子字符串,因此您可以稍后匹配它。如果您愿意,您将使用解析器。只要想想在这些情况下如何使用正则表达式,你就会直觉地得出你不能得出的结论。:)

For 3, it's the same as the problem in K&R for parsing string literals. You can't just say a string literal is between the first ' " ' and the second ' " ', but what happens when there's an escaped quote(\")?

对于3,它与K&R中解析字符串常量的问题相同。你不能只说一个字符串字面意思是在第一个和第二个之间,但是当有一个转义引号(\)时会发生什么?

As for the relation to Russel's paradox, I think you're hunch is right, because the problem is regex's limited introspection capabilities. The book has references to the proofs. If you want to, I can look them up for you.

至于罗素悖论的关系,我认为你的直觉是对的,因为问题在于regex有限的自省能力。这本书有证据的参考。如果你愿意,我可以帮你查一下。

#5


1  

You need to use a parser as soon as you have a problem regular expressions is not meant to, (or simply can't) solve. Matching (un)balanced parenthesis (recursively) for instance is one of those problems. Eventhough some flavours, like PCRE, get you very far they don't win over a hand written parser.

一旦遇到正则表达式不打算(或根本不能)解决的问题,就需要使用解析器。例如,匹配(un)平衡括号(递归地)就是其中一个问题。尽管有些风格,比如PCRE,能让你走得更远,但它们并不能赢得手写解析器的青睐。

#6


1  

Here are some use cases, courtesy of Steve Yegge: Rich Programmer Food.

以下是Steve Yegge提供的一些用例:丰富的程序员食品。

#7


0  

Your question is a bit vague, but I guess my opinion is that when your regex becomes complicated or takes too long, and you have a reasonably defined "language" to deal with, a parser will be easier.

您的问题有点模糊,但我认为,当您的regex变得复杂或花费太长时间,并且您有一个合理定义的“语言”要处理时,解析器将更容易处理。

I don't think you can set a line in the sand and say that anything on one side can be done by regex, and on the other side you need a parser. It depends on the situation.

我不认为你可以在一边设置一条线,说regex可以做任何事情,而在另一边你需要一个解析器。这要视情况而定。

#8


0  

There are things that regex cannot do while parser can do.
For example:

有些事情regex不能做,而解析器可以做。例如:

Start ::= (Inner);
Inner ::= Start | x;

开始::=(内部);内:=启动| x;

Regular expression wouldn't be able to do that because regex can't track if there are same number of open and close parenthesis. That is why when you are trying to tokenize and parse a large file, parser is expected to be used, while regex can simply find special pattern(s) inside the file.

正则表达式不能这样做,因为regex不能跟踪是否有相同数量的开括号和闭括号。这就是为什么当您尝试对一个大文件进行标记和解析时,期望使用解析器,而regex可以在文件中找到特殊的模式。

#1


9  

You should use a parser when you are interested in the lexical or semantic meaning of text, when patterns can vary. Parsers are generally overkill when you are simply looking to match or replace patterns of characters, regardless of their functional meaning.

当您对文本的词法或语义含义感兴趣时,当模式可能发生变化时,您应该使用解析器。当您只是想匹配或替换字符的模式,而不考虑字符的功能含义时,解析器通常是多余的。

In your case, you seem to be interested in the meaning behind the text ("functional components" of code), so a parser would be the better choice. Parsers can, however, internally make use of regex, so they should not be regarded as mutually exclusive.

在您的例子中,您似乎对文本(代码的“功能组件”)背后的含义感兴趣,因此解析器是更好的选择。然而,解析器可以在内部使用regex,因此它们不应该被视为互斥的。


A "parser" does not automatically mean it has to be complicated, however. For example, if you are interested in C code blocks, you could simply parse nested groups of { and }. This parser would only be interested in two tokens ('{' and '}') and the blocks of text between them.

然而,“解析器”并不意味着它必须是复杂的。例如,如果您对C代码块感兴趣,可以简单地解析{和}的嵌套组。这个解析器只对两个令牌('{' '和'}')和它们之间的文本块感兴趣。

However, a simple regex comparison is not sufficient here because of the nested semantics. Take the following code:

但是,由于嵌套语义,在这里简单的regex比较是不够的。下面的代码:

void Foo(bool Bar)
{
    if(Bar)
    {
        f();
    }
    else
    {
        g();
    }
}

A parser will understand the overall scope of Foo, as well as each inner scope contained within Foo (the if and else blocks). As it encounters each '{' token, it "understands" their meaning. A simple search, however does not understand the meaning behind the text and may interpret the following to be a block, which we of course know is not correct:

解析器将理解Foo的总体范围,以及Foo中包含的每个内部范围(if和else块)。当它遇到每个“{”标记时,它“理解”它们的含义。但是,简单的搜索并不理解文本背后的含义,可能会将以下内容解释为一个块,我们当然知道这是不正确的:

{
    if(Bar)
    {
        f();
    }

#2


3  

you need a parser when:

当:

  1. language is not regular (wikipedia)
  2. 语言不是常规的(*)
  3. you need a parse tree (more generally when you need to execute actions contextually)
  4. 您需要一个解析树(通常需要在上下文环境中执行操作)
  5. when the resulting regular expression is too obscure/complex
  6. 当产生的正则表达式太模糊/太复杂时

My 2 cents.

我的2美分。

#3


2  

There are a few compelling use cases for parsers over regular expressions. You should use a parser instead of a regular expression:

在正则表达式之上,有一些引人注目的解析器用例。您应该使用解析器而不是正则表达式:

  • Whenever the kinds of expressions you'd like to work with are more complex than few semantic entities (tags, variables, phone numbers, etc.).
  • 无论何时,您想要处理的表达式类型都比很少的语义实体(标签、变量、电话号码等)要复杂。
  • Whenever you need to know the semantic meaning of text instead of merely matching a pattern. For example, if you're trying to match all possible ways of writing a phone number, a parser is probably better than a regex. If you're trying to match a specific pattern that happens to correspond to a phone number, a regex is probably fine.
  • 当你需要知道文本的语义而不是仅仅匹配一个模式时。例如,如果您试图匹配所有可能的编写电话号码的方法,解析器可能比regex要好。如果您正在尝试匹配与电话号码相对应的特定模式,那么regex可能没问题。
  • Whenever input can't be guaranteed to be well-formed.
  • 当输入不能保证格式良好时。
  • If you're working entirely within the structure of a well-defined language that has a syntax specification (C#, XML, C++, Ruby, etc.), there's already going to be a parser, so you have some work done for you.
  • 如果您完全在一个具有语法规范(c#、XML、c++、Ruby等)的定义良好的语言的结构中工作,那么已经有一个解析器,所以您需要为您完成一些工作。

#4


2  

The Dragon Book has a small section about what you can't use Regular Expressions for:

龙书有一小部分是关于你不能使用正则表达式的:

  • They can't detect repetition of a string, meaning you can't match constructs like 'wcw', where w is the same succesion of symbols
  • 它们无法检测到字符串的重复,这意味着你无法匹配像wcw这样的结构,其中w是符号的相同序列
  • You can only detect a fixed number of repetition or an unspecified number of repetitions, which is to say you can't use an already parsed token to determine the number of repetitions, something like: 'n s1 s2 ... sn'
  • 你只能检测到固定的重复次数或不确定的重复次数,也就是说你不能使用已经解析过的标记来确定重复次数,比如:'n s1 s2…sn”
  • "Regular Expressions can't be used to describe balanced or nested constructs, [like] the set of strings of all balanced parentheses"
  • “正则表达式不能用于描述平衡的或嵌套的结构,比如所有平衡括号的字符串集合”

For 1 and 2, there's a simple explanation, you can't capture a substring so you can match it later. If you would, than you would be using a parser. Just think of how you would be using regular expressions for those cases, and you will intuitively come to the conclusion you can't. :)

对于1和2,有一个简单的解释,您不能捕获子字符串,因此您可以稍后匹配它。如果您愿意,您将使用解析器。只要想想在这些情况下如何使用正则表达式,你就会直觉地得出你不能得出的结论。:)

For 3, it's the same as the problem in K&R for parsing string literals. You can't just say a string literal is between the first ' " ' and the second ' " ', but what happens when there's an escaped quote(\")?

对于3,它与K&R中解析字符串常量的问题相同。你不能只说一个字符串字面意思是在第一个和第二个之间,但是当有一个转义引号(\)时会发生什么?

As for the relation to Russel's paradox, I think you're hunch is right, because the problem is regex's limited introspection capabilities. The book has references to the proofs. If you want to, I can look them up for you.

至于罗素悖论的关系,我认为你的直觉是对的,因为问题在于regex有限的自省能力。这本书有证据的参考。如果你愿意,我可以帮你查一下。

#5


1  

You need to use a parser as soon as you have a problem regular expressions is not meant to, (or simply can't) solve. Matching (un)balanced parenthesis (recursively) for instance is one of those problems. Eventhough some flavours, like PCRE, get you very far they don't win over a hand written parser.

一旦遇到正则表达式不打算(或根本不能)解决的问题,就需要使用解析器。例如,匹配(un)平衡括号(递归地)就是其中一个问题。尽管有些风格,比如PCRE,能让你走得更远,但它们并不能赢得手写解析器的青睐。

#6


1  

Here are some use cases, courtesy of Steve Yegge: Rich Programmer Food.

以下是Steve Yegge提供的一些用例:丰富的程序员食品。

#7


0  

Your question is a bit vague, but I guess my opinion is that when your regex becomes complicated or takes too long, and you have a reasonably defined "language" to deal with, a parser will be easier.

您的问题有点模糊,但我认为,当您的regex变得复杂或花费太长时间,并且您有一个合理定义的“语言”要处理时,解析器将更容易处理。

I don't think you can set a line in the sand and say that anything on one side can be done by regex, and on the other side you need a parser. It depends on the situation.

我不认为你可以在一边设置一条线,说regex可以做任何事情,而在另一边你需要一个解析器。这要视情况而定。

#8


0  

There are things that regex cannot do while parser can do.
For example:

有些事情regex不能做,而解析器可以做。例如:

Start ::= (Inner);
Inner ::= Start | x;

开始::=(内部);内:=启动| x;

Regular expression wouldn't be able to do that because regex can't track if there are same number of open and close parenthesis. That is why when you are trying to tokenize and parse a large file, parser is expected to be used, while regex can simply find special pattern(s) inside the file.

正则表达式不能这样做,因为regex不能跟踪是否有相同数量的开括号和闭括号。这就是为什么当您尝试对一个大文件进行标记和解析时,期望使用解析器,而regex可以在文件中找到特殊的模式。