“两级正则表达式”是什么意思?

时间:2022-10-12 15:47:21

I understand basic regular expression, but unsure what the below quote means (regarding how to implement a wiki parser), could anyone provide some pseudo code to enlighten me?

我理解基本的正则表达式,但不确定下面引用的含义(关于如何实现wiki解析器),是否有人可以提供一些伪代码来启发我?

Two-level regular expressions

This is a very popular approach. It's pretty fast, as it scans the raw text exactly two times.

这是一种非常流行的方法。它非常快,因为它只扫描原始文本两次。

The idea is to create two kinds of regular expressions -- one to split the text into blocks of different kinds (paragraphs, headings, lists, preformatted blocks, etc.) and then process each of them with different character-level regular expression.

我们的想法是创建两种正则表达式 - 一种用于将文本拆分为不同类型的块(段落,标题,列表,预格式化块等),然后使用不同的字符级正则表达式处理它们。

Quote from: http://www.wikicreole.org/wiki/CommonWikiParsingTechniques

引自:http://www.wikicreole.org/wiki/CommonWikiParsingTechniques

2 个解决方案

#1


5  

It means not trying to accomplish multiple tasks in a single Regex, but to split it into two tasks (two levels); splitting first, then handling each token separately.

这意味着不要尝试在单个正则表达式中完成多个任务,而是将其拆分为两个任务(两个级别);首先拆分,然后分别处理每个令牌。

My opinion is that people often unecessarily try to have a single Regex do too much at once, instead of making things much simpler by splitting different tasks like this.

我的观点是,人们经常不必尝试让单个正则表达式一次做太多,而不是通过分割这样的不同任务来使事情变得更简单。

#2


3  

It looks like "two-level regular expressions" is a (slightly ambiguous) term for something I've recommended in a few answers here on * for parsing a slightly difficult (but still regular) language problem.

看起来“两级正则表达式”是一个(略微含糊不清)的术语,我在*的一些答案中推荐用于解析稍微困难(但仍然常规)的语言问题。

An example is getting all the img src= URLs from an HTML page. It's possible (but rather messy) to do this all in one regular expression; something that makes more sense is to use a regular expression to get all the <img> tags (capturing the whole tag), then using a different regular expression to get src="http://some-url-here.com" from each of the matches. This makes code far more readable and you're only scanning the text twice.

一个例子是从HTML页面获取所有img src = URL。在一个正则表达式中执行此操作是可能的(但相当混乱);更有意义的是使用正则表达式来获取所有“两级正则表达式”是什么意思?标签(捕获整个标签),然后使用不同的正则表达式来获取src =“http://some-url-here.com”每场比赛。这使代码更具可读性,并且您只扫描文本两次。

#1


5  

It means not trying to accomplish multiple tasks in a single Regex, but to split it into two tasks (two levels); splitting first, then handling each token separately.

这意味着不要尝试在单个正则表达式中完成多个任务,而是将其拆分为两个任务(两个级别);首先拆分,然后分别处理每个令牌。

My opinion is that people often unecessarily try to have a single Regex do too much at once, instead of making things much simpler by splitting different tasks like this.

我的观点是,人们经常不必尝试让单个正则表达式一次做太多,而不是通过分割这样的不同任务来使事情变得更简单。

#2


3  

It looks like "two-level regular expressions" is a (slightly ambiguous) term for something I've recommended in a few answers here on * for parsing a slightly difficult (but still regular) language problem.

看起来“两级正则表达式”是一个(略微含糊不清)的术语,我在*的一些答案中推荐用于解析稍微困难(但仍然常规)的语言问题。

An example is getting all the img src= URLs from an HTML page. It's possible (but rather messy) to do this all in one regular expression; something that makes more sense is to use a regular expression to get all the <img> tags (capturing the whole tag), then using a different regular expression to get src="http://some-url-here.com" from each of the matches. This makes code far more readable and you're only scanning the text twice.

一个例子是从HTML页面获取所有img src = URL。在一个正则表达式中执行此操作是可能的(但相当混乱);更有意义的是使用正则表达式来获取所有“两级正则表达式”是什么意思?标签(捕获整个标签),然后使用不同的正则表达式来获取src =“http://some-url-here.com”每场比赛。这使代码更具可读性,并且您只扫描文本两次。