为什么PLY对待正则表达式与Python / re不同?

时间:2022-01-23 14:32:08

Some background:

I am writing a parser to retrieve information from sites with a markup language. Standard libraries as wikitools, ... do not work for me as I need to be more specific and adapting them to my needs puts a layer of complexity between me and the problem. Python + "simple" regex got me into difficulties identifying the dependencies between the different "tokens" in the markup language in a transparent manner - so obviously I needed to arrive at PLY at the end of this journey.

我正在编写一个解析器来从使用标记语言的站点检索信息。标准库作为wikitools,...不适合我,因为我需要更具体,并根据我的需要调整它们,这使我和问题之间存在一层复杂性。 Python +“简单”正则表达式使我难以以透明的方式识别标记语言中不同“标记”之间的依赖关系 - 所以显然我需要在此旅程结束时到达PLY。

Now it seems that PLY identifies the tokens via regex differently compared to Python - but I can't find something on it. I don't want to move on in case I don't understand how PLY determines the tokens within its lexer (as otherwise I would have no control of the logic I am depending on and will fail in a later stage).

现在看来PLY通过正则表达式识别令牌与Python相比有所不同 - 但我找不到它的东西。我不想继续前进,以防我不明白PLY如何确定其词法分析器中的标记(否则我将无法控制我依赖的逻辑并在稍后阶段失败)。

Here we go:

开始了:

import ply.lex as lex

text = r'--- 123456 ---'
token1 = r'-- .* --'
tokens = (
   'TEST',
)
t_TEST = token1

lexer = lex.lex(reflags=re.UNICODE, debug=1)
lexer.input(text)
for tok in lexer:
    print tok.type, tok.value, tok.lineno, tok.lexpos

results in:

lex: tokens   = ('TEST',)
lex: literals = ''
lex: states   = {'INITIAL': 'inclusive'}
lex: Adding rule t_TEST -> '-- .* --' (state 'INITIAL')
lex: ==== MASTER REGEXS FOLLOW ====
lex: state 'INITIAL' : regex[0] = '(?P<t_TEST>-- .* --)'
TEST --- 123456 --- 1 0

The last line is surprising - I would have expected the first and the last - to be missing in --- 123456 --- in case it is comparable to "search" (and nothing in case it is comparable to "match"). Obviously this is important as then -- cannot be distinguished from --- (or === from ===), i.e. headlines, enumbering, ... cannot be differentiated.

最后一行是令人惊讶的 - 我预计第一个和最后一个 - 在 - 123456中丢失 - 如果它与“搜索”相当(并且在与“匹配”相当的情况下没有任何内容)。显然这很重要,因为那时 - 无法区分---(或=== from ===),即标题,编号,......无法区分。

So why does PLY behaves differently for standard Python/regex? (and how? - couldn't find something in the documentation, or here at *).

那么为什么PLY对于标准Python /正则表达式有不同的表现呢? (以及如何? - 无法在文档中找到某些内容,或者在*中找不到内容)。

I would guess it is more my understanding of PLY as the tool is around for quite some time already, i.e. this behavior is in there by intention I would guess. The only somehow related information I could find deals with different groups but does not explain a different behavior of identifying regexes itself. I found nothing in ply-hack as well.

我猜这是我对PLY的理解,因为这个工具已经存在很长一段时间了,也就是说这种行为是我猜的意思。我能找到的唯一相关信息涉及不同的群体,但没有解释识别正则表达式本身的不同行为。我也没有在ply-hack中找到任何东西。

Am I overlooking something stupid simple?

我忽略了一些愚蠢的简单吗?

For comparison purposes here standard Python / regex:

为了比较,这里标准Python /正则表达式:

import re

text = r'--- 123456 ---'
token1 = r'-- .* --'

p = re.compile(token1)

m = p.search(text)
if m:
    print 'Match found: ', m.group()
else:
    print 'No match'

m = p.match(text)
if m:
    print 'Match found: ', m.group()
else:
    print 'No match'

gives:

Match found:  -- 123456 --
No match

(as expected, first is the result of "search", second of "match")

(正如预期的那样,首先是“搜索”的结果,第二个是“匹配”)

My settings: I am working with spyder - this is the terminal display at start:

我的设置:我正在使用spyder - 这是开始时的终端显示:

Python 2.7.5+ (default, Sep 19 2013, 13:49:51) 
[GCC 4.8.1] on linux2
Type "help", "copyright", "credits" or "license" for more information.

Imported NumPy 1.7.1, SciPy 0.12.0, Matplotlib 1.2.1
Type "scientific" for more details.

Thanks for your time and help.

感谢您的时间和帮助。

1 个解决方案

#1


2  

The answer in ply lexmatch regular expression has different groups than a usual re helps here too. In lex.py:

ply lexmatch正则表达式中的答案具有不同的组,而不是通常的帮助。在lex.py中:

c = re.compile("(?P<%s>%s)" % (fname,f.__doc__), re.VERBOSE | self.reflags)

Notice the VERBOSE flag. It means the re engine ignores the whitespace characters in your regexps. So r'-- .* --' really means r'--.*--', which indeed matches completely a string like '--- foobar ---'. See the documentation of re.VERBOSE for more details.

注意VERBOSE标志。这意味着引擎会忽略regexp中的空白字符。所以r' - 。* - '实际上意味着r' - 。* - ',它确实完全匹配像' - foobar ---'这样的字符串。有关更多详细信息,请参阅re.VERBOSE的文档。

#1


2  

The answer in ply lexmatch regular expression has different groups than a usual re helps here too. In lex.py:

ply lexmatch正则表达式中的答案具有不同的组,而不是通常的帮助。在lex.py中:

c = re.compile("(?P<%s>%s)" % (fname,f.__doc__), re.VERBOSE | self.reflags)

Notice the VERBOSE flag. It means the re engine ignores the whitespace characters in your regexps. So r'-- .* --' really means r'--.*--', which indeed matches completely a string like '--- foobar ---'. See the documentation of re.VERBOSE for more details.

注意VERBOSE标志。这意味着引擎会忽略regexp中的空白字符。所以r' - 。* - '实际上意味着r' - 。* - ',它确实完全匹配像' - foobar ---'这样的字符串。有关更多详细信息,请参阅re.VERBOSE的文档。