Some background:
I am writing a parser to retrieve information from sites with a markup language. Standard libraries as wikitools, ... do not work for me as I need to be more specific and adapting them to my needs puts a layer of complexity between me and the problem. Python + "simple" regex got me into difficulties identifying the dependencies between the different "tokens" in the markup language in a transparent manner - so obviously I needed to arrive at PLY at the end of this journey.
我正在编写一个解析器来从使用标记语言的站点检索信息。标准库作为wikitools,...不适合我,因为我需要更具体,并根据我的需要调整它们,这使我和问题之间存在一层复杂性。 Python +“简单”正则表达式使我难以以透明的方式识别标记语言中不同“标记”之间的依赖关系 - 所以显然我需要在此旅程结束时到达PLY。
Now it seems that PLY identifies the tokens via regex differently compared to Python - but I can't find something on it. I don't want to move on in case I don't understand how PLY determines the tokens within its lexer (as otherwise I would have no control of the logic I am depending on and will fail in a later stage).
现在看来PLY通过正则表达式识别令牌与Python相比有所不同 - 但我找不到它的东西。我不想继续前进,以防我不明白PLY如何确定其词法分析器中的标记(否则我将无法控制我依赖的逻辑并在稍后阶段失败)。
Here we go:
开始了:
import ply.lex as lex
text = r'--- 123456 ---'
token1 = r'-- .* --'
tokens = (
'TEST',
)
t_TEST = token1
lexer = lex.lex(reflags=re.UNICODE, debug=1)
lexer.input(text)
for tok in lexer:
print tok.type, tok.value, tok.lineno, tok.lexpos
results in:
lex: tokens = ('TEST',)
lex: literals = ''
lex: states = {'INITIAL': 'inclusive'}
lex: Adding rule t_TEST -> '-- .* --' (state 'INITIAL')
lex: ==== MASTER REGEXS FOLLOW ====
lex: state 'INITIAL' : regex[0] = '(?P<t_TEST>-- .* --)'
TEST --- 123456 --- 1 0
The last line is surprising - I would have expected the first and the last -
to be missing in --- 123456 ---
in case it is comparable to "search" (and nothing in case it is comparable to "match"). Obviously this is important as then --
cannot be distinguished from ---
(or ===
from ===
), i.e. headlines, enumbering, ... cannot be differentiated.
最后一行是令人惊讶的 - 我预计第一个和最后一个 - 在 - 123456中丢失 - 如果它与“搜索”相当(并且在与“匹配”相当的情况下没有任何内容)。显然这很重要,因为那时 - 无法区分---(或=== from ===),即标题,编号,......无法区分。
So why does PLY behaves differently for standard Python/regex? (and how? - couldn't find something in the documentation, or here at *).
那么为什么PLY对于标准Python /正则表达式有不同的表现呢? (以及如何? - 无法在文档中找到某些内容,或者在*中找不到内容)。
I would guess it is more my understanding of PLY as the tool is around for quite some time already, i.e. this behavior is in there by intention I would guess. The only somehow related information I could find deals with different groups but does not explain a different behavior of identifying regexes itself. I found nothing in ply-hack as well.
我猜这是我对PLY的理解,因为这个工具已经存在很长一段时间了,也就是说这种行为是我猜的意思。我能找到的唯一相关信息涉及不同的群体,但没有解释识别正则表达式本身的不同行为。我也没有在ply-hack中找到任何东西。
Am I overlooking something stupid simple?
我忽略了一些愚蠢的简单吗?
For comparison purposes here standard Python / regex:
为了比较,这里标准Python /正则表达式:
import re
text = r'--- 123456 ---'
token1 = r'-- .* --'
p = re.compile(token1)
m = p.search(text)
if m:
print 'Match found: ', m.group()
else:
print 'No match'
m = p.match(text)
if m:
print 'Match found: ', m.group()
else:
print 'No match'
gives:
Match found: -- 123456 --
No match
(as expected, first is the result of "search", second of "match")
(正如预期的那样,首先是“搜索”的结果,第二个是“匹配”)
My settings: I am working with spyder - this is the terminal display at start:
我的设置:我正在使用spyder - 这是开始时的终端显示:
Python 2.7.5+ (default, Sep 19 2013, 13:49:51)
[GCC 4.8.1] on linux2
Type "help", "copyright", "credits" or "license" for more information.
Imported NumPy 1.7.1, SciPy 0.12.0, Matplotlib 1.2.1
Type "scientific" for more details.
Thanks for your time and help.
感谢您的时间和帮助。
1 个解决方案
#1
2
The answer in ply lexmatch regular expression has different groups than a usual re helps here too. In lex.py:
ply lexmatch正则表达式中的答案具有不同的组,而不是通常的帮助。在lex.py中:
c = re.compile("(?P<%s>%s)" % (fname,f.__doc__), re.VERBOSE | self.reflags)
Notice the VERBOSE
flag. It means the re
engine ignores the whitespace characters in your regexps. So r'-- .* --'
really means r'--.*--'
, which indeed matches completely a string like '--- foobar ---'
. See the documentation of re.VERBOSE
for more details.
注意VERBOSE标志。这意味着引擎会忽略regexp中的空白字符。所以r' - 。* - '实际上意味着r' - 。* - ',它确实完全匹配像' - foobar ---'这样的字符串。有关更多详细信息,请参阅re.VERBOSE的文档。
#1
2
The answer in ply lexmatch regular expression has different groups than a usual re helps here too. In lex.py:
ply lexmatch正则表达式中的答案具有不同的组,而不是通常的帮助。在lex.py中:
c = re.compile("(?P<%s>%s)" % (fname,f.__doc__), re.VERBOSE | self.reflags)
Notice the VERBOSE
flag. It means the re
engine ignores the whitespace characters in your regexps. So r'-- .* --'
really means r'--.*--'
, which indeed matches completely a string like '--- foobar ---'
. See the documentation of re.VERBOSE
for more details.
注意VERBOSE标志。这意味着引擎会忽略regexp中的空白字符。所以r' - 。* - '实际上意味着r' - 。* - ',它确实完全匹配像' - foobar ---'这样的字符串。有关更多详细信息,请参阅re.VERBOSE的文档。