I am using ply and have noticed a strange discrepancy between the token re match stored in t.lex.lexmatch, as compared with an sre_pattern defined in the usual way with the re module. The group(x)'s seem to be off by 1.
我正在使用ply并注意到存储在t.lex.lexmatch中的令牌重新匹配与使用re模块以通常方式定义的sre_pattern之间存在奇怪的差异。小组(x)似乎是1。
I have defined a simple lexer to illustrate the behavior I am seeing:
我已经定义了一个简单的词法分析器来说明我所看到的行为:
import ply.lex as lex
tokens = ('CHAR',)
def t_CHAR(t):
r'.'
t.value = t.lexer.lexmatch
return t
l = lex.lex()
(I get a warning about t_error but ignore it for now.) Now I feed some input into the lexer and get a token:
(我收到关于t_error的警告但暂时忽略它。)现在我将一些输入提供给词法分析器并获取一个标记:
l.input('hello')
l.token()
I get a LexToken(CHAR,<_sre.SRE_Match object at 0x100fb1eb8>,1,0)
. I want to look a the match object:
我得到一个LexToken(CHAR,<_ sre.SRE_Match对象,位于0x100fb1eb8>,1,0)。我想看一下匹配对象:
m = _.value
So now I look at the groups:
所以现在我看看这些小组:
m.group()
=> 'h'
as I expect.
m.group()=>'h'正如我所料。
m.group(0)
=> 'h'
as I expect.
m.group(0)=>'h'正如我所料。
m.group(1)
=> 'h'
, yet I would expect it to not have such a group.
m.group(1)=>'h',但我希望它没有这样一个组。
Compare this to creating such a regular expression manually:
将此与手动创建此类正则表达式进行比较:
import re
p = re.compile(r'.')
m2 = p.match('hello')
This gives different groups:
这给了不同的群体:
m2.group()
= 'h'
as I expect.
m2.group()='h'正如我所料。
m2.group(0)
= 'h'
as I expect.
m2.group(0)='h'正如我所料。
m2.group(1)
gives IndexError: no such group
as I expect.
m2.group(1)给出了IndexError:没有我期望的那样的组。
Does anyone know why this discrepancy exists?
有谁知道为什么存在这种差异?
2 个解决方案
#1
4
In version 3.4 of PLY, the reason this occurs is related to how the expressions are converted from docstrings to patterns.
在PLY的3.4版本中,出现这种情况的原因与表达式如何从文档字符串转换为模式有关。
Looking at the source really does help - line 746 of lex.py:
查看源代码确实有帮助 - lex.py的第746行:
c = re.compile("(?P<%s>%s)" % (fname,f.__doc__), re.VERBOSE | self.reflags)
I wouldn't recommend relying on something like this between versions - this is just part of the magic of how PLY works.
我不建议在版本之间依赖类似的东西 - 这只是PLY工作原理的一部分。
#2
1
it seems for me that matching group depends on position of the token function in the file, like if groups were actually cumulated through all the declared tokens regexes :
对我来说,匹配组依赖于令牌函数在文件中的位置,就像组实际上通过所有声明的令牌正则表累积一样:
t_MYTOKEN1(t):
r'matchit(\w+)'
t.value = lexer.lexmatch.group(1)
return t
t_MYTOKEN2(t):
r'matchit(\w+)'
t.value = lexer.lexmatch.group(2)
return t
#1
4
In version 3.4 of PLY, the reason this occurs is related to how the expressions are converted from docstrings to patterns.
在PLY的3.4版本中,出现这种情况的原因与表达式如何从文档字符串转换为模式有关。
Looking at the source really does help - line 746 of lex.py:
查看源代码确实有帮助 - lex.py的第746行:
c = re.compile("(?P<%s>%s)" % (fname,f.__doc__), re.VERBOSE | self.reflags)
I wouldn't recommend relying on something like this between versions - this is just part of the magic of how PLY works.
我不建议在版本之间依赖类似的东西 - 这只是PLY工作原理的一部分。
#2
1
it seems for me that matching group depends on position of the token function in the file, like if groups were actually cumulated through all the declared tokens regexes :
对我来说,匹配组依赖于令牌函数在文件中的位置,就像组实际上通过所有声明的令牌正则表累积一样:
t_MYTOKEN1(t):
r'matchit(\w+)'
t.value = lexer.lexmatch.group(1)
return t
t_MYTOKEN2(t):
r'matchit(\w+)'
t.value = lexer.lexmatch.group(2)
return t