搜索模式包括方括号

时间:2021-11-20 19:24:48

I am trying to search for exact words in a file. I read the file by lines and loop through the lines to find the exact words. As the in keyword is not suitable for finding exact words, I am using a regex pattern.

我正在尝试在文件中搜索确切的单词。我通过线条读取文件并在线条中循环以找到确切的单词。由于in关键字不适合查找确切的单词,我使用的是正则表达式模式。

def findWord(w):
    return re.compile(r'\b({0})\b'.format(w), flags=re.IGNORECASE).search

The problem with this function is that is doesn't recognizes square brackets [xyz].

此函数的问题是无法识别方括号[xyz]。

For example

findWord('data_var_cod[0]')('Cod_Byte1 = DATA_VAR_COD[0]') 

returns None whereas

返回无,而

findWord('data_var_cod')('Cod_Byte1 = DATA_VAR_COD') 

returns <_sre.SRE_Match object at 0x0000000015622288>

返回<_sre.SRE_Match对象,位于0x0000000015622288>

Can anybody please help me to tweak the regex pattern?

有人可以帮我调整正则表达式吗?

3 个解决方案

#1


1  

It's because of that regex engine assume the square brackets as character class which are regex characters for get ride of this problem you need to escape your regex characters. you can use re.escape function :

这是因为正则表达式引擎假设方括号作为字符类,这是正则字符,以获得这个问题,你需要逃避你的正则表达式字符。你可以使用re.escape函数:

def findWord(w):
    return re.compile(r'\b({0})\b'.format(re.escape(w)), flags=re.IGNORECASE).search

Also as a more pythonic way fro get all matches you can use re.fildall() which returns a list of matches or re.finditer which returns an iterator contains matchobjects.

另外,作为获取所有匹配的更加pythonic方式,您可以使用re.fildall()返回匹配列表或re.finditer返回包含matchobjects的迭代器。

But still this way is not complete and efficient because when you are using word boundary your inner word must contains one type characters.

但是这种方式仍然不完整和有效,因为当您使用单词边界时,您的内部单词必须包含一个类型字符。

>>> ss = 'hello string [processing] in python.'  
>>>re.compile(r'\b({0})\b'.format(re.escape('[processing]')),flags=re.IGNORECASE).search(ss)
>>> 
>>>re.compile(r'({})'.format(re.escape('[processing]')),flags=re.IGNORECASE).search(ss).group(0)
'[processing]'

So I suggest to remove the word boundaries if your words are contains none word characters.

因此,如果您的单词不包含单词字符,我建议删除单词边界。

But as a more general way you can use following regex which use positive look around that match words that surround by space or come at the end of string or leading:

但是作为一种更通用的方法,你可以使用以下正则表达式,它使用正面环绕匹配由空格环绕的单词或者在字符串或结尾处结束:

r'(?: |^)({})(?=[. ]|$) '

#2


1  

That's because [ and ] has special meaning. You should quote the string you're looking for:

那是因为[和]具有特殊意义。你应该引用你正在寻找的字符串:

re.escape(regex)

Will escape the regex for you. Change your code to:

将为你逃脱正则表达式。将您的代码更改为:

return re.compile(r'\b({0})\b'.format(re.escape(w)), flags=re.IGNORECASE).search
                                      ↑↑↑↑↑↑↑↑↑

You can see what re.quote does for your string, for example:

您可以看到re.quote对您的字符串执行的操作,例如:

>>> w = '[xyz]'
>>> print re.escape(w)
\[xyz\]

#3


0  

You need a "smart" way of building the regex:

你需要一种“聪明”的方式来构建正则表达式:

def findWord(w):
    if re.match(r'\w', w) and re.search(r'\w$', w):
        return re.compile(r'\b{0}\b'.format(w), flags=re.IGNORECASE).search
    if not re.match(r'\w', w) and not re.search(r'\w$', w):
        return re.compile(r'{0}'.format(w), flags=re.IGNORECASE).search
    if not re.match(r'\w', w) and re.search(r'\w$', w):
        return re.compile(r'{0}\b'.format(w), flags=re.IGNORECASE).search
    if re.match(r'\w', w) and not re.search(r'\w$', w):
        return re.compile(r'\b{0}'.format(w), flags=re.IGNORECASE).search

The problem is that some of your keywords will have word characters at the start only, others - at the end only, most will have word characters on both ends, and some will have non-word characters. To effectively check the word boundary, you need to know if a word character is present at the start/end of the keyword.

问题是你的一些关键词只会在一开始就有单词字符,其他的 - 只在最后,大多数会在两端都有单词字符,有些会有非单词字符。要有效地检查单词边界,您需要知道关键字的开头/结尾是否存在单词字符。

Thus, with re.match(r'\w', x) we can check if the keyword starts with a word character, and if yes, add the \b to the pattern, and with re.search(r'\w$', x) we can check if the keyword ends with a word character.

因此,使用re.match(r'\ w',x),我们可以检查关键字是否以单词字符开头,如果是,则将\ b添加到模式中,并使用re.search(r'\ w $) ',x)我们可以检查关键字是否以单词字符结尾。

In case you have multiple keywords to check a string against you can check this post of mine.

如果您有多个关键字来检查字符串,可以查看我的这篇文章。

#1


1  

It's because of that regex engine assume the square brackets as character class which are regex characters for get ride of this problem you need to escape your regex characters. you can use re.escape function :

这是因为正则表达式引擎假设方括号作为字符类,这是正则字符,以获得这个问题,你需要逃避你的正则表达式字符。你可以使用re.escape函数:

def findWord(w):
    return re.compile(r'\b({0})\b'.format(re.escape(w)), flags=re.IGNORECASE).search

Also as a more pythonic way fro get all matches you can use re.fildall() which returns a list of matches or re.finditer which returns an iterator contains matchobjects.

另外,作为获取所有匹配的更加pythonic方式,您可以使用re.fildall()返回匹配列表或re.finditer返回包含matchobjects的迭代器。

But still this way is not complete and efficient because when you are using word boundary your inner word must contains one type characters.

但是这种方式仍然不完整和有效,因为当您使用单词边界时,您的内部单词必须包含一个类型字符。

>>> ss = 'hello string [processing] in python.'  
>>>re.compile(r'\b({0})\b'.format(re.escape('[processing]')),flags=re.IGNORECASE).search(ss)
>>> 
>>>re.compile(r'({})'.format(re.escape('[processing]')),flags=re.IGNORECASE).search(ss).group(0)
'[processing]'

So I suggest to remove the word boundaries if your words are contains none word characters.

因此,如果您的单词不包含单词字符,我建议删除单词边界。

But as a more general way you can use following regex which use positive look around that match words that surround by space or come at the end of string or leading:

但是作为一种更通用的方法,你可以使用以下正则表达式,它使用正面环绕匹配由空格环绕的单词或者在字符串或结尾处结束:

r'(?: |^)({})(?=[. ]|$) '

#2


1  

That's because [ and ] has special meaning. You should quote the string you're looking for:

那是因为[和]具有特殊意义。你应该引用你正在寻找的字符串:

re.escape(regex)

Will escape the regex for you. Change your code to:

将为你逃脱正则表达式。将您的代码更改为:

return re.compile(r'\b({0})\b'.format(re.escape(w)), flags=re.IGNORECASE).search
                                      ↑↑↑↑↑↑↑↑↑

You can see what re.quote does for your string, for example:

您可以看到re.quote对您的字符串执行的操作,例如:

>>> w = '[xyz]'
>>> print re.escape(w)
\[xyz\]

#3


0  

You need a "smart" way of building the regex:

你需要一种“聪明”的方式来构建正则表达式:

def findWord(w):
    if re.match(r'\w', w) and re.search(r'\w$', w):
        return re.compile(r'\b{0}\b'.format(w), flags=re.IGNORECASE).search
    if not re.match(r'\w', w) and not re.search(r'\w$', w):
        return re.compile(r'{0}'.format(w), flags=re.IGNORECASE).search
    if not re.match(r'\w', w) and re.search(r'\w$', w):
        return re.compile(r'{0}\b'.format(w), flags=re.IGNORECASE).search
    if re.match(r'\w', w) and not re.search(r'\w$', w):
        return re.compile(r'\b{0}'.format(w), flags=re.IGNORECASE).search

The problem is that some of your keywords will have word characters at the start only, others - at the end only, most will have word characters on both ends, and some will have non-word characters. To effectively check the word boundary, you need to know if a word character is present at the start/end of the keyword.

问题是你的一些关键词只会在一开始就有单词字符,其他的 - 只在最后,大多数会在两端都有单词字符,有些会有非单词字符。要有效地检查单词边界,您需要知道关键字的开头/结尾是否存在单词字符。

Thus, with re.match(r'\w', x) we can check if the keyword starts with a word character, and if yes, add the \b to the pattern, and with re.search(r'\w$', x) we can check if the keyword ends with a word character.

因此,使用re.match(r'\ w',x),我们可以检查关键字是否以单词字符开头,如果是,则将\ b添加到模式中,并使用re.search(r'\ w $) ',x)我们可以检查关键字是否以单词字符结尾。

In case you have multiple keywords to check a string against you can check this post of mine.

如果您有多个关键字来检查字符串,可以查看我的这篇文章。