使用不区分大小写的文本查询pymongo

Following is how I query data from my mongodb using pymongo:

以下是我使用pymongo从我的mongodb查询数据的方法:

def is_philippine_facebook(self,facebook_user):
        is_philippine = False
        db_server = self.ConfigSectionMap('db_server')
        database_name = db_server['database']
        db = self.client[database_name]
        cursor = db[collection_name].find({
                'isPhilippine':True,
                'facebook_user': re.compile('@'+facebook_user, re.IGNORECASE)
            })
        for document in cursor:
            if document is not None:
                is_philippine = True
                break
        return is_philippine

In fact, I want to query records having a certain facebook_user with incasesensitive option. However, the query returns many incorrect results. For example, if facebook_user is WWF, records with WWF_XYZ will be returned.

实际上,我想查询具有某个具有incasesensitive选项的facebook_user的记录。但是,查询返回许多不正确的结果。例如,如果facebook_user是WWF,则将返回包含WWF_XYZ的记录。

How can I fix this? Thanks.

我怎样才能解决这个问题?谢谢。

2 个解决方案

#1

Sounds like you want a word boundary \b

听起来像你想要一个单词边界\ b

'facebook_user': re.compile('@'+ facebook_user +'\\b', re.IGNORECASE)

So if you supply WWF or wwf then it only matches up to the end of the "word" and not beyond it.

因此,如果您提供WWF或wwf,那么它只匹配“单词”的结尾,而不是超出它。

As a note, case insensitive searches an searches not anchored with the caret ^ to the beginning of the string require a full collection scan and are not very efficient.

作为注释,不区分大小写的搜索不以插入符号^锚定的搜索到字符串的开头需要完整的集合扫描并且效率不高。

If matching to the beginning of a string you should use the caret, and you should probably normalize case as a document property for searching so you do not need the "case insensitive" option either. These two things are required for an index to be used on a search. See $regex in the documentation

如果匹配到字符串的开头,则应使用插入符,并且您应该将case作为搜索的文档属性进行规范化,这样您也不需要“不区分大小写”选项。在搜索上使用索引需要这两件事。请参阅文档中的$ regex

#2

Use the following fix:

使用以下修复:

re.compile(r'@{0}\b'.format(facebook_user), re.IGNORECASE)

See the regex demo.

请参阅正则表达式演示。

Pattern details:

@WWF - a literal @WWF

@WWF - 文字@WWF

\b - a word boundary (requires a char other than letter, digit or _, or end of string after @WWF)

\ b - 单词边界(在@WWF之后需要除字母,数字或_之外的字符或字符串结尾)

If a facebook_user may contain special chars, you need to use

如果facebook_user可能包含特殊字符,则需要使用

re.compile(r'(?<!\w)@{0}(?!\w)'.format(re.escape(facebook_user)), re.IGNORECASE)

However, the facebook_user seems to only contain word chars, so a word boundary should really suffice in this case.

但是,facebook_user似乎只包含单词字符,因此在这种情况下,单词边界应该足够了。

#1