
时间:2021-06-26 19:21:16

How can I do case insensitive string comparison in Python?


I would like to encapsulate comparison of a regular strings to a repository string using in a very simple and Pythonic way. I also would like to have ability to look up values in a dict hashed by strings using regular python strings.


10 个解决方案



Assuming ASCII strings:


string1 = 'Hello'string2 = 'hello'if string1.lower() == string2.lower():    print "The strings are the same (case insensitive)"else:    print "The strings are not the same (case insensitive)"



Comparing string in a case insensitive way seems like something that's trivial, but it's not. I will be using Python 3, since Python 2 is underdeveloped here.

以一种不区分大小写的方式比较字符串似乎是一件微不足道的事情,但它不是。我将使用Python 3,因为这里的Python 2不发达。

The first thing to note it that case-removing conversions in unicode aren't trivial. There is text for which text.lower() != text.upper().lower(), such as "ß":

首先要注意的是,在unicode中去除大小写转换并非易事。有文本text.lower()! = text.upper().lower(),例如“ß”:

"ß".lower()#>>> 'ß'"ß".upper().lower()#>>> 'ss'

But let's say you wanted to caselessly compare "BUSSE" and "Buße". Heck, you probably also want to compare "BUSSE" and "BUẞE" equal - that's the newer capital form. The recommended way is to use casefold:


help(str.casefold)#>>> Help on method_descriptor:#>>>#>>> casefold(...)#>>>     S.casefold() -> str#>>>     #>>>     Return a version of S suitable for caseless comparisons.#>>>

Do not just use lower. If casefold is not available, doing .upper().lower() helps (but only somewhat).


Then you should consider accents. If your font renderer is good, you probably think "ê" == "ê" - but it doesn't:

然后你应该考虑口音。如果你的字体渲染器是好的,你可能认为“e”= =“ê”——但它不会:

"ê" == "ê"#>>> False

This is because they are actually


import unicodedata[unicodedata.name(char) for char in "ê"]#>>> ['LATIN SMALL LETTER E WITH CIRCUMFLEX'][unicodedata.name(char) for char in "ê"]#>>> ['LATIN SMALL LETTER E', 'COMBINING CIRCUMFLEX ACCENT']

The simplest way to deal with this is unicodedata.normalize. You probably want to use NFKD normalization, but feel free to check the documentation. Then one does


unicodedata.normalize("NFKD", "ê") == unicodedata.normalize("NFKD", "ê")#>>> True

To finish up, here this is expressed in functions:


import unicodedatadef normalize_caseless(text):    return unicodedata.normalize("NFKD", text.casefold())def caseless_equal(left, right):    return normalize_caseless(left) == normalize_caseless(right)



Using Python 2, calling .lower() on each string or Unicode object...

使用Python 2,在每个字符串或Unicode对象上调用.lower()。

string1.lower() == string2.lower()

...will work most of the time, but indeed doesn't work in the situations @tchrist has described.


Assume we have a file called unicode.txt containing the two strings Σίσυφος and ΣΊΣΥΦΟΣ. With Python 2:

假设我们有一个名为unicode的文件。包含两个字符串Σίσυφος和ΣΊΣΥΦΟΣtxt。与Python 2:

>>> utf8_bytes = open("unicode.txt", 'r').read()>>> print repr(utf8_bytes)'\xce\xa3\xce\xaf\xcf\x83\xcf\x85\xcf\x86\xce\xbf\xcf\x82\n\xce\xa3\xce\x8a\xce\xa3\xce\xa5\xce\xa6\xce\x9f\xce\xa3\n'>>> u = utf8_bytes.decode('utf8')>>> print uΣίσυφοςΣΊΣΥΦΟΣ>>> first, second = u.splitlines()>>> print first.lower()σίσυφος>>> print second.lower()σίσυφοσ>>> first.lower() == second.lower()False>>> first.upper() == second.upper()True

The Σ character has two lowercase forms, ς and σ, and .lower() won't help compare them case-insensitively.


However, as of Python 3, all three forms will resolve to ς, and calling lower() on both strings will work correctly:

然而,Python 3,所有三种形式将解决ς和调用低()在两个字符串将正常工作:

>>> s = open('unicode.txt', encoding='utf8').read()>>> print(s)ΣίσυφοςΣΊΣΥΦΟΣ>>> first, second = s.splitlines()>>> print(first.lower())σίσυφος>>> print(second.lower())σίσυφος>>> first.lower() == second.lower()True>>> first.upper() == second.upper()True

So if you care about edge-cases like the three sigmas in Greek, use Python 3.

所以如果你关心边缘情况,比如希腊语中的三个符号,请使用Python 3。

(For reference, Python 2.7.3 and Python 3.3.0b1 are shown in the interpreter printouts above.)

(参考上面的解释器打印输出显示了Python 2.7.3和Python。)



Section 3.13 of the Unicode standard defines algorithms for caselessmatching.


X.casefold() == Y.casefold() in Python 3 implements the "default caseless matching" (D144).

在Python 3中,X.casefold() = Y.casefold()实现了“默认的无实例匹配”(D144)。

Casefolding does not preserve the normalization of strings in all instances and therefore the normalization needs to be done ('å' vs. 'å'). D145 introduces "canonical caseless matching":

casef不保留所有实例中字符串的规范化,因此需要进行规范化(' a ' vs)。“一个̊”)。D145引入“典型无表壳匹配”:

import unicodedatadef NFD(text):    return unicodedata.normalize('NFD', text)def canonical_caseless(text):    return NFD(NFD(text).casefold())

NFD() is called twice for very infrequent edge cases involving U+0345 character.




>>> 'å'.casefold() == 'å'.casefold()False>>> canonical_caseless('å') == canonical_caseless('å')True

There are also compatibility caseless matching (D146) for cases such as '㎒' (U+3392) and "identifier caseless matching" to simplify and optimize caseless matching of identifiers.

也有兼容性caseless匹配(D146)等情况下“㎒”(U + 3392)和“标识符caseless匹配”来简化和优化caseless匹配的标识符。



How about converting to lowercase first? you can use string.lower().




I saw this solution here using regex.


import reif re.search('mandy', 'Mandy Pande', re.IGNORECASE):# is True

It works well with accents


In [42]: if re.search("ê","ê", re.IGNORECASE):....:        print(1)....:1

However, it doesn't work with unicode characters case-insensitive. Thank you @Rhymoid for pointing out that as my understanding was that it needs the exact symbol, for the case to be true. The output is as follows:


In [36]: "ß".lower()Out[36]: 'ß'In [37]: "ß".upper()Out[37]: 'SS'In [38]: "ß".upper().lower()Out[38]: 'ss'In [39]: if re.search("ß","ßß", re.IGNORECASE):....:        print(1)....:1In [40]: if re.search("SS","ßß", re.IGNORECASE):....:        print(1)....:In [41]: if re.search("ß","SS", re.IGNORECASE):....:        print(1)....:



The usual approach is to uppercase the strings or lower case them for the lookups and comparisons. For example:


>>> "hello".upper() == "HELLO".upper()True>>> 



def insenStringCompare(s1, s2):    """ Method that takes two strings and returns True or False, based        on if they are equal, regardless of case."""    try:        return s1.lower() == s2.lower()    except AttributeError:        print "Please only pass strings into this method."        print "You passed a %s and %s" % (s1.__class__, s2.__class__)



If you have lists with strings and you want to compare the strings in different list with case insensitive. Here is my solution.


list1 = map(lambda each:each.lower(), list1)list2 = map(lambda each:each.lower(), list2)

After doing that, you can make string comparision easly.




I've used this to accomplish something more useful for comparing two strings:


def strings_iequal(first, second):    try:        return first.upper() == second.upper()    except AttributeError:        if not first:            if not second:                return True

Update: As noted by gerrit, this answer has some bugs. This was years ago and I no longer remember what I used it for. I do recall writing tests, but what good are they now!




Assuming ASCII strings:


string1 = 'Hello'string2 = 'hello'if string1.lower() == string2.lower():    print "The strings are the same (case insensitive)"else:    print "The strings are not the same (case insensitive)"



Comparing string in a case insensitive way seems like something that's trivial, but it's not. I will be using Python 3, since Python 2 is underdeveloped here.

以一种不区分大小写的方式比较字符串似乎是一件微不足道的事情,但它不是。我将使用Python 3,因为这里的Python 2不发达。

The first thing to note it that case-removing conversions in unicode aren't trivial. There is text for which text.lower() != text.upper().lower(), such as "ß":

首先要注意的是,在unicode中去除大小写转换并非易事。有文本text.lower()! = text.upper().lower(),例如“ß”:

"ß".lower()#>>> 'ß'"ß".upper().lower()#>>> 'ss'

But let's say you wanted to caselessly compare "BUSSE" and "Buße". Heck, you probably also want to compare "BUSSE" and "BUẞE" equal - that's the newer capital form. The recommended way is to use casefold:


help(str.casefold)#>>> Help on method_descriptor:#>>>#>>> casefold(...)#>>>     S.casefold() -> str#>>>     #>>>     Return a version of S suitable for caseless comparisons.#>>>

Do not just use lower. If casefold is not available, doing .upper().lower() helps (but only somewhat).


Then you should consider accents. If your font renderer is good, you probably think "ê" == "ê" - but it doesn't:

然后你应该考虑口音。如果你的字体渲染器是好的,你可能认为“e”= =“ê”——但它不会:

"ê" == "ê"#>>> False

This is because they are actually


import unicodedata[unicodedata.name(char) for char in "ê"]#>>> ['LATIN SMALL LETTER E WITH CIRCUMFLEX'][unicodedata.name(char) for char in "ê"]#>>> ['LATIN SMALL LETTER E', 'COMBINING CIRCUMFLEX ACCENT']

The simplest way to deal with this is unicodedata.normalize. You probably want to use NFKD normalization, but feel free to check the documentation. Then one does


unicodedata.normalize("NFKD", "ê") == unicodedata.normalize("NFKD", "ê")#>>> True

To finish up, here this is expressed in functions:


import unicodedatadef normalize_caseless(text):    return unicodedata.normalize("NFKD", text.casefold())def caseless_equal(left, right):    return normalize_caseless(left) == normalize_caseless(right)



Using Python 2, calling .lower() on each string or Unicode object...

使用Python 2,在每个字符串或Unicode对象上调用.lower()。

string1.lower() == string2.lower()

...will work most of the time, but indeed doesn't work in the situations @tchrist has described.


Assume we have a file called unicode.txt containing the two strings Σίσυφος and ΣΊΣΥΦΟΣ. With Python 2:

假设我们有一个名为unicode的文件。包含两个字符串Σίσυφος和ΣΊΣΥΦΟΣtxt。与Python 2:

>>> utf8_bytes = open("unicode.txt", 'r').read()>>> print repr(utf8_bytes)'\xce\xa3\xce\xaf\xcf\x83\xcf\x85\xcf\x86\xce\xbf\xcf\x82\n\xce\xa3\xce\x8a\xce\xa3\xce\xa5\xce\xa6\xce\x9f\xce\xa3\n'>>> u = utf8_bytes.decode('utf8')>>> print uΣίσυφοςΣΊΣΥΦΟΣ>>> first, second = u.splitlines()>>> print first.lower()σίσυφος>>> print second.lower()σίσυφοσ>>> first.lower() == second.lower()False>>> first.upper() == second.upper()True

The Σ character has two lowercase forms, ς and σ, and .lower() won't help compare them case-insensitively.


However, as of Python 3, all three forms will resolve to ς, and calling lower() on both strings will work correctly:

然而,Python 3,所有三种形式将解决ς和调用低()在两个字符串将正常工作:

>>> s = open('unicode.txt', encoding='utf8').read()>>> print(s)ΣίσυφοςΣΊΣΥΦΟΣ>>> first, second = s.splitlines()>>> print(first.lower())σίσυφος>>> print(second.lower())σίσυφος>>> first.lower() == second.lower()True>>> first.upper() == second.upper()True

So if you care about edge-cases like the three sigmas in Greek, use Python 3.

所以如果你关心边缘情况,比如希腊语中的三个符号,请使用Python 3。

(For reference, Python 2.7.3 and Python 3.3.0b1 are shown in the interpreter printouts above.)

(参考上面的解释器打印输出显示了Python 2.7.3和Python。)



Section 3.13 of the Unicode standard defines algorithms for caselessmatching.


X.casefold() == Y.casefold() in Python 3 implements the "default caseless matching" (D144).

在Python 3中,X.casefold() = Y.casefold()实现了“默认的无实例匹配”(D144)。

Casefolding does not preserve the normalization of strings in all instances and therefore the normalization needs to be done ('å' vs. 'å'). D145 introduces "canonical caseless matching":

casef不保留所有实例中字符串的规范化,因此需要进行规范化(' a ' vs)。“一个̊”)。D145引入“典型无表壳匹配”:

import unicodedatadef NFD(text):    return unicodedata.normalize('NFD', text)def canonical_caseless(text):    return NFD(NFD(text).casefold())

NFD() is called twice for very infrequent edge cases involving U+0345 character.




>>> 'å'.casefold() == 'å'.casefold()False>>> canonical_caseless('å') == canonical_caseless('å')True

There are also compatibility caseless matching (D146) for cases such as '㎒' (U+3392) and "identifier caseless matching" to simplify and optimize caseless matching of identifiers.

也有兼容性caseless匹配(D146)等情况下“㎒”(U + 3392)和“标识符caseless匹配”来简化和优化caseless匹配的标识符。



How about converting to lowercase first? you can use string.lower().




I saw this solution here using regex.


import reif re.search('mandy', 'Mandy Pande', re.IGNORECASE):# is True

It works well with accents


In [42]: if re.search("ê","ê", re.IGNORECASE):....:        print(1)....:1

However, it doesn't work with unicode characters case-insensitive. Thank you @Rhymoid for pointing out that as my understanding was that it needs the exact symbol, for the case to be true. The output is as follows:


In [36]: "ß".lower()Out[36]: 'ß'In [37]: "ß".upper()Out[37]: 'SS'In [38]: "ß".upper().lower()Out[38]: 'ss'In [39]: if re.search("ß","ßß", re.IGNORECASE):....:        print(1)....:1In [40]: if re.search("SS","ßß", re.IGNORECASE):....:        print(1)....:In [41]: if re.search("ß","SS", re.IGNORECASE):....:        print(1)....:



The usual approach is to uppercase the strings or lower case them for the lookups and comparisons. For example:


>>> "hello".upper() == "HELLO".upper()True>>> 



def insenStringCompare(s1, s2):    """ Method that takes two strings and returns True or False, based        on if they are equal, regardless of case."""    try:        return s1.lower() == s2.lower()    except AttributeError:        print "Please only pass strings into this method."        print "You passed a %s and %s" % (s1.__class__, s2.__class__)



If you have lists with strings and you want to compare the strings in different list with case insensitive. Here is my solution.


list1 = map(lambda each:each.lower(), list1)list2 = map(lambda each:each.lower(), list2)

After doing that, you can make string comparision easly.




I've used this to accomplish something more useful for comparing two strings:


def strings_iequal(first, second):    try:        return first.upper() == second.upper()    except AttributeError:        if not first:            if not second:                return True

Update: As noted by gerrit, this answer has some bugs. This was years ago and I no longer remember what I used it for. I do recall writing tests, but what good are they now!
