如何进行不区分大小写的字符串比较?

时间:2021-06-26 19:21:16

How can I do case insensitive string comparison in Python?

如何在Python中进行大小写不敏感的字符串比较?

I would like to encapsulate comparison of a regular strings to a repository string using in a very simple and Pythonic way. I also would like to have ability to look up values in a dict hashed by strings using regular python strings.

我想用非常简单和python的方式将常规字符串的比较封装到存储库字符串中。我还希望能够使用常规python字符串查找由字符串哈希的dict类型中的值。

10 个解决方案

#1


411  

Assuming ASCII strings:

假设ASCII字符串:

string1 = 'Hello'string2 = 'hello'if string1.lower() == string2.lower():    print "The strings are the same (case insensitive)"else:    print "The strings are not the same (case insensitive)"

#2


327  

Comparing string in a case insensitive way seems like something that's trivial, but it's not. I will be using Python 3, since Python 2 is underdeveloped here.

以一种不区分大小写的方式比较字符串似乎是一件微不足道的事情,但它不是。我将使用Python 3,因为这里的Python 2不发达。

The first thing to note it that case-removing conversions in unicode aren't trivial. There is text for which text.lower() != text.upper().lower(), such as "ß":

首先要注意的是,在unicode中去除大小写转换并非易事。有文本text.lower()! = text.upper().lower(),例如“ß”:

"ß".lower()#>>> 'ß'"ß".upper().lower()#>>> 'ss'

But let's say you wanted to caselessly compare "BUSSE" and "Buße". Heck, you probably also want to compare "BUSSE" and "BUẞE" equal - that's the newer capital form. The recommended way is to use casefold:

但假设你想caselessly比较“会”和“Buße”。见鬼,你可能还想比较“会”和“BUẞE”平等——这是新的资本形式。推荐的方法是使用casefold:

help(str.casefold)#>>> Help on method_descriptor:#>>>#>>> casefold(...)#>>>     S.casefold() -> str#>>>     #>>>     Return a version of S suitable for caseless comparisons.#>>>

Do not just use lower. If casefold is not available, doing .upper().lower() helps (but only somewhat).

不要只使用较低的。如果casefold不可用,那么执行.upper().lower()会有所帮助(但只会有所帮助)。

Then you should consider accents. If your font renderer is good, you probably think "ê" == "ê" - but it doesn't:

然后你应该考虑口音。如果你的字体渲染器是好的,你可能认为“e”= =“ê”——但它不会:

"ê" == "ê"#>>> False

This is because they are actually

这是因为它们实际上是

import unicodedata[unicodedata.name(char) for char in "ê"]#>>> ['LATIN SMALL LETTER E WITH CIRCUMFLEX'][unicodedata.name(char) for char in "ê"]#>>> ['LATIN SMALL LETTER E', 'COMBINING CIRCUMFLEX ACCENT']

The simplest way to deal with this is unicodedata.normalize. You probably want to use NFKD normalization, but feel free to check the documentation. Then one does

解决这个问题最简单的方法是unicodedata.normalize。您可能想要使用NFKD规范化,但请随意查看文档。然后一个人

unicodedata.normalize("NFKD", "ê") == unicodedata.normalize("NFKD", "ê")#>>> True

To finish up, here this is expressed in functions:

最后,这里用函数表示:

import unicodedatadef normalize_caseless(text):    return unicodedata.normalize("NFKD", text.casefold())def caseless_equal(left, right):    return normalize_caseless(left) == normalize_caseless(right)

#3


49  

Using Python 2, calling .lower() on each string or Unicode object...

使用Python 2,在每个字符串或Unicode对象上调用.lower()。

string1.lower() == string2.lower()

...will work most of the time, but indeed doesn't work in the situations @tchrist has described.

…在大多数情况下会有效,但在@tchrist描述的情况下是行不通的。

Assume we have a file called unicode.txt containing the two strings Σίσυφος and ΣΊΣΥΦΟΣ. With Python 2:

假设我们有一个名为unicode的文件。包含两个字符串Σίσυφος和ΣΊΣΥΦΟΣtxt。与Python 2:

>>> utf8_bytes = open("unicode.txt", 'r').read()>>> print repr(utf8_bytes)'\xce\xa3\xce\xaf\xcf\x83\xcf\x85\xcf\x86\xce\xbf\xcf\x82\n\xce\xa3\xce\x8a\xce\xa3\xce\xa5\xce\xa6\xce\x9f\xce\xa3\n'>>> u = utf8_bytes.decode('utf8')>>> print uΣίσυφοςΣΊΣΥΦΟΣ>>> first, second = u.splitlines()>>> print first.lower()σίσυφος>>> print second.lower()σίσυφοσ>>> first.lower() == second.lower()False>>> first.upper() == second.upper()True

The Σ character has two lowercase forms, ς and σ, and .lower() won't help compare them case-insensitively.

Σ字符有两个小写形式,ςσ,.lower()不会帮助他们比较不区分大小写。

However, as of Python 3, all three forms will resolve to ς, and calling lower() on both strings will work correctly:

然而,Python 3,所有三种形式将解决ς和调用低()在两个字符串将正常工作:

>>> s = open('unicode.txt', encoding='utf8').read()>>> print(s)ΣίσυφοςΣΊΣΥΦΟΣ>>> first, second = s.splitlines()>>> print(first.lower())σίσυφος>>> print(second.lower())σίσυφος>>> first.lower() == second.lower()True>>> first.upper() == second.upper()True

So if you care about edge-cases like the three sigmas in Greek, use Python 3.

所以如果你关心边缘情况,比如希腊语中的三个符号,请使用Python 3。

(For reference, Python 2.7.3 and Python 3.3.0b1 are shown in the interpreter printouts above.)

(参考上面的解释器打印输出显示了Python 2.7.3和Python 3.3.3.0b1。)

#4


14  

Section 3.13 of the Unicode standard defines algorithms for caselessmatching.

Unicode标准的第3.13节定义了无情况匹配的算法。

X.casefold() == Y.casefold() in Python 3 implements the "default caseless matching" (D144).

在Python 3中,X.casefold() = Y.casefold()实现了“默认的无实例匹配”(D144)。

Casefolding does not preserve the normalization of strings in all instances and therefore the normalization needs to be done ('å' vs. 'å'). D145 introduces "canonical caseless matching":

casef不保留所有实例中字符串的规范化,因此需要进行规范化(' a ' vs)。“一个̊”)。D145引入“典型无表壳匹配”:

import unicodedatadef NFD(text):    return unicodedata.normalize('NFD', text)def canonical_caseless(text):    return NFD(NFD(text).casefold())

NFD() is called twice for very infrequent edge cases involving U+0345 character.

NFD()在非常罕见的边界情况下被调用两次,涉及到U+0345字符。

Example:

例子:

>>> 'å'.casefold() == 'å'.casefold()False>>> canonical_caseless('å') == canonical_caseless('å')True

There are also compatibility caseless matching (D146) for cases such as '㎒' (U+3392) and "identifier caseless matching" to simplify and optimize caseless matching of identifiers.

也有兼容性caseless匹配(D146)等情况下“㎒”(U + 3392)和“标识符caseless匹配”来简化和优化caseless匹配的标识符。

#5


3  

How about converting to lowercase first? you can use string.lower().

先换成小写的怎么样?您可以使用string.lower()。

#6


3  

I saw this solution here using regex.

我在这里看到了使用regex的解决方案。

import reif re.search('mandy', 'Mandy Pande', re.IGNORECASE):# is True

It works well with accents

它和口音很协调。

In [42]: if re.search("ê","ê", re.IGNORECASE):....:        print(1)....:1

However, it doesn't work with unicode characters case-insensitive. Thank you @Rhymoid for pointing out that as my understanding was that it needs the exact symbol, for the case to be true. The output is as follows:

但是,它与unicode字符不区分大小写无关。感谢@Rhymoid指出,根据我的理解,这个案例需要确切的符号,才能成立。输出如下:

In [36]: "ß".lower()Out[36]: 'ß'In [37]: "ß".upper()Out[37]: 'SS'In [38]: "ß".upper().lower()Out[38]: 'ss'In [39]: if re.search("ß","ßß", re.IGNORECASE):....:        print(1)....:1In [40]: if re.search("SS","ßß", re.IGNORECASE):....:        print(1)....:In [41]: if re.search("ß","SS", re.IGNORECASE):....:        print(1)....:

#7


2  

The usual approach is to uppercase the strings or lower case them for the lookups and comparisons. For example:

通常的方法是用大写或小写来进行查找和比较。例如:

>>> "hello".upper() == "HELLO".upper()True>>> 

#8


0  

def insenStringCompare(s1, s2):    """ Method that takes two strings and returns True or False, based        on if they are equal, regardless of case."""    try:        return s1.lower() == s2.lower()    except AttributeError:        print "Please only pass strings into this method."        print "You passed a %s and %s" % (s1.__class__, s2.__class__)

#9


-6  

If you have lists with strings and you want to compare the strings in different list with case insensitive. Here is my solution.

如果你有带字符串的列表,你想用不区分大小写来比较不同列表中的字符串。这是我的解决方案。

list1 = map(lambda each:each.lower(), list1)list2 = map(lambda each:each.lower(), list2)

After doing that, you can make string comparision easly.

这样做之后,您可以使字符串比较容易。

#10


-7  

I've used this to accomplish something more useful for comparing two strings:

我用它来完成比较两个字符串更有用的事情:

def strings_iequal(first, second):    try:        return first.upper() == second.upper()    except AttributeError:        if not first:            if not second:                return True

Update: As noted by gerrit, this answer has some bugs. This was years ago and I no longer remember what I used it for. I do recall writing tests, but what good are they now!

更新:正如gerrit指出的,这个答案有一些错误。这是几年前的事了,我不记得我用它做什么了。我确实记得写过测试,但是现在有什么好呢!

#1


411  

Assuming ASCII strings:

假设ASCII字符串:

string1 = 'Hello'string2 = 'hello'if string1.lower() == string2.lower():    print "The strings are the same (case insensitive)"else:    print "The strings are not the same (case insensitive)"

#2


327  

Comparing string in a case insensitive way seems like something that's trivial, but it's not. I will be using Python 3, since Python 2 is underdeveloped here.

以一种不区分大小写的方式比较字符串似乎是一件微不足道的事情,但它不是。我将使用Python 3,因为这里的Python 2不发达。

The first thing to note it that case-removing conversions in unicode aren't trivial. There is text for which text.lower() != text.upper().lower(), such as "ß":

首先要注意的是,在unicode中去除大小写转换并非易事。有文本text.lower()! = text.upper().lower(),例如“ß”:

"ß".lower()#>>> 'ß'"ß".upper().lower()#>>> 'ss'

But let's say you wanted to caselessly compare "BUSSE" and "Buße". Heck, you probably also want to compare "BUSSE" and "BUẞE" equal - that's the newer capital form. The recommended way is to use casefold:

但假设你想caselessly比较“会”和“Buße”。见鬼,你可能还想比较“会”和“BUẞE”平等——这是新的资本形式。推荐的方法是使用casefold:

help(str.casefold)#>>> Help on method_descriptor:#>>>#>>> casefold(...)#>>>     S.casefold() -> str#>>>     #>>>     Return a version of S suitable for caseless comparisons.#>>>

Do not just use lower. If casefold is not available, doing .upper().lower() helps (but only somewhat).

不要只使用较低的。如果casefold不可用,那么执行.upper().lower()会有所帮助(但只会有所帮助)。

Then you should consider accents. If your font renderer is good, you probably think "ê" == "ê" - but it doesn't:

然后你应该考虑口音。如果你的字体渲染器是好的,你可能认为“e”= =“ê”——但它不会:

"ê" == "ê"#>>> False

This is because they are actually

这是因为它们实际上是

import unicodedata[unicodedata.name(char) for char in "ê"]#>>> ['LATIN SMALL LETTER E WITH CIRCUMFLEX'][unicodedata.name(char) for char in "ê"]#>>> ['LATIN SMALL LETTER E', 'COMBINING CIRCUMFLEX ACCENT']

The simplest way to deal with this is unicodedata.normalize. You probably want to use NFKD normalization, but feel free to check the documentation. Then one does

解决这个问题最简单的方法是unicodedata.normalize。您可能想要使用NFKD规范化,但请随意查看文档。然后一个人

unicodedata.normalize("NFKD", "ê") == unicodedata.normalize("NFKD", "ê")#>>> True

To finish up, here this is expressed in functions:

最后,这里用函数表示:

import unicodedatadef normalize_caseless(text):    return unicodedata.normalize("NFKD", text.casefold())def caseless_equal(left, right):    return normalize_caseless(left) == normalize_caseless(right)

#3


49  

Using Python 2, calling .lower() on each string or Unicode object...

使用Python 2,在每个字符串或Unicode对象上调用.lower()。

string1.lower() == string2.lower()

...will work most of the time, but indeed doesn't work in the situations @tchrist has described.

…在大多数情况下会有效,但在@tchrist描述的情况下是行不通的。

Assume we have a file called unicode.txt containing the two strings Σίσυφος and ΣΊΣΥΦΟΣ. With Python 2:

假设我们有一个名为unicode的文件。包含两个字符串Σίσυφος和ΣΊΣΥΦΟΣtxt。与Python 2:

>>> utf8_bytes = open("unicode.txt", 'r').read()>>> print repr(utf8_bytes)'\xce\xa3\xce\xaf\xcf\x83\xcf\x85\xcf\x86\xce\xbf\xcf\x82\n\xce\xa3\xce\x8a\xce\xa3\xce\xa5\xce\xa6\xce\x9f\xce\xa3\n'>>> u = utf8_bytes.decode('utf8')>>> print uΣίσυφοςΣΊΣΥΦΟΣ>>> first, second = u.splitlines()>>> print first.lower()σίσυφος>>> print second.lower()σίσυφοσ>>> first.lower() == second.lower()False>>> first.upper() == second.upper()True

The Σ character has two lowercase forms, ς and σ, and .lower() won't help compare them case-insensitively.

Σ字符有两个小写形式,ςσ,.lower()不会帮助他们比较不区分大小写。

However, as of Python 3, all three forms will resolve to ς, and calling lower() on both strings will work correctly:

然而,Python 3,所有三种形式将解决ς和调用低()在两个字符串将正常工作:

>>> s = open('unicode.txt', encoding='utf8').read()>>> print(s)ΣίσυφοςΣΊΣΥΦΟΣ>>> first, second = s.splitlines()>>> print(first.lower())σίσυφος>>> print(second.lower())σίσυφος>>> first.lower() == second.lower()True>>> first.upper() == second.upper()True

So if you care about edge-cases like the three sigmas in Greek, use Python 3.

所以如果你关心边缘情况,比如希腊语中的三个符号,请使用Python 3。

(For reference, Python 2.7.3 and Python 3.3.0b1 are shown in the interpreter printouts above.)

(参考上面的解释器打印输出显示了Python 2.7.3和Python 3.3.3.0b1。)

#4


14  

Section 3.13 of the Unicode standard defines algorithms for caselessmatching.

Unicode标准的第3.13节定义了无情况匹配的算法。

X.casefold() == Y.casefold() in Python 3 implements the "default caseless matching" (D144).

在Python 3中,X.casefold() = Y.casefold()实现了“默认的无实例匹配”(D144)。

Casefolding does not preserve the normalization of strings in all instances and therefore the normalization needs to be done ('å' vs. 'å'). D145 introduces "canonical caseless matching":

casef不保留所有实例中字符串的规范化,因此需要进行规范化(' a ' vs)。“一个̊”)。D145引入“典型无表壳匹配”:

import unicodedatadef NFD(text):    return unicodedata.normalize('NFD', text)def canonical_caseless(text):    return NFD(NFD(text).casefold())

NFD() is called twice for very infrequent edge cases involving U+0345 character.

NFD()在非常罕见的边界情况下被调用两次,涉及到U+0345字符。

Example:

例子:

>>> 'å'.casefold() == 'å'.casefold()False>>> canonical_caseless('å') == canonical_caseless('å')True

There are also compatibility caseless matching (D146) for cases such as '㎒' (U+3392) and "identifier caseless matching" to simplify and optimize caseless matching of identifiers.

也有兼容性caseless匹配(D146)等情况下“㎒”(U + 3392)和“标识符caseless匹配”来简化和优化caseless匹配的标识符。

#5


3  

How about converting to lowercase first? you can use string.lower().

先换成小写的怎么样?您可以使用string.lower()。

#6


3  

I saw this solution here using regex.

我在这里看到了使用regex的解决方案。

import reif re.search('mandy', 'Mandy Pande', re.IGNORECASE):# is True

It works well with accents

它和口音很协调。

In [42]: if re.search("ê","ê", re.IGNORECASE):....:        print(1)....:1

However, it doesn't work with unicode characters case-insensitive. Thank you @Rhymoid for pointing out that as my understanding was that it needs the exact symbol, for the case to be true. The output is as follows:

但是,它与unicode字符不区分大小写无关。感谢@Rhymoid指出,根据我的理解,这个案例需要确切的符号,才能成立。输出如下:

In [36]: "ß".lower()Out[36]: 'ß'In [37]: "ß".upper()Out[37]: 'SS'In [38]: "ß".upper().lower()Out[38]: 'ss'In [39]: if re.search("ß","ßß", re.IGNORECASE):....:        print(1)....:1In [40]: if re.search("SS","ßß", re.IGNORECASE):....:        print(1)....:In [41]: if re.search("ß","SS", re.IGNORECASE):....:        print(1)....:

#7


2  

The usual approach is to uppercase the strings or lower case them for the lookups and comparisons. For example:

通常的方法是用大写或小写来进行查找和比较。例如:

>>> "hello".upper() == "HELLO".upper()True>>> 

#8


0  

def insenStringCompare(s1, s2):    """ Method that takes two strings and returns True or False, based        on if they are equal, regardless of case."""    try:        return s1.lower() == s2.lower()    except AttributeError:        print "Please only pass strings into this method."        print "You passed a %s and %s" % (s1.__class__, s2.__class__)

#9


-6  

If you have lists with strings and you want to compare the strings in different list with case insensitive. Here is my solution.

如果你有带字符串的列表,你想用不区分大小写来比较不同列表中的字符串。这是我的解决方案。

list1 = map(lambda each:each.lower(), list1)list2 = map(lambda each:each.lower(), list2)

After doing that, you can make string comparision easly.

这样做之后,您可以使字符串比较容易。

#10


-7  

I've used this to accomplish something more useful for comparing two strings:

我用它来完成比较两个字符串更有用的事情:

def strings_iequal(first, second):    try:        return first.upper() == second.upper()    except AttributeError:        if not first:            if not second:                return True

Update: As noted by gerrit, this answer has some bugs. This was years ago and I no longer remember what I used it for. I do recall writing tests, but what good are they now!

更新:正如gerrit指出的,这个答案有一些错误。这是几年前的事了,我不记得我用它做什么了。我确实记得写过测试,但是现在有什么好呢!