How can I do case insensitive string comparison in Python?
如何在Python中进行大小写不敏感的字符串比较?
I would like to encapsulate comparison of a regular strings to a repository string using in a very simple and Pythonic way. I also would like to have ability to look up values in a dict hashed by strings using regular python strings.
我想用非常简单和python的方式将常规字符串的比较封装到存储库字符串中。我还希望能够使用常规python字符串查找由字符串哈希的dict类型中的值。
10 个解决方案
#1
411
Assuming ASCII strings:
假设ASCII字符串:
string1 = 'Hello'string2 = 'hello'if string1.lower() == string2.lower(): print "The strings are the same (case insensitive)"else: print "The strings are not the same (case insensitive)"
#2
327
Comparing string in a case insensitive way seems like something that's trivial, but it's not. I will be using Python 3, since Python 2 is underdeveloped here.
以一种不区分大小写的方式比较字符串似乎是一件微不足道的事情,但它不是。我将使用Python 3,因为这里的Python 2不发达。
The first thing to note it that case-removing conversions in unicode aren't trivial. There is text for which text.lower() != text.upper().lower()
, such as "ß"
:
首先要注意的是,在unicode中去除大小写转换并非易事。有文本text.lower()! = text.upper().lower(),例如“ß”:
"ß".lower()#>>> 'ß'"ß".upper().lower()#>>> 'ss'
But let's say you wanted to caselessly compare "BUSSE"
and "Buße"
. Heck, you probably also want to compare "BUSSE"
and "BUẞE"
equal - that's the newer capital form. The recommended way is to use casefold
:
但假设你想caselessly比较“会”和“Buße”。见鬼,你可能还想比较“会”和“BUẞE”平等——这是新的资本形式。推荐的方法是使用casefold:
help(str.casefold)#>>> Help on method_descriptor:#>>>#>>> casefold(...)#>>> S.casefold() -> str#>>> #>>> Return a version of S suitable for caseless comparisons.#>>>
Do not just use lower
. If casefold
is not available, doing .upper().lower()
helps (but only somewhat).
不要只使用较低的。如果casefold不可用,那么执行.upper().lower()会有所帮助(但只会有所帮助)。
Then you should consider accents. If your font renderer is good, you probably think "ê" == "ê"
- but it doesn't:
然后你应该考虑口音。如果你的字体渲染器是好的,你可能认为“e”= =“ê”——但它不会:
"ê" == "ê"#>>> False
This is because they are actually
这是因为它们实际上是
import unicodedata[unicodedata.name(char) for char in "ê"]#>>> ['LATIN SMALL LETTER E WITH CIRCUMFLEX'][unicodedata.name(char) for char in "ê"]#>>> ['LATIN SMALL LETTER E', 'COMBINING CIRCUMFLEX ACCENT']
The simplest way to deal with this is unicodedata.normalize
. You probably want to use NFKD normalization, but feel free to check the documentation. Then one does
解决这个问题最简单的方法是unicodedata.normalize。您可能想要使用NFKD规范化,但请随意查看文档。然后一个人
unicodedata.normalize("NFKD", "ê") == unicodedata.normalize("NFKD", "ê")#>>> True
To finish up, here this is expressed in functions:
最后,这里用函数表示:
import unicodedatadef normalize_caseless(text): return unicodedata.normalize("NFKD", text.casefold())def caseless_equal(left, right): return normalize_caseless(left) == normalize_caseless(right)
#3
49
Using Python 2, calling .lower()
on each string or Unicode object...
使用Python 2,在每个字符串或Unicode对象上调用.lower()。
string1.lower() == string2.lower()
...will work most of the time, but indeed doesn't work in the situations @tchrist has described.
…在大多数情况下会有效,但在@tchrist描述的情况下是行不通的。
Assume we have a file called unicode.txt
containing the two strings Σίσυφος
and ΣΊΣΥΦΟΣ
. With Python 2:
假设我们有一个名为unicode的文件。包含两个字符串Σίσυφος和ΣΊΣΥΦΟΣtxt。与Python 2:
>>> utf8_bytes = open("unicode.txt", 'r').read()>>> print repr(utf8_bytes)'\xce\xa3\xce\xaf\xcf\x83\xcf\x85\xcf\x86\xce\xbf\xcf\x82\n\xce\xa3\xce\x8a\xce\xa3\xce\xa5\xce\xa6\xce\x9f\xce\xa3\n'>>> u = utf8_bytes.decode('utf8')>>> print uΣίσυφοςΣΊΣΥΦΟΣ>>> first, second = u.splitlines()>>> print first.lower()σίσυφος>>> print second.lower()σίσυφοσ>>> first.lower() == second.lower()False>>> first.upper() == second.upper()True
The Σ character has two lowercase forms, ς and σ, and .lower()
won't help compare them case-insensitively.
Σ字符有两个小写形式,ςσ,.lower()不会帮助他们比较不区分大小写。
However, as of Python 3, all three forms will resolve to ς, and calling lower() on both strings will work correctly:
然而,Python 3,所有三种形式将解决ς和调用低()在两个字符串将正常工作:
>>> s = open('unicode.txt', encoding='utf8').read()>>> print(s)ΣίσυφοςΣΊΣΥΦΟΣ>>> first, second = s.splitlines()>>> print(first.lower())σίσυφος>>> print(second.lower())σίσυφος>>> first.lower() == second.lower()True>>> first.upper() == second.upper()True
So if you care about edge-cases like the three sigmas in Greek, use Python 3.
所以如果你关心边缘情况,比如希腊语中的三个符号,请使用Python 3。
(For reference, Python 2.7.3 and Python 3.3.0b1 are shown in the interpreter printouts above.)
(参考上面的解释器打印输出显示了Python 2.7.3和Python 3.3.3.0b1。)
#4
14
Section 3.13 of the Unicode standard defines algorithms for caselessmatching.
Unicode标准的第3.13节定义了无情况匹配的算法。
X.casefold() == Y.casefold()
in Python 3 implements the "default caseless matching" (D144).
在Python 3中,X.casefold() = Y.casefold()实现了“默认的无实例匹配”(D144)。
Casefolding does not preserve the normalization of strings in all instances and therefore the normalization needs to be done ('å'
vs. 'å'
). D145 introduces "canonical caseless matching":
casef不保留所有实例中字符串的规范化,因此需要进行规范化(' a ' vs)。“一个̊”)。D145引入“典型无表壳匹配”:
import unicodedatadef NFD(text): return unicodedata.normalize('NFD', text)def canonical_caseless(text): return NFD(NFD(text).casefold())
NFD()
is called twice for very infrequent edge cases involving U+0345 character.
NFD()在非常罕见的边界情况下被调用两次,涉及到U+0345字符。
Example:
例子:
>>> 'å'.casefold() == 'å'.casefold()False>>> canonical_caseless('å') == canonical_caseless('å')True
There are also compatibility caseless matching (D146) for cases such as '㎒'
(U+3392) and "identifier caseless matching" to simplify and optimize caseless matching of identifiers.
也有兼容性caseless匹配(D146)等情况下“㎒”(U + 3392)和“标识符caseless匹配”来简化和优化caseless匹配的标识符。
#5
3
How about converting to lowercase first? you can use string.lower()
.
先换成小写的怎么样?您可以使用string.lower()。
#6
3
I saw this solution here using regex.
我在这里看到了使用regex的解决方案。
import reif re.search('mandy', 'Mandy Pande', re.IGNORECASE):# is True
It works well with accents
它和口音很协调。
In [42]: if re.search("ê","ê", re.IGNORECASE):....: print(1)....:1
However, it doesn't work with unicode characters case-insensitive. Thank you @Rhymoid for pointing out that as my understanding was that it needs the exact symbol, for the case to be true. The output is as follows:
但是,它与unicode字符不区分大小写无关。感谢@Rhymoid指出,根据我的理解,这个案例需要确切的符号,才能成立。输出如下:
In [36]: "ß".lower()Out[36]: 'ß'In [37]: "ß".upper()Out[37]: 'SS'In [38]: "ß".upper().lower()Out[38]: 'ss'In [39]: if re.search("ß","ßß", re.IGNORECASE):....: print(1)....:1In [40]: if re.search("SS","ßß", re.IGNORECASE):....: print(1)....:In [41]: if re.search("ß","SS", re.IGNORECASE):....: print(1)....:
#7
2
The usual approach is to uppercase the strings or lower case them for the lookups and comparisons. For example:
通常的方法是用大写或小写来进行查找和比较。例如:
>>> "hello".upper() == "HELLO".upper()True>>>
#8
0
def insenStringCompare(s1, s2): """ Method that takes two strings and returns True or False, based on if they are equal, regardless of case.""" try: return s1.lower() == s2.lower() except AttributeError: print "Please only pass strings into this method." print "You passed a %s and %s" % (s1.__class__, s2.__class__)
#9
-6
If you have lists with strings and you want to compare the strings in different list with case insensitive. Here is my solution.
如果你有带字符串的列表,你想用不区分大小写来比较不同列表中的字符串。这是我的解决方案。
list1 = map(lambda each:each.lower(), list1)list2 = map(lambda each:each.lower(), list2)
After doing that, you can make string comparision easly.
这样做之后,您可以使字符串比较容易。
#10
-7
I've used this to accomplish something more useful for comparing two strings:
我用它来完成比较两个字符串更有用的事情:
def strings_iequal(first, second): try: return first.upper() == second.upper() except AttributeError: if not first: if not second: return True
Update: As noted by gerrit, this answer has some bugs. This was years ago and I no longer remember what I used it for. I do recall writing tests, but what good are they now!
更新:正如gerrit指出的,这个答案有一些错误。这是几年前的事了,我不记得我用它做什么了。我确实记得写过测试,但是现在有什么好呢!
#1
411
Assuming ASCII strings:
假设ASCII字符串:
string1 = 'Hello'string2 = 'hello'if string1.lower() == string2.lower(): print "The strings are the same (case insensitive)"else: print "The strings are not the same (case insensitive)"
#2
327
Comparing string in a case insensitive way seems like something that's trivial, but it's not. I will be using Python 3, since Python 2 is underdeveloped here.
以一种不区分大小写的方式比较字符串似乎是一件微不足道的事情,但它不是。我将使用Python 3,因为这里的Python 2不发达。
The first thing to note it that case-removing conversions in unicode aren't trivial. There is text for which text.lower() != text.upper().lower()
, such as "ß"
:
首先要注意的是,在unicode中去除大小写转换并非易事。有文本text.lower()! = text.upper().lower(),例如“ß”:
"ß".lower()#>>> 'ß'"ß".upper().lower()#>>> 'ss'
But let's say you wanted to caselessly compare "BUSSE"
and "Buße"
. Heck, you probably also want to compare "BUSSE"
and "BUẞE"
equal - that's the newer capital form. The recommended way is to use casefold
:
但假设你想caselessly比较“会”和“Buße”。见鬼,你可能还想比较“会”和“BUẞE”平等——这是新的资本形式。推荐的方法是使用casefold:
help(str.casefold)#>>> Help on method_descriptor:#>>>#>>> casefold(...)#>>> S.casefold() -> str#>>> #>>> Return a version of S suitable for caseless comparisons.#>>>
Do not just use lower
. If casefold
is not available, doing .upper().lower()
helps (but only somewhat).
不要只使用较低的。如果casefold不可用,那么执行.upper().lower()会有所帮助(但只会有所帮助)。
Then you should consider accents. If your font renderer is good, you probably think "ê" == "ê"
- but it doesn't:
然后你应该考虑口音。如果你的字体渲染器是好的,你可能认为“e”= =“ê”——但它不会:
"ê" == "ê"#>>> False
This is because they are actually
这是因为它们实际上是
import unicodedata[unicodedata.name(char) for char in "ê"]#>>> ['LATIN SMALL LETTER E WITH CIRCUMFLEX'][unicodedata.name(char) for char in "ê"]#>>> ['LATIN SMALL LETTER E', 'COMBINING CIRCUMFLEX ACCENT']
The simplest way to deal with this is unicodedata.normalize
. You probably want to use NFKD normalization, but feel free to check the documentation. Then one does
解决这个问题最简单的方法是unicodedata.normalize。您可能想要使用NFKD规范化,但请随意查看文档。然后一个人
unicodedata.normalize("NFKD", "ê") == unicodedata.normalize("NFKD", "ê")#>>> True
To finish up, here this is expressed in functions:
最后,这里用函数表示:
import unicodedatadef normalize_caseless(text): return unicodedata.normalize("NFKD", text.casefold())def caseless_equal(left, right): return normalize_caseless(left) == normalize_caseless(right)
#3
49
Using Python 2, calling .lower()
on each string or Unicode object...
使用Python 2,在每个字符串或Unicode对象上调用.lower()。
string1.lower() == string2.lower()
...will work most of the time, but indeed doesn't work in the situations @tchrist has described.
…在大多数情况下会有效,但在@tchrist描述的情况下是行不通的。
Assume we have a file called unicode.txt
containing the two strings Σίσυφος
and ΣΊΣΥΦΟΣ
. With Python 2:
假设我们有一个名为unicode的文件。包含两个字符串Σίσυφος和ΣΊΣΥΦΟΣtxt。与Python 2:
>>> utf8_bytes = open("unicode.txt", 'r').read()>>> print repr(utf8_bytes)'\xce\xa3\xce\xaf\xcf\x83\xcf\x85\xcf\x86\xce\xbf\xcf\x82\n\xce\xa3\xce\x8a\xce\xa3\xce\xa5\xce\xa6\xce\x9f\xce\xa3\n'>>> u = utf8_bytes.decode('utf8')>>> print uΣίσυφοςΣΊΣΥΦΟΣ>>> first, second = u.splitlines()>>> print first.lower()σίσυφος>>> print second.lower()σίσυφοσ>>> first.lower() == second.lower()False>>> first.upper() == second.upper()True
The Σ character has two lowercase forms, ς and σ, and .lower()
won't help compare them case-insensitively.
Σ字符有两个小写形式,ςσ,.lower()不会帮助他们比较不区分大小写。
However, as of Python 3, all three forms will resolve to ς, and calling lower() on both strings will work correctly:
然而,Python 3,所有三种形式将解决ς和调用低()在两个字符串将正常工作:
>>> s = open('unicode.txt', encoding='utf8').read()>>> print(s)ΣίσυφοςΣΊΣΥΦΟΣ>>> first, second = s.splitlines()>>> print(first.lower())σίσυφος>>> print(second.lower())σίσυφος>>> first.lower() == second.lower()True>>> first.upper() == second.upper()True
So if you care about edge-cases like the three sigmas in Greek, use Python 3.
所以如果你关心边缘情况,比如希腊语中的三个符号,请使用Python 3。
(For reference, Python 2.7.3 and Python 3.3.0b1 are shown in the interpreter printouts above.)
(参考上面的解释器打印输出显示了Python 2.7.3和Python 3.3.3.0b1。)
#4
14
Section 3.13 of the Unicode standard defines algorithms for caselessmatching.
Unicode标准的第3.13节定义了无情况匹配的算法。
X.casefold() == Y.casefold()
in Python 3 implements the "default caseless matching" (D144).
在Python 3中,X.casefold() = Y.casefold()实现了“默认的无实例匹配”(D144)。
Casefolding does not preserve the normalization of strings in all instances and therefore the normalization needs to be done ('å'
vs. 'å'
). D145 introduces "canonical caseless matching":
casef不保留所有实例中字符串的规范化,因此需要进行规范化(' a ' vs)。“一个̊”)。D145引入“典型无表壳匹配”:
import unicodedatadef NFD(text): return unicodedata.normalize('NFD', text)def canonical_caseless(text): return NFD(NFD(text).casefold())
NFD()
is called twice for very infrequent edge cases involving U+0345 character.
NFD()在非常罕见的边界情况下被调用两次,涉及到U+0345字符。
Example:
例子:
>>> 'å'.casefold() == 'å'.casefold()False>>> canonical_caseless('å') == canonical_caseless('å')True
There are also compatibility caseless matching (D146) for cases such as '㎒'
(U+3392) and "identifier caseless matching" to simplify and optimize caseless matching of identifiers.
也有兼容性caseless匹配(D146)等情况下“㎒”(U + 3392)和“标识符caseless匹配”来简化和优化caseless匹配的标识符。
#5
3
How about converting to lowercase first? you can use string.lower()
.
先换成小写的怎么样?您可以使用string.lower()。
#6
3
I saw this solution here using regex.
我在这里看到了使用regex的解决方案。
import reif re.search('mandy', 'Mandy Pande', re.IGNORECASE):# is True
It works well with accents
它和口音很协调。
In [42]: if re.search("ê","ê", re.IGNORECASE):....: print(1)....:1
However, it doesn't work with unicode characters case-insensitive. Thank you @Rhymoid for pointing out that as my understanding was that it needs the exact symbol, for the case to be true. The output is as follows:
但是,它与unicode字符不区分大小写无关。感谢@Rhymoid指出,根据我的理解,这个案例需要确切的符号,才能成立。输出如下:
In [36]: "ß".lower()Out[36]: 'ß'In [37]: "ß".upper()Out[37]: 'SS'In [38]: "ß".upper().lower()Out[38]: 'ss'In [39]: if re.search("ß","ßß", re.IGNORECASE):....: print(1)....:1In [40]: if re.search("SS","ßß", re.IGNORECASE):....: print(1)....:In [41]: if re.search("ß","SS", re.IGNORECASE):....: print(1)....:
#7
2
The usual approach is to uppercase the strings or lower case them for the lookups and comparisons. For example:
通常的方法是用大写或小写来进行查找和比较。例如:
>>> "hello".upper() == "HELLO".upper()True>>>
#8
0
def insenStringCompare(s1, s2): """ Method that takes two strings and returns True or False, based on if they are equal, regardless of case.""" try: return s1.lower() == s2.lower() except AttributeError: print "Please only pass strings into this method." print "You passed a %s and %s" % (s1.__class__, s2.__class__)
#9
-6
If you have lists with strings and you want to compare the strings in different list with case insensitive. Here is my solution.
如果你有带字符串的列表,你想用不区分大小写来比较不同列表中的字符串。这是我的解决方案。
list1 = map(lambda each:each.lower(), list1)list2 = map(lambda each:each.lower(), list2)
After doing that, you can make string comparision easly.
这样做之后,您可以使字符串比较容易。
#10
-7
I've used this to accomplish something more useful for comparing two strings:
我用它来完成比较两个字符串更有用的事情:
def strings_iequal(first, second): try: return first.upper() == second.upper() except AttributeError: if not first: if not second: return True
Update: As noted by gerrit, this answer has some bugs. This was years ago and I no longer remember what I used it for. I do recall writing tests, but what good are they now!
更新:正如gerrit指出的,这个答案有一些错误。这是几年前的事了,我不记得我用它做什么了。我确实记得写过测试,但是现在有什么好呢!