If I have a Python Unicode string that contains combining characters, len
reports a value that does not correspond to the number of characters "seen".
如果我有一个包含组合字符的Python Unicode字符串,len会报告一个与“已见”字符数量不一致的值。
For example, if I have a string with combining overlines and underlines such as u'A\u0332\u0305BC'
, len(u'A\u0332\u0305BC')
reports 5; but the displayed string is only 3 characters long.
例如,如果我有一个字符串,它结合了overlines和下划线,比如u' a \u0332\u0305BC', len(u' a \u0332\u0305BC')报告5;但是显示的字符串只有3个字符长。
How do I get the "visible" — that is, number of distinct positions occupied by the string the user sees — length of a Unicode string containing combining glyphs in Python?
如何获得“可见的”——即用户看到的字符串占用的不同位置的数量——包含Python中的组合符号的Unicode字符串的长度?
3 个解决方案
#1
4
The unicodedata
module has a function combining
that can be used to determine if a single character is a combining character. If it returns 0
you can count the character as non-combining.
unicodedata模块具有一个组合函数,可用于确定单个字符是否是组合字符。如果返回0,则可以将字符计数为非组合。
import unicodedata
len(u''.join(ch for ch in u'A\u0332\u0305BC' if unicodedata.combining(ch) == 0))
or, slightly simpler:
或者,稍微简单一点:
sum(1 for ch in u'A\u0332\u0305BC' if unicodedata.combining(ch) == 0)
#2
4
If you have a regex flavor that supports matching grapheme, you can use \X
如果您有支持匹配字符的regex风格,您可以使用\X
演示
While the default Python re module does not support \X
, Matthew Barnett's regex module does:
虽然默认的Python re模块不支持\X,但是Matthew Barnett的regex模块是这样做的:
>>> len(regex.findall(r'\X', u'A\u0332\u0305BC'))
3
On Python 2, you need to use u
in the pattern:
在Python 2中,您需要在模式中使用u:
>>> regex.findall(u'\\X', u'A\u0332\u0305BC')
[u'A\u0332\u0305', u'B', u'C']
>>> len(regex.findall(u'\\X', u'A\u0332\u0305BC'))
3
#3
2
Combining characters are not the only zero-width characters:
组合字符不是唯一的零宽度字符:
>>> sum(1 for ch in u'\u200c' if unicodedata.combining(ch) == 0)
1
("\u200c"
or ""
is zero-width non-joiner; it's a non-printing character.)
(“\ u200c”或“”是零宽度non-joiner;这是一个非打印字符。)
In this case the regex module does not work either:
在这种情况下,regex模块也不起作用:
>>> len(regex.findall(r'\X', u'\u200c'))
1
I found wcwidth that handles the above case correctly:
我找到了正确处理上述情况的wcwidth:
>>> from wcwidth import wcswidth
>>> wcswidth(u'A\u0332\u0305BC')
3
>>> wcswidth(u'\u200c')
0
But still doesn't seem to work with user 596219's example:
但用户596219的例子似乎仍然不能说明问题:
>>> wcswidth('각')
4
#1
4
The unicodedata
module has a function combining
that can be used to determine if a single character is a combining character. If it returns 0
you can count the character as non-combining.
unicodedata模块具有一个组合函数,可用于确定单个字符是否是组合字符。如果返回0,则可以将字符计数为非组合。
import unicodedata
len(u''.join(ch for ch in u'A\u0332\u0305BC' if unicodedata.combining(ch) == 0))
or, slightly simpler:
或者,稍微简单一点:
sum(1 for ch in u'A\u0332\u0305BC' if unicodedata.combining(ch) == 0)
#2
4
If you have a regex flavor that supports matching grapheme, you can use \X
如果您有支持匹配字符的regex风格,您可以使用\X
演示
While the default Python re module does not support \X
, Matthew Barnett's regex module does:
虽然默认的Python re模块不支持\X,但是Matthew Barnett的regex模块是这样做的:
>>> len(regex.findall(r'\X', u'A\u0332\u0305BC'))
3
On Python 2, you need to use u
in the pattern:
在Python 2中,您需要在模式中使用u:
>>> regex.findall(u'\\X', u'A\u0332\u0305BC')
[u'A\u0332\u0305', u'B', u'C']
>>> len(regex.findall(u'\\X', u'A\u0332\u0305BC'))
3
#3
2
Combining characters are not the only zero-width characters:
组合字符不是唯一的零宽度字符:
>>> sum(1 for ch in u'\u200c' if unicodedata.combining(ch) == 0)
1
("\u200c"
or ""
is zero-width non-joiner; it's a non-printing character.)
(“\ u200c”或“”是零宽度non-joiner;这是一个非打印字符。)
In this case the regex module does not work either:
在这种情况下,regex模块也不起作用:
>>> len(regex.findall(r'\X', u'\u200c'))
1
I found wcwidth that handles the above case correctly:
我找到了正确处理上述情况的wcwidth:
>>> from wcwidth import wcswidth
>>> wcswidth(u'A\u0332\u0305BC')
3
>>> wcswidth(u'\u200c')
0
But still doesn't seem to work with user 596219's example:
但用户596219的例子似乎仍然不能说明问题:
>>> wcswidth('각')
4