I have a unicode string thus I wanna check if the character is continuation bit or starting bit so as tot count the number of unicode character through simple programme as
我有一个unicode字符串,因此我想检查字符是否为延续位或起始位,以便通过简单的程序来计算unicode字符的数量。
#!/usr/bin/env python
# -*- coding: utf-8 -*-
def arg(str):
i = 0
j = 0
print i
for test in str:
print test
value = int(test,16)
if (value & 0xc0) != 0x80:
j=j+1
print "hello"
print j
#return j
def main():
print "inside main"
new = "象形字"
charlen = len(new)
print charlen
tes = new.decode('utf-8')
declen = len(tes)
print declen
data = tes.encode('utf-8')
# print self_len
enclen = len(data)
print enclen
print data
arg(data)
if __name__ == "__main__":
main()
running the code gives the error as
运行代码会产生错误。
象形字[Decode error - output not utf-8]
Traceback (most recent call last):
File "/Users/laxmi518/Documents/laxmi/code/C/python-c/python_unicode.py", line 69, in <module>
main()
File "/Users/laxmi518/Documents/laxmi/code/C/python-c/python_unicode.py", line 52, in main
arg(data)
File "/Users/laxmi518/Documents/laxmi/code/C/python-c/python_unicode.py", line 16, in arg
value = int(test,16)
ValueError: invalid literal for int() with base 16: '\xe8'
[Finished in 0.1s with exit code 1]
1 个解决方案
#1
3
UTF-8 bytes are not hex strings. They are just bytes, and Python will display bytes outside the ASCII printable range by using the literal escape syntax. This is just a debugging display notation.
UTF-8字节不是十六进制字符串。它们只是字节,而Python将使用文字转义语法在ASCII打印范围之外显示字节。这只是一个调试显示符号。
Use the ord()
function to get the numerical value of a byte:
使用ord()函数得到一个字节的数值:
value = ord(test)
With that change, running your script in a terminal on Mac OS X (configured for UTF-8) outputs:
有了这个变化,在Mac OS X上的终端运行脚本(配置为UTF-8)输出:
inside main
9
3
9
象形字
0
?
hello
?
?
?
hello
?
?
?
hello
?
?
3
The question marks are generated by the terminal; printing a single byte from a UTF-8 bytestream means you are printing incomplete UTF-8 code units, so the terminal doesn't know what to do with those and produces a placeholder character instead.
问号是由终端产生的;从UTF-8 bytestream上打印单个字节意味着您正在打印不完整的UTF-8代码单元,因此终端不知道如何处理这些代码,并生成一个占位符。
Instead of printing test
directly, print the output of the repr()
function:
不要直接打印测试,打印repr()函数的输出:
print repr(test)
to get a \xhh
hex notation for those bytes instead:
为这些字节得到一个\xhh十六进制表示法:
inside main
9
3
9
象形字
0
'\xe8'
hello
'\xb1'
'\xa1'
'\xe5'
hello
'\xbd'
'\xa2'
'\xe5'
hello
'\xad'
'\x97'
3
#1
3
UTF-8 bytes are not hex strings. They are just bytes, and Python will display bytes outside the ASCII printable range by using the literal escape syntax. This is just a debugging display notation.
UTF-8字节不是十六进制字符串。它们只是字节,而Python将使用文字转义语法在ASCII打印范围之外显示字节。这只是一个调试显示符号。
Use the ord()
function to get the numerical value of a byte:
使用ord()函数得到一个字节的数值:
value = ord(test)
With that change, running your script in a terminal on Mac OS X (configured for UTF-8) outputs:
有了这个变化,在Mac OS X上的终端运行脚本(配置为UTF-8)输出:
inside main
9
3
9
象形字
0
?
hello
?
?
?
hello
?
?
?
hello
?
?
3
The question marks are generated by the terminal; printing a single byte from a UTF-8 bytestream means you are printing incomplete UTF-8 code units, so the terminal doesn't know what to do with those and produces a placeholder character instead.
问号是由终端产生的;从UTF-8 bytestream上打印单个字节意味着您正在打印不完整的UTF-8代码单元,因此终端不知道如何处理这些代码,并生成一个占位符。
Instead of printing test
directly, print the output of the repr()
function:
不要直接打印测试,打印repr()函数的输出:
print repr(test)
to get a \xhh
hex notation for those bytes instead:
为这些字节得到一个\xhh十六进制表示法:
inside main
9
3
9
象形字
0
'\xe8'
hello
'\xb1'
'\xa1'
'\xe5'
hello
'\xbd'
'\xa2'
'\xe5'
hello
'\xad'
'\x97'
3