python字符编码演示三则

时间:2021-06-26 15:43:59

python2默认终端环境中文由gb系列编码处理

% python2
>>> x = "中国"
>>> x
'\xd6\xd0\xb9\xfa'
# 每个汉字占两个字节,不是utf-8
>>> import sys
>>> sys.getdefaultencoding()
'ascii'
# python的系统环境默认编码为ascii,但ascii是不能编码/解码中文的,所以针对中文使用的是其实是别的编码方案
>>> x.decode("ascii")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xd6 in position 0: ordinal not in range(128)
>>> x.decode()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xd6 in position 0: ordinal not in range(128)
# 上面两条命令证实ascii不能解码中文
>>> x.decode("gb2312")
u'\u4e2d\u56fd'
>>> x.decode("gbk")
u'\u4e2d\u56fd'
>>> x.decode("gb18030")
u'\u4e2d\u56fd'
# 上面三条命令证实实际采用的是gb系列编码
>>> exit()
%

ipython默认终端环境中文由utf-8编码处理

% ipython2
In [1]: x = "中国"
In [2]: x
Out[2]: '\xe4\xb8\xad\xe5\x9b\xbd'
# ipython环境下默认将中文进行utf-8编码
In [3]: import sys
In [4]: sys.getdefaultencoding()
Out[4]: 'ascii'
# 但实际上python系统默认编码仍是ascii,即ipython不是通过改动这个值实现默认utf-8编码的
In [5]: x.decode("utf-8")
Out[5]: u'\u4e2d\u56fd'
# 使用utf-8可以解码中文
In [6]: x.decode("gbk")
---------------------------------------------------------------------------
UnicodeDecodeError                        Traceback (most recent call last)
<ipython-input-7-98ccb03bca37> in <module>()
----> 1 x.decode("gbk")
UnicodeDecodeError: 'gbk' codec can't decode bytes in position 2-3: illegal multibyte sequence
# gb系列编码反倒不可以
In [7]: exit
%

字符编码方案对字节数相关计算的影响

% ipython2
In [34]: x = "abcde今天是星期六12345"
In [35]: len(x)
28
In [36]: type(x)
<type 'str'>
In [37]: x.find("今天")
5
# 按照字符数计算
In [38]: x.find("星期六")
14
# 按照字节数计算,因为中文默认通过utf-8编码,所以每个汉字占3个字节;根据前文可料想,python默认终端环境下会输出11,因为那里每个汉字占2个字节
In [39]: x.find("今天".decode("utf-8"))
5
In [40]: x.find("星期六".decode("utf-8"))
8
# 解码为unicode后,按字符数计算
In [41]: x.decode("utf-8").find("今天".decode("utf-8"))
5
In [42]: x.decode("utf-8").find("星期六".decode("utf-8"))

总结

  • 牢记unicode是字符集,gb/utf-8等是编码方案,decode是“编码结果→字符集”,encode是“字符集→编码结果”