最近因为做数据挖掘的需要,又再次复习和实践了Python编解码的一些知识。总结如下,如有疏漏还请各位同学指正。
1. 计算机编解码历史
请参考如下链接的文章,非常通俗易懂,难得的好文章:
字符编解码的故事(ASCII,ANSI,Unicode,Utf-8区别)
2. Python编解码介绍
Python2中的str对象是字节字符串,相当于java中的byte[]。str对象用来和人交互。
例如,ascii编码对英语足够,但是对中文就不够。中文可以使用gbk或gb2312之类的编码。这里的ascii、gbk、gb2312都属于str对象的一种。
而Python2内部使用unicode对象,相当于java中的String对象。注意python内部的unicode和真实的unicode是有点差别的,我们可以暂时忽略这些差别。
2.1 Python编解码常用的模块:
codec 和chardet
2.2 Python编解码常用的方法:
myStr.decode : 将myStr解码成unicode,参数指定的是s本来的编码方式。
myUnicode.encode : 将myUnicode编码成str对象,参数指定使用的编码方式。
isinstance(s, str) :用来判断是否为一般字符串
isinstance(s, unicode) :用来判断是否为unicode
codecs.encode(obj[, encoding[, errors]]) : 若不加参数,默认是进行ASCII编码。参数必须为unicode对象
codecs.decode(obj[, encoding[, errors]]):若不加参数,默认是进行ASCII解码
2.3 Python编解码用法示例:
>>> import chardet>>> myStr = "ab"
>>> print len(myStr)
2
>>> print chardet.detect(myStr)
{'confidence': 1.0, 'encoding': 'ascii'}
>>> myStr = u"ab"
>>> myUEStr = u"ab"
>>> print len(myUEStr)
2
>>> print chardet.detect(myUEStr)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "build\bdist.win32\egg\chardet\__init__.py", line 25, in detect
ValueError: Expected a bytes object, not a unicode object
出错原因: chardet.detect()函数输入参数不能是unicode类型
>>> myCnStr = "示例"
>>> print len(myCnStr)
4
>>> print chardet.detect(myCnStr)
{'confidence': 0.5475, 'encoding': 'windows-1252'}
>>> myUCStr = u"示例"
>>> print len(myUCStr)
2
分析:python unicode通常是2个字节表示一个字符(参见本位第一部分:编解码历史)
>>> print myCnStr.decode('windows-1252')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'gbk' codec can't encode character u'\xca' in position 0: il
legal multibyte sequence
出错原因:到这一步我还以为仅仅是因为windows的cmd窗口默认使用gbk编码器(右键cmd窗口===》属性===》选项 ===》 当前代码页)
>>> print myCnStr.decode('windows-1252').encode('gb18030')
????????
>>> print myCnStr.decode('windows-1252').encode('gbk')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'gbk' codec can't encode character u'\xca' in position 0: il
legal multibyte sequence
出错分析:这两次出错让我很疑惑,本想把myCnStr转成unicode后再编码为GBK应该是合理的,为什么还会说GBK不识别呢?猜测应该是用'windows-1252'解码不能完全成功,chardet.detect(myCnStr)的confidence值太低所以结果不靠谱吗?
接下来尝试用gb18030解码就成功了。同样用gbk和gbk2312解码也可以。
>>> print myCnStr.decode('gb18030')
示例
分析: myCnStr解码后的unicode字符串可以在cmd窗口中输出正确,应该是因为cmd窗口的GBK解码器起了作用。
>>> print myCnStr.decode('gb18030').encode('gbk')
示例
>>> print len(myCnStr.decode('gb18030'))
2
>>> print len(myCnStr.decode('gb18030').encode('gbk'))
4
分析:gbk编码后长度为4,进一步说明了gbk编码是字节码(byte)字符串
3. Python中中访问mysql
(已在win7上验证可行。)
以下内容请谨慎参考,还请热心同学指正可能的错误。感觉我对mysql的配置和使用还有些疑惑。
模块: MySQLdb
*******************************************my-default.ini*******************************************
======================》(蓝色为自己新添加的部分)
# For advice on how to change settings please see
# http://dev.mysql.com/doc/refman/5.6/en/server-configuration-defaults.html
# *** DO NOT EDIT THIS FILE. It's a template which will be copied to the
# *** default location during install, and will be replaced if you
# *** upgrade to a newer version of MySQL.
[mysqld]
# Remove leading # and set to the amount of RAM for the most important data
# cache in MySQL. Start at 70% of total RAM for dedicated server, else 10%.
# innodb_buffer_pool_size = 128M
# Remove leading # to turn on a very important data integrity option: logging
# changes to the binary log between backups.
# log_bin
# These are commonly set, remove the # and set as required.
basedir = H:\Software\mysql-5.6.21-win32\mysql-5.6.21-win32
datadir = H:\\Software\mysql-5.6.21-win32\mysql-5.6.21-win32\data
# port = .....
# server_id = .....
# Remove leading # to set options mainly useful for reporting servers.
# The server defaults are faster for transactions and fast SELECTs.
# Adjust sizes as needed, experiment to find the optimal values.
# join_buffer_size = 128M
# sort_buffer_size = 2M
# read_rnd_buffer_size = 2M
sql_mode=NO_ENGINE_SUBSTITUTION,STRICT_TRANS_TABLES
default-character-set = utf8
character_set_server = utf8
[mysqld_safe]
default-character-set = utf8
[client]
default-character-set = utf8
*****************************************************************************************************
**************************************Python代码示例**********************************************
#-*- encoding=utf-8 -*- #======>python只会检测encoding和等号后面的编码类型,-*-不是必须的
import MySQLdb
import chardet
#以下是让python默认使用utf-8编解码
import sys
reload(sys)
sys.setdefaultencoding('utf-8')
... #省略无关代码
try:
myConn = MySQLdb.connect(host='localhost',user='root',passwd='rootPWD',db='myDB',port=3306, charset='gbk')
myCur = myConn.cursor()
mysqlCmd = "insert into myTable values(%s, %s)"
myParam = (strName, float(value))
n = myCursor.execute(mysqlCmd, myParam)
myConn.commit()
myCur.close()
myConn.close()
except MySQLdb.Error, e:
print "Mysql Error %d: %s" % (e.args[0], e.args[1]
*****************************************************************************************************
几点注意:
(1) 在mysql内用 select * from myTable 读取表格内容时,若中文Field显示为乱码,可执行 set names gbk 然后再尝试select。
(2) MySQL 对于字符集的支持细化到四个层次: 服务器(server),数据库(database),数据表(table)和行()。
mysql> show variables like"collation_%";
+----------------------+-------------------+
| Variable_name | Value |
+----------------------+-------------------+
| collation_connection | gbk_chinese_ci |
| collation_database | latin1_swedish_ci |
| collation_server | latin1_swedish_ci |
+----------------------+-------------------+
3 rows in set (0.00 sec)
mysql> show variables like 'character_set_%';
+--------------------------+----------------------------------------------------
--------------------------+
| Variable_name | Value
|
+--------------------------+----------------------------------------------------
--------------------------+
| character_set_client | gbk
|
| character_set_connection | gbk
|
| character_set_database | latin1
|
| character_set_filesystem | binary
|
| character_set_results | gbk
|
| character_set_server | latin1
|
| character_set_system | utf8
|
| character_sets_dir | H:\A_Tian_Jie\Software\mysql-5.6.21-win32\mysql-5.6
.21-win32\share\charsets\ |
+--------------------------+----------------------------------------------------
--------------------------+
8 rows in set (0.00 sec)
IT人的微信自媒体--- 杰天空, 走在寻找创意的路上
发掘创意,点缀生活,品味人生。
请搜索微信订阅号: jksy_studio ,或者微信扫描下图二维码添加关注
杰天空静候您的光临。