关于UnicodeDecodeError: 'gbk' codec can't decode byte的解决办法

时间:2021-11-22 09:29:17

问题描述:
UnicodeDecodeError: ‘gbk’ codec can’t decode byte 0xae in position 199: illegal multibyte sequence

这个问题是来自《机器学习实战》朴素贝叶斯算法中的问题

解决办法:
email\ham中的23.txt中第二段多了一个问号,导致解码失败,删除‘?’之后便可以继续执行。

下面废话较多,可以不看。

UnicodeDecodeError解释为Unicode的解码(decode)出现错误了,也就你当前正在处理某种编码类型的字符串,是想要将该字符串去解码,变成Unicode,但是在解码的过程中发生错误了。
http://www.crifan.com/summary_python_unicodedecode_error_possible_reasons_and_solutions/
通过错误的详细描述发现,是‘gbk’解码错误,通常的解决办法是改变打开文件的编码格式,详细方法见链接:http://www.cnblogs.com/arctique/p/5699620.html,好不好使还要自己尝试,因为算法中的问题实在奇葩,接下来进入正题,关于朴素贝叶斯算法中UnicodeDecodeError问题的解决办法:
我是用spyder编译,之后通过Ipython调用模块,运行代码,尝试过度娘的各种办法,还是报错。通常报错,我都将函数分解,用Ipython一步一步print,那么问题来了,我用Ipython单独执行一个文件的时候,函数是能正常运行的

    ...: file = open('email\\ham\\9.txt')
...: fileRead = file.read()
...: wordList = bayes.textParse(fileRead)
...: docList.append(wordList)
...: fullText.extend(wordList)
...: classList.append(1)
...:
C:\ProgramData\Anaconda3\lib\re.py:212: FutureWarning: split() requires a non-em
pty pattern match.
return _compile(pattern, flags).split(string, maxsplit)

所以可以断定代码本身是没有为题的,那么问题就是字符串的来源了,通过测试发现

In [9]: for i in range(1,26):
...: file = open('email\\spam\\%d.txt' %i)
...: fileRead = file.read()
...: wordList = bayes.textParse(fileRead)
...: docList.append(wordList)
...: fullText.extend(wordList)
...: classList.append(1)
...:
C:\ProgramData\Anaconda3\lib\re.py:212: FutureWarning: split() requires a non-em
pty pattern match.
return _compile(pattern, flags).split(string, maxsplit)

In [10]: for i in range(1,26):
...: file = open('email\\ham\\%d.txt' %i)
...: fileRead = file.read()
...: wordList = bayes.textParse(fileRead)
...: docList.append(wordList)
...: fullText.extend(wordList)
...: classList.append(1)
...:
C:\ProgramData\Anaconda3\lib\re.py:212: FutureWarning: split() requires a non-empty pattern match.
return _compile(pattern, flags).split(string, maxsplit)
---------------------------------------------------------------------------
UnicodeDecodeError Traceback (most recent call last)
<ipython-input-10-92edce10b119> in <module>()
1 for i in range(1,26):
2 file = open('email\\ham\\%d.txt' %i)
----> 3 fileRead = file.read()
4 wordList = bayes.textParse(fileRead)
5 docList.append(wordList)

UnicodeDecodeError: 'gbk' codec can't decode byte 0xae in position 199: illegal multibyte sequence

email\spam中的文件都可以正常解码,但是email\ham中的报错了,所以我从1.txt开始执行

In [33]:
...: file = open('email\\ham\\23.txt')
...: fileRead = file.read()
...: wordList = bayes.textParse(fileRead)
...: docList.append(wordList)
...: fullText.extend(wordList)
...: classList.append(1)
...:
---------------------------------------------------------------------------
UnicodeDecodeError Traceback (most recent call last)
<ipython-input-33-598ed4b80c9a> in <module>()
1
2 file = open('email\\ham\\23.txt')
----> 3 fileRead = file.read()
4 wordList = bayes.textParse(fileRead)
5 docList.append(wordList)

UnicodeDecodeError: 'gbk' codec can't decode byte 0xae in position 199: illegal
multibyte sequence

发现问题,源文件23.txt中“SciFinance?is a derivatives pricing and risk model development tool that automatically generates C/C++ and GPU-enabled source code from concise, high-level model specifications. No parallel computing or CUDA programming expertise is required.”多了一个不该有的问号,删除问号之后此处的代码便可以正常执行了。