Traceback (most recent call last):
File "C:/Users/rohanhm.2014/PycharmProjects/untitled1/abc", line 11, in <module>
docs2 = [[w.lower() for w in doc]for doc in docs]
File "C:/Users/rohanhm.2014/PycharmProjects/untitled1/abc", line 11, in <listcomp>
docs2 = [[w.lower() for w in doc]for doc in docs]
File "C:/Users/rohanhm.2014/PycharmProjects/untitled1/", line 11, in <listcomp>
docs2 = [[w.lower() for w in doc]for doc in docs]
File "C:\Python34\lib\site-packages\nltk\corpus\reader\util.py", line 291, in iterate_from
['PROJECT', 'FINAL', 'REPORT', 'Revision', 'History', 'Date', 'Version', 'Author', 'Validated', 'by', 'Purpose', '4', '-', 'Dec', '-', '13', '0', '.', '1', 'EA', 'Initial', 'Document', '1', '/', '8', '/', '2014', '0', '.', '2', 'EA', '&', 'AHE', 'Combined', 'the', 'copy', 'for', 'both', 'MOE', 'and', 'MOA', '.', '1', '/', '8', '/', '2014', '0', '.', '3']
tokens = self.read_block(self._stream)
File "C:\Python34\lib\site-packages\nltk\corpus\reader\plaintext.py", line 117, in _read_word_block
words.extend(self._word_tokenizer.tokenize(stream.readline()))
File "C:\Python34\lib\site-packages\nltk\data.py", line 1095, in readline
new_chars = self._read(readsize)
File "C:\Python34\lib\site-packages\nltk\data.py", line 1322, in _read
chars, bytes_decoded = self._incr_decode(bytes)
File "C:\Python34\lib\site-packages\nltk\data.py", line 1352, in _incr_decode
return self.decode(bytes, 'strict')
File "C:\Python34\lib\encodings\utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe9 in position 50: invalid continuation byte
UnicodeDecodeError: 'utf-8' codec不能解码位置50的字节0xe9:无效的延续字节。
I am trying to perform preprocessing of text using NLTK. However i keep running into this error. Some thoughts would be helpful
我正在尝试使用NLTK对文本进行预处理。然而,我一直在犯这个错误。有些想法会有所帮助。
1 个解决方案
#1
3
Some lines of code would be useful. However, my intuition says your corpus reader object should deal with another encoding rather than utf8, probably latin-1.
一些代码行是有用的。但是,我的直觉告诉您,您的主体阅读器对象应该处理另一个编码,而不是utf8,可能是latin-1。
corpus = nltk.corpus.reader.PlaintextCorpusReader(
"/path/to/files", r'.*', encoding='latin-1')
See also here: UnicodeDecodeError, invalid continuation byte
这里也可以看到:UnicodeDecodeError,无效的延续字节。
#1
3
Some lines of code would be useful. However, my intuition says your corpus reader object should deal with another encoding rather than utf8, probably latin-1.
一些代码行是有用的。但是,我的直觉告诉您,您的主体阅读器对象应该处理另一个编码,而不是utf8,可能是latin-1。
corpus = nltk.corpus.reader.PlaintextCorpusReader(
"/path/to/files", r'.*', encoding='latin-1')
See also here: UnicodeDecodeError, invalid continuation byte
这里也可以看到:UnicodeDecodeError,无效的延续字节。