UnicodeDecodeError:'utf8'编解码器无法解码位置894中的字节0xb5:无效的起始字节

时间:2022-01-30 10:41:31

I am using scikit-learn for a project. While performing feature extraction (working_with_text_data tutorial) I get UnicodeDecodeError: 'utf8' codec can't decode byte.

我正在使用scikit-learn进行项目。在执行特征提取(working_with_text_data教程)时,我得到UnicodeDecodeError:'utf8'编解码器无法解码字节。

Using python 2.7.8 and have build scikit-learn using make.

使用python 2.7.8并使用make构建scikit-learn。

from sklearn.feature_extraction.text import CountVectorizer
count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(dataset.data)
print(X_train_counts.shape)

Kindly help on how to resolve?

请帮忙解决一下?

1 个解决方案

#1


when using load_files function, encoding should be latin1

当使用load_files函数时,编码应该是latin1

twenty_train = load_files('path/to/folder',encoding='latin1')

in sklearn/datasets/twenty_newscroups.py

function _download_20newsgroups
...
load_files(train_path, encoding='latin1')

#1


when using load_files function, encoding should be latin1

当使用load_files函数时,编码应该是latin1

twenty_train = load_files('path/to/folder',encoding='latin1')

in sklearn/datasets/twenty_newscroups.py

function _download_20newsgroups
...
load_files(train_path, encoding='latin1')