Python 3.6 utf - 8 UnicodeEncodeError

时间:2021-10-04 22:29:43
#!/usr/bin/env python3
import glob
import xml.etree.ElementTree as ET
filenames = glob.glob("C:\\Users\\####\\Desktop\\BNC2\\[A00-ZZZ]*.xml")
out_lines = []
for filename in filenames:
    with open(filename, 'r', encoding="utf-8") as content:
        tree = ET.parse(content)
        root = tree.getroot()
        for w in root.iter('w'):
            lemma = w.get('hw')
            pos = w.get('pos')
            tag = w.get('c5')

            out_lines.append(w.text + "," + lemma + "," + pos + "," + tag)

with open("C:\\Users\\####\\Desktop\\bnc.txt", "w") as out_file:
    for line in out_lines:
        line = bytes(line, 'utf-8').decode('utf-8', 'ignore')
        out_file.write("{}\n".format(line))

Gives the error:

给出了错误:

UnicodeEncodeError: 'charmap' codec can't encode character '\u2192' in position 0: character maps to undefined

UnicodeEncodeError:“charmap”编解码器不能在0位置编码字符“\u2192”:字符映射到未定义的字符

I thought this line would have solved that...

我以为这条线就能解出来……

line = bytes(line, 'utf-8').decode('utf-8', 'ignore')

行=字节(线,“utf - 8”).decode(“utf - 8”、“忽略”)

2 个解决方案

#1


3  

You need to specify the encoding when opening the output file, same as you did with the input file:

打开输出文件时需要指定编码,与打开输入文件时一样:

with open("C:\\Users\\####\\Desktop\\bnc.txt", "w", encoding="utf-8") as out_file:
    for line in out_lines:
        out_file.write("{}\n".format(line))

#2


-1  

If your script have multiple reads and writes and you want to have a particular encoding ( let's say utf-8) for all of them, we can change the default encoding too

如果您的脚本有多个读和写,并且您希望对所有的脚本都有一个特定的编码(比方说utf-8),那么我们也可以修改默认的编码

import sys
reload(sys)
sys.setdefaultencoding('UTF8')

We should use it only when we have multiple reads/writes though and should be done at the beginning of the script

只有当我们有多个读/写时才应该使用它,并且应该在脚本的开头完成

Changing default encoding of Python?

更改Python的默认编码?

#1


3  

You need to specify the encoding when opening the output file, same as you did with the input file:

打开输出文件时需要指定编码,与打开输入文件时一样:

with open("C:\\Users\\####\\Desktop\\bnc.txt", "w", encoding="utf-8") as out_file:
    for line in out_lines:
        out_file.write("{}\n".format(line))

#2


-1  

If your script have multiple reads and writes and you want to have a particular encoding ( let's say utf-8) for all of them, we can change the default encoding too

如果您的脚本有多个读和写,并且您希望对所有的脚本都有一个特定的编码(比方说utf-8),那么我们也可以修改默认的编码

import sys
reload(sys)
sys.setdefaultencoding('UTF8')

We should use it only when we have multiple reads/writes though and should be done at the beginning of the script

只有当我们有多个读/写时才应该使用它,并且应该在脚本的开头完成

Changing default encoding of Python?

更改Python的默认编码?