#!/usr/bin/env python3
import glob
import xml.etree.ElementTree as ET
filenames = glob.glob("C:\\Users\\####\\Desktop\\BNC2\\[A00-ZZZ]*.xml")
out_lines = []
for filename in filenames:
with open(filename, 'r', encoding="utf-8") as content:
tree = ET.parse(content)
root = tree.getroot()
for w in root.iter('w'):
lemma = w.get('hw')
pos = w.get('pos')
tag = w.get('c5')
out_lines.append(w.text + "," + lemma + "," + pos + "," + tag)
with open("C:\\Users\\####\\Desktop\\bnc.txt", "w") as out_file:
for line in out_lines:
line = bytes(line, 'utf-8').decode('utf-8', 'ignore')
out_file.write("{}\n".format(line))
Gives the error:
给出了错误:
UnicodeEncodeError: 'charmap' codec can't encode character '\u2192' in position 0: character maps to undefined
UnicodeEncodeError:“charmap”编解码器不能在0位置编码字符“\u2192”:字符映射到未定义的字符
I thought this line would have solved that...
我以为这条线就能解出来……
line = bytes(line, 'utf-8').decode('utf-8', 'ignore')
行=字节(线,“utf - 8”).decode(“utf - 8”、“忽略”)
2 个解决方案
#1
3
You need to specify the encoding when opening the output file, same as you did with the input file:
打开输出文件时需要指定编码,与打开输入文件时一样:
with open("C:\\Users\\####\\Desktop\\bnc.txt", "w", encoding="utf-8") as out_file:
for line in out_lines:
out_file.write("{}\n".format(line))
#2
-1
If your script have multiple reads and writes and you want to have a particular encoding ( let's say utf-8) for all of them, we can change the default encoding too
如果您的脚本有多个读和写,并且您希望对所有的脚本都有一个特定的编码(比方说utf-8),那么我们也可以修改默认的编码
import sys
reload(sys)
sys.setdefaultencoding('UTF8')
We should use it only when we have multiple reads/writes though and should be done at the beginning of the script
只有当我们有多个读/写时才应该使用它,并且应该在脚本的开头完成
Changing default encoding of Python?
更改Python的默认编码?
#1
3
You need to specify the encoding when opening the output file, same as you did with the input file:
打开输出文件时需要指定编码,与打开输入文件时一样:
with open("C:\\Users\\####\\Desktop\\bnc.txt", "w", encoding="utf-8") as out_file:
for line in out_lines:
out_file.write("{}\n".format(line))
#2
-1
If your script have multiple reads and writes and you want to have a particular encoding ( let's say utf-8) for all of them, we can change the default encoding too
如果您的脚本有多个读和写,并且您希望对所有的脚本都有一个特定的编码(比方说utf-8),那么我们也可以修改默认的编码
import sys
reload(sys)
sys.setdefaultencoding('UTF8')
We should use it only when we have multiple reads/writes though and should be done at the beginning of the script
只有当我们有多个读/写时才应该使用它,并且应该在脚本的开头完成
Changing default encoding of Python?
更改Python的默认编码?