I'm trying to parse a bunch of xml files with the library xml.dom.minidom, to extract some data and put it in a text file. Most of the XMLs go well, but for some of them I get the following error when calling minidom.parsestring():
我正在尝试用xml.dom库解析一堆xml文件。minidom提取一些数据并将其放入文本文件中。大多数xml都运行得很好,但是对于其中的一些,在调用minidom.parsestring()时,会出现以下错误:
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2019' in position 5189: ordinal not in range(128)
UnicodeEncodeError:“ascii”编解码器不能在位置5189中编码字符u'\u2019:序数不在范围内(128)
It happens for some other non-ascii characters too. My question is: what are my options here? Am I supposed to somehow strip/replace all those non-English characters before being able to parse the XML files?
其他一些非ascii字符也会出现这种情况。我的问题是:我的选择是什么?在解析XML文件之前,我是否应该以某种方式删除/替换所有非英语字符?
5 个解决方案
#1
8
Try to decode it:
尝试解码:
> print u'abcdé'.encode('utf-8')
> abcdé
> print u'abcdé'.encode('utf-8').decode('utf-8')
> abcdé
#2
3
In case your string is 'str':
如果你的字符串是" str ":
xmldoc = minidom.parseString(u'{0}'.format(str).encode('utf-8'))
This worked for me.
这为我工作。
#3
2
Minidom doesn't directly support parsing Unicode strings; it's something that has historically had poor support and standardisation. Many XML tools recognise only byte streams as something an XML parser can consume.
Minidom不直接支持解析Unicode字符串;这在历史上一直缺乏支持和标准化。许多XML工具只识别字节流,这是XML解析器可以使用的。
If you have plain files, you should either read them in as byte strings (not Unicode!) and pass that to parseString()
, or just use parse()
which will read a file directly.
如果您有普通文件,您应该以字节字符串(不是Unicode!)的形式读取它们,并将其传递给parseString(),或者使用parse()直接读取文件。
#4
0
I know the O.P. asked about parsing strings, but I had the same exception upon writing the DOM model to a file via Document.writexml(...). In case people with that (related) problem land here, I will offer my solution.
我知道O.P.要求解析字符串,但我在通过Document.writexml(…)将DOM模型写入文件时也有同样的例外。如果这里有相关的问题,我将提供我的解决方案。
My code which was throwing the UnicodeEncodeError looked like:
我抛出UnicodeEncodeError的代码是:
with tempfile.NamedTemporaryFile(delete=False) as fh: dom.writexml(fh, encoding="utf-8")
Note that the "encoding" param only effects the XML header and has no effect on the treatment of the data. To fix it, I changed it to:
注意,“编码”参数只影响XML头,对数据的处理没有影响。为了修复它,我将它改为:
with tempfile.NamedTemporaryFile(delete=False) as fh: fh = codecs.lookup("utf-8")[3](fh) dom.writexml(fh, encoding="utf-8")
This will wrap the file handle with an instance of encodings.utf_8.StreamWriter, which handles the data as UTF-8 rather then ASCII, and the UnicodeEncodeError went away. I got the idea from reading the source of xml.dom.minidom.Node.toprettyxml(...).
这将使用encoding .utf_8实例包装文件句柄。StreamWriter将数据处理为UTF-8而不是ASCII,并且UnicodeEncodeError消失了。我是从阅读xml.dom.minidom.Node.toprettyxml(…)的源代码中得到这个想法的。
#5
-2
I encounter this error a few times, and my hacky way of dealing with it is just to do this:
我有几次遇到这个错误,而我处理问题的方法就是这样做:
def getCleanString(word):
str = ""
for character in word:
try:
str_character = str(character)
str = str + str_character
except:
dummy = 1 # this happens if character is unicode
return str
Of course, this is probably a dumb way of doing it, but it gets the job done for me, and doesn't cost me anything in speed.
当然,这可能是一种愚蠢的做法,但它可以帮我完成工作,而且不会让我在速度上付出任何代价。
#1
8
Try to decode it:
尝试解码:
> print u'abcdé'.encode('utf-8')
> abcdé
> print u'abcdé'.encode('utf-8').decode('utf-8')
> abcdé
#2
3
In case your string is 'str':
如果你的字符串是" str ":
xmldoc = minidom.parseString(u'{0}'.format(str).encode('utf-8'))
This worked for me.
这为我工作。
#3
2
Minidom doesn't directly support parsing Unicode strings; it's something that has historically had poor support and standardisation. Many XML tools recognise only byte streams as something an XML parser can consume.
Minidom不直接支持解析Unicode字符串;这在历史上一直缺乏支持和标准化。许多XML工具只识别字节流,这是XML解析器可以使用的。
If you have plain files, you should either read them in as byte strings (not Unicode!) and pass that to parseString()
, or just use parse()
which will read a file directly.
如果您有普通文件,您应该以字节字符串(不是Unicode!)的形式读取它们,并将其传递给parseString(),或者使用parse()直接读取文件。
#4
0
I know the O.P. asked about parsing strings, but I had the same exception upon writing the DOM model to a file via Document.writexml(...). In case people with that (related) problem land here, I will offer my solution.
我知道O.P.要求解析字符串,但我在通过Document.writexml(…)将DOM模型写入文件时也有同样的例外。如果这里有相关的问题,我将提供我的解决方案。
My code which was throwing the UnicodeEncodeError looked like:
我抛出UnicodeEncodeError的代码是:
with tempfile.NamedTemporaryFile(delete=False) as fh: dom.writexml(fh, encoding="utf-8")
Note that the "encoding" param only effects the XML header and has no effect on the treatment of the data. To fix it, I changed it to:
注意,“编码”参数只影响XML头,对数据的处理没有影响。为了修复它,我将它改为:
with tempfile.NamedTemporaryFile(delete=False) as fh: fh = codecs.lookup("utf-8")[3](fh) dom.writexml(fh, encoding="utf-8")
This will wrap the file handle with an instance of encodings.utf_8.StreamWriter, which handles the data as UTF-8 rather then ASCII, and the UnicodeEncodeError went away. I got the idea from reading the source of xml.dom.minidom.Node.toprettyxml(...).
这将使用encoding .utf_8实例包装文件句柄。StreamWriter将数据处理为UTF-8而不是ASCII,并且UnicodeEncodeError消失了。我是从阅读xml.dom.minidom.Node.toprettyxml(…)的源代码中得到这个想法的。
#5
-2
I encounter this error a few times, and my hacky way of dealing with it is just to do this:
我有几次遇到这个错误,而我处理问题的方法就是这样做:
def getCleanString(word):
str = ""
for character in word:
try:
str_character = str(character)
str = str + str_character
except:
dummy = 1 # this happens if character is unicode
return str
Of course, this is probably a dumb way of doing it, but it gets the job done for me, and doesn't cost me anything in speed.
当然,这可能是一种愚蠢的做法,但它可以帮我完成工作,而且不会让我在速度上付出任何代价。