如何用minidom解析unicode字符串?

I'm trying to parse a bunch of xml files with the library xml.dom.minidom, to extract some data and put it in a text file. Most of the XMLs go well, but for some of them I get the following error when calling minidom.parsestring():

我正在尝试用xml.dom库解析一堆xml文件。minidom提取一些数据并将其放入文本文件中。大多数xml都运行得很好，但是对于其中的一些，在调用minidom.parsestring()时，会出现以下错误:

UnicodeEncodeError: 'ascii' codec can't encode character u'\u2019' in position 5189: ordinal not in range(128)

UnicodeEncodeError:“ascii”编解码器不能在位置5189中编码字符u'\u2019:序数不在范围内(128)

It happens for some other non-ascii characters too. My question is: what are my options here? Am I supposed to somehow strip/replace all those non-English characters before being able to parse the XML files?

其他一些非ascii字符也会出现这种情况。我的问题是:我的选择是什么?在解析XML文件之前，我是否应该以某种方式删除/替换所有非英语字符?

5 个解决方案

#1

Try to decode it:

尝试解码:

> print u'abcdé'.encode('utf-8')
> abcdÃ©

> print u'abcdé'.encode('utf-8').decode('utf-8')
> abcdé

#2

In case your string is 'str':

如果你的字符串是" str ":

xmldoc = minidom.parseString(u'{0}'.format(str).encode('utf-8'))

This worked for me.

这为我工作。

#3

Minidom doesn't directly support parsing Unicode strings; it's something that has historically had poor support and standardisation. Many XML tools recognise only byte streams as something an XML parser can consume.

Minidom不直接支持解析Unicode字符串;这在历史上一直缺乏支持和标准化。许多XML工具只识别字节流，这是XML解析器可以使用的。

If you have plain files, you should either read them in as byte strings (not Unicode!) and pass that to parseString(), or just use parse() which will read a file directly.

如果您有普通文件，您应该以字节字符串(不是Unicode!)的形式读取它们，并将其传递给parseString()，或者使用parse()直接读取文件。

#4

I know the O.P. asked about parsing strings, but I had the same exception upon writing the DOM model to a file via Document.writexml(...). In case people with that (related) problem land here, I will offer my solution.

我知道O.P.要求解析字符串，但我在通过Document.writexml(…)将DOM模型写入文件时也有同样的例外。如果这里有相关的问题，我将提供我的解决方案。

My code which was throwing the UnicodeEncodeError looked like:

我抛出UnicodeEncodeError的代码是:

with tempfile.NamedTemporaryFile(delete=False) as fh:
    dom.writexml(fh, encoding="utf-8")
Note that the "encoding" param only effects the XML header and has no effect on the treatment of the data. To fix it, I changed it to:

注意，“编码”参数只影响XML头，对数据的处理没有影响。为了修复它，我将它改为:
with tempfile.NamedTemporaryFile(delete=False) as fh:
    fh = codecs.lookup("utf-8")[3](fh)
    dom.writexml(fh, encoding="utf-8")

This will wrap the file handle with an instance of encodings.utf_8.StreamWriter, which handles the data as UTF-8 rather then ASCII, and the UnicodeEncodeError went away. I got the idea from reading the source of xml.dom.minidom.Node.toprettyxml(...).

这将使用encoding .utf_8实例包装文件句柄。StreamWriter将数据处理为UTF-8而不是ASCII，并且UnicodeEncodeError消失了。我是从阅读xml.dom.minidom.Node.toprettyxml(…)的源代码中得到这个想法的。

#5

-2

I encounter this error a few times, and my hacky way of dealing with it is just to do this:

我有几次遇到这个错误，而我处理问题的方法就是这样做:

def getCleanString(word):   
   str = ""
   for character in word:
      try: 
         str_character = str(character)
         str = str + str_character
      except:
         dummy = 1 # this happens if character is unicode
   return str

Of course, this is probably a dumb way of doing it, but it gets the job done for me, and doesn't cost me anything in speed.

当然，这可能是一种愚蠢的做法，但它可以帮我完成工作，而且不会让我在速度上付出任何代价。

#1