为什么ElementTree会引发一个ParseError?

I have been trying to parse a file with xml.etree.ElementTree:

我一直在尝试用xml.etree.ElementTree解析一个文件:

import xml.etree.ElementTree as ET
from xml.etree.ElementTree import ParseError

def analyze(xml):
    it = ET.iterparse(file(xml))
    count = 0
    last = None

    try:        
        for (ev, el) in it:
            count += 1
            last = el

    except ParseError:
            print("catastrophic failure")
            print("last successful: {0}".format(last))

    print('count: {0}'.format(count))

This is of course a simplified version of my code, but this is enough to break my program. I get this error with some files if I remove the try-catch block:

这当然是我代码的简化版本，但这足以破坏我的程序。如果我删除try-catch块，就会得到一些文件的错误:

Traceback (most recent call last):
  File "<pyshell#22>", line 1, in <module>
    from yparse import analyze; analyze('file.xml')
  File "C:\Python27\yparse.py", line 10, in analyze
    for (ev, el) in it:
  File "C:\Python27\lib\xml\etree\ElementTree.py", line 1258, in next
    self._parser.feed(data)
  File "C:\Python27\lib\xml\etree\ElementTree.py", line 1624, in feed
    self._raiseerror(v)
  File "C:\Python27\lib\xml\etree\ElementTree.py", line 1488, in _raiseerror
    raise err
ParseError: reference to invalid character number: line 1, column 52459

The results are deterministic though, if a file works it will always work. If a file fails, it always fails and always fails at the same point.

结果是确定的，如果一个文件可以工作，它将永远工作。如果一个文件失败，它总是失败，并且总是在同一时刻失败。

The strangest thing is I'm using the trace to find out if I have any malformed XML that's breaking the parser. I then isolate the node that caused the failure. But when I create an XML file containing that node and a few of its neighbors, the parsing works!

最奇怪的是，我使用跟踪查找是否有任何格式错误的XML正在破坏解析器。然后隔离导致失败的节点。但是，当我创建一个包含该节点和它的一些邻居的XML文件时，解析工作就开始了!

This doesn't seem to be a size problem either. I have managed to parse much larger files with no problems.

这似乎也不是一个大小问题。我成功地解析了更大的文件，没有任何问题。

Any ideas?

什么好主意吗?

4 个解决方案

#1

As @John Machin suggested, the files in question do have dubious numeric entities in them, though the error messages seem to be pointing at the wrong place in the text. Perhaps the streaming nature and buffering are making it difficult to report accurate positions.

正如@John Machin所说，这些文件中确实有可疑的数字实体，尽管错误消息似乎指向了文本中的错误位置。也许流线性质和缓冲使报告精确的位置变得困难。

In fact, all of these entities appear in the text:

事实上，所有这些实体都出现在案文中:

set(['&#x08;', '&#x0E;', '&#x1E;', '&#x1C;', '&#x18;', '&#x04;', '&#x0A;', '&#x0C;', '&#x16;', '&#x14;', '&#x06;', '&#x00;', '&#x10;', '&#x02;', '&#x0D;', '&#x1D;', '&#x0F;', '&#x09;', '&#x1B;', '&#x05;', '&#x15;', '&#x01;', '&#x03;'])

Most are not allowed. Looks like this parser is quite strict, you'll need to find another that is not so strict, or pre-process the XML.

大多数都是不允许的。看起来这个解析器非常严格，您需要找到另一个不那么严格的解析器，或者对XML进行预处理。

#2

Here are some ideas:

这里有一些建议:

(0) Explain "a file" and "occasionally": do you really mean it works sometimes and fails sometimes with the same file?

(0)解释“一个文件”和“偶尔”:你的意思是说它有时能用，有时用同一个文件会失败吗?

Do the following for each failing file:

对每个失败的文件执行以下操作:

(1) Find out what is in the file at the point that it is complaining about:

(1)找出文件中出现的问题，并提出自己的意见:

text = open("the_file.xml", "rb").read()
err_col = 52459
print repr(text[err_col-50:err_col+100]) # should include the error text
print repr(text[:50]) # show the XML declaration

(2) Throw your file at a web-based XML validation service e.g. http://www.validome.org/xml/ or http://validator.aborla.net/

(2)将文件扔向基于web的XML验证服务，例如http://www.validome.org/xml/或http://validator.aborla.net/

and edit your question to display your findings.

编辑你的问题以显示你的发现。

Update: Here is the minimal xml file that illustrates your problem:

更新:这里是最小的xml文件，它说明了您的问题:

[badcharref.xml]
<a>&#1;</a>

[Python 2.7.1 output]
>>> import xml.etree.ElementTree as ET
>>> it = ET.iterparse(file("badcharref.xml"))
>>> for ev, el in it:
...     print el.tag
...
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\python27\lib\xml\etree\ElementTree.py", line 1258, in next
    self._parser.feed(data)
  File "C:\python27\lib\xml\etree\ElementTree.py", line 1624, in feed
    self._raiseerror(v)
  File "C:\python27\lib\xml\etree\ElementTree.py", line 1488, in _raiseerror
    raise err
xml.etree.ElementTree.ParseError: reference to invalid character number: line 1, column 3
>>>

Not all valid Unicode characters are valid in XML. See the XML 1.0 Specification.

并非所有有效的Unicode字符在XML中都是有效的。参见XML 1.0规范。

You may wish to examine your files using regexes like r'&#([0-9]+);' and r'&#x([0-9A-Fa-f]+);', convert the matched text to an int ordinal and check against the valid list from the spec i.e. #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]

您可能希望使用r'&#([0-9]+);'和'&#x([0-9A-Fa-f]+)检查您的文件;'将匹配的文本转换为int序号，并对照规范中的有效列表进行检查，也就是# bb9 | # xb0 # xbb1 # xbxbxf10fbfb_10fbfb_10fbfbfbfbfb_10fbfbfb_2 [# 2] [# x10fb_10fb# x10fbfbfb_10fbfbfbfbfbfbfbfb_10f2]

... or maybe the numeric character reference is syntactically invalid e.g. not terminated by a ;', &#not-a-digit etc etc

…或者可能数字字符引用在语法上是无效的，例如没有被;'、&#not-a-digit等终止

Update 2 I was wrong, the number in the ElementTree error message is counting Unicode code points, not bytes. See the code below and snippets from the output from running it over the two bad files.

更新2我错了，ElementTree错误消息中的数字正在计数Unicode代码点，而不是字节。请参见下面的代码和在两个坏文件上运行它的输出片段。

# coding: ascii
# Find numeric character references that refer to Unicode code points
# that are not valid in XML.
# Get byte offsets for seeking etc in undecoded file bytestreams.
# Get unicode offsets for checking against ElementTree error message,
# **IF** your input file is small enough. 

BYTE_OFFSETS = True
import sys, re, codecs
fname = sys.argv[1]
print fname
if BYTE_OFFSETS:
    text = open(fname, "rb").read()
else:
    # Assumes file is encoded in UTF-8.
    text = codecs.open(fname, "rb", "utf8").read()
rx = re.compile("&#([0-9]+);|&#x([0-9a-fA-F]+);")
endpos = len(text)
pos = 0
while pos < endpos:
    m = rx.search(text, pos)
    if not m: break
    mstart, mend = m.span()
    target = m.group(1)
    if target:
        num = int(target)
    else:
        num = int(m.group(2), 16)
    # #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]
    if not(num in (0x9, 0xA, 0xD) or 0x20 <= num <= 0xD7FF
    or 0xE000 <= num <= 0xFFFD or 0x10000 <= num <= 0x10FFFF):
        print mstart, m.group()
    pos = mend

Output:

输出:

comments.xml
6615405 &#x10;
10205764 &#x00;
10213901 &#x00;
10213936 &#x00;
10214123 &#x00;
13292514 &#x03;
...
155656543 &#x1B;
155656564 &#x1B;
157344876 &#x10;
157722583 &#x10;

posts.xml
7607143 &#x1F;
12982273 &#x1B;
12982282 &#x1B;
12982292 &#x1B;
12982302 &#x1B;
12982310 &#x1B;
16085949 &#x1C;
16085955 &#x1C;
...
36303479 &#x12;
36303494 &#xFFFF; <<=== whoops
38942863 &#x10;
...
785292911 &#x08;
801282472 &#x13;
848911592 &#x0B;

#3

I'm not sure if this answers your question, but if you want to use an exception with the ParseError raised by element tree, you would do this:

我不确定这是否回答了您的问题，但是如果您想要使用元素树引发的ParseError异常，您可以这样做:

except ET.ParseError:
            print("catastrophic failure")
            print("last successful: {0}".format(last))

Source: http://effbot.org/zone/elementtree-13-intro.htm

来源:http://effbot.org/zone/elementtree-13-intro.htm

#4

I felt it might also be important to note here that you could rather easily catch your error and avoid having to completely stop your program by simply using what you're already using later on in the function, placing your statement:

我觉得在这里注意一下也很重要，你可以很容易地发现你的错误，避免完全停止你的程序，只需使用你以后在函数中使用的东西，将你的语句放在下面:

it = ET.iterparse(file(xml))

inside a try & except bracket:

在try & except括号内:

try:
    it = ET.iterparse(file(xml))
except:
    print('iterparse error')

Of course, this will not fix your XML file or pre-processing technique, but could help in identifying which file (if you're parsing lots) is causing your error.

当然，这不会修复您的XML文件或预处理技术，但是可以帮助您识别导致错误的文件(如果您正在解析大量文件)。

#1