xml.etree.ElementTree与lxml.etree:不同的内部节点表示?

时间:2020-12-11 18:26:28

I have been transforming some of my original xml.etree.ElementTree (ET) code to lxml.etree (lxmlET). Luckily there are a lot of similarities between the two. However, I did stumble upon some strange behaviour that I cannot find written down in any documentation. It considers the internal representation of descendant nodes.

我一直在将一些原始的xml.etree.ElementTree(ET)代码转换为lxml.etree(lxmlET)。幸运的是,两者之间有很多相似之处。但是,我偶然发现了一些我在任何文档中都找不到的奇怪行为。它考虑后代节点的内部表示。

In ET, iter() is used to iterate over all descendants of an Element, optionally filtered by tag name. Because I could not find any details about this in the documentation, I expected similar behaviour for lxmlET. The thing is that from testing I conclude that in lxmlET, there is a different internal representation of a tree.

在ET中,iter()用于迭代Element的所有后代,可选地按标记名称进行过滤。因为我在文档中找不到任何有关此内容的详细信息,所以我预计lxmlET会出现类似的行为。问题是,通过测试我得出结论,在lxmlET中,有一个不同的树内部表示。

In the example below, I iterate over nodes in a tree and print each node's children, but in addition I also create all different combinations of those children and print those. This means, if an element has children ('A', 'B', 'C') I create alterations, namely trees [('A'), ('A', 'B'), ('A', 'C'), ('B'), ('B', 'C'), ('C')].

在下面的示例中,我迭代树中的节点并打印每个节点的子节点,但此外我还创建了这些子节点的所有不同组合并打印它们。这意味着,如果一个元素有子元素('A','B','C'),我会创建更改,即树[('A'),('A','B'),('A',' C'),('B'),('B','C'),('C')]。

# import lxml.etree as ET
import xml.etree.ElementTree as ET
from itertools import combinations
from copy import deepcopy


def get_combination_trees(tree):
    children = list(tree)
    for i in range(1, len(children)):
        for combination in combinations(children, i):
            new_combo_tree = ET.Element(tree.tag, tree.attrib)
            for recombined_child in combination:
                new_combo_tree.append(recombined_child)
                # when using lxml a deepcopy is required to make this work (or make change in parse_xml)
                # new_combo_tree.append(deepcopy(recombined_child))
            yield new_combo_tree

    return None


def parse_xml(tree_p):
    for node in ET.fromstring(tree_p):
        if not node.tag == 'node_main':
            continue
        # replace by node.xpath('.//node') for lxml (or use deepcopy in get_combination_trees)
        for subnode in node.iter('node'):
            children = list(subnode)
            if children:
                print('-'.join([child.attrib['id'] for child in children]))
            else:
                print(f'node {subnode.attrib["id"]} has no children')

            for combo_tree in get_combination_trees(subnode):
                combo_children = list(combo_tree)
                if combo_children:
                    print('-'.join([child.attrib['id'] for child in combo_children]))    

    return None


s = '''<root>
  <node_main>
    <node id="1">
      <node id="2" />
      <node id="3">
        <node id="4">
          <node id="5" />
        </node>
        <node id="6" />
      </node>
    </node>
  </node_main>
</root>
'''

parse_xml(s)

The expected output here is the id's of the children of each node joined together with a hyphen, and also all possible combinations of the children (cf. supra) in a top-down breadth-first fashion.

这里的预期输出是用连字符连接在一起的每个节点的子节点的id,以及以自上而下的广度优先方式的子节点的所有可能组合(参见上文)。

2-3
2
3
node 2 has no children
4-6
4
6
5
node 5 has no children
node 6 has no children

However, when you use the lxml module instead of xml (uncomment the import for lxmlET and comment the import for ET), and run the code you'll see that the output is

但是,当您使用lxml模块而不是xml时(取消注释lxmlET的导入并注释ET的导入),并运行代码,您将看到输出是

2-3
2
3
node 2 has no children

So the deeper descendant nodes are never visited. This can be circumvented by either:

因此,永远不会访问更深层次的后代节点。这可以通过以下两种方式规避:

  1. using deepcopy (comment/uncomment relevant part in get_combination_trees()), or
  2. 使用deepcopy(在get_combination_trees()中注释/取消注释相关部分),或
  3. using for subnode in node.xpath('.//node') in parse_xml() instead of iter().
  4. 在parse_xml()中使用node.xpath('.// node')中的子节点而不是iter()。

So I know that there is a way around this, but I am mainly wondering what is happening?! It took me ages to debug this, and I can't find any documentation on it. What is going on, what is the actual underlying difference here between the two modules? And what is the most efficient work-around when working with very large trees?

所以我知道有一种解决方法,但我主要想知道发生了什么?!我花了很长时间来调试它,我找不到任何文档。发生了什么,两个模块之间的实际底层差异是什么?使用非常大的树木时,最有效的解决方法是什么?

3 个解决方案

#1


2  

While Louis's answer is correct and I completely agree that modifying a data structure as you traverse it generally a Bad Idea(tm), you also asked why the code works with xml.etree.ElementTree and not lxml.etree and there is a very reasonable explanation for that.

虽然路易斯的答案是正确的,我完全同意在你遍历它时修改数据结构通常是一个坏主意(tm),你也问为什么代码适用于xml.etree.ElementTree而不是lxml.etree并且有一个非常合理的对此的解释。

Implementation of .append in xml.etree.ElementTree

This library is implemented directly in Python and could vary depending on which Python runtime you're using. Assuming you're using CPython, the implementation you're looking for is implemented in vanilla Python:

该库直接在Python中实现,可能因您使用的Python运行时而异。假设您正在使用CPython,您正在寻找的实现是在vanilla Python中实现的:

def append(self, subelement):
    """Add *subelement* to the end of this element.
    The new element will appear in document order after the last existing
    subelement (or directly after the text, if it's the first subelement),
    but before the end tag for this element.
    """
    self._assert_is_element(subelement)
    self._children.append(subelement)

The last line is the only part we're concerned with. As it turns out, self._children is initialized towards the top of that file as:

最后一行是我们唯一关注的部分。事实证明,self._children被初始化为该文件的顶部,如下所示:

self._children = []

So adding a child to a tree is just appending an element to a list. Intuitively, that's exactly what you're looking for (in this case) and the implementation behaves in a completely unsurprising way.

因此,将一个子项添加到树只是将一个元素附加到列表中。直观地说,这正是您正在寻找的(在这种情况下),并且实现的行为完全不令人惊讶。

Implementation .append in lxml.etree

lxml is implemented as a mix of Python, non-trivial Cython, and C code so spelunking through it was significantly harder than the pure-Python implementation. First off, .append is implemented as:

lxml是作为Python,非平凡的Cython和C代码的混合实现的,因此通过它进行探索比纯Python实现要困难得多。首先,.append实现为:

def append(self, _Element element not None):
    u"""append(self, element)
    Adds a subelement to the end of this element.
    """
    _assertValidNode(self)
    _assertValidNode(element)
    _appendChild(self, element)

_appendChild is implemented over in apihelper.pxi:

_appendChild在apihelper.pxi中实现:

cdef int _appendChild(_Element parent, _Element child) except -1:
    u"""Append a new child to a parent element.
    """
    c_node = child._c_node
    c_source_doc = c_node.doc
    # prevent cycles
    if _isAncestorOrSame(c_node, parent._c_node):
        raise ValueError("cannot append parent to itself")
    # store possible text node
    c_next = c_node.next
    # move node itself
    tree.xmlUnlinkNode(c_node)
    tree.xmlAddChild(parent._c_node, c_node)
    _moveTail(c_next, c_node)
    # uh oh, elements may be pointing to different doc when
    # parent element has moved; change them too..
    moveNodeToDocument(parent._doc, c_source_doc, c_node)
    return 0

There's definitely a bit more going on here. In particular, lxml explicitly removes the node from the tree and then adds it elsewhere. This prevents you from accidentally creating a cyclic XML graph while manipulating nodes (which is something you could probably do with the xml.etree version).

肯定会有更多的事情发生在这里。特别是,lxml显式地从树中删除该节点,然后将其添加到其他位置。这可以防止您在操作节点时意外创建循环XML图形(这可能是您对xml.etree版本可能执行的操作)。

Workarounds for lxml

Now that we know that xml.etree copies nodes when appending but lxml.etree moves them, why do those workarounds work? Based on the tree.xmlUnlinkNode method (which is actually defined in C inside of libxml2), unlinking just messes with a bunch of pointers. So, anything that copies node metadata will do the trick. Because all of the metadata we care about are direct fields on the xmlNode struct, anything that shallow copies nodes will do the trick

现在我们知道xml.etree在追加时会复制节点,但是lxml.etree会移动它们,为什么这些变通办法有效呢?基于tree.xmlUnlinkNode方法(实际上是在libxml2中的C语言中定义),取消链接只是用一堆指针混淆。因此,复制节点元数据的任何事情都可以解决问题。因为我们关心的所有元数据都是xmlNode结构上的直接字段,所以浅层复制节点的任何东西都可以解决问题

  • copy.deepcopy() definitely works
  • copy.deepcopy()绝对有效
  • node.xpath returns nodes wrapped in proxy elements which happens to shallow copy the tree metadata
  • node.xpath返回包含在代理元素中的节点,这些节点恰好浅层复制树元数据
  • copy.copy() also does the trick
  • copy.copy()也可以解决问题
  • If you don't need your combinations to actually be in an official tree, setting new_combo_tree = [] also gives you list appending just like xml.etree.
  • 如果您不需要将组合实际存在于官方树中,则设置new_combo_tree = []也会为您提供与xml.etree类似的列表。

If you're really concerned about performance and large trees, I'd probably start with shallow copying with copy.copy() although you should absolutely profile a few different options and see which one works best for you.

如果你真的关心性能和大树,我可能会开始使用copy.copy()进行浅层复制,尽管你应该绝对描述一些不同的选项,看看哪一个最适合你。

#2


3  

Copying Problem

In general, the safe thing to do when you are manipulating an XML tree and want to copy information in multiple places in the tree (by opposition to moving information from one place to another) is to perform a deep copy operation on those elements rather than just add them to their new location. The vast majority of XML parsing libraries that produce trees require you to perform a deep copy if you want to copy structures around. They just won't give you the results you want if you do not deep copy. lxml is one such library that requires you to deep copy the structures you want to copy.

通常,当您操作XML树并希望在树中的多个位置复制信息(通过反对将信息从一个地方移动到另一个地方)时,安全的事情是对这些元素执行深度复制操作,而不是只需将它们添加到新位置即可。如果要复制结构,绝大多数生成树的XML解析库都要求您执行深层复制。如果你不进行深层复制,他们就不会给你想要的结果。 lxml就是这样一个库,它要求您深度复制要复制的结构。

The fact that xml.etree.ElementTree works in a way such that .append effectively allows you to have the same element in two places in the tree is definitely unusual in my experience.

事实上,xml.etree.ElementTree的工作方式使得.append有效地允许你在树中的两个地方拥有相同的元素,这在我的经验中绝对是不寻常的。

Walking-while-Modifying Problem

You mentioned that for subnode in node.xpath('.//node') also solves you problem. Note that if you use for subnode in list(node.iter('node')), you'll get the same result. What is going on here is that using list(node.iter('node')) or node.xpath('.//node') or using deepcopy to copy the nodes instead of moving them protect you against another problem with your code: you are walking a structure while modifying it.

你提到过node.xpath('.// node')中的子节点也解决了你的问题。请注意,如果您在list(node.iter('node'))中使用子节点,则会得到相同的结果。这里发生的是使用list(node.iter('node'))或node.xpath('.// node')或使用deepcopy复制节点而不是移动它们可以保护您免受代码的另一个问题:你正在修改它时走一个结构。

node.iter('node') creates an iterator that goes over the XML structure as you iterate it. If you wrap it in list(), then the structure is walked immediately and the result put in a list. So you've effectively taken a snapshot of the structure before you walk it. That prevents your walking operation from being affected by changes to the tree. If you do node.xpath('.//node') you are also obtaining a snapshot of the tree before you walk it because that method returns a list of nodes. And if you do a deepcopy of the nodes and append the copy of the node instead of appending the original node, then you are not modifying the tree you are walking while you are walking it.

node.iter('node')创建一个迭代器,在迭代它时遍历XML结构。如果将它包装在list()中,则立即走结构并将结果放入列表中。所以你在走之前已经有效地拍摄了结构的快照。这可以防止您的步行操作受到树的更改的影响。如果你执行node.xpath('.// node'),那么在你走之前你也会获得树的快照,因为该方法返回一个节点列表。如果您对节点执行深度复制并附加节点的副本而不是附加原始节点,那么您在行走时不会修改正在行走的树。

Whether you can get away with using XPath or using node.xpath('.//node') instead of using deepcopy depends on what you plan to do with your combinations. The code you show in your question prints the combinations to the screen as soon you create them. They look fine when you print them, but if you do not use a deepcopy for creating them, then as soon as you create a new combination, the old one will get messed up because any node that appeared in the old combination and needs to appear in the new one will be moved instead of copied.

您是否可以使用XPath或使用node.xpath('.// node')而不是使用deepcopy取决于您计划对组合执行的操作。您在问题中显示的代码会在创建后立即将组合打印到屏幕上。它们在打印时看起来很好,但是如果你不使用深度复制来创建它们,那么只要你创建一个新的组合,旧的组合就会搞砸,因为旧组合中出现的任何节点都需要出现在新的一个将被移动而不是复制。


And what is the most efficient work-around when working with very large trees?

使用非常大的树木时,最有效的解决方法是什么?

It depends on the specifics of your application and the data you need to parse. You gave one example which is a small document but you ask about "large trees". What applies to small documents does not necessarily transfer to large documents. You can optimize for case X but if case X happens only extremely rarely in real data, then your optimization may not pan out. In some cases, it may actually be harmful.

这取决于您的应用程序的细节和您需要解析的数据。你举了一个例子,这是一个小文件,但你问的是“大树”。什么适用于小型文件不一定转移到大型文件。您可以针对案例X进行优化,但如果案例X在实际数据中极少发生,那么您的优化可能不会成功。在某些情况下,它实际上可能是有害的。

In one application of mine, I had to replace references to some structures with the structures themselves. A simplified illustration would be a document that contains elements like <define id="...">...</def> and references like <ref idref="..."/>. Every instance of ref would have to be replaced with the define it points to. In general, this may mean copying a single define multiple times but sometimes a define may be referred by only one ref so one optimization was to detect this and skip the deep copy in those cases where there was only one reference. I got this optimization "for free" because the application already required recording each instance of ref and define for other purposes. If I've had to add bookkeeping just for this optimization, it is not clear it would have been worth it.

在我的一个应用程序中,我不得不用结构本身替换对某些结构的引用。简化图示将是包含 ... 等元素以及 等引用的文档。每个ref实例都必须用它指向的define替换。通常,这可能意味着多次复制单个定义,但有时定义可能仅由一个引用引用,因此一个优化是检测到这一点并在只有一个引用的情况下跳过深层副本。我“免费”获得了这个优化,因为应用程序已经需要记录每个ref实例并为其他目的定义。如果我不得不为这个优化添加簿记,那么不清楚它是否值得。

#3


0  

At the beginning I didn't think there was such a difference (neither did I check), but both @supersam654 and @Louis answers pinpointed it very clearly.

一开始我并没有想到会有这样的差异(我也没有检查过),但是@ supersam654和@Louis的答案都非常明确地指出了它。

But code that is dependent on internal representation (rather than interface) of stuff that it uses, doesn't seem right (from design PoV) to me. Also, as I was asking in my comment: combo_children seems totally useless:

但是依赖于它所使用的东西的内部表示(而不是接口)的代码似乎不对(从设计PoV)到我。另外,正如我在评论中提到的那样:combo_children似乎完全没用:

  1. Get child nodes combo (as a list)
  2. 获取子节点组合(作为列表)
  3. Append each node from the list as a child to combo_children
  4. 将作为子项的列表中的每个节点附加到combo_children
  5. Return combo_children
  6. 返回combo_children
  7. Get combo_children children (as a list)
  8. 获取combo_children子项(作为列表)
  9. Use the list (combo)
  10. 使用列表(组合)

when things could be easily done:

当事情可以轻松完成:

  1. Get child nodes combo (as a list)
  2. 获取子节点组合(作为列表)
  3. Return the list
  4. 返回列表
  5. Use the list (combo)
  6. 使用列表(组合)

Apparently, the combo_children approach was also exposing the behavioral difference between the modules.

显然,combo_children方法也暴露了模块之间的行为差​​异。

code_orig_lxml.py:

code_orig_lxml.py:

import lxml.etree as ET
#import xml.etree.ElementTree as ET
from itertools import combinations
from copy import deepcopy


def get_combination_trees(tree):
    children = list(tree)
    for i in range(1, len(children)):
        for combination in combinations(children, i):
            #new_combo_tree = ET.Element(tree.tag, tree.attrib)
            #for recombined_child in combination:
                #new_combo_tree.append(recombined_child)
                # when using lxml a deepcopy is required to make this work (or make change in parse_xml)
                # new_combo_tree.append(deepcopy(recombined_child))
            #yield new_combo_tree
            yield combination

    return None


def parse_xml(tree_p):
    for node in ET.fromstring(tree_p):
        if not node.tag == 'node_main':
            continue
        # replace by node.xpath('.//node') for lxml (or use deepcopy in get_combination_trees)
        for subnode in node.iter('node'):
            children = list(subnode)
            if children:
                print('-'.join([child.attrib['id'] for child in children]))
            else:
                print(f'node {subnode.attrib["id"]} has no children')

            #for combo_tree in get_combination_trees(subnode):
            for combo_children in get_combination_trees(subnode):
                #combo_children = list(combo_tree)
                if combo_children:
                    print('-'.join([child.attrib['id'] for child in combo_children]))

    return None


s = '''<root>
  <node_main>
    <node id="1">
      <node id="2" />
      <node id="3">
        <node id="4">
          <node id="5" />
        </node>
        <node id="6" />
      </node>
    </node>
  </node_main>
</root>
'''

parse_xml(s)

Notes:

笔记:

  • This is your code with the changes above
  • 这是您的代码,其中包含上述更改
  • I didn't removed anything, instead just commented stuff (which would generate the smallest diff between the old and new versions)
  • 我没有删除任何东西,只是评论了东西(这将产生旧版本和新版本之间的最小差异)

Output:

输出:

(py36x86_test) e:\Work\Dev\*\q050749937>"e:\Work\Dev\VEnvs\py36x86_test\Scripts\python.exe" code_orig_lxml.py
2-3
2
3
node 2 has no children
4-6
4
6
5
node 5 has no children
node 6 has no children

While I was investigating, I modified your code further, to:

在我调查的过程中,我进一步修改了代码,以便:

  • Fix the issue
  • 解决问题
  • Improve printing
  • 改善印刷
  • Make it modular
  • 使其模块化
  • Use both parsing methods, to make differences between them clearer
  • 使用两种解析方法,使它们之间的差异更加清晰

xml_data.py:

xml_data.py:

DATA = """<root>
  <node_main>
    <node id="1">
      <node id="2" />
      <node id="3">
        <node id="4">
          <node id="5" />
        </node>
        <node id="6" />
      </node>
    </node>
  </node_main>
</root>
"""

code.py:

code.py:

import sys
import xml.etree.ElementTree as xml_etree_et
import lxml.etree as lxml_etree
from itertools import combinations
from xml_data import DATA


MAIN_NODE_NAME = "node_main"


def get_children_combinations(tree):
    children = list(tree)
    for i in range(1, len(children)):
        yield from combinations(children, i)


def get_tree(xml_str, parse_func, tag=None):
    root_node = parse_func(xml_str)
    if tag:
        return [item for item in root_node if item.tag == tag]
    return [root_node]


def process_xml(xml_node):
    for node in xml_node.iter("node"):
        print(f"\nNode ({node.tag}, {node.attrib['id']})")
        children = list(node)
        if children:
            print("    Children: " + " - ".join([child.attrib["id"] for child in children]))

        for children_combo in get_children_combinations(node):
            if children_combo:
                print("    Combo: " + " - ".join([child.attrib["id"] for child in children_combo]))


def main():
    parse_funcs = (xml_etree_et.fromstring, lxml_etree.fromstring)
    for func in parse_funcs:
        print(f"\nParsing xml using: {func.__module__} {func.__name__}")
        nodes = get_tree(DATA, func, tag=MAIN_NODE_NAME)
        for node in nodes:
            print(f"\nProcessing node: {node.tag}")
            process_xml(node)


if __name__ == "__main__":
    print("Python {:s} on {:s}\n".format(sys.version, sys.platform))
    main()

Output:

输出:

(py36x86_test) e:\Work\Dev\*\q050749937>"e:\Work\Dev\VEnvs\py36x86_test\Scripts\python.exe" code.py
Python 3.6.2 (v3.6.2:5fd33b5, Jul  8 2017, 04:14:34) [MSC v.1900 32 bit (Intel)] on win32


Parsing xml using: xml.etree.ElementTree XML

Processing node: node_main

Node (node, 1)
    Children: 2 - 3
    Combo: 2
    Combo: 3

Node (node, 2)

Node (node, 3)
    Children: 4 - 6
    Combo: 4
    Combo: 6

Node (node, 4)
    Children: 5

Node (node, 5)

Node (node, 6)

Parsing xml using: lxml.etree fromstring

Processing node: node_main

Node (node, 1)
    Children: 2 - 3
    Combo: 2
    Combo: 3

Node (node, 2)

Node (node, 3)
    Children: 4 - 6
    Combo: 4
    Combo: 6

Node (node, 4)
    Children: 5

Node (node, 5)

Node (node, 6)

#1


2  

While Louis's answer is correct and I completely agree that modifying a data structure as you traverse it generally a Bad Idea(tm), you also asked why the code works with xml.etree.ElementTree and not lxml.etree and there is a very reasonable explanation for that.

虽然路易斯的答案是正确的,我完全同意在你遍历它时修改数据结构通常是一个坏主意(tm),你也问为什么代码适用于xml.etree.ElementTree而不是lxml.etree并且有一个非常合理的对此的解释。

Implementation of .append in xml.etree.ElementTree

This library is implemented directly in Python and could vary depending on which Python runtime you're using. Assuming you're using CPython, the implementation you're looking for is implemented in vanilla Python:

该库直接在Python中实现,可能因您使用的Python运行时而异。假设您正在使用CPython,您正在寻找的实现是在vanilla Python中实现的:

def append(self, subelement):
    """Add *subelement* to the end of this element.
    The new element will appear in document order after the last existing
    subelement (or directly after the text, if it's the first subelement),
    but before the end tag for this element.
    """
    self._assert_is_element(subelement)
    self._children.append(subelement)

The last line is the only part we're concerned with. As it turns out, self._children is initialized towards the top of that file as:

最后一行是我们唯一关注的部分。事实证明,self._children被初始化为该文件的顶部,如下所示:

self._children = []

So adding a child to a tree is just appending an element to a list. Intuitively, that's exactly what you're looking for (in this case) and the implementation behaves in a completely unsurprising way.

因此,将一个子项添加到树只是将一个元素附加到列表中。直观地说,这正是您正在寻找的(在这种情况下),并且实现的行为完全不令人惊讶。

Implementation .append in lxml.etree

lxml is implemented as a mix of Python, non-trivial Cython, and C code so spelunking through it was significantly harder than the pure-Python implementation. First off, .append is implemented as:

lxml是作为Python,非平凡的Cython和C代码的混合实现的,因此通过它进行探索比纯Python实现要困难得多。首先,.append实现为:

def append(self, _Element element not None):
    u"""append(self, element)
    Adds a subelement to the end of this element.
    """
    _assertValidNode(self)
    _assertValidNode(element)
    _appendChild(self, element)

_appendChild is implemented over in apihelper.pxi:

_appendChild在apihelper.pxi中实现:

cdef int _appendChild(_Element parent, _Element child) except -1:
    u"""Append a new child to a parent element.
    """
    c_node = child._c_node
    c_source_doc = c_node.doc
    # prevent cycles
    if _isAncestorOrSame(c_node, parent._c_node):
        raise ValueError("cannot append parent to itself")
    # store possible text node
    c_next = c_node.next
    # move node itself
    tree.xmlUnlinkNode(c_node)
    tree.xmlAddChild(parent._c_node, c_node)
    _moveTail(c_next, c_node)
    # uh oh, elements may be pointing to different doc when
    # parent element has moved; change them too..
    moveNodeToDocument(parent._doc, c_source_doc, c_node)
    return 0

There's definitely a bit more going on here. In particular, lxml explicitly removes the node from the tree and then adds it elsewhere. This prevents you from accidentally creating a cyclic XML graph while manipulating nodes (which is something you could probably do with the xml.etree version).

肯定会有更多的事情发生在这里。特别是,lxml显式地从树中删除该节点,然后将其添加到其他位置。这可以防止您在操作节点时意外创建循环XML图形(这可能是您对xml.etree版本可能执行的操作)。

Workarounds for lxml

Now that we know that xml.etree copies nodes when appending but lxml.etree moves them, why do those workarounds work? Based on the tree.xmlUnlinkNode method (which is actually defined in C inside of libxml2), unlinking just messes with a bunch of pointers. So, anything that copies node metadata will do the trick. Because all of the metadata we care about are direct fields on the xmlNode struct, anything that shallow copies nodes will do the trick

现在我们知道xml.etree在追加时会复制节点,但是lxml.etree会移动它们,为什么这些变通办法有效呢?基于tree.xmlUnlinkNode方法(实际上是在libxml2中的C语言中定义),取消链接只是用一堆指针混淆。因此,复制节点元数据的任何事情都可以解决问题。因为我们关心的所有元数据都是xmlNode结构上的直接字段,所以浅层复制节点的任何东西都可以解决问题

  • copy.deepcopy() definitely works
  • copy.deepcopy()绝对有效
  • node.xpath returns nodes wrapped in proxy elements which happens to shallow copy the tree metadata
  • node.xpath返回包含在代理元素中的节点,这些节点恰好浅层复制树元数据
  • copy.copy() also does the trick
  • copy.copy()也可以解决问题
  • If you don't need your combinations to actually be in an official tree, setting new_combo_tree = [] also gives you list appending just like xml.etree.
  • 如果您不需要将组合实际存在于官方树中,则设置new_combo_tree = []也会为您提供与xml.etree类似的列表。

If you're really concerned about performance and large trees, I'd probably start with shallow copying with copy.copy() although you should absolutely profile a few different options and see which one works best for you.

如果你真的关心性能和大树,我可能会开始使用copy.copy()进行浅层复制,尽管你应该绝对描述一些不同的选项,看看哪一个最适合你。

#2


3  

Copying Problem

In general, the safe thing to do when you are manipulating an XML tree and want to copy information in multiple places in the tree (by opposition to moving information from one place to another) is to perform a deep copy operation on those elements rather than just add them to their new location. The vast majority of XML parsing libraries that produce trees require you to perform a deep copy if you want to copy structures around. They just won't give you the results you want if you do not deep copy. lxml is one such library that requires you to deep copy the structures you want to copy.

通常,当您操作XML树并希望在树中的多个位置复制信息(通过反对将信息从一个地方移动到另一个地方)时,安全的事情是对这些元素执行深度复制操作,而不是只需将它们添加到新位置即可。如果要复制结构,绝大多数生成树的XML解析库都要求您执行深层复制。如果你不进行深层复制,他们就不会给你想要的结果。 lxml就是这样一个库,它要求您深度复制要复制的结构。

The fact that xml.etree.ElementTree works in a way such that .append effectively allows you to have the same element in two places in the tree is definitely unusual in my experience.

事实上,xml.etree.ElementTree的工作方式使得.append有效地允许你在树中的两个地方拥有相同的元素,这在我的经验中绝对是不寻常的。

Walking-while-Modifying Problem

You mentioned that for subnode in node.xpath('.//node') also solves you problem. Note that if you use for subnode in list(node.iter('node')), you'll get the same result. What is going on here is that using list(node.iter('node')) or node.xpath('.//node') or using deepcopy to copy the nodes instead of moving them protect you against another problem with your code: you are walking a structure while modifying it.

你提到过node.xpath('.// node')中的子节点也解决了你的问题。请注意,如果您在list(node.iter('node'))中使用子节点,则会得到相同的结果。这里发生的是使用list(node.iter('node'))或node.xpath('.// node')或使用deepcopy复制节点而不是移动它们可以保护您免受代码的另一个问题:你正在修改它时走一个结构。

node.iter('node') creates an iterator that goes over the XML structure as you iterate it. If you wrap it in list(), then the structure is walked immediately and the result put in a list. So you've effectively taken a snapshot of the structure before you walk it. That prevents your walking operation from being affected by changes to the tree. If you do node.xpath('.//node') you are also obtaining a snapshot of the tree before you walk it because that method returns a list of nodes. And if you do a deepcopy of the nodes and append the copy of the node instead of appending the original node, then you are not modifying the tree you are walking while you are walking it.

node.iter('node')创建一个迭代器,在迭代它时遍历XML结构。如果将它包装在list()中,则立即走结构并将结果放入列表中。所以你在走之前已经有效地拍摄了结构的快照。这可以防止您的步行操作受到树的更改的影响。如果你执行node.xpath('.// node'),那么在你走之前你也会获得树的快照,因为该方法返回一个节点列表。如果您对节点执行深度复制并附加节点的副本而不是附加原始节点,那么您在行走时不会修改正在行走的树。

Whether you can get away with using XPath or using node.xpath('.//node') instead of using deepcopy depends on what you plan to do with your combinations. The code you show in your question prints the combinations to the screen as soon you create them. They look fine when you print them, but if you do not use a deepcopy for creating them, then as soon as you create a new combination, the old one will get messed up because any node that appeared in the old combination and needs to appear in the new one will be moved instead of copied.

您是否可以使用XPath或使用node.xpath('.// node')而不是使用deepcopy取决于您计划对组合执行的操作。您在问题中显示的代码会在创建后立即将组合打印到屏幕上。它们在打印时看起来很好,但是如果你不使用深度复制来创建它们,那么只要你创建一个新的组合,旧的组合就会搞砸,因为旧组合中出现的任何节点都需要出现在新的一个将被移动而不是复制。


And what is the most efficient work-around when working with very large trees?

使用非常大的树木时,最有效的解决方法是什么?

It depends on the specifics of your application and the data you need to parse. You gave one example which is a small document but you ask about "large trees". What applies to small documents does not necessarily transfer to large documents. You can optimize for case X but if case X happens only extremely rarely in real data, then your optimization may not pan out. In some cases, it may actually be harmful.

这取决于您的应用程序的细节和您需要解析的数据。你举了一个例子,这是一个小文件,但你问的是“大树”。什么适用于小型文件不一定转移到大型文件。您可以针对案例X进行优化,但如果案例X在实际数据中极少发生,那么您的优化可能不会成功。在某些情况下,它实际上可能是有害的。

In one application of mine, I had to replace references to some structures with the structures themselves. A simplified illustration would be a document that contains elements like <define id="...">...</def> and references like <ref idref="..."/>. Every instance of ref would have to be replaced with the define it points to. In general, this may mean copying a single define multiple times but sometimes a define may be referred by only one ref so one optimization was to detect this and skip the deep copy in those cases where there was only one reference. I got this optimization "for free" because the application already required recording each instance of ref and define for other purposes. If I've had to add bookkeeping just for this optimization, it is not clear it would have been worth it.

在我的一个应用程序中,我不得不用结构本身替换对某些结构的引用。简化图示将是包含 ... 等元素以及 等引用的文档。每个ref实例都必须用它指向的define替换。通常,这可能意味着多次复制单个定义,但有时定义可能仅由一个引用引用,因此一个优化是检测到这一点并在只有一个引用的情况下跳过深层副本。我“免费”获得了这个优化,因为应用程序已经需要记录每个ref实例并为其他目的定义。如果我不得不为这个优化添加簿记,那么不清楚它是否值得。

#3


0  

At the beginning I didn't think there was such a difference (neither did I check), but both @supersam654 and @Louis answers pinpointed it very clearly.

一开始我并没有想到会有这样的差异(我也没有检查过),但是@ supersam654和@Louis的答案都非常明确地指出了它。

But code that is dependent on internal representation (rather than interface) of stuff that it uses, doesn't seem right (from design PoV) to me. Also, as I was asking in my comment: combo_children seems totally useless:

但是依赖于它所使用的东西的内部表示(而不是接口)的代码似乎不对(从设计PoV)到我。另外,正如我在评论中提到的那样:combo_children似乎完全没用:

  1. Get child nodes combo (as a list)
  2. 获取子节点组合(作为列表)
  3. Append each node from the list as a child to combo_children
  4. 将作为子项的列表中的每个节点附加到combo_children
  5. Return combo_children
  6. 返回combo_children
  7. Get combo_children children (as a list)
  8. 获取combo_children子项(作为列表)
  9. Use the list (combo)
  10. 使用列表(组合)

when things could be easily done:

当事情可以轻松完成:

  1. Get child nodes combo (as a list)
  2. 获取子节点组合(作为列表)
  3. Return the list
  4. 返回列表
  5. Use the list (combo)
  6. 使用列表(组合)

Apparently, the combo_children approach was also exposing the behavioral difference between the modules.

显然,combo_children方法也暴露了模块之间的行为差​​异。

code_orig_lxml.py:

code_orig_lxml.py:

import lxml.etree as ET
#import xml.etree.ElementTree as ET
from itertools import combinations
from copy import deepcopy


def get_combination_trees(tree):
    children = list(tree)
    for i in range(1, len(children)):
        for combination in combinations(children, i):
            #new_combo_tree = ET.Element(tree.tag, tree.attrib)
            #for recombined_child in combination:
                #new_combo_tree.append(recombined_child)
                # when using lxml a deepcopy is required to make this work (or make change in parse_xml)
                # new_combo_tree.append(deepcopy(recombined_child))
            #yield new_combo_tree
            yield combination

    return None


def parse_xml(tree_p):
    for node in ET.fromstring(tree_p):
        if not node.tag == 'node_main':
            continue
        # replace by node.xpath('.//node') for lxml (or use deepcopy in get_combination_trees)
        for subnode in node.iter('node'):
            children = list(subnode)
            if children:
                print('-'.join([child.attrib['id'] for child in children]))
            else:
                print(f'node {subnode.attrib["id"]} has no children')

            #for combo_tree in get_combination_trees(subnode):
            for combo_children in get_combination_trees(subnode):
                #combo_children = list(combo_tree)
                if combo_children:
                    print('-'.join([child.attrib['id'] for child in combo_children]))

    return None


s = '''<root>
  <node_main>
    <node id="1">
      <node id="2" />
      <node id="3">
        <node id="4">
          <node id="5" />
        </node>
        <node id="6" />
      </node>
    </node>
  </node_main>
</root>
'''

parse_xml(s)

Notes:

笔记:

  • This is your code with the changes above
  • 这是您的代码,其中包含上述更改
  • I didn't removed anything, instead just commented stuff (which would generate the smallest diff between the old and new versions)
  • 我没有删除任何东西,只是评论了东西(这将产生旧版本和新版本之间的最小差异)

Output:

输出:

(py36x86_test) e:\Work\Dev\*\q050749937>"e:\Work\Dev\VEnvs\py36x86_test\Scripts\python.exe" code_orig_lxml.py
2-3
2
3
node 2 has no children
4-6
4
6
5
node 5 has no children
node 6 has no children

While I was investigating, I modified your code further, to:

在我调查的过程中,我进一步修改了代码,以便:

  • Fix the issue
  • 解决问题
  • Improve printing
  • 改善印刷
  • Make it modular
  • 使其模块化
  • Use both parsing methods, to make differences between them clearer
  • 使用两种解析方法,使它们之间的差异更加清晰

xml_data.py:

xml_data.py:

DATA = """<root>
  <node_main>
    <node id="1">
      <node id="2" />
      <node id="3">
        <node id="4">
          <node id="5" />
        </node>
        <node id="6" />
      </node>
    </node>
  </node_main>
</root>
"""

code.py:

code.py:

import sys
import xml.etree.ElementTree as xml_etree_et
import lxml.etree as lxml_etree
from itertools import combinations
from xml_data import DATA


MAIN_NODE_NAME = "node_main"


def get_children_combinations(tree):
    children = list(tree)
    for i in range(1, len(children)):
        yield from combinations(children, i)


def get_tree(xml_str, parse_func, tag=None):
    root_node = parse_func(xml_str)
    if tag:
        return [item for item in root_node if item.tag == tag]
    return [root_node]


def process_xml(xml_node):
    for node in xml_node.iter("node"):
        print(f"\nNode ({node.tag}, {node.attrib['id']})")
        children = list(node)
        if children:
            print("    Children: " + " - ".join([child.attrib["id"] for child in children]))

        for children_combo in get_children_combinations(node):
            if children_combo:
                print("    Combo: " + " - ".join([child.attrib["id"] for child in children_combo]))


def main():
    parse_funcs = (xml_etree_et.fromstring, lxml_etree.fromstring)
    for func in parse_funcs:
        print(f"\nParsing xml using: {func.__module__} {func.__name__}")
        nodes = get_tree(DATA, func, tag=MAIN_NODE_NAME)
        for node in nodes:
            print(f"\nProcessing node: {node.tag}")
            process_xml(node)


if __name__ == "__main__":
    print("Python {:s} on {:s}\n".format(sys.version, sys.platform))
    main()

Output:

输出:

(py36x86_test) e:\Work\Dev\*\q050749937>"e:\Work\Dev\VEnvs\py36x86_test\Scripts\python.exe" code.py
Python 3.6.2 (v3.6.2:5fd33b5, Jul  8 2017, 04:14:34) [MSC v.1900 32 bit (Intel)] on win32


Parsing xml using: xml.etree.ElementTree XML

Processing node: node_main

Node (node, 1)
    Children: 2 - 3
    Combo: 2
    Combo: 3

Node (node, 2)

Node (node, 3)
    Children: 4 - 6
    Combo: 4
    Combo: 6

Node (node, 4)
    Children: 5

Node (node, 5)

Node (node, 6)

Parsing xml using: lxml.etree fromstring

Processing node: node_main

Node (node, 1)
    Children: 2 - 3
    Combo: 2
    Combo: 3

Node (node, 2)

Node (node, 3)
    Children: 4 - 6
    Combo: 4
    Combo: 6

Node (node, 4)
    Children: 5

Node (node, 5)

Node (node, 6)