使用python的zip和list comprehensions将xml转换为csv

I've been looking at other questions here in SO about zip and the magic * which have helped me a lot in understanding how it works. For example:

我一直在寻找关于拉链和魔术*的其他问题,这些问题帮助我理解了它的工作原理。例如:

Why does x,y = zip(*zip(a,b)) work in Python?

为什么x,y = zip(* zip(a,b))在Python中有效?

How does zip(*[iter(s)]*n) work in Python?

zip(* [iter(s)] * n)如何在Python中运行?

Zip as a list comprehension

Zip作为列表理解

XML to csv(-like) format

XML到csv(类似)格式

Even though I still have to think a little about what's actually happening I have a better understanding now. So what I'm trying to achieve is to convert an xml document into csv. That last link above gets really close to what I want to do, however my source xml doesn't have the most consistent structure, and that's where I'm hitting a wall. Here's an example of my source xml (simplified for the sake of this example):

即使我仍然需要考虑一下实际发生的事情,我现在有了更好的理解。所以我想要实现的是将xml文档转换为csv。上面的最后一个链接非常接近我想要做的事情,但是我的源xml没有最一致的结构,而那就是我要撞墙的地方。这是我的源xml的一个例子(为了这个例子简化):

<?xml version="1.0" encoding="utf-8"?>
<root>
    <child>
        <Name>John</Name>
        <Surname>Doe</Surname>
        <Phone>123456</Phone>
        <Phone>654321</Phone>
        <Fax>111111</Fax>
    </child>
    <child>
        <Name>Tom</Name>
        <Surname>Cat</Surname>
        <Phone>98765</Phone>
        <Phone>56789</Phone>
        <Phone>00000</Phone>
    </child>
</root>

As you can see I can have 2 or more of the same elements under <child>. Also, if a certain element has no value, it won't even exist (like on the second <child> where there's no <Fax>).

如您所见,我可以在下拥有2个或更多相同的元素。此外,如果某个元素没有值,它甚至不存在(就像第二个那里没有 )。

This is the code I currently have:

这是我目前的代码:

data = etree.parse(open('test.xml')).findall(".//child")
tags = ('Name', 'Surname', 'Phone', 'Fax')

for child in data:
    for a in zip(*[child.findall(x) for x in tags]):
        print([x.text for x in a])

>> Result:

['John', 'Doe', '123456', '111111']

Although this gives me a format I can use to write a csv, it has two problems:

虽然这给了我一种可以用来编写csv的格式,但它有两个问题:

It skips the 2nd child because it doesn't have the <Fax>element (I suppose). If I only search for elements that exist in both children by setting tags = ('Name', 'Surname') then it I have 2 lists back (great!)

它会跳过第二个孩子,因为它没有元素(我想)。如果我只通过设置tags =('Name','Surname')搜索两个孩子中存在的元素,那么我有两个列表(太棒了!)
That first child actually has 2 phone numbers but only one is returned

第一个孩子实际上有2个电话号码,但只返回一个

From what I could test, stuff starts to disappear when zip* comes into play... How could I maybe set a default value so I can keep empty values?

从我可以测试的东西开始,当zip *进入游戏时,东西开始消失......我怎么可能设置一个默认值,所以我可以保持空值?

Update: to make it more clear what I intend to do, here's the expected output format (CSV with semicolon separator, where multiple values in each field are split by a comma):

更新:为了使我更清楚我打算做什么,这是预期的输出格式(带分号分隔符的CSV,其中每个字段中的多个值用逗号分隔):

John;Joe;123456,654321;111111;
Tom;Cat;98765,56789;00000;;

Thanks!

2 个解决方案

#1

I hacked this together. Read the csv module's documentation and change accordingly if you want a more specific format.

我一起砍了这个。如果您想要更具体的格式,请阅读csv模块的文档并相应地进行更改。

from csv import DictWriter
from StringIO import StringIO
import xml.etree
from xml.etree import ElementTree

xml_str = \
'''
<?xml version="1.0" encoding="utf-8"?>
<root>
    <child>
        <Name>John</Name>
        <Surname>Doe</Surname>
        <Phone>123456</Phone>
        <Phone>654321</Phone>
        <Fax>111111</Fax>
    </child>
    <child>
        <Name>Tom</Name>
        <Surname>Cat</Surname>
        <Phone>98765</Phone>
        <Phone>56789</Phone>
        <Phone>00000</Phone>
    </child>
</root>
'''

root = ElementTree.parse(StringIO(xml_str.strip()))
entry_list = []
for child_tag in root.iterfind("child"):
    child_tags = child_tag.getchildren()

    tag_count = {}
    [tag_count.__setitem__(tag.tag, tag_count.get(tag.tag, 0) + 1) for tag in child_tags]

    m_count = dict([(key, 0) for (key, val) in filter(lambda (x, y): y > 1, tag_count.items())])

    enum = lambda x: ("%s%s" % (x.tag, (" %d" % m_count.setdefault(x.tag, m_count.pop(x.tag) + 1)) if(tag_count[x.tag] > 1) else ""), x.text)
    tmp_dict = dict([enum(tag) for tag in child_tags])

    entry_list.append(tmp_dict)

field_order = ["Name", "Surname", "Phone 1", "Phone 2", "Phone 3", "Fax"]
field_check = lambda q: field_order.index(q) if(field_order.count(q)) else sys.maxint

all_fields = list(reduce(lambda x, y: x | set(y.keys()), entry_list, set([])))
all_fields.sort(cmp=lambda x, y: field_check(x) - field_check(y))

with open("test.csv", "w") as file_h:
    writer = DictWriter(file_h, all_fields, restval="", extrasaction="ignore", dialect="excel", lineterminator="\n")
    writer.writerow(dict(zip(all_fields, all_fields)))
    writer.writerows(entry_list)

#2

You say, in regards to your first problem, that "[i]f I only search for elements that exist in both children ... I have 2 lists back," implying that the lack of output for the second child has something to do with interaction between the two child nodes. That's not the case. The aspect of the behavior of zip that you appear to be overlooking is that zip stops processing its arguments after it's exhausted the shortest one.

你说,关于你的第一个问题,“我只搜索两个孩子中存在的元素......我有两个列表,”暗示第二个孩子缺乏输出有关与两个子节点之间的交互。事实并非如此。您似乎忽略的zip行为的一个方面是,zip在耗尽最短的参数后停止处理它的参数。

Consider the output of the following simplification of your code:

考虑以下代码简化的输出:

for child in data:
    print [child.findall(x) for x in tags]

The output will be (omitting memory addresses):

输出将是(省略内存地址):

[[<Element 'Name'>], [<Element 'Surname'>], [<Element 'Phone'>, <Element 'Phone'>], [<Element 'Fax'>]]
[[<Element 'Name'>], [<Element 'Surname'>], [<Element 'Phone'>, <Element 'Phone'>, <Element 'Phone'>], []]

Notice that the second list has an empty sublist (because the second child has no Fax node). This means that when you zip those sublists together the process stops immediately and returns an empty list; on its first pass it's already exhausted one of the sublists. That's why your second child is omitted in the output; it has nothing to do with elements being shared between children.

请注意,第二个列表具有空子列表(因为第二个子节点没有传真节点)。这意味着当您将这些子列表压缩在一起时,该过程会立即停止并返回一个空列表;在它的第一次通过它已经用尽其中一个子列表。这就是为什么你的第二个孩子在输出中被省略了;它与儿童之间共享的元素无关。

The same principle of zip's behavior explains your second problem. Notice that the first output list above consists of four elements: a list of length one for three of your tags and a list of length two with the two phone elements. When you zip those together, the process again stops after exhausting any of the sublists. In this case, the shortest sublist has length one, so the result only draws one element from the phone sublist.

拉链行为的相同原理解释了您的第二个问题。请注意,上面的第一个输出列表包含四个元素:三个标签的长度为1的列表,以及带有两个电话元素的长度为2的列表。当您将它们压缩在一起时,在耗尽任何子列表后,该过程将再次停止。在这种情况下,最短子列表的长度为1,因此结果仅从电话子列表中抽取一个元素。

I'm not sure exactly what you want your output to look like, but if you're simply trying to construct, for each child node, a list containing the text of each element in that node, you can do something like:

我不确定你想要的输出是什么样的,但如果你只是想为每个子节点构建一个包含该节点中每个元素的文本的列表,你可以这样做:

for child in data:
    print [x.text for x in child]

That will produce:

这会产生:

['John', 'Doe', '123456', '654321', '111111']
['Tom', 'Cat', '98765', '56789', '00000']

#1