I've been looking at other questions here in SO about zip and the magic * which have helped me a lot in understanding how it works. For example:
- Why does x,y = zip(*zip(a,b)) work in Python?
- How does zip(*[iter(s)]*n) work in Python?
- Zip as a list comprehension
- XML to csv(-like) format
为什么x,y = zip(* zip(a,b))在Python中有效?
zip(* [iter(s)] * n)如何在Python中运行?
Even though I still have to think a little about what's actually happening I have a better understanding now. So what I'm trying to achieve is to convert an xml document into csv. That last link above gets really close to what I want to do, however my source xml doesn't have the most consistent structure, and that's where I'm hitting a wall. Here's an example of my source xml (simplified for the sake of this example):
<?xml version="1.0" encoding="utf-8"?>
As you can see I can have 2 or more of the same elements under <child>
. Also, if a certain element has no value, it won't even exist (like on the second <child>
where there's no <Fax>
This is the code I currently have:
data = etree.parse(open('test.xml')).findall(".//child")
tags = ('Name', 'Surname', 'Phone', 'Fax')
for child in data:
for a in zip(*[child.findall(x) for x in tags]):
print([x.text for x in a])
>> Result:
['John', 'Doe', '123456', '111111']
Although this gives me a format I can use to write a csv, it has two problems:
It skips the 2nd child because it doesn't have the
element (I suppose). If I only search for elements that exist in both children by settingtags = ('Name', 'Surname')
then it I have 2 lists back (great!)它会跳过第二个孩子,因为它没有
元素(我想)。如果我只通过设置tags =('Name','Surname')搜索两个孩子中存在的元素,那么我有两个列表(太棒了!) -
That first child actually has 2 phone numbers but only one is returned
From what I could test, stuff starts to disappear when zip* comes into play... How could I maybe set a default value so I can keep empty values?
从我可以测试的东西开始,当zip *进入游戏时,东西开始消失......我怎么可能设置一个默认值,所以我可以保持空值?
Update: to make it more clear what I intend to do, here's the expected output format (CSV with semicolon separator, where multiple values in each field are split by a comma):
2 个解决方案
I hacked this together. Read the csv module's documentation and change accordingly if you want a more specific format.
from csv import DictWriter
from StringIO import StringIO
import xml.etree
from xml.etree import ElementTree
xml_str = \
<?xml version="1.0" encoding="utf-8"?>
root = ElementTree.parse(StringIO(xml_str.strip()))
entry_list = []
for child_tag in root.iterfind("child"):
child_tags = child_tag.getchildren()
tag_count = {}
[tag_count.__setitem__(tag.tag, tag_count.get(tag.tag, 0) + 1) for tag in child_tags]
m_count = dict([(key, 0) for (key, val) in filter(lambda (x, y): y > 1, tag_count.items())])
enum = lambda x: ("%s%s" % (x.tag, (" %d" % m_count.setdefault(x.tag, m_count.pop(x.tag) + 1)) if(tag_count[x.tag] > 1) else ""), x.text)
tmp_dict = dict([enum(tag) for tag in child_tags])
field_order = ["Name", "Surname", "Phone 1", "Phone 2", "Phone 3", "Fax"]
field_check = lambda q: field_order.index(q) if(field_order.count(q)) else sys.maxint
all_fields = list(reduce(lambda x, y: x | set(y.keys()), entry_list, set([])))
all_fields.sort(cmp=lambda x, y: field_check(x) - field_check(y))
with open("test.csv", "w") as file_h:
writer = DictWriter(file_h, all_fields, restval="", extrasaction="ignore", dialect="excel", lineterminator="\n")
writer.writerow(dict(zip(all_fields, all_fields)))
You say, in regards to your first problem, that "[i]f I only search for elements that exist in both children ... I have 2 lists back," implying that the lack of output for the second child has something to do with interaction between the two child
nodes. That's not the case. The aspect of the behavior of zip
that you appear to be overlooking is that zip
stops processing its arguments after it's exhausted the shortest one.
Consider the output of the following simplification of your code:
for child in data:
print [child.findall(x) for x in tags]
The output will be (omitting memory addresses):
[[<Element 'Name'>], [<Element 'Surname'>], [<Element 'Phone'>, <Element 'Phone'>], [<Element 'Fax'>]]
[[<Element 'Name'>], [<Element 'Surname'>], [<Element 'Phone'>, <Element 'Phone'>, <Element 'Phone'>], []]
Notice that the second list has an empty sublist (because the second child has no Fax
node). This means that when you zip those sublists together the process stops immediately and returns an empty list; on its first pass it's already exhausted one of the sublists. That's why your second child is omitted in the output; it has nothing to do with elements being shared between children.
The same principle of zip
's behavior explains your second problem. Notice that the first output list above consists of four elements: a list of length one for three of your tags and a list of length two with the two phone elements. When you zip those together, the process again stops after exhausting any of the sublists. In this case, the shortest sublist has length one, so the result only draws one element from the phone sublist.
I'm not sure exactly what you want your output to look like, but if you're simply trying to construct, for each child node, a list containing the text of each element in that node, you can do something like:
for child in data:
print [x.text for x in child]
That will produce:
['John', 'Doe', '123456', '654321', '111111']
['Tom', 'Cat', '98765', '56789', '00000']
I hacked this together. Read the csv module's documentation and change accordingly if you want a more specific format.
from csv import DictWriter
from StringIO import StringIO
import xml.etree
from xml.etree import ElementTree
xml_str = \
<?xml version="1.0" encoding="utf-8"?>
root = ElementTree.parse(StringIO(xml_str.strip()))
entry_list = []
for child_tag in root.iterfind("child"):
child_tags = child_tag.getchildren()
tag_count = {}
[tag_count.__setitem__(tag.tag, tag_count.get(tag.tag, 0) + 1) for tag in child_tags]
m_count = dict([(key, 0) for (key, val) in filter(lambda (x, y): y > 1, tag_count.items())])
enum = lambda x: ("%s%s" % (x.tag, (" %d" % m_count.setdefault(x.tag, m_count.pop(x.tag) + 1)) if(tag_count[x.tag] > 1) else ""), x.text)
tmp_dict = dict([enum(tag) for tag in child_tags])
field_order = ["Name", "Surname", "Phone 1", "Phone 2", "Phone 3", "Fax"]
field_check = lambda q: field_order.index(q) if(field_order.count(q)) else sys.maxint
all_fields = list(reduce(lambda x, y: x | set(y.keys()), entry_list, set([])))
all_fields.sort(cmp=lambda x, y: field_check(x) - field_check(y))
with open("test.csv", "w") as file_h:
writer = DictWriter(file_h, all_fields, restval="", extrasaction="ignore", dialect="excel", lineterminator="\n")
writer.writerow(dict(zip(all_fields, all_fields)))
You say, in regards to your first problem, that "[i]f I only search for elements that exist in both children ... I have 2 lists back," implying that the lack of output for the second child has something to do with interaction between the two child
nodes. That's not the case. The aspect of the behavior of zip
that you appear to be overlooking is that zip
stops processing its arguments after it's exhausted the shortest one.
Consider the output of the following simplification of your code:
for child in data:
print [child.findall(x) for x in tags]
The output will be (omitting memory addresses):
[[<Element 'Name'>], [<Element 'Surname'>], [<Element 'Phone'>, <Element 'Phone'>], [<Element 'Fax'>]]
[[<Element 'Name'>], [<Element 'Surname'>], [<Element 'Phone'>, <Element 'Phone'>, <Element 'Phone'>], []]
Notice that the second list has an empty sublist (because the second child has no Fax
node). This means that when you zip those sublists together the process stops immediately and returns an empty list; on its first pass it's already exhausted one of the sublists. That's why your second child is omitted in the output; it has nothing to do with elements being shared between children.
The same principle of zip
's behavior explains your second problem. Notice that the first output list above consists of four elements: a list of length one for three of your tags and a list of length two with the two phone elements. When you zip those together, the process again stops after exhausting any of the sublists. In this case, the shortest sublist has length one, so the result only draws one element from the phone sublist.
I'm not sure exactly what you want your output to look like, but if you're simply trying to construct, for each child node, a list containing the text of each element in that node, you can do something like:
for child in data:
print [x.text for x in child]
That will produce:
['John', 'Doe', '123456', '654321', '111111']
['Tom', 'Cat', '98765', '56789', '00000']