I have an xml
file of the form:
我有一个表单的xml文件:
<NewDataSet>
<Root>
<Phonemic>and</Phonemic>
<Phonetic>nd</Phonetic>
<Description/>
<Start>0</Start>
<End>8262</End>
</Root>
<Root>
<Phonemic>comfortable</Phonemic>
<Phonetic>comfetebl</Phonetic>
<Description>adj</Description>
<Start>61404</Start>
<End>72624</End>
</Root>
</NewDataSet>
I need to process it so that, for instance, when the user inputs nd
, the program matches it with the <Phonetic>
tag and returns and
from the <Phonemic>
part. I thought maybe if I can convert the xml file to a dictionary, I would be able to iterate over the data and find information when needed.
我需要处理它,例如,当用户输入nd时,程序将其与
I searched and found xmltodict which is used for the same purpose:
我搜索并发现xmltodict用于相同的目的:
import xmltodict
with open(r'path\to\1.xml', encoding='utf-8', errors='ignore') as fd:
obj = xmltodict.parse(fd.read())
Running this gives me an ordered dict
:
运行这个给了我一个有序的字典:
>>> obj
OrderedDict([('NewDataSet', OrderedDict([('Root', [OrderedDict([('Phonemic', 'and'), ('Phonetic', 'nd'), ('Description', None), ('Start', '0'), ('End', '8262')]), OrderedDict([('Phonemic', 'comfortable'), ('Phonetic', 'comfetebl'), ('Description', 'adj'), ('Start', '61404'), ('End', '72624')])])]))])
Now this unfortunately hasn't made things simpler and I am not sure how to go about implementing the program with the new data structure. For example to access nd
I'd have to write:
不幸的是,这并没有使事情变得更简单,我不知道如何使用新的数据结构来实现程序。例如,访问nd我必须写:
obj['NewDataSet']['Root'][0]['Phonetic']
which is ridiculously complicated. I tried to make it into a regular dictionary by dict()
but as it is nested, the inner layers remain ordered and my data is so big.
这太荒谬了。我试图通过dict()将它变成一个普通的字典,但是当它嵌套时,内层仍然是有序的,我的数据是如此之大。
3 个解决方案
#1
5
If you are accessing this as obj['NewDataSet']['Root'][0]['Phonetic']
, IMO, you are not doing it right.
如果您正在以obj ['NewDataSet'] ['Root'] [0] ['Phonetic'],IMO访问它,那么你做得不对。
Instead, you can do the following
相反,您可以执行以下操作
obj = obj["NewDataSet"]
root_elements = obj["Root"] if type(obj) == OrderedDict else [obj["Root"]]
# Above step ensures that root_elements is always a list
for element in root_elements:
print element["Phonetic"]
Even though this code looks much more longer, the advantage is that it will be lot more compact and modular once you start dealing with sufficiently large xml.
即使这段代码看起来更长,但优点是一旦你开始处理足够大的xml,它就会更加紧凑和模块化。
PS: I had the same issues with xmltodict
. But instead of parsing using xml.etree.ElementTree to parse the xml files, xmltodict was much easier to work with as the code base was smaller, and I didn't have to deal with other inanities of the xml module.
PS:我和xmltodict有同样的问题。但是,使用xml.etree.ElementTree解析xml文件而不是解析,因为代码库较小,xmltodict更容易使用,而且我不必处理xml模块的其他内容。
EDIT
Following code works for me
以下代码适合我
import xmltodict
from collections import OrderedDict
xmldata = """<NewDataSet>
<Root>
<Phonemic>and</Phonemic>
<Phonetic>nd</Phonetic>
<Description/>
<Start>0</Start>
<End>8262</End>
</Root>
<Root>
<Phonemic>comfortable</Phonemic>
<Phonetic>comfetebl</Phonetic>
<Description>adj</Description>
<Start>61404</Start>
<End>72624</End>
</Root>
</NewDataSet>"""
obj = xmltodict.parse(xmldata)
obj = obj["NewDataSet"]
root_elements = obj["Root"] if type(obj) == OrderedDict else [obj["Root"]]
# Above step ensures that root_elements is always a list
for element in root_elements:
print element["Phonetic"]
#2
0
Mu's answer worked for me, the only thing I had to change was the tricky ensure root_element is always a list step.: -
Mu的答案对我有用,我唯一需要改变的是棘手的确保root_element始终是一个列表步骤: -
import xmltodict
from collections import OrderedDict
xmldata = """<NewDataSet>
<Root>
<Phonemic>and</Phonemic>
<Phonetic>nd</Phonetic>
<Description/>
<Start>0</Start>
<End>8262</End>
</Root>
<Root>
<Phonemic>comfortable</Phonemic>
<Phonetic>comfetebl</Phonetic>
<Description>adj</Description>
<Start>61404</Start>
<End>72624</End>
</Root>
</NewDataSet>"""
obj = xmltodict.parse(xmldata)
obj = obj["NewDataSet"]
root_elements = obj["Root"] if type(obj["Root"]) == list else [obj["Root"]]
# Above step ensures that root_elements is always a list
# Is obj["Root"] a list already, then use obj["Root"], otherwise make single element list.
for element in root_elements:
print element["Phonetic"]
#3
0
You can actually avoid conversion to OrderedDict by setting an additional keyword paramter:
您实际上可以通过设置其他关键字参数来避免转换为OrderedDict:
obj = xmltodict.parse(xmldata, dict_constructor=dict)
parse
is forwarding keyword arguments to _DictSAXHandler
and dict_constructor
is by default set to OrderedDict
.
parse是将关键字参数转发给_DictSAXHandler,dict_constructor默认设置为OrderedDict。
#1
5
If you are accessing this as obj['NewDataSet']['Root'][0]['Phonetic']
, IMO, you are not doing it right.
如果您正在以obj ['NewDataSet'] ['Root'] [0] ['Phonetic'],IMO访问它,那么你做得不对。
Instead, you can do the following
相反,您可以执行以下操作
obj = obj["NewDataSet"]
root_elements = obj["Root"] if type(obj) == OrderedDict else [obj["Root"]]
# Above step ensures that root_elements is always a list
for element in root_elements:
print element["Phonetic"]
Even though this code looks much more longer, the advantage is that it will be lot more compact and modular once you start dealing with sufficiently large xml.
即使这段代码看起来更长,但优点是一旦你开始处理足够大的xml,它就会更加紧凑和模块化。
PS: I had the same issues with xmltodict
. But instead of parsing using xml.etree.ElementTree to parse the xml files, xmltodict was much easier to work with as the code base was smaller, and I didn't have to deal with other inanities of the xml module.
PS:我和xmltodict有同样的问题。但是,使用xml.etree.ElementTree解析xml文件而不是解析,因为代码库较小,xmltodict更容易使用,而且我不必处理xml模块的其他内容。
EDIT
Following code works for me
以下代码适合我
import xmltodict
from collections import OrderedDict
xmldata = """<NewDataSet>
<Root>
<Phonemic>and</Phonemic>
<Phonetic>nd</Phonetic>
<Description/>
<Start>0</Start>
<End>8262</End>
</Root>
<Root>
<Phonemic>comfortable</Phonemic>
<Phonetic>comfetebl</Phonetic>
<Description>adj</Description>
<Start>61404</Start>
<End>72624</End>
</Root>
</NewDataSet>"""
obj = xmltodict.parse(xmldata)
obj = obj["NewDataSet"]
root_elements = obj["Root"] if type(obj) == OrderedDict else [obj["Root"]]
# Above step ensures that root_elements is always a list
for element in root_elements:
print element["Phonetic"]
#2
0
Mu's answer worked for me, the only thing I had to change was the tricky ensure root_element is always a list step.: -
Mu的答案对我有用,我唯一需要改变的是棘手的确保root_element始终是一个列表步骤: -
import xmltodict
from collections import OrderedDict
xmldata = """<NewDataSet>
<Root>
<Phonemic>and</Phonemic>
<Phonetic>nd</Phonetic>
<Description/>
<Start>0</Start>
<End>8262</End>
</Root>
<Root>
<Phonemic>comfortable</Phonemic>
<Phonetic>comfetebl</Phonetic>
<Description>adj</Description>
<Start>61404</Start>
<End>72624</End>
</Root>
</NewDataSet>"""
obj = xmltodict.parse(xmldata)
obj = obj["NewDataSet"]
root_elements = obj["Root"] if type(obj["Root"]) == list else [obj["Root"]]
# Above step ensures that root_elements is always a list
# Is obj["Root"] a list already, then use obj["Root"], otherwise make single element list.
for element in root_elements:
print element["Phonetic"]
#3
0
You can actually avoid conversion to OrderedDict by setting an additional keyword paramter:
您实际上可以通过设置其他关键字参数来避免转换为OrderedDict:
obj = xmltodict.parse(xmldata, dict_constructor=dict)
parse
is forwarding keyword arguments to _DictSAXHandler
and dict_constructor
is by default set to OrderedDict
.
parse是将关键字参数转发给_DictSAXHandler,dict_constructor默认设置为OrderedDict。