使用有序字典解析xml文件

I have an xml file of the form:

我有一个表单的xml文件:

<NewDataSet>
    <Root>
        <Phonemic>and</Phonemic>
        <Phonetic>nd</Phonetic>
        <Description/>
        <Start>0</Start>
        <End>8262</End>
    </Root>
    <Root>
        <Phonemic>comfortable</Phonemic>
        <Phonetic>comfetebl</Phonetic>
        <Description>adj</Description>
        <Start>61404</Start>
        <End>72624</End>
    </Root>
</NewDataSet>

I need to process it so that, for instance, when the user inputs nd, the program matches it with the <Phonetic> tag and returns and from the <Phonemic> part. I thought maybe if I can convert the xml file to a dictionary, I would be able to iterate over the data and find information when needed.

我需要处理它,例如,当用户输入nd时,程序将其与标签匹配并返回部分。我想如果我可以将xml文件转换为字典,我将能够迭代数据并在需要时查找信息。

I searched and found xmltodict which is used for the same purpose:

我搜索并发现xmltodict用于相同的目的:

import xmltodict
with open(r'path\to\1.xml', encoding='utf-8', errors='ignore') as fd:
    obj = xmltodict.parse(fd.read())

Running this gives me an ordered dict:

运行这个给了我一个有序的字典:

>>> obj
OrderedDict([('NewDataSet', OrderedDict([('Root', [OrderedDict([('Phonemic', 'and'), ('Phonetic', 'nd'), ('Description', None), ('Start', '0'), ('End', '8262')]), OrderedDict([('Phonemic', 'comfortable'), ('Phonetic', 'comfetebl'), ('Description', 'adj'), ('Start', '61404'), ('End', '72624')])])]))])

Now this unfortunately hasn't made things simpler and I am not sure how to go about implementing the program with the new data structure. For example to access nd I'd have to write:

不幸的是,这并没有使事情变得更简单,我不知道如何使用新的数据结构来实现程序。例如,访问nd我必须写:

obj['NewDataSet']['Root'][0]['Phonetic']

which is ridiculously complicated. I tried to make it into a regular dictionary by dict() but as it is nested, the inner layers remain ordered and my data is so big.

这太荒谬了。我试图通过dict()将它变成一个普通的字典,但是当它嵌套时,内层仍然是有序的,我的数据是如此之大。

3 个解决方案

#1

If you are accessing this as obj['NewDataSet']['Root'][0]['Phonetic'], IMO, you are not doing it right.

如果您正在以obj ['NewDataSet'] ['Root'] [0] ['Phonetic'],IMO访问它,那么你做得不对。

Instead, you can do the following

相反,您可以执行以下操作

obj = obj["NewDataSet"]
root_elements = obj["Root"] if type(obj) == OrderedDict else [obj["Root"]] 
# Above step ensures that root_elements is always a list
for element in root_elements:
    print element["Phonetic"]

Even though this code looks much more longer, the advantage is that it will be lot more compact and modular once you start dealing with sufficiently large xml.

即使这段代码看起来更长,但优点是一旦你开始处理足够大的xml,它就会更加紧凑和模块化。

PS: I had the same issues with xmltodict. But instead of parsing using xml.etree.ElementTree to parse the xml files, xmltodict was much easier to work with as the code base was smaller, and I didn't have to deal with other inanities of the xml module.

PS:我和xmltodict有同样的问题。但是,使用xml.etree.ElementTree解析xml文件而不是解析,因为代码库较小,xmltodict更容易使用,而且我不必处理xml模块的其他内容。

EDIT

Following code works for me

以下代码适合我

import xmltodict
from collections import OrderedDict

xmldata = """<NewDataSet>
    <Root>
        <Phonemic>and</Phonemic>
        <Phonetic>nd</Phonetic>
        <Description/>
        <Start>0</Start>
        <End>8262</End>
    </Root>
    <Root>
        <Phonemic>comfortable</Phonemic>
        <Phonetic>comfetebl</Phonetic>
        <Description>adj</Description>
        <Start>61404</Start>
        <End>72624</End>
    </Root>
</NewDataSet>"""

obj = xmltodict.parse(xmldata)
obj = obj["NewDataSet"]
root_elements = obj["Root"] if type(obj) == OrderedDict else [obj["Root"]] 
# Above step ensures that root_elements is always a list
for element in root_elements:
    print element["Phonetic"]

#2

Mu's answer worked for me, the only thing I had to change was the tricky ensure root_element is always a list step.: -

Mu的答案对我有用,我唯一需要改变的是棘手的确保root_element始终是一个列表步骤: -

import xmltodict
from collections import OrderedDict

xmldata = """<NewDataSet>
    <Root>
        <Phonemic>and</Phonemic>
        <Phonetic>nd</Phonetic>
        <Description/>
        <Start>0</Start>
        <End>8262</End>
    </Root>
    <Root>
        <Phonemic>comfortable</Phonemic>
        <Phonetic>comfetebl</Phonetic>
        <Description>adj</Description>
        <Start>61404</Start>
        <End>72624</End>
    </Root>
</NewDataSet>"""

obj = xmltodict.parse(xmldata)
obj = obj["NewDataSet"]
root_elements = obj["Root"] if type(obj["Root"]) == list else [obj["Root"]] 
# Above step ensures that root_elements is always a list
# Is obj["Root"] a list already, then use obj["Root"], otherwise make single element list.
for element in root_elements:
    print element["Phonetic"]

#3

You can actually avoid conversion to OrderedDict by setting an additional keyword paramter:

您实际上可以通过设置其他关键字参数来避免转换为OrderedDict:

obj = xmltodict.parse(xmldata, dict_constructor=dict)

parse is forwarding keyword arguments to _DictSAXHandler and dict_constructor is by default set to OrderedDict.

parse是将关键字参数转发给_DictSAXHandler,dict_constructor默认设置为OrderedDict。

#1