如何在Python中解析XML ?

时间:2021-05-04 20:44:42

I have many rows in a database that contains xml and I'm trying to write a Python script that will go through those rows and count how many instances of a particular node attribute show up. For instance, my tree looks like:

我在一个包含xml的数据库中有很多行,我正在尝试编写一个Python脚本,它将遍历这些行,并计算一个特定节点属性出现的次数。例如,我的树是这样的:

<foo>
   <bar>
      <type foobar="1"/>
      <type foobar="2"/>
   </bar>
</foo>

How can I access the attributes 1 and 2 in the XML using Python?

如何使用Python访问XML中的属性1和2 ?

12 个解决方案

#1


547  

I suggest ElementTree. There are other compatible implementations of the same API, such as lxml, and cElementTree in the Python standard library itself; but, in this context, what they chiefly add is even more speed -- the ease of programming part depends on the API, which ElementTree defines.

我建议ElementTree。在Python标准库中,同样的API有其他兼容的实现,比如lxml和cElementTree;但是,在这种情况下,它们主要添加的是更快的速度——编程部分的简化依赖于ElementTree定义的API。

After building an Element instance e from the XML, e.g. with the XML function, or by parsing a file with something like

在从XML中构建一个元素实例e之后,例如使用XML函数,或者用类似的方法解析文件。

import xml.etree.ElementTree
e = xml.etree.ElementTree.parse('thefile.xml').getroot()

or any of the many other ways shown at ElementTree, you just do something like:

或者在ElementTree中有很多其他的方法,你可以做如下的事情:

for atype in e.findall('type'):
    print(atype.get('foobar'))

and similar, usually pretty simple, code patterns.

类似的,通常很简单,代码模式。

#2


353  

minidom is the quickest and pretty straight forward:

minidom是最快速和最直接的:

XML:

XML:

<data>
    <items>
        <item name="item1"></item>
        <item name="item2"></item>
        <item name="item3"></item>
        <item name="item4"></item>
    </items>
</data>

PYTHON:

PYTHON:

from xml.dom import minidom
xmldoc = minidom.parse('items.xml')
itemlist = xmldoc.getElementsByTagName('item')
print(len(itemlist))
print(itemlist[0].attributes['name'].value)
for s in itemlist:
    print(s.attributes['name'].value)

OUTPUT

输出

4
item1
item1
item2
item3
item4

#3


190  

You can use BeautifulSoup

您可以使用BeautifulSoup

from bs4 import BeautifulSoup

x="""<foo>
   <bar>
      <type foobar="1"/>
      <type foobar="2"/>
   </bar>
</foo>"""

y=BeautifulSoup(x)
>>> y.foo.bar.type["foobar"]
u'1'

>>> y.foo.bar.findAll("type")
[<type foobar="1"></type>, <type foobar="2"></type>]

>>> y.foo.bar.findAll("type")[0]["foobar"]
u'1'
>>> y.foo.bar.findAll("type")[1]["foobar"]
u'2'

#4


71  

There are many options out there. cElementTree looks excellent if speed and memory usage are an issue. It has very little overhead compared to simply reading in the file using readlines.

有很多选择。如果速度和内存使用是一个问题,那么cElementTree看起来非常好。与简单地使用readlines在文件中读取相比,它的开销非常小。

The relevant metrics can be found in the table below, copied from the cElementTree website:

相关的指标可以在下面的表格中找到,从cElementTree网站复制:

library                         time    space
xml.dom.minidom (Python 2.1)    6.3 s   80000K
gnosis.objectify                2.0 s   22000k
xml.dom.minidom (Python 2.4)    1.4 s   53000k
ElementTree 1.2                 1.6 s   14500k  
ElementTree 1.2.4/1.3           1.1 s   14500k  
cDomlette (C extension)         0.540 s 20500k
PyRXPU (C extension)            0.175 s 10850k
libxml2 (C extension)           0.098 s 16000k
readlines (read as utf-8)       0.093 s 8850k
cElementTree (C extension)  --> 0.047 s 4900K <--
readlines (read as ascii)       0.032 s 5050k   

As pointed out by @jfs, cElementTree comes bundled with Python:

正如@jfs所指出的,cElementTree与Python绑定在一起:

  • Python 2: from xml.etree import cElementTree as ElementTree.
  • Python 2:从xml。etree导入cElementTree为ElementTree。
  • Python 3: from xml.etree import ElementTree (the accelerated C version is used automatically).
  • Python 3:从xml。etree导入ElementTree(加速C版本是自动使用的)。

#5


35  

lxml.objectify is really simple.

lxml。objectify是很简单的。

Taking your sample text:

把你的示例文本:

from lxml import objectify
from collections import defaultdict

count = defaultdict(int)

root = objectify.fromstring(text)

for item in root.bar.type:
    count[item.attrib.get("foobar")] += 1

print dict(count)

Output:

输出:

{'1': 1, '2': 1}

#6


27  

I suggest xmltodict for simplicity.

我建议用xmltodict来简化。

It parses your xml to an OrderedDict;

它将xml解析为OrderedDict;

>>> e = '<foo>
             <bar>
                 <type foobar="1"/>
                 <type foobar="2"/>
             </bar>
        </foo> '

>>> import xmltodict
>>> result = xmltodict.parse(e)
>>> result

OrderedDict([(u'foo', OrderedDict([(u'bar', OrderedDict([(u'type', [OrderedDict([(u'@foobar', u'1')]), OrderedDict([(u'@foobar', u'2')])])]))]))])

>>> result['foo']

OrderedDict([(u'bar', OrderedDict([(u'type', [OrderedDict([(u'@foobar', u'1')]), OrderedDict([(u'@foobar', u'2')])])]))])

>>> result['foo']['bar']

OrderedDict([(u'type', [OrderedDict([(u'@foobar', u'1')]), OrderedDict([(u'@foobar', u'2')])])])

#7


17  

Python has an interface to the expat xml parser.

Python与expat xml解析器有一个接口。

xml.parsers.expat

It's a non-validating parser, so bad xml will not be caught. But if you know your file is correct, then this is pretty good, and you'll probably get the exact info you want and you can discard the rest on the fly.

它是一个非验证的解析器,因此不会捕获坏的xml。但是如果你知道你的文件是正确的,那么这是相当不错的,你可能会得到你想要的确切信息,你可以把剩下的都扔在苍蝇上。

stringofxml = """<foo>
    <bar>
        <type arg="value" />
        <type arg="value" />
        <type arg="value" />
    </bar>
    <bar>
        <type arg="value" />
    </bar>
</foo>"""
count = 0
def start(name, attr):
    global count
    if name == 'type':
        count += 1

p = expat.ParserCreate()
p.StartElementHandler = start
p.Parse(stringofxml)

print count # prints 4

#8


8  

Here a very simple but effective code using cElementTree.

这里有一个使用cElementTree的非常简单但有效的代码。

try:
    import cElementTree as ET
except ImportError:
  try:
    # Python 2.5 need to import a different module
    import xml.etree.cElementTree as ET
  except ImportError:
    exit_err("Failed to import cElementTree from any known place")      

def find_in_tree(tree, node):
    found = tree.find(node)
    if found == None:
        print "No %s in file" % node
        found = []
    return found  

# Parse a xml file (specify the path)
def_file = "xml_file_name.xml"
try:
    dom = ET.parse(open(def_file, "r"))
    root = dom.getroot()
except:
    exit_err("Unable to open and parse input definition file: " + def_file)

# Parse to find the child nodes list of node 'myNode'
fwdefs = find_in_tree(root,"myNode")

Source:

来源:

http://www.snip2code.com/Snippet/991/python-xml-parse?fromPage=1

http://www.snip2code.com/Snippet/991/python-xml-parse?fromPage=1

#9


7  

Just to add another possibility, you can use untangle, as it is a simple xml-to-python-object library. Here you have an example:

为了添加另一种可能性,您可以使用untangle,因为它是一个简单的xml到python对象库。这里有一个例子:

Installation

安装

pip install untangle

Usage

使用

Your xml file (a little bit changed):

您的xml文件(有一点变化):

<foo>
   <bar name="bar_name">
      <type foobar="1"/>
   </bar>
</foo>

accessing the attributes with untangle:

使用untangle访问属性:

import untangle

obj = untangle.parse('/path_to_xml_file/file.xml')

print obj.foo.bar['name']
print obj.foo.bar.type['foobar']

the output will be:

的输出将会是:

bar_name
1

More information about untangle can be found here.
Also (if you are curious), you can find a list of tools for working with XML and Python here (you will also see that the most common ones were mentioned by previous answers).

关于untangle的更多信息可以在这里找到。同样(如果您很好奇),您可以在这里找到使用XML和Python的工具列表(您还会看到前面的答案中提到的最常见的工具)。

#10


6  

I find the Python xml.dom and xml.dom.minidom quite easy. Keep in mind that DOM isn't good for large amounts of XML, but if your input is fairly small then this will work fine.

我找到了Python xml。dom和xml.dom。minidom相当容易。请记住,DOM并不适合大量的XML,但是如果您的输入相当小,那么它就可以正常工作。

#11


4  

I might suggest declxml.

我可能会建议declxml。

Full disclosure: I wrote this library because I was looking for a way to convert between XML and Python data structures without needing to write dozens of lines of imperative parsing/serialization code with ElementTree.

完全公开:我编写了这个库,因为我正在寻找一种方法,可以在XML和Python数据结构之间进行转换,而不需要使用ElementTree编写大量的命令式解析/序列化代码。

With declxml, you use processors to declaratively define the structure of your XML document and how to map between XML and Python data structures. Processors are used to for both serialization and parsing as well as for a basic level of validation.

使用declxml,您可以使用处理器来声明XML文档的结构,以及如何在XML和Python数据结构之间进行映射。处理器用于序列化和解析,以及基本的验证级别。

Parsing into Python data structures is straightforward:

对Python数据结构的解析非常简单:

import declxml as xml

xml_string = """
<foo>
   <bar>
      <type foobar="1"/>
      <type foobar="2"/>
   </bar>
</foo>
"""

processor = xml.dictionary('foo', [
    xml.dictionary('bar', [
        xml.array(xml.integer('type', attribute='foobar'))
    ])
])

xml.parse_from_string(processor, xml_string)

Which produces the output:

生成的输出:

{'bar': {'foobar': [1, 2]}}

You can also use the same processor to serialize data to XML

您还可以使用相同的处理器将数据序列化为XML。

data = {'bar': {
    'foobar': [7, 3, 21, 16, 11]
}}

xml.serialize_to_string(processor, data, indent='    ')

Which produces the following output

哪个产生以下输出?

<?xml version="1.0" ?>
<foo>
    <bar>
        <type foobar="7"/>
        <type foobar="3"/>
        <type foobar="21"/>
        <type foobar="16"/>
        <type foobar="11"/>
    </bar>
</foo>

If you want to work with objects instead of dictionaries, you can define processors to transform data to and from objects as well.

如果您希望使用对象而不是字典,您可以定义处理器来将数据转换为和从对象转换。

import declxml as xml

class Bar:

    def __init__(self):
        self.foobars = []

    def __repr__(self):
        return 'Bar(foobars={})'.format(self.foobars)


xml_string = """
<foo>
   <bar>
      <type foobar="1"/>
      <type foobar="2"/>
   </bar>
</foo>
"""

processor = xml.dictionary('foo', [
    xml.user_object('bar', Bar, [
        xml.array(xml.integer('type', attribute='foobar'), alias='foobars')
    ])
])

xml.parse_from_string(processor, xml_string)

Which produces the following output

哪个产生以下输出?

{'bar': Bar(foobars=[1, 2])}

#12


3  

import xml.etree.ElementTree as ET
data = '''<foo>
           <bar>
               <type foobar="1"/>
               <type foobar="2"/>
          </bar>
       </foo>'''
tree = ET.fromstring(data)
lst = tree.findall('bar/type')
for item in lst:
    print item.get('foobar')

This will print the value of foobar attribute.

这将打印foobar属性的值。

#1


547  

I suggest ElementTree. There are other compatible implementations of the same API, such as lxml, and cElementTree in the Python standard library itself; but, in this context, what they chiefly add is even more speed -- the ease of programming part depends on the API, which ElementTree defines.

我建议ElementTree。在Python标准库中,同样的API有其他兼容的实现,比如lxml和cElementTree;但是,在这种情况下,它们主要添加的是更快的速度——编程部分的简化依赖于ElementTree定义的API。

After building an Element instance e from the XML, e.g. with the XML function, or by parsing a file with something like

在从XML中构建一个元素实例e之后,例如使用XML函数,或者用类似的方法解析文件。

import xml.etree.ElementTree
e = xml.etree.ElementTree.parse('thefile.xml').getroot()

or any of the many other ways shown at ElementTree, you just do something like:

或者在ElementTree中有很多其他的方法,你可以做如下的事情:

for atype in e.findall('type'):
    print(atype.get('foobar'))

and similar, usually pretty simple, code patterns.

类似的,通常很简单,代码模式。

#2


353  

minidom is the quickest and pretty straight forward:

minidom是最快速和最直接的:

XML:

XML:

<data>
    <items>
        <item name="item1"></item>
        <item name="item2"></item>
        <item name="item3"></item>
        <item name="item4"></item>
    </items>
</data>

PYTHON:

PYTHON:

from xml.dom import minidom
xmldoc = minidom.parse('items.xml')
itemlist = xmldoc.getElementsByTagName('item')
print(len(itemlist))
print(itemlist[0].attributes['name'].value)
for s in itemlist:
    print(s.attributes['name'].value)

OUTPUT

输出

4
item1
item1
item2
item3
item4

#3


190  

You can use BeautifulSoup

您可以使用BeautifulSoup

from bs4 import BeautifulSoup

x="""<foo>
   <bar>
      <type foobar="1"/>
      <type foobar="2"/>
   </bar>
</foo>"""

y=BeautifulSoup(x)
>>> y.foo.bar.type["foobar"]
u'1'

>>> y.foo.bar.findAll("type")
[<type foobar="1"></type>, <type foobar="2"></type>]

>>> y.foo.bar.findAll("type")[0]["foobar"]
u'1'
>>> y.foo.bar.findAll("type")[1]["foobar"]
u'2'

#4


71  

There are many options out there. cElementTree looks excellent if speed and memory usage are an issue. It has very little overhead compared to simply reading in the file using readlines.

有很多选择。如果速度和内存使用是一个问题,那么cElementTree看起来非常好。与简单地使用readlines在文件中读取相比,它的开销非常小。

The relevant metrics can be found in the table below, copied from the cElementTree website:

相关的指标可以在下面的表格中找到,从cElementTree网站复制:

library                         time    space
xml.dom.minidom (Python 2.1)    6.3 s   80000K
gnosis.objectify                2.0 s   22000k
xml.dom.minidom (Python 2.4)    1.4 s   53000k
ElementTree 1.2                 1.6 s   14500k  
ElementTree 1.2.4/1.3           1.1 s   14500k  
cDomlette (C extension)         0.540 s 20500k
PyRXPU (C extension)            0.175 s 10850k
libxml2 (C extension)           0.098 s 16000k
readlines (read as utf-8)       0.093 s 8850k
cElementTree (C extension)  --> 0.047 s 4900K <--
readlines (read as ascii)       0.032 s 5050k   

As pointed out by @jfs, cElementTree comes bundled with Python:

正如@jfs所指出的,cElementTree与Python绑定在一起:

  • Python 2: from xml.etree import cElementTree as ElementTree.
  • Python 2:从xml。etree导入cElementTree为ElementTree。
  • Python 3: from xml.etree import ElementTree (the accelerated C version is used automatically).
  • Python 3:从xml。etree导入ElementTree(加速C版本是自动使用的)。

#5


35  

lxml.objectify is really simple.

lxml。objectify是很简单的。

Taking your sample text:

把你的示例文本:

from lxml import objectify
from collections import defaultdict

count = defaultdict(int)

root = objectify.fromstring(text)

for item in root.bar.type:
    count[item.attrib.get("foobar")] += 1

print dict(count)

Output:

输出:

{'1': 1, '2': 1}

#6


27  

I suggest xmltodict for simplicity.

我建议用xmltodict来简化。

It parses your xml to an OrderedDict;

它将xml解析为OrderedDict;

>>> e = '<foo>
             <bar>
                 <type foobar="1"/>
                 <type foobar="2"/>
             </bar>
        </foo> '

>>> import xmltodict
>>> result = xmltodict.parse(e)
>>> result

OrderedDict([(u'foo', OrderedDict([(u'bar', OrderedDict([(u'type', [OrderedDict([(u'@foobar', u'1')]), OrderedDict([(u'@foobar', u'2')])])]))]))])

>>> result['foo']

OrderedDict([(u'bar', OrderedDict([(u'type', [OrderedDict([(u'@foobar', u'1')]), OrderedDict([(u'@foobar', u'2')])])]))])

>>> result['foo']['bar']

OrderedDict([(u'type', [OrderedDict([(u'@foobar', u'1')]), OrderedDict([(u'@foobar', u'2')])])])

#7


17  

Python has an interface to the expat xml parser.

Python与expat xml解析器有一个接口。

xml.parsers.expat

It's a non-validating parser, so bad xml will not be caught. But if you know your file is correct, then this is pretty good, and you'll probably get the exact info you want and you can discard the rest on the fly.

它是一个非验证的解析器,因此不会捕获坏的xml。但是如果你知道你的文件是正确的,那么这是相当不错的,你可能会得到你想要的确切信息,你可以把剩下的都扔在苍蝇上。

stringofxml = """<foo>
    <bar>
        <type arg="value" />
        <type arg="value" />
        <type arg="value" />
    </bar>
    <bar>
        <type arg="value" />
    </bar>
</foo>"""
count = 0
def start(name, attr):
    global count
    if name == 'type':
        count += 1

p = expat.ParserCreate()
p.StartElementHandler = start
p.Parse(stringofxml)

print count # prints 4

#8


8  

Here a very simple but effective code using cElementTree.

这里有一个使用cElementTree的非常简单但有效的代码。

try:
    import cElementTree as ET
except ImportError:
  try:
    # Python 2.5 need to import a different module
    import xml.etree.cElementTree as ET
  except ImportError:
    exit_err("Failed to import cElementTree from any known place")      

def find_in_tree(tree, node):
    found = tree.find(node)
    if found == None:
        print "No %s in file" % node
        found = []
    return found  

# Parse a xml file (specify the path)
def_file = "xml_file_name.xml"
try:
    dom = ET.parse(open(def_file, "r"))
    root = dom.getroot()
except:
    exit_err("Unable to open and parse input definition file: " + def_file)

# Parse to find the child nodes list of node 'myNode'
fwdefs = find_in_tree(root,"myNode")

Source:

来源:

http://www.snip2code.com/Snippet/991/python-xml-parse?fromPage=1

http://www.snip2code.com/Snippet/991/python-xml-parse?fromPage=1

#9


7  

Just to add another possibility, you can use untangle, as it is a simple xml-to-python-object library. Here you have an example:

为了添加另一种可能性,您可以使用untangle,因为它是一个简单的xml到python对象库。这里有一个例子:

Installation

安装

pip install untangle

Usage

使用

Your xml file (a little bit changed):

您的xml文件(有一点变化):

<foo>
   <bar name="bar_name">
      <type foobar="1"/>
   </bar>
</foo>

accessing the attributes with untangle:

使用untangle访问属性:

import untangle

obj = untangle.parse('/path_to_xml_file/file.xml')

print obj.foo.bar['name']
print obj.foo.bar.type['foobar']

the output will be:

的输出将会是:

bar_name
1

More information about untangle can be found here.
Also (if you are curious), you can find a list of tools for working with XML and Python here (you will also see that the most common ones were mentioned by previous answers).

关于untangle的更多信息可以在这里找到。同样(如果您很好奇),您可以在这里找到使用XML和Python的工具列表(您还会看到前面的答案中提到的最常见的工具)。

#10


6  

I find the Python xml.dom and xml.dom.minidom quite easy. Keep in mind that DOM isn't good for large amounts of XML, but if your input is fairly small then this will work fine.

我找到了Python xml。dom和xml.dom。minidom相当容易。请记住,DOM并不适合大量的XML,但是如果您的输入相当小,那么它就可以正常工作。

#11


4  

I might suggest declxml.

我可能会建议declxml。

Full disclosure: I wrote this library because I was looking for a way to convert between XML and Python data structures without needing to write dozens of lines of imperative parsing/serialization code with ElementTree.

完全公开:我编写了这个库,因为我正在寻找一种方法,可以在XML和Python数据结构之间进行转换,而不需要使用ElementTree编写大量的命令式解析/序列化代码。

With declxml, you use processors to declaratively define the structure of your XML document and how to map between XML and Python data structures. Processors are used to for both serialization and parsing as well as for a basic level of validation.

使用declxml,您可以使用处理器来声明XML文档的结构,以及如何在XML和Python数据结构之间进行映射。处理器用于序列化和解析,以及基本的验证级别。

Parsing into Python data structures is straightforward:

对Python数据结构的解析非常简单:

import declxml as xml

xml_string = """
<foo>
   <bar>
      <type foobar="1"/>
      <type foobar="2"/>
   </bar>
</foo>
"""

processor = xml.dictionary('foo', [
    xml.dictionary('bar', [
        xml.array(xml.integer('type', attribute='foobar'))
    ])
])

xml.parse_from_string(processor, xml_string)

Which produces the output:

生成的输出:

{'bar': {'foobar': [1, 2]}}

You can also use the same processor to serialize data to XML

您还可以使用相同的处理器将数据序列化为XML。

data = {'bar': {
    'foobar': [7, 3, 21, 16, 11]
}}

xml.serialize_to_string(processor, data, indent='    ')

Which produces the following output

哪个产生以下输出?

<?xml version="1.0" ?>
<foo>
    <bar>
        <type foobar="7"/>
        <type foobar="3"/>
        <type foobar="21"/>
        <type foobar="16"/>
        <type foobar="11"/>
    </bar>
</foo>

If you want to work with objects instead of dictionaries, you can define processors to transform data to and from objects as well.

如果您希望使用对象而不是字典,您可以定义处理器来将数据转换为和从对象转换。

import declxml as xml

class Bar:

    def __init__(self):
        self.foobars = []

    def __repr__(self):
        return 'Bar(foobars={})'.format(self.foobars)


xml_string = """
<foo>
   <bar>
      <type foobar="1"/>
      <type foobar="2"/>
   </bar>
</foo>
"""

processor = xml.dictionary('foo', [
    xml.user_object('bar', Bar, [
        xml.array(xml.integer('type', attribute='foobar'), alias='foobars')
    ])
])

xml.parse_from_string(processor, xml_string)

Which produces the following output

哪个产生以下输出?

{'bar': Bar(foobars=[1, 2])}

#12


3  

import xml.etree.ElementTree as ET
data = '''<foo>
           <bar>
               <type foobar="1"/>
               <type foobar="2"/>
          </bar>
       </foo>'''
tree = ET.fromstring(data)
lst = tree.findall('bar/type')
for item in lst:
    print item.get('foobar')

This will print the value of foobar attribute.

这将打印foobar属性的值。