使用Python将XML转换为JSON ?

时间:2022-08-21 08:59:50

I've seen a fair share of ungainly XML->JSON code on the web, and having interacted with Stack's users for a bit, I'm convinced that this crowd can help more than the first few pages of Google results can.

我在web上看到了相当多的笨拙的XML->JSON代码,并且与Stack的用户进行了一些交互,我确信这群人可以帮助比谷歌结果的前几页更有帮助。

So, we're parsing a weather feed, and we need to populate weather widgets on a multitude of web sites. We're looking now into Python-based solutions.

因此,我们正在解析一个天气提要,我们需要在许多web站点上填充天气部件。我们现在研究的是基于python的解决方案。

This public weather.com RSS feed is a good example of what we'd be parsing (our actual weather.com feed contains additional information because of a partnership w/them).

这个公共的weather.com RSS提要是我们正在解析的一个很好的例子(我们实际的weather.com feed包含了更多的信息,因为它是一个伙伴关系)。

In a nutshell, how should we convert XML to JSON using Python?

简单地说,我们应该如何使用Python将XML转换为JSON ?

15 个解决方案

#1


44  

There is no "one-to-one" mapping between XML and JSON, so converting one to the other necessarily requires some understanding of what you want to do with the results.

XML和JSON之间没有“一对一”的映射,因此将一个转换为另一个必须要理解您想要如何处理结果。

That being said, Python's standard library has several modules for parsing XML (including DOM, SAX, and ElementTree). As of Python 2.6, support for converting Python data structures to and from JSON is included in the json module.

也就是说,Python的标准库有几个用于解析XML的模块(包括DOM、SAX和ElementTree)。在Python 2.6中,JSON模块中包含了将Python数据结构转换为和从JSON转换的支持。

So the infrastructure is there.

基础设施就在那里。

#2


214  

xmltodict (full disclosure: I wrote it) can help you convert your XML to a dict+list+string structure, following this "standard". It is Expat-based, so it's very fast and doesn't need to load the whole XML tree in memory.

xmltodict(完整的披露:我写了它)可以帮助您将XML转换为dict+list+字符串结构,遵循这个“标准”。它是基于expat的,所以速度非常快,不需要在内存中加载整个XML树。

Once you have that data structure, you can serialize it to JSON:

一旦有了这个数据结构,就可以将其序列化为JSON:

import xmltodict, json

o = xmltodict.parse('<e> <a>text</a> <a>text</a> </e>')
json.dumps(o) # '{"e": {"a": ["text", "text"]}}'

#3


12  

You can use the xmljson library to convert using different XML JSON conventions.

可以使用xmljson库转换使用不同的XML JSON约定。

For example, this XML:

例如,这个XML:

<p id="1">text</p>

translates via the BadgerFish convention into this:

通过《BadgerFish公约》翻译如下:

{
  'p': {
    '@id': 1,
    '$': 'text'
  }
}

and via the GData convention into this (attributes are not supported):

通过GData约定(不支持属性):

{
  'p': {
    '$t': 'text'
  }
}

... and via the Parker convention into this (attributes are not supported):

…通过帕克公约(不支持属性):

{
  'p': 'text'
}

It's possible to convert from XML to JSON and from JSON to XML using the same conventions:

可以使用相同的约定将XML转换为JSON和从JSON转换为XML:

>>> import json, xmljson
>>> from lxml.etree import fromstring, tostring
>>> xml = fromstring('<p id="1">text</p>')
>>> json.dumps(xmljson.badgerfish.data(xml))
'{"p": {"@id": 1, "$": "text"}}'
>>> xmljson.parker.etree({'ul': {'li': [1, 2]}})
# Creates [<ul><li>1</li><li>2</li></ul>]

Disclosure: I wrote this library. Hope it helps future searchers.

披露:我写了这个图书馆。希望它能帮助未来的搜索者。

#4


5  

Here's the code I built for that. There's no parsing of the contents, just plain conversion.

这是我为之建立的代码。没有对内容的解析,只是简单的转换。

from xml.dom import minidom
import simplejson as json
def parse_element(element):
    dict_data = dict()
    if element.nodeType == element.TEXT_NODE:
        dict_data['data'] = element.data
    if element.nodeType not in [element.TEXT_NODE, element.DOCUMENT_NODE, 
                                element.DOCUMENT_TYPE_NODE]:
        for item in element.attributes.items():
            dict_data[item[0]] = item[1]
    if element.nodeType not in [element.TEXT_NODE, element.DOCUMENT_TYPE_NODE]:
        for child in element.childNodes:
            child_name, child_dict = parse_element(child)
            if child_name in dict_data:
                try:
                    dict_data[child_name].append(child_dict)
                except AttributeError:
                    dict_data[child_name] = [dict_data[child_name], child_dict]
            else:
                dict_data[child_name] = child_dict 
    return element.nodeName, dict_data

if __name__ == '__main__':
    dom = minidom.parse('data.xml')
    f = open('data.json', 'w')
    f.write(json.dumps(parse_element(dom), sort_keys=True, indent=4))
    f.close()

#5


4  

You may want to have a look at http://designtheory.org/library/extrep/designdb-1.0.pdf. This project starts off with an XML to JSON conversion of a large library of XML files. There was much research done in the conversion, and the most simple intuitive XML -> JSON mapping was produced (it is described early in the document). In summary, convert everything to a JSON object, and put repeating blocks as a list of objects.

您可能想看一下http://designy.org/library/extrep/designdb -1.0.pdf。这个项目从一个XML到JSON转换的一个大型XML文件库开始。在转换过程中进行了大量研究,并且生成了最简单的直观XML -> JSON映射(在文档的早期描述)。总之,将所有内容都转换为JSON对象,并将重复的块作为对象的列表。

objects meaning key/value pairs (dictionary in Python, hashmap in Java, object in JavaScript)

对象表示键/值对(Python中的dictionary, Java中的hashmap, JavaScript中的对象)

There is no mapping back to XML to get an identical document, the reason is, it is unknown whether a key/value pair was an attribute or an <key>value</key>, therefore that information is lost.

没有映射回XML来得到一个相同的文档,原因是,一个键/值对是一个属性还是一个 ,因此信息丢失了。

If you ask me, attributes are a hack to start; then again they worked well for HTML.

如果你问我,属性是一个开始的hack;然后,他们又为HTML工作得很好。

#6


4  

There is a method to transport XML-based markup as JSON which allows it to be losslessly converted back to its original form. See http://jsonml.org/.

有一种方法可以将基于xml的标记传输为JSON,这样就可以将其无损地转换回原来的格式。见http://jsonml.org/。

It's a kind of XSLT of JSON. I hope you find it helpful

它是JSON的一种XSLT。我希望你觉得它有用。

#7


3  

Well, probably the simplest way is just parse the XML into dictionaries and then serialize that with simplejson.

可能最简单的方法就是将XML解析成字典,然后用simplejson将其序列化。

#8


2  

While the built-in libs for XML parsing are quite good I am partial to lxml.

虽然XML解析内置的libs很好,但我偏爱lxml。

But for parsing RSS feeds, I'd recommend Universal Feed Parser, which can also parse Atom. Its main advantage is that it can digest even most malformed feeds.

但是对于解析RSS提要,我推荐通用的Feed解析器,它也可以解析Atom。它的主要优势是它能消化甚至大多数畸形的饲料。

Python 2.6 already includes a JSON parser, but a newer version with improved speed is available as simplejson.

Python 2.6已经包含了一个JSON解析器,但是有一个更新的速度可以作为simplejson使用。

With these tools building your app shouldn't be that difficult.

使用这些工具构建应用程序不应该那么困难。

#9


2  

I'd suggest not going for a direct conversion. Convert XML to an object, then from the object to JSON.

我建议不要直接兑换。将XML转换为对象,然后从对象转换为JSON。

In my opinion, this gives a cleaner definition of how the XML and JSON correspond.

在我看来,这为XML和JSON如何通信提供了更清晰的定义。

It takes time to get right and you may even write tools to help you with generating some of it, but it would look roughly like this:

它需要时间来获得正确,你甚至可以编写工具来帮助你生成其中的一些,但是它看起来大概是这样的:

class Channel:
  def __init__(self)
    self.items = []
    self.title = ""

  def from_xml( self, xml_node ):
    self.title = xml_node.xpath("title/text()")[0]
    for x in xml_node.xpath("item"):
      item = Item()
      item.from_xml( x )
      self.items.append( item )

  def to_json( self ):
    retval = {}
    retval['title'] = title
    retval['items'] = []
    for x in items:
      retval.append( x.to_json() )
    return retval

class Item:
  def __init__(self):
    ...

  def from_xml( self, xml_node ):
    ...

  def to_json( self ):
    ...

#10


2  

When I do anything with XML in python I almost always use the lxml package. I suspect that most people use lxml. You could use xmltodict but you will have to pay the penalty of parsing the XML again.

当我在python中使用XML时,我几乎总是使用lxml包。我怀疑大多数人都使用lxml。您可以使用xmltodict,但是您必须再次支付解析XML的代价。

To convert XML to json with lxml you:

用lxml将XML转换为json:

  1. Parse XML document with lxml
  2. 使用lxml解析XML文档。
  3. Convert lxml to a dict
  4. 将lxml转换为命令。
  5. Convert list to json
  6. 列表转换为json

I use the following class in my projects. Use the toJson method.

我在我的项目中使用了下面的类。使用toJson方法。

from lxml import etree 
import json


class Element:
    '''
    Wrapper on the etree.Element class.  Extends functionality to output element
    as a dictionary.
    '''

    def __init__(self, element):
        '''
        :param: element a normal etree.Element instance
        '''
        self.element = element

    def toDict(self):
        '''
        Returns the element as a dictionary.  This includes all child elements.
        '''
        rval = {
            self.element.tag: {
                'attributes': dict(self.element.items()),
            },
        }
        for child in self.element:
            rval[self.element.tag].update(Element(child).toDict())
        return rval


class XmlDocument:
    '''
    Wraps lxml to provide:
        - cleaner access to some common lxml.etree functions
        - converter from XML to dict
        - converter from XML to json
    '''
    def __init__(self, xml = '<empty/>', filename=None):
        '''
        There are two ways to initialize the XmlDocument contents:
            - String
            - File

        You don't have to initialize the XmlDocument during instantiation
        though.  You can do it later with the 'set' method.  If you choose to
        initialize later XmlDocument will be initialized with "<empty/>".

        :param: xml Set this argument if you want to parse from a string.
        :param: filename Set this argument if you want to parse from a file.
        '''
        self.set(xml, filename) 

    def set(self, xml=None, filename=None):
        '''
        Use this to set or reset the contents of the XmlDocument.

        :param: xml Set this argument if you want to parse from a string.
        :param: filename Set this argument if you want to parse from a file.
        '''
        if filename is not None:
            self.tree = etree.parse(filename)
            self.root = self.tree.getroot()
        else:
            self.root = etree.fromstring(xml)
            self.tree = etree.ElementTree(self.root)


    def dump(self):
        etree.dump(self.root)

    def getXml(self):
        '''
        return document as a string
        '''
        return etree.tostring(self.root)

    def xpath(self, xpath):
        '''
        Return elements that match the given xpath.

        :param: xpath
        '''
        return self.tree.xpath(xpath);

    def nodes(self):
        '''
        Return all elements
        '''
        return self.root.iter('*')

    def toDict(self):
        '''
        Convert to a python dictionary
        '''
        return Element(self.root).toDict()

    def toJson(self, indent=None):
        '''
        Convert to JSON
        '''
        return json.dumps(self.toDict(), indent=indent)


if __name__ == "__main__":
    xml='''<system>
    <product>
        <demod>
            <frequency value='2.215' units='MHz'>
                <blah value='1'/>
            </frequency>
        </demod>
    </product>
</system>
'''
    doc = XmlDocument(xml)
    print doc.toJson(indent=4)

The output from the built in main is:

主要是:

{
    "system": {
        "attributes": {}, 
        "product": {
            "attributes": {}, 
            "demod": {
                "attributes": {}, 
                "frequency": {
                    "attributes": {
                        "units": "MHz", 
                        "value": "2.215"
                    }, 
                    "blah": {
                        "attributes": {
                            "value": "1"
                        }
                    }
                }
            }
        }
    }
}

Which is a transformation of this xml:

这是xml的一个变换:

<system>
    <product>
        <demod>
            <frequency value='2.215' units='MHz'>
                <blah value='1'/>
            </frequency>
        </demod>
    </product>
</system>

#11


1  

jsonpickle or if you're using feedparser, you can try feed_parser_to_json.py

jsonpickle或如果您使用feedparser,您可以尝试feed_parser_to_json.py。

#12


1  

I found for simple XML snips, use regular expression would save troubles. For example:

我发现对于简单的XML snip,使用正则表达式可以省去麻烦。例如:

# <user><name>Happy Man</name>...</user>
import re
names = re.findall(r'<name>(\w+)<\/name>', xml_string)
# do some thing to names

To do it by XML parsing, as @Dan said, there is not one-for-all solution because the data is different. My suggestion is to use lxml. Although not finished to json, lxml.objectify give quiet good results:

正如@Dan所说,通过XML解析来实现这一点,并不是所有的解决方案,因为数据是不同的。我的建议是使用lxml。尽管还没有完成json, lxml。客观化给予安静良好的结果:

>>> from lxml import objectify
>>> root = objectify.fromstring("""
... <root xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
...   <a attr1="foo" attr2="bar">1</a>
...   <a>1.2</a>
...   <b>1</b>
...   <b>true</b>
...   <c>what?</c>
...   <d xsi:nil="true"/>
... </root>
... """)

>>> print(str(root))
root = None [ObjectifiedElement]
    a = 1 [IntElement]
      * attr1 = 'foo'
      * attr2 = 'bar'
    a = 1.2 [FloatElement]
    b = 1 [IntElement]
    b = True [BoolElement]
    c = 'what?' [StringElement]
    d = None [NoneElement]
      * xsi:nil = 'true'

#13


1  

My answer addresses the specific (and somewhat common) case where you don't really need to convert the entire xml to json, but what you need is to traverse/access specific parts of the xml, and you need it to be fast, and simple (using json/dict-like operations).

我的答案针对的是特定的(而且有点普通)的情况,您不需要将整个xml转换为json,但是您需要的是遍历/访问xml的特定部分,并且您需要它快速、简单(使用json/类似于命令的操作)。

Approach

For this, it is important to note that parsing an xml to etree using lxml is super fast. The slow part in most of the other answers is the second pass: traversing the etree structure (usually in python-land), converting it to json.

为此,需要注意的是,使用lxml解析xml到etree的速度非常快。大多数其他答案的慢部分是第二遍:遍历etree结构(通常在python-land中),将其转换为json。

Which leads me to the approach I found best for this case: parsing the xml using lxml, and then wrapping the etree nodes (lazily), providing them with a dict-like interface.

这使我找到了我认为最好的方法:使用lxml解析xml,然后包装etree节点(lazily),为它们提供一个类似于dict的接口。

Code

Here's the code:

这是代码:

from collections import Mapping
import lxml.etree

class ETreeDictWrapper(Mapping):

    def __init__(self, elem, attr_prefix = '@', list_tags = ()):
        self.elem = elem
        self.attr_prefix = attr_prefix
        self.list_tags = list_tags

    def _wrap(self, e):
        if isinstance(e, basestring):
            return e
        if len(e) == 0 and len(e.attrib) == 0:
            return e.text
        return type(self)(
            e,
            attr_prefix = self.attr_prefix,
            list_tags = self.list_tags,
        )

    def __getitem__(self, key):
        if key.startswith(self.attr_prefix):
            return self.elem.attrib[key[len(self.attr_prefix):]]
        else:
            subelems = [ e for e in self.elem.iterchildren() if e.tag == key ]
            if len(subelems) > 1 or key in self.list_tags:
                return [ self._wrap(x) for x in subelems ]
            elif len(subelems) == 1:
                return self._wrap(subelems[0])
            else:
                raise KeyError(key)

    def __iter__(self):
        return iter(set( k.tag for k in self.elem) |
                    set( self.attr_prefix + k for k in self.elem.attrib ))

    def __len__(self):
        return len(self.elem) + len(self.elem.attrib)

    # defining __contains__ is not necessary, but improves speed
    def __contains__(self, key):
        if key.startswith(self.attr_prefix):
            return key[len(self.attr_prefix):] in self.elem.attrib
        else:
            return any( e.tag == key for e in self.elem.iterchildren() )


def xml_to_dictlike(xmlstr, attr_prefix = '@', list_tags = ()):
    t = lxml.etree.fromstring(xmlstr)
    return ETreeDictWrapper(
        t,
        attr_prefix = '@',
        list_tags = set(list_tags),
    )

This implementation is not complete, e.g., it doesn't cleanly support cases where an element has both text and attributes, or both text and children (only because I didn't need it when I wrote it...) It should be easy to improve it, though.

这个实现不完整,例如,它不支持元素具有文本和属性的情况,也不包括文本和子元素(只是因为我在写它的时候不需要它)。不过,它应该很容易改进。

Speed

In my specific use case, where I needed to only process specific elements of the xml, this approach gave a suprising and striking speedup by a factor of 70 (!) compared to using @Martin Blech's xmltodict and then traversing the dict directly.

在我的特定用例中,我只需要处理xml的特定元素,这种方法比使用@Martin Blech的xmltodict,然后直接遍历命令,使速度加快了70倍(!)

Bonus

As a bonus, since our structure is already dict-like, we get another alternative implementation of xml2json for free. We just need to pass our dict-like structure to json.dumps. Something like:

额外的好处是,既然我们的结构已经像dict一样,我们可以免费获得xml2json的另一个替代实现。我们只需要将我们的类似于dict的结构传递给json.dump。喜欢的东西:

def xml_to_json(xmlstr, **kwargs):
    x = xml_to_dictlike(xmlstr, **kwargs)
    return json.dumps(x)

If your xml includes attributes, you'd need to use some alphanumeric attr_prefix (e.g. "ATTR_"), to ensure the keys are valid json keys.

如果xml包含属性,则需要使用一些字母数字attr_prefix(例如:“ATTR_”,以确保密钥是有效的json密钥。

I haven't benchmarked this part.

这部分我没有做过基准测试。

#14


1  

This stuff here is actively maintained and so far is my favorite: xml2json in python

这里的这些东西是积极维护的,到目前为止是我最喜欢的:python中的xml2json。

#15


1  

To anyone that may still need this. Here's a newer, simple code to do this conversion.

对于任何可能仍然需要这个的人。这里有一个更新的、简单的代码来进行这种转换。

from xml.etree import ElementTree as ET

xml    = ET.parse('FILE_NAME.xml')
parsed = parseXmlToJson(xml)


def parseXmlToJson(xml):
  response = {}

  for child in list(xml):
    if len(list(child)) > 0:
      response[child.tag] = parseXmlToJson(child)
    else:
      response[child.tag] = child.text or ''

    # one-liner equivalent
    # response[child.tag] = parseXmlToJson(child) if len(list(child)) > 0 else child.text or ''

  return response

#1


44  

There is no "one-to-one" mapping between XML and JSON, so converting one to the other necessarily requires some understanding of what you want to do with the results.

XML和JSON之间没有“一对一”的映射,因此将一个转换为另一个必须要理解您想要如何处理结果。

That being said, Python's standard library has several modules for parsing XML (including DOM, SAX, and ElementTree). As of Python 2.6, support for converting Python data structures to and from JSON is included in the json module.

也就是说,Python的标准库有几个用于解析XML的模块(包括DOM、SAX和ElementTree)。在Python 2.6中,JSON模块中包含了将Python数据结构转换为和从JSON转换的支持。

So the infrastructure is there.

基础设施就在那里。

#2


214  

xmltodict (full disclosure: I wrote it) can help you convert your XML to a dict+list+string structure, following this "standard". It is Expat-based, so it's very fast and doesn't need to load the whole XML tree in memory.

xmltodict(完整的披露:我写了它)可以帮助您将XML转换为dict+list+字符串结构,遵循这个“标准”。它是基于expat的,所以速度非常快,不需要在内存中加载整个XML树。

Once you have that data structure, you can serialize it to JSON:

一旦有了这个数据结构,就可以将其序列化为JSON:

import xmltodict, json

o = xmltodict.parse('<e> <a>text</a> <a>text</a> </e>')
json.dumps(o) # '{"e": {"a": ["text", "text"]}}'

#3


12  

You can use the xmljson library to convert using different XML JSON conventions.

可以使用xmljson库转换使用不同的XML JSON约定。

For example, this XML:

例如,这个XML:

<p id="1">text</p>

translates via the BadgerFish convention into this:

通过《BadgerFish公约》翻译如下:

{
  'p': {
    '@id': 1,
    '$': 'text'
  }
}

and via the GData convention into this (attributes are not supported):

通过GData约定(不支持属性):

{
  'p': {
    '$t': 'text'
  }
}

... and via the Parker convention into this (attributes are not supported):

…通过帕克公约(不支持属性):

{
  'p': 'text'
}

It's possible to convert from XML to JSON and from JSON to XML using the same conventions:

可以使用相同的约定将XML转换为JSON和从JSON转换为XML:

>>> import json, xmljson
>>> from lxml.etree import fromstring, tostring
>>> xml = fromstring('<p id="1">text</p>')
>>> json.dumps(xmljson.badgerfish.data(xml))
'{"p": {"@id": 1, "$": "text"}}'
>>> xmljson.parker.etree({'ul': {'li': [1, 2]}})
# Creates [<ul><li>1</li><li>2</li></ul>]

Disclosure: I wrote this library. Hope it helps future searchers.

披露:我写了这个图书馆。希望它能帮助未来的搜索者。

#4


5  

Here's the code I built for that. There's no parsing of the contents, just plain conversion.

这是我为之建立的代码。没有对内容的解析,只是简单的转换。

from xml.dom import minidom
import simplejson as json
def parse_element(element):
    dict_data = dict()
    if element.nodeType == element.TEXT_NODE:
        dict_data['data'] = element.data
    if element.nodeType not in [element.TEXT_NODE, element.DOCUMENT_NODE, 
                                element.DOCUMENT_TYPE_NODE]:
        for item in element.attributes.items():
            dict_data[item[0]] = item[1]
    if element.nodeType not in [element.TEXT_NODE, element.DOCUMENT_TYPE_NODE]:
        for child in element.childNodes:
            child_name, child_dict = parse_element(child)
            if child_name in dict_data:
                try:
                    dict_data[child_name].append(child_dict)
                except AttributeError:
                    dict_data[child_name] = [dict_data[child_name], child_dict]
            else:
                dict_data[child_name] = child_dict 
    return element.nodeName, dict_data

if __name__ == '__main__':
    dom = minidom.parse('data.xml')
    f = open('data.json', 'w')
    f.write(json.dumps(parse_element(dom), sort_keys=True, indent=4))
    f.close()

#5


4  

You may want to have a look at http://designtheory.org/library/extrep/designdb-1.0.pdf. This project starts off with an XML to JSON conversion of a large library of XML files. There was much research done in the conversion, and the most simple intuitive XML -> JSON mapping was produced (it is described early in the document). In summary, convert everything to a JSON object, and put repeating blocks as a list of objects.

您可能想看一下http://designy.org/library/extrep/designdb -1.0.pdf。这个项目从一个XML到JSON转换的一个大型XML文件库开始。在转换过程中进行了大量研究,并且生成了最简单的直观XML -> JSON映射(在文档的早期描述)。总之,将所有内容都转换为JSON对象,并将重复的块作为对象的列表。

objects meaning key/value pairs (dictionary in Python, hashmap in Java, object in JavaScript)

对象表示键/值对(Python中的dictionary, Java中的hashmap, JavaScript中的对象)

There is no mapping back to XML to get an identical document, the reason is, it is unknown whether a key/value pair was an attribute or an <key>value</key>, therefore that information is lost.

没有映射回XML来得到一个相同的文档,原因是,一个键/值对是一个属性还是一个 ,因此信息丢失了。

If you ask me, attributes are a hack to start; then again they worked well for HTML.

如果你问我,属性是一个开始的hack;然后,他们又为HTML工作得很好。

#6


4  

There is a method to transport XML-based markup as JSON which allows it to be losslessly converted back to its original form. See http://jsonml.org/.

有一种方法可以将基于xml的标记传输为JSON,这样就可以将其无损地转换回原来的格式。见http://jsonml.org/。

It's a kind of XSLT of JSON. I hope you find it helpful

它是JSON的一种XSLT。我希望你觉得它有用。

#7


3  

Well, probably the simplest way is just parse the XML into dictionaries and then serialize that with simplejson.

可能最简单的方法就是将XML解析成字典,然后用simplejson将其序列化。

#8


2  

While the built-in libs for XML parsing are quite good I am partial to lxml.

虽然XML解析内置的libs很好,但我偏爱lxml。

But for parsing RSS feeds, I'd recommend Universal Feed Parser, which can also parse Atom. Its main advantage is that it can digest even most malformed feeds.

但是对于解析RSS提要,我推荐通用的Feed解析器,它也可以解析Atom。它的主要优势是它能消化甚至大多数畸形的饲料。

Python 2.6 already includes a JSON parser, but a newer version with improved speed is available as simplejson.

Python 2.6已经包含了一个JSON解析器,但是有一个更新的速度可以作为simplejson使用。

With these tools building your app shouldn't be that difficult.

使用这些工具构建应用程序不应该那么困难。

#9


2  

I'd suggest not going for a direct conversion. Convert XML to an object, then from the object to JSON.

我建议不要直接兑换。将XML转换为对象,然后从对象转换为JSON。

In my opinion, this gives a cleaner definition of how the XML and JSON correspond.

在我看来,这为XML和JSON如何通信提供了更清晰的定义。

It takes time to get right and you may even write tools to help you with generating some of it, but it would look roughly like this:

它需要时间来获得正确,你甚至可以编写工具来帮助你生成其中的一些,但是它看起来大概是这样的:

class Channel:
  def __init__(self)
    self.items = []
    self.title = ""

  def from_xml( self, xml_node ):
    self.title = xml_node.xpath("title/text()")[0]
    for x in xml_node.xpath("item"):
      item = Item()
      item.from_xml( x )
      self.items.append( item )

  def to_json( self ):
    retval = {}
    retval['title'] = title
    retval['items'] = []
    for x in items:
      retval.append( x.to_json() )
    return retval

class Item:
  def __init__(self):
    ...

  def from_xml( self, xml_node ):
    ...

  def to_json( self ):
    ...

#10


2  

When I do anything with XML in python I almost always use the lxml package. I suspect that most people use lxml. You could use xmltodict but you will have to pay the penalty of parsing the XML again.

当我在python中使用XML时,我几乎总是使用lxml包。我怀疑大多数人都使用lxml。您可以使用xmltodict,但是您必须再次支付解析XML的代价。

To convert XML to json with lxml you:

用lxml将XML转换为json:

  1. Parse XML document with lxml
  2. 使用lxml解析XML文档。
  3. Convert lxml to a dict
  4. 将lxml转换为命令。
  5. Convert list to json
  6. 列表转换为json

I use the following class in my projects. Use the toJson method.

我在我的项目中使用了下面的类。使用toJson方法。

from lxml import etree 
import json


class Element:
    '''
    Wrapper on the etree.Element class.  Extends functionality to output element
    as a dictionary.
    '''

    def __init__(self, element):
        '''
        :param: element a normal etree.Element instance
        '''
        self.element = element

    def toDict(self):
        '''
        Returns the element as a dictionary.  This includes all child elements.
        '''
        rval = {
            self.element.tag: {
                'attributes': dict(self.element.items()),
            },
        }
        for child in self.element:
            rval[self.element.tag].update(Element(child).toDict())
        return rval


class XmlDocument:
    '''
    Wraps lxml to provide:
        - cleaner access to some common lxml.etree functions
        - converter from XML to dict
        - converter from XML to json
    '''
    def __init__(self, xml = '<empty/>', filename=None):
        '''
        There are two ways to initialize the XmlDocument contents:
            - String
            - File

        You don't have to initialize the XmlDocument during instantiation
        though.  You can do it later with the 'set' method.  If you choose to
        initialize later XmlDocument will be initialized with "<empty/>".

        :param: xml Set this argument if you want to parse from a string.
        :param: filename Set this argument if you want to parse from a file.
        '''
        self.set(xml, filename) 

    def set(self, xml=None, filename=None):
        '''
        Use this to set or reset the contents of the XmlDocument.

        :param: xml Set this argument if you want to parse from a string.
        :param: filename Set this argument if you want to parse from a file.
        '''
        if filename is not None:
            self.tree = etree.parse(filename)
            self.root = self.tree.getroot()
        else:
            self.root = etree.fromstring(xml)
            self.tree = etree.ElementTree(self.root)


    def dump(self):
        etree.dump(self.root)

    def getXml(self):
        '''
        return document as a string
        '''
        return etree.tostring(self.root)

    def xpath(self, xpath):
        '''
        Return elements that match the given xpath.

        :param: xpath
        '''
        return self.tree.xpath(xpath);

    def nodes(self):
        '''
        Return all elements
        '''
        return self.root.iter('*')

    def toDict(self):
        '''
        Convert to a python dictionary
        '''
        return Element(self.root).toDict()

    def toJson(self, indent=None):
        '''
        Convert to JSON
        '''
        return json.dumps(self.toDict(), indent=indent)


if __name__ == "__main__":
    xml='''<system>
    <product>
        <demod>
            <frequency value='2.215' units='MHz'>
                <blah value='1'/>
            </frequency>
        </demod>
    </product>
</system>
'''
    doc = XmlDocument(xml)
    print doc.toJson(indent=4)

The output from the built in main is:

主要是:

{
    "system": {
        "attributes": {}, 
        "product": {
            "attributes": {}, 
            "demod": {
                "attributes": {}, 
                "frequency": {
                    "attributes": {
                        "units": "MHz", 
                        "value": "2.215"
                    }, 
                    "blah": {
                        "attributes": {
                            "value": "1"
                        }
                    }
                }
            }
        }
    }
}

Which is a transformation of this xml:

这是xml的一个变换:

<system>
    <product>
        <demod>
            <frequency value='2.215' units='MHz'>
                <blah value='1'/>
            </frequency>
        </demod>
    </product>
</system>

#11


1  

jsonpickle or if you're using feedparser, you can try feed_parser_to_json.py

jsonpickle或如果您使用feedparser,您可以尝试feed_parser_to_json.py。

#12


1  

I found for simple XML snips, use regular expression would save troubles. For example:

我发现对于简单的XML snip,使用正则表达式可以省去麻烦。例如:

# <user><name>Happy Man</name>...</user>
import re
names = re.findall(r'<name>(\w+)<\/name>', xml_string)
# do some thing to names

To do it by XML parsing, as @Dan said, there is not one-for-all solution because the data is different. My suggestion is to use lxml. Although not finished to json, lxml.objectify give quiet good results:

正如@Dan所说,通过XML解析来实现这一点,并不是所有的解决方案,因为数据是不同的。我的建议是使用lxml。尽管还没有完成json, lxml。客观化给予安静良好的结果:

>>> from lxml import objectify
>>> root = objectify.fromstring("""
... <root xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
...   <a attr1="foo" attr2="bar">1</a>
...   <a>1.2</a>
...   <b>1</b>
...   <b>true</b>
...   <c>what?</c>
...   <d xsi:nil="true"/>
... </root>
... """)

>>> print(str(root))
root = None [ObjectifiedElement]
    a = 1 [IntElement]
      * attr1 = 'foo'
      * attr2 = 'bar'
    a = 1.2 [FloatElement]
    b = 1 [IntElement]
    b = True [BoolElement]
    c = 'what?' [StringElement]
    d = None [NoneElement]
      * xsi:nil = 'true'

#13


1  

My answer addresses the specific (and somewhat common) case where you don't really need to convert the entire xml to json, but what you need is to traverse/access specific parts of the xml, and you need it to be fast, and simple (using json/dict-like operations).

我的答案针对的是特定的(而且有点普通)的情况,您不需要将整个xml转换为json,但是您需要的是遍历/访问xml的特定部分,并且您需要它快速、简单(使用json/类似于命令的操作)。

Approach

For this, it is important to note that parsing an xml to etree using lxml is super fast. The slow part in most of the other answers is the second pass: traversing the etree structure (usually in python-land), converting it to json.

为此,需要注意的是,使用lxml解析xml到etree的速度非常快。大多数其他答案的慢部分是第二遍:遍历etree结构(通常在python-land中),将其转换为json。

Which leads me to the approach I found best for this case: parsing the xml using lxml, and then wrapping the etree nodes (lazily), providing them with a dict-like interface.

这使我找到了我认为最好的方法:使用lxml解析xml,然后包装etree节点(lazily),为它们提供一个类似于dict的接口。

Code

Here's the code:

这是代码:

from collections import Mapping
import lxml.etree

class ETreeDictWrapper(Mapping):

    def __init__(self, elem, attr_prefix = '@', list_tags = ()):
        self.elem = elem
        self.attr_prefix = attr_prefix
        self.list_tags = list_tags

    def _wrap(self, e):
        if isinstance(e, basestring):
            return e
        if len(e) == 0 and len(e.attrib) == 0:
            return e.text
        return type(self)(
            e,
            attr_prefix = self.attr_prefix,
            list_tags = self.list_tags,
        )

    def __getitem__(self, key):
        if key.startswith(self.attr_prefix):
            return self.elem.attrib[key[len(self.attr_prefix):]]
        else:
            subelems = [ e for e in self.elem.iterchildren() if e.tag == key ]
            if len(subelems) > 1 or key in self.list_tags:
                return [ self._wrap(x) for x in subelems ]
            elif len(subelems) == 1:
                return self._wrap(subelems[0])
            else:
                raise KeyError(key)

    def __iter__(self):
        return iter(set( k.tag for k in self.elem) |
                    set( self.attr_prefix + k for k in self.elem.attrib ))

    def __len__(self):
        return len(self.elem) + len(self.elem.attrib)

    # defining __contains__ is not necessary, but improves speed
    def __contains__(self, key):
        if key.startswith(self.attr_prefix):
            return key[len(self.attr_prefix):] in self.elem.attrib
        else:
            return any( e.tag == key for e in self.elem.iterchildren() )


def xml_to_dictlike(xmlstr, attr_prefix = '@', list_tags = ()):
    t = lxml.etree.fromstring(xmlstr)
    return ETreeDictWrapper(
        t,
        attr_prefix = '@',
        list_tags = set(list_tags),
    )

This implementation is not complete, e.g., it doesn't cleanly support cases where an element has both text and attributes, or both text and children (only because I didn't need it when I wrote it...) It should be easy to improve it, though.

这个实现不完整,例如,它不支持元素具有文本和属性的情况,也不包括文本和子元素(只是因为我在写它的时候不需要它)。不过,它应该很容易改进。

Speed

In my specific use case, where I needed to only process specific elements of the xml, this approach gave a suprising and striking speedup by a factor of 70 (!) compared to using @Martin Blech's xmltodict and then traversing the dict directly.

在我的特定用例中,我只需要处理xml的特定元素,这种方法比使用@Martin Blech的xmltodict,然后直接遍历命令,使速度加快了70倍(!)

Bonus

As a bonus, since our structure is already dict-like, we get another alternative implementation of xml2json for free. We just need to pass our dict-like structure to json.dumps. Something like:

额外的好处是,既然我们的结构已经像dict一样,我们可以免费获得xml2json的另一个替代实现。我们只需要将我们的类似于dict的结构传递给json.dump。喜欢的东西:

def xml_to_json(xmlstr, **kwargs):
    x = xml_to_dictlike(xmlstr, **kwargs)
    return json.dumps(x)

If your xml includes attributes, you'd need to use some alphanumeric attr_prefix (e.g. "ATTR_"), to ensure the keys are valid json keys.

如果xml包含属性,则需要使用一些字母数字attr_prefix(例如:“ATTR_”,以确保密钥是有效的json密钥。

I haven't benchmarked this part.

这部分我没有做过基准测试。

#14


1  

This stuff here is actively maintained and so far is my favorite: xml2json in python

这里的这些东西是积极维护的,到目前为止是我最喜欢的:python中的xml2json。

#15


1  

To anyone that may still need this. Here's a newer, simple code to do this conversion.

对于任何可能仍然需要这个的人。这里有一个更新的、简单的代码来进行这种转换。

from xml.etree import ElementTree as ET

xml    = ET.parse('FILE_NAME.xml')
parsed = parseXmlToJson(xml)


def parseXmlToJson(xml):
  response = {}

  for child in list(xml):
    if len(list(child)) > 0:
      response[child.tag] = parseXmlToJson(child)
    else:
      response[child.tag] = child.text or ''

    # one-liner equivalent
    # response[child.tag] = parseXmlToJson(child) if len(list(child)) > 0 else child.text or ''

  return response