在Python中分割一个大型XML文件

时间:2021-11-06 15:44:17

I'm looking to split a huge XML file into smaller bits. I'd like to scan through the file looking for a specific tag, then grab all info between and , then save that into a file, then continue on through the rest of the file.

我希望将一个巨大的XML文件分割成更小的部分。我想要扫描整个文件寻找一个特定的标记,然后获取和之间的所有信息,然后将其保存到一个文件中,然后继续完成文件的其余部分。

My issue is trying to find a clean way to note the start and end of the tags, so that I can grab the text inside as I scan through the file with "for line in f"

我的问题是寻找一种干净的方式来记录标签的开始和结束,这样我就可以在扫描文件时获取其中的文本了

I'd rather not use sentinel variables. Is there a pythonic way to get this done?

我宁愿不使用sentinel变量。有没有一种毕达哥拉斯式的方法来完成这件事?

The file is too big to read into memory.

这个文件太大了,无法读取到内存中。

5 个解决方案

#1


9  

There are two common ways to handle XML data.

处理XML数据有两种常见的方法。

One is called DOM, which stands for Document Object Model. This style of XML parsing is probably what you have seen when looking at documentation, because it reads the entire XML into memory to create the object model.

一个是DOM,它代表文档对象模型。这种XML解析方式可能是您在查看文档时所看到的,因为它将整个XML读入内存来创建对象模型。

The second is called SAX, which is a streaming method. The parser starts reading the XML and sends signals to your code about certain events, e.g. when a new start tag is found.

第二个是SAX,它是一个流方法。解析器开始读取XML,并向代码发送关于某些事件的信号,例如找到一个新的开始标记时。

So SAX is clearly what you need for your situation. Sax parsers can be found in the python library under xml.sax and xml.parsers.expat.

因此,SAX显然是您所需要的。可以在xml下的python库中找到Sax解析器。sax和xml.parsers.expat。

#2


6  

I have had success with the cElementTree.iterparse method in order to do a similar task.

我在森林里取得了成功。迭代解析方法,以完成类似的任务。

I had a large xml doc with repeated 'entries' with tag 'resFrame' and I wanted to filter out entries for a specific id. Here is the code that I used for it:

我有一个很大的xml文档,上面有重复的“条目”,标签是“resFrame”,我想为一个特定的id过滤掉条目。

source document had this structure

源文档具有这种结构

<snapDoc>
  <bucket>....</bucket>
  <bucket>....</bucket>
  <bucket>....</bucket>
  ...
  <resFrame><id>234234</id>.....</resFrame>
  <frame><id>344234</id>.....</frame>
  <resFrame>...</resFrame>
  <frame>...</frame>
</snapDoc>

I used the following script to create a smaller doc that had the same structure, bucket entries and only resFrame entries with a specific id.

我使用下面的脚本创建了一个更小的doc,它具有相同的结构、bucket条目和只有具有特定id的resFrame条目。

#!/usr/bin/env python2.6

import xml.etree.cElementTree as cElementTree
start = '''<?xml version="1.0" encoding="UTF-8"?>
<snapDoc>'''

def main():
    print start
    context = cElementTree.iterparse('snap.xml', events=("start", "end"))
    context = iter(context)
    event, root = context.next() # get the root element of the XML doc

    for event, elem in context:
        if event == "end":
            if elem.tag == 'bucket': # i want to write out all <bucket> entries
               elem.tail = None  
               print cElementTree.tostring( elem )
            if elem.tag == 'resFrame':
                if elem.find("id").text == ":4:39644:482:-1:1": # i only want to write out resFrame entries with this id
                    elem.tail = None
                    print cElementTree.tostring( elem )
            if elem.tag in ['bucket', 'frame', 'resFrame']:
                root.clear()  # when done parsing a section clear the tree to safe memory
    print "</snapDoc>"

main()

#3


6  

You might consider using the ElementTree iterparse function for this situation.

您可以考虑为这种情况使用ElementTree iterparse函数。

#4


1  

How serendipitous! Will Larson just made a good post about Handling Very Large CSV and XML File in Python.

多么偶然!Will Larson刚刚发表了一篇关于在Python中处理非常大的CSV和XML文件的文章。

The main takeaways seem to be to use the xml.sax module, as Van mentioned, and to make some macro-functions to abstract away the details of the low-level SAX API.

主要的结论似乎是使用xml。正如Van提到的sax模块,并使一些宏功能抽象出低级sax API的细节。

#5


0  

This is an old, but very good article from Uche Ogbuji's also very good Python & XMl column. It covers your exact question and uses the standard lib's sax module like the other answer has suggested. Decomposition, Process, Recomposition

这是一篇来自Uche Ogbuji非常好的文章,也是非常好的Python和XMl专栏。它涵盖了您的确切问题,并使用了标准lib的sax模块,就像其他答案所建议的那样。分解,过程,改写

#1


9  

There are two common ways to handle XML data.

处理XML数据有两种常见的方法。

One is called DOM, which stands for Document Object Model. This style of XML parsing is probably what you have seen when looking at documentation, because it reads the entire XML into memory to create the object model.

一个是DOM,它代表文档对象模型。这种XML解析方式可能是您在查看文档时所看到的,因为它将整个XML读入内存来创建对象模型。

The second is called SAX, which is a streaming method. The parser starts reading the XML and sends signals to your code about certain events, e.g. when a new start tag is found.

第二个是SAX,它是一个流方法。解析器开始读取XML,并向代码发送关于某些事件的信号,例如找到一个新的开始标记时。

So SAX is clearly what you need for your situation. Sax parsers can be found in the python library under xml.sax and xml.parsers.expat.

因此,SAX显然是您所需要的。可以在xml下的python库中找到Sax解析器。sax和xml.parsers.expat。

#2


6  

I have had success with the cElementTree.iterparse method in order to do a similar task.

我在森林里取得了成功。迭代解析方法,以完成类似的任务。

I had a large xml doc with repeated 'entries' with tag 'resFrame' and I wanted to filter out entries for a specific id. Here is the code that I used for it:

我有一个很大的xml文档,上面有重复的“条目”,标签是“resFrame”,我想为一个特定的id过滤掉条目。

source document had this structure

源文档具有这种结构

<snapDoc>
  <bucket>....</bucket>
  <bucket>....</bucket>
  <bucket>....</bucket>
  ...
  <resFrame><id>234234</id>.....</resFrame>
  <frame><id>344234</id>.....</frame>
  <resFrame>...</resFrame>
  <frame>...</frame>
</snapDoc>

I used the following script to create a smaller doc that had the same structure, bucket entries and only resFrame entries with a specific id.

我使用下面的脚本创建了一个更小的doc,它具有相同的结构、bucket条目和只有具有特定id的resFrame条目。

#!/usr/bin/env python2.6

import xml.etree.cElementTree as cElementTree
start = '''<?xml version="1.0" encoding="UTF-8"?>
<snapDoc>'''

def main():
    print start
    context = cElementTree.iterparse('snap.xml', events=("start", "end"))
    context = iter(context)
    event, root = context.next() # get the root element of the XML doc

    for event, elem in context:
        if event == "end":
            if elem.tag == 'bucket': # i want to write out all <bucket> entries
               elem.tail = None  
               print cElementTree.tostring( elem )
            if elem.tag == 'resFrame':
                if elem.find("id").text == ":4:39644:482:-1:1": # i only want to write out resFrame entries with this id
                    elem.tail = None
                    print cElementTree.tostring( elem )
            if elem.tag in ['bucket', 'frame', 'resFrame']:
                root.clear()  # when done parsing a section clear the tree to safe memory
    print "</snapDoc>"

main()

#3


6  

You might consider using the ElementTree iterparse function for this situation.

您可以考虑为这种情况使用ElementTree iterparse函数。

#4


1  

How serendipitous! Will Larson just made a good post about Handling Very Large CSV and XML File in Python.

多么偶然!Will Larson刚刚发表了一篇关于在Python中处理非常大的CSV和XML文件的文章。

The main takeaways seem to be to use the xml.sax module, as Van mentioned, and to make some macro-functions to abstract away the details of the low-level SAX API.

主要的结论似乎是使用xml。正如Van提到的sax模块,并使一些宏功能抽象出低级sax API的细节。

#5


0  

This is an old, but very good article from Uche Ogbuji's also very good Python & XMl column. It covers your exact question and uses the standard lib's sax module like the other answer has suggested. Decomposition, Process, Recomposition

这是一篇来自Uche Ogbuji非常好的文章,也是非常好的Python和XMl专栏。它涵盖了您的确切问题,并使用了标准lib的sax模块,就像其他答案所建议的那样。分解,过程,改写