如何获取XML文档并使用Python扭曲解析它？

I want a fast way to grab a URL and parse it while streaming. Ideally this should be super fast. My language of choice is Python. I have an intuition that twisted can do this but I'm at a loss to find an example.

我想快速获取URL并在流式传输时解析它。理想情况下,这应该是超快的。我选择的语言是Python。我有一种直觉,扭曲可以做到这一点,但我找不到一个例子。

2 个解决方案

#1

If you need to handle HTTP responses in a streaming fashion, there are a few options.

如果您需要以流方式处理HTTP响应,则有一些选项。

You can do it via downloadPage:

你可以通过downloadPage来做到:

from xml.sax import make_parser
from twisted.web.client import downloadPage

class StreamingXMLParser:
    def __init__(self):
        self._parser = make_parser()

    def write(self, bytes):
        self._parser.feed(bytes)

    def close(self):
        self._parser.feed('', True)

parser = StreamingXMLParser()
d = downloadPage(url, parser)
# d fires when the response is completely received

This works because downloadPage writes the response body to the file-like object passed to it. Here, passing in an object with write and close methods satisfies that requirement, but incrementally parses the data as XML instead of putting it on a disk.

这是有效的,因为downloadPage将响应主体写入传递给它的类文件对象。在这里,使用write和close方法传入对象满足该要求,但是逐步将数据解析为XML而不是将其放在磁盘上。

Another approach is to hook into things at the HTTPPageGetter level. HTTPPageGetter is the protocol used internally by getPage.

另一种方法是在HTTPPageGetter级别挂钩。 HTTPPageGetter是getPage内部使用的协议。

class StreamingXMLParsingHTTPClient(HTTPPageGetter):
    def connectionMade(self):
        HTTPPageGetter.connectionMade(self)
        self._parser = make_parser()

    def handleResponsePart(self, bytes):
        self._parser.feed(bytes)

    def handleResponseEnd(self):
        self._parser.feed('', True)
        self.handleResponse(None) # Whatever you pass to handleResponse will be the result of the Deferred below.

factory = HTTPClientFactory(url)
factory.protocol = StreamingXMLParsingHTTPClient
reactor.connectTCP(host, port, factory)
d = factory.deferred
# d fires when the response is completely received

Finally, there will be a new HTTP client API soon. Since this isn't part of any release yet, it's not as directly useful as the previous two approaches, but it's somewhat nicer, so I'll include it to give you an idea of what the future will bring. :) The new API lets you specify a protocol to receive the response body. So you'd do something like this:

最后,很快就会有一个新的HTTP客户端API。由于这不是任何版本的一部分,它不像前两种方法那样直接有用,但它更好一些,所以我会把它包含在内,让你了解未来会带来什么。 :)新API允许您指定接收响应主体的协议。所以你会做这样的事情:

class StreamingXMLParser(Protocol):
    def __init__(self):
        self.done = Deferred()

    def connectionMade(self):
        self._parser = make_parser()

    def dataReceived(self, bytes):
        self._parser.feed(bytes)

    def connectionLost(self, reason):
        self._parser.feed('', True)
        self.done.callback(None)

from twisted.web.client import Agent
from twisted.internet import reactor

agent = Agent(reactor)
d = agent.request('GET', url, headers, None)
def cbRequest(response):
    # You can look at the response headers here if you like.
    protocol = StreamingXMLParser()
    response.deliverBody(protocol)
    return protocol.done
d.addCallback(cbRequest) # d fires when the response is fully received and parsed

#2

You only need to parse a single URL? Then don't worry. Use urllib2 to open the connection and pass the file handle into ElementTree.

您只需解析一个URL?然后别担心。使用urllib2打开连接并将文件句柄传递给ElementTree。

Variations you can try would be to use ElementTree's incremental parser or to use iterparse, but that depends on what your real requirements are. There's "super fast" but there's also "fast enough."

您可以尝试的变体是使用ElementTree的增量解析器或使用iterparse,但这取决于您的实际需求。有“超级快”,但也“足够快”。

It's only when you start having multiple simultaneous connections where you should look at Twisted or multithreading.

只有当你开始有多个同时连接时,你应该看看Twisted或多线程。

#1