通过HTTPS增量处理大型XML文件？

I've got to download, process, and store an 8GB XML file from a secure web server. I could download the file using the WebRequest class, but this will take a VERY long time. Also, I know that the file is structured in such a way that it suits processing in discrete chunks.

我必须从安全的Web服务器下载,处理和存储8GB的XML文件。我可以使用WebRequest类下载该文件,但这将花费很长时间。此外,我知道该文件的结构使其适合离散块中的处理。

How can I 'stream' this file such that I only get bite-size pieces which I can work on, without having to get the whole stream at one time?

我如何'流式传输'这个文件,这样我只能获得一些可以处理的小块,而不必一次获得整个流?

Edit

I forgot to mention - we are hosted on Azure. An idea that comes to mind is to provision a worker role which just downloads large files and can take as long as it wants. How feasible would that be?

我忘了提 - 我们托管在Azure上。想到的一个想法是提供一个工作者角色,它只下载大文件并且可以根据需要进行。这有多可行?

4 个解决方案

#1

8 GB is a large workload. To protect myself from rework and to scale effectively, I would decouple the XML file download from it’s processing.

8 GB是一个很大的工作量。为了保护自己免于返工和有效扩展,我会将XML文件下载与其处理分离。

While downloading as a stream, I would write some sort of stream identifier to persistent storage and schedule each atomic unit of work to be done by placing a message with its relevant data on a queue. This would allow recovery from the download going south for any reason or a unit of work being unsuccessful and/or interfering with the download.

在作为流下载时,我会将某种流标识符写入持久存储,并通过将包含其相关数据的消息放在队列上来安排完成每个原子工作单元。由于任何原因或工作单元不成功和/或干扰下载,这将允许从下载向南恢复。

#2

I'm using HttpWebRequest, BeginGetResponse then GetResponseStream

我正在使用HttpWebRequest,BeginGetResponse然后使用GetResponseStream

Then one can read the stream in chunks as it's dripping down via stream.BeginRead

然后,当它通过stream.BeginRead滴下时,可以读取块中的流

Here's much too complicated example: http://stuff.seans.com/2009/01/05/using-httpwebrequest-for-asynchronous-downloads/

这是一个非常复杂的例子:http://stuff.seans.com/2009/01/05/using-httpwebrequest-for-asynchronous-downloads/

#3

If you need to process file sequentially just open an XMLReader on the stream of response and read the data as needed.

如果需要按顺序处理文件,只需在响应流上打开XMLReader并根据需要读取数据。

If you need random access to the file (i.e. read in the middle) you may need to do more work to create seekable stream (if server supports RANGE option in the request) or simply download whole file as you do now.

如果您需要随机访问该文件(即在中间读取),您可能需要做更多工作来创建可搜索流(如果服务器在请求中支持RANGE选项)或者只是像现在一样下载整个文件。

Please note that 8GB is large amount of data and downloading it completely will take a lot of time irrespective of method of reading.

请注意,8GB是大量数据,完全下载将花费大量时间,无论读取方法如何。

#4

You could upload the xml file to a block blob and download it from there.This blog post might help http://blogs.msdn.com/b/kwill/archive/2011/05/30/asynchronous-parallel-block-blob-transfers-with-progress-change-notification.aspx

您可以将xml文件上传到块blob并从那里下载。这篇博客文章可能有所帮助http://blogs.msdn.com/b/kwill/archive/2011/05/30/asynchronous-parallel-block-blob -transfers,与正在进行的变化,notification.aspx

Hope this helps.

希望这可以帮助。

#1