使用Python 2.6从Web保存二进制文件的大多数内存有效方法?

时间:2021-06-18 20:40:20

I'm trying to download (and save) a binary file from the web using Python 2.6 and urllib.

我正在尝试使用Python 2.6和urllib从Web下载(并保存)二进制文件。

As I understand it, read(), readline() and readlines() are the 3 ways to read a file-like object. Since the binary files aren't really broken into newlines, read() and readlines() read teh whole file into memory.

据我了解,read(),readline()和readlines()是读取类文件对象的3种方法。由于二进制文件并未真正分解为换行符,因此read()和readlines()将整个文件读入内存。

Is choosing a random read() buffer size the most efficient way to limit memory usage during this process?

选择随机read()缓冲区大小是在此过程中限制内存使用的最有效方法吗?

i.e.

import urllib
import os

title = 'MyFile'
downloadurl = 'http://somedomain.com/myfile.avi'
webFile = urllib.urlopen(downloadurl)
mydirpath = os.path.join('c:', os.sep,'mydirectory',\
                         downloadurl.split('/')[-1])

if not os.path.exists(mydirpath):
    print "Downloading...%s" % title
    localFile = open(mydirpath, 'wb')
    data = webFile.read(1000000) #1MB at a time
    while data:
        localFile.write(data)
        data = webFile.read(1000000) #1MB at a time
    webFile.close()
    localFile.close()
    print "Finished downloading: %s" % title
else:
    print "%s already exists." % mydirypath

I chose read(1000000) arbitrarily because it worked and kept RAM usage down. I assume if I was working with a raw network buffer choosing a random amount would be bad since the buffer might run dry if the transfer rate was too low. But it seems urllib is already handling lower level buffering for me.

我任意选择了读取(1000000),因为它工作并保持RAM使用率下降。我假设如果我使用原始网络缓冲区选择随机数量会很糟糕,因为如果传输速率太低,缓冲区可能会干涸。但似乎urllib已经为我处理了较低级别的缓冲。

With that in mind, is choosing an arbitrary number fine? Is there a better way?

考虑到这一点,选择一个任意数字罚款?有没有更好的办法?

Thanks.

2 个解决方案

#1


You should use urllib.urlretrieve for this. It will handle everything for you.

你应该使用urllib.urlretrieve。它将为您处理一切。

#2


Instead of using your own read-write loop, you should probably check out the shutil module. The copyfileobj method will let you define the buffering. The most efficient method varies from situation to situation. Even copying the same source file to the same destination may vary due to network issues.

您应该检查shutil模块,而不是使用自己的读写循环。 copyfileobj方法将允许您定义缓冲。最有效的方法因情况而异。即使将相同的源文件复制到同一目标,也可能因网络问题而有所不同。

#1


You should use urllib.urlretrieve for this. It will handle everything for you.

你应该使用urllib.urlretrieve。它将为您处理一切。

#2


Instead of using your own read-write loop, you should probably check out the shutil module. The copyfileobj method will let you define the buffering. The most efficient method varies from situation to situation. Even copying the same source file to the same destination may vary due to network issues.

您应该检查shutil模块,而不是使用自己的读写循环。 copyfileobj方法将允许您定义缓冲。最有效的方法因情况而异。即使将相同的源文件复制到同一目标,也可能因网络问题而有所不同。