如何使用urllib3下载文件?

时间:2020-12-18 18:10:53

This is based on another question on this site: What's the best way to download file using urllib3 However, I cannot comment there so I ask another question:

这是基于这个网站上的另一个问题:使用urllib3下载文件的最好方法是什么,但是我不能在那里评论,所以我问了另一个问题:

How to download a (larger) file with urllib3?

如何使用urllib3下载一个(更大的)文件?

I tried to use the same code that works with urllib2 (Download file from web in Python 3), but it fails with urllib3:

我尝试使用与urllib2相同的代码(在Python 3中从web下载文件),但是使用urllib3失败了:

http = urllib3.PoolManager()

with http.request('GET', url) as r, open(path, 'wb') as out_file:       
    #shutil.copyfileobj(r.data, out_file) # this writes a zero file
    shutil.copyfileobj(r.data, out_file)

This says that 'bytes' object has no attribute 'read'

这就是说,'bytes'对象没有属性'read'

I then tried to use the code in that question but it gets stuck in an infinite loop because data is always '0':

然后我试着用这个问题的代码但是它被困在一个无限循环中因为数据总是'0'

http = urllib3.PoolManager()
r = http.request('GET', url)

with open(path, 'wb') as out:
    while True:
        data = r.read(4096)         
        if data is None:
            break
        out.write(data)
r.release_conn()

However, if I read everything in memory, the file gets downloaded correctly:

但是,如果我读取内存中的所有内容,文件就会被正确下载:

http = urllib3.PoolManager()
r = http.request('GET', url)
with open(path, 'wb') as out:
    out.write(data)

I do not want to do this, as I might potentially download very large files. It is unfortunate that the urllib documentation does not cover the best practice in this topic.

我不想这样做,因为我可能会下载非常大的文件。不幸的是,urllib文档并没有涵盖本主题中的最佳实践。

(Also, please do not suggest requests or urllib2, because they are not flexible enough when it comes to self-signed certificates.)

(另外,请不要提出请求或urllib2,因为它们在自签名证书上不够灵活。)

1 个解决方案

#1


6  

You were very close, the piece that was missing is setting preload_content=False (this will be the default in an upcoming version). Also you can treat the response as a file-like object, rather than the .data attribute (which is a magic property that will hopefully be deprecated someday).

您非常接近,缺少的部分是设置preload_content=False(这将是即将出现的版本中的默认值)。您还可以将响应视为一个文件类对象,而不是.data属性(这是一个希望有一天会被弃用的魔法属性)。

- with http.request('GET', url) ...
+ with http.request('GET', url, preload_content=False) ...

This code should work:

这段代码应该工作:

http = urllib3.PoolManager()

with http.request('GET', url, preload_content=False) as r, open(path, 'wb') as out_file:       
    shutil.copyfileobj(r, out_file)

urllib3's response object also respects the io interface, so you can also do things like...

urllib3的响应对象也尊重io接口,所以您也可以做类似的事情。

import io
response = http.request(..., preload_content=False)
buffered_response = io.BufferedReader(response, 2048)

As long as you add preload_content=False to any of your three attempts and treat the response as a file-like object, they should all work.

只要您将preload_content=False添加到您的三次尝试中,并将响应视为一个类似文件的对象,那么它们都应该工作。

It is unfortunate that the urllib documentation does not cover the best practice in this topic.

不幸的是,urllib文档并没有涵盖本主题中的最佳实践。

You're totally right, I hope you'll consider helping us document this use case by sending a pull request here: https://github.com/shazow/urllib3

您完全正确,我希望您可以考虑通过发送一个pull请求来帮助我们记录这个用例:https://github.com/shazow/urllib3。

#1


6  

You were very close, the piece that was missing is setting preload_content=False (this will be the default in an upcoming version). Also you can treat the response as a file-like object, rather than the .data attribute (which is a magic property that will hopefully be deprecated someday).

您非常接近,缺少的部分是设置preload_content=False(这将是即将出现的版本中的默认值)。您还可以将响应视为一个文件类对象,而不是.data属性(这是一个希望有一天会被弃用的魔法属性)。

- with http.request('GET', url) ...
+ with http.request('GET', url, preload_content=False) ...

This code should work:

这段代码应该工作:

http = urllib3.PoolManager()

with http.request('GET', url, preload_content=False) as r, open(path, 'wb') as out_file:       
    shutil.copyfileobj(r, out_file)

urllib3's response object also respects the io interface, so you can also do things like...

urllib3的响应对象也尊重io接口,所以您也可以做类似的事情。

import io
response = http.request(..., preload_content=False)
buffered_response = io.BufferedReader(response, 2048)

As long as you add preload_content=False to any of your three attempts and treat the response as a file-like object, they should all work.

只要您将preload_content=False添加到您的三次尝试中,并将响应视为一个类似文件的对象,那么它们都应该工作。

It is unfortunate that the urllib documentation does not cover the best practice in this topic.

不幸的是,urllib文档并没有涵盖本主题中的最佳实践。

You're totally right, I hope you'll consider helping us document this use case by sending a pull request here: https://github.com/shazow/urllib3

您完全正确,我希望您可以考虑通过发送一个pull请求来帮助我们记录这个用例:https://github.com/shazow/urllib3。