Python 网络爬虫 011 (高级功能) 支持代理proxy — 让爬虫可以爬取google,Youtube等网站

原博文链接：http://www.aobosir.com/blog/2016/12/25/python-Web-crawler-proxy-support/

使用的系统：Windows 10 64位
Python 语言版本：Python 2.7.10 V
使用的编程 Python 的集成开发环境：PyCharm 2016 04
我使用的 urllib 的版本：urllib2

注意： 我没这里使用的是 Python2 ，而不是Python3

一 . 前言

由于网络原因，这里的内容不可以在这里显示。请访问原博文链接：http://www.aobosir.com/blog/2016/12/25/python-Web-crawler-proxy-support/

二 . 测试

我们可以使用 urllib2 支持代理。

proxy = ...
opener = urllib2.build_opener()
proxy_params = {urlparse.urlparse(url).scheme: proxy}
opener.add_handler(urllib2.ProxyHandler(proxy_params))
response = opener.open(request)

三 . 代码

代码在这里。

#-*- coding:utf-8 -*-

import urllib2
import chardet
import urlparse

def download(url, user_agent='wswp', proxy=None, num_retries=2):
print 'Downloading: ', url
    headers = {'User-agent' : user_agent}
    request = urllib2.Request(url, headers=headers)

    opener = urllib2.build_opener()
if proxy:
        proxy_params = {urlparse.urlparse(url).scheme: proxy}
        opener.add_handler(urllib2.ProxyHandler(proxy_params))
try:
        html = opener.open(request).read()
        charset = chardet.detect(html)['encoding']
if charset == 'GB2312' or charset == 'gb2312':
            html = html.decode('GBK').encode('GB18030')
else:
            html = html.decode(charset).encode('GB18030')
except urllib2.URLError as e:
print 'Download error', e.reason
        html = None
if num_retries > 0:
if num_retries > 0:
if hasattr(e, 'code') and 500 <= e.code < 600:
# recursively retry 5xx HTTP errors
return download(url, user_agent, proxy, num_retries-1)
return html

四 . 运行

如何使用这个最新的 download() 函数。download() 函数里面的形参 proxy 究竟要传入什么？

如果直接运行：

>>> download('https://www.google.co.jp/')

输出：

Downloading:  https://www.google.co.jp/
Download error [Errno 11002] getaddrinfo failed

现在，我们启动proxy代理：

Python 网络爬虫 011 (高级功能) 支持代理proxy — 让爬虫可以爬取google,Youtube等网站

proxy 的本地端口为：127.0.0.1:1080

所以，我们给download() 函数的 proxy 参数的值设置为：127.0.0.1:1080

>>> download('https://www.google.co.jp/', proxy='127.0.0.1:1080')

成功输出了google日本的网站源代码：

Python 网络爬虫 011 (高级功能) 支持代理proxy — 让爬虫可以爬取google,Youtube等网站

五 . 讲解代码中重点部分

    if proxy:
        proxy_params = {urlparse.urlparse(url).scheme: proxy}
        opener.add_handler(urllib2.ProxyHandler(proxy_params))

如果用户给proxy参数赋值了，那么就执行里面的代码。

其中urlparse.urlparse(url).scheme 得到的是网页的协议类型，比如：http、https、ftp等等。

所以proxy_params = {urlparse.urlparse(url).scheme: proxy}这句代码现在这个情况就等于：proxy_params = {'https', '127.0.0.1:1080'}。所以，proxy_params是一个字典，里面存放在代理的端口号。

在urllib2包中有ProxyHandler类，通过此类可以设置代理访问网页。

所以，上面完整的代码所执行的功能，和下面这一小段代码执行所得到的效果是一样的：

#coding=utf8

import urllib2
import chardet

proxy = urllib2.ProxyHandler({'https': '127.0.0.1:1080'})
opener = urllib2.build_opener(proxy)
html = opener.open('https://www.google.co.jp/').read()
charset = chardet.detect(html)['encoding']

print html.decode(charset).encode('GB18030')

参考网站：

http://outofmemory.cn/code-snippet/2625/python-urllib2-usage-ProxyHandler-through-Proxy-call-wangye

请访问：http://www.aobosir.com/

秒客网

Python 网络爬虫 011 (高级功能) 支持代理proxy — 让爬虫可以爬取google,Youtube等网站

一 . 前言

二 . 测试

三 . 代码

四 . 运行

五 . 讲解代码中重点部分

相关文章