Python爬虫【requests】request for humans

安装

 pip install requests

源码

git clone git://github.com/kennethreitz/requests.git

导入

import requests

发送请求

get请求

r = requests.get('https://api.github.com/events')

post请求

r = requests.post('http://httpbin.org/post', data = {'key':'value'})

其他

>>> r = requests.put('http://httpbin.org/put', data = {'key':'value'})

>>> r = requests.delete('http://httpbin.org/delete')

>>> r = requests.head('http://httpbin.org/get')

>>> r = requests.options('http://httpbin.org/get')

传递URL参数

1.get请求携带参数

>>> payload = {'key1': 'value1', 'key2': 'value2'}

>>> r = requests.get("http://httpbin.org/get", params=payload)

>>> print(r.url)

http://httpbin.org/get?key2=value2&key1=value1

携带参数值为列表

>>> payload = {'key1': 'value1', 'key2': ['value2', 'value3']}

>>> r = requests.get('http://httpbin.org/get', params=payload)

>>> print(r.url)

http://httpbin.org/get?key1=value1&key2=value2&key2=value3

2.post请求

如果要将参数放在request body中传递，使用data参数，可以是字典，字符串或者是类文件对象。

使用字典时将发送form-encoded data：

>>> payload = {'key1': 'value1', 'key2': 'value2'}

>>> r = requests.post("http://httpbin.org/post", data=payload)

>>> print(r.text)

{

  ...

  "form": {

    "key2": "value2",

    "key1": "value1"

  },

  ...

}

application/json

>>> import json

>>> url = 'https://api.github.com/some/endpoint'

>>> payload = {'some': 'data'}

>>> r = requests.post(url, data=json.dumps(payload))

流式上传

with open('massive-body', 'rb') as f:

    requests.post('http://some.url/streamed', data=f)

块编码请求

def gen():

    yield 'hi'

    yield 'there'

requests.post('http://some.url/chunked', data=gen())

如果要上传文件，可以使用file参数发送Multipart-encoded数据，file参数是{ 'name': file-like-objects}格式的字典 (or {'name':('filename', fileobj)}) ：

>>> url = 'http://httpbin.org/post'

>>> files = {'file': open('report.xls', 'rb')}

>>> r = requests.post(url, files=files)

>>> r.text

{

  ...

  "files": {

    "file": "<censored...binary...data>"

  },

  ...

}

也可以明确设置filename, content_type and headers：

 >>> url = 'http://httpbin.org/post'

 >>> files = {'file': ('report.xls', open('report.xls', 'rb'), 'application/vnd.ms-excel', {'Expires': ''})}

 >>> r = requests.post(url, files=files)

 >>> print r.text

 {

   "args": {},

   "data": "",

   "files": {

     "file": "1\t2\r\n"

   },

   "form": {},

   "headers": {

     "Content-Type": "multipart/form-data; boundary=e0f9ff1303b841498ae53a903f27e565",

     "Host": "httpbin.org",

     "User-Agent": "python-requests/2.2.1 CPython/2.7.3 Windows/7",

   },

   "url": "http://httpbin.org/post"

 }

一次性上传多个文件，比如可以接受多个值的文件上传：

<input type="file" name="images" multiple="true" required="true"/>

只要把文件放到一个元组的列表中，其中元组结构为(form_field_name, file_info):

>>> url = 'http://httpbin.org/post'

>>> multiple_files = [('images', ('foo.png', open('foo.png', 'rb'), 'image/png')),

                      ('images', ('bar.png', open('bar.png', 'rb'), 'image/png'))]

>>> r = requests.post(url, files=multiple_files)

>>> r.text

{

  ...

  'files': {'images': 'data:image/png;base64,iVBORw ....'}

  'Content-Type': 'multipart/form-data; boundary=3131623adb2043caaeb5538cc7aa0b3a',

  ...

}

响应内容

>>> import requests

>>> r = requests.get('https://api.github.com/events')

>>> r.text

u'[{"repository":{"open_issues":0,"url":"https://github.com/...

解码

>>> r.encoding

'utf-8'

>>> r.encoding = 'ISO-8859-1'

一般这样子用

r.content.decode('utf-8')

二进制响应内容

非文本请求，字节形式

>>> r.content

b'[{"repository":{"open_issues":0,"url":"https://github.com/...

Requests 会自动为你解码 gzip 和 deflate 传输编码的响应数据。

例如，以请求返回的二进制数据创建一张图片，你可以使用如下代码：

>>> from PIL import Image

>>> from io import BytesIO

>>> i = Image.open(BytesIO(r.content))

JSON 响应内容

内置json解码器

>>> import requests

>>> r = requests.get('https://api.github.com/events')

>>> r.json()

[{u'repository': {u'open_issues': 0, u'url': 'https://github.com/...

原始响应内容

确保stream=True

>>> r = requests.get('https://api.github.com/events', stream=True)

>>> r.raw

<requests.packages.urllib3.response.HTTPResponse object at 0x101194810>

>>> r.raw.read(10)

'\x1f\x8b\x08\x00\x00\x00\x00\x00\x00\x03'

一般这样子用

with open(filename, 'wb') as fd:

    for chunk in r.iter_content(chunk_size):

        fd.write(chunk)

定制请求头

只要简单地传递一个 dict 给 headers 参数就可以了

>>> url = 'https://api.github.com/some/endpoint'

>>> headers = {'user-agent': 'my-app/0.0.1'}

>>> r = requests.get(url, headers=headers)

响应状态码

>>> r = requests.get('http://httpbin.org/get')

>>> r.status_code

200

状态码原因短语

>>> r.status_code == requests.codes.ok

True

发送错误请求，通过 raise_for_status 抛出异常，当状态码为200，返回None

>>> bad_r = requests.get('http://httpbin.org/status/404')

>>> bad_r.status_code

404

>>> bad_r.raise_for_status()

Traceback (most recent call last):

  File "requests/models.py", line 832, in raise_for_status

    raise http_error

requests.exceptions.HTTPError: 404 Client Error

响应头

Python 字典形式展示的服务器响应头

>>> r.headers

{

    'content-encoding': 'gzip',

    'transfer-encoding': 'chunked',

    'connection': 'close',

    'server': 'nginx/1.0.4',

    'x-runtime': '148ms',

    'etag': '"e1ca502697e5c9317743dc078f67693f"',

    'content-type': 'application/json'

}

由于HTTP头部大小写不敏感，我们可以这样使用

>>> r.headers['Content-Type']

'application/json'

>>> r.headers.get('content-type')

'application/json'

它还有一个特殊点，那就是服务器可以多次接受同一 header，每次都使用不同的值。但 Requests 会将它们合并，将每个后续的栏位值依次追加到合并的栏位值中，用逗号隔开即可，

Cookie

>>> url = 'http://example.com/some/cookie/setting/url'

>>> r = requests.get(url)

>>> r.cookies['example_cookie_name']

'example_cookie_value'

构建cookies请求

>>> url = 'http://httpbin.org/cookies'

>>> cookies = dict(cookies_are='working')

>>> r = requests.get(url, cookies=cookies)

>>> r.text

'{"cookies": {"cookies_are": "working"}}'

Cookie 的返回对象为 RequestsCookieJar，它的行为和字典类似，但接口更为完整，适合跨域名跨路径使用。你还可以把 Cookie Jar 传到 Requests 中：

>>> jar = requests.cookies.RequestsCookieJar()

>>> jar.set('tasty_cookie', 'yum', domain='httpbin.org', path='/cookies')

>>> jar.set('gross_cookie', 'blech', domain='httpbin.org', path='/elsewhere')

>>> url = 'http://httpbin.org/cookies'

>>> r = requests.get(url, cookies=jar)

>>> r.text

'{"cookies": {"tasty_cookie": "yum"}}'

cookie转为字典

>>> requests.utils.dict_from_cookiejar(r.cookies)

{'BAIDUID': '84722199DF8EDC372D549EC56CA1A0E2:FG=1', 'BD_HOME': '', 'BDSVRTM': ''}

将字典转为CookieJar：

requests.utils.cookiejar_from_dict(cookie_dict, cookiejar=None, overwrite=True)

会话对象

requests提供了一个Session类，来保持cookie，可用于访问登录后的页面

s = requests.Session()

s.get('http://httpbin.org/cookies/set/sessioncookie/123456789')

r = s.get("http://httpbin.org/cookies")

print(r.text)

# '{"cookies": {"sessioncookie": "123456789"}}'

会话也可用来为请求方法提供缺省数据。这是通过为会话对象的属性提供数据来实现的：

s = requests.Session()

s.auth = ('user', 'pass')

s.headers.update({'x-test': 'true'})

# both 'x-test' and 'x-test2' are sent

s.get('http://httpbin.org/headers', headers={'x-test2': 'true'})

任何你传递给请求方法的字典都会与已设置会话层数据合并。

方法层的参数会覆盖会话的参数。

不过需要注意，就算使用了会话，方法级别的参数也不会被跨请求保持。下面的例子只会在第一个请求发送 cookie ，而第二个不会发送cookie：

s = requests.Session()

r = s.get('http://httpbin.org/cookies', cookies={'from-my': 'browser'})

print(r.text)

# '{"cookies": {"from-my": "browser"}}'

r = s.get('http://httpbin.org/cookies')

print(r.text)

# '{"cookies": {}}'

前后文管理会话

with requests.Session() as s:

    s.get('http://httpbin.org/cookies/set/sessioncookie/123456789')

SSH证书认证

Requests 可以为 HTTPS 请求验证 SSL 证书，就像 web 浏览器一样。SSL 验证默认是开启的，如果证书验证失败，Requests 会抛出 SSLError:

>>> requests.get('https://requestb.in')

requests.exceptions.SSLError: hostname 'requestb.in' doesn't match either of '*.herokuapp.com', 'herokuapp.com'

在该域名上我没有设置 SSL，所以失败了。但 Github 设置了 SSL:

>>> requests.get('https://github.com', verify=True)

<Response [200]>

为 verify 传入 CA_BUNDLE 文件的路径，或者包含可信任 CA 证书文件的文件夹路径：

>>> requests.get('https://github.com', verify='/path/to/certfile')

也可以保持在会话中

s = requests.Session()

s.verify = '/path/to/certfile'

如果 verify 设为文件夹路径，文件夹必须通过 OpenSSL 提供的 c_rehash 工具处理。

忽略证书设置verify为False

>>> requests.get('https://kennethreitz.org', verify=False)

<Response [200]>

默认情况下， verify 是设置为 True 的。选项 verify 仅应用于主机证书。

客户端证书

单个文件（包含密钥和证书【pem】）或一个包含两个文件路径的元组

>>> requests.get('https://kennethreitz.org', cert=('/path/client.cert', '/path/client.key'))

保持在会话中

s = requests.Session()

s.cert = '/path/client.cert'

本地证书的私有 key 必须是解密状态。目前，Requests 不支持使用加密的 key。

证书出错

>>> requests.get('https://kennethreitz.org', cert='/wrong_path/client.pem')

SSLError: [Errno 336265225] _ssl.c:347: error:140B0009:SSL routines:SSL_CTX_use_PrivateKey_file:PEM lib

CA 证书

Requests 默认附带了一套它信任的根证书，来自于 Mozilla trust store。然而它们在每次 Requests 更新时才会更新。这意味着如果你固定使用某一版本的 Requests，你的证书有可能已经太旧了。

从 Requests 2.4.0 版之后，如果系统中装了 certifi 包，Requests 会试图使用它里边的证书。这样用户就可以在不修改代码的情况下更新他们的可信任证书。

为了安全起见，我们建议你经常更新 certifi！

响应体内容工作流

默认情况下，当你进行网络请求后，响应体会立即被下载。你可以通过 stream 参数覆盖这个行为，推迟下载响应体直到访问 Response.content 属性：

tarball_url = 'https://github.com/kennethreitz/requests/tarball/master'

r = requests.get(tarball_url, stream=True)

此时仅有响应头被下载下来了，连接保持打开状态，因此允许我们根据条件获取内容：

if int(r.headers['content-length']) < TOO_LONG:

  content = r.content

  ...

你可以进一步使用 Response.iter_content 和 Response.iter_lines 方法来控制工作流，或者以 Response.raw 从底层 urllib3 的 urllib3.HTTPResponse <urllib3.response.HTTPResponse 读取未解码的响应体。

如果你在请求中把 stream 设为 True，Requests 无法将连接释放回连接池，除非你消耗了所有的数据，或者调用了 Response.close。这样会带来连接效率低下的问题。如果你发现你在使用 stream=True 的同时还在部分读取请求的 body（或者完全没有读取 body），那么你就应该考虑使用 with 语句发送请求，这样可以保证请求一定会被关闭：

with requests.get('http://httpbin.org/get', stream=True) as r:

    # 在此处理响应。

事件挂钩

Requests有一个钩子系统，你可以用来操控部分请求过程，或信号事件处理。

钩子：

response：从一个请求产生的响应

你可以通过传递一个 {hook_name: callback_function} 字典给 hooks 请求参数为每个请求分配一个钩子函数：

hooks=dict(response=print_url)

callback_function 会接受一个数据块作为它的第一个参数。

def print_url(r, *args, **kwargs):

    print(r.url)

>>> requests.get('http://httpbin.org', hooks=dict(response=print_url))

http://httpbin.org

<Response [200]>

自定义身份验证

任何传递给请求方法的 auth 参数的可调用对象，在请求发出之前都有机会修改请求。

定义子类继承 requests.auth.AuthBase ，两种常见的身份验证方案：HTTPBasicAuth 和 HTTPDigestAuth 。

假设我们有一个web服务，仅在 X-Pizza 头被设置为一个密码值的情况下才会有响应

from requests.auth import AuthBase

class PizzaAuth(AuthBase):

    """Attaches HTTP Pizza Authentication to the given Request object."""

    def __init__(self, username):

        # setup any auth-related data here

        self.username = username

    def __call__(self, r):

        # modify and return the request

        r.headers['X-Pizza'] = self.username

        return r

>>> requests.get('http://pizzabin.org/admin', auth=PizzaAuth('kenneth'))

<Response [200]>

流式请求

简单地设置 stream 为 True 便可以使用 iter_lines 对相应进行迭代：

import json

import requests

r = requests.get('http://httpbin.org/stream/20', stream=True)

for line in r.iter_lines():

    # filter out keep-alive new lines

    if line:

        decoded_line = line.decode('utf-8')

        print(json.loads(decoded_line))

当使用 decode_unicode=True 在 Response.iter_lines() 或 Response.iter_content() 中时，你需要提供一个回退编码方式，以防服务器没有提供默认回退编码，从而导致错误：

r = requests.get('http://httpbin.org/stream/20', stream=True)

if r.encoding is None:

    r.encoding = 'utf-8'

for line in r.iter_lines(decode_unicode=True):

    if line:

        print(json.loads(line))

代理

参数为proxies

import requests

proxies = {

  "http": "http://10.10.1.10:3128",

  "https": "http://10.10.1.10:1080",

}

requests.get("http://example.org", proxies=proxies)

你也可以通过环境变量 HTTP_PROXY 和 HTTPS_PROXY 来配置代理

$ export HTTP_PROXY="http://10.10.1.10:3128"

$ export HTTPS_PROXY="http://10.10.1.10:1080"

$ python

>>> import requests

>>> requests.get("http://example.org")

若你的代理需要使用HTTP Basic Auth，可以使用 http://user:password@host/ 语法：

proxies = {

    "http": "http://user:pass@10.10.1.10:3128/",

}

要为某个特定的连接方式或者主机设置代理，使用 scheme://hostname 作为 key，它会针对指定的主机和连接方式进行匹配。

proxies = {'http://10.20.1.128': 'http://10.10.1.10:5323'}

代理 URL 必须包含连接方式。

socks代理

安装

 pip install requests[socks]

proxies = {

    'http': 'socks5://user:pass@host:port',

    'https': 'socks5://user:pass@host:port'

}

重定向与请求历史

默认情况下，除了 HEAD, Requests 会自动处理所有重定向。

可以使用响应对象的 history 方法来追踪重定向。

Response.history 是一个 Response 对象的列表，为了完成请求而创建了这些对象。这个对象列表按照从最老到最近的请求进行排序。

>>> r = requests.get('http://github.com')

>>> r.url

'https://github.com/'

>>> r.status_code

200

>>> r.history

[<Response [301]>]

如果你使用的是GET、OPTIONS、POST、PUT、PATCH 或者 DELETE，那么你可以通过 allow_redirects 参数禁用重定向处理：

>>> r = requests.get('http://github.com', allow_redirects=False)

>>> r.status_code

301

>>> r.history

[]

如果你使用了 HEAD，你也可以启用重定向：

>>> r = requests.head('http://github.com', allow_redirects=True)

>>> r.url

'https://github.com/'

>>> r.history

[<Response [301]>]

超时timeout

超时停止等待响应

>>> requests.get('http://github.com', timeout=0.001)

Traceback (most recent call last):

  File "<stdin>", line 1, in <module>

requests.exceptions.Timeout: HTTPConnectionPool(host='github.com', port=80): Request timed out. (timeout=0.001)

timeout 仅对连接过程有效，与响应体的下载无关。 timeout 并不是整个下载响应的时间限制，而是如果服务器在 timeout 秒内没有应答，将会引发一个异常（更精确地说，是在timeout 秒内没有从基础套接字上接收到任何字节的数据时）If no timeout is specified explicitly, requests do not time out.

错误与异常

遇到网络问题（如：DNS 查询失败、拒绝连接等）时，Requests 会抛出一个 ConnectionError 异常。

如果 HTTP 请求返回了不成功的状态码， Response.raise_for_status() 会抛出一个 HTTPError异常。

若请求超时，则抛出一个 Timeout 异常。

若请求超过了设定的最大重定向次数，则会抛出一个 TooManyRedirects 异常。

所有Requests显式抛出的异常都继承自 requests.exceptions.RequestException 。

秒客网

Python爬虫【requests】request for humans

JSON 响应内容

原始响应内容

定制请求头

响应状态码

响应头

Cookie

CA 证书

响应体内容工作流

事件挂钩

自定义身份验证

流式请求

代理

重定向与请求历史

超时timeout

错误与异常

相关文章

Python爬虫 【requests】request for humans

JSON 响应内容

原始响应内容

定制请求头

响应状态码

响应头

Cookie

CA 证书

响应体内容工作流

事件挂钩

自定义身份验证

流式请求

代理

重定向与请求历史

超时timeout

错误与异常

相关文章

Python爬虫【requests】request for humans