【Python爬虫】Requests库的基本使用

时间:2024-02-22 13:54:04

requests比起之前用到的urllib,requests模块的api更加便捷(本质就是封装了urllib3)

#GET请求
HTTP默认的请求方法就是GET
     * 没有请求体
     * 数据必须在1K之内!
     * GET请求数据会暴露在浏览器的地址栏中

GET请求常用的操作:
       1. 在浏览器的地址栏中直接给出URL,那么就一定是GET请求
       2. 点击页面上的超链接也一定是GET请求
       3. 提交表单时,表单默认使用GET请求,但可以设置为POST


#POST请求
(1). 数据不会出现在地址栏中
(2). 数据的大小没有上限
(3). 有请求体
(4). 请求体中如果存在中文,会使用URL编码!


#!!!requests.post()用法与requests.get()完全一致,特殊的是requests.post()有一个data参数,用来存放请求体数据

基本的GET请求

import requests
 
response = requests.get(\'http://httpbin.org/get\')
print(response.text)
{
  "args": {}, 
  "headers": {
    "Accept": "*/*", 
    "Accept-Encoding": "gzip, deflate", 
    "Connection": "close", 
    "Host": "httpbin.org", 
    "User-Agent": "python-requests/2.18.4"
  }, 
  "origin": "223.71.166.246", 
  "url": "http://httpbin.org/get"
}
输出结果

带参数的GET请求

#通常我们在发送请求时都需要带上请求头,请求头是将自身伪装成浏览器的关键,常见的有用的请求头如下
Host
Referer #大型网站通常都会根据该参数判断请求的来源(从哪里跳转到当前页面的)
User-Agent #浏览器内核,模拟是浏览器请求的
Cookie #Cookie信息虽然包含在请求头里,但requests模块有单独的参数来处理他,headers={}内就不要放它了

方式1:

import requests
 
response = requests.get(\'http://httpbin.org/get?name=xiong&age=25\')
print(response.text)
{
  "args": {
    "age": "25", 
    "name": "xiong"
  }, 
  "headers": {
    "Accept": "*/*", 
    "Accept-Encoding": "gzip, deflate", 
    "Connection": "close", 
    "Host": "httpbin.org", 
    "User-Agent": "python-requests/2.18.4"
  }, 
  "origin": "223.71.166.246", 
  "url": "http://httpbin.org/get?name=0bug&age=25"
}

结果
输出结果

方式2:

import requests
 
data = {
    \'name\': \'xiong\',
    \'age\': 25
}
response = requests.get(\'http://httpbin.org/get\', params=data)
print(response.text)
{
  "args": {
    "age": "25", 
    "name": "xiong"
  }, 
  "headers": {
    "Accept": "*/*", 
    "Accept-Encoding": "gzip, deflate", 
    "Connection": "close", 
    "Host": "httpbin.org", 
    "User-Agent": "python-requests/2.18.4"
  }, 
  "origin": "223.71.166.246", 
  "url": "http://httpbin.org/get?name=0bug&age=25"
}
输出结果

response = requests.get() 实例化对象response的方法总结

response = requests.get(url="http://www.baidu.com",params=None)     # get(url, params=None, **kwargs)

response.text       #获取网页HTML

response.content        #获取请求的url二进制内容,如 http://ww1.sinaimg.cn/large/007nuqGAly1g1yst34oyaj30ia0qedh5.jpg

response.encoding       #设置编码

response.apparent_encoding      #获取网页的编码方式

response.status_code        #获取请求的状态码

respone.headers        #获取请求头

respone.cookies        #获取cookies返回一个对象

respone.cookies.get_dict()        #返回cookies具体内容

respone.url        #获取请求地址

respone.history        #重定向

response.close()        ##关闭response
    

解析Json

#解析json
import requests
response=requests.get(\'http://httpbin.org/get\')

import json
res1=json.loads(response.text) #太麻烦

res2=response.json() #直接获取json数据


print(res1 == res2) #True

下载小的图片

import requests
 
response = requests.get(\'https://github.com/favicon.ico\')
with open(\'img.ico\',\'wb\') as f:
    f.write(response.content)

下载大的视频文件

#stream参数:一点一点的取,比如下载视频时,如果视频100G,用response.content然后一下子写到文件中是不合理的

import requests

response=requests.get(\'https://gss3.baidu.com/6LZ0ej3k1Qd3ote6lo7D0j9wehsv/tieba-smallvideo-transcode/1767502_56ec685f9c7ec542eeaf6eac93a65dc7_6fe25cd1347c_3.mp4\',
                      stream=True)

with open(\'b.mp4\',\'wb\') as f:
    for line in response.iter_content():
        f.write(line)

添加headers

import requests
 
headers = {
    \'User-Agent\': \'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.117 Safari/537.36\'
 
}
response = requests.get(\'https://www.baidu.com/\', headers=headers)
print(response.status_code)

基本的POST请求

import requests
 
data = {\'name\':\'xiong\'}
response = requests.post(\'http://httpbin.org/post\',data=data)
print(response.text)
{
  "args": {}, 
  "data": "", 
  "files": {}, 
  "form": {
    "name": "xiong"
  }, 
  "headers": {
    "Accept": "*/*", 
    "Accept-Encoding": "gzip, deflate", 
    "Connection": "close", 
    "Content-Length": "9", 
    "Content-Type": "application/x-www-form-urlencoded", 
    "Host": "httpbin.org", 
    "User-Agent": "python-requests/2.18.4"
  }, 
  "json": null, 
  "origin": "223.71.166.246", 
  "url": "http://httpbin.org/post"
}

结果
输出结果
import requests
 
data = {\'name\': \'xiong\'}
headers = {
    \'User-Agent\': \'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.117 Safari/537.36\'
 
}
response = requests.post(\'http://httpbin.org/post\', data=data, headers=headers)
print(response.json())
{\'args\': {}, \'data\': \'\', \'files\': {}, \'form\': {\'name\': \'xiong\'}, \'headers\': {\'Accept\': \'*/*\', \'Accept-Encoding\': \'gzip, deflate\', \'Connection\': \'close\', \'Content-Length\': \'9\', \'Content-Type\': \'application/x-www-form-urlencoded\', \'Host\': \'httpbin.org\', \'User-Agent\': \'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.117 Safari/537.36\'}, \'json\': None, \'origin\': \'223.71.166.246\', \'url\': \'http://httpbin.org/post\'}
输出结果

response属性

import requests
 
response = requests.get(\'http://www.jianshu.com\')
print(type(response.status_code), response.status_code)
print(type(response.headers), response.headers)
print(type(response.cookies), response.cookies)
print(type(response.url), response.url)
print(type(response.history), response.history)
<class \'int\'> 403
<class \'requests.structures.CaseInsensitiveDict\'> {\'Date\': \'Sat, 21 Apr 2018 02:16:27 GMT\', \'Server\': \'Tengine\', \'Content-Type\': \'text/html\', \'Transfer-Encoding\': \'chunked\', \'Strict-Transport-Security\': \'max-age=31536000; includeSubDomains; preload\', \'Content-Encoding\': \'gzip\', \'X-Via\': \'1.1 PSbjwjBGP2oc238:9 (Cdn Cache Server V2.0), 1.1 PSgxnnwt6jp78:4 (Cdn Cache Server V2.0), 1.1 PSbjhkwlwa80:0 (Cdn Cache Server V2.0)\', \'Connection\': \'close\'}
<class \'requests.cookies.RequestsCookieJar\'> <RequestsCookieJar[]>
<class \'str\'> https://www.jianshu.com/
<class \'list\'> [<Response [301]>]
输出结果

文件上传

import requests
 
files = {\'file\': open(\'img.ico\', \'rb\')}
response = requests.post(\'http://httpbin.org/post\', files=files)
print(response.text)
{
  "args": {}, 
  "data": "", 
  "files": {
    "file": "data:application/octet-stream;base64,
  }, 
  "form": {}, 
  "headers": {
    "Accept": "*/*", 
    "Accept-Encoding": "gzip, deflate", 
    "Connection": "close", 
    "Content-Length": "6661", 
    "Content-Type": "multipart/form-data; boundary=4ba9cec7ffee4873b4a00164473f792f", 
    "Host": "httpbin.org", 
    "User-Agent": "python-requests/2.18.4"
  }, 
  "json": null, 
  "origin": "223.71.166.246", 
  "url": "http://httpbin.org/post"
}

结果
输出结果

获取cookie

import requests
 
response = requests.get(\'https://www.baidu.com\')
print(response.cookies)
for key,value in response.cookies.items():
    print(key+\'=\'+value)
<RequestsCookieJar[<Cookie BDORZ=27315 for .baidu.com/>]>
BDORZ=27315
输出结果

会话维持

import requests
 
s = requests.Session()
s.get(\'http://httpbin.org/cookies/set/number/123456\')
response = s.get(\'http://httpbin.org/cookies\')
print(response.text)
{
  "cookies": {
    "number": "123456"
  }
}
输出结果

证书验证

1.无证书报错

import requests
 
response = requests.get(\'https://www.12306.cn\')
print(response.status_code)

2.设置不使用证书,会返回200,也会有警告信息

import requests
 
response = requests.get(\'https://www.12306.cn\',verify=False)
print(response.status_code)

3.消除警告信息

import requests
import urllib3
urllib3.disable_warnings()
response = requests.get(\'https://www.12306.cn\',verify=False)
print(response.status_code)

4.使用本地证书

import requests
 
response = requests.get(\'https://www.12306.cn\',cert=(\'/path/server.crt\',\'/path/key\'))
print(response.status_code)

代理设置

import requests
 
proxies = {
    \'http\': \'http://127.0.0.1:9743\',
    \'https\': \'https://127.0.0.1:9743\'
}
 
response = requests.get(\'https://www.taobao.com\', proxies=proxies)
print(response.status_code)

有用户名和密码的代理

import requests
 
proxies = {
    \'http\': \'http://user:password@127.0.0.1:9743\',
}
 
response = requests.get(\'https://www.taobao.com\', proxies=proxies)
print(response.status_code)

使用socks代理,需要安装一个模块

pip install requests[socks]

再使用代理

import requests
 
proxies = {
    \'http\': \'socks5://127.0.0.1:9742\',
    \'https\': \'socks5://127.0.0.1:9742\',
}
 
response = requests.get(\'https://www.taobao.com\', proxies=proxies)
print(response.status_code)

超时设置

import requests
 
response = requests.get(\'https://www.taobao.com\',timeout=0.01)
print(response.status_code)

错误处理

import requests
from requests.exceptions import ReadTimeout
 
try:
    response = requests.get(\'https://www.taobao.com\', timeout=0.01)
    print(response.status_code)
except ReadTimeout:
    print(\'time out\')

认证设置

方式1:

import requests
from requests.auth import HTTPBasicAuth
 
r = requests.get(\'http://127.0.0.1:8080\', auth=HTTPBasicAuth(\'user\', \'123\'))
print(r.status_code)

方式2:

import requests
 
r = requests.get(\'http://127.0.0.1:8080\', auth=(\'user\', \'123\'))
print(r.status_code)

异常处理

import requests
from requests.exceptions import ReadTimeout, HTTPError, RequestException
 
try:
    response = requests.get(\'http://httpbin.org/get\', timeout=0.01)
    print(response.status_code)
except ReadTimeout:
    print(\'Timeout\')
except HTTPError:
    print(\'HTTP REEOR\')
except RequestException:
    print(\'Error\')

 

参考:

http://www.cnblogs.com/0bug/p/8899841.html

官方文档:http://www.python-requests.org/en/master/_modules/requests/exceptions/#RequestException