爬虫-day02-抓取和分析

时间:2021-09-19 19:17:55
###页面抓取###
1、urllib3
    是一个功能强大且好用的HTTP客户端,弥补了Python标准库中的不足
    安装: pip install urllib3
    使用:
import urllib3
http = urllib3.PoolManager()
response = http.request('GET', 'http://news.qq.com')
print(response.headers)
result = response.data.decode('gbk')
print(result)
 
发送HTTPS协议的请求
安装依赖 : pip install certifi
import  certifi
import urllib3
http = urllib3.PoolManager(cert_reqs = 'CERT_REQUIRED', ca_certs = certifi.where()) #添加证书
resp = http.request('GET', 'http://news.baidu.com/')
print(resp.data.decode('utf-8'))
 
####带上参数
import urllib3
from urllib.parse import urlencode
http = urllib3.PoolManager()
args = {'wd' : '人民币'}
# url = 'http://www.baidu.com/s?%s' % (args)
url = 'http://www.baidu.com/s?%s' % (urlencode(args))
print(url)
# resp = http.request('GET' , url)
# print(resp.data.decode('utf-8')) headers = {
'Accept' : 'text/javascript, application/javascript, application/ecmascript, application/x-ecmascript, **; q=0.01',
'Accept-Encoding' : 'gzip, deflate, br',
'Accept-Language' : 'zh-CN,zh;q=0.9',
'Connection' : 'keep-alive',
'Host' : 'www.baidu.com',
'Referer' : 'https://www.baidu.com/s?wd=人民币',
'User-Agent' : "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.139 Safari/537.36"
}
resp8 = requests.get(url8, fields=args8, headers=headers8)
print(resp8.text)