Python爬虫系列 - 初探:爬取旅游评论
Python爬虫目前是基于requests包,下面是该包的文档,查一些资料还是比较方便。
http://docs.python-requests.org/en/master/
POST发送内容格式
爬取某旅游网站的产品评论,通过分析,获取json文件需要POST指令。简单来说:
- GET是将需要发送的信息直接添加在网址后面发送
- POST方式是发送一个另外的内容到服务器
那么通过POST发送的内容可以大概有三种,即form、json和multipart,目前先介绍前两种
1.content in form
Content-Type: application/x-www-form-urlencoded
将内容放入dict,然后传递给参数data即可。
payload = {\'key1\': \'value1\', \'key2\': \'value2\'} r = requests.post(url, data=payload)
2. content in json
Content-Type: application/json
将dict转换为json,传递给data参数。
payload = {\'some\': \'data\'} r = requests.post(url, data=json.dumps(payload))
或者将dict传递给json参数。
payload = {\'some\': \'data\'} r = requests.post(url, json=payload
HTTP Hearder概述
A new request may need type(eg: POST), URL, request Headers and request Body. Now let\'s talk about the request body of a new POST request.
Reference: https://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html
Accept
It can be used to specify certain media types which are acceptable for the response.
The asterisk "*" character means all types. For example, "*/*" indicating all media types and "type/*" indicating all subtypes of that type.
";" "q" "=" qvalue is a relative degree. The default "q" is 1.
Accept: audio/*; q=0.2, audio/basic
If more than one media range applies to a given type, the most specific reference has precedence.
Accept: text/*, text/html, text/html;level=1, */*
In this example, "text/html;level=1" has the highest precedence.
Content-Length
the size of the entity-body that would have been sent had the request been a GET.
For example, The form data is like this:
type: all currentPage: 3 productId:
And the Request Body you send is like this:
type=all¤tPage=3&productId=
So the Content-Length is 33.
User-Agent
Search the Internet for different User-Agents.
然后贴一下简单的代码供参考。
import requests import json def getCommentStr(): url = r"https://package.com/user/comment/product/queryComments.json" header = { \'User-Agent\': r\'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:62.0) Gecko/20100101 Firefox/62.0\', \'Accept\': r\'application/json, text/javascript, */*; q=0.01\', \'Accept-Language\': r\'en-US,en;q=0.5\', \'Accept-Encoding\': r\'gzip, deflate, br\', \'Content-Type\': r\'application/x-www-form-urlencoded; charset=UTF-8\', \'X-Requested-With\': r\'XMLHttpRequest\', \'Content-Length\': \'65\', \'DNT\': \'1\', \'Connection\': r\'keep-alive\', \'TE\': r\'Trailers\' } params = { \'pageNo\': \'2\', \'pageSize\': \'10\', \'productId\': \'2590732030\', \'rateStatus\': \'ALL\', \'type\': \'all\' } r = requests.post(url, headers = header, data = params) print(r.text) getCommentStr()
小技巧
- 对于cookies,感觉可以用浏览器的编辑功能,逐步删除每次发送的cookies信息,判断哪些是没有用的?
- 对于测试代码阶段,我还是比较习惯于将爬取的数据存为str,也算是为了服务器减负吧。
爬取信息处理
爬取信息处理主要讲Beautifulsoup库和正则表达式(Regular Expression)
1. BeautifulSoup
bs4的官方文档
https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html
首先在Ternimal安装 pip install bs4 ,同时也可以安装lxml解析器 pip install lxml ,或者html5lib解析器。
soup = bs4.BeautifulSoup(t,\'lxml\') tagList = soup.find_all(\'div\', attrs={\'class\': \'content\'}) tagList = soup.find_all(\'div\', attrs={\'class\': re.compile("(content)|()")})
其中t是需要解析的文本,lxml是解析器。
tagList接收的是div标签下class="content"的标签内容,其中可以运用正则表达式对象。
2. 正则表达式
正则表达式使用前先 import re ,基本语法见笔记。
提取匹配信息
对目标文本t匹配
useful = re.findall(r\'有用<em>\d+</em>\',t)
构造正则表达式对象,并进行使用
usefulRE = re.compile(\'有用<em>\d+</em>\') useful = usefulRE.findall(t)
替换匹配信息
replace()函数替换文本
newUseful.append(useful[i].replace(\'有用<em>\',\'\').replace(\'</em>\',\'\'))
正则表达式替换文本
newScoreA.append(re.sub(r\'[^\d+]\',\'\',scoreA[i]))