Python爬虫系列 - 初探：爬取旅游评论 - MetalOxide

Python爬虫系列 - 初探：爬取旅游评论

Python爬虫目前是基于requests包，下面是该包的文档，查一些资料还是比较方便。

http://docs.python-requests.org/en/master/

POST发送内容格式

爬取某旅游网站的产品评论，通过分析，获取json文件需要POST指令。简单来说：

GET是将需要发送的信息直接添加在网址后面发送
POST方式是发送一个另外的内容到服务器

那么通过POST发送的内容可以大概有三种，即form、json和multipart，目前先介绍前两种

1.content in form

Content-Type: application/x-www-form-urlencoded

将内容放入dict，然后传递给参数data即可。

payload = {\'key1\': \'value1\', \'key2\': \'value2\'}
r = requests.post(url, data=payload)

2. content in json

Content-Type: application/json

将dict转换为json，传递给data参数。

payload = {\'some\': \'data\'}
r = requests.post(url, data=json.dumps(payload))

或者将dict传递给json参数。

payload = {\'some\': \'data\'}
r = requests.post(url, json=payload

HTTP Hearder概述

A new request may need type(eg: POST), URL, request Headers and request Body. Now let\'s talk about the request body of a new POST request.

Reference: https://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html

Accept

It can be used to specify certain media types which are acceptable for the response.

The asterisk "*" character means all types. For example, "*/*" indicating all media types and "type/*" indicating all subtypes of that type.

";" "q" "=" qvalue is a relative degree. The default "q" is 1.

Accept: audio/*; q=0.2, audio/basic

If more than one media range applies to a given type, the most specific reference has precedence.

Accept: text/*, text/html, text/html;level=1, */*

In this example, "text/html;level=1" has the highest precedence.

Content-Length

the size of the entity-body that would have been sent had the request been a GET.

For example, The form data is like this:

type: all
currentPage: 3
productId:

And the Request Body you send is like this:

type=all&currentPage=3&productId=

So the Content-Length is 33.

User-Agent

Search the Internet for different User-Agents.

然后贴一下简单的代码供参考。

import requests
import json

def getCommentStr():
    url = r"https://package.com/user/comment/product/queryComments.json"

    header = {
        \'User-Agent\':           r\'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:62.0) Gecko/20100101 Firefox/62.0\',
        \'Accept\':               r\'application/json, text/javascript, */*; q=0.01\',
        \'Accept-Language\':      r\'en-US,en;q=0.5\',
        \'Accept-Encoding\':      r\'gzip, deflate, br\',
        \'Content-Type\':         r\'application/x-www-form-urlencoded; charset=UTF-8\',
        \'X-Requested-With\':     r\'XMLHttpRequest\',
        \'Content-Length\':       \'65\',
        \'DNT\':                  \'1\',
        \'Connection\':           r\'keep-alive\',
        \'TE\':                   r\'Trailers\'
    }

    params = {
        \'pageNo\':               \'2\',
        \'pageSize\':             \'10\',
        \'productId\':            \'2590732030\',
        \'rateStatus\':           \'ALL\',
        \'type\':                 \'all\'
    }
    
    
    r = requests.post(url, headers = header, data = params)
    print(r.text)

getCommentStr()

小技巧

对于cookies，感觉可以用浏览器的编辑功能，逐步删除每次发送的cookies信息，判断哪些是没有用的？
对于测试代码阶段，我还是比较习惯于将爬取的数据存为str，也算是为了服务器减负吧。

爬取信息处理

爬取信息处理主要讲Beautifulsoup库和正则表达式（Regular Expression）

1. BeautifulSoup

bs4的官方文档

https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html

首先在Ternimal安装 pip install bs4 ，同时也可以安装lxml解析器 pip install lxml ，或者html5lib解析器。

soup = bs4.BeautifulSoup(t,\'lxml\')
tagList = soup.find_all(\'div\', attrs={\'class\': \'content\'})
tagList = soup.find_all(\'div\', attrs={\'class\': re.compile("(content)|()")})

其中t是需要解析的文本，lxml是解析器。

tagList接收的是div标签下class="content"的标签内容，其中可以运用正则表达式对象。

2. 正则表达式

正则表达式使用前先 import re ，基本语法见笔记。

提取匹配信息

对目标文本t匹配

useful = re.findall(r\'有用<em>\d+</em>\',t)

构造正则表达式对象，并进行使用

usefulRE = re.compile(\'有用<em>\d+</em>\')
useful = usefulRE.findall(t)

替换匹配信息

replace()函数替换文本

newUseful.append(useful[i].replace(\'有用<em>\',\'\').replace(\'</em>\',\'\'))

正则表达式替换文本

newScoreA.append(re.sub(r\'[^\d+]\',\'\',scoreA[i]))

秒客网

Python爬虫系列 - 初探：爬取旅游评论 - MetalOxide