爬取猪八戒网站
1.网站分析
首先在搜索框中输入saas
我们主要获取价格、标题、评分、销量、好评、企业名称,在使用Xpath的时侯,从网站上复制的Xpath和返回的Xpath存在差异,所以我们在获取的时候按class进行查找。
2.代码实现
import pandas as pd
import requests
from lxml import etree
url = 'https://shijiazhuang.zbj.com/search/service/?kw=saas&r=2'
headers = {
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/106.0.0.0 Safari/537.36'
}
resp = requests.get(url=url, headers=headers)
html = etree.HTML(resp.text)
datas = html.xpath('//*[@]/div/div[3]/div/div[3]/div[4]/div[1]/div')
info_list = []
for data in datas:
# 网页上的路径和实际路径不同
price = data.xpath('.//div[@class="price"]/span/text()')[0] # 价格
title = data.xpath('.//div[@class="name-pic-box"]/a/text()')[0] # 标题
score = data.xpath('.//div[@class="fraction"]/span[1]/text()')[0] # 评分
sale = data.xpath('.//div[@class="sales"]//span[@class="num"]/text()')[0] # 销量
good = data.xpath('.//div[@class="evaluate"]//span[@class="num"]/text()')[0] # 好评
com_name = data.xpath('.//div[@class="shop-info text-overflow-line"]/text()')[0] # 公司名
info = {
'价格': price,
'标题': title,
'评分': score,
'销量': sale,
'好评': good,
'公司名': com_name
}
info_list.append(info)
pd.DataFrame(info_list).to_csv('../data/猪八戒.csv')
3.结果查看
打开文件: