【python】:用爬虫脚本爬取招聘网站上的信息

方法：

1，一个招聘只为下，会显示多个页面数据，依次把每个页面的连接爬到url；

2，在page_x页面中，爬到15条的具体招聘信息的s_url保存下来；

3，打开每个s_url链接，获取想要的信息例如，title，connect，salary等；

4，将信息保存并输入到csv文本中去。

代码：

from lxml import etree
import requests
import time
#要爬取的网站链接
url = "https://www.lagou.com/zhaopin/Java/?labelWords=label"
#设置信息头，模拟人为操作，可以避免一些反爬虫
head = {\'user-agent\':\'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3534.4 Safari/537.36\'}

res = requests.get(url, headers=head).content.decode("utf-8")
re = etree.HTML(res)
#获得该页面翻页地址链接
s_url = re.xpath("//div[@class=\'pager_container\']/a[position()>2 and position()<7]/@href")
print(\'s_url=\', s_url)
#依次循环page1，page2等等
for x in s_url:
    res = requests.get(x, headers=head).content.decode("utf-8")
    re = etree.HTML(res)
    print(\'x==\', x)
    #获取当前页面下的所有招聘信息链接
    list_url = re.xpath("//div[@class=\'s_position_list \']/ul/li[position()>=0 and position()<15]/div/div[1]/div/a/@href")
    print(\'list_url=\', list_url)
    #依次循环每个招聘信息，将标题，内容，薪资获取到
    for y in list_url:
        r01 = requests.get(y, headers=head).content.decode("utf-8")
        html01 = etree.HTML(r01)
        print(\'y==\', y)
        
        title = html01.xpath("string(//div[@class=\'job-name\'])")
        print(\'title===\', title)
        content = html01.xpath("string(//div[@class=\'job-detail\'])")
        print(\'content===\', content)
        salary = html01.xpath("string(/html/body/div[5]/div/div[1]/dd/h3/span[1])")
        print(\'salary===\', salary)
        #设置休眠是防止网站识别自己，最好是random休眠
        time.sleep(5)
        # 保存爬虫信息内容
        with open("cn-blog.csv", "a+", encoding="utf-8") as file:
            file.write(title + "\n")
            file.write(content + "\n")
            file.write(salary + "\n")
            file.write("*" * 50 + "\n")

总结：

1，设置head信息以及sleep，防止网站识别自己（虽然网站还是会屏蔽些，但是也能抓取大部分数据了）；

2，用xpath获取同一个元素下所有内容，用下标[position()>x and position()<y]表示；

秒客网

【python】:用爬虫脚本爬取招聘网站上的信息

相关文章