[Python] Scrapy爬虫框架入门

说明：

　　本文主要学习Scrapy框架入门，介绍如何使用Scrapy框架爬取页面信息。

　　项目案例：爬取腾讯招聘页面 https://hr.tencent.com/position.php?&start=

　　开发环境：win10、Python3.5、Scrapy1.5

一、安装

　　》pip install scrapy

　　//如果安装不成功，可以参考 https://blog.csdn.net/dapenghehe/article/details/51548079

　　//或下载安装twisted

二、创建项目（scrapy startproject）

　　1、在开始爬取之前，必须创建一个新的Scrapy项目。进入相应目录，运行下列命令（tencent为项目名称）：

　　　　》scrapy startproject tencentSpider

　　2、进入项目目录（tencentSpider）

　　　　项目目录结构如下：

　　　　 [Python] Scrapy爬虫框架入门

　　　　scrapy.cfg：项目的配置文件。

　　　　tencentSpider/：项目的Python模块，将会从这里引用代码。

　　　　tencentSpider/spiders/：存储爬虫代码目录（爬虫文件主要在此编辑）。

　　　　tencentSpider/items.py：项目的目标文件。

　　　　tencentSpider/middlewares.py：项目中间件。

　　　　tencentSpider/pipelines.py：项目管道文件。

　　　　tencentSpider/setting：项目的设置文件。

　　到此，项目基本创建完成，接下来就是编写爬虫代码了。

三、明确目标（tencentSpider/items.py）

　　明确需要爬取的网址以及需要的信息，在 items.py 中定义需要爬取的信息字段。

　　本项目主要爬取：https://hr.tencent.com/position.php?&start= 网站里的职称、详情地址、类别、人数、地点和发布时间。

　　1、打开 tencentSpider 目录下的 items.py。

　　2、Item 定义结构化数据字段，用来保存爬取到的数据，类似于Python的字典，但是提供一些额外的的保护减少错误。

　　3、可以通过创建一个 scrapy.Item 类，并且定义类型为 scrapy.Field 的类属性来定义一个Item（可以理解成类似于ORM的映射关系）。

　　4、接下来，创建一个 TencentspiderItem 类，和构建item模型（model）。

　　items.py代码如下：

 # -*- coding: utf-8 -*-

 # Define here the models for your scraped items

 #

 # See documentation in:

 # https://doc.scrapy.org/en/latest/topics/items.html

 import scrapy

 class TencentspiderItem(scrapy.Item):

     # define the fields for your item here like:

     # name = scrapy.Field()

     # 职称

     title = scrapy.Field()

     # 详情地址

     link = scrapy.Field()

     # 类别

     cate = scrapy.Field()

     # 人数

     num = scrapy.Field()

     # 地点

     address = scrapy.Field()

     # 发布时间

     date = scrapy.Field()

四、制作爬虫（spiders/tencentSpider.py）

　　1、爬取数据

　　　　① 在与 scrapy.cfg 同级目录下执行如下命令，将会在 tencentSpider/spiders 目录下创建一个名为 tencent 的爬虫，并制定爬取的域范围（或手动创建文件，基本代码格式如下所示）：

　　　　　　》scrapy genspider tencent "hr.tencent.com"

　　　　② 打开 tencentSpider/spiders 目录下的 tencent.py ，默认的代码如下：

 # -*- coding: utf-8 -*-

 import scrapy

 class TencentSpider(scrapy.Spider):

     name = 'tencent'

     allowed_domains = ['hr.tencent.com']

     start_urls = ['http://hr.tencent.com/']

     def parse(self, response):

         pass

　　　　③ 编写爬虫文件，基本思路：构造分页url，解析内容（xpath），管道文件处理：

 # -*- coding: utf-8 -*-

 import scrapy

 from tencentSpider.items import TencentspiderItem

 class TencentSpider(scrapy.Spider):

     # 爬虫的名字

     name = 'tencent'

     allowed_domains = ["hr.tencent.com"]

     # 拼接 URL

     url = "https://hr.tencent.com/position.php?&start="

     offset = 0

     # 首次爬取入口URL

     start_urls = [url + str(offset)]

     def parse(self, response):

         info_ls = response.xpath('//tr[contains(@class, "odd")] | //tr[contains(@class, "even")]')

         # 原始地址

         origin_url = "https://hr.tencent.com/"

         for each in info_ls:

             # 初始化模型对象

             item = TencentspiderItem()

             # 职称

             title = each.xpath("./td/a/text()")[0].extract()

             # 详情地址

             link = origin_url + each.xpath("./td/a/@href")[0].extract()

             # 职位分类

             cate = each.xpath('./td[2]/text()')[0].extract()

             # 人数

             num = each.xpath('./td[3]/text()')[0].extract()

             # 所在地址

             address = each.xpath('./td[4]/text()')[0].extract()

             # 发布时间

             date = each.xpath('./td[5]/text()')[0].extract()

             item['title'] = title

             item['link'] = link

             item['cate'] = cate

             item['num'] = num

             item['address'] = address

             item['date'] = date

             # 交给管道 pipelines 处理

             yield item

         # 循环遍历分页,这里只爬取 100 条

         if self.offset < 100:

             self.offset += 10

             # 每次处理完一页的数据之后， 重新发送下一页的页面请求

             yield scrapy.Request(self.url + str(self.offset), callback=self.parse)

         else:

             print("[ALL_END:爬虫结束]")

　　　　④ 修改配置文件（settings.py），部分:

　　　　　　需要修改的主要有如下三处：　

 # 是否遵守 robot 协议，本项目为False

 ROBOTSTXT_OBEY = False

 # 请求头

 DEFAULT_REQUEST_HEADERS = {

     'User-Agent': 'Mozilla/5.0 (compatible; MSIE 9.0; Window NT 6.1; Trident/5.0;)',

     'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',

     # 'Accept-Language': 'en',

 }

 # 配置管道文件

 ITEM_PIPELINES = {

    'tencentSpider.pipelines.TencentspiderPipeline': 300,

 }

　　　　⑤ 编写管道文件 pipelines.py：

　　　　　　这里的管道文件主要把数据以json格式保存在文件中：

 # -*- coding: utf-8 -*-

 # Define your item pipelines here

 #

 # Don't forget to add your pipeline to the ITEM_PIPELINES setting

 # See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html

 import json

 class TencentspiderPipeline(object):

     def __init__(self):

         self.save_path = open("res_info.json", "w", encoding="utf8")

         self.save_path.write("[")

     def process_item(self, item, spider):

         # 处理每页的数据，并写入文件

         json_text = json.dumps(dict(item), ensure_ascii=False) + ", \n"

         self.save_path.write(json_text)

         return item

     def close_spider(self, spider):

         self.save_path.write("{}]")

         self.save_path.close()

　　　　⑥ 运行爬虫：

　　　　　　》scrapy crawl tencent

　　　　⑦ 查看结果，打开数据文件 res_info.json：

　　　　 [Python] Scrapy爬虫框架入门

秒客网

[Python] Scrapy爬虫框架入门

相关文章