Scrapy框架基本使用

时间:2023-03-08 17:03:16
Scrapy框架基本使用

pycharm+Scrapy

距离上次使用Scrapy已经是大半年前的事情了,赶紧把西瓜皮捡回来。。

简单粗暴上爬取目标:

初始URL:http://quotes.toscrape.com/

目标:将每一页中每一栏的语录、作者、标签解析出来,保存到json文件或者MongoDB数据库中

Scrapy框架基本使用

打开命令行,敲

scrapy startproject quotetutorial      #在当前目录下生成了一个叫quotetutorial的项目

然后敲cd quotetutorail,然后敲

scrapy genspider quotes quotes.toscrape.com      #创建一个目标站点的爬虫

此时项目结构如下:

Scrapy框架基本使用

做一下解释:

iems:定义存储数据的Item类

settings:变量的配置信息

pipeline:负责处理被Spider提取出来的Item,典型应用有:清理HTML数据;验证爬取数据的合法性,检查Item是否包含某些字段;查重并丢弃;将爬取结果保存到文件或者数据库中

middlewares:中间件

spiders > quotes:爬虫模块

接着我们修改quotes.py代码:

# -*- coding: utf-8 -*-
import scrapy
from quotetutorial.items import QuotetutorialItem
from urllib.parse import urljoin
class QuotesSpider(scrapy.Spider):
name = 'quotes'
allowed_domains = ['quotes.toscrape.com']
start_urls = ['http://quotes.toscrape.com/'] def parse(self, response):
quotes = response.css('.quote')
for quote in quotes:
item = QuotetutorialItem()
text = quote.css('.text::text').extract_first()
author = quote.css('.author::text').extract_first()
tags = quote.css('.tags .tag::text').extract()
item['text'] = text
item['author'] = author
item['tags'] = tags
yield item next = response.css('.pager .next a::attr(href)').extract_first()#提取翻页的url
url = response.urljoin(next) #作url拼接
if url:
yield scrapy.Request(url=url,callback=self.parse)#回调parse函数

然后是pipelines.py文件

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html
from scrapy.exceptions import DropItem
from pymongo import MongoClient class TextPipeline(object):#对item数据处理,限制字段大小
def __init__(self):
self.limit = 50 def process_item(self, item, spider):
if item['text']:
if len(item['text']) > self.limit:
item['text'] = item['text'][0:self.limit].rstrip() + '...'
return item
else:
return DropItem('Missing Text') class MongoPipeline(object):#保存到MongoDB数据库 def __init__(self,mongo_uri,mongo_db):
self.mongo_uri = mongo_uri
self.mongo_db = mongo_db @classmethod
def from_crawler(cls,crawler):
return cls(
mongo_uri = crawler.settings.get('MONGO_URI'),
mongo_db = crawler.settings.get('MONGO_DB')
) def open_spider(self,spider):
self.client = MongoClient(self.mongo_uri)
self.db = self.client[self.mongo_db] def process_item(self,item,spider):
name = item.__class__.__name__
self.db[name].insert(dict(item))
return item def close_spider(self,spider):
self.client.close()

然后是items.py

# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# http://doc.scrapy.org/en/latest/topics/items.html import scrapy class QuotetutorialItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
text = scrapy.Field()
author = scrapy.Field()
tags = scrapy.Field()

然后修改settings.py

SPIDER_MODULES = ['quotetutorial.spiders']
NEWSPIDER_MODULE = 'quotetutorial.spiders' MONGO_URI = 'localhost'
MONGO_DB = 'quotestutorial' # Configure item pipelines
# See http://scrapy.readthedocs.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
'quotetutorial.pipelines.TextPipeline': 300,      #数字越小表示优先级越高,先处理
'quotetutorial.pipelines.MongoPipeline': 400,
}

这里需要注意的地方是:

Scrapy有自己的一套数据提取机制,成为Selector,通过Xpath或者CSS来解析HTML,用法和普通的选择器一样

把CSS换成XPATH如下:

    def parse(self, response):
quotes = response.xpath(".//*[@class='quote']")
for quote in quotes:
item = QuotetutorialItem()
# text = quote.css('.text::text').extract_first()
# author = quote.css('.author::text').extract_first()
# tags = quote.css('.tags .tag::text').extract()
text = quote.xpath(".//span[@class='text']/text()").extract()[0]
author = quote.xpath(".//span/small[@class='author']/text()").extract()[0]
tags = quote.xpath(".//div[@class='tags']/a/text()").extract()
item['text'] = text
item['author'] = author
item['tags'] = tags # item['tags'] = tags
yield item