本文章仅作为个人笔记
还不懂Scrapy的可以参考
Scrpy官网
Scrpy官方文档
Scrpy中文文档
Scrpy-splash项目git地址
个人ScrapyDemo项目地址
准备工作
- 先完成简单scrapy项目
-
安装docker
- win下下载安装包安装
- mac下下载安装包安装(尝试使用brew安装,安装启动过程非常复杂,最后选择使用安装包直接安装)
-
centos7下运行:
yum install docker
-
redhat运行:
yum install --setopt=obsoletes=0 docker-ce-17.03.2.ce-1.el7.centos.x86_64 docker-ce-selinux-17.03.2.ce-1.el7.centos.noarch
-
安装 scrapy-splash
pip install scrapy-splash
-
启动docker服务
-
centos7
service docker start
win下直接打开应用
- mac下直接打开应用
-
-
拉取镜像
docker pull scrapinghub/splash
-
运行镜像
docker run -p 8050:8050 scrapinghub/splash
-
配置splash服务(以下操作全部在settings.py):
-
添加splash服务器地址:
SPLASH_URL = ‘http://localhost:8050’
-
将splash middleware添加到DOWNLOADER_MIDDLEWARE中:
DOWNLOADER_MIDDLEWARES = { 'scrapy_splash.SplashCookiesMiddleware': 723, 'scrapy_splash.SplashMiddleware': 725, 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810, }
-
Enable SplashDeduplicateArgsMiddleware:
SPIDER_MIDDLEWARES = { 'scrapy_splash.SplashDeduplicateArgsMiddleware': 100, }
-
Set a custom DUPEFILTER_CLASS:
DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'
-
a custom cache storage backend:
HTTPCACHE_STORAGE = 'scrapy_splash.SplashAwareFSCacheStorage'
-
-
例子
import json, scrapy lass MySpider(scrapy.Spider): name = 'example' allowed_domains = ['example.com'] start_urls = ["http://example.com", "http://example.com/foo"] def start_requests(self): for url in self.start_urls: yield SplashRequest(url, self.parse, args={'wait': 0.5}) def parse(self, response): # ...