Scrapy(Scrapy+Scrapy-splash)抓取js动态页面入门笔记

时间:2021-10-14 08:55:40

本文章仅作为个人笔记

还不懂Scrapy的可以参考

Scrpy官网

Scrpy官方文档

Scrpy中文文档

Scrpy-splash项目git地址

个人ScrapyDemo项目地址

准备工作
  • 先完成简单scrapy项目
  • 安装docker

    • win下下载安装包安装
    • mac下下载安装包安装(尝试使用brew安装,安装启动过程非常复杂,最后选择使用安装包直接安装)
    • centos7下运行:

      yum install docker

  • redhat运行:

    yum install --setopt=obsoletes=0 docker-ce-17.03.2.ce-1.el7.centos.x86_64 docker-ce-selinux-17.03.2.ce-1.el7.centos.noarch
    
  • 安装 scrapy-splash

    pip install scrapy-splash
    
  • 启动docker服务

    • centos7

      service docker start

    • win下直接打开应用

    • mac下直接打开应用
  • 拉取镜像

    docker pull scrapinghub/splash
    
  • 运行镜像

    docker run -p 8050:8050 scrapinghub/splash
    
  • 配置splash服务(以下操作全部在settings.py):

    • 添加splash服务器地址:

      SPLASH_URL = ‘http://localhost:8050

    • 将splash middleware添加到DOWNLOADER_MIDDLEWARE中:

      DOWNLOADER_MIDDLEWARES = {
          'scrapy_splash.SplashCookiesMiddleware': 723,
          'scrapy_splash.SplashMiddleware': 725,
          'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
      }
      
    • Enable SplashDeduplicateArgsMiddleware:

      SPIDER_MIDDLEWARES = {
          'scrapy_splash.SplashDeduplicateArgsMiddleware': 100,
      }
      
    • Set a custom DUPEFILTER_CLASS:

      DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'
      
    • a custom cache storage backend:

      HTTPCACHE_STORAGE = 'scrapy_splash.SplashAwareFSCacheStorage'
      
  • 例子

    import json, scrapy
    
    lass MySpider(scrapy.Spider):
       name = 'example'
       allowed_domains = ['example.com']
       start_urls = ["http://example.com", "http://example.com/foo"]
    
       def start_requests(self):
         for url in self.start_urls:
           yield SplashRequest(url, self.parse, args={'wait': 0.5})
    
       def parse(self, response):
           # ...