分布式爬虫原理
首先我们来看一下scrapy的单机架构:
可以看到,scrapy单机模式,通过一个scrapy引擎通过一个调度器,将Requests队列中的request请求发给下载器,进行页面的爬取。
那么多台主机协作的关键是共享一个爬取队列。
所以,单主机的爬虫架构如下图所示:
前文提到,分布式爬虫的关键是共享一个requests队列,维护该队列的主机称为master,而从机则负责数据的抓取,数据处理和数据存储,所以分布式爬虫架构如下图所示:
MasterSpider 对 start_urls 中的 urls 构造 request,获取 response
MasterSpider 将 response 解析,获取目标页面的 url, 利用 redis 对 url 去重并生成待爬 request 队列
SlaveSpider 读取 redis 中的待爬队列,构造 request
SlaveSpider 发起请求,获取目标页面的 response
Slavespider 解析 response,获取目标数据,写入生产数据库
增加并发
并发是指同时处理数量。其有全局限制和局部(每个网站)的限制。
Scrapy 默认的全局并发限制对同时爬取大量网站的情况并不适用。 增加多少取决于爬虫能占用多少 CPU。 一般开始可以设置为 100 。
不过最好的方式是做一些测试,获得 Scrapy 进程占取 CPU 与并发数的关系。 为了优化性能,应该选择一个能使CPU占用率在80%-90%的并发数。
Redis 远程连接
安装完成后,redis默认是不能被远程连接的,此时要修改配置文件/etc/redis.conf
# bind 127.0.0.1
修改后,重启redis服务器
Windows的小伙伴儿 pip是安装Scrapy可能会出现问题。推荐使用anaconda 、不然还是老老实实用Linux吧
conda install scrapy 或者 pip install scrapy
安装Scrapy-Redis
conda install scrapy-redis 或者 pip install scrapy-redis
开始之前我们得知道scrapy-redis的一些配置:PS 这些配置是写在Scrapy项目的settings.py中的!
- settings.py
# -*- coding: utf-8 -*- # Scrapy settings for companyNews project # # For simplicity, this file contains only settings considered important or # commonly used. You can find more settings consulting the documentation: # # http://doc.scrapy.org/en/latest/topics/settings.html # http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html # http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html from DBSetting import host_redis,port_redis,db_redis,password_redis BOT_NAME = 'companyNews' SPIDER_MODULES = ['companyNews.spiders'] NEWSPIDER_MODULE = 'companyNews.spiders' #-----------------------日志文件配置----------------------------------- #日志文件名 #LOG_FILE = "dg.log" #日志文件级别 LOG_LEVEL = 'WARNING' # Obey robots.txt rules # robots.txt 是遵循 Robot协议 的一个文件,它保存在网站的服务器中,它的作用是,告诉搜索引擎爬虫, # 本网站哪些目录下的网页 不希望 你进行爬取收录。在Scrapy启动后,会在第一时间访问网站的 robots.txt 文件, # 然后决定该网站的爬取范围。 # ROBOTSTXT_OBEY = True # ------------------------全局并发数的一些配置:------------------------------- # Configure maximum concurrent requests performed by Scrapy (default: 16) # 默认 Request 并发数:16 # CONCURRENT_REQUESTS = 32 # 默认 Item 并发数:100 # CONCURRENT_ITEMS = 100 # The download delay setting will honor only one of: # 默认每个域名的并发数:16 #CONCURRENT_REQUESTS_PER_DOMAIN = 16 # 每个IP的最大并发数:0表示忽略 # CONCURRENT_REQUESTS_PER_IP = 0 # Configure a delay for requests for the same website (default: 0) # See http://scrapy.readthedocs.org/en/latest/topics/settings.html#download-delay # See also autothrottle settings and docs #DOWNLOAD_DELAY 会影响 CONCURRENT_REQUESTS,不能使并发显现出来,设置下载延迟 #DOWNLOAD_DELAY = 3 # Disable cookies (enabled by default) #禁用cookies # COOKIES_ENABLED = True # COOKIES_DEBUG = True # Disable Telnet Console (enabled by default) #TELNETCONSOLE_ENABLED = False # Crawl responsibly by identifying yourself (and your website) on the user-agent #USER_AGENT = 'haoduofuli (+http://www.yourdomain.com)' # Override the default request headers: DEFAULT_REQUEST_HEADERS = { 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8', 'Accept-Language': 'en', } # Enable or disable spider middlewares # See http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html SPIDER_MIDDLEWARES = { 'companyNews.middlewares.UserAgentmiddleware': 401, 'companyNews.middlewares.ProxyMiddleware':426, } # Enable or disable downloader middlewares # See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html DOWNLOADER_MIDDLEWARES = { 'companyNews.middlewares.UserAgentmiddleware': 400, 'companyNews.middlewares.ProxyMiddleware':425, # 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware':423, # 'companyNews.middlewares.CookieMiddleware': 700, } MYEXT_ENABLED=True # 开启扩展 IDLE_NUMBER=10 # 配置空闲持续时间单位为 360个 ,一个时间单位为5s # Enable or disable extensions # See http://scrapy.readthedocs.org/en/latest/topics/extensions.html # 在 EXTENSIONS 配置,激活扩展 EXTENSIONS = { # 'scrapy.extensions.telnet.TelnetConsole': None, 'companyNews.extensions.RedisSpiderSmartIdleClosedExensions': 500, } # Configure item pipelines # See http://scrapy.readthedocs.org/en/latest/topics/item-pipeline.html # 注意:自定义pipeline的优先级需高于Redispipeline,因为RedisPipeline不会返回item, # 所以如果RedisPipeline优先级高于自定义pipeline,那么自定义pipeline无法获取到item ITEM_PIPELINES = { #将清除的项目在redis进行处理,# 将RedisPipeline注册到pipeline组件中(这样才能将数据存入Redis) # 'scrapy_redis.pipelines.RedisPipeline': 400, 'companyNews.pipelines.companyNewsPipeline': 300,# 自定义pipeline视情况选择性注册(可选) } # Enable and configure the AutoThrottle extension (disabled by default) # See http://doc.scrapy.org/en/latest/topics/autothrottle.html #AUTOTHROTTLE_ENABLED = True # The initial download delay #AUTOTHROTTLE_START_DELAY = 5 # The maximum download delay to be set in case of high latencies #AUTOTHROTTLE_MAX_DELAY = 60 # The average number of requests Scrapy should be sending in parallel to # each remote server #AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0 # Enable showing throttling stats for every response received: #AUTOTHROTTLE_DEBUG = False # Enable and configure HTTP caching (disabled by default) # See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings # ----------------scrapy默认已经自带了缓存,配置如下----------------- # 打开缓存 #HTTPCACHE_ENABLED = True # 设置缓存过期时间(单位:秒) #HTTPCACHE_EXPIRATION_SECS = 0 # 缓存路径(默认为:.scrapy/httpcache) #HTTPCACHE_DIR = 'httpcache' # 忽略的状态码 #HTTPCACHE_IGNORE_HTTP_CODES = [] # 缓存模式(文件缓存) #HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage' #-----------------Scrapy-Redis分布式爬虫相关设置如下-------------------------- # Enables scheduling storing requests queue in redis. #启用Redis调度存储请求队列,使用Scrapy-Redis的调度器,不再使用scrapy的调度器 SCHEDULER = "scrapy_redis.scheduler.Scheduler" # Ensure all spiders share same duplicates filter through redis. #确保所有的爬虫通过Redis去重,使用Scrapy-Redis的去重组件,不再使用scrapy的去重组件 DUPEFILTER_CLASS = "scrapy_redis.dupefilter.RFPDupeFilter" # 默认请求序列化使用的是pickle 但是我们可以更改为其他类似的。PS:这玩意儿2.X的可以用。3.X的不能用 # SCHEDULER_SERIALIZER = "scrapy_redis.picklecompat" # 使用优先级调度请求队列 (默认使用), # 使用Scrapy-Redis的从请求集合中取出请求的方式,三种方式择其一即可: # 分别按(1)请求的优先级/(2)队列FIFO/(先进先出)(3)栈FILO 取出请求(先进后出) # SCHEDULER_QUEUE_CLASS = 'scrapy_redis.queue.PriorityQueue' # 可选用的其它队列 SCHEDULER_QUEUE_CLASS = 'scrapy_redis.queue.FifoQueue' # SCHEDULER_QUEUE_CLASS = 'scrapy_redis.queue.LifoQueue' # Don't cleanup redis queues, allows to pause/resume crawls. #不清除Redis队列、这样可以暂停/恢复 爬取, # 允许暂停,redis请求记录不会丢失(重启爬虫不会重头爬取已爬过的页面) #SCHEDULER_PERSIST = True #----------------------redis的地址配置------------------------------------- # Specify the full Redis URL for connecting (optional). # If set, this takes precedence over the REDIS_HOST and REDIS_PORT settings. # 指定用于连接redis的URL(可选) # 如果设置此项,则此项优先级高于设置的REDIS_HOST 和 REDIS_PORT # REDIS_URL = 'redis://root:密码@主机IP:端口' # REDIS_URL = 'redis://root:123456@192.168.8.30:6379' REDIS_URL = 'redis://root:%s@%s:%s'%(password_redis,host_redis,port_redis) # 自定义的redis参数(连接超时之类的) REDIS_PARAMS={'db': db_redis} # Specify the host and port to use when connecting to Redis (optional). # 指定连接到redis时使用的端口和地址(可选) #REDIS_HOST = '127.0.0.1' #REDIS_PORT = 6379 #REDIS_PASS = '19940225' # REDIRECT_ENABLED = False # # HTTPERROR_ALLOWED_CODES = [302, 301] # # DEPTH_LIMIT = 3 #------------------------------------------------------------------------------------------------ # 最大空闲时间防止分布式爬虫因为等待而关闭 # 这只有当上面设置的队列类是SpiderQueue或SpiderStack时才有效 # 并且当您的蜘蛛首次启动时,也可能会阻止同一时间启动(由于队列为空) # SCHEDULER_IDLE_BEFORE_CLOSE = 10 # 序列化项目管道作为redis Key存储 # REDIS_ITEMS_KEY = '%(spider)s:items' # 默认使用ScrapyJSONEncoder进行项目序列化 # You can use any importable path to a callable object. # REDIS_ITEMS_SERIALIZER = 'json.dumps' # 自定义redis客户端类 # REDIS_PARAMS['redis_cls'] = 'myproject.RedisClient' # 如果为True,则使用redis的'spop'进行操作。 # 如果需要避免起始网址列表出现重复,这个选项非常有用。开启此选项urls必须通过sadd添加,否则会出现类型错误。 # REDIS_START_URLS_AS_SET = False # RedisSpider和RedisCrawlSpider默认 start_usls 键 # REDIS_START_URLS_KEY = '%(name)s:start_urls' # 设置redis使用utf-8之外的编码 # REDIS_ENCODING = 'latin1'
Nice配置文件写到这儿。我们来做一些基本的反爬虫设置
最基本的一个切换UserAgent!
首先在项目文件中新建一个useragent.py用来写一堆 User-Agent(可以去网上找更多,也可以用下面这些现成的)
- useragent.py
# -*- coding: utf-8 -*- agents = [ "Mozilla/5.0 (Linux; U; Android 2.3.6; en-us; Nexus S Build/GRK39F) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1", "Avant Browser/1.2.789rel1 (http://www.avantbrowser.com)", "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/532.5 (KHTML, like Gecko) Chrome/4.0.249.0 Safari/532.5", "Mozilla/5.0 (Windows; U; Windows NT 5.2; en-US) AppleWebKit/532.9 (KHTML, like Gecko) Chrome/5.0.310.0 Safari/532.9", "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US) AppleWebKit/534.7 (KHTML, like Gecko) Chrome/7.0.514.0 Safari/534.7", "Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US) AppleWebKit/534.14 (KHTML, like Gecko) Chrome/9.0.601.0 Safari/534.14", "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.14 (KHTML, like Gecko) Chrome/10.0.601.0 Safari/534.14", "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US) AppleWebKit/534.20 (KHTML, like Gecko) Chrome/11.0.672.2 Safari/534.20", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/534.27 (KHTML, like Gecko) Chrome/12.0.712.0 Safari/534.27", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.1 (KHTML, like Gecko) Chrome/13.0.782.24 Safari/535.1", "Mozilla/5.0 (Windows NT 6.0) AppleWebKit/535.2 (KHTML, like Gecko) Chrome/15.0.874.120 Safari/535.2", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.7 (KHTML, like Gecko) Chrome/16.0.912.36 Safari/535.7", "Mozilla/5.0 (Windows; U; Windows NT 6.0 x64; en-US; rv:1.9pre) Gecko/2008072421 Minefield/3.0.2pre", "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.0.10) Gecko/2009042316 Firefox/3.0.10", "Mozilla/5.0 (Windows; U; Windows NT 6.0; en-GB; rv:1.9.0.11) Gecko/2009060215 Firefox/3.0.11 (.NET CLR 3.5.30729)", "Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US; rv:1.9.1.6) Gecko/20091201 Firefox/3.5.6 GTB5", "Mozilla/5.0 (Windows; U; Windows NT 5.1; tr; rv:1.9.2.8) Gecko/20100722 Firefox/3.6.8 ( .NET CLR 3.5.30729; .NET4.0E)", "Mozilla/5.0 (Windows NT 6.1; rv:2.0.1) Gecko/20100101 Firefox/4.0.1", "Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:2.0.1) Gecko/20100101 Firefox/4.0.1", "Mozilla/5.0 (Windows NT 5.1; rv:5.0) Gecko/20100101 Firefox/5.0", "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:6.0a2) Gecko/20110622 Firefox/6.0a2", "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:7.0.1) Gecko/20100101 Firefox/7.0.1", "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:2.0b4pre) Gecko/20100815 Minefield/4.0b4pre", "Mozilla/4.0 (compatible; MSIE 5.5; Windows NT 5.0 )", "Mozilla/4.0 (compatible; MSIE 5.5; Windows 98; Win 9x 4.90)", "Mozilla/5.0 (Windows; U; Windows XP) Gecko MultiZilla/1.6.1.0a", "Mozilla/2.02E (Win95; U)", "Mozilla/3.01Gold (Win95; I)", "Mozilla/4.8 [en] (Windows NT 5.1; U)", "Mozilla/5.0 (Windows; U; Win98; en-US; rv:1.4) Gecko Netscape/7.1 (ax)", "HTC_Dream Mozilla/5.0 (Linux; U; Android 1.5; en-ca; Build/CUPCAKE) AppleWebKit/528.5 (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1", "Mozilla/5.0 (hp-tablet; Linux; hpwOS/3.0.2; U; de-DE) AppleWebKit/534.6 (KHTML, like Gecko) wOSBrowser/234.40.1 Safari/534.6 TouchPad/1.0", "Mozilla/5.0 (Linux; U; Android 1.5; en-us; sdk Build/CUPCAKE) AppleWebkit/528.5 (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1", "Mozilla/5.0 (Linux; U; Android 2.1; en-us; Nexus One Build/ERD62) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17", "Mozilla/5.0 (Linux; U; Android 2.2; en-us; Nexus One Build/FRF91) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1", "Mozilla/5.0 (Linux; U; Android 1.5; en-us; htc_bahamas Build/CRB17) AppleWebKit/528.5 (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1", "Mozilla/5.0 (Linux; U; Android 2.1-update1; de-de; HTC Desire 1.19.161.5 Build/ERE27) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17", "Mozilla/5.0 (Linux; U; Android 2.2; en-us; Sprint APA9292KT Build/FRF91) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1", "Mozilla/5.0 (Linux; U; Android 1.5; de-ch; HTC Hero Build/CUPCAKE) AppleWebKit/528.5 (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1", "Mozilla/5.0 (Linux; U; Android 2.2; en-us; ADR6300 Build/FRF91) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1", "Mozilla/5.0 (Linux; U; Android 2.1; en-us; HTC Legend Build/cupcake) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17", "Mozilla/5.0 (Linux; U; Android 1.5; de-de; HTC Magic Build/PLAT-RC33) AppleWebKit/528.5 (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1 FirePHP/0.3", "Mozilla/5.0 (Linux; U; Android 1.6; en-us; HTC_TATTOO_A3288 Build/DRC79) AppleWebKit/528.5 (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1", "Mozilla/5.0 (Linux; U; Android 1.0; en-us; dream) AppleWebKit/525.10 (KHTML, like Gecko) Version/3.0.4 Mobile Safari/523.12.2", "Mozilla/5.0 (Linux; U; Android 1.5; en-us; T-Mobile G1 Build/CRB43) AppleWebKit/528.5 (KHTML, like Gecko) Version/3.1.2 Mobile Safari 525.20.1", "Mozilla/5.0 (Linux; U; Android 1.5; en-gb; T-Mobile_G2_Touch Build/CUPCAKE) AppleWebKit/528.5 (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1", "Mozilla/5.0 (Linux; U; Android 2.0; en-us; Droid Build/ESD20) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17", "Mozilla/5.0 (Linux; U; Android 2.2; en-us; Droid Build/FRG22D) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1", "Mozilla/5.0 (Linux; U; Android 2.0; en-us; Milestone Build/ SHOLS_U2_01.03.1) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17", "Mozilla/5.0 (Linux; U; Android 2.0.1; de-de; Milestone Build/SHOLS_U2_01.14.0) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17", "Mozilla/5.0 (Linux; U; Android 3.0; en-us; Xoom Build/HRI39) AppleWebKit/525.10 (KHTML, like Gecko) Version/3.0.4 Mobile Safari/523.12.2", "Mozilla/5.0 (Linux; U; Android 0.5; en-us) AppleWebKit/522 (KHTML, like Gecko) Safari/419.3", "Mozilla/5.0 (Linux; U; Android 1.1; en-gb; dream) AppleWebKit/525.10 (KHTML, like Gecko) Version/3.0.4 Mobile Safari/523.12.2", "Mozilla/5.0 (Linux; U; Android 2.0; en-us; Droid Build/ESD20) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17", "Mozilla/5.0 (Linux; U; Android 2.1; en-us; Nexus One Build/ERD62) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Mobile Safari/530.17", "Mozilla/5.0 (Linux; U; Android 2.2; en-us; Sprint APA9292KT Build/FRF91) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1", "Mozilla/5.0 (Linux; U; Android 2.2; en-us; ADR6300 Build/FRF91) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1", "Mozilla/5.0 (Linux; U; Android 2.2; en-ca; GT-P1000M Build/FROYO) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1", "Mozilla/5.0 (Linux; U; Android 3.0.1; fr-fr; A500 Build/HRI66) AppleWebKit/534.13 (KHTML, like Gecko) Version/4.0 Safari/534.13", "Mozilla/5.0 (Linux; U; Android 3.0; en-us; Xoom Build/HRI39) AppleWebKit/525.10 (KHTML, like Gecko) Version/3.0.4 Mobile Safari/523.12.2", "Mozilla/5.0 (Linux; U; Android 1.6; es-es; SonyEricssonX10i Build/R1FA016) AppleWebKit/528.5 (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1", "Mozilla/5.0 (Linux; U; Android 1.6; en-us; SonyEricssonX10i Build/R1AA056) AppleWebKit/528.5 (KHTML, like Gecko) Version/3.1.2 Mobile Safari/525.20.1", ]
现在我们来重写一下Scrapy的下载中间件
在项目中新建一个middlewares.py的文件(如果你使用的新版本的Scrapy,在新建的时候会有这么一个文件,直接用就好了)
首先导入UserAgentMiddleware毕竟我们要重写它啊!
import json ##处理json的包 import redis #Python操作redis的包 import random #随机选择 from .useragent import agents #导入前面的 from scrapy.downloadermiddlewares.useragent import UserAgentMiddleware #UserAegent中间件 from scrapy.downloadermiddlewares.retry import RetryMiddleware #重试中间件
开写:
class UserAgentmiddleware(UserAgentMiddleware): def process_request(self, request, spider): agent = random.choice(agents) request.headers["User-Agent"] = agent
第一行:定义了一个类UserAgentmiddleware继承自UserAgentMiddleware
第二行:定义了函数process_request(request, spider)为什么定义这个函数,因为Scrapy每一个request通过中间 件都会调用这个方法。
第三行:随机选择一个User-Agent
第四行:设置request的User-Agent为我们随机的User-Agent
_Y(o)Y一个中间件写完了!哈哈 是不是So easy!
需要登陆的,我们需要重写Cookie中间件!分布式爬虫啊!你不能手动的给每个Spider写一个Cookie吧。而且你还不会知道这个Cookie到底有没有失效。所以我们需要维护一个Cookie池(这个cookie池用redis)。
好!来理一理思路,维护一个Cookie池最基本需要具备些什么功能呢?
获取Cookie
更新Cookie
删除Cookie
判断Cookie是否可用进行相对应的操作(比如重试)
好,我们先做前三个对Cookie进行操作。
首先我们在项目中新建一个cookies.py的文件用来写我们需要对Cookie进行的操作。
首先日常导入我们需要的文件:
import requests import json import redis import logging from .settings import REDIS_URL ##获取settings.py中的REDIS_URL
首先我们把登陆用的账号密码 以Key:value的形式存入redis数据库。不推荐使用db0(这是Scrapy-redis默认使用的,账号密码单独使用一个db进行存储。)
解决第一个问题:获取Cookie:
import requests import json import redis import logging from .settings import REDIS_URL logger = logging.getLogger(__name__) ##使用REDIS_URL链接Redis数据库, deconde_responses=True这个参数必须要,数据会变成byte形式 完全没法用 reds = redis.Redis.from_url(REDIS_URL, db=2, decode_responses=True) login_url = 'http://haoduofuli.pw/wp-login.php' ##获取Cookie def get_cookie(account, password): s = requests.Session() payload = { 'log': account, 'pwd': password, 'rememberme': "forever", 'wp-submit': "登录", 'redirect_to': "http://http://www.haoduofuli.pw/wp-admin/", " } response = s.post(login_url, data=payload) cookies = response.cookies.get_dict() logger.warning("获取Cookie成功!(账号为:%s)" % account) return json.dumps(cookies)
这段很好懂吧。
使用requests模块提交表单登陆获得Cookie,返回一个通过Json序列化后的Cookie(如果不序列化,存入Redis后会变成Plain Text格式的,后面取出来Cookie就没法用啦。)
第二个问题:将Cookie写入Redis数据库(分布式呀,当然得要其它其它Spider也能使用这个Cookie了)
def init_cookie(red, spidername): redkeys = reds.keys() for user in redkeys: password = reds.get(user) if red.get("%s:Cookies:%s--%s" % (spidername, user, password)) is None: cookie = get_cookie(user, password) red.set("%s:Cookies:%s--%s"% (spidername, user, password), cookie)
使用我们上面建立的redis链接获取redis db2中的所有Key(我们设置为账号的哦!),再从redis中获取所有的Value(我设成了密码哦!)
判断这个spider和账号的Cookie是否存在,不存在 则调用get_cookie函数传入从redis中获取到的账号密码的cookie;
保存进redis,Key为spider名字和账号密码,value为cookie。
重写cookie中间件;估摸着吧!聪明的小伙儿看了上面重写User-Agent的方法,十之八九也知道怎么重写Cookie中间件了。
好啦,现在继续写middlewares.py啦!
class CookieMiddleware(RetryMiddleware): def __init__(self, settings, crawler): RetryMiddleware.__init__(self, settings) self.rconn = redis.from_url(settings['REDIS_URL'], db=1, decode_responses=True)##decode_responses设置取出的编码为str init_cookie(self.rconn, crawler.spider.name) @classmethod def from_crawler(cls, crawler): return cls(crawler.settings, crawler) def process_request(self, request, spider): redisKeys = self.rconn.keys() while len(redisKeys) > 0: elem = random.choice(redisKeys) if spider.name + ':Cookies' in elem: cookie = json.loads(self.rconn.get(elem)) request.cookies = cookie request.meta["accountText"] = elem.split("Cookies:")[-1] break
第一行:不说
第二行第三行得说一下 这玩意儿叫重载,有啥用呢:
也不扯啥子高深问题了,小伙伴儿可能发现,当你继承父类之后;子类是不能用 def init()方法的,不过重载父类之后就能用啦!
第四行:settings[‘REDIS_URL’]是个什么鬼?这是访问scrapy的settings。怎么访问的?下面说
第五行:往redis中添加cookie。第二个参数就是spidername的获取方法(其实就是字典啦!)
@classmethod def from_crawler(cls, crawler): return cls(crawler.settings, crawler)
这个貌似不好理解,作用看下面:
这样是不是一下就知道了??
至于访问settings的方法官方文档给出了详细的方法:
http://scrapy-chs.readthedocs.io/zh_CN/latest/topics/settings.html#how-to-access-settings
下面就是完整的middlewares.py文件:
# -*- coding: utf-8 -*- # Define here the models for your spider middleware # # See documentation in: # http://doc.scrapy.org/en/latest/topics/spider-middleware.html from scrapy import signals import json import redis import random from .useragent import agents from .cookies import init_cookie, remove_cookie, update_cookie from scrapy.downloadermiddlewares.useragent import UserAgentMiddleware from scrapy.downloadermiddlewares.retry import RetryMiddleware import logging logger = logging.getLogger(__name__) class UserAgentmiddleware(UserAgentMiddleware): def process_request(self, request, spider): agent = random.choice(agents) request.headers["User-Agent"] = agent class CookieMiddleware(RetryMiddleware): def __init__(self, settings, crawler): RetryMiddleware.__init__(self, settings) self.rconn = redis.from_url(settings['REDIS_URL'], db=1, decode_responses=True)##decode_responses设置取出的编码为str init_cookie(self.rconn, crawler.spider.name) @classmethod def from_crawler(cls, crawler): return cls(crawler.settings, crawler) def process_request(self, request, spider): redisKeys = self.rconn.keys() while len(redisKeys) > 0: elem = random.choice(redisKeys) if spider.name + ':Cookies' in elem: cookie = json.loads(self.rconn.get(elem)) request.cookies = cookie request.meta["accountText"] = elem.split("Cookies:")[-1] break #else: #redisKeys.remove(elem) #def process_response(self, request, response, spider): #""" #下面的我删了,各位小伙伴可以尝试以下完成后面的工作 #你需要在这个位置判断cookie是否失效 #然后进行相应的操作,比如更新cookie 删除不能用的账号 #写不出也没关系,不影响程序正常使用, #"""
MasterSpider
# coding: utf-8 from scrapy import Item, Field from scrapy.spiders import Rule from scrapy_redis.spiders import RedisCrawlSpider from scrapy.linkextractors import LinkExtractor from redis import Redis from time import time from urllib.parse import urlparse, parse_qs, urlencode class MasterSpider(RedisCrawlSpider): name = 'ebay_master' redis_key = 'ebay:start_urls' ebay_main_lx = LinkExtractor(allow=(r'http://www.ebay.com/sch/allcategories/all-categories', )) ebay_category2_lx = LinkExtractor(allow=(r'http://www.ebay.com/sch/[^\s]*/\d+/i.html', r'http://www.ebay.com/sch/[^\s]*/\d+/i.html?_ipg=\d+&_pgn=\d+', r'http://www.ebay.com/sch/[^\s]*/\d+/i.html?_pgn=\d+&_ipg=\d+',)) rules = ( Rule(ebay_category2_lx, callback='parse_category2', follow=False), Rule(ebay_main_lx, callback='parse_main', follow=False), ) def __init__(self, *args, **kwargs): domain = kwargs.pop('domain', '') # self.allowed_domains = filter(None, domain.split(',')) super(MasterSpider, self).__init__(*args, **kwargs) def parse_main(self, response): pass data = response.xpath("//div[@class='gcma']/ul/li/a[@class='ch']") for d in data: try: item = LinkItem() item['name'] = d.xpath("text()").extract_first() item['link'] = d.xpath("@href").extract_first() yield self.make_requests_from_url(item['link'] + r"?_fsrp=1&_pppn=r1&scp=ce2") except: pass def parse_category2(self, response): data = response.xpath("//ul[@id='ListViewInner']/li/h3[@class='lvtitle']/a[@class='vip']") redis = Redis() for d in data: # item = LinkItem() try: self._filter_url(redis, d.xpath("@href").extract_first()) except: pass try: next_page = response.xpath("//a[@class='gspr next']/@href").extract_first() except: pass else: # yield self.make_requests_from_url(next_page) new_url = self._build_url(response.url) redis.lpush("test:new_url", new_url) # yield self.make_requests_from_url(new_url) # yield Request(url, headers=self.headers, callback=self.parse2) def _filter_url(self, redis, url, key="ebay_slave:start_urls"): is_new_url = bool(redis.pfadd(key + "_filter", url)) if is_new_url: redis.lpush(key, url) def _build_url(self, url): parse = urlparse(url) query = parse_qs(parse.query) base = parse.scheme + '://' + parse.netloc + parse.path if '_ipg' not in query.keys() or '_pgn' not in query.keys() or '_skc' in query.keys(): new_url = base + "}) else: new_url = base + "?" + urlencode({"_ipg": query['_ipg'][0], "_pgn": int(query['_pgn'][0]) + 1}) return new_url class LinkItem(Item): name = Field() link = Field()
MasterSpider 继承来自 scrapy-redis 组件下的 RedisCrawlSpider,相比 scrapy框架 有了以下变化:
- redis_key
该爬虫的 start_urls 的存放容器由原先的 Python list 改至 redis list,所以此处需要 redis_key 存放 redis list 的 key - rules
- rules 是含有多个 Rule 对象的 tuple
- Rule 对象实例化常用的三个参数:link_extractor / callback / follow
- link_extractor 是一个 LinkExtractor 对象。 其定义了如何从爬取到的页面提取链接
- callback 是一个 callable 或 string (该spider中同名的函数将会被调用)。 从 link_extractor中每获取到链接时将会调用该函数。该回调函数接受一个response作为其第一个参数, 并返回一个包含 Item 以及(或) Request 对象(或者这两者的子类)的列表(list)。rules中的规则如果callback没有指定,则使用默认的parse函数进行解析,如果指定了,那么使用自定义的解析函数
- follow 是一个布尔(boolean)值,指定了根据该规则从response提取的链接是否需要跟进。 如果 callback 为None, follow 默认设置为 True ,否则默认为 False 。
- process_links 处理所有的链接的回调,用于处理从response提取的links,通常用于过滤(参数为link列表)
- process_request 链接请求预处理(添加header或cookie等)
- ebay_main_lx / ebay_category2_lx
LinkExtractor 对象- allow (a regular expression (or list of)) – 必须要匹配这个正则表达式(或正则表达式列表)的URL才会被提取。如果没有给出(或为空), 它会匹配所有的链接。
- deny 排除正则表达式匹配的链接(优先级高于allow)
- allow_domains 允许的域名(可以是str或list)
- deny_domains 排除的域名(可以是str或list)
- restrict_xpaths: 取满足XPath选择条件的链接(可以是str或list)
- restrict_css 提取满足css选择条件的链接(可以是str或list)
- tags 提取指定标签下的链接,默认从a和area中提取(可以是str或list)
- attrs 提取满足拥有属性的链接,默认为href(类型为list)
- unique 链接是否去重(类型为boolean)
- process_value 值处理函数(优先级大于allow)
- parse_main / parse_category2
- 用于解析符合对应 rule 的 url 的 response 的方法
- _filter_url / _build_url
- 一些有关 url 的工具方法
- LinkItem
- 继承自 Item 对象
- Item 对象是种简单的容器,用于保存爬取到得数据。 其提供了类似于 dict 的 API 以及用于声明可用字段的简单语法。
SlaveSpider
# coding: utf-8 from scrapy import Item, Field from scrapy_redis.spiders import RedisSpider class SlaveSpider(RedisSpider): name = "ebay_slave" redis_key = "ebay_slave:start_urls" def parse(self, response): item = ProductItem() item["price"] = response.xpath("//span[contains(@id,'prcIsum')]/text()").extract_first() item["item_id"] = response.xpath("//div[@id='descItemNumber']/text()").extract_first() item["seller_name"] = response.xpath("//span[@class='mbg-nw']/text()").extract_first() item["sold"] = response.xpath("//span[@class='vi-qtyS vi-bboxrev-dsplblk vi-qty-vert-algn vi-qty-pur-lnk']/a/text()").extract_first() item["cat_1"] = response.xpath("//li[@class='bc-w'][1]/a/span/text()").extract_first() item["cat_2"] = response.xpath("//li[@class='bc-w'][2]/a/span/text()").extract_first() item["cat_3"] = response.xpath("//li[@class='bc-w'][3]/a/span/text()").extract_first() item["cat_4"] = response.xpath("//li[@class='bc-w'][4]/a/span/text()").extract_first() yield item class ProductItem(Item): name = Field() price = Field() sold = Field() seller_name = Field() pl_id = Field() cat_id = Field() cat_1 = Field() cat_2 = Field() cat_3 = Field() cat_4 = Field() item_id = Field()
SlaveSpider 继承自 RedisSpider,属性与方法相比 MasterSpider 简单了不少,少了 rules 与其他,但大致功能都比较类似
SlaveSpider 从 ebay_slave:start_urls 下读取构建好的目标页面的 request,对 response 解析出目标数据,以 ProductItem 的形式输出数据
IP proxy
给请求添加代理有2种方式,第一种是重写你的爬虫类的start_request方法,第二种是添加download中间件。
重写start_request方法
我在我的爬虫类中重写了start_requests方法:
反爬虫一个最常用的方法的就是限制 ip。为了避免最坏的情况,可以利用代理服务器来爬取数据,scrapy 设置代理服务器只需要在请求前设置 Request 对象的 meta 属性,添加 proxy 值即可,
可以通过中间件来实现:
1、------------------------------- class ProxyMiddleware(object): def process_request(self, request, spider): proxy = 'http://178.33.6.236:3128' # 代理服务器 request.meta['proxy'] = proxy proxy_user_pass=b'test:test'#用户名:密码(bytes形式) request.headers['Proxy-Authorization'] = b'Basic '+base64.b64encode(proxy_user_pass) 2、------------------------------- from scrapy.downloadermiddlewares.httpproxy import HttpProxyMiddleware class ProxyMiddleware(HttpProxyMiddleware): def process_request(self, request, spider): proxy = 'http://%s'%ip request.meta['proxy'] = proxy proxy_user_pass=b'test:test' request.headers['Proxy-Authorization'] = b'Basic '+base64.b64encode(proxy_user_pass)
在setting文件中添加
DOWNLOADER_MIDDLEWARES = { '项目名.spider同级文件名.文件名.ProxyMiddleware': 543, }
另外,也可以使用大量的 IP Proxy 建立起代理 IP 池,请求时随机调用来避免更严苛的 IP 限制机制,方法类似 User-Agent 池
URL Filter
正常业务逻辑下,爬虫不会对重复爬取同一个页面两次。所以爬虫默认都会对重复请求进行过滤,但当爬虫体量达到千万级时,默认的过滤器占用的内存将会远远超乎你的想象。
为了解决这个问题,可以通过一些算法来牺牲一点点过滤的准确性来换取更小的空间复杂度
Bloom Filter
Bloom Filter可以用于检索一个元素是否在一个集合中。它的优点是空间效率和查询时间都远远超过一般的算法,缺点是有一定的误识别率和删除困难。
Hyperloglog
HyperLogLog是一个基数估计算法。其空间效率非常高,1.5K内存可以在误差不超过2%的前提下,用于超过10亿的数据集合基数估计。
这两种算法都是合适的选择,以 Hyperloglog 为例
由于 redis 已经提供了支持 hyperloglog 的数据结构,所以只需对此数据结构进行操作即可
MasterSpider 下的 _filter_url 实现了过滤 URL 的功能
def _filter_url(self, redis, url, key="ebay_slave:start_urls"): is_new_url = bool(redis.pfadd(key + "_filter", url)) if is_new_url: redis.lpush(key, url)
当 redis.pfadd() 执行时,一个 url 尝试插入 hyperloglog 结构中,如果 url 存在返回 0,反之返回 1。由此来判断是否要将该 url 存放至待爬队列