爬取智联招聘信息

爬取计划:每种职业计划爬取30页

页数判断:

定位这个来判断，下方的30

....

定位这个进入工作的详细信息页面:

jobs = response.css("td.zwmc>div>a")

解析从myspider:start_urls处返回的response：

def parse(self,response):

1.判断页数

2.解析页面

（i）提取到的jobs的url

（ii）产生request，跳转到parsejob()函数，进行下一步的处理

（iii）提取下一页的url，并产生request

def parsejob(self,response):

1.提取有关job的详细信息

scrapy爬虫观察：

这个是重点，产生了一个请求

[scrapy.core.engine] DEBUG: Crawed(200) <GET http://..............>

这个应该是要解析的

[scrapy.core.scraper] DEBUG: Scraped from <200 http://..............>

使用爬虫是遇到的情况：

测试条件：

Download Delay = 5

在无request情况下，本地爬虫产生一个错误raise NotImplementError

原因：注释掉parse(self,response):函数

只保留parsejob(self,response):函数负责处理response

结果：

爬虫并未停止运行，因为内部机制有个叫爬虫闲置（具体名字忘了，下次见到补上），专门应对这种分布式情况，下面这段话可以解释

 # Max idle time to prevent the spider from being closed when distributed crawling. # This only works if queue class is SpiderQueue or SpiderStack, # and may also block the same time when your spider start at the first time (because the queue is empty). #SCHEDULER_IDLE_BEFORE_CLOSE = 10

爬虫代码：

from scrapy_redis.spiders import RedisSpider
import scrapy

class MySpider(RedisSpider):
    name = "zhilian"
    redis_key = "zhilian:start_urls"
    allowed_domains = ["jobs.zhaopin.com","sou.zhaopin.com"]
    def parse(self,response):
        pagenum = response.xpath("//body/div[3]/div[3]/div[2]/form/div[1]/div[1]/div[3]/ul/li[6]/a/text()").extract_first()
        if int(pagenum) <= 30:
            jobsurl = response.css("td.zwmc>div>a::attr(href)").extract()
            for joburl in jobsurl:
                yield scrapy.Request(joburl,callback=self.parsejob)
            nextPage = response.xpath("//body/div[3]/div[3]/div[2]/form/div[1]/div[1]/div[3]/ul/li[11]/a/@href").extract_first()
            yield scrapy.Request(nextPage,callback=self.parse)
    def parsejob(self,response):
        yield {
            'jobname':response.xpath("//body/div[5]/div[1]/div[1]/h1/text()").extract_first(),
        }

秒客网

爬取智联招聘信息

相关文章