数据采集之贝壳新房

1、创建爬虫项目

scrapy  startproject  项目名
scrapy  startproject  baiduspide

2、创建爬虫文件名、域名【进入项目里面】

cd 项目名
scrapy genspider  文件名  域名
scrapy genspider  baidu  baidu.com

3、创建爬虫数据项【中定义】
4、编写爬虫文件 –>函数parse()解析
5、运行爬虫文件

cd 项目文件
scrapy crawl 文件名
scrapy crawl baidu  -o  baidu.csv

6.scrapy shell

response.xpath("//div[@id=‘u1’]/a")
("//div[@id=‘u1’]/a/text()")

("//div[@id=‘u1’]/a")[0].xpath(“text()”)
("//div[@id=‘u1’]/a")[0].xpath("@href")

scrapy shell

("//div[@id=‘u1’]/a")

("#u1 a::text").get()

("#u1 a::attr(href)").get()

实现思路是使用FormRequest发送Post请求模拟登录，请求发送完成后使用XPath表达式验证页面中是否出现logout链接，如果出现logout链接表示登录成功。

scrapy runspider

贝壳新房案例
在这里插入图片描述

class BeikehouseItem(scrapy.Item):
    # define the fields for your item here like:
    # name = ()
    name = scrapy.Field()  # 楼盘名称
    addr = scrapy.Field()  # 地址
    price = scrapy.Field()  # 价格
    house_price = scrapy.Field()  # 房价

在spiders下的文件

import scrapy
from beikeHouse.items import BeikehouseItem


class BeikeSpider(scrapy.Spider):
    name = 'beike'
    allowed_domains = ['/loupan']
    # start_urls = ['/loupan/pg1']  #起始URl

    # 爬取多页  第九页
    start_urls = ['/pg' + str(i) for i in range(1,10)]
    def parse(self, response):
#        每行数据的XPath表达式为//ul[@class='resblock-list-wrapper']/li
        for i in response.xpath("//ul[@class='resblock-list-wrapper']/li"):
            item = BeikehouseItem()  # 实例化容器
            item['name']=i.xpath("./div[@class='resblock-desc-wrapper']/div[@class='resblock-name']/a/text()").get()
            item['addr']=i.xpath("div/a[1]/@title").get().strip()
            item['price']=i.xpath("./div[@class='resblock-desc-wrapper']/div[@class='resblock-price']/div[@class='main-price']/span/text()").get()
            item['house_price']=i.xpath("./div[@class='resblock-desc-wrapper']/div[@class='resblock-price']/div[@class='second']/text()").get()
            yield item

运行

Scrapy框架编写的爬虫程序是多个程序协同工作，需要通过命令来“scrapy  crawl 爬虫名”或“scrapy runspider 爬虫名”来运行。运行shells爬虫的命令需要在爬虫项目根目录即house目录下输入。

调试
最好的方式是让scrapy项目在Pycharm环境中支持断点调试。如果支持断点调试，需要在scrapy项目根目录中新建文件，然后编写如下代码。
新建

from scrapy import cmdline

cmdline.execute("scrapy crawl beike -o ".split())
#beike 为文件名

在执行DEBUG前，需要在怀疑有问题的代码处鼠标单击，增加断点，反复单击可取消断点。
然后切换到文件，鼠标右键，选择“Debug main”命令，程序启动后直接跳转到断点处。可鼠标点击向下按钮，或按F8一步步调试，可以随时观察变量的输出结果。如图7-27所示。以上为基本的调试技巧。读者也可在切换到文件后，鼠标右键，选择“Run main”直接运行程序，解决了每次都输入执行命令的麻烦。

秒客网

数据采集之贝壳新房

scrapy shell

("//div[@id=‘u1’]/a")

("#u1 a::text").get()

("#u1 a::attr(href)").get()

相关文章