使用scrapy简易爬取豆瓣9分榜单图书并存放在mysql数据库中

注：大部分内容参考http://www.cnblogs.com/voidsky/p/5490798.html，但原文不是存在数据库中。

首先创建一个项目douban9fen

kuku@ubuntu:~/pachong$ scrapy startproject douban9fenNew Scrapy project 'douban9fen', using template directory '/usr/local/lib/python2.7/dist-packages/scrapy/templates/project', created in:
    /home/kuku/pachong/douban9fen

You can start your first spider with:
    cd douban9fen
    scrapy genspider example example.com

kuku@ubuntu:~/pachong$ cd douban9fen/

首先，我们要确定所要抓取的信息，包括三个字段：（1）书名，（2）评分，（3）作者

然后，让我们分析下,采用火狐浏览器，进入https://www.douban.com/doulist/1264675/

按F12对上述页面进行调试

分别按照1、2、3 的步骤查看每个对象所属的div，关闭调试窗口

进而，在页面中右击查看页面源代码，在页面源代码中查看搜索3中的div标签下class为bd doulist-subject的地方

根据先大后小的原则，我们先用bd doulist-subject，把每个书找到，然后，循环对里面的信息进行提取

提取书大框架：

'//div[@class="bd doulist-subject"]'

提取题目：

'div[@class="title"]/a/text()'

提取得分：

'div[@class="rating"]/span[@class="rating_nums"]/text()'

提取作者：（这里用正则方便点）

'<div class="abstract">(.*?)<br'

经过上述分析，接下来进行代码的编写

kuku@ubuntu:~/pachong/douban9fen$ ls

douban9fen  scrapy.cfg

kuku@ubuntu:~/pachong/douban9fen$ tree douban9fen/

douban9fen/├── __init__.py├── items.py├── pipelines.py├── settings.py└── spiders    └── __init__.py

kuku@ubuntu:~/pachong/douban9fen/douban9fen/spiders$ vim db_9fen_spider.py

添加以下内容：

# -*- coding:utf8 -*-import scrapyimport reclass Db9fenSpider(scrapy.Spider):    name = "db9fen"    allowed_domains = ["douban.com"]    start_urls = ["https://www.douban.com/doulist/1264675/"]    #解析数据    def parse(self,response):#       print response.body        ninefenbook = response.xpath('//div[@class="bd doulist-subject"]')        for each in ninefenbook:            title = each.xpath('div[@class="title"]/a/text()').extract()[0]            title = title.replace(' ','').replace('\n','')            print title            author = re.search('<div class="abstract">(.*?)<br',each.extract(),re.S).group(1)            author = author.replace(' ','').replace('\n','')            print author            rate = each.xpath('div[@class="rating"]/span[@class="rating_nums"]/text()').extract()[0]            print rate

保存。

为方便执行，我们将建立一个main.py文件

kuku@ubuntu:~/pachong/douban9fen/douban9fen/spiders$ cd ../..kuku@ubuntu:~/pachong/douban9fen$ vim main.py

添加以下内容，

# -*- coding:utf8 -*-import scrapy.cmdline as cmdcmd.execute('scrapy crawl db9fen'.split())   #db9fen 对应着db_9fen_spider.py文件中的name变量值

保存。

此时，我们可以执行下

kuku@ubuntu:~/pachong/douban9fen$ python main.py

但此时只能抓取到当前页面中的信息，查看页面中的后页信息

可以看到是存在标签span中的class="next"下，我们只需要将这个链接提取出来，进而对其进行爬取

'//span[@class="next"]/link/@href'

然后提取后我们scrapy的爬虫怎么处理呢？
可以使用yield，这样爬虫就会自动执行url的命令了，处理方式还是使用我们的parse函数

yield scrapy.http.Request(url,callback=self.parse)

然后将更改db_9fen_spider.py文件,添加以下内容到for函数中。

 nextpage = response.xpath('//span[@class="next"]/link/@href').extract() if nextpage:     print nextpage     next = nextpage[0]     print next     yield scrapy.http.Request(next,callback=self.parse)

如图所示

可能有些人想问，next = nextpage[0]什么意思，这里可以解释以下，变量nextpage是一个列表，列表里面存的是一个链接字符串，next = nextpage[0]就是将这个链接取出并赋值给变量next。

现在可以在items文件中定义我们要抓取的字段

kuku@ubuntu:~/pachong/douban9fen/douban9fen$ vim items.py

编辑item.py文件中的内容是：

# -*- coding: utf-8 -*-# Define here the models for your scraped items## See documentation in:# http://doc.scrapy.org/en/latest/topics/items.htmlimport scrapyfrom scrapy import Fieldclass Douban9FenItem(scrapy.Item):    # define the fields for your item here like:    # name = scrapy.Field()    title = Field()    author = Field()    rate = Field()

定义好字段之后，将重新对db_9fen_spider.py进行编辑，将刚才抓取到的三个字段存放在items.py中类的实例中，作为属性值。

kuku@ubuntu:~/pachong/douban9fen/douban9fen$ cd spiders/kuku@ubuntu:~/pachong/douban9fen/douban9fen/spiders$ vim db_9fen_spider.py

# -*- coding:utf8 -*-import scrapyimport refrom douban9fen.items import Douban9FenItemclass Db9fenSpider(scrapy.Spider):    name = "db9fen"    allowed_domains = ["douban.com"]    start_urls = ["https://www.douban.com/doulist/1264675/"]    #解析数据    def parse(self,response):#       print response.body        ninefenbook = response.xpath('//div[@class="bd doulist-subject"]')        for each in ninefenbook:            item = Douban9FenItem()            title = each.xpath('div[@class="title"]/a/text()').extract()[0]            title = title.replace(' ','').replace('\n','')            print title            item['title'] = title            author = re.search('<div class="abstract">(.*?)<br',each.extract(),re.S).group(1)            author = author.replace(' ','').replace('\n','')            print author            item['author'] = author            rate = each.xpath('div[@class="rating"]/span[@class="rating_nums"]/text()').extract()[0]            print rate            item['rate'] = rate            yield item                        nextpage = response.xpath('//span[@class="next"]/link/@href').extract()            if nextpage:#               print nextpage                next = nextpage[0]#               print next                yield scrapy.http.Request(next,callback=self.parse)

编辑setting.py,添加数据库配置信息

USER_AGENT = 'Mozilla/5.0  (Windows; U; Windows NT 5.1; zh-CN; rv:1.8.1.14) Gecko/20080404 Firefox/44.0.2'# start MySQL database configure settingMYSQL_HOST = 'localhost'MYSQL_DBNAME = 'douban9fen'MYSQL_USER = 'root'MYSQL_PASSWD = 'openstack'  # end of MySQL database configure settingITEM_PIPELINES = {    'douban9fen.pipelines.Douban9FenPipeline': 300,}

注意mysql数据库是预先安装进去的，可以看到数据库的名称为douban9fen，因此我们首先需要在数据库中创建douban9fen 数据库

kuku@ubuntu:~/pachong/douban9fen/douban9fen/spiders$ mysql -uroot -pEnter password: Welcome to the MySQL monitor.  Commands end with ; or \g.Your MySQL connection id is 46Server version: 5.5.52-0ubuntu0.14.04.1 (Ubuntu)Copyright (c) 2000, 2016, Oracle and/or its affiliates. All rights reserved.Oracle is a registered trademark of Oracle Corporation and/or itsaffiliates. Other names may be trademarks of their respectiveowners.Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.mysql> create database douban9fen;Query OK, 1 row affected (0.00 sec)

mysql> show databases;

+--------------------+| Database           |+--------------------+| information_schema || csvt04             || douban9fen         || doubandianying     || mysql              || performance_schema || web08              |+--------------------+7 rows in set (0.00 sec)

可以看到已经创建数据库成功；

mysql> use douban9fen;

接下来创建数据表

mysql> create table douban9fen (id int(4) not null primary key auto_increment, title varchar(100) not null,author varchar(40) not null, rate varchar(20) not null )CHARACTER SET utf8 COLLATE utf8_general_ci; Query OK, 0 rows affected (0.04 sec)

编辑pipelines.py，将数据储存到数据库中，

kuku@ubuntu:~/pachong/douban9fen/douban9fen$ vim pipelines.py

# -*- coding: utf-8 -*-# Define your item pipelines here## Don't forget to add your pipeline to the ITEM_PIPELINES setting# See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html#将数据存储到mysql数据库from twisted.enterprise import adbapiimport MySQLdbimport MySQLdb.cursorsclass Douban9FenPipeline(object):      #数据库参数    def __init__(self):        dbargs = dict(             host = '127.0.0.1',             db = 'douban9fen',             user = 'root',             passwd = 'openstack',             cursorclass = MySQLdb.cursors.DictCursor,             charset = 'utf8',             use_unicode = True            )        self.dbpool = adbapi.ConnectionPool('MySQLdb',**dbargs)    def process_item(self, item, spider):        res = self.dbpool.runInteraction(self.insert_into_table,item)        return item    #插入的表，此表需要事先建好    def insert_into_table(self,conn,item):        conn.execute('insert into douban9fen( title,author,rate) values(%s,%s,%s)', (                item['title'],                item['author'],                item['rate']                )        )

编辑好上面的红色标注的文件后，

kuku@ubuntu:~/pachong/douban9fen/douban9fen$ cd ..kuku@ubuntu:~/pachong/douban9fen$

再执行 main.py文件

kuku@ubuntu:~/pachong/douban9fen$ python main.py

执行过程如下：

打开mysql ，查看是否已经写入到数据库中；

kuku@ubuntu:~/pachong/douban9fen$ mysql -uroot -p

输入密码openstack 登录

mysql> show databases;

+--------------------+| Database           |+--------------------+| information_schema || csvt04             || douban9fen         || doubandianying     || mysql              || performance_schema || web08              |+--------------------+7 rows in set (0.00 sec)

mysql> use douban9fen;

Reading table information for completion of table and column namesYou can turn off this feature to get a quicker startup with -ADatabase changed

mysql> show tables;

+----------------------+| Tables_in_douban9fen |+----------------------+| douban9fen           |+----------------------+1 row in set (0.00 sec)

mysql> select * from douban9fen;

显示能够成功写入到数据库中。

本文出自 “lefteva” 博客，请务必保留此出处http://lefteva.blog.51cto.com/11892835/1874863

秒客网

使用scrapy简易爬取豆瓣9分榜单图书并存放在mysql数据库中

相关文章