python爬虫系列:更上一层楼

时间:2021-11-15 19:49:31

这篇博文主要写Scrapy框架的安装与使用

 

Scrapy框架安装

命令行进入C:\Anaconda2\Scripts目录,运行:conda install Scrapy

 

创建Scrapy项目

1)进入打算存储的目录下,执行scrapy startproject 文件名 命令即可创建

python爬虫系列:更上一层楼

新文件目录及内容

demo/
scrapy.cfg
tutorial
/
__init__.py
items.py
pipelines.py
settings.py
spiders
/
__init__.py
...

这些文件分别是:

  • scrapy.cfg: 项目的配置文件
  • demo/: 该项目的python模块.
  • demo/items.py: 项目中的item文件,即写将要抓取的内容.
  • demo/pipelines.py: 项目中的pipelines文件,即写数据如何存储.
  • demo/settings.py: 项目的设置文件,即写如何定制Scrapy组件,这个比较复杂可以忽略.
  • demo/spiders/: 放置spider代码的目录,即写如何实现爬取.

 

定义爬虫文件

1)定义Item

#item.py
import scrapy
from scrapy.item import Item, Field

class DoubanItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field(
title = scrapy.Field()
url
= scrapy.Field()
rate
= scrapy.Field()
tag
= scrapy.Field()

2)定义spider

# coding:utf8
import scrapy
from douban.items import DoubanItem
from scrapy.selector import Selector
import re
from douban.pipelines import DoubanPipeline
import json
import urllib

import sys
reload(sys)
sys.setdefaultencoding(
'utf-8')
class DmozSpider(scrapy.Spider):
name
= "dmoz"
allowed_domains
= ["douban.com"]
start_urls
= [
"https://movie.douban.com/j/search_subjects?type=movie&tag=热门&sort=recommend&page_limit=1000&page_start=0",
]

def start_requests(self):
reqs
= []
tags
= [u'热门', u'最新', u'经典', u'豆瓣高分', u'冷门佳片', u'华语', u'欧美',
u
'韩国', u'日本', u'动作', u'喜剧', u'爱情', u'科幻', u'悬疑', u'恐怖', u'文艺']

for i in tags:
url
= 'https://movie.douban.com/j/search_subjects?type=movie&tag=' + str(i) + '&sort=recommend&page_limit=1000&page_start=0'
req
= scrapy.Request(url)
reqs.append(req)

return reqs

def parse(self, response):
html
= response.body
url
= response.url
# print u'地址',url
tag = re.findall(u'tag=(.*?)&',url)[0]
tag
=urllib.unquote(tag)
# print type(tag)

dictt
= json.loads(html)
dd
= dictt['subjects']
items
= []
for a in dd:
# self.get_tag(tag)
pre_item = TutorialItem()
pre_item[
'url'] = a['url']
pre_item[
'title'] = a['title']
pre_item[
'rate'] = a['rate']
pre_item[
'tag'] = tag
items.append(pre_item)

return items

3)定义pipeline

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
#
Don't forget to add your pipeline to the ITEM_PIPELINES setting
#
See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html

import pymongo

class DoubanPipeline(object):

def process_item(self, item, spider):
db
= spider.settings.get('db')
dbb
= pymongo.MongoClient(db)
db
= dbb['douban']
# lis = (item['url'],item['rate'])
db.info.insert(dict(item))
# lis = (item['title'], item['PORT'], item['POSITION'], item[
# 'TYPE'], item['SPEED'], item['last_check_time'])
return item

4)定义setting

1 MONGO_HOST = "127.0.0.1"  # 主机IP
2 MONGO_PORT = 27017 # 端口号
3 MONGO_DB = "Spider" # 库名
4 MONGO_COLL = "douban" # collection名
5 # MONGO_USER = "Ryana"
6 # MONGO_PSW = "123456"

 

运行Spider

进入spider所在的文件夹,执行scrapy crawl  spiderName 命令即可,这里再推荐一个MongoDB可视化工具Robomongo,运行结果如下图

 python爬虫系列:更上一层楼