(六）Python爬虫------使用Scrapy库简单爬取天气网城市天气预报信息，并使用MySQL数据库保存数据

时间：2024-03-22 07:13:50

一、使用Scrapy库做爬虫项目，前提是已经安装好了Scrapy库，没有没有安装，请查看我前几天的Scrapy库pip离线安装方法。

1.在使用Scrapy开发爬虫时，通常需要创建一个Scrapy项目。通过如下命令即可创建 Scrapy 项目：

scrapy startproject PythonScrapyWeather （PythonScrapyWeather为项目名称）

2.再通过命令创建一个Weathers.py的文件，如下命令即可创建Weathers.py的文件：
# 进入当前目录
cd PythonScrapyWeather
# 创建爬虫文件
scrapy genspider Weathers tianqi.com （Weather会自动创建为Weathers.py文件，）

二、项目中各个文件介绍:

（1）init.py
此文件为项目的初始化文件，主要写的是一些项目的初始化信息。 spider目录为一个python模块

（2）items.py
爬虫项目的数据容器文件，主要用来定义我们要获取的数据定义需要的item类

（3）piplines.py
爬虫项目的管道文件，主要用来对items里面定义的数据进行进一步的加工与处理，传入item.py中的item类，清理数据，保存或入库

（4）settings.py
爬虫项目的设置文件，主要为爬虫项目的一些设置信息，例如设置用户代理、cookie 初始下载延迟

（5）spiders文件夹
此文件夹下放置的事爬虫项目中的爬虫部分相关

爬虫文件Weathers.py

name指定名称，文件唯一标识

allowed_domains以及start_urls标识开始的网址

parse执行的具体操作

三、爬虫代码编写

（1）Weathers.py文件：主要是网络请求和一些逻辑的实现

import scrapy
import requests
from Python_Scrapy_Weather.items import PythonScrapyWeatherItem

"""
多页面爬取有两种形式。

1）从某一个或者多个主页中获取多个子页面的url列表，parse()函数依次爬取列表中的各个子页面。

2）从递归爬取，这个相对简单。在scrapy中只要定义好初始页面以及爬虫规则rules，就能够实现自动化的递归爬取。
"""
class WeathersSpider(scrapy.Spider):
   #设置相关参数
    name = 'Weathers'
    allowed_domains = ['tianqi.com']
    start_urls = ["https://www.tianqi.com/chinacity.html"]

    #获取不同省、直辖市的URL（响应）请求只是没有写出来，但是已经执行了请求：def start_requests(self):方法已经执行

    def parse(self, response):
        url="https://www.tianqi.com"
        allProvince_list=response.xpath('//div[@class="citybox"]/h2/a/@href').extract()
        allCity_list = response.xpath('//div[@class="citybox"]/span/h3/a/@href').extract()
        print("*******allCity_list", allCity_list)
        for city_name in allCity_list:
            city_url=url+city_name
            print("city_url*******", city_url)
           #再通过省、直辖市的URL请求每个省所有市的URL（请求）
            yield scrapy.Request(city_url,callback=self.subpage_content)

    #获取到每个省所有市的URL（响应）
    def subpage_content(self,response):
        print("response", response.status)

        try:
            #实例化对象item
            item = PythonScrapyWeatherItem()
            #使用xpath方法遍历HTML所需要的元素
            province_Data=response.xpath('//div[@class="left"]/div[6]')

            for province_name in province_Data:
                item["province_Name"]=province_name.xpath('div/h2/text()').extract()[0]
                province_Name=item["province_Name"]
                print("*province_Name***",type(province_Name))

                weather_Detail_Data = response.xpath('//div[@class="left"]')
                for weather_detail in weather_Detail_Data:
                    #获取item对象的属性值
                    item["city_Name"] = weather_detail.xpath('dl/dd[@class ="name"]/h2/text()').extract()[0]
                    item["date"] = weather_detail.xpath('dl/dd[@class="week"]/text()').extract()[0]
                    item["temperature"] = weather_detail.xpath('dl/dd[@class="weather"]/span/text()').extract()[0]
                    item["weather_condition"] = weather_detail.xpath('dl/dd[@class="weather"]/span/b/text()').extract()[0]
                    item["air_quality"] = weather_detail.xpath('dl/dd[@class="kongqi"]/h5/text()').extract()[0]
                    return item
        except:
            print(response.status)
        pass

（2）items.py文件：主要用来定义对象的属性（固定格式）

import scrapy

class PythonScrapyWeatherItem(scrapy.Item):

# define the fields for your item here like:

    province_Name=scrapy.Field()
    city_Name = scrapy.Field()
    date = scrapy.Field()
    temperature = scrapy.Field()
    weather_condition = scrapy.Field()
    air_quality = scrapy.Field()

（3）settings.py文件

请求不到数据，对此文件相关参数做设置

设置下载中间件

DOWNLOADER_MIDDLEWARES = {
'Python_Scrapy_Weather.middlewares.PythonScrapyWeatherDownloaderMiddleware': 543,
}

设置请求头

DEFAULT_REQUEST_HEADERS = {
'User-Agent': "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.88 Safari/537.36",
}

设置COOKIE

COOKIES_ENABLED = True

设置代理池

IP_PROXY

（4）piplines.py文件，通过mysql数据库将我们所需的数据保存

前提已经安装好mysql数据库，直接通过命令前天创建了一个名字叫dataSave_Sql的数据库，以及创建了一个数据表： Weathers

数据库操作相关知识点：

mysql -u root -p

输入密码：123456

show databases;

create database dataSave_Sql; （创建了一个名字叫dataSave_Sql的数据库）

use datasave_sql;

#####创建了一个数据表： Weathers

create table Weathers(
id int primary key auto_increment,province_Name varchar(100),city_Name varchar(100),date varchar(100),temperature varchar(100),weather_condition varchar(100),air_quality varchar(100));

import pymysql
class PythonScrapyWeatherPipeline(object):
    # 连接数据库
    def init(self):
        self.db_connect = pymysql.connect(
            host='localhost',
            user='root',
            password='root',
            db='datasave_sql',
            charset="utf8",
            port=3306,
            use_unicode=False)
        self.cursor=self.db_connect.cursor()
        self.cursor.execute("SELECT VERSION()")

    def process_item(self, item, spider):
        # 插入数据库
        sql = 'INSERT INTO Weathers_Info(province_Name,city_Name,date,temperature,weather_condition,air_quality) VALUES("{}","{}","{}","{}","{}","{}")'
        try:
            self.cursor.execute(sql.format(item["province_Name"],item["city_Name"], item["date"],item["temperature"],item["weather_condition"],item["air_quality"]))
            self.db_connect.commit()
            print(self.cursor.rowcount, "记录插入成功。")

        except    BaseException as e:
            print("错误在这里>>>>>>>>>>>>>", e, "<<<<<<<<<<<<<错误在这里")
            self.db_connect.rollback()
        return item

    # 关闭数据库
    def close_spider(self, spider):
        self.cursor.close()
        self.connect.close()

（5）MYSQL_Text.py文件

从数据库中通过条件查询已经保存的数据

import pymysql
import json
def select_Data():

    db_connect = pymysql.connect(
        host='localhost',
        user='root',
        password='root',
        db='datasave_sql',
        charset="utf8",
        port=3306,
        use_unicode=False)
    cursor = db_connect.cursor()
    cursor.execute("SELECT VERSION()")
    sql_province = 'select *from Weathers_Info where id="山西" '
    cursor.execute(sql_province)
    result = cursor.fetchall() # 获取所有记录列
    print(result)

