笔记--scrapy 爬取IP 存储到MySQL数据库

本章将讲述爬取IP地址，为后期深入爬虫做准备

1.准备工作

分析一下我们要抓取那些字段和网页的结构，我们要抓取的是ip地址，端口，服务器地址，速度，存活时间等

查看一下网页的结构

笔记--scrapy 爬取IP 存储到MySQL数据库

可以看到要抓取的内容主要在table里面，按照次序抓取即可，最后是存储到数据库中。

2.创建项目

在终端创建一个项目

scrapy startproject collectips

创建spider文件

cd collectipsscrapy genspider xici www.xicidaili.com

3.定义字段（items）

定义要抓取的字段，我们抓取的主要为7个字段，下面是代码

# -*- coding: utf-8 -*-# Define here the models for your scraped items## See documentation in:# https://doc.scrapy.org/en/latest/topics/items.htmlimport scrapyclass CollectipsItem(scrapy.Item):    ip=scrapy.Field()    port=scrapy.Field()    city=scrapy.Field()    High=scrapy.Field()    types=scrapy.Field()    speed=scrapy.Field()    Survival_time=scrapy.Field()

4.spider主代码

开始写我们的spider主代码

# -*- coding: utf-8 -*-import scrapyfrom collectips.items import CollectipsItemimport reclass XiciSpider(scrapy.Spider):    name = 'xici'    allowed_domains = ['www.xicidaili.com']    start_urls = ['http://www.xicidaili.com/']    def start_requests(self):        reqs=[]        for i in range(1,206):#将存储到列表中用于解析            req=scrapy.Request("http://www.xicidaili.com/nn/%s"%i)            reqs.append(req)        return reqs    #解析的每个url参数传递到下面函数中，用于字段抓取    def parse(self, response):        ip_list=response.xpath("//table[@id='ip_list']")        trs=ip_list.xpath("tr")        items=[]        for ip in trs[1:]:            pre_item=CollectipsItem()            pre_item["ip"]=ip.xpath("td[2]/text()")[0].extract()            pre_item["port"]=ip.xpath("td[3]/text()")[0].extract()            pre_item["city"]=ip.xpath("string(td[4])")[0].extract().strip()            pre_item["High"]=ip.xpath("td[5]/text()")[0].extract()            pre_item["types"]=ip.xpath("td[6]/text()")[0].extract()            pre_item["speed"]=ip.xpath("td[7]/div/@title").re('\d{0,2}\.\d{0,}')[0]            pre_item["Survival_time"]=ip.xpath("td[9]/text()")[0].extract()            items.append(pre_item)        return items#返回字段传递到pipelines管道中

5.pipelines存储

抓取的字段传递到pipelines中，存储到数据库代码如下：

# -*- coding: utf-8 -*-# Define your item pipelines here## Don't forget to add your pipeline to the ITEM_PIPELINES setting# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.htmlimport pymysqlclass CollectipsPipeline(object):    def process_item(self, item, spider):        DBKWARGS=spider.settings.get('DBKWARGS')        con=pymysql.connect(**DBKWARGS)        cur=con.cursor()        sql=('insert into xici values (%s,%s,%s,%s,%s,%s,%s)')        lis=(item['ip'], item['port'], item['city'], item['High'], item['types'] ,item['speed'],item['Survival_time'])        try:            cur.execute(sql,lis)        except Exception as e:            print("insert err:",e)            con.rollback()        else:            con.commit()        cur.close()        con.close()        return item

6.设置setting文件

配置setting文件，填写headers，cookie参数，伪装浏览器浏览

# -*- coding: utf-8 -*-BOT_NAME = 'collectips'SPIDER_MODULES = ['collectips.spiders']NEWSPIDER_MODULE = 'collectips.spiders'DBKWARGS={'db':'test',"user":"root","passwd":"12345","host":"localhost","use_unicode":True,"charset":"utf8"}USER_AGENT = 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36'ROBOTSTXT_OBEY = TrueDEFAULT_REQUEST_HEADERS = {    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',    'Accept-Encoding': 'gzip, deflate, sdch',    'Accept-Language': 'zh-CN,zh;q=0.8,en;q=0.6,zh-TW;q=0.4',    'Connection': 'keep-alive',    'Cookie': '_free_proxy_session=BAh7B0kiD3Nlc3Npb25faWQGOgZFVEkiJTdiMTgzYWM4YWMxMWZjYTU3MDJmY2FkMmQ1N2U5ZmQ1BjsAVEkiEF9jc3JmX3Rva2VuBjsARkkiMWI1TlVCcEpxOTZnUW5vN1pEM2NJWFQvblFQeDN4RkN6UVhodTRPN3FDUnM9BjsARg%3D%3D--4ab7b3d3c06537aad8e1c2275589d7fe091b4a0a; __guid=264997385.1762204003176204800.1521079489470.1633; monitor_count=3; Hm_lvt_0cf76c77469e965d2957f0553e6ecf59=1521079495; Hm_lpvt_0cf76c77469e965d2957f0553e6ecf59=1521080766',    'Cache-Control': 'max-age=0',    'DNT': '1',    'Host': 'www.xicidaili.com',    'Referer': 'http://www.xicidaili.com/',    'Upgrade-Insecure-Requests': '1',    'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36'}ITEM_PIPELINES = {   'collectips.pipelines.CollectipsPipeline': 300,}

7.程序运行

scrapy crawl xici

我这里拿到了1万条数据，有兴趣的朋友可以尝试玩一下

笔记--scrapy 爬取IP 存储到MySQL数据库

秒客网

笔记--scrapy 爬取IP 存储到MySQL数据库

相关文章