本章将讲述爬取IP地址,为后期深入爬虫做准备
1.准备工作
分析一下我们要抓取那些字段和网页的结构,我们要抓取的是ip地址,端口,服务器地址,速度,存活时间等
查看一下网页的结构
可以看到要抓取的内容主要在table里面,按照次序抓取即可,最后是存储到数据库中。
2.创建项目
在终端创建一个项目
scrapy startproject collectips
创建spider文件
cd collectipsscrapy genspider xici www.xicidaili.com
3.定义字段(items)
定义要抓取的字段,我们抓取的主要为7个字段,下面是代码
# -*- coding: utf-8 -*-# Define here the models for your scraped items## See documentation in:# https://doc.scrapy.org/en/latest/topics/items.htmlimport scrapyclass CollectipsItem(scrapy.Item): ip=scrapy.Field() port=scrapy.Field() city=scrapy.Field() High=scrapy.Field() types=scrapy.Field() speed=scrapy.Field() Survival_time=scrapy.Field()
4.spider主代码
开始写我们的spider主代码
# -*- coding: utf-8 -*-import scrapyfrom collectips.items import CollectipsItemimport reclass XiciSpider(scrapy.Spider): name = 'xici' allowed_domains = ['www.xicidaili.com'] start_urls = ['http://www.xicidaili.com/'] def start_requests(self): reqs=[] for i in range(1,206):#将存储到列表中用于解析 req=scrapy.Request("http://www.xicidaili.com/nn/%s"%i) reqs.append(req) return reqs #解析的每个url参数传递到下面函数中,用于字段抓取 def parse(self, response): ip_list=response.xpath("//table[@id='ip_list']") trs=ip_list.xpath("tr") items=[] for ip in trs[1:]: pre_item=CollectipsItem() pre_item["ip"]=ip.xpath("td[2]/text()")[0].extract() pre_item["port"]=ip.xpath("td[3]/text()")[0].extract() pre_item["city"]=ip.xpath("string(td[4])")[0].extract().strip() pre_item["High"]=ip.xpath("td[5]/text()")[0].extract() pre_item["types"]=ip.xpath("td[6]/text()")[0].extract() pre_item["speed"]=ip.xpath("td[7]/div/@title").re('\d{0,2}\.\d{0,}')[0] pre_item["Survival_time"]=ip.xpath("td[9]/text()")[0].extract() items.append(pre_item) return items#返回字段传递到pipelines管道中
5.pipelines存储
抓取的字段传递到pipelines中,存储到数据库代码如下:
# -*- coding: utf-8 -*-# Define your item pipelines here## Don't forget to add your pipeline to the ITEM_PIPELINES setting# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.htmlimport pymysqlclass CollectipsPipeline(object): def process_item(self, item, spider): DBKWARGS=spider.settings.get('DBKWARGS') con=pymysql.connect(**DBKWARGS) cur=con.cursor() sql=('insert into xici values (%s,%s,%s,%s,%s,%s,%s)') lis=(item['ip'], item['port'], item['city'], item['High'], item['types'] ,item['speed'],item['Survival_time']) try: cur.execute(sql,lis) except Exception as e: print("insert err:",e) con.rollback() else: con.commit() cur.close() con.close() return item
6.设置setting文件
配置setting文件,填写headers,cookie参数,伪装浏览器浏览
# -*- coding: utf-8 -*-BOT_NAME = 'collectips'SPIDER_MODULES = ['collectips.spiders']NEWSPIDER_MODULE = 'collectips.spiders'DBKWARGS={'db':'test',"user":"root","passwd":"12345","host":"localhost","use_unicode":True,"charset":"utf8"}USER_AGENT = 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36'ROBOTSTXT_OBEY = TrueDEFAULT_REQUEST_HEADERS = { 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8', 'Accept-Encoding': 'gzip, deflate, sdch', 'Accept-Language': 'zh-CN,zh;q=0.8,en;q=0.6,zh-TW;q=0.4', 'Connection': 'keep-alive', 'Cookie': '_free_proxy_session=BAh7B0kiD3Nlc3Npb25faWQGOgZFVEkiJTdiMTgzYWM4YWMxMWZjYTU3MDJmY2FkMmQ1N2U5ZmQ1BjsAVEkiEF9jc3JmX3Rva2VuBjsARkkiMWI1TlVCcEpxOTZnUW5vN1pEM2NJWFQvblFQeDN4RkN6UVhodTRPN3FDUnM9BjsARg%3D%3D--4ab7b3d3c06537aad8e1c2275589d7fe091b4a0a; __guid=264997385.1762204003176204800.1521079489470.1633; monitor_count=3; Hm_lvt_0cf76c77469e965d2957f0553e6ecf59=1521079495; Hm_lpvt_0cf76c77469e965d2957f0553e6ecf59=1521080766', 'Cache-Control': 'max-age=0', 'DNT': '1', 'Host': 'www.xicidaili.com', 'Referer': 'http://www.xicidaili.com/', 'Upgrade-Insecure-Requests': '1', 'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.87 Safari/537.36'}ITEM_PIPELINES = { 'collectips.pipelines.CollectipsPipeline': 300,}
7.程序运行
scrapy crawl xici
我这里拿到了1万条数据,有兴趣的朋友可以尝试玩一下