本文实例讲述了python3爬虫爬取英雄联盟高清桌面壁纸功能。分享给大家供大家参考,具体如下:
使用scrapy爬虫抓取英雄联盟高清桌面壁纸
源码地址:https://github.com/snowyme/loldesk
开始项目前需要安装python3和scrapy,不会的自行百度,这里就不具体介绍了
首先,创建项目
1
|
scrapy startproject loldesk
|
生成项目的目录结构
首先需要定义抓取元素,在item.py中,我们这个项目用到了图片名和链接
1
2
3
4
5
|
import scrapy
class loldeskitem(scrapy.item):
name = scrapy.field()
imgurl = scrapy.field()
pass
|
接下来在爬虫目录创建爬虫文件,并编写主要代码,loldesk.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
|
import scrapy
from loldesk.items import loldeskitem
class loldeskpiderspider(scrapy.spider):
name = "loldesk"
allowed_domains = [ "www.win4000.com" ]
# 抓取链接
start_urls = [
'http://www.win4000.com/zt/lol.html'
]
def parse( self , response):
list = response.css( ".left_bar ul li" )
for img in list :
imgurl = img.css( "a::attr(href)" ).extract_first()
imgurl2 = str (imgurl)
next_url = response.css( ".next::attr(href)" ).extract_first()
if next_url is not none:
# 下一页
yield response.follow(next_url, callback = self .parse)
yield scrapy.request(imgurl2, callback = self .content)
def content( self , response):
item = loldeskitem()
item[ 'name' ] = response.css( ".pic-large::attr(title)" ).extract_first()
item[ 'imgurl' ] = response.css( ".pic-large::attr(src)" ).extract()
yield item
# 判断页码
next_url = response.css( ".pic-next-img a::attr(href)" ).extract_first()
allnum = response.css( ".ptitle em::text" ).extract_first()
thisnum = next_url[ - 6 : - 5 ]
if int (allnum) > int (thisnum):
# 下一页
yield response.follow(next_url, callback = self .content)
|
图片的链接和名称已经获取到了,接下来需要使用图片通道下载图片并保存到本地,pipelines.py:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
|
from scrapy.pipelines.images import imagespipeline
from scrapy.exceptions import dropitem
from scrapy.http import request
import re
class myimagespipeline(imagespipeline):
def get_media_requests( self , item, info):
for image_url in item[ 'imgurl' ]:
yield request(image_url,meta = { 'item' :item[ 'name' ]})
def file_path( self , request, response = none, info = none):
name = request.meta[ 'item' ]
name = re.sub(r '[?\\*|“<>:/()0123456789]' , '', name)
image_guid = request.url.split( '/' )[ - 1 ]
filename = u 'full/{0}/{1}' . format (name, image_guid)
return filename
def item_completed( self , results, item, info):
image_path = [x[ 'path' ] for ok, x in results if ok]
if not image_path:
raise dropitem( 'item contains no images' )
item[ 'image_paths' ] = image_path
return item
|
最后在settings.py中设置存储目录并开启通道:
1
2
3
4
5
6
|
# 设置图片存储路径
images_store = 'f:/python/loldesk'
#启动pipeline中间件
item_pipelines = {
'loldesk.pipelines.myimagespipeline' : 300 ,
}
|
在根目录下运行程序:
1
|
scrapy crawl loldesk
|
大功告成!!!一共抓取到128个文件夹
希望本文所述对大家python程序设计有所帮助。
原文链接:https://blog.csdn.net/ziwoods/article/details/84321188