本文实例讲述了Python爬虫框架scrapy实现的文件下载功能。分享给大家供大家参考,具体如下:
我们在写普通脚本的时候,从一个网站拿到一个文件的下载url,然后下载,直接将数据写入文件或者保存下来,但是这个需要我们自己一点一点的写出来,而且反复利用率并不高,为了不重复造*,scrapy提供很流畅的下载文件方式,只需要随便写写便可用了。
mat.py文件
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
|
# -*- coding: utf-8 -*-
import scrapy
from scrapy.linkextractor import LinkExtractor
from weidashang.items import matplotlib
class MatSpider(scrapy.Spider):
name = "mat"
allowed_domains = [ "matplotlib.org" ]
start_urls = [ 'https://matplotlib.org/examples' ]
def parse( self , response):
#抓取每个脚本文件的访问页面,拿到后下载
link = LinkExtractor(restrict_css = 'div.toctree-wrapper.compound li.toctree-l2' )
for link in link.extract_links(response):
yield scrapy.Request(url = link.url,callback = self .example)
def example( self ,response):
#进入每个脚本的页面,抓取源码文件按钮,并和base_url结合起来形成一个完整的url
href = response.css( 'a.reference.external::attr(href)' ).extract_first()
url = response.urljoin(href)
example = matplotlib()
example[ 'file_urls' ] = [url]
return example
|
pipelines.py
1
2
3
4
|
class MyFilePlipeline(FilesPipeline):
def file_path( self , request, response = None , info = None ):
path = urlparse(request.url).path
return join(basename(dirname(path)),basename(path))
|
settings.py
1
2
3
4
|
ITEM_PIPELINES = {
'weidashang.pipelines.MyFilePlipeline' : 1 ,
}
FILES_STORE = 'examples_src'
|
items.py
1
2
3
|
class matplotlib(Item):
file_urls = Field()
files = Field()
|
run.py
1
2
|
from scrapy.cmdline import execute
execute([ 'scrapy' , 'crawl' , 'mat' , '-o' , 'example.json' ])
|
希望本文所述对大家Python程序设计有所帮助。
原文链接:https://www.cnblogs.com/lei0213/p/8098180.html