【Python爬虫系列】使用requests爬取图片

Python3常用的爬虫第三方插件有requests，urllib.request等。这里主要介绍使用requests抓取网页上的图片，该方法只针对静态网页，不考虑js动态加载的网页。

预备知识：

requests模块的基本了解，包括get，post等方法和status_code，history等属性。
熟练使用BeautifulSoup（美丽汤）进行文本定位、筛选，常用方法有find_all，select等。
基本的文件流操作，如文件夹是否存在的判断，新建文件夹等。
requests的write下载图片操作

操作开始：

这里以笔趣阁（http://www.biquzi.com/）为例，抓取网页上的小说封面。

抓图的基本流程就是：

requests发送网页请求 --> 使用get获取response --> 利用BeautifulSoup对response进行文本筛选，抓取图片链接 ---> 新建一个图片存放的文件夹 ---> urlretrieve下载图片到文件夹

按F12分析网页结构

【Python爬虫系列】使用requests爬取图片

图片定位到的文本信息类似于上图红框所示：

<img src="http://www.biquzi.com/files/article/image/0/703/703s.jpg" alt="斗战狂潮" width="120" height="150"> == $0

我们只关心图片链接信息（标红部分），其他的信息都要过滤掉。

下面上具体代码

import requests
import urllib.request
from bs4 import BeautifulSoup
import os
import time

url  = 'http://www.biquzi.com/'
headers = {'User-Agent': 'Mozilla/5.0'}
response = requests.get(url, headers=headers)  # 使用headers避免访问受限
soup = BeautifulSoup(response.content, 'html.parser')
items = soup.find_all('img')
folder_path = './photo/'
if os.path.exists(folder_path) == False:  # 判断文件夹是否已经存在
    os.makedirs(folder_path)  # 创建文件夹

for index,item in enumerate(items):
	if item:		
		html = requests.get(item.get('src'))   # get函数获取图片链接地址，requests发送访问请求
		img_name = folder_path + str(index + 1) +'.png'
		with open(img_name, 'wb') as file:  # 以byte形式将图片数据写入
			file.write(html.content)
			file.flush()
		file.close()  # 关闭文件
		print('第%d张图片下载完成' %(index+1))
		time.sleep(1)  # 自定义延时
print('抓取完成')

最后运行成功的结果

【Python爬虫系列】使用requests爬取图片

秒客网

【Python爬虫系列】使用requests爬取图片

相关文章