示例一:爬取百度百科:网络爬虫
用到的库包括python 自带的 urllib,re,random及自己安装的Beautifulsoup
第一步:urllib对输入的网页链接包含中文的部分进行转换
import urllib s="网络爬虫" x=urllib.parse.quote(s) print(x)
第二步:导入网址
from bs4 import BeautifulSoup from urllib.request import urlopen import re import random base_url = "https://baike.baidu.com" his = ["/item/%E7%BD%91%E7%BB%9C%E7%88%AC%E8%99%AB/5162711"]
第三步:lxml 解析 并用find选择匹配结果
url = base_url + his[-1] html = urlopen(url).read().decode('utf-8') soup=BeautifulSoup(html,features="lxml") print(soup.find('h1').get_text(),'url:',his[-1])
https://baike.baidu.com/item/%E7%BD%91%E7%BB%9C%E7%88%AC%E8%99%AB/5162711 网络爬虫 url: /item/%E7%BD%91%E7%BB%9C%E7%88%AC%E8%99%AB/5162711
第四步:正则表达式匹配筛选内容
sub_urls = soup.find_all("a", { "target" : "_blank", "href" : re.compile("/item/(%.{2})+$")}) # print(sub_urls) if len(sub_urls) !=0: his.append(random.sample(sub_urls,1)[0]['href']) else: his.pop() print(his)总写:即循环20次代码
his = ["/item/%E7%BD%91%E7%BB%9C%E7%88%AC%E8%99%AB/5162711"] for i in range(20): url = base_url + his[-1] html = urlopen(url).read().decode('utf-8') soup = BeautifulSoup(html, features='lxml') print (i ,soup.find('h1').get_text(),'url:',his[-1]) #find valid urls sub_urls =soup.find_all("a",{"target":"_blank","href":re.compile("/item/(%.{2})+$")}) if len(sub_urls) !=1: his.append(random.sample(sub_urls,1)[0]['href']) else: his.pop()
示例二:requests爬取中国地理官网图片
1.网址解析,发现图片总class:
from bs4 import BeautifulSoup import requests url="http://www.ngchina.com.cn/animals/" html=requests.get(url).text soup=BeautifulSoup(html,'lxml') img_ul=soup.find_all('ul',{"class":"img_list"}) print(img_ul)
2.创建图片保存路径
import os os.makedirs("./img",exist_ok=True)
3.循环获取class下的图片网址链接,再次启用requests,下载图片并保存
for ul in img_ul: imgs=ul.find_all("img") print(imgs) for img in imgs: url =img["src"] r=requests.get(url,stream=True) image_name = url.split("/")[-1] with open('./img/%s' % image_name ,"wb") as f: for chunk in r.iter_content(chunk_size=128): f.write(chunk) print("Save %s" % image_name)总结:爬虫不易,继续努力。