python爬虫 第一课

时间:2021-04-15 20:05:04
 

示例一:爬取百度百科:网络爬虫 

用到的库包括python 自带的 urllib,re,random及自己安装的Beautifulsoup

第一步:urllib对输入的网页链接包含中文的部分进行转换

import urllib
s="网络爬虫"
x=urllib.parse.quote(s)
print(x)

第二步:导入网址

from bs4 import BeautifulSoup
from urllib.request import urlopen
import re
import random

base_url = "https://baike.baidu.com"
his = ["/item/%E7%BD%91%E7%BB%9C%E7%88%AC%E8%99%AB/5162711"]

第三步:lxml 解析 并用find选择匹配结果

url = base_url + his[-1]

html = urlopen(url).read().decode('utf-8')
soup=BeautifulSoup(html,features="lxml")
print(soup.find('h1').get_text(),'url:',his[-1])
https://baike.baidu.com/item/%E7%BD%91%E7%BB%9C%E7%88%AC%E8%99%AB/5162711
网络爬虫 url: /item/%E7%BD%91%E7%BB%9C%E7%88%AC%E8%99%AB/5162711

第四步:正则表达式匹配筛选内容

sub_urls = soup.find_all("a",
                        {
                            "target" : "_blank",
                            "href" : re.compile("/item/(%.{2})+$")})
# print(sub_urls)
if len(sub_urls) !=0:
    his.append(random.sample(sub_urls,1)[0]['href'])
else:
    his.pop()
print(his)
总写:即循环20次代码
his = ["/item/%E7%BD%91%E7%BB%9C%E7%88%AC%E8%99%AB/5162711"]
for i in range(20):
    url = base_url + his[-1]
    html = urlopen(url).read().decode('utf-8')
    soup = BeautifulSoup(html, features='lxml')
    print (i ,soup.find('h1').get_text(),'url:',his[-1])
    #find valid urls
    sub_urls =soup.find_all("a",{"target":"_blank","href":re.compile("/item/(%.{2})+$")})
    
    if len(sub_urls) !=1:
        his.append(random.sample(sub_urls,1)[0]['href'])
    else:
        his.pop()
        

示例二:requests爬取中国地理官网图片

1.网址解析,发现图片总class:

from bs4 import BeautifulSoup
import requests
url="http://www.ngchina.com.cn/animals/"
html=requests.get(url).text
soup=BeautifulSoup(html,'lxml')
img_ul=soup.find_all('ul',{"class":"img_list"})
print(img_ul)

2.创建图片保存路径

import os
os.makedirs("./img",exist_ok=True)

3.循环获取class下的图片网址链接,再次启用requests,下载图片并保存

for ul in img_ul:
    imgs=ul.find_all("img")
    print(imgs)
    for img in imgs:
        url =img["src"]
        r=requests.get(url,stream=True)
        image_name = url.split("/")[-1]
        with open('./img/%s' % image_name ,"wb") as f:
            for chunk in r.iter_content(chunk_size=128):
                f.write(chunk)
        print("Save %s" % image_name)
总结:爬虫不易,继续努力。