《Python网络数据采集》笔记之BeautifulSoup

时间:2022-12-14 07:35:31

一  初见网络爬虫

都是使用的python3。

一个简单的例子:

from  urllib.request import urlopen
html
= urlopen("http://pythonscraping.com/pages/page1.html")
print(html.read())

在 Python 2.x 里的 urllib2 库, 在 Python 3.x 里,urllib2 改名为 urllib,被分成一些子模块:urllib.request、 urllib.parse 和 urllib.error。

 

二  BeautifulSoup

        1.使用BeautifulSoup

注意:1.通过pip install  BeautifulSoup4 安装模块

           2. 建立可靠的网络连接,能处理程序可能会发生的异常

如下面这个例子:

from urllib.error import HTTPError
from urllib.request import urlopen
from bs4 import BeautifulSoup


def getTitle(url):
try:
html
= urlopen(url)
except HTTPError as e:
return None
try:
bsobj
= BeautifulSoup(html.read())
title
= bsobj.body.h1
except AttributeError as e:
return None
return title
title
= getTitle("http://pythonscraping.com/pages/page1.html")
if title == None:
print("title was not found")
else:
print(title)

 

        2. 网络爬虫可以通过 class 属性的值,获得指定的内容

from urllib.request import urlopen
from bs4 import BeautifulSoup

html
= urlopen("http://pythonscraping.com/pages/warandpeace.html")

bsobj
= BeautifulSoup(html)

# 通过bsobj对象,用fillAll函数抽取class属性为red的span便签
contentList = bsobj.findAll("span",{"class":"red"})

for content in contentList:
print(content.get_text())
print('\n')

 

        3. 通过导航树

from urllib.request import urlopen
from bs4 import BeautifulSoup

html
= urlopen("http://pythonscraping.com/pages/page3.html")
bsobj
= BeautifulSoup(html)


#找出子标签
for child in bsobj.find("table",{"id":"giftList"}).children:
print(child)

#找出兄弟标签
for silbling in bsobj.find("table",{"id":"giftList"}).tr.next_siblings:
print(silbling)

for h2title in bsobj.findAll("h2"):
print(h2title.get_text())

print(bsobj.find("img",{"src":"../img/gifts/img1.jpg"}).parent.previous_sibling.get_text())

 

        5. 正则表达式和BeautifulSoup

from urllib.request import urlopen
from bs4 import BeautifulSoup
import re

html
= urlopen("http://pythonscraping.com/pages/page3.html")
bsobj
= BeautifulSoup(html)
#返回字典对象images
images = bsobj.findAll("img",{"src":re.compile("\.\.\/img\/gifts/img.*\.jpg")})
for image in images:
print(image["src"])