Python网络爬虫与信息提取(二) BeautifulSoup库

BeautifulSoup 库入门

BeautifulSoup库主要作用是能对html xml格式进行解析，并且提供解析

import requests
from bs4 import BeautifulSoup as bs
r = requests.get ("https://www.python123.io/ws/demo.html")
r.text
demo =r.text
soup = bs(demo,"html.parser") # 解释器
print(soup)
print(soup.prettify()) # 增加换行符，分行显示，更加直观美观

BeautifulSoup 库解析器

解析器	使用方法	条件
bs4的html解析器	BeautifulSoup(mk,”html.parser”)	安装bs4库
lxml的html解析器	BeautifulSoup(mk,”lxml”)	pip install lxml
lxml的xml解析器	BeautifulSoup(mk,”xml”)	pip install lxml
html5lib的解析器	BeautifulSoup(mk,”html5lib”)	pip install html5lib

BeautifulSoup 的基本元素

基本元素	说明
Tag	标签，最基本的信息组织单元
Name	标签的名字，格式：.name
Attributes	标签的属性，字典形式组织，格式：.attrs
NavigableString	标签内非属性字符串，格式 .string
Comment	注释

获得tag标签

soup.title #获取title标签
tag = soup.a  # a标签定义超链接
""" 如果多个标签，只返回第一个"""
""" 标签名字"""
soup.a.name # "a"
soup.a.parent.name # "p"
soup.p.parent.name # "body"
""" 标签属性"""
tag.attrs # 字典形式
tag.attrs["class"] # 提取字典信息 
tag.attrs["href"]#
""" 标签内字符串 """
soup.p.string 
soup.a.string

基于bs4库的html内容遍历方法

上行遍历，下行遍历，平行遍历

遍历属性	说明
.contents	子节点的的列表，将所有儿子节点存入列表
.children	子节点的迭代类型，与.content类似，用于循环遍历儿子节点
.descendants	子孙节点的迭代类型，包含所有子孙节点，用于循环遍历
.parent	节点的父亲标签
.parents	节点的父辈标签，包含父亲，爷爷及以上
.next_sibling	返回按照html文本顺序的下一个平行节点标签
.previous_sibling	返回按照html文本顺序的上一个平行节点标签
.next_siblings	迭代类型，返回按照html文本顺序的后续所有平行节点标签
.previous_siblings	返回按照html文本顺序的前序平行节点标签

soup.head
soup.head.contents #返回的是列表
soup.body.contents # 可以用len（）函数检索数量，可以用for...in...的方式遍历列表
soup.a.parent
soup.a.next_sibling
soup.a.next_sibling.next_sibling

上行遍历标准代码

soup = bs(demo,"html.parser")
for parent in soup.a.parents:
    if parent is None:
        print (parent)
    else:
        print(parent.name)

信息提取一般方法：

""" 提取demo页面中的url链接"""
for link in soup.find_all("a"): # .find_all(name, attrs,recursive = True,string)
    print (link.get("href"))

.find_all()方法介绍

""" 用于查找信息,返回列表类型"""
""" name 用于对标签名称的检索字符串"""
soup.find_all('a')
soup.find_all(['a','b']) # 注意是列表形式
soup.find_all (True) #所有标签 
import re # 引入正则表达式
for tag in soup.find_all(re.compile('b')):
    print ('tag.name')
""" attrs对标签属性值的检索字符串"""
soup.find_all('p','course') #返回含有course值得p标签
soup.find_all (id = 'link1')
soup.find_all (string = 'Basic Python')
soup.find_all (string = re.compile('python')) #所有含有python的字符串

'''最后因为find_all函数经常被使用，所以可以不用输入，比如soup.find_all(...) 可以写成soup.(...) '''

扩展方法：.find() .find_parent .find_parents .find_next_siblings .find_next_sibling .find_previous_siblings .find_previous_sibling

秒客网