
安装
pip3 install beautifulsoup4
解析库
解析器 | 使用方法 | 优势 | 劣势 |
---|---|---|---|
Python标准库 | BeautifulSoup(markup,'html,parser') | Python的内置标准库、执行速度适中、文档容错能力强 | Python 2.7.3 or 3.2.2前的版本中文容错能力差 |
lxml HTML 解析库 | BeautifulSoup(markup,'lxml') | 速度快、文档容错能力强 | 需要安装C语言库 |
lxml XML 解析库 | BeautifulSoup(markup,'xml') | 速度快、唯一支持XML的解析器 | 需要安装C语言库 |
html5lib | BeautifulSoup(markup,'xml') | 最好的容错性、以浏览器的方式解析文档、生成HTML5格式的文档 | 速度慢、不依赖外部扩展 |
基本使用
html = """
<html dir="ltr" lang="en"><head><meta charset="utf-8"/> <title>The Dormouse's story</title> </head><body><p class="title" name="dormouse"> <b>The Dormouse's story</b></p><p class="story">Once upon a time there were three little sisters;and their names were
<a class="sister" href="http://example.com/elsie" id="link1"> <!-- Elsie --></a><a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>and<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>; and they lived at the bottom of a well
</p> <p class="story"> ...story go on...</p>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html,'lxml')
print(soup.prettify()
自动补全代码:
<html dir="ltr" lang="en">
<head>
<meta charset="utf-8"/>
<title>
The Dormouse's story
</title>
</head>
<body>
<p class="title" name="dormouse">
<b>
The Dormouse's story
</b>
</p>
<p class="story">
Once upon a time there were three little sisters;and their names were
<a class="sister" href="http://example.com/elsie" id="link1">
<!-- Elsie -->
</a>
<a class="sister" href="http://example.com/lacie" id="link2">
Lacie
</a>
and
<a class="sister" href="http://example.com/tillie" id="link3">
Tillie
</a>
; and they lived at the bottom of a well
</p>
<p class="story">
...story go on...
</p>
</body>
</html>
print(soup.title.string)
输出html的标题:
The Dormouse's story
标签选择器
选择元素
from bs4 import BeautifulSoup
soup = BeautifulSoup(html,'lxml')
print(soup.title)
print(type(soup.title))
print(soup.head)
print(soup.p)
输出结果如下:
<title>The Dormouse's story</title>
<class 'bs4.element.Tag'>
<head><meta charset="utf-8"/> <title>The Dormouse's story</title> </head>
<p class="title" name="dormouse"> <b>The Dormouse's story</b></p> #只返回第一个p标签
获取外层标签的名称
from bs4 import BeautifulSoup
soup = BeautifulSoup(html,'lxml')
print(soup.title.name)
title
获取内容的属性
from bs4 import BeautifulSoup
soup = BeautifulSoup(html,'lxml')
print(soup.p.attrs['name'])
print(soup.p['name'])
两种获取属性名称的方法
dormouse
dormouse
获取内容
from bs4 import BeautifulSoup
soup = BeautifulSoup(html,'lxml')
print(soup.b.string)
The Dormouse's story
嵌套选择
from bs4 import BeautifulSoup
soup = BeautifulSoup(html,'lxml')
print(soup.head.title.string)
The Dormouse's story
字节点和子孙节点
html = '''<html dir="ltr" lang="en"><head><meta charset="utf-8"/> <title>The Dormouse\'s story</title> </head><body><p class="story">Once upon a time there were three little sisters;and their names were\n <a class="sister" href="http://example.com/elsie" id="link1"> <span>Elsie </span></a><a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>and<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>; and they lived at the bottom of a well\n </p> <p class="story"> ...story go on...</p>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html,'lxml')
print(soup.p.contents)
['Once upon a time there were three little sisters;and their names were\n ', <a class="sister" href="http://example.com/elsie" id="link1"> <!-- Elsie --></a>, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>, 'and', <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>, '; and they lived at the bottom of a well\n ']
children是一个迭代器:
from bs4 import BeautifulSoup
soup = BeautifulSoup(html,'lxml')
print(soup.p.children)
for i,child in enumerate(soup.p.children):
print(i,child)
<list_iterator object at 0x7fe986ba07f0>
0 Once upon a time there were three little sisters;and their names were
1<a class="sister" href="http://example.com/elsie" id="link1"> <!-- Elsie --></a>
2<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
3 and
4<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
5 ; and they lived at the bottom of a well
html = '''<html dir="ltr" lang="en"><head><meta charset="utf-8"/> <title>The Dormouse\'s story</title> </head><body><p class="story">Once upon a time there were three little sisters;and their names were\n <a class="sister" href="http://example.com/elsie" id="link1"> <span>Elsie </span></a><a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>and<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>; and they lived at the bottom of a well\n </p> <p class="story"> ...story go on...</p>
... '''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html,'lxml')
print(soup.p.descendants)
for i,child in enumerate(soup.p.descendants):
print(i,child)
孙节点也被输出出来:
<generator object descendants at 0x7fe986c11468>
0 Once upon a time there were three little sisters;and their names were
1<a class="sister" href="http://example.com/elsie" id="link1"> <span>Elsie </span></a>
2
3<span>Elsie </span>
4 Elsie
5<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>
6 Lacie
7 and
8<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>
9 Tillie
10 ; and they lived at the bottom of a well
父节点和祖先节点
from bs4 import BeautifulSoup
soup = BeautifulSoup(html,'lxml')
print(soup.a.parent)
显示结果:
<p class="story">Once upon a time there were three little sisters;and their names were
<a class="sister" href="http://example.com/elsie" id="link1"> <span>Elsie </span></a><a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>and<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>; and they lived at the bottom of a well
</p>
from bs4 import BeautifulSoup
soup = BeautifulSoup(html,'lxml')
print(list(enumerate(soup.a.parent)))
显示结果:
[(0, 'Once upon a time there were three little sisters;and their names were\n '), (1, <a class="sister" href="http://example.com/elsie" id="link1"> <span>Elsie </span></a>), (2, <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>), (3, 'and'), (4, <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>), (5, '; and they lived at the bottom of a well\n ')]
print(list(enumerate(soup.a.parents)))
显示所有结果:最后为源代码跟节点
[(0, <p class="story">Once upon a time there were three little sisters;and their names were
<a class="sister" href="http://example.com/elsie" id="link1"> <span>Elsie </span></a><a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>and<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>; and they lived at the bottom of a well
</p>), (1, <body><p class="story">Once upon a time there were three little sisters;and their names were
<a class="sister" href="http://example.com/elsie" id="link1"> <span>Elsie </span></a><a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>and<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>; and they lived at the bottom of a well
</p> <p class="story"> ...story go on...</p>
</body>), (2, <html dir="ltr" lang="en"><head><meta charset="utf-8"/> <title>The Dormouse's story</title> </head><body><p class="story">Once upon a time there were three little sisters;and their names were
<a class="sister" href="http://example.com/elsie" id="link1"> <span>Elsie </span></a><a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>and<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>; and they lived at the bottom of a well
</p> <p class="story"> ...story go on...</p>
</body></html>), (3, <html dir="ltr" lang="en"><head><meta charset="utf-8"/> <title>The Dormouse's story</title> </head><body><p class="story">Once upon a time there were three little sisters;and their names were
<a class="sister" href="http://example.com/elsie" id="link1"> <span>Elsie </span></a><a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>and<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>; and they lived at the bottom of a well
</p> <p class="story"> ...story go on...</p>
</body></html>)]
兄弟节点
from bs4 import BeautifulSoup
soup = BeautifulSoup(html,'lxml')
print(list(enumerate(soup.a.next_siblings)))
显示如下:```html
[(0, Lacie), (1, 'and'), (2, Tillie), (3, '; and they lived at the bottom of a well\n ')]
`print(list(enumerate(soup.a.previous_siblings)))`
> `[(0, 'Once upon a time there were three little sisters;and their names were\n ')]`
## 标准选择器
### find_all(name,attrs,recursive,text,**kwargs)
可根据标签名、属性、内容查找文档
#### name
```py
html = """
<div class="panel">
<div class="panel-heading">
<h4>Helllo</h4>
</div>
<div class="panel-body">
<ul class="list" id="list-1">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>
<ul class="list list-small" id="list-2">
<li class="element">Foo</li>
<li class="element">Bar</li>
</ul>
</div>
</div>
"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html,'lxml')
print(soup.find_all('ul'))
print(type(soup.find_all('ul')[0]))
显示结果如下:
[
- Foo
- Bar
- Jay
,
- Foo
- Bar
]
```
>
```py
from bs4 import BeautifulSoup
soup = BeautifulSoup(html,'lxml')
for ul in soup.find_all('ul'):
print(ul.find_all('li'))
```
显示结果如下
```html
[
,
,
]
[
,
]
```
attrs
html = '''
<div class="panel">\n <div class="panel-heading">\n <h4>Helllo</h4>\n </div>\n <div class="panel-body">\n <ul class="list" id="list-1" name=elements>\n <li class="element">Foo</li>\n <li class="element">Bar</li>\n <li class="element">Jay</li>\n </ul>\n <ul class="list list-small" id="list-2">\n <li class="element">Foo</li>\n <li class="element">Bar</li>\n </ul>\n </div>\n</div>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html,'lxml')
print(soup.find_all(attrs={'id':'list-1'}))
print(soup.find_all(attrs={'name':'elements'}))
显示如下:
[<ul class="list" id="list-1" name="elements">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>]
[<ul class="list" id="list-1" name="elements">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>]
另外知道ID或Class可以用下列方法查找:
from bs4 import BeautifulSoup
soup = BeautifulSoup(html,'lxml')
print(soup.find_all(id='list-1'))
[<ul class="list" id="list-1" name="elements">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>]
print(soup.find_all(class_='element'))
[<li class="element">Foo</li>, <li class="element">Bar</li>, <li class="element">Jay</li>, <li class="element">Foo</li>, <li class="element">Bar</li>]
text
from bs4 import BeautifulSoup
soup = BeautifulSoup(html,'lxml')
print(soup.find_all(text='Foo'))
['Foo', 'Foo']
find(name,attrs,recursive,text,**kwargs)
find返回单个元素,find_all返回所有元素
from bs4 import BeautifulSoup
soup = BeautifulSoup(html,'lxml')
print(soup.find('ul'))
- Foo
- Bar
- Jay
```
print(type(soup.find('ul')))
<class 'bs4.element.Tag'>
print(type(soup.find('page')))
不存在返回结果:
<class 'NoneType'>
CSS选择器
通过select()直接传入CSS选择器即可完成选择
from bs4 import BeautifulSoup
soup = BeautifulSoup(html,'lxml')
print(soup.select('.panel .panel-heading'))
print(soup.select('ul li'))
print(soup.select('#list-2 .element'))
print(soup.select('ul')[0])
显示结果如下:
[```html
Helllo
```]
[<li class="element">Foo</li>, <li class="element">Bar</li>, <li class="element">Jay</li>, <li class="element">Foo</li>, <li class="element">Bar</li>]
[<li class="element">Foo</li>, <li class="element">Bar</li>]
<ul class="list" id="list-1" name="elements">
<li class="element">Foo</li>
<li class="element">Bar</li>
<li class="element">Jay</li>
</ul>
遍历的用法:
from bs4 import BeautifulSoup
soup = BeautifulSoup(html,'lxml')
for ul in soup.select('ul'):
print(ul.select('li'))
显示结果如下:
[<li class="element">Foo</li>, <li class="element">Bar</li>, <li class="element">Jay</li>]
[<li class="element">Foo</li>, <li class="element">Bar</li>]
获取属性
from bs4 import BeautifulSoup
soup = BeautifulSoup(html,'lxml')
for ul in soup.select('ul'):
print(ul['id'])
print(ul.attrs['id'])
显示效果如下:
list-1
list-1
list-2
list-2
获取内容
from bs4 import BeautifulSoup
soup = BeautifulSoup(html,'lxml')
for li in soup.select('li'):
print(li.get_text())
显示结果:
Foo
Bar
Jay
Foo
Bar
总结:
- 推荐使用lxml解析库,必要时使用html.parser
- 标签选择筛选功能弱但是速度快
- 建议使用find()、find_all()查询匹配单个结果或多个结果
- 如果对CSS选择器书系建议使用select()
- 记住常用的获取属性和文本值的方法