BeautifulSoup是一个可以从HTML或XML文件中提取数据的Python库
通过beautifulsoup4预防XSS攻击
借助beautifulsoup4将用户输入内容进行过滤
实际使用时需要采用单例模式
步骤:
- 实例化对象,对页面进行解析
- 查找目标标签
- 将非法标签进行清空
- 获取处理后字符串
直接操作标签
示例:
content = '''
<div id="i1">
<img src="" id="img">
</div>
<div id="i2"></div>
<script>alert('Hi!')</script>
'''
soup = BeautifulSoup(content, 'html.parser') # <class 'bs4.BeautifulSoup'>
script_tag = soup.find('script') # <class 'bs4.element.Tag'>
script_tag.clear()
script_tag.hidden = True
content = soup.decode() # 将对象转换为一个字符串
print(content)
输出结果:
<div id="i1">
<img src="" id="img">
</div>
<div id="i2"></div>
操作属性
通过.attrs
获取属性字典,在字典中进行操作
示例:
content = '''
<div id="i1">
<img src="" id="img">
</div>
<div id="i2"></div>
<script>alert('Hi!')</script>
'''
soup = BeautifulSoup(content, 'html.parser')
img_tag = soup.find('img')
del img_tag.attrs['id']
content = soup.decode()
print(content)
输出结果:
<div id="i1">
<img src="">
</div>
<div id="i2"></div>
<script>alert('Hi!')</script>
设置白名单
示例:
from bs4 import BeautifulSoup
content = '''
<div id="i1">
<img src="" id="img">
</div>
<div id="i2" class="c1"></div>
<script>alert('Hi!')</script>
'''
tag_p = {
# 允许使用的标签和允许的属性
'div': ['class', ],
'img': ['src', ],
}
soup = BeautifulSoup(content, 'html.parser') # <class 'bs4.BeautifulSoup'>
# 开始过滤
for tag in soup.find_all():
if tag.name in tag_p:
pass
else: # 不在白名单中的标签进行清除
tag.hidden = True
tag.clear()
continue
for k in list(tag.attrs.keys()): # 注意要先将dict.keys转换成列表
if k in tag_p[tag.name]:
pass
else:
del tag.attrs[k]
content = soup.decode()
print(content)
输出结果:
<div>
<img src=""/>
</div>
<div class="c1"></div>
方法
findChildren = findAll = find_all
findChild = find = find_all[0]
tag.clear 将选定标签中内容清空(标签还在)
tag.hidden = True 将标签去掉(内容还在)
tag.attrs 获取一个字典,key: value