创建并显示原始内容
其中的lxml第三方解释器加快解析速度
import bs4 from bs4 import BeautifulSoup html_str = """ <html><head><title>The Dormouse's story</title></head> <body> <p class="title"><b>The Dormouse's story</b></p> <p class="story">Once upon a time there were three little sisters; and their names were <a href="http://example.com/elsie" class="sister" id="link1"><!-- Elsie --></a>, <a href="http://example.com/lacie" class="sister" id="link2"><!-- Lacie --></a> and <a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>; and they lived at the bottom of a well.</p> <p class="story">...</p> """ soup = BeautifulSoup(html_str,'lxml') print(soup.prettify())
提取对象内容和属性
提取Tag对象,包括了所有的标签。默认提取第一个符合条件的标签。其中,name用于显示标签名,内容直接显示。
print(soup.name) print(soup.title.name) print(soup.title) print(soup.a)
attrs用于显示属性。class用于显示选中的标签Tag中的类名。
print(soup.p['class']) print(soup.p.attrs)
显示标记中的文字,NavigableString类型
print(soup.p.string) print(type(soup.p.string))
显示注释内容,用于数据分类
print(soup.a.string) print(type(soup.a.string))
文档相关结点
结点中的contents输出直接子节点数组,可以通过for逐个输出,通过string属性直接输出内容
print(soup.body.contents)
结点children输出直接子节点,和contents类似。
for i in soup.body.children: print(i)
结点descendants可以输出子节点和孙节点
for i in soup.body.descendants: print(i)
节点strings输出全部子节点内容值
print(soup.strings) for text in soup.strings: print(text)
节点stripped_strings输出全部内容并去掉回车和空格
for text in soup.stripped_strings: print(text)
父节点parent
print(soup.title) print(soup.title.parent)
父辈节点parents,这里只输出名字就好了,否则内容过多
for i in soup.a.parents: print(i.name)
兄弟节点next_sibling,previous_sibling,另有 :next_siblings,previous_siblings
print(soup.p.next_sibling.next_sibling) print(soup.p.previous_sibling)
前后节点:next_element,next_elements等......
BeautifulSoup的搜索方法
包括了find_all,find,find_parents等等,这里只举例find_all。
find_all中参数name查找名称标记。配合正则表达式使用
print(soup.find_all('b')) import re for tag in soup.find_all(re.compile("^b")): print(tag.name) print(soup.find_all(["a", "b"])) for tag in soup.find_all(True): print(tag.name) def hasClass_Id(tag): return tag.has_attr('class') and tag.has_attr('id') print(soup.find_all(hasClass_Id)) 查找关键词参数kwargs并输出 print(soup.find_all(id='link2')) print(soup.find_all(href=re.compile("elsie"))) print(soup.find_all(id=True)) print(soup.find_all("a", class_="sister")) print(soup.find_all(href=re.compile("elsie"), id='link1')) data_soup = BeautifulSoup('<div data-foo="value">foo!</div>') data_soup.find_all(attrs={"data-foo": "value"}) 通过text参数查找文本内容并过滤 print(soup.find_all(text="Elsie")) print(soup.find_all(text=["Tillie", "Elsie", "Lacie"])) print(soup.find_all(text=re.compile("Dormouse"))) print(soup.find_all("a", text="Elsie")) 通过limit参数限制查找数量 print(soup.find_all("a", limit=2)) 通过recursive参数只查找直接子节点 print(soup.find_all("title")) print(soup.find_all("title", recursive=False))
使用CSS选择器查找
#直接查找title标签 print soup.select("title") #逐层查找title标签 print soup.select("html head title") #查找直接子节点 #查找head下的title标签 print soup.select("head > title") #查找p下的id="link1"的标签 print soup.select("p > #link1") #查找兄弟节点 #查找id="link1"之后class=sisiter的所有兄弟标签 print soup.select("#link1 ~ .sister") #查找紧跟着id="link1"之后class=sisiter的子标签 print soup.select("#link1 + .sister") print soup.select(".sister") print soup.select("[class~=sister]") print soup.select("#link1") print soup.select("a#link2") print soup.select('a[href]') print soup.select('a[href="http://example.com/elsie"]') print soup.select('a[href^="http://example.com/"]') print soup.select('a[href$="tillie"]') print soup.select('a[href*=".com/el"]')