BeautifulSoup库之find、findAll和children、descendants

BeautifulSoup库提供了一些方法和属性去解析HTML，将HTML页面映射成一棵树。

一、其中findAll函数通过标签的名称和属性来查找标签，返回一个列表

例如：

from urllib.request import urlopen
from bs4 import BeautifulSoup

html = urlopen("http://www.pythonscraping.com/pages/warandpeace.html")
bsObj = BeautifulSoup(html, "html.parser")

nameList =bsObj.findAll("span",{"class":"green"})
for name in nameList:              
    print(name.get_text())

.get_text()会把你正在处理的HTML文档中所有的标签都清除，然后返回一个只包含文字的字符串
结果如下：如果没有.get_text()，直接print(name),结果如下：

BeautifulSoup库之find、findAll和children、descendants

find返回的是findAll搜索值的第一个值：Anna Pavlovna Scherer

二：在HTML页面中，一个标签可以有子标签和后代标签。子标签就是一个父标签的下一级，而后代标签是指一个父标签下面所有级别的标签。例如，tr标签是table标签的子标签，而tr、th、td、img和span标签都是table标签的后代标签。

BeautifulSoup库之find、findAll和children、descendants

一般情况下，BeautifulSoup函数总是处理当前标签的后代标签。例如，bsObj.body.h1选择了body标签后代里的第一个h1标签，不会去找body外面的标签。

children()函数仅包含标签的子标签（所有子节点，例如table的子标签tr，包含所有的子标签tr,而不是只包含第一个tr），而descendants()函数包含标签的后代标签（循环输出后代的所有标签）

先看一下children():

from urllib.request import urlopen
from bs4 import BeautifulSoup

html = urlopen("http://www.pythonscraping.com/pages/page3.html")
bsObj = BeautifulSoup(html, "html.parser")

for child in bsObj.find("table",{"id":"giftList"}).children:
    print(child)

输出结果为：如果将.children换成.descendants，结果如下

BeautifulSoup库之find、findAll和children、descendants

打印出子标签后，再将子标签的后代标签一层层循环打印出来

这是我目前的想法，欢迎各位指导~~ BeautifulSoup库之find、findAll和children、descendants

秒客网

BeautifulSoup库之find、findAll和children、descendants

相关文章