I am trying to do a Breath First Search on a Beautiful soup tree. I know, we can do a Depth First Search with Beautiful soup like this :
我正试着在一棵漂亮的汤树上做一次呼吸。我知道,我们可以用这样的漂亮的汤做深度的第一次搜索:
html = """SOME HTML FILE"""
soup = BeautifulSoup(html)
for child in soup.recursiveChildGenerator():
# do some stuff here
pass
But I have no idea how to do a Breath First Search, anyone having any idea, suggestion ?
但我不知道怎么做呼吸第一次搜索,有人有任何想法,建议吗?
Thanks for your help.
谢谢你的帮助。
2 个解决方案
#1
0
Use the .children
generator for each element to append to your breadth-first queue:
使用.children生成器为每个元素添加到第一个队列:
from bs4 import BeautifulSoup
import requests
html = requests.get("https://*.com/questions/44798715/").text
soup = BeautifulSoup(html, "html5lib")
queue = [([], soup)] # queue of (path, element) pairs
while queue:
path, element = queue.pop(0)
if hasattr(element, 'children'): # check for leaf elements
for child in element.children:
queue.append((path + [child.name if child.name is not None else type(child)],
child))
# do stuff
print(path, repr(element.string[:50]) if element.string else type(element))
#2
0
To browse HTML document parsed by BeautifulSoup with DFS or BFS do :
浏览用DFS或BFS进行美化的HTML文档:
solution.py:
solution.py:
import bs4
from bs4 import BeautifulSoup
html = """
<div>root
<div>child1
<div>child4
</div>
<div>child5
</div>
</div>
<div>child2
</div>
<div>child3
<div>child6
</div>
</div>
</div>
"""
Append these lines to solution.py :
将这些行附加到解决方案中。py:
def visit(node):
if isinstance(node, bs4.element.Tag):
# be careful bs4.element subclass ...
print(type(node), 'tag:', node.name)
elif isinstance(node, bs4.element.NavigableString):
# be careful bs4.CDdata and bs4.element.Comment subclass ...
print(type(node), repr(node.string))
else:
print(type(node), 'UNKNOWN')
And:
和:
def dfs(html):
bs = BeautifulSoup(html, 'html.parser')
# <class 'bs4.BeautifulSoup'> [document]
visit(bs)
for child in bs.recursiveChildGenerator():
visit(child)
def bfs(html):
bs = BeautifulSoup(html, 'html.parser')
# <class 'bs4.BeautifulSoup'> [document]
visit(bs)
for child in recursiveChildGeneratorBfs(bs):
visit(child)
def recursiveChildGeneratorBfs(bs):
root = bs
stack = [root]
while len(stack) != 0:
node = stack.pop(0)
if node is not bs:
yield node
if hasattr(node, 'children'):
for child in node.children:
stack.append(child)
In ipython
console:
在ipython控制台:
In [1]: run solution.py
BFS:
石:
In [2]: bfs(html)
<class 'bs4.BeautifulSoup'> tag: [document]
<class 'bs4.element.NavigableString'> '\n'
<class 'bs4.element.Tag'> tag: div
<class 'bs4.element.NavigableString'> '\n'
<class 'bs4.element.NavigableString'> 'root\n '
<class 'bs4.element.Tag'> tag: div
<class 'bs4.element.NavigableString'> '\n'
<class 'bs4.element.Tag'> tag: div
<class 'bs4.element.NavigableString'> '\n'
<class 'bs4.element.Tag'> tag: div
<class 'bs4.element.NavigableString'> '\n'
<class 'bs4.element.NavigableString'> 'child1\n '
<class 'bs4.element.Tag'> tag: div
<class 'bs4.element.NavigableString'> '\n'
<class 'bs4.element.Tag'> tag: div
<class 'bs4.element.NavigableString'> '\n'
<class 'bs4.element.NavigableString'> 'child2\n '
<class 'bs4.element.NavigableString'> 'child3\n '
<class 'bs4.element.Tag'> tag: div
<class 'bs4.element.NavigableString'> '\n'
<class 'bs4.element.NavigableString'> 'child4\n '
<class 'bs4.element.NavigableString'> 'child5\n '
<class 'bs4.element.NavigableString'> 'child6\n '
DFS:
DFS:
In [3]: dfs(html)
<class 'bs4.BeautifulSoup'> tag: [document]
<class 'bs4.element.NavigableString'> '\n'
<class 'bs4.element.Tag'> tag: div
<class 'bs4.element.NavigableString'> 'root\n '
<class 'bs4.element.Tag'> tag: div
<class 'bs4.element.NavigableString'> 'child1\n '
<class 'bs4.element.Tag'> tag: div
<class 'bs4.element.NavigableString'> 'child4\n '
<class 'bs4.element.NavigableString'> '\n'
<class 'bs4.element.Tag'> tag: div
<class 'bs4.element.NavigableString'> 'child5\n '
<class 'bs4.element.NavigableString'> '\n'
<class 'bs4.element.NavigableString'> '\n'
<class 'bs4.element.Tag'> tag: div
<class 'bs4.element.NavigableString'> 'child2\n '
<class 'bs4.element.NavigableString'> '\n'
<class 'bs4.element.Tag'> tag: div
<class 'bs4.element.NavigableString'> 'child3\n '
<class 'bs4.element.Tag'> tag: div
<class 'bs4.element.NavigableString'> 'child6\n '
<class 'bs4.element.NavigableString'> '\n'
<class 'bs4.element.NavigableString'> '\n'
<class 'bs4.element.NavigableString'> '\n'
See :
看到的:
文档
#1
0
Use the .children
generator for each element to append to your breadth-first queue:
使用.children生成器为每个元素添加到第一个队列:
from bs4 import BeautifulSoup
import requests
html = requests.get("https://*.com/questions/44798715/").text
soup = BeautifulSoup(html, "html5lib")
queue = [([], soup)] # queue of (path, element) pairs
while queue:
path, element = queue.pop(0)
if hasattr(element, 'children'): # check for leaf elements
for child in element.children:
queue.append((path + [child.name if child.name is not None else type(child)],
child))
# do stuff
print(path, repr(element.string[:50]) if element.string else type(element))
#2
0
To browse HTML document parsed by BeautifulSoup with DFS or BFS do :
浏览用DFS或BFS进行美化的HTML文档:
solution.py:
solution.py:
import bs4
from bs4 import BeautifulSoup
html = """
<div>root
<div>child1
<div>child4
</div>
<div>child5
</div>
</div>
<div>child2
</div>
<div>child3
<div>child6
</div>
</div>
</div>
"""
Append these lines to solution.py :
将这些行附加到解决方案中。py:
def visit(node):
if isinstance(node, bs4.element.Tag):
# be careful bs4.element subclass ...
print(type(node), 'tag:', node.name)
elif isinstance(node, bs4.element.NavigableString):
# be careful bs4.CDdata and bs4.element.Comment subclass ...
print(type(node), repr(node.string))
else:
print(type(node), 'UNKNOWN')
And:
和:
def dfs(html):
bs = BeautifulSoup(html, 'html.parser')
# <class 'bs4.BeautifulSoup'> [document]
visit(bs)
for child in bs.recursiveChildGenerator():
visit(child)
def bfs(html):
bs = BeautifulSoup(html, 'html.parser')
# <class 'bs4.BeautifulSoup'> [document]
visit(bs)
for child in recursiveChildGeneratorBfs(bs):
visit(child)
def recursiveChildGeneratorBfs(bs):
root = bs
stack = [root]
while len(stack) != 0:
node = stack.pop(0)
if node is not bs:
yield node
if hasattr(node, 'children'):
for child in node.children:
stack.append(child)
In ipython
console:
在ipython控制台:
In [1]: run solution.py
BFS:
石:
In [2]: bfs(html)
<class 'bs4.BeautifulSoup'> tag: [document]
<class 'bs4.element.NavigableString'> '\n'
<class 'bs4.element.Tag'> tag: div
<class 'bs4.element.NavigableString'> '\n'
<class 'bs4.element.NavigableString'> 'root\n '
<class 'bs4.element.Tag'> tag: div
<class 'bs4.element.NavigableString'> '\n'
<class 'bs4.element.Tag'> tag: div
<class 'bs4.element.NavigableString'> '\n'
<class 'bs4.element.Tag'> tag: div
<class 'bs4.element.NavigableString'> '\n'
<class 'bs4.element.NavigableString'> 'child1\n '
<class 'bs4.element.Tag'> tag: div
<class 'bs4.element.NavigableString'> '\n'
<class 'bs4.element.Tag'> tag: div
<class 'bs4.element.NavigableString'> '\n'
<class 'bs4.element.NavigableString'> 'child2\n '
<class 'bs4.element.NavigableString'> 'child3\n '
<class 'bs4.element.Tag'> tag: div
<class 'bs4.element.NavigableString'> '\n'
<class 'bs4.element.NavigableString'> 'child4\n '
<class 'bs4.element.NavigableString'> 'child5\n '
<class 'bs4.element.NavigableString'> 'child6\n '
DFS:
DFS:
In [3]: dfs(html)
<class 'bs4.BeautifulSoup'> tag: [document]
<class 'bs4.element.NavigableString'> '\n'
<class 'bs4.element.Tag'> tag: div
<class 'bs4.element.NavigableString'> 'root\n '
<class 'bs4.element.Tag'> tag: div
<class 'bs4.element.NavigableString'> 'child1\n '
<class 'bs4.element.Tag'> tag: div
<class 'bs4.element.NavigableString'> 'child4\n '
<class 'bs4.element.NavigableString'> '\n'
<class 'bs4.element.Tag'> tag: div
<class 'bs4.element.NavigableString'> 'child5\n '
<class 'bs4.element.NavigableString'> '\n'
<class 'bs4.element.NavigableString'> '\n'
<class 'bs4.element.Tag'> tag: div
<class 'bs4.element.NavigableString'> 'child2\n '
<class 'bs4.element.NavigableString'> '\n'
<class 'bs4.element.Tag'> tag: div
<class 'bs4.element.NavigableString'> 'child3\n '
<class 'bs4.element.Tag'> tag: div
<class 'bs4.element.NavigableString'> 'child6\n '
<class 'bs4.element.NavigableString'> '\n'
<class 'bs4.element.NavigableString'> '\n'
<class 'bs4.element.NavigableString'> '\n'
See :
看到的:
文档