Python - 如何在多个标签之间提取元素

时间:2023-02-09 17:36:28

Working HTML:

<h2> Heading 1 </h2>
<h3> Subheading 1.1 </h3>
<a href="#">Link 1</a> | <a href="#">Link 2 </a> | <a href="#">Link 3</a>
<h3> Subheading 1.2 </h3>
<a href="#">Link 1</a> | <a href="#">Link 2 </a> | <a href="#">Link 3</a> | <a href="#">Link 4</a>
<h3> Subheading 1.3 </h3>
<a href="#">Link 1</a>
<h2> Heading 2 </h2>
<h3> Subheading 2.1 </h3>
<a href="#">Link 1</a> | <a href="#">Link 2</a>
<h3> Subheading 2.2 </h3>
<a href="#">Link 1</a> | <a href="#">Link 2 </a>
<h3> Subheading 2.3 </h3>
<a href="#">Link 1</a>
<h2> Heading 3 </h2>

Problem: I want to extract h3 tags between every h2 tags and also to extract all anchors between h3 tags

问题:我想在每个h2标签之间提取h3标签,并提取h3标签之间的所有锚点

What I have:

是)我有的:

soup = BeautifulSoup("""<h2> Heading 1 </h2>
<h3> Subheading 1.1 </h3>
<a href="#">Link 1</a> | <a href="#">Link 2 </a> | <a href="#">Link 3</a>
<h3> Subheading 1.2 </h3>
<a href="#">Link 1</a> | <a href="#">Link 2 </a> | <a href="#">Link 3</a> | <a href="#">Link 4</a>
<h3> Subheading 1.3 </h3>
<a href="#">Link 1</a>
<h2> Heading 2 </h2>
<h3> Subheading 2.1 </h3>
<a href="#">Link 1</a> | <a href="#">Link 2</a>
<h3> Subheading 2.2 </h3>
<a href="#">Link 1</a> | <a href="#">Link 2 </a>
<h3> Subheading 2.3 </h3>
<a href="#">Link 1</a>
<h2> Heading 3 </h2>""", 'html5lib')

for row in soup.find_all("h2"):
    print(row.text)
    print(row.find_next('h3'))
    print('################')

Current result:

################
 Heading 1 
<h3> Subheading 1.1 </h3>
################
 Heading 2 
<h3> Subheading 2.1 </h3>
################
 Heading 3 
None
################

Wanted result:

################
Heading 1 
Subheading 1.1
Link 1
Link 2
Link 3
--------
Subheading 1.2 
Link 1
Link 2
Link 3
Link 4
--------
Subheading 1.3 
Link 1
################
Heading 2 
Subheading 2.1 
Link 1
Link 2
--------
Subheading 2.2 
Link 1
Link 2
--------
Subheading 2.3 
Link 1
################

Or something like that

或类似的东西

1 个解决方案

#1


2  

This works!

s = """

<h2> Heading 1 </h2>
<h3> Subheading 1.1 </h3>
<a href="#">Link 1</a> | <a href="#">Link 2 </a> | <a href="#">Link 3</a>
<h3> Subheading 1.2 </h3>
<a href="#">Link 1</a> | <a href="#">Link 2 </a> | <a href="#">Link 3</a> | <a href="#">Link 4</a>
<h3> Subheading 1.3 </h3>
<a href="#">Link 1</a>
<h2> Heading 2 </h2>
<h3> Subheading 2.1 </h3>
<a href="#">Link 1</a> | <a href="#">Link 2</a>
<h3> Subheading 2.2 </h3>
<a href="#">Link 1</a> | <a href="#">Link 2 </a>
<h3> Subheading 2.3 </h3>
<a href="#">Link 1</a>
<h2> Heading 3 </h2>

"""

from bs4 import BeautifulSoup as bs

soup = bs(s)

for i in soup.find_all('h2'):
    print i.text
    for j in i.next_siblings:
        if j.name == 'h2': break
        if j.name == 'h3':
            print '\t'+j.text
            for k in j.next_siblings:
                if k.name == 'h3': break
                if k.name == 'a':
                    print '\t\t'+k.text

#1


2  

This works!

s = """

<h2> Heading 1 </h2>
<h3> Subheading 1.1 </h3>
<a href="#">Link 1</a> | <a href="#">Link 2 </a> | <a href="#">Link 3</a>
<h3> Subheading 1.2 </h3>
<a href="#">Link 1</a> | <a href="#">Link 2 </a> | <a href="#">Link 3</a> | <a href="#">Link 4</a>
<h3> Subheading 1.3 </h3>
<a href="#">Link 1</a>
<h2> Heading 2 </h2>
<h3> Subheading 2.1 </h3>
<a href="#">Link 1</a> | <a href="#">Link 2</a>
<h3> Subheading 2.2 </h3>
<a href="#">Link 1</a> | <a href="#">Link 2 </a>
<h3> Subheading 2.3 </h3>
<a href="#">Link 1</a>
<h2> Heading 3 </h2>

"""

from bs4 import BeautifulSoup as bs

soup = bs(s)

for i in soup.find_all('h2'):
    print i.text
    for j in i.next_siblings:
        if j.name == 'h2': break
        if j.name == 'h3':
            print '\t'+j.text
            for k in j.next_siblings:
                if k.name == 'h3': break
                if k.name == 'a':
                    print '\t\t'+k.text