使用beautifulsoup删除p标签中的脚本标签

时间:2021-10-22 00:25:47

I have written a code such that it extracts contents from paragraphs

我写了一个代码,从段落中提取内容

from bs4 import BeautifulSoup
from bs4 import BeautifulSoup, NavigableString
import re


soup = BeautifulSoup(open('MUFC.html'))
a_tag = soup.find_all('p')
#print(a_tag)
for x in a_tag:
    print(x.get_text())

But there are some script tags inside p tags

但是在p标签里面有一些脚本标签

something like

类似的

<p>
<script>
.....
</script>
</p>

which I don't want. Can we put some condition so as to ignore tags for get_text() method?

我不想要。我们可以设置一些条件来忽略get_text()方法的标记吗?

1 个解决方案

#1


6  

First, remove all script tags and then get the text:

首先,删除所有脚本标记,然后获取文本:

soup = BeautifulSoup(open('MUFC.html'))

for script in soup.find_all('script'):
    script.extract()

paragraphs = soup.find_all('p')
for paragraph in paragraphs:
    print(paragraph.get_text(strip=True))

#1


6  

First, remove all script tags and then get the text:

首先,删除所有脚本标记,然后获取文本:

soup = BeautifulSoup(open('MUFC.html'))

for script in soup.find_all('script'):
    script.extract()

paragraphs = soup.find_all('p')
for paragraph in paragraphs:
    print(paragraph.get_text(strip=True))