I have written a code such that it extracts contents from paragraphs
我写了一个代码,从段落中提取内容
from bs4 import BeautifulSoup
from bs4 import BeautifulSoup, NavigableString
import re
soup = BeautifulSoup(open('MUFC.html'))
a_tag = soup.find_all('p')
#print(a_tag)
for x in a_tag:
print(x.get_text())
But there are some script tags inside p tags
但是在p标签里面有一些脚本标签
something like
类似的
<p>
<script>
.....
</script>
</p>
which I don't want. Can we put some condition so as to ignore tags for get_text() method?
我不想要。我们可以设置一些条件来忽略get_text()方法的标记吗?
1 个解决方案
#1
6
First, remove all script
tags and then get the text:
首先,删除所有脚本标记,然后获取文本:
soup = BeautifulSoup(open('MUFC.html'))
for script in soup.find_all('script'):
script.extract()
paragraphs = soup.find_all('p')
for paragraph in paragraphs:
print(paragraph.get_text(strip=True))
#1
6
First, remove all script
tags and then get the text:
首先,删除所有脚本标记,然后获取文本:
soup = BeautifulSoup(open('MUFC.html'))
for script in soup.find_all('script'):
script.extract()
paragraphs = soup.find_all('p')
for paragraph in paragraphs:
print(paragraph.get_text(strip=True))