I currently have two functions to extract the HTML <body>
text from Python and return it as a bag of words. They give equivalent output. I also clean up various tags that would otherwise give me garbage text (e.g. <script>
code).
我目前有两个函数从Python中提取HTML 文本并将其作为一个单词返回。它们提供相同的输出。我还清理了各种标签,否则会给我垃圾文本(例如
def html_to_bow_bs(text):
if text is None or len(text)==0:
return []
soup = BeautifulSoup(text, "lxml",parse_only=SoupStrainer('body'))
# Remove all irrelevant tags
for elem in soup.findAll(['script','style','a']):
elem.extract()
body_text = soup.findAll("body")
if len(body_text) == 0:
return []
# Encoding. Remove extra whitespace and unprintable characters
the_text = body_text[0].get_text().encode('utf-8')
the_text = str(the_text)
the_text = the_text.strip()
the_text = re.sub(r'[^\x00-\x7F]+',' ',the_text)
return [w.lower() for w in the_text.split()]
def html_to_bow_bs_lxml(text):
if text is None or len(text)==0:
return []
body_re = re.findall('<body(.*?)</body>', text, flags=re.DOTALL)
if len(body_re) == 0:
return []
fragment = body_re[0]
# Remove irrelevant tags
fragment = re.sub(r'<script.*?</script>', ' ', fragment, flags=re.DOTALL)
fragment = re.sub(r'<style.*?</style>', ' ', fragment, flags=re.DOTALL)
text = "<body" + fragment + "</body>"
soup = BeautifulSoup(text, "lxml")
if soup is None:
return []
# Remote more irrelevant tags
for elem in soup.findAll(['a']):
elem.extract()
# Encoding. Remove extra whitespace and unprintable characters
the_text = body_text[0].get_text().encode('utf-8')
the_text = str(the_text)
the_text = the_text.strip()
the_text = re.sub(r'[^\x00-\x7F]+',' ',the_text)
return [w.lower() for w in the_text.split()]
My main requirement is matching output: that the set of words from html_to_bow_bs_lxml(text)
matches html_to_bow_bs(text)
. Currently, both are on a par on running time; for 330 pages, they run about 20 seconds (slow!). If I remove and replace my last soup.findAll(['a'])...extract()
in my second function with regexes, I can shave 6 seconds off my time. Replacing BeautifulSoup
altogether with lxml.etree
can shave an additional 10 seconds, making the total run time about 3-4 seconds. However, when replacing the with regexes,
我的主要要求是匹配输出:来自html_to_bow_bs_lxml(文本)的单词集与html_to_bow_bs(文本)匹配。目前,两者的运行时间都相同;对于330页,它们运行大约20秒(慢!)。如果我删除并替换我的最后一个汤.findAll(['a'])... extract()在我的第二个函数与正则表达式,我可以减少6秒的时间。用lxml.etree替换BeautifulSoup可以再刮10秒,总运行时间约为3-4秒。但是,当用正则表达式替换时,
- the output doesn't always match. When replacing
BeautifulSoup
either the output doesn't match or - my program crashes during processing because of poorly-formed HTML. How to increase speed while maintaining correctness?
输出并不总是匹配。替换BeautifulSoup时输出不匹配或
由于HTML格式不正确,我的程序在处理期间崩溃。如何在保持正确性的同时提高速度?
I've seen various recommendations for extracting HTML with Python generally on *, but these date back a few years (e.g. 2012). There have understandably been many updates to the libraries since then.
我已经看到了一般在*上使用Python提取HTML的各种建议,但这些建议可追溯到几年前(例如2012年)。从那时起,可以理解的是对库的许多更新。
(I've also tried pyquery, but it doesn't always extract the body correctly.)
(我也尝试过pyquery,但它并不总能正确地提取身体。)
2 个解决方案
#1
1
You've done a lot to make it fast - the soup strainer and the lxml
parser are usually the first things to try when optimizing the parsing with BeautifulSoup
.
你已经做了很多事情来使它快速 - 汤过滤器和lxml解析器通常是使用BeautifulSoup优化解析时首先尝试的事情。
Here are some improvements to this particular code.
以下是对此特定代码的一些改进。
Remove the body existence check:
删除身体存在检查:
body_text = soup.findAll("body")
if len(body_text) == 0:
return []
and use find()
instead.
并使用find()代替。
Replace the if text is None or len(text)==0:
with just if not text:
.
将if文本替换为None或len(text)== 0:如果不是text:。
Strip via get_text(strip=True)
.
通过get_text剥离(strip = True)。
The improved code:
改进的代码:
def html_to_bow_bs(text):
if not text:
return []
soup = BeautifulSoup(text, "lxml", parse_only=SoupStrainer('body'))
# Remove all irrelevant tags
for elem in soup.find_all(['script','style','a']):
elem.extract()
body = soup.find("body")
if not body:
return []
the_text = body.get_text(strip=True).encode('utf-8')
the_text = re.sub(r'[^\x00-\x7F]+', ' ', the_text)
return [w.lower() for w in the_text.split()]
These are just micro-improvements and I don't think they are gonna change the overall performance picture. What I would also look into:
这些只是微观改进,我认为它们不会改变整体性能。我还会研究:
- running the script via
pypy
(beautifulsoup4
is compatible, but you would not be able to uselxml
parser - try it withhtml.parser
orhtml5lib
). You might win a lot without even modifying the code at all.
通过pypy运行脚本(beautifulsoup4兼容,但你不能使用lxml解析器 - 尝试使用html.parser或html5lib)。如果没有修改代码,你可能会赢得很多。
#2
0
Using requests module and bs4
使用请求模块和bs4
This is a simplest way to print the main text.
这是打印主文本的最简单方法。
import requests
from bs4 import BeautifulSoup
url = "yourUrl"
r = requests.get(url)
soup = BeautifulSoup(r.content, 'lxml')
items = soup.find_all('body')
for item in items:
print item.text
Note: If you print all body, it also would print jquery and javascript functions in case there are in there.
注意:如果你打印所有正文,它也会打印jquery和javascript函数,以防有在那里。
#1
1
You've done a lot to make it fast - the soup strainer and the lxml
parser are usually the first things to try when optimizing the parsing with BeautifulSoup
.
你已经做了很多事情来使它快速 - 汤过滤器和lxml解析器通常是使用BeautifulSoup优化解析时首先尝试的事情。
Here are some improvements to this particular code.
以下是对此特定代码的一些改进。
Remove the body existence check:
删除身体存在检查:
body_text = soup.findAll("body")
if len(body_text) == 0:
return []
and use find()
instead.
并使用find()代替。
Replace the if text is None or len(text)==0:
with just if not text:
.
将if文本替换为None或len(text)== 0:如果不是text:。
Strip via get_text(strip=True)
.
通过get_text剥离(strip = True)。
The improved code:
改进的代码:
def html_to_bow_bs(text):
if not text:
return []
soup = BeautifulSoup(text, "lxml", parse_only=SoupStrainer('body'))
# Remove all irrelevant tags
for elem in soup.find_all(['script','style','a']):
elem.extract()
body = soup.find("body")
if not body:
return []
the_text = body.get_text(strip=True).encode('utf-8')
the_text = re.sub(r'[^\x00-\x7F]+', ' ', the_text)
return [w.lower() for w in the_text.split()]
These are just micro-improvements and I don't think they are gonna change the overall performance picture. What I would also look into:
这些只是微观改进,我认为它们不会改变整体性能。我还会研究:
- running the script via
pypy
(beautifulsoup4
is compatible, but you would not be able to uselxml
parser - try it withhtml.parser
orhtml5lib
). You might win a lot without even modifying the code at all.
通过pypy运行脚本(beautifulsoup4兼容,但你不能使用lxml解析器 - 尝试使用html.parser或html5lib)。如果没有修改代码,你可能会赢得很多。
#2
0
Using requests module and bs4
使用请求模块和bs4
This is a simplest way to print the main text.
这是打印主文本的最简单方法。
import requests
from bs4 import BeautifulSoup
url = "yourUrl"
r = requests.get(url)
soup = BeautifulSoup(r.content, 'lxml')
items = soup.find_all('body')
for item in items:
print item.text
Note: If you print all body, it also would print jquery and javascript functions in case there are in there.
注意:如果你打印所有正文,它也会打印jquery和javascript函数,以防有在那里。