Python使用Beautiful Soup对特定内容进行HTML处理

So when I decided to parse content from a website. For example, http://allrecipes.com/Recipe/Slow-Cooker-Pork-Chops-II/Detail.aspx

所以当我决定解析网站上的内容时。例如，http：//allrecipes.com/Recipe/Slow-Cooker-Pork-Chops-II/Detail.aspx

I want to parse the ingredients into a text file. The ingredients are located in:

我想将成分解析成文本文件。成分位于：

< div class="ingredients" style="margin-top: 10px;">

and within this, each ingredient is stored between

并且在其中，每种成分存储在其间

< li class="plaincharacterwrap">

Someone was nice enough to provide code using regex, but it gets confusing when you are modyfying from site to site. So I wanted to use Beautiful Soup since it has a lot of built in features. Except I can confused on how to actually do it.

有人很好地使用正则表达式来提供代码，但是当你从一个站点到另一个站点进行修改时会让人感到困惑。所以我想使用Beautiful Soup，因为它有很多内置功能。除了我可以混淆如何实际做到这一点。

Code:

码：

import re
import urllib2,sys
from BeautifulSoup import BeautifulSoup, NavigableString
html = urllib2.urlopen("http://allrecipes.com/Recipe/Slow-Cooker-Pork-Chops-II/Detail.aspx")
soup = BeautifulSoup(html)

try:

        ingrdiv = soup.find('div', attrs={'class': 'ingredients'})

except IOError: 
        print 'IO error'

Is this kind of how you get started? I want to find the actual div class and then parse out all those ingredients located within the li class.

这是你如何开始的？我想找到实际的div类，然后解析li类中的所有成分。

Any help would be appreciated! Thanks!

任何帮助，将不胜感激！谢谢！

2 个解决方案

#1

import urllib2
import BeautifulSoup

def main():
    url = "http://allrecipes.com/Recipe/Slow-Cooker-Pork-Chops-II/Detail.aspx"
    data = urllib2.urlopen(url).read()
    bs = BeautifulSoup.BeautifulSoup(data)

    ingreds = bs.find('div', {'class': 'ingredients'})
    ingreds = [s.getText().strip() for s in ingreds.findAll('li')]

    fname = 'PorkChopsRecipe.txt'
    with open(fname, 'w') as outf:
        outf.write('\n'.join(ingreds))

if __name__=="__main__":
    main()

results in

结果是

1/4 cup olive oil
1 cup chicken broth
2 cloves garlic, minced
1 tablespoon paprika
1 tablespoon garlic powder
1 tablespoon poultry seasoning
1 teaspoon dried oregano
1 teaspoon dried basil
4 thick cut boneless pork chops
salt and pepper to taste

。

Follow-up response to @eyquem:

对@eyquem的后续回复：

from time import clock
import urllib
import re
import BeautifulSoup
import lxml.html

start = clock()
url = 'http://allrecipes.com/Recipe/Slow-Cooker-Pork-Chops-II/Detail.aspx'
data = urllib.urlopen(url).read()
print "Loading took", (clock()-start), "s"

# by regex
start = clock()
x = data.find('Ingredients</h3>')
patingr = re.compile('<li class="plaincharacterwrap">\r\n +(.+?)</li>\r\n')
res1 = '\n'.join(patingr.findall(data,x))
print "Regex parse took", (clock()-start), "s"

# by BeautifulSoup
start = clock()
bs = BeautifulSoup.BeautifulSoup(data)
ingreds = bs.find('div', {'class': 'ingredients'})
res2 = '\n'.join(s.getText().strip() for s in ingreds.findAll('li'))
print "BeautifulSoup parse took", (clock()-start), "s  - same =", (res2==res1)

# by lxml
start = clock()
lx = lxml.html.fromstring(data)
ingreds = lx.xpath('//div[@class="ingredients"]//li/text()')
res3 = '\n'.join(s.strip() for s in ingreds)
print "lxml parse took", (clock()-start), "s  - same =", (res3==res1)

gives

给

Loading took 1.09091222621 s
Regex parse took 0.000432703726233 s
BeautifulSoup parse took 0.28126133314 s  - same = True
lxml parse took 0.0100940499505 s  - same = True

Regex is much faster (except when it's wrong); but if you consider loading the page and parsing it together, BeautifulSoup is still only 20% of the runtime. If you are terribly concerned about speed, I recommend lxml instead.

正则表达式要快得多（除非它错了）;但是如果考虑加载页面并将其解析在一起，BeautifulSoup仍然只占运行时的20％。如果你非常担心速度，我推荐使用lxml。

#2

Yes , a special regex pattern must be written for every site.

是的，必须为每个站点编写特殊的正则表达式模式。

But I think that

但我认为

1- the treatments done with Beautiful Soup must be adapted to every site, too.

1-使用Beautiful Soup进行的治疗也必须适应每个地方。

2- regexes are not so complicated to write, and with a little habit, it can be done quickly

2-regexs写起来并不复杂，而且有一点习惯，它可以快速完成

I am curious to see what kind of treatments must be done with Beautiful Soup to obtain the same results that I obtained in a few minutes. Once upon a time, I tried to learn beautiful Soup but I didn't undesrtand anything to this mess. I should try again, now I am a little more skilled in Python. But regexes have been OK and sufficient for me until now

我很想知道必须用Beautiful Soup做什么样的治疗才能获得我在几分钟内获得的相同结果。曾几何时，我试图学习美丽的汤，但我并没有对这个混乱做任何事情。我应该再试一次，现在我对Python更加熟练了。但到目前为止，正则表达对我来说已经足够了

Here's the code for this new site:

这是这个新网站的代码：

import urllib
import re

url = 'http://allrecipes.com/Recipe/Slow-Cooker-Pork-Chops-II/Detail.aspx'

sock = urllib.urlopen(url)
ch = sock.read()
sock.close()

x = ch.find('Ingredients</h3>')

patingr = re.compile('<li class="plaincharacterwrap">\r\n +(.+?)</li>\r\n')

print '\n'.join(patingr.findall(ch,x))

。

EDIT

I downloaded and installed BeautifulSoup and ran a comparison with regex.

我下载并安装了BeautifulSoup并与正则表达式进行了比较。

I don't think I did any error in my comparison code

我不认为我的比较代码中有任何错误

import urllib
import re
from time import clock
import BeautifulSoup

url = 'http://allrecipes.com/Recipe/Slow-Cooker-Pork-Chops-II/Detail.aspx'
data = urllib.urlopen(url).read()


te = clock()
x = data.find('Ingredients</h3>')
patingr = re.compile('<li class="plaincharacterwrap">\r\n +(.+?)</li>\r\n')
res1 = '\n'.join(patingr.findall(data,x))
t1 = clock()-te

te = clock()
bs = BeautifulSoup.BeautifulSoup(data)
ingreds = bs.find('div', {'class': 'ingredients'})
ingreds = [s.getText().strip() for s in ingreds.findAll('li')]
res2 = '\n'.join(ingreds)
t2 = clock()-te

print res1
print
print res2
print
print 'res1==res2 is ',res1==res2

print '\nRegex :',t1
print '\nBeautifulSoup :',t2
print '\nBeautifulSoup execution time / Regex execution time ==',t2/t1

result

结果

1/4 cup olive oil
1 cup chicken broth
2 cloves garlic, minced
1 tablespoon paprika
1 tablespoon garlic powder
1 tablespoon poultry seasoning
1 teaspoon dried oregano
1 teaspoon dried basil
4 thick cut boneless pork chops
salt and pepper to taste

1/4 cup olive oil
1 cup chicken broth
2 cloves garlic, minced
1 tablespoon paprika
1 tablespoon garlic powder
1 tablespoon poultry seasoning
1 teaspoon dried oregano
1 teaspoon dried basil
4 thick cut boneless pork chops
salt and pepper to taste

res1==res2 is  True

Regex : 0.00210892725193

BeautifulSoup : 2.32453566026

BeautifulSoup execution time / Regex execution time == 1102.23605776

No comment !

没有意见！

。

EDIT 2

I realized that in my code I don't use a regex, I employ a method that use a regex and find().

我意识到在我的代码中我没有使用正则表达式，我使用了一个使用正则表达式和find（）的方法。

It's the method I use when I resort to regexes because it raises the speed of treatment in some cases. It is due to the function find() that runs extremly rapidly.

这是我使用正则表达式时使用的方法，因为它在某些情况下提高了治疗速度。这是由于函数find（）极快地运行。

To know what we are comparing, we need the following codes.

要知道我们要比较什么，我们需要以下代码。

In the code 3 and 4, I took account of remarks of Achim in another thread of posts: using re.IGNORECASE and re.DOTALL, ["\'] instead of ".

在代码3和代码4中，我在另一个帖子中考虑了Achim的评论：使用re.IGNORECASE和re.DOTALL，[“\']代替”。

These codes are separated because they must be executed in different files to obtain reliable results: I don't know why, but if all the codes are executed in the same file ,certain resulting times are strongly different (0.00075 instead of 0.0022 for exemple)

这些代码是分开的，因为它们必须在不同的文件中执行才能获得可靠的结果：我不知道为什么，但是如果所有代码都在同一个文件中执行，那么某些结果时间就会大不相同（例如0.00075而不是0.0022）

import urllib
import re
import BeautifulSoup
from time import clock

url = 'http://allrecipes.com/Recipe/Slow-Cooker-Pork-Chops-II/Detail.aspx'
data = urllib.urlopen(url).read()

# Simple regex , without x
te = clock()
patingr = re.compile('<li class="plaincharacterwrap">\r\n +(.+?)</li>\r\n')
res0 = '\n'.join(patingr.findall(data))
t0 = clock()-te

print '\nSimple regex , without x :',t0

and

和

# Simple regex , with x
te = clock()
x = data.find('Ingredients</h3>')
patingr = re.compile('<li class="plaincharacterwrap">\r\n +(.+?)</li>\r\n')
res1 = '\n'.join(patingr.findall(data,x))
t1 = clock()-te

print '\nSimple regex , with x :',t1

and

和

# Regex with flags , without x and y
te = clock()
patingr = re.compile('<li class=["\']plaincharacterwrap["\']>\r\n +(.+?)</li>\r\n',
                     flags=re.DOTALL|re.IGNORECASE)
res10 = '\n'.join(patingr.findall(data))
t10 = clock()-te

print '\nRegex with flags , without x and y :',t10

and

和

# Regex with flags , with x and y 
te = clock()
x = data.find('Ingredients</h3>')
y = data.find('h3>\r\n                    Footnotes</h3>\r\n')
patingr = re.compile('<li class=["\']plaincharacterwrap["\']>\r\n +(.+?)</li>\r\n',
                     flags=re.DOTALL|re.IGNORECASE)
res11 = '\n'.join(patingr.findall(data,x,y))
t11 = clock()-te

print '\nRegex with flags , without x and y :',t11

and

和

# BeautifulSoup
te = clock()
bs = BeautifulSoup.BeautifulSoup(data)
ingreds = bs.find('div', {'class': 'ingredients'})
ingreds = [s.getText().strip() for s in ingreds.findAll('li')]
res2 = '\n'.join(ingreds)
t2 = clock()-te

print '\nBeautifulSoup                      :',t2

result

结果

Simple regex , without x           : 0.00230488284125

Simple regex , with x              : 0.00229121279385

Regex with flags , without x and y : 0.00758719458758

Regex with flags , with x and y    : 0.00183724493364

BeautifulSoup                      : 2.58728860791

The use of x has no influence on the speed for a simple regex.

使用x对简单正则表达式的速度没有影响。

The regex with flags , without x and y, takes longer to execute , but the result isn't the same as the others, because it catches a supplementary chunk of text. That's why in a real application, it would be the regex with flags and x/y that should be used.

带有标志的正则表达式，没有x和y，执行时间更长，但结果与其他结果不同，因为它捕获了一大块补充文本。这就是为什么在实际应用程序中，应该使用带有标志和x / y的正则表达式。

The more complicated regex with flags and with x and y takes 20 % of time less.

带有标志和x和y的更复杂的正则表达式减少了20％的时间。

Well, the results are not very much changed, with or without x/y.

嗯，无论是否有x / y，结果都没有太大变化。

So my conclusion is the same

所以我的结论是一样的

the use of a regex, resorting to find() or not, remains roughly 1000 times faster than BeautifulSoup, and I estimate 100 times faster than lxml (I didn't installed lxml)

使用正则表达式，求助于（或不），仍然比BeautifulSoup快约1000倍，我估计比lxml快100倍（我没有安装lxml）

。

To what you wrote, Hugh, I would say:

对你所写的，休，我会说：

When a regex is wrong, it is not faster nor slower. It doesn't run.

当正则表达式错误时，它不会更快也不会更慢。它没有运行。

When a regex is wrong, the coder makes it becoming right, that's all.

当正则表达式出错时，编码器会使它变得正确，就是这样。

I don't understand why 95% of the persons on *.com want to persuade other 5% that regexes must not be employed to analyse HTML or XML or anything else. I say "analyse", not "parse". As far as I understood it, a parser first analyse the WHOLE of a text and then displays the content of elements that we want. On the contrary, a regex goes right to what is searched, it doesn't build the tree of the HTML/XML text or whatever else a parser does and that I don't know very well.

我不明白为什么*.com上95％的人想要说服其他5％的人不得使用正则表达式来分析HTML或XML或其他任何东西。我说“分析”，而不是“解析”。据我所知，解析器首先分析文本的整数，然后显示我们想要的元素的内容。相反，正则表达式适用于搜索的内容，它不构建HTML / XML文本的树或解析器所做的其他任何事情，而且我不太了解。

So, I am very satisfied of regexes. I have no problem to write even very long REs, and regexes allow me to run programs that must react rapidly after the analyse of a text. BS or lxml would work but that would be a hassle.

所以，我对正则表达式非常满意。即使是非常长的RE也没有问题，正则表达式允许我运行在分析文本后必须迅速做出反应的程序。 BS或lxml可以工作，但这将是一个麻烦。

I would have other comments to do , but I have no time for a subject in which, in fact, I let others to do as they prefer.

我还有其他意见要做，但我没有时间讨论一个主题，事实上，我让别人按照自己的意愿去做。

#1

import urllib2
import BeautifulSoup

def main():
    url = "http://allrecipes.com/Recipe/Slow-Cooker-Pork-Chops-II/Detail.aspx"
    data = urllib2.urlopen(url).read()
    bs = BeautifulSoup.BeautifulSoup(data)

    ingreds = bs.find('div', {'class': 'ingredients'})
    ingreds = [s.getText().strip() for s in ingreds.findAll('li')]

    fname = 'PorkChopsRecipe.txt'
    with open(fname, 'w') as outf:
        outf.write('\n'.join(ingreds))

if __name__=="__main__":
    main()

results in

结果是

1/4 cup olive oil
1 cup chicken broth
2 cloves garlic, minced
1 tablespoon paprika
1 tablespoon garlic powder
1 tablespoon poultry seasoning
1 teaspoon dried oregano
1 teaspoon dried basil
4 thick cut boneless pork chops
salt and pepper to taste

。

Follow-up response to @eyquem:

对@eyquem的后续回复：

from time import clock
import urllib
import re
import BeautifulSoup
import lxml.html

start = clock()
url = 'http://allrecipes.com/Recipe/Slow-Cooker-Pork-Chops-II/Detail.aspx'
data = urllib.urlopen(url).read()
print "Loading took", (clock()-start), "s"

# by regex
start = clock()
x = data.find('Ingredients</h3>')
patingr = re.compile('<li class="plaincharacterwrap">\r\n +(.+?)</li>\r\n')
res1 = '\n'.join(patingr.findall(data,x))
print "Regex parse took", (clock()-start), "s"

# by BeautifulSoup
start = clock()
bs = BeautifulSoup.BeautifulSoup(data)
ingreds = bs.find('div', {'class': 'ingredients'})
res2 = '\n'.join(s.getText().strip() for s in ingreds.findAll('li'))
print "BeautifulSoup parse took", (clock()-start), "s  - same =", (res2==res1)

# by lxml
start = clock()
lx = lxml.html.fromstring(data)
ingreds = lx.xpath('//div[@class="ingredients"]//li/text()')
res3 = '\n'.join(s.strip() for s in ingreds)
print "lxml parse took", (clock()-start), "s  - same =", (res3==res1)

gives

给

Loading took 1.09091222621 s
Regex parse took 0.000432703726233 s
BeautifulSoup parse took 0.28126133314 s  - same = True
lxml parse took 0.0100940499505 s  - same = True

#2

Yes , a special regex pattern must be written for every site.

是的，必须为每个站点编写特殊的正则表达式模式。

But I think that

但我认为

1- the treatments done with Beautiful Soup must be adapted to every site, too.

1-使用Beautiful Soup进行的治疗也必须适应每个地方。

2- regexes are not so complicated to write, and with a little habit, it can be done quickly

2-regexs写起来并不复杂，而且有一点习惯，它可以快速完成

Here's the code for this new site:

这是这个新网站的代码：

import urllib
import re

url = 'http://allrecipes.com/Recipe/Slow-Cooker-Pork-Chops-II/Detail.aspx'

sock = urllib.urlopen(url)
ch = sock.read()
sock.close()

x = ch.find('Ingredients</h3>')

patingr = re.compile('<li class="plaincharacterwrap">\r\n +(.+?)</li>\r\n')

print '\n'.join(patingr.findall(ch,x))

。

EDIT

I downloaded and installed BeautifulSoup and ran a comparison with regex.

我下载并安装了BeautifulSoup并与正则表达式进行了比较。

I don't think I did any error in my comparison code

我不认为我的比较代码中有任何错误

import urllib
import re
from time import clock
import BeautifulSoup

url = 'http://allrecipes.com/Recipe/Slow-Cooker-Pork-Chops-II/Detail.aspx'
data = urllib.urlopen(url).read()


te = clock()
x = data.find('Ingredients</h3>')
patingr = re.compile('<li class="plaincharacterwrap">\r\n +(.+?)</li>\r\n')
res1 = '\n'.join(patingr.findall(data,x))
t1 = clock()-te

te = clock()
bs = BeautifulSoup.BeautifulSoup(data)
ingreds = bs.find('div', {'class': 'ingredients'})
ingreds = [s.getText().strip() for s in ingreds.findAll('li')]
res2 = '\n'.join(ingreds)
t2 = clock()-te

print res1
print
print res2
print
print 'res1==res2 is ',res1==res2

print '\nRegex :',t1
print '\nBeautifulSoup :',t2
print '\nBeautifulSoup execution time / Regex execution time ==',t2/t1

result

结果

1/4 cup olive oil
1 cup chicken broth
2 cloves garlic, minced
1 tablespoon paprika
1 tablespoon garlic powder
1 tablespoon poultry seasoning
1 teaspoon dried oregano
1 teaspoon dried basil
4 thick cut boneless pork chops
salt and pepper to taste

1/4 cup olive oil
1 cup chicken broth
2 cloves garlic, minced
1 tablespoon paprika
1 tablespoon garlic powder
1 tablespoon poultry seasoning
1 teaspoon dried oregano
1 teaspoon dried basil
4 thick cut boneless pork chops
salt and pepper to taste

res1==res2 is  True

Regex : 0.00210892725193

BeautifulSoup : 2.32453566026

BeautifulSoup execution time / Regex execution time == 1102.23605776

No comment !

没有意见！

。

EDIT 2

I realized that in my code I don't use a regex, I employ a method that use a regex and find().

我意识到在我的代码中我没有使用正则表达式，我使用了一个使用正则表达式和find（）的方法。

It's the method I use when I resort to regexes because it raises the speed of treatment in some cases. It is due to the function find() that runs extremly rapidly.

这是我使用正则表达式时使用的方法，因为它在某些情况下提高了治疗速度。这是由于函数find（）极快地运行。

To know what we are comparing, we need the following codes.

要知道我们要比较什么，我们需要以下代码。

In the code 3 and 4, I took account of remarks of Achim in another thread of posts: using re.IGNORECASE and re.DOTALL, ["\'] instead of ".

在代码3和代码4中，我在另一个帖子中考虑了Achim的评论：使用re.IGNORECASE和re.DOTALL，[“\']代替”。

import urllib
import re
import BeautifulSoup
from time import clock

url = 'http://allrecipes.com/Recipe/Slow-Cooker-Pork-Chops-II/Detail.aspx'
data = urllib.urlopen(url).read()

# Simple regex , without x
te = clock()
patingr = re.compile('<li class="plaincharacterwrap">\r\n +(.+?)</li>\r\n')
res0 = '\n'.join(patingr.findall(data))
t0 = clock()-te

print '\nSimple regex , without x :',t0

and

和

# Simple regex , with x
te = clock()
x = data.find('Ingredients</h3>')
patingr = re.compile('<li class="plaincharacterwrap">\r\n +(.+?)</li>\r\n')
res1 = '\n'.join(patingr.findall(data,x))
t1 = clock()-te

print '\nSimple regex , with x :',t1

and

和

# Regex with flags , without x and y
te = clock()
patingr = re.compile('<li class=["\']plaincharacterwrap["\']>\r\n +(.+?)</li>\r\n',
                     flags=re.DOTALL|re.IGNORECASE)
res10 = '\n'.join(patingr.findall(data))
t10 = clock()-te

print '\nRegex with flags , without x and y :',t10

and

和

# Regex with flags , with x and y 
te = clock()
x = data.find('Ingredients</h3>')
y = data.find('h3>\r\n                    Footnotes</h3>\r\n')
patingr = re.compile('<li class=["\']plaincharacterwrap["\']>\r\n +(.+?)</li>\r\n',
                     flags=re.DOTALL|re.IGNORECASE)
res11 = '\n'.join(patingr.findall(data,x,y))
t11 = clock()-te

print '\nRegex with flags , without x and y :',t11

and

和

# BeautifulSoup
te = clock()
bs = BeautifulSoup.BeautifulSoup(data)
ingreds = bs.find('div', {'class': 'ingredients'})
ingreds = [s.getText().strip() for s in ingreds.findAll('li')]
res2 = '\n'.join(ingreds)
t2 = clock()-te

print '\nBeautifulSoup                      :',t2

result

结果

Simple regex , without x           : 0.00230488284125

Simple regex , with x              : 0.00229121279385

Regex with flags , without x and y : 0.00758719458758

Regex with flags , with x and y    : 0.00183724493364

BeautifulSoup                      : 2.58728860791

The use of x has no influence on the speed for a simple regex.

使用x对简单正则表达式的速度没有影响。

The more complicated regex with flags and with x and y takes 20 % of time less.

带有标志和x和y的更复杂的正则表达式减少了20％的时间。

Well, the results are not very much changed, with or without x/y.

嗯，无论是否有x / y，结果都没有太大变化。

So my conclusion is the same

所以我的结论是一样的

the use of a regex, resorting to find() or not, remains roughly 1000 times faster than BeautifulSoup, and I estimate 100 times faster than lxml (I didn't installed lxml)

使用正则表达式，求助于（或不），仍然比BeautifulSoup快约1000倍，我估计比lxml快100倍（我没有安装lxml）

。

To what you wrote, Hugh, I would say:

对你所写的，休，我会说：

When a regex is wrong, it is not faster nor slower. It doesn't run.

当正则表达式错误时，它不会更快也不会更慢。它没有运行。

When a regex is wrong, the coder makes it becoming right, that's all.

当正则表达式出错时，编码器会使它变得正确，就是这样。

I would have other comments to do , but I have no time for a subject in which, in fact, I let others to do as they prefer.

我还有其他意见要做，但我没有时间讨论一个主题，事实上，我让别人按照自己的意愿去做。

秒客网

Python使用Beautiful Soup对特定内容进行HTML处理

2 个解决方案

#1

#2

EDIT

EDIT 2

#1

#2

EDIT

EDIT 2

相关文章