Python使用Beautiful Soup对特定内容进行HTML处理

时间:2022-11-29 12:25:08

So when I decided to parse content from a website. For example,


I want to parse the ingredients into a text file. The ingredients are located in:


< div class="ingredients" style="margin-top: 10px;">

and within this, each ingredient is stored between


< li class="plaincharacterwrap">

  • Someone was nice enough to provide code using regex, but it gets confusing when you are modyfying from site to site. So I wanted to use Beautiful Soup since it has a lot of built in features. Except I can confused on how to actually do it.

    有人很好地使用正则表达式来提供代码,但是当你从一个站点到另一个站点进行修改时会让人感到困惑。所以我想使用Beautiful Soup,因为它有很多内置功能。除了我可以混淆如何实际做到这一点。



    import re
    import urllib2,sys
    from BeautifulSoup import BeautifulSoup, NavigableString
    html = urllib2.urlopen("")
    soup = BeautifulSoup(html)
            ingrdiv = soup.find('div', attrs={'class': 'ingredients'})
    except IOError: 
            print 'IO error'

    Is this kind of how you get started? I want to find the actual div class and then parse out all those ingredients located within the li class.


    Any help would be appreciated! Thanks!


    2 个解决方案



    import urllib2
    import BeautifulSoup
    def main():
        url = ""
        data = urllib2.urlopen(url).read()
        bs = BeautifulSoup.BeautifulSoup(data)
        ingreds = bs.find('div', {'class': 'ingredients'})
        ingreds = [s.getText().strip() for s in ingreds.findAll('li')]
        fname = 'PorkChopsRecipe.txt'
        with open(fname, 'w') as outf:
    if __name__=="__main__":

    results in


    1/4 cup olive oil
    1 cup chicken broth
    2 cloves garlic, minced
    1 tablespoon paprika
    1 tablespoon garlic powder
    1 tablespoon poultry seasoning
    1 teaspoon dried oregano
    1 teaspoon dried basil
    4 thick cut boneless pork chops
    salt and pepper to taste


    Follow-up response to @eyquem:


    from time import clock
    import urllib
    import re
    import BeautifulSoup
    import lxml.html
    start = clock()
    url = ''
    data = urllib.urlopen(url).read()
    print "Loading took", (clock()-start), "s"
    # by regex
    start = clock()
    x = data.find('Ingredients</h3>')
    patingr = re.compile('<li class="plaincharacterwrap">\r\n +(.+?)</li>\r\n')
    res1 = '\n'.join(patingr.findall(data,x))
    print "Regex parse took", (clock()-start), "s"
    # by BeautifulSoup
    start = clock()
    bs = BeautifulSoup.BeautifulSoup(data)
    ingreds = bs.find('div', {'class': 'ingredients'})
    res2 = '\n'.join(s.getText().strip() for s in ingreds.findAll('li'))
    print "BeautifulSoup parse took", (clock()-start), "s  - same =", (res2==res1)
    # by lxml
    start = clock()
    lx = lxml.html.fromstring(data)
    ingreds = lx.xpath('//div[@class="ingredients"]//li/text()')
    res3 = '\n'.join(s.strip() for s in ingreds)
    print "lxml parse took", (clock()-start), "s  - same =", (res3==res1)


    Loading took 1.09091222621 s
    Regex parse took 0.000432703726233 s
    BeautifulSoup parse took 0.28126133314 s  - same = True
    lxml parse took 0.0100940499505 s  - same = True

    Regex is much faster (except when it's wrong); but if you consider loading the page and parsing it together, BeautifulSoup is still only 20% of the runtime. If you are terribly concerned about speed, I recommend lxml instead.




    Yes , a special regex pattern must be written for every site.


    But I think that


    1- the treatments done with Beautiful Soup must be adapted to every site, too.

    1-使用Beautiful Soup进行的治疗也必须适应每个地方。

    2- regexes are not so complicated to write, and with a little habit, it can be done quickly


    I am curious to see what kind of treatments must be done with Beautiful Soup to obtain the same results that I obtained in a few minutes. Once upon a time, I tried to learn beautiful Soup but I didn't undesrtand anything to this mess. I should try again, now I am a little more skilled in Python. But regexes have been OK and sufficient for me until now

    我很想知道必须用Beautiful Soup做什么样的治疗才能获得我在几分钟内获得的相同结果。曾几何时,我试图学习美丽的汤,但我并没有对这个混乱做任何事情。我应该再试一次,现在我对Python更加熟练了。但到目前为止,正则表达对我来说已经足够了

    Here's the code for this new site:


    import urllib
    import re
    url = ''
    sock = urllib.urlopen(url)
    ch =
    x = ch.find('Ingredients</h3>')
    patingr = re.compile('<li class="plaincharacterwrap">\r\n +(.+?)</li>\r\n')
    print '\n'.join(patingr.findall(ch,x))



    I downloaded and installed BeautifulSoup and ran a comparison with regex.


    I don't think I did any error in my comparison code


    import urllib
    import re
    from time import clock
    import BeautifulSoup
    url = ''
    data = urllib.urlopen(url).read()
    te = clock()
    x = data.find('Ingredients</h3>')
    patingr = re.compile('<li class="plaincharacterwrap">\r\n +(.+?)</li>\r\n')
    res1 = '\n'.join(patingr.findall(data,x))
    t1 = clock()-te
    te = clock()
    bs = BeautifulSoup.BeautifulSoup(data)
    ingreds = bs.find('div', {'class': 'ingredients'})
    ingreds = [s.getText().strip() for s in ingreds.findAll('li')]
    res2 = '\n'.join(ingreds)
    t2 = clock()-te
    print res1
    print res2
    print 'res1==res2 is ',res1==res2
    print '\nRegex :',t1
    print '\nBeautifulSoup :',t2
    print '\nBeautifulSoup execution time / Regex execution time ==',t2/t1



    1/4 cup olive oil
    1 cup chicken broth
    2 cloves garlic, minced
    1 tablespoon paprika
    1 tablespoon garlic powder
    1 tablespoon poultry seasoning
    1 teaspoon dried oregano
    1 teaspoon dried basil
    4 thick cut boneless pork chops
    salt and pepper to taste
    1/4 cup olive oil
    1 cup chicken broth
    2 cloves garlic, minced
    1 tablespoon paprika
    1 tablespoon garlic powder
    1 tablespoon poultry seasoning
    1 teaspoon dried oregano
    1 teaspoon dried basil
    4 thick cut boneless pork chops
    salt and pepper to taste
    res1==res2 is  True
    Regex : 0.00210892725193
    BeautifulSoup : 2.32453566026
    BeautifulSoup execution time / Regex execution time == 1102.23605776

    No comment !

    没有意见 !


    EDIT 2

    I realized that in my code I don't use a regex, I employ a method that use a regex and find().


    It's the method I use when I resort to regexes because it raises the speed of treatment in some cases. It is due to the function find() that runs extremly rapidly.


    To know what we are comparing, we need the following codes.


    In the code 3 and 4, I took account of remarks of Achim in another thread of posts: using re.IGNORECASE and re.DOTALL, ["\'] instead of ".


    These codes are separated because they must be executed in different files to obtain reliable results: I don't know why, but if all the codes are executed in the same file ,certain resulting times are strongly different (0.00075 instead of 0.0022 for exemple)


    import urllib
    import re
    import BeautifulSoup
    from time import clock
    url = ''
    data = urllib.urlopen(url).read()
    # Simple regex , without x
    te = clock()
    patingr = re.compile('<li class="plaincharacterwrap">\r\n +(.+?)</li>\r\n')
    res0 = '\n'.join(patingr.findall(data))
    t0 = clock()-te
    print '\nSimple regex , without x :',t0


    # Simple regex , with x
    te = clock()
    x = data.find('Ingredients</h3>')
    patingr = re.compile('<li class="plaincharacterwrap">\r\n +(.+?)</li>\r\n')
    res1 = '\n'.join(patingr.findall(data,x))
    t1 = clock()-te
    print '\nSimple regex , with x :',t1


    # Regex with flags , without x and y
    te = clock()
    patingr = re.compile('<li class=["\']plaincharacterwrap["\']>\r\n +(.+?)</li>\r\n',
    res10 = '\n'.join(patingr.findall(data))
    t10 = clock()-te
    print '\nRegex with flags , without x and y :',t10


    # Regex with flags , with x and y 
    te = clock()
    x = data.find('Ingredients</h3>')
    y = data.find('h3>\r\n                    Footnotes</h3>\r\n')
    patingr = re.compile('<li class=["\']plaincharacterwrap["\']>\r\n +(.+?)</li>\r\n',
    res11 = '\n'.join(patingr.findall(data,x,y))
    t11 = clock()-te
    print '\nRegex with flags , without x and y :',t11


    # BeautifulSoup
    te = clock()
    bs = BeautifulSoup.BeautifulSoup(data)
    ingreds = bs.find('div', {'class': 'ingredients'})
    ingreds = [s.getText().strip() for s in ingreds.findAll('li')]
    res2 = '\n'.join(ingreds)
    t2 = clock()-te
    print '\nBeautifulSoup                      :',t2



    Simple regex , without x           : 0.00230488284125
    Simple regex , with x              : 0.00229121279385
    Regex with flags , without x and y : 0.00758719458758
    Regex with flags , with x and y    : 0.00183724493364
    BeautifulSoup                      : 2.58728860791

    The use of x has no influence on the speed for a simple regex.


    The regex with flags , without x and y, takes longer to execute , but the result isn't the same as the others, because it catches a supplementary chunk of text. That's why in a real application, it would be the regex with flags and x/y that should be used.

    带有标志的正则表达式,没有x和y,执行时间更长,但结果与其他结果不同,因为它捕获了一大块补充文本。这就是为什么在实际应用程序中,应该使用带有标志和x / y的正则表达式。

    The more complicated regex with flags and with x and y takes 20 % of time less.


    Well, the results are not very much changed, with or without x/y.

    嗯,无论是否有x / y,结果都没有太大变化。

    So my conclusion is the same


    the use of a regex, resorting to find() or not, remains roughly 1000 times faster than BeautifulSoup, and I estimate 100 times faster than lxml (I didn't installed lxml)



    To what you wrote, Hugh, I would say:


    When a regex is wrong, it is not faster nor slower. It doesn't run.


    When a regex is wrong, the coder makes it becoming right, that's all.


    I don't understand why 95% of the persons on *.com want to persuade other 5% that regexes must not be employed to analyse HTML or XML or anything else. I say "analyse", not "parse". As far as I understood it, a parser first analyse the WHOLE of a text and then displays the content of elements that we want. On the contrary, a regex goes right to what is searched, it doesn't build the tree of the HTML/XML text or whatever else a parser does and that I don't know very well.

    我不明白为什么*.com上95%的人想要说服其他5%的人不得使用正则表达式来分析HTML或XML或其他任何东西。我说“分析”,而不是“解析”。据我所知,解析器首先分析文本的整数,然后显示我们想要的元素的内容。相反,正则表达式适用于搜索的内容,它不构建HTML / XML文本的树或解析器所做的其他任何事情,而且我不太了解。

    So, I am very satisfied of regexes. I have no problem to write even very long REs, and regexes allow me to run programs that must react rapidly after the analyse of a text. BS or lxml would work but that would be a hassle.

    所以,我对正则表达式非常满意。即使是非常长的RE也没有问题,正则表达式允许我运行在分析文本后必须迅速做出反应的程序。 BS或lxml可以工作,但这将是一个麻烦。

    I would have other comments to do , but I have no time for a subject in which, in fact, I let others to do as they prefer.




    import urllib2
    import BeautifulSoup
    def main():
        url = ""
        data = urllib2.urlopen(url).read()
        bs = BeautifulSoup.BeautifulSoup(data)
        ingreds = bs.find('div', {'class': 'ingredients'})
        ingreds = [s.getText().strip() for s in ingreds.findAll('li')]
        fname = 'PorkChopsRecipe.txt'
        with open(fname, 'w') as outf:
    if __name__=="__main__":

    results in


    1/4 cup olive oil
    1 cup chicken broth
    2 cloves garlic, minced
    1 tablespoon paprika
    1 tablespoon garlic powder
    1 tablespoon poultry seasoning
    1 teaspoon dried oregano
    1 teaspoon dried basil
    4 thick cut boneless pork chops
    salt and pepper to taste


    Follow-up response to @eyquem:


    from time import clock
    import urllib
    import re
    import BeautifulSoup
    import lxml.html
    start = clock()
    url = ''
    data = urllib.urlopen(url).read()
    print "Loading took", (clock()-start), "s"
    # by regex
    start = clock()
    x = data.find('Ingredients</h3>')
    patingr = re.compile('<li class="plaincharacterwrap">\r\n +(.+?)</li>\r\n')
    res1 = '\n'.join(patingr.findall(data,x))
    print "Regex parse took", (clock()-start), "s"
    # by BeautifulSoup
    start = clock()
    bs = BeautifulSoup.BeautifulSoup(data)
    ingreds = bs.find('div', {'class': 'ingredients'})
    res2 = '\n'.join(s.getText().strip() for s in ingreds.findAll('li'))
    print "BeautifulSoup parse took", (clock()-start), "s  - same =", (res2==res1)
    # by lxml
    start = clock()
    lx = lxml.html.fromstring(data)
    ingreds = lx.xpath('//div[@class="ingredients"]//li/text()')
    res3 = '\n'.join(s.strip() for s in ingreds)
    print "lxml parse took", (clock()-start), "s  - same =", (res3==res1)


    Loading took 1.09091222621 s
    Regex parse took 0.000432703726233 s
    BeautifulSoup parse took 0.28126133314 s  - same = True
    lxml parse took 0.0100940499505 s  - same = True

    Regex is much faster (except when it's wrong); but if you consider loading the page and parsing it together, BeautifulSoup is still only 20% of the runtime. If you are terribly concerned about speed, I recommend lxml instead.




    Yes , a special regex pattern must be written for every site.


    But I think that


    1- the treatments done with Beautiful Soup must be adapted to every site, too.

    1-使用Beautiful Soup进行的治疗也必须适应每个地方。

    2- regexes are not so complicated to write, and with a little habit, it can be done quickly


    I am curious to see what kind of treatments must be done with Beautiful Soup to obtain the same results that I obtained in a few minutes. Once upon a time, I tried to learn beautiful Soup but I didn't undesrtand anything to this mess. I should try again, now I am a little more skilled in Python. But regexes have been OK and sufficient for me until now

    我很想知道必须用Beautiful Soup做什么样的治疗才能获得我在几分钟内获得的相同结果。曾几何时,我试图学习美丽的汤,但我并没有对这个混乱做任何事情。我应该再试一次,现在我对Python更加熟练了。但到目前为止,正则表达对我来说已经足够了

    Here's the code for this new site:


    import urllib
    import re
    url = ''
    sock = urllib.urlopen(url)
    ch =
    x = ch.find('Ingredients</h3>')
    patingr = re.compile('<li class="plaincharacterwrap">\r\n +(.+?)</li>\r\n')
    print '\n'.join(patingr.findall(ch,x))



    I downloaded and installed BeautifulSoup and ran a comparison with regex.


    I don't think I did any error in my comparison code


    import urllib
    import re
    from time import clock
    import BeautifulSoup
    url = ''
    data = urllib.urlopen(url).read()
    te = clock()
    x = data.find('Ingredients</h3>')
    patingr = re.compile('<li class="plaincharacterwrap">\r\n +(.+?)</li>\r\n')
    res1 = '\n'.join(patingr.findall(data,x))
    t1 = clock()-te
    te = clock()
    bs = BeautifulSoup.BeautifulSoup(data)
    ingreds = bs.find('div', {'class': 'ingredients'})
    ingreds = [s.getText().strip() for s in ingreds.findAll('li')]
    res2 = '\n'.join(ingreds)
    t2 = clock()-te
    print res1
    print res2
    print 'res1==res2 is ',res1==res2
    print '\nRegex :',t1
    print '\nBeautifulSoup :',t2
    print '\nBeautifulSoup execution time / Regex execution time ==',t2/t1



    1/4 cup olive oil
    1 cup chicken broth
    2 cloves garlic, minced
    1 tablespoon paprika
    1 tablespoon garlic powder
    1 tablespoon poultry seasoning
    1 teaspoon dried oregano
    1 teaspoon dried basil
    4 thick cut boneless pork chops
    salt and pepper to taste
    1/4 cup olive oil
    1 cup chicken broth
    2 cloves garlic, minced
    1 tablespoon paprika
    1 tablespoon garlic powder
    1 tablespoon poultry seasoning
    1 teaspoon dried oregano
    1 teaspoon dried basil
    4 thick cut boneless pork chops
    salt and pepper to taste
    res1==res2 is  True
    Regex : 0.00210892725193
    BeautifulSoup : 2.32453566026
    BeautifulSoup execution time / Regex execution time == 1102.23605776

    No comment !

    没有意见 !


    EDIT 2

    I realized that in my code I don't use a regex, I employ a method that use a regex and find().


    It's the method I use when I resort to regexes because it raises the speed of treatment in some cases. It is due to the function find() that runs extremly rapidly.


    To know what we are comparing, we need the following codes.


    In the code 3 and 4, I took account of remarks of Achim in another thread of posts: using re.IGNORECASE and re.DOTALL, ["\'] instead of ".


    These codes are separated because they must be executed in different files to obtain reliable results: I don't know why, but if all the codes are executed in the same file ,certain resulting times are strongly different (0.00075 instead of 0.0022 for exemple)


    import urllib
    import re
    import BeautifulSoup
    from time import clock
    url = ''
    data = urllib.urlopen(url).read()
    # Simple regex , without x
    te = clock()
    patingr = re.compile('<li class="plaincharacterwrap">\r\n +(.+?)</li>\r\n')
    res0 = '\n'.join(patingr.findall(data))
    t0 = clock()-te
    print '\nSimple regex , without x :',t0


    # Simple regex , with x
    te = clock()
    x = data.find('Ingredients</h3>')
    patingr = re.compile('<li class="plaincharacterwrap">\r\n +(.+?)</li>\r\n')
    res1 = '\n'.join(patingr.findall(data,x))
    t1 = clock()-te
    print '\nSimple regex , with x :',t1


    # Regex with flags , without x and y
    te = clock()
    patingr = re.compile('<li class=["\']plaincharacterwrap["\']>\r\n +(.+?)</li>\r\n',
    res10 = '\n'.join(patingr.findall(data))
    t10 = clock()-te
    print '\nRegex with flags , without x and y :',t10


    # Regex with flags , with x and y 
    te = clock()
    x = data.find('Ingredients</h3>')
    y = data.find('h3>\r\n                    Footnotes</h3>\r\n')
    patingr = re.compile('<li class=["\']plaincharacterwrap["\']>\r\n +(.+?)</li>\r\n',
    res11 = '\n'.join(patingr.findall(data,x,y))
    t11 = clock()-te
    print '\nRegex with flags , without x and y :',t11


    # BeautifulSoup
    te = clock()
    bs = BeautifulSoup.BeautifulSoup(data)
    ingreds = bs.find('div', {'class': 'ingredients'})
    ingreds = [s.getText().strip() for s in ingreds.findAll('li')]
    res2 = '\n'.join(ingreds)
    t2 = clock()-te
    print '\nBeautifulSoup                      :',t2



    Simple regex , without x           : 0.00230488284125
    Simple regex , with x              : 0.00229121279385
    Regex with flags , without x and y : 0.00758719458758
    Regex with flags , with x and y    : 0.00183724493364
    BeautifulSoup                      : 2.58728860791

    The use of x has no influence on the speed for a simple regex.


    The regex with flags , without x and y, takes longer to execute , but the result isn't the same as the others, because it catches a supplementary chunk of text. That's why in a real application, it would be the regex with flags and x/y that should be used.

    带有标志的正则表达式,没有x和y,执行时间更长,但结果与其他结果不同,因为它捕获了一大块补充文本。这就是为什么在实际应用程序中,应该使用带有标志和x / y的正则表达式。

    The more complicated regex with flags and with x and y takes 20 % of time less.


    Well, the results are not very much changed, with or without x/y.

    嗯,无论是否有x / y,结果都没有太大变化。

    So my conclusion is the same


    the use of a regex, resorting to find() or not, remains roughly 1000 times faster than BeautifulSoup, and I estimate 100 times faster than lxml (I didn't installed lxml)



    To what you wrote, Hugh, I would say:


    When a regex is wrong, it is not faster nor slower. It doesn't run.


    When a regex is wrong, the coder makes it becoming right, that's all.


    I don't understand why 95% of the persons on *.com want to persuade other 5% that regexes must not be employed to analyse HTML or XML or anything else. I say "analyse", not "parse". As far as I understood it, a parser first analyse the WHOLE of a text and then displays the content of elements that we want. On the contrary, a regex goes right to what is searched, it doesn't build the tree of the HTML/XML text or whatever else a parser does and that I don't know very well.

    我不明白为什么*.com上95%的人想要说服其他5%的人不得使用正则表达式来分析HTML或XML或其他任何东西。我说“分析”,而不是“解析”。据我所知,解析器首先分析文本的整数,然后显示我们想要的元素的内容。相反,正则表达式适用于搜索的内容,它不构建HTML / XML文本的树或解析器所做的其他任何事情,而且我不太了解。

    So, I am very satisfied of regexes. I have no problem to write even very long REs, and regexes allow me to run programs that must react rapidly after the analyse of a text. BS or lxml would work but that would be a hassle.

    所以,我对正则表达式非常满意。即使是非常长的RE也没有问题,正则表达式允许我运行在分析文本后必须迅速做出反应的程序。 BS或lxml可以工作,但这将是一个麻烦。

    I would have other comments to do , but I have no time for a subject in which, in fact, I let others to do as they prefer.
