Python使用Beautiful Soup对特定内容进行HTML处理

时间:2022-11-29 12:25:08

So when I decided to parse content from a website. For example, http://allrecipes.com/Recipe/Slow-Cooker-Pork-Chops-II/Detail.aspx

所以当我决定解析网站上的内容时。例如,http://allrecipes.com/Recipe/Slow-Cooker-Pork-Chops-II/Detail.aspx

I want to parse the ingredients into a text file. The ingredients are located in:

我想将成分解析成文本文件。成分位于:

< div class="ingredients" style="margin-top: 10px;">

and within this, each ingredient is stored between

并且在其中,每种成分存储在其间

< li class="plaincharacterwrap">

  • Someone was nice enough to provide code using regex, but it gets confusing when you are modyfying from site to site. So I wanted to use Beautiful Soup since it has a lot of built in features. Except I can confused on how to actually do it.

    有人很好地使用正则表达式来提供代码,但是当你从一个站点到另一个站点进行修改时会让人感到困惑。所以我想使用Beautiful Soup,因为它有很多内置功能。除了我可以混淆如何实际做到这一点。

    Code:

    码:

    import re
    import urllib2,sys
    from BeautifulSoup import BeautifulSoup, NavigableString
    html = urllib2.urlopen("http://allrecipes.com/Recipe/Slow-Cooker-Pork-Chops-II/Detail.aspx")
    soup = BeautifulSoup(html)
    
    try:
    
            ingrdiv = soup.find('div', attrs={'class': 'ingredients'})
    
    except IOError: 
            print 'IO error'
    

    Is this kind of how you get started? I want to find the actual div class and then parse out all those ingredients located within the li class.

    这是你如何开始的?我想找到实际的div类,然后解析li类中的所有成分。

    Any help would be appreciated! Thanks!

    任何帮助,将不胜感激!谢谢!

    2 个解决方案

    #1


    4  

    import urllib2
    import BeautifulSoup
    
    def main():
        url = "http://allrecipes.com/Recipe/Slow-Cooker-Pork-Chops-II/Detail.aspx"
        data = urllib2.urlopen(url).read()
        bs = BeautifulSoup.BeautifulSoup(data)
    
        ingreds = bs.find('div', {'class': 'ingredients'})
        ingreds = [s.getText().strip() for s in ingreds.findAll('li')]
    
        fname = 'PorkChopsRecipe.txt'
        with open(fname, 'w') as outf:
            outf.write('\n'.join(ingreds))
    
    if __name__=="__main__":
        main()
    

    results in

    结果是

    1/4 cup olive oil
    1 cup chicken broth
    2 cloves garlic, minced
    1 tablespoon paprika
    1 tablespoon garlic powder
    1 tablespoon poultry seasoning
    1 teaspoon dried oregano
    1 teaspoon dried basil
    4 thick cut boneless pork chops
    salt and pepper to taste
    

    .


    Follow-up response to @eyquem:

    对@eyquem的后续回复:

    from time import clock
    import urllib
    import re
    import BeautifulSoup
    import lxml.html
    
    start = clock()
    url = 'http://allrecipes.com/Recipe/Slow-Cooker-Pork-Chops-II/Detail.aspx'
    data = urllib.urlopen(url).read()
    print "Loading took", (clock()-start), "s"
    
    # by regex
    start = clock()
    x = data.find('Ingredients</h3>')
    patingr = re.compile('<li class="plaincharacterwrap">\r\n +(.+?)</li>\r\n')
    res1 = '\n'.join(patingr.findall(data,x))
    print "Regex parse took", (clock()-start), "s"
    
    # by BeautifulSoup
    start = clock()
    bs = BeautifulSoup.BeautifulSoup(data)
    ingreds = bs.find('div', {'class': 'ingredients'})
    res2 = '\n'.join(s.getText().strip() for s in ingreds.findAll('li'))
    print "BeautifulSoup parse took", (clock()-start), "s  - same =", (res2==res1)
    
    # by lxml
    start = clock()
    lx = lxml.html.fromstring(data)
    ingreds = lx.xpath('//div[@class="ingredients"]//li/text()')
    res3 = '\n'.join(s.strip() for s in ingreds)
    print "lxml parse took", (clock()-start), "s  - same =", (res3==res1)
    

    gives

    Loading took 1.09091222621 s
    Regex parse took 0.000432703726233 s
    BeautifulSoup parse took 0.28126133314 s  - same = True
    lxml parse took 0.0100940499505 s  - same = True
    

    Regex is much faster (except when it's wrong); but if you consider loading the page and parsing it together, BeautifulSoup is still only 20% of the runtime. If you are terribly concerned about speed, I recommend lxml instead.

    正则表达式要快得多(除非它错了);但是如果考虑加载页面并将其解析在一起,BeautifulSoup仍然只占运行时的20%。如果你非常担心速度,我推荐使用lxml。

    #2


    2  

    Yes , a special regex pattern must be written for every site.

    是的,必须为每个站点编写特殊的正则表达式模式。

    But I think that

    但我认为

    1- the treatments done with Beautiful Soup must be adapted to every site, too.

    1-使用Beautiful Soup进行的治疗也必须适应每个地方。

    2- regexes are not so complicated to write, and with a little habit, it can be done quickly

    2-regexs写起来并不复杂,而且有一点习惯,它可以快速完成

    I am curious to see what kind of treatments must be done with Beautiful Soup to obtain the same results that I obtained in a few minutes. Once upon a time, I tried to learn beautiful Soup but I didn't undesrtand anything to this mess. I should try again, now I am a little more skilled in Python. But regexes have been OK and sufficient for me until now

    我很想知道必须用Beautiful Soup做什么样的治疗才能获得我在几分钟内获得的相同结果。曾几何时,我试图学习美丽的汤,但我并没有对这个混乱做任何事情。我应该再试一次,现在我对Python更加熟练了。但到目前为止,正则表达对我来说已经足够了

    Here's the code for this new site:

    这是这个新网站的代码:

    import urllib
    import re
    
    url = 'http://allrecipes.com/Recipe/Slow-Cooker-Pork-Chops-II/Detail.aspx'
    
    sock = urllib.urlopen(url)
    ch = sock.read()
    sock.close()
    
    x = ch.find('Ingredients</h3>')
    
    patingr = re.compile('<li class="plaincharacterwrap">\r\n +(.+?)</li>\r\n')
    
    print '\n'.join(patingr.findall(ch,x))
    

    .

    EDIT

    I downloaded and installed BeautifulSoup and ran a comparison with regex.

    我下载并安装了BeautifulSoup并与正则表达式进行了比较。

    I don't think I did any error in my comparison code

    我不认为我的比较代码中有任何错误

    import urllib
    import re
    from time import clock
    import BeautifulSoup
    
    url = 'http://allrecipes.com/Recipe/Slow-Cooker-Pork-Chops-II/Detail.aspx'
    data = urllib.urlopen(url).read()
    
    
    te = clock()
    x = data.find('Ingredients</h3>')
    patingr = re.compile('<li class="plaincharacterwrap">\r\n +(.+?)</li>\r\n')
    res1 = '\n'.join(patingr.findall(data,x))
    t1 = clock()-te
    
    te = clock()
    bs = BeautifulSoup.BeautifulSoup(data)
    ingreds = bs.find('div', {'class': 'ingredients'})
    ingreds = [s.getText().strip() for s in ingreds.findAll('li')]
    res2 = '\n'.join(ingreds)
    t2 = clock()-te
    
    print res1
    print
    print res2
    print
    print 'res1==res2 is ',res1==res2
    
    print '\nRegex :',t1
    print '\nBeautifulSoup :',t2
    print '\nBeautifulSoup execution time / Regex execution time ==',t2/t1
    

    result

    结果

    1/4 cup olive oil
    1 cup chicken broth
    2 cloves garlic, minced
    1 tablespoon paprika
    1 tablespoon garlic powder
    1 tablespoon poultry seasoning
    1 teaspoon dried oregano
    1 teaspoon dried basil
    4 thick cut boneless pork chops
    salt and pepper to taste
    
    1/4 cup olive oil
    1 cup chicken broth
    2 cloves garlic, minced
    1 tablespoon paprika
    1 tablespoon garlic powder
    1 tablespoon poultry seasoning
    1 teaspoon dried oregano
    1 teaspoon dried basil
    4 thick cut boneless pork chops
    salt and pepper to taste
    
    res1==res2 is  True
    
    Regex : 0.00210892725193
    
    BeautifulSoup : 2.32453566026
    
    BeautifulSoup execution time / Regex execution time == 1102.23605776
    

    No comment !

    没有意见 !

    .

    EDIT 2

    I realized that in my code I don't use a regex, I employ a method that use a regex and find().

    我意识到在我的代码中我没有使用正则表达式,我使用了一个使用正则表达式和find()的方法。

    It's the method I use when I resort to regexes because it raises the speed of treatment in some cases. It is due to the function find() that runs extremly rapidly.

    这是我使用正则表达式时使用的方法,因为它在某些情况下提高了治疗速度。这是由于函数find()极快地运行。

    To know what we are comparing, we need the following codes.

    要知道我们要比较什么,我们需要以下代码。

    In the code 3 and 4, I took account of remarks of Achim in another thread of posts: using re.IGNORECASE and re.DOTALL, ["\'] instead of ".

    在代码3和代码4中,我在另一个帖子中考虑了Achim的评论:使用re.IGNORECASE和re.DOTALL,[“\']代替”。

    These codes are separated because they must be executed in different files to obtain reliable results: I don't know why, but if all the codes are executed in the same file ,certain resulting times are strongly different (0.00075 instead of 0.0022 for exemple)

    这些代码是分开的,因为它们必须在不同的文件中执行才能获得可靠的结果:我不知道为什么,但是如果所有代码都在同一个文件中执行,那么某些结果时间就会大不相同(例如0.00075而不是0.0022)

    import urllib
    import re
    import BeautifulSoup
    from time import clock
    
    url = 'http://allrecipes.com/Recipe/Slow-Cooker-Pork-Chops-II/Detail.aspx'
    data = urllib.urlopen(url).read()
    
    # Simple regex , without x
    te = clock()
    patingr = re.compile('<li class="plaincharacterwrap">\r\n +(.+?)</li>\r\n')
    res0 = '\n'.join(patingr.findall(data))
    t0 = clock()-te
    
    print '\nSimple regex , without x :',t0
    

    and

    # Simple regex , with x
    te = clock()
    x = data.find('Ingredients</h3>')
    patingr = re.compile('<li class="plaincharacterwrap">\r\n +(.+?)</li>\r\n')
    res1 = '\n'.join(patingr.findall(data,x))
    t1 = clock()-te
    
    print '\nSimple regex , with x :',t1
    

    and

    # Regex with flags , without x and y
    te = clock()
    patingr = re.compile('<li class=["\']plaincharacterwrap["\']>\r\n +(.+?)</li>\r\n',
                         flags=re.DOTALL|re.IGNORECASE)
    res10 = '\n'.join(patingr.findall(data))
    t10 = clock()-te
    
    print '\nRegex with flags , without x and y :',t10
    

    and

    # Regex with flags , with x and y 
    te = clock()
    x = data.find('Ingredients</h3>')
    y = data.find('h3>\r\n                    Footnotes</h3>\r\n')
    patingr = re.compile('<li class=["\']plaincharacterwrap["\']>\r\n +(.+?)</li>\r\n',
                         flags=re.DOTALL|re.IGNORECASE)
    res11 = '\n'.join(patingr.findall(data,x,y))
    t11 = clock()-te
    
    print '\nRegex with flags , without x and y :',t11
    

    and

    # BeautifulSoup
    te = clock()
    bs = BeautifulSoup.BeautifulSoup(data)
    ingreds = bs.find('div', {'class': 'ingredients'})
    ingreds = [s.getText().strip() for s in ingreds.findAll('li')]
    res2 = '\n'.join(ingreds)
    t2 = clock()-te
    
    print '\nBeautifulSoup                      :',t2
    

    result

    结果

    Simple regex , without x           : 0.00230488284125
    
    Simple regex , with x              : 0.00229121279385
    
    Regex with flags , without x and y : 0.00758719458758
    
    Regex with flags , with x and y    : 0.00183724493364
    
    BeautifulSoup                      : 2.58728860791
    

    The use of x has no influence on the speed for a simple regex.

    使用x对简单正则表达式的速度没有影响。

    The regex with flags , without x and y, takes longer to execute , but the result isn't the same as the others, because it catches a supplementary chunk of text. That's why in a real application, it would be the regex with flags and x/y that should be used.

    带有标志的正则表达式,没有x和y,执行时间更长,但结果与其他结果不同,因为它捕获了一大块补充文本。这就是为什么在实际应用程序中,应该使用带有标志和x / y的正则表达式。

    The more complicated regex with flags and with x and y takes 20 % of time less.

    带有标志和x和y的更复杂的正则表达式减少了20%的时间。

    Well, the results are not very much changed, with or without x/y.

    嗯,无论是否有x / y,结果都没有太大变化。

    So my conclusion is the same

    所以我的结论是一样的

    the use of a regex, resorting to find() or not, remains roughly 1000 times faster than BeautifulSoup, and I estimate 100 times faster than lxml (I didn't installed lxml)

    使用正则表达式,求助于(或不),仍然比BeautifulSoup快约1000倍,我估计比lxml快100倍(我没有安装lxml)

    .

    To what you wrote, Hugh, I would say:

    对你所写的,休,我会说:

    When a regex is wrong, it is not faster nor slower. It doesn't run.

    当正则表达式错误时,它不会更快也不会更慢。它没有运行。

    When a regex is wrong, the coder makes it becoming right, that's all.

    当正则表达式出错时,编码器会使它变得正确,就是这样。

    I don't understand why 95% of the persons on *.com want to persuade other 5% that regexes must not be employed to analyse HTML or XML or anything else. I say "analyse", not "parse". As far as I understood it, a parser first analyse the WHOLE of a text and then displays the content of elements that we want. On the contrary, a regex goes right to what is searched, it doesn't build the tree of the HTML/XML text or whatever else a parser does and that I don't know very well.

    我不明白为什么*.com上95%的人想要说服其他5%的人不得使用正则表达式来分析HTML或XML或其他任何东西。我说“分析”,而不是“解析”。据我所知,解析器首先分析文本的整数,然后显示我们想要的元素的内容。相反,正则表达式适用于搜索的内容,它不构建HTML / XML文本的树或解析器所做的其他任何事情,而且我不太了解。

    So, I am very satisfied of regexes. I have no problem to write even very long REs, and regexes allow me to run programs that must react rapidly after the analyse of a text. BS or lxml would work but that would be a hassle.

    所以,我对正则表达式非常满意。即使是非常长的RE也没有问题,正则表达式允许我运行在分析文本后必须迅速做出反应的程序。 BS或lxml可以工作,但这将是一个麻烦。

    I would have other comments to do , but I have no time for a subject in which, in fact, I let others to do as they prefer.

    我还有其他意见要做,但我没有时间讨论一个主题,事实上,我让别人按照自己的意愿去做。

    #1


    4  

    import urllib2
    import BeautifulSoup
    
    def main():
        url = "http://allrecipes.com/Recipe/Slow-Cooker-Pork-Chops-II/Detail.aspx"
        data = urllib2.urlopen(url).read()
        bs = BeautifulSoup.BeautifulSoup(data)
    
        ingreds = bs.find('div', {'class': 'ingredients'})
        ingreds = [s.getText().strip() for s in ingreds.findAll('li')]
    
        fname = 'PorkChopsRecipe.txt'
        with open(fname, 'w') as outf:
            outf.write('\n'.join(ingreds))
    
    if __name__=="__main__":
        main()
    

    results in

    结果是

    1/4 cup olive oil
    1 cup chicken broth
    2 cloves garlic, minced
    1 tablespoon paprika
    1 tablespoon garlic powder
    1 tablespoon poultry seasoning
    1 teaspoon dried oregano
    1 teaspoon dried basil
    4 thick cut boneless pork chops
    salt and pepper to taste
    

    .


    Follow-up response to @eyquem:

    对@eyquem的后续回复:

    from time import clock
    import urllib
    import re
    import BeautifulSoup
    import lxml.html
    
    start = clock()
    url = 'http://allrecipes.com/Recipe/Slow-Cooker-Pork-Chops-II/Detail.aspx'
    data = urllib.urlopen(url).read()
    print "Loading took", (clock()-start), "s"
    
    # by regex
    start = clock()
    x = data.find('Ingredients</h3>')
    patingr = re.compile('<li class="plaincharacterwrap">\r\n +(.+?)</li>\r\n')
    res1 = '\n'.join(patingr.findall(data,x))
    print "Regex parse took", (clock()-start), "s"
    
    # by BeautifulSoup
    start = clock()
    bs = BeautifulSoup.BeautifulSoup(data)
    ingreds = bs.find('div', {'class': 'ingredients'})
    res2 = '\n'.join(s.getText().strip() for s in ingreds.findAll('li'))
    print "BeautifulSoup parse took", (clock()-start), "s  - same =", (res2==res1)
    
    # by lxml
    start = clock()
    lx = lxml.html.fromstring(data)
    ingreds = lx.xpath('//div[@class="ingredients"]//li/text()')
    res3 = '\n'.join(s.strip() for s in ingreds)
    print "lxml parse took", (clock()-start), "s  - same =", (res3==res1)
    

    gives

    Loading took 1.09091222621 s
    Regex parse took 0.000432703726233 s
    BeautifulSoup parse took 0.28126133314 s  - same = True
    lxml parse took 0.0100940499505 s  - same = True
    

    Regex is much faster (except when it's wrong); but if you consider loading the page and parsing it together, BeautifulSoup is still only 20% of the runtime. If you are terribly concerned about speed, I recommend lxml instead.

    正则表达式要快得多(除非它错了);但是如果考虑加载页面并将其解析在一起,BeautifulSoup仍然只占运行时的20%。如果你非常担心速度,我推荐使用lxml。

    #2


    2  

    Yes , a special regex pattern must be written for every site.

    是的,必须为每个站点编写特殊的正则表达式模式。

    But I think that

    但我认为

    1- the treatments done with Beautiful Soup must be adapted to every site, too.

    1-使用Beautiful Soup进行的治疗也必须适应每个地方。

    2- regexes are not so complicated to write, and with a little habit, it can be done quickly

    2-regexs写起来并不复杂,而且有一点习惯,它可以快速完成

    I am curious to see what kind of treatments must be done with Beautiful Soup to obtain the same results that I obtained in a few minutes. Once upon a time, I tried to learn beautiful Soup but I didn't undesrtand anything to this mess. I should try again, now I am a little more skilled in Python. But regexes have been OK and sufficient for me until now

    我很想知道必须用Beautiful Soup做什么样的治疗才能获得我在几分钟内获得的相同结果。曾几何时,我试图学习美丽的汤,但我并没有对这个混乱做任何事情。我应该再试一次,现在我对Python更加熟练了。但到目前为止,正则表达对我来说已经足够了

    Here's the code for this new site:

    这是这个新网站的代码:

    import urllib
    import re
    
    url = 'http://allrecipes.com/Recipe/Slow-Cooker-Pork-Chops-II/Detail.aspx'
    
    sock = urllib.urlopen(url)
    ch = sock.read()
    sock.close()
    
    x = ch.find('Ingredients</h3>')
    
    patingr = re.compile('<li class="plaincharacterwrap">\r\n +(.+?)</li>\r\n')
    
    print '\n'.join(patingr.findall(ch,x))
    

    .

    EDIT

    I downloaded and installed BeautifulSoup and ran a comparison with regex.

    我下载并安装了BeautifulSoup并与正则表达式进行了比较。

    I don't think I did any error in my comparison code

    我不认为我的比较代码中有任何错误

    import urllib
    import re
    from time import clock
    import BeautifulSoup
    
    url = 'http://allrecipes.com/Recipe/Slow-Cooker-Pork-Chops-II/Detail.aspx'
    data = urllib.urlopen(url).read()
    
    
    te = clock()
    x = data.find('Ingredients</h3>')
    patingr = re.compile('<li class="plaincharacterwrap">\r\n +(.+?)</li>\r\n')
    res1 = '\n'.join(patingr.findall(data,x))
    t1 = clock()-te
    
    te = clock()
    bs = BeautifulSoup.BeautifulSoup(data)
    ingreds = bs.find('div', {'class': 'ingredients'})
    ingreds = [s.getText().strip() for s in ingreds.findAll('li')]
    res2 = '\n'.join(ingreds)
    t2 = clock()-te
    
    print res1
    print
    print res2
    print
    print 'res1==res2 is ',res1==res2
    
    print '\nRegex :',t1
    print '\nBeautifulSoup :',t2
    print '\nBeautifulSoup execution time / Regex execution time ==',t2/t1
    

    result

    结果

    1/4 cup olive oil
    1 cup chicken broth
    2 cloves garlic, minced
    1 tablespoon paprika
    1 tablespoon garlic powder
    1 tablespoon poultry seasoning
    1 teaspoon dried oregano
    1 teaspoon dried basil
    4 thick cut boneless pork chops
    salt and pepper to taste
    
    1/4 cup olive oil
    1 cup chicken broth
    2 cloves garlic, minced
    1 tablespoon paprika
    1 tablespoon garlic powder
    1 tablespoon poultry seasoning
    1 teaspoon dried oregano
    1 teaspoon dried basil
    4 thick cut boneless pork chops
    salt and pepper to taste
    
    res1==res2 is  True
    
    Regex : 0.00210892725193
    
    BeautifulSoup : 2.32453566026
    
    BeautifulSoup execution time / Regex execution time == 1102.23605776
    

    No comment !

    没有意见 !

    .

    EDIT 2

    I realized that in my code I don't use a regex, I employ a method that use a regex and find().

    我意识到在我的代码中我没有使用正则表达式,我使用了一个使用正则表达式和find()的方法。

    It's the method I use when I resort to regexes because it raises the speed of treatment in some cases. It is due to the function find() that runs extremly rapidly.

    这是我使用正则表达式时使用的方法,因为它在某些情况下提高了治疗速度。这是由于函数find()极快地运行。

    To know what we are comparing, we need the following codes.

    要知道我们要比较什么,我们需要以下代码。

    In the code 3 and 4, I took account of remarks of Achim in another thread of posts: using re.IGNORECASE and re.DOTALL, ["\'] instead of ".

    在代码3和代码4中,我在另一个帖子中考虑了Achim的评论:使用re.IGNORECASE和re.DOTALL,[“\']代替”。

    These codes are separated because they must be executed in different files to obtain reliable results: I don't know why, but if all the codes are executed in the same file ,certain resulting times are strongly different (0.00075 instead of 0.0022 for exemple)

    这些代码是分开的,因为它们必须在不同的文件中执行才能获得可靠的结果:我不知道为什么,但是如果所有代码都在同一个文件中执行,那么某些结果时间就会大不相同(例如0.00075而不是0.0022)

    import urllib
    import re
    import BeautifulSoup
    from time import clock
    
    url = 'http://allrecipes.com/Recipe/Slow-Cooker-Pork-Chops-II/Detail.aspx'
    data = urllib.urlopen(url).read()
    
    # Simple regex , without x
    te = clock()
    patingr = re.compile('<li class="plaincharacterwrap">\r\n +(.+?)</li>\r\n')
    res0 = '\n'.join(patingr.findall(data))
    t0 = clock()-te
    
    print '\nSimple regex , without x :',t0
    

    and

    # Simple regex , with x
    te = clock()
    x = data.find('Ingredients</h3>')
    patingr = re.compile('<li class="plaincharacterwrap">\r\n +(.+?)</li>\r\n')
    res1 = '\n'.join(patingr.findall(data,x))
    t1 = clock()-te
    
    print '\nSimple regex , with x :',t1
    

    and

    # Regex with flags , without x and y
    te = clock()
    patingr = re.compile('<li class=["\']plaincharacterwrap["\']>\r\n +(.+?)</li>\r\n',
                         flags=re.DOTALL|re.IGNORECASE)
    res10 = '\n'.join(patingr.findall(data))
    t10 = clock()-te
    
    print '\nRegex with flags , without x and y :',t10
    

    and

    # Regex with flags , with x and y 
    te = clock()
    x = data.find('Ingredients</h3>')
    y = data.find('h3>\r\n                    Footnotes</h3>\r\n')
    patingr = re.compile('<li class=["\']plaincharacterwrap["\']>\r\n +(.+?)</li>\r\n',
                         flags=re.DOTALL|re.IGNORECASE)
    res11 = '\n'.join(patingr.findall(data,x,y))
    t11 = clock()-te
    
    print '\nRegex with flags , without x and y :',t11
    

    and

    # BeautifulSoup
    te = clock()
    bs = BeautifulSoup.BeautifulSoup(data)
    ingreds = bs.find('div', {'class': 'ingredients'})
    ingreds = [s.getText().strip() for s in ingreds.findAll('li')]
    res2 = '\n'.join(ingreds)
    t2 = clock()-te
    
    print '\nBeautifulSoup                      :',t2
    

    result

    结果

    Simple regex , without x           : 0.00230488284125
    
    Simple regex , with x              : 0.00229121279385
    
    Regex with flags , without x and y : 0.00758719458758
    
    Regex with flags , with x and y    : 0.00183724493364
    
    BeautifulSoup                      : 2.58728860791
    

    The use of x has no influence on the speed for a simple regex.

    使用x对简单正则表达式的速度没有影响。

    The regex with flags , without x and y, takes longer to execute , but the result isn't the same as the others, because it catches a supplementary chunk of text. That's why in a real application, it would be the regex with flags and x/y that should be used.

    带有标志的正则表达式,没有x和y,执行时间更长,但结果与其他结果不同,因为它捕获了一大块补充文本。这就是为什么在实际应用程序中,应该使用带有标志和x / y的正则表达式。

    The more complicated regex with flags and with x and y takes 20 % of time less.

    带有标志和x和y的更复杂的正则表达式减少了20%的时间。

    Well, the results are not very much changed, with or without x/y.

    嗯,无论是否有x / y,结果都没有太大变化。

    So my conclusion is the same

    所以我的结论是一样的

    the use of a regex, resorting to find() or not, remains roughly 1000 times faster than BeautifulSoup, and I estimate 100 times faster than lxml (I didn't installed lxml)

    使用正则表达式,求助于(或不),仍然比BeautifulSoup快约1000倍,我估计比lxml快100倍(我没有安装lxml)

    .

    To what you wrote, Hugh, I would say:

    对你所写的,休,我会说:

    When a regex is wrong, it is not faster nor slower. It doesn't run.

    当正则表达式错误时,它不会更快也不会更慢。它没有运行。

    When a regex is wrong, the coder makes it becoming right, that's all.

    当正则表达式出错时,编码器会使它变得正确,就是这样。

    I don't understand why 95% of the persons on *.com want to persuade other 5% that regexes must not be employed to analyse HTML or XML or anything else. I say "analyse", not "parse". As far as I understood it, a parser first analyse the WHOLE of a text and then displays the content of elements that we want. On the contrary, a regex goes right to what is searched, it doesn't build the tree of the HTML/XML text or whatever else a parser does and that I don't know very well.

    我不明白为什么*.com上95%的人想要说服其他5%的人不得使用正则表达式来分析HTML或XML或其他任何东西。我说“分析”,而不是“解析”。据我所知,解析器首先分析文本的整数,然后显示我们想要的元素的内容。相反,正则表达式适用于搜索的内容,它不构建HTML / XML文本的树或解析器所做的其他任何事情,而且我不太了解。

    So, I am very satisfied of regexes. I have no problem to write even very long REs, and regexes allow me to run programs that must react rapidly after the analyse of a text. BS or lxml would work but that would be a hassle.

    所以,我对正则表达式非常满意。即使是非常长的RE也没有问题,正则表达式允许我运行在分析文本后必须迅速做出反应的程序。 BS或lxml可以工作,但这将是一个麻烦。

    I would have other comments to do , but I have no time for a subject in which, in fact, I let others to do as they prefer.

    我还有其他意见要做,但我没有时间讨论一个主题,事实上,我让别人按照自己的意愿去做。