使用Python 3和示例来解析HTML的最佳库?

时间:2022-03-13 15:46:15

I'm new to Python completely and am using Python 3.1 on Windows (pywin). I need to parse some HTML, to essentially extra values between specific HTML tags and am confused at my array of options, and everything I find is suited for Python 2.x. I've read raves about Beautiful Soup, HTML5Lib and lxml, but I cannot figure out how to install any of these on Windows.

我完全熟悉Python,并且在Windows (pywin)上使用Python 3.1。我需要解析一些HTML,在特定的HTML标记之间添加额外的值,并且对我的选项数组感到困惑,我找到的所有东西都适合Python 2.x。我读过关于漂亮的汤、HTML5Lib和lxml的文章,但我不知道如何在Windows上安装这些。

Questions:

问题:

  1. What HTML parser do you recommend?
  2. 您推荐什么HTML解析器?
  3. How do I install it? (Be gentle, I'm completely new to Python and remember I'm on Windows)
  4. 我如何安装它?(温柔点,我对Python完全陌生,记得我在Windows上)
  5. Do you have a simple example on how to use the recommended library to snag HTML from a specific URL and return the value out of say something like this:

    您是否有一个简单的例子,说明如何使用推荐的库从特定的URL获取HTML并返回值,比如:

    <div class="foo"><table><tr><td>foo</td></tr></table><a class="link" href='/blahblah'>Link</a></div>

    < div class = " foo " > <表> < tr > < td > foo < / td > < / tr > < /表> < class = "链接" href = ' / blahblah ' > < / > < / div >链接

(say we want to return "/blahblah")

(比如我们想要返回“/blahblah”)

5 个解决方案

#1


5  

Web-scraping in Python 3 is currently very poorly supported; all the decent libraries work only with Python 2. If you must web scrape in Python, use Python 2.

Python 3中的web抓取目前非常不受支持;所有像样的库只使用Python 2。如果您必须使用Python进行web抓取,请使用Python 2。

Although Beautiful Soup is oft recommended (every question regarding web scraping with Python in Stack Overflow suggests it), it's not as good for Python 3 as it is for Python 2; I couldn't even install it as the installation code was still Python 2.

尽管经常推荐漂亮的汤(在Stack Overflow中,关于web抓取的每一个问题都说明了这一点),但是对于python3来说,它并不像Python 2那样好;我甚至不能安装它,因为安装代码仍然是python2。

As for adequate and simple-to-install solutions for Python 3, you can try the library's HTML parser, although quite barebones, it comes with Python 3.

对于Python 3的适当和简单的安装解决方案,您可以尝试使用该库的HTML解析器,尽管它是非常简陋的,它附带了Python 3。

#2


6  

If your HTML is well formed, you have many options, such as sax and dom. If it is not well formed you need a fault tolerant parser such as Beautiful soup, element tidy, or lxml's HTML parser. No parser is perfect, when presented with a variety of broken HTML, sometimes I have to try more then one. Lxml and Elementree use a mostly compatible api that is more of a standard than Beautiful soup.

如果您的HTML格式良好,那么您有许多选项,比如sax和dom。如果它不是很好,您需要一个容错的解析器,如Beautiful soup、element tidy或lxml的HTML解析器。没有一个解析器是完美的,当呈现出各种各样的破损HTML时,有时我必须尝试更多。Lxml和Elementree使用了一个大多数兼容的api,这比漂亮的汤更符合标准。

In my opinion, lxml is the best module for working with xml documents, but the ElementTree included with python is still pretty good. In the past I have used Beautiful soup to convert HTML to xml and construct ElementTree for processing the data.

在我看来,lxml是处理xml文档的最佳模块,但是python中包含的ElementTree仍然非常好。在过去,我用漂亮的汤将HTML转换成xml,并构造用于处理数据的ElementTree。

#3


4  

BeautifulSoup, with its version 3.1.0.1 (January 2009) also work with Python 3.x.

美观的汤,其版本3.1.0.1(2009年1月)也使用Python 3.x。

I do not have have direct experience with BeautifulSoup under Py3k (although this soon should change...).   I just read, however, that Version 3.1.0 of Beautiful Soup does significantly worse on real-world HTML than its previous versions, so I may try and wait if possible (i.e. stay with Python 2.6 a bit longer).

我在Py3k下没有直接体验过漂亮的汤(尽管这很快就会改变…)。不过,我刚读到,在现实世界的HTML中,3.1.0版的漂亮汤比以前的版本要糟糕得多,所以我可以尝试一下,如果可能的话(也就是使用Python 2.6更长一点)。

#4


4  

I'm currently using lxml, and on Windows I used the installation binary from http://www.lfd.uci.edu/~gohlke/pythonlibs/#lxml.

我现在使用的是lxml,在Windows上,我使用了来自http://www.lfd.uci.edu/~gohlke/pythonlibs/#lxml的安装二进制文件。

import lxml.html
page = lxml.html.fromstring(...)
title = page.xpath('//head/title/text()')[0]

#5


4  

I know this is way late, but for future reference, Beautiful Soup 4.3.2 is available as of Oct. 2013.

我知道这有点晚了,但是为了以后的参考,漂亮的汤4.3.2可以在2013年10月使用。

http://www.crummy.com/software/BeautifulSoup/bs4/download/

http://www.crummy.com/software/BeautifulSoup/bs4/download/

It is compatible with Python 3.

它与Python 3兼容。

#1


5  

Web-scraping in Python 3 is currently very poorly supported; all the decent libraries work only with Python 2. If you must web scrape in Python, use Python 2.

Python 3中的web抓取目前非常不受支持;所有像样的库只使用Python 2。如果您必须使用Python进行web抓取,请使用Python 2。

Although Beautiful Soup is oft recommended (every question regarding web scraping with Python in Stack Overflow suggests it), it's not as good for Python 3 as it is for Python 2; I couldn't even install it as the installation code was still Python 2.

尽管经常推荐漂亮的汤(在Stack Overflow中,关于web抓取的每一个问题都说明了这一点),但是对于python3来说,它并不像Python 2那样好;我甚至不能安装它,因为安装代码仍然是python2。

As for adequate and simple-to-install solutions for Python 3, you can try the library's HTML parser, although quite barebones, it comes with Python 3.

对于Python 3的适当和简单的安装解决方案,您可以尝试使用该库的HTML解析器,尽管它是非常简陋的,它附带了Python 3。

#2


6  

If your HTML is well formed, you have many options, such as sax and dom. If it is not well formed you need a fault tolerant parser such as Beautiful soup, element tidy, or lxml's HTML parser. No parser is perfect, when presented with a variety of broken HTML, sometimes I have to try more then one. Lxml and Elementree use a mostly compatible api that is more of a standard than Beautiful soup.

如果您的HTML格式良好,那么您有许多选项,比如sax和dom。如果它不是很好,您需要一个容错的解析器,如Beautiful soup、element tidy或lxml的HTML解析器。没有一个解析器是完美的,当呈现出各种各样的破损HTML时,有时我必须尝试更多。Lxml和Elementree使用了一个大多数兼容的api,这比漂亮的汤更符合标准。

In my opinion, lxml is the best module for working with xml documents, but the ElementTree included with python is still pretty good. In the past I have used Beautiful soup to convert HTML to xml and construct ElementTree for processing the data.

在我看来,lxml是处理xml文档的最佳模块,但是python中包含的ElementTree仍然非常好。在过去,我用漂亮的汤将HTML转换成xml,并构造用于处理数据的ElementTree。

#3


4  

BeautifulSoup, with its version 3.1.0.1 (January 2009) also work with Python 3.x.

美观的汤,其版本3.1.0.1(2009年1月)也使用Python 3.x。

I do not have have direct experience with BeautifulSoup under Py3k (although this soon should change...).   I just read, however, that Version 3.1.0 of Beautiful Soup does significantly worse on real-world HTML than its previous versions, so I may try and wait if possible (i.e. stay with Python 2.6 a bit longer).

我在Py3k下没有直接体验过漂亮的汤(尽管这很快就会改变…)。不过,我刚读到,在现实世界的HTML中,3.1.0版的漂亮汤比以前的版本要糟糕得多,所以我可以尝试一下,如果可能的话(也就是使用Python 2.6更长一点)。

#4


4  

I'm currently using lxml, and on Windows I used the installation binary from http://www.lfd.uci.edu/~gohlke/pythonlibs/#lxml.

我现在使用的是lxml,在Windows上,我使用了来自http://www.lfd.uci.edu/~gohlke/pythonlibs/#lxml的安装二进制文件。

import lxml.html
page = lxml.html.fromstring(...)
title = page.xpath('//head/title/text()')[0]

#5


4  

I know this is way late, but for future reference, Beautiful Soup 4.3.2 is available as of Oct. 2013.

我知道这有点晚了,但是为了以后的参考,漂亮的汤4.3.2可以在2013年10月使用。

http://www.crummy.com/software/BeautifulSoup/bs4/download/

http://www.crummy.com/software/BeautifulSoup/bs4/download/

It is compatible with Python 3.

它与Python 3兼容。