用于下载和解析网页的语言/库?

时间:2022-03-30 22:47:20

What language and libraries are suitable for a script to parse and download small numbers of web resources?

什么样的语言和库适合脚本解析和下载少量web资源?

For example, some websites publish pseudo-podcasts, but not as proper RSS feeds; they just publish an MP3 file regularly with a web page containing the playlist. I want to write a script to run regularly and parse the relevant pages for the link and playlist info, download the MP3, and put the playlist in the MP3 tags so it shows up nicely in my iPod. There are a bunch of similar applications that I could write too.

例如,一些网站发布伪播客,但不作为合适的RSS提要;他们只是定期发布一个MP3文件和一个包含播放列表的网页。我想写一个脚本,定期运行,解析相关页面的链接和播放列表信息,下载MP3,把播放列表放在MP3标签中,这样它就会很好地显示在我的iPod中。我也可以写很多类似的应用。

What language would you recommend? I would like the script to run on Windows and MacOS. Here are some alternatives:

你推荐什么语言?我希望脚本可以在Windows和MacOS上运行。这里有一些选择:

  • JavaScript. Just so I could use jQuery for the parsing. I don't know if jQuery works outside a browser though.
  • JavaScript。这样我就可以使用jQuery进行解析了。我不知道jQuery是否在浏览器之外工作。
  • Python. Probably good library support for doing what I want. But I don't love Python syntax.
  • Python。可能是很好的库支持,可以做我想做的事情。但是我不喜欢Python语法。
  • Ruby. I've done simple stuff (manual parsing) in Ruby before.
  • Ruby。我以前在Ruby中做过一些简单的事情(手工解析)。
  • Clojure. Because I want to spend a bit of time with it.
  • Clojure。因为我想花点时间在这上面。

What's your favourite language and libraries for doing this? And why? Are there any nice jQuery-like libraries for other languages?

你最喜欢的语言和库是什么?,为什么?有适合其他语言的类似于jquery的库吗?

10 个解决方案

#1


7  

If you want to spend some time with Clojure (a very good idea IMO!), give Enlive a shot. The GitHub description reads

如果你想花点时间在Clojure(在我看来这是个好主意!)GitHub描述读

a selector-based (à la CSS) templating and transformation system for Clojure — Read more

一个基于选择器(la CSS)的Clojure模板和转换系统——请继续阅读

In addition to being useful for templating, it's a capable webscraping library; see the initial part of this tutorial for some simple scraping examples. (The third one is the New York Times homepage, so actually not as simple as all that.)

除了对模板有用之外,它还是一个功能强大的web抓取库;有关一些简单的抓取示例,请参见本教程的第一部分。(第三个是《纽约时报》的主页,所以其实没那么简单。)

There are other tutorials available on the Web if you look for them; Enlive itself comes with some docs / examples. (Plus the code is < 1000 lines in total and very readable, though I suppose this might be less so for someone new to the language.)

如果你在网上找的话,还有其他的教程;Enlive本身提供了一些文档/示例。(加上代码总计< 1000行,可读性非常强,不过我认为对于语言新手来说,这可能不太好)。

#2


6  

Clojure link dumps, covering enlive, based on tagSoup and agents for parallel downloads (roundups/ link dumps aren't pretty, but I did spend some time googling/searching for different libs. Spidering/crawling can be very easy or pretty involved depending on the structure of sites crawled, HTML, XHTML, etc.)

Clojure链接转储,覆盖了enlive,基于tagSoup和代理进行并行下载(roundups/ link转储不是很好,但是我确实花了一些时间在google上搜索/搜索不同的libs。根据站点的结构、HTML、XHTML等的不同,爬行/爬行可以非常容易或非常复杂)。

http://blog.bestinclass.dk/index.php/2009/10/functional-social-webscraping/

http://blog.bestinclass.dk/index.php/2009/10/functional-social-webscraping/

http://nakkaya.com/2009/12/17/mashups-using-clojure/

http://nakkaya.com/2009/12/17/mashups-using-clojure/

http://freegeek.in/blog/2009/10/downloading-a-bunch-of-files-in-parallel-using-clojure-agents/

http://freegeek.in/blog/2009/10/downloading-a-bunch-of-files-in-parallel-using-clojure-agents/

http://blog.maryrosecook.com/post/46601664/Writing-an-mp3-crawler-in-Clojure

http://blog.maryrosecook.com/post/46601664/Writing-an-mp3-crawler-in-Clojure


http://gnuvince.wordpress.com/2008/11/18/fetching-web-comics-with-clojure-part-2/

http://gnuvince.wordpress.com/2008/11/18/fetching-web-comics-with-clojure-part-2/

http://htmlparser.sourceforge.net/

http://htmlparser.sourceforge.net/

http://nakkaya.com/2009/11/23/converting-html-to-compojure-dsl/

http://nakkaya.com/2009/11/23/converting-html-to-compojure-dsl/

http://www.bestinclass.dk/index.php/2009/10/functional-social-webscraping/

http://www.bestinclass.dk/index.php/2009/10/functional-social-webscraping/


apache http client

apache http客户端

http://github.com/rnewman/clj-apache-http

http://github.com/rnewman/clj-apache-http

http://github.com/heyZeus/clj-web-crawler

http://github.com/heyZeus/clj-web-crawler

http://japhr.blogspot.com/2009/01/clojure-http-clientclj.html

http://japhr.blogspot.com/2009/01/clojure-http-clientclj.html

#3


5  

Beautiful Soup (http://www.crummy.com/software/BeautifulSoup/) is a good python library for this. It specializes in dealing with malformed markup.

BeautifulSoup (http://www.crummy.com/software/BeautifulSoup/)是一个很好的python库。它专门处理格式错误的标记。

#4


4  

In ruby you also have Nokogiri, Nokogiri (鋸) is an HTML, XML, SAX, and Reader parser. Among Nokogiri's many features is the ability to search documents via XPath or CSS3 selectors.

在ruby中,你也有Nokogiri,Nokogiri(鋸)是一个HTML,XML,SAX解析器和读者。Nokogiri的许多特性之一是能够通过XPath或CSS3选择器搜索文档。

#5


2  

Like Mikael S has mentioned hpricot is a great ruby html parser. However, for page retrieval, you may consider using a screen scraping library like scRUBYt or Mechanize.

就像Mikael S提到的hpricot是一个伟大的ruby html解析器。但是,对于页面检索,您可以考虑使用scRUBYt或Mechanize这样的屏幕抓取库。

#6


1  

I highly recommend using Ruby and the hpricot library.

我强烈推荐使用Ruby和hpricot库。

#7


1  

You should really give to Python a shot.

你应该给Python一个机会。

When I decide to design a crawler, i usually reproduce the same pattern.

当我决定设计爬虫时,我通常复制相同的模式。

For each step, there is a worker, which picks the data from a container (mainly a queue). There is container between each type of worker. After the first connection the target site, all types of workers can be threaded. So we have to use synchronization for accessing theses queues.

对于每个步骤,都有一个worker,它从容器(主要是队列)中选择数据。每一种工人之间都有一个容器。在第一次连接目标站点之后,可以对所有类型的worker进行线程化。因此,我们必须使用同步来访问这些队列。

  1. Connector : the Session object from the requests library is remarkable.
  2. 连接器:来自请求库的会话对象非常重要。
  3. Loader : with multiple threaded Loaders, multiple requests can be launched in no time.
  4. 加载程序:使用多个线程加载程序,可以立即启动多个请求。
  5. Parser : xpath is intensively used on each etree object created with lxml.
  6. 解析器:xpath主要用于使用lxml创建的每个etree对象。
  7. Validator : set of assertions and heuristics to check the validity of parsed data.
  8. 验证器:检查已解析数据有效性的断言和启发式集。
  9. Archiver : depending on what is stored, how many and how fast, but nosql is often the easiest way to store the retrieved data. For example, mongodb and pymongo.
  10. Archiver:这取决于存储的内容、数量和速度,但是nosql通常是存储检索到的数据的最简单的方式。例如,mongodb和pymongo。

#8


0  

I would probably do this with PHP, curl, & phpQuery .. but there's a lot of different ways ..

我可能会用PHP、curl和phpQuery来实现这一点。但是有很多不同的方法。

#9


0  

What do you really want to do? If you want to learn Clojure||ruby||C do that. If you just want to get it done do whatever is fastest for you to do. And at the very least when you say Clojure and library you are also saying Java and library, there are lots and some are very good(I don't know what they are though). And the same was said for ruby and python above. So what do you want to do?

你到底想做什么?如果你想学Clojure||ruby||C就这么做。如果你只是想完成它,做任何对你来说最快的事情。至少当你说Clojure和library的时候你也在说Java和library,有很多,有些非常好(我不知道它们是什么)。上面提到的ruby和python也是如此。那你想做什么?

#10


0  

For jQuery-like CSS selector library in Perl then take a look at pQuery

对于Perl中的类似jquery的CSS选择器库,请查看pQuery

Also have a look at this previous SO question for examples of HTML parsing & scraping in many languages.

还可以看看前面的SO问题,了解许多语言中HTML解析和抓取的例子。

/I3az/

/ I3az /

#1


7  

If you want to spend some time with Clojure (a very good idea IMO!), give Enlive a shot. The GitHub description reads

如果你想花点时间在Clojure(在我看来这是个好主意!)GitHub描述读

a selector-based (à la CSS) templating and transformation system for Clojure — Read more

一个基于选择器(la CSS)的Clojure模板和转换系统——请继续阅读

In addition to being useful for templating, it's a capable webscraping library; see the initial part of this tutorial for some simple scraping examples. (The third one is the New York Times homepage, so actually not as simple as all that.)

除了对模板有用之外,它还是一个功能强大的web抓取库;有关一些简单的抓取示例,请参见本教程的第一部分。(第三个是《纽约时报》的主页,所以其实没那么简单。)

There are other tutorials available on the Web if you look for them; Enlive itself comes with some docs / examples. (Plus the code is < 1000 lines in total and very readable, though I suppose this might be less so for someone new to the language.)

如果你在网上找的话,还有其他的教程;Enlive本身提供了一些文档/示例。(加上代码总计< 1000行,可读性非常强,不过我认为对于语言新手来说,这可能不太好)。

#2


6  

Clojure link dumps, covering enlive, based on tagSoup and agents for parallel downloads (roundups/ link dumps aren't pretty, but I did spend some time googling/searching for different libs. Spidering/crawling can be very easy or pretty involved depending on the structure of sites crawled, HTML, XHTML, etc.)

Clojure链接转储,覆盖了enlive,基于tagSoup和代理进行并行下载(roundups/ link转储不是很好,但是我确实花了一些时间在google上搜索/搜索不同的libs。根据站点的结构、HTML、XHTML等的不同,爬行/爬行可以非常容易或非常复杂)。

http://blog.bestinclass.dk/index.php/2009/10/functional-social-webscraping/

http://blog.bestinclass.dk/index.php/2009/10/functional-social-webscraping/

http://nakkaya.com/2009/12/17/mashups-using-clojure/

http://nakkaya.com/2009/12/17/mashups-using-clojure/

http://freegeek.in/blog/2009/10/downloading-a-bunch-of-files-in-parallel-using-clojure-agents/

http://freegeek.in/blog/2009/10/downloading-a-bunch-of-files-in-parallel-using-clojure-agents/

http://blog.maryrosecook.com/post/46601664/Writing-an-mp3-crawler-in-Clojure

http://blog.maryrosecook.com/post/46601664/Writing-an-mp3-crawler-in-Clojure


http://gnuvince.wordpress.com/2008/11/18/fetching-web-comics-with-clojure-part-2/

http://gnuvince.wordpress.com/2008/11/18/fetching-web-comics-with-clojure-part-2/

http://htmlparser.sourceforge.net/

http://htmlparser.sourceforge.net/

http://nakkaya.com/2009/11/23/converting-html-to-compojure-dsl/

http://nakkaya.com/2009/11/23/converting-html-to-compojure-dsl/

http://www.bestinclass.dk/index.php/2009/10/functional-social-webscraping/

http://www.bestinclass.dk/index.php/2009/10/functional-social-webscraping/


apache http client

apache http客户端

http://github.com/rnewman/clj-apache-http

http://github.com/rnewman/clj-apache-http

http://github.com/heyZeus/clj-web-crawler

http://github.com/heyZeus/clj-web-crawler

http://japhr.blogspot.com/2009/01/clojure-http-clientclj.html

http://japhr.blogspot.com/2009/01/clojure-http-clientclj.html

#3


5  

Beautiful Soup (http://www.crummy.com/software/BeautifulSoup/) is a good python library for this. It specializes in dealing with malformed markup.

BeautifulSoup (http://www.crummy.com/software/BeautifulSoup/)是一个很好的python库。它专门处理格式错误的标记。

#4


4  

In ruby you also have Nokogiri, Nokogiri (鋸) is an HTML, XML, SAX, and Reader parser. Among Nokogiri's many features is the ability to search documents via XPath or CSS3 selectors.

在ruby中,你也有Nokogiri,Nokogiri(鋸)是一个HTML,XML,SAX解析器和读者。Nokogiri的许多特性之一是能够通过XPath或CSS3选择器搜索文档。

#5


2  

Like Mikael S has mentioned hpricot is a great ruby html parser. However, for page retrieval, you may consider using a screen scraping library like scRUBYt or Mechanize.

就像Mikael S提到的hpricot是一个伟大的ruby html解析器。但是,对于页面检索,您可以考虑使用scRUBYt或Mechanize这样的屏幕抓取库。

#6


1  

I highly recommend using Ruby and the hpricot library.

我强烈推荐使用Ruby和hpricot库。

#7


1  

You should really give to Python a shot.

你应该给Python一个机会。

When I decide to design a crawler, i usually reproduce the same pattern.

当我决定设计爬虫时,我通常复制相同的模式。

For each step, there is a worker, which picks the data from a container (mainly a queue). There is container between each type of worker. After the first connection the target site, all types of workers can be threaded. So we have to use synchronization for accessing theses queues.

对于每个步骤,都有一个worker,它从容器(主要是队列)中选择数据。每一种工人之间都有一个容器。在第一次连接目标站点之后,可以对所有类型的worker进行线程化。因此,我们必须使用同步来访问这些队列。

  1. Connector : the Session object from the requests library is remarkable.
  2. 连接器:来自请求库的会话对象非常重要。
  3. Loader : with multiple threaded Loaders, multiple requests can be launched in no time.
  4. 加载程序:使用多个线程加载程序,可以立即启动多个请求。
  5. Parser : xpath is intensively used on each etree object created with lxml.
  6. 解析器:xpath主要用于使用lxml创建的每个etree对象。
  7. Validator : set of assertions and heuristics to check the validity of parsed data.
  8. 验证器:检查已解析数据有效性的断言和启发式集。
  9. Archiver : depending on what is stored, how many and how fast, but nosql is often the easiest way to store the retrieved data. For example, mongodb and pymongo.
  10. Archiver:这取决于存储的内容、数量和速度,但是nosql通常是存储检索到的数据的最简单的方式。例如,mongodb和pymongo。

#8


0  

I would probably do this with PHP, curl, & phpQuery .. but there's a lot of different ways ..

我可能会用PHP、curl和phpQuery来实现这一点。但是有很多不同的方法。

#9


0  

What do you really want to do? If you want to learn Clojure||ruby||C do that. If you just want to get it done do whatever is fastest for you to do. And at the very least when you say Clojure and library you are also saying Java and library, there are lots and some are very good(I don't know what they are though). And the same was said for ruby and python above. So what do you want to do?

你到底想做什么?如果你想学Clojure||ruby||C就这么做。如果你只是想完成它,做任何对你来说最快的事情。至少当你说Clojure和library的时候你也在说Java和library,有很多,有些非常好(我不知道它们是什么)。上面提到的ruby和python也是如此。那你想做什么?

#10


0  

For jQuery-like CSS selector library in Perl then take a look at pQuery

对于Perl中的类似jquery的CSS选择器库,请查看pQuery

Also have a look at this previous SO question for examples of HTML parsing & scraping in many languages.

还可以看看前面的SO问题,了解许多语言中HTML解析和抓取的例子。

/I3az/

/ I3az /