在Ruby中解析web页面的最佳方式是什么?

I have been looking at XML and HTML libraries on rubyforge for a simple way to pull data out of a web page. For example if I want to parse a user page on * how can I get the data into a usable format?

我一直在rubyforge上研究XML和HTML库，寻找一种从web页面中提取数据的简单方法。例如，如果我想在*上解析一个用户页面，我如何将数据转换成可用的格式?

Say I want to parse my own user page for my current reputation score and badge listing. I tried to convert the source retrieved from my user page into xml but the conversion failed due to a missing div. I know I could do a string compare and find the text I'm looking for, but there has to be a much better way of doing this.

假设我想要解析我自己的用户页面以获得我当前的声誉分数和徽章列表。我试图将从用户页面检索到的源代码转换为xml，但由于缺少div，转换失败了。

I want to incorporate this into a simple script that spits out my user data at the command line, and possibly expand it into a GUI application.

我希望将其合并到一个简单的脚本中，该脚本将我的用户数据发送到命令行，并可能将其扩展到GUI应用程序中。

6 个解决方案

#1

Hpricot is over !

Hpricot结束了!

Use Nokogiri now.

Nokogiri现在使用。

#2

Unfortunately * is claiming to be XML but actually isn't. Hpricot however can parse this tag soup into a tree of elements for you.

不幸的是，*声称是XML，但实际上不是。然而，Hpricot可以将这个标签汤解析为元素树。

require 'hpricot'
require 'open-uri'

doc = Hpricot(open("http://*.com/users/19990/armin-ronacher"))
reputation = (doc / "td.summaryinfo div.summarycount").text.gsub(/[^\d]+/, "").to_i

And so forth.

等等。

#3

try hpricot, its well... awesome

尝试hpricot,其…太棒了

I've used it several times for screen scraping.

我已经用过好几次了。

#4

I always really like what Ilya Grigorik writes, and he wrote up a nice post about using hpricot.

我一直很喜欢Ilya Grigorik写的东西，他写了一篇关于使用hpricot的帖子。

I also read this post a while back and it looks like it would be useful for you.

我之前也读过这篇文章，看起来对你很有用。

Haven't done either myself, so YMMV but these seem pretty useful.

我自己也没做过，所以YMMV，但是这些看起来很有用。

#5

Something I ran into trying to do this before is that few web pages are well-formed XML documents. Hpricot may be able to deal with that (I haven't used it) but when I was doing a similar project in the past (using Python and its library's built in parsing functions) it helped to have a pre-processor to clean up the HTML. I used the python bindings for HTML Tidy as this and it made life a lot easier. Ruby bindings are here but I haven't tried them.

我之前遇到的一些问题是，很少有web页面是格式良好的XML文档。Hpricot可能可以处理这个问题(我没有使用它)，但是当我以前做一个类似的项目(使用Python和它的库内置解析函数)时，它帮助我有一个预处理器来清理HTML。我将python绑定用于HTML Tidy as this，它使事情变得更简单。Ruby绑定在这里，但我还没有尝试过。

Good luck!

好运！

#6

it seems to be an old topic but here is a new one. Example getting reputation:

这似乎是一个古老的话题，但这里有一个新的话题。例获得声誉:

#!/usr/bin/env ruby

require 'rubygems'
require 'hpricot'
require 'open-uri'

user = "619673/100kg"
html = "http://*.com/users/%s?tab=reputation"

page = html % user
puts page

doc = Hpricot(open(page))
pars = Array.new
doc.search("div[@class='subheader user-full-tab-header']/h1/span[@class='count']").text.each do |p|
  pars << p
end

puts "reputation " + pars[0]

#1