如何使用正则表达式从HTML中提取链接？

I want to extract links from google.com; My HTML code looks like this:

我想从google.com中提取链接;我的HTML代码如下所示:

<a href="http://www.test.com/" class="l"

I took me around five minutes to find a regex that works using www.rubular.com. It is:

我花了大约五分钟找到一个使用www.rubular.com工作的正则表达式。它是:

"(.*?)" class="l"

The code is:

代码是:

require "open-uri"
url = "http://www.google.com/search?q=ruby"

source = open(url).read()
links = source.scan(/"(.*?)" class="l"/) 

links.each { |link| puts #{link} 
}

The problem is, is it not outputting the websites links.

问题是,它是不是输出网站链接。

3 个解决方案

#1

Those links actually have class=l not class="l". By the way, to figure this put I added some logging to the method so that you can see the output at various stages and debug it. I searched for the string you were expecting to find and didn't find it, which is why your regex failed. So I looked for the right string you actually wanted and changed the regex accordingly. Debugging skills are handy.

这些链接实际上有class = l而不是class =“l”。顺便说一下,为了计算这个问题,我在方法中添加了一些日志记录,以便您可以在各个阶段查看输出并进行调试。我搜索了你期望找到的字符串但没找到它,这就是你的正则表达式失败的原因。所以我找了你想要的正确的字符串并相应地更改了正则表达式。调试技巧很方便。

require "open-uri"
url = "http://www.google.com/search?q=ruby"

source = open(url).read

puts "--- PAGE SOURCE ---"
puts source

links = source.scan(/<a.+?href="(.+?)".+?class=l/)

puts "--- FOUND THIS MANY LINKS ---"
puts links.size

puts "--- PRINTING LINKS ---"
links.each do |link|
  puts "- #{link}"
end

I also improved your regex. You are looking for some text that starts with the opening of an a tag (<a), then some characters of some sort that you dont care about (.+?), an href attribute (href="), the contents of the href attribute that you want to capture ((.+?)), some spaces or other attributes (.+?), and lastly the class attrubute (class=l).

我也改进了你的正则表达式。您正在寻找一些以打开标签( )开头的文本,然后是某些您不关心的字符(。+?),href属性(href>

I have .+? in three places there. the . means any character, the + means there must be one or more of the things right before it, and the ? means that the .+ should try to match as short a string as possible.

我有。+?在那里的三个地方。这个。意味着任何角色,+意味着必须有一个或多个在它之前的东西,并且?表示。+应该尝试匹配尽可能短的字符串。

#2

To put it bluntly, the problem is that you're using regexes. The problem is that HTML is what is known as a context-free language, while regular expressions can only the class of languages that are known as regular languages.

说穿了,问题是你正在使用正则表达式。问题是HTML是所谓的无上下文语言,而正则表达式只能是被称为常规语言的语言类。

What you should do is send the page data to a parser that can handle HTML code, such as Hpricot, and then walk the parse tree you get from the parser.

您应该做的是将页面数据发送到可以处理HTML代码的解析器,例如Hpricot,然后遍历从解析器获得的解析树。

#3

What im going wrong?

我出了什么问题?

You're trying to parse HTML with regex. Don't do that. Regular expressions cannot cover the range of syntax allowed even by valid XHTML, let alone real-world tag soup. Use an HTML parser library such as Hpricot.

您正在尝试使用正则表达式解析HTML。不要那样做。正则表达式无法涵盖即使通过有效的XHTML允许的语法范围,更不用说现实标签汤了。使用HTML解析器库,如Hpricot。

FWIW, when I fetch ‘http://www.google.com/search?q=ruby’ I do not receive ‘class="l"’ anywhere in the returned markup. Perhaps it depends on which local Google you are using and/or whether you are logged in or otherwise have a Google cookie. (Your script, like me, would not.)

FWIW,当我获取'http://www.google.com/search?q=ruby'时,我在返回的标记中的任何地方都没有收到'class =“l”'。也许这取决于您使用的是哪个本地Google和/或您是否已登录或拥有Google Cookie。 (你的脚本和我一样,不会。)

#1

require "open-uri"
url = "http://www.google.com/search?q=ruby"

source = open(url).read

puts "--- PAGE SOURCE ---"
puts source

links = source.scan(/<a.+?href="(.+?)".+?class=l/)

puts "--- FOUND THIS MANY LINKS ---"
puts links.size

puts "--- PRINTING LINKS ---"
links.each do |link|
  puts "- #{link}"
end