My goal is to find the first result in google search resultes and collect the site link, so I built this script:
我的目标是在谷歌搜索结果中找到第一个结果并收集网站链接,所以我构建了这个脚本:
require 'hpricot'
require 'open-uri'
require 'mechanize'
query = gets.chomp
agent = Mechanize.new
page = agent.get("http://www.google.co.il/")
search_form = page.form_with(:name => "f")
search_form.field_with(:name => "q").value = query.to_s
search_results = agent.submit(search_form)
search_results = search_results.body
doc = Hpricot(search_results)
site = doc.search("a")[16,1]
url = site.to_s
puts url
I get a string like this:
我得到一个像这样的字符串:
url = <a href="http://en.wikipedia.org/wiki/Gallon" dir="ltr" class="l"><em>Gallon</em> - Wikipedia, the free encyclopedia</a>
But I need only the link (http://en.wikipedia.org/wiki/Gallon) not all the html code... How can I do it? I am using the gems:
但我只需要链接(http://en.wikipedia.org/wiki/Gallon)并不是所有的HTML代码......我该怎么做?我正在使用宝石:
require 'hpricot'
require 'open-uri'
require 'mechanize'
6 个解决方案
#1
6
You can get the value of attributes like this
您可以获得这样的属性值
(doc/"a")[16].attributes['href']
but I have to say that the magic number 16 seems brittle.
但我不得不说神奇的数字16似乎很脆弱。
You are also not supposed to scrape the search results, you should consider using the Custom Search API.
您也不应该搜索搜索结果,您应该考虑使用自定义搜索API。
#2
6
Since mechanize includes nokogiri you can should skip hpricot altogether. It will slow your code down unnecessarily. You are effectively doing the same thing twice.
由于机械化包括nokogiri,你可以完全跳过hpricot。它会不必要地降低你的代码速度。你实际上两次做同样的事情。
require 'mechanize'
query = gets.chomp
agent = Mechanize.new
page = agent.get("http://www.google.co.il/")
search_form = page.form_with(:name => "f")
search_form.field_with(:name => "q").value = query.to_s
search_results = agent.submit(search_form)
puts search_results.links[16].href
#3
1
Instead of converting to a string with url = site.to_s
do url = site[0].attributes['href']
而不是使用url = site.to_s转换为字符串do url = site [0] .attributes ['href']
#4
0
try to use:
尝试使用:
site = doc.search("a[@href]")[16,1]
#5
0
Waitir is a reasonable choice to check the layout of a web page.
Waitir是检查网页布局的合理选择。
require 'rubygems'
require 'watir'
#Launching browser windows and navigating to google
browser = Watir::Browser.new
browser.goto("http://www.google.co.il/")
#Logging to console if a link with href = http://en.wikipedia.org/wiki/Gallon present
puts browser.link(:href, "http://en.wikipedia.org/wiki/Gallon").exists?
#6
0
Since the input is always going to follow the same format, you could just do:
由于输入始终遵循相同的格式,您可以这样做:
url.split("href=\"").last.split("\"").first
#1
6
You can get the value of attributes like this
您可以获得这样的属性值
(doc/"a")[16].attributes['href']
but I have to say that the magic number 16 seems brittle.
但我不得不说神奇的数字16似乎很脆弱。
You are also not supposed to scrape the search results, you should consider using the Custom Search API.
您也不应该搜索搜索结果,您应该考虑使用自定义搜索API。
#2
6
Since mechanize includes nokogiri you can should skip hpricot altogether. It will slow your code down unnecessarily. You are effectively doing the same thing twice.
由于机械化包括nokogiri,你可以完全跳过hpricot。它会不必要地降低你的代码速度。你实际上两次做同样的事情。
require 'mechanize'
query = gets.chomp
agent = Mechanize.new
page = agent.get("http://www.google.co.il/")
search_form = page.form_with(:name => "f")
search_form.field_with(:name => "q").value = query.to_s
search_results = agent.submit(search_form)
puts search_results.links[16].href
#3
1
Instead of converting to a string with url = site.to_s
do url = site[0].attributes['href']
而不是使用url = site.to_s转换为字符串do url = site [0] .attributes ['href']
#4
0
try to use:
尝试使用:
site = doc.search("a[@href]")[16,1]
#5
0
Waitir is a reasonable choice to check the layout of a web page.
Waitir是检查网页布局的合理选择。
require 'rubygems'
require 'watir'
#Launching browser windows and navigating to google
browser = Watir::Browser.new
browser.goto("http://www.google.co.il/")
#Logging to console if a link with href = http://en.wikipedia.org/wiki/Gallon present
puts browser.link(:href, "http://en.wikipedia.org/wiki/Gallon").exists?
#6
0
Since the input is always going to follow the same format, you could just do:
由于输入始终遵循相同的格式,您可以这样做:
url.split("href=\"").last.split("\"").first