I want to get data from this page:
我想从这个页面获取数据:
http://www.canadapost.ca/cpotools/apps/track/personal/findByTrackNumber?trackingNumber=0656887000494793
But that page forwards to:
但该页面转发到:
http://www.canadapost.ca/cpotools/apps/track/personal/findByTrackNumber?execution=eXs1
So, when I use open
, from OpenUri, to try and fetch the data, it throws a RuntimeError
error saying HTTP redirection loop:
因此,当我从OpenUri使用open来尝试获取数据时,它会抛出一个RuntimeError错误,说HTTP重定向循环:
I'm not really sure how to get that data after it redirects and throws that error.
我不确定如何在重定向并抛出该错误后获取该数据。
3 个解决方案
#1
22
You need a tool like Mechanize. From it's description:
你需要像Mechanize这样的工具。从它的描述:
The Mechanize library is used for automating interaction with websites. Mechanize automatically stores and sends cookies, follows redirects, can follow links, and submit forms. Form fields can be populated and submitted. Mechanize also keeps track of the sites that you have visited as a history.
Mechanize库用于自动与网站交互。 Mechanize自动存储和发送cookie,遵循重定向,可以跟踪链接和提交表单。可以填充和提交表单字段。 Mechanize还会跟踪您作为历史记录访问过的站点。
which is exactly what you need. So,
这正是你需要的。所以,
sudo gem install mechanize
then
require 'mechanize'
agent = WWW::Mechanize.new
page = agent.get "http://www.canadapost.ca/cpotools/apps/track/personal/findByTrackNumber trackingNumber=0656887000494793"
page.content # Get the resulting page as a string
page.body # Get the body content of the resulting page as a string
page.search(".somecss") # Search for specific elements by XPath/CSS using nokogiri
and you're ready to rock 'n' roll.
而你已经准备好摇滚了。
#2
1
The site seems to be doing some of the redirection logic with sessions. If you don't send back the session cookies they are sending on the first request you will end up in a redirect loop. IMHO it's a crappy implementation on their part.
该网站似乎正在使用会话进行一些重定向逻辑。如果您没有发回他们在第一次请求时发送的会话cookie,您将最终进入重定向循环。恕我直言,这是他们的一个糟糕的实施。
However, I tried to pass the cookies back to them, but I didn't get it to work, so I can't be completely sure that that is all that's going on here.
但是,我试图将cookie传递给他们,但我没有让它工作,所以我不能完全确定这就是这里发生的一切。
#3
1
While mechanize is a wonderful tool I prefer to "cook" my own thing.
虽然机械化是一个很好的工具,但我更喜欢“烹饪”我自己的东西。
If you are serious about parsing you can take a look at this code. It serves to crawl thousands of site on an international level everyday and as far as I have researched and tweaked there isn't a more stable approach to this that also allows you to highly customize later on your needs.
如果您认真解析,可以查看此代码。它每天在国际层面上抓取成千上万的网站,据我所研究和调整,没有更稳定的方法,这也允许您以后高度定制您的需求。
require "open-uri"
require "zlib"
require "nokogiri"
require "sanitize"
require "htmlentities"
require "readability"
def crawl(url_address)
self.errors = Array.new
begin
begin
url_address = URI.parse(url_address)
rescue URI::InvalidURIError
url_address = URI.decode(url_address)
url_address = URI.encode(url_address)
url_address = URI.parse(url_address)
end
url_address.normalize!
stream = ""
timeout(8) { stream = url_address.open(SHINSO_HEADERS) }
if stream.size > 0
url_crawled = URI.parse(stream.base_uri.to_s)
else
self.errors << "Server said status 200 OK but document file is zero bytes."
return
end
rescue Exception => exception
self.errors << exception
return
end
# extract information before html parsing
self.url_posted = url_address.to_s
self.url_parsed = url_crawled.to_s
self.url_host = url_crawled.host
self.status = stream.status
self.content_type = stream.content_type
self.content_encoding = stream.content_encoding
self.charset = stream.charset
if stream.content_encoding.include?('gzip')
document = Zlib::GzipReader.new(stream).read
elsif stream.content_encoding.include?('deflate')
document = Zlib::Deflate.new().deflate(stream).read
#elsif stream.content_encoding.include?('x-gzip') or
#elsif stream.content_encoding.include?('compress')
else
document = stream.read
end
self.charset_guess = CharGuess.guess(document)
if not self.charset_guess.blank? and (not self.charset_guess.downcase == 'utf-8' or not self.charset_guess.downcase == 'utf8')
document = Iconv.iconv("UTF-8", self.charset_guess, document).to_s
end
document = Nokogiri::HTML.parse(document,nil,"utf8")
document.xpath('//script').remove
document.xpath('//SCRIPT').remove
for item in document.xpath('//*[translate(@src, "ABCDEFGHIJKLMNOPQRSTUVWXYZ", "abcdefghijklmnopqrstuvwxyz")]')
item.set_attribute('src',make_absolute_address(item['src']))
end
document = document.to_s.gsub(/<!--(.|\s)*?-->/,'')
self.content = Nokogiri::HTML.parse(document,nil,"utf8")
end
#1
22
You need a tool like Mechanize. From it's description:
你需要像Mechanize这样的工具。从它的描述:
The Mechanize library is used for automating interaction with websites. Mechanize automatically stores and sends cookies, follows redirects, can follow links, and submit forms. Form fields can be populated and submitted. Mechanize also keeps track of the sites that you have visited as a history.
Mechanize库用于自动与网站交互。 Mechanize自动存储和发送cookie,遵循重定向,可以跟踪链接和提交表单。可以填充和提交表单字段。 Mechanize还会跟踪您作为历史记录访问过的站点。
which is exactly what you need. So,
这正是你需要的。所以,
sudo gem install mechanize
then
require 'mechanize'
agent = WWW::Mechanize.new
page = agent.get "http://www.canadapost.ca/cpotools/apps/track/personal/findByTrackNumber trackingNumber=0656887000494793"
page.content # Get the resulting page as a string
page.body # Get the body content of the resulting page as a string
page.search(".somecss") # Search for specific elements by XPath/CSS using nokogiri
and you're ready to rock 'n' roll.
而你已经准备好摇滚了。
#2
1
The site seems to be doing some of the redirection logic with sessions. If you don't send back the session cookies they are sending on the first request you will end up in a redirect loop. IMHO it's a crappy implementation on their part.
该网站似乎正在使用会话进行一些重定向逻辑。如果您没有发回他们在第一次请求时发送的会话cookie,您将最终进入重定向循环。恕我直言,这是他们的一个糟糕的实施。
However, I tried to pass the cookies back to them, but I didn't get it to work, so I can't be completely sure that that is all that's going on here.
但是,我试图将cookie传递给他们,但我没有让它工作,所以我不能完全确定这就是这里发生的一切。
#3
1
While mechanize is a wonderful tool I prefer to "cook" my own thing.
虽然机械化是一个很好的工具,但我更喜欢“烹饪”我自己的东西。
If you are serious about parsing you can take a look at this code. It serves to crawl thousands of site on an international level everyday and as far as I have researched and tweaked there isn't a more stable approach to this that also allows you to highly customize later on your needs.
如果您认真解析,可以查看此代码。它每天在国际层面上抓取成千上万的网站,据我所研究和调整,没有更稳定的方法,这也允许您以后高度定制您的需求。
require "open-uri"
require "zlib"
require "nokogiri"
require "sanitize"
require "htmlentities"
require "readability"
def crawl(url_address)
self.errors = Array.new
begin
begin
url_address = URI.parse(url_address)
rescue URI::InvalidURIError
url_address = URI.decode(url_address)
url_address = URI.encode(url_address)
url_address = URI.parse(url_address)
end
url_address.normalize!
stream = ""
timeout(8) { stream = url_address.open(SHINSO_HEADERS) }
if stream.size > 0
url_crawled = URI.parse(stream.base_uri.to_s)
else
self.errors << "Server said status 200 OK but document file is zero bytes."
return
end
rescue Exception => exception
self.errors << exception
return
end
# extract information before html parsing
self.url_posted = url_address.to_s
self.url_parsed = url_crawled.to_s
self.url_host = url_crawled.host
self.status = stream.status
self.content_type = stream.content_type
self.content_encoding = stream.content_encoding
self.charset = stream.charset
if stream.content_encoding.include?('gzip')
document = Zlib::GzipReader.new(stream).read
elsif stream.content_encoding.include?('deflate')
document = Zlib::Deflate.new().deflate(stream).read
#elsif stream.content_encoding.include?('x-gzip') or
#elsif stream.content_encoding.include?('compress')
else
document = stream.read
end
self.charset_guess = CharGuess.guess(document)
if not self.charset_guess.blank? and (not self.charset_guess.downcase == 'utf-8' or not self.charset_guess.downcase == 'utf8')
document = Iconv.iconv("UTF-8", self.charset_guess, document).to_s
end
document = Nokogiri::HTML.parse(document,nil,"utf8")
document.xpath('//script').remove
document.xpath('//SCRIPT').remove
for item in document.xpath('//*[translate(@src, "ABCDEFGHIJKLMNOPQRSTUVWXYZ", "abcdefghijklmnopqrstuvwxyz")]')
item.set_attribute('src',make_absolute_address(item['src']))
end
document = document.to_s.gsub(/<!--(.|\s)*?-->/,'')
self.content = Nokogiri::HTML.parse(document,nil,"utf8")
end