I'm implementing a tool that needs to crawl a website. I'm using anemone to crawl and on each anemone's page I'm using boilerpipe and Nokogiri to manage HTML format, etc.
我正在实现一个需要抓取网站的工具。我使用anemone来爬行,在每个anemone的页面上,我使用boilerpipe和Nokogiri来管理HTML格式,等等。
My problem is: if I get 500 Internal Server Error, it makes Nokogiri fail because there is no page.
我的问题是:如果我得到500个内部服务器错误,它会导致Nokogiri失败,因为没有页面。
Anemone.crawl(name) do |anemone|
anemone.on_every_page do |page|
if not (page.nil? && page.not_found?)
result = Boilerpipe.extract(page.url, {:output => :htmlFragment, :extractor => :ArticleExtractor})
doc = Nokogiri::HTML.parse(result)
end
end
end
In the case above, if there is a 500 Internal Server Error, the application will give an error on Nokogiri::HTML.parse(). I want to avoid this problem. If the server gives an error I want to continue computation ignoring this page.
在上面的示例中,如果有500个内部服务器错误,应用程序将在Nokogiri::HTML.parse()上给出一个错误。我想避免这个问题。如果服务器出错,我想继续计算,忽略这个页面。
There is any way to handle 500 Internal Server Error and 404 Page Not Found with these tools?
有什么方法可以处理500个内部服务器错误和404页面没有找到这些工具?
Kind regards, Hugo
亲切的问候,雨果
2 个解决方案
#1
5
# gets the reponse of the link
res = Net::HTTP.get_response(URI.parse(url))
# if it returns a good code
if res.code.to_i >= 200 && res.code.to_i < 400 #good codes will be betweem 200 - 399
# do something with the url
else
# skip the object
next
end
#2
0
I ran into a similar problem. The question and the reply is here
我遇到了类似的问题。问题和答案在这里
How to handle 404 errors with Nokogiri
如何使用Nokogiri处理404错误?
#1
5
# gets the reponse of the link
res = Net::HTTP.get_response(URI.parse(url))
# if it returns a good code
if res.code.to_i >= 200 && res.code.to_i < 400 #good codes will be betweem 200 - 399
# do something with the url
else
# skip the object
next
end
#2
0
I ran into a similar problem. The question and the reply is here
我遇到了类似的问题。问题和答案在这里
How to handle 404 errors with Nokogiri
如何使用Nokogiri处理404错误?