如何处理在Anemone、Boilerpipe和Nokigiri中找不到的500个内部服务器错误和404页

时间:2022-10-10 16:53:52

I'm implementing a tool that needs to crawl a website. I'm using anemone to crawl and on each anemone's page I'm using boilerpipe and Nokogiri to manage HTML format, etc.

我正在实现一个需要抓取网站的工具。我使用anemone来爬行,在每个anemone的页面上,我使用boilerpipe和Nokogiri来管理HTML格式,等等。

My problem is: if I get 500 Internal Server Error, it makes Nokogiri fail because there is no page.

我的问题是:如果我得到500个内部服务器错误,它会导致Nokogiri失败,因为没有页面。

Anemone.crawl(name) do |anemone|
   anemone.on_every_page do |page|
       if not (page.nil? && page.not_found?)
              result = Boilerpipe.extract(page.url, {:output => :htmlFragment, :extractor => :ArticleExtractor})
              doc = Nokogiri::HTML.parse(result)

       end
    end
end

In the case above, if there is a 500 Internal Server Error, the application will give an error on Nokogiri::HTML.parse(). I want to avoid this problem. If the server gives an error I want to continue computation ignoring this page.

在上面的示例中,如果有500个内部服务器错误,应用程序将在Nokogiri::HTML.parse()上给出一个错误。我想避免这个问题。如果服务器出错,我想继续计算,忽略这个页面。

There is any way to handle 500 Internal Server Error and 404 Page Not Found with these tools?

有什么方法可以处理500个内部服务器错误和404页面没有找到这些工具?

Kind regards, Hugo

亲切的问候,雨果

2 个解决方案

#1


5  

# gets the reponse of the link
res = Net::HTTP.get_response(URI.parse(url))

# if it returns a good code
if res.code.to_i >= 200 && res.code.to_i < 400 #good codes will be betweem 200 - 399
  # do something with the url
else
  # skip the object
  next
end

#2


0  

I ran into a similar problem. The question and the reply is here

我遇到了类似的问题。问题和答案在这里

How to handle 404 errors with Nokogiri

如何使用Nokogiri处理404错误?

#1


5  

# gets the reponse of the link
res = Net::HTTP.get_response(URI.parse(url))

# if it returns a good code
if res.code.to_i >= 200 && res.code.to_i < 400 #good codes will be betweem 200 - 399
  # do something with the url
else
  # skip the object
  next
end

#2


0  

I ran into a similar problem. The question and the reply is here

我遇到了类似的问题。问题和答案在这里

How to handle 404 errors with Nokogiri

如何使用Nokogiri处理404错误?