使用_details方法单击链接时,Scrubyt会出现404错误

时间:2021-12-07 09:45:35

This might be a similar problem to my earlier two questions - see here and here but I'm trying to use the _detail command to automatically click the link so I can scrape the details page for each individual event.

这可能与我之前的两个问题类似 - 请参阅此处和此处,但我正在尝试使用_detail命令自动单击链接,以便我可以抓取每个单独事件的详细信息页面。

The code I'm using is:

我正在使用的代码是:

require 'rubygems'
require 'scrubyt'

nuffield_data = Scrubyt::Extractor.define do
  fetch 'http://www.nuffieldtheatre.co.uk/cn/events/event_listings.php'

  event do
    title 'The Coast of Mayo'
    link_url
    event_detail do
      dates "1-4 October"
      times "7:30pm"
    end
  end

  next_page "Next Page", :limit => 20
end

  nuffield_data.to_xml.write($stdout,1)

Is there any way to print out the URL that using the event_detail is trying to access? The error doesn't seem to give me the URL that gave the 404.

有没有办法打印出使用event_detail尝试访问的URL?该错误似乎没有给我提供404的URL。

Update: I think the link may be a relative link - could this be causing problems? Any ideas how to deal with that?

更新:我认为链接可能是一个相对链接 - 这可能导致问题吗?任何想法如何处理?

4 个解决方案

#1


1  

I had the same issue with relative links and fixed it like this... you have to set the :resolve param to the correct base url

我对相对链接有同样的问题,并修复它像这样...你必须将:resolve param设置为正确的基本URL

  event do
    title 'The Coast of Mayo'
    link_url
    event_detail :resolve => 'http://www.nuffieldtheatre.co.uk/cn/events' do
      dates "1-4 October"
      times "7:30pm"
    end
  end

#2


1  

    sudo gem install ruby-debug

This will give you access to a nice ruby debugger, start the debugger by altering your script:

    require 'rubygems'
    require 'ruby-debug'
    Debugger.start
    Debugger.settings[:autoeval] = true if Debugger.respond_to?(:settings)

    require 'scrubyt'

    nuffield_data = Scrubyt::Extractor.define do
      fetch 'http://www.nuffieldtheatre.co.uk/cn/events/event_listings.php'

      event do
        title 'The Coast of Mayo'
        link_url
        event_detail do
          dates "1-4 October"
          times "7:30pm"
        end
      end

      next_page "Next Page", :limit => 2

    end

    nuffield_data.to_xml.write($stdout,1)

Then find out where scrubyt is throwing an exception - in this case:

    /Library/Ruby/Gems/1.8/gems/scrubyt-0.3.4/lib/scrubyt/core/navigation/fetch_action.rb:52:in `fetch'

Find the scrubyt gem on your system, and add a rescue clause to the method in question so that the end of the method looks like this:

      if @@current_doc_protocol == 'file'
        @@hpricot_doc = Hpricot(PreFilterDocument.br_to_newline(open(@@current_doc_url).read))
      else
        @@hpricot_doc = Hpricot(PreFilterDocument.br_to_newline(@@mechanize_doc.body))
        store_host_name(self.get_current_doc_url)   # in case we're on a new host
      end
    rescue
      debugger
      self # the self is here because debugger doesn't like being at the end of a method
    end

Now run the script again and you should be dropped into a debugger when the exception is raised. Just try typing this a the debug prompt to see what the offending URL is:

现在再次运行脚本,在引发异常时应将其放入调试器中。只需尝试在调试提示符下输入此内容即可查看违规URL:

@@current_doc_url

You can also add a debugger statement anywhere in that method if you want to check what is going on - for example you may want to add one between line 51 and 52 of this method to check how the url that is being called changes and why.

如果要检查正在进行的操作,还可以在该方法的任何位置添加调试器语句 - 例如,您可能希望在此方法的第51行和第52行之间添加一个,以检查被调用的URL如何更改以及原因。

This is basically how I figured out the answer to your previous questions.

这基本上是我如何找出你以前的问题的答案。

Good luck.

#3


0  

Sorry I have no idea why this would be nil - every time I have run this it returns a url - the method self.fetch requires a URL which you should be able to access as the local variable doc_url. If this returns nil also may you should post the code where you have included the debugger call.

抱歉,我不知道为什么这会是零 - 每次我运行它都会返回一个url - 方法self.fetch需要一个你应该能够作为局部变量doc_url访问的URL。如果返回nil,也可以将代码发布到包含调试器调用的位置。

#4


0  

I've tried to access doc_url but that seems to also return nil. When I have access to my server (later in the day) I'll post the code with the debugging bit in it.

我试图访问doc_url,但似乎也返回nil。当我有权访问我的服务器(当天晚些时候)时,我将发布带有调试位的代码。

#1


1  

I had the same issue with relative links and fixed it like this... you have to set the :resolve param to the correct base url

我对相对链接有同样的问题,并修复它像这样...你必须将:resolve param设置为正确的基本URL

  event do
    title 'The Coast of Mayo'
    link_url
    event_detail :resolve => 'http://www.nuffieldtheatre.co.uk/cn/events' do
      dates "1-4 October"
      times "7:30pm"
    end
  end

#2


1  

    sudo gem install ruby-debug

This will give you access to a nice ruby debugger, start the debugger by altering your script:

    require 'rubygems'
    require 'ruby-debug'
    Debugger.start
    Debugger.settings[:autoeval] = true if Debugger.respond_to?(:settings)

    require 'scrubyt'

    nuffield_data = Scrubyt::Extractor.define do
      fetch 'http://www.nuffieldtheatre.co.uk/cn/events/event_listings.php'

      event do
        title 'The Coast of Mayo'
        link_url
        event_detail do
          dates "1-4 October"
          times "7:30pm"
        end
      end

      next_page "Next Page", :limit => 2

    end

    nuffield_data.to_xml.write($stdout,1)

Then find out where scrubyt is throwing an exception - in this case:

    /Library/Ruby/Gems/1.8/gems/scrubyt-0.3.4/lib/scrubyt/core/navigation/fetch_action.rb:52:in `fetch'

Find the scrubyt gem on your system, and add a rescue clause to the method in question so that the end of the method looks like this:

      if @@current_doc_protocol == 'file'
        @@hpricot_doc = Hpricot(PreFilterDocument.br_to_newline(open(@@current_doc_url).read))
      else
        @@hpricot_doc = Hpricot(PreFilterDocument.br_to_newline(@@mechanize_doc.body))
        store_host_name(self.get_current_doc_url)   # in case we're on a new host
      end
    rescue
      debugger
      self # the self is here because debugger doesn't like being at the end of a method
    end

Now run the script again and you should be dropped into a debugger when the exception is raised. Just try typing this a the debug prompt to see what the offending URL is:

现在再次运行脚本,在引发异常时应将其放入调试器中。只需尝试在调试提示符下输入此内容即可查看违规URL:

@@current_doc_url

You can also add a debugger statement anywhere in that method if you want to check what is going on - for example you may want to add one between line 51 and 52 of this method to check how the url that is being called changes and why.

如果要检查正在进行的操作,还可以在该方法的任何位置添加调试器语句 - 例如,您可能希望在此方法的第51行和第52行之间添加一个,以检查被调用的URL如何更改以及原因。

This is basically how I figured out the answer to your previous questions.

这基本上是我如何找出你以前的问题的答案。

Good luck.

#3


0  

Sorry I have no idea why this would be nil - every time I have run this it returns a url - the method self.fetch requires a URL which you should be able to access as the local variable doc_url. If this returns nil also may you should post the code where you have included the debugger call.

抱歉,我不知道为什么这会是零 - 每次我运行它都会返回一个url - 方法self.fetch需要一个你应该能够作为局部变量doc_url访问的URL。如果返回nil,也可以将代码发布到包含调试器调用的位置。

#4


0  

I've tried to access doc_url but that seems to also return nil. When I have access to my server (later in the day) I'll post the code with the debugging bit in it.

我试图访问doc_url,但似乎也返回nil。当我有权访问我的服务器(当天晚些时候)时,我将发布带有调试位的代码。