Web爬虫在Rails中提取链接并从Web页面下载文件

时间:2022-10-12 21:16:14

I'm using RoR, I will specify a link to a web page in my application and here are the things that I want to do

我正在使用RoR,我将在我的应用程序中指定一个到web页面的链接,以下是我想做的事情

(1) I want to extract all the links in the web page

(1)我想提取网页中的所有链接

(2) Find if they are links to pdf file(basically a pattern match)

(2)查找它们是否是pdf文件的链接(基本上是模式匹配)

(3)I want to download file in link(a pdf for example) and store them on my system.

(3)我想下载链接中的文件(例如pdf),并将它们存储在我的系统中。

I tried using Anemone, but it crawls the entire website which overshoots my needs and also how do I download the files in corresponding links?

我尝试过使用Anemone,但是它会搜索整个网站,超过我的需求,而且我如何下载相应链接的文件?

Cheers

干杯

2 个解决方案

#1


8  

Have a look at Nokogiri aswell.

让我们一起来看看Nokogiri。

require 'nokogiri'
require 'open-uri'
doc = Nokogiri::HTML(open('http://www.thatwebsite.com/downloads'))

doc.css('a').each do |link|
  if link['href'] =~ /\b.+.pdf/
    begin
      File.open('filename_to_save_to.pdf', 'wb') do |file|
        downloaded_file = open(link['href'])
        file.write(downloaded_file.read())
      end
    rescue => ex
      puts "Something went wrong...."
    end
  end
end

You might want to do some better exception catching, but I think you get the idea :)

你可能想做一些更好的例外情况,但我想你应该明白了

#2


1  

Have you tried scrapi? You can scrape the page with css selectors.

你有试过scrapi吗?您可以使用css选择器来抓取页面。

Ryan Bates also made a screencast about it.

瑞恩·贝茨(Ryan Bates)也为此做了一段视频。

To download the files you can use open-uri

要下载文件,可以使用open-uri

require 'open-uri'  
url = "http://example.com/document.pdf"
file = open(url)  
c = file.read()

#1


8  

Have a look at Nokogiri aswell.

让我们一起来看看Nokogiri。

require 'nokogiri'
require 'open-uri'
doc = Nokogiri::HTML(open('http://www.thatwebsite.com/downloads'))

doc.css('a').each do |link|
  if link['href'] =~ /\b.+.pdf/
    begin
      File.open('filename_to_save_to.pdf', 'wb') do |file|
        downloaded_file = open(link['href'])
        file.write(downloaded_file.read())
      end
    rescue => ex
      puts "Something went wrong...."
    end
  end
end

You might want to do some better exception catching, but I think you get the idea :)

你可能想做一些更好的例外情况,但我想你应该明白了

#2


1  

Have you tried scrapi? You can scrape the page with css selectors.

你有试过scrapi吗?您可以使用css选择器来抓取页面。

Ryan Bates also made a screencast about it.

瑞恩·贝茨(Ryan Bates)也为此做了一段视频。

To download the files you can use open-uri

要下载文件,可以使用open-uri

require 'open-uri'  
url = "http://example.com/document.pdf"
file = open(url)  
c = file.read()