
时间:2022-04-30 02:43:04

I am trying to speed up a ruby algorithm. I have a rails app that uses active record and nokogiri to visit a list of urls in a database and scrape the main image from the page and save it under the image attribute associated with that url.


This rails task usually takes about 2:30 s to complete and I am trying to speed it up as a learning exercise. Would it be possible to use C through RubyInline and raw SQL code to achieve the desired result? My only issue is that if I use C I lose the database connection that active record with ruby had, and have no idea how to write SQL queries in conjunction with the C code that will properly connect to my db.


Has anyone had experience with this, or even know if it's possible? I'm doing this as primarily a learning exercise and was wondering whether it was even possible. Here is the code that I want to translate into C and SQL if you are interested:


task :getimg => :environment do

    stories = FeedEntry.all

    stories.each do |story|

        if story.image.nil?

            url = story.url

            doc = Nokogiri::HTML(open(url))

            if doc.at_css(".full-width img")
                img = doc.at_css(".full-width img")[:src]
                story.image = img
            elsif doc.at_css(".body-width img")
                img = doc.at_css(".body-width img")[:src]
                story.image = img
            elsif doc.at_css(".body-narrow-width img")
                img = doc.at_css(".body-narrow-width img")[:src]
                story.image = img
            elsif doc.at_css(".caption img")
                img = doc.at_css(".caption img")[:src]
                story.image = img
            elsif doc.at_css(".cnnArticleGalleryPhotoContainer img")
                img = doc.at_css(".cnnArticleGalleryPhotoContainer img")[:src]
                story.image = img
            elsif doc.at_css(".cnn_strylftcntnt div img")
                img = doc.at_css(".cnn_strylftcntnt div img")[:src]
                story.image = img
            elsif doc.at_css(".cnn_stryimg640captioned img")
                img = doc.at_css(".cnn_stryimg640captioned img")[:src]
                story.image = img
            #do nothing

I would appreciate any and all help and insights in this matter. Thank you in advance!!


2 个解决方案



Speed of DB Saving

I've written a web crawler in ruby and I found that one of the bottlenecks that can affect performance is the actual creation of the row in the database. It's faster to have a single mass insert at the end of extracting all URLs than to have multiple individual inserts (at-least for Postgres).


So instead of calling YourModel.save! for every url you visit, just push every url to an array that will keep track of url's that you need to save to the database. Then once you've finished scraping all links, do a mass insert of all the image links through an sql command.


stories.each do |story|
    url = story.url
    doc = Nokogiri::HTML(open(url))

    img_url = doc.at_css("img")[:src]
    to_insert.push "(#{img_url})"
#notice the mass insert at the end
sql = "INSERT INTO your_table (img_url) VALUES #{to_insert.join(", ")}"

#CONN is a constant declared at the top of your file (CONN = ActiveRecord::Base.connection)
#that connects to the database
CONN.execute sql

"Speed Up" Downloading

The downloading of links will also be a bottleneck. Thus, the best option would be to create a thread pool, where each thread is allocated a partition of URLs from the database to scrape. This way, you will never be stuck waiting for a single page to download before you do any real processing.


Some pseudoish ruby code:


number_of_workers = 10
(1..number_of_workers).each do |worker|
    Thread.new do 
            urls_to_scrape_for_this_thread = [...list of urls to scrape...]
            while urls_to_scrape > 0
                url = take_one_url_from_list
        rescue => e
            puts "========================================"
            puts "Thread # #{i} error"
            puts "#{e.message}"
            puts "#{e.backtrace}"
            puts "======================================="
            raise e



  1. Are the URLs remote? if so, first benchmark it to see the network latency. If that's the bottleneck, I think you have nothing to do with your code or your choice of language.


  2. How many FeedEntrys do you have in your database? I suggest using FeedEntry.find_each instead of FeedEntry.all.each, because the former loads 1000 entries into memory, processes them, and then loads the next 1000 entries ..., while the latter loads all entries into memory and then iterates over them, which requires more memory and increases GC cycles.


  3. If the bottleneck is neither one of the above, then maybe it's the DOM node searching algorithm which is slow. You can find the (only one?) img node, then check its parent node or grandparent node if necessary, and update your entries accordingly.


     image_node = doc.at_css('img')
     story.update image: image_node['src'] if needed?(image_node)
     def needed?(image_node)
       parent_node = image_node.parent
       parent_class = image_node.parent['class']
       return true if parent_class == 'full-width'
       return true if parent_class == 'body-width'
       return true if parent_class == 'body-narrow-width'
       return true if parent_class == 'caption'
       return true if parent_class == 'cnnArticleGalleryPhotoContainer'
       return true if parent_class == 'cnn_stryimg640captioned'
       return false unless parent_node.node_type == 'div'
       return true if parent_node.parent['class'] == 'cnn_strylftcntnt'



Speed of DB Saving

I've written a web crawler in ruby and I found that one of the bottlenecks that can affect performance is the actual creation of the row in the database. It's faster to have a single mass insert at the end of extracting all URLs than to have multiple individual inserts (at-least for Postgres).


So instead of calling YourModel.save! for every url you visit, just push every url to an array that will keep track of url's that you need to save to the database. Then once you've finished scraping all links, do a mass insert of all the image links through an sql command.


stories.each do |story|
    url = story.url
    doc = Nokogiri::HTML(open(url))

    img_url = doc.at_css("img")[:src]
    to_insert.push "(#{img_url})"
#notice the mass insert at the end
sql = "INSERT INTO your_table (img_url) VALUES #{to_insert.join(", ")}"

#CONN is a constant declared at the top of your file (CONN = ActiveRecord::Base.connection)
#that connects to the database
CONN.execute sql

"Speed Up" Downloading

The downloading of links will also be a bottleneck. Thus, the best option would be to create a thread pool, where each thread is allocated a partition of URLs from the database to scrape. This way, you will never be stuck waiting for a single page to download before you do any real processing.


Some pseudoish ruby code:


number_of_workers = 10
(1..number_of_workers).each do |worker|
    Thread.new do 
            urls_to_scrape_for_this_thread = [...list of urls to scrape...]
            while urls_to_scrape > 0
                url = take_one_url_from_list
        rescue => e
            puts "========================================"
            puts "Thread # #{i} error"
            puts "#{e.message}"
            puts "#{e.backtrace}"
            puts "======================================="
            raise e



  1. Are the URLs remote? if so, first benchmark it to see the network latency. If that's the bottleneck, I think you have nothing to do with your code or your choice of language.


  2. How many FeedEntrys do you have in your database? I suggest using FeedEntry.find_each instead of FeedEntry.all.each, because the former loads 1000 entries into memory, processes them, and then loads the next 1000 entries ..., while the latter loads all entries into memory and then iterates over them, which requires more memory and increases GC cycles.


  3. If the bottleneck is neither one of the above, then maybe it's the DOM node searching algorithm which is slow. You can find the (only one?) img node, then check its parent node or grandparent node if necessary, and update your entries accordingly.


     image_node = doc.at_css('img')
     story.update image: image_node['src'] if needed?(image_node)
     def needed?(image_node)
       parent_node = image_node.parent
       parent_class = image_node.parent['class']
       return true if parent_class == 'full-width'
       return true if parent_class == 'body-width'
       return true if parent_class == 'body-narrow-width'
       return true if parent_class == 'caption'
       return true if parent_class == 'cnnArticleGalleryPhotoContainer'
       return true if parent_class == 'cnn_stryimg640captioned'
       return false unless parent_node.node_type == 'div'
       return true if parent_node.parent['class'] == 'cnn_strylftcntnt'