I am trying to speed up a ruby algorithm. I have a rails app that uses active record and nokogiri to visit a list of urls in a database and scrape the main image from the page and save it under the image attribute associated with that url.
我正在尝试加速ruby算法。我有一个rails应用程序,它使用活动记录和nokogiri访问数据库中的URL列表,并从页面中抓取主图像并将其保存在与该URL关联的图像属性下。
This rails task usually takes about 2:30 s to complete and I am trying to speed it up as a learning exercise. Would it be possible to use C through RubyInline and raw SQL code to achieve the desired result? My only issue is that if I use C I lose the database connection that active record with ruby had, and have no idea how to write SQL queries in conjunction with the C code that will properly connect to my db.
这个rails任务通常需要大约2:30才能完成,我正在尝试将其加速作为学习练习。是否可以通过RubyInline和原始SQL代码使用C来实现所需的结果?我唯一的问题是,如果我使用C,我会丢失与ruby有活动记录的数据库连接,并且不知道如何将SQL查询与正确连接到我的数据库的C代码一起编写。
Has anyone had experience with this, or even know if it's possible? I'm doing this as primarily a learning exercise and was wondering whether it was even possible. Here is the code that I want to translate into C and SQL if you are interested:
有没有人有这方面的经验,甚至知道它是否可能?我这样做主要是一个学习练习,并想知道它是否有可能。如果您感兴趣,以下是我想要转换为C和SQL的代码:
task :getimg => :environment do
stories = FeedEntry.all
stories.each do |story|
if story.image.nil?
url = story.url
doc = Nokogiri::HTML(open(url))
if doc.at_css(".full-width img")
img = doc.at_css(".full-width img")[:src]
story.image = img
story.save!
elsif doc.at_css(".body-width img")
img = doc.at_css(".body-width img")[:src]
story.image = img
story.save!
elsif doc.at_css(".body-narrow-width img")
img = doc.at_css(".body-narrow-width img")[:src]
story.image = img
story.save!
elsif doc.at_css(".caption img")
img = doc.at_css(".caption img")[:src]
story.image = img
story.save!
elsif doc.at_css(".cnnArticleGalleryPhotoContainer img")
img = doc.at_css(".cnnArticleGalleryPhotoContainer img")[:src]
story.image = img
story.save!
elsif doc.at_css(".cnn_strylftcntnt div img")
img = doc.at_css(".cnn_strylftcntnt div img")[:src]
story.image = img
story.save!
elsif doc.at_css(".cnn_stryimg640captioned img")
img = doc.at_css(".cnn_stryimg640captioned img")[:src]
story.image = img
story.save!
end
else
#do nothing
end
end
end
I would appreciate any and all help and insights in this matter. Thank you in advance!!
我将非常感谢此事的任何和所有帮助和见解。先谢谢你!!
2 个解决方案
#1
1
Speed of DB Saving
I've written a web crawler in ruby and I found that one of the bottlenecks that can affect performance is the actual creation of the row in the database. It's faster to have a single mass insert
at the end of extracting all URLs than to have multiple individual inserts (at-least for Postgres).
我在ruby中编写了一个Web爬虫,我发现可能影响性能的瓶颈之一是在数据库中实际创建了行。在提取所有URL结束时单个质量插入比在多个单独插入时更快(至少对于Postgres而言)。
So instead of calling YourModel.save!
for every url you visit, just push every url to an array that will keep track of url's that you need to save to the database. Then once you've finished scraping all links, do a mass insert of all the image links through an sql command.
所以不要调用YourModel.save!对于您访问的每个网址,只需将每个网址推送到一个数组,该数组将跟踪您需要保存到数据库的网址。然后,在完成所有链接的抓取后,通过sql命令对所有图像链接进行大量插入。
stories.each do |story|
url = story.url
doc = Nokogiri::HTML(open(url))
img_url = doc.at_css("img")[:src]
to_insert.push "(#{img_url})"
end
#notice the mass insert at the end
sql = "INSERT INTO your_table (img_url) VALUES #{to_insert.join(", ")}"
#CONN is a constant declared at the top of your file (CONN = ActiveRecord::Base.connection)
#that connects to the database
CONN.execute sql
"Speed Up" Downloading
The downloading of links will also be a bottleneck. Thus, the best option would be to create a thread pool, where each thread is allocated a partition of URLs from the database to scrape. This way, you will never be stuck waiting for a single page to download before you do any real processing.
下载链接也将成为瓶颈。因此,最好的选择是创建一个线程池,其中每个线程被分配一个URL数据库分区来刮擦。这样,在进行任何实际处理之前,您永远不会等待单个页面下载。
Some pseudoish ruby code:
一些伪红宝石代码:
number_of_workers = 10
(1..number_of_workers).each do |worker|
Thread.new do
begin
urls_to_scrape_for_this_thread = [...list of urls to scrape...]
while urls_to_scrape > 0
url = take_one_url_from_list
scrape(url)
end
rescue => e
puts "========================================"
puts "Thread # #{i} error"
puts "#{e.message}"
puts "#{e.backtrace}"
puts "======================================="
raise e
end
end
end
#2
1
-
Are the URLs remote? if so, first benchmark it to see the network latency. If that's the bottleneck, I think you have nothing to do with your code or your choice of language.
URL是远程的吗?如果是这样,首先对其进行基准测试以查看网络延迟。如果这是瓶颈,我认为你与你的代码或你选择的语言无关。
-
How many
FeedEntry
s do you have in your database? I suggest usingFeedEntry.find_each
instead ofFeedEntry.all.each
, because the former loads 1000 entries into memory, processes them, and then loads the next 1000 entries ..., while the latter loads all entries into memory and then iterates over them, which requires more memory and increases GC cycles.您的数据库中有多少FeedEntrys?我建议使用FeedEntry.find_each而不是FeedEntry.all.each,因为前者将1000个条目加载到内存中,处理它们,然后加载接下来的1000个条目......,而后者将所有条目加载到内存中然后迭代它们,这需要更多的内存并增加GC周期。
-
If the bottleneck is neither one of the above, then maybe it's the DOM node searching algorithm which is slow. You can find the (only one?)
img
node, then check its parent node or grandparent node if necessary, and update your entries accordingly.如果瓶颈既不是上述之一,那么可能是DOM节点搜索算法很慢。您可以找到(仅一个?)img节点,然后在必要时检查其父节点或祖父节点,并相应地更新您的条目。
image_node = doc.at_css('img') story.update image: image_node['src'] if needed?(image_node) def needed?(image_node) parent_node = image_node.parent parent_class = image_node.parent['class'] return true if parent_class == 'full-width' return true if parent_class == 'body-width' return true if parent_class == 'body-narrow-width' return true if parent_class == 'caption' return true if parent_class == 'cnnArticleGalleryPhotoContainer' return true if parent_class == 'cnn_stryimg640captioned' return false unless parent_node.node_type == 'div' return true if parent_node.parent['class'] == 'cnn_strylftcntnt' false end
#1
1
Speed of DB Saving
I've written a web crawler in ruby and I found that one of the bottlenecks that can affect performance is the actual creation of the row in the database. It's faster to have a single mass insert
at the end of extracting all URLs than to have multiple individual inserts (at-least for Postgres).
我在ruby中编写了一个Web爬虫,我发现可能影响性能的瓶颈之一是在数据库中实际创建了行。在提取所有URL结束时单个质量插入比在多个单独插入时更快(至少对于Postgres而言)。
So instead of calling YourModel.save!
for every url you visit, just push every url to an array that will keep track of url's that you need to save to the database. Then once you've finished scraping all links, do a mass insert of all the image links through an sql command.
所以不要调用YourModel.save!对于您访问的每个网址,只需将每个网址推送到一个数组,该数组将跟踪您需要保存到数据库的网址。然后,在完成所有链接的抓取后,通过sql命令对所有图像链接进行大量插入。
stories.each do |story|
url = story.url
doc = Nokogiri::HTML(open(url))
img_url = doc.at_css("img")[:src]
to_insert.push "(#{img_url})"
end
#notice the mass insert at the end
sql = "INSERT INTO your_table (img_url) VALUES #{to_insert.join(", ")}"
#CONN is a constant declared at the top of your file (CONN = ActiveRecord::Base.connection)
#that connects to the database
CONN.execute sql
"Speed Up" Downloading
The downloading of links will also be a bottleneck. Thus, the best option would be to create a thread pool, where each thread is allocated a partition of URLs from the database to scrape. This way, you will never be stuck waiting for a single page to download before you do any real processing.
下载链接也将成为瓶颈。因此,最好的选择是创建一个线程池,其中每个线程被分配一个URL数据库分区来刮擦。这样,在进行任何实际处理之前,您永远不会等待单个页面下载。
Some pseudoish ruby code:
一些伪红宝石代码:
number_of_workers = 10
(1..number_of_workers).each do |worker|
Thread.new do
begin
urls_to_scrape_for_this_thread = [...list of urls to scrape...]
while urls_to_scrape > 0
url = take_one_url_from_list
scrape(url)
end
rescue => e
puts "========================================"
puts "Thread # #{i} error"
puts "#{e.message}"
puts "#{e.backtrace}"
puts "======================================="
raise e
end
end
end
#2
1
-
Are the URLs remote? if so, first benchmark it to see the network latency. If that's the bottleneck, I think you have nothing to do with your code or your choice of language.
URL是远程的吗?如果是这样,首先对其进行基准测试以查看网络延迟。如果这是瓶颈,我认为你与你的代码或你选择的语言无关。
-
How many
FeedEntry
s do you have in your database? I suggest usingFeedEntry.find_each
instead ofFeedEntry.all.each
, because the former loads 1000 entries into memory, processes them, and then loads the next 1000 entries ..., while the latter loads all entries into memory and then iterates over them, which requires more memory and increases GC cycles.您的数据库中有多少FeedEntrys?我建议使用FeedEntry.find_each而不是FeedEntry.all.each,因为前者将1000个条目加载到内存中,处理它们,然后加载接下来的1000个条目......,而后者将所有条目加载到内存中然后迭代它们,这需要更多的内存并增加GC周期。
-
If the bottleneck is neither one of the above, then maybe it's the DOM node searching algorithm which is slow. You can find the (only one?)
img
node, then check its parent node or grandparent node if necessary, and update your entries accordingly.如果瓶颈既不是上述之一,那么可能是DOM节点搜索算法很慢。您可以找到(仅一个?)img节点,然后在必要时检查其父节点或祖父节点,并相应地更新您的条目。
image_node = doc.at_css('img') story.update image: image_node['src'] if needed?(image_node) def needed?(image_node) parent_node = image_node.parent parent_class = image_node.parent['class'] return true if parent_class == 'full-width' return true if parent_class == 'body-width' return true if parent_class == 'body-narrow-width' return true if parent_class == 'caption' return true if parent_class == 'cnnArticleGalleryPhotoContainer' return true if parent_class == 'cnn_stryimg640captioned' return false unless parent_node.node_type == 'div' return true if parent_node.parent['class'] == 'cnn_strylftcntnt' false end