如何使用Nokogiri在div中获取所有内容?

时间:2022-11-25 15:13:44

I am using Nokogiri to scrape a site that looks like this:

我正在使用Nokogiri来刮一个像这样的网站:

<div class="BOX">
  <div class="apple">This is an apple.</div>
  <p>Apple a day, doctor away</p>
</div>

<div class="BOX">
  <div class="iphone">This is an iPhone.</div>
  <div class="android">This is an Android.</div>
  <a href="www.apple.com">Apple home page</a>
  <p>Snoop Lion has both. He's rich.</p>
</div>

I would like to scrape everything within the "BOX" div. Each "BOX" has its own unique divs and HTML tags, with no apparent patterns. How would I do this?

我想把所有的东西都放到“BOX”div中。每个“BOX”都有自己独特的div标签和HTML标签,没有明显的模式。我该怎么做呢?

My first attempt looked like this:

我的第一次尝试是这样的:

require 'uri-open'
require 'nokogiri'

doc = Nokogiri::HTML(open('http://www.examplesite.com'))
doc.css('BOX').each do |box|
  puts box.content
end

But it returns nothing. May I please have an explanation of what's going on?

但它返回。我能解释一下发生了什么事吗?

2 个解决方案

#1


3  

I think you should use #inner_html method instead of #content. Although your CSS class selector rule is wrong. The code should look like below :

我认为您应该使用#inner_html方法而不是#content。尽管CSS类选择器规则是错误的。守则应如下:

require 'nokogiri'

doc = Nokogiri::HTML::Document.parse <<-eot
<div class="BOX">
  <div class="apple">This is an apple.</div>
  <p>Apple a day, doctor away</p>
</div>

<div class="BOX">
  <div class="iphone">This is an iPhone.</div>
  <div class="android">This is an Android.</div>
  <a href="www.apple.com">Apple home page</a>
  <p>Snoop Lion has both. Hes rich.</p>
</div>
eot

doc.css('.BOX').each do|n|
   p n.inner_html
end

output:

输出:

  <div class="apple">This is an apple.</div>
  <p>Apple a day, doctor away</p>

  <div class="iphone">This is an iPhone.</div>
  <div class="android">This is an Android.</div>
  <a href="www.apple.com">Apple home page</a>
  <p>Snoop Lion has both. He's rich.</p>

#content will give you all the text by removing the html wrapper inside the each div node.See below :

通过删除每个div节点中的html包装,#content将为您提供所有的文本。见下文:

doc.css('.BOX').each do|n|
   puts n.content
end

output:

输出:

  This is an apple.
  Apple a day, doctor away

  This is an iPhone.
  This is an Android.
  Apple home page
  Snoop Lion has both. He's rich.

#2


4  

You missed a dot(.).

你错过了一个点(.)。

Without dot, it match a <BOX> tag. To match an element with class="BOX" you should prefix it with dot.

没有点,它匹配一个 标签。要将一个元素与class="BOX"匹配,你应该在它前面加上点。

doc.css('.BOX').each do |box|
  #      ^-- here
  puts box.content
end

#1


3  

I think you should use #inner_html method instead of #content. Although your CSS class selector rule is wrong. The code should look like below :

我认为您应该使用#inner_html方法而不是#content。尽管CSS类选择器规则是错误的。守则应如下:

require 'nokogiri'

doc = Nokogiri::HTML::Document.parse <<-eot
<div class="BOX">
  <div class="apple">This is an apple.</div>
  <p>Apple a day, doctor away</p>
</div>

<div class="BOX">
  <div class="iphone">This is an iPhone.</div>
  <div class="android">This is an Android.</div>
  <a href="www.apple.com">Apple home page</a>
  <p>Snoop Lion has both. Hes rich.</p>
</div>
eot

doc.css('.BOX').each do|n|
   p n.inner_html
end

output:

输出:

  <div class="apple">This is an apple.</div>
  <p>Apple a day, doctor away</p>

  <div class="iphone">This is an iPhone.</div>
  <div class="android">This is an Android.</div>
  <a href="www.apple.com">Apple home page</a>
  <p>Snoop Lion has both. He's rich.</p>

#content will give you all the text by removing the html wrapper inside the each div node.See below :

通过删除每个div节点中的html包装,#content将为您提供所有的文本。见下文:

doc.css('.BOX').each do|n|
   puts n.content
end

output:

输出:

  This is an apple.
  Apple a day, doctor away

  This is an iPhone.
  This is an Android.
  Apple home page
  Snoop Lion has both. He's rich.

#2


4  

You missed a dot(.).

你错过了一个点(.)。

Without dot, it match a <BOX> tag. To match an element with class="BOX" you should prefix it with dot.

没有点,它匹配一个 标签。要将一个元素与class="BOX"匹配,你应该在它前面加上点。

doc.css('.BOX').each do |box|
  #      ^-- here
  puts box.content
end