I want to remove all text from html page that I load with nokogiri. For example, if a page has the following:
我想删除我用nokogiri加载的html页面中的所有文本。例如,如果页面具有以下内容:
<body><script>var x = 10;</script><div>Hello</div><div><h1>Hi</h1></div></body>
I want to process it with Nokogiri and return html like the following after stripping the text like so:
我想用Nokogiri处理它并在剥离文本之后返回如下所示的html:
<body><script>var x = 10;</script><div></div><div><h1></h1></div></body>
(That is, remove the actual h1 text, text between divs, text in p elements etc, but keep the tags. Also, don't remove text in the script tags.)
(也就是说,删除实际的h1文本,div之间的文本,p元素中的文本等,但保留标记。另外,不要删除脚本标记中的文本。)
1 个解决方案
#1
3
require 'nokogiri'
html = "<body><script>var x = 10;</script><div>Hello</div><div><h1>Hi</h1></div></body>"
hdoc = Nokogiri::HTML(html)
hdoc.xpath( '//*[text()]' ).each do |el|
el.content='' unless el.name=="script"
end
puts hdoc
#=> <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
#=> <html><body>
#=> <script>var x = 10;</script><div></div>
#=> <div><h1></h1></div>
#=> </body></html>
Warning: As you did not specify how to handle a case like <div>foo<h1>bar</h1></div>
the above may or may not do what you expect. Alternatively, the following may match your needs:
警告:由于您没有指定如何处理像
bar 这样的情况,上述内容可能会也可能不会达到预期效果。或者,以下可能符合您的需求:
hdoc.xpath( '//text()' ).each do |el|
el.remove unless el.parent.name=="script"
end
Update
Here's a more elegant solution using a single xpath to select all text nodes not part of a <script>
element. I've added more text nodes to show how it handles them.
这是一个更优雅的解决方案,使用单个xpath选择不属于
require 'nokogiri'
hdoc = Nokogiri::HTML <<ENDHTML
<body>
<script>var x = 10;</script>
<div>Hello</div>
<div>foo<h1>Hi</h1>bar</div>
</body>
ENDHTML
hdoc.xpath( '//text()[not(parent::script)]' ).each{ |text| text.remove }
puts hdoc
#=> <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
#=> <html><body>
#=> <script>var x = 10;</script><div></div>
#=> <div><h1></h1></div>
#=> </body></html>
For Ruby 1.9, the meat is more simply:
对于Ruby 1.9,肉更简单:
hdoc.xpath( '//text()[not(parent::script)]' ).each(&:remove)
#1
3
require 'nokogiri'
html = "<body><script>var x = 10;</script><div>Hello</div><div><h1>Hi</h1></div></body>"
hdoc = Nokogiri::HTML(html)
hdoc.xpath( '//*[text()]' ).each do |el|
el.content='' unless el.name=="script"
end
puts hdoc
#=> <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
#=> <html><body>
#=> <script>var x = 10;</script><div></div>
#=> <div><h1></h1></div>
#=> </body></html>
Warning: As you did not specify how to handle a case like <div>foo<h1>bar</h1></div>
the above may or may not do what you expect. Alternatively, the following may match your needs:
警告:由于您没有指定如何处理像
bar 这样的情况,上述内容可能会也可能不会达到预期效果。或者,以下可能符合您的需求:
hdoc.xpath( '//text()' ).each do |el|
el.remove unless el.parent.name=="script"
end
Update
Here's a more elegant solution using a single xpath to select all text nodes not part of a <script>
element. I've added more text nodes to show how it handles them.
这是一个更优雅的解决方案,使用单个xpath选择不属于
require 'nokogiri'
hdoc = Nokogiri::HTML <<ENDHTML
<body>
<script>var x = 10;</script>
<div>Hello</div>
<div>foo<h1>Hi</h1>bar</div>
</body>
ENDHTML
hdoc.xpath( '//text()[not(parent::script)]' ).each{ |text| text.remove }
puts hdoc
#=> <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
#=> <html><body>
#=> <script>var x = 10;</script><div></div>
#=> <div><h1></h1></div>
#=> </body></html>
For Ruby 1.9, the meat is more simply:
对于Ruby 1.9,肉更简单:
hdoc.xpath( '//text()[not(parent::script)]' ).each(&:remove)