There are lots of examples of how to strip HTML tags from a document using Ruby, Hpricot and Nokogiri have inner_text methods that remove all HTML for you easily and quickly.
有很多关于如何使用Ruby从文档中剥离HTML标记的示例,Hpricot和Nokogiri都有inner_text方法,可以轻松快速地删除所有HTML。
What I am trying to do is the opposite, remove all the text from an HTML document, leaving just the tags and their attributes.
我想要做的是相反,删除HTML文档中的所有文本,只留下标记及其属性。
I considered looping through the document setting inner_html to nil but then really you'd have to do this in reverse as the first element (root) has an inner_html of the entire rest of the document, so ideally I'd have to start at the inner most element and set inner_html to nil whilst moving up through the ancestors.
我考虑通过文件设置inner_html循环到nil然后真的你必须反过来做,因为第一个元素(root)有一个inner_html整个文档的其余部分,所以理想情况下我必须从最内层的元素,并将inner_html设置为nil,同时向上移动通过祖先。
Does anyone know a neat little trick for doing this efficiently? I was thinking perhaps regex's might do it but probably not as efficiently as an HTML tokenizer/parser might.
有没有人知道一个有效的小技巧?我想也许正则表达式可能会这样做,但可能不如HTML tokenizer / parser那样有效。
4 个解决方案
#1
38
This works too:
这也有效:
doc = Nokogiri::HTML(your_html)
doc.xpath("//text()").remove
#2
3
You can scan the string to create an array of "tokens", and then only select those that are html tags:
您可以扫描字符串以创建“标记”数组,然后只选择那些是html标记:
>> some_html
=> "<div>foo bar</div><p>I like <em>this</em> stuff <a href='http://foo.bar'> long time</a></p>"
>> some_html.scan(/<\/?[^>]+>|[\w\|`~!@#\$%^&*\(\)\-_\+=\[\]{}:;'",\.\/?]+|\s+/).select { |t| t =~ /<\/?[^>]+>/ }.join("")
=> "<div></div><p><em></em><a href='http://foo.bar'></a></p>"
==Edit==
== ==编辑
Or even better, just scan for html tags ;)
或者甚至更好,只需扫描html标签;)
>> some_html.scan(/<\/?[^>]+>/).join("")
=> "<div></div><p><em></em><a href='http://foo.bar'></a></p>"
#3
3
To grab everything not in a tag, you can use nokogiri like this:
要获取不在标签中的所有内容,您可以像这样使用nokogiri:
doc.search('//text()').text
Of course, that will grab stuff like the contents of <script>
or <style>
tags, so you could also remove blacklisted tags:
当然,这会抓取像
blacklist = ['title', 'script', 'style']
nodelist = doc.search('//text()')
blacklist.each do |tag|
nodelist -= doc.search('//' + tag + '/text()')
end
nodelist.text
You could also whitelist if you preferred, but that's probably going to be more time-intensive:
如果您愿意,也可以将白名单列入白名单,但这可能会耗费更多时间:
whitelist = ['p', 'span', 'strong', 'i', 'b'] #The list goes on and on...
nodelist = Nokogiri::XML::NodeSet.new(doc)
whitelist.each do |tag|
nodelist += doc.search('//' + tag + '/text()')
end
nodelist.text
You could also just build a huge XPath expression and do one search. I honestly don't know which way is faster, or if there is even an appreciable difference.
您还可以构建一个巨大的XPath表达式并进行一次搜索。老实说,我不知道哪种方式更快,或者是否有明显的差异。
#4
0
I just came up with this, but @andre-r's solution is soo much better!
我想出了这个,但@ andre-r的解决方案太好了!
#!/usr/bin/env ruby
require 'nokogiri'
def strip_text doc
Nokogiri(doc).tap { |doc|
doc.traverse do |node|
node.content = nil if node.text?
end
}.to_s
end
require 'test/unit'
require 'yaml'
class TestHTMLStripping < Test::Unit::TestCase
def test_that_all_text_gets_strippped_from_the_document
dirty, clean = YAML.load DATA
assert_equal clean, strip_text(dirty)
end
end
__END__
---
- |
<!DOCTYPE html>
<html xmlns='http://www.w3.org/1999/xhtml' xml:lang='en' lang='en'>
<head>
<meta http-equiv='Content-type' content='text/html; charset=UTF-8' />
<title>Test HTML Document</title>
<meta http-equiv='content-language' content='en' />
</head>
<body>
<h1>Test <abbr title='Hypertext Markup Language'>HTML</abbr> Document</h1>
<div class='main'>
<p>
<strong>Test</strong> <abbr title='Hypertext Markup Language'>HTML</abbr> <em>Document</em>
</p>
</div>
</body>
</html>
- |
<!DOCTYPE html>
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<title></title>
<meta http-equiv="content-language" content="en">
</head>
<body><h1><abbr title="Hypertext Markup Language"></abbr></h1><div class="main"><p><strong></strong><abbr title="Hypertext Markup Language"></abbr><em></em></p></div></body>
</html>
#1
38
This works too:
这也有效:
doc = Nokogiri::HTML(your_html)
doc.xpath("//text()").remove
#2
3
You can scan the string to create an array of "tokens", and then only select those that are html tags:
您可以扫描字符串以创建“标记”数组,然后只选择那些是html标记:
>> some_html
=> "<div>foo bar</div><p>I like <em>this</em> stuff <a href='http://foo.bar'> long time</a></p>"
>> some_html.scan(/<\/?[^>]+>|[\w\|`~!@#\$%^&*\(\)\-_\+=\[\]{}:;'",\.\/?]+|\s+/).select { |t| t =~ /<\/?[^>]+>/ }.join("")
=> "<div></div><p><em></em><a href='http://foo.bar'></a></p>"
==Edit==
== ==编辑
Or even better, just scan for html tags ;)
或者甚至更好,只需扫描html标签;)
>> some_html.scan(/<\/?[^>]+>/).join("")
=> "<div></div><p><em></em><a href='http://foo.bar'></a></p>"
#3
3
To grab everything not in a tag, you can use nokogiri like this:
要获取不在标签中的所有内容,您可以像这样使用nokogiri:
doc.search('//text()').text
Of course, that will grab stuff like the contents of <script>
or <style>
tags, so you could also remove blacklisted tags:
当然,这会抓取像
blacklist = ['title', 'script', 'style']
nodelist = doc.search('//text()')
blacklist.each do |tag|
nodelist -= doc.search('//' + tag + '/text()')
end
nodelist.text
You could also whitelist if you preferred, but that's probably going to be more time-intensive:
如果您愿意,也可以将白名单列入白名单,但这可能会耗费更多时间:
whitelist = ['p', 'span', 'strong', 'i', 'b'] #The list goes on and on...
nodelist = Nokogiri::XML::NodeSet.new(doc)
whitelist.each do |tag|
nodelist += doc.search('//' + tag + '/text()')
end
nodelist.text
You could also just build a huge XPath expression and do one search. I honestly don't know which way is faster, or if there is even an appreciable difference.
您还可以构建一个巨大的XPath表达式并进行一次搜索。老实说,我不知道哪种方式更快,或者是否有明显的差异。
#4
0
I just came up with this, but @andre-r's solution is soo much better!
我想出了这个,但@ andre-r的解决方案太好了!
#!/usr/bin/env ruby
require 'nokogiri'
def strip_text doc
Nokogiri(doc).tap { |doc|
doc.traverse do |node|
node.content = nil if node.text?
end
}.to_s
end
require 'test/unit'
require 'yaml'
class TestHTMLStripping < Test::Unit::TestCase
def test_that_all_text_gets_strippped_from_the_document
dirty, clean = YAML.load DATA
assert_equal clean, strip_text(dirty)
end
end
__END__
---
- |
<!DOCTYPE html>
<html xmlns='http://www.w3.org/1999/xhtml' xml:lang='en' lang='en'>
<head>
<meta http-equiv='Content-type' content='text/html; charset=UTF-8' />
<title>Test HTML Document</title>
<meta http-equiv='content-language' content='en' />
</head>
<body>
<h1>Test <abbr title='Hypertext Markup Language'>HTML</abbr> Document</h1>
<div class='main'>
<p>
<strong>Test</strong> <abbr title='Hypertext Markup Language'>HTML</abbr> <em>Document</em>
</p>
</div>
</body>
</html>
- |
<!DOCTYPE html>
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<title></title>
<meta http-equiv="content-language" content="en">
</head>
<body><h1><abbr title="Hypertext Markup Language"></abbr></h1><div class="main"><p><strong></strong><abbr title="Hypertext Markup Language"></abbr><em></em></p></div></body>
</html>