I wrote a web crawler in Ruby and I'm using Nokogiri::HTML
to parse the page. I need to print the page out and while messing around in IRB I noticed a pretty_print
method. However it takes a parameter and I can't figure out what it wants.
我在Ruby中编写了一个Web爬虫,我正在使用Nokogiri :: HTML来解析页面。我需要打印页面,在IRB中乱搞时我发现了一个pretty_print方法。然而,它需要一个参数,我无法弄清楚它想要什么。
My crawler is caching the HTML of the webpages and writing it to files on my local machine. I would like to "pretty print" the HTML so that it looks nice and properly formatted when I do so.
我的抓取工具正在缓存网页的HTML并将其写入本地计算机上的文件。我想“漂亮地打印”HTML,以便它在我这样做时看起来很好并且格式正确。
6 个解决方案
#1
20
By "pretty printing" of HTML page I presume you meant that you want to reformat the HTML structure with proper indentation. Nokogiri doesn't support this; pretty_print
method is for the "pp" library and the output is useful for debugging only.
通过HTML页面的“漂亮打印”,我认为你的意思是你想要用适当的缩进重新格式化HTML结构。 Nokogiri不支持这一点; pretty_print方法用于“pp”库,输出仅用于调试。
There are several projects that understand HTML well enough to be able to reformat it without destroying whitespace that is actually significant (the famous one is HTML Tidy), but by googling I've found this post titled "Pretty printing XHTML with Nokogiri and XSLT".
有几个项目能够很好地理解HTML,能够在不破坏真正重要的空白(着名的HTML Tidy)的情况下对其进行重新格式化,但通过谷歌搜索,我发现这篇文章名为“使用Nokogiri和XSLT轻松打印XHTML” 。
It comes down to this:
归结为:
xsl = Nokogiri::XSLT(File.open("pretty_print.xsl"))
html = Nokogiri(File.open("source.html"))
puts xsl.apply_to(html).to_s
It requires you, of course, to download the linked xsl file to your filesystem. I've tried it very quickly on my machine and it works like a charm.
当然,它要求您将链接的xsl文件下载到文件系统。我在我的机器上很快就尝试过,它就像一个魅力。
#2
69
The answer by @mislav is somewhat wrong. Nokogiri does support pretty-printing if you:
@mislav的答案有点不对劲。如果您:Nokogiri支持漂亮打印:
- Parse the document as XML
- 将文档解析为XML
- Instruct Nokogiri to ignore whitespace-only nodes ("blanks") during parsing
- 指示Nokogiri在解析期间忽略仅空白节点(“空白”)
- Use
to_xhtml
orto_xml
to specify pretty-printing parameters - 使用to_xhtml或to_xml指定漂亮的打印参数
In action:
在行动:
html = '<section>
<h1>Main Section 1</h1><p>Intro</p>
<section>
<h2>Subhead 1.1</h2><p>Meat</p><p>MOAR MEAT</p>
</section><section>
<h2>Subhead 1.2</h2><p>Meat</p>
</section></section>'
require 'nokogiri'
doc = Nokogiri::XML(html,&:noblanks)
puts doc
#=> <section>
#=> <h1>Main Section 1</h1>
#=> <p>Intro</p>
#=> <section>
#=> <h2>Subhead 1.1</h2>
#=> <p>Meat</p>
#=> <p>MOAR MEAT</p>
#=> </section>
#=> <section>
#=> <h2>Subhead 1.2</h2>
#=> <p>Meat</p>
#=> </section>
#=> </section>
puts doc.to_xhtml( indent:3, indent_text:"." )
#=> <section>
#=> ...<h1>Main Section 1</h1>
#=> ...<p>Intro</p>
#=> ...<section>
#=> ......<h2>Subhead 1.1</h2>
#=> ......<p>Meat</p>
#=> ......<p>MOAR MEAT</p>
#=> ...</section>
#=> ...<section>
#=> ......<h2>Subhead 1.2</h2>
#=> ......<p>Meat</p>
#=> ...</section>
#=> </section>
#3
4
You can try REXML:
你可以试试REXML:
require "rexml/document"
doc = REXML::Document.new(xml)
doc.write($stdout, 2)
#4
4
This worked for me:
这对我有用:
pretty_html = Nokogiri::HTML(html).to_xhtml(indent: 3)
I tried the REXML version above, but it corrupted some of my documents. And I hate to bring xslt into a new project. Both feel antiquated. :)
我尝试了上面的REXML版本,但它损坏了我的一些文档。我讨厌将xslt带入一个新项目。两人都觉得过时了。 :)
#5
2
My solution was to add a print
method onto the actual Nokogiri
objects. After you run the code in the snippet below, you should just be able to write node.print
, and it'll pretty print the contents. No xslt required :-)
我的解决方案是在实际的Nokogiri对象上添加打印方法。在下面的代码片段中运行代码后,您应该只能编写node.print,并且它将打印内容。没有xslt要求:-)
Nokogiri::XML::Node.class_eval do
# Print every Node by default (will be overridden by CharacterData)
define_method :should_print? do
true
end
# Duplicate this node, replace the contents of the duplicated node with a
# newline. With this content substitution, the #to_s method conveniently
# returns a string with the opening tag (e.g. `<a href="foo">`) on the first
# line and the closing tag on the second (e.g. `</a>`, provided that the
# current node is not a self-closing tag).
#
# Now, print the open tag preceded by the correct amount of indentation, then
# recursively print this node's children (with extra indentation), and then
# print the close tag (if there is a closing tag)
define_method :print do |indent=0|
duplicate = self.dup
duplicate.content = "\n"
open_tag, close_tag = duplicate.to_s.split("\n")
puts (" " * indent) + open_tag
self.children.select(&:should_print?).each { |child| child.print(indent + 2) }
puts (" " * indent) + close_tag if close_tag
end
end
Nokogiri::XML::CharacterData.class_eval do
# Only print CharacterData if there's non-whitespace content
define_method :should_print? do
content =~ /\S+/
end
# Replace all consecutive whitespace characters by a single space; precede the
# outut by a certain amount of indentation; print this text.
define_method :print do |indent=0|
puts (" " * indent) + to_s.strip.sub(/\s+/, ' ')
end
end
#6
-4
why don't you try the pp
method?
你为什么不尝试pp方法?
require 'pp'
pp some_var
#1
20
By "pretty printing" of HTML page I presume you meant that you want to reformat the HTML structure with proper indentation. Nokogiri doesn't support this; pretty_print
method is for the "pp" library and the output is useful for debugging only.
通过HTML页面的“漂亮打印”,我认为你的意思是你想要用适当的缩进重新格式化HTML结构。 Nokogiri不支持这一点; pretty_print方法用于“pp”库,输出仅用于调试。
There are several projects that understand HTML well enough to be able to reformat it without destroying whitespace that is actually significant (the famous one is HTML Tidy), but by googling I've found this post titled "Pretty printing XHTML with Nokogiri and XSLT".
有几个项目能够很好地理解HTML,能够在不破坏真正重要的空白(着名的HTML Tidy)的情况下对其进行重新格式化,但通过谷歌搜索,我发现这篇文章名为“使用Nokogiri和XSLT轻松打印XHTML” 。
It comes down to this:
归结为:
xsl = Nokogiri::XSLT(File.open("pretty_print.xsl"))
html = Nokogiri(File.open("source.html"))
puts xsl.apply_to(html).to_s
It requires you, of course, to download the linked xsl file to your filesystem. I've tried it very quickly on my machine and it works like a charm.
当然,它要求您将链接的xsl文件下载到文件系统。我在我的机器上很快就尝试过,它就像一个魅力。
#2
69
The answer by @mislav is somewhat wrong. Nokogiri does support pretty-printing if you:
@mislav的答案有点不对劲。如果您:Nokogiri支持漂亮打印:
- Parse the document as XML
- 将文档解析为XML
- Instruct Nokogiri to ignore whitespace-only nodes ("blanks") during parsing
- 指示Nokogiri在解析期间忽略仅空白节点(“空白”)
- Use
to_xhtml
orto_xml
to specify pretty-printing parameters - 使用to_xhtml或to_xml指定漂亮的打印参数
In action:
在行动:
html = '<section>
<h1>Main Section 1</h1><p>Intro</p>
<section>
<h2>Subhead 1.1</h2><p>Meat</p><p>MOAR MEAT</p>
</section><section>
<h2>Subhead 1.2</h2><p>Meat</p>
</section></section>'
require 'nokogiri'
doc = Nokogiri::XML(html,&:noblanks)
puts doc
#=> <section>
#=> <h1>Main Section 1</h1>
#=> <p>Intro</p>
#=> <section>
#=> <h2>Subhead 1.1</h2>
#=> <p>Meat</p>
#=> <p>MOAR MEAT</p>
#=> </section>
#=> <section>
#=> <h2>Subhead 1.2</h2>
#=> <p>Meat</p>
#=> </section>
#=> </section>
puts doc.to_xhtml( indent:3, indent_text:"." )
#=> <section>
#=> ...<h1>Main Section 1</h1>
#=> ...<p>Intro</p>
#=> ...<section>
#=> ......<h2>Subhead 1.1</h2>
#=> ......<p>Meat</p>
#=> ......<p>MOAR MEAT</p>
#=> ...</section>
#=> ...<section>
#=> ......<h2>Subhead 1.2</h2>
#=> ......<p>Meat</p>
#=> ...</section>
#=> </section>
#3
4
You can try REXML:
你可以试试REXML:
require "rexml/document"
doc = REXML::Document.new(xml)
doc.write($stdout, 2)
#4
4
This worked for me:
这对我有用:
pretty_html = Nokogiri::HTML(html).to_xhtml(indent: 3)
I tried the REXML version above, but it corrupted some of my documents. And I hate to bring xslt into a new project. Both feel antiquated. :)
我尝试了上面的REXML版本,但它损坏了我的一些文档。我讨厌将xslt带入一个新项目。两人都觉得过时了。 :)
#5
2
My solution was to add a print
method onto the actual Nokogiri
objects. After you run the code in the snippet below, you should just be able to write node.print
, and it'll pretty print the contents. No xslt required :-)
我的解决方案是在实际的Nokogiri对象上添加打印方法。在下面的代码片段中运行代码后,您应该只能编写node.print,并且它将打印内容。没有xslt要求:-)
Nokogiri::XML::Node.class_eval do
# Print every Node by default (will be overridden by CharacterData)
define_method :should_print? do
true
end
# Duplicate this node, replace the contents of the duplicated node with a
# newline. With this content substitution, the #to_s method conveniently
# returns a string with the opening tag (e.g. `<a href="foo">`) on the first
# line and the closing tag on the second (e.g. `</a>`, provided that the
# current node is not a self-closing tag).
#
# Now, print the open tag preceded by the correct amount of indentation, then
# recursively print this node's children (with extra indentation), and then
# print the close tag (if there is a closing tag)
define_method :print do |indent=0|
duplicate = self.dup
duplicate.content = "\n"
open_tag, close_tag = duplicate.to_s.split("\n")
puts (" " * indent) + open_tag
self.children.select(&:should_print?).each { |child| child.print(indent + 2) }
puts (" " * indent) + close_tag if close_tag
end
end
Nokogiri::XML::CharacterData.class_eval do
# Only print CharacterData if there's non-whitespace content
define_method :should_print? do
content =~ /\S+/
end
# Replace all consecutive whitespace characters by a single space; precede the
# outut by a certain amount of indentation; print this text.
define_method :print do |indent=0|
puts (" " * indent) + to_s.strip.sub(/\s+/, ' ')
end
end
#6
-4
why don't you try the pp
method?
你为什么不尝试pp方法?
require 'pp'
pp some_var