Python:如何漂亮地将html打印到文件中

时间:2022-10-23 21:45:09

I am using lxml.html to generate some HTML. I want to pretty print (with indentation) my final result into an html file. How do I do that?

我使用lxml。生成一些html。我想把最后的结果打印成一个html文件。我该怎么做呢?

This is what I have tried and got till now (I am relatively new to Python and lxml) :

这就是我到目前为止所尝试和得到的(我对Python和lxml比较陌生):

import lxml.html as lh
from lxml.html import builder as E
sliderRoot=lh.Element("div", E.CLASS("scroll"), style="overflow-x: hidden; overflow-y: hidden;")
scrollContainer=lh.Element("div", E.CLASS("scrollContainer"), style="width: 4340px;")
sliderRoot.append(scrollContainer)
print lh.tostring(sliderRoot, pretty_print = True, method="html")

As you can see I am using the pretty_print=True attribute. I thought that would give indented code, but it doesn't really help. This is the output :

如您所见,我正在使用pretty_print=True属性。我认为这会给出缩进的代码,但实际上并没有帮助。这是输出:

<div style="overflow-x: hidden; overflow-y: hidden;" class="scroll"><div style="width: 4340px;" class="scrollContainer"></div></div>

< div风格= " overflow-x:隐藏;" class="scroll">

"

9 个解决方案

#1


66  

I ended up using BeautifulSoup directly. That is something lxml.html.soupparser uses for parsing HTML.

最后我直接用了漂亮的汤。这是lxml.html。soupparser用于解析HTML

BeautifulSoup has a prettify method that does exactly what it says it does. It prettifies the HTML with proper indents and everything.

“漂亮的汤”有一种“美化”的方法,它的效果和它说的完全一样。它用适当的缩进和所有东西来美化HTML。

BeautifulSoup will NOT fix the HTML, so broken code, remains broken. But in this case, since the code is being generated by lxml, the HTML code should be at least semantically correct.

BeautifulSoup不会修复HTML,因此损坏的代码仍然损坏。但是在这种情况下,由于代码是由lxml生成的,所以HTML代码至少在语义上应该是正确的。

In the example given in my question, I will have to do this :

在我的问题中给出的例子中,我必须这样做:

from BeautifulSoup import BeautifulSoup as bs
root=lh.tostring(sliderRoot) #convert the generated HTML to a string
soup=bs(root)                #make BeautifulSoup
prettyHTML=soup.prettify()   #prettify the html

#2


23  

Though my answer might not be helpful now, I am dropping it here to act as a reference to anybody else in future.

虽然我的答案现在可能没有帮助,但我现在把它丢在这里,作为今后任何人的参考。

lxml.html.tostring(), indeed, doesn't pretty print the provided HTML in spite of pretty_print=True.

lxml.html.tostring()确实不能很好地打印提供的HTML,尽管pretty_print=True。

However, the "sibling" of lxml.html - lxml.etree has it working well.

然而,lxml的“兄弟”。html - lxml。etree的工作很好。

So one might use it as following:

所以我们可以用它来表示

from lxml import etree, html

document_root = html.fromstring("<html><body><h1>hello world</h1></body></html>")
print(etree.tostring(document_root, encoding='unicode', pretty_print=True))

The output is like this:

输出是这样的:

<html>
  <body>
    <h1>hello world</h1>
  </body>
</html>

#3


3  

If you store the HTML as an unformatted string, in a variable html_string, it can be done using beautifulsoup4 as follows:

如果将HTML存储为未格式化的字符串,在变量html_string中,可以使用漂亮的soup4进行如下操作:

from bs4 import BeautifulSoup
print(BeautifulSoup(html_string, 'html.parser').prettify())

#4


2  

Under the hood, lxml uses libxml2 to serialize the tree back into a string. Here is the relevant snippet of code that determines whether to append a newline after closing a tag:

在底层,lxml使用libxml2将树序列化为字符串。以下是相关的代码片段,用于确定是否在结束标记后追加换行:

    xmlOutputBufferWriteString(buf, ">");
    if ((format) && (!info->isinline) && (cur->next != NULL)) {
        if ((cur->next->type != HTML_TEXT_NODE) &&
            (cur->next->type != HTML_ENTITY_REF_NODE) &&
            (cur->parent != NULL) &&
            (cur->parent->name != NULL) &&
            (cur->parent->name[0] != 'p')) /* p, pre, param */
            xmlOutputBufferWriteString(buf, "\n");
    }
    return;

So if a node is an element, is not an inline tag and is followed by a sibling node (cur->next != NULL) and isn't one of p, pre, param then it will output a newline.

因此,如果一个节点是一个元素,它不是内联标记,后面跟着一个同级节点(cur->next != NULL),并且不是p、pre、param中的一个,那么它将输出一个换行符。

#5


1  

Couldn't you just pipe it into HTML Tidy? Either from the shell or through os.system().

你就不能把它导入HTML Tidy吗?从shell或通过os.system()。

#6


1  

If adding one more dependency is not a problem, you can use the html5print package. The advantage over the other solutions, is that it also beautifies both CSS and Javascript code embedded in the HTML document.

如果再添加一个依赖项不是问题,您可以使用html5print包。与其他解决方案相比,它的优势在于,它还美化了嵌入在HTML文档中的CSS和Javascript代码。

To install it, execute:

安装、执行:

pip install html5print

Then, you can either use it as a command:

然后,您可以使用它作为命令:

html5-print ugly.html -o pretty.html

or as Python code:

或者是Python代码:

from html5print import HTMLBeautifier
html = '<title>Page Title</title><p>Some text here</p>'
print(HTMLBeautifier.beautify(html, 4))

#7


0  

If you don't care about quirky HTMLness (e.g. you must support absolutely support those hordes of Netscpae 2.0-using clients, so having <br> instead of <br /> is a must), you can always change your method to "xml", which seems to work. This is probably a bug in lxml or in libxml, but I couldn't find the reason for it.

如果您不关心奇怪的HTMLness(例如,您必须支持绝对支持那些使用Netscpae 2.0的客户端,因此必须使用
而不是
),那么您可以将方法更改为“xml”,这似乎是可行的。这可能是lxml或libxml中的一个bug,但我找不到原因。

#8


0  

not really my code, I picked it somewhere

不是我的代码,是我在什么地方找到的

def indent(elem, level=0):
    i = '\n' + level * '  '
    if len(elem):
        if not elem.text or not elem.text.strip():
            elem.text = i + '  '
        if not elem.tail or not elem.tail.strip():
            elem.tail = i
        for elem in elem:
            indent(elem, level+1)
        if not elem.tail or not elem.tail.strip():
            elem.tail = i
    else:
        if level and (not elem.tail or not elem.tail.strip()):
            elem.tail = i

I use it with:

我用它:

indent(page)
tostring(page)

#9


0  

I tried both BeautifulSoup's prettify and html5print's HTMLBeautifier solutions but since I'm using yattag to generate HTML it seems more appropriate to use its indent function, which produces nicely indented output.

我尝试了BeautifulSoup的beautiful tify和html5print的HTMLBeautifier解决方案,但由于我正在使用yattag生成HTML,因此似乎更适合使用它的缩进函数,该函数产生了良好的缩进输出。

from yattag import indent

rawhtml = "String with some HTML code..."

result = indent(
    rawhtml,
    indentation = '    ',
    newline = '\r\n',
    indent_text = True
)

print(result)

#1


66  

I ended up using BeautifulSoup directly. That is something lxml.html.soupparser uses for parsing HTML.

最后我直接用了漂亮的汤。这是lxml.html。soupparser用于解析HTML

BeautifulSoup has a prettify method that does exactly what it says it does. It prettifies the HTML with proper indents and everything.

“漂亮的汤”有一种“美化”的方法,它的效果和它说的完全一样。它用适当的缩进和所有东西来美化HTML。

BeautifulSoup will NOT fix the HTML, so broken code, remains broken. But in this case, since the code is being generated by lxml, the HTML code should be at least semantically correct.

BeautifulSoup不会修复HTML,因此损坏的代码仍然损坏。但是在这种情况下,由于代码是由lxml生成的,所以HTML代码至少在语义上应该是正确的。

In the example given in my question, I will have to do this :

在我的问题中给出的例子中,我必须这样做:

from BeautifulSoup import BeautifulSoup as bs
root=lh.tostring(sliderRoot) #convert the generated HTML to a string
soup=bs(root)                #make BeautifulSoup
prettyHTML=soup.prettify()   #prettify the html

#2


23  

Though my answer might not be helpful now, I am dropping it here to act as a reference to anybody else in future.

虽然我的答案现在可能没有帮助,但我现在把它丢在这里,作为今后任何人的参考。

lxml.html.tostring(), indeed, doesn't pretty print the provided HTML in spite of pretty_print=True.

lxml.html.tostring()确实不能很好地打印提供的HTML,尽管pretty_print=True。

However, the "sibling" of lxml.html - lxml.etree has it working well.

然而,lxml的“兄弟”。html - lxml。etree的工作很好。

So one might use it as following:

所以我们可以用它来表示

from lxml import etree, html

document_root = html.fromstring("<html><body><h1>hello world</h1></body></html>")
print(etree.tostring(document_root, encoding='unicode', pretty_print=True))

The output is like this:

输出是这样的:

<html>
  <body>
    <h1>hello world</h1>
  </body>
</html>

#3


3  

If you store the HTML as an unformatted string, in a variable html_string, it can be done using beautifulsoup4 as follows:

如果将HTML存储为未格式化的字符串,在变量html_string中,可以使用漂亮的soup4进行如下操作:

from bs4 import BeautifulSoup
print(BeautifulSoup(html_string, 'html.parser').prettify())

#4


2  

Under the hood, lxml uses libxml2 to serialize the tree back into a string. Here is the relevant snippet of code that determines whether to append a newline after closing a tag:

在底层,lxml使用libxml2将树序列化为字符串。以下是相关的代码片段,用于确定是否在结束标记后追加换行:

    xmlOutputBufferWriteString(buf, ">");
    if ((format) && (!info->isinline) && (cur->next != NULL)) {
        if ((cur->next->type != HTML_TEXT_NODE) &&
            (cur->next->type != HTML_ENTITY_REF_NODE) &&
            (cur->parent != NULL) &&
            (cur->parent->name != NULL) &&
            (cur->parent->name[0] != 'p')) /* p, pre, param */
            xmlOutputBufferWriteString(buf, "\n");
    }
    return;

So if a node is an element, is not an inline tag and is followed by a sibling node (cur->next != NULL) and isn't one of p, pre, param then it will output a newline.

因此,如果一个节点是一个元素,它不是内联标记,后面跟着一个同级节点(cur->next != NULL),并且不是p、pre、param中的一个,那么它将输出一个换行符。

#5


1  

Couldn't you just pipe it into HTML Tidy? Either from the shell or through os.system().

你就不能把它导入HTML Tidy吗?从shell或通过os.system()。

#6


1  

If adding one more dependency is not a problem, you can use the html5print package. The advantage over the other solutions, is that it also beautifies both CSS and Javascript code embedded in the HTML document.

如果再添加一个依赖项不是问题,您可以使用html5print包。与其他解决方案相比,它的优势在于,它还美化了嵌入在HTML文档中的CSS和Javascript代码。

To install it, execute:

安装、执行:

pip install html5print

Then, you can either use it as a command:

然后,您可以使用它作为命令:

html5-print ugly.html -o pretty.html

or as Python code:

或者是Python代码:

from html5print import HTMLBeautifier
html = '<title>Page Title</title><p>Some text here</p>'
print(HTMLBeautifier.beautify(html, 4))

#7


0  

If you don't care about quirky HTMLness (e.g. you must support absolutely support those hordes of Netscpae 2.0-using clients, so having <br> instead of <br /> is a must), you can always change your method to "xml", which seems to work. This is probably a bug in lxml or in libxml, but I couldn't find the reason for it.

如果您不关心奇怪的HTMLness(例如,您必须支持绝对支持那些使用Netscpae 2.0的客户端,因此必须使用
而不是
),那么您可以将方法更改为“xml”,这似乎是可行的。这可能是lxml或libxml中的一个bug,但我找不到原因。

#8


0  

not really my code, I picked it somewhere

不是我的代码,是我在什么地方找到的

def indent(elem, level=0):
    i = '\n' + level * '  '
    if len(elem):
        if not elem.text or not elem.text.strip():
            elem.text = i + '  '
        if not elem.tail or not elem.tail.strip():
            elem.tail = i
        for elem in elem:
            indent(elem, level+1)
        if not elem.tail or not elem.tail.strip():
            elem.tail = i
    else:
        if level and (not elem.tail or not elem.tail.strip()):
            elem.tail = i

I use it with:

我用它:

indent(page)
tostring(page)

#9


0  

I tried both BeautifulSoup's prettify and html5print's HTMLBeautifier solutions but since I'm using yattag to generate HTML it seems more appropriate to use its indent function, which produces nicely indented output.

我尝试了BeautifulSoup的beautiful tify和html5print的HTMLBeautifier解决方案,但由于我正在使用yattag生成HTML,因此似乎更适合使用它的缩进函数,该函数产生了良好的缩进输出。

from yattag import indent

rawhtml = "String with some HTML code..."

result = indent(
    rawhtml,
    indentation = '    ',
    newline = '\r\n',
    indent_text = True
)

print(result)