如何使lxml的解析器保留根元素之外的空格?

时间:2023-01-14 08:54:16

I am using lxml to manipulate some existing XML documents, and I want to introduce as little diff noise as possible. Unfortunately by default lxml.etree.XMLParser doesn't preserve whitespace before or after the root element of a document:

我正在使用lxml来操作一些现有的XML文档,我想尽可能少地引入diff噪声。不幸的是,默认情况下,lxml.etree.XMLParser不会在文档的根元素之前或之后保留空格:

>>> xml = '\n    <etaoin>shrdlu</etaoin>\n'
>>> lxml.etree.tostring(lxml.etree.fromstring(xml))
'<etaoin>shrdlu</etaoin>'
>>> lxml.etree.tostring(lxml.etree.fromstring(xml)) == xml
False

Is this possible using lxml? Is it supported by the underlying libxml2?

这可能使用lxml吗?是否由底层libxml2支持?

2 个解决方案

#1


Capture the whitespace with a regex and add it back to the string when you're done.

使用正则表达式捕获空白,并在完成后将其添加回字符串。

#2


I don't know of any XML library that will do it for you. But using a regex sounds like a decent idea if you really need to do this.

我不知道任何XML库会为你做这件事。但如果你真的需要这样做,使用正则表达式听起来是个不错的主意。

>>> xml = '\n    <etaoin>shrdlu</etaoin>\n'
>>> head, tail = re.findall(r"^\s*|\s*$", xml)[:2]
>>> root = etree.fromstring(xml)
>>> out = head + etree.tostring(root) + tail
>>> out == xml
True

#1


Capture the whitespace with a regex and add it back to the string when you're done.

使用正则表达式捕获空白,并在完成后将其添加回字符串。

#2


I don't know of any XML library that will do it for you. But using a regex sounds like a decent idea if you really need to do this.

我不知道任何XML库会为你做这件事。但如果你真的需要这样做,使用正则表达式听起来是个不错的主意。

>>> xml = '\n    <etaoin>shrdlu</etaoin>\n'
>>> head, tail = re.findall(r"^\s*|\s*$", xml)[:2]
>>> root = etree.fromstring(xml)
>>> out = head + etree.tostring(root) + tail
>>> out == xml
True