使用lxml解析HTML时如何保留命名空间信息?

时间:2022-08-26 12:07:05
>>> from lxml.etree import HTML, tostring
>>> tostring(HTML('<fb:like>'))
'<html><body><like/></body></html>'

Note how the tag turns from <fb:like> to simply <like>.

请注意标记如何从 转换为 :like>

This makes processing pages that incorporate XFBML with lxml much harder. (Same thing happens to <g:plusone></g:plusone>)

这使得将XFBML与lxml合并的处理页面变得更加困难。 (同样的事情发生在 :plusone>

Any help is appreciated.

任何帮助表示赞赏。

2 个解决方案

#1


1  

Try adding the namespace prefix definitions that are missing. lxml will avoid the namespaces otherwise, supposedly to make it easier for you.

尝试添加缺少的名称空间前缀定义。否则,lxml将避免名称空间,据说可以让你更容易。

Most likely the sites you try to parse will not contain these namespace definitions, so you should add them.

您尝试解析的网站很可能不包含这些命名空间定义,因此您应该添加它们。

Something like this: xmlns:adlcp="http://xxx/yy/zzz"

像这样:xmlns:adlcp =“http:// xxx / yy / zzz”

#2


1  

One way to fix this issue is to patch libxml2.

解决此问题的一种方法是修补libxml2。

Referring to the source code of libxml2.9.2 (https: //git.gnome.org/browse/libxml2/tree/?id=v2.9.2), in SAX2.c (https: //git.gnome.org/browse/libxml2/tree/SAX2.c?id=v2.9.2) (the internal SAX parser used to create the DOM tree) at line 1699 attributes with xmlns are not parsed when in HTML mode, and they are parsed like any other attributes at line and 1740. Consequently, it makes sense to adjust line 1622, which splits the name into prefix and local part. Change:

参考libxml2.9.2(https://git.gnome.org/browse/libxml2/tree/?id=v2.9.2)的源代码,在SAX2.c中(https://git.gnome.org/browse) /libxml2/tree/SAX2.c?id=v2.9.2)(用于创建DOM树的内部SAX解析器)在第1699行的xmlns属性在HTML模式下不被解析,并且它们被解析为类似于任何其他属性at因此,调整第1622行是有意义的,它将名称拆分为前缀和本地部分。更改:

name = xmlSplitQName(ctxt, fullname, &prefix);

into

if (!ctxt->html) {
    name = xmlSplitQName(ctxt, fullname, &prefix);
} else {
    name = xmlStrdup(fullname);
    prefix = NULL;
}

Then libxml2 will consider tags such as <o:p> to be for elements with name o:p, that is, the colon is included in the element name with no special meaning. This is the correct interpretation in HTML. For example, the HTML5 specification says:

然后libxml2会将诸如 之类的标签视为名称为o:p的元素,也就是说冒号包含在元素名称中,没有特殊含义。这是HTML中的正确解释。例如,HTML5规范说: :p>

In the HTML syntax, namespace prefixes and namespace declarations do not have the same effect as in XML. For instance, the colon has no special meaning in HTML element names.

在HTML语法中,名称空间前缀和名称空间声明与XML中的效果不同。例如,冒号在HTML元素名称中没有特殊含义。

Hopefully this change will be approved for a future version of libxml2. There is an open bug report (https: //bugzilla.gnome.org/show_bug.cgi?id=654146).

希望此更改将被批准用于未来版本的libxml2。有一个开放的错误报告(https://bugzilla.gnome.org/show_bug.cgi?id = 654146)。

#1


1  

Try adding the namespace prefix definitions that are missing. lxml will avoid the namespaces otherwise, supposedly to make it easier for you.

尝试添加缺少的名称空间前缀定义。否则,lxml将避免名称空间,据说可以让你更容易。

Most likely the sites you try to parse will not contain these namespace definitions, so you should add them.

您尝试解析的网站很可能不包含这些命名空间定义,因此您应该添加它们。

Something like this: xmlns:adlcp="http://xxx/yy/zzz"

像这样:xmlns:adlcp =“http:// xxx / yy / zzz”

#2


1  

One way to fix this issue is to patch libxml2.

解决此问题的一种方法是修补libxml2。

Referring to the source code of libxml2.9.2 (https: //git.gnome.org/browse/libxml2/tree/?id=v2.9.2), in SAX2.c (https: //git.gnome.org/browse/libxml2/tree/SAX2.c?id=v2.9.2) (the internal SAX parser used to create the DOM tree) at line 1699 attributes with xmlns are not parsed when in HTML mode, and they are parsed like any other attributes at line and 1740. Consequently, it makes sense to adjust line 1622, which splits the name into prefix and local part. Change:

参考libxml2.9.2(https://git.gnome.org/browse/libxml2/tree/?id=v2.9.2)的源代码,在SAX2.c中(https://git.gnome.org/browse) /libxml2/tree/SAX2.c?id=v2.9.2)(用于创建DOM树的内部SAX解析器)在第1699行的xmlns属性在HTML模式下不被解析,并且它们被解析为类似于任何其他属性at因此,调整第1622行是有意义的,它将名称拆分为前缀和本地部分。更改:

name = xmlSplitQName(ctxt, fullname, &prefix);

into

if (!ctxt->html) {
    name = xmlSplitQName(ctxt, fullname, &prefix);
} else {
    name = xmlStrdup(fullname);
    prefix = NULL;
}

Then libxml2 will consider tags such as <o:p> to be for elements with name o:p, that is, the colon is included in the element name with no special meaning. This is the correct interpretation in HTML. For example, the HTML5 specification says:

然后libxml2会将诸如 之类的标签视为名称为o:p的元素,也就是说冒号包含在元素名称中,没有特殊含义。这是HTML中的正确解释。例如,HTML5规范说: :p>

In the HTML syntax, namespace prefixes and namespace declarations do not have the same effect as in XML. For instance, the colon has no special meaning in HTML element names.

在HTML语法中,名称空间前缀和名称空间声明与XML中的效果不同。例如,冒号在HTML元素名称中没有特殊含义。

Hopefully this change will be approved for a future version of libxml2. There is an open bug report (https: //bugzilla.gnome.org/show_bug.cgi?id=654146).

希望此更改将被批准用于未来版本的libxml2。有一个开放的错误报告(https://bugzilla.gnome.org/show_bug.cgi?id = 654146)。