I have a node like
我有一个节点
<a class="someclass">
Wie
<em>Messi</em>
einen kleinen Jungen stehen lässt
</a>
How do I construct an XPath to get ["Wie Messi einen kleinen Jungen stehen lässt"]
instead of ["Wie","Messi","einen kleinen Jungen stehen lässt"]
?
如何构建一个XPath来获取[“Wie Messi einen kleinen Jungenstehenlässt”]而不是[“Wie”,“Messi”,“einen kleinen Jungenstehenlässt”]?
I am using python lxml.html function with XPath.
我在XPath中使用python lxml.html函数。
Tried combinations
//a/node()/text()
//a/descendant::*/text()
//a/text()
But it didn't help. Any solutions?
但它没有帮助。有解决方案?
I was thinking of another approach where I somehow get the "inner html" of the <a>
element (which in the above case will be "Wie <em>Messi</em> einen kleinen Jungen stehen lässt"
) and remove the <em>
tags from the html.
我想到了另一种方法,我以某种方式获得元素的“内部html”(在上面的例子中将是“Wie Messi einen kleinen Jungenstehenlässt”)并删除< em>来自html的标签。
Still trying to figure out how to get innerhtml (Javascript, anyone?) from XPath.
还在试图弄清楚如何从XPath获取innerhtml(Javascript,任何人?)。
2 个解决方案
#1
4
XPath is a selection language, so what it can do is select nodes. If there are separate nodes in the input then you will get a list of separate nodes as the selection result.
XPath是一种选择语言,因此它可以做的是选择节点。如果输入中有单独的节点,那么您将获得单独节点的列表作为选择结果。
You'll need the help of your host language - Python in this case - to do things beyond that scope (like, merging text nodes into a singe string).
在这种情况下,您需要宿主语言的帮助 - 超出该范围的事情(例如,将文本节点合并为单个字符串)。
You need to find all <a>
elements and join their individual text descendants. That's easy enough to do:
您需要找到所有元素并加入其各自的文本后代。这很容易做到:
from lxml import etree
doc = etree.parse("path/to/file")
for a in doc.xpath("//a"):
print " ".join([t.strip() for t in a.itertext()])
prints
Wie Messi einen kleinen Jungen stehen lässt
As paul correctly points out in the comments below, you can use XPath's normalize-space()
and the whole thing gets even simpler.
正如paul在下面的评论中正确指出的那样,你可以使用XPath的normalize-space(),整个事情变得更加简单。
for a in doc.xpath("//a"):
print a.xpath("normalize-space()")
#2
1
If you get the string value of the <a>
node instead of using text()
, you will get a concatenation of the string value of all child nodes, instead of individual text nodes.
如果获得节点的字符串值而不是使用text(),则将获得所有子节点的字符串值的串联,而不是单个文本节点。
Try using simply
尝试简单地使用
//a
And reading the node as a string in your host language. In Python you can use a DOM function as mentioned by @Tomalak to obtain the string value. In lxml you can use .text_content():
并以宿主语言将节点作为字符串读取。在Python中,您可以使用@Tomalak提到的DOM函数来获取字符串值。在lxml中,您可以使用.text_content():
tree.XPath("//a)").text_content()
Within XPath, you can use a type function:
在XPath中,您可以使用类型函数:
string(//a)
#1
4
XPath is a selection language, so what it can do is select nodes. If there are separate nodes in the input then you will get a list of separate nodes as the selection result.
XPath是一种选择语言,因此它可以做的是选择节点。如果输入中有单独的节点,那么您将获得单独节点的列表作为选择结果。
You'll need the help of your host language - Python in this case - to do things beyond that scope (like, merging text nodes into a singe string).
在这种情况下,您需要宿主语言的帮助 - 超出该范围的事情(例如,将文本节点合并为单个字符串)。
You need to find all <a>
elements and join their individual text descendants. That's easy enough to do:
您需要找到所有元素并加入其各自的文本后代。这很容易做到:
from lxml import etree
doc = etree.parse("path/to/file")
for a in doc.xpath("//a"):
print " ".join([t.strip() for t in a.itertext()])
prints
Wie Messi einen kleinen Jungen stehen lässt
As paul correctly points out in the comments below, you can use XPath's normalize-space()
and the whole thing gets even simpler.
正如paul在下面的评论中正确指出的那样,你可以使用XPath的normalize-space(),整个事情变得更加简单。
for a in doc.xpath("//a"):
print a.xpath("normalize-space()")
#2
1
If you get the string value of the <a>
node instead of using text()
, you will get a concatenation of the string value of all child nodes, instead of individual text nodes.
如果获得节点的字符串值而不是使用text(),则将获得所有子节点的字符串值的串联,而不是单个文本节点。
Try using simply
尝试简单地使用
//a
And reading the node as a string in your host language. In Python you can use a DOM function as mentioned by @Tomalak to obtain the string value. In lxml you can use .text_content():
并以宿主语言将节点作为字符串读取。在Python中,您可以使用@Tomalak提到的DOM函数来获取字符串值。在lxml中,您可以使用.text_content():
tree.XPath("//a)").text_content()
Within XPath, you can use a type function:
在XPath中,您可以使用类型函数:
string(//a)