使用Ruby on Rails从解析的HTML生成Xpath

时间:2023-02-05 23:10:15

Given the following example HTML:

给出以下示例HTML:

<table cellpadding="4" cellspacing="0" border="0" width="100%">
  <tbody>
    <tr bgcolor="#FFE4D8" valign="top">
    <td>in the next 20 minutes you will learn how to create a winter landscape. For this excersize you do need to have only a basic experience in Lightwave, so lets just start with it.<br>
  </tbody>
</table>

How could I auto generate an Xpath expression to the tag that contains "20 minutes"; in the same manner that Firepath does. Is this possible to do from within Ruby?

如何自动为包含“20分钟”的标签生成Xpath表达式;就像Firepath一样。这可以从Ruby中做到吗?

1 个解决方案

#1


0  

Assuming the text is not broken into different tags, you could find the lowest leaf node by

假设文本没有分成不同的标记,您可以找到最低的叶节点

//*[contains(text(),'20 minutes')]

You can then generate the string of parents by adding /.. at the end of the XPath until you got the root element html. At every step you will also need to get the position of the element by

然后,您可以通过在XPath结尾添加/ ..生成父项字符串,直到获得根元素html。在每一步中,您还需要获取元素的位置

//*[contains(text(),'20 minutes')]/position()

and for higher elements

并为更高的元素

//*[contains(text(),'20 minutes')]/../position()

After you know each tag name and position, you can build the path

在了解每个标记名称和位置后,您可以构建路径

/html[1]/body[1]/div[x]/table[y]/tbody[z]/tr[1]/td[1]

With x, y, z being placeholders.

x,y,z是占位符。

Since I don't know ruby, I cannot provide source code, but this will be an easy algorithm. The good thing is that you can implement this with any DOM parser, that knows XPath. It may be possible to optimize it considerably, if the DOM parser has a method for returning the parent of a node, because selecting the parent in XPath on its own for every step is slow and not viable for many/long documents.

由于我不知道ruby,我无法提供源代码,但这将是一个简单的算法。好处是你可以用任何知道XPath的DOM解析器来实现它。如果DOM解析器具有返回节点的父节点的方法,则可以大大优化它,因为在每个步骤中在XPath中单独选择父节点是缓慢的并且对于许多/长文档是不可行的。

It looks like REXML for ruby support the parent() method.

看起来像ruby的REXML支持parent()方法。

#1


0  

Assuming the text is not broken into different tags, you could find the lowest leaf node by

假设文本没有分成不同的标记,您可以找到最低的叶节点

//*[contains(text(),'20 minutes')]

You can then generate the string of parents by adding /.. at the end of the XPath until you got the root element html. At every step you will also need to get the position of the element by

然后,您可以通过在XPath结尾添加/ ..生成父项字符串,直到获得根元素html。在每一步中,您还需要获取元素的位置

//*[contains(text(),'20 minutes')]/position()

and for higher elements

并为更高的元素

//*[contains(text(),'20 minutes')]/../position()

After you know each tag name and position, you can build the path

在了解每个标记名称和位置后,您可以构建路径

/html[1]/body[1]/div[x]/table[y]/tbody[z]/tr[1]/td[1]

With x, y, z being placeholders.

x,y,z是占位符。

Since I don't know ruby, I cannot provide source code, but this will be an easy algorithm. The good thing is that you can implement this with any DOM parser, that knows XPath. It may be possible to optimize it considerably, if the DOM parser has a method for returning the parent of a node, because selecting the parent in XPath on its own for every step is slow and not viable for many/long documents.

由于我不知道ruby,我无法提供源代码,但这将是一个简单的算法。好处是你可以用任何知道XPath的DOM解析器来实现它。如果DOM解析器具有返回节点的父节点的方法,则可以大大优化它,因为在每个步骤中在XPath中单独选择父节点是缓慢的并且对于许多/长文档是不可行的。

It looks like REXML for ruby support the parent() method.

看起来像ruby的REXML支持parent()方法。