使用Ruby on Rails从解析的HTML生成Xpath

时间:2023-02-05 23:10:15

Given the following example HTML:


<table cellpadding="4" cellspacing="0" border="0" width="100%">
    <tr bgcolor="#FFE4D8" valign="top">
    <td>in the next 20 minutes you will learn how to create a winter landscape. For this excersize you do need to have only a basic experience in Lightwave, so lets just start with it.<br>

How could I auto generate an Xpath expression to the tag that contains "20 minutes"; in the same manner that Firepath does. Is this possible to do from within Ruby?


1 个解决方案



Assuming the text is not broken into different tags, you could find the lowest leaf node by


//*[contains(text(),'20 minutes')]

You can then generate the string of parents by adding /.. at the end of the XPath until you got the root element html. At every step you will also need to get the position of the element by

然后,您可以通过在XPath结尾添加/ ..生成父项字符串,直到获得根元素html。在每一步中,您还需要获取元素的位置

//*[contains(text(),'20 minutes')]/position()

and for higher elements


//*[contains(text(),'20 minutes')]/../position()

After you know each tag name and position, you can build the path



With x, y, z being placeholders.


Since I don't know ruby, I cannot provide source code, but this will be an easy algorithm. The good thing is that you can implement this with any DOM parser, that knows XPath. It may be possible to optimize it considerably, if the DOM parser has a method for returning the parent of a node, because selecting the parent in XPath on its own for every step is slow and not viable for many/long documents.


It looks like REXML for ruby support the parent() method.




Assuming the text is not broken into different tags, you could find the lowest leaf node by


//*[contains(text(),'20 minutes')]

You can then generate the string of parents by adding /.. at the end of the XPath until you got the root element html. At every step you will also need to get the position of the element by

然后,您可以通过在XPath结尾添加/ ..生成父项字符串,直到获得根元素html。在每一步中,您还需要获取元素的位置

//*[contains(text(),'20 minutes')]/position()

and for higher elements


//*[contains(text(),'20 minutes')]/../position()

After you know each tag name and position, you can build the path



With x, y, z being placeholders.


Since I don't know ruby, I cannot provide source code, but this will be an easy algorithm. The good thing is that you can implement this with any DOM parser, that knows XPath. It may be possible to optimize it considerably, if the DOM parser has a method for returning the parent of a node, because selecting the parent in XPath on its own for every step is slow and not viable for many/long documents.


It looks like REXML for ruby support the parent() method.
