I am trying to use YQL to extract a portion of HTML from a series of web pages. The pages themselves have slightly different structure (so a Yahoo Pipes "Fetch Page" with its "Cut content" feature does not work well) but the fragment I am interested in always has the same class
attribute.
我试图使用YQL从一系列网页中提取HTML的一部分。页面本身的结构略有不同(因此Yahoo Pipes“Fetch Page”及其“剪切内容”功能效果不佳)但我感兴趣的片段始终具有相同的类属性。
If I have an HTML page like this:
如果我有这样的HTML页面:
<html>
<body>
<div class="foo">
<p>Wolf</p>
<ul>
<li>Dog</li>
<li>Cat</li>
</ul>
</div>
</body>
</html>
and use a YQL expression like this:
并使用这样的YQL表达式:
SELECT * FROM html
WHERE url="http://example.com/containing-the-fragment-above"
AND xpath="//div[@class='foo']"
what I get back are the (apparently unordered?) DOM elements, where what I want is the HTML content itself. I've tried SELECT content
as well, but that only selects textual content. I want HTML. Is this possible?
我得到的是(显然是无序的?)DOM元素,我想要的是HTML内容本身。我也尝试过SELECT内容,但这只选择了文本内容。我想要HTML。这可能吗?
3 个解决方案
#1
8
You could write a little Open Data Table to send out a normal YQL html
table query and stringify the result. Something like the following:
您可以编写一些Open Data Table来发送正常的YQL html表查询并对结果进行字符串化。类似于以下内容:
<?xml version="1.0" encoding="UTF-8" ?>
<table xmlns="http://query.yahooapis.com/v1/schema/table.xsd">
<meta>
<sampleQuery>select * from {table} where url="http://finance.yahoo.com/q?s=yhoo" and xpath='//div[@id="yfi_headlines"]/div[2]/ul/li/a'</sampleQuery>
<description>Retrieve HTML document fragments</description>
<author>Peter Cowburn</author>
</meta>
<bindings>
<select itemPath="result.html" produces="JSON">
<inputs>
<key id="url" type="xs:string" paramType="variable" required="true"/>
<key id="xpath" type="xs:string" paramType="variable" required="true"/>
</inputs>
<execute><![CDATA[
var results = y.query("select * from html where url=@url and xpath=@xpath", {url:url, xpath:xpath}).results.*;
var html_strings = [];
for each (var item in results) html_strings.push(item.toXMLString());
response.object = {html: html_strings};
]]></execute>
</select>
</bindings>
</table>
You could then query against that custom table with a YQL query like:
然后,您可以使用YQL查询查询该自定义表,如:
use "http://url.to/your/datatable.xml" as html.tostring;
select * from html.tostring where
url="http://finance.yahoo.com/q?s=yhoo"
and xpath='//div[@id="yfi_headlines"]/div[2]/ul/li'
Edit: Just realised this is a pretty old question that was bumped; at least an answer is here, eventually, for anyone stumbling on the question. :)
编辑:刚刚意识到这是一个非常古老的问题,被撞了;至少答案在这里,最终,对于任何绊倒这个问题的人来说。 :)
#2
2
I had this same exact problem. The only way I have gotten around it is to avoid YQL and just use regular expressions to match the start and end tags :/. Not the best solution, but if the html is relatively unchanging, and the pattern just from say <div class='name'>
to <div class='just_after
>`, then you can get away with that. Then you can get the html between.
我有同样的问题。我唯一能解决的问题是避免使用YQL,只使用正则表达式来匹配开始和结束标记:/。不是最好的解决方案,但是如果html相对不变,并且模式只是从
#3
0
YQL converts the page into XML, then does your XPath on it, then takes the DOMNodeList and serializes that back to XML for your output (and then converts to JSON if needed). You can't access the original data.
YQL将页面转换为XML,然后对其执行XPath,然后获取DOMNodeList并将其序列化为输出的XML(如果需要,则转换为JSON)。您无法访问原始数据。
Why can't you deal with XML instead of HTML?
为什么不能处理XML而不是HTML?
#1
8
You could write a little Open Data Table to send out a normal YQL html
table query and stringify the result. Something like the following:
您可以编写一些Open Data Table来发送正常的YQL html表查询并对结果进行字符串化。类似于以下内容:
<?xml version="1.0" encoding="UTF-8" ?>
<table xmlns="http://query.yahooapis.com/v1/schema/table.xsd">
<meta>
<sampleQuery>select * from {table} where url="http://finance.yahoo.com/q?s=yhoo" and xpath='//div[@id="yfi_headlines"]/div[2]/ul/li/a'</sampleQuery>
<description>Retrieve HTML document fragments</description>
<author>Peter Cowburn</author>
</meta>
<bindings>
<select itemPath="result.html" produces="JSON">
<inputs>
<key id="url" type="xs:string" paramType="variable" required="true"/>
<key id="xpath" type="xs:string" paramType="variable" required="true"/>
</inputs>
<execute><![CDATA[
var results = y.query("select * from html where url=@url and xpath=@xpath", {url:url, xpath:xpath}).results.*;
var html_strings = [];
for each (var item in results) html_strings.push(item.toXMLString());
response.object = {html: html_strings};
]]></execute>
</select>
</bindings>
</table>
You could then query against that custom table with a YQL query like:
然后,您可以使用YQL查询查询该自定义表,如:
use "http://url.to/your/datatable.xml" as html.tostring;
select * from html.tostring where
url="http://finance.yahoo.com/q?s=yhoo"
and xpath='//div[@id="yfi_headlines"]/div[2]/ul/li'
Edit: Just realised this is a pretty old question that was bumped; at least an answer is here, eventually, for anyone stumbling on the question. :)
编辑:刚刚意识到这是一个非常古老的问题,被撞了;至少答案在这里,最终,对于任何绊倒这个问题的人来说。 :)
#2
2
I had this same exact problem. The only way I have gotten around it is to avoid YQL and just use regular expressions to match the start and end tags :/. Not the best solution, but if the html is relatively unchanging, and the pattern just from say <div class='name'>
to <div class='just_after
>`, then you can get away with that. Then you can get the html between.
我有同样的问题。我唯一能解决的问题是避免使用YQL,只使用正则表达式来匹配开始和结束标记:/。不是最好的解决方案,但是如果html相对不变,并且模式只是从
#3
0
YQL converts the page into XML, then does your XPath on it, then takes the DOMNodeList and serializes that back to XML for your output (and then converts to JSON if needed). You can't access the original data.
YQL将页面转换为XML,然后对其执行XPath,然后获取DOMNodeList并将其序列化为输出的XML(如果需要,则转换为JSON)。您无法访问原始数据。
Why can't you deal with XML instead of HTML?
为什么不能处理XML而不是HTML?