从*获取信息 - 如何获取HTML表单？

I'm using curl to retrieve information from wikipedia. So far I've been successful in retrieving basic text information but I really would want to retrieve it in HTML.

我正在使用curl从*中检索信息。到目前为止,我已成功检索基本文本信息,但我真的想要用HTML检索它。

Here is my code:

这是我的代码:

$s = curl_init();       

$url = 'http://boss.yahooapis.com/ysearch/web/v1/site:en.wikipedia.org+'.$article_name.'?appid=myID';
curl_setopt($s,CURLOPT_URL, $url);
curl_setopt($s,CURLOPT_HEADER,false);
curl_setopt($s,CURLOPT_RETURNTRANSFER,1);

$rs = curl_exec($s);

$rs = Zend_Json::decode($rs);

$rs = ($rs['ysearchresponse']['resultset_web']);

$rs = array_shift($rs);
$article= str_replace('http://en.wikipedia.org/wiki/', '', $rs['url']);

$url = 'http://en.wikipedia.org/w/api.php?';
$url.='format=json';
$url.=sprintf('&action=query&titles=%s&rvprop=content&prop=revisions&redirects=1', $article);

curl_setopt($s,CURLOPT_URL, $url);
curl_setopt($s,CURLOPT_HEADER,false);
curl_setopt($s,CURLOPT_RETURNTRANSFER,1);

$rs = curl_exec($s);
//curl_close( $s );
$rs = Zend_Json::decode($rs);

$rs = array_pop(array_pop(array_pop($rs)));
$rs = array_shift($rs['revisions']);
$articleText = $rs['*'];

However the text retrieved this way isnt well enough to be displayed :( its all in this kind of format

然而,以这种方式检索的文本不足以显示:(它全部采用这种格式

'''Aix-les-Bains''' is a [[Communes of France|commune]] in the [[Savoie]] [[Departments of France|department]] in the [[Rhône-Alpes]] [[regions of France|region]] in southeastern [[France]].

'''Aix-les-Bains'''[[Savoie]] [[法国部门]] [[罗纳 - 阿尔卑斯]] [[地区] [[Communes of France | commune]法国|地区]]在东南[[法国]]。

It lies near the [[Lac du Bourget]], {{convert|9|km|mi|abbr=on}} by rail north of [[Chambéry]].

它位于[[Lac du Bourget]]附近,{{convert | 9 | km | mi | abbr = on}}靠近[[Chambéry]]以北的铁路。

==History== ''Aix'' derives from [[Latin]] ''Aquae'' (literally, "waters"; ''cf'' [[Aix-la-Chapelle]] (Aachen) or [[Aix-en-Provence]]), and Aix was a bath during the [[Roman Empire]], even before it was renamed ''Aquae Gratianae'' to commemorate the [[Emperor Gratian]], who was assassinated not far away, in [[Lyon]], in [[383]]. Numerous Roman remains survive. [[Image:IMG 0109 Lake Promenade.jpg|thumb|left|Lac du Bourget Promenade]]

==历史=='''Aix''源于[[拉丁文]]''Aquae''(字面意思是“水”;''cf''[[Aix-la-Chapelle]](亚琛)或[[Aix] -en-Provence]]),Aix在[[罗马帝国]]期间洗澡,甚至在它被重新命名为'Aquae Gratianae'以纪念[[格拉蒂安皇帝]],在不远处被暗杀, [[里昂]],[[383]]。许多罗马遗体幸存下来。 [[Image:IMG 0109 Lake Promenade.jpg | thumb | left | Lac du Bourget Promenade]]

How do I get the HTML of the wikipedia article?

如何获取*文章的HTML?

UPDATE: Thanks but I'm kinda new to this here and right now I'm trying to run an xpath query [albeit for the first time] and can't seem to get any results. I actually need to know a couple of things here.

更新:谢谢,但我对此有点新鲜,现在我正在尝试运行xpath查询[虽然是第一次],似乎无法获得任何结果。我实际上需要知道一些事情。

How do I request just a part of an article?

我如何只要求文章的一部分?

How do I get the HTML of the article requested.

如何获取所请求文章的HTML。

I went through this url on data mining from wikipedia - it put an idea to make a second request to wikipedia api with the retrieved wikipedia text as parameters and that would retrieve the html - although it hasn't seemed to work so far :( - I don't want to just grab the whole article as a mess of html and dump it. Basically my application what it does is that you have some locations and cities pin pointed on the map - you click on the city marker and it would request via ajax details of the city to be shown in an adjacent div. This information I wish to get from wikipedia dynamically. I'll worry about about dealing with articles that don't exist for a particular city later on just need to make sure its working at this point.

我从*上浏览了这个关于数据挖掘的网址 - 它提出了一个想法,即将检索到的*文本作为参数向wikipedia api发出第二个请求,这将检索html - 尽管到目前为止似乎没有工作:( - 我不想把整篇文章当作乱七八糟的内容抓取并转储它。基本上我的应用程序它的作用是你有一些地点和城市针在地图上指向 - 你点击城市标记它会要求通过相关div中显示的城市的ajax详细信息。我希望动态地从*获取这些信息。我会担心如何处理特定城市以后不存在的文章,只需要确保它在这一点上工作。

Does anyone know of a nice working example that does what I'm looking for i.e. read and parse through selected portions of a wikipedia article.

有没有人知道一个很好的工作示例,它正在寻找我正在寻找的东西,即阅读和解析*文章的选定部分。

According to the url provided - it says I should post the wikitext to the wikipedia api location for it to return parsed html. The issue is that if I post the information I get no response and instead an error that I'm denied access - however if I try to include the wikitext as GET it parses with no issue. But it fails of course when I have waaaaay too much text to parse.

根据提供的网址 - 它说我应该将wiki文本发布到*api位置,以便它返回已解析的HTML。问题是,如果我发布信息,我得不到响应,而是一个错误,我被拒绝访问 - 但是如果我尝试将wikitext包含为GET,它解析没有问题。但是,当我有太多的文本需要解析时,它当然失败了。

Is this a problem with the wikipedia api? Because I've been hacking at it for two days now with no luck at all :(

这是*api的问题吗?因为我已经被攻击了两天而现在没有运气:(

4 个解决方案

#1

The simplest solution would probably be to grab the page itself (e.g. http://en.wikipedia.org/wiki/Combination ) and then extract the content of <div id="content">, potentially with an xpath query.

最简单的解决方案可能是抓取页面本身(例如http://en.wikipedia.org/wiki/Combination),然后提取

的内容,可能使用xpath查询。

#2

There is a PEAR Wiki Filter that I have used and it does a very decent job.

我使用了一个PEAR Wiki过滤器,它做得非常不错。

Text Wiki

Phil

#3

Try looking at the printable version of the desired Wikipedia article in question.

尝试查看所需*文章的可打印版本。

In other words, change this line of your source code:

换句话说,更改源代码的这一行:

$url.=sprintf('&action=query&titles=%s&rvprop=content&prop=revisions&redirects=1', $article);

to something like:

类似于:

$url.=sprintf('&action=query&titles=%s&printable=yes&redirects=1', $article);

Disclaimer: Have not tested, and this is just a guess at how your API might work.

免责声明:尚未经过测试,这只是猜测您的API可能如何运作。

#4

As far as I understand it, the Wikipedia software converts the Wiki markup into HTML when the page is requested. So using your current method, you'll need to deal with the results.

据我所知,*软件在请求页面时将Wiki标记转换为HTML。因此,使用您当前的方法,您将需要处理结果。

A good place to start is the Mediawiki API. You can also use http://pear.php.net/package/Text_Wiki to format the results retrieved via cURL.

Mediawiki API是一个很好的起点。您还可以使用http://pear.php.net/package/Text_Wiki格式化通过cURL检索的结果。

#1