How to parse Wikipedia XML with PHP? I tried it with simplepie, but I got nothing. Here is a link which I want to get its data.
如何用PHP解析Wikipedia XML?我用simplepie尝试过,但我一无所获。这是我想要获取其数据的链接。
http://en.wikipedia.org/w/api.php?action=query&generator=allpages&gaplimit=2&gapfilterredir=nonredirects&gapfrom=Re&prop=revisions&rvprop=content&format=xml
Edit code:
<?php
define("EMAIL_ADDRESS", "youlichika@hotmail.com");
$ch = curl_init();
$cv = curl_version();
$user_agent = "curl ${cv['version']} (${cv['host']}) libcurl/${cv['version']} ${cv['ssl_version']} zlib/${cv['libz_version']} <" . EMAIL_ADDRESS . ">";
curl_setopt($ch, CURLOPT_USERAGENT, $user_agent);
curl_setopt($ch, CURLOPT_COOKIEFILE, "cookies.txt");
curl_setopt($ch, CURLOPT_COOKIEJAR, "cookies.txt");
curl_setopt($ch, CURLOPT_ENCODING, "deflate, gzip, identity");
curl_setopt($ch, CURLOPT_HEADER, FALSE);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, TRUE);
curl_setopt($ch, CURLOPT_HTTPGET, TRUE);
curl_setopt($ch, CURLOPT_URL, "http://en.wikipedia.org/w/api.php?action=query&generator=allpages&gaplimit=2&gapfilterredir=nonredirects&gapfrom=Re&prop=revisions&rvprop=content&format=xml");
$xml = curl_exec($ch);
$xml_reader = new XMLReader();
$xml_reader->xml($xml, "UTF-8");
echo $xml->api->query->pages->page->rev;
?>
3 个解决方案
#1
7
I generally use a combination of CURL and XMLReader
to parse XML generated by the MediaWiki API.
我通常使用CURL和XMLReader的组合来解析MediaWiki API生成的XML。
Note that you must include your e-mail address in the User-Agent
header, or else the API script will respond with HTTP 403 Forbidden.
请注意,您必须在User-Agent标头中包含您的电子邮件地址,否则API脚本将以HTTP 403 Forbidden响应。
Here is how I initialize the CURL handle:
以下是我初始化CURL句柄的方法:
define("EMAIL_ADDRESS", "my@email.com");
$ch = curl_init();
$cv = curl_version();
$user_agent = "curl ${cv['version']} (${cv['host']}) libcurl/${cv['version']} ${cv['ssl_version']} zlib/${cv['libz_version']} <" . EMAIL_ADDRESS . ">";
curl_setopt($ch, CURLOPT_USERAGENT, $user_agent);
curl_setopt($ch, CURLOPT_COOKIEFILE, "cookies.txt");
curl_setopt($ch, CURLOPT_COOKIEJAR, "cookies.txt");
curl_setopt($ch, CURLOPT_ENCODING, "deflate, gzip, identity");
curl_setopt($ch, CURLOPT_HEADER, FALSE);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, TRUE);
You can then use this code which grabs the XML and constructs a new XMLReader
object in $xml_reader
:
然后,您可以使用此代码来获取XML并在$ xml_reader中构造新的XMLReader对象:
curl_setopt($ch, CURLOPT_HTTPGET, TRUE);
curl_setopt($ch, CURLOPT_URL, "http://en.wikipedia.org/w/api.php?action=query&generator=allpages&gaplimit=2&gapfilterredir=nonredirects&gapfrom=Re&prop=revisions&rvprop=content&format=xml");
$xml = curl_exec($ch);
$xml_reader = new XMLReader();
$xml_reader->xml($xml, "UTF-8");
EDIT: Here is a working example:
编辑:这是一个工作的例子:
<?php
define("EMAIL_ADDRESS", "youlichika@hotmail.com");
$ch = curl_init();
$cv = curl_version();
$user_agent = "curl ${cv['version']} (${cv['host']}) libcurl/${cv['version']} ${cv['ssl_version']} zlib/${cv['libz_version']} <" . EMAIL_ADDRESS . ">";
curl_setopt($ch, CURLOPT_USERAGENT, $user_agent);
curl_setopt($ch, CURLOPT_COOKIEFILE, "cookies.txt");
curl_setopt($ch, CURLOPT_COOKIEJAR, "cookies.txt");
curl_setopt($ch, CURLOPT_ENCODING, "deflate, gzip, identity");
curl_setopt($ch, CURLOPT_HEADER, FALSE);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, TRUE);
curl_setopt($ch, CURLOPT_HTTPGET, TRUE);
curl_setopt($ch, CURLOPT_URL, "http://en.wikipedia.org/w/api.php?action=query&generator=allpages&gaplimit=2&gapfilterredir=nonredirects&gapfrom=Re&prop=revisions&rvprop=content&format=xml");
$xml = curl_exec($ch);
$xml_reader = new XMLReader();
$xml_reader->xml($xml, "UTF-8");
function extract_first_rev(XMLReader $xml_reader)
{
while ($xml_reader->read()) {
if ($xml_reader->nodeType == XMLReader::ELEMENT) {
if ($xml_reader->name == "rev") {
$content = htmlspecialchars_decode($xml_reader->readInnerXML(), ENT_QUOTES);
return $content;
}
} else if ($xml_reader->nodeType == XMLReader::END_ELEMENT) {
if ($xml_reader->name == "page") {
throw new Exception("Unexpectedly found `</page>`");
}
}
}
throw new Exception("Reached the end of the XML document without finding revision content");
}
$latest_rev = array();
while ($xml_reader->read()) {
if ($xml_reader->nodeType == XMLReader::ELEMENT) {
if ($xml_reader->name == "page") {
$latest_rev[$xml_reader->getAttribute("title")] = extract_first_rev($xml_reader);
}
}
}
function parse($rev)
{
global $ch;
curl_setopt($ch, CURLOPT_HTTPGET, TRUE);
curl_setopt($ch, CURLOPT_URL, "http://en.wikipedia.org/w/api.php?action=parse&text=" . rawurlencode($rev) . "&prop=text&format=xml");
sleep(3);
$xml = curl_exec($ch);
$xml_reader = new XMLReader();
$xml_reader->xml($xml, "UTF-8");
while ($xml_reader->read()) {
if ($xml_reader->nodeType == XMLReader::ELEMENT) {
if ($xml_reader->name == "text") {
$html = htmlspecialchars_decode($xml_reader->readInnerXML(), ENT_QUOTES);
return $html;
}
}
}
throw new Exception("Failed to parse");
}
foreach ($latest_rev as $title => $latest_rev) {
echo parse($latest_rev) . "\n";
}
#2
1
You could use simplexml
:
你可以使用simplexml:
$xml = simplexml_load_file($url);
See example here: http://php.net/manual/en/simplexml.examples-basic.php
请参见此处的示例:http://php.net/manual/en/simplexml.examples-basic.php
Or Dom
:
$xml = new DomDocument;
$xml->load($url);
Or XmlReader
for huge XML documents that you don't want to read entirely in memory.
或者XmlReader,用于您不想在内存中完全读取的大型XML文档。
#1
7
I generally use a combination of CURL and XMLReader
to parse XML generated by the MediaWiki API.
我通常使用CURL和XMLReader的组合来解析MediaWiki API生成的XML。
Note that you must include your e-mail address in the User-Agent
header, or else the API script will respond with HTTP 403 Forbidden.
请注意,您必须在User-Agent标头中包含您的电子邮件地址,否则API脚本将以HTTP 403 Forbidden响应。
Here is how I initialize the CURL handle:
以下是我初始化CURL句柄的方法:
define("EMAIL_ADDRESS", "my@email.com");
$ch = curl_init();
$cv = curl_version();
$user_agent = "curl ${cv['version']} (${cv['host']}) libcurl/${cv['version']} ${cv['ssl_version']} zlib/${cv['libz_version']} <" . EMAIL_ADDRESS . ">";
curl_setopt($ch, CURLOPT_USERAGENT, $user_agent);
curl_setopt($ch, CURLOPT_COOKIEFILE, "cookies.txt");
curl_setopt($ch, CURLOPT_COOKIEJAR, "cookies.txt");
curl_setopt($ch, CURLOPT_ENCODING, "deflate, gzip, identity");
curl_setopt($ch, CURLOPT_HEADER, FALSE);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, TRUE);
You can then use this code which grabs the XML and constructs a new XMLReader
object in $xml_reader
:
然后,您可以使用此代码来获取XML并在$ xml_reader中构造新的XMLReader对象:
curl_setopt($ch, CURLOPT_HTTPGET, TRUE);
curl_setopt($ch, CURLOPT_URL, "http://en.wikipedia.org/w/api.php?action=query&generator=allpages&gaplimit=2&gapfilterredir=nonredirects&gapfrom=Re&prop=revisions&rvprop=content&format=xml");
$xml = curl_exec($ch);
$xml_reader = new XMLReader();
$xml_reader->xml($xml, "UTF-8");
EDIT: Here is a working example:
编辑:这是一个工作的例子:
<?php
define("EMAIL_ADDRESS", "youlichika@hotmail.com");
$ch = curl_init();
$cv = curl_version();
$user_agent = "curl ${cv['version']} (${cv['host']}) libcurl/${cv['version']} ${cv['ssl_version']} zlib/${cv['libz_version']} <" . EMAIL_ADDRESS . ">";
curl_setopt($ch, CURLOPT_USERAGENT, $user_agent);
curl_setopt($ch, CURLOPT_COOKIEFILE, "cookies.txt");
curl_setopt($ch, CURLOPT_COOKIEJAR, "cookies.txt");
curl_setopt($ch, CURLOPT_ENCODING, "deflate, gzip, identity");
curl_setopt($ch, CURLOPT_HEADER, FALSE);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, TRUE);
curl_setopt($ch, CURLOPT_HTTPGET, TRUE);
curl_setopt($ch, CURLOPT_URL, "http://en.wikipedia.org/w/api.php?action=query&generator=allpages&gaplimit=2&gapfilterredir=nonredirects&gapfrom=Re&prop=revisions&rvprop=content&format=xml");
$xml = curl_exec($ch);
$xml_reader = new XMLReader();
$xml_reader->xml($xml, "UTF-8");
function extract_first_rev(XMLReader $xml_reader)
{
while ($xml_reader->read()) {
if ($xml_reader->nodeType == XMLReader::ELEMENT) {
if ($xml_reader->name == "rev") {
$content = htmlspecialchars_decode($xml_reader->readInnerXML(), ENT_QUOTES);
return $content;
}
} else if ($xml_reader->nodeType == XMLReader::END_ELEMENT) {
if ($xml_reader->name == "page") {
throw new Exception("Unexpectedly found `</page>`");
}
}
}
throw new Exception("Reached the end of the XML document without finding revision content");
}
$latest_rev = array();
while ($xml_reader->read()) {
if ($xml_reader->nodeType == XMLReader::ELEMENT) {
if ($xml_reader->name == "page") {
$latest_rev[$xml_reader->getAttribute("title")] = extract_first_rev($xml_reader);
}
}
}
function parse($rev)
{
global $ch;
curl_setopt($ch, CURLOPT_HTTPGET, TRUE);
curl_setopt($ch, CURLOPT_URL, "http://en.wikipedia.org/w/api.php?action=parse&text=" . rawurlencode($rev) . "&prop=text&format=xml");
sleep(3);
$xml = curl_exec($ch);
$xml_reader = new XMLReader();
$xml_reader->xml($xml, "UTF-8");
while ($xml_reader->read()) {
if ($xml_reader->nodeType == XMLReader::ELEMENT) {
if ($xml_reader->name == "text") {
$html = htmlspecialchars_decode($xml_reader->readInnerXML(), ENT_QUOTES);
return $html;
}
}
}
throw new Exception("Failed to parse");
}
foreach ($latest_rev as $title => $latest_rev) {
echo parse($latest_rev) . "\n";
}
#2
1
You could use simplexml
:
你可以使用simplexml:
$xml = simplexml_load_file($url);
See example here: http://php.net/manual/en/simplexml.examples-basic.php
请参见此处的示例:http://php.net/manual/en/simplexml.examples-basic.php
Or Dom
:
$xml = new DomDocument;
$xml->load($url);
Or XmlReader
for huge XML documents that you don't want to read entirely in memory.
或者XmlReader,用于您不想在内存中完全读取的大型XML文档。