JSON解析,如果Wikipedia有多个选项,请选择要显示的第一页

时间:2022-10-29 20:43:39

The following code grabs the first paragraph from a Wikipedia page.

下面的代码从Wikipedia页面中获取了第一段。

<?
// action=parse: get parsed text
// page=Baseball: from the page Baseball
// format=json: in json format
// prop=text: send the text content of the article
// section=0: top content of the page

$find = $_GET['find'];

$url = 'http://en.wikipedia.org/w/api.php?action=parse&page=baseball&format=json&prop=text&section=0';
$ch = curl_init($url);
curl_setopt ($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt ($ch, CURLOPT_USERAGENT, "TestScript"); // required by wikipedia.org server; use YOUR user agent with YOUR contact information. (otherwise your IP might get blocked)
$c = curl_exec($ch);

$json = json_decode($c);

$content = $json->{'parse'}->{'text'}->{'*'}; // get the main text content of the query (it's parsed HTML)

// pattern for first match of a paragraph
$pattern = '#<p>(.*?)</p>#s'; // http://www.phpbuilder.com/board/showthread.php?t=10352690
if(preg_match_all($pattern, $content, $matches))
{
    // print $matches[0]; // content of the first paragraph (including wrapping <p> tag)
    echo "Wikipedia:<br>";
    print strip_tags(implode("\n\n",$matches[1])); // Content of the first paragraph without the HTML tags.
}
?>

The issue is that sometimes I want to make the title a variable in PHP so I can "search" for the information, but my query isn't always going to be a legitimate Wikipedia page.

问题是,有时候我想让标题成为PHP中的变量,所以我可以“搜索”信息,但我的查询并不总是合法的Wikipedia页面。

For example, when the above code searches for baseball, there is a page for baseball. But when I search for "mandarin", it shows:

例如,当上面的代码搜索棒球时,有一个棒球的页面。但当我搜索“mandarin”时,它显示:

Mandarin may refer to any of the following:

But it doesn't show any options.

但它没有显示任何选项。

My question is, is there a way to check to see if the page exists, and if not, get a list of options from Wikipedia that it could be, then pick the first page to display?

我的问题是,是否有一种方法可以检查这个页面是否存在,如果没有,从*上获取可能存在的选项列表,然后选择要显示的第一个页面?

1 个解决方案

#1


0  

Back in the 80's when referring to parsing XML and HTML documents, Nancy Reagan cried out:

早在上世纪80年代,南希•里根(Nancy Reagan)在谈到解析XML和HTML文档时就大声疾呼:

Just Say No to REGEX!

对REGEX说不就行了!

Wait a minute! I might be mistaken on that. I think she may have said, "Just Say No to Drugs!" and I don't think she was probably thinking about XML or HTML documents when she said that. But if she were, I'm sure she would agree with me that parsing XML and HTML is better done with PHP's DomDocument class, for two reasons:

等一下!我可能弄错了。我认为她可能说过,“对毒品说不”,我认为她说这些话的时候可能并没有考虑到XML或HTML文档。但如果是的话,我相信她会同意我的观点,即使用PHP的DomDocument类解析XML和HTML更好,原因有两个:

  • Regular expressions aren't very reliable for that purpose. A single character can throw them off, and any changes made by the webmaster to render your regex patterns useless.
  • 正则表达式在这方面不是很可靠。单个字符可以将它们丢弃,而站长为使regex模式无用而进行的任何更改。
  • Regular expressions are slow, especially if you have to get multiple items from the document. The DomDocument model parses the document once, and then all the data is contained in an object for easy access.
  • 正则表达式很慢,尤其是当您必须从文档中获取多个条目时。DomDocument模型一次解析文档,然后所有数据都包含在一个对象中以方便访问。

I went to the "Mandarin" page and found the following:

我去了“普通话”页面,发现了以下内容:

<h2>
    <span class="editsection">[<a href="/w/index.php?title=Mandarin&amp;action=edit&amp;section=1" title="Edit section: Officials">edit</a>]</span>
    <span class="mw-headline" id="Officials">Officials</span>
</h2>
<ul>
    <li><a href="/wiki/Mandarin_(bureaucrat)" title="Mandarin (bureaucrat)">Mandarin (bureaucrat)</a>, a bureaucrat of Imperial China (the original meaning of the word), Vietnam, and by analogy, any senior government bureaucrat</li>
</ul>

You can get the first link using the following code:

您可以使用以下代码获得第一个链接:

$doc = new DOMDocument();
//load HTML string into document object
if ( ! @$doc->loadHTML($data)){
    return FALSE;
}
//create XPath object using the document object as the parameter
$xpath = new DOMXPath($doc);
$query = "//span[@class='editsection']/a";
//XPath queries return a NodeList
$res = $xpath->query($query);
$link = $res->item(0)->getAttribute('href');

Once you have the URL, it's a simple matter to request the next page. As far as testing whether a page has this information or not, I think you can figure that out.

一旦有了URL,请求下一个页面就很简单了。至于测试一个页面是否有这些信息,我认为您可以找到。

If you're going to be doing this sort of thing, it's well worth your while to learn about the DomDocument class and making xpath queries.

如果您打算做这种事情,那么学习DomDocument类并进行xpath查询是值得的。

EDIT:

编辑:

The variable $data is just a string containing the HTML from the page.

变量$data仅仅是一个包含页面中的HTML的字符串。

#1


0  

Back in the 80's when referring to parsing XML and HTML documents, Nancy Reagan cried out:

早在上世纪80年代,南希•里根(Nancy Reagan)在谈到解析XML和HTML文档时就大声疾呼:

Just Say No to REGEX!

对REGEX说不就行了!

Wait a minute! I might be mistaken on that. I think she may have said, "Just Say No to Drugs!" and I don't think she was probably thinking about XML or HTML documents when she said that. But if she were, I'm sure she would agree with me that parsing XML and HTML is better done with PHP's DomDocument class, for two reasons:

等一下!我可能弄错了。我认为她可能说过,“对毒品说不”,我认为她说这些话的时候可能并没有考虑到XML或HTML文档。但如果是的话,我相信她会同意我的观点,即使用PHP的DomDocument类解析XML和HTML更好,原因有两个:

  • Regular expressions aren't very reliable for that purpose. A single character can throw them off, and any changes made by the webmaster to render your regex patterns useless.
  • 正则表达式在这方面不是很可靠。单个字符可以将它们丢弃,而站长为使regex模式无用而进行的任何更改。
  • Regular expressions are slow, especially if you have to get multiple items from the document. The DomDocument model parses the document once, and then all the data is contained in an object for easy access.
  • 正则表达式很慢,尤其是当您必须从文档中获取多个条目时。DomDocument模型一次解析文档,然后所有数据都包含在一个对象中以方便访问。

I went to the "Mandarin" page and found the following:

我去了“普通话”页面,发现了以下内容:

<h2>
    <span class="editsection">[<a href="/w/index.php?title=Mandarin&amp;action=edit&amp;section=1" title="Edit section: Officials">edit</a>]</span>
    <span class="mw-headline" id="Officials">Officials</span>
</h2>
<ul>
    <li><a href="/wiki/Mandarin_(bureaucrat)" title="Mandarin (bureaucrat)">Mandarin (bureaucrat)</a>, a bureaucrat of Imperial China (the original meaning of the word), Vietnam, and by analogy, any senior government bureaucrat</li>
</ul>

You can get the first link using the following code:

您可以使用以下代码获得第一个链接:

$doc = new DOMDocument();
//load HTML string into document object
if ( ! @$doc->loadHTML($data)){
    return FALSE;
}
//create XPath object using the document object as the parameter
$xpath = new DOMXPath($doc);
$query = "//span[@class='editsection']/a";
//XPath queries return a NodeList
$res = $xpath->query($query);
$link = $res->item(0)->getAttribute('href');

Once you have the URL, it's a simple matter to request the next page. As far as testing whether a page has this information or not, I think you can figure that out.

一旦有了URL,请求下一个页面就很简单了。至于测试一个页面是否有这些信息,我认为您可以找到。

If you're going to be doing this sort of thing, it's well worth your while to learn about the DomDocument class and making xpath queries.

如果您打算做这种事情,那么学习DomDocument类并进行xpath查询是值得的。

EDIT:

编辑:

The variable $data is just a string containing the HTML from the page.

变量$data仅仅是一个包含页面中的HTML的字符串。