First, I know that I can get the HTML of a webpage with:
首先,我知道我可以获取网页的HTML:
file_get_contents($url);
What I am trying to do is get a specific link element in the page (found in the head).
我想要做的是在页面中找到一个特定的链接元素(在头部找到)。
e.g:
例如:
<link type="text/plain" rel="service" href="/service.txt" /> (the element could close with just >)
My question is: How can I get that specific element with the "rel" attribute equal to "service" so I can get the href?
我的问题是:我如何获得“rel”属性等于“service”的特定元素,这样我才能获得href?
My second question is: Should I also get the "base" element? Does it apply to the "link" element? I am trying to follow the standard.
我的第二个问题是:我是否也应该获得“基础”元素?它适用于“link”元素吗?我试图遵循标准。
Also, the html might have errors. I don't have control on how my users code there stuff.
此外,html可能有错误。我没有控制我的用户如何编码那些东西。
3 个解决方案
#1
3
Using PHP's DOMDocument
, this should do it (untested):
使用PHP的DOMDocument,这应该做(未经测试):
$doc = new DOMDocument();
$doc->loadHTML($file);
$head = $doc->getElementsByTagName('head')->item(0);
$links = $head->getElementsByTagName("link");
foreach($links as $l) {
if($l->getAttribute("rel") == "service") {
echo $l->getAttribute("href");
}
}
#2
0
You should get the Base element, but know how it works and its scope.
你应该得到Base元素,但要知道它的工作原理和范围。
In truth, when I have to screen-scrape, I use phpquery. This is an older PHP port of jQuery... and what that may sound like something of a dumb concept, it is awesome for document traversal... and doesn't require well-formed XHTMl.
事实上,当我必须屏幕刮,我使用phpquery。这是一个较旧的jQuery PHP端口...而这听起来像是一个愚蠢的概念,它对于文档遍历来说非常棒......并且不需要格式良好的XHTMl。
http://code.google.com/p/phpquery/
http://code.google.com/p/phpquery/
#3
0
I'm working with Selenium under Java for Web-Application-Testing. It provides very nice features for document traversal using CSS-Selectors.
我正在使用Selenium在Java下进行Web应用程序测试。它为使用CSS-Selectors的文档遍历提供了非常好的功能。
Have a look at How to use Selenium with PHP.
But this setup might be to complex for your needs if you only want to extract this one link.
看看如何使用PHP的Selenium。但是,如果您只想提取此链接,则此设置可能会复杂化以满足您的需求。
#1
3
Using PHP's DOMDocument
, this should do it (untested):
使用PHP的DOMDocument,这应该做(未经测试):
$doc = new DOMDocument();
$doc->loadHTML($file);
$head = $doc->getElementsByTagName('head')->item(0);
$links = $head->getElementsByTagName("link");
foreach($links as $l) {
if($l->getAttribute("rel") == "service") {
echo $l->getAttribute("href");
}
}
#2
0
You should get the Base element, but know how it works and its scope.
你应该得到Base元素,但要知道它的工作原理和范围。
In truth, when I have to screen-scrape, I use phpquery. This is an older PHP port of jQuery... and what that may sound like something of a dumb concept, it is awesome for document traversal... and doesn't require well-formed XHTMl.
事实上,当我必须屏幕刮,我使用phpquery。这是一个较旧的jQuery PHP端口...而这听起来像是一个愚蠢的概念,它对于文档遍历来说非常棒......并且不需要格式良好的XHTMl。
http://code.google.com/p/phpquery/
http://code.google.com/p/phpquery/
#3
0
I'm working with Selenium under Java for Web-Application-Testing. It provides very nice features for document traversal using CSS-Selectors.
我正在使用Selenium在Java下进行Web应用程序测试。它为使用CSS-Selectors的文档遍历提供了非常好的功能。
Have a look at How to use Selenium with PHP.
But this setup might be to complex for your needs if you only want to extract this one link.
看看如何使用PHP的Selenium。但是,如果您只想提取此链接,则此设置可能会复杂化以满足您的需求。