I want to download a page from the web, it's allowed to do when you are using a simple browser like Firefox, but when I use "file_get_contents" the server refuses and replies that it understands the command but don't allow such downloads.
我想从网上下载一个页面,当你使用像Firefox这样的简单浏览器时允许这样做,但是当我使用“file_get_contents”时,服务器拒绝并回复它理解命令但不允许这样的下载。
So what to do? I think I saw in some scripts (on Perl) a way to make your script like a real browser by creating a user agent and cookies, which makes the servers think that your script is a real web browser.
那么该怎么办?我想我在一些脚本(Perl)中看到了一种通过创建用户代理和cookie使您的脚本像真正的浏览器一样的方法,这使得服务器认为您的脚本是真正的Web浏览器。
Does anyone have an idea about this, how it can be done?
有没有人对此有所了解,如何做到这一点?
4 个解决方案
#1
Use CURL.
<?php
// create curl resource
$ch = curl_init();
// set url
curl_setopt($ch, CURLOPT_URL, "example.com");
//return the transfer as a string
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
// set the UA
curl_setopt($ch, CURLOPT_USERAGENT, 'My App (http://www.example.com/)');
// Alternatively, lie, and pretend to be a browser
// curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)');
// $output contains the output string
$output = curl_exec($ch);
// close curl resource to free up system resources
curl_close($ch);
?>
#2
Yeah, CUrl is pretty good in getting page content. I use it with classes like DOMDocument and DOMXPath to grind the content to a usable form.
是的,CUrl非常适合获取页面内容。我将它与DOMDocument和DOMXPath之类的类一起使用,将内容研磨成可用的形式。
function __construct($useragent,$url)
{
$this->useragent='Firefox (WindowsXP) - Mozilla/5.0 (Windows; U; Windows NT 5.1; en-GB; rv:1.8.1.6) Gecko/20070725 Firefox/2.0.0.'.$useragent;
$this->url=$url;
$ch = curl_init();
curl_setopt($ch, CURLOPT_USERAGENT, $useragent);
curl_setopt($ch, CURLOPT_URL,$url);
curl_setopt($ch, CURLOPT_FAILONERROR, true);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_AUTOREFERER, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER,true);
curl_setopt($ch, CURLOPT_TIMEOUT, 10);
$html= curl_exec($ch);
$dom = new DOMDocument();
@$dom->loadHTML($html);
$this->xpath = new DOMXPath($dom);
}
...
public function displayResults($site)
$data=$this->path[0]->length;
for($i=0;$i<$data;$i++)
{
$delData=$this->path[0]->item($i);
//setting the href and title properties
$urlSite=$delData->getElementsByTagName('a')->item(0)->getAttribute('href');
$titleSite=$delData->getElementsByTagName('a')->item(0)->nodeValue;
//setting the saves and additoinal
$saves=$delData->getElementsByTagName('span')->item(0)->nodeValue;
if ($saves==NULL)
{
$saves=0;
}
//build the array
$this->newSiteBookmark[$i]['source']='delicious.com';
$this->newSiteBookmark[$i]['url']=$urlSite;
$this->newSiteBookmark[$i]['title']=$titleSite;
$this->newSiteBookmark[$i]['saves']=$saves;
}
The latter is a part of a class that scrapes data from delicious.com .Not very legal though.
后者是从delicious.com中删除数据的类的一部分。虽然不是很合法。
#3
This answer takes your comment to Rich's answer in mind.
这个答案记录了Rich的回答。
The site is probably checking whether or not you are a real user using the HTTP referer or the User Agent string. try setting these for your curl:
该站点可能正在检查您是否是使用HTTP referer或User Agent字符串的真实用户。尝试为你的卷曲设置这些:
//pretend you came from their site already
curl_setopt($ch, CURLOPT_REFERER, 'http://domainofthesite.com');
//pretend you are firefox 3.06 running on windows Vista
curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US; rv:1.9.0.6) Gecko/2009011913 Firefox/3.0.6');
#4
Another way to do it (though others have pointed out a better way), is to use PHP's fopen() function, like so:
另一种方法(尽管其他人指出了一种更好的方法),就是使用PHP的fopen()函数,如下所示:
$handle = fopen("http://www.example.com/", "r");//open specified URL for reading
It's especially useful if cURL isn't available.
如果cURL不可用,它尤其有用。
#1
Use CURL.
<?php
// create curl resource
$ch = curl_init();
// set url
curl_setopt($ch, CURLOPT_URL, "example.com");
//return the transfer as a string
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
// set the UA
curl_setopt($ch, CURLOPT_USERAGENT, 'My App (http://www.example.com/)');
// Alternatively, lie, and pretend to be a browser
// curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)');
// $output contains the output string
$output = curl_exec($ch);
// close curl resource to free up system resources
curl_close($ch);
?>
#2
Yeah, CUrl is pretty good in getting page content. I use it with classes like DOMDocument and DOMXPath to grind the content to a usable form.
是的,CUrl非常适合获取页面内容。我将它与DOMDocument和DOMXPath之类的类一起使用,将内容研磨成可用的形式。
function __construct($useragent,$url)
{
$this->useragent='Firefox (WindowsXP) - Mozilla/5.0 (Windows; U; Windows NT 5.1; en-GB; rv:1.8.1.6) Gecko/20070725 Firefox/2.0.0.'.$useragent;
$this->url=$url;
$ch = curl_init();
curl_setopt($ch, CURLOPT_USERAGENT, $useragent);
curl_setopt($ch, CURLOPT_URL,$url);
curl_setopt($ch, CURLOPT_FAILONERROR, true);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_AUTOREFERER, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER,true);
curl_setopt($ch, CURLOPT_TIMEOUT, 10);
$html= curl_exec($ch);
$dom = new DOMDocument();
@$dom->loadHTML($html);
$this->xpath = new DOMXPath($dom);
}
...
public function displayResults($site)
$data=$this->path[0]->length;
for($i=0;$i<$data;$i++)
{
$delData=$this->path[0]->item($i);
//setting the href and title properties
$urlSite=$delData->getElementsByTagName('a')->item(0)->getAttribute('href');
$titleSite=$delData->getElementsByTagName('a')->item(0)->nodeValue;
//setting the saves and additoinal
$saves=$delData->getElementsByTagName('span')->item(0)->nodeValue;
if ($saves==NULL)
{
$saves=0;
}
//build the array
$this->newSiteBookmark[$i]['source']='delicious.com';
$this->newSiteBookmark[$i]['url']=$urlSite;
$this->newSiteBookmark[$i]['title']=$titleSite;
$this->newSiteBookmark[$i]['saves']=$saves;
}
The latter is a part of a class that scrapes data from delicious.com .Not very legal though.
后者是从delicious.com中删除数据的类的一部分。虽然不是很合法。
#3
This answer takes your comment to Rich's answer in mind.
这个答案记录了Rich的回答。
The site is probably checking whether or not you are a real user using the HTTP referer or the User Agent string. try setting these for your curl:
该站点可能正在检查您是否是使用HTTP referer或User Agent字符串的真实用户。尝试为你的卷曲设置这些:
//pretend you came from their site already
curl_setopt($ch, CURLOPT_REFERER, 'http://domainofthesite.com');
//pretend you are firefox 3.06 running on windows Vista
curl_setopt($ch, CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US; rv:1.9.0.6) Gecko/2009011913 Firefox/3.0.6');
#4
Another way to do it (though others have pointed out a better way), is to use PHP's fopen() function, like so:
另一种方法(尽管其他人指出了一种更好的方法),就是使用PHP的fopen()函数,如下所示:
$handle = fopen("http://www.example.com/", "r");//open specified URL for reading
It's especially useful if cURL isn't available.
如果cURL不可用,它尤其有用。