100分请教高手抓取一网页的源码问题。

网址为： http://googleads.g.doubleclick.net/pagead/sdo?client=dist-aff-pub-8581564299250417&output=html&dt=1246961633609&format=js_sdo&h=35&w=500&correlator=1246961633609&same_win=2&logo=left&rl_pos=right&cts_mode=rs&num_cts=2&box_h=26&box_w=215&u_h=768&u_w=1024&u_ah=738&u_aw=1024&u_cd=32&u_tz=480&u_his=7&u_java=true&u_nplug=0&u_nmime=0&frm=0&lmt=1246961632&url=http://127.0.0.1/search.html&dtd=0

前提是在浏览器当中直接输入上面的网址可以正常打开页面（所以在抓取的时候不需要考虑引用地址)
所用失败的方法如下：
curl                            失败
fsockopen                 失败
file_get_contents      失败

实在没法子了，搞了三天也没不行，请教大家.

23 个解决方案

#1

最让我感觉奇怪的是上面的网址用网络上比较流行的一个软件叫：火车头采集器用他写的软件可以正确的抓取上面网址的内容。

#2

fopen 行不？

#3

不行啊，急死了，

#4

关注中～

#5

你这个页面我拿到浏览器中什么也得不到.
抓取的时候,要伪造refer,header,user_agent,
必要的时候，还要设置cookie,获取页面的session
以及页面随机产生的验证码．

#6

$url = 'http://googleads.g.doubleclick.net/pagead/sdo?client=dist-aff-pub-8581564299250417&output=html&dt=1246961633609&format=js_sdo&h=35&w=500&correlator=1246961633609&same_win=2&logo=left&rl_pos=right&cts_mode=rs&num_cts=2&box_h=26&box_w=215&u_h=768&u_w=1024&u_ah=738&u_aw=1024&u_cd=32&u_tz=480&u_his=7&u_java=true&u_nplug=0&u_nmime=0&frm=0&lmt=1246961632&url=http://127.0.0.1/search.html&dtd=0';


$opts = array(

	'http'=>array(

		'method'=>'GET',

		'header'=>'User-Agent: Mozilla'

	)

);


echo $html = file_get_contents($url,false,stream_context_create($opts));

这页东西有什么好下载 -_-
它更像广告

#7

失败也得有个原因啊！你把地址贴到浏览器上，查看原码，里面有什么，你自己的三种方法肯定都能下载下来！必要的时候找个http监测软件看看是不是有cookie,referer限制等

#8

顶yctin

#9

其实抓个包,或者用firebug看下请求头,你就用socket发送和看到的请求头一样的头信息,不可能抓不到.

#10

引用 6 楼 yctin 的回复:

PHP code$url='http://googleads.g.doubleclick.net/pagead/sdo?client=dist-aff-pub-8581564299250417&output=html&dt=1246961633609&format=js_sdo&h=35&w=500&correlator=1246961633609&same_win=2&logo=left&rl_pos=right&cts_mode=rs&num_cts=2&box_h=26&box_w=215&u_h=768&u_w=1024&u_ah=738&u_aw=1024&u_cd=32&u_tz=480&u_his=7&u_java=true&u_nplug=0&u_nmime=0&frm=0&lmt=1246961632&url=http://127.0.0.1/search.html&…

顶起，您的思路非常好，能解释一个其它两个参数的意思吗？在手册上没有找到，谢谢了

#11

引用 9 楼 foolbirdflyfirst 的回复:

其实抓个包,或者用firebug看下请求头,你就用socket发送和看到的请求头一样的头信息,不可能抓不到.

你好：socket方法我确实测试了，确实不行

#12

引用 7 楼 yemingwy 的回复:

失败也得有个原因啊！你把地址贴到浏览器上，查看原码，里面有什么，你自己的三种方法肯定都能下载下来！必要的时候找个http监测软件看看是不是有cookie,referer限制等

直接在ie地址栏输入可以打开，证明没有referr吧？

#13

引用 11 楼 lovewangya 的回复:

引用 9 楼 foolbirdflyfirst 的回复:
其实抓个包,或者用firebug看下请求头,你就用socket发送和看到的请求头一样的头信息,不可能抓不到.

你好：socket方法我确实测试了，确实不行

那是因为你没模拟User-Agent吧,怎么能说socket不行呢,只要是发送接受http协议,你不用socket,用什么?
浏览器也是socket发送，浏览器可以做到，你利用fsocket也可以做到

#14

失败显示什么,在我记忆中这三种方法算最主流了..LZ把错误输出贴出来看看

#15

LZ 不知道流览器的工作原因啊
一切都可以模拟的，CURL ,FSOCKET ，只要流览器能打开，这两个函数绝对能获取到一样的代码，
要模拟SID,cookie ,token 等数据

#16

跟着学

#17

学习下……
楼主的测试代码有问题吧

#18

呵呵,只能说明没有找到正确的方法.

#19

你抓那个做什么??练习??那个用html就实现了，，，也可以用google公开的Ajax Api

#20



<?

function get_url($url)

{

	$url			= str_replace(" ", "%20", $url);

	$TheURL_header	= substr($url, 0, 7);

	if($TheURL_header == "http://")

	{

		$pos	= strpos($url, "/", 7);

		if($pos)

		{

			$host	= substr($url, 7, $pos - 7);

		}

		else

		{

			$host	= substr($url, 7);

		}

	}

	else

	{

		return false;

	}

	$TheURL_footer	= substr($url, strlen("http://".$host."/"));

	$http_header	= "";

	$http_header	.= "GET /".$TheURL_footer." HTTP/1.1\r\n";

	$http_header	.= "User-Agent: Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322; Alexa Toolbar)\r\n";

	$http_header	.= "Host: ".$host."\r\n";

	$http_header	.= "Connection: close\r\n";

	$http_header	.= "Cache-Control: no-cache\r\n";

	$http_header	.= "\r\n";

	$data	= "";

	$bytes	= 0;

	$fp		= @fsockopen($host, 80, $errno, $errstr, 30);

	if (!$fp)

	{

		return false;

	}

	fwrite($fp, $http_header);

	while(!feof($fp))

	{

		$tmp_stream		= fgets($fp, 1280);

		$stream_header	= substr($tmp_stream, 0, 9);

		$stream_header	= strtolower($stream_header);

		if($stream_header == "location:")

		{

			//---- 转走了 ----

			$remote_url	= substr($tmp_stream, 9);

			$remote_url	= trim($remote_url);

			$the_url_header	= substr($remote_url, 0, 7);

			if($the_url_header != "http://")

			{

				if(substr($remote_url, 0, 1) == "/")

				{

					$remote_url	= substr($remote_url, 1);

					$remote_url	= "http://".$host.$remote_url;

				}

				else

				{

					$pos_url	= strrpos($url, "/");

					$remote_url	= substr($url, 0, $pos_url)."/".$remote_url;

				}

			}

			fclose($fp);

			return get_url($remote_url);

		}

		$stream_header	= substr($tmp_stream, 0, 15);

		$stream_header	= strtolower($stream_header);

		if($tmp_stream == "\r\n")

		{

			//---- 头部信息结束 ----

			break;

		}

	}

	while(!feof($fp))

	{

		$tmp_stream		= fgets($fp, 1280);

		$data	.= $tmp_stream;

	}

	fclose($fp);

	return $data;

}

echo get_url("http://googleads.g.doubleclick.net/pagead/sdo?client=dist-aff-pub-8581564299250417&output=html&dt=1246961633609&format=js_sdo&h=35&w=500&correlator=1246961633609&same_win=2&logo=left&rl_pos=right&cts_mode=rs&num_cts=2&box_h=26&box_w=215&u_h=768&u_w=1024&u_ah=738&u_aw=1024&u_cd=32&u_tz=480&u_his=7&u_java=true&u_nplug=0&u_nmime=0&frm=0&lmt=1246961632&url=http://127.0.0.1/search.html&dtd=0");

#21

这个需要curl支持，请自行修改php.ini配置。



<?php

/**

 * @desc 支持referer的url访问

 * @author shadu###foxmail.com 

 * @param string $url 要访问的页面

 * @param string $ref 从哪个网站访问的

 */

function get_UrlContent($url,$ref) {

	$ch = curl_init ( $url );

	curl_setopt ( $ch, CURLOPT_TIMEOUT, 10 );

	curl_setopt ( $ch, CURLOPT_RETURNTRANSFER, 1 );

	curl_setopt ( $ch, CURLOPT_FOLLOWLOCATION, true );

	curl_setopt ( $ch, CURLOPT_USERAGENT, "Mozilla/4.0 (compatible; MSIE 5.01; Windows NT 5.0)" );

	curl_setopt ( $ch, CURLOPT_REFERER, $ref);

	$content = curl_exec ( $ch );

	curl_close ( $ch );

	return $content;

}


/**

 * @desc 我猜lz是想在http://127.0.0.1/search.html嵌入Google广告代码，

 *       所以这个$ref最好还是设置下，否则广告挣的钱就不是你的了:)

 */

$ref = 'http://127.0.0.1/search.html';


$url = "http://googleads.g.doubleclick.net/pagead/sdo?client=dist-aff-pub-8581564299250417&output=html"

		."&dt=1246961633609&format=js_sdo&h=35&w=500&correlator=1246961633609&same_win=2&logo=left&rl_pos=right"

		."&cts_mode=rs&num_cts=2&box_h=26&box_w=215&u_h=768&u_w=1024&u_ah=738&u_aw=1024&u_cd=32&u_tz=480&u_his=7"

		."&u_java=true&u_nplug=0&u_nmime=0&frm=0&lmt=1246961632"

		."&url=$ref&dtd=0";


echo get_UrlContent($url,$ref);

?>

#22



<input type="text" id=goo />

<input type="button" id=but />



var tmp=document.getElementById('but');

tmp.onclick=function(){

   var obj=document.getElementById('goo');

   window.open('http://www.google.cn/search?hl=zh-CN&q='+obj.value);

}

自已按要求再改下就好了

#23

关注！

#1

最让我感觉奇怪的是上面的网址用网络上比较流行的一个软件叫：火车头采集器用他写的软件可以正确的抓取上面网址的内容。

#2

fopen 行不？

#3

不行啊，急死了，

#4

关注中～

#5

#6

$url = 'http://googleads.g.doubleclick.net/pagead/sdo?client=dist-aff-pub-8581564299250417&output=html&dt=1246961633609&format=js_sdo&h=35&w=500&correlator=1246961633609&same_win=2&logo=left&rl_pos=right&cts_mode=rs&num_cts=2&box_h=26&box_w=215&u_h=768&u_w=1024&u_ah=738&u_aw=1024&u_cd=32&u_tz=480&u_his=7&u_java=true&u_nplug=0&u_nmime=0&frm=0&lmt=1246961632&url=http://127.0.0.1/search.html&dtd=0';


$opts = array(

	'http'=>array(

		'method'=>'GET',

		'header'=>'User-Agent: Mozilla'

	)

);


echo $html = file_get_contents($url,false,stream_context_create($opts));

这页东西有什么好下载 -_-
它更像广告

#7

#8

顶yctin

#9

其实抓个包,或者用firebug看下请求头,你就用socket发送和看到的请求头一样的头信息,不可能抓不到.

#10

引用 6 楼 yctin 的回复:

PHP code$url='http://googleads.g.doubleclick.net/pagead/sdo?client=dist-aff-pub-8581564299250417&output=html&dt=1246961633609&format=js_sdo&h=35&w=500&correlator=1246961633609&same_win=2&logo=left&rl_pos=right&cts_mode=rs&num_cts=2&box_h=26&box_w=215&u_h=768&u_w=1024&u_ah=738&u_aw=1024&u_cd=32&u_tz=480&u_his=7&u_java=true&u_nplug=0&u_nmime=0&frm=0&lmt=1246961632&url=http://127.0.0.1/search.html&…

顶起，您的思路非常好，能解释一个其它两个参数的意思吗？在手册上没有找到，谢谢了

#11

引用 9 楼 foolbirdflyfirst 的回复:

其实抓个包,或者用firebug看下请求头,你就用socket发送和看到的请求头一样的头信息,不可能抓不到.

你好：socket方法我确实测试了，确实不行

#12

引用 7 楼 yemingwy 的回复:

失败也得有个原因啊！你把地址贴到浏览器上，查看原码，里面有什么，你自己的三种方法肯定都能下载下来！必要的时候找个http监测软件看看是不是有cookie,referer限制等

直接在ie地址栏输入可以打开，证明没有referr吧？

#13

引用 11 楼 lovewangya 的回复:

引用 9 楼 foolbirdflyfirst 的回复:
其实抓个包,或者用firebug看下请求头,你就用socket发送和看到的请求头一样的头信息,不可能抓不到.

你好：socket方法我确实测试了，确实不行

#14

失败显示什么,在我记忆中这三种方法算最主流了..LZ把错误输出贴出来看看

#15

LZ 不知道流览器的工作原因啊
一切都可以模拟的，CURL ,FSOCKET ，只要流览器能打开，这两个函数绝对能获取到一样的代码，
要模拟SID,cookie ,token 等数据

#16

跟着学

#17

学习下……
楼主的测试代码有问题吧

#18

呵呵,只能说明没有找到正确的方法.

#19

你抓那个做什么??练习??那个用html就实现了，，，也可以用google公开的Ajax Api

#20



<?

function get_url($url)

{

	$url			= str_replace(" ", "%20", $url);

	$TheURL_header	= substr($url, 0, 7);

	if($TheURL_header == "http://")

	{

		$pos	= strpos($url, "/", 7);

		if($pos)

		{

			$host	= substr($url, 7, $pos - 7);

		}

		else

		{

			$host	= substr($url, 7);

		}

	}

	else

	{

		return false;

	}

	$TheURL_footer	= substr($url, strlen("http://".$host."/"));

	$http_header	= "";

	$http_header	.= "GET /".$TheURL_footer." HTTP/1.1\r\n";

	$http_header	.= "User-Agent: Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322; Alexa Toolbar)\r\n";

	$http_header	.= "Host: ".$host."\r\n";

	$http_header	.= "Connection: close\r\n";

	$http_header	.= "Cache-Control: no-cache\r\n";

	$http_header	.= "\r\n";

	$data	= "";

	$bytes	= 0;

	$fp		= @fsockopen($host, 80, $errno, $errstr, 30);

	if (!$fp)

	{

		return false;

	}

	fwrite($fp, $http_header);

	while(!feof($fp))

	{

		$tmp_stream		= fgets($fp, 1280);

		$stream_header	= substr($tmp_stream, 0, 9);

		$stream_header	= strtolower($stream_header);

		if($stream_header == "location:")

		{

			//---- 转走了 ----

			$remote_url	= substr($tmp_stream, 9);

			$remote_url	= trim($remote_url);

			$the_url_header	= substr($remote_url, 0, 7);

			if($the_url_header != "http://")

			{

				if(substr($remote_url, 0, 1) == "/")

				{

					$remote_url	= substr($remote_url, 1);

					$remote_url	= "http://".$host.$remote_url;

				}

				else

				{

					$pos_url	= strrpos($url, "/");

					$remote_url	= substr($url, 0, $pos_url)."/".$remote_url;

				}

			}

			fclose($fp);

			return get_url($remote_url);

		}

		$stream_header	= substr($tmp_stream, 0, 15);

		$stream_header	= strtolower($stream_header);

		if($tmp_stream == "\r\n")

		{

			//---- 头部信息结束 ----

			break;

		}

	}

	while(!feof($fp))

	{

		$tmp_stream		= fgets($fp, 1280);

		$data	.= $tmp_stream;

	}

	fclose($fp);

	return $data;

}

echo get_url("http://googleads.g.doubleclick.net/pagead/sdo?client=dist-aff-pub-8581564299250417&output=html&dt=1246961633609&format=js_sdo&h=35&w=500&correlator=1246961633609&same_win=2&logo=left&rl_pos=right&cts_mode=rs&num_cts=2&box_h=26&box_w=215&u_h=768&u_w=1024&u_ah=738&u_aw=1024&u_cd=32&u_tz=480&u_his=7&u_java=true&u_nplug=0&u_nmime=0&frm=0&lmt=1246961632&url=http://127.0.0.1/search.html&dtd=0");

#21

这个需要curl支持，请自行修改php.ini配置。



<?php

/**

 * @desc 支持referer的url访问

 * @author shadu###foxmail.com 

 * @param string $url 要访问的页面

 * @param string $ref 从哪个网站访问的

 */

function get_UrlContent($url,$ref) {

	$ch = curl_init ( $url );

	curl_setopt ( $ch, CURLOPT_TIMEOUT, 10 );

	curl_setopt ( $ch, CURLOPT_RETURNTRANSFER, 1 );

	curl_setopt ( $ch, CURLOPT_FOLLOWLOCATION, true );

	curl_setopt ( $ch, CURLOPT_USERAGENT, "Mozilla/4.0 (compatible; MSIE 5.01; Windows NT 5.0)" );

	curl_setopt ( $ch, CURLOPT_REFERER, $ref);

	$content = curl_exec ( $ch );

	curl_close ( $ch );

	return $content;

}


/**

 * @desc 我猜lz是想在http://127.0.0.1/search.html嵌入Google广告代码，

 *       所以这个$ref最好还是设置下，否则广告挣的钱就不是你的了:)

 */

$ref = 'http://127.0.0.1/search.html';


$url = "http://googleads.g.doubleclick.net/pagead/sdo?client=dist-aff-pub-8581564299250417&output=html"

		."&dt=1246961633609&format=js_sdo&h=35&w=500&correlator=1246961633609&same_win=2&logo=left&rl_pos=right"

		."&cts_mode=rs&num_cts=2&box_h=26&box_w=215&u_h=768&u_w=1024&u_ah=738&u_aw=1024&u_cd=32&u_tz=480&u_his=7"

		."&u_java=true&u_nplug=0&u_nmime=0&frm=0&lmt=1246961632"

		."&url=$ref&dtd=0";


echo get_UrlContent($url,$ref);

?>

#22



<input type="text" id=goo />

<input type="button" id=but />



var tmp=document.getElementById('but');

tmp.onclick=function(){

   var obj=document.getElementById('goo');

   window.open('http://www.google.cn/search?hl=zh-CN&q='+obj.value);

}

自已按要求再改下就好了

#23

关注！

100分请教高手抓取一网页的源码问题。

23 个解决方案

#1

#2

#3

#4

#5

#6

#7

#8

#9

#10

#11

#12

#13

#14

#15

#16

#17

#18

#19

#20

#21

#22

#23

#1

#2

#3

#4

#5

#6

#7

#8

#9

#10

#11

#12

#13

#14

#15

#16

#17

#18

#19

#20

#21

#22

#23

相关文章