[ Crawler ] 爬虫防屏蔽技巧

技巧1 仿真Request(使用随机UserAgent、随机Proxy与随机时间间隔对墙进行冲击)

准备UserAgent array与Proxy array，随机拼对，进行访问。一般情况下，会有 ScrapManager 下面包含 UserAgentManager 与 ProxyManager的一些封装。注意在轮询遍历时候，需要Sleep一定的时间。

Thread.Sleep(Consts.RandInt() * 1000);

public class ScrapManager
{
     public static void Load()
     {
        ProxyManager.Load();
        UserAgentManager.Load();
     }

     public static void Next( )
     {
        ProxyManager.Next();
        UserAgentManager.Next();
     }
}

public class ProxyManager
{
   public static string Proxy = " your proxy ";
    public static void Load()
   {

   }

   public static Next()
   { 

   }
}

public class UserAgentManager
{
   public static string UserAgent = "Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.2; Trident/6.0)";

   public static void Load()
   {

   }

   public static Next()
   {

 
   }
}

string HtmlContent  = string.Empty;

// Request
HttpWebRequest m_HttpWebRequest = (HttpWebRequest)WebRequest.Create(“ your link”);

// Proxy
m_HttpWebRequest.Proxy = new WebProxy(ProxyManager.Proxy, true);

// UserAgent
m_HttpWebRequest.UserAgent = UserAgentManager.UserAgent;
m_HttpWebRequest.Method = "GET";
m_HttpWebRequest.Timeout = -1;

// Response
HttpWebResponse m_HttpWebResponse = (HttpWebResponse)m_HttpWebRequest.GetResponse();

using (StreamReader reader = new StreamReader(m_HttpWebResponse.GetResponseStream()))
{
    HtmlContent = reader.ReadToEnd();
    reader.Close();
}

总结：保持随机性，一般能不会被完全屏蔽。受限于手上的代理数，需要很多的代理，博主本人手上有14个代理，还是感到有点吃力。

技巧2 Iframe嵌套原页面使用前段抓取(针对load script html page)

参考：http://www.cnblogs.com/VincentDao/archive/2013/02/05/2892466.html

总结：实现较为简单，适合扒取脚本load data的网站。

技巧3 仿造Cookie（针对某些门户的屏蔽措施）

// Cookie
CookieContainer m_CookieContainer = new CookieContainer();
m_HttpWebRequest.CookieContainer = m_CookieContainer;
m_HttpWebRequest.CookieContainer.Add(new Cookie() { Name = "key", Value = "value", Domain = www.example.com });

总结：可以使用IE开发人员工具，Firefox，Chrome对request与response的Cookie进行监测。一般解决商城、社交网络的网页扒取。

技巧4 使用Selenium调用浏览器扒取页面

[ Crawler ] 爬虫防屏蔽技巧

总结：被屏蔽概率最低，能很好的解决以上暴露的不足与问题。对Dev的水平要求较高。

秒客网

[ Crawler ] 爬虫防屏蔽技巧

相关文章