前言:
首先表示抱歉,春节后一直较忙,未及时更新该系列文章。
近期,由于监控的站源越来越多,就偶有站源做了反爬机制,造成我们的SupportYun系统小爬虫服务时常被封IP,不能进行数据采集。
这时候,前面有园友提到的IP代理就该上场表演了。
IP代理池设计:
博主查阅与调研了多方资料,最终决定先通过爬取网络上各大IP代理网站免费代理的方式,来建立自己的IP代理池。
最终爬取了五家较为优质的IP代理站点:
1.西刺代理
2.快代理
3.逼格代理
4.proxy360
5.66免费代理
IP代理池方案设计如下:
简单点说就是把在采集的站源里面已知具有反爬机制的站源打上标签,修改所有的爬虫服务,遇到有此标签的站源先从IP代理池随机获取可用的代理IP再进行数据爬取。
安装Redis:
首先,我们需要一台服务器来部署我们的Redis服务(先不考虑集群什么的)。
博主一向不喜欢弹个小黑框,不停敲命令行进行操作的各种方式。个人认为,GUI是推动计算机快速发展的重要因素之一(非喜勿喷)。
翻阅了资料,找到了简易的redis安装客户端(windows版本,安装简单到爆),地址如下:
http://download.csdn.net/detail/cb511612371/9784687
在博客园找到一篇介绍redis配置文件的博文,贴出来供大家参考:http://www.cnblogs.com/kreo/p/4423362.html
话说博主就简单的修改了一下内存限制,设置了允许外网连接,设置了一个密码,也没多改其他东西。
注意,配置文件在安装完成后的目录下,名称是:Redis.window-server.conf
熟悉一点都知道,redis的c#驱动ServiceStack.Redis,NuGet就可以直接安装。比较坑的是4.0版本后商业化了,限制每小时6000次,要么下载3.9版本,要么考虑其他的驱动,例如:StackExchange。
博主使用的是ServiceStack V3.9版本,附上下载地址:http://download.csdn.net/detail/cb511612371/9784626
下面附上博主基于ServiceStack写的RedisManageService,由于业务简单,只使用到了几个API,大家凑合着看。
/// <summary>
/// 基于ServiceStack的redis操作管理服务
/// 当前用到set存储
/// </summary>
public class RedisManageService
{
private static readonly string redisAddress = ConfigurationManager.AppSettings["RedisAddress"];
private static readonly string redisPassword = "myRedisPassword"; /// <summary>
/// 获取某set集合 随机一条数据
/// </summary>
/// <param name="setName"></param>
/// <returns></returns>
public static string GetRandomItemFromSet(RedisSetNameEnum setName)
{
using (RedisClient client = new RedisClient(redisAddress, , redisPassword))
{
var result = client.GetRandomItemFromSet(setName.ToString());
if (result == null)
{
throw new Exception("redis set集合"+setName.ToString()+"已无数据!");
}
return result;
}
} /// <summary>
/// 从某set集合 删除指定数据
/// </summary>
/// <param name="setName"></param>
/// <param name="value"></param>
/// <returns></returns>
public static void RemoveItemFromSet(RedisSetNameEnum setName, string value)
{
using (RedisClient client = new RedisClient(redisAddress, , redisPassword))
{
client.RemoveItemFromSet(setName.ToString(), value);
}
} /// <summary>
/// 添加一条数据到某set集合
/// </summary>
/// <param name="setName"></param>
/// <param name="value"></param>
public static void AddItemToSet(RedisSetNameEnum setName, string value)
{
using (RedisClient client = new RedisClient(redisAddress, , redisPassword))
{
client.AddItemToSet(setName.ToString(), value);
}
} /// <summary>
/// 添加一个列表到某set集合
/// </summary>
/// <param name="setName"></param>
/// <param name="values"></param>
public static void AddItemListToSet(RedisSetNameEnum setName, List<string> values)
{
using (RedisClient client = new RedisClient(redisAddress, , redisPassword))
{
client.AddRangeToSet(setName.ToString(), values);
}
} /// <summary>
/// 判断某值是否已存在某set集合中
/// </summary>
/// <param name="setName"></param>
/// <param name="value"></param>
/// <returns></returns>
public static bool JudgeItemInSet(RedisSetNameEnum setName, string value)
{
using (RedisClient client = new RedisClient(redisAddress, , redisPassword))
{
return client.Sets[setName.ToString()].Any(t => t == value);
}
} /// <summary>
/// 获取某set数据总数
/// </summary>
/// <param name="setName"></param>
/// <returns></returns>
public static long GetSetCount(RedisSetNameEnum setName)
{
using (RedisClient client = new RedisClient(redisAddress, , redisPassword))
{
return client.GetSetCount(setName.ToString());
}
}
}
免费代理IP抓取服务实现:
我们首先设计一个最简单的IpProxy对象:
/// <summary>
/// Ip代理对象
/// </summary>
public class IpProxy
{
/// <summary>
/// IP地址
/// </summary>
public string Address { get; set; } /// <summary>
/// 端口
/// </summary>
public int Port { get; set; }
}
然后实现一个基于Redis的Ip代理池操作服务:
/// <summary>
/// 基于Redis的代理池管理服务
/// </summary>
public class PoolManageService
{
/// <summary>
/// 从代理池随机获取一条代理
/// </summary>
/// <returns></returns>
public static string GetProxy()
{
string result = string.Empty; try
{
result = RedisManageService.GetRandomItemFromSet(RedisSetNameEnum.ProxyPool);
if (result != null)
{
if (
!HttpHelper.IsAvailable(result.Split(new[] { ':' })[],
int.Parse(result.Split(new[] { ':' })[])))
{
DeleteProxy(result);
return GetProxy();
}
}
}
catch (Exception e)
{
LogUtils.ErrorLog(new Exception("从代理池获取代理数据出错", e));
}
return result;
} /// <summary>
/// 从代理池删除一条代理
/// </summary>
/// <param name="value"></param>
public static void DeleteProxy(string value)
{
try
{
RedisManageService.RemoveItemFromSet(RedisSetNameEnum.ProxyPool, value);
}
catch (Exception e)
{
LogUtils.ErrorLog(new Exception("从代理池删除代理数据出错", e));
}
} /// <summary>
/// 添加一条代理到代理池
/// </summary>
/// <param name="proxy"></param>
public static void Add(IpProxy proxy)
{
try
{
if (HttpHelper.IsAvailable(proxy.Address, proxy.Port))
{
RedisManageService.AddItemToSet(RedisSetNameEnum.ProxyPool, proxy.Address + ":" + proxy.Port.ToString());
}
}
catch (Exception e)
{
LogUtils.ErrorLog(new Exception("添加一条代理数据到代理池出错", e));
}
}
}
提供简易的三个方法:添加代理IP、删除代理IP、随机获取一条代理IP
我们还需要一个爬虫服务,来爬取我们需要的免费代理IP数据:
/// <summary>
/// IP池 抓取蜘蛛
/// TODO:代理池站点变化较快,时常关注日志监控
/// </summary>
public class IpPoolSpider
{
public void Initial()
{
ThreadPool.QueueUserWorkItem(Downloadproxy360);
ThreadPool.QueueUserWorkItem(DownloadproxyBiGe);
ThreadPool.QueueUserWorkItem(Downloadproxy66);
ThreadPool.QueueUserWorkItem(Downloadxicidaili);
} // 下载西刺代理的html页面
public void Downloadxicidaili(object DATA)
{
try
{
List<string> list = new List<string>()
{
"http://www.xicidaili.com/nt/",
"http://www.xicidaili.com/nn/",
"http://www.xicidaili.com/wn/",
"http://www.xicidaili.com/wt/" };
foreach (var utlitem in list)
{
for (int i = ; i < ; i++)
{
string url = utlitem + i.ToString();
var ipProxy = PoolManageService.GetProxy();
if (string.IsNullOrEmpty(ipProxy))
{
LogUtils.ErrorLog(new Exception("Ip代理池暂无可用代理IP"));
return;
}
var ip = ipProxy;
WebProxy webproxy;
if (ipProxy.Contains(":"))
{
ip = ipProxy.Split(new[] { ':' })[];
var port = int.Parse(ipProxy.Split(new[] { ':' })[]);
webproxy = new WebProxy(ip, port);
}
else
{
webproxy = new WebProxy(ip);
}
string html = HttpHelper.DownloadHtml(url, webproxy);
if (string.IsNullOrEmpty(html))
{
LogUtils.ErrorLog(new Exception("代理地址:" + url + " 访问失败"));
continue;
} HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(html);
HtmlNode node = doc.DocumentNode;
string xpathstring = "//tr[@class='odd']";
HtmlNodeCollection collection = node.SelectNodes(xpathstring);
foreach (var item in collection)
{
var proxy = new IpProxy();
string xpath = "td[2]";
proxy.Address = item.SelectSingleNode(xpath).InnerHtml;
xpath = "td[3]";
proxy.Port = int.Parse(item.SelectSingleNode(xpath).InnerHtml);
Task.Run(() =>
{
PoolManageService.Add(proxy);
});
}
}
}
}
catch (Exception e)
{
LogUtils.ErrorLog(new Exception("下载西刺代理IP池出现故障", e));
}
} // 下载快代理
public void Downkuaidaili(object DATA)
{
try
{
string url = "http://www.kuaidaili.com/proxylist/";
for (int i = ; i < ; i++)
{
string html = HttpHelper.DownloadHtml(url + i.ToString() + "/", null);
string xpath = "//tbody/tr";
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(html);
HtmlNode node = doc.DocumentNode;
HtmlNodeCollection collection = node.SelectNodes(xpath);
foreach (var item in collection)
{
var proxy = new IpProxy();
proxy.Address = item.FirstChild.InnerHtml;
xpath = "td[2]";
proxy.Port = int.Parse(item.SelectSingleNode(xpath).InnerHtml);
Task.Run(() =>
{
PoolManageService.Add(proxy);
});
}
}
}
catch (Exception e)
{
LogUtils.ErrorLog(new Exception("下载快代理IP池出现故障", e));
}
} // 下载proxy360
public void Downloadproxy360(object DATA)
{
try
{
string url = "http://www.proxy360.cn/default.aspx";
string html = HttpHelper.DownloadHtml(url, null);
if (string.IsNullOrEmpty(html))
{
LogUtils.ErrorLog(new Exception("代理地址:" + url + " 访问失败"));
return;
}
HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(html);
string xpathstring = "//div[@class='proxylistitem']";
HtmlNode node = doc.DocumentNode;
HtmlNodeCollection collection = node.SelectNodes(xpathstring); foreach (var item in collection)
{
var proxy = new IpProxy();
var childnode = item.ChildNodes[];
xpathstring = "span[1]";
proxy.Address = childnode.SelectSingleNode(xpathstring).InnerHtml.Trim();
xpathstring = "span[2]";
proxy.Port = int.Parse(childnode.SelectSingleNode(xpathstring).InnerHtml);
Task.Run(() =>
{
PoolManageService.Add(proxy);
});
}
}
catch (Exception e)
{
LogUtils.ErrorLog(new Exception("下载proxy360IP池出现故障", e));
}
} // 下载逼格代理
public void DownloadproxyBiGe(object DATA)
{
try
{
List<string> list = new List<string>()
{
"http://www.bigdaili.com/dailiip/1/{0}.html",
"http://www.bigdaili.com/dailiip/2/{0}.html",
"http://www.bigdaili.com/dailiip/3/{0}.html",
"http://www.bigdaili.com/dailiip/4/{0}.html"
};
foreach (var utlitem in list)
{
for (int i = ; i < ; i++)
{
string url = String.Format(utlitem, i);
string html = HttpHelper.DownloadHtml(url, null);
if (string.IsNullOrEmpty(html))
{
LogUtils.ErrorLog(new Exception("代理地址:" + url + " 访问失败"));
continue;
} HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(html);
HtmlNode node = doc.DocumentNode;
string xpathstring = "//tbody/tr";
HtmlNodeCollection collection = node.SelectNodes(xpathstring);
foreach (var item in collection)
{
var proxy = new IpProxy();
var xpath = "td[1]";
proxy.Address = item.SelectSingleNode(xpath).InnerHtml;
xpath = "td[2]";
proxy.Port = int.Parse(item.SelectSingleNode(xpath).InnerHtml);
Task.Run(() =>
{
PoolManageService.Add(proxy);
});
}
}
}
}
catch (Exception e)
{
LogUtils.ErrorLog(new Exception("下载逼格代理IP池出现故障", e));
}
} // 下载66免费代理
public void Downloadproxy66(object DATA)
{
try
{
List<string> list = new List<string>()
{
"http://www.66ip.cn/areaindex_35/index.html",
"http://www.66ip.cn/areaindex_35/2.html",
"http://www.66ip.cn/areaindex_35/3.html"
};
foreach (var utlitem in list)
{
string url = utlitem;
string html = HttpHelper.DownloadHtml(url, null);
if (string.IsNullOrEmpty(html))
{
LogUtils.ErrorLog(new Exception("代理地址:" + url + " 访问失败"));
break;
} HtmlDocument doc = new HtmlDocument();
doc.LoadHtml(html);
HtmlNode node = doc.DocumentNode;
string xpathstring = "//table[@bordercolor='#6699ff']/tr";
HtmlNodeCollection collection = node.SelectNodes(xpathstring);
foreach (var item in collection)
{
var proxy = new IpProxy();
var xpath = "td[1]";
proxy.Address = item.SelectSingleNode(xpath).InnerHtml;
if (proxy.Address.Contains("ip"))
{
continue;
}
xpath = "td[2]";
proxy.Port = int.Parse(item.SelectSingleNode(xpath).InnerHtml);
Task.Run(() =>
{
PoolManageService.Add(proxy);
});
}
}
}
catch (Exception e)
{
LogUtils.ErrorLog(new Exception("下载66免费代理IP池出现故障", e));
}
}
}
这段代码也没什么营养,就不仔细解释了。
前面有说到,博主的爬虫服务都是以windows服务的方式部署的。以前一直用Timer来实现固定间隔多次循环,这次博主引用了Quartz.NET任务调度框架来做,代码看起来更优美一点。
Quartz.NET可直接在NuGet下载安装。
先写一个代理池的总调度任务类ProxyPoolTotalJob,继承IJob接口:
/// <summary>
/// 代理池总调度任务
/// </summary>
class ProxyPoolTotalJob : IJob
{
public void Execute(IJobExecutionContext context)
{
var spider = new IpPoolSpider();
spider.Initial();
}
}
接下来是在OnStart中运行的Run()方法实现:
private static void Run()
{
try
{
StdSchedulerFactory factory = new StdSchedulerFactory();
IScheduler scheduler = factory.GetScheduler();
scheduler.Start();
IJobDetail job = JobBuilder.Create<ProxyPoolTotalJob>().WithIdentity("job1", "group1").Build();
ITrigger trigger = TriggerBuilder.Create()
.WithIdentity("trigger1", "group1")
.StartNow()
.WithSimpleSchedule(
x => x
.WithIntervalInMinutes() // 28分钟一次
.RepeatForever()
).Build();
scheduler.ScheduleJob(job, trigger); }
catch (SchedulerException se)
{
Console.WriteLine(se);
}
}
最后采集具有反爬机制的html页面的时候,使用代理IP,这个相信大家都会,设置一下webRequest的Proxy参数即可。
webRequest.Proxy = new WebProxy(ip, port);
以上,就实现了一个基于redis的免费代理IP池。我们被封IP的爬虫服务又满血复活了,继续采集新数据去。
原创文章,代码都是从自己项目里贴出来的。转载请注明出处哦,亲~~~