C#爬虫通过代理刷文章浏览量

时间:2021-09-10 23:16:37

1.如何维护代理ip库?

想要使用代理ip,那就必须有一个一定数量、足够有效的代理ip库,在学习阶段,随便玩玩那就只能从免费的代理ip网站上抓取,没有一定数量的代理刷文章流浪量非常慢,首先就是要维护好自己的代理ip库

之前用过的西刺代理、66ip比较靠谱,西祠好像有反扒,遇到过一次,不知道是西祠网站的问题还是反扒的策略,这两个网站每分钟抓取的能用的代理大概有2,3个,这已经算的上比较客观的了, data5u、快代理、ip3366网页端更新的非常少,而且有效性比较低,快代理抓取网页还必须设置useragent,发现设置后获取的ip的端口和网页端不一致,很玩味是吧,没办法免费的就是这样,不然人家就收费了,当然付费的代理也不稳定,但肯定是比免费的好很多。

维护代理质量

从网页端抓取下来的代理,肯定是要经过验证再入库的,最简单的方式就是发起一个请求状态码是否为200。我推荐的免费代理还是上面两个西刺代理和66ip,相对其他的免费代理有效性、数量都比较高。

代理如何储存

我使用的是redis来存储这些有效代理,数据结构最好是采用set,不允许存储相同的ip。代理的有效性无法得知,有的可能是几十秒钟,有的几十分钟,在使用的时候应该记录那些多次不能使用的ip,达到一定次数,应该从set中删除掉。无法确定代理的时效,代理ip要及时使用,可以使用定时器定时从redis中取出代理。

2.反爬虫的一些常见的机制?

反爬虫的原则就是判断是否是一个真实的用户,一些比较重要的数据会利用多种机制混合,让爬虫的代价变大甚至无法爬取,header里面的字段设置、ip限制、cookie等

ip限制

一些网站为了防止爬虫,可能会对每个ip进行访问频率的限制,访问频率一个是速度,可以同thread.sleep来进行休眠,暂停一会儿再进行爬取;一个ip次数这个我们可以通过抓取的免费代理来设置。

header里的限制

user-agent :用户代理,这个非常简单,可以收集一些常见的浏览器代理头,在请求的时候随机设置user-agent

referer :访问目标链接是从哪个链接条过来的,做防图片的盗链可以用它来处理,当然这个refresh也是可以伪造。

cookie:登录后或其他的一些用户操作,服务端会返回一些cookie信息,没有cookie很容易被识别为伪造请求,可以在本地通过js,根据服务端返回的一些信息,本地区设置cooke,当然这个实际中并没有这么简单,一般会涉及到加密解密的过程。这个是爬虫的一个难点。

3.使用代理ip刷新csdn文章的浏览量

csdn文章的浏览量还是比较好刷的,前提的是你有足够多的代理,没有更多的代理效率会非常慢。前面一篇文章我们已经从几个免费的代理网站抓取了代理,这里就不多做介绍了,这里我们接着上一篇的拿来就用。c#批量抓取免费代理并验证有效性。1.我使用的多线程批量发送请求,效率更好,每个线程平均分配一定数量的代理执行请求。2.定时获取redis中代理3.使用system.collections.concurrent 命名空间下concurrentdictionary字典集合来统计失败的次数,如果达到一定次数就直接从库中删除该代理。关于代码中主要功能是实现,不足的地方就是代理太少,效率不高。

效果如图:

C#爬虫通过代理刷文章浏览量

昨天晚上看了篇文章,(这里是随便csdn或者其他地方的有浏览量的文章),刷的时间不短,主要是因为代理太少了。

C#爬虫通过代理刷文章浏览量

C#爬虫通过代理刷文章浏览量

主要代码如下:

?
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
class program
 {
 static bool finishiscompleted=true;
 static concurrentdictionary<string, int> failstatis;//保存请求失败的ip地址:key 失败次数:value
 static string refreshlink = "https://blog.csdn.net/baijifeilong/article/details/80734388";
 static string requestsuccesskey,requestfailkey;
 static async task main(string[] args)
 {
  
  threadpool.setminthreads(500, 100);
  failstatis = new concurrentdictionary<string, int>();
  requestsuccesskey = "list_request_success"+datetime.now.tostring("hhmm");
  requestfailkey = "list_request_fail" + datetime.now.tostring("hhmm");
  timer timer = new timer(async (state) =>
  {
  if (finishiscompleted)
  {
   finishiscompleted = false;
   //获取代理
   var proxyips = redishelper.getproxy();
   int threadcount = 1;
   if (proxyips.count > 10)
   {
   threadcount = proxyips.count / 10;
   }
   //平均分配给每个thread执行15个请求
   int requestcount = proxyips.count / threadcount;
   for (var i = 0; i < threadcount; i++)
   {
   var templist = proxyips.getrange(i * requestcount, requestcount);
   if (i == threadcount - 1)
   {
    templist.addrange(proxyips.getrange(threadcount * requestcount, proxyips.count - threadcount * requestcount));
   }
   thread thread = new thread(async () =>
   {
    //执行发起请求
    await finish(templist);
   });
   thread.start();
   }
  }
  }, "processing timer event", 0, 1000*30);
  console.readline();
 }
 
 public static async task finish(list<string> proxyips)
 {
  for (int i = 0; i < proxyips.count; i++)
  {
  string ip = proxyips[i];
  int index = ip.indexof(":");
  string ipaddress = ip.substring(0, index);
  int ipport = int.parse(ip.substring(index + 1));
  //随机休眠
  thread.sleep(new random().next(1,4)*1000);
  await get(ipaddress, ipport, 10000, randomuseragent(), refreshlink, () =>
  {
   redishelper.addrequestok(requestsuccesskey,ip+" "+datetime.now.toshorttimestring(),true);
   console.foregroundcolor = consolecolor.white;
   console.writeline(ip+" success");
  },
  (error) =>
  {
   redishelper.addrequestok(requestfailkey, ip + " " + datetime.now.toshorttimestring(),false);
   console.foregroundcolor = consolecolor.red;
   console.writeline(ipaddress+error+"失败"+(failstatis.containskey(ip)?failstatis[ip] :1)+"次");
   if (failstatis.containskey(ip))
   {
   if (failstatis[ip] == 6)
   {
    redishelper.removesetvalue(ip);
   }
   else
    failstatis[ip]++;
   }
   else
   {
   failstatis.addorupdate(ip, 1,(key,oldvalue)=>oldvalue+1);
   }
  }
  );
  }
  finishiscompleted = true;
 }
 
 private static string randomuseragent()
 {
  string[] usersagents = new string[] {
  "mozilla/5.0 (linux; u; android 2.3.7; en-us; nexus one build/frf91) applewebkit/533.1 (khtml, like gecko) version/4.0 mobile safari/533.1",
  "mqqbrowser/26 mozilla/5.0 (linux; u; android 2.3.7; zh-cn; mb200 build/grj22; cyanogenmod-7) applewebkit/533.1 (khtml, like gecko) version/4.0 mobile safari/533.1",
  "juc (linux; u; 2.3.7; zh-cn; mb200; 320*480) ucweb7.9.3.103/139/999",
  "mozilla/5.0 (windows nt 6.1; wow64; rv:7.0a1) gecko/20110623 firefox/7.0a1 fennec/7.0a1",
  "opera/9.80 (android 2.3.4; linux; opera mobi/build-1107180945; u; en-gb) presto/2.8.149 version/11.10",
  "mozilla/5.0 (linux; u; android 3.0; en-us; xoom build/hri39) applewebkit/534.13 (khtml, like gecko) version/4.0 safari/534.13",
  "mozilla/5.0 (iphone; u; cpu iphone os 3_0 like mac os x; en-us) applewebkit/420.1 (khtml, like gecko) version/3.0 mobile/1a542a safari/419.3",
  "mozilla/5.0 (iphone; u; cpu iphone os 4_0 like mac os x; en-us) applewebkit/532.9 (khtml, like gecko) version/4.0.5 mobile/8a293 safari/6531.22.7",
  "mozilla/5.0 (ipad; u; cpu os 3_2 like mac os x; en-us) applewebkit/531.21.10 (khtml, like gecko) version/4.0.4 mobile/7b334b safari/531.21.10",
  "mozilla/5.0 (blackberry; u; blackberry 9800; en) applewebkit/534.1+ (khtml, like gecko) version/6.0.0.337 mobile safari/534.1+",
  "mozilla/5.0 (hp-tablet; linux; hpwos/3.0.0; u; en-us) applewebkit/534.6 (khtml, like gecko) wosbrowser/233.70 safari/534.6 touchpad/1.0",
  "mozilla/5.0 (symbianos/9.4; series60/5.0 nokian97-1/20.0.019; profile/midp-2.1 configuration/cldc-1.1) applewebkit/525 (khtml, like gecko) browserng/7.1.18124",
  "mozilla/5.0 (compatible; msie 9.0; windows phone os 7.5; trident/5.0; iemobile/9.0; htc; titan)",
  "mozilla/5.0 (windows nt 6.1) applewebkit/537.36 (khtml, like gecko) chrome/41.0.2228.0 safari/537.36",
  "mozilla/5.0 (macintosh; intel mac os x 10_10_1) applewebkit/537.36 (khtml, like gecko) chrome/41.0.2227.1 safari/537.36",
  "mozilla/5.0 (x11; u; linux x86_64; zh-cn; rv:1.9.2.10) gecko/20100922 ubuntu/10.10 (maverick) firefox/3.6.10",
  "mozilla/5.0 (windows nt 5.1; u; en; rv:1.8.1) gecko/20061208 firefox/2.0.0 opera 9.50",
  "mozilla/5.0 (windows nt 6.1; wow64) applewebkit/534.57.2 (khtml, like gecko) version/5.1.7 safari/534.57.2",
  "mozilla/5.0 (windows nt 6.1; wow64) applewebkit/537.36 (khtml, like gecko) chrome/30.0.1599.101 safari/537.36",
  "mozilla/5.0 (compatible; msie 9.0; windows nt 6.1; wow64; trident/5.0; slcc2; .net clr 2.0.50727; .net clr 3.5.30729; .net clr 3.0.30729; media center pc 6.0; .net4.0c; .net4.0e; lbbrowser) ",
  "mozilla/5.0 (compatible; msie 9.0; windows nt 6.1; wow64; trident/5.0; slcc2; .net clr 2.0.50727; .net clr 3.5.30729; .net clr 3.0.30729; media center pc 6.0; .net4.0c; .net4.0e; qqbrowser/7.0.3698.400)",
  "mozilla/5.0 (windows nt 6.1; wow64) applewebkit/537.36 (khtml, like gecko) chrome/38.0.2125.122 ubrowser/4.0.3214.0 safari/537.36",
  "mozilla/5.0 (linux; u; android 2.2.1; zh-cn; htc_wildfire_a3333 build/frg83d) applewebkit/533.1 (khtml, like gecko) version/4.0 mobile safari/533.1",
  "mozilla/5.0 (blackberry; u; blackberry 9800; en) applewebkit/534.1+ (khtml, like gecko) version/6.0.0.337 mobile safari/534.1+",
  "mozilla/5.0 (compatible; msie 9.0; windows phone os 7.5; trident/5.0; iemobile/9.0; htc; titan)",
  "mozilla/4.0 (compatible; msie 6.0; ) opera/ucweb7.0.2.37/28/999",
  "openwave/ ucweb7.0.2.37/28/999",
  "nokia5700/ ucweb7.0.2.37/28/999",
  "ucweb7.0.2.37/28/999",
  "mozilla/5.0 (hp-tablet; linux; hpwos/3.0.0; u; en-us) applewebkit/534.6 (khtml, like gecko) wosbrowser/233.70 safari/534.6 touchpad/1.0",
  "mozilla/5.0 (linux; u; android 3.0; en-us; xoom build/hri39) applewebkit/534.13 (khtml, like gecko) version/4.0 safari/534.13",
  "opera/9.80 (android 2.3.4; linux; opera mobi/build-1107180945; u; en-gb) presto/2.8.149 version/11.10",
  "mozilla/5.0 (ipad; u; cpu os 4_3_3 like mac os x; en-us) applewebkit/533.17.9 (khtml, like gecko) version/5.0.2 mobile/8j2 safari/6533.18.5",
  };
  random random = new random();
  var randomnumber = random.next(0, usersagents.length);
  return usersagents[randomnumber];
 }
 public static async task get(string proxyip, int proxyport,int timeout, string randomuseragent, string url, action success, action<string> fail)
 {
  httpwebrequest request = null;
  httpwebresponse response = null;
  try
  {
  request = (httpwebrequest)webrequest.create(url);
  request.timeout = timeout;
  request.useragent = randomuseragent;
  request.proxy = new webproxy(proxyip,proxyport);
 
  response = await request.getresponseasync() as httpwebresponse;
  
  if (response.statuscode == httpstatuscode.ok)
  {
   success();
  }
  else
  {
   fail(response+":"+response.statusdescription);
  }
  }
  catch (exception ex)
  {
  fail(ex.message.tostring());
  }
  finally
  {
  if (request != null)
  {
   request.abort();
   request = null;
  }
  if (response != null)
  {
   response.close();
   response = null;
  }
  }
 }
 }

redishelper.cs

?
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
public class redishelper
 {
 private static readonly object locker = new object();
 private static connectionmultiplexer _redis;
 private const string connecttionstring = "127.0.0.1:6379,defaultdatabase=3";
 public const string redis_set_ket_success = "set_success_ip";
 private static connectionmultiplexer manager
 {
  get
  {
  if (_redis == null)
  {
   lock (locker)
   {
   if (_redis != null) return _redis;
   _redis = getmanager();
   return _redis;
   }
  }
  return _redis;
  }
 }
 private static connectionmultiplexer getmanager(string connectionstring = null)
 {
  if (string.isnullorempty(connectionstring))
  {
  connectionstring = connecttionstring;
  }
  return connectionmultiplexer.connect(connectionstring);
 }
 public static void addrequestok(string key,string value,bool issuccess)
 {
  var db = manager.getdatabase();
  if(issuccess)
  db.listleftpush(key,value);
  else
  db.listleftpush(key, value);
 }
 public static list<string> getproxy()
 {
  list<string> result = new list<string>();
  var db = manager.getdatabase();
  var values = db.setmembers(redis_set_ket_success);
  foreach (var value in values)
  {
  result.add(value.tostring());
  }
  return result;
 }
 public static bool insertset(string value)
 {
  var db = manager.getdatabase();
  return db.setadd(redis_set_ket_success, value);
 }
 public static bool removesetvalue(string value)
 {
  var db = manager.getdatabase();
  return db.setremove(redis_set_ket_success,value);
 }
 }

原文链接:https://www.cnblogs.com/zhangmumu/p/9275383.html