用C#爬虫来抓取网页并解析

C# 网络爬虫整理

一、前言

在学了C#的网络爬虫之后，深感C#的强大，和爬虫的有趣，在这里将自己的心得体会记下来，以便日后的学习和回顾。这里有两个程序，首先是一个简单的抓取网页程序，将网页抓取下来之后用正则表达式进行解析，从而得到相应的信息。

二、C#网页爬虫

这里有几个要点:

第一：抓取的源

当我们想抓网页，首先就要知道该网页的具体内容，包含的主要信息，之后我们对信息进行处理，可以确定我们要抓取的网页数量，有两种抓取方法，一种是深度优先，一种是广度优先，最终抓取出所有自己想要的网页。

第二，对源的处理

我们知道我们抓取的是整个网页，有很多信息是我们不需要的，因此有两种处理方法，一种是用正则表达式来处理，另外一种是对DOM（Document Object Model）结构的数据用XPath函数来处理.

第三：处理过程中我们要用到线程

也就是异步任务，因此我们需要对其进行学习，理解async和await这一对孪生兄弟的用法。

1 a)  只有在async方法里面才能使用await操作符；
2 b)  await操作符是针对Task对象的；
3 c)  当方法A调用方法B,方法B方法体内又通过await调用方法C时，如果方法C内部有异步操作，则方法B会等待异步操作执行完，才往下执行；但方法A可以继续往下执行，不用再等待B方法执行完。

 1         static void Main(string[] args)
 2 
 3         {
 4 
 5             Test();
 6 
 7             Console.WriteLine("Test End!");
 8 
 9             Console.ReadLine();
10 
11         }
12 
13         static async void Test()
14 
15         {
16 
17             await Test1();
18 
19             Console.WriteLine("Test1 End!");
20 
21         }
22 
23         static Task Test1()
24 
25         {
26 
27             Thread.Sleep(1000);
28 
29             Console.WriteLine("create task in test1");
30 
31             return Task.Run(() =>
32 
33             {
34 
35                 Thread.Sleep(3000);
36 
37                 Console.WriteLine("Test1");
38 
39             });
40 
41         }

View Code

相当于代码：

 1        static void Main(string[] args)
 2 
 3         {
 4 
 5             Test();
 6 
 7             Console.WriteLine("Test End!");
 8 
 9             Console.ReadLine();
10 
11         }
12 
13         static void Test()
14 
15         {
16 
17             var test1=Test1();
18 
19             Task.Run(() =>
20 
21             {
22 
23                 test1.Wait();
24 
25                 Console.WriteLine("Test1 End!");
26 
27             });
28 
29         }
30 
31         static Task Test1()
32 
33         {
34 
35             Thread.Sleep(1000);
36 
37             Console.WriteLine("create task in test1");
38 
39             return Task.Run(() =>
40 
41             {
42 
43                 Thread.Sleep(3000);
44 
45                 Console.WriteLine("Test1");
46 
47             });
48 
49         }

View Code

第四：在C#中大量的出现lambda表达式，我们要对其有深刻的理解和认识。

比如：

1  cityCrawler.OnStart += (s, e) =>
2  {
3    Console.WriteLine("爬虫开始抓取地址：" + e.Uri.ToString());
4  };

我们只有深刻的认识了lambda表达式，才能更好的使用和理解它。

第五：

我们的爬虫是怎么伪造浏览器来进行抓包的，如果大量的抓包会被服务器警觉，我们要采用代理来解决这一问题。

第六：

对EventHandler的认识，它的构造有两个参数，一个是当前的上下文，一个是具体的对象（这个对象是我们自己创建的，在该委托的模板中进行传递）。

第七：并发处理。

1 Parallel.For(0, 2, (i) =>
2 { 
3    var hotel = hotelList[i];
4     hotelCrawler.Start(hotel.Uri);
5 });

而For函数的定义如下：

 1         // 摘要:
 2 
 3         //     执行 for（在 Visual Basic 中为 For）循环，其中可能会并行运行迭代。
 4 
 5         //
 6 
 7         // 参数:
 8 
 9         //   fromInclusive:
10 
11         //     开始索引（含）。
12 
13         //
14 
15         //   toExclusive:
16 
17         //     结束索引（不含）。
18 
19         //
20 
21         //   body:
22 
23         //     将为每个迭代调用一次的委托。
24 
25         //
26 
27         // 返回结果:
28 
29         //     包含有关已完成的循环部分的信息的结构。
30 
31         public  static  ParallelLoopResult For(int fromInclusive, int toExclusive, Action<int> body);

View Code

第八：计时函数

var watch = newStopwatch();
watch.Start();

。。。。

watch.Stop()
var milliseconds = watch.ElapsedMilliseconds;

第九：伪造浏览器

  1        public async Task<string> Start(Uri uri,string proxy=null)
  2 
  3         {
  4 
  5             return await Task.Run(() =>
  6 
  7             {
  8 
  9                 var pageSource = string.Empty;
 10 
 11                 try
 12 
 13                 {
 14 
 15                     if (this.OnStart != null) this.OnStart(this, newOnStartEventArgs(uri));
 16 
 17                     var watch = newStopwatch();
 18 
 19                     watch.Start();
 20 
 21                     var request = (HttpWebRequest)WebRequest.Create(uri);
 22 
 23                     request.Accept = "*/*";
 24 
 25                     request.ServicePoint.Expect100Continue = false;//加快载入速度
 26 
 27                     request.ServicePoint.UseNagleAlgorithm = false;//禁止Nagle算法加快载入速度
 28 
 29                     request.AllowWriteStreamBuffering = false;//禁止缓冲加快载入速度
 30 
 31                   request.Headers.Add(HttpRequestHeader.AcceptEncoding, "gzip,deflate");//定义gzip压缩页面支持
 32 
 33                     request.ContentType = "application/x-www-form-urlencoded";//定义文档类型及编码
 34 
 35                     request.AllowAutoRedirect = false;//禁止自动跳转
 36 
 37                     //设置User-Agent，伪装成Google Chrome浏览器
 38 
 39                    request.UserAgent = "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36";
 40 
 41                     request.Timeout = 5000;//定义请求超时时间为5秒
 42 
 43                     request.KeepAlive = true;//启用长连接
 44 
 45                     request.Method = "GET";//定义请求方式为GET             
 46 
 47                     if (proxy != null)request.Proxy = newWebProxy(proxy);//设置代理服务器IP，伪装请求地址
 48 
 49                     request.CookieContainer = this.CookiesContainer;//附加Cookie容器
 50 
 51                     request.ServicePoint.ConnectionLimit = int.MaxValue;//定义最大连接数
 52 
 53                     using (var response = (HttpWebResponse)request.GetResponse()) {//获取请求响应
 54 
 55                         foreach (Cookie cookie in response.Cookies) this.CookiesContainer.Add(cookie);
 56 
 57                       //将Cookie加入容器，保存登录状态
 58 
 59                         if (response.ContentEncoding.ToLower().Contains("gzip"))//解压
 60 
 61                         {
 62 
 63                             using (GZipStream stream = newGZipStream(response.GetResponseStream(), CompressionMode.Decompress))
 64 
 65                             {
 66 
 67                                 using (StreamReader reader = newStreamReader(stream, Encoding.UTF8))
 68 
 69                                 {
 70 
 71                                     pageSource = reader.ReadToEnd();
 72 
 73                                 }
 74 
 75                             }
 76 
 77                         }
 78 
 79                         elseif (response.ContentEncoding.ToLower().Contains("deflate"))//解压
 80 
 81                         {
 82 
 83                             using (DeflateStream stream = newDeflateStream(response.GetResponseStream(), CompressionMode.Decompress))
 84 
 85                             {
 86 
 87                                 using (StreamReader reader = newStreamReader(stream, Encoding.UTF8))
 88 
 89                                 {
 90 
 91                                     pageSource = reader.ReadToEnd();
 92 
 93                                 }
 94 
 95                             }
 96 
 97                         }
 98 
 99                         else
100 
101                        {
102 
103                             using (Stream stream = response.GetResponseStream())//原始
104 
105                             {
106 
107                                 using (StreamReader reader = newStreamReader(stream, Encoding.UTF8))
108 
109                                 {
110 
111  
112 
113                                     pageSource= reader.ReadToEnd();
114 
115                                 }
116 
117                             }
118 
119                         }
120 
121                     }
122 
123                     request.Abort();
124 
125                     watch.Stop();
126 
127                     var threadId = System.Threading.Thread.CurrentThread.ManagedThreadId;//获取当前任务线程ID
128 
129                     var milliseconds = watch.ElapsedMilliseconds;//获取请求执行时间
130 
131                     if (this.OnCompleted != null) this.OnCompleted(this, newOnCompletedEventArgs(uri, threadId, milliseconds, pageSource));
132 
133                 }
134 
135                 catch (Exception ex)
136 
137                 {
138 
139                     if (this.OnError != null) this.OnError(this, newOnErrorEventArgs(uri,ex));
140 
141                 }
142 
143                 return pageSource;
144 
145                 });
146 
147      }

View Code

这是我们爬虫的主体部分，我们伪造浏览器，设置好一定的参数，进行访问服务器，得到结果然后解析结果，并显示。整个过程是非常恰当的。

第十：对Task状态的掌控。

 1   public OnCompletedEventArgs(Uri uri, int threadId, long milliseconds, string pageSource)
 2 
 3         {
 4 
 5             this.Uri = uri;
 6 
 7             this.ThreadId = threadId;
 8 
 9             this.Milliseconds = milliseconds;
10 
11             this.PageSource = pageSource;
12 
13         }
14 
15         public OnErrorEventArgs(Uri uri,Exception exception)
16 
17        {
18 
19             this.Uri = uri;
20 
21             this.Exception = exception;
22 
23         }
24 
25         public OnStartEventArgs(Uri uri)
26 
27         {
28 
29             this.Uri = uri;
30 
31         }

我们有三种状态，起始态，完成态，出错态。并且将它们扩展为委托事件，在程序中使用，非常的抽象和方便。

1 public eventEventHandler<OnStartEventArgs> OnStart;//爬虫启动事件
2 public eventEventHandler<OnCompletedEventArgs> OnCompleted;//爬虫完成事件
3 public eventEventHandler<OnErrorEventArgs> OnError;//爬虫出错事件

 1             cityCrawler.OnStart += (s, e) =>
 2 
 3             {
 4 
 5                 Console.WriteLine("爬虫开始抓取地址：" + e.Uri.ToString());
 6 
 7             };
 8 
 9             cityCrawler.OnError += (s, e) =>
10 
11             {
12 
13                 Console.WriteLine("爬虫抓取出现错误：" + e.Uri.ToString() + "，异常消息：" + e.Exception.Message);
14 
15             };
16 
17             cityCrawler.OnCompleted += (s, e) =>
18 
19             {
20 
21                 //使用正则表达式清洗网页源代码中的数据
22 
23                 var links = Regex.Matches(e.PageSource, @"<a[^>]+href=""*(?<href>/hotel/[^>\s]+)""\s*[^>]*>(?<text>(?!.*img).*?)</a>", RegexOptions.IgnoreCase);
24 
25                 foreach (Match match in links)
26 
27                 {
28 
29                     var city = newCity
30 
31                     {
32 
33                         CityName = match.Groups["text"].Value,
34 
35                         Uri = newUri("http://hotels.ctrip.com" + match.Groups["href"].Value
36 
37                     )
38 
39                     };
40 
41                     if (!cityList.Contains(city)) cityList.Add(city);//将数据加入到泛型列表
42 
43                  Console.WriteLine(city.CityName + "|" + city.Uri);//将城市名称及URL显示到控制台
44 
45                 }
46 
47                 Console.WriteLine("===============================================");
48 
49                 Console.WriteLine("爬虫抓取任务完成！合计 " + links.Count + " 个城市。");
50 
51                 Console.WriteLine("耗时：" + e.Milliseconds + "毫秒");
52 
53                 Console.WriteLine("线程：" + e.ThreadId);
54 
55                 Console.WriteLine("地址：" + e.Uri.ToString());
56 
57        };

View Code

第十一：代理服务器和测试服务器。

  //测试代理IP是否生效：http://1212.ip138.com/ic.asp
  //测试当前爬虫的User-Agent：http://www.whatismyuseragent.net

三、加强版的网络爬虫

在简单版的基础上，这次我们不是直接伪造浏览器上网了，而是使用相应的工具来帮助我们进行网页解析。

首先我们需要四个DLL:

其次我们还是先定义一个接口类：

 1     public interface ICrawler
 2 
 3     {
 4 
 5         eventEventHandler<OnStartEventArgs> OnStart;//爬虫启动事件
 6 
 7         eventEventHandler<OnCompletedEventArgs> OnCompleted;//爬虫完成事件
 8 
 9         eventEventHandler<OnErrorEventArgs> OnError;//爬虫出错事件
10 
11         Task Start(Uri uri, Script script, Operation operation); //启动爬虫进程
12 
13    }

然后，我们需要用到PlantomJS和Selenium这两个工具，前者是用来对webkit进行渲染，后者是用来自动化测试，让服务器感觉到就像是真人一样的在操作网页。

 1       private PhantomJSOptions _options;//定义PhantomJS内核参数
 2 
 3         private PhantomJSDriverService _service;//定义Selenium驱动配置
 4 
 5         public StrongCrawler(string proxy = null)
 6 
 7         {
 8 
 9             this._options = newPhantomJSOptions();//定义PhantomJS的参数配置对象
10 
11             this._service = PhantomJSDriverService.CreateDefaultService(Environment.CurrentDirectory);
12 
13             //初始化Selenium配置，传入存放phantomjs.exe文件的目录
14 
15             _service.IgnoreSslErrors = true;//忽略证书错误
16 
17             _service.WebSecurity = false;//禁用网页安全
18 
19             _service.HideCommandPromptWindow = true;//隐藏弹出窗口
20 
21             _service.LoadImages = false;//禁止加载图片
22 
23             _service.LocalToRemoteUrlAccess = true;//允许使用本地资源响应远程 URL
24 
25             _options.AddAdditionalCapability(@"phantomjs.page.settings.userAgent", "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36");
26 
27             if (proxy != null)
28 
29             {
30 
31                 _service.ProxyType = "HTTP";//使用HTTP代理
32 
33                 _service.Proxy = proxy;//代理IP及端口
34 
35             }
36 
37             else
38 
39             {
40 
41                 _service.ProxyType = "none";//不使用代理
42 
43             }
44 
45         }

之后就该我们的主线程了Task异步操作：

 1   public  async Task Start(Uri uri,Script script, Operation operation)
 2 
 3         {
 4 
 5             awaitTask.Run(() =>
 6 
 7             {
 8 
 9                 if (OnStart != null) this.OnStart(this, newOnStartEventArgs(uri));
10 
11                 var driver = newPhantomJSDriver(_service, _options);//实例化PhantomJS的WebDriver
12 
13                 try
14 
15                 {
16 
17                     var watch = DateTime.Now;
18 
19                     driver.Navigate().GoToUrl(uri.ToString());//请求URL地址
20 
21                     if (script != null) driver.ExecuteScript(script.Code, script.Args);
22 
23 //执行Javascript代码
24 
25                     if (operation.Action != null) operation.Action.Invoke(driver);
26 
27                     var driverWait = newWebDriverWait(driver, TimeSpan.FromMilliseconds(operation.Timeout));//设置超时时间为x毫秒
28 
29                     if (operation.Condition != null) driverWait.Until(operation.Condition);
30 
31                     var threadId = System.Threading.Thread.CurrentThread.ManagedThreadId;
32 
33 //获取当前任务线程ID
34 
35                     var milliseconds = DateTime.Now.Subtract(watch).Milliseconds;
36 
37 //获取请求执行时间;
38 
39                     var pageSource = driver.PageSource;//获取网页Dom结构
40 
41                     this.OnCompleted.Invoke(this, newOnCompletedEventArgs(uri, threadId, milliseconds, pageSource, driver));
42 
43                 }
44 
45                 catch (Exception ex)
46 
47                 {
48 
49                     this.OnError.Invoke(this, newOnErrorEventArgs(uri, ex));
50 
51                 }
52 
53                 finally
54 
55                 {
56 
57                     driver.Close();
58 
59                     driver.Quit();
60 
61                 }
62 
63             });
64 
65         }

在这里与简单的不同，首先我们没有返回值，其次，我们用了驱动PhantomJSDriver来代替我们自己构造的http请求,也进行了一定的参数设置。其次我们的事件都用了Invoke（）方法来调用，它的作用是让主线程来执行相应的操作，从而避免死锁。

Phantom JS是一个服务器端的 JavaScript API 的 WebKit。其支持各种Web标准： DOM 处理, CSS 选择器, JSON, Canvas, 和 SVG.

selenium官方加上第三方宣布支持的驱动有很多种；除了PC端的浏览器之外，还支持iphone、Android的driver.

PC端的driver都是基于浏览器的，主要分为2种类型：

一种是真实的浏览器driver

比如：safari、ff都是以插件形式驱动浏览器本身的；ie、chrome都是通过二进制文件来驱动浏览器本身的；这些driver都是直接启动并通过调用浏览器的底层接口来驱动浏览器的，因此具有最真实的用户场景模拟，主要用于进行web的兼容性测试使用。

一种是伪浏览器driver

selenium支持的伪浏览器包括htmlunit、PhantomJS；他们都不是真正的在浏览器、都没有GUI，而是具有支持html、js等解析能力的类浏览器程序；这些程序不会渲染出网页的显示内容，但是支持页面元素的查找、JS的执行等；由于不进行css及GUI渲染，所以运行效率上会比真实浏览器要快很多，主要用在功能性测试上面。

htmlunit是Java实现的类浏览器程序，包含在selenium server中，无需驱动，直接实例化即可；其js的解析引擎是Rhino.

PhantomJS是第三方的一个独立类浏览器应用，可以支持html、js、css等执行；其驱动是Ghost driver在1.9.3版本之后已经打包进了主程序中，因此只要下载一个主程序即可；其js的解析引擎是chrome 的V8。

再来看主函数，这里我们定义了一个Operation类，为的就是模拟正常人的操作，让selenium来执行。

 1            var operation = newOperation
 2 
 3              {
 4 
 5                 Action = (x) => {
 6 
 7                     //通过Selenium驱动点击页面的“酒店评论”
 8 
 9                     x.FindElement(By.XPath("//*[@id=\'commentTab\']")).Click();
10 
11                 },
12 
13                 Condition = (x) => {
14 
15                     //判断Ajax评论内容是否已经加载成功
16 
17                     return x.FindElement(By.XPath("//*[@id=\'commentList\']")).Displayed && x.FindElement(By.XPath("//*[@id=\'hotel_info_comment\']/div[@id=\'commentList\']")).Displayed && !x.FindElement(By.XPath("//*[@id=\'hotel_info_comment\']/div[@id=\'commentList\']")).Text.Contains("点评载入中");
18 
19                 },
20 
21                 Timeout = 5000
22 
23             };

最后是解析方法：

 1         private static void  HotelCrawler(OnCompletedEventArgs e) {
 2 
 3             //Console.WriteLine(e.PageSource);
 4 
 5             //File.WriteAllText(Environment.CurrentDirectory + "\\cc.html", e.PageSource, Encoding.UTF8);
 6 
 7             var hotelName = e.WebDriver.FindElement(By.XPath("//*[@id=\'J_htl_info\']/div[@class=\'name\']/h2[@class=\'cn_n\']")).Text;
 8 
 9             var address = e.WebDriver.FindElement(By.XPath("//*[@id=\'J_htl_info\']/div[@class=\'adress\']")).Text;
10 
11             var price = e.WebDriver.FindElement(By.XPath("//*[@id=\'div_minprice\']/p[1]")).Text;
12 
13             var score = e.WebDriver.FindElement(By.XPath("//*[@id=\'divCtripComment\']/div[1]/div[1]/span[3]/span")).Text;
14 
15             var reviewCount = e.WebDriver.FindElement(By.XPath("//*[@id=\'commentTab\']/a")).Text;
16 
17             var comments = e.WebDriver.FindElement(By.XPath("//*[@id=\'hotel_info_comment\']/div[@id=\'commentList\']/div[1]/div[1]/div[1]"));
18 
19             var currentPage =Convert.ToInt32(comments.FindElement(By.XPath("div[@class=\'c_page_box\']/div[@class=\'c_page\']/div[contains(@class,\'c_page_list\')]/a[@class=\'current\']")).Text);
20 
21             var totalPage = Convert.ToInt32(comments.FindElement(By.XPath("div[@class=\'c_page_box\']/div[@class=\'c_page\']/div[contains(@class,\'c_page_list\')]/a[last()]")).Text);
22 
23             var messages = comments.FindElements(By.XPath("div[@class=\'comment_detail_list\']/div"));
24 
25             var nextPage = Convert.ToInt32(comments.FindElement(By.XPath("div[@class=\'c_page_box\']/div[@class=\'c_page\']/div[contains(@class,\'c_page_list\')]/a[@class=\'current\']/following-sibling::a[1]")).Text);
26 
27             Console.WriteLine();
28 
29             Console.WriteLine("名称：" + hotelName);
30 
31             Console.WriteLine("地址：" + address);
32 
33             Console.WriteLine("价格：" + price);
34 
35             Console.WriteLine("评分：" + score);
36 
37             Console.WriteLine("数量：" + reviewCount);
38 
39             Console.WriteLine("页码：" + "当前页（" + currentPage + "）" + "下一页（" + nextPage + "）" + "总页数（" + totalPage + "）" + "每页（" + messages.Count + "）");
40 
41             Console.WriteLine();
42 
43             Console.WriteLine("===============================================");
44 
45             Console.WriteLine();
46 
47             Console.WriteLine("点评内容：");
48 
49             foreach (var message in messages)
50 
51             {
52 
53                 Console.WriteLine("帐号：" + message.FindElement(By.XPath("div[contains(@class,\'user_info\')]/p[@class=\'name\']")).Text);
54 
55                 Console.WriteLine("房型：" + message.FindElement(By.XPath("div[@class=\'comment_main\']/p/a")).Text);
56 
57                 Console.WriteLine("内容：" + message.FindElement(By.XPath("div[@class=\'comment_main\']/div[@class=\'comment_txt\']/div[1]")).Text.Substring(0,50) + "....");
58 
59                 Console.WriteLine();
60 
61                 Console.WriteLine();
62 
63             }
64 
65             Console.WriteLine();
66 
67             Console.WriteLine("===============================================");
68 
69             Console.WriteLine("地址：" + e.Uri.ToString());
70 
71             Console.WriteLine("耗时：" + e.Milliseconds + "毫秒");
72 
73         }

可以看到我们使用了PlantomJS+Selenium来解析DOM数据最终得到相应的结果数据。

其实，用C#虽然功能方便，调试清楚，可是还是有一些不足的，比如代码冗长，实现抓取网页需要大量的工作量，下次我们将使用天生的抓包工具Python来模拟抓包。

秒客网

用C#爬虫来抓取网页并解析

C# 网络爬虫整理

一、前言

二、C#网页爬虫

第一：抓取的源

第二，对源的处理

第三：处理过程中我们要用到线程

第四：在C#中大量的出现lambda表达式，我们要对其有深刻的理解和认识。

第五：

第六：

第七：并发处理。

第八：计时函数

第九：伪造浏览器

第十：对Task状态的掌控。

第十一：代理服务器和测试服务器。

三、加强版的网络爬虫

相关文章