【爬虫笔记】爬虫入门

跌跌撞撞算是能够爬一些数据了，也算是半只脚入门了。当然，不可否认的是还仍然有很长的路要走。

因为之前在实习是公司也算是写了一段时间的Python3，然后，就直接从慕课入门爬虫了。给你个链接：爬虫入门的链接。看过这个视频，也就能对爬虫有了一个初步的了解。然后，也查询了许多技术博客。对于，这种比较成熟的技术一般情况下，baidu/google都会有很多的好的可以借鉴的博客的。

python3中只有urllib，而没有urllib2，也不能说没有了，只能说python2中的urllib和urllib2合成了一个包为urlib。更加详细关于他们的区别可以看这个链接：关于python3,python2中urllib的一些区别链接

实践才能把知识理解：

首先，应该对爬虫的总体架构有一些简单的认识，这个非常重要的。因为，这就像你要做一件事情的总体计划，有了这个，你大体路径不会错。

1，url管理器：用于管理你需要爬取/已经爬取/待爬取页面的URL。

2，页面下载器（urllib）：将给定的url的页面的html下载到本地。

3，网页解析器（BeautifulSoup）：结构化解析DOM - document object model，将html/xml网页解析成一种树形结构，从而提取有用的数据。

【爬虫笔记】爬虫入门

当然，每一部分都会很多的知识可以怕根问底的。这里仅仅介绍其大体框架。

"""
    for crewl http://acm.nyist.net/JudgeOnline/problemset.php  problems' name
"""
import urllib.request
import urllib.parse
from bs4 import BeautifulSoup
import re
"""
    这里，我把自己写的爬虫写成了一个MySpider类。
    1，用set作为url管理器，new_urls就是待爬取页面的url，而old_urls就是爬取过页面的url
    2，url_downloader()就是页面下载器。给一个URL，下载来其页面的html/xml。
    3，page_resolver()就是页面解析器。给一段html/xml字符串，来解析出来有用的信息。
"""
class MySpider(object):
    new_urls = set()
    old_urls = set()
    def __init__(self, root_url):
        self.new_urls.add(root_url)
    
    def url_downloader(self, url):
        req = urllib.request.Request(url)
        req.add_header("User-Agent", "Mozilla/5.0........ Firefox/50.0")
        req.add_header("GET",url)
        req.add_header("Host","acm.nyist.net") 
        req.add_header("Referer","http://acm.nyist.net/JudgeOnline/problemset.php")
        """
            对于Request.header的创建，可以通过你自己的浏览器进行看出有用的信息。
            Host: acm.nyist.net
            User-Agent: Mozilla/5.0 (Windows NT 10.0; rv:50.0) Gecko/20100101 Firefox/50.0
            Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
            Accept-Language: zh-CN,zh;q=0.8,en-US;q=0.5,en;q=0.3
            Accept-Encoding: gzip, deflate
            Cookie: __utma=1.777807425.1476802115.1485247339.1485251033.16; __utmz=1.1476802115.1.1.utmcsr=(direct)|utmccn=(direct)|utmcmd=(none); _gscu_771983383=7991431907sfcl37; PHPSESSID=1e816be3352a5e380b670153ccb7f0bd; __utmc=1
            Connection: keep-alive
            Upgrade-Insecure-Requests: 1
            Cache-Control: max-age=0
        """
        response = urllib.request.urlopen(req)
        return response.read()
        
    def page_resolver(self, page_content):
        # BeautifulSoup 是一个解析器的工具。
        soup = BeautifulSoup(page_content, 'html.parser', from_encoding='utf-8')
        problem = soup.find_all('a', href=re.compile(r'problem\.php\?pid=\d+'))
        _file = open('problem.txt', 'a+')
        for item in problem:
            print (item.get_text(), file=_file)
        _file.close()
        page_url = soup.find_all('a', href=re.compile(r'\?page=\d+'))
        print (page_url)
        for item in page_url:
            newurl = item['href']
            newfullurl = urllib.parse.urljoin("http://acm.nyist.net/JudgeOnline/problemset.php", newurl)
            if newfullurl not in self.new_urls and newfullurl not in self.old_urls:
                self.new_urls.add(newfullurl)
        
    #  crewl 用来调度爬虫，也作为的爬虫的一部分。
    def crewl(self):
        while len(self.new_urls):
            url = self.new_urls.pop()
            self.old_urls.add(url)
            page = self.url_downloader(url)
            self.page_resolver(page)


initurl = "http://acm.nyist.net/JudgeOnline/problemset.php?page=1"
spider = MySpider(initurl)
spider.crewl()

注：

1，关于查看request的信息：

【爬虫笔记】爬虫入门

2，关于BeautifulSoup，可以baidu/goolge一些好的技术博客进行入门。

当然，自己写的MySpider还有很多的改进的地方。比如，利用该方法，并不是所有的网站都能爬取，比如一些需要登录信息的网站。比如爬取出的一些代码中有js代码并不能解析出来等等。

简单来说就是，具体需要爬取的页面也是需要不同的方法进行爬取的，不能一以贯之。

总的来说，这仅仅是一个最最基础入门爬虫的文章。爬虫，还有很长的路要走。

秒客网

【爬虫笔记】爬虫入门

相关文章