爬虫系列1：python简易爬虫分析

决定写一个小的爬虫系列，本文是第一篇，讲爬虫的基本原理和简易示例。

1、单个网页的简易爬虫

以下爬虫的主要功能是爬取百度贴吧中某一页面的所有图片。代码由主要有两个函数：其中getHtml()通过页面url获取其对应的html内容，getImage()则通过解析html获取图片地址，实现图片的下载。

代码如下：

import urllib

import re

def getHtml(url):

    """通过页面url获取其对应的html内容

    """

    page = urllib.urlopen(url) #打开页面

    content = page.read() #读取页面内容

    return content

def getImage(html):

    """通过解析html获取图片地址，实现图片的下载

    """

    regx =r'src="(.+?\.jpg)" pic_ext' #利用正则表达式获得图片url

    imgreg = re.compile(regx)

    imglist = re.findall(imgreg,html)

    x = 0

    for imgurl in imglist:

        filepath ='F:\\Downloads\\'+str(x)+'.jpg'

        urllib.urlretrieve(imgurl,filepath) #将图片下载到本地

        x += 1

    print 'completed!'

html = getHtml('http://tieba.baidu.com/p/2505265675')

imglist = getImage(html)

2、爬取多网页的框架

这里只讲基本思想：第一步是选择一个起始页面，可以直接选择某个网站的主页作为起始页面；第二步是分析这个起始页面的所有链接，然后爬取所有链接的内容；第三步就是无休无止的递归过程，分析爬虫所及的所有子页面内部链接，如果没有爬取过，则继续无休无止的爬取。

借用知乎上谢科兄弟的一段代码来说明。设定初始页面initial_page，爬虫就从这里开始获取页面，url_queue用来存将要爬取的页面队列，seen用来存爬取过的页面。

import Queue

initial_page ="http://www.renminribao.com"

url_queue =Queue.Queue()

seen = set()

seen.insert(initial_page)

url_queue.put(initial_page)

while True:

    if url_queue.size()>0:

        current_url = url_queue.get()    #取出队例中第一个的url

        store(current_url)             #把这个url代表的网页存储好

        for next_url inextract_urls(current_url): #提取把这个url里链向的url

            if next_url not in seen:

                seen.put(next_url)

                url_queue.put(next_url)

    else:

        break

实际写爬虫的时候我们一般还会限定爬虫运行的域（domain），限定域之外的链接不予爬取。有许多优秀的框架可以实现多网页的爬虫，用python写的话我推荐Scrapy。

秒客网

爬虫系列1：python简易爬虫分析

相关文章