Python 爬虫小程序（正则表达式的应用）

目标：通过正则表达式写一个爬虫程序，抓下网页的所有图片。思路 1. 获取网页源代码 2. 获取图片

3. 下载图片

第一步，打开URL 获取源代码

[root@node1 python]# mkdir image[root@node1 python]# cd image
[root@node1 python]# vim getHtml.py
#!/usr/bin/python
import re
import urllib

def getHtml(url):
        html = urllib.urlopen(url)
        scode = html.read()
        return scode

print getHtml('http://tieba.baidu.com/p/1762577651')

第二步，获取图片相关地址（正则匹配）

从取回的源代码中分析图片相关URL 的构造，然后通过正则表达式将图片地址提取出来源文件中图片的标签是这样子的：

<img class="BDE_Image" src="http://imgsrc.baidu.com/forum/w%3D580/sign=2e8f3ca53af33a879e6d0012f65d1018/4ece3bc79f3df8dc2ab63004cd11728b46102899.jpg" width="560" height="400" changedsize="true">

要获取的是http://imgsrc.baidu.com/xxxxxxx.jpg

#!/usr/bin/pythonimport reimport urllib                                                  def getHtml(url):        html = urllib.urlopen(url)        scode = html.read()        return scode                                                  def getImage(source):        re = r'src="(.*?\.jpg)" width='        imgre = re.compile(re)        images = re.findall(imgre,source)        return images                                                  source = getHtml('http://tieba.baidu.com/p/1762577651')print getImage(source)

第三步，下载获取到的图片上一步已经将取到的图片地址存放在一个列表中了，现在只有对这个列表做一个遍历即可

#!/usr/bin/pythonimport reimport urllib                                      def getHtml(url):        html = urllib.urlopen(url)        scode = html.read()        return scode                                      def getImage(source):        re = r'src="(.*?\.jpg)" width='        imgre = re.compile(re)        images = re.findall(imgre,source)        for i in images:                urllib.urlretrieve(i,'1.jpg')                                      source = getHtml('http://tieba.baidu.com/p/1762577651')print getImage(source)

但是这样会有一个问题，就是每个图片保存下来后都会被命名为1.jpg ，换句话说就是后面的图片会覆盖前面的图片，所以只能保存到一个图片。因此还需要一步，对图片进行命名

#!/usr/bin/pythonimport reimport urllib                                def getHtml(url):        html = urllib.urlopen(url)        scode = html.read()        return scode                                def getImage(source):        re = r'src="(.*?\.jpg)" width='        imgre = re.compile(re)        images = re.findall(imgre,source)        x = 0        for i in images:                urllib.urlretrieve(i,'%s.jpg' % x)                x+=1                                source = getHtml('http://tieba.baidu.com/p/1762577651')print getImage(source)

执行结果：

[root@node1 image]# python getHtml.py[root@node1 image]# ls11.jpg  13.jpg  15.jpg  17.jpg  19.jpg  20.jpg  3.jpg  5.jpg  7.jpg  9.jpg  10.jpg12.jpg  14.jpg  16.jpg  18.jpg  1.jpg   2.jpg   4.jpg  6.jpg  8.jpg  getHtml.py

秒客网

Python 爬虫小程序（正则表达式的应用）

相关文章