python爬虫（6）——正则表达式（三）

　　　　下面，我再写一个例子，加强对正则表达式的理解。还是回到我们下载的那个二手房网页，在实际中，我们并不需要整个网页的内容，因此我们来改进这个程序，对网页上的信息进行过滤筛选，并保存我们需要的内容。打开chrome浏览器，右键检查。

　　　　在网页源码中找到了我们所需要的内容。为了调试程序，我们可以在 http://tool.oschina.net/regex/ 上测试编译好的正则表达式。

　　　　对于 houseinfo：pattern=r' data-el="region">(.+?)</div>'

　　　　对于 price：pattern=r'<div class="totalPrice"><span>\d+</span>万'

python爬虫（6）——正则表达式（三）

　　　　我们用正则提取的内容是有冗余部分的，可以联想到用切片的方法处理提取内容。上源码：

 from urllib import request

 import re

 def HTMLspider(url,startPage,endPage):

     #作用：负责处理URL，分配每个URL去发送请求

     for page in range(startPage,endPage+1):

         filename="第" + str(page) + "页.html"

         #组合为完整的url

         fullurl=url + str(page)

         #调用loadPage()发送请求，获取HTML页面

         html=loadPage(fullurl,filename)

 def loadPage(fullurl,filename):

     #获取页面

     response=request.urlopen(fullurl)

     Html=response.read().decode('utf-8')

     #print(Html)

     #正则编译，获取房产信息

     info_pattern=r'data-el="region">(.+?)</div>'

     info_list=re.findall(info_pattern,Html)

     #print(info_list)

     #正则编译，获取房产价格

     price_pattern=r'<div class="totalPrice"><span>\d+</span>万'

     price_list=re.findall(price_pattern,Html)

     #print(price_list)

     writePage(price_list,info_list,filename)

 def writePage(price_list,info_list,filename):

     """

     将服务器的响应文件保存到本地磁盘

     """

     list1=[]

     list2=[]

     for i in price_list:

         i='-------------->>>>>Price:' + i[30:-8] + '万'

         list1.append(i)

         #print(i[30:-8])

     for j in info_list:

         j=j.replace('</a>',' '*10)

         j=j[:10] + ' '*5 +  '---------->>>>>Deatil information:  ' + j[10:] + ' '*5

         list2.append(j)

     #print(j)

     for each in zip(list2,list1):

         print(each)

     print("正在存储"+filename)

     #with open(filename,'wb') as f:

      #   f.write(html)

     print("--"*30)

 if __name__=="__main__":

     #输入需要下载的起始页和终止页，注意转换成int类型

     startPage=int(input("请输入起始页："))

     endPage=int(input("请输入终止页："))

     url="https://sh.lianjia.com/ershoufang/"

     HTMLspider(url,startPage,endPage)

     print("下载完成！")

　　　　这是程序运行后的结果。我只是将其打印在终端，也可以使用json.dumps()，将爬取到的内容保存到本地中。

　　　　实际上这种数据提取还有其他方法，这将在以后会讲到。

python爬虫（6）——正则表达式（三）

秒客网

python爬虫（6）——正则表达式（三）

相关文章