python提取页面内url列表的方法

本文实例讲述了python 提取页面内url列表的方法。分享给大家供大家参考。具体实现方法如下：

									from bs4 import BeautifulSoup

									import time,re,urllib2

									t=time.time()

									websiteurls={}

									def scanpage(url):

									  websiteurl=url

									  t=time.time()

									  n=0

									  html=urllib2.urlopen(websiteurl).read()

									  soup=BeautifulSoup(html)

									  pageurls=[]

									  Upageurls={}

									  pageurls=soup.find_all("a",href=True)

									  for links in pageurls:

									    if websiteurl in links.get("href") and links.get("href") not in Upageurls and links.get("href") not in websiteurls:

									      Upageurls[links.get("href")]=0

									  for links in Upageurls.keys():

									    try:

									      urllib2.urlopen(links).getcode()

									    except:

									      print "connect failed"

									    else:

									      t2=time.time()

									      Upageurls[links]=urllib2.urlopen(links).getcode()

									      print n,

									      print links,

									      print Upageurls[links]

									      t1=time.time()

									      print t1-t2

									    n+=1

									  print ("total is "+repr(n)+" links")

									  print time.time()-t

									scanpage("http://news.163.com/")

希望本文所述对大家的Python程序设计有所帮助。

秒客网

python提取页面内url列表的方法

相关文章