本文实例讲述了Python获取当前页面内所有链接的四种方法。分享给大家供大家参考,具体如下:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
|
'''
得到当前页面所有连接
'''
import requests
import re
from bs4 import BeautifulSoup
from lxml import etree
from selenium import webdriver
url = 'http://www.testweb.com'
r = requests.get(url)
r.encoding = 'gb2312'
# 利用 re (太黄太暴力!)
matchs = re.findall(r "(?<=href=\").+?(?=\")|(?<=href=\').+?(?=\')" , r.text)
for link in matchs:
print (link)
print ()
# 利用 BeautifulSoup4 (DOM树)
soup = BeautifulSoup(r.text, 'lxml' )
for a in soup.find_all( 'a' ):
link = a[ 'href' ]
print (link)
print ()
# 利用 lxml.etree (XPath)
tree = etree.HTML(r.text)
for link in tree.xpath( "//@href" ):
print (link)
print ()
# 利用selenium(要开浏览器!)
driver = webdriver.Firefox()
driver.get(url)
for link in driver.find_elements_by_tag_name( "a" ):
print (link.get_attribute( "href" ))
driver.close()
|
注意:若页面中含有 iframe,则 iframe 内所包含页面的所有标签都无法用以上四种方法获得!!!此时则要:
1
2
3
4
5
6
7
8
9
10
11
12
|
# 再打开所有iframe查找全部的a标签
for iframe in soup.find_all( 'iframe' ):
url_ifr = iframe[ 'src' ] # 取得当前iframe的src属性值
rr = requests.get(url_ifr)
rr.encoding = 'gb2312'
soup_ifr = BeautifulSoup(rr.text, 'lxml' )
for a in soup_ifr.find_all( 'a' ):
link = a[ 'href' ]
m = re.match(r 'http:\/\/.*?(?=\/)' ,link)
#print(link)
if m:
all_urls.add(m.group( 0 ))
|
希望本文所述对大家Python程序设计有所帮助。
原文链接:http://www.cnblogs.com/hhh5460/p/5044038.html