最近在学习python,顺便了解一下网络爬虫,整理了一下爬虫基础(基于py2.7):
获取网页数据的三种方法:
# encoding=utf-8
import urllib2
def download1(url):
return urllib2.urlopen(url).read()
# read()方法是默认获取全部数据
# read(100)方法是获取前100个字符
def download2(url):
return urllib2.urlopen(url).readlines()
def download3(url):
response = urllib2.urlopen(url)
while True:
line = response.readline()
if not line:
break
print line
url = "http://wwww.baidu.com"
print download3(url)
基于urllib2框架,这个比较简单。
伪装浏览器
现在很多网站为了防止数据被爬取,都使用了反爬虫措施。为了在这种情况下能继续使用爬虫,目前所学有两种方案:一是添加随机的Header,二是使用框架进行模拟,其实意思都差不多。
添加随机的header:
import urllib2
def download(url):
# header = {"User-Agent": "User-Agent:Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0)"}
header = {"User-Agent": "User-Agent: UCWEB7.0.2.37/28/999"}
request = urllib2.Request(url=url, headers=header)
# add another header
request.add_header("name", "zhangsan")
# open the request
response = urllib2.urlopen(request)
print "result:" + str(response.code)
print response.read()
download("http://www.baidu.com")
一般我们可以使用随机数进行header的选取,就像上面的分别模拟了IE浏览器和手机UC浏览器进行访问。当然了,网上有很多User-Agent,大家可以随机选取一下,进行随机模拟,下面是网上摘抄的一部分记录,大家可以使用:
pcUserAgent = {
"safari 5.1 – MAC":"User-Agent:Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_8; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50",
"safari 5.1 – Windows":"User-Agent:Mozilla/5.0 (Windows; U; Windows NT 6.1; en-us) AppleWebKit/534.50 (KHTML, like Gecko) Version/5.1 Safari/534.50",
"IE 9.0":"User-Agent:Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0);",
"IE 8.0":"User-Agent:Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.0; Trident/4.0)",
"IE 7.0":"User-Agent:Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0)",
"IE 6.0":"User-Agent: Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)",
"Firefox 4.0.1 – MAC":"User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv:2.0.1) Gecko/20100101 Firefox/4.0.1",
"Firefox 4.0.1 – Windows":"User-Agent:Mozilla/5.0 (Windows NT 6.1; rv:2.0.1) Gecko/20100101 Firefox/4.0.1",
"Opera 11.11 – MAC":"User-Agent:Opera/9.80 (Macintosh; Intel Mac OS X 10.6.8; U; en) Presto/2.8.131 Version/11.11",
"Opera 11.11 – Windows":"User-Agent:Opera/9.80 (Windows NT 6.1; U; en) Presto/2.8.131 Version/11.11",
"Chrome 17.0 – MAC":"User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_0) AppleWebKit/535.11 (KHTML, like Gecko) Chrome/17.0.963.56 Safari/535.11",
"Maxthon":"User-Agent: Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Maxthon 2.0)",
"Tencent TT":"User-Agent: Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; TencentTraveler 4.0)",
"The World 2.x":"User-Agent: Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)",
"The World 3.x":"User-Agent: Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; The World)",
"sogou 1.x":"User-Agent: Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Trident/4.0; SE 2.X MetaSr 1.0; SE 2.X MetaSr 1.0; .NET CLR 2.0.50727; SE 2.X MetaSr 1.0)",
"360":"User-Agent: Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; 360SE)",
"Avant":"User-Agent: Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; Avant Browser)",
"Green Browser":"User-Agent: Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)"
}
mobileUserAgent = {
"iOS 4.33 – iPhone":"User-Agent:Mozilla/5.0 (iPhone; U; CPU iPhone OS 4_3_3 like Mac OS X; en-us) AppleWebKit/533.17.9 (KHTML, like Gecko) Version/5.0.2 Mobile/8J2 Safari/6533.18.5",
"iOS 4.33 – iPod Touch":"User-Agent:Mozilla/5.0 (iPod; U; CPU iPhone OS 4_3_3 like Mac OS X; en-us) AppleWebKit/533.17.9 (KHTML, like Gecko) Version/5.0.2 Mobile/8J2 Safari/6533.18.5",
"iOS 4.33 – iPad":"User-Agent:Mozilla/5.0 (iPad; U; CPU OS 4_3_3 like Mac OS X; en-us) AppleWebKit/533.17.9 (KHTML, like Gecko) Version/5.0.2 Mobile/8J2 Safari/6533.18.5",
"Android N1":"User-Agent: Mozilla/5.0 (Linux; U; Android 2.3.7; en-us; Nexus One Build/FRF91) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1",
"Android QQ":"User-Agent: MQQBrowser/26 Mozilla/5.0 (Linux; U; Android 2.3.7; zh-cn; MB200 Build/GRJ22; CyanogenMod-7) AppleWebKit/533.1 (KHTML, like Gecko) Version/4.0 Mobile Safari/533.1",
"Android Opera ":"User-Agent: Opera/9.80 (Android 2.3.4; Linux; Opera Mobi/build-1107180945; U; en-GB) Presto/2.8.149 Version/11.10",
"Android Pad Moto Xoom":"User-Agent: Mozilla/5.0 (Linux; U; Android 3.0; en-us; Xoom Build/HRI39) AppleWebKit/534.13 (KHTML, like Gecko) Version/4.0 Safari/534.13",
"BlackBerry":"User-Agent: Mozilla/5.0 (BlackBerry; U; BlackBerry 9800; en) AppleWebKit/534.1+ (KHTML, like Gecko) Version/6.0.0.337 Mobile Safari/534.1+",
"WebOS HP Touchpad":"User-Agent: Mozilla/5.0 (hp-tablet; Linux; hpwOS/3.0.0; U; en-US) AppleWebKit/534.6 (KHTML, like Gecko) wOSBrowser/233.70 Safari/534.6 TouchPad/1.0",
"Nokia N97":"User-Agent: Mozilla/5.0 (SymbianOS/9.4; Series60/5.0 NokiaN97-1/20.0.019; Profile/MIDP-2.1 Configuration/CLDC-1.1) AppleWebKit/525 (KHTML, like Gecko) BrowserNG/7.1.18124",
"Windows Phone Mango":"User-Agent: Mozilla/5.0 (compatible; MSIE 9.0; Windows Phone OS 7.5; Trident/5.0; IEMobile/9.0; HTC; Titan)",
"UC":"User-Agent: UCWEB7.0.2.37/28/999",
"UC standard":"User-Agent: NOKIA5700/ UCWEB7.0.2.37/28/999",
"UCOpenwave":"User-Agent: Openwave/ UCWEB7.0.2.37/28/999",
"UC Opera":"User-Agent: Mozilla/4.0 (compatible; MSIE 6.0; ) Opera/UCWEB7.0.2.37/28/999"
}
第二种方法,可以使用selenium网络测试框架,进行远程访问,简单代码如下:
import selenium #网络测试框架
import selenium.webdriver #模拟浏览器访问
def getJobNumberByName(name):
target_url = "http://www.baidu.com"
driver = selenium.webdriver.Chrome() # 模拟浏览器请求,
driver.get(target_url) # 模拟访问连接
page_source = driver.page_source # 获取网页信息
print page_source
selenium可以查看这里简介,它会调用OS上的浏览器driver,如果没有相应的配置,则会报下面的错误:
如果你使用的webdriver.Chrome(),那么需要一个chromedriver,可以下载解压,获取地址,更改代码为:
driver = selenium.webdriver.Chrome(chrom_driver_path) # 模拟浏览器请求,
编码统一
主要涉及到是中文传输,中文在传输的过程中如果不采取编码,那么服务器接受到的内容将会是乱码。编码方式如下:
import urllib
words = {"name":"zhangsan","address":"上哈"}
print urllib.urlencode(words) #url编码
print urllib.unquote(urllib.urlencode(words)) #url解码
get/post请求
get与post请求主要是参数的传递方式不同,get直接在url后面添加参数,而post则将参数封装在请求体中。
使用python的flask框架创建一个简单的server:
app = Flask(__name__)
@app.route('/')
def hello_world():
return 'Hello World!'
@app.route("/login", methods=["POST"])
def login():
name = request.form.to_dict().get("name", "")
age = request.form.to_dict().get("age", "")
return name + "-------" + age
@app.route("/query", methods=["GET"])
def query():
age = request.args.get("age", "")
return "this age is " + age
if __name__ == '__main__':
app.run(
"127.0.0.1",
port=8090
)
那么get请求为:
import urllib2
words = {"age" : "23"}
request = urllib2.Request(url="http://127.0.0.1:8090/query?" + urllib.urlencode(words))
response = urllib2.urlopen(request)
print response.read()
post请求为:
import urllib2
info = {"name":"Tom张","age":"20"}
info = urllib.urlencode(info) # 这是也需要进行url编码
request = urllib2.Request("http://127.0.0.1:8090/login")
request.add_data(info)
response = urllib2.urlopen(request)
print response.read()
图片下载
import urllib
urllib.urlretrieve(图片原始地址,图片本地保存地址)
代理与本地代理
多个爬虫使用单个ip时,如果此时IP地址被禁止,那爬虫就没法正常工作了,所以这也衍生了不少生态链,某宝上搜索”vps“等关键字,可以看到各种专业代理,如下图:
当然我们可以使用免费的代理:[代理更新时间为18年4月21号 16点14分]
https://www.kuaidaili.com/free/ ## 快代理
http://www.xicidaili.com/ ## 西刺代理
python代码使用代理为:
import urllib2
http_proxy = urllib2.ProxyHandler({"http":"117.90.3.126:9000"}) #代理ip与端口
opener = urllib2.build_opener(http_proxy)
request = urllib2.Request("http://www.baidu.com")
response = opener.open(request)
print response.read()
重定向
1.判断url是否被重定向了:
import urllib2
# 判断url是否重定向了
def url_is_redirect(url):
response = urllib2.urlopen(url)
return response.geturl() != url
print url_is_redirect("http://www.baidu.cn")
2.如果是重定向,那我们需要获取新的地址:
class RedirectHandler(urllib2.HTTPRedirectHandler):
def http_error_302(self, req, fp, code, msg, headers):
res = urllib2.HTTPRedirectHandler.http_error_302(self, req, fp, code, msg, headers)
res.status = code # 返回的编码
res.newurl = res.geturl() # 当前的URL
print res.newurl, res.status # 查看重定向url
return res
opener = urllib2.build_opener(RedirectHandler)
opener.open("http://www.baidu.cn")
cookie相关
网页的关联性获取,需要用到cookie。
1.cookie的获取:
# encoding=utf-8
import urllib2
import cookielib
#create a cookie object
cookie = cookielib.CookieJar()
#get the cookie
header = urllib2.HTTPCookieProcessor(cookie)
#deal the cookie
opener = urllib2.build_opener(header)
response = opener.open("http://www.baidu.com")
for data in cookie :
print data.name + "--" + data.value + "\r"
获取的结果为:
BAIDUID--2643F48FC95482FF4ECAD2EBC7DBE11E:FG=1
BIDUPSID--2643F48FC95482FF4ECAD2EBC7DBE11E
H_PS_PSSID--1466_21088_18560_22158
PSTM--1524360190
BDSVRTM--0
BD_HOME--0
2.cookie的读取:
# encoding=utf-8
import urllib2
import cookielib
file_path = "cookie.txt"
cookie = cookielib.LWPCookieJar(file_path) # 设定路径
header = urllib2.HTTPCookieProcessor(cookie) # 设置cookie,与网站有关
opener = urllib2.build_opener(header)
response = opener.open("http://www.baidu.com")
cookie.save(ignore_expires=True, ignore_discard=True)
运行之后,cookie.txt将会存入我们的cookie文件
基本上就这些差不多了,剩下的慢慢再上来更新吧。