此次遇到的是一个函数使用不熟练造成的问题,但有了分析工具后可以很快定位到问题(此处推荐一个非常棒的抓包工具fiddler)
正文如下:
在爬取某个app数据时(app上的数据都是由http请求的),用Fidder分析了请求信息,并把python的request header信息写在程序中进行请求数据
代码如下
1
|
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
|
import requests
url = ' http://xxx?startDate=2017-10-19&endDate=2017-10-19&pageIndex=1&limit=50&sort=datetime&order=desc '
"Host" : "xxx.com" ,
"Connection" : "keep-alive" ,
"Accept" : "application/json, text/javascript, */*; q=0.01" ,
"User-Agent" : "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/29.0.1547.59 Safari/537.36" ,
"X-Requested-With" : "XMLHttpRequest" ,
"Referer" : " http://app.jg.eastmoney.com/html_Report/index.html " ,
"Accept-Encoding" : "gzip,deflate" ,
"Accept-Language" : "en-us,en" ,
"Cookie" : "xxx"
}
r = requests.get(url,headers)
print (r.text)
|
请求成功但是,返回的是
1
|
|
{ "Id" : "6202c187-2fad-46e8-b4c6-b72ac8de0142" , "ReturnMsg" : "加载失败!" }
|
就是被发现不是正常请求被拦截了
然后我去Fidder中看刚才python发送请求的记录 #盖掉的两个部分分别是Host和URL。
然后查看请求详细信息的时候,请求头并没有加载进去,User-Agent就写着python-requests ! #请求头里的UA信息是java,python程序,有点反爬虫意识的网站、app都会拦截掉
Header详细信息如下
1
|
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
|
GET http: / / xxx?istartDate = 2017 - 10 - 19 &endDate = 2017 - 10 - 19 &pageIndex = 1 &limit = 50 &sort = datetime&order = desc
&Host = xxx.com
&Connection = keep - alive
&Accept = application % 2Fjson % 2C + text % 2Fjavascript % 2C + % 2A % 2F % 2A % 3B + q % 3D0 . 01
&User - Agent = Mozilla % 2F5 . 0 + % 28Windows + NT + 6.1 % 3B + WOW64 % 29 + AppleWebKit % 2F537 . 36 + % 28KHTML % 2C + like + Gecko % 29 + Chrome % 2F29 . 0.1547 . 59 + Safari % 2F537 . 36
&X - Requested - With = XMLHttpRequest
&Referer = xxx
&Accept - Encoding = gzip % 2Cdeflate
&Accept - Language = en - us % 2Cen
&Cookie = xxx
HTTP / 1.1
Host: xxx.com
User - Agent: python - requests / 2.18 . 4
Accept - Encoding: gzip, deflate
Accept: * / *
Connection: keep - alive
HTTP / 1.1 200 OK
Server: nginx / 1.2 . 2
Date: Sat, 21 Oct 2017 06 : 07 : 21 GMT
Content - Type : application / json; charset = utf - 8
Content - Length: 75
Connection: keep - alive
Cache - Control: private
X - AspNetMvc - Version: 5.2
X - AspNet - Version: 4.0 . 30319
X - Powered - By: ASP.NET
|
一开始还没发现,等我把请求的URL信息全部读完,才发现程序把我的请求头信息当做参数放到了URL里
那就是我请求的时候request函数Header信息参数用错了
又重新看了一下Requests库的Headers参数使用方法,发现有一行代码写错了,在使用request.get()方法时要把参数 “headers =“写出来
更改如下:
1
|
2
3
4
5
6
7
8
9
10
11
12
13
14
15
|
import requests
url = ' http://xxx?startDate=2017-10-19&endDate=2017-10-19&pageIndex=1&limit=50&sort=datetime&order=desc '
headers = {
"Host" : "xxx.com" ,
"Connection" : "keep-alive" ,
"Accept" : "application/json, text/javascript, */*; q=0.01" ,
"User-Agent" : "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/29.0.1547.59 Safari/537.36" ,
"X-Requested-With" : "XMLHttpRequest" ,
"Referer" : " http://app.jg.eastmoney.com/html_Report/index.html " ,
"Accept-Encoding" : "gzip,deflate" ,
"Accept-Language" : "en-us,en" ,
"Cookie" : "xxx"
}
r = requests.get(url,headers = headers)
|
然后去查看Fiddler中的请求。
此次python中的请求头已经正常了,请求详细信息如下
1
|
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
|
GET http: / / xxx?startDate = 2017 - 10 - 19 &endDate = 2017 - 10 - 19 &pageIndex = 1 &limit = 50 &sort = datetime&order = desc HTTP / 1.1
User - Agent: Mozilla / 5.0 (Windows NT 6.1 ; WOW64) AppleWebKit / 537.36 (KHTML, like Gecko) Chrome / 29.0 . 1547.59 Safari / 537.36
Accept - Encoding: gzip,deflate
Accept: application / json, text / javascript, * / * ; q = 0.01
Connection: keep - alive
Host: xxx.com
X - Requested - With: XMLHttpRequest
Referer: http: / / xxx
Accept - Language: en - us,en
Cookie: xxx
HTTP / 1.1 200 OK
Server: nginx / 1.2 . 2
Date: Sat, 21 Oct 2017 06 : 42 : 21 GMT
Content - Type : application / json; charset = utf - 8
Content - Length: 75
Connection: keep - alive
Cache - Control: private
X - AspNetMvc - Version: 5.2
X - AspNet - Version: 4.0 . 30319
X - Powered - By: ASP.NET
|
然后又用python程序请求了一次,结果请求成功,返回的还是
1
|
|
{ "Id" : "6202c187-2fad-46e8-b4c6-b72ac8de0142" , "ReturnMsg" : "加载失败!" }
|
因为一般cookie都会在短时间内过期,所以更新了cookie,然后请求成功
需要注意的是用程序爬虫一定要把Header设置好,这个app如果反爬的时候封ip的话可能就麻烦了。
以上就是本文的全部内容,希望对大家的学习有所帮助,也希望大家多多支持服务器之家。
原文链接:http://www.cnblogs.com/Jacck/p/7704832.html