一.前言
爬取的页面分为静态页面和动态页面,静态的页面爬取很常见,就如豆瓣top250的爬取,展示的内容都在HTML源代码中。而动态页面,很多内容不会出现在HTML源代码中,例如使用JavaScript时,很可能出现这种情况。
静态网页例子:
豆瓣Top250页面标题/top250?start=25&filter=
F12—>检查(选择触不可及),可以看到源码定位到触不可及标题这里。
在右键打开源码
可以看到此源码中存在该标题内容,因此爬取豆瓣Top250时候,这是一个静态页面,可以直接使用该URL进行爬取(也就是说我们可以把该URL当作真实地址进行爬取)。
动态网页例子:
万科更新时间/historylist/%E4%B8%87%E7%A7%91/6141470#page1
可以看到在该页面仍然可以用检查工具找到对应的位置,这里我们都会犯一个错误,就是由此认为其是一个静态页面,但真正的判断是右键继续查看源码。
可以发现HTML后面有一段代码,翻到其他页数看源码,发现最后的那一段代码仍然是相同的。
因此我们得到结论,它是一个动态页面,并不是真正的数据存储的地址,因此,我们需要寻找其真实地址。那么对于动态页面的爬取采用如下两种技术。
二.动态页面爬取技术
1.解析真实地址抓取
1.F12—>Network
2.寻找xhr文件或者json文件
点击Type便于寻找(因为其会给你将Type分类)
发现下图的xhr文件的Name有点类似于我们爬取豆瓣时候的地址
点击该文件查看
发现一堆数据,有时间,id,姓名等属性,因此断定该文件的url为真实地址。
因此,我们可以进行爬取该地址
代码如下:
#coding:utf-8
import requests
url = "/api/wikiui/gethistorylist?tk=10273b583b2d59dc49cbcfc1eb5cf5a3&lemmaId=6141470&from=3&count=1&size=25"
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.93 Safari/537.36',
'Host': '',
'Cookie': 'BIDUPSID=66D853C540ACF3CA684E9852E1DCF1DB; PSTM=1590825717; BAIDUID=66D853C540ACF3CAE0B26F235ABD45ED:FG=1; BK_SEARCHLOG=%7B%22key%22%3A%5B%22hex%22%2C%22%E8%B5%9E%E5%8A%A9%E4%BA%BA%22%2C%22%E8%B5%9E%E5%8A%A9%E5%98%89%E5%AE%BE%22%2C%22SAML%22%2C%22%E7%94%A8%E6%88%B7%22%2C%22Rainbow%E8%A1%A8%E6%94%BB%E5%87%BB%E7%BB%95%E8%BF%87%E6%9C%80%E5%A4%A7%E5%A4%B1%E8%B4%A5%E7%99%BB%E5%BD%95%E9%99%90%E5%88%B6%22%2C%22%E7%89%B9%E6%9D%83%E5%8D%87%E7%BA%A7%E7%BD%91%E7%BB%9C%E5%AE%89%E5%85%A8%22%2C%22%E7%89%B9%E6%9D%83%E5%8D%87%E7%BA%A7%22%2C%22%E8%AE%A4%E5%8F%AF%22%2C%22%E5%90%8C%E8%B4%A8%E6%80%A7%22%5D%7D; H_WISE_SIDS=154770_153759_156158_155553_149355_156816_156287_150775_154259_148867_156096_154606_153243_153629_157262_157236_154172_156417_153065_156516_127969_154174_158527_150346_155803_146734_158745_131423_154037_107316_158054_158876_154189_155344_155255_157171_157790_144966_157401_154619_157814_158716_156726_157418_147551_157118_158367_158505_158589_157696_154639_154270_157472_110085_157006; BDUSS=klPNzU4NldvdkJOUG5DWk1zUWd0VUJPQjV4c3U2bHoxblBnS0NmcEtDT0hKR2hnRVFBQUFBJCQAAAAAAAAAAAEAAABOxNvnSnNrMjEyAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAIeXQGCHl0Bgf; BDUSS_BFESS=klPNzU4NldvdkJOUG5DWk1zUWd0VUJPQjV4c3U2bHoxblBnS0NmcEtDT0hKR2hnRVFBQUFBJCQAAAAAAAAAAAEAAABOxNvnSnNrMjEyAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAIeXQGCHl0Bgf; __yjs_duid=1_420ea1f33b6915a9547439f2725ac2811617111425851; Hm_lvt_55b574651fcae74b0a9f1cf9c8d7c93a=1617721790,1617721848,1617721949,1618794964; BAIDUID_BFESS=07052B6E9DBAC2EF4CB760778355D0A6:FG=1; __yjs_st=2_ZDc0Y2QxNzIyMmRmYzlmMjdjNjBmZmYzNGQ1MjcwN2IzMTU5Mzk5ODZmZGUxODJiYjQxMGU1NDQ2MGQwNjNmMmYzOTc1YmY4ZDcxZjgxYjNhYjI4M2I0ODU5ZTNmNGEyNGM4NDE0NTFmZDc3NjBjZGU0YWRmMTgwNmQxZjNhMzllNzE1MTYyZjNmYTMwNmNlNTZmMmEyNmI0MzJkMGY3MGI0Zjc3NGE0N2QxZDY2NjkxMjAwMmIyYzIyODg3NjkzNDQ3NWQ3ZjYyOWZiNTExMGEwZDRiOTVjNGZiNDg2NGU4ZDYzYmVmMzkxMTU3OTAyOGU5OWU3N2Q3YjRiMDk2N183Xzc0YjEzMTY3; BDORZ=FFFB88E999055A3F8A630C64834BD6D0; BDRCVFR[VbLWpd7HGum]=mk3SLVN4HKm; H_PS_PSSID=33839_33822_31254_33849_33760_33607_26350_33893; ab_sr=1.0.0_Y2YwMDFlM2ZmMTMzODEzNmYxNzE5M2M0YTk0ZDQ1ZTM1YTZjMzE5OTVkMzFlM2M2YzcxOWI3ZDBmODNlOTU4YTA5NDY1YTUwNTYwMTI4OTMyYWZiM2VlNGE5MzhiZmI2NDNjZmRjMmE4YWE1NTFmZGZmMjg5NWIzMWI4OTM1ODE=',
}
r = requests.get(url,headers=headers)
print(r.status_code)
print(r.text)
可以看到成功爬取,但是得到的内容并不是我们想要的,我们需要把json解析该数据
代码如下:
#coding:utf-8
import requests
import json
url = "/api/wikiui/gethistorylist?tk=10273b583b2d59dc49cbcfc1eb5cf5a3&lemmaId=6141470&from=3&count=1&size=25"
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.93 Safari/537.36',
'Host': '',
'Cookie': 'BIDUPSID=66D853C540ACF3CA684E9852E1DCF1DB; PSTM=1590825717; BAIDUID=66D853C540ACF3CAE0B26F235ABD45ED:FG=1; BK_SEARCHLOG=%7B%22key%22%3A%5B%22hex%22%2C%22%E8%B5%9E%E5%8A%A9%E4%BA%BA%22%2C%22%E8%B5%9E%E5%8A%A9%E5%98%89%E5%AE%BE%22%2C%22SAML%22%2C%22%E7%94%A8%E6%88%B7%22%2C%22Rainbow%E8%A1%A8%E6%94%BB%E5%87%BB%E7%BB%95%E8%BF%87%E6%9C%80%E5%A4%A7%E5%A4%B1%E8%B4%A5%E7%99%BB%E5%BD%95%E9%99%90%E5%88%B6%22%2C%22%E7%89%B9%E6%9D%83%E5%8D%87%E7%BA%A7%E7%BD%91%E7%BB%9C%E5%AE%89%E5%85%A8%22%2C%22%E7%89%B9%E6%9D%83%E5%8D%87%E7%BA%A7%22%2C%22%E8%AE%A4%E5%8F%AF%22%2C%22%E5%90%8C%E8%B4%A8%E6%80%A7%22%5D%7D; H_WISE_SIDS=154770_153759_156158_155553_149355_156816_156287_150775_154259_148867_156096_154606_153243_153629_157262_157236_154172_156417_153065_156516_127969_154174_158527_150346_155803_146734_158745_131423_154037_107316_158054_158876_154189_155344_155255_157171_157790_144966_157401_154619_157814_158716_156726_157418_147551_157118_158367_158505_158589_157696_154639_154270_157472_110085_157006; BDUSS=klPNzU4NldvdkJOUG5DWk1zUWd0VUJPQjV4c3U2bHoxblBnS0NmcEtDT0hKR2hnRVFBQUFBJCQAAAAAAAAAAAEAAABOxNvnSnNrMjEyAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAIeXQGCHl0Bgf; BDUSS_BFESS=klPNzU4NldvdkJOUG5DWk1zUWd0VUJPQjV4c3U2bHoxblBnS0NmcEtDT0hKR2hnRVFBQUFBJCQAAAAAAAAAAAEAAABOxNvnSnNrMjEyAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAIeXQGCHl0Bgf; __yjs_duid=1_420ea1f33b6915a9547439f2725ac2811617111425851; Hm_lvt_55b574651fcae74b0a9f1cf9c8d7c93a=1617721790,1617721848,1617721949,1618794964; BAIDUID_BFESS=07052B6E9DBAC2EF4CB760778355D0A6:FG=1; __yjs_st=2_ZDc0Y2QxNzIyMmRmYzlmMjdjNjBmZmYzNGQ1MjcwN2IzMTU5Mzk5ODZmZGUxODJiYjQxMGU1NDQ2MGQwNjNmMmYzOTc1YmY4ZDcxZjgxYjNhYjI4M2I0ODU5ZTNmNGEyNGM4NDE0NTFmZDc3NjBjZGU0YWRmMTgwNmQxZjNhMzllNzE1MTYyZjNmYTMwNmNlNTZmMmEyNmI0MzJkMGY3MGI0Zjc3NGE0N2QxZDY2NjkxMjAwMmIyYzIyODg3NjkzNDQ3NWQ3ZjYyOWZiNTExMGEwZDRiOTVjNGZiNDg2NGU4ZDYzYmVmMzkxMTU3OTAyOGU5OWU3N2Q3YjRiMDk2N183Xzc0YjEzMTY3; BDORZ=FFFB88E999055A3F8A630C64834BD6D0; BDRCVFR[VbLWpd7HGum]=mk3SLVN4HKm; H_PS_PSSID=33839_33822_31254_33849_33760_33607_26350_33893; ab_sr=1.0.0_Y2YwMDFlM2ZmMTMzODEzNmYxNzE5M2M0YTk0ZDQ1ZTM1YTZjMzE5OTVkMzFlM2M2YzcxOWI3ZDBmODNlOTU4YTA5NDY1YTUwNTYwMTI4OTMyYWZiM2VlNGE5MzhiZmI2NDNjZmRjMmE4YWE1NTFmZGZmMjg5NWIzMWI4OTM1ODE=',
}
r = requests.get(url,headers=headers)
print(r.status_code)
print(r.text)
json_data = json.loads(r.text)
print(json_data)
comment_list = json_data['data']
print(comment_list)
可以看到json解析后的结果,也可以看到查询json解析后的data下的数据,因此可以从输出结果可以找到规律,我们要找更新时间,该元素在的字典中的data字典中的pages字典下的3字典下的数组中,数组下的字典auditTime中
代码如下:
#coding:utf-8
import requests
import json
url = "/api/wikiui/gethistorylist?tk=10273b583b2d59dc49cbcfc1eb5cf5a3&lemmaId=6141470&from=3&count=1&size=25"
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.93 Safari/537.36',
'Host': '',
'Cookie': 'BIDUPSID=66D853C540ACF3CA684E9852E1DCF1DB; PSTM=1590825717; BAIDUID=66D853C540ACF3CAE0B26F235ABD45ED:FG=1; BK_SEARCHLOG=%7B%22key%22%3A%5B%22hex%22%2C%22%E8%B5%9E%E5%8A%A9%E4%BA%BA%22%2C%22%E8%B5%9E%E5%8A%A9%E5%98%89%E5%AE%BE%22%2C%22SAML%22%2C%22%E7%94%A8%E6%88%B7%22%2C%22Rainbow%E8%A1%A8%E6%94%BB%E5%87%BB%E7%BB%95%E8%BF%87%E6%9C%80%E5%A4%A7%E5%A4%B1%E8%B4%A5%E7%99%BB%E5%BD%95%E9%99%90%E5%88%B6%22%2C%22%E7%89%B9%E6%9D%83%E5%8D%87%E7%BA%A7%E7%BD%91%E7%BB%9C%E5%AE%89%E5%85%A8%22%2C%22%E7%89%B9%E6%9D%83%E5%8D%87%E7%BA%A7%22%2C%22%E8%AE%A4%E5%8F%AF%22%2C%22%E5%90%8C%E8%B4%A8%E6%80%A7%22%5D%7D; H_WISE_SIDS=154770_153759_156158_155553_149355_156816_156287_150775_154259_148867_156096_154606_153243_153629_157262_157236_154172_156417_153065_156516_127969_154174_158527_150346_155803_146734_158745_131423_154037_107316_158054_158876_154189_155344_155255_157171_157790_144966_157401_154619_157814_158716_156726_157418_147551_157118_158367_158505_158589_157696_154639_154270_157472_110085_157006; BDUSS=klPNzU4NldvdkJOUG5DWk1zUWd0VUJPQjV4c3U2bHoxblBnS0NmcEtDT0hKR2hnRVFBQUFBJCQAAAAAAAAAAAEAAABOxNvnSnNrMjEyAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAIeXQGCHl0Bgf; BDUSS_BFESS=klPNzU4NldvdkJOUG5DWk1zUWd0VUJPQjV4c3U2bHoxblBnS0NmcEtDT0hKR2hnRVFBQUFBJCQAAAAAAAAAAAEAAABOxNvnSnNrMjEyAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAIeXQGCHl0Bgf; __yjs_duid=1_420ea1f33b6915a9547439f2725ac2811617111425851; Hm_lvt_55b574651fcae74b0a9f1cf9c8d7c93a=1617721790,1617721848,1617721949,1618794964; BAIDUID_BFESS=07052B6E9DBAC2EF4CB760778355D0A6:FG=1; __yjs_st=2_ZDc0Y2QxNzIyMmRmYzlmMjdjNjBmZmYzNGQ1MjcwN2IzMTU5Mzk5ODZmZGUxODJiYjQxMGU1NDQ2MGQwNjNmMmYzOTc1YmY4ZDcxZjgxYjNhYjI4M2I0ODU5ZTNmNGEyNGM4NDE0NTFmZDc3NjBjZGU0YWRmMTgwNmQxZjNhMzllNzE1MTYyZjNmYTMwNmNlNTZmMmEyNmI0MzJkMGY3MGI0Zjc3NGE0N2QxZDY2NjkxMjAwMmIyYzIyODg3NjkzNDQ3NWQ3ZjYyOWZiNTExMGEwZDRiOTVjNGZiNDg2NGU4ZDYzYmVmMzkxMTU3OTAyOGU5OWU3N2Q3YjRiMDk2N183Xzc0YjEzMTY3; BDORZ=FFFB88E999055A3F8A630C64834BD6D0; BDRCVFR[VbLWpd7HGum]=mk3SLVN4HKm; H_PS_PSSID=33839_33822_31254_33849_33760_33607_26350_33893; ab_sr=1.0.0_Y2YwMDFlM2ZmMTMzODEzNmYxNzE5M2M0YTk0ZDQ1ZTM1YTZjMzE5OTVkMzFlM2M2YzcxOWI3ZDBmODNlOTU4YTA5NDY1YTUwNTYwMTI4OTMyYWZiM2VlNGE5MzhiZmI2NDNjZmRjMmE4YWE1NTFmZGZmMjg5NWIzMWI4OTM1ODE=',
}
r = requests.get(url,headers=headers)
print(r.status_code)
print(r.text)
json_data = json.loads(r.text)
print(json_data)
comment_list = json_data['data']
print(comment_list)
comment_list = json_data['data']['pages']['3']
for eachone in comment_list:
time1 = eachone['auditTime']
print(time1)
可以看到成功得到时间戳,但还不是我们想要的结果,我们将它改成北京时间
代码如下:
#coding:utf-8
import requests
import json
import time
url = "/api/wikiui/gethistorylist?tk=10273b583b2d59dc49cbcfc1eb5cf5a3&lemmaId=6141470&from=3&count=1&size=25"
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.93 Safari/537.36',
'Host': '',
'Cookie': 'BIDUPSID=66D853C540ACF3CA684E9852E1DCF1DB; PSTM=1590825717; BAIDUID=66D853C540ACF3CAE0B26F235ABD45ED:FG=1; BK_SEARCHLOG=%7B%22key%22%3A%5B%22hex%22%2C%22%E8%B5%9E%E5%8A%A9%E4%BA%BA%22%2C%22%E8%B5%9E%E5%8A%A9%E5%98%89%E5%AE%BE%22%2C%22SAML%22%2C%22%E7%94%A8%E6%88%B7%22%2C%22Rainbow%E8%A1%A8%E6%94%BB%E5%87%BB%E7%BB%95%E8%BF%87%E6%9C%80%E5%A4%A7%E5%A4%B1%E8%B4%A5%E7%99%BB%E5%BD%95%E9%99%90%E5%88%B6%22%2C%22%E7%89%B9%E6%9D%83%E5%8D%87%E7%BA%A7%E7%BD%91%E7%BB%9C%E5%AE%89%E5%85%A8%22%2C%22%E7%89%B9%E6%9D%83%E5%8D%87%E7%BA%A7%22%2C%22%E8%AE%A4%E5%8F%AF%22%2C%22%E5%90%8C%E8%B4%A8%E6%80%A7%22%5D%7D; H_WISE_SIDS=154770_153759_156158_155553_149355_156816_156287_150775_154259_148867_156096_154606_153243_153629_157262_157236_154172_156417_153065_156516_127969_154174_158527_150346_155803_146734_158745_131423_154037_107316_158054_158876_154189_155344_155255_157171_157790_144966_157401_154619_157814_158716_156726_157418_147551_157118_158367_158505_158589_157696_154639_154270_157472_110085_157006; BDUSS=klPNzU4NldvdkJOUG5DWk1zUWd0VUJPQjV4c3U2bHoxblBnS0NmcEtDT0hKR2hnRVFBQUFBJCQAAAAAAAAAAAEAAABOxNvnSnNrMjEyAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAIeXQGCHl0Bgf; BDUSS_BFESS=klPNzU4NldvdkJOUG5DWk1zUWd0VUJPQjV4c3U2bHoxblBnS0NmcEtDT0hKR2hnRVFBQUFBJCQAAAAAAAAAAAEAAABOxNvnSnNrMjEyAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAIeXQGCHl0Bgf; __yjs_duid=1_420ea1f33b6915a9547439f2725ac2811617111425851; Hm_lvt_55b574651fcae74b0a9f1cf9c8d7c93a=1617721790,1617721848,1617721949,1618794964; BAIDUID_BFESS=07052B6E9DBAC2EF4CB760778355D0A6:FG=1; __yjs_st=2_ZDc0Y2QxNzIyMmRmYzlmMjdjNjBmZmYzNGQ1MjcwN2IzMTU5Mzk5ODZmZGUxODJiYjQxMGU1NDQ2MGQwNjNmMmYzOTc1YmY4ZDcxZjgxYjNhYjI4M2I0ODU5ZTNmNGEyNGM4NDE0NTFmZDc3NjBjZGU0YWRmMTgwNmQxZjNhMzllNzE1MTYyZjNmYTMwNmNlNTZmMmEyNmI0MzJkMGY3MGI0Zjc3NGE0N2QxZDY2NjkxMjAwMmIyYzIyODg3NjkzNDQ3NWQ3ZjYyOWZiNTExMGEwZDRiOTVjNGZiNDg2NGU4ZDYzYmVmMzkxMTU3OTAyOGU5OWU3N2Q3YjRiMDk2N183Xzc0YjEzMTY3; BDORZ=FFFB88E999055A3F8A630C64834BD6D0; BDRCVFR[VbLWpd7HGum]=mk3SLVN4HKm; H_PS_PSSID=33839_33822_31254_33849_33760_33607_26350_33893; ab_sr=1.0.0_Y2YwMDFlM2ZmMTMzODEzNmYxNzE5M2M0YTk0ZDQ1ZTM1YTZjMzE5OTVkMzFlM2M2YzcxOWI3ZDBmODNlOTU4YTA5NDY1YTUwNTYwMTI4OTMyYWZiM2VlNGE5MzhiZmI2NDNjZmRjMmE4YWE1NTFmZGZmMjg5NWIzMWI4OTM1ODE=',
}
r = requests.get(url,headers=headers)
print(r.status_code)
print(r.text)
json_data = json.loads(r.text)
print(json_data)
comment_list = json_data['data']
print(comment_list)
comment_list = json_data['data']['pages']['3']
for eachone in comment_list:
time1 = eachone['auditTime']
print(time1)
time_tuple = time.localtime(time1)
bj_time = time.strftime("%Y/%m/%d %H:%M:%S",time_tuple)
print("北京时间:",bj_time)
可以看到成功显示北京时间,达到我们的目的
验证成功
模拟浏览器抓取
此方法即用浏览器渲染引擎,直接用浏览器在显示的网页时解析HTML,应用CSS样式并执行Javascript的语句。这个方法在爬虫过程中会打开一个浏览器加载该页面,自动操作浏览器浏览各个网页,顺便把数据抓下来,用一句简单而通俗的话说,就是使用浏览器渲染方法将爬取动态网页变成爬取静态网页。
Selenium是Python中的一个库,其可以实现多个浏览器的调用,这里使用火狐
代码如下:
from selenium import webdriver
driver = webdriver.Firefox()
driver.get('/')
发现程序有报错,这是因为最新版本的selenium无法正常运行。
解决办法:
1.下载,下载地址:/mozilla/geckodriver/releases,下载后解压
2.然后将解压后的放在Firefox的安装目录下,如:C:\Program Files\Mozilla Firefox,并在环境变量Path中添加路径:C:\Program Files\Mozilla Firefox。
然后我们重启电脑再次运行,结果如下:
同时打开了火狐,但是效果很失望,没有显示百度,这是因为没有设置执行路径
解决方法:把()改成(executable_path=r’的地址’)
代码如下:
from selenium import webdriver
browser = webdriver.Firefox(executable_path=r'C:\Program Files\Mozilla Firefox\')
browser.get("/")
可以看到成功返回我们所需要访问的百度。到这里我们的selenium就已经配置好了。
因此我们继续爬取万科更新时间
使用该方法打开万科页面:(只需把(改成万科url))
继续F12,使用检查找到该元素
可以看到该时间元素,根据上图可看出时间元素都在<tbody>
标签内,同时又各自在一个submitTime的class类下。因此这里运用WebDriver查找网页元素的方法。
扩展内容:
#以下都是单次定位,返回第一个定位到的。如果想多次定位,给element加个s就行,返回的是符合元素的列表
element = driver.find_element_by_id() # 通过标签的id定位,接收id属性值
driver.find_element_by_name() # 通过标签的name定位,接收name属性值
driver.find_element_by_xpath() # 通过xpath定位,接收xpath表达式
driver.find_element_by_link_text() # 通过标签的完全文本定位,接收完整的文本
driver.find_element_by_partial_link_text() # 通过标签的部分文本定位,接收部分文本
driver.find_element_by_tag_name() # 通过标签名定位,接收标签名
driver.find_element_by_class_name() # 通过标签的class定位,接收class属性值
driver.find_element_by_css_selector() # 通过css选择器定位,接收其语法
#返回的WebElement对象可以继续往下定位
代码如下:
from selenium import webdriver
browser = webdriver.Firefox(executable_path=r'C:\Program Files\Mozilla Firefox\')
browser.get("/historylist/%E4%B8%87%E7%A7%91/6141470#page1")
comment = browser.find_element_by_tag_name('tbody')
print(comment.text)
可以看到成功得到<tbody>
下的内容
接下来,我们获得名为submitTime的class里的内容(也就是我们要的内容)
代码如下:
from selenium import webdriver
browser = webdriver.Firefox(executable_path=r'C:\Program Files\Mozilla Firefox\')
browser.get("/historylist/%E4%B8%87%E7%A7%91/6141470#page1")
comment = browser.find_element_by_tag_name('tbody')
conent = comment.find_element_by_class_name('submitTime')
print(conent.text)
发现,这里只返回了第一个更新时间,也就正如上面扩展内容一样,只是返回第一个,我们需要将element改成elements才可以返回所有的查找结果
代码如下:
from selenium import webdriver
browser = webdriver.Firefox(executable_path=r'C:\Program Files\Mozilla Firefox\')
browser.get("/historylist/%E4%B8%87%E7%A7%91/6141470#page1")
comment = browser.find_element_by_tag_name('tbody')
conent = comment.find_elements_by_class_name('submitTime')
for each_list in conent:
print(each_list.text)
成功获得更新时间,这里需要注意的是我们将element改成elements时,其返回的是list类型,那么我们需要利用循环将其打印
验证成功,完结