![python3 [爬虫实战] selenium 爬取安居客 python3 [爬虫实战] selenium 爬取安居客](https://image.shishitao.com:8440/aHR0cHM6Ly9ia3FzaW1nLmlrYWZhbi5jb20vdXBsb2FkL2NoYXRncHQtcy5wbmc%2FIQ%3D%3D.png?!?w=700&webp=1)
获取的内容:包括地区名,地区链接:
![python3 [爬虫实战] selenium 爬取安居客 python3 [爬虫实战] selenium 爬取安居客](https://image.shishitao.com:8440/aHR0cHM6Ly91cGxvYWQtaW1hZ2VzLmppYW5zaHUuaW8vdXBsb2FkX2ltYWdlcy82MTUyNTk1LWExMTlmOWNhMjg4YjM5MzAucG5nP2ltYWdlTW9ncjIvYXV0by1vcmllbnQvc3RyaXAlN0NpbWFnZVZpZXcyLzIvdy8xMjQw.png?w=700&webp=1)
安居客详情
- 一开始直接用requests库进行网站的爬取,会访问不到数据的, 会直接出现 访问的页面出现错误的信息。
- selenium 的使用,我的博客上有说过:
- 代码
# -*- coding: utf-8 -*- # @Time : # @Author : # @Email : # @File : import requests import re from bs4 import BeautifulSoup import csv import time import threading from lxml import etree from selenium import webdriver from openpyxl import Workbook num0 = 1 # 用来计数 baseurl = 'https://www.anjuke.com/sy-city.html' wb = Workbook() ws = wb.active ws.title = '安居客' ws.cell(row=1, column=1).value = '城市链接' ws.cell(row=1, column=2).value = '城市名称' def gethtml(): chromedriver = "chromedriver.exe" browser = webdriver.Chrome(chromedriver) browser.get(baseurl) time.sleep(5) #让页面滚动到下面,window.scrollBy(0, scrollStep),ScrollStep :间歇滚动间距 js = 'window.scrollBy(0,3000)' browser.execute_script(js) js = 'window.scrollBy(0,5000)' browser.execute_script(js) html = browser.page_source return html def parseHotBook(html): # print(html) regAuthor = r'.*?<a href="(.*?)</a>' reg_author = re.compile(regAuthor) authorother = re.findall(reg_author, html) global num0 for info in authorother: verinfo = info.split('">') print(verinfo[0],verinfo[1].replace('class="hot','')) num0 = num0 + 1 name = verinfo[0] link = verinfo[1].replace('class="hot','') ws.cell(row=num0, column=1).value = name ws.cell(row=num0, column=2).value = link wb.save('安居客2' + '.xlsx') print('爬取成功') if __name__=='__main__': html = gethtml() parseHotBook(html)
文本存储还有一些瑕疵,因为用的是正则表达式,并没有进行很严格的匹配
贴上爬取内容:
![python3 [爬虫实战] selenium 爬取安居客 python3 [爬虫实战] selenium 爬取安居客](https://image.shishitao.com:8440/aHR0cHM6Ly91cGxvYWQtaW1hZ2VzLmppYW5zaHUuaW8vdXBsb2FkX2ltYWdlcy82MTUyNTk1LWQ3MDRjZTg0MTQ5NTRkMTgucG5nP2ltYWdlTW9ncjIvYXV0by1vcmllbnQvc3RyaXAlN0NpbWFnZVZpZXcyLzIvdy8xMjQw.png?w=700&webp=1)
安居客爬取内容