一、配置webdriver
下载谷歌浏览器驱动,并配置好
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
|
import time
import random
from pil import image
from selenium import webdriver
from selenium.webdriver.common.by import by
from selenium.webdriver.support.ui import webdriverwait
from selenium.webdriver.support import expected_conditions as ec
if __name__ = = '__main__' :
options = webdriver.chromeoptions()
options.binary_location = r 'c:\users\hhh\appdata\local\google\chrome\application\谷歌浏览器.exe'
# driver=webdriver.chrome(executable_path=r'd:\360chrome\chromedriver\chromedriver.exe')
driver = webdriver.chrome(options = options)
#以java模块为例
driver.get( 'https://www.csdn.net/nav/java' )
for i in range ( 1 , 20 ):
driver.execute_script( "window.scrollto(0, document.body.scrollheight)" )
time.sleep( 2 )
|
二、获取url
1
2
3
4
5
6
7
|
from bs4 import beautifulsoup
from lxml import etree
html = etree.html(driver.page_source)
# soup = beautifulsoup(html, 'lxml')
# soup_herf=soup.find_all("#feedlist_id > li:nth-child(1) > div > div > h2 > a")
# soup_herf
title = html.xpath( '//*[@id="feedlist_id"]/li/div/div/h2/a/@href' )
|
可以看到,一下爬取了很多,速度非常快
三、写入redis
导入redis包后,配置redis端口和redis数据库,用rpush函数写入
打开redis
1
2
3
4
5
6
7
|
import redis
r_link = redis.redis(port = '6379' , host = 'localhost' , decode_responses = true, db = 1 )
for u in title:
print ( "准备写入{}" . format (u))
r_link.rpush( "csdn_url" , u)
print ( "{}写入成功!" . format (u))
print ( '=' * 30 , '\n' , "共计写入url:{}个" . format ( len (title)), '\n' , '=' * 30 )
|
大功告成!
在redis desktop manager中可以看到,爬取和写入都是非常的快。
要使用只需用rpop出栈就ok
1
2
3
|
one_url = r_link.rpop( "csdn_url)" )
while one_url:
print ( "{}被弹出!" . format (one_url))
|
到此这篇关于详解用python实现爬取csdn热门评论url并存入redis的文章就介绍到这了,更多相关python爬取url内容请搜索服务器之家以前的文章或继续浏览下面的相关文章希望大家以后多多支持服务器之家!
原文链接:https://blog.csdn.net/Rex__404/article/details/115366167