jianshu-crawl:Scrapy + selenium爬取简书全站下载

【文件属性】：

文件名称：jianshu-crawl:Scrapy + selenium爬取简书全站

文件大小：57KB

文件格式：ZIP

更新时间：2024-04-21 15:16:57

Python

Scrapy + selenium爬取简书全站环境 Ubuntu 18.04 Python 3.8 Scrapy 2.1 爬取内容文字标题作者作者头像发布日期内容文章连接文章ID 思路分析简书文章的url规则使用selenium请求页面使用xpath获取需要的数据初步存储数据到MySQL（提高存储效率）实现前戏：创建scrapy项目建立crawlsipder爬虫文件： pipelines和middleware 初步：分析简书文章的url 可以jianshu.com/p/文章ID url规则为jianshu.com/p/文章ID ，然后再crawlsipder中设置url规则 class JsSpider ( CrawlSpider ): name = 'js' allowed_domains = [ 'jianshu.com' ]

立即下载

【文件预览】：
jianshu-crawl-master
----jianshu_crawl()
--------middlewares.py(1KB)
--------spiders()
--------__init__.py(0B)
--------pipelines.py(3KB)
--------__pycache__()
--------start.py(71B)
--------settings.py(3KB)
--------items.py(414B)
----README.md(8KB)
----.idea()
--------misc.xml(297B)
--------workspace.xml(14KB)
--------vcs.xml(180B)
--------dataSources.xml(770B)
--------dataSources.local.xml(2KB)
--------inspectionProfiles()
--------dataSources()
--------modules.xml(278B)
--------jianshu_crawl.iml(326B)
----scrapy.cfg(269B)
----img()
--------image-20200508174922373.png(23KB)

秒客网

jianshu-crawl:Scrapy + selenium爬取简书全站

网友评论

相关文章