Python爬虫实战二 | 抓取小说网完结小说

大家都知道，在小说网站看小说总是各种广告，想要下载小说然而却要么需要钱，要么需要会员，如此，我们不妨写一个小说爬虫，将网页上的小说内容章节全部抓取下来，整理成为一本完整的txt文件，这样岂不是一件很愉快的事情！

第一只爬虫：

第一只爬虫效果

第一只爬虫，在urlChange（）函数处理网址变化，然而到了最后，小东发现，原来小说的每一章节不是按照序号顺次排列的，老阔痛！！！哈哈~

# 名称：爬取小说内容
# 作者： DYBOY 小东
# 时间： 2017-09-07

'''
小说地址：http://www.quanshuwang.com/book/44/44683/
小说章节第一章：http://www.quanshuwang.com/book/44/44683/15379609.html
                http://www.quanshuwang.com/book/44/44683/15379610.html
                http://www.quanshuwang.com/book/44/44683/15380350.html

'''

import requests

from bs4 import BeautifulSoup
#以上作为基本引用

#返回小说详情页的标题+内容
def getContent(content_url):
    res = requests.get(content_url,timeout=10)
    res.encoding = 'gbk'
    soup = BeautifulSoup(res.text,'html.parser')
    title = soup.select('.jieqi_title')[0].text.lstrip('章 节目录 ')
    content = soup.select('#content')[0].text.lstrip('style5();').rstrip('style6();')
    both = title + content
    return both

def urlChange():
    i=0
    f = open("dldl.txt", 'w+',encoding='utf-8')
    url='http://www.quanshuwang.com/book/44/44683/153'
    for num in range(79609,80350):
        curl = url + str(num) + '.html'
        contents = getContent(curl)
        print(contents,file = f)
        i=i+1
        print(i)
    f.close()
    print('ok!!!')

#MAIN--
urlChange()

那么我们该如何解决那？想到每一章节的网页都有下一章的按钮，由此，我们可以抓取下一章的网页地址，如此反复即可！

下面请看第二只升级版的爬虫V1.1：

这样就好多了

其中getContent()用到了递归的思想，OK，这只升级版镶钻水晶奢华配置爬虫就算写好了！

# 名称：爬取小说内容 V1.1
# 作者： DYBOY 小东
# 时间： 2017-09-07

'''
小说地址：http://www.quanshuwang.com/book/44/44683/
小说章节第一章：http://www.quanshuwang.com/book/44/44683/15379609.html
                http://www.quanshuwang.com/book/44/44683/15380350.html

'''

import requests
import re

from bs4 import BeautifulSoup
#以上作为基本引用



def getContent(content_url,i):
    i=i+1
    header = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36'}
    res = requests.get(content_url,headers = header,timeout = 10)
    res.encoding = 'gbk'
    soup = BeautifulSoup(res.text,'html.parser')
    title = soup.select('.jieqi_title')[0].text.lstrip('章 节目录 ')
    content = soup.select('#content')[0].text.lstrip('style5();').rstrip('style6();')
    both = title + content
    next_url = soup.select('.next')[0]['href']
    print(both,file = f)
    print(i)
    return getContent(next_url,i)


#MAIN
f = open("dldl2.txt", 'w+',encoding='utf-8')
i=0
getContent('http://www.quanshuwang.com/book/44/44683/15379609.html',i)
f.close()
print('ok!')

欢迎各位上Github下载！

注：本文属于原创文章，转载请注明本文地址！

作者QQ:1099718640

CSDN博客主页：http://blog.csdn.net/dyboy2017

Github开源项目：https://github.com/dyboy2017/novel_spider

秒客网

Python爬虫实战二 | 抓取小说网完结小说

相关文章