Python实例之抓取HTML中的数据并保存为TXT

本实例实现了抓取捧腹网中存储于html中的笑话数据（非JSON数据）

通过浏览器相关工具发现捧腹网笑话页面的数据存储在HTML页面而非json数据中，因此可以直接使用soup.select()方法来抓取数据，具体实现代码如下：

import requests

from bs4 import BeautifulSoup

restr = ''

for j in range(1,51):     #一共抓取50个页面的数据

    html = 'https://www.pengfu.com/xiaohua_'+str(j)+'.html'

    res = requests.get(html)

    res.encoding = 'utf-8'   #html_doc = str(res.content,'utf-8')亦可

    soup = BeautifulSoup(res.text,'lxml')

    h1 = soup.select('h1[class=dp-b]')

    con = soup.select('.content-img')

    for i in range(0,10):       #每页抓取10条笑话

        rh1 = '笑话标题：' + h1[i].text.strip().replace('\n','')

        rcon = '笑话内容：' + con[i].text.strip().replace('\n','')

        restr += rh1

        restr += '\n'

        restr += rcon

        restr += '\n\n'

        print('当前正在读取第'+str(j)+"页的第"+str(i+1)+'条笑话...')

f = open('捧腹网笑话500条.txt','w',1,'UTF-8')

f.write(restr)

print('正在保存。。。')

f.close

print('保存完毕！')

秒客网

Python实例之抓取HTML中的数据并保存为TXT

相关文章