目标
- 抓取糗事百科上的段子
- 实现每按一次回车显示一个段子
- 输入想要看的页数,按 'Q' 或者 'q' 退出
实现思路
- 目标网址:糗事百科
- 使用requests抓取页面 requests官方教程
- 使用bs4模块解析页面,获取内容 bs4官方教程
代码内容:
1 import requests 2 from bs4 import BeautifulSoup 3 4 5 def get_content(pages): # get jokes list 6 headers = {'user_agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) Apple\ 7 WebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.87 Safari/537.36'} # 用户代理 8 content_list = [] 9 for page in range(1, pages+1): # 想看多少页 10 url = 'http://www.qiushibaike.com/text/page/' + str(page) + '/?s=4928950' 11 response = requests.get(url, headers=headers) # 获取网页内容 12 html = response.text 13 soup = BeautifulSoup(html, 'html5lib') # 解析网页内容 14 jokes = soup.find_all('div', class_='content') 15 for each in jokes: 16 each_joke = each.get_text() 17 joke = each_joke.replace('\n', '') # 将换行符替换 18 content_list.append(joke) 19 return content_list # 返回段子列表 20 21 22 if __name__ == "__main__": 23 number = int(input("How many pages do you want to read?\nIf you want to quit, just press 'q'.\n")) # 输入想要看的页数 24 print() # 换行,便于阅读 25 for paragraph in get_content(number): 26 print(paragraph) 27 user_input = input() 28 if user_input == 'q': # 按'q'退出 29 break
结果展示:
参考:
http://www.jianshu.com/p/19c846daccb3
静谧的爬虫教程:https://cuiqingcai.com/990.html
爬取段子参考:http://www.jianshu.com/p/0e7d1c80b8c3