1、需求
爬去所有新浪国内动态新闻的内容、标题、时间、来源、评论数及责任编辑
2、、整理思路
新浪网新闻是滚动显示,并且有分页,首先需要找到每则新闻链接,然后爬去新闻内容
其次找到分页链接,爬去每页所有新闻的链接
最后完成分页操作,爬去所有页面的新闻内容。
3、方式:
采取函数式编程,大致需要三个函数
4、编辑器
使用jupyter notebook,可能用到的模块有requests、re、BeautifulSoup、datetime、json
5、遇到的问题
分页链接不容易寻找,评论链接不容易寻找
6、代码
import requests import json from bs4 import BeautifulSoup import re from datetime import datetime def getCommentCount(newsurl): commentURL = 'http://comment5.news.sina.com.cn/page/info?version=1&format=json&channel=gn&newsid=comos-{}&group=undefined&compress=0&ie=utf-8&oe=utf-8&page=1&page_size=3&t_size=3&h_size=3&thread=1' m = re.search('doc-i(.+).shtml',newsurl) newsid = m.group(1) #newsid = newsurl.split('/')[-1].lstrip('doc-i').rstrip('.shtml') comments = requests.get(commentURL.format(newsid)) jd = json.loads(comments.text) return jd['result']['count']['total'] def getNewsdetail(newsurl): result = {} res = requests.get(newsurl) res.encoding = 'utf-8' soup = BeautifulSoup(res.text,'html.parser') result['title'] = soup.select('.main-title')[0].text result['newssource'] = soup.select('.date-source a')[0].text timesource = soup.select('.date-source span')[0].text result['dt'] = datetime.strptime(timesource,'%Y年%m月%d日 %H:%M') result['article'] = article = '\n'.join(p.text.strip() for p in soup.select('.article p')[:-3]) result['editor'] = soup.select('.show_author')[0].text.lstrip('责任编辑:').rstrip(' ') result['comments'] = getCommentCount(newsurl) return result def ParseListLinks(url): newsdetails = [] res = requests.get(url) res.encoding = 'utf-8' jd = json.loads(res.text.lstrip(" newsloadercallback(").rstrip(");")) for ent in jd['result']['data']: #print(ent['url']) newsdetails.append(getNewsdetail(ent['url'])) return newsdetails url = 'http://api.roll.news.sina.com.cn/zt_list?channel=news&cat_1=gnxw&cat_2==gdxw1||=gatxw||=zs-pl||=mtjj&level==1||=2&show_ext=1&show_all=1&show_num=22&tag=1&format=json&page={}' news_total = [] for i in range(5): if i != 2: newsurl = url.format(i) newsary = ParseListLinks(newsurl) news_total.extend(newsary) import pandas df = pandas.DataFrame(news_total) df.head()