新浪微博需要登录才能爬取,这里使用m.weibo.cn这个移动端网站即可实现简化操作,用这个访问可以直接得到的微博id。
分析新浪微博的评论获取方式得知,其采用动态加载。所以使用json模块解析json代码
单独编写了字符优化函数,解决微博评论中的嘈杂干扰字符
本函数是用python写网络爬虫的终极目的,所以采用函数化方式编写,方便后期优化和添加各种功能
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
|
# -*- coding:gbk -*-
import re
import requests
import json
from lxml import html
#测试微博4054483400791767
comments = []
def get_page(weibo_id):
url = 'https://m.weibo.cn/status/{}' . format (weibo_id)
html = requests.get(url).text
regcount = r '"comments_count": (.*?),'
comments_count = re.findall(regcount,html)[ - 1 ]
comments_count_number = int (comments_count)
page = int (comments_count_number / 10 )
return page - 1
def opt_comment(comment):
tree = html.fromstring(comment)
strcom = tree.xpath( 'string(.)' )
reg1 = r '回复@.*?:'
reg2 = r '回覆@.*?:'
reg3 = r '//@.*'
newstr = ''
comment1 = re.subn(reg1,newstr,strcom)[ 0 ]
comment2 = re.subn(reg2,newstr,comment1)[ 0 ]
comment3 = re.subn(reg3,newstr,comment2)[ 0 ]
return comment3
def get_responses( id ,page):
url = "https://m.weibo.cn/api/comments/show?id={}&page={}" . format ( id ,page)
response = requests.get(url)
return response
def get_weibo_comments(response):
json_response = json.loads(response.text)
for i in range ( 0 , len (json_response[ 'data' ])):
comment = opt_comment(json_response[ 'data' ][i][ 'text' ])
comments.append(comment)
weibo_id = input ( "输入微博id,自动返回前5页评论:" )
weibo_id = int (weibo_id)
print ( '\n' )
page = get_page(weibo_id)
for page in range ( 1 ,page + 1 ):
response = get_responses(weibo_id,page)
get_weibo_comments(response)
for com in comments:
print (com)
print ( len (comments))
|
以上所述是小编给大家介绍的python爬取新浪微博评论详解整合,希望对大家有所帮助,如果大家有任何疑问请给我留言,小编会及时回复大家的。在此也非常感谢大家对服务器之家网站的支持!
原文链接:https://blog.csdn.net/Joliph/article/details/77334354