工作原因需要爬取微博上相关微博内容以及评论。直接scrapy上手,发现有部分重复的内容出现。(标题重复,内容重复,但是url不重复)
1.scrapy爬取微博内容
为了降低爬取难度,直接爬取微博的移动端:(电脑访问到移动版本微博,之后F12调出控制台来操作)
点击搜索栏:输入相关搜索关键词:
可以看到微博的开始搜索URL为:https://m.weibo.cn/api/container/getIndex?containerid=100103type%3D1%26q%3D%E8%8C%83%E5%86%B0%E5%86%B0&page_type=searchall
我们要搜索的“范冰冰” 其实做了URL编码:
class SinaspiderSpider(scrapy.Spider):
name = 'weibospider'
allowed_domains = ['m.weibo.cn']
start_urls = ['https://m.weibo.cn/api/container/getIndex?containerid=100103type%3D1%26q%3D%E8%8C%83%E5%86%B0%E5%86%B0&page_type=searchall']
Referer = {"Referer": "https://m.weibo.cn/p/searchall?containerid=100103type%3D1%26q%3D"+quote("范冰冰")}
def start_requests(self):
yield Request(url="https://m.weibo.cn/api/container/getIndex?containerid=100103type%3D1%26q%3D"+quote("范冰冰")+"&page_type=searchall&page=1",headers=self.Referer,meta={"page":1,"keyword":"范冰冰"})
之后我们滚动往下拉发现url是有规律的:
https://m.weibo.cn/api/container/getIndex?containerid=100103type%3D1%26q%3D%E8%8C%83%E5%86%B0%E5%86%B0&page_type=searchall&page=2
https://m.weibo.cn/api/container/getIndex?containerid=100103type%3D1%26q%3D%E8%8C%83%E5%86%B0%E5%86%B0&page_type=searchall&page=3
https://m.weibo.cn/api/container/getIndex?containerid=100103type%3D1%26q%3D%E8%8C%83%E5%86%B0%E5%86%B0&page_type=searchall&page=4
https://m.weibo.cn/api/container/getIndex?containerid=100103type%3D1%26q%3D%E8%8C%83%E5%86%B0%E5%86%B0&page_type=searchall&page=5
https://m.weibo.cn/api/container/getIndex?containerid=100103type%3D1%26q%3D%E8%8C%83%E5%86%B0%E5%86%B0&page_type=searchall&page=6
https://m.weibo.cn/api/container/getIndex?containerid=100103type%3D1%26q%3D%E8%8C%83%E5%86%B0%E5%86%B0&page_type=searchall&page=7
在原来的基础上新增了一个参数“&page=2” 这些参数从哪里来的呢?我们如何判断多少页的时候就没有了呢?
打开我们最开始的那条URL:
复制这段json,然后通过下面两个网站格式化一下,便于我们观察规律:
Unicode 转中文:http://www.atool.org/chinese2unicode.php
Json在线格式化:http://tool.oschina.net/codeformat/json
在线工具有特别丰富的功能让我们更好的查看json:
我们发现JSON中保存着我们要的页面信息:
其他的信息一次类推在JSON或者URL中观察:
微博爬取parse函数:
def parse(self, response):
base_url = "https://m.weibo.cn/api/container/getIndex?containerid=100103type%3D1%26q%3D"+quote("范冰冰")+"&page_type=searchall&page="
results = json.loads(response.text,encoding="utf-8")
page = response.meta.get("page")
keyword = response.meta.get("keyword")
# 下一页
next_page = results.get("data").get("cardlistInfo").get("page")
if page != next_page:
yield Request(url=base_url+str(next_page), headers=self.Referer, meta={"page":next_page,"keyword":keyword})
result = results.get("data").get("cards")
# 获取微博
for j in result:
card_type = j.get("card_type")
show_type = j.get("show_type")
# 过滤
if show_type ==1 and card_type ==11 :
for i in j.get("card_group"):
reposts_count = i.get("mblog").get("reposts_count")
comments_count = i.get("mblog").get("comments_count")
attitudes_count = i.get("mblog").get("attitudes_count")
# 过滤到评论 转发 喜欢都为0 的微博
if reposts_count and comments_count and attitudes_count:
message_id = i.get("mblog").get("id")
status_url = "https://m.weibo.cn/comments/hotflow?id=%s&mid=%s&max_id_type=0"
# 返回微博评论爬取
yield Request(url=status_url%(message_id,message_id),callback=self.commentparse, meta={"keyword":keyword,"message_id":message_id})
title = keyword
status_url = "https://m.weibo.cn/status/%s"
# response1 = requests.get(status_url%message_id)
if i.get("mblog").get("page_info"):
content = i.get("mblog").get("page_info").get("page_title")
content1 = i.get("mblog").get("page_info").get("content1")
content2 = i.get("mblog").get("page_info").get("content2")
else:
content = ""
content1 = ""
content2 = ""
text = i.get("mblog").get("text").encode(encoding="utf-8")
textLength = i.get("mblog").get("textLength")
isLongText = i.get("mblog").get("isLongText")
create_time = i.get("mblog").get("created_at")
spider_time = datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S")
user = i.get("mblog").get("user").get("screen_name")
message_url = i.get("scheme")
longText = i.get("mblog").get("longText").get("longTextContent") if isLongText else ""
reposts_count = reposts_count
comments_count = comments_count
attitudes_count = attitudes_count
weiboitemloader = WeiBoItemLoader(item=WeibopachongItem())
weiboitemloader.add_value("title",title )
weiboitemloader.add_value("message_id",message_id )
weiboitemloader.add_value("content",content )
weiboitemloader.add_value("content1",content1 )
weiboitemloader.add_value("content2",content2 )
weiboitemloader.add_value("text",text )
weiboitemloader.add_value("textLength",textLength )
weiboitemloader.add_value("create_time",create_time )
weiboitemloader.add_value("spider_time",spider_time )
weiboitemloader.add_value("user1",user )
weiboitemloader.add_value("message_url",message_url )
weiboitemloader.add_value("longText1",longText )
weiboitemloader.add_value("reposts_count",reposts_count )
weiboitemloader.add_value("comments_count",comments_count )
weiboitemloader.add_value("attitudes_count",attitudes_count )
yield weiboitemloader.load_item()
2.scrapy爬取微博评论
评论在微博正文中往下拉鼠标可以获得URL规律,下面是微博评论解析函数:
def commentparse(self,response):
status_after_url = "https://m.weibo.cn/comments/hotflow?id=%s&mid=%s&max_id=%s&max_id_type=%s"
message_id = response.meta.get("message_id")
keyword = response.meta.get("keyword")
results = json.loads(response.text, encoding="utf-8")
if results.get("ok"):
max_id = results.get("data").get("max_id")
max_id_type = results.get("data").get("max_id_type")
if max_id:
# 评论10 个为一段,下一段在上一段JSON中定义:
yield Request(url=status_after_url%(message_id,message_id,str(max_id),str(max_id_type)),callback=self.commentparse,meta={"keyword":keyword,"message_id":message_id})
datas = results.get("data").get("data")
for data in datas:
text1 = data.get("text")
like_count = data.get("like_count")
user1 = data.get("user").get("screen_name")
user_url = data.get("user").get("profile_url")
emotion = SnowNLP(text1).sentiments
weibocommentitem = WeiboCommentItem()
weibocommentitem["title"] = keyword
weibocommentitem["message_id"] = message_id
weibocommentitem["text1"] = text1
weibocommentitem["user1"] = user1
weibocommentitem["user_url"] = user_url
weibocommentitem["emotion"] = emotion
yield weibocommentitem
最后异步存入MYSQL:
item:
import scrapy
from scrapy.loader import ItemLoader
from scrapy.loader.processors import Compose
def get_First(values):
if values is not None:
return values[0]
class WeiBoItemLoader(ItemLoader):
default_output_processor = Compose(get_First)
class WeibopachongItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
title = scrapy.Field()
message_id = scrapy.Field()
content = scrapy.Field()
content1 = scrapy.Field()
content2 = scrapy.Field()
text = scrapy.Field()
textLength = scrapy.Field()
create_time = scrapy.Field()
spider_time = scrapy.Field()
user1 = scrapy.Field()
message_url = scrapy.Field()
longText1 = scrapy.Field()
reposts_count = scrapy.Field()
comments_count = scrapy.Field()
attitudes_count = scrapy.Field()
def get_insert_sql(self):
insert_sql = """
insert into t_public_opinion_realtime_weibo(title,message_id,content,content1,content2,text,textLength,create_time,spider_time,user1,message_url,longText1,reposts_count,comments_count,attitudes_count)values (%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s)
"""
parms = (self["title"],self["message_id"],self["content"],self["content1"],self["content2"],self["text"],self["textLength"],self["create_time"],self["spider_time"],self["user1"],self["message_url"],self["longText1"],self["reposts_count"],self["comments_count"],self["attitudes_count"])
return insert_sql, parms
class WeiboCommentItem(scrapy.Item):
title = scrapy.Field()
message_id = scrapy.Field()
text1 = scrapy.Field()
user1 = scrapy.Field()
user_url = scrapy.Field()
emotion = scrapy.Field()
def get_insert_sql(self):
insert_sql = """
insert into t_public_opinion_realtime_weibo_comment(title,message_id,text1,user1,user_url,emotion)
values (%s,%s,%s,%s,%s,%s)
"""
parms = (self["title"],self["message_id"],self["text1"],self["user1"],self["user_url"],self["emotion"])
return insert_sql, parms
Pipline:异步插入:
# 插入
class MysqlTwistedPipline(object):
def __init__(self,dbpool):
self.dbpool=dbpool
@classmethod
def from_settings(cls,setting):
dbparms=dict(
host=setting["MYSQL_HOST"],
db=setting["MYSQL_DBNAME"],
user=setting["MYSQL_USER"],
passwd=setting["MYSQL_PASSWORD"],
charset='utf8mb4',
cursorclass=MySQLdb.cursors.DictCursor,
use_unicode=True,
)
dbpool=adbapi.ConnectionPool("MySQLdb",**dbparms)
return cls(dbpool)
#mysql异步插入执行
def process_item(self, item, spider):
query=self.dbpool.runInteraction(self.do_insert,item)
query.addErrback(self.handle_error,item,spider)
def handle_error(self,failure,item,spider):
#处理异步插入的异常
print (failure)
def do_insert(self,cursor,item):
insert_sql,parms=item.get_insert_sql()
print(parms)
cursor.execute(insert_sql, parms)
按照规则来写爬虫还是难免有重复:
所以需要在插入内容前对数据进行去重处理
3.scrapy+Redis实现对重复微博的过滤
这里使用Redis中的Set集合来实现,也可以用Python中的Set来做,数据量不大的情况下,Redis中Set有Sadd方法,当成功插入数据后,会返回1。如果插入重复数据则会返回0。
redis_db = redis.Redis(host='127.0.0.1', port=6379, db=0)
result = redis_db.sadd("wangliuqi","12323")
print(result)
result1 = redis_db.sadd("wangliuqi","12323")
print(result1)
结果:=========》》》》》》》》
1
0
在Scrapy中新增一个pipline,然后对每一个要保存的item进行判断,如果是重复的微博则对其进行丢弃操作:
RemoveReDoPipline:
class RemoveReDoPipline(object):
def __init__(self,host):
self.conn = MySQLdb.connect(host, 'root', 'root', 'meltmedia', charset="utf8", use_unicode=True)
self.redis_db = redis.Redis(host='127.0.0.1', port=6379, db=0)
sql = "SELECT message_id FROM t_public_opinion_realtime_weibo"
# 获取全部的message_id,这是区分是不是同一条微博的标识
df = pd.read_sql(sql, self.conn)
# 全部放入Redis中
for mid in df['message_id'].get_values():
self.redis_db.sadd("weiboset", mid)
# 获取setting文件配置
@classmethod
def from_settings(cls,setting):
host=setting["MYSQL_HOST"]
return cls(host)
def process_item(self, item, spider):
# 只对微博的Item过滤,微博评论不需要过滤直接return:
if isinstance(item,WeibopachongItem):
if self.redis_db.sadd("weiboset",item["message_id"]):
return item
else:
print("重复内容:", item['text'])
raise DropItem("same title in %s" % item['text'])
else:
return item
最后别忘了在setting文件中把pipline配置进去,并且要配置到保存数据pipline前面才可以。否则起不到过滤效果:
ITEM_PIPELINES = {
'weibopachong.pipelines.MysqlTwistedPipline': 200,
'weibopachong.pipelines.RemoveReDoPipline': 100,
}