Scrapy实现对新浪微博某关键词的爬取以及不同url中重复内容的过滤

时间:2024-03-29 16:55:05

工作原因需要爬取微博上相关微博内容以及评论。直接scrapy上手,发现有部分重复的内容出现。(标题重复,内容重复,但是url不重复)

  1. 目录

    1.scrapy爬取微博内容 

    2.scrapy爬取微博评论

    3.scrapy+Redis实现对重复微博的过滤


1.scrapy爬取微博内容 

为了降低爬取难度,直接爬取微博的移动端:(电脑访问到移动版本微博,之后F12调出控制台来操作)

Scrapy实现对新浪微博某关键词的爬取以及不同url中重复内容的过滤

点击搜索栏:输入相关搜索关键词:

可以看到微博的开始搜索URL为:https://m.weibo.cn/api/container/getIndex?containerid=100103type%3D1%26q%3D%E8%8C%83%E5%86%B0%E5%86%B0&page_type=searchall

我们要搜索的“范冰冰” 其实做了URL编码:

class SinaspiderSpider(scrapy.Spider):
    name = 'weibospider'
    allowed_domains = ['m.weibo.cn']
    start_urls = ['https://m.weibo.cn/api/container/getIndex?containerid=100103type%3D1%26q%3D%E8%8C%83%E5%86%B0%E5%86%B0&page_type=searchall']
    Referer = {"Referer": "https://m.weibo.cn/p/searchall?containerid=100103type%3D1%26q%3D"+quote("范冰冰")}
    def start_requests(self):

        yield Request(url="https://m.weibo.cn/api/container/getIndex?containerid=100103type%3D1%26q%3D"+quote("范冰冰")+"&page_type=searchall&page=1",headers=self.Referer,meta={"page":1,"keyword":"范冰冰"})

Scrapy实现对新浪微博某关键词的爬取以及不同url中重复内容的过滤

之后我们滚动往下拉发现url是有规律的:

 https://m.weibo.cn/api/container/getIndex?containerid=100103type%3D1%26q%3D%E8%8C%83%E5%86%B0%E5%86%B0&page_type=searchall&page=2
 https://m.weibo.cn/api/container/getIndex?containerid=100103type%3D1%26q%3D%E8%8C%83%E5%86%B0%E5%86%B0&page_type=searchall&page=3
 https://m.weibo.cn/api/container/getIndex?containerid=100103type%3D1%26q%3D%E8%8C%83%E5%86%B0%E5%86%B0&page_type=searchall&page=4
 https://m.weibo.cn/api/container/getIndex?containerid=100103type%3D1%26q%3D%E8%8C%83%E5%86%B0%E5%86%B0&page_type=searchall&page=5
 https://m.weibo.cn/api/container/getIndex?containerid=100103type%3D1%26q%3D%E8%8C%83%E5%86%B0%E5%86%B0&page_type=searchall&page=6
 https://m.weibo.cn/api/container/getIndex?containerid=100103type%3D1%26q%3D%E8%8C%83%E5%86%B0%E5%86%B0&page_type=searchall&page=7

在原来的基础上新增了一个参数“&page=2” 这些参数从哪里来的呢?我们如何判断多少页的时候就没有了呢?

打开我们最开始的那条URL:

Scrapy实现对新浪微博某关键词的爬取以及不同url中重复内容的过滤

复制这段json,然后通过下面两个网站格式化一下,便于我们观察规律:

Unicode 转中文:http://www.atool.org/chinese2unicode.php

Json在线格式化:http://tool.oschina.net/codeformat/json

在线工具有特别丰富的功能让我们更好的查看json:

Scrapy实现对新浪微博某关键词的爬取以及不同url中重复内容的过滤

我们发现JSON中保存着我们要的页面信息:

Scrapy实现对新浪微博某关键词的爬取以及不同url中重复内容的过滤

其他的信息一次类推在JSON或者URL中观察:

微博爬取parse函数:

    def parse(self, response):
        base_url = "https://m.weibo.cn/api/container/getIndex?containerid=100103type%3D1%26q%3D"+quote("范冰冰")+"&page_type=searchall&page="
        results = json.loads(response.text,encoding="utf-8")
        page = response.meta.get("page")
        keyword = response.meta.get("keyword")
        # 下一页
        next_page = results.get("data").get("cardlistInfo").get("page")
        if page != next_page:
            yield Request(url=base_url+str(next_page), headers=self.Referer, meta={"page":next_page,"keyword":keyword})
        result = results.get("data").get("cards")
        # 获取微博
        for j in result:
            card_type = j.get("card_type")
            show_type = j.get("show_type")
            # 过滤
            if show_type ==1 and card_type ==11 :
                for i in j.get("card_group"):
                    reposts_count = i.get("mblog").get("reposts_count")
                    comments_count = i.get("mblog").get("comments_count")
                    attitudes_count = i.get("mblog").get("attitudes_count")
                    # 过滤到评论 转发 喜欢都为0 的微博
                    if reposts_count and comments_count and attitudes_count:
                        message_id = i.get("mblog").get("id")
                        status_url = "https://m.weibo.cn/comments/hotflow?id=%s&mid=%s&max_id_type=0"
                        # 返回微博评论爬取
                        yield Request(url=status_url%(message_id,message_id),callback=self.commentparse, meta={"keyword":keyword,"message_id":message_id})
                        title = keyword
                        status_url = "https://m.weibo.cn/status/%s"
                        # response1 = requests.get(status_url%message_id)
                        if i.get("mblog").get("page_info"):
                            content = i.get("mblog").get("page_info").get("page_title")
                            content1 = i.get("mblog").get("page_info").get("content1")
                            content2 = i.get("mblog").get("page_info").get("content2")
                        else:
                            content = ""
                            content1 = ""
                            content2 = ""
                        text = i.get("mblog").get("text").encode(encoding="utf-8")
                        textLength = i.get("mblog").get("textLength")
                        isLongText = i.get("mblog").get("isLongText")
                        create_time = i.get("mblog").get("created_at")
                        spider_time =  datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S")
                        user = i.get("mblog").get("user").get("screen_name")
                        message_url = i.get("scheme")
                        longText = i.get("mblog").get("longText").get("longTextContent") if isLongText else ""
                        reposts_count = reposts_count
                        comments_count = comments_count
                        attitudes_count = attitudes_count
                        weiboitemloader = WeiBoItemLoader(item=WeibopachongItem())
                        weiboitemloader.add_value("title",title )
                        weiboitemloader.add_value("message_id",message_id )
                        weiboitemloader.add_value("content",content )
                        weiboitemloader.add_value("content1",content1 )
                        weiboitemloader.add_value("content2",content2 )
                        weiboitemloader.add_value("text",text )
                        weiboitemloader.add_value("textLength",textLength )
                        weiboitemloader.add_value("create_time",create_time )
                        weiboitemloader.add_value("spider_time",spider_time )
                        weiboitemloader.add_value("user1",user )
                        weiboitemloader.add_value("message_url",message_url )
                        weiboitemloader.add_value("longText1",longText )
                        weiboitemloader.add_value("reposts_count",reposts_count )
                        weiboitemloader.add_value("comments_count",comments_count )
                        weiboitemloader.add_value("attitudes_count",attitudes_count )
                        yield weiboitemloader.load_item()

2.scrapy爬取微博评论

评论在微博正文中往下拉鼠标可以获得URL规律,下面是微博评论解析函数:

    def commentparse(self,response):
        status_after_url = "https://m.weibo.cn/comments/hotflow?id=%s&mid=%s&max_id=%s&max_id_type=%s"
        message_id = response.meta.get("message_id")
        keyword = response.meta.get("keyword")
        results = json.loads(response.text, encoding="utf-8")
        if results.get("ok"):
            max_id = results.get("data").get("max_id")
            max_id_type = results.get("data").get("max_id_type")
            if max_id:
                # 评论10 个为一段,下一段在上一段JSON中定义:
                yield Request(url=status_after_url%(message_id,message_id,str(max_id),str(max_id_type)),callback=self.commentparse,meta={"keyword":keyword,"message_id":message_id})
            datas = results.get("data").get("data")
            for data in datas:
                text1 = data.get("text")
                like_count = data.get("like_count")
                user1 = data.get("user").get("screen_name")
                user_url = data.get("user").get("profile_url")
                emotion = SnowNLP(text1).sentiments
                weibocommentitem = WeiboCommentItem()
                weibocommentitem["title"] = keyword
                weibocommentitem["message_id"] = message_id
                weibocommentitem["text1"] = text1
                weibocommentitem["user1"] = user1
                weibocommentitem["user_url"] = user_url
                weibocommentitem["emotion"] = emotion
                yield weibocommentitem

最后异步存入MYSQL:

item:

import scrapy
from scrapy.loader import ItemLoader
from scrapy.loader.processors import Compose

def get_First(values):
    if values is not None:
        return values[0]
class WeiBoItemLoader(ItemLoader):
   default_output_processor = Compose(get_First)

class WeibopachongItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    title = scrapy.Field()
    message_id = scrapy.Field()
    content = scrapy.Field()
    content1 = scrapy.Field()
    content2 = scrapy.Field()
    text = scrapy.Field()
    textLength = scrapy.Field()
    create_time = scrapy.Field()
    spider_time = scrapy.Field()
    user1 = scrapy.Field()
    message_url = scrapy.Field()
    longText1 = scrapy.Field()
    reposts_count = scrapy.Field()
    comments_count = scrapy.Field()
    attitudes_count = scrapy.Field()

    def get_insert_sql(self):
        insert_sql = """
        insert into  t_public_opinion_realtime_weibo(title,message_id,content,content1,content2,text,textLength,create_time,spider_time,user1,message_url,longText1,reposts_count,comments_count,attitudes_count)values (%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s)
        """
        parms = (self["title"],self["message_id"],self["content"],self["content1"],self["content2"],self["text"],self["textLength"],self["create_time"],self["spider_time"],self["user1"],self["message_url"],self["longText1"],self["reposts_count"],self["comments_count"],self["attitudes_count"])
        return insert_sql, parms

class WeiboCommentItem(scrapy.Item):
    title = scrapy.Field()
    message_id = scrapy.Field()
    text1 = scrapy.Field()
    user1 = scrapy.Field()
    user_url = scrapy.Field()
    emotion = scrapy.Field()

    def get_insert_sql(self):
        insert_sql = """
        insert into  t_public_opinion_realtime_weibo_comment(title,message_id,text1,user1,user_url,emotion)
        values (%s,%s,%s,%s,%s,%s)
        """
        parms = (self["title"],self["message_id"],self["text1"],self["user1"],self["user_url"],self["emotion"])
        return insert_sql, parms

Pipline:异步插入:

# 插入
class MysqlTwistedPipline(object):
    def __init__(self,dbpool):
        self.dbpool=dbpool
    @classmethod
    def from_settings(cls,setting):
        dbparms=dict(
                host=setting["MYSQL_HOST"],
                db=setting["MYSQL_DBNAME"],
                user=setting["MYSQL_USER"],
                passwd=setting["MYSQL_PASSWORD"],
                charset='utf8mb4',
                cursorclass=MySQLdb.cursors.DictCursor,
                use_unicode=True,
        )
        dbpool=adbapi.ConnectionPool("MySQLdb",**dbparms)
        return cls(dbpool)
    #mysql异步插入执行
    def process_item(self, item, spider):
        query=self.dbpool.runInteraction(self.do_insert,item)
        query.addErrback(self.handle_error,item,spider)

    def handle_error(self,failure,item,spider):
        #处理异步插入的异常
        print (failure)
    def do_insert(self,cursor,item):
        insert_sql,parms=item.get_insert_sql()
        print(parms)
        cursor.execute(insert_sql, parms)

按照规则来写爬虫还是难免有重复:

Scrapy实现对新浪微博某关键词的爬取以及不同url中重复内容的过滤

所以需要在插入内容前对数据进行去重处理

3.scrapy+Redis实现对重复微博的过滤

这里使用Redis中的Set集合来实现,也可以用Python中的Set来做,数据量不大的情况下,Redis中Set有Sadd方法,当成功插入数据后,会返回1。如果插入重复数据则会返回0。

redis_db = redis.Redis(host='127.0.0.1', port=6379, db=0)
result = redis_db.sadd("wangliuqi","12323")
print(result)
result1 = redis_db.sadd("wangliuqi","12323")
print(result1)






结果:=========》》》》》》》》
        1
        0

在Scrapy中新增一个pipline,然后对每一个要保存的item进行判断,如果是重复的微博则对其进行丢弃操作:

RemoveReDoPipline:
class RemoveReDoPipline(object):
    def __init__(self,host):
        self.conn = MySQLdb.connect(host, 'root', 'root', 'meltmedia', charset="utf8", use_unicode=True)
        self.redis_db = redis.Redis(host='127.0.0.1', port=6379, db=0)
        sql = "SELECT message_id FROM t_public_opinion_realtime_weibo"
        # 获取全部的message_id,这是区分是不是同一条微博的标识
        df = pd.read_sql(sql, self.conn)
        # 全部放入Redis中
        for mid in df['message_id'].get_values():
            self.redis_db.sadd("weiboset", mid)
    # 获取setting文件配置
    @classmethod
    def from_settings(cls,setting):
        host=setting["MYSQL_HOST"]
        return cls(host)

    def process_item(self, item, spider):
        # 只对微博的Item过滤,微博评论不需要过滤直接return:
        if isinstance(item,WeibopachongItem):
            if self.redis_db.sadd("weiboset",item["message_id"]):
                return item
            else:
                print("重复内容:", item['text'])
                raise DropItem("same title in %s" % item['text'])
        else:
            return item

最后别忘了在setting文件中把pipline配置进去,并且要配置到保存数据pipline前面才可以。否则起不到过滤效果:

ITEM_PIPELINES = {
   'weibopachong.pipelines.MysqlTwistedPipline': 200,
   'weibopachong.pipelines.RemoveReDoPipline': 100,
}

参考链接:https://www.jianshu.com/p/f03479b9222d