如果数据不断变化,那么缓存什么呢?(以Twitter为例)

时间:2022-09-15 17:58:06

I've been spending some time looking into caching with (redis and memcached mostly) and am having a hard time figuring out where exactly to use caching when your data is constantly changing.

我花了一些时间研究缓存(主要是redis和memcached),在数据不断变化的情况下,我很难找到使用缓存的确切位置。

Take Twitter for example (just read Making Twitter 10000% faster). How would you (or do they) cache their data when a large percentage of their database records are constantly changing?

以推特为例(只要读一下就能让推特快1000倍)。当他们的数据库记录的很大一部分不断变化时,你(或者他们)如何缓存他们的数据?

Say Twitter has these models: User, Tweet, Follow, Favorite.

Twitter有这些模型:用户,推特,Follow, Favorite。

Someone may post a Tweet that gets retweeted once in a day, and another that's retweeted a thousand times in a day. For that 1000x retweet, since there's about 24 * 60 == 1440 minutes in day, that means the Tweet updated almost every minute (say it got 440 favorites as well). Same with following someone, charlie sheen has even attracted 1 million Twitter followers in 1 day. It doesn't seem worth it to cache in these cases, but maybe just because I haven't reached that level yet.

有人可能会发一条一天被转发一次的推特,而另一条一天被转发一千次的推特。对于1000倍的转发,由于每天有24 * 60 = 1440分钟,这意味着推特几乎每分钟都会更新一次(假设它也有440个最爱)。同样的,查理·辛也在一天之内吸引了100万推特粉丝。在这些情况下,似乎不值得缓存,但可能只是因为我还没有达到那个级别。

Say also that the average Twitter follower either tweets/follows/favorites at least once a day. That means in the naive intro-rails schema case, the users table is updated at least once a day (tweet_count, etc.). This case makes sense for caching the user profile.

也可以说,Twitter的普通追随者每天至少发一次tweet /关注/偏爱。这意味着在简单的内向型-rails模式中,用户表每天至少更新一次(tweet_count,等等)。这种情况对于缓存用户概要文件是有意义的。

But for the 1000x Tweets and 1M followers examples above, what are recommended practices when it comes to caching data?

但是对于上面的1000x tweet和1M关注者示例,在缓存数据时,有什么推荐的实践吗?

Specifically (assuming memcached or redis, and using a purely JSON API (no page/fragment caching)):

具体来说(假设memcached或redis使用纯JSON API(没有页面/片段缓存)):

  • Do you cache individual Tweets/records?
  • 您是否缓存单个Tweets/记录?
  • Or do you cache chunks of records via pagination (e.g. redis lists of 20 each)?
  • 或者您是否通过分页来缓存大量记录(例如,每个记录有20个redis列表)?
  • Or do you cache both the records individually and in pages (viewing a single tweet vs. a JSON feed)?
  • 或者您是否将记录分别缓存到页面中(查看单个tweet或JSON提要)?
  • Or do you cache lists of Tweets for each different scenario: home timeline tweets, user tweets, user favorite tweets, etc? Or all of the above?
  • 或者您是否缓存了每个不同场景的tweet列表:home timeline tweet、user tweet、user favorite tweet等?还是以上所有的?
  • Or are you breaking the data into "most volatile (newest)" to "last few days" to "old" chunks, where "old" data is cached with a longer expiration date or into discrete paginated lists or something? And the newest records are just not cached at all. (i.e. if the data is time dependent like Tweets, do you treat it differently if you older records know it won't change much?)
  • 还是将数据分解为“最不稳定的(最新的)”、“持续几天”、“旧的”块,其中“旧的”数据以更长的过期日期缓存,还是分解为离散的分页列表或其他内容?而且最新的记录根本不缓存。(也就是说,如果数据像tweet一样依赖于时间,如果你的旧记录知道它不会有太大变化,你会对它有什么不同的看法吗?)

What I don't understand is what the ratio of how much the data changes vs. if you should cache it (and deal with the complexities expiring the cache). It seems like Twitter could be caching the different user tweet feeds, and the home tweets per user, but that then invalidating the cache every time one favorites/tweets/retweets would mean updating all those cache items (and possibly cached lists of records), which at some point seems like it would mean invalidating the cache is counter productive.

我不理解的是数据变化的比率和你应该缓存它的比率(并处理缓存的复杂性)。似乎Twitter可以缓存不同用户Twitter feed,和每个用户的推文,但然后无效缓存每次最爱/微博/转发意味着更新所有缓存项(也可能是缓存列表的记录),这在某种程度上似乎意味着无效的缓存是相反。

What are the recommended strategies for caching data that is changing a lot like this?

缓存数据的推荐策略是什么?

2 个解决方案

#1


3  

Not saying that Twitter does it like this (although I'm pretty sure it's related), but: I recently got acquainted with CQRS + Event Sourcing. ( http://martinfowler.com/bliki/CQRS.html + http://martinfowler.com/eaaDev/EventSourcing.html) .

不是说Twitter这样做(虽然我很确定它是相关的),但是:我最近了解了CQRS +事件源。(http://martinfowler.com/bliki/CQRS.html + http://martinfowler.com/eaaDev/EventSourcing.html)。

Basically: reads and writes are entirely separated on an application as well as persistence level (CQRS) , and every write to the system is processed as an event which can be subscribed to (event sourcing). There's more to it (such as being able to replay the entire event stream, which is incredibly useful for implementing new functionality later-on), but this is the relevant part.

基本上:读和写在应用程序和持久性级别(CQRS)上是完全分离的,对系统的每一次写都作为一个事件处理,可以订阅(事件源)。它有更多的功能(比如能够重放整个事件流,这对于实现新功能非常有用),但这是相关的部分。

Following this, the general practice is that a Read Model (think in-mem cache) is recreated whenever the responsible Projector (i.e: it projects an event to a new read-model) receives a new event of an event-type it is subscribed to.

在此之后,一般的做法是,每当有责任的放映机(i)时,就会重新创建一个读模型(考虑在mem缓存中)。e:它向一个新的读模型项目发送一个事件,接收一个新事件的事件类型,它是订阅的。

In this case an event could be TweetHandled which would be handled by all subscribers among which a RecentTweetsPerUserProjector, TimelinePerUserProjector, etc. to update their respective ReadModels.

在这种情况下,事件可以用TweetHandled来处理,所有订阅者都可以使用近期的tweetsperuser投影仪、timelineperuser投影仪等来更新各自的readmodel。

The result is a collection of ReadModels that areeventually consistent and don't need any invalidation, i.e: the updated writes and the resulting events are the trigger for updating the ReadModels to begin with.

结果是一组最终是一致的、不需要任何无效的ReadModels, i。e:更新的写操作和结果事件是更新ReadModels的触发器。

I agree that in the end a Read Model for Charlie Sheen would get updated a lot (although this updating can be very efficient) so the cache-advantage is probably pretty low. However, looking at the average postings per timeunit for the average user, and the picture is entirely different.

我同意查理·辛的阅读模型最终会得到大量更新(尽管这种更新非常有效),因此缓存优势可能非常低。然而,看看平均用户每次发布的帖子,情况就完全不同了。

Some influential people in the DDD / CQRS / event-sourcing scene: Greg Young, Udi Dahan.

DDD / CQRS /事件来源领域的一些有影响力的人:Greg Young, Udi Dahan。

The concepts are pretty 'profound' so don't expect to completely grok it in an hour (at least I didn't) . Perhaps this recent mindmap on related concepts is useful too: http://www.mindmeister.com/de/181195534/cqrs-ddd-links

这些概念非常“深刻”,所以不要期望在一个小时内完全领会(至少我没有)。也许最近关于相关概念的思维图也很有用:http://www.mindmeister.com/de/181195534/cqrs-ddd链接

Yeah I'm pretty enthousiastic about this, if you didn't notice already :)

是的,我对此很感兴趣,如果你还没注意到的话

#2


0  

My humble 2 cents: Redis allows you to operate on its data structures, meaning that you can do in-memory operations faster than touching a relational database everytime.

我的拙见:Redis允许您对其数据结构进行操作,这意味着您可以比每次访问关系数据库更快地执行内存中的操作。

So, the "cache" can be altered so it's not invalidated as much as you are expecting.

因此,“缓存”可以被修改,所以不会像您预期的那样无效。

In my project, I load periodically 500K records to sorted sets, and then run statistic reports only by doing range queries over them, which brought the report execution time to under 2s average.

在我的项目中,我定期将500K记录加载到排序集中,然后只通过对它们执行范围查询来运行统计报表,这使得报表执行时间平均低于2s。

#1


3  

Not saying that Twitter does it like this (although I'm pretty sure it's related), but: I recently got acquainted with CQRS + Event Sourcing. ( http://martinfowler.com/bliki/CQRS.html + http://martinfowler.com/eaaDev/EventSourcing.html) .

不是说Twitter这样做(虽然我很确定它是相关的),但是:我最近了解了CQRS +事件源。(http://martinfowler.com/bliki/CQRS.html + http://martinfowler.com/eaaDev/EventSourcing.html)。

Basically: reads and writes are entirely separated on an application as well as persistence level (CQRS) , and every write to the system is processed as an event which can be subscribed to (event sourcing). There's more to it (such as being able to replay the entire event stream, which is incredibly useful for implementing new functionality later-on), but this is the relevant part.

基本上:读和写在应用程序和持久性级别(CQRS)上是完全分离的,对系统的每一次写都作为一个事件处理,可以订阅(事件源)。它有更多的功能(比如能够重放整个事件流,这对于实现新功能非常有用),但这是相关的部分。

Following this, the general practice is that a Read Model (think in-mem cache) is recreated whenever the responsible Projector (i.e: it projects an event to a new read-model) receives a new event of an event-type it is subscribed to.

在此之后,一般的做法是,每当有责任的放映机(i)时,就会重新创建一个读模型(考虑在mem缓存中)。e:它向一个新的读模型项目发送一个事件,接收一个新事件的事件类型,它是订阅的。

In this case an event could be TweetHandled which would be handled by all subscribers among which a RecentTweetsPerUserProjector, TimelinePerUserProjector, etc. to update their respective ReadModels.

在这种情况下,事件可以用TweetHandled来处理,所有订阅者都可以使用近期的tweetsperuser投影仪、timelineperuser投影仪等来更新各自的readmodel。

The result is a collection of ReadModels that areeventually consistent and don't need any invalidation, i.e: the updated writes and the resulting events are the trigger for updating the ReadModels to begin with.

结果是一组最终是一致的、不需要任何无效的ReadModels, i。e:更新的写操作和结果事件是更新ReadModels的触发器。

I agree that in the end a Read Model for Charlie Sheen would get updated a lot (although this updating can be very efficient) so the cache-advantage is probably pretty low. However, looking at the average postings per timeunit for the average user, and the picture is entirely different.

我同意查理·辛的阅读模型最终会得到大量更新(尽管这种更新非常有效),因此缓存优势可能非常低。然而,看看平均用户每次发布的帖子,情况就完全不同了。

Some influential people in the DDD / CQRS / event-sourcing scene: Greg Young, Udi Dahan.

DDD / CQRS /事件来源领域的一些有影响力的人:Greg Young, Udi Dahan。

The concepts are pretty 'profound' so don't expect to completely grok it in an hour (at least I didn't) . Perhaps this recent mindmap on related concepts is useful too: http://www.mindmeister.com/de/181195534/cqrs-ddd-links

这些概念非常“深刻”,所以不要期望在一个小时内完全领会(至少我没有)。也许最近关于相关概念的思维图也很有用:http://www.mindmeister.com/de/181195534/cqrs-ddd链接

Yeah I'm pretty enthousiastic about this, if you didn't notice already :)

是的,我对此很感兴趣,如果你还没注意到的话

#2


0  

My humble 2 cents: Redis allows you to operate on its data structures, meaning that you can do in-memory operations faster than touching a relational database everytime.

我的拙见:Redis允许您对其数据结构进行操作,这意味着您可以比每次访问关系数据库更快地执行内存中的操作。

So, the "cache" can be altered so it's not invalidated as much as you are expecting.

因此,“缓存”可以被修改,所以不会像您预期的那样无效。

In my project, I load periodically 500K records to sorted sets, and then run statistic reports only by doing range queries over them, which brought the report execution time to under 2s average.

在我的项目中,我定期将500K记录加载到排序集中,然后只通过对它们执行范围查询来运行统计报表,这使得报表执行时间平均低于2s。