Mongodb vs Cassandra用于聚合,搜索和分析许多日志

时间:2022-10-17 21:44:40

I'm working on a project that does log aggregation and analytics as part of a bigger project. I don't know which database to choose for handling these logs. Lately I'm going back and forth between MongoDB and Cassandra, but I'm sure there are others that fit my needs as well. Which one should I choose and why?

我正在开发一个项目,它将日志聚合和分析作为更大项目的一部分。我不知道选择哪个数据库来处理这些日志。最近我在MongoDB和Cassandra之间来回走动,但我确信还有其他的也适合我的需求。我应该选择哪一个?为什么?

The whole thing is quite at the beginning right now, but here are the requirements so far:

现在整个事情刚刚开始,但到目前为止这里是要求:

  • logs are in the syslog format
  • 日志采用syslog格式
  • queries are mostly on a small string that's now in the message, but I will get it on a separate field. And there will also be filters based on date, severity or tag. Very rarely, people would just search for a random string within the message.
  • 查询主要是在一个现在在消息中的小字符串,但我会在一个单独的字段上得到它。并且还会有基于日期,严重程度或标记的过滤器。很少有人会在消息中搜索随机字符串。
  • hourly analytics from some of the log entries
  • 来自某些日志条目的每小时分析
  • keep the logs for a configurable amount of time
  • 将日志保留一段可配置的时间
  • more will come, I'm sure :) That's why I'm thinking NoSQL is more appropriate, because we can change the schema.
  • 更多的将来,我敢肯定:)这就是为什么我认为NoSQL更合适,因为我们可以改变架构。

We are expecting to grow the database to some TB of data (and ~50K inserts per second), so sharding is a must. Queries are not so often, because they are mainly used by the developers of the bigger project. But a result needs to be returned in a few seconds.

我们期望将数据库增长到一些TB数据(每秒约50K插入),因此必须进行分片。查询不常见,因为它们主要由较大项目的开发人员使用。但结果需要在几秒钟内返回。

Right now, the storage is common (and slow) for all the machines. So for scalability, I suppose we need to make best use of memory and multithreading - in order for sharding to make sense.

现在,存储对于所有机器来说都是常见的(并且很慢)。因此,对于可伸缩性,我认为我们需要充分利用内存和多线程 - 以便分片有意义。

The basic ideas I got so far is that MongoDB has more features, such as regex or sorting results, and it's easier setup to a decent configuration, while Cassandra seems more scalable (by simply adding servers), and also has a few neat features, like putting a TTL on data.

我到目前为止所获得的基本思想是MongoDB具有更多功能,例如正则表达式或排序结果,并且更容易设置到合适的配置,而Cassandra似乎更具可扩展性(通过简单地添加服务器),并且还具有一些简洁的功能,比如把数据放在TTL上。

3 个解决方案

#1


5  

Sparsely columnar datastores such as Apache Cassandra are excellent at aggregating time series data. See the following articles for examples:

稀疏的列式数据存储(如Apache Cassandra)在聚合时间序列数据方面非常出色。有关示例,请参阅以下文章:

#2


2  

MongoDB does sound like a good fit for your requirements. Here's why:

MongoDB听起来非常适合您的要求。原因如下:

  • indices: since you want to run occasional queries, it's nice not to have to maintain them in your app or have a separate search app (Lucene).
  • 索引:因为您想要偶尔运行查询,所以不必在应用程序中维护它们或拥有单独的搜索应用程序(Lucene)。
  • scales well (built-in sharding support, replication)
  • 很好地扩展(内置分片支持,复制)
  • writes are asynchronous (by default, you could make them synchr.), that is non-blocking, and fast. You might lose few in certain failure scenarios, but for logs and analytics, it wouldn't make a difference.
  • 写入是异步的(默认情况下,您可以使它们同步。),即非阻塞且快速。在某些故障情况下,您可能会损失很少,但对于日志和分析,它不会产生任何影响。
  • fairly powerful query API (not like relational, no joins, but better than all other nosql key-value stores, and sounds more powerful than what Cassandra offers).
  • 相当强大的查询API(不像关系,没有连接,但比所有其他nosql键值存储更好,听起来比Cassandra提供的更强大)。

You might even find a proper configuration to have it in a non-sharded setup. For example by default it syncs to disk every 60sec, which means 60secs of writes will be buffered hence reducing IO. I've tried it on a half a terabyte of data on a single machine and a single indexed field queries run in cca 100-200ms.

您甚至可以找到一个合适的配置,以便在非分片设置中使用它。例如,默认情况下,它每隔60秒同步到磁盘,这意味着将缓冲60秒的写入,从而减少IO。我在一台机器上尝试了半个TB的数据,一个索引的字段查询在cca 100-200ms内运行。

#3


0  

Given that your system will be a high write throughput application I would recommend Cassandra.

鉴于您的系统将是一个高写入吞吐量应用程序,我会建议Cassandra。

I have put together a high level overview of the differences between MongoDB and Cassandra here -https://scalegrid.io/blog/cassandra-vs-mongodb/

我在这里汇总了MongoDB和Cassandra之间差异的高级概述-https://scalegrid.io/blog/cassandra-vs-mongodb/

#1


5  

Sparsely columnar datastores such as Apache Cassandra are excellent at aggregating time series data. See the following articles for examples:

稀疏的列式数据存储(如Apache Cassandra)在聚合时间序列数据方面非常出色。有关示例,请参阅以下文章:

#2


2  

MongoDB does sound like a good fit for your requirements. Here's why:

MongoDB听起来非常适合您的要求。原因如下:

  • indices: since you want to run occasional queries, it's nice not to have to maintain them in your app or have a separate search app (Lucene).
  • 索引:因为您想要偶尔运行查询,所以不必在应用程序中维护它们或拥有单独的搜索应用程序(Lucene)。
  • scales well (built-in sharding support, replication)
  • 很好地扩展(内置分片支持,复制)
  • writes are asynchronous (by default, you could make them synchr.), that is non-blocking, and fast. You might lose few in certain failure scenarios, but for logs and analytics, it wouldn't make a difference.
  • 写入是异步的(默认情况下,您可以使它们同步。),即非阻塞且快速。在某些故障情况下,您可能会损失很少,但对于日志和分析,它不会产生任何影响。
  • fairly powerful query API (not like relational, no joins, but better than all other nosql key-value stores, and sounds more powerful than what Cassandra offers).
  • 相当强大的查询API(不像关系,没有连接,但比所有其他nosql键值存储更好,听起来比Cassandra提供的更强大)。

You might even find a proper configuration to have it in a non-sharded setup. For example by default it syncs to disk every 60sec, which means 60secs of writes will be buffered hence reducing IO. I've tried it on a half a terabyte of data on a single machine and a single indexed field queries run in cca 100-200ms.

您甚至可以找到一个合适的配置,以便在非分片设置中使用它。例如,默认情况下,它每隔60秒同步到磁盘,这意味着将缓冲60秒的写入,从而减少IO。我在一台机器上尝试了半个TB的数据,一个索引的字段查询在cca 100-200ms内运行。

#3


0  

Given that your system will be a high write throughput application I would recommend Cassandra.

鉴于您的系统将是一个高写入吞吐量应用程序,我会建议Cassandra。

I have put together a high level overview of the differences between MongoDB and Cassandra here -https://scalegrid.io/blog/cassandra-vs-mongodb/

我在这里汇总了MongoDB和Cassandra之间差异的高级概述-https://scalegrid.io/blog/cassandra-vs-mongodb/