After analyzing some gigabytes of logfiles with grep and the like I was wondering how to make this easier by using a database to log the stuff into. What database would be appropiate for this purpuse? A vanillia SQL database works, of course, but provides lots of transactional guarantees etc. which you don't need here, and which might make it slow if you work with gigabytes of data and very fast insertion rates. So a NoSQL database that could be the right answer (compare this answer for some suggestions). Some requirements for the database would be:
在使用grep分析了一些千兆字节的日志文件之后,我想知道如何通过使用数据库将这些内容记录到日志文件中,从而使这变得更容易。什么样的数据库适合这个用途?vanillia SQL数据库当然可以工作,但是提供了很多事务保证等等,这些都是您在这里不需要的,如果您使用千兆字节的数据和非常快的插入速度,那么它可能会变慢。因此,一个可能是正确答案的NoSQL数据库(请将这个答案与一些建议进行比较)。对数据库的一些要求是:
- Ability to cope with gigabytes or maybe even terabytes of data
- 能够处理千兆字节甚至千兆字节的数据
- Fast insertion
- 快速插入
- Multiple indizes on each entry should be possible (e.g. time, session id, URL etc.)
- 每个条目上应该有多个indizes(例如时间、会话id、URL等)。
- If possible, it store the data in a compressed form, since logfiles are usually extremely repetitive.
- 如果可能的话,它以压缩格式存储数据,因为日志文件通常非常重复。
Update: There are already some SO-questions for this: Database suggestion for processing/reporting on large amount of log file type data and What are good NoSQL and non-relational database solutions for audit/logging database . However, I am curious which databases fulfill which requirements.
更新:已经有一些这样的问题:处理/报告大量日志文件类型数据的数据库建议,以及审计/日志数据库的良好NoSQL和非关系数据库解决方案。然而,我很好奇哪些数据库满足哪些需求。
3 个解决方案
#1
5
After having tried a lot of nosql solutions, my best bets would be:
在尝试了许多nosql解决方案之后,我的最佳选择是:
- riak + riak search for great scalability
- riak + riak搜索,寻找强大的可扩展性
- unnormalized data in mysql/postgresql
- 非规范数据mysql或postgresql
- mongoDB if you don't mind waiting
- 如果你不介意等的话。
- couchdb if you KNOW what you're searching for
- 如果你知道你在寻找什么的话
Riak + Riak Search scale easily (REALLY!) and allow you free form queries over your data. You can also easily mix data schemas and maybe even compress data with innostore as a backend.
Riak + Riak搜索很容易(真的!)并且允许你对你的数据进行*的形式查询。您还可以轻松地混合数据模式,甚至可以使用innostore作为后端压缩数据。
MongoDB is annoying to scale over several gigabytes of data if you really want to use indexes and not slow down to a crawl. It is really fast considering single node performance and offers index creation. As soon as your working data set doesn't fit in memory anymore, it becomes a problem...
如果您真的想要使用索引,而又不想慢下来到爬行状态,那么MongoDB会令人讨厌地扩展数gb的数据。考虑到单个节点的性能,它的速度非常快,并且提供了索引创建。一旦您的工作数据集不再适合内存,它就会成为一个问题……
mysql/postgresql is still pretty fast and allows free form queries thanks to the usual b+tree indexes. Look at postgres for partial indexes if some of the fields don't show up in every record. They also offer compressed tables and since the schema is fixed, you don't save your row names over and over again (that's what usually happens for a lot of the nosql solutions)
mysql/postgresql仍然非常快,并且由于通常的b+树索引,允许*的表单查询。如果有些字段没有出现在每个记录中,请查看postgres的部分索引。它们还提供压缩表,而且由于模式是固定的,所以不会一遍又一遍地保存行名(这是许多nosql解决方案通常的情况)
CouchDB is nice if you already know the queries you want to see, their incremental map/reduce based views are a great system for that.
如果您已经知道想要查看的查询,CouchDB是很好的,它们基于增量映射/减少的视图是一个很好的系统。
#2
3
There are a lot of different options that you could look into. You could use Hive for your analytics and Flume to consume and load the log files. MongoDB might also be a good option for you, take a look at this article on log analytics with MongoDB, Ruby, and Google Charts
你有很多不同的选择。您可以使用Hive作为您的分析和水槽来使用和加载日志文件。MongoDB可能也是一个不错的选择,看看这篇关于使用MongoDB、Ruby和谷歌图表进行日志分析的文章吧
#3
1
Depending on your needs Splunk might be a good option. It is more than just a database but you get all kinds of reporting. Plus it is designed to be a log file replacement so they have already solved the scaling issues.
根据您的需要,Splunk可能是一个不错的选择。它不仅仅是一个数据库,而是各种各样的报告。另外,它被设计为一个日志文件替换,因此他们已经解决了缩放问题。
#1
5
After having tried a lot of nosql solutions, my best bets would be:
在尝试了许多nosql解决方案之后,我的最佳选择是:
- riak + riak search for great scalability
- riak + riak搜索,寻找强大的可扩展性
- unnormalized data in mysql/postgresql
- 非规范数据mysql或postgresql
- mongoDB if you don't mind waiting
- 如果你不介意等的话。
- couchdb if you KNOW what you're searching for
- 如果你知道你在寻找什么的话
Riak + Riak Search scale easily (REALLY!) and allow you free form queries over your data. You can also easily mix data schemas and maybe even compress data with innostore as a backend.
Riak + Riak搜索很容易(真的!)并且允许你对你的数据进行*的形式查询。您还可以轻松地混合数据模式,甚至可以使用innostore作为后端压缩数据。
MongoDB is annoying to scale over several gigabytes of data if you really want to use indexes and not slow down to a crawl. It is really fast considering single node performance and offers index creation. As soon as your working data set doesn't fit in memory anymore, it becomes a problem...
如果您真的想要使用索引,而又不想慢下来到爬行状态,那么MongoDB会令人讨厌地扩展数gb的数据。考虑到单个节点的性能,它的速度非常快,并且提供了索引创建。一旦您的工作数据集不再适合内存,它就会成为一个问题……
mysql/postgresql is still pretty fast and allows free form queries thanks to the usual b+tree indexes. Look at postgres for partial indexes if some of the fields don't show up in every record. They also offer compressed tables and since the schema is fixed, you don't save your row names over and over again (that's what usually happens for a lot of the nosql solutions)
mysql/postgresql仍然非常快,并且由于通常的b+树索引,允许*的表单查询。如果有些字段没有出现在每个记录中,请查看postgres的部分索引。它们还提供压缩表,而且由于模式是固定的,所以不会一遍又一遍地保存行名(这是许多nosql解决方案通常的情况)
CouchDB is nice if you already know the queries you want to see, their incremental map/reduce based views are a great system for that.
如果您已经知道想要查看的查询,CouchDB是很好的,它们基于增量映射/减少的视图是一个很好的系统。
#2
3
There are a lot of different options that you could look into. You could use Hive for your analytics and Flume to consume and load the log files. MongoDB might also be a good option for you, take a look at this article on log analytics with MongoDB, Ruby, and Google Charts
你有很多不同的选择。您可以使用Hive作为您的分析和水槽来使用和加载日志文件。MongoDB可能也是一个不错的选择,看看这篇关于使用MongoDB、Ruby和谷歌图表进行日志分析的文章吧
#3
1
Depending on your needs Splunk might be a good option. It is more than just a database but you get all kinds of reporting. Plus it is designed to be a log file replacement so they have already solved the scaling issues.
根据您的需要,Splunk可能是一个不错的选择。它不仅仅是一个数据库,而是各种各样的报告。另外,它被设计为一个日志文件替换,因此他们已经解决了缩放问题。