For one of my projects, I have to enter a big-ish collection of events into a database for later processing and I am trying to decide which DBMS would be best for my purpose.
在我的一个项目中,我必须向数据库中输入大量的事件,以便以后处理,我正在尝试确定哪种DBMS最适合我的目的。
I have:
我有:
-
About 400,000,000 discrete events at the moment
目前大约有4亿个离散事件
-
About 600 GB of data that will be stored in the DB
大约600 GB的数据将存储在DB中
These events come in a variety of formats, but I estimate the count of individual attributes to be about 5,000. Most events only contain values for about 100 attributes each. The attribute values are to be treated as arbitrary strings and, in some cases, integers.
这些事件有各种格式,但我估计单个属性的数量大约为5,000。大多数事件只包含大约100个属性的值。属性值将被视为任意字符串,在某些情况下是整数。
The events will eventually be consolidated into a single time series. While they do have some internal structure, there are no references to other events, which - I believe - means that I don't need an object DB or some ORM system.
这些事件最终将被合并成一个时间序列。虽然它们有一些内部结构,但是没有对其他事件的引用,我相信这意味着我不需要对象DB或某个ORM系统。
My requirements:
我的要求:
-
Open source license - I may have to tweak it a bit.
开源许可——我可能需要稍微调整一下。
-
Scalability by being able to expand to multiple servers, although only one system will be used at first.
可以扩展到多个服务器的可伸缩性,尽管最初只使用一个系统。
-
Fast queries - updates are not that critical.
快速查询——更新不是那么重要。
-
Mature drivers/bindings for C/C++, Java and Python. Preferrably with a license that plays well with others - I'd rather not commit myself to anything because of a technical decision. I think that most DB drivers do not have a problem here, but it should be mentioned, anyway.
成熟的驱动/绑定,用于C/ c++、Java和Python。更好的是,我更喜欢与他人合得来的许可证——我宁愿自己不做任何事情,因为技术上的决定。我认为大多数DB驱动程序在这里没有问题,但是无论如何,它应该被提及。
-
Availability for Linux.
可用性为Linux。
-
It would be nice, but not necessary, if it was also available for Windows
如果它也适用于Windows,那就太好了,但也没有必要
My ideal DB for this would allow me to retrieve all the events from a specified time period with a single query.
理想的DB允许我用一个查询从指定的时间段中检索所有事件。
What I have found/considered so far:
到目前为止,我发现/考虑的是:
-
Postgresql with an increased page size can apparently have up to 6,000 columns in each table. If my estimate of the attribute count is not off, it might do.
增加了页面大小的Postgresql显然可以在每个表中包含6000列。如果我对属性计数的估计没有关闭,它可能会关闭。
-
MySQL seems to have a limit of 4,000 columns per table. I could use multiple tables with a bit of SQL-fu, but I'd rather not.
MySQL似乎每个表有4000列的限制。我可以使用带有一点SQL-fu的多个表,但我宁愿不使用。
-
MongoDB is what I am currently leaning towards. It would allow me to preserve the internal structure of the events, while still being able to query them. Its API also seems quite straight-forward. I have no idea how well it does performance-wise though - at least on a single server.
MongoDB是我目前倾向的方向。它将允许我保留事件的内部结构,同时仍然能够查询它们。它的API看起来也很直接。我不知道它的性能如何——至少在一台服务器上。
-
OpenTSDB and its metric collection framework sounds interesting.I could use a single time series for each attribute (which might help with some of my processing), have the attribute value as a tag and additionally tag the entries to associate them to a specific event. It probably has a steeper preparation curve that the three above, both from an administrator and an application programmer point of view. No idea about its performance.
OpenTSDB及其度量收集框架听起来很有趣。我可以为每个属性使用单个时间序列(这可能对我的一些处理有所帮助),将属性值作为标记,并附加标记条目以将它们关联到特定事件。从管理员和应用程序程序员的角度来看,它可能有一个更陡峭的准备曲线。对它的表现一无所知。
-
Use HBase directly. This might fit my requirements better than OpenTSDB, although - judging from my past experience with hadoop - the administration overhead is probably still higher than the first three options.
直接使用HBase。这可能比OpenTSDB更符合我的要求,尽管从我过去使用hadoop的经验来看,管理开销可能仍然高于前三个选项。
There are probably other databases that could do it, so feel free to let me know - I would appreciate any suggestion or comment that might help me with this.
可能还有其他数据库可以做到这一点,所以请随时告诉我——如果有任何建议或评论可以帮助我做到这一点,我将不胜感激。
PS: I only have minimal experience as a DB administrator, so I apologise for any misconceptions.
PS:我作为DB管理员只有很少的经验,所以我对任何误解表示歉意。
2 个解决方案
#1
4
Using tables with thousands of columns is madness. Especially when most of them are zero as you said.
使用有数千列的表格是疯狂的。尤其是当它们大多数都是零的时候。
You should first look into converting your data-structure from this:
您应该首先将数据结构转换为以下内容:
table_1
-------
event_id
attribute_1
attribute_2
[...]
attribute_5000
into something like this:
是这样的:
table_1 event_values attributes
-------- ------------ ----------
event_id event_id attribute_id
attribute_id attribute_type
attribute_value
which can be used with any RDMS (your only constraint then would be the total database size and performance)
可以与任何RDMS一起使用(那么,惟一的限制就是数据库的总大小和性能)
#2
0
It is probably very late for an answer, but here is what I do.
答案可能很晚,但这就是我所做的。
I use HDF5 as my time series repository. It has a number of effective and fast compression styles which can be mixed and matched. It can be used with a number of different programming languages. It is available on Windows as well as Linux.
我使用HDF5作为我的时间序列库。它有许多有效和快速的压缩风格,可以混合和匹配。它可以与许多不同的编程语言一起使用。它可以在Windows和Linux上使用。
I use boost::date_time for the timestamp field. This allows a large variety of datetime based computations.
我对时间戳字段使用boost: date_time。这允许大量基于时间的计算。
In the financial realm, I then create specific data structures for each of bars, ticks, trades, quotes, ...
在金融领域中,我然后为每个条、节拍、交易、报价……
I created a number of custom iterators and used standard template library algorithms to be able to efficiently search for specific values or ranges of time-based records. The selections can then be loaded into memory.
我创建了许多自定义迭代器,并使用标准模板库算法来有效地搜索特定的值或基于时间的记录的范围。然后可以将这些选择加载到内存中。
#1
4
Using tables with thousands of columns is madness. Especially when most of them are zero as you said.
使用有数千列的表格是疯狂的。尤其是当它们大多数都是零的时候。
You should first look into converting your data-structure from this:
您应该首先将数据结构转换为以下内容:
table_1
-------
event_id
attribute_1
attribute_2
[...]
attribute_5000
into something like this:
是这样的:
table_1 event_values attributes
-------- ------------ ----------
event_id event_id attribute_id
attribute_id attribute_type
attribute_value
which can be used with any RDMS (your only constraint then would be the total database size and performance)
可以与任何RDMS一起使用(那么,惟一的限制就是数据库的总大小和性能)
#2
0
It is probably very late for an answer, but here is what I do.
答案可能很晚,但这就是我所做的。
I use HDF5 as my time series repository. It has a number of effective and fast compression styles which can be mixed and matched. It can be used with a number of different programming languages. It is available on Windows as well as Linux.
我使用HDF5作为我的时间序列库。它有许多有效和快速的压缩风格,可以混合和匹配。它可以与许多不同的编程语言一起使用。它可以在Windows和Linux上使用。
I use boost::date_time for the timestamp field. This allows a large variety of datetime based computations.
我对时间戳字段使用boost: date_time。这允许大量基于时间的计算。
In the financial realm, I then create specific data structures for each of bars, ticks, trades, quotes, ...
在金融领域中,我然后为每个条、节拍、交易、报价……
I created a number of custom iterators and used standard template library algorithms to be able to efficiently search for specific values or ranges of time-based records. The selections can then be loaded into memory.
我创建了许多自定义迭代器,并使用标准模板库算法来有效地搜索特定的值或基于时间的记录的范围。然后可以将这些选择加载到内存中。