存储时间序列数据,关系数据还是非关系数据?

时间:2021-10-11 16:55:15

I am creating a system which polls devices for data on varying metrics such as CPU utilisation, disk utilisation, temperature etc. at (probably) 5 minute intervals using SNMP. The ultimate goal is to provide visualisations to a user of the system in the form of time-series graphs.

我正在创建一个系统,该系统可以在使用SNMP的5分钟间隔内,根据不同的指标,如CPU使用率、磁盘利用率、温度等,来投票。最终目标是以时间序列图的形式为系统用户提供可视化。

I have looked at using RRDTool in the past, but rejected it as storing the captured data indefinitely is important to my project, and I want higher level and more flexible access to the captured data. So my question is really:

我曾经考虑过使用RRDTool,但是拒绝使用它,因为无限期地存储捕获的数据对我的项目很重要,我希望对捕获的数据有更高的级别和更灵活的访问。我的问题是

What is better, a relational database (such as MySQL or PostgreSQL) or a non-relational or NoSQL database (such as MongoDB or Redis) with regard to performance when querying data for graphing.

更好的是关系数据库(如MySQL或PostgreSQL)或非关系数据库或NoSQL数据库(如MongoDB或Redis)在查询数据以获取图形时的性能。

Relational

Given a relational database, I would use a data_instances table, in which would be stored every instance of data captured for every metric being measured for all devices, with the following fields:

给定一个关系数据库,我将使用data_instances表,在该表中,将存储为所有设备所捕获的每个度量的数据的每个实例,包括以下字段:

Fields: id fk_to_device fk_to_metric metric_value timestamp

字段:id fk_to_device fk_to_metric metric_value timestamp

When I want to draw a graph for a particular metric on a particular device, I must query this singular table filtering out the other devices, and the other metrics being analysed for this device:

当我想在一个特定的设备上绘制一个特定指标的图时,我必须查询这个奇异的表过滤掉其他设备,以及对这个设备进行分析的其他指标:

SELECT metric_value, timestamp FROM data_instances
    WHERE fk_to_device=1 AND fk_to_metric=2

The number of rows in this table would be:

该表中的行数为:

d * m_d * f * t

where d is the number of devices, m_d is the accumulative number of metrics being recorded for all devices, f is the frequency at which data is polled for and t is the total amount of time the system has been collecting data.

其中d为设备数量,m_d为所有设备记录的累积度量数,f为数据轮询的频率,t为系统收集数据的总时间。

For a user recording 10 metrics for 3 devices every 5 minutes for a year, we would have just under 5 million records.

对于一个一年内每5分钟记录3台设备10个指标的用户来说,我们将有不到500万的记录。

Indexes

Without indexes on fk_to_device and fk_to_metric scanning this continuously expanding table would take too much time. So indexing the aforementioned fields and also timestamp (for creating graphs with localised periods) is a requirement.

在fk_to_device和fk_to_metric扫描上没有索引,这个不断扩展的表将花费太多时间。因此,对上述字段和时间戳(用于创建具有本地化周期的图)进行索引是一种需求。

Non-Relational (NoSQL)

MongoDB has the concept of a collection, unlike tables these can be created programmatically without setup. With these I could partition the storage of data for each device, or even each metric recorded for each device.

MongoDB具有集合的概念,与无需设置就可以以编程方式创建的表不同。有了这些,我可以为每个设备划分数据存储,甚至为每个设备记录的每个度量。

I have no experience with NoSQL and do not know if they provide any query performance enhancing features such as indexing, however the previous paragraph proposes doing most of the traditional relational query work in the structure by which the data is stored under NoSQL.

我没有使用NoSQL的经验,也不知道它们是否提供了任何查询性能增强特性(如索引),但是上一段建议在NoSQL下存储数据的结构中执行大多数传统的关系查询工作。

Undecided

Would a relational solution with correct indexing reduce to a crawl within the year? Or does the collection based structure of NoSQL approaches (which matches my mental model of the stored data) provide a noticeable benefit?

具有正确索引的关系解决方案是否会在一年内减少为爬行?或者NoSQL方法的基于集合的结构(它与我的存储数据的心理模型相匹配)是否提供了明显的好处?

10 个解决方案

#1


138  

Definitely Relational. Unlimited flexibility and expansion.

绝对的关系。无限的灵活性和扩张。

Two corrections, both in concept and application, followed by an elevation.

在概念和应用上都做了两次修正,然后是一次提升。

Correction

  1. It is not "filtering out the un-needed data"; it is selecting only the needed data. Yes, of course, if you have an Index to support the columns identified in the WHERE clause, it is very fast, and the query does not depend on the size of the table (grabbing 1,000 rows from a 16 billion row table is instantaneous).

    它不是“过滤掉不必要的数据”;它只选择所需的数据。是的,当然,如果您有一个索引来支持WHERE子句中标识的列,那么它是非常快的,并且查询不依赖于表的大小(从160亿行表获取1000行是瞬时的)。

  2. Your table has one serious impediment. Given your description, the actual PK is (Device, Metric, DateTime). (Please don't call it TimeStamp, that means something else, but that is a minor issue.) The uniqueness of the row is identified by:

    你的桌子有一个严重的障碍。根据您的描述,实际的PK是(设备、度量、DateTime)。(请不要称它为时间戳,这意味着其他事情,但这是一个小问题。)行的唯一性由:

       (Device, Metric, DateTime)
    
    • The Id column does nothing, it is totally and completely redundant.

      Id列什么都不做,它完全是冗余的。

      • An Id column is never a Key (duplicate rows, which are prohibited in a Relational database, must be prevented by other means).
      • Id列从来不是键(关系数据库中禁止重复的行,必须通过其他方法加以阻止)。
      • The Id column requires an additional Index, which obviously impedes the speed of INSERT/DELETE, and adds to the disk space used.

        Id列需要一个额外的索引,这显然会阻碍插入/删除的速度,并增加所使用的磁盘空间。

      • You can get rid of it. Please.

        你可以摆脱它。请。

Elevation

  1. Now that you have removed the impediment, you may not have recognised it, but your table is in Sixth Normal Form. Very high speed, with just one Index on the PK. For understanding, read this answer from the What is Sixth Normal Form ? heading onwards.

    既然你已经消除了障碍,你可能还没有意识到它,但是你的桌子是第六个正常的形式。非常高的速度,只有一个指标在PK上。为了理解,请阅读这个答案从什么是第六范式?标题起。

    • (I have one index only, not three; on the Non-SQLs you may need three indices).

      (我只有一个索引,不是三个;对于非sql,您可能需要三个索引)。

    • I have the exact same table (without the Id "key", of course). I have an additional column Server. I support multiple customers remotely.

      我有相同的表(当然没有Id“key”)。我还有一个附加的列服务器。我远程支持多个客户。

      (Server, Device, Metric, DateTime)

      (服务器、设备指标,DateTime)

    The table can be used to Pivot the data (ie. Devices across the top and Metrics down the side, or pivoted) using exactly the same SQL code (yes, switch the cells). I use the table to erect an unlimited variety of graphs and charts for customers re their server performance.

    该表可用于数据透视。通过使用完全相同的SQL代码(是的,切换单元格),在顶部设置设备,在底部设置度量值(或者转向)。我使用该表来为客户建立各种各样的图形和图表,以反映他们的服务器性能。

    • Monitor Statistics Data Model.
      (Too large for inline; some browsers cannot load inline; click the link. Also that is the obsolete demo version, for obvious reasons, I cannot show you commercial product DM.)

      监控统计数据模型。(太大,内联;有些浏览器不能内联加载;点击链接。此外,这是过时的演示版本,出于显而易见的原因,我无法向您展示商业产品DM。

    • It allows me to produce Charts Like This, six keystrokes after receiving a raw monitoring stats file from the customer, using a single SELECT command. Notice the mix-and-match; OS and server on the same chart; a variety of Pivots. Of course, there is no limit to the number of stats matrices, and thus the charts. (Used with the customer's kind permission.)

      它允许我生成这样的图表,在收到客户的原始监控数据文件后,使用一个SELECT命令,6个按键。注意到混搭;操作系统和服务器在同一图表上;各种各样的轴心。当然,统计矩阵的数量是没有限制的,因此图表也没有限制。(经客户许可使用)

    • Readers who are unfamiliar with the Standard for Modelling Relational Databases may find the IDEF1X Notation helpful.

      不熟悉建模关系数据库标准的读者可能会发现IDEF1X符号很有帮助。

One More Thing

一件事

Last but not least, SQL is a IEC/ISO/ANSI Standard. The freeware is actually Non-SQL; it is fraudulent to use the term SQL if they do not provide the Standard. They may provide "extras", but they are absent the basics.

最后,SQL是一个IEC/ISO/ANSI标准。freeware实际上是非sql的;如果术语SQL不提供标准,则使用它们是欺诈的。他们可能会提供额外的服务,但他们缺乏基本的服务。

#2


19  

Found very interesting the above answers. Trying to add a couple more considerations here.

发现上面的答案很有趣。这里还需要考虑几个问题。

1) Data aging

1)数据老化

Time-series management usually need to create aging policies. A typical scenario (e.g. monitoring server CPU) requires to store:

时间序列管理通常需要创建老化策略。一个典型的场景(例如监视服务器CPU)需要存储:

  • 1-sec raw samples for a short period (e.g. for 24 hours)

    1秒的样品(例如24小时)

  • 5-min detail aggregate samples for a medium period (e.g. 1 week)

    5分钟内,中期(例1周)

  • 1-hour detail over that (e.g. up to 1 year)

    1小时的细节(例如最多1年)

Although relational models make it possible for sure (my company implemented massive centralized databases for some large customers with tens of thousands of data series) to manage it appropriately, the new breed of data stores add interesting functionalities to be explored like:

虽然关系模型可以确保(我的公司为一些拥有成千上万数据系列的大客户实现了大量集中的数据库)对其进行适当的管理,但是新一代的数据存储增加了有趣的功能,如:

  • automated data purging (see Redis' EXPIRE command)

    自动数据清理(参见Redis的EXPIRE命令)

  • multidimensional aggregations (e.g. map-reduce jobs a-la-Splunk)

    多维聚合(例如map-reduce作业a-la-Splunk)

2) Real-time collection

2)实时收集

Even more importantly some non-relational data stores are inherently distributed and allow for a much more efficient real-time (or near-real time) data collection that could be a problem with RDBMS because of the creation of hotspots (managing indexing while inserting in a single table). This problem in the RDBMS space is typically solved reverting to batch import procedures (we managed it this way in the past) while no-sql technologies have succeeded in massive real-time collection and aggregation (see Splunk for example, mentioned in previous replies).

更重要的是,一些非关系数据存储本身是分布式的,并且允许更高效的实时(或接近实时)数据收集,这可能是RDBMS的问题,因为创建了热点(在一个表中插入时管理索引)。RDBMS空间中的这个问题通常可以解决返回到批导入过程(过去我们是这样管理的),而非sql技术已经成功地实现了大量的实时收集和聚合(参见前面的回复中提到的Splunk)。

#3


7  

You table has data in single table. So relational vs non relational is not the question. Basically you need to read a lot of sequential data. Now if you have enough RAM to store a years worth data then nothing like using Redis/MongoDB etc.

表中有单个表中的数据。所以关系和非关系不是问题。基本上你需要阅读大量的顺序数据。如果你有足够的内存来存储一年的数据,那就不像使用Redis/MongoDB之类的东西了。

Mostly NoSQL databases will store your data on same location on disk and in compressed form to avoid multiple disk access.

大多数NoSQL数据库将把数据存储在磁盘上和压缩格式的相同位置,以避免多个磁盘访问。

NoSQL does the same thing as creating the index on device id and metric id, but in its own way. With database even if you do this the index and data may be at different places and there would be a lot of disk IO.

NoSQL做的事情与在设备id和度量id上创建索引是一样的,但是是以它自己的方式。使用数据库,即使这样做,索引和数据也可能位于不同的位置,并且会有很多磁盘IO。

Tools like Splunk are using NoSQL backends to store time series data and then using map reduce to create aggregates (which might be what you want later). So in my opinion to use NoSQL is an option as people have already tried it for similar use cases. But will a million rows bring the database to crawl (maybe not , with decent hardware and proper configurations).

Splunk等工具使用NoSQL backends来存储时间序列数据,然后使用map reduce创建聚合(这可能是您稍后需要的)。因此,在我看来,使用NoSQL是一种选择,因为人们已经在类似的用例中尝试过了。但是一百万行将会使数据库爬行(可能不会,有适当的硬件和适当的配置)。

#4


3  

If you are looking at GPL packages, RRDTool is a good one to look at. It is a good tool for storing, extracting and graphing times-series data. Your use-case looks exactly like time-series data.

如果您正在查看GPL包,RRDTool是一个很好的工具。它是存储、提取和绘制时间序列数据的好工具。您的用例看起来完全像时间序列数据。

#5


3  

Create a file, name it 1_2.data. weired idea? what you get:

创建一个文件,命名为1_2.data。堰主意吗?你得到什么:

  • You save up to 50% of space because you don't need to repeat the fk_to_device and fk_to_metric value for every data point.
  • 您可以节省50%的空间,因为您不需要为每个数据点重复fk_to_device和fk_to_metric值。
  • You save up even more space because you don't need any indices.
  • 你节省了更多的空间,因为你不需要任何指标。
  • Save pairs of (timestamp,metric_value) to the file by appending the data so you get a order by timestamp for free. (assuming that your sources don't send out of order data for a device)
  • 通过添加数据将(timestamp,metric_value)对保存到文件中,这样您就可以通过timestamp免费获得订单。(假设您的消息源不发送设备的订单数据)

=> Queries by timestamp run amazingly fast because you can use binary search to find the right place in the file to read from.

=>查询通过时间戳运行得非常快,因为您可以使用二进制搜索在文件中找到要读取的正确位置。

if you like it even more optimized start thinking about splitting your files like that;

如果你喜欢它甚至更优化开始考虑像这样分割你的文件;

  • 1_2_january2014.data
  • 1 _2_january2014.data
  • 1_2_february2014.data
  • 1 _2_february2014.data
  • 1_2_march2014.data
  • 1 _2_march2014.data

or use kdb+ from http://kx.com because they do all this for you:) column-oriented is what may help you.

或者使用来自http://kx.com的kdb+,因为它们都是为您做的:)面向列是可能对您有所帮助的。

There is a cloud-based column-oriented solution popping up, so you may want to have a look at: http://timeseries.guru

现在出现了一个基于云的面向列的解决方案,所以您可能想看看:http://timeseries.guru

#6


2  

This is a problem we've had to solve at ApiAxle. We wrote up a blog post on how we did it using Redis. It hasn't been out there for very long but it's proving to be effective.

这是我们在apishaft必须解决的问题。我们写了一篇关于如何使用Redis的博文。它已经很久没有出现了,但是它被证明是有效的。

I've also used RRDTool for another project which was excellent.

我还用RRDTool做了另一个非常好的项目。

#7


2  

I think that the answer for this kind of question should mainly revolve about the way your Database utilize storage. Some Database servers use RAM and Disk, some use RAM only (optionally Disk for persistency), etc. Most common SQL Database solutions are using memory+disk storage and writes the data in a Row based layout (every inserted raw is written in the same physical location). For timeseries stores, in most cases the workload is something like: Relatively-low interval of massive amount of inserts, while reads are column based (in most cases you want to read a range of data from a specific column, representing a metric)

我认为,这类问题的答案应该主要围绕数据库利用存储的方式。有些数据库服务器使用RAM和磁盘,有些只使用RAM(可选的磁盘用于持久性),等等。大多数常见的SQL数据库解决方案都使用内存+磁盘存储,并将数据写入基于行的布局中(每个插入的原始数据都写入相同的物理位置)。对于timeseries存储而言,在大多数情况下工作负载是这样的:大量插入的相对低间隔,而读取是基于列的(在大多数情况下,您希望从特定的列中读取一系列数据,表示一个度量)

I have found Columnar Databases (google it, you'll find MonetDB, InfoBright, parAccel, etc) are doing terrific job for time series.

我发现柱状数据库(谷歌it, MonetDB, InfoBright, parAccel等)在时间序列方面做得非常好。

As for your question, which personally I think is somewhat invalid (as all discussions using the fault term NoSQL - IMO): You can use a Database server that can talk SQL on one hand, making your life very easy as everyone knows SQL for many years and this language has been perfected over and over again for data queries; but still utilize RAM, CPU Cache and Disk in a Columnar oriented way, making your solution best fit Time Series

至于你的问题,我个人认为有点无效(因为所有讨论使用故障项NoSQL -国际海事组织):你可以使用一个数据库服务器,可以讨论SQL一方面,使得你的生活非常容易众所周知SQL多年,这种语言已经完善了一遍又一遍数据查询;但是仍然要以柱状的方式利用RAM、CPU缓存和磁盘,使您的解决方案最适合时间序列。

#8


2  

5 Millions of rows is nothing for today's torrential data. Expect data to be in the TB or PB in just a few months. At this point RDBMS do not scale to the task and we need the linear scalability of NoSql databases. Performance would be achieved for the columnar partition used to store the data, adding more columns and less rows kind of concept to boost performance. Leverage the Open TSDB work done on top of HBASE or MapR_DB, etc.

5百万行对于今天的大量数据来说是毫无意义的。预计数据将在短短几个月内出现在TB或PB中。此时,RDBMS不适合这个任务,我们需要NoSql数据库的线性可扩展性。对于用于存储数据的柱状分区的性能,增加了更多的列和较少的行类型来提高性能。利用在HBASE或MapR_DB上完成的打开的TSDB工作。

#9


1  

I face similar requirements regularly, and have recently started using Zabbix to gather and store this type of data. Zabbix has its own graphing capability, but it's easy enough to extract the data out of Zabbix's database and process it however you like. If you haven't already checked Zabbix out, you might find it worth your time to do so.

我经常遇到类似的需求,最近开始使用Zabbix收集和存储这类数据。Zabbix有自己的图形处理能力,但是从Zabbix的数据库中提取数据并按照自己的方式进行处理非常容易。如果你还没有检查Zabbix,你可能会发现花时间检查是值得的。

#10


0  

You should look into Time series database. It was created for this purpose.

您应该查看时间序列数据库。它就是为了这个目的而创建的。

A time series database (TSDB) is a software system that is optimized for handling time series data, arrays of numbers indexed by time (a datetime or a datetime range).

时间序列数据库(TSDB)是为处理时间序列数据而优化的软件系统,即按时间(datetime或datetime range)索引的数字数组。

Popular example of time-series database InfluxDB

时间序列数据库的流行示例

#1


138  

Definitely Relational. Unlimited flexibility and expansion.

绝对的关系。无限的灵活性和扩张。

Two corrections, both in concept and application, followed by an elevation.

在概念和应用上都做了两次修正,然后是一次提升。

Correction

  1. It is not "filtering out the un-needed data"; it is selecting only the needed data. Yes, of course, if you have an Index to support the columns identified in the WHERE clause, it is very fast, and the query does not depend on the size of the table (grabbing 1,000 rows from a 16 billion row table is instantaneous).

    它不是“过滤掉不必要的数据”;它只选择所需的数据。是的,当然,如果您有一个索引来支持WHERE子句中标识的列,那么它是非常快的,并且查询不依赖于表的大小(从160亿行表获取1000行是瞬时的)。

  2. Your table has one serious impediment. Given your description, the actual PK is (Device, Metric, DateTime). (Please don't call it TimeStamp, that means something else, but that is a minor issue.) The uniqueness of the row is identified by:

    你的桌子有一个严重的障碍。根据您的描述,实际的PK是(设备、度量、DateTime)。(请不要称它为时间戳,这意味着其他事情,但这是一个小问题。)行的唯一性由:

       (Device, Metric, DateTime)
    
    • The Id column does nothing, it is totally and completely redundant.

      Id列什么都不做,它完全是冗余的。

      • An Id column is never a Key (duplicate rows, which are prohibited in a Relational database, must be prevented by other means).
      • Id列从来不是键(关系数据库中禁止重复的行,必须通过其他方法加以阻止)。
      • The Id column requires an additional Index, which obviously impedes the speed of INSERT/DELETE, and adds to the disk space used.

        Id列需要一个额外的索引,这显然会阻碍插入/删除的速度,并增加所使用的磁盘空间。

      • You can get rid of it. Please.

        你可以摆脱它。请。

Elevation

  1. Now that you have removed the impediment, you may not have recognised it, but your table is in Sixth Normal Form. Very high speed, with just one Index on the PK. For understanding, read this answer from the What is Sixth Normal Form ? heading onwards.

    既然你已经消除了障碍,你可能还没有意识到它,但是你的桌子是第六个正常的形式。非常高的速度,只有一个指标在PK上。为了理解,请阅读这个答案从什么是第六范式?标题起。

    • (I have one index only, not three; on the Non-SQLs you may need three indices).

      (我只有一个索引,不是三个;对于非sql,您可能需要三个索引)。

    • I have the exact same table (without the Id "key", of course). I have an additional column Server. I support multiple customers remotely.

      我有相同的表(当然没有Id“key”)。我还有一个附加的列服务器。我远程支持多个客户。

      (Server, Device, Metric, DateTime)

      (服务器、设备指标,DateTime)

    The table can be used to Pivot the data (ie. Devices across the top and Metrics down the side, or pivoted) using exactly the same SQL code (yes, switch the cells). I use the table to erect an unlimited variety of graphs and charts for customers re their server performance.

    该表可用于数据透视。通过使用完全相同的SQL代码(是的,切换单元格),在顶部设置设备,在底部设置度量值(或者转向)。我使用该表来为客户建立各种各样的图形和图表,以反映他们的服务器性能。

    • Monitor Statistics Data Model.
      (Too large for inline; some browsers cannot load inline; click the link. Also that is the obsolete demo version, for obvious reasons, I cannot show you commercial product DM.)

      监控统计数据模型。(太大,内联;有些浏览器不能内联加载;点击链接。此外,这是过时的演示版本,出于显而易见的原因,我无法向您展示商业产品DM。

    • It allows me to produce Charts Like This, six keystrokes after receiving a raw monitoring stats file from the customer, using a single SELECT command. Notice the mix-and-match; OS and server on the same chart; a variety of Pivots. Of course, there is no limit to the number of stats matrices, and thus the charts. (Used with the customer's kind permission.)

      它允许我生成这样的图表,在收到客户的原始监控数据文件后,使用一个SELECT命令,6个按键。注意到混搭;操作系统和服务器在同一图表上;各种各样的轴心。当然,统计矩阵的数量是没有限制的,因此图表也没有限制。(经客户许可使用)

    • Readers who are unfamiliar with the Standard for Modelling Relational Databases may find the IDEF1X Notation helpful.

      不熟悉建模关系数据库标准的读者可能会发现IDEF1X符号很有帮助。

One More Thing

一件事

Last but not least, SQL is a IEC/ISO/ANSI Standard. The freeware is actually Non-SQL; it is fraudulent to use the term SQL if they do not provide the Standard. They may provide "extras", but they are absent the basics.

最后,SQL是一个IEC/ISO/ANSI标准。freeware实际上是非sql的;如果术语SQL不提供标准,则使用它们是欺诈的。他们可能会提供额外的服务,但他们缺乏基本的服务。

#2


19  

Found very interesting the above answers. Trying to add a couple more considerations here.

发现上面的答案很有趣。这里还需要考虑几个问题。

1) Data aging

1)数据老化

Time-series management usually need to create aging policies. A typical scenario (e.g. monitoring server CPU) requires to store:

时间序列管理通常需要创建老化策略。一个典型的场景(例如监视服务器CPU)需要存储:

  • 1-sec raw samples for a short period (e.g. for 24 hours)

    1秒的样品(例如24小时)

  • 5-min detail aggregate samples for a medium period (e.g. 1 week)

    5分钟内,中期(例1周)

  • 1-hour detail over that (e.g. up to 1 year)

    1小时的细节(例如最多1年)

Although relational models make it possible for sure (my company implemented massive centralized databases for some large customers with tens of thousands of data series) to manage it appropriately, the new breed of data stores add interesting functionalities to be explored like:

虽然关系模型可以确保(我的公司为一些拥有成千上万数据系列的大客户实现了大量集中的数据库)对其进行适当的管理,但是新一代的数据存储增加了有趣的功能,如:

  • automated data purging (see Redis' EXPIRE command)

    自动数据清理(参见Redis的EXPIRE命令)

  • multidimensional aggregations (e.g. map-reduce jobs a-la-Splunk)

    多维聚合(例如map-reduce作业a-la-Splunk)

2) Real-time collection

2)实时收集

Even more importantly some non-relational data stores are inherently distributed and allow for a much more efficient real-time (or near-real time) data collection that could be a problem with RDBMS because of the creation of hotspots (managing indexing while inserting in a single table). This problem in the RDBMS space is typically solved reverting to batch import procedures (we managed it this way in the past) while no-sql technologies have succeeded in massive real-time collection and aggregation (see Splunk for example, mentioned in previous replies).

更重要的是,一些非关系数据存储本身是分布式的,并且允许更高效的实时(或接近实时)数据收集,这可能是RDBMS的问题,因为创建了热点(在一个表中插入时管理索引)。RDBMS空间中的这个问题通常可以解决返回到批导入过程(过去我们是这样管理的),而非sql技术已经成功地实现了大量的实时收集和聚合(参见前面的回复中提到的Splunk)。

#3


7  

You table has data in single table. So relational vs non relational is not the question. Basically you need to read a lot of sequential data. Now if you have enough RAM to store a years worth data then nothing like using Redis/MongoDB etc.

表中有单个表中的数据。所以关系和非关系不是问题。基本上你需要阅读大量的顺序数据。如果你有足够的内存来存储一年的数据,那就不像使用Redis/MongoDB之类的东西了。

Mostly NoSQL databases will store your data on same location on disk and in compressed form to avoid multiple disk access.

大多数NoSQL数据库将把数据存储在磁盘上和压缩格式的相同位置,以避免多个磁盘访问。

NoSQL does the same thing as creating the index on device id and metric id, but in its own way. With database even if you do this the index and data may be at different places and there would be a lot of disk IO.

NoSQL做的事情与在设备id和度量id上创建索引是一样的,但是是以它自己的方式。使用数据库,即使这样做,索引和数据也可能位于不同的位置,并且会有很多磁盘IO。

Tools like Splunk are using NoSQL backends to store time series data and then using map reduce to create aggregates (which might be what you want later). So in my opinion to use NoSQL is an option as people have already tried it for similar use cases. But will a million rows bring the database to crawl (maybe not , with decent hardware and proper configurations).

Splunk等工具使用NoSQL backends来存储时间序列数据,然后使用map reduce创建聚合(这可能是您稍后需要的)。因此,在我看来,使用NoSQL是一种选择,因为人们已经在类似的用例中尝试过了。但是一百万行将会使数据库爬行(可能不会,有适当的硬件和适当的配置)。

#4


3  

If you are looking at GPL packages, RRDTool is a good one to look at. It is a good tool for storing, extracting and graphing times-series data. Your use-case looks exactly like time-series data.

如果您正在查看GPL包,RRDTool是一个很好的工具。它是存储、提取和绘制时间序列数据的好工具。您的用例看起来完全像时间序列数据。

#5


3  

Create a file, name it 1_2.data. weired idea? what you get:

创建一个文件,命名为1_2.data。堰主意吗?你得到什么:

  • You save up to 50% of space because you don't need to repeat the fk_to_device and fk_to_metric value for every data point.
  • 您可以节省50%的空间,因为您不需要为每个数据点重复fk_to_device和fk_to_metric值。
  • You save up even more space because you don't need any indices.
  • 你节省了更多的空间,因为你不需要任何指标。
  • Save pairs of (timestamp,metric_value) to the file by appending the data so you get a order by timestamp for free. (assuming that your sources don't send out of order data for a device)
  • 通过添加数据将(timestamp,metric_value)对保存到文件中,这样您就可以通过timestamp免费获得订单。(假设您的消息源不发送设备的订单数据)

=> Queries by timestamp run amazingly fast because you can use binary search to find the right place in the file to read from.

=>查询通过时间戳运行得非常快,因为您可以使用二进制搜索在文件中找到要读取的正确位置。

if you like it even more optimized start thinking about splitting your files like that;

如果你喜欢它甚至更优化开始考虑像这样分割你的文件;

  • 1_2_january2014.data
  • 1 _2_january2014.data
  • 1_2_february2014.data
  • 1 _2_february2014.data
  • 1_2_march2014.data
  • 1 _2_march2014.data

or use kdb+ from http://kx.com because they do all this for you:) column-oriented is what may help you.

或者使用来自http://kx.com的kdb+,因为它们都是为您做的:)面向列是可能对您有所帮助的。

There is a cloud-based column-oriented solution popping up, so you may want to have a look at: http://timeseries.guru

现在出现了一个基于云的面向列的解决方案,所以您可能想看看:http://timeseries.guru

#6


2  

This is a problem we've had to solve at ApiAxle. We wrote up a blog post on how we did it using Redis. It hasn't been out there for very long but it's proving to be effective.

这是我们在apishaft必须解决的问题。我们写了一篇关于如何使用Redis的博文。它已经很久没有出现了,但是它被证明是有效的。

I've also used RRDTool for another project which was excellent.

我还用RRDTool做了另一个非常好的项目。

#7


2  

I think that the answer for this kind of question should mainly revolve about the way your Database utilize storage. Some Database servers use RAM and Disk, some use RAM only (optionally Disk for persistency), etc. Most common SQL Database solutions are using memory+disk storage and writes the data in a Row based layout (every inserted raw is written in the same physical location). For timeseries stores, in most cases the workload is something like: Relatively-low interval of massive amount of inserts, while reads are column based (in most cases you want to read a range of data from a specific column, representing a metric)

我认为,这类问题的答案应该主要围绕数据库利用存储的方式。有些数据库服务器使用RAM和磁盘,有些只使用RAM(可选的磁盘用于持久性),等等。大多数常见的SQL数据库解决方案都使用内存+磁盘存储,并将数据写入基于行的布局中(每个插入的原始数据都写入相同的物理位置)。对于timeseries存储而言,在大多数情况下工作负载是这样的:大量插入的相对低间隔,而读取是基于列的(在大多数情况下,您希望从特定的列中读取一系列数据,表示一个度量)

I have found Columnar Databases (google it, you'll find MonetDB, InfoBright, parAccel, etc) are doing terrific job for time series.

我发现柱状数据库(谷歌it, MonetDB, InfoBright, parAccel等)在时间序列方面做得非常好。

As for your question, which personally I think is somewhat invalid (as all discussions using the fault term NoSQL - IMO): You can use a Database server that can talk SQL on one hand, making your life very easy as everyone knows SQL for many years and this language has been perfected over and over again for data queries; but still utilize RAM, CPU Cache and Disk in a Columnar oriented way, making your solution best fit Time Series

至于你的问题,我个人认为有点无效(因为所有讨论使用故障项NoSQL -国际海事组织):你可以使用一个数据库服务器,可以讨论SQL一方面,使得你的生活非常容易众所周知SQL多年,这种语言已经完善了一遍又一遍数据查询;但是仍然要以柱状的方式利用RAM、CPU缓存和磁盘,使您的解决方案最适合时间序列。

#8


2  

5 Millions of rows is nothing for today's torrential data. Expect data to be in the TB or PB in just a few months. At this point RDBMS do not scale to the task and we need the linear scalability of NoSql databases. Performance would be achieved for the columnar partition used to store the data, adding more columns and less rows kind of concept to boost performance. Leverage the Open TSDB work done on top of HBASE or MapR_DB, etc.

5百万行对于今天的大量数据来说是毫无意义的。预计数据将在短短几个月内出现在TB或PB中。此时,RDBMS不适合这个任务,我们需要NoSql数据库的线性可扩展性。对于用于存储数据的柱状分区的性能,增加了更多的列和较少的行类型来提高性能。利用在HBASE或MapR_DB上完成的打开的TSDB工作。

#9


1  

I face similar requirements regularly, and have recently started using Zabbix to gather and store this type of data. Zabbix has its own graphing capability, but it's easy enough to extract the data out of Zabbix's database and process it however you like. If you haven't already checked Zabbix out, you might find it worth your time to do so.

我经常遇到类似的需求,最近开始使用Zabbix收集和存储这类数据。Zabbix有自己的图形处理能力,但是从Zabbix的数据库中提取数据并按照自己的方式进行处理非常容易。如果你还没有检查Zabbix,你可能会发现花时间检查是值得的。

#10


0  

You should look into Time series database. It was created for this purpose.

您应该查看时间序列数据库。它就是为了这个目的而创建的。

A time series database (TSDB) is a software system that is optimized for handling time series data, arrays of numbers indexed by time (a datetime or a datetime range).

时间序列数据库(TSDB)是为处理时间序列数据而优化的软件系统,即按时间(datetime或datetime range)索引的数字数组。

Popular example of time-series database InfluxDB

时间序列数据库的流行示例