有效存储时间序列数据:mySQL或平面文件?许多表(或文件)或具有WHERE条件的查询?

时间:2021-06-15 16:58:02

What's the best way to store time series data of thousands (but could become millions soon) real-world hardware sensors? The sensors itself are different, some just capture one variable, some up to a dozen. I need to store these values every hour, and I don't want to delete data that is older than x, i.e. the data will just keep growing.

什么是存储数千个时间序列数据的最佳方法(但可能很快就会成为数百万个)现实世界的硬件传感器?传感器本身是不同的,有些只捕获一个变量,有些甚至多达十几个。我需要每小时存储这些值,并且我不想删除早于x的数据,即数据将继续增长。

Currently, I use a mySQL database to store these time series (which also serves a web frontend that shows nice time series graphs for every sensor). I have one table for every sensor, which right now equals about 11000 total. Each table has a layout like "timestamp, value1, [value2] ... ".

目前,我使用mySQL数据库来存储这些时间序列(它还提供了一个web前端,为每个传感器显示了漂亮的时间序列图)。每个传感器都有一个表,现在总共大约11000个。每个表都有一个类似“timestamp,value1,[value2] ...”的布局。

The main task of the database are more selects (every time sombebody looks at the graphs) than inserts/updates (once an hour). The select query for showing the graph is simply a "SELECT * FROM $sensor_id ORDER BY timestamp", so getting the info from my select statements is pretty simple/efficient.

数据库的主要任务是更多选择(每次sombebody查看图表)而不是插入/更新(每小时一次)。用于显示图形的选择查询只是“SELECT * FROM $ sensor_id ORDER BY timestamp”,因此从我的select语句中获取信息非常简单/高效。

However, having that many tables already presents some problems when backing up the database, because I run into LOCK limits (e.g. mysqldump: Got error: 23: Out of resources when opening file './database/table_xyz.MYD' (Errcode: 24) when using LOCK TABLES"). I can get around that error, but obviously that got me thinking...

但是,备份数据库时有很多表已经出现了一些问题,因为我遇到了LOCK限制(例如mysqldump:Got error:23:打开文件时资源不足'./database/table_xyz.MYD'(错误代码:24) )当使用LOCK TABLES时)。我可以解决这个错误,但显然这让我想到了......

So, the real question, broken down into sub-questions:

那么,真正的问题,分解为子问题:

  • How bad is my approach of having one table for every sensor? What if instead of a few thousand tables, I had a few millions (I might have to deal with that many sensors in the near future)?
  • 我为每个传感器准备一张桌子的方法有多糟糕?如果不是几千张桌子,我有几百万(我可能不得不在不久的将来处理那么多传感器)怎么办?
  • Is storing all sensors' data in one combined table with an extra column that holds the sensor_id a better approach, since it would probably slow down my select statement by a lot (SELECT * from all_sensors WHERE sensor_id='$sensor_id')? Keep in mind that different sensors measure different things, so this table would have a few dozen columns instead of just one to a few, if I every sensor has its own table?
  • 将所有传感器的数据存储在一个组合表中,并使用额外的列来保存sensor_id更好的方法,因为它可能会大大减慢我的select语句(SELECT * from all_sensors WHERE sensor_id ='$ sensor_id')?请记住,不同的传感器测量不同的东西,所以如果我每个传感器都有自己的表,那么这个表会有几十列而不是一个到几个。
  • I also thought about storing the time series data NOT in mySQL, but instead in flat (CSV) files. The graphing library I use for the frontend (dygraphs) deals fine with CSV files (plus it would give me the option of making these available for download, which would be a bonus but is not a requirement currently). I still need the database for other front-end related things, but it would mean having a few dozen tables instead of 11000 (or even more if we add more sensors).
  • 我还考虑过将时间序列数据存储在mySQL中,而不是存储在平面(CSV)文件中。我用于前端(dygraphs)的图形库可以很好地处理CSV文件(加上它可以让我选择让这些文件可供下载,这将是一个奖励,但目前不是一个要求)。我仍然需要数据库用于其他前端相关的东西,但这意味着有几十个表而不是11000个(如果我们添加更多传感器,甚至更多)。
  • If I create one file for every table, then I would probably run into filesystem limits eventually (this is an ext3 partition, so there's the ~32k files per directory limit). So also here the same question as above applies: should I then store it in one large file that holds all sensors' data? This would probably slow down my reads even worse, as the graphing libary would need to read a much,much bigger file into memory every time someone looks at a graph?
  • 如果我为每个表创建一个文件,那么我最终可能会遇到文件系统限制(这是一个ext3分区,所以每个目录限制有~32k文件)。所以这里也适用同样的问题:我应该将它存储在一个包含所有传感器数据的大文件中吗?这可能会减慢我的读取速度,因为图形库每次有人看图形时都需要将更大,更大的文件读入内存?

What would you do?

你会怎么做?

Thanks!

谢谢!

1 个解决方案

#1


5  

To answer this question, we must first analyse the real issue you're facing.

要回答这个问题,我们首先要分析你所面临的真正问题。

The real issue would be the most efficient combination of writing and retrieving data.

真正的问题是编写和检索数据的最有效组合。

Let's review your conclusions:

让我们回顾一下你的结论:

  • thousands of tables - well, that violates the purpose of databases and makes it harder to work with. You also gain nothing. There is still disk seeking involved, this time with many file descriptors in use. You also have to know the table names, and there's thousands of them. It's also difficult to extract data, which is what databases are for - to structure the data in such a way that you can easily cross-reference the records. Thousands of tables - not efficient from perf. point of view. Not efficient from use point of view. Bad choice.

    成千上万的表 - 嗯,这违反了数据库的目的,并使其更难以使用。你也什么也得不到。仍然涉及磁盘搜索,这次使用了许多文件描述符。您还必须知道表名,并且有数千个。提取数据也是很困难的,这就是数据库的用途 - 以一种您可以轻松交叉引用记录的方式构建数据。成千上万的表 - 从perf不高效。观点看法。从使用的角度来看效率不高。糟糕的选择。

  • a csv file - it is probably excellent for fetching the data, if you need entire contents at once. But it's far from remotely good for manipulating or transforming the data. Given the fact you rely on a specific layout - you have to be extra careful while writing to CSV. If this grows to thousands of CSV files, you didn't do yourself a favor. You removed all the overhead of SQL (which isn't that big) but you did nothing for retrieving parts of the data set. You also have problems fetching historic data or cross referencing anything. Bad choice.

    一个csv文件 - 如果你需要一次全部内容,它可能非常适合获取数据。但是,对于操纵或转换数据来说远远不够好。鉴于您依赖于特定布局 - 在写入CSV时必须格外小心。如果这种情况增长到数千个CSV文件,那么你并没有帮忙。您删除了SQL的所有开销(这不是那么大),但您没有采取任何措施来检索数据集的部分内容。您在获取历史数据或交叉引用任何内容时也会遇到问题。糟糕的选择。

The ideal scenario would be being able to access any part of the data set in an efficient and quick way without any kind of structure change.

理想的情况是能够以有效和快速的方式访问数据集的任何部分,而无需任何结构更改。

And this is exactly the reason why we use relational databases and why we dedicate entire servers with a lot of RAM to those databases.

这正是我们使用关系数据库的原因,也是为什么我们将具有大量RAM的整个服务器专用于这些数据库的原因。

In your case, you are using MyISAM tables (.MYD file extension). It's an old storage format that worked great for low end hardware which was used back in the day. But these days, we have excellent and fast computers. That's why we use InnoDB and allow it to use a lot of RAM so the I/O costs are reduced. The variable in question that controls it is called innodb_buffer_pool_size - googling that will produce meaningful results.

在您的情况下,您正在使用MyISAM表(.MYD文件扩展名)。这是一种旧的存储格式,适用于当天使用的低端硬件。但是现在,我们拥有出色而快速的计算机。这就是我们使用InnoDB并允许它使用大量RAM以降低I / O成本的原因。控制它的问题变量叫做innodb_buffer_pool_size - 谷歌搜索将产生有意义的结果。

To answer the question - an efficient, satisfiable solution would be to use one table where you store sensor information (id, title, description) and another table where you store sensor readings. You allocate sufficient RAM or sufficiently fast storage (an SSD). The tables would look like this:

要回答这个问题 - 一个有效,可满足的解决方案是使用一个存储传感器信息的表(id,标题,描述)和另一个存储传感器读数的表。您可以分配足够的RAM或足够快的存储空间(SSD)。表格如下所示:

CREATE TABLE sensors ( 
    id int unsigned not null auto_increment,
    sensor_title varchar(255) not null,
    description varchar(255) not null,
    date_created datetime,
    PRIMARY KEY(id)
) ENGINE = InnoDB DEFAULT CHARSET = UTF8;

CREATE TABLE sensor_readings (
    id int unsigned not null auto_increment,
    sensor_id int unsigned not null,
    date_created datetime,
    reading_value varchar(255), -- note: this column's value might vary, I do not know what data type you need to hold value(s)
    PRIMARY KEY(id),
    FOREIGN KEY (sensor_id) REFERENCES sensors (id) ON DELETE CASCADE
) ENGINE = InnoDB DEFAULT CHARSET = UTF8;

InnoDB, by default, uses one flat-file for entire database/installation. That alleviates the problem of exceeding file descriptor limit of the OS / filesystem. Several, or even tens of millions of records should not be a problem if you were to allocate 5-6 gigs of RAM to hold the working data set in memory - that would allow you quick access to the data.

默认情况下,InnoDB使用一个平面文件进行整个数据库/安装。这缓解了超出OS /文件系统的文件描述符限制的问题。如果要分配5-6 GB的RAM来保存内存中的工作数据,那么几个甚至数千万个记录应该不成问题 - 这样可以快速访问数据。

If I were to design such a system, this is the first approach I would make (personally). From there on it's easy to adjust depending on what you need to do with that information.

如果我要设计这样一个系统,这是我会做的第一个方法(个人)。从那里开始,根据您对该信息的需求,可以轻松调整。

#1


5  

To answer this question, we must first analyse the real issue you're facing.

要回答这个问题,我们首先要分析你所面临的真正问题。

The real issue would be the most efficient combination of writing and retrieving data.

真正的问题是编写和检索数据的最有效组合。

Let's review your conclusions:

让我们回顾一下你的结论:

  • thousands of tables - well, that violates the purpose of databases and makes it harder to work with. You also gain nothing. There is still disk seeking involved, this time with many file descriptors in use. You also have to know the table names, and there's thousands of them. It's also difficult to extract data, which is what databases are for - to structure the data in such a way that you can easily cross-reference the records. Thousands of tables - not efficient from perf. point of view. Not efficient from use point of view. Bad choice.

    成千上万的表 - 嗯,这违反了数据库的目的,并使其更难以使用。你也什么也得不到。仍然涉及磁盘搜索,这次使用了许多文件描述符。您还必须知道表名,并且有数千个。提取数据也是很困难的,这就是数据库的用途 - 以一种您可以轻松交叉引用记录的方式构建数据。成千上万的表 - 从perf不高效。观点看法。从使用的角度来看效率不高。糟糕的选择。

  • a csv file - it is probably excellent for fetching the data, if you need entire contents at once. But it's far from remotely good for manipulating or transforming the data. Given the fact you rely on a specific layout - you have to be extra careful while writing to CSV. If this grows to thousands of CSV files, you didn't do yourself a favor. You removed all the overhead of SQL (which isn't that big) but you did nothing for retrieving parts of the data set. You also have problems fetching historic data or cross referencing anything. Bad choice.

    一个csv文件 - 如果你需要一次全部内容,它可能非常适合获取数据。但是,对于操纵或转换数据来说远远不够好。鉴于您依赖于特定布局 - 在写入CSV时必须格外小心。如果这种情况增长到数千个CSV文件,那么你并没有帮忙。您删除了SQL的所有开销(这不是那么大),但您没有采取任何措施来检索数据集的部分内容。您在获取历史数据或交叉引用任何内容时也会遇到问题。糟糕的选择。

The ideal scenario would be being able to access any part of the data set in an efficient and quick way without any kind of structure change.

理想的情况是能够以有效和快速的方式访问数据集的任何部分,而无需任何结构更改。

And this is exactly the reason why we use relational databases and why we dedicate entire servers with a lot of RAM to those databases.

这正是我们使用关系数据库的原因,也是为什么我们将具有大量RAM的整个服务器专用于这些数据库的原因。

In your case, you are using MyISAM tables (.MYD file extension). It's an old storage format that worked great for low end hardware which was used back in the day. But these days, we have excellent and fast computers. That's why we use InnoDB and allow it to use a lot of RAM so the I/O costs are reduced. The variable in question that controls it is called innodb_buffer_pool_size - googling that will produce meaningful results.

在您的情况下,您正在使用MyISAM表(.MYD文件扩展名)。这是一种旧的存储格式,适用于当天使用的低端硬件。但是现在,我们拥有出色而快速的计算机。这就是我们使用InnoDB并允许它使用大量RAM以降低I / O成本的原因。控制它的问题变量叫做innodb_buffer_pool_size - 谷歌搜索将产生有意义的结果。

To answer the question - an efficient, satisfiable solution would be to use one table where you store sensor information (id, title, description) and another table where you store sensor readings. You allocate sufficient RAM or sufficiently fast storage (an SSD). The tables would look like this:

要回答这个问题 - 一个有效,可满足的解决方案是使用一个存储传感器信息的表(id,标题,描述)和另一个存储传感器读数的表。您可以分配足够的RAM或足够快的存储空间(SSD)。表格如下所示:

CREATE TABLE sensors ( 
    id int unsigned not null auto_increment,
    sensor_title varchar(255) not null,
    description varchar(255) not null,
    date_created datetime,
    PRIMARY KEY(id)
) ENGINE = InnoDB DEFAULT CHARSET = UTF8;

CREATE TABLE sensor_readings (
    id int unsigned not null auto_increment,
    sensor_id int unsigned not null,
    date_created datetime,
    reading_value varchar(255), -- note: this column's value might vary, I do not know what data type you need to hold value(s)
    PRIMARY KEY(id),
    FOREIGN KEY (sensor_id) REFERENCES sensors (id) ON DELETE CASCADE
) ENGINE = InnoDB DEFAULT CHARSET = UTF8;

InnoDB, by default, uses one flat-file for entire database/installation. That alleviates the problem of exceeding file descriptor limit of the OS / filesystem. Several, or even tens of millions of records should not be a problem if you were to allocate 5-6 gigs of RAM to hold the working data set in memory - that would allow you quick access to the data.

默认情况下,InnoDB使用一个平面文件进行整个数据库/安装。这缓解了超出OS /文件系统的文件描述符限制的问题。如果要分配5-6 GB的RAM来保存内存中的工作数据,那么几个甚至数千万个记录应该不成问题 - 这样可以快速访问数据。

If I were to design such a system, this is the first approach I would make (personally). From there on it's easy to adjust depending on what you need to do with that information.

如果我要设计这样一个系统,这是我会做的第一个方法(个人)。从那里开始,根据您对该信息的需求,可以轻松调整。