What is the best way to store a large number of data points?
存储大量数据点的最佳方法是什么?
For example temperature values which are measured every minute over lots of locations?
例如,在很多地方每分钟测量一次的温度值?
SQL databases with one row per data points doesn't seem very efficient.
每个数据点一行的SQL数据库似乎效率不高。
3 个解决方案
#1
3
I would like to know why you reckon it to be "not efficient". Probably you need to explain your data model and schema to give a better context of the scenario.
我想知道为什么你认为它“效率不高”。您可能需要解释您的数据模型和架构,以提供更好的场景上下文。
Storing multiple data points into a single row, when they are not related to each other, and should indeed stand on their own, is not a good approach. Meshing together will result in very counter-intuitive and quirky query statements to pull out the correct data points you need for a given scenario.
将多个数据点存储在一行中,当它们彼此不相关时,并且确实应该独立存在时,这不是一个好方法。网格化将导致非常反直觉和古怪的查询语句,以提取给定方案所需的正确数据点。
We have done work in a power station before, collecting data from various systems and metering equipment a wide variety of gas and electrical parameters that need to be monitored and aggregated. They can come in every 3-5 minutes to 30-60 minutes depending on the type of parameters. These naturally results in millions of records per month.
我们之前在发电站完成了工作,从各种系统和计量设备收集了需要监测和汇总的各种气体和电气参数。它们可以每3-5分钟到30-60分钟,具体取决于参数类型。这些自然导致每月数百万条记录。
The key is indexing the tables properly so that their physical order is tied to sequence in which the records came in. (Clustered index) New pages and extents are created and filled sequentially by incoming data. This should prevent massive page splits and reshuffling.
关键是正确索引表,以便它们的物理顺序与记录所在的顺序相关联。(聚簇索引)新的页面和范围由输入数据按顺序创建和填充。这应该可以防止大量的页面拆分和重新洗牌。
#2
2
The key questiopn may be: how do you need to access them later?
关键问题可能是:您以后需要如何访问它们?
If you need to associate each point with a timestamp and location ID, and later need to retrieve individual measurements based on time/time range and location from multiple clients, an database may indeed be the most efficient at retrieval.
如果您需要将每个点与时间戳和位置ID相关联,并且稍后需要根据来自多个客户端的时间/时间范围和位置检索单个测量,则数据库确实可能是最有效的检索。
OTOH, if your client will load and process the data of a whole day of one location, storing the data in one file per location and day reduces dependencies and may be easier.
OTOH,如果您的客户端将加载并处理一个位置的一整天的数据,则每个位置和每天将数据存储在一个文件中会减少依赖性并且可能更容易。
Other concerns is backups and archival, and if your users can/should deal with that themselves.
其他问题是备份和存档,如果您的用户可以/应该自己处理。
#3
1
A table like this may work:
像这样的表可能有效:
LocationID, Temperature, Timestamp
LocationID,温度,时间戳
I don't see why this wouldn't be efficient. This is what databases are for, after all.
我不明白为什么这不会有效。毕竟,这就是数据库的用途。
#1
3
I would like to know why you reckon it to be "not efficient". Probably you need to explain your data model and schema to give a better context of the scenario.
我想知道为什么你认为它“效率不高”。您可能需要解释您的数据模型和架构,以提供更好的场景上下文。
Storing multiple data points into a single row, when they are not related to each other, and should indeed stand on their own, is not a good approach. Meshing together will result in very counter-intuitive and quirky query statements to pull out the correct data points you need for a given scenario.
将多个数据点存储在一行中,当它们彼此不相关时,并且确实应该独立存在时,这不是一个好方法。网格化将导致非常反直觉和古怪的查询语句,以提取给定方案所需的正确数据点。
We have done work in a power station before, collecting data from various systems and metering equipment a wide variety of gas and electrical parameters that need to be monitored and aggregated. They can come in every 3-5 minutes to 30-60 minutes depending on the type of parameters. These naturally results in millions of records per month.
我们之前在发电站完成了工作,从各种系统和计量设备收集了需要监测和汇总的各种气体和电气参数。它们可以每3-5分钟到30-60分钟,具体取决于参数类型。这些自然导致每月数百万条记录。
The key is indexing the tables properly so that their physical order is tied to sequence in which the records came in. (Clustered index) New pages and extents are created and filled sequentially by incoming data. This should prevent massive page splits and reshuffling.
关键是正确索引表,以便它们的物理顺序与记录所在的顺序相关联。(聚簇索引)新的页面和范围由输入数据按顺序创建和填充。这应该可以防止大量的页面拆分和重新洗牌。
#2
2
The key questiopn may be: how do you need to access them later?
关键问题可能是:您以后需要如何访问它们?
If you need to associate each point with a timestamp and location ID, and later need to retrieve individual measurements based on time/time range and location from multiple clients, an database may indeed be the most efficient at retrieval.
如果您需要将每个点与时间戳和位置ID相关联,并且稍后需要根据来自多个客户端的时间/时间范围和位置检索单个测量,则数据库确实可能是最有效的检索。
OTOH, if your client will load and process the data of a whole day of one location, storing the data in one file per location and day reduces dependencies and may be easier.
OTOH,如果您的客户端将加载并处理一个位置的一整天的数据,则每个位置和每天将数据存储在一个文件中会减少依赖性并且可能更容易。
Other concerns is backups and archival, and if your users can/should deal with that themselves.
其他问题是备份和存档,如果您的用户可以/应该自己处理。
#3
1
A table like this may work:
像这样的表可能有效:
LocationID, Temperature, Timestamp
LocationID,温度,时间戳
I don't see why this wouldn't be efficient. This is what databases are for, after all.
我不明白为什么这不会有效。毕竟,这就是数据库的用途。