存储具有可变列数的大型SQL数据集

时间:2021-04-05 23:45:55

In America’s Cup yachting, we generate large datasets where at every time-stamp (e.g. 100Hz) we need to store maybe 100-1000 channels of sensor data (e.g. speed, loads, pressures). We store this in MS SQL Server and need to be able to retrieve subsets of channels of the data for analysis, and perform queries such as the maximum pressure on a particular sensor in a test, or over an entire season.

在America's Cup yachting中,我们生成大型数据集,其中在每个时间戳(例如100Hz)我们需要存储100-1000个通道的传感器数据(例如速度,负载,压力)。我们将其存储在MS SQL Server中,并且需要能够检索数据的通道子集以进行分析,并执行查询,例如测试中或特定传感器上的特定传感器的最大压力。

The set of channels to be stored stays the same for several thousand time-stamps, but day-to-day will change as new sensors are added, renamed, etc... and depending on testing, racing or simulating, the number of channels can vary greatly.

要存储的通道集对于数千个时间戳保持相同,但是随着新传感器的添加,重命名等,日常将会发生变化......并且取决于测试,比赛或模拟,通道数量可以变化很大。

The textbook way to structure the SQL tables would probably be:

构建SQL表的教科书方式可能是:

OPTION 1

ChannelNames
+-----------+-------------+
| ChannelID | ChannelName |
+-----------+-------------+
| 50        | Pressure    |
| 51        | Speed       |
| ...       | ...         |
+-----------+-------------+

Sessions
+-----------+---------------+-------+----------+
| SessionID |   Location    | Boat  | Helmsman |
+-----------+---------------+-------+----------+
| 789       | San Francisco | BoatA |  SailorA |
| 790       | San Francisco | BoatB |  SailorB |
| ...       | ...           | ...   |          |
+-----------+---------------+-------+----------+

SessionTimestamps
+-------------+-------------+------------------------+
| SessionID   | TimestampID | DateTime               |
+-------------+-------------+------------------------+
| 789         |       12345 | 2013/08/17 10:30:00:00 |
| 789         |       12346 | 2013/08/17 10:30:00:01 |
| ...         |       ...   | ...                    |
+-------------+-------------+------------------------+

ChannelData
+-------------+-----------+-----------+
| TimestampID | ChannelID | DataValue |
+-------------+-----------+-----------+
| 12345       | 50        | 1015.23   |
| 12345       | 51        | 12.23     |
| ...         | ...       | ...       |
+-------------+-----------+-----------+

This structure is neat but inefficient. Each DataValue requires three storage fields, and at each time-stamp we need to INSERT 100-1000 rows.

这种结构整洁但效率低下。每个DataValue需要三个存储字段,在每个时间戳我们需要插入100-1000行。

If we always had the same channels, it would be more sensible to use one row per time-stamp and structure like this:

如果我们总是使用相同的通道,那么每个时间戳和结构使用一行更为明智:

OPTION 2

+-----------+------------------------+----------+-------+----------+--------+-----+
| SessionID | DateTime               | Pressure | Speed | LoadPt   | LoadSb | ... |
+-----------+------------------------+----------+-------+----------+--------+-----+
| 789       | 2013/08/17 10:30:00:00 | 1015.23  | 12.23 | 101.12   | 98.23  | ... |
| 789       | 2013/08/17 10:30:00:01 | 1012.51  | 12.44 | 100.33   | 96.82  | ... |
| ...       | ...                    | ...      |       |          |        |     |
+-----------+------------------------+----------+-------+----------+--------+-----+

However, the channels change every day, and over the months the number of columns would grow and grow, with most cells ending up empty. We could create a new table for every new Session, but it doesn’t feel right to be using a table name as a variable, and would ultimately result in tens of thousands of tables – also, it becomes very difficult to query over a season, with data stored in multiple tables.

然而,渠道每天都在变化,并且在几个月内,列数会增长和增长,大多数单元格都会变空。我们可以为每个新的Session创建一个新表,但是使用表名作为变量感觉不对,并且最终会导致数万个表 - 而且,在一个季节中查询变得非常困难,数据存储在多个表中。

Another option would be:

另一种选择是:

OPTION 3

+-----------+------------------------+----------+----------+----------+----------+-----+
| SessionID | DateTime               | Channel1 | Channel2 | Channel3 | Channel4 | ... |
+-----------+------------------------+----------+----------+----------+----------+-----+
| 789       | 2013/08/17 10:30:00:00 | 1015.23  |    12.23 | 101.12   | 98.23    | ... |
| 789       | 2013/08/17 10:30:00:01 | 1012.51  |    12.44 | 100.33   | 96.82    | ... |
| ...       | ...                    | ...      |          |          |          |     |
+-----------+------------------------+----------+----------+----------+----------+-----+

with a look-up from Channel column IDs to channel names – but this requires an EXEC or eval to execute a pre-constructed query to obtain the channel we want – because SQL isn’t designed to have column names as variables. On the plus side, we can re-use columns when channels change, but there will still be many empty cells because the table has to be as wide as the largest number of channels we ever encounter. Using a SPARSE table may help here, but I am uncomfortable with the EXEC/eval issue above.

从Channel列ID到通道名称的查找 - 但是这需要EXEC或eval来执行预构造的查询以获得我们想要的通道 - 因为SQL不是设计为将列名称作为变量。从好的方面来说,我们可以在频道改变时重新使用列,但仍然会有很多空单元格,因为表格必须与我们遇到的最大数量的频道一样宽。使用SPARSE表可能对此有所帮助,但我对上面的EXEC / eval问题感到不舒服。

What is the right solution to this problem, that achieves efficiency of storage, inserts and queries?

这个问题的正确解决方案是什么,可以实现存储,插入和查询的效率?

2 个解决方案

#1


I would go with Option 1.

我会选择选项1。

Data integrity is first, optimization (if needed) - second.

首先是数据完整性,优化(如果需要) - 第二。

Other options would eventually have a lot of NULL values and other problems stemming from not being normalized. Managing the data and making efficient queries would be difficult.

其他选项最终会有很多NULL值和其他问题源于未被规范化。管理数据和进行有效查询将很困难。

Besides, there is a limit on the number of columns that a table can have - 1024, so if you have 1000 sensors/channels you are already dangerously close to the limit. Even if you make your table a wide table, which allows 30,000 columns, still there is a limitation on the size of the row in a table - 8,060 bytes per row. And there are certain performance considerations.

此外,表格可以拥有的列数限制为1024,因此如果您有1000个传感器/通道,则您已经危险地接近极限。即使你使你的表成为一个允许30,000列的宽表,仍然存在对表中行的大小的限制 - 每行8,060字节。并且存在某些性能因素。

I would not use wide tables in this case, even if I was sure that the data for each row would never exceed 8060 bytes and growing number of channels would never exceed 30,000.

在这种情况下,我不会使用宽表,即使我确信每行的数据永远不会超过8060字节,并且不断增加的通道数永远不会超过30,000。

I don't see a problem with inserting 100 - 1000 rows in Option 1 vs 1 row in other options. To do such INSERT efficiently don't make 1000 individual INSERT statements, do it in bulk. In various places in my system I use the following two approaches:

我没有看到在选项1中插入100 - 1000行而在其他选项中插入1行时没有问题。要有效地执行此类INSERT,请不要生成1000个单独的INSERT语句,请批量执行。在我系统的各个地方,我使用以下两种方法:

1) Build one long INSERT statement

1)构建一个长INSERT语句

INSERT INTO ChannelData (TimestampID, ChannelID, DataValue) VALUES
(12345, 50, 1015.23),
(12345, 51, 12.23),
...
(), (), (), (), ........... ();

that contains 1000 rows and execute it as normal INSERT in one transaction, rather than 1000 transactions (check the syntax details).

包含1000行并在一个事务中执行它作为普通INSERT,而不是1000个事务(检查语法详细信息)。

2) Have a stored procedure that accepts a table-valued parameter. Call such procedure passing 1000 rows as a table.

2)有一个接受表值参数的存储过程。调用此类过程将1000行作为表传递。

CREATE TYPE [dbo].[ChannelDataTableType] AS TABLE(
    [TimestampID] [int] NOT NULL,
    [ChannelID] [int] NOT NULL,
    [DataValue] [float] NOT NULL
)
GO

CREATE PROCEDURE [dbo].[InsertChannelData]
    -- Add the parameters for the stored procedure here
    @ParamRows dbo.ChannelDataTableType READONLY
AS
BEGIN
    -- SET NOCOUNT ON added to prevent extra result sets from
    -- interfering with SELECT statements.
    SET NOCOUNT ON;

    BEGIN TRANSACTION;
    BEGIN TRY

        INSERT INTO [dbo].[ChannelData]
            ([TimestampID],
            [ChannelID],
            [DataValue])
        SELECT
            TT.[TimestampID]
            ,TT.[ChannelID]
            ,TT.[DataValue]
        FROM
            @ParamRows AS TT
        ;

        COMMIT TRANSACTION;
    END TRY
    BEGIN CATCH
        ROLLBACK TRANSACTION;
    END CATCH;

END
GO

If possible, accumulate data from several timestamps before inserting to make the batches larger. You should try with your system and find the optimal size of the batch. I have batches around 10K rows using the stored procedure.

如果可能,在插入之前累积来自多个时间戳的数据以使批量更大。您应该尝试使用您的系统并找到批次的最佳大小。我使用存储过程批量约10K行。

If you have your data coming from sensors 100 times a second, then I would at first dump the incoming raw data in some very simple CSV file(s) and have a parallel background process that would insert it into the database in chunks. In other words, have some buffer for incoming data, so that if the server can't cope with the incoming volume, you would not loose your data.

如果您的数据来自传感器每秒100次,那么我首先会将传入的原始数据转储到一些非常简单的CSV文件中,并且具有并行后台进程,可以将其以块的形式插入到数据库中。换句话说,为传入数据设置一些缓冲区,这样如果服务器无法处理传入的卷,您就不会丢失数据。

Based on your comments, when you said that some channels are likely to be more interesting and queried several times, while others are less interesting, here is one optimization that I would consider. In addition to having one table ChannelData for all channels have another table InterestingChannelData. ChannelData would have the whole set of data, just in case. InterestingChannelData would have a subset only for the most interesting channels. It should be much smaller and it should take less time to query it. In any case, this is an optimization (denormalization/data duplication) built on top of properly normalized structure.

根据您的评论,当您说某些频道可能更有趣并且多次查询时,而其他频道则不那么有趣,这里有一个我会考虑的优化。除了拥有一个表ChannelData之外,所有通道都有另一个表InterestingChannelData。 ChannelData将拥有整套数据,以防万一。 InterestingChannelData只有最有趣的频道的子集。它应该小得多,查询它应该花费更少的时间。无论如何,这是在正确规范化结构之上构建的优化(非规范化/数据复制)。

#2


Is your process like this:

你的流程是这样的:

  1. Generate data during the day
  2. 白天生成数据

  3. Analyse data afterwards
  4. 之后分析数据

If these are separate activities then you might want to consider using different 'insert' and 'select' schemas. You could create a schema that's fast for inserting on the boat, then afterwards you batch upload this data into an analysis optimised schema. This requires a transformation step (where for example you map generic column names into useful column names)

如果这些是单独的活动,那么您可能需要考虑使用不同的“插入”和“选择”模式。您可以创建一个快速插入船上的模式,然后批量上传此数据到分析优化模式。这需要转换步骤(例如,您将通用列名称映射到有用的列名称)

This is along the lines of data warehousing and data marts. In this kind of design, you batch load and optimise the schema for reporting. Does your current daily upload have much of a window?

这与数据仓库和数据集市一致。在这种设计中,您可以批量加载和优化模式以进行报告。您当前的每日上传是否有很多窗口?

#1


I would go with Option 1.

我会选择选项1。

Data integrity is first, optimization (if needed) - second.

首先是数据完整性,优化(如果需要) - 第二。

Other options would eventually have a lot of NULL values and other problems stemming from not being normalized. Managing the data and making efficient queries would be difficult.

其他选项最终会有很多NULL值和其他问题源于未被规范化。管理数据和进行有效查询将很困难。

Besides, there is a limit on the number of columns that a table can have - 1024, so if you have 1000 sensors/channels you are already dangerously close to the limit. Even if you make your table a wide table, which allows 30,000 columns, still there is a limitation on the size of the row in a table - 8,060 bytes per row. And there are certain performance considerations.

此外,表格可以拥有的列数限制为1024,因此如果您有1000个传感器/通道,则您已经危险地接近极限。即使你使你的表成为一个允许30,000列的宽表,仍然存在对表中行的大小的限制 - 每行8,060字节。并且存在某些性能因素。

I would not use wide tables in this case, even if I was sure that the data for each row would never exceed 8060 bytes and growing number of channels would never exceed 30,000.

在这种情况下,我不会使用宽表,即使我确信每行的数据永远不会超过8060字节,并且不断增加的通道数永远不会超过30,000。

I don't see a problem with inserting 100 - 1000 rows in Option 1 vs 1 row in other options. To do such INSERT efficiently don't make 1000 individual INSERT statements, do it in bulk. In various places in my system I use the following two approaches:

我没有看到在选项1中插入100 - 1000行而在其他选项中插入1行时没有问题。要有效地执行此类INSERT,请不要生成1000个单独的INSERT语句,请批量执行。在我系统的各个地方,我使用以下两种方法:

1) Build one long INSERT statement

1)构建一个长INSERT语句

INSERT INTO ChannelData (TimestampID, ChannelID, DataValue) VALUES
(12345, 50, 1015.23),
(12345, 51, 12.23),
...
(), (), (), (), ........... ();

that contains 1000 rows and execute it as normal INSERT in one transaction, rather than 1000 transactions (check the syntax details).

包含1000行并在一个事务中执行它作为普通INSERT,而不是1000个事务(检查语法详细信息)。

2) Have a stored procedure that accepts a table-valued parameter. Call such procedure passing 1000 rows as a table.

2)有一个接受表值参数的存储过程。调用此类过程将1000行作为表传递。

CREATE TYPE [dbo].[ChannelDataTableType] AS TABLE(
    [TimestampID] [int] NOT NULL,
    [ChannelID] [int] NOT NULL,
    [DataValue] [float] NOT NULL
)
GO

CREATE PROCEDURE [dbo].[InsertChannelData]
    -- Add the parameters for the stored procedure here
    @ParamRows dbo.ChannelDataTableType READONLY
AS
BEGIN
    -- SET NOCOUNT ON added to prevent extra result sets from
    -- interfering with SELECT statements.
    SET NOCOUNT ON;

    BEGIN TRANSACTION;
    BEGIN TRY

        INSERT INTO [dbo].[ChannelData]
            ([TimestampID],
            [ChannelID],
            [DataValue])
        SELECT
            TT.[TimestampID]
            ,TT.[ChannelID]
            ,TT.[DataValue]
        FROM
            @ParamRows AS TT
        ;

        COMMIT TRANSACTION;
    END TRY
    BEGIN CATCH
        ROLLBACK TRANSACTION;
    END CATCH;

END
GO

If possible, accumulate data from several timestamps before inserting to make the batches larger. You should try with your system and find the optimal size of the batch. I have batches around 10K rows using the stored procedure.

如果可能,在插入之前累积来自多个时间戳的数据以使批量更大。您应该尝试使用您的系统并找到批次的最佳大小。我使用存储过程批量约10K行。

If you have your data coming from sensors 100 times a second, then I would at first dump the incoming raw data in some very simple CSV file(s) and have a parallel background process that would insert it into the database in chunks. In other words, have some buffer for incoming data, so that if the server can't cope with the incoming volume, you would not loose your data.

如果您的数据来自传感器每秒100次,那么我首先会将传入的原始数据转储到一些非常简单的CSV文件中,并且具有并行后台进程,可以将其以块的形式插入到数据库中。换句话说,为传入数据设置一些缓冲区,这样如果服务器无法处理传入的卷,您就不会丢失数据。

Based on your comments, when you said that some channels are likely to be more interesting and queried several times, while others are less interesting, here is one optimization that I would consider. In addition to having one table ChannelData for all channels have another table InterestingChannelData. ChannelData would have the whole set of data, just in case. InterestingChannelData would have a subset only for the most interesting channels. It should be much smaller and it should take less time to query it. In any case, this is an optimization (denormalization/data duplication) built on top of properly normalized structure.

根据您的评论,当您说某些频道可能更有趣并且多次查询时,而其他频道则不那么有趣,这里有一个我会考虑的优化。除了拥有一个表ChannelData之外,所有通道都有另一个表InterestingChannelData。 ChannelData将拥有整套数据,以防万一。 InterestingChannelData只有最有趣的频道的子集。它应该小得多,查询它应该花费更少的时间。无论如何,这是在正确规范化结构之上构建的优化(非规范化/数据复制)。

#2


Is your process like this:

你的流程是这样的:

  1. Generate data during the day
  2. 白天生成数据

  3. Analyse data afterwards
  4. 之后分析数据

If these are separate activities then you might want to consider using different 'insert' and 'select' schemas. You could create a schema that's fast for inserting on the boat, then afterwards you batch upload this data into an analysis optimised schema. This requires a transformation step (where for example you map generic column names into useful column names)

如果这些是单独的活动,那么您可能需要考虑使用不同的“插入”和“选择”模式。您可以创建一个快速插入船上的模式,然后批量上传此数据到分析优化模式。这需要转换步骤(例如,您将通用列名称映射到有用的列名称)

This is along the lines of data warehousing and data marts. In this kind of design, you batch load and optimise the schema for reporting. Does your current daily upload have much of a window?

这与数据仓库和数据集市一致。在这种设计中,您可以批量加载和优化模式以进行报告。您当前的每日上传是否有很多窗口?