I am about to import around 500 million rows of telemetry data into SQL Server 2008 R2, and I want to make sure I get the indexing/schema right to allow for fast searches of the data. I've been working with databases for a while but nothing on this scale. I'm hoping I can describe my data and the application, and someone can advise me on a good strategy for indexing it.
我即将将大约5亿行遥测数据导入SQL Server 2008 R2,我想确保我获得正确的索引/架构,以便快速搜索数据。我一直在使用数据库一段时间,但没有这个规模。我希望我可以描述我的数据和应用程序,有人可以告诉我一个好的策略来索引它。
The data is instrument readings from a data collection system, and has 3 columns: SentTime (datetime2(3)), Topic (nvarchar(255), and Value(float). The SentTime precision is to the millisecond, and is NOT unique. There are around 400 distinct Topics (ex: "Voltage1", "PumpPressure", etc) in the data, and my plan was to break out the data into about 30 tables, each with 10-15 columns, grouped into logical groupings like Voltages, Pressures, Temperatures, etc, each with their own SentTime column.
数据是来自数据收集系统的仪器读数,有3列:SentTime(datetime2(3)),Topic(nvarchar(255)和Value(float).SendTime精度为毫秒,并非唯一。数据中有大约400个不同的主题(例如:“Voltage1”,“PumpPressure”等),我的计划是将数据分成大约30个表,每个表有10-15列,分组为逻辑分组,如Voltages ,压力,温度等,每个都有自己的SentTime列。
A typical search will be to retrieve various Values (could be across several tables) for a given time range. Another possible search will be to retrieve all times/values for a given value range and topic. The user interface will show coarse graphs of the data, to allow the user to find the interesting data and export it to Excel or CSV.
典型的搜索将是在给定时间范围内检索各种值(可以跨多个表)。另一种可能的搜索方式是检索给定值范围和主题的所有时间/值。用户界面将显示数据的粗略图形,以允许用户查找感兴趣的数据并将其导出到Excel或CSV。
My main question is, if I add an index based on SentTime alone, will that speed searches for a given time range? Would it be better to make a composite index on time and value, since the time is not unique? Any point in adding a unique primary key? Is there any other overall strategy or schema I should be looking at for this application?
我的主要问题是,如果我单独添加一个基于SentTime的索引,那么这个速度会搜索给定的时间范围吗?由于时间不是唯一的,因此在时间和价值上制作综合指数会更好吗?添加唯一主键的任何一点?我应该为这个应用程序寻找其他整体策略或架构吗?
Another note, I will not be inserting any data once the import is done, so no need to worry about the insertion overhead of indexes.
另请注意,导入完成后我不会插入任何数据,因此无需担心索引的插入开销。
1 个解决方案
#1
1
It seems that you'll be doing a lot of range searches over the SentTime column. In that case, I would create a clustered index on SentTime; with the nonclustered index there would be the overhead of lookups (to retrieve additional data). It is not important that SentTime is not unique, engine will add an uniquifier to it.
您似乎将通过SentTime列进行大量范围搜索。在这种情况下,我会在SentTime上创建一个聚簇索引;使用非聚簇索引会有查找的开销(以检索其他数据)。 SentTime不是唯一的并不重要,引擎会为它添加一个uniquifier。
Does the Topic column have to be nvarchar; why not a varchar?
Topic列必须是nvarchar;为什么不是varchar?
My relational self will punish me for this, but it seems that you don't need an additional PK. The data is read-only, right?
我的关系自我会为此惩罚我,但似乎你不需要额外的PK。数据是只读的,对吗?
One more thought: check the sparse columns feature, it seems that it would be a perfect fit in your scenario. There could be a large number of sparse columns (up to 10.000 if I'm not mistaken), they can be grouped and manipulated as XML, and the main point is that NULLs are almost free storage-wise.
还有一个想法:检查稀疏列功能,它似乎非常适合你的场景。可能存在大量稀疏列(如果我没有记错,最多可达10.000),它们可以作为XML进行分组和操作,主要的一点是NULL几乎是免费存储的。
#1
1
It seems that you'll be doing a lot of range searches over the SentTime column. In that case, I would create a clustered index on SentTime; with the nonclustered index there would be the overhead of lookups (to retrieve additional data). It is not important that SentTime is not unique, engine will add an uniquifier to it.
您似乎将通过SentTime列进行大量范围搜索。在这种情况下,我会在SentTime上创建一个聚簇索引;使用非聚簇索引会有查找的开销(以检索其他数据)。 SentTime不是唯一的并不重要,引擎会为它添加一个uniquifier。
Does the Topic column have to be nvarchar; why not a varchar?
Topic列必须是nvarchar;为什么不是varchar?
My relational self will punish me for this, but it seems that you don't need an additional PK. The data is read-only, right?
我的关系自我会为此惩罚我,但似乎你不需要额外的PK。数据是只读的,对吗?
One more thought: check the sparse columns feature, it seems that it would be a perfect fit in your scenario. There could be a large number of sparse columns (up to 10.000 if I'm not mistaken), they can be grouped and manipulated as XML, and the main point is that NULLs are almost free storage-wise.
还有一个想法:检查稀疏列功能,它似乎非常适合你的场景。可能存在大量稀疏列(如果我没有记错,最多可达10.000),它们可以作为XML进行分组和操作,主要的一点是NULL几乎是免费存储的。