SQL Server“一次写入”表聚集索引

时间:2022-10-25 02:46:51

I have a fairly unique table in a SQL Server database that doesn't follow 'typical' usage conventions and am looking for some advice regarding the clustered index.

我在SQL Server数据库中有一个相当独特的表,它不遵循“典型”使用惯例,并且正在寻找有关聚簇索引的一些建议。

This is a made-up example, but follows the real data pretty closely.

这是一个简单的例子,但非常接近真实数据。

The table has a 3 column primary key, which are really foreign keys to other tables, and a fourth field that contains the relevant data. For this example, let's say that the table looks like this:

该表有一个3列主键,它实际上是其他表的外键,第四个字段包含相关数据。对于这个例子,假设表格如下所示:

CREATE TABLE [dbo].[WordCountsForPage](
 [AuthorID] [int] NOT NULL,
 [BookID] [int] NOT NULL,
 [PageNumber] [int] NOT NULL,
 [WordCount] [int] NOT NULL
)

So, we have a somewhat hierarchical primary key, with the unique data being that fourth field.

因此,我们有一个有点分层的主键,唯一的数据是第四个字段。

In the real application, there are a total of 2.8 Billion possible records, but that's all. The records are created on the fly as the data is calculated over time, and realistically probably only 1/4 of those records will ever actually be calculated. They are stored in the DB since the calculation is an expensive operation, and we only want to do it once for each unique combination.

在实际应用中,总共有28亿条可能的记录,但就是这样。随着时间的推移计算数据,动态创建记录,实际上可能只有1/4的记录将被实际计算出来。它们存储在DB中,因为计算是一项昂贵的操作,我们只希望对每个唯一组合执行一次。

Today, the data is read thousands of times a minute, but (at least for now) there are also hundreds of inserts per minute as the table populates itself (and this will continue for quite some time). I would say that there are 10 reads for every insert (today).

今天,数据每分钟被读取数千次,但是(至少现在)每个分钟还有数百个插入,因为表格自身填充(这将持续相当长的一段时间)。我会说每个插入(今天)有10个读取。

I am wondering if we are taking a performance hit on all of those inserts because of the clustered index.

我想知道我们是否因为聚集索引而对所有这些插件产生性能影响。

The clustered index makes sense "long term" since the table will eventually become read-only, but it will take some time to get there.

聚簇索引有意义“长期”,因为该表最终将变为只读,但它需要一些时间才能实现。

I suppose I could make the index non-clustered during the heavy insert period, and change it to clustered as the table becomes populated, but how do you determine when the cross-over point would be (and how can I notify myself in the future that the 'time has come')?

我想我可以在重插入期间使索引非聚集,并在表填充时将其更改为聚簇,但是如何确定交叉点何时(以及如何在将来通知我自己) '时机已到')?

What I really need is a convertible index that crosses over from non-clustered to clustered at some magical time in the future.

我真正需要的是一个可转换索引,在未来某个神奇的时间从非聚集到聚簇。

Any suggestions for how to handle this one?

有关如何处理这个的任何建议?

1 个解决方案

#1


3  

Actually, I would not bother with trying to have a non-clustered index first and convert it to a clustered one (that alone is a really messy affair!) later on.

实际上,我不打算首先尝试使用非聚集索引并将其转换为聚簇索引(仅此一个是非常混乱的事情!)。

As The Queen Of Indexing, Kimberly Tripp, explains in her The Clustered Index Debate Continues.., having a clustered index on a table can actually improve your INSERT performance!

正如索引女王Kimberly Tripp在她的The Clustered Index Debate中所解释的那样,在桌子上拥有聚集索引实际上可以提高你的INSERT性能!

Inserts are faster in a clustered table (but only in the "right" clustered table) than compared to a heap. The primary problem here is that lookups in the IAM/PFS to determine the insert location in a heap are slower than in a clustered table (where insert location is known, defined by the clustered key). Inserts are faster when inserted into a table where order is defined (CL) and where that order is ever-increasing.

与堆相比,嵌入在集群表中更快(但仅在“右”聚簇表中)。这里的主要问题是IAM / PFS中用于确定堆中插入位置的查找比群集表(其中插入位置已知,由群集密钥定义)慢。插入到定义了顺序(CL)的表中以及该顺序不断增加的位置时,插入更快。

A heap is a table which has no clustered index defined on it.

堆是一个没有定义聚簇索引的表。

Considering this, and the effort and trouble it takes to go from heap to a table with a clustered index - I wouldn't even bother. Just define your indices, and start using that table!

考虑到这一点,以及从堆到具有聚簇索引的表所花费的精力和麻烦 - 我甚至都不会打扰。只需定义索引,然后开始使用该表!

#1


3  

Actually, I would not bother with trying to have a non-clustered index first and convert it to a clustered one (that alone is a really messy affair!) later on.

实际上,我不打算首先尝试使用非聚集索引并将其转换为聚簇索引(仅此一个是非常混乱的事情!)。

As The Queen Of Indexing, Kimberly Tripp, explains in her The Clustered Index Debate Continues.., having a clustered index on a table can actually improve your INSERT performance!

正如索引女王Kimberly Tripp在她的The Clustered Index Debate中所解释的那样,在桌子上拥有聚集索引实际上可以提高你的INSERT性能!

Inserts are faster in a clustered table (but only in the "right" clustered table) than compared to a heap. The primary problem here is that lookups in the IAM/PFS to determine the insert location in a heap are slower than in a clustered table (where insert location is known, defined by the clustered key). Inserts are faster when inserted into a table where order is defined (CL) and where that order is ever-increasing.

与堆相比,嵌入在集群表中更快(但仅在“右”聚簇表中)。这里的主要问题是IAM / PFS中用于确定堆中插入位置的查找比群集表(其中插入位置已知,由群集密钥定义)慢。插入到定义了顺序(CL)的表中以及该顺序不断增加的位置时,插入更快。

A heap is a table which has no clustered index defined on it.

堆是一个没有定义聚簇索引的表。

Considering this, and the effort and trouble it takes to go from heap to a table with a clustered index - I wouldn't even bother. Just define your indices, and start using that table!

考虑到这一点,以及从堆到具有聚簇索引的表所花费的精力和麻烦 - 我甚至都不会打扰。只需定义索引,然后开始使用该表!