I'm not a DBA ("Good!", you'll be thinking in a moment.)
我不是DBA(“好!”,你马上就会思考。)
I have a table of logging data with these characteristics and usage patterns:
我有一个记录数据表,其中包含以下特征和使用模式:
- A
datetime
column for storing log timestamps whose value is ever-increasing and mostly (but only mostly) unique - 一个日期时间列,用于存储日志时间戳,其值不断增加且主要(但仅大部分)唯一
- Frequent-ish inserts (say, a dozen a minute), only at the end of the timestamp range (new data being logged)
- Frequent-ish插入(例如,每分钟十几次),仅在时间戳范围的末尾(记录新数据)
- Infrequent deletes, in bulk, from the beginning of the timestamp range (old data being cleared)
- 从时间戳范围的开头大量删除(旧数据被清除)
- No updates at all
- 根本没有更新
- Frequent-ish selects using the timestamp column as the primary criterion, along with secondary criteria on other columns
- Frequent-ish选择使用timestamp列作为主要标准,以及其他列的次要标准
- Infrequent selects using other columns as the criteria (and not including the timestamp column)
- 不经常选择使用其他列作为条件(不包括时间戳列)
- A good amount of data, but nowhere near enough that I'm worried much about storage space
- 大量的数据,但远远不够,我担心存储空间
Additionally, there is currently a daily maintenance window during which I could do table optimization.
此外,目前还有一个日常维护窗口,在此期间我可以进行表格优化。
I frankly don't expect this table to challenge the server it's going to be on even if I mis-index it a bit, but nevertheless it seemed like a good opportunity to ask for some input on SQL Server clustered indexes.
坦率地说,即使我对它进行了错误的索引,我也不希望这个表能够挑战它将要启动的服务器,但是它似乎是一个在SQL Server聚簇索引上请求输入的好机会。
I know that clustered indexes determine the storage of the actual table data (the data is stored in the leaf nodes of the index itself), and that non-clustered indexes are separate pointers into the data. So in query terms, a clustered index is going to be faster than a non-clustered index -- once we've found the index value, the data is right there. There are costs on insert and delete (and of course an update changing the clustered index column's value would be particularly costly).
我知道聚簇索引确定实际表数据的存储(数据存储在索引本身的叶节点中),非聚簇索引是指向数据的单独指针。因此,在查询术语中,聚簇索引将比非聚集索引更快 - 一旦我们找到索引值,数据就在那里。插入和删除都有成本(当然,更新聚簇索引列的值的更新将特别昂贵)。
But I read in this answer that deletes leave gaps that don't get cleaned up until/unless the index is rebuilt.
但我在这个答案中读到,除非重建索引,否则删除不会被清理的空白。
All of this suggests to me that I should:
所有这些都告诉我,我应该:
- Put a clustered index on the timestamp column with a 100% fill-factor
- 在具有100%填充因子的时间戳列上放置聚簇索引
- Put non-clustered indexes on any other column that may be used as a criterion in a query that doesn't also involve the clustered column (which may be any of them in my case)
- 将非聚集索引放在任何其他列上,该列可用作不涉及聚簇列的查询中的条件(在我的情况下可能是其中任何一个)
- Schedule the bulk deletes to occur during the daily maintenance interval
- 安排在每日维护间隔期间发生批量删除
- Schedule a rebuild of the clustered index to occur immediately after the bulk delete
- 安排在批量删除后立即重建聚集索引
- Relax and get out more
- 放松,多出去
Am I wildly off base there? Do I need to frequently rebuild the index like that to avoid lots of wasted space? Are there other obvious (to a DBA) things I should be doing?
我在那里疯狂吗?我是否需要经常重建索引以避免浪费大量空间?还有其他显而易见的(对于DBA)我应该做的事情吗?
Thanks in advance.
提前致谢。
4 个解决方案
#1
3
I agree with putting the clustered index on the timestamp column. My query would be on the fillfactor - 100% gives best read performance at the expense of write performance. you may be hurt by page splits. Choosing a lower fillfactor will delay page splitting at the expense of read performance so its a fine balancing act to get the best for your situation.
我同意将聚集索引放在timestamp列上。我的查询将在fillfactor上 - 100%以牺牲写入性能为代价提供最佳读取性能。页面拆分可能会对您造成伤害。选择较低的fillfactor会以牺牲读取性能为代价来延迟页面拆分,因此它是一种很好的平衡方式,可以最好地适应您的情况。
After the bulk deletes its worth rebuilding the indexes and updating statistics. This not only keeps performance up but also resets the indexes to the specified fillfactor.
批量删除后,重建索引并更新统计信息。这不仅可以提高性能,还可以将索引重置为指定的fillfactor。
Finally, yes put nonclustered indexes on other appropriate columns but only ones that are very select e.g not bit fields. But remember the more indexes, the more this affects write performance
最后,是的将非聚簇索引放在其他适当的列上,但只有非常选择的列,例如非位域。但是记住索引越多,这对写入性能的影响就越大
#2
5
Contrary to what a lot of people believe, having a good clustered index on a table can actually make operations like INSERTs faster - yes, faster!
与许多人认为的相反,在表上拥有良好的聚簇索引实际上可以使INSERT之类的操作更快 - 是的,更快!
Check out the seminal blog post The Clustered Index Debate Continues.... by Kimberly Tripp - the ultimate indexing queen.
查看开创性的博客文章The Clustered Index Debate Continues ....作者:Kimberly Tripp--终极索引女王。
She mentions (about in the middle of the article):
她提到(大约在文章的中间):
Inserts are faster in a clustered table (but only in the "right" clustered table) than compared to a heap. The primary problem here is that lookups in the IAM/PFS to determine the insert location in a heap are slower than in a clustered table (where insert location is known, defined by the clustered key). Inserts are faster when inserted into a table where order is defined (CL) and where that order is ever-increasing.
与堆相比,嵌入在集群表中更快(但仅在“右”聚簇表中)。这里的主要问题是IAM / PFS中用于确定堆中插入位置的查找比群集表(其中插入位置已知,由群集密钥定义)慢。插入到定义了顺序(CL)的表中以及该顺序不断增加的位置时,插入更快。
The crucial point is: only with the right clustered index will you be able to reap the benefits - when a clustered index is unique, narrow, stable and optimally ever-increasing. This is best served with an INT IDENTITY column.
关键点在于:只有使用正确的聚簇索引,您才能获得收益 - 当聚簇索引是唯一的,狭窄的,稳定的并且最佳地不断增加时。这最适合使用INT IDENTITY列。
Kimberly Tripp also has a great article on how to pick the best possible clustering key for your tables, and what criteria it should fulfil - see her post entitled Ever-increasing clustering key - the Clustered Index Debate..........again!
Kimberly Tripp也有一篇很棒的文章,介绍如何为你的表格选择最好的聚类键,以及它应该达到的标准 - 请参阅她的帖子题为不断增加的聚类键 - 聚集索引辩论......... 。再次!
If you have such a column - e.g. a surrogate primary key - use that for your clustering key and you should see very nice performance on your table - even on lots of INSERTs.
如果你有这样一个列 - 例如代理主键 - 使用它作为您的群集键,您应该在表上看到非常好的性能 - 即使在很多INSERT上也是如此。
#3
3
There's two "best practice" ways to index a high traffic logging table:
索引高流量日志记录表有两种“最佳实践”方法:
- an integer identity column as a primary clustered key
- 整数标识列作为主群集密钥
- a uniqueidentifier colum as primary key, with
DEFAULT NEWSEQUENTIALID()
- uniqueidentifier colum作为主键,使用DEFAULT NEWSEQUENTIALID()
Both methods allow SQL Server to grow the table efficiently, because it knows that the index tree will grow in a particular direction.
这两种方法都允许SQL Server有效地扩展表,因为它知道索引树将在特定方向上增长。
I would not put any other indexes on the table, or schedule rebuilds of the index, unless there is a specific performance issue.
除非存在特定的性能问题,否则我不会在表上放置任何其他索引,也不会安排索引的重建。
#4
0
The obvious answer is it depends on how you will query it. The point of the index is to lessen the quantity of compares when selecting data. The clustered index helps when you consider what data you will load together and the blocking factor of the storage (you can load a bunch of data in a 64k block with one read). If you include an ID and a datetime as the primary key, but not use them in your selection criteria, they will do nothing but hinder your performance. This is why people usually drop indexes upon bulk inserts before loading data.
显而易见的答案取决于您将如何查询它。索引的要点是在选择数据时减少比较的数量。当您考虑将要加载的数据和存储的阻塞因子时,聚簇索引会有所帮助(您可以在64k块中加载一堆数据,只读一次)。如果您将ID和日期时间作为主键,但不在选择条件中使用它们,则它们将不会做任何事情,只会妨碍您的表现。这就是人们在加载数据之前通常在批量插入时删除索引的原因。
#1
3
I agree with putting the clustered index on the timestamp column. My query would be on the fillfactor - 100% gives best read performance at the expense of write performance. you may be hurt by page splits. Choosing a lower fillfactor will delay page splitting at the expense of read performance so its a fine balancing act to get the best for your situation.
我同意将聚集索引放在timestamp列上。我的查询将在fillfactor上 - 100%以牺牲写入性能为代价提供最佳读取性能。页面拆分可能会对您造成伤害。选择较低的fillfactor会以牺牲读取性能为代价来延迟页面拆分,因此它是一种很好的平衡方式,可以最好地适应您的情况。
After the bulk deletes its worth rebuilding the indexes and updating statistics. This not only keeps performance up but also resets the indexes to the specified fillfactor.
批量删除后,重建索引并更新统计信息。这不仅可以提高性能,还可以将索引重置为指定的fillfactor。
Finally, yes put nonclustered indexes on other appropriate columns but only ones that are very select e.g not bit fields. But remember the more indexes, the more this affects write performance
最后,是的将非聚簇索引放在其他适当的列上,但只有非常选择的列,例如非位域。但是记住索引越多,这对写入性能的影响就越大
#2
5
Contrary to what a lot of people believe, having a good clustered index on a table can actually make operations like INSERTs faster - yes, faster!
与许多人认为的相反,在表上拥有良好的聚簇索引实际上可以使INSERT之类的操作更快 - 是的,更快!
Check out the seminal blog post The Clustered Index Debate Continues.... by Kimberly Tripp - the ultimate indexing queen.
查看开创性的博客文章The Clustered Index Debate Continues ....作者:Kimberly Tripp--终极索引女王。
She mentions (about in the middle of the article):
她提到(大约在文章的中间):
Inserts are faster in a clustered table (but only in the "right" clustered table) than compared to a heap. The primary problem here is that lookups in the IAM/PFS to determine the insert location in a heap are slower than in a clustered table (where insert location is known, defined by the clustered key). Inserts are faster when inserted into a table where order is defined (CL) and where that order is ever-increasing.
与堆相比,嵌入在集群表中更快(但仅在“右”聚簇表中)。这里的主要问题是IAM / PFS中用于确定堆中插入位置的查找比群集表(其中插入位置已知,由群集密钥定义)慢。插入到定义了顺序(CL)的表中以及该顺序不断增加的位置时,插入更快。
The crucial point is: only with the right clustered index will you be able to reap the benefits - when a clustered index is unique, narrow, stable and optimally ever-increasing. This is best served with an INT IDENTITY column.
关键点在于:只有使用正确的聚簇索引,您才能获得收益 - 当聚簇索引是唯一的,狭窄的,稳定的并且最佳地不断增加时。这最适合使用INT IDENTITY列。
Kimberly Tripp also has a great article on how to pick the best possible clustering key for your tables, and what criteria it should fulfil - see her post entitled Ever-increasing clustering key - the Clustered Index Debate..........again!
Kimberly Tripp也有一篇很棒的文章,介绍如何为你的表格选择最好的聚类键,以及它应该达到的标准 - 请参阅她的帖子题为不断增加的聚类键 - 聚集索引辩论......... 。再次!
If you have such a column - e.g. a surrogate primary key - use that for your clustering key and you should see very nice performance on your table - even on lots of INSERTs.
如果你有这样一个列 - 例如代理主键 - 使用它作为您的群集键,您应该在表上看到非常好的性能 - 即使在很多INSERT上也是如此。
#3
3
There's two "best practice" ways to index a high traffic logging table:
索引高流量日志记录表有两种“最佳实践”方法:
- an integer identity column as a primary clustered key
- 整数标识列作为主群集密钥
- a uniqueidentifier colum as primary key, with
DEFAULT NEWSEQUENTIALID()
- uniqueidentifier colum作为主键,使用DEFAULT NEWSEQUENTIALID()
Both methods allow SQL Server to grow the table efficiently, because it knows that the index tree will grow in a particular direction.
这两种方法都允许SQL Server有效地扩展表,因为它知道索引树将在特定方向上增长。
I would not put any other indexes on the table, or schedule rebuilds of the index, unless there is a specific performance issue.
除非存在特定的性能问题,否则我不会在表上放置任何其他索引,也不会安排索引的重建。
#4
0
The obvious answer is it depends on how you will query it. The point of the index is to lessen the quantity of compares when selecting data. The clustered index helps when you consider what data you will load together and the blocking factor of the storage (you can load a bunch of data in a 64k block with one read). If you include an ID and a datetime as the primary key, but not use them in your selection criteria, they will do nothing but hinder your performance. This is why people usually drop indexes upon bulk inserts before loading data.
显而易见的答案取决于您将如何查询它。索引的要点是在选择数据时减少比较的数量。当您考虑将要加载的数据和存储的阻塞因子时,聚簇索引会有所帮助(您可以在64k块中加载一堆数据,只读一次)。如果您将ID和日期时间作为主键,但不在选择条件中使用它们,则它们将不会做任何事情,只会妨碍您的表现。这就是人们在加载数据之前通常在批量插入时删除索引的原因。