We have a legacy database which is a sql server db (2005, and 2008).
我们有一个遗留数据库,它是一个sql server db(2005和2008)。
All of the primary keys in the tables are UniqueIdentifiers.
表中的所有主键都是UniqueIdentifiers。
The tables currently have no clustered index created on them and we are running into performance issues on tables with only 750k records. This is the first database i've worked on with unique identifiers as the sole primary key and I've never seen sql server be this slow with returning data.
这些表当前没有在它们上创建聚簇索引,我们在只有750k记录的表上遇到了性能问题。这是我使用唯一标识符作为唯一主键的第一个数据库,我从未见过sql server返回数据这么慢。
I don't want to create a clustered index on the uniqueidentifier as they are not sequential and will therefore slow the apps down when it comes to inserting data.
我不想在uniqueidentifier上创建聚簇索引,因为它们不是顺序的,因此在插入数据时会降低应用程序的速度。
We cannot remove the uniqueidentifier as that is used for remote site record identity management purposes.
我们无法删除uniqueidentifier,因为它用于远程站点记录身份管理目的。
I had thought about adding a big integer identity column to the tables and creating the clustered index on this column and including the unique identifier column.
我曾考虑过向表中添加一个大整数标识列,并在此列上创建聚簇索引并包含唯一标识符列。
i.e.
int identity - First column to maintain insert speeds unique identifier - To ensure the application keeps working as expected.
int identity - 保持插入速度唯一标识符的第一列 - 确保应用程序按预期保持工作。
The goal is to improve the identity query and joined table query performance.
目标是改进身份查询并加入表查询性能。
Q1: Will this improve the query performance of the db or will it slow it down?
问题1:这会改善数据库的查询性能还是会降低它的速度?
Q2: Is there an alternative to this that I haven't listed?
Q2:有没有我没有列出的替代方案?
Thanks Pete
Edit: The performance issues are on retrieving data quickly through select statements, especially if a few of the more "transactional / changing" tables are joined together.
编辑:性能问题是通过select语句快速检索数据,特别是如果一些更“交易/更改”的表连接在一起。
Edit 2: The joins between tables are generally all between the primary key and foreign keys, for tables that have foreign keys they are included in the non-clustered index to provide a more covering index.
编辑2:表之间的连接通常都在主键和外键之间,对于具有外键的表,它们包含在非聚集索引中以提供更多覆盖索引。
The tables all have no other values which would provide a good clustered index.
这些表都没有其他值可以提供良好的聚簇索引。
I'm leaning more towards adding an additional identity column on each of the high load tables and then including the current Guid PK column within the clustered index to provide the best query performance.
我更倾向于在每个高负载表上添加一个额外的标识列,然后在聚簇索引中包含当前的Guid PK列以提供最佳的查询性能。
Edit 3: I would estimate that 80% of the queries are performed on primary and foreign keys alone through the data access mechanism. Generally our data model has lazy loaded objects which perform the query when accessed, these queries use the objects id and the PK column. We have a large amount of user driven data exclusion / inclusion queries which use the foreign key columns as a filter based on the criteria of for type X exclude the following id's. The remaining 20% is where clauses on Enum (int) or date range columns, very few text based queries are performed in the system.
编辑3:我估计80%的查询是通过数据访问机制单独在主键和外键上执行的。通常,我们的数据模型具有延迟加载的对象,这些对象在访问时执行查询,这些查询使用对象id和PK列。我们有大量用户驱动的数据排除/包含查询,它们使用外键列作为基于类型X的条件的过滤器,排除以下ID。剩下的20%是Enum(int)或日期范围列的子句,在系统中执行的文本查询非常少。
Where possible I have already added covering indexes to cover the heaviest queries, but as yet i'm still dissapointed by the performance. As bluefooted says the data is being stored as a heap.
在可能的情况下,我已经添加了覆盖索引来覆盖最重的查询,但到目前为止,我仍然感到失望。蓝脚表示数据存储为堆。
4 个解决方案
#1
4
If you don't have a clustered index on the table, it is being stored as a heap rather than a b-tree. Heap data access is absolutely atrocious in SQL Server so you definitely need to add a clustered index.
如果表上没有聚簇索引,则将其存储为堆而不是b树。堆数据访问在SQL Server中绝对是残酷的,因此您肯定需要添加聚簇索引。
I agree with your analysis that the GUID column is a poor choice for clustering, especially since you don't have the ability to use NEWSEQUENTIALID(). You could create a new artificial integer key if you like, but if there is another column or combination of columns that would make sense as a clustered index, that is fine as well.
我同意您的分析,GUID列是群集的不良选择,尤其是因为您无法使用NEWSEQUENTIALID()。如果您愿意,可以创建一个新的人工整数键,但如果有另一列或列组合作为聚簇索引有意义,那也没关系。
Do you have a field that is used frequently for range scans? Which columns are used for joins? Is there a combination of columns that also uniquely identifies the row aside from the GUID? Posting a sample of the data model would help us to suggest a good candidate for clustering.
您是否有经常用于范围扫描的字段?哪些列用于连接?是否有一个列的组合,除了GUID之外还唯一标识行?发布数据模型的样本将有助于我们建议群集的良好候选者。
#2
2
I'm not sure where your GUIDs come from, but if they're being generated during the insert, using the NEWSEQUENTIALID() in SQL Server instead of NEWID() will help you avoid fragmentation issues during insert.
我不确定您的GUID来自哪里,但如果它们是在插入期间生成的,则使用SQL Server中的NEWSEQUENTIALID()而不是NEWID()将帮助您避免插入期间的碎片问题。
Regarding the choice of a clustered index, as Kimberly L. Tripp states here: "the most important factors in choosing a clustered index are that it's unique, narrow and static (ever-increasing has other benefits to minimizing splits)." A GUID falls short on the narrow requirement when compared to an INT or even BIGINT.
关于聚集索引的选择,正如Kimberly L. Tripp在此所述:“选择聚簇索引的最重要因素是它是独特的,狭窄的和静态的(不断增加的其他优点是最大限度地减少分裂)。”与INT或甚至BIGINT相比,GUID无法满足狭窄的要求。
Kimberly also has an excellent article on GUIDs as PRIMARY KEYs and/or the clustering key.
Kimberly还有一篇关于GUID作为PRIMARY KEYs和/或聚类键的优秀文章。
#3
1
It's not 100% clear to me: is your number 1 access pattern to query the tables by the GUID or by other columns? And when joining to other tables, what columns (and data types) are most often used?
对我来说,这不是100%明确的:您是通过GUID或其他列查询表的第一个访问模式吗?当加入其他表时,最常使用哪些列(和数据类型)?
I can't really give you any solid recommendations until I understand more about how these GUIDs are used. I realize you said they're primary keys, but that doesn't guarantee they are used as the primary conditions on queries or in joins.
在了解有关如何使用这些GUID的更多信息之前,我无法给您任何可靠的建议。我意识到你说它们是主键,但这并不能保证它们被用作查询或连接的主要条件。
UPDATE
Now that I know a little more, I have a crazy suggestion. Do cluster those tables on the GUIDs, but set the fill factor to 60%. This will ameliorate the page split problem and give you better performance querying on those puppies.
现在我知道了一点,我有一个疯狂的建议。在GUID上对这些表进行聚类,但将填充因子设置为60%。这将改善页面拆分问题,并为您提供更好的查询这些小狗的性能。
As for using Guid.NewGuid(), it seems that you can do sequentialGUIDs in C# after all. I found the following code here on SO:
至于使用Guid.NewGuid(),似乎你可以在C#中执行sequentialGUIDs。我在SO上找到了以下代码:
[DllImport("rpcrt4.dll", SetLastError = true)]
static extern int UuidCreateSequential(out Guid guid);
public static Guid SequentialGuid()
{
const int RPC_S_OK = 0;
Guid g;
if (UuidCreateSequential(out g) != RPC_S_OK)
return Guid.NewGuid();
else
return g;
}
newsequentialID() is actually just a wrapper for UuidCreateSequential. I'm sure if you can't use this directly on the client you can figure out a way to make a quick round-trip to the server to get a new sequential id from there, perhaps even with a "dispenser" table and a stored procedure to do the job.
newsequentialID()实际上只是UuidCreateSequential的包装器。我敢肯定,如果你不能直接在客户端上使用它,你可以找到一种方法快速往返服务器从那里获得一个新的顺序ID,甚至可能使用“分配器”表和存储过程来完成这项工作。
#4
0
You don't indicate what your performance issues are. If the worst performing action is an INSERT, then maybe your solution is right. If it's something else, then I'd look at how the clustered index can help that.
您没有说明您的性能问题。如果执行效果最差的操作是INSERT,那么您的解决方案可能是正确的。如果它是其他东西,那么我将看看聚集索引如何帮助它。
You might look at existing indexes on the table and the queries that use them. You may be able to select an index that, while degrades INSERTs slightly, provides a greater benefit to the current performance-problem areas.
您可能会查看表中的现有索引以及使用它们的查询。您可以选择一个索引,在略微降低INSERT的同时,为当前的性能问题区域提供更大的好处。
#1
4
If you don't have a clustered index on the table, it is being stored as a heap rather than a b-tree. Heap data access is absolutely atrocious in SQL Server so you definitely need to add a clustered index.
如果表上没有聚簇索引,则将其存储为堆而不是b树。堆数据访问在SQL Server中绝对是残酷的,因此您肯定需要添加聚簇索引。
I agree with your analysis that the GUID column is a poor choice for clustering, especially since you don't have the ability to use NEWSEQUENTIALID(). You could create a new artificial integer key if you like, but if there is another column or combination of columns that would make sense as a clustered index, that is fine as well.
我同意您的分析,GUID列是群集的不良选择,尤其是因为您无法使用NEWSEQUENTIALID()。如果您愿意,可以创建一个新的人工整数键,但如果有另一列或列组合作为聚簇索引有意义,那也没关系。
Do you have a field that is used frequently for range scans? Which columns are used for joins? Is there a combination of columns that also uniquely identifies the row aside from the GUID? Posting a sample of the data model would help us to suggest a good candidate for clustering.
您是否有经常用于范围扫描的字段?哪些列用于连接?是否有一个列的组合,除了GUID之外还唯一标识行?发布数据模型的样本将有助于我们建议群集的良好候选者。
#2
2
I'm not sure where your GUIDs come from, but if they're being generated during the insert, using the NEWSEQUENTIALID() in SQL Server instead of NEWID() will help you avoid fragmentation issues during insert.
我不确定您的GUID来自哪里,但如果它们是在插入期间生成的,则使用SQL Server中的NEWSEQUENTIALID()而不是NEWID()将帮助您避免插入期间的碎片问题。
Regarding the choice of a clustered index, as Kimberly L. Tripp states here: "the most important factors in choosing a clustered index are that it's unique, narrow and static (ever-increasing has other benefits to minimizing splits)." A GUID falls short on the narrow requirement when compared to an INT or even BIGINT.
关于聚集索引的选择,正如Kimberly L. Tripp在此所述:“选择聚簇索引的最重要因素是它是独特的,狭窄的和静态的(不断增加的其他优点是最大限度地减少分裂)。”与INT或甚至BIGINT相比,GUID无法满足狭窄的要求。
Kimberly also has an excellent article on GUIDs as PRIMARY KEYs and/or the clustering key.
Kimberly还有一篇关于GUID作为PRIMARY KEYs和/或聚类键的优秀文章。
#3
1
It's not 100% clear to me: is your number 1 access pattern to query the tables by the GUID or by other columns? And when joining to other tables, what columns (and data types) are most often used?
对我来说,这不是100%明确的:您是通过GUID或其他列查询表的第一个访问模式吗?当加入其他表时,最常使用哪些列(和数据类型)?
I can't really give you any solid recommendations until I understand more about how these GUIDs are used. I realize you said they're primary keys, but that doesn't guarantee they are used as the primary conditions on queries or in joins.
在了解有关如何使用这些GUID的更多信息之前,我无法给您任何可靠的建议。我意识到你说它们是主键,但这并不能保证它们被用作查询或连接的主要条件。
UPDATE
Now that I know a little more, I have a crazy suggestion. Do cluster those tables on the GUIDs, but set the fill factor to 60%. This will ameliorate the page split problem and give you better performance querying on those puppies.
现在我知道了一点,我有一个疯狂的建议。在GUID上对这些表进行聚类,但将填充因子设置为60%。这将改善页面拆分问题,并为您提供更好的查询这些小狗的性能。
As for using Guid.NewGuid(), it seems that you can do sequentialGUIDs in C# after all. I found the following code here on SO:
至于使用Guid.NewGuid(),似乎你可以在C#中执行sequentialGUIDs。我在SO上找到了以下代码:
[DllImport("rpcrt4.dll", SetLastError = true)]
static extern int UuidCreateSequential(out Guid guid);
public static Guid SequentialGuid()
{
const int RPC_S_OK = 0;
Guid g;
if (UuidCreateSequential(out g) != RPC_S_OK)
return Guid.NewGuid();
else
return g;
}
newsequentialID() is actually just a wrapper for UuidCreateSequential. I'm sure if you can't use this directly on the client you can figure out a way to make a quick round-trip to the server to get a new sequential id from there, perhaps even with a "dispenser" table and a stored procedure to do the job.
newsequentialID()实际上只是UuidCreateSequential的包装器。我敢肯定,如果你不能直接在客户端上使用它,你可以找到一种方法快速往返服务器从那里获得一个新的顺序ID,甚至可能使用“分配器”表和存储过程来完成这项工作。
#4
0
You don't indicate what your performance issues are. If the worst performing action is an INSERT, then maybe your solution is right. If it's something else, then I'd look at how the clustered index can help that.
您没有说明您的性能问题。如果执行效果最差的操作是INSERT,那么您的解决方案可能是正确的。如果它是其他东西,那么我将看看聚集索引如何帮助它。
You might look at existing indexes on the table and the queries that use them. You may be able to select an index that, while degrades INSERTs slightly, provides a greater benefit to the current performance-problem areas.
您可能会查看表中的现有索引以及使用它们的查询。您可以选择一个索引,在略微降低INSERT的同时,为当前的性能问题区域提供更大的好处。