I've got a really large table (10+ million rows) that is starting to show signs of performance degradation for queries. Since this table will probably double or triple in size relatively soon I'm looking into partitioning the table to squeeze out some query performance.
我有一个非常大的表(超过1000万行),开始显示查询性能下降的迹象。由于这个表的大小可能会相对很快增加一倍或三倍,所以我正在考虑对表进行分区以挤出一些查询性能。
The table looks something like this:
该表看起来像这样:
CREATE TABLE [my_data] (
[id] [int] IDENTITY(1,1) NOT NULL,
[topic_id] [int] NULL,
[data_value] [decimal](19, 5) NULL
)
So, a bunch of values for any given topic. Queries on this table will always be by topic ID, so there's a clustered index on (id, topic_id).
因此,任何给定主题的一堆值。此表上的查询将始终按主题ID,因此在(id,topic_id)上有一个聚簇索引。
Anyway, since topic IDs aren't bounded (any number of topics could be added) I'd like to try partitioning this table on a modulus function of the topic IDs. So something like:
无论如何,由于主题ID没有限制(可以添加任意数量的主题),我想尝试在主题ID的模数函数上对此表进行分区。所以类似于:
topic_id % 4 == 0 => partition 0
topic_id % 4 == 1 => partition 1
topic_id % 4 == 2 => partition 2
topic_id % 4 == 3 => partition 3
However, I haven't seen any way to tell "create partition function" or "create partition scheme" to perform this operation when deciding on a partition.
但是,在决定分区时,我还没有看到任何方法告诉“创建分区函数”或“创建分区方案”来执行此操作。
Is this even possible? How can we make a partition function based on an operation performed on the input value?
这有可能吗?我们如何根据对输入值执行的操作来创建分区函数?
4 个解决方案
#1
5
You just need to create your modulus column as a PERSISTED computed column.
您只需要将模数列创建为PERSISTED计算列。
Blue Peter style, here's one I made earlier (although I'm not 100% sure I have the partition values clause right):
蓝色彼得风格,这是我之前做的一个(尽管我不是100%确定我有正确的分区值子句):
CREATE PARTITION FUNCTION [PF_PartitonFour] (int)
AS RANGE RIGHT
FOR VALUES (
0,
1,
2)
GO
CREATE PARTITION SCHEME [PS_PartitionFourScheme]
AS PARTITION [PF_PartitonFour]
TO ([TestPartitionGroup1],
[TestPartitionGroup2],
[TestPartitionGroup3],
[TestPartitionGroup4])
GO
CREATE TABLE [my_data] (
[id] [int] IDENTITY(1,1) NOT NULL,
[topic_id] [int] NULL,
[data_value] [decimal](19, 5) NULL
[PartitionElement] AS [topic_id] % 4 PERSISTED,
) ON [PS_PartitionFourScheme] (PartitionElement);
GO
#2
3
Hash partitioning is not available in SQL Server 2005/2008. You must use range partitioning.
SQL Server 2005/2008中不提供散列分区。您必须使用范围分区。
That being said, you should be aware that partitioning is primarily a storage option, see Partitioned Table and Index Concepts:
话虽这么说,您应该知道分区主要是一个存储选项,请参阅分区表和索引概念:
Partitioning makes large tables or indexes more manageable, because partitioning enables you to manage and access subsets of data quickly and efficiently, while maintaining the integrity of a data collection. By using partitioning, an operation such as loading data from an OLTP to an OLAP system takes only seconds, instead of the minutes and hours the operation takes in earlier versions of SQL Server. Maintenance operations that are performed on subsets of data are also performed more efficiently because these operations target only the data that is required, instead of the whole table.
分区使大型表或索引更易于管理,因为分区使您能够快速有效地管理和访问数据子集,同时保持数据集合的完整性。通过使用分区,诸如将数据从OLTP加载到OLAP系统的操作只需几秒钟,而不是在早期版本的SQL Server中操作所需的分钟和小时。对数据子集执行的维护操作也可以更有效地执行,因为这些操作仅针对所需的数据,而不是整个表。
As you can see, the introduction of partitioning in MSDN focuses on maintenance, manageability and data load. In my experience partitioning gives, at best, 0 performance gain. Specially in SQL 2005. Usualy it gives performance degradation. To improve performance you should use a correct clustered index and properly designed non-clustered indexes.
如您所见,MSDN中的分区介绍侧重于维护,可管理性和数据加载。根据我的经验,分区最多只能带来0性能提升。特别是在SQL 2005中。通常它会降低性能。要提高性能,您应该使用正确的聚簇索引和正确设计的非聚簇索引。
In SQL 2008 there are improvements in the parallel operators in regard to partitions if they are properly distributed from an IO point of view, see Designing Partitions to Improve Query Performance. Their benefit are marginal though and overshadowed by the benefits of a properly designed set of clustered and non-clustered indexes. Case in point a clustered index in (id, topic_id) where id is an identity is usefull solely for single item lookup by id. On the other hand a clustered index by (topic_id, id) would benefit any queries that look for specific topic(s). I don't know your system requirements and the queries you run, but 10M rows performance problems on such a narrow table smell like indexing ands querying issue, no partitioning issue.
在SQL 2008中,如果从IO的角度对分区进行了适当的分布,则并行运算符在分区方面有所改进,请参阅设计分区以提高查询性能。它们的好处虽然微不足道,但却被适当设计的集群和非集群索引集的好处所掩盖。例如,(id,topic_id)中的聚簇索引,其中id是一个标识,仅用于通过id查找单个项目。另一方面,聚集索引(topic_id,id)将有益于查找特定主题的任何查询。我不知道你的系统要求和你运行的查询,但在这样一个狭窄的表上的10M行性能问题闻起来像索引和查询问题,没有分区问题。
#3
0
From the documentation, it seems like you have to give values to the function:
从文档中,您似乎必须为函数赋值:
To create 4 partitions...
要创建4个分区......
CREATE PARTITION FUNCTION myRangePF1 (int)
AS RANGE LEFT FOR VALUES (1, 100, 1000);
Couldn't you just do your computations above this call and find the proper values to split on? Substitute the values into the call? Or am I missing why you want to use the modulus? Based on the possibility of your ID's having gaps, you may need to use some statistics math to find out where to partition.
难道你不能只是在这个调用之上进行计算并找到要拆分的正确值吗?将值替换为呼叫?或者我错过了你为什么要使用模数?根据您的ID存在差距的可能性,您可能需要使用一些统计数学来找出分区的位置。
CREATE PARTITION FUNCTION myRangePF1 (int)
AS RANGE LEFT FOR VALUES (@low, @Med, @High);
#4
0
10 million rows isn't that many for SQL server to handle; regular index design would probably solve this without the need for partitioning. As has been noted, try clustering on different sets of columns; clustering on topicid, id seems like something to test out, especially if most queries have topicid as a criterion. A clustered index like that has approximately the same effect as paritioning, at least in that it groups the related rows of data together on disk and allows a range scan to fetch them quickly.
SQL Server要处理的1000万行并不多;常规索引设计可能无需分区即可解决此问题。如前所述,尝试在不同的列集上进行聚类;在topicid上聚类,id似乎是要测试的东西,特别是如果大多数查询都将topicid作为标准。像这样的聚簇索引具有与分区大致相同的效果,至少在于它将相关的数据行组合在磁盘上并允许范围扫描快速获取它们。
If that design works, all you have to worry about is fragmentation from inserts, but that's manageable. After getting the indexing right, make sure you have enough RAM, and that you don't have a disk bottleneck.
如果该设计有效,那么您只需要担心插入碎片,但这是可管理的。获得正确的索引后,请确保您有足够的RAM,并且没有磁盘瓶颈。
#1
5
You just need to create your modulus column as a PERSISTED computed column.
您只需要将模数列创建为PERSISTED计算列。
Blue Peter style, here's one I made earlier (although I'm not 100% sure I have the partition values clause right):
蓝色彼得风格,这是我之前做的一个(尽管我不是100%确定我有正确的分区值子句):
CREATE PARTITION FUNCTION [PF_PartitonFour] (int)
AS RANGE RIGHT
FOR VALUES (
0,
1,
2)
GO
CREATE PARTITION SCHEME [PS_PartitionFourScheme]
AS PARTITION [PF_PartitonFour]
TO ([TestPartitionGroup1],
[TestPartitionGroup2],
[TestPartitionGroup3],
[TestPartitionGroup4])
GO
CREATE TABLE [my_data] (
[id] [int] IDENTITY(1,1) NOT NULL,
[topic_id] [int] NULL,
[data_value] [decimal](19, 5) NULL
[PartitionElement] AS [topic_id] % 4 PERSISTED,
) ON [PS_PartitionFourScheme] (PartitionElement);
GO
#2
3
Hash partitioning is not available in SQL Server 2005/2008. You must use range partitioning.
SQL Server 2005/2008中不提供散列分区。您必须使用范围分区。
That being said, you should be aware that partitioning is primarily a storage option, see Partitioned Table and Index Concepts:
话虽这么说,您应该知道分区主要是一个存储选项,请参阅分区表和索引概念:
Partitioning makes large tables or indexes more manageable, because partitioning enables you to manage and access subsets of data quickly and efficiently, while maintaining the integrity of a data collection. By using partitioning, an operation such as loading data from an OLTP to an OLAP system takes only seconds, instead of the minutes and hours the operation takes in earlier versions of SQL Server. Maintenance operations that are performed on subsets of data are also performed more efficiently because these operations target only the data that is required, instead of the whole table.
分区使大型表或索引更易于管理,因为分区使您能够快速有效地管理和访问数据子集,同时保持数据集合的完整性。通过使用分区,诸如将数据从OLTP加载到OLAP系统的操作只需几秒钟,而不是在早期版本的SQL Server中操作所需的分钟和小时。对数据子集执行的维护操作也可以更有效地执行,因为这些操作仅针对所需的数据,而不是整个表。
As you can see, the introduction of partitioning in MSDN focuses on maintenance, manageability and data load. In my experience partitioning gives, at best, 0 performance gain. Specially in SQL 2005. Usualy it gives performance degradation. To improve performance you should use a correct clustered index and properly designed non-clustered indexes.
如您所见,MSDN中的分区介绍侧重于维护,可管理性和数据加载。根据我的经验,分区最多只能带来0性能提升。特别是在SQL 2005中。通常它会降低性能。要提高性能,您应该使用正确的聚簇索引和正确设计的非聚簇索引。
In SQL 2008 there are improvements in the parallel operators in regard to partitions if they are properly distributed from an IO point of view, see Designing Partitions to Improve Query Performance. Their benefit are marginal though and overshadowed by the benefits of a properly designed set of clustered and non-clustered indexes. Case in point a clustered index in (id, topic_id) where id is an identity is usefull solely for single item lookup by id. On the other hand a clustered index by (topic_id, id) would benefit any queries that look for specific topic(s). I don't know your system requirements and the queries you run, but 10M rows performance problems on such a narrow table smell like indexing ands querying issue, no partitioning issue.
在SQL 2008中,如果从IO的角度对分区进行了适当的分布,则并行运算符在分区方面有所改进,请参阅设计分区以提高查询性能。它们的好处虽然微不足道,但却被适当设计的集群和非集群索引集的好处所掩盖。例如,(id,topic_id)中的聚簇索引,其中id是一个标识,仅用于通过id查找单个项目。另一方面,聚集索引(topic_id,id)将有益于查找特定主题的任何查询。我不知道你的系统要求和你运行的查询,但在这样一个狭窄的表上的10M行性能问题闻起来像索引和查询问题,没有分区问题。
#3
0
From the documentation, it seems like you have to give values to the function:
从文档中,您似乎必须为函数赋值:
To create 4 partitions...
要创建4个分区......
CREATE PARTITION FUNCTION myRangePF1 (int)
AS RANGE LEFT FOR VALUES (1, 100, 1000);
Couldn't you just do your computations above this call and find the proper values to split on? Substitute the values into the call? Or am I missing why you want to use the modulus? Based on the possibility of your ID's having gaps, you may need to use some statistics math to find out where to partition.
难道你不能只是在这个调用之上进行计算并找到要拆分的正确值吗?将值替换为呼叫?或者我错过了你为什么要使用模数?根据您的ID存在差距的可能性,您可能需要使用一些统计数学来找出分区的位置。
CREATE PARTITION FUNCTION myRangePF1 (int)
AS RANGE LEFT FOR VALUES (@low, @Med, @High);
#4
0
10 million rows isn't that many for SQL server to handle; regular index design would probably solve this without the need for partitioning. As has been noted, try clustering on different sets of columns; clustering on topicid, id seems like something to test out, especially if most queries have topicid as a criterion. A clustered index like that has approximately the same effect as paritioning, at least in that it groups the related rows of data together on disk and allows a range scan to fetch them quickly.
SQL Server要处理的1000万行并不多;常规索引设计可能无需分区即可解决此问题。如前所述,尝试在不同的列集上进行聚类;在topicid上聚类,id似乎是要测试的东西,特别是如果大多数查询都将topicid作为标准。像这样的聚簇索引具有与分区大致相同的效果,至少在于它将相关的数据行组合在磁盘上并允许范围扫描快速获取它们。
If that design works, all you have to worry about is fragmentation from inserts, but that's manageable. After getting the indexing right, make sure you have enough RAM, and that you don't have a disk bottleneck.
如果该设计有效,那么您只需要担心插入碎片,但这是可管理的。获得正确的索引后,请确保您有足够的RAM,并且没有磁盘瓶颈。