The database I'm working with is currently over 100 GiB and promises to grow much larger over the next year or so. I'm trying to design a partitioning scheme that will work with my dataset but thus far have failed miserably. My problem is that queries against this database will typically test the values of multiple columns in this one large table, ending up in result sets that overlap in an unpredictable fashion.
我正在使用的数据库目前超过100 GiB,并承诺在未来一年左右增长更多。我正在尝试设计一个可以与我的数据集一起使用的分区方案,但到目前为止已经失败了。我的问题是针对此数据库的查询通常会测试这一个大表中多列的值,最终会以不可预测的方式重叠结果集。
Everyone (the DBAs I'm working with) warns against having tables over a certain size and I've researched and evaluated the solutions I've come across but they all seem to rely on a data characteristic that allows for logical table partitioning. Unfortunately, I do not see a way to achieve that given the structure of my tables.
每个人(我正在与之合作的DBA)都警告不要有超过一定大小的表格,我已经研究和评估了我遇到的解决方案,但他们似乎都依赖于允许逻辑表分区的数据特性。不幸的是,鉴于我的表结构,我没有看到实现这一点的方法。
Here's the structure of our two main tables to put this into perspective.
这是我们两个主要表格的结构,以便对此进行透视。
Table: Case
Columns:
Year
Type
Status
UniqueIdentifier
PrimaryKey
etc.
Table: Case_Participant
Columns:
Case.PrimaryKey
LastName
FirstName
SSN
DLN
OtherUniqueIdentifiers
Note that any of the columns above can be used as query parameters.
请注意,上面的任何列都可以用作查询参数。
3 个解决方案
#1
Rather than guess, measure. Collect statistics of usage (queries run), look at the engine own statistics like sys.dm_db_index_usage_stats
and then you make an informed decision: the partition that bests balances data size and gives best affinity for the most often run queries will be a good candidate. Of course you'll have to compromise.
而不是猜测,衡量。收集使用情况统计信息(查询运行),查看引擎自己的统计信息,如sys.dm_db_index_usage_stats,然后做出明智的决定:最佳平衡数据大小的分区,并为最常运行的查询提供最佳关联,这将是一个很好的选择。当然,你必须妥协。
Also don't forget that partitioning is per index (where 'table' = one of the indexes), not per table, so the question is not what to partition on, but which indexes to partition or not and what partitioning function to use. Your clustered indexes on the two tables are going to be the most likely candidates obviously (not much sense to partition just a non-clustered index and not partition the clustered one) so, unless you're considering redesign of your clustered keys, the question is really what partitioning function to choose for your clustered indexes.
另外不要忘记分区是每个索引(其中'table'=其中一个索引),而不是每个表,所以问题不是分区的内容,而是分区与否的索引以及要使用的分区函数。这两个表上的聚簇索引显然是最可能的候选者(仅对非聚集索引进行分区并且不对聚簇索引进行分区没有多大意义)因此,除非您正在考虑重新设计群集密钥,否则问题实际上是为您的聚簇索引选择的分区函数。
If I'd venture a guess I'd say that for any data that accumulates over time (like 'cases' with a 'year') the most natural partition is the sliding window.
如果我冒昧地猜测我会说任何随时间累积的数据(比如'带有'年'的'案例'),最自然的分区是滑动窗口。
#2
If you have no other choice you can partition by key module the number of partition tables. Lets say that you want to partition to 10 tables. You will define tables:
Case00
Case01
...
Case09
如果没有其他选择,可以按密钥模块分区分区表的数量。可以说你要分区到10个表。您将定义表:Case00 Case01 ... Case09
And partition you data by UniqueIdentifier or PrimaryKey module 10 and place each record in the corresponding table (Depending on your unique UniqueIdentifier you might need to start manual allocation of ids).
并通过UniqueIdentifier或PrimaryKey模块10对数据进行分区,并将每条记录放在相应的表中(根据您可能需要开始手动分配ID的唯一UniqueIdentifier)。
When performing a query, you will need to run same query on all tables, and use UNION to merge the result set into a single query result.
执行查询时,您需要对所有表运行相同的查询,并使用UNION将结果集合并为单个查询结果。
It's not as good as partitioning the tables based on some logical separation which corresponds to the expected query, but it's better then hitting the size limit of a table.
它不如基于与预期查询对应的某些逻辑分隔来对表进行分区,但它最好能达到表的大小限制。
#3
Another possible thing to look at (before partitioning) is your model.
另一个可能的问题(分区前)是你的模型。
Are you in a normalized database? Are there further steps which could improve performance by different choices in the normalization/de-/partial-normalization? Are there options to transform the data into a Kimball-style dimensional star model which is optimal for reporting/querying?
你在一个标准化的数据库吗?是否有进一步的步骤可以通过标准化/去标准化/部分标准化中的不同选择来提高性能?是否有选项可以将数据转换为Kimball风格的维度星型模型,该模型最适合报告/查询?
If you aren't going to drop partitions of the table (sliding window, as mentioned) or treat different partitions differently (you say any columns can be used in the query), I'm not sure what you are trying to get out of the partitioning that you won't already get out of your indexing strategy.
如果你不打算删除表的分区(滑动窗口,如上所述)或不同地处理不同的分区(你说在查询中可以使用任何列),我不确定你想要的是什么您尚未从索引策略中获得的分区。
I'm not aware of any table limits on rows. AFAIK, the number of rows is limited only by available storage.
我不知道行的任何表限制。 AFAIK,行数仅受可用存储空间的限制。
#1
Rather than guess, measure. Collect statistics of usage (queries run), look at the engine own statistics like sys.dm_db_index_usage_stats
and then you make an informed decision: the partition that bests balances data size and gives best affinity for the most often run queries will be a good candidate. Of course you'll have to compromise.
而不是猜测,衡量。收集使用情况统计信息(查询运行),查看引擎自己的统计信息,如sys.dm_db_index_usage_stats,然后做出明智的决定:最佳平衡数据大小的分区,并为最常运行的查询提供最佳关联,这将是一个很好的选择。当然,你必须妥协。
Also don't forget that partitioning is per index (where 'table' = one of the indexes), not per table, so the question is not what to partition on, but which indexes to partition or not and what partitioning function to use. Your clustered indexes on the two tables are going to be the most likely candidates obviously (not much sense to partition just a non-clustered index and not partition the clustered one) so, unless you're considering redesign of your clustered keys, the question is really what partitioning function to choose for your clustered indexes.
另外不要忘记分区是每个索引(其中'table'=其中一个索引),而不是每个表,所以问题不是分区的内容,而是分区与否的索引以及要使用的分区函数。这两个表上的聚簇索引显然是最可能的候选者(仅对非聚集索引进行分区并且不对聚簇索引进行分区没有多大意义)因此,除非您正在考虑重新设计群集密钥,否则问题实际上是为您的聚簇索引选择的分区函数。
If I'd venture a guess I'd say that for any data that accumulates over time (like 'cases' with a 'year') the most natural partition is the sliding window.
如果我冒昧地猜测我会说任何随时间累积的数据(比如'带有'年'的'案例'),最自然的分区是滑动窗口。
#2
If you have no other choice you can partition by key module the number of partition tables. Lets say that you want to partition to 10 tables. You will define tables:
Case00
Case01
...
Case09
如果没有其他选择,可以按密钥模块分区分区表的数量。可以说你要分区到10个表。您将定义表:Case00 Case01 ... Case09
And partition you data by UniqueIdentifier or PrimaryKey module 10 and place each record in the corresponding table (Depending on your unique UniqueIdentifier you might need to start manual allocation of ids).
并通过UniqueIdentifier或PrimaryKey模块10对数据进行分区,并将每条记录放在相应的表中(根据您可能需要开始手动分配ID的唯一UniqueIdentifier)。
When performing a query, you will need to run same query on all tables, and use UNION to merge the result set into a single query result.
执行查询时,您需要对所有表运行相同的查询,并使用UNION将结果集合并为单个查询结果。
It's not as good as partitioning the tables based on some logical separation which corresponds to the expected query, but it's better then hitting the size limit of a table.
它不如基于与预期查询对应的某些逻辑分隔来对表进行分区,但它最好能达到表的大小限制。
#3
Another possible thing to look at (before partitioning) is your model.
另一个可能的问题(分区前)是你的模型。
Are you in a normalized database? Are there further steps which could improve performance by different choices in the normalization/de-/partial-normalization? Are there options to transform the data into a Kimball-style dimensional star model which is optimal for reporting/querying?
你在一个标准化的数据库吗?是否有进一步的步骤可以通过标准化/去标准化/部分标准化中的不同选择来提高性能?是否有选项可以将数据转换为Kimball风格的维度星型模型,该模型最适合报告/查询?
If you aren't going to drop partitions of the table (sliding window, as mentioned) or treat different partitions differently (you say any columns can be used in the query), I'm not sure what you are trying to get out of the partitioning that you won't already get out of your indexing strategy.
如果你不打算删除表的分区(滑动窗口,如上所述)或不同地处理不同的分区(你说在查询中可以使用任何列),我不确定你想要的是什么您尚未从索引策略中获得的分区。
I'm not aware of any table limits on rows. AFAIK, the number of rows is limited only by available storage.
我不知道行的任何表限制。 AFAIK,行数仅受可用存储空间的限制。