I'm designing a database that will need to be optimized for maximum speed.
我正在设计一个需要针对最大速度进行优化的数据库。
All the database data is generated once from something I call an input database (which holds the data I'm editing, mainly some polylines, markers, etc for google maps).
所有的数据库数据都是从我称之为输入数据库的东西中生成一次的(输入数据库包含我正在编辑的数据,主要是谷歌地图的一些折线,标记等)。
So the database is not subject to editing, but it needs to hold as many data as it can for quickly displaying results to the user (routes across town, custom polylines, etc).
因此数据库不受编辑,但它需要保存尽可能多的数据,以便快速向用户显示结果(遍布城镇的路线,自定义折线等)。
The question is: choosing smaller data types for example like smallint over int will improve performance or it will affect it? Space is not quite a problem, after some quick calculations, the database will not exceed 200mb, and there will not be tables with more than 100.000 rows (average will be around 5.000).
问题是:选择较小的数据类型,例如像smallint over int会提高性能还是会影响它?空间不是一个问题,经过一些快速计算,数据库不会超过200mb,并且不会有超过100.000行的表(平均值约为5.000)。
I'm asking this because I read some articles around the internet and some say that smaller data types improve performance others say that it affects it because additional processing must be done. I'm aware that for smaller databases probably results are not noticeable, but I'm interested in every bit because I'm expecting many requests which will trigger a lot more queries.
我之所以这样问,是因为我在互联网上阅读了一些文章,有些人说较小的数据类型可以提高性能,有人说它会影响它,因为必须进行额外的处理。我知道对于较小的数据库,可能结果并不明显,但我对每一点都感兴趣,因为我期待许多请求会触发更多查询。
The hosting environment is gonna be Windows Server 2008 R2 with SQL Server 2008 R2.
托管环境将是带有SQL Server 2008 R2的Windows Server 2008 R2。
EDIT 1: Just to give you an example because I don't have a proper table structure yet: I'm going to have a table which will hold public transportation lines (somewhere around 200), identified by a unique number in real life, and which is going to be referenced in all sorts of tables and on which all sorts of operations are going to be made. These referencing tables will hold the largest amount of data.
编辑1:只是举个例子,因为我还没有合适的桌面结构:我将有一张桌子可以容纳公共交通线路(大约200个左右),由现实生活中的唯一编号识别,并且将在各种表中引用,并且将在其上进行各种操作。这些引用表将保存最大量的数据。
Because lines have unique numbers, I have thought of 3 examples of designs:
因为线条有唯一的数字,我想到了3个设计的例子:
-
The PK is the line number of datatype: smallint
PK是数据类型的行号:smallint
-
The PK is the line number of datatype: int
PK是数据类型的行号:int
-
The PK is something different (identity for example) and the line number is stored in a different field.
PK是不同的(例如身份),行号存储在不同的字段中。
-
Just for the sake of argument, because I used this on the 'input database' which is not subject to optimization, the PK is a GUID (16 bytes); if you like, you can make a comparison of how bad is this compared to others, if it really is
仅仅是为了参数,因为我在'输入数据库'上使用了这个,它不受优化,PK是一个GUID(16字节);如果你愿意,你可以比较一下与其他人相比有多糟糕,如果真的如此
So keep in mind that the PK is going to be referenced in at least 15 tables, some of which will have over 50.000 rows (the rest averaging 5.000 as I said above) which are going to be subject to constant querying and manipulation, and I'm interested in every bit of speed that I can get.
所以请记住,PK将在至少15个表中被引用,其中一些表将有超过50,000行(其余的平均值为5.000,如上所述),这些表将受到不断的查询和操作,而我我对能得到的每一点速度感兴趣。
I can detail this even more if you need. Thanks
如果需要,我可以更详细地说明这一点。谢谢
EDIT 2: And another question related to this came to my mind, think it fits into this discussion:
编辑2:另一个与此相关的问题出现在我的脑海中,认为它符合这个讨论:
Will I see any performance improvements in this specific scenario if I use native SQL queries from inside my .NET application rather than using LINQ to SQL? I know LINQ is strongly optimized and generates very good queries performance-wise, but still, sure worth asking. Thanks again.
如果我在.NET应用程序中使用本机SQL查询而不是使用LINQ to SQL,我是否会在此特定方案中看到任何性能改进?我知道LINQ经过了强有力的优化,可以在性能方面产生非常好的查询,但仍然值得一提。再次感谢。
3 个解决方案
#1
4
Can you point to some articles that say that smaller data types = more processing? Keeping in mind that even with SSDs most workloads today are I/O-bound (or memory-bound) and not CPU-bound.
你能指出一些文章说较小的数据类型=更多的处理吗?请记住,即使使用SSD,今天的大多数工作负载都受I / O限制(或受内存限制)而不受CPU限制。
Particularly in cases where the PK is going to be referenced in many tables, it will be beneficial to use the smallest data type possible. In this case if that's a SMALLINT
then that's what I would use (though you say there are about 200 values, so theoretically you could use TINYINT
which is half the size and supports 0-255). Where you need to exercise caution is if you aren't 100% sure that there will always be ~200 values. Once you need 256 you're going to have to change the data type in all of the affected tables, and this is going to be a pain. So sometimes a trade-off is made between accommodating future growth and squeezing the absolute most performance today. If you don't know for certain that you will never exceed 255 or 32,000 values then I would probably just an INT
. Unless you also don't know that you won't ever exceed 2 billion values, in which case you would use BIGINT
.
特别是在将要在许多表中引用PK的情况下,使用可能的最小数据类型将是有益的。在这种情况下,如果那是一个SMALLINT那么我会使用(虽然你说有大约200个值,所以理论上你可以使用TINYINT,它是一半大小,支持0-255)。如果您不能100%确定总会有~200个值,那么您需要谨慎行事。一旦你需要256,你将不得不在所有受影响的表中更改数据类型,这将是一个痛苦。因此,有时需要在适应未来增长和挤压当今最绝对的业绩之间进行权衡。如果您不确定您将永远不会超过255或32,000个值,那么我可能只是一个INT。除非你也不知道你不会超过20亿个值,在这种情况下你会使用BIGINT。
The difference between INT
/SMALLINT
/TINYINT
is going to be more noticeable in disk space than in performance. (And if you're on Enterprise, the differences in both disk space and performance can be offset quite a bit using data compression - particularly while your INT
values all fit within SMALLINT
/TINYINT
, though in the latter case it really will be negligible because the values are unique.) On the other hand, the difference between any of these and GUID
is going to be much more noticeable in both performance and disk space. Marc gave some great links from Kimberly; I wrote this article in 2003 and while it's a little dated it does contain most of the salient points that are still relevant today.
INT / SMALLINT / TINYINT之间的区别在于磁盘空间比性能更明显。 (如果你使用的是Enterprise,那么磁盘空间和性能的差异可以通过数据压缩来抵消 - 特别是当你的INT值都适合SMALLINT / TINYINT时,尽管在后一种情况下它确实可以忽略不计,因为这些值是唯一的。)另一方面,任何这些和GUID之间的差异在性能和磁盘空间中都会更加明显。马克给了金伯利一些很棒的联系;我在2003年写了这篇文章,虽然它有点过时,它确实包含了今天仍然相关的大部分要点。
Another trade-off that sometimes needs to be considered (though not in your specific case, it seems) is whether values need to be unique across multiple systems. This is where you might need to sacrifice some performance in order to meet business requirements. In a lot of cases folks take the easy way and resign themselves to GUID
. But there are other solutions too, such as identity ranges, a central custom sequence generator, and the new SEQUENCE
object in SQL Server 2012. I wrote about SEQUENCE
back in 2010 when the first public beta of SQL Server 2012 was released.
有时需要考虑的另一个权衡(尽管不是在您的具体情况下,似乎)是值是否需要在多个系统中是唯一的。这是您可能需要牺牲一些性能以满足业务需求的地方。在很多情况下,人们采取简单的方式并将自己辞去GUID。但是还有其他解决方案,例如标识范围,*自定义序列生成器以及SQL Server 2012中的新SEQUENCE对象。我在2010年发布了SQL Server 2012的第一个公开测试版时写的关于SEQUENCE的文章。
#2
0
I think you will need to provide some more details about the tables structure and sample queries that will be running against them. Based on the information that you have provided I believe that impact of choosing smaller data types will be just a couple of percents and I would suggest to give higher attention to indexes that you will have. SQL Server does a good job on suggesting what indexes to create by providing you with execution plans for your queries and tuning advisor tool
我认为您需要提供有关表格结构和将针对它们运行的示例查询的更多详细信息。根据您提供的信息,我认为选择较小数据类型的影响只有几个百分点,我建议您更加关注您将拥有的索引。通过为您的查询和调优顾问工具提供执行计划,SQL Server可以很好地建议要创建的索引
#3
-2
One suggestion that I have is to incorporate a decimal datatype instead of using a combination of fields. For example, instead of having a table with Date (YYYYMMDD), Store (SSSS), and Item (IIII), I would recommend...YYYYMMDD.SSSSIIII. Especially when querying multiple tables with this same key combination, it dramatically improves processing time.
我的一个建议是合并十进制数据类型而不是使用字段组合。例如,我建议不要使用Date(YYYYMMDD),Store(SSSS)和Item(IIII)的表格,而是建议... YYYYMMDD.SSSSIIII。特别是在使用相同的组合键查询多个表时,它可以显着缩短处理时间。
#1
4
Can you point to some articles that say that smaller data types = more processing? Keeping in mind that even with SSDs most workloads today are I/O-bound (or memory-bound) and not CPU-bound.
你能指出一些文章说较小的数据类型=更多的处理吗?请记住,即使使用SSD,今天的大多数工作负载都受I / O限制(或受内存限制)而不受CPU限制。
Particularly in cases where the PK is going to be referenced in many tables, it will be beneficial to use the smallest data type possible. In this case if that's a SMALLINT
then that's what I would use (though you say there are about 200 values, so theoretically you could use TINYINT
which is half the size and supports 0-255). Where you need to exercise caution is if you aren't 100% sure that there will always be ~200 values. Once you need 256 you're going to have to change the data type in all of the affected tables, and this is going to be a pain. So sometimes a trade-off is made between accommodating future growth and squeezing the absolute most performance today. If you don't know for certain that you will never exceed 255 or 32,000 values then I would probably just an INT
. Unless you also don't know that you won't ever exceed 2 billion values, in which case you would use BIGINT
.
特别是在将要在许多表中引用PK的情况下,使用可能的最小数据类型将是有益的。在这种情况下,如果那是一个SMALLINT那么我会使用(虽然你说有大约200个值,所以理论上你可以使用TINYINT,它是一半大小,支持0-255)。如果您不能100%确定总会有~200个值,那么您需要谨慎行事。一旦你需要256,你将不得不在所有受影响的表中更改数据类型,这将是一个痛苦。因此,有时需要在适应未来增长和挤压当今最绝对的业绩之间进行权衡。如果您不确定您将永远不会超过255或32,000个值,那么我可能只是一个INT。除非你也不知道你不会超过20亿个值,在这种情况下你会使用BIGINT。
The difference between INT
/SMALLINT
/TINYINT
is going to be more noticeable in disk space than in performance. (And if you're on Enterprise, the differences in both disk space and performance can be offset quite a bit using data compression - particularly while your INT
values all fit within SMALLINT
/TINYINT
, though in the latter case it really will be negligible because the values are unique.) On the other hand, the difference between any of these and GUID
is going to be much more noticeable in both performance and disk space. Marc gave some great links from Kimberly; I wrote this article in 2003 and while it's a little dated it does contain most of the salient points that are still relevant today.
INT / SMALLINT / TINYINT之间的区别在于磁盘空间比性能更明显。 (如果你使用的是Enterprise,那么磁盘空间和性能的差异可以通过数据压缩来抵消 - 特别是当你的INT值都适合SMALLINT / TINYINT时,尽管在后一种情况下它确实可以忽略不计,因为这些值是唯一的。)另一方面,任何这些和GUID之间的差异在性能和磁盘空间中都会更加明显。马克给了金伯利一些很棒的联系;我在2003年写了这篇文章,虽然它有点过时,它确实包含了今天仍然相关的大部分要点。
Another trade-off that sometimes needs to be considered (though not in your specific case, it seems) is whether values need to be unique across multiple systems. This is where you might need to sacrifice some performance in order to meet business requirements. In a lot of cases folks take the easy way and resign themselves to GUID
. But there are other solutions too, such as identity ranges, a central custom sequence generator, and the new SEQUENCE
object in SQL Server 2012. I wrote about SEQUENCE
back in 2010 when the first public beta of SQL Server 2012 was released.
有时需要考虑的另一个权衡(尽管不是在您的具体情况下,似乎)是值是否需要在多个系统中是唯一的。这是您可能需要牺牲一些性能以满足业务需求的地方。在很多情况下,人们采取简单的方式并将自己辞去GUID。但是还有其他解决方案,例如标识范围,*自定义序列生成器以及SQL Server 2012中的新SEQUENCE对象。我在2010年发布了SQL Server 2012的第一个公开测试版时写的关于SEQUENCE的文章。
#2
0
I think you will need to provide some more details about the tables structure and sample queries that will be running against them. Based on the information that you have provided I believe that impact of choosing smaller data types will be just a couple of percents and I would suggest to give higher attention to indexes that you will have. SQL Server does a good job on suggesting what indexes to create by providing you with execution plans for your queries and tuning advisor tool
我认为您需要提供有关表格结构和将针对它们运行的示例查询的更多详细信息。根据您提供的信息,我认为选择较小数据类型的影响只有几个百分点,我建议您更加关注您将拥有的索引。通过为您的查询和调优顾问工具提供执行计划,SQL Server可以很好地建议要创建的索引
#3
-2
One suggestion that I have is to incorporate a decimal datatype instead of using a combination of fields. For example, instead of having a table with Date (YYYYMMDD), Store (SSSS), and Item (IIII), I would recommend...YYYYMMDD.SSSSIIII. Especially when querying multiple tables with this same key combination, it dramatically improves processing time.
我的一个建议是合并十进制数据类型而不是使用字段组合。例如,我建议不要使用Date(YYYYMMDD),Store(SSSS)和Item(IIII)的表格,而是建议... YYYYMMDD.SSSSIIII。特别是在使用相同的组合键查询多个表时,它可以显着缩短处理时间。