When designing a schema for a DB (e.g. MySQL) the question arises whether or not to completely normalize the tables.
在为DB(例如MySQL)设计模式时,问题是是否要完全规范化表。
On one hand joins (and foreign key constraints, etc.) are very slow, and on the other hand you get redundant data and the potential for inconsistency.
一方面,连接(以及外键约束等)非常缓慢,另一方面,您会得到冗余数据和不一致性的可能性。
Is "optimize last" the correct approach here? i.e. create a by-the-book normalized DB and then see what can be denormalized to achieve the optimal speed gain.
“最后优化”是正确的方法吗?例如,创建一个逐本归一化的DB,然后查看什么可以反规范化以获得最佳的速度增益。
My fear, regarding this approach, is that I will settle on a DB design that might not be fast enough - but at that stage refactoring the schema (while supporting existing data) would be very painful. This is why I'm tempted to just temporarily forget everything I learned about "proper" RDBMS practices, and try the "flat table" approach for once.
对于这种方法,我的担心是,我将解决一个可能不够快的DB设计,但是在这个阶段重构模式(同时支持现有数据)将是非常痛苦的。这就是为什么我想暂时忘记关于“合适的”RDBMS实践的所有知识,并尝试一下“flat table”方法。
Should the fact that this DB is going to be insert-heavy effect the decision?
这个DB会对决策产生大量插入的影响吗?
9 个解决方案
#1
29
A philosophical answer: Sub-optimal (relational) databases are rife with insert, update, and delete anomalies. These all lead to inconsistent data, resulting in poor data quality. If you can't trust the accuracy of your data, what good is it? Ask yourself this: Do you want the right answers slower or do you want the wrong answers faster?
一个哲学答案:次优(关系)数据库中充斥着插入、更新和删除异常。这些都导致数据不一致,导致数据质量很差。如果你不能相信你的数据的准确性,那又有什么用呢?问问你自己:你是希望正确的答案慢一点,还是希望错误的答案快一点?
As a practical matter: get it right before you get it fast. We humans are very bad at predicting where bottlenecks will occur. Make the database great, measure the performance over a decent period of time, then decide if you need to make it faster. Before you denormalize and sacrifice accuracy try other techniques: can you get a faster server, connection, db driver, etc? Might stored procedures speed things up? How are the indexes and their fill factors? If those and other performance and tuning techniques do not do the trick, only then consider denormalization. Then measure the performance to verify that you got the increase in speed that you "paid for". Make sure that you are performing optimization, not pessimization.
作为一件实际的事情:在你快速得到它之前先把它做好。我们人类非常不善于预测瓶颈将在哪里发生。让数据库非常好,在相当长的时间内测量性能,然后决定是否需要更快地实现它。在您进行规格化和牺牲准确性之前,请尝试其他技术:您能获得更快的服务器、连接、db驱动程序等等吗?存储过程会加快速度吗?指标及其填充因子如何?如果这些和其他性能和调优技术都不能达到目的,那么就考虑去规范化。然后测量性能,以验证您获得了“付出”的速度增长。确保你在执行优化,而不是悲观。
[edit]
(编辑)
Q: So if I optimize last, can you recommend a reasonable way to migrate data after the schema is changed? If, for example, I decide to get rid of a lookup table - how can I migrate existing databased to this new design?
问:如果我最后一次优化,你能建议一个合理的方法在模式改变后迁移数据吗?例如,如果我决定删除查找表——如何将现有的数据库迁移到新的设计中?
A: Sure.
答:当然。
- Make a backup.
- 做一个备份。
- Make another backup to a different device.
- 为另一个设备做另一个备份。
- Create new tables with "select into newtable from oldtable..." type commands. You'll need to do some joins to combine previously distinct tables.
- 使用“从oldtable中选择到newtable…”类型命令创建新表。您需要做一些连接来组合以前不同的表。
- Drop the old tables.
- 删除旧表。
- Rename the new tables.
- 重命名的新表。
BUT... consider a more robust approach:
但是…考虑一种更稳健的方法:
Create some views on your fully normalized tables right now. Those views (virtual tables, "windows" on the data... ask me if you want to know more about this topic) would have the same defining query as step three above. When you write your application or DB-layer logic, use the views (at least for read access; updatable views are... well, interestsing). Then if you denormalize later, create a new table as above, drop the view, rename the new base table whatever the view was. Your application/DB-layer won't know the difference.
现在就在完全规范化的表上创建一些视图。那些视图(虚拟表,数据上的“窗口”……)如果您想了解更多关于这个主题的信息,请咨询我)将具有与上面第3步相同的定义查询。当您编写应用程序或DB-layer逻辑时,请使用视图(至少用于读取访问;可更新视图…嗯,利益)。然后,如果您在稍后进行反规范化,创建一个新的表,如上面所示,删除视图,不管视图是什么,重命名新的基表。您的应用程序/DB-layer不知道它们的区别。
There's actually more to this in practice, but this should get you started.
实际上,在实践中还有更多,但这应该能让你开始。
#2
13
The usage pattern of your database (insert-heavy vs. reporting-heavy) will definitely affect your normalization. Furthermore, you may want to look at your indexing, etc. if you are seeing a significant slowdown with normalized tables. Which version of MySQL are you using?
数据库的使用模式(大量插入和大量报告)肯定会影响规范化。此外,如果您看到规范化表的增长明显放缓,您可能需要查看索引等等。你用的是什么版本的MySQL ?
In general, an insert-heavy database should be more normalized than a reporting-heavy database. However, YMMV of course...
一般来说,大量插入的数据库应该比大量报告的数据库更加规范化。然而,YMMV当然……
#3
7
A normal design is the place to start; get it right, first, because you may not need to make it fast.
正常的设计是开始的地方;首先,要把它做好,因为你可能不需要把它做得很快。
The concern about time-costly joins are often based on experience with poor designs. As the design becomes more normal, the number of tables in the design usually increases while the number of columns and rows in each table decreases, the number of unions in the design increase as the number of joins decreases, indicies become more useful, &c. In other words: good things happen.
对时间开销很大的连接的关注通常基于设计差的经验。当设计变得更加正常时,设计中的表的数量通常会增加,而每个表中的列和行数则会减少,设计中的联合数会随着连接数的减少而增加,独立值变得更有用,等等。换句话说,好事总会发生。
And normalization is only one way to end up with a normal design...
正常化只是一种以正常设计结束的方法……
#4
4
Is "optimize last" the correct approach here? i.e. create a by-the-book normalized DB and then see what can be denormalized to achieve the optimal speed gain.
“最后优化”是正确的方法吗?例如,创建一个逐本归一化的DB,然后查看什么可以反规范化以获得最佳的速度增益。
I'd say, yes. I've had to deal with badly structured DBs too many times to condone 'flat table' ones without a good deal of thought.
我想说,是的。我不得不处理结构糟糕的DBs太多次,以至于不能不加思索地容忍“扁平的桌子”。
Actually, inserts usually behave well on fully normalized DBs so if it is insert heavy this shouldn't be a factor.
实际上,插入通常在完全规范化的DBs上表现得很好,所以如果插入量很大,这就不是一个因素。
#5
4
On an insert-heavy database, I'd definitely start with normalized tables. If you have performance problems with queries, I'd first try to optimize the query and add useful indexes.
在一个插入重的数据库中,我肯定会从规范化的表开始。如果您对查询有性能问题,我将首先尝试优化查询并添加有用的索引。
Only if this does not help, you should try denormalized tables. Be sure to benchmark both inserts and queries before and after denormalization, since it's likely that you are slowing down your inserts.
只有当这不起作用时,您才应该尝试去规范化的表。一定要在非规范化前后对插入和查询进行基准测试,因为很可能会减慢插入的速度。
#6
4
The general design approach for this issue is to first completely normalise your database to 3rd normal form, then denormalise as appropriate for performance and ease of access. This approach tends to be the safest as you are making specific decision by design rather than not normalising by default.
针对这个问题的一般设计方法是,首先将数据库完全规范化为第3个正常形式,然后根据性能和访问的易用性对数据库进行非规范化。这种方法往往是最安全的,因为您正在通过设计做出特定的决策,而不是默认地进行规范化。
The 'as appropriate' is the tricky bit that takes experience. Normalising is a fairly 'by-rote' procedure that can be taught, knowing where to denormalise is less precise and will depend upon the application usage and business rules and will consequently differ from application to application. All your denormalisation decisions should be defensible to a fellow professional.
“适当”是需要经验的棘手部分。规范化是一种相当“死记硬背”的过程,可以教授,但知道去哪里去去规范化就不那么精确了,它将取决于应用程序的使用和业务规则,因此会因应用程序的不同而有所不同。你所有的去核化决定都应该对你的同道中人有充分的理由。
For example if I have a one to many relations ship A to B I would in most circumstances leave this normalised, but if I know that the business only ever has, say, two occurrences of B for each A, this is highly unlikely to change, there is limited data in the B record. and they will be usually pulling back the B data with the A record I would most likely extend the A record with two occurrences of the B fields. Of course most passing DBA's will then immediately flag this up as a possible design issue, so you must be able to convincingly argue your justification for denormalisation.
例如如果我有一个一对多关系船在大多数情况下我将离开这个正常化,但是如果我知道业务,只能说,两种B为每个事件,这是极不可能的改变,在B有有限的数据记录。他们通常会用A记录拉回B数据,我很可能会用两个B字段扩展A记录。当然,大多数通过的DBA会立即将其标记为可能的设计问题,因此您必须能够令人信服地论证您的非规范化的理由。
It should be apparent from this that denormalisation should be the exception. In any production database I would expect the vast majority of it - 95% plus - to be in 3rd normal form, with just a handful of denormalised structures.
从这一点可以明显看出,非苹果化应该是一个例外。在任何一个生产数据库中,我都认为其中的绝大部分(95%以上)都是第三种正态分布,只有少量的去核结构。
#7
4
Where did you get the idea that "joins (and foreign key constraints, etc.) are very slow"? It's a very vague statement, and usually IMO there is no performance problems.
您从哪里得到“连接(和外键约束,等等)非常慢”的想法?这是一个非常模糊的表述,在我看来,通常没有性能问题。
#8
4
Denormalisation is only rarely needed on an operational system. One system I did the data model for had 560 tables or thereabouts (at the time it was the largest J2EE system built in Australasia) and had just 4 pieces of denormalised data. Two of the items were denormalised search tables designed to facilitiate complex search screens (one was a materialised view) and the other two were added in response to specific performance requirements.
在操作系统中,很少需要去核化。我做过数据模型的一个系统有560个表(当时它是澳大利亚最大的J2EE系统),只有4个非数字数据。其中两项是为了方便复杂的搜索屏幕而设计的去核化搜索表(其中一项是实体化的视图),另外两项是为了响应特定的性能需求而添加的。
Don't prematurely optimise a database with denormalised data. That's a recipe for ongoing data integrity problems. Also, always use database triggers to manage the denormalised data - don't rely on the application do do it.
不要过早地使用非数字数据优化数据库。这是一个持续的数据完整性问题的配方。另外,总是使用数据库触发器来管理数据的非规范化数据——不要依赖应用程序来完成它。
Finally, if you need to improve reporting performance, consider building a data mart or other separate denormalised structure for reporting. Reports that combine requirements of a real-time view of aggregates calculated over large volumes of data are rare and tend to only occur in a handful of lines of business. Systems that can do this tend to be quite fiddly to build and therefore expensive.
最后,如果您需要改进报告性能,请考虑构建一个数据集市或其他独立的非数字化报告结构。将基于大量数据计算的聚合的实时视图的需求组合在一起的报告很少出现,而且往往只出现在少数业务线中。能够做到这一点的系统往往是非常复杂的,因此成本很高。
You will almost certainly only have a small number of reports that genuinely need up-to-the minute data and they will almost always be operational reports like to-do-lists or exception reports that work on small amounts of data. Anything else can be pushed to the data mart, for which a nightly refresh is probably sufficient.
几乎可以肯定,只有少数报告真正需要最新的数据,而且它们几乎都是操作报告,比如对少量数据进行工作的“to-do-list”或“异常报告”。任何其他东西都可以推到数据集市上,对于这个集市来说,每晚的刷新可能就足够了。
#9
2
I don't know what you mean about creating a database by-the-book because most books I've read about databases include a topic about optimization which is the same thing as denormalizing the database design.
我不知道你说的“按本创建数据库”是什么意思,因为我读过的大多数关于数据库的书籍都有一个关于优化的主题,这和数据库设计的非规范化是一样的。
It's a balance act so don't optimize prematurely. The reason is that denormalized database design tend to be become difficult to work with. You'll need some metrics so do some stress-testing on the database in order to decide wether or not you wan't to denormalize.
这是一种平衡行为,所以不要过早地优化。原因是,非规范化数据库设计往往变得难以使用。您将需要一些指标,所以在数据库上进行一些压力测试,以确定是否需要反规范化。
So normalize for maintainability but denormalize for optimization.
因此,对可维护性进行规范化,而对优化进行非规范化。
#1
29
A philosophical answer: Sub-optimal (relational) databases are rife with insert, update, and delete anomalies. These all lead to inconsistent data, resulting in poor data quality. If you can't trust the accuracy of your data, what good is it? Ask yourself this: Do you want the right answers slower or do you want the wrong answers faster?
一个哲学答案:次优(关系)数据库中充斥着插入、更新和删除异常。这些都导致数据不一致,导致数据质量很差。如果你不能相信你的数据的准确性,那又有什么用呢?问问你自己:你是希望正确的答案慢一点,还是希望错误的答案快一点?
As a practical matter: get it right before you get it fast. We humans are very bad at predicting where bottlenecks will occur. Make the database great, measure the performance over a decent period of time, then decide if you need to make it faster. Before you denormalize and sacrifice accuracy try other techniques: can you get a faster server, connection, db driver, etc? Might stored procedures speed things up? How are the indexes and their fill factors? If those and other performance and tuning techniques do not do the trick, only then consider denormalization. Then measure the performance to verify that you got the increase in speed that you "paid for". Make sure that you are performing optimization, not pessimization.
作为一件实际的事情:在你快速得到它之前先把它做好。我们人类非常不善于预测瓶颈将在哪里发生。让数据库非常好,在相当长的时间内测量性能,然后决定是否需要更快地实现它。在您进行规格化和牺牲准确性之前,请尝试其他技术:您能获得更快的服务器、连接、db驱动程序等等吗?存储过程会加快速度吗?指标及其填充因子如何?如果这些和其他性能和调优技术都不能达到目的,那么就考虑去规范化。然后测量性能,以验证您获得了“付出”的速度增长。确保你在执行优化,而不是悲观。
[edit]
(编辑)
Q: So if I optimize last, can you recommend a reasonable way to migrate data after the schema is changed? If, for example, I decide to get rid of a lookup table - how can I migrate existing databased to this new design?
问:如果我最后一次优化,你能建议一个合理的方法在模式改变后迁移数据吗?例如,如果我决定删除查找表——如何将现有的数据库迁移到新的设计中?
A: Sure.
答:当然。
- Make a backup.
- 做一个备份。
- Make another backup to a different device.
- 为另一个设备做另一个备份。
- Create new tables with "select into newtable from oldtable..." type commands. You'll need to do some joins to combine previously distinct tables.
- 使用“从oldtable中选择到newtable…”类型命令创建新表。您需要做一些连接来组合以前不同的表。
- Drop the old tables.
- 删除旧表。
- Rename the new tables.
- 重命名的新表。
BUT... consider a more robust approach:
但是…考虑一种更稳健的方法:
Create some views on your fully normalized tables right now. Those views (virtual tables, "windows" on the data... ask me if you want to know more about this topic) would have the same defining query as step three above. When you write your application or DB-layer logic, use the views (at least for read access; updatable views are... well, interestsing). Then if you denormalize later, create a new table as above, drop the view, rename the new base table whatever the view was. Your application/DB-layer won't know the difference.
现在就在完全规范化的表上创建一些视图。那些视图(虚拟表,数据上的“窗口”……)如果您想了解更多关于这个主题的信息,请咨询我)将具有与上面第3步相同的定义查询。当您编写应用程序或DB-layer逻辑时,请使用视图(至少用于读取访问;可更新视图…嗯,利益)。然后,如果您在稍后进行反规范化,创建一个新的表,如上面所示,删除视图,不管视图是什么,重命名新的基表。您的应用程序/DB-layer不知道它们的区别。
There's actually more to this in practice, but this should get you started.
实际上,在实践中还有更多,但这应该能让你开始。
#2
13
The usage pattern of your database (insert-heavy vs. reporting-heavy) will definitely affect your normalization. Furthermore, you may want to look at your indexing, etc. if you are seeing a significant slowdown with normalized tables. Which version of MySQL are you using?
数据库的使用模式(大量插入和大量报告)肯定会影响规范化。此外,如果您看到规范化表的增长明显放缓,您可能需要查看索引等等。你用的是什么版本的MySQL ?
In general, an insert-heavy database should be more normalized than a reporting-heavy database. However, YMMV of course...
一般来说,大量插入的数据库应该比大量报告的数据库更加规范化。然而,YMMV当然……
#3
7
A normal design is the place to start; get it right, first, because you may not need to make it fast.
正常的设计是开始的地方;首先,要把它做好,因为你可能不需要把它做得很快。
The concern about time-costly joins are often based on experience with poor designs. As the design becomes more normal, the number of tables in the design usually increases while the number of columns and rows in each table decreases, the number of unions in the design increase as the number of joins decreases, indicies become more useful, &c. In other words: good things happen.
对时间开销很大的连接的关注通常基于设计差的经验。当设计变得更加正常时,设计中的表的数量通常会增加,而每个表中的列和行数则会减少,设计中的联合数会随着连接数的减少而增加,独立值变得更有用,等等。换句话说,好事总会发生。
And normalization is only one way to end up with a normal design...
正常化只是一种以正常设计结束的方法……
#4
4
Is "optimize last" the correct approach here? i.e. create a by-the-book normalized DB and then see what can be denormalized to achieve the optimal speed gain.
“最后优化”是正确的方法吗?例如,创建一个逐本归一化的DB,然后查看什么可以反规范化以获得最佳的速度增益。
I'd say, yes. I've had to deal with badly structured DBs too many times to condone 'flat table' ones without a good deal of thought.
我想说,是的。我不得不处理结构糟糕的DBs太多次,以至于不能不加思索地容忍“扁平的桌子”。
Actually, inserts usually behave well on fully normalized DBs so if it is insert heavy this shouldn't be a factor.
实际上,插入通常在完全规范化的DBs上表现得很好,所以如果插入量很大,这就不是一个因素。
#5
4
On an insert-heavy database, I'd definitely start with normalized tables. If you have performance problems with queries, I'd first try to optimize the query and add useful indexes.
在一个插入重的数据库中,我肯定会从规范化的表开始。如果您对查询有性能问题,我将首先尝试优化查询并添加有用的索引。
Only if this does not help, you should try denormalized tables. Be sure to benchmark both inserts and queries before and after denormalization, since it's likely that you are slowing down your inserts.
只有当这不起作用时,您才应该尝试去规范化的表。一定要在非规范化前后对插入和查询进行基准测试,因为很可能会减慢插入的速度。
#6
4
The general design approach for this issue is to first completely normalise your database to 3rd normal form, then denormalise as appropriate for performance and ease of access. This approach tends to be the safest as you are making specific decision by design rather than not normalising by default.
针对这个问题的一般设计方法是,首先将数据库完全规范化为第3个正常形式,然后根据性能和访问的易用性对数据库进行非规范化。这种方法往往是最安全的,因为您正在通过设计做出特定的决策,而不是默认地进行规范化。
The 'as appropriate' is the tricky bit that takes experience. Normalising is a fairly 'by-rote' procedure that can be taught, knowing where to denormalise is less precise and will depend upon the application usage and business rules and will consequently differ from application to application. All your denormalisation decisions should be defensible to a fellow professional.
“适当”是需要经验的棘手部分。规范化是一种相当“死记硬背”的过程,可以教授,但知道去哪里去去规范化就不那么精确了,它将取决于应用程序的使用和业务规则,因此会因应用程序的不同而有所不同。你所有的去核化决定都应该对你的同道中人有充分的理由。
For example if I have a one to many relations ship A to B I would in most circumstances leave this normalised, but if I know that the business only ever has, say, two occurrences of B for each A, this is highly unlikely to change, there is limited data in the B record. and they will be usually pulling back the B data with the A record I would most likely extend the A record with two occurrences of the B fields. Of course most passing DBA's will then immediately flag this up as a possible design issue, so you must be able to convincingly argue your justification for denormalisation.
例如如果我有一个一对多关系船在大多数情况下我将离开这个正常化,但是如果我知道业务,只能说,两种B为每个事件,这是极不可能的改变,在B有有限的数据记录。他们通常会用A记录拉回B数据,我很可能会用两个B字段扩展A记录。当然,大多数通过的DBA会立即将其标记为可能的设计问题,因此您必须能够令人信服地论证您的非规范化的理由。
It should be apparent from this that denormalisation should be the exception. In any production database I would expect the vast majority of it - 95% plus - to be in 3rd normal form, with just a handful of denormalised structures.
从这一点可以明显看出,非苹果化应该是一个例外。在任何一个生产数据库中,我都认为其中的绝大部分(95%以上)都是第三种正态分布,只有少量的去核结构。
#7
4
Where did you get the idea that "joins (and foreign key constraints, etc.) are very slow"? It's a very vague statement, and usually IMO there is no performance problems.
您从哪里得到“连接(和外键约束,等等)非常慢”的想法?这是一个非常模糊的表述,在我看来,通常没有性能问题。
#8
4
Denormalisation is only rarely needed on an operational system. One system I did the data model for had 560 tables or thereabouts (at the time it was the largest J2EE system built in Australasia) and had just 4 pieces of denormalised data. Two of the items were denormalised search tables designed to facilitiate complex search screens (one was a materialised view) and the other two were added in response to specific performance requirements.
在操作系统中,很少需要去核化。我做过数据模型的一个系统有560个表(当时它是澳大利亚最大的J2EE系统),只有4个非数字数据。其中两项是为了方便复杂的搜索屏幕而设计的去核化搜索表(其中一项是实体化的视图),另外两项是为了响应特定的性能需求而添加的。
Don't prematurely optimise a database with denormalised data. That's a recipe for ongoing data integrity problems. Also, always use database triggers to manage the denormalised data - don't rely on the application do do it.
不要过早地使用非数字数据优化数据库。这是一个持续的数据完整性问题的配方。另外,总是使用数据库触发器来管理数据的非规范化数据——不要依赖应用程序来完成它。
Finally, if you need to improve reporting performance, consider building a data mart or other separate denormalised structure for reporting. Reports that combine requirements of a real-time view of aggregates calculated over large volumes of data are rare and tend to only occur in a handful of lines of business. Systems that can do this tend to be quite fiddly to build and therefore expensive.
最后,如果您需要改进报告性能,请考虑构建一个数据集市或其他独立的非数字化报告结构。将基于大量数据计算的聚合的实时视图的需求组合在一起的报告很少出现,而且往往只出现在少数业务线中。能够做到这一点的系统往往是非常复杂的,因此成本很高。
You will almost certainly only have a small number of reports that genuinely need up-to-the minute data and they will almost always be operational reports like to-do-lists or exception reports that work on small amounts of data. Anything else can be pushed to the data mart, for which a nightly refresh is probably sufficient.
几乎可以肯定,只有少数报告真正需要最新的数据,而且它们几乎都是操作报告,比如对少量数据进行工作的“to-do-list”或“异常报告”。任何其他东西都可以推到数据集市上,对于这个集市来说,每晚的刷新可能就足够了。
#9
2
I don't know what you mean about creating a database by-the-book because most books I've read about databases include a topic about optimization which is the same thing as denormalizing the database design.
我不知道你说的“按本创建数据库”是什么意思,因为我读过的大多数关于数据库的书籍都有一个关于优化的主题,这和数据库设计的非规范化是一样的。
It's a balance act so don't optimize prematurely. The reason is that denormalized database design tend to be become difficult to work with. You'll need some metrics so do some stress-testing on the database in order to decide wether or not you wan't to denormalize.
这是一种平衡行为,所以不要过早地优化。原因是,非规范化数据库设计往往变得难以使用。您将需要一些指标,所以在数据库上进行一些压力测试,以确定是否需要反规范化。
So normalize for maintainability but denormalize for optimization.
因此,对可维护性进行规范化,而对优化进行非规范化。