在数据库方面,“为正确性规范化,为性能规范化”是正确的吗?

时间:2021-08-07 04:15:44

Normalization leads to many essential and desirable characteristics, including aesthetic pleasure. Besides it is also theoretically "correct". In this context, denormalization is applied as a compromise, a correction to achieve performance. Is there any reason other than performance that a database could be denormalized?

正规化导致了许多基本和可取的特征,包括审美快感。此外,它在理论上也是“正确的”。在这种情况下,非规范化被应用为一种妥协,一种为了实现性能而进行的修正。除了性能之外,还有什么原因可以使数据库去规格化吗?

14 个解决方案

#1


78  

The two most common reasons to denormalize are:

反规范化最常见的两个原因是:

  1. Performance
  2. 性能
  3. Ignorance
  4. 无知

The former should be verified with profiling, while the latter should be corrected with a rolled-up newspaper ;-)

前者应通过剖析加以验证,而后者应通过卷起的报纸予以纠正;-)

I would say a better mantra would be "normalize for correctness, denormalize for speed - and only when necessary"

我认为更好的口头禅应该是“规范正确性,规范速度——只有在必要的时候”

#2


32  

To fully understand the import of the original question, you have to understand something about team dynamics in systems development, and the kind of behavior (or misbehavior) different roles / kinds of people are predisposed to. Normalization is important because it isn't just a dispassionate debate of design patterns -- it also has a lot to do with how systems are designed and managed over time.

要充分理解原始问题的含义,您必须了解系统开发中的团队动态,以及不同角色/类型的人的行为(或不当行为)。规范化很重要,因为它不仅是对设计模式的冷静辩论——它还与随着时间的推移如何设计和管理系统有很大关系。

Database people are trained that data integrity is a paramount issue. We like to think in terms of 100% correctness of data, so that once data is in the DB, you don't have to think about or deal with it ever being logically wrong. This school of thought places a high value on normalization, because it causes (forces) a team to come to grips with the underlying logic of the data & system. To consider a trivial example -- does a customer have just one name & address, or could he have several? Someone needs to decide, and the system will come to depend on that rule being applied consistently. That sounds like a simple issue, but multiply that issue by 500x as you design a reasonably complex system and you will see the problem -- rules can't just exist on paper, they have to be enforced. A well-normalized database design (with the additional help of uniqueness constraints, foreign keys, check values, logic-enforcing triggers etc.) can help you have a well-defined core data model and data-correctness rules, which is really important if you want the system to work as expected when many people work on different parts of the system (different apps, reports, whatever) and different people work on the system over time. Or to put it another way -- if you don't have some way to define and operationally enforce a solid core data model, your system will suck.

数据库人员被训练成数据完整性是最重要的问题。我们喜欢用100%的数据正确性来思考,所以一旦数据出现在数据库中,你就不必考虑或处理逻辑上的错误。这一学派非常重视规范化,因为规范化会迫使团队掌握数据和系统的基本逻辑。考虑一个简单的例子——一个客户只有一个名字和地址,或者他可以有几个?有些人需要做出决定,系统将逐渐依赖于该规则被持续应用。这听起来是一个简单的问题,但是当你设计一个相当复杂的系统时,你会发现问题——规则不能仅仅存在于纸上,它们必须被执行。well-normalized数据库设计(额外帮助的唯一性约束、外键,检查值,logic-enforcing触发器等)可以帮助你有一个明确的核心数据模型和data-correctness规则,这是很重要的如果你想让系统正常工作时,很多人工作在系统的不同部分(不同应用程序、报告等等)和不同的人工作在系统上。或者换句话说——如果您没有某种方法来定义和执行一个可靠的核心数据模型,那么您的系统将会很糟糕。

Other people (often, less experienced developers) don't see it this way. They see the database as at best a tool that's enslaved to the application they're developing, or at worst a bureaucracy to be avoided. (Note that I'm saying "less experienced" developers. A good developer will have the same awareness of the need for a solid data model and data correctness as a database person. They might differ on what's the best way to achieve that, but in my experience are reasonably open to doing those things in a DB layer as long as the DB team knows what they're doing and can be responsive to the developers). These less experienced folks are often the ones who push for denormalization, as more or less an excuse for doing a quick & dirty job of designing and managing the data model. This is how you end up getting database tables that are 1:1 with application screens and reports, each reflecting a different developer's design assumptions, and a complete lack of sanity / coherence between the tables. I've experienced this several times in my career. It is a disheartening and deeply unproductive way to develop a system.

其他人(通常是经验较少的开发人员)不这么看。他们认为数据库充其量是一种受制于他们正在开发的应用程序的工具,或者最坏的是要避免官僚作风。(注意,我说的是“缺乏经验”的开发人员。一个好的开发人员应该和数据库人员一样意识到需要一个可靠的数据模型和数据正确性。他们可能会对实现这一目标的最佳方式有所不同,但在我的经验中,只要DB团队知道他们在做什么,并且能够对开发人员做出响应,我就可以合理地在DB层中做这些事情。这些缺乏经验的人往往是推动非规范化的人,他们或多或少地成为设计和管理数据模型的快速而肮脏工作的借口。这就是您最终获得与应用程序屏幕和报告1:1的数据库表的方式,每个表都反映了不同的开发人员的设计假设,并且表之间完全缺乏完整性/一致性。在我的职业生涯中,我经历过好几次。开发一个系统是一种令人沮丧且效率极低的方式。

So one reason people have a strong feeling about normalization is that the issue is a stand-in for other issues they feel strongly about. If you are sucked into a debate about normalization, think about the underlying (non-technical) motivation that the parties may be bringing to the debate.

所以人们对正常化有强烈感觉的一个原因是,这个问题是他们强烈关注的其他问题的替代品。如果你陷入了一场关于正常化的辩论,考虑一下双方可能带来的潜在(非技术)动机。

Having said that, here's a more direct answer to the original question :)

话虽如此,这里有一个更直接的答案来回答最初的问题:

It is useful to think of your database as consisting of a core design that is as close as possible to a logical design -- highly normalized and constrained -- and an extended design that addresses other issues like stable application interfaces and performance.

把你的数据库看作是一个核心设计,它尽可能地接近逻辑设计(高度规范化和受限)和扩展的设计,解决其他问题,比如稳定的应用程序接口和性能,这是很有用的。

You should want to constrain and normalize your core data model, because to not do that compromises the fundamental integrity of the data and all the rules / assumptions your system is being built upon. If you let those issues get away from you, your system can get crappy pretty fast. Test your core data model against requirements and real-world data, and iterate like mad until it works. This step will feel a lot more like clarifying requirements than building a solution, and it should. Use the core data model as a forcing function to get clear answers on these design issues for everyone involved.

您应该对核心数据模型进行约束和规范化,因为不这样做会损害数据的基本完整性和系统所构建的所有规则/假设。如果你让这些问题远离你,你的系统就会很快变得糟糕。根据需求和真实数据测试您的核心数据模型,并进行疯狂的迭代,直到它生效。这一步感觉更像是澄清需求,而不是构建解决方案,而且应该如此。使用核心数据模型作为强制函数,为每个涉及到的每个人提供这些设计问题的明确答案。

Complete your core data model before moving on to the extended data model. Use it and see how far you can get with it. Depending on the amount of data, number of users and patterns of use, you may never need an extended data model. See how far you can get with indexing plus the 1,001 performance-related knobs you can turn in your DBMS.

在继续扩展数据模型之前,完成您的核心数据模型。使用它,看看你能走多远。根据数据量、用户数量和使用模式的不同,您可能永远不需要扩展的数据模型。看看你能在索引和1001性能相关的旋钮上取得多少进展,你可以在你的数据库管理系统中。

If you truly tap out the performance-management capabilities of your DBMS, then you need to look at extending your data model in a way that adds denormalization. Note this is not about denormalizing your core data model, but rather adding new resources that handle the denorm data. For example, if there are a few huge queries that crush your performance, you might want to add a few tables that precompute the data those queries would produce -- essentially pre-executing the query. It is important to do this in a way that maintains the coherence of the denormalized data with the core (normalized) data. For example, in DBMS's that support them, you can use a MATERIALIZED VIEW to make the maintenance of the denorm data automatic. If your DBMS doesn't have this option, then maybe you can do it by creating triggers on the tables where the underlying data exists.

如果您真的开发了DBMS的性能管理功能,那么您需要考虑以增加非规范化的方式扩展您的数据模型。请注意,这不是关于非规范化核心数据模型,而是添加处理denorm数据的新资源。例如,如果有一些大型查询会影响性能,您可能需要添加一些表,这些表可以预先计算查询将生成的数据——基本上是预先执行查询。以维护非规范化数据与核心(规范化)数据的一致性的方式进行此操作是很重要的。例如,在支持它们的DBMS中,您可以使用物化视图来自动维护denorm数据。如果您的DBMS没有这个选项,那么您可以通过在底层数据存在的表上创建触发器来实现。

There is a world of difference between selectively denormalizing a database in a coherent manner to deal with a realistic performance challenge vs. just having a weak data design and using performance as a justification for it.

以一致的方式有选择地对数据库进行非规格化以处理实际的性能挑战,与只进行弱数据设计并将性能作为其理由之间存在着巨大的差异。

When I work with low-to-medium experienced database people and developers, I insist they produce an absolutely normalized design ... then later may involve a small number of more experienced people in a discussion of selective denormalization. Denormalization is more or less always bad in your core data model. Outside the core, there is nothing at all wrong with denormalization if you do it in a considered and coherent way.

当我与底层的经验丰富的数据库人员和开发人员一起工作时,我坚持他们会产生一个绝对标准化的设计……之后可能会有一小部分更有经验的人参与讨论选择性去正化。在您的核心数据模型中,反规范化或多或少总是不好的。在核心之外,如果你以一种深思熟虑的、连贯的方式去做,那么去正化就没有任何问题了。

In other words, denormalizing from a normal design to one that preserves the normal while adding some denormal -- that deals with the physical reality of your data while preserving its essential logic -- is fine. Designs that don't have a core of normal design -- that shouldn't even be called de-normalized, because they were never normalized in the first place, because they were never consciously designed in a disciplined way -- are not fine.

换句话说,从一个正常的设计到一个保持正常的设计,同时添加一些非标准的——处理数据的物理现实,同时保留其基本逻辑——是可以的。没有常规设计核心的设计——甚至不应该被称为非规范化,因为它们从来没有规范化过,因为它们从来没有有意识地以一种规范的方式设计过——是不合适的。

Don't accept the terminology that a weak, undisciplined design is a "denormalized" design. I believe the confusion between intentionally / carefully denormalized data vs. plain old crappy database design that results in denormal data because the designer was a careless idiot is the root cause of many of the debates about denormalization.

不要接受“不规范的设计是一种非规范化的设计”这一术语。我认为故意/小心地去规格化数据与普通的老的糟糕的数据库设计之间的混淆导致了去规格化数据,因为设计者是一个粗心的傻瓜,这是许多关于去规格化的争论的根本原因。

#3


14  

Denormalization normally means some improvement in retrieval efficiency (otherwise, why do it at all), but at a huge cost in complexity of validating the data during modify (insert, update, sometimes even delete) operations. Most often, the extra complexity is ignored (because it is too damned hard to describe), leading to bogus data in the database, which is often not detected until later - such as when someone is trying to work out why the company went bankrupt and it turns out that the data was self-inconsistent because it was denormalized.

反规范化通常意味着检索效率的一些改进(否则,为什么要这么做),但是在修改(插入、更新,有时甚至是删除)操作过程中验证数据的复杂性方面付出了巨大的代价。多数情况下,额外的复杂性被忽略(因为它太可恶的难以描述),导致虚假的数据在数据库中,这通常是后来才发现——如当一个人试图找出为什么该公司破产了,事实证明,这些数据是self-inconsistent因为它是不正常的。

I think the mantra should go "normalize for correctness, denormalize only when senior management offers to give your job to someone else", at which point you should accept the opportunity to go to pastures new since the current job may not survive as long as you'd like.

我认为这句口头禅应该是“规范正确,只有当高级管理人员提出将你的工作交给别人时才去规范化”,在这一点上,你应该接受跳槽的机会,因为目前的工作可能不会像你希望的那样长久存在。

Or "denormalize only when management sends you an email that exonerates you for the mess that will be created".

或者“只有当管理层给你发了一封邮件,要求你为将要产生的混乱洗脱罪名”。

Of course, this assumes that you are confident of your abilities and value to the company.

当然,这需要你对自己的能力和对公司的价值充满信心。

#4


11  

Mantras almost always oversimplify their subject matter. This is a case in point.

咒语几乎总是过分简化他们的主题。这是一个恰当的例子。

The advantages of normalizing are more that merely theoretic or aesthetic. For every departure from a normal form for 2NF and beyond, there is an update anomaly that occurs when you don't follow the normal form and that goes away when you do follow the normal form. Departure from 1NF is a whole different can of worms, and I'm not going to deal with it here.

正常化的好处更多的是仅仅是理论上的或美学上的。对于2NF和以上的任何偏离正常形式的情况,当不遵循正常形式时就会发生更新异常,当遵循正常形式时就会消失。从1NF出发是完全不同的蠕虫,我不会在这里处理它。

These update anomalies generally fall into inserting new data, updating existing data, and deleting rows. You can generally work your way around these anomalies by clever, tricky programming. The question then is was the benefit of using clever, tricky programming worth the cost. Sometimes the cost is bugs. Sometimes the cost is loss of adaptability. Sometimes the cost is actually, believe it or not, bad performance.

这些更新异常通常包括插入新数据、更新现有数据和删除行。您通常可以通过巧妙、巧妙的编程绕过这些异常。当时的问题是,使用巧妙、棘手的编程是否值得付出代价。有时成本是bug。有时代价是失去适应能力。信不信由你,有时候成本是糟糕的表现。

If you learn the various normal forms, you should consider your learning incomplete until you understand the accompanying update anomaly.

如果你学习了各种各样的正常形式,你应该考虑你的学习不完整,直到你了解伴随的更新异常。

The problem with "denormalize" as a guideline is that it doesn't tell you what to do. There are myriad ways to denormalize a database. Most of them are unfortunate, and that's putting it charitably. One of the dumbest ways is to simply denormalize one step at a time, every time you want to speed up some particular query. You end up with a crazy mish mosh that cannot be understood without knowing the history of the application.

“非规范化”作为指导原则的问题是它没有告诉你要做什么。有无数的方法去反规范化数据库。他们中的大多数都是不幸的,这是他们的慈善。最愚蠢的一种方法是每次只去规范化一个步骤,每次您想要加速某些特定的查询。您最终会得到一个疯狂的mish mosh,如果不了解应用程序的历史,就无法理解它。

A lot of denormalizing steps that "seemed like a good idea at the time" turn out later to be very bad moves.

很多“在当时看来是个好主意”的非规范化步骤后来都变成了非常糟糕的步骤。

Here's a better alternative, when you decide not to fully normalize: adopt some design discipline that yields certain benefits, even when that design discipline departs from full normalization. As an example, there is star schema design, widely used in data warehousing and data marts. This is a far more coherent and disciplined approach than merely denormalizing by whimsy. There are specific benefits you'll get out of a star schema design, and you can contrast them with the update anomalies you will suffer because star schema design contradicts normalized design.

当您决定不完全规范化时,这里有一个更好的替代方案:采用一些能够产生一定好处的设计规程,即使该设计规程偏离了完全规范化。例如,星型模式设计,广泛应用于数据仓库和数据集市。这是一种更连贯、更有纪律性的方法,而不仅仅是奇思妙想的去规模化。您将从星型模式设计中获得特定的好处,您可以将它们与您将遭受的更新异常进行对比,因为星型模式设计与规范化设计相矛盾。

In general, many people who design star schemas are building a secondary database, one that does not interact with the OLTP application programs. One of the hardest problems in keeping such a database current is the so called ETL (Extract, Transform, and Load) processing. The good news is that all this processing can be collected in a handful of programs, and the application programmers who deal with the normalized OLTP database don't have to learn this stuff. There are tools out there to help with ETL, and copying data from a normalized OLTP database to a star schema data mart or warehouse is a well understood case.

通常,许多设计星型模式的人正在构建一个辅助数据库,一个不与OLTP应用程序交互的数据库。保持这样的数据库当前状态最困难的问题之一是ETL(提取、转换和加载)处理。好消息是,所有这些处理都可以在少数几个程序中收集,处理规范化OLTP数据库的应用程序程序员不需要学习这些东西。有一些工具可以帮助ETL,将数据从规范化OLTP数据库复制到星型模式数据集市或仓库是很容易理解的情况。

Once you have built a star schema, and if you have chosen your dimensions well, named your columns wisely, and especially chosen your granularity well, using this star schema with OLAP tools like Cognos or Business Objects turns out to be almost as easy as playing a video game. This permits your data analysts to focus on analysing the data instead of learning how the container of the data works.

一旦您构建了一个星型模式,并且如果您已经选择了您的维度,那么明智地命名您的列,并特别选择您的粒度,使用这个星型模式,使用OLAP工具,比如Cognos或业务对象,结果几乎和玩电子游戏一样简单。这允许数据分析师专注于分析数据,而不是学习数据容器是如何工作的。

There are other designs besides star schema that depart from normalization, but star schema is worth a special mention.

除了星型模式之外,还有其他一些与规范化无关的设计,但是星型模式值得特别提及。

#5


6  

Data warehouses in a dimensional model are often modelled in a (denormalized) star schema. These kinds of schemas are not (normally) used for online production or transactional systems.

维度模型中的数据仓库通常采用(非规范化)星型模式建模。这些类型的模式(通常)不用于在线生产或事务系统。

The underlying reason is performance, but the fact/dimensional model also allows for a number of temporal features like slowly changing dimensions which are doable in traditional ER-style models, but can be incredibly complex and slow (effective dates, archive tables, active records, etc).

潜在的原因是性能,但是事实/维度模型也允许一些时间特性,比如缓慢变化的维度,这在传统的er样式模型中是可行的,但是可能非常复杂和缓慢(有效日期、存档表、活动记录等)。

#6


5  

Don't forget that each time you denormalize part of your database, your capacity to further adapt it decreases, as risks of bugs in code increases, making the whole system less and less sustainable.

不要忘记,每当您去规范化数据库的一部分时,您进一步适应它的能力就会降低,因为代码中的bug风险会增加,从而使整个系统越来越不可持续。

Good luck!

好运!

#7


4  

Database normalization isn't just for theoretical correctness, it can help to prevent data corruption. I certainly would NOT denormalize for "simplicity" as @aSkywalker suggests. Fixing and cleaning corrupted data is anything but simple.

数据库规范化不仅仅是为了理论上的正确性,它还可以帮助防止数据损坏。我当然不会像@aSkywalker建议的那样去规范化“简单性”。修复和清理损坏的数据绝非易事。

#8


4  

Normalization has nothing to do with performance. I can't really put it better than Erwin Smout did in this thread: What is the resource impact from normalizing a database?

规范化与性能无关。我不能比Erwin Smout在这个线程中做得更好:规范化数据库对资源的影响是什么?

Most SQL DBMSs have limited support for changing the physical representation of data without also compromising the logical model, so unfortunately that's one reason why you may find it necessary to demormalize. Another is that many DBMSs don't have good support for multi-table integrity constraints, so as a workaround to implement those constraints you may be forced to put extraneous attributes into some tables.

大多数SQL DBMSs对更改数据的物理表示的支持是有限的,而且不会损害逻辑模型,因此,不幸的是,这就是为什么您可能会发现需要去规范化的原因之一。另一个原因是,许多dbms不能很好地支持多表完整性约束,因此作为实现这些约束的解决方案,您可能不得不将无关的属性放入某些表中。

#9


3  

You don't normalize for 'correctness' per se. Here is the thing:

你不会因为“正确性”而正常。问题就在这里:

Denormalized table has the benefit of increasing performance but requires redundancy and more developer brain power.

非规范化的表可以提高性能,但是需要冗余和更多的开发人员的脑力。

Normalized tables has the benefit of reducing redundancy and increasing ease of development but requires performance.

规范化表可以减少冗余和增加开发的易用性,但需要性能。

It's almost like a classic balanced equation. So depending on your needs (such as how many that are hammering your database server) you should stick with normalized tables unless it is really needed. It is however easier and less costly for development to go from normalized to denormalized than vice versa.

就像一个经典的平衡方程。因此,根据您的需要(比如有多少个数据库服务器受到影响),您应该坚持使用规范化表,除非确实需要规范化表。然而,从规范化到非规范化的开发更容易,成本也更低,反之亦然。

#10


1  

No way. Keep in mind that what you're supposed to be normalizing is your relations (logical level), not your tables (physical level).

不可能。请记住,您应该规范化的是关系(逻辑级别),而不是表(物理级别)。

#11


1  

Denormalized data is much more often found at places where not enough normalization was done.

在没有进行足够的标准化处理的地方,经常会发现非规范化数据。

My mantra is 'normalize for correctness, eliminate for performance'. RDBMs are very flexible tools, but optimized for the OLTP situation. Replacing the RDBMS by something simpler (e.g. objects in memory with a transaction log) can help a lot.

我的座右铭是“规范正确性,消除性能”。RDBMs是非常灵活的工具,但是针对OLTP情况进行了优化。用更简单的东西(例如内存中的对象和事务日志)替换RDBMS可以帮助很大。

#12


1  

I take issue with the assertion by folks here that Normalized databases are always associated with simpler, cleaner, more robust code. It is certainly true that there are many cases where fully normalized are associated with simpler code than partially denormalized code, but at best this is a guideline, not a law of physics.

我不同意这里的人们关于规范化数据库总是与更简单、更干净、更健壮的代码相关联的断言。确实,在许多情况下,完全规范化与比部分非规范化代码更简单的代码相关联,但这充其量是一个指导原则,而不是物理定律。

Someone once defined a word as the skin of a living idea. In CS, you could say an object or table is defined in terms of the needs of the problem and the existing infrastructure, instead of being a platonic reflection of an ideal object. In theory, there will be no difference between theory and practice, but in practice, you do find variations from theory. This saying is particularly interesting for CS because one of the focuses of the field is to find these differences and to handle them in the best way possible.

有人曾经把一个词定义为活的想法的外壳。在CS中,您可以说对象或表是根据问题的需要和现有的基础设施定义的,而不是理想对象的柏拉图式反映。理论上,理论和实践之间没有区别,但在实践中,你会发现理论的差异。这句话对CS来说特别有趣,因为这个领域的重点之一是找出这些差异,并以最好的方式处理它们。

Taking a break from the DB side of things and looking at the coding side of things, object oriented programming has saved us from a lot of the evils of spaghetti-coding, by grouping a lot of closely related code together under an object-class name that has an english meaning that is easy to remember and that somehow fits with all of the code it is associated with. If too much information is clustered together, then you end up with large amounts of complexity within each object and it is reminiscent of spaghetti code. If you make the clusters to small, then you can't follow threads of logic without searching through large numbers of objects with very little information in each object, which has been referred to as "Macaroni code".

从DB的一面休息,看着编码方面,面向对象编程挽救了我们很多spaghetti-coding的弊端,通过分组很多密切相关的代码一起下一个对象类名称,英文意思,容易记住,在某种程度上符合相关的所有代码。如果太多的信息聚集在一起,那么每个对象都会有大量的复杂性,这让人想起意大利面条式的代码。如果集群变得很小,那么如果不搜索大量的对象,并且每个对象的信息都非常少,那么就无法跟踪逻辑线程,这被称为“通心粉代码”。

If you look at the trade-off between the ideal object size on the programming side of things and the object size that results from normalizing your database, I will give a nod to those that would say it is often better to chose based on the database and then work around that choice in code. Especially because you have the ability in some cases to create objects from joins with hibernate and technologies like that. However I would stop far short of saying this is an absolute rule. Any OR-Mapping layer is written with the idea of making the most complex cases simpler, possibly at the expense of adding complexity to the most simple cases. And remember that complexity is not measured in units of size, but rather in units of complexity. There are all sorts of different systems out there. Some are expected to grow to a few thousand lines of code and stay there forever. Others are meant to be the central portal to a company's data and could theoretically grow in any direction without constraint. Some applications manage data that is read millions of times for every update. Others manage data that is only read for audit and ad-hoc purposes. In general the rules are:

如果你看看理想对象大小之间的权衡在编程方面的事情和结果从数据库规范化的对象的大小,我将向那些会说它往往是更好的选择基于数据库,然后选择代码的工作。特别是因为在某些情况下,您可以使用hibernate和类似的技术来创建对象。然而,我要说的是,这是一个绝对的规则。任何OR-Mapping层都是为了简化最复杂的情况而编写的,其代价可能是增加最简单的情况的复杂性。记住,复杂性不是用大小来衡量的,而是用复杂度来衡量的。世界上有各种不同的系统。有些代码预计会增加到几千行,并且永远保持不变。另一些则是公司数据的中心门户,理论上可以不受约束地向任何方向发展。有些应用程序管理数据,每次更新都要读取数百万次。另一些则管理仅用于审计和特殊目的的数据。一般来说,规则是:

  • Normalization is almost always a good idea in medium-sized apps or larger when data on both sides of the split can be modified and the potential modifications are independent of each other.

    在中型或大型的应用程序中,规范化几乎总是一个好主意,因为可以修改分割两边的数据,并且潜在的修改相互独立。

  • Updating or selecting from a single table is generally simpler than working with multiple tables, however with a well-written OR, this difference can be minimized for a large part of the data model space. Working with straight SQL, this is almost trivial to work around for an individual use case, albeit it in a non-object-oriented way.

    从单个表进行更新或选择通常比使用多个表要简单,但是使用编写良好的or,这种差异可以在数据模型空间的很大一部分中最小化。使用纯SQL,这对于处理单个用例来说几乎是微不足道的,尽管是以非面向对象的方式。

  • Code needs to be kept relatively small to be manage-able and one effective way to do this is to divide the data model and build a service-oriented architecture around the various pieces of the data model. The goal of an optimal state of data (de)normalization should be thought of within the paradigm of your overall complexity management strategy.

    代码需要保持相对较小的管理能力,这样做的一种有效方法是,将数据模型分解,并围绕数据模型的各个部分构建面向服务的体系结构。数据优化状态(de)规范化的目标应该在您的总体复杂性管理策略的范例中考虑。

In complex object hierarchies there are complexities that you don't see on the database side, like the cascading of updates. If you model relational foreign keys and crosslinks with an object ownership relationship, then when updating the object, you have to decide whether to cascade the update or not. This can be more complex than it would be in sql because of the difference between doing something once and doing something correctly always, sort of like the difference between loading a data file and writing a parser for that type of file. The code that cascades an update or delete in C++, java, or whatever will need to make the decision correctly for a variety of different scenarios, and the consequences of mistakes in this logic can be pretty serious. It remains to be proven that this can never be simplified with a bit of flexibility on the SQL side enough to make any sql complexities worthwhile.

在复杂的对象层次结构中,您在数据库端看不到复杂性,比如更新的级联。如果您对关系外键和与对象所有权关系的交叉链接进行建模,那么在更新对象时,您必须决定是否要对更新进行级联。这可能比sql中要复杂得多,因为做一件事和总是正确地做一件事是不同的,有点像加载数据文件和为这种类型的文件编写解析器之间的区别。将更新或删除的代码级联到c++、java或其他任何需要对各种不同场景进行正确决策的代码,以及这种逻辑中错误的后果可能是非常严重的。还需要证明的是,在SQL方面有一点灵活性,使任何SQL复杂性变得有价值,这一点永远无法简化。

There is also a point deserving delineation with one of the normalization precepts. A central argument for normalization in databases is the idea that data duplication is always bad. This is frequently true, but cannot be followed slavishly, especially when there are different owners for the different pieces of a solution. I saw a situation once in which one group of developers managed a certain type of transactions, and another group of developers supported auditability of these transactions, so the second group of developers wrote a service which scraped several tables whenever a transaction occurred and created a denormalized snapshot record stating, in effect, what was the state of the system at the time the transaction. This scenario stands as an interesting use case (for the data duplication part of the question at least), but it is actually part of a larger category of issues. Data constistency desires will often put certain constraints on the structure of data in the database that can make error handling and troubleshooting simpler by making some of the incorrect cases impossible. However this can also have the impact of "freezing" portions of data because changing that subset of the data would cause past transactions to become invalid under the consistancy rules. Obviously some sort of versioning system is required to sort this out, so the obvious question is whether to use a normalized versioning system (effective and expiration times) or a snapshot-based approach (value as of transaction time). There are several internal structure questions for the normalized version that you don't have to worry about with the snapshot approach, like:

还有一点值得用一种标准化的规范来描述。数据库规范化的一个中心论点是,数据重复总是不好的。这通常是正确的,但不能盲目地遵循,特别是当解决方案的不同部分有不同的所有者时。我看到一个情况在这一组开发人员管理某种类型的交易,和另一组开发人员支持这些交易的审核,所以开发人员写了一个服务的第二组刮几表事务发生时,创建了一个规范化快照记录说,实际上,当时系统的状态是什么交易。这个场景是一个有趣的用例(至少是问题的数据复制部分),但它实际上是更大范围问题的一部分。数据简洁性需求通常会对数据库中的数据结构施加一定的限制,这会使错误处理和故障排除变得更简单,因为不可能出现一些不正确的情况。然而,这也会影响到数据的“冻结”部分,因为改变数据子集会导致过去的事务在一致性规则下变得无效。显然,需要某种类型的版本控制系统来解决这个问题,因此显而易见的问题是,是否要使用规范化的版本控制系统(有效的和过期的时间),还是使用基于快照的方法(即事务时间的值)。对于规范化版本,有几个内部结构问题,您不必担心快照方法,比如:

  • Can date range queries be done efficiently even for large tables?
  • 即使对于大型表,日期范围查询也能有效地完成吗?
  • Is it possible to guarantee non-overlap of date ranges?
  • 是否有可能保证日期范围不重叠?
  • Is it possible to trace status events back to operator, transaction, or reason for change? (probably yes, but this is additional overhead)
  • 是否可能将状态事件追溯到操作符、事务或更改原因?(可能是的,但这是额外的开销)
  • By creating a more complicated versioning system, are you putting the right owners in charge of the right data?
  • 通过创建一个更复杂的版本控制系统,您是否让正确的所有者负责正确的数据?

I think the optimal goal here is to learn not only what is correct in theory, but why it is correct, and what are the consequences of violations, then when you are in the real world, you can decide which consequences are worth taking to gain which other benefits. That is the real challenge of design.

我认为这里的最佳目标不仅是学习什么在理论上是正确的,而且学习为什么它是正确的,以及违反行为的后果是什么,那么当你在现实世界中,你可以决定哪些后果值得你去做,以获得哪些其他好处。这是设计的真正挑战。

#13


0  

Reporting system and transaction system have different requirements.

报告系统和事务系统有不同的需求。

I would recommend for transaction system, always use normalization for data correctness.

对于事务系统,我建议使用规范化来保证数据的正确性。

For reporting system, use normalization unless denormaliztion is required for whatever reason, such as ease of adhoc query, performance, etc.

对于报告系统,使用规范化,除非出于任何原因需要非规格化,例如易于进行特殊查询、性能等。

#14


-2  

Simplicity? Not sure if Steven is gonna swat me with his newspaper, but where I hang, sometimes the denormalized tables help the reporting/readonly guys get their jobs done without bugging the database/developers all the time...

简洁?我不知道Steven是否会用他的报纸打我,但是在我挂的地方,有时候那些非规范化的表格可以帮助报告/只读的人完成他们的工作,而不会一直干扰数据库/开发人员……

#1


78  

The two most common reasons to denormalize are:

反规范化最常见的两个原因是:

  1. Performance
  2. 性能
  3. Ignorance
  4. 无知

The former should be verified with profiling, while the latter should be corrected with a rolled-up newspaper ;-)

前者应通过剖析加以验证,而后者应通过卷起的报纸予以纠正;-)

I would say a better mantra would be "normalize for correctness, denormalize for speed - and only when necessary"

我认为更好的口头禅应该是“规范正确性,规范速度——只有在必要的时候”

#2


32  

To fully understand the import of the original question, you have to understand something about team dynamics in systems development, and the kind of behavior (or misbehavior) different roles / kinds of people are predisposed to. Normalization is important because it isn't just a dispassionate debate of design patterns -- it also has a lot to do with how systems are designed and managed over time.

要充分理解原始问题的含义,您必须了解系统开发中的团队动态,以及不同角色/类型的人的行为(或不当行为)。规范化很重要,因为它不仅是对设计模式的冷静辩论——它还与随着时间的推移如何设计和管理系统有很大关系。

Database people are trained that data integrity is a paramount issue. We like to think in terms of 100% correctness of data, so that once data is in the DB, you don't have to think about or deal with it ever being logically wrong. This school of thought places a high value on normalization, because it causes (forces) a team to come to grips with the underlying logic of the data & system. To consider a trivial example -- does a customer have just one name & address, or could he have several? Someone needs to decide, and the system will come to depend on that rule being applied consistently. That sounds like a simple issue, but multiply that issue by 500x as you design a reasonably complex system and you will see the problem -- rules can't just exist on paper, they have to be enforced. A well-normalized database design (with the additional help of uniqueness constraints, foreign keys, check values, logic-enforcing triggers etc.) can help you have a well-defined core data model and data-correctness rules, which is really important if you want the system to work as expected when many people work on different parts of the system (different apps, reports, whatever) and different people work on the system over time. Or to put it another way -- if you don't have some way to define and operationally enforce a solid core data model, your system will suck.

数据库人员被训练成数据完整性是最重要的问题。我们喜欢用100%的数据正确性来思考,所以一旦数据出现在数据库中,你就不必考虑或处理逻辑上的错误。这一学派非常重视规范化,因为规范化会迫使团队掌握数据和系统的基本逻辑。考虑一个简单的例子——一个客户只有一个名字和地址,或者他可以有几个?有些人需要做出决定,系统将逐渐依赖于该规则被持续应用。这听起来是一个简单的问题,但是当你设计一个相当复杂的系统时,你会发现问题——规则不能仅仅存在于纸上,它们必须被执行。well-normalized数据库设计(额外帮助的唯一性约束、外键,检查值,logic-enforcing触发器等)可以帮助你有一个明确的核心数据模型和data-correctness规则,这是很重要的如果你想让系统正常工作时,很多人工作在系统的不同部分(不同应用程序、报告等等)和不同的人工作在系统上。或者换句话说——如果您没有某种方法来定义和执行一个可靠的核心数据模型,那么您的系统将会很糟糕。

Other people (often, less experienced developers) don't see it this way. They see the database as at best a tool that's enslaved to the application they're developing, or at worst a bureaucracy to be avoided. (Note that I'm saying "less experienced" developers. A good developer will have the same awareness of the need for a solid data model and data correctness as a database person. They might differ on what's the best way to achieve that, but in my experience are reasonably open to doing those things in a DB layer as long as the DB team knows what they're doing and can be responsive to the developers). These less experienced folks are often the ones who push for denormalization, as more or less an excuse for doing a quick & dirty job of designing and managing the data model. This is how you end up getting database tables that are 1:1 with application screens and reports, each reflecting a different developer's design assumptions, and a complete lack of sanity / coherence between the tables. I've experienced this several times in my career. It is a disheartening and deeply unproductive way to develop a system.

其他人(通常是经验较少的开发人员)不这么看。他们认为数据库充其量是一种受制于他们正在开发的应用程序的工具,或者最坏的是要避免官僚作风。(注意,我说的是“缺乏经验”的开发人员。一个好的开发人员应该和数据库人员一样意识到需要一个可靠的数据模型和数据正确性。他们可能会对实现这一目标的最佳方式有所不同,但在我的经验中,只要DB团队知道他们在做什么,并且能够对开发人员做出响应,我就可以合理地在DB层中做这些事情。这些缺乏经验的人往往是推动非规范化的人,他们或多或少地成为设计和管理数据模型的快速而肮脏工作的借口。这就是您最终获得与应用程序屏幕和报告1:1的数据库表的方式,每个表都反映了不同的开发人员的设计假设,并且表之间完全缺乏完整性/一致性。在我的职业生涯中,我经历过好几次。开发一个系统是一种令人沮丧且效率极低的方式。

So one reason people have a strong feeling about normalization is that the issue is a stand-in for other issues they feel strongly about. If you are sucked into a debate about normalization, think about the underlying (non-technical) motivation that the parties may be bringing to the debate.

所以人们对正常化有强烈感觉的一个原因是,这个问题是他们强烈关注的其他问题的替代品。如果你陷入了一场关于正常化的辩论,考虑一下双方可能带来的潜在(非技术)动机。

Having said that, here's a more direct answer to the original question :)

话虽如此,这里有一个更直接的答案来回答最初的问题:

It is useful to think of your database as consisting of a core design that is as close as possible to a logical design -- highly normalized and constrained -- and an extended design that addresses other issues like stable application interfaces and performance.

把你的数据库看作是一个核心设计,它尽可能地接近逻辑设计(高度规范化和受限)和扩展的设计,解决其他问题,比如稳定的应用程序接口和性能,这是很有用的。

You should want to constrain and normalize your core data model, because to not do that compromises the fundamental integrity of the data and all the rules / assumptions your system is being built upon. If you let those issues get away from you, your system can get crappy pretty fast. Test your core data model against requirements and real-world data, and iterate like mad until it works. This step will feel a lot more like clarifying requirements than building a solution, and it should. Use the core data model as a forcing function to get clear answers on these design issues for everyone involved.

您应该对核心数据模型进行约束和规范化,因为不这样做会损害数据的基本完整性和系统所构建的所有规则/假设。如果你让这些问题远离你,你的系统就会很快变得糟糕。根据需求和真实数据测试您的核心数据模型,并进行疯狂的迭代,直到它生效。这一步感觉更像是澄清需求,而不是构建解决方案,而且应该如此。使用核心数据模型作为强制函数,为每个涉及到的每个人提供这些设计问题的明确答案。

Complete your core data model before moving on to the extended data model. Use it and see how far you can get with it. Depending on the amount of data, number of users and patterns of use, you may never need an extended data model. See how far you can get with indexing plus the 1,001 performance-related knobs you can turn in your DBMS.

在继续扩展数据模型之前,完成您的核心数据模型。使用它,看看你能走多远。根据数据量、用户数量和使用模式的不同,您可能永远不需要扩展的数据模型。看看你能在索引和1001性能相关的旋钮上取得多少进展,你可以在你的数据库管理系统中。

If you truly tap out the performance-management capabilities of your DBMS, then you need to look at extending your data model in a way that adds denormalization. Note this is not about denormalizing your core data model, but rather adding new resources that handle the denorm data. For example, if there are a few huge queries that crush your performance, you might want to add a few tables that precompute the data those queries would produce -- essentially pre-executing the query. It is important to do this in a way that maintains the coherence of the denormalized data with the core (normalized) data. For example, in DBMS's that support them, you can use a MATERIALIZED VIEW to make the maintenance of the denorm data automatic. If your DBMS doesn't have this option, then maybe you can do it by creating triggers on the tables where the underlying data exists.

如果您真的开发了DBMS的性能管理功能,那么您需要考虑以增加非规范化的方式扩展您的数据模型。请注意,这不是关于非规范化核心数据模型,而是添加处理denorm数据的新资源。例如,如果有一些大型查询会影响性能,您可能需要添加一些表,这些表可以预先计算查询将生成的数据——基本上是预先执行查询。以维护非规范化数据与核心(规范化)数据的一致性的方式进行此操作是很重要的。例如,在支持它们的DBMS中,您可以使用物化视图来自动维护denorm数据。如果您的DBMS没有这个选项,那么您可以通过在底层数据存在的表上创建触发器来实现。

There is a world of difference between selectively denormalizing a database in a coherent manner to deal with a realistic performance challenge vs. just having a weak data design and using performance as a justification for it.

以一致的方式有选择地对数据库进行非规格化以处理实际的性能挑战,与只进行弱数据设计并将性能作为其理由之间存在着巨大的差异。

When I work with low-to-medium experienced database people and developers, I insist they produce an absolutely normalized design ... then later may involve a small number of more experienced people in a discussion of selective denormalization. Denormalization is more or less always bad in your core data model. Outside the core, there is nothing at all wrong with denormalization if you do it in a considered and coherent way.

当我与底层的经验丰富的数据库人员和开发人员一起工作时,我坚持他们会产生一个绝对标准化的设计……之后可能会有一小部分更有经验的人参与讨论选择性去正化。在您的核心数据模型中,反规范化或多或少总是不好的。在核心之外,如果你以一种深思熟虑的、连贯的方式去做,那么去正化就没有任何问题了。

In other words, denormalizing from a normal design to one that preserves the normal while adding some denormal -- that deals with the physical reality of your data while preserving its essential logic -- is fine. Designs that don't have a core of normal design -- that shouldn't even be called de-normalized, because they were never normalized in the first place, because they were never consciously designed in a disciplined way -- are not fine.

换句话说,从一个正常的设计到一个保持正常的设计,同时添加一些非标准的——处理数据的物理现实,同时保留其基本逻辑——是可以的。没有常规设计核心的设计——甚至不应该被称为非规范化,因为它们从来没有规范化过,因为它们从来没有有意识地以一种规范的方式设计过——是不合适的。

Don't accept the terminology that a weak, undisciplined design is a "denormalized" design. I believe the confusion between intentionally / carefully denormalized data vs. plain old crappy database design that results in denormal data because the designer was a careless idiot is the root cause of many of the debates about denormalization.

不要接受“不规范的设计是一种非规范化的设计”这一术语。我认为故意/小心地去规格化数据与普通的老的糟糕的数据库设计之间的混淆导致了去规格化数据,因为设计者是一个粗心的傻瓜,这是许多关于去规格化的争论的根本原因。

#3


14  

Denormalization normally means some improvement in retrieval efficiency (otherwise, why do it at all), but at a huge cost in complexity of validating the data during modify (insert, update, sometimes even delete) operations. Most often, the extra complexity is ignored (because it is too damned hard to describe), leading to bogus data in the database, which is often not detected until later - such as when someone is trying to work out why the company went bankrupt and it turns out that the data was self-inconsistent because it was denormalized.

反规范化通常意味着检索效率的一些改进(否则,为什么要这么做),但是在修改(插入、更新,有时甚至是删除)操作过程中验证数据的复杂性方面付出了巨大的代价。多数情况下,额外的复杂性被忽略(因为它太可恶的难以描述),导致虚假的数据在数据库中,这通常是后来才发现——如当一个人试图找出为什么该公司破产了,事实证明,这些数据是self-inconsistent因为它是不正常的。

I think the mantra should go "normalize for correctness, denormalize only when senior management offers to give your job to someone else", at which point you should accept the opportunity to go to pastures new since the current job may not survive as long as you'd like.

我认为这句口头禅应该是“规范正确,只有当高级管理人员提出将你的工作交给别人时才去规范化”,在这一点上,你应该接受跳槽的机会,因为目前的工作可能不会像你希望的那样长久存在。

Or "denormalize only when management sends you an email that exonerates you for the mess that will be created".

或者“只有当管理层给你发了一封邮件,要求你为将要产生的混乱洗脱罪名”。

Of course, this assumes that you are confident of your abilities and value to the company.

当然,这需要你对自己的能力和对公司的价值充满信心。

#4


11  

Mantras almost always oversimplify their subject matter. This is a case in point.

咒语几乎总是过分简化他们的主题。这是一个恰当的例子。

The advantages of normalizing are more that merely theoretic or aesthetic. For every departure from a normal form for 2NF and beyond, there is an update anomaly that occurs when you don't follow the normal form and that goes away when you do follow the normal form. Departure from 1NF is a whole different can of worms, and I'm not going to deal with it here.

正常化的好处更多的是仅仅是理论上的或美学上的。对于2NF和以上的任何偏离正常形式的情况,当不遵循正常形式时就会发生更新异常,当遵循正常形式时就会消失。从1NF出发是完全不同的蠕虫,我不会在这里处理它。

These update anomalies generally fall into inserting new data, updating existing data, and deleting rows. You can generally work your way around these anomalies by clever, tricky programming. The question then is was the benefit of using clever, tricky programming worth the cost. Sometimes the cost is bugs. Sometimes the cost is loss of adaptability. Sometimes the cost is actually, believe it or not, bad performance.

这些更新异常通常包括插入新数据、更新现有数据和删除行。您通常可以通过巧妙、巧妙的编程绕过这些异常。当时的问题是,使用巧妙、棘手的编程是否值得付出代价。有时成本是bug。有时代价是失去适应能力。信不信由你,有时候成本是糟糕的表现。

If you learn the various normal forms, you should consider your learning incomplete until you understand the accompanying update anomaly.

如果你学习了各种各样的正常形式,你应该考虑你的学习不完整,直到你了解伴随的更新异常。

The problem with "denormalize" as a guideline is that it doesn't tell you what to do. There are myriad ways to denormalize a database. Most of them are unfortunate, and that's putting it charitably. One of the dumbest ways is to simply denormalize one step at a time, every time you want to speed up some particular query. You end up with a crazy mish mosh that cannot be understood without knowing the history of the application.

“非规范化”作为指导原则的问题是它没有告诉你要做什么。有无数的方法去反规范化数据库。他们中的大多数都是不幸的,这是他们的慈善。最愚蠢的一种方法是每次只去规范化一个步骤,每次您想要加速某些特定的查询。您最终会得到一个疯狂的mish mosh,如果不了解应用程序的历史,就无法理解它。

A lot of denormalizing steps that "seemed like a good idea at the time" turn out later to be very bad moves.

很多“在当时看来是个好主意”的非规范化步骤后来都变成了非常糟糕的步骤。

Here's a better alternative, when you decide not to fully normalize: adopt some design discipline that yields certain benefits, even when that design discipline departs from full normalization. As an example, there is star schema design, widely used in data warehousing and data marts. This is a far more coherent and disciplined approach than merely denormalizing by whimsy. There are specific benefits you'll get out of a star schema design, and you can contrast them with the update anomalies you will suffer because star schema design contradicts normalized design.

当您决定不完全规范化时,这里有一个更好的替代方案:采用一些能够产生一定好处的设计规程,即使该设计规程偏离了完全规范化。例如,星型模式设计,广泛应用于数据仓库和数据集市。这是一种更连贯、更有纪律性的方法,而不仅仅是奇思妙想的去规模化。您将从星型模式设计中获得特定的好处,您可以将它们与您将遭受的更新异常进行对比,因为星型模式设计与规范化设计相矛盾。

In general, many people who design star schemas are building a secondary database, one that does not interact with the OLTP application programs. One of the hardest problems in keeping such a database current is the so called ETL (Extract, Transform, and Load) processing. The good news is that all this processing can be collected in a handful of programs, and the application programmers who deal with the normalized OLTP database don't have to learn this stuff. There are tools out there to help with ETL, and copying data from a normalized OLTP database to a star schema data mart or warehouse is a well understood case.

通常,许多设计星型模式的人正在构建一个辅助数据库,一个不与OLTP应用程序交互的数据库。保持这样的数据库当前状态最困难的问题之一是ETL(提取、转换和加载)处理。好消息是,所有这些处理都可以在少数几个程序中收集,处理规范化OLTP数据库的应用程序程序员不需要学习这些东西。有一些工具可以帮助ETL,将数据从规范化OLTP数据库复制到星型模式数据集市或仓库是很容易理解的情况。

Once you have built a star schema, and if you have chosen your dimensions well, named your columns wisely, and especially chosen your granularity well, using this star schema with OLAP tools like Cognos or Business Objects turns out to be almost as easy as playing a video game. This permits your data analysts to focus on analysing the data instead of learning how the container of the data works.

一旦您构建了一个星型模式,并且如果您已经选择了您的维度,那么明智地命名您的列,并特别选择您的粒度,使用这个星型模式,使用OLAP工具,比如Cognos或业务对象,结果几乎和玩电子游戏一样简单。这允许数据分析师专注于分析数据,而不是学习数据容器是如何工作的。

There are other designs besides star schema that depart from normalization, but star schema is worth a special mention.

除了星型模式之外,还有其他一些与规范化无关的设计,但是星型模式值得特别提及。

#5


6  

Data warehouses in a dimensional model are often modelled in a (denormalized) star schema. These kinds of schemas are not (normally) used for online production or transactional systems.

维度模型中的数据仓库通常采用(非规范化)星型模式建模。这些类型的模式(通常)不用于在线生产或事务系统。

The underlying reason is performance, but the fact/dimensional model also allows for a number of temporal features like slowly changing dimensions which are doable in traditional ER-style models, but can be incredibly complex and slow (effective dates, archive tables, active records, etc).

潜在的原因是性能,但是事实/维度模型也允许一些时间特性,比如缓慢变化的维度,这在传统的er样式模型中是可行的,但是可能非常复杂和缓慢(有效日期、存档表、活动记录等)。

#6


5  

Don't forget that each time you denormalize part of your database, your capacity to further adapt it decreases, as risks of bugs in code increases, making the whole system less and less sustainable.

不要忘记,每当您去规范化数据库的一部分时,您进一步适应它的能力就会降低,因为代码中的bug风险会增加,从而使整个系统越来越不可持续。

Good luck!

好运!

#7


4  

Database normalization isn't just for theoretical correctness, it can help to prevent data corruption. I certainly would NOT denormalize for "simplicity" as @aSkywalker suggests. Fixing and cleaning corrupted data is anything but simple.

数据库规范化不仅仅是为了理论上的正确性,它还可以帮助防止数据损坏。我当然不会像@aSkywalker建议的那样去规范化“简单性”。修复和清理损坏的数据绝非易事。

#8


4  

Normalization has nothing to do with performance. I can't really put it better than Erwin Smout did in this thread: What is the resource impact from normalizing a database?

规范化与性能无关。我不能比Erwin Smout在这个线程中做得更好:规范化数据库对资源的影响是什么?

Most SQL DBMSs have limited support for changing the physical representation of data without also compromising the logical model, so unfortunately that's one reason why you may find it necessary to demormalize. Another is that many DBMSs don't have good support for multi-table integrity constraints, so as a workaround to implement those constraints you may be forced to put extraneous attributes into some tables.

大多数SQL DBMSs对更改数据的物理表示的支持是有限的,而且不会损害逻辑模型,因此,不幸的是,这就是为什么您可能会发现需要去规范化的原因之一。另一个原因是,许多dbms不能很好地支持多表完整性约束,因此作为实现这些约束的解决方案,您可能不得不将无关的属性放入某些表中。

#9


3  

You don't normalize for 'correctness' per se. Here is the thing:

你不会因为“正确性”而正常。问题就在这里:

Denormalized table has the benefit of increasing performance but requires redundancy and more developer brain power.

非规范化的表可以提高性能,但是需要冗余和更多的开发人员的脑力。

Normalized tables has the benefit of reducing redundancy and increasing ease of development but requires performance.

规范化表可以减少冗余和增加开发的易用性,但需要性能。

It's almost like a classic balanced equation. So depending on your needs (such as how many that are hammering your database server) you should stick with normalized tables unless it is really needed. It is however easier and less costly for development to go from normalized to denormalized than vice versa.

就像一个经典的平衡方程。因此,根据您的需要(比如有多少个数据库服务器受到影响),您应该坚持使用规范化表,除非确实需要规范化表。然而,从规范化到非规范化的开发更容易,成本也更低,反之亦然。

#10


1  

No way. Keep in mind that what you're supposed to be normalizing is your relations (logical level), not your tables (physical level).

不可能。请记住,您应该规范化的是关系(逻辑级别),而不是表(物理级别)。

#11


1  

Denormalized data is much more often found at places where not enough normalization was done.

在没有进行足够的标准化处理的地方,经常会发现非规范化数据。

My mantra is 'normalize for correctness, eliminate for performance'. RDBMs are very flexible tools, but optimized for the OLTP situation. Replacing the RDBMS by something simpler (e.g. objects in memory with a transaction log) can help a lot.

我的座右铭是“规范正确性,消除性能”。RDBMs是非常灵活的工具,但是针对OLTP情况进行了优化。用更简单的东西(例如内存中的对象和事务日志)替换RDBMS可以帮助很大。

#12


1  

I take issue with the assertion by folks here that Normalized databases are always associated with simpler, cleaner, more robust code. It is certainly true that there are many cases where fully normalized are associated with simpler code than partially denormalized code, but at best this is a guideline, not a law of physics.

我不同意这里的人们关于规范化数据库总是与更简单、更干净、更健壮的代码相关联的断言。确实,在许多情况下,完全规范化与比部分非规范化代码更简单的代码相关联,但这充其量是一个指导原则,而不是物理定律。

Someone once defined a word as the skin of a living idea. In CS, you could say an object or table is defined in terms of the needs of the problem and the existing infrastructure, instead of being a platonic reflection of an ideal object. In theory, there will be no difference between theory and practice, but in practice, you do find variations from theory. This saying is particularly interesting for CS because one of the focuses of the field is to find these differences and to handle them in the best way possible.

有人曾经把一个词定义为活的想法的外壳。在CS中,您可以说对象或表是根据问题的需要和现有的基础设施定义的,而不是理想对象的柏拉图式反映。理论上,理论和实践之间没有区别,但在实践中,你会发现理论的差异。这句话对CS来说特别有趣,因为这个领域的重点之一是找出这些差异,并以最好的方式处理它们。

Taking a break from the DB side of things and looking at the coding side of things, object oriented programming has saved us from a lot of the evils of spaghetti-coding, by grouping a lot of closely related code together under an object-class name that has an english meaning that is easy to remember and that somehow fits with all of the code it is associated with. If too much information is clustered together, then you end up with large amounts of complexity within each object and it is reminiscent of spaghetti code. If you make the clusters to small, then you can't follow threads of logic without searching through large numbers of objects with very little information in each object, which has been referred to as "Macaroni code".

从DB的一面休息,看着编码方面,面向对象编程挽救了我们很多spaghetti-coding的弊端,通过分组很多密切相关的代码一起下一个对象类名称,英文意思,容易记住,在某种程度上符合相关的所有代码。如果太多的信息聚集在一起,那么每个对象都会有大量的复杂性,这让人想起意大利面条式的代码。如果集群变得很小,那么如果不搜索大量的对象,并且每个对象的信息都非常少,那么就无法跟踪逻辑线程,这被称为“通心粉代码”。

If you look at the trade-off between the ideal object size on the programming side of things and the object size that results from normalizing your database, I will give a nod to those that would say it is often better to chose based on the database and then work around that choice in code. Especially because you have the ability in some cases to create objects from joins with hibernate and technologies like that. However I would stop far short of saying this is an absolute rule. Any OR-Mapping layer is written with the idea of making the most complex cases simpler, possibly at the expense of adding complexity to the most simple cases. And remember that complexity is not measured in units of size, but rather in units of complexity. There are all sorts of different systems out there. Some are expected to grow to a few thousand lines of code and stay there forever. Others are meant to be the central portal to a company's data and could theoretically grow in any direction without constraint. Some applications manage data that is read millions of times for every update. Others manage data that is only read for audit and ad-hoc purposes. In general the rules are:

如果你看看理想对象大小之间的权衡在编程方面的事情和结果从数据库规范化的对象的大小,我将向那些会说它往往是更好的选择基于数据库,然后选择代码的工作。特别是因为在某些情况下,您可以使用hibernate和类似的技术来创建对象。然而,我要说的是,这是一个绝对的规则。任何OR-Mapping层都是为了简化最复杂的情况而编写的,其代价可能是增加最简单的情况的复杂性。记住,复杂性不是用大小来衡量的,而是用复杂度来衡量的。世界上有各种不同的系统。有些代码预计会增加到几千行,并且永远保持不变。另一些则是公司数据的中心门户,理论上可以不受约束地向任何方向发展。有些应用程序管理数据,每次更新都要读取数百万次。另一些则管理仅用于审计和特殊目的的数据。一般来说,规则是:

  • Normalization is almost always a good idea in medium-sized apps or larger when data on both sides of the split can be modified and the potential modifications are independent of each other.

    在中型或大型的应用程序中,规范化几乎总是一个好主意,因为可以修改分割两边的数据,并且潜在的修改相互独立。

  • Updating or selecting from a single table is generally simpler than working with multiple tables, however with a well-written OR, this difference can be minimized for a large part of the data model space. Working with straight SQL, this is almost trivial to work around for an individual use case, albeit it in a non-object-oriented way.

    从单个表进行更新或选择通常比使用多个表要简单,但是使用编写良好的or,这种差异可以在数据模型空间的很大一部分中最小化。使用纯SQL,这对于处理单个用例来说几乎是微不足道的,尽管是以非面向对象的方式。

  • Code needs to be kept relatively small to be manage-able and one effective way to do this is to divide the data model and build a service-oriented architecture around the various pieces of the data model. The goal of an optimal state of data (de)normalization should be thought of within the paradigm of your overall complexity management strategy.

    代码需要保持相对较小的管理能力,这样做的一种有效方法是,将数据模型分解,并围绕数据模型的各个部分构建面向服务的体系结构。数据优化状态(de)规范化的目标应该在您的总体复杂性管理策略的范例中考虑。

In complex object hierarchies there are complexities that you don't see on the database side, like the cascading of updates. If you model relational foreign keys and crosslinks with an object ownership relationship, then when updating the object, you have to decide whether to cascade the update or not. This can be more complex than it would be in sql because of the difference between doing something once and doing something correctly always, sort of like the difference between loading a data file and writing a parser for that type of file. The code that cascades an update or delete in C++, java, or whatever will need to make the decision correctly for a variety of different scenarios, and the consequences of mistakes in this logic can be pretty serious. It remains to be proven that this can never be simplified with a bit of flexibility on the SQL side enough to make any sql complexities worthwhile.

在复杂的对象层次结构中,您在数据库端看不到复杂性,比如更新的级联。如果您对关系外键和与对象所有权关系的交叉链接进行建模,那么在更新对象时,您必须决定是否要对更新进行级联。这可能比sql中要复杂得多,因为做一件事和总是正确地做一件事是不同的,有点像加载数据文件和为这种类型的文件编写解析器之间的区别。将更新或删除的代码级联到c++、java或其他任何需要对各种不同场景进行正确决策的代码,以及这种逻辑中错误的后果可能是非常严重的。还需要证明的是,在SQL方面有一点灵活性,使任何SQL复杂性变得有价值,这一点永远无法简化。

There is also a point deserving delineation with one of the normalization precepts. A central argument for normalization in databases is the idea that data duplication is always bad. This is frequently true, but cannot be followed slavishly, especially when there are different owners for the different pieces of a solution. I saw a situation once in which one group of developers managed a certain type of transactions, and another group of developers supported auditability of these transactions, so the second group of developers wrote a service which scraped several tables whenever a transaction occurred and created a denormalized snapshot record stating, in effect, what was the state of the system at the time the transaction. This scenario stands as an interesting use case (for the data duplication part of the question at least), but it is actually part of a larger category of issues. Data constistency desires will often put certain constraints on the structure of data in the database that can make error handling and troubleshooting simpler by making some of the incorrect cases impossible. However this can also have the impact of "freezing" portions of data because changing that subset of the data would cause past transactions to become invalid under the consistancy rules. Obviously some sort of versioning system is required to sort this out, so the obvious question is whether to use a normalized versioning system (effective and expiration times) or a snapshot-based approach (value as of transaction time). There are several internal structure questions for the normalized version that you don't have to worry about with the snapshot approach, like:

还有一点值得用一种标准化的规范来描述。数据库规范化的一个中心论点是,数据重复总是不好的。这通常是正确的,但不能盲目地遵循,特别是当解决方案的不同部分有不同的所有者时。我看到一个情况在这一组开发人员管理某种类型的交易,和另一组开发人员支持这些交易的审核,所以开发人员写了一个服务的第二组刮几表事务发生时,创建了一个规范化快照记录说,实际上,当时系统的状态是什么交易。这个场景是一个有趣的用例(至少是问题的数据复制部分),但它实际上是更大范围问题的一部分。数据简洁性需求通常会对数据库中的数据结构施加一定的限制,这会使错误处理和故障排除变得更简单,因为不可能出现一些不正确的情况。然而,这也会影响到数据的“冻结”部分,因为改变数据子集会导致过去的事务在一致性规则下变得无效。显然,需要某种类型的版本控制系统来解决这个问题,因此显而易见的问题是,是否要使用规范化的版本控制系统(有效的和过期的时间),还是使用基于快照的方法(即事务时间的值)。对于规范化版本,有几个内部结构问题,您不必担心快照方法,比如:

  • Can date range queries be done efficiently even for large tables?
  • 即使对于大型表,日期范围查询也能有效地完成吗?
  • Is it possible to guarantee non-overlap of date ranges?
  • 是否有可能保证日期范围不重叠?
  • Is it possible to trace status events back to operator, transaction, or reason for change? (probably yes, but this is additional overhead)
  • 是否可能将状态事件追溯到操作符、事务或更改原因?(可能是的,但这是额外的开销)
  • By creating a more complicated versioning system, are you putting the right owners in charge of the right data?
  • 通过创建一个更复杂的版本控制系统,您是否让正确的所有者负责正确的数据?

I think the optimal goal here is to learn not only what is correct in theory, but why it is correct, and what are the consequences of violations, then when you are in the real world, you can decide which consequences are worth taking to gain which other benefits. That is the real challenge of design.

我认为这里的最佳目标不仅是学习什么在理论上是正确的,而且学习为什么它是正确的,以及违反行为的后果是什么,那么当你在现实世界中,你可以决定哪些后果值得你去做,以获得哪些其他好处。这是设计的真正挑战。

#13


0  

Reporting system and transaction system have different requirements.

报告系统和事务系统有不同的需求。

I would recommend for transaction system, always use normalization for data correctness.

对于事务系统,我建议使用规范化来保证数据的正确性。

For reporting system, use normalization unless denormaliztion is required for whatever reason, such as ease of adhoc query, performance, etc.

对于报告系统,使用规范化,除非出于任何原因需要非规格化,例如易于进行特殊查询、性能等。

#14


-2  

Simplicity? Not sure if Steven is gonna swat me with his newspaper, but where I hang, sometimes the denormalized tables help the reporting/readonly guys get their jobs done without bugging the database/developers all the time...

简洁?我不知道Steven是否会用他的报纸打我,但是在我挂的地方,有时候那些非规范化的表格可以帮助报告/只读的人完成他们的工作,而不会一直干扰数据库/开发人员……