When creating a database structure, what are good guidelines to follow or good ways to determine how far a database should be normalized? Should you create an un-normalized database and split it apart as the project progresses? Should you create it fully normalized and combine tables as needed for performance?


You want to start designing a normalized database up to 3rd normal form. As you develop the business logic layer you may decide you have to denormalize a bit but never, never go below the 3rd form. Always, keep 1rd and 2nd form compliant. You want to denormalize for simplicity of code, not for performance. Use indexes and stored procedures for that :)


The reason not "normalize as you go" is that you would have to modify the code you already have written most every time you modify the database design.


There are a couple of good articles:





@GrizzlyGuru A wise man once told me "normalize till it hurts, denormalize till it works".


It hasn't failed me yet :)


I disagree about starting with it in un-normalized form however, in my experience its' been easier to adapt your application to deal with a less normalized database than a more-normalized one. It could also lead to situations where its' working "well enough" so you never get around to normalizing it (until its' too late!)




Normalization means eliminating redundant data. In other words, an un-normalized or de-normalized database is a database where the same information will be repeated in multiple different places. This means you have to write more complex update statement to ensure you update the same data everywhere, otherwise you get inconsistent data which in turn means the output of queries is unrealiable.


This is a pretty huge problem, so I would say denormalization hurts, not the other way around.


In some case you may deliberately decide to denormalize specific parts of a database, if you judge that the benefit outweighs the extra work in updating data and the risk of data corruption. For example with datawarehouses, where data is aggregated for performance reasons, and data if often not updated after the initial entry which reduce the risk of inconsistencies.


But in general be weary of denormalizing for performance. For example the performance benefit of a denormalized join can typically be achieved by using materialized view (also called indexed view), which will be as fast as querying a denormalized table, but still protects the consistency of the data.




Jeff has a pretty good overview of his philosophy on his blog: Maybe normalization isn't normal. The main thing is: don't overdo normalization. But I think an even bigger point to take away is that it probably doesn't matter too much. Unless you're running the next Google, you probably won't notice much of a difference until your application grows.




Database normizational I feel is an art form.


You don't want to over normalize your database because you will have too many tables and it will cause your queries of even simple objects take longer than they should.


A good rule of thumb I follow is to normalize the same information repeated over and over again.


For example if you are creating a contact management application it would make sense to have Address (Street, City, State, Zip, etc. . ) as its own table.


However if you have only 2 types of contacts, Business or personal, do you need a contact type table if you know you are only going to have 2? For me no.


I would start by first figuring out the datatypes you need. Use a modeling program to help like Visio. You don't want to start with a non-normalized database because you will eventually normalize. Start by putting objects in there logical groupings, as you see data repeated take that data into a new table. I would keep up with that process until you feel you have the database designed.


Let testing tell you if you need to combine tables. A well written query can cover any over normalization.




I believe starting with an un-normalized database and moving toward normalized as you progress is usually easiest to get started. To the question of how far to normalize, my philosophy is to normalize until is starts to hurt. That may sound a little flippant, but it generally is a good way to gauge how far to take it.




Having a normalized database will give you the most flexibility and the easiest maintenance. I always start with a normalized database and then un-normalize only when there is an real life problem that needs addressing.


I view this similarly to code performance i.e. write maintainable, flexible code and make compromises for performance when you know that there is a performance problem.




The original poster never described in what situation the database will be used. If it's going to be any type of data warehousing project where at some point you will need cubes (OLAP) processing data for some front-end, it would be wiser to start off with star schema (fact tables + dimension) rather than looking into normalization. The Kimball books will be of great help in this case.




I agree that it is typically better to start out with a normalized DB and then denormalize to solve very specific problems, but I'd probably start at Boyce-Codd Normal Form instead of 3rd Normal Form.

我同意通常最好先使用规范化的数据库,然后进行非规范化以解决非常具体的问题,但我可能会从Boyce-Codd Normal Form而不是3rd Normal Form开始。



The truth is that "it depends." It depends on a lot of factors including:


  • Code (Hand-coded or Tool driven (like ETL packages))
  • 代码(手工编码或工具驱动(如ETL包))

  • Primary Application (Transaction Processing, Data Warehousing, Reporting)
  • 主要应用程序(事务处理,数据仓库,报告)

  • Type of Database (MySQL, DB/2, Oracle, Netezza, etc.)
  • 数据库类型(MySQL,DB / 2,Oracle,Netezza等)

  • Database Architecture (Tablular, Columnar)
  • 数据库架构(Tablular,Columnar)

  • DBA Quality (proactive, reactive, inactive)
  • DBA质量(主动,被动,不活动)

  • Expected Data Quality (do you want to enforce data quality at the application level or the database level?)
  • 预期的数据质量(您是否希望在应用程序级别或数据库级别强制执行数据质量?)



I agree that you should normalise as much as possible and only denormalise if absolutely necessary for performance. And with materialised views or caching schemes this is often not necessary.


The thing to bare in mind is that by normalising your model you are giving the database more information on how to constrain your data so that you can remove the risk of update anomalies that can occur in incompletely normalised models.


If you denormalise then you either need to live with the fact that you may get update anomolies or you need to implement the constraint validation yourself in your application code. This takes away a lot of the benefit of using a DBMS which lets you define these constraints declaratively.


So assuming the same quality of code, denormalising may not actually give you better performance.


Another thing to mention is that hardware is cheap these days so throwing extra processing power at the problem is often more cost effective than accepting the potential costs of cleaning up corrupted data.




Often if you normalize as far as your other software will let you, you'll be done.


For example, when using Object-Relational mapping technology, you'll have a rich set of semantics for various many-to-one and many-to-many relationships. Under the hood that'll provide join tables with effectively 2 primary keys. While relatively rare, true normalization often gives you relations with 3 or more primary keys. In cases like this, I prefer to stick with the O/R and roll my own code to avoid the various DB anomalies.

例如,在使用对象关系映射技术时,您将拥有丰富的语义集,用于各种多对一和多对多关系。在引擎盖下,将提供有效2个主键的连接表。虽然相对罕见,但真正的规范化通常会为您提供3个或更多主键的关系。在这种情况下,我更喜欢坚持使用O / R并滚动我自己的代码以避免各种数据库异常。



Just try to use common sense.


Also some say - and I have to agree with them - that, if you're finding yourself joining 6 (the magic number) tables together in most of your queries - not including reporting related ones- , than you might consider denormalizing a bit.

还有人说 - 我必须同意他们 - 如果你发现自己在大多数查询中加入了6个(神奇数字)表 - 不包括与报告相关的表 - 那么你可能会考虑对其进行非规范化。



Don't forget The mother of all database normalization debates on Coding Horror (summarized on the High Scalability blog).

不要忘记关于Coding Horror的所有数据库规范化争论的母亲(在High Scalability博客上总结)。



我同意通常最好先使用规范化的数据库,然后进行非规范化以解决非常具体的问题,但我可能会从Boyce-Codd Normal Form而不是3rd Normal Form开始。



  • 代码(手工编码或工具驱动(如ETL包))

  • 主要应用程序(事务处理,数据仓库,报告)

  • 数据库类型(MySQL,DB / 2,Oracle,Netezza等)

  • 数据库架构(Tablular,Columnar)

  • DBA质量(主动,被动,不活动)

  • 预期的数据质量(您是否希望在应用程序级别或数据库级别强制执行数据质量?)



例如,在使用对象关系映射技术时,您将拥有丰富的语义集,用于各种多对一和多对多关系。在引擎盖下,将提供有效2个主键的连接表。虽然相对罕见,但真正的规范化通常会为您提供3个或更多主键的关系。在这种情况下,我更喜欢坚持使用O / R并滚动我自己的代码以避免各种数据库异常。



还有人说 - 我必须同意他们 - 如果你发现自己在大多数查询中加入了6个(神奇数字)表 - 不包括与报告相关的表 - 那么你可能会考虑对其进行非规范化。



不要忘记关于Coding Horror的所有数据库规范化争论的母亲(在High Scalability博客上总结)。