SQL Server规范化策略：varchar vs int Identity

I'm just wondering what the optimal solution is here.

我只是想知道这里的最佳解决方案是什么。

Say I have a normalized database. The primary key of the whole system is a varchar. What I'm wondering is should I relate this varchar to an int for normalization or leave it? It's simpler to leave as a varchar, but it might be more optimal

假设我有一个规范化的数据库。整个系统的主键是varchar。我想知道的是我应该将这个varchar与一个int关联以进行规范化还是留下它?离开作为varchar更简单,但它可能更优

For instance I can have

比如我可以

People
======================
name      varchar(10)   
DoB       DateTime    
Height    int  

Phone_Number
======================
name      varchar(10)   
number    varchar(15)

Or I could have

或者我可以

People
======================
id        int Identity   
name      varchar(10)   
DoB       DateTime  
Height    int  

Phone_Number
======================
id        int   
number    varchar(15)

Add several other one-to-many relationships of course.

当然,添加其他几个一对多关系。

What do you all think? Which is better and why?

你们都在想什么?哪个更好?为什么?

7 个解决方案

#1

Can you really use names as primary keys? Isn't there a high risk of several people with the same name?

你真的可以使用名字作为主键吗?是不是有几个人同名的高风险?

If you really are so lucky that your name attribute can be used as primary key, then - by all means - use that. Often, though, you will have to make something up, like a customer_id, etc.

如果你真的很幸运,你的名字属性可以用作主键,那么 - 无论如何 - 使用它。但是,通常情况下,您必须制作一些内容,例如customer_id等。

And finally: "NAME" is a reserved word in at least one DBMS, so consider using something else, e.g. fullname.

最后:“NAME”是至少一个DBMS中的保留字,因此请考虑使用其他内容,例如:全名。

#2

I believe that the majority of people who have developed any significant sized real world database applications will tell you that surrogate keys are the only realistic solution.
I know the academic community will disagree but that is the difference between theoretical purity and practicality.

我相信大多数开发了大量现实世界数据库应用程序的人都会告诉你,代理键是唯一现实的解决方案。我知道学术界不同意,但这是理论纯度和实用性之间的差异。

Any reasonable sized query that has to do joins between tables that use non-surrogate keys where some tables have composite primary keys quickly becomes unmaintainable.

任何合理大小的查询必须在使用非代理键的表之间进行连接,其中某些表具有复合主键很快变得不可维护。

#3

Using any kind of non-synthetic data (i.e. anything from the user, as opposed to generated by the application) as a PK is problematic; you have to worry about culture/localization differences, case sensitivity (and other issues depending on DB collation), can result in data problems if/when that user-entered data ever changes, etc.

使用任何种类的非合成数据(即来自用户的任何东西,而不是由应用程序生成的)作为PK是有问题的;您必须担心文化/本地化差异,区分大小写(以及其他问题取决于数据库归类),如果/当用户输入的数据发生变化时,可能会导致数据问题等。

Using non-user-generated data (Sequential GUIDs (or non-sequential if your DB doesn't support them or you don't care about page splits) or identity ints (if you don't need GUIDs)) is much easier and much safer.

使用非用户生成的数据(顺序GUID(如果您的数据库不支持它们,或者您不关心页面拆分,则为非顺序数据)或标识整数(如果您不需要GUID))更容易更安全。

Regarding duplicate data: I don't see how using non-synthetic keys protects you from that. You still have issues where the user enters "Bob Smith" instead of "Bob K. Smith" or "Smith, Bob" or "bob smith" etc. Duplication management is necessary (and pretty much identical) regardless of whether your key is synthetic or non-synthetic, and non-synthetic keys have a host of other potential issues that synthetic keys neatly avoid.

关于重复数据:我没有看到使用非合成密钥如何保护您。您仍然遇到用户输入“Bob Smith”而不是“Bob K. Smith”或“Smith,Bob”或“bob smith”等问题。无论您的密钥是否为合成密钥,都必须进行复制管理(并且几乎完全相同)或非合成密钥和非合成密钥具有合成密钥巧妙避免的许多其他潜在问题。

Many projects don't need to worry about that (tightly constrained collation choices avoid many of them, for example) but in general I prefer synthetic keys. This is not to say you can't be successful with organic keys, clearly you can, but for many projects they're not the better choice.

许多项目不需要担心(例如,严格限制的校对选择会避免其中许多项目),但总的来说我更喜欢合成密钥。这并不是说你无法用有机键成功,显然你可以,但对于许多项目来说,它们不是更好的选择。

#4

I think if your VARCHAR was larger you would notice you're duplicating quite a bit of data throughout the database. Whereas if you went with a numeric ID column, you're not duplicating nearly the same amount of data when adding foreign key columns to other tables.

我想如果您的VARCHAR更大,您会注意到您在整个数据库中复制了相当多的数据。如果您使用数字ID列,则在将外键列添加到其他表时,您不会复制几乎相同数量的数据。

Moreover, textual data is a royal pain in terms of comparisons, your life is much easier when you're doing WHERE id = user_id versus WHERE name LIKE inputname (or something similar).

此外,文本数据在比较方面是一种巨大的痛苦,当你在做WHERE id = user_id而不是WHERE名称LIKE inputname(或类似的东西)时,你的生活会更容易。

#5

If the "name" field really is appropriate as a primary key, then do it. The database will not get more normalized by creating a surrogate key in that case. You will get some duplicate strings for foreign keys, but that is not a normalization issue, since the FK constraint guarantrees integrity on strings just as it would on surrogate keys.

如果“name”字段确实适合作为主键,那么就这样做。在这种情况下,通过创建代理键,数据库不会更加规范化。您将获得一些重复的外键字符串,但这不是规范化问题,因为FK约束保证了字符串的完整性,就像在代理键上一样。

However you are not explaining what the "name" is. In practice it is very seldom that a string is appropriate as a primary key. If it is the name of a person, it wont work as a PK, since more than one person can have the same name, people can change names and so on.

但是你没有解释“名称”是什么。在实践中,很少有字符串适合作为主键。如果它是一个人的名字,它不会作为PK工作,因为不止一个人可以有相同的名字,人们可以更改名称等等。

#6

One thing that others don't seem to have mentioned is that joins on int fields tend to perform better than joins on varchar fields.

其他人似乎没有提到的一件事是,int字段上的连接往往比连接varchar字段更好。

And I definitely would always use a surrogate key over using names (of people or businesses) because they are never unique over time. In our database, for instance, we have 164 names with over 100 instances of the same name. This clearly shows the dangers of considering using name as a key field.

而且我绝对会使用代理密钥而不是使用名称(人或企业),因为它们永远不会是唯一的。例如,在我们的数据库中,我们有164个名称,其中包含100多个同名实例。这清楚地表明了考虑使用名称作为关键字段的危险。

#7

The original question is not one of normalization. If you have a normalized database, as you stated, then you do not need to change it for normalization reasons.

原始问题不是正常化问题。如果您有一个规范化的数据库,如您所述,那么您不需要为了标准化原因而更改它。

There are really two issues in your question. The first is whether ints or varchars a preferable for use as primary keys and foreign keys. The second is whether you can use the natural keys given in the problem definition, or whether you should generate a synthetic key (surrogate key) to take the place of the natural key.

你的问题确实有两个问题。首先是int或varchars是否适合用作主键和外键。第二个问题是你是否可以使用问题定义中给出的自然键,或者是否应该生成合成键(代理键)来代替自然键。

ints are a little more concise than varchars, and a little more efficient for such things as index processing. But the difference is not overwhelming. You should probably not make your decision on this basis alone.

int比varchars更简洁,对索引处理这样的东西更有效。但差异并不是很大。你可能不应该仅仅依据这个基础做出决定。

The question of whether the natural key provided really works as a natural key or not is much more significant. The problem of duplicates in a "name" column is not the only problem. There is also the problem of what happens when a person changes her name. This problem probably doesn't surface in the example you've given, but it does surface in lots of other database applications. An example would be the transcript over four years of all the courses taken by a student. A woman might get married and change her name in the course of four years, and now you're stuck.

提供的自然键是否真的作为自然键的问题更为重要。 “名称”列中重复的问题不是唯一的问题。当一个人改名时会发生什么问题。这个问题可能不会出现在您给出的示例中,但它确实存在于许多其他数据库应用程序中。一个例子是学生所有课程四年的成绩单。一个女人可能会结婚并在四年内更改她的名字,现在你被困住了。

You either have to leave the name unchanged, in which case it no longer agrees with the real world, or update it retroactively in all the courses the person took, which makes the database disagree with the printed rosters made at the time.

您要么保持名称不变,在这种情况下,它不再与现实世界一致,或者在所有课程中追溯更新它,这使得数据库不同意当时制作的印刷名单。

If you do decide on a synthetic key, you now have to decide whether or not the application is going to reveal the value of the synthetic key to the user community. That's another whole can of worms, and beyond the scope of this discussion.

如果确定了合成密钥,则现在必须确定应用程序是否要向用户社区显示合成密钥的值。这是另一整套蠕虫,超出了本讨论的范围。

#1