多对多关系:在列中使用关联表或分隔值?

Update 2009.04.24

更新2009.04.24

The main point of my question is not developer confusion and what to do about it.

我的问题的主要观点不是开发人员的困惑以及如何处理它。

The point is to understand when delimited values are the right solution.

关键是要理解何时定界值是正确的解决方案。

I've seen delimited data used in commercial product databases (Ektron lol).

我见过在商业产品数据库(Ektron lol)中使用带分隔符的数据。

SQL Server even has an XML datatype, so that could be used for the same purpose as delimited fields.

SQL Server甚至有一个XML数据类型，因此可以用于与分隔字段相同的用途。

/end Update

/结束更新

The application I'm designing has some many-to-many relationships. In the past, I've often used associative tables to represent these in the database. This has caused some confusion to the developers.

我正在设计的应用程序有一些多对多的关系。过去，我经常使用关联表在数据库中表示这些数据。这给开发人员带来了一些困惑。

Here's an example DB structure:

下面是一个例子DB结构:

Document
---------------
ID (PK)
Title
CategoryIDs (varchar(4000))


Category
------------
ID (PK)
Title

There is a many-to-many relationship between Document and Category.

文档和类别之间存在多对多的关系。

In this implementation, Document.CategoryIDs is a big pipe-delimited list of CategoryIDs.

在这个实现中,文档。范畴是一个由管道分隔的范畴列表。

To me, this is bad because it requires use of substring matching in queries -- which cannot make use of indexes. I think this will be slow and will not scale.

对我来说，这很糟糕，因为它需要在查询中使用子字符串匹配——这不能使用索引。我认为这将是缓慢的，不会扩大。

With that model, to get all Documents for a Category, you would need something like the following:

使用这个模型，要获得一个类别的所有文档，您需要以下内容:

select * from documents where categoryids like '%|' + @targetCategoryId + '|%'

My solution is to create an associative table as follows:

我的解决方案是创建一个关联表，如下所示:

Document_Category
-------------------------------
DocumentID (PK)
CategoryID (PK)

This is confusing to the developers. Is there some elegant alternate solution that I'm missing?

这让开发人员感到困惑。有什么优雅的替代方案是我遗漏的吗?

I'm assuming there will be thousands of rows in Document. Category may be like 40 rows or so. The primary concern is query performance. Am I over-engineering this?

假设文档中有成千上万行。类别可能是40行左右。主要关注的是查询性能。我过度设计呢?

Is there a case where it's preferred to store lists of IDs in database columns rather than pushing the data out to an associative table?

是否有一种情况，它更喜欢在数据库列中存储id列表，而不是将数据推到关联表中?

Consider also that we may need to create many-to-many relationships among documents. This would suggest an associative table Document_Document. Is that the preferred design or is it better to store the associated Document IDs in a single column?

还要考虑到，我们可能需要在文档之间创建多对多的关系。这将建议一个关联表Document_Document。这是首选的设计，还是最好将相关的文档id存储在一个列中?

Thanks.

谢谢。

9 个解决方案

#1

The Document_Category table in your design is certainly the correct way to approach the problem. If it's possible, I would suggest that you educate the developers instead of coming up with a suboptimal solution (and taking a performance hit, and not having referential integrity).

设计中的Document_Category表当然是解决问题的正确方法。如果可能的话，我建议您对开发人员进行培训，而不是提出一种次优的解决方案(并且对性能造成影响，并且不具有引用完整性)。

Your other options may depend on the database you're using. For example, in SQL Server you can have an XML column that would allow you to store your array in a pre-defined schema and then do joins based on the contents of that field. Other database systems may have something similar.

您的其他选项可能取决于您正在使用的数据库。例如，在SQL Server中，您可以有一个XML列，它允许您将数组存储在预定义的模式中，然后根据该字段的内容进行连接。其他数据库系统可能也有类似的情况。

#2

This is confusing to the developers.

这让开发人员感到困惑。

Get better developers. That is the right approach.

得到更好的开发人员。这是正确的做法。

#3

Your suggestion IS the elegant, powerful, best practice solution.

您的建议是优雅、强大、最佳实践的解决方案。

Since I don't think the other answers said the following strongly enough, I'm going to do it.

因为我觉得其他的回答不够有力，所以我打算这么做。

If your developers 1) can't understand how to model a many-to-many relationship in a relational database, and 2) strongly insist on storing your CategoryIDs as delimited character data,

如果您的开发人员1)不能理解如何在关系数据库中建模多对多关系，2)强烈坚持将类别id存储为带分隔符的字符数据，

Then they ought to immediately lose all database design privileges. At the very least, they need an actual experienced professional to join their team who has the authority to stop them from doing something this unwise and can give them the database design training they are completely lacking.

然后它们应该立即失去所有数据库设计特权。至少，他们需要一个真正有经验的专业人士加入他们的团队，这个团队有权阻止他们做这种不明智的事情，并能给他们提供他们完全缺乏的数据库设计培训。

Last, you should not refer to them as "database developers" again until they are properly up to speed, as this is a slight to those of us who actually are competent developers & designers.

最后，您不应该再次将它们称为“数据库开发人员”，直到它们能够适当地赶上速度，因为这对我们这些实际上是有能力的开发人员和设计人员来说是微不足道的。

I hope this answer is very helpful to you.

我希望这个答案对你很有帮助。

Update

更新

The main point of my question is not developer confusion and what to do about it.

我的问题的主要观点不是开发人员的困惑以及如何处理它。

The point is to understand when delimited values are the right solution.

关键是要理解何时定界值是正确的解决方案。

Delimited values are the wrong solution except in extremely rare cases. When individual values will ever be queried/inserted/deleted/updated this proves it was the wrong decision, because you have to parse and touch all the other values just to work with the desired one. By doing this you're violating first (!!!) normal form (this phrase should sound to you like an unbelievably vile expletive). Using XML to do the same thing is wrong, too. Storing delimited values or multi-value XML in a column could make sense when it is treated as an indivisible and opaque "property bag" that is NOT queried on by the database but is always sent whole to another consumer (perhaps a web server or an EDI recipient).

定界值是错误的解决方案，除非是非常罕见的情况。当查询/插入/删除/更新单个值时，这证明这是错误的决定，因为您必须解析和触摸所有其他值，才能使用所需的值。通过这样做，你违反了第一(!!!)正常的形式(这句话应该听起来像是令人难以置信的脏话)。使用XML做同样的事情也是错误的。在一个列中存储分隔值或多值XML可能会有意义，因为它被看作是一个不可分割的、不透明的“属性包”，它不是由数据库查询的，而是一直发送到另一个消费者(可能是web服务器或EDI接收方)。

This takes me back to my initial comment. Developers who think violating first normal form is a good idea are very inexperienced developers in my book.

这让我想起了我最初的评论。在我的书中，认为违反first normal form是一个好主意的开发人员是非常缺乏经验的。

I will grant there are some pretty sophisticated non-relational data storage implementations out there using text property bags (such as Facebook(?) and other multi-million user sites running on thousands of servers). Well, when your database, user base, and transactions per second are big enough to need that, you'll have the money to develop it. In the meantime, stick with best practice.

我将允许使用文本属性包(比如Facebook和其他在数千台服务器上运行的数百万个用户站点)实现一些非常复杂的非关系数据存储实现。当你的数据库、用户群和每秒的事务足够大时，你就有钱来开发它了。同时，坚持最佳实践。

#4

It's almost always a big mistake to use comma separated IDs.
RDBMS are designed to store relationships.

使用逗号分隔的id几乎总是一个大错误。RDBMS是用来存储关系的。

#5

My solution is to create an associative table as follows: This is confusing to the developers

我的解决方案是创建一个关联表，如下所示:这让开发人员感到困惑

Really? this is database 101, if this is confusing to them then maybe they need to step away from their wizard generated code and learn some basic DB normalization.

真的吗?这是数据库101，如果这让他们感到困惑，那么他们可能需要离开向导生成的代码，学习一些基本的DB规范化。

What you propose is the right solution!!

你的建议是正确的!!

#6

The many-to-many mapping you are doing is fine and normalized. It also allows for other data to be added later if needed. For example, say you wanted to add a time that the category was added to the document.

您正在进行的多对多映射很好，而且是规范化的。如果需要，它还允许稍后添加其他数据。例如，假设您想要在文档中添加类别的时间。

I would suggest having a surrogate primary key on the document_category table as well. And a Unique(documentid, categoryid) constraint if that makes sense to do so.

我还建议在document_category表中使用代理主键。还有一个唯一的(documentid, categoryid)约束如果这样做有意义的话。

Why are the developers confused?

为什么开发人员感到困惑?

#7

The 'this is confusing to the developers' design means you have under-educated developers. It is the better relational database design - you should use it if at all possible.

“这让开发人员感到困惑，这意味着你的开发人员缺乏教育。”这是更好的关系数据库设计—如果可能，您应该使用它。

If you really want to use the list structure, then use a DBMS that understands them. Examples of such databases would be the U2 (Unidata, Universe) DBMS, which are (or were, once upon a long time ago) based on the Pick DBMS. There are likely to be other similar DBMS providers.

如果您真的想使用列表结构，那么请使用能够理解它们的DBMS。这类数据库的例子可能是U2 (Unidata, Universe) DBMS，它是(或者是很久以前的数据库)基于Pick DBMS。可能还有其他类似的DBMS提供商。

#8

This is the classic object-relational mapping problem. The developers are probably not stupid, just inexperienced or unaccustomed to doing things the right way. Shouting "3NF!" over and over again won't convince them of the right way.

这是典型的对象-关系映射问题。开发人员可能并不愚蠢，只是没有经验，或者不习惯正确地做事情。一遍又一遍地喊“3NF”不会让他们信服正确的方法。

I suggest you ask your developers to explain to you how they would get a count of documents by category using the pipe-delimited approach. It would be a nightmare, whereas the link table makes it quite simple.

我建议您请开发人员向您解释如何使用管道分隔的方法按类别获取文档计数。这将是一场噩梦，而链接表使其非常简单。

#9

The number one reason that my developers try this "comma-delimited values in a database column" approach is that they have a perception that adding a new table to address the need for multiple values will take too long to add to the data model and the database.

我的开发人员尝试使用“数据库列中的逗号分隔值”方法的第一个原因是，他们认为添加一个新表来处理多个值的需求将花费太长时间来添加到数据模型和数据库中。

Most of them know that their work around is bad for all kinds of reasons, but they choose this suboptimal method because they just can. They can do this and maybe never get caught, or they will get caught much later in the project when it is too expensive and risky to fix it. Why do they do this? Because their performance is measured solely on speed and not on quality or compliance.

他们中的大多数人都知道他们的工作因为各种各样的原因是不好的，但是他们选择这种次优方法是因为他们可以。他们可以这样做，而且可能永远不会被发现，或者他们将在项目的后期被发现，因为修复它太昂贵和风险太大。他们为什么要这样做?因为他们的表现仅以速度而不是质量或依从性来衡量。

It could also be, as on one of my projects, that the developers had a table to put the multi values in but were under the impression that duplicating that data in the parent table would speed up performance. They were wrong and they were called out on it.

同样，就像在我的一个项目中一样，开发人员有一个表可以放入多个值，但他们的印象是，在父表中复制该数据会提高性能。他们错了，他们被点名了。

So while you do need an answer to how to handle these costly, risky, and business-confidence damaging tricks, you should also try to find the reason why the developers believe that taking this course of action is better in the short and the long run for the project and company. Then fix both the perception and the data structures.

因此，当您确实需要解决如何处理这些代价高昂的、危险的、有损商业信心的技巧时，您还应该设法找出为什么开发人员认为，从短期和长期来看，采取这种行动会更好。然后修复感知和数据结构。

Yes, it could just be laziness, malicious intent, or cluelessness, but I'm betting most of the time developers do this stuff because they are constantly being told "just get it done". We on the data model and database design sides need to ensure that we aren't sending the wrong message about how responsive we can be to requests to fulfill a business requirement for a new entity/table/piece of information.

是的，这可能只是懒惰、恶意意图或无知，但我敢打赌，大多数时候开发人员都在做这些事情，因为他们总是被告知“只管完成它”。在数据模型和数据库设计方面，我们需要确保我们没有发送错误的消息，即我们可以对满足新实体/表/信息块的业务需求的请求作出多大的响应。

We should also see that data people need to be constantly monitoring the "as-built" part of our data architectures.

我们还应该看到，数据人员需要不断地监视我们的数据体系结构的“已构建”部分。

Personally, I never authorize the use of comma delimited values in a relational database because it is actually faster to build a new table than it is to build a parsing routine to create, update, and manage multiple values in a column and deal with all the anomalies introduced because sometimes that data has embedded commas, too.

就我个人而言,我从来没有授权使用逗号分隔的值在一个关系数据库,因为它实际上比它更快的建立一个新表是建立一个解析程序来创建、更新和管理多个值的列和处理数据的所有异常因为有时介绍嵌入式逗号,。

Bottom line, don't do comma delimited values, but find out why the developers want to do it and fix that problem.

底线是，不要使用逗号分隔的值，但是要找出开发人员为什么要这样做并解决这个问题。

#1

#2