在RDBMS中实现灵活的关系 - 真正的权衡是什么？

I have a bunch of products with a bunch of different possible attributes for each product. E.g. Product A has a name, size, color, shape. Product B has a name, calories, sugar, etc. One way to solve this is like:

我有一堆产品,每种产品都有许多不同的可能属性。例如。产品A具有名称,大小,颜色,形状。产品B有名称,卡路里,糖等。解决这个问题的一种方法是:

1) Create tables

1)创建表格

Products (id, name)
Attributes (id, name)
Product_Attributes (product_id, attribute_id, value as string)

This allows for maximum flexibility, but I have heard a lot of people recommend against this although I am not sure why. I mean, if those tables were called Teams, Players, Team_Players we would all agree that this is proper relational design.

这允许最大的灵活性,但我听到很多人建议不要这样,虽然我不知道为什么。我的意思是,如果这些表被称为团队,玩家,Team_Players,我们都同意这是适当的关系设计。

Everyone who explains to me why this is bad does so in the context of a completely flexible relational design where you don't ever create real tables past a basic few basic initial tables (e.g. object, attribute, object_attribute)-- which I think we all can agree is bad. But this is a much more limited and contained version of that (only Products, not every object in the system), so I don't think it is fair to group these two architectures together.

每个向我解释为什么这么糟糕的人都会在完全灵活的关系设计环境中这样做,你不会创建真正的表通过基本的几个基本初始表(例如object,attribute,object_attribute) - 我认为我们所有人都同意是坏事。但这是一个更加有限和包含的版本(只有产品,而不是系统中的每个对象),所以我认为将这两种架构组合在一起并不公平。

What issues have you encountered (experience or theoretical) that makes this design so bad?

您遇到的哪些问题(经验或理论上的问题)使这个设计如此糟糕?

2) Another way to solve this is to create a Product table with a bunch of columns like Size, Color, Shape, Weight, Sugar, etc and then include some extra columns at the end to give us some flexibility. This will create generally sparse rows filled mostly with NULLs. People tend to like this approach, but my question is how many columns can you have before this approach loses its performance benefits? If you have 200 columns, I imagine this is no longer a smart move, but what about 100 columns? 50 columns? 25 columns?

2)另一种解决方法是创建一个包含大量列,如大小,颜色,形状,重量,糖等的Product表,然后在末尾包含一些额外的列以提供一些灵活性。这将创建通常稀疏的行,主要填充NULL。人们倾向于喜欢这种方法,但我的问题是,在此方法失去性能优势之前,您可以拥有多少列?如果你有200列,我想这不再是一个聪明的举动,但100列呢? 50列? 25列?

3) The final approach I know about is to store all of the attributes as a blob (JSON perhaps) in a single column of the Products table. I like this approach but it doesn't feel right. Queries are hard. And if you want to be able to easily change the name of an attribute later, you either have to parse every record individually or have them keyed in your blob by some id. If you go the id path then you will need another table Attributes and things start to look like approach #1 from above except you won't be able to join the attribute_id with your blob, so I hope you didn't want to query anything by attribute name.

3)我所知道的最后一种方法是将所有属性作为blob(可能是JSON)存储在Products表的单个列中。我喜欢这种方法,但感觉不对。查询很难。如果您希望以后能够轻松更改属性的名称,则必须单独解析每个记录,或者通过某个ID将它们键入blob中。如果你去id路径然后你将需要另一个表属性,事情开始看起来像上面的方法#1,除了你将无法与你的blob加入attribute_id,所以我希望你不想查询任何东西按属性名称。

What I like about this approach though is you can query one product and in your code you can easily access all the properties it has -- fast. And if you delete a product, you won't have to cleanup other tables -- easy to stay consistent.

我喜欢这种方法的是你可以查询一个产品,在你的代码中你可以快速地访问它拥有的所有属性。如果删除产品,则无需清理其他表 - 易于保持一致。

4) I have read some things about being able to index strongly typed xml formats in some RDBMSs, but I honestly don't know much about this approach.

4)我已经阅读了一些关于能够在一些RDBMS中索引强类型xml格式的东西,但老实说我对这种方法并不了解。

I am stuck. I feel like approach #1 is the best bet, but everything I read says that way stinks. What is the right way to think about this problem to be able to decide what is the best method for a given situation? More ideas than what I have listed are obviously welcomed!

我被卡住了。我认为方法#1是最好的选择,但我读到的所有内容都说这样的方式很糟糕。考虑这个问题的正确方法是什么,以便能够确定给定情况下最佳方法是什么?显然,欢迎比我列出的更多想法!

4 个解决方案

#1

You can probably find a great deal about this topic by doing a Google search on "entity attribute value antipattern".

您可以通过对“实体属性值反模式”进行Google搜索来找到关于此主题的大量内容。

One of the issues with this approach is that you end up mixing meta-data with actual data. Your "attribute" has to now tell the database what exactly is held in the "value" column. This can make it very difficult to handle this data in front-ends, reporting software, etc.

这种方法的一个问题是您最终将元数据与实际数据混合在一起。您的“属性”现在必须告诉数据库“值”列中的确切内容。这可能使得在前端,报告软件等中处理这些数据变得非常困难。

Second, you're going to have a very hard time actually enforcing any data integrity in the database. When your product has an attribute of "weight" what's to stop someone from putting "22 inches" in the value? Or a non-numeric value completely. You might say, "Well, my application will handle that." Then you need to change your application every time that you want to add a new attribute because the application needs to know how to handle it. If you're going to go through all of that work, just add a new column.

其次,您将很难在数据库中实际执行任何数据完整性。当你的产品具有“重量”属性时,阻止某人将“22英寸”放入值中的是什么?或者完全是非数字值。你可能会说,“好吧,我的应用程序将处理这个问题。”然后,每次要添加新属性时都需要更改应用程序,因为应用程序需要知道如何处理它。如果您要完成所有这些工作,只需添加一个新列。

Third, how do you enforce that a given product has all of the attributes that it needs? In a row you can make column NOT NULL and they are then required to get that row into the database. You can't enforce that in the EAV model.

第三,如何强制给定产品具有所需的所有属性?在一行中,您可以使列NOT NULL,然后需要将该行放入数据库。您无法在EAV模型中强制执行此操作。

Fourth, this kind of a model usually leads to a lot of confusion. People aren't sure what "attributes" are supported, or they duplicate an attribute, or they forget to handle an attribute when creating a report. As an example, if I have an attribute for "Weight(kg)" and another attribute for "Weight(lbs)" and someone asks me, "What's the heaviest product in your database?" I'd better remember that I need to check both attributes.

第四,这种模式通常会导致很多混乱。人们不确定支持哪些“属性”,或者他们复制属性,或者在创建报告时忘记处理属性。例如,如果我有“Weight(kg)”属性和“Weight(lbs)”的另一个属性,有人问我,“数据库中最重的产品是什么?”我最好记住,我需要检查两个属性。

Fifth, this model usually also leads to laziness. Hey, there's no reason to actually do any analysis of the products that our system can handle, because whatever comes along we'll just add some attributes. In my experience, companies are much better off doing the analysis required to create a good database design rather than fall back on an antipattern like this. You'll learn things about the database, the application, and likely the business as well.

第五,这种模式通常也会导致懒惰。嘿,没有理由对我们的系统可以处理的产品进行任何分析,因为无论出现什么,我们只会添加一些属性。根据我的经验,公司在创建良好的数据库设计所需的分析方面要好得多,而不是像这样的反模式。您将了解有关数据库,应用程序以及可能的业务的信息。

Sixth, it might take a LOT of joins to get a single row of data for a given product. You can return the attributes as separate rows, but now you have to come up with customized list boxes to list those products, etc. Similarly, writing search queries against this model can be very difficult and in both of these situations you're likely to have performance issues.

第六,为给定产品获取单行数据可能需要很多连接。您可以将属性作为单独的行返回,但现在您必须提供自定义列表框以列出这些产品等。同样,针对此模型编写搜索查询可能非常困难,并且在这两种情况下您都可能有性能问题。

These are just a few of the problems which I've encountered over the years. I'm sure that there are others.

这些只是我多年来遇到的一些问题。我确信还有其他人。

What the correct solution is for your system depends a lot on the specifics of your business and application. Rather than a sparse row, you might consider using subtype tables if your products fall into a few categories that share common attributes.

对您的系统而言,正确的解决方案取决于您的业务和应用程序的具体细节。如果您的产品属于共享公共属性的几个类别,则可以考虑使用子类型表而不是稀疏行。

#2

There are many problems with flexible data models but the first one that is likely to bite you is the fact that queries get unwieldy very quickly. For example, if you wanted to get the Size attribute for every product, the query is relatively easy.

灵活的数据模型存在许多问题,但第一个可能会让您感到困惑的问题是查询很快就变得难以处理。例如,如果要获取每个产品的“大小”属性,则查询相对容易。

SELECT p.name product_name, 
       pa.value product_size
  FROM product p    
         left outer join product_attribute pa on (p.product_id = pa.product_id)
         left outer join attribute a on (pa.attribute_id = a.attribute_id and 
                                         a.name          = 'size')

If you want to get the size and some other attribute like color, things get trickier

如果你想获得尺寸和其他一些属性,比如颜色,事情会变得棘手

SELECT p.name product_name, 
       pa_size.value product_size
       pa_color.value product_color
  FROM product p    
         left outer join product_attribute pa_size on (p.product_id = pa_size.product_id)
         left outer join product_attribute pa_color on (p.product_id = pa_size.product_id)
         left outer join attribute a_size on (pa_size.attribute_id = a.attribute_id and 
                                              a_size.name          = 'size')
         left outer join attribute a_color on (pa_color.attribute_id = a.attribute_id and
                                              a_color.name         = 'color')

Very quickly, when you start wanting to grab 10 attributes or write complex searches (show me products where the color is blue and the size is medium), the queries start to get very complicated both for developers to write and maintain and for the database optimizer to generate the query plan for. If you're joining 30 tables together, the optimizer would have to prune the tree of plans it considers very, very quickly to be able to generate a query plan in a reasonable time frame. That tends to lead the optimizer to discard promising paths too early and to generate less than optimal paths for many of your queries.

非常快,当你开始想要获取10个属性或编写复杂的搜索(向我展示颜色为蓝色且大小为中等的产品)时,查询开始变得非常复杂,无论是开发人员编写和维护,还是数据库优化器生成查询计划。如果您将30个表连接在一起,优化器必须非常快速地修剪它所考虑的计划树,以便能够在合理的时间范围内生成查询计划。这往往导致优化器过早地丢弃有前景的路径,并为您的许多查询生成不太理想的路径。

This, in turn, means that you very quickly get to a point where new development is bottlenecked because developers can't get their queries right or developers can't get their queries to return quickly enough. Whatever time you saved up front by not gathering the requirements to determine what the valid attributes are quickly gets used up with the 47th iteration of "Why can't I get the data I want out of this putrid data model?"

反过来,这意味着您很快就会遇到新的开发受到瓶颈的问题,因为开发人员无法正确地查询或者开发人员无法让他们的查询足够快地返回。无论你何时没有收集确定有效属性的要求,以及第47次迭代“我为什么不能从这个腐烂的数据模型中得到我想要的数据?

Beyond this cost to developers, you end up creating a lot of costs for the organization as a whole.

除了开发人员的这笔成本之外,您最终会为整个组织创造大量成本。

No query tool is going to handle this sort of data model well. So all the users that can currently fire up their favorite query tool and run some reports out of your database are now stuck waiting for developers to write their reports and do their extracts for them.

没有任何查询工具可以很好地处理这种数据模型。因此,目前可以启动他们最喜欢的查询工具并从数据库中运行一些报告的所有用户现在都在等待开发人员编写他们的报告并为他们提取摘录。

Data quality becomes very hard to enforce. It becomes very hard to check conditions that involve multiple attributes (i.e. if a product's size Medium then the weight must be between 1 and 10 pounds, if a product's height is specified then a width is required as well) so people don't make those checks. They don't write the reports to identify where these sorts of rules are violated. So the data ends up being a bit bucket of data that downstream processes decide they can't use because it isn't sufficiently complete.

数据质量变得非常难以实施。检查涉及多个属性的条件变得非常困难(例如,如果产品的尺寸为中等,那么重量必须在1到10磅之间,如果指定了产品的高度,那么也需要宽度)所以人们不会制作那些检查。他们不会编写报告来确定违反这些规则的位置。因此,数据最终成为下游流程决定无法使用的数据桶,因为它不够完整。

You're moving too much of the initial requirements discussion off into the future when understanding the core entities will likely lead to a much better design overall. If you can't agree on a set of attributes that the first version of the product needs to support, you don't really understand what that version is supposed to do. Even if you successfully code a very generic application, that means that it is going to require a lot of time to configure once you've built it (because someone will have to figure out what attributes it supports at that point). And then you'll discover when the application is being configured that you missed a ton of requirements that only became clear when the attributes were defined-- you can't know that width is required if height is specified if you don't know whether they're going to store height or width in the first place.
In the worst case, the response to this problem during configuration is to immediately determine that you need to provide a flexible way to specify business rules and to specify workflows so that the people configuring the application can quickly code their business rules when they add new attributes and so that they can control the flow of the application by grouping attributes together or skipping certain pages (i.e. have a page where make & model are required if the product type is car, skip that page if now). But in order to do that, you're going to end up building an entire development environment. And you're going to push the job of actually coding the application to the folks that are configuring the product. Unless you happen to be really good at building development environments, and unless the people configuring the product are really developers, this doesn't end well.

当您了解核心实体可能会带来更好的整体设计时,您将把初始需求讨论过多地转移到未来。如果您无法就产品的第一个版本需要支持的一组属性达成一致,那么您并不真正了解该版本应该执行的操作。即使您成功编写了一个非常通用的应用程序代码,这意味着一旦您构建它就需要花费大量时间进行配置(因为有人必须弄清楚它在那时支持哪些属性)。然后你会发现,在配置应用程序的时候,你错过了大量的要求,这些要求只有在定义了属性时才变得清晰 - 如果你不知道是否指定了高度,则无法知道是否需要宽度他们首先要存储高度或宽度。在最坏的情况下,配置期间对此问题的响应是立即确定您需要提供一种灵活的方式来指定业务规则并指定工作流,以便配置应用程序的人员可以在添加新属性时快速编写业务规则因此,他们可以通过将属性组合在一起或跳过某些页面来控制应用程序的流程(例如,如果产品类型是car,则有一个需要make和model的页面,如果现在则跳过该页面)。但为了做到这一点,你最终将构建一个完整的开发环境。而且您将把实际编写应用程序的工作推给正在配置产品的人员。除非你恰好擅长构建开发环境,除非配置产品的人真的是开发人员,否则这并不是很好。

#3

I mean, if those tables were called Teams, Players, Team_Players we would all agree that this is proper relational design.

我的意思是,如果这些表被称为团队,玩家,Team_Players,我们都同意这是适当的关系设计。

No, we wouldn't. Here's why.

不,我们不会。这就是原因。

You started with this.

你从这开始。

Products (id, name)
Attributes (id, name)
Product_Attributes (product_id, attribute_id, value as string)

Let's drop the id numbers, so we can see what's really going on. (Longer column names for clarity.)

让我们删除身份证号码,这样我们就能看到真正发生的事情。 (为了清楚起见,列名更长。)

Products (product_name)
Attributes (attribute_name)
Product_Attributes (product_name, attribute_name, value as string)

And translating that to teams and players . . .

并将其转化为团队和玩家。。。

Teams (team_name)
Players (player_name)
Team_Players (team_name, player_name, value as string)

So for sample data we might have

因此,对于我们可能有的样本数据

Team                   Player             Value
--
St. Louis Cardinals    Boggs, Mitchell    ?
St. Louis Cardinals    Carpenter, Chris   ?
St. Louis Cardinals    Franklin, Ryan     ?
St. Louis Cardinals    Garcia, Jaime      ?

What on earth belongs in place of the question marks? Let's say we want to record number of games played. Now the sample data looks like this.

究竟什么代替了问号?假设我们想要记录比赛的数量。现在样本数据看起来像这样。

Team                   Player             Value
--
St. Louis Cardinals    Boggs, Mitchell    23
St. Louis Cardinals    Carpenter, Chris   15
St. Louis Cardinals    Franklin, Ryan     19
St. Louis Cardinals    Garcia, Jaime      14

Want to store batting average, too? You can't. Not only can you not store batting average along with games played, you can't tell by looking at the database whether Mitch Boggs played in 23 games, had 23 hits, scored 23 runs, had 23 "at bats", had 23 singles, or struck out 23 times.

想要存储击球率吗?你不能。你不仅可以通过观看数据库来判断米奇博格斯是否参加了23场比赛,23次命中,23次投球,23次“击球”,23次单打,或者击出23次。

#4

The reason why this approach is so bad is that you don't know how may times you have to join to the table to get all the attributes. Plus joining to the same table 20 times tends to create a performance block of massive proportions. I am assuming that Products wil be at the heart of your system and thus be a critical place for performance.

这种方法如此糟糕的原因是你不知道你有多少时间必须加入到表中才能获得所有属性。加上同一张桌子20次往往会形成一个巨大比例的性能块。我假设产品将成为您系统的核心,因此是性能的关键所在。

Now you say that the product attributes will be drastically different. I disagree. There will be many attributes that are common to a large number of your products things like price, units, size, color, dimemnsions, weight. Those should be in the product table as common properties. These are also the ones that the user is most likely to be searching for when picking a product.

现在你说产品属性将完全不同。我不同意。大量的产品会有许多属性,如价格,单位,尺寸,颜色,尺寸,重量等。那些应该在产品表中作为常见属性。这些也是用户在挑选产品时最有可能搜索的内容。

Other properties are useful as a description but not for most anything else (they won't be searched on or put into the order details). Put those in a description or notes field.

其他属性可用作描述,但不适用于大多数其他属性(不会搜索它们或将其放入订单详细信息中)。将它们放在描述或注释字段中。

Finally you are left with the few attributes which might be different. But how different are they? Are they common to a partiuclar type of product (books have these attributes, cameras have these), then a related table for that type of product might work well.

最后,您将得到一些可能不同的属性。但他们有多么不同?它们对于一种类型的产品是常见的(书籍具有这些属性,相机具有这些属性),那么该类型产品的相关表可能运行良好。

Once you have done your job and figured all this out, then add the flexibility of an EAV table if you still need one. The steps above should cover 98+% of the real requirements.

一旦完成了工作并想出了所有这些,然后添加EAV表的灵活性,如果你还需要它。上述步骤应涵盖98%以上的实际需求。

(Also it's kind of hard to design the order details table if you don't know the attribute fields you need to record for the order - you can't rely on the products table for that)

(如果您不知道需要为订单记录的属性字段,那么设计订单详细信息表也很困难 - 您不能依赖于产品表)

(oh and I agree wholeheartedly with what @Tom H is saying as well.)

(哦,我完全同意@Tom H所说的话。)

#1