Let's say you're modeling an entity that has many attributes (2400+), far greater than the physical limit on a given database engine (e.g. ~1000 SQL Server). Knowing nothing about the relative importance of these data points (which ones are hot/used most often) besides the domain/candidate keys, how would you implement it?
假设您正在建模一个具有许多属性的实体(2400+),远远大于给定数据库引擎(例如,~1000 SQL Server)的物理限制。除了域/候选键之外,不知道这些数据点(哪些是最热门/最常用的)的相对重要性,您如何实现它?
A) EAV. (boo... Native relational tools thrown out the window.)
由。(嘘…本机关系工具抛出窗外。
B) Go straight across. The first table has a primary key and 1000 columns, right up to the limit. The next table is 1000, foreign keyed to the first. The last table is the remaining 400, also foreign keyed.
B)直走。第一个表有一个主键和1000列,一直到极限。下一个表是1000,外国键到第一个。最后一张是剩下的400张,也是外国键控的。
C) Stripe evenly across ceil( n / limit )
tables. Each table has an even number of columns, foreign keying to the first table. 800, 800, 800.
C)在ceil(n / limit)表上均匀地条纹。每个表都有偶数列,外键指向第一个表。800、800、800。
D) Something else...
D)别的东西……
And why?
,为什么?
Edit: This is more of a philosophical/generic question, not tied to any specific limits or engines.
编辑:这是一个哲学/通用的问题,没有任何特定的限制或引擎。
Edit^2: As many have pointed out, the data was probably not normalized. Per usual, business constraints at the time made deep research an impossibility.
编辑^ 2:正如许多人所指出的,数据可能不是标准化的。通常情况下,当时的业务限制使得深入研究成为不可能。
8 个解决方案
#1
5
My solution: investigate further. Specifically, establish whether the table is truly normalised (at 2400 columns this seems highly unlikely).
我的解决方案:进一步调查。具体地说,确定这张表是否真正正常化(在2400列这似乎不太可能)。
If not, restructure until it is fully normalised (at which point there are likely to be fewer than 1000 columns per table).
如果没有,重构直到它完全规范化(此时每个表可能少于1000列)。
If it is already fully normalised, establish (as far as possible) approximate frequencies of population for each attribute. Place the most commonly occurring attributes on the "home" table for the entity, use 2 or 3 additional tables for the less frequently populated attributes. (Try to make frequency of occurrence the criteria for determining which fields should go on which tables.)
如果它已经完全正常化,则为每个属性建立(尽可能地)近似的总体频率。将最常见的属性放在实体的“home”表上,为不太频繁填充的属性使用2或3个附加表。(试着使发生频率成为确定哪些字段应该放在哪个表上的标准。)
Only consider EAV for extremely sparsely populated attributes (preferably, not at all).
只考虑EAV用于填充极少的属性(最好是完全没有)。
#2
6
Use Sparse Columns for up to 30000 columns. The great advantage over EAV or XML is that you can use Filtered Indexes in conjunction with sparse columns, for very efficient searches over common attributes.
使用稀疏列最多30000列。与EAV或XML相比,最大的优点是可以结合使用过滤索引和稀疏列,对公共属性进行高效搜索。
#3
4
Without having much knowlegde in this area, i think an entity with so many attributes really really needs a re-design. With that I mean splitting the big thing into smaller parts that are logically connected.
我不太了解这个领域,我认为一个拥有如此多属性的实体真的需要重新设计。我的意思是把大的东西分割成小的部分,逻辑上是相连的。
#4
2
The key item to me is this piece:
对我来说,最重要的是:
Knowing nothing about the relative importance of these data points (which ones are hot/used most often)
不知道这些数据点的相对重要性(哪些是最热门/最常用的)
If you have an idea of which fields are more important, I would put those more important fields in the "native" table and let an EAV structure handle the rest.
如果您知道哪些字段更重要,我将把这些更重要的字段放在“本机”表中,让EAV结构处理其余的字段。
The thing is, without this information you're really shooting blind anyway. Whether you have 2400 fields or just 24, you ought to have some kind of idea about the meaning (and therefore relative importance, or at least logical groupings) your data points.
问题是,如果没有这些信息,你真的是瞎了眼。无论您有2400个字段还是仅仅24个字段,您都应该对您的数据点的含义(以及相对重要性,或者至少是逻辑分组)有一些了解。
#5
1
I'd use a one to many attribute table with a foreign key to the entity.
我将使用一个到多个属性表,其中包含实体的外键。
Eg
如
entities: id,
实体:id、
attrs: id, entity_id, attr_name, value
attrs: id、entity_id、attr_name、value
ADDED
添加
Or as Butler Lampson would say, "all problems in Computer Science can be solved by another level of indirection"
或者就像Butler Lampson说的,“计算机科学中的所有问题都可以通过另一种间接的方式来解决”
#6
0
I would rotate the columns and make them rows. Rather than having a column containing the name of the attribute as a string (nvarchar) you could have it as a fkey back to a lookup table which contains a list of all the possible attributes.
我将旋转这些列并使它们成行。与其使用包含属性名称为字符串(nvarchar)的列,不如将其作为fkey返回到包含所有可能属性的列表的查找表。
Rotating it in this way means you:
以这种方式旋转意味着:
- don't have masses of tables to record the details of just one item
- 不要有大量的表格来记录一个项目的细节。
- don't have massively wide tables
- 不要有太宽的桌子
- you can store only the info you need due to the rotation (if you don't want to store a particular attribute, then just don't have that row)
- 您可以只存储由于旋转而需要的信息(如果您不想存储一个特定的属性,那么就没有这个行)
#7
0
-
I'd look at the data model a lot more carefully. Is it 3rd normal form? Are there groups of attributes that should be logically grouped together into their own tables?
我会更仔细地研究数据模型。是第三正常吗?是否有一组属性应该逻辑地分组到它们自己的表中?
-
Assuming it is normalized and the entity truly has 2400+ attributes, I wouldn't be so quick to boo an EAV model. IMHO, it's the best, most flexible solution for the situation you've described. It gives you built in support for sparse data and gives you good searching speed as the values for any given attribute can be found in a single index.
假设它是规范化的,并且实体真正拥有2400+的属性,我不会那么快地对EAV模型提出异议。IMHO,这是你描述的最好的,最灵活的解决方案。它为您构建了对稀疏数据的支持,并为您提供了良好的搜索速度,因为任何给定属性的值都可以在单个索引中找到。
#8
0
I would like to use vertical ( increase number of rows ) approach instead of horizontal ( increase number of columns).
我希望使用垂直(增加行数)方法而不是水平(增加列数)。
You can try this approach like
您可以尝试这种方法
Table -- id , property_name -- property_value.
表——id, property_name——property_value。
The advantage with approach is, no need to alter / create a table when you introduce the new property / column.
方法的优点是,在引入新属性/列时不需要修改/创建表。
#1
5
My solution: investigate further. Specifically, establish whether the table is truly normalised (at 2400 columns this seems highly unlikely).
我的解决方案:进一步调查。具体地说,确定这张表是否真正正常化(在2400列这似乎不太可能)。
If not, restructure until it is fully normalised (at which point there are likely to be fewer than 1000 columns per table).
如果没有,重构直到它完全规范化(此时每个表可能少于1000列)。
If it is already fully normalised, establish (as far as possible) approximate frequencies of population for each attribute. Place the most commonly occurring attributes on the "home" table for the entity, use 2 or 3 additional tables for the less frequently populated attributes. (Try to make frequency of occurrence the criteria for determining which fields should go on which tables.)
如果它已经完全正常化,则为每个属性建立(尽可能地)近似的总体频率。将最常见的属性放在实体的“home”表上,为不太频繁填充的属性使用2或3个附加表。(试着使发生频率成为确定哪些字段应该放在哪个表上的标准。)
Only consider EAV for extremely sparsely populated attributes (preferably, not at all).
只考虑EAV用于填充极少的属性(最好是完全没有)。
#2
6
Use Sparse Columns for up to 30000 columns. The great advantage over EAV or XML is that you can use Filtered Indexes in conjunction with sparse columns, for very efficient searches over common attributes.
使用稀疏列最多30000列。与EAV或XML相比,最大的优点是可以结合使用过滤索引和稀疏列,对公共属性进行高效搜索。
#3
4
Without having much knowlegde in this area, i think an entity with so many attributes really really needs a re-design. With that I mean splitting the big thing into smaller parts that are logically connected.
我不太了解这个领域,我认为一个拥有如此多属性的实体真的需要重新设计。我的意思是把大的东西分割成小的部分,逻辑上是相连的。
#4
2
The key item to me is this piece:
对我来说,最重要的是:
Knowing nothing about the relative importance of these data points (which ones are hot/used most often)
不知道这些数据点的相对重要性(哪些是最热门/最常用的)
If you have an idea of which fields are more important, I would put those more important fields in the "native" table and let an EAV structure handle the rest.
如果您知道哪些字段更重要,我将把这些更重要的字段放在“本机”表中,让EAV结构处理其余的字段。
The thing is, without this information you're really shooting blind anyway. Whether you have 2400 fields or just 24, you ought to have some kind of idea about the meaning (and therefore relative importance, or at least logical groupings) your data points.
问题是,如果没有这些信息,你真的是瞎了眼。无论您有2400个字段还是仅仅24个字段,您都应该对您的数据点的含义(以及相对重要性,或者至少是逻辑分组)有一些了解。
#5
1
I'd use a one to many attribute table with a foreign key to the entity.
我将使用一个到多个属性表,其中包含实体的外键。
Eg
如
entities: id,
实体:id、
attrs: id, entity_id, attr_name, value
attrs: id、entity_id、attr_name、value
ADDED
添加
Or as Butler Lampson would say, "all problems in Computer Science can be solved by another level of indirection"
或者就像Butler Lampson说的,“计算机科学中的所有问题都可以通过另一种间接的方式来解决”
#6
0
I would rotate the columns and make them rows. Rather than having a column containing the name of the attribute as a string (nvarchar) you could have it as a fkey back to a lookup table which contains a list of all the possible attributes.
我将旋转这些列并使它们成行。与其使用包含属性名称为字符串(nvarchar)的列,不如将其作为fkey返回到包含所有可能属性的列表的查找表。
Rotating it in this way means you:
以这种方式旋转意味着:
- don't have masses of tables to record the details of just one item
- 不要有大量的表格来记录一个项目的细节。
- don't have massively wide tables
- 不要有太宽的桌子
- you can store only the info you need due to the rotation (if you don't want to store a particular attribute, then just don't have that row)
- 您可以只存储由于旋转而需要的信息(如果您不想存储一个特定的属性,那么就没有这个行)
#7
0
-
I'd look at the data model a lot more carefully. Is it 3rd normal form? Are there groups of attributes that should be logically grouped together into their own tables?
我会更仔细地研究数据模型。是第三正常吗?是否有一组属性应该逻辑地分组到它们自己的表中?
-
Assuming it is normalized and the entity truly has 2400+ attributes, I wouldn't be so quick to boo an EAV model. IMHO, it's the best, most flexible solution for the situation you've described. It gives you built in support for sparse data and gives you good searching speed as the values for any given attribute can be found in a single index.
假设它是规范化的,并且实体真正拥有2400+的属性,我不会那么快地对EAV模型提出异议。IMHO,这是你描述的最好的,最灵活的解决方案。它为您构建了对稀疏数据的支持,并为您提供了良好的搜索速度,因为任何给定属性的值都可以在单个索引中找到。
#8
0
I would like to use vertical ( increase number of rows ) approach instead of horizontal ( increase number of columns).
我希望使用垂直(增加行数)方法而不是水平(增加列数)。
You can try this approach like
您可以尝试这种方法
Table -- id , property_name -- property_value.
表——id, property_name——property_value。
The advantage with approach is, no need to alter / create a table when you introduce the new property / column.
方法的优点是,在引入新属性/列时不需要修改/创建表。