My requirements are:
我的要求是:
- Need to be able to dynamically add User-Defined fields of any data type
- 需要能够动态地添加任何数据类型的用户定义字段
- Need to be able to query UDFs quickly
- 需要能够快速查询udf。
- Need to be able to do calculations on UDFs based on datatype
- 需要能够基于数据类型对udf进行计算
- Need to be able to sort UDFs based on datatype
- 需要能够基于数据类型对udf进行排序
Other Information:
其他信息:
- I'm looking for performance primarily
- 我寻找的主要是性能
- There are a few million Master records which can have UDF data attached
- 有几百万条主记录可以附带UDF数据
- When I last checked, there were over 50mil UDF records in our current database
- 上次检查时,我们当前的数据库中有超过50mil的UDF记录
- Most of the time, a UDF is only attached to a few thousand of the Master records, not all of them
- 大多数时候,一个UDF只附加到几千个主记录,而不是全部。
- UDFs are not joined or used as keys. They're just data used for queries or reports
- udf不被连接或用作键。它们只是用于查询或报告的数据
Options:
选项:
-
Create a big table with StringValue1, StringValue2... IntValue1, IntValue2,... etc. I hate this idea, but will consider it if someone can tell me it is better than other ideas and why.
创建一个带StringValue1的大表,StringValue2…IntValue1,IntValue2,…等等。我讨厌这个想法,但如果有人能告诉我它比其他想法好,为什么,我会考虑。
-
Create a dynamic table which adds a new column on demand as needed. I also don't like this idea since I feel performance would be slow unless you indexed every column.
创建一个动态表,根据需要添加新的列。我也不喜欢这个想法,因为我觉得除非你索引每一列,否则性能会很差。
-
Create a single table containing UDFName, UDFDataType, and Value. When a new UDF gets added, generate a View which pulls just that data and parses it into whatever type is specified. Items which don't meet the parsing criteria return NULL.
创建一个包含UDFName、UDFDataType和Value的表。当添加一个新的UDF时,生成一个视图,该视图只提取数据并将其解析为指定的任何类型。不满足解析条件的项返回NULL。
-
Create multiple UDF tables, one per data type. So we'd have tables for UDFStrings, UDFDates, etc. Probably would do the same as #2 and auto-generate a View anytime a new field gets added
创建多个UDF表,一个数据类型。因此,我们将有udfstring、UDFDates等的表。可能会像#2一样,在添加新字段时自动生成视图
-
XML DataTypes? I haven't worked with these before but have seen them mentioned. Not sure if they'd give me the results I want, especially with performance.
XML数据类型?我以前没有做过这些,但我见过他们提到过。我不确定他们是否会给我想要的结果,尤其是在性能方面。
-
Something else?
别的吗?
13 个解决方案
#1
44
If performance is the primary concern, I would go with #6... a table per UDF (really, this is a variant of #2). This answer is specifically tailored to this situation and the description of the data distribution and access patterns described.
如果表现是首要考虑的话,我会选择#6……每个UDF都有一个表(实际上,这是#2的一个变体)。这个答案是专门针对这种情况以及描述的数据分布和访问模式的。
Pros:
-
Because you indicate that some UDFs have values for a small portion of the overall data set, a separate table would give you the best performance because that table will be only as large as it needs to be to support the UDF. The same holds true for the related indices.
因为您指出,某些UDF对整个数据集的一小部分具有值,所以单独的表将提供最佳性能,因为该表的大小仅为支持UDF所需的大小。相关指数也是如此。
-
You also get a speed boost by limiting the amount of data that has to be processed for aggregations or other transformations. Splitting the data out into multiple tables lets you perform some of the aggregating and other statistical analysis on the UDF data, then join that result to the master table via foreign key to get the non-aggregated attributes.
您还可以通过限制必须为聚合或其他转换处理的数据量来提高速度。将数据分割成多个表可以让您对UDF数据执行一些聚合和其他统计分析,然后通过外键将结果连接到主表,以获得非聚合属性。
-
You can use table/column names that reflect what the data actually is.
可以使用表/列名称来反映数据的实际情况。
-
You have complete control to use data types, check constraints, default values, etc. to define the data domains. Don't underestimate the performance hit resulting from on-the-fly data type conversion. Such constraints also help RDBMS query optimizers develop more effective plans.
您可以完全控制使用数据类型、检查约束、默认值等来定义数据域。不要低估即时数据类型转换带来的性能损失。这些约束也有助于RDBMS查询优化器开发更有效的计划。
-
Should you ever need to use foreign keys, built-in declarative referential integrity is rarely out-performed by trigger-based or application level constraint enforcement.
如果您需要使用外键,内置的声明式引用完整性很少被基于触发器的或应用程序级别的约束强制执行超越。
Cons:
-
This could create a lot of tables. Enforcing schema separation and/or a naming convention would alleviate this.
这可以创建很多表。实施模式分离和/或命名约定将缓解这一问题。
-
There is more application code needed to operate the UDF definition and management. I expect this is still less code needed than for the original options 1, 3, & 4.
操作UDF定义和管理需要更多的应用程序代码。我期望这仍然比原始选项1、3和4所需的代码要少。
Other Considerations:
-
If there is anything about the nature of the data that would make sense for the UDFs to be grouped, that should be encouraged. That way, those data elements can be combined into a single table. For example, let's say you have UDFs for color, size, and cost. The tendency in the data is that most instances of this data looks like
如果有任何关于数据的性质的东西,可以对udf进行分组,那么应该鼓励它。这样,这些数据元素就可以组合成一个表。例如,假设您有颜色、大小和成本的udf。数据中的趋势是这个数据的大多数实例看起来是这样的
'red', 'large', 45.03
rather than
而不是
NULL, 'medium', NULL
In such a case, you won't incur a noticeable speed penalty by combining the 3 columns in 1 table because few values would be NULL and you avoid making 2 more tables, which is 2 fewer joins needed when you need to access all 3 columns.
在这种情况下,您不会因为将3个列合并到一个表中而导致明显的速度损失,因为很少有值是空的,并且您会避免生成2个表,当您需要访问所有3个列时,需要减少2个连接。
-
If you hit a performance wall from a UDF that is heavily populated and frequently used, then that should be considered for inclusion in the master table.
如果您遇到一个UDF的性能墙,该性能墙是大量填充和频繁使用的,那么应该考虑将其包含在主表中。
-
Logical table design can take you to a certain point, but when the record counts get truly massive, you also should start looking at what table partitioning options are provided by your RDBMS of choice.
逻辑表设计可以将您带到某个点,但是当记录计数变得非常庞大时,您还应该开始研究您所选择的RDBMS提供了哪些表分区选项。
#2
22
I have written about this problem a lot. The most common solution is the Entity-Attribute-Value antipattern, which is similar to what you describe in your option #3. Avoid this design like the plague.
我已经写了很多关于这个问题的文章。最常见的解决方案是实体-属性-值反模式,这与您在选项#3中描述的类似。尽量避免这种设计。
What I use for this solution when I need truly dynamic custom fields is to store them in a blob of XML, so I can add new fields at any time. But to make it speedy, also create additional tables for each field you need to search or sort on (you don't a table per field--just a table per searchable field). This is sometimes called an inverted index design.
当我需要真正的动态定制字段时,我使用这个解决方案的方法是将它们存储在一个XML blob中,这样我就可以在任何时候添加新的字段。但是为了让它更快,还需要为每个需要搜索或排序的字段创建额外的表(每个字段不是一个表,而是一个可搜索字段的表)。这有时被称为反向索引设计。
You can read an interesting article from 2009 about this solution here: http://backchannel.org/blog/friendfeed-schemaless-mysql
您可以从2009年阅读一篇关于这个解决方案的有趣文章:http://backchannel.org/blog/friendfeed- schemalmysql -mysql。
Or you can use a document-oriented database, where it's expected that you have custom fields per document. I'd choose Solr.
或者您可以使用面向文档的数据库,在这个数据库中,每个文档都有自定义字段。我就选择Solr。
#3
9
I would most probably create a table of the following structure:
我很可能会创建如下结构的表:
- varchar Name
- varchar名字
- varchar Type
- varchar类型
- decimal NumberValue
- 十进制NumberValue
- varchar StringValue
- varchar StringValue
- date DateValue
- 日期DateValue
The exact types of course depend on your needs (and of course on the dbms you are using). You could also use the NumberValue (decimal) field for int's and booleans. You may need other types as well.
当然,具体类型取决于您的需求(当然也取决于您使用的dbms)。您还可以对int和布尔值使用NumberValue (decimal)字段。您可能还需要其他类型。
You need some link to the Master records which own the value. It's probably easiest and fastest to create a user fields table for each master table and add a simple foreign key. This way you can filter master records by user fields easily and quickly.
您需要一些到拥有该值的主记录的链接。为每个主表创建一个用户字段表并添加一个简单的外键可能是最简单和最快的。通过这种方式,您可以轻松快速地通过用户字段过滤主记录。
You may want to have some kind of meta data information. So you end up with the following:
你可能想要一些元数据信息。所以你的结局是:
Table UdfMetaData
表UdfMetaData
- int id
- int id
- varchar Name
- varchar名字
- varchar Type
- varchar类型
Table MasterUdfValues
表MasterUdfValues
- int Master_FK
- int Master_FK
- int MetaData_FK
- int MetaData_FK
- decimal NumberValue
- 十进制NumberValue
- varchar StringValue
- varchar StringValue
- date DateValue
- 日期DateValue
Whatever you do, I would not change the table structure dynamically. It is a maintenance nightmare. I would also not use XML structures, they are much too slow.
无论您做什么,我都不会动态地更改表结构。这是一场维护噩梦。我也不会使用XML结构,它们太慢了。
#4
8
This sounds like a problem that might be better solved by a non-relational solution, like MongoDB or CouchDB.
这听起来像是可以通过非关系解决方案(如MongoDB或CouchDB)更好地解决的问题。
They both allow for dynamic schema expansion while allowing you to maintain the tuple integrity you seek.
它们都允许动态模式扩展,同时允许您维护所寻求的元组完整性。
I agree with Bill Karwin, the EAV model is not a performant approach for you. Using name-value pairs in a relational system is not intrinsically bad, but only works well when the name-value pair make a complete tuple of information. When using it forces you to dynamically reconstruct a table at run-time, all kinds of things start to get hard. Querying becomes an exercise in pivot maintenance or forces you to push the tuple reconstruction up into the object layer.
我同意Bill Karwin的观点,EAV模式对你来说不是一种表现方式。在关系系统中使用名称-值对本身并不是坏事,但只有当名称-值对构成一个完整的信息元组时才有效。当使用它强制您在运行时动态重构一个表时,所有的事情开始变得困难。查询成为枢轴维护中的一个练习,或者迫使您将元组重构推到对象层中。
You can't determine whether a null or missing value is a valid entry or lack of entry without embedding schema rules in your object layer.
如果不在对象层中嵌入模式规则,就无法确定空值或缺失值是有效的条目还是缺少条目。
You lose the ability to efficiently manage your schema. Is a 100-character varchar the right type for the "value" field? 200-characters? Should it be nvarchar instead? It can be a hard trade-off and one that ends with you having to place artificial limits on the dynamic nature of your set. Something like "you can only have x user-defined fields and each can only be y characters long.
您失去了有效管理模式的能力。100字的varchar是“值”字段的正确类型吗?200 -字符?应该换成nvarchar吗?这可能是一种艰难的权衡,最后你不得不对你的设置的动态特性进行人工限制,比如“你只能有x用户定义的字段,每个字段只能是y字符长。”
With a document-oriented solution, like MongoDB or CouchDB, you maintain all attributes associated with a user within a single tuple. Since joins are not an issue, life is happy, as neither of these two does well with joins, despite the hype. Your users can define as many attributes as they want (or you will allow) at lengths that don't get hard to manage until you reach about 4MB.
使用面向文档的解决方案(如MongoDB或CouchDB),您可以在单个元组中维护与用户关联的所有属性。因为加入并不是一个问题,生活是快乐的,因为这两个人在加入加入时都做得不好,尽管宣传得天花乱坠。您的用户可以定义任意数量的属性(或者您将允许),长度在达到4MB之前不会很难管理。
If you have data that requires ACID-level integrity, you might consider splitting the solution, with the high-integrity data living in your relational database and the dynamic data living in a non-relational store.
如果您有需要acid级别的完整性的数据,您可以考虑将解决方案拆分,将高完整性数据放在关系数据库中,动态数据放在非关系存储中。
#5
6
Even if you provide for a user adding custom columns, it will not necessarily be the case that querying on those columns will perform well. There are many aspects that go into query design that allow them to perform well, the most important of which is the proper specification on what should be stored in the first place. Thus, fundamentally, is it that you want to allow users to create schema without thought as to specifications and be able to quickly derive information from that schema? If so, then it is unlikley that any such solution will scale well especially if you want to allow the user to do numerical analysis on the data.
即使您提供给用户添加自定义列,也不一定是查询这些列的情况。查询设计中有许多方面允许它们很好地执行,其中最重要的是关于应该首先存储的内容的适当规范。因此,从根本上说,您是否希望允许用户创建模式而不考虑规范,并能够从该模式中快速获取信息?如果是这样,那么任何这样的解决方案都不可能很好地扩展,尤其是如果您希望允许用户对数据进行数值分析的话。
Option 1
IMO this approach gives you schema with no knowledge as to what the schema means which is a recipe for disaster and a nightmare for report designers. I.e., you must have the meta data to know what column stores what data. If that metadata gets messed up, it has the potential to hose your data. Plus, it makes it easy to put the wrong data in the wrong column. ("What? String1 contains the name of convents? I thought it was Chalie Sheen's favorite drugs.")
在我看来,这种方法为您提供了模式,但您并不了解模式的含义,这将导致灾难,并成为报表设计人员的梦魇。即。,必须有元数据才能知道哪些列存储哪些数据。如果元数据搞砸了,它就有可能把你的数据冲洗掉。另外,它可以很容易地将错误的数据放在错误的列中。(“什么?弦1包含对流的名字?我以为那是查莉·辛最喜欢的药。
Option 3,4,5
IMO, requirements 2, 3, and 4 eliminate any variation of an EAV. If you need to query, sort or do calculations on this data, then an EAV is Cthulhu's dream and your development team's and DBA's nightmare. EAV's will create a bottleneck in terms of performance and will not give you the data integrity you need to quickly get to the information you want. Queries will quickly turn to crosstab Gordian knots.
国际海事组织,要求2,3,和4消除任何变化的EAV。如果您需要对这些数据进行查询、排序或计算,那么EAV就是Cthulhu的梦想,也是开发团队和DBA的梦魇。EAV会在性能方面造成瓶颈,不会给您提供快速获取所需信息所需的数据完整性。查询将迅速转换为交叉的Gordian结。
Option 2,6
That really leaves one choice: gather specifications and then build out the schema.
这就剩下一个选择:收集规范,然后构建模式。
If the client wants the best performance on data they wish to store, then they need to go through the process of working with a developer to understand their needs so that it is stored as efficiently as possible. It could still be stored in a table separate from the rest of the tables with code that dynamically builds a form based on the schema of the table. If you have a database that allows for extended properties on columns, you could even use those to help the form builder use nice labels, tooltips etc. so that all that was necessary is to add the schema. Either way, to build and run reports efficiently, the data needs to be stored properly. If the data in question will have lots of nulls, some databases have the ability to store that type of information. For example, SQL Server 2008 has a feature called Sparse Columns specifically for data with lots of nulls.
如果客户想要在他们希望存储的数据上获得最佳性能,那么他们需要与开发人员一起工作,以了解他们的需求,以便尽可能有效地存储数据。它仍然可以存储在与其他表分离的表中,其中包含基于表模式动态构建表单的代码。如果您有一个允许在列上扩展属性的数据库,您甚至可以使用它们来帮助表单构建器使用漂亮的标签、工具提示等,因此所需要的只是添加模式。无论如何,要有效地构建和运行报告,数据都需要正确地存储。如果问题中的数据有大量的空值,一些数据库就有能力存储这种类型的信息。例如,SQL Server 2008有一个名为稀列的特性,专门针对具有大量空值的数据。
If this were only a bag of data on which no analysis, filtering, or sorting was to be done, I'd say some variation of an EAV might do the trick. However, given your requirements, the most efficient solution will be to get the proper specifications even if you store these new columns in separate tables and build forms dynamically off those tables.
如果这只是一堆不需要进行分析、过滤或排序的数据,我认为EAV的一些变体可能会起到作用。但是,考虑到您的需求,最有效的解决方案将是获得适当的规范,即使您将这些新列存储在单独的表中,并从这些表中动态地构建表单。
稀疏列
#6
4
- Create multiple UDF tables, one per data type. So we'd have tables for UDFStrings, UDFDates, etc. Probably would do the same as #2 and auto-generate a View anytime a new field gets added
- 创建多个UDF表,一个数据类型。因此,我们将有udfstring、UDFDates等的表。可能会像#2一样,在添加新字段时自动生成视图
According to my research multiple tables based on the data type not going to help you in performance. Especially if you have bulk data, like 20K or 25K records with 50+ UDFs. Performance was the worst.
根据我的研究,基于数据类型的多个表对性能没有帮助。特别是如果您有大量的数据,比如20K或25K记录和50+ udf。性能是最差的。
You should go with single table with multiple columns like:
您应该使用具有多个列的单表,如:
varchar Name
varchar Type
decimal NumberValue
varchar StringValue
date DateValue
#7
4
This is a problematic situation, and none of the solutions appears "right". However option 1 is probably the best both in terms of simplicity and in terms of performance.
这是一个有问题的情况,没有一个解决方案看起来是“正确的”。但是,在简单性和性能方面,选项1可能是最好的。
This is also the solution used in some commercial enterprise applications.
这也是一些商业企业应用程序中使用的解决方案。
EDIT
编辑
another option that is available now, but didn't exist (or at least wasn't mature) when the question was original asked is to use json fields in the DB.
另一个现在可用的选项是在DB中使用json字段。
many relational DBs support now json based fields (that can include a dynamic list of sub fields) and allow querying on them
许多关系DBs现在支持基于json的字段(可以包含一个子字段的动态列表),并允许查询它们。
postgres
mysql
#8
2
I've had experience or 1, 3 and 4 and they all end up either messy, with it not being clear what the data is or really complicated with some sort of soft categorisation to break the data down into dynamic types of record.
我有过1 3 4的经验,它们要么很混乱,不清楚数据是什么,要么很复杂用一些软分类把数据分解成动态类型的记录。
I'd be tempted to try XML, you should be able to enforce schemas against the contents of the xml to check data typing etc which will help holding difference sets of UDF data. In newer versions of SQL server you can index on XML fields, which should help out on the performance. (see http://blogs.technet.com/b/josebda/archive/2009/03/23/sql-server-2008-xml-indexing.aspx) for example
我很想尝试使用XML,您应该能够针对XML的内容强制使用模式来检查数据类型等等,这将有助于保存UDF数据的不同集。在SQL server的更新版本中,可以对XML字段进行索引,这将有助于提高性能。(见http://blogs.technet.com/b/josebda/archive/2009/03/23/sql -服务器- 2008 - xml - indexing.aspx)为例
#9
2
If you're using SQL Server, don't overlook the sqlvariant type. It's pretty fast and should do your job. Other databases might have something similar.
如果您正在使用SQL Server,请不要忽略sqlvariant类型。它非常快,应该能完成你的工作。其他数据库可能也有类似的东西。
XML datatypes are not so good for performance reasons. If youre doing calculations on the server then you're constantly having to deserialize these.
由于性能原因,XML数据类型不太好。如果您正在服务器上进行计算,那么您必须不断地反序列化这些数据。
Option 1 sounds bad and looks cruddy, but performance-wise can be your best bet. I have created tables with columns named Field00-Field99 before because you just can't beat the performance. You might need to consider your INSERT performance too, in which case this is also the one to go for. You can always create Views on this table if you want it to look neat!
选项1听起来很糟糕,看起来很糟糕,但是性能方面的可能是你最好的选择。我以前创建过名为Field00-Field99的列的表,因为您无法击败性能。您可能也需要考虑插入性能,在这种情况下,这也是需要考虑的。如果您希望该表看起来整洁,您可以在该表上创建视图!
#10
1
I've managed this very successfully in the past using none of these options (option 6? :) ).
我在过去成功地使用了这些选项(选项6?):))。
I create a model for the users to play with (store as xml and expose via a custom modelling tool) and from the model generated tables and views to join the base tables with the user-defined data tables. So each type would have a base table with core data and a user table with user defined fields.
我为用户创建了一个模型(作为xml存储并通过自定义建模工具公开),并通过模型生成的表和视图将基本表与用户定义的数据表连接起来。所以每种类型都有一个带核心数据的基表和一个带用户定义字段的用户表。
Take a document as an example: typical fields would be name, type, date, author, etc. This would go in the core table. Then users would define their own special document types with their own fields, such as contract_end_date, renewal_clause, blah blah blah. For that user defined document there would be the core document table, the xcontract table, joined on a common primary key (so the xcontracts primary key is also foreign on the primary key of the core table). Then I would generate a view to wrap these two tables. Performance when querying was fast. additional business rules can also be embedded into the views. This worked really well for me.
以文档为例:典型的字段包括名称、类型、日期、作者等等。然后用户将使用自己的字段定义自己的特殊文档类型,比如contract_end_date、renewal_子等等。对于该用户定义的文档,将会有一个核心的文档表,xcontract表,加入一个共同的主键(因此,xcontract主键在核心表的主键上也是外键)。然后我将生成一个视图来包装这两个表。查询时的性能很快。还可以将其他业务规则嵌入到视图中。这对我来说效果很好。
#11
0
SharePoint uses option 1 and has reasonable performance.
SharePoint使用选项1并具有合理的性能。
#12
0
In the comments I saw you saying that the UDF fields are to dump imported data that is not properly mapped by the user.
在评论中,我看到您说UDF字段将转储用户未正确映射的导入数据。
Perhaps another option is to track the number of UDF's made by each user and force them to reuse fields by saying they can use 6 (or some other equally random limit) custom fields tops.
也许另一种选择是跟踪每个用户创建的UDF的数量,并强制他们重用字段,因为他们可以使用6个(或其他一些相同的随机限制)自定义字段。
When you are faced with a database structuring problem like this it is often best to go back to the basic design of the application (import system in your case) and put a few more restraints on it.
当您遇到这样的数据库结构问题时,通常最好回到应用程序的基本设计(在您的例子中是导入系统),并对其进行一些限制。
Now what I would do is option 4 (EDIT) with the addition of a link to users:
现在我要做的是选项4(编辑)添加一个用户链接:
general_data_table
id
...
udfs_linked_table
id
general_data_id
udf_id
udfs_table
id
name
type
owner_id --> Use this to filter for the current user and limit their UDFs
string_link_id --> link table for string fields
int_link_id
type_link_id
Now make sure to make views to optimize performance and get your indexes right. This level of normalization makes the DB footprint smaller, but your application more complex.
现在,请确保使用视图来优化性能并使索引正确。这种级别的规范化使DB占用更少,但您的应用程序更复杂。
#13
0
Our database powers a SaaS app (helpdesk software) where users have over 7k "custom fields". We use a combined approach:
我们的数据库支持一个SaaS应用程序(helpdesk软件),用户拥有超过7k的“自定义字段”。我们采用综合的方法:
-
(EntityID, FieldID, Value)
table for searching the data - (EntityID, FieldID, Value)用于搜索数据的表
- a JSON field in the
entities
table, that holds all entity values, used for displaying the data. (this way you don't need a million JOIN's to get the values values). - 实体表中的一个JSON字段,它包含用于显示数据的所有实体值。(通过这种方式,您不需要100万个JOIN来获取值)。
You could further split #1 to have a "table per datatype" like this answer suggests, this way you can even index your UDFs.
您可以进一步将#1拆分为“每个数据类型都有一个表”,正如这个答案所示,这样您甚至可以为您的udf建立索引。
P.S. Couple of words to defend the "Entity-Attribute-Value" approach everyone keeps bashing. We have used #1 without #2 for decades and it worked just fine. Sometimes it's a business decision. Do you have time to rewrite your app and redesign the db or you can through a couple of bucks on a cloud-servers, which are really cheap these days? By the way, when we were using #1 approach, our DB was holding millions of entities, accessed by 100s of thousands of users, and a 16GB dual-core db server was doing just fine (really an "r3" vm on AWS).
为了捍卫“实体-属性-价值”的方法,每个人都在不停地抨击。我们已经用了几十年没有2号的1号了,它运行得很好。有时这是一个商业决定。你有没有时间重写你的应用程序,重新设计db或者你可以花几块钱在云服务器上,现在真的很便宜?顺便说一下,当我们使用#1方法时,我们的DB拥有数百万个实体,被成千上万的用户访问,16GB双核DB服务器运行良好(实际上是AWS上的“r3”vm)。
#1
44
If performance is the primary concern, I would go with #6... a table per UDF (really, this is a variant of #2). This answer is specifically tailored to this situation and the description of the data distribution and access patterns described.
如果表现是首要考虑的话,我会选择#6……每个UDF都有一个表(实际上,这是#2的一个变体)。这个答案是专门针对这种情况以及描述的数据分布和访问模式的。
Pros:
-
Because you indicate that some UDFs have values for a small portion of the overall data set, a separate table would give you the best performance because that table will be only as large as it needs to be to support the UDF. The same holds true for the related indices.
因为您指出,某些UDF对整个数据集的一小部分具有值,所以单独的表将提供最佳性能,因为该表的大小仅为支持UDF所需的大小。相关指数也是如此。
-
You also get a speed boost by limiting the amount of data that has to be processed for aggregations or other transformations. Splitting the data out into multiple tables lets you perform some of the aggregating and other statistical analysis on the UDF data, then join that result to the master table via foreign key to get the non-aggregated attributes.
您还可以通过限制必须为聚合或其他转换处理的数据量来提高速度。将数据分割成多个表可以让您对UDF数据执行一些聚合和其他统计分析,然后通过外键将结果连接到主表,以获得非聚合属性。
-
You can use table/column names that reflect what the data actually is.
可以使用表/列名称来反映数据的实际情况。
-
You have complete control to use data types, check constraints, default values, etc. to define the data domains. Don't underestimate the performance hit resulting from on-the-fly data type conversion. Such constraints also help RDBMS query optimizers develop more effective plans.
您可以完全控制使用数据类型、检查约束、默认值等来定义数据域。不要低估即时数据类型转换带来的性能损失。这些约束也有助于RDBMS查询优化器开发更有效的计划。
-
Should you ever need to use foreign keys, built-in declarative referential integrity is rarely out-performed by trigger-based or application level constraint enforcement.
如果您需要使用外键,内置的声明式引用完整性很少被基于触发器的或应用程序级别的约束强制执行超越。
Cons:
-
This could create a lot of tables. Enforcing schema separation and/or a naming convention would alleviate this.
这可以创建很多表。实施模式分离和/或命名约定将缓解这一问题。
-
There is more application code needed to operate the UDF definition and management. I expect this is still less code needed than for the original options 1, 3, & 4.
操作UDF定义和管理需要更多的应用程序代码。我期望这仍然比原始选项1、3和4所需的代码要少。
Other Considerations:
-
If there is anything about the nature of the data that would make sense for the UDFs to be grouped, that should be encouraged. That way, those data elements can be combined into a single table. For example, let's say you have UDFs for color, size, and cost. The tendency in the data is that most instances of this data looks like
如果有任何关于数据的性质的东西,可以对udf进行分组,那么应该鼓励它。这样,这些数据元素就可以组合成一个表。例如,假设您有颜色、大小和成本的udf。数据中的趋势是这个数据的大多数实例看起来是这样的
'red', 'large', 45.03
rather than
而不是
NULL, 'medium', NULL
In such a case, you won't incur a noticeable speed penalty by combining the 3 columns in 1 table because few values would be NULL and you avoid making 2 more tables, which is 2 fewer joins needed when you need to access all 3 columns.
在这种情况下,您不会因为将3个列合并到一个表中而导致明显的速度损失,因为很少有值是空的,并且您会避免生成2个表,当您需要访问所有3个列时,需要减少2个连接。
-
If you hit a performance wall from a UDF that is heavily populated and frequently used, then that should be considered for inclusion in the master table.
如果您遇到一个UDF的性能墙,该性能墙是大量填充和频繁使用的,那么应该考虑将其包含在主表中。
-
Logical table design can take you to a certain point, but when the record counts get truly massive, you also should start looking at what table partitioning options are provided by your RDBMS of choice.
逻辑表设计可以将您带到某个点,但是当记录计数变得非常庞大时,您还应该开始研究您所选择的RDBMS提供了哪些表分区选项。
#2
22
I have written about this problem a lot. The most common solution is the Entity-Attribute-Value antipattern, which is similar to what you describe in your option #3. Avoid this design like the plague.
我已经写了很多关于这个问题的文章。最常见的解决方案是实体-属性-值反模式,这与您在选项#3中描述的类似。尽量避免这种设计。
What I use for this solution when I need truly dynamic custom fields is to store them in a blob of XML, so I can add new fields at any time. But to make it speedy, also create additional tables for each field you need to search or sort on (you don't a table per field--just a table per searchable field). This is sometimes called an inverted index design.
当我需要真正的动态定制字段时,我使用这个解决方案的方法是将它们存储在一个XML blob中,这样我就可以在任何时候添加新的字段。但是为了让它更快,还需要为每个需要搜索或排序的字段创建额外的表(每个字段不是一个表,而是一个可搜索字段的表)。这有时被称为反向索引设计。
You can read an interesting article from 2009 about this solution here: http://backchannel.org/blog/friendfeed-schemaless-mysql
您可以从2009年阅读一篇关于这个解决方案的有趣文章:http://backchannel.org/blog/friendfeed- schemalmysql -mysql。
Or you can use a document-oriented database, where it's expected that you have custom fields per document. I'd choose Solr.
或者您可以使用面向文档的数据库,在这个数据库中,每个文档都有自定义字段。我就选择Solr。
#3
9
I would most probably create a table of the following structure:
我很可能会创建如下结构的表:
- varchar Name
- varchar名字
- varchar Type
- varchar类型
- decimal NumberValue
- 十进制NumberValue
- varchar StringValue
- varchar StringValue
- date DateValue
- 日期DateValue
The exact types of course depend on your needs (and of course on the dbms you are using). You could also use the NumberValue (decimal) field for int's and booleans. You may need other types as well.
当然,具体类型取决于您的需求(当然也取决于您使用的dbms)。您还可以对int和布尔值使用NumberValue (decimal)字段。您可能还需要其他类型。
You need some link to the Master records which own the value. It's probably easiest and fastest to create a user fields table for each master table and add a simple foreign key. This way you can filter master records by user fields easily and quickly.
您需要一些到拥有该值的主记录的链接。为每个主表创建一个用户字段表并添加一个简单的外键可能是最简单和最快的。通过这种方式,您可以轻松快速地通过用户字段过滤主记录。
You may want to have some kind of meta data information. So you end up with the following:
你可能想要一些元数据信息。所以你的结局是:
Table UdfMetaData
表UdfMetaData
- int id
- int id
- varchar Name
- varchar名字
- varchar Type
- varchar类型
Table MasterUdfValues
表MasterUdfValues
- int Master_FK
- int Master_FK
- int MetaData_FK
- int MetaData_FK
- decimal NumberValue
- 十进制NumberValue
- varchar StringValue
- varchar StringValue
- date DateValue
- 日期DateValue
Whatever you do, I would not change the table structure dynamically. It is a maintenance nightmare. I would also not use XML structures, they are much too slow.
无论您做什么,我都不会动态地更改表结构。这是一场维护噩梦。我也不会使用XML结构,它们太慢了。
#4
8
This sounds like a problem that might be better solved by a non-relational solution, like MongoDB or CouchDB.
这听起来像是可以通过非关系解决方案(如MongoDB或CouchDB)更好地解决的问题。
They both allow for dynamic schema expansion while allowing you to maintain the tuple integrity you seek.
它们都允许动态模式扩展,同时允许您维护所寻求的元组完整性。
I agree with Bill Karwin, the EAV model is not a performant approach for you. Using name-value pairs in a relational system is not intrinsically bad, but only works well when the name-value pair make a complete tuple of information. When using it forces you to dynamically reconstruct a table at run-time, all kinds of things start to get hard. Querying becomes an exercise in pivot maintenance or forces you to push the tuple reconstruction up into the object layer.
我同意Bill Karwin的观点,EAV模式对你来说不是一种表现方式。在关系系统中使用名称-值对本身并不是坏事,但只有当名称-值对构成一个完整的信息元组时才有效。当使用它强制您在运行时动态重构一个表时,所有的事情开始变得困难。查询成为枢轴维护中的一个练习,或者迫使您将元组重构推到对象层中。
You can't determine whether a null or missing value is a valid entry or lack of entry without embedding schema rules in your object layer.
如果不在对象层中嵌入模式规则,就无法确定空值或缺失值是有效的条目还是缺少条目。
You lose the ability to efficiently manage your schema. Is a 100-character varchar the right type for the "value" field? 200-characters? Should it be nvarchar instead? It can be a hard trade-off and one that ends with you having to place artificial limits on the dynamic nature of your set. Something like "you can only have x user-defined fields and each can only be y characters long.
您失去了有效管理模式的能力。100字的varchar是“值”字段的正确类型吗?200 -字符?应该换成nvarchar吗?这可能是一种艰难的权衡,最后你不得不对你的设置的动态特性进行人工限制,比如“你只能有x用户定义的字段,每个字段只能是y字符长。”
With a document-oriented solution, like MongoDB or CouchDB, you maintain all attributes associated with a user within a single tuple. Since joins are not an issue, life is happy, as neither of these two does well with joins, despite the hype. Your users can define as many attributes as they want (or you will allow) at lengths that don't get hard to manage until you reach about 4MB.
使用面向文档的解决方案(如MongoDB或CouchDB),您可以在单个元组中维护与用户关联的所有属性。因为加入并不是一个问题,生活是快乐的,因为这两个人在加入加入时都做得不好,尽管宣传得天花乱坠。您的用户可以定义任意数量的属性(或者您将允许),长度在达到4MB之前不会很难管理。
If you have data that requires ACID-level integrity, you might consider splitting the solution, with the high-integrity data living in your relational database and the dynamic data living in a non-relational store.
如果您有需要acid级别的完整性的数据,您可以考虑将解决方案拆分,将高完整性数据放在关系数据库中,动态数据放在非关系存储中。
#5
6
Even if you provide for a user adding custom columns, it will not necessarily be the case that querying on those columns will perform well. There are many aspects that go into query design that allow them to perform well, the most important of which is the proper specification on what should be stored in the first place. Thus, fundamentally, is it that you want to allow users to create schema without thought as to specifications and be able to quickly derive information from that schema? If so, then it is unlikley that any such solution will scale well especially if you want to allow the user to do numerical analysis on the data.
即使您提供给用户添加自定义列,也不一定是查询这些列的情况。查询设计中有许多方面允许它们很好地执行,其中最重要的是关于应该首先存储的内容的适当规范。因此,从根本上说,您是否希望允许用户创建模式而不考虑规范,并能够从该模式中快速获取信息?如果是这样,那么任何这样的解决方案都不可能很好地扩展,尤其是如果您希望允许用户对数据进行数值分析的话。
Option 1
IMO this approach gives you schema with no knowledge as to what the schema means which is a recipe for disaster and a nightmare for report designers. I.e., you must have the meta data to know what column stores what data. If that metadata gets messed up, it has the potential to hose your data. Plus, it makes it easy to put the wrong data in the wrong column. ("What? String1 contains the name of convents? I thought it was Chalie Sheen's favorite drugs.")
在我看来,这种方法为您提供了模式,但您并不了解模式的含义,这将导致灾难,并成为报表设计人员的梦魇。即。,必须有元数据才能知道哪些列存储哪些数据。如果元数据搞砸了,它就有可能把你的数据冲洗掉。另外,它可以很容易地将错误的数据放在错误的列中。(“什么?弦1包含对流的名字?我以为那是查莉·辛最喜欢的药。
Option 3,4,5
IMO, requirements 2, 3, and 4 eliminate any variation of an EAV. If you need to query, sort or do calculations on this data, then an EAV is Cthulhu's dream and your development team's and DBA's nightmare. EAV's will create a bottleneck in terms of performance and will not give you the data integrity you need to quickly get to the information you want. Queries will quickly turn to crosstab Gordian knots.
国际海事组织,要求2,3,和4消除任何变化的EAV。如果您需要对这些数据进行查询、排序或计算,那么EAV就是Cthulhu的梦想,也是开发团队和DBA的梦魇。EAV会在性能方面造成瓶颈,不会给您提供快速获取所需信息所需的数据完整性。查询将迅速转换为交叉的Gordian结。
Option 2,6
That really leaves one choice: gather specifications and then build out the schema.
这就剩下一个选择:收集规范,然后构建模式。
If the client wants the best performance on data they wish to store, then they need to go through the process of working with a developer to understand their needs so that it is stored as efficiently as possible. It could still be stored in a table separate from the rest of the tables with code that dynamically builds a form based on the schema of the table. If you have a database that allows for extended properties on columns, you could even use those to help the form builder use nice labels, tooltips etc. so that all that was necessary is to add the schema. Either way, to build and run reports efficiently, the data needs to be stored properly. If the data in question will have lots of nulls, some databases have the ability to store that type of information. For example, SQL Server 2008 has a feature called Sparse Columns specifically for data with lots of nulls.
如果客户想要在他们希望存储的数据上获得最佳性能,那么他们需要与开发人员一起工作,以了解他们的需求,以便尽可能有效地存储数据。它仍然可以存储在与其他表分离的表中,其中包含基于表模式动态构建表单的代码。如果您有一个允许在列上扩展属性的数据库,您甚至可以使用它们来帮助表单构建器使用漂亮的标签、工具提示等,因此所需要的只是添加模式。无论如何,要有效地构建和运行报告,数据都需要正确地存储。如果问题中的数据有大量的空值,一些数据库就有能力存储这种类型的信息。例如,SQL Server 2008有一个名为稀列的特性,专门针对具有大量空值的数据。
If this were only a bag of data on which no analysis, filtering, or sorting was to be done, I'd say some variation of an EAV might do the trick. However, given your requirements, the most efficient solution will be to get the proper specifications even if you store these new columns in separate tables and build forms dynamically off those tables.
如果这只是一堆不需要进行分析、过滤或排序的数据,我认为EAV的一些变体可能会起到作用。但是,考虑到您的需求,最有效的解决方案将是获得适当的规范,即使您将这些新列存储在单独的表中,并从这些表中动态地构建表单。
稀疏列
#6
4
- Create multiple UDF tables, one per data type. So we'd have tables for UDFStrings, UDFDates, etc. Probably would do the same as #2 and auto-generate a View anytime a new field gets added
- 创建多个UDF表,一个数据类型。因此,我们将有udfstring、UDFDates等的表。可能会像#2一样,在添加新字段时自动生成视图
According to my research multiple tables based on the data type not going to help you in performance. Especially if you have bulk data, like 20K or 25K records with 50+ UDFs. Performance was the worst.
根据我的研究,基于数据类型的多个表对性能没有帮助。特别是如果您有大量的数据,比如20K或25K记录和50+ udf。性能是最差的。
You should go with single table with multiple columns like:
您应该使用具有多个列的单表,如:
varchar Name
varchar Type
decimal NumberValue
varchar StringValue
date DateValue
#7
4
This is a problematic situation, and none of the solutions appears "right". However option 1 is probably the best both in terms of simplicity and in terms of performance.
这是一个有问题的情况,没有一个解决方案看起来是“正确的”。但是,在简单性和性能方面,选项1可能是最好的。
This is also the solution used in some commercial enterprise applications.
这也是一些商业企业应用程序中使用的解决方案。
EDIT
编辑
another option that is available now, but didn't exist (or at least wasn't mature) when the question was original asked is to use json fields in the DB.
另一个现在可用的选项是在DB中使用json字段。
many relational DBs support now json based fields (that can include a dynamic list of sub fields) and allow querying on them
许多关系DBs现在支持基于json的字段(可以包含一个子字段的动态列表),并允许查询它们。
postgres
mysql
#8
2
I've had experience or 1, 3 and 4 and they all end up either messy, with it not being clear what the data is or really complicated with some sort of soft categorisation to break the data down into dynamic types of record.
我有过1 3 4的经验,它们要么很混乱,不清楚数据是什么,要么很复杂用一些软分类把数据分解成动态类型的记录。
I'd be tempted to try XML, you should be able to enforce schemas against the contents of the xml to check data typing etc which will help holding difference sets of UDF data. In newer versions of SQL server you can index on XML fields, which should help out on the performance. (see http://blogs.technet.com/b/josebda/archive/2009/03/23/sql-server-2008-xml-indexing.aspx) for example
我很想尝试使用XML,您应该能够针对XML的内容强制使用模式来检查数据类型等等,这将有助于保存UDF数据的不同集。在SQL server的更新版本中,可以对XML字段进行索引,这将有助于提高性能。(见http://blogs.technet.com/b/josebda/archive/2009/03/23/sql -服务器- 2008 - xml - indexing.aspx)为例
#9
2
If you're using SQL Server, don't overlook the sqlvariant type. It's pretty fast and should do your job. Other databases might have something similar.
如果您正在使用SQL Server,请不要忽略sqlvariant类型。它非常快,应该能完成你的工作。其他数据库可能也有类似的东西。
XML datatypes are not so good for performance reasons. If youre doing calculations on the server then you're constantly having to deserialize these.
由于性能原因,XML数据类型不太好。如果您正在服务器上进行计算,那么您必须不断地反序列化这些数据。
Option 1 sounds bad and looks cruddy, but performance-wise can be your best bet. I have created tables with columns named Field00-Field99 before because you just can't beat the performance. You might need to consider your INSERT performance too, in which case this is also the one to go for. You can always create Views on this table if you want it to look neat!
选项1听起来很糟糕,看起来很糟糕,但是性能方面的可能是你最好的选择。我以前创建过名为Field00-Field99的列的表,因为您无法击败性能。您可能也需要考虑插入性能,在这种情况下,这也是需要考虑的。如果您希望该表看起来整洁,您可以在该表上创建视图!
#10
1
I've managed this very successfully in the past using none of these options (option 6? :) ).
我在过去成功地使用了这些选项(选项6?):))。
I create a model for the users to play with (store as xml and expose via a custom modelling tool) and from the model generated tables and views to join the base tables with the user-defined data tables. So each type would have a base table with core data and a user table with user defined fields.
我为用户创建了一个模型(作为xml存储并通过自定义建模工具公开),并通过模型生成的表和视图将基本表与用户定义的数据表连接起来。所以每种类型都有一个带核心数据的基表和一个带用户定义字段的用户表。
Take a document as an example: typical fields would be name, type, date, author, etc. This would go in the core table. Then users would define their own special document types with their own fields, such as contract_end_date, renewal_clause, blah blah blah. For that user defined document there would be the core document table, the xcontract table, joined on a common primary key (so the xcontracts primary key is also foreign on the primary key of the core table). Then I would generate a view to wrap these two tables. Performance when querying was fast. additional business rules can also be embedded into the views. This worked really well for me.
以文档为例:典型的字段包括名称、类型、日期、作者等等。然后用户将使用自己的字段定义自己的特殊文档类型,比如contract_end_date、renewal_子等等。对于该用户定义的文档,将会有一个核心的文档表,xcontract表,加入一个共同的主键(因此,xcontract主键在核心表的主键上也是外键)。然后我将生成一个视图来包装这两个表。查询时的性能很快。还可以将其他业务规则嵌入到视图中。这对我来说效果很好。
#11
0
SharePoint uses option 1 and has reasonable performance.
SharePoint使用选项1并具有合理的性能。
#12
0
In the comments I saw you saying that the UDF fields are to dump imported data that is not properly mapped by the user.
在评论中,我看到您说UDF字段将转储用户未正确映射的导入数据。
Perhaps another option is to track the number of UDF's made by each user and force them to reuse fields by saying they can use 6 (or some other equally random limit) custom fields tops.
也许另一种选择是跟踪每个用户创建的UDF的数量,并强制他们重用字段,因为他们可以使用6个(或其他一些相同的随机限制)自定义字段。
When you are faced with a database structuring problem like this it is often best to go back to the basic design of the application (import system in your case) and put a few more restraints on it.
当您遇到这样的数据库结构问题时,通常最好回到应用程序的基本设计(在您的例子中是导入系统),并对其进行一些限制。
Now what I would do is option 4 (EDIT) with the addition of a link to users:
现在我要做的是选项4(编辑)添加一个用户链接:
general_data_table
id
...
udfs_linked_table
id
general_data_id
udf_id
udfs_table
id
name
type
owner_id --> Use this to filter for the current user and limit their UDFs
string_link_id --> link table for string fields
int_link_id
type_link_id
Now make sure to make views to optimize performance and get your indexes right. This level of normalization makes the DB footprint smaller, but your application more complex.
现在,请确保使用视图来优化性能并使索引正确。这种级别的规范化使DB占用更少,但您的应用程序更复杂。
#13
0
Our database powers a SaaS app (helpdesk software) where users have over 7k "custom fields". We use a combined approach:
我们的数据库支持一个SaaS应用程序(helpdesk软件),用户拥有超过7k的“自定义字段”。我们采用综合的方法:
-
(EntityID, FieldID, Value)
table for searching the data - (EntityID, FieldID, Value)用于搜索数据的表
- a JSON field in the
entities
table, that holds all entity values, used for displaying the data. (this way you don't need a million JOIN's to get the values values). - 实体表中的一个JSON字段,它包含用于显示数据的所有实体值。(通过这种方式,您不需要100万个JOIN来获取值)。
You could further split #1 to have a "table per datatype" like this answer suggests, this way you can even index your UDFs.
您可以进一步将#1拆分为“每个数据类型都有一个表”,正如这个答案所示,这样您甚至可以为您的udf建立索引。
P.S. Couple of words to defend the "Entity-Attribute-Value" approach everyone keeps bashing. We have used #1 without #2 for decades and it worked just fine. Sometimes it's a business decision. Do you have time to rewrite your app and redesign the db or you can through a couple of bucks on a cloud-servers, which are really cheap these days? By the way, when we were using #1 approach, our DB was holding millions of entities, accessed by 100s of thousands of users, and a 16GB dual-core db server was doing just fine (really an "r3" vm on AWS).
为了捍卫“实体-属性-价值”的方法,每个人都在不停地抨击。我们已经用了几十年没有2号的1号了,它运行得很好。有时这是一个商业决定。你有没有时间重写你的应用程序,重新设计db或者你可以花几块钱在云服务器上,现在真的很便宜?顺便说一下,当我们使用#1方法时,我们的DB拥有数百万个实体,被成千上万的用户访问,16GB双核DB服务器运行良好(实际上是AWS上的“r3”vm)。