
时间:2022-10-09 13:00:24

My requirements are:


  • Need to be able to dynamically add User-Defined fields of any data type
  • 需要能够动态地添加任何数据类型的用户定义字段
  • Need to be able to query UDFs quickly
  • 需要能够快速查询udf。
  • Need to be able to do calculations on UDFs based on datatype
  • 需要能够基于数据类型对udf进行计算
  • Need to be able to sort UDFs based on datatype
  • 需要能够基于数据类型对udf进行排序

Other Information:


  • I'm looking for performance primarily
  • 我寻找的主要是性能
  • There are a few million Master records which can have UDF data attached
  • 有几百万条主记录可以附带UDF数据
  • When I last checked, there were over 50mil UDF records in our current database
  • 上次检查时,我们当前的数据库中有超过50mil的UDF记录
  • Most of the time, a UDF is only attached to a few thousand of the Master records, not all of them
  • 大多数时候,一个UDF只附加到几千个主记录,而不是全部。
  • UDFs are not joined or used as keys. They're just data used for queries or reports
  • udf不被连接或用作键。它们只是用于查询或报告的数据



  1. Create a big table with StringValue1, StringValue2... IntValue1, IntValue2,... etc. I hate this idea, but will consider it if someone can tell me it is better than other ideas and why.


  2. Create a dynamic table which adds a new column on demand as needed. I also don't like this idea since I feel performance would be slow unless you indexed every column.


  3. Create a single table containing UDFName, UDFDataType, and Value. When a new UDF gets added, generate a View which pulls just that data and parses it into whatever type is specified. Items which don't meet the parsing criteria return NULL.


  4. Create multiple UDF tables, one per data type. So we'd have tables for UDFStrings, UDFDates, etc. Probably would do the same as #2 and auto-generate a View anytime a new field gets added


  5. XML DataTypes? I haven't worked with these before but have seen them mentioned. Not sure if they'd give me the results I want, especially with performance.


  6. Something else?


13 个解决方案



If performance is the primary concern, I would go with #6... a table per UDF (really, this is a variant of #2). This answer is specifically tailored to this situation and the description of the data distribution and access patterns described.



  1. Because you indicate that some UDFs have values for a small portion of the overall data set, a separate table would give you the best performance because that table will be only as large as it needs to be to support the UDF. The same holds true for the related indices.


  2. You also get a speed boost by limiting the amount of data that has to be processed for aggregations or other transformations. Splitting the data out into multiple tables lets you perform some of the aggregating and other statistical analysis on the UDF data, then join that result to the master table via foreign key to get the non-aggregated attributes.


  3. You can use table/column names that reflect what the data actually is.


  4. You have complete control to use data types, check constraints, default values, etc. to define the data domains. Don't underestimate the performance hit resulting from on-the-fly data type conversion. Such constraints also help RDBMS query optimizers develop more effective plans.


  5. Should you ever need to use foreign keys, built-in declarative referential integrity is rarely out-performed by trigger-based or application level constraint enforcement.



  1. This could create a lot of tables. Enforcing schema separation and/or a naming convention would alleviate this.


  2. There is more application code needed to operate the UDF definition and management. I expect this is still less code needed than for the original options 1, 3, & 4.


Other Considerations:

  1. If there is anything about the nature of the data that would make sense for the UDFs to be grouped, that should be encouraged. That way, those data elements can be combined into a single table. For example, let's say you have UDFs for color, size, and cost. The tendency in the data is that most instances of this data looks like


     'red', 'large', 45.03 

    rather than


     NULL, 'medium', NULL

    In such a case, you won't incur a noticeable speed penalty by combining the 3 columns in 1 table because few values would be NULL and you avoid making 2 more tables, which is 2 fewer joins needed when you need to access all 3 columns.


  2. If you hit a performance wall from a UDF that is heavily populated and frequently used, then that should be considered for inclusion in the master table.


  3. Logical table design can take you to a certain point, but when the record counts get truly massive, you also should start looking at what table partitioning options are provided by your RDBMS of choice.




I have written about this problem a lot. The most common solution is the Entity-Attribute-Value antipattern, which is similar to what you describe in your option #3. Avoid this design like the plague.


What I use for this solution when I need truly dynamic custom fields is to store them in a blob of XML, so I can add new fields at any time. But to make it speedy, also create additional tables for each field you need to search or sort on (you don't a table per field--just a table per searchable field). This is sometimes called an inverted index design.

当我需要真正的动态定制字段时,我使用这个解决方案的方法是将它们存储在一个XML blob中,这样我就可以在任何时候添加新的字段。但是为了让它更快,还需要为每个需要搜索或排序的字段创建额外的表(每个字段不是一个表,而是一个可搜索字段的表)。这有时被称为反向索引设计。

You can read an interesting article from 2009 about this solution here: http://backchannel.org/blog/friendfeed-schemaless-mysql

您可以从2009年阅读一篇关于这个解决方案的有趣文章:http://backchannel.org/blog/friendfeed- schemalmysql -mysql。

Or you can use a document-oriented database, where it's expected that you have custom fields per document. I'd choose Solr.




I would most probably create a table of the following structure:


  • varchar Name
  • varchar名字
  • varchar Type
  • varchar类型
  • decimal NumberValue
  • 十进制NumberValue
  • varchar StringValue
  • varchar StringValue
  • date DateValue
  • 日期DateValue

The exact types of course depend on your needs (and of course on the dbms you are using). You could also use the NumberValue (decimal) field for int's and booleans. You may need other types as well.

当然,具体类型取决于您的需求(当然也取决于您使用的dbms)。您还可以对int和布尔值使用NumberValue (decimal)字段。您可能还需要其他类型。

You need some link to the Master records which own the value. It's probably easiest and fastest to create a user fields table for each master table and add a simple foreign key. This way you can filter master records by user fields easily and quickly.


You may want to have some kind of meta data information. So you end up with the following:


Table UdfMetaData


  • int id
  • int id
  • varchar Name
  • varchar名字
  • varchar Type
  • varchar类型

Table MasterUdfValues


  • int Master_FK
  • int Master_FK
  • int MetaData_FK
  • int MetaData_FK
  • decimal NumberValue
  • 十进制NumberValue
  • varchar StringValue
  • varchar StringValue
  • date DateValue
  • 日期DateValue

Whatever you do, I would not change the table structure dynamically. It is a maintenance nightmare. I would also not use XML structures, they are much too slow.




This sounds like a problem that might be better solved by a non-relational solution, like MongoDB or CouchDB.


They both allow for dynamic schema expansion while allowing you to maintain the tuple integrity you seek.


I agree with Bill Karwin, the EAV model is not a performant approach for you. Using name-value pairs in a relational system is not intrinsically bad, but only works well when the name-value pair make a complete tuple of information. When using it forces you to dynamically reconstruct a table at run-time, all kinds of things start to get hard. Querying becomes an exercise in pivot maintenance or forces you to push the tuple reconstruction up into the object layer.

我同意Bill Karwin的观点,EAV模式对你来说不是一种表现方式。在关系系统中使用名称-值对本身并不是坏事,但只有当名称-值对构成一个完整的信息元组时才有效。当使用它强制您在运行时动态重构一个表时,所有的事情开始变得困难。查询成为枢轴维护中的一个练习,或者迫使您将元组重构推到对象层中。

You can't determine whether a null or missing value is a valid entry or lack of entry without embedding schema rules in your object layer.


You lose the ability to efficiently manage your schema. Is a 100-character varchar the right type for the "value" field? 200-characters? Should it be nvarchar instead? It can be a hard trade-off and one that ends with you having to place artificial limits on the dynamic nature of your set. Something like "you can only have x user-defined fields and each can only be y characters long.

您失去了有效管理模式的能力。100字的varchar是“值”字段的正确类型吗?200 -字符?应该换成nvarchar吗?这可能是一种艰难的权衡,最后你不得不对你的设置的动态特性进行人工限制,比如“你只能有x用户定义的字段,每个字段只能是y字符长。”

With a document-oriented solution, like MongoDB or CouchDB, you maintain all attributes associated with a user within a single tuple. Since joins are not an issue, life is happy, as neither of these two does well with joins, despite the hype. Your users can define as many attributes as they want (or you will allow) at lengths that don't get hard to manage until you reach about 4MB.


If you have data that requires ACID-level integrity, you might consider splitting the solution, with the high-integrity data living in your relational database and the dynamic data living in a non-relational store.




Even if you provide for a user adding custom columns, it will not necessarily be the case that querying on those columns will perform well. There are many aspects that go into query design that allow them to perform well, the most important of which is the proper specification on what should be stored in the first place. Thus, fundamentally, is it that you want to allow users to create schema without thought as to specifications and be able to quickly derive information from that schema? If so, then it is unlikley that any such solution will scale well especially if you want to allow the user to do numerical analysis on the data.


Option 1

IMO this approach gives you schema with no knowledge as to what the schema means which is a recipe for disaster and a nightmare for report designers. I.e., you must have the meta data to know what column stores what data. If that metadata gets messed up, it has the potential to hose your data. Plus, it makes it easy to put the wrong data in the wrong column. ("What? String1 contains the name of convents? I thought it was Chalie Sheen's favorite drugs.")


Option 3,4,5

IMO, requirements 2, 3, and 4 eliminate any variation of an EAV. If you need to query, sort or do calculations on this data, then an EAV is Cthulhu's dream and your development team's and DBA's nightmare. EAV's will create a bottleneck in terms of performance and will not give you the data integrity you need to quickly get to the information you want. Queries will quickly turn to crosstab Gordian knots.


Option 2,6

That really leaves one choice: gather specifications and then build out the schema.


If the client wants the best performance on data they wish to store, then they need to go through the process of working with a developer to understand their needs so that it is stored as efficiently as possible. It could still be stored in a table separate from the rest of the tables with code that dynamically builds a form based on the schema of the table. If you have a database that allows for extended properties on columns, you could even use those to help the form builder use nice labels, tooltips etc. so that all that was necessary is to add the schema. Either way, to build and run reports efficiently, the data needs to be stored properly. If the data in question will have lots of nulls, some databases have the ability to store that type of information. For example, SQL Server 2008 has a feature called Sparse Columns specifically for data with lots of nulls.

如果客户想要在他们希望存储的数据上获得最佳性能,那么他们需要与开发人员一起工作,以了解他们的需求,以便尽可能有效地存储数据。它仍然可以存储在与其他表分离的表中,其中包含基于表模式动态构建表单的代码。如果您有一个允许在列上扩展属性的数据库,您甚至可以使用它们来帮助表单构建器使用漂亮的标签、工具提示等,因此所需要的只是添加模式。无论如何,要有效地构建和运行报告,数据都需要正确地存储。如果问题中的数据有大量的空值,一些数据库就有能力存储这种类型的信息。例如,SQL Server 2008有一个名为稀列的特性,专门针对具有大量空值的数据。

If this were only a bag of data on which no analysis, filtering, or sorting was to be done, I'd say some variation of an EAV might do the trick. However, given your requirements, the most efficient solution will be to get the proper specifications even if you store these new columns in separate tables and build forms dynamically off those tables.


Sparse Columns




  1. Create multiple UDF tables, one per data type. So we'd have tables for UDFStrings, UDFDates, etc. Probably would do the same as #2 and auto-generate a View anytime a new field gets added
  2. 创建多个UDF表,一个数据类型。因此,我们将有udfstring、UDFDates等的表。可能会像#2一样,在添加新字段时自动生成视图

According to my research multiple tables based on the data type not going to help you in performance. Especially if you have bulk data, like 20K or 25K records with 50+ UDFs. Performance was the worst.

根据我的研究,基于数据类型的多个表对性能没有帮助。特别是如果您有大量的数据,比如20K或25K记录和50+ udf。性能是最差的。

You should go with single table with multiple columns like:


varchar Name
varchar Type
decimal NumberValue
varchar StringValue
date DateValue



This is a problematic situation, and none of the solutions appears "right". However option 1 is probably the best both in terms of simplicity and in terms of performance.


This is also the solution used in some commercial enterprise applications.




another option that is available now, but didn't exist (or at least wasn't mature) when the question was original asked is to use json fields in the DB.


many relational DBs support now json based fields (that can include a dynamic list of sub fields) and allow querying on them








I've had experience or 1, 3 and 4 and they all end up either messy, with it not being clear what the data is or really complicated with some sort of soft categorisation to break the data down into dynamic types of record.

我有过1 3 4的经验,它们要么很混乱,不清楚数据是什么,要么很复杂用一些软分类把数据分解成动态类型的记录。

I'd be tempted to try XML, you should be able to enforce schemas against the contents of the xml to check data typing etc which will help holding difference sets of UDF data. In newer versions of SQL server you can index on XML fields, which should help out on the performance. (see http://blogs.technet.com/b/josebda/archive/2009/03/23/sql-server-2008-xml-indexing.aspx) for example

我很想尝试使用XML,您应该能够针对XML的内容强制使用模式来检查数据类型等等,这将有助于保存UDF数据的不同集。在SQL server的更新版本中,可以对XML字段进行索引,这将有助于提高性能。(见http://blogs.technet.com/b/josebda/archive/2009/03/23/sql -服务器- 2008 - xml - indexing.aspx)为例



If you're using SQL Server, don't overlook the sqlvariant type. It's pretty fast and should do your job. Other databases might have something similar.

如果您正在使用SQL Server,请不要忽略sqlvariant类型。它非常快,应该能完成你的工作。其他数据库可能也有类似的东西。

XML datatypes are not so good for performance reasons. If youre doing calculations on the server then you're constantly having to deserialize these.


Option 1 sounds bad and looks cruddy, but performance-wise can be your best bet. I have created tables with columns named Field00-Field99 before because you just can't beat the performance. You might need to consider your INSERT performance too, in which case this is also the one to go for. You can always create Views on this table if you want it to look neat!




I've managed this very successfully in the past using none of these options (option 6? :) ).


I create a model for the users to play with (store as xml and expose via a custom modelling tool) and from the model generated tables and views to join the base tables with the user-defined data tables. So each type would have a base table with core data and a user table with user defined fields.


Take a document as an example: typical fields would be name, type, date, author, etc. This would go in the core table. Then users would define their own special document types with their own fields, such as contract_end_date, renewal_clause, blah blah blah. For that user defined document there would be the core document table, the xcontract table, joined on a common primary key (so the xcontracts primary key is also foreign on the primary key of the core table). Then I would generate a view to wrap these two tables. Performance when querying was fast. additional business rules can also be embedded into the views. This worked really well for me.




SharePoint uses option 1 and has reasonable performance.




In the comments I saw you saying that the UDF fields are to dump imported data that is not properly mapped by the user.


Perhaps another option is to track the number of UDF's made by each user and force them to reuse fields by saying they can use 6 (or some other equally random limit) custom fields tops.


When you are faced with a database structuring problem like this it is often best to go back to the basic design of the application (import system in your case) and put a few more restraints on it.


Now what I would do is option 4 (EDIT) with the addition of a link to users:




owner_id --> Use this to filter for the current user and limit their UDFs
string_link_id --> link table for string fields

Now make sure to make views to optimize performance and get your indexes right. This level of normalization makes the DB footprint smaller, but your application more complex.




Our database powers a SaaS app (helpdesk software) where users have over 7k "custom fields". We use a combined approach:


  1. (EntityID, FieldID, Value) table for searching the data
  2. (EntityID, FieldID, Value)用于搜索数据的表
  3. a JSON field in the entities table, that holds all entity values, used for displaying the data. (this way you don't need a million JOIN's to get the values values).
  4. 实体表中的一个JSON字段,它包含用于显示数据的所有实体值。(通过这种方式,您不需要100万个JOIN来获取值)。

You could further split #1 to have a "table per datatype" like this answer suggests, this way you can even index your UDFs.


P.S. Couple of words to defend the "Entity-Attribute-Value" approach everyone keeps bashing. We have used #1 without #2 for decades and it worked just fine. Sometimes it's a business decision. Do you have time to rewrite your app and redesign the db or you can through a couple of bucks on a cloud-servers, which are really cheap these days? By the way, when we were using #1 approach, our DB was holding millions of entities, accessed by 100s of thousands of users, and a 16GB dual-core db server was doing just fine (really an "r3" vm on AWS).




