在数据库列中存储结构化数据?

时间:2022-08-08 16:59:52

I have been having a debate with a coworker about whether it would be a good idea to store structured data (such as XML or JSON) in a database column instead of creating subtables. For example, say we need to store information about questions. The two types of questions are Multiple choice and Rating (rate from 1-10 for example). I would typically create at structure like the one below:

我一直在与一位同事讨论,在数据库列中存储结构化数据(如XML或JSON)是否是个好主意,而不是创建子表。例如,假设我们需要存储有关问题的信息。这两种类型的问题是多项选择和评级(例如从1到10的概率)。我通常会创建如下结构:

Table                   |   Columns
------------------------------------------------------
Question                | ID, Title, QuestionTypeId
Question_MultipleChoice | QuestionId, Choice
Question_Rating         | QuestionId, Min, Max
QuestionTypes           | ID, TypeName

My co-worker believes it would be better to store information in a single Question table with a column for subinfo. For example:

我的同事认为,最好将信息存储在一个问题表中,并在其中添加一列子信息。例如:

Question
----------------
ID
Title
SubInfo  <-- JSON

Because it would make queries simpler and possibly faster by avoiding JOINS. Are there reasons that this type of database structure should be avoided? It seems like if you need to query based on the data in the SubInfo column this would be a bad idea, but if that is not needed, is this a reasonable database structure?

因为避免连接会使查询更简单,而且可能更快。是否应该避免这种类型的数据库结构?如果您需要基于SubInfo列中的数据进行查询,这似乎不是一个好主意,但如果不需要,这是一个合理的数据库结构吗?

3 个解决方案

#1


2  

Speaking very personally, surveys are one case where I think normalizing nothing and storing JSON pretty much as is is the better option.

就我个人而言,调查是一种我认为什么都不规范化和存储JSON差不多是更好的选择。

Without it, you're going to end up with all sorts of bizarre use-cases that you'll eventually want to manage down the road. In addition to tidy multiple choice questions of all sorts, you'll also need to manage that "Other" answer in them, condition questions, conditional groups of questions, the list goes on and on. What more, surveys are — like other forms of data — subject to change, and things go from gawdawful to nuclear when they do.

如果没有它,你最终会得到各种各样奇怪的用例,你最终会想要去管理它们。除了整理各种各样的多项选择题之外,您还需要管理它们中的“其他”答案,条件问题,有条件的问题组,列表等等。更重要的是,调查和其他形式的数据一样,可能会发生变化,而当他们这么做的时候,事情就会从糟糕透顶变成核问题。

The merit of JSON is that, since surveys are conceptually independent from one another, you've little to no need for referential integrity from one to the next, so you might as well store the entire tree of questions and options as one JSON blob, and worry about formatting it in your app.

JSON的优点是,由于调查在概念上独立于彼此,你几乎不需要引用完整性从一个到另一个,所以你不妨存储整个树的问题和选项作为一个JSON blob和担心格式在你的应用程序。

The same for each submitted answer, for that matter: take the original blob, mark the relevant answer as selected and so forth within that, and store the resulting JSON as is, rather than storing references to the the original questions alongside whatever was answered. This will allow you to readily keep track of what users actually answered, as opposed to whatever the current version of the survey says, and do irrespective of how much the survey has diverged since it was originally answered.

对于每个提交的答案,都是一样的:使用原始的blob,将相应的答案标记为所选的,然后将结果的JSON存储为is,而不是将原始问题的引用存储在所有被回答的问题旁边。这将使您能够很容易地跟踪用户实际回答了什么,而不是与当前版本的调查内容相反,并且不管调查与最初的回答有多大的分歧。

If you need to mine the answers later, note that Postgres allows to index JSON using GIST indexes on the whole field, and BTREE indexes on expressions.

如果以后需要挖掘答案,请注意,Postgres允许使用整个字段上的GIST索引和表达式上的BTREE索引来索引JSON。

#2


1  

JSON, XML are essentially data types.

JSON, XML本质上是数据类型。

So, if your chosen DB supports that data type and has appropriate set of matching operators then all is good.

因此,如果所选的DB支持该数据类型并具有适当的匹配操作符集,那么一切都很好。

If you plan to stick XML or JSON in a DB and declare it to be a string, then definitely not recommended. String is a string, it is not JSON nor XML.

如果您打算在DB中插入XML或JSON并将其声明为字符串,那么绝对不建议这样做。字符串是字符串,不是JSON也不是XML。

For example, equality operator for JSON data type knows (or should know) that {"firstName": "John", "lastName": "Smith"} = {"lastName": "Smith", "firstName": "John"} is true.

例如,JSON数据类型的相等运算符知道(或应该知道){"firstName": "John", "lastName": "Smith"} = {"lastName": "Smith", "firstName": "John"}为真。

Equality operator for strings returns false for that -- and so on.

字符串的相等运算符返回false。

Do not expect much from a DB if it can not tell if two things are equal.

如果DB不能区分两件事是否相等,就不要对它期望太高。

#3


0  

We're considering doing both for a similar problem. You could store a blob in the 'Question' table so that you don't have the n+1 issue when trying to retrieve a question with all of its answers, but also keep the 'Answers' table so you can write queries like:

对于类似的问题,我们考虑两者都做。您可以在“问题”表中存储一个blob,以便在检索包含所有答案的问题时不会出现n+1问题,但也可以保留“答案”表,以便您可以编写如下查询:

SELECT q.*
FROM Questions q
WHERE EXISTS (
           SELECT a.question_id 
           FROM Answers a 
           WHERE 
              a.question_id = q.id AND 
              a.Choice = 'SomeAnswer');

If Questions and Answers don't change often, updating both tables on inserts and updates will work fine.

如果问题和答案不经常更改,那么在插入和更新上更新这两个表将很有效。

I don't think I would put min/max rating in a separate table though.

我不认为我会把最小/最大值放在一个单独的表中。

#1


2  

Speaking very personally, surveys are one case where I think normalizing nothing and storing JSON pretty much as is is the better option.

就我个人而言,调查是一种我认为什么都不规范化和存储JSON差不多是更好的选择。

Without it, you're going to end up with all sorts of bizarre use-cases that you'll eventually want to manage down the road. In addition to tidy multiple choice questions of all sorts, you'll also need to manage that "Other" answer in them, condition questions, conditional groups of questions, the list goes on and on. What more, surveys are — like other forms of data — subject to change, and things go from gawdawful to nuclear when they do.

如果没有它,你最终会得到各种各样奇怪的用例,你最终会想要去管理它们。除了整理各种各样的多项选择题之外,您还需要管理它们中的“其他”答案,条件问题,有条件的问题组,列表等等。更重要的是,调查和其他形式的数据一样,可能会发生变化,而当他们这么做的时候,事情就会从糟糕透顶变成核问题。

The merit of JSON is that, since surveys are conceptually independent from one another, you've little to no need for referential integrity from one to the next, so you might as well store the entire tree of questions and options as one JSON blob, and worry about formatting it in your app.

JSON的优点是,由于调查在概念上独立于彼此,你几乎不需要引用完整性从一个到另一个,所以你不妨存储整个树的问题和选项作为一个JSON blob和担心格式在你的应用程序。

The same for each submitted answer, for that matter: take the original blob, mark the relevant answer as selected and so forth within that, and store the resulting JSON as is, rather than storing references to the the original questions alongside whatever was answered. This will allow you to readily keep track of what users actually answered, as opposed to whatever the current version of the survey says, and do irrespective of how much the survey has diverged since it was originally answered.

对于每个提交的答案,都是一样的:使用原始的blob,将相应的答案标记为所选的,然后将结果的JSON存储为is,而不是将原始问题的引用存储在所有被回答的问题旁边。这将使您能够很容易地跟踪用户实际回答了什么,而不是与当前版本的调查内容相反,并且不管调查与最初的回答有多大的分歧。

If you need to mine the answers later, note that Postgres allows to index JSON using GIST indexes on the whole field, and BTREE indexes on expressions.

如果以后需要挖掘答案,请注意,Postgres允许使用整个字段上的GIST索引和表达式上的BTREE索引来索引JSON。

#2


1  

JSON, XML are essentially data types.

JSON, XML本质上是数据类型。

So, if your chosen DB supports that data type and has appropriate set of matching operators then all is good.

因此,如果所选的DB支持该数据类型并具有适当的匹配操作符集,那么一切都很好。

If you plan to stick XML or JSON in a DB and declare it to be a string, then definitely not recommended. String is a string, it is not JSON nor XML.

如果您打算在DB中插入XML或JSON并将其声明为字符串,那么绝对不建议这样做。字符串是字符串,不是JSON也不是XML。

For example, equality operator for JSON data type knows (or should know) that {"firstName": "John", "lastName": "Smith"} = {"lastName": "Smith", "firstName": "John"} is true.

例如,JSON数据类型的相等运算符知道(或应该知道){"firstName": "John", "lastName": "Smith"} = {"lastName": "Smith", "firstName": "John"}为真。

Equality operator for strings returns false for that -- and so on.

字符串的相等运算符返回false。

Do not expect much from a DB if it can not tell if two things are equal.

如果DB不能区分两件事是否相等,就不要对它期望太高。

#3


0  

We're considering doing both for a similar problem. You could store a blob in the 'Question' table so that you don't have the n+1 issue when trying to retrieve a question with all of its answers, but also keep the 'Answers' table so you can write queries like:

对于类似的问题,我们考虑两者都做。您可以在“问题”表中存储一个blob,以便在检索包含所有答案的问题时不会出现n+1问题,但也可以保留“答案”表,以便您可以编写如下查询:

SELECT q.*
FROM Questions q
WHERE EXISTS (
           SELECT a.question_id 
           FROM Answers a 
           WHERE 
              a.question_id = q.id AND 
              a.Choice = 'SomeAnswer');

If Questions and Answers don't change often, updating both tables on inserts and updates will work fine.

如果问题和答案不经常更改,那么在插入和更新上更新这两个表将很有效。

I don't think I would put min/max rating in a separate table though.

我不认为我会把最小/最大值放在一个单独的表中。