相同的数据，两种不同的存储方式

The two tables below can both hold the same data - a full year, including some arbitrary info about each month

下面的两个表都可以保存相同的数据 - 一整年,包括每个月的一些任意信息

table1 (one row = one month)
------
id
month
year
info

table2 (one row = one year)
------
id
year
jan_info
feb_info
mar_info
apr_info
may_info
jun_info
jul_info
aug_info
sep_info
oct_info
nov_info
dec_info

Table A

Seems more intuitive because the month is numeric, but its

看起来更直观,因为月份是数字,但它是

10x more rows for a full year of data. Also the

全年数据的行数增加10倍。还有

Rows are smaller (less columns)

行更小(列更少)

Table B

10x less rows for a full year of data, but

全年数据减少10倍,但是

Single rows are much larger

单行要大得多

Possibly more difficult to add more arbitrary info for a month

可能更难以在一个月内添加更多任意信息

In a real world test scenerio I set up, there were 12,000 rows in table1 for 10 years of data, where table2 had 150. I realize less is better, generally speaking, but ALWAYS? I'm afraid that im overlooking some caveat that ill find later if I commit to one way. I havent even considered disk usage or what query might be faster. What does MySQL prefer? Is there a "correct" way? Or is there a "better" way?

在我设置的真实世界测试场景中,table1中有12,000行用于10年的数据,其中table2有150个。我意识到越少越好,一般而言,但总是如此?如果我采取一种方式,我担心我会忽略一些后来发现的警告。我甚至没有考虑过磁盘使用情况或查询可能更快。 MySQL更喜欢什么?有“正确”的方式吗?或者,还有更好的方法?

Thanks for your input!

感谢您的输入!

6 个解决方案

#1

Don't think about how to store it, think about how you use it. And also think about how it might change in the future. The storage structure should reflect use.

不要考虑如何存储它,考虑如何使用它。并考虑将来如何改变。存储结构应反映使用情况。

The first option is more normalized by the second, so I would tend to prefer it. It has the benefit of being easy to change, for example if every month suddenly needed a second piece of information stored about it. Usually this kind of structure is easier to populate, but not always. Think about where the data is coming from.

第一个选项在第二个选项中更加标准化,所以我倾向于选择它。它具有易于更改的优点,例如,如果每个月突然需要存储关于它的第二条信息。通常这种结构更容易填充,但并非总是如此。想想数据的来源。

If you're only using this data for reports and you don't need to aggregate data across months, use the second option.

如果您仅将此数据用于报表,并且不需要跨月汇总数据,请使用第二个选项。

It really depends on what the data is for and where it comes from. Generally, though, the first option is better.

这实际上取决于数据的来源和来源。但一般来说,第一种选择更好。

#2

12000 rows for 10 years of data? I say that scale pretty well since 12000 rows is next to nothing with a decent DBMS.

10年的12000行数据?我说这个规模相当不错,因为12000行与一个不错的DBMS几乎没有任何关系。

How are you using the database? Are you sure you really need to worry about optimizations?

你是如何使用数据库的?你确定你真的需要担心优化吗?

If you need to store data that is specific to a month then you should absolutely store a row for each month. It's a lot cleaner approach compared to the one with a column for each month.

如果您需要存储特定于一个月的数据,那么您应该绝对存储每个月的行。与每月有一列的清洁方法相比,这种方法更为清洁。

#3

"In a real world test scenerio I set up, there were 12,000 rows in table1 for 10 years of data, where table2 had 150."

“在我设置的真实世界测试场景中,table1中有12,000行用于10年的数据,其中table2有150行。”

How? There would have to be 80 months in a year for that to be the case.

怎么样?对于这种情况,一年中必须有80个月。

#4

Since this is an optimising problem the optimising answer applies: It depends.

由于这是一个优化问题,优化答案适用:它取决于。

What do you want to do with your data?

您想对数据做什么?

Table A is the normal form in which one would store this kind of data.

表A是存储此类数据的正常形式。

For special cases Table B might come in handy, but I'd need to think hard to find a good example.

对于特殊情况,表B可能会派上用场,但我需要努力寻找一个好的例子。

So either go with A or give us some details about what you want to do with the data.

所以要么选择A,要么给我们一些关于你想要对数据做什么的细节。

A note on disc space: Total disc space is is a non issue, except for extremely huge tables. If at all discspace per select matters, and that should be less for the Table A design in most cases.

关于磁盘空间的说明:除极大的表外,磁盘空间总量不是问题。如果在每个选择事项的所有磁盘空间中,并且在大多数情况下对于表A设计应该更少。

A note on math: if you divide 12000 by 12 and get 150 as an result, something is wrong.

关于数学的注释:如果你将12000除以12并得到150,那么就会出现问题。

#5

How are you using the data? If you are often doing a report that splits the data out by month, the second is easier (and probably faster but you need to test for yourself) to query. It is less normalized but but honestly when was the last time we added a new month to the year?

你是如何使用这些数据的?如果你经常做一个按月拆分数据的报告,那么第二个更容易(并且可能更快但你需要自己测试)来查询。它不太正常化,但老实说,我们最后一次增加新的一个月是什么时候?

#6

In general I'd say one record per month as the more general solution.

总的来说,我会说每个月有一条记录作为更通用的解决方案。

One important issue is whether "info" is and must logically always be a single field. If there are really several pieces of data per month, or if it's at all likely that in the future there will be, than putting them all in one table gets to be a major pain.

一个重要的问题是“信息”是否且逻辑上必须始终是单个字段。如果每个月确实存在多个数据,或者将来可能存在多个数据,那么将它们全部放在一个表中会变得非常痛苦。

Another question is what you will do with this data. You don't say what "info" is, so just for purposes of discussion let's suppose it's "sales for the month". Will you ever want to say, "In what months did we have over $1,000,000 in sales?" ? With one record per month, this is an easy query: "select year, month from sales where month_sales>1000000". Now try doing that with the year table. "select year, 'Jan' from year_sales where jan_sales>1000000 union select year, 'Feb' from year_sales where feb_sales>1000000 union select year, 'Mar' from year_sales where mar_sales>1000000 union ..." etc. Or maybe you'd prefer "select year, case when jan_sales>1000000 then 'Jan=yes' else 'Jan=no', case when feb_sales>1000000 then 'Feb=yes' else 'Feb=no' ... for the remaining months ... from year_sales where jan_sales>1000000 or feb_sales>1000000 or mar_sales>1000000 ..." Yuck.

另一个问题是你将如何处理这些数据。你没有说“信息”是什么,所以仅仅为了讨论的目的,让我们假设它是“本月的销售额”。你会不会想说,“在几个月里我们的销售额超过了100万美元?” ?每月只有一条记录,这是一个简单的查询:“选择年份,月份来自销售,其中month_sales> 1000000”。现在尝试使用年表。 “选择年份,'Jan'来自year_sales,其中jan_sales> 1000000工会选择年份,'Feb'来自year_sales,其中feb_sales> 1000000工会选择年份,'Mar'来自year_sales,其中mar_sales> 1000000 union ...”等等或者你可能' d喜欢“选择年份,jan_sales> 1000000然后'Jan =是'其他'Jan = no',feb_sales> 1000000然后'Feb = yes'其他'2月=没有'...剩余月份... 。来自year_sales,其中jan_sales> 1000000或feb_sales> 1000000或mar_sales> 1000000 ......“哎呀。

Having many small records is not that much more of a resource burden than having fewer but bigger records. Yes, the total disk space requirement will surely be more because of per-record overhead, and index searches will be somewhat slower because the index will be larger. But the difference is likely to be minor, and frankly there are so many factors in database performance that this sort of thing is hard to predict.

拥有许多小记录并不比拥有更少但更大的记录更多的资源负担。是的,由于每个记录的开销,总磁盘空间要求肯定会更多,并且索引搜索会稍微慢一点,因为索引会更大。但差异可能很小,坦率地说,数据库性能有很多因素很难预测。

But I have to admit that I just faced a very similar problem and went the other way: I needed a set of flags for each day of the week, saying "are you working on this day". I wrestled with whether to create a separate table with one record per day, but I ended up putting seven fields into a single record. My thinking is that there will never be additional data for each day without some radical change in the design, and I have no reason to ever want to look at just one day. The days are used for calculating a schedule and assigning due dates, so I can't imagine, in the context of this application, ever wanting to say "give me all the people who are working on Tuesday". But I can readily imagine the same data in a different application being used with precisely that question.

但我不得不承认,我只是遇到了一个非常类似的问题而且走了另一条道路:我需要一周的每一天都有一套标志,说“你今天工作了吗”。我是否要创建一个每天只有一条记录的单独表格,但我最终将七个字段放入一条记录中。我的想法是,如果没有设计上的一些根本改变,每天都不会有额外的数据,我没有理由只想看一天。这些日子用于计算时间表和分配截止日期,所以我无法想象,在这个应用程序的上下文中,我想要说“给我所有在星期二工作的人”。但我可以很容易地想象在不同的应用程序中使用相同的数据正是这个问题。

#1