What is the more efficient (in terms of query performance) database table design - long or wide?
什么是更高效(在查询性能方面)数据库表设计 - 长还是宽?
I.e., this
id size price
1 S 12.4
1 M 23.1
1 L 33.3
2 S 3.3
2 M 5.3
2 L 11.0
versus this
id S M L
1 12.4 23.1 33.3
2 3.3 5.3 11.0
Generally (I reckon) it comes down to the comparison of performance between GROUP BY
and selecting the columns directly:
通常(我认为)它归结为GROUP BY之间的性能比较和直接选择列:
SELECT AVG(price) FROM table GROUP BY size
or
SELECT AVG(S), AVG(M), AVG(L) FROM table
Second one is a bit longer to write (in terms of many columns), but what about the performance of the two? If possible, what are the general advantages/disadvantages of each of these tables formats?
第二个是写一点(就许多列而言),但两者的性能如何呢?如果可能,每种表格格式的一般优点/缺点是什么?
3 个解决方案
#1
6
First of all, these are two different data models suitable for different purposes.
首先,这些是适用于不同目的的两种不同的数据模型。
That being said, I'd expect1 the second model will be faster for aggregation, simply because the data is packed more compactly, therefore needing less I/O:
话虽这么说,我希望第二个模型的聚合更快,因为数据打包更紧凑,因此需要更少的I / O:
- The GROUP BY in the first model can be satisfied by a full scan on the index
{size, price}
. The alternative to index is too slow when the data is too large to fit in RAM. - The query in the second model can be satisfied by a full table scan. No index needed2.
通过对索引{size,price}进行全面扫描,可以满足第一个模型中的GROUP BY。当数据太大而无法容纳在RAM中时,索引的替代方法太慢。
可以通过全表扫描来满足第二模型中的查询。不需要索引2。
Since the first approach requires table + index and the second one just the table, the cache utilization is better in the second case. Even if we disregard caching and compare the index (without table) in the first model with the table in the second model, I suspect the index will be larger than the table, simply because it physically records the size
and has unused "holes" typical for B-Trees (though the same is true for the table if it is clustered).
由于第一种方法需要table + index而第二种方法只需要表,因此在第二种情况下缓存利用率更高。即使我们忽略缓存并将第一个模型中的索引(没有表)与第二个模型中的表进行比较,我怀疑索引会比表大,只是因为它物理地记录了大小并且没有未使用的“漏洞”对于B树(如果它是聚类的,表也是如此)。
And finally, the second model does not have the index maintenance overhead, which could impact the INSERT/UPDATE/DELETE performance.
最后,第二个模型没有索引维护开销,这可能会影响INSERT / UPDATE / DELETE性能。
Other than that, you can consider caching the SUM and COUNT in a separate table containing just one row. Update both the SUM and COUNT via triggers whenever a row is inserted, updated or deleted in the main table. You can then easily get the current AVG, simply by dividing SUM and COUNT.
除此之外,您可以考虑在仅包含一行的单独表中缓存SUM和COUNT。每当在主表中插入,更新或删除行时,都会通过触发器更新SUM和COUNT。然后,只需将SUM和COUNT分开,即可轻松获取当前的AVG。
1 But you should really measure on representative amounts of data to be sure.
1但是你应该真正衡量代表性的数据量。
2 Since there is no WHERE clause in your query, all rows will be scanned. Indexes are only useful for getting a relatively small subset of table's rows (and sometimes for index-only scans). As a rough rule of thumb, if more than 10% of rows in the table are needed, indexes won't help and the DBMS will often opt for a full table scan even when indexes are available.
2由于查询中没有WHERE子句,因此将扫描所有行。索引仅用于获取表的行的相对较小的子集(有时仅用于索引扫描)。作为一个粗略的经验法则,如果需要表中超过10%的行,索引将无济于事,即使索引可用,DBMS也会选择全表扫描。
#2
2
The first option results in more rows and will generally be slower than the second option.
第一个选项会产生更多行,并且通常比第二个选项慢。
However, as Deltalima also indicated, the first option is more flexible. Not only when it comes to different query options, but also if/when you one day need to extend the table with other sizes, colors etc.
然而,正如Deltalima所指出的那样,第一种选择更灵活。不仅在涉及不同的查询选项时,而且当您/有一天需要使用其他大小,颜色等扩展表时。
Unless you have a very large dataset or need ultra-fast lookup time, you'll probably be better off with the first option.
除非你有一个非常大的数据集或需要超快的查找时间,否则第一个选项可能会更好。
If you do have or need a very large dataset, you may be better off creating a table with pre-calculated summary values.
如果您确实拥有或需要非常大的数据集,那么最好创建一个包含预先计算的汇总值的表。
#3
1
The long is more flexible in use. It allows you to filter on size
for example
长期使用更灵活。例如,它允许您过滤尺寸
SELECT MAX(price) where size='L'
Also it allows for indexing on the size
and on the id
. This speeds up the GROUP BY
and any queries where other tables are joined on id
and/or size
such a product stock table.
它还允许索引大小和id。这加快了GROUP BY以及其他表在id和/或大小上加入这样的产品库存表的任何查询。
#1
6
First of all, these are two different data models suitable for different purposes.
首先,这些是适用于不同目的的两种不同的数据模型。
That being said, I'd expect1 the second model will be faster for aggregation, simply because the data is packed more compactly, therefore needing less I/O:
话虽这么说,我希望第二个模型的聚合更快,因为数据打包更紧凑,因此需要更少的I / O:
- The GROUP BY in the first model can be satisfied by a full scan on the index
{size, price}
. The alternative to index is too slow when the data is too large to fit in RAM. - The query in the second model can be satisfied by a full table scan. No index needed2.
通过对索引{size,price}进行全面扫描,可以满足第一个模型中的GROUP BY。当数据太大而无法容纳在RAM中时,索引的替代方法太慢。
可以通过全表扫描来满足第二模型中的查询。不需要索引2。
Since the first approach requires table + index and the second one just the table, the cache utilization is better in the second case. Even if we disregard caching and compare the index (without table) in the first model with the table in the second model, I suspect the index will be larger than the table, simply because it physically records the size
and has unused "holes" typical for B-Trees (though the same is true for the table if it is clustered).
由于第一种方法需要table + index而第二种方法只需要表,因此在第二种情况下缓存利用率更高。即使我们忽略缓存并将第一个模型中的索引(没有表)与第二个模型中的表进行比较,我怀疑索引会比表大,只是因为它物理地记录了大小并且没有未使用的“漏洞”对于B树(如果它是聚类的,表也是如此)。
And finally, the second model does not have the index maintenance overhead, which could impact the INSERT/UPDATE/DELETE performance.
最后,第二个模型没有索引维护开销,这可能会影响INSERT / UPDATE / DELETE性能。
Other than that, you can consider caching the SUM and COUNT in a separate table containing just one row. Update both the SUM and COUNT via triggers whenever a row is inserted, updated or deleted in the main table. You can then easily get the current AVG, simply by dividing SUM and COUNT.
除此之外,您可以考虑在仅包含一行的单独表中缓存SUM和COUNT。每当在主表中插入,更新或删除行时,都会通过触发器更新SUM和COUNT。然后,只需将SUM和COUNT分开,即可轻松获取当前的AVG。
1 But you should really measure on representative amounts of data to be sure.
1但是你应该真正衡量代表性的数据量。
2 Since there is no WHERE clause in your query, all rows will be scanned. Indexes are only useful for getting a relatively small subset of table's rows (and sometimes for index-only scans). As a rough rule of thumb, if more than 10% of rows in the table are needed, indexes won't help and the DBMS will often opt for a full table scan even when indexes are available.
2由于查询中没有WHERE子句,因此将扫描所有行。索引仅用于获取表的行的相对较小的子集(有时仅用于索引扫描)。作为一个粗略的经验法则,如果需要表中超过10%的行,索引将无济于事,即使索引可用,DBMS也会选择全表扫描。
#2
2
The first option results in more rows and will generally be slower than the second option.
第一个选项会产生更多行,并且通常比第二个选项慢。
However, as Deltalima also indicated, the first option is more flexible. Not only when it comes to different query options, but also if/when you one day need to extend the table with other sizes, colors etc.
然而,正如Deltalima所指出的那样,第一种选择更灵活。不仅在涉及不同的查询选项时,而且当您/有一天需要使用其他大小,颜色等扩展表时。
Unless you have a very large dataset or need ultra-fast lookup time, you'll probably be better off with the first option.
除非你有一个非常大的数据集或需要超快的查找时间,否则第一个选项可能会更好。
If you do have or need a very large dataset, you may be better off creating a table with pre-calculated summary values.
如果您确实拥有或需要非常大的数据集,那么最好创建一个包含预先计算的汇总值的表。
#3
1
The long is more flexible in use. It allows you to filter on size
for example
长期使用更灵活。例如,它允许您过滤尺寸
SELECT MAX(price) where size='L'
Also it allows for indexing on the size
and on the id
. This speeds up the GROUP BY
and any queries where other tables are joined on id
and/or size
such a product stock table.
它还允许索引大小和id。这加快了GROUP BY以及其他表在id和/或大小上加入这样的产品库存表的任何查询。