I was wondering what the best approach would be for the following situation:
我想知道以下情况的最佳方法是什么:
I have an Orders table in a database that obviously contains all orders. But these are literally ALL orders, so including the complete/finished ones that are just flagged as 'complete'. From all the open orders I want to calculate some stuff (like open amount, open items, etc). What would be better performance wise:
我在数据库中有一个显然包含所有订单的Orders表。但这些都是字面上的所有订单,因此包括标记为“完整”的完整/已完成订单。从所有未结订单我想要计算一些东西(如开放金额,未清项目等)。什么会更好的表现明智:
Keep 1 Orders table with ALL orders, including the complete/archived ones, and do calculations by filtering the 'complete' flag?
保留1个订单表包含所有订单,包括完整/存档的订单,并通过过滤“完整”标志进行计算?
Or should I create another table, e.g. 'Orders_Archive', so that the Orders table would only contain open orders that I use for the calculations?
或者我应该创建另一个表,例如'Orders_Archive',以便Orders表只包含我用于计算的未结订单?
Is there any (clear) performance difference in these approaches?
这些方法中是否有(明显的)性能差异?
(B.T.W. I'm on a PostgreSQL db.)
(B.T.W。我正在使用PostgreSQL数据库。)
4 个解决方案
#1
5
Or should I create another table, e.g. 'Orders_Archive', so that the Orders table would only contain open orders that I use for the calculations?
或者我应该创建另一个表,例如'Orders_Archive',以便Orders表只包含我用于计算的未结订单?
Yes. They call that data warehousing. Folks do this because it speeds up the transaction system to eliminate the hardly-used history. First, tables are physically smaller and process faster. Second, a long-running history report doesn't interfere with transactional processing.
是。他们称之为数据仓库。人们这样做是因为它加速了交易系统,以消除几乎没有用过的历史。首先,表格在物理上更小,处理速度更快。其次,长期运行的历史报告不会干扰事务处理。
Is there any (clear) performance difference in these approaches?
这些方法中是否有(明显的)性能差异?
Yes. Bonus. You can restructure your history so that it's no longer in 3NF (for updating) but in a Star Schema (for reporting). The advantages are huge.
是。奖金。您可以重新构建历史记录,使其不再是3NF(用于更新),而是用于星型模式(用于报告)。优点是巨大的。
Buy Kimball's The Data Warehouse Toolkit book to learn more about star schema design and migrating history out of active tables into warehouse tables.
购买Kimball的The Data Warehouse Toolkit一书,了解有关星型模式设计的更多信息,并将历史记录从活动表迁移到仓库表中。
#2
7
This is a common problem in database design: The question of whether to separate or "archive" records that are no longer "active".
这是数据库设计中的常见问题:是否要分离或“归档”不再“活动”的记录的问题。
The most common approaches are:
最常见的方法是:
- Everything in one table, mark orders as "complete" as appropriate. Pros: Simplest solution (both code- and structure-wise), good flexibility (e.g. easy to "resurrect" orders). Cons: Tables can get quite large, a problem both for queries and for e.g. backups.
- 在一个表中的所有内容,将订单标记为“完整”。优点:最简单的解决方案(代码和结构方面),良好的灵活性(例如易于“复活”的订单)。缺点:表格可能会非常大,这对于查询和例如查询都是一个问题。备份。
- Archive old stuff to separate table. Solves the problems from the first approach, at the cost of greater complexity.
- 将旧东西归档到单独的表。从第一种方法解决问题,代价是更复杂。
- Use table with value-based partitioning. That means logically (to the application) everything is in one table, but behind the scenes the DBMS puts stuff into separate areas depending on the value(s) on some column(s). You'd probably use the "complete" column, or the "order completion date" for the partitioning.
- 使用具有基于值的分区的表。这意味着逻辑上(对应用程序)一切都在一个表中,但在幕后,DBMS根据某些列上的值将内容放入不同的区域。您可能使用“完整”列或分区的“订单完成日期”。
The last approach kind of combines the good parts of the first two, but needs support in the DBMS and is more complex to set up.
最后一种方法结合了前两种方法的优点,但需要DBMS支持,并且设置起来比较复杂。
Note:
注意:
Tables that only store "archived" data are commonly referred to as "archive tables". Some DBMS even provide special storage engines for these tables (e.g. MySQL), which are optimized to allow quick retrieval and good storage efficiency, at the cost of slow changes/inserts.
仅存储“存档”数据的表通常称为“存档表”。有些DBMS甚至为这些表(例如MySQL)提供了特殊的存储引擎,这些表经过优化,可以快速检索和提高存储效率,代价是缓慢的更改/插入。
#3
3
Never split off or separate current/archived data. It is simply incorrect. It may be called "data warehousing" or a bucket of fish, but it is wrong, unnecessary, and creates problems which were not otherwise present. The result is:
切勿拆分或分离当前/存档数据。这完全是不正确的。它可能被称为“数据仓库”或一桶鱼,但它是错误的,不必要的,并产生其他方面没有的问题。结果是:
- everyone who queries the data now has to look for it in two places rather than one
- 现在,每个查询数据的人都必须在两个地方而不是一个地方寻找它
- and worse, do the addition of aggregated values manually (in Excel or whatever)
- 更糟糕的是,手动添加聚合值(在Excel或其他)
- you introduce anomalies in the key, the integrity is lost (which would otherwise be unique by a single db constraint)
- 在密钥中引入异常,完整性丢失(否则单个数据库约束将是唯一的)
- when a Completed Order (or many) needs to be changed, you have to fish it out of the "warehouse" and put it back in the "database"
- 当需要更改完成订单(或许多)时,您必须将其从“仓库”中取出并将其放回“数据库”中
If, and only if the response on the table is slow, then address that, and enhance the speed. Only. Nothing else. This (in every case I have seen) is an indexing error (a missing index or the incorrect columns or the incorrect sequence of columns are all errors). Generally, all you will need is the IsComplete column in an index, along with whatever your users use to search most frequently, to in/exclude Open/Complete Orders.
如果,并且只有当表上的响应很慢时,才解决这个问题,并提高速度。只要。没有其他的。这(在我看到的每种情况下)都是索引错误(缺少索引或不正确的列或不正确的列序列都是错误)。通常,您需要的只是索引中的IsComplete列,以及用户用于搜索最频繁的任何内容,进入/排除打开/完成订单。
Now, if your dbms platform cannot handle large tables, or large result sets, that is a different problem, and you have to use whatever methods are available in the tool. But as a database design issue, it is simply wrong; there is no need to create a duplicate, populate it, and maintain it (with all the ensuing problems) except if you are limited by your platform.
现在,如果您的dbms平台无法处理大型表或大型结果集,那么这是一个不同的问题,您必须使用该工具中可用的任何方法。但作为数据库设计问题,它完全是错误的;除非您受到平台的限制,否则无需创建副本,填充并维护它(包含所有后续问题)。
Both last year and this, as part of an ordinary performance assignment, I have consolidated such split tables with billions of rows (and had to resolve all the duplicate row problems that allegedly "did not exist", yeah right, 2 days just for that). The consolidated tables with the corrected indices were faster than the split tables; the excuse that "billions of rows slowed the table down" was completely false. The users love me because they no longer have to use two tools and query two "databases" to get what they need.
去年和今年,作为普通性能分配的一部分,我已经整合了数十亿行的拆分表(并且必须解决所谓的“不存在”的所有重复行问题,是的,2天只是为了那个)。具有更正指数的合并表比拆分表更快; “数十亿行放慢桌面速度”的借口完全是错误的。用户爱我,因为他们不再需要使用两个工具并查询两个“数据库”来获得他们需要的东西。
#4
1
Since you are using postgresql, you can take advantage of partial index. Suppose for unfinished order you often use orderdate, you can specify index like this:
由于您使用的是postgresql,因此可以利用部分索引。假设对于未完成的订单,您经常使用orderdate,您可以像这样指定索引:
create index order_orderdate_unfinished_ix on orders ( orderdate )
where completed is null or completed = 'f';
When you put that condition, postgresql will not index the completed orders, thus saving harddisk space and make the index much faster because it contains only small amount of data. So you get the benefit without the hassles of table separation.
当您放置该条件时,postgresql将不会索引已完成的订单,从而节省了硬盘空间并使索引更快,因为它只包含少量数据。因此,您可以获得好处,而无需桌面分离的麻烦。
When you separate data into ORDERS and ORDERS_ARCHIVE, you will have to adjust existing reports. If you have lots of reports, that can be painful.
将数据分成ORDERS和ORDERS_ARCHIVE时,您必须调整现有报告。如果您有很多报告,那可能会很痛苦。
See full description of partial index in this page: http://www.postgresql.org/docs/9.0/static/indexes-partial.html
在此页面中查看部分索引的完整描述:http://www.postgresql.org/docs/9.0/static/indexes-partial.html
EDIT: for archiving, I prefer to create another database with identical schema, then move the old data from transaction db to this archive db.
编辑:对于归档,我更喜欢创建具有相同模式的另一个数据库,然后将旧数据从事务数据库移动到此归档数据库。
#1
5
Or should I create another table, e.g. 'Orders_Archive', so that the Orders table would only contain open orders that I use for the calculations?
或者我应该创建另一个表,例如'Orders_Archive',以便Orders表只包含我用于计算的未结订单?
Yes. They call that data warehousing. Folks do this because it speeds up the transaction system to eliminate the hardly-used history. First, tables are physically smaller and process faster. Second, a long-running history report doesn't interfere with transactional processing.
是。他们称之为数据仓库。人们这样做是因为它加速了交易系统,以消除几乎没有用过的历史。首先,表格在物理上更小,处理速度更快。其次,长期运行的历史报告不会干扰事务处理。
Is there any (clear) performance difference in these approaches?
这些方法中是否有(明显的)性能差异?
Yes. Bonus. You can restructure your history so that it's no longer in 3NF (for updating) but in a Star Schema (for reporting). The advantages are huge.
是。奖金。您可以重新构建历史记录,使其不再是3NF(用于更新),而是用于星型模式(用于报告)。优点是巨大的。
Buy Kimball's The Data Warehouse Toolkit book to learn more about star schema design and migrating history out of active tables into warehouse tables.
购买Kimball的The Data Warehouse Toolkit一书,了解有关星型模式设计的更多信息,并将历史记录从活动表迁移到仓库表中。
#2
7
This is a common problem in database design: The question of whether to separate or "archive" records that are no longer "active".
这是数据库设计中的常见问题:是否要分离或“归档”不再“活动”的记录的问题。
The most common approaches are:
最常见的方法是:
- Everything in one table, mark orders as "complete" as appropriate. Pros: Simplest solution (both code- and structure-wise), good flexibility (e.g. easy to "resurrect" orders). Cons: Tables can get quite large, a problem both for queries and for e.g. backups.
- 在一个表中的所有内容,将订单标记为“完整”。优点:最简单的解决方案(代码和结构方面),良好的灵活性(例如易于“复活”的订单)。缺点:表格可能会非常大,这对于查询和例如查询都是一个问题。备份。
- Archive old stuff to separate table. Solves the problems from the first approach, at the cost of greater complexity.
- 将旧东西归档到单独的表。从第一种方法解决问题,代价是更复杂。
- Use table with value-based partitioning. That means logically (to the application) everything is in one table, but behind the scenes the DBMS puts stuff into separate areas depending on the value(s) on some column(s). You'd probably use the "complete" column, or the "order completion date" for the partitioning.
- 使用具有基于值的分区的表。这意味着逻辑上(对应用程序)一切都在一个表中,但在幕后,DBMS根据某些列上的值将内容放入不同的区域。您可能使用“完整”列或分区的“订单完成日期”。
The last approach kind of combines the good parts of the first two, but needs support in the DBMS and is more complex to set up.
最后一种方法结合了前两种方法的优点,但需要DBMS支持,并且设置起来比较复杂。
Note:
注意:
Tables that only store "archived" data are commonly referred to as "archive tables". Some DBMS even provide special storage engines for these tables (e.g. MySQL), which are optimized to allow quick retrieval and good storage efficiency, at the cost of slow changes/inserts.
仅存储“存档”数据的表通常称为“存档表”。有些DBMS甚至为这些表(例如MySQL)提供了特殊的存储引擎,这些表经过优化,可以快速检索和提高存储效率,代价是缓慢的更改/插入。
#3
3
Never split off or separate current/archived data. It is simply incorrect. It may be called "data warehousing" or a bucket of fish, but it is wrong, unnecessary, and creates problems which were not otherwise present. The result is:
切勿拆分或分离当前/存档数据。这完全是不正确的。它可能被称为“数据仓库”或一桶鱼,但它是错误的,不必要的,并产生其他方面没有的问题。结果是:
- everyone who queries the data now has to look for it in two places rather than one
- 现在,每个查询数据的人都必须在两个地方而不是一个地方寻找它
- and worse, do the addition of aggregated values manually (in Excel or whatever)
- 更糟糕的是,手动添加聚合值(在Excel或其他)
- you introduce anomalies in the key, the integrity is lost (which would otherwise be unique by a single db constraint)
- 在密钥中引入异常,完整性丢失(否则单个数据库约束将是唯一的)
- when a Completed Order (or many) needs to be changed, you have to fish it out of the "warehouse" and put it back in the "database"
- 当需要更改完成订单(或许多)时,您必须将其从“仓库”中取出并将其放回“数据库”中
If, and only if the response on the table is slow, then address that, and enhance the speed. Only. Nothing else. This (in every case I have seen) is an indexing error (a missing index or the incorrect columns or the incorrect sequence of columns are all errors). Generally, all you will need is the IsComplete column in an index, along with whatever your users use to search most frequently, to in/exclude Open/Complete Orders.
如果,并且只有当表上的响应很慢时,才解决这个问题,并提高速度。只要。没有其他的。这(在我看到的每种情况下)都是索引错误(缺少索引或不正确的列或不正确的列序列都是错误)。通常,您需要的只是索引中的IsComplete列,以及用户用于搜索最频繁的任何内容,进入/排除打开/完成订单。
Now, if your dbms platform cannot handle large tables, or large result sets, that is a different problem, and you have to use whatever methods are available in the tool. But as a database design issue, it is simply wrong; there is no need to create a duplicate, populate it, and maintain it (with all the ensuing problems) except if you are limited by your platform.
现在,如果您的dbms平台无法处理大型表或大型结果集,那么这是一个不同的问题,您必须使用该工具中可用的任何方法。但作为数据库设计问题,它完全是错误的;除非您受到平台的限制,否则无需创建副本,填充并维护它(包含所有后续问题)。
Both last year and this, as part of an ordinary performance assignment, I have consolidated such split tables with billions of rows (and had to resolve all the duplicate row problems that allegedly "did not exist", yeah right, 2 days just for that). The consolidated tables with the corrected indices were faster than the split tables; the excuse that "billions of rows slowed the table down" was completely false. The users love me because they no longer have to use two tools and query two "databases" to get what they need.
去年和今年,作为普通性能分配的一部分,我已经整合了数十亿行的拆分表(并且必须解决所谓的“不存在”的所有重复行问题,是的,2天只是为了那个)。具有更正指数的合并表比拆分表更快; “数十亿行放慢桌面速度”的借口完全是错误的。用户爱我,因为他们不再需要使用两个工具并查询两个“数据库”来获得他们需要的东西。
#4
1
Since you are using postgresql, you can take advantage of partial index. Suppose for unfinished order you often use orderdate, you can specify index like this:
由于您使用的是postgresql,因此可以利用部分索引。假设对于未完成的订单,您经常使用orderdate,您可以像这样指定索引:
create index order_orderdate_unfinished_ix on orders ( orderdate )
where completed is null or completed = 'f';
When you put that condition, postgresql will not index the completed orders, thus saving harddisk space and make the index much faster because it contains only small amount of data. So you get the benefit without the hassles of table separation.
当您放置该条件时,postgresql将不会索引已完成的订单,从而节省了硬盘空间并使索引更快,因为它只包含少量数据。因此,您可以获得好处,而无需桌面分离的麻烦。
When you separate data into ORDERS and ORDERS_ARCHIVE, you will have to adjust existing reports. If you have lots of reports, that can be painful.
将数据分成ORDERS和ORDERS_ARCHIVE时,您必须调整现有报告。如果您有很多报告,那可能会很痛苦。
See full description of partial index in this page: http://www.postgresql.org/docs/9.0/static/indexes-partial.html
在此页面中查看部分索引的完整描述:http://www.postgresql.org/docs/9.0/static/indexes-partial.html
EDIT: for archiving, I prefer to create another database with identical schema, then move the old data from transaction db to this archive db.
编辑:对于归档,我更喜欢创建具有相同模式的另一个数据库,然后将旧数据从事务数据库移动到此归档数据库。