存储历史数据的数据库结构

Preface: I was thinking the other day about a new database structure for a new application and realized that we needed a way to store historical data in an efficient way. I was wanting someone else to take a look and see if there are any problems with this structure. I realize that this method of storing data may very well have been invented before (I am almost certain it has) but I have no idea if it has a name and some google searches that I tried didn't yield anything.

前言：我前几天在考虑新应用程序的新数据库结构，并意识到我们需要一种以有效的方式存储历史数据的方法。我想让别人看一看，看看这个结构是否有任何问题。我意识到这种存储数据的方法很可能是以前发明的（我几乎可以肯定它已经发明）但是我不知道它是否有一个名字和一些谷歌搜索，我试过没有产生任何东西。

Problem: Lets say you have a table for orders, and orders are related to a customer table for the customer that placed the order. In a normal database structure you might expect something like this:

问题：假设您有一个订单表，订单与下订单的客户的客户表相关。在普通的数据库结构中，您可能会遇到以下情况：

orders
------
orderID
customerID


customers
---------
customerID
address
address2
city
state
zip

Pretty straightforward, orderID has a foreign key of customerID which is the primary key of the customer table. But if we were to go and run a report over the order table, we are going to join the customers table to the orders table, which will bring back the current record for that customer ID. What if when the order was placed, the customers address was different and it has been subsequently changed. Now our order no longer reflects the history of that customers address, at the time the order was placed. Basically, by changing the customer record, we just changed all history for that customer.

非常简单，orderID有一个customerID的外键，它是customer表的主键。但是，如果我们要在订单表上运行报表，我们将把客户表加入订单表，这将返回该客户ID的当前记录。如果订单下达时，客户地址不同，后来又被更改了。现在，我们的订单不再反映订单下达时客户地址的历史记录。基本上，通过更改客户记录，我们只更改了该客户的所有历史记录。

Now there are several ways around this, one of which would be to copy the record when an order was created. What I have come up with though is what I think would be an easier way to do this that is perhaps a little more elegant, and has the added bonus of logging anytime a change is made.

现在有几种解决方法，其中一种方法是在创建订单时复制记录。我想出的是我认为这样做更简单的方法，可能更优雅一点，并且在任何时候进行更改都有额外的记录。

What if I did a structure like this instead:

如果我做了这样的结构，那该怎么办：

orders
------
orderID
customerID
customerHistoryID


customers
---------
customerID
customerHistoryID


customerHistory
--------
customerHistoryID
customerID
address
address2
city
state
zip
updatedBy
updatedOn

please forgive the formatting, but I think you can see the idea. Basically, the idea is that anytime a customer is changed, insert or update, the customerHistoryID is incremented and the customers table is updated with the latest customerHistoryID. The order table now not only points to the customerID (which allows you to see all revisions of the customer record), but also to the customerHistoryID, which points to a specific revision of the record. Now the order reflects the state of data at the time the order was created.

请原谅格式，但我认为你可以看到这个想法。基本上，我们的想法是，无论何时更改，插入或更新客户，customerHistoryID都会递增，customers表会使用最新的customerHistoryID进行更新。订单表现在不仅指向customerID（允许您查看客户记录的所有修订），还指向customerHistoryID，它指向记录的特定修订。现在，订单反映了订单创建时的数据状态。

By adding an updatedby and updatedon column to the customerHistory table, you can also see an "audit log" of the data, so you could see who made the changes and when.

通过将updatedby和updatedon列添加到customerHistory表，您还可以看到数据的“审核日志”，这样您就可以看到谁进行了更改以及何时进行更改。

One potential downside could be deletes, but I am not really worried about that for this need as nothing should ever be deleted. But even still, the same effect could be achieved by using an activeFlag or something like it depending on the domain of the data.

一个潜在的缺点可能是删除，但我并不担心这种需要，因为任何事情都不应该被删除。但即使如此，使用activeFlag或类似的东西可以实现相同的效果，具体取决于数据的域。

My thought is that all tables would use this structure. Anytime historical data is being retrieved, it would be joined against the history table using the customerHistoryID to show the state of data for that particular order.

我的想法是所有表都会使用这种结构。无论何时检索历史数据，都将使用customerHistoryID将其连接到历史表，以显示该特定订单的数据状态。

Retrieving a list of customers is easy, it just takes a join to the customer table on the customerHistoryID.

检索客户列表很简单，只需加入customerHistoryID上的customer表即可。

Can anyone see any problems with this approach, either from a design standpoint, or performance reasons why this is bad. Remember, no matter what I do I need to make sure that the historical data is preserved so that subsequent updates to records do not change history. Is there a better way? Is this a known idea that has a name, or any documentation on it?

任何人都可以从设计的角度看待这种方法的任何问题，或者为什么这是坏的性能原因。请记住，无论我做什么，我都需要确保保留历史数据，以便后续记录更新不会更改历史记录。有没有更好的办法？这是一个有名称或任何文档的已知想法吗？

Thanks for any help.

谢谢你的帮助。

Update: This is a very simple example of what I am really going to have. My real application will have "orders" with several foreign keys to other tables. Origin/destination location information, customer information, facility information, user information, etc. It has been suggested a couple of times that I could copy the information into the order record at that point, and I have seen it done this way many times, but this would result in a record with hundreds of columns, which really isn't feasible in this case.

更新：这是我真正想要的一个非常简单的例子。我的真实应用程序将有“订单”与其他表的几个外键。原点/目的地位置信息，客户信息，设施信息，用户信息等。有人建议我有几次可以将信息复制到订单记录中，我已经多次这样做了，但这会产生包含数百列的记录，在这种情况下实际上是不可行的。

7 个解决方案

#1

When I've encountered such problems one alternative is to make the order the history table. Its functions the same but its a little easier to follow

当我遇到这样的问题时，另一种方法是将命令作为历史表。它的功能相同，但更容易遵循

orders
------
orderID
customerID
address
City
state
zip



customers
---------
customerID
address
City
state
zip

EDIT: if the number of columns gets to high for your liking you can separate it out however you like.

编辑：如果你喜欢的列数很高，你可以随意将它分开。

If you do go with the other option and using history tables you should consider using bitemporal data since you may have to deal with the possibility that historical data needs to be corrected. For example Customer Changed his current address From A to B but you also have to correct address on an existing order that is currently be fulfilled.

如果您使用其他选项并使用历史表，则应考虑使用双时态数据，因为您可能必须处理历史数据需要更正的可能性。例如，客户将其当前地址从A更改为B，但您还必须更正当前已完成的现有订单的地址。

Also if you are using MS SQL Server you might want to consider using indexed views. That will allow you to trade a small incremental insert/update perf decrease for a large select perf increase. If you're not using MS SQL server you can replicate this using triggers and tables.

此外，如果您使用的是MS SQL Server，则可能需要考虑使用索引视图。这将允许您交换小的增量插入/更新性能减少，以获得大的选择性增加。如果您不使用MS SQL服务器，则可以使用触发器和表来复制它。

#2

When you are designing your data structures, be very carful to store the correct relationships, not something that is similar to the correct relationships. If the address for an order needs to be maintained, then that is because the address is part of the order, not the customer. Also, unit prices are part of the order, not the product, etc.

在设计数据结构时，要非常小心地存储正确的关系，而不是与正确关系类似的东西。如果需要维护订单的地址，那么这是因为地址是订单的一部分，而不是客户。此外，单价是订单的一部分，而不是产品等。

Try an arrangement like this:

尝试这样的安排：

Customer
--------
CustomerId (PK)
Name
AddressId (FK)
PhoneNumber
Email

Order
-----
OrderId (PK)
CustomerId (FK)
ShippingAddressId (FK)
BillingAddressId (FK)
TotalAmount

Address
-------
AddressId (PK)
AddressLine1
AddressLine2
City
Region
Country
PostalCode

OrderLineItem
-------------
OrderId (PK) (FK)
OrderItemSequence (PK)
ProductId (FK)
UnitPrice
Quantity

Product
-------
ProductId (PK)
Price

etc.

If you truly need to store history for something, like tracking changes to an order over time, then you should do that with a log or audit table, not with your transaction tables.

如果您确实需要存储某些内容的历史记录，例如跟踪订单随时间的变化，那么您应该使用日志或审计表，而不是事务表。

#3

Normally orders simply store the information as it is at the time of the order. This is especially true of things like part numbers, part names and prices as well as customer address and name. Then you don;t have to join to 5 or six tables to get teh information that can be stored in one. This is not denormalization as you actually need to have the innformation as it existed at the time of the order. I think is is less likely that having this information in the order and order detail (stores the individual items ordered) tables is less risky in terms of accidental change to the data as well.

通常，订单只是在订单时存储信息。对于零件号，零件名称和价格以及客户地址和名称等内容尤其如此。然后你不必加入5或6个表来获取可以存储在一个表中的信息。这不是非规范化，因为您实际上需要具有在订单时存在的信息。我认为，在数据的意外更改方面，在订单和订单详细信息中存储此信息（存储订购的单个项目）表的风险较小。

Your order table would not have hundreds of columns. You would have an order table and an order detail table due to one to many relationships. Order table would include order no. customer id 9so you can search for everything this customer has ever ordered even if the name changed), customer name, customer address (note you don't need city state zip etc, put the address in one field), order date and possibly a few other fields that relate directly to the order at a top level. Then you have an order detail table that has order number, detail_id, part number, part description (this can be a consolidation of a bunch of fields like size, color etc. or you can separate out the most common), No of items, unit type, price per unit, taxes, total price, ship date, status. You put one entry in for each item ordered.

您的订单表不会有数百列。由于一对多关系，您将拥有订单表和订单明细表。订单表将包含订单号。客户ID 9即使名称发生变化，您也可以搜索该客户订购的所有商品），客户名称，客户地址（注意您不需要城市州邮政等，将地址放在一个字段中），订购日期和可能的很少有其他与*订单直接相关的字段。然后你有一个订单详细信息表，其中包含订单号，detail_id，部件号，部件描述（这可以是一系列字段的合并，如大小，颜色等，或者你可以分开最常见的），没有项目，单位类型，单位价格，税金，总价，发货日期，状态。您为每个订购的商品添加了一个条目。

#4

I myself like to keep it simple. I would use two tables, a customer table and a customer history table. If you have the key (eg customerId) in the history table there is no reason to make a joining table, a select on that key will give you all records.

我自己喜欢保持简单。我会使用两个表，一个客户表和一个客户历史表。如果您在历史记录表中有密钥（例如customerId），则没有理由建立连接表，对该密钥的选择将为您提供所有记录。

You also don't have audit information (eg date modified, who modified etc) in the history table as you show it, I expect you want this.

您在显示时也没有历史记录表中的审核信息（例如修改日期，修改过的人等），我希望您想要这样。

So mine would look something like this:

所以我的看起来像这样：

CustomerTable  (this contains current customer information)
CustID (distinct non null)
...all customer information fields

CustomerHistoryTable
CustId (not distinct non null)
...all customer information fields
DateOfChange 
WhoChanged

The DataOfChagne field is the date the customer table was changed (from the values in this record) to the values in a more recent record of the values in the CustomerTable

DataOfChagne字段是customer表更改的日期（从此记录中的值）到CustomerTable中值的更新记录中的值

You orders table just needs a CustomerID if you need to find the customer information at the time of the order it is a simple select.

如果您需要在订单时找到客户信息，您只需要一个CustomerID，您只需要一个CustomerID即可。

#5

What you want is called a datawarehouse. Since datawarehouses are OLAP and not OLTP, it is recommended to have as many columns as you need in order to achieve your goals. In your case the orders table in the datawarehouse will have 11 fields as having a 'snapshot' of orders as they come, regardless of users accounts updates.

你想要的是一个数据仓库。由于数据仓库是OLAP而不是OLTP，因此建议您根据需要使用尽可能多的列来实现目标。在您的情况下，无论用户帐户更新如何，数据仓库中的订单表都会有11个字段，因为它们具有订单的“快照”。

Wiley -The Data Warehouse Toolkit, Second Edition

It's a good start.

这是一个好的开始。

#6

Our payroll system uses effective dates in many tables. The ADDRESSES table is keyed on EMPLID and EFFDT. This allows us to track every time an employee's address changes. You could use the same logic to track historical addresses for customers. Your queries would simply need to include a clause that compares the order date to the customer address date that was in effect at the time of the order. For example

我们的工资单系统在许多表格中使用有效日期。 ADDRESSES表以EMPLID和EFFDT键入。这使我们可以在每次员工地址发生变化时进行跟踪。您可以使用相同的逻辑来跟踪客户的历史地址。您的查询只需要包含一个条款，该条款将订单日期与订单时生效的客户地址日期进行比较。例如

select o.orderID, c.customerID, c.address, c.city, c.state, c.zip
from orders o, customers c
where c.customerID = o.customerID
and c.effdt = (
   select max(c1.effdt) from customers c1
   where c1.customerID = c.customerID and c1.effdt <= o.orderdt
)

The objective is to select the most recent row in customers having an effective date that is on or before the date of the order. This same strategy could be used to keep historical information on product prices.

目标是选择具有生效日期的客户中的最新行，该生效日期在订单日期之前或之前。同样的策略可用于保存产品价格的历史信息。

#7

If you are genuinely interested in such problems, I can only suggest you take a serious look at "Temporal Data and the Relational Model".

如果你真的对这些问题感兴趣，我只能建议你认真看看“时间数据和关系模型”。

Warning1 : there is no SQL in there and almost anything you think you know about the relational model will be claimed a falsehood. With good reason.

警告1：那里没有SQL，你认为关于关系模型的几乎任何东西都会被称为虚假。有充分的理由。

Warning2 : you are expected to think, and think hard.

警告2：你应该思考，并认真思考。

Warning3 : the book is about what the solution for this particular family of problems ought to look like, but as the introduction says, it is not about any technology available today.

警告3：这本书是关于这个特定系列问题的解决方案应该是什么样子，但正如引言所说，它不是关于今天可用的任何技术。

That said, the book is genuine enlightenment. At the very least, it helps to make it clear that the solution for such problems will not be found in SQl as it stands today, or in ORMs as those stand today, for that matter.

也就是说，这本书是真正的启蒙。至少，有助于明确表示，对于此类问题的解决方案将无法在今天的SQl中找到，或者在今天的ORM中找不到，就此而言。

#1