What are the common design approaches taken in loading data from a typical Entity-Relationship OLTP database model into a Kimball star schema Data Warehouse/Marts model?
将数据从典型的实体关系OLTP数据库模型加载到Kimball星型模式数据仓库/市场模型中的常见设计方法是什么?
- Do you use a staging area to perform the transformation and then load into the warehouse?
- How do you link data between the warehouse and the OLTP database?
- Where/How do you manage the transformation process - in the database as sprocs, dts/ssis packages, or SQL from application code?
您是否使用暂存区域执行转换然后加载到仓库?
如何链接仓库和OLTP数据库之间的数据?
在哪里/如何管理转换过程 - 在数据库中作为sprocs,dts / ssis包或应用程序代码中的SQL?
5 个解决方案
#1
8
Personally, I tend to work as follows:
就个人而言,我倾向于按如下方式工作:
- Design the data warehouse first. In particular, design the tables that are needed as part of the DW, ignoring any staging tables.
- Design the ETL, using SSIS, but sometimes with SSIS calling stored procedures in the involved databases.
- If any staging tables are required as part of the ETL, fine, but at the same time make sure they get cleaned up. A staging table used only as part of a single series of ETL steps should be truncated after those steps are completed, with or without success.
- I have the SSIS packages refer to the OLTP database at least to pull data into the staging tables. Depending on the situation, they may process the OLTP tables directly into the data warehouse. All such queries are performed WITH(NOLOCK).
- Document, Document, Document. Make it clear what inputs are used by each package, and where the output goes. Make sure to document the criteria by which the input are selected (last 24 hours? since last success? new identity values? all rows?)
首先设计数据仓库。特别是,设计作为DW的一部分所需的表,忽略任何临时表。
使用SSIS设计ETL,但有时使用SSIS在相关数据库中调用存储过程。
如果需要任何临时表作为ETL的一部分,那很好,但同时要确保它们得到清理。仅在作为单个ETL步骤系列的一部分使用的临时表应在这些步骤完成后被截断,无论是否成功。
我有SSIS包引用OLTP数据库至少将数据拉入登台表。根据具体情况,他们可以将OLTP表直接处理到数据仓库中。所有这些查询都执行WITH(NOLOCK)。
文件,文件,文件。明确每个包使用的输入以及输出的位置。确保记录选择输入的标准(自上次成功以来最近24小时?新标识值?所有行?)
This has worked well for me, though I admit I haven't done many of these projects, nor any really large ones.
这对我来说效果很好,但我承认我没有完成很多这些项目,也没有做过任何大项目。
#2
2
I'm currently working on a small/mid size dataware house. We're adopting some of the concepts that Kimball puts forward, i.e. the star scheme with fact and dimension tables. We structure it so that facts only join to dimensions (not fact to fact or dimension to dimension - but this is our choice, not saying it's the way it should be done), so we flatten all dimension joins to the fact table.
我目前正在开发一个中小型数据软件公司。我们采用了Kimball提出的一些概念,即具有事实和维度表的星型方案。我们构造它以便事实只加入维度(不是事实或维度维度 - 但这是我们的选择,而不是说应该这样做),所以我们将所有维度连接展平到事实表。
We use SSIS to move the data from the production DB -> source DB -> staging DB -> reporting DB (we probably could have have used less DBs, but that's the way it's fallen).
我们使用SSIS从生产数据库移动数据 - >源数据库 - >登台数据库 - >报告数据库(我们可能已经使用了更少的数据库,但这就是它的下降方式)。
SSIS is really nice as it's lets you structure your data flows very logically. We use a combination of SSIS components and stored procs, where one nice feature of SSIS is the ability to provide SQL commands as a transform between a source/destination data-flow. This means we can call stored procs on every row if we want, which can be useful (albeit a bit slower).
SSIS非常好用,因为它可以让您非常合理地构建数据流。我们使用SSIS组件和存储过程的组合,其中SSIS的一个很好的特性是能够提供SQL命令作为源/目标数据流之间的转换。这意味着如果需要,我们可以在每一行调用存储过程,这可能很有用(虽然有点慢)。
We're also using a new SQL Server 2008 feature called change data capture (CDC) which allows you to audit all changes on a table (you can specify which columns you want to look at in those tables), so we use that on the production DB to tell what has changed so we can move just those records across to the source DB for processing.
我们还使用了一个名为更改数据捕获(CDC)的新SQL Server 2008功能,它允许您审核表上的所有更改(您可以指定要在这些表中查看哪些列),因此我们在生产数据库告诉已更改的内容,以便我们可以将这些记录移动到源数据库进行处理。
#3
2
I agree with the highly rated answer but thought I'd add the following:
我同意高度评价的答案,但我想补充以下内容:
* Do you use a staging area to perform the transformation and then
load into the warehouse?
装入仓库?
It depends on the type of transformation whether it will require staging. Staging offers benefits of breaking the ETL into more manageable chunks, but also provides a working area that allows manipulations to take place on the data without affecting the warehouse. It can help to have (at least) some dimension lookups in a staging area which store the keys from the OLTP system and the key of the latest dim record, to use as a lookup when loading your fact records. The transformation happens in the ETL process itself, but it may or may not require some staging to help it along the way.
它取决于转换的类型是否需要升级。分段提供了将ETL分解为更易于管理的块的好处,但还提供了一个工作区域,允许对数据进行操作而不会影响仓库。它可以帮助(至少)在暂存区域中进行一些维度查找,该临时区域存储来自OLTP系统的密钥和最新的昏暗记录的密钥,以在加载事实记录时用作查找。转换发生在ETL过程本身,但它可能需要或可能不需要一些升级来帮助它。
* How do you link data between the warehouse and the OLTP database?
It is useful to load the business keys (or actual primary keys if available) into the data warehouse as a reference back to the OLTP system. Also, auditing in the DW process should record the lineage of each bit of data by recording the load process that has loaded it.
将业务键(或实际主键,如果可用)加载到数据仓库中作为对OLTP系统的引用非常有用。此外,DW过程中的审计应记录每个数据位的谱系,方法是记录加载过程的加载过程。
* Where/How do you manage the transformation process - in the
database as sprocs, dts/ssis packages, or SQL from application code?
数据库作为sprocs,dts / ssis包或应用程序代码中的SQL?
This would typically be in SSIS packages, but often it is more performant to transform in the source query. Unfortunately this makes the source query quite complicated to understand and therefore maintain, so if performance is not an issue then transforming in the SSIS code is best. When you do this, this is another reason for having a staging area as then you can make more joins in the source query between different tables.
这通常是在SSIS包中,但通常在源查询中转换更高效。不幸的是,这使得源查询很难理解并因此维护,因此如果性能不是问题,那么转换SSIS代码是最好的。执行此操作时,这是具有暂存区域的另一个原因,因为您可以在不同表之间的源查询中进行更多连接。
#4
1
John Saunders' process explanation is a good.
John Saunders的流程解释很好。
If you are looking to implement a Datawarehouse project in SQL Server you will find all the information you require for the delivering the entire project within the excellent text "The Microsoft Data Warehouse Toolkit".
如果您希望在SQL Server中实现Datawarehouse项目,您将在优秀的文本“Microsoft Data Warehouse Toolkit”中找到交付整个项目所需的所有信息。
Funilly enough, one of the authors is Ralph Kimball :-)
Funilly,其中一位作者是Ralph Kimball :-)
#5
0
You may want to take a look at Data Vault Modeling. It claims solving some loner term issues like changing attributes.
您可能需要查看Data Vault Modeling。它声称解决了一些孤独的术语问题,比如改变属性。
#1
8
Personally, I tend to work as follows:
就个人而言,我倾向于按如下方式工作:
- Design the data warehouse first. In particular, design the tables that are needed as part of the DW, ignoring any staging tables.
- Design the ETL, using SSIS, but sometimes with SSIS calling stored procedures in the involved databases.
- If any staging tables are required as part of the ETL, fine, but at the same time make sure they get cleaned up. A staging table used only as part of a single series of ETL steps should be truncated after those steps are completed, with or without success.
- I have the SSIS packages refer to the OLTP database at least to pull data into the staging tables. Depending on the situation, they may process the OLTP tables directly into the data warehouse. All such queries are performed WITH(NOLOCK).
- Document, Document, Document. Make it clear what inputs are used by each package, and where the output goes. Make sure to document the criteria by which the input are selected (last 24 hours? since last success? new identity values? all rows?)
首先设计数据仓库。特别是,设计作为DW的一部分所需的表,忽略任何临时表。
使用SSIS设计ETL,但有时使用SSIS在相关数据库中调用存储过程。
如果需要任何临时表作为ETL的一部分,那很好,但同时要确保它们得到清理。仅在作为单个ETL步骤系列的一部分使用的临时表应在这些步骤完成后被截断,无论是否成功。
我有SSIS包引用OLTP数据库至少将数据拉入登台表。根据具体情况,他们可以将OLTP表直接处理到数据仓库中。所有这些查询都执行WITH(NOLOCK)。
文件,文件,文件。明确每个包使用的输入以及输出的位置。确保记录选择输入的标准(自上次成功以来最近24小时?新标识值?所有行?)
This has worked well for me, though I admit I haven't done many of these projects, nor any really large ones.
这对我来说效果很好,但我承认我没有完成很多这些项目,也没有做过任何大项目。
#2
2
I'm currently working on a small/mid size dataware house. We're adopting some of the concepts that Kimball puts forward, i.e. the star scheme with fact and dimension tables. We structure it so that facts only join to dimensions (not fact to fact or dimension to dimension - but this is our choice, not saying it's the way it should be done), so we flatten all dimension joins to the fact table.
我目前正在开发一个中小型数据软件公司。我们采用了Kimball提出的一些概念,即具有事实和维度表的星型方案。我们构造它以便事实只加入维度(不是事实或维度维度 - 但这是我们的选择,而不是说应该这样做),所以我们将所有维度连接展平到事实表。
We use SSIS to move the data from the production DB -> source DB -> staging DB -> reporting DB (we probably could have have used less DBs, but that's the way it's fallen).
我们使用SSIS从生产数据库移动数据 - >源数据库 - >登台数据库 - >报告数据库(我们可能已经使用了更少的数据库,但这就是它的下降方式)。
SSIS is really nice as it's lets you structure your data flows very logically. We use a combination of SSIS components and stored procs, where one nice feature of SSIS is the ability to provide SQL commands as a transform between a source/destination data-flow. This means we can call stored procs on every row if we want, which can be useful (albeit a bit slower).
SSIS非常好用,因为它可以让您非常合理地构建数据流。我们使用SSIS组件和存储过程的组合,其中SSIS的一个很好的特性是能够提供SQL命令作为源/目标数据流之间的转换。这意味着如果需要,我们可以在每一行调用存储过程,这可能很有用(虽然有点慢)。
We're also using a new SQL Server 2008 feature called change data capture (CDC) which allows you to audit all changes on a table (you can specify which columns you want to look at in those tables), so we use that on the production DB to tell what has changed so we can move just those records across to the source DB for processing.
我们还使用了一个名为更改数据捕获(CDC)的新SQL Server 2008功能,它允许您审核表上的所有更改(您可以指定要在这些表中查看哪些列),因此我们在生产数据库告诉已更改的内容,以便我们可以将这些记录移动到源数据库进行处理。
#3
2
I agree with the highly rated answer but thought I'd add the following:
我同意高度评价的答案,但我想补充以下内容:
* Do you use a staging area to perform the transformation and then
load into the warehouse?
装入仓库?
It depends on the type of transformation whether it will require staging. Staging offers benefits of breaking the ETL into more manageable chunks, but also provides a working area that allows manipulations to take place on the data without affecting the warehouse. It can help to have (at least) some dimension lookups in a staging area which store the keys from the OLTP system and the key of the latest dim record, to use as a lookup when loading your fact records. The transformation happens in the ETL process itself, but it may or may not require some staging to help it along the way.
它取决于转换的类型是否需要升级。分段提供了将ETL分解为更易于管理的块的好处,但还提供了一个工作区域,允许对数据进行操作而不会影响仓库。它可以帮助(至少)在暂存区域中进行一些维度查找,该临时区域存储来自OLTP系统的密钥和最新的昏暗记录的密钥,以在加载事实记录时用作查找。转换发生在ETL过程本身,但它可能需要或可能不需要一些升级来帮助它。
* How do you link data between the warehouse and the OLTP database?
It is useful to load the business keys (or actual primary keys if available) into the data warehouse as a reference back to the OLTP system. Also, auditing in the DW process should record the lineage of each bit of data by recording the load process that has loaded it.
将业务键(或实际主键,如果可用)加载到数据仓库中作为对OLTP系统的引用非常有用。此外,DW过程中的审计应记录每个数据位的谱系,方法是记录加载过程的加载过程。
* Where/How do you manage the transformation process - in the
database as sprocs, dts/ssis packages, or SQL from application code?
数据库作为sprocs,dts / ssis包或应用程序代码中的SQL?
This would typically be in SSIS packages, but often it is more performant to transform in the source query. Unfortunately this makes the source query quite complicated to understand and therefore maintain, so if performance is not an issue then transforming in the SSIS code is best. When you do this, this is another reason for having a staging area as then you can make more joins in the source query between different tables.
这通常是在SSIS包中,但通常在源查询中转换更高效。不幸的是,这使得源查询很难理解并因此维护,因此如果性能不是问题,那么转换SSIS代码是最好的。执行此操作时,这是具有暂存区域的另一个原因,因为您可以在不同表之间的源查询中进行更多连接。
#4
1
John Saunders' process explanation is a good.
John Saunders的流程解释很好。
If you are looking to implement a Datawarehouse project in SQL Server you will find all the information you require for the delivering the entire project within the excellent text "The Microsoft Data Warehouse Toolkit".
如果您希望在SQL Server中实现Datawarehouse项目,您将在优秀的文本“Microsoft Data Warehouse Toolkit”中找到交付整个项目所需的所有信息。
Funilly enough, one of the authors is Ralph Kimball :-)
Funilly,其中一位作者是Ralph Kimball :-)
#5
0
You may want to take a look at Data Vault Modeling. It claims solving some loner term issues like changing attributes.
您可能需要查看Data Vault Modeling。它声称解决了一些孤独的术语问题,比如改变属性。