I have denormalized data (coming from a file) that needs to be imported into parent-child tables. The source data is something like this:
我有需要导入父子表的非规范化数据(来自文件)。源数据是这样的:
Account# Name Membership Email
101 J Burns Gold alpha@foo.com
101 J Burns Gold bravo@foo.com
101 J Burns Gold charlie@yay.com
227 H Gordon Silver red@color.com
350 B Clyde Silver italian@food.com
350 B Clyde Silver mexican@food.com
What are the pieces, parts, or tactics of SSIS I should use to read the first three columns into a parent table, and the 4th column (Email) into a child table? I have several options for the parent key which I am permitted to take:
我应该使用SSIS的哪些部分,部分或策略将前三列读入父表,将第4列(电子邮件)读入子表?我有多个父键选项,我可以选择:
- Directly use the Account# as the primary key
- 直接使用Account#作为主键
- Use a surrogate key generated by SSIS during the import process
- 在导入过程中使用SSIS生成的代理键
- Configure an identity primary key
- 配置身份主键
I'm sure I've listed my primary key options in increasing order of difficulty. I'd be interested in knowing how to do the first and the last option - I'll infer how to achieve the middle option. To emphasize again, I'm interested in a decidedly SSIS solution; I'm looking for an answer that uses the language of SSIS, rather than a procedural, technology neutral answer.
我确信我已经列出了我的主键选项,增加了难度。我有兴趣知道如何做第一个和最后一个选项 - 我将推断如何实现中间选项。再次强调,我对一个明确的SSIS解决方案感兴趣;我正在寻找一个使用SSIS语言的答案,而不是一个程序性的,技术中立的答案。
My question is somewhat similar to another SO question, having an answer of vague viability. I'm hoping more detailed guidance could be given. I already know how to solve this problem by creating a "staging" middle-step, where the parent-child separation is actually handled with straight SQL. However, I'm curious about how this can be done without that kind of middle-step.
我的问题有点类似于另一个SO问题,有一个模糊可行性的答案。我希望能给出更详细的指导。我已经知道如何通过创建一个“临时”中间步骤来解决这个问题,其中父子分离实际上是用直接SQL处理的。但是,我很好奇如果没有这种中间步骤可以做到这一点。
It seems to me this kind of import would be so common, that there would be a well-published formulaic way to handle it - a technique that SSIS excels at. As yet, I've not quite seen any straight up answer to this.
在我看来,这种重要性是如此常见,以至于会有一种公开的公式化方法来处理它 - 这是SSIS擅长的一种技术。到目前为止,我还没有看到任何直接答案。
Update #1
: Based on comments, I've adjusted the sample data to be more obviously denormalized. I also removed "flat" from "flat file," so that semantics don't interfere with the question.
更新#1:根据评论,我调整了样本数据以更明显地非规范化。我还从“平面文件”中删除了“flat”,因此语义不会干扰问题。
Update #2
: I've amplified my interest in a solution spoken in the language of SSIS.
更新#2:我已经放大了对SSIS语言解决方案的兴趣。
2 个解决方案
#1
29
Here is one possible option that you can consider in loading parent-child data. This option consists of two steps. In the first step, read the source file and write data to parent table. In the second step, read the source file again and use lookup transformation to fetch the parent info in order to write data to the child table. Following example uses the data provided in the question. This example was created using SSIS 2008 R2 and SQL Server 2008 database.
以下是加载父子数据时可以考虑的一种可能选项。此选项包含两个步骤。在第一步中,读取源文件并将数据写入父表。在第二步中,再次读取源文件并使用查找转换来获取父信息,以便将数据写入子表。以下示例使用问题中提供的数据。此示例是使用SSIS 2008 R2和SQL Server 2008数据库创建的。
Step-by-Step process:
循序渐进的过程:
-
Create a sample flat file named
Source.txt
as shown in screenshot #1.创建一个名为Source.txt的示例平面文件,如屏幕截图#1所示。
-
In the SQL database, create two tables named
dbo.Parent
anddbo.Child
using the scripts given under SQL Scripts section. Both the tables have an auto generated identity column.在SQL数据库中,使用SQL Scripts部分下给出的脚本创建两个名为dbo.Parent和dbo.Child的表。这两个表都有一个自动生成的标识列。
-
On the package, place an
OLE DB connection
to connect to the SQL Server andFlat File connection
to read the source file as shown in screenshot #2. Configure the flat file connection as shown in screenshots #3 - #9.在包上,放置一个OLE DB连接以连接到SQL Server和平面文件连接以读取源文件,如屏幕截图#2所示。配置平面文件连接,如屏幕截图#3 - #9所示。
-
On the Control Flow tab, place two
Data Flow Tasks
as shown in screenshot #10.在“控制流”选项卡上,放置两个数据流任务,如屏幕截图#10所示。
-
Inside the data flow task named Parent, place a Flat File source, Sort transformation and an OLE DB destination as shown in screenshot #11.
在名为Parent的数据流任务中,放置一个Flat File源,Sort转换和一个OLE DB目标,如屏幕截图#11所示。
-
Configure the flat file source as shown in screenshots #12 and #13. We need to read the flat file source.
配置平面文件源,如屏幕截图#12和#13所示。我们需要阅读平面文件源。
-
Configure the sort transformation as shown in screenshot #14. We need to eliminate the duplicate values so that only the unique records are inserted into the parent table
dbo.Parent
.配置排序转换,如屏幕截图#14所示。我们需要消除重复值,以便只将唯一记录插入到父表dbo.Parent中。
-
Configure the ole db destination as shown in screenshots #15 and #16. We need to insert the data into the parent table
dbo.Parent
.配置ole db目标,如屏幕截图#15和#16所示。我们需要将数据插入父表dbo.Parent。
-
Inside the data flow task named Child, place a Flat File source, Lookup transformation and an OLE DB destination as shown in screenshot #17.
在名为Child的数据流任务中,放置一个平面文件源,查找转换和一个OLE DB目标,如屏幕截图#17所示。
-
Configure the flat file source as shown in screenshots #12 and #13. This configuration is same as the flat file source in the previous data flow task.
配置平面文件源,如屏幕截图#12和#13所示。此配置与先前数据流任务中的平面文件源相同。
-
Configure the lookup transformation as shown in screenshots #18 and #20. We need to find the parent id from the table
dbo.Parent
using the other key columns present in the file. The key columns here are the Account, Name and Email. If the file happened to have a unique column, you could just use that column alone to fetch the parent id.配置查找转换,如屏幕截图#18和#20所示。我们需要使用文件中存在的其他键列从表dbo.Parent中找到父ID。这里的关键列是帐户,名称和电子邮件。如果文件恰好具有唯一列,则可以单独使用该列来获取父ID。
-
Configure the ole db destination as shown in screenshots #21 and #22. We need to insert the Email column along with the Parent id into the table
dbo.Child
.配置ole db目标,如屏幕截图#21和#22所示。我们需要将电子邮件列和父ID一起插入到表dbo.Child中。
-
Screenshot #23 shows data in the tables before the package execution.
屏幕截图#23显示了包执行前表中的数据。
-
Screenshots #24 and #25 show sample package execution.
屏幕截图#24和#25显示了示例包执行。
-
Screenshot #26 shows data in the tables after the package execution.
屏幕截图#26显示了包执行后表中的数据。
Hope that helps.
希望有所帮助。
SQL Scripts:
SQL脚本:
CREATE TABLE [dbo].[Child](
[ChildId] [int] IDENTITY(1,1) NOT NULL,
[ParentId] [int] NULL,
[Email] [varchar](21) NULL,
CONSTRAINT [PK_Child] PRIMARY KEY CLUSTERED ([ChildId] ASC)) ON [PRIMARY]
GO
CREATE TABLE [dbo].[Parent](
[ParentId] [int] IDENTITY(1,1) NOT NULL,
[Account] [varchar](12) NULL,
[Name] [varchar](12) NULL,
[Membership] [varchar](14) NULL,
CONSTRAINT [PK_Parent] PRIMARY KEY CLUSTERED ([ParentId] ASC)) ON [PRIMARY]
GO
Screenshot #1:
截图#1:
Screenshot #2:
截图#2:
Screenshot #3:
截图#3:
Screenshot #4:
截图#4:
Screenshot #5:
截图#5:
Screenshot #6:
截图#6:
Screenshot #7:
截图#7:
Screenshot #8:
截图#8:
Screenshot #9:
截图#9:
Screenshot #10:
截图#10:
Screenshot #11:
截图#11:
Screenshot #12:
截图#12:
Screenshot #13:
截图#13:
Screenshot #14:
屏幕截图#14:
Screenshot #15:
截图#15:
Screenshot #16:
截图#16:
Screenshot #17:
截图#17:
Screenshot #18:
截图#18:
Screenshot #19:
截图#19:
Screenshot #20:
截图#20:
Screenshot #21:
截图#21:
Screenshot #22:
屏幕截图#22:
Screenshot #23:
屏幕截图#23:
Screenshot #24:
屏幕截图#24:
Screenshot #25:
屏幕截图#25:
Screenshot #26:
截图#26:
#2
0
If the data is sorted and Account# is an integer I would:
如果数据已排序且Account#是整数,我会:
Insert the emails into a table (add an auto increment column, it's a best practise).
将电子邮件插入表格(添加自动增量列,这是最佳做法)。
1 101 alpha@foo.com
2 101 bravo@foo.com
3 101 charlie@yay.com
etc.
Then I would insert the other records to a parent table.
然后我会将其他记录插入父表。
- using Account# as the primary key
- 使用Account#作为主键
- omitting the email addresses
- 省略电子邮件地址
- skipping duplicates (easy if the data is sorted).
- 跳过重复项(如果数据已排序,则很容易)。
If you have a foreign key relationship setup, you will need to do the second step first (to avoid having any orphan records).
如果您设置了外键关系,则需要先执行第二步(避免使用任何孤立记录)。
My two cents: I don't know what your requirements are but it seems a bit over-normalized. If there is a small limit on the number of email addresses, I would consider adding several email columns to the main table...for speed and simplicity.
我的两分钱:我不知道你的要求是什么,但似乎有点过于规范化了。如果电子邮件地址的数量有一个小的限制,我会考虑在主表中添加几个电子邮件列...以提高速度和简单性。
#1
29
Here is one possible option that you can consider in loading parent-child data. This option consists of two steps. In the first step, read the source file and write data to parent table. In the second step, read the source file again and use lookup transformation to fetch the parent info in order to write data to the child table. Following example uses the data provided in the question. This example was created using SSIS 2008 R2 and SQL Server 2008 database.
以下是加载父子数据时可以考虑的一种可能选项。此选项包含两个步骤。在第一步中,读取源文件并将数据写入父表。在第二步中,再次读取源文件并使用查找转换来获取父信息,以便将数据写入子表。以下示例使用问题中提供的数据。此示例是使用SSIS 2008 R2和SQL Server 2008数据库创建的。
Step-by-Step process:
循序渐进的过程:
-
Create a sample flat file named
Source.txt
as shown in screenshot #1.创建一个名为Source.txt的示例平面文件,如屏幕截图#1所示。
-
In the SQL database, create two tables named
dbo.Parent
anddbo.Child
using the scripts given under SQL Scripts section. Both the tables have an auto generated identity column.在SQL数据库中,使用SQL Scripts部分下给出的脚本创建两个名为dbo.Parent和dbo.Child的表。这两个表都有一个自动生成的标识列。
-
On the package, place an
OLE DB connection
to connect to the SQL Server andFlat File connection
to read the source file as shown in screenshot #2. Configure the flat file connection as shown in screenshots #3 - #9.在包上,放置一个OLE DB连接以连接到SQL Server和平面文件连接以读取源文件,如屏幕截图#2所示。配置平面文件连接,如屏幕截图#3 - #9所示。
-
On the Control Flow tab, place two
Data Flow Tasks
as shown in screenshot #10.在“控制流”选项卡上,放置两个数据流任务,如屏幕截图#10所示。
-
Inside the data flow task named Parent, place a Flat File source, Sort transformation and an OLE DB destination as shown in screenshot #11.
在名为Parent的数据流任务中,放置一个Flat File源,Sort转换和一个OLE DB目标,如屏幕截图#11所示。
-
Configure the flat file source as shown in screenshots #12 and #13. We need to read the flat file source.
配置平面文件源,如屏幕截图#12和#13所示。我们需要阅读平面文件源。
-
Configure the sort transformation as shown in screenshot #14. We need to eliminate the duplicate values so that only the unique records are inserted into the parent table
dbo.Parent
.配置排序转换,如屏幕截图#14所示。我们需要消除重复值,以便只将唯一记录插入到父表dbo.Parent中。
-
Configure the ole db destination as shown in screenshots #15 and #16. We need to insert the data into the parent table
dbo.Parent
.配置ole db目标,如屏幕截图#15和#16所示。我们需要将数据插入父表dbo.Parent。
-
Inside the data flow task named Child, place a Flat File source, Lookup transformation and an OLE DB destination as shown in screenshot #17.
在名为Child的数据流任务中,放置一个平面文件源,查找转换和一个OLE DB目标,如屏幕截图#17所示。
-
Configure the flat file source as shown in screenshots #12 and #13. This configuration is same as the flat file source in the previous data flow task.
配置平面文件源,如屏幕截图#12和#13所示。此配置与先前数据流任务中的平面文件源相同。
-
Configure the lookup transformation as shown in screenshots #18 and #20. We need to find the parent id from the table
dbo.Parent
using the other key columns present in the file. The key columns here are the Account, Name and Email. If the file happened to have a unique column, you could just use that column alone to fetch the parent id.配置查找转换,如屏幕截图#18和#20所示。我们需要使用文件中存在的其他键列从表dbo.Parent中找到父ID。这里的关键列是帐户,名称和电子邮件。如果文件恰好具有唯一列,则可以单独使用该列来获取父ID。
-
Configure the ole db destination as shown in screenshots #21 and #22. We need to insert the Email column along with the Parent id into the table
dbo.Child
.配置ole db目标,如屏幕截图#21和#22所示。我们需要将电子邮件列和父ID一起插入到表dbo.Child中。
-
Screenshot #23 shows data in the tables before the package execution.
屏幕截图#23显示了包执行前表中的数据。
-
Screenshots #24 and #25 show sample package execution.
屏幕截图#24和#25显示了示例包执行。
-
Screenshot #26 shows data in the tables after the package execution.
屏幕截图#26显示了包执行后表中的数据。
Hope that helps.
希望有所帮助。
SQL Scripts:
SQL脚本:
CREATE TABLE [dbo].[Child](
[ChildId] [int] IDENTITY(1,1) NOT NULL,
[ParentId] [int] NULL,
[Email] [varchar](21) NULL,
CONSTRAINT [PK_Child] PRIMARY KEY CLUSTERED ([ChildId] ASC)) ON [PRIMARY]
GO
CREATE TABLE [dbo].[Parent](
[ParentId] [int] IDENTITY(1,1) NOT NULL,
[Account] [varchar](12) NULL,
[Name] [varchar](12) NULL,
[Membership] [varchar](14) NULL,
CONSTRAINT [PK_Parent] PRIMARY KEY CLUSTERED ([ParentId] ASC)) ON [PRIMARY]
GO
Screenshot #1:
截图#1:
Screenshot #2:
截图#2:
Screenshot #3:
截图#3:
Screenshot #4:
截图#4:
Screenshot #5:
截图#5:
Screenshot #6:
截图#6:
Screenshot #7:
截图#7:
Screenshot #8:
截图#8:
Screenshot #9:
截图#9:
Screenshot #10:
截图#10:
Screenshot #11:
截图#11:
Screenshot #12:
截图#12:
Screenshot #13:
截图#13:
Screenshot #14:
屏幕截图#14:
Screenshot #15:
截图#15:
Screenshot #16:
截图#16:
Screenshot #17:
截图#17:
Screenshot #18:
截图#18:
Screenshot #19:
截图#19:
Screenshot #20:
截图#20:
Screenshot #21:
截图#21:
Screenshot #22:
屏幕截图#22:
Screenshot #23:
屏幕截图#23:
Screenshot #24:
屏幕截图#24:
Screenshot #25:
屏幕截图#25:
Screenshot #26:
截图#26:
#2
0
If the data is sorted and Account# is an integer I would:
如果数据已排序且Account#是整数,我会:
Insert the emails into a table (add an auto increment column, it's a best practise).
将电子邮件插入表格(添加自动增量列,这是最佳做法)。
1 101 alpha@foo.com
2 101 bravo@foo.com
3 101 charlie@yay.com
etc.
Then I would insert the other records to a parent table.
然后我会将其他记录插入父表。
- using Account# as the primary key
- 使用Account#作为主键
- omitting the email addresses
- 省略电子邮件地址
- skipping duplicates (easy if the data is sorted).
- 跳过重复项(如果数据已排序,则很容易)。
If you have a foreign key relationship setup, you will need to do the second step first (to avoid having any orphan records).
如果您设置了外键关系,则需要先执行第二步(避免使用任何孤立记录)。
My two cents: I don't know what your requirements are but it seems a bit over-normalized. If there is a small limit on the number of email addresses, I would consider adding several email columns to the main table...for speed and simplicity.
我的两分钱:我不知道你的要求是什么,但似乎有点过于规范化了。如果电子邮件地址的数量有一个小的限制,我会考虑在主表中添加几个电子邮件列...以提高速度和简单性。