将一个大的非规范化表拆分为规范化数据库

时间:2021-06-13 16:58:32

I have a large (5 million row, 300+ column) csv file I need to import into a staging table in SQL Server, then run a script to split each row up and insert data into the relevant tables in a normalized db. The format of the source table looks something like this:

我有一个大的(500万行,300 +列)csv文件,我需要导入到SQL Server中的临时表中,然后运行一个脚本来分割每一行并将数据插入到规范化数据库中的相关表中。源表的格式如下所示:

(fName, lName, licenseNumber1, licenseIssuer1, licenseNumber2, licenseIssuer2..., specialtyName1, specialtyState1, specialtyName2, specialtyState2..., identifier1, identifier2...)

(fName,lName,licenseNumber1,licenseIssuer1,licenseNumber2,licenseIssuer2 ...,specialName1,specialtyState1,specialtyName2,specialtyState2 ...,identifier1,identifier2 ......)

There are 50 licenseNumber/licenseIssuer columns, 15 specialtyName/specialtyState columns, and 15 identifier columns. There is always at least one of each of those, but the remaining 49 or 14 could be null. The first identifier is unique, but is not used as the primary key of the Person in our schema.

有50个licenseNumber / licenseIssuer列,15个specialName / specialtyState列和15个标识符列。每个中至少有一个,但剩下的49或14可以为空。第一个标识符是唯一的,但不用作架构中Person的主键。

My database schema looks like this

我的数据库架构如下所示

People(ID int Identity(1,1))
Names(ID int, personID int, lName varchar, fName varchar)
Licenses(ID int, personID int, number varchar, issuer varchar)
Specialties(ID int, personID int, name varchar, state varchar)
Identifiers(ID int, personID int, value)

The database will already be populated with some People before adding the new ones from the csv.

在从csv添加新数据库之前,数据库已经填充了一些人员。

What is the best way to approach this?

解决这个问题的最佳方法是什么?

I have tried iterating over the staging table one row at a time with select top 1:

我已经尝试一次迭代登台表一行,选择前1:

WHILE EXISTS (Select top 1 * from staging)
BEGIN
    INSERT INTO People Default Values
    SET @LastInsertedID = SCOPE_IDENTITY() -- might use the output clause to get this instead

    INSERT INTO Names (personID, lName, fName) 
    SELECT top 1 @LastInsertedID, lName, fName from staging

    INSERT INTO Licenses(personID, number, issuer)
    SELECT top 1 @LastInsertedID, licenseNumber1, licenseIssuer1 from staging

    IF (select top 1 licenseNumber2 from staging) is not null
    BEGIN
        INSERT INTO Licenses(personID, number, issuer)
        SELECT top 1 @LastInsertedID, licenseNumber2, licenseIssuer2 from staging
    END

    -- Repeat the above 49 times, etc...

    DELETE top 1 from staging
END

One problem with this approach is that it is prohibitively slow, so I refactored it to use a cursor. This works and is significantly faster, but has me declaring 300+ variables for Fetch INTO.

这种方法的一个问题是它非常慢,所以我重构它以使用游标。这有效并且明显更快,但我已经为Fetch INTO声明了300多个变量。

Is there a set-based approach that would work here? That would be preferable, as I understand that cursors are frowned upon, but I'm not sure how to get the identity from the INSERT into the People table for use as a foreign key in the others without going row-by-row from the staging table.

是否有基于集合的方法可以在这里工作?这是更好的,因为我理解游标是不受欢迎的,但我不知道如何从INSERT中获取身份到People表中作为外键使用,而不是逐行排除临时表。

Also, how could I avoid copy and pasting the insert into the Licenses table? With a cursor approach I could try:

另外,我怎样才能避免将插入文件复制并粘贴到Licenses表中?使用光标方法,我可以尝试:

FETCH INTO ...@LicenseNumber1, @LicenseIssuer1, @LicenseNumber2, @LicenseIssuer2...
INSERT INTO #LicenseTemp (number, issuer) Values
(@LicenseNumber1, @LicenseIssuer1),
(@LicenseNumber2, @LicenseIssuer2),
... Repeat 48 more times...
.
.
.
INSERT INTO Licenses(personID, number, issuer)
SELECT @LastInsertedID, number, issuer
FROM #LicenseTEMP
WHERE number is not null

There still seems to be some redundant copy and pasting there, though.

但是,似乎仍有一些冗余的副本和粘贴。

To summarize the questions, I'm looking for idiomatic approaches to:

总结这些问题,我正在寻找惯用的方法:

  1. Break up one large staging table into a set of normalized tables, retrieving the Primary Key/identity from one table and using it as the foreign key in the others
  2. 将一个大型临时表拆分为一组规范化表,从一个表中检索主键/标识,并将其用作其他表中的外键

  3. Insert multiple rows into the normalized tables that come from many repeated columns in the staging table with less boilerplate/copy and paste (Licenses and Specialties above)
  4. 将多行插入到规范表中,这些表来自登台表中的许多重复列,其中样板/复制和粘贴较少(上面的许可证和特殊功能)

Short of discreet answers, I'd also be very happy with pointers towards resources and references that could assist me in figuring this out.

如果没有谨慎的答案,我也会非常满意对资源和参考资料的指示,这可以帮助我解决这个问题。

2 个解决方案

#1


3  

Ok, I'm not an SQL Server expert, but here's the "strategy" I would suggest.

好吧,我不是SQL Server专家,但这是我建议的“策略”。

Calculate the personId on the staging table As @Shnugo suggested before me, calculating the personId in the staging table will ease the next steps

计算登台表上的personId正如@Shnugo在我之前建议的那样,计算登台表中的personId将简化后续步骤

Use a sequence for the personID From SQL Server 2012 you can define sequences. If you use it for every person insert, you'll never risk an overlapping of IDs. If you have (as it seems) personId that were loaded before the sequence you can create the sequence with the first free personID as starting value

使用personID的序列从SQL Server 2012中,您可以定义序列。如果您为每个人插入使用它,您将永远不会冒重叠ID的风险。如果你有(在它看来)在序列之前加载的personId你可以用第一个free personID作为起始值创建序列

Create a numbers table Create an utility table keeping numbers from 1 to n (you need n to be at least 50.. you can look at this question for some implementations)

创建一个数字表创建一个实用程序表,保持从1到n的数字(你需要n至少50 ..你可以看一些实现的这个问题)

Use set logic to do the insert I'd avoid cursor and row-by-row logic: you are right that it is better to limit the number of accesses to the table, but I'd say that you should strive to limit it to one access for target table.

使用set logic进行插入我会避免使用游标和逐行逻辑:你是对的,最好限制对表的访问次数,但是我要说你应该努力将它限制为目标表的一次访问。

You could proceed like these:

你可以像这样继续:

People:

 INSERT INTO People (personID) 
 SELECT personId from staging;

Names:

 INSERT INTO Names (personID, lName, fName) 
 SELECT personId, lName, fName from staging;

Licenses: here we'll need the Number table

许可证:这里我们需要Number表

 INSERT INTO Licenses (personId, number, issuer)
 SELECT * FROM (
    SELECT personId, 
           case nbrs.n 
                when 1 then licenseNumber1 
                when 2 then licenseNumber2
                ...
                when 50 then licenseNumber50
            end as licenseNumber,    
           case nbrs.n 
                when 1 then licenseIssuer1 
                when 2 then licenseIssuer2
                ...
                when 50 then licenseIssuer50
            end as licenseIssuer
      from staging 
           cross join 
           (select n from numbers where n>=1 and n<=50) nbrs
  ) WHERE licenseNumber is not null;

Specialties:

 INSERT INTO Specialties(personId, name, state)
 SELECT * FROM (
    SELECT personId, 
           case nbrs.n 
                when 1 then specialtyName1
                when 2 then specialtyName2
                ...
                when 15 then specialtyName15
            end as specialtyName,    
           case nbrs.n 
                when 1 then specialtyState1
                when 2 then specialtyState2
                ...
                when 15 then specialtyState15
            end as specialtyState
      from staging 
           cross join 
           (select n from numbers where n>=1 and n<=15) nbrs
 ) WHERE specialtyName is not null;

Identifiers:

 INSERT INTO Identifiers(personId, value)
 SELECT * FROM (
    SELECT personId, 
           case nbrs.n 
                when 1 then identifier1
                when 2 then identifier2
                ...
                when 15 then identifier15
            end as value
      from staging 
           cross join 
           (select n from numbers where n>=1 and n<=15) nbrs
 ) WHERE value is not null;

Hope it helps.

希望能帮助到你。

#2


1  

You say: but the staging table could be modified

你说:但是可以修改临时表

I would

  • add a PersonID INT NOT NULL column and fill it with DENSE_RANK() OVER(ORDER BY fname,lname)

    添加一个PersonID INT NOT NULL列并用DENSE_RANK()OVER(ORDER BY fname,lname)填充它

  • add an index to this PersonID

    为此PersonID添加索引

  • use this ID in combination with GROUP BY to fill your People table

    将此ID与GROUP BY结合使用以填充您的People表

  • do the same with your names table

    对你的名字表做同样的事情

  • And then use this ID for a set-based insert into your three side tables

    然后将此ID用于基于集合的插入到您的三个边表中

Do it like this

像这样做

SELECT AllTogether.PersonID, AllTogether.TheValue
FROM
(
           SELECT PersonID,SomeValue1 AS TheValue FROM StagingTable
 UNION ALL SELECT PersonID,SomeValue2             FROM StagingTable
 UNION ALL ... 
) AS AllTogether
WHERE AllTogether.TheValue IS NOT NULL

UPDATE

You say: might cause a conflict with IDs that already exist in the People table

您说:可能会导致与People表中已存在的ID发生冲突

You did not tell anything about existing People...

你没有告诉现有的人......

Is there any sure and unique mark to identify them? Use a simple

是否有任何确定和独特的标记来识别它们?用一个简单的

UPDATE StagingTable SET PersonID=xyz WHERE ...

to set existing PersonIDs into your staging table and then use something like

将现有的PersonID设置到您的临时表中,然后使用类似的东西

UPDATE StagingTable 
SET PersonID=DENSE RANK() OVER(...) + MaxExistingID
WHERE PersonID IS NULL

to set new IDs for PersonIDs still being NULL.

为PersonID设置新ID仍为NULL。

#1


3  

Ok, I'm not an SQL Server expert, but here's the "strategy" I would suggest.

好吧,我不是SQL Server专家,但这是我建议的“策略”。

Calculate the personId on the staging table As @Shnugo suggested before me, calculating the personId in the staging table will ease the next steps

计算登台表上的personId正如@Shnugo在我之前建议的那样,计算登台表中的personId将简化后续步骤

Use a sequence for the personID From SQL Server 2012 you can define sequences. If you use it for every person insert, you'll never risk an overlapping of IDs. If you have (as it seems) personId that were loaded before the sequence you can create the sequence with the first free personID as starting value

使用personID的序列从SQL Server 2012中,您可以定义序列。如果您为每个人插入使用它,您将永远不会冒重叠ID的风险。如果你有(在它看来)在序列之前加载的personId你可以用第一个free personID作为起始值创建序列

Create a numbers table Create an utility table keeping numbers from 1 to n (you need n to be at least 50.. you can look at this question for some implementations)

创建一个数字表创建一个实用程序表,保持从1到n的数字(你需要n至少50 ..你可以看一些实现的这个问题)

Use set logic to do the insert I'd avoid cursor and row-by-row logic: you are right that it is better to limit the number of accesses to the table, but I'd say that you should strive to limit it to one access for target table.

使用set logic进行插入我会避免使用游标和逐行逻辑:你是对的,最好限制对表的访问次数,但是我要说你应该努力将它限制为目标表的一次访问。

You could proceed like these:

你可以像这样继续:

People:

 INSERT INTO People (personID) 
 SELECT personId from staging;

Names:

 INSERT INTO Names (personID, lName, fName) 
 SELECT personId, lName, fName from staging;

Licenses: here we'll need the Number table

许可证:这里我们需要Number表

 INSERT INTO Licenses (personId, number, issuer)
 SELECT * FROM (
    SELECT personId, 
           case nbrs.n 
                when 1 then licenseNumber1 
                when 2 then licenseNumber2
                ...
                when 50 then licenseNumber50
            end as licenseNumber,    
           case nbrs.n 
                when 1 then licenseIssuer1 
                when 2 then licenseIssuer2
                ...
                when 50 then licenseIssuer50
            end as licenseIssuer
      from staging 
           cross join 
           (select n from numbers where n>=1 and n<=50) nbrs
  ) WHERE licenseNumber is not null;

Specialties:

 INSERT INTO Specialties(personId, name, state)
 SELECT * FROM (
    SELECT personId, 
           case nbrs.n 
                when 1 then specialtyName1
                when 2 then specialtyName2
                ...
                when 15 then specialtyName15
            end as specialtyName,    
           case nbrs.n 
                when 1 then specialtyState1
                when 2 then specialtyState2
                ...
                when 15 then specialtyState15
            end as specialtyState
      from staging 
           cross join 
           (select n from numbers where n>=1 and n<=15) nbrs
 ) WHERE specialtyName is not null;

Identifiers:

 INSERT INTO Identifiers(personId, value)
 SELECT * FROM (
    SELECT personId, 
           case nbrs.n 
                when 1 then identifier1
                when 2 then identifier2
                ...
                when 15 then identifier15
            end as value
      from staging 
           cross join 
           (select n from numbers where n>=1 and n<=15) nbrs
 ) WHERE value is not null;

Hope it helps.

希望能帮助到你。

#2


1  

You say: but the staging table could be modified

你说:但是可以修改临时表

I would

  • add a PersonID INT NOT NULL column and fill it with DENSE_RANK() OVER(ORDER BY fname,lname)

    添加一个PersonID INT NOT NULL列并用DENSE_RANK()OVER(ORDER BY fname,lname)填充它

  • add an index to this PersonID

    为此PersonID添加索引

  • use this ID in combination with GROUP BY to fill your People table

    将此ID与GROUP BY结合使用以填充您的People表

  • do the same with your names table

    对你的名字表做同样的事情

  • And then use this ID for a set-based insert into your three side tables

    然后将此ID用于基于集合的插入到您的三个边表中

Do it like this

像这样做

SELECT AllTogether.PersonID, AllTogether.TheValue
FROM
(
           SELECT PersonID,SomeValue1 AS TheValue FROM StagingTable
 UNION ALL SELECT PersonID,SomeValue2             FROM StagingTable
 UNION ALL ... 
) AS AllTogether
WHERE AllTogether.TheValue IS NOT NULL

UPDATE

You say: might cause a conflict with IDs that already exist in the People table

您说:可能会导致与People表中已存在的ID发生冲突

You did not tell anything about existing People...

你没有告诉现有的人......

Is there any sure and unique mark to identify them? Use a simple

是否有任何确定和独特的标记来识别它们?用一个简单的

UPDATE StagingTable SET PersonID=xyz WHERE ...

to set existing PersonIDs into your staging table and then use something like

将现有的PersonID设置到您的临时表中,然后使用类似的东西

UPDATE StagingTable 
SET PersonID=DENSE RANK() OVER(...) + MaxExistingID
WHERE PersonID IS NULL

to set new IDs for PersonIDs still being NULL.

为PersonID设置新ID仍为NULL。