I have a SQL Server table with three columns:
我有一个SQL Server表,有三个列:
Table1
表1
col1 int
col2 int
col3 string
I have a unique constraint defined for all three columns (col1, col2, col3)
我为所有三列定义了唯一的约束(col1、col2、col3)
Now, I have a .csv file from which I want to add records in this table and the *.csv file can have duplicate records.
现在,我有一个。csv文件,我想从其中添加记录到这个表和*中。csv文件可以有重复的记录。
I have searched for various options for avoiding duplicates in above scenario. Below are the three options which are working well for me. Please have a look and throw some ideas on pros/cons of each method so I can choose the best one.
为了避免在上述场景中出现重复,我搜索了各种选项。下面是三个对我很有用的选项。请大家看看,并对每种方法的优缺点提出一些看法,以便我选择最好的一种。
Option#1 :
选项1:
Avoiding duplicates in the first place i.e. while adding objects to the list from csv file. I have used HashSet<T>
for this and overridden below methods for type T:
首先要避免重复,比如在csv文件中添加对象。我在这里使用了HashSet
public override int GetHashCode()
{
return col1.GetHashCode() + col2.GetHashCode() + col3.GetHashCode();
}
public override bool Equals(object obj)
{
var other = obj as T;
if (other == null)
{
return false;
}
return col1 == other.col1
&& col2 == other.col2
&& col3 == other.col3;
}
option #2
选择# 2
Having List<T>
instead of HashSet<T>
.
有列表
Removing duplicates after all the objects are added to List<T>
将所有对象添加到列表
List<T> distinctObjects = allObjects
.GroupBy(x => new {x.col1, x.col2, x.col3})
.Select(x => x.First()).ToList();
option #3
选择# 3
Removing duplicates after all the objects are added to DataTable
.
在将所有对象添加到DataTable之后删除重复项。
public static DataTable RemoveDuplicatesRows(DataTable dataTable)
{
IEnumerable<DataRow> uniqueRows = dataTable.AsEnumerable().Distinct(DataRowComparer.Default);
DataTable dataTable2 = uniqueRows.CopyToDataTable();
return dataTable2;
}
Although I have not compared their running time, but I prefer option#1 as I am removing duplicates as a first step - so moving ahead only with what is required.
虽然我没有比较它们的运行时间,但是我更喜欢选项1,因为我将删除重复的内容作为第一步——所以只处理需要的内容。
Please share your views so I can choose the best one.
请分享你的观点,我可以选择最好的。
Thanks a lot!
谢谢!
2 个解决方案
#1
5
I like option 1: the HashSet<T>
provides a fast way of avoiding duplicates before ever sending them to the DB. You should implement a better GetHashCode
, e.g. using Skeet's implementation from What is the best algorithm for an overridden System.Object.GetHashCode?
我喜欢选项1:HashSet
But there's a problem: what if the table already contains data that can be a duplicate of your CSV? You'd have to copy the whole table down first for a simple HashSet
to really work. You could do just that, but to solve this, I might pair option 1 with a temporary table and an insert statement like Skip-over/ignore duplicate rows on insert's:
但是有一个问题:如果表中已经包含了可以复制CSV的数据怎么办?要使一个简单的HashSet真正工作,您必须首先将整个表复制下来。您可以这样做,但是要解决这个问题,我可以将选项1与临时表和insert语句(如insert -over/ignore duplicate)配对:
INSERT dbo.Table1(col1, col2, col3)
SELECT col1, col2, col3
FROM dbo.tmp_holding_Table1 AS t
WHERE NOT EXISTS (SELECT 1 FROM dbo.Table1 AS d
WHERE col1 = t.col1
AND col2 = t.col2
AND col3 = t.col3);
With this combination, the volume of data transferred to/from your DB is minimized.
有了这种组合,传输到/从DB的数据量就最小化了。
#2
0
Another solution could be the IGNORE_DUP_KEY = { ON | OFF }
option when creating / rebuilding an index. This solution will prevent getting errors with inserting duplicate rows. Instead, SQL Server will generate warnings: Duplicate key was ignored.
.
另一种解决方案是在创建/重建索引时使用IGNORE_DUP_KEY = {ON | OFF}选项。该解决方案将防止插入重复行的错误。相反,SQL Server将生成警告:重复键被忽略。
CREATE TABLE dbo.MyTable (Col1 INT, Col2 INT, Col3 INT);
GO
CREATE UNIQUE INDEX IUN_MyTable_Col1_Col2_Col3
ON dbo.MyTable (Col1,Col2,Col3)
WITH (IGNORE_DUP_KEY = ON);
GO
INSERT dbo.MyTable (Col1,Col2,Col3)
VALUES (1,11,111);
INSERT dbo.MyTable (Col1,Col2,Col3)
SELECT 1,11,111 UNION ALL
SELECT 2,22,222 UNION ALL
SELECT 3,33,333;
INSERT dbo.MyTable (Col1,Col2,Col3)
SELECT 2,22,222 UNION ALL
SELECT 3,33,333;
GO
/*
(1 row(s) affected)
(2 row(s) affected)
Duplicate key was ignored.
*/
SELECT * FROM dbo.MyTable;
/*
Col1 Col2 Col3
----------- ----------- -----------
1 11 111
2 22 222
3 33 333
*/
Note: Because you have an UNIQUE constraint if you try to change index options with ALTER INDEX
注意:因为如果您试图用ALTER index更改索引选项,就会有一个唯一的约束
ALTER INDEX IUN_MyTable_Col1_Col2_Col3
ON dbo.MyTable
REBUILD WITH (IGNORE_DUP_KEY = ON)
you will get following error:
您将得到以下错误:
Msg 1979, Level 16, State 1, Line 1
Cannot use index option ignore_dup_key to alter index 'IUN_MyTable_Col1_Col2_Col3' as it enforces a primary or unique constraint.`
So, if you choose this solution the options are:
因此,如果你选择这个解决方案,选项是:
1) Create another UNIQUE index and to drop the UNIQUE constraint (this option will require more storage space but will be a UNIQUE index/constraint active all time) or
1)创建另一个惟一的索引并删除惟一的约束(此选项将需要更多的存储空间,但将始终是惟一的索引/约束)。
2) Drop the UNIQUE constraint and create an UNIQUE index with WITH (IGNORE_DUP_KEY = ON)
option (I wouldn't recommend this last option).
2)删除惟一约束,并使用(IGNORE_DUP_KEY = ON)选项创建惟一索引(我不推荐最后的选项)。
#1
5
I like option 1: the HashSet<T>
provides a fast way of avoiding duplicates before ever sending them to the DB. You should implement a better GetHashCode
, e.g. using Skeet's implementation from What is the best algorithm for an overridden System.Object.GetHashCode?
我喜欢选项1:HashSet
But there's a problem: what if the table already contains data that can be a duplicate of your CSV? You'd have to copy the whole table down first for a simple HashSet
to really work. You could do just that, but to solve this, I might pair option 1 with a temporary table and an insert statement like Skip-over/ignore duplicate rows on insert's:
但是有一个问题:如果表中已经包含了可以复制CSV的数据怎么办?要使一个简单的HashSet真正工作,您必须首先将整个表复制下来。您可以这样做,但是要解决这个问题,我可以将选项1与临时表和insert语句(如insert -over/ignore duplicate)配对:
INSERT dbo.Table1(col1, col2, col3)
SELECT col1, col2, col3
FROM dbo.tmp_holding_Table1 AS t
WHERE NOT EXISTS (SELECT 1 FROM dbo.Table1 AS d
WHERE col1 = t.col1
AND col2 = t.col2
AND col3 = t.col3);
With this combination, the volume of data transferred to/from your DB is minimized.
有了这种组合,传输到/从DB的数据量就最小化了。
#2
0
Another solution could be the IGNORE_DUP_KEY = { ON | OFF }
option when creating / rebuilding an index. This solution will prevent getting errors with inserting duplicate rows. Instead, SQL Server will generate warnings: Duplicate key was ignored.
.
另一种解决方案是在创建/重建索引时使用IGNORE_DUP_KEY = {ON | OFF}选项。该解决方案将防止插入重复行的错误。相反,SQL Server将生成警告:重复键被忽略。
CREATE TABLE dbo.MyTable (Col1 INT, Col2 INT, Col3 INT);
GO
CREATE UNIQUE INDEX IUN_MyTable_Col1_Col2_Col3
ON dbo.MyTable (Col1,Col2,Col3)
WITH (IGNORE_DUP_KEY = ON);
GO
INSERT dbo.MyTable (Col1,Col2,Col3)
VALUES (1,11,111);
INSERT dbo.MyTable (Col1,Col2,Col3)
SELECT 1,11,111 UNION ALL
SELECT 2,22,222 UNION ALL
SELECT 3,33,333;
INSERT dbo.MyTable (Col1,Col2,Col3)
SELECT 2,22,222 UNION ALL
SELECT 3,33,333;
GO
/*
(1 row(s) affected)
(2 row(s) affected)
Duplicate key was ignored.
*/
SELECT * FROM dbo.MyTable;
/*
Col1 Col2 Col3
----------- ----------- -----------
1 11 111
2 22 222
3 33 333
*/
Note: Because you have an UNIQUE constraint if you try to change index options with ALTER INDEX
注意:因为如果您试图用ALTER index更改索引选项,就会有一个唯一的约束
ALTER INDEX IUN_MyTable_Col1_Col2_Col3
ON dbo.MyTable
REBUILD WITH (IGNORE_DUP_KEY = ON)
you will get following error:
您将得到以下错误:
Msg 1979, Level 16, State 1, Line 1
Cannot use index option ignore_dup_key to alter index 'IUN_MyTable_Col1_Col2_Col3' as it enforces a primary or unique constraint.`
So, if you choose this solution the options are:
因此,如果你选择这个解决方案,选项是:
1) Create another UNIQUE index and to drop the UNIQUE constraint (this option will require more storage space but will be a UNIQUE index/constraint active all time) or
1)创建另一个惟一的索引并删除惟一的约束(此选项将需要更多的存储空间,但将始终是惟一的索引/约束)。
2) Drop the UNIQUE constraint and create an UNIQUE index with WITH (IGNORE_DUP_KEY = ON)
option (I wouldn't recommend this last option).
2)删除惟一约束,并使用(IGNORE_DUP_KEY = ON)选项创建惟一索引(我不推荐最后的选项)。