I have a database where several hundred records have been duplicated. However the duplicated information is not the same across all fields. For any two lines, the first line will contain information in some fields while the duplicate line's fields are blank; but then for other fields, the duplicate (second) line will contain information while the first line's fields are blank. For example, it looks like this:
我有一个数据库,其中有数百条记录被复制。但是,重复的信息在所有字段中都不相同。对于任何两行,第一行将包含某些字段中的信息,而重复行的字段为空;但是对于其他字段,重复(第二)行将包含信息,而第一行的字段为空。例如,它看起来像这样:
ID Deleted Reference Name Case_Date Outcome Outcome_Date
100 False A123 Chris 2000-01-01 Yes
101 False A123 Chris 2000-03-31
The ID column is a unique primary key for the record. The Reference column is the one by which I can identify the duplicates. However as you can see, the first record (100) contains information in Case_Date and Outcome, but the second record (101) contains an Outcome_Date.
ID列是记录的唯一主键。参考列是我可以识别重复的列。但是如您所见,第一个记录(100)包含Case_Date和Outcome中的信息,但第二个记录(101)包含Outcome_Date。
What I want to do is to copy the most amount of information into just one of each pair of records, and then mark the duplicate as deleted (I use a soft-delete, not actually removing records from the table but just flagging the Duplicate column as True). With the above example, I want it to look like this:
我想要做的是将大量信息复制到每对记录中的一个,然后将副本标记为已删除(我使用软删除,实际上不删除表中的记录,只是标记重复列如此真实)。通过上面的例子,我希望它看起来像这样:
ID Deleted Reference Name Case_Date Outcome Outcome_Date
100 False A123 Chris 2000-01-01 Yes 2000-03-31
101 True A123 Chris (2000-01-01)* (Yes)* 2000-03-31
- Technically it will not be necessary to also copy information into the blank fields of the record which will be marked as deleted, but I figure it's easier to just copy everything and then mark the "second" record as the duplicate, rather than trying to work out which one contains more information and which one contains less.
- 从技术上讲,没有必要将信息复制到记录的空白字段中,这些字段将被标记为已删除,但我认为更容易复制所有内容然后将“第二”记录标记为重复,而不是尝试工作哪一个包含更多信息哪一个包含更少的信息。
I am also aware that it will be easier to run a separate SQL command for each column than to try to do them all at once. The columns shown above are a simplified example, and the information which may or may not be present across each column differs.
我也知道,为每个列运行单独的SQL命令比尝试一次完成它们更容易。上面显示的列是简化示例,并且每列可能存在或不存在的信息不同。
My select query for the record set of duplicates is:
我对重复记录集的选择查询是:
SELECT *
FROM [Cohorts]
WHERE [Deleted] = False
AND ([CaseType] = "Female" OR [CaseType] = "Family")
AND [Reference] Is Not Null
And [Reference] In (SELECT [Reference] FROM [Cohorts] As Tmp
WHERE [Deleted] = False
AND ([CaseType] = "Female" OR [CaseType]="Family")
GROUP BY [Reference]
HAVING Count(*) > 1)
ORDER BY [Reference];
This will return all (Female/Family) records in the table [Cohorts] where there exists more than one record with the same Reference (and where the records have not been marked as deleted).
这将返回表[Cohorts]中的所有(Female / Family)记录,其中存在多个具有相同Reference的记录(并且记录未被标记为已删除)。
I'm running my queries from VBA via ADO, so can execute UPDATE statements. My database is an Access-compatible .mdb using the JET engine.
我正在通过ADO从VBA运行查询,因此可以执行UPDATE语句。我的数据库是使用JET引擎的Access兼容的.mdb。
Grateful if anyone could suggest a suitable SQL command which I can run per column in order to populate the NULL fields with the values of the non-NULL fields from the relevant duplicate records. It's a bit beyond my SQL understanding at present! Thanks.
感谢任何人都可以建议一个合适的SQL命令,我可以按列运行,以便使用相关重复记录中的非NULL字段的值填充NULL字段。这有点超出了我目前的SQL理解!谢谢。
1 个解决方案
#1
0
My first UPDATE JOIN ever, hope it works (untested):
我的第一次UPDATE JOIN,希望它有效(未经测试):
update t1
set t1.name = coalesce(t1.name, t2.name),
t1.Case_Date = coalesce(t1.Case_Date, t2.Case_Date),
t1.Outcome = coalesce(t1.Outcome, t2.Outcome),
t1.Outcome_Date = coalesce(t1.Outcome_Date, t2.Outcome_Date),
t1.deleted = case when t1.id < t2.id then FALSE else TRUE end
from Cohorts t1
join Cohorts t2 on t1.Reference = t2.Reference and t1.id <> t2.id
Edit: Alternative solution:
编辑:替代解决方案:
Create a copy table, do insert select
:
创建一个副本表,执行insert select:
insert into CohortsCopy (Deleted, Reference, Name, Case_Date, Outcome, Outcome_Date)
select case when t1.id < t2.id or t2.id is null then FALSE else TRUE end,
coalesce(t1.Reference, t2.Reference),
coalesce(t1.name, t2.name),
coalesce(t1.Case_Date, t2.Case_Date),
coalesce(t1.Outcome, t2.Outcome),
coalesce(t1.Outcome_Date, t2.Outcome_Date)
from Cohorts t1
left join Cohorts t2 on t1.Reference = t2.Reference and t1.id <> t2.id
Then either rename, or copy back to original table.
然后重命名,或复制回原始表。
#1
0
My first UPDATE JOIN ever, hope it works (untested):
我的第一次UPDATE JOIN,希望它有效(未经测试):
update t1
set t1.name = coalesce(t1.name, t2.name),
t1.Case_Date = coalesce(t1.Case_Date, t2.Case_Date),
t1.Outcome = coalesce(t1.Outcome, t2.Outcome),
t1.Outcome_Date = coalesce(t1.Outcome_Date, t2.Outcome_Date),
t1.deleted = case when t1.id < t2.id then FALSE else TRUE end
from Cohorts t1
join Cohorts t2 on t1.Reference = t2.Reference and t1.id <> t2.id
Edit: Alternative solution:
编辑:替代解决方案:
Create a copy table, do insert select
:
创建一个副本表,执行insert select:
insert into CohortsCopy (Deleted, Reference, Name, Case_Date, Outcome, Outcome_Date)
select case when t1.id < t2.id or t2.id is null then FALSE else TRUE end,
coalesce(t1.Reference, t2.Reference),
coalesce(t1.name, t2.name),
coalesce(t1.Case_Date, t2.Case_Date),
coalesce(t1.Outcome, t2.Outcome),
coalesce(t1.Outcome_Date, t2.Outcome_Date)
from Cohorts t1
left join Cohorts t2 on t1.Reference = t2.Reference and t1.id <> t2.id
Then either rename, or copy back to original table.
然后重命名,或复制回原始表。