帮助使用SQL聚合查询来检测重复

时间:2021-02-18 04:18:47

I have a table which has records that contain a persons information and a filename that the information originated from, so the table looks like so:

我有一个表,它有包含一个人信息的记录和一个信息来源的文件名,所以表格看起来是这样的:

|Table|
|id, first-name, last-name, ssn, filename|

I also have a stored procedure that provides some analytics for the files in the system and i'm trying to add information to that stored procedure to shed light into the possibility of duplicates.

我还有一个存储过程,它为系统中的文件提供了一些分析,我正在尝试向该存储过程添加信息,以阐明复制的可能性。

Here is the current stored procedure

这是当前存储过程。

SELECT [filename],
       COUNT([filename]) as totalRecords,
       COUNT(closedleads.id) as closedRecords,
       ROUND(--calcs percent of records closed in a file)
FROM table
LEFT OUTER JOIN closedleads ON closedleads.leadid = table.id
GROUP BY [filename]

What I want to add is the ability to see maybe # of possible duplicates, defined as records with matching SSNs and I am at a loss as to how I could perform a count on a sub query or join and include it in the results set. Can anyone provide some pointers?

我想添加的能力也许#可能的重复,定义为记录和匹配SSNs亏本,我如何我可以执行一个指望子查询或加入并将其包括在结果集。谁能提供一些指针?

What I'm trying to do is add something like this to my procedure above

我要做的就是在上面的程序中添加这样的东西

SELECT COUNT(
    SELECT COUNT(*) FROM Table T1
    INNER JOIN Table T2 on T1.SSN = T2.SSN
    WHERE T1.id != T2.id
) as PossibleDuplicates

What I'm looking for is merging this code with my procedure above so I can get all of the same data in one and possible have this # of duplicates across each filename, so for each filename I get a result of # of records, # of records closed and # of possible duplicates

我寻找的是这段代码合并过程上面了,所以我可以得到所有的相同的数据在一个重复的和可能有这个#每一个文件名,所以对于每个文件名我得到的结果#的记录,记录#和#关闭可能重复

EDIT:

编辑:

I'm very close to my desired goal but I'm failing on the last little bit--getting the number of possible duplicates BY filename, here is my query

我非常接近我想要的目标,但我在最后一点上失败了——通过文件名获得可能的复制数,这是我的查询。

select [q1].[filename], [q1].leads, [q1].closed, [q2].dups
FROM (
    SELECT [filename], count([filename]) as leads,
    count(closedleads.id) as closed
    FROM Table
    left join closedleads on closedleads.leadid = Table.id
    group by [filename]
) as [q1]
INNER JOIN (
    select count([ssn]) as dups, [filename] from Table
    group by [ssn], [filename]
    having count([ssn]) > 1
) as [q2] on [q1].[filename] = [q2].[filename]

This works but it showing multiple results for each filename with values of 2-5 instead of summing the total count of possible duplicates

这是可行的,但是它显示了每个文件名的多个结果,值为2-5,而不是对可能重复的总数求和

Working Query

Hey everyone, thanks for all the help, this is eventually what I got to that worked exactly as I wanted

大家好,谢谢大家的帮助,这就是我最终得到的结果我想要的结果

select [q1].[filename], [q1].leads, [q1].closed, [q2].dups,
        round(([q1].closed / [q1].leads), 3) as percentClosed
FROM (
    SELECT [filename], count([filename]) as leads,
    count(closedleads.id) as closed
    FROM Table
    left join closedleads on closedleads.leadid = Table.id
    and [filename] is not null
    group by [filename]
) as [q1]
INNER JOIN (
    select [filename], count(*) - count(distinct [ssn]) as dups 
            from Table
            group by [filename]
) as [q2] on [q1].[filename] = [q2].[filename]

4 个解决方案

#1


0  

I think the existing answers don't quite understand your question. I think I do but it's not completely specified yet. Is it a duplicate if the same SSN appears in two different files or only within the same file? Because you group by filename, that becomes the grain.

我认为现有的答案不太理解你的问题。我想是的,但是还没有完全确定。如果相同的SSN出现在两个不同的文件中,还是只出现在同一个文件中,它是重复的吗?因为你按文件名分组,这就变成了纹理。

The Output of your query is like

查询的输出是这样的

StateFarm1, 500,   50,    10%,   <your new value goes here>
AllState2,  100,   90,    90%    <your new value goes here>

So if you have the same SSN in those two files, you have 1 duplicate, so on which row do you show 1, on the AllState row or the Statefarm row? If you say both, invariably someone will SUM that column and get a doubling of the results.

如果你在这两个文件中有相同的SSN,你有一个副本,那么你在哪个行显示1,在AllState行还是Statefarm行?如果你两者都说,总有人会把这一列加起来,得到双倍的结果。

Now What if you have a Geico row with the same SSN, is that 1 duplicate or 2? and again which row?

如果你有一个Geico行和相同的SSN,这是1重复还是2?又哪一行?

I know this isn't a final answer but these questions do highlight the the question as it stands is unanswerable... you fix this and I'll change the answer,

我知道这不是最终的答案,但这些问题确实突出了这个问题,因为它是无法回答的……你解决这个问题,我会改变答案,

please no downvotes in the meantime

请不要在这段时间内投反对票

Addendum

I believe the only thing you are missing is a DISTINCT.

我相信你唯一缺少的是一种与众不同的东西。

select [q1].[filename], [q1].leads, [q1].closed, [q2].dups
FROM (
    SELECT [filename], count([filename]) as leads,
    count(closedleads.id) as closed
    FROM tbldata
    left join closedleads on closedleads.leadid = Table.id
    group by [filename]
) as [q1]
INNER JOIN (
    select count( DISTINCT [ssn]) as dups, [filename] from Table '<---- here'
    group by [ssn], [filename]
    having count([ssn]) > 1
) as [q2] on [q1].[filename] = [q2].[filename]

#2


3  

You'll probably want to make use of a HAVING clause somewhere, eg:

你可能想在某处使用“拥有”一词,例如:

    LEFT JOIN (
        SELECT SSN, COUNT(SSN) - 1 DupeCount FROM Table T1
        GROUP BY SSN
        HAVING COUNT(SSN) > 1 ) AS PossibleDuplicates
    ON table.ssn = PossibleDuplicates.SSN

If you want to include 0 possible duplicates (rather than null) you actually don't need the HAVING clause, just the left join.

如果您想包含0个可能的重复(而不是null),那么实际上不需要have子句,只需要左连接。

#3


1  

Edit - Updated with a better example which matches your question better

编辑-更新一个更好的例子,更匹配你的问题

Here's an example if I understand correctly.

如果我理解正确,这里有个例子。

create table #table  (id int,ssn varchar(10))

insert into #table values(1,'10')
insert into #table values(2,'10')

insert into #table values(3,'11')
insert into #table values(4,'12')


insert into #table values(5,'11')

insert into #table values(6,'13')


select sum(cnt)
from (
select count(distinct ssn) as cnt
from #table
group by ssn 
having count(*)>1
) dups

You shouldn't need to self join the table if you group by ssn and then pull back only ssn's where you have more then one.

如果你通过ssn进行分组,然后只回拉ssn,那么你就不需要自己加入表格了。

#4


0  

You don't need the outer COUNT - your inner SELECT COUNT(*)... will return you just one number, a count of records with duplicate SSN but different id.

您不需要外部计数—您的内部选择计数(*)…将只返回一个数字,一个具有重复SSN但不同id的记录计数。

#1


0  

I think the existing answers don't quite understand your question. I think I do but it's not completely specified yet. Is it a duplicate if the same SSN appears in two different files or only within the same file? Because you group by filename, that becomes the grain.

我认为现有的答案不太理解你的问题。我想是的,但是还没有完全确定。如果相同的SSN出现在两个不同的文件中,还是只出现在同一个文件中,它是重复的吗?因为你按文件名分组,这就变成了纹理。

The Output of your query is like

查询的输出是这样的

StateFarm1, 500,   50,    10%,   <your new value goes here>
AllState2,  100,   90,    90%    <your new value goes here>

So if you have the same SSN in those two files, you have 1 duplicate, so on which row do you show 1, on the AllState row or the Statefarm row? If you say both, invariably someone will SUM that column and get a doubling of the results.

如果你在这两个文件中有相同的SSN,你有一个副本,那么你在哪个行显示1,在AllState行还是Statefarm行?如果你两者都说,总有人会把这一列加起来,得到双倍的结果。

Now What if you have a Geico row with the same SSN, is that 1 duplicate or 2? and again which row?

如果你有一个Geico行和相同的SSN,这是1重复还是2?又哪一行?

I know this isn't a final answer but these questions do highlight the the question as it stands is unanswerable... you fix this and I'll change the answer,

我知道这不是最终的答案,但这些问题确实突出了这个问题,因为它是无法回答的……你解决这个问题,我会改变答案,

please no downvotes in the meantime

请不要在这段时间内投反对票

Addendum

I believe the only thing you are missing is a DISTINCT.

我相信你唯一缺少的是一种与众不同的东西。

select [q1].[filename], [q1].leads, [q1].closed, [q2].dups
FROM (
    SELECT [filename], count([filename]) as leads,
    count(closedleads.id) as closed
    FROM tbldata
    left join closedleads on closedleads.leadid = Table.id
    group by [filename]
) as [q1]
INNER JOIN (
    select count( DISTINCT [ssn]) as dups, [filename] from Table '<---- here'
    group by [ssn], [filename]
    having count([ssn]) > 1
) as [q2] on [q1].[filename] = [q2].[filename]

#2


3  

You'll probably want to make use of a HAVING clause somewhere, eg:

你可能想在某处使用“拥有”一词,例如:

    LEFT JOIN (
        SELECT SSN, COUNT(SSN) - 1 DupeCount FROM Table T1
        GROUP BY SSN
        HAVING COUNT(SSN) > 1 ) AS PossibleDuplicates
    ON table.ssn = PossibleDuplicates.SSN

If you want to include 0 possible duplicates (rather than null) you actually don't need the HAVING clause, just the left join.

如果您想包含0个可能的重复(而不是null),那么实际上不需要have子句,只需要左连接。

#3


1  

Edit - Updated with a better example which matches your question better

编辑-更新一个更好的例子,更匹配你的问题

Here's an example if I understand correctly.

如果我理解正确,这里有个例子。

create table #table  (id int,ssn varchar(10))

insert into #table values(1,'10')
insert into #table values(2,'10')

insert into #table values(3,'11')
insert into #table values(4,'12')


insert into #table values(5,'11')

insert into #table values(6,'13')


select sum(cnt)
from (
select count(distinct ssn) as cnt
from #table
group by ssn 
having count(*)>1
) dups

You shouldn't need to self join the table if you group by ssn and then pull back only ssn's where you have more then one.

如果你通过ssn进行分组,然后只回拉ssn,那么你就不需要自己加入表格了。

#4


0  

You don't need the outer COUNT - your inner SELECT COUNT(*)... will return you just one number, a count of records with duplicate SSN but different id.

您不需要外部计数—您的内部选择计数(*)…将只返回一个数字,一个具有重复SSN但不同id的记录计数。