Situation
On a Microsoft SQL Server 2008 I have about 2 million rows. (this should have never happened but we inherited the situation). A sample as follows:
情况在Microsoft SQL Server 2008上,我有大约200万行。 (这应该从未发生过,但我们继承了这种情况)。样本如下:
usernum. | phone | email
1 | 123 | user1@local.com
2 | 123 | user2@local.com
3 | 245 | user3@local.com
4 | 678 | user3@local.com
Aim
I would like to create a table that looks like this. The idea is that if 'phone' or 'email' is the same, they are assigned the same group number.
目标我想创建一个看起来像这样的表。这个想法是,如果“电话”或“电子邮件”相同,则会为它们分配相同的组号。
groupnum |usernum. | phone | email
1 | 1 | 123 | user1@local.com
1 | 2 | 123 | user2@local.com
2 | 3 | 245 | user3@local.com
2 | 4 | 678 | user3@local.com
Tried so far
So far I have created a simple python script that conceptually does the following:
- for each usernum in the table
-- assign a group number
-- also assign the group number to all rows where phone or email is the same as this row
-- do not assign the group number if usernum already processed (else we would do things double)
Problem
The python script basically has to check for each row if there are duplicates for phone or email. Although this is perfectly fine for maybe 10,000 records or so, it is too slow for 2 million records. I think this possible to do in t-sql which should be much faster than my python script using pyodbc.
The big question thus is, how to do this in sql.
到目前为止尝试到目前为止,我已经创建了一个简单的python脚本,在概念上执行以下操作: - 对于表中的每个usernum - 分配一个组号 - 还将组号分配给电话或电子邮件与此相同的所有行row - 如果usernum已经处理,则不要分配组号(否则我们会做双重处理)问题如果手机或电子邮件有重复,python脚本基本上必须检查每一行。虽然这对于大约10,000条记录来说完全没问题,但对于200万条记录来说这太慢了。我认为这可能在t-sql中做,这应该比使用pyodbc的python脚本快得多。因此,最大的问题是,如何在sql中执行此操作。
2 个解决方案
#1
1
Just noticed you said email or phone is duplicate. For that I would think you would need to decide which has priority in instances where a user could be joined from either field. Or you could potentially just split the update into a few batches to make group numbers based on phone AND email, then email (when not already matched), then phone (when not already matched) as such:
刚刚注意到你说电子邮件或电话是重复的。为此,我认为您需要确定哪个用户可以从任一字段加入的优先级。或者您可能只是将更新拆分为几个批次,以便根据电话和电子邮件制作组号,然后发送电子邮件(当尚未匹配时),然后打电话(当尚未匹配时):
insert into yourGroupsTable (phone, email) -- assuming identity column of groupNum here
select distinct phone, email
from yourUserTable
-- assign group nums with priority on matching phone AND email
update yourUserTable
set groupNum = g.groupNum
from yourUserTable u
join yourGroupsTable g on u.phone = g.phone
and u.email = g.email
It occurs to me now that this would not work as each row would join on the yourGroupsTable due to the distinct select. I came across a scenario that I'm unsure what your expected outcome would be (and too big for a comment) - what happens in this instance:
现在我发现这不起作用,因为由于不同的选择,每一行都会加入yourGroupsTable。我遇到了一个场景,我不确定你的预期结果是什么(并且对评论来说太大了) - 在这个例子中会发生什么:
your test data slightly modified:
您的测试数据略有修改:
groupnum |usernum. | phone | email
1 | 1 | 123 | user1@local.com
1 | 2 | 123 | user2@local.com
? | 3 | 245 | user3@local.com
? | 4 | 678 | user3@local.com
? | 5 | 245 | user7@local.com
? | 6 | 678 | user7@local.com
what would the group numbs be in the above case?
在上述情况下,该组织会出现什么麻烦?
#2
0
As you do python script is good way ... if you want to move with mysql make it one procedure before inserting record must check its exist or not in table
正如你所做的python脚本是好方法...如果你想用mysql移动使其成为一个程序,在插入记录之前必须检查它是否存在于表中
If Exist THEN get that row groupnum and assign that groupnum to this new record ... IF Not Then give new groupnum
如果Exist THEN得到那个行groupnum并将该groupnum分配给这个新记录... IF Not Then然后给出新的groupnum
but i have still little confusion
但我仍然没有什么困惑
now if record is like
现在如果记录是这样的
5 | 678 | user1@local.com
5 | 678 | user1@local.com
if this is the case then ?
如果是这样的话呢?
I assume that both column [phone and email ] is consider to give groupnum.
我假设列[电话和电子邮件]都考虑给groupnum。
if my assumption is correct then go with mysql procedure ...
如果我的假设是正确的那么请使用mysql程序...
#1
1
Just noticed you said email or phone is duplicate. For that I would think you would need to decide which has priority in instances where a user could be joined from either field. Or you could potentially just split the update into a few batches to make group numbers based on phone AND email, then email (when not already matched), then phone (when not already matched) as such:
刚刚注意到你说电子邮件或电话是重复的。为此,我认为您需要确定哪个用户可以从任一字段加入的优先级。或者您可能只是将更新拆分为几个批次,以便根据电话和电子邮件制作组号,然后发送电子邮件(当尚未匹配时),然后打电话(当尚未匹配时):
insert into yourGroupsTable (phone, email) -- assuming identity column of groupNum here
select distinct phone, email
from yourUserTable
-- assign group nums with priority on matching phone AND email
update yourUserTable
set groupNum = g.groupNum
from yourUserTable u
join yourGroupsTable g on u.phone = g.phone
and u.email = g.email
It occurs to me now that this would not work as each row would join on the yourGroupsTable due to the distinct select. I came across a scenario that I'm unsure what your expected outcome would be (and too big for a comment) - what happens in this instance:
现在我发现这不起作用,因为由于不同的选择,每一行都会加入yourGroupsTable。我遇到了一个场景,我不确定你的预期结果是什么(并且对评论来说太大了) - 在这个例子中会发生什么:
your test data slightly modified:
您的测试数据略有修改:
groupnum |usernum. | phone | email
1 | 1 | 123 | user1@local.com
1 | 2 | 123 | user2@local.com
? | 3 | 245 | user3@local.com
? | 4 | 678 | user3@local.com
? | 5 | 245 | user7@local.com
? | 6 | 678 | user7@local.com
what would the group numbs be in the above case?
在上述情况下,该组织会出现什么麻烦?
#2
0
As you do python script is good way ... if you want to move with mysql make it one procedure before inserting record must check its exist or not in table
正如你所做的python脚本是好方法...如果你想用mysql移动使其成为一个程序,在插入记录之前必须检查它是否存在于表中
If Exist THEN get that row groupnum and assign that groupnum to this new record ... IF Not Then give new groupnum
如果Exist THEN得到那个行groupnum并将该groupnum分配给这个新记录... IF Not Then然后给出新的groupnum
but i have still little confusion
但我仍然没有什么困惑
now if record is like
现在如果记录是这样的
5 | 678 | user1@local.com
5 | 678 | user1@local.com
if this is the case then ?
如果是这样的话呢?
I assume that both column [phone and email ] is consider to give groupnum.
我假设列[电话和电子邮件]都考虑给groupnum。
if my assumption is correct then go with mysql procedure ...
如果我的假设是正确的那么请使用mysql程序...