I have a table set up as follows
我有一个表格设置如下
id
origin
destination
carrier_id
so typical row could be,
所以典型的行可能是,
100: London Manchester 366
Now each route goes both ways, so there shouldn't be a row like this
现在每条路线都是双向的,所以不应该有这样的行
233: Manchester London 366
since that's essentially the same route (for my purposes anyway)
因为这基本上是相同的路线(无论如何我的目的)
Unfortunately though, i have wound up with a handful of duplicates. I have over 50,000 routes made up of around 2000 point of origin (or destination, however you want to look at it) in the table. So i'm thinking looping through each point of origin to find duplicates would be insane.
不幸的是,我已经完成了一些重复。我有超过50,000条路线,由大约2000个起点(或目的地,但你想看它)组成。因此,我正在考虑循环遍历每个原点,以发现重复将是疯狂的。
So I don't even know where to start trying to figure out a query to identify them. Any ideas?
所以我甚至不知道从哪里开始尝试找出一个查询来识别它们。有任何想法吗?
3 个解决方案
#1
I think you just need a double join, the following will identify all the "duplicate" records joined together.
我认为你只需要一个双连接,以下将识别连接在一起的所有“重复”记录。
Here's an example.
这是一个例子。
Say SELECT * FROM FLIGHTS
yielded:
说SELECT * FROM FLIGHTS产生:
id origin destination carrierid
1 toronto quebec 1
2 quebec toronto 2
3 edmonton calgary 3
4 calgary edmonton 4
5 hull vancouver 5
6 vancouveredmonton 6
7 edmonton toronto 7
9 edmonton quebec 8
10 toronto edmonton 9
11 quebec edmonton 10
12 calgary lethbridge 11
So there's a bunch of duplicates (4 of the routes are duplicates of some other route).
所以有一堆重复(其中4条路线与其他路线重复)。
select *
from flights t1 inner join flights t2 on t1.origin = t2.destination
AND t2.origin = t1.destination
would yield just the duplicates:
只会产生重复:
id origin destination carrierid id origin destination carrierid
1 toronto quebec 1 2 quebec toronto 2
2 quebec toronto 2 1 toronto quebec 1
3 edmonton calgary 3 4 calgary edmonton 4
4 calgary edmonton 4 3 edmonton calgary 3
7 edmonton toronto 7 10 toronto edmonton 9
9 edmonton quebec 8 11 quebec edmonton 10
10 toronto edmonton 9 7 edmonton toronto 7
11 quebec edmonton 10 9 edmonton quebec 8
At that point you just might delete all the ones that occurred 1st.
那时你可能会删除所有发生的第一个。
delete from flights
where id in (
select t1.id
from flights t1 inner join flights t2 on t1.origin = t2.destination
AND t2.origin = t1.destination
)
Good luck!
#2
Bummer! Off the top of my head (and in psuedo-sql):
坏消息!在我的头顶(和psuedo-sql):
select * from (
select id, concat(origin, '_', destination, '_', carrier_id) as key from ....
union
select id, concat(destination, '_', origin, '_', carrier_id) as key from ....
) having count(key) > 1;
For the records above, you'd end up with:
对于上面的记录,你最终得到:
100, London_Manchester_366
100, Manchester_Longer_366
233 Manchester_London_366
233 London_Manchester_366
That's really, really hackish, and doesn't give you exactly what you're doing - it only narrows it down. Maybe it'll give you a starting point? Maybe it'll give someone else some ideas they can provide to help you too.
这真的,真的是hackish,并没有准确地告诉你你正在做什么 - 它只会缩小它。也许它会给你一个起点?也许它会给别人一些他们可以提供帮助的想法。
#3
If you don't mind a little shell scripting, and if you can get a dump of the input in the form you've shown here... and here's my sample input:
如果您不介意一点shell脚本,并且您可以在此处显示的表单中获取输入的转储...这里是我的示例输入:
100: London Manchester 366
121: London CityA 240
144: Manchester CityA 300
150: CityA CityB 90
233: Manchester London 366
You might be able to do something like this:
你可以做这样的事情:
$ cat m.txt | awk '{ if ($2 < $3) print $2, $3, $1; else print $3, $2, $1}' | sort
CityA CityB 150:
CityA London 121:
CityA Manchester 144:
London Manchester 100:
London Manchester 233:
So that you at least have the pairs grouped together. Not sure what would be the best move from there.
这样你至少可以将这些对组合在一起。不确定那里最好的举动是什么。
Okay, here's a beast of a command line:
$ cat m.txt | awk '{ if ($2 < $3) print $2, $3, $1; else print $3, $2, $1}' | (sort; echo "") | awk '{ if (fst == $1 && snd == $2) { printf "%s%s", num, $3 } else { print fst, snd; fst = $1; snd = $2; num = $3} }' | grep "^[0-9]"
150:151:150:255:CityA CityB
100:233:London Manchester
where m.txt has these new contents:
其中m.txt包含以下新内容:
100: London Manchester 366
121: London CityA 240
144: Manchester CityA 300
150: CityA CityB 90
151: CityB CityA 90
233: Manchester London 366
255: CityA CityB 90
Perl probably would have been a better choice than awk, but here goes: First we sort the two city names and put the ID at the end of the string, which I did in the first section. Then we sort those to group pairs together, and we have to tack on an extra line for the awk script to finish up. Then, we loop over each line in the file. If we see a new pair of cities, we print the cities we previously saw, and we store the new cities and the new ID. If we see the same cities we saw last time, then we print out the ID of the previous line and the ID of this line. Finally, we grep only lines beginning with a number so that we discard non-duplicated pairs.
Perl可能是比awk更好的选择,但是这里说:首先我们对两个城市名称进行排序并将ID放在字符串的末尾,这是我在第一部分中所做的。然后我们将它们分组到一起组合,我们必须为awk脚本添加额外的一行来完成。然后,我们遍历文件中的每一行。如果我们看到一对新城市,我们会打印我们之前看到的城市,并存储新城市和新ID。如果我们看到上次看到的相同城市,那么我们会打印出上一行的ID和该行的ID。最后,我们只搜索以数字开头的行,以便我们丢弃非重复的对。
If a pair occurs more than twice, you'll get a duplicate ID, but that's not such a big deal.
如果一对出现超过两次,你将获得一个重复的ID,但这不是什么大问题。
Clear as mud?
像泥一样清楚?
#1
I think you just need a double join, the following will identify all the "duplicate" records joined together.
我认为你只需要一个双连接,以下将识别连接在一起的所有“重复”记录。
Here's an example.
这是一个例子。
Say SELECT * FROM FLIGHTS
yielded:
说SELECT * FROM FLIGHTS产生:
id origin destination carrierid
1 toronto quebec 1
2 quebec toronto 2
3 edmonton calgary 3
4 calgary edmonton 4
5 hull vancouver 5
6 vancouveredmonton 6
7 edmonton toronto 7
9 edmonton quebec 8
10 toronto edmonton 9
11 quebec edmonton 10
12 calgary lethbridge 11
So there's a bunch of duplicates (4 of the routes are duplicates of some other route).
所以有一堆重复(其中4条路线与其他路线重复)。
select *
from flights t1 inner join flights t2 on t1.origin = t2.destination
AND t2.origin = t1.destination
would yield just the duplicates:
只会产生重复:
id origin destination carrierid id origin destination carrierid
1 toronto quebec 1 2 quebec toronto 2
2 quebec toronto 2 1 toronto quebec 1
3 edmonton calgary 3 4 calgary edmonton 4
4 calgary edmonton 4 3 edmonton calgary 3
7 edmonton toronto 7 10 toronto edmonton 9
9 edmonton quebec 8 11 quebec edmonton 10
10 toronto edmonton 9 7 edmonton toronto 7
11 quebec edmonton 10 9 edmonton quebec 8
At that point you just might delete all the ones that occurred 1st.
那时你可能会删除所有发生的第一个。
delete from flights
where id in (
select t1.id
from flights t1 inner join flights t2 on t1.origin = t2.destination
AND t2.origin = t1.destination
)
Good luck!
#2
Bummer! Off the top of my head (and in psuedo-sql):
坏消息!在我的头顶(和psuedo-sql):
select * from (
select id, concat(origin, '_', destination, '_', carrier_id) as key from ....
union
select id, concat(destination, '_', origin, '_', carrier_id) as key from ....
) having count(key) > 1;
For the records above, you'd end up with:
对于上面的记录,你最终得到:
100, London_Manchester_366
100, Manchester_Longer_366
233 Manchester_London_366
233 London_Manchester_366
That's really, really hackish, and doesn't give you exactly what you're doing - it only narrows it down. Maybe it'll give you a starting point? Maybe it'll give someone else some ideas they can provide to help you too.
这真的,真的是hackish,并没有准确地告诉你你正在做什么 - 它只会缩小它。也许它会给你一个起点?也许它会给别人一些他们可以提供帮助的想法。
#3
If you don't mind a little shell scripting, and if you can get a dump of the input in the form you've shown here... and here's my sample input:
如果您不介意一点shell脚本,并且您可以在此处显示的表单中获取输入的转储...这里是我的示例输入:
100: London Manchester 366
121: London CityA 240
144: Manchester CityA 300
150: CityA CityB 90
233: Manchester London 366
You might be able to do something like this:
你可以做这样的事情:
$ cat m.txt | awk '{ if ($2 < $3) print $2, $3, $1; else print $3, $2, $1}' | sort
CityA CityB 150:
CityA London 121:
CityA Manchester 144:
London Manchester 100:
London Manchester 233:
So that you at least have the pairs grouped together. Not sure what would be the best move from there.
这样你至少可以将这些对组合在一起。不确定那里最好的举动是什么。
Okay, here's a beast of a command line:
$ cat m.txt | awk '{ if ($2 < $3) print $2, $3, $1; else print $3, $2, $1}' | (sort; echo "") | awk '{ if (fst == $1 && snd == $2) { printf "%s%s", num, $3 } else { print fst, snd; fst = $1; snd = $2; num = $3} }' | grep "^[0-9]"
150:151:150:255:CityA CityB
100:233:London Manchester
where m.txt has these new contents:
其中m.txt包含以下新内容:
100: London Manchester 366
121: London CityA 240
144: Manchester CityA 300
150: CityA CityB 90
151: CityB CityA 90
233: Manchester London 366
255: CityA CityB 90
Perl probably would have been a better choice than awk, but here goes: First we sort the two city names and put the ID at the end of the string, which I did in the first section. Then we sort those to group pairs together, and we have to tack on an extra line for the awk script to finish up. Then, we loop over each line in the file. If we see a new pair of cities, we print the cities we previously saw, and we store the new cities and the new ID. If we see the same cities we saw last time, then we print out the ID of the previous line and the ID of this line. Finally, we grep only lines beginning with a number so that we discard non-duplicated pairs.
Perl可能是比awk更好的选择,但是这里说:首先我们对两个城市名称进行排序并将ID放在字符串的末尾,这是我在第一部分中所做的。然后我们将它们分组到一起组合,我们必须为awk脚本添加额外的一行来完成。然后,我们遍历文件中的每一行。如果我们看到一对新城市,我们会打印我们之前看到的城市,并存储新城市和新ID。如果我们看到上次看到的相同城市,那么我们会打印出上一行的ID和该行的ID。最后,我们只搜索以数字开头的行,以便我们丢弃非重复的对。
If a pair occurs more than twice, you'll get a duplicate ID, but that's not such a big deal.
如果一对出现超过两次,你将获得一个重复的ID,但这不是什么大问题。
Clear as mud?
像泥一样清楚?