I have a table with the following columns:
我有一张列有以下几栏的表格:
URL_ID
URL_ADDR
URL_Time
I want to remove duplicates on the URL_ADDR
column using a MySQL query.
我想使用MySQL查询删除URL_ADDR列上的重复项。
Is it possible to do such a thing without using any programming?
在不使用任何编程的情况下,是否可以做这样的事情?
7 个解决方案
#1
31
Consider the following test case:
考虑以下测试用例:
CREATE TABLE mytb (url_id int, url_addr varchar(100));
INSERT INTO mytb VALUES (1, 'www.google.com');
INSERT INTO mytb VALUES (2, 'www.microsoft.com');
INSERT INTO mytb VALUES (3, 'www.apple.com');
INSERT INTO mytb VALUES (4, 'www.google.com');
INSERT INTO mytb VALUES (5, 'www.cnn.com');
INSERT INTO mytb VALUES (6, 'www.apple.com');
Where our test table now contains:
我们的测试表现在包含:
SELECT * FROM mytb;
+--------+-------------------+
| url_id | url_addr |
+--------+-------------------+
| 1 | www.google.com |
| 2 | www.microsoft.com |
| 3 | www.apple.com |
| 4 | www.google.com |
| 5 | www.cnn.com |
| 6 | www.apple.com |
+--------+-------------------+
5 rows in set (0.00 sec)
Then we can use the multiple-table DELETE
syntax as follows:
然后我们可以使用多表删除语法如下:
DELETE t2
FROM mytb t1
JOIN mytb t2 ON (t2.url_addr = t1.url_addr AND t2.url_id > t1.url_id);
... which will delete duplicate entries, leaving only the first url based on url_id
:
…这将删除重复项,只留下基于url_id的第一个url:
SELECT * FROM mytb;
+--------+-------------------+
| url_id | url_addr |
+--------+-------------------+
| 1 | www.google.com |
| 2 | www.microsoft.com |
| 3 | www.apple.com |
| 5 | www.cnn.com |
+--------+-------------------+
3 rows in set (0.00 sec)
UPDATE - Further to new comments above:
更新-进一步了解以上的新评论:
If the duplicate URLs will not have the same format, you may want to apply the REPLACE()
function to remove www.
or http://
parts. For example:
如果重复的url没有相同的格式,您可能需要应用REPLACE()函数来删除www。或http://部分。例如:
DELETE t2
FROM mytb t1
JOIN mytb t2 ON (REPLACE(t2.url_addr, 'www.', '') =
REPLACE(t1.url_addr, 'www.', '') AND
t2.url_id > t1.url_id);
#2
8
You may want to try the method mentioned at http://labs.creativecommons.org/2010/01/12/removing-duplicate-rows-in-mysql/.
您可能想尝试http://labs.creativecommons.org/2010/01/12/removing-duplicate- rows-inmysql/中提到的方法。
ALTER IGNORE TABLE your_table ADD UNIQUE INDEX `tmp_index` (URL_ADDR);
#3
5
This will leave the ones with the highest URL_ID
for a particular URL_ADDR
这将为特定的URL_ADDR保留最高的URL_ID
DELETE FROM table
WHERE URL_ID NOT IN
(SELECT ID FROM
(SELECT MAX(URL_ID) AS ID
FROM table
WHERE URL_ID IS NOT NULL
GROUP BY URL_ADDR ) X) /*Sounds like you would need to GROUP BY a
calculated form - e.g. using REPLACE to
strip out www see Daniel's answer*/
(The derived table 'X' is to avoid the error "You can't specify target table 'tablename' for update in FROM clause")
(派生表'X'是为了避免出现“无法为FROM子句中更新指定目标表'tablename'的错误”)
#4
3
Well, you could always:
好吧,你总是可以:
- create a temporary table;
- 创建一个临时表;
-
INSERT INTO ... SELECT DISTINCT
into the temp table from original table; - 插入……从原始表中选择不同的temp表;
- clear original table
- 清晰的原始表
-
INSERT INTO ... SELECT
into the original table from the temp table - 插入……从临时表中选择原始表。
- drop temp table.
- 删除临时表。
It's clumsy and awkward, and requires several queries (not to mention privileges), but it will do the trick if you don't find another solution.
它笨拙而笨拙,并且需要多个查询(更不用说特权),但是如果您没有找到其他解决方案,它将发挥作用。
#5
1
Daniel Vassallo How to for multiple column?
Daniel Vassallo如何为多列?
DELETE t2 FROM directory1 t1 JOIN directory1 t2 ON (t2.page = t1.page, t2.parentTopic = t1.parentTopic, t2.title = t1.title, t2.description = t1.description, t2.linktype = t1.linktype, t2.priority = t1.priority AND t2.linkID > t1.linkID);
从directory1 t1连接directory1 t2中删除t2 (t2)。页面= t1。页面,t2。parentTopic = t1。parentTopic,t2。title = t1。title, t2.description = t1.description, t2。linktype = t1。linktype,t2。优先级= t1。优先级和t2。linkID > t1.linkID);
maybe like this?
也许像这样?
#6
0
You can group by on the URL_ADDR which will effectively give you only distinct values in the URL_ADDR field.
您可以对URL_ADDR进行分组,这将有效地在URL_ADDR字段中只提供不同的值。
select
URL_ID
URL_ADDR
URL_Time
from
some_table
group by
URL_ADDR
Enjoy!
享受吧!
#7
0
This will work provided that your URL_ID column is unique.
如果您的URL_ID列是唯一的,那么这将有效。
DELETE FROM url WHERE URL_ID IN (
SELECT URL_ID
FROM url a INNER JOIN (
SELECT URL_ADDR, MAX(URL_ID) MaxURLId
FROM url
GROUP BY URL_ADDR
HAVING COUNT(*) > 1) b ON a.URL_ID <> b.MaxURLId AND a.URL_ADDR = b.URL_ADDR
)
#1
31
Consider the following test case:
考虑以下测试用例:
CREATE TABLE mytb (url_id int, url_addr varchar(100));
INSERT INTO mytb VALUES (1, 'www.google.com');
INSERT INTO mytb VALUES (2, 'www.microsoft.com');
INSERT INTO mytb VALUES (3, 'www.apple.com');
INSERT INTO mytb VALUES (4, 'www.google.com');
INSERT INTO mytb VALUES (5, 'www.cnn.com');
INSERT INTO mytb VALUES (6, 'www.apple.com');
Where our test table now contains:
我们的测试表现在包含:
SELECT * FROM mytb;
+--------+-------------------+
| url_id | url_addr |
+--------+-------------------+
| 1 | www.google.com |
| 2 | www.microsoft.com |
| 3 | www.apple.com |
| 4 | www.google.com |
| 5 | www.cnn.com |
| 6 | www.apple.com |
+--------+-------------------+
5 rows in set (0.00 sec)
Then we can use the multiple-table DELETE
syntax as follows:
然后我们可以使用多表删除语法如下:
DELETE t2
FROM mytb t1
JOIN mytb t2 ON (t2.url_addr = t1.url_addr AND t2.url_id > t1.url_id);
... which will delete duplicate entries, leaving only the first url based on url_id
:
…这将删除重复项,只留下基于url_id的第一个url:
SELECT * FROM mytb;
+--------+-------------------+
| url_id | url_addr |
+--------+-------------------+
| 1 | www.google.com |
| 2 | www.microsoft.com |
| 3 | www.apple.com |
| 5 | www.cnn.com |
+--------+-------------------+
3 rows in set (0.00 sec)
UPDATE - Further to new comments above:
更新-进一步了解以上的新评论:
If the duplicate URLs will not have the same format, you may want to apply the REPLACE()
function to remove www.
or http://
parts. For example:
如果重复的url没有相同的格式,您可能需要应用REPLACE()函数来删除www。或http://部分。例如:
DELETE t2
FROM mytb t1
JOIN mytb t2 ON (REPLACE(t2.url_addr, 'www.', '') =
REPLACE(t1.url_addr, 'www.', '') AND
t2.url_id > t1.url_id);
#2
8
You may want to try the method mentioned at http://labs.creativecommons.org/2010/01/12/removing-duplicate-rows-in-mysql/.
您可能想尝试http://labs.creativecommons.org/2010/01/12/removing-duplicate- rows-inmysql/中提到的方法。
ALTER IGNORE TABLE your_table ADD UNIQUE INDEX `tmp_index` (URL_ADDR);
#3
5
This will leave the ones with the highest URL_ID
for a particular URL_ADDR
这将为特定的URL_ADDR保留最高的URL_ID
DELETE FROM table
WHERE URL_ID NOT IN
(SELECT ID FROM
(SELECT MAX(URL_ID) AS ID
FROM table
WHERE URL_ID IS NOT NULL
GROUP BY URL_ADDR ) X) /*Sounds like you would need to GROUP BY a
calculated form - e.g. using REPLACE to
strip out www see Daniel's answer*/
(The derived table 'X' is to avoid the error "You can't specify target table 'tablename' for update in FROM clause")
(派生表'X'是为了避免出现“无法为FROM子句中更新指定目标表'tablename'的错误”)
#4
3
Well, you could always:
好吧,你总是可以:
- create a temporary table;
- 创建一个临时表;
-
INSERT INTO ... SELECT DISTINCT
into the temp table from original table; - 插入……从原始表中选择不同的temp表;
- clear original table
- 清晰的原始表
-
INSERT INTO ... SELECT
into the original table from the temp table - 插入……从临时表中选择原始表。
- drop temp table.
- 删除临时表。
It's clumsy and awkward, and requires several queries (not to mention privileges), but it will do the trick if you don't find another solution.
它笨拙而笨拙,并且需要多个查询(更不用说特权),但是如果您没有找到其他解决方案,它将发挥作用。
#5
1
Daniel Vassallo How to for multiple column?
Daniel Vassallo如何为多列?
DELETE t2 FROM directory1 t1 JOIN directory1 t2 ON (t2.page = t1.page, t2.parentTopic = t1.parentTopic, t2.title = t1.title, t2.description = t1.description, t2.linktype = t1.linktype, t2.priority = t1.priority AND t2.linkID > t1.linkID);
从directory1 t1连接directory1 t2中删除t2 (t2)。页面= t1。页面,t2。parentTopic = t1。parentTopic,t2。title = t1。title, t2.description = t1.description, t2。linktype = t1。linktype,t2。优先级= t1。优先级和t2。linkID > t1.linkID);
maybe like this?
也许像这样?
#6
0
You can group by on the URL_ADDR which will effectively give you only distinct values in the URL_ADDR field.
您可以对URL_ADDR进行分组,这将有效地在URL_ADDR字段中只提供不同的值。
select
URL_ID
URL_ADDR
URL_Time
from
some_table
group by
URL_ADDR
Enjoy!
享受吧!
#7
0
This will work provided that your URL_ID column is unique.
如果您的URL_ID列是唯一的,那么这将有效。
DELETE FROM url WHERE URL_ID IN (
SELECT URL_ID
FROM url a INNER JOIN (
SELECT URL_ADDR, MAX(URL_ID) MaxURLId
FROM url
GROUP BY URL_ADDR
HAVING COUNT(*) > 1) b ON a.URL_ID <> b.MaxURLId AND a.URL_ADDR = b.URL_ADDR
)