How do I clean up my join_table and remove duplicate entries?

时间:2021-05-20 07:34:30

I have 2 models - Question and Tag - which have a HABTM between them, and they share a join table questions_tags.

我有两个模型 - 问题和标签 - 它们之间有一个HABTM,它们共享一个连接表questions_tags。

Feast your eyes on this badboy:

在这个badboy上大饱眼福:

1.9.3p392 :011 > Question.count
   (852.1ms)  SELECT COUNT(*) FROM "questions" 
 => 417 
1.9.3p392 :012 > Tag.count
   (197.8ms)  SELECT COUNT(*) FROM "tags" 
 => 601 
1.9.3p392 :013 > Question.connection.execute("select count(*) from questions_tags").first["count"].to_i
   (648978.7ms)  select count(*) from questions_tags
 => 39919778 

I am assuming that the questions_tags join table contains a bunch of duplicate records - otherwise, I have no idea why it would be so large.

我假设questions_tags连接表包含一堆重复记录 - 否则,我不知道它为什么会这么大。

How do I clean up that join table so that it only has uniq content? Or how do I even check to see if there are duplicate records in there?

如何清理该连接表以使其仅具有uniq内容?或者我怎么检查是否有重复的记录?

Edit 1

I am using PostgreSQL, this is the schema for the join_table questions_tags

我正在使用PostgreSQL,这是join_table questions_tags的架构

  create_table "questions_tags", :id => false, :force => true do |t|
    t.integer "question_id"
    t.integer "tag_id"
  end

  add_index "questions_tags", ["question_id"], :name => "index_questions_tags_on_question_id"
  add_index "questions_tags", ["tag_id"], :name => "index_questions_tags_on_tag_id"

2 个解决方案

#1


2  

I'm adding this as a new answer since it's a lot different from my last. This one doesn't assume that you have an id column on the join table. This creates a new table, selects unique rows into it, then drops the old table and renames the new one. This will be much faster than anything involving a subselect.

我将此作为一个新答案添加,因为它与我的上一个有很大的不同。这个假设你没有在连接表上有一个id列。这将创建一个新表,在其中选择唯一的行,然后删除旧表并重命名新表。这将比涉及子选择的任何事情快得多。

foo=# select * from questions_tags;
 question_id | tag_id
-------------+--------
           1 |      2
           2 |      1
           2 |      2
           1 |      1
           1 |      1
(5 rows)

foo=# select distinct question_id, tag_id into questions_tags_tmp from questions_tags;
SELECT 4
foo=# select * from questions_tags_tmp;
 question_id | tag_id
-------------+--------
           2 |      2
           1 |      2
           2 |      1
           1 |      1
(4 rows)

foo=# drop table questions_tags;
DROP TABLE
foo=# alter table questions_tags_tmp rename to questions_tags;
ALTER TABLE
foo=# select * from questions_tags;
 question_id | tag_id
-------------+--------
           2 |      2
           1 |      2
           2 |      1
           1 |      1
(4 rows)

#2


1  

Delete tag associations with bad tag reference

删除带有错误标记引用的标记关联

DELETE  FROM questions_tags
WHERE   NOT EXISTS ( SELECT  1 
                 FROM    tags
                 WHERE   tags.id = questions_tags.tag_id);

Delete tag associations with bad question reference

删除带有错误问题参考的标签关联

DELETE  FROM questions_tags
WHERE   NOT EXISTS ( SELECT  1 
                 FROM    questions
                 WHERE   questions.id = questions_tags.question_id);

Delete duplicate tag associations

删除重复的标记关联

DELETE  FROM questions_tags
USING   ( SELECT qt3.user_id, qt3.question_id, MIN(qt3.id) id
          FROM   questions_tags qt3
          GROUP BY qt3.user_id, qt3.question_id
        ) qt2
WHERE   questions_tags.user_id=qt2.user_id AND 
        questions_tags.question_id=qt2.question_id AND
        questions_tags.id != qt2.id

Note:

Please test the SQL's in your development environment before trying them on your production environment.

在生产环境中尝试之前,请先在开发环境中测试SQL。

#1


2  

I'm adding this as a new answer since it's a lot different from my last. This one doesn't assume that you have an id column on the join table. This creates a new table, selects unique rows into it, then drops the old table and renames the new one. This will be much faster than anything involving a subselect.

我将此作为一个新答案添加,因为它与我的上一个有很大的不同。这个假设你没有在连接表上有一个id列。这将创建一个新表,在其中选择唯一的行,然后删除旧表并重命名新表。这将比涉及子选择的任何事情快得多。

foo=# select * from questions_tags;
 question_id | tag_id
-------------+--------
           1 |      2
           2 |      1
           2 |      2
           1 |      1
           1 |      1
(5 rows)

foo=# select distinct question_id, tag_id into questions_tags_tmp from questions_tags;
SELECT 4
foo=# select * from questions_tags_tmp;
 question_id | tag_id
-------------+--------
           2 |      2
           1 |      2
           2 |      1
           1 |      1
(4 rows)

foo=# drop table questions_tags;
DROP TABLE
foo=# alter table questions_tags_tmp rename to questions_tags;
ALTER TABLE
foo=# select * from questions_tags;
 question_id | tag_id
-------------+--------
           2 |      2
           1 |      2
           2 |      1
           1 |      1
(4 rows)

#2


1  

Delete tag associations with bad tag reference

删除带有错误标记引用的标记关联

DELETE  FROM questions_tags
WHERE   NOT EXISTS ( SELECT  1 
                 FROM    tags
                 WHERE   tags.id = questions_tags.tag_id);

Delete tag associations with bad question reference

删除带有错误问题参考的标签关联

DELETE  FROM questions_tags
WHERE   NOT EXISTS ( SELECT  1 
                 FROM    questions
                 WHERE   questions.id = questions_tags.question_id);

Delete duplicate tag associations

删除重复的标记关联

DELETE  FROM questions_tags
USING   ( SELECT qt3.user_id, qt3.question_id, MIN(qt3.id) id
          FROM   questions_tags qt3
          GROUP BY qt3.user_id, qt3.question_id
        ) qt2
WHERE   questions_tags.user_id=qt2.user_id AND 
        questions_tags.question_id=qt2.question_id AND
        questions_tags.id != qt2.id

Note:

Please test the SQL's in your development environment before trying them on your production environment.

在生产环境中尝试之前,请先在开发环境中测试SQL。