I have 2 models - Question
and Tag
- which have a HABTM between them, and they share a join table questions_tags
.
我有两个模型 - 问题和标签 - 它们之间有一个HABTM,它们共享一个连接表questions_tags。
Feast your eyes on this badboy:
在这个badboy上大饱眼福:
1.9.3p392 :011 > Question.count
(852.1ms) SELECT COUNT(*) FROM "questions"
=> 417
1.9.3p392 :012 > Tag.count
(197.8ms) SELECT COUNT(*) FROM "tags"
=> 601
1.9.3p392 :013 > Question.connection.execute("select count(*) from questions_tags").first["count"].to_i
(648978.7ms) select count(*) from questions_tags
=> 39919778
I am assuming that the questions_tags
join table contains a bunch of duplicate records - otherwise, I have no idea why it would be so large.
我假设questions_tags连接表包含一堆重复记录 - 否则,我不知道它为什么会这么大。
How do I clean up that join table so that it only has uniq
content? Or how do I even check to see if there are duplicate records in there?
如何清理该连接表以使其仅具有uniq内容?或者我怎么检查是否有重复的记录?
Edit 1
I am using PostgreSQL, this is the schema for the join_table questions_tags
我正在使用PostgreSQL,这是join_table questions_tags的架构
create_table "questions_tags", :id => false, :force => true do |t|
t.integer "question_id"
t.integer "tag_id"
end
add_index "questions_tags", ["question_id"], :name => "index_questions_tags_on_question_id"
add_index "questions_tags", ["tag_id"], :name => "index_questions_tags_on_tag_id"
2 个解决方案
#1
2
I'm adding this as a new answer since it's a lot different from my last. This one doesn't assume that you have an id
column on the join table. This creates a new table, selects unique rows into it, then drops the old table and renames the new one. This will be much faster than anything involving a subselect.
我将此作为一个新答案添加,因为它与我的上一个有很大的不同。这个假设你没有在连接表上有一个id列。这将创建一个新表,在其中选择唯一的行,然后删除旧表并重命名新表。这将比涉及子选择的任何事情快得多。
foo=# select * from questions_tags;
question_id | tag_id
-------------+--------
1 | 2
2 | 1
2 | 2
1 | 1
1 | 1
(5 rows)
foo=# select distinct question_id, tag_id into questions_tags_tmp from questions_tags;
SELECT 4
foo=# select * from questions_tags_tmp;
question_id | tag_id
-------------+--------
2 | 2
1 | 2
2 | 1
1 | 1
(4 rows)
foo=# drop table questions_tags;
DROP TABLE
foo=# alter table questions_tags_tmp rename to questions_tags;
ALTER TABLE
foo=# select * from questions_tags;
question_id | tag_id
-------------+--------
2 | 2
1 | 2
2 | 1
1 | 1
(4 rows)
#2
1
Delete tag associations with bad tag reference
删除带有错误标记引用的标记关联
DELETE FROM questions_tags
WHERE NOT EXISTS ( SELECT 1
FROM tags
WHERE tags.id = questions_tags.tag_id);
Delete tag associations with bad question reference
删除带有错误问题参考的标签关联
DELETE FROM questions_tags
WHERE NOT EXISTS ( SELECT 1
FROM questions
WHERE questions.id = questions_tags.question_id);
Delete duplicate tag associations
删除重复的标记关联
DELETE FROM questions_tags
USING ( SELECT qt3.user_id, qt3.question_id, MIN(qt3.id) id
FROM questions_tags qt3
GROUP BY qt3.user_id, qt3.question_id
) qt2
WHERE questions_tags.user_id=qt2.user_id AND
questions_tags.question_id=qt2.question_id AND
questions_tags.id != qt2.id
Note:
Please test the SQL's in your development environment before trying them on your production environment.
在生产环境中尝试之前,请先在开发环境中测试SQL。
#1
2
I'm adding this as a new answer since it's a lot different from my last. This one doesn't assume that you have an id
column on the join table. This creates a new table, selects unique rows into it, then drops the old table and renames the new one. This will be much faster than anything involving a subselect.
我将此作为一个新答案添加,因为它与我的上一个有很大的不同。这个假设你没有在连接表上有一个id列。这将创建一个新表,在其中选择唯一的行,然后删除旧表并重命名新表。这将比涉及子选择的任何事情快得多。
foo=# select * from questions_tags;
question_id | tag_id
-------------+--------
1 | 2
2 | 1
2 | 2
1 | 1
1 | 1
(5 rows)
foo=# select distinct question_id, tag_id into questions_tags_tmp from questions_tags;
SELECT 4
foo=# select * from questions_tags_tmp;
question_id | tag_id
-------------+--------
2 | 2
1 | 2
2 | 1
1 | 1
(4 rows)
foo=# drop table questions_tags;
DROP TABLE
foo=# alter table questions_tags_tmp rename to questions_tags;
ALTER TABLE
foo=# select * from questions_tags;
question_id | tag_id
-------------+--------
2 | 2
1 | 2
2 | 1
1 | 1
(4 rows)
#2
1
Delete tag associations with bad tag reference
删除带有错误标记引用的标记关联
DELETE FROM questions_tags
WHERE NOT EXISTS ( SELECT 1
FROM tags
WHERE tags.id = questions_tags.tag_id);
Delete tag associations with bad question reference
删除带有错误问题参考的标签关联
DELETE FROM questions_tags
WHERE NOT EXISTS ( SELECT 1
FROM questions
WHERE questions.id = questions_tags.question_id);
Delete duplicate tag associations
删除重复的标记关联
DELETE FROM questions_tags
USING ( SELECT qt3.user_id, qt3.question_id, MIN(qt3.id) id
FROM questions_tags qt3
GROUP BY qt3.user_id, qt3.question_id
) qt2
WHERE questions_tags.user_id=qt2.user_id AND
questions_tags.question_id=qt2.question_id AND
questions_tags.id != qt2.id
Note:
Please test the SQL's in your development environment before trying them on your production environment.
在生产环境中尝试之前,请先在开发环境中测试SQL。