在mysql上估计/加速巨大的表自连接

I have a huge table:

我有一张巨大的桌子:

 CREATE TABLE `messageline` (
  `id` bigint(20) NOT NULL AUTO_INCREMENT,
  `hash` bigint(20) DEFAULT NULL,
  `quoteLevel` int(11) DEFAULT NULL,
  `messageDetails_id` bigint(20) DEFAULT NULL,
  PRIMARY KEY (`id`),
  KEY `FK2F5B707BF7C835B8` (`messageDetails_id`),
  KEY `hash_idx` (`hash`),
  KEY `quote_level_idx` (`quoteLevel`),
  CONSTRAINT `FK2F5B707BF7C835B8` FOREIGN KEY (`messageDetails_id`) REFERENCES `messagedetails` (`id`) ON DELETE NO ACTION ON UPDATE NO ACTION
) ENGINE=InnoDB AUTO_INCREMENT=401798068 DEFAULT CHARSET=utf8 COLLATE=utf8_bin

I need to find duplicate lines this way:

我需要以这种方式找到重复的行:

create table foundline AS
select ml.messagedetails_id, ml.hash, ml.quotelevel
from messageline ml,
     messageline ml1
where ml1.hash = ml.hash
  and ml1.messagedetails_id!=ml.messagedetails_id

But this request is working >1 day already. This is too long. Few hours would be ok. How can I speed this up? Thanx.

但是这个请求已经工作了> 1天了。这太长了。几个小时就可以了。我怎样才能加快速度呢?感谢名单。

Explain:

+----+-------------+-------+------+---------------+----------+---------+---------------+-----------+-------------+
| id | select_type | table | type | possible_keys | key      | key_len | ref           | rows      | Extra       |
+----+-------------+-------+------+---------------+----------+---------+---------------+-----------+-------------+
|  1 | SIMPLE      | ml    | ALL  | hash_idx      | NULL     | NULL    | NULL          | 401798409 |             |
|  1 | SIMPLE      | ml1   | ref  | hash_idx      | hash_idx | 9       | skryb.ml.hash |         1 | Using where |
+----+-------------+-------+------+---------------+----------+---------+---------------+-----------+-------------+

2 个解决方案

#1

You can find your duplicates like this

你可以找到这样的副本

SELECT messagedetails_id, COUNT(*) c
FROM messageline ml
GROUP BY messagedetails_id HAVING c > 1;

If it is still too long, add a condition to split the request on an indexed field :

如果它仍然太长,请添加一个条件以在索引字段上拆分请求:

WHERE messagedetails_id < 100000

#2

Is it required to do this solely with SQL? Because for such a number of records you would be better off to break this down into 2 steps:

是否需要仅使用SQL执行此操作?因为对于如此多的记录,您最好将其分解为两个步骤:

First run the following query


 CREATE TABLE duplicate_hashes
 SELECT * FROM (
   SELECT hash, GROUP_CONCAT(id) AS ids, COUNT(*) AS cnt,
   COUNT(DISTINCT messagedetails_id) AS cnt_message_details,
   GROUP_CONCAT(DISTINCT messagedetails_id) as messagedetails_ids
   FROM messageline GROUP BY hash ORDER BY NULL HAVING cnt > 1
 ) tmp 
 WHERE cnt > cnt_message_details

This will give you the duplicate IDs for each hash and since you have an index on the hash field grouping by will be relatively fast. Now, by counting distinct messagedetails_id values and comparing you implicitly fulfill the requirement for different messagedetails_id


 where ml1.hash = ml.hash
 and ml1.messagedetails_id!=ml.messagedetails_id

首先运行以下查询 CREATE TABLE duplicate_hashes SELECT * FROM( SELECT hash,GROUP_CONCAT(id)AS id,COUNT(*)AS cnt, COUNT(DISTINCT messagedetails_id)AS cnt_message_details, GROUP_CONCAT(DISTINCT messagedetails_id)为messagedetails_ids FROM messageline GROUP BY hash ORDER BY NULL HAVING cnt> 1 )tmp 在哪里cnt> cnt_message_details 这将为您提供每个哈希的重复ID,并且由于您在哈希字段上有索引,因此分组将相对较快。现在,通过计算不同的messagedetails_id值并比较您隐式满足不同messagedetails_id的要求其中ml1.hash = ml.hash 和ml1.messagedetails_id!= ml.messagedetails_id

Use a script to check each record of the duplicate_hashes table

使用脚本检查duplicate_hashes表的每个记录

#1