如何在MySQL中有效地找到重复的blob行?

时间:2022-03-23 19:42:33

I have a table of the form

我有一张表格

CREATE TABLE data
{
   pk INT PRIMARY KEY AUTO_INCREMENT,
   dt BLOB
};

It has about 160,000 rows and about 2GB of data in the blob column (avg. 14kb per blob). Another table has foreign keys into this table.

它在blob列中有大约160,000行和大约2GB的数据(平均每个blob 14kb)。另一个表在此表中有外键。

Something like 3000 of the blobs are identical. So what I want is a query that will give me a re map table that will allow me to remove the duplicates.

像3000个blob的东西是相同的。所以我想要的是一个查询,它将为我提供一个重新映射表,允许我删除重复项。

The naive approach took about an hour on 30-40k rows:

天真的方法在30-40k行上花了大约一个小时:

SELECT a.pk, MIN(b.pk) 
    FROM data AS a 
    JOIN data AS b
  ON a.dt=b.dt
  WHERE b.pk < a.pk
  GROUP BY a.pk;

I happen to have, for other reasons, a table that has the sizes of the blobs:

出于其他原因,我碰巧有一个具有blob大小的表:

CREATE TABLE sizes
(
   fk INT,  // note: non-unique
   sz INT
   // other cols
);

By building indexes for both fk and another for sz the direct query from that takes about 24 sec with 50k rows:

通过为sz构建fk和另一个的索引,直接查询大约需要24秒,每行50k行:

SELECT da.pk,MIN(db.pk) 
  FROM data AS da
  JOIN data AS db
  JOIN sizes AS sa
  JOIN sizes AS sb
  ON
        sa.size=sb.size
    AND da.pk=sa.fk
    AND db.pk=sb.fk
  WHERE
        sb.fk<sa.fk
    AND da.dt=db.dt 
  GROUP BY da.pk;

However that is doing a full table scan on da (the data table). Given that the hit rate should be fairly low I'd think that an index scan would be better. With that in mind in added a 3rd copy of data as a 5th join to get that, and lost about 3 sec.

然而,这是在da(数据表)上进行全表扫描。鉴于命中率应该相当低,我认为索引扫描会更好。考虑到这一点,添加了第3个数据副本作为第5个连接以获得该数据,并且丢失了大约3秒。

OK so for the question: Am I going to get much better than the second select? If so, how?

好的问题是:我会比第二次选择好得多吗?如果是这样,怎么样?

A bit of a corollary is: if I have a table where the key column's get very heavy use but the rest should only get rarely used, will I ever be better off adding another join of that table to encourage an index scan vs. a full table scan?

一个必然的结果是:如果我有一个表,其中键列的使用非常繁重,但其余部分应该很少使用,我是否会更好地添加该表的另一个连接以鼓励索引扫描与完整表扫描?


Xgc on #mysql@irc.freenode.net points out that the adding a utility table like sizes but with a unique constraint on fk might help a lot. Some fun with triggers and what not might make it even not to bad to keep up to date.

#mysql@irc.freenode.net上的Xgc指出,添加一个像size这样的实用程序表,但对fk有一个独特的约束可能会有很大帮助。一些有趣的触发器和什么不可能使它甚至不坏以保持最新。

1 个解决方案

#1


10  

You can always use a hashing function (MD5 or SHA1) for your data and then compare the hashes.

您始终可以为数据使用散列函数(MD5或SHA1),然后比较散列。

The question is if you can save the hashes in your database?

问题是你是否可以在数据库中保存哈希值?

#1


10  

You can always use a hashing function (MD5 or SHA1) for your data and then compare the hashes.

您始终可以为数据使用散列函数(MD5或SHA1),然后比较散列。

The question is if you can save the hashes in your database?

问题是你是否可以在数据库中保存哈希值?