具有子选择的MySQL查询性能 - 具有数百万行的表

I know there have been a lot of questions about sql query performance improvement, but I was not able to use the answers of those questions to improve my queries performance (enough).

我知道有很多关于sql查询性能改进的问题,但是我无法使用这些问题的答案来提高我的查询性能(足够)。

Since I wanted something more flexible than rsync & fslint, I've written a little java tool that walks file trees and stores paths & checksums in a mysql database.

因为我想要比rsync和fslint更灵活的东西,所以我编写了一个小工具来处理文件树并在mysql数据库中存储路径和校验和。

You'll find my table structure here: http://code.google.com/p/directory-scanner/source/browse/trunk/sql/create_table.sql - at first I only had one table, but then I thought I could save a lot of space if I move the redundant quite long strings of the directory paths into a seperate place and make it an 1:n relationship

你会在这里找到我的表格结构:http://code.google.com/p/directory-scanner/source/browse/trunk/sql/create_table.sql - 起初我只有一张桌子,但后来我想我如果我将冗余的相当长的目录路径字符串移动到一个单独的位置并使其成为1:n关系,则可以节省大量空间

I've defined those two indexes:

我已经定义了这两个索引:

CREATE INDEX files_sha1 ON files (sha1);
CREATE INDEX files_size ON files (size);

Now the queries that bug me are those: http://code.google.com/p/directory-scanner/source/browse/trunk/sql/reporingQueries.sql

现在有问题的查询是:http://code.google.com/p/directory-scanner/source/browse/trunk/sql/reporingQueries.sql

The worst of them is the last one, that should with a very high probability always return an empty set (sha1 collisions & mistakenly multiple inserted files):

最糟糕的是最后一个,应该以非常高的概率总是返回一个空集(sha1冲突和错误多个插入的文件):

SELECT 
    d.path, 
    d.id, 
    f.filename, 
    f.id, 
    f.size, 
    f.scandate, 
    f.sha1, 
    f.lastmodified 
FROM files f 
INNER JOIN directories d 
    ON d.id = f.dir_id 
WHERE EXISTS ( /* same sha1 but different size */ 
    SELECT ff.id 
    FROM files ff 
    WHERE ff.sha1 = f.sha1 
    AND ff.size <> f.size 
) 
OR EXISTS ( /* files with same name and path but different id */ 
    SELECT ff2.id 
    FROM files ff2 
    INNER JOIN directories dd2 
        ON dd2.id = ff2.dir_id 
    WHERE ff2.id <> f.id 
    AND ff2.filename = f.filename 
    AND dd2.path = d.path 
) 
ORDER BY f.sha1

It ran well enough within less than a second as long as I had only 20k rows (after creating my indexes), but now that I have 750k rows, it literary runs for hours, and mysql totaly uses up one of my cpu cores for the whole time.

只要我只有20k行(在创建索引之后),它在不到一秒的时间内运行得很好,但是现在我有750k行,它的文学运行了几个小时,并且mysql总计耗尽了我的一个cpu核心用于全程。

EXPLAIN for this query gives this result:

EXPLAIN for this query给出了以下结果:

id ; select_type ; table ; type ; possible_keys ; key ; key_len ; ref ; rows ; filtered ; Extra
1 ; PRIMARY ; d ; ALL ; PRIMARY ; NULL ; NULL ; NULL ; 56855 ; 100.0 ; Using temporary; Using filesort
1 ; PRIMARY ; f ; ref ; dir_id ; dir_id ; 4 ; files.d.id ; 13 ; 100.0 ; Using where
3 ; DEPENDENT SUBQUERY ; dd2 ; ALL ; PRIMARY ; NULL ; NULL ; NULL ; 56855 ; 100.0 ; Using where
3 ; DEPENDENT SUBQUERY ; ff2 ; ref ; dir_id ; dir_id ; 4 ; files.dd2.id ; 13 ; 100.0 ; Using where
2 ; DEPENDENT SUBQUERY ; ff ; ref ; files_sha1 ; files_sha1 ; 23 ; files.f.sha1 ; 1 ; 100.0 ; Using where

My other queries are also not quick with 750k rows, but finish at least within 15 minutes or something the like (however, I would like them to also work with millions of rows..)

我的其他查询也不是快速的750k行,但至少在15分钟或类似的东西完成(但是,我希望他们也可以使用数百万行...)

UPDATE: Thanks radashk for the comment, but the indexes you suggested seem to be created automatically by mysql -->

更新:感谢radashk的评论,但你建议的索引似乎是由mysql自动创建的 - >

"Table","Non_unique","Key_name","Seq_in_index","Column_name","Collation","Cardinality","Sub_part","Packed","Null","Index_type","Comment","Index_comment"
"files","0","PRIMARY","1","id","A","698397","NULL","NULL",,"BTREE",,
"files","1","dir_id","1","dir_id","A","53722","NULL","NULL",,"BTREE",,
"files","1","scanDir_id","1","scanDir_id","A","16","NULL","NULL","YES","BTREE",,
"files","1","files_sha1","1","sha1","A","698397","NULL","NULL","YES","BTREE",,
"files","1","files_size","1","size","A","174599","NULL","NULL",,"BTREE",,

UPDATE2: Thanks Eugen Rieck! I consider your answer a good replacement for this query, since it most likly will return an empty set anyway I will just select the data to display the user to describe the problem later in another query. To make me really happy it would be great if someone could take a look at my other queries as well :D

更新2:谢谢Eugen Rieck!我认为你的答案是这个查询的一个很好的替代品,因为它最有可能会返回一个空集我无论如何我只会选择数据来显示用户在稍后的另一个查询中描述问题。为了让我真的很开心,如果有人可以看看我的其他问题也会很棒:D

UPDATE3: The answer from Justin Swanhart inspired me to the following solution: instead of having queries to check for directories and files that have been inserted multiple times unintentionally, just create unique constraints like this:

更新3:Justin Swanhart的答案激发了我对以下解决方案的启发:不是要查询已无意中多次插入的目录和文件,而是创建如下的唯一约束:

ALTER TABLE directories ADD CONSTRAINT uc_dir_path UNIQUE (path);
ALTER TABLE files ADD CONSTRAINT uc_files UNIQUE(dir_id, filename);

However, I wonder how much this would negatively effect the performance of insert statements, could somebody comment on this please?

但是,我想知道这会对插入语句的性能产生多大的负面影响,有人可以对此发表评论吗?

UPDATE4:

ALTER TABLE directories ADD CONSTRAINT uc_dir_path UNIQUE (path);

doesn't work, since its to long..

不起作用,因为它很长..

ERROR 1071 (42000): Specified key was too long; max key length is 767 bytes

UPDATE5:

Okey, this is the solution I'm gonna use for replacing the query I quoted above in my initial question:

哦,这是我将用于替换我在上面的问题中引用的查询的解决方案:

For the first part, finding sha1 collisions, I will use this:

对于第一部分,找到sha1碰撞,我将使用这个:

SELECT sha1
FROM files
GROUP BY sha1
HAVING COUNT(*)>1
AND MIN(size)<>MAX(size)

And if it returns anything, I will select the details with another query WHERE sha1 = ?

如果它返回任何内容,我将用另一个查询选择详细信息WHERE sha1 =?

I guess this query will run best, with this index defined:

我猜这个查询将运行得最好,定义了这个索引:

CREATE INDEX sha1_size ON files (sha1, size);

For verifying that no duplicated directories exist, I will use this, since he doesn't allow a constraint (see UPDATE4 above):

为了验证不存在重复的目录,我将使用它,因为他不允许约束(参见上面的UPDATE4):

SELECT path
FROM directories
GROUP BY path
HAVING COUNT(*)>1

And for the duplicated files I will try to create this constraint:

对于重复的文件,我将尝试创建此约束:

CREATE UNIQUE INDEX filename_dir ON files (filename, dir_id);

This runs quite fast (15 to 20 sec) and I don't need to create other indexes before it to make it faster. Also the error message contains the details I need to display the problem to the user (which is unlikely anyway since I check for those things before inserting)

这运行得非常快(15到20秒),我不需要在它之前创建其他索引以使其更快。此外,错误消息包含向用户显示问题所需的详细信息(由于我在插入之前检查了这些内容,因此不太可能

Now there are only 5 more queries to make perform in less time ;) thanks for the great help so far Eugen & Justin!

现在只有5个查询可以在更短的时间内完成演出;)感谢Eugen和Justin迄今为止的巨大帮助!

UPDATE6: Okey, so since it's been a few days since the last response from anybody, I'm just gonna accept Justin's answer, since that was the one that helped me the most. I incorporated what I learned from both of you into my app and released version 0.0.4 here: http://code.google.com/p/directory-scanner/downloads/detail?name=directory-scanner-0.0.4-jar-with-dependencies.jar

更新6:好吧,所以既然距离任何人的最后一次回复还有几天,我只会接受贾斯汀的回答,因为那是对我帮助最大的回答。我将从双方学到的内容合并到我的应用程序中,并在此处发布了0.0.4版本:http://code.google.com/p/directory-scanner/downloads/detail?name = directory-scanner-0.0.4- JAR-与-dependencies.jar

3 个解决方案

#1

While I can't verify without building your tables an dpopulating, I'd try something like

虽然我无法在没有建立你的桌子的情况下进行验证,但我会尝试类似的东西

-- This checks the SHA1 collisions
SELECT
  MIN(id) AS id,
FROM files
GROUP BY sha1
HAVING COUNT(*)>1
AND MIN(size)<>MAX(size)

-- This checks for directory duplicates
SELECT
  MIN(path) AS path
FROM directories
GROUP BY path
HAVING COUNT(*)>1

-- This checks for file duplicates
SELECT
  MIN(f.id) AS id
FROM files AS f
INNER JOIN files AS ff 
   ON f.dir_id=ff.dir_id
   AND f.filename=ff.filename
GROUP BY f.id
HAVING COUNT(*)>1

Run one after the other.

一个接一个地跑。

Edit

3rd query was bogous - sorry for that

第三个问题很糟糕 - 对不起

#2

Instead of subqueries try using a UNION and two well indexed queries with joins.

而不是子查询尝试使用UNION和带有连接的两个索引良好的查询。

First you will need two indexes (based on the schema in the create_table.sql you provided):

首先,您需要两个索引(基于您提供的create_table.sql中的模式):

ALTER TABLE files add key (sha1, size);
alter table files add key(filename, dir_id);

Then you will need to rewrite the query:

然后你需要重写查询:

(SELECT 
    d.path, 
    d.id, 
    f.filename, 
    f.id, 
    f.size, 
    f.scandate, 
    f.sha1, 
    f.lastmodified 
FROM files f 
INNER JOIN directories d 
    ON d.id = f.dir_id 
INNER JOIN files files2
    USING(sha1)
WHERE files2.size != f.size)

UNION

(SELECT 
    d.path, 
    d.id, 
    f.filename, 
    f.id, 
    f.size, 
    f.scandate, 
    f.sha1, 
    f.lastmodified 
FROM files f 
INNER JOIN directories d 
    ON d.id = f.dir_id
INNER JOIN files files2
    ON files2.id != f.id
   AND files2.filename = f.filename
INNER JOIN directories d2
   ON files2.dir_id = d2.id
  AND d2.path = d.path)

#3

Do you take a kind of cross join in your second subquery. Try to change second subquery to:

你在第二个子查询中采用了一种交叉连接吗?尝试将第二个子查询更改为:

SELECT ff2.id 
FROM files ff2 
WHERE ff2.id <> f.id 
AND ff2.dir_id  = d.dir_id 
AND ff2.filename = f.filename

and create an index over dir_id, filename on files table.

并在文件表上的dir_id,filename上创建索引。

#1