如何在避免MemoryError的同时在大型文件系统中查找重复文件

I am trying to avoid duplicates in my mp3 collection (quite large). I want to check for duplicates by checking file contents, instead of looking for same file name. I have written the code below to do this but it throws a MemoryError after about a minute. Any suggestions on how I can get this to work?

我试图避免我的mp3集合中的重复(非常大)。我想通过检查文件内容来检查重复项,而不是查找相同的文件名。我已经编写了下面的代码来执行此操作,但它在大约一分钟后抛出了一个MemoryError。有关如何使其工作的任何建议?

import os
import hashlib

walk = os.walk('H:\MUSIC NEXT GEN')

mySet = set()
dupe  = []

hasher = hashlib.md5()

for dirpath, subdirs, files in walk:
    for f in files:
        fileName =  os.path.join(dirpath, f)
        with open(fileName, 'rb') as mp3:
            buf = mp3.read()
            hasher.update(buf)
            hashKey = hasher.hexdigest()
            print hashKey
            if hashKey in mySet:
                dupe.append(fileName)
            else:
                mySet.add(hashKey)


print 'Dupes: ' + str(dupe)

2 个解决方案

#1

You probably have a huge file that can't be read at once like you try with mp3.read(). Read smaller parts instead. Putting it into a nice little function also helps keeping your main program clean. Here's a function I've been using myself for a while now (just slightly polished it now) for a tool probably similar to yours:

您可能有一个巨大的文件,无法像您尝试使用mp3.read()一样立即读取。改为阅读较小的部分。将它放入一个漂亮的小功能也有助于保持主程序的清洁。这是一个我现在已经使用了一段时间的功能(现在只是略微抛光),它可能与你的工具类似:

import hashlib

def filehash(filename):
    with open(filename, mode='rb') as file:
        hasher = hashlib.md5()
        while True:
            buffer = file.read(1 << 20)
            if not buffer:
                return hasher.hexdigest()
            hasher.update(buffer)

Update: A readinto version:

更新:一个readinto版本:

buffer = bytearray(1 << 20)
def filehash(filename):
    with open(filename, mode='rb') as file:
        hasher = hashlib.md5()
        while True:
            n = file.readinto(buffer)
            if not n:
                return hasher.hexdigest()
            hasher.update(buffer if n == len(buffer) else buffer[:n])

With a 1GB file already cached in memory and ten attempts, this took on average 5.35 seconds. The read version took on average 6.07 seconds. In both versions, the Python process occupied about 10MB of RAM during the run.

1GB文件已缓存在内存中并进行了10次尝试,平均耗时5.35秒。阅读版本平均需要6.07秒。在这两个版本中,Python进程在运行期间占用了大约10MB的RAM。

I'll probably stick with the read version, as I prefer its simplicity and because in my real use cases, the data isn't already cached in RAM and I use sha256 (so the overall time goes up significantly and makes the little advantage of readinto even more irrelevant).

我可能会坚持使用读取版本,因为我更喜欢它的简单性,因为在我的实际使用情况下,数据尚未缓存在RAM中并且我使用sha256(因此总体时间显着上升并且使得阅读甚至更无关紧要)。

#2

hasher.update appends the content to the previous. You may want to create a new hasher for each file

hasher.update将内容附加到上一个。您可能想为每个文件创建一个新的哈希

#1

import hashlib

def filehash(filename):
    with open(filename, mode='rb') as file:
        hasher = hashlib.md5()
        while True:
            buffer = file.read(1 << 20)
            if not buffer:
                return hasher.hexdigest()
            hasher.update(buffer)