Okay. So I have about 250,000 high resolution images. What I want to do is go through all of them and find ones that are corrupted. If you know what 4scrape is, then you know the nature of the images I.
好的。所以我有大约250,000张高分辨率图像。我想要做的就是浏览所有这些并找到损坏的。如果你知道4scrape是什么,那么你就知道了图像的本质。
Corrupted, to me, is the image is loaded into Firefox and it says
对我来说,损坏的是图像加载到Firefox中,它说
The image “such and such image” cannot be displayed, because it contains errors.
无法显示图像“此类图像”,因为它包含错误。
Now, I could select all of my 250,000 images (~150gb) and drag-n-drop them into Firefox. That would be bad though, because I don't think Mozilla designed Firefox to open 250,000 tabs. No, I need a way to programmatically check whether an image is corrupted.
现在,我可以选择所有250,000张图像(~150gb)并将它们拖放到Firefox中。那会很糟糕,因为我不认为Mozilla设计Firefox可以打开250,000个标签。不,我需要一种方法来以编程方式检查图像是否已损坏。
Does anyone know a PHP or Python library which can do something along these lines? Or an existing piece of software for Windows?
有谁知道PHP或Python库可以沿着这些方向做些什么?或者是Windows的现有软件?
I have already removed obviously corrupted images (such as ones that are 0 bytes) but I'm about 99.9% sure that there are more diseased images floating around in my throng of a collection.
我已经删除了明显损坏的图像(例如0字节的图像),但我大约99.9%确定在我的一组集合中有更多的患病图像浮动。
5 个解决方案
#1
23
An easy way would be to try loading and verifying the files with PIL (Python Imaging Library).
一种简单的方法是尝试使用PIL(Python Imaging Library)加载和验证文件。
from PIL import Image
v_image = Image.open(file)
v_image.verify()
Catch the exceptions...
赶上例外......
From the documentation:
从文档:
im.verify()
im.verify()
Attempts to determine if the file is broken, without actually decoding the image data. If this method finds any problems, it raises suitable exceptions. This method only works on a newly opened image; if the image has already been loaded, the result is undefined. Also, if you need to load the image after using this method, you must reopen the image file.
尝试确定文件是否损坏,而不实际解码图像数据。如果此方法发现任何问题,则会引发适当的异常。此方法仅适用于新打开的图像;如果图像已加载,则结果未定义。此外,如果需要在使用此方法后加载图像,则必须重新打开图像文件。
#2
6
i suggest you check out imagemagick for this: http://www.imagemagick.org/
我建议您查看imagemagick:http://www.imagemagick.org/
there you have a tool called identify which you can either use in combination with a script/stdout or you can use the programming interface provided
你有一个名为identify的工具,你可以将它与脚本/标准输出结合使用,或者你可以使用提供的编程接口
#3
5
In PHP, with exif_imagetype():
在PHP中,使用exif_imagetype():
if (exif_imagetype($filename) === false)
{
unlink($filename); // image is corrupted
}
EDIT: Or you can try to fully load the image with ImageCreateFromString():
编辑:或者您可以尝试使用ImageCreateFromString()完全加载图像:
if (ImageCreateFromString(file_get_contents($filename)) === false)
{
unlink($filename); // image is corrupted
}
An image resource will be returned on success. FALSE is returned if the image type is unsupported, the data is not in a recognized format, or the image is corrupt and cannot be loaded.
成功返回图像资源。如果图像类型不受支持,数据不是可识别的格式,或者图像已损坏且无法加载,则返回FALSE。
#4
3
If your exact requirements are that it show correctly in FireFox you may have a difficult time - the only way to be sure would be to link to the exact same image loading source code as FireFox.
如果您的确切要求是它在FireFox中正确显示,您可能会遇到困难 - 唯一的方法是确保链接到与FireFox完全相同的图像加载源代码。
Basic image corruption (file is incomplete) can be detected simply by trying to open the file using any number of image libraries.
只需尝试使用任意数量的图像库打开文件,即可检测到基本图像损坏(文件不完整)。
However many images can fail to display simply because they stretch a part of the file format that the particular viewer you are using can't handle (GIF in particular has a lot of these edge cases, but you can find JPEG and the rare PNG file that can only be displayed in specific viewers). There are also some ugly JPEG edge cases where the file appears to be uncorrupted in viewer X, but in reality the file has been cut short and is only displaying correctly because very little information has been lost (FireFox can show some cut off JPEGs correctly [you get a grey bottom], but others result in FireFox seeming the load them half way and then display the error message instead of the partial image)
然而,许多图像无法显示,因为它们会拉伸您正在使用的特定查看器无法处理的文件格式的一部分(特别是GIF有很多这些边缘情况,但您可以找到JPEG和罕见的PNG文件只能在特定的观众中显示)。还有一些丑陋的JPEG边缘情况,其中文件在查看器X中看起来没有损坏,但实际上文件已被缩短并且只能正确显示,因为很少有信息丢失(FireFox可以正确显示一些切断的JPEG [你得到一个灰色的底部],但其他人导致FireFox看起来加载他们中途,然后显示错误信息而不是部分图像)
#5
0
You could use imagemagick if it is available:
如果可用,您可以使用imagemagick:
if you want to do a whole folder
如果你想做一个整个文件夹
identify "./myfolder/*" >log.txt 2>&1
if you want to just check a file:
如果你想只检查一个文件:
identify myfile.jpg
#1
23
An easy way would be to try loading and verifying the files with PIL (Python Imaging Library).
一种简单的方法是尝试使用PIL(Python Imaging Library)加载和验证文件。
from PIL import Image
v_image = Image.open(file)
v_image.verify()
Catch the exceptions...
赶上例外......
From the documentation:
从文档:
im.verify()
im.verify()
Attempts to determine if the file is broken, without actually decoding the image data. If this method finds any problems, it raises suitable exceptions. This method only works on a newly opened image; if the image has already been loaded, the result is undefined. Also, if you need to load the image after using this method, you must reopen the image file.
尝试确定文件是否损坏,而不实际解码图像数据。如果此方法发现任何问题,则会引发适当的异常。此方法仅适用于新打开的图像;如果图像已加载,则结果未定义。此外,如果需要在使用此方法后加载图像,则必须重新打开图像文件。
#2
6
i suggest you check out imagemagick for this: http://www.imagemagick.org/
我建议您查看imagemagick:http://www.imagemagick.org/
there you have a tool called identify which you can either use in combination with a script/stdout or you can use the programming interface provided
你有一个名为identify的工具,你可以将它与脚本/标准输出结合使用,或者你可以使用提供的编程接口
#3
5
In PHP, with exif_imagetype():
在PHP中,使用exif_imagetype():
if (exif_imagetype($filename) === false)
{
unlink($filename); // image is corrupted
}
EDIT: Or you can try to fully load the image with ImageCreateFromString():
编辑:或者您可以尝试使用ImageCreateFromString()完全加载图像:
if (ImageCreateFromString(file_get_contents($filename)) === false)
{
unlink($filename); // image is corrupted
}
An image resource will be returned on success. FALSE is returned if the image type is unsupported, the data is not in a recognized format, or the image is corrupt and cannot be loaded.
成功返回图像资源。如果图像类型不受支持,数据不是可识别的格式,或者图像已损坏且无法加载,则返回FALSE。
#4
3
If your exact requirements are that it show correctly in FireFox you may have a difficult time - the only way to be sure would be to link to the exact same image loading source code as FireFox.
如果您的确切要求是它在FireFox中正确显示,您可能会遇到困难 - 唯一的方法是确保链接到与FireFox完全相同的图像加载源代码。
Basic image corruption (file is incomplete) can be detected simply by trying to open the file using any number of image libraries.
只需尝试使用任意数量的图像库打开文件,即可检测到基本图像损坏(文件不完整)。
However many images can fail to display simply because they stretch a part of the file format that the particular viewer you are using can't handle (GIF in particular has a lot of these edge cases, but you can find JPEG and the rare PNG file that can only be displayed in specific viewers). There are also some ugly JPEG edge cases where the file appears to be uncorrupted in viewer X, but in reality the file has been cut short and is only displaying correctly because very little information has been lost (FireFox can show some cut off JPEGs correctly [you get a grey bottom], but others result in FireFox seeming the load them half way and then display the error message instead of the partial image)
然而,许多图像无法显示,因为它们会拉伸您正在使用的特定查看器无法处理的文件格式的一部分(特别是GIF有很多这些边缘情况,但您可以找到JPEG和罕见的PNG文件只能在特定的观众中显示)。还有一些丑陋的JPEG边缘情况,其中文件在查看器X中看起来没有损坏,但实际上文件已被缩短并且只能正确显示,因为很少有信息丢失(FireFox可以正确显示一些切断的JPEG [你得到一个灰色的底部],但其他人导致FireFox看起来加载他们中途,然后显示错误信息而不是部分图像)
#5
0
You could use imagemagick if it is available:
如果可用,您可以使用imagemagick:
if you want to do a whole folder
如果你想做一个整个文件夹
identify "./myfolder/*" >log.txt 2>&1
if you want to just check a file:
如果你想只检查一个文件:
identify myfile.jpg