I have written a java program for compression. I have compressed some text file. The file size after compression reduced. But when I tried to compress PDF file. I dinot see any change in file size after compression.
我编写了一个用于压缩的java程序。我压缩了一些文本文件。压缩后的文件大小减少了。但是当我试图压缩PDF文件时。我压缩后看到文件大小的任何变化。
So I want to know what other files will not reduce its size after compression.
所以我想知道压缩后其他文件不会减小其大小。
Thanks Sunil Kumar Sahoo
谢谢Sunil Kumar Sahoo
16 个解决方案
#1
File compression works by removing redundancy. Therefore, files that contain little redundancy compress badly or not at all.
文件压缩通过删除冗余来工作。因此,包含很少冗余的文件会严重压缩或根本不压缩。
The kind of files with no redundancy that you're most likely to encounter is files that have already been compressed. In the case of PDF, that would specifically be PDFs that consist mainly of images which are themselves in a compressed image format like JPEG.
您最有可能遇到的没有冗余的文件类型是已经压缩的文件。在PDF的情况下,特别是主要由图像组成的PDF,这些图像本身是JPEG等压缩图像格式。
#2
Compressed files will not reduce their size after compression.
压缩后压缩文件不会缩小其大小。
#3
jpeg/gif/avi/mpeg/mp3 and already compressed files wont change much after compression. You may see a small decrease in filesize.
jpeg / gif / avi / mpeg / mp3和已经压缩的文件在压缩后不会有太大变化。您可能会看到文件大小略有减少。
#4
The only files that cannot be compressed are random ones - truly random bits, or as approximated by the output of a compressor.
唯一无法压缩的文件是随机的 - 真正的随机位,或者由压缩器的输出近似。
However, for any algorithm in general, there are many files that cannot be compressed by it but can be compressed well by another algorithm.
但是,对于任何算法,一般来说,有许多文件不能被它压缩,但可以通过另一种算法很好地压缩。
#5
Five years later, I have at least some real statistics to show of this.
五年后,我至少有一些真实的统计数据可以证明这一点。
I've generated 17439 multi-page pdf-files with PrinceXML that totals 4858 Mb. A zip -r archive pdf_folder
gives me an archive.zip that is 4542 Mb. That's 93.5% of the original size, so not worth it to save space.
我用PrinceXML生成了17439个多页pdf文件,总计4858 Mb。 zip -r archive pdf_folder给了我一个4542 Mb的archive.zip。这是原始尺寸的93.5%,因此不值得节省空间。
#6
Generally you cannot compress data that has already been compressed. You might even end up with a compressed size that is larger than the input.
通常,您无法压缩已经压缩的数据。您甚至可能最终得到的压缩大小大于输入。
#7
You will probably have difficulty compressing encrypted files too as they are essentially random and will (typically) have few repeating blocks.
您可能也难以压缩加密文件,因为它们基本上是随机的,并且(通常)具有很少的重复块。
#8
PDF files are already compressed. They use the following compression algorithms:
PDF文件已经过压缩。他们使用以下压缩算法:
- LZW (Lempel-Ziv-Welch)
- FLATE (ZIP, in PDF 1.2)
- JPEG and JPEG2000 (PDF version 1.5 CCITT (the facsimile standard, Group 3 or 4)
- JBIG2 compression (PDF version 1.4) RLE (Run Length Encoding)
FLATE(ZIP,PDF 1.2)
JPEG和JPEG2000(PDF版本1.5 CCITT(传真标准,第3组或第4组)
RIG(运行长度编码)的JBIG2压缩(PDF版本1.4)
Depending on which tool created the PDF and version, different types of encryption are used. You can compress it further using a more efficient algorithm, loose some quality by converting images to low quality jpegs.
根据创建PDF和版本的工具,使用不同类型的加密。您可以使用更高效的算法进一步压缩它,通过将图像转换为低质量的jpeg来降低一些质量。
There is a great link on this here
这里有一个很好的联系
#9
Files encrypted with a good algorithm like IDEA or DES in CBC mode don't compress anymore regardless of their original content. That's why encryption programs first compress and only then run the encryption.
在CBC模式下使用IDEA或DES等良好算法加密的文件不再压缩,无论其原始内容如何。这就是加密程序首先压缩然后再运行加密的原因。
#10
Media files don't tend to compress well. JPEG and MPEG don't compress while you may be able to compress .png files
媒体文件不易压缩。当您可以压缩.png文件时,JPEG和MPEG不会压缩
#11
File that are already compressed usually can't be compressed any further. For example mp3, jpg, flac, and so on. You could even get files that are bigger because of the re-compressed file header.
已压缩的文件通常无法进一步压缩。例如mp3,jpg,flac等。您甚至可以获得因重新压缩的文件头而更大的文件。
#12
Really, it all depends on the algorithm that is used. An algorithm that is specifically tailored to use the frequency of letters found in common English words will do fairly poorly when the input file does not match that assumption.
实际上,这完全取决于所使用的算法。当输入文件与该假设不匹配时,专门为使用普通英语单词中的字母频率而定制的算法将会相当差。
In general, PDFs contain images and such that are already compressed, so it will not compress much further. Your algorithm is probably only able to eke out meagre if any savings based on the text strings contained in the PDF?
通常,PDF包含已经压缩的图像等,因此不会进一步压缩。如果根据PDF中包含的文本字符串进行任何节省,您的算法可能只能微不足道?
#13
Simple answer: compressed files (or we could reduce file sizes to 0 by compressing multiple times :). Many file formats already apply compression and you might find that the file size shrinks by less then 1% when compressing movies, mp3s, jpegs, etc.
简单的答案:压缩文件(或者我们可以通过多次压缩将文件大小减小到0)。许多文件格式已经应用压缩,您可能会发现压缩电影,mp3,jpeg等时文件大小缩小了不到1%。
#14
You can add all Office 2007 file formats to the list (of @waqasahmed):
您可以将所有Office 2007文件格式添加到(@waqasahmed)列表中:
Since the Office 2007 .docx and .xlsx (etc) are actually zipped .xml files, you also might not see a lot of size reduction in them either.
由于Office 2007 .docx和.xlsx(等)实际上是压缩的.xml文件,因此您也可能看不到它们的大量减少。
#15
Truly random
-
Approximation thereof, made by cryptographically strong hash function or cipher, e.g.:
通过密码强哈希函数或密码进行的近似,例如:
AES-CBC(any input)
"".join(map(b2a_hex, [md5(str(i)) for i in range(...)]))
“”。join(map(b2a_hex,[md5(str(i))for i in range(...)]))
#16
Any lossless compression algorithm, provided it makes some inputs smaller (as the name compression suggests), will also make some other inputs larger.
任何无损压缩算法,只要它使一些输入更小(如压缩建议的名称),也会使一些其他输入更大。
Otherwise, the set of all input sequences up to a given length L could be mapped to the (much) smaller set of all sequences of length less than L, and do so without collisions (because the compression must be lossless and reversible), which possibility the pigeonhole principle excludes.
否则,直到给定长度L的所有输入序列的集合可以被映射到长度小于L的所有序列的(更多)较小集合,并且这样做没有冲突(因为压缩必须是无损且可逆的),鸽笼原则排除的可能性。
So, there are infinite files which do NOT reduce its size after compression and, moreover, it's not required for a file to be an high entropy file :)
因此,有无限文件在压缩后不会减小其大小,而且,文件不需要高熵文件:)
#1
File compression works by removing redundancy. Therefore, files that contain little redundancy compress badly or not at all.
文件压缩通过删除冗余来工作。因此,包含很少冗余的文件会严重压缩或根本不压缩。
The kind of files with no redundancy that you're most likely to encounter is files that have already been compressed. In the case of PDF, that would specifically be PDFs that consist mainly of images which are themselves in a compressed image format like JPEG.
您最有可能遇到的没有冗余的文件类型是已经压缩的文件。在PDF的情况下,特别是主要由图像组成的PDF,这些图像本身是JPEG等压缩图像格式。
#2
Compressed files will not reduce their size after compression.
压缩后压缩文件不会缩小其大小。
#3
jpeg/gif/avi/mpeg/mp3 and already compressed files wont change much after compression. You may see a small decrease in filesize.
jpeg / gif / avi / mpeg / mp3和已经压缩的文件在压缩后不会有太大变化。您可能会看到文件大小略有减少。
#4
The only files that cannot be compressed are random ones - truly random bits, or as approximated by the output of a compressor.
唯一无法压缩的文件是随机的 - 真正的随机位,或者由压缩器的输出近似。
However, for any algorithm in general, there are many files that cannot be compressed by it but can be compressed well by another algorithm.
但是,对于任何算法,一般来说,有许多文件不能被它压缩,但可以通过另一种算法很好地压缩。
#5
Five years later, I have at least some real statistics to show of this.
五年后,我至少有一些真实的统计数据可以证明这一点。
I've generated 17439 multi-page pdf-files with PrinceXML that totals 4858 Mb. A zip -r archive pdf_folder
gives me an archive.zip that is 4542 Mb. That's 93.5% of the original size, so not worth it to save space.
我用PrinceXML生成了17439个多页pdf文件,总计4858 Mb。 zip -r archive pdf_folder给了我一个4542 Mb的archive.zip。这是原始尺寸的93.5%,因此不值得节省空间。
#6
Generally you cannot compress data that has already been compressed. You might even end up with a compressed size that is larger than the input.
通常,您无法压缩已经压缩的数据。您甚至可能最终得到的压缩大小大于输入。
#7
You will probably have difficulty compressing encrypted files too as they are essentially random and will (typically) have few repeating blocks.
您可能也难以压缩加密文件,因为它们基本上是随机的,并且(通常)具有很少的重复块。
#8
PDF files are already compressed. They use the following compression algorithms:
PDF文件已经过压缩。他们使用以下压缩算法:
- LZW (Lempel-Ziv-Welch)
- FLATE (ZIP, in PDF 1.2)
- JPEG and JPEG2000 (PDF version 1.5 CCITT (the facsimile standard, Group 3 or 4)
- JBIG2 compression (PDF version 1.4) RLE (Run Length Encoding)
FLATE(ZIP,PDF 1.2)
JPEG和JPEG2000(PDF版本1.5 CCITT(传真标准,第3组或第4组)
RIG(运行长度编码)的JBIG2压缩(PDF版本1.4)
Depending on which tool created the PDF and version, different types of encryption are used. You can compress it further using a more efficient algorithm, loose some quality by converting images to low quality jpegs.
根据创建PDF和版本的工具,使用不同类型的加密。您可以使用更高效的算法进一步压缩它,通过将图像转换为低质量的jpeg来降低一些质量。
There is a great link on this here
这里有一个很好的联系
#9
Files encrypted with a good algorithm like IDEA or DES in CBC mode don't compress anymore regardless of their original content. That's why encryption programs first compress and only then run the encryption.
在CBC模式下使用IDEA或DES等良好算法加密的文件不再压缩,无论其原始内容如何。这就是加密程序首先压缩然后再运行加密的原因。
#10
Media files don't tend to compress well. JPEG and MPEG don't compress while you may be able to compress .png files
媒体文件不易压缩。当您可以压缩.png文件时,JPEG和MPEG不会压缩
#11
File that are already compressed usually can't be compressed any further. For example mp3, jpg, flac, and so on. You could even get files that are bigger because of the re-compressed file header.
已压缩的文件通常无法进一步压缩。例如mp3,jpg,flac等。您甚至可以获得因重新压缩的文件头而更大的文件。
#12
Really, it all depends on the algorithm that is used. An algorithm that is specifically tailored to use the frequency of letters found in common English words will do fairly poorly when the input file does not match that assumption.
实际上,这完全取决于所使用的算法。当输入文件与该假设不匹配时,专门为使用普通英语单词中的字母频率而定制的算法将会相当差。
In general, PDFs contain images and such that are already compressed, so it will not compress much further. Your algorithm is probably only able to eke out meagre if any savings based on the text strings contained in the PDF?
通常,PDF包含已经压缩的图像等,因此不会进一步压缩。如果根据PDF中包含的文本字符串进行任何节省,您的算法可能只能微不足道?
#13
Simple answer: compressed files (or we could reduce file sizes to 0 by compressing multiple times :). Many file formats already apply compression and you might find that the file size shrinks by less then 1% when compressing movies, mp3s, jpegs, etc.
简单的答案:压缩文件(或者我们可以通过多次压缩将文件大小减小到0)。许多文件格式已经应用压缩,您可能会发现压缩电影,mp3,jpeg等时文件大小缩小了不到1%。
#14
You can add all Office 2007 file formats to the list (of @waqasahmed):
您可以将所有Office 2007文件格式添加到(@waqasahmed)列表中:
Since the Office 2007 .docx and .xlsx (etc) are actually zipped .xml files, you also might not see a lot of size reduction in them either.
由于Office 2007 .docx和.xlsx(等)实际上是压缩的.xml文件,因此您也可能看不到它们的大量减少。
#15
Truly random
-
Approximation thereof, made by cryptographically strong hash function or cipher, e.g.:
通过密码强哈希函数或密码进行的近似,例如:
AES-CBC(any input)
"".join(map(b2a_hex, [md5(str(i)) for i in range(...)]))
“”。join(map(b2a_hex,[md5(str(i))for i in range(...)]))
#16
Any lossless compression algorithm, provided it makes some inputs smaller (as the name compression suggests), will also make some other inputs larger.
任何无损压缩算法,只要它使一些输入更小(如压缩建议的名称),也会使一些其他输入更大。
Otherwise, the set of all input sequences up to a given length L could be mapped to the (much) smaller set of all sequences of length less than L, and do so without collisions (because the compression must be lossless and reversible), which possibility the pigeonhole principle excludes.
否则,直到给定长度L的所有输入序列的集合可以被映射到长度小于L的所有序列的(更多)较小集合,并且这样做没有冲突(因为压缩必须是无损且可逆的),鸽笼原则排除的可能性。
So, there are infinite files which do NOT reduce its size after compression and, moreover, it's not required for a file to be an high entropy file :)
因此,有无限文件在压缩后不会减小其大小,而且,文件不需要高熵文件:)