使用zlib的gzip文件访问函数来压缩文件大小

时间:2021-04-11 20:00:39

Using linux command line tool gzip I can tell the uncompressed size of a compress file using gzip -l.

使用linux命令行工具gzip,我可以使用gzip -l来判断压缩文件的未压缩大小。

I couldn't find any function like that on zlib manual section "gzip File Access Functions".

我在zlib手册部分“gzip文件访问函数”中找不到任何类似的函数。

At this link, I found a solution http://www.abeel.be/content/determine-uncompressed-size-gzip-file that involves reading the last 4 bytes of the file, but I am avoiding it right now because I prefer to use lib's functions.

在这个链接中,我找到了一个解决方案http://www.abeel.be/content/determine- uncompressize -gzip-file,它包括读取文件的最后4个字节,但是我现在不使用它,因为我更喜欢使用lib的函数。

1 个解决方案

#1


14  

There is no reliable way to get the uncompressed size of a gzip file without decompressing, or at least decoding the whole thing. There are three reasons.

没有可靠的方法可以在不解压或至少解码整个文件的情况下获得gzip文件的未压缩大小。有三个原因。

First, the only information about the uncompressed length is four bytes at the end of the gzip file (stored in little-endian order). By necessity, that is the length modulo 232. So if the uncompressed length is 4 GB or more, you won't know what the length is. You can only be certain that the uncompressed length is less than 4 GB if the compressed length is less than something like 232 / 1032 + 18, or around 4 MB. (1032 is the maximum compression factor of deflate.)

首先,关于未压缩长度的唯一信息是gzip文件末尾的4个字节(以little-endian顺序存储)。这就是长度的模232。如果未压缩的长度是4gb或更多,你就不知道长度是多少。您只能确定,如果压缩长度小于232 / 1032 + 18或约4 MB,未压缩的长度小于4 GB(1032是压缩的最大压缩因数)。

Second, and this is worse, a gzip file may actually be a concatenation of multiple gzip streams. Other than decoding, there is no way to find where each gzip stream ends in order to look at the four-byte uncompressed length of that piece. (Which may be wrong anyway due to the first reason.)

其次,更糟糕的是,gzip文件实际上可能是多个gzip流的连接。除了解码之外,无法找到每个gzip流的结束位置,以查看该片段的4字节未压缩长度。(由于第一个原因,这可能是错误的。)

Third, gzip files will sometimes have junk after the end of the gzip stream (usually zeros). Then the last four bytes are not the length.

第三,gzip文件有时在gzip流结束后会有垃圾(通常是零)。最后四个字节不是长度。

So gzip -l doesn't really work anyway. As a result, there is no point in providing that function in zlib.

所以gzip -l不管怎样都不能工作。因此,在zlib中提供该函数毫无意义。

pigz has an option to in fact decode the entire input in order to get the actual uncompressed length: pigz -lt, which guarantees the right answer. pigz -l does what gzip -l does, which may be wrong.

pigz实际上可以对整个输入进行解码,以得到实际的未压缩长度:pigz -lt,它保证了正确的答案。pigz -l做的和gzip -l做的一样,这可能是错误的。

#1


14  

There is no reliable way to get the uncompressed size of a gzip file without decompressing, or at least decoding the whole thing. There are three reasons.

没有可靠的方法可以在不解压或至少解码整个文件的情况下获得gzip文件的未压缩大小。有三个原因。

First, the only information about the uncompressed length is four bytes at the end of the gzip file (stored in little-endian order). By necessity, that is the length modulo 232. So if the uncompressed length is 4 GB or more, you won't know what the length is. You can only be certain that the uncompressed length is less than 4 GB if the compressed length is less than something like 232 / 1032 + 18, or around 4 MB. (1032 is the maximum compression factor of deflate.)

首先,关于未压缩长度的唯一信息是gzip文件末尾的4个字节(以little-endian顺序存储)。这就是长度的模232。如果未压缩的长度是4gb或更多,你就不知道长度是多少。您只能确定,如果压缩长度小于232 / 1032 + 18或约4 MB,未压缩的长度小于4 GB(1032是压缩的最大压缩因数)。

Second, and this is worse, a gzip file may actually be a concatenation of multiple gzip streams. Other than decoding, there is no way to find where each gzip stream ends in order to look at the four-byte uncompressed length of that piece. (Which may be wrong anyway due to the first reason.)

其次,更糟糕的是,gzip文件实际上可能是多个gzip流的连接。除了解码之外,无法找到每个gzip流的结束位置,以查看该片段的4字节未压缩长度。(由于第一个原因,这可能是错误的。)

Third, gzip files will sometimes have junk after the end of the gzip stream (usually zeros). Then the last four bytes are not the length.

第三,gzip文件有时在gzip流结束后会有垃圾(通常是零)。最后四个字节不是长度。

So gzip -l doesn't really work anyway. As a result, there is no point in providing that function in zlib.

所以gzip -l不管怎样都不能工作。因此,在zlib中提供该函数毫无意义。

pigz has an option to in fact decode the entire input in order to get the actual uncompressed length: pigz -lt, which guarantees the right answer. pigz -l does what gzip -l does, which may be wrong.

pigz实际上可以对整个输入进行解码,以得到实际的未压缩长度:pigz -lt,它保证了正确的答案。pigz -l做的和gzip -l做的一样,这可能是错误的。