为什么gnu并行分块会改善gzip的压缩大小？

File under: "Unexpected Efficiency Dept."

档案下:“意外效率部门”

The first 90 million numbers take up about 761MB, as output by:

前9000万个数字约占761MB,输出量为:

 seq 90000000

According to man parallel, it can speed up gzip's archiving big files by chopping the input up, and using different CPUs to compress the chunks. So even though gzip is single-threaded this technique makes it multi-threaded:

根据man parallel,它可以通过切断输入并使用不同的CPU压缩块来加速gzip的归档大文件。所以即使gzip是单线程的,这种技术也使它成为多线程的:

seq 90000000  | parallel --pipe --recend '' -k gzip -9 >bigfile.gz

Took 46 seconds, on an Intel Core i3-2330M (4) @ 2.2GHz.

在Intel Core i3-2330M(4)@ 2.2GHz上花了46秒。

Pipe that to plain old gzip:

管道到普通的旧gzip:

seq 90000000  | gzip -9 > bigfile2.gz

Took 80 seconds, on the same CPU. Now the surprise:

在相同的CPU上花了80秒。现在惊喜:

ls -log bigfile*.gz

Output:

-rw-rw-r-- 1 200016306 Jul  3 17:27 bigfile.gz
-rw-rw-r-- 1 200381681 Jul  3 17:30 bigfile2.gz

300K larger? That didn't look right. First I checked with zdiff if the files had the same contents -- yes, the same. I'd have supposed any compressor would do better with a continuous data stream than a chunked one. Why isn't bigfile2.gz smaller than bigfile.gz?

300K更大?这看起来不对。首先,我检查了zdiff文件是否具有相同的内容 - 是的,相同。我认为任何压缩器在连续数据流方面都会比分块数据流做得更好。为什么bigfile2.gz不比bigfile.gz小?

3 个解决方案

#1

The reason is that for this particular, rather unusual input, smaller deflate blocks are better than larger ones. By default gzip uses larger deflate blocks, since that works best for normal input data. The parallel command is forcing a few smaller deflate blocks by breaking up the input every 1 MB, resulting in a small gain. Though most of the blocks are still the same size.

原因在于,对于这种特殊的,相当不寻常的输入,较小的放气块优于较大的放气块。默认情况下,gzip使用较大的deflate块,因为这对正常输入数据最有效。并行命令通过每1 MB分解输入来强制几个较小的放气块,从而产生小的增益。虽然大多数街区的大小仍然相同。

You can do much better by setting a smaller block size for every block by using zlib's memLevel parameter in deflateInit2(). Here I compress the same output in a single thread each time, using memLevel values from 9 to 2, where a smaller memLevel is a smaller deflate block size (note that zlib does a little better than your gzip at the default level):

通过在deflateInit2()中使用zlib的memLevel参数,可以为每个块设置更小的块大小,从而做得更好。在这里,我每次使用7到2的memLevel值在单个线程中压缩相同的输出,其中较小的memLevel是较小的deflate块大小(请注意,zlib在默认级别上比gzip稍微好一点):

9 - 199688429

9 - 199688429

8 - 198554111 (default)

8 - 198554111(默认)

7 - 191582070

7 - 191582070

6 - 184880482

6 - 184880482

5 - 181295029

5 - 181295029

4 - 180137425 (optimum for this input)

4 - 180137425(最适合此输入)

3 - 181176610

3 - 181176610

2 - 185759115

2 - 185759115

The optimum memLevel for this data turns out to be 4, for which the compressed data is 12 MB (9%) smaller than for the default memLevel of 8. For memLevel 8, the deflate block size is 16383 symbols, whereas for memLevel 4, the deflate block size is 1023 symbols. One symbol is either a literal byte or a match.

此数据的最佳memLevel为4,压缩数据比默认memLevel为8小12 MB(9%)。对于memLevel 8,deflate块大小为16383个符号,而对于memLevel 4,放气块大小为1023个符号。一个符号是文字字节或匹配。

The improvement comes from the extremely regular nature of the input, resulting in a regular sequence of match and literal commands. The smaller the block size, the fewer such distinct commands that appear, which then takes fewer bits to code each of them. This is still true for memLevel 3, but by then the overhead of the code description at the beginning of each deflate block cancels the improvement from fewer distinct codes.

改进来自输入的极其规则的性质,导致规则的匹配和文字命令序列。块大小越小,出现的这些不同命令就越少,然后它们需要更少的位来对每个命令进行编码。对于memLevel 3来说,这仍然是正确的,但到那时,每个deflate块开头的代码描述的开销取消了较少的不同代码的改进。

zopfli is a deflate compressor that optimizes the block size and the commands selected, and managed to compress it to 100,656,812 bytes. It took three and a half hours though! zopfli is invoked with pigz using compression level 11.

zopfli是一个deflate压缩器,可以优化块大小和所选命令,并设法将其压缩为100,656,812字节。虽然花了三个半小时!使用压缩等级11的pigz调用zopfli。

#2

The effect is likely due to compression block size. Compressing the same input stream with a range of settings like this:

效果可能是由于压缩块大小造成的。使用以下设置压缩相同的输入流:

for i in {1..9}; do seq 90000000 | gzip -$i >$i.gz; done

gives file sizes that reach a minimum at gzip -5:

给出gzip -5下达到最小值的文件大小:

-rw-r--r-- 1 203473375 Jul  4 16:39 1.gz
-rw-r--r-- 1 201160853 Jul  4 16:40 2.gz
-rw-r--r-- 1 200181562 Jul  4 16:40 3.gz
-rw-r--r-- 1 204266147 Jul  4 16:40 4.gz
-rw-r--r-- 1 199144028 Jul  4 16:40 5.gz
-rw-r--r-- 1 199688429 Jul  4 16:40 6.gz
-rw-r--r-- 1 199689546 Jul  4 16:41 7.gz
-rw-r--r-- 1 200376213 Jul  4 16:41 8.gz
-rw-r--r-- 1 200381681 Jul  4 16:42 9.gz

That's not far off gzip's default of -6.

这与gzip的默认值-6相差不远。

#3

I think it is frequency of dictionary making, which is different. This is the balance between speed and compression efficiency, like gzip vs lzma.

我认为这是字典制作的频率,这是不同的。这是速度和压缩效率之间的平衡,如gzip vs lzma。

I guess it is more frequent in the split case. So the numbers of the dictionary are more similar to the following.

我猜在分案中更常见。因此字典的数字更类似于以下内容。

There was one 20 minute lecture on YouTube, Raul Fraile: How GZIP compression works | JSConf EU 2014.

在YouTube上有一个20分钟的讲座,Raul Fraile:GZIP压缩的工作原理JSConf EU 2014。

#1

9 - 199688429