压缩唯一的数据流

I've got a large number of integer arrays. Each one has a few thousand integers in it, and each integer is generally the same as the one before it or is different by only a single bit or two. I'd like to shrink each array down as small as possible to reduce my disk IO.

我有大量的整数数组。每个整数都有几千个整数,每个整数通常与之前的整数相同,或者只有一个或两个不同。我想尽量缩小每个阵列,以减少我的磁盘IO。

Zlib shrinks it to about 25% of its original size. That's nice, but I don't think its algorithm is particularly well suited for the problem. Does anyone know a compression library or simple algorithm that might perform better for this type of information?

Zlib将其缩小至原始尺寸的约25%。这很好,但我不认为它的算法特别适合这个问题。有没有人知道压缩库或简单的算法可能会更好地执行此类信息?

Update: zlib after converting it to an array of xor deltas shrinks it to about 20% of the original size.

更新:将zlib转换为xor deltas数组后,将其缩小到原始大小的20%左右。

7 个解决方案

#1

If most of the integers really are the same as the previous, and the inter-symbol difference can usually be expressed as a single bit flip, this sounds like a job for XOR.

如果大多数整数与前一个完全相同,并且符号间的差异通常可以表示为单个位翻转,这听起来像是XOR的工作。

Take an input stream like:

获取输入流,如:

and output:

a bit of pseudo code

一点伪代码

compressed[0] = uncompressed[0]
loop
  compressed[i] = uncompressed[i-1] ^ uncompressed[i]

We've now reduced most of the output to 0, even when a high bit is changed. The RLE compression in any other tool you use will have a field day with this. It'll work even better on 32-bit integers, and it can still encode a radically different integer popping up in the stream. You're saved the bother of dealing with bit-packing yourself, as everything remains an int-sized quantity.

我们现在已将大部分输出减少到0,即使更改了高位也是如此。您使用的任何其他工具中的RLE压缩都会有一个字段日。它在32位整数上工作得更好,它仍然可以编码流中突然出现的完全不同的整数。你节省了处理自己打包的麻烦,因为一切都是一个int大小的数量。

When you want to decompress:

当你想要解压缩时:

uncompressed[0] = compressed[0]
loop
  uncompressed[i] = uncompressed[i-1] ^ compressed[i]

This also has the advantage of being a simple algorithm that is going to run really, really fast, since it is just XOR.

这也是一个简单算法的优点,它将真正,非常快地运行,因为它只是异或。

#2

Have you considered Run-length encoding?

你考虑过游程编码吗?

Or try this: Instead of storing the numbers themselves, you store the differences between the numbers. 1 1 2 2 2 3 5 becomes 1 0 1 0 0 1 2. Now most of the numbers you have to encode are very small. To store a small integer, use an 8-bit integer instead of the 32-bit one you'll encode on most platforms. That's a factor of 4 right there. If you do need to be prepared for bigger gaps than that, designate the high-bit of the 8-bit integer to say "this number requires the next 8 bits as well".

或者尝试这样:您可以存储数字之间的差异,而不是自己存储数字。 1 1 2 2 2 3 5变为1 0 1 0 0 1 2.现在,您必须编码的大多数数字都非常小。要存储一个小整数,请使用8位整数,而不是大多数平台上要编码的32位整数。这就是4的因素。如果你确实需要为更大的间隙做好准备,请指定8位整数的高位来说“这个数字也需要接下来的8位”。

You can combine that with run-length encoding for even better compression ratios, depending on your data.

您可以将其与行程编码相结合,以获得更好的压缩率,具体取决于您的数据。

Neither of these options is particularly hard to implement, and they all run very fast and with very little memory (as opposed to, say, bzip).

这些选项都没有特别难以实现,并且它们都运行得非常快且内存很少(与bzip相反)。

#3

You want to preprocess your data -- reversibly transform it to some form that is better-suited to your back-end data compression method, first. The details will depend on both the back-end compression method, and (more critically) on the properties you expect from the data you're compressing.

您希望预处理数据 - 首先将其可逆地转换为更适合您的后端数据压缩方法的形式。细节将取决于后端压缩方法,以及(更关键的)您希望从您正在压缩的数据中获得的属性。

In your case, zlib is a byte-wise compression method, but your data comes in (32-bit?) integers. You don't need to reimplement zlib yourself, but you do need to read up on how it works, so you can figure out how to present it with easily compressible data, or if it's appropriate for your purposes at all.

在您的情况下,zlib是一种逐字节压缩方法,但您的数据是(32位?)整数。您不需要自己重新实现zlib,但是您需要了解它是如何工作的,这样您就可以弄清楚如何使用易于压缩的数据来呈现它,或者它是否适合您的目的。

Zlib implements a form of Lempel-Ziv coding. JPG and many others use Huffman coding for their backend. Run-length encoding is popular for many ad hoc uses. Etc., etc. ...

Zlib实现了Lempel-Ziv编码的一种形式。 JPG和许多其他人使用霍夫曼编码作为他们的后端。行程编码在许多临时用途中很流行。等等......

#4

Perhaps the answer is to pre-filter the arrays in a way analogous to the Filtering used to create small PNG images. Here are some ideas right off the top of my head. I've not tried these approaches, but if you feel like playing, they could be interesting.

也许答案是以类似于用于创建小PNG图像的过滤的方式预过滤阵列。这是我的头脑中的一些想法。我没有尝试过这些方法,但如果你想玩,他们可能很有趣。

Break your ints up each into 4 bytes, so i₀, i₁, i₂, ..., i_n becomes b_0,0, b_0,1, b_0,2, b_0,3, b_1,0, b_1,1, b_1,2, b_1,3, ..., b_n,0, b_n,1, b_n,2, b_n,3. Then write out all the b_i,0s, followed by the b_i,1s, b_i,2s, and b_i,3s. If most of the time your numbers differ only by a bit or two, you should get nice long runs of repeated bytes, which should compress really nicely using something like Run-length Encoding or zlib. This is my favourite of the methods I present.

将每个中心分成4个字节,因此i0,i1,i2,...,in变为b0,0,b0,1,b0,2,b0,3,b1,0,b1,1,b1,2, b1,3,...,bn,0,bn,1,bn,2,bn,3。然后写出所有bi,0,然后是bi,1s,bi,2s和bi,3s。如果大多数时候你的数字只有一两个差别,你应该得到很好的长时间重复字节,这应该使用像运行长度编码或zlib这样很好地压缩。这是我最喜欢的方法。
If the integers in each array are closely-related to the one before, you could maybe store the original integer, followed by diffs against the previous entry - this should give a smaller set of values to draw from, which typically results in a more compressed form.

如果每个数组中的整数与之前的整数紧密相关,则可以存储原始整数,然后对前一个条目进行差异 - 这应该提供一组较小的值来绘制,这通常会导致压缩更多形成。
If you have various bits differing, you still may have largish differences, but if you're more likely to have large numeric differences that correspond to (usually) one or two bits differing, you may be better off with a scheme where you create ahebyte array - use the first 4 bytes to encode the first integer, and then for each subsequent entry, use 0 or more bytes to indicate which bits should be flipped - storing 0, 1, 2, ..., or 31 in the byte, with a sentinel (say 32) to indicate when you're done. This could result the raw number of bytes needed to represent and integer to something close to 2 on average, which most bytes coming from a limited set (0 - 32). Run that stream through zlib, and maybe you'll be pleasantly surprised.

如果你有各种不同的位,你仍然可能有较大的差异,但如果你更可能有大的数字差异对应(通常)一个或两个位不同,你可能会更好的与你创建ahebyte的方案array - 使用前4个字节对第一个整数进行编码,然后对于每个后续条目,使用0个或更多个字节来指示应该翻转哪些位 - 在字节中存储0,1,2,...或31,用哨兵(比方说32)表示你什么时候完成。这可能导致表示所需的原始字节数和平均值接近2的整数,大多数字节来自有限集(0 - 32)。通过zlib运行该流,也许你会感到惊喜。

#5

Did you try bzip2 for this? http://bzip.org/

你为此尝试过bzip2吗? http://bzip.org/

It's always worked better than zlib for me.

对我来说,它总是比zlib更好用。

#6

Since your concern is to reduce disk IO, you'll want to compress each integer array independently, without making reference to other integer arrays.

由于您关注的是减少磁盘IO,因此您需要独立压缩每个整数数组,而不引用其他整数数组。

A common technique for your scenario is to store the differences, since a small number of differences can be encoded with short codewords. It sounds like you need to come up with your own coding scheme for differences, since they are multi-bit differences, perhaps using an 8 bit byte something like this as a starting point:

您的方案的常用技术是存储差异,因为可以使用短代码字编码少量差异。听起来你需要为差异提出自己的编码方案,因为它们是多位差异,可能使用像这样的8位字节作为起点:

1 bit to indicate that a complete new integer follows, or that this byte encodes a difference from the last integer,

1位表示跟随一个完整的新整数,或者该字节编码与最后一个整数的差值,

1 bit to indicate that there are more bytes following, recording more single bit differences for the same integer.

1位表示后面有更多字节,为同一整数记录更多单个位差。

6 bits to record the bit number to switch from your previous integer.

6位用于记录要从前一个整数切换的位数。

If there are more than 4 bits different, then store the integer.

如果有不同的4位,则存储整数。

This scheme might not be appropriate if you also have a lot of completely different codes, since they'll take 5 bytes each now instead of 4.

如果你也有很多完全不同的代码,这个方案可能不合适,因为它们现在每个需要5个字节而不是4个字节。

#7

"Zlib shrinks it by a factor of about 4x." means that a file of 100K now takes up negative 300K; that's pretty impressive by any definition :-). I assume you mean it shrinks it by 75%, i.e., to 1/4 its original size.

“Zlib将它缩小了大约4倍。”意味着100K的文件现在占用负300K;任何定义都令人印象深刻:-)。我认为你的意思是它缩小了75%,即缩小到原始尺寸的1/4。

One possibility for an optimized compression is as follows (it assumes a 32-bit integer and at most 3 bits changing from element to element).

优化压缩的一种可能性如下(它假定32位整数,并且最多3位从元素到元素变化)。

Output the first integer (32 bits).

输出第一个整数(32位)。

Output the number of bit changes (n=0-3, 2 bits).

输出位数变化(n = 0-3,2位)。

Output n bit specifiers (0-31, 5 bits each).

输出n位说明符(0-31,每位5位)。

Worst case for this compression is 3 bit changes in every integer (2+5+5+5 bits) which will tend towards 17/32 of original size (46.875% compression).

这种压缩的最坏情况是每个整数(2 + 5 + 5 + 5位)中的3位变化,这将倾向于原始大小的17/32(46.875%压缩)。

I say "tends towards" since the first integer is always 32 bits but, for any decent sized array, that first integer would be negligable.

我说“倾向于”,因为第一个整数总是32位,但对于任何体面大小的数组,第一个整数都是可以忽略的。

Best case is a file of identical integers (no bit changes for every integer, just the 2 zero bits) - this will tend towards 2/32 of original size (93.75% compression).

最好的情况是一个相同整数的文件(每个整数没有位变化,只有2个零位) - 这将趋向于原始大小的2/32(93.75%压缩)。

Where you average 2 bits different per consecutive integer (as you say is your common case), you'll get 2+5+5 bits per integer which will tend towards 12/32 or 62.5% compression.

在每个连续整数平均2位不同的地方(正如你所说的那样),你会得到每个整数2 + 5 + 5位,这将趋向于12/32或62.5%压缩。

Your break-even point (if zlib gives 75% compression) is 8 bits per integer which would be

你的收支平衡点(如果zlib给出75%压缩)是每个整数8位

single-bit changes (2+5 = 7 bits) : 80% of the transitions.

单比特变化(2 + 5 = 7比特):80%的转变。

double-bit changes (2+5+5 = 12 bits) : 20% of the transitions.

双位变化(2 + 5 + 5 = 12位):20%的转换。

This means your average would have to be 1.2 bit changes per integer to make this worthwhile.

这意味着你的平均值必须是每个整数1.2位更改才能使这个值得。

One thing I would suggest looking at is 7zip - this has a very liberal licence and you can link it with your code (I think the source is available as well).

我建议看一下7zip - 它有一个非常*的许可证,你可以将它与你的代码链接(我认为源代码也是可用的)。

I notice (for my stuff anyway) it performs much better than WinZip on a Windows platform so it may also outperform zlib.

我注意到(对于我的东西无论如何)它在Windows平台上比WinZip表现得更好,所以它也可能胜过zlib。

#1