压缩java中的整数数组

I have some extremely large array of integers which i would like to compress.
However the way to do it in java is to use something like this -

我有一些非常大的整数数组,我想压缩。然而,在java中这样做的方法是使用这样的东西 -

int[] myIntArray;
ByteArrayOutputStream byteArrayOutputStream = new ByteArrayOutputStream(1024);
ObjectOutputStream objectOutputStream = new ObjectOutputStream(new DeflaterOutputStream(byteArrayOutputStream));
objectOutputStream.writeObject(myIntArray);

Note that the int array first needs to be converted to bytes by java. Now I know that is fast but it still needs to create a whole new byte array and scan through the entire original int array converting it to bytes and copying the value to the new byte array.

请注意,首先需要通过java将int数组转换为字节。现在我知道这很快但它仍然需要创建一个全新的字节数组并扫描整个原始int数组,将其转换为字节并将值复制到新的字节数组。

Is there any way to skip the byte conversion and make it compress the integers right away?

有没有办法跳过字节转换并使其立即压缩整数?

6 个解决方案

#1

Skip the ObjectOutputStream and just store the ints directly as four bytes each. DataOutputStream.writeInt for instance is an easy way to do it.

跳过ObjectOutputStream并直接将int存储为四个字节。例如,DataOutputStream.writeInt是一种简单的方法。

#2

Hmm. A general-purpose compression algorithm won't necessarily do a good job compressing an array of binary values, unless there's a lot of redundancy. You might do better to develop something of your own, based on what you know about the data.

嗯。除非存在大量冗余,否则通用压缩算法不一定能很好地压缩二进制值数组。根据您对数据的了解,您可能会更好地开发自己的东西。

What is it that you're actually trying to compress?

你实际上试图压缩的是什么?

#3

You could use the representation used by Protocol Buffers. Each integer is represented by 1-5 bytes, depending on its magnitude.

您可以使用Protocol Buffers使用的表示形式。每个整数由1-5个字节表示,具体取决于其大小。

Additionally, the new "packed" representation means you get basically a bit of "header" to say how big it is (and which field it's in) and then just the data. That's probably what ObjectOutputStream does as well, but it's a recent innovation in PB :)

此外,新的“打包”表示意味着你基本上得到一个“标题”来说明它有多大(以及它在哪个字段)然后只是数据。这可能是ObjectOutputStream的作用,但它是PB最近的一项创新:)

Note that this will compress based on magnitude, not based on how often the integer has seen. That will dramatically affect whether it's useful for you or not.

请注意,这将根据幅度进行压缩,而不是基于整数的频率。这将极大地影响它是否对你有用。

#4

A byte array is not going to save you much memory unless you make it a byte array holding unsigned ints, which is very dangerous in Java. It will replace memory overhead with extra processing time for the step checking of the code. This may be aright for data storage, but there already is data storage solution out there.
Unless you are doing this for serialization purposes, I think that you are wasting your time.

一个字节数组不会为你节省太多内存,除非你把它作为一个包含无符号整数的字节数组,这在Java中是非常危险的。它将用额外的处理时间替换内存开销,以便对代码进行步骤检查。这可能适合数据存储,但已有数据存储解决方案。除非你为了序列化目的这样做,否则我认为你在浪费你的时间。

#5

In your example, you are writing the compressed stream to the ByteArrayOutputStream. Your compressed array needs to exist somewhere, and if the destination is memory, then ByteArrayOutputStream is your likely choice. You could also write the stream to a socket or file. In that case, you wouldn't duplicate the stream in memory. If your array is 800MB and your running in a 1GB, you could easily write the array to a compressed file with the example you included. The change would be replacing the ByteArrayOutputStream with a file stream.

在您的示例中,您将压缩流写入ByteArrayOutputStream。您的压缩数组需要存在于某处,如果目标是内存,则可能选择ByteArrayOutputStream。您还可以将流写入套接字或文件。在这种情况下,您不会在内存中复制流。如果您的阵列是800MB并且运行速度为1GB,则可以使用您包含的示例轻松地将阵列写入压缩文件。更改将使用文件流替换ByteArrayOutputStream。

The ObjectOutputStream format is actually fairly efficient. It will not duplicate your array in memory, and has special code for efficiently writing arrays.

ObjectOutputStream格式实际上非常有效。它不会在内存中复制您的数组,并且具有有效编写数组的特殊代码。

Are wanting to work with the compressed array in memory? Would you data lend itself well to a sparse array? Sparse array's are good when you have large gaps in your data.

想要在内存中使用压缩数组吗?你的数据是否适合稀疏阵列?当数据中存在较大间隙时,稀疏数组很好。

#6

If the array of ints is guaranteed to have no duplicates, you can use a java.util.BitSet, instead.

如果保证int的数组没有重复项,则可以使用java.util.BitSet。

As its base implementation is an array of bits, with each bit indicating if a certain integer is present or not in the BitSet, its memory usage is quite low, therefore needing less space to be serialized.

由于其基本实现是一个位数组,每个位指示BitSet中是否存在某个整数,因此其内存使用率非常低,因此需要较少的空间来进行序列化。

#1