什么是允许文件中随机读/写的最佳压缩算法?

时间:2022-12-21 13:28:55

What is the best compression algorithm that allows random reads/writes in a file?

什么是允许文件中随机读/写的最佳压缩算法?

I know that any adaptive compression algorithms would be out of the question.

我知道任何自适应压缩算法都是不可能的。

And I know huffman encoding would be out of the question.

我知道霍夫曼编码是不可能的。

Does anyone have a better compression algorithm that would allow random reads/writes?

有没有人有更好的压缩算法,允许随机读/写?

I think you could use any compression algorithm if you write it in blocks, but ideally I would not like to have to decompress a whole block at a time. But if you have suggestions on an easy way to do this and how to know the block boundaries, please let me know. If this is part of your solution, please also let me know what you do when the data you want to read is across a block boundary?

我认为你可以使用任何压缩算法,如果你用块写它,但理想情况下我不想一次解压缩整个块。但是,如果您有关于简单方法的建议以及如何知道块边界,请告诉我。如果这是您的解决方案的一部分,请告诉我您想要读取的数据是否跨越块边界时您要做什么?

In the context of your answers please assume the file in question is 100GB, and sometimes I'll want to read the first 10 bytes, and sometimes I'll want to read the last 19 bytes, and sometimes I'll want to read 17 bytes in the middle. .

在您的答案的上下文中,请假设有问题的文件是100GB,有时我想读取前10个字节,有时我想读取最后19个字节,有时我想阅读17中间的字节。 。

7 个解决方案

#1


17  

I am stunned at the number of responses that imply that such a thing is impossible.

我对那些暗示这样的事情是不可能的回应感到震惊。

Have these people never heard of "compressed file systems", which have been around since before Microsoft was sued in 1993 by Stac Electronics over compressed file system technology?

这些人从来没有听说过“压缩文件系统”,自从1993年Stac Electronics通过压缩文件系统技术被微软起诉之前就已经存在了吗?

I hear that LZS and LZJB are popular algorithms for people implementing compressed file systems, which necessarily require both random-access reads and random-access writes.

我听说LZS和LZJB是人们实现压缩文件系统的流行算法,它必然需要随机访问读取和随机访问写入。

Perhaps the simplest and best thing to do is to turn on file system compression for that file, and let the OS deal with the details. But if you insist on handling it manually, perhaps you can pick up some tips by reading about NTFS transparent file compression.

也许最简单和最好的事情是打开该文件的文件系统压缩,让操作系统处理细节。但是如果你坚持手动处理它,也许你可以通过阅读NTFS透明文件压缩来获取一些技巧。

Also check out: "*: Compression formats with good support for random access within archives?"

另请查看:“*:压缩格式,可以很好地支持档案中的随机访问?”

#2


4  

The razip format supports random access reads with better performance than gzip/bzip2 which have to be tweaked for this support:

razip格式支持随机访问读取,其性能优于gzip / bzip2,必须针对此支持进行调整:

http://sourceforge.net/projects/razip/

#3


3  

A dictionary-based compression scheme, with each dictionary entry's code being encoded with the same size, will result in being able to begin reading at any multiple of the code size, and writes and updates are easy if the codes make no use of their context/neighbors.

基于字典的压缩方案,每个字典条目的代码使用相同的大小进行编码,将导致能够以代码大小的任意倍数开始读取,并且如果代码不使用其上下文,则写入和更新很容易/邻居。

If the encoding includes a way of distinguishing the start or end of codes then you do not need the codes to be the same length, and you can start reading anywhere in the middle of the file. This technique is more useful if you're reading from an unknown position in a stream.

如果编码包括区分代码开头或结尾的方法,那么您不需要代码长度相同,并且您可以开始在文件中间的任何位置读取。如果您从流中的未知位置读取,此技术更有用。

#4


2  

I think Stephen Denne might be onto something here. Imagine:

我认为Stephen Denne可能会在这里发表意见。想像:

  • zip-like compression of sequences to codes
  • 将序列压缩成代码

  • a dictionary mapping code -> sequence
  • 字典映射代码 - >序列

  • file will be like a filesystem
    • each write generates a new "file" (a sequence of bytes, compressed according to dictionary)
    • 每次写入都会生成一个新的“文件”(一个字节序列,根据字典压缩)

    • "filesystem" keeps track of which "file" belongs to which bytes (start, end)
    • “filesystem”跟踪哪个“文件”属于哪个字节(开始,结束)

    • each "file" is compressed according to dictionary
    • 根据字典压缩每个“文件”

    • reads work filewise, uncompressing and retrieving bytes according to "filesystem"
    • 根据“文件系统”读取文件工作,解压缩和检索字节

    • writes make "files" invalid, new "files" are appended to replace the invalidated ones
    • 写入“文件”无效,附加新的“文件”以替换无效的文件

  • 文件将像文件系统一样每个写生成一个新的“文件”(一个字节序列,根据字典压缩)“filesystem”跟踪哪个“文件”属于哪个字节(开始,结束)每个“文件”被压缩根据字典按文件读取工作,解压缩和检索字节根据“filesystem”写入使“文件”无效,附加新的“文件”以替换无效的文件

  • this system will need:
    • defragmentation mechanism of filesystem
    • 文件系统的碎片整理机制

    • compacting dictionary from time to time (removing unused codes)
    • 不时压缩字典(删除未使用的代码)

  • 这个系统需要:文件系统的碎片整理机制不时压缩字典(删除未使用的代码)

  • done properly, housekeeping could be done when nobody is looking (idle time) or by creating a new file and "switching" eventually
  • 如果做得不错,可以在没人看(空闲时间)或创建新文件并最终“切换”时完成内务管理

One positive effect would be that the dictionary would apply to the whole file. If you can spare the CPU cycles, you could periodically check for sequences overlapping "file" boundaries and then regrouping them.

一个积极的影响是字典将应用于整个文件。如果可以节省CPU周期,则可以定期检查重叠“文件”边界的序列,然后重新组合它们。

This idea is for truly random reads. If you are only ever going to read fixed sized records, some parts of this idea could get easier.

这个想法是真正的随机读取。如果你只是要阅读固定大小的记录,这个想法的某些部分可能会变得更容易。

#5


1  

I don't know of any compression algorithm that allows random reads, never mind random writes. If you need that sort of ability, your best bet would be to compress the file in chunks rather than as a whole.

我不知道任何允许随机读取的压缩算法,更不用说随机写入。如果你需要那种能力,那么最好的办法就是将文件压缩成块而不是整体。

e.g.
We'll look at the read-only case first. Let's say you break up your file into 8K chunks. You compress each chunk and store each compressed chunk sequentially. You will need to record where each compressed chunk is stored and how big it is. Then, say you need to read N bytes starting at offset O. You will need to figure out which chunk it's in (O / 8K), decompress that chunk and grab those bytes. The data you need may span multiple chunks, so you have to deal with that scenario.

例如我们先看一下只读案例。假设您将文件分解为8K块。您压缩每个块并按顺序存储每个压缩块。您需要记录每个压缩块的存储位置和大小。然后,假设您需要从偏移量O开始读取N个字节。您需要确定它所在的块(O / 8K),解压缩该块并获取这些字节。您需要的数据可能跨越多个块,因此您必须处理该场景。

Things get complicated when you want to be able to write to the compressed file. You have to deal with compressed chunks getting bigger and smaller. You may need to add some extra padding to each chunk in case it expands (it's still the same size uncompressed, but different data will compress to different sizes). You may even need to move chunks if the compressed data is too big to fit back in the original space it was given.

当您希望能够写入压缩文件时,事情会变得复杂。你必须处理越来越大的压缩块。您可能需要为每个块添加一些额外的填充,以防它扩展(它仍然是未压缩的相同大小,但不同的数据将压缩到不同的大小)。如果压缩数据太大而无法放回原来的空间,您甚至可能需要移动块。

This is basically how compressed file systems work. You might be better off turning on file system compression for your files and just read/write to them normally.

这基本上是压缩文件系统的工作方式。您可能最好为文件启用文件系统压缩,并且只是正常读/写它们。

#6


1  

Compression is all about removing redundancy from the data. Unfortunately, it's unlikely that the redundancy is going to be distributed with monotonous evenness throughout the file, and that's about the only scenario in which you could expect compression and fine-grained random access.

压缩就是从数据中删除冗余。不幸的是,冗余不可能在整个文件中以单调的均匀度进行分配,这是您可以预期压缩和细粒度随机访问的唯一情况。

However, you could get close to random access by maintaining an external list, built during the compression, which shows the correspondence between chosen points in the uncompressed datastream and their locations in the compressed datastream. You'd obviously have to choose a method where the translation scheme between the source stream and its compressed version does not vary with the location in the stream (i.e. no LZ77 or LZ78; instead you'd probably want to go for Huffman or byte-pair encoding.) Obviously this would incur a lot of overhead, and you'd have to decide on just how you wanted to trade off between the storage space needed for "bookmark points" and the processor time needed to decompress the stream starting at a bookmark point to get the data you're actually looking for on that read.

但是,您可以通过维护在压缩期间构建的外部列表来接近随机访问,该列表显示未压缩数据流中所选点与压缩数据流中位置之间的对应关系。您显然必须选择一种方法,其中源流与其压缩版本之间的转换方案不会随着流中的位置而变化(即没有LZ77或LZ78;相反,您可能想要使用Huffman或byte-对编码。)显然这会产生很多开销,你必须决定你想要如何在“书签点”所需的存储空间和解压缩流所需的处理器时间之间进行权衡。书签点可以获取您在该读数上实际查找的数据。

As for random-access writing... that's all but impossible. As already noted, compression is about removing redundancy from the data. If you try to replace data that could be and was compressed because it was redundant with data that does not have the same redundancy, it's simply not going to fit.

至于随机访问写作......这几乎是不可能的。如前所述,压缩是关于从数据中删除冗余。如果您尝试替换可能被压缩的数据,因为它对于没有相同冗余的数据是冗余的,那么它根本不适合。

However, depending on how much random-access writing you're going to do -- you may be able to simulate it by maintaining a sparse matrix representing all data written to the file after the compression. On all reads, you'd check the matrix to see if you were reading an area that you had written to after the compression. If not, then you'd go to the compressed file for the data.

但是,根据您要进行的随机访问写入次数,您可以通过维护表示压缩后写入文件的所有数据的稀疏矩阵来模拟它。在所有读取中,您将检查矩阵以查看您是否正在读取压缩后写入的区域。如果没有,那么你将转到数据的压缩文件。

#7


-1  

No compression scheme will allow fine-grained random access, for two related reasons:

由于两个相关原因,没有压缩方案将允许细粒度的随机访问:

  • you can't know exactly how far into the compressed file your desired piece of data is, therefore
  • 因此,您无法确切知道压缩文件到达所需数据的距离

  • there is no way to know where a symbol starts (at what bit position for Huffman, worse for arithmetic coding).
  • 没有办法知道符号的起始位置(在霍夫曼的位置,对于算术编码来说更糟)。

I can only suggest treating the file like a broadcast stream and inserting frequent synchronization / position markers, with obvious overhead (the sync marks not only take up space themselves, but complicate the encoding because it has to avoid "accidental" sync marks!). Alternatively, and to avoid seeking being something like a binary search (with the optimization that you can take a better guess where to start than the middle), you could include a "table of contents" at the start or end of the file.

我只能建议将文件视为广播流并插入频繁的同步/位置标记,但显而易见的开销(同步标记不仅会占用空间,而且会使编码复杂化,因为它必须避免“意外”同步标记!)。或者,为了避免寻找像二进制搜索这样的东西(通过优化,你可以更好地猜测从哪里开始比中间),你可以在文件的开头或结尾包含一个“目录”。

As for random-access writing... I can't think of any neat solution :(

至于随机访问写作...我想不出任何简洁的解决方案:(

#1


17  

I am stunned at the number of responses that imply that such a thing is impossible.

我对那些暗示这样的事情是不可能的回应感到震惊。

Have these people never heard of "compressed file systems", which have been around since before Microsoft was sued in 1993 by Stac Electronics over compressed file system technology?

这些人从来没有听说过“压缩文件系统”,自从1993年Stac Electronics通过压缩文件系统技术被微软起诉之前就已经存在了吗?

I hear that LZS and LZJB are popular algorithms for people implementing compressed file systems, which necessarily require both random-access reads and random-access writes.

我听说LZS和LZJB是人们实现压缩文件系统的流行算法,它必然需要随机访问读取和随机访问写入。

Perhaps the simplest and best thing to do is to turn on file system compression for that file, and let the OS deal with the details. But if you insist on handling it manually, perhaps you can pick up some tips by reading about NTFS transparent file compression.

也许最简单和最好的事情是打开该文件的文件系统压缩,让操作系统处理细节。但是如果你坚持手动处理它,也许你可以通过阅读NTFS透明文件压缩来获取一些技巧。

Also check out: "*: Compression formats with good support for random access within archives?"

另请查看:“*:压缩格式,可以很好地支持档案中的随机访问?”

#2


4  

The razip format supports random access reads with better performance than gzip/bzip2 which have to be tweaked for this support:

razip格式支持随机访问读取,其性能优于gzip / bzip2,必须针对此支持进行调整:

http://sourceforge.net/projects/razip/

#3


3  

A dictionary-based compression scheme, with each dictionary entry's code being encoded with the same size, will result in being able to begin reading at any multiple of the code size, and writes and updates are easy if the codes make no use of their context/neighbors.

基于字典的压缩方案,每个字典条目的代码使用相同的大小进行编码,将导致能够以代码大小的任意倍数开始读取,并且如果代码不使用其上下文,则写入和更新很容易/邻居。

If the encoding includes a way of distinguishing the start or end of codes then you do not need the codes to be the same length, and you can start reading anywhere in the middle of the file. This technique is more useful if you're reading from an unknown position in a stream.

如果编码包括区分代码开头或结尾的方法,那么您不需要代码长度相同,并且您可以开始在文件中间的任何位置读取。如果您从流中的未知位置读取,此技术更有用。

#4


2  

I think Stephen Denne might be onto something here. Imagine:

我认为Stephen Denne可能会在这里发表意见。想像:

  • zip-like compression of sequences to codes
  • 将序列压缩成代码

  • a dictionary mapping code -> sequence
  • 字典映射代码 - >序列

  • file will be like a filesystem
    • each write generates a new "file" (a sequence of bytes, compressed according to dictionary)
    • 每次写入都会生成一个新的“文件”(一个字节序列,根据字典压缩)

    • "filesystem" keeps track of which "file" belongs to which bytes (start, end)
    • “filesystem”跟踪哪个“文件”属于哪个字节(开始,结束)

    • each "file" is compressed according to dictionary
    • 根据字典压缩每个“文件”

    • reads work filewise, uncompressing and retrieving bytes according to "filesystem"
    • 根据“文件系统”读取文件工作,解压缩和检索字节

    • writes make "files" invalid, new "files" are appended to replace the invalidated ones
    • 写入“文件”无效,附加新的“文件”以替换无效的文件

  • 文件将像文件系统一样每个写生成一个新的“文件”(一个字节序列,根据字典压缩)“filesystem”跟踪哪个“文件”属于哪个字节(开始,结束)每个“文件”被压缩根据字典按文件读取工作,解压缩和检索字节根据“filesystem”写入使“文件”无效,附加新的“文件”以替换无效的文件

  • this system will need:
    • defragmentation mechanism of filesystem
    • 文件系统的碎片整理机制

    • compacting dictionary from time to time (removing unused codes)
    • 不时压缩字典(删除未使用的代码)

  • 这个系统需要:文件系统的碎片整理机制不时压缩字典(删除未使用的代码)

  • done properly, housekeeping could be done when nobody is looking (idle time) or by creating a new file and "switching" eventually
  • 如果做得不错,可以在没人看(空闲时间)或创建新文件并最终“切换”时完成内务管理

One positive effect would be that the dictionary would apply to the whole file. If you can spare the CPU cycles, you could periodically check for sequences overlapping "file" boundaries and then regrouping them.

一个积极的影响是字典将应用于整个文件。如果可以节省CPU周期,则可以定期检查重叠“文件”边界的序列,然后重新组合它们。

This idea is for truly random reads. If you are only ever going to read fixed sized records, some parts of this idea could get easier.

这个想法是真正的随机读取。如果你只是要阅读固定大小的记录,这个想法的某些部分可能会变得更容易。

#5


1  

I don't know of any compression algorithm that allows random reads, never mind random writes. If you need that sort of ability, your best bet would be to compress the file in chunks rather than as a whole.

我不知道任何允许随机读取的压缩算法,更不用说随机写入。如果你需要那种能力,那么最好的办法就是将文件压缩成块而不是整体。

e.g.
We'll look at the read-only case first. Let's say you break up your file into 8K chunks. You compress each chunk and store each compressed chunk sequentially. You will need to record where each compressed chunk is stored and how big it is. Then, say you need to read N bytes starting at offset O. You will need to figure out which chunk it's in (O / 8K), decompress that chunk and grab those bytes. The data you need may span multiple chunks, so you have to deal with that scenario.

例如我们先看一下只读案例。假设您将文件分解为8K块。您压缩每个块并按顺序存储每个压缩块。您需要记录每个压缩块的存储位置和大小。然后,假设您需要从偏移量O开始读取N个字节。您需要确定它所在的块(O / 8K),解压缩该块并获取这些字节。您需要的数据可能跨越多个块,因此您必须处理该场景。

Things get complicated when you want to be able to write to the compressed file. You have to deal with compressed chunks getting bigger and smaller. You may need to add some extra padding to each chunk in case it expands (it's still the same size uncompressed, but different data will compress to different sizes). You may even need to move chunks if the compressed data is too big to fit back in the original space it was given.

当您希望能够写入压缩文件时,事情会变得复杂。你必须处理越来越大的压缩块。您可能需要为每个块添加一些额外的填充,以防它扩展(它仍然是未压缩的相同大小,但不同的数据将压缩到不同的大小)。如果压缩数据太大而无法放回原来的空间,您甚至可能需要移动块。

This is basically how compressed file systems work. You might be better off turning on file system compression for your files and just read/write to them normally.

这基本上是压缩文件系统的工作方式。您可能最好为文件启用文件系统压缩,并且只是正常读/写它们。

#6


1  

Compression is all about removing redundancy from the data. Unfortunately, it's unlikely that the redundancy is going to be distributed with monotonous evenness throughout the file, and that's about the only scenario in which you could expect compression and fine-grained random access.

压缩就是从数据中删除冗余。不幸的是,冗余不可能在整个文件中以单调的均匀度进行分配,这是您可以预期压缩和细粒度随机访问的唯一情况。

However, you could get close to random access by maintaining an external list, built during the compression, which shows the correspondence between chosen points in the uncompressed datastream and their locations in the compressed datastream. You'd obviously have to choose a method where the translation scheme between the source stream and its compressed version does not vary with the location in the stream (i.e. no LZ77 or LZ78; instead you'd probably want to go for Huffman or byte-pair encoding.) Obviously this would incur a lot of overhead, and you'd have to decide on just how you wanted to trade off between the storage space needed for "bookmark points" and the processor time needed to decompress the stream starting at a bookmark point to get the data you're actually looking for on that read.

但是,您可以通过维护在压缩期间构建的外部列表来接近随机访问,该列表显示未压缩数据流中所选点与压缩数据流中位置之间的对应关系。您显然必须选择一种方法,其中源流与其压缩版本之间的转换方案不会随着流中的位置而变化(即没有LZ77或LZ78;相反,您可能想要使用Huffman或byte-对编码。)显然这会产生很多开销,你必须决定你想要如何在“书签点”所需的存储空间和解压缩流所需的处理器时间之间进行权衡。书签点可以获取您在该读数上实际查找的数据。

As for random-access writing... that's all but impossible. As already noted, compression is about removing redundancy from the data. If you try to replace data that could be and was compressed because it was redundant with data that does not have the same redundancy, it's simply not going to fit.

至于随机访问写作......这几乎是不可能的。如前所述,压缩是关于从数据中删除冗余。如果您尝试替换可能被压缩的数据,因为它对于没有相同冗余的数据是冗余的,那么它根本不适合。

However, depending on how much random-access writing you're going to do -- you may be able to simulate it by maintaining a sparse matrix representing all data written to the file after the compression. On all reads, you'd check the matrix to see if you were reading an area that you had written to after the compression. If not, then you'd go to the compressed file for the data.

但是,根据您要进行的随机访问写入次数,您可以通过维护表示压缩后写入文件的所有数据的稀疏矩阵来模拟它。在所有读取中,您将检查矩阵以查看您是否正在读取压缩后写入的区域。如果没有,那么你将转到数据的压缩文件。

#7


-1  

No compression scheme will allow fine-grained random access, for two related reasons:

由于两个相关原因,没有压缩方案将允许细粒度的随机访问:

  • you can't know exactly how far into the compressed file your desired piece of data is, therefore
  • 因此,您无法确切知道压缩文件到达所需数据的距离

  • there is no way to know where a symbol starts (at what bit position for Huffman, worse for arithmetic coding).
  • 没有办法知道符号的起始位置(在霍夫曼的位置,对于算术编码来说更糟)。

I can only suggest treating the file like a broadcast stream and inserting frequent synchronization / position markers, with obvious overhead (the sync marks not only take up space themselves, but complicate the encoding because it has to avoid "accidental" sync marks!). Alternatively, and to avoid seeking being something like a binary search (with the optimization that you can take a better guess where to start than the middle), you could include a "table of contents" at the start or end of the file.

我只能建议将文件视为广播流并插入频繁的同步/位置标记,但显而易见的开销(同步标记不仅会占用空间,而且会使编码复杂化,因为它必须避免“意外”同步标记!)。或者,为了避免寻找像二进制搜索这样的东西(通过优化,你可以更好地猜测从哪里开始比中间),你可以在文件的开头或结尾包含一个“目录”。

As for random-access writing... I can't think of any neat solution :(

至于随机访问写作...我想不出任何简洁的解决方案:(