This is similar to a previous question, but the answers there don't satisfy my needs and my question is slightly different:
这和之前的问题很相似,但是答案并不能满足我的需求,我的问题略有不同:
I currently use gzip compression for some very large files which contain sorted data. When the files are not compressed, binary search is a handy and efficient way to support seeking to a location in the sorted data.
我目前使用gzip压缩一些非常大的文件,其中包含排序的数据。当文件不被压缩时,二分查找是一种方便而有效的方法来支持在排序数据中查找位置。
But when the files are compressed, things get tricky. I recently found out about zlib's Z_FULL_FLUSH
option, which can be used during compression to insert "sync points" in the compressed output (inflateSync()
can then begin reading from various points in the file). This is OK, though files I already have would have to be recompressed to add this feature (and strangely gzip
doesn't have an option for this, but I'm willing to write my own compression program if I must).
但是当文件被压缩时,事情就变得棘手了。我最近发现了zlib的Z_FULL_FLUSH选项,它可以在压缩过程中使用,在压缩输出中插入“同步点”(inflateSync()可以从文件的各个点开始读取)。这是可以的,虽然我已经需要重新压缩文件来添加这个特性(但奇怪的是gzip没有这个选项,但是如果必须的话,我愿意编写自己的压缩程序)。
It seems from one source that even Z_FULL_FLUSH
is not a perfect solution...not only is it not supported by all gzip archives, but the very idea of detecting sync points in archives may produce false positives (either by coincidence with the magic number for sync points, or due to the fact that Z_SYNC_FLUSH
also produces sync points but they are not usable for random access).
从一个来源看来,即使Z_FULL_FLUSH也不是一个完美的解决方案……它不仅是不支持gzip档案,但检测档案同步点的想法可能产生假阳性(通过巧合的神奇数字同步点,或由于Z_SYNC_FLUSH也生产同步点但不是用于随机访问)。
Is there a better solution? I'd like to avoid having auxiliary files for indexing if possible, and explicit, default support for quasi-random access would be helpful (even if it's large-grained--like being able to start reading at each 10 MB interval). Is there another compression format with better support for random reads than gzip?
有更好的解决方案吗?如果可能的话,我希望避免使用辅助文件来进行索引,而且,对于准随机访问的默认支持将是有用的(即使它是大粒度的——比如能够在每10 MB的时间间隔开始阅读)。是否有另一种压缩格式比gzip更好地支持随机读取?
Edit: As I mentioned, I wish to do binary search in the compressed data. I don't need to seek to a specific (uncompressed) position--only to seek with some coarse granularity within the compressed file. I just want support for something like "Decompress the data starting roughly 50% (25%, 12.5%, etc.) of the way into this compressed file."
编辑:如前所述,我希望在压缩数据中进行二进制搜索。我不需要寻找一个特定的(未压缩的)位置——只是在压缩文件中寻找一些粗粒度。我只是想支持一些东西,比如“将数据分解为大约50%(25%,12.5%,等等)到这个压缩文件中。”
12 个解决方案
#1
17
I don't know of any compressed file format which would support random access to a specific location in the uncompressed data (well, except for multimedia formats), but you can brew your own.
我不知道有什么压缩文件格式可以支持对未压缩数据中的特定位置的随机访问(除了多媒体格式之外),但是您可以自己制作。
For example, bzip2 compressed files are composed of independent compressed blocks of size <1MB uncompressed, which are delimited by sequences of magic bytes, so you could parse the bzip2 file, get the block boundaries and then just uncompress the right block. This would need some indexing to remember where do the blocks start.
例如,bzip2压缩文件是由大小小于1MB的独立压缩块组成的,它是由魔法字节的序列分隔的,因此您可以解析bzip2文件,获取块边界,然后将右块解压。这需要一些索引来记住块从哪里开始。
Still, I think the best solution would be to split your file into chunks of your choice, and then compressing it with some archiver, like zip or rar, which support random access to individual files in the archive.
不过,我认为最好的解决方案是将文件分割成大块的选择,然后用一些归档文件压缩它,比如zip或rar,它支持对归档文件中的单个文件的随机访问。
#2
30
Take a look at dictzip. It is compatible with gzip and allows coarse random access.
看一看。它与gzip兼容并允许粗随机访问。
An excerpt from its man page:
摘自其手册页:
dictzip compresses files using the gzip(1) algorithm (LZ77) in a manner which is completely compatible with the gzip file format. An extension to the gzip file format (Extra Field, described in 2.3.1.1 of RFC 1952) allows extra data to be stored in the header of a compressed file. Programs like gzip and zcat will ignore this extra data. However, [dictzcat --start] will make use of this data to perform pseudo-random access on the file.
使用gzip(1)算法(LZ77)压缩文件的方式与gzip文件格式完全兼容。对gzip文件格式的扩展(RFC 1952的2.3.1.1中描述的额外字段)允许将额外的数据存储在压缩文件的头文件中。像gzip和zcat这样的程序将忽略这些额外的数据。但是,[dictzcat—start]将利用这些数据在文件上执行伪随机访问。
I have the package dictzip in Ubuntu. Or its source code is in a dictd-*.tar.gz. Its license is GPL. You are free to study it.
我有Ubuntu的软件包。或者它的源代码在一个dictd-*.tar.gz中。它的许可证GPL。你可以*地研究它。
Update:
I improved dictzip to have no file size limit. My implementation is under MIT license.
我改进了dictzip,没有文件大小限制。我的实现是在MIT的许可下。
#3
7
Solutions exist for providing random access to gzip and bzip2 archives:
提供随机访问gzip和bzip2档案的解决方案:
- gzip zran.c from the ghostscript source code
- gzip zran。c来自ghostscript源代码。
- bzip2 seek-bzip by James Taylor
- bzip2 seek-bzip by James Taylor。
(I'm looking for something for 7zip)
(我在找7zip的东西)
#4
7
The .xz file format (which uses LZMA compression) seems to support this:
xz文件格式(使用LZMA压缩)似乎支持以下内容:
Random-access reading: The data can be split into independently compressed blocks. Every .xz file contains an index of the blocks, which makes limited random-access reading possible when the block size is small enough.
随机访问读取:数据可以被分割成独立的压缩块。每个.xz文件都包含一个块的索引,当块大小足够小的时候,这使得有限的随机访问读取成为可能。
This should be sufficient for your purpose. A drawback is that the API of liblzma (for interacting with these containers) does not seem that well-documented, so it may take some effort figuring out how to randomly access blocks.
这对你的目的应该足够了。缺点是liblzma的API(用于与这些容器交互)似乎并没有很好的文档化,因此可能需要花费一些精力来弄清楚如何随机访问块。
#5
4
bgzip
can compress files in a gzip
variant which is indexable (and can be decompressed by gzip
). This is used in some bioinformatics applications, together with the tabix
indexer.
bgzip可以压缩gzip格式的文件,它是可索引的(可以通过gzip解压)。这在一些生物信息学应用程序中使用,与tabix索引器一起使用。
See explanations here: http://blastedbio.blogspot.fr/2011/11/bgzf-blocked-bigger-better-gzip.html, and here: http://www.htslib.org/doc/tabix.html.
请参见这里的解释:http://blastedbio.blogspot.fr/2011/11/bgzf-blocked-bigger- gzip.html,这里是:http://www.htslib.org/doc/tabix.html。
I don't know to what extent it is adaptable to other applications.
我不知道它在多大程度上适用于其他应用。
#6
3
I'm not sure if this would be practical in your exact situation, but couldn't you just gzip each large file into smaller files, say 10 MB each? You would end up with a bunch of files: file0.gz, file1.gz, file2.gz, etc. Based on a given offset within the original large, you could search in the file named "file" + (offset / 10485760) + ".gz"
. The offset within the uncompressed archive would be offset % 10485760
.
我不确定这在你的具体情况下是否可行,但是你不能把每个大文件压缩成更小的文件,比如10mb吗?您将得到一堆文件:file0。广州,file1。广州,file2。基于一个给定的偏移量,你可以在文件中搜索“文件”+(偏移/ 10485760)+“。gz”。未压缩归档文件中的偏移量将被抵消% 10485760。
#7
3
Because lossless compression works better on some areas than others, if you store compressed data into blocks of convenient length BLOCKSIZE, even though each block has exactly the same number of compressed bytes, some compressed blocks will expand to a much longer piece of plaintext than others.
因为无损压缩在某些方面比其他区域更有效,如果将压缩数据存储到方便长度的块中,即使每个块的压缩字节数完全相同,一些压缩块将扩展成比其他块更长的文本。
You might look at "Compression: A Key for Next-Generation Text Retrieval Systems" by Nivio Ziviani, Edleno Silva de Moura, Gonzalo Navarro, and Ricardo Baeza-Yates in Computer magazine November 2000 http://doi.ieeecomputersociety.org/10.1109/2.881693
你可以看看Nivio Ziviani、Edleno Silva de Moura、Gonzalo Navarro和Ricardo Baeza-Yates在电脑杂志上的“压缩:下一代文本检索系统的关键”http://doi.ieeecomputersociety.org/10.1109/2.881693。
Their decompressor takes 1, 2, or 3 whole bytes of compressed data and decompresses (using a vocabulary list) into a whole word. One can directly search the compressed text for words or phrases, which turns out to be even faster than searching uncompressed text.
它们的解压器以1、2或3个字节的压缩数据和解压(使用词汇表)作为一个完整的单词。人们可以直接搜索文字或短语的压缩文本,这比搜索未压缩的文本要快得多。
Their decompressor lets you point to any word in the text with a normal (byte) pointer and start decompressing immediately from that point.
它们的解压器让您指向文本中的任何一个单词,并从该点立即开始解压缩。
You can give every word a unique 2 byte code, since you probably have less than 65,000 unique words in your text. (There are almost 13,000 unique words in the KJV Bible). Even if there are more than 65,000 words, it's pretty simple to assign the first 256 two-byte code "words" to all possible bytes, so you can spell out words that aren't in the lexicon of the 65,000 or so "most frequent words and phrases". (The compression gained by packing frequent words and phrases into two bytes is usually worth the "expansion" of occasionally spelling out a word using two bytes per letter). There are a variety of ways to pick a lexicon of "frequent words and phrases" that will give adequate compression. For example, you could tweak a LZW compressor to dump "phrases" it uses more than once to a lexicon file, one line per phrase, and run it over all your data. Or you could arbitrarily chop up your uncompressed data into 5 byte phrases in a lexicon file, one line per phrase. Or you could chop up your uncompressed data into actual English words, and put each word -- including the space at the beginning of the word -- into the lexicon file. Then use "sort --unique" to eliminate duplicate words in that lexicon file. (Is picking the perfect "optimum" lexicon wordlist still considered NP-hard?)
你可以给每个单词一个唯一的2字节码,因为你可能在你的文本中有少于65000个唯一的单词。(KJV圣经中有将近13000个单词)。即使有超过65,000个单词,将第一个256个双字节码“单词”分配给所有可能的字节也是相当简单的,这样你就能拼写出不在65,000左右的词汇中“最常见的单词和短语”的单词。(通过将频繁的单词和短语打包成两个字节所获得的压缩通常值得“扩展”,偶尔用两个字节来拼写一个单词)。有各种各样的方法来选择“频繁的单词和短语”的词汇,这将给予足够的压缩。例如,您可以调整LZW压缩器来转储“短语”,它不止一次地使用一个lexicon文件,每一个短语一行,并在所有数据上运行它。或者,您可以任意地将未压缩的数据分割成5个字节的词汇表,每个短语一行。或者,你可以把未压缩的数据分解成实际的英语单词,并把每个单词——包括单词开头的空格——放入词典文件中。然后使用“sort—unique”来消除该词典文件中的重复单词。(选择完美的“最佳”词汇表仍被认为是np困难吗?)
Store the lexicon at the beginning of your huge compressed file, pad it out to some convenient BLOCKSIZE, and then store the compressed text -- a series of two byte "words" -- from there to the end of the file. Presumably the searcher will read this lexicon once and keep it in some quick-to-decode format in RAM during decompression, to speed up decompressing "two byte code" to "variable-length phrase". My first draft would start with a simple one line per phrase list, but you might later switch to storing the lexicon in a more compressed form using some sort of incremental coding or zlib.
将lexicon存储在巨大压缩文件的开头,然后将其填充到一些方便的BLOCKSIZE中,然后将压缩文本(由两个字节的“单词”)存储到文件的末尾。可能搜索者将会阅读这一词汇,并将它保存在RAM中的快速解码格式中,以加速将“两个字节码”分解为“可变长度的短语”。我的初稿将以每个短语列表的简单一行开始,但稍后您可能会使用某种增量编码或zlib来将lexicon存储在更压缩的表单中。
You can pick any random even byte offset into the compressed text, and start decompressing from there. I don't think it's possible to make a finer-grained random access compressed file format.
您可以在压缩文本中选择任意一个甚至是字节的偏移量,然后从那里开始解压缩。我不认为有可能做出更细粒度的随机访问压缩文件格式。
#8
1
Two possible solutions:
两个可能的解决方案:
-
Let the OS deal with compression, create and mount a compressed file system (SquashFS, clicfs, cloop, cramfs, e2compr or whatever) containing all your text files and don't do anything about compression in your application program.
让操作系统处理压缩、创建和挂载一个压缩文件系统(SquashFS、clicfs、cloop、cramfs、e2compr或其他),包含所有文本文件,在应用程序中不做任何压缩工作。
-
Use clicfs directly on each text file (one clicfs per text file) instead of compressing a filesystem image. Think of "mkclicfs mytextfile mycompressedfile" being "gzip <mytextfile >mycompressedfile" and "clicfs mycompressedfile directory" as a way of getting random access to the data via the file "directory/mytextfile".
在每个文本文件上直接使用clicfs(每个文本文件一个clicfs),而不是压缩文件系统映像。请考虑“mkclicfs mytextfile mycompressedfile”作为“gzip
mycompressedfile”和“clicfs mycompressedfile目录”,作为一种通过文件“目录/mytextfile”获得随机访问数据的方式。
#9
1
I dont know if its been mentioned yet, but the Kiwix project had done great work in this regard. Through their program Kiwix, they offer random access to ZIM file archives. Good compression, too. The project originated when there was a demand for offline copies of the Wikipedia (which has reached above 100 GB in uncompressed form, with all media included). They have successfully taken a 25 GB file (a single-file embodiment of the wikipedia without most of the media) and compressed it to a measly 8 GB zim file archive. And through the Kiwix program, you can call up any page of the Wikipedia, with all associated data, faster than you can surfing the net.
我不知道它是否被提及,但是Kiwix项目在这方面做了很多工作。通过他们的程序Kiwix,他们提供了对ZIM文件档案的随机访问。好压缩。该项目源于对*(Wikipedia)的脱机拷贝的需求(在未压缩的表单中,它已经达到了100gb,所有媒体都包含在内)。他们已经成功地使用了25gb的文件(没有大多数媒体的*的单一文件实施例),并将其压缩到一个微不足道的8 GB的zim文件归档文件中。通过Kiwix的程序,你可以在*的任何一页上,使用所有相关的数据,比你在网上冲浪的速度更快。
Even though Kiwix program is a technology based around the wikipedia database structure, it proves that you can have excellent compression ratios and random access simultaneously.
尽管Kiwix程序是一种基于wikipedia数据库结构的技术,但它证明了您可以同时拥有出色的压缩比和随机访问。
#10
1
This is a very old question but it looks like zindex could provide a good solution (although I don't have much experience with it)
这是一个很老的问题,但是看起来zindex可以提供一个很好的解决方案(尽管我没有太多的经验)
#11
0
razip supports random access with better performance than gzip/bzip2 which have to be tweaked for this support - reducing compression at the expense of "ok" random access:
razip支持随机访问,其性能优于gzip/bzip2,该方法必须对这种支持进行调整,以牺牲“ok”随机访问的代价来减少压缩:
http://sourceforge.net/projects/razip/
http://sourceforge.net/projects/razip/
#12
0
I am the author of an open-source tool for compressing a particular type of biological data. This tool, called starch
, splits the data by chromosome and uses those divisions as indices for fast access to compressed data units within the larger archive.
我是一个开源工具的作者,用来压缩特定类型的生物数据。这个工具叫做“淀粉”,它通过染色体来分割数据,并将这些分类作为索引,以便快速访问较大的归档文件中的压缩数据单元。
Per-chromosome data are transformed to remove redundancy in genomic coordinates, and the transformed data are compressed with either bzip2
or gzip
algorithms. The offsets, metadata and compressed genomic data are concatenated into one file.
在基因组坐标中,每条染色体的数据被转换为删除冗余,而转换后的数据则被压缩为bzip2或gzip算法。偏移量、元数据和压缩的基因组数据被连接到一个文件中。
Source code is available from our GitHub site. We have compiled it under Linux and Mac OS X.
源代码可以从GitHub站点获得。我们已经在Linux和Mac OS X下编译了它。
For your case, you could store (10 MB, or whatever) offsets in a header to a custom archive format. You parse the header, retrieve the offsets, and incrementally fseek
through the file by current_offset_sum
+ header_size
.
对于您的情况,您可以将(10 MB,或其他)偏移量存储到自定义归档格式的头文件中。您可以通过current_offset_sum + header_size来解析标题、检索偏移量和增量fseek。
#1
17
I don't know of any compressed file format which would support random access to a specific location in the uncompressed data (well, except for multimedia formats), but you can brew your own.
我不知道有什么压缩文件格式可以支持对未压缩数据中的特定位置的随机访问(除了多媒体格式之外),但是您可以自己制作。
For example, bzip2 compressed files are composed of independent compressed blocks of size <1MB uncompressed, which are delimited by sequences of magic bytes, so you could parse the bzip2 file, get the block boundaries and then just uncompress the right block. This would need some indexing to remember where do the blocks start.
例如,bzip2压缩文件是由大小小于1MB的独立压缩块组成的,它是由魔法字节的序列分隔的,因此您可以解析bzip2文件,获取块边界,然后将右块解压。这需要一些索引来记住块从哪里开始。
Still, I think the best solution would be to split your file into chunks of your choice, and then compressing it with some archiver, like zip or rar, which support random access to individual files in the archive.
不过,我认为最好的解决方案是将文件分割成大块的选择,然后用一些归档文件压缩它,比如zip或rar,它支持对归档文件中的单个文件的随机访问。
#2
30
Take a look at dictzip. It is compatible with gzip and allows coarse random access.
看一看。它与gzip兼容并允许粗随机访问。
An excerpt from its man page:
摘自其手册页:
dictzip compresses files using the gzip(1) algorithm (LZ77) in a manner which is completely compatible with the gzip file format. An extension to the gzip file format (Extra Field, described in 2.3.1.1 of RFC 1952) allows extra data to be stored in the header of a compressed file. Programs like gzip and zcat will ignore this extra data. However, [dictzcat --start] will make use of this data to perform pseudo-random access on the file.
使用gzip(1)算法(LZ77)压缩文件的方式与gzip文件格式完全兼容。对gzip文件格式的扩展(RFC 1952的2.3.1.1中描述的额外字段)允许将额外的数据存储在压缩文件的头文件中。像gzip和zcat这样的程序将忽略这些额外的数据。但是,[dictzcat—start]将利用这些数据在文件上执行伪随机访问。
I have the package dictzip in Ubuntu. Or its source code is in a dictd-*.tar.gz. Its license is GPL. You are free to study it.
我有Ubuntu的软件包。或者它的源代码在一个dictd-*.tar.gz中。它的许可证GPL。你可以*地研究它。
Update:
I improved dictzip to have no file size limit. My implementation is under MIT license.
我改进了dictzip,没有文件大小限制。我的实现是在MIT的许可下。
#3
7
Solutions exist for providing random access to gzip and bzip2 archives:
提供随机访问gzip和bzip2档案的解决方案:
- gzip zran.c from the ghostscript source code
- gzip zran。c来自ghostscript源代码。
- bzip2 seek-bzip by James Taylor
- bzip2 seek-bzip by James Taylor。
(I'm looking for something for 7zip)
(我在找7zip的东西)
#4
7
The .xz file format (which uses LZMA compression) seems to support this:
xz文件格式(使用LZMA压缩)似乎支持以下内容:
Random-access reading: The data can be split into independently compressed blocks. Every .xz file contains an index of the blocks, which makes limited random-access reading possible when the block size is small enough.
随机访问读取:数据可以被分割成独立的压缩块。每个.xz文件都包含一个块的索引,当块大小足够小的时候,这使得有限的随机访问读取成为可能。
This should be sufficient for your purpose. A drawback is that the API of liblzma (for interacting with these containers) does not seem that well-documented, so it may take some effort figuring out how to randomly access blocks.
这对你的目的应该足够了。缺点是liblzma的API(用于与这些容器交互)似乎并没有很好的文档化,因此可能需要花费一些精力来弄清楚如何随机访问块。
#5
4
bgzip
can compress files in a gzip
variant which is indexable (and can be decompressed by gzip
). This is used in some bioinformatics applications, together with the tabix
indexer.
bgzip可以压缩gzip格式的文件,它是可索引的(可以通过gzip解压)。这在一些生物信息学应用程序中使用,与tabix索引器一起使用。
See explanations here: http://blastedbio.blogspot.fr/2011/11/bgzf-blocked-bigger-better-gzip.html, and here: http://www.htslib.org/doc/tabix.html.
请参见这里的解释:http://blastedbio.blogspot.fr/2011/11/bgzf-blocked-bigger- gzip.html,这里是:http://www.htslib.org/doc/tabix.html。
I don't know to what extent it is adaptable to other applications.
我不知道它在多大程度上适用于其他应用。
#6
3
I'm not sure if this would be practical in your exact situation, but couldn't you just gzip each large file into smaller files, say 10 MB each? You would end up with a bunch of files: file0.gz, file1.gz, file2.gz, etc. Based on a given offset within the original large, you could search in the file named "file" + (offset / 10485760) + ".gz"
. The offset within the uncompressed archive would be offset % 10485760
.
我不确定这在你的具体情况下是否可行,但是你不能把每个大文件压缩成更小的文件,比如10mb吗?您将得到一堆文件:file0。广州,file1。广州,file2。基于一个给定的偏移量,你可以在文件中搜索“文件”+(偏移/ 10485760)+“。gz”。未压缩归档文件中的偏移量将被抵消% 10485760。
#7
3
Because lossless compression works better on some areas than others, if you store compressed data into blocks of convenient length BLOCKSIZE, even though each block has exactly the same number of compressed bytes, some compressed blocks will expand to a much longer piece of plaintext than others.
因为无损压缩在某些方面比其他区域更有效,如果将压缩数据存储到方便长度的块中,即使每个块的压缩字节数完全相同,一些压缩块将扩展成比其他块更长的文本。
You might look at "Compression: A Key for Next-Generation Text Retrieval Systems" by Nivio Ziviani, Edleno Silva de Moura, Gonzalo Navarro, and Ricardo Baeza-Yates in Computer magazine November 2000 http://doi.ieeecomputersociety.org/10.1109/2.881693
你可以看看Nivio Ziviani、Edleno Silva de Moura、Gonzalo Navarro和Ricardo Baeza-Yates在电脑杂志上的“压缩:下一代文本检索系统的关键”http://doi.ieeecomputersociety.org/10.1109/2.881693。
Their decompressor takes 1, 2, or 3 whole bytes of compressed data and decompresses (using a vocabulary list) into a whole word. One can directly search the compressed text for words or phrases, which turns out to be even faster than searching uncompressed text.
它们的解压器以1、2或3个字节的压缩数据和解压(使用词汇表)作为一个完整的单词。人们可以直接搜索文字或短语的压缩文本,这比搜索未压缩的文本要快得多。
Their decompressor lets you point to any word in the text with a normal (byte) pointer and start decompressing immediately from that point.
它们的解压器让您指向文本中的任何一个单词,并从该点立即开始解压缩。
You can give every word a unique 2 byte code, since you probably have less than 65,000 unique words in your text. (There are almost 13,000 unique words in the KJV Bible). Even if there are more than 65,000 words, it's pretty simple to assign the first 256 two-byte code "words" to all possible bytes, so you can spell out words that aren't in the lexicon of the 65,000 or so "most frequent words and phrases". (The compression gained by packing frequent words and phrases into two bytes is usually worth the "expansion" of occasionally spelling out a word using two bytes per letter). There are a variety of ways to pick a lexicon of "frequent words and phrases" that will give adequate compression. For example, you could tweak a LZW compressor to dump "phrases" it uses more than once to a lexicon file, one line per phrase, and run it over all your data. Or you could arbitrarily chop up your uncompressed data into 5 byte phrases in a lexicon file, one line per phrase. Or you could chop up your uncompressed data into actual English words, and put each word -- including the space at the beginning of the word -- into the lexicon file. Then use "sort --unique" to eliminate duplicate words in that lexicon file. (Is picking the perfect "optimum" lexicon wordlist still considered NP-hard?)
你可以给每个单词一个唯一的2字节码,因为你可能在你的文本中有少于65000个唯一的单词。(KJV圣经中有将近13000个单词)。即使有超过65,000个单词,将第一个256个双字节码“单词”分配给所有可能的字节也是相当简单的,这样你就能拼写出不在65,000左右的词汇中“最常见的单词和短语”的单词。(通过将频繁的单词和短语打包成两个字节所获得的压缩通常值得“扩展”,偶尔用两个字节来拼写一个单词)。有各种各样的方法来选择“频繁的单词和短语”的词汇,这将给予足够的压缩。例如,您可以调整LZW压缩器来转储“短语”,它不止一次地使用一个lexicon文件,每一个短语一行,并在所有数据上运行它。或者,您可以任意地将未压缩的数据分割成5个字节的词汇表,每个短语一行。或者,你可以把未压缩的数据分解成实际的英语单词,并把每个单词——包括单词开头的空格——放入词典文件中。然后使用“sort—unique”来消除该词典文件中的重复单词。(选择完美的“最佳”词汇表仍被认为是np困难吗?)
Store the lexicon at the beginning of your huge compressed file, pad it out to some convenient BLOCKSIZE, and then store the compressed text -- a series of two byte "words" -- from there to the end of the file. Presumably the searcher will read this lexicon once and keep it in some quick-to-decode format in RAM during decompression, to speed up decompressing "two byte code" to "variable-length phrase". My first draft would start with a simple one line per phrase list, but you might later switch to storing the lexicon in a more compressed form using some sort of incremental coding or zlib.
将lexicon存储在巨大压缩文件的开头,然后将其填充到一些方便的BLOCKSIZE中,然后将压缩文本(由两个字节的“单词”)存储到文件的末尾。可能搜索者将会阅读这一词汇,并将它保存在RAM中的快速解码格式中,以加速将“两个字节码”分解为“可变长度的短语”。我的初稿将以每个短语列表的简单一行开始,但稍后您可能会使用某种增量编码或zlib来将lexicon存储在更压缩的表单中。
You can pick any random even byte offset into the compressed text, and start decompressing from there. I don't think it's possible to make a finer-grained random access compressed file format.
您可以在压缩文本中选择任意一个甚至是字节的偏移量,然后从那里开始解压缩。我不认为有可能做出更细粒度的随机访问压缩文件格式。
#8
1
Two possible solutions:
两个可能的解决方案:
-
Let the OS deal with compression, create and mount a compressed file system (SquashFS, clicfs, cloop, cramfs, e2compr or whatever) containing all your text files and don't do anything about compression in your application program.
让操作系统处理压缩、创建和挂载一个压缩文件系统(SquashFS、clicfs、cloop、cramfs、e2compr或其他),包含所有文本文件,在应用程序中不做任何压缩工作。
-
Use clicfs directly on each text file (one clicfs per text file) instead of compressing a filesystem image. Think of "mkclicfs mytextfile mycompressedfile" being "gzip <mytextfile >mycompressedfile" and "clicfs mycompressedfile directory" as a way of getting random access to the data via the file "directory/mytextfile".
在每个文本文件上直接使用clicfs(每个文本文件一个clicfs),而不是压缩文件系统映像。请考虑“mkclicfs mytextfile mycompressedfile”作为“gzip
mycompressedfile”和“clicfs mycompressedfile目录”,作为一种通过文件“目录/mytextfile”获得随机访问数据的方式。
#9
1
I dont know if its been mentioned yet, but the Kiwix project had done great work in this regard. Through their program Kiwix, they offer random access to ZIM file archives. Good compression, too. The project originated when there was a demand for offline copies of the Wikipedia (which has reached above 100 GB in uncompressed form, with all media included). They have successfully taken a 25 GB file (a single-file embodiment of the wikipedia without most of the media) and compressed it to a measly 8 GB zim file archive. And through the Kiwix program, you can call up any page of the Wikipedia, with all associated data, faster than you can surfing the net.
我不知道它是否被提及,但是Kiwix项目在这方面做了很多工作。通过他们的程序Kiwix,他们提供了对ZIM文件档案的随机访问。好压缩。该项目源于对*(Wikipedia)的脱机拷贝的需求(在未压缩的表单中,它已经达到了100gb,所有媒体都包含在内)。他们已经成功地使用了25gb的文件(没有大多数媒体的*的单一文件实施例),并将其压缩到一个微不足道的8 GB的zim文件归档文件中。通过Kiwix的程序,你可以在*的任何一页上,使用所有相关的数据,比你在网上冲浪的速度更快。
Even though Kiwix program is a technology based around the wikipedia database structure, it proves that you can have excellent compression ratios and random access simultaneously.
尽管Kiwix程序是一种基于wikipedia数据库结构的技术,但它证明了您可以同时拥有出色的压缩比和随机访问。
#10
1
This is a very old question but it looks like zindex could provide a good solution (although I don't have much experience with it)
这是一个很老的问题,但是看起来zindex可以提供一个很好的解决方案(尽管我没有太多的经验)
#11
0
razip supports random access with better performance than gzip/bzip2 which have to be tweaked for this support - reducing compression at the expense of "ok" random access:
razip支持随机访问,其性能优于gzip/bzip2,该方法必须对这种支持进行调整,以牺牲“ok”随机访问的代价来减少压缩:
http://sourceforge.net/projects/razip/
http://sourceforge.net/projects/razip/
#12
0
I am the author of an open-source tool for compressing a particular type of biological data. This tool, called starch
, splits the data by chromosome and uses those divisions as indices for fast access to compressed data units within the larger archive.
我是一个开源工具的作者,用来压缩特定类型的生物数据。这个工具叫做“淀粉”,它通过染色体来分割数据,并将这些分类作为索引,以便快速访问较大的归档文件中的压缩数据单元。
Per-chromosome data are transformed to remove redundancy in genomic coordinates, and the transformed data are compressed with either bzip2
or gzip
algorithms. The offsets, metadata and compressed genomic data are concatenated into one file.
在基因组坐标中,每条染色体的数据被转换为删除冗余,而转换后的数据则被压缩为bzip2或gzip算法。偏移量、元数据和压缩的基因组数据被连接到一个文件中。
Source code is available from our GitHub site. We have compiled it under Linux and Mac OS X.
源代码可以从GitHub站点获得。我们已经在Linux和Mac OS X下编译了它。
For your case, you could store (10 MB, or whatever) offsets in a header to a custom archive format. You parse the header, retrieve the offsets, and incrementally fseek
through the file by current_offset_sum
+ header_size
.
对于您的情况,您可以将(10 MB,或其他)偏移量存储到自定义归档格式的头文件中。您可以通过current_offset_sum + header_size来解析标题、检索偏移量和增量fseek。