I want to emulate the functionality of gzcat | tail -n.
我想模仿gzcat的功能尾巴-n。
This would be helpful for times when there are huge files (of a few GB's or so). Can I tail the last few lines of such a file w/o reading it from the beginning? I doubt that this won't be possible since I'd guess for gzip, the encoding would depend on all the previous text.
这对于有大量文件(几GB左右)的时候会有所帮助。我可以从头开始读取这样一个文件的最后几行吗?我怀疑这是不可能的,因为我猜gzip,编码将取决于所有以前的文本。
But still I'd like to hear if anyone has tried doing something similar - maybe investigating over a compression algorithm that could provide such a feature.
但是我还是想听听是否有人尝试过类似的事情 - 也许是通过可以提供这种功能的压缩算法进行调查。
6 个解决方案
#1
No, you can't. The zipping algorithm works on streams and adapts its internal codings to what the stream contains to achieve its high compression ratio.
不,你不能。压缩算法在流上工作,并使其内部编码适应流所包含的内容以实现其高压缩比。
Without knowing what the contents of the stream are before a certain point, it's impossible to know how to go about de-compressing from that point on.
如果不知道流的内容在某一点之前是什么,就不可能知道如何从那一点开始解压缩。
Any algorithm which allows you to de-compress arbitrary parts of it will require multiple passes over the data to compress it.
任何允许您对其任意部分进行解压缩的算法都需要对数据进行多次传递以对其进行压缩。
#2
BGZF is used to created index gzip compressed BAM files created by Samtools. These are randomly accessible.
BGZF用于创建由Samtools创建的索引gzip压缩BAM文件。这些是随机可访问的。
#3
If you have control over what goes into the file in the first place, if it's anything like a ZIP file you could store chunks of predetermined size with filenames in increasing numerical order and then just decompress the last chunk/file.
如果您首先控制进入文件的内容,如果它类似于ZIP文件,您可以按照递增的数字顺序存储具有文件名的预定大小的块,然后只解压缩最后一个块/文件。
#4
If it's an option, then bzip2 might be a better compression algorithm to use for this purpose.
如果它是一个选项,那么bzip2可能是一个更好的压缩算法用于此目的。
Bzip2 uses a block compression scheme. As such, if you take a chunk of the end of your file which you are sure is large enough to contain all of the last chunk, then you can recover it with bzip2recover.
Bzip2使用块压缩方案。因此,如果您获取文件末尾的一大块,您确定它足够大以包含所有最后一个块,那么您可以使用bzip2recover恢复它。
The block size is selectable at the time the file is written. In fact that's what happens when you set -1 (or --fast) to -9 (or --best) as compression options, which correspond to block sizes of 100k to 900k. The default is 900k.
可以在写入文件时选择块大小。事实上,当您将-1(或--fast)设置为-9(或--best)作为压缩选项时会发生这种情况,这对应于100k到900k的块大小。默认值为900k。
The bzip2 command line tools don't give you a nice friendly way to do this with a pipeline, but then given bzip2 is not stream oriented, perhaps that's not surprising.
bzip2命令行工具没有给你一个很好的友好方式来做一个管道,但随后给定bzip2不是面向流,也许这并不奇怪。
#5
zindex creates and queries an index on a compressed, line-based text file in a time- and space-efficient way.
zindex以时间和空间有效的方式在压缩的基于行的文本文件上创建和查询索引。
#6
An example of a fully gzip-compatible pseudo-random access format is dictzip
:
完全gzip兼容的伪随机访问格式的一个例子是dictzip:
For compression, the file is divided up into "chunks" of data, each chunk is less than 64kB. [...]
对于压缩,文件被划分为“数据块”,每个块小于64kB。 [...]
To perform random access on the data, the offset and length of the data are provided to library routines. These routines determine the chunk in which the desired data begins, and decompresses that chunk. Consecutive chunks are decompressed as necessary."
要对数据执行随机访问,数据的偏移量和长度将提供给库例程。这些例程确定所需数据开始的块,并解压缩该块。必要时会对连续的块进行解压缩。“
#1
No, you can't. The zipping algorithm works on streams and adapts its internal codings to what the stream contains to achieve its high compression ratio.
不,你不能。压缩算法在流上工作,并使其内部编码适应流所包含的内容以实现其高压缩比。
Without knowing what the contents of the stream are before a certain point, it's impossible to know how to go about de-compressing from that point on.
如果不知道流的内容在某一点之前是什么,就不可能知道如何从那一点开始解压缩。
Any algorithm which allows you to de-compress arbitrary parts of it will require multiple passes over the data to compress it.
任何允许您对其任意部分进行解压缩的算法都需要对数据进行多次传递以对其进行压缩。
#2
BGZF is used to created index gzip compressed BAM files created by Samtools. These are randomly accessible.
BGZF用于创建由Samtools创建的索引gzip压缩BAM文件。这些是随机可访问的。
#3
If you have control over what goes into the file in the first place, if it's anything like a ZIP file you could store chunks of predetermined size with filenames in increasing numerical order and then just decompress the last chunk/file.
如果您首先控制进入文件的内容,如果它类似于ZIP文件,您可以按照递增的数字顺序存储具有文件名的预定大小的块,然后只解压缩最后一个块/文件。
#4
If it's an option, then bzip2 might be a better compression algorithm to use for this purpose.
如果它是一个选项,那么bzip2可能是一个更好的压缩算法用于此目的。
Bzip2 uses a block compression scheme. As such, if you take a chunk of the end of your file which you are sure is large enough to contain all of the last chunk, then you can recover it with bzip2recover.
Bzip2使用块压缩方案。因此,如果您获取文件末尾的一大块,您确定它足够大以包含所有最后一个块,那么您可以使用bzip2recover恢复它。
The block size is selectable at the time the file is written. In fact that's what happens when you set -1 (or --fast) to -9 (or --best) as compression options, which correspond to block sizes of 100k to 900k. The default is 900k.
可以在写入文件时选择块大小。事实上,当您将-1(或--fast)设置为-9(或--best)作为压缩选项时会发生这种情况,这对应于100k到900k的块大小。默认值为900k。
The bzip2 command line tools don't give you a nice friendly way to do this with a pipeline, but then given bzip2 is not stream oriented, perhaps that's not surprising.
bzip2命令行工具没有给你一个很好的友好方式来做一个管道,但随后给定bzip2不是面向流,也许这并不奇怪。
#5
zindex creates and queries an index on a compressed, line-based text file in a time- and space-efficient way.
zindex以时间和空间有效的方式在压缩的基于行的文本文件上创建和查询索引。
#6
An example of a fully gzip-compatible pseudo-random access format is dictzip
:
完全gzip兼容的伪随机访问格式的一个例子是dictzip:
For compression, the file is divided up into "chunks" of data, each chunk is less than 64kB. [...]
对于压缩,文件被划分为“数据块”,每个块小于64kB。 [...]
To perform random access on the data, the offset and length of the data are provided to library routines. These routines determine the chunk in which the desired data begins, and decompresses that chunk. Consecutive chunks are decompressed as necessary."
要对数据执行随机访问,数据的偏移量和长度将提供给库例程。这些例程确定所需数据开始的块,并解压缩该块。必要时会对连续的块进行解压缩。“