如何合并两个大文件

时间:2022-07-18 22:48:02

Suppose I have 2 files with size of 100G each. And I want to merge them into one, and then delete them. In linux we can use

假设我有两个大小为100G的文件。我想把它们合并成一个,然后删除它们。在linux中我们可以使用

cat file1 file2 > final_file

cat file1 > final_file

But that needs to read 2 big files, and then write a bigger file. Is it possible just append one file to the other, so that no IO is required? Since metadata of file contains the location of the file, and the length, I am wondering whether it is possible to change the metadata of the file to do the merge, so no IO will happen.

但这需要读取2个大文件,然后再写一个更大的文件。是否可以将一个文件附加到另一个文件中,这样就不需要IO了?由于文件的元数据包含文件的位置和长度,我想知道是否可以更改文件的元数据来进行合并,所以不会发生IO。

3 个解决方案

#1


21  

Can you merge two files without writing one file onto the other?

Only in obscure theory. Since disk storage is always based on blocks and filesystems therefore store things on block boundaries, you could only append one file to another without rewriting if the first file ended perfectly on a block boundary. There are some rare filesystem configurations that use tail packing, but that would only help if the first file where already using the tail block of the previous file.

只有在模糊理论。由于磁盘存储始终基于块和文件系统,因此只能在块边界上存储内容,如果第一个文件在块边界上完美地结束,则只能将一个文件附加到另一个文件而不重写。有一些罕见的文件系统配置使用尾部打包,但是只有当第一个文件已经使用了前一个文件的尾部块时才会有帮助。

Unless that perfect scenario occurs or your filesystem is able to mark a partial block in the middle of the file (I've never heard of this), this won't work. Just to kick the edge case around, there's also no way outside of changing the kernel interace to make such a call (re: Link to a specific inode)

除非出现了完美的场景,或者您的文件系统能够在文件的中间标记一个部分块(我从未听说过这个),否则这将无法工作。为了解决这个问题,除了修改内核接口外,没有其他方法可以进行这样的调用(re:链接到特定的inode)

Can we make this better than doubling the size of both files?

Yes, we can use the append (>>) operation instead.

是的,我们可以使用append(>>)操作。

cat file2 >> file1

That will still result in using all the space of consumed by file2 twice over until we can delete it.

这仍然会导致使用file2所消耗的所有空间,直到我们可以删除它为止。

Can we avoid using extra space?

No. Unless somebody comes back with something I don't know, you're basically out of luck there. It's possible to truncate a file, forgetting about the existence of the end of it, but there is no way to forget about the existence of the start unless we get back to modifying inodes directly and having to alter the kernel interface to the filesystem since that's definitely not a a POSIX operation.

不。除非有人带着我不知道的东西回来,否则你根本就不走运。可以截断文件,忘记结束它的存在,但没有办法忘记开始的存在,除非我们回到修改索引节点直接和不得不改变内核接口文件系统以来,肯定不是一个POSIX操作。

What about writing a little bit at a time, then deleting what we wrote?

No again. Since we can't chop the start of a file off, we'd have to rewrite everything from the point of interest all the way to the end of the file. This would be very costly for IO and only useful after we've already read half the file.

没有了。因为我们不能删除文件的开头,所以我们必须重写所有内容,从兴趣点一直到文件的结尾。这对于IO来说是非常昂贵的,并且只有在我们已经阅读了一半的文件之后才有用。

What about sparse files?

Maybe! Sparse file allow us to store a long string of zeroes without using up nearly that much space. If we were to read file2 in large chunks starting at the end, we could write those blocks to the end of file1. file1 would immediately look (and read) as if it were the same size as both, but it would be corrupted until we were done because everything we hadn't written would be full of zeroes.

也许!稀疏文件允许我们存储一长串0,而不会占用太多空间。如果我们把file2分成大块,从末尾开始,我们可以将这些块写到file1的末尾。file1将立即查看(并读取)它们,就好像它们的大小相同,但在我们完成之前,它将被破坏,因为我们没有编写的所有内容都将满是0。

Explaining all this is another answer in itself, but if you can do a spare allocation, you would be able to use only your chunk read size + a little bit extra in disk space to perform this operation. For a reference talking about sparse blocks in the middle of files, see http://lwn.net/Articles/357767/ or do a search involving the term, SEEK_HOLE.

解释这一切本身就是另一个答案,但是如果您可以做一个备用分配,那么您将只能使用您的块读取大小+磁盘空间中的一点额外的空间来执行此操作。有关在文件中间讨论稀疏块的引用,请参见http://lwn.net/Articles/357767/或进行涉及术语SEEK_HOLE的搜索。

Why is this "maybe" instead of "yes"? Two parts: you'd have to write your own tool (at least we're on the right site for that), and sparse files are not universally respected by file systems and other processes alike. Fortunately you probably won't have to worry about other processes respecting your file, but you will have to worry about setting the right flags and making sure your filesystem is amenable. Last of all, you'll still be reading and re-writing the length of file2, which isn't what you want. This method does mean you can append with just a small amount of disk space, though, rather at using at least 2*file2 amount of space.

为什么是“也许”而不是“是”?有两个部分:您必须编写自己的工具(至少我们有合适的站点),而文件系统和其他进程并不普遍地尊重稀疏文件。幸运的是,您可能不需要担心与文件相关的其他进程,但是您需要担心设置正确的标志和确保文件系统是可修改的。最后,您仍将阅读和重写file2的长度,这不是您想要的。不过,此方法意味着您可以只附加少量磁盘空间,而不需要使用至少2*file2空间。

#2


5  

You can do like this

你可以这样做

cat file2 >> file1

file1 will become the full content.

file1将成为完整的内容。

#3


0  

No, it is not possible to merge (on Linux) two big files by working on their meta-data.

不,通过处理它们的元数据来合并(在Linux上)两个大文件是不可能的。

Maybe you might consider some kind of database for your work.

也许你可以为你的工作考虑一些数据库。

As Alexandre noticed, you can append one big file to another, but this still requires a lot of data copying.

正如Alexandre所注意到的,您可以将一个大文件附加到另一个大文件,但这仍然需要大量的数据复制。

#1


21  

Can you merge two files without writing one file onto the other?

Only in obscure theory. Since disk storage is always based on blocks and filesystems therefore store things on block boundaries, you could only append one file to another without rewriting if the first file ended perfectly on a block boundary. There are some rare filesystem configurations that use tail packing, but that would only help if the first file where already using the tail block of the previous file.

只有在模糊理论。由于磁盘存储始终基于块和文件系统,因此只能在块边界上存储内容,如果第一个文件在块边界上完美地结束,则只能将一个文件附加到另一个文件而不重写。有一些罕见的文件系统配置使用尾部打包,但是只有当第一个文件已经使用了前一个文件的尾部块时才会有帮助。

Unless that perfect scenario occurs or your filesystem is able to mark a partial block in the middle of the file (I've never heard of this), this won't work. Just to kick the edge case around, there's also no way outside of changing the kernel interace to make such a call (re: Link to a specific inode)

除非出现了完美的场景,或者您的文件系统能够在文件的中间标记一个部分块(我从未听说过这个),否则这将无法工作。为了解决这个问题,除了修改内核接口外,没有其他方法可以进行这样的调用(re:链接到特定的inode)

Can we make this better than doubling the size of both files?

Yes, we can use the append (>>) operation instead.

是的,我们可以使用append(>>)操作。

cat file2 >> file1

That will still result in using all the space of consumed by file2 twice over until we can delete it.

这仍然会导致使用file2所消耗的所有空间,直到我们可以删除它为止。

Can we avoid using extra space?

No. Unless somebody comes back with something I don't know, you're basically out of luck there. It's possible to truncate a file, forgetting about the existence of the end of it, but there is no way to forget about the existence of the start unless we get back to modifying inodes directly and having to alter the kernel interface to the filesystem since that's definitely not a a POSIX operation.

不。除非有人带着我不知道的东西回来,否则你根本就不走运。可以截断文件,忘记结束它的存在,但没有办法忘记开始的存在,除非我们回到修改索引节点直接和不得不改变内核接口文件系统以来,肯定不是一个POSIX操作。

What about writing a little bit at a time, then deleting what we wrote?

No again. Since we can't chop the start of a file off, we'd have to rewrite everything from the point of interest all the way to the end of the file. This would be very costly for IO and only useful after we've already read half the file.

没有了。因为我们不能删除文件的开头,所以我们必须重写所有内容,从兴趣点一直到文件的结尾。这对于IO来说是非常昂贵的,并且只有在我们已经阅读了一半的文件之后才有用。

What about sparse files?

Maybe! Sparse file allow us to store a long string of zeroes without using up nearly that much space. If we were to read file2 in large chunks starting at the end, we could write those blocks to the end of file1. file1 would immediately look (and read) as if it were the same size as both, but it would be corrupted until we were done because everything we hadn't written would be full of zeroes.

也许!稀疏文件允许我们存储一长串0,而不会占用太多空间。如果我们把file2分成大块,从末尾开始,我们可以将这些块写到file1的末尾。file1将立即查看(并读取)它们,就好像它们的大小相同,但在我们完成之前,它将被破坏,因为我们没有编写的所有内容都将满是0。

Explaining all this is another answer in itself, but if you can do a spare allocation, you would be able to use only your chunk read size + a little bit extra in disk space to perform this operation. For a reference talking about sparse blocks in the middle of files, see http://lwn.net/Articles/357767/ or do a search involving the term, SEEK_HOLE.

解释这一切本身就是另一个答案,但是如果您可以做一个备用分配,那么您将只能使用您的块读取大小+磁盘空间中的一点额外的空间来执行此操作。有关在文件中间讨论稀疏块的引用,请参见http://lwn.net/Articles/357767/或进行涉及术语SEEK_HOLE的搜索。

Why is this "maybe" instead of "yes"? Two parts: you'd have to write your own tool (at least we're on the right site for that), and sparse files are not universally respected by file systems and other processes alike. Fortunately you probably won't have to worry about other processes respecting your file, but you will have to worry about setting the right flags and making sure your filesystem is amenable. Last of all, you'll still be reading and re-writing the length of file2, which isn't what you want. This method does mean you can append with just a small amount of disk space, though, rather at using at least 2*file2 amount of space.

为什么是“也许”而不是“是”?有两个部分:您必须编写自己的工具(至少我们有合适的站点),而文件系统和其他进程并不普遍地尊重稀疏文件。幸运的是,您可能不需要担心与文件相关的其他进程,但是您需要担心设置正确的标志和确保文件系统是可修改的。最后,您仍将阅读和重写file2的长度,这不是您想要的。不过,此方法意味着您可以只附加少量磁盘空间,而不需要使用至少2*file2空间。

#2


5  

You can do like this

你可以这样做

cat file2 >> file1

file1 will become the full content.

file1将成为完整的内容。

#3


0  

No, it is not possible to merge (on Linux) two big files by working on their meta-data.

不,通过处理它们的元数据来合并(在Linux上)两个大文件是不可能的。

Maybe you might consider some kind of database for your work.

也许你可以为你的工作考虑一些数据库。

As Alexandre noticed, you can append one big file to another, but this still requires a lot of data copying.

正如Alexandre所注意到的,您可以将一个大文件附加到另一个大文件,但这仍然需要大量的数据复制。