如何从文件中读取数据块,然后从该块读取到向量中?

时间:2022-04-20 03:04:03

Suppose I have a file which has x records. One 'block' holds m records. Total number of blocks in file n=x/m. If I know the size of one record, say b bytes (size of one block = b*m), I can read the complete block at once using system command read() (is there any other method?). Now, how do I read each record from this block and put each record as a separate element into a vector.

假设我有一个有x条记录的文件。一个'块'保存m个记录。文件中的块总数n = x / m。如果我知道一个记录的大小,比如说b个字节(一个块的大小= b * m),我可以使用系统命令read()一次读取完整的块(有没有其他方法?)。现在,如何从该块中读取每条记录,并将每条记录作为单独的元素放入向量中。

The reason why I want to do this in the first place is to reduce the disk i/o operations. As the disk i/o operations are much more expensive according to what I have learned. Or will it take the same amount of time as when I read record by record from file and directly put it into vectors instead of reading block by block? On reading block by block, I will have only n disk I/O's whereas x I/O's if I read record by record.

我之所以想要这样做的原因是为了减少磁盘I / O操作。根据我所学到的,磁盘i / o操作要贵得多。或者它会花费相同的时间,当我从文件中读取记录并直接将其放入向量而不是逐块读取时?在逐块读取时,如果我按记录读取记录,我将只有n个磁盘I / O而x I / O.

Thanks.

2 个解决方案

#1


3  

You should consider using mmap() instead of reading your files using read().

您应该考虑使用mmap()而不是使用read()读取文件。

What's nice about mmap is that you can treat file contents as simply mapped into your process space as if you already had a pointer into the file contents. By simply inspecting memory contents and treating it as an array, or by copying data using memcpy() you will implicitly perform read operations, but only as necessary - operating system virtual memory subsystem is smart enough to do it very efficiently.

mmap的优点在于,您可以将文件内容简单地映射到您的进程空间,就像您已经有一个指向文件内容的指针一样。通过简单地检查内存内容并将其作为数组处理,或者通过使用memcpy()复制数据,您将隐式执行读取操作,但仅在必要时 - 操作系统虚拟内存子系统足够智能,可以非常有效地执行操作。

The only possible reason to avoid mmap maybe if you are running on 32-bit OS and file size exceeds 2 gigabytes (or slightly less than that). In this case OS may have trouble allocating address space to your mmap-ed memory. But on 64-bit OS using mmap should never be a problem.

如果您在32位操作系统上运行并且文件大小超过2千兆字节(或略小于该值),则避免使用mmap的唯一可能原因可能是。在这种情况下,操作系统可能无法为mmap-ed内存分配地址空间。但是在使用mmap的64位操作系统上应该永远不会成为问题。

Also, mmap can be cumbersome if you are writing a lot of data, and size of the data is not known upfront. Other than that, it is always better and faster to use it over the read.

此外,如果您正在编写大量数据,并且预先不知道数据的大小,则mmap可能很麻烦。除此之外,在阅读中使用它总是更好更快。

Actually, most modern operating systems rely on mmap extensively. For example, in Linux, to execute some binary, your executable is simply mmap-ed and executed from memory as if it was copied there by read, without actually reading it.

实际上,大多数现代操作系统都广泛依赖于mmap。例如,在Linux中,为了执行某些二进制文件,您的可执行文件只是mmap-ed并从内存中执行,就好像它是通过read复制它一样,而不是实际读取它。

#2


2  

Reading a block at a time won't necessarily reduce the number of I/O operations at all. The standard library already does buffering as it reads data from a file, so you do not (normally) expect to see an actual disk input operation every time you attempt to read from a stream (or anything close).

一次读取一个块不一定会减少I / O操作的数量。标准库在从文件中读取数据时已经进行了缓冲,因此每次尝试从流(或任何接近的)读取时,您都不会(通常)期望看到实际的磁盘输入操作。

It's still possible reading a block at a time would reduce the number of I/O operations. If your block is larger than the buffer the stream uses by default, then you'd expect to see fewer I/O operations used to read the data. On the other hand, you can accomplish the same by simply adjusting the size of buffer used by the stream (which is probably a lot easier).

仍然可以一次读取一个块将减少I / O操作的数量。如果您的块大于默认情况下流使用的缓冲区,那么您希望看到用于读取数据的I / O操作更少。另一方面,您可以通过简单地调整流使用的缓冲区大小来完成相同的操作(这可能更容易)。

#1


3  

You should consider using mmap() instead of reading your files using read().

您应该考虑使用mmap()而不是使用read()读取文件。

What's nice about mmap is that you can treat file contents as simply mapped into your process space as if you already had a pointer into the file contents. By simply inspecting memory contents and treating it as an array, or by copying data using memcpy() you will implicitly perform read operations, but only as necessary - operating system virtual memory subsystem is smart enough to do it very efficiently.

mmap的优点在于,您可以将文件内容简单地映射到您的进程空间,就像您已经有一个指向文件内容的指针一样。通过简单地检查内存内容并将其作为数组处理,或者通过使用memcpy()复制数据,您将隐式执行读取操作,但仅在必要时 - 操作系统虚拟内存子系统足够智能,可以非常有效地执行操作。

The only possible reason to avoid mmap maybe if you are running on 32-bit OS and file size exceeds 2 gigabytes (or slightly less than that). In this case OS may have trouble allocating address space to your mmap-ed memory. But on 64-bit OS using mmap should never be a problem.

如果您在32位操作系统上运行并且文件大小超过2千兆字节(或略小于该值),则避免使用mmap的唯一可能原因可能是。在这种情况下,操作系统可能无法为mmap-ed内存分配地址空间。但是在使用mmap的64位操作系统上应该永远不会成为问题。

Also, mmap can be cumbersome if you are writing a lot of data, and size of the data is not known upfront. Other than that, it is always better and faster to use it over the read.

此外,如果您正在编写大量数据,并且预先不知道数据的大小,则mmap可能很麻烦。除此之外,在阅读中使用它总是更好更快。

Actually, most modern operating systems rely on mmap extensively. For example, in Linux, to execute some binary, your executable is simply mmap-ed and executed from memory as if it was copied there by read, without actually reading it.

实际上,大多数现代操作系统都广泛依赖于mmap。例如,在Linux中,为了执行某些二进制文件,您的可执行文件只是mmap-ed并从内存中执行,就好像它是通过read复制它一样,而不是实际读取它。

#2


2  

Reading a block at a time won't necessarily reduce the number of I/O operations at all. The standard library already does buffering as it reads data from a file, so you do not (normally) expect to see an actual disk input operation every time you attempt to read from a stream (or anything close).

一次读取一个块不一定会减少I / O操作的数量。标准库在从文件中读取数据时已经进行了缓冲,因此每次尝试从流(或任何接近的)读取时,您都不会(通常)期望看到实际的磁盘输入操作。

It's still possible reading a block at a time would reduce the number of I/O operations. If your block is larger than the buffer the stream uses by default, then you'd expect to see fewer I/O operations used to read the data. On the other hand, you can accomplish the same by simply adjusting the size of buffer used by the stream (which is probably a lot easier).

仍然可以一次读取一个块将减少I / O操作的数量。如果您的块大于默认情况下流使用的缓冲区,那么您希望看到用于读取数据的I / O操作更少。另一方面,您可以通过简单地调整流使用的缓冲区大小来完成相同的操作(这可能更容易)。