getline在读取文件和读取整个文件的同时,根据换行字符进行分割

时间:2021-07-22 15:41:01

I want to process each line of a file on a hard-disk now. Is it better to load a file as a whole and then split on basis of newline character (using boost), or is it better to use getline()? My question is does getline() reads single line when called (resulting in multiple hard disk access) or reads whole file and gives line by line?

我现在要处理硬盘上文件的每一行。作为一个整体加载一个文件,然后根据换行字符(使用boost)进行分割,这样更好吗?还是使用getline()更好?我的问题是getline()在调用时读取一行(导致多个硬盘访问)还是读取整个文件并逐行给出?

6 个解决方案

#1


5  

getline will call read() as a system call somewhere deep in the gutst of the C library. Exactly how many times it is called, and how it is called depends on the C library design. But most likely there is no distinct difference in reading a line at a time vs. the whole file, becuse the OS at the bottom layer will read (at least) one disk-block at a time, and most likely at least a "page" (4KB), if not more.

getline将调用read()作为系统调用,该系统调用位于C库的gutst中。它被调用的确切次数以及调用方式取决于C库的设计。但最可能的情况是,每次读取一行与读取整个文件并没有明显的区别,因为底层的操作系统每次读取(至少)一个磁盘块,而且最可能至少读取一个“页面”(4KB),如果不是更多的话。

Further, unles you do nearly nothing with your string after you have read it (e.g you are writing something like "grep", so mostly just reading the to find a string), it is unlikely that the overhead of reading a line at a time is a large part of the time you spend.

更进一步说,在你读了它之后,你几乎什么也不做。你正在写“grep”之类的东西,所以大多数时候你只是在阅读“查找字符串”),一次读一行的开销不太可能是你所花费的大部分时间。

But the "load the whole file in one go" has several, distinct, problems:

但是“一次加载整个文件”有几个不同的问题:

  1. You don't start processing until you have read the whole file.
  2. 在读取整个文件之前,您不会开始处理。
  3. You need enough memory to read the entire file into memory - what if the file is a few hundred GB in size? Does your program fail then?
  4. 您需要足够的内存将整个文件读入内存——如果文件的大小是几百GB怎么办?你的程序会失败吗?

Don't try to optimise something unless you have used profiling to prove that it's part of why your code is running slow. You are just causing more problems for yourself.

不要试图优化某些东西,除非您使用分析来证明它是代码运行缓慢的部分原因。你给自己带来了更多的问题。

Edit: So, I wrote a program to measure this, since I think it's quite interesting.

编辑:所以,我写了一个程序来测量它,因为我觉得它很有趣。

And the results are definitely interesting - to make the comparison fair, I created three large files of 1297984192 bytes each (by copying all source files in a directory with about a dozen different source files, then copying this file several times over to "multiply" it, until it took over 1.5 seconds to run the test, which is how long I think you need to run things to make sure the timing isn't too susceptible to random "network packet came in" or some other outside influences taking time out of the process).

结果肯定是有趣的,比较公平,我创建了三个大文件的1297984192字节(通过复制目录中的所有源文件有12个不同的源文件,然后复制这个文件多次“繁殖”,直到它接管了1.5秒运行测试,这就是我认为你需要运行的时间,以确保时间不太容易被随机的“网络数据包进入”或其他外部影响占用时间的过程)。

I also decided to measure the system and user-time by the process.

我还决定通过这个过程来度量系统和用户时间。

$ ./bigfile
Lines=24812608
Wallclock time for mmap is 1.98 (user:1.83 system: 0.14)
Lines=24812608
Wallclock time for getline is 2.07 (user:1.68 system: 0.389)
Lines=24812608
Wallclock time for readwhole is 2.52 (user:1.79 system: 0.723)
$ ./bigfile
Lines=24812608
Wallclock time for mmap is 1.96 (user:1.83 system: 0.12)
Lines=24812608
Wallclock time for getline is 2.07 (user:1.67 system: 0.392)
Lines=24812608
Wallclock time for readwhole is 2.48 (user:1.76 system: 0.707)

Here's the three different functions to read the file (there's some code to measure time and stuff too, of course, but for reducing the size of this post, I choose to not post all of that - and I played around with ordering to see if that made any difference, so results above are not in the same order as the functions here)

这是三个不同的函数来读取文件(有一些代码来测量时间等等,当然,但对于减少大小的这篇文章中,我选择不发布所有这些——我摆弄排序,看看有什么影响,所以以上结果不相同的顺序的功能)

void func_readwhole(const char *name)
{
    string fullname = string("bigfile_") + name;
    ifstream f(fullname.c_str());

    if (!f) 
    {
        cerr << "could not open file for " << fullname << endl;
        exit(1);
    }

    f.seekg(0, ios::end);
    streampos size = f.tellg();

    f.seekg(0, ios::beg);

    char* buffer = new char[size];
    f.read(buffer, size);
    if (f.gcount() != size)
    {
        cerr << "Read failed ...\n";
        exit(1);
    }

    stringstream ss;
    ss.rdbuf()->pubsetbuf(buffer, size);

    int lines = 0;
    string str;
    while(getline(ss, str))
    {
        lines++;
    }

    f.close();


    cout << "Lines=" << lines << endl;

    delete [] buffer;
}

void func_getline(const char *name)
{
    string fullname = string("bigfile_") + name;
    ifstream f(fullname.c_str());

    if (!f) 
    {
        cerr << "could not open file for " << fullname << endl;
        exit(1);
    }

    string str;
    int lines = 0;

    while(getline(f, str))
    {
        lines++;
    }

    cout << "Lines=" << lines << endl;

    f.close();
}

void func_mmap(const char *name)
{
    char *buffer;

    string fullname = string("bigfile_") + name;
    int f = open(fullname.c_str(), O_RDONLY);

    off_t size = lseek(f, 0, SEEK_END);

    lseek(f, 0, SEEK_SET);

    buffer = (char *)mmap(NULL, size, PROT_READ, MAP_PRIVATE, f, 0);


    stringstream ss;
    ss.rdbuf()->pubsetbuf(buffer, size);

    int lines = 0;
    string str;
    while(getline(ss, str))
    {
        lines++;
    }

    munmap(buffer, size);
    cout << "Lines=" << lines << endl;
}

#2


2  

The OS will read a whole block of data (depending on how the disk is formatted, typically 4-8k at a time) and do some of the buffering for you. Let the OS take care of it for you, and read the data in the way that makes sense for your program.

操作系统将读取整个数据块(取决于磁盘的格式,通常一次读取4-8k),并为您做一些缓冲。让操作系统为您处理它,并以对您的程序有意义的方式读取数据。

#3


1  

The fstreams are buffered reasonably. The underlying acesses to the harddisk by the OS are buffered reasonably. The hard disk itself has a reasonable buffer. You most surely will not trigger more hard disk accesses if you read the file line by line. Or character by character, for that matter.

fstreams被合理地缓冲了。操作系统对硬盘的底层访问进行了合理的缓冲。硬盘本身有一个合理的缓冲。如果逐行读取文件,您肯定不会触发更多的硬盘访问。或者是一个又一个的性格。

So there is no reason to load the whole file into a big buffer and work on that buffer, because it already is in a buffer. And there often is no reason to buffer one line at a time, either. Why allocate memory to buffer something in a string that is already buffered in the ifstream? If you can, work on the stream directly, don't bother tossing everything around twice or more from one buffer to the next. Unless it supports readability and/or your profiler told you that disc access is slowing your program down significantly.

因此,没有理由将整个文件加载到一个大缓冲区中并对该缓冲区进行处理,因为它已经在缓冲区中。通常也没有理由每次只缓冲一行。为什么要分配内存来缓冲ifstream中已经缓存的字符串中的某些内容呢?如果可以的话,直接在流上工作,不要把所有的东西从一个缓冲区抛到另一个缓冲区中两次或更多。除非它支持可读性和/或您的profiler告诉您磁盘访问正在显著降低您的程序。

#4


0  

If it's a small file on disk, it's probably more efficient to read the entire file and parse it line by line vs. reading one line at a time--that would take lot's of disk access.

如果它是磁盘上的一个小文件,那么读取整个文件并逐行解析它可能比每次读取一行更有效——这将需要大量的磁盘访问。

#5


0  

I believe the C++ idiom would be to read the file line-by-line, and create a line-based container as you read the file. Most likely the iostreams (getline) will be buffered enough that you won't notice a significant difference.

我相信c++的习惯用法是逐行读取文件,并在读取文件时创建基于行的容器。大多数情况下,iostreams (getline)将被缓冲到足够的程度,以至于您不会注意到显著的差异。

However for very large files you may get better performance by reading larger chunks of the file (not the whole file at once) and splitting internall as newlines are found.

但是,对于非常大的文件,您可以通过读取较大的文件块(不是同时读取整个文件)并将internall作为换行符来获得更好的性能。

If you want to know specifically which method is faster and by how much, you'll have to profile your code.

如果您想要明确知道哪个方法更快,并且要知道多少,您必须对代码进行概要分析。

#6


0  

Its better to fetch the all data if it can be accommodated in memory because whenever you request the I/O your programmme looses the processing and put in a wait Q.

如果所有的数据都可以存储在内存中,那么获取这些数据会更好,因为每当您请求I/O时,您的程序会丢失处理并放入一个等待Q。

getline在读取文件和读取整个文件的同时,根据换行字符进行分割

However if the file size is big then it's better to read as much data at a time which is required in processing. Because bigger read operation will take much time to complete then the small ones. cpu process switching time is much smaller then this entire file read time.

但是,如果文件大小较大,那么最好每次读取处理所需的数据。因为较大的读取操作要比较小的操作花费更多的时间。cpu进程切换时间比整个文件读取时间要短得多。

#1


5  

getline will call read() as a system call somewhere deep in the gutst of the C library. Exactly how many times it is called, and how it is called depends on the C library design. But most likely there is no distinct difference in reading a line at a time vs. the whole file, becuse the OS at the bottom layer will read (at least) one disk-block at a time, and most likely at least a "page" (4KB), if not more.

getline将调用read()作为系统调用,该系统调用位于C库的gutst中。它被调用的确切次数以及调用方式取决于C库的设计。但最可能的情况是,每次读取一行与读取整个文件并没有明显的区别,因为底层的操作系统每次读取(至少)一个磁盘块,而且最可能至少读取一个“页面”(4KB),如果不是更多的话。

Further, unles you do nearly nothing with your string after you have read it (e.g you are writing something like "grep", so mostly just reading the to find a string), it is unlikely that the overhead of reading a line at a time is a large part of the time you spend.

更进一步说,在你读了它之后,你几乎什么也不做。你正在写“grep”之类的东西,所以大多数时候你只是在阅读“查找字符串”),一次读一行的开销不太可能是你所花费的大部分时间。

But the "load the whole file in one go" has several, distinct, problems:

但是“一次加载整个文件”有几个不同的问题:

  1. You don't start processing until you have read the whole file.
  2. 在读取整个文件之前,您不会开始处理。
  3. You need enough memory to read the entire file into memory - what if the file is a few hundred GB in size? Does your program fail then?
  4. 您需要足够的内存将整个文件读入内存——如果文件的大小是几百GB怎么办?你的程序会失败吗?

Don't try to optimise something unless you have used profiling to prove that it's part of why your code is running slow. You are just causing more problems for yourself.

不要试图优化某些东西,除非您使用分析来证明它是代码运行缓慢的部分原因。你给自己带来了更多的问题。

Edit: So, I wrote a program to measure this, since I think it's quite interesting.

编辑:所以,我写了一个程序来测量它,因为我觉得它很有趣。

And the results are definitely interesting - to make the comparison fair, I created three large files of 1297984192 bytes each (by copying all source files in a directory with about a dozen different source files, then copying this file several times over to "multiply" it, until it took over 1.5 seconds to run the test, which is how long I think you need to run things to make sure the timing isn't too susceptible to random "network packet came in" or some other outside influences taking time out of the process).

结果肯定是有趣的,比较公平,我创建了三个大文件的1297984192字节(通过复制目录中的所有源文件有12个不同的源文件,然后复制这个文件多次“繁殖”,直到它接管了1.5秒运行测试,这就是我认为你需要运行的时间,以确保时间不太容易被随机的“网络数据包进入”或其他外部影响占用时间的过程)。

I also decided to measure the system and user-time by the process.

我还决定通过这个过程来度量系统和用户时间。

$ ./bigfile
Lines=24812608
Wallclock time for mmap is 1.98 (user:1.83 system: 0.14)
Lines=24812608
Wallclock time for getline is 2.07 (user:1.68 system: 0.389)
Lines=24812608
Wallclock time for readwhole is 2.52 (user:1.79 system: 0.723)
$ ./bigfile
Lines=24812608
Wallclock time for mmap is 1.96 (user:1.83 system: 0.12)
Lines=24812608
Wallclock time for getline is 2.07 (user:1.67 system: 0.392)
Lines=24812608
Wallclock time for readwhole is 2.48 (user:1.76 system: 0.707)

Here's the three different functions to read the file (there's some code to measure time and stuff too, of course, but for reducing the size of this post, I choose to not post all of that - and I played around with ordering to see if that made any difference, so results above are not in the same order as the functions here)

这是三个不同的函数来读取文件(有一些代码来测量时间等等,当然,但对于减少大小的这篇文章中,我选择不发布所有这些——我摆弄排序,看看有什么影响,所以以上结果不相同的顺序的功能)

void func_readwhole(const char *name)
{
    string fullname = string("bigfile_") + name;
    ifstream f(fullname.c_str());

    if (!f) 
    {
        cerr << "could not open file for " << fullname << endl;
        exit(1);
    }

    f.seekg(0, ios::end);
    streampos size = f.tellg();

    f.seekg(0, ios::beg);

    char* buffer = new char[size];
    f.read(buffer, size);
    if (f.gcount() != size)
    {
        cerr << "Read failed ...\n";
        exit(1);
    }

    stringstream ss;
    ss.rdbuf()->pubsetbuf(buffer, size);

    int lines = 0;
    string str;
    while(getline(ss, str))
    {
        lines++;
    }

    f.close();


    cout << "Lines=" << lines << endl;

    delete [] buffer;
}

void func_getline(const char *name)
{
    string fullname = string("bigfile_") + name;
    ifstream f(fullname.c_str());

    if (!f) 
    {
        cerr << "could not open file for " << fullname << endl;
        exit(1);
    }

    string str;
    int lines = 0;

    while(getline(f, str))
    {
        lines++;
    }

    cout << "Lines=" << lines << endl;

    f.close();
}

void func_mmap(const char *name)
{
    char *buffer;

    string fullname = string("bigfile_") + name;
    int f = open(fullname.c_str(), O_RDONLY);

    off_t size = lseek(f, 0, SEEK_END);

    lseek(f, 0, SEEK_SET);

    buffer = (char *)mmap(NULL, size, PROT_READ, MAP_PRIVATE, f, 0);


    stringstream ss;
    ss.rdbuf()->pubsetbuf(buffer, size);

    int lines = 0;
    string str;
    while(getline(ss, str))
    {
        lines++;
    }

    munmap(buffer, size);
    cout << "Lines=" << lines << endl;
}

#2


2  

The OS will read a whole block of data (depending on how the disk is formatted, typically 4-8k at a time) and do some of the buffering for you. Let the OS take care of it for you, and read the data in the way that makes sense for your program.

操作系统将读取整个数据块(取决于磁盘的格式,通常一次读取4-8k),并为您做一些缓冲。让操作系统为您处理它,并以对您的程序有意义的方式读取数据。

#3


1  

The fstreams are buffered reasonably. The underlying acesses to the harddisk by the OS are buffered reasonably. The hard disk itself has a reasonable buffer. You most surely will not trigger more hard disk accesses if you read the file line by line. Or character by character, for that matter.

fstreams被合理地缓冲了。操作系统对硬盘的底层访问进行了合理的缓冲。硬盘本身有一个合理的缓冲。如果逐行读取文件,您肯定不会触发更多的硬盘访问。或者是一个又一个的性格。

So there is no reason to load the whole file into a big buffer and work on that buffer, because it already is in a buffer. And there often is no reason to buffer one line at a time, either. Why allocate memory to buffer something in a string that is already buffered in the ifstream? If you can, work on the stream directly, don't bother tossing everything around twice or more from one buffer to the next. Unless it supports readability and/or your profiler told you that disc access is slowing your program down significantly.

因此,没有理由将整个文件加载到一个大缓冲区中并对该缓冲区进行处理,因为它已经在缓冲区中。通常也没有理由每次只缓冲一行。为什么要分配内存来缓冲ifstream中已经缓存的字符串中的某些内容呢?如果可以的话,直接在流上工作,不要把所有的东西从一个缓冲区抛到另一个缓冲区中两次或更多。除非它支持可读性和/或您的profiler告诉您磁盘访问正在显著降低您的程序。

#4


0  

If it's a small file on disk, it's probably more efficient to read the entire file and parse it line by line vs. reading one line at a time--that would take lot's of disk access.

如果它是磁盘上的一个小文件,那么读取整个文件并逐行解析它可能比每次读取一行更有效——这将需要大量的磁盘访问。

#5


0  

I believe the C++ idiom would be to read the file line-by-line, and create a line-based container as you read the file. Most likely the iostreams (getline) will be buffered enough that you won't notice a significant difference.

我相信c++的习惯用法是逐行读取文件,并在读取文件时创建基于行的容器。大多数情况下,iostreams (getline)将被缓冲到足够的程度,以至于您不会注意到显著的差异。

However for very large files you may get better performance by reading larger chunks of the file (not the whole file at once) and splitting internall as newlines are found.

但是,对于非常大的文件,您可以通过读取较大的文件块(不是同时读取整个文件)并将internall作为换行符来获得更好的性能。

If you want to know specifically which method is faster and by how much, you'll have to profile your code.

如果您想要明确知道哪个方法更快,并且要知道多少,您必须对代码进行概要分析。

#6


0  

Its better to fetch the all data if it can be accommodated in memory because whenever you request the I/O your programmme looses the processing and put in a wait Q.

如果所有的数据都可以存储在内存中,那么获取这些数据会更好,因为每当您请求I/O时,您的程序会丢失处理并放入一个等待Q。

getline在读取文件和读取整个文件的同时,根据换行字符进行分割

However if the file size is big then it's better to read as much data at a time which is required in processing. Because bigger read operation will take much time to complete then the small ones. cpu process switching time is much smaller then this entire file read time.

但是,如果文件大小较大,那么最好每次读取处理所需的数据。因为较大的读取操作要比较小的操作花费更多的时间。cpu进程切换时间比整个文件读取时间要短得多。