为什么Java读取大文件的速度比c++快?

时间:2022-04-02 01:16:03

I have a 2 GB file (iputfile.txt) in which every line in the file is a word, just like:

我有一个2gb的文件(iputfile.txt),文件中的每一行都是一个单词,就像:

apple
red
beautiful
smell
spark
input

I need to write a program to read every word in the file and print the word count. I wrote it using Java and C++, but the result is surprising: Java runs 2.3 times faster than C++. My code are as follows:

我需要编写一个程序来读取文件中的每个单词并打印单词count。我使用Java和c++编写它,但结果令人惊讶:Java的运行速度是c++的2.3倍。我的代码如下:

C++:

c++:

int main() {
    struct timespec ts, te;
    double cost;
    clock_gettime(CLOCK_REALTIME, &ts);

    ifstream fin("inputfile.txt");
    string word;
    int count = 0;
    while(fin >> word) {
        count++;
    }
    cout << count << endl;

    clock_gettime(CLOCK_REALTIME, &te);
    cost = te.tv_sec - ts.tv_sec + (double)(te.tv_nsec-ts.tv_nsec)/NANO;
    printf("Run time: %-15.10f s\n", cost);

    return 0;
}

Output:

输出:

5e+08
Run time: 69.311 s

Java:

Java:

 public static void main(String[] args) throws Exception {

    long startTime = System.currentTimeMillis();

    FileReader reader = new FileReader("inputfile.txt");
    BufferedReader br = new BufferedReader(reader);
    String str = null;
    int count = 0;
    while((str = br.readLine()) != null) {
        count++;
    }
    System.out.println(count);

    long endTime = System.currentTimeMillis();
    System.out.println("Run time : " + (endTime - startTime)/1000 + "s");
}

Output:

输出:

5.0E8
Run time: 29 s

Why is Java faster than C++ in this situation, and how do I improve the performance of C++?

在这种情况下,为什么Java比c++快,我如何提高c++的性能?

5 个解决方案

#1


64  

You aren't comparing the same thing. The Java program reads lines, depening on the newline, while the C++ program reads white space delimited "words", which is a little extra work.

你不是在比较同一件事。Java程序读取行,依赖于换行,而c++程序读取空格分隔的“words”,这是额外的工作。

Try istream::getline.

尝试istream::getline。

Later

晚些时候

You might also try and do an elementary read operation to read a byte array and scan this for newlines.

您还可以尝试执行一个基本的读操作来读取字节数组并扫描它以获取换行。

Even later

更晚

On my old Linux notebook, jdk1.7.0_21 and don't-tell-me-it's-old 4.3.3 take about the same time, comparing with C++ getline. (We have established that reading words is slower.) There isn't much difference between -O0 and -O2, which doesn't surprise me, given the simplicity of the code in the loop.

在我的旧Linux笔记本上,jdk1.7.0_21和don't-tell-me-it -it -it -old 4.3.3与c++ getline相比,大约要花费相同的时间。(我们已经确定,读单词要慢一些。)考虑到循环中代码的简单性,-O0和-O2之间没有太大的区别,这一点我并不感到惊讶。

Last note As I suggested, fin.read(buffer,LEN) with LEN = 1MB and using memchr to scan for '\n' results in another speed improvement of about 20%, which makes C (there isn't any C++ left by now) faster than Java.

最后要注意的是,使用LEN = 1MB的find .read(buffer,LEN)和memchr对'\n'进行扫描,结果又一次提高了20%左右的速度,这使得C(现在已经没有c++了)比Java快。

#2


7  

There are a number of significant differences in the way the languages handle I/O, all of which can make a difference, one way or another.

在语言处理I/O的方式上有许多显著的差异,所有这些都可以以某种方式带来不同。

Perhaps the first (and most important) question is: how is the data encoded in the text file. If it is single-byte characters (ISO 8859-1 or UTF-8), then Java has to convert it into UTF-16 before processing; depending on the locale, C++ may (or may not) also convert or do some additional checking.

也许第一个(也是最重要的)问题是:在文本文件中如何编码数据。如果是单字节字符(ISO 8859-1或UTF-8),则Java必须在处理前将其转换为UTF-16;根据语言环境的不同,c++也可以(也可以不可以)转换或做一些额外的检查。

As has been pointed out (partially, at least), in C++, >> uses a locale specific isspace, getline will simply compare for '\n', which is probably faster. (Typical implementations of isspace will use a bitmap, which means an additional memory access for each character.)

正如已经指出的(至少是部分地),在c++中,>>使用一个地区特定的isspace, getline将只比较'\n',它可能更快。(isspace的典型实现将使用位图,这意味着对每个字符进行额外的内存访问。)

Optimization levels and specific library implementations may also vary. It's not unusual in C++ for one library implementation to be 2 or 3 times faster than another.

优化级别和特定的库实现也可能不同。在c++中,一个库实现的速度是另一个库实现的2到3倍是很常见的。

Finally, a most significant difference: C++ distinguishes between text files and binary files. You've opened the file in text mode; this means that it will be "preprocessed" at the lowest level, before even the extraction operators see it. This depends on the platform: for Unix platforms, the "preprocessing" is a no-op; on Windows, it will convert CRLF pairs into '\n', which will have a definite impact on performance. If I recall correctly (I've not used Java for some years), Java expects higher level functions to handle this, so functions like readLine will be slightly more complicated. Just guessing here, but I suspect that the additional logic at the higher level costs less in runtime than the buffer preprocessing at the lower level. (If you are testing under Windows, you might experiment with opening the file in binary mode in C++. This should make no difference in the behavior of the program when you use >>; any extra CR will be considered white space. With getline, you'll have to add logic to remove any trailing '\r' to your code.)

最后,最重要的区别是:c++区分了文本文件和二进制文件。以文本模式打开文件;这意味着它将被“预先处理”在最低的级别,甚至在提取操作符看到它之前。这取决于平台:对于Unix平台,“预处理”是不允许的;在Windows上,它将CRLF对转换为“\n”,这将对性能产生一定的影响。如果我没记错的话(我已经好几年没有使用Java了),Java希望有更高级别的函数来处理这个问题,所以像readLine这样的函数会稍微复杂一些。这里只是猜测一下,但是我怀疑在更高级别上的附加逻辑在运行时比在更低级别上的缓冲区预处理花费更少。(如果在Windows下测试,您可以尝试在c++中以二进制模式打开文件。当您使用>>时,这对程序的行为没有影响;任何额外的CR都被认为是空格。使用getline,您必须添加逻辑来删除代码中的任何拖尾'\r'。

#3


5  

I would suspect that the main difference is that java.io.BufferedReader performs better than the std::ifstream because it buffers, while the ifsteam does not. The BufferedReader reads large chunks of the file in advance and hands them to your program from RAM when you call readLine(), while the std::ifstream only reads a few bytes at a time when you prompt it to by calling the >>-operator.

我怀疑主要的区别是java.io。BufferedReader比std::ifstream表现得更好,因为它可以缓冲,而ifsteam没有。BufferedReader提前读取文件的大块,并在调用readLine()时将它们从RAM中交给程序,而std:::ifstream在调用>>-operator时每次只读取几个字节。

Sequential access of large amounts of data from the hard drive is usually much faster than accessing many small chunks one at a time.

从硬盘驱动器连续访问大量数据通常比一次访问许多小块要快得多。

A fairer comparison would be to compare std::ifstream to the unbuffered java.io.FileReader.

更公平的比较是将std::ifstream与未缓冲的java.io.FileReader进行比较。

#4


4  

I am not expert in C++, but you have at least the following to affect performance:

我不是c++的专家,但是您至少有以下的影响性能:

  1. OS level caching for the file
  2. 文件的OS级缓存
  3. For Java you are using a buffered reader and the buffer size defaults to a page or something. I am not sure how C++ streams does this.
  4. 对于Java,您使用的是缓冲阅读器,缓冲区大小默认为一个页面或其他内容。我不确定c++流是如何做到这一点的。
  5. Since the file is so big that JIT would probably be kicked in, and it probably compiles the Java byte code better than if you don't turn any optimization on for your C++ compiler.
  6. 因为这个文件太大了,所以JIT很可能会被踢进去,而且它编译Java字节代码的能力可能比您不为c++编译器启用任何优化要好。

Since I/O cost is the major cost here, I guess 1 and 2 are the major reasons.

由于I/O成本是这里的主要成本,我认为1和2是主要原因。

#5


2  

I would also try using mmap instead of standard file read/write. This should let your OS handle the reading and writing while your application is only concerned with the data.

我也会尝试使用mmap而不是标准的文件读/写。这样,当应用程序只关心数据时,操作系统就可以处理读写了。

There's no situation where C++ can't be faster than Java, but sometimes it takes a lot of work from very talented people. But I don't think this one should be too hard to beat as it is a straightforward task.

没有什么情况下c++不能比Java快,但是有时候需要非常有才能的人做大量的工作。但我不认为这是一项很难克服的任务,因为这是一项简单的任务。

mmap for Windows is described in File Mapping (MSDN).

文件映射(MSDN)描述了Windows的mmap。

#1


64  

You aren't comparing the same thing. The Java program reads lines, depening on the newline, while the C++ program reads white space delimited "words", which is a little extra work.

你不是在比较同一件事。Java程序读取行,依赖于换行,而c++程序读取空格分隔的“words”,这是额外的工作。

Try istream::getline.

尝试istream::getline。

Later

晚些时候

You might also try and do an elementary read operation to read a byte array and scan this for newlines.

您还可以尝试执行一个基本的读操作来读取字节数组并扫描它以获取换行。

Even later

更晚

On my old Linux notebook, jdk1.7.0_21 and don't-tell-me-it's-old 4.3.3 take about the same time, comparing with C++ getline. (We have established that reading words is slower.) There isn't much difference between -O0 and -O2, which doesn't surprise me, given the simplicity of the code in the loop.

在我的旧Linux笔记本上,jdk1.7.0_21和don't-tell-me-it -it -it -old 4.3.3与c++ getline相比,大约要花费相同的时间。(我们已经确定,读单词要慢一些。)考虑到循环中代码的简单性,-O0和-O2之间没有太大的区别,这一点我并不感到惊讶。

Last note As I suggested, fin.read(buffer,LEN) with LEN = 1MB and using memchr to scan for '\n' results in another speed improvement of about 20%, which makes C (there isn't any C++ left by now) faster than Java.

最后要注意的是,使用LEN = 1MB的find .read(buffer,LEN)和memchr对'\n'进行扫描,结果又一次提高了20%左右的速度,这使得C(现在已经没有c++了)比Java快。

#2


7  

There are a number of significant differences in the way the languages handle I/O, all of which can make a difference, one way or another.

在语言处理I/O的方式上有许多显著的差异,所有这些都可以以某种方式带来不同。

Perhaps the first (and most important) question is: how is the data encoded in the text file. If it is single-byte characters (ISO 8859-1 or UTF-8), then Java has to convert it into UTF-16 before processing; depending on the locale, C++ may (or may not) also convert or do some additional checking.

也许第一个(也是最重要的)问题是:在文本文件中如何编码数据。如果是单字节字符(ISO 8859-1或UTF-8),则Java必须在处理前将其转换为UTF-16;根据语言环境的不同,c++也可以(也可以不可以)转换或做一些额外的检查。

As has been pointed out (partially, at least), in C++, >> uses a locale specific isspace, getline will simply compare for '\n', which is probably faster. (Typical implementations of isspace will use a bitmap, which means an additional memory access for each character.)

正如已经指出的(至少是部分地),在c++中,>>使用一个地区特定的isspace, getline将只比较'\n',它可能更快。(isspace的典型实现将使用位图,这意味着对每个字符进行额外的内存访问。)

Optimization levels and specific library implementations may also vary. It's not unusual in C++ for one library implementation to be 2 or 3 times faster than another.

优化级别和特定的库实现也可能不同。在c++中,一个库实现的速度是另一个库实现的2到3倍是很常见的。

Finally, a most significant difference: C++ distinguishes between text files and binary files. You've opened the file in text mode; this means that it will be "preprocessed" at the lowest level, before even the extraction operators see it. This depends on the platform: for Unix platforms, the "preprocessing" is a no-op; on Windows, it will convert CRLF pairs into '\n', which will have a definite impact on performance. If I recall correctly (I've not used Java for some years), Java expects higher level functions to handle this, so functions like readLine will be slightly more complicated. Just guessing here, but I suspect that the additional logic at the higher level costs less in runtime than the buffer preprocessing at the lower level. (If you are testing under Windows, you might experiment with opening the file in binary mode in C++. This should make no difference in the behavior of the program when you use >>; any extra CR will be considered white space. With getline, you'll have to add logic to remove any trailing '\r' to your code.)

最后,最重要的区别是:c++区分了文本文件和二进制文件。以文本模式打开文件;这意味着它将被“预先处理”在最低的级别,甚至在提取操作符看到它之前。这取决于平台:对于Unix平台,“预处理”是不允许的;在Windows上,它将CRLF对转换为“\n”,这将对性能产生一定的影响。如果我没记错的话(我已经好几年没有使用Java了),Java希望有更高级别的函数来处理这个问题,所以像readLine这样的函数会稍微复杂一些。这里只是猜测一下,但是我怀疑在更高级别上的附加逻辑在运行时比在更低级别上的缓冲区预处理花费更少。(如果在Windows下测试,您可以尝试在c++中以二进制模式打开文件。当您使用>>时,这对程序的行为没有影响;任何额外的CR都被认为是空格。使用getline,您必须添加逻辑来删除代码中的任何拖尾'\r'。

#3


5  

I would suspect that the main difference is that java.io.BufferedReader performs better than the std::ifstream because it buffers, while the ifsteam does not. The BufferedReader reads large chunks of the file in advance and hands them to your program from RAM when you call readLine(), while the std::ifstream only reads a few bytes at a time when you prompt it to by calling the >>-operator.

我怀疑主要的区别是java.io。BufferedReader比std::ifstream表现得更好,因为它可以缓冲,而ifsteam没有。BufferedReader提前读取文件的大块,并在调用readLine()时将它们从RAM中交给程序,而std:::ifstream在调用>>-operator时每次只读取几个字节。

Sequential access of large amounts of data from the hard drive is usually much faster than accessing many small chunks one at a time.

从硬盘驱动器连续访问大量数据通常比一次访问许多小块要快得多。

A fairer comparison would be to compare std::ifstream to the unbuffered java.io.FileReader.

更公平的比较是将std::ifstream与未缓冲的java.io.FileReader进行比较。

#4


4  

I am not expert in C++, but you have at least the following to affect performance:

我不是c++的专家,但是您至少有以下的影响性能:

  1. OS level caching for the file
  2. 文件的OS级缓存
  3. For Java you are using a buffered reader and the buffer size defaults to a page or something. I am not sure how C++ streams does this.
  4. 对于Java,您使用的是缓冲阅读器,缓冲区大小默认为一个页面或其他内容。我不确定c++流是如何做到这一点的。
  5. Since the file is so big that JIT would probably be kicked in, and it probably compiles the Java byte code better than if you don't turn any optimization on for your C++ compiler.
  6. 因为这个文件太大了,所以JIT很可能会被踢进去,而且它编译Java字节代码的能力可能比您不为c++编译器启用任何优化要好。

Since I/O cost is the major cost here, I guess 1 and 2 are the major reasons.

由于I/O成本是这里的主要成本,我认为1和2是主要原因。

#5


2  

I would also try using mmap instead of standard file read/write. This should let your OS handle the reading and writing while your application is only concerned with the data.

我也会尝试使用mmap而不是标准的文件读/写。这样,当应用程序只关心数据时,操作系统就可以处理读写了。

There's no situation where C++ can't be faster than Java, but sometimes it takes a lot of work from very talented people. But I don't think this one should be too hard to beat as it is a straightforward task.

没有什么情况下c++不能比Java快,但是有时候需要非常有才能的人做大量的工作。但我不认为这是一项很难克服的任务,因为这是一项简单的任务。

mmap for Windows is described in File Mapping (MSDN).

文件映射(MSDN)描述了Windows的mmap。