Java 8的流:为什么并行流比较慢?

I am playing with Java 8's streams and cannot understand the performance results I am getting.

我正在玩Java 8的流，不能理解我得到的性能结果。

I have 2 core CPU (Intel i73520M), Windows 8 x64, and 64-bit Java 8 update 5. I am doing simple map over stream/parallel stream of Strings and found that parallel version is somewhat slower.

我有2个核心CPU (Intel i73520M)、Windows 8 x64和64位Java 8更新5。我正在做简单的图，通过流/并行流的字符串，发现并行版本稍微慢一些。

When I run this code:

当我运行这段代码:

String[] array = new String[1000000];
Arrays.fill(array, "AbabagalamagA");

Stream<String> stream = Arrays.stream(array);

long time1 = System.nanoTime();

List<String> list = stream.map((x) -> x.toLowerCase()).collect(Collectors.toList());

long time2 = System.nanoTime();

System.out.println((time2 - time1) / 1000000f);

... I am getting result of somewhere around 600. This version, that uses parallel stream:

…我得到的结果大约是600。这个版本使用了并行流:

String[] array = new String[1000000];
Arrays.fill(array, "AbabagalamagA");

Stream<String> stream = Arrays.stream(array).parallel();

long time1 = System.nanoTime();

List<String> list = stream.map((x) -> x.toLowerCase()).collect(Collectors.toList());

long time2 = System.nanoTime();


System.out.println((time2 - time1) / 1000000f);

... gives me something about 900.

…给我900美元。

Shouldn't the parallel version be faster, considering the fact that I have 2 CPU cores? Could someone give me a hint why parallel version is slower?

考虑到我有两个CPU内核，并行版本是不是应该更快一些?有人能给我一个提示，为什么并行版本比较慢?

3 个解决方案

#1

There are several issues going on here in parallel, as it were.

这里有几个问题是平行的。

The first is that solving a problem in parallel always involves performing more actual work than doing it sequentially. Overhead is involved in splitting the work among several threads and joining or merging the results. Problems like converting short strings to lower-case are small enough that they are in danger of being swamped by the parallel splitting overhead.

第一个问题是，并行解决问题总是比按顺序执行更实际的工作。开销涉及在多个线程之间分割工作，并合并或合并结果。像将短字符串转换为小写字母这样的问题非常小，以至于它们有被并行的开销所淹没的危险。

The second issue is that benchmarking Java program is very subtle, and it is very easy to get confusing results. Two common issues are JIT compilation and dead code elimination. Short benchmarks often finish before or during JIT compilation, so they're not measuring peak throughput, and indeed they might be measuring the JIT itself. When compilation occurs is somewhat non-deterministic, so it may cause results to vary wildly as well.

第二个问题是，对Java程序的基准测试是非常微妙的，很容易混淆结果。两个常见问题是JIT编译和死代码消除。短基准测试通常在JIT编译之前或期间完成，所以它们没有测量峰值吞吐量，实际上它们可能是在测量JIT本身。当编译发生时，有些不确定性，所以它可能导致结果也有很大的差异。

For small, synthetic benchmarks, the workload often computes results that are thrown away. JIT compilers are quite good at detecting this and eliminating code that doesn't produce results that are used anywhere. This probably isn't happening in this case, but if you tinker around with other synthetic workloads, it can certainly happen. Of course, if the JIT eliminates the benchmark workload, it renders the benchmark useless.

对于小型的合成基准测试，工作负载经常计算被丢弃的结果。JIT编译器非常善于发现和消除不产生任何地方使用的结果的代码。在这种情况下可能不会发生这种情况，但是如果您对其他合成工作负载进行修补，它肯定会发生。当然，如果JIT消除了基准工作负载，它就会使基准测试无效。

I strongly recommend using a well-developed benchmarking framework such as JMH instead of hand-rolling one of your own. JMH has facilities to help avoid common benchmarking pitfalls, including these, and it's pretty easy to set up and run. Here's your benchmark converted to use JMH:

我强烈推荐使用一个完善的基准测试框架，比如JMH，而不是自己动手操作。JMH有一些工具可以帮助避免常见的基准测试陷阱，包括这些，并且很容易设置和运行。这是您的基准转换为使用JMH:

package com.*.questions;

import java.util.Arrays;
import java.util.List;
import java.util.stream.Collectors;
import java.util.concurrent.TimeUnit;

import org.openjdk.jmh.annotations.*;

public class SO23170832 {
    @State(Scope.Benchmark)
    public static class BenchmarkState {
        static String[] array;
        static {
            array = new String[1000000];
            Arrays.fill(array, "AbabagalamagA");
        }
    }

    @GenerateMicroBenchmark
    @OutputTimeUnit(TimeUnit.SECONDS)
    public List<String> sequential(BenchmarkState state) {
        return
            Arrays.stream(state.array)
                  .map(x -> x.toLowerCase())
                  .collect(Collectors.toList());
    }

    @GenerateMicroBenchmark
    @OutputTimeUnit(TimeUnit.SECONDS)
    public List<String> parallel(BenchmarkState state) {
        return
            Arrays.stream(state.array)
                  .parallel()
                  .map(x -> x.toLowerCase())
                  .collect(Collectors.toList());
    }
}

I ran this using the command:

我使用命令来运行它:

java -jar dist/microbenchmarks.jar ".*SO23170832.*" -wi 5 -i 5 -f 1

(The options indicate five warmup iterations, five benchmark iterations, and one forked JVM.) During its run, JMH emits lots of verbose messages, which I've elided. The summary results are as follows.

(选项显示5个预热迭代、5个基准迭代和一个forked JVM。)在运行期间，JMH会发出大量冗长的消息，这一点我已经省略了。总结结果如下。

Benchmark                       Mode   Samples         Mean   Mean error    Units
c.s.q.SO23170832.parallel      thrpt         5        4.600        5.995    ops/s
c.s.q.SO23170832.sequential    thrpt         5        1.500        1.727    ops/s

Note that results are in ops per second, so it looks like the parallel run was about three times faster than the sequential run. But my machine has only two cores. Hmmm. And the mean error per run is actually larger than the mean runtime! WAT? Something fishy is going on here.

注意，结果是每秒操作数，所以并行运行的速度大约是连续运行的3倍。但是我的机器只有两个核心。嗯。每个运行的平均错误实际上比平均运行时要大!窟?这里发生了一些可疑的事情。

This brings us to a third issue. Looking more closely at the workload, we can see that it allocates a new String object for each input, and it also collects the results into a list, which involves lots of reallocation and copying. I'd guess that this will result in a fair amount of garbage collection. We can see this by rerunning the benchmark with GC messages enabled:

这就引出了第三个问题。仔细查看工作负载，我们可以看到它为每个输入分配了一个新的字符串对象，它还将结果收集到一个列表中，其中包含大量的重新分配和复制。我猜这将导致大量的垃圾收集。我们可以通过启用GC消息重新运行基准来查看这一点:

java -verbose:gc -jar dist/microbenchmarks.jar ".*SO23170832.*" -wi 5 -i 5 -f 1

This gives results like:

这给了这样的结果:

[GC (Allocation Failure)  512K->432K(130560K), 0.0024130 secs]
[GC (Allocation Failure)  944K->520K(131072K), 0.0015740 secs]
[GC (Allocation Failure)  1544K->777K(131072K), 0.0032490 secs]
[GC (Allocation Failure)  1801K->1027K(132096K), 0.0023940 secs]
# Run progress: 0.00% complete, ETA 00:00:20
# VM invoker: /Users/src/jdk/jdk8-b132.jdk/Contents/Home/jre/bin/java
# VM options: -verbose:gc
# Fork: 1 of 1
[GC (Allocation Failure)  512K->424K(130560K), 0.0015460 secs]
[GC (Allocation Failure)  933K->552K(131072K), 0.0014050 secs]
[GC (Allocation Failure)  1576K->850K(131072K), 0.0023050 secs]
[GC (Allocation Failure)  3075K->1561K(132096K), 0.0045140 secs]
[GC (Allocation Failure)  1874K->1059K(132096K), 0.0062330 secs]
# Warmup: 5 iterations, 1 s each
# Measurement: 5 iterations, 1 s each
# Threads: 1 thread, will synchronize iterations
# Benchmark mode: Throughput, ops/time
# Benchmark: com.*.questions.SO23170832.parallel
# Warmup Iteration   1: [GC (Allocation Failure)  7014K->5445K(132096K), 0.0184680 secs]
[GC (Allocation Failure)  7493K->6346K(135168K), 0.0068380 secs]
[GC (Allocation Failure)  10442K->8663K(135168K), 0.0155600 secs]
[GC (Allocation Failure)  12759K->11051K(139776K), 0.0148190 secs]
[GC (Allocation Failure)  18219K->15067K(140800K), 0.0241780 secs]
[GC (Allocation Failure)  22167K->19214K(145920K), 0.0208510 secs]
[GC (Allocation Failure)  29454K->25065K(147456K), 0.0333080 secs]
[GC (Allocation Failure)  35305K->30729K(153600K), 0.0376610 secs]
[GC (Allocation Failure)  46089K->39406K(154624K), 0.0406060 secs]
[GC (Allocation Failure)  54766K->48299K(164352K), 0.0550140 secs]
[GC (Allocation Failure)  71851K->62725K(165376K), 0.0612780 secs]
[GC (Allocation Failure)  86277K->74864K(184320K), 0.0649210 secs]
[GC (Allocation Failure)  111216K->94203K(185856K), 0.0875710 secs]
[GC (Allocation Failure)  130555K->114932K(199680K), 0.1030540 secs]
[GC (Allocation Failure)  162548K->141952K(203264K), 0.1315720 secs]
[Full GC (Ergonomics)  141952K->59696K(159232K), 0.5150890 secs]
[GC (Allocation Failure)  105613K->85547K(184832K), 0.0738530 secs]
1.183 ops/s

Note: the lines beginning with # are normal JMH output lines. All the rest are GC messages. This is just the first of the five warmup iterations, which precedes five benchmark iterations. The GC messages continued in the same vein during the rest of the iterations. I think it's safe to say that the measured performance is dominated by GC overhead and that the results reported should not be believed.

注意:以#开头的行是正常的JMH输出行。其余的都是GC消息。这只是5个预热迭代中的第一个，在5个基准迭代之前。在其余的迭代过程中，GC消息以相同的方式继续。我认为可以有把握地说，测量的性能由GC开销控制，并且报告的结果不应该被相信。

At this point it's unclear what to do. This is purely a synthetic workload. It clearly involves very little CPU time doing actual work compared to allocation and copying. It's hard to say what you really are trying to measure here. One approach would be to come up with a different workload that is in some sense more "real." Another approach would be to change the heap and GC parameters to avoid GC during the benchmark run.

目前还不清楚该做什么。这纯粹是人工合成的工作量。与分配和复制相比，它显然占用了很少的CPU时间。很难说你真正想要衡量的是什么。一种方法是想出一个不同的工作负载，在某种意义上更“真实”。另一种方法是在基准运行期间更改堆和GC参数以避免GC。

#2

When doing benchmarks, you should pay attention to the JIT compiler, and that timing behaviors can change, when the JIT kicks in. If I add a warm-up phase to your test program, the parallel version is bit a faster than the sequential version. Here are the results:

在执行基准测试时，您应该注意JIT编译器，当JIT启动时，计时行为可以改变。如果我在您的测试程序中添加了一个预热阶段，那么并行版本比顺序版本要快一些。这里是结果:

Warmup...
Benchmark...
Run 0:  sequential 0.12s  -  parallel 0.11s
Run 1:  sequential 0.13s  -  parallel 0.08s
Run 2:  sequential 0.15s  -  parallel 0.08s
Run 3:  sequential 0.12s  -  parallel 0.11s
Run 4:  sequential 0.13s  -  parallel 0.08s

Following is the complete source code, that I have used for this test.

下面是完整的源代码，我已经使用过这个测试。

public static void main(String... args) {
    String[] array = new String[1000000];
    Arrays.fill(array, "AbabagalamagA");
    System.out.println("Warmup...");
    for (int i = 0; i < 100; ++i) {
        sequential(array);
        parallel(array);
    }
    System.out.println("Benchmark...");
    for (int i = 0; i < 5; ++i) {
        System.out.printf("Run %d:  sequential %s  -  parallel %s\n",
            i,
            test(() -> sequential(array)),
            test(() -> parallel(array)));
    }
}
private static void sequential(String[] array) {
    Arrays.stream(array).map(String::toLowerCase).collect(Collectors.toList());
}
private static void parallel(String[] array) {
    Arrays.stream(array).parallel().map(String::toLowerCase).collect(Collectors.toList());
}
private static String test(Runnable runnable) {
    long start = System.currentTimeMillis();
    runnable.run();
    long elapsed = System.currentTimeMillis() - start;
    return String.format("%4.2fs", elapsed / 1000.0);
}

#3

Using multiple threads to process your data has some initial setup costs, e.g. initializing the thread pool. These costs may outweigh the gain from using those threads, especially if the runtime is already quite low. Additionally, if there is contention, e.g. other threads running, background processes, etc., the performance of parallel processing can decrease further.

使用多个线程处理数据有一些初始设置成本，例如初始化线程池。这些成本可能超过使用这些线程的收益，特别是在运行时已经很低的情况下。此外，如果存在争用，例如其他线程运行、后台进程等，并行处理的性能可能会进一步降低。

This issue is not new for parallel processing. This article gives some details in the light of Java 8 parallel() and some more things to consider: http://java.dzone.com/articles/think-twice-using-java-8

这个问题对于并行处理来说并不新鲜。本文提供了一些关于Java 8并行()的详细信息，还有一些需要考虑的事情:http://java.dzone.com/articles/thinktwiusingjava8。

#1