Java在许多核心上的扩展比C＃差得多？

I am testing spawning off many threads running the same function on a 32 core server for Java and C#. I run the application with 1000 iterations of the function, which is batched across either 1,2,4,8, 16 or 32 threads using a threadpool.

我正在测试在32核心服务器上为Java和C#运行相同功能的许多线程的产生。我使用函数的1000次迭代运行应用程序,使用线程池对1,2,4,8,16或32个线程进行批处理。

At 1, 2, 4, 8 and 16 concurrent threads Java is at least twice as fast as C#. However, as the number of threads increases, the gap closes and by 32 threads C# has nearly the same average run-time, but Java occasionally takes 2000ms (whereas both languages are usually running about 400ms). Java is starting to get worse with massive spikes in the time taken per thread iteration.

在1,2,4,8和16个并发线程中,Java至少是C#的两倍。但是,随着线程数量的增加,间隙关闭,32个线程C#的平均运行时间几乎相同,但Java偶尔需要2000ms(而两种语言通常运行时间约为400ms)。在每次线程迭代所花费的时间内,Java开始变得更糟。

EDIT This is Windows Server 2008

编辑这是Windows Server 2008

EDIT2 I have changed the code below to show using the Executor Service threadpool. I have also installed Java 7.

EDIT2我已经使用Executor Service线程池更改了下面的代码。我还安装了Java 7。

I have set the following optimisations in the hotspot VM:

我在hotspot VM中设置了以下优化:

-XX:+UseConcMarkSweepGC -Xmx 6000

-XX:+ UseConcMarkSweepGC -Xmx 6000

but it still hasnt made things any better. The only difference between the code is that im using the below threadpool and for the C# version we use:

但它仍然没有让事情变得更好。代码之间的唯一区别是我使用下面的线程池和我们使用的C#版本:

http://www.codeproject.com/Articles/7933/Smart-Thread-Pool

Is there a way to make the Java more optimised? Perhaos you could explain why I am seeing this massive degradation in performance?

有没有办法让Java更优化? Perhaos你可以解释为什么我看到这种性能大幅下降?

Is there a more efficient Java threadpool?

是否有更高效的Java线程池?

(Please note, I do not mean by changing the test function)

(请注意,我不是指改变测试功能)

import java.io.DataOutputStream;
import java.io.FileNotFoundException;
import java.io.FileOutputStream;
import java.io.PrintStream;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
import java.util.concurrent.ThreadPoolExecutor;

public class PoolDemo {

    static long FastestMemory = 2000000;
    static long SlowestMemory = 0;
    static long TotalTime;
    static int[] FileArray;
    static DataOutputStream outs;
    static FileOutputStream fout;
    static Byte myByte = 0;

  public static void main(String[] args) throws InterruptedException, FileNotFoundException {

        int Iterations = Integer.parseInt(args[0]);
        int ThreadSize = Integer.parseInt(args[1]);

        FileArray = new int[Iterations];
        fout = new FileOutputStream("server_testing.csv");

        // fixed pool, unlimited queue
        ExecutorService service = Executors.newFixedThreadPool(ThreadSize);
        ThreadPoolExecutor executor = (ThreadPoolExecutor) service;

        for(int i = 0; i<Iterations; i++) {
          Task t = new Task(i);
          executor.execute(t);
        }

        for(int j=0; j<FileArray.length; j++){
            new PrintStream(fout).println(FileArray[j] + ",");
        }
      }

  private static class Task implements Runnable {

    private int ID;

    public Task(int index) {
      this.ID = index;
    }

    public void run() {
        long Start = System.currentTimeMillis();

        int Size1 = 100000;
        int Size2 = 2 * Size1;
        int Size3 = Size1;

        byte[] list1 = new byte[Size1];
        byte[] list2 = new byte[Size2];
        byte[] list3 = new byte[Size3];

        for(int i=0; i<Size1; i++){
            list1[i] = myByte;
        }

        for (int i = 0; i < Size2; i=i+2)
        {
            list2[i] = myByte;
        }

        for (int i = 0; i < Size3; i++)
        {
            byte temp = list1[i];
            byte temp2 = list2[i];
            list3[i] = temp;
            list2[i] = temp;
            list1[i] = temp2;
        }

        long Finish = System.currentTimeMillis();
        long Duration = Finish - Start;
        TotalTime += Duration;
        FileArray[this.ID] = (int)Duration;
        System.out.println("Individual Time " + this.ID + " \t: " + (Duration) + " ms");


        if(Duration < FastestMemory){
            FastestMemory = Duration;
        }
        if (Duration > SlowestMemory)
        {
            SlowestMemory = Duration;
        }
    }
  }
}

5 个解决方案

#1

Summary

Below are the original response, update 1, and update 2. Update 1 talks about dealing with the race conditions around the test statistic variables by using concurrency structures. Update 2 is a much simpler way of dealing with the race condition issue. Hopefully no more updates from me - sorry for the length of the response but multithreaded programming is complicated!

下面是原始响应,更新1和更新2.更新1讨论了使用并发结构处理测试统计变量周围的竞争条件。更新2是处理竞争条件问题的一种更简单的方法。希望没有更多来自我的更新 - 抱歉响应的长度,但多线程编程很复杂!

Original Response

The only difference between the code is that im using the below threadpool

代码之间的唯一区别是我使用下面的线程池

I would say that is an absolutely huge difference. It's difficult to compare the performance of the two languages when their thread pool implementations are completely different blocks of code, written in user space. The thread pool implementation could have enormous impact on performance.

我想说这是一个绝对巨大的差异。当两个语言的线程池实现完全不同的代码块(用户空间编写)时,很难比较两种语言的性能。线程池实现可能会对性能产生巨大影响。

You should consider using Java's own built-in thread pools. See ThreadPoolExecutor and the entire java.util.concurrent package of which it is part. The Executors class has convenient static factory methods for pools and is a good higher level interface. All you need is JDK 1.5+, though the newer, the better. The fork/join solutions mentioned by other posters are also part of this package - as mentioned, they require 1.7+.

您应该考虑使用Java自己的内置线程池。请参阅ThreadPoolExecutor以及它所属的整个java.util.concurrent包。 Executors类为池提供了方便的静态工厂方法,是一个很好的更高级别的接口。你需要的只是JDK 1.5+,虽然越新越好。其他海报提到的fork / join解决方案也是这个包的一部分 - 如上所述,它们需要1.7+。

Update 1 - Addressing race conditions by using concurrency structures

You have race conditions around the setting of FastestMemory, SlowestMemory, and TotalTime. For the first two, you are doing the < and > testing and then the setting in more than one step. This is not atomic; there is certainly the chance that another thread will update these values in between the testing and the setting. The += setting of TotalTime is also non-atomic: a test and set in disguise.

你有FastestMemory,SlowestMemory和TotalTime设置的竞争条件。对于前两个,您正在进行 <和> 测试,然后在多个步骤中进行设置。这不是原子的;当然,另一个线程有可能在测试和设置之间更新这些值。 TotalTime的+ =设置也是非原子的:测试并伪装设置。

Here are some suggested fixes.

以下是一些建议的修复方法。

TotalTime

The goal here is a threadsafe, atomic += of TotalTime.

这里的目标是线程安全,原子+ = TotalTime。

// At the top of everything
import java.util.concurrent.atomic.AtomicLong;  

...    

// In PoolDemo
static AtomicLong TotalTime = new AtomicLong();    

...    

// In Task, where you currently do the TotalTime += piece
TotalTime.addAndGet (Duration);

FastestMemory / SlowestMemory

FastestMemory / SlowestMemory

The goal here is testing and updating FastestMemory and SlowestMemory each in an atomic step, so no thread can slip in between the test and update steps to cause a race condition.

这里的目标是在原子步骤中测试和更新FastestMemory和SlowestMemory,因此没有线程可以在测试和更新步骤之间插入以引起竞争条件。

Simplest approach:

Protect the testing and setting of the variables using the class itself as a monitor. We need a monitor that contains the variables in order to guarantee synchronized visibility (thanks @A.H. for catching this.) We have to use the class itself because everything is static.

使用类本身作为监视器来保护变量的测试和设置。我们需要一个包含变量的监视器,以保证同步可见性(感谢@ A.H。来捕获它。)我们必须使用类本身,因为一切都是静态的。

// In Task
synchronized (PoolDemo.class) {
    if (Duration < FastestMemory) {
        FastestMemory = Duration;
    }

    if (Duration > SlowestMemory) {
        SlowestMemory = Duration;
    }
}

Intermediate approach:

You may not like taking the whole class for the monitor, or exposing the monitor by using the class, etc. You could do a separate monitor that does not itself contain FastestMemory and SlowestMemory, but you will then run into synchronization visibility issues. You get around this by using the volatile keyword.

您可能不喜欢将整个类用于监视器,或者通过使用类等来暴露监视器。您可以执行单独的监视器,其本身不包含FastestMemory和SlowestMemory,但是您将遇到同步可见性问题。您可以使用volatile关键字来解决这个问题。

// In PoolDemo
static Integer _monitor = new Integer(1);
static volatile long FastestMemory = 2000000;
static volatile long SlowestMemory = 0;

...

// In Task
synchronized (PoolDemo._monitor) {
    if (Duration < FastestMemory) {
        FastestMemory = Duration;
    }

    if (Duration > SlowestMemory) {
        SlowestMemory = Duration;
    }
}

Advanced approach:

Here we use the java.util.concurrent.atomic classes instead of monitors. Under heavy contention, this should perform better than the synchronized approach. Try it and see.

这里我们使用java.util.concurrent.atomic类而不是监视器。在激烈争论下,这应该比同步方法表现更好。试试看吧。

// At the top of everything
import java.util.concurrent.atomic.AtomicLong;    

. . . . 

// In PoolDemo
static AtomicLong FastestMemory = new AtomicLong(2000000);
static AtomicLong SlowestMemory = new AtomicLong(0);

. . . . .

// In Task
long temp = FastestMemory.get();       
while (Duration < temp) {
    if (!FastestMemory.compareAndSet (temp, Duration)) {
        temp = FastestMemory.get();       
    }
}

temp = SlowestMemory.get();
while (Duration > temp) {
    if (!SlowestMemory.compareAndSet (temp, Duration)) {
        temp = SlowestMemory.get();
    }
}

Let me know what happens after this. It may not fix your problem, but the race condition around the very variables that track your performance is too dangerous to ignore.

让我知道在此之后会发生什么。它可能无法解决您的问题,但跟踪您的性能的变量周围的竞争条件太危险而无法忽略。

I originally posted this update as a comment but moved it here so that I would have room to show code. This update has been through a few iterations - thanks to A.H. for catching a bug I had in an earlier version. Anything in this update supersedes anything in the comment.

我最初发布此更新作为评论,但在此处移动,以便我有空间显示代码。这个更新经历了几次迭代 - 感谢A.H.捕获我在早期版本中遇到的错误。此更新中的任何内容都将取代评论中的任何内容。

Last but not least, an excellent source covering all this material is Java Concurrency in Practice, the best book on Java concurrency, and one of the best Java books overall.

最后但同样重要的是,涵盖所有这些材料的优秀来源是Java Concurrency in Practice,这是关于Java并发的最佳书籍,也是最好的Java书籍之一。

Update 2 - Addressing race conditions in a much simpler way

I recently noticed that your current code will never terminate unless you add executorService.shutdown(). That is, the non-daemon threads living in that pool must be terminated or else the main thread will never exit. This got me to thinking that since we have to wait for all threads to exit, why not compare their durations after they finished, and thus bypass the concurrent updating of FastestMemory, etc. altogether? This is simpler and could be faster; there's no more locking or CAS overhead, and you are already doing an iteration of FileArray at the end of things anyway.

我最近注意到除非你添加executorService.shutdown(),否则你当前的代码永远不会终止。也就是说,必须终止生成在该池中的非守护进程线程,否则主线程将永远不会退出。这让我想到,既然我们必须等待所有线程退出,为什么不在它们完成后比较它们的持续时间,从而完全绕过FastestMemory等的并发更新?这更简单,可以更快;没有更多的锁定或CAS开销,无论如何你已经在事情的最后完成了FileArray的迭代。

The other thing we can take advantage of is that your concurrent updating of FileArray is perfectly safe, since each thread is writing to a separate cell, and since there is no reading of FileArray during the writing of it.

我们可以利用的另一件事是你对FileArray的并发更新非常安全,因为每个线程都写入一个单独的单元格,并且因为在写入期间没有读取FileArray。

With that, you make the following changes:

有了它,您进行以下更改:

// In PoolDemo
// This part is the same, just so you know where we are
for(int i = 0; i<Iterations; i++) {
    Task t = new Task(i);
    executor.execute(t);
}

// CHANGES BEGIN HERE
// Will block till all tasks finish. Required regardless.
executor.shutdown();
executor.awaitTermination(10, TimeUnit.SECONDS);

for(int j=0; j<FileArray.length; j++){
    long duration = FileArray[j];
    TotalTime += duration;

    if (duration < FastestMemory) {
        FastestMemory = duration;
    }

    if (duration > SlowestMemory) {
        SlowestMemory = duration;
    }

    new PrintStream(fout).println(FileArray[j] + ",");
}

. . . 

// In Task
// Ending of Task.run() now looks like this
long Finish = System.currentTimeMillis();
long Duration = Finish - Start;
FileArray[this.ID] = (int)Duration;
System.out.println("Individual Time " + this.ID + " \t: " + (Duration) + " ms");

Give this approach a shot as well.

也可以尝试这种方法。

You should definitely be checking your C# code for similar race conditions.

你肯定应该检查你的C#代码以了解类似的竞争条件。

#2

...but Java occasionally takes 2000ms...

...但Java偶尔需要2000毫秒......

And

    byte[] list1 = new byte[Size1];
    byte[] list2 = new byte[Size2];
    byte[] list3 = new byte[Size3];

The hickups will be the garbage collector cleaning up your arrays. If you really want to tune that I suggest you use some kind of cache for the arrays.

hickups将是清理阵列的垃圾收集器。如果你真的想调整它,我建议你为数组使用某种缓存。

Edit

This one

   System.out.println("Individual Time " + this.ID + " \t: " + (Duration) + " ms");

does one or more synchronized internally. So your highly "concurrent" code will be serialized quite good at this point. Just remove it and retest.

在内部进行一个或多个同步。所以你的高度“并发”代码在这一点上会被很好地序列化。只需将其删除并重新测试即可。

#3

While @sparc_spread's answer is great, another thing I've noticed is this:

虽然@sparc_spread的答案很棒,但我注意到的另一件事是:

I run the application with 1000 iterations of the function

我运行该应用程序1000次迭代的功能

Notice that the HotSpot JVM is working on interpreted mode for the first 1.5k iterations of any function on client mode, and for 10k iterations on server mode. Computers with that many cores are automatically considered "servers" by the HotSpot JVM.

请注意,HotSpot JVM正在处理客户端模式下任何函数的前1.5k次迭代的解释模式,以及服务器模式下的10k次迭代。具有多个核心的计算机将被HotSpot JVM自动视为“服务器”。

That would mean that C# would do JIT (and run in machine code) before Java does, and has a chance for better performance at the function runtime. Try increasing the iterations to 20,000 and start counting from 10k iteration.

这意味着C#会在Java之前执行JIT(并在机器代码中运行),并且有可能在函数运行时获得更好的性能。尝试将迭代次数增加到20,000并从10k迭代开始计数。

The rationale here is that the JVM collects statistical data for how to do JIT best. It trusts that your function is going to be run a lot through time, so it takes a "slow bootstrapping" mechanism for a faster runtime overall. Or in their words "20% of the functions run 80% of the time", so why JIT them all?

这里的基本原理是JVM收集有关如何最好地执行JIT的统计数据。它相信你的函数会随着时间的推移而运行很多,所以它需要一个“慢速引导”机制来实现更快的整体运行时间。或者用他们的话来说“20%的功能在80%的时间内运行”,那么为什么要JIT全部呢?

#4

Are you using java6? Java 7 comes with features to improve performance in parallel programing:

你在用java6吗? Java 7具有提高并行编程性能的功能:

http://www.oracle.com/technetwork/articles/java/fork-join-422606.html

#5

You could also look at the ExecutorService, created using Executors.newFixedThreadPool(noOfCores) or a similar method.

您还可以查看使用Executors.newFixedThreadPool(noOfCores)或类似方法创建的ExecutorService。

#1

Summary

Original Response

The only difference between the code is that im using the below threadpool

代码之间的唯一区别是我使用下面的线程池

Update 1 - Addressing race conditions by using concurrency structures

Here are some suggested fixes.

以下是一些建议的修复方法。

TotalTime

The goal here is a threadsafe, atomic += of TotalTime.

这里的目标是线程安全,原子+ = TotalTime。

// At the top of everything
import java.util.concurrent.atomic.AtomicLong;  

...    

// In PoolDemo
static AtomicLong TotalTime = new AtomicLong();    

...    

// In Task, where you currently do the TotalTime += piece
TotalTime.addAndGet (Duration);

FastestMemory / SlowestMemory

FastestMemory / SlowestMemory

The goal here is testing and updating FastestMemory and SlowestMemory each in an atomic step, so no thread can slip in between the test and update steps to cause a race condition.

这里的目标是在原子步骤中测试和更新FastestMemory和SlowestMemory,因此没有线程可以在测试和更新步骤之间插入以引起竞争条件。

Simplest approach:

// In Task
synchronized (PoolDemo.class) {
    if (Duration < FastestMemory) {
        FastestMemory = Duration;
    }

    if (Duration > SlowestMemory) {
        SlowestMemory = Duration;
    }
}

Intermediate approach:

// In PoolDemo
static Integer _monitor = new Integer(1);
static volatile long FastestMemory = 2000000;
static volatile long SlowestMemory = 0;

...

// In Task
synchronized (PoolDemo._monitor) {
    if (Duration < FastestMemory) {
        FastestMemory = Duration;
    }

    if (Duration > SlowestMemory) {
        SlowestMemory = Duration;
    }
}

Advanced approach:

Here we use the java.util.concurrent.atomic classes instead of monitors. Under heavy contention, this should perform better than the synchronized approach. Try it and see.

这里我们使用java.util.concurrent.atomic类而不是监视器。在激烈争论下,这应该比同步方法表现更好。试试看吧。

// At the top of everything
import java.util.concurrent.atomic.AtomicLong;    

. . . . 

// In PoolDemo
static AtomicLong FastestMemory = new AtomicLong(2000000);
static AtomicLong SlowestMemory = new AtomicLong(0);

. . . . .

// In Task
long temp = FastestMemory.get();       
while (Duration < temp) {
    if (!FastestMemory.compareAndSet (temp, Duration)) {
        temp = FastestMemory.get();       
    }
}

temp = SlowestMemory.get();
while (Duration > temp) {
    if (!SlowestMemory.compareAndSet (temp, Duration)) {
        temp = SlowestMemory.get();
    }
}

Let me know what happens after this. It may not fix your problem, but the race condition around the very variables that track your performance is too dangerous to ignore.

让我知道在此之后会发生什么。它可能无法解决您的问题,但跟踪您的性能的变量周围的竞争条件太危险而无法忽略。

Last but not least, an excellent source covering all this material is Java Concurrency in Practice, the best book on Java concurrency, and one of the best Java books overall.

最后但同样重要的是,涵盖所有这些材料的优秀来源是Java Concurrency in Practice,这是关于Java并发的最佳书籍,也是最好的Java书籍之一。

Update 2 - Addressing race conditions in a much simpler way

我们可以利用的另一件事是你对FileArray的并发更新非常安全,因为每个线程都写入一个单独的单元格,并且因为在写入期间没有读取FileArray。

With that, you make the following changes:

有了它,您进行以下更改:

// In PoolDemo
// This part is the same, just so you know where we are
for(int i = 0; i<Iterations; i++) {
    Task t = new Task(i);
    executor.execute(t);
}

// CHANGES BEGIN HERE
// Will block till all tasks finish. Required regardless.
executor.shutdown();
executor.awaitTermination(10, TimeUnit.SECONDS);

for(int j=0; j<FileArray.length; j++){
    long duration = FileArray[j];
    TotalTime += duration;

    if (duration < FastestMemory) {
        FastestMemory = duration;
    }

    if (duration > SlowestMemory) {
        SlowestMemory = duration;
    }

    new PrintStream(fout).println(FileArray[j] + ",");
}

. . . 

// In Task
// Ending of Task.run() now looks like this
long Finish = System.currentTimeMillis();
long Duration = Finish - Start;
FileArray[this.ID] = (int)Duration;
System.out.println("Individual Time " + this.ID + " \t: " + (Duration) + " ms");

Give this approach a shot as well.

也可以尝试这种方法。

You should definitely be checking your C# code for similar race conditions.

你肯定应该检查你的C#代码以了解类似的竞争条件。

#2

...but Java occasionally takes 2000ms...

...但Java偶尔需要2000毫秒......

And

    byte[] list1 = new byte[Size1];
    byte[] list2 = new byte[Size2];
    byte[] list3 = new byte[Size3];

The hickups will be the garbage collector cleaning up your arrays. If you really want to tune that I suggest you use some kind of cache for the arrays.

hickups将是清理阵列的垃圾收集器。如果你真的想调整它,我建议你为数组使用某种缓存。

Edit

This one

   System.out.println("Individual Time " + this.ID + " \t: " + (Duration) + " ms");

does one or more synchronized internally. So your highly "concurrent" code will be serialized quite good at this point. Just remove it and retest.

在内部进行一个或多个同步。所以你的高度“并发”代码在这一点上会被很好地序列化。只需将其删除并重新测试即可。

#3

While @sparc_spread's answer is great, another thing I've noticed is this:

虽然@sparc_spread的答案很棒,但我注意到的另一件事是:

I run the application with 1000 iterations of the function

我运行该应用程序1000次迭代的功能

这意味着C#会在Java之前执行JIT(并在机器代码中运行),并且有可能在函数运行时获得更好的性能。尝试将迭代次数增加到20,000并从10k迭代开始计数。

#4

Are you using java6? Java 7 comes with features to improve performance in parallel programing:

你在用java6吗? Java 7具有提高并行编程性能的功能:

http://www.oracle.com/technetwork/articles/java/fork-join-422606.html

#5

You could also look at the ExecutorService, created using Executors.newFixedThreadPool(noOfCores) or a similar method.

您还可以查看使用Executors.newFixedThreadPool(noOfCores)或类似方法创建的ExecutorService。

秒客网

Java在许多核心上的扩展比C＃差得多？

5 个解决方案

#1

Summary

Original Response

Update 1 - Addressing race conditions by using concurrency structures

Update 2 - Addressing race conditions in a much simpler way

#2

#3

#4

#5

#1

Summary

Original Response

Update 1 - Addressing race conditions by using concurrency structures

Update 2 - Addressing race conditions in a much simpler way

#2

#3

#4

#5

相关文章