并发程序降级的性能随着线程的增加而增加?

时间:2021-12-22 06:59:13

I have been trying to implement the below code on quad core computer and average running times with No of threads in the Executor service over 100 iterations is as follows

我一直在尝试在四核计算机上实现以下代码和平均运行时间,Executor服务中的线程数超过100次迭代如下

1 thread = 78404.95

1个线程= 78404.95

2 threads = 174995.14

2个主题= 174995.14

4 thread = 144230.23

4个线程= 144230.23

But according to what I have studied 2*(no of cores) of threads should give optimal result for the program which is clearly not the case in my program which bizarrely gives best time for single thread.

但根据我所研究的2 *(没有内核)线程应该为程序提供最佳结果,这在我的程序中显然不是这样,这给单线程提供了最佳时间。

Code :

  import java.util.Collections;
import java.util.Random;
import java.util.Set;
import java.util.concurrent.ConcurrentHashMap;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
import java.util.concurrent.TimeUnit;

public class TestHashSet {

    public static void main(String argv[]){
        Set<Integer> S = Collections.newSetFromMap(new ConcurrentHashMap<Integer,Boolean>());
        S.add(1);
        S.add(2);
        S.add(3);
        S.add(4);
        S.add(5);
        long  startTime = System.nanoTime();
        ExecutorService executor = Executors.newFixedThreadPool(8);
        int Nb = 0;
        for(int i = 0;i<10;i++){
            User runnable = new User(S);
            executor.execute(runnable);

            Nb = Thread.getAllStackTraces().keySet().size();
        }
        executor.shutdown();
        try {
            executor.awaitTermination(Long.MAX_VALUE, TimeUnit.DAYS);
        } catch (InterruptedException e) {
            // TODO Auto-generated catch block
            e.printStackTrace();
        }
        long endTime = System.nanoTime();
        System.out.println(0.001*(endTime-startTime)+" And "+Nb);
    }
}
class User implements Runnable{
    Set<Integer> S;
    User(Set<Integer> S){
        this.S = S;
    }
    @Override
    public void run() {
        // TODO Auto-generated method stub
        Set<Integer> t =Collections.newSetFromMap(new ConcurrentHashMap<Integer,Boolean>());;
        for(int i = 0;i<10;i++){
            t.add(i+5);
        }
        S.retainAll(t);
        Set<Integer> t2 =Collections.newSetFromMap(new ConcurrentHashMap<Integer,Boolean>());;
        for(int i = 0;i<10;i++){
            t2.add(i);
        }
        S.addAll(t);
        /*
        ConcurrentHashSet<Integer> D = new ConcurrentHashSet<Integer>();
        for(int i=0;i<10;i++){
            D.add(i+3);
        }
        S.difference(D);
        */
    }
}

Update : If I increase no of queries per thread to 1000 , 4-threaded is performing better than Single threaded .I think overhead has been higher than run-time when I used only about 4 queries per thread and as no of queries increased Runtime is now greater than Overhead.Thanks

更新:如果我将每个线程的查询数增加到1000,则4线程的性能优于单线程。我认为当我每个线程只使用大约4个查询并且没有增加查询时,开销高于运行时运行时间是现在大于Overhead.Thanks

1 个解决方案

#1


But 5 Threads Supposed to increase the performance..?

但5线程应该提高性能..?

That's what >>you<< suppose. But in fact, there are no guarantees that adding threads will increase performance.

这就是你所谓的“假设”。但事实上,无法保证添加线程会提高性能。

But according to what I have studied 2*(no of cores) of threads should give optimal result ...

但根据我研究的2 *(没有核心)线程应该给出最佳结果......

If you read that somewhere, then you either misread it or it is plain wrong.

如果你在某处读到,那么你要么误读它,要么就是错误的。

The reality is that the number of threads for optimal performance is highly dependent on the nature of your application, and also on the hardware you are running on.

实际情况是,获得最佳性能的线程数量在很大程度上取决于应用程序的性质,也取决于您运行的硬件。


Based on a cursory reading of your code, it appears that this is a benchmark to test how well Java deals with multi-threaded access and updates to a shared set (S). Each thread is doing some operations on a thread-confined set, then either adding or removing all entries in the thread-confined set to the shared set.

基于对代码的粗略读取,似乎这是测试Java如何处理多线程访问和更新共享集(S)的基准。每个线程在线程限制集上执行某些操作,然后将线程限制集中的所有条目添加或删除到共享集。

The problem is that the addAll and retainAll calls are likely to be concurrency bottlenecks. A set based on ConcurrentHashMap will give better concurrent performance for point access / update to the set than on based on HashMap. However, addAll and retainAll perform N such operations, on the same entries that the other threads are operating on. Given the nature of this pattern of operations, you are likely to get significant contention within the different regions of the ConcurrentHashMap. That is likely to lead to one thread blocking another ... and a slowdown.

问题是addAll和retainAll调用可能是并发瓶颈。基于ConcurrentHashMap的集合将为集合的点访问/更新提供比基于HashMap的更好的并发性能。但是,addAll和retainAll在其他线程正在操作的相同条目上执行N个此类操作。鉴于此操作模式的性质,您可能会在ConcurrentHashMap的不同区域内获得重大争用。这可能会导致一个线程阻塞另一个......并且减速。

Update : If I increase no of queries per thread 4-threaded is performing better than Single threaded .I think overhead has been higher than run-time when I used only about 4 queries per thread and as no of queries increased Runtime is now greater than Overhead.

更新:如果我增加每个线程的查询数量4线程的性能优于单线程。我认为开销高于运行时我每个线程只使用大约4个查询而且查询没有增加运行时间现在大于高架。

I assume that you mean that you are increasing the number of hash map entries. This is likely to reduce the average contention, given the way that ConcurrentHashMap works. (The class divides the map into regions, and arranges that operations involving entries in different regions incur the minimum possible contention overheads. By increasing the number of distinct entries, you are reducing the probability that two simultaneous operations will lead to contention.)

我假设您的意思是增加哈希映射条目的数量。考虑到ConcurrentHashMap的工作方式,这可能会减少平均争用。 (该类将映射划分为区域,并安排涉及不同区域中的条目的操作产生最小可能的争用开销。通过增加不同条目的数量,可以降低两个同时操作将导致争用的可能性。)


So returning to the "2 x no of threads" factoid.

所以回到“2 x no of threads”factoid。

I suspect that the sources you have been reading don't actually say that that gives you optimal performance. I suspect that they really say that that:

我怀疑你一直在阅读的消息来源实际上并没有说这会给你带来最佳性能。我怀疑他们真的这么说:

  • "2 x no of threads" is a good starting point ... and you need to tune it for your application / problem / hardware, and/or

    “2 x no of threads”是一个很好的起点......你需要针对你的应用/问题/硬件进行调整,和/或

  • don't go above "2 x no of threads" for a compute intensive task ... because it is unlikely to help.

    对于计算密集型任务,不要超过“2 x no of threads”...因为它不太可能有所帮助。

In your example, it is most likely that the main source of the contention is in the updates to the shared set / map ... and the overheads of ensuring that they happen atomically.

在您的示例中,争用的主要来源很可能是对共享集/映射的更新......以及确保它们以原子方式发生的开销。

You can also get contention at a lower level; i.e. contention for memory bandwidth (RAM read/write) and memory cache contention. Whether that happens will depend on the specs of the hardware you are running on ...

您也可以在较低级别获得争用;即争用内存带宽(RAM读/写)和内存高速缓存争用。是否发生这种情况取决于您运行的硬件的规格......


The final thing to note is that your benchmark is flawed in that it does not allow for various VM warmup effects ... such as JIT compilation. The fact that your 2 thread times are more than double the 1 thread times points to that issue.

最后要注意的是,您的基准测试存在缺陷,因为它不允许各种VM预热效果......例如JIT编译。事实上,你的2个线程时间超过了1个线程时间的两倍,指向该问题。

There are other questionable aspects about your benchmarking:

您的基准测试还有其他可疑方面:

  • The amount of work done by the run() method is too small.

    run()方法完成的工作量太小。

  • This benchmark does not appear to be representative of a real-world use-case. Measuring speed-up in a totally fictitious (nonsense) algorithm is not going to give you any clues about how a real algorithm is likely to perform when you scale the thread count.

    该基准似乎不代表现实世界的用例。在完全虚拟(无意义)算法中测量加速并不能为您提供有关缩放线程数时真实算法可能执行的任何线索。

  • Running the tests on a 4 core machine means that you probably wouldn't have enough data points to draw scientifically meaningful conclusions ... assuming that the benchmark was sound.

    在4核计算机上运行测试意味着您可能没有足够的数据点来得出具有科学意义的结论......假设基准测试是合理的。


Having said that, the 2 to 4 thread slowdown that you seem to be seeing is not unexpected ... to me.

话虽如此,你似乎看到的2到4线程减速对我来说并不意外。

#1


But 5 Threads Supposed to increase the performance..?

但5线程应该提高性能..?

That's what >>you<< suppose. But in fact, there are no guarantees that adding threads will increase performance.

这就是你所谓的“假设”。但事实上,无法保证添加线程会提高性能。

But according to what I have studied 2*(no of cores) of threads should give optimal result ...

但根据我研究的2 *(没有核心)线程应该给出最佳结果......

If you read that somewhere, then you either misread it or it is plain wrong.

如果你在某处读到,那么你要么误读它,要么就是错误的。

The reality is that the number of threads for optimal performance is highly dependent on the nature of your application, and also on the hardware you are running on.

实际情况是,获得最佳性能的线程数量在很大程度上取决于应用程序的性质,也取决于您运行的硬件。


Based on a cursory reading of your code, it appears that this is a benchmark to test how well Java deals with multi-threaded access and updates to a shared set (S). Each thread is doing some operations on a thread-confined set, then either adding or removing all entries in the thread-confined set to the shared set.

基于对代码的粗略读取,似乎这是测试Java如何处理多线程访问和更新共享集(S)的基准。每个线程在线程限制集上执行某些操作,然后将线程限制集中的所有条目添加或删除到共享集。

The problem is that the addAll and retainAll calls are likely to be concurrency bottlenecks. A set based on ConcurrentHashMap will give better concurrent performance for point access / update to the set than on based on HashMap. However, addAll and retainAll perform N such operations, on the same entries that the other threads are operating on. Given the nature of this pattern of operations, you are likely to get significant contention within the different regions of the ConcurrentHashMap. That is likely to lead to one thread blocking another ... and a slowdown.

问题是addAll和retainAll调用可能是并发瓶颈。基于ConcurrentHashMap的集合将为集合的点访问/更新提供比基于HashMap的更好的并发性能。但是,addAll和retainAll在其他线程正在操作的相同条目上执行N个此类操作。鉴于此操作模式的性质,您可能会在ConcurrentHashMap的不同区域内获得重大争用。这可能会导致一个线程阻塞另一个......并且减速。

Update : If I increase no of queries per thread 4-threaded is performing better than Single threaded .I think overhead has been higher than run-time when I used only about 4 queries per thread and as no of queries increased Runtime is now greater than Overhead.

更新:如果我增加每个线程的查询数量4线程的性能优于单线程。我认为开销高于运行时我每个线程只使用大约4个查询而且查询没有增加运行时间现在大于高架。

I assume that you mean that you are increasing the number of hash map entries. This is likely to reduce the average contention, given the way that ConcurrentHashMap works. (The class divides the map into regions, and arranges that operations involving entries in different regions incur the minimum possible contention overheads. By increasing the number of distinct entries, you are reducing the probability that two simultaneous operations will lead to contention.)

我假设您的意思是增加哈希映射条目的数量。考虑到ConcurrentHashMap的工作方式,这可能会减少平均争用。 (该类将映射划分为区域,并安排涉及不同区域中的条目的操作产生最小可能的争用开销。通过增加不同条目的数量,可以降低两个同时操作将导致争用的可能性。)


So returning to the "2 x no of threads" factoid.

所以回到“2 x no of threads”factoid。

I suspect that the sources you have been reading don't actually say that that gives you optimal performance. I suspect that they really say that that:

我怀疑你一直在阅读的消息来源实际上并没有说这会给你带来最佳性能。我怀疑他们真的这么说:

  • "2 x no of threads" is a good starting point ... and you need to tune it for your application / problem / hardware, and/or

    “2 x no of threads”是一个很好的起点......你需要针对你的应用/问题/硬件进行调整,和/或

  • don't go above "2 x no of threads" for a compute intensive task ... because it is unlikely to help.

    对于计算密集型任务,不要超过“2 x no of threads”...因为它不太可能有所帮助。

In your example, it is most likely that the main source of the contention is in the updates to the shared set / map ... and the overheads of ensuring that they happen atomically.

在您的示例中,争用的主要来源很可能是对共享集/映射的更新......以及确保它们以原子方式发生的开销。

You can also get contention at a lower level; i.e. contention for memory bandwidth (RAM read/write) and memory cache contention. Whether that happens will depend on the specs of the hardware you are running on ...

您也可以在较低级别获得争用;即争用内存带宽(RAM读/写)和内存高速缓存争用。是否发生这种情况取决于您运行的硬件的规格......


The final thing to note is that your benchmark is flawed in that it does not allow for various VM warmup effects ... such as JIT compilation. The fact that your 2 thread times are more than double the 1 thread times points to that issue.

最后要注意的是,您的基准测试存在缺陷,因为它不允许各种VM预热效果......例如JIT编译。事实上,你的2个线程时间超过了1个线程时间的两倍,指向该问题。

There are other questionable aspects about your benchmarking:

您的基准测试还有其他可疑方面:

  • The amount of work done by the run() method is too small.

    run()方法完成的工作量太小。

  • This benchmark does not appear to be representative of a real-world use-case. Measuring speed-up in a totally fictitious (nonsense) algorithm is not going to give you any clues about how a real algorithm is likely to perform when you scale the thread count.

    该基准似乎不代表现实世界的用例。在完全虚拟(无意义)算法中测量加速并不能为您提供有关缩放线程数时真实算法可能执行的任何线索。

  • Running the tests on a 4 core machine means that you probably wouldn't have enough data points to draw scientifically meaningful conclusions ... assuming that the benchmark was sound.

    在4核计算机上运行测试意味着您可能没有足够的数据点来得出具有科学意义的结论......假设基准测试是合理的。


Having said that, the 2 to 4 thread slowdown that you seem to be seeing is not unexpected ... to me.

话虽如此,你似乎看到的2到4线程减速对我来说并不意外。