为什么50个线程比4个快?

时间:2021-01-15 01:12:05
DWORD WINAPI MyThreadFunction(LPVOID lpParam) {
    volatile auto x = 1;
    for (auto i = 0; i < 800000000 / MAX_THREADS; ++i) {
        x += i / 3;
    }
    return 0;
}

This function is run in MAX_THREADS threads.
I have run the tests on Intel Core 2 Duo, Windows 7, MS Visual Studio 2012 using Concurrency Visualizer with MAX_THREADS=4 and MAX_THREADS=50.
test1 (4 threads) completed in 7.1 seconds, but test2 (50 threads) completed in 5.8 seconds while test1 has more context switches than test2.
I have run the same tests on Intel Core i5, Mac OS 10.7.5 and got the same results.

此函数在MAX_THREADS个线程中运行。我使用Concurrency Visualizer运行了Intel Core 2 Duo,Windows 7,MS Visual Studio 2012上的测试,MAX_THREADS = 4,MAX_THREADS = 50。 test1(4个线程)在7.1秒内完成,但test2(50个线程)在5.8秒内完成,而test1具有比test2更多的上下文切换。我在英特尔酷睿i5,Mac OS 10.7.5上运行了相同的测试,并得到了相同的结果。

5 个解决方案

#1


38  

I decided to benchmark this myself on my 4-core machine. I directly compared 4 threads with 50 threads by interleaving 100 tests of each. I used my own numbers so that I had a reasonable execution time for each task.

我决定在我的4核机器上自己进行基准测试。我直接将4个线程与50个线程进行比较,每个线程交错100次测试。我使用自己的数字,以便每个任务都有合理的执行时间。

The result was as you described. The 50-thread version is marginally faster. Here is a box plot of my results:

结果就像你描述的那样。 50线程版本略快。这是我的结果的方框图:

为什么50个线程比4个快?

Why? I think this comes down to the thread scheduling. The task is not complete until all threads have done their work, and each thread must do a quarter of the job. Because your process is being shared with other processes on the system, if any single thread is switched out to another process, this will delay the entire task. While we are waiting for the last thread to finish, all other cores are idle. Note how the time distribution of the 4-thread test is much wider than the 50-thread test, which we might expect.

为什么?我认为这归结于线程调度。在所有线程完成工作之前,任务才完成,每个线程必须完成四分之一的工作。由于您的进程正在与系统上的其他进程共享,因此如果将任何单个线程切换到另一个进程,这将延迟整个任务。当我们等待最后一个线程完成时,所有其他核心都处于空闲状态。注意4线程测试的时间分布比50线程测试要宽得多,我们可能会预期。

When you use 50 threads, each thread has less to do. Because of this, any delays in a single thread will have a less significant effect on the total time. When the scheduler is busy rationing cores out to lots of short threads, a delay on one core can be compensated by giving these threads time on another core. The total effect of latency on one core is not as much of a show-stopper.

当您使用50个线程时,每个线程都可以做的更少。因此,单个线程中的任何延迟都会对总时间产生不太显着的影响。当调度程序忙于将内核配置为大量短线程时,可以通过在另一个内核上提供这些线程的时间来补偿一个内核的延迟。延迟对一个核心的总影响并不是一个显示阻滞。

So it would seem that in this case the extra context-switching is not the biggest factor. While the gain is small, it appears to be beneficial to swamp the thread scheduler a little bit, given that the processing is much more significant than the context-switching. As with everything, you must find the correct balance for your application.

因此,在这种情况下,额外的上下文切换似乎不是最大的因素。虽然增益很小,但考虑到处理比上下文切换更重要,看起来有点淹没线程调度程序。与所有内容一样,您必须为您的应用找到正确的平衡。


[edit] Out of curiosity I ran a test overnight while my computer wasn't doing much else. This time I used 200 samples per test. Again, tests were interleaved to reduce the impact of any localised background tasks.

[编辑]出于好奇,我在一夜之间进行了测试,而我的电脑没有做太多其他事情。这次我每次测试使用200个样本。同样,测试是交错的,以减少任何本地化后台任务的影响。

The first plot of these results is for low thread-counts (up to 3 times the number of cores). You can see how some choices of thread count are quite poor... That is, anything that is not a multiple of the number of cores, and especially odd values.

这些结果的第一个图是低线程数(最多为核心数的3倍)。你可以看到一些线程数的选择是如何很差的......也就是说,任何不是核心数的倍数的东西,特别是奇数值。

为什么50个线程比4个快?

The second plot is for higher thread-counts (from 3 times the number of cores up to 60).

第二个图是针对更高的线程数(从核心数的3倍到60)。

为什么50个线程比4个快?

Above, you can see a definite downward trend as the thread-count increases. You can also see the spread of results narrow as the thread-count increases.

在上面,随着线程数的增加,您可以看到明确的下降趋势。随着线程数的增加,您还可以看到结果的扩展范围缩小。

In this test, it's interesting to note that the performance of 4-thread and 50-thread tests were about the same and the spread of results in the 4-core test was not as wide as my original test. Because the computer wasn't doing much else, it could dedicate time to the tests. It would be interesting to repeat the test while placing one core under 75% load.

在这个测试中,有趣的是注意到4线程和50线程测试的性能大致相同,并且4核心测试中结果的扩展并不像我原来的测试那么宽。因为计算机没有做太多其他事情,所以可以将时间用于测试。在将一个核心置于75%负载下时重复测试将是有趣的。

And just to keep things in perspective, consider this:

为了保持透视,请考虑以下事项:

为什么50个线程比4个快?


[Another edit] After posting my last lot of results, I noticed that the jumbled box plot showed a trend for those tests that were multiples of 4, but the data was a little hard to see.

[另一个编辑]在发布了我的最后一批结果之后,我注意到混乱的盒子图显示了那些4的倍数的测试趋势,但数据有点难以看到。

I decided to do a test with only multiples of four, and thought I may as well find the point of diminishing returns at the same time. So I used thread counts that are powers of 2, up to 1024. I would have gone higher, but Windows bugged out at around 1400 threads.

我决定只用四的倍数进行测试,并且我认为我同时也可以找到收益递减点。所以我使用了2次幂的线程计数,最多1024次。我本来会更高,但是Windows大约有1400个线程。

The result is rather nice, I think. In case you wonder what the little circles are, those are the median values. I chose it instead of the red line that I used previously because it shows the trend more clearly.

我认为结果相当不错。如果你想知道小圆是什么,那就是中值。我选择它而不是之前使用的红线,因为它更清楚地显示了趋势。

为什么50个线程比4个快?

It seems that in this particular case, the pay dirt lies somewhere between 50 and 150 threads. After that, the benefit quickly drops away, and we're entering the territory of excessive thread management and context-switching.

在这种特殊情况下,付费污垢似乎介于50到150个线程之间。在那之后,好处很快就会消失,我们正在进入过度线程管理和上下文切换的领域。

The results might vary significantly with a longer or shorter task. In this case, it was a task involving a lot of pointless arithmetic which took approximately 18 seconds to compute on a single core.

任务更长或更短,结果可能会有很大差异。在这种情况下,这是一项涉及大量无意义算术的任务,在单个核心上计算大约需要18秒。

By tuning only the number of threads, I was able to shave an extra 1.5% to 2% off the median execution time of the 4-thread version.

通过仅调整线程数,我能够将4线程版本的中值执行时间额外削减1.5%到2%。

#2


3  

It all depends on what your threads are doing.

这一切都取决于你的线程在做什么。

Your computer can only concurrently run as many threads as there are cores in the system. This includes virtual cores via features like Hyper-threading.

您的计算机只能同时运行与系统中的核心一样多的线程。这包括通过超线程等功能的虚拟核心。

CPU-bound

If your threads are CPU-bound, (meaning they spend the vast majority of their time doing calculations on data that is in memory), you will see little improvement by increasing the number of threads above the number of cores. You actually lose efficiency with more threads running, because of the added overhead of having to context-swtich the threads on and off the CPU cores.

如果您的线程受CPU限制(意味着他们将大部分时间花在计算内存中的数据上),那么通过增加高于内核数量的线程数,您将看不到什么改进。实际上,在运行更多线程的情况下,您会失去效率,因为必须上下文切换CPU内核上的线程。

I/O-bound

Where (#threads > #cores) will help, is when your threads are I/O-bound, meaning they spend most of their time waiting on I/O, (hard disk, network, other hardware, etc.) In this case, a thread that is blocked waiting on I/O to complete will be pulled off the CPU, and a thread that is actually ready to do something will be put on instead.

其中(#threads> #cores)会有所帮助,就是当你的线程受I / O限制时,意味着他们大部分时间都在等待I / O,(硬盘,网络,其他硬件等)。在这种情况下一个被阻塞等待I / O完成的线程将从CPU中拔出,而一个实际准备好做某事的线程将被替换。

The way to get highest efficiency is to always keep the CPU busy with a thread that's actually doing something. (Not waiting on something, and not context-switching to other threads.)

获得最高效率的方法是始终让CPU忙于一个实际正在做某事的线程。 (不等待某些事情,而不是上下文切换到其他线程。)

#3


3  

I took some code that I had "laying about" for some other purposes, and re-used it - so please beware that it's not "pretty", nor is supposed to be a good example of how you should do this.

我拿了一些我为其他目的“铺设”的代码,并重新使用它 - 所以请注意它不是“漂亮”,也不应该是你应该如何做到这一点的一个很好的例子。

Here's the code I came up with (this is on a Linux system, so I'm using pthreads and I removed the "WINDOWS-isms":

这是我提出的代码(这是在Linux系统上,所以我使用pthreads并删除了“WINDOWS-isms”:

#include <iostream>
#include <pthread.h>
#include <cstring>

int MAX_THREADS = 4;

void * MyThreadFunction(void *) {
    volatile auto x = 1;
    for (auto i = 0; i < 800000000 / MAX_THREADS; ++i) {
        x += i / 3;
    }
    return 0;
}


using namespace std;

int main(int argc, char **argv)
{
    for(int i = 1; i < argc; i++)
    {
    if (strcmp(argv[i], "-t") == 0 && argc > i+1)
    {
        i++;
        MAX_THREADS = strtol(argv[i], NULL, 0);
        if (MAX_THREADS == 0)
        {
        cerr << "Hmm, seems like end is not a number..." << endl;
        return 1;
        }       
    }
    }
    cout << "Using " << MAX_THREADS << " threads" << endl;
    pthread_t *thread_id = new pthread_t [MAX_THREADS];
    for(int i = 0; i < MAX_THREADS; i++)
    {
    int rc = pthread_create(&thread_id[i], NULL, MyThreadFunction, NULL);
    if (rc != 0)
    {
        cerr << "Huh? Pthread couldn't be created. rc=" << rc << endl;
    }
    }
    for(int i = 0; i < MAX_THREADS; i++)
    {
        pthread_join(thread_id[i], NULL);
    }
    delete [] thread_id;
}

Running this with a variety of number of threads:

使用各种线程运行它:

MatsP@linuxhost junk]$ g++ -Wall -O3 -o thread_speed thread_speed.cpp -std=c++0x -lpthread
[MatsP@linuxhost junk]$ time ./thread_speed -t 4
Using 4 threads

real    0m0.448s
user    0m1.673s
sys 0m0.004s
[MatsP@linuxhost junk]$ time ./thread_speed -t 50
Using 50 threads

real    0m0.438s
user    0m1.683s
sys 0m0.008s
[MatsP@linuxhost junk]$ time ./thread_speed -t 1
Using 1 threads

real    0m1.666s
user    0m1.658s
sys 0m0.004s
[MatsP@linuxhost junk]$ time ./thread_speed -t 2
Using 2 threads

real    0m0.847s
user    0m1.670s
sys 0m0.004s
[MatsP@linuxhost junk]$ time ./thread_speed -t 50
Using 50 threads

real    0m0.434s
user    0m1.670s
sys 0m0.005s

As you can see, the "user" time stays almost identical. I actually tries a lot of other values too. But the results are the same so I won't bore y'all with a dozen more that show almost the same thing.

如您所见,“用户”时间几乎相同。我实际上也尝试了很多其他的价值观。但结果是一样的,所以我不会厌倦你们十几个显示几乎相同的东西。

This is running on a quad core processor, so you can see that the "more than 4 threads" times show the same "real" time as with "4 threads".

这是在四核处理器上运行,因此您可以看到“超过4个线程”时间显示与“4个线程”相同的“实际”时间。

I doubt very much there is anything different in how Windows deals with threads.

我非常怀疑Windows如何处理线程有什么不同。

I also compiled the code with a #define MAX_THREADS 50 and same again with 4. It gave no difference to the code posted - but just to cover the alternative where the compiler is optimizing the code.

我还使用#define MAX_THREADS 50编译代码,并再次使用4编译代码。它对发布的代码没有任何区别 - 但只是为了涵盖编译器优化代码的替代方案。

By the way, the fact that my code runs some three to ten times faster indicates that the originally posted code is using debug mode?

顺便说一句,我的代码运行速度快了三到十倍,这表明最初发布的代码使用的是调试模式?

#4


2  

I did some tests a while ago on Windows, (Vista 64 Ultimate), on a 4/8 core i7. I used similar 'counting' code, submitted as tasks to a threadpool with varying numbers of threads, but always with the same total amount of work. The threads in the pool were given a low priority so that all the tasks got queued up before the threads, and timing, started. Obviously, the box was otherwise idle, (~1% CPU used up on services etc).

我刚刚在Windows上进行了一些测试(Vista 64 Ultimate),在4/8核心i7上。我使用类似的'计数'代码,作为任务提交给具有不同线程数的线程池,但始终具有相同的总工作量。池中的线程被赋予低优先级,以便所有任务在线程和计时开始之前排队。显然,这个盒子是空闲的(大约1%的CPU用于服务等)。

8 tests,
400 tasks,
counting to 10000000,
using 8 threads:
Ticks: 2199
Ticks: 2184
Ticks: 2215
Ticks: 2153
Ticks: 2200
Ticks: 2215
Ticks: 2200
Ticks: 2230
Average: 2199 ms

8 tests,
400 tasks,
counting to 10000000,
using 32 threads:
Ticks: 2137
Ticks: 2121
Ticks: 2153
Ticks: 2138
Ticks: 2137
Ticks: 2121
Ticks: 2153
Ticks: 2137
Average: 2137 ms

8 tests,
400 tasks,
counting to 10000000,
using 128 threads:
Ticks: 2168
Ticks: 2106
Ticks: 2184
Ticks: 2106
Ticks: 2137
Ticks: 2122
Ticks: 2106
Ticks: 2137
Average: 2133 ms

8 tests,
400 tasks,
counting to 10000000,
using 400 threads:
Ticks: 2137
Ticks: 2153
Ticks: 2059
Ticks: 2153
Ticks: 2168
Ticks: 2122
Ticks: 2168
Ticks: 2138
Average: 2137 ms

With tasks that take a long time, and with very little cache to swap out on a context-change, the number of threads used makes hardly any difference to the overall run time.

由于任务需要很长时间,而且只需要很少的缓存就可以换出上下文更改,因此使用的线程数对整个运行时间几乎没有任何影响。

#5


0  

The problem you encounter is tighly bound to the way you are subdividing the workload of your process. In order to make an efficient use of a multicore system on a multitasking OS, you must ensure that there will always be remaining work for all the cores as long as possible during your process lifetime.

您遇到的问题与您细分流程工作量的方式紧密相关。为了在多任务操作系统上有效地使用多核系统,您必须确保在您的过程生命周期中尽可能长时间地为所有内核保留剩余工作。

Consider the situation where your 4 threads process executes on 4 cores, and because of the system load configuration, one of the cores manages to finish 50% faster than the others: for the remaining process time, your CPU will only be able to allocate 3/4 of its processing power to your process, since there's only 3 threads remaining. In the same CPU load scenario, but with many more threads, the workload is split in many more subtasks which can be distributed more finely between the cores, all other things being equal (*).

考虑4个线程进程在4个核心上执行的情况,并且由于系统负载配置,其中一个核心比其他核心完成快50%:对于剩余的处理时间,您的CPU将只能分配3个核心/ 4它对你的进程的处理能力,因为只剩下3个线程。在相同的CPU负载情况下,但是有更多的线程,工作负载被分成更多的子任务,这些子任务可以在核心之间更精细地分配,所有其他条件相同(*)。

This example illustrate that the timing difference is not actually due to the number of threads, but rather to the way the work has been divided, which is much more resilient to an uneven availability of cores in the later case. The same programme built with only 4 threads, but where the work is abstracted to a series of small tasks pulled by threads as soon as they are available would certainly produce similar or even better results on average, even though there would be the overhead of managing the tasks queue.

这个例子说明了时间差异实际上并不是由于线程的数量,而是由于工作的划分方式,后者在后一种情况下对核心的可用性不均衡更具弹性。同样的程序只用4个线程构建,但是当工作被提取到一系列由线程提供的小任务时,一旦可用,它们肯定会产生类似甚至更好的结果,即使会有管理的开销任务队列。

The finer granularity of a process task set gives it better flexibility.

流程任务集的更精细粒度为其提供了更好的灵活性。


(*) In the situation of a highly loaded system, the many threads approach might not be as beneficial, the unused core being actually allocated to other OS process, hence lightening the load for the three others cores still possibly used by your process.

(*)在高负载系统的情况下,许多线程方法可能没有那么有益,未使用的核心实际上被分配给其他OS进程,因此减轻了您的进程仍可能使用的其他三个核心的负载。

#1


38  

I decided to benchmark this myself on my 4-core machine. I directly compared 4 threads with 50 threads by interleaving 100 tests of each. I used my own numbers so that I had a reasonable execution time for each task.

我决定在我的4核机器上自己进行基准测试。我直接将4个线程与50个线程进行比较,每个线程交错100次测试。我使用自己的数字,以便每个任务都有合理的执行时间。

The result was as you described. The 50-thread version is marginally faster. Here is a box plot of my results:

结果就像你描述的那样。 50线程版本略快。这是我的结果的方框图:

为什么50个线程比4个快?

Why? I think this comes down to the thread scheduling. The task is not complete until all threads have done their work, and each thread must do a quarter of the job. Because your process is being shared with other processes on the system, if any single thread is switched out to another process, this will delay the entire task. While we are waiting for the last thread to finish, all other cores are idle. Note how the time distribution of the 4-thread test is much wider than the 50-thread test, which we might expect.

为什么?我认为这归结于线程调度。在所有线程完成工作之前,任务才完成,每个线程必须完成四分之一的工作。由于您的进程正在与系统上的其他进程共享,因此如果将任何单个线程切换到另一个进程,这将延迟整个任务。当我们等待最后一个线程完成时,所有其他核心都处于空闲状态。注意4线程测试的时间分布比50线程测试要宽得多,我们可能会预期。

When you use 50 threads, each thread has less to do. Because of this, any delays in a single thread will have a less significant effect on the total time. When the scheduler is busy rationing cores out to lots of short threads, a delay on one core can be compensated by giving these threads time on another core. The total effect of latency on one core is not as much of a show-stopper.

当您使用50个线程时,每个线程都可以做的更少。因此,单个线程中的任何延迟都会对总时间产生不太显着的影响。当调度程序忙于将内核配置为大量短线程时,可以通过在另一个内核上提供这些线程的时间来补偿一个内核的延迟。延迟对一个核心的总影响并不是一个显示阻滞。

So it would seem that in this case the extra context-switching is not the biggest factor. While the gain is small, it appears to be beneficial to swamp the thread scheduler a little bit, given that the processing is much more significant than the context-switching. As with everything, you must find the correct balance for your application.

因此,在这种情况下,额外的上下文切换似乎不是最大的因素。虽然增益很小,但考虑到处理比上下文切换更重要,看起来有点淹没线程调度程序。与所有内容一样,您必须为您的应用找到正确的平衡。


[edit] Out of curiosity I ran a test overnight while my computer wasn't doing much else. This time I used 200 samples per test. Again, tests were interleaved to reduce the impact of any localised background tasks.

[编辑]出于好奇,我在一夜之间进行了测试,而我的电脑没有做太多其他事情。这次我每次测试使用200个样本。同样,测试是交错的,以减少任何本地化后台任务的影响。

The first plot of these results is for low thread-counts (up to 3 times the number of cores). You can see how some choices of thread count are quite poor... That is, anything that is not a multiple of the number of cores, and especially odd values.

这些结果的第一个图是低线程数(最多为核心数的3倍)。你可以看到一些线程数的选择是如何很差的......也就是说,任何不是核心数的倍数的东西,特别是奇数值。

为什么50个线程比4个快?

The second plot is for higher thread-counts (from 3 times the number of cores up to 60).

第二个图是针对更高的线程数(从核心数的3倍到60)。

为什么50个线程比4个快?

Above, you can see a definite downward trend as the thread-count increases. You can also see the spread of results narrow as the thread-count increases.

在上面,随着线程数的增加,您可以看到明确的下降趋势。随着线程数的增加,您还可以看到结果的扩展范围缩小。

In this test, it's interesting to note that the performance of 4-thread and 50-thread tests were about the same and the spread of results in the 4-core test was not as wide as my original test. Because the computer wasn't doing much else, it could dedicate time to the tests. It would be interesting to repeat the test while placing one core under 75% load.

在这个测试中,有趣的是注意到4线程和50线程测试的性能大致相同,并且4核心测试中结果的扩展并不像我原来的测试那么宽。因为计算机没有做太多其他事情,所以可以将时间用于测试。在将一个核心置于75%负载下时重复测试将是有趣的。

And just to keep things in perspective, consider this:

为了保持透视,请考虑以下事项:

为什么50个线程比4个快?


[Another edit] After posting my last lot of results, I noticed that the jumbled box plot showed a trend for those tests that were multiples of 4, but the data was a little hard to see.

[另一个编辑]在发布了我的最后一批结果之后,我注意到混乱的盒子图显示了那些4的倍数的测试趋势,但数据有点难以看到。

I decided to do a test with only multiples of four, and thought I may as well find the point of diminishing returns at the same time. So I used thread counts that are powers of 2, up to 1024. I would have gone higher, but Windows bugged out at around 1400 threads.

我决定只用四的倍数进行测试,并且我认为我同时也可以找到收益递减点。所以我使用了2次幂的线程计数,最多1024次。我本来会更高,但是Windows大约有1400个线程。

The result is rather nice, I think. In case you wonder what the little circles are, those are the median values. I chose it instead of the red line that I used previously because it shows the trend more clearly.

我认为结果相当不错。如果你想知道小圆是什么,那就是中值。我选择它而不是之前使用的红线,因为它更清楚地显示了趋势。

为什么50个线程比4个快?

It seems that in this particular case, the pay dirt lies somewhere between 50 and 150 threads. After that, the benefit quickly drops away, and we're entering the territory of excessive thread management and context-switching.

在这种特殊情况下,付费污垢似乎介于50到150个线程之间。在那之后,好处很快就会消失,我们正在进入过度线程管理和上下文切换的领域。

The results might vary significantly with a longer or shorter task. In this case, it was a task involving a lot of pointless arithmetic which took approximately 18 seconds to compute on a single core.

任务更长或更短,结果可能会有很大差异。在这种情况下,这是一项涉及大量无意义算术的任务,在单个核心上计算大约需要18秒。

By tuning only the number of threads, I was able to shave an extra 1.5% to 2% off the median execution time of the 4-thread version.

通过仅调整线程数,我能够将4线程版本的中值执行时间额外削减1.5%到2%。

#2


3  

It all depends on what your threads are doing.

这一切都取决于你的线程在做什么。

Your computer can only concurrently run as many threads as there are cores in the system. This includes virtual cores via features like Hyper-threading.

您的计算机只能同时运行与系统中的核心一样多的线程。这包括通过超线程等功能的虚拟核心。

CPU-bound

If your threads are CPU-bound, (meaning they spend the vast majority of their time doing calculations on data that is in memory), you will see little improvement by increasing the number of threads above the number of cores. You actually lose efficiency with more threads running, because of the added overhead of having to context-swtich the threads on and off the CPU cores.

如果您的线程受CPU限制(意味着他们将大部分时间花在计算内存中的数据上),那么通过增加高于内核数量的线程数,您将看不到什么改进。实际上,在运行更多线程的情况下,您会失去效率,因为必须上下文切换CPU内核上的线程。

I/O-bound

Where (#threads > #cores) will help, is when your threads are I/O-bound, meaning they spend most of their time waiting on I/O, (hard disk, network, other hardware, etc.) In this case, a thread that is blocked waiting on I/O to complete will be pulled off the CPU, and a thread that is actually ready to do something will be put on instead.

其中(#threads> #cores)会有所帮助,就是当你的线程受I / O限制时,意味着他们大部分时间都在等待I / O,(硬盘,网络,其他硬件等)。在这种情况下一个被阻塞等待I / O完成的线程将从CPU中拔出,而一个实际准备好做某事的线程将被替换。

The way to get highest efficiency is to always keep the CPU busy with a thread that's actually doing something. (Not waiting on something, and not context-switching to other threads.)

获得最高效率的方法是始终让CPU忙于一个实际正在做某事的线程。 (不等待某些事情,而不是上下文切换到其他线程。)

#3


3  

I took some code that I had "laying about" for some other purposes, and re-used it - so please beware that it's not "pretty", nor is supposed to be a good example of how you should do this.

我拿了一些我为其他目的“铺设”的代码,并重新使用它 - 所以请注意它不是“漂亮”,也不应该是你应该如何做到这一点的一个很好的例子。

Here's the code I came up with (this is on a Linux system, so I'm using pthreads and I removed the "WINDOWS-isms":

这是我提出的代码(这是在Linux系统上,所以我使用pthreads并删除了“WINDOWS-isms”:

#include <iostream>
#include <pthread.h>
#include <cstring>

int MAX_THREADS = 4;

void * MyThreadFunction(void *) {
    volatile auto x = 1;
    for (auto i = 0; i < 800000000 / MAX_THREADS; ++i) {
        x += i / 3;
    }
    return 0;
}


using namespace std;

int main(int argc, char **argv)
{
    for(int i = 1; i < argc; i++)
    {
    if (strcmp(argv[i], "-t") == 0 && argc > i+1)
    {
        i++;
        MAX_THREADS = strtol(argv[i], NULL, 0);
        if (MAX_THREADS == 0)
        {
        cerr << "Hmm, seems like end is not a number..." << endl;
        return 1;
        }       
    }
    }
    cout << "Using " << MAX_THREADS << " threads" << endl;
    pthread_t *thread_id = new pthread_t [MAX_THREADS];
    for(int i = 0; i < MAX_THREADS; i++)
    {
    int rc = pthread_create(&thread_id[i], NULL, MyThreadFunction, NULL);
    if (rc != 0)
    {
        cerr << "Huh? Pthread couldn't be created. rc=" << rc << endl;
    }
    }
    for(int i = 0; i < MAX_THREADS; i++)
    {
        pthread_join(thread_id[i], NULL);
    }
    delete [] thread_id;
}

Running this with a variety of number of threads:

使用各种线程运行它:

MatsP@linuxhost junk]$ g++ -Wall -O3 -o thread_speed thread_speed.cpp -std=c++0x -lpthread
[MatsP@linuxhost junk]$ time ./thread_speed -t 4
Using 4 threads

real    0m0.448s
user    0m1.673s
sys 0m0.004s
[MatsP@linuxhost junk]$ time ./thread_speed -t 50
Using 50 threads

real    0m0.438s
user    0m1.683s
sys 0m0.008s
[MatsP@linuxhost junk]$ time ./thread_speed -t 1
Using 1 threads

real    0m1.666s
user    0m1.658s
sys 0m0.004s
[MatsP@linuxhost junk]$ time ./thread_speed -t 2
Using 2 threads

real    0m0.847s
user    0m1.670s
sys 0m0.004s
[MatsP@linuxhost junk]$ time ./thread_speed -t 50
Using 50 threads

real    0m0.434s
user    0m1.670s
sys 0m0.005s

As you can see, the "user" time stays almost identical. I actually tries a lot of other values too. But the results are the same so I won't bore y'all with a dozen more that show almost the same thing.

如您所见,“用户”时间几乎相同。我实际上也尝试了很多其他的价值观。但结果是一样的,所以我不会厌倦你们十几个显示几乎相同的东西。

This is running on a quad core processor, so you can see that the "more than 4 threads" times show the same "real" time as with "4 threads".

这是在四核处理器上运行,因此您可以看到“超过4个线程”时间显示与“4个线程”相同的“实际”时间。

I doubt very much there is anything different in how Windows deals with threads.

我非常怀疑Windows如何处理线程有什么不同。

I also compiled the code with a #define MAX_THREADS 50 and same again with 4. It gave no difference to the code posted - but just to cover the alternative where the compiler is optimizing the code.

我还使用#define MAX_THREADS 50编译代码,并再次使用4编译代码。它对发布的代码没有任何区别 - 但只是为了涵盖编译器优化代码的替代方案。

By the way, the fact that my code runs some three to ten times faster indicates that the originally posted code is using debug mode?

顺便说一句,我的代码运行速度快了三到十倍,这表明最初发布的代码使用的是调试模式?

#4


2  

I did some tests a while ago on Windows, (Vista 64 Ultimate), on a 4/8 core i7. I used similar 'counting' code, submitted as tasks to a threadpool with varying numbers of threads, but always with the same total amount of work. The threads in the pool were given a low priority so that all the tasks got queued up before the threads, and timing, started. Obviously, the box was otherwise idle, (~1% CPU used up on services etc).

我刚刚在Windows上进行了一些测试(Vista 64 Ultimate),在4/8核心i7上。我使用类似的'计数'代码,作为任务提交给具有不同线程数的线程池,但始终具有相同的总工作量。池中的线程被赋予低优先级,以便所有任务在线程和计时开始之前排队。显然,这个盒子是空闲的(大约1%的CPU用于服务等)。

8 tests,
400 tasks,
counting to 10000000,
using 8 threads:
Ticks: 2199
Ticks: 2184
Ticks: 2215
Ticks: 2153
Ticks: 2200
Ticks: 2215
Ticks: 2200
Ticks: 2230
Average: 2199 ms

8 tests,
400 tasks,
counting to 10000000,
using 32 threads:
Ticks: 2137
Ticks: 2121
Ticks: 2153
Ticks: 2138
Ticks: 2137
Ticks: 2121
Ticks: 2153
Ticks: 2137
Average: 2137 ms

8 tests,
400 tasks,
counting to 10000000,
using 128 threads:
Ticks: 2168
Ticks: 2106
Ticks: 2184
Ticks: 2106
Ticks: 2137
Ticks: 2122
Ticks: 2106
Ticks: 2137
Average: 2133 ms

8 tests,
400 tasks,
counting to 10000000,
using 400 threads:
Ticks: 2137
Ticks: 2153
Ticks: 2059
Ticks: 2153
Ticks: 2168
Ticks: 2122
Ticks: 2168
Ticks: 2138
Average: 2137 ms

With tasks that take a long time, and with very little cache to swap out on a context-change, the number of threads used makes hardly any difference to the overall run time.

由于任务需要很长时间,而且只需要很少的缓存就可以换出上下文更改,因此使用的线程数对整个运行时间几乎没有任何影响。

#5


0  

The problem you encounter is tighly bound to the way you are subdividing the workload of your process. In order to make an efficient use of a multicore system on a multitasking OS, you must ensure that there will always be remaining work for all the cores as long as possible during your process lifetime.

您遇到的问题与您细分流程工作量的方式紧密相关。为了在多任务操作系统上有效地使用多核系统,您必须确保在您的过程生命周期中尽可能长时间地为所有内核保留剩余工作。

Consider the situation where your 4 threads process executes on 4 cores, and because of the system load configuration, one of the cores manages to finish 50% faster than the others: for the remaining process time, your CPU will only be able to allocate 3/4 of its processing power to your process, since there's only 3 threads remaining. In the same CPU load scenario, but with many more threads, the workload is split in many more subtasks which can be distributed more finely between the cores, all other things being equal (*).

考虑4个线程进程在4个核心上执行的情况,并且由于系统负载配置,其中一个核心比其他核心完成快50%:对于剩余的处理时间,您的CPU将只能分配3个核心/ 4它对你的进程的处理能力,因为只剩下3个线程。在相同的CPU负载情况下,但是有更多的线程,工作负载被分成更多的子任务,这些子任务可以在核心之间更精细地分配,所有其他条件相同(*)。

This example illustrate that the timing difference is not actually due to the number of threads, but rather to the way the work has been divided, which is much more resilient to an uneven availability of cores in the later case. The same programme built with only 4 threads, but where the work is abstracted to a series of small tasks pulled by threads as soon as they are available would certainly produce similar or even better results on average, even though there would be the overhead of managing the tasks queue.

这个例子说明了时间差异实际上并不是由于线程的数量,而是由于工作的划分方式,后者在后一种情况下对核心的可用性不均衡更具弹性。同样的程序只用4个线程构建,但是当工作被提取到一系列由线程提供的小任务时,一旦可用,它们肯定会产生类似甚至更好的结果,即使会有管理的开销任务队列。

The finer granularity of a process task set gives it better flexibility.

流程任务集的更精细粒度为其提供了更好的灵活性。


(*) In the situation of a highly loaded system, the many threads approach might not be as beneficial, the unused core being actually allocated to other OS process, hence lightening the load for the three others cores still possibly used by your process.

(*)在高负载系统的情况下,许多线程方法可能没有那么有益,未使用的核心实际上被分配给其他OS进程,因此减轻了您的进程仍可能使用的其他三个核心的负载。