多核+超线程 - 线程如何分布？

I was reading a review of the new Intel Atom 330, where they noted that Task Manager shows 4 cores - two physical cores, plus two more simulated by Hyperthreading.

我正在阅读新的英特尔凌动330的评论,他们注意到任务管理器显示4个内核 - 两个物理内核,另外还有两个由超线程模拟。

Suppose you have a program with two threads. Suppose also that these are the only threads doing any work on the PC, everything else is idle. What is the probability that the OS will put both threads on the same core? This has huge implications for program throughput.

假设你有一个包含两个线程的程序。假设这些是在PC上进行任何工作的唯一线程,其他一切都是空闲的。操作系统将两个线程放在同一个核心上的概率是多少?这对程序吞吐量有很大影响。

If the answer is anything other than 0%, are there any mitigation strategies other than creating more threads?

如果答案不是0%,那么除了创建更多线程之外,还有其他任何缓解策略吗?

I expect there will be different answers for Windows, Linux, and Mac OS X.

我希望Windows,Linux和Mac OS X会有不同的答案。

Using sk's answer as Google fodder, then following the links, I found the GetLogicalProcessorInformation function in Windows. It speaks of "logical processors that share resources. An example of this type of resource sharing would be hyperthreading scenarios." This implies that jalf is correct, but it's not quite a definitive answer.

8 个解决方案

#1

Linux has quite a sophisticated thread scheduler which is HT aware. Some of its strategies include:

Linux具有非常复杂的线程调度程序,可以识别HT。其中一些策略包括:

Passive Loadbalancing: If a physical CPU is running more than one task the scheduler will attempt to run any new tasks on a second physical processor.

被动负载平衡:如果物理CPU运行多个任务,则调度程序将尝试在第二个物理处理器上运行任何新任务。

Active Loadbalancing: If there are 3 tasks, 2 on one physical cpu and 1 on the other when the second physical processor goes idle the scheduler will attempt to migrate one of the tasks to it.

主动负载平衡:如果有3个任务,当第二个物理处理器空闲时,一个物理cpu上有2个,另一个上面有1个,调度程序将尝试将其中一个任务迁移到它。

It does this while attempting to keep thread affinity because when a thread migrates to another physical processor it will have to refill all levels of cache from main memory causing a stall in the task.

它在尝试保持线程亲和性时执行此操作,因为当线程迁移到另一个物理处理器时,它将不得不从主内存重新填充所有级别的缓存,从而导致任务停顿。

So to answer your question (on Linux at least); given 2 threads on a dual core hyperthreaded machine, each thread will run on its own physical core.

所以回答你的问题(至少在Linux上);在双核超线程机器上给出2个线程,每个线程将在其自己的物理核心上运行。

#2

A sane OS will try to schedule computationally intensive tasks on their own cores, but problems arise when you start context switching them. Modern OS's still have a tendency to schedule things on cores where there is no work at scheduling time, but this can result in processes in parallel applications getting swapped from core to core fairly liberally. For parallel apps, you do not want this, because you lose data the process might've been using in the caches on its core. People use processor affinity to control for this, but on Linux, the semantics of sched_affinity() can vary a lot between distros/kernels/vendors, etc.

一个理智的操作系统会尝试在自己的内核上安排计算密集型任务,但是当您启动上下文切换时会出现问题。现代操作系统仍然倾向于在核心上安排事情,而这些事情在调度时没有工作,但这可能导致并行应用程序中的进程相当*地从核心交换到核心。对于并行应用程序,您不希望这样,因为您丢失了该进程可能在其核心的高速缓存中使用的数据。人们使用处理器亲和力来控制它,但在Linux上,sched_affinity()的语义在发行版/内核/供应商等之间可能会有很大差异。

If you're on Linux, you can portably control processor affinity with the Portable Linux Processor Affinity Library (PLPA). This is what OpenMPI uses internally to make sure processes get scheduled to their own cores in multicore and multisocket systems; they've just spun off the module as a standalone project. OpenMPI is used at Los Alamos among a number of other places, so this is well-tested code. I'm not sure what the equivalent is under Windows.

如果您使用的是Linux,则可以通过Portable Linux Processor Affinity Library(PLPA)轻松控制处理器关联。这就是OpenMPI在内部使用的方法,以确保流程在多核和多串口系统中安排到自己的内核;他们只是将模块作为独立项目分离出来。 OpenMPI在洛斯阿拉莫斯的许多其他地方使用,因此这是经过良好测试的代码。我不确定Windows下的等价物是什么。

#3

I have been looking for some answers on thread scheduling on Windows, and have some empirical information that I'll post here for anyone who may stumble across this post in the future.

我一直在寻找关于Windows上的线程调度的一些答案,并且有一些经验信息,我将在这里发布给将来可能偶然发现这篇文章的任何人。

I wrote a simple C# program that launches two threads. On my quad core Windows 7 box, I saw some surprising results.

我写了一个简单的C#程序,它启动了两个线程。在我的四核Windows 7机箱上,我看到了一些令人惊讶的结果。

When I did not force affinity, Windows spread the workload of the two threads across all four cores. There are two lines of code that are commented out - one that binds a thread to a CPU, and one that suggests an ideal CPU. The suggestion seemed to have no effect, but setting thread affinity did cause Windows to run each thread on their own core.

当我没有强制亲和力时,Windows将两个线程的工作负载分散到所有四个核心。有两行代码被注释掉 - 一行将线程绑定到CPU,另一行建议理想的CPU。该建议似乎没有任何效果,但设置线程关联确实导致Windows在自己的核心上运行每个线程。

To see the results best, compile this code using the freely available compiler csc.exe that comes with the .NET Framework 4.0 client, and run it on a machine with multiple cores. With the processor affinity line commented out, Task Manager showed the threads spread across all four cores, each running at about 50%. With affinity set, the two threads maxed out two cores at 100%, with the other two cores idling (which is what I expected to see before I ran this test).

要最好地查看结果,请使用.NET Framework 4.0客户端随附的免费编译器csc.exe编译此代码,并在具有多个内核的计算机上运行它。在处理器关联线注释掉之后,任务管理器显示线程分布在所有四个核心上,每个核心运行大约50%。通过设置亲和性,两个线程以100%最大化两个核心,其他两个核心空闲(这是我在运行此测试之前所期望的)。

EDIT: I initially found some differences in performance with these two configurations. However, I haven't been able to reproduce them, so I edited this post to reflect that. I still found the thread affinity interesting since it wasn't what I expected.

编辑:我最初发现这两种配置的性能存在一些差异。但是,我无法重现它们,因此我编辑了这篇文章以反映这一点。我仍然发现线程亲和力很有趣,因为它不是我所期望的。

using System;
using System.Collections.Generic;
using System.Linq;
using System.Diagnostics;
using System.Runtime.InteropServices;
using System.Threading.Tasks;

class Program
{
    [DllImport("kernel32")]
    static extern int GetCurrentThreadId();

    static void Main(string[] args)
    {
        Task task1 = Task.Factory.StartNew(() => ThreadFunc(1));
        Task task2 = Task.Factory.StartNew(() => ThreadFunc(2));
        Stopwatch time = Stopwatch.StartNew();
        Task.WaitAll(task1, task2);
        Console.WriteLine(time.Elapsed);
    }

    static void ThreadFunc(int cpu)
    {
        int cur = GetCurrentThreadId();
        var me = Process.GetCurrentProcess().Threads.Cast<ProcessThread>().Where(t => t.Id == cur).Single();
        //me.ProcessorAffinity = (IntPtr)cpu;     //using this line of code binds a thread to each core
        //me.IdealProcessor = cpu;                //seems to have no effect

        //do some CPU / memory bound work
        List<int> ls = new List<int>();
        ls.Add(10);
        for (int j = 1; j != 30000; ++j)
        {
            ls.Add((int)ls.Average());
        }
    }
}

#4

This is a very good and relevant question. As we all know, a hyper-threaded core is not a real CPU/core. Instead, it is a virtual CPU/core (from now on I'll say core). The Windows CPU scheduler as of Windows XP is supposed to be able to distinguish hyperthreaded (virtual) cores from real cores. You might imagine then that in this perfect world it handles them 'just right' and it is not an issue. You would be wrong.

这是一个非常好的相关问题。众所周知,超线程核心不是真正的CPU /核心。相反,它是一个虚拟的CPU /核心(从现在开始,我会说核心)。从Windows XP开始的Windows CPU调度程序应该能够区分超线程(虚拟)内核和实际内核。你可能会想到,在这个完美的世界中,它处理它们“恰到好处”并且它不是问题。你错了。

Microsoft's own recommendation for optimizing a Windows 2008 BizTalk server recommends disabling HyperThreading. This suggests, to me, that the handling of hyper-threaded cores isn't perfect and sometimes threads get a time slice on a hyper-threaded core and suffer the penalty (a fraction of the performance of a real core, 10% I'd guess, and Microsoft guesses 20-30%).

Microsoft自己建议优化Windows 2008 BizTalk服务器建议禁用超线程。对我来说,这表明超线程内核的处理并不完美,有时线程会在超线程内核上获得时间片并受到惩罚(实际内核的性能的一小部分,10%I') d猜测,微软猜测20-30%)。

Microsoft article reference where they suggest disabling HyperThreading to improve server efficiency: http://msdn.microsoft.com/en-us/library/cc615012(BTS.10).aspx

微软文章参考,他们建议禁用超线程以提高服务器效率:http://msdn.microsoft.com/en-us/library/cc615012(BTS.10).aspx

It is the SECOND recommendation after BIOS update, that is how important they consider it. They say:

这是BIOS更新后的第二条建议,即他们认为这有多重要。他们说:

FROM MICROSOFT:

"Disable hyper-threading on BizTalk Server and SQL Server computers

“在BizTalk Server和SQL Server计算机上禁用超线程

It is critical hyper-threading be turned off for BizTalk Server computers. This is a BIOS setting, typically found in the Processor settings of the BIOS setup. Hyper-threading makes the server appear to have more processors/processor cores than it actually does; however hyper-threaded processors typically provide between 20 and 30% of the performance of a physical processor/processor core. When BizTalk Server counts the number of processors to adjust its self-tuning algorithms; the hyper-threaded processors cause these adjustments to be skewed which is detrimental to overall performance. "

对于BizTalk Server计算机,关闭超线程是至关重要的。这是BIOS设置,通常位于BIOS设置的处理器设置中。超线程使服务器看起来拥有比实际更多的处理器/处理器核心;但是,超线程处理器通常提供物理处理器/处理器核心性能的20%到30%。当BizTalk Server计算处理器数量以调整其自调整算法时;超线程处理器导致这些调整偏斜,这对整体性能不利。 “

Now, they do say it is due to it throwing off the self-tuning algorithms, but then go on to mention contention problems (suggesting it is a larger scheduling issue, at least to me). Read it as you will, but I think it says it all. HyperThreading was a good idea when were with single CPU systems, but is now just a complication that can hurt performance in this multi-core world.

现在,他们确实说它是由于它放弃了自我调整算法,但接着提到争用问题(暗示这是一个更大的调度问题,至少对我而言)。按照你的意愿阅读它,但我认为它说明了一切。使用单CPU系统时,超线程是个好主意,但现在只是一个可能会影响这个多核心世界性能的复杂功能。

Instead of completely disabling HyperThreading, you can use programs like Process Lasso (free) to set default CPU affinities for critical processes, so that their threads never get allocated to virtual CPUs.

您可以使用Process Lasso(免费)等程序为关键进程设置默认CPU关联,而不是完全禁用HyperThreading,以便永远不会将其线程分配给虚拟CPU。

So.... I don't think anyone really knows just how well the Windows CPU Scheduler handles virtual CPUs, but I think it is safe to say that XP handles it worst, and they've gradually improved it since then, but it still isn't perfect. In fact, it may NEVER be perfect because the OS doesn't have any knowledge of what threads are best to put on these slower virtual cores. That may be the issue there, and why Microsoft recommends disabling HyperThreading in server environments.

所以....我认为没有人真正知道Windows CPU Scheduler处理虚拟CPU有多好,但我认为可以说XP处理得最差,并且从那时起它们逐渐改进了它,但它仍然不完美。实际上,它可能永远不会是完美的,因为操作系统不知道什么线程最好放在这些较慢的虚拟内核上。这可能是那里的问题,以及为什么Microsoft建议在服务器环境中禁用HyperThreading。

Also remember even WITHOUT HyperThreading, there is the issue of 'core thrashing'. If you can keep a thread on a single core, that's a good thing, as it reduces the core change penalties.

还记得即使没有超线程,也存在“核心颠簸”的问题。如果你可以在一个核心上保留一个线程,这是一件好事,因为它减少了核心的变化惩罚。

#5

You can make sure both threads get scheduled for the same execution units by giving them a processor affinity. This can be done in either windows or unix, via either an API (so the program can ask for it) or via administrative interfaces (so an administrator can set it). E.g. in WinXP you can use the Task Manager to limit which logical processor(s) a process can execute on.

您可以通过为两个线程提供处理器关联来确保它们为相同的执行单元进行调度。这可以在windows或unix中通过API(因此程序可以请求它)或通过管理界面(因此管理员可以设置它)来完成。例如。在WinXP中,您可以使用任务管理器来限制进程可以执行的逻辑处理器。

Otherwise, the scheduling will be essentially random and you can expect a 25% usage on each logical processor.

否则,调度将基本上是随机的,您可以预期每个逻辑处理器的使用率为25%。

#6

The probability is essentially 0% that the OS won't utilize as many physical cores as possible. Your OS isn't stupid. Its job is to schedule everything, and it knows full well what cores it has available. If it sees two CPU-intensive threads, it will make sure they run on two physical cores.

概率基本上为0%,OS不会使用尽可能多的物理内核。你的操作系统并不愚蠢。它的工作是安排一切,它完全清楚它可用的核心。如果它看到两个CPU密集型线程,它将确保它们在两个物理内核上运行。

Edit Just to elaborate a bit, for high-performance stuff, once you get into MPI or other serious parallelization frameworks, you definitely want to control what runs on each core.

编辑只是为了详细说明,对于高性能的东西,一旦进入MPI或其他严重的并行化框架,你肯定想要控制每个核心上运行的东西。

The OS will make a sort of best-effort attempt to utilize all cores, but it doesn't have the long-term information that you do, that "this thread is going to run for a very long time", or that "we're going to have this many threads executing in parallel". So it can't make perfect decisions, which means that your thread will get assigned to a new core from time to time, which means you'll run into cache misses and similar, which costs a bit of time. For most purposes, it's good enough, and you won't even notice the performance difference. And it also plays nice with the rest of the system, if that matters. (On someone's desktop system, that's probably fairly important. In a grid with a few thousand CPU's dedicated to this task, you don't particularly want to play nice, you just want to use every clock cycle available).

操作系统将尽一切努力利用所有核心,但它没有你所做的长期信息,“这个线程将运行很长一段时间”,或者说“我们'将要让这么多线程并行执行“。所以它无法做出完美的决定,这意味着你的线程会不时被分配到一个新的核心,这意味着你会遇到缓存未命中等等,这需要花费一些时间。在大多数情况下,它足够好,你甚至不会注意到性能差异。如果这很重要的话,它对系统的其他部分也很有用。 (在某人的桌面系统上,这可能相当重要。在一个专门用于此任务的几千个CPU的网格中,你并不特别想要玩得很好,你只想使用每个可用的时钟周期)。

So for large-scale HPC stuff, yes, you'll want each thread to stay on one core, fixed. But for most smaller tasks, it won't really matter, and you can trust the OS's scheduler.

所以对于大规模的HPC来说,是的,你会希望每个线程都保留在一个核心上,修复。但对于大多数较小的任务,它并不重要,您可以信任操作系统的调度程序。

#7

I don't know about the other platforms, but in the case of Intel, they publish a lot of info on threading on their Intel Software Network. They also have a free newsletter (The Intel Software Dispatch) you can subscribe via email and has had a lot of such articles lately.

我不了解其他平台,但就英特尔而言,他们在英特尔软件网络上发布了大量有关线程的信息。他们还有一个免费的时事通讯(The Intel Software Dispatch),您可以通过电子邮件订阅,最近有很多这样的文章。

#8

The chance that the OS will dispatch 2 active threads to the same core is zero unless the threads were tied to a specific core (thread affinity).

除非线程绑定到特定核心(线程关联),否则操作系统将2个活动线程分派到同一核心的可能性为零。

The reasons behind this are mostly HW related:

这背后的原因主要是硬件相关的:

The OS (and the CPU) wants to use as little power as possible so it will run the tasks as efficient as possible in order to enter a low power-state ASAP.

操作系统(和CPU)希望尽可能少地使用电源,以便尽可能高效地运行任务,以便尽快进入低功耗状态。

Running everything on the same core will cause it to heat up much faster. In pathological conditions, the processor may overheat and reduce its clock to cool down. Excessive heat also cause CPU fans to spin faster (think laptops) and create more noise.

在同一个核心上运行所有内容将使其加热更快。在病理条件下,处理器可能会过热并减少其时钟以冷却。过热还会导致CPU风扇旋转得更快(想想笔记本电脑)并产生更多噪音。

The system is never actually idle. ISRs and DPCs run every ms (on most modern OSes).

系统永远不会空闲。 ISR和DPC每隔ms运行一次(在大多数现代操作系统上)。

Performance degradation due to threads hopping from core to core are negligible in 99.99% of the workloads.

在99.99%的工作负载中,由于线程从核心跳到核心而导致的性能下降可以忽略不计。

In all modern processors the last level cache is shared thus switching cores isn't so bad.

在所有现代处理器中,最后一级缓存是共享的,因此交换核心并不是那么糟糕。

For Multi-socket systems (Numa), the OS will minimize hopping from socket to socket so a process stays "near" its memory controller. This is a complex domain when optimizing for such systems (tens/hundreds of cores).

对于多插槽系统(Numa),操作系统将最大限度地减少从套接字到套接字的跳跃,以便进程保持“靠近”其内存控制器。在针对此类系统(数十/数百个核心)进行优化时,这是一个复杂的领域。

BTW, the way the OS knows the CPU topology is via ACPI - an interface provided by the BIOS.

顺便说一句,OS知道CPU拓扑的方式是通过ACPI - BIOS提供的接口。

To sum things up, it all boils down to system power considerations (battery life, power bill, noise from cooling solution).

总而言之,这一切都归结为系统功耗考虑因素(电池寿命,电费,冷却解决方案的噪音)。

#1