寻找线程同步性能问题的解释

时间:2022-06-11 00:51:42

When using kernel objects to synchronize threads running on different CPUs, is there perhaps some extra runtime cost when using Windows Server 2008 R2 relative to other OS's?

当使用内核对象来同步在不同CPU上运行的线程时,使用Windows Server 2008 R2相对于其他操作系统可能会有一些额外的运行时成本吗?

Edit: And as found out via the answer, the question should also include the phrase, "when running at lower CPU utilization levels." I included more information in my own answer to this question.

编辑:通过答案发现,问题还应包括“在较低的CPU利用率水平下运行”这一短语。我在自己对这个问题的回答中包含了更多信息。

Background

I work on a product that uses shared memory and semaphores for communication between processes (when the two processes are running on the same machine). Reports of performance problems on Windows Server 2008 R2 (which I shorten to Win2008R2 after this) led me to find that sharing a semaphore between two threads on Win2008R2 was relatively slow compared to other OS’s.

我使用共享内存和信号量进行流程之间的通信(当两个进程在同一台机器上运行时)。 Windows Server 2008 R2上的性能问题报告(我在此之后缩短为Win2008R2)使我发现在Win2008R2上两个线程之间共享信号量与其他操作系统相比相对较慢。

Reproducing it

I was able to reproduce it by running the following bit of code concurrently on two threads:

我能够通过在两个线程上同时运行以下代码来重现它:

for ( i = 0; i < N; i++ )
  {
  WaitForSingleObject( globalSem, INFINITE );
  ReleaseSemaphore( globalSem, 1, NULL );
  }

Testing with a machine that would dual boot into Windows Server 2003 R2 SP2 and Windows Server 2008 R2, the above snippet would run about 7 times faster on the Win2003R2 machine versus the Win2008R2 (3 seconds for Win2003R2 and 21 seconds for Win2008R2).

使用可以双启动到Windows Server 2003 R2 SP2和Windows Server 2008 R2的计算机进行测试,上面的代码片段在Win2003R2计算机上运行速度比Win2008R2快7倍(Win2003R2为3秒,Win2008R2为21秒)。

Simple Version of the Test

The following is the full version of the aforementioned test:

以下是上述测试的完整版本:

#include <windows.h>
#include <stdio.h>
#include <time.h>


HANDLE gSema4;
int    gIterations = 1000000;

DWORD WINAPI testthread( LPVOID tn )
{
   int count = gIterations;

   while ( count-- )
      {
      WaitForSingleObject( gSema4, INFINITE );
      ReleaseSemaphore( gSema4, 1, NULL );
      }

   return 0;
}


int main( int argc, char* argv[] )
{
   DWORD    threadId;
   clock_t  ct;
   HANDLE   threads[2];

   gSema4 = CreateSemaphore( NULL, 1, 1, NULL );

   ct = clock();
   threads[0] = CreateThread( NULL, 0, testthread, NULL, 0, &threadId );
   threads[1] = CreateThread( NULL, 0, testthread, NULL, 0, &threadId );

   WaitForMultipleObjects( 2, threads, TRUE, INFINITE );

   printf( "Total time = %d\n", clock() - ct );

   CloseHandle( gSema4 );
   return 0;
}

More Details

I updated the test to enforce the threads to run a single iteration and force a switch to the next thread at each loop. Each thread signals the next thread to run at the end of each loop (round-robin style). And I also updated it to use a spinlock as an alternative to the semaphore (which is a kernel object).

我更新了测试以强制执行线程以运行单个迭代并强制切换到每个循环的下一个线程。每个线程都表示下一个线程在每个循环结束时运行(循环式)。我还更新了它,使用自旋锁作为信号量的替代品(这是一个内核对象)。

All machines I tested on were 64-bit machines. I compiled the test mostly as 32-bit. If built as 64-bit, it ran a bit faster overall and changed the ratios some, but the final result was the same. In addition to Win2008R2, I also ran against Windows 7 Enterprise SP 1, Windows Server 2003 R2 Standard SP 2, Windows Server 2008 (not R2), and Windows Server 2012 Standard.

我测试的所有机器都是64位机器。我把测试编译成32位。如果构建为64位,则整体运行速度更快一些,并将比率改变一些,但最终结果是相同的。除了Win2008R2之外,我还遇到了Windows 7 Enterprise SP 1,Windows Server 2003 R2 Standard SP 2,Windows Server 2008(不是R2)和Windows Server 2012 Standard。

  • Running the test on a single CPU was significantly faster (“forced” by setting thread affinity with SetThreadAffinityMask and checked with GetCurrentProcessorNumber). Not surprisingly, it was faster on all OS’s when using a single CPU, but the ratio between multi-cpu and single cpu with the kernel object synchronization was much higher on Win2008R2. The typical ratio for all machines except Win2008R2 was 2x to 4x (running on multiple CPUs took 2 to 4 times longer). But on Win2008R2, the ratio was 9x.
  • 在单个CPU上运行测试要快得多(通过使用SetThreadAffinityMask设置线程亲和性并使用GetCurrentProcessorNumber检查“强制”)。毫不奇怪,在使用单个CPU时,所有操作系统都更快,但在Win2008R2上,多CPU和单个CPU与内核对象同步的比率要高得多。除Win2008R2外,所有机器的典型比例为2x到4x(在多个CPU上运行时间要长2到4倍)。但在Win2008R2上,这个比例是9倍。

  • However ... I was not able to reproduce the slowdown on all Win2008R2 machines. I tested on 4, and it showed up on 3 of them. So I cannot help but wonder if there is some kind of configuration setting or performance tuning option that might affect this. I have read performance tuning guides, looked through various settings, and changed various settings (e.g., background service vs foreground app) with no difference in behavior.
  • 但是......我无法重现所有Win2008R2机器的减速。我测试了4,它出现了3个。所以我不禁想知道是否存在某种可能影响这种情况的配置设置或性能调整选项。我已阅读性能调整指南,查看各种设置,并更改了各种设置(例如,后台服务与前台应用程序),但行为没有差异。

  • It does not seem to be necessarily tied to switching between physical cores. I originally suspected that it was somehow tied to the cost of accessing global data on different cores repeatedly. But when running a version of the test that uses a simple spinlock for synchronization (not a kernel object), running the individual threads on different CPUs was reasonably fast on all OS types. The ratio of the multi-cpu semaphore sync test vs multi-cpu spinlock test was typically 10x to 15x. But for the Win2008R2 Standard Edition machines, the ratio was 30x.
  • 它似乎不一定与物理核心之间的切换有关。我原本怀​​疑它与某种方式相关联,反复出现在不同核心*问全局数据的成本。但是当运行使用简单的自旋锁进行同步(而不是内核对象)的测试版本时,在所有操作系统类型上运行不同CPU上的各个线程的速度相当快。多CPU信号量同步测试与多CPU自旋锁测试的比率通常为10x至15x。但对于Win2008R2标准版机器,比例是30倍。

Here are some actual numbers from the updated test (times are in milliseconds):

以下是更新测试中的一些实际数字(时间以毫秒为单位):

+----------------+-----------+---------------+----------------+
|       OS       | 2 cpu sem |   1 cpu sem   | 2 cpu spinlock |
+----------------+-----------+---------------+----------------+
| Windows 7      | 7115 ms   | 1960 ms (3.6) | 504 ms (14.1)  |
| Server 2008 R2 | 20640 ms  | 2263 ms (9.1) | 866 ms (23.8)  |
| Server 2003    | 3570 ms   | 1766 ms (2.0) | 452 ms (7.9)   |
+----------------+-----------+---------------+----------------+

Each of the 2 threads in the test ran 1 million iterations. Those testes were all run on identical machines. The Win Server 2008 and Server 2003 numbers are from a dual boot machine. The Win 7 machine has the exact same specs but was a different physical machine. The machine in this case is a Lenovo T420 laptop with Core i5-2520M 2.5GHz. Obviously not a server class machine, but I get similar result on true server class hardware. The numbers in parentheses are the ratio of the first column to the given column.

测试中的两个线程中的每一个都运行了100万次迭代。那些睾丸都在相同的机器上运行。 Win Server 2008和Server 2003编号来自双引导计算机。 Win 7机器具有完全相同的规格,但是是不同的物理机器。在这种情况下,该机器是配备Core i5-2520M 2.5GHz的Lenovo T420笔记本电脑。显然不是服务器类机器,但我在真正的服务器类硬件上得到了类似的结果。括号中的数字是第一列与给定列的比率。

Any explanation for why this one OS would seem to introduce extra expense for kernel level synchronization across CPUs? Or do you know of some configuration/tuning parameter that might affect this?

为什么这个操作系统似乎会为CPU内核级同步引入额外费用的任何解释?或者你知道一些可能影响这个的配置/调整参数吗?

While it would make this exceedingly verbose and long post longer, I could post the enhanced version of the test code that the above numbers came from if anyone wants it. That would show the enforcement of the round-robin logic and the spinlock version of the test.

虽然它会使这个非常冗长和长篇文章更长,我可以发布上述数字来自的测试代码的增强版本,如果有人想要的话。这将显示循环逻辑和测试的自旋锁版本的强制执行。

Extended Background

To try to answer some of the inevitable questions about why things are done this way. And I'm the same ... when I read a post, I often wonder why I am even asking. So here are some attempts clarify:

试图回答一些关于为什么事情以这种方式完成的不可避免的问题。我也一样......当我读一篇文章时,我常常想知道为什么我甚至都在问。所以这里有一些尝试澄清:

  • What is the application? It is a database server. In some situations, customers run the client application on the same machine as the server. In that case, it is faster to use shared memory for communications (versus sockets). This question is related to the shared memory comm.
  • 申请是什么?它是一个数据库服务器。在某些情况下,客户在与服务器相同的计算机上运行客户端应用程序。在这种情况下,使用共享内存进行通信(与套接字相比)更快。这个问题与共享内存通信有关。

  • Is the workload really that dependent on events? Well ... the shared memory comm is implemented using named semaphores. The client signals a semaphore, the server reads the data, the server signals a semaphore for the client when the response is ready. In other platforms, it is blinding fast. On Win2008R2, it is not. It is also very dependent on the customer application. If they write it with lots of small requests to the server, then there is a lot of communication between the two processes.
  • 工作量是否真的取决于事件?嗯...共享内存comm是使用命名信号量实现的。客户端发信号通知信号量,服务器读取数据,服务器在响应准备好时发信号通知客户端的信号量。在其他平台上,它快速致盲。在Win2008R2上,它不是。它也非常依赖于客户应用程序。如果他们向服务器写了很多小请求,那么两个进程之间就会有很多通信。

  • Can a lightweight lock be used? Possibly. I am already looking at that. But it is independent of the original question.
  • 可以使用轻量级锁吗?有可能。我已经在看那个了。但它独立于原始问题。

4 个解决方案

#1


3  

Pulled from the comments into an answer:

从评论中拉出答案:

Maybe the server is not set to the high-performance power plan? Win2k8 might have a different default. Many servers aren't by default, and this hits performance very hard.

也许服务器没有设置为高性能电源计划? Win2k8可能有不同的默认值。默认情况下,许多服务器都不是,这很难达到性能。

The OP confirmed this as the root cause.

OP确认这是根本原因。

This is a funny cause for this behavior. The idea flashed up in my head while I was doing something completely different.

这是这种行为的一个有趣原因。当我做一些完全不同的事情时,这个想法闪现在我脑海中。

#2


0  

It could well be the OS installation configuration varies. Perhaps the slow system is configured to disallow multiple threads from your process being scheduled simultaneously. If some other high priority process were always (or mostly) ready to run, the only alternative is for your threads to be run sequentially, not in parallel.

它可能是操作系统安装配置的不同。也许慢速系统配置为禁止同时调度进程中的多个线程。如果其他高优先级进程总是(或大部分)准备运行,唯一的选择是让您的线程按顺序运行,而不是并行运行。

#3


0  

I'm adding this additional "answer" information here rather than burying it in my overly long OP. @usr pointed me in the right direction with the power management options suggestion. The contrived test in the OP as well as the original problem involves a lot of handshaking between different threads. The handshaking in the real world app was across different processes, but testing showed the results do not differ if it is threads or processes doing the handshaking. The sharing of the semaphore (kernel sync object) across the CPUs seems to be greatly affected in Windows Server 2008 R2 by the power settings when running at low (e.g., 5% to 10%) CPU usage. My understanding of this at this point is purely based on measuring and timing applications.

我在这里添加这个额外的“答案”信息,而不是把它埋在我过长的OP中。 @usr通过电源管理选项建议指出了我正确的方向。 OP中的人为测试以及原始问题涉及不同线程之间的大量握手。现实世界应用程序中的握手是跨越不同的过程,但测试表明,如果线程或进程正在进行握手,则结果不会有所不同。在CPU使用率较低(例如,5%到10%)的情况下,通过电源设置,Windows Server 2008 R2中的信号量(内核同步对象)的共享似乎受到很大影响。此时我对此的理解纯粹基于测量和定时应用。

A related question on Serverfault talks about this some as well.

关于Serverfault的一个相关问题也谈到了这一点。

The Test Settings

OS Power Options Setting The default power plan for Windows Server 2008 R2 is "Balanced". Changing it to the "High Performance" option helped performance of this test quite a bit. In particular, one specified setting under the "Change advanced power settings" seems to be the critical one. The advanced settings has an option under Processor power management called Minimum processor state. The default value for this under the Balanced plan seems to be 5%. Changing that to 100% in my testing was the key.

操作系统电源选项设置Windows Server 2008 R2的默认电源计划为“已平衡”。将其更改为“高性能”选项有助于执行此测试。特别是,“更改高级电源设置”下的一个指定设置似乎是关键设置。高级设置在处理器电源管理下有一个选项,称为最小处理器状态。平衡计划下的默认值似乎是5%。在我的测试中将其改为100%是关键。

BIOS Setting In addition, a BIOS setting affected this test greatly. I'm sure this varies a lot across hardware, but the primary machine I tested on has a setting named "CPU Power Management". The description of the BIOS setting is, "Enables or disables the power saving feature that stop (sic) the microprocessor clock automatically when there are no system activities." I changed this option to "Disabled".

BIOS设置此外,BIOS设置会极大地影响此测试。我确信这在硬件上变化很大,但我测试的主要机器有一个名为“CPU电源管理”的设置。 BIOS设置的说明是“启用或禁用在没有系统活动时自动停止(原样)微处理器时钟的省电功能。”我将此选项更改为“已禁用”。

Empirical Results

The two test cases shown are:

显示的两个测试用例是:

  • (a) Simple. A modified version of the one included in the OP. This simple test enforced round-robin switching at every iteration between two threads on two CPUs. Each thread ran 1 million iterations (thus, there were 2 million context switches across CPUs).
  • (一个简单的。 OP中包含的修改版本。这个简单的测试在两个CPU上的两个线程之间的每次迭代中强制执行循环切换。每个线程运行100万次迭代(因此,CPU上有200万个上下文切换)。

  • (b) Real World. The real world client/server test where a client was making many "small" requests of the server via shared memory and synchronized with global named semaphores.
  • (b)真实世界。真实世界客户端/服务器测试客户端通过共享内存向服务器发出许多“小”请求并与全局命名信号量同步。

The three test scenarios are:

三种测试场景是:

  • (i) Balanced. Default installation of Windows Server 2008 R2, which uses the Balanced power plan.
  • (i)平衡。 Windows Server 2008 R2的默认安装,它使用平衡电源计划。

  • (ii) HighPerf. I changed the power option from "Balanced" to "High Performance". Equivalently, the same results occurred by setting the Minimum Processor State CPU option as described above to 100% (from 5%).
  • (ii)HighPerf。我将电源选项从“平衡”更改为“高性能”。等效地,通过将​​如上所述的最小处理器状态CPU选项设置为100%(从5%)来发生相同的结果。

  • (iii) BIOS. I disabled the CPU Power Management BIOS option as described above and also left the High Performance power option selected.
  • (iii)BIOS。我如上所述禁用了CPU Power Management BIOS选项,并且还选择了High Performance power选项。

The times given are in seconds:

给出的时间以秒为单位:

╔════════════════╦═════════════╦═══════════════╦════════════╗
║                ║ (i)Balanced ║ (ii) HighPerf ║ (iii) BIOS ║
╠════════════════╬═════════════╬═══════════════╬════════════╣
║ (a) Simple     ║ 21.4 s      ║ 9.2 s         ║ 4.0 s      ║
║ (b) Real World ║ 9.3 s       ║ 2.2 s         ║ 1.7 s      ║
╚════════════════╩═════════════╩═══════════════╩════════════╝

So after both changes were made (OS and BIOS), both the real world test and the contrived test ran about 5 times faster than under the default installation and default BIOS settings.

因此,在进行了两次更改(操作系统和BIOS)之后,现实测试和人为测试的运行速度比默认安装和默认BIOS设置快5倍。


While I was testing these cases, I sometimes encountered a result I could not explain. When the CPU was busy (some background process would fire up), the test would run faster. I would file it away in my head and be puzzled for a while. But now it makes sense. When another process would run, it would bump up the CPU usage past whatever threshold was needed to keep it in a high power state and the context switches would be fast. I still do not know what aspect is slow (the primary cost is buried in the WaitForSingleObject call) but the end results now all kind of make sense.

在我测试这些案例时,我有时会遇到一个我无法解释的结果。当CPU忙时(某些后台进程会启动),测试运行得更快。我会把它归档在脑海中并且困惑了一会儿。但现在它是有道理的。当另一个进程运行时,它将使CPU使用率超过任何需要的阈值以使其保持在高功率状态并且上下文切换将很快。我仍然不知道哪个方面很慢(主要成本隐藏在WaitForSingleObject调用中)但最终结果现在都有意义。

#4


-3  

This isn't a reasonable benchmark, your semaphores are always frobbed in the same process (and so presumably on the same CPU/core). An important part of the cost of locking in real-world cases is the memory accesses involved when different CPUs/cores fight over exclusive access to the memory area (which bounces back and forth between caches). Look for some more real-world benchmarks (sorry, not my area), o (even better) measure (some cut down version of) your application with (contrived, but realistic) test data.

这不是一个合理的基准,你的信号量总是在同一个过程中被欺骗(因此可能在相同的CPU /核心上)。在实际情况下锁定成本的一个重要部分是当不同的CPU /核心争夺对存储区域的独占访问(在缓存之间来回反弹)时所涉及的存储器访问。寻找更实际的基准测试(对不起,不是我的区域),o(甚至更好)测量(一些减少版本)你的应用程序与(人为的,但现实的)测试数据。

[Test data for benchmarks should never be the ones for testing or regression testing: the later pokes at (probably rarely used) corner cases, you want "typical" runs for benchmarking.]

[基准测试的测试数据永远不应该是测试或回归测试的测试数据:后来的(可能很少使用的)极端情况下,您需要“典型”的基准测试运行。

#1


3  

Pulled from the comments into an answer:

从评论中拉出答案:

Maybe the server is not set to the high-performance power plan? Win2k8 might have a different default. Many servers aren't by default, and this hits performance very hard.

也许服务器没有设置为高性能电源计划? Win2k8可能有不同的默认值。默认情况下,许多服务器都不是,这很难达到性能。

The OP confirmed this as the root cause.

OP确认这是根本原因。

This is a funny cause for this behavior. The idea flashed up in my head while I was doing something completely different.

这是这种行为的一个有趣原因。当我做一些完全不同的事情时,这个想法闪现在我脑海中。

#2


0  

It could well be the OS installation configuration varies. Perhaps the slow system is configured to disallow multiple threads from your process being scheduled simultaneously. If some other high priority process were always (or mostly) ready to run, the only alternative is for your threads to be run sequentially, not in parallel.

它可能是操作系统安装配置的不同。也许慢速系统配置为禁止同时调度进程中的多个线程。如果其他高优先级进程总是(或大部分)准备运行,唯一的选择是让您的线程按顺序运行,而不是并行运行。

#3


0  

I'm adding this additional "answer" information here rather than burying it in my overly long OP. @usr pointed me in the right direction with the power management options suggestion. The contrived test in the OP as well as the original problem involves a lot of handshaking between different threads. The handshaking in the real world app was across different processes, but testing showed the results do not differ if it is threads or processes doing the handshaking. The sharing of the semaphore (kernel sync object) across the CPUs seems to be greatly affected in Windows Server 2008 R2 by the power settings when running at low (e.g., 5% to 10%) CPU usage. My understanding of this at this point is purely based on measuring and timing applications.

我在这里添加这个额外的“答案”信息,而不是把它埋在我过长的OP中。 @usr通过电源管理选项建议指出了我正确的方向。 OP中的人为测试以及原始问题涉及不同线程之间的大量握手。现实世界应用程序中的握手是跨越不同的过程,但测试表明,如果线程或进程正在进行握手,则结果不会有所不同。在CPU使用率较低(例如,5%到10%)的情况下,通过电源设置,Windows Server 2008 R2中的信号量(内核同步对象)的共享似乎受到很大影响。此时我对此的理解纯粹基于测量和定时应用。

A related question on Serverfault talks about this some as well.

关于Serverfault的一个相关问题也谈到了这一点。

The Test Settings

OS Power Options Setting The default power plan for Windows Server 2008 R2 is "Balanced". Changing it to the "High Performance" option helped performance of this test quite a bit. In particular, one specified setting under the "Change advanced power settings" seems to be the critical one. The advanced settings has an option under Processor power management called Minimum processor state. The default value for this under the Balanced plan seems to be 5%. Changing that to 100% in my testing was the key.

操作系统电源选项设置Windows Server 2008 R2的默认电源计划为“已平衡”。将其更改为“高性能”选项有助于执行此测试。特别是,“更改高级电源设置”下的一个指定设置似乎是关键设置。高级设置在处理器电源管理下有一个选项,称为最小处理器状态。平衡计划下的默认值似乎是5%。在我的测试中将其改为100%是关键。

BIOS Setting In addition, a BIOS setting affected this test greatly. I'm sure this varies a lot across hardware, but the primary machine I tested on has a setting named "CPU Power Management". The description of the BIOS setting is, "Enables or disables the power saving feature that stop (sic) the microprocessor clock automatically when there are no system activities." I changed this option to "Disabled".

BIOS设置此外,BIOS设置会极大地影响此测试。我确信这在硬件上变化很大,但我测试的主要机器有一个名为“CPU电源管理”的设置。 BIOS设置的说明是“启用或禁用在没有系统活动时自动停止(原样)微处理器时钟的省电功能。”我将此选项更改为“已禁用”。

Empirical Results

The two test cases shown are:

显示的两个测试用例是:

  • (a) Simple. A modified version of the one included in the OP. This simple test enforced round-robin switching at every iteration between two threads on two CPUs. Each thread ran 1 million iterations (thus, there were 2 million context switches across CPUs).
  • (一个简单的。 OP中包含的修改版本。这个简单的测试在两个CPU上的两个线程之间的每次迭代中强制执行循环切换。每个线程运行100万次迭代(因此,CPU上有200万个上下文切换)。

  • (b) Real World. The real world client/server test where a client was making many "small" requests of the server via shared memory and synchronized with global named semaphores.
  • (b)真实世界。真实世界客户端/服务器测试客户端通过共享内存向服务器发出许多“小”请求并与全局命名信号量同步。

The three test scenarios are:

三种测试场景是:

  • (i) Balanced. Default installation of Windows Server 2008 R2, which uses the Balanced power plan.
  • (i)平衡。 Windows Server 2008 R2的默认安装,它使用平衡电源计划。

  • (ii) HighPerf. I changed the power option from "Balanced" to "High Performance". Equivalently, the same results occurred by setting the Minimum Processor State CPU option as described above to 100% (from 5%).
  • (ii)HighPerf。我将电源选项从“平衡”更改为“高性能”。等效地,通过将​​如上所述的最小处理器状态CPU选项设置为100%(从5%)来发生相同的结果。

  • (iii) BIOS. I disabled the CPU Power Management BIOS option as described above and also left the High Performance power option selected.
  • (iii)BIOS。我如上所述禁用了CPU Power Management BIOS选项,并且还选择了High Performance power选项。

The times given are in seconds:

给出的时间以秒为单位:

╔════════════════╦═════════════╦═══════════════╦════════════╗
║                ║ (i)Balanced ║ (ii) HighPerf ║ (iii) BIOS ║
╠════════════════╬═════════════╬═══════════════╬════════════╣
║ (a) Simple     ║ 21.4 s      ║ 9.2 s         ║ 4.0 s      ║
║ (b) Real World ║ 9.3 s       ║ 2.2 s         ║ 1.7 s      ║
╚════════════════╩═════════════╩═══════════════╩════════════╝

So after both changes were made (OS and BIOS), both the real world test and the contrived test ran about 5 times faster than under the default installation and default BIOS settings.

因此,在进行了两次更改(操作系统和BIOS)之后,现实测试和人为测试的运行速度比默认安装和默认BIOS设置快5倍。


While I was testing these cases, I sometimes encountered a result I could not explain. When the CPU was busy (some background process would fire up), the test would run faster. I would file it away in my head and be puzzled for a while. But now it makes sense. When another process would run, it would bump up the CPU usage past whatever threshold was needed to keep it in a high power state and the context switches would be fast. I still do not know what aspect is slow (the primary cost is buried in the WaitForSingleObject call) but the end results now all kind of make sense.

在我测试这些案例时,我有时会遇到一个我无法解释的结果。当CPU忙时(某些后台进程会启动),测试运行得更快。我会把它归档在脑海中并且困惑了一会儿。但现在它是有道理的。当另一个进程运行时,它将使CPU使用率超过任何需要的阈值以使其保持在高功率状态并且上下文切换将很快。我仍然不知道哪个方面很慢(主要成本隐藏在WaitForSingleObject调用中)但最终结果现在都有意义。

#4


-3  

This isn't a reasonable benchmark, your semaphores are always frobbed in the same process (and so presumably on the same CPU/core). An important part of the cost of locking in real-world cases is the memory accesses involved when different CPUs/cores fight over exclusive access to the memory area (which bounces back and forth between caches). Look for some more real-world benchmarks (sorry, not my area), o (even better) measure (some cut down version of) your application with (contrived, but realistic) test data.

这不是一个合理的基准,你的信号量总是在同一个过程中被欺骗(因此可能在相同的CPU /核心上)。在实际情况下锁定成本的一个重要部分是当不同的CPU /核心争夺对存储区域的独占访问(在缓存之间来回反弹)时所涉及的存储器访问。寻找更实际的基准测试(对不起,不是我的区域),o(甚至更好)测量(一些减少版本)你的应用程序与(人为的,但现实的)测试数据。

[Test data for benchmarks should never be the ones for testing or regression testing: the later pokes at (probably rarely used) corner cases, you want "typical" runs for benchmarking.]

[基准测试的测试数据永远不应该是测试或回归测试的测试数据:后来的(可能很少使用的)极端情况下,您需要“典型”的基准测试运行。