无锁编程是一个挑战,不只是因为任务本身的复杂度,还由于《
上图的解释: 1,你的线程是否是多线程程序?是。或者有中断处理函数。 2,这些线程是否访问相同的内存地址?是。 3,这些线程是否能够互相阻塞?否。即有某种方式使得线程无限期的锁定?(这句话没有明白)。 满足以上三个条件,就是无锁程序了。 在这个意思上来说,在无锁编程中的lock不只是指的mutex,而是可能使得整个应用程序锁定的各种方法,无论是死锁还是活锁-甚至由于线程调度决策引起的问题made by
无锁编程技巧
当你试图编写让人满意的无锁程序时,一系列的技术必然会被提起:原子操作(atomic operations)、内存障碍(
上图说明:暂时未明。。。等待。
原子的读-修改-写操作 原子操作是不可分割的操作内存的方式。没有线程能够看到原子操作的中间过程。在现代处理器中,很多操作都是原子的。比如,内存对齐的简单类型(比如int,char,bool,等等简单类型。long类型应该也属于,long long不确认。。)的读和写操作。 读-修改-写(RMW)操作更进一步,允许你进行更复杂的原子操作。当一个无锁算法必须支持多个写者时,他们(RMW)特别有用,因为当多个线程试图RMW相同地址时,他们会排成一列一个接一个的执行那些操作。我已经在这个blog里提到过RMW操作,如在实现一个轻量级的互斥锁(http://preshing.com/20120226/roll-your-own-lightweight-mutex),一个递归的互斥(http://preshing.com/20120305/implementing-a-recursive-mutex)和一个轻量级的日志系统(http://preshing.com/20120522/lightweight-in-memory-logging)。
RMW操作的例子包括
_InterlockedIncrement
OSAtomicAdd32
std::atomic::fetch_add
std::atomic<>::is_lock_free来确认下。
不同的CPU系列支持RMW的方式不同。
PowerPC 比较并交换(CAS)循环 也许最经常讨论的RMW操作是CAS。在WIN32系统上,CAS是通过一些内部函数实现的,如
_InterlockedCompareExchange 。通常,程序员在一个循环中执行CAS,反复尝试执行某一个事务。这种模式通常会把一个共享的变量拷贝为临时变量,执行一些操作,并尝试使用CAS把修改后的共享变量设置回去。
void LockFreeQueue::push(Node*newHead){
}
顺序一致性 顺序一致性的意思是,当所有线程顺序访问内存时,访问内存的顺序和源码中的顺序是一样的。在顺序一致性的基础上,我上一篇文章里演示的内存重排的问题不会出现的。(http://preshing.com/20120515/memory-reordering-caught-in-the-act) 一个简单的实现顺序一致性的办法(绝对不切实际的)是禁用编译器优化让所有的线程运行在一个cpu上。一个cpu绝对不会看到内存顺序出现问题,就算线程在任何时间进行调度。 一些编程语言提供顺序一致性,就算是优化代码运行在多cpu环境上。在C++11中,你可以声明所有共享变量为C++11原子类型(atomic
由于C++11的atomic类型保持顺序一致性,那么r1和r2是不可能同时为0的。为了达到这个目标,编译器在幕后额外增加了一些指令-通常是内存隔离或者RMW操作。这些指令的性能会比程序员直接处理内存顺序问题的性能差。
内存序列化 随着流程图的演示,任何时候在多核系统(或者任何对称多处理器 ,
读取原语为了阻止内存重排需要放到读取操作的前面,写入原语为了阻止内存重排需要放到写入操作的后面。这些原语特别适合生产者/消费者的情境,特别是只有一个生产者和一个消费者的情况。我将在下一个blog中讨论这个问题。(future post.)
不同的处理器有不同的内存模式 对于内存重排问题,不同的CPU系列有不同的特性(Different CPU families have different habits)。这些特性需要查询特定CPU厂商及硬件的文档。例如,PowerPC和ARM的CPU可以改变内存存储相对于指令自身的顺序,一般来说Intel的X86/64家族的CPU及AMD的CPU却不会改变。我们看到以前的处理器拥有更加轻松的内存模型(这句话没有明白。We say the former processors have a more
有一种抽象与硬件细节之外的方法,特别是C++11提供了一种通用的可移植的进行无锁编程的方法。至少现在,大部分无锁编程的程序员至少要了解不同平台间的差异。至少需要记住一个关键点,就是在X86/64指令层,任何从内存获取数据有读原语(comes with acquire semantics),向内存存入数据有写原语(provides release semantics)--至少对于非SSE指令和非WC内存操作。这会导致过去写的在X86/64上的无锁代码不能运行与其他类型CPU上。(
【这篇文章被出版到
-
Anthony Williams’ blog
and his book, C++ Concurrency in Action -
Dmitriy V’jukov’s website
and various forum discussions - Bartosz Milewski’s blog
- Charles Bloom’s
Low-Level Threading series on his blog - Doug Lea’s
JSR-133 Cookbook - Howells and McKenney’s
memory-barriers.txt document - Hans Boehm’s
collection of links about the C++11 memory model - Herb Sutter’s
Effective Concurrency series
Lightweight In-Memory Logging
When debugging multithreaded code, it’s not always easy to determine which codepath was taken. You can’t always reproduce the bug while stepping through the debugger, nor can you always sprinkleprintf
s throughout the code, as you might in a single-threaded program. There might be millions of events before the bug occurs, andprintf
can easily slow the application to a crawl, mask the bug, or create a spam fest in the output log.
One way of attacking such problems is to instrument the code so that events are logged to a circular buffer in memory. This is similar to addingprintf
s, except that only the most recent events are kept in the log, and the performance overhead can be made very low using lock-free techniques.
Here’s one possible implementation. I’ve written it specifically for Windows in 32-bit C++, but you could easily adapt the idea to other platforms. The header file contains the following:
#include <windows.h>
#include <intrin.h>
namespace Logger
{
struct Event
{
DWORD tid; // Thread ID
const char* msg; // Message string
DWORD param; // A parameter which can mean anything you want
};
static const int BUFFER_SIZE = 65536; // Must be a power of 2
extern Event g_events[BUFFER_SIZE];
extern LONG g_pos;
inline void Log(const char* msg, DWORD param)
{
// Get next event index
LONG index = _InterlockedIncrement(&g_pos);
// Write an event at this index
Event* e = g_events + (index & (BUFFER_SIZE - 1)); // Wrap to buffer size
e->tid = ((DWORD*) __readfsdword(24))[9]; // Get thread ID
e->msg = msg;
e->param = param;
}
}
#define LOG(m, p) Logger::Log(m, p)
And you must place the following in a .cpp file.
namespace Logger
{
Event g_events[BUFFER_SIZE];
LONG g_pos = -1;
}
This is perhaps one of the simplest examples of lock-free programming which actually does something useful. There’s a single macroLOG
, which writes to the log. It uses _InterlockedIncrement
, an atomic operation which I’ve talked about inprevious posts, for thread safety. There are no readers. You are meant to be the reader when you inspect the process in the debugger, such as when the program crashes, or when the bug is otherwise caught.
Using It to Debug My Previous Post
My previous post, Memory Reordering Caught In the Act, contains a sample program which demonstrates a specific type of memory reordering. There are two semaphores,beginSema1
and beginSema2
, which are used to repeatedly kick off two worker threads.
While I was preparing the post, there was only a single beginSema
shared by both threads. To verify that the experiment was valid, I added a makeshift assert to the worker threads. Here’s the Win32 version:
DWORD WINAPI thread1Func(LPVOID param)
{
MersenneTwister random(1);
for (;;)
{
WaitForSingleObject(beginSema, INFINITE); // Wait for signal
while (random.integer() % 8 != 0) {} // Random delay
// ----- THE TRANSACTION! -----
if (X != 0) DebugBreak(); // Makeshift assert
X = 1;
_ReadWriteBarrier(); // Prevent compiler reordering only
r1 = Y;
ReleaseSemaphore(endSema, 1, NULL); // Notify transaction complete
}
return 0; // Never returns
};
Surprisingly, this “assert” got hit, which means that X
was not 0 at the start of the experiment, as expected. This puzzled me, sinceas I explained in that post, the semaphores are supposed to guarantee the initial valuesX = 0
and Y = 0
are completely propogated at this point.
I needed more visibility on what was going on, so I added the LOG
macro in a few strategic places. Note that the integer parameter can be used to log any value you want. In the secondLOG
statement below, I use it to log the initial value of X
. Similar changes were made in the other worker thread.
for (;;)
{
LOG("wait", 0);
WaitForSingleObject(beginSema, INFINITE); // Wait for signal
while (random.integer() % 8 != 0) {} // Random delay
// ----- THE TRANSACTION! -----
LOG("X ==", X);
if (X != 0) DebugBreak(); // Makeshift assert
X = 1;
_ReadWriteBarrier(); // Prevent compiler reordering only
r1 = Y;
ReleaseSemaphore(endSema, 1, NULL); // Notify transaction complete
}
And in the main thread:
for (int iterations = 1; ; iterations++)
{
// Reset X and Y
LOG("reset vars", 0);
X = 0;
Y = 0;
// Signal both threads
ReleaseSemaphore(beginSema, 1, NULL);
ReleaseSemaphore(beginSema, 1, NULL);
// Wait for both threads
WaitForSingleObject(endSema, INFINITE);
WaitForSingleObject(endSema, INFINITE);
// Check if there was a simultaneous reorder
LOG("check vars", 0);
if (r1 == 0 && r2 == 0)
{
detected++;
printf("%d reorders detected after %d iterations\n", detected, iterations);
}
}
The next time the “assert” was hit, I checked the contents of the log simply by watching the expressionsLogger::g_pos
and Logger::g_events
in the Watch window.
In this case, the assert was hit fairly quickly. Only 17 events were logged in total (0 - 16). The final three events made the problem obvious: a single worker thread had managed to iteratetwice before the other thread got a chance to run. In other words, thread1 had stolen the extra semaphore count which was intended to kick off thread2! Splitting this semaphore into two separate semaphores fixed the bug.
This example was relatively simple, involving a small number of events. In some games I’ve worked on, we’ve used this kind of technique to track down more complex problems. It’s still possible for this technique to mask a bug; for example, whenmemory reordering is the issue. But even if so, that may tell you something about the problem.
Tips on Viewing the Log
The g_events
array is only big enough to hold the latest 65536 events. You can adjust this number to your liking, but at some point, the index counterg_pos
will have to wrap around. For example, if g_pos
has reached a value of 3630838, you can find the last log entry by taking this value modulo 65536. Using interactive Python:
>>> 3630838 % 65536
26358
When breaking, you may also find that “CXX0017: Error: symbol not found” is sometimes shown in the Watch window, as seen here:
This usually means that the debugger’s current thread and stack frame context is inside an external DLL instead of your executable. You can often fix it by double-clicking a different stack frame in the Call Stack window and/or a different thread in the Threads window. If all else fails, you can always add the context operator to your Watch expression, explicitly telling the debugger which module to use to resolve these symbols:
One convenient detail about this implementation is that the event log is stored in a global array. This allows the log show up in crash dumps, via an automated crash reporting system for example, even when limitedminidump flags are used.
What Makes This Lightweight?
In this implementation, I strived to make the LOG
macro as non-intrusive as reasonably possible. Besides being lock-free, this is mainly achieved through copious use of compiler intrinsics, which avoid the overhead of DLL function calls for certain functions. For example, instead of calling InterlockedIncrement
, which involves a call into kernel32.dll, I used the intrinsic function_InterlockedIncrement
(with an underscore).
Similarly, instead of getting the current thread ID from GetCurrentThreadId
, I used the compiler intrinsic __readfsdword
to read the thread ID directly from the Thread Information Block (TIB), an undocumented but well-known data structure in Win32.
You may question whether such micro-optimizations are justified. However, after building several makeshift logging systems, usually to handle millions of events in high-performance, multi-threaded code, I’ve come to believe that the less intrusive you can make it, the better. As a result of these micro-optimizations, the LOG
macro compiles down to a few machine instructions, all inlined, with no function calls, no branching and no blocking:
This technique is attractive because it is very easy to integrate. There are many ways you could adapt it, depending on your needs, especially if performance is less of a concern. You could add timestamps and stack traces. You could introduce a dedicated thread to spool the event log to disk, though this would require much more sophisticated synchronization than the single atomic operation used here.
After adding such features, the technique would begin to resemble Microsoft’s Event Tracing for Windows (ETW) framework, so if you’re willing to go that far, it might be interesting to look at ETW’ssupport for user-mode provider events instead.