Introduction
Here I’m going to show you some racy code which I wrote, that worked OK on AArch64, but broke on X86_64. Since general opinion has it that “The Arm® architecture has a more relaxed memory model than X86_64”, that seemed odd, and worth explaining, as, if that was always true, you’d expect races to show up on AArch64 but not on X86_64.
What Problem is the Code Solving?
In the Little OpenMP® runtime we implement “static steal” scheduling for OpenMP parallel loops with a nonmonotonic schedule.
Since that seems rather complicated, I’ll unpack it a bit.
What is a nonmonotonic schedule?
Consider an OpenMP for loop like this (assumed to be executed in a parallel region, so there are many threads all executing the same code until we get to a work-sharing construct like this work-sharing loop)
#pragma omp for schedule(dynamic)
for (auto i=0; i<1000; ++i) {
...
}
That will cause the iterations of the loop to be distributed dynamically to the available threads, so that all iterations are executed, and no iteration is executed more than once. However, the standard does not say exactly how that dynamic allocation of iterations to threads should occur, leaving that up to the compiler and runtime.
Now consider this code (also inside a parallel region)
auto myMaxIdx = -1; // Per-thread variable on the
// thread's stack
#pragma omp for schedule(dynamic)
for (auto i=0; i<1000; i++) {
if (i < myMaxIdx) {
abort(); // Non-monotonic behaviour detected!
} else {
myMaxIdx = i;
}
}
Do you expect that it could ever call abort()
?
That will depend on whether the OpenMP implementation is allocating the iterations to threads in a monotonic fashion, i.e. whether a single thread sees the iterations it executes in order. Of course, by specifying that this is a work-sharing loop, the programmer has already accepted that multiple iterations can be executing concurrently in different threads, and that iterations will complete in some utterly random, potentially nonmonotonic, order; but here we’re looking at a slightly different issue, which is the order the iterations are seen within a single thread.
Prior to OpenMP 4.5, the standard said nothing about monotonicity, though there was an implicit assumption among some users that a dynamic schedule would be implemented by having a single, shared, counter which each thread would atomically increment when it required more iterations. Of course, such an implementation is naturally monotonic.
Since this is detectable (as we showed above), in OpenMP 4.5 the OpenMP standard added the idea of “schedule modifiers”, and, in particular, the monotonic
and nonmonotonic
schedule modifiers, but without specifying the monotonicity of the simple dynamic
schedule. However, as of OpenMP 5.0, the default dynamic schedule is now explicitly allowed to be nonmonotonic. (See the OpenMP 5.0 standard).
So, to answer our question, as of OpenMP 5.0, that code might call abort()
when run on a standard conforming OpenMP implementation1.
Why do we want to allow a nonmonotonic schedule?
Because it has no restrictions other than those of correctness, the nonmonotonic:dynamic
schedule allows the use of any implementation, so allows more experimentation with clever implementations.
For instance, a runtime can implement a nonmonotonic:dynamic
schedule more efficiently than a monotonic:dynamic
one, since the additional freedom allows it remove the potentially heavily contended atomic operations on a single shared counter. As we have seen before (in “Atomics in AArch64”), having many threads performing an atomic operation on a single cacheline can be very slow. Of course, here we’d expect to be executing user code (the loop body) between each atomic, so things aren’t as bad as in the µ-benchmark we used there, but reducing overhead in the OpenMP runtime is worthwhile since it allows more code to execute well in OpenMP.2
Alternatively, a runtime can use other more complicated implementations, such as maintaining profile information about a given loop and using that to choose a good allocation of iterations to threads if the same loop is re-executed, which may be able to achieve even better performance if each execution of the loop has similar properties and the cost of profiling and choosing a schedule is low.
You can see the improvement achieved by one implementation in this graph (taken from slide 5 of my presentation on the OpenMP booth at SC183)
Here we can see that when there is only a small amount of work inside each loop iteration, the cost of monotonic:dynamic
scheduling can be much larger than that of the nonmotonic:dynamic
implementation.
That presentation also shows that nonmonotonic:dynamic
can outperform both monotonic:dynamic
and static
schedules in a real code (MPAS-O).
There is also a whole chapter on “Scheduling Parallel Loops” in “the book”, which shows more comparisons between different schedules and has references to some of the work in this field.
How do we implement the nonmonotonic:dynamic
schedule?
The implementation we are attempting is based on the idea that we initially perform an allocation of iterations to threads which is the same as that we’d perform for a balanced static
schedule, but then, if there is actual load imbalance at runtime, threads which have no iterations left to execute steal some from a thread which still has iterations it has not yet started4. This is described as a “static steal” implementation; it is based on the scheme used by Cilk (and Threading Building Blocks) for task scheduling.
This approach has many advantages:-
If the work is well balanced, it will perform almost the same as a static schedule5.
If there is external interference (for instance if threads are stolen by the operating system), the schedule can adapt.
Instead of a single location being accessed atomically, there is one for each thread, so contention is lower.
The total number of atomic operations required on the critical path can be reduced.
To implement this we give each thread an instance of a class something like this:6-
// The real code may use uint64_t and __uint128_t.
typedef uint32_t unsignedType;
typedef uint64_t pairType;
class contiguousWork {
// The bounds here are based on a "less than" calculation,
// since the empty (0,0) case can occur.
union CACHE_ALIGNED {
// For some reason GCC requires that we name the struct,
// while LLVM is happy for it to be anonymous, so we name
// it, and then have to type a little more in a few places.
struct {
std::atomic<unsignedType> atomicBase;
std::atomic<unsignedType> atomicEnd;
} ab;
pairType pair;
std::atomic<pairType> atomicPair;
};
auto setBase(unsignedType nb, std::memory_order order) {
return ab.atomicBase.store(nb, order);
}
void assign(unsignedType b, unsignedType e) {
// No need for atomicity here; we're copying into a
// local value.
ab.atomicBase.store(b, std::memory_order_relaxed);
ab.atomicEnd.store (e, std::memory_order_relaxed);
}
public:
contiguousWork() {}
contiguousWork(unsignedType b, unsignedType e) {
assign(b,e);
}
auto getBase(std::memory_order order =
std::memory_order_acquire) const {
return ab.atomicBase.load(order);
}
auto getEnd(std::memory_order order =
std::memory_order_acquire) const {
return ab.atomicEnd.load(order);
}
contiguousWork(contiguousWork * other) {
// N.B. NOT loaded atomically over both parts, but that's
// fine, since when we update we'll use a wide CAS, so if
// it changed at all we'll see it.
assign(other->getBase(), other->getEnd());
}
~contiguousWork() {}
auto getIterations() const {
return getEnd() - getBase();
}
bool trySteal(unsignedType * basep, unsignedType * endp);
bool incrementBase(unsignedType * oldp);
};
Although this looks complicated, it’s really just a pair of atomically accessed unsigned integers (base
and end
) placed next to each other. Each thread has its own instance of this class. The thread which owns the work executes iterations one at a time from the base, while other threads which have no work to execute can steal a group of iterations from the end.
Thus the owning thread never modifies the end, while stealing threads (“thieves”) never modify the base.
Clearly, thieves must use atomic operations on the end, since many of them may be attempting to steal from the same victim simultaneously. However, since the owner is the only thread updating the base, it does not need to use an atomic operation to increment the base. At least, that is what we’re trying to achieve, to reduce the number of atomic operations on the normal, fast-path.
The owner does, however, have to tread carefully, since there is a potential race between a thief stealing the last available iteration, and the owner attempting to claim it.
The code for the owner to increment the base, then, looks like this:-
// This is only executed by the thread which owns this set of
// contiguousWork, and that thread is the only one which ever
// changes the base value, so we don't need to use an atomic
// add here.
bool contiguousWork::incrementBase(unsignedType * basep) {
auto oldBase = getBase();
auto oldEnd = getEnd();
// Have we run out of iterations?
if (oldBase >= oldEnd) {
return false;
}
// Update the base value.
setBase(oldBase + 1, std::memory_order_release);
// Load the end again, so that we can see if it changed
// while we were incrementing the base. This thread never
// moves the end, but other threads do.
auto newEnd = getEnd();
// Did someone steal the work we thought we were going to
// take?
if (newEnd == oldBase) {
// Someone stole our last iteration while we were trying
// to claim it.
return false;
}
else {
// We got it. It doesn't matter if the end moved down while
// we were incrementing the base, as long as it is still
// above the base we claimed. (Say we're claiming
// iteration zero, while some other thread steals
// iterations [100,200], that's fine. It doesn't impact on
// us claiming iteration zero.)
*basep = oldBase;
return true;
}
}
Since the bug is here, rather than in the stealing code, I won’t show that (you can look at it in the CpuFun code snippets repository if you want to). At present it uses a std::atomic::compare_exchange_weak
operation to update the end while also ensuring that the base hasn’t moved while it’s doing that7.
What Happens?
When we run the code as above, it works fine on AArch64, but fails (some iterations are executed by more than one thread) on X86_64.
As we said before this is counter-intuitive, given that we’re clearly seeing a race, but we’d expect that to show up on the AArch64 machine with the more relaxed memory semantics, rather than on the X86_64 machine where the memory model is more restrictive.
How is it Supposed to Work?
The race condition that we are trying to avoid occurs when the owner thread is incrementing the base to claim the final iteration, while a thief is also pulling end downwards attempting to steal that same, final, iteration.
The first possible race would be like this
Base, End Owner Thief
5, 6
oldBase = getBase();
(value 5) Read end (6)
5, 6
setBase(oldBase+1)
6, 6
Store end-1 (5)
6, 5
If that happened, both the thief and owner would think that they owned iteration 5. Oops!
However, we know this can’t happen, because the thief’s attempt to store is performed by an atomic compare_exchange operation, which is checking that the value of end is still what it was before (here 5).
The second possible race would be like this
Base, End Owner Thief
5, 6
oldBase = getBase() // 5
Read end // 6
Store end-1 // 5
5, 5
setBase(oldBase+1) // 6
6, 5
Again, both threads think they own iteration 5. But, our code is trying to prevent this by checking that it hasn’t happened by looking to see whether the end
value has changed while the owner was performing the store op. We try to detect this case by saving the value of end
and checking that it hasn’t changed to show that the last iteration has been stolen while the owner thread was updating the base.
So the intent is to have the code work like this:-
Base, End Owner Thief
5, 6
oldBase = getBase() // 5
oldEnd = getEnd() // 6
Read end // 6
Store end-1 // 5
5, 5
setBase(oldBase+1) // 6
6, 5
newEnd = getEnd() // 5
if (newEnd == oldBase) race detected
However, since we’re seeing this race happening, clearly this isn’t working!
So, What Is Going On?
To understand that we need to look in detail at the memory ordering of the operations, the instructions that are generated, and the architecture documents.
Memory Ordering
We can see that our code is using std::memory_order_acquire
for both getBase()
and getEnd()
, and std::memory_order_release
for the setBase()
operation, so the loads with acquire ensure that no subsequent loads float above them, while the store with release ensures that all preceding stores have completed before it does.
Diving into the Architectures
AArch64
Here’s the code generated by LLVM for Aarch64
contiguousWork::incrementBase(unsigned int*): // @contiguousWork::incrementBase(unsigned int*)
add x9, x0, #4 // =4
ldar w8, [x0]
ldar w10, [x9]
cmp w8, w10
b.hs .LBB0_2
add w10, w8, #1 // =1
stlr w10, [x0]
ldar w9, [x9]
cmp w9, w8
b.ne .LBB0_3
.LBB0_2:
mov w0, wzr
ret
.LBB0_3:
mov w0, #1
str w8, [x1]
ret
We can see that the AArch64 code is using ldar
and stlr
instructions to enforce the memory ordering we have requested. If we search for information about those instructions, we find this. It explains that the instructions do what we asked for, but that they also enforce stronger ordering than we had asked for :-
Load-Acquire (LDAR)
All loads and stores that are after an
LDAR
in program order, and that match the shareability domain of the target address, must be observed after the LDAR.Store-Release (STLR)
All loads and stores preceding an STLR that match the shareability domain of the target address must be observed before the STLR.
So, they are enforcing a strict ordering, preventing both loads and stores from moving over these instructions, whereas we only asked for ordering between the load operations themselves, and the store operations themselves. We haven’t specified any ordering (other than the obvious, semantically required, one of data dependence) between loads and store.
X86_64
Here’s the code generated by LLVM for X86_64
contiguousWork::incrementBase(unsigned int*): # @contiguousWork::incrementBase(unsigned int*)
mov eax, dword ptr [rdi]
mov ecx, dword ptr [rdi + 4]
cmp eax, ecx
jae .LBB0_1
lea ecx, [rax + 1]
mov dword ptr [rdi], ecx
mov ecx, dword ptr [rdi + 4]
cmp ecx, eax
je .LBB0_4
mov dword ptr [rsi], eax
.LBB0_4:
cmp ecx, eax
setne al
ret
.LBB0_1:
xor eax, eax
ret
We can see that there are no atomic instructions or memory fences being used. The simple mov
(the mnemonic for both load and store) is being used for all of the memory operations.
That is OK because the stricter memory model of the X86_64 memory model already guarantees that loads are ordered with respect to loads and stores are ordered with respect to stores, which is what we asked for.
So, Where’s the Bug?
We know that the X86_64 code is breaking, but the AArch64 code works, so it’s worth thinking a bit more about the additional ordering that is being enforced by the AArch64 instructions as compared to what X86_64 is doing to see if that gives us a hint.
As we saw, the AArch64 code is preventing ldar
tagged loads and stlr
tagged stores from moving past each other, however, the X86_64 memory model allows loads/stores to unrelated addresses to bypass each other.
So what’s happening is that the read of end
, which is in the code after the update of base
is sometimes floating above it, as there is no ordering of these two instructions
mov dword ptr [rdi], ecx # store base
mov ecx, dword ptr [rdi + 4] # load end
In effect the X86_64 CPU is sometimes executing the code as if it had been written like this
bool contiguousWork::incrementBase(unsignedType * basep) {
auto oldBase = getBase();
auto oldEnd = getEnd();
// Have we run out of iterations?
if (oldBase >= oldEnd) {
return false;
}
// Load the end again, so that we can see if it changed
// while we were incrementing the base. This thread never
// moves the end, but other threads do.
auto newEnd = getEnd();
/**** Possible undetected update of end here ***/
// Update the base value.
setBase(oldBase + 1, std::memory_order_release);
// Did someone steal the work we thought we were going to
// take?
if (newEnd == oldBase) {
// Someone stole our last iteration while we were trying
// to claim it.
return false;
}
else {
// We got it. It doesn't matter if the end moved down while
// we were incrementing the base, as long as it is still
// above the base we claimed. (Say we're claiming
// iteration zero, while some other thread steals
// iterations [100,200], that's fine. It doesn't impact on
// us claiming iteration zero.)
*basep = oldBase;
return true;
}
}
Clearly when executed like that there is the possibility that the thief can get in and change the end
value between this code reading end
and storing it in newEnd
and the update of the base
value. Since the victim’s store happens after the thief’s, the thief’s compare_exchange will still succeed, so it believes it has successfully stolen the last iteration, while, at the same time, the victim won’t see the thief’s store, so believes that it has successfully claimed the same iteration.
Ouch!
Of course, this is not a bug anywhere other than in our code! We have not told the compiler that we need any ordering here, so this our fault, not that of the compiler or hardware implementation.
What’s the Solution?
Clearly we need to enforce stricter ordering between those two instructions, which we can do by telling the compiler that the store to base
must have sequential consistency (i.e. ensure that all memory operations before it have completed before it does, and none from after it have started until it has finished). That’s a simple fix; instead of writing
// Update the base value.
setBase(oldBase + 1, std::memory_order_release);
we write
// Update the base value.
setBase(oldBase + 1, std::memory_order_seq_cst);
If we compile the code with that fix, the AArch64 code remains identical, but the X86_64 code now looks like this:-
contiguousWork::incrementBase(unsigned int*): # @contiguousWork::incrementBase(unsigned int*)
mov eax, dword ptr [rdi]
mov ecx, dword ptr [rdi + 4]
cmp eax, ecx
jae .LBB0_1
lea ecx, [rax + 1]
xchg dword ptr [rdi], ecx # store to base
mov ecx, dword ptr [rdi + 4]
cmp ecx, eax
je .LBB0_4
mov dword ptr [rsi], eax
.LBB0_4:
cmp ecx, eax
setne al
ret
.LBB0_1:
xor eax, eax
ret
You can see that the store to base
which was a simple mov
has been replaced with an xchg
, which is always an atomic operation in X86 architectures (despite not having a lock
prefix). The compiler is probably using it here to save a few bytes of code space8. Since all atomic operations on X86 architectures enforce full memory barriers, this achieves the effect we asked for. And, indeed, the bug has been vanquished.
Isn’t There an Easier Way to Find the Bug?
The answer here is the all purpose answer in High Performance Computing: “It depends”.
Certainly there are race detection tools, such as Intel®’s Inspector, and GCC or LLVM’s -fsanitize=thread
option.
However, Intel Inspector doesn’t seem to be installed on the X86 machine I have access too, and nor are the tsan libraries needed to make the compiler options work!
If I force the bug into the AArch64 code (by explicitly tagging the relevant store as having relaxed memory semantics), then the bug does show up, but not when run under LLVM’s sanitizer…
So, definitely worth a try, but not a perfect answer.
What Have We Learned?
Handling thread interactions without using locks is hard.
Which architecture’s memory model is more restrictive is not always easy to tell.
Although only one thread is writing the base location we have still ended up needing an atomic operation (or memory fence, which can have significant performance impact), so this may all have been a waste of time from the performance point of view9.
Acknowledgements
This work used the Isambard UK National Tier-2 HPC Service, operated by GW4 and the UK Met Office, and funded by EPSRC (EP/T022078/1)
Thanks to the people who reviewed this for me before publication.
I have tried to use appropriate copyright and trademark symbols at the first mention of the relevant name, however it is likely I have missed some. So “Other names may be the property of others.”
We cannot assert that the loop will definitely abort, since monotonic sequences are a subset of nonmonotonic ones, and a nonmonotonic implementation might just happen to generate a monotonic sequence in any specific execution. Also, compiler and runtime writers don’t like breaking user code which used to work, so may be loath to make the change, even though the standard allows it. <rant> This aligns with the general view of programmers that a language standard is something that affects a compiler, but not their code, and that if their code works with one compiler that means any compiler it doesn’t work with is clearly broken. Whereas, in reality, a language standard is a contract that binds both the compiler and the programmer. The compiler only has to make the code work if the code obeys the rules in the standard. </rant>
OpenMP already supports a "guided” schedule”which is another way to reduce the number of atomic operations required to schedule the loop, however, that will not work well if the final few iterations are the longest running, for instance in a triangular increasing loop, like this :-
#pragma omp for
for (auto i=0; i<N; i++) {
for (auto j=0; j<i; j++) {
constant_work();
}
}
Similarly, one could tell the OpenMP implementation to block the iterations, but that might introduce significant load imbalance (for instance with the increasing triangular case shown, since all of the most expensive iterations will be aggregated together), and also reduces the available parallelism.
Full details of the machine and test case are on slide 24 of the presentation, so not repeated here.
There is a lot more fun to be had here in deciding whom to steal from and how many iterations of those they have not yet started to steal, but none of that is relevant to this post.
A compile-time known static schedule still has the advantage that the compiler can generate better code for it than for any dynamic schedule, but other than that…
The full code has additional complexity to handle detecting when the loop has completed handling different width loop counters, choosing appropriate double width types and so on. You can see it in the LOMP library code in loops.h and loops.cc.
If you’re thinking “Hmm, couldn’t it use the same trick and check the base before and after rather than using a double-width compare_exchange?”, you’re right, it could, and that should be faster. It’s another optimisation that hasn’t got into the code yet!
It would certainly also be possible to insert an mfence
instruction to explicitly describe the full memory fence, but that would be an extra instruction.
Yup, I should run the scheduling benchmarks again!
I think the fix for the bug just happened to work on x86_64, and isn't guaranteed to work on every architecture. In particular, notice that if you had made the second call to getEnd be sequentially consistent instead of setBase, then the assembly would have been unchanged and the bug would have remained, despite the explanation making it sound like that would have fixed it too. The problem is that making an operation sequentially consistent has no effect on operations on different variables that aren't sequentially consistent. I think the best way to fix the problem is just to make both of those functions be sequentially consistent. Then the C++ standard guarantees that they'll remain in order even on other compilers and architectures you haven't thought of, and it shouldn't make your code any slower, since the generated assembly will still be what it is now.