In addition to sheer performance, we might add that algorithms that are supposed to be "wait free" can't really be said to be wait free if the atomic operations they rely on are implemented with a loop under the hood. So, on Armv8 they aren't really wait free, on Armv8.1 they are.
I don't think I entirely agree with that, since it seems to me that the details of the implementation of the atomic operation don't affect the blocking semantics of the algorithm. It seems perverse to say that an implementation of an atomic which uses compare-and-swap [CAS], so necessarily has a loop, is different from one that implements the same operation inside a single instruction. Either of the implementations has to ensure that the relevant data is not updated by any other thread (or DMA access) during the update, and that enforces the required serialisation. The CAS implementation may be less efficient since the relevant cache line may bounce around the machine more, but it still has the requisite global forward progress guarantee, since for my CAS to fail, someone else's must have succeeded.
Of course, there may be other issues if the implementation was based on taking a single, global, lock which locked out all such operations independent of the address being updated, since that would destroy the global forward progress guarantee (as updates on one address could then prevent updates on another from ever happening).
Upon re-reading that comment it seems that you are confusing wait free with lock free? The global progress guarantee is only enough to make an algorithm lock free. As the Wikipedia says: "An algorithm is wait-free if every operation has a bound on the number of steps the algorithm will take before the operation completes". Or are you saying that the single instruction version really is doing a loop that isn't visible in the assembly code?
The definition of wait-free to which you (implicitly) point says nothing about having no polling loops, rather it says that there should be guaranteed per-thread progress. The description (and the related paper) also point out that they use compare-and-swap (CAS) which, since it can fail, does not need an internal loop even when one is forced to emulate it using LL-SC. Therefore changes in the underlying implementation of other atomic operations would not affect this.
"[t]he definition of wait-free to which you (implicitly) point says nothing about having no polling loops"
This is just plain wrong. If an operation depends on a polling loop, it can not be guaranteed to be completed within a bounded number of steps.
Or, as it says here: "A wait-free data structure is a lock-free data structure with the additional property that every thread accessing the data structure can make complete its operation within a bounded number of steps, regardless of the behaviour of other threads. Algorithms that can involve an unbounded number of retries due to clashes with other threads are thus not wait-free."
A loop that keeps retrying when it's intercepted by another thread is not wait-free. It may be lock free, but not wait free. Or, as you state yourself: "...rather it says that there should be guaranteed per-thread progress". While the thread is in the polling loop, it is not making progress.
I mis-phrased, what I meant was that loops are permitted as long as there is an upper bound on the number of iterations, so you are polling, but know that there is a an upper bound. For instance, you could allocate each thread a time-slot, but make it wait (and poll) until that slot becomes available. What is required is that there is a guarantee of thread progress within a bounded time, not that there are no loops.
The paper which WIkipedia claims made this popular (http://www.cs.technion.ac.il/~erez/Papers/wfquque-ppopp.pdf) has threads helping each other out, so clearly the progress of one thread will depend on how many operations other threads have created which it then performs, so there is an upper bound on how long it will be delayed, but it is executing a loop...
The critical issue then becomes whether the atomic instructions themselves guarantee any forward progress of the thread performing the operation, a property which is often deeply buried in the architecture or programming manual!
I see. So what you are saying is that the single instruction atomic increment might itself involve a similar unbounded process but we don't know. Fair enough. I guess it does not matter much if you have access to Armv8.1 anyway, since the single instruction version is much faster as well.
But anyone who writes a loop in his own code, who claim it's wait free, would be corrected. I don't really see the difference between writing code that includes a loop, and writing code that is translated to a loop by the compiler.
Where ARM LDX/STX really shine is in more complex operations. You can do pretty-much-arbitrary operations on up to 128b of data, atomically and locklessly, without suffering from the ABA problem, and with hardware forward progress guarantees.
This is _great_ for things like MPMC queues. Much of the classical complexity gets dropped and you end up with surprisingly readable (and performant) code.
(Ditto for atomic updates of things like status words and such. Expressing things like "if field A, increment field B, if overflow set flag C" (anonymized from an actual usecase) is simple with LDX/STX but between difficult and impossible in classical atomics (depending on how much you care about forward progress and ABA issues))
Unfortunately this was DOA the moment ARM recommended using intrinsics for these operations. You can't access the full power of LDX/STX using intrinsics, mainly due to the memory access requirements (you can't access other memory between the LDX/STX, which compilers won't guarantee), which means that you end up having to rewrite operations into load / CAS loops... which drops all of the niceities of LDX/STX on the floor (no forward progress guarantees, ABA pops up as an issue again, etc, etc) and ends up with them being worse than classical atomics.
Still great if you don't mind writing (extended) asm though. Especially in applications where you expect the atomics to be lightly loaded (and hence throughput more important than time in the critical section) (e.g. doing atomic updates to semi-random locations in a 100MB array)
The ISA here is very similar to Intel's Transactional Synchronization Extensions (TSX), so, for instance, the OpenMP specification already has a way for you to ask for transactional locks. I'd expect the use cases to look similar too, so my talk from 2018 remains relevant. https://www.openmp.org/wp-content/uploads/SC18-BoothTalks-Cownie.pdf
TME avoids some of this... or would if anyone could actually get it correct. I'm not exactly encouraged given the number of times Intel has tried and failed here.
That being said, TME still has major issues with lockless code given the whole "lack of progress guarantees" thing. Far too often the "fallback" code ends up hurting the fastpath for various reasons.
Again, consider my previous example of "if field A, increment field B, if overflow set flag C". With LDX/STX, this is doable in a fairly simple manner. Fastpath performance is about as good as you can get here (so far as I am aware), and contended performance is alright. (Depends on the details of the cluster cache coherency.)
With TME without forward progress guarantees... is there a way I'm missing to avoid really nasty tail behavior here?
LDX/STX:
extern struct foo_s *loc;
register struct foo_s foo;
do {
foo = LDX(loc);
foo.b += !!foo.a;
foo.c |= (!!foo.a & !foo.b);
} while (!STX(foo, loc));
(Note that other cores, even ones accessing the same location, continue to make forward progress even if a core stalls. Also note that this behaves fairly gracefully even under heavy contention... a little too weighted towards fairness at the expense of throughput sometimes (cache line pingpong between cores where overall throughput would be better if a cache line stuck around on a single core long enough to batch several such operations), but still decent.)
TME:
extern struct foo_s *loc;
struct lock_s lock;
if (xbegin()) {
take_lock_full(&lock);
} else if (is_locked(&lock)) {
xabort();
}
dmb(); //...I don't know if this is necessary? I don't see the memory ordering implications of TME noted anywhere.
if (loc->a) {
loc->b += 1;
if (!loc->b) {
loc->c = true;
}
}
dmb(); //...I don't know if this is necessary? I don't see the memory ordering implications of TME noted anywhere.
if (is_locked(&lock)) {
release_lock_full(&lock);
} else {
xcommit();
}
(Note that a single core being stalled can stall the world here, if it gets stalled while holding the lock. Also note that the worst-case behavior is significantly worse than the 'good' case.)
This is a whole lot better than a "standard" lock, assuming that your hardware supports it. Unfortunately, it's still rather worse than LDX/STX in several ways. For one, either your lock is fine-grained in which case you're using a fair bit of memory for locks or it's coarse-grained in which case your fallback performance is terrible. (You can mitigate this somewhat by using something like Rust's parking lot which essentially does a hash of address -> corresponding lock... but this itself can get slow.) For another, you still have more cache traffic than the LDX/STX, which hurts the fastpath (you're reading the lock still...), and your slowpath is now even worse than a 'normal' lock, much less LDX/STX. Ditto, you have two barriers (either explicit, or implicit, depending on how exactly TME ends up defined), which can further hurt throughput. And this also has really nasty dynamic failure modes (e.g. program running, thread needs to fall back to slowpath... which then takes the lock, which then causes other threads to need to fall back to slowpath, etc, etc.).
I am excited for true transactional memory because it allows you to easily get non-terrible performance in a relatively easy-to-reason-about manner. 'TME' defined in a manner where an implementation can "support" it but simply reject all transactions... not so much. You still have all of the nasty reasoning of a 'standard' solution, and _additional_ edge cases and failure modes from the TME portion (especially given that control flow is different inside a transaction - how do you debug when you have a failure that's only present inside a transaction, given that a debugger will almost certainly end up failing all transactions? This is even a problem with LDX/STX... though at least with LDX/STX you have exactly one store and it's at the very end.). And meanwhile it's _still_ not as fast as standard atomics.
(Also, am I reading those TME intrinsics correctly in that you need to check for (status == 0 || (status & _TMFAILURE_RTRY))? If so... eww. Both from a source code perspective (you can't just do a if(xbegin()), you need to pull it out into a variable or wrap xbegin in e.g. an inline function) and from an assembly perspective (I think the cleanest test is cbz + tbz ..., #15). Wish they had a DO_NOT_RETRY bit instead, in which case you could just check said bit...)
TBH, I don't have a good theory for that. The simple {load,store} x {modified,unmodified} to a single line as you move that line across the machine graph doesn't show anything there (the "Placement" test in the loadsStores.cc benchmark), and nor does the similar graph for increased sharing with load/store (the "Sharing" test in the same BM).
Does it go away if you pin threads to specific cores? Does it go away if you pin all threads to one of the two clusters? I wonder if e.g. the OS is trying to load-balance between the two clusters and as such bounces a thread between the two clusters in the odd case. (As in this particular benchmark temporarily stalling a couple of cores would _increase_ overall throughput.)
(I've seen this sort of 'stalling increasing effective throughput' behavior a few times now... where e.g. core A issues an SGI to all the other cores then core A's task makes a bunch of progress because suddenly a normally-contended memory location is uncontended until the other cores process said SGI. Requires a relatively slow SGI however, and doesn't work if e.g. you're using locks (as likely one of the other cores currently holds said lock).)
"Does it go away if you pin threads to specific cores?"
From the Boilerplate section : "In the experiments here we are running one thread/core and tightly binding the threads to a single logicalCPU."
So, no, it doesn't go away because that's what I was already measuring.
In general, it is certainly the case that applying sensible back-off under contention when implementing CMPXCHG (or LDX/STX) atomics can improve machine throughput.
There are benchmarks for that in atomics.cc code; see the three different implementation of FP increment;
The benchmark can also show you stats about the number of times the CMPXCHG fails in the FP add case.
What's happening here is that each failed cmpxchg (or STX) represents a data movement that was non-productive (it will have to be repeated), since we're data-movement bound, reducing such wasted movements increases overall performance.
However, you need to be careful about what you wish for, since we already know that the highest throughput can be achieved by being completely unfair, so if you asked a malicious demon for a machine with the highest possible contended atomic throughput, they would give you a single core machine!
In reality many cores need to execute the atomic, so we need implementations that work for that.
In addition to sheer performance, we might add that algorithms that are supposed to be "wait free" can't really be said to be wait free if the atomic operations they rely on are implemented with a loop under the hood. So, on Armv8 they aren't really wait free, on Armv8.1 they are.
I don't think I entirely agree with that, since it seems to me that the details of the implementation of the atomic operation don't affect the blocking semantics of the algorithm. It seems perverse to say that an implementation of an atomic which uses compare-and-swap [CAS], so necessarily has a loop, is different from one that implements the same operation inside a single instruction. Either of the implementations has to ensure that the relevant data is not updated by any other thread (or DMA access) during the update, and that enforces the required serialisation. The CAS implementation may be less efficient since the relevant cache line may bounce around the machine more, but it still has the requisite global forward progress guarantee, since for my CAS to fail, someone else's must have succeeded.
Of course, there may be other issues if the implementation was based on taking a single, global, lock which locked out all such operations independent of the address being updated, since that would destroy the global forward progress guarantee (as updates on one address could then prevent updates on another from ever happening).
Upon re-reading that comment it seems that you are confusing wait free with lock free? The global progress guarantee is only enough to make an algorithm lock free. As the Wikipedia says: "An algorithm is wait-free if every operation has a bound on the number of steps the algorithm will take before the operation completes". Or are you saying that the single instruction version really is doing a loop that isn't visible in the assembly code?
The definition of wait-free to which you (implicitly) point says nothing about having no polling loops, rather it says that there should be guaranteed per-thread progress. The description (and the related paper) also point out that they use compare-and-swap (CAS) which, since it can fail, does not need an internal loop even when one is forced to emulate it using LL-SC. Therefore changes in the underlying implementation of other atomic operations would not affect this.
Sorry, forgot the link: https://www.justsoftwaresolutions.co.uk/threading/non_blocking_lock_free_and_wait_free.html
"[t]he definition of wait-free to which you (implicitly) point says nothing about having no polling loops"
This is just plain wrong. If an operation depends on a polling loop, it can not be guaranteed to be completed within a bounded number of steps.
Or, as it says here: "A wait-free data structure is a lock-free data structure with the additional property that every thread accessing the data structure can make complete its operation within a bounded number of steps, regardless of the behaviour of other threads. Algorithms that can involve an unbounded number of retries due to clashes with other threads are thus not wait-free."
A loop that keeps retrying when it's intercepted by another thread is not wait-free. It may be lock free, but not wait free. Or, as you state yourself: "...rather it says that there should be guaranteed per-thread progress". While the thread is in the polling loop, it is not making progress.
I mis-phrased, what I meant was that loops are permitted as long as there is an upper bound on the number of iterations, so you are polling, but know that there is a an upper bound. For instance, you could allocate each thread a time-slot, but make it wait (and poll) until that slot becomes available. What is required is that there is a guarantee of thread progress within a bounded time, not that there are no loops.
The paper which WIkipedia claims made this popular (http://www.cs.technion.ac.il/~erez/Papers/wfquque-ppopp.pdf) has threads helping each other out, so clearly the progress of one thread will depend on how many operations other threads have created which it then performs, so there is an upper bound on how long it will be delayed, but it is executing a loop...
The critical issue then becomes whether the atomic instructions themselves guarantee any forward progress of the thread performing the operation, a property which is often deeply buried in the architecture or programming manual!
I see. So what you are saying is that the single instruction atomic increment might itself involve a similar unbounded process but we don't know. Fair enough. I guess it does not matter much if you have access to Armv8.1 anyway, since the single instruction version is much faster as well.
But anyone who writes a loop in his own code, who claim it's wait free, would be corrected. I don't really see the difference between writing code that includes a loop, and writing code that is translated to a loop by the compiler.
Where ARM LDX/STX really shine is in more complex operations. You can do pretty-much-arbitrary operations on up to 128b of data, atomically and locklessly, without suffering from the ABA problem, and with hardware forward progress guarantees.
This is _great_ for things like MPMC queues. Much of the classical complexity gets dropped and you end up with surprisingly readable (and performant) code.
(Ditto for atomic updates of things like status words and such. Expressing things like "if field A, increment field B, if overflow set flag C" (anonymized from an actual usecase) is simple with LDX/STX but between difficult and impossible in classical atomics (depending on how much you care about forward progress and ABA issues))
Unfortunately this was DOA the moment ARM recommended using intrinsics for these operations. You can't access the full power of LDX/STX using intrinsics, mainly due to the memory access requirements (you can't access other memory between the LDX/STX, which compilers won't guarantee), which means that you end up having to rewrite operations into load / CAS loops... which drops all of the niceities of LDX/STX on the floor (no forward progress guarantees, ABA pops up as an issue again, etc, etc) and ends up with them being worse than classical atomics.
Still great if you don't mind writing (extended) asm though. Especially in applications where you expect the atomics to be lightly loaded (and hence throughput more important than time in the critical section) (e.g. doing atomic updates to semi-random locations in a 100MB array)
In this context the upcoming support for Transactional Memory Extensions (TME) in the V9-a architecture seems relevant. That will allow much more general "atomic" sections with ABA protection,(but not progress guarantees; you need back-off to something else [normally a lock]). See https://developer.arm.com/documentation/101028/0012/16--Transactional-Memory-Extension--TME--intrinsics
The ISA here is very similar to Intel's Transactional Synchronization Extensions (TSX), so, for instance, the OpenMP specification already has a way for you to ask for transactional locks. I'd expect the use cases to look similar too, so my talk from 2018 remains relevant. https://www.openmp.org/wp-content/uploads/SC18-BoothTalks-Cownie.pdf
TME avoids some of this... or would if anyone could actually get it correct. I'm not exactly encouraged given the number of times Intel has tried and failed here.
That being said, TME still has major issues with lockless code given the whole "lack of progress guarantees" thing. Far too often the "fallback" code ends up hurting the fastpath for various reasons.
Again, consider my previous example of "if field A, increment field B, if overflow set flag C". With LDX/STX, this is doable in a fairly simple manner. Fastpath performance is about as good as you can get here (so far as I am aware), and contended performance is alright. (Depends on the details of the cluster cache coherency.)
With TME without forward progress guarantees... is there a way I'm missing to avoid really nasty tail behavior here?
LDX/STX:
extern struct foo_s *loc;
register struct foo_s foo;
do {
foo = LDX(loc);
foo.b += !!foo.a;
foo.c |= (!!foo.a & !foo.b);
} while (!STX(foo, loc));
(Note that other cores, even ones accessing the same location, continue to make forward progress even if a core stalls. Also note that this behaves fairly gracefully even under heavy contention... a little too weighted towards fairness at the expense of throughput sometimes (cache line pingpong between cores where overall throughput would be better if a cache line stuck around on a single core long enough to batch several such operations), but still decent.)
TME:
extern struct foo_s *loc;
struct lock_s lock;
if (xbegin()) {
take_lock_full(&lock);
} else if (is_locked(&lock)) {
xabort();
}
dmb(); //...I don't know if this is necessary? I don't see the memory ordering implications of TME noted anywhere.
if (loc->a) {
loc->b += 1;
if (!loc->b) {
loc->c = true;
}
}
dmb(); //...I don't know if this is necessary? I don't see the memory ordering implications of TME noted anywhere.
if (is_locked(&lock)) {
release_lock_full(&lock);
} else {
xcommit();
}
(Note that a single core being stalled can stall the world here, if it gets stalled while holding the lock. Also note that the worst-case behavior is significantly worse than the 'good' case.)
This is a whole lot better than a "standard" lock, assuming that your hardware supports it. Unfortunately, it's still rather worse than LDX/STX in several ways. For one, either your lock is fine-grained in which case you're using a fair bit of memory for locks or it's coarse-grained in which case your fallback performance is terrible. (You can mitigate this somewhat by using something like Rust's parking lot which essentially does a hash of address -> corresponding lock... but this itself can get slow.) For another, you still have more cache traffic than the LDX/STX, which hurts the fastpath (you're reading the lock still...), and your slowpath is now even worse than a 'normal' lock, much less LDX/STX. Ditto, you have two barriers (either explicit, or implicit, depending on how exactly TME ends up defined), which can further hurt throughput. And this also has really nasty dynamic failure modes (e.g. program running, thread needs to fall back to slowpath... which then takes the lock, which then causes other threads to need to fall back to slowpath, etc, etc.).
I am excited for true transactional memory because it allows you to easily get non-terrible performance in a relatively easy-to-reason-about manner. 'TME' defined in a manner where an implementation can "support" it but simply reject all transactions... not so much. You still have all of the nasty reasoning of a 'standard' solution, and _additional_ edge cases and failure modes from the TME portion (especially given that control flow is different inside a transaction - how do you debug when you have a failure that's only present inside a transaction, given that a debugger will almost certainly end up failing all transactions? This is even a problem with LDX/STX... though at least with LDX/STX you have exactly one store and it's at the very end.). And meanwhile it's _still_ not as fast as standard atomics.
(Also, am I reading those TME intrinsics correctly in that you need to check for (status == 0 || (status & _TMFAILURE_RTRY))? If so... eww. Both from a source code perspective (you can't just do a if(xbegin()), you need to pull it out into a variable or wrap xbegin in e.g. an inline function) and from an assembly perspective (I think the cleanest test is cbz + tbz ..., #15). Wish they had a DO_NOT_RETRY bit instead, in which case you could just check said bit...)
Until we have a TME implementation to poke at it is hard to know how well it will work. (Duh!).
Note, too, that Intel seem to have completely given up on TSX (to the extend of removing it from existing CPUs with a u-code update), https://en.wikipedia.org/wiki/Transactional_Synchronization_Extensions#:~:text=In%20June%202021%2C%20Intel%20published,Performance%20Monitoring%20Unit%20(PMU).
Great read!
Any idea what might be causing the odd-even throughput fluctuations?
TBH, I don't have a good theory for that. The simple {load,store} x {modified,unmodified} to a single line as you move that line across the machine graph doesn't show anything there (the "Placement" test in the loadsStores.cc benchmark), and nor does the similar graph for increased sharing with load/store (the "Sharing" test in the same BM).
Does it go away if you pin threads to specific cores? Does it go away if you pin all threads to one of the two clusters? I wonder if e.g. the OS is trying to load-balance between the two clusters and as such bounces a thread between the two clusters in the odd case. (As in this particular benchmark temporarily stalling a couple of cores would _increase_ overall throughput.)
(I've seen this sort of 'stalling increasing effective throughput' behavior a few times now... where e.g. core A issues an SGI to all the other cores then core A's task makes a bunch of progress because suddenly a normally-contended memory location is uncontended until the other cores process said SGI. Requires a relatively slow SGI however, and doesn't work if e.g. you're using locks (as likely one of the other cores currently holds said lock).)
"Does it go away if you pin threads to specific cores?"
From the Boilerplate section : "In the experiments here we are running one thread/core and tightly binding the threads to a single logicalCPU."
So, no, it doesn't go away because that's what I was already measuring.
In general, it is certainly the case that applying sensible back-off under contention when implementing CMPXCHG (or LDX/STX) atomics can improve machine throughput.
There are benchmarks for that in atomics.cc code; see the three different implementation of FP increment;
{'f', doFPIncrement, "float (no backoff)"},
{'e', doFPIncrementRE, "float (random e**x backoff)"},
{'t', doFPIncrementTTAS, "float (TTAS)"}};
(TTAS == "Test and Test&Set")).
The benchmark can also show you stats about the number of times the CMPXCHG fails in the FP add case.
What's happening here is that each failed cmpxchg (or STX) represents a data movement that was non-productive (it will have to be repeated), since we're data-movement bound, reducing such wasted movements increases overall performance.
However, you need to be careful about what you wish for, since we already know that the highest throughput can be achieved by being completely unfair, so if you asked a malicious demon for a machine with the highest possible contended atomic throughput, they would give you a single core machine!
In reality many cores need to execute the atomic, so we need implementations that work for that.