Better get those compiler flags right!
In addition to sheer performance, we might add that algorithms that are supposed to be "wait free" can't really be said to be wait free if the atomic operations they rely on are implemented with a loop under the hood. So, on Armv8 they aren't really wait free, on Armv8.1 they are.
Where ARM LDX/STX really shine is in more complex operations. You can do pretty-much-arbitrary operations on up to 128b of data, atomically and locklessly, without suffering from the ABA problem, and with hardware forward progress guarantees.
This is _great_ for things like MPMC queues. Much of the classical complexity gets dropped and you end up with surprisingly readable (and performant) code.
(Ditto for atomic updates of things like status words and such. Expressing things like "if field A, increment field B, if overflow set flag C" (anonymized from an actual usecase) is simple with LDX/STX but between difficult and impossible in classical atomics (depending on how much you care about forward progress and ABA issues))
Unfortunately this was DOA the moment ARM recommended using intrinsics for these operations. You can't access the full power of LDX/STX using intrinsics, mainly due to the memory access requirements (you can't access other memory between the LDX/STX, which compilers won't guarantee), which means that you end up having to rewrite operations into load / CAS loops... which drops all of the niceities of LDX/STX on the floor (no forward progress guarantees, ABA pops up as an issue again, etc, etc) and ends up with them being worse than classical atomics.
Still great if you don't mind writing (extended) asm though. Especially in applications where you expect the atomics to be lightly loaded (and hence throughput more important than time in the critical section) (e.g. doing atomic updates to semi-random locations in a 100MB array)
Any idea what might be causing the odd-even throughput fluctuations?