Presenting in Node Scaling
The previous post discussed issues which apply to any presentation of scaling results. This post describes a different mistake that can be made when presenting in node scaling when using CPUs that support simultaneous multi-threading (SMT) (also known as “hyper-threading”), thought the same principals can also apply in a Message Passing Interface (MPI) code.
To demonstrate how confusing an SMT system can be, we’ll look at a simple theoretical model like this:-
A problem with an Amdahl serial fraction of 2.5%, (which, when running in SMT mode is still executed by a single SMT thread in the context of the number of threads/core which are being used1).
Running on an SMT machine which can support up to 4 SMT threads, where we assume that if the CPU throughput is 1 with a single thread it can achieve 1.5x the throughput when executing two SMT threads and 2x with four threads2.
Note that the Amdahl serial fraction is, mostly, a property of the problem, not the machine, while the SMT performance is much more hardware related3.
Results
If we look at the theoretical performance of this problem on such a machine, and ensure that for the cases of 2 and 4 threads/core we use whole CPUs, simple mathematics gives us performance data that looks like this if the original, serial, problem took 1s:-
If we plot that as a parallel efficiency, we get this:-
That seems to be saying that 1T/core is always best.
BUT, it’s not comparing like with like, because we’re using a whole core to run a single thread in the 1T/core case, whereas with higher numbers of threads/core we’re using fewer cores for a given number of threads. That gives the 1T/core case a huge advantage, since it has all the execution resources of the core, such as being able to use all of the cache, and memory bandwidth. Therefore what we should have on the x-axis in all of these graphs is cores, not threads; then we’ll be able to see the impact of the SMT implementation when comparing cases that use the same hardware resources.
Now we can see what is really going on, and that, within this range of cores, 1T/core never wins. Below around 10 cores, 4T/core performs best, and from there up to 32 cores 2T/core is beating 1T/core. (Though, as the serial component becomes more important, 1T/core is going to win fairly soon after that!)
What Have We Learned
Just because you explicitly force the number of threads in your OpenMP® code to investigate its scaling doesn’t mean that threads are the right independent variable to use on the x-axis.
When dealing with SMT CPUs be very careful about what to put on the x-axis; it is almost certain to need to be cores.
The x-axis needs to reflect the available hardware being used, not a software abstraction like “threads”, or “Message Passing Interface (MPI) processes”.
Although I have discussed this in terms of in-node scaling, one could clearly get equally confused in an MPI environment where MPI processes either share hardware resources (run on the same node), or don’t (run one process/node).
Thanks
Roger Shepherd for commenting on drafts of this article.
It is certainly possible to argue that serial code should always be run on a CPU that has only one SMT thread executing. However ensuring that idling threads are not still running (from the point of view of the hardware) may require system calls to sleep and wake the threads. These are themselves expensive, and can cost more time than simply running the serial code more slowly. Also, since this is all intended to be illustrative of the plotting issues (not the details of SMT performance), I don’t think this issue is of critical importance here!
These performance figures are within the range produced by real implementations. For instance,Marvell’s TX3 processor (which was never released as a stand-alone CPU), claimed 1.79x throughput for 2T/C and 2.21x for 4T/C when executing code with high cycles/instruction. (Marvell ThunderX3 Time to Shine at Hot Chips 32).
Yes, I know, many factors can affect the Amdahl serial fraction, including hardware properties and decisions taken in low-level software, but the point of this article is about how to present results sanely when dealing with the machines we’re given, and the choice of problem is merely illustrative.