To sched_yield() Or Not To sched_yield()?

Is Linux' sched_yield() useful while polling, or merely "sound and fury signifying nothing"?

Mar 31, 2022

Introduction

This post is prompted by a tweet from @science_dot that seems to be asserting that the Linux sched_yield() system call is useful under over-subscription when a thread is waiting at a barrier1. I am dubious, so have written a small benchmark to investigate the issue.

What Problem Are We Trying to Solve?

Suppose that we have a parallel code in which there are multiple threads, but that we have more threads than there are logicalCPU2s available to the code. As a result we have over-subscription, and multiple software threads are sharing a hardware thread. The choice of which thread to execute is then made by the scheduler in the Linux kernel, which will try to be fair and give each of the n threads 1/n of the CPU-time.

Normally that is what we want; there is no reason to prefer one thread making progress on the parallel problem over another one, but, when one thread has finished all of its work and has to wait for others to complete theirs (i.e. it is at a barrier of some sort), we do not want the polling thread to be consuming CPU-time that could better be used to help to solve the problem.

Therefore we want to find a way to ensure that while a thread is waiting (be it at a barrier, for a lock, or for “any other reason why”3) it is sleeping and not running. The most obvious way to ensure this would be to tell the kernel explicitly that the thread is waiting and then have the final thread to arrive at the barrier (or the thread releasing the lock, or …) execute a system call to wake the sleeping thread(s). This is all supported by the Linux futex() system call.

So, why not sleep in the kernel using a `futex()`?

There are a few reasons not to call futex() immediately:

Adding a futex complicates the code. We have to ensure that at least the final thread to arrive makes the futex call to wake the others.
Waking threads via a futex takes time, whereas we hope that for threads that share an L1 cache polling a flag can be very fast.
Futexes Are Tricky

Scheduling Model

Before we try to force the kernel scheduler to do what we want, we need to have a model of how we think it behaves. The simple, and “obvious” model is that it works something like this:-

For each priority level we imagine a single queue of runnable threads, where the next thread to execute will be the one at the front of the queue, and when a thread is de-scheduled it will be placed at the back of the queue.

In our case, all of our threads will have the same priority, so we’re only interested in a single queue.

What Is `sched_yield()`?

sched_yield() is a Linux system call documented here. At first sight it seems ideal:-

Description
sched_yield() causes the calling thread to relinquish the CPU. The thread is moved to the end of the queue for its static priority and a new thread gets to run.

This appears to be describing the view we just outlined, in which case, by calling sched_yield()our polling thread will give up its time-slice and be moved to the back of the scheduling queue, so will not run again until the thread doing useful work has consumed its time-slice and been descheduled. Thus we’ll give almost all of the CPU time to the worker thread.

Diagramatically, what we hope inserting sched_yield() in our polling loop will achieve is that the execution on the logicalCPU of interest will change from something like this:-

to something like this:-

A Simple Benchmark

Our benchmark uses two threads, forcing them to be on the same logicalCPU by using the sched_setaffinity call. In the computation thread we perform a not entirely trivial calculation (counting the number of points which fall inside the Mandelbrot set if we have a 1000x1000 grid placed over it and then doing that 100 times), while in the interfering thread we have three different possible functions, each of which is polling, waiting for the computation to finish :-

static void interfere(std::atomic<bool> * flag, 
                      statistic *stat) {
  TIME_BLOCK(stat);
  while (!*flag)
    ;
}

static void interfereWithSchedYield(std::atomic<bool> * flag,        
                                    statistic *stat){
  while (!*flag) {
    TIME_BLOCK(stat);
    sched_yield();
  }
}

static void interfereWithStdYield(std::atomic<bool> * flag,                                     
                                  statistic * stat) {
  while (!*flag) {
    TIME_BLOCK(stat);       
    std::this_thread::yield();
  }
}

We also measure the CPU time (adding together both user and system time) consumed by each thread for the interesting region.

Results

Hmm, it looks as if yielding (either via sched_yield() directly, or the C++ standard interface) is achieving absolutely nothing; whether we have yields in the polling loop or not our interference thread is getting half of the CPU time, so the elapsed time doubles.

What Is Going On?

The answer is hidden away in the sched_yield() man page, where we find this at the bottom of the NOTES section :-

sched_yield() is intended for use with real-time scheduling policies (i.e., SCHED_FIFO or SCHED_RR). Use of sched_yield() with nondeterministic scheduling policies such as SCHED_OTHER is unspecified and very likely means your application design is broken.

A little research finds a useful article in Wikipedia on the Completely Fair Scheduler (CFS) which has been Linux’ standard scheduler for normal, non-real-time, threads since 2007.

The CFS attempts to do what its name suggests, so it will reschedule a thread immediately when it yields if the thread has not had its share of the CPU time. Thus what is happening looks something like this :-

We can also see this if we look at how long the sched_yield() system calls take; what we hope is that they take a long time, since that shows that the productive thread is running when the interference thread yields. Instead, what we see looks like this (for the interfereWithSchedYield case):-

Count, Min, Mean, Max, Total, SD 5.56 M , 400.00 ns, 857.71 ns, 11.32 ms, 4.77 s, 66.24 us

We can see that there is at least one case where the yield does what we wanted (when the useful thread runs for over 10ms), but that in most cases the time is very short (significantly less than 1µs). These represent the immediate rescheduling of this thread by the kernel.

Deeper analysis here seems an ideal case for the use of the perf-sched tool, unfortunately, that requires access to some system “files”4, which are not accessible to unprivileged users on the machine which I am using.

What Does This Show?

It certainly doesn’t show that Jeff is wrong in general. There can be many, more complicated, cases where sched_yield() may help, for instance if there is general machine over-subscription but threads are not tied to specific cores, or if there is over-subscription due to other jobs on the machine. His example is in the BLIS library, and clearly adding sched_yield() worked. However, even there it seems that the developers think that sched_yield() may not be the ultimate solution5.

I certainly agree completely with the idea that any polling should include backoff, ideally ultimately to waiting in a way that the kernel supports. (See “How to wait”, section 6.7 in “the book” for further discussion in the context of locks).

We have also said nothing here about improving the performance when we have code running in multiple logicalCPUs that are sharing a single physical core. In this case it is definitely worth using the architecture appropriate hint to the CPU hardware that this thread is happy to give up time. (For X86_64, this would be a pause (or tpause) instruction, for Arm® architectures the yield instruction).

Conclusions

You should read man pages to the end before writing code.
Linux scheduling is more complicated than you probably thought, if you had thought about it at all.
sched_yield() doesn’t immediately seem to be the magic solution to the problem of how to poll with low impact. On the other hand, if you’re going to be burning CPU while polling anyway, doing it via sched_yield() is probably only marginally worse than doing nothing.
Any code that polls while doing nothing about backoff is probably a performance problem waiting to happen.

A Hint for OpenMP® Users

The OpenMP Specification 5.1 and later includes the OMP_WAIT_POLICY envirable which can be used to request that waiting threads do not consume CPU time. See the OMP_WAIT_POLICY description in the standard. If you are expecting over-subscription, using the passive wait policy may help.

Similarly, the LLVM OpenMP runtime library (also used by vendor compilers which are based on LLVM) has a default backoff strategy where it polls for some time, and then backs-off to waiting on a kernel resource. The time it waits before waiting in teh kernel is 200ms by default (which seems rather long!), but can be controlled by setting the KMP_BLOCKTIME envirable. If you are expecting over-subscription, setting that to 0 seems a worthwhile experiment, at least!

Acknowledgements and Reproducibility Details

This work used the Isambard UK National Tier-2 HPC Service, operated by GW4 and the UK Met Office, and funded by EPSRC (EP/T022078/1)

The kernel used for these measurements is

$ uname -a
Linux xcil01 4.12.14-197.7_5.0.99-cray_ari_s #1 SMP Wed Mar 17 20:29:55 UTC 2021 (c63ba6f) aarch64 aarch64 aarch64 GNU/Linux

The machine used is

$ lscpu
Architecture:        aarch64
Byte Order:          Little Endian
CPU(s):              256
On-line CPU(s) list: 0-255
Thread(s) per core:  4
Core(s) per socket:  32
Socket(s):           2
NUMA node(s):        2
Vendor ID:           Cavium
Model:               2
Model name:          ThunderX2 99xx
Stepping:            0x1
BogoMIPS:            400.00
L1d cache:           32K
L1i cache:           32K
L2 cache:            256K
L3 cache:            32768K
NUMA node0 CPU(s):   0-31,64-95,128-159,192-223
NUMA node1 CPU(s):   32-63,96-127,160-191,224-255
Flags:               fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics cpuid asimdrdm

The code for the microbenchmark is in the sched_yield.cc file in the LOMP microBM directory. (At present [March 2022] only in a branch; it will be moved into “main” later).

“logicalCPU” is the Linux term for the piece of hardware that can execute a thread. In a hardware implementation with no Simultaneous Multi-Threading (SMT) that is a single CPU. In an SMT machine there is a logicalCPU for each SMT hardware thread.

Five Reasons (Henry Purcell)

“Error: No permissions to read /sys/kernel/debug/tracing/events/sched/sched_switch”

See “thread barriers need backoff #604”

Hadrien G

Dec 22, 2023

I recently encountered this issue and the workaround that I have used is to sleep for a tiny amount of time (minimal duration allowed by the clock, which if you use the right clock is much shorter than the time it actually takes for a kernel to put a task to sleep and wake it up).

This is more expensive than a true scheduler yield would be, since the kernel has to read the TSC to query the sleep deadline. But it does fulfill the intended purpose : since the kernel cannot keep executing a sleeping task, it will switches to other tasks while waiting for the sleep duration to be elapsed.

Expand full comment

CPU fun

Discussion about this post