Fun with Serial and Parallel Sum Reductions

What is the "right" answer?

Feb 22, 2024

Introduction

This post is prompted by a complaint that “the OpenMP® parallel reduction is giving the wrong answer” without any discussion of what the “right” answer is. For sum reductions this is not immediately apparent…

The Problem

The mathematical problem that the code is trying to solve seems very simple:

\(Total = \sum_{i=0}^{n-1} Array[i]\)

We are simply adding together all of the values in an array of size n.

Although this captures the arithmetic properties, in many cases the array does not really exist, and what we’re dealing with is the online version of the problem, in which the values to be accumulated arrive one at a time, for instance once for each iteration of some loop which calculates a value. While this is not a huge difference, and does not affect most of what I discuss, algorithms which require visibility of all the values before performing the accumulation may not be feasible in general because of the additional storage required.

The Code

The obvious C/C++ code1 for this looks something like this:-

float serTot(int n, float const *a){
    float total = 0.0;
    
// #pragma omp parallel for reduction(+:total)
    for (int i=0; i<n; i++)
      total += a[i];

    return total;
}

We initialise the running total to zero, and then add all of the values in the vector together. For a parallel version we simply uncomment the OpenMP directive.

Running a Test

Suppose I set up a test like this:-

static void initArray(int n, float *a) {
  a[0] = 1.0;
  a[n-1] = -1.0;
  for (int i=1; i<n-1; i++) {
    a[i] = 2.e-8;
  }
}

int main(int, char **) {
  enum {
    arraySize = 100002
  };
  float * data = new float[arraySize];

  initArray(arraySize, data);

  printf("omp_get_max_threads() %d\n",omp_get_max_threads());
  printf("Serial total: %g, parallel total %g, mathematical result %g\n",
         serTot(arraySize, data), parTot(arraySize, data), (arraySize-2)*2.e-8);

  return 0;
}

What Result Do You Expect?

Clearly, the mathematical result should be 1 + 100000 * 2.e-8 - 1= 0.002, but what about the serial result, which is often the one people think is “the right answer”?

Because I’m asking, you’d be right to be suspicious; the choice of values to sum may also look “interesting”…

$ clang++ -fopenmp -O3 sumReduction.cxx
$ ./a.out
omp_get_max_threads() 8
Serial total: 0, parallel total 0.00174981, mathematical result 0.002

As you can see the serial total is zero, while the parallel total (here from eight threads) is nearer to the true, mathematical, result, though still not spot on.

What is Going On?

The Serial Case

The issue here is that we’re running into one of the properties of floating point numbers, which is that they have a limited number of mantissa bits. We’re using IEEE 754 32 bit numbers which are implemented by most machines2, so are almost certainly what you’ll get for a float on your machine. Since the 32-bit format has 24 bits of mantissa, adding a number to another one that is more than a factor of 2**24 larger will simply not change the larger value. As 2**-24 ~= 5.6e-8, the sum 1.0 + 2.e-8 will always give the result 1.0. Therefore none of the additions of 2.e-8 has any effect and the final summation is 1.0-1.0 = 0.0.

Of course, that is almost certainly not the result that was desired.

Another way to think about this is that floating point addition (unlike the mathematical version) is not associative. The order in which the operations are performed can matter, so (a+b)+c is not guaranteed to give the same value as a+(b+c).

We can see that even in simple code like this

#include <stdio.h>

int main()
{
    float small = 3.e-8;
    float one = 1.0;
    printf("(3.e-8+3.e-8)+1.0=%12.10f 3.e-8+(3.e-8+1.0)=%12.10f\n",
    (small+small)+one, small+(small+one));
    return 0;
}

(which you can see on Compiler Explorer here). When executed it gives this output

(3.e-8+3.e-8)+1.0=1.0000001192 3.e-8+(3.e-8+1.0)=1.0000000000

Which demonstrates that the order of additions affects the result even in very simple cases.3

The Parallel Case

As we have just seen that the order of additions can affect the result, and the whole point of executing in parallel is to allow multiple operations to be going on simultaneously, thus changing the overall order of the operations, it should be no surprise that the result of the parallel reduction is different from the serial result.

Here, the OpenMP runtime has clearly chosen to implement the reduction by performing a per-thread reduction and then combining the results from each thread at the barrier at the end of the loop. This is not required by the OpenMP standard, and, indeed, at least the LLVM runtime also has support for using an atomic operation to update the single, shared, result variable with every addition. However, when there are many iterations, as there are here, the per-thread reduction implementation is preferred.

As a result, each thread will have its own, private, version of the variable total initialised to zero, and will accumulate all of the values for the iterations it executes. Therefore in most of the threads, the addition of the small values will happen successfully, and, by the time these per-thread totals are added together each of them will be large enough to make a contribution to the final value, rather than being so small that they are ignored.

Assuming a static schedule for the loop (which is the common default), we’d expect an iteration split like this (the handling of the “spare” two iterations is not specified by the OpenMP standard, but allocating one each to two different threads is optimal, so extremely likely to be the behaviour of a sane compiler/runtime).

T   Count  First   Last
0 : 12501 (    0 : 12500)
1 : 12501 (12501 : 25001)
2 : 12500 (25002 : 37501)
3 : 12500 (37502 : 50001)
4 : 12500 (50002 : 62501)
5 : 12500 (62502 : 75001)
6 : 12500 (75002 : 87501)
7 : 12500 (87502 : 100001)

The first thread will still lose all of its small values, so will simply sum to 1.0, and the final thread will lose some bits when the -1.0 is added, but each of the other threads will accumulate all of their values. We’d therefore expect this, eight thread, case to sum to approximately 7/8ths of the correct result, i.e. 0.00175 which is, indeed, close to the 0.00174981 which we see.

In general we’d expect to lose the contributions from the first thread, so the (threads-1)/threads formula should apply, and, if we force the number of threads, we can see that it does (e.g. for 50 threads we get 0.00196004, which is close to 0.002 * 49/50 = 0.00196)

What Can I Do About This?

Don’t just ignore it by arguing that your sum reductions are on doubles, and that they have enough mantissa bits so it isn’t an issue. If you are complaining that the parallel reduction gives a different result from that given by the serial code, clearly the order of operations is affecting the result!

If you are accumulating floats, then changing the type of the accumulator (total in the code above) to double may be enough to fix the problem, since then the accumulator has many more mantissa bits available. We can see that it does fix this example:

With double accumulator.
Serial total: 0.002, parallel total 0.002, mathematical result 0.002

However if you are accumulating doubles, using a wider floating point type may be more problematic (see Portable Support for 128b Floats in C/C++).

If you are operating on CPUs and using the LLVM OpenMP runtime, then the KMP_DETERMINISTIC_REDUCTION envirable (see here) may help, though it only provides a guarantee that the result will be consistent between runs with the same number of threads, and, obviously requires that a static schedule is used to ensure that each thread always sees the same set of values to accumulate. This does not ensure the same result as the serial code.

If you want to make larger changes, then using the unum floating point representation (see here) for the accumulation operations might help, or you could consider serialising the reduction and buffering the values (which may be impossible for large reductions) so that you can apply a more complicated scheme to choose a good order of operations. (For instance, sorting the leaf level contributions by their absolute magnitude, then adding the two smallest, inserting that back into the list of values to be reduced at the right place to maintain the sort, and iterating those two operations until you have a single result seems likely to give more accurate results, though at significant cost.)

What Did We Learn?

Floating point arithmetic is more fun than you might expect.
If you’re complaining that OpenMP reductions give “the wrong answer”, you are wrong. What’s happening is that OpenMP is exposing a problem that you already had.

References

What every computer scientist should know about floating-point arithmetic by David Goldberg. This is the classic paper, which certainly lives up to its title. If you have not read it, then do so now!

The code for the test is available here.

I know; in C++ one could write this more idiomatically using std::vector and a foreach loop, or std::reduce (though that explicitly says that the result is non-deterministic for non-associative operators; see here).

I don’t expect you to be on an IBM Z-series mainframe, and, if you are, you’ll see this problem even worse, since the classical IBM floating point format has fewer mantissa bits!

You can also see the other effect, that 2.e-8 is not precisely representable in this format, and that we’re still losing many mantissa bits that represented useful values when doing the summation. Hence the result 1.0000001192 rather than 1.00000012.

CPU fun

Discussion about this post