Do You Really Need 128b Floating Point?
I am not going to get into this discussion, since there are clearly people who believe that they do, even though the most precisely known physical constant is the fine structure constant (known to ~12 decimal digits1, while the 128b IEEE float gives ~34 decimal digits).
The Problem
IEEE P754 specifies a 128b floating point format, but there is a problem in how that is mapped to C language types.
In effect this is because of a historical quirk, which is that the 8087 FP chip for the Intel® 8086 operated internally in an 80b format, which could then also be exposed to the user.
Therefore when C compilers for the 8086 with 8087 (and the succeeding X86 processors) were mapping floating point types supported by the hardware to C type-names, the obvious mapping was
float -> 32b
double -> 64b
long double -> 80b
That leaves no obvious type to use for the 128b float format which was specified later (well, OK, maybe “long long double”, but people clearly didn’t like that).
As a result compilers came up with their own solutions, which leads to non-portable code.
There is now a standard for this, however, neither the GNU Compiler Collection (GCC) 11.x nor LLVM 13.x seem to implement it entirely, and, to make things more fun, they each implement different things.
Since this is an area that also depends on the target machine architecture and the support provided by OS system libraries (libc, libm), the same compiler may also behave differently even on the same architecture when targeting different OSes.
Overall, if one is trying to write code that uses the 128b IEEE float type it is hard to write it in a way that does not bind the code to a specific compiler, architecture and OS, something that is clearly a bad idea!
Standard Solution
Since 2015, the C standard has had a more general solution which is described in draft in ISO/IEC TS 18661-3 (“Floating-point extensions for C - Part 3 - Interchange and extended types”), which was revised in Annex X (normative) in 2019 (“IEC60559 interchange and extended types”).
It can be summarised (for the 128b case) like this :-
So we’re all done, right? The implementers have had six years, so this is surely in all the compilers I need to use, and I can just use it.
Note that the standard does not require that printf (and friends) accept _Float128 arguments, or that scanf (and friends) read such a number. Instead you are supposed to use
int strfromf128(char * restrict s, size_t n,
const char * restrict format, _Float128 fp);
and
_Float128 strtof128(const char * restrict nptr,
char ** restrict endptr);
The format string passed to strtof128 must only include a format (i.e. start with a %) , and have a format specification from (a,A,e,E,f,F,g,G). So you are expected to buffer the output into a string before passing it to printf itself, should you want to mix it into output produced by a printf call.
Before we pack up, though, we should check what compilers actually do implement. Here we’ll look at GCC’s gcc and LLVM’s clang, since they are the most widely used compilers, and LLVM forms the basis for many vendor compilers2. We’ll also check both X86_64 and the Arm 64b architecture (AArch64) targets, looking at Linux and MacOS3.
Gnu Compiler Collection (GCC)
GCC seems to have set the trend here in gcc, though precisely what it does depends on the target architecture and OS.
Support for appropriate constants and functions is provided in <quadmath.h> which is part of the GCC distribution.
Compiler Explorer can show us what’s going on.
E..g. When targeting Linux on an X86_64, this code
long double ldone = (long double) 1;
__float128 F128one = (__float128) 1;
compiles to
ldone:
.quad 0x8000000000000000 # x86_fp80 1
.short 0x3fff
.zero 6
F128one:
.quad 0x0000000000000000 # fp128 1
.quad 0x3fff000000000000
Here you can see the two different constant formats, with the 80b (i.e. 10B) constant followed by six bytes of padding for the long double type, and the full, 16B fp128 constant.
However, on non-X86 platforms gcc may define the long double type as mapping to the IEEE 128b float. For instance, Compiler Explorer shows us that, when targeting Linux on AArch64, gcc compiles this code
long double ldone = (long double) 1;
to
ldone:
.word 0
.word 0
.word 0
.word 1073676288
(i.e. a 128b IEEE constant) .
However, if we try to compile the previous code for MacOS, we discover that the __float128 type is not supported here by gcc, despite the compiler demonstrably being able to generate code for 128b floats if the environment allows that.
Note that on platforms where it supports a 128b float, gcc does accept the standard type name _Float128, and the standard constant suffix F128 in addition to its own versions (__float128 and q).
LLVM
clang follows gcc to some extent, while appearing to ignore the standard completely.
Where Does That Leave Us?
There is no consistent set of names and tags that we can use across the full set of (gcc, clang) x (X86_64, AArch64), so we will have to provide a non-standard, but consistent, set of names which we can then map to the underlying implementation.
There are four main things we need to be able to translate from our code to whichever underlying syntax is required.
The typename itself
The format of constants
The names of functions and constant values from libm
The handling of I/O (tags in printf strings, the name of the function to convert a string to a number).
Typename
While it is tempting to use _Float128, in the hope that we can do nothing here when compiling with a compiler which supports the standard (e.g. gcc) and add typedef long double _Float128; elsewhere, it’s probably not a good idea, since _Float128 is a reserved typename.
I have therefore chosen to go with FP128, which we’ll also use throughout for our tags and so on.
Format of Constants
Since there is no consistency here, we need to use a macro wrapped around our constant declarations that can add the appropriate syntactic tag for the target environment.
I’ve chosen to call that macro FP128_CONST, so you would write FP128_CONST(1.0), or, more realistically, something like
#define M_EFP128 FP128_CONST(2.718281828459045235360287471352662498)
Names of Functions and Constants
Append FP128, so we will appear to have functions like
FP128 sinFP128(FP128);
I say “appear”, though, since in reality we’ll be calling an inline shim function which calls the appropriate underlying function, (here, likely either sinq on X86 with quadmath, or sinl on AArch64).
The header will provide appropriate constant macros (such as the M_EFP128 shown above).
Handling I/O
Input
The input side can be handled relatively simply, since we can provide
FP128 strtoFP128(char const *, char **);
and map it directly to the appropriate underlying function.
Output
While we would hope that there is a way to output 128b floats through printf, it turns out that, in general, it isn’t.
On X86_64, if we use gcc and the <quadmath.h> approach although ‘q’ is used as a suffix for constants, in a printf string it is equivalent to ‘l’, and describes formatting of the 80b long double type.
#include <stdio.h>
#include <stdint.h>
#define __STDC_WANT_IEC_60559_TYPES_EXT__ 1
#include <float.h>
_Float128 d2 = 1.0q;
long double ld2 = 2.0l;
// __float128 d3 = 3.0q;
int main (int argc, char ** argv) {
printf("_Float128 %%qg => %qg\n", d2);
printf("_Float128 %%Qg => %Qg\n", d2);
printf("_Float128 %%Lg => %Lg\n", d2);
printf("long double %%qg => %qg\n", ld2);
printf("long double %%Lg => %Lg\n", ld2);
}
gives this output when compiled with gcc 11.2 for X86_64 on compiler explorer.
Program returned: 0
Program stdout
_Float128 %qg => 5.13001e-4937
_Float128 %Qg => %Qg
_Float128 %Lg => 5.13001e-4937
long double %qg => 2
long double %Lg => 2
However, on Linux (though not, apparently, in the Compiler Explorer environment), the ‘Q’ tag does work!
On other architectures (where there is no 80b float, and the long double type is the 128b float), then the ‘l’ (or ‘L’) suffix will work, but for portable code we still need to avoid it. We therefore provide a restricted FP128_snprintf which follows the restrictions of quadmath_snprintf (since that’s the most restrictive case).
Even with that restriction, we also need to know the appropriate tag character, so provide FP128_FMT_TAG, so that you can write code like
printf("_Float128 %%" FP128_FMT_TAG "g => %" FP128_FMT_TAG " g\n", d2);
Here we are using printf, which is slightly dubious (as we saw from the Compiler Explorer case).
Output Conclusion
If you really want portable code you should not use printf (and friends) for handling FP128 output, but rather the specific, restricted, FP128_snprintf function, however, if you’re prepared to take a slight risk, then you can use standard printf provided that you use the FP128_FMT_TAG in your format strings.
Namespace Pollution
Since the portability header is going to map directly to the underlying implementation through the use of static inline functions and macros, we necessarily have to include any necessary underlying definitions. That means that on X86_64, the header will also #include <quadmath.h>, and, therefore the functions and definitions from there will also appear when our portability header is used.
I don’t see that as a huge problem, though it does mean that you might think that you’ve abstracted away from quadmath, but until you actually try to compile with another compiler or for a different platform, you could have missed some references to quadmath functions and not noticed.
Conclusions
You can write portable code for this, but it’s far from easy.
MacOS on their AArch64 machines does not support FP128 at all, and, as a result, neither do compilers configured for that environment. So the best you can do at present if you really want to run code with 128b floats on these machines is to use the X86_64 emulation environment.
Where is the Code?
The code is all freely available at https://github.com/UoB-HPC/PortableFP128, and is licensed with a permissive license (Apache License v2.0 with LLVM Exceptions). The core of the code is the single, self-contained, header file pfp128.h, though there is also a test to exercise the interfaces and a Makefile.
Caveats
The code has only been tested with GCC and LLVM, and only on MacOS and Linux; other compilers (such as AMD’s. Arm’s, Cray’s, Intel’s, or NVidia’s) may require additional work or configuration. In particular, finding the <quadmath.h> header on X86_64 may need additional include paths when using compilers other than GCC, as may finding libquadmath at link time. (Though you would need to do that to escape GCC even if you were not adding the additional portability features).
Sources of Information
Wikipedia article on “Quadruple Precision”
CPPreference website’s page on Experimental C Features (including CF3P annex from 2019)
Legal Stuff
I have tried to use appropriate copyright and trademark symbols at the first mention of the relevant name, however it is likely I have missed some. So “Other names may be the property of others.”
This work used the Isambard UK National Tier-2 HPC Service (http://gw4.ac.uk/isambard/) operated by GW4 and the UK Met Office, and funded by EPSRC (EP/T022078/1)
At least AMD®, Arm®, Intel, and NVIDIA®
I don’t have access to a Microsoft® Windows machine, so I haven't tested there.
Swift is getting 128-bit FP at some time in the future. One suspects that will coincide with MacOS getting beefed up.
If you're simulating physics from black holes down to quarks then you need higher precision because you're doing arithmetic between values of vastly different scale. In that case you might even need 256-bit fp.