In this blog I’m going to show you a small difference between the C/C++ calling convention used on AArch64 processors running MacOS and that used on Linux which introduces a difference between what you can get away with in the AArch64/MacOS environment and almost everywhere else.
If all of your code follows the relevant standards, then you’re fine. If you’re cheating with function pointers you may not be!
Variadic (Ellipsis) Functions
C (and by inheritance, C++) support the idea of a single function which can be called with an arbitrary number of arguments (a “variadic”, or “ellipsis”, function). The canonical example of this is
int printf(char const * fmt, …); which takes a format string followed by the values which will be printed. The number of required arguments depends on the format string, so cannot be known when the function itself is compiled.
How is a call to a variadic function generated?
In all the environments of which I am aware, apart from AArch64/MacOS, a call to a variadic function generates the same code as that which would be generated if the function prototype explicitly matched the type of the arguments present at the call site. In effect the variadic property of the function being called is used solely to allow the code to pass the semantic check that a function is called with the correct number and type of arguments. It does not affect anything else.
That means that although it is not standard conforming, code which does things like this, where a non-variadic function is called through a pointer to a variadic function will work OK.
void foo (int * args, int a1, int a2, int a3); ... void (*fpE)(int *,...) = (void (*)(int *, ...))foo; fpE(&args, 10,11,12);
Similarly, code which does the opposite (invoking a variadic function through a pointer to a non-variadic one) also works
void foo_ellipsis(int *args, ...); ... void(*fpNE)(int *, int,int,int) = (void(*)(int *,int,int,int))foo_ellipsis; fpNE(&args, 7,8,9);
We can see this if we look at a small sample code in Compiler Explorer1, here. Directly at the link you’ll see code compiled for the x86_64 architecture (but you can easily change that by choosing a different compiler). For now just compare the code generated for the call to the non-variadic function
129 mov rdi, r14 130 mov esi, 1 131 mov edx, 2 132 mov ecx, 3 133 call foo(int*, int, int, int)
with that generated for the call to the variadic version
157 mov rdi, r14 158 mov esi, 4 159 mov edx, 5 160 mov ecx, 6 161 xor eax, eax 162 call foo_ellipsis(int*, ...)
We can see that the arguments are passed in the same registers (
rdi,esi,edx,ecx), when calling either function.
You can play with the code in Compiler Explorer to attempt to validate my assertion that this is true on many architectures, and, even execute the x86_64 code to see the (non-standard-conformant) test case run successfully.
Why bother us if this all works?
The reason this is an issue at all is that it doesn’t work this way on AArch64/MacOS You may have checked the AArch64 compilers in Compiler Explorer and seen code like this,
add x0, sp, #4 // =4 mov w1, #1 mov w2, #2 mov w3, #3 bl foo(int*, int, int, int) ... add x0, sp, #4 // =4 mov w1, #4 mov w2, #5 mov w3, #6 bl foo_ellipsis(int*, ...)
which shows the same properties as that on x86_64: the arguments are being passed in the same places whether or not this is a variadic function, so that’s all good, right?
But… and it’s a big BUT, the calling convention on AArch64/MacOS is not like this. Here the compiler doesn’t load arguments which are matching the ellipsis into registers, but rather puts them onto the stack. Then the
va_list code extracts them from there. As a result the test code fails when run natively on the MacOS M1 machines.
$ clang broken_ellipsis.c $ file ./a.out ./a.out: Mach-O 64-bit executable arm64 $ ./a.out ./a.out foo(&args,1,2,3) sees 1,2,3 ***OK*** foo_ellipsis(&args,4,5,6) sees 4,5,6 ***OK*** foo_ellipsis called via void(*)(int *,int,int,int)(&args,7,8,9) sees 4,5,6 ***FAIL*** foo called via void (*)(int *,...)(&args,10,11,12) sees -1,-1,1873948785 ***FAIL*** $
Unfortunately I haven’t found an obvious way to ask any of the compilers in Compiler Explorer to compile for AArch64/MacOS, so you’ll have to believe me that the code generated there (by LLVM) looks like this
sub x0, x29, #36 ; =36 mov w1, #1 mov w2, #2 mov w3, #3 bl _foo ... sub x0, x29, #36 ; =36 mov x9, sp mov x8, #4 str x8, [x9] mov x8, #5 str x8, [x9, #8] mov x8, #6 str x8, [x9, #16] bl _foo_ellipsis
Here you can see that the way the arguments are being passed is different. When calling
foo(int *, int, int, int), the arguments are passed in registers (
x0,w1,w2,w3) , but when calling
foo_ellipsis(int *, …) only the first argument is passed in a register (
x0) while all of the others are stored into the stack, and will be picked up relative to the caller’s stack from inside
Given the different way the arguments are passed, it should be no surprise that this non-standard-conformant function pointer casting fails, since a function which is expecting its arguments in registers won’t find them if they aren’t there!
Why would Apple break this?
The first point is that they haven’t broken it, any code which fails because of this was already broken. You just didn’t realise it.
It is important to realise that a language standard is not just something which applies to a compiler writer. Rather it’s a contract between you (and your code) and the compiler writer (and their code). They promise to have your code run correctly (with the semantics specified in the standard), provided that your code obeys all of the rules and restrictions the standard lays out. If you step beyond those, the compiler is free to do anything. (The standard comment in the room when we were standardising High Performance Fortran was always “undefined behaviour, up to and including starting world war 3”). The fact that all other compilers implement the undefined behaviour in the same way, and that that is convenient to you, does not let you escape this constraint.
The second point is that changing this can improve the performance and reduce the overhead of using a variadic function. If you look at the x86_64 code for the
foo_ellipsis function and compare it with that for
foo, you’ll see that the implementation of
foo is four instructions long, whereas
foo_ellipsis is ~60 instructions including a number of conditional branches. Similarly, the AArch64/Linux code is 3 instructions for
foo, and 64 for
foo_ellipsis, whereas for the AArch64/MacOS code,
foo is still 3 instructions, but
foo_ellipsis is now only 26 instructions long.
Of course, whether the overhead of calling variadic functions actually matters will depend on your code, but, maybe Apple know of some places inside critical applications where
printf performance is important!
How did I fall into this hole?
In the implementation of the Little OpenMP* Runtime (LOMP) the runtime has to handle the OpenMP fork operation, where it must apply a function which has been created by the compiler to represent the outlined body of a parallel region. In the LLVM interface, the number of arguments to that function will depend on the number of shared variables which are accessed in the parallel region, since a pointer to each such variable is passed as an argument. The compiler generates the code for these outlined body functions as normal functions with a fixed number of arguments (which is correct, since each such function does take a specific number of arguments).
However, the the OpenMP runtime function which is called to implement the fork operation which causes all of the relevant threads to invoke the outlined body function is a variadic function that is passed a function pointer, the number of arguments and the arguments themselves (via the ellipsis). That is fine too, the compiler calls it as a variadic function, and it was defined as one.
The problem comes at the point where the runtime has to apply the function in each thread, and it was there that I was cheating and pretending that the outlined function body is variadic, when it is not.
That code2 is actually horribly non-standard-conformant anyway, but does now work and achieves calling the outlined body without using any assembler code on Arm (32b and 64b), RISC-V and x86_64. (I think it’ll also work on the Power architecture, but haven’t tried it yet :-); SPARC may be more “fun” if it’s still using register-windows).
What Did We Just Learn?
Just because my code works everywhere doesn’t mean it is standard-conformant.
It’s my job when writing code to obey the language standard which applies as much to my code as it does to the compiler.
The AArch64/MacOS calling convention is different from the AArch64/Linux one, and that can bite non-conformant code even if it does work everywhere else.
What’s Coming Next?
In the next issue I’ll talk about timers and some “interesting” features of the emulated x86_64 environment that could bite you.
Many thanks to Matt Godbolt for creating Compiler Explorer, which is a wonderful tool for this kind of investigation, and, generally useful if you want to see how a compiler is mangling (sorry, I mean “optimising”) your code!