Macro Magic

Using a FOREACH macro and a few other amusements

Feb 11, 2022

Introduction

This blog describes the use of macros in C/C++ to make it easy to handle many similar operations in a small amount of code. This may seem somewhat off-topic for CpuFun, since there is nothing CPU-architecture specific here, but it seems worth discussing, since it is a useful style when writing things like hardware benchmarks, parallel runtime libraries, or machine simulators where one is dealing with many similar operations, and I haven’t seen it being explicitly discussed elsewhere1.

Since I’m writing about macros, I’ll also mention a few other things you can do with them that people seem often to miss.

What Problem Are We Trying to Solve?

Suppose that I want to write a benchmark to measure the performance of atomic operations on some machines. One of the potentially interesting aspects is whether the actual atomic operation being performed affects the performance at all. I therefore need to have separate tests for many different atomic operations, but most of the relevant code is identical.

Similarly, if I was writing a parallel runtime library that supports atomic operations, it will need a function for each operation that is supported at the higher level, but again, most of the code is associated with the handling of the atomicity, not the operation itself.

Or, maybe I’m writing a simulator, and need to write code to emulate all of the ALU operations. Again, most of the code for each emulation function is identical apart from the precise operation being performed.

Or, perhaps I am writing a tracing library, and want the user to be able to add new trace points without having to explicitly add code to print them, or define a new enum for each one.

In all of these cases I effectively have a list of things that I want to use to generate a set of functions, or to generate a table and some enumerations, but I don’t want to have to write the common code more than once, or find all of the places where changes need to be made when I add more things.

The Solution

C/C++ Macros

We all know about macros. You use #define foo bah and then any instance of foo is replaced by bah. And, of course, you can define macros with arguments

#define TIMES2(val) (2*val)

How Does That Help?

The key point to notice is that macro-expansion happens repeatedly until the text is no longer changing, so, with the definitions above, TIMES2(foo) is expanded first to (2*foo) and then again to (2*bah).

That doesn’t seem very magical, but, now consider what happens if we do something like this and use a macro to apply another macro.

#define FOREACH_WEEBLE(m) \
  m(elephant) \
  m(whale)    \
  m(dragon)

We can then use the FOREACH_WEEBLE macro to apply another macro to all of the weebles we want to handle.

Since we have abstracted the operation (the m macro) from the list of items to which it is to be applied, so we can now do something like this2, and apply more than one operation to the set of weebles:-

#define ADD_ENUM(c) c,
enum Weebles {
  FOREACH_WEEBLE(ADD_ENUM)
};
#define ADD_NAME(c) STRINGIFY(c),
static char const * weebleNames[] = {
  FOREACH_WEEBLE(ADD_NAME)
};

Now creating a new weeble, defining an enum value for it, and adding it to a table of name strings only requires adding a single line to the FOREACH_WEEBLE macro. It’s not even necessary to understand what adding that line does! Simply adding it ought to do all that is needed.3

Going Further

One obvious, and useful, extension, is to pass more information into the underlying expansion macro, so that the FOREACH_ macro has a little bit more in it.

As an example, the code that generates the atomic operations required by the LLVM® OpenMP® runtime interface in atomics.cc in the “Little OpenMP Runtime” (LOMP) 4 uses FOREACH macros like these

// Operations which can be expressed as std::atomic<type> 
// operator op=(type operand).

// The choices of macros here are somewhat determined by what
// makes sense for integers vs floats, and, also what 
// implementation is possible.
// For instance C++ (<C++20) does not have float atomics, 
// multiply/divide, or && and || atomics on any type.
#define FOREACH_ADD_OPERATION(macro, type, typetag)     \
  macro(type, typetag, +, add, false)                   \
  macro(type, typetag, -, sub, false)                   

#define FOREACH_MUL_OPERATION(macro, type, typetag)     \
  macro(type, typetag, doMul, mul, false)               \
  macro(type, typetag, doDiv, div, false)               \
  macro(type, typetag, doSub, sub_rev, true)            \
  macro(type, typetag, doDiv, div_rev, true)

It then also uses another level of expansion:-

#define FOREACH_STD_INTEGER_TYPE(expansionMacro, leafMacro) \
  expansionMacro(leafMacro, int8_t,   fixed1)  \
  expansionMacro(leafMacro, uint8_t,  fixed1u) \
  expansionMacro(leafMacro, int16_t,  fixed2)  \
  expansionMacro(leafMacro, uint16_t, fixed2u) \
  expansionMacro(leafMacro, int32_t,  fixed4)  \
  expansionMacro(leafMacro, uint32_t, fixed4u) \
  expansionMacro(leafMacro, int64_t,  fixed8)  \
  expansionMacro(leafMacro, uint64_t, fixed8u)

#define FOREACH_FP_TYPE(expansionMacro, leafMacro)      \
  expansionMacro(leafMacro, float, float4)              \
  expansionMacro(leafMacro, double, float8)

#define FOREACH_COMPLEX_TYPE(expansionMacro, leafMacro)  \
  expansionMacro(leafMacro, std::complex<float>,  cmplx4)\
  expansionMacro(leafMacro, std::complex<double>, cmplx8)
... etc ...

The final expansion then looks like this, where we’re using the macros to sweep over both the operations and the types :-

FOREACH_STD_INTEGER_TYPE(FOREACH_ADD_OPERATION,
                         expandInlineBinaryOp)
FOREACH_STD_INTEGER_TYPE(FOREACH_BITLOGICAL_OPERATION, 
                         expandInlineBinaryOp)
... etc ...

I’ve omitted the low level macros like expandInlineBinaryOp, even though they’re where the actual code gets generated because they’re not the point of this blog, which is about the use of macros to simplify your code, but by all means look at them if you’re interested in the implementation of atomic operations.

The end result of all of that is that the 188 lines of code in that file (along with 577 lines of comment and 46 blank lines :-) in February 2022)) generate 152 __kmpc_atomic_* functions. While that is still less than half of the 348 which the LLVM OpenMP runtime library provides on AArch64, it covers most common cases.

But, Can’t I Do All This With C++ Templates?

C++ templates are certainly useful as way of writing a single piece of code that can be easily re-used with different types, or, even, operations. However, it seems hard to me to use a template to encapsulate the outermost driver that gives the set of versions of the template that are required. Therefore I find templates very useful to provide the underlying implementation, but use a FOREACH macro to drive the instantiation of the template.

Other Useful Macros

`_Pragma`

This is not really a macro (or, maybe it’s a built-in macro), but it is useful, hardly new (it’s been in C since C99, so can now drink alcohol in the USA), and seems not to be well known.

What Does It Do?

It lets you generate #pragma directives from macros. That may not seem very useful, but it allows you to remove a whole load of macro-processing conditionals from your code.

So, instead of having to write something like this5 on every loop that you want unrolled

#if defined(__clang__)
#  pragma unroll
#elif defined (__GNUC__)
#  pragma GCC unroll
#endif
  for (...whatever...) {
    ... loop body ...
  }

You can define a macro in a target dependent header, like this

#if defined(__clang__)
#  define UNROLL_LOOP _Pragma("unroll")
#elif defined(__GNUC__)
#  define UNROLL_LOOP _Pragma("GCC unroll")
... other compilers here ...
#endif

Then in the code itself all you need to write is

  UNROLL_LOOP
  for (...whatever...) {
    ... loop body ...
  }

When combined with STRINGIFY (see below) you can also pass in arguments, something like this for the LLVM version (the GCC one is left as an exercise for the reader).

#define UNROLL_LOOP(count) _Pragma(STRINGIFY(unroll count))

This approach is also useful for guarding OpenMP pragmas, since you can potentially avoid a whole pile of

#if defined(_OPENMP)
# pragma omp parallel
#endif

type code, by using a macro like

#if defined(_OPENMP)
# define OMP(...) _Pragma (STRINGIFY(omp __VA_ARGS__))
#else
# define OMP(...)
#endif

Then you can simply write

OMP(parallel)

without requiring any guards.

Of course, the compiler should ignore OpenMP pragmas when not compiling for OpenMP, however, when invoked with the -Wall flag many compilers (at least GCC and icc) give warnings about pragmas which they ignored. So, if compiling with all warnings enabled and having no warnings is your style, you need to guard each OpenMP pragma somehow.

`STRINGIFY`

This macro expands any macros in its arguments and converts the result (including commas) into a string.

// Expand a macro and convert the result into a string
#define STRINGIFY1(...) #__VA_ARGS__
#define STRINGIFY(...) STRINGIFY1(__VA_ARGS__)

Summary

Macros may be more powerful than you had realised!6
Macros can be used to reduce the amount of code you have to write, and, more importantly, avoid replication of source code which then has to be maintained. When you find a bug in it there’s only one place you need to fix, not all the places the same code has been copied and pasted!
Macros can be used to make compilation-time guards less intrusive.
Lazy programmers (is there any other sort?) should like the fact that there is less code to write. (Those who are paid by the number of lines they write will disagree, of course).

Yup, I know, my Google-fu is low, and it may be well known and described in a bunch of other places too!

STRINGIFY is defined near the end of the blog (scroll up slightly from here if you want to see it.)

You can see this style being used by the statistics collection code in the LLVM OpenMP runtime in the KMP_FOREACH_{COUNTER,TIMER} macros in kmp_stats.h. These macros make it easy to add new timers or counters without having explicitly to add more code to print them.

The implementation choices which drive the design of the Little OpenMP runtime are described in the book High Performance Parallel Runtimes. The code is all available on github.

OK, so there may be a standard way to request loop unrolling in C++2x, but I couldn’t find it (see footnote 1 :-)). Before you tell me, I do know that OpenMP is working on loop transformations which provide a compiler independent syntax; see Michael Kruse’s slides from 2020, chapter 9 (page 219) in OpenMP 5.2 [PDF], or the related text in OpenMP 5.1.

If you think the macro stuff that I’m showing here is extreme, then take a look at CHEAT (“C Header Embedded Automated Testing or something like that.”)

“Imagine a source file including a header file. Then imagine the header file including the source file that included it. Now imagine doing that three times in a row within the same header file. Proceed to imagine redefining all of the identifiers each time. Finally imagine doing all of that with preprocessor directives. What you ended up with is CHEAT.”

CPU fun

Discussion about this post