Introduction
This time around I’m going to look at high-resolution timers and a few oddities in the way the x86_64 emulation on the M1 presents itself, that lead to some potential “gotchas”.
I’ve created a small program to demonstrate the issues I discuss here, which you can download to run yourself. It’s nominalFrequency.cc
,
which can be compiled with no defines, flags, or whatever by your favourite C++ compiler for AArch64 or x86_64
(e.g. clang++ nominalFrequency.cc
). The code is effectively that used in the LOMP implementation in src/stats-timing.cc
which is itself used by the micro-benchmarks in the LOMP microBM
directory.
Why Worry About Timers?
For micro-benchmarks it is useful to have high-resolution, low-overhead timers, ideally ones which we can access in a single instruction. While the most portable thing to do is to use the std::chrono::steady_clock
(following the advice to avoid the std::chrono::high_resolution_clock
) we can see that it is implemented via calls into a runtime library, so has non-trivial overhead (it will significantly affect register allocation and so on), therefore it’s worth going straight to the hardware if we can.
Timer Properties
Before we look at the timer implementations let’s consider the properties we want from a timer:
Invariance: it always ticks at the same rate.
Monotonicity: it never runs backwards.
High resolution: it can resolve small time intervals.
Low interference : inserting the code to read the timer doesn’t hugely change the execution of the code being timed.
Low overhead: reading the timer is fast.
Synchronisation: it is synchronised between different logical CPUs (so that we can take the difference between time-points in different threads and get a sensible elapsed time).
If we can access a single instruction to read the timer, that can help with interference and overhead, though we still have to be careful1.
AArch64 Timer
In AArch64, we can use the mrs
instruction to read the cntvct_el0
system register which contains a timer counter, and also to read the cntfreq_el0
register which tells us the frequency at which the counter increments.
// Setup functions we need for accessing the high resolution
// clock
#define GENERATE_READ_SYSTEM_REGISTER(ResultType, FuncName, Reg) \
inline ResultType FuncName() { \
uint64_t Res; \
__asm__ volatile("mrs \t%0," #Reg : "=r"(Res)); \
return Res; \
}
GENERATE_READ_SYSTEM_REGISTER(uint64_t, readCycleCount,
cntvct_el0)
GENERATE_READ_SYSTEM_REGISTER(uint32_t, getHRFreq, cntfrq_el0)
#undef GENERATE_READ_SYSTEM_REGISTER
Aside from the issue that the rate at which this timer ticks seems relatively low2 (“with a frequency typically in the range of 1MHz to 50MHz.”), this is all quite easy; the clock is monotonic, it is common across all of the available logical CPUs, we can read the clock easily, and also easily discover the rate at which it ticks.
x86_64 Timer
Things here are somewhat more “fun”. Reading the timer is still a single instruction (rdtsc
), but obtaining its properties is harder, and its properties have evolved over time.
In the first implementations, the counter counted “CPU clocks”, however since the CPU clock rate can be moved up and down by the power management system, that means that it was not measuring elapsed, wall-clock, time. That changed a few years ago, but we must check whether that change is implemented on the CPU on which we’re running before using the timer for wall-clock time measurement.
There is also no simple, generally agreed upon, method to find out the rate at which timer is incrementing even when it is invariant.
Doing those checks and trying to extract the properties of the CPU brings us to the fun that is cpuid
.
What is cpuid
?
cpuid
is the instruction that is used to obtain information about details of the implementation of an x86_64 (or IA32), CPU. Unfortunately, although most vendors implement the instruction, the details of how to use it differ between them. Useful documents if you want to go deeper here are the Intel Software Developer’s Manual ( volume 2A has the entry for the cpuid
instruction), and the AMD64 Architecture Programmer’s Manual (see appendices D and E of Volume 3)3.
The vendors do all provide a compatible way to discover which vendor’s CPU you are using, so that you can then choose the appropriate way to use cpuid
to discover more information, and, at least AMD* and Intel*, do manage to have some common interfaces.
Here’s some simple code to give us low-level access to cpuid
. This may also be feasible via a compiler intrinsic, but this asm code works with GCC and LLVM, at least.
/* cpuid fun. Here since we need to check the sanity of the
* time-stamp-counter.
*/
struct cpuid_t {
uint32_t eax;
uint32_t ebx;
uint32_t ecx;
uint32_t edx;
};
static inline void x86_cpuid(int leaf, int subleaf,
struct cpuid_t * p) {
__asm__ __volatile__("cpuid"
: "=a"(p->eax), "=b"(p->ebx),
"=c"(p->ecx), "=d"(p->edx)
: "a"(leaf), "c"(subleaf));
}
Once we have that we can extract the brand name like this :-
static std::string CPUBrandName() {
cpuid_t cpuinfo;
uint32_t intBuffer[4];
char * buffer = (char *)&intBuffer[0];
// All of the X86 vendors agree on this leaf.
// But, what you read here then determines how you
// should interpret other leaves.
x86_cpuid(0x00000000, 0, &cpuinfo);
intBuffer[0] = cpuinfo.ebx;
intBuffer[1] = cpuinfo.edx;
intBuffer[2] = cpuinfo.ecx;
buffer[12] = char(0);
return buffer;
}
Invariant TSC
Both AMD and Intel use cpuid
leaf 80000007H edx
bit 8 to tell us whether the TSC clock measures time (is invariant), or CPU clock ticks. Of course, older processors may not even support this leaf, so we have to check that first!
static bool haveInvariantTSC() {
// These leaves are common to Intel and AMD.
cpuid_t cpuinfo;
// Does the leaf that can tell us that exist?
x86_cpuid(0x80000000, 0, &cpuinfo);
if (cpuinfo.eax < 0x80000007) {
// This processor cannot even tell us whether it
// has invariantTSC!
return false;
}
// At least the CPU can tell us whether it supports an
// invariant TSC.
x86_cpuid(0x80000007, 0, &cpuinfo);
return (cpuinfo.edx & (1 << 8)) != 0;
}
What is the rdtsc
unit?
We’ve seen how to discover whether it is sane to use rdtsc
for elapsed time, but we don’t yet know the time which each tick represents. Since it does rather matter whether “1” means “1s” or “1ns”, we need to find that out.
Intel have specified a cpuid
leaf that gives us that information (leaf 15H), however they only did so relatively recently, and I have yet to see a CPU that implements this. (The code checks for it, and will use it if it can, but it’s clearly not a general solution, and obviously hasn’t been tested :-)).
They do, though, provide a nominal frequency in their model name, for instance
Intel(R) Xeon(R) Gold 6230 CPU @ 2.10GHz
Intel(R) Core(TM) i5-5257U CPU @ 2.70GHz
.
While there is no specification that requires that that is the same as the notional time-stamp counter rate, so far I haven’t seen an Intel processor where it is different.
AMD do not seem to have any way to find this out via cpuid
(they do not encode it in the model name string where they prefer to boast about the number of cores, rather than the clock rate), so all we can do there is to work it out for ourself by comparing the count we get from rdtsc
with another timer which we trust (i.e. std::chrono::steady_clock
).
What Do We See on the M1 Under x86_64
Emulation?
Well, what would you expect? There is no Intel silicon here, so what brand should it show?
Here’s what it does show:-
Brand: GenuineIntel
Model: VirtualApple @ 2.50GHz
That’s slightly unexpected, but you can see why Apple would want to claim to be GenuineIntel
, as existing code for x86_64 MacOS is very likely only to know how to decode Intel’s cpuid
interface since that is all it has probably seen. As the whole point of the emulation is to support such code without change, the Apple emulation wants to show that code what it expects.
As with the real Intel implementation it’s also telling us the nominal clock-rate, so we’re all done, right? Our existing timer code can use that and it’ll all “just work”. Before I answer that, let’s look at what we see in the native AArch64 environment.
What Do We See on the M1 in the AArch64 Environment?
As we saw above, getting the information about the timer here is simple to obtain, and we see this :-
From high resolution timer frequency (cntfrq_el0) 24.00 MHz => 41.67 ns
That seems entirely plausible, but makes the emulated x86_64 environment look suspicious. Here we have a unit of ~42ns4, but there we have one of 1/2.5GHz = 400ps. Since it seems unlikely that the emulated environment has access to a higher resolution clock than the underlying hardware, what we’re seeing there seems odd.
What’s Going On?
Rather than trusting the information we got from cpuid
, we can check the units of the rdtsc
clock by comparing it with std::chrono::steady_clock
(as we have to do anyway on AMD).
If we do that we see something like this:-
Brand: GenuineIntel
Model: VirtualApple @ 2.50GHz
...
Sanity check against std::chrono::steady_clock gives frequency 999.98 MHz => 1.00 ns
So… although the brand name is GenuineIntel
, the rdtsc
clock unit is not the one we’d expect from the nominal CPU frequency in the model name. And, if we had believed that the times we measured would all be 2.5x too small!
However, that’s not all. Even 1ns is much smaller than the 41.67ns units used by the underlying hardware.
Difference between units and actual rate
What we’re actually seeing is that the the units in which the clock time is measured are different from the rate at which it ticks, so each change in the clock is not one tick, but some larger number.
We can try to work out what that is by using code like this to see how small a change in tick count we can see:-
// Try to see whether the clock actually ticks at the same rate
// as its value is enumerated in. Consider a clock whose value
// is enumerated in seconds, but which only changes once an
// hour...
// Just because a clock has a fine interval, that doesn't mean
// it can resolve to that level.
static uint64_t measureClockGranularity() {
// If the clock is very slow, this might not work...
uint64_t delta = std::numeric_limits<uint64_t>::max();
for (int i = 0; i < 50; i++) {
uint64_t m1 = readCycleCount();
uint64_t m2 = readCycleCount();
uint64_t m3 = readCycleCount();
uint64_t m4 = readCycleCount();
uint64_t m5 = readCycleCount();
uint64_t m6 = readCycleCount();
uint64_t m7 = readCycleCount();
uint64_t m8 = readCycleCount();
uint64_t m9 = readCycleCount();
uint64_t m10 = readCycleCount();
auto d = (m2 - m1);
if (d != 0)
delta = std::min(d, delta);
d = (m3 - m2);
if (d != 0)
delta = std::min(d, delta);
d = (m4 - m3);
if (d != 0)
delta = std::min(d, delta);
... Code elided to keep this example manageable ...
... It computes all the other differences the same way ...
}
return delta;
}
If we add that to our code, we see what is really happening :-
x86_64 processor:
Brand: GenuineIntel
Model: VirtualApple @ 2.50GHz
Invariant TSC: True
cpuid leaf 15H is not supported
From measurement frequency 999.13 MHz => 1.00 ns
Sanity check against std::chrono::steady_clock gives frequency 999.98 MHz => 1.00 ns
Measured granularity = 41 ticks => 24.37 MHz, 41.04 ns
Which show us that although the units in which time is measured are ns, the clock can only resolve at best 41ns, which aligns with the underlying hardware clock we saw on the AArch64 side:-
AArch64 processor:
From high resolution timer frequency (cntfrq_el0) 24.00 MHz => 41.67 ns
Sanity check against std::chrono::steady_clock gives frequency 23.90 MHz => 41.85 ns
Measured granularity = 1 tick => 24.00 MHz, 41.67 ns
We can also run the code on a variety of other x86_64 processors, to see what they do…
x86_64 processor:
Brand: GenuineIntel
Model: Intel(R) Xeon(R) Gold 6230 CPU @ 2.10GHz
Invariant TSC: True
cpuid leaf 15H does not give frequency
From model name string frequency 2.10 GHz => 476.19 ps
Sanity check against std::chrono::steady_clock gives frequency 2.09 GHz => 477.33 ps
Measured granularity = 60 ticks => 35.00 MHz, 28.57 ns
x86_64 processor:
Brand: GenuineIntel
Model: Intel(R) Xeon(R) CPU E5-2695 v4 @ 2.10GHz
Invariant TSC: True
cpuid leaf 15H is not supported
From model name string frequency 2.10 GHz => 476.19 ps
Sanity check against std::chrono::steady_clock gives frequency 2.10 GHz => 477.29 ps
Measured granularity = 42 ticks => 50.00 MHz, 20.00 ns
x86_64 processor:
Brand: AuthenticAMD
Model: AMD EPYC 7742 64-Core Processor
Invariant TSC: True
cpuid leaf 15H is not supported
From measurement frequency 2.25 GHz => 444.46 ps
Sanity check against std::chrono::steady_clock gives frequency 2.25 GHz => 444.43 ps
Measured granularity = 22 ticks => 102.27 MHz, 9.78 ns
That shows us that, as we expected, all of the Intel processors we’re testing do use the same units for the rdtsc
time as the nominal frequency in their model name, but that in all of the x86_64 architectures the actual resolution of the timer is lower than the units in which it is measured. So, although the resolution of the M1’s clock is the lowest, it’s not as far out of line as it at first appeared to be.
What Did We Just Learn?
If you’re measuring time using
rdtsc
in the emulated x86_64 environment on the M1 be very careful. The times you’re seeing may be 2.5x smaller than the reality!The x86_64 emulation on the M1 can mislead you in places where the hardware behaviour is under-specified.
Even without considering the out-of-order intricacies, timers are more complicated than you might reasonably expect.
What’s Coming Next?
I’m not sure.
I’m about done with M1 stuff (at least until something else bites me), so maybe something on the memory behaviour of a variety of machines that affects broadcasts, barriers, locks and so on5.
I am not going to go into the intricacies of exactly what it means to insert a timer in an out-of-order processor’s instruction stream, because it’s slightly off-topic, and discussing it made this blog too long.
This aspect is being fixed by Arm; they have a newer specification which sets the frequency at 1GHz. (See Developments in the Arm A-Profile Architecture: Armv8.6-A).
I couldn’t find these architecture documents online in HTML, just as PDFs which are harder to reference.
It seems unlikely that this 42 is “The answer to life, the universe, and everything”, but maybe it is!
In bigLittle cntfreq returns frequency counter but of the max processor
For instance, with RK3588, we are reading frequency of A76 on A55 cores!
How to improve accuracy ?
I see a lot of examples on the internet for AArch64 where an isb instruction is placed before (or before and after) the mrs instruction to read the cntvct register, such as:
asm volatile("isb; mrs %0, cntvct_el0" : "=r" (ticks));
or
asm volatile("isb; mrs %0, cntvct_el0; isb" : "=r" (ticks));
When I try this, I get unusual results. For long periods the ticks value does not increment and other times it goes backwards (values get smaller)? I'm testing on an Apple M1.