More M1 fun: Hardware Information

Where are we and how do we know?

Introduction

In the last post, I discussed some of the features of the Apple M1/MacOS environment, why it is more complicated than you might hope, and one route to being able to build native (AArch64) code.

Here I’ll discuss some hardware properties that are different on the M1 from on x86_64 machines (and from some other AArch64 machines) which may affect your code.

Hardware Properties

If you are used to Linux, you probably expect to be able to find out a lot about your machine by looking in /proc/cpuinfo, or using lscpu. For instance, on the Isambard1 login node, we see this :-

-bash-4.2$ lscpu
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                36
On-line CPU(s) list:   0-35
Thread(s) per core:    1
Core(s) per socket:    18
Socket(s):             2
NUMA node(s):          2
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 79
Model name:            Intel(R) Xeon(R) CPU E5-2695 v4 @ 2.10GHz
Stepping:              1
CPU MHz:               1199.963
CPU max MHz:           3300.0000
CPU min MHz:           1200.0000
BogoMIPS:              4190.39
Virtualization:        VT-x
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              46080K
NUMA node0 CPU(s):     0-17
NUMA node1 CPU(s):     18-35
Flags:                 fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch epb cat_l3 cdp_l3 invpcid_single intel_ppin intel_pt tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm rdt_a rdseed adx smap xsaveopt cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts

However, lscpu is not supported on MacOS (at least not to the depth of my Google-fu); that is hardly surprising, since it is effectively a summary of /proc/cpuinfo which is a Linux invention.

What we do have on MacOS is the sysctl command, which will show us similar information (though in a less compact form). Here we’re asking for hardware and machine information:-

$ sysctl -a hw machdep.cpu
hw.ncpu: 8
hw.byteorder: 1234
hw.memsize: 8589934592
hw.activecpu: 8
hw.physicalcpu: 8
hw.physicalcpu_max: 8
hw.logicalcpu: 8
hw.logicalcpu_max: 8
hw.cputype: 16777228
hw.cpusubtype: 2
hw.cpu64bit_capable: 1
hw.cpufamily: 458787763
hw.cpusubfamily: 2
hw.cacheconfig: 8 1 1 0 0 0 0 0 0 0
hw.cachesize: 3660136448 65536 4194304 0 0 0 0 0 0 0
hw.pagesize: 16384
hw.pagesize32: 16384
hw.cachelinesize: 128
hw.l1icachesize: 131072
hw.l1dcachesize: 65536
hw.l2cachesize: 4194304
hw.tbfrequency: 24000000
hw.packages: 1
hw.osenvironment: 
hw.ephemeral_storage: 0
hw.use_recovery_securityd: 0
hw.use_kernelmanagerd: 1
hw.serialdebugmode: 0
hw.optional.floatingpoint: 1
hw.optional.watchpoint: 4
hw.optional.breakpoint: 6
hw.optional.neon: 1
hw.optional.neon_hpfp: 1
hw.optional.neon_fp16: 1
hw.optional.armv8_1_atomics: 1
hw.optional.armv8_crc32: 1
hw.optional.armv8_2_fhm: 1
hw.optional.armv8_2_sha512: 1
hw.optional.armv8_2_sha3: 1
hw.optional.amx_version: 2
hw.optional.ucnormal_mem: 1
hw.optional.arm64: 1
hw.targettype: J274
machdep.cpu.cores_per_package: 8
machdep.cpu.core_count: 8
machdep.cpu.logical_per_package: 8
machdep.cpu.thread_count: 8
machdep.cpu.brand_string: Apple M1

This information is also available via the sysctl set of system calls, and, in particular the sysctlbyname call. So, where on an x86 Linux machine we might either have used the cpuid instruction, or parsed /proc/cpuinfo ourself to find the CPU model/brand name, here we can use code like this to extract it :-

$ cat cpuname.cc
#include <sys/sysctl.h>
#include <string>
#include <stdio.h>

std::string getModel() {
  char buffer[64]; /* Should be long enough! */
  size_t len = sizeof(buffer);

  if (sysctlbyname("machdep.cpu.brand_string", &buffer[0], &len, 0, 0) == 0) {
    return &buffer[0];
  }
  return "**UNKNOWN***";
}

int main (int , char ** ) {
  printf("Running on '%s'\n", getModel().c_str());
  return 0;
}
$ g++ cpuname.cc
$ ./a.out
Running on 'Apple M1'
$

Notable Differences

You can see that sysctl is telling us some information that wasn’t obvious in the lscpu information. In particular:-

hw.pagesize: 16384 
hw.cachelinesize: 128

On x86_64 the standard cacheline size is 64B (even though this is not architecturally required, all the implementations stick to it), and the default pagesize is 4KiB. Similarly, some other AArch64 processors (such as the Marvell ThunderX2) also follow the x86_64 style with 64B cachelines and 4KiB pages. Therefore much code is written assuming that 64B is the natural and only size that cacheline should be. Here we can see that that presumption is wrong!

Why should I care?

The different cacheline size means that code which is trying to optimise data-placement, either by ensuring that items are in the same cacheline, or that they are in different lines will almost certainly need to be recompiled with a different cacheline size specified if it is to achieve the highest performance. That affects things such as parallel runtime libraries, where the layout of data is important when building fast barriers, or implementing locks.

Similarly, if allocation of data to VM pages is important code has to know the appropriate page size.

These differences between the underlying hardware and the x86_64 emulation also mean that code running in the x86_64 emulation will not be performing as well as it could if those optimisations are important, since it will be assuming the 64B line size, when the underlying hardware is using a 128B size. Of course, for highest performance you would not want to be executing under the emulation, but just tweaking this might help a little.

Topology

For optimising parallel code, it is important to know about the NUMA properties of the machine, as well as the caches. Luckily Brice and his team at Inria have already ported hwloc to MacOS and it is available as a native executable via brew, so in this case the tools we expect are available.

Here is how lstopo sees my Mac Mini M1. You can see that it is a single NUMA domain with 8 cores, no Simultaneous-MultiThreading, and the two levels of cache that we also saw in the sysctl output above.

What Did We Just Learn?

  1. The ways to find out about the hardware on MacOs are not the same as those on Linux.

  2. The hardware properties of the M1 differ from those of x86_64 machines in ways other than the instruction set, and that matters too.

  3. We can find out what we need to know.

What’s Coming Next

This post seemed long enough without the discussion I had intended to put here about an interesting feature of the C/C++ ABI on MacOS/AArch64 (where it differs from Linux/AArch64 as well as from other architectures). Therefore I’ll give you that in the next post!

1

This work used the Isambard UK National Tier-2 HPC Service (http://gw4.ac.uk/isambard/) operated by GW4 and the UK Met Office, and funded by EPSRC (EP/P020224/1)