That behaviour seems entirely sane to me. Since a thread may migrate between the two cores, you want to have both cores deliver time in the same units, which should be the smaller ticks that make sense on the faster core. AFAICT that is the behaviour you are describing.
Of course, the actual period that can be resolved by the physical timer on each core may be different, but the units in which it is expressed, and, we hope, the starting point for the clocks want to be the same everywhere.
I see a lot of examples on the internet for AArch64 where an isb instruction is placed before (or before and after) the mrs instruction to read the cntvct register, such as:
When I try this, I get unusual results. For long periods the ticks value does not increment and other times it goes backwards (values get smaller)? I'm testing on an Apple M1.
It seems that rdtsc does a pipeline flush on most CPUs. It would have been nice to have one which didn't and also had an input register for locating it in the scoreboard. Decode, issue and retire times would also be handy, if only to 32 bit resolution.
From my POV, I am more interested in timing data transfers, so memory fencing is more interesting for my tests than instruction fencing. But, given the apparent resolution, timing single operations is hard anyway.
In bigLittle cntfreq returns frequency counter but of the max processor
For instance, with RK3588, we are reading frequency of A76 on A55 cores!
How to improve accuracy ?
That behaviour seems entirely sane to me. Since a thread may migrate between the two cores, you want to have both cores deliver time in the same units, which should be the smaller ticks that make sense on the faster core. AFAICT that is the behaviour you are describing.
Of course, the actual period that can be resolved by the physical timer on each core may be different, but the units in which it is expressed, and, we hope, the starting point for the clocks want to be the same everywhere.
I see a lot of examples on the internet for AArch64 where an isb instruction is placed before (or before and after) the mrs instruction to read the cntvct register, such as:
asm volatile("isb; mrs %0, cntvct_el0" : "=r" (ticks));
or
asm volatile("isb; mrs %0, cntvct_el0; isb" : "=r" (ticks));
When I try this, I get unusual results. For long periods the ticks value does not increment and other times it goes backwards (values get smaller)? I'm testing on an Apple M1.
Note, "Intel® 64 and IA-32 Architectures, Software Developer’s Manual, Volume 3B: System Programming Guide, Part 2" does say that:
> On certain processors, the TSC frequency may not be the same as the frequency in the brand string.
It seems that rdtsc does a pipeline flush on most CPUs. It would have been nice to have one which didn't and also had an input register for locating it in the scoreboard. Decode, issue and retire times would also be handy, if only to 32 bit resolution.
My impression is that rdtsc (as against rdtscp) is rather vague about what instruction fencing is enforced.
There is information at https://www.intel.com/content/dam/www/public/us/en/documents/white-papers/ia-32-ia-64-benchmark-code-execution-paper.pdf (rather old) which suggests one approach. I believe that only cpuid is guaranteed to be an instruction fence.
From my POV, I am more interested in timing data transfers, so memory fencing is more interesting for my tests than instruction fencing. But, given the apparent resolution, timing single operations is hard anyway.
Anandtech clearly manage something, though, for instance https://www.anandtech.com/show/16535/intel-core-i7-11700k-review-blasting-off-with-rocket-lake/3