Re: Strange: Rosetta faster than M1

Gerriet M. Denkmann

On 23 Sep 2022, at 07:51, Quincey Morris <quinceymorris@...> wrote:

OK, this is embarrassing. I’m so used to looking at Swift code these days, that my brain just automatically translated your code into Swift, and I never realized it was C. So the Swift forums aren’t where I should have suggested — although I see over there that you might be gettting some kind of answer anyway. It’s an interesting question.
As you see, I got an excellent in depth answer and also some very helpful tips for improvements.
So no reason to be embarrassed at all.
Rather I have to thank you for a very good and fruitful suggestion.


On Sep 21, 2022, at 17:49, Quincey Morris <quinceymorris@...> wrote:

I really think you should start by asking this question over in the Swift forums, in case there is some compiler-specific answer that can immediately explain the result. It could be a compiler code generation deficiency, but there are many other possibilities. For example, a colleague of mine speculated that there could be reasons why the M1 Mac ran the Rosetta translation on a *performance* core, but ran the Apple Silicon version on an *efficiency* core.

You can also investigate the performance yourself, by running (say) the Time Profiler template in Instruments. It’s unclear which instrument might provide informative results, so you might need to try a couple of different templates, focussing on different things.

This might ultimately be an Apple support question, but I’d imagine there are numerous people on the Swift forums who’d enjoy puzzling out the answer. :)

On Sep 20, 2022, at 22:59, Gerriet M. Denkmann <gerriet@...> wrote:

On 20 Sep 2022, at 19:42, Alex Zavatone via <zav@...> wrote:

It might seem like a primitive approach, but logging with time stamps should be able to highlight where the suckyness is. Run a log that displays the time delta from the last logging statement so that you are only looking at the deltas. Then run each version and see where the slowness is. That should tell you, right?
I did this:
typedef uint32_t limb;
typedef uint64_t bigLimb;

const uint len = 50000;
const int shiftLimb = sizeof(limb) * 8;

limb *someArray = malloc( len * sizeof(limb) );
bigLimb someBig = 0;

for (bigLimb factor = 1; factor < len; factor++ )
for (uint idx = 0 ; idx < len ; idx++)
someBig += factor * someArray[idx] ;
someArray[idx] = (limb)(someBig);
someBig >>= shiftLimb;

and run it in Release mode (-Os = Fastest, Smallest)
(In Debug mode (-O0) Rosetta time = M1 time).

with "someBig >>= shiftLimb”:
Rosetta M1 Rosetta time / M1 time
1.8 3.35 0.54
without the shift:
1.32 0.924 1.43

So it seems that Rosetta optimizes shifts way better than Apple Silicon.

Which kind of looks like a bug.


Join to automatically receive all group messages.