Re: Strange: Rosetta faster than M1


Gerriet M. Denkmann
 

On 23 Sep 2022, at 09:28, Jack Brindle via groups.io <jackbrindle@...> wrote:

For those of us not in that forum, what was the explanation? You have our attention with this one...
Steve Cannon explained that:
• Arm has “madd” which does multiply and add in one go. But has a latency of more than one cycle.
• x86_64 doesn't have a multiply-add instruction.

Using multiply-add is generally a good idea (that’s why the arm compiler choose it), but in a very short loop (as in my case) the added latency sadly makes things slower.
And so the native M1 code ended up being 60 % slower than x86_64 + Rosetta.

Maybe the optimiser for Apple Silicon should consider not using “madd” in very short loops.



On Sep 22, 2022, at 9:26 PM, Gerriet M. Denkmann <gerriet@...> wrote:



On 23 Sep 2022, at 07:51, Quincey Morris <quinceymorris@...> wrote:

OK, this is embarrassing. I’m so used to looking at Swift code these days, that my brain just automatically translated your code into Swift, and I never realized it was C. So the Swift forums aren’t where I should have suggested — although I see over there that you might be gettting some kind of answer anyway. It’s an interesting question.
As you see, I got an excellent in depth answer and also some very helpful tips for improvements.
So no reason to be embarrassed at all.
Rather I have to thank you for a very good and fruitful suggestion.

Gerriet.



On Sep 21, 2022, at 17:49, Quincey Morris <quinceymorris@...> wrote:

I really think you should start by asking this question over in the Swift forums, in case there is some compiler-specific answer that can immediately explain the result. It could be a compiler code generation deficiency, but there are many other possibilities. For example, a colleague of mine speculated that there could be reasons why the M1 Mac ran the Rosetta translation on a *performance* core, but ran the Apple Silicon version on an *efficiency* core.

You can also investigate the performance yourself, by running (say) the Time Profiler template in Instruments. It’s unclear which instrument might provide informative results, so you might need to try a couple of different templates, focussing on different things.

This might ultimately be an Apple support question, but I’d imagine there are numerous people on the Swift forums who’d enjoy puzzling out the answer. :)

On Sep 20, 2022, at 22:59, Gerriet M. Denkmann <gerriet@...> wrote:



On 20 Sep 2022, at 19:42, Alex Zavatone via groups.io <zav@...> wrote:

It might seem like a primitive approach, but logging with time stamps should be able to highlight where the suckyness is. Run a log that displays the time delta from the last logging statement so that you are only looking at the deltas. Then run each version and see where the slowness is. That should tell you, right?
I did this:
typedef uint32_t limb;
typedef uint64_t bigLimb;

const uint len = 50000;
const int shiftLimb = sizeof(limb) * 8;

limb *someArray = malloc( len * sizeof(limb) );
bigLimb someBig = 0;

for (bigLimb factor = 1; factor < len; factor++ )
{
for (uint idx = 0 ; idx < len ; idx++)
{
someBig += factor * someArray[idx] ;
someArray[idx] = (limb)(someBig);
someBig >>= shiftLimb;
}
}

and run it in Release mode (-Os = Fastest, Smallest)
(In Debug mode (-O0) Rosetta time = M1 time).

with "someBig >>= shiftLimb”:
Rosetta M1 Rosetta time / M1 time
1.8 3.35 0.54
without the shift:
1.32 0.924 1.43

So it seems that Rosetta optimizes shifts way better than Apple Silicon.

Which kind of looks like a bug.

Gerriet.














Join cocoa@apple-dev.groups.io to automatically receive all group messages.