Strange: Rosetta faster than M1


Gary L. Wade
 

If you haven’t already, submit a feedback request with your source, Xcode version, and platform used that demonstrate the issue.
--
Gary

On Sep 22, 2022, at 9:16 PM, Gerriet M. Denkmann <gerriet@...> wrote:


On 23 Sep 2022, at 09:28, Jack Brindle via groups.io <jackbrindle@...> wrote:

For those of us not in that forum, what was the explanation? You have our attention with this one...
Steve Cannon explained that:
• Arm has “madd” which does multiply and add in one go. But has a latency of more than one cycle.
• x86_64 doesn't have a multiply-add instruction.

Using multiply-add is generally a good idea (that’s why the arm compiler choose it), but in a very short loop (as in my case) the added latency sadly makes things slower.
And so the native M1 code ended up being 60 % slower than x86_64 + Rosetta.

Maybe the optimiser for Apple Silicon should consider not using “madd” in very short loops.



On Sep 22, 2022, at 9:26 PM, Gerriet M. Denkmann <gerriet@...> wrote:


On 23 Sep 2022, at 07:51, Quincey Morris <quinceymorris@...> wrote:

OK, this is embarrassing. I’m so used to looking at Swift code these days, that my brain just automatically translated your code into Swift, and I never realized it was C. So the Swift forums aren’t where I should have suggested — although I see over there that you might be gettting some kind of answer anyway. It’s an interesting question.
As you see, I got an excellent in depth answer and also some very helpful tips for improvements.
So no reason to be embarrassed at all.
Rather I have to thank you for a very good and fruitful suggestion.

Gerriet.



On Sep 21, 2022, at 17:49, Quincey Morris <quinceymorris@...> wrote:

I really think you should start by asking this question over in the Swift forums, in case there is some compiler-specific answer that can immediately explain the result. It could be a compiler code generation deficiency, but there are many other possibilities. For example, a colleague of mine speculated that there could be reasons why the M1 Mac ran the Rosetta translation on a *performance* core, but ran the Apple Silicon version on an *efficiency* core.

You can also investigate the performance yourself, by running (say) the Time Profiler template in Instruments. It’s unclear which instrument might provide informative results, so you might need to try a couple of different templates, focussing on different things.

This might ultimately be an Apple support question, but I’d imagine there are numerous people on the Swift forums who’d enjoy puzzling out the answer. :)

On Sep 20, 2022, at 22:59, Gerriet M. Denkmann <gerriet@...> wrote:



On 20 Sep 2022, at 19:42, Alex Zavatone via groups.io <zav@...> wrote:

It might seem like a primitive approach, but logging with time stamps should be able to highlight where the suckyness is. Run a log that displays the time delta from the last logging statement so that you are only looking at the deltas. Then run each version and see where the slowness is. That should tell you, right?
I did this:
typedef uint32_t limb;
typedef uint64_t bigLimb;

const uint len = 50000;
const int shiftLimb = sizeof(limb) * 8;

limb *someArray = malloc( len * sizeof(limb) );
bigLimb someBig = 0;

for (bigLimb factor = 1; factor < len; factor++ )
{
for (uint idx = 0 ; idx < len ; idx++)
{
someBig += factor * someArray[idx] ;
someArray[idx] = (limb)(someBig);
someBig >>= shiftLimb;
}
}

and run it in Release mode (-Os = Fastest, Smallest)
(In Debug mode (-O0) Rosetta time = M1 time).

with "someBig >>= shiftLimb”:
Rosetta M1 Rosetta time / M1 time
1.8 3.35 0.54
without the shift:
1.32 0.924 1.43

So it seems that Rosetta optimizes shifts way better than Apple Silicon.

Which kind of looks like a bug.

Gerriet.


Chris Ridd
 

On 22 Sep 2022, at 01:49, Quincey Morris <quinceymorris@...> wrote:

I really think you should start by asking this question over in the Swift forums, in case there is some compiler-specific answer that can immediately explain the result. It could be a compiler code generation deficiency, but there are many other possibilities. For example, a colleague of mine speculated that there could be reasons why the M1 Mac ran the Rosetta translation on a *performance* core, but ran the Apple Silicon version on an *efficiency* core.
I was wondering about the P/E cores too. Perhaps you can avoid that difference by using GCD + QOS to explicitly run the calculation on the desired kind of core. Though I guess we don’t actually know if Rosetta always runs high QOS Intel threads on high QOS M1 threads..

But yes, Gerriet’s good spot about the use of >>= looks like something for the compiler/codegen folks.

Chris


Gerriet M. Denkmann
 

On 23 Sep 2022, at 09:28, Jack Brindle via groups.io <jackbrindle@...> wrote:

For those of us not in that forum, what was the explanation? You have our attention with this one...
Steve Cannon explained that:
• Arm has “madd” which does multiply and add in one go. But has a latency of more than one cycle.
• x86_64 doesn't have a multiply-add instruction.

Using multiply-add is generally a good idea (that’s why the arm compiler choose it), but in a very short loop (as in my case) the added latency sadly makes things slower.
And so the native M1 code ended up being 60 % slower than x86_64 + Rosetta.

Maybe the optimiser for Apple Silicon should consider not using “madd” in very short loops.



On Sep 22, 2022, at 9:26 PM, Gerriet M. Denkmann <gerriet@...> wrote:



On 23 Sep 2022, at 07:51, Quincey Morris <quinceymorris@...> wrote:

OK, this is embarrassing. I’m so used to looking at Swift code these days, that my brain just automatically translated your code into Swift, and I never realized it was C. So the Swift forums aren’t where I should have suggested — although I see over there that you might be gettting some kind of answer anyway. It’s an interesting question.
As you see, I got an excellent in depth answer and also some very helpful tips for improvements.
So no reason to be embarrassed at all.
Rather I have to thank you for a very good and fruitful suggestion.

Gerriet.



On Sep 21, 2022, at 17:49, Quincey Morris <quinceymorris@...> wrote:

I really think you should start by asking this question over in the Swift forums, in case there is some compiler-specific answer that can immediately explain the result. It could be a compiler code generation deficiency, but there are many other possibilities. For example, a colleague of mine speculated that there could be reasons why the M1 Mac ran the Rosetta translation on a *performance* core, but ran the Apple Silicon version on an *efficiency* core.

You can also investigate the performance yourself, by running (say) the Time Profiler template in Instruments. It’s unclear which instrument might provide informative results, so you might need to try a couple of different templates, focussing on different things.

This might ultimately be an Apple support question, but I’d imagine there are numerous people on the Swift forums who’d enjoy puzzling out the answer. :)

On Sep 20, 2022, at 22:59, Gerriet M. Denkmann <gerriet@...> wrote:



On 20 Sep 2022, at 19:42, Alex Zavatone via groups.io <zav@...> wrote:

It might seem like a primitive approach, but logging with time stamps should be able to highlight where the suckyness is. Run a log that displays the time delta from the last logging statement so that you are only looking at the deltas. Then run each version and see where the slowness is. That should tell you, right?
I did this:
typedef uint32_t limb;
typedef uint64_t bigLimb;

const uint len = 50000;
const int shiftLimb = sizeof(limb) * 8;

limb *someArray = malloc( len * sizeof(limb) );
bigLimb someBig = 0;

for (bigLimb factor = 1; factor < len; factor++ )
{
for (uint idx = 0 ; idx < len ; idx++)
{
someBig += factor * someArray[idx] ;
someArray[idx] = (limb)(someBig);
someBig >>= shiftLimb;
}
}

and run it in Release mode (-Os = Fastest, Smallest)
(In Debug mode (-O0) Rosetta time = M1 time).

with "someBig >>= shiftLimb”:
Rosetta M1 Rosetta time / M1 time
1.8 3.35 0.54
without the shift:
1.32 0.924 1.43

So it seems that Rosetta optimizes shifts way better than Apple Silicon.

Which kind of looks like a bug.

Gerriet.















Jack Brindle
 

For those of us not in that forum, what was the explanation? You have our attention with this one...

Jack

On Sep 22, 2022, at 9:26 PM, Gerriet M. Denkmann <gerriet@...> wrote:



On 23 Sep 2022, at 07:51, Quincey Morris <quinceymorris@...> wrote:

OK, this is embarrassing. I’m so used to looking at Swift code these days, that my brain just automatically translated your code into Swift, and I never realized it was C. So the Swift forums aren’t where I should have suggested — although I see over there that you might be gettting some kind of answer anyway. It’s an interesting question.
As you see, I got an excellent in depth answer and also some very helpful tips for improvements.
So no reason to be embarrassed at all.
Rather I have to thank you for a very good and fruitful suggestion.

Gerriet.



On Sep 21, 2022, at 17:49, Quincey Morris <quinceymorris@...> wrote:

I really think you should start by asking this question over in the Swift forums, in case there is some compiler-specific answer that can immediately explain the result. It could be a compiler code generation deficiency, but there are many other possibilities. For example, a colleague of mine speculated that there could be reasons why the M1 Mac ran the Rosetta translation on a *performance* core, but ran the Apple Silicon version on an *efficiency* core.

You can also investigate the performance yourself, by running (say) the Time Profiler template in Instruments. It’s unclear which instrument might provide informative results, so you might need to try a couple of different templates, focussing on different things.

This might ultimately be an Apple support question, but I’d imagine there are numerous people on the Swift forums who’d enjoy puzzling out the answer. :)

On Sep 20, 2022, at 22:59, Gerriet M. Denkmann <gerriet@...> wrote:



On 20 Sep 2022, at 19:42, Alex Zavatone via groups.io <zav@...> wrote:

It might seem like a primitive approach, but logging with time stamps should be able to highlight where the suckyness is. Run a log that displays the time delta from the last logging statement so that you are only looking at the deltas. Then run each version and see where the slowness is. That should tell you, right?
I did this:
typedef uint32_t limb;
typedef uint64_t bigLimb;

const uint len = 50000;
const int shiftLimb = sizeof(limb) * 8;

limb *someArray = malloc( len * sizeof(limb) );
bigLimb someBig = 0;

for (bigLimb factor = 1; factor < len; factor++ )
{
for (uint idx = 0 ; idx < len ; idx++)
{
someBig += factor * someArray[idx] ;
someArray[idx] = (limb)(someBig);
someBig >>= shiftLimb;
}
}

and run it in Release mode (-Os = Fastest, Smallest)
(In Debug mode (-O0) Rosetta time = M1 time).

with "someBig >>= shiftLimb”:
Rosetta M1 Rosetta time / M1 time
1.8 3.35 0.54
without the shift:
1.32 0.924 1.43

So it seems that Rosetta optimizes shifts way better than Apple Silicon.

Which kind of looks like a bug.

Gerriet.













Gerriet M. Denkmann
 

On 23 Sep 2022, at 07:51, Quincey Morris <quinceymorris@...> wrote:

OK, this is embarrassing. I’m so used to looking at Swift code these days, that my brain just automatically translated your code into Swift, and I never realized it was C. So the Swift forums aren’t where I should have suggested — although I see over there that you might be gettting some kind of answer anyway. It’s an interesting question.
As you see, I got an excellent in depth answer and also some very helpful tips for improvements.
So no reason to be embarrassed at all.
Rather I have to thank you for a very good and fruitful suggestion.

Gerriet.



On Sep 21, 2022, at 17:49, Quincey Morris <quinceymorris@...> wrote:

I really think you should start by asking this question over in the Swift forums, in case there is some compiler-specific answer that can immediately explain the result. It could be a compiler code generation deficiency, but there are many other possibilities. For example, a colleague of mine speculated that there could be reasons why the M1 Mac ran the Rosetta translation on a *performance* core, but ran the Apple Silicon version on an *efficiency* core.

You can also investigate the performance yourself, by running (say) the Time Profiler template in Instruments. It’s unclear which instrument might provide informative results, so you might need to try a couple of different templates, focussing on different things.

This might ultimately be an Apple support question, but I’d imagine there are numerous people on the Swift forums who’d enjoy puzzling out the answer. :)

On Sep 20, 2022, at 22:59, Gerriet M. Denkmann <gerriet@...> wrote:



On 20 Sep 2022, at 19:42, Alex Zavatone via groups.io <zav@...> wrote:

It might seem like a primitive approach, but logging with time stamps should be able to highlight where the suckyness is. Run a log that displays the time delta from the last logging statement so that you are only looking at the deltas. Then run each version and see where the slowness is. That should tell you, right?
I did this:
typedef uint32_t limb;
typedef uint64_t bigLimb;

const uint len = 50000;
const int shiftLimb = sizeof(limb) * 8;

limb *someArray = malloc( len * sizeof(limb) );
bigLimb someBig = 0;

for (bigLimb factor = 1; factor < len; factor++ )
{
for (uint idx = 0 ; idx < len ; idx++)
{
someBig += factor * someArray[idx] ;
someArray[idx] = (limb)(someBig);
someBig >>= shiftLimb;
}
}

and run it in Release mode (-Os = Fastest, Smallest)
(In Debug mode (-O0) Rosetta time = M1 time).

with "someBig >>= shiftLimb”:
Rosetta M1 Rosetta time / M1 time
1.8 3.35 0.54
without the shift:
1.32 0.924 1.43

So it seems that Rosetta optimizes shifts way better than Apple Silicon.

Which kind of looks like a bug.

Gerriet.









Quincey Morris
 

OK, this is embarrassing. I’m so used to looking at Swift code these days, that my brain just automatically translated your code into Swift, and I never realized it was C. So the Swift forums aren’t where I should have suggested — although I see over there that you might be gettting some kind of answer anyway. It’s an interesting question.

Does this mean that Swift is some kind of cult that has taken over my brain? Am I auto translating the entire world into Swift? Can I get rehab for this?

Admittedly, the people who might be interested in this overlap pretty well with Swift engineers, but I’m sorry I didn’t make a better suggestion.

On Sep 21, 2022, at 17:49, Quincey Morris <quinceymorris@...> wrote:

I really think you should start by asking this question over in the Swift forums, in case there is some compiler-specific answer that can immediately explain the result. It could be a compiler code generation deficiency, but there are many other possibilities. For example, a colleague of mine speculated that there could be reasons why the M1 Mac ran the Rosetta translation on a *performance* core, but ran the Apple Silicon version on an *efficiency* core.

You can also investigate the performance yourself, by running (say) the Time Profiler template in Instruments. It’s unclear which instrument might provide informative results, so you might need to try a couple of different templates, focussing on different things.

This might ultimately be an Apple support question, but I’d imagine there are numerous people on the Swift forums who’d enjoy puzzling out the answer. :)

On Sep 20, 2022, at 22:59, Gerriet M. Denkmann <gerriet@...> wrote:



On 20 Sep 2022, at 19:42, Alex Zavatone via groups.io <zav@...> wrote:

It might seem like a primitive approach, but logging with time stamps should be able to highlight where the suckyness is.  Run a log that displays the time delta from the last logging statement so that you are only looking at the deltas. Then run each version and see where the slowness is.  That should tell you, right?

I did this:
typedef uint32_t limb;
typedef uint64_t bigLimb;

const uint len = 50000;
const int shiftLimb = sizeof(limb) * 8;

limb *someArray = malloc( len * sizeof(limb) );  
bigLimb someBig = 0;

for (bigLimb factor = 1; factor < len; factor++ )
{
for (uint idx = 0 ; idx < len ; idx++)
{
someBig += factor * someArray[idx] ;
someArray[idx] = (limb)(someBig);
someBig >>= shiftLimb;
}
}

and run it in Release mode (-Os = Fastest, Smallest) 
(In Debug mode (-O0) Rosetta time = M1 time).

with "someBig >>= shiftLimb”:
Rosetta M1 Rosetta time / M1 time
1.8 3.35 0.54
without the shift:
1.32 0.924 1.43

So it seems that Rosetta optimizes shifts way better than Apple Silicon.

Which kind of looks like a bug.

Gerriet.













Quincey Morris
 

I really think you should start by asking this question over in the Swift forums, in case there is some compiler-specific answer that can immediately explain the result. It could be a compiler code generation deficiency, but there are many other possibilities. For example, a colleague of mine speculated that there could be reasons why the M1 Mac ran the Rosetta translation on a *performance* core, but ran the Apple Silicon version on an *efficiency* core.

You can also investigate the performance yourself, by running (say) the Time Profiler template in Instruments. It’s unclear which instrument might provide informative results, so you might need to try a couple of different templates, focussing on different things.

This might ultimately be an Apple support question, but I’d imagine there are numerous people on the Swift forums who’d enjoy puzzling out the answer. :)

On Sep 20, 2022, at 22:59, Gerriet M. Denkmann <gerriet@...> wrote:



On 20 Sep 2022, at 19:42, Alex Zavatone via groups.io <zav@...> wrote:

It might seem like a primitive approach, but logging with time stamps should be able to highlight where the suckyness is. Run a log that displays the time delta from the last logging statement so that you are only looking at the deltas. Then run each version and see where the slowness is. That should tell you, right?
I did this:
typedef uint32_t limb;
typedef uint64_t bigLimb;

const uint len = 50000;
const int shiftLimb = sizeof(limb) * 8;

limb *someArray = malloc( len * sizeof(limb) );
bigLimb someBig = 0;

for (bigLimb factor = 1; factor < len; factor++ )
{
for (uint idx = 0 ; idx < len ; idx++)
{
someBig += factor * someArray[idx] ;
someArray[idx] = (limb)(someBig);
someBig >>= shiftLimb;
}
}

and run it in Release mode (-Os = Fastest, Smallest)
(In Debug mode (-O0) Rosetta time = M1 time).

with "someBig >>= shiftLimb”:
Rosetta M1 Rosetta time / M1 time
1.8 3.35 0.54
without the shift:
1.32 0.924 1.43

So it seems that Rosetta optimizes shifts way better than Apple Silicon.

Which kind of looks like a bug.

Gerriet.







Gerriet M. Denkmann
 

On 20 Sep 2022, at 19:42, Alex Zavatone via groups.io <zav@...> wrote:

It might seem like a primitive approach, but logging with time stamps should be able to highlight where the suckyness is. Run a log that displays the time delta from the last logging statement so that you are only looking at the deltas. Then run each version and see where the slowness is. That should tell you, right?
I did this:
typedef uint32_t limb;
typedef uint64_t bigLimb;

const uint len = 50000;
const int shiftLimb = sizeof(limb) * 8;

limb *someArray = malloc( len * sizeof(limb) );
bigLimb someBig = 0;

for (bigLimb factor = 1; factor < len; factor++ )
{
for (uint idx = 0 ; idx < len ; idx++)
{
someBig += factor * someArray[idx] ;
someArray[idx] = (limb)(someBig);
someBig >>= shiftLimb;
}
}

and run it in Release mode (-Os = Fastest, Smallest)
(In Debug mode (-O0) Rosetta time = M1 time).

with "someBig >>= shiftLimb”:
Rosetta M1 Rosetta time / M1 time
1.8 3.35 0.54
without the shift:
1.32 0.924 1.43

So it seems that Rosetta optimizes shifts way better than Apple Silicon.

Which kind of looks like a bug.

Gerriet.


Alex Zavatone
 

It might seem like a primitive approach, but logging with time stamps should be able to highlight where the suckyness is. Run a log that displays the time delta from the last logging statement so that you are only looking at the deltas. Then run each version and see where the slowness is. That should tell you, right?

Cheers,
Alex Zavatone

On Sep 19, 2022, at 7:40 PM, Gerriet M. Denkmann <gerriet@...> wrote:



On 19 Sep 2022, at 20:07, Alex Zavatone via groups.io <zav@...> wrote:

What is it doing? We would need to compare the tasks it’s doing so that we could estimate where any slowness would be.
It computes the faculty of 300 000 (i.e. 1 * 2 * 3 * … * 299 999 * 300 000), a somewhat large integer.


Have you profiled it using Xcode’s profiler?
Yes, but I could not see anything unusual.

Can you add log statements so that you can see where the operations are slowed down?
Probably not. But maybe you could tell me, how I can persuade Instruments to use the Rosetta-Version?
Then maybe I could see, where the two versions differ.

Gerriet.






Gerriet M. Denkmann
 

On 19 Sep 2022, at 19:21, Tom Landrum <tomlandrum@...> wrote:

Perhaps something is cached from the first run? Do you get the same results if you reverse the order?
The results are stable and repeatable and do not depend on the order.

Gerriet.


Gerriet M. Denkmann
 

On 19 Sep 2022, at 20:07, Alex Zavatone via groups.io <zav@...> wrote:

What is it doing? We would need to compare the tasks it’s doing so that we could estimate where any slowness would be.
It computes the faculty of 300 000 (i.e. 1 * 2 * 3 * … * 299 999 * 300 000), a somewhat large integer.


Have you profiled it using Xcode’s profiler?
Yes, but I could not see anything unusual.

Can you add log statements so that you can see where the operations are slowed down?
Probably not. But maybe you could tell me, how I can persuade Instruments to use the Rosetta-Version?
Then maybe I could see, where the two versions differ.

Gerriet.


Glenn L. Austin
 

It's a bit unusual for emulated code to run faster, but not impossible.

Depending upon the operation, the order of execution could end up pre-loading values or accessing devices in such a way that the emulated code isn't blocked waiting for a device/memory while the non-emulated code has to wait.

-- 
Glenn L. Austin, Computer Wizard and Race Car Driver         <><
<http://www.austinsoft.com>


On Sep 19, 2022, at 5:21 AM, Tom Landrum <tomlandrum@...> wrote:

Perhaps something is cached from the first run?  Do you get the same results if you reverse the order?

Tom


On Sep 19, 2022, at 7:02 AM, Gerriet M. Denkmann <gerriet@...> wrote:

I have a simple C command line tool.

This takes 16 seconds when I run it in Xcode with “My Mac”.
But only 11 seconds when run with “My Mac (Rosetta)”

How can this be? 
I always assumed that emulating Intel-Code on a M1 must of course be slower than running native M1-Code directly.

Am I making some silly mistake?

Gerriet.











Tom Landrum
 

Perhaps something is cached from the first run? Do you get the same results if you reverse the order?

Tom

On Sep 19, 2022, at 7:02 AM, Gerriet M. Denkmann <gerriet@...> wrote:

I have a simple C command line tool.

This takes 16 seconds when I run it in Xcode with “My Mac”.
But only 11 seconds when run with “My Mac (Rosetta)”

How can this be?
I always assumed that emulating Intel-Code on a M1 must of course be slower than running native M1-Code directly.

Am I making some silly mistake?

Gerriet.






Alex Zavatone
 

What is it doing? We would need to compare the tasks it’s doing so that we could estimate where any slowness would be.

Have you profiled it using Xcode’s profiler? Can you add log statements so that you can see where the operations are slowed down?

Happy Monday,
Alex Zavatone

On Sep 19, 2022, at 7:02 AM, Gerriet M. Denkmann <gerriet@...> wrote:

I have a simple C command line tool.

This takes 16 seconds when I run it in Xcode with “My Mac”.
But only 11 seconds when run with “My Mac (Rosetta)”

How can this be?
I always assumed that emulating Intel-Code on a M1 must of course be slower than running native M1-Code directly.

Am I making some silly mistake?

Gerriet.






Gerriet M. Denkmann
 

I have a simple C command line tool.

This takes 16 seconds when I run it in Xcode with “My Mac”.
But only 11 seconds when run with “My Mac (Rosetta)”

How can this be?
I always assumed that emulating Intel-Code on a M1 must of course be slower than running native M1-Code directly.

Am I making some silly mistake?

Gerriet.