Date
1 - 15 of 15
Strange: Rosetta faster than M1
Gerriet M. Denkmann
I have a simple C command line tool.
This takes 16 seconds when I run it in Xcode with “My Mac”.
But only 11 seconds when run with “My Mac (Rosetta)”
How can this be?
I always assumed that emulating Intel-Code on a M1 must of course be slower than running native M1-Code directly.
Am I making some silly mistake?
Gerriet.
This takes 16 seconds when I run it in Xcode with “My Mac”.
But only 11 seconds when run with “My Mac (Rosetta)”
How can this be?
I always assumed that emulating Intel-Code on a M1 must of course be slower than running native M1-Code directly.
Am I making some silly mistake?
Gerriet.
Alex Zavatone
What is it doing? We would need to compare the tasks it’s doing so that we could estimate where any slowness would be.
Have you profiled it using Xcode’s profiler? Can you add log statements so that you can see where the operations are slowed down?
Happy Monday,
Alex Zavatone
toggle quoted message
Show quoted text
Have you profiled it using Xcode’s profiler? Can you add log statements so that you can see where the operations are slowed down?
Happy Monday,
Alex Zavatone
On Sep 19, 2022, at 7:02 AM, Gerriet M. Denkmann <gerriet@...> wrote:
I have a simple C command line tool.
This takes 16 seconds when I run it in Xcode with “My Mac”.
But only 11 seconds when run with “My Mac (Rosetta)”
How can this be?
I always assumed that emulating Intel-Code on a M1 must of course be slower than running native M1-Code directly.
Am I making some silly mistake?
Gerriet.
Tom Landrum
Perhaps something is cached from the first run? Do you get the same results if you reverse the order?
Tom
toggle quoted message
Show quoted text
Tom
On Sep 19, 2022, at 7:02 AM, Gerriet M. Denkmann <gerriet@...> wrote:
I have a simple C command line tool.
This takes 16 seconds when I run it in Xcode with “My Mac”.
But only 11 seconds when run with “My Mac (Rosetta)”
How can this be?
I always assumed that emulating Intel-Code on a M1 must of course be slower than running native M1-Code directly.
Am I making some silly mistake?
Gerriet.
Glenn L. Austin
It's a bit unusual for emulated code to run faster, but not impossible.
toggle quoted message
Show quoted text
Depending upon the operation, the order of execution could end up pre-loading values or accessing devices in such a way that the emulated code isn't blocked waiting for a device/memory while the non-emulated code has to wait.
On Sep 19, 2022, at 5:21 AM, Tom Landrum <tomlandrum@...> wrote:Perhaps something is cached from the first run? Do you get the same results if you reverse the order?
TomOn Sep 19, 2022, at 7:02 AM, Gerriet M. Denkmann <gerriet@...> wrote:
I have a simple C command line tool.
This takes 16 seconds when I run it in Xcode with “My Mac”.
But only 11 seconds when run with “My Mac (Rosetta)”
How can this be?
I always assumed that emulating Intel-Code on a M1 must of course be slower than running native M1-Code directly.
Am I making some silly mistake?
Gerriet.
Gerriet M. Denkmann
On 19 Sep 2022, at 20:07, Alex Zavatone via groups.io <zav@...> wrote:It computes the faculty of 300 000 (i.e. 1 * 2 * 3 * … * 299 999 * 300 000), a somewhat large integer.
What is it doing? We would need to compare the tasks it’s doing so that we could estimate where any slowness would be.
Yes, but I could not see anything unusual.
Have you profiled it using Xcode’s profiler?
Can you add log statements so that you can see where the operations are slowed down?Probably not. But maybe you could tell me, how I can persuade Instruments to use the Rosetta-Version?
Then maybe I could see, where the two versions differ.
Gerriet.
Alex Zavatone
It might seem like a primitive approach, but logging with time stamps should be able to highlight where the suckyness is. Run a log that displays the time delta from the last logging statement so that you are only looking at the deltas. Then run each version and see where the slowness is. That should tell you, right?
Cheers,
Alex Zavatone
toggle quoted message
Show quoted text
Cheers,
Alex Zavatone
On Sep 19, 2022, at 7:40 PM, Gerriet M. Denkmann <gerriet@...> wrote:On 19 Sep 2022, at 20:07, Alex Zavatone via groups.io <zav@...> wrote:It computes the faculty of 300 000 (i.e. 1 * 2 * 3 * … * 299 999 * 300 000), a somewhat large integer.
What is it doing? We would need to compare the tasks it’s doing so that we could estimate where any slowness would be.Yes, but I could not see anything unusual.
Have you profiled it using Xcode’s profiler?Can you add log statements so that you can see where the operations are slowed down?Probably not. But maybe you could tell me, how I can persuade Instruments to use the Rosetta-Version?
Then maybe I could see, where the two versions differ.
Gerriet.
Gerriet M. Denkmann
On 20 Sep 2022, at 19:42, Alex Zavatone via groups.io <zav@...> wrote:I did this:
It might seem like a primitive approach, but logging with time stamps should be able to highlight where the suckyness is. Run a log that displays the time delta from the last logging statement so that you are only looking at the deltas. Then run each version and see where the slowness is. That should tell you, right?
typedef uint32_t limb;
typedef uint64_t bigLimb;
const uint len = 50000;
const int shiftLimb = sizeof(limb) * 8;
limb *someArray = malloc( len * sizeof(limb) );
bigLimb someBig = 0;
for (bigLimb factor = 1; factor < len; factor++ )
{
for (uint idx = 0 ; idx < len ; idx++)
{
someBig += factor * someArray[idx] ;
someArray[idx] = (limb)(someBig);
someBig >>= shiftLimb;
}
}
and run it in Release mode (-Os = Fastest, Smallest)
(In Debug mode (-O0) Rosetta time = M1 time).
with "someBig >>= shiftLimb”:
Rosetta M1 Rosetta time / M1 time
1.8 3.35 0.54
without the shift:
1.32 0.924 1.43
So it seems that Rosetta optimizes shifts way better than Apple Silicon.
Which kind of looks like a bug.
Gerriet.
Quincey Morris
I really think you should start by asking this question over in the Swift forums, in case there is some compiler-specific answer that can immediately explain the result. It could be a compiler code generation deficiency, but there are many other possibilities. For example, a colleague of mine speculated that there could be reasons why the M1 Mac ran the Rosetta translation on a *performance* core, but ran the Apple Silicon version on an *efficiency* core.
You can also investigate the performance yourself, by running (say) the Time Profiler template in Instruments. It’s unclear which instrument might provide informative results, so you might need to try a couple of different templates, focussing on different things.
This might ultimately be an Apple support question, but I’d imagine there are numerous people on the Swift forums who’d enjoy puzzling out the answer. :)
toggle quoted message
Show quoted text
You can also investigate the performance yourself, by running (say) the Time Profiler template in Instruments. It’s unclear which instrument might provide informative results, so you might need to try a couple of different templates, focussing on different things.
This might ultimately be an Apple support question, but I’d imagine there are numerous people on the Swift forums who’d enjoy puzzling out the answer. :)
On Sep 20, 2022, at 22:59, Gerriet M. Denkmann <gerriet@...> wrote:On 20 Sep 2022, at 19:42, Alex Zavatone via groups.io <zav@...> wrote:I did this:
It might seem like a primitive approach, but logging with time stamps should be able to highlight where the suckyness is. Run a log that displays the time delta from the last logging statement so that you are only looking at the deltas. Then run each version and see where the slowness is. That should tell you, right?
typedef uint32_t limb;
typedef uint64_t bigLimb;
const uint len = 50000;
const int shiftLimb = sizeof(limb) * 8;
limb *someArray = malloc( len * sizeof(limb) );
bigLimb someBig = 0;
for (bigLimb factor = 1; factor < len; factor++ )
{
for (uint idx = 0 ; idx < len ; idx++)
{
someBig += factor * someArray[idx] ;
someArray[idx] = (limb)(someBig);
someBig >>= shiftLimb;
}
}
and run it in Release mode (-Os = Fastest, Smallest)
(In Debug mode (-O0) Rosetta time = M1 time).
with "someBig >>= shiftLimb”:
Rosetta M1 Rosetta time / M1 time
1.8 3.35 0.54
without the shift:
1.32 0.924 1.43
So it seems that Rosetta optimizes shifts way better than Apple Silicon.
Which kind of looks like a bug.
Gerriet.
Quincey Morris
OK, this is embarrassing. I’m so used to looking at Swift code these days, that my brain just automatically translated your code into Swift, and I never realized it was C. So the Swift forums aren’t where I should have suggested — although I see over there that you might be gettting some kind of answer anyway. It’s an interesting question.
toggle quoted message
Show quoted text
Does this mean that Swift is some kind of cult that has taken over my brain? Am I auto translating the entire world into Swift? Can I get rehab for this?
Admittedly, the people who might be interested in this overlap pretty well with Swift engineers, but I’m sorry I didn’t make a better suggestion.
On Sep 21, 2022, at 17:49, Quincey Morris <quinceymorris@...> wrote:I really think you should start by asking this question over in the Swift forums, in case there is some compiler-specific answer that can immediately explain the result. It could be a compiler code generation deficiency, but there are many other possibilities. For example, a colleague of mine speculated that there could be reasons why the M1 Mac ran the Rosetta translation on a *performance* core, but ran the Apple Silicon version on an *efficiency* core.
You can also investigate the performance yourself, by running (say) the Time Profiler template in Instruments. It’s unclear which instrument might provide informative results, so you might need to try a couple of different templates, focussing on different things.
This might ultimately be an Apple support question, but I’d imagine there are numerous people on the Swift forums who’d enjoy puzzling out the answer. :)On Sep 20, 2022, at 22:59, Gerriet M. Denkmann <gerriet@...> wrote:On 20 Sep 2022, at 19:42, Alex Zavatone via groups.io <zav@...> wrote:
It might seem like a primitive approach, but logging with time stamps should be able to highlight where the suckyness is. Run a log that displays the time delta from the last logging statement so that you are only looking at the deltas. Then run each version and see where the slowness is. That should tell you, right?
I did this:
typedef uint32_t limb;
typedef uint64_t bigLimb;
const uint len = 50000;
const int shiftLimb = sizeof(limb) * 8;
limb *someArray = malloc( len * sizeof(limb) );
bigLimb someBig = 0;
for (bigLimb factor = 1; factor < len; factor++ )
{
for (uint idx = 0 ; idx < len ; idx++)
{
someBig += factor * someArray[idx] ;
someArray[idx] = (limb)(someBig);
someBig >>= shiftLimb;
}
}
and run it in Release mode (-Os = Fastest, Smallest)
(In Debug mode (-O0) Rosetta time = M1 time).
with "someBig >>= shiftLimb”:
Rosetta M1 Rosetta time / M1 time
1.8 3.35 0.54
without the shift:
1.32 0.924 1.43
So it seems that Rosetta optimizes shifts way better than Apple Silicon.
Which kind of looks like a bug.
Gerriet.
Gerriet M. Denkmann
On 23 Sep 2022, at 07:51, Quincey Morris <quinceymorris@...> wrote:As you see, I got an excellent in depth answer and also some very helpful tips for improvements.
OK, this is embarrassing. I’m so used to looking at Swift code these days, that my brain just automatically translated your code into Swift, and I never realized it was C. So the Swift forums aren’t where I should have suggested — although I see over there that you might be gettting some kind of answer anyway. It’s an interesting question.
So no reason to be embarrassed at all.
Rather I have to thank you for a very good and fruitful suggestion.
Gerriet.
On Sep 21, 2022, at 17:49, Quincey Morris <quinceymorris@...> wrote:
I really think you should start by asking this question over in the Swift forums, in case there is some compiler-specific answer that can immediately explain the result. It could be a compiler code generation deficiency, but there are many other possibilities. For example, a colleague of mine speculated that there could be reasons why the M1 Mac ran the Rosetta translation on a *performance* core, but ran the Apple Silicon version on an *efficiency* core.
You can also investigate the performance yourself, by running (say) the Time Profiler template in Instruments. It’s unclear which instrument might provide informative results, so you might need to try a couple of different templates, focussing on different things.
This might ultimately be an Apple support question, but I’d imagine there are numerous people on the Swift forums who’d enjoy puzzling out the answer. :)On Sep 20, 2022, at 22:59, Gerriet M. Denkmann <gerriet@...> wrote:On 20 Sep 2022, at 19:42, Alex Zavatone via groups.io <zav@...> wrote:I did this:
It might seem like a primitive approach, but logging with time stamps should be able to highlight where the suckyness is. Run a log that displays the time delta from the last logging statement so that you are only looking at the deltas. Then run each version and see where the slowness is. That should tell you, right?
typedef uint32_t limb;
typedef uint64_t bigLimb;
const uint len = 50000;
const int shiftLimb = sizeof(limb) * 8;
limb *someArray = malloc( len * sizeof(limb) );
bigLimb someBig = 0;
for (bigLimb factor = 1; factor < len; factor++ )
{
for (uint idx = 0 ; idx < len ; idx++)
{
someBig += factor * someArray[idx] ;
someArray[idx] = (limb)(someBig);
someBig >>= shiftLimb;
}
}
and run it in Release mode (-Os = Fastest, Smallest)
(In Debug mode (-O0) Rosetta time = M1 time).
with "someBig >>= shiftLimb”:
Rosetta M1 Rosetta time / M1 time
1.8 3.35 0.54
without the shift:
1.32 0.924 1.43
So it seems that Rosetta optimizes shifts way better than Apple Silicon.
Which kind of looks like a bug.
Gerriet.
Jack Brindle
For those of us not in that forum, what was the explanation? You have our attention with this one...
Jack
toggle quoted message
Show quoted text
Jack
On Sep 22, 2022, at 9:26 PM, Gerriet M. Denkmann <gerriet@...> wrote:On 23 Sep 2022, at 07:51, Quincey Morris <quinceymorris@...> wrote:As you see, I got an excellent in depth answer and also some very helpful tips for improvements.
OK, this is embarrassing. I’m so used to looking at Swift code these days, that my brain just automatically translated your code into Swift, and I never realized it was C. So the Swift forums aren’t where I should have suggested — although I see over there that you might be gettting some kind of answer anyway. It’s an interesting question.
So no reason to be embarrassed at all.
Rather I have to thank you for a very good and fruitful suggestion.
Gerriet.On Sep 21, 2022, at 17:49, Quincey Morris <quinceymorris@...> wrote:
I really think you should start by asking this question over in the Swift forums, in case there is some compiler-specific answer that can immediately explain the result. It could be a compiler code generation deficiency, but there are many other possibilities. For example, a colleague of mine speculated that there could be reasons why the M1 Mac ran the Rosetta translation on a *performance* core, but ran the Apple Silicon version on an *efficiency* core.
You can also investigate the performance yourself, by running (say) the Time Profiler template in Instruments. It’s unclear which instrument might provide informative results, so you might need to try a couple of different templates, focussing on different things.
This might ultimately be an Apple support question, but I’d imagine there are numerous people on the Swift forums who’d enjoy puzzling out the answer. :)On Sep 20, 2022, at 22:59, Gerriet M. Denkmann <gerriet@...> wrote:On 20 Sep 2022, at 19:42, Alex Zavatone via groups.io <zav@...> wrote:I did this:
It might seem like a primitive approach, but logging with time stamps should be able to highlight where the suckyness is. Run a log that displays the time delta from the last logging statement so that you are only looking at the deltas. Then run each version and see where the slowness is. That should tell you, right?
typedef uint32_t limb;
typedef uint64_t bigLimb;
const uint len = 50000;
const int shiftLimb = sizeof(limb) * 8;
limb *someArray = malloc( len * sizeof(limb) );
bigLimb someBig = 0;
for (bigLimb factor = 1; factor < len; factor++ )
{
for (uint idx = 0 ; idx < len ; idx++)
{
someBig += factor * someArray[idx] ;
someArray[idx] = (limb)(someBig);
someBig >>= shiftLimb;
}
}
and run it in Release mode (-Os = Fastest, Smallest)
(In Debug mode (-O0) Rosetta time = M1 time).
with "someBig >>= shiftLimb”:
Rosetta M1 Rosetta time / M1 time
1.8 3.35 0.54
without the shift:
1.32 0.924 1.43
So it seems that Rosetta optimizes shifts way better than Apple Silicon.
Which kind of looks like a bug.
Gerriet.
Gerriet M. Denkmann
On 23 Sep 2022, at 09:28, Jack Brindle via groups.io <jackbrindle@...> wrote:Steve Cannon explained that:
For those of us not in that forum, what was the explanation? You have our attention with this one...
• Arm has “madd” which does multiply and add in one go. But has a latency of more than one cycle.
• x86_64 doesn't have a multiply-add instruction.
Using multiply-add is generally a good idea (that’s why the arm compiler choose it), but in a very short loop (as in my case) the added latency sadly makes things slower.
And so the native M1 code ended up being 60 % slower than x86_64 + Rosetta.
Maybe the optimiser for Apple Silicon should consider not using “madd” in very short loops.
On Sep 22, 2022, at 9:26 PM, Gerriet M. Denkmann <gerriet@...> wrote:On 23 Sep 2022, at 07:51, Quincey Morris <quinceymorris@...> wrote:As you see, I got an excellent in depth answer and also some very helpful tips for improvements.
OK, this is embarrassing. I’m so used to looking at Swift code these days, that my brain just automatically translated your code into Swift, and I never realized it was C. So the Swift forums aren’t where I should have suggested — although I see over there that you might be gettting some kind of answer anyway. It’s an interesting question.
So no reason to be embarrassed at all.
Rather I have to thank you for a very good and fruitful suggestion.
Gerriet.On Sep 21, 2022, at 17:49, Quincey Morris <quinceymorris@...> wrote:
I really think you should start by asking this question over in the Swift forums, in case there is some compiler-specific answer that can immediately explain the result. It could be a compiler code generation deficiency, but there are many other possibilities. For example, a colleague of mine speculated that there could be reasons why the M1 Mac ran the Rosetta translation on a *performance* core, but ran the Apple Silicon version on an *efficiency* core.
You can also investigate the performance yourself, by running (say) the Time Profiler template in Instruments. It’s unclear which instrument might provide informative results, so you might need to try a couple of different templates, focussing on different things.
This might ultimately be an Apple support question, but I’d imagine there are numerous people on the Swift forums who’d enjoy puzzling out the answer. :)On Sep 20, 2022, at 22:59, Gerriet M. Denkmann <gerriet@...> wrote:On 20 Sep 2022, at 19:42, Alex Zavatone via groups.io <zav@...> wrote:I did this:
It might seem like a primitive approach, but logging with time stamps should be able to highlight where the suckyness is. Run a log that displays the time delta from the last logging statement so that you are only looking at the deltas. Then run each version and see where the slowness is. That should tell you, right?
typedef uint32_t limb;
typedef uint64_t bigLimb;
const uint len = 50000;
const int shiftLimb = sizeof(limb) * 8;
limb *someArray = malloc( len * sizeof(limb) );
bigLimb someBig = 0;
for (bigLimb factor = 1; factor < len; factor++ )
{
for (uint idx = 0 ; idx < len ; idx++)
{
someBig += factor * someArray[idx] ;
someArray[idx] = (limb)(someBig);
someBig >>= shiftLimb;
}
}
and run it in Release mode (-Os = Fastest, Smallest)
(In Debug mode (-O0) Rosetta time = M1 time).
with "someBig >>= shiftLimb”:
Rosetta M1 Rosetta time / M1 time
1.8 3.35 0.54
without the shift:
1.32 0.924 1.43
So it seems that Rosetta optimizes shifts way better than Apple Silicon.
Which kind of looks like a bug.
Gerriet.
Chris Ridd
On 22 Sep 2022, at 01:49, Quincey Morris <quinceymorris@...> wrote:I was wondering about the P/E cores too. Perhaps you can avoid that difference by using GCD + QOS to explicitly run the calculation on the desired kind of core. Though I guess we don’t actually know if Rosetta always runs high QOS Intel threads on high QOS M1 threads..
I really think you should start by asking this question over in the Swift forums, in case there is some compiler-specific answer that can immediately explain the result. It could be a compiler code generation deficiency, but there are many other possibilities. For example, a colleague of mine speculated that there could be reasons why the M1 Mac ran the Rosetta translation on a *performance* core, but ran the Apple Silicon version on an *efficiency* core.
But yes, Gerriet’s good spot about the use of >>= looks like something for the compiler/codegen folks.
Chris
Gary L. Wade
If you haven’t already, submit a feedback request with your source, Xcode version, and platform used that demonstrate the issue.
--
Gary
toggle quoted message
Show quoted text
--
Gary
On Sep 22, 2022, at 9:16 PM, Gerriet M. Denkmann <gerriet@...> wrote:
On 23 Sep 2022, at 09:28, Jack Brindle via groups.io <jackbrindle@...> wrote:Steve Cannon explained that:
For those of us not in that forum, what was the explanation? You have our attention with this one...
• Arm has “madd” which does multiply and add in one go. But has a latency of more than one cycle.
• x86_64 doesn't have a multiply-add instruction.
Using multiply-add is generally a good idea (that’s why the arm compiler choose it), but in a very short loop (as in my case) the added latency sadly makes things slower.
And so the native M1 code ended up being 60 % slower than x86_64 + Rosetta.
Maybe the optimiser for Apple Silicon should consider not using “madd” in very short loops.On Sep 22, 2022, at 9:26 PM, Gerriet M. Denkmann <gerriet@...> wrote:On 23 Sep 2022, at 07:51, Quincey Morris <quinceymorris@...> wrote:As you see, I got an excellent in depth answer and also some very helpful tips for improvements.
OK, this is embarrassing. I’m so used to looking at Swift code these days, that my brain just automatically translated your code into Swift, and I never realized it was C. So the Swift forums aren’t where I should have suggested — although I see over there that you might be gettting some kind of answer anyway. It’s an interesting question.
So no reason to be embarrassed at all.
Rather I have to thank you for a very good and fruitful suggestion.
Gerriet.On Sep 21, 2022, at 17:49, Quincey Morris <quinceymorris@...> wrote:
I really think you should start by asking this question over in the Swift forums, in case there is some compiler-specific answer that can immediately explain the result. It could be a compiler code generation deficiency, but there are many other possibilities. For example, a colleague of mine speculated that there could be reasons why the M1 Mac ran the Rosetta translation on a *performance* core, but ran the Apple Silicon version on an *efficiency* core.
You can also investigate the performance yourself, by running (say) the Time Profiler template in Instruments. It’s unclear which instrument might provide informative results, so you might need to try a couple of different templates, focussing on different things.
This might ultimately be an Apple support question, but I’d imagine there are numerous people on the Swift forums who’d enjoy puzzling out the answer. :)On Sep 20, 2022, at 22:59, Gerriet M. Denkmann <gerriet@...> wrote:On 20 Sep 2022, at 19:42, Alex Zavatone via groups.io <zav@...> wrote:I did this:
It might seem like a primitive approach, but logging with time stamps should be able to highlight where the suckyness is. Run a log that displays the time delta from the last logging statement so that you are only looking at the deltas. Then run each version and see where the slowness is. That should tell you, right?
typedef uint32_t limb;
typedef uint64_t bigLimb;
const uint len = 50000;
const int shiftLimb = sizeof(limb) * 8;
limb *someArray = malloc( len * sizeof(limb) );
bigLimb someBig = 0;
for (bigLimb factor = 1; factor < len; factor++ )
{
for (uint idx = 0 ; idx < len ; idx++)
{
someBig += factor * someArray[idx] ;
someArray[idx] = (limb)(someBig);
someBig >>= shiftLimb;
}
}
and run it in Release mode (-Os = Fastest, Smallest)
(In Debug mode (-O0) Rosetta time = M1 time).
with "someBig >>= shiftLimb”:
Rosetta M1 Rosetta time / M1 time
1.8 3.35 0.54
without the shift:
1.32 0.924 1.43
So it seems that Rosetta optimizes shifts way better than Apple Silicon.
Which kind of looks like a bug.
Gerriet.