Huge Memory Bandwidth, but not for every Block

One highly intriguing aspect of the M1 Max, maybe less so for the M1 Pro, is the massive memory bandwidth that is available for the SoC.

Apple was keen to market their 400GB/s figure during the launch, but this number is so wild and out there that there’s just a lot of questions left open as to how the chip is able to take advantage of this kind of bandwidth, so it’s one of the first things to investigate.

Starting off with our memory latency tests, the new M1 Max changes system memory behaviour quite significantly compared to what we’ve seen on the M1. On the core and L2 side of things, there haven’t been any changes and we consequently don’t see much alterations in terms of the results – it’s still a 3.2GHz peak core with 128KB of L1D at 3 cycles load-load latencies, and a 12MB L2 cache.

Where things are quite different is when we enter the system cache, instead of 8MB, on the M1 Max it’s now 48MB large, and also a lot more noticeable in the latency graph. While being much larger, it’s also evidently slower than the M1 SLC – the exact figures here depend on access pattern, but even the linear chain access shows that data has to travel a longer distance than the M1 and corresponding A-chips.

DRAM latency, even though on paper is faster for the M1 Max in terms of frequency on bandwidth, goes up this generation. At a 128MB comparable test depth, the new chip is roughly 15ns slower. The larger SLCs, more complex chip fabric, as well as possible worse timings on the part of the new LPDDR5 memory all could add to the regression we’re seeing here. In practical terms, because the SLC is so much bigger this generation, workloads latencies should still be lower for the M1 Max due to the higher cache hit rates, so performance shouldn’t regress.

A lot of people in the HPC audience were extremely intrigued to see a chip with such massive bandwidth – not because they care about GPU or other offload engines of the SoC, but because the possibility of the CPUs being able to have access to such immense bandwidth, something that otherwise is only possible to achieve on larger server-class CPUs that cost a multitude of what the new MacBook Pros are sold at. It was also one of the first things I tested out – to see exactly just how much bandwidth the CPU cores have access to.

Unfortunately, the news here isn’t the best case-scenario that we hoped for, as the M1 Max isn’t able to fully saturate the SoC bandwidth from just the CPU side;

From a single core perspective, meaning from a single software thread, things are quite impressive for the chip, as it’s able to stress the memory fabric to up to 102GB/s. This is extremely impressive and outperforms any other design in the industry by multiple factors, we had already noted that the M1 chip was able to fully saturate its memory bandwidth with a single core and that the bottleneck had been on the DRAM itself. On the M1 Max, it seems that we’re hitting the limit of what a core can do – or more precisely, a limit to what the CPU cluster can do.

The little hump between 12MB and 64MB should be the SLC of 48MB in size, the reduction in BW at the 12MB figure signals that the core is somehow limited in bandwidth when evicting cache lines back to the upper memory system. Our test here consists of reading, modifying, and writing back cache lines, with a 1:1 R/W ratio.

Going from 1 core/threads to 2, what the system is actually doing is spreading the workload across the two performance clusters of the SoC, so both threads are on their own cluster and have full access to the 12MB of L2. The “hump” after 12MB reduces in size, ending earlier now at +24MB, which makes sense as the 48MB SLC is now shared amongst two cores. Bandwidth here increases to 186GB/s.

Adding a third thread there’s a bit of an imbalance across the clusters, DRAM bandwidth goes to 204GB/s, but a fourth thread lands us at 224GB/s and this appears to be the limit on the SoC fabric that the CPUs are able to achieve, as adding additional cores and threads beyond this point does not increase the bandwidth to DRAM at all. It’s only when the E-cores, which are in their own cluster, are added in, when the bandwidth is able to jump up again, to a maximum of 243GB/s.

While 243GB/s is massive, and overshadows any other design in the industry, it’s still quite far from the 409GB/s the chip is capable of. More importantly for the M1 Max, it’s only slightly higher than the 204GB/s limit of the M1 Pro, so from a CPU-only workload perspective, it doesn’t appear to make sense to get the Max if one is focused just on CPU bandwidth.

That begs the question, why does the M1 Max have such massive bandwidth? The GPU naturally comes to mind, however in my testing, I’ve had extreme trouble to find workloads that would stress the GPU sufficiently to take advantage of the available bandwidth. Granted, this is also an issue of lacking workloads, but for actual 3D rendering and benchmarks, I haven’t seen the GPU use more than 90GB/s (measured via system performance counters). While I’m sure there’s some productivity workload out there where the GPU is able to stretch its legs, we haven’t been able to identify them yet.

That leaves everything else which is on the SoC, media engine, NPU, and just workloads that would simply stress all parts of the chip at the same time. The new media engine on the M1 Pro and Max are now able to decode and encode ProRes RAW formats, the above clip is a 5K 12bit sample with a bitrate of 1.59Gbps, and the M1 Max is not only able to play it back in real-time, it’s able to do it at multiple times the speed, with seamless immediate seeking. Doing the same thing on my 5900X machine results in single-digit frames. The SoC DRAM bandwidth while seeking around was at around 40-50GB/s – I imagine that workloads that stress CPU, GPU, media engines all at the same time would be able to take advantage of the full system memory bandwidth, and allow the M1 Max to stretch its legs and differentiate itself more from the M1 Pro and other systems.

M1 Pro & M1 Max: Performance Laptop Chips Power Behaviour: No Real TDP, but Wide Range
POST A COMMENT

492 Comments

View All Comments

  • sthambi - Wednesday, November 3, 2021 - link

    Hi Anand, I stumbled across your blog post, and I enjoyed reading it. I'm a professional video editor, photographer. Ordered the 32 core, 64GB, M1 Pro Max for $3900. I'm upgrading from the iMac 5k, late 2015 model. I personally feel like am overkilling my configuration. I don't want to look back 2 years from now, and feel like I lost 4k, and now apple doubled again. Do you think I really need this much heavy configuration to use premiere pro cc, max 5k video editing, and canon raw images, and simultaneous creative cloud application running? what would you recommend, which can help me save money and not compromise on the performance? is my decision of going full configuration bad? Reply
  • MykeM - Sunday, November 14, 2021 - link

    Read the byline (the names under the headlines). The site’s namesake- Anand- left a few years ago. He no longer writes here. The people replacing him are every bit as capable but none of them are actually named Anand. Reply
  • razer555 - Thursday, November 4, 2021 - link

    https://www.youtube.com/watch?v=OMgCsvcMIaQ
    https://www.youtube.com/watch?v=mN8ve8Hp4I4

    Anandtech, your tests about the graphic seems wrong.
    Reply
  • Sheepshot - Sunday, November 7, 2021 - link

    Anand tech = Apple shills.

    M1 beats both the M1Pro and max in power efficiency. Draws 50% of watt butt provides almost 65-70% of the performance in most relevant benches.
    Reply
  • Hrunga_Zmuda - Sunday, November 7, 2021 - link

    Shills?

    The 90s called and want your insult back.
    Reply
  • evernessince - Wednesday, November 10, 2021 - link

    HWUB just did a review of the M1 pro in actual applications and performance is good but not nearly as impressive as Anand suggests. These chips are competitive with laptop chips but you certainly don't need to bust out server class components as suggested in the article. Performance is very good in certain areas and in others it's very poor. Most of the time it's about as good as X86 laptop chips. GPU is decent but given the price, you can get much much more performance on X86 at a much lower price. Reply
  • Motti.shneor - Sunday, November 14, 2021 - link

    I think I heard M1 Max and M1 Pro have different number of CPU cores? Here you say they're identical?

    Also, I keep asking myself why a tech visionary like yourself doesn't see the "big picture" and the bold transitional step in computing taken here.

    For me, that sheer "horse power" means very little - I'm using a 1st generation M1 Mac-Mini, with medium configuration, beside an i9 MacBookPro from 2020 - and the Mini is SO MUCH BETTER in each and every way and meaning (except of course for the terrible bugs and deteriorating quality and bad behavior at boot time, the EFI and such)

    As a power user, and Mac/iOS software engineer/tech-lead for over 35 years, and with my pack of 400 applications installed, some native, some emulated, and my 0.5TB library of photos and 0.5TB library of music.... well, with all this, I can testify that MacMini "feels" 5 times faster than the MBP, in most everything I DO. Maybe it'll fail on benchmarks, but I couldn't care less. Rebuild a project? snap. Export a video while rescaling it? Immediate! heavy image conversion? no time. Launch a heavy app? before you know it. It FEELS very very fast, and that's 1st generation.

    What I think IS IMPRTANT and not being said by anyone, is that the whole mode of computing goes back from "general purpose" into "specialized hardware". You can no longer appreciate a computer by its linear-programming CPU cycles, and if you do - you just get a completely wrong evaluation.

    Moreover - you CANT just "port some general C code from somewhere" and expect it to run fast. You MUST be using system APIs at SOME LEVEL, that will dispatch your work onto specialized hardware, so you gain from all those monstrous engines under the hood. If you will just compile some neural-networks engine or drag it over in python or something, it'll crawl and it will suck. But if you use Vision Framework from Apple, you'll have jaw-dropping performance. You MUST build software FOR the M1, to have the software shine. This is a paradigm shift, that contradicts everything we've seen in the last 40 years (moving from custom hardware into general-purpose computing devices).

    If history is to repeat itself like so many times in the past - soon enough all the competitors in the Computing arena will be forced into similar changes, so not to lose market share - and we'll have a very strange market, much harder to compare - because the Apple guys will always bring in Apple software highly optimized to use the hardware, and the "other" guys will pull their "specialized" software for their special processors... I

    I am quite thrilled, and I really want to have one of them M1 Max machines, just to feel them a little.

    Despite the long threads underneath, I think Gaming is not even secondary in the list of important aspects - And I also predict that Game makers will skip the Mac in the future just like today. It's not because they don't like it, but because of tradition, and because of the high priced entry point of Powerful Mac computers. Still - Corporate-America is buying MBPs like mad, and they'll keep doing that in the coming 3-5 years.
    Reply
  • stevenLu - Tuesday, December 7, 2021 - link

    I am a fan of Apple technology. And I am only glad to read news about their development and some new technologies. Another new technology is used by Lucid Reality Labs. You can read more on their website https://lucidrealitylabs.com/blog/5-vr-headsets-ma... Reply
  • Cloakstar - Tuesday, December 21, 2021 - link

    One reason this M1 Max performs so well is that even though the CPU is in control, the memory hierarchy is more GPU+CPU than the typical CPU+GPU, so the typical APU memory bottleneck is gone. :D AMD APUs, for example, are highly memory bound, doubling in performance when you go form 1 stick of RAM to 4 sticks with bank+channel interleaving. Reply
  • wr3@k0n - Friday, December 24, 2021 - link

    For a $3499 no shit it's competing with server grade, if it comes with that price. Though PC still ends up being cheaper and infinitely more repairable and upgradeable. This article doesn't address many of the drawbacks of the Apple ecosystem and it will take more than "close to PC" performance. Reply

Log in

Don't have an account? Sign up now