Huge Memory Bandwidth, but not for every Block

One highly intriguing aspect of the M1 Max, maybe less so for the M1 Pro, is the massive memory bandwidth that is available for the SoC.

Apple was keen to market their 400GB/s figure during the launch, but this number is so wild and out there that there’s just a lot of questions left open as to how the chip is able to take advantage of this kind of bandwidth, so it’s one of the first things to investigate.

Starting off with our memory latency tests, the new M1 Max changes system memory behaviour quite significantly compared to what we’ve seen on the M1. On the core and L2 side of things, there haven’t been any changes and we consequently don’t see much alterations in terms of the results – it’s still a 3.2GHz peak core with 128KB of L1D at 3 cycles load-load latencies, and a 12MB L2 cache.

Where things are quite different is when we enter the system cache, instead of 8MB, on the M1 Max it’s now 48MB large, and also a lot more noticeable in the latency graph. While being much larger, it’s also evidently slower than the M1 SLC – the exact figures here depend on access pattern, but even the linear chain access shows that data has to travel a longer distance than the M1 and corresponding A-chips.

DRAM latency, even though on paper is faster for the M1 Max in terms of frequency on bandwidth, goes up this generation. At a 128MB comparable test depth, the new chip is roughly 15ns slower. The larger SLCs, more complex chip fabric, as well as possible worse timings on the part of the new LPDDR5 memory all could add to the regression we’re seeing here. In practical terms, because the SLC is so much bigger this generation, workloads latencies should still be lower for the M1 Max due to the higher cache hit rates, so performance shouldn’t regress.

A lot of people in the HPC audience were extremely intrigued to see a chip with such massive bandwidth – not because they care about GPU or other offload engines of the SoC, but because the possibility of the CPUs being able to have access to such immense bandwidth, something that otherwise is only possible to achieve on larger server-class CPUs that cost a multitude of what the new MacBook Pros are sold at. It was also one of the first things I tested out – to see exactly just how much bandwidth the CPU cores have access to.

Unfortunately, the news here isn’t the best case-scenario that we hoped for, as the M1 Max isn’t able to fully saturate the SoC bandwidth from just the CPU side;

From a single core perspective, meaning from a single software thread, things are quite impressive for the chip, as it’s able to stress the memory fabric to up to 102GB/s. This is extremely impressive and outperforms any other design in the industry by multiple factors, we had already noted that the M1 chip was able to fully saturate its memory bandwidth with a single core and that the bottleneck had been on the DRAM itself. On the M1 Max, it seems that we’re hitting the limit of what a core can do – or more precisely, a limit to what the CPU cluster can do.

The little hump between 12MB and 64MB should be the SLC of 48MB in size, the reduction in BW at the 12MB figure signals that the core is somehow limited in bandwidth when evicting cache lines back to the upper memory system. Our test here consists of reading, modifying, and writing back cache lines, with a 1:1 R/W ratio.

Going from 1 core/threads to 2, what the system is actually doing is spreading the workload across the two performance clusters of the SoC, so both threads are on their own cluster and have full access to the 12MB of L2. The “hump” after 12MB reduces in size, ending earlier now at +24MB, which makes sense as the 48MB SLC is now shared amongst two cores. Bandwidth here increases to 186GB/s.

Adding a third thread there’s a bit of an imbalance across the clusters, DRAM bandwidth goes to 204GB/s, but a fourth thread lands us at 224GB/s and this appears to be the limit on the SoC fabric that the CPUs are able to achieve, as adding additional cores and threads beyond this point does not increase the bandwidth to DRAM at all. It’s only when the E-cores, which are in their own cluster, are added in, when the bandwidth is able to jump up again, to a maximum of 243GB/s.

While 243GB/s is massive, and overshadows any other design in the industry, it’s still quite far from the 409GB/s the chip is capable of. More importantly for the M1 Max, it’s only slightly higher than the 204GB/s limit of the M1 Pro, so from a CPU-only workload perspective, it doesn’t appear to make sense to get the Max if one is focused just on CPU bandwidth.

That begs the question, why does the M1 Max have such massive bandwidth? The GPU naturally comes to mind, however in my testing, I’ve had extreme trouble to find workloads that would stress the GPU sufficiently to take advantage of the available bandwidth. Granted, this is also an issue of lacking workloads, but for actual 3D rendering and benchmarks, I haven’t seen the GPU use more than 90GB/s (measured via system performance counters). While I’m sure there’s some productivity workload out there where the GPU is able to stretch its legs, we haven’t been able to identify them yet.

That leaves everything else which is on the SoC, media engine, NPU, and just workloads that would simply stress all parts of the chip at the same time. The new media engine on the M1 Pro and Max are now able to decode and encode ProRes RAW formats, the above clip is a 5K 12bit sample with a bitrate of 1.59Gbps, and the M1 Max is not only able to play it back in real-time, it’s able to do it at multiple times the speed, with seamless immediate seeking. Doing the same thing on my 5900X machine results in single-digit frames. The SoC DRAM bandwidth while seeking around was at around 40-50GB/s – I imagine that workloads that stress CPU, GPU, media engines all at the same time would be able to take advantage of the full system memory bandwidth, and allow the M1 Max to stretch its legs and differentiate itself more from the M1 Pro and other systems.

M1 Pro & M1 Max: Performance Laptop Chips Power Behaviour: No Real TDP, but Wide Range
POST A COMMENT

492 Comments

View All Comments

  • OreoCookie - Friday, October 29, 2021 - link

    You shouldn't mix the M1 Pro and M1 Max: the article was about the Max. The Pro makes some concessions and it looks like there are some workloads where you can saturate its memory bandwidth … but only barely so. Even then, the M1 Pro would have much, much more memory bandwidth than any laptop CPU available today (and any x86 on the horizon).

    And I think you should include the L2 cache here, which is larger than the SL cache on the Pro, and still significant in the Max (28 MB vs. 48 MB).

    I still think you are nitpicking: memory bandwidth is a strength of the M1 Pro and Max, not a weakness. The extra cache in AMD's Zen 3D will not change the landscape in this respect either.
    Reply
  • richardnpaul - Friday, October 29, 2021 - link

    The article does describe the differences between the two on the front page and runs comparisons throughout the benchmarks, whilst it's titled to be about the Max I found that it really basically covered both chips, the focus was on what benefits if any the Max brings over the Pro, so I felt it natural to include what I now see is a confusing reference to 24MB because you don't know what's going on in my head 😁

    From what I could tell the SL cache was not described like a typical L3 cache but I guess you could think of it more like that, so I was thinking of it as almost like an L4 cache (thus my comment about its placement in the die, its next to the memory controllers, and the GPU blocks, and quite far away from the CPU cores themselves so there will be a larger penalty for access vs a typical L3 which would be very close to the CPU core blocks. I've gone back and looked again and it's not as far away as I first though as I'd mistook where the CPU cores were)

    Total cache is 72MB (76MB including the efficiency cores' L2, and anything in the GPU), the AMD Desktop M3 chip has 36MB and will be 100MB with the Vcache so certainly in the same ballpark really, as in it's a lot currently (but I'm sure that we'll see the famed 1GB in the next decade). The M1 Max is crazy huge for a laptop which is why I compare it to the desktop Zen3 and also because nothing else is really comparable with 8 cores.

    I don't think it's a weakness, it's pretty huge for a 10TF GPU and an 8 core CPU (plus whatever the NPU etc. pull through it). I'm just not a fan of the compromises involved, such as RAM that can't be upgraded; though a 512bit interface would necessitate quite a few PCB layers to achieve with modular RAM.
    Reply
  • Oxford Guy - Friday, October 29, 2021 - link

    Apple pioneered the disposable closed system with the original Mac.

    It was so extreme that Jobs used outright bait and switch fraud to sucker the tech press with speech synthesis. The only Mac to be sold at the time of the big unveiling had 128K and was not expandable. Jobs used a 512K prototype without informing the press so he could run speech synthesis — software that also did not come with the Mac (another deception).

    Non-expandable RAM isn’t a bug to Apple’s management; it’s a very highly-craved feature.
    Reply
  • techconc - Thursday, October 28, 2021 - link

    You're exactly right. Here's what Affinity Photo has to say about it...

    "The #M1Max is the fastest GPU we have ever measured in the @affinitybyserif Photo benchmark. It outperforms the W6900X — a $6000, 300W desktop part — because it has immense compute performance, immense on-chip bandwidth and immediate transfer of data on and off the GPU (UMA)."
    Reply
  • richardnpaul - Thursday, October 28, 2021 - link

    They're right, which is why you see SMA these days on the newer AMD stuff (Resize BAR) and why Nvidia did the custom interface tech with IBM and are looking to do the same in servers with ARM to leverage these kinds of performance gains. It's also the reason why AMD bought ATI in the first place all those years ago; the whole failed heterogeneous compute (it must be galling for some at AMD that Apple have executed on this promise so well.) Reply
  • techconc - Thursday, October 28, 2021 - link

    You clearly don't understand what drives performance. You have a very limited view which looks only at the TFLOPs metric and not at the entire system. Performance comes from the following 3 things: High compute performance (TFLOPS), fast on-chip bandwidth and fast transfer on and off the GPU.

    As an example, Andy Somerfield, lead for Affinity Photo app had the following to say regarding the M1 Max with their application:
    "The #M1Max is the fastest GPU we have ever measured in the @affinitybyserif Photo benchmark. It outperforms the W6900X — a $6000, 300W desktop part — because it has immense compute performance, immense on-chip bandwidth and immediate transfer of data on and off the GPU (UMA)."

    This is comparing the M1 Max GPU to a $6000, 300W part and the M1 Max handily outperforms it. In terms of TFLOPS, the 6900XT has more than 2x the power. Yet, the high speed and efficient design of the share memory on the M1 Max allows it to outperform this more expensive part in actual practice. It does so while using just a fraction of the power. That does make the M1 Max pretty special.
    Reply
  • richardnpaul - Thursday, October 28, 2021 - link

    Yes TFLOPs is a very simple metric and doesn't directly tell you much about performance, but it's a general guide (Nvidia got more out of their hardware compared to AMD for example and have until the 6800 series if you only looked at the TFLOPS figures.) Please, tell me more about what I think and understand /s

    It's fastest for their scenario and for their implementation. It may be, and is very likely, that there's some specific bottleneck that they are hitting with the W6900X that isn't a problem with the implementation details of the M1 Pro/Max chips. Their issue seems to be interconnect bandwidth, they're constantly moving data back and forth between the CPU and GPU and with the M1 chips they don't need to do that, saving huge amounts of time because the PCI-E bus adds a lot of latency from what I understand so you really don't want to transfer back and forth over it (and maybe you don't need to, maybe you can do something differently in the software implementation, maybe you can't and it's just a problem that's much more efficiently done on this kind of architecture I don't know and wouldn't be able to comment knowing nothing about the software or problem that it solves. What I don't take at face value is one person/company saying use our software as it's amazing on only this hardware; I mean a la Oracle right?)

    When it comes to gaming performance, it seems that the 6900XT or the RTX 3080 seem to put this chip in its place, based on the benchmarks we saw (infact, the mobile 3080 is basically just an RTX 3070 so even more so which could be because of all sorts of issues already highlighted) you could say that the GPU isn't good as a GPU but is great at one task as a highly parallel co-processor for one piece of software that if that's the software you want to use then great for you but if you want to use the GPU for actual GPU tasks it might underwhelm (though in a laptop format and for this little power draw of ~120W max it's not going to do that for a few years which is the point that you're making and I'm not disputing - Apple will obviously launch new replacements which will put this in the shade in time).
    Reply
  • Hrunga_Zmuda - Tuesday, October 26, 2021 - link

    From the developers of Affinity Photo:

    "The #M1Max is the fastest GPU we have ever measured in the @affinitybyserif Photo benchmark. It outperforms the W6900X — a $6000, 300W desktop part — because it has immense compute performance, immense on-chip bandwidth and immediate transfer of data on and off the GPU (UMA)."

    Ahem, a laptop that tops out at not much more than the top GPU. That is bananas!
    Reply
  • buta8 - Wednesday, October 27, 2021 - link

    Please tell me how monitor the CPU Bandwidth - Intra-cacheline R&W? Reply
  • buta8 - Wednesday, October 27, 2021 - link

    Please tell me how monitor the CPU Bandwidth - Intra-cacheline R&W? Reply

Log in

Don't have an account? Sign up now