Huge Memory Bandwidth, but not for every Block

One highly intriguing aspect of the M1 Max, maybe less so for the M1 Pro, is the massive memory bandwidth that is available for the SoC.

Apple was keen to market their 400GB/s figure during the launch, but this number is so wild and out there that there’s just a lot of questions left open as to how the chip is able to take advantage of this kind of bandwidth, so it’s one of the first things to investigate.

Starting off with our memory latency tests, the new M1 Max changes system memory behaviour quite significantly compared to what we’ve seen on the M1. On the core and L2 side of things, there haven’t been any changes and we consequently don’t see much alterations in terms of the results – it’s still a 3.2GHz peak core with 128KB of L1D at 3 cycles load-load latencies, and a 12MB L2 cache.

Where things are quite different is when we enter the system cache, instead of 8MB, on the M1 Max it’s now 48MB large, and also a lot more noticeable in the latency graph. While being much larger, it’s also evidently slower than the M1 SLC – the exact figures here depend on access pattern, but even the linear chain access shows that data has to travel a longer distance than the M1 and corresponding A-chips.

DRAM latency, even though on paper is faster for the M1 Max in terms of frequency on bandwidth, goes up this generation. At a 128MB comparable test depth, the new chip is roughly 15ns slower. The larger SLCs, more complex chip fabric, as well as possible worse timings on the part of the new LPDDR5 memory all could add to the regression we’re seeing here. In practical terms, because the SLC is so much bigger this generation, workloads latencies should still be lower for the M1 Max due to the higher cache hit rates, so performance shouldn’t regress.

A lot of people in the HPC audience were extremely intrigued to see a chip with such massive bandwidth – not because they care about GPU or other offload engines of the SoC, but because the possibility of the CPUs being able to have access to such immense bandwidth, something that otherwise is only possible to achieve on larger server-class CPUs that cost a multitude of what the new MacBook Pros are sold at. It was also one of the first things I tested out – to see exactly just how much bandwidth the CPU cores have access to.

Unfortunately, the news here isn’t the best case-scenario that we hoped for, as the M1 Max isn’t able to fully saturate the SoC bandwidth from just the CPU side;

From a single core perspective, meaning from a single software thread, things are quite impressive for the chip, as it’s able to stress the memory fabric to up to 102GB/s. This is extremely impressive and outperforms any other design in the industry by multiple factors, we had already noted that the M1 chip was able to fully saturate its memory bandwidth with a single core and that the bottleneck had been on the DRAM itself. On the M1 Max, it seems that we’re hitting the limit of what a core can do – or more precisely, a limit to what the CPU cluster can do.

The little hump between 12MB and 64MB should be the SLC of 48MB in size, the reduction in BW at the 12MB figure signals that the core is somehow limited in bandwidth when evicting cache lines back to the upper memory system. Our test here consists of reading, modifying, and writing back cache lines, with a 1:1 R/W ratio.

Going from 1 core/threads to 2, what the system is actually doing is spreading the workload across the two performance clusters of the SoC, so both threads are on their own cluster and have full access to the 12MB of L2. The “hump” after 12MB reduces in size, ending earlier now at +24MB, which makes sense as the 48MB SLC is now shared amongst two cores. Bandwidth here increases to 186GB/s.

Adding a third thread there’s a bit of an imbalance across the clusters, DRAM bandwidth goes to 204GB/s, but a fourth thread lands us at 224GB/s and this appears to be the limit on the SoC fabric that the CPUs are able to achieve, as adding additional cores and threads beyond this point does not increase the bandwidth to DRAM at all. It’s only when the E-cores, which are in their own cluster, are added in, when the bandwidth is able to jump up again, to a maximum of 243GB/s.

While 243GB/s is massive, and overshadows any other design in the industry, it’s still quite far from the 409GB/s the chip is capable of. More importantly for the M1 Max, it’s only slightly higher than the 204GB/s limit of the M1 Pro, so from a CPU-only workload perspective, it doesn’t appear to make sense to get the Max if one is focused just on CPU bandwidth.

That begs the question, why does the M1 Max have such massive bandwidth? The GPU naturally comes to mind, however in my testing, I’ve had extreme trouble to find workloads that would stress the GPU sufficiently to take advantage of the available bandwidth. Granted, this is also an issue of lacking workloads, but for actual 3D rendering and benchmarks, I haven’t seen the GPU use more than 90GB/s (measured via system performance counters). While I’m sure there’s some productivity workload out there where the GPU is able to stretch its legs, we haven’t been able to identify them yet.

That leaves everything else which is on the SoC, media engine, NPU, and just workloads that would simply stress all parts of the chip at the same time. The new media engine on the M1 Pro and Max are now able to decode and encode ProRes RAW formats, the above clip is a 5K 12bit sample with a bitrate of 1.59Gbps, and the M1 Max is not only able to play it back in real-time, it’s able to do it at multiple times the speed, with seamless immediate seeking. Doing the same thing on my 5900X machine results in single-digit frames. The SoC DRAM bandwidth while seeking around was at around 40-50GB/s – I imagine that workloads that stress CPU, GPU, media engines all at the same time would be able to take advantage of the full system memory bandwidth, and allow the M1 Max to stretch its legs and differentiate itself more from the M1 Pro and other systems.

M1 Pro & M1 Max: Performance Laptop Chips Power Behaviour: No Real TDP, but Wide Range
POST A COMMENT

493 Comments

View All Comments

  • richardnpaul - Wednesday, October 27, 2021 - link

    I'm not saying that it's not great and energy efficient marvel of technology (although you're forgetting that the compared part is Zen3 mobile 35W part which has 12MB rather than 32MB of L3 and that's partly because its a small die on 7nm).

    They mentioned Metal they mentioned how they can't get direct comparative results, this is one of the downsides of this, and the others from Apple, chip, great as it is it has drawbacks that hamper it which are nothing to to do with the architecture.
    Reply
  • OreoCookie - Wednesday, October 27, 2021 - link

    I don’t think I’m forgetting anything here. I am just saying that Anandtech should compare the M1 Max against actual products rather than speculate how it compares to future products like Alderlake or Zen 3 with V cache. Your claim was that the article “comes across as a fanboi article”, and I am just saying that they are just giving the chip a great review because in their low-level benchmarks it outclasses the competition in virtually every way. That’s not fanboi-ism, it is just rooted in fact.

    And yes, they explained the issue with APIs and the lack of optimization of games for the Mac. Given that Mac users either aren’t gamers or (if they are gamers) tend to not use their Macs for gaming, we can argue how important that drawback actually is. In more GPU compute-focussed benchmarks (e. g. by Affinity that make cross-platform creativity apps), the results of the GPU seem very impressive.
    Reply
  • richardnpaul - Thursday, October 28, 2021 - link

    My main disagreement was not them comparing with with Zen3, but more that I felt that they failed to adequately cover how the change would impact this use case scenario between M1 versions given that comparing Zen2 to Zen3 has been covered (and AMD have already said that the Vcache will mainly impact gaming and server workloads by around 15% on average) and shown in these specific use cases to have quite a large benefit and I'd just wanted that kind of abstract logical analysis of how the Max might be more positively positioned for this or these use cases above say the original M1. (I know that they mentioned in the article that they didn't have the M1 anymore and the actual AMD 5900HS device is dead which has severely impacted their testing here.

    I come to Anandtech specifically for the more indepth coverage that you don't get elsewhere and I come for all the hardware articles irrespective of brand because I'm interested in technology not brand names which is why I dislike articles that come across as biased (whilst it'll never be intentionally biased we're all human at the end of the day and it's hard not to let the excitement of novel tech cloud our judgement).
    Reply
  • richardnpaul - Wednesday, October 27, 2021 - link

    Also my comparison was AMD to AMD between generations and how it might apply to increasing the cache sizes of the M1 and the positive improvement it might have on performance in situations using the GPU such as gaming. Reply
  • Ppietra - Wednesday, October 27, 2021 - link

    You are so focused on a fringe case that you don’t stop to think that "maybe" there are other things happening besides "gluing" a CPU and GPU on the same silicon, fighting for memory bandwidth. Unified memory architecture plus CPU and GPU sharing data over the system cache, has an impact on memory bandwidth needs.
    Besides this, looking at data that it is provided, we seem to be far from saturating memory bandwidth on a regular basis.
    It would be interesting though to actually see how applications behave when truly optimised for this hardware and not just ported with some compatibility abstraction layer in the middle. Affinity Photo would probably be the best example.
    Reply
  • richardnpaul - Wednesday, October 27, 2021 - link

    This is exactly what I wanted coving in the article. If the GPU and CPU are hitting the memory subsystem they are going to be competing for cache hits. My point was that Zen3 (desktop) showed a large positive correlation between doubling the cache (or unifying it into a single blob in reality) and increased FPS in games and that that might also hold true for the increased cache on the M1 Pro and Max.

    Unfortunately testing this chip is hampered by decisions completely unrelated to the hardware itself, and that also applies to certain use cases.
    it'll be more interesting to see testing the same games under Linux between an Nvidia/AMD/Intel based laptop as then the only differences should be the ISA; and immature drivers.
    Reply
  • Ppietra - Wednesday, October 27, 2021 - link

    "hitting the memory subsystem they are going to be competing for cache hits"
    CPU and GPU also have their own cache (CPU 24MB L2 total; GPU don’t know how much now) which is very substantial.
    And I think you are not seeing the picture about CPU and GPU not having to duplicate resources, working on the same data in an enormous 48MB system cache (when using native APIs of course) before even needing to access RAM, reducing latency, etc. This can be very powerful. So no, I don’t assume that there will any significant impact because of some fringe case while ignoring the great benefits that it brings.
    Reply
  • richardnpaul - Wednesday, October 27, 2021 - link

    One person's fringe edge case is another person's primary use case.

    The 24/48MB is a shared cache between the CPU and GPU (and everything else that accesses main memory).
    Reply
  • Ppietra - Wednesday, October 27, 2021 - link

    no, it’s a fringe case period! You don’t see laptop processors with these amounts of L2 cache and system cache anywhere, not even close, and yet for some reason you feel that it would be at an disadvantage, failing to acknowledge the advantages of sharing Reply
  • richardnpaul - Thursday, October 28, 2021 - link

    What you call a fringe case I call 2.35m people. Okay, so it's probably on about 1.5 to 2% of Mac users; it's ~2.5% of Steam users.

    I know people who play games on Windows Machines because their GPUs in their Macs aren't good enough. Those people who are frustrated having to maintain a Windows machine just to play games. Those people will buy into an M1 Pro or Max just so they can be rid of the Windows system. It won't be their main concern, but then they're not going to be buying an M1 Pro/Max for the reason of rendering etc when they're a web developer, they're going to buy it so that they can dump the pain in the backside Windows gaming machine. Valve don't maintain their MacOS version of Steam for no good reason.
    Reply

Log in

Don't have an account? Sign up now