Closing Remarks: Pushing Forward on 3 nm For 2024

Having attended Arm's Client Technology Day, my initial impressions were that Arm has opted to refine and hone its IP for 2024 instead of completely redefining and making groundbreaking changes. Following on from last year's introduction of the Armv9.2 family of cores, Arm has made some notable changes within the architecture of the latest Cortex series for 2024, with a clear and intended switch to the more advanced 3 nm process node, both with Samsung and TSMC 3 nm as the basis of client-based CSS for the 2024 platform.

The Cortex-X925, Cortex-A725, and Cortex-A520 cores have been optimized for the 3 nm process, delivering significantly touted performance and power efficiency improvements. The Cortex-X925, with its enhanced 10-wide decode and dispatch width and higher clock speeds reaching up to 3.8 GHz, looks to set a new standard for single-threaded IPC performance. Arm's updated v9.2 platform looks ideal for high-performance applications, including AI workloads and high-end gaming, both in the mobile space and with Microsoft's Windows on Arm ecosystem.

In the grand scheme of things, and from Arm's in-house performance comparisons between the new CSS platform and last year's TCS2023 version, Arm claims gains of between 30 and 60% in performance, depending on the task and workload. If it is to be believed and taken as gospel, the performance improvements are incredible, with the likely transition to 3 nm being the primary improver of performance rather than the underlying architectural improvements.

The Cortex-A725 balances performance and efficiency, making it suitable for several mid-range devices. Thanks to architectural enhancements such as increased cache sizes and expanded reorder buffers, Arm claims the improvements achieve up to 35% performance efficiency over the previous generation. The refreshed Cortex-A520 focuses primarily on being optimized on the 3 nm node while looking to remain unmatched in power efficiency, achieving a 15% energy saving compared to its predecessor. This core is optimized for low-intensity workloads, making it ideal for power-sensitive applications like IoT devices and lower-cost smartphones.

AI capabilities have been a significant focus in Arm's latest offerings. The Cortex-X925 and Cortex-A725 cores primarily integrate dedicated AI accelerators, allowing access to optimized software libraries, such as KleidiAI and KleidiCV, ensuring efficient AI processing. These enhancements are crucial for applications ranging from neural language models and LLMs.

Arm also continues to support its latest Core Cluster with a usually adept and comprehensive ecosystem driven by the new CSS platform, coupled with the Arm Performance Studio and in tandem with the Kleidi AI and CV libraries. These provided tools give developers a robust foundation to fully leverage the new architecture's capabilities. This effectively reduces the overall time-to-market and fosters innovations across various industries, such as content creation and on-device AI inferencing. The CSS platform's integration with operating systems such as Android, Linux, and Windows (Windows on Arm) ensures a larger reach in adoption. It pushes a wider level of development, making software and applications available on more devices than in previous generations. 

In summary, Arm's move to all its latest CPU designs onto the 3 nm process technology and the refinements in the Cortex-X925 and Cortex-A725 cores demonstrate a strategic focus on optimizing existing architectures rather than making radical changes. These refinements include increased cache sizes per core, moving to a wider pipeline, and bolstering the DSU-120 Core Cluster for 2024, which certainly delivers substantial performance and power efficiency gains on paper.

While enabling new devices capable of handling demanding applications, most of these improvements in efficiency and performance are prevalent from the switch to the more advanced yet more challenging jump to the 3 nm node. As Arm continues to push the boundaries of what's possible with its IP, these technologies should pave the way for more powerful, efficient, and intelligent devices, shaping the future of what's possible and capable from a mobile device, whether that be in terms of the new generation of AI capable devices, or mobile gaming, Arm is looking to offer it all.

Arm Cortex A520: Same 2023 Core Optimized For 3nm
Comments Locked

55 Comments

View All Comments

  • SarahKerrigan - Wednesday, May 29, 2024 - link

    "The core is built on Arm's latest 3 nm process technology, which enables it to achieve significant power savings compared to previous generations."

    ARM doesn't have lithography capabilities and this is a synthesizable core. This sentence doesn't mean anything.
  • meacupla - Wednesday, May 29, 2024 - link

    AFAIK, the core design needs to be adapted to the smaller process node, and it's not as simple as shrinking an existing design.
  • Ryan Smith - Wednesday, May 29, 2024 - link

    Thanks. Reworded.
  • dotjaz - Wednesday, May 29, 2024 - link

    "ARM doesn't have lithography capabilities and this is a synthesizable core"

    And? Apple also doesn't have litho. You are telling me they can't implement anything with external foundries? Do you even know the basics of modern chip design? DTCO has been THE key to archieve better results for at least half a decade now.

    Also this is clearly not just a synthesizable core. ARM explicitly announced this is avaiable as production ready cores, that means the implementations are tied to TSMC N3E and Samsung SF3 via DTCO, and this is the first time ARM has launched with ready for production hard core implementation.

    You clearly didn't understand, and that's why it didn't mean anything TO YOU, and probably had to be dumbed down for you.

    It actually makes perfect sense to me.
  • lmcd - Wednesday, May 29, 2024 - link

    There was a turnaround time slide that didn't get Anandtech text to go with it that made this more clear, but a skim would miss it.
  • zamroni - Monday, June 17, 2024 - link

    it means the logic circuit is designed for 3nm's characteristics, e.g. signal latency, transistor density etc.

    older cortex designs can be manufactured using 3nm but it won't reach same performance as they were designed to cater higher signal latency of 4nm or older generations
  • Duncan Macdonald - Wednesday, May 29, 2024 - link

    Lots of buzzwords but low on technical content. Much of this reads like a presentation designed to bamboozle senior management.
  • Ryan Smith - Wednesday, May 29, 2024 - link

    Similar sentiments were shared at the briefing.
  • continuum - Thursday, May 30, 2024 - link

    Whole tone of this article feels like it was written by an AI given how often (compared to what I'm used to in previous articles on this from Anandtech!) certain sentiments like "3nm process" and other buzzwords are used!
  • name99 - Wednesday, May 29, 2024 - link

    Not completely true...

    Interesting points (relative to Apple, I don't know enough about Nuvia internals to comment) include
    - 4-wide load (vs Apple 3-wide load) is a nice tweak.

    - 6-wide NEON is a big jump. Of course they have to scramble to cover that they STILL don't have SVE or SME; even so there is definitely some code that will like this, and the responses will be interesting. I can see a trajectory for how Apple improves SME and SSVE as a response, probably (we shall see...) also boosting NEON to 256b-SVE2. (But for this first round, still 4xNEON=2xSVE2)
    Nuvia, less clear how they will counter.

    Regardless I'm happy about both of these and requiring a response from Apple which, in turn, makes M a better chip for math/science/engineering (which is what I care about).

    They're still relying on run-ahead for some fraction of their I-Prefetch. This SOUNDS good, but honestly, that's a superficial first response and you need to think deeper. Problem is that as far as prefetch goes, branches are of two forms – near branches (mostly if/else), which don't matter, a simple next line prefetcher covers them; and far branches (mostly call/return). You want to drive your prefetcher based on call/return patterns, not trying to run the if/else fetches enough cycles ahead of Decode. Apple gets this right with an I-prefetcher scheme that's based on call/return patterns (and has recently been boosted to use some TAGE-like ideas).

    Ultimately it looks to me like they are boxed in by the fact that they need to look good on phones that are too cheap for a real NPU or a decent GPU. Which means they're blowing most of their extra budget on throughput functionality to handle CPU-based AI.
    Probably not the optimal way to spend transistors as opposed to Apple or QC. BUT
    with the great side-effect that it makes their core a lot nicer for STEM code! Maybe not what marketing wanted to push, but as I said, I'll take it as steering Apple and QC in the right direction.
    I suspect this is part of why the announcement comes across as so light compared to the past few years – there simply isn't much new cool interesting stuff there, just a workmanlike (and probably appropriate) use of extra transistors to buy more throughput.

Log in

Don't have an account? Sign up now