Cortex-A720: Middle Core, Big on Efficiency

Focusing on Arm's latest middle core, the Cortex-A720 hasn't changed much from the previous Cortex A715 design last year, which was also Arm's first AArch64-only middle core. Arm has a set philosophy for its A700 family, and that's mostly about increasing performance through optimizations, delivering maximum levels of power efficiency within set thermal limits, and optimizing workloads for actual use cases instead of blisteringly fast benchmark performance. Arm's key aims are to enhance performance metrics while maintaining power efficiency, area, and all within an acceptable thermal envelope. Cost is also essential, with many entry-level mobile devices already on the market leveraging the Cortex A700 family for its main cores.

Similar to the Cortex-X4 in that the Cortex-A720 is built around the Armv9.2 ISA, Arm has optimized its design to enable the A720 to deliver more performance within the same power budget compared to the Cortex A715. The Arm 700-series family typically covers a much broader range of applications and caters to various markets, including, and not limited to, digital TVs (DTV), smartphones, and laptops. Having more comprehensive flexibility in a more diverse space has its advantages, and Arm looks to capitalize on that with the Cortex-A720 acting as the 'workhorse' of the TSC23 core cluster.

Devices such as smartphones at the entry-level typically want to reduce cost but maximize performance and efficiency, and that's where cores such as the Cortex-A720 come into play; the Cortex-X4 is primarily allocated to devices with flagship status or those that require the most burst and sustained performance, such as top tier smartphones, tablets, and laptops. Meanwhile, Cortex-A720 is the next step down, giving up the X4's high peak performance for a much smaller core size and with correspondingly lower energy consumption.

For the Cortex-A720 in particular, Arm is also offering multiple configuration options. Along with the standard, highest-performing option, Arm has what they're terming an "entry-tier" configuration that shaves A720 down to the same size as Arm Cortex-A78, all while still offering a 10% uplift in overall performance. With some Arm customers being especially austere on die sizes, moves such as these are necessary to convince them to finally make the jump over to the Cortex-A7xx series and Armv9.

Arm's focus is to broaden the range of the entry-level market and expand on the possible use cases for its Cortex-A720 core so that it can be implemented into a wider variety of entry-level mobile devices and in lower-end markets.

Some of the critical improvements to the Cortex-A720, when compared to the previous A715, is Arm has opted for a faster branch mispredict recovery. Branch prediction breaks down the instructions into predicates, and a branch predictor will only execute statements it predicates to be true. Opting for a faster branch mispredict recovery has multiple benefits, as it not only reduces the delay within the execution of instructions, but it can improve overall performance. Another element of this is pipeline efficiency, as a branch misprediction can disrupt the flow of instruction throughout the pipeline, and the ability to do this faster not only yields benefits to performance but also to overall power efficiency.

Arm has reduced the overall branch mispredict penalty on A720 to 11 cycles, down from 12 on the Cortex A715. They have also improved upon their 2-taken branch prediction technique, which predicts the outcome of the instruction, and, again, adds efficiency to the pipeline and reduces the penalties regarding misprediction.

Another improvement is the Pipelined FDIV/FSQRT (division + square root), which performs operations on floating point numbers using the pipelines. Allowing for concurrent executions of both FDI and FSQRT can improve instruction throughput, and Arm claims to have achieved a significant speed boost without impacting the overall area. There are also faster transfers from floating point to floating point, including NEON and SVE2 integer, which Arm introduced for Armv9. This also includes overall improvements to issue queues and the execution units, which simplifies the forwarding of data forwarding to AGUs.

Within the memory system of the Cortex-A720, reduced the L2 cache latency to 9-cycles, and Arm claims to have up to 2x the memset(0) bandwidth within the L2 cache. Without going into much detail about their methods, Arm also claims to have improved generationally on accuracy and coverage to the prefetcher. However, it has a new L2 spatial prefetch engine, which was previously a pioneering Cortex-X core system design feature.

Translating the refinements and improvements to performance, Arm estimates the performance uplift to be about 15% at iso-frequency, depending on the workload. Among other benchmarkmarks, thare are clear gains over the previous generation in SPECint2017 and improvements within internal testing with SPECint2006. For example, using SPECint2007 as its performance indication metric in SPECint2007_403.gcc, the Cortex-A720 has a gain of around 5% over the Cortex A715, with an even more significant improvement of about 6% in power efficiency. 

Other performance metrics on offer include DRAM reads, which Arm has focused a lot of attention on making more efficient, showing minor gains overall; SPEC2007int_483.xalacbmk shows a massive increase of up to 41% in DRAM read performance. While everything is relative and subjective to the workload tasked, Arm has made some clear forward progress with its latest Cortex-A720 CPU core microarchitecture.

Arm Cortex X4: Fastest Arm Core Ever Built Cortex A520: LITTLE Core with Big Improvements
Comments Locked

52 Comments

View All Comments

  • Doug_S - Tuesday, May 30, 2023 - link

    Yes TSO is a mode, which requires a setting IN THE ISA to be able to enable it. That setting does not exist on ARM CPUs, only on Apple Silicon implementations.

    abr2 found what I didn't have time to look for in the ARMv8 architecture reference manual proving your ridiculous claim that ARMv8 required AArch32 support was wrong. Now you're picking on nits trying to twist my words as if I was claiming TSO is an instruction. Give it up you are wrong, everyone knows it, go away quietly instead of making yourself look like even a bigger fool.
  • dotjaz - Tuesday, May 30, 2023 - link

    And your understanding of ARMv9 is abysmal at best. ARMv9-A made Aarch32 EL0 optional, it wasn't possible in ARMv8-A. There is no special license or "something like that".
  • Chelgrian - Tuesday, May 30, 2023 - link

    It has been possible an architecturally permissible since ARMv8.0 to create an AArch64 only implementation. If AArch32 is not supported at a particular exception level then setting the M[4] bit in the SPSR and executing an ERET instruction to that level will produce an illegal exception return exception. Combined with designing the system to only reset in to AArch64 at the highest implemented exception level gives you an AArch64 only design.

    This tangentially referred to in rule R-tytwb in section D1.3.4 of revision J.a of the ARM Architecture Reference Manual.

    A conformant ARMv8.x implementation can (but it not mandated to) implement AArch32 at any exception level.

    A conformant ARMv9.x implementation may only implement AArch32 at EL0. This is documented in section 3.1 of revision J.a of the ARM Architecture Reference Manual.

    There are even documented ARMv8.1 processors out there which are AArch64 only for example the Cavium ThunderX2

    https://en.wikichip.org/wiki/cavium/thunderx2

    "Only the 64-bit AArch64 execution state is support. No 32-bit AArch32 support."
  • abr2 - Tuesday, May 30, 2023 - link

    From:
    Arm® Architecture Reference Manual
    Armv8, for Armv8-A architecture profile
    [2021 version]

    D1.20.2 Support for Exception levels and Execution states
    Subject to the interprocessing rules defined in Interprocessing on page D1-2525, an implementation of the Arm architecture could support:
    • AArch64 state only.
    • AArch64 and AArch32 states.
    • AArch32 state only.
  • techconc - Thursday, June 8, 2023 - link

    @dotjaz - You don’t know what you’re talking about. The Apple A7 chip supported both A32 and A64 instruction set. By the A11 (in 2017), Apple dropped A32 instruction set and was 64bit only.
  • dotjaz - Tuesday, May 30, 2023 - link

    > I'm very fairly certain of this, but if you know something I don't? (I might not..)

    You are clearly wrong, no ARM licensees can alter ARM ISA in any way. That's the fundation of ARM's licensing terms. And that's the sole reason Apple's AMX extention is masked as undocumented "co-processor" not available to anyone. Even if you knew nothing about the fundamental licensing terms, you should be able to figure that out because if this.
  • name99 - Monday, May 29, 2023 - link

    Jesus. The levels of delusion that are required to write a comment like this.
    You really think that
    (a) ARM is going to make a big deal about Apple being, in some legalistic sense, "non-compliant" AND
    (b) that Apple gives a fsck?

    Exactly who do you think gets hurt if Apple are not allowed to call APPLE SILICON (note that branding...) Arm Compliant?
  • Wereweeb - Tuesday, May 30, 2023 - link

    Lmao apple fanboys still as hilarious and ignorant as always
  • Silver5urfer - Sunday, May 28, 2023 - link

    So much of this nonsensical 64Bit bs. Esp in the name of security, News Flash - Qualcomm EDL mode exists and thankfully it helps the folks to unlock their Bootloaders.

    The whole 64Bit thing killed the passion on Android. Google just enforces it brutally by n-1 where n being the latest API SDK, thus making all the old apps go obsolete. Windows and x86 excels massively just because of this, Apple did it because they always want to control everything which they do, and the stupid Google just copies them in hoping to make same but they killed all fun on android now, the UI is so boring garbage and the whole Filesystem nerfs - Scoped Storage, lack of proepr SD Card app support and a ton of other APIs blacklisted. Limited the scope of foreground and background apps utilizing the hardware of a phone.

    What's the use of the ARM processor devices, when your latest and greatest X4 ARM phone will be outdated in 1 year and goes to dumpster after 2-3 years max. Non Removable, non serviceable, no longevity of the OS / HW / Software. Locked like chastity belt for the User tinkering when the core OS, the Kernel runs Linux. A big L to consumers and all that Environment jabber is literally just a worthless cacophony. Literally you have latest V30 class Micro SDs and SD Association even had PCIe / NVMe SSD class but since not a single $1000-$2000 Android phone pushes forward for a real computer in pocket, its rather a spybox and a mere 2FA device with some Navigation, Social Media, Camera attached.

    All this ARM tech is only useful if your device Software API can open it up properly and used a proper pocket computer. But that ship has sailed. All that X4 processing power and multi core non homogeneous compute wasted on basic consumables.
  • rpg1966 - Monday, May 29, 2023 - link

    Could you explain how the UI is affected by the bitness of the OS?

Log in

Don't have an account? Sign up now