DynamIQ

Before we discuss the new CPUs, we need to discuss DynamIQ. Introduced 5 years ago, ARM’s original big.LITTLE (bL) technology, which allows multiple clusters of up to 4 CPUs to be chained together, has been massively successful in the marketplace, allowing various combinations of its Cortex-A family of CPUs to power mobile devices ranging from the budget-friendly with no frills to the budget-busting flagships stuffed with technology. The combination of Cortex-A and bL extends beyond smartphones and tablets too, with applications ranging from servers to automotive.

Over the years, ARM’s IP and the needs of its customers have evolved, necessitating a new version of bL: DynamIQ. ARM started working on DynamIQ in 2013 by asking a single question: “How do you make big.LITTLE better?” Looking forward, ARM could see DynamIQ needed to be more flexible, more scalable, and offer better performance. Considering how much work went into this project, DynamIQ will be around for at least the next few years, so hopefully it delivers on those goals.

Like bL, DynamIQ provides a way to group CPUs into clusters and connect them to other processors and hardware within the system; however, there’s several significant changes, starting with the ability to place big and little Cortex-A CPUs in the same cluster. With bL, different CPUs had to reside within separate clusters. What appears to be a simple reshuffling of cores actually impacts CPU performance and configuration flexibility.

Another big change is the ability to place up to 8 CPUs inside a single cluster (up from 4 for bL), with the total number of CPUs scaling up to 256 with 32 clusters, which can scale even further to 1000s of CPUs with multi-chip support provided via a CCIX interface. Within a cluster CPUs are divided into voltage/frequency domains, and within a domain each core is inside its own power domain. This allows each CPU to be individually powered down, although all CPUs in the same domain must operate at the same frequency, which is no different from bL; however, with DynamIQ each cluster can support up to 8 voltage/frequency domains, providing greater flexibility than bL’s single voltage/frequency domain per cluster. So, what does this mean? It means that, in theory, an SoC vendor could place each CPU into its own voltage domain so that voltage/frequency could be set independently for each of the 8 CPUs in the cluster. Each voltage/frequency domain requires its own voltage regulator, which adds cost and complexity, so we’ll most likely continue to see 2-4 CPUs per domain.

ARM still sees 8-core configurations being used in mobile devices over the next few years. With bL, this would likely be a 4+4 pairing using 4 big cores and 4 little cores or 8 little cores spread across 2 clusters. With DynamIQ, all 8 cores can fit inside a single cluster and can be split into any combination (1+7, 2+6, 3+5, 4+4) of A75 and A55 cores. ARM sees the 1+7 configuration, where one A55 core is replaced by a big A75 core, as particularly appealing for the mid-range market, because it offers up to 2.41x better single-thread performance and 1.42x better multi-thread performance for only a 1.13x increase in die area compared to an octa-core A53 configuration (iso-process, iso-frequency).

The main puzzle piece that enables this flexibility is the DynamIQ Shared Unit (DSU), a separate block that sits inside each DynamIQ cluster and functions as a central hub for the CPUs within the cluster and the bridge to the rest of the system. Each voltage/frequency domain in the cluster can be configured to run synchronously or asynchronously with the DSU. Using asynchronous bridges (one per domain) allows different CPUs (A75/A55) to operate at different frequencies (using synchronous bridges would force all CPUs to operate at the same frequency).

The DSU communicates with a CCI, CCN, or CMN cache-coherent interconnect through 1 to 2 128-bit AMBA 5 ACE ports or a single 256-bit AMBA 5 CHI port. There’s also an Accelerated Coherency Port (ACP) for attaching specialized accelerators that require cache coherency with the CPUs. It’s also used for enabling DynamIQ’s cache stashing feature, which we’ll discuss in a minute. Finally, there’s a separate peripheral port that’s used to program the accelerators attached to the ACP interface (basically a shortcut for programming transactions so they do not need to be routed through the system interconnect).

So far we’ve discussed DynamIQ’s flexibility and scalability features, but it also improves CPU performance through a new cache topology. With bL, CPUs inside a cluster had access to a shared L2 cache; however, DynamIQ compatible CPUs (currently limited to A75/A55) have private L2 caches that operate at the CPU core’s frequency. Moving the L2 closer to the core improves L2 cache latency by 50% or more. DynamIQ also adds another level of cache: The optional shared L3 cache lives inside the DSU and is 16-way set associative. Cache sizes are 1MB, 2MB, or 4MB, but may be omitted for certain applications like networking. The L3 cache is technically pseudo-exclusive, but ARM says it’s really closer to being fully-exclusive, with nearly all of the L3’s contents not appearing in the L2 and L1 caches. If the new L3 cache was inclusive, meaning that it contained a copy of a CPU’s L2, then its performance benefit would be largely mitigated and a lot of area and power would be wasted.

The L3 cache can be partitioned, which can be useful for networking or embedded systems that run a fixed workload or applications that require more deterministic data management. It can be partitioned into a maximum of 4 groups, and the split can be unbalanced, so 1 CPU could get 3MB while the other 7 CPUs would share the remaining 1MB in an 8-core 4MB L3 configuration. Each group can be assigned to specific CPU(s) or external accelerators attached to the DSU via the ACP or other interface. Any processors not specifically assigned to a cache group share the remaining L3 cache. The partitions are dynamic and can be created/adjusted during runtime by the OS or hypervisor.

One of the features supported by DynamIQ is error reporting, which allows the system to report detected errors, both correctable and uncorrectable, to software. The L3 supports ECC/parity (actually all levels of cache and snoop filters do, with SECDED on caches that can hold dirty data and SED parity on caches that only hold clean data) in order to be ASIL-D compliant. The L3 also has persistent error correction and can support recovery from a single hard error (data poisoning is supported at a 64-bit granularity).

Another new feature is cache stashing, which allows a GPU or other specialized accelerators and I/O agents to read/write data into the shared L3 cache or directly into a specific CPU’s L2 cache via the ACP or AMBA 5 CHI port. A specific example is a networking appliance that uses a TCP/IP offload engine to accelerate packet processing. Instead of writing its data to system memory for the CPU to fetch or relying on some other cache coherency mechanism, the accelerator could use cache stashing to write its data directly into the CPU’s L2, increasing performance and reducing power consumption.

In order to use cache stashing, the software drivers running in kernel space need to be aware of the processor and cache topology, which will require custom code to enable hardware outside the cluster to access the shared L3 or an individual CPU’s L2. While a limitation for consumer electronics where time to market is key, it’s not as serious of an issue for commercial applications.

While cache stashing could be a useful feature for sharing data with processors sitting outside the cluster, DynamIQ also makes it easier to share data among CPUs within a cluster. This is one of the reasons why ARM wanted to bring big and little CPUs into the same cluster. Moving cache lines within a DynamIQ cluster is faster than moving them between clusters like with bL, reducing latency when migrating threads between big and little cores.


DynamIQ also includes improved power management. Having the DSU perform all cache and coherency management in hardware instead of software saves several steps when changing CPU power states, allowing the CPU cores to power up or down much faster than before with bL. The DSU can also power down portions of the L3 cache to reduce leakage power by autonomously monitoring cache usage and switching between full on, half off, and full off states.

The DSU includes a Snoop Control Unit (SCU) with an integrated snoop filter for handling the new cache topology. Together with L3 cache and other control logic, the DSU is about the same area as an A55 core in its max configuration or half the area of an A55 in its min configuration. These are just rough estimates because most of the DSU area is used by the L3 cache and the size of the DSU logic scales with the number of CPU cores.

Some of ARM’s partners may be slow to migrate from bL to DynamIQ, choosing to stick with a technology and CPU cores they are familiar with instead of investing the extra time and money required to develop new solutions. But for performance (and marketing) sensitive markets that need access to ARM’s latest CPU cores, such as mobile, the switch to DynamIQ should happen quickly, with the first DynamIQ SoCs likely to appear by the end of 2017 or early 2018.

Introduction Cortex-A75 Microarchitecture
POST A COMMENT

104 Comments

View All Comments

  • Meteor2 - Monday, May 29, 2017 - link

    How? Reply
  • Paul A. Clayton - Monday, May 29, 2017 - link

    The A55 design is constrained not merely by area and power but also by configurability. Being able to vary the L1 cache sizes from 16 KiB to 64 KiB means that the pipeline structure and cycle time is not optimized for one size. Targeting multiple processes and design factors (e.g., SRAM libraries can be tuned for different performance/area/power tradeoffs) also constrains optimization.

    While ARM might have had in mind a particular implementation for optimization (for which it might provide hard cores), it is still limited to providing acceptable designs for other implementations. Some microarchitectural optimizations might strongly depend on implementation details which are outside of ARM's control.

    There are probably also higher-risk design possibilities that were not explored simply because the resources were not available. Having multiple design teams with similar targets typically would mean wasting effort, but such provides a potential for a better design. It would be difficult for ARM to charge for the cost of unused designs given that other designs are available.

    Targeting a broad range of workloads also means a design will tend to be worse than a design targeting a narrower range of workloads.
    Reply
  • Kevin G - Monday, May 29, 2017 - link

    Of course they could but would those changes have permitted it still be within the design constraints of the A55? Small die size and lower power are two characteristics that are not compromised for the A55. Faster is easy to do with more power but considering that the A55 is the little core, higher power consumption is to be avoided. Similarly a faster core might be done with a larger die area. There are trade offs here but the pressure from ARM's customers is to keep this as small as possible.

    Considering those constraints, I considering any improvements to be rather impressive. If there is a silver bullet that ARM could have used to make it faster/smaller/consumer less power in these designs without violating the constraints they have in place, I'd like know what it was.
    Reply
  • tipoo - Monday, May 29, 2017 - link

    Alas, still waiting to find out how different Apples Zephyr is from standard Little cores like it. It's nearly twice as big. Reply
  • jjj - Monday, May 29, 2017 - link

    "What will be the goal for the next core, which will be coming from ARM’s Austin team that produced the A72? "

    That was my main question too but my hope was that the next core is aiming for much higher IPC. They need it for server and dual big core configs in mobile on 7nm.
    Or maybe they don't quite need it really, A75 is really fast and if the next core adds 15-20% higher IPC combined with higher clocks enabled by the process, that's quite a lot and rather amazing from a perf density perspective.

    Not much talk about area, any clue how A75 + DynamIQ compare to previous solutions - ofc the cache part is easy to factor in.

    It is interesting that A75 scales better with higher clocks, any guesses for clocks at 2W? A laptop with 4b4L would be rather nice.

    A55 not targeting higher clocks seems a bit odd, would mean that power goes down if folks move from A53 on 16FF to A55 on 7nm so maybe ARM has another update before 7nm.
    Reply
  • Meteor2 - Monday, May 29, 2017 - link

    I think these are still mobile CPUs. It's up to Cavium et al to do ARM ISA-compatible designs for servers. ARM's not that bothered; the mobile market is far larger. Reply
  • jjj - Monday, May 29, 2017 - link

    ARM is very eager to go server and just a year ago ARM was targeting 25% share in server by 2020. This gen does highlight infrastructure as they call it, a large segment where they've been gaining share and the next step is server.
    7nm is where it starts really, TSMC has the HPC version of the process and ARM needs to be ready too with the core that follows A75.
    What's is unclear is the strategy. A75 is already desktop class so they could just increase IPC some more but maybe they can aim higher. It seems that the Austin team got an extra year to work on the next core so that's 3 years, could be an entirely new design.
    Reply
  • Kevin G - Monday, May 29, 2017 - link

    ARM in the server space is sound much like the hype of Linux on the desktop: always 'next year'.

    The challenges ARM designs have had have been to simply get out to market. AMD's Seattle chip is indeed out but suffered two years of delays and most of the design wins have evaporated due to it. AMD's K12 efforts are MIA right now. Similarly Cavium's ThunderX line is interesting but not the game changer it was hyped to be. Broadcom has exited the ARM server market after promoting an interesting design (SMT on ARM!). Applied Micro's efforts for ARM servers have been lost to corporate mergers. Caldexa folded years ago.

    The one interesting ray of hope is that there are indeed some customers like Microsoft, Facebook, Google and Amazon who are interesting in ARM's low power nature to certain workloads. Microsoft has a version of Windows Server running on ARM but is not releasing it publicly, rather keeping it tied to their Azure cloud services. I have yet to hear where MS has gotten their ARM hardware from though. Google has dipped their toe into chip development for their deep learning efforts and it would be a straight forward process to piece together their own server designs from licensed IP blocks now that they have the in-house expertise to do it (saying they can and them doing it are two different things). In the end, the big cloud providers who could have spurred the ARM server space for everyone may keep the ARM server idea private to themselves while the rest of the market gets to deal with x86. Considering that x86 is perceived as higher power and higher cost, this serves the cloud providers well as it give incentive for companies to migrate to their cloud solutions instead of looking at ARM alternatives.

    The other difficulty for ARM in the market place right now is that Intel preemptively released their response: the Xeon D. Intel was doing a performance/watt play there and it paid for for the low end server market. In most cases, the Xeon D for a pure single socket server was a better choice than the Xeon E5 1xxxx or Xeon E3 line up. I suspect that Intel management sees Xeon D as 'too good' and thus hasn't been quick to bring an updated Sky Lake version to market.
    Reply
  • Wilco1 - Monday, May 29, 2017 - link

    Please read: http://www.anandtech.com/show/11189/appliedmicro-x... - it says both Vulcan and XGene are alive. You forgot to mention QC's Centric (48 cores on 10nm, available this year). There are also 64-core/256GB DRAM beasts made by HiSilicon. Reply
  • jjj - Monday, May 29, 2017 - link

    If you assume a 15-20% IPC gain over A75 for ARM's 7nm core and clock it past 4GHz for server, that's somewhat the worst case scenario for where ARM is in server in 2018-2019. We can assume DinamIQ evolves a bit by then too.
    That wouldn't be bad at all and ARM has extraordinary perf density. They might deliver more than that, we'll see.
    Reply

Log in

Don't have an account? Sign up now