US Dept. of Energy Announces Frontier Supercomputer: Cray and AMD to Build 1.5 Exaflop Machine
by Ryan Smith on May 7, 2019 7:40 AM ESTThe history of the computing industry is one of constant progress. Processors get faster, storage gets cheaper, and memory gets denser. We see the repercussions of this advancement through all aspects of society, and that extends to the top as well, where national governments continue to invest in bigger and better supercomputers. One part technological necessity and one part technological race, the exascale era of supercomputers is about to begin, as orders for the first exaFLOP-capable are now going out. It’s only fitting then that this morning the United States Department of Energy is announcing the contract for their fastest supercomputer yet, the Frontier system, which will be built by Cray and AMD.
Frontier is planned for delivery in 2021, and when it’s activated it will become the second and most powerful of the US DOE’s two planned 2021 exascale systems, with performance expected to reach 1.5 exaFLOPS. The ambitious system won’t come cheaply, however; with a price tag of over 500 million dollars for the system alone – and another 100 million dollars for R&D – Frontier is among the most expensive supercomputers ever ordered by the US Department of Energy.
The new supercomputer is being built as part of the US DOE’s CORAL-2 program for supercomputers, with Frontier scheduled to replace Oak Ridge National Laboratory’s current Summit supercomputer. Summit is the current reigning champion in the supercomputer world, with 200 petaFLOPS of performance, and accordingly the US DOE and Oak Ridge are aiming to significantly improve on its performance for the new computer. All told, Frontier should be able to deliver over 7x the performance of Summit, and is expected to be the fastest supercomputer in the world once it’s activated.
Like Summit (and Titan before it), Frontier is an open science system, meaning that it’s available to academic researchers to run simulations and experiments on. Accordingly, the lab is expecting the supercomputer to be used for a wide range of projects across numerous disciplines, including not only traditional modeling and simulation tasks, but also more data-driven techniques for artificial intelligence and data analytics. In fact the latter is a bit of new ground for the lab and the system’s eventual users; just as we’ve seen in the enterprise space over the past few years, neural network-based AI is becoming an increasingly popular technique to solve problems and extract analysis from large datasets, and now researchers are looking at how to refine those techniques from the current-generation systems and apply them to exascale-level projects.
US Department of Energy Supercomputers | |||||
Frontier | Aurora | Summit | |||
CPU Architecture | AMD EPYC (Future Zen) |
Intel Xeon Scalable | IBM POWER9 | ||
GPU Architecture | Radeon Instinct | Intel Xe | NVIDIA Volta | ||
Performance (RPEAK) | 1.5 EFLOPS | 1 EFLOPS | 200 PFLOPS | ||
Power Consumption | ~30MW | N/A | 13MW | ||
Nodes | 100 Cabinets | N/A | 3,400 | ||
Laboratory | Oak Ridge | Argonne | Oak Ridge | ||
Vendor | Cray | Intel | IBM | ||
Year | 2021 | 2021 | 2018 |
Frontier: Powered by Cray & AMD
Officially, the prime contractor for Frontier will be Cray. But looking at the specifications, you could be excused for thinking it was AMD. Cray for its part is partnering with the chipmaker for the system, and as a result AMD is providing most of the core hardware for the new supercomputer. Designed as a next-generation CPU + accelerator system, with a mix of CPUs and GPUs doing the heavy compute work, AMD will be supplying both the CPUs and GPUs for Frontier. And as the principle processor provider, AMD will also be taking on a lot of the responsibility for developing the software stack as well, with the company working with Cray to develop an enhanced version of their ROCm environment to best extract performance from the massive cluster of CPUs and GPUs.
On the CPU side of matters, AMD will be supplying a customized next-generation EPYC CPU. AMD has confirmed that it’s going to be using a future generation of their Zen CPU cores, and given the timing of the project, we’re almost certainly looking at a Zen 3 or Zen 4 design here. Just how custom AMD’s CPU is remains to be seen, but their announcement has revealed that Frontier’s CPUs will include new instructions for the optimization of AI and supercomputing workloads.
Meanwhile on the GPU side of matters, AMD and Cray are holding their cards a little closer. Rather than naming any architecture or architectural generation, AMD is only saying that the GPUs are “based on the Radeon Instinct family” and have “yet to be announced.” AMD’s current public roadmap goes out to “Next Gen” in 2020, and with GPU development cycles averaging 2 years, this may be the architecture we see. But with the particular needs for a supercomputer, AMD may have something slightly more bespoke.
What the company is confirming for now is that they aren’t holding back on features. The HPC-focused GPU is being designed with Frontier in mind and will incorporate mixed precision compute support. Feeding the beast will be HBM memory, and AMD will be tapping a version of Infinity Fabric to connect the CPUs and GPUs.
In fact while AMD has kept the details on the technology light, it sounds like this version of IF will be the most advanced version yet. AMD is specifically noting that it’s an “incredibly” coherent fabric, calling it the first fully optimized CPU + GPU design for supercomputing. AMD’s GPUs and CPUs will be arranged in a 4-to-1 ratio, with 4 GPUs for each EPYC CPU. It’s worth noting that AMD’s slide shows a mesh with every GPU connected to the CPU and two other GPUs, but I’m not reading too much into this quite yet, as AMD hasn’t disclosed any other details on the IF setup.
With AMD going up to the blade level, tying together all of these nodes will be Cray’s job. For Frontier the supercomputer vendor is launching their new Slingshot interconnect, an equally ambitious interconnect that will support adaptive routing, congestion management, and quality-of-service features. Slingshot is capable of 200Gb/sec per port, with individual blades incorporating a port for each GPU in the blade so that other nodes can directly read and write data to a GPU’s memory. As a result Frontier will have a significant amount of interconnect bandwidth, which is all but necessary in order to allow the system to scale to exaFLOP levels.
Overall, Frontier will be organized into over 100 Cray Shasta cabinets. And while Cray has not announced a specific power consumption figure for Frontier, with each cabinet rated for 300KW, this would put the complete system at over 30MW. Which to put things in context, this is over twice the power consumption of the 13MW Summit. So while Frontier is a significantly faster system than the supercomputer it replaces, Cray, AMD, and the US DOE are all feeling the pinch of Dennard scaling slowing down, as power efficiency gains get harder to achieve. All told, in a passing comment made in the press briefing, it sounds like Oak Ridge will be installing a total of 40MW of capacity for Frontier, which is a significant amount of power to say the least.
Along with furthering the US’s own supercomputing leadership goals, securing the Frontier contract also represents big wins for Cray and AMD. Cray is now involved in both 2021 exascale systems, reinforcing their own place in the supercomputing world. Meanwhile for AMD, who is spending this current generation from the outside looking in, they have now secured a major and prestigious win for both their CPU and GPU divisions.
In fact it’s interesting to note that of the two 2021 exascale systems being ordered, both are coming from full-service processor vendors that supply both CPUs and GPUs. Current-generation systems like Summit use mixed vendors – e.g. IBM + NVIDIA – so the move to integrated vendors is a big shift for these CPU + accelerator systems. Clearly there are technological and procurement benefits to using a single vendor for all of the processors, which benefits both AMD and Intel. Though it’s worth noting that the CORAL-2 program requires the DOE to buy systems based on two different architectures, so if the future is integrated systems, then AMD and Intel are the logical choices.
At any rate, with the contract placed for Frontier, the job is only half-done. AMD and Cray will need to continue developing their hardware and software for the system, not to mention locking down the specific specifications for the finished supercomputer. So expect to continue to hear news about Frontier trickle out over the next couple of years, leading up to its installation in 2021.
Source: AMD
77 Comments
View All Comments
peevee - Tuesday, May 7, 2019 - link
AMD? If it is on their EPIC CPUs and not GPUs, it is not real 1.5exaFP. Unfortunately, on that architecture (same with other modern CPUs), peak performance while doing FMAs in AVX registers and real performance on big data which has to be retrieved from and put back to main memory, performance can differ by 3-4 orders of magnitude.On GPUs, multiplication of dense matrices can be done efficiently, but not sparse matrices in their usual compressed representation (and in most HPC applications, it is all about giant matrix multiplications).
amrnuke - Tuesday, May 7, 2019 - link
Good thing you brought this up. Can someone get ORNL on the line and let them know they made a bad decision based on peevee's speculation about AVX performance of a microarchitecture that is still under development with no details?(I'm fairly certain that when tasked with spending hundreds of millions of dollars on a mission-critical supercomputer that they themselves will be using, they've done the math already.)
peevee - Wednesday, May 8, 2019 - link
amrnune, if you thought less about your ad hominem attacks and more about the subject, you would understand that I am right about AVX peak vs real performance, regardless MICROarchitecture. The very basic Von Neumann-derived architecture is the problem here.In terms (scale) of lamp computers of Von Neumann times, in a modern computer using Intel/AMD/ARM etc, ALU is in New York and memory is all over Europe. Is it easier to understand this way?
Irata - Wednesday, May 8, 2019 - link
It's both on their CPU and GPU - afaik it's a based on 1S with one CPU + 4 GPU .peevee - Wednesday, May 8, 2019 - link
So main compute is GPU, as I expected. Of course GPUs cannot be deployed alone as they are not general purpose processors.mode_13h - Wednesday, May 8, 2019 - link
> On GPUs, multiplication of dense matrices can be done efficiently, but not sparse matrices in their usual compressed representationIt would be interesting if they use the texture units to facilitate this.
I expect we'll also start seeing texture units handling on-the-fly decompression of neural network weights.
peevee - Wednesday, May 8, 2019 - link
The word "compressed" means completely different thing when applied to sparse matrices vs textures. They just skip all (or most in some schemes) 0s and add a level of indirection through indices/offsets of non-zero elements of blocks of elements. That indirection is bad news for any kind of vector processors, including SMs in GPUs.I guess for simulations of nuclear blasts it does not matter that much as there should not be that many zeros there... but it is just a guess, my HPC tasks are different.
mode_13h - Thursday, May 9, 2019 - link
You could put indirection and scatter/gather support down in the memory controllers. GDDR6 and HBM2 both have fairly narrow channel widths, presumably making them less dependent on large burst sizes for good efficiency.HStewart - Tuesday, May 7, 2019 - link
To me this sounds like AMD is playing the me too game. One thing is confusing is the vendor, Aurora platform is also Cray based and some of specs on cabinets are not mention.I would not doubt that the version of Aurora finally done will match or beat this one. In reality it probably already has.
Irata - Tuesday, May 7, 2019 - link
According to Cray's press release, Aurora will use "over 200 Shasta cabinets" vs. "over 100" for Frontier.