Amazon Unveils Graviton4: A 96-Core ARM CPU with 536.7 GBps Memory Bandwidth
by Anton Shilov on November 29, 2023 4:30 PM EST- Posted in
- CPUs
- Arm
- Datacenter
- Amazon
- AWS
- Neoverse V2
- Graviton4
- Graviton
Nowadays many cloud service providers design their own silicon, but Amazon Web Services (AWS) started to do this ahead of its rivals and by now its Annapurna Labs subsidiary develop processors that can well compete with those from AMD and Intel. This week AWS introduced its Graviton4 SoC, a 96-core ARM-based chip that promises to challenge renowned CPU designers and offer unprecedented performance to AWS clients.
"By focusing our chip designs on real workloads that matter to customers, we are able to deliver the most advanced cloud infrastructure to them," said David Brown, vice president of Compute and Networking at AWS. "Graviton4 marks the fourth generation we have delivered in just five years, and is the most powerful and energy efficient chip we have ever built for a broad range of workloads."
The AWS Graviton4 processor packs 96 cores that offer on average 30% higher compute performance compared to Graviton3 and is 40% faster in database applications as well as 45% faster in Java applications, according to Amazon. Given that Amazon did not reveal many details about its Graviton4, it is hard to attribute performance increases to any particular characteristics of the CPU.
Yet, NextPlatform believes that the processor uses Arm Neoverse V2 cores, which are more capable than V1 cores used in previous-generation AWS processors when it comes to instruction per clock (IPC). Furthermore, the new CPU is expected to be fabricated using one of TSMC's N4 process technologies (4nm-class), which offers a higher clock-speed potential than TSMC's N5 nodes.
"AWS Graviton4 instances are the fastest EC2 instances we have ever tested, and they are delivering outstanding performance across our most competitive and latency sensitive workloads," said Roman Visintine, lead cloud engineer at Epic. "We look forward to using Graviton4 to improve player experience and expand what is possible within Fortnite.”
In addition, the new processor features a revamped memory subsystem with a 536.7 GB/s peak bandwidth, which is 75% higher compared to the previous-generation AWS CPU. Higher memory bandwidth improves performance of CPUs in memory intensive applications, such as databases.
Meanwhile, such a major memory bandwidth improvement indicates that the new processor employs a memory subsystem with a higher number of channels compared to Graviton3, though AWS has not formally confirmed this.
Graviton4 will be featured in memory-optimized Amazon EC2 R8g instances, which is particularly useful to boost performance in high-end databases and analytics. Furthermore, these R8g instances provide up to three times more vCPUs and memory than Graviton 3-based R7g instances, enabling higher throughput for data processing, better scalability, faster results, and reduced costs. To ensure security of AWS EC2 instances, Amazon equipped all high-speed physical hardware interfaces of Graviton4 CPUs.
Graviton4 R8g is currently in preview, these instances will be available widely in the coming months.
Sources: AWS, NextPlatform
35 Comments
View All Comments
TheinsanegamerN - Friday, December 1, 2023 - link
I am not aware of any mass production DDR5 PCs with 9600 mhz memory standard. LPDDR, OTOH......mode_13h - Friday, December 1, 2023 - link
That's not simply LPDDR5, but rather LPDDR5T.DDR5 has its own variants, like MCR and some similar technique that AMD is pursuing.
name99 - Friday, December 1, 2023 - link
Most of the latency to DRAM is in the traversal through caches and on the NoC. The part that's really specific to DRAM latency is surprisingly similar across different DRAM technologies.Apple and AMD (and Intel) make different tradeoffs about the latency of this cache traversal and NoC.
Apple optimize for low power, and get away with it because their caches are much more sophisticated (hit more often) and they do a better job of hiding DRAM latency within the core.
erinadreno - Saturday, December 2, 2023 - link
I think most of the latency is actually the column address time. LPDDR chips runs at lower voltage for higher bus speed, but sacrifice the row buffer charge time. Typical trade off between PPAmode_13h - Saturday, December 2, 2023 - link
> Most of the latency to DRAM is in the traversal through caches and on the NoC.Are you sure about that? Here, we see the latency of GDDR6 and GDDR6X running in the range of 227 to 269 ns, which is more than a factor of 2 greater than DRAM latency usually runs, even for server CPUs.
https://chipsandcheese.com/2022/11/02/microbenchma...
On an otherwise idle GPU, I really can't imagine why it would take so long to traverse its cache hierarchy and on-die interconnect. Not only that, but the RTX 4000 GPUs have just 2 levels of cache, in contrast to modern CPUs' 3-level cache hierarchy.
name99 - Monday, December 4, 2023 - link
This is covered in https://user.eng.umd.edu/~blj/papers/memsys2018-dr...If you look at the paper (and understand it...) you will see that across a wide range of DRAM technologies the (kinda) best case scenario from when a request "leaves the last queue" until when the data is available is about 30ns. The problem is dramatic variation in when "leaving that last queue" occurs. WITHIN DRAM technologies, schemes that provide multiple simultaneously serviced queues can dramatically reduce the queueing delay; across SoC designs, certain schemes can dramatically reduce (or increase) the delay from "execution unit" to "memory controller".
In the case of a GPU, for example, a large part of what the GPU wants to do is aggregate memory requests so that if successive lanes of a warp, or successive warps, reference successive addresses, this can be converted into a single long request to memory.
There's no incentive to move requests as fast as possible from execution to DRAM; on the contrary they are kept around for as long as feasible in the hopes that aggregation increases.
mode_13h - Monday, December 4, 2023 - link
Thanks for the paper. I noticed it just cover GDDR5. I tried looking at the GDDR6 spec, but didn't see much to warrant a big change other than the bifurcation of the interface down to 16-bit. Even that doesn't seem like enough to add more than a nanosecond or so per 64B burst.> There's no incentive to move requests as fast as possible from execution to DRAM;
> on the contrary they are kept around for as long as feasible in the hopes that
> aggregation increases
That makes sense for writes, but not reads (which is what Chips&Cheese measured). For reads, you'd just grab a cacheline, as that's already bigger than a typical compressed texture block. Furthermore, in rendering, reads tend to be scattered, while writes tend to be fairly coherent. So, write-combining makes sense, but read-combining doesn't.
Also, write latency can (usually) be hidden from software by deep queues, but read latency can't. Even though GPUs' SMT can hide read latency, that depends on having lots of concurrency and games don't always have sufficient shader occupancy to accomplish this. Plus, the more read latency you need to hide, the more warps/wavefronts you need, which translates into needing more registers. So, SMT isn't free -- you don't want more of it than necessary.
dotjaz - Thursday, November 30, 2023 - link
And it only has 53% the bandwidth of a 2019 mid-high end GPU (RADEON VII)mode_13h - Thursday, November 30, 2023 - link
Again, comparing on-package memory (i.e. HBM2) vs. DDR5 RDIMMs.Intel's Xeon Max gets the same 1 TB/s as Radeon VII, but it's limited to just 64 GB of HBM.
mode_13h - Thursday, November 30, 2023 - link
Try putting a couple TB of RAM in that Mac Studio.Oops, that's right! You can't add any RAM at all! You're stuck with just the 128 GB that Apple could fit on package.
The point of servers is they're *scalable*. Same goes for PCIe.