Imagination Announces B-Series GPU IP: Scaling up with Multi-GPU
by Andrei Frumusanu on October 13, 2020 4:00 AM EST- Posted in
- GPUs
- Imagination Technologies
- SoCs
- IP
Configurations - Up to 4 GPUs at 6TFLOPs
Starting off with the smallest GPU building blocks, it’s good to remind ourselves how an Imagination GPU looks like – the following is from last year’s A-Series presentation:
PowerVR GPU Comparison | ||||
AXT-16-512 BXT-16-512 |
GT9524 | GT8525 | GT7200 Plus | |
Core Configuration |
1 SPU (Shader Processing Unit) - "GPU Core" 2 USCs (Unified Shading Clusters) - ALU Clusters |
|||
FP32 FLOPS/Clock MADD = 2 FLOPs MUL = 1 FLOP |
512 (2x (128x MADD)) |
240 (2x (40x MADD+MUL)) |
192 (2x (32x MADD+MUL)) |
128 (2x (16x MADD+MADD)) |
FP16 Ratio | 2:1 (Vec2) | |||
Pixels / Clock | 8 | 4 | ||
Texels / Clock | 16 | 8 | 4 | |
Architecture | A-Series B-Series |
Series-9XTP (Furian) |
Series-8XT (Furian) |
Series-7XT (Rogue) |
Fundamentally and at a high-level, the new B-Series GPU microarchitecture looks very similar to the A-Series. Microarchitecturally, Imagination noted that we should generally expect a 15% increase in performance or increase in efficiency compared to the A-Series, with the building blocks of the two GPU families being generally the same save for some more important additions such as the new IMGIC (Imagination Image Compression) implementation which we’ll cover in a bit.
An XT GPU still consists of the new SPU design which houses the new more powerful TPU (Texture Processing Unit) as well as the new 128-wide ALU designs that is scaled into ALU clusters called USCs (Unified Shading Clusters).
Imagination’s current highest-end hardware implementation in the BXT series is the BXT 32-1024, and putting four of these together creates an MC4 GPU. In a high-performance implementation reaching up to 1.5GHz clock speeds, this configuration would offer up to 6TFLOPs of FP32 computing power. Whilst this isn’t quite enough to catch up to Nvidia and AMD, it’s a major leap for a third-party GPU IP provider that’s been mostly active in the mobile space for the last 15 years.
The company’s BXM series continues to see a differentiation in the architecture as some of its implementations do not use the ultra-wide ALU design of the XT series. For example, while the BXM-8-256 uses one 128-wide USC, the more area efficient BXM 4-64 for example continues to use the 32-wide ALU from the 8XT series. Putting four BXM-4-64 GPUs together gets you to a higher performance tier with a better area and power efficiency compared to a larger single GPU implementation.
The most interesting aspect of the multi-GPU approach is found in the BXE series, which is Imagination’s smallest GPU IP that purely focuses on getting to the best possible area efficiency. Whilst the BXT and BXM series GPUs until now are delivered as “primary” cores, the BXE is being offered in the form of both a primary as well as a secondary GPU implementation. The differences here is that the secondary variant of the IP lacks a firmware processor as well as a geometry processing, instead fully relying on the primary GPU’s geometry throughput. Imagination says that this configuration would be able to offer quite high compute and fillrate capabilities in extremely minuscule area usage.
PowerVR Hardware Designs GPU Comparison | ||||||
Family | Texels/ Clock |
FP32/ Clock |
Cores | USCs | Wavefront Width |
MC Design |
BXT-32-1024 MC1 | 32 | 1024 | 1 | 4 | 128 | P |
BXT-16-512 MC1 | 16 | 512 | 1 | 2 | 128 | P |
BXM-8-256 MC1 | 8 | 256 | 1 | 1 | 128 | P |
BXM-4-64 MC1 | 4 | 64 | 1 | 1 | 32 | P |
BXE-4-32 Secondary | 4 | 32 | 1 | 1 | 16 | S |
BXE-4-32 MC1 | 4 | 32 | 1 | 1 | 16 | P |
BXE-2-32 MC1 | 2 | 32 | 1 | 1 | 16 | P |
BXE-1-16 MC1 | 1 | 16 | 1 | 1 | 8 | P |
Putting the different designs into a table, we’re seeing only 8 different hardware designs that Imagination has to create the RTL and do physical design and timing closure on. This is already quite a nice line-up in terms of scaling from the lowest-end area focused IP to something that would be used in a premium high-end mobile SoC.
PowerVR MC GPU Configurations | ||||||
Family | Texels/ Clock |
FP32/ Clock |
Cores | USCs | Wavefront Width |
MC Design |
BXT-32-1024 MC4 | 128 | 4096 | 4 | 16 | 128 | PPPP |
BXT-32-1024 MC3 | 96 | 3072 | 3 | 12 | 128 | PPP |
BXT-32-1024 MC2 | 64 | 2048 | 2 | 8 | 128 | PP |
BXT-32-1024 MC1 | 32 | 1024 | 1 | 4 | 128 | P |
BXT-16-512 MC1 | 16 | 512 | 1 | 2 | 128 | P |
BXM-8-256 MC1 | 8 | 256 | 1 | 1 | 128 | P |
BXM-4-64 MC4 | 16 | 256 | 4 | 4 | 32 | PPPP |
BXM-4-64 MC3 | 12 | 192 | 3 | 3 | 32 | PPP |
BXM-4-64 MC2 | 8 | 128 | 2 | 2 | 32 | PP |
BXM-4-64 MC1 | 4 | 64 | 1 | 1 | 32 | P |
BXE-4-32 MC4 | 16 | 128 | 4 | 4 | 16 | PSSS |
BXE-4-32 MC3 | 12 | 96 | 3 | 3 | 16 | PSS |
BXE-4-32 MC2 | 8 | 64 | 2 | 2 | 16 | PS |
BXE-4-32 MC1 | 4 | 32 | 1 | 1 | 16 | P |
BXE-2-32 MC1 | 2 | 32 | 1 | 1 | 16 | P |
BXE-1-16 MC1 | 1 | 16 | 1 | 1 | 8 | P |
The big flexibility gain for Imagination and their customers is that they can simple take one of the aforementioned hardware designs, and scale these up seamlessly by laying out multiple GPUs. On the low-end, this creates some very interesting overlaps in terms of compute abilities, but offer different fillrate capabilities at different area efficiency options.
At the high-end, the biggest advantage is that Imagination can quadruple their processing power from their biggest GPU configuration. Imagination notes that for the BXT series, they no longer created a single design larger than the BXT-32-1024 because the return on investment would simply be smaller, and involve more complex timing work than if a customer would simply scale performance up via a multi-core implementation.
74 Comments
View All Comments
Yojimbo - Tuesday, October 13, 2020 - link
I didn't know Xi JinPing was an engineer...EthiaW - Wednesday, October 14, 2020 - link
Those chinese have only managed to outcast the former corporate leaders recently. The shift from engeneer culture will take time, if not reverted by the UK government.Yojimbo - Wednesday, October 14, 2020 - link
They had the stubbornness to not be bought by Apple in order to be bought by the Chinese government. And through what will or method is the UK government going to change the culture of the company?Yojimbo - Tuesday, October 13, 2020 - link
Hey, you're right. He studied chemical engineering. I knew that, but forgot.melgross - Tuesday, October 13, 2020 - link
With Apple being 60% of their sales, and 80% of their profits, they demanded $1 billion from Apple, which refused that ridiculous price.The company is likely worth no more than $100 million, if that, considering their sales are now just about $20 million a year.
colinisation - Tuesday, October 13, 2020 - link
Well if not Apple why not ARM, I know ARM tried to buy them at some point in the past.But once Apple left their valuation would have taken a pretty substantial hit and ARM's GPU IP is successful but I don't think it is the most Area/Power efficient so it looked to me to be something they would have explored both would have been in the same country, maybe it would have spurred ARM into providing a more viable alternative to Qualcomm in the smartphone GPU space.
CiccioB - Tuesday, October 13, 2020 - link
"Whereas current monolithic GPU designs have trouble being broken up into chiplets in the same way CPUs can be, Imagination’s decentralised multi-GPU approach would have no issues in being implemented across multiple chiplets, and still appear as a single GPU to software."There's not problem in splitting today desktop monolithic GPUs into chiplets.
What is done here is to create small chiplets that have all the needed pieces as a monolithic one. The main one is the memory controller.
Splitting a GPU over chiplets all having their own MC is technically simple but makes a mess when trying to use them due to the NUMA configuration. Being connected with a slow bus makes data sharing between chiplets almost impossible and so needs the programmer/driver to split the needed data over the single chiplet memory space and not make algorithms that share data between them.
The real problem with MCM configuration is data sharing = bandwidth.
You have to allow for data to flow from one core to another independently of its physical location on which chiplet it is. That's the only way you can obtain really efficient MCM GPUs.
And that requires high power+wide buses and complex data management with most probably very big caches (= silicon and power again) to mask memory latency and natural bandwidth restriction as it is impossible to have buses as fast as actual ones that connect 1TB/s to a GPU for each chiplet.
As you can see to make their GPUs work in parallel in HCP market Nvidia made a very fast point-to-point connection and created very fast switches to connect them together.
hehatemeXX - Tuesday, October 13, 2020 - link
That's why Infinity Cache is big. The bandwidth limitation is removed.Yojimbo - Tuesday, October 13, 2020 - link
Anything Infinity is big, except compared to a bigger Infinity.CiccioB - Tuesday, October 13, 2020 - link
It is removed just for the size of the cache.If you need more than that amount of data you'll still be limited to bandwidth limitation.
With the big cache latency now added.
If it were so easy to reduce the bandwidth limitations anyone would just add a big enough cache... the fact is that there's no a big enough cache for the immense quantity of data GPUs work with, unless you want all your VRAM as a cache (but then you won't be connected with such a limited bus).