Configurations - Up to 4 GPUs at 6TFLOPs

Starting off with the smallest GPU building blocks, it’s good to remind ourselves how an Imagination GPU looks like – the following is from last year’s A-Series presentation:

PowerVR GPU Comparison
  AXT-16-512
BXT-16-512
GT9524 GT8525 GT7200 Plus
Core Configuration
 
1 SPU (Shader Processing Unit) - "GPU Core"

2 USCs (Unified Shading Clusters) - ALU Clusters
FP32 FLOPS/Clock

MADD = 2 FLOPs
MUL = 1 FLOP
512

(2x (128x MADD))
240

(2x (40x MADD+MUL))
192

(2x (32x MADD+MUL))
128

(2x (16x MADD+MADD))
FP16 Ratio 2:1 (Vec2)
Pixels / Clock 8 4
Texels / Clock 16 8 4
Architecture A-Series
B-Series
Series-9XTP
(Furian)
Series-8XT
(Furian)
Series-7XT
(Rogue)

Fundamentally and at a high-level, the new B-Series GPU microarchitecture looks very similar to the A-Series. Microarchitecturally, Imagination noted that we should generally expect a 15% increase in performance or increase in efficiency compared to the A-Series, with the building blocks of the two GPU families being generally the same save for some more important additions such as the new IMGIC (Imagination Image Compression) implementation which we’ll cover in a bit.

An XT GPU still consists of the new SPU design which houses the new more powerful TPU (Texture Processing Unit) as well as the new 128-wide ALU designs that is scaled into ALU clusters called USCs (Unified Shading Clusters).

Imagination’s current highest-end hardware implementation in the BXT series is the BXT 32-1024, and putting four of these together creates an MC4 GPU. In a high-performance implementation reaching up to 1.5GHz clock speeds, this configuration would offer up to 6TFLOPs of FP32 computing power. Whilst this isn’t quite enough to catch up to Nvidia and AMD, it’s a major leap for a third-party GPU IP provider that’s been mostly active in the mobile space for the last 15 years.

The company’s BXM series continues to see a differentiation in the architecture as some of its implementations do not use the ultra-wide ALU design of the XT series. For example, while the BXM-8-256 uses one 128-wide USC, the more area efficient BXM 4-64 for example continues to use the 32-wide ALU from the 8XT series. Putting four BXM-4-64 GPUs together gets you to a higher performance tier with a better area and power efficiency compared to a larger single GPU implementation.

The most interesting aspect of the multi-GPU approach is found in the BXE series, which is Imagination’s smallest GPU IP that purely focuses on getting to the best possible area efficiency. Whilst the BXT and BXM series GPUs until now are delivered as “primary” cores, the BXE is being offered in the form of both a primary as well as a secondary GPU implementation. The differences here is that the secondary variant of the IP lacks a firmware processor as well as a geometry processing, instead fully relying on the primary GPU’s geometry throughput. Imagination says that this configuration would be able to offer quite high compute and fillrate capabilities in extremely minuscule area usage.

PowerVR Hardware Designs GPU Comparison
Family Texels/
Clock
FP32/
Clock
Cores USCs Wavefront
Width
MC Design
BXT-32-1024 MC1 32 1024 1 4 128 P
BXT-16-512 MC1 16 512 1 2 128 P
BXM-8-256 MC1 8 256 1 1 128 P
BXM-4-64 MC1 4 64 1 1 32 P
BXE-4-32 Secondary 4 32 1 1 16 S
BXE-4-32 MC1 4 32 1 1 16 P
BXE-2-32 MC1 2 32 1 1 16 P
BXE-1-16 MC1 1 16 1 1 8 P

Putting the different designs into a table, we’re seeing only 8 different hardware designs that Imagination has to create the RTL and do physical design and timing closure on. This is already quite a nice line-up in terms of scaling from the lowest-end area focused IP to something that would be used in a premium high-end mobile SoC.

PowerVR MC GPU Configurations
Family Texels/
Clock
FP32/
Clock
Cores USCs Wavefront
Width
MC Design
BXT-32-1024 MC4 128 4096 4 16 128 PPPP
BXT-32-1024 MC3 96 3072 3 12 128 PPP
BXT-32-1024 MC2 64 2048 2 8 128 PP
BXT-32-1024 MC1 32 1024 1 4 128 P
BXT-16-512 MC1 16 512 1 2 128 P
BXM-8-256 MC1 8 256 1 1 128 P
BXM-4-64 MC4 16 256 4 4 32 PPPP
BXM-4-64 MC3 12 192 3 3 32 PPP
BXM-4-64 MC2 8 128 2 2 32 PP
BXM-4-64 MC1 4 64 1 1 32 P
BXE-4-32 MC4 16 128 4 4 16 PSSS
BXE-4-32 MC3 12 96 3 3 16 PSS
BXE-4-32 MC2 8 64 2 2 16 PS
BXE-4-32 MC1 4 32 1 1 16 P
BXE-2-32 MC1 2 32 1 1 16 P
BXE-1-16 MC1 1 16 1 1 8 P

The big flexibility gain for Imagination and their customers is that they can simple take one of the aforementioned hardware designs, and scale these up seamlessly by laying out multiple GPUs. On the low-end, this creates some very interesting overlaps in terms of compute abilities, but offer different fillrate capabilities at different area efficiency options.

At the high-end, the biggest advantage is that Imagination can quadruple their processing power from their biggest GPU configuration. Imagination notes that for the BXT series, they no longer created a single design larger than the BXT-32-1024 because the return on investment would simply be smaller, and involve more complex timing work than if a customer would simply scale performance up via a multi-core implementation.

Introduction - Scaling up to Multi-GPU Introducing IMGIC - A better frame-buffer compression
POST A COMMENT

74 Comments

View All Comments

  • tkSteveFOX - Wednesday, October 14, 2020 - link

    I don't think those sub $100 GPUs can beat the integrated graphics. They are there for office stations to support 3-4 monitors, nothing more. Reply
  • Mat3 - Wednesday, October 14, 2020 - link

    The TBDR architecture has always allowed for efficient multi-core solutions, since they bin all the triangles before rasterization and work then tile by tile. Each core can work on different tiles. IMG has been touting this benefit for literally decades already, I'm not sure what is so different here. The Naomi 2 arcade board from over 20 years ago is a simple implementation of this.

    The other concern for scaling it up to high-end desktop levels is always the same; the number of triangles that would need to be binned for a modern a desktop game is much, much higher than a mobile game.
    Reply
  • myownfriend - Wednesday, October 14, 2020 - link

    I never understood why the polygon counts are considered an issue for TBRs.

    I know GPUs like the RTX 3080 have peak primitives counts of about 10.2 billion with boost clocks but I'm pretty sure actual game triangle counts never really get that high. If you divide that by 60 fps then we're looking at a peak of about 170 million per frame with a current target resolution of Ultra HD which is 8,294,400 pixels. Even accounting for back-facing and overlapping geometry, do any games really have 20 times more primitives than rendered pixels?
    Reply
  • supdawgwtfd - Thursday, October 15, 2020 - link

    "into different work tiles that can then the other “slave” GPUs can pull from in order to work on"

    My brain exploded from trying to read this...

    Seriously. Get a damn editor or even a basic proof reader!
    Reply

Log in

Don't have an account? Sign up now