Bifrost - The GPU architecture for next five billion

Bifrost - The GPU architecture for next five billion Hessed Choi Senior FAE / ARM ARM Tech Forum June 28 th, 2016

Vulkan 2 ARM 2016

What is Vulkan? A 3D graphics API for the next twenty years Logical successor to OpenGL and OpenGL ES Modern, efficient design Open, industry-controlled standard Here, now Released in February, with unprecedented support Available today for desktop Windows and Linux Officially supported in Android N Shipping today in Samsung Galaxy S7 Engaged, active developer community 3

Why ARM loves Vulkan A great fit for mobile graphics architectures! No wasted effort trying to look like a desktop GPU Designed to enable mobile-specific optimizations Radical commitment to efficiency CPU load is greatly reduced, even on a single core Makes your multi-core CPU more useful! Driver work can be distributed across many threads This helps performance and power Makes your multi-core GPU more useful too Easier for applications to keep a powerful GPU busy 4

Bifrost 5 ARM 2016

Bifrost: The new GPU architecture The increasing pixel impact of modern mobile gaming continues to drive innovation 2016: Bifrost 2010: Utgard 2013: Midgard 6

ARM Mali processor generations BIFROST Mali-G71 GPU Unified shader cores, scalar ISA, clause execution, full coherency, Vulkan, OpenCL MIDGARD Mali-T600 GPU series Mali-T700 GPU series Mali-T800 GPU series Unified shader cores, SIMD ISA, OpenGL ES 3.x, OpenCL, Vulkan UTGARD Mali-200 GPU Mali-300 GPU Mali-400 GPU Mali-450 GPU Mali-470 GPU Separate shader cores, SIMD ISA, OpenGL ES 2.x 7

Mali-G71 efficiency drives performance 20% Higher energy efficiency* 32 cores 40% Better performance density* 20% Bandwidth Improvement* Optimized for next generation, advanced, real-world content *Compared to Mali-T880, on same process node under the same conditions. 8

Bifrost features A more efficient architecture: More performance overall, per mm 2 and per line of real world shader code Major shader core redesign New scalar, clause-based ISA New quad-based arithmetic units New core fabric New geometry data flow Reduces memory bandwidth and footprint 1.5x Performance improvement 9

Architectural innovations 10 ARM 2016

Bifrost architectural innovations Energy efficiency Claused shaders Index Driven Vertex Shading Wire light pipelines Developer friendly Designed for Vulkan and VR/AR Heterogeneous computing Full system coherency Midgard Bifrost CPU CPU GPU Coherent Interconnect DRAM 11

Bifrost GPU design Driver Software Job Manager Core 0 Core 1 Core 2 Core 31 Control Fabric Tiler MMU L2 Cache Segment L2 Cache Segment L2 Cache Segment AXI Memory Bus AXI Memory Bus AXI Memory Bus 12

Scalable system design Driver Software Up to 32 shader cores supported Job Manager Core 0 Core 1 Core 2 Core 31 Control Fabric Tiler MMU L2 Cache Segment L2 Cache Segment L2 Cache Segment AXI Memory Bus AXI Memory Bus AXI Memory Bus 13

Execution core improvements Driver Software Job Manager Core 0 Core 1 Core 2 Core 31 Control Fabric Tiler MMU L2 Cache Segment L2 Cache Segment L2 Cache Segment AXI Memory Bus AXI Memory Bus AXI Memory Bus 14

Bifrost core design 15 ARM 2016

ZS Memory Bifrost core design Compute Frontend Fragment Frontend Quad Creator Quad Creator Execution Engine 0 Execution Engine 1 Execution Engine 2 Quad State Quad State Quad State Quad Control Quad Manager Control Fabric Load/store Unit Attribute Unit Varying Unit Texture Unit Blender & Tile Access Depth & Stencil To L2 Mem Sys To L2 Mem Sys Tile Memory Tile Writeback To L2 Mem Sys 16

Lane 0 Lane 1 Lane 2 Lane 3 Quad vectorization Bifrost uses quad-parallel execution Four scalar threads executed in lockstep in a quad One quad at a time executes in each pipeline stage Each thread fills one 32-bit lane of the hardware 4 threads doing a vec3 FP32 add takes 3 cycles Improves utilization T0.x T1.x T2.x T0.y T1.y T2.y T0.z T1.x T2.z T3.x T3.y T3.z Cycle 1 Cycle 2 Cycle 3 Quad vectorization is compiler friendly Each thread only sees a stream of scalar operations Vector operations can always be split into scalars 17

Clause execution Back-to-back execution guaranteed within a clause Allows aggressive optimisation Overhead Instruction 18

Clause execution R0 R1 R2 R3 R4 R5 R6 R7 R0 R1 R2 R3 R4 R5 R6 R7 R0 R1 R2 R3 R4 R5 R6 R7 R0 R1 R2 R3 R4 R5 R6 R7 ADD R2, R0, R1 ADD R4, R2, R3 ADD R0, R4, R5 Back-to-back register access is common The result from one instruction is often only used as input to the next 19

Clause execution R0 R1 R2 R3 R4 R5 R6 R7 R0 R1 R2 R3 R4 R5 R6 R7 R0 R1 R2 R3 R4 R5 R6 R7 R0 R1 R2 R3 R4 R5 R6 R7 ADD T, R0, R1 T ADD T, T, R3 T ADD R0, T, R5 Back-to-back register access is common Register file bypass saves power. Allows use of simpler, smaller register files. 20

Clause scheduling TEX Unrelated? Required data not ready? Use result Texture unit operation Delay next clause if asynchronous data not ready Overhead Instruction 21

Clause scheduling? Use result TEX Texture unit operation Another quad can use this execution unit High utilization, high efficiency Overhead Quad 1 Quad 2 22

Arithmetic functional units 23 ARM 2016

Temp Registers Bifrost arithmetic functional units Executes quad-parallel scalar operations 4x32-bit multiplier FMA 4x32-bit adder ADD Adder includes special function unit Smaller and more area efficient Simplified layout eases compilation Better scheduling in today s code Better utilization One instruction word contains two instructions Main Regs Read FMA ADD/SF Main Regs Write 24

Temp Registers Bifrost arithmetic functional units Retains support for smaller width data types Integers useful for deep learning 2x performance for FP16 useful for pixel shaders Main Regs Read int8 int8 int8 int8 8-bit integers int16 int16 16-bit integers int32 32-bit integers FMA float16 float16 16-bit floating point float32 32-bit floating point ADD/SF Main Regs Write 25

Temp Registers Special arithmetic operations Special function hardware is smaller than Midgard equivalent Many transcendental functions supported Special functions provide building blocks for compiled shader code Part of the built-in function libraries Main Regs Read FMA ADD/SF Main Regs Write 26

Load/store units 27 ARM 2016

ZS Memory New core design Compute Frontend Fragment Frontend Execution Engine 0 Execution Engine 1 Execution Engine 2 Quad State Quad State Quad State Quad Creator Quad Manager Quad Creator Control Fabric Load/store Unit Attribute Unit Varying Unit Texture Unit Blender & Tile Access Depth & Stencil To L2 Mem Sys To L2 Mem Sys Tile Memory Tile Writeback To L2 Mem Sys 28

Bifrost load/store units Separate units, scheduled separately, for better utilization Load/store Unit Attribute Unit Varying Unit Handles most general memory accesses Includes memory address translation and coherent caching Handles attribute indexing and addressing Defers to load/store for actual memory access Handles varying interpolation Lower power, but more range and precision than Midgard 29

Tiler 30 ARM 2016

Geometry flow improvement Driver Software Job Manager Core 0 Core 1 Core 2 Core 31 Control Fabric Tiler MMU L2 Cache Segment L2 Cache Segment L2 Cache Segment AXI Memory Bus AXI Memory Bus AXI Memory Bus 31

Geometry flow: Midgard Processing Read+Write Bandwidth [x times of storage size] Vertex Shading Tiling Fragment Shading 1 2 3 1x 3.5x 2.5x Positions Attribs Memory 1x Positions Attribs Trans. Positions Varyings Indices Polygon List 1x 1 Leading data stream at the numbered stage Bandwidth used relative to memory storage size 32

0.5x 0.5x Geometry flow: Bifrost - index-driven vertex shading Processing Read+Write Bandwidth [x times of storage size] Position Shading Tiling Varying Shading Fragment Shading 1 2 3.5x 2.0x 2.5x 1.5x Positions Attribs Memory Indices Positions Trans. Positions Polygon List Attribs Varyings 1x 1 Leading data stream at the numbered stage Bandwidth used relative to memory storage size 33

Memory system 34 ARM 2016

Memory system Driver Software Job Manager Core 0 Core 1 Core 2 Core 31 Control Fabric Tiler MMU L2 Cache Segment L2 Cache Segment L2 Cache Segment AXI Memory Bus AXI Memory Bus AXI Memory Bus Full coherency using ACE protocol 35

Memory system Full system coherency support Supports tightly coupled CPU+GPU use cases Cortex-A73 CPU Mali-G71 GPU L2 cache improvements Single logical L2 cache makes software easier Fewer partial lines written to AXI which improves LPDDR4 performance CoreLink CCI-550 DMC-500 DRAM 36

The trademarks featured in this presentation are registered and/or unregistered trademarks of ARM Limited (or its subsidiaries) in the EU and/or elsewhere. All rights reserved. All other marks featured may be trademarks of their respective owners. Copyright 2016 ARM Limited