The Bifrost GPU architecture and the ARM Mali-G71 GPU

Similar documents
Bifrost - The GPU architecture for next five billion

Developing the Bifrost GPU architecture for mainstream graphics

Hardware- Software Co-design at Arm GPUs

PowerVR Series5. Architecture Guide for Developers

PowerVR Hardware. Architecture Overview for Developers

3D Graphics in Future Mobile Devices. Steve Steele, ARM

Achieving Console Quality Games on Mobile

Each Milliwatt Matters

Exploring System Coherency and Maximizing Performance of Mobile Memory Systems

Course Recap + 3D Graphics on Mobile GPUs

Mali-400 MP: A Scalable GPU for Mobile Devices Tom Olson

ARM. Mali GPU. OpenGL ES Application Optimization Guide. Version: 3.0. Copyright 2011, 2013 ARM. All rights reserved. ARM DUI 0555C (ID102813)

Portland State University ECE 588/688. Graphics Processors

Profiling and Debugging OpenCL Applications with ARM Development Tools. October 2014

EE382N (20): Computer Architecture - Parallelism and Locality Spring 2015 Lecture 09 GPUs (II) Mattan Erez. The University of Texas at Austin

Working with Metal Overview

Vulkan Multipass mobile deferred done right

The NVIDIA GeForce 8800 GPU

! Readings! ! Room-level, on-chip! vs.!

Threading Hardware in G80

Case 1:17-cv SLR Document 1-4 Filed 01/23/17 Page 1 of 30 PageID #: 75 EXHIBIT D

Parallel Programming on Larrabee. Tim Foley Intel Corp

Optimizing Cache Coherent Subsystem Architecture for Heterogeneous Multicore SoCs

CS8803SC Software and Hardware Cooperative Computing GPGPU. Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology

Antonio R. Miele Marco D. Santambrogio

Case 1:17-cv SLR Document 1-3 Filed 01/23/17 Page 1 of 33 PageID #: 60 EXHIBIT C

Enabling a Richer Multimedia Experience with GPU Compute. Roberto Mijat Visual Computing Marketing Manager

Profiling and Debugging Games on Mobile Platforms

Achieving High-performance Graphics on Mobile With the Vulkan API

ARM Multimedia IP: working together to drive down system power and bandwidth

GeForce4. John Montrym Henry Moreton

Mali Developer Resources. Kevin Ho ARM Taiwan FAE

Lecture 25: Board Notes: Threads and GPUs

Parallelizing Graphics Pipeline Execution (+ Basics of Characterizing a Rendering Workload)

Optimizing and Profiling Unity Games for Mobile Platforms. Angelo Theodorou Senior Software Engineer, MPG Gamelab 2014, 25 th -27 th June

Building High Performance, Power Efficient Cortex and Mali systems with ARM CoreLink. Robert Kaye

Optimizing for DirectX Graphics. Richard Huddy European Developer Relations Manager

Integrating CPU and GPU, The ARM Methodology. Edvard Sørgård, Senior Principal Graphics Architect, ARM Ian Rickards, Senior Product Manager, ARM

Graphics Architectures and OpenCL. Michael Doggett Department of Computer Science Lund university

Introduction to Parallel Programming Models

Real - Time Rendering. Graphics pipeline. Michal Červeňanský Juraj Starinský

Overview. Technology Details. D/AVE NX Preliminary Product Brief

CS427 Multicore Architecture and Parallel Computing

Parallelizing Graphics Pipeline Execution (+ Basics of Characterizing a Rendering Workload)

The Benefits of GPU Compute on ARM Mali GPUs

Unreal Engine 4: Mobile Graphics on ARM CPU and GPU Architecture

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

Scheduling the Graphics Pipeline on a GPU

PowerVR Performance Recommendations. The Golden Rules

Graphics Hardware. Graphics Processing Unit (GPU) is a Subsidiary hardware. With massively multi-threaded many-core. Dedicated to 2D and 3D graphics

Mali-G72 Enabling tomorrow s technology today

ARM. Mali GPU. OpenGL ES Application Optimization Guide. Version: 2.0. Copyright 2011, 2013 ARM. All rights reserved. ARM DUI 0555B (ID051413)

Introduction to Multicore architecture. Tao Zhang Oct. 21, 2010

Architectures. Michael Doggett Department of Computer Science Lund University 2009 Tomas Akenine-Möller and Michael Doggett 1

Bringing AAA graphics to mobile platforms. Niklas Smedberg Senior Engine Programmer, Epic Games

Copyright Khronos Group Page 1. Vulkan Overview. June 2015

1.2.3 The Graphics Hardware Pipeline

Next-Generation Graphics on Larrabee. Tim Foley Intel Corp

Mali-G72: Enabling tomorrow s technology today

Evolving IP configurability and the need for intelligent IP configuration

Dave Shreiner, ARM March 2009

Challenges for GPU Architecture. Michael Doggett Graphics Architecture Group April 2, 2008

Windowing System on a 3D Pipeline. February 2005

Spring 2010 Prof. Hyesoon Kim. AMD presentations from Richard Huddy and Michael Doggett

A Reconfigurable Architecture for Load-Balanced Rendering

Maximizing heterogeneous system performance with ARM interconnect and CCIX

Modeling Performance Use Cases with Traffic Profiles Over ARM AMBA Interfaces

EECS 487: Interactive Computer Graphics

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Efficient and Scalable Shading for Many Lights

Optimizing DirectX Graphics. Richard Huddy European Developer Relations Manager

Hardware-driven Visibility Culling Jeong Hyun Kim

ECE 574 Cluster Computing Lecture 16

Vulkan: Architecture positive How Vulkan maps to PowerVR GPUs Kevin sun Lead Developer Support Engineer, APAC PowerVR Graphics.

Comprehensive Arm Solutions for Innovative Machine Learning (ML) and Computer Vision (CV) Applications

CUDA Programming Model

Vulkan on Mobile. Daniele Di Donato, ARM GDC 2016

Optimisation. CS7GV3 Real-time Rendering

PowerVR Graphics - Latest Developments and Future Plans

Building blocks for 64-bit Systems Development of System IP in ARM

GRAPHICS PROCESSING UNITS

Analyzing and Debugging Performance Issues with Advanced ARM CoreLink System IP Components

Higher Level Programming Abstractions for FPGAs using OpenCL

On-chip Networks Enable the Dark Silicon Advantage. Drew Wingard CTO & Co-founder Sonics, Inc.

Lecture 9: Deferred Shading. Visual Computing Systems CMU , Fall 2013

Overview. Think Silicon is a privately held company founded in 2007 by the core team of Atmel MMC IC group

Modern Processor Architectures. L25: Modern Compiler Design

Bifurcation Between CPU and GPU CPUs General purpose, serial GPUs Special purpose, parallel CPUs are becoming more parallel Dual and quad cores, roadm

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design

Cortex-A75 and Cortex-A55 DynamIQ processors Powering applications from mobile to autonomous driving

Graphics Performance Optimisation. John Spitzer Director of European Developer Technology

Spring 2011 Prof. Hyesoon Kim

Beyond Hardware IP An overview of Arm development solutions

Beyond Programmable Shading. Scheduling the Graphics Pipeline

SDACCEL DEVELOPMENT ENVIRONMENT. The Xilinx SDAccel Development Environment. Bringing The Best Performance/Watt to the Data Center

Take GPU Processing Power Beyond Graphics with Mali GPU Computing

PowerVR Performance Recommendations. The Golden Rules

Graphics Hardware, Graphics APIs, and Computation on GPUs. Mark Segal

ARM instruction sets and CPUs for wide-ranging applications

Transcription:

The Bifrost GPU architecture and the ARM Mali-G71 GPU Jem Davies ARM Fellow and VP of Technology Hot Chips 28 Aug 2016

Introduction to ARM Soft IP ARM licenses Soft IP cores (amongst other things) to our Silicon Partners They then make chips and sell to OEMs, who sell consumer devices ARM doesn t make chips We provide all the RTL, integration testbenches, memories lists, reference floorplans, example synthesis scripts, sometimes models, sometimes FPGA images, sometimes with implementation advice, always with memory system requirements/recommendations Consequently silicon area, power, frequencies, performance, benchmark scores can therefore vary quite a bit in real silicon 2

ARM Mali: The world s #1 shipping GPU 140 Total licenses 65 Total licensees 27 new Mali licenses in FY15 750M Mali-based GPUs shipped in 2015 Mali is in: Mali graphics based IC shipments (units) 750m ~75% of DTVs tablets smartphones ~50% ~40% of of <50m 150m 400m 550m 2011 2012 2013 2014 2015 3

ARM Mali graphics processor generations BIFROST Mali-G71 GPU Unified shader cores, scalar ISA, clause execution, full coherency, Vulkan, OpenCL MIDGARD Mali-T600 GPU series Mali-T700 GPU series Mali-T800 GPU series Unified shader cores, SIMD ISA, OpenGL ES 3.x, OpenCL, Vulkan Presented at HotChips 2015 UTGARD Mali-200 GPU Mali-300 GPU Mali-400 GPU Mali-450 GPU Mali-470 GPU Separate shader cores, SIMD ISA, OpenGL ES 2.x 4

Bifrost features Leverages Mali s scalable architecture Scalable to 32 shader cores Major shader core redesign New scalar, clause-based ISA New quad-based arithmetic units 20% Higher energy efficiency* Scalable to 32 cores 5 New geometry data flow Reduces memory bandwidth and footprint Support for fine grain buffer sharing with the CPU 20% Bandwidth Improvement* 40% Better performance density* *Compared to Mali-T880 on same process node under the same conditions.

Bifrost features Leverages Mali s scalable architecture Scalable to 32 shader cores Major shader core redesign New scalar, clause-based ISA New quad-based arithmetic units 20% Higher energy efficiency* Scalable to 32 cores 6 New geometry data flow Reduces memory bandwidth and footprint Support for fine grain buffer sharing with the CPU 20% Bandwidth Improvement* 40% Better performance density* *Compared to Mali-T880, on same process node under the same conditions.

Bifrost GPU design Driver Software Up to 32 shader cores supported Job Manager Core 0 Core 1 Core 2 Core 31 GPU Fabric Tiler MMU 7

Bifrost features Leverages Mali s scalable architecture Scalable to 32 shader cores Major shader core redesign New quad-based arithmetic units New scalar, clause-based ISA 20% Higher energy efficiency* Scalable to 32 cores 8 New geometry data flow Reduces memory bandwidth and footprint Support for fine grain buffer sharing with the CPU 20% Bandwidth Improvement* 40% Better performance density* *Compared to Mali-T880, on same process node under the same conditions.

core improvements Driver Software Job Manager Core 0 Core 1 Core 2 Core 31 GPU Fabric Tiler MMU 9

Mali-G71 shader core design Compute Frontend Fragment Frontend Execution Engine 0 Execution Engine 1 Execution Engine 2 Quad State Quad State Quad State Quad Control Core Fabric Quad Creator Quad Manager Quad Creator ZS Memory Load/store Unit Attribute Unit Varying Unit Texture Unit Blender & Tile Access Depth & Stencil To L2 Mem Sys To L2 Mem Sys Tile Memory Tile Writeback To L2 Mem Sys 10

Quad execution Compute Frontend Fragment Frontend Execution Engine 0 Execution Engine 1 Execution Engine 2 Quad State Quad State Quad State Core Fabric Quad Creator Quad Manager Quad Creator ZS Memory Load/store Unit Attribute Unit Varying Unit Texture Unit Blender & Tile Access Depth & Stencil To L2 Mem Sys To L2 Mem Sys Tile Memory Tile Writeback To L2 Mem Sys 11

Recap: SIMD vectorization Midgard GPUs use SIMD vectorization One thread at a time executes in each pipeline stage Each thread must fill the width of the hardware Lane 0 Lane 1 Lane 2 T0.x T0.y T0.z Lane 3 Idle Cycle 1 T1.x T1.y T1.z Idle Cycle 2 T2.x T2.y T2.z Idle Cycle 3 Sensitive to shader code Code always evolving Compiler vectorization is not perfect Have to detect combinations of operations which can be merged to fill idle lanes Scalar operations can not always be merged into vectors T3.x T3.y T3.z Idle Cycle 4 12

Quad vectorization Lane 0 Lane 1 Lane 2 Lane 3 Bifrost uses quad-parallel execution Four scalar threads executed in lockstep in a quad One quad at a time executes in each pipeline stage Each thread fills one 32-bit lane of the hardware 4 threads doing a vec3 FP32 add takes 3 cycles Improves utilization T0.x T1.x T2.x T0.y T1.y T2.y T0.z T1.x T2.z T3.x T3.y T3.z Cycle 1 Cycle 2 Cycle 3 Quad vectorization is compiler friendly Each thread only sees a stream of scalar operations Vector operations can always be split into scalars 13

Bifrost execution engine Executes quad-parallel scalar operations 4x32-bit multiplier FMA 4x32-bit adder ADD Adder includes special function unit Main Regs Read Smaller and more area efficient Simplified layout eases compilation Better scheduling in today s code Better utilization FMA ADD/SF Temp Registers One instruction word contains two instructions Main Regs Write 14

Bifrost execution engine: Special arithmetic ops Special function hardware is smaller than Midgard VLUT equivalent Many transcendental functions supported Special functions provide building blocks for compiled shader code Part of the built-in function libraries Main Regs Read FMA ADD/SF Temp Registers Main Regs Write 15

Bifrost execution engine: Functional units Retains support for smaller width data types 2x performance for FP16 useful for pixel shaders Main Regs Read int8 int8 int8 int8 8-bit integers int16 int16 16-bit integers int32 32-bit integers float16 float16 16-bit floating point FMA ADD/SF Temp Registers float32 32-bit floating point Main Regs Write 16

Bifrost features Leverages Mali s scalable architecture Scalable to 32 shader cores Major shader core redesign New quad-based arithmetic units New scalar, clause-based ISA 20% Higher energy efficiency* Scalable to 32 cores 17 New geometry data flow Reduces memory bandwidth and footprint Support for fine grain buffer sharing with the CPU 20% Bandwidth Improvement* 40% Better performance density* *Compared to Mali-T880, on same process node under the same conditions.

Clause execution A group of instructions which executes atomically Architectural state visible after clause completion Bypass path registers exposed to the compiler Non-deterministic instructions on clause boundaries 18

Bifrost shader clause example Linear Source LOAD.32 r0, [r10] FADD.32 r1,r0,r0 FADD.32 r2,r1,r1 FADD.32 r3,r2,r2 FADD.32 r4,r3,r3 FADD.32 r3,r3,r4 FADD.32 r0,r3,r3 STORE.32 r0, [r10] 19

Bifrost shader clause example Linear Source LOAD.32 r0, [r10] FADD.32 r1,r0,r0 FADD.32 r2,r1,r1 FADD.32 r3,r2,r2 FADD.32 r4,r3,r3 FADD.32 r3,r3,r4 FADD.32 r0,r3,r3 STORE.32 r0, [r10] Load is variable length, so clause must be split from use of r0 Next clause uses r0, so must wait for load to complete Basic Clause Compile { LOAD.32 r0, [r10] } wait(load) { FADD.32 r1,r0,r0 FADD.32 r2,r1,r1 FADD.32 r3,r2,r2 FADD.32 r4,r3,r3 FADD.32 r3,r3,r4 FADD.32 r0,r3,r3 STORE.32 r0, [r10] } 20

Bifrost shader clause example Linear Source Basic Clause Compile Optimized Clause Compile LOAD.32 r0, [r10] FADD.32 r1,r0,r0 FADD.32 r2,r1,r1 FADD.32 r3,r2,r2 FADD.32 r4,r3,r3 FADD.32 r3,r3,r4 FADD.32 r0,r3,r3 STORE.32 r0, [r10] Load is variable length, so clause must be split from use of r0 Next clause uses r0, so must wait for load to complete { LOAD.32 r0, [r10] } wait(load) { FADD.32 r1,r0,r0 FADD.32 r2,r1,r1 FADD.32 r3,r2,r2 FADD.32 r4,r3,r3 FADD.32 r3,r3,r4 FADD.32 r0,r3,r3 STORE.32 r0, [r10] } Use temporaries: only 2 register file accesses in 6 cycles Store may also stall, so split to help scheduling { LOAD.32 r0, [r10] } wait(load) { FADD.32 t0,r0,r0 FADD.32 t1,t0,t0 FADD.32 t0,t1,t1 FADD.32 t1,t0,t0 FADD.32 t0,t1,t1 FADD.32 r0,t0,t0 } { STORE.32 r0, [r10] } 21

Bifrost features Leverages Mali s scalable architecture Scalable to 32 shader cores Major shader core redesign New scalar, clause-based ISA New quad-based arithmetic units 20% Higher energy efficiency* Scalable to 32 cores 22 New geometry data flow Reduces memory bandwidth and footprint Support for fine grain buffer sharing with the CPU 20% Bandwidth Improvement* 40% Better performance density* *Compared to Mali-T880, on same process node under the same conditions.

Tiler changes Bifrost uses the same underlying hierarchical binning design as Midgard Significantly redesigned tiler memory structures Minimum buffer allocations eliminated Buffer allocation granularity now finer Micro-triangle elimination reduces the number of primitives stored in bin buffers for geometrydense scenes Cumulative effect of all changes is up to 95% reduction in tiler memory footprint 23

Index-driven position shading Tiler Assembly Position Shading Tiler Culling ½ x Varying Shading Fragment Shading Read/write bandwidth [x times of storage size] 1x 1x ½ x ½ x ½ x Processing Memory 3.5x 2.0x 2.5x 1.5x Positions Positions Attribs Attributes Indices Positions Transformed Positions Polygon List Vertex Attributes Vertex Varyings Midgard Bifrost 1x Bandwidth used relative to memory storage size 24

Index-driven position shading Leverages existing coherency flows Driver Software Job Manager Tiler MMU Core 0 Up to 32 shader cores supported Core 1 GPU Fabric Core 2 Core 31 Multiple shader cores write transformed positions into a shared memory fifo. The fixed function Tiler reads the transformed positions directly via shared memory reads. No manual flushing required (fifo values are most likely resident in the L2C, but don t have to be) 1x Bandwidth used relative to memory storage size GPU coherency domain Once the tiler has read the positions, they are no longer needed and may be discarded 25

Bifrost features Leverages Mali s scalable architecture Scalable to 32 shader cores Major shader core redesign New scalar, clause-based ISA New quad-based arithmetic units 20% Higher energy efficiency* Scalable to 32 cores 26 New geometry data flow Reduces memory bandwidth and footprint Support for fine grain buffer sharing with the CPU 20% Bandwidth Improvement* 40% Better performance density* *Compared to Mali-T880, on same process node under the same conditions.

Memory system Driver Software Job Manager Core 0 Core 1 Core 2 Core 31 Control Fabric Tiler MMU Full coherency using AMBA ACE protocol 27

Memory system Full system coherency support Supports tightly coupled CPU+GPU use cases Cortex-A73 CPU Mali-G71 GPU L2 cache improvements Single logical L2 cache makes software easier Fewer partial lines written to memory which improves LPDDR4 performance Supports TrustZone CoreLink CCI-550 DMC-500 DRAM 28

Next-generation heterogeneous computing OpenCL 2.0 Introduces Shared Virtual Memory 1 Workload balancing memory overhead reduction Mali-G71 goes one step further with fine grained buffers Significantly eases development and enables truly heterogeneous use case Normalized runtime 0.8 0.6 0.4-81% -90% Initialization and processing Memory operations 0.2 0 Coarse grained with SW Coherency Coarse grained with Full Coherency Fine grained with Full Coherency 29

Summary Leverages Mali s scalable architecture Scalable to 32 shader cores Major shader core redesign New scalar, clause-based ISA New quad-based arithmetic units 20% Higher energy efficiency* Scalable to 32 cores 30 New geometry data flow Reduces memory bandwidth and footprint Support for fine grain buffer sharing with the CPU 20% Bandwidth Improvement* 40% Better performance density* *Compared to Mali-T880, on same process node under the same conditions.

Thank you! The trademarks featured in this presentation are registered and/or unregistered trademarks of ARM Limited (or its subsidiaries) in the EU and/or elsewhere. All rights reserved. All other marks featured may be trademarks of their respective owners. Copyright 2016 ARM Limited