Bifrost - The GPU architecture for next five billion

Similar documents
The Bifrost GPU architecture and the ARM Mali-G71 GPU

Developing the Bifrost GPU architecture for mainstream graphics

Hardware- Software Co-design at Arm GPUs

Achieving Console Quality Games on Mobile

Mali-G72 Enabling tomorrow s technology today

Each Milliwatt Matters

Profiling and Debugging OpenCL Applications with ARM Development Tools. October 2014

3D Graphics in Future Mobile Devices. Steve Steele, ARM

Mali-G72: Enabling tomorrow s technology today

Next Generation OpenGL Neil Trevett Khronos President NVIDIA VP Mobile Copyright Khronos Group Page 1

Copyright Khronos Group Page 1. Vulkan Overview. June 2015

Working with Metal Overview

Unreal Engine 4: Mobile Graphics on ARM CPU and GPU Architecture

Antonio R. Miele Marco D. Santambrogio

Vulkan Multipass mobile deferred done right

ARM Multimedia IP: working together to drive down system power and bandwidth

Profiling and Debugging Games on Mobile Platforms

CS8803SC Software and Hardware Cooperative Computing GPGPU. Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology

Maximizing heterogeneous system performance with ARM interconnect and CCIX

Case 1:17-cv SLR Document 1-3 Filed 01/23/17 Page 1 of 33 PageID #: 60 EXHIBIT C

Mali Developer Resources. Kevin Ho ARM Taiwan FAE

ARM. Mali GPU. OpenGL ES Application Optimization Guide. Version: 3.0. Copyright 2011, 2013 ARM. All rights reserved. ARM DUI 0555C (ID102813)

Dave Shreiner, ARM March 2009

Course Recap + 3D Graphics on Mobile GPUs

CS427 Multicore Architecture and Parallel Computing

PowerVR Hardware. Architecture Overview for Developers

Exploring System Coherency and Maximizing Performance of Mobile Memory Systems

Enabling a Richer Multimedia Experience with GPU Compute. Roberto Mijat Visual Computing Marketing Manager

Integrating CPU and GPU, The ARM Methodology. Edvard Sørgård, Senior Principal Graphics Architect, ARM Ian Rickards, Senior Product Manager, ARM

Mobile Graphics Ecosystem. Tom Olson OpenGL ES working group chair

LPGPU Workshop on Power-Efficient GPU and Many-core Computing (PEGPUM 2014)

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

Vulkan on Mobile. Daniele Di Donato, ARM GDC 2016

Copyright Khronos Group Page 1

Efficient and Scalable Shading for Many Lights

Optimizing and Profiling Unity Games for Mobile Platforms. Angelo Theodorou Senior Software Engineer, MPG Gamelab 2014, 25 th -27 th June

Take GPU Processing Power Beyond Graphics with Mali GPU Computing

Spring 2011 Prof. Hyesoon Kim

EE382N (20): Computer Architecture - Parallelism and Locality Spring 2015 Lecture 09 GPUs (II) Mattan Erez. The University of Texas at Austin

Optimizing Cache Coherent Subsystem Architecture for Heterogeneous Multicore SoCs

Enable AI on Mobile Devices

Portland State University ECE 588/688. Graphics Processors

The Benefits of GPU Compute on ARM Mali GPUs

Spring 2009 Prof. Hyesoon Kim

Building High Performance, Power Efficient Cortex and Mali systems with ARM CoreLink. Robert Kaye

Achieving High-performance Graphics on Mobile With the Vulkan API

Inside VR on Mobile. Sam Martin Graphics Architect GDC 2016

Vulkan: Architecture positive How Vulkan maps to PowerVR GPUs Kevin sun Lead Developer Support Engineer, APAC PowerVR Graphics.

TEAPOT: A Toolset for Evaluating Performance, Power and Image Quality on Mobile Graphics Systems

The NVIDIA GeForce 8800 GPU

Graphics Architectures and OpenCL. Michael Doggett Department of Computer Science Lund university

Parallel Programming on Larrabee. Tim Foley Intel Corp

Challenges for GPU Architecture. Michael Doggett Graphics Architecture Group April 2, 2008

Mali-400 MP: A Scalable GPU for Mobile Devices Tom Olson

Case 1:17-cv SLR Document 1-4 Filed 01/23/17 Page 1 of 30 PageID #: 75 EXHIBIT D

SIGGRAPH Briefing August 2014

Threading Hardware in G80

CSCI-GA Graphics Processing Units (GPUs): Architecture and Programming Lecture 2: Hardware Perspective of GPUs

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

New ARMv8-R technology for real-time control in safetyrelated

Spring 2010 Prof. Hyesoon Kim. AMD presentations from Richard Huddy and Michael Doggett

GeForce4. John Montrym Henry Moreton

Overview. Technology Details. D/AVE NX Preliminary Product Brief

EECS 487: Interactive Computer Graphics

PowerVR Performance Recommendations. The Golden Rules

From Brook to CUDA. GPU Technology Conference

Analyzing and Debugging Performance Issues with Advanced ARM CoreLink System IP Components

Unleashing the benefits of GPU Computing with ARM Mali TM Practical applications and use-cases. Steve Steele, ARM

Modeling Performance Use Cases with Traffic Profiles Over ARM AMBA Interfaces

Bifurcation Between CPU and GPU CPUs General purpose, serial GPUs Special purpose, parallel CPUs are becoming more parallel Dual and quad cores, roadm

Graphics Processing Unit Architecture (GPU Arch)

Multimedia in Mobile Phones. Architectures and Trends Lund

A SIMD-efficient 14 Instruction Shader Program for High-Throughput Microtriangle Rasterization

Khronos Connects Software to Silicon

Comprehensive Arm Solutions for Innovative Machine Learning (ML) and Computer Vision (CV) Applications

Evolving IP configurability and the need for intelligent IP configuration

GPU Architecture. Michael Doggett Department of Computer Science Lund university

Parallel Computing: Parallel Architectures Jin, Hai

Bringing AAA graphics to mobile platforms. Niklas Smedberg Senior Engine Programmer, Epic Games

GPGPU on Mobile Devices

Anatomy of AMD s TeraScale Graphics Engine

Introduction to Modern GPU Hardware

Cortex-A75 and Cortex-A55 DynamIQ processors Powering applications from mobile to autonomous driving

PowerVR Graphics - Latest Developments and Future Plans

Performance Optimization for an ARM Cortex-A53 System Using Software Workloads and Cycle Accurate Models. Jason Andrews

GPGPU on ARM. Tom Gall, Gil Pitney, 30 th Oct 2013

ARM. Mali GPU. OpenGL ES Application Optimization Guide. Version: 2.0. Copyright 2011, 2013 ARM. All rights reserved. ARM DUI 0555B (ID051413)

Overview. Think Silicon is a privately held company founded in 2007 by the core team of Atmel MMC IC group

GRAPHICS PROCESSING UNITS

Lecture 15: Introduction to GPU programming. Lecture 15: Introduction to GPU programming p. 1

CCIX: a new coherent multichip interconnect for accelerated use cases

Real - Time Rendering. Graphics pipeline. Michal Červeňanský Juraj Starinský

Introduction to Multicore architecture. Tao Zhang Oct. 21, 2010

Graphics Hardware. Graphics Processing Unit (GPU) is a Subsidiary hardware. With massively multi-threaded many-core. Dedicated to 2D and 3D graphics

A Reconfigurable Architecture for Load-Balanced Rendering

Building blocks for 64-bit Systems Development of System IP in ARM

Rendering Structures Analyzing modern rendering on mobile

Lecture 25: Board Notes: Threads and GPUs

Shaders. Slide credit to Prof. Zwicker

POWERVR MBX & SGX OpenVG Support and Resources

Transcription:

Bifrost - The GPU architecture for next five billion Hessed Choi Senior FAE / ARM ARM Tech Forum June 28 th, 2016

Vulkan 2 ARM 2016

What is Vulkan? A 3D graphics API for the next twenty years Logical successor to OpenGL and OpenGL ES Modern, efficient design Open, industry-controlled standard Here, now Released in February, with unprecedented support Available today for desktop Windows and Linux Officially supported in Android N Shipping today in Samsung Galaxy S7 Engaged, active developer community 3

Why ARM loves Vulkan A great fit for mobile graphics architectures! No wasted effort trying to look like a desktop GPU Designed to enable mobile-specific optimizations Radical commitment to efficiency CPU load is greatly reduced, even on a single core Makes your multi-core CPU more useful! Driver work can be distributed across many threads This helps performance and power Makes your multi-core GPU more useful too Easier for applications to keep a powerful GPU busy 4

Bifrost 5 ARM 2016

Bifrost: The new GPU architecture The increasing pixel impact of modern mobile gaming continues to drive innovation 2016: Bifrost 2010: Utgard 2013: Midgard 6

ARM Mali processor generations BIFROST Mali-G71 GPU Unified shader cores, scalar ISA, clause execution, full coherency, Vulkan, OpenCL MIDGARD Mali-T600 GPU series Mali-T700 GPU series Mali-T800 GPU series Unified shader cores, SIMD ISA, OpenGL ES 3.x, OpenCL, Vulkan UTGARD Mali-200 GPU Mali-300 GPU Mali-400 GPU Mali-450 GPU Mali-470 GPU Separate shader cores, SIMD ISA, OpenGL ES 2.x 7

Mali-G71 efficiency drives performance 20% Higher energy efficiency* 32 cores 40% Better performance density* 20% Bandwidth Improvement* Optimized for next generation, advanced, real-world content *Compared to Mali-T880, on same process node under the same conditions. 8

Bifrost features A more efficient architecture: More performance overall, per mm 2 and per line of real world shader code Major shader core redesign New scalar, clause-based ISA New quad-based arithmetic units New core fabric New geometry data flow Reduces memory bandwidth and footprint 1.5x Performance improvement 9

Architectural innovations 10 ARM 2016

Bifrost architectural innovations Energy efficiency Claused shaders Index Driven Vertex Shading Wire light pipelines Developer friendly Designed for Vulkan and VR/AR Heterogeneous computing Full system coherency Midgard Bifrost CPU CPU GPU Coherent Interconnect DRAM 11

Bifrost GPU design Driver Software Job Manager Core 0 Core 1 Core 2 Core 31 Control Fabric Tiler MMU L2 Cache Segment L2 Cache Segment L2 Cache Segment AXI Memory Bus AXI Memory Bus AXI Memory Bus 12

Scalable system design Driver Software Up to 32 shader cores supported Job Manager Core 0 Core 1 Core 2 Core 31 Control Fabric Tiler MMU L2 Cache Segment L2 Cache Segment L2 Cache Segment AXI Memory Bus AXI Memory Bus AXI Memory Bus 13

Execution core improvements Driver Software Job Manager Core 0 Core 1 Core 2 Core 31 Control Fabric Tiler MMU L2 Cache Segment L2 Cache Segment L2 Cache Segment AXI Memory Bus AXI Memory Bus AXI Memory Bus 14

Bifrost core design 15 ARM 2016

ZS Memory Bifrost core design Compute Frontend Fragment Frontend Quad Creator Quad Creator Execution Engine 0 Execution Engine 1 Execution Engine 2 Quad State Quad State Quad State Quad Control Quad Manager Control Fabric Load/store Unit Attribute Unit Varying Unit Texture Unit Blender & Tile Access Depth & Stencil To L2 Mem Sys To L2 Mem Sys Tile Memory Tile Writeback To L2 Mem Sys 16

Lane 0 Lane 1 Lane 2 Lane 3 Quad vectorization Bifrost uses quad-parallel execution Four scalar threads executed in lockstep in a quad One quad at a time executes in each pipeline stage Each thread fills one 32-bit lane of the hardware 4 threads doing a vec3 FP32 add takes 3 cycles Improves utilization T0.x T1.x T2.x T0.y T1.y T2.y T0.z T1.x T2.z T3.x T3.y T3.z Cycle 1 Cycle 2 Cycle 3 Quad vectorization is compiler friendly Each thread only sees a stream of scalar operations Vector operations can always be split into scalars 17

Clause execution Back-to-back execution guaranteed within a clause Allows aggressive optimisation Overhead Instruction 18

Clause execution R0 R1 R2 R3 R4 R5 R6 R7 R0 R1 R2 R3 R4 R5 R6 R7 R0 R1 R2 R3 R4 R5 R6 R7 R0 R1 R2 R3 R4 R5 R6 R7 ADD R2, R0, R1 ADD R4, R2, R3 ADD R0, R4, R5 Back-to-back register access is common The result from one instruction is often only used as input to the next 19

Clause execution R0 R1 R2 R3 R4 R5 R6 R7 R0 R1 R2 R3 R4 R5 R6 R7 R0 R1 R2 R3 R4 R5 R6 R7 R0 R1 R2 R3 R4 R5 R6 R7 ADD T, R0, R1 T ADD T, T, R3 T ADD R0, T, R5 Back-to-back register access is common Register file bypass saves power. Allows use of simpler, smaller register files. 20

Clause scheduling TEX Unrelated? Required data not ready? Use result Texture unit operation Delay next clause if asynchronous data not ready Overhead Instruction 21

Clause scheduling? Use result TEX Texture unit operation Another quad can use this execution unit High utilization, high efficiency Overhead Quad 1 Quad 2 22

Arithmetic functional units 23 ARM 2016

Temp Registers Bifrost arithmetic functional units Executes quad-parallel scalar operations 4x32-bit multiplier FMA 4x32-bit adder ADD Adder includes special function unit Smaller and more area efficient Simplified layout eases compilation Better scheduling in today s code Better utilization One instruction word contains two instructions Main Regs Read FMA ADD/SF Main Regs Write 24

Temp Registers Bifrost arithmetic functional units Retains support for smaller width data types Integers useful for deep learning 2x performance for FP16 useful for pixel shaders Main Regs Read int8 int8 int8 int8 8-bit integers int16 int16 16-bit integers int32 32-bit integers FMA float16 float16 16-bit floating point float32 32-bit floating point ADD/SF Main Regs Write 25

Temp Registers Special arithmetic operations Special function hardware is smaller than Midgard equivalent Many transcendental functions supported Special functions provide building blocks for compiled shader code Part of the built-in function libraries Main Regs Read FMA ADD/SF Main Regs Write 26

Load/store units 27 ARM 2016

ZS Memory New core design Compute Frontend Fragment Frontend Execution Engine 0 Execution Engine 1 Execution Engine 2 Quad State Quad State Quad State Quad Creator Quad Manager Quad Creator Control Fabric Load/store Unit Attribute Unit Varying Unit Texture Unit Blender & Tile Access Depth & Stencil To L2 Mem Sys To L2 Mem Sys Tile Memory Tile Writeback To L2 Mem Sys 28

Bifrost load/store units Separate units, scheduled separately, for better utilization Load/store Unit Attribute Unit Varying Unit Handles most general memory accesses Includes memory address translation and coherent caching Handles attribute indexing and addressing Defers to load/store for actual memory access Handles varying interpolation Lower power, but more range and precision than Midgard 29

Tiler 30 ARM 2016

Geometry flow improvement Driver Software Job Manager Core 0 Core 1 Core 2 Core 31 Control Fabric Tiler MMU L2 Cache Segment L2 Cache Segment L2 Cache Segment AXI Memory Bus AXI Memory Bus AXI Memory Bus 31

Geometry flow: Midgard Processing Read+Write Bandwidth [x times of storage size] Vertex Shading Tiling Fragment Shading 1 2 3 1x 3.5x 2.5x Positions Attribs Memory 1x Positions Attribs Trans. Positions Varyings Indices Polygon List 1x 1 Leading data stream at the numbered stage Bandwidth used relative to memory storage size 32

0.5x 0.5x Geometry flow: Bifrost - index-driven vertex shading Processing Read+Write Bandwidth [x times of storage size] Position Shading Tiling Varying Shading Fragment Shading 1 2 3.5x 2.0x 2.5x 1.5x Positions Attribs Memory Indices Positions Trans. Positions Polygon List Attribs Varyings 1x 1 Leading data stream at the numbered stage Bandwidth used relative to memory storage size 33

Memory system 34 ARM 2016

Memory system Driver Software Job Manager Core 0 Core 1 Core 2 Core 31 Control Fabric Tiler MMU L2 Cache Segment L2 Cache Segment L2 Cache Segment AXI Memory Bus AXI Memory Bus AXI Memory Bus Full coherency using ACE protocol 35

Memory system Full system coherency support Supports tightly coupled CPU+GPU use cases Cortex-A73 CPU Mali-G71 GPU L2 cache improvements Single logical L2 cache makes software easier Fewer partial lines written to AXI which improves LPDDR4 performance CoreLink CCI-550 DMC-500 DRAM 36

The trademarks featured in this presentation are registered and/or unregistered trademarks of ARM Limited (or its subsidiaries) in the EU and/or elsewhere. All rights reserved. All other marks featured may be trademarks of their respective owners. Copyright 2016 ARM Limited