Hardware- Software Co-design at Arm GPUs

Similar documents
The Bifrost GPU architecture and the ARM Mali-G71 GPU

Bifrost - The GPU architecture for next five billion

Developing the Bifrost GPU architecture for mainstream graphics

Rendering Structures Analyzing modern rendering on mobile

The Changing Face of Edge Compute

PowerVR Hardware. Architecture Overview for Developers

Implementing debug. and trace access. through functional I/O. Alvin Yang Staff FAE. Arm Tech Symposia Arm Limited

How to Build Optimized ML Applications with Arm Software

Optimize HPC - Application Efficiency on Many Core Systems

Achieving Console Quality Games on Mobile

Lecture 2. Shaders, GLSL and GPGPU

How to Build Optimized ML Applications with Arm Software

A Developer's Guide to Security on Cortex-M based MCUs

Case 1:17-cv SLR Document 1-3 Filed 01/23/17 Page 1 of 33 PageID #: 60 EXHIBIT C

Connect your IoT device: Bluetooth 5, , NB-IoT

Mali-400 MP: A Scalable GPU for Mobile Devices Tom Olson

Bringing Intelligence to Enterprise Storage Drives

Optimizing and Profiling Unity Games for Mobile Platforms. Angelo Theodorou Senior Software Engineer, MPG Gamelab 2014, 25 th -27 th June

Next-Generation Graphics on Larrabee. Tim Foley Intel Corp

Spring 2009 Prof. Hyesoon Kim

Spring 2011 Prof. Hyesoon Kim

A New Security Platform for High Performance Client SoCs

X. GPU Programming. Jacobs University Visualization and Computer Graphics Lab : Advanced Graphics - Chapter X 1

Profiling and Debugging OpenCL Applications with ARM Development Tools. October 2014

Vulkan Multipass mobile deferred done right

Rendering approaches. 1.image-oriented. 2.object-oriented. foreach pixel... 3D rendering pipeline. foreach object...

1.2.3 The Graphics Hardware Pipeline

Case 1:17-cv SLR Document 1-4 Filed 01/23/17 Page 1 of 30 PageID #: 75 EXHIBIT D

Connect Your IoT Device: Bluetooth 5, , NB-IoT

CS GPU and GPGPU Programming Lecture 2: Introduction; GPU Architecture 1. Markus Hadwiger, KAUST

Shaders. Slide credit to Prof. Zwicker

PowerVR Series5. Architecture Guide for Developers

2017 Arm Limited. How to design an IoT SoC and get Arm CPU IP for no upfront license fee

CS4620/5620: Lecture 14 Pipeline

Course Recap + 3D Graphics on Mobile GPUs

Graphics Hardware. Instructor Stephen J. Guy

Bringing AAA graphics to mobile platforms. Niklas Smedberg Senior Engine Programmer, Epic Games

Mali-G72: Enabling tomorrow s technology today

CS427 Multicore Architecture and Parallel Computing

Vulkan: Architecture positive How Vulkan maps to PowerVR GPUs Kevin sun Lead Developer Support Engineer, APAC PowerVR Graphics.

A SIMD-efficient 14 Instruction Shader Program for High-Throughput Microtriangle Rasterization

Parallelizing Graphics Pipeline Execution (+ Basics of Characterizing a Rendering Workload)

PROFESSIONAL. WebGL Programming DEVELOPING 3D GRAPHICS FOR THE WEB. Andreas Anyuru WILEY. John Wiley & Sons, Ltd.

Graphics Hardware. Graphics Processing Unit (GPU) is a Subsidiary hardware. With massively multi-threaded many-core. Dedicated to 2D and 3D graphics

ARM. Mali GPU. OpenGL ES Application Optimization Guide. Version: 3.0. Copyright 2011, 2013 ARM. All rights reserved. ARM DUI 0555C (ID102813)

WAVE ONE MAINFRAME WAVE THREE INTERNET WAVE FOUR MOBILE & CLOUD WAVE TWO PERSONAL COMPUTING & SOFTWARE Arm Limited

Practical Texturing (WebGL) CS559 Fall 2016 Lecture 20 November 7th 2016

Graphics Processing Unit Architecture (GPU Arch)

CS230 : Computer Graphics Lecture 4. Tamar Shinar Computer Science & Engineering UC Riverside

Accelerate Ceph By SPDK on AArch64

GeForce4. John Montrym Henry Moreton

Vulkan on Mobile. Daniele Di Donato, ARM GDC 2016

Achieving High-performance Graphics on Mobile With the Vulkan API

Software Ecosystem for Arm-based HPC

Could you make the XNA functions yourself?

Unleash the DSP performance of Arm Cortex processors

The Application Stage. The Game Loop, Resource Management and Renderer Design

Profiling and Debugging Games on Mobile Platforms

Threading Hardware in G80

Rendering Objects. Need to transform all geometry then

TEAPOT: A Toolset for Evaluating Performance, Power and Image Quality on Mobile Graphics Systems

Hardware-driven Visibility Culling Jeong Hyun Kim

Real - Time Rendering. Pipeline optimization. Michal Červeňanský Juraj Starinský

Using Virtual Platforms To Improve Software Verification and Validation Efficiency

Graphics Architectures and OpenCL. Michael Doggett Department of Computer Science Lund university

Real-Time Rendering (Echtzeitgraphik) Michael Wimmer

2.11 Particle Systems

Hardware-driven visibility culling

Rasterization Overview

Adaptive Point Cloud Rendering

Shaders (some slides taken from David M. course)

Converts geometric primitives into images Is split into several independent stages Those are usually executed concurrently

CS770/870 Fall 2015 Advanced GLSL

Dave Shreiner, ARM March 2009

CS130 : Computer Graphics. Tamar Shinar Computer Science & Engineering UC Riverside

Cloud Computing Project High Level Design Document

DPDK on Arm64 Status Review & Plan

Real - Time Rendering. Graphics pipeline. Michal Červeňanský Juraj Starinský

Drawing Fast The Graphics Pipeline

Mali-G72 Enabling tomorrow s technology today

Pipeline Operations. CS 4620 Lecture Steve Marschner. Cornell CS4620 Spring 2018 Lecture 11

General Purpose Computation (CAD/CAM/CAE) on the GPU (a.k.a. Topics in Manufacturing)

CS451Real-time Rendering Pipeline

The Rasterization Pipeline

Pipeline Operations. CS 4620 Lecture 14

GPU Memory Model. Adapted from:

Advanced IP solutions enabling the autonomous driving revolution

Cortex-A75 and Cortex-A55 DynamIQ processors Powering applications from mobile to autonomous driving

Parallelizing Graphics Pipeline Execution (+ Basics of Characterizing a Rendering Workload)

Accelerating intelligence at the edge for embedded and IoT applications

ARM. Mali GPU. OpenGL ES Application Optimization Guide. Version: 2.0. Copyright 2011, 2013 ARM. All rights reserved. ARM DUI 0555B (ID051413)

PowerVR Performance Recommendations. The Golden Rules

A Secure and Connected Intelligent Future. Ian Smythe Senior Director Marketing, Client Business Arm Tech Symposia 2017

C P S C 314 S H A D E R S, O P E N G L, & J S RENDERING PIPELINE. Mikhail Bessmeltsev

Architectures. Michael Doggett Department of Computer Science Lund University 2009 Tomas Akenine-Möller and Michael Doggett 1

Compute solutions for mass deployment of autonomy

Lecture 25: Board Notes: Threads and GPUs

Arm crossplatform. VI-HPS platform October 16, Arm Limited

TZMP-1 Software Reference Implementation. Ken Liu 2018-Mar-12

Inside VR on Mobile. Sam Martin Graphics Architect GDC 2016

Transcription:

Hardware- Software Co-design at Arm GPUs Johan Grönqvist MCC 2017 - Uppsala

About Arm

Arm Mali GPUs: The World s #1 Shipping Graphics Processor 151 Total Mali licenses 21 Mali video and display licenses 1Bn Mali GPUs shipped in 2016 Mali GPUs are in: Mali graphics based 1Bn ~50% of mobile VR ~80% of DTVs ~50% of smartphones IC shipments (units) 400m 150m 550m 750m 2012 2013 2014 2015 2016 3

Arm 4

Graphics

Levels of Abstraction Standardized APIs allow separation of concerns Graphics APIs Application HW hidden ISA Memory hierarchy User-space graphics driver Display subsystem integration Display subsystem Portability User HW specific optimizations in the Application Application specific optimizations in the driver Kernel Kernel-space graphics driver GPU hardware OS graphics memory manager Display kernel driver 6

A Graphics Pipeline Vertices, triangles, fragments and pixels Application Vertex shader CPU GPU Primitive assembly & culling Rasterization Fragment shader ZS test & blending Framebuffer 7

A Graphics Demo by Arm Ice Cave, created by Arm in 2015 to show what kids of effects and level of detail a mobile GPU is capable of. https://www.youtube.com/watch?v=gsybmhjvhxa 8

Pipeline & Optimizations

A Schematic OpenGL Implementation def shade(vertices, vertex_shader, fragment_shader, blend_shader, vertex_data): (vertices, vertex_data ) = ([], []) for (v, vd) in zip(vertices, vertex_data): v, vd = vertex_shader(v, vd) vertices.append(v ) vertex_data.append(vd ) triangles, triangle_data = primitive_assembly(vertices, vertex_data ) for (t, td) in zip(triangles, triangle_data): fragments, fragment_data = rasterize_triangle(t, td) for (f, fd) in zip(fragments, fragment_data): depth, color = fragment_shader(f, fd) if is_visible(depth, pos): output[pos] = blend_shader(output[pos], color) 10

A Schematic OpenGL Implementation def shade(vertices, vertex_shader, fragment_shader, blend_shader, vertex_data): (vertices, vertex_data ) = ([], []) for (v, vd) in zip(vertices, vertex_data): v, vd = vertex_shader(v, vd) vertices.append(v ) vertex_data.append(vd ) triangles, triangle_data = primitive_assembly(vertices, vertex_data ) for (t, td) in zip(triangles, triangle_data): fragments, fragment_data = rasterize_triangle(t, td) for (f, fd) in zip(fragments, fragment_data): depth, color = fragment_shader(f, fd) if is_visible(depth, pos): output[pos] = blend_shader(output[pos], color) 11

A Schematic OpenGL Implementation def shade(vertices, vertex_shader, fragment_shader, blend_shader, vertex_data): (vertices, vertex_data ) = ([], []) for (v, vd) in zip(vertices, vertex_data): v, vd = vertex_shader(v, vd) vertices.append(v ) vertex_data.append(vd ) triangles, triangle_data = primitive_assembly(vertices, vertex_data ) for (t, td) in zip(triangles, triangle_data): fragments, fragment_data = rasterize_triangle(t, td) for (f, fd) in zip(fragments, fragment_data): depth, color = fragment_shader(f, fd) if is_visible(depth, pos): output[pos] = blend_shader(output[pos], color) 12

A Schematic OpenGL Implementation def shade(vertices, vertex_shader, fragment_shader, blend_shader, vertex_data): (vertices, vertex_data ) = ([], []) for (v, vd) in zip(vertices, vertex_data): v, vd = vertex_shader(v, vd) vertices.append(v ) vertex_data.append(vd ) triangles, triangle_data = primitive_assembly(vertices, vertex_data ) for (t, td) in zip(triangles, triangle_data): fragments, fragment_data = rasterize_triangle(t, td) for (f, fd) in zip(fragments, fragment_data): depth, color = fragment_shader(f, fd) if is_visible(depth, pos): output[pos] = blend_shader(output[pos], color) 13

A Schematic OpenGL Implementation def shade(vertices, vertex_shader, fragment_shader, blend_shader, vertex_data): (vertices, vertex_data ) = ([], []) for (v, vd) in zip(vertices, vertex_data): v, vd = vertex_shader(v, vd) vertices.append(v ) vertex_data.append(vd ) triangles, triangle_data = primitive_assembly(vertices, vertex_data ) for (t, td) in zip(triangles, triangle_data): fragments, fragment_data = rasterize_triangle(t, td) for (f, fd) in zip(fragments, fragment_data): depth, color = fragment_shader(f, fd) if is_visible(depth, pos): output[pos] = blend_shader(output[pos], color) 14

A Schematic OpenGL Implementation def shade(vertices, vertex_shader, fragment_shader, blend_shader, vertex_data): (vertices, vertex_data ) = ([], []) for (v, vd) in zip(vertices, vertex_data): v, vd = vertex_shader(v, vd) vertices.append(v ) vertex_data.append(vd ) triangles, triangle_data = primitive_assembly(vertices, vertex_data ) for (t, td) in zip(triangles, triangle_data): fragments, fragment_data = rasterize_triangle(t, td) for (f, fd) in zip(fragments, fragment_data): depth, color = fragment_shader(f, fd) if is_visible(depth, pos): output[pos] = blend_shader(output[pos], color) 15

Programmable Parts Everything else is fixed function def shade(vertices, vertex_shader, fragment_shader, blend_shader, vertex_data): (vertices, vertex_data ) = ([], []) for (v, vd) in zip(vertices, vertex_data): v, vd = vertex_shader(v, vd) vertices.append(v ) vertex_data.append(vd ) triangles, triangle_data = primitive_assembly(vertices, vertex_data ) for (t, td) in zip(triangles, triangle_data): fragments, fragment_data = rasterize_triangle(t, td) for (f, fd) in zip(fragments, fragment_data): depth, color = fragment_shader(f, fd) if is_visible(depth, pos): output[pos] = blend_shader(output[pos], color) 16

Parallelization Opportunities We have many shading jobs and compute jobs running at the same time Other Jobs def shade(vertices, vertex_shader, fragment_shader, blend_shader, vertex_data): (vertices, vertex_data ) = ([], []) for (v, vd) in zip(vertices, vertex_data): v, vd = vertex_shader(v, vd) vertices.append(v ) vertex_data.append(vd ) triangles, triangle_data = primitive_assembly(vertices, vertex_data ) for (t, td) in zip(triangles, triangle_data): fragments, fragment_data = rasterize_triangle(t, td) for (f, fd) in zip(fragments, fragment_data): depth, color = fragment_shader(f, fd) if is_visible(depth, pos): output[pos] = blend_shader(output[pos], color) 17

Early Visibility Testing Dependency tracking becomes tricky with visibility tests in many places def shade(vertices, vertex_shader, fragment_shader, blend_shader, vertex_data): (vertices, vertex_data ) = ([], []) for (v, vd) in zip(vertices, vertex_data): Move up v, vd = vertex_shader(v, vd) vertices.append(v ) vertex_data.append(vd ) triangles, triangle_data = primitive_assembly(vertices, vertex_data ) for (t, td) in zip(triangles, triangle_data): fragments, fragment_data = rasterize_triangle(t, td) for (f, fd) in zip(fragments, fragment_data): depth, color = fragment_shader(f, fd) if is_visible(depth, pos): output[pos] = blend_shader(output[pos], color) 18

GPUs

Arm Bifrost GPU Architecture APB IRQ Job Manager L2 cache SC0 SC1 SC3 MMU SC4 SC5... Tiler 20

A Bifrost Core Thread management Instruction Execution Visibility & ColorBuffer Memory Ops 21

ZS memory EXECUTION CORE A Bifrost Core Thread management Fragment front-end Quad creator Compute front-end Quad creator Instruction Execution Quad manager Execution engine 0 Thread state Execution engine 1 Thread state Execution engine 3 Thread state Message fabric Late ZS unit Blender & Tile access Varying unit Attribute units Load/store cache Texture unit Visibility & ColorBuffer Color memory Tile writeback To L2 Mem Sys To L2 Mem Sys Memory Ops 22

Fixed Use-Case and Standardized APIs Enabling hardware-software co-design A GPU implementation uses Fixed function hardware for triangle operations, rasterizations, interpolations, and more Hardware to ensure serialization when necessary Buffers to reorder threads to enable more parallelism, where serialization is not needed Fixed function hardware for killing hidden pixels as early as possible 23

Fixed Use-Case and Standardized APIs Enabling hardware-software co-design A GPU implementation uses Fixed function hardware for triangle operations, rasterizations, interpolations, and more Hardware to ensure serialization when necessary Buffers to reorder threads to enable more parallelism, where serialization is not needed Fixed function hardware for killing hidden pixels as early as possible Tuning hardware becomes very content dependent Complexity of the Geometry, Number of buffers used, the amounts of data load per thread Software and hardware evolve together over time 24

Fixed Use-Case and Standardized APIs Enabling hardware-software co-design A GPU implementation uses Fixed function hardware for triangle operations, rasterizations, interpolations, and more Hardware to ensure serialization when necessary Buffers to reorder threads to enable more parallelism, where serialization is not needed Fixed function hardware for killing hidden pixels as early as possible Tuning hardware becomes very content dependent Complexity of the Geometry, Number of buffers used, the amounts of data load per thread Software and hardware evolve together over time The API allows large and invasive hardware changes that are transparent to the application, due to the driver mediating between the two. 25

Mutual Adaptation Between GPUs and Applications Both evolve over time Better Hardware Heavier Applications More complex geometries and more complex computations. More triangles per object More objects More complex light effects More particle effects Application Changes Rebalanced GPUs HW buffers, memory hierarchy and compute density must be balanced for the use-case. Larger buffers to enable more out-of order and more parallelism Higher compute performance Large caches and more complex memory hierarchy 26

Index-Driven Position Shading

IDVS Introduction Discarding triangles early A vertex shader typically computes two kinds of things, a position and some data A coordinate transformation from the model space to the real space A coordinate transformation as objects move around Data transformations for, e.g., lighting calculations Often, the coordinate transformations are simple. Sometimes, we can discard triangles based only on the position data. Executing the data transformations is then useless work that we want to avoid. 28

IDVS Introduction Splitting the vertex shader We want to split the vertex shader into two parts The original vertex shader transformed both position and data out_position, out_data = vertex_shader(position, data) Split into two parts, computing one piece each out_position = position_shader(position, data) out_data = data_shader(out_position, data) Discard triangles between the two shaders Implementation Let the compiler split the shader into two Let the hardware cull triangles between position_shader and data_shader Let the driver manage memory buffers to ensure correctness 29

IDVS: Hardware Indices Primitive assembly Position shading Culling & Tiling Transformed vertex positions Vertex Data Varying shading Transformed Vertex Data 30

A Simple Example Basic vertex shader Input data in two parts, for position and data. A simple transform to position the object A separate computation for, e.g., lighting def vertex_shader(position, data): transform, buffer = data out_position = transform(position) out_data = data_transform(out_position, buffer) return (out_position, out_data) 31

A Simple Example Splits nicely into two parts Input data in two parts, for position and other data. A simple transform to position the object A separate computation for, e.g., lighting Easily split into two parts def vertex_shader(position, data): transform, buffer = data out_position = transform(position) out_data = data_transform(out_position, buffer) return (out_position, out_data) 32

A Simple Example Splits nicely into two parts Input data in two parts, for position and other data. A simple transform to position the object A separate computation for, e.g., lighting Easily split into two parts def position_shader (position, data): transform, _ = data out_position = transform(position) return out_position def data_shader(position, data): out_data = data_transform(out_position, data) return out_data 33

A More Complex Example Transformation needed to compute data Input data in two parts, for position and data. Complex dependencies on input data def vertex_shader(position, data): transform_data = compute_transform(data) out_position = transform(transform_data, position) out_data = lighting_calculation( out_position, data, transform_data) return (out_position, out_data) 34

A More Complex Example Common subexpression Input data in two parts, for position and data. Complex dependencies on input data Common subexpressions for position and data computations def vertex_shader(position, data): transform_data = compute_transform(data) out_position = transform(transform_data, position) out_data = lighting_calculation( out_position, data, transform_data) return (out_position, out_data) 35

A More Complex Example Recomputing the same function twice Input data in two parts, for position and data. Complex dependencies on input data Common subexpressions for position and data computations Naïve split requires computing transform twice def vertex_shader(position, data): transform_data = compute_transform(data) out_position = transform(transform_data, position) return out_position def data_shader(out_position, data): transform_data = compute_transform(data) out_data = lighting_calculation( out_position, data, transform_data) return out_data 36

Directions for Solution Intermediate results common for both position and data should not be recomputed Identify such intermediate results Store them to additional buffers Only split shaders when we expect a performance improvement Needs active content aware hardware-software co-design to obtain good performance in general 37

Summary Graphics APIs give device designers a HW/SW co-design opportunity Understanding the use-case is important for successful HW/SW co-design HW/SW co-design is important for obtaining good performance Applications and GPUs evolve together 38

The Arm trademarks featured in this presentation are registered trademarks or trademarks of Arm Limited (or its subsidiaries) in the US and/or elsewhere. All rights reserved. All other marks featured may be trademarks of their respective owners. www.arm.com/company/policies/trademarks 39

Thank You! Danke! Merci! 谢谢! ありがとう! Gracias! Kiitos! 감사합니다 धन यव द 40