Vulkan on Mobile. Daniele Di Donato, ARM GDC 2016

Similar documents
Achieving High-performance Graphics on Mobile With the Vulkan API

Inside VR on Mobile. Sam Martin Graphics Architect GDC 2016

Vulkan Subpasses. or The Frame Buffer is Lava. Andrew Garrard Samsung R&D Institute UK. UK Khronos Chapter meet, May 2016

Vulkan Multipass mobile deferred done right

Vulkan: Architecture positive How Vulkan maps to PowerVR GPUs Kevin sun Lead Developer Support Engineer, APAC PowerVR Graphics.

EECS 487: Interactive Computer Graphics

Introduction to SPIR-V Shaders

Vulkan (including Vulkan Fast Paths)

Achieving Console Quality Games on Mobile

Bringing AAA graphics to mobile platforms. Niklas Smedberg Senior Engine Programmer, Epic Games

Optimizing and Profiling Unity Games for Mobile Platforms. Angelo Theodorou Senior Software Engineer, MPG Gamelab 2014, 25 th -27 th June

Using SPIR-V in practice with SPIRV-Cross

Mobile HW and Bandwidth

PowerVR Performance Recommendations. The Golden Rules

Bifrost - The GPU architecture for next five billion

Working with Metal Overview

The Bifrost GPU architecture and the ARM Mali-G71 GPU

Vulkan API 杨瑜, 资深工程师

Unreal Engine 4: Mobile Graphics on ARM CPU and GPU Architecture

Copyright Khronos Group Page 1

ARM Multimedia IP: working together to drive down system power and bandwidth

Profiling and Debugging Games on Mobile Platforms

PowerVR Hardware. Architecture Overview for Developers

Next-Generation Graphics on Larrabee. Tim Foley Intel Corp

Moving Mobile Graphics Advanced Real-time Shadowing. Marius Bjørge ARM

Hardware- Software Co-design at Arm GPUs

Developing the Bifrost GPU architecture for mainstream graphics

Course Recap + 3D Graphics on Mobile GPUs

Mali Developer Resources. Kevin Ho ARM Taiwan FAE

Optimizing Mobile Games with ARM. Solo Chang Staff Applications Engineer, ARM

Press Briefing SIGGRAPH 2015 Neil Trevett Khronos President NVIDIA Vice President Mobile Ecosystem. Copyright Khronos Group Page 1

PowerVR Performance Recommendations. The Golden Rules

Vulkan and Animation 3/13/ &height=285&playerId=

LPGPU Workshop on Power-Efficient GPU and Many-core Computing (PEGPUM 2014)

Dave Shreiner, ARM March 2009

Graphics Architectures and OpenCL. Michael Doggett Department of Computer Science Lund university

Hardware-driven Visibility Culling Jeong Hyun Kim

Next Generation OpenGL Neil Trevett Khronos President NVIDIA VP Mobile Copyright Khronos Group Page 1

Introducing Metal 2. Graphics and Games #WWDC17. Michal Valient, GPU Software Engineer Richard Schreyer, GPU Software Engineer

Lecture 25: Board Notes: Threads and GPUs

Mali-G72 Enabling tomorrow s technology today

Could you make the XNA functions yourself?

ARM. Mali GPU. OpenGL ES Application Optimization Guide. Version: 3.0. Copyright 2011, 2013 ARM. All rights reserved. ARM DUI 0555C (ID102813)

Moving Mobile Graphics: Mobile Graphics 101 Andrew Garrard, Samsung R&D Institute UK SIGGRAPH. All Rights Reserved

Mali-400 MP: A Scalable GPU for Mobile Devices Tom Olson

Drawing Fast The Graphics Pipeline

Shaders. Slide credit to Prof. Zwicker

ASYNCHRONOUS SHADERS WHITE PAPER 0

GUERRILLA DEVELOP CONFERENCE JULY 07 BRIGHTON

Shadows for Many Lights sounds like it might mean something, but In fact it can mean very different things, that require very different solutions.

The Application Stage. The Game Loop, Resource Management and Renderer Design

Breaking Down Barriers: An Intro to GPU Synchronization. Matt Pettineo Lead Engine Programmer Ready At Dawn Studios

Intel Core 4 DX11 Extensions Getting Kick Ass Visual Quality out of the Latest Intel GPUs

Parallelizing Graphics Pipeline Execution (+ Basics of Characterizing a Rendering Workload)

Hardware-driven visibility culling

VR Rendering Improvements Featuring Autodesk VRED

X. GPU Programming. Jacobs University Visualization and Computer Graphics Lab : Advanced Graphics - Chapter X 1

Engine Development & Support Team Lead for Korea UE4 Mobile Team Lead

Programming Tips For Scalable Graphics Performance

Copyright Khronos Group Page 1. Vulkan Overview. June 2015

Press Briefing SIGGRAPH 2015 Neil Trevett Khronos President NVIDIA Vice President Mobile Ecosystem. Copyright Khronos Group Page 1

The Rasterization Pipeline

Parallelizing Graphics Pipeline Execution (+ Basics of Characterizing a Rendering Workload)

Lecture 9: Deferred Shading. Visual Computing Systems CMU , Fall 2013

Optimizing Mobile Games with Gameloft and ARM

PowerVR Series5. Architecture Guide for Developers

Free Downloads OpenGL ES 3.0 Programming Guide

CS GPU and GPGPU Programming Lecture 2: Introduction; GPU Architecture 1. Markus Hadwiger, KAUST

Vulkan on Android: Gotchas and best practices

Copyright Khronos Group 2012 Page 1. Teaching GL. Dave Shreiner Director, Graphics and GPU Computing, ARM 1 December 2012

GPU Memory Model Overview

GPU Memory Model. Adapted from:

Real-Time Rendering (Echtzeitgraphik) Michael Wimmer

More frames per second. Alex Kan and Jean-François Roy GPU Software

Optimizing DirectX Graphics. Richard Huddy European Developer Relations Manager

Programmable GPUs. Real Time Graphics 11/13/2013. Nalu 2004 (NVIDIA Corporation) GeForce 6. Virtua Fighter 1995 (SEGA Corporation) NV1

DX10, Batching, and Performance Considerations. Bryan Dudash NVIDIA Developer Technology

Lecture 13: OpenGL Shading Language (GLSL)

Scheduling the Graphics Pipeline on a GPU

Investigating real-time rendering techniques approaching realism using the Vulkan API

Optimisation. CS7GV3 Real-time Rendering

ARM. Mali GPU. OpenGL ES Application Optimization Guide. Version: 2.0. Copyright 2011, 2013 ARM. All rights reserved. ARM DUI 0555B (ID051413)

Coding OpenGL ES 3.0 for Better Graphics Quality

Graphics Processing Unit Architecture (GPU Arch)

PowerVR Performance Recommendations The Golden Rules. October 2015

POWERVR MBX. Technology Overview

Case 1:17-cv SLR Document 1-3 Filed 01/23/17 Page 1 of 33 PageID #: 60 EXHIBIT C

Metal for OpenGL Developers

PROFESSIONAL VR: AN UPDATE. Robert Menzel, Ingo Esser GTC 2018, March

Rendering Grass with Instancing in DirectX* 10

RSX Best Practices. Mark Cerny, Cerny Games David Simpson, Naughty Dog Jon Olick, Naughty Dog

SIGGRAPH Briefing August 2014

Enhancing Traditional Rasterization Graphics with Ray Tracing. October 2015

Mali-G72: Enabling tomorrow s technology today

Hello & welcome to this last part about many light rendering on mobile hardware, i.e. on devices like smart phones and tablets.

WebGL and GLSL Basics. CS559 Fall 2015 Lecture 10 October 6, 2015

Copyright Khronos Group, Page Graphic Remedy. All Rights Reserved

Pipeline Operations. CS 4620 Lecture Steve Marschner. Cornell CS4620 Spring 2018 Lecture 11

Building scalable 3D applications. Ville Miettinen Hybrid Graphics

Enhancing Traditional Rasterization Graphics with Ray Tracing. March 2015

Transcription:

Vulkan on Mobile Daniele Di Donato, ARM GDC 2016

Outline Vulkan main features Mapping Vulkan Key features to ARM CPUs Mapping Vulkan Key features to ARM Mali GPUs 4

Vulkan Good match for mobile and tiling architectures Explicit multi-pass render passes No hidden costs (copies, allocs, shader recompiles, etc) Multi-threaded Low overhead The driver is lightweight and doesn t execute any error checking or validation No more safety net as in OpenGL..but freedom to squeeze performance 5

Vulkan Main Features for Mobile Multi-threading (even mid-range phones now have 4 cores) Most of the function doesn t need to be externally synchronized See chapter 2.4 Threading Behaviour of the spec Multi-pass Render Passes Able to exploit faster Tile cache memory on mobile See chapter 7 Render Pass of the spec Other features Independent samplers and textures Ability to use the same Sampler configuration to access multiple textures Low level memory bindings Resource creation doesn t allocate backing memory Sparse memory bindings Backing memory can be assigned completely or partially at runtime (use case: virtual textures) 6

Multi-Threading in Vulkan The Vulkan spec guarantees that some of its core functions don t need to be explicitly synchronized by the programmer (see chapter 2.4 of the Vulkan spec) This allows multiple threads to call the same functions or set of functions at the same time The typical use-cases are: Command buffer construction: Multiple threads build various command buffers at the same time based on the grouping made by the engine Shaders compilation: Multiple threads compile the shaders used Memory bindings: Multiple threads compute the memory requirements for the textures and allocate/assign it at runtime (multiple virtualtextures update) 7

Multiprocessing Support in ARM CPUs Inside the single core ARM SIMD Neon Allows vectorization of operations Typically used to speed-up the vector math used for physics, animations, etc. Really useful if the task to solve is sequential and cannot be parallelized or multithread overhead is not worth. Across all cores ARM big.little Able to chose between: big cores: High performance core used for hi-load tasks LITTLE cores: High efficiency cores for low-medium load tasks Schedules and migrates tasks according to the load Provides best trade-off within performance and power consumption 8

Multi-Pass in Vulkan Similar to Pixel Local Storage introduced by ARM Especially on Tiled GPUs: allows the driver to perform various optimizations when each pixel rendered in a subpass accesses the results of the previous subpass at the same pixel location All the data can be contained and remain on the fast on-chip memory Some use-cases: Deferred Rendering Tone-mapping Soft-particles (1 st subpass renders the solid geometry and the 2 nd renders the particles accessing the depth information) 10

Deferred Tile-Based Rendering 101 Typical tile-based rendering Fragment shader GPU Tile cache Framebuffer (Main memory) Tiler Heap/ Vertex Shaders ouput Shader Read-back from memory (slow) (Texture)FB 11

Deferred Tile-Based Rendering 101 Multi-pass tile-based Rendering Fragment shader Subpass 1 writes to its color attachment Subpass 2 uses color attachment from subpass 1 as input attachment The Tile cache is transferred to main memoy at the end of all subpasses GPU Tile cache Framebuffer (Main memory) Tiler Heap/ Vertex Shaders ouput Shader 12

Other Mali Features Available Enabled by you: ASTC texture compression Included in the Vulkan core spec Early-Z Avoids fragment shading for occluded pixels, sorting front-to-back of opaque geometry gives best results 4x MSAA Multisampling algorithm happening on Tile memory for free Automatically enabled: AFBC (Arm Frame Buffer Compression) Transparently reduces the memory bandwidth required to save energy Transaction Elimination Avoids the computations related to a tile if it s unchanged from the previous frame (UI and 2D games with static props will benefit from this feature) Forward Pixel Kill 13

Thank you! The trademarks featured in this presentation are registered and/or unregistered trademarks of ARM Limited (or its subsidiaries) in the EU and/or elsewhere. All rights reserved. All other marks featured may be trademarks of their respective owners. Copyright 2016 ARM Limited

To Find Out More. ARM Booth #1624 on Expo Floor: Live demos of the techniques shown in this session In-depth Q&A with ARM engineers More tech talks at the ARM Lecture Theatre http://malideveloper.arm.com/gdc2016: Revisit this talk in PDF and video format post GDC Download the tools and resources 15

More Talks From ARM at GDC 2016 Available post-show at the Mali Developer Center: malideveloper.arm.com/ Vulkan on Mobile with Unreal Engine 4 Case Study Weds. 9:30am, West Hall 3022 Making Light Work of Dynamic Large Worlds Weds. 2pm, West Hall 2000 Achieving High Quality Mobile VR Games Thurs. 10am, West Hall 3022 Optimize Your Mobile Games With Practical Case Studies Thurs. 11:30am, West Hall 2404 An End-to-End Approach to Physically Based Rendering Fri. 10am, West Hall 2020 16

Marius Bjørge Graphics Research Engineer GDC 2016

Agenda Overview Command Buffers Synchronization Memory Shaders and Pipelines Descriptor sets Render passes 18

Overview OpenGL OpenGL is mainly single-threaded Drawcalls are normally only submitted on main thread Multiple threads with shared GL contexts mainly used for texture streaming OpenGL has a lot of implicit behaviour Dependency tracking of resources Compiling shader combinations based on render state Splitting up workloads All this adds API overhead! 19

Overview Vulkan Vulkan is designed from the ground up to allow efficient multi-threading behaviour Vulkan is explicit in nature Applications must track resource dependencies to avoid deleting anything that might still be used by the GPU or CPU Little API overhead Vulkan is very verbose in terms of lines of code Getting a simple Hello Triangle running requires ~1000 lines of code 20

Overview Vulkan To get the most out of Vulkan you probably have to think about re-designing your graphics engine Migrating from OpenGL to Vulkan is not trivial Some things to keep in mind: Do you really need Vulkan for your project? Portability? 21

Command Buffers Used to record commands which are later submitted to a device for execution This includes draw/dispatch, texture uploads, etc. Primary and secondary command buffers Command buffers work independently from each other Contains all state No inheritance of state between command buffers 22

Command Buffers vkbegincommandbuffer vkcmdbeginrenderpass vkcmdexecutecommands Secondary commands Secondary commands Secondary commands Secondary commands vkcmdendrenderpass vkendcommandbuffer vkqueuesubmit 23

Synchronization Submitted work is completed out of order by the GPU Dependencies must be tracked by the application Using output from a previous render pass Using output from a compute shader Etc Synchronization primitives in Vulkan Pipeline barriers and events Fences Semaphores 24

Allocating Memory Memory is first allocated and then bound to Vulkan objects Different Vulkan objects may have different memory requirements Allows for aliasing memory across different Vulkan objects Driver does no ref counting of any objects in Vulkan Cannot free memory until you are sure it is never going to be used again Most of the memory allocated during run-time is transient Allocate, write and use in the same frame Block based memory allocator 25

Block Based Memory Allocator Relaxes memory reference counting Only entire blocks are freed/recycled 26

Image Layout Transitions Must match how the image is used at any time Pedantic or relaxed Some implementations might require careful tracking of previous and new layout to achieve optimal performance For Mali we can be quite relaxed with this most of the time we can keep the image layout as VK_IMAGE_LAYOUT_GENERAL 27

Pipelines Vulkan bundles state into big monolithic pipeline state objects Driver has full knowledge during shader compilation vkcreategraphicspipelines(...) ; Pipeline State vkbeginrenderpass(...); vkcmdbindpipeline(pipeline); vkcmddraw(...); vkendrenderpass(...); Dynamic State Blending State Raster State Pipeline Layout Shaders Depth Stencil Input Assembly Framebuffer Formats Vertex Input 28

Pipelines In an ideal world All pipeline combinations should be created upfront but this requires detailed knowledge of every potential shader/state combination that you might have in your scene As an example, a typical fragment shader in a graphics engine such as Unreal may have ~9 000 combinations Every one of these shaders can use different render state We also have to make sure the pipelines are bound to compatible render passes An explosion of combinations! 29

Pipeline Cache Result of the pipeline construction can be re-used between pipelines Can be stored out to disk and re-used next time you run the application 30

Shaders Vulkan standardized on SPIR-V Khronos reference compiler Outputs SPIR-V from your GLSL shader sources GL_KHR_vulkan_glsl Can be easily integrated into your graphics engine 31

Descriptor Sets Textures, uniform buffers, etc. are bound to shaders in descriptor sets Hierarchical invalidation Order descriptor sets by update frequency Ideally all descriptors are pre-baked during level load Keep track of low level descriptor sets per material But, this is not trivial Simple solution: Keep track of bindings and update descriptor sets when necessary Keep around cache for non-dynamic descriptor sets 32

SPIR-V reflection Introducing SPIR2CROSS Convert SPIR-V to readable GLSL https://github.com/arm-software/spir2cross Using SPIR2CROSS we can retrieve information about bindings as well as inputs and outputs directly form the SPIR-V binary This is useful information when creating or re-using existing pipeline layouts and descriptor set layouts Also allows us to easily re-use compatible pipeline layouts across a bunch of different shader combinations 33

Push Constants Push constants replace non-opaque uniforms Think of them as small, fast-access uniform buffer memory Update in Vulkan with vkcmdpushconstants Directly mapped to registers on Mali GPUs // New layout(push_constant, std430) uniform PushConstants { mat4 MVP; vec4 MaterialData; } RegisterMapped; // Old, no longer supported in Vulkan GLSL uniform mat4 MVP; uniform vec4 MaterialData; 34

Render Passes Describes the beginning and end of rendering to a framebuffer Render passes in Vulkan are very explicit Declare when a render pass begins Load, discard or clear the framebuffer? Declare when a render pass ends Which parts do you need to be committed to memory? 35

Subpass Inputs Vulkan supports subpasses within render passes Standardized GL_EXT_shader_pixel_local_storage! // GLSL #extension GL_EXT_shader_pixel_local_storage : require pixel_local_inext GBuffer { layout(rgba8) vec4 albedo; layout(rgba8) vec4 normal;... } pls; // Vulkan layout(input_attachment_index = 0) uniform subpassinput albedo; layout(input_attachment_index = 1) uniform subpassinput normal;... 36

Niklas Smedis Smedberg Technical Director, Platform Partnerships GDC 2016

UE4 ProtoStar Demo Goals: Impressive real-time graphics to showcase UE4 Vulkan on Samsung Galaxy S7 Must be the best in the world Problems to overcome: Mobile graphics features did not exist in UE4 yet Vulkan API did not exist yet Driver did not exist yet Device did not exist yet Not enough time or people 38

UE4 Vulkan: A First Attempt 39

UE4 Vulkan: Final Results 40

UE4 Vulkan Thoughts One queue One big command buffer per frame (array round-robin reused) Multi-threaded rendering a great future opportunity in Vulkan External synchronization Semaphores for synchronization on GPU (GPU wait / ordering) Fences for synchronization on CPU (CPU wait / check for completion) Recording command buffer Simplest usage-case: Instancing Nice usage-case: VR (stereoscopic rendering) 41

UE4 Vulkan Source Code UE4 Vulkan source code on github soon! unrealengine.com Read and learn Experiment with Vulkan Make something fun! 42

Galaxy Jungwoo Kim Principle Engineer, Samsung Mobile Graphics Team GDC 2016

Galaxy 3 year long-term project started in 2012 by Samsung 1.5 year contribution for Vulkan within Khronos group 1 year collaboration with our partners for the demo But this is just the beginning

Samsung Mobile is planning a game developer support program Official announcement at Samsung Developer Conference in April Samsung wants to engage with the game developer community Samsung also wishes to support Vulkan game developers Galaxy GameDev #GalaxyGameDev