Vulkan Multipass mobile deferred done right

Similar documents
Achieving High-performance Graphics on Mobile With the Vulkan API

Vulkan Subpasses. or The Frame Buffer is Lava. Andrew Garrard Samsung R&D Institute UK. UK Khronos Chapter meet, May 2016

Rendering Structures Analyzing modern rendering on mobile

Vulkan on Mobile. Daniele Di Donato, ARM GDC 2016

Optimizing and Profiling Unity Games for Mobile Platforms. Angelo Theodorou Senior Software Engineer, MPG Gamelab 2014, 25 th -27 th June

Using SPIR-V in practice with SPIRV-Cross

Vulkan: Architecture positive How Vulkan maps to PowerVR GPUs Kevin sun Lead Developer Support Engineer, APAC PowerVR Graphics.

GUERRILLA DEVELOP CONFERENCE JULY 07 BRIGHTON

Bifrost - The GPU architecture for next five billion

The Bifrost GPU architecture and the ARM Mali-G71 GPU

Achieving Console Quality Games on Mobile

Vulkan API 杨瑜, 资深工程师

Deferred Rendering Due: Wednesday November 15 at 10pm

Applications of Explicit Early-Z Z Culling. Jason Mitchell ATI Research

Whiz-Bang Graphics and Media Performance for Java Platform, Micro Edition (JavaME)

Moving Mobile Graphics Advanced Real-time Shadowing. Marius Bjørge ARM

Profiling and Debugging Games on Mobile Platforms

Developing the Bifrost GPU architecture for mainstream graphics

PowerVR Hardware. Architecture Overview for Developers

Lecture 2. Shaders, GLSL and GPGPU

PowerVR Performance Recommendations. The Golden Rules

Hi everyone! My name is Niklas and I ve been working at EA for the last 2.5 years. My main responsibility there have been to bring up the graphics

Engine Development & Support Team Lead for Korea UE4 Mobile Team Lead

Rendering Grass with Instancing in DirectX* 10

Bringing AAA graphics to mobile platforms. Niklas Smedberg Senior Engine Programmer, Epic Games

Copyright Khronos Group, Page Graphic Remedy. All Rights Reserved

Next-Generation Graphics on Larrabee. Tim Foley Intel Corp

Hello & welcome to this last part about many light rendering on mobile hardware, i.e. on devices like smart phones and tablets.

Vulkan (including Vulkan Fast Paths)

Real-Time Rendering (Echtzeitgraphik) Michael Wimmer

Mali Demos: Behind the Pixels. Stacy Smith

Rendering Objects. Need to transform all geometry then

Optimizing Mobile Games with ARM. Solo Chang Staff Applications Engineer, ARM

Lecture 9: Deferred Shading. Visual Computing Systems CMU , Fall 2013

CMSC427 Advanced shading getting global illumination by local methods. Credit: slides Prof. Zwicker

Optimizing Mobile Games with Gameloft and ARM

PowerVR Series5. Architecture Guide for Developers

Windowing System on a 3D Pipeline. February 2005

TSBK03 Screen-Space Ambient Occlusion

Mali Developer Resources. Kevin Ho ARM Taiwan FAE

Graphics Performance Optimisation. John Spitzer Director of European Developer Technology

Hardware- Software Co-design at Arm GPUs

Working with Metal Overview

Real - Time Rendering. Pipeline optimization. Michal Červeňanský Juraj Starinský

Computer Graphics (CS 563) Lecture 4: Advanced Computer Graphics Image Based Effects: Part 2. Prof Emmanuel Agu

The Application Stage. The Game Loop, Resource Management and Renderer Design

GeForce4. John Montrym Henry Moreton

The Making of Seemore WebGL. Will Eastcott, CEO, PlayCanvas

Evolution of GPUs Chris Seitz

EECS 487: Interactive Computer Graphics

Mobile HW and Bandwidth

ARM Mali Application Developer Best Practices

Rendering. Converting a 3D scene to a 2D image. Camera. Light. Rendering. View Plane

Graphics Hardware, Graphics APIs, and Computation on GPUs. Mark Segal

DEFERRED RENDERING STEFAN MÜLLER ARISONA, ETH ZURICH SMA/

Mali-G72 Enabling tomorrow s technology today

Graphics Hardware. Instructor Stephen J. Guy

Case 1:17-cv SLR Document 1-3 Filed 01/23/17 Page 1 of 33 PageID #: 60 EXHIBIT C

Rasterization Overview

3D Authoring Tool BS Content Studio supports Deferred Rendering for improved visual quality

LPGPU Workshop on Power-Efficient GPU and Many-core Computing (PEGPUM 2014)

Enhancing Traditional Rasterization Graphics with Ray Tracing. March 2015

MAXIS-mizing Darkspore*: A Case Study of Graphic Analysis and Optimizations in Maxis Deferred Renderer

Next Generation OpenGL Neil Trevett Khronos President NVIDIA VP Mobile Copyright Khronos Group Page 1

CS452/552; EE465/505. Clipping & Scan Conversion

1.2.3 The Graphics Hardware Pipeline

Hardware-driven visibility culling

Shaders (some slides taken from David M. course)

PowerVR Performance Recommendations The Golden Rules. October 2015

Dave Shreiner, ARM March 2009

D3D12 & Vulkan: Lessons learned. Dr. Matthäus G. Chajdas Developer Technology Engineer, AMD

The Rasterization Pipeline

Real - Time Rendering. Graphics pipeline. Michal Červeňanský Juraj Starinský

Unreal Engine 4: Mobile Graphics on ARM CPU and GPU Architecture

Programming Graphics Hardware

Graphics Processing Unit Architecture (GPU Arch)

Copyright Khronos Group Page 1

Mali-400 MP: A Scalable GPU for Mobile Devices Tom Olson

CS4620/5620: Lecture 14 Pipeline

Drawing Fast The Graphics Pipeline

Coding OpenGL ES 3.0 for Better Graphics Quality

Direct Rendering of Trimmed NURBS Surfaces

Course Recap + 3D Graphics on Mobile GPUs

Low-Overhead Rendering with Direct3D. Evan Hart Principal Engineer - NVIDIA

Graphics Architectures and OpenCL. Michael Doggett Department of Computer Science Lund university

Porting Roblox to Vulkan. Arseny

Enabling immersive gaming experiences Intro to Ray Tracing

RSX Best Practices. Mark Cerny, Cerny Games David Simpson, Naughty Dog Jon Olick, Naughty Dog

ASYNCHRONOUS SHADERS WHITE PAPER 0

Introduction to Shaders.

Parallelizing Graphics Pipeline Execution (+ Basics of Characterizing a Rendering Workload)

CS451Real-time Rendering Pipeline

CSE 591: GPU Programming. Introduction. Entertainment Graphics: Virtual Realism for the Masses. Computer games need to have: Klaus Mueller

Could you make the XNA functions yourself?

ARM. Mali GPU. OpenGL ES Application Optimization Guide. Version: 3.0. Copyright 2011, 2013 ARM. All rights reserved. ARM DUI 0555C (ID102813)

X. GPU Programming. Jacobs University Visualization and Computer Graphics Lab : Advanced Graphics - Chapter X 1

Accelerating Realism with the (NVIDIA Scene Graph)

Dominic Filion, Senior Engineer Blizzard Entertainment. Rob McNaughton, Lead Technical Artist Blizzard Entertainment

Applications of Explicit Early-Z Culling

Edge Detection. Whitepaper

Transcription:

Vulkan Multipass mobile deferred done right Hans-Kristian Arntzen Marius Bjørge Khronos 5 / 25 / 2017

Content What is multipass? What multipass allows... A driver to do versus MRT Developers to do Transient images and lazy memory Case studies Baseline app Sponza «Lofoten» demo 2

What is Multipass? Renderpasses can have multiple subpasses Subpasses can have dependencies between each other Render pass graphs Subpasses refer to subset of attachments 3

Improved MRT deferred shading in Vulkan Classic deferred has two render passes G-Buffer pass, render to ~4 textures Lighting pass, read from G-Buffer, accumulate light Lighting pass only reads G-Buffer at gl_fragcoord Rethinking this in terms of Vulkan multipass Two subpasses Dependencies COLOR DEPTH -> INPUT_ATTACHMENT COLOR DEPTH_READ VK_DEPENDENCY_BY_REGION_BIT 4

Tile-based GPUs 101 Tile-based GPUs batch up and bin all primitives in a render pass to tiles In fragment processing later, render one tile at a time Hardware knows all primitives which cover a tile Advantage, framebuffer is now in fast and small SRAM! Having framebuffer in on-chip SRAM has practical benefits Read/write to it is cheap, no external bandwidth cost Main memory is written to when tile is complete GPU Vertex Shader Tiler Fragment Shader Local Tile Memory DDR Vertex Attributes Geometry Working Set (Tile List and Varyings) Textures Framebuffer 5

Tile-based GPU subpass fusing Subpass information is known ahead of time VkRenderPass Driver can find two or more sub-passes which have BY_REGION dependencies no external side effects which might prevent fusing Fuse G-Buffer and Lighting passes Combine draw calls from G-Buffer and Lighting into one render pass G-Buffer content can remain in on-chip SRAM Reading G-Buffer data in lighting pass just needs to read tile buffer vkcmdnextsubpass essentially becomes a noop 6

Vulkan GLSL subpassload() Reading from input attachments in Vulkan is special Special image type in SPIR-V On vkcreategraphicspipelines we know renderpass subpassindex subpassload() either becomes texelfetch()-like if subpasses were not fused This is why we need VK_DESCRIPTOR_TYPE_INPUT_ATTACHMENT magicreadfromtilebuffer() if subpasses were fused Compiler knows ahead of time No last-minute shader patching required 7

Transient attachments After the lighting pass, G-Buffer data is not needed anymore G-Buffer data only needs to live on the on-chip SRAM Clear on render pass begin, no need to read from main memory storeop is DONT_CARE, so never actually written out to main memory Vulkan exposes lazily allocated memory imageusage = VK_IMAGE_USAGE_TRANSIENT_ATTACHMENT_BIT memoryproperty = VK_MEMORY_PROPERTY_LAZILY_ALLOCATED_BIT On tilers, no need to back these images with physical memory 8

Multipass benefits everyone Deferred paths essentially same for mobile and desktop Same Vulkan code (*) Same shader code (*) VkRenderPass contains all information it needs Desktop can enjoy more informed scheduling decisions Latest desktop GPU iterations seem to be moving towards tile-based At worst, it s just classic MRT (*) Minor tweaking to G-Buffer layout may apply 9

Baseline test Basic multipass sample One renderpass Light on geometry ~8 large lights Simple shading Benchmark Multipass (subpass fusing) MRT Overall performance Bandwidth 11

Baseline test data Measured on Galaxy S7 (Exynos) 4096x2048 resolution Hit V-Sync at native 1440p ~30% FPS improvement ~80% bandwidth reduction Only using albedo and normals Saving bandwidth is vital for mobile GPUs MRT Multipass 13

Sponza test Mid-high complexity geometry for mobile ~200K tris drawn / frame ~610 spot lights Full PBR Shadowmaps on some of the lights 8 reflection probes Directional light w/ shadows Full HDR pipeline Adaptive tonemapping Bloom 14

Stencil culling Update stencil state Update stencil state Render light with front face culling Render light with back face culling 19

Clustered stencil culling Classic method of per light stencil culling involves a lot of state toggling Can be expensive even on Vulkan Usually performs poorly on GPU Instead, cluster local lights along camera Z axis Each light is assigned to a cluster using conservative depth Stencil is written for all local lights in a single pass Peel depth for an extra 2-bit depth buffer in stencil Lights are finally rendered using stencil with double sided depth test 20

Sponza test data Far, far heavier than sensible mobile content 1440p (native) Overkill ~50-60% bandwidth reduction ~18% FPS increase Multipass MRT 22

«Lofoten» City scene with ~2.5 million primitives Relies heavily on CPU based culling techniques to reduce geometry load 100 spot lights and point lights w/shadow maps Reflection probes Sun light with cascaded shadow maps Atmospheric scattering Ocean simulation FFT running GPU compute Separate refraction pass Screen space reflections Bloom post-processing w/adaptive luminance tone mapping 24

Not your typical mobile graphics pipeline FFT update Render shadow depth G-buffer pass Lighting pass Bloom threshold Compute average luminance Multi-rate blur Tonemap/ composite 25

Implementation details G-buffer pass On tile-based GPUs, fill rate!= bandwidth Emissive/forward materials written directly to light buffer Stencil set to mark reflection influence Depth peeling for clustered stencil culling Lighting pass Lighting accumulated using additive blending to lighting attachment Clustered stencil culling used for local lights Transparent objects Fog applied after shading is complete Only commits light buffer to memory 26

Render passes Early decision to also make the high level interface explicit in terms of defining render passes Possible to back-port to OpenGL etc BeginRenderPass() NextSubPass() EndRenderPass() 27

Integration for transient and lazy images Add support for virtual attachments Keep a pool of images allocated with TRANSIENT_ATTACHMENT_BIT Actual image handles not visible to the API user multipass.virtualattachments = { RGBA8, RGBA8, RGB10A2 }; mutipass.numsubpasses = 2; multipass.subpass[0].colortargets = { RT0, VIRTUAL0, VIRTUAL1, VIRTUAL2 }; multipass.subpass[1].colortargets = { RT0 }; multipass.subpass[1].inputs = { VIRTUAL0, VIRTUAL1, VIRTUAL2, DEPTH }; renderpass.multipass = &multipass; BeginRenderPass(&renderpass); 28

Multipass virtual attachments // Render G-buffer pcb->beginrenderpass(pcommandbuffer, KINSO_RENDERPASS_LOAD_CLEAR_ALL KINSO_RENDERPASS_STORE_COLOR, Vec4(0.0f), 1.0f, 0, &m_multipass); DrawGeometry(); // Increment to the next subpass. pcb->nextsubpass(); // Render lights additively. DrawLightGeometry(); // Finally apply environment effects ApplyEnv(); pcb->endrenderpass(); 29

«Lofoten» test data Even heavier than Sponza 1440p (native) Overkill ~50% bandwidth save ~12% FPS increase ~25% GPU energy save! MRT Multipass 30

Performance considerations for tilers With on-chip SRAM, per-pixel buffer size is limited G-Buffer size is therefore limited On current Mali hardware, 128 bits color targets per pixel May vary between GPUs and vendors Smaller tiles may allow for larger G-Buffer At a quite large performance penalty Fewer threads active, worse occupancy on shader core Need to scan through tile list more 32

Engine integration for multiple APIs Render pass concept in engine is a must Cannot express multiple subpasses otherwise subpassload() is unique to Vulkan GLSL Solution #1, Vulkan GLSL is main shading language SPIRV-Cross can remap subpassload() to MRT-style texelfetch Solution #2, HLSL or similar Make your own intrinsic which emits subpassload in Vulkan Unroll multipass to multiple passes in other APIs Change render targets on NextSubpass() Bind input attachments for subpass to texture units Statically remap input_attachment_index -> texture unit in shader 33

Handling image layouts G-Buffer images are by design only used temporarily No need to track layouts Application should not have direct access to these images! Can use external subpass dependencies for transition VkSubpassDependency dep = {}; dep.srcsubpass = VK_SUBPASS_EXTERNAL; dep.dstsubpass = 0; attachment.initiallayout = VK_IMAGE_LAYOUT_UNDEFINED; reference.layout = VK_IMAGE_LAYOUT_COLOR_ATTACHMENT_OPTIMAL; dep.srcaccessmask = COLOR_ATTACHMENT_WRITE; dep.dstaccessmask = COLOR_ATTACHMENT_READ WRITE; dep.srcstagemask = COLOR_ATTACHMENT_OUTPUT; dep.dststagemask = COLOR_ATTACHMENT_OUTPUT; 34

Questions? The trademarks featured in this presentation are registered and/or unregistered trademarks of ARM Limited (or its subsidiaries) in the EU and/or elsewhere. All rights reserved. All other marks featured may be trademarks of their respective owners. Copyright 2017 ARM Limited