Get the most out of the new OpenGL ES 3.1 API. Hans-Kristian Arntzen Software Engineer

Similar documents
Compute Shaders. Christian Hafner. Institute of Computer Graphics and Algorithms Vienna University of Technology

Unleash the Benefits of OpenGL ES 3.1 and the Android Extension Pack. Nathan Li SGL & Staff Engineer, ARM

A Review of OpenGL Compute Shaders

OpenGL Compute Shaders

Working with Metal Overview

CS195V Week 6. Image Samplers and Atomic Operations

OpenGL Compute Shaders

GDC 2014 Barthold Lichtenbelt OpenGL ARB chair

Information Coding / Computer Graphics, ISY, LiTH. Compute shaders!! The future of GPU computing or a late rip-off of Direct Compute?

Achieving High-performance Graphics on Mobile With the Vulkan API

GLSL Overview: Creating a Program

Getting the Most Out of OpenGL ES. Daniele Di Donato, Tom Olson, and Dave Shreiner ARM

Many rendering scenarios, such as battle scenes or urban environments, require rendering of large numbers of autonomous characters.

Deferred Splatting. Gaël GUENNEBAUD Loïc BARTHE Mathias PAULIN IRIT UPS CNRS TOULOUSE FRANCE.

DirectCompute Performance on DX11 Hardware. Nicolas Thibieroz, AMD Cem Cebenoyan, NVIDIA

BOF Siggraph 2012 Barthold Lichtenbelt OpenGL ARB chair

OpenGL Shading Language Hints/Kinks

Real - Time Rendering. Pipeline optimization. Michal Červeňanský Juraj Starinský

The Bifrost GPU architecture and the ARM Mali-G71 GPU

ARM. Mali GPU. OpenGL ES Application Optimization Guide. Version: 3.0. Copyright 2011, 2013 ARM. All rights reserved. ARM DUI 0555C (ID102813)

Mali & OpenGL ES 3.0. Dave Shreiner Jon Kirkham ARM. Game Developers Conference 27 March 2013

GPU Memory Model. Adapted from:

Particle Systems with Compute & Geometry Shader. Institute of Computer Graphics and Algorithms Vienna University of Technology

Shaders. Slide credit to Prof. Zwicker

Building scalable 3D applications. Ville Miettinen Hybrid Graphics

Vulkan: Architecture positive How Vulkan maps to PowerVR GPUs Kevin sun Lead Developer Support Engineer, APAC PowerVR Graphics.

Copyright Khronos Group, Page Graphic Remedy. All Rights Reserved

Vulkan on Mobile. Daniele Di Donato, ARM GDC 2016

Bifrost - The GPU architecture for next five billion

The Application Stage. The Game Loop, Resource Management and Renderer Design

Starting out with OpenGL ES 3.0. Jon Kirkham, Senior Software Engineer, ARM

Copyright Khronos Group 2012 Page 1. Teaching GL. Dave Shreiner Director, Graphics and GPU Computing, ARM 1 December 2012

OpenGL BOF Siggraph 2011

Graphics Performance Optimisation. John Spitzer Director of European Developer Technology

Mobile Performance Tools and GPU Performance Tuning. Lars M. Bishop, NVIDIA Handheld DevTech Jason Allen, NVIDIA Handheld DevTools

Programming Guide. Aaftab Munshi Dan Ginsburg Dave Shreiner. TT r^addison-wesley

CS179 GPU Programming Introduction to CUDA. Lecture originally by Luke Durant and Tamas Szalay

Mali Developer Resources. Kevin Ho ARM Taiwan FAE

Graphics Programming. Computer Graphics, VT 2016 Lecture 2, Chapter 2. Fredrik Nysjö Centre for Image analysis Uppsala University

PERFORMANCE OPTIMIZATIONS FOR AUTOMOTIVE SOFTWARE

Why modern versions of OpenGL should be used Some useful API commands and extensions

OpenGL with Qt 5. Qt Developer Days, Berlin Presented by Sean Harmer. Produced by Klarälvdalens Datakonsult AB

Vulkan Multipass mobile deferred done right

VR Rendering Improvements Featuring Autodesk VRED

Adaptive Point Cloud Rendering

Best practices for effective OpenGL programming. Dan Omachi OpenGL Development Engineer

COMP371 COMPUTER GRAPHICS

Understanding M3G 2.0 and its Effect on Producing Exceptional 3D Java-Based Graphics. Sean Ellis Consultant Graphics Engineer ARM, Maidenhead

OpenGL on Android. Lecture 7. Android and Low-level Optimizations Summer School. 27 July 2015

X. GPU Programming. Jacobs University Visualization and Computer Graphics Lab : Advanced Graphics - Chapter X 1

OpenGL Status - November 2013 G-Truc Creation

PowerVR Hardware. Architecture Overview for Developers

Optimizing and Profiling Unity Games for Mobile Platforms. Angelo Theodorou Senior Software Engineer, MPG Gamelab 2014, 25 th -27 th June

Introducing Metal 2. Graphics and Games #WWDC17. Michal Valient, GPU Software Engineer Richard Schreyer, GPU Software Engineer

Profiling and Debugging Games on Mobile Platforms

CS 432 Interactive Computer Graphics

PowerVR Performance Recommendations. The Golden Rules

Octree-Based Sparse Voxelization for Real-Time Global Illumination. Cyril Crassin NVIDIA Research

Programmable Graphics Hardware

Using SPIR-V in practice with SPIRV-Cross

General Purpose Computation (CAD/CAM/CAE) on the GPU (a.k.a. Topics in Manufacturing)

Dave Shreiner, ARM March 2009

Lecture 13: OpenGL Shading Language (GLSL)

Could you make the XNA functions yourself?

Grafica Computazionale: Lezione 30. Grafica Computazionale. Hiding complexity... ;) Introduction to OpenGL. lezione30 Introduction to OpenGL

Sign up for crits! Announcments

Antonio R. Miele Marco D. Santambrogio

Vulkan (including Vulkan Fast Paths)

Profiling and Debugging OpenCL Applications with ARM Development Tools. October 2014

Real - Time Rendering. Graphics pipeline. Michal Červeňanský Juraj Starinský

GCN Performance Tweets AMD Developer Relations

Metal for OpenGL Developers

SPU Render. Arseny Zeux Kapoulkine CREAT Studios

Optimizing Graphics Applications

Optimizing Mobile Games with Gameloft and ARM

Whiz-Bang Graphics and Media Performance for Java Platform, Micro Edition (JavaME)

GRAPHICS CONTROLLERS

ARM. Mali GPU. OpenGL ES Application Optimization Guide. Version: 2.0. Copyright 2011, 2013 ARM. All rights reserved. ARM DUI 0555B (ID051413)

LOD and Occlusion Christian Miller CS Fall 2011

GPU Memory Model Overview

Lecture 9: Deferred Shading. Visual Computing Systems CMU , Fall 2013

More frames per second. Alex Kan and Jean-François Roy GPU Software

NVIDIA Parallel Nsight. Jeff Kiel

EECE 478. Learning Objectives. Learning Objectives. Rasterization & Scenes. Rasterization. Compositing

Parallel Programming on Larrabee. Tim Foley Intel Corp

Programming with OpenGL Shaders I. Adapted From: Ed Angel Professor of Emeritus of Computer Science University of New Mexico

OUTLINE. Implementing Texturing What Can Go Wrong and How to Fix It Mipmapping Filtering Perspective Correction

Graphics Processing Unit Architecture (GPU Arch)

Rendering Objects. Need to transform all geometry then

OpenGL SUPERBIBLE. Fifth Edition. Comprehensive Tutorial and Reference. Richard S. Wright, Jr. Nicholas Haemel Graham Sellers Benjamin Lipchak

CUDA (Compute Unified Device Architecture)

last time put back pipeline figure today will be very codey OpenGL API library of routines to control graphics calls to compile and load shaders

WebGL (Web Graphics Library) is the new standard for 3D graphics on the Web, designed for rendering 2D graphics and interactive 3D graphics.

Next Generation OpenGL Neil Trevett Khronos President NVIDIA VP Mobile Copyright Khronos Group Page 1

VULKAN TECHNOLOGY UPDATE Christoph Kubisch, NVIDIA GTC 2017 Ingo Esser, NVIDIA

Pipeline Operations. CS 4620 Lecture Steve Marschner. Cornell CS4620 Spring 2018 Lecture 11

C P S C 314 S H A D E R S, O P E N G L, & J S RENDERING PIPELINE. Mikhail Bessmeltsev

Moving Mobile Graphics Advanced Real-time Shadowing. Marius Bjørge ARM

Lecture 5 Vertex and Fragment Shaders-1. CITS3003 Graphics & Animation

Performance OpenGL Programming (for whatever reason)

Transcription:

Get the most out of the new OpenGL ES 3.1 API Hans-Kristian Arntzen Software Engineer 1

Content Compute shaders introduction Shader storage buffer objects Shader image load/store Shared memory Atomics Synchronization Indirect commands Best practices on ARM Mali Midgard GPUs Use cases Example 2

Introduction to Compute Shaders Brings some OpenCL functionality to OpenGL ES Familiar Open GLSL syntax Random access writes to buffers and textures Sharing of data between threads 3

Introduction to Compute Shaders (cont.) Graphics Buffers Compute Vertex Textures Fragment Read Random access write FBO Sequential access write 4

Compute model Traditional graphics pipeline No random access write Implicit parallelism No synchronization between threads Compute Random access writes to buffers and textures Explicit parallelism Full synchronization and sharing between threads in same work group 5

Compute model (cont.) Work group the compute building block Independent Up to three dimensions Compute dispatch Work group Work group Work group Work group Work group Work group Work group Work group Work group Work group Work group Work group 6

Compute model (cont.) Work group Shared memory Concurrent threads Synchronization Unique identification gl_localinvocation{id,index} gl_globalinvocationid gl_workgroupid Work group Shared Memory Thread Thread Thread Thread Thread Thread 7

Hello compute world #version 310 es layout(local_size_x = 1) in; layout(std430, binding = 0) buffer Output { writeonly float data[]; } output; void main() { uint ident = gl_globalinvocationid.x; output.data[ident] = float(ident); } 8

Compiling and executing a compute shader GLuint shader = glcreateshader(gl_compute_shader); //... Compile, attach and link here. gluseprogram(program); gldispatchcompute(work_groups_x, work_groups_y, work_groups_z); 9

Shader storage buffer objects (SSBO) Writeable uniform buffers Minimum required size 128 MiB Can be unsized in shader New buffer layout for SSBOs (std430), better packing than std140 glbindbufferbase(gl_shader_storage_buffer, binding, buffer_object); layout(std430, binding = 0) buffer SomeData { float data[]; }; 10

Shader image load/store Raw read/write texel access Layering support Atomics support in OES_shader_image_atomic glbindimagetexture(0, tex, level, layered, layer, access, format); layout(r32f, binding = 0) uniform writeonly image2d myimage; imagestore(myimage, ivec2(x, y), color); 11

Shared memory Same as local address space in OpenCL Shared between threads in same work group Coherent Limited in size GL_MAX_COMPUTE_SHARED_MEMORY_SIZE Implementations must support at least 16 KiB 12

Atomic operations Dedicated atomic counters SSBOs and shared memory Add, Min/Max, Exchange, CompSwap, etc 13 glbindbufferbase(gl_atomic_counter_buffer, 0, atomic); layout(binding = 0, offset = 0) uniform atomic_uint mycounter; void main() { uint unique = atomiccounterincrement(mycounter); } shared uint sharedvarible; uint previous = atomicmax(sharedvariable, 42u);

Memory qualifiers Applies to SSBOs and images coherent Writes can be read by other shader invocations in the same command Ensure visibility with shader language memory barrier Writes only visible if they have actually happened Shared memory implicitly declared coherent readonly / writeonly volatile / restrict Same meaning as in C 14

Synchronization OpenGL ES synchronizes GL commands for you Appears to operate as-if everything is in-order Random access writes are unsynchronized - Ensure visibility to other GL commands with API memory barrier 15

Synchronization (cont.) glmemorybarrier() Ensures shader writes are visible to subsequent GL calls Specify how data is read after the barrier glbindbufferbase(gl_shader_storage_buffer, 0, vbo); gldispatchcompute(groups_x, groups_y, groups_z); // Non-blocking call! Flush caches on GPU, etc. glmemorybarrier(gl_vertex_attrib_array_barrier_bit); // Draw using updated VBO contents. gldrawelements(...); 16

Synchronization (cont.) groupmemorybarrier(), memorybarrier*() Ensures coherent writes are visible to other shader invocations Writes below barrier not visible before writes above barrier barrier() All threads in work group must reach barrier before any thread can continue Must be called from dynamically uniform control flow Does not order memory memorybarriershared() before barrier() 17

Synchronization (cont.) #version 310 es layout(local_size_x = 128) in; shared float shareddata[128]; void main() { shareddata[gl_localinvocationindex] = 0.0; // Ensure shared memory writes are visible to work group memorybarriershared(); // Ensure all threads in work group // have executed statements above barrier(); } // Entire buffer now cleared for every thread 18

Indirect commands Three new indirect commands gldrawarraysindirect gldrawelementsindirect gldispatchcomputeindirect Draw/dispatch parameters sourced from buffer object Lets GPU feed itself with work Very useful when draw parameters are not known by CPU Avoids CPU/GPU synchronization point 19

Indirect commands (cont.) struct IndirectCommand { GLuint count; GLuint instancecount; GLuint firstindex; GLuint basevertex; GLuint reservedmustbezero; }; glbindbuffer(gl_draw_indirect_buffer, command); // Update instancecount on GPU. gldrawelementsindirect(gl_triangles, GL_UNSIGNED_SHORT, NULL); 20

Indirect commands (cont.) struct IndirectDispatch { GLuint num_groups_x; GLuint num_groups_y; GLuint num_groups_z; }; glbindbuffer(gl_dispatch_indirect_buffer, command); // Update dispatch buffer on GPU. gldispatchcomputeindirect(0); 21

Best practices on ARM Mali Midgard Global memory just as fast as shared memory Avoid reads from global to shared memory just for caching Use shared if sharing of computation is needed Atomics on SSBOs just as efficient as atomic counters SSBOs cleaner anyways Cheap branching Branch on gl_localinvocationindex == 0u for expensive once-per-workgroup code paths 22

Best practices on ARM Mali Midgard (cont.) Many small indirect draws are expensive instancecount of 0 often just as expensive as 1 Use sparingly Ideal case is instancing Avoid tiny work groups Limits maximum number of concurrent threads Work group of 128 threads recommended 23

Use cases for compute Wave simulation Occlusion culling Physics Particle effects Image processing 24

Occlusion culling with compute 25

Occlusion culling with compute (cont.) 26

Occlusion culling with compute (cont.) 27

Occlusion culling with compute (cont.) Only want to draw visible meshes Frustum culling not enough OpenGL ES 3.0 occlusion query too inefficient Doesn t support instancing CPU readback required Traditional CPU methods not always viable Instance data updated every frame by GPU CPU already busy with other tasks 28

Hierarchical-Z occlusion culling Rasterize occluders to depth map Simplified occluder meshes Reduced resolution good enough (256x256) Mipmap depth manually with max() filter Test every bounding volume in parallel with compute Find screen space bounding box Sample depth at appropriate LOD Append visible per-instance data to buffer Atomic counter to increment instancecount Indirect Draw 29

Results (without culling) 30

Results (with culling) 31

Performance, Mali -T604 Vertex bound ~100 vertices per sphere ~114k spheres in scene Method Frame time (ms) Culling time (ms) Vertices (per frame) Hi-Z culling 10.8 3.3 186k Frustum culling 120.5 1.3 2837k No culling 246.9 N/A 11271k 32

Closing Compute shaders is a very useful addition to OpenGL ES Indirect features allow GPU to feed itself with work 33