Graphics Architectures and OpenCL. Michael Doggett Department of Computer Science Lund university

Similar documents
Architectures. Michael Doggett Department of Computer Science Lund University 2009 Tomas Akenine-Möller and Michael Doggett 1

GPU Architecture. Michael Doggett Department of Computer Science Lund university

PowerVR Hardware. Architecture Overview for Developers

Graphics Hardware. Graphics Processing Unit (GPU) is a Subsidiary hardware. With massively multi-threaded many-core. Dedicated to 2D and 3D graphics

Vulkan: Architecture positive How Vulkan maps to PowerVR GPUs Kevin sun Lead Developer Support Engineer, APAC PowerVR Graphics.

Real - Time Rendering. Graphics pipeline. Michal Červeňanský Juraj Starinský

A Trip Down The (2011) Rasterization Pipeline

Lecture 4: Geometry Processing. Kayvon Fatahalian CMU : Graphics and Imaging Architectures (Fall 2011)

Real-Time Reyes: Programmable Pipelines and Research Challenges. Anjul Patney University of California, Davis

Bringing AAA graphics to mobile platforms. Niklas Smedberg Senior Engine Programmer, Epic Games

Graphics Processing Unit Architecture (GPU Arch)

CS427 Multicore Architecture and Parallel Computing

EECS 487: Interactive Computer Graphics

Antonio R. Miele Marco D. Santambrogio

Real-Time Buffer Compression. Michael Doggett Department of Computer Science Lund university

Scheduling the Graphics Pipeline on a GPU

CSE 591: GPU Programming. Introduction. Entertainment Graphics: Virtual Realism for the Masses. Computer games need to have: Klaus Mueller

Spring 2010 Prof. Hyesoon Kim. AMD presentations from Richard Huddy and Michael Doggett

Challenges for GPU Architecture. Michael Doggett Graphics Architecture Group April 2, 2008

Lecture 6: Texture. Kayvon Fatahalian CMU : Graphics and Imaging Architectures (Fall 2011)

From Shader Code to a Teraflop: How GPU Shader Cores Work. Jonathan Ragan- Kelley (Slides by Kayvon Fatahalian)

PowerVR Series5. Architecture Guide for Developers

Parallelizing Graphics Pipeline Execution (+ Basics of Characterizing a Rendering Workload)

CSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University

Parallelizing Graphics Pipeline Execution (+ Basics of Characterizing a Rendering Workload)

Graphics and Imaging Architectures

Lecture 7: The Programmable GPU Core. Kayvon Fatahalian CMU : Graphics and Imaging Architectures (Fall 2011)

Mobile HW and Bandwidth

Portland State University ECE 588/688. Graphics Processors

POWERVR MBX & SGX OpenVG Support and Resources

Efficient GPU Rendering of Subdivision Surfaces. Tim Foley,

Real-Time Rendering Architectures

LPGPU Workshop on Power-Efficient GPU and Many-core Computing (PEGPUM 2014)

Dave Shreiner, ARM March 2009

Optimizing DirectX Graphics. Richard Huddy European Developer Relations Manager

From Shader Code to a Teraflop: How Shader Cores Work

Parallel Programming for Graphics

Optimizing and Profiling Unity Games for Mobile Platforms. Angelo Theodorou Senior Software Engineer, MPG Gamelab 2014, 25 th -27 th June

Working with Metal Overview

Real-time Graphics 9. GPGPU

Approximate Catmull-Clark Patches. Scott Schaefer Charles Loop

Threading Hardware in G80

Introduction to CUDA (1 of n*)

Optimizing for DirectX Graphics. Richard Huddy European Developer Relations Manager

Course Recap + 3D Graphics on Mobile GPUs

GPU Architecture and Function. Michael Foster and Ian Frasch

Graphics Hardware. Instructor Stephen J. Guy

Anatomy of AMD s TeraScale Graphics Engine

The Bifrost GPU architecture and the ARM Mali-G71 GPU

CS GPU and GPGPU Programming Lecture 8+9: GPU Architecture 7+8. Markus Hadwiger, KAUST

Graphics Hardware, Graphics APIs, and Computation on GPUs. Mark Segal

Shaders. Slide credit to Prof. Zwicker

CS GPU and GPGPU Programming Lecture 2: Introduction; GPU Architecture 1. Markus Hadwiger, KAUST

Current Trends in Computer Graphics Hardware

A SIMD-efficient 14 Instruction Shader Program for High-Throughput Microtriangle Rasterization

Bifrost - The GPU architecture for next five billion

ECE 574 Cluster Computing Lecture 16

CS130 : Computer Graphics. Tamar Shinar Computer Science & Engineering UC Riverside

Real-time Graphics 9. GPGPU

Real-Time Rendering (Echtzeitgraphik) Michael Wimmer

Ray Tracing. Computer Graphics CMU /15-662, Fall 2016

What Next? Kevin Walsh CS 3410, Spring 2010 Computer Science Cornell University. * slides thanks to Kavita Bala & many others

CSE 167: Introduction to Computer Graphics Lecture #5: Rasterization. Jürgen P. Schulze, Ph.D. University of California, San Diego Fall Quarter 2015

Direct Rendering of Trimmed NURBS Surfaces

Lecture 9: Deferred Shading. Visual Computing Systems CMU , Fall 2013

CS 179: GPU Programming

GPU for HPC. October 2010

Next-Generation Graphics on Larrabee. Tim Foley Intel Corp

OpenCL Base Course Ing. Marco Stefano Scroppo, PhD Student at University of Catania

Bifurcation Between CPU and GPU CPUs General purpose, serial GPUs Special purpose, parallel CPUs are becoming more parallel Dual and quad cores, roadm

Beyond Programmable Shading Keeping Many Cores Busy: Scheduling the Graphics Pipeline

Lecture 25: Board Notes: Threads and GPUs

X. GPU Programming. Jacobs University Visualization and Computer Graphics Lab : Advanced Graphics - Chapter X 1

Windowing System on a 3D Pipeline. February 2005

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

Real-Time Graphics Architecture

Rendering Grass with Instancing in DirectX* 10

The Rasterization Pipeline

Introduction to Modern GPU Hardware

GPU Architecture. Robert Strzodka (MPII), Dominik Göddeke G. TUDo), Dominik Behr (AMD)

Squeezing Performance out of your Game with ATI Developer Performance Tools and Optimization Techniques

CS770/870 Fall 2015 Advanced GLSL

1. Introduction 2. Methods for I/O Operations 3. Buses 4. Liquid Crystal Displays 5. Other Types of Displays 6. Graphics Adapters 7.

Efficient and Scalable Shading for Many Lights

Introduction to Multicore architecture. Tao Zhang Oct. 21, 2010

GPUs and GPGPUs. Greg Blanton John T. Lubia

Mattan Erez. The University of Texas at Austin

Cornell University CS 569: Interactive Computer Graphics. Introduction. Lecture 1. [John C. Stone, UIUC] NASA. University of Calgary

Grafica Computazionale: Lezione 30. Grafica Computazionale. Hiding complexity... ;) Introduction to OpenGL. lezione30 Introduction to OpenGL

! Readings! ! Room-level, on-chip! vs.!

Could you make the XNA functions yourself?

SIGGRAPH Briefing August 2014

Case 1:17-cv SLR Document 1-3 Filed 01/23/17 Page 1 of 33 PageID #: 60 EXHIBIT C

Prof. Hakim Weatherspoon CS 3410, Spring 2015 Computer Science Cornell University

GRAPHICS HARDWARE. Niels Joubert, 4th August 2010, CS147

Mali-400 MP: A Scalable GPU for Mobile Devices Tom Olson

The Ultimate Developers Toolkit. Jonathan Zarge Dan Ginsburg

Vulkan API 杨瑜, 资深工程师

How to Work on Next Gen Effects Now: Bridging DX10 and DX9. Guennadi Riguer ATI Technologies

Rendering Objects. Need to transform all geometry then

Transcription:

Graphics Architectures and OpenCL Michael Doggett Department of Computer Science Lund university

Overview Parallelism Radeon 5870 Tiled Graphics Architectures Important when Memory and Bandwidth limited Different to Tiled Rasterization! Tessellation OpenCL Programming the GPU without Graphics

Only 10% of our pixels require lots of samples for soft shadows, but determining which 10% is slower than always doing the samples. by ID_AA_Carmack, Twitter, 111017

Parallelism GPUs do a lot of work in parallel Pipelining, SIMD and MIMD What work are they doing?

What s running on 16 unified shaders? 128 fragments in parallel 16 cores = 128 ALUs MIMD 16 simultaneous instruction streams Slide courtesy Kayvon Fatahalian 5

vertices/fragments 128 [ primitives ] in parallel OpenCL work items CUDA threads vertices primitives fragments Slide courtesy Kayvon Fatahalian 6

Unified Shader Architecture Let s take the ATI Radeon 5870 From 2009 What are the components of the modern graphics hardware pipeline?

Unified Shader Architecture Grouper Rasterizer Unified shader Texture Z & Alpha FrameBuffer

Unified Shader Architecture Grouper Rasterizer Unified Shader Texture Z & Alpha FrameBuffer ATI Radeon 5870

Shader Inputs Vertex Grouper Tessellator Geometry Grouper Rasterizer Unified Shader Thread Scheduler 64 KB GDS 20 SIMD Engines SIMD 16 Shader Processors 32 KB LDS Shader Processor 5 32bit FP MulAdd 256 KB GPRs Texture Sampler 8 KB L1 Tex Cache Texture Shader Export Z & Alpha 4 Memory Controllers 128 KB L2 Tex Cache 4 Z & Alpha Units 32 Z tests, 8 alpha blends Z & Color cache FrameBuffer ATI Radeon 5870

Tiled Based Architectures There are two major styles of GPU architecture : Traditional One we work with and discuss Dominant in Desktop (Nvidia, AMD, Intel) Tiling based Dominant in Mobile (ARM Mali, Qualcomm Adreno) Different to tile based rasterisation mmxv mcd

Examples of Tiled Based Architectures ARM Mali (Mobile) Imagination Technologies Kyro II (PC graphics) Sega Dreamcast (console) PowerVR (Mobile) Apple GPU? Other notable ones Original Intel Larrabee used a Software Tiled based rasterizer (became Xeon Phi) mmxv mcd

Tiled Based Rendering Screen is divided into Tiles (e.g. 32x32 pixels) 1. 3. 5. Run vertex shader to project triangles to the screen Create Tile Triangle Lists (or Bins) 2. 4. 6. 1. 2. 3. 4. 5. 6. Each list contains pointers to all triangles in the tile mmxvii mcd

Tiled Based Rendering Take one tile and render all triangles in that tile list Use 1 tile sized on-chip memory 1. 2. 3. 4. 5. 6. After processing all triangles in the list On-chip memory Copy tile memory, with Z and Color, from on-chip memory to main memory 1. 3. 5. Off-chip memory mmxvii mcd 2. 4. 6.

Tiled Based Rendering Take one tile and render all triangles in that tile list Use 1 tile sized on-chip memory 1. 2. 3. 4. 5. 6. After processing all triangles in the list On-chip memory Copy tile memory, with Z and Color, from on-chip memory to main memory 1. 3. 5. Off-chip memory mmxvii mcd 2. 4. 6.

Tiled Based Rendering Take one tile and render all triangles in that tile list Use 1 tile sized on-chip memory 1. 2. 3. 4. 5. 6. After processing all triangles in the list On-chip memory Copy tile memory, with Z and Color, from on-chip memory to main memory 1. 3. 5. Off-chip memory mmxvii mcd 2. 4. 6.

Tiled Based Rendering Take one tile and render all triangles in that tile list Use 1 tile sized on-chip memory 1. 2. 3. 4. 5. 6. After processing all triangles in the list On-chip memory Copy tile memory, with Z and Color, from on-chip memory to main memory 1. 3. 5. Off-chip memory mmxvii mcd 2. 4. 6.

Tiled Based Rendering Take one tile and render all triangles in that tile list Use 1 tile sized on-chip memory 1. 2. 3. 4. 5. 6. After processing all triangles in the list On-chip memory Copy tile memory, with Z and Color, from on-chip memory to main memory 1. 3. 5. Off-chip memory mmxvii mcd 2. 4. 6.

Tiled Based Rendering Take one tile and render all triangles in that tile list Use 1 tile sized on-chip memory 1. 2. 3. 4. 5. 6. After processing all triangles in the list On-chip memory Copy tile memory, with Z and Color, from on-chip memory to main memory 1. 3. 5. Off-chip memory mmxvii mcd 2. 4. 6.

Mali Bifrost GPU Hierarchical tiling unit - 16x16, 32x32, 64x64 bins Switch to TLP quad execution design Better for compute mmxvii mcd image courtesy Jem Davies, Bifrost GPU, 2016, ARM

Mali-G71 shader core design mmxvii mcd image courtesy Jem Davies, Bifrost GPU, 2016, ARM

Tiling : Disadvantages More on chip memory for sorting triangles into tiles Increased bandwidth for triangle sorting State changes now happen for every tile (can take a lot of time) Triangles are processed multiple times (can be solved with Mali hierarchical tiling) On chip memory limits places hard limits on number of triangles Overflow causes spilling and creates a big hit on performance Too much geometry will flush whole pipeline (ParameterBuffer overflow) mmxvii mcd

Tiling : Advantages Easily parallelizable Allows access to frame buffer in a small high speed memory Z, Color accesses close to free MSAA is almost free (5-10% of rendering time) Alpha Blending is significantly cheaper mmxvii mcd

Comparing Tiling to the Traditional/Straight Pipeline Advantages Less complex pipeline More predictable performance Z Framebuffer compression techniques reduce bandwidth State changes once for frame (but too many still a problem) Disadvantages Impossible to have frame buffer in on chip memory Accessing the frame buffer cost power and latency mmxvii mcd

Expanding the graphics pipeline Vertex shaders & Pixel shaders But wait there s more!!! Geometry Shaders Triangle in, many triangles out Like a triangle shader Tessellation (in OpenGL terminology) Control Shader Tessellator Evaluation Shader

Geometry Shaders AV0 AV2 Vertex shader Geometry shader Rasterization Pixel shader Z & Alpha FrameBuffer Input and output can be points, lines, triangles But lines and triangles can also have adjacency information (3 for triangles, 2 for lines) Can think of it as the Triangle shader 1:n ratio of input:output 1 primitive in, many (or no) primitives out Must set a max_vertices for number of outputs Number of vertices, number of output components Enables storing processed vertices into multiple buffers Output lines and triangles are strips If you want individual, you need EndPrimitive Layered rendering Send primitives to specific layers of framebuffer Good for rendering cube-based shadow mapping and environment maps DirectX 10, OpenGL 3.0 feature AV1

Geometry Shader Applications Single pass render to cube map Using gl_layer Point sprite expansion/generation Fur/Fin generation Shadow volume extrusion Tessellation? mmxv mcd

Tessellation Difference between games and film is geometric detail Film uses higher order surfaces (HOS) Subdivision surfaces Bezier patches, NURBs Tessellation converts HOS into triangles Disney/Pixar. Image from Nießner, M., et al. Feature-adaptive GPU rendering of Catmull-Clark subdivision surfaces, TOG January 2012

Vertex shader Tessellation Control shader Tessellator Evaluation shader Geometry shader Rasterization Pixel shader Z & Alpha Benefits Higher order surfaces such as subdivision surfaces Higher detailed surfaces using displaced mapping View-dependent levels-of-detail Input is quad or triangle patches DirectX 11 (Hull and Domain shader), OpenGL 4.0 feature On mobile, Android Extension Pack FrameBuffer

Tessellation Control shader Transforms up to 32 control points of a patch (typically 16) Calculates tessellation factor for patch edges in tessellator Tessellator Outputs UV coordinates Evaluation shader Calculates the vertex position of the output of tessellator in the patch Images courtesy Charles Loop, Hardware Subdivision and Tessellation of Catmull-Clark Surfaces, GTC2010

Images courtesy Charles Loop, Hardware Subdivision and Tessellation of Catmull-Clark Surfaces, GTC2010

displacement mapping mmxii Images mcd courtesy Charles Loop, Hardware Subdivision and Tessellation of Catmull-Clark Surfaces, GTC2010

OpenCL OpenCL - Open Compute Language Application Programming Interface (API) Runs on GPUs Write programs (shaders) for GPU using OpenCL Industry standard between multiple companies

GPU as a parallel processor GPU parallelism increased significantly during the last decade Roughly 1000x more pixels/sec in 10 years People started to use this for more than just graphics How to enable using the GPU as a parallel processor? How could we change OpenGL?

Designing OpenCL Grouper Rasterizer Unified shader Take out all the graphics Triangles, textures, Z testing, alpha blending Keep simple small programs (graphics shaders) Compute kernels Use the C programming model for kernels (ANSI C99) Z & Alpha FrameBuffer Low-level abstraction High-performance, but device independent CPUs are also becoming parallel so OpenCL works there too

Designing OpenCL Scheduler Unified shader Take out all the graphics Triangles, textures, Z testing, alpha blending Keep simple small programs (graphics shaders) FrameBuffer Compute kernels Use the C programming model for kernels (ANSI C99) Low-level abstraction High-performance, but device independent CPUs are also becoming parallel so OpenCL works there too

OpenCL platform Compute device may be a CPU, GPU or other processor Compute device is made up of Compute Units Compute Unit is made up of Processing Elements Processing elements execute code as SIMD or SPMD What could be added to GPUs for parallel programming and not just graphics? GPU programs have a lot of dedicated memory(ram) that only shaders use Shaders cannot communicate with each other while running

How can kernels share data in OpenCL? Kernel Local Memory Allow kernel programs to share data while running Nvidia s CUDA shared memory Graphics groups pixels (shaders) with triangles OpenCL groups work-items (kernels/threads) with work-groups (blocks) Local Memory is where work-items in a work-group can share data

OpenCL Memory model Private Memory Per work-item - registers Local Memory Shared within a workgroup Global/Constant Memory Visible to all workgroups Read-only data (Constant) Off-chip memory for GPU GPU memory/framebuffer Host Memory CPU s memory Private Memory Work Item 0 Private Memory Work Item M Compute Unit 0 Local Memory Compute Device Host Private Memory Work Item 0 Global/Constant Memory Host Memory Private Memory Work Item M Compute Unit N Local Memory

How do you run commands in OpenCL? Commands are submitted to a GPU or CPU queue Execution can be in-order or out-of-order Events are used for synchronization

Execution model Work-Group equivalent to a thread (pixel) NDRange Work-Item Work-Item local size y global size y Work-Item Work-Item global size x local size x Kernels are run over an N-Dimensional range (NDRange) Each work-item has a unique identifier (global-id)

OpenCL platform Compute device may be a CPU, GPU or other processor Compute device is made up of Compute Units Compute Unit is made up of Processing Elements Processing elements execute code as SIMD or SPMD

Summary OpenCL enables parallel programming on both GPUs and CPUs from different vendors AMD (CPU&GPU), Nvidia (GPU), Intel (CPU&GPU), ARM(GPU) on different operating systems windows/linux/mac/android At low-level (performance) with well known programming language

Further material on OpenCL Khronos web page http://www.khronos.org/opencl/ Overview http://www.khronos.org/developers/library/overview/opencl_overview.pdf Specification http://www.khronos.org/registry/cl/specs/opencl-1.1.pdf AMD GPUs sample code SDK http://developer.amd.com/gpu/atistreamsdk/pages/default.aspx Nvidia sample code http://developer.download.nvidia.com/compute/cuda/3_0/sdk/website/opencl/website/ samples.html

Projects Discussion Forum, Q&A by everyone!! Code due 13:00, Thursday, 2017-12-14 Report due following Monday, 18th Send by email to me MD15

Next Week No lecture Thursday (7th) or following Wednesday (13th) More time for projects! Next week - Wednesday 9.15am Guest Speaker Johan Grönqvist, ARM