GPU Computation Strategies & Tricks. Ian Buck NVIDIA

Similar documents
General Purpose Computation (CAD/CAM/CAE) on the GPU (a.k.a. Topics in Manufacturing)

From Brook to CUDA. GPU Technology Conference

What s New with GPGPU?

General Purpose Computation (CAD/CAM/CAE) on the GPU (a.k.a. Topics in Manufacturing)

CSE 591: GPU Programming. Introduction. Entertainment Graphics: Virtual Realism for the Masses. Computer games need to have: Klaus Mueller

Graphics Processing Unit Architecture (GPU Arch)

CS8803SC Software and Hardware Cooperative Computing GPGPU. Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology

Real - Time Rendering. Pipeline optimization. Michal Červeňanský Juraj Starinský

Optimizing DirectX Graphics. Richard Huddy European Developer Relations Manager

CSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University

Optimizing for DirectX Graphics. Richard Huddy European Developer Relations Manager

The GPGPU Programming Model

CS427 Multicore Architecture and Parallel Computing

A Data-Parallel Genealogy: The GPU Family Tree. John Owens University of California, Davis

GPGPU. Peter Laurens 1st-year PhD Student, NSC

The Source for GPU Programming

Portland State University ECE 588/688. Graphics Processors

Introduction to Multicore architecture. Tao Zhang Oct. 21, 2010

X. GPU Programming. Jacobs University Visualization and Computer Graphics Lab : Advanced Graphics - Chapter X 1

CS GPU and GPGPU Programming Lecture 8+9: GPU Architecture 7+8. Markus Hadwiger, KAUST

Data-Parallel Algorithms on GPUs. Mark Harris NVIDIA Developer Technology

PowerVR Hardware. Architecture Overview for Developers

Monday Morning. Graphics Hardware

Mattan Erez. The University of Texas at Austin

Architectures. Michael Doggett Department of Computer Science Lund University 2009 Tomas Akenine-Möller and Michael Doggett 1

The NVIDIA GeForce 8800 GPU

Why GPUs? Robert Strzodka (MPII), Dominik Göddeke G. TUDo), Dominik Behr (AMD)

Real-Time Rendering Architectures

GPGPU: Beyond Graphics. Mark Harris, NVIDIA

Tutorial on GPU Programming #2. Joong-Youn Lee Supercomputing Center, KISTI

Squeezing Performance out of your Game with ATI Developer Performance Tools and Optimization Techniques

From Shader Code to a Teraflop: How GPU Shader Cores Work. Jonathan Ragan- Kelley (Slides by Kayvon Fatahalian)

Windowing System on a 3D Pipeline. February 2005

Graphics Hardware, Graphics APIs, and Computation on GPUs. Mark Segal

Efficient and Scalable Shading for Many Lights

Automatic Tuning Matrix Multiplication Performance on Graphics Hardware

Pat Hanrahan. Modern Graphics Pipeline. How Powerful are GPUs? Application. Command. Geometry. Rasterization. Fragment. Display.

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

GPGPU Lessons Learned. Mark Harris

ECE 571 Advanced Microprocessor-Based Design Lecture 20

The Application Stage. The Game Loop, Resource Management and Renderer Design

B-KD Trees for Hardware Accelerated Ray Tracing of Dynamic Scenes

Accelerating CFD with Graphics Hardware

Graphics Performance Optimisation. John Spitzer Director of European Developer Technology

Scheduling the Graphics Pipeline on a GPU

The Rasterization Pipeline

Antonio R. Miele Marco D. Santambrogio

Data parallel algorithms, algorithmic building blocks, precision vs. accuracy

General-Purpose Computation on Graphics Hardware

Lecture 6: Texture. Kayvon Fatahalian CMU : Graphics and Imaging Architectures (Fall 2011)

Current Trends in Computer Graphics Hardware

Threading Hardware in G80

UberFlow: A GPU-Based Particle Engine

Graphics Hardware. Computer Graphics COMP 770 (236) Spring Instructor: Brandon Lloyd 2/26/07 1

Real-Time Reyes: Programmable Pipelines and Research Challenges. Anjul Patney University of California, Davis

Comparing Reyes and OpenGL on a Stream Architecture

Mattan Erez. The University of Texas at Austin

A Data-Parallel Genealogy: The GPU Family Tree

Spring 2009 Prof. Hyesoon Kim

Applications of Explicit Early-Z Culling

Graphics Hardware. Graphics Processing Unit (GPU) is a Subsidiary hardware. With massively multi-threaded many-core. Dedicated to 2D and 3D graphics

Performance Analysis and Culling Algorithms

Applications of Explicit Early-Z Z Culling. Jason Mitchell ATI Research

Spring 2011 Prof. Hyesoon Kim

GPU-Based Volume Rendering of. Unstructured Grids. João L. D. Comba. Fábio F. Bernardon UFRGS

Lecture 15: Introduction to GPU programming. Lecture 15: Introduction to GPU programming p. 1

GeForce4. John Montrym Henry Moreton

From Shader Code to a Teraflop: How Shader Cores Work

Hardware-driven visibility culling

Programmable GPUs. Real Time Graphics 11/13/2013. Nalu 2004 (NVIDIA Corporation) GeForce 6. Virtua Fighter 1995 (SEGA Corporation) NV1

Why Use the GPU? How to Exploit? New Hardware Features. Sparse Matrix Solvers on the GPU: Conjugate Gradients and Multigrid. Semiconductor trends

Parallelizing Graphics Pipeline Execution (+ Basics of Characterizing a Rendering Workload)

Programming Graphics Hardware

GPGPU, 1st Meeting Mordechai Butrashvily, CEO GASS

GPU Architecture. Samuli Laine NVIDIA Research

Spring 2010 Prof. Hyesoon Kim. AMD presentations from Richard Huddy and Michael Doggett

Scanline Rendering 2 1/42

Lecture 7: The Programmable GPU Core. Kayvon Fatahalian CMU : Graphics and Imaging Architectures (Fall 2011)

Hardware-driven Visibility Culling Jeong Hyun Kim

2.11 Particle Systems

Course Recap + 3D Graphics on Mobile GPUs

ECE 574 Cluster Computing Lecture 16

GPU Programming. Lecture 1: Introduction. Miaoqing Huang University of Arkansas 1 / 27

ECE 571 Advanced Microprocessor-Based Design Lecture 18

A SIMD-efficient 14 Instruction Shader Program for High-Throughput Microtriangle Rasterization

Challenges for GPU Architecture. Michael Doggett Graphics Architecture Group April 2, 2008

General Purpose Computation (CAD/CAM/CAE) on the GPU (a.k.a. Topics in Manufacturing)

DiFi: Distance Fields - Fast Computation Using Graphics Hardware

CS GPU and GPGPU Programming Lecture 2: Introduction; GPU Architecture 1. Markus Hadwiger, KAUST

Graphics Processor Acceleration and YOU

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Save the Nanosecond! PC Graphics Performance for the next 3 years. Richard Huddy European Developer Relations Manager ATI Technologies, Inc.

Real-Time Graphics Architecture

Programmable Shaders for Deformation Rendering

Real-time Ray Tracing on Programmable Graphics Hardware

Introduction to Modern GPU Hardware

Cornell University CS 569: Interactive Computer Graphics. Introduction. Lecture 1. [John C. Stone, UIUC] NASA. University of Calgary

Real-Time Rendering (Echtzeitgraphik) Michael Wimmer

CS 152, Spring 2011 Section 10

Ray Tracing. Computer Graphics CMU /15-662, Fall 2016

Transcription:

GPU Computation Strategies & Tricks Ian Buck NVIDIA

Recent Trends 2

Compute is Cheap parallelism to keep 100s of ALUs per chip busy shading is highly parallel millions of fragments per frame 0.5mm 64-bit FPU (to scale) 90nm Chip 12mm courtesy of Bill Dally 3

...but Bandwidth is Expensive latency tolerance to cover 500 cycle remote memory access time locality to match 20Tb/s ALU bandwidth to ~100Gb/s chip bandwidth 0.5mm 1 clock 90nm Chip 12mm courtesy of Bill Dally 4

Optimizing for GPUs shading is compute intensive 100s of floating point operations output 1 32-bit color value compute to bandwidth ratio arithmetic intensity 0.5mm 1 clock 12mm 90nm Chip courtesy of Bill Dally 5

Compute vs. Bandwidth GFLOPS R300 R360 R420 GFloats/sec 6

Arithmetic Intensity GFLOPS 7x Gap R300 R360 R420 GFloats/sec 7

Arithmetic Intensity GeForce 7800 GTX Pentium 4 3.0 GHz GPU wins when Arithmetic intensity Segment 3.7 ops per word 11 GFLOPS 8

Arithmetic Intensity Overlapping computation with communication 9

Memory Bandwidth GPU wins when Streaming memory bandwidth SAXPY FFT GeForce 7800 GTX Pentium 4 3.0 GHz 10

Memory Bandwidth Streaming Memory System Optimized for sequential performance GPU cache is limited Optimized for texture filtering Read-only Small GeForce 7800 GTX Pentium 4 Local storage CPU >> GPU 11

Computational Intensity Considering GPU transfer costs: T r CPU GPU Memory Memory 12

Computational Intensity Considering GPU transfer costs: T r Computational intensity: γ γ K gpu / T r work per word transferred to outperform the CPU: speedup: s K cpu / K gpu γ > 1 s - 1 13

Kernel Overhead Considering CPU cost to issuing a kernel Generating compute geometry Graphics driver CPU limited GPU limited 14

Floating Point Precision s exponent mantissa sign * 1.mantissa * 2 (exponent+bias) NVIDIA FP32 s23e8 ATI 24-bit float s16e7 NVIDIA FP16 s10e5 15

Floating Point Precision Common Bug Pack large 1D array in 2D texture Compute 1D address in shader Convert 1D address into 2D FP precision will leave unaddressable texels! Largest Counting Number NVIDIA FP32: 16,777,217 ATI 24-bit float: 131,073 NVIDIA FP16: 2,049 16

Scatter Techniques Problem: a[i] = p Indirect write Can t set the x,y of fragment in pixel shader Often want to do: a[i] += p 17

Scatter Techniques Solution 1: Convert to Gather for each spring f = computed force mass_force[left] += f; mass_force[right] -= f; f1 m1 f2 m2 f3 18

Scatter Techniques Solution 1: Convert to Gather for each spring f = computed force for each mass mass_force = f[left] - f[right]; f1 m1 f2 m2 f3 19

Scatter Techniques Solution 2: Address Sorting Sort & Search Shader outputs destination address and data Bitonic sort based on address Run binary search shader over destination buffer Each fragment searches for source data 20

Scatter Techniques Solution 3: Vertex processor Render points Use vertex shader to set destination or just read back the data and re-issue Vertex Textures Render data and address to texture Issue points, set point x,y in vertex shader using address texture Requires texld instruction in vertex program 21

Strategies & Tricks: Conditionals 22

Conditionals Problem: if (a) b = f(); else b = g(); Limited fragment shader conditional support 23

Pre-computation Pre-compute anything that will not change every iteration! Example: static obstacles in fluid sim When user draws obstacles, compute texture containing boundary info for cells Reuse that texture until obstacles are modified Combine with Z-cull for higher performance! 24

Static Branch Resolution Avoid branches where outcome is fixed One region is always true, another false Separate FPs for each region, no branches Example: boundaries 25

Branching with Occlusion Query Use it for iteration termination Do { // outer loop on CPU BeginOcclusionQuery { // Render with fragment program that // discards fragments that satisfy // termination criteria } EndQuery } While query returns > 0 Can be used for subdivision techniques 26

Conditionals Predication Execute both f and g if (a) b = f(); else b = g(); Use CMP instruction CMP b, -a, f, g Executes all conditional code 27

Conditionals Predication Use DP4 instruction DP4 b.x, a, f Executes all conditional code a = (0, 1, 0, 0) f = (x, y, z, w) if (a.x) b = x; else if (a.y) b = y; else if (a.z) b = z; else if (a.w) b = w; 28

Conditionals Using the depth buffer Set Z buffer to a Z-test can prevent shader execution glenable(gl_depth_test) Locality in conditional if (a) b = f(); else b = g(); 29

Conditionals Using the depth buffer Optimization disabled with: ATI: Writing Z in shader Enabling Alpha test Using texkill in shader NVIDIA: Changing depth test direction in frame Writing stencil while rejecting based on stencil Changing stencil func/ref/mask in frame 30

Depth Conditionals GeForce 7800 GTX 31

Conditionals Conditional Instructions Available with NV_fragment_program2 MOVC CC, R0; IF GT.x; MOV R0, R1; # executes if R0.x > 0 ELSE; MOV R0, R2; # executes if R0.x <= 0 ENDIF; 32

GeForce 6/7 Series Branching True, SIMD branching Lots of incoherent branching can hurt performance Should have coherent regions of ~1000 pixels That is only about 30x30 pixels, so still very useable! Don t ignore overhead of branch instructions Branching over < 5 instructions may not be worth it Use branching for early exit from loops Save a lot of computation 33

Conditional Instructions GeForce 7800 GTX 34

Branching Techniques Fragment program branches can be expensive No true fragment branching on GeForce FX or Radeon 9x00-X850 SIMD branching on GeForce 6/7 Series Incoherent branching hurts performance Sometimes better to move decisions up the pipeline Replace with math Occlusion Query Static Branch Resolution Depth Buffer Pre-computation 35