Height field ambient occlusion using CUDA

Similar documents
CS GPU and GPGPU Programming Lecture 8+9: GPU Architecture 7+8. Markus Hadwiger, KAUST

GPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC

CS427 Multicore Architecture and Parallel Computing

High Performance Computing on GPUs using NVIDIA CUDA

Overview. Videos are everywhere. But can take up large amounts of resources. Exploit redundancy to reduce file size

Parallel Computing: Parallel Architectures Jin, Hai

CUDA OPTIMIZATIONS ISC 2011 Tutorial

Fundamental CUDA Optimization. NVIDIA Corporation

CUDA Optimizations WS Intelligent Robotics Seminar. Universität Hamburg WS Intelligent Robotics Seminar Praveen Kulkarni

CUDA Optimization: Memory Bandwidth Limited Kernels CUDA Webinar Tim C. Schroeder, HPC Developer Technology Engineer

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS

CSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University

CSE 591: GPU Programming. Introduction. Entertainment Graphics: Virtual Realism for the Masses. Computer games need to have: Klaus Mueller

Fundamental Optimizations

Mattan Erez. The University of Texas at Austin

Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control flow

Real-Time Reyes: Programmable Pipelines and Research Challenges. Anjul Patney University of California, Davis

Portland State University ECE 588/688. Graphics Processors

Fundamental CUDA Optimization. NVIDIA Corporation

EE382N (20): Computer Architecture - Parallelism and Locality Spring 2015 Lecture 09 GPUs (II) Mattan Erez. The University of Texas at Austin

GPU Computation Strategies & Tricks. Ian Buck NVIDIA

Real-time Graphics 9. GPGPU

CUDA Performance Optimization. Patrick Legresley

GPU Performance Nuggets

Supercomputing, Tutorial S03 New Orleans, Nov 14, 2010

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

Direct Rendering. Direct Rendering Goals

Real-time Graphics 9. GPGPU

CSE 599 I Accelerated Computing - Programming GPUS. Memory performance

Tesla GPU Computing A Revolution in High Performance Computing

Using Virtual Texturing to Handle Massive Texture Data

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

Advanced CUDA Optimizations. Umar Arshad ArrayFire

Fundamental Optimizations in CUDA Peng Wang, Developer Technology, NVIDIA

General Purpose GPU Computing in Partial Wave Analysis

CS GPU and GPGPU Programming Lecture 11: GPU Texturing 1. Markus Hadwiger, KAUST

GPGPU Applications. for Hydrological and Atmospheric Simulations. and Visualizations on the Web. Ibrahim Demir

Performance potential for simulating spin models on GPU

Parallelising Pipelined Wavefront Computations on the GPU

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model

Numerical Simulation on the GPU

GPU-accelerated data expansion for the Marching Cubes algorithm

Adaptive Point Cloud Rendering

High performance 2D Discrete Fourier Transform on Heterogeneous Platforms. Shrenik Lad, IIIT Hyderabad Advisor : Dr. Kishore Kothapalli

Dynamic Spatial Partitioning for Real-Time Visibility Determination. Joshua Shagam Computer Science

Mattan Erez. The University of Texas at Austin

GPGPU Lessons Learned. Mark Harris

Advanced CUDA Programming. Dr. Timo Stich

Double-Precision Matrix Multiply on CUDA

Massively Parallel Architectures

How to Optimize Geometric Multigrid Methods on GPUs

Persistent RNNs. (stashing recurrent weights on-chip) Gregory Diamos. April 7, Baidu SVAIL

GPU Computing with CUDA. Part 2: CUDA Introduction

Immersive Out-of-Core Visualization of Large-Size and Long-Timescale Molecular Dynamics Trajectories

Bifurcation Between CPU and GPU CPUs General purpose, serial GPUs Special purpose, parallel CPUs are becoming more parallel Dual and quad cores, roadm

Information Coding / Computer Graphics, ISY, LiTH. CUDA memory! ! Coalescing!! Constant memory!! Texture memory!! Pinned memory 26(86)

Slide credit: Slides adapted from David Kirk/NVIDIA and Wen-mei W. Hwu, DRAM Bandwidth

Copyright Khronos Group, Page Graphic Remedy. All Rights Reserved

Musemage. The Revolution of Image Processing

CUDA. Schedule API. Language extensions. nvcc. Function type qualifiers (1) CUDA compiler to handle the standard C extensions.

CMPSCI 691AD General Purpose Computation on the GPU

CSCI-GA Graphics Processing Units (GPUs): Architecture and Programming Lecture 2: Hardware Perspective of GPUs

Scan Primitives for GPU Computing

high performance medical reconstruction using stream programming paradigms

CUDA Optimization with NVIDIA Nsight Visual Studio Edition 3.0. Julien Demouth, NVIDIA

Image Processing Tricks in OpenGL. Simon Green NVIDIA Corporation

GPU for HPC. October 2010

CENG 477 Introduction to Computer Graphics. Graphics Hardware and OpenGL

Administrivia. HW0 scores, HW1 peer-review assignments out. If you re having Cython trouble with HW2, let us know.

Mention driver developers in the room. Because of time this will be fairly high level, feel free to come talk to us afterwards

CUDA Experiences: Over-Optimization and Future HPC

Graphics Hardware. Graphics Processing Unit (GPU) is a Subsidiary hardware. With massively multi-threaded many-core. Dedicated to 2D and 3D graphics

GPUs and GPGPUs. Greg Blanton John T. Lubia

2/2/11. Administrative. L6: Memory Hierarchy Optimization IV, Bandwidth Optimization. Project Proposal (due 3/9) Faculty Project Suggestions

ECE 574 Cluster Computing Lecture 16

Order Independent Transparency with Dual Depth Peeling. Louis Bavoil, Kevin Myers

TUNING CUDA APPLICATIONS FOR MAXWELL

GPGPU. Peter Laurens 1st-year PhD Student, NSC

Real-Time Ray Tracing Using Nvidia Optix Holger Ludvigsen & Anne C. Elster 2010

ECE 571 Advanced Microprocessor-Based Design Lecture 20

Lecture 27: Pot-Pourri. Today s topics: Shared memory vs message-passing Simultaneous multi-threading (SMT) GPUs Disks and reliability

Tesla Architecture, CUDA and Optimization Strategies

Multi Agent Navigation on GPU. Avi Bleiweiss

GPU Architecture. Michael Doggett Department of Computer Science Lund university

CS179 GPU Programming Introduction to CUDA. Lecture originally by Luke Durant and Tamas Szalay

Graphics Architectures and OpenCL. Michael Doggett Department of Computer Science Lund university

Real-Time Support for GPU. GPU Management Heechul Yun

From Brook to CUDA. GPU Technology Conference

Enhancing Traditional Rasterization Graphics with Ray Tracing. March 2015

CS8803SC Software and Hardware Cooperative Computing GPGPU. Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology

VOLTA: PROGRAMMABILITY AND PERFORMANCE. Jack Choquette NVIDIA Hot Chips 2017

Textures & Surfaces CUDA Webinar Gernot Ziegler, Developer Technology (Compute)

Lecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013

GPU 101. Mike Bailey. Oregon State University. Oregon State University. Computer Graphics gpu101.pptx. mjb April 23, 2017

GPU 101. Mike Bailey. Oregon State University

Accelerating GPU computation through mixed-precision methods. Michael Clark Harvard-Smithsonian Center for Astrophysics Harvard University

CSC573: TSHA Introduction to Accelerators

Fast Tridiagonal Solvers on GPU

CUDA GPGPU. Ivo Ihrke Tobias Ritschel Mario Fritz

Concurrent execution of an analytical workload on a POWER8 server with K40 GPUs A Technology Demonstration

Transcription:

Height field ambient occlusion using CUDA 3.6.2009

Outline 1 2 3 4 Theory Kernel 5

Height fields

Self occlusion

Current methods Marching several directions from each fragment Sampling several times along a direction Optimizations based on this approach

Outline 1 2 3 4 Theory Kernel 5

Sweeps We process several (e.g. 64) directions (sweeps) For a height field of n 2 (e.g. 1024x1024) a sweep consists of n 2n lines One direction = infinitely high but thin light source

Spatial coherence Each line steps one texel at a time Keeps a record of previous occluders Line-wise coherence Convex hull subset of occluders is enough 3% of occluders required in practice Output: occlusion vector/value on each step for each direction blended together

Implications Previously bandwidth limited New method samples significantly less Bound to discrete directions, but ideally takes each pixel on them into account Does not map to shaders well GPGPU

Outline 1 2 3 4 Theory Kernel 5

Environment OpenGL 3.0 (Aug 2008) CUDA 2.2 (May 2009) Linux (primary), Windows Software (CPU) and hardware (GPU) implementations

Tesla architecture Global (device) memory on-board, various access modes Shared memory on-chip, for each MP 64kB, 16 banks, interleaved 32b 30 MPs on GTX280, 8 ALUs on a MP

Outline 1 2 3 4 Theory Kernel 5

OpenGL interoperability

OpenGL interoperability Too much data copying Not enough threads We can transfer more sweeps at once to remedy the latter

Blending in CUDA Write into multiple dests as before Blend using CUDA still memcpies, sampling slower in CUDA than in OpenGL

Outline 1 2 3 4 Theory Kernel 5

Sample src PBO with CUDA Reduces data passed between OpenGL and CUDA Need to copy src PBO to CudaArray CUDA provides 2D bilinear filtering (read-only)

Preblending in CUDA Cannot write directly, atomic adds are slow 180 rotation is trivial 90 /270 needs shared mem tricks for mem coalescing

Packing texels CUDA memory coalescing only works for thread-consecutive 32b elements But we have only single luminance value for a texel This calls for bit operations Heaviest on preblend kernels, which luckily have low arithmetic density

Outline 1 2 3 4 Theory Kernel 5

Outline Theory Kernel 1 2 3 4 Theory Kernel 5

Function of the kernel Theory Kernel Processes one line, reads height values as input Writes aligned packed values Keeps a representation of the convex hull Outputs an occlusion vector Coalescing vs. length uniformity Thread block size and shared memory resources

Occluder processing Theory Kernel Keep a vector of occluders in shared mem Search it backwards (linear/binary) Use heuristic comparison operator Insert new occluder Remove remaining (change length indicator) Return last current

Outline Theory Kernel 1 2 3 4 Theory Kernel 5

Performance figures Profilers are immature But we know that we get 50-80 GB/s device memory rates And 200-400 Gops/s

Comparison M$ published previous state-of-the-art method in Eurographics 2008 It scales badly, 1024x1024 @ 2.5fps, 32 directions Our method achieves 36-42fps, 64 directions Is texel-precise on sharp edges

Screen captures