GPGPU in Film Production. Laurence Emms Pixar Animation Studios

Similar documents
Register file. A single large register file (ex. 16K registers) is partitioned among the threads of the dispatched blocks.

CUDA Basics. July 6, 2016

CS 179: GPU Computing. Lecture 2: The Basics

CUDA Programming. Week 1. Basic Programming Concepts Materials are copied from the reference list

Basic Elements of CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

University of Bielefeld

Real-time Graphics 9. GPGPU

Real-time Graphics 9. GPGPU

Josef Pelikán, Jan Horáček CGG MFF UK Praha

CS179 GPU Programming Recitation 4: CUDA Particles

GPU CUDA Programming

Introduction to CUDA CME343 / ME May James Balfour [ NVIDIA Research

Outline 2011/10/8. Memory Management. Kernels. Matrix multiplication. CIS 565 Fall 2011 Qing Sun

Parallel Numerical Algorithms

Lecture 9. Outline. CUDA : a General-Purpose Parallel Computing Architecture. CUDA Device and Threads CUDA. CUDA Architecture CUDA (I)

Today s Content. Lecture 7. Trends. Factors contributed to the growth of Beowulf class computers. Introduction. CUDA Programming CUDA (I)

Lecture 15: Introduction to GPU programming. Lecture 15: Introduction to GPU programming p. 1

Introduc)on to GPU Programming

COSC 6339 Accelerators in Big Data

Massively Parallel Algorithms

CUDA C Programming Mark Harris NVIDIA Corporation

COMP 322: Fundamentals of Parallel Programming. Flynn s Taxonomy for Parallel Computers

Module 2: Introduction to CUDA C

CS377P Programming for Performance GPU Programming - I

ECE 574 Cluster Computing Lecture 17

EEM528 GPU COMPUTING

04. CUDA Data Transfer

Introduction to GPGPUs and to CUDA programming model

GPU Programming. Alan Gray, James Perry EPCC The University of Edinburgh

Introduction to GPU Computing Junjie Lai, NVIDIA Corporation

COSC 6385 Computer Architecture. - Data Level Parallelism (II)

Basic Elements of CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Accelerator cards are typically PCIx cards that supplement a host processor, which they require to operate Today, the most common accelerators include

Lecture 2: Introduction to CUDA C

Review. Lecture 10. Today s Outline. Review. 03b.cu. 03?.cu CUDA (II) Matrix addition CUDA-C API

Introduction to CUDA (1 of n*)

COSC 6385 Computer Architecture. - Multi-Processors (V) The Intel Larrabee, Nvidia GT200 and Fermi processors

HPCSE II. GPU programming and CUDA

COSC 6374 Parallel Computations Introduction to CUDA

Module 2: Introduction to CUDA C. Objective

Introduction to GPU Computing Using CUDA. Spring 2014 Westgid Seminar Series

GPU COMPUTING. Ana Lucia Varbanescu (UvA)

Introduction to GPU Computing Using CUDA. Spring 2014 Westgid Seminar Series

Technische Universität München. GPU Programming. Rüdiger Westermann Chair for Computer Graphics & Visualization. Faculty of Informatics

ECE 574 Cluster Computing Lecture 15

CUDA Architecture & Programming Model

NVIDIA CUDA Compute Unified Device Architecture

Lecture 10!! Introduction to CUDA!

GPU Computing: A Quick Start

An Introduction to GPGPU Pro g ra m m ing - CUDA Arc hitec ture

CUDA Particles. Simon Green

Information Coding / Computer Graphics, ISY, LiTH. Introduction to CUDA. Ingemar Ragnemalm Information Coding, ISY

CUDA Workshop. High Performance GPU computing EXEBIT Karthikeyan

GPU Computing: Introduction to CUDA. Dr Paul Richmond

Learn CUDA in an Afternoon. Alan Gray EPCC The University of Edinburgh

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model

Basic CUDA workshop. Outlines. Setting Up Your Machine Architecture Getting Started Programming CUDA. Fine-Tuning. Penporn Koanantakool

CUDA Parallel Programming Model. Scalable Parallel Programming with CUDA

GPU Programming. Lecture 2: CUDA C Basics. Miaoqing Huang University of Arkansas 1 / 34

CUDA PROGRAMMING MODEL. Carlo Nardone Sr. Solution Architect, NVIDIA EMEA

CUDA Parallel Programming Model Michael Garland

GPU Programming Using CUDA

Lecture 11: GPU programming

HPC Middle East. KFUPM HPC Workshop April Mohamed Mekias HPC Solutions Consultant. Introduction to CUDA programming

Scientific discovery, analysis and prediction made possible through high performance computing.

GPU Programming Using CUDA. Samuli Laine NVIDIA Research

GPU Programming Using CUDA. Samuli Laine NVIDIA Research

GPU Architecture and Programming. Andrei Doncescu inspired by NVIDIA

Real-Time GPU Fluid Dynamics

GPGPU. Alan Gray/James Perry EPCC The University of Edinburgh.

Introduction to CUDA

Real-Time Reyes: Programmable Pipelines and Research Challenges. Anjul Patney University of California, Davis

Mathematical computations with GPUs

GPU Memory Model Overview

Parallel Accelerators

CUDA Programming Model

Lecture 8: GPU Programming. CSE599G1: Spring 2017

EE 4702 GPU Programming

GPGPU/CUDA/C Workshop 2012

GPU & High Performance Computing (by NVIDIA) CUDA. Compute Unified Device Architecture Florian Schornbaum

High Performance Linear Algebra on Data Parallel Co-Processors I

Tutorial: Parallel programming technologies on hybrid architectures HybriLIT Team

Shaders. Slide credit to Prof. Zwicker

COMP 322: Fundamentals of Parallel Programming

Speed Up Your Codes Using GPU

Getting Started with CUDA C/C++ Mark Ebersole, NVIDIA CUDA Educator

Parallel Accelerators

CUDA Kenjiro Taura 1 / 36

2.11 Particle Systems

CUDA Programming (Basics, Cuda Threads, Atomics) Ezio Bartocci

Tesla Architecture, CUDA and Optimization Strategies

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming

Abstract. Introduction. Kevin Todisco

Information Coding / Computer Graphics, ISY, LiTH. CUDA memory! ! Coalescing!! Constant memory!! Texture memory!! Pinned memory 26(86)

Accelerating image registration on GPUs

GPU programming: CUDA basics. Sylvain Collange Inria Rennes Bretagne Atlantique

High-Performance Computing Using GPUs

Hardware Accelerated Volume Visualization. Leonid I. Dimitrov & Milos Sramek GMI Austrian Academy of Sciences

INTRODUCTION TO GPU COMPUTING WITH CUDA. Topi Siro

GPU computing Simulating spin models on GPU Lecture 1: GPU architecture and CUDA programming. GPU computing. GPU computing.

Transcription:

GPGPU in Film Production Laurence Emms Pixar Animation Studios

Outline GPU computing at Pixar Demo overview Simulation on the GPU Future work

GPU Computing at Pixar GPUs have been used for real-time preview of assets Emphasis on matching GPU with CPU results GPGPU allows us to speed up more stages of the asset pipeline

LPics Interactive relighting engine RenderMan surface shaders generate image space caches Caches loaded onto GPU Light shaders run on GPU hardware Lpics: a Hybrid Hardware-Accelerated Relighting Engine for Computer Cinematography, Fabio Pellacini, et. al., August 2005

Floating Point Precision Shader Model 2.0 introduced IEEE single precision floating point accuracy (2005) Idea: Substitute GPU programs for some stages of the asset pipeline

Floating Point Textures Rendering to the default framebuffer clamps values from 0.0 to 1.0 Request floating point textures with GL_RGBA32F and GL_FLOAT: glteximage2d(gl_texture_2d, 0, GL_RGBA32F, _image_width, _image_height, 0, GL_RGBA, GL_FLOAT, NULL)

Modern OpenGL Modern OpenGL pipeline is similar to RenderMan pipeline Supports tessellation, screen space effects and displacement Allows us to use OpenGL as a preview tool until later in the pipeline

Geometry Shaders Take an OpenGL primitive passed in from a vertex or tessellation shader Generate new geometry Used for hair, particles, etc.

Artists want a grass representation in Presto Upload CPU procedural result onto GPU Vegetation Preview Render with OpenGL Vertex Buffer Objects (VBO) and Geometry Shaders

Tessellation Shaders Takes a GL_PATCH primitive from a vertex shader Hardware tessellation unit subdivides the patch based on Tessellation Control Shader (TCS) Tessellation Evaluation Shader follows (TES)

Grooming TDs want to see hair styles as they work Upload hairs to VBO Tessellation shaders to match curves SSAO to show volume Hair Style Preview

Open source subdivision surface libraries OpenSubdiv Hybrid CPU/GPU libraries https://github.com/pixaranimationstudios/opensubdiv

Subdivision Surfaces Procedurals Modern OpenGL Pipeline Source: OpenGL.org wiki Rendering Pipeline Overview http://www.opengl.org/wiki/rendering_pipeline_overview

Simple Mass-Spring Simulation on the GPU Combines CUDA with OpenGL Render a set of Jelly Cubes Demo Overview

Demo Open source GPU mass spring simulation https://github.com/lemms/siggraphasiademo2012 GNU GPL License

CUDA General purpose GPU programming CPU = Host GPU = Device Good for data parallel algorithms Run on Streaming Multiprocessors (SM) in GPU. Source: NVIDIA CUDA C Programming Guide

Install the CUDA Toolkit Setup https://developer.nvidia.com/cuda-downloads CUDA programs use the nvcc compiler In Visual Studio, right click project name, then click Build Customizations, then select the CUDA Toolkit version you installed

Kernels Execute on device (GPU), called from the host (CPU): Declaration: global void device_func( ) { } Call: device_func <<< threads_per_block, blocks >>> ( );

C++ call: for (int i = 0; i < n; i++) { } a[i] = b[i] + c[i]; Kernels Example CUDA definition: global void sum(int n, int *a, int*b, int *c) { int i = blockid.x * blockdim.x + threadid.x; } call: if (i < n) a[i] = b[i] + c[i]; sum<<< blocks, threads>>> (n, a, b, c); cudathreadsynchronize();

Threads and Blocks Multiple threads are grouped into blocks of fixed size. Blocks are assigned to one SM each. Blocks share resources.

Kernel Calls with Threads and Blocks int tpb = 256; // threads per block int n = a.size(); // a, b, c are the same size sum<<<(n+tpb-1)/tpb, tpb>>>(n, a, b, c); This creates just enough blocks to process n items with 256 threads per block.

GPU Memory Allocate: cudamalloc(void **devptr, size_t size) Free: cudafree(void *devptr) Copy to/from device: cudamemcpy(void *dst, const void *src, size_t count, enum cudamemcpykind kind) kind = cudamemcpyhosttodevice or cudamemcpydevicetohost

STL Vectors on the GPU Idea: Manage CPU memory with std::vector and upload to GPU. std::vector<t> cpu_data; cudamalloc((void**)&gpu_data, cpu_data.size()*sizeof(t)); cudamemcpy(gpu_data, &cpu_data[0], cpu_data.size()*sizeof(t), cudamemcpyhosttodevice);

Mass Spring Simulation Masses simulated using explicit RK4 Spring forces using Hooke s Law Simulate using very small timesteps dt = 1e-4

Masses in axis aligned cartesian grid Masses Form a grid of cubes with one mass on each vertex

Mass Simulation Each mass is a structure: struct Mass { float _mass; float _x; float _y; float _z; float _vx; float _vy; float _vz; float _radius; int _state; }; An array of masses is stored in a MassList struct (AoS). We upload an array of structures using cudamemcpy(). Access elements using masses[threadid]._mass

Structure of Arrays (SoA) Problem: Global memory accesses are unaligned. Solution: Rearrange data into a single struct. struct MassDeviceArrays { float *_mass; float *_x; float *_y; float *_z; float *_radius; int *_state; }; 1. Allocate individual arrays using cudamalloc() and copy data to GPU using cudamemcpy(). 2. Allocate a duplicate MassDeviceArrays struct in GPU memory to copy array pointers into constant memory on the GPU. Access elements using masses->_mass[threadid]

Mass Simulation Each kernel call represents one RK4 increment. masses.startframe(); masses.clearforces(); masses.evaluatek1(dt, ground_collision); springs.applyspringforces(masses); masses.clearforces(); masses.evaluatek4(dt, ground_collision); springs.applyspringforces(masses); masses.update(dt, ground_collision); masses.endframe();

Simplified linear springs. Springs F = -k_s*(dx/l_0-1) - k_d*dv F = force on right mass k_s = Young s modulus k_d = linear damping constant dx = length of spring l_0 = resting length of spring dv = relative velocity of right mass to left mass

Cartesian axis aligned springs connecting masses Structural Springs Prevent collapsing along edges

Axis aligned springs between every second neighbor Prevent edges bending Simplification of axial bending springs [Selle, A., Lentine, M., G., Fedkiw, R., A Mass Spring Model for Hair Simulation, ACM TOG 27, 64.1-64.11 (2008)] Bending Springs

Diagonal springs Prevents planar shearing and twisting Two diagonal springs per face and 4 interior springs per cube Shear Springs

4 interior springs per cube connecting diagonally opposite vertices Interior Springs

Each spring is a structure: struct Spring { Spring( Springs MassList &masses, unsigned int mass0, unsigned int mass1); unsigned int _mass0; // mass 0 index unsigned int _mass1; // mass 1 index }; float _l0; // resting length float _fx0; float _fy0; float _fz0; float _fx1; float _fy1; float _fz1;

Spring Forces Spring forces calculated once per RK4 increment. Two stages: devicecomputespringforces() computes the force for each spring. deviceapplyspringforces() sums forces from each spring attached to a mass.

Collisions Bounding boxes are calculated around each object on the CPU. Impulses from virtual springs push nearby particles apart. O(n 2 ) but still fast on the GPU because of shared memory. Use shared memory primarily as a scratchpad.

Performance Runs at 30 fps on a Geforce 670M with 140k springs Creates a plausible real-time simulation with 50k springs Performance based on: Occupancy Coalesced memory access Optimizations: Shared memory spring force accumulation Structure of arrays (SOA)

Future Work Convert general purpose data-parallel tools to run on the GPU Simulation, deformers, procedurals, etc. Dynamic Parallelism

Questions Laurence Emms lemms@pixar.com