CSE 591: GPU Programming. Introduction. Entertainment Graphics: Virtual Realism for the Masses. Computer games need to have: Klaus Mueller

Similar documents
CSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University

Current Trends in Computer Graphics Hardware

GPU Architecture. Michael Doggett Department of Computer Science Lund university

Graphics Hardware. Graphics Processing Unit (GPU) is a Subsidiary hardware. With massively multi-threaded many-core. Dedicated to 2D and 3D graphics

CS427 Multicore Architecture and Parallel Computing

From Brook to CUDA. GPU Technology Conference

GPGPU, 1st Meeting Mordechai Butrashvily, CEO GASS

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

GPU for HPC. October 2010

CS8803SC Software and Hardware Cooperative Computing GPGPU. Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology

GPGPU. Peter Laurens 1st-year PhD Student, NSC

GPU Computation Strategies & Tricks. Ian Buck NVIDIA

Real - Time Rendering. Graphics pipeline. Michal Červeňanský Juraj Starinský

GPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC

CS GPU and GPGPU Programming Lecture 8+9: GPU Architecture 7+8. Markus Hadwiger, KAUST

CMPE 665:Multiple Processor Systems CUDA-AWARE MPI VIGNESH GOVINDARAJULU KOTHANDAPANI RANJITH MURUGESAN

Threading Hardware in G80

Introduction to Multicore architecture. Tao Zhang Oct. 21, 2010

Technology for a better society. hetcomp.com

Antonio R. Miele Marco D. Santambrogio

From Shader Code to a Teraflop: How GPU Shader Cores Work. Jonathan Ragan- Kelley (Slides by Kayvon Fatahalian)

Scientific Computing on GPUs: GPU Architecture Overview

What Next? Kevin Walsh CS 3410, Spring 2010 Computer Science Cornell University. * slides thanks to Kavita Bala & many others

CS 179: GPU Programming

What s New with GPGPU?

Graphics Hardware. Instructor Stephen J. Guy

Real-Time Rendering Architectures

Graphics Processor Acceleration and YOU

ECE 574 Cluster Computing Lecture 16

Lecture 7: The Programmable GPU Core. Kayvon Fatahalian CMU : Graphics and Imaging Architectures (Fall 2011)

General Purpose Computing on Graphical Processing Units (GPGPU(

Accelerator cards are typically PCIx cards that supplement a host processor, which they require to operate Today, the most common accelerators include

G P G P U : H I G H - P E R F O R M A N C E C O M P U T I N G

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

General Purpose GPU Computing in Partial Wave Analysis

ECE 571 Advanced Microprocessor-Based Design Lecture 20

Introduction to CUDA (1 of n*)

Graphics Processing Unit Architecture (GPU Arch)

ECE 571 Advanced Microprocessor-Based Design Lecture 18

HiPANQ Overview of NVIDIA GPU Architecture and Introduction to CUDA/OpenCL Programming, and Parallelization of LDPC codes.

Bifurcation Between CPU and GPU CPUs General purpose, serial GPUs Special purpose, parallel CPUs are becoming more parallel Dual and quad cores, roadm

PowerVR Hardware. Architecture Overview for Developers

GPGPU introduction and network applications. PacketShaders, SSLShader

CME 213 S PRING Eric Darve

Challenges for GPU Architecture. Michael Doggett Graphics Architecture Group April 2, 2008

GPGPU, 4th Meeting Mordechai Butrashvily, CEO GASS Company for Advanced Supercomputing Solutions

Real-time Graphics 9. GPGPU

NVIDIA s Compute Unified Device Architecture (CUDA)

NVIDIA s Compute Unified Device Architecture (CUDA)

GPUs and GPGPUs. Greg Blanton John T. Lubia

Real-time Graphics 9. GPGPU

Accelerating CFD with Graphics Hardware

Windowing System on a 3D Pipeline. February 2005

CSCI-GA Graphics Processing Units (GPUs): Architecture and Programming Lecture 2: Hardware Perspective of GPUs

Technical Report on IEIIT-CNR

Using Graphics Chips for General Purpose Computation

GeForce4. John Montrym Henry Moreton

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Administrivia. HW0 scores, HW1 peer-review assignments out. If you re having Cython trouble with HW2, let us know.

Programmable Graphics Hardware (GPU) A Primer

CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav

Graphics Hardware, Graphics APIs, and Computation on GPUs. Mark Segal

Multimedia in Mobile Phones. Architectures and Trends Lund

Introduction to Modern GPU Hardware

CUDA Conference. Walter Mundt-Blum March 6th, 2008

Introduction to GPU computing

From Shader Code to a Teraflop: How Shader Cores Work

Graphics Architectures and OpenCL. Michael Doggett Department of Computer Science Lund university

Portland State University ECE 588/688. Graphics Processors

Monday Morning. Graphics Hardware

Graphics and Imaging Architectures

Optimizing DirectX Graphics. Richard Huddy European Developer Relations Manager

GPGPU on ARM. Tom Gall, Gil Pitney, 30 th Oct 2013

Lecture 6: Texture. Kayvon Fatahalian CMU : Graphics and Imaging Architectures (Fall 2011)

Martin Kruliš, v

Experts in Application Acceleration Synective Labs AB

GRAPHICS HARDWARE. Niels Joubert, 4th August 2010, CS147

Performance potential for simulating spin models on GPU

Lecture 15: Introduction to GPU programming. Lecture 15: Introduction to GPU programming p. 1

Mathematical computations with GPUs

GPU Architecture. Alan Gray EPCC The University of Edinburgh

Spring 2010 Prof. Hyesoon Kim. AMD presentations from Richard Huddy and Michael Doggett

GPU Basics. Introduction to GPU. S. Sundar and M. Panchatcharam. GPU Basics. S. Sundar & M. Panchatcharam. Super Computing GPU.

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

GPU Architecture and Function. Michael Foster and Ian Frasch

Introduction to GPU hardware and to CUDA

Comparison of High-Speed Ray Casting on GPU

Fundamental CUDA Optimization. NVIDIA Corporation

GPU Background. GPU Architectures for Non-Graphics People. David Black-Schaffer David Black-Schaffer 1

! Readings! ! Room-level, on-chip! vs.!

Motivation Hardware Overview Programming model. GPU computing. Part 1: General introduction. Ch. Hoelbling. Wuppertal University

Exotic Methods in Parallel Computing [GPU Computing]

Numerical Simulation on the GPU

PARALLEL PROGRAMMING MANY-CORE COMPUTING: INTRO (1/5) Rob van Nieuwpoort

Programmable GPUS. Last Time? Reading for Today. Homework 4. Planar Shadows Projective Texture Shadows Shadow Maps Shadow Volumes

Prof. Hakim Weatherspoon CS 3410, Spring 2015 Computer Science Cornell University

GRAPHICS PROCESSING UNITS

Architectures. Michael Doggett Department of Computer Science Lund University 2009 Tomas Akenine-Möller and Michael Doggett 1

Shaders. Slide credit to Prof. Zwicker

Anatomy of AMD s TeraScale Graphics Engine

Transcription:

Entertainment Graphics: Virtual Realism for the Masses CSE 591: GPU Programming Introduction Computer games need to have: realistic appearance of characters and objects believable and creative shading, textures, surface details realistic physics and kinematics effects need to be customizable and interactive Klaus Mueller Computer Science Department Stony Brook University High Performance Computing on the Desktop Just Computing PC graphics boards featuring GPUs: NVidia FX, ATI Radeon available at every computer store for less $500 set up your PC in less than an hour and play than Compute-only (no graphics): NVIDIA Tesla 1060 True GPGPU (General Purpose Computing using GPU Technology) 4 GB memory per card the latest board: Nvidia GeForce GTX 280 Bundle up to 4 cards: 960 processors, 16 GB memory

Incredible Growth History: Accelerated Graphics 1.3 TFLOPS (GTX 480) 500$ 1990s: rise (and fall) of Silicon Graphics (SGI) in 1991 pioneers the first graphics accelerator developed OpenGL evolving 2D to 3D graphics capabilities through mid-90s desktop to high-end performance workstations (O2, Octane, Onyx) #1: expensive #2: non-programmable March 2010 Performance gap GPU / CPU is growing currently 1-2 orders of magnitude is achievable (given appropriate programming and problem decomposition) The Graphics Pipeline History: Cheap Consumer Graphics Old-style, non-programmable: (X,Y,Z,W) (R,G,B,A) Late 1990s: rise of consumer graphics chips 1994: Voodoo graphics chip (and later the popular Voodoo 2) chips still separate from memory other chips soon emerge: ATI Rage, NVIDIA Riva End 1990s: consumer graphics boards with high-end graphics transform, lighting, setup, and rendering all on a single GPU the world s first GPU: NVIDIA GeForce 256 (NV 10) #1: inexpensive #2: non-programmable textures

History: Programmability, GPGPU The Graphics Pipeline 2000s: emergence of programmable consumer graphics hardware programmable vertex and fragment shaders graphics cards: NVIDIA GeForce 3, ATI Radeon 9700 evolving capabilities for floating point, loops, if s enabled GPGPU HW programming languages: CG, GLSL, HLSL SW graphics API: OpenGL, DirectX #1: inexpensive #2: programmable Modern, programmable: (X,Y,Z,W) (R,G,B,A) programmable textures History: Focus Parallel Computing Hardware Architecture 2006: parallel computing languages appear address the need to provide dedicated SDK and API for parallel high performance computing (GPGPU) CUDA (Compute Unified Device Architecture) - developed by NVIDIA OpenCL (Open Computing Language) - initially developed by Apple - now with the Khronos Compute Working Group specific GPGPU boards: NVIDIA Tesla, AMD FireStream GPU Compute-intensive Highly data parallel Original SIMD Programming language Expose the parallel capabilities of GPUs. other parallel-computing chips: Intel Larabee, IBM Cell BE ALU: Arithmetic logic unit

GPU Vital Specs Comparison with CPUs GeForce 8800 GTX GeForce GTX 280 GeForce GTX 480 Codename G80 D10U-23 GF100 Release date 11/2006 6/2008 3/2010 Transistors 681 M (90nm) 1400 M (65nm) 3000 M (40nm) Clock speed 1350 MHz 1296 MHz 1401 MHz Processors 128 240 480 Peak pixel fill rate 13.8 Gigapixels/s 19.3 Gigapixels/s 33.60 Gigapixels/s Pk memory bandwidth 86.4 GB/s (384 bit) 141.7 GB/s (512 bit) 177.4 GB/s (384bit) Memory 768 MB 1024 MB 1536 MB Intel Core 2 Quad GeForce GTX 280 GeForce GTX 480 Core 4 8x30 32x15 SIMD width / Core 4 32 32 Managed / executed NA 32 48 threads Clock speed 3 GHz 1.3GHz 1.4 GHz Performance 96 Gigaflops 933 Gigaflops 1345 Gigaflops Peak performance 520 Gigaflops 933 Gigaflops 1.3Teraflops GPU vs. CPU GPU Architecture: Overview Highly parallel GeForce 8800 has 128 processors (128-way parallel) Memory very close to processors fast data transfer CPU requires lots of cache logic and communication High % of GPU chip real-estate for computing small in CPUs (example, 6.5% in Intel Itanium) In many cases speedups of 1-2 orders of magnitude can be obtained by porting to GPU more details on the rules for effective porting later 128 processors 8 multi-processors of 16 processors each local cache L1 (4k) shared cache L2 (1M) Global Memory DRAM (global memory)

GPU Architecture: Overview GPU Architecture: Overview Memory management is key! Memory management is key! Thread management is key! 128 processors 8 multi-processors of 16 processors each 128 processors 8 multi-processors of 16 processors each local cache L1 (4k) local cache L1 (4k) shared cache L2 (1M) shared cache L2 (1M) Global Memory DRAM (global memory) Global Memory DRAM (global memory) GPU Architecture: Different View History: Focus Serious Computing 2009: next generation CUDA architectures announced NVIDIA Fermi, AMD Cypress substrate for supercomputing focused on serious high performance computing (clusters, etc) Each multiprocessor is a SIMD (Same Instruction, Multiple Data) architecture Equipped with a set of local 32-bit registers (L1 and L2 caches) The (multi-processor level) shared Constant Cache and Texture Cache The (device-level shared) Device Memory (Global Memory) has readwrite access Enrico Fermi (1901-1954) Italian physicist one of the top scientists of the 20th century developed the first nuclear reactor contributed to - quantum theory, statistical mechanics - nuclear and particle physics Nobel Prize in Physics in 1938 for his work on induced radioactivity

GPU Specifics Latency Hiding All standard graphics ops are hardwired linear interpolations matrix and vector arithmetic (+, -, *) Arithmetic intensity the ratio of ALU arithmetic per operand fetched needs to be reasonably high, else application is memory-bound GPU memory 1-2 orders of magnitude slower than GPU processors computation often better than table look-ups indirections can be expensive Be aware of GPU 2D-caching protocol (for texture memory) data is fetched in 2D tiles (recall graphics bilinear texture filtering) promote data locality in 2D tiles GPUs provide hardware multi-threading kicks in when a thread within a core ALU stalls (waiting for memory, etc) then another SIMD thread is swapped in for execution this hides the latency for the stalled thread GPU allows many threads to be maintained than SIMD- executed Hardware multi-threading requires memory contexts of all such threads must be maintained in register or memory typically limits the amount of threads that can be simultaneously maintained for latency hiding Programmability CUDA\OpenCL vs. CG GPU hardware can be programmed with shading languages (NVIDIA CG, OpenGL GLSL, Microsoft HLSL) parallel programming language (CUDA, OpenCL) Shading languages require computer graphics knowledge give access to all fixed function pipelines (fast, ASIC-accelerated) - texturing: data interpolation, filtering - rasterization: mapping into the data domain - culling: clipping, early thread removal (fragment kill) this can provide performance benefits Parallel programming languages ease programming, eliminate need to study (some) graphics Exposes details on the GPU memory and thread management memory hierarchy, latencies, operation costs, etc shading languages don t make this explicit gives programmers better control over memory, threads, and arithmetic intensity (via occupancy calculator) Promotes computations as SIMD threads, executed in kernels synonymous to the CG fragments But still requires (for optimal performance): careful computation flow planning, memory management, and analysis before coding no magic here; no pain, no gain

GPGPU GPGPU Example Applications GPGPU = General Purpose Computation on Graphics hardware (GPU) massive trend to use GPUs for main stream computing see http://www.gpgpu.org Accelerate volume rendering and advanced graphics effects computer vision scientific computing and simulations audio and image & video processing database operations, numerical algorithms and sorting data compression medical imaging and many others Organization Will have 4-5 lab projects practice what you have learned in class Either use your own PC or lab machines if your PC is not CUDA ready, can use an emulator will be OK for the first few weeks given the size of the class, lab infrastructure will be worked out in the meantime You may work in a teams of two for the labs but need to be able to explain the code by yourself Final project there will be a larger final project choose from a number of topics or choose your own