CSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University

Similar documents
CSE 591: GPU Programming. Introduction. Entertainment Graphics: Virtual Realism for the Masses. Computer games need to have: Klaus Mueller

Current Trends in Computer Graphics Hardware

GPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC

CS427 Multicore Architecture and Parallel Computing

Graphics Hardware. Graphics Processing Unit (GPU) is a Subsidiary hardware. With massively multi-threaded many-core. Dedicated to 2D and 3D graphics

Introduction to Multicore architecture. Tao Zhang Oct. 21, 2010

Antonio R. Miele Marco D. Santambrogio

GPU Architecture. Michael Doggett Department of Computer Science Lund university

CME 213 S PRING Eric Darve

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

From Brook to CUDA. GPU Technology Conference

CMPE 665:Multiple Processor Systems CUDA-AWARE MPI VIGNESH GOVINDARAJULU KOTHANDAPANI RANJITH MURUGESAN

GPGPU, 1st Meeting Mordechai Butrashvily, CEO GASS

HiPANQ Overview of NVIDIA GPU Architecture and Introduction to CUDA/OpenCL Programming, and Parallelization of LDPC codes.

Scientific Computing on GPUs: GPU Architecture Overview

From Shader Code to a Teraflop: How GPU Shader Cores Work. Jonathan Ragan- Kelley (Slides by Kayvon Fatahalian)

GPGPU. Peter Laurens 1st-year PhD Student, NSC

GPU for HPC. October 2010

CS GPU and GPGPU Programming Lecture 8+9: GPU Architecture 7+8. Markus Hadwiger, KAUST

Lecture 7: The Programmable GPU Core. Kayvon Fatahalian CMU : Graphics and Imaging Architectures (Fall 2011)

Technology for a better society. hetcomp.com

What Next? Kevin Walsh CS 3410, Spring 2010 Computer Science Cornell University. * slides thanks to Kavita Bala & many others

CS8803SC Software and Hardware Cooperative Computing GPGPU. Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology

Real-Time Rendering Architectures

Accelerator cards are typically PCIx cards that supplement a host processor, which they require to operate Today, the most common accelerators include

GPU Computation Strategies & Tricks. Ian Buck NVIDIA

General Purpose GPU Computing in Partial Wave Analysis

Graphics Processor Acceleration and YOU

ECE 571 Advanced Microprocessor-Based Design Lecture 18

ECE 571 Advanced Microprocessor-Based Design Lecture 20

GPGPU introduction and network applications. PacketShaders, SSLShader

What s New with GPGPU?

Threading Hardware in G80

ECE 574 Cluster Computing Lecture 16

Programmable Graphics Hardware (GPU) A Primer

Administrivia. HW0 scores, HW1 peer-review assignments out. If you re having Cython trouble with HW2, let us know.

GPU Programming. Lecture 1: Introduction. Miaoqing Huang University of Arkansas 1 / 27

GPU Architecture. Alan Gray EPCC The University of Edinburgh

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Windowing System on a 3D Pipeline. February 2005

Performance potential for simulating spin models on GPU

Introduction to Modern GPU Hardware

From Shader Code to a Teraflop: How Shader Cores Work

Mathematical computations with GPUs

NVIDIA s Compute Unified Device Architecture (CUDA)

NVIDIA s Compute Unified Device Architecture (CUDA)

CSCI-GA Graphics Processing Units (GPUs): Architecture and Programming Lecture 2: Hardware Perspective of GPUs

GPU > CPU. FOR HIGH PERFORMANCE COMPUTING PRESENTATION BY - SADIQ PASHA CHETHANA DILIP

Graphics Hardware. Instructor Stephen J. Guy

GRAPHICS HARDWARE. Niels Joubert, 4th August 2010, CS147

CS 179: GPU Programming

Introduction to GPU hardware and to CUDA

Portland State University ECE 588/688. Graphics Processors

Administrivia. Administrivia. Administrivia. CIS 565: GPU Programming and Architecture. Meeting

Graphics Hardware, Graphics APIs, and Computation on GPUs. Mark Segal

CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav

Finite Element Integration and Assembly on Modern Multi and Many-core Processors

Using Graphics Chips for General Purpose Computation

GRAPHICS PROCESSING UNITS

GeForce4. John Montrym Henry Moreton

PowerVR Hardware. Architecture Overview for Developers

Bifurcation Between CPU and GPU CPUs General purpose, serial GPUs Special purpose, parallel CPUs are becoming more parallel Dual and quad cores, roadm

Multimedia in Mobile Phones. Architectures and Trends Lund

Technical Report on IEIIT-CNR

Prof. Hakim Weatherspoon CS 3410, Spring 2015 Computer Science Cornell University

Lecture 15: Introduction to GPU programming. Lecture 15: Introduction to GPU programming p. 1

! Readings! ! Room-level, on-chip! vs.!

General Purpose Computing on Graphical Processing Units (GPGPU(

G P G P U : H I G H - P E R F O R M A N C E C O M P U T I N G

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

Fundamental CUDA Optimization. NVIDIA Corporation

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Graphics Architectures and OpenCL. Michael Doggett Department of Computer Science Lund university

Martin Kruliš, v

Accelerating CFD with Graphics Hardware

Monday Morning. Graphics Hardware

Numerical Simulation on the GPU

DIFFERENTIAL. Tomáš Oberhuber, Atsushi Suzuki, Jan Vacata, Vítězslav Žabka

GPUs and GPGPUs. Greg Blanton John T. Lubia

Graphics and Imaging Architectures

Real - Time Rendering. Graphics pipeline. Michal Červeňanský Juraj Starinský

High Performance Computing and GPU Programming

Comparison of High-Speed Ray Casting on GPU

1. Introduction 2. Methods for I/O Operations 3. Buses 4. Liquid Crystal Displays 5. Other Types of Displays 6. Graphics Adapters 7.

High Performance Computing on GPUs using NVIDIA CUDA

GPU! Advanced Topics on Heterogeneous System Architectures. Politecnico di Milano! Seminar Room, Bld 20! 11 December, 2017!

Warps and Reduction Algorithms

CUDA OPTIMIZATIONS ISC 2011 Tutorial

Tesla GPU Computing A Revolution in High Performance Computing

Particle-in-Cell Simulations on Modern Computing Platforms. Viktor K. Decyk and Tajendra V. Singh UCLA

Lecture 1: Gentle Introduction to GPUs

Introduction to CUDA (1 of n*)

GPGPU, 4th Meeting Mordechai Butrashvily, CEO GASS Company for Advanced Supercomputing Solutions

GPU Background. GPU Architectures for Non-Graphics People. David Black-Schaffer David Black-Schaffer 1

PARALLEL PROGRAMMING MANY-CORE COMPUTING: INTRO (1/5) Rob van Nieuwpoort

Graphics Processing Unit Architecture (GPU Arch)

Architectures. Michael Doggett Department of Computer Science Lund University 2009 Tomas Akenine-Möller and Michael Doggett 1

CS 152 Computer Architecture and Engineering. Lecture 16: Graphics Processing Units (GPUs)

Exotic Methods in Parallel Computing [GPU Computing]

Spring 2009 Prof. Hyesoon Kim

Transcription:

CSE 591/392: GPU Programming Introduction Klaus Mueller Computer Science Department Stony Brook University

First: A Big Word of Thanks! to the millions of computer game enthusiasts worldwide Who demand an utmost of performance and realism of their game engines And who create a market force for high performance computing that beats any federal-funded effort (DOE, NASA, etc.)

High Performance Computing on the Desktop PC graphics boards featuring GPUs: NVIDIA GeForce, ATI Radeon available at every computer store for less $500 set up your PC in less than an hour and play than the latest board: NVIDIA GeForce GTX 960

Just Computing Compute-only (no graphics): NVIDIA Tesla K 40 (Kepler) True GPGPU (General Purpose Computing using GPU Technology) 12 GB memory per card, 2,880 processors ~$3,500 Bundle 8 cards into a server: 23k processors, 96 GB memory

At SUNY Korea NVIDIA Kepler K20 Workstation

At SUNY Korea NVIDIA HP ProLiant Server Dual rack server with 12 Tesla K 20 cards

Incredible Growth Performance gap GPU / CPU is growing currently 1-2 orders of magnitude is achievable (given appropriate programming and problem decomposition)

GPU Vital Specs GeForce 8800 GTX GeForce GTX 580 Codename G80 GF118 Release date 11/2006 11/2010 Transistors 681 M (90nm) 3,000 M (40nm) Clock speed 1,350 MHz 1,544 MHz Processors 128 512 Peak pixel fill rate 13.8 Gigapixels/s 37.6 Gigapixels/s Pk memory bandwidth 86.4 GB/s (384 bit) 192 GB/s (384 bit) Memory 768 MB 1536 MB Peak performance 520 Gigaflops 1,581 Gigaflops

Comparison with CPUs Intel Xeon Westmere X5670 GeForce GTX 580 Price $800 $500 Cores / Chip 6 16 ALUs / Core 1 32 Managed threads / Core 2 1536 Clock speed 3 GHz 1.5 GHz Performance 96 Gigaflops 1,581 Gigaflops

Backprojection task 496 projections size 1248 960 each Comparison with CPUs from Treibig et al. Pushing the limits for medical image reconstruction on recent standard multicore processors, International Journal of High Performance Computing Applications $500 $4,500

The Graphics Pipeline Old-style, non-programmable: stream of (X,Y,Z,W) stream of (R,G,B,A) textures

The Graphics Pipeline Modern, programmable: stream of (X,Y,Z,W) stream of (R,G,B,A) programmable textures

The Graphics Pipeline From a computational view: stream of control primitives 4-tuples stream of 4-tuples data (4-tuples) programmable textures

Stream Processing GPUs are stream processors [Kapasi 03] (with some restrictions) [Venkatasubramanian 03] input data stream Stream register file lower bandwidth output data stream kernel stream kernel stream kernel SIMT SIMT SIMT high bandwidth high bandwidth Local register file Local register file Local register file

History: Accelerated Graphics 1990s: accelerated graphics Silicon Graphics (SGI) expensive and non-programmable Late 1990s: rise of consumer graphics chips Voodoo, ATI Rage, NVIDIA Riva chips still separate from memory End 1990s: consumer graphics boards with high-end graphics the world s first GPU: NVIDIA GeForce 256 (NV 10) inexpensive, but still non-programmable 2000s: programmable consumer graphics hardware graphics cards: NVIDIA GeForce 3, ATI Radeon 9700 HW programming languages: CG, GLSL, HLSL

Now: Focus Parallel Computing 2006: parallel computing languages appear dedicated SDK and API for parallel high performance computing (GPGPU) CUDA (Compute Unified Device Architecture) - developed by NVIDIA OpenCL (Open Computing Language) - initially developed by Apple - now with the Khronos Compute Working Group specific GPGPU boards: NVIDIA Tesla, AMD FireStream

Right Now: Focus Serious Parallel Computing 2009: next generation CUDA architectures announced NVIDIA Fermi, AMD Cypress substrate for supercomputing focused on serious high performance computing (clusters, etc) Enrico Fermi (1901-1954) Italian physicist one of the top scientists of the 20th century developed the first nuclear reactor contributed to - quantum theory, statistical mechanics - nuclear and particle physics Nobel Prize in Physics in 1938 for his work on induced radioactivity

GPU vs. CPU One instruction-decode per kernel stream CPU needs a decode for each data item Highly parallel GTX 580 has 512 processors Memory very close to processors fast data transfer Threads are cheap to switch (light-weight) use this to swap out waiting threads, swap in ready threads CPU requires lots of cache logic and communication to manage resources GPU has the resources close by

GPU vs. CPU High % of GPU chip real-estate for computing small in CPUs (example, 6.5% in Intel Itanium) In many cases speedups of 1-2 orders of magnitude can be obtained by porting to GPU more details on the rules for effective porting later

GPU Architecture: Overview 128 processors 8 multi-processors of 16 processors each local cache L1 (4k) shared cache L2 (1M) Global Memory DRAM (global memory)

GPU Architecture: Overview Memory management is key! 128 processors 8 multi-processors of 16 processors each local cache L1 (4k) shared cache L2 (1M) Global Memory DRAM (global memory)

GPU Architecture: Overview Memory management is key! Thread management is key! 128 processors 8 multi-processors of 16 processors each local cache L1 (4k) shared cache L2 (1M) Global Memory DRAM (global memory)

GPU Architecture: Different View each multiprocessor is a SIMT (Same Instruction, Multiple Thread) architecture equipped with a set of local 32-bit registers (L1 and L2 caches) the (multi-processor level) shared Constant Cache and Texture Cache are read-only the (device-level shared) Device Memory (Global Memory) has read-write access (with caching soon)

GPU Specifics All standard graphics ops are hardwired linear interpolations matrix and vector arithmetic (+, -, *) Arithmetic intensity the ratio of ALU arithmetic per operand fetched needs to be reasonably high, else application is memory-bound GPU memory 1-2 orders of magnitude slower than GPU processors computation often better than table look-ups indirections can be expensive Be aware of GPU 2D-caching protocol (for texture memory) data is fetched in 2D tiles (recall graphics bilinear texture filtering) promote data locality in 2D tiles

Latency Hiding GPUs provide hardware multi-threading kicks in when threads within a core ALU stall (waiting for memory, etc) then another (light-weight) SIMT thread group is swapped in for execution this hides the latency for the stalled threads GTX 480 allows 48x more threads to be maintained than currently SIMT- executed Hardware multi-threading requires memory contexts of all these threads must be maintained in memory this typically limits the amount of threads that can be simultaneously maintained for latency hiding

Programmability GPU hardware can be programmed with shading languages (NVIDIA CG, OpenGL GLSL, Microsoft HLSL) parallel programming language (CUDA, OpenCL) Shading languages require computer graphics knowledge give access to all fixed function pipelines (fast, ASIC-accelerated) - texturing: data interpolation, filtering - rasterization: mapping into the data domain - culling: clipping, early thread removal (early fragment kill) this can provide performance benefits Parallel programming languages ease programming, eliminate need to study (some) graphics

Parallel Programming Languages CUDA (C-interface: C for CUDA, also Fortran), OpenCL Expose details on GPU memory and thread management memory hierarchy, latencies, operation costs, etc shading languages don t make this explicit give programmers better control over memory, threads, and arithmetic intensity (via occupancy calculator, profiler) Promote computations as SIMT threads, executed in kernels Singe Instruction Multiple Threads synonymous to fragments in shading languages But still require (for optimal performance): careful computation flow planning, memory management, and analysis before coding no magic here; no pain, no gain

GPGPU GPGPU = General Purpose Computation on Graphics hardware (GPU) massive trend to use GPUs for main stream computing see http://www.gpgpu.org Accelerate volume rendering and advanced graphics effects computer vision scientific computing and simulations audio and image & video processing database operations, numerical algorithms and sorting data compression medical imaging and many others

GPGPU Example Applications

Organization Will have 4-5 lab projects practice what you have learned in class Either use your own PC or lab machines ideally use the lab server Final project there will be a larger final project choose from a number of topics or choose your own ideally develop some multi-gpu application