From Brook to CUDA. GPU Technology Conference

Similar documents
Ian Buck, GM GPU Computing Software

GPU Computation Strategies & Tricks. Ian Buck NVIDIA

GPU Computing Past, Present, Future. Ian Buck, GM GPU Computing Sw

CSE 591: GPU Programming. Introduction. Entertainment Graphics: Virtual Realism for the Masses. Computer games need to have: Klaus Mueller

What s New with GPGPU?

Threading Hardware in G80

CS8803SC Software and Hardware Cooperative Computing GPGPU. Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology

CSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University

GPGPU. Peter Laurens 1st-year PhD Student, NSC

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

CS427 Multicore Architecture and Parallel Computing

CUDA Programming Model

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Windowing System on a 3D Pipeline. February 2005

GPU. Ben de Waal Summer 2008

CS GPU and GPGPU Programming Lecture 8+9: GPU Architecture 7+8. Markus Hadwiger, KAUST

PARALLEL PROGRAMMING MANY-CORE COMPUTING: INTRO (1/5) Rob van Nieuwpoort

Graphics Processing Unit Architecture (GPU Arch)

Lecture 15: Introduction to GPU programming. Lecture 15: Introduction to GPU programming p. 1

Pat Hanrahan. Modern Graphics Pipeline. How Powerful are GPUs? Application. Command. Geometry. Rasterization. Fragment. Display.

GPGPU, 1st Meeting Mordechai Butrashvily, CEO GASS

Current Trends in Computer Graphics Hardware

A Data-Parallel Genealogy: The GPU Family Tree. John Owens University of California, Davis

Portland State University ECE 588/688. Graphics Processors

X. GPU Programming. Jacobs University Visualization and Computer Graphics Lab : Advanced Graphics - Chapter X 1

Introduction to Multicore architecture. Tao Zhang Oct. 21, 2010

The Future of 3D Graphics NVIDIA

Introduction to Modern GPU Hardware

Introduction to CUDA (1 of n*)

General Purpose Computation (CAD/CAM/CAE) on the GPU (a.k.a. Topics in Manufacturing)

General Purpose Computing on Graphical Processing Units (GPGPU(

Real-Time Rendering Architectures

CONSOLE ARCHITECTURE

ECE 574 Cluster Computing Lecture 16

High Performance Computing on GPUs using NVIDIA CUDA

From Shader Code to a Teraflop: How GPU Shader Cores Work. Jonathan Ragan- Kelley (Slides by Kayvon Fatahalian)

General-Purpose Computation on Graphics Hardware

Motivation Hardware Overview Programming model. GPU computing. Part 1: General introduction. Ch. Hoelbling. Wuppertal University

Parallel Systems Course: Chapter IV. GPU Programming. Jan Lemeire Dept. ETRO November 6th 2008

GPGPU Lessons Learned. Mark Harris

The NVIDIA GeForce 8800 GPU

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

CMPE 665:Multiple Processor Systems CUDA-AWARE MPI VIGNESH GOVINDARAJULU KOTHANDAPANI RANJITH MURUGESAN


General Purpose Computation (CAD/CAM/CAE) on the GPU (a.k.a. Topics in Manufacturing)

The GPGPU Programming Model

Numerical Simulation on the GPU

Graphics Hardware. Instructor Stephen J. Guy

GPUs and GPGPUs. Greg Blanton John T. Lubia

Mattan Erez. The University of Texas at Austin

What is GPU? CS 590: High Performance Computing. GPU Architectures and CUDA Concepts/Terms

Accelerator cards are typically PCIx cards that supplement a host processor, which they require to operate Today, the most common accelerators include

GPU-Based Volume Rendering of. Unstructured Grids. João L. D. Comba. Fábio F. Bernardon UFRGS

Graphics Hardware. Graphics Processing Unit (GPU) is a Subsidiary hardware. With massively multi-threaded many-core. Dedicated to 2D and 3D graphics

G P G P U : H I G H - P E R F O R M A N C E C O M P U T I N G

GPU Basics. Introduction to GPU. S. Sundar and M. Panchatcharam. GPU Basics. S. Sundar & M. Panchatcharam. Super Computing GPU.

Monday Morning. Graphics Hardware

GPU for HPC. October 2010

Why GPUs? Robert Strzodka (MPII), Dominik Göddeke G. TUDo), Dominik Behr (AMD)

MANY-CORE COMPUTING. 7-Oct Ana Lucia Varbanescu, UvA. Original slides: Rob van Nieuwpoort, escience Center

A Data-Parallel Genealogy: The GPU Family Tree

Bifrost - The GPU architecture for next five billion

Technology for a better society. hetcomp.com

This Unit: Putting It All Together. CIS 371 Computer Organization and Design. Sources. What is Computer Architecture?

From Shader Code to a Teraflop: How Shader Cores Work

GPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC

GpuPy: Accelerating NumPy With a GPU

Real-Time Rendering (Echtzeitgraphik) Michael Wimmer

CSCI-GA Graphics Processing Units (GPUs): Architecture and Programming Lecture 2: Hardware Perspective of GPUs

GPU & High Performance Computing (by NVIDIA) CUDA. Compute Unified Device Architecture Florian Schornbaum

Real-Time Ray Tracing Using Nvidia Optix Holger Ludvigsen & Anne C. Elster 2010

high performance medical reconstruction using stream programming paradigms

GPU Background. GPU Architectures for Non-Graphics People. David Black-Schaffer David Black-Schaffer 1

Spring 2010 Prof. Hyesoon Kim. AMD presentations from Richard Huddy and Michael Doggett

This Unit: Putting It All Together. CIS 501 Computer Architecture. What is Computer Architecture? Sources

Real-Time Graphics Architecture

Challenges for GPU Architecture. Michael Doggett Graphics Architecture Group April 2, 2008

Massively Parallel Architectures

Scientific Computing on GPUs: GPU Architecture Overview

Introduction to CUDA

Graphics Hardware, Graphics APIs, and Computation on GPUs. Mark Segal

Introduction to Parallel Programming Models

Comparison of High-Speed Ray Casting on GPU

GPGPU introduction and network applications. PacketShaders, SSLShader

Multi-Processors and GPU

ECE 571 Advanced Microprocessor-Based Design Lecture 20

Real-time Graphics 9. GPGPU

Bifurcation Between CPU and GPU CPUs General purpose, serial GPUs Special purpose, parallel CPUs are becoming more parallel Dual and quad cores, roadm

GPU Architecture and Function. Michael Foster and Ian Frasch

CENG 477 Introduction to Computer Graphics. Graphics Hardware and OpenGL

This Unit: Putting It All Together. CIS 371 Computer Organization and Design. What is Computer Architecture? Sources

Antonio R. Miele Marco D. Santambrogio

ECE 571 Advanced Microprocessor-Based Design Lecture 18

CUDA Performance Optimization. Patrick Legresley

Real - Time Rendering. Graphics pipeline. Michal Červeňanský Juraj Starinský

Real-time Graphics 9. GPGPU

GRAPHICS HARDWARE. Niels Joubert, 4th August 2010, CS147

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS

Master Informatics Eng.

Transcription:

From Brook to CUDA GPU Technology Conference

A 50 Second Tutorial on GPU Programming by Ian Buck Adding two vectors in C is pretty easy for (i=0; i<n; i++) c[i] = a[i] + b[i]; On the GPU, it s a wee bit more complicated

First, you ll want to create a floating point PBuffer Of course, there is different code for NVIDIA and ATI, OpenGL and DirectX, Windows, Linux, OS X naturally

You ll want to create some floating point textures Don t forget to turn off filtering otherwise everything will run in software mode good luck finding that in the documentation

You ll need to write the add shader char singlefetch[] = "!!ARBfp1.0\n" "TEMP R0;\n" "TEMP R1;\n" "TEX R0, fragment.texcoord[0], texture[0], RECT;\n" "TEX R1, fragment.texcoord[0], texture[1], RECT;\n" "ADD result.color, R0, R1;\n" "END\n"; Copy the data to the GPU Boy that sucked. Render a shaded quad glbindprogramarb(gl_fragment_program_arb, pass_id[pass_idx]); glbegin(gl_triangles); glmultitexcoord4farb(gl_texture0_arb, f1[i].x, f2[i].y, 0.0f, 1.0f); glvertex2f(-1.0f, 3.0f); glmultitexcoord4farb(gl_texture0_arb, f1[i].x, f2[i].y, 0.0f, 1.0f); glmultitexcoord4farb(gl_texture0_arb+i, f1[i].x, f2[i].y, 0.0f, 1.0f); glvertex2f(-1.0f, -1.0f); glmultitexcoord4farb(gl_texture0_arb+i, f1[i].x, f2[i].y, 0.0f, 1.0f); glvertex2f(3.0f, -1.0f); glend(); CHECK_GL(); gltexsubimage2d(gl_texture_rectangle_ext, 0, 0, 0, width, height, GLformat(ncomp[i]), GL_FLOAT, t); Read back from the GPU glreadpixels (0, 0, width, height, GLformat(ncomp[i]), GL_FLOAT, t); Congratulations, you ve successfully added two vectors

History... GPGPU in 2004

GFLOPS recent trends multiplies per second (observed peak) NVIDIA NV30, 35, 40 ATI R300, 360, 420 Pentium 4 July 01 Jan 02 July 02 Jan 03 July 03 Jan 04

GPU history NVIDIA historicals Product Process Trans MHz GFLOPS (MUL) Aug-02 GeForce FX5800 0.13 121M 500 8 Jan-03 GeForce FX5900 0.13 130M 475 20 Dec-03 GeForce 6800 0.13 222M 400 53 translating transistors into performance 1.8x increase of transistors 20% decrease in clock rate 6.6x GFLOP speedup

courtesy of Bill Dally compute is cheap parallelism to keep 100s of ALUs per chip busy shading is highly parallel millions of fragments per frame 0.5mm 64-bit FPU (to scale) 90nm Chip $200 1GHz 12mm

courtesy of Bill Dally...but bandwidth is expensive latency tolerance to cover 500 cycle remote memory access time locality to match 20Tb/s ALU bandwidth to ~100Gb/s chip bandwidth 0.5mm 1 clock 90nm Chip $200 1GHz 12mm

arithmetic intensity shading is compute intensive 100s of floating point operations output 1 32-bit color value 0.5mm 1 clock 90nm Chip $200 1GHz arithmetic intensity compute to bandwidth ratio 12mm can we structure our computation in a similar way?

Brook language C with streams streams collection of records requiring similar computation particle positions, voxels, FEM cell, Ray r<200>; float3 velocityfield<100,100,100>; similar to arrays, but index operations disallowed: position[i]

issuing compute geometry Brook Vertex Program Rasterization Fragment Program Texture Memory

Brook Applications ray-tracer segmentation SAXPY SGEMV fft edge detect linear algebra

Legacy GPGPU Brook was great but... Lived within the constraints of graphics Constrained streaming programming model How can we improve GPUs to be better computing platforms?

Challenges Graphics API Addressing modes Limited texture size/dimension Shader capabilities Limited outputs Instruction sets Integer & bit ops Communication limited Between pixels Scatter a[i] = p Input Registers Fragment Program Output Registers Texture Constants Registers

GeForce 7800 Pixel Input Registers Texture Fragment Program Constants Registers Output Registers

Thread Programs Thread Number Thread Program Texture Constants Registers Features Millions of instructions Full Integer and Bit instructions No limits on branching, looping 1D, 2D, or 3D thread ID allocation Output Registers

Global Memory Thread Number Thread Program Global Memory Texture Constants Registers Features Fully general load/store to GPU memory: Scatter/Gather Programmer flexibility on how memory is accessed Untyped, not limited to fixed texture types Pointer support

Shared Memory Thread Number Thread Program Shared Texture Constants Registers Features Dedicated on-chip memory Shared between threads for inter-thread communication Explicitly managed As fast as registers Global Memory

Example Algorithm - Fluids Goal: Calculate PRESSURE in a fluid Pressure = Sum of neighboring pressures P n = P 1 + P 2 + P 3 + P 4 So the pressure for each particle is Pressure depends on neighbors Pressure 1 = P 1 + P 2 + P 3 + P 4 Pressure 2 = P 3 + P 4 + P 5 + P 6 Pressure 3 = P 5 + P 6 + P 7 + P 8 Pressure 4 = P 7 + P 8 + P 9 + P 10

Example Fluid Algorithm CPU GPGPU CUDA GPU Computing Control ALU P n =P 1 +P 2 +P 3 +P 4 Cache P 1 P 2 P 3 P 4 Single thread out of cache Data/Computation DRAM Control ALU P n =P 1 +P 2 +P 3 +P 4 Control ALU P n =P 1 +P 2 +P 3 +P 4 Control ALU ALU P n =P 1 +P 2 +P 3 +P 4 P1,P2 P3,P4 P1,P2 P3,P4 Video Memory P1,P2 P3,P4 Multiple passes through video memory Thread Execution Manager Control ALU P n =P 1 +P 2 +P 3 +P 4 Control ALU P n =P 1 +P 2 +P 3 +P 4 Control ALU Shared Data P 1 P 2 P 3 P 4 P 5 DRAM Program/Control P n =P 1 +P 2 +P 3 +P 4

Streaming vs. GPU Computing Streaming Gather in, Restricted write Memory is far from ALU No inter-element communication ALU CUDA More general data parallel model Full Scatter / Gather PDC brings the data closer to the ALU GPGPU ALU CUDA App decides how to decompose the problem across threads Share and communicate between threads to solve problems efficiently

Divergence in Parallel Computing Removing divergence pain from parallel programming SIMD Pain User required to SIMD-ify User suffers when computation goes divergent GPUs: Decouple execution width from programming model Threads can diverge freely Inefficiency only when granularity exceeds native machine width Hardware managed Managing divergence becomes performance optimization Scalable

CUDA: Threading in Data Parallel Threading in a data parallel world Operations drive execution, not data Users simply given thread id They decide what thread access which data element One thread = single data element or block or variable or nothing. No need for accessors, views, or built-ins

Ease of Adoption Customizing Solutions Ported Applications Domain Libraries Domain specific lang C for CUDA Driver API PTX HW Generality

Ahead of the Curve GPUs are already at where CPU are going Task parallelism is short lived... Data parallel is the future Express a problem as data parallel... Maps automatically to a scalable architecture CUDA is defining that data parallel future

BACKUP

Stunning Graphics Realism Lush, Rich Worlds Crysis 2006 Crytek / Electronic Arts Incredible Physics Effects Core of the Definitive Gaming Platform Hellgate: London 2005-2006 Flagship Studios, Inc. Licensed by NAMCO BANDAI Games America, Inc. Full Spectrum Warrior: Ten Hammers 2006 Pandemic Studios, LLC. All rights reserved. 2006 THQ Inc. All rights reserved.

GPGPU Programming Model OpenGL Program to Add A and B Vertex Program Programs created with raster operation Rasterization Fragment Program Read textures as input to OpenGL shader program CPU Reads Texture Memory for Results Write answer to texture memory as a color