What s New with GPGPU?

Similar documents
Experiences with gpu Computing

A Data-Parallel Genealogy: The GPU Family Tree. John Owens University of California, Davis

GPU Computation Strategies & Tricks. Ian Buck NVIDIA

A Data-Parallel Genealogy: The GPU Family Tree

From Brook to CUDA. GPU Technology Conference

CS427 Multicore Architecture and Parallel Computing

Windowing System on a 3D Pipeline. February 2005

GPGPU. Peter Laurens 1st-year PhD Student, NSC

CSE 591: GPU Programming. Introduction. Entertainment Graphics: Virtual Realism for the Masses. Computer games need to have: Klaus Mueller

CS8803SC Software and Hardware Cooperative Computing GPGPU. Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology

CS GPU and GPGPU Programming Lecture 2: Introduction; GPU Architecture 1. Markus Hadwiger, KAUST

CSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University

Introduction to Multicore architecture. Tao Zhang Oct. 21, 2010

Graphics Hardware, Graphics APIs, and Computation on GPUs. Mark Segal

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

Graphics Hardware. Graphics Processing Unit (GPU) is a Subsidiary hardware. With massively multi-threaded many-core. Dedicated to 2D and 3D graphics

The GPGPU Programming Model

X. GPU Programming. Jacobs University Visualization and Computer Graphics Lab : Advanced Graphics - Chapter X 1

Lecture 2. Shaders, GLSL and GPGPU

Portland State University ECE 588/688. Graphics Processors

Real - Time Rendering. Graphics pipeline. Michal Červeňanský Juraj Starinský

Optimizing for DirectX Graphics. Richard Huddy European Developer Relations Manager

Optimizing DirectX Graphics. Richard Huddy European Developer Relations Manager

GPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC

Automatic Tuning Matrix Multiplication Performance on Graphics Hardware

Data-Parallel Algorithms on GPUs. Mark Harris NVIDIA Developer Technology

Threading Hardware in G80

Current Trends in Computer Graphics Hardware

GPGPU, 1st Meeting Mordechai Butrashvily, CEO GASS

Pat Hanrahan. Modern Graphics Pipeline. How Powerful are GPUs? Application. Command. Geometry. Rasterization. Fragment. Display.

Graphics Processing Unit Architecture (GPU Arch)

Challenges for GPU Architecture. Michael Doggett Graphics Architecture Group April 2, 2008

General Purpose Computation (CAD/CAM/CAE) on the GPU (a.k.a. Topics in Manufacturing)

GPUs and GPGPUs. Greg Blanton John T. Lubia

General-Purpose Computation on Graphics Hardware

Accelerating CFD with Graphics Hardware

Why GPUs? Robert Strzodka (MPII), Dominik Göddeke G. TUDo), Dominik Behr (AMD)

GPU Memory Model Overview

The NVIDIA GeForce 8800 GPU

Introduction to Shaders.

Today s Agenda. DirectX 9 Features Sim Dietrich, nvidia - Multisample antialising Jason Mitchell, ATI - Shader models and coding tips

Programmable GPUs. Real Time Graphics 11/13/2013. Nalu 2004 (NVIDIA Corporation) GeForce 6. Virtua Fighter 1995 (SEGA Corporation) NV1

Graphics Hardware. Instructor Stephen J. Guy

Multiple Cores, Multiple Pipes, Multiple Threads Do we have more Parallelism than we can handle? David B. Kirk

General Purpose Computing on Graphical Processing Units (GPGPU(

Real-Time Rendering Architectures

Real-time Graphics 9. GPGPU

Collaborators. Glift: : Generic, Efficient Random-Access GPU Data Structures. Compute vs. Bandwidth : 2005 Update.

Programmable GPUS. Last Time? Reading for Today. Homework 4. Planar Shadows Projective Texture Shadows Shadow Maps Shadow Volumes

Real-Time Graphics Architecture

CS GPU and GPGPU Programming Lecture 8+9: GPU Architecture 7+8. Markus Hadwiger, KAUST

Antonio R. Miele Marco D. Santambrogio

The Rasterization Pipeline

printf Debugging Examples

Spring 2009 Prof. Hyesoon Kim

General Purpose Computation (CAD/CAM/CAE) on the GPU (a.k.a. Topics in Manufacturing)

Spring 2011 Prof. Hyesoon Kim

GPU Architecture. Michael Doggett Department of Computer Science Lund university

Computing on GPUs. Prof. Dr. Uli Göhner. DYNAmore GmbH. Stuttgart, Germany

GeForce4. John Montrym Henry Moreton

Lecture 1: Gentle Introduction to GPUs

INTEL HPC DEVELOPER CONFERENCE FUEL YOUR INSIGHT

Real-Time Rendering (Echtzeitgraphik) Michael Wimmer

The Application Stage. The Game Loop, Resource Management and Renderer Design

Introduction to Modern GPU Hardware

Real-time Graphics 9. GPGPU

frame buffer depth buffer stencil buffer

high performance medical reconstruction using stream programming paradigms

Bifurcation Between CPU and GPU CPUs General purpose, serial GPUs Special purpose, parallel CPUs are becoming more parallel Dual and quad cores, roadm

EECS 487: Interactive Computer Graphics

GPU Memory Model. Adapted from:

Shaders. Slide credit to Prof. Zwicker

Real-Time Reyes: Programmable Pipelines and Research Challenges. Anjul Patney University of California, Davis

Next-Generation Graphics on Larrabee. Tim Foley Intel Corp

Technology for a better society. hetcomp.com

Parallel Programming on Larrabee. Tim Foley Intel Corp

1.2.3 The Graphics Hardware Pipeline

Real-Time Support for GPU. GPU Management Heechul Yun

Why Use the GPU? How to Exploit? New Hardware Features. Sparse Matrix Solvers on the GPU: Conjugate Gradients and Multigrid. Semiconductor trends

From Shader Code to a Teraflop: How GPU Shader Cores Work. Jonathan Ragan- Kelley (Slides by Kayvon Fatahalian)

Mattan Erez. The University of Texas at Austin

Squeezing Performance out of your Game with ATI Developer Performance Tools and Optimization Techniques

2.11 Particle Systems

Parallel Programming for Graphics

General Purpose Computation (CAD/CAM/CAE) on the GPU (a.k.a. Topics in Manufacturing)

Graphics Performance Optimisation. John Spitzer Director of European Developer Technology

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 6. Parallel Processors from Client to Cloud

Tutorial on GPU Programming #2. Joong-Youn Lee Supercomputing Center, KISTI

Parallel Systems I The GPU architecture. Jan Lemeire

Graphics Architectures and OpenCL. Michael Doggett Department of Computer Science Lund university

Developer Tools. Robert Strzodka. caesar research center Bonn, Germany

Lecture 15: Introduction to GPU programming. Lecture 15: Introduction to GPU programming p. 1

Soft Particles. Tristan Lorach

GPU Architecture. Robert Strzodka (MPII), Dominik Göddeke G. TUDo), Dominik Behr (AMD)

ECE 574 Cluster Computing Lecture 16

Comparing Reyes and OpenGL on a Stream Architecture

ECE 8823: GPU Architectures. Objectives

DIFFERENTIAL. Tomáš Oberhuber, Atsushi Suzuki, Jan Vacata, Vítězslav Žabka

From Shader Code to a Teraflop: How Shader Cores Work

Introduction to CUDA (1 of n*)

Transcription:

What s New with GPGPU? John Owens Assistant Professor, Electrical and Computer Engineering Institute for Data Analysis and Visualization University of California, Davis

Microprocessor Scaling is Slowing 1e+7 1e+6 1e+5 1e+4 1e+3 1e+2 1e+1 52%/year ps/gate 19% Gates/clock 9% Clocks/inst 18% 19%/year Perf (ps/inst) 1e+0 1980 1985 1990 1995 2000 2005 2010 2015 2020 [courtesy of Bill Dally] [courtesy of Bill Dally]

Today s Microprocessors Scalar programming model with no native data parallelism SSE is the exception Few arithmetic units little area Optimized for complex control Optimized for low latency not high bandwidth Result: poor match for many apps Pentium III 28.1M T

Future Potential is Large 1e+7 1e+6 1e+5 1e+4 1e+3 1e+2 1e+1 1e+0 1e-1 1e-2 1e-3 52%/year 74%/year 30:1 Perf (ps/inst) Linear (ps/inst) 19%/year 1,000:1 1e-4 1980 1985 1990 1995 2000 2005 2010 2015 2020 30,000:1 2001: 30:1 2011: 1000:1 [courtesy of Bill Dally]

Parallel Processing is the Future Major vendors supporting multicore Intel, AMD Excitement about IBM Cell Hardware support for threads Interest in general-purpose programmability on GPUs Universities must teach thinking in parallel

Long-Term Trend: CPU vs. GPU GPU

Recent GPU Performance Trends Programmable 32-bit FP multiplies per second 54.4 GB/s 49.6 GB/s 6 GB/s $278 X1900 XTX $308 7800 GTX $334 P4X3.4 Data courtesy Ian Buck; from Owens et al. 2005 [EG STAR]

Functionality Improves Too! 10 years ago: Graphics done in software 5 years ago: Full graphics pipeline Today: 40x geometry, 13x fill vs.. 5 yrs ago Programmable! Programmable, data parallel processing on every desktop The GPU is the first commercial data- parallel processor

The Rendering Pipeline Application Geometry Compute 3D geometry Make calls to graphics API Transform geometry from 3D to 2D (in parallel) Rasterization Generate fragments from 2D geometry (in parallel) Composite GPU Combine fragments into image

The Programmable Rendering Pipeline Application Geometry (Vertex) Compute 3D geometry Make calls to graphics API Transform geometry from 3D to 2D; vertex programs Rasterization (Fragment) Generate fragments from 2D geometry; fragment programs Composite GPU Combine fragments into image

NVIDIA GeForce 6800 3D Pipeline Vertex Triangle Setup Z-Cull Shader Instruction Dispatch Fragment L2 Tex Fragment Crossbar Composite Memory Partition Memory Partition Memory Partition Memory Partition Courtesy Nick Triantos, NVIDIA

Programming a GPU for Graphics Application specifies geometry rasterized Each fragment is shaded w/ SIMD program Shading can use values from texture memory Image can be used as texture on future passes

Programming a GPU for GP Programs Draw a screen-sized quad Run a SIMD program over each fragment Gather is permitted from texture memory Resulting buffer can be treated as texture on next pass

GPUs are fast (why?) Characteristics of computation permit efficient hardware implementations High amount of parallelism exploited by graphics hardware High latency tolerance and feed-forward dataflow allow very deep pipelines allow optimization for bandwidth not latency Simple control Restrictive programming model Competition between vendors

but GPU programming is hard Must think in graphics metaphors Requires parallel programming (CPU-GPU, task, data, instruction) Restrictive programming models and instruction sets Primitive tools Rapidly changing interfaces Big picture: Every time I say GPU, substitute parallel processor

GPGPU in Scientific Visualization Flowfield Vis: a million particles with ease @ 60 fps Tensorfield Vis: CPU GPU Speedup = 1 : 150 (!!!) Online decompression: Interactively Visualize 20 GB of data on a 256 MB GPU GPGPU and VIS a winning team Focus and Context Raycasting: High Quality interactive visualization of large datasets on $400 hardware (winner of IEEE 2005 VIS Competition) Courtesy Jens Krüger (TU München)

GPGPU Effects Fire Water Let your Virtual Reality come to live with interactive GPGPU effects! (all off these effects are simulated and rendered on the GPU in realtime) Smoke Courtesy Jens Krüger (TU München)

Challenge: Programming Systems Programming Model High-Level Abstractions/ Libraries Low-Level Languages Performance Analysis Tools Compilers Docs CPU Scalar STL, GNU SL, MPI, C, Fortran, gcc,, vendor-specific, gdb, vtune,, Purify, Lots applications GPU Stream? Data-Parallel? Brook, Scout, sh, Glift GLSL, Cg, HLSL, Vendor-specific Shadesmith, NVPerfHUD None kernels

Glift: Data Structures for GPUs Goal Simplify creation and use of random-access GPU data structures for graphics and GPGPU programming Contributions Abstraction for GPU data structures Glift template library Iterator computation model for GPUs Aaron E. Lefohn, Joe Kniss, Robert Strzodka, Shubhabrata Sengupta, and John D. Owens. Glift: An Abstraction for Generic, Efficient GPU Data Structures '. ACM TOG Jan 2006.

Today s Vendor Support High-Level Graphics Language OpenGL D3D Low-Level Device Driver

Possible Future Vendor Support High-Level Graphics Language OpenGL D3D Compute Low-Level Device Driver High-Level Compute Lang. Low-Level API

ATI CTM Low-level interface to GPU

Big Picture Research Targets We don t t know what architecture will win. But we know it will be parallel. than bottom-up what SHOULD we support? Data structures Top-down approach rather Interaction of algorithms and data structures Export to multiple architectures Self-tuning code Communication between GPUs Programming systems for multiple parallel architectures Major obstacle: difficulty of programming. Danger of fragmentation! Opportunity for education as well. Learn from past Explore portable primitives What we re good at: using GPUs as first class computing resources

Rob Pike on Languages Conclusion A highly parallel language used by non-experts. Power of notation Good: make it easier to express yourself Better: hide stuff you don't care about Exposing Parallelism Best: hide stuff you do care about Control Flow Give the language a purpose. Data Locality Synchronization

Moving Forward What will DX10 give us? What works well now? What doesn t t work well now? What will improve in the future? What will continue to be difficult?

The New DX10 Pipeline Vertex Vertex Buffer Buffer Input Input Assembler Assembler Vertex Vertex Shader Shader Setup Setup Rasterizer Rasterizer Output Output Merger Merger Pixel Pixel Shader Shader Geometry Geometry Shader Shader Index Index Buffer Buffer Texture Texture Texture Texture Render Render Target Target Depth Depth Stencil Stencil Texture Texture Stream Stream Buffer Buffer Stream out Stream out Memory Memory memory memory programmable programmable fixed fixed Sampler Sampler Sampler Sampler Sampler Sampler Constant Constant Constant Constant Constant Constant [courtesy David Blythe]

What Runs Well on GPUs? GPUs win when Limited data reuse P4 3GHz NV GF 6800 Memory BW 6 GB/s 36 GB/s Cache BW 44 GB/s -- High arithmetic intensity: Defined as math operations per memory op Attacks the memory wall - are all mem ops necessary? Common error: Not comparing against optimized CPU implementation

Arithmetic Intensity Historical growth rates (per year): Compute: 71% DRAM bandwidth: 25% DRAM latency: 5% GFLOPS 7x Gap R300 R360 R420 GFloats/sec [courtesy Ian Buck]

Arithmetic Intensity GeForce 7800 GTX Pentium 4 3.0 GHz GPU wins when Arithmetic intensity Segment 3.7 ops per word 11 GFLOPS [courtesy Ian Buck]

Memory Bandwidth GPU wins when Streaming memory bandwidth SAXPY FFT GeForce 7800 GTX Pentium 4 3.0 GHz [courtesy Ian Buck]

Memory Bandwidth Streaming Memory System Optimized for sequential performance GPU cache is limited Optimized for texture filtering Read-only GeForce 7800 GTX Pentium 4 Small Local storage CPU >> GPU [courtesy Ian Buck]

What Will (Hopefully) Improve? Orthogonality Instruction sets Features Tools Stability Interfaces, APIs, libraries, abstractions Necessary as graphics and GPGPU converge!

What Won t Change? Rate of progress Precision (64b floating point?) Parallelism Won t sacrifice performance Difficulty of programming parallel hardware but APIs and libraries may help Concentration on entertainment apps

GPGPU Top Ten The Killer App Programming models and tools GPU in tomorrow s computer? Data conditionals Relationship to other parallel hw/sw Managing rapid change in hw/sw (roadmaps) Performance evaluation and cliffs Philosophy of faults and lack of precision Broader toolbox for computation / data structures Wedding graphics and GPGPU techniques