Pat Hanrahan. Modern Graphics Pipeline. How Powerful are GPUs? Application. Command. Geometry. Rasterization. Fragment. Display.

Similar documents
Tutorial on GPU Programming. Joong-Youn Lee Supercomputing Center, KISTI

GPGPU. Peter Laurens 1st-year PhD Student, NSC

A Real-Time Procedural Shading System for Programmable Graphics Hardware

CS GPU and GPGPU Programming Lecture 8+9: GPU Architecture 7+8. Markus Hadwiger, KAUST

CSE 591: GPU Programming. Introduction. Entertainment Graphics: Virtual Realism for the Masses. Computer games need to have: Klaus Mueller

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

General Purpose Computing on Graphical Processing Units (GPGPU(

GPU Architecture. Robert Strzodka (MPII), Dominik Göddeke G. TUDo), Dominik Behr (AMD)

From Shader Code to a Teraflop: How Shader Cores Work

What s New with GPGPU?

General-Purpose Computation on Graphics Hardware

From Shader Code to a Teraflop: How GPU Shader Cores Work. Jonathan Ragan- Kelley (Slides by Kayvon Fatahalian)

Graphics and Imaging Architectures

CSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University

From Brook to CUDA. GPU Technology Conference

Programmable Graphics Hardware

GPGPU, 1st Meeting Mordechai Butrashvily, CEO GASS

GPU Architecture. Michael Doggett Department of Computer Science Lund university

GPU Computation Strategies & Tricks. Ian Buck NVIDIA

Threading Hardware in G80

CS8803SC Software and Hardware Cooperative Computing GPGPU. Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology

Tutorial on GPU Programming #2. Joong-Youn Lee Supercomputing Center, KISTI

Chromatic Aberration. CEDEC 2001 Tokyo, Japan

Challenges for GPU Architecture. Michael Doggett Graphics Architecture Group April 2, 2008

Current Trends in Computer Graphics Hardware

Real - Time Rendering. Graphics pipeline. Michal Červeňanský Juraj Starinský

GPUs and GPGPUs. Greg Blanton John T. Lubia

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

GRAPHICS HARDWARE. Niels Joubert, 4th August 2010, CS147

What is GPU? CS 590: High Performance Computing. GPU Architectures and CUDA Concepts/Terms

EE382N (20): Computer Architecture - Parallelism and Locality Spring 2015 Lecture 09 GPUs (II) Mattan Erez. The University of Texas at Austin

CS427 Multicore Architecture and Parallel Computing

Real-Time Graphics Architecture

Introduction to Programmable GPUs CPSC 314. Real Time Graphics

GPGPU, 4th Meeting Mordechai Butrashvily, CEO GASS Company for Advanced Supercomputing Solutions

ECE 574 Cluster Computing Lecture 16

GPU Basics. Introduction to GPU. S. Sundar and M. Panchatcharam. GPU Basics. S. Sundar & M. Panchatcharam. Super Computing GPU.

Antonio R. Miele Marco D. Santambrogio

Mattan Erez. The University of Texas at Austin

Real-Time Rendering Architectures

Graphics Hardware. Graphics Processing Unit (GPU) is a Subsidiary hardware. With massively multi-threaded many-core. Dedicated to 2D and 3D graphics

Introduction to Programmable GPUs CPSC 314. Introduction to GPU Programming CS314 Gordon Wetzstein, 09/03/09

Lecture 7: The Programmable GPU Core. Kayvon Fatahalian CMU : Graphics and Imaging Architectures (Fall 2011)

Xbox 360 Architecture. Lennard Streat Samuel Echefu

Accelerating CFD with Graphics Hardware

Parallel Programming for Graphics

On-the-fly Vertex Reuse for Massively-Parallel Software Geometry Processing

A Data-Parallel Genealogy: The GPU Family Tree. John Owens University of California, Davis

Shaders. Slide credit to Prof. Zwicker

Portland State University ECE 588/688. Graphics Processors

Using Graphics Chips for General Purpose Computation

Scientific Computing on GPUs: GPU Architecture Overview

Compute-mode GPU Programming Interfaces

GPGPU introduction and network applications. PacketShaders, SSLShader

Experiences with gpu Computing

GPU Architecture and Function. Michael Foster and Ian Frasch

X. GPU Programming. Jacobs University Visualization and Computer Graphics Lab : Advanced Graphics - Chapter X 1

CME 213 S PRING Eric Darve

CS GPU and GPGPU Programming Lecture 7: Shading and Compute APIs 1. Markus Hadwiger, KAUST

Comparing Reyes and OpenGL on a Stream Architecture

The NVIDIA GeForce 8800 GPU

Spring 2009 Prof. Hyesoon Kim

Data-Parallel Algorithms on GPUs. Mark Harris NVIDIA Developer Technology

Introduction to Multicore architecture. Tao Zhang Oct. 21, 2010

Computing on GPUs. Prof. Dr. Uli Göhner. DYNAmore GmbH. Stuttgart, Germany

Graphics Hardware, Graphics APIs, and Computation on GPUs. Mark Segal

Introduction to GPU computing

Introduction to CUDA (1 of n*)

Efficient and Scalable Shading for Many Lights

Introduction to CUDA

Spring 2010 Prof. Hyesoon Kim. AMD presentations from Richard Huddy and Michael Doggett

Real-time Graphics 9. GPGPU

Spring 2011 Prof. Hyesoon Kim

Technical Report on IEIIT-CNR

Lecture 15: Introduction to GPU programming. Lecture 15: Introduction to GPU programming p. 1

Real-Time Support for GPU. GPU Management Heechul Yun

Windowing System on a 3D Pipeline. February 2005

GPU A rchitectures Architectures Patrick Neill May

Introduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 6. Parallel Processors from Client to Cloud

Sequoia. Mattan Erez. The University of Texas at Austin

Jeremy W. Sheaffer 1 David P. Luebke 2 Kevin Skadron 1. University of Virginia Computer Science 2. NVIDIA Research

The Problem: Difficult To Use. Motivation: The Potential of GPGPU CGC & FXC. GPGPU Languages

General Purpose GPU Computing in Partial Wave Analysis

GPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC

Real-time Graphics 9. GPGPU

A Reconfigurable Architecture for Load-Balanced Rendering

High Performance Computing with Accelerators

Graphics Processing Units (GPUs) V1.2 (January 2010) Outline. High-Level Pipeline. 1. GPU Pipeline:

GpuPy: Accelerating NumPy With a GPU

Core/Many-Core Architectures and Programming. Prof. Huiyang Zhou


Finite Element Integration and Assembly on Modern Multi and Many-core Processors

Anatomy of AMD s TeraScale Graphics Engine

Graphics Hardware. Computer Graphics COMP 770 (236) Spring Instructor: Brandon Lloyd 2/26/07 1

High Performance Computing on GPUs using NVIDIA CUDA

Tesla GPU Computing A Revolution in High Performance Computing

Mattan Erez. The University of Texas at Austin

Motivation Hardware Overview Programming model. GPU computing. Part 1: General introduction. Ch. Hoelbling. Wuppertal University

Introduction to Modern GPU Hardware

Transcription:

How Powerful are GPUs? Pat Hanrahan Computer Science Department Stanford University Computer Forum 2007 Modern Graphics Pipeline Application Command Geometry Rasterization Texture Fragment Display Page 1

A Pitch from 5 Years Ago Cinematic games and media drive GPU market Current GPUs faster than CPUs (at graphics) Gap between the GPU and the CPU increasing Why? Efficiently use VLSI resources Programmable GPUs Stream processors Many applications map to stream processing Therefore, a $50 high-performance, massively parallel computer will soon ship with every PC Pat Hanrahan, circa 2002-2005 What Happened? Now AMD and Intel gave up on sequential CPUs with high clock rates and went multi-core (2-4) Gap between GPU and CPU stablelized GPUs are data parallel (64-128 cores) DX10 mandates unified graphics pipeline GPGPU many algorithms implemented Future Two main types of processors CPU fast sequential processor GPU fast data parallel processor Hybrid CPU/GPU Page 2

Overview Current programmable GPUs Performance Programming model: Stream abstraction Applications How General? Programmable GPUs Page 3

ATI R600 (X2X00) 80 nm process ~700 million transistors 64 4-wide unified shaders ~700 Mhz clock 512-bit GDDR memory GDDR3 @ 900Mhz = 115 GB/s GDDR4 @ 1100Mhz = 140 GB/s R300 not R600 230 Watt NVIDIA G80 (8800) 90 nm TSMC process 681million transistors 480 mm^2 128 scalar processors 1.3 Ghz clock rate 384-bit GDDR memory GDDR3 @ 900Mhz = 86.4 GB/s 130 Watts Page 4

GeForce 8800 Series GPU Host Input Assembler Vertex Thread Geometry Thread Rasterization ti Pixel Thread TF L1 TF L1 TF L1 TF L1 TF L1 TF L1 TF L1 TF L1 Thread Processor L2 L2 L2 L2 L2 L2 FB FB FB FB FB FB Shader Model 4.0 Architecture 32 4-32-bit Input Parameters 64K 32-bit Registers 32 4-32-bit 64K insts Program Textures 8 4-32-bit Output Page 5

Simple Graphics Pipeline # c[0-3] = modelview projection (composite) matrix # c[4-7] = modelview inverse transpose # c[32] = eye-space light direction # c[33] = constant eye-space half-angle vector # c[35].x = pre-multiplied diffuse light color & diffuse mat. # c[35].y = pre-multiplied ambient light color & diffuse mat. # c[36] = specular color; c[38].x = specular power DP4 o[hpos].x, c[0], v[opos]; # Transform position. DP4 o[hpos].y, c[1], v[opos]; DP4 o[hpos].z, c[2], v[opos]; DP4 o[hpos].w, c[3], v[opos]; DP3 R0.x, c[4], v[nrml]; # Transform normal. DP3 R0.y, c[5], v[nrml]; DP3 R0.z, c[6], v[nrml]; DP3 R1.x, c[32], R0; # R1.x = L DOT N' DP3 R1.y, c[33], R0; # R1.y = H DOT N' MOV R1.w, c[38].x; # R1.w = specular power LIT R2, R1; # Compute lighting MAD R3, c[35].x, R2.y, c[35].y; # diffuse + ambient MAD o[col0].xyz, c[36], R2.z, R3; # + specular END G80 = Data Parallel Computer Host Input Assembler Thread Execution Manager SIMD Core SIMD core SIMD core SIMD Core SIMD Core SIMD Core SIMD Core SIMD Core Parallel Parallel Parallel Parallel Parallel Parallel Parallel Parallel Data Cache Data Cache Data Cache Data Cache Data Cache Data Cache Data Cache Data Cache Load/store Global Memory Page 6

G80 core Parallel l Data Cache Each core 8 functional units SIMD 16/32 warp 8-10 stage pipeline Thread scheduler 128-512 threads/core 16 KB shared memory Total #threads/chip 16 * 512 = 8K GPU Multi-threading (version 1) Change threads each cycle (round robin) frag1 frag2 frag3 frag4 instr1 instr2 instr3 Page 7

GPU Multi-threading (version 2) Change thread after texture fetch/stall Run until stall at texture fetch frag1 frag2 frag3 frag4 (multiple instructions) 8800GTX Peak Performance 575 Mhz * 128 processors * 2 flop/inst * 2 inst/clock MAD instruction = 332.8 GFLOPS Page 8

Instructions Issue Rate http://graphics.stanford.edu/projects/gpubench/ ATI X1900XTX NVIDIA 7900GTX Instructions Issue Rate http://graphics.stanford.edu/projects/gpubench/ NVIDIA 7900GTX NVIDIA 8800GTX Page 9

Measured BLAS Performance SAXPY X1900 (DX9): 6 GFlops X1900 (CTM): 6GFlops 8800GTX (DX9): 12 GFlops SGEMV X1900 (DX9): 4 GFlops X1900 (CTM): 6 GFlops 8800GTX (DX9): 14 GFlops SGEMM X1900 (DX9): 30 GFlops X1900 (CTM): 120 GFlops 8800GTX (DX9): 105 Gflops 3 Ghz Core 2 40 Gflops Programming Abstractions Page 10

Approach I Run application using graphics library Graphics library-based programming models NVIDIA s Cg Microsoft s HLSL OpenGL Shading Language RapidMind Sh [McCool et al. 2004] Approach II Map application to parallel computer Communicating sequential processes (C) Threads: pthreads, Occam, UPC, Message passing: MPI Data parallel programming APL, SETL, S, Fortran90, C* (lisp*), NESL, Stream languages StreaMIT, StreamC/KernelC MS Accelerator, CUDA, DPVM, PeakStream Page 11

Stream Programming Environment Collections stored in memory Multidimensional arrays (stencils) Graphs and meshes (topology) Data parallel operators Application: map Reductions: scan, reduce (fold) Communication: send, sort, gather, scatter Filter ( O < I ) and generate ( O > I ) Brook Ian Buck PhD Thesis Stanford University Brook for GPUs: Stream computing on graphics hardware, I. Buck, T. Foley, D. Horn, J. Sugarman, K. Fatahalian, M. Houston, P. Hanrahan, SIGGRAPH 2004 Page 12

Brook Example kernel void foo ( float a<>, float b<>, out tfloat result<> ) { result = a + b; } float a<100>; float b<100>; float c<100>; foo(a,b,c); for (i=0; i<100; i++) c[i] = a[i]+b[i]; Classical N-Body Simulation Stellar dynamics Gravitational acceleration Gravitational accel. + jerk Molecular dynamics Implicit solvent models Lennard-Jones Coulomb Page 13

Folding@Home Performance Vijay Pande Group GROMACs on Brook GPU:CPUcore 40:1 CPU: 3.0 Ghz P4 GPU: ATI X1900X Current Statistics: March 19, 2007 Client type Current TFLOPS* Current Processors Windows 150 157457 Mac OS X/PPC 7 8710 Mac OS X/Intel 7 2520 Linux 34 24639 GPU 40 682 PS/3 26 877 Total 223 1824132 *TFLOPs is actual flops from software cores, not peak values Page 14

Folding@Home GPU Cluster 25 nodes Nforce4 SLI Dual core Opteron 2x ATI X1900XTX Linux 5 TFlops of folding power Not actual machine Future Page 15

Summary Cinematic games and media drive GPU market GPU evolving into a high throughput processor Data parallel multi-threaded machine Many applications map to GPUs Processor of the future likely to be a CPU/GPU Small number of traditional CPU cores Large number of GPU cores Opportunities Current hardware not optimal Incredible opportunity for architectural innovation Current software environment immature Incredible opportunity for reinventing parallel computing software, programming environments and languages Page 16

Acknowledgements Bill Dally Eric Darve Vijay Pande Bill Mark John Owens Kurt Akeley Mark Horowitz Ian Buck Mattan Erez Kayvon Fatahalian Tim Foley Daniel Horn Michael Houston Jeremy Sugarman Funding: DARPA, DOE, ATI, IBM, NVIDIA, SONY Questions? Page 17