Real-Time Graphics Architecture

Similar documents
Threading Hardware in G80

CS8803SC Software and Hardware Cooperative Computing GPGPU. Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology

CS427 Multicore Architecture and Parallel Computing

Mattan Erez. The University of Texas at Austin

Graphics Hardware. Graphics Processing Unit (GPU) is a Subsidiary hardware. With massively multi-threaded many-core. Dedicated to 2D and 3D graphics

Introduction to CUDA (1 of n*)

EE382N (20): Computer Architecture - Parallelism and Locality Spring 2015 Lecture 09 GPUs (II) Mattan Erez. The University of Texas at Austin

What s New with GPGPU?

CS GPU and GPGPU Programming Lecture 8+9: GPU Architecture 7+8. Markus Hadwiger, KAUST

GPU Computation Strategies & Tricks. Ian Buck NVIDIA

The NVIDIA GeForce 8800 GPU

Portland State University ECE 588/688. Graphics Processors

A Data-Parallel Genealogy: The GPU Family Tree. John Owens University of California, Davis

Parallelizing Graphics Pipeline Execution (+ Basics of Characterizing a Rendering Workload)

Graphics Processing Unit Architecture (GPU Arch)

Introduction to Multicore architecture. Tao Zhang Oct. 21, 2010

CSE 591: GPU Programming. Introduction. Entertainment Graphics: Virtual Realism for the Masses. Computer games need to have: Klaus Mueller

Parallelizing Graphics Pipeline Execution (+ Basics of Characterizing a Rendering Workload)

GeForce4. John Montrym Henry Moreton

Lecture 15: Introduction to GPU programming. Lecture 15: Introduction to GPU programming p. 1

Current Trends in Computer Graphics Hardware

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Programmable GPUs. Real Time Graphics 11/13/2013. Nalu 2004 (NVIDIA Corporation) GeForce 6. Virtua Fighter 1995 (SEGA Corporation) NV1

Mattan Erez. The University of Texas at Austin

CS GPU and GPGPU Programming Lecture 2: Introduction; GPU Architecture 1. Markus Hadwiger, KAUST

Real-Time Graphics Architecture

General Purpose Computation (CAD/CAM/CAE) on the GPU (a.k.a. Topics in Manufacturing)

CS GPU and GPGPU Programming Lecture 3: GPU Architecture 2. Markus Hadwiger, KAUST

Windowing System on a 3D Pipeline. February 2005

From Shader Code to a Teraflop: How Shader Cores Work

Mattan Erez. The University of Texas at Austin

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

X. GPU Programming. Jacobs University Visualization and Computer Graphics Lab : Advanced Graphics - Chapter X 1

From Brook to CUDA. GPU Technology Conference

Comparing Reyes and OpenGL on a Stream Architecture

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

A Data-Parallel Genealogy: The GPU Family Tree

Scheduling the Graphics Pipeline on a GPU

Next-Generation Graphics on Larrabee. Tim Foley Intel Corp

Cornell University CS 569: Interactive Computer Graphics. Introduction. Lecture 1. [John C. Stone, UIUC] NASA. University of Calgary

Cg: A system for programming graphics hardware in a C-like language

CSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University

CS GPU and GPGPU Programming Lecture 7: Shading and Compute APIs 1. Markus Hadwiger, KAUST

From Shader Code to a Teraflop: How GPU Shader Cores Work. Jonathan Ragan- Kelley (Slides by Kayvon Fatahalian)

Spring 2009 Prof. Hyesoon Kim

Programmable Graphics Hardware

Pat Hanrahan. Modern Graphics Pipeline. How Powerful are GPUs? Application. Command. Geometry. Rasterization. Fragment. Display.

Real-Time Graphics Architecture

CS179 GPU Programming Introduction to CUDA. Lecture originally by Luke Durant and Tamas Szalay

GPU Architecture. Robert Strzodka (MPII), Dominik Göddeke G. TUDo), Dominik Behr (AMD)

Spring 2011 Prof. Hyesoon Kim

Shaders. Slide credit to Prof. Zwicker

Graphics Processor Acceleration and YOU

Real-Time Reyes: Programmable Pipelines and Research Challenges. Anjul Patney University of California, Davis

Tutorial on GPU Programming #2. Joong-Youn Lee Supercomputing Center, KISTI

OpenGL Programmable Shaders

Prof. Hakim Weatherspoon CS 3410, Spring 2015 Computer Science Cornell University

EE382N (20): Computer Architecture - Parallelism and Locality Fall 2011 Lecture 18 GPUs (III)

GPU Architecture. Michael Doggett Department of Computer Science Lund university

GPGPU. Peter Laurens 1st-year PhD Student, NSC

GPGPU introduction and network applications. PacketShaders, SSLShader

Challenges for GPU Architecture. Michael Doggett Graphics Architecture Group April 2, 2008

Texture. Real-Time Graphics Architecture. Kurt Akeley Pat Hanrahan.

! Readings! ! Room-level, on-chip! vs.!

General-Purpose Computation on Graphics Hardware

CS GPU and GPGPU Programming Lecture 3: GPU Architecture 2. Markus Hadwiger, KAUST

Real-Time Rendering Architectures

Graphics Hardware, Graphics APIs, and Computation on GPUs. Mark Segal

Parallel Programming on Larrabee. Tim Foley Intel Corp

CSCI-GA Graphics Processing Units (GPUs): Architecture and Programming Lecture 2: Hardware Perspective of GPUs

GPU Programming. Lecture 1: Introduction. Miaoqing Huang University of Arkansas 1 / 27

General Purpose Computation (CAD/CAM/CAE) on the GPU (a.k.a. Topics in Manufacturing)

GRAPHICS HARDWARE. Niels Joubert, 4th August 2010, CS147

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

Parallel Computing: Parallel Architectures Jin, Hai

Automatic Tuning Matrix Multiplication Performance on Graphics Hardware

Programmable GPUs Outline

Lecture 2. Shaders, GLSL and GPGPU

The GPGPU Programming Model

Shaders (some slides taken from David M. course)

Shading Languages for Graphics Hardware

PowerVR Hardware. Architecture Overview for Developers

GRAPHICS PROCESSING UNITS

Bifurcation Between CPU and GPU CPUs General purpose, serial GPUs Special purpose, parallel CPUs are becoming more parallel Dual and quad cores, roadm

Graphics Hardware. Computer Graphics COMP 770 (236) Spring Instructor: Brandon Lloyd 2/26/07 1

Real-World Applications of Computer Arithmetic

A Real-Time Procedural Shading System for Programmable Graphics Hardware

Graphics and Imaging Architectures

1.2.3 The Graphics Hardware Pipeline

Lecture 6: Texture. Kayvon Fatahalian CMU : Graphics and Imaging Architectures (Fall 2011)

Cg 2.0. Mark Kilgard

Sung-Eui Yoon ( 윤성의 )

The F-Buffer: A Rasterization-Order FIFO Buffer for Multi-Pass Rendering. Bill Mark and Kekoa Proudfoot. Stanford University

Lecture 7: The Programmable GPU Core. Kayvon Fatahalian CMU : Graphics and Imaging Architectures (Fall 2011)

Accelerating NAMD with Graphics Processors

The Bifrost GPU architecture and the ARM Mali-G71 GPU

Spring 2010 Prof. Hyesoon Kim. AMD presentations from Richard Huddy and Michael Doggett

Shader Series Primer: Fundamentals of the Programmable Pipeline in XNA Game Studio Express

Architectures. Michael Doggett Department of Computer Science Lund University 2009 Tomas Akenine-Möller and Michael Doggett 1

Lecture 4: Geometry Processing. Kayvon Fatahalian CMU : Graphics and Imaging Architectures (Fall 2011)

Transcription:

Real-Time Graphics Architecture Lecture 9: Programming GPUs Kurt Akeley Pat Hanrahan http://graphics.stanford.edu/cs448-07-spring/ Programming GPUs Outline Caveat History Contemporary GPUs Programming model Implementation GPGPU: non-graphical computation using GPUs This is not a programming tutorial You will gain some understanding of the architectural issues that underlie programmable GPUs 1

GPU programmability is not new Essentially all GPUs are (and have been) programmable The questions are: Who gets to program them? And when? Example systems from SGI s early history: Geometry Engine (1981) GE4 (1987) VGXT Image Engine (1991) Dynamic code generation Geometry engine (1981) Four 32-bit floating-point units Jim-Clark format 2 s-complement 2scomplement e and m Programmed using John Hennessy s SLIM (Stanford Language for the Implementation of Microcode) ~500-term PLA, hard-wired at fabrication time Programmed by Jim Clark and Marc Hannah 2

SGI GE4 (1987) Weitek 3332 floating-point ALU ~4k ~64-bit lines of microcode in SRAM Programmed in low-level assembly code No run-time code changes were made (though they could have been) Programmed by several SGI employees, including Kurt Wire-wrapped GE4 prototype VGXT image engine (1991) Tiny code space to implement Texture filter and application Frame buffer operations Code overlays loaded on-the-fly Code sequences generated prior to run-time Fields (e.g., blend function) filled in at run time 3

Dynamic code generation SUN graphics card Bresenham lines Abrash talk Application programmability is not new GPUs have a long history of application programmability Example systems: Ikonas graphics system (1978) Trancept TAAC-1 (1987) Multi-pass vector processing (2000) Register combiners (2001) 4

Ikonas graphics system (1978) Multi-board system 32-bit integer processor (16x16 multiply) 16-bit graphics processor (16 pixel parallel render) Hardware Z-buffer Image credit: www.virhistory.com/ikonas/ikonas.html Trancept TAAC-1 (1987) C-language microcode compiler 200-bit microcode word Sold to Sun Microsystems in 1987 Customers got a C language microcode compiler and were strongly encouraged (!) to develop their own code Nick England, www.virhistory.com/trancept/ trancept.html Image credit: www.virhistory.com/trancept/trancept.html 5

Multi-pass vector processing (2000) Treat OpenGL as a very long instruction word Compute vector style Apply inst. to all pixels Build up final image in many passes Peercy, Olano, Airey, and Ungar, Interactive Multi- Pass Programmable Shading, SIGGRAPH 2000 #include marble.h surface marble() { varying color a; uniform string fx; uniform float x; x = ½; fx = noisebw.tx ; FB = texture(tx, scale(x,x,x)); repeat(3) { x = x * 0.5; FB *= 0.5; FB += texture(tx, scale(x,x,x)); } FB = lookup(fb,tab); a = FB; FB = diffuse; FB *= a; FB += environment( env ); } (Figure adapted from the SIGGRAPH paper) Register combiners (2001) ARB_texture_env_combine New texture environment function COMBINE_ARB allows texture combiner operations, including REPLACE, MODULATE, ADD,, INTERPOLATE. ARB_texture_env_combine Adds the capability to use the texture color from other texture units as sources to the COMBINE_ARB environment function. ARB_texture_env_dot3 Adds a dot-product operation to the texture combiner operations. 6

Summary of history General-purpose processor with attached units Could not compete in performance/cost ratio Heathkit marketing Multi-pass vector processing Low performance Wasteful use of bandwidth (against the trends) Register combiners Rube Goldberg design quality Hopelessly increasing complexity Programmable pipeline stages Not accessible to application programmers Reason (at SGI) was Unable to keep the programming model consistent from product to product, and Unable to use software to hide the inconsistencies Contemporary GPU Programming Model Application-programmable pipeline stages 7

Programming model (Direct3D 10 pipeline) Traditional graphics pipeline (except location of geometry shader) Three application- programmable stages operate on independent vertexes, primitives, and pixel fragments All stages share same programming model, and (most) resources Single shared memory, but memory objects remain distinct Input Assembler Vertex Shader Geometry Shader Rasterizer Pixel Shader Vertexes Sampler Texture Constant Sampler Texture Constant Memory Sampler Texture Constant Application programmable Output Merger Target Direct3D 10 programmable core Complete instruction set Floating-point operations Integer operations (new) Branching operations Unlimited code length Vector instructions Lots of temporary state Independent operation Each vertex, or Each primitive, or Each pixel fragment From previous stage Input registers (16) 4x32b Textures (128) Samplers (16) Temp registers Const buffers (32/4096) 4x32b (16) 4096x32b FP unit Special Int ALU 4x32b Function 4x32b Output registers (16) 4x32b To next stage 8

Vector instructions Vector vs. scalar 4-component vector arithmetic is typical of gfx But as shaders get longer, an increasing fraction of their code is scalar Some GPUs support dual-issue split vectors: Programming interface Assembly code Low level and error prone Not available in Direct3D 10 C-like shading language Cg: layered above assembly language HLSL, OpenGL Shading Language: intrinsic Fixed-function stages Lost when a applicationdefined shader is used Not included in Direct3D 10 Were to be omitted from OpenGL 2.0 PARAM mat[4] = { state.matrix.modelview }; DP4 result.x, vertex, mat[0]; DP4 result.y, vertex, mat[1]; DP4 result.z, vertex, mat[2]; DP4 result.w, vertex, mat[3]; Vertex transformation (ARB_vertex_program) void transform(float4 position : POSITION, out float4 oposition : POSITION, { } uniform float4x4 modelview) oposition = mul(modelview, position); Vertex transformation (Cg) 9

Parallel coding complexity is minimized Complexities are handled by the implementation: Dependencies Ordering Sorting Scalability Computation Bandwidth Load balancing (much more implementation work) All threads are independent. Message passing is Not required Not allowed Parallel coding complexity is minimized Correctness constraints include: Cannot modify textures directly from shaders Cannot bind a target as both source and sink Texture and render target Vertex buffer and stream output Cannot union different data formats 10

Performance considerations Programmer can largely ignore: Memory latency Floating-point cost (same as integer) Special function cost Texture filtering cost Programmer should pay attention to: Divergent branching Number of registers used Size of storage formats (bandwidth) Re-specification of Z values (defeats Z cull) Contemporary GPU Implementation 11

GeForce 7800 Parallelism Host / FW / VTF Figure courtesy of NVIDIA Z-Cull Cull / Clip / Setup Shader Instruction Dispatch L2 Tex Fragment Crossbar Memory Partition Memory Partition Memory Partition Memory Partition CS448 Lecture DRAM(s) 9 DRAM(s) DRAM(s) Kurt Akeley, Pat Hanrahan, DRAM(s) Spring 2007 SIMD control SIMD = Single Instruction, Multiple Data One or more SIMD blocks; each block includes many data paths controlled by a single sequencer SIMD benefits: Reduced circuitry Synchronization of events SIMD liabilities: Reduced instruction-fetch bandwidth Better texture cache utilization Branch divergence reduces performance Branch convergence is complicated to implement Sequencer Data Path Data Path Data Path Data Path Generic SIMD Geometry Engine 12

Multiple threads hide memory latency Scheduler maintains a queue of ready-to-run threads When a thread blocks, another is started Blocking determination may be Simple (any attempted memory access), or Advanced (block only on data dependencies) Thread context storage is significant product of Number of processors, Threads per processor, and State per thread (registers, pc, ) Fixed-size memory pool may be insufficient for large thread contexts Large state/thread too few threads/processor Local registers may not be free Latency hiding consumes memory Latency hiding mechanisms consume memory Cache (CPU, ~texture) Fragment fifo (texture prefetching) Shader context (threaded execution) GPUs devote less die area to cache than CPUs But the real question is: what area is devoted to latency-hiding memory? Hint: ask the invited experts Rasterizer FIFO Request FIFO Reorder Buffer Texture Filter Texture Apply From lecture 7 Texture memory 13

GeForce 8800 thread scheduling Threads are scheduled in SIMD blocks 4 threads per quad (2x2 fragment unit) 16 threads per warp (NVIDIA GeForce 8800) Thread scheduling is critical and complicated David Kirk assessment of 8800 load balancing: 3x more difficult overall Sort-last-fragment, with load balancing for Vertex processors Geometry (primitive) processors Pixel fragment processors Reference: Mike Shebanow, ECE 498 AL: Programming Massively Parallel Computers, Lecture 12 Thread scheduling complexity Two parts to scheduling decisions: What class of thread should be run? Vertex, primitive, fragment When should a thread (group) be blocked? When a memory access is attempted? When a data dependency is reached? Data-dependent blocking is worthwhile, because it conserves thread-context memory space Has been an ATI advantage over NVIDIA Did this contribute to 8800 scheduling difficulty? 14

Abstraction distance Actual instruction set may not match advertised one Was true when assembly language was supported Complex post-assembly run-time optimization Layered languages (like Cg) 2-layer optimization Even easier with Direct3D 10 Example: use scalar processor to emulate vector operations Large abstraction ti distance wasn t practical in the 90s CPUs are much faster now Software technology has come a long way Why isn t Output Merger programmable? OM is highly optimized Memory access is the GPU limiter in many cases OM has data hazards All programmable interfaces are memory read-only Input Assembler Vertex Shader Geometry Shader Rasterizer Vertexes Sampler Texture Constant Sampler Texture Constant Memory Pixel Shader Sampler Constant Texture Application programmable Output Merger Target 15

GPGPU Performance motivates interest in GPGPU 7x performance advantage to GPUs today: 47 GFLOPS Intel core2 extreme e6800 345 GFLOPS NVIDIA GeForce 8800 GTX 350 GFLOPS ATI X1900 16

Rate-of-change of performance is the key CPU and GPU rates differ: GPU: ~2x CAGR CPU: <1.5x CAGR, recently much less Will this 10x/decade disparity persist in the multi-core era? GFLO OPS G80 = GeForce 8800 GTX G71 = GeForce 7900 GTX G70 = GeForce 7800 GTX NV40 = GeForce 6800 Ultra NV35 = GeForce FX 5950 Ultra NV30 = GeForce FX 5800 NVIDIA Corporation 2006 Additional GPGPU motivations Visible success stories Folding@home gets 20-40x speedup over PCs Matrix multiply achieves 120 GFLOPS on ATI X1900 Ubiquity of GPU hardware Every PC has a GPU of some kind This is a first for an attached hardware accelerator Evolving GPU architecture Rich instruction set Highly parallel Improving GPU CPU data transfer rates 17

GPUs are high-performance multiprocessors GeForce 8800 block diagram (NVIDIA Corporation 2006) Host Data Assembler Vtx Thread Issue Geom Thread Issue Setup / Rstr / ZCull Pixel Thread Issue SP TF L1 SP SP TF SP SP SP SP SP SP SP SP SP SP SP SP SP TF TF TF TF TF TF L1 L1 L1 L1 L1 L1 L1 Thread Processor T L2 L2 L2 L2 L2 L2 FB FB FB FB FB FB Scatter, Gather, and Reduction Frame buffer mindset: fragment associated with a specific region of memory Early programmable GPUs (2001) Today s GPUs (2007) Gather Texture fetch Texture fetch Scatter -- Uses fragment sort Reduction -- Occlusion query 18

GPGPU concerns Application specificity. Requires: High arithmetic intensity (compute / bandwidth) High branching coherence Ability to decompose to 1000s of threads Computation quality No memory error detection/correction No double precision Incomplete (but improving) IEEE single precision Possible end of divergence of CPU and GPU performance Abysmal history of attached processor success How will CPUs and GPUs coexist? Possible outcomes: CPU instruction set additions Heterogeneous multi-core Shared memory Differentiated memory Generalized GPUs What do/will we mean by GPU? Anything graphics related? Massively multi-threaded to hide memory latency? 19

Summary Long history of GPU programmability, but current situation (app-programmable pipeline stages) is new Current GPUs are easily programmed (for graphics) and give programmers amazing power Current GPUs are high-performance multiprocessors Exciting things are happening in computer architecture, and GPUs will play an important role Suggested readings Wen-Mei Hwu and David Kirk, ECE 498 AL: Programming Massively Parallel Computers, http://courses.ece.uiuc.edu/ece498/al/index.html Peercy Olano, Airey, and Ungar, Interactive Multi-Pass Programmable Shading, SIGGRAPH 2000. Fernando, GPU Gems, Addison Wesley, 2004. Pharr and Fernando, GPU Gems 2, Addison Wesley, 2005. 20

Real-Time Graphics Architecture Lecture 9: Programming GPUs Kurt Akeley Pat Hanrahan http://graphics.stanford.edu/cs448-07-spring/ 21