GRAPHICS HARDWARE. Niels Joubert, 4th August 2010, CS147

Similar documents
From Shader Code to a Teraflop: How GPU Shader Cores Work. Jonathan Ragan- Kelley (Slides by Kayvon Fatahalian)

From Shader Code to a Teraflop: How Shader Cores Work

GPU Architecture. Robert Strzodka (MPII), Dominik Göddeke G. TUDo), Dominik Behr (AMD)

Real-Time Rendering Architectures

Lecture 7: The Programmable GPU Core. Kayvon Fatahalian CMU : Graphics and Imaging Architectures (Fall 2011)

CS427 Multicore Architecture and Parallel Computing

Antonio R. Miele Marco D. Santambrogio

CSE 591: GPU Programming. Introduction. Entertainment Graphics: Virtual Realism for the Masses. Computer games need to have: Klaus Mueller

Introduction to Multicore architecture. Tao Zhang Oct. 21, 2010

CSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University

Scientific Computing on GPUs: GPU Architecture Overview

CS GPU and GPGPU Programming Lecture 8+9: GPU Architecture 7+8. Markus Hadwiger, KAUST

Current Trends in Computer Graphics Hardware

Graphics Hardware. Graphics Processing Unit (GPU) is a Subsidiary hardware. With massively multi-threaded many-core. Dedicated to 2D and 3D graphics

GPU Architecture. Michael Doggett Department of Computer Science Lund university

Threading Hardware in G80

Portland State University ECE 588/688. Graphics Processors

Administrivia. HW0 scores, HW1 peer-review assignments out. If you re having Cython trouble with HW2, let us know.

Spring 2011 Prof. Hyesoon Kim

Parallel Computing: Parallel Architectures Jin, Hai

GPUs and GPGPUs. Greg Blanton John T. Lubia

Parallel Programming for Graphics

PowerVR Hardware. Architecture Overview for Developers

Architectures. Michael Doggett Department of Computer Science Lund University 2009 Tomas Akenine-Möller and Michael Doggett 1

CS8803SC Software and Hardware Cooperative Computing GPGPU. Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

Lecture 15: Introduction to GPU programming. Lecture 15: Introduction to GPU programming p. 1

Spring 2009 Prof. Hyesoon Kim

Graphics Processing Unit Architecture (GPU Arch)

A Data-Parallel Genealogy: The GPU Family Tree. John Owens University of California, Davis

Real - Time Rendering. Graphics pipeline. Michal Červeňanský Juraj Starinský

Cornell University CS 569: Interactive Computer Graphics. Introduction. Lecture 1. [John C. Stone, UIUC] NASA. University of Calgary

Graphics and Imaging Architectures

Graphics Architectures and OpenCL. Michael Doggett Department of Computer Science Lund university

CONSOLE ARCHITECTURE

X. GPU Programming. Jacobs University Visualization and Computer Graphics Lab : Advanced Graphics - Chapter X 1

Windowing System on a 3D Pipeline. February 2005

What Next? Kevin Walsh CS 3410, Spring 2010 Computer Science Cornell University. * slides thanks to Kavita Bala & many others

Accelerating image registration on GPUs

GPU! Advanced Topics on Heterogeneous System Architectures. Politecnico di Milano! Seminar Room, Bld 20! 11 December, 2017!

Challenges for GPU Architecture. Michael Doggett Graphics Architecture Group April 2, 2008

Introduction to CUDA (1 of n*)

GPU Basics. Introduction to GPU. S. Sundar and M. Panchatcharam. GPU Basics. S. Sundar & M. Panchatcharam. Super Computing GPU.

Parallelizing Graphics Pipeline Execution (+ Basics of Characterizing a Rendering Workload)

EE382N (20): Computer Architecture - Parallelism and Locality Spring 2015 Lecture 09 GPUs (II) Mattan Erez. The University of Texas at Austin

! Readings! ! Room-level, on-chip! vs.!

Parallelizing Graphics Pipeline Execution (+ Basics of Characterizing a Rendering Workload)

Unit 11: Putting it All Together: Anatomy of the XBox 360 Game Console

Multi-Processors and GPU

Efficient and Scalable Shading for Many Lights

This Unit: Putting It All Together. CIS 501 Computer Architecture. What is Computer Architecture? Sources

Mattan Erez. The University of Texas at Austin

GeForce4. John Montrym Henry Moreton

CSCI-GA Graphics Processing Units (GPUs): Architecture and Programming Lecture 2: Hardware Perspective of GPUs

Accelerator cards are typically PCIx cards that supplement a host processor, which they require to operate Today, the most common accelerators include

This Unit: Putting It All Together. CIS 371 Computer Organization and Design. Sources. What is Computer Architecture?

GPU Architecture and Function. Michael Foster and Ian Frasch

ECE 574 Cluster Computing Lecture 16

GPU Programming. Lecture 1: Introduction. Miaoqing Huang University of Arkansas 1 / 27

Graphics Hardware. Instructor Stephen J. Guy

Optimizing DirectX Graphics. Richard Huddy European Developer Relations Manager

GPU for HPC. October 2010

GPU A rchitectures Architectures Patrick Neill May

GPGPU introduction and network applications. PacketShaders, SSLShader

Multimedia in Mobile Phones. Architectures and Trends Lund

Parallel Programming on Larrabee. Tim Foley Intel Corp

Introduction to Parallel Programming For Real-Time Graphics (CPU + GPU)

Computer Architecture

GRAPHICS PROCESSING UNITS

Real-time Graphics 9. GPGPU

GPU Computation Strategies & Tricks. Ian Buck NVIDIA

Beyond Programmable Shading Keeping Many Cores Busy: Scheduling the Graphics Pipeline

Anatomy of AMD s TeraScale Graphics Engine

Real-Time Reyes: Programmable Pipelines and Research Challenges. Anjul Patney University of California, Davis

Pat Hanrahan. Modern Graphics Pipeline. How Powerful are GPUs? Application. Command. Geometry. Rasterization. Fragment. Display.

ECE 571 Advanced Microprocessor-Based Design Lecture 20

GPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC

Shaders. Slide credit to Prof. Zwicker

From Brook to CUDA. GPU Technology Conference

A Data-Parallel Genealogy: The GPU Family Tree

GPU Architecture. Samuli Laine NVIDIA Research

CE 431 Parallel Computer Architecture Spring Graphics Processor Units (GPU) Architecture

Real-time Graphics 9. GPGPU

Motivation Hardware Overview Programming model. GPU computing. Part 1: General introduction. Ch. Hoelbling. Wuppertal University

Numerical Simulation on the GPU

Scanline Rendering 2 1/42

Introduction to GPU hardware and to CUDA

Lecture 6: Texture. Kayvon Fatahalian CMU : Graphics and Imaging Architectures (Fall 2011)

The NVIDIA GeForce 8800 GPU

Compute-mode GPU Programming Interfaces

Josef Pelikán, Jan Horáček CGG MFF UK Praha

CS GPU and GPGPU Programming Lecture 2: Introduction; GPU Architecture 1. Markus Hadwiger, KAUST

Next-Generation Graphics on Larrabee. Tim Foley Intel Corp

Scheduling the Graphics Pipeline on a GPU

Evolution of GPUs Chris Seitz

This Unit: Putting It All Together. CIS 371 Computer Organization and Design. What is Computer Architecture? Sources

Spring 2010 Prof. Hyesoon Kim. AMD presentations from Richard Huddy and Michael Doggett

CS130 : Computer Graphics. Tamar Shinar Computer Science & Engineering UC Riverside

2.11 Particle Systems

Preparing seismic codes for GPUs and other

Transcription:

GRAPHICS HARDWARE Niels Joubert, 4th August 2010, CS147

Rendering Latest GPGPU Today Enabling Real Time Graphics Pipeline History Architecture Programming

RENDERING PIPELINE Real-Time Graphics

Vertices Rendering Pipeline Vertex Processing are transformed into screen space

Vertices Rendering Pipeline Vertex Processing are transformed into screen space Each vertex is transformed independently! Vertex Processor Vertex Processor Vertex Processor Vertex Processor Vertex Processor Vertex Processor

Vertices Primitives Rendering Pipeline Primitive Processing are organized into primitives are clipped and culled

Rendering Pipeline Rasterization Primitives are rasterized into pixel fragments.

Rendering Pipeline Rasterization Each primitive is rasterized Primitives are rasterized into pixel fragments. independently!

Fragments Rendering Pipeline Fragment Processing are shaded to compute a color at a pixel

Fragments Rendering Pipeline Fragment Processing Each fragment is shaded independently! are shaded to compute a color at a pixel

Fragments Z-Buffer Rendering Pipeline Pixel Operations are blended into the framebuffer determines visibility

Memory Many All Conflicts Revisit Rendering Pipeline Frame buffer location with aggregation ability fragments end up on same pixel fragments are handled independently when writing to framebuffer? this later! [Synchronization / Atomics]

Rendering Pipeline Pipeline Entities

Rendering Pipeline Graphics Pipeline Vertices Vertex Generation Vertex Processing Vertex Buffers Transformations Vertex Shader 3D to Screen Space Position, Color, Texture Coordinate manipulations Primitives Primitive Generation Primitive Processing Polygon Data Geometry Shader Interpolate Variables Add or Remove Vertices Fragments Fragment Generation Fragment Processing Fragment shader Interpolate & Rast Lighting Effects Pixels Pixel Operations Z-Buffer Blend Producer-Consumer

HISTORY How we got to where we are

History: 1963 Sketchpad The first GUI.

History: 1972 Xerox Alto 1972: Mouse, Keyboard, Files, Folders, Draggable windows Personal Computer

Launched Did History: 1977 Apple II 1977 1Mhz 6502 processor 4KB RAM (expandable to 48KB for $1340) Graphics for the Masses it have a graphics card?

Video No, History: 1977 Apple II CPU: Writes pixel data to RAM Hardware: Reads RAM in scanlines, generates NTSC there is no graphics card.

The History: 1982 The Geometry Engine A VLSI Geometry System for Graphics James H. Clark, Computer Systems Lab, Stanford University & Silicon Graphics, Inc. Geometry System is a... computing system for computer graphics constructed from a basic building block, the Geometry Engine.

History: 1982 The Geometry Engine A VLSI Geometry System for Graphics James H. Clark, Computer Systems Lab, Stanford University & Silicon Graphics, Inc. 5 Million FLOPS

History: 1982 The Geometry Engine A VLSI Geometry System for Graphics James H. Clark, Computer Systems Lab, Stanford University & Silicon Graphics, Inc.

History: 1982 The Geometry Engine Instruction Set:

Silicon History: 1982 IRIS - Integrated Raster Imaging System Graphics Incʼs first real-time 3D rendering engine. CPU Multibus Geometry System Raster System EtherNet

BLock History: 1985 BLIT & Commodore Amiga Image Transfer Co-processor / logic block Rapid movement and modification of memory blocks Commodore Amiga had a complete blitter in hardware, in a separate graphics processor

History: 1993 Silicon Graphics RealityEngine Its target capability is the rendering of lighted, smooth shaded, depth buffered, texture mapped, antialiased triangles. - RealityEngine Graphics, K. Akeley, 1993 Vertex Generation Vertex Processing Primitive Generation Primitive Processing Fragment Generation Fragment Processing Pixel Operations

Silicon OpenGL History: 1992 OpenGL 1.0 & 1995 Direct3D OpenGL 1.0 Graphics Proprietary IRIS GL API (state of the art) OpenGL as open standard derived from IRIS Standardised HW access, device drivers becomes important HUGE success: allows HW to evolve, SW to decouple

History: 1998 NVidia RIVA TNT CPU Vertex Generation Vertex Processing GPU Clip/Cull/Rast Texture Map Pixel Ops / Framebuffer Primitive Generation Primitive Processing Fragment Generation Fragment Processing Pixel Operations

History: 2002 Direct3D 9, OpenGL 2.0 GPU Vertex Generation Vertex Vertex Vertex Vertex Vertex Processing Clip / Cull / Rasterize Vertex Vertex Vertex Vertex Tex Tex Tex Tex Primitive Generation Primitive Processing Vertex Vertex Vertex Vertex Fragment Generation Tex Tex Tex Tex Pixel Ops / Framebuffer Fragment Processing Pixel Operations ATI Radeon 9700

History: 2006 Unified Shading GPUs GPU Vertex Generation Core Core Core Core Vertex Processing Clip / Cull / Rasterize Scheduler Core Core Core Core Tex Tex Tex Tex Core Core Core Core Core Core Core Core Tex Tex Tex Tex Primitive Generation Primitive Processing Fragment Generation Pixel Ops Pixel Ops Pixel Ops Fragment Processing Pixel Ops Pixel Ops Pixel Ops Pixel Operations GeForce G80

GRAPHICS ARCHITECTURE GPUs as Throughput-Oriented Hardware

Single Architecture CPU Evolution stream of instructions REALLY FAST Long, deep pipelines Branch Prediction & Speculative Execution Hierarchy of Caches Instruction Level Parallelism (ILP)

Architecture G5 (2003)

Architecture G5 (2003) 2 Ghz 1 Ghz FSB 4GB RAM 2 FPUs 50 million transistors 215 inst. pipeline Branch Prediction

Architecture G5 (2003) Fermi (2010) 2 Ghz 1 Ghz FSB 4GB RAM 1.4 Ghz 1.8 Ghz FSB 4GB RAM (1.5 in GTX) 2 FPUs 50 million transistors 215 inst. pipeline Branch Prediction

Architecture G5 (2003) Fermi (2010) 2 Ghz 1 Ghz FSB 4GB RAM 2 FPUs 50 million transistors 1.4 Ghz 1.8 Ghz FSB 4GB RAM (1.5 in GTX) 960 FPUs 3 billion transistors 215 inst. pipeline Branch Prediction No Branch Prediction

The - Architecture Mooreʼs Law number of transistors on an integrated circuit doubles every two years Gorden E. Moore

The - What Architecture Mooreʼs Law number of transistors on an integrated circuit doubles every two years Gorden E. Moore Matters: How we use these transistors

Architecture Buy Performance with Power

There Weʼre Architecture Serial Performance Scaling Cannot continue to scale Mhz is no 10 Ghz chip Cannot increase power consumption per area melting chips Can continue to increase transistor count

out-of-order vector SSE, multithreading, Architecture Using Transistors Instruction-level parallelism execution, speculation, branch prediction Data-level parallelism units, SIMD execution AVX, Cell SPE, Clearspeed Thread-level parallelism multicore, manycore

Architecture Why Massively Parallel Processing? Computation Power of Graphic Processing Units

Architecture Why Massively Parallel Processing? Memory Throughput of Graphic Processing Units

Architecture Why Massively Parallel Processing? How can this be? Remove transistors dedicated to speed of a single stream of instructions out-of-order execution, speculation, caches, branch prediction CPU: minimize latency of an individual thread More memory bandwidth, more compute Nothing else on the card! Simple design GPU: maximize throughput of all threads.

From Shader Code to a Teraflop: How Shader Cores Work Kayvon Fatahalian Stanford University SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/

What s in a GPU? Shader Core Shader Core Tex Input Assembly Rasterizer Shader Core Shader Core Tex Output Blend Shader Core Shader Core Shader Core Shader Core Tex Tex Video Decode Work Distributor HW or SW? Heterogeneous chip multi-processor (highly tuned for graphics) SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/ 4

A diffuse reflectance shader!"#$%&'(#)*"#$+(,&-./'&0123%4".56(#),&-+( 3%4".5(%789.17'+( 3%4".:(;733/!&*9";&'<3%4".5(=4'#>(3%4".0(/?@( A( ((3%4".5(B;+( ((B;(C(#),&-D*"#$%&<#)*"#$>(/?@+( ((B;(EC(F%"#$<(;4.<%789.17'>(=4'#@>(GDG>(HDG@+( (('&./'=(3%4".:<B;>(HDG@+(((( I(( Independent, but no explicit parallelism SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/ 5

Compile shader 1 unshaded fragment input record!"#$%&'(#)*"#$+(,&-./'&0123%4".56(#),&-+( 3%4".5(%789.17'+( 3%4".:(;733/!&*9";&'<3%4".5(=4'#>(3%4".0(/?@( A( ((3%4".5(B;+( ((B;(C(#),&-D*"#$%&<#)*"#$>(/?@+( ((B;(EC(F%"#$(<(;4.<%789.17'>(=4'#@>(GDG>(HDG@+( (('&./'=(3%4".:<B;>(HDG@+(((( I(( 2;733/!&*9";&'6J(!"#$%&('G>(?:>(.G>(!G( #/%(('5>(?G>(FKGLGM( #";;('5>(?H>(FKGLHM>('5( #";;('5>(?0>(FKGL0M>('5( F%#$('5>('5>(%<GDG@>(%<HDG@( #/%((4G>('G>('5( #/%((4H>('H>('5( #/%((40>('0>('5( #4?((45>(%<HDG@( 1 shaded fragment output record SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/ 6

Execute shader Fetch/ Decode ALU (Execute) Execution Context!"#$$%&'()*"'+,-. &*/01'.+23.453.623.&2. /%1..+73.423.892:2;. /*"".+73.4<3.892:<;3.+7. /*"".+73.4=3.892:=;3.+7. 81/0.+73.+73.1>2?2@3.1><?2@. /%1..A23.+23.+7. /%1..A<3.+<3.+7. /%1..A=3.+=3.+7. /A4..A73.1><?2@. SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/ 7

CPU- style cores Fetch/ Decode ALU (Execute) Execution Context Out-of-order control logic Fancy branch predictor Memory pre-fetcher Data cache (A big one) SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/ 13

Slimming down Fetch/ Decode ALU (Execute) Execution Context Idea #1: Remove components that help a single instruction stream run fast SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/ 14

Two cores (two fragments in parallel) fragment 1 fragment 2 Fetch/ Decode Fetch/ Decode!"#$$%&'()*"'+,-. &*/01'.+23.453.623.&2. /%1..+73.423.892:2;. /*"".+73.4<3.892:<;3.+7. /*"".+73.4=3.892:=;3.+7. 81/0.+73.+73.1>2?2@3.1><?2@. /%1..A23.+23.+7. /%1..A<3.+<3.+7. /%1..A=3.+=3.+7. /A4..A73.1><?2@. ALU (Execute) Execution Context ALU (Execute) Execution Context!"#$$%&'()*"'+,-. &*/01'.+23.453.623.&2. /%1..+73.423.892:2;. /*"".+73.4<3.892:<;3.+7. /*"".+73.4=3.892:=;3.+7. 81/0.+73.+73.1>2?2@3.1><?2@. /%1..A23.+23.+7. /%1..A<3.+<3.+7. /%1..A=3.+=3.+7. /A4..A73.1><?2@. SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/ 15

Four cores (four fragments in parallel) Fetch/ Decode ALU (Execute) Execution Context Fetch/ Decode ALU (Execute) Execution Context Fetch/ Decode ALU (Execute) Execution Context Fetch/ Decode ALU (Execute) Execution Context SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/ 16

Sixteen cores (sixteen fragments in parallel) ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU ALU 16 cores = 16 simultaneous instruction streams SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/ 17

Instruction stream sharing But many fragments should be able to share an instruction stream!!"#$$%&'()*"'+,-. &*/01'.+23.453.623.&2. /%1..+73.423.892:2;. /*"".+73.4<3.892:<;3.+7. /*"".+73.4=3.892:=;3.+7. 81/0.+73.+73.1>2?2@3.1><?2@. /%1..A23.+23.+7. /%1..A<3.+<3.+7. /%1..A=3.+=3.+7. /A4..A73.1><?2@. SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/ 18

Recall: simple processing core Fetch/ Decode ALU (Execute) Execution Context SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/ 19

Add ALUs Fetch/ Decode ALU 1 ALU 2 ALU 3 ALU 4 ALU 5 ALU 6 ALU 7 ALU 8 Idea #2: Amortize cost/complexity of managing an instruction stream across many ALUs Ctx Ctx Ctx Ctx SIMD processing Ctx Ctx Ctx Ctx Shared Ctx Data SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/ 20

Modifying the shader Fetch/ Decode ALU 1 ALU 2 ALU 3 ALU 4 ALU 5 ALU 6 ALU 7 ALU 8 Ctx Ctx Ctx Ctx Ctx Ctx Ctx Ctx Shared Ctx Data!"#$$%&'()*"'+,-. &*/01'.+23.453.623.&2. /%1..+73.423.892:2;. /*"".+73.4<3.892:<;3.+7. /*"".+73.4=3.892:=;3.+7. 81/0.+73.+73.1>2?2@3.1><?2@. /%1..A23.+23.+7. /%1..A<3.+<3.+7. /%1..A=3.+=3.+7. /A4..A73.1><?2@. Original compiled shader: Processes one fragment using scalar ops on scalar SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/ registers 21

Modifying the shader Fetch/ Decode ALU 1 ALU 2 ALU 3 ALU 4 ALU 5 ALU 6 ALU 7 ALU 8 Ctx Ctx Ctx Ctx!"#$%&'())*+,-./',0123 "#$%&+/456,37,8&09:37,8&7;:3<9:37,8&+93 "#$%&4*6337,8&0=:37,8&79:38>9?9@3 "#$%&4/''37,8&0=:37,8&7A:38>9?A@:37,8&0=3 "#$%&4/''37,8&0=:37,8&7B:38>9?B@:37,8&0=3 "#$%&864537,8&0=:37,8&0=:36C9D9E:36CAD9E3 "#$%&4*6337,8&F9:37,8&09:37,8&0=3 "#$%&4*6337,8&FA:37,8&0A:37,8&0=3 "#$%&4*6337,8&FB:37,8&0B:37,8&0=3 "#$%&4F7337,8&F=:36CAD9E3 Ctx Ctx Ctx Ctx Shared Ctx Data SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/ New compiled shader: Processes 8 fragments using vector ops on vector registers 22

Modifying the shader 1 2 3 4 Fetch/ Decode ALU 1 ALU 2 ALU 3 ALU 4 ALU 5 ALU 6 ALU 7 ALU 8 Ctx Ctx Ctx Ctx 5 6 7 8!"#$%&'())*+,-./',0123 "#$%&+/456,37,8&09:37,8&7;:3<9:37,8&+93 "#$%&4*6337,8&0=:37,8&79:38>9?9@3 "#$%&4/''37,8&0=:37,8&7A:38>9?A@:37,8&0=3 "#$%&4/''37,8&0=:37,8&7B:38>9?B@:37,8&0=3 "#$%&864537,8&0=:37,8&0=:36C9D9E:36CAD9E3 "#$%&4*6337,8&F9:37,8&09:37,8&0=3 "#$%&4*6337,8&FA:37,8&0A:37,8&0=3 "#$%&4*6337,8&FB:37,8&0B:37,8&0=3 "#$%&4F7337,8&F=:36CAD9E3 Ctx Ctx Ctx Ctx Shared Ctx Data SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/ 23

128 fragments in parallel SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/ 16 cores = 128 ALUs = 16 simultaneous instruction streams 24

vertices / fragments primitives 128 [ ] in parallel CUDA threads OpenCL work items compute shader threads primitives vertices fragments SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/ 25

What is the problem?

But what about branches? Time (clocks) 1 2...... 8 ALU 1 ALU 2...... ALU 8 T T F T F F F F./01203!4!205,# -653+7#123+&#!"#$%#&#'(#)# 7+",#:#9#A#@5>### *#+,-+#)# %#:#'>## *# 9#:#;2<$%=#+%;(># 9#?:#@-># 7+",#:#@5>###.7+-/8+#/01203!4!205,# -653+7#123+&# SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/ 27

But what about branches? Time (clocks) 1 2...... 8 ALU 1 ALU 2...... ALU 8 T T F T F F F F Not all ALUs do useful work! Worst case: 1/8 performance./01203!4!205,# -653+7#123+&#!"#$%#&#'(#)# 7+",#:#9#A#@5>### *#+,-+#)# %#:#'>## *# 9#:#;2<$%=#+%;(># 9#?:#@-># 7+",#:#@5>###.7+-/8+#/01203!4!205,# -653+7#123+&# SIGGRAPH 2009: Beyond Programmable Shading: http://s09.idav.ucdavis.edu/ 28

Plenty more intricacies! No time now - See the Beyond Programmable Shading Siggraph talk Think of a GPU as a multi-core processor optimized for maximum throughput: Many SIMD cores working together.

Thousands Amenable An efficient GPU workload... on independent pieces of work Uses many ALUs on many cores to instruction stream sharing Uses SIMD instructions Compute-heavy: lots of math for each memory access

30 GF100 Architecture SMʼs on GTX480 Resources: 2 x 16 Cores 16 Load/Store units 4 Special Functions Sin/Cos/Sqrt 16 Double Precision units 1.44 Terra FLOPS

Mental On GF100 Architecture Model: every clock cycle, you can assign an instruction to two of these resources Resources: 2 x 16 Cores 16 Load/Store units 4 Special Functions 16 Double Precision units

GPGPU General Purpose Computing on GPUs

C/C++ GPGPU CUDA Stream Programming extended with: Kernels - function executed N times in parallel CPU/GPU Synchronization GPU Memory Management

GPGPU CUDA Stream Programming

Example: Vector Addition Kernel Device Code!!"#$%&'()"*)+($,"-'%"#"."/01!!"23+4"(4,)35"&),6$,%-"$7)"&38,9:8-)"3558(8$7 ;;<=$>3=;; *$85"*)+/55?6=$3(@"/A"6=$3(@"1A"6=$3(@"#B C 87( 8."(4,)35D5EFE 0">=$+GH8%FE @">=$+GD5EFEI #J8K"."/J8K"0"1J8KI L 87( %387?B C!!"M'7"<,85"$6"N!OPQ">=$+G-"$6"OPQ"(4,)35-")3+4 *)+/55RRR"N!OPQA"OPQSSS?5;/A"5;1A"5;#BI L

Example: Vector Addition Kernel!!"#$%&'()"*)+($,"-'%"#"."/01!!"23+4"(4,)35"&),6$,%-"$7)"&38,9:8-)"3558(8$7 ;;<=$>3=;; *$85"*)+/55?6=$3(@"/A"6=$3(@"1A"6=$3(@"#B C 87("8"."(4,)35D5EFE 0">=$+GH8%FE @">=$+GD5EFEI #J8K"."/J8K"0"1J8KI L 87("%387?B C Host Code!!"M'7"<,85"$6"N!OPQ">=$+G-"$6"OPQ"(4,)35-")3+4 *)+/55RRR"N!OPQA"OPQSSS?5;/A"5;1A"5;#BI L

Kernel Variations and Output global void kernel( int *a ) { int idx = blockidx.x*blockdim.x + threadidx.x; a[idx] = 7; } Output: 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 global void kernel( int *a ) { int idx = blockidx.x*blockdim.x + threadidx.x; a[idx] = blockidx.x; } Output: 0 0 0 0 1 1 1 1 2 2 2 2 3 3 3 3 global void kernel( int *a ) { int idx = blockidx.x*blockdim.x + threadidx.x; a[idx] = threadidx.x; } Output: 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3

Example: Shuffling Data!!"#$%&'$&"()*+$,"-),$'"%."/$0,!!"1)23"43&$)'"5%($,"%.$"$*$5$.4 667*%-)*66 (%8'",3+99*$:8.4;"<&$(6)&&)0="8.4;".$>6)&&)0="8.4;"8.'82$,? @ J 8.4 8 A"43&$)'B'CDC E"-*%2/F85DC ;"-*%2/B'CDCG.$>6)&&)0H8I"A"<&$(6)&&)0H8.'82$,H8IIG Host Code 8.4 5)8.:? @!!"#+."7&8'"%9"K!LMN"-*%2/,"%9"LMN"43&$)',"$)23,3+99*$OOO"K!LMN="LMNPPP:'6%*'="'6.$>="'68.'?G J

Need GPGPU Alternatives OpenCL Attempts to be OpenGL for GPGPU Almost identical to CUDA for a higher level languages Jacket for MATLAB PyCUDA

The Future Massively Multi-Core Processors

The Future MultiCore is Dead You have an array of six-core CPUs in your rack? You re going to feel pretty stupid when all the cool admins start popping eight-core chips.

Number Cannot The Future Many-Core? of cores so large that: Traditional caching models donʼt work keep coherent cache Network on a chip?

Havenʼt Programming The Future Memory Models & More even touched it Coherent and Uncoherent caches Uniform vs Non-Uniform Memory Access? TM? Special Purpose Hardware? Schedulers? Languages

Up 30% Similar The Future Intel Nehalem to 12 cores lower power usage programming abstractions: 12 cores, each with 128-bit wide SIMD units (SSE) Intel SandyBridge 256-bit wide SIMD units (AVX)

Putting Graphics The Future Parallelism Everywhere all those transistors to use Many ALUs Many Cores Intricate Cache Hierarchies Very difficult to program is way ahead of the game

The The Future Learn More: Landscape of Parallel Computing Research http://view.eecs.berkeley.edu/wiki/main_page

Kayvon Mike Jared Thank you! Acknowledgements Fatahalian Many of these slides are inspired by or copied from him Houston CS448s Beyond Programmable Shading Hobernock & David Tarjan CS193g Programming massively parallel processors (I TAʼd this last quarter)