GPU Architecture. Samuli Laine NVIDIA Research

Size: px
Start display at page:

Download "GPU Architecture. Samuli Laine NVIDIA Research"

Transcription

1 GPU Architecture Samuli Laine NVIDIA Research

2 Today The graphics pipeline: Evolution of the GPU Throughput-optimized parallel processor design I.e., the GPU Contrast with latency-optimized (CPU-like) design A look at NVIDIA s GPU architecture

3 Atari: Pong (1972) Dedicated video circuitry

4 CAPCOM: Commando, C64 version (1985) Video chip with HW sprites etc.

5 id Software: DOOM (1993) 2.5D + sprites, everything done on CPU

6 id Software: Quake (1996) True 3D, everything still done on CPU

7 Valve: Half-Life (1998) Triangle rasterization hardware

8 Valve: Half-Life 2 (2004) GPU with programmable shaders

9 DICE: Star Wars Battlefront (2015) GPU with shaders, computation

10 The Graphics Pipeline

11 The Graphics Pipeline Vertex Transform & Lighting Triangle Setup & Rasterization Texturing & Pixel Shading Depth Test & Blending Framebuffer

12 The Graphics Pipeline Vertex Remains a useful abstraction Rasterize Hardware look like this Pixel Test & Blend Framebuffer

13 The Graphics Pipeline Vertex float4 skin(float4 restpos, uniform float4x4 xform[n_bones], uniform float weight[n_bones]) { float4 outpos = float4(0,0,0,0) for (int b = 0; b < N_BONES; b++) { outpos += weight[b]*mul(xform[b], restpos)... Rasterize Hardware look like this Pixel Vertex, pixel processing became programmable Test & Blend Framebuffer

14 Vertex Shaders f (position, attributes) (new position, attributes) Purely functional (no side effects) Move / animate vertices Apply view and projection matrices Prepare data for pixel shaders Lighting, texture coordinates,... Hardware interpolates vertex attributes over the triangle and gives the results to pixel shader

15 VS Example 1: Blend Shapes E.g., face geometries Angry, happy, sad, move eyebrow, Each target geometry stored as difference vector For each vertex: average position + n differences Result is a weighted sum of all targets

16 VS Example 2: Skinning Transform each vertex pi with each bone as if it was rigidly tied to it Blend the results using bone weights float4 skin(float4 restpos, uniform float4x4 xform[n_bones], uniform float weight[n_bones]) { float4 outpos = float4(0,0,0,0) for (int b = 0; b < N_BONES; b++) { outpos += weight[b]*mul(xform[b], restpos) } return outpos }

17 VS Example 3: Projection In: vertex position in input space Out: vertex position in clip space All* vertex shaders need to do this float4 transform(float4 worldpos, uniform float4x4 modelviewprojection) { return mul(modelviewprojection, worldpos) }

18 Pixel Shaders f (interpolated attributes) (color, [depth]) Also known as Fragment shaders Purely functional (no side effects) Calculate color of the surface at the given pixel Also possible: Set blending opacity (alpha) Override hardware-generated depth value Discard, i.e., produce no output Hardware takes the produced fragment and blends it into the frame buffer

19 PS Example 1: Lighting Blinn-Torrance-Phong shading model Uses the halfway vector h between v and l h n l v surface p

20 PS Example 1: Lighting h n l v struct interpolants { float4 p, n, v } struct light { float4 pos, float Li } p float4 phong(interpolants in, uniform light lgt, uniform float q, uniform float Ks) { float4 l = lgt.pos - in.p float r2 = dot(l, l) float4 h = normalize(normalize(l) + in.v) } return Ks * pow(dot(in.n, h), q) * (lgt.li / r2)

21 More PS Examples: Melting Ice Procedural, animated texture Bumped environment map

22 More PS Examples: Toon & Fur Toon shading Volumetric fur

23 Power of VS & PS: Half-Life 2 (2004)

24 Questions?

25 The Graphics Pipeline Vertex float4 skin(float4 restpos, uniform float4x4 xform[n_bones], uniform float weight[n_bones]) { float4 outpos = float4(0,0,0,0) for (int b = 0; b < N_BONES; b++) { outpos += weight[b]*mul(xform[b], restpos)... Rasterize Hardware look like this Pixel Vertex, pixel processing became programmable Test & Blend Framebuffer

26 The Graphics Pipeline Vertex Geometry Rasterize Pixel Hardware float4 skin(float4 restpos, uniform float4x4 xform[n_bones], uniform float weight[n_bones]) { float4 outpos = float4(0,0,0,0) for (int b = 0; b < N_BONES; b++) { outpos += weight[b]*mul(xform[b], restpos)... look like this Vertex, pixel processing became programmable New stages added Test & Blend Framebuffer

27 The Graphics Pipeline Vertex Tessellation Geometry Rasterize Pixel Hardware float4 skin(float4 restpos, uniform float4x4 xform[n_bones], uniform float weight[n_bones]) { float4 outpos = float4(0,0,0,0) for (int b = 0; b < N_BONES; b++) { outpos += weight[b]*mul(xform[b], restpos)... look like this Vertex, pixel processing became programmable New stages added Test & Blend Even more stages added Framebuffer GPU architecture increasingly centers around shader execution

28 Modern GPUs: Unified Design Discrete Design Unified Design Shader A Shader B ibuffer ibuffer ibuffer ibuffer Shader Core Vertex shaders, pixel shaders, etc. become threads running different programs on a flexible core Shader C obuffer obuffer obuffer obuffer Shader D

29 GPU Architecture Today GP102 (Titan X)

30 GPU Architecture Today GP102 (Titan X)

31 GPU Architecture Today Vertex Fetch PolyMorph Engine 4.0 Tessellator Attribute Setup Raster Engine Simultaneous Multi-Projection Stream Output Very small portion of chip is strictly graphics-specific hardware GP102 (Titan X)

32 GPU Architecture Today Most of the units are for general-purpose computation, suitable for running arbitrary graphics shaders GP102 (Titan X)

33 What Makes It Fast? Massive number of independent work items (pixels) Allows parallelism Usually, coherent control High degree of data locality Main sources of off-chip accesses: textures and frame buffer Luckily, these tend to be very coherent! Keep as much data as possible on-chip (vertices, attributes, etc.) Custom scheduling and resource allocation No need for software arbitration, thread launching, sync.. Fixed function units for common, expensive ops E.g. texture filtering

34 Different Workloads Graphics Large number of independent but similar work items Heavy on arithmetic (lots of math/memory op) Coherent control, little data-dependent branching Coherent memory accesses

35 Different Workloads Graphics Large number of independent but similar work items Heavy on arithmetic (lots of math/memory op) Coherent control, little data-dependent branching Coherent memory accesses Opposite Long programs with serial dependencies Complex data-dependent control and memory access patterns Few independent work items Not 2 million pixels

36 Different Workloads Graphics = Throughput-sensitive Large number of independent but similar work items Heavy on arithmetic (lots of math/memory op) Coherent control, little data-dependent branching Coherent memory accesses Opposite = Latency-sensitive Long programs with serial dependencies Complex data-dependent control and memory access patterns Few independent work items Not 2 million pixels

37 Different Workloads Graphics = Throughput-sensitive GPU Large number of independent but similar work items Heavy on arithmetic (lots of math/memory op) Coherent control, little data-dependent branching Coherent memory accesses Opposite = Latency-sensitive CPU Long programs with serial dependencies Complex data-dependent control and memory access patterns Few independent work items Not 2 million pixels

38 Physical Realities Today Clock speeds are not going up by much......and power consumption is superlinear in GHz Unavoidable corollary: Processors must be parallel

39 Physical Realities Today, cont d DRAM is slow Latency is 100s of cycles More speed is exponentially more expensive DRAM is bad with random access Memory atom is large (32 bytes), need coalesced R/W Strong pressure towards 64 byte atom DRAM is power hungry Off-chip access may burn 1000x more power than reading off the register file (which is not free either) Need to minimize DRAM use, otherwise execution units are sitting idle waiting for data

40 Dealing with DRAM, the CPU Way 1. Get locality by large, fast on-chip caches ($) 2. Reorder instructions to hide latency 3. Use a few threads to further hide latency e.g. Intel s HyperThreading TM Great for workloads that exhibit data reuse When cache is large enough to accommodate working set Even with non-coherent access patterns Tolerates unpredictable control by branch prediction

41 Dealing with DRAM, the GPU Way 1. Bite the bullet and wait When waiting, switch in other threads that have all the data they need With enough threads, DRAM latency is hidden What is enough? Need many times more threads than execution units (remember, latency is 100s of cycles) 2. Exploit locality by having individual threads co-operate through fast on-chip memory Allows execution units to be much simpler No need for branch prediction, instruction reordering logic, register renaming, etc.

42 Shared Memory Local to each SM, can be shared between threads Goal: Bring the data closer to the ALU I.e., minimize trips to external memory Share values between threads to minimize overfetch and computation Increases arithmetic intensity by keeping data close to the processors

43 Multicore CPU: Run 10 Threads Fast Core Cache Core Cache Global Memory Few processor cores, each supporting 1 2 hardware threads Large on-chip memory/cache near processor

44 GPU: Run Threads Fast SM Cache/ SM Cache/ Memory Memory SM Cache/ Memory Global Memory Dozens of SMs, each supporting hundreds / thousands of hardware threads On-chip memory near processors Use as explicit local storage, allow thread co-operation Hide latency by switching between many threads

45 High-Bandwidth Memory Interfaces GDDR5 / GDDR5X / HBM2 memory interface bit wide memory bus to GDDR5(X) up to 480 GB/s HBM2 memory is on-chip, up to 4096 bit wide bus and 720 GB/s GDDR5X HBM2 GDDR5X GDDR5X GP102 (Titan X) GP100

46 Questions?

47 NVIDIA Pascal Architecture GP100

48 NVIDIA Pascal Architecture GP100

49 GP100 SM Scheduler Register file Single-precision ALUs Double-precision ALUs L1 cache Shared memory

50 Warps Threads are executed in warps Warp contains up to 32 threads SM operates at warp granularity Resource allocation Execution

51 Warp scheduling At every cycle, each SM chooses which warp to execute Actually two warps per cycle in current architectures Zero overhead in switching between warps or threads Warp is eligible to be executed if all of its threads are free to execute Not waiting for memory fetches Not waiting for results from ALUs Not waiting for synchronization

52 Program counter (PC) All threads in a warp have the same PC I.e., they execute the same instruction on a given cycle

53 SIMT execution model How is this possible? Sounds like SIMD, but how can threads be independent? SIMT = Single Instruction, Multiple Threads Close to SIMD, but allows free per-thread control flow Built into SM instructions and scheduler Dedicated hardware is necessary for efficient implementation

54 SIMT vs SIMD SIMD (Single Instruction Multiple Data) Used in CPUs, e.g. Intel s SSE/AVX extensions Programmer sees a scalar thread with access to a wide ALU For example, able to do 4 or 8 additions with a single instruction SIMT (Single Instruction Multiple Thread) Programmer sees independent scalar threads with scalar ALUs Hardware internally converts independent control flow into convergent control flow

55 Managing divergence How can threads of a warp diverge if they all have the same PC? Partial solution: Per-instruction execution predication Full solution: Execution mask, execution stack in hardware

56 Example: Instruction predication if (a < 10) small++; else big++; ISETP.LT.AND P0, pt, R6, 10, IADD R5, R5, IADD R4, R4, 0x1;

57 Example: Instruction predication if (a < 10) small++; else big++; Set predicate register P0 if a < 10, result can vary across warp ISETP.LT.AND P0, pt, R6, 10, IADD R5, R5, IADD R4, R4, 0x1;

58 Example: Instruction predication if (a < 10) small++; else big++; ISETP.LT.AND P0, pt, R6, 10, IADD R5, R5, IADD R4, R4, 0x1; In threads where P0 is set, R5 = R5 + 1

59 Example: Instruction predication if (a < 10) small++; else big++; ISETP.LT.AND P0, pt, R6, 10, IADD R5, R5, IADD R4, R4, 0x1; In threads where P0 is clear, R4 = R4 + 1

60 What about complex cases? Nested if / else blocks, loops, recursion Solution: Execution mask and execution stack

61 Execution mask & stack: Example if (a < 10) foo(); else bar(); /*0048*/ ISETP.LT.AND P0, pt, R6, 10, pt; BRA 0x70; /*0058*/...; /*0060*/...; foo() /*0068*/ BRA 0x80; /*0070*/...; /*0078*/...; bar() /*0080*/ code continues here

62 Execution mask & stack: Example Case 1: All threads take the if branch if (a < 10) foo(); else bar(); /*0048*/ ISETP.LT.AND P0, pt, R6, 10, pt; BRA 0x70; // no thread of the warp wants to jump /*0058*/...; foo() /*0060*/...; /*0068*/ BRA 0x80; /*0070*/...; bar() /*0078*/...; /*0080*/ code continues here

63 Execution mask & stack: Example Case 2: All threads take the else branch if (a < 10) foo(); else bar(); /*0048*/ ISETP.LT.AND P0, pt, R6, 10, pt; BRA 0x70; // all threads of the warp want to jump /*0058*/...; foo() /*0060*/...; /*0068*/ BRA 0x80; /*0070*/...; bar() /*0078*/...; /*0080*/ code continues here

64 Execution mask & stack: Example Case 3: Some threads take the if branch, some take the else branch if (a < 10) foo(); else bar(); /*0048*/ ISETP.LT.AND P0, pt, R6, 10, pt; BRA 0x70; // some threads of the warp want to jump: push /*0058*/...; foo() /*0060*/...; /*0068*/ BRA 0x80; // restore active thread mask /*0070*/...; bar() /*0078*/...; // pop /*0080*/ code continues here

65 Benefits of SIMT Supports all structured C++ constructs if / else, switch / case, loops, function calls, exceptions goto kind of works, but don t use Multi-level constructs handled efficiently break / continue from inside multiple levels of conditionals Function return from inside loops and conditionals Retreating to exception handler from anywhere You only need to care about SIMT when tuning for performance Unlike traditional SIMD that gives you nothing unless you explicitly use it

66 Consequences of SIMT An if statement takes the same number of cycles for any number of threads greater than zero If nobody participates it s cheap Also, masked-out threads don t do memory accesses A loop is iterated until all active threads in the warp are done A warp stays alive until every thread in it has terminated Terminated threads are dead weight Same as in conditionals when masked out

67 Coherent Execution Is Great An if statement is perfectly efficient if either everyone takes it or nobody does All threads stay active A loop is perfectly efficient if everyone does the same number of iterations Note: These are required for traditional SIMD

68 Incoherent Execution Is Okay Conditionals are efficient as long as threads usually agree Loops are efficient if threads usually take roughly the same number of iterations Much easier to program than explicit SIMD SIMT: Incoherence supported, performance degrades if control diverges SIMD: performance is fixed, incoherence not supported

69 Recap, GPU Unified programmable cores used for all shader types Fixed-function units for rasterization, texture filtering, ROP, etc. SIMT execution model Run scalar threads on widely parallel machine SIMT provides hardware support SIMD requires program to manage control flow Throughput-oriented design Tolerate DRAM latency by having lots of active threads

70 Thank you! Questions?

The Graphics Pipeline: Evolution of the GPU!

The Graphics Pipeline: Evolution of the GPU! 1 Today The Graphics Pipeline: Evolution of the GPU! Bigger picture: Parallel processor designs! Throughput-optimized (GPU-like)! Latency-optimized (Multicore CPU-like)! A look at NVIDIA s Fermi GPU architecture!

More information

Portland State University ECE 588/688. Graphics Processors

Portland State University ECE 588/688. Graphics Processors Portland State University ECE 588/688 Graphics Processors Copyright by Alaa Alameldeen 2018 Why Graphics Processors? Graphics programs have different characteristics from general purpose programs Highly

More information

CS427 Multicore Architecture and Parallel Computing

CS427 Multicore Architecture and Parallel Computing CS427 Multicore Architecture and Parallel Computing Lecture 6 GPU Architecture Li Jiang 2014/10/9 1 GPU Scaling A quiet revolution and potential build-up Calculation: 936 GFLOPS vs. 102 GFLOPS Memory Bandwidth:

More information

Introduction to Multicore architecture. Tao Zhang Oct. 21, 2010

Introduction to Multicore architecture. Tao Zhang Oct. 21, 2010 Introduction to Multicore architecture Tao Zhang Oct. 21, 2010 Overview Part1: General multicore architecture Part2: GPU architecture Part1: General Multicore architecture Uniprocessor Performance (ECint)

More information

GRAPHICS PROCESSING UNITS

GRAPHICS PROCESSING UNITS GRAPHICS PROCESSING UNITS Slides by: Pedro Tomás Additional reading: Computer Architecture: A Quantitative Approach, 5th edition, Chapter 4, John L. Hennessy and David A. Patterson, Morgan Kaufmann, 2011

More information

Advanced GPU Programming. Samuli Laine NVIDIA Research

Advanced GPU Programming. Samuli Laine NVIDIA Research Advanced GPU Programming Samuli Laine NVIDIA Research Today Code execution on GPU High-level GPU architecture SIMT execution model Warp-wide programming techniques GPU memory system Estimating the cost

More information

Modern Processor Architectures. L25: Modern Compiler Design

Modern Processor Architectures. L25: Modern Compiler Design Modern Processor Architectures L25: Modern Compiler Design The 1960s - 1970s Instructions took multiple cycles Only one instruction in flight at once Optimisation meant minimising the number of instructions

More information

Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design

Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design The 1960s - 1970s Instructions took multiple cycles Only one instruction in flight at once Optimisation meant

More information

Parallel Computing: Parallel Architectures Jin, Hai

Parallel Computing: Parallel Architectures Jin, Hai Parallel Computing: Parallel Architectures Jin, Hai School of Computer Science and Technology Huazhong University of Science and Technology Peripherals Computer Central Processing Unit Main Memory Computer

More information

Antonio R. Miele Marco D. Santambrogio

Antonio R. Miele Marco D. Santambrogio Advanced Topics on Heterogeneous System Architectures GPU Politecnico di Milano Seminar Room A. Alario 18 November, 2015 Antonio R. Miele Marco D. Santambrogio Politecnico di Milano 2 Introduction First

More information

Graphics Hardware. Graphics Processing Unit (GPU) is a Subsidiary hardware. With massively multi-threaded many-core. Dedicated to 2D and 3D graphics

Graphics Hardware. Graphics Processing Unit (GPU) is a Subsidiary hardware. With massively multi-threaded many-core. Dedicated to 2D and 3D graphics Why GPU? Chapter 1 Graphics Hardware Graphics Processing Unit (GPU) is a Subsidiary hardware With massively multi-threaded many-core Dedicated to 2D and 3D graphics Special purpose low functionality, high

More information

GPU Computation Strategies & Tricks. Ian Buck NVIDIA

GPU Computation Strategies & Tricks. Ian Buck NVIDIA GPU Computation Strategies & Tricks Ian Buck NVIDIA Recent Trends 2 Compute is Cheap parallelism to keep 100s of ALUs per chip busy shading is highly parallel millions of fragments per frame 0.5mm 64-bit

More information

Optimizing DirectX Graphics. Richard Huddy European Developer Relations Manager

Optimizing DirectX Graphics. Richard Huddy European Developer Relations Manager Optimizing DirectX Graphics Richard Huddy European Developer Relations Manager Some early observations Bear in mind that graphics performance problems are both commoner and rarer than you d think The most

More information

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS 1 Last time Each block is assigned to and executed on a single streaming multiprocessor (SM). Threads execute in groups of 32 called warps. Threads in

More information

Computer Architecture

Computer Architecture Computer Architecture Slide Sets WS 2013/2014 Prof. Dr. Uwe Brinkschulte M.Sc. Benjamin Betting Part 10 Thread and Task Level Parallelism Computer Architecture Part 10 page 1 of 36 Prof. Dr. Uwe Brinkschulte,

More information

Mattan Erez. The University of Texas at Austin

Mattan Erez. The University of Texas at Austin EE382V (17325): Principles in Computer Architecture Parallelism and Locality Fall 2007 Lecture 12 GPU Architecture (NVIDIA G80) Mattan Erez The University of Texas at Austin Outline 3D graphics recap and

More information

From Shader Code to a Teraflop: How GPU Shader Cores Work. Jonathan Ragan- Kelley (Slides by Kayvon Fatahalian)

From Shader Code to a Teraflop: How GPU Shader Cores Work. Jonathan Ragan- Kelley (Slides by Kayvon Fatahalian) From Shader Code to a Teraflop: How GPU Shader Cores Work Jonathan Ragan- Kelley (Slides by Kayvon Fatahalian) 1 This talk Three major ideas that make GPU processing cores run fast Closer look at real

More information

Challenges for GPU Architecture. Michael Doggett Graphics Architecture Group April 2, 2008

Challenges for GPU Architecture. Michael Doggett Graphics Architecture Group April 2, 2008 Michael Doggett Graphics Architecture Group April 2, 2008 Graphics Processing Unit Architecture CPUs vsgpus AMD s ATI RADEON 2900 Programming Brook+, CAL, ShaderAnalyzer Architecture Challenges Accelerated

More information

GPU Fundamentals Jeff Larkin November 14, 2016

GPU Fundamentals Jeff Larkin November 14, 2016 GPU Fundamentals Jeff Larkin , November 4, 206 Who Am I? 2002 B.S. Computer Science Furman University 2005 M.S. Computer Science UT Knoxville 2002 Graduate Teaching Assistant 2005 Graduate

More information

ECE 574 Cluster Computing Lecture 16

ECE 574 Cluster Computing Lecture 16 ECE 574 Cluster Computing Lecture 16 Vince Weaver http://web.eece.maine.edu/~vweaver vincent.weaver@maine.edu 26 March 2019 Announcements HW#7 posted HW#6 and HW#5 returned Don t forget project topics

More information

Lecture 7: The Programmable GPU Core. Kayvon Fatahalian CMU : Graphics and Imaging Architectures (Fall 2011)

Lecture 7: The Programmable GPU Core. Kayvon Fatahalian CMU : Graphics and Imaging Architectures (Fall 2011) Lecture 7: The Programmable GPU Core Kayvon Fatahalian CMU 15-869: Graphics and Imaging Architectures (Fall 2011) Today A brief history of GPU programmability Throughput processing core 101 A detailed

More information

Scientific Computing on GPUs: GPU Architecture Overview

Scientific Computing on GPUs: GPU Architecture Overview Scientific Computing on GPUs: GPU Architecture Overview Dominik Göddeke, Jakub Kurzak, Jan-Philipp Weiß, André Heidekrüger and Tim Schröder PPAM 2011 Tutorial Toruń, Poland, September 11 http://gpgpu.org/ppam11

More information

Multi-Processors and GPU

Multi-Processors and GPU Multi-Processors and GPU Philipp Koehn 7 December 2016 Predicted CPU Clock Speed 1 Clock speed 1971: 740 khz, 2016: 28.7 GHz Source: Horowitz "The Singularity is Near" (2005) Actual CPU Clock Speed 2 Clock

More information

Optimizing for DirectX Graphics. Richard Huddy European Developer Relations Manager

Optimizing for DirectX Graphics. Richard Huddy European Developer Relations Manager Optimizing for DirectX Graphics Richard Huddy European Developer Relations Manager Also on today from ATI... Start & End Time: 12:00pm 1:00pm Title: Precomputed Radiance Transfer and Spherical Harmonic

More information

Graphics Processing Unit Architecture (GPU Arch)

Graphics Processing Unit Architecture (GPU Arch) Graphics Processing Unit Architecture (GPU Arch) With a focus on NVIDIA GeForce 6800 GPU 1 What is a GPU From Wikipedia : A specialized processor efficient at manipulating and displaying computer graphics

More information

Architectures. Michael Doggett Department of Computer Science Lund University 2009 Tomas Akenine-Möller and Michael Doggett 1

Architectures. Michael Doggett Department of Computer Science Lund University 2009 Tomas Akenine-Möller and Michael Doggett 1 Architectures Michael Doggett Department of Computer Science Lund University 2009 Tomas Akenine-Möller and Michael Doggett 1 Overview of today s lecture The idea is to cover some of the existing graphics

More information

PowerVR Hardware. Architecture Overview for Developers

PowerVR Hardware. Architecture Overview for Developers Public Imagination Technologies PowerVR Hardware Public. This publication contains proprietary information which is subject to change without notice and is supplied 'as is' without warranty of any kind.

More information

X. GPU Programming. Jacobs University Visualization and Computer Graphics Lab : Advanced Graphics - Chapter X 1

X. GPU Programming. Jacobs University Visualization and Computer Graphics Lab : Advanced Graphics - Chapter X 1 X. GPU Programming 320491: Advanced Graphics - Chapter X 1 X.1 GPU Architecture 320491: Advanced Graphics - Chapter X 2 GPU Graphics Processing Unit Parallelized SIMD Architecture 112 processing cores

More information

Real-Time Rendering Architectures

Real-Time Rendering Architectures Real-Time Rendering Architectures Mike Houston, AMD Part 1: throughput processing Three key concepts behind how modern GPU processing cores run code Knowing these concepts will help you: 1. Understand

More information

By: Tomer Morad Based on: Erik Lindholm, John Nickolls, Stuart Oberman, John Montrym. NVIDIA TESLA: A UNIFIED GRAPHICS AND COMPUTING ARCHITECTURE In IEEE Micro 28(2), 2008 } } Erik Lindholm, John Nickolls,

More information

Graphics Hardware. Instructor Stephen J. Guy

Graphics Hardware. Instructor Stephen J. Guy Instructor Stephen J. Guy Overview What is a GPU Evolution of GPU GPU Design Modern Features Programmability! Programming Examples Overview What is a GPU Evolution of GPU GPU Design Modern Features Programmability!

More information

From Shader Code to a Teraflop: How Shader Cores Work

From Shader Code to a Teraflop: How Shader Cores Work From Shader Code to a Teraflop: How Shader Cores Work Kayvon Fatahalian Stanford University This talk 1. Three major ideas that make GPU processing cores run fast 2. Closer look at real GPU designs NVIDIA

More information

Windowing System on a 3D Pipeline. February 2005

Windowing System on a 3D Pipeline. February 2005 Windowing System on a 3D Pipeline February 2005 Agenda 1.Overview of the 3D pipeline 2.NVIDIA software overview 3.Strengths and challenges with using the 3D pipeline GeForce 6800 220M Transistors April

More information

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI. CSCI 402: Computer Architectures Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI 6.6 - End Today s Contents GPU Cluster and its network topology The Roofline performance

More information

Current Trends in Computer Graphics Hardware

Current Trends in Computer Graphics Hardware Current Trends in Computer Graphics Hardware Dirk Reiners University of Louisiana Lafayette, LA Quick Introduction Assistant Professor in Computer Science at University of Louisiana, Lafayette (since 2006)

More information

The NVIDIA GeForce 8800 GPU

The NVIDIA GeForce 8800 GPU The NVIDIA GeForce 8800 GPU August 2007 Erik Lindholm / Stuart Oberman Outline GeForce 8800 Architecture Overview Streaming Processor Array Streaming Multiprocessor Texture ROP: Raster Operation Pipeline

More information

Threading Hardware in G80

Threading Hardware in G80 ing Hardware in G80 1 Sources Slides by ECE 498 AL : Programming Massively Parallel Processors : Wen-Mei Hwu John Nickolls, NVIDIA 2 3D 3D API: API: OpenGL OpenGL or or Direct3D Direct3D GPU Command &

More information

GeForce4. John Montrym Henry Moreton

GeForce4. John Montrym Henry Moreton GeForce4 John Montrym Henry Moreton 1 Architectural Drivers Programmability Parallelism Memory bandwidth 2 Recent History: GeForce 1&2 First integrated geometry engine & 4 pixels/clk Fixed-function transform,

More information

B. Tech. Project Second Stage Report on

B. Tech. Project Second Stage Report on B. Tech. Project Second Stage Report on GPU Based Active Contours Submitted by Sumit Shekhar (05007028) Under the guidance of Prof Subhasis Chaudhuri Table of Contents 1. Introduction... 1 1.1 Graphic

More information

Lecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013

Lecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013 Lecture 13: Memory Consistency + a Course-So-Far Review Parallel Computer Architecture and Programming Today: what you should know Understand the motivation for relaxed consistency models Understand the

More information

From Brook to CUDA. GPU Technology Conference

From Brook to CUDA. GPU Technology Conference From Brook to CUDA GPU Technology Conference A 50 Second Tutorial on GPU Programming by Ian Buck Adding two vectors in C is pretty easy for (i=0; i

More information

EE382N (20): Computer Architecture - Parallelism and Locality Spring 2015 Lecture 09 GPUs (II) Mattan Erez. The University of Texas at Austin

EE382N (20): Computer Architecture - Parallelism and Locality Spring 2015 Lecture 09 GPUs (II) Mattan Erez. The University of Texas at Austin EE382 (20): Computer Architecture - ism and Locality Spring 2015 Lecture 09 GPUs (II) Mattan Erez The University of Texas at Austin 1 Recap 2 Streaming model 1. Use many slimmed down cores to run in parallel

More information

CS8803SC Software and Hardware Cooperative Computing GPGPU. Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology

CS8803SC Software and Hardware Cooperative Computing GPGPU. Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology CS8803SC Software and Hardware Cooperative Computing GPGPU Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology Why GPU? A quiet revolution and potential build-up Calculation: 367

More information

Shaders. Slide credit to Prof. Zwicker

Shaders. Slide credit to Prof. Zwicker Shaders Slide credit to Prof. Zwicker 2 Today Shader programming 3 Complete model Blinn model with several light sources i diffuse specular ambient How is this implemented on the graphics processor (GPU)?

More information

Addressing the Memory Wall

Addressing the Memory Wall Lecture 26: Addressing the Memory Wall Parallel Computer Architecture and Programming CMU 15-418/15-618, Spring 2015 Tunes Cage the Elephant Back Against the Wall (Cage the Elephant) This song is for the

More information

CSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University

CSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University CSE 591/392: GPU Programming Introduction Klaus Mueller Computer Science Department Stony Brook University First: A Big Word of Thanks! to the millions of computer game enthusiasts worldwide Who demand

More information

Parallel Programming on Larrabee. Tim Foley Intel Corp

Parallel Programming on Larrabee. Tim Foley Intel Corp Parallel Programming on Larrabee Tim Foley Intel Corp Motivation This morning we talked about abstractions A mental model for GPU architectures Parallel programming models Particular tools and APIs This

More information

CS195V Week 9. GPU Architecture and Other Shading Languages

CS195V Week 9. GPU Architecture and Other Shading Languages CS195V Week 9 GPU Architecture and Other Shading Languages GPU Architecture We will do a short overview of GPU hardware and architecture Relatively short journey into hardware, for more in depth information,

More information

High Performance Computing on GPUs using NVIDIA CUDA

High Performance Computing on GPUs using NVIDIA CUDA High Performance Computing on GPUs using NVIDIA CUDA Slides include some material from GPGPU tutorial at SIGGRAPH2007: http://www.gpgpu.org/s2007 1 Outline Motivation Stream programming Simplified HW and

More information

Fundamental CUDA Optimization. NVIDIA Corporation

Fundamental CUDA Optimization. NVIDIA Corporation Fundamental CUDA Optimization NVIDIA Corporation Outline! Fermi Architecture! Kernel optimizations! Launch configuration! Global memory throughput! Shared memory access! Instruction throughput / control

More information

Real-Time Rendering (Echtzeitgraphik) Michael Wimmer

Real-Time Rendering (Echtzeitgraphik) Michael Wimmer Real-Time Rendering (Echtzeitgraphik) Michael Wimmer wimmer@cg.tuwien.ac.at Walking down the graphics pipeline Application Geometry Rasterizer What for? Understanding the rendering pipeline is the key

More information

Comparing Memory Systems for Chip Multiprocessors

Comparing Memory Systems for Chip Multiprocessors Comparing Memory Systems for Chip Multiprocessors Jacob Leverich Hideho Arakida, Alex Solomatnikov, Amin Firoozshahian, Mark Horowitz, Christos Kozyrakis Computer Systems Laboratory Stanford University

More information

CS4620/5620: Lecture 14 Pipeline

CS4620/5620: Lecture 14 Pipeline CS4620/5620: Lecture 14 Pipeline 1 Rasterizing triangles Summary 1! evaluation of linear functions on pixel grid 2! functions defined by parameter values at vertices 3! using extra parameters to determine

More information

ECE 571 Advanced Microprocessor-Based Design Lecture 18

ECE 571 Advanced Microprocessor-Based Design Lecture 18 ECE 571 Advanced Microprocessor-Based Design Lecture 18 Vince Weaver http://www.eece.maine.edu/ vweaver vincent.weaver@maine.edu 11 November 2014 Homework #4 comments Project/HW Reminder 1 Stuff from Last

More information

GPU Programming. Lecture 1: Introduction. Miaoqing Huang University of Arkansas 1 / 27

GPU Programming. Lecture 1: Introduction. Miaoqing Huang University of Arkansas 1 / 27 1 / 27 GPU Programming Lecture 1: Introduction Miaoqing Huang University of Arkansas 2 / 27 Outline Course Introduction GPUs as Parallel Computers Trend and Design Philosophies Programming and Execution

More information

ECE 571 Advanced Microprocessor-Based Design Lecture 20

ECE 571 Advanced Microprocessor-Based Design Lecture 20 ECE 571 Advanced Microprocessor-Based Design Lecture 20 Vince Weaver http://www.eece.maine.edu/~vweaver vincent.weaver@maine.edu 12 April 2016 Project/HW Reminder Homework #9 was posted 1 Raspberry Pi

More information

GPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC

GPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC GPGPUs in HPC VILLE TIMONEN Åbo Akademi University 2.11.2010 @ CSC Content Background How do GPUs pull off higher throughput Typical architecture Current situation & the future GPGPU languages A tale of

More information

CS GPU and GPGPU Programming Lecture 8+9: GPU Architecture 7+8. Markus Hadwiger, KAUST

CS GPU and GPGPU Programming Lecture 8+9: GPU Architecture 7+8. Markus Hadwiger, KAUST CS 380 - GPU and GPGPU Programming Lecture 8+9: GPU Architecture 7+8 Markus Hadwiger, KAUST Reading Assignment #5 (until March 12) Read (required): Programming Massively Parallel Processors book, Chapter

More information

Spring 2011 Prof. Hyesoon Kim

Spring 2011 Prof. Hyesoon Kim Spring 2011 Prof. Hyesoon Kim Application Geometry Rasterizer CPU Each stage cane be also pipelined The slowest of the pipeline stage determines the rendering speed. Frames per second (fps) Executes on

More information

Master Informatics Eng.

Master Informatics Eng. Advanced Architectures Master Informatics Eng. 2018/19 A.J.Proença Data Parallelism 3 (GPU/CUDA, Neural Nets,...) (most slides are borrowed) AJProença, Advanced Architectures, MiEI, UMinho, 2018/19 1 The

More information

CUDA Optimizations WS Intelligent Robotics Seminar. Universität Hamburg WS Intelligent Robotics Seminar Praveen Kulkarni

CUDA Optimizations WS Intelligent Robotics Seminar. Universität Hamburg WS Intelligent Robotics Seminar Praveen Kulkarni CUDA Optimizations WS 2014-15 Intelligent Robotics Seminar 1 Table of content 1 Background information 2 Optimizations 3 Summary 2 Table of content 1 Background information 2 Optimizations 3 Summary 3

More information

Occupancy-based compilation

Occupancy-based compilation Occupancy-based compilation Advanced Course on Compilers Spring 2015 (III-V): Lecture 10 Vesa Hirvisalo ESG/CSE/Aalto Today Threads and occupancy GPUs as the example SIMT execution warp (thread-group)

More information

1.2.3 The Graphics Hardware Pipeline

1.2.3 The Graphics Hardware Pipeline Figure 1-3. The Graphics Hardware Pipeline 1.2.3 The Graphics Hardware Pipeline A pipeline is a sequence of stages operating in parallel and in a fixed order. Each stage receives its input from the prior

More information

CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading)

CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading) CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading) Limits to ILP Conflicting studies of amount of ILP Benchmarks» vectorized Fortran FP vs. integer

More information

Efficient and Scalable Shading for Many Lights

Efficient and Scalable Shading for Many Lights Efficient and Scalable Shading for Many Lights 1. GPU Overview 2. Shading recap 3. Forward Shading 4. Deferred Shading 5. Tiled Deferred Shading 6. And more! First GPU Shaders Unified Shaders CUDA OpenCL

More information

HPC VT Machine-dependent Optimization

HPC VT Machine-dependent Optimization HPC VT 2013 Machine-dependent Optimization Last time Choose good data structures Reduce number of operations Use cheap operations strength reduction Avoid too many small function calls inlining Use compiler

More information

Spring 2009 Prof. Hyesoon Kim

Spring 2009 Prof. Hyesoon Kim Spring 2009 Prof. Hyesoon Kim Application Geometry Rasterizer CPU Each stage cane be also pipelined The slowest of the pipeline stage determines the rendering speed. Frames per second (fps) Executes on

More information

Computer Architecture

Computer Architecture Jens Teubner Computer Architecture Summer 2017 1 Computer Architecture Jens Teubner, TU Dortmund jens.teubner@cs.tu-dortmund.de Summer 2017 Jens Teubner Computer Architecture Summer 2017 34 Part II Graphics

More information

Pipeline Operations. CS 4620 Lecture Steve Marschner. Cornell CS4620 Spring 2018 Lecture 11

Pipeline Operations. CS 4620 Lecture Steve Marschner. Cornell CS4620 Spring 2018 Lecture 11 Pipeline Operations CS 4620 Lecture 11 1 Pipeline you are here APPLICATION COMMAND STREAM 3D transformations; shading VERTEX PROCESSING TRANSFORMED GEOMETRY conversion of primitives to pixels RASTERIZATION

More information

Spring 2010 Prof. Hyesoon Kim. AMD presentations from Richard Huddy and Michael Doggett

Spring 2010 Prof. Hyesoon Kim. AMD presentations from Richard Huddy and Michael Doggett Spring 2010 Prof. Hyesoon Kim AMD presentations from Richard Huddy and Michael Doggett Radeon 2900 2600 2400 Stream Processors 320 120 40 SIMDs 4 3 2 Pipelines 16 8 4 Texture Units 16 8 4 Render Backens

More information

GPU Memory Model. Adapted from:

GPU Memory Model. Adapted from: GPU Memory Model Adapted from: Aaron Lefohn University of California, Davis With updates from slides by Suresh Venkatasubramanian, University of Pennsylvania Updates performed by Gary J. Katz, University

More information

Working with Metal Overview

Working with Metal Overview Graphics and Games #WWDC14 Working with Metal Overview Session 603 Jeremy Sandmel GPU Software 2014 Apple Inc. All rights reserved. Redistribution or public display not permitted without written permission

More information

Fundamental CUDA Optimization. NVIDIA Corporation

Fundamental CUDA Optimization. NVIDIA Corporation Fundamental CUDA Optimization NVIDIA Corporation Outline Fermi/Kepler Architecture Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control

More information

Pipeline Operations. CS 4620 Lecture 14

Pipeline Operations. CS 4620 Lecture 14 Pipeline Operations CS 4620 Lecture 14 2014 Steve Marschner 1 Pipeline you are here APPLICATION COMMAND STREAM 3D transformations; shading VERTEX PROCESSING TRANSFORMED GEOMETRY conversion of primitives

More information

Lecture 25: Board Notes: Threads and GPUs

Lecture 25: Board Notes: Threads and GPUs Lecture 25: Board Notes: Threads and GPUs Announcements: - Reminder: HW 7 due today - Reminder: Submit project idea via (plain text) email by 11/24 Recap: - Slide 4: Lecture 23: Introduction to Parallel

More information

Real-Time Buffer Compression. Michael Doggett Department of Computer Science Lund university

Real-Time Buffer Compression. Michael Doggett Department of Computer Science Lund university Real-Time Buffer Compression Michael Doggett Department of Computer Science Lund university Project 3D graphics project Demo, Game Implement 3D graphics algorithm(s) C++/OpenGL(Lab2)/iOS/android/3D engine

More information

The Rasterization Pipeline

The Rasterization Pipeline Lecture 5: The Rasterization Pipeline (and its implementation on GPUs) Computer Graphics CMU 15-462/15-662, Fall 2015 What you know how to do (at this point in the course) y y z x (w, h) z x Position objects

More information

CSE 591: GPU Programming. Introduction. Entertainment Graphics: Virtual Realism for the Masses. Computer games need to have: Klaus Mueller

CSE 591: GPU Programming. Introduction. Entertainment Graphics: Virtual Realism for the Masses. Computer games need to have: Klaus Mueller Entertainment Graphics: Virtual Realism for the Masses CSE 591: GPU Programming Introduction Computer games need to have: realistic appearance of characters and objects believable and creative shading,

More information

The F-Buffer: A Rasterization-Order FIFO Buffer for Multi-Pass Rendering. Bill Mark and Kekoa Proudfoot. Stanford University

The F-Buffer: A Rasterization-Order FIFO Buffer for Multi-Pass Rendering. Bill Mark and Kekoa Proudfoot. Stanford University The F-Buffer: A Rasterization-Order FIFO Buffer for Multi-Pass Rendering Bill Mark and Kekoa Proudfoot Stanford University http://graphics.stanford.edu/projects/shading/ Motivation for this work Two goals

More information

Tesla Architecture, CUDA and Optimization Strategies

Tesla Architecture, CUDA and Optimization Strategies Tesla Architecture, CUDA and Optimization Strategies Lan Shi, Li Yi & Liyuan Zhang Hauptseminar: Multicore Architectures and Programming Page 1 Outline Tesla Architecture & CUDA CUDA Programming Optimization

More information

Graphics Hardware, Graphics APIs, and Computation on GPUs. Mark Segal

Graphics Hardware, Graphics APIs, and Computation on GPUs. Mark Segal Graphics Hardware, Graphics APIs, and Computation on GPUs Mark Segal Overview Graphics Pipeline Graphics Hardware Graphics APIs ATI s low-level interface for computation on GPUs 2 Graphics Hardware High

More information

CME 213 S PRING Eric Darve

CME 213 S PRING Eric Darve CME 213 S PRING 2017 Eric Darve Summary of previous lectures Pthreads: low-level multi-threaded programming OpenMP: simplified interface based on #pragma, adapted to scientific computing OpenMP for and

More information

Scanline Rendering 2 1/42

Scanline Rendering 2 1/42 Scanline Rendering 2 1/42 Review 1. Set up a Camera the viewing frustum has near and far clipping planes 2. Create some Geometry made out of triangles 3. Place the geometry in the scene using Transforms

More information

ME964 High Performance Computing for Engineering Applications

ME964 High Performance Computing for Engineering Applications ME964 High Performance Computing for Engineering Applications Execution Scheduling in CUDA Revisiting Memory Issues in CUDA February 17, 2011 Dan Negrut, 2011 ME964 UW-Madison Computers are useless. They

More information

Today s Agenda. DirectX 9 Features Sim Dietrich, nvidia - Multisample antialising Jason Mitchell, ATI - Shader models and coding tips

Today s Agenda. DirectX 9 Features Sim Dietrich, nvidia - Multisample antialising Jason Mitchell, ATI - Shader models and coding tips Today s Agenda DirectX 9 Features Sim Dietrich, nvidia - Multisample antialising Jason Mitchell, ATI - Shader models and coding tips Optimization for DirectX 9 Graphics Mike Burrows, Microsoft - Performance

More information

CUDA Programming Model

CUDA Programming Model CUDA Xing Zeng, Dongyue Mou Introduction Example Pro & Contra Trend Introduction Example Pro & Contra Trend Introduction What is CUDA? - Compute Unified Device Architecture. - A powerful parallel programming

More information

Lecture 15: Introduction to GPU programming. Lecture 15: Introduction to GPU programming p. 1

Lecture 15: Introduction to GPU programming. Lecture 15: Introduction to GPU programming p. 1 Lecture 15: Introduction to GPU programming Lecture 15: Introduction to GPU programming p. 1 Overview Hardware features of GPGPU Principles of GPU programming A good reference: David B. Kirk and Wen-mei

More information

Overview. Technology Details. D/AVE NX Preliminary Product Brief

Overview. Technology Details. D/AVE NX Preliminary Product Brief Overview D/AVE NX is the latest and most powerful addition to the D/AVE family of rendering cores. It is the first IP to bring full OpenGL ES 2.0/3.1 rendering to the FPGA and SoC world. Targeted for graphics

More information

PowerVR Series5. Architecture Guide for Developers

PowerVR Series5. Architecture Guide for Developers Public Imagination Technologies PowerVR Series5 Public. This publication contains proprietary information which is subject to change without notice and is supplied 'as is' without warranty of any kind.

More information

Optimisation. CS7GV3 Real-time Rendering

Optimisation. CS7GV3 Real-time Rendering Optimisation CS7GV3 Real-time Rendering Introduction Talk about lower-level optimization Higher-level optimization is better algorithms Example: not using a spatial data structure vs. using one After that

More information

GPUs and GPGPUs. Greg Blanton John T. Lubia

GPUs and GPGPUs. Greg Blanton John T. Lubia GPUs and GPGPUs Greg Blanton John T. Lubia PROCESSOR ARCHITECTURAL ROADMAP Design CPU Optimized for sequential performance ILP increasingly difficult to extract from instruction stream Control hardware

More information

DirectCompute Performance on DX11 Hardware. Nicolas Thibieroz, AMD Cem Cebenoyan, NVIDIA

DirectCompute Performance on DX11 Hardware. Nicolas Thibieroz, AMD Cem Cebenoyan, NVIDIA DirectCompute Performance on DX11 Hardware Nicolas Thibieroz, AMD Cem Cebenoyan, NVIDIA Why DirectCompute? Allow arbitrary programming of GPU General-purpose programming Post-process operations Etc. Not

More information

Real - Time Rendering. Graphics pipeline. Michal Červeňanský Juraj Starinský

Real - Time Rendering. Graphics pipeline. Michal Červeňanský Juraj Starinský Real - Time Rendering Graphics pipeline Michal Červeňanský Juraj Starinský Overview History of Graphics HW Rendering pipeline Shaders Debugging 2 History of Graphics HW First generation Second generation

More information

Cornell University CS 569: Interactive Computer Graphics. Introduction. Lecture 1. [John C. Stone, UIUC] NASA. University of Calgary

Cornell University CS 569: Interactive Computer Graphics. Introduction. Lecture 1. [John C. Stone, UIUC] NASA. University of Calgary Cornell University CS 569: Interactive Computer Graphics Introduction Lecture 1 [John C. Stone, UIUC] 2008 Steve Marschner 1 2008 Steve Marschner 2 NASA University of Calgary 2008 Steve Marschner 3 2008

More information

CUDA OPTIMIZATIONS ISC 2011 Tutorial

CUDA OPTIMIZATIONS ISC 2011 Tutorial CUDA OPTIMIZATIONS ISC 2011 Tutorial Tim C. Schroeder, NVIDIA Corporation Outline Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control

More information

Could you make the XNA functions yourself?

Could you make the XNA functions yourself? 1 Could you make the XNA functions yourself? For the second and especially the third assignment, you need to globally understand what s going on inside the graphics hardware. You will write shaders, which

More information

Multithreaded Processors. Department of Electrical Engineering Stanford University

Multithreaded Processors. Department of Electrical Engineering Stanford University Lecture 12: Multithreaded Processors Department of Electrical Engineering Stanford University http://eeclass.stanford.edu/ee382a Lecture 12-1 The Big Picture Previous lectures: Core design for single-thread

More information

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield NVIDIA GTX200: TeraFLOPS Visual Computing August 26, 2008 John Tynefield 2 Outline Execution Model Architecture Demo 3 Execution Model 4 Software Architecture Applications DX10 OpenGL OpenCL CUDA C Host

More information

graphics pipeline computer graphics graphics pipeline 2009 fabio pellacini 1

graphics pipeline computer graphics graphics pipeline 2009 fabio pellacini 1 graphics pipeline computer graphics graphics pipeline 2009 fabio pellacini 1 graphics pipeline sequence of operations to generate an image using object-order processing primitives processed one-at-a-time

More information

GPU! Advanced Topics on Heterogeneous System Architectures. Politecnico di Milano! Seminar Room, Bld 20! 11 December, 2017!

GPU! Advanced Topics on Heterogeneous System Architectures. Politecnico di Milano! Seminar Room, Bld 20! 11 December, 2017! Advanced Topics on Heterogeneous System Architectures GPU! Politecnico di Milano! Seminar Room, Bld 20! 11 December, 2017! Antonio R. Miele! Marco D. Santambrogio! Politecnico di Milano! 2 Introduction!

More information