Challenges for GPU Architecture. Michael Doggett Graphics Architecture Group April 2, 2008

Similar documents
Spring 2010 Prof. Hyesoon Kim. AMD presentations from Richard Huddy and Michael Doggett

AMD Radeon HD 2900 Highlights

Architectures. Michael Doggett Department of Computer Science Lund University 2009 Tomas Akenine-Möller and Michael Doggett 1

Threading Hardware in G80

Portland State University ECE 588/688. Graphics Processors

GPU Architecture. Michael Doggett Department of Computer Science Lund university

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Real - Time Rendering. Graphics pipeline. Michal Červeňanský Juraj Starinský

CSE 591: GPU Programming. Introduction. Entertainment Graphics: Virtual Realism for the Masses. Computer games need to have: Klaus Mueller

Anatomy of AMD s TeraScale Graphics Engine

Optimizing DirectX Graphics. Richard Huddy European Developer Relations Manager

CS427 Multicore Architecture and Parallel Computing

Antonio R. Miele Marco D. Santambrogio

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Graphics Architectures and OpenCL. Michael Doggett Department of Computer Science Lund university

Real-Time Rendering Architectures

Graphics Hardware. Graphics Processing Unit (GPU) is a Subsidiary hardware. With massively multi-threaded many-core. Dedicated to 2D and 3D graphics

Introduction to Multicore architecture. Tao Zhang Oct. 21, 2010

GRAPHICS PROCESSING UNITS

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

Optimizing for DirectX Graphics. Richard Huddy European Developer Relations Manager

Parallel Computing: Parallel Architectures Jin, Hai

From Shader Code to a Teraflop: How GPU Shader Cores Work. Jonathan Ragan- Kelley (Slides by Kayvon Fatahalian)

Windowing System on a 3D Pipeline. February 2005

Computer Architecture

Parallel Programming on Larrabee. Tim Foley Intel Corp

The Ultimate Developers Toolkit. Jonathan Zarge Dan Ginsburg

GPUs and GPGPUs. Greg Blanton John T. Lubia

Lecture 7: The Programmable GPU Core. Kayvon Fatahalian CMU : Graphics and Imaging Architectures (Fall 2011)

Prof. Hakim Weatherspoon CS 3410, Spring 2015 Computer Science Cornell University

EE382N (20): Computer Architecture - Parallelism and Locality Spring 2015 Lecture 09 GPUs (II) Mattan Erez. The University of Texas at Austin

This Unit: Putting It All Together. CIS 501 Computer Architecture. What is Computer Architecture? Sources

Mattan Erez. The University of Texas at Austin

A SIMD-efficient 14 Instruction Shader Program for High-Throughput Microtriangle Rasterization

GPGPU, 1st Meeting Mordechai Butrashvily, CEO GASS

Lecture 25: Board Notes: Threads and GPUs

ASYNCHRONOUS SHADERS WHITE PAPER 0

CSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University

Unit 11: Putting it All Together: Anatomy of the XBox 360 Game Console

HPG 2011 HIGH PERFORMANCE GRAPHICS HOT 3D

Building scalable 3D applications. Ville Miettinen Hybrid Graphics

Graphics Hardware, Graphics APIs, and Computation on GPUs. Mark Segal

CSCI-GA Multicore Processors: Architecture & Programming Lecture 10: Heterogeneous Multicore

Breaking Down Barriers: An Intro to GPU Synchronization. Matt Pettineo Lead Engine Programmer Ready At Dawn Studios

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 6. Parallel Processors from Client to Cloud

Higher Level Programming Abstractions for FPGAs using OpenCL

PowerVR Hardware. Architecture Overview for Developers

General Purpose GPU Programming. Advanced Operating Systems Tutorial 9

Graphics Processing Unit Architecture (GPU Arch)

CLICK TO EDIT MASTER TITLE STYLE. Click to edit Master text styles. Second level Third level Fourth level Fifth level

Bifrost - The GPU architecture for next five billion

Graphics Hardware 2008

From Shader Code to a Teraflop: How Shader Cores Work

Spring 2010 Prof. Hyesoon Kim. Xbox 360 System Architecture, Anderews, Baker

Today s Agenda. DirectX 9 Features Sim Dietrich, nvidia - Multisample antialising Jason Mitchell, ATI - Shader models and coding tips

This Unit: Putting It All Together. CIS 371 Computer Organization and Design. Sources. What is Computer Architecture?

Fundamental CUDA Optimization. NVIDIA Corporation

Next Generation OpenGL Neil Trevett Khronos President NVIDIA VP Mobile Copyright Khronos Group Page 1

Modern Processor Architectures. L25: Modern Compiler Design

Graphics Hardware. Instructor Stephen J. Guy

Original PlayStation: no vector processing or floating point support. Photorealism at the core of design strategy

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing

Multimedia in Mobile Phones. Architectures and Trends Lund

HETEROGENEOUS SYSTEM ARCHITECTURE: PLATFORM FOR THE FUTURE

The NVIDIA GeForce 8800 GPU

Could you make the XNA functions yourself?

Course II Parallel Computer Architecture. Week 2-3 by Dr. Putu Harry Gunawan

Parallel Systems I The GPU architecture. Jan Lemeire

Introduction to CUDA (1 of n*)

! Readings! ! Room-level, on-chip! vs.!

Parallel Programming for Graphics

THE PROGRAMMER S GUIDE TO THE APU GALAXY. Phil Rogers, Corporate Fellow AMD

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model

The Bifrost GPU architecture and the ARM Mali-G71 GPU

General Purpose GPU Programming. Advanced Operating Systems Tutorial 7

Hardware-driven Visibility Culling Jeong Hyun Kim

GeForce4. John Montrym Henry Moreton

Graphics and Imaging Architectures

GPU Computation Strategies & Tricks. Ian Buck NVIDIA

OpenCL Base Course Ing. Marco Stefano Scroppo, PhD Student at University of Catania

Day: Thursday, 03/19 Time: 16:00-16:50 Location: Room 212A Level: Intermediate Type: Talk Tags: Developer - Tools & Libraries; Game Development

Pat Hanrahan. Modern Graphics Pipeline. How Powerful are GPUs? Application. Command. Geometry. Rasterization. Fragment. Display.

Squeezing Performance out of your Game with ATI Developer Performance Tools and Optimization Techniques

GPU A rchitectures Architectures Patrick Neill May

GPGPU COMPUTE ON AMD. Udeepta Bordoloi April 6, 2011

WHY PARALLEL PROCESSING? (CE-401)

Introduction to Parallel Programming Models

CS8803SC Software and Hardware Cooperative Computing GPGPU. Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology

Efficient and Scalable Shading for Many Lights

Scientific Computing on GPUs: GPU Architecture Overview

This Unit: Putting It All Together. CIS 371 Computer Organization and Design. What is Computer Architecture? Sources

GPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC

ECE 8823: GPU Architectures. Objectives

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors

GPGPU. Peter Laurens 1st-year PhD Student, NSC

CSCI-GA Graphics Processing Units (GPUs): Architecture and Programming Lecture 2: Hardware Perspective of GPUs

GPU Architecture. Robert Strzodka (MPII), Dominik Göddeke G. TUDo), Dominik Behr (AMD)

NVIDIA Fermi Architecture

Parallelism and Concurrency. COS 326 David Walker Princeton University

High Performance Graphics 2010

Transcription:

Michael Doggett Graphics Architecture Group April 2, 2008

Graphics Processing Unit Architecture CPUs vsgpus AMD s ATI RADEON 2900 Programming Brook+, CAL, ShaderAnalyzer Architecture Challenges Accelerated Computing 2

ATI WhiteOut Demo [2007] Architecture 3

Chip Design Focus Point CPU Lots of instructions little data Out of order exec Branch prediction Reuse and locality Task parallel Needs OS Complex sync Latency machines GPU Few instructions lots of data SIMD Hardware threading Little reuse Data parallel No OS Simple sync Throughput machines 4

Typical CPU Operation One iteration at a time Single CPU unit Cannot reach 100% Hard to prefetch data Multi-core does not help Cluster does not help Limited number of outstanding fetches Wait for memory, gaps prevent peak performance Gap size varies dynamically Hard to tolerate latency 5

GPU THREADS (Lower Clock Different Scale) Overlapped fetch and alu Many outstanding fetches ALU units reach 100% utilization Hardware sync for final Output Lots of threads Fetch unit + ALU unit Fast thread switch In-order finish 6

RADEON 2900 Top Level Red Compute/ Fixed Function Yellow Cache Unified shader Shader R/W Cache Instr./Const. cache Unified texture cache Compression Tessellator Hierarchical Z Z/Stencil Cache Stream Out Memory Read/Write Cache Rasterizer Interpolators Command Processor Vertex Index Fetch Setup Unit Tessellator Geometry Vertex Assembler Assembler Ultra-Threaded Dispatch Processor Unified Shader Processors Shader Export Texture Units Shader Caches Instruction & Constant L1 Texture Cache L2 Texture Cache Render Back-Ends Color Cache 7

Ultra-Threaded Dispatch Processor/ Scheduler Main control for the shader core All workloads have threads of 64 elements 100 s of threads in flight Threads are put to sleep when they request a slow responding resource Arbitration policy Age/Need/Availability When in doubt favor pixels Programmable Hiera Stream O Interpolators Geometry Vertex Assembler Assembler Ultra-Threaded Dispatch Processor Caches ction & stant che L 8

Ultra-Threaded Dispatch Processor Setup Engine Arbiter Sequencer Sequencer Interpolators Vertex Shader Command Queue Geometry Shader Command Queue PIxel Shader Command Queue VS Thread 3 VS Thread 2 VS Thread 1 GS Thread 2 GS Thread 1 Arbiter Arbiter Arbiter Sequencer Sequencer Arbiter Sequencer Sequencer PS PS PS PS Arbiter Arbiter Sequencer Sequencer Thread Thread Thread Thread 4 3 2 1 Texture Fetch Arbiter Vertex Fetch Arbiter Texture Fetch Sequencer Vertex Fetch Sequencer Instructions and Control SIMD Array SIMD Array 80 Stream Processing Units 80 Stream Processing Units 80 Stream Processing Units 80 Stream Processing Units Vertex / Texture Cache SIMD Array Vertex / Texture Units 9 SIMD Array Shader Constant Cache Arbiter Geometry Assembler Shader Instruction Cache UltraUltra-Threaded Dispatch Processor Vertex Assembler

Shader Core 4 parallel SIMD units Each unit receives independent ALU instruction Very Long Instruction Word (VLIW) ALU Instruction (1 to 7 64-bit words) 5 scalar ops 64 bits for src/dst/cntrls/op 2 additional for literal constant(s) 10

Stream Processing Units 5 Scalar Units Each scalar unit does FP Multiply- Add (MAD) and integer operations One also handles transcendental instructions IEEE 32-bit floating point precision Branch Execution Unit Flow control instructions Up to 6 operations co-issued General Purpose Registers Branch Execution Unit 11

Programming 12

Programming model Vertex and pixel kernels (shaders) Parallel loops are implicit Performance aware code does not know how many cores or how many threads All sorts of queues maintained under covers All kinds of sync done implicitly Programs are very small 13

Parallelism Model All parallel operations are hidden via domain specific API calls Developers write sequential code + kernels Kernel operate on one vertex or pixel Developers never deal with parallelism directly No need for auto parallel compilers 14

Ruby Demo Series Four versions each done by experts to show of features of the chip as well as develop novel forward-looking graphics techniques First 3 written in DirectX9, fourth in DirectX10 15

DX Pixel Shader Length 4 Triangles in 1000 demo version 3 2 2000 1500 1000 500 d1 d2 d3 d4 1 Num Pixel Shaders demo 1 = 140 demo 2 = 163 demo 3 = 312 demo 4 = 250 0 200 400 600 800 shader size 16

AMD Stream Computing 17

GPU ShaderAnalyzer 18

Architecture Challenges 19

GPU Directions Continued improvement and evolution in Graphics RADEON 2900 Tessellator Continued conversion/removal of fixed function Which fixed function? Increasing importance of GPU as an accelerator Generalization of GPU architecture More aggressive GPU thread scheduler Shader read-only via texture, write-only via color exports Need new GPU shader architectures to meet much wider requirements Already has a range of requirements across Graphics and Compute 20

Accelerated Computing 21

The Accelerated Computing Imperative Homogeneous multi-core becomes increasingly inadequate Targeted special purpose HW solutions can offer substantially better power efficiency and throughput than traditional microprocessors when operating on some of these new data types. Power constraints will force applications to be performance heterogeneous Applications can target the HW device to get this power benefit GPUs - high power efficiency, more than 2 GFLOPs/watt 20x > than dual-core CPU 22

AMD's Accelerated Computing Initiative Discrete CPUs + GPUs Full Integration - Fusion - Mutli-core CPU + GPU Accelerator Compatibility will continue to be a key enabler in our industry Need SW for new HW How should existing Compute APIs evolve? Do we need new API models? GPU parallelism is so successful because graphics APIs are sequential New APIs can t tie down GPU progress Need to replicate success of DX, need IHV input Can only use GPGPU APIs when performance is necessary and programmer understands machine Layers of computation Compilers that can target workloads at appropriate processors 23

Future GPU/Accelerated Computing Applications Which applications benefit from combined CPU+GPU and how? What type of architectural workload coupling between CPU+GPU do these applications require? What are the new data and communication requirements? What are the costs to existing performance to broaden what we do well? 24

Form Factors for Accelerated Computing Notebook Embedded space has already embraced heterogeneous computing models Desktop Home Media Server Game console Home Cinema HDTV UMPC Digital STB/PVR Handset OCUR 25

Accelerated Computing GPGPU APIs enable offloading compute to GPUs Without heroic programming efforts API compatibility enables Hardware innovation without new SW Improving performance on new HW without existing apps Broad range of possible accelerator designs We need both CPUs and GPUs Amdahl's law 26

Lots of Challenges Managing context state and exceptions This includes the program-visible state in the compute offload engine! Virtualizing the context state Communications/Messaging Simplified & fabric independent producer-consumer model Optimized communications is a key enabler It s the synchronization, stupid Memory BW and Data Movement Keeping up with the computation rates will require increasingly capable memory systems New and appropriate APIs Must offer a programming model that is actually easier than today s multi-core models Use abstraction to trade some performance for programmer productivity Live within bounds set by OS and Concurrent Runtimes 27

Processor World Map CPU GPU MPP HC 28

Questions? Internships ATI Fellowships Michael.doggett at amd.com 29

References Issues and Challenges in Compiling for Graphics Processors, Norm Rubin, Code Generation and Optimization 8 April, 2008 The Role of Accelerated Computing in the Multi-Core Era, Chuck Moore, March 2008 30