Spring 2010 Prof. Hyesoon Kim. AMD presentations from Richard Huddy and Michael Doggett

Similar documents
Anatomy of AMD s TeraScale Graphics Engine

AMD Radeon HD 2900 Highlights

Challenges for GPU Architecture. Michael Doggett Graphics Architecture Group April 2, 2008

The NVIDIA GeForce 8800 GPU

Windowing System on a 3D Pipeline. February 2005

Optimizing DirectX Graphics. Richard Huddy European Developer Relations Manager

Graphics Processing Unit Architecture (GPU Arch)

Spring 2009 Prof. Hyesoon Kim

High Performance Graphics 2010

GPU Architecture. Michael Doggett Department of Computer Science Lund university

Spring 2011 Prof. Hyesoon Kim

Architectures. Michael Doggett Department of Computer Science Lund University 2009 Tomas Akenine-Möller and Michael Doggett 1

Optimizing for DirectX Graphics. Richard Huddy European Developer Relations Manager

GeForce4. John Montrym Henry Moreton

1. Introduction 2. Methods for I/O Operations 3. Buses 4. Liquid Crystal Displays 5. Other Types of Displays 6. Graphics Adapters 7.

CS427 Multicore Architecture and Parallel Computing

RADEON X1000 Memory Controller

Portland State University ECE 588/688. Graphics Processors

Unit 11: Putting it All Together: Anatomy of the XBox 360 Game Console

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

GRAPHICS PROCESSING UNITS

This Unit: Putting It All Together. CIS 501 Computer Architecture. What is Computer Architecture? Sources

Antonio R. Miele Marco D. Santambrogio

This Unit: Putting It All Together. CIS 371 Computer Organization and Design. Sources. What is Computer Architecture?

The Need for Programmability

Spring 2010 Prof. Hyesoon Kim. Xbox 360 System Architecture, Anderews, Baker

SAPPHIRE R7 260X 2GB GDDR5 OC BATLELFIELD 4 EDITION

CSE 591: GPU Programming. Introduction. Entertainment Graphics: Virtual Realism for the Masses. Computer games need to have: Klaus Mueller

AMD E8860 2GB PCIEx16 Mini DP X6 GFX-AE8860F16-5J

Lecture 6: Texture. Kayvon Fatahalian CMU : Graphics and Imaging Architectures (Fall 2011)

Ultimate Graphics Performance for DirectX 10 Hardware

SAPPHIRE DUAL-X R9 270X 2GB GDDR5 OC WITH BOOST

Xbox 360 Architecture. Lennard Streat Samuel Echefu

Graphics Architectures and OpenCL. Michael Doggett Department of Computer Science Lund university

Programming Graphics Hardware

This Unit: Putting It All Together. CIS 371 Computer Organization and Design. What is Computer Architecture? Sources

Real-Time Rendering (Echtzeitgraphik) Michael Wimmer

AMD Smarter Choice. Futures Some thoughts. Raja Koduri

SAPPHIRE TOXIC R9 280X 3GB GDDR5

AMD HD5450 PCIe ADD-IN BOARD. Datasheet AEGX-A3T5-01FST1

Vertex Shader Design I

From Shader Code to a Teraflop: How GPU Shader Cores Work. Jonathan Ragan- Kelley (Slides by Kayvon Fatahalian)

From Shader Code to a Teraflop: How Shader Cores Work

Real-Time Rendering Architectures

Real-World Applications of Computer Arithmetic

Graphics Hardware. Instructor Stephen J. Guy

Squeezing Performance out of your Game with ATI Developer Performance Tools and Optimization Techniques

Threading Hardware in G80

2 x Maximum Display Monitor(s) support MHz Core Clock 28 nm Chip 384 x Stream Processors. 145(L)X95(W)X26(H) mm Size. 1.

Graphics Hardware, Graphics APIs, and Computation on GPUs. Mark Segal

A SIMD-efficient 14 Instruction Shader Program for High-Throughput Microtriangle Rasterization

PowerVR Series5. Architecture Guide for Developers

GCN Performance Tweets AMD Developer Relations

AMD E M PCIEx16 GFX-AE6460F16-5C

AMD Embedded PCIe ADD-IN BOARD E6760/E6460 Datasheet. (ER93FLA/ER91FLA-xx)

GPU Target Applications

Evolution of GPUs Chris Seitz

GPGPU, 1st Meeting Mordechai Butrashvily, CEO GASS

AMD HD5450 PCI ADD-IN BOARD. Datasheet. Advantech model number:gfx-a3t5-61fst1

GPU Computation Strategies & Tricks. Ian Buck NVIDIA

X. GPU Programming. Jacobs University Visualization and Computer Graphics Lab : Advanced Graphics - Chapter X 1

Graphics Hardware. Graphics Processing Unit (GPU) is a Subsidiary hardware. With massively multi-threaded many-core. Dedicated to 2D and 3D graphics

Real - Time Rendering. Graphics pipeline. Michal Červeňanský Juraj Starinský

CSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University

PowerVR Hardware. Architecture Overview for Developers

CSCI-GA Graphics Processing Units (GPUs): Architecture and Programming Lecture 2: Hardware Perspective of GPUs

Readings on graphics architecture for Advanced Computer Architecture class

Mattan Erez. The University of Texas at Austin

ECE 574 Cluster Computing Lecture 16

AMD E8860 PCIe ADD-IN BOARD. Datasheet (AEGX-A5T8-20FMT1)

A Reconfigurable Architecture for Load-Balanced Rendering

Parallel Computing: Parallel Architectures Jin, Hai

Introduction to Multicore architecture. Tao Zhang Oct. 21, 2010

Monday Morning. Graphics Hardware

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

1.2.3 The Graphics Hardware Pipeline

CONSOLE ARCHITECTURE

AMD E8870 4GB PCIEX16 Mini DP X4 Low profile ER24FL-SK4 GFX-AE8870L16-5J

Falanx Microsystems. Company Overview

Bringing AAA graphics to mobile platforms. Niklas Smedberg Senior Engine Programmer, Epic Games

Xbox 360 high-level architecture

Bifrost - The GPU architecture for next five billion

CS8803SC Software and Hardware Cooperative Computing GPGPU. Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology

Parallel Programming on Larrabee. Tim Foley Intel Corp

ASYNCHRONOUS SHADERS WHITE PAPER 0

The Bifrost GPU architecture and the ARM Mali-G71 GPU

GPU Architecture and Function. Michael Foster and Ian Frasch

GPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC

From Brook to CUDA. GPU Technology Conference

Cornell University CS 569: Interactive Computer Graphics. Introduction. Lecture 1. [John C. Stone, UIUC] NASA. University of Calgary

Tutorial on GPU Programming #2. Joong-Youn Lee Supercomputing Center, KISTI

A Trip Down The (2011) Rasterization Pipeline

Could you make the XNA functions yourself?

Spring 2009 Prof. Hyesoon Kim

Standard Graphics Pipeline

Efficient and Scalable Shading for Many Lights

GPU A rchitectures Architectures Patrick Neill May

INSTITUTO SUPERIOR TÉCNICO. Architectures for Embedded Computing

A 50Mvertices/s Graphics Processor with Fixed-Point Programmable Vertex Shader for Mobile Applications

Computer Architecture

Transcription:

Spring 2010 Prof. Hyesoon Kim AMD presentations from Richard Huddy and Michael Doggett

Radeon 2900 2600 2400 Stream Processors 320 120 40 SIMDs 4 3 2 Pipelines 16 8 4 Texture Units 16 8 4 Render Backens 16 4 4 L2 texture cache (KB) 256 128 0 Technology (nm) 80 65 65 Area (mm2) 420 153 82 Transistors (millions) 720 390 180 Memory bandwidth 512 128 64 Optimized for High clock speed Power efficiency Power efficiency AMD presentations from Richard Huddy and Michael Dogget t

320 Stream processing units 4 SMIDs 4 Texture Units 4 Render Back-end AMD presentations from Richard Huddy and Michael Dogget t

AMD presentations from Richard Huddy and Michael Doggett

AMD presentations from Richard Huddy and Michael Doggett

GPU interface with host Processes command stream from graphics driver A custom RISC based Micro-Coded engine First class memory client with Read/Write access State management AMD presentations from Richard Huddy and Michael Dogget t

Prepares data for processing by the stream processing units 3 groups of blocks feeding 3 data streams Each group feeding 16 elements Vertex blocks: Primitive tessellation, Inputs-index & instancing Sends vertex addresses to shader core Geometry blocks : Uses on/off chip staging Sends processed vertex addresses, near neighbor addresses and topological information Pixel blocks : Scan conversion, Triangle setup, Rasterizations, and interpolation Interfaces to depth to perform Hiz/EarlyZ checks AMD presentations from Richard Huddy and Michael Doggett

Main control for the shader core Separate command queues for each shader type Each thread consists of a number of instructions that will operate on a block of input data All workloads have threads of 64 elements 100 s of threads in flight Threads are put to sleep when they request a slow responding resource AMD presentations from Richard Huddy and Michael Dogget t

AMD presentations from Richard Huddy and Michael Dogget t

Initial arbiter to select with thread to submit Two arbiter units per SIMD array Allows each SIMD to be pipelined, with two operations at a time in process Dedicated arbiter units for texture and vertex fetches Can be scheduled independently from math operations Executing threads can be bumped at any time if a higher priority thread is pulled from the command queues Temporary data saved so thread can resume later Arbitration policy Age/need/availability When in doubt favor pixels Programmable AMD presentations from Richard Huddy and Michael Dogget t

Dedicated shader caches Instruction cache allows unlimited shader length Constant cache allows unlimited number of constants Both caches take advantage of data re-use to improve state change overhead and efficiency Latency hiding Cache miss, switches to another thread Suspended threads remain in the command queues until their requested data arrives Ultra-threaded dispatch processor can queue up hundreds of threads AMD presentations from Richard Huddy and Michael Dogget t

4 parallel SIMD units Each unit receives independent ALU instruction Very Long Instruction Word (VLIW) Each instruction word can include up to 6 independent, co-issued operations (5 math + 1 flow control) All operations are performed in parallel on each data element in the current thread Texture fetch and vertex fetch instructions are issued and executed separately Allows fetches to begin executing before the requested data is required by the shader ALU Instruction ( 1 to 7 64-bit words) 5 scalar ops- 64 bits for src/dst/controls/op 2 additional for literal constants

5 Scalar Units Each scalar unit does FP Multiply- Add (MAD) and integer operations One also handles transcendental instructions (SIN, COS, LOG, EXP, etc.) IEEE 32-bit floating point precision Integer and bitwise operation support Branch Execution unit Up to 6 operations co-issued AMD presentations from Richard Huddy and Michael Dogget t

Virtualizes register space Allow overflow to graphics memory Can be read from or written to by and SIMD (texture & vertex caches are read-only) 8KB Fully associative cache, write combining Stream out Allows shader output to bypass render back-ends and color buffer Render to vertex buffer Outputs sequential stream of data instead of bitmaps Uses: Used for inter-thread communication AMD presentations from Richard Huddy and Michael Dogget t

Fetch Units 8 fetch address processor each (32 total) 4 filtered and unfiltered 20 texture samplers each (80 total) Can fetch a single data value per clock 4 filtered texels (with BW) (16 total) Bilinear filter one 64-bit FP color value per clocks for each pixel 128-bit FP textures filtered at half speed Trilinear and anisotropic filtering Fetch caches Unified caches across all SIMDs Vertex/Unfiltered cache 4kB L1, 32 Kb L2 Texture cache 32KB L1, 256 KB L2 (128KB for HD 2600, HD2400 uses single level vertex/texture cache) AMD presentations from Richard Huddy and Michael Dogget t

Double rate depth/stencil test 32 pixels per clock for HD 2900 8 pixels per clock for HD2600&HD2400 Programmable MSAA (multi-sample anti-aliasing) resolve Allows custom AA filters New blend-able DX10 surface formats 128-bit and 11:11:10 floating point format Up to 8 Multiple Render Targets (MRT) with MSAA support AMD presentations from Richard Huddy and Michael Dogget t

Improved Z & Stencil compression Up to 16:1 in standard mode Z & stencil now compressed separately with each other for better efficiency Z Range optimization Re-Z Limit depth test operations to a programmable depth range (useful for speeding up stencil shadowing) Can check Z buffer twice once before pixel shader, and again after Allows early Z before shading in all cases Improved Hierarchical Z buffer Adds hierarchical stencil (HiS) for better stencil shadow performance Handles most situations where it had to be disabled in the past 32-bit floating point z-buffer support AMD presentations from Richard Huddy and Michael Dogget t

Centralized Partially distributed Fully distributed Crossbar ATI Radeon X850& earlier + All computing GPUs Hybrid Ring Bus ATI Radeon X1000 Series Ring Bus ATI Radeon HD 2000 series AMD presentations from Richard Huddy and Michael Dogget t

Over 100GB/s memory bandwidth Fully distributed design Highly scalable 512-bit interface Compacts, stacked I/O pad design More bandwidth with existing memory technology Improved cost: bandwidth ratio 8x64 bit memory channels Double ring bus 512 bit read and write AMD presentations from Richard Huddy and Michael Dogget t

Benefits of a 512-bit interface More bandwidth with existing memory technology Lower memory clock required to achieve target bandwidth Benefits of the ring bus Simplifies routing to improve scalability Reduces wire delay Reduces number of repeaters required AMD presentations from Richard Huddy and Michael Dogget t

Programmable tessellation unit Based on xbox 360 technology Provides highly effective geometry data compression Orders of magnitude faster than CPU-based or geometry shaderbased tessellation Enables: More detailed animation More realistic characters Complex terrain More sophisticated shader effect AMD presentations from Richard Huddy and Michael Dogget t

All ATIs Radeon HD 2000 series GPUs feature High bandwidth dual-link GPU interconnect Supports display resolutions up to 2560x2048 @ 60Hz Built for future scalability AMD presentations from Richard Huddy and Michael Dogget t

Focus on efficiency Old equation: architecture advances = f (performance, features) New equation, architecture advances = f(perf/watt, perf/$, features) Scale up processing power & AA performance Enhance stream computing capability Faster and more flexible DirectX 10.1, tessllation, CFAA, GDDR5, PCI-E 2.0 SIGGARH 08, Houston

800 stream processing units Texture New texture cache design New memory architecture Optimized render back-ends for faster anti-aliasing Enhanced geometry shader and tessellator performance SIGGARH 08, Houston

Each core 80 scalar stream processing units + 16KB local store Has its own control logic 4 dedicated texture units + L1 cache 16 global data share 4:1 ALU:TEX ratio SIGGARH 08, Houston

40% increase in performance per mm 2 More aggressive clock gating for improved performance per watt Fast double precision processing units SIGGARH 08, Houston

Streamlined design 70% increase in performance/mm 2 More performance Double the texture cache bandwidth 2.5x increase in 32- bit filter rate 1.25x increase in 64-bit filter rate New cache design L2s aligned with memory channels Separate vertex cache Increased bandwidth SIGGARH 08, Houston

Focus on improving AA performance per mm 2 Doubled peak rate for depth/stencil ops to 64 per clock SIGGARH 08, Houston

New distributed design with hub Controllers distributed around periphery of chip, adjacent to primary bandwidth consumers 256-bit interface allows reduced latency Hub handles relatively low bandwidth traffic PCI Express, CrossFireX interconnect, UVD2, display controllers SIGGARH 08, Houston

On-chip microcontroller Controls clock gating, engine/memory clock speeds, voltages, and fan controllers SIGGARH 08, Houston

TeraScale 2 architecture The first DirectX 11 support GPU 2.7 TeralFlops for a single precision 544 Gflops for a double precision Evolved from TeraScale architecture (HD 4800) No revolution

2X the processing power of previous Gen Over 2 TeraFLOPS Over 20 Gigapixels/Sec Major Feature and Design Enhancements: Instruction set Stream processing units SIMD layout Graphics engine Texture units Render back-ends Display controllers http://www.hardocp.com/image.html?image=mti1mzu4otm1nvldbxbla3zkzm5fnv8xx2wuz2lm

20 SIMD engines Each with 16 thread processors Each with 5 stream cores (1600 total) 80 Texture units 4 per SIMD engine 150+ GB/sec GDDR5 memory interface http://www.hardocp.com/image.html?image=mti1mzu4otm1nvldbxbla3zkzm5fnv8xx2wuz2lm

2.7 TeraFLOPs single precision 544 GigaFLOPs double precision Increased IPC More flexible dot products Co-issue MUL, dependent ADD in single clock Sum of absolute differences (SAD) 12x speed-up with native instruction Used for video encoding, computer vision Exposed via OpenCL extension DirectX 11 bit-level ops Bit count, insert, extract, etc. Fused Multiply-Add Each thread processor includes 4 stream cores + SFU Branch unit General purpose registers http://www.hardocp.com/image.html?image=mti1mzu4otm1nvldbxbla3zkzm5fnv8xx2wuz2lm

80 texture units Increased texture bandwidth Up to 68 billion bilinear filtered texel/sec Up to 272 billion 32-bit fetches/sec Increased cache bandwidth Up to 1TB/sec L1 texture fetch bandwidth Up to 435 GB/s between L1&L2 Doubled L2 cache 128KB per memory controller New DirectX 11 texture features 16k x 16k max resolution New 32-bit and 64-bit HDR block compression modes http://www.hardocp.com/image.html?image=mti1mzu4otm1nvldbxbla3zkzm5fnv8xx2wuz2lm

Dual rasterizers New tessellation unit 6 th generation technology Programmable via Direct X11 Hull & Domain shaders Pull model interpolation New DirectX 11 feature Uses stream processors for interpolation with new instructions Improved flexibility, negligible performance cost Improved performance for constant buffer updates Faster geometry shading OpenGL enhancements Improved line rendering performance and clipped speed 12-bit subpixel precision http://www.hardocp.com/image.html?image=mti1mzu4otm1nvldbxbla3zkzm5fnv8xx2wuz2lm

New readback path Texture units can now read compressed AA color buffers Improved CFAA performance Faster sample rate shading Enhanced MRT performance Faster color clears Supersample AA Anti-aliases shaders & texture as well as polygon engines - Efficient implementation based on adaptive AA technology - Works seamlessly with CFAA http://www.hardocp.com/image.html?image=mti1mzu4otm1nvldbxbla3zkzm5fnv8xx2wuz2lm

Optimized memory controller area EDC (Error detection code) CRC checks on data transfers for improved reliability at high clock speeds GDDR5 memory clock temperature compensation Enables speeds approaching 5 Gbs Fast GDDR5 link retraining Allows voltage & clock switching on the FLY without glitches http://www.hardocp.com/image.html?image=mti1mzu4otm1nvldbxbla3zkzm5fnv8xx2wuz2lm

Feature Shader model 4.0 Shader model 5.0 Benefits Thread dispatch 2D 3D Replace multiple 2D thread arrays with a single 3D array Thread limit 768 1024 More threads Thread group shared memory 16KB 32KB Increase inter-thread communication Shared memory access 256 B write only Full 32KB read/write Efficient shared memory I/O Atomic operations Not supported Supported Each thread operaets on protected memory locations Easy programming CPU based algorithms Double precision Not supported Supported Append/consume buffers Not supported Supported Useful for building and accessin g data in list or stack form Unordered access views bound to compute shader 1 8 Unordered access views bound to pixel shader Gather 4 Not supported Not supported 8 Supported

AMD presentations from Richard Huddy and Michael Dogget t