AMD Radeon HD 2900 Highlights

Similar documents
Spring 2010 Prof. Hyesoon Kim. AMD presentations from Richard Huddy and Michael Doggett

Anatomy of AMD s TeraScale Graphics Engine

HPG 2011 HIGH PERFORMANCE GRAPHICS HOT 3D

Challenges for GPU Architecture. Michael Doggett Graphics Architecture Group April 2, 2008

AMD APU and Processor Comparisons. AMD Client Desktop Feb 2013 AMD

High Performance Graphics 2010

SOLUTION TO SHADER RECOMPILES IN RADEONSI SEPTEMBER 2015


Understanding GPGPU Vector Register File Usage

The NVIDIA GeForce 8800 GPU

Graphics Hardware 2008

The Road to the AMD. Fiji GPU. Featuring Die Stacking and HBM Technology 1 THE ROAD TO THE AMD FIJI GPU ECTC 2016 MAY 2015

OPENCL TM APPLICATION ANALYSIS AND OPTIMIZATION MADE EASY WITH AMD APP PROFILER AND KERNELANALYZER

INTRODUCTION TO OPENCL TM A Beginner s Tutorial. Udeepta Bordoloi AMD

AMD Graphics Team Last Updated February 11, 2013 APPROVED FOR PUBLIC DISTRIBUTION. 1 3DMark Overview February 2013 Approved for public distribution

EFFICIENT SPARSE MATRIX-VECTOR MULTIPLICATION ON GPUS USING THE CSR STORAGE FORMAT

AMD Graphics Team Last Updated April 29, 2013 APPROVED FOR PUBLIC DISTRIBUTION. 1 3DMark Overview April 2013 Approved for public distribution

GPGPU COMPUTE ON AMD. Udeepta Bordoloi April 6, 2011

AMD HD3D Technology. Setup Guide. 1 AMD HD3D TECHNOLOGY: Setup Guide

HyperTransport Technology

Graphics Processing Unit Architecture (GPU Arch)

ADVANCED RENDERING EFFECTS USING OPENCL TM AND APU Session Olivier Zegdoun AMD Sr. Software Engineer

Vulkan (including Vulkan Fast Paths)

Windowing System on a 3D Pipeline. February 2005

THE PROGRAMMER S GUIDE TO THE APU GALAXY. Phil Rogers, Corporate Fellow AMD

HETEROGENEOUS SYSTEM ARCHITECTURE: PLATFORM FOR THE FUTURE

D3D12 & Vulkan: Lessons learned. Dr. Matthäus G. Chajdas Developer Technology Engineer, AMD

The mobile computing evolution. The Griffin architecture. Memory enhancements. Power management. Thermal management

viewdle! - machine vision experts

This Unit: Putting It All Together. CIS 371 Computer Organization and Design. Sources. What is Computer Architecture?

AMD 780G. Niles Burbank AMD. an x86 chipset with advanced integrated GPU. Hot Chips 2008

This Unit: Putting It All Together. CIS 501 Computer Architecture. What is Computer Architecture? Sources

AMD ACCELERATING TECHNOLOGIES FOR EXASCALE COMPUTING FELLOW 3 OCTOBER 2016

CAUTIONARY STATEMENT This presentation contains forward-looking statements concerning Advanced Micro Devices, Inc. (AMD) including, but not limited to

Unit 11: Putting it All Together: Anatomy of the XBox 360 Game Console

Architectures. Michael Doggett Department of Computer Science Lund University 2009 Tomas Akenine-Möller and Michael Doggett 1

Portland State University ECE 588/688. Graphics Processors

Threading Hardware in G80

Panel Discussion: The Future of I/O From a CPU Architecture Perspective

This Unit: Putting It All Together. CIS 371 Computer Organization and Design. What is Computer Architecture? Sources

Use cases. Faces tagging in photo and video, enabling: sharing media editing automatic media mashuping entertaining Augmented reality Games

Maximizing Six-Core AMD Opteron Processor Performance with RHEL

Multi-core processors are here, but how do you resolve data bottlenecks in native code?

GPU Architecture. Michael Doggett Department of Computer Science Lund university

Computer Architecture

Antonio R. Miele Marco D. Santambrogio

PowerVR Hardware. Architecture Overview for Developers

Fusion Enabled Image Processing

LIQUIDVR TODAY AND TOMORROW GUENNADI RIGUER, SOFTWARE ARCHITECT

SIMULATOR AMD RESEARCH JUNE 14, 2015

GeForce4. John Montrym Henry Moreton

AMD IOMMU VERSION 2 How KVM will use it. Jörg Rödel August 16th, 2011

Gestural and Cinematic Interfaces - DX11. David Brebner Unlimited Realities CTO

1. Introduction 2. Methods for I/O Operations 3. Buses 4. Liquid Crystal Displays 5. Other Types of Displays 6. Graphics Adapters 7.

Run Anywhere. The Hardware Platform Perspective. Ben Pollan, AMD Java Labs October 28, 2008

Optimizing DirectX Graphics. Richard Huddy European Developer Relations Manager

AMD CORPORATE TEMPLATE AMD Radeon Open Compute Platform Felix Kuehling

User Guide. NVIDIA Quadro FX 4700 X2 BY PNY Technologies Part No. VCQFX4700X2-PCIE-PB

STREAMING VIDEO DATA INTO 3D APPLICATIONS Session Christopher Mayer AMD Sr. Software Engineer

SCALING DGEMM TO MULTIPLE CAYMAN GPUS AND INTERLAGOS MANY-CORE CPUS FOR HPL

Spring 2010 Prof. Hyesoon Kim. Xbox 360 System Architecture, Anderews, Baker

Xbox 360 Architecture. Lennard Streat Samuel Echefu

Xbox 360 high-level architecture

Graphics Hardware. Graphics Processing Unit (GPU) is a Subsidiary hardware. With massively multi-threaded many-core. Dedicated to 2D and 3D graphics

CS427 Multicore Architecture and Parallel Computing

Parallel Computing: Parallel Architectures Jin, Hai

ASYNCHRONOUS SHADERS WHITE PAPER 0

Optimizing for DirectX Graphics. Richard Huddy European Developer Relations Manager

GRAPHICS PROCESSING UNITS

FUSION PROCESSORS AND HPC

Accelerating Applications. the art of maximum performance computing James Spooner Maxeler VP of Acceleration

AMD RYZEN PROCESSOR WITH RADEON VEGA GRAPHICS CORPORATE BRAND GUIDELINES

CSE 591: GPU Programming. Introduction. Entertainment Graphics: Virtual Realism for the Masses. Computer games need to have: Klaus Mueller

HIGHLY PARALLEL COMPUTING IN PHYSICS-BASED RENDERING OpenCL Raytracing Based. Thibaut PRADOS OPTIS Real-Time & Virtual Reality Manager

AMD Smarter Choice. Futures Some thoughts. Raja Koduri

Graphics Hardware, Graphics APIs, and Computation on GPUs. Mark Segal

CAUTIONARY STATEMENT This presentation contains forward-looking statements concerning Advanced Micro Devices, Inc. (AMD) including, but not limited to

Automatic Intra-Application Load Balancing for Heterogeneous Systems

PowerVR Series5. Architecture Guide for Developers

ROCm: An open platform for GPU computing exploration

GPGPU, 1st Meeting Mordechai Butrashvily, CEO GASS

ACCELERATING MATRIX PROCESSING WITH GPUs. Nicholas Malaya, Shuai Che, Joseph Greathouse, Rene van Oostrum, and Michael Schulte AMD Research

DR. LISA SU

Could you make the XNA functions yourself?

Introduction to Multicore architecture. Tao Zhang Oct. 21, 2010

Crusoe Processor Model TM5800

Real-World Applications of Computer Arithmetic

EE382N (20): Computer Architecture - Parallelism and Locality Spring 2015 Lecture 09 GPUs (II) Mattan Erez. The University of Texas at Austin

Official Catalogue. Summer 2008

Spring 2009 Prof. Hyesoon Kim

SAPPHIRE R7 260X 2GB GDDR5 OC BATLELFIELD 4 EDITION

GPU Computation Strategies & Tricks. Ian Buck NVIDIA

FLASH MEMORY SUMMIT Adoption of Caching & Hybrid Solutions

CS8803SC Software and Hardware Cooperative Computing GPGPU. Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology

AMD HD5450 PCIe ADD-IN BOARD. Datasheet AEGX-A3T5-01FST1

From Brook to CUDA. GPU Technology Conference

Generic System Calls for GPUs

2 x Maximum Display Monitor(s) support MHz Core Clock 28 nm Chip 384 x Stream Processors. 145(L)X95(W)X26(H) mm Size. 1.

RegMutex: Inter-Warp GPU Register Time-Sharing

CAUTIONARY STATEMENT 1 AMD NEXT HORIZON NOVEMBER 6, 2018

Transcription:

C O N F I D E N T I A L 2007 Hot Chips 19 AMD s Radeon HD 2900 2 nd Generation Unified Shader Architecture Mike Mantor Fellow AMD Graphics Products Group michael.mantor@amd.com AMD Radeon HD 2900 Highlights Technology leadership Clock speeds 742 MHz Transistor 700 million Technology Process - TSMC 80nm HS Power ~215 W, Pin Count - 2140 Die Size 420mm (20mm x 21mm) Cutting-edge image quality features Advanced anti-aliasing and texture filtering capabilities Fast High Dynamic Range rendering Programmable Tessellation Unit 2 nd generation unified architecture Scalar ALU design with 320 stream processing units 475 GigaFLOPS C O of (MulAdd) N F I D compute E N T I A L 47.5 GigaPixels/Sec & 742 Mtri/sec 106 GB/sec Bandwidth Optimized for Dynamic Game Computing and Accelerated Stream Processing DirectX 10 Massive shader and geometry processing performance Shader Model 4.0 with Integer support Enabling the next generation of visual effects ATI Avivo HD technology Delivering The Ultimate Visual Experience For HD video HD display and audio connectivity HD DVD and Blu-Ray capable Native CrossFire technology Superior multi-gpu support Scales up rendering performance and image quality with 2 or more GPUs 2

Performance Improvements (5 years) 40000 35000 30000 ATI Radeon X1800 ATI Radeon X1950 ATI Radeon HD 2900 25000 20000 15000 10000 5000 0 ATI Radeon 9700 ATI Radeon 9800 ATI Radeon X800 Doubling every 1 8 months Radeon GPU Performance Doubling every 24 months 2002 2003 2004 2005 2006 2007 3 Approximate Availability Dates AMD Radeon HD2900 Graphics System CPU Host Application Graphics Driver System Memory Address Space Memory Mapped HD2900 Registers Commands Buffers Instruction/Constant Inputs/Outputs AMD HD2900 Graphics Memory Instruction/Constant Inputs/Output/Buffer Memory Controller Interrupts AMD Radeon HD 2900 Display & Multimedia

Massive Bandwidth World s First 512b Fully Distributed Memory Interface New stacked I/O pad design 64-bit Memory Channel GDDR3/4 DRAM GDDR3/4 DRAM 64-bit Memory Channel PCI PCI Express Express Ring Ring Stop Stop Ring Stop Ring Stop Kilobit Ring Bus (512-bit read + 512-bit write) Crossbar Mux Ring Stop Ring Stop Read Write Memory Client Interfaces Ring Stop Highlights Over 100 GB/sec memory bandwidth Target achieved via current technology Eight 64-bit memory channels Kilobit ring bus Lower Required Frequencies AMD Radeon HD2900 Graphics Unit 2nd Generation Unified Shader Architecture Development from proven and successful XBOX 360 graphics New dispatch processor handling thousands of simultaneous threads Instruction Cache and Constant Cache for unlimited program size Up to 320 discrete, independent stream processing units Scalar ALU implementation Dedicated branch execution units Three dedicated fetch units Texture Cache Vertex Cache Load/Store Cache Full support for DirectX 10.0, Shader Model 4.0

Programmer s View of Shader Dataflow Graphics Pipeline Data Flow without GS DMA Buffer DMA Buffer DMA Buffer DMA Buffer State Data Draw Call State Data Draw Call Sync Query State Data Draw Call Unique Vertices Vertex Shader PaC PoC Parameters Connectivity Positions Rasterizer Pixel Quads Interpolation Pixels Pixel Shader HiZ EarlyZ Shaded Pixels D E P T H & C O L O R Color & Depth Buffer

Command Processor GPU interface with host A custom RISC based Micro-Coded engine Memory & Register Read and Write access Multiple buffers with dependant fetch latency hiding Surface coherency synchronization Host interrupt notification system Hardware based validation of state data at draw call Setup Engine Draw/State Vertex Index Fetch Shader Feedback Hierarchiacal and EarlyZ Rasterizer Interpolators Setup Engine Geometry Assembler Programmable Tessellator Vertex Assembler Thread Workload Workload preparation for Shader Staging to collect data for submission Different arrival/drain rates Different storage requirements Different processing needs Submittal Arbitration policies Output Need feedback/availability/balance Prevent over-subscription when in doubt favor pixels Vertex workload Primitive Assembly & Vertex Reuse Primitive Tessellation (742Mtri/sec) Inputs Index & Instancing Data Geometry Shader Staging On\off chip staging Amplification and parallelism Dependence on SIMD size Pixel Shader Staging Rasterization and Interpolation Vertex/Pixel I/O mappings Inputs- System variables, z, center, centroid, sample, linear

Design Goals for Unified Shader Maximize Performance via ALU utilization Provide shared resources for all shader types Sustain peak Fetch and I/O Rates Provide a common programming language Simplify design and verification process Enable common tool chain Flexibility and Scalability Requirements to meet goals Hide latency of memory fetches Create cache locality to prevent over-fetch Prevent resource over subscription Arbitrate on age/need to protect bandwidth Enable performance scaling for all workloads Provide ALU & I/O Ratios for typical workloads Interleaved diverse workloads to balance

Simultaneous Multi-Threaded Engine Fetch Limited ALU Limited Unified Shader\Stream Processors Single Instruction Multiple Data Each SIMD receives independent ALU instruction stream Each SIMD applies instruction stream to multiple data elements Multiple Instruction Multiple Data Multiple SIMD units operating in parallel (Multi-Processor System) Distributed or shared memory Very Long Instruction Word (VLIW) design Co-issued up to 6 operations (5 ALU + 1 FC) 1.25 Machine Scalar operation per clock for each of 64 data elements Independent scalar source and destination addressing Simultaneous Instruction Issue Input, Output, Fetch, ALU, and Control Flow per SIMD

Shader Instructions VLIW (Very Long Instruction Word), variable length Control Flow Instructions Control branch, loop, stack operations Clause launch Barriers, Allocation, and Exports Clause Set of instructions that executes w/o pre-emption ALU Instructions Texture & Vertex Fetch Instructions Memory Read/Write Instructions ALU Instruction (1 to 7 64-bit words) 5 scalar ops 64 bits each 2 additional words for literal constants Ultra-Threaded Dispatch Processor Vertex Assembler Setup Engine Geometry Assembler Interpolators Unified Thread Group Queue PS Thread Group 4 PS Thread Group 4 PS Thread Group 3 VS Thread Group 4 GS Thread Group 2 PS Thread Group 2 PS Thread Group 1 VS Thread Group 3 GS Thread Group 1 VS Thread Group 2 VS Thread Group 1 Control Flow & s Export Export LoadStore LoadStore Shader Instruction Cache Ultra-Threaded Dispatch Processor Shader Constant Cache Texture Fetch Texture Fetch Vertex Fetch Vertex Fetch Export Buffers & CrossBar Read/Write Cache SIMD Array 80 Stream Processing Units SIMD Array 80 Stream Processing Units SIMD Array 80 Stream Processing Units SIMD Array 80 Stream Processing Units Vertex / Texture Units Vertex / Texture Cache 16

Shader Processing Units (SPU) Arranged as 5-way scalar stream processors Co-issue up to 5 scalar FP MAD (Multiply-Add) Up to 5 integer operations supported (cmp, logical, add) One of the 5 stream processing units additionally handles * transcendental instructions (SIN, COS, LOG, EXP, RCP, RSQ) * integer multiply and shift operations 32-bit floating point precision (round to nearest even) Branch execution units handle flow control and conditional operations Condition code generation for full branching Predication supported directly in ALU General Purpose Registers 1 MByte of GPR space for fast register access Branch Execution Unit General Purpose Registers 17 Fetch Unit Design Vertex Address Processors Texture Address Processors Texture Filter Units Vertex Fetches FP16 Texture Samples Fetch units Fetch Address Processors each 4 filtered (fetch neighboring data for filtering) 4 un-filtered raw data fetch 20 Samples accessed from cache per clock 4 bilinear filter results per clock (with BW) Filter rate for each pixel: one 64-bit FP texture result per clock, one 128-bit FP result per 2 clocks Multi-level fetch cache design L2/L1 cache structures Unified 4kb L1 structured cache (unfiltered) Unified 32kb L2 structure cache (unfiltered) Unified 32k L1 texture cache Unified 256KB L2 texture cache 18

Memory Read/Write Cache Virtualizes register space Allows overflow to graphics memory Can be read from or written to by any SIMD (fetch caches are read-only) Can export data to stream out buffer 8KB Fully associative cache, write combining Stream Out Allows shader output to bypass render backends and color buffer Outputs sequential stream of data instead of bitmaps Uses include: Inter-thread communication Render to vertex buffer Overflow storage/output for Geometry Shader data (allowing parallel processing for large amplification) 19 Render Back-Ends Alpha testing, Alpha and fog blending Double rate depth/stencil test 32 pixels per clock for ATI Radeon HD 2900 Multi-Sample Anti-Aliasing (MSAA) resolve functionality is programmable Makes Custom Anti-Aliasing Filter possible New blend-able surface formats Allows new DirectX10 formats to be displayable 128-bit floating point format 11:11:10 floating point format MRT (Multiple Render Target) support Up to 8 MRTs with MSAA support Z/Stencil Cache Decompress Compress Programmable MSAA Resolve Decompress Color Cache Alpha/Fog Depth/Stencil Blend Compress 20

A Scalable Family ATI RadeonHD 2900 320 Stream Processors 4 SIMDs 4 Texture Units 4 Render Back-End ATI Radeon HD 2600 120 Stream Processing 3 SIMDs 2 Texture Units 1 Render Back-End ATI Radeon HD 2400 40 Stream Processing 2 SIMDs 1 Texture Unit 1 Render Back-End Shared vertex/texture cache Designed with a numbers of for most elements Shader, Texture, Interpolate, Raster Backend Units Core functionality exists in all parts Target specific cost/performance levels for each part AMD Accelerated Computing Software Stream Applications Compilers Libraries Eco System Stream Extensions for C, C++ G C C M I C R O S O F t ACML (Math Library) COBRA Video Transcode library 3 rd Party Developers Havok FX PeakStream Rapidmind AMD Runtime AMD Compute Abstraction Layer (CAL) CTM HAL Graphics API Direct X OpenGL CAL Graphics Bindings AMD Multicore CPU s AMD Stream Processors

Questions? Abstract: AMD Radeon HD 2900 Technology A 2 nd Generation Unified Shader Architecture The internally named R600 Graphics Processor Unit (GPU) developed by Advanced Micro Devices (AMD) will support two distinct types of intensive data parallel compute applications, both high performance 3D graphics processing and general purpose algorithms with large parallel data sets. The included hybrid Simultaneous Multi-Threaded (SMT) shader core was developed to effectively hide memory fetch latency by interleaving individual vectors of threads grouped in a manner for parallel execution, such that thousands of threads are in flight at any time. Use of both SIMD (Single Instruction Multiple Data) and MIMD (Multiple Instruction Multiple Data) technologies are employed to create the right balance of resources such as Instruction, Constant and fetch capabilities to maintain peak execution rates with high utilization of computation resources when processing large parallel data sets. Vectors of all type of compute (Vertex, Primitive, Pixel, and Compute) are interleaved in this unified shading system so the total compute power is available to each type of processing, while each is used in a cooperative manner to assist in hiding latency. At the lowest level of implementation, instruction level parallelism is utilized along with industry leading compiler technology to schedule a collection of flexible VLIW 5 way scalar processor units to hide pipe latency and achieve maximum utilization rates. Data and Instruction caches to provide the right balance of inputs for this shading complex has been developed by using a mixture of distributed and unified cache structures. Distributed read only caches for instruction and constants are used to provide a parallel robust set of instruction streams with high re-use of fetched instructions, while providing an interleaved streaming idea for reuse when using very long shader programs. For data fetch read only caches, both a unified texture cache for 1/2/3D texture map fetch and a unified Vertex Cache for array of structure type data fetches is included, each optimized for maximum reuse for that data type and minimizing over fetch. This GPU is equipped with an enhanced 2nd generation dual ring memory subsystem design with protocol support for DDR3, DDR4, GDDR3, and GDDR4. This dual ring design provides desired bandwidth from any memory channel to desired client. Also equipped with traditional 3d command buffer fetching/execution capabilities for state, draw and synchronization commands, vertex re-use and rasterization, depth and post shader blend operations. A thin Hardware/Software Interface layer called CTM (for Close To Metal) will enable general purpose computing access to these compute and fetch capabilities at the lowest level to reach new heights in many areas by enabling data parallel programs to be developed for this device. Applications will be able to harness the extreme available data bandwidths and high arithmetic compute intensity to process parallel data sets in areas such as signal processing, physics, image processing, matrix-matrix multiply, FFT, and convolution

Disclaimer & Attribution DISCLAIMER The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. The information contained herein is subject to change and may be rendered inaccurate for many reasons, including but not limited to product and roadmap changes, component and motherboard version changes, new model and/or product releases, product differences between differing manufacturers, software changes, BIOS flashes, firmware upgrades, or the like. AMD assumes no obligation to update or otherwise correct or revise this information. However, AMD reserves the right to revise this information and to make changes from time to time to the content hereof without obligation of AMD to notify any person of such revisions or changes. AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY INACCURACIES, ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION. AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE. IN NO EVENT WILL AMD BE LIABLE TO ANY PERSON FOR ANY DIRECT, INDIRECT, SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN, EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES. ATTRIBUTION 2007 Advanced Micro Devices, Inc. All rights reserved. AMD, the AMD Arrow logo, ATI, the ATI logo, Avivo, Catalyst, Radeon, The Ultimate Visual Experience and combinations thereof are trademarks of Advanced Micro Devices, Inc. DirectX, Microsoft, Windows and Vista are registered trademarks, of Microsoft Corporation in the United States and/or other jurisdictions. Other names are for informational purposes only and may be trademarks of their respective owners.