Anatomy of AMD s TeraScale Graphics Engine

Similar documents
Spring 2010 Prof. Hyesoon Kim. AMD presentations from Richard Huddy and Michael Doggett

AMD Radeon HD 2900 Highlights

High Performance Graphics 2010

EFFICIENT SPARSE MATRIX-VECTOR MULTIPLICATION ON GPUS USING THE CSR STORAGE FORMAT

AMD Smarter Choice. Futures Some thoughts. Raja Koduri

AMD 780G. Niles Burbank AMD. an x86 chipset with advanced integrated GPU. Hot Chips 2008

Graphics Hardware 2008

AMD APU and Processor Comparisons. AMD Client Desktop Feb 2013 AMD

CAUTIONARY STATEMENT This presentation contains forward-looking statements concerning Advanced Micro Devices, Inc. (AMD) including, but not limited to

The Road to the AMD. Fiji GPU. Featuring Die Stacking and HBM Technology 1 THE ROAD TO THE AMD FIJI GPU ECTC 2016 MAY 2015

viewdle! - machine vision experts

HyperTransport Technology

CAUTIONARY STATEMENT 1 AMD NEXT HORIZON NOVEMBER 6, 2018

The mobile computing evolution. The Griffin architecture. Memory enhancements. Power management. Thermal management

SAPPHIRE R7 260X 2GB GDDR5 OC BATLELFIELD 4 EDITION

GPGPU COMPUTE ON AMD. Udeepta Bordoloi April 6, 2011

SCALING DGEMM TO MULTIPLE CAYMAN GPUS AND INTERLAGOS MANY-CORE CPUS FOR HPL

OPENCL TM APPLICATION ANALYSIS AND OPTIMIZATION MADE EASY WITH AMD APP PROFILER AND KERNELANALYZER

SAPPHIRE DUAL-X R9 270X 2GB GDDR5 OC WITH BOOST

SOLUTION TO SHADER RECOMPILES IN RADEONSI SEPTEMBER 2015

INTRODUCTION TO OPENCL TM A Beginner s Tutorial. Udeepta Bordoloi AMD

Use cases. Faces tagging in photo and video, enabling: sharing media editing automatic media mashuping entertaining Augmented reality Games

2 x Maximum Display Monitor(s) support MHz Core Clock 28 nm Chip 384 x Stream Processors. 145(L)X95(W)X26(H) mm Size. 1.

HPG 2011 HIGH PERFORMANCE GRAPHICS HOT 3D

Gestural and Cinematic Interfaces - DX11. David Brebner Unlimited Realities CTO

Challenges for GPU Architecture. Michael Doggett Graphics Architecture Group April 2, 2008

GPGPU, 1st Meeting Mordechai Butrashvily, CEO GASS

SAPPHIRE TOXIC R9 280X 3GB GDDR5

CAUTIONARY STATEMENT This presentation contains forward-looking statements concerning Advanced Micro Devices, Inc. (AMD) including, but not limited to

Understanding GPGPU Vector Register File Usage

AMD Graphics Team Last Updated April 29, 2013 APPROVED FOR PUBLIC DISTRIBUTION. 1 3DMark Overview April 2013 Approved for public distribution

Entering the Golden Age of Heterogeneous Computing

Using Graphics Chips for General Purpose Computation

AMD Graphics Team Last Updated February 11, 2013 APPROVED FOR PUBLIC DISTRIBUTION. 1 3DMark Overview February 2013 Approved for public distribution


CSE 591: GPU Programming. Introduction. Entertainment Graphics: Virtual Realism for the Masses. Computer games need to have: Klaus Mueller

EPYC VIDEO CUG 2018 MAY 2018

FLASH MEMORY SUMMIT Adoption of Caching & Hybrid Solutions

ADVANCED RENDERING EFFECTS USING OPENCL TM AND APU Session Olivier Zegdoun AMD Sr. Software Engineer

MEASURING AND MODELING ON-CHIP INTERCONNECT POWER ON REAL HARDWARE

Release Notes Compute Abstraction Layer (CAL) Stream Computing SDK New Features. 2 Resolved Issues. 3 Known Issues. 3.

THE PROGRAMMER S GUIDE TO THE APU GALAXY. Phil Rogers, Corporate Fellow AMD

STREAMING VIDEO DATA INTO 3D APPLICATIONS Session Christopher Mayer AMD Sr. Software Engineer

Xbox 360 high-level architecture

AMD Radeon ProRender plug-in for Unreal Engine. Installation Guide

Official Catalogue. Summer 2008

HETEROGENEOUS SYSTEM ARCHITECTURE: PLATFORM FOR THE FUTURE

From Shader Code to a Teraflop: How GPU Shader Cores Work. Jonathan Ragan- Kelley (Slides by Kayvon Fatahalian)

DR. LISA SU

From Shader Code to a Teraflop: How Shader Cores Work

ASYNCHRONOUS SHADERS WHITE PAPER 0

The NVIDIA GeForce 8800 GPU

FUSION PROCESSORS AND HPC

AMD CORPORATE TEMPLATE AMD Radeon Open Compute Platform Felix Kuehling

RADEON X1000 Memory Controller

GPU Architecture. Michael Doggett Department of Computer Science Lund university

BIOMEDICAL DATA ANALYSIS ON HETEROGENEOUS PLATFORM. Dong Ping Zhang Heterogeneous System Architecture AMD

AMD RYZEN PROCESSOR WITH RADEON VEGA GRAPHICS CORPORATE BRAND GUIDELINES

Multi-core processors are here, but how do you resolve data bottlenecks in native code?

Fan Control in AMD Radeon Pro Settings. User Guide. This document is a quick user guide on how to configure GPU fan speed in AMD Radeon Pro Settings.

GPUs and GPGPUs. Greg Blanton John T. Lubia

Graphics Architectures and OpenCL. Michael Doggett Department of Computer Science Lund university

Graphics Processing Unit Architecture (GPU Arch)

Real-Time Rendering Architectures

The Rise of Open Programming Frameworks. JC BARATAULT IWOCL May 2015

1. Introduction 2. Methods for I/O Operations 3. Buses 4. Liquid Crystal Displays 5. Other Types of Displays 6. Graphics Adapters 7.

GRAPHICS PROCESSING UNITS

AMD HD3D Technology. Setup Guide. 1 AMD HD3D TECHNOLOGY: Setup Guide

ECE 571 Advanced Microprocessor-Based Design Lecture 20

Spotlight: ATI Radeon HD 4890 Graphics

CS427 Multicore Architecture and Parallel Computing

ACCELERATING MATRIX PROCESSING WITH GPUs. Nicholas Malaya, Shuai Che, Joseph Greathouse, Rene van Oostrum, and Michael Schulte AMD Research

This Unit: Putting It All Together. CIS 501 Computer Architecture. What is Computer Architecture? Sources

This Unit: Putting It All Together. CIS 371 Computer Organization and Design. What is Computer Architecture? Sources

D3D12 & Vulkan: Lessons learned. Dr. Matthäus G. Chajdas Developer Technology Engineer, AMD

Antonio R. Miele Marco D. Santambrogio

Accelerating Applications. the art of maximum performance computing James Spooner Maxeler VP of Acceleration

Xbox 360 Architecture. Lennard Streat Samuel Echefu

This Unit: Putting It All Together. CIS 371 Computer Organization and Design. Sources. What is Computer Architecture?

GPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC

update Catalogue 09 / 2008 Summer 2008

Unit 11: Putting it All Together: Anatomy of the XBox 360 Game Console

Architectures. Michael Doggett Department of Computer Science Lund University 2009 Tomas Akenine-Möller and Michael Doggett 1

AMD Embedded PCIe ADD-IN BOARD E6760/E6460 Datasheet. (ER93FLA/ER91FLA-xx)

Panel Discussion: The Future of I/O From a CPU Architecture Perspective

SAPPHIRE VAPOR-X HD 7970 GHz Edition 6GB GDDR5

SIMULATOR AMD RESEARCH JUNE 14, 2015

Maximizing Six-Core AMD Opteron Processor Performance with RHEL

Threading Hardware in G80

Run Anywhere. The Hardware Platform Perspective. Ben Pollan, AMD Java Labs October 28, 2008

CS GPU and GPGPU Programming Lecture 8+9: GPU Architecture 7+8. Markus Hadwiger, KAUST

Heterogeneous Computing

PowerVR Series5. Architecture Guide for Developers

GPU Architecture and Function. Michael Foster and Ian Frasch

AMD IOMMU VERSION 2 How KVM will use it. Jörg Rödel August 16th, 2011

Vulkan (including Vulkan Fast Paths)

GPU A rchitectures Architectures Patrick Neill May

CSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University

AMD Brings New Value to Radeon

Automatic Intra-Application Load Balancing for Heterogeneous Systems

Transcription:

Anatomy of AMD s TeraScale Graphics Engine Mike Houston

Design Goals Focus on Efficiency f(perf/watt, Perf/$) Scale up processing power and AA performance Target >2x previous generation Enhance stream computing capability Faster and more flexible Implement advanced feature set DirectX 10.1, tessellation, UVD2, PCIe 2.0, and more 2

10 9 8 7 6 5 4 3 2 1 0 Design Efficiency ATI RADEON X1800 GigaFlops per Watt ATI RADEON X1900 GigaFlops per mm^2 4x Performance/w and Performance/mm² in less than a year ATI RADEON HD 2900 ATI RADEON HD 3800 ATI RADEON HD 4800 3

Terascale Graphics Engine 800 highly optimized stream processing units New SIMD core layout Optimized texture units New texture cache design New memory architecture Optimized render back-ends for fast anti-aliasing performance Enhanced geometry shader & tessellator performance 4

ATI Radeon HD 4800 Series Architecture 10 SIMD cores Each with 80 32-bit Stream Processing Units (800 total) GDDR5 Memory Interface 40 Texture Units ATI Radeon HD 3870 ATI Radeon HD 4870 Difference Die Size 190 mm 2 260 mm 2 1.4x Texture Units SIMD Cores Memory 72 GB/sec 115 GB/sec 1.6x AA Resolve 32 64 2x Z/Stencil 32 64 2x Texture 16 40 2.5x Shader 320 800 2.5x PCI Express Bus Interface UVD & Display Controllers 5

SIMD Cores Each core: Includes 80 scalar stream processing units in total + 16KB Local Data Share Has its own control logic and runs from a shared set of threads Has 4 dedicated texture units + L1 cache Communicates with other SIMD cores via 16KB global data share New design allows texture fetch capability to scale with shader power, maintaining 4:1 ALU:TEX ratio 6

Stream Processing Units 40% increase in performance per mm 2* More aggressive clock gating for improved Performance per Watt * Fast double precision processing (240 GigaFLOPS) Integer bit shift operations for all units (12.5x improvement * ) * Internal AMD test results comparing ATI Radeon TM HD 4800 series and ATI Radeon TM HD 3800 series 7

Texture Units Streamlined design 70% increase in performance/mm 2 * More performance Double the texture cache bandwidth of the ATI Radeon tm HD 3800 series * 2.5x increase in 32-bit filter rate * 1.25x increase in 64-bit filter rate * Up to 160 fetches per clock * Peak 32-bit texture fetch rate ATI Radeon HD 4870 120 Gtex/s ATI Radeon HD 3870 49.6 Gtex/s * Internal AMD test results comparing ATI Radeon TM HD 4800 series and ATI Radeon TM HD 3800 series 8

Texture Units New cache design L2s aligned with memory channels L1s store unique data per SIMD 2.5x increase aggregate L1 * Separate vertex cache Increased bandwidth Up to 480 GB/sec of L1 texture fetch bandwidth Up to 384 GB/sec between L1 & L2 * Comparing ATI Radeon TM HD 4800 series and ATI Radeon TM HD 3800 series 9

Render Back-Ends Focus on improving AA performance per mm 2 Doubled peak rate for depth/stencil ops to 64 per clock Doubled AA peak fill rate for 32-bit & 64-bit color * Doubled non-aa peak fill rate for 64-bit color Supports both fixed function (MSAA) and programmable (CFAA) modes Color ATI Radeon TM HD 3800 series ATI Radeon TM HD 4800 series Difference No MSAA 16 pix/clk 16 pix/clk 1x 2x/4x MSAA 32-bit 8 pix/clk 16 pix/clk 2x 8x MSAA 4 pix/clk 8 pix/clk 2x No MSAA 8 pix/clk 16 pix/clk 2x 2x/4x MSAA 64-bit 8 pix/clk 16 pix/clk 2x 8x MSAA 4 pix/clk 8 pix/clk 2x Depth/stencil only 32 pix/clk 64 pix/clk 2x * Comparing ATI Radeon TM HD 4800 series and ATI Radeon TM HD 3800 series 10

Edge Detect CFAA Filters Enhanced edge-detect filter delivers 12x & 24x CFAA modes Avoids blurring by taking additional samples along edges, not across them Polygon edge Same memory footprint as 4x & 8x MSAA Works with Adaptive AA Pixel boundary AA filter kernel AA sample point 11

Edge Detect CFAA Filters ATI 8x MSAA ATI 24x CFAA (Edge Detect) Images captured from Half-Life 2 by Valve Software 12

Custom Filter Anti-Aliasing Performance Performance benefits greatly from 2x sample generation rate and 2.5x shader resolve rate New fast path between render back-end and shader engine provides further improvements Test settings were 2560x1600 with 8x AF for Call of Duty 4 and Half Life 2, and 1920x1200 with 8xAF for Unreal Tournament 3. Test platform was an AMD Phenom X4 9850, with Catalyst 8.6 driver 13

Memory Controller Architecture New distributed design with hub Controllers distributed around periphery of chip, adjacent to primary bandwidth consumers Memory tiling & 256-bit interface allows reduced latency, silicon area, and power consumption Hub handles relatively low bandwidth traffic PCI Express, CrossFireX interconnect, UVD2, display controllers, intercommunication) 14

Geometry Shader & Tessellation Enhanced geometry amplification performance over previous generation Allow more GS-generated data to be kept on-chip 4x more GS threads supported Improved tessellation unit Instancing support Compatible with DirectX 10/10.1 200 150 100 50 0 Rightmark3D 2.0 (GS - Hyperlight) ATI Radeon HD 3870 ATI Radeon HD 4870 Test settings: High Polycount, Heavy Load settings, 640x480 resolution. Config: Intel Core2 Extreme X9650 processor, using Catalyst 8.5 driver 15

ATI Radeon HD 4800 Series Stream Architecture Several enhancements done for stream computing Fast compute vector Local and Global data shares Fast Integer Processing Fast Memexport/Memimport Significant increases in performance on many important stream processing workloads 800% 700% 600% 500% 400% 300% 200% 100% Radeon HD 3870 Radeon HD 4870 0% Peak Single Peak Double Precision Flops Precision Flops Matrix Multiplication (Single) Matrix Multiplication (Double) FFT AES Encryption Internal AMD testing, CAL SDK version 1.1, Intel QX6800 CPU, Catalyst version 8.5 16

ATI Radeon HD 4870 Computation Highlights >100 GB/s memory bandwidth 256b GDDR5 interface Targeted for handling thousands of simultaneous lightweight threads 800 (160x5) stream processors 640 (160x4) basic units (FMAC, ADD/SUB, etc.) ~1.2 TFlops theoretical peak 160 enhanced transcendental units (adds COS, LOG, EXP, RSQ, etc.) Support for INT/UINT in all units (ADD/SUB, AND, XOR, NOT, OR, etc.) 64-bit double precision FP support 1/5 single precision rate (~250GFlops theoretical performance) 4 SIMDs -> 10 SIMDs 2.5X peak performance increase over ATI Radeon 3870 ~1.2 TFlops FP32 theoretical peak ~250 GFlops FP64 theoretical peak Scratch-pad memories 16KB per SIMD (LDS) 16KB across SIMDs (GDS) Synchronization capabilities Compute Shader Launch work without rasterization Linear scheduling Faster thread launch Beyond 1 Programmable Shading: Fundamentals 7

Power Dynamic Power Management On-chip microcontroller Constantly monitors thermals sensors and activity of various GPU blocks, PCI Express bus Minimal driver overhead No clock gating Clock gating Up to 36% avg. power savings from clock gating 1 Controls clock gating, engine/memory clock speeds, voltages, and fan controller Enables 2x Perf/W improvement vs. ATI Radeon HD 3800 2 Windows Desktop 3D App Windows Desktop Time 1 Internal AMD test results for ATI Radeon TM HD 4800 series 2 Internal AMD test results comparing ATI Radeon TM HD 4800 series and ATI Radeon TM HD 3800 series 18

Terascale Graphics Have Arrived Efficient GPU design Major advances in Performance/Watt, Performance/$ Improved game performance and image quality >2x increase in AA frame rates Massive stream compute power Over 1 TeraFLOPS per GPU Advanced feature set DirectX 10.1, tessellation, CFAA, GDDR5, PCI Express 2.0 19

Disclaimer and Attribution DISCLAIMER The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY INACCURACIES, ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION. AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE. IN NO EVENT WILL AMD BE LIABLE TO ANY PERSON FOR ANY DIRECT, INDIRECT, SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN, EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES. ATTRIBUTION 2008 Advanced Micro Devices, Inc. All rights reserved. AMD, the AMD Arrow logo, ATI, the ATI logo, CrossFireX, PowerPlay and Radeon and combinations thereof are trademarks of Advanced Micro Devices, Inc. Other names are for informational purposes only and may be trademarks of their respective owners. 20