HPG 2011 HIGH PERFORMANCE GRAPHICS HOT 3D

Similar documents
AMD GRAPHIC CORE NEXT

OPENCL TM APPLICATION ANALYSIS AND OPTIMIZATION MADE EASY WITH AMD APP PROFILER AND KERNELANALYZER

INTRODUCTION TO OPENCL TM A Beginner s Tutorial. Udeepta Bordoloi AMD

Understanding GPGPU Vector Register File Usage

SIMULATOR AMD RESEARCH JUNE 14, 2015

ADVANCED RENDERING EFFECTS USING OPENCL TM AND APU Session Olivier Zegdoun AMD Sr. Software Engineer

HETEROGENEOUS SYSTEM ARCHITECTURE: PLATFORM FOR THE FUTURE


MEASURING AND MODELING ON-CHIP INTERCONNECT POWER ON REAL HARDWARE

THE PROGRAMMER S GUIDE TO THE APU GALAXY. Phil Rogers, Corporate Fellow AMD

SOLUTION TO SHADER RECOMPILES IN RADEONSI SEPTEMBER 2015

AMD CORPORATE TEMPLATE AMD Radeon Open Compute Platform Felix Kuehling

GPGPU COMPUTE ON AMD. Udeepta Bordoloi April 6, 2011

AMD IOMMU VERSION 2 How KVM will use it. Jörg Rödel August 16th, 2011

AMD Radeon HD 2900 Highlights

Panel Discussion: The Future of I/O From a CPU Architecture Perspective

AMD Graphics Team Last Updated February 11, 2013 APPROVED FOR PUBLIC DISTRIBUTION. 1 3DMark Overview February 2013 Approved for public distribution

ROCm: An open platform for GPU computing exploration

High Performance Graphics 2010

D3D12 & Vulkan: Lessons learned. Dr. Matthäus G. Chajdas Developer Technology Engineer, AMD

EFFICIENT SPARSE MATRIX-VECTOR MULTIPLICATION ON GPUS USING THE CSR STORAGE FORMAT

viewdle! - machine vision experts

Sequential Consistency for Heterogeneous-Race-Free

AMD RYZEN PROCESSOR WITH RADEON VEGA GRAPHICS CORPORATE BRAND GUIDELINES

AMD ACCELERATING TECHNOLOGIES FOR EXASCALE COMPUTING FELLOW 3 OCTOBER 2016

Use cases. Faces tagging in photo and video, enabling: sharing media editing automatic media mashuping entertaining Augmented reality Games

Vulkan (including Vulkan Fast Paths)

Fusion Enabled Image Processing

AMD Graphics Team Last Updated April 29, 2013 APPROVED FOR PUBLIC DISTRIBUTION. 1 3DMark Overview April 2013 Approved for public distribution

FUSION PROCESSORS AND HPC

Anatomy of AMD s TeraScale Graphics Engine

Heterogeneous Computing

Generic System Calls for GPUs

Accelerating Applications. the art of maximum performance computing James Spooner Maxeler VP of Acceleration

Graphics Hardware 2008

1 HiPEAC January, 2012 Public TASKS, FUTURES AND ASYNCHRONOUS PROGRAMMING

RegMutex: Inter-Warp GPU Register Time-Sharing

STREAMING VIDEO DATA INTO 3D APPLICATIONS Session Christopher Mayer AMD Sr. Software Engineer

AMD APU and Processor Comparisons. AMD Client Desktop Feb 2013 AMD

HyperTransport Technology

Automatic Intra-Application Load Balancing for Heterogeneous Systems

CAUTIONARY STATEMENT 1 AMD NEXT HORIZON NOVEMBER 6, 2018

The Rise of Open Programming Frameworks. JC BARATAULT IWOCL May 2015

CAUTIONARY STATEMENT This presentation contains forward-looking statements concerning Advanced Micro Devices, Inc. (AMD) including, but not limited to

Run Anywhere. The Hardware Platform Perspective. Ben Pollan, AMD Java Labs October 28, 2008

Maximizing Six-Core AMD Opteron Processor Performance with RHEL

The mobile computing evolution. The Griffin architecture. Memory enhancements. Power management. Thermal management

KVM CPU MODEL IN SYSCALL EMULATION MODE ALEXANDRU DUTU, JOHN SLICE JUNE 14, 2015

HIGHLY PARALLEL COMPUTING IN PHYSICS-BASED RENDERING OpenCL Raytracing Based. Thibaut PRADOS OPTIS Real-Time & Virtual Reality Manager

BIOMEDICAL DATA ANALYSIS ON HETEROGENEOUS PLATFORM. Dong Ping Zhang Heterogeneous System Architecture AMD

EXPLOITING ACCELERATOR-BASED HPC FOR ARMY APPLICATIONS

LIQUIDVR TODAY AND TOMORROW GUENNADI RIGUER, SOFTWARE ARCHITECT

Multi-core processors are here, but how do you resolve data bottlenecks in native code?

ACCELERATING MATRIX PROCESSING WITH GPUs. Nicholas Malaya, Shuai Che, Joseph Greathouse, Rene van Oostrum, and Michael Schulte AMD Research

INTERFERENCE FROM GPU SYSTEM SERVICE REQUESTS

FLASH MEMORY SUMMIT Adoption of Caching & Hybrid Solutions

AMD HD3D Technology. Setup Guide. 1 AMD HD3D TECHNOLOGY: Setup Guide

Gestural and Cinematic Interfaces - DX11. David Brebner Unlimited Realities CTO

Designing Natural Interfaces

The Road to the AMD. Fiji GPU. Featuring Die Stacking and HBM Technology 1 THE ROAD TO THE AMD FIJI GPU ECTC 2016 MAY 2015

SCALING DGEMM TO MULTIPLE CAYMAN GPUS AND INTERLAGOS MANY-CORE CPUS FOR HPL

CAUTIONARY STATEMENT This presentation contains forward-looking statements concerning Advanced Micro Devices, Inc. (AMD) including, but not limited to

NEXT-GENERATION MATRIX 3D IMMERSIVE USER INTERFACE [ M3D-IUI ] H Raghavendra Swamy AMD Senior Software Engineer

OpenCL Implementation Of A Heterogeneous Computing System For Real-time Rendering And Dynamic Updating Of Dense 3-d Volumetric Data

clarmor: A DYNAMIC BUFFER OVERFLOW DETECTOR FOR OPENCL KERNELS CHRIS ERB, JOE GREATHOUSE, MAY 16, 2018

AMD EPYC CORPORATE BRAND GUIDELINES

Challenges for GPU Architecture. Michael Doggett Graphics Architecture Group April 2, 2008

3D Numerical Analysis of Two-Phase Immersion Cooling for Electronic Components

AMD S X86 OPEN64 COMPILER. Michael Lai AMD

AMD SEV Update Linux Security Summit David Kaplan, Security Architect

Desktop Telepresence Arrived! Sudha Valluru ViVu CEO

Cilk Plus: Multicore extensions for C and C++

1 Presentation Title Month ##, 2012

HPCA 18. Reliability-aware Data Placement for Heterogeneous memory Architecture

NUMA Topology for AMD EPYC Naples Family Processors

Real-Time Rendering Architectures

DR. LISA SU

OPENCL GPU BEST PRACTICES BENJAMIN COQUELLE MAY 2015

MULTIMEDIA PROCESSING Real-time H.264 video enhancement by using AMD APP SDK

Portland State University ECE 588/688. Graphics Processors

AMD Radeon ProRender plug-in for Unreal Engine. Installation Guide

EE382N (20): Computer Architecture - Parallelism and Locality Spring 2015 Lecture 09 GPUs (II) Mattan Erez. The University of Texas at Austin

AMD RYZEN CORPORATE BRAND GUIDELINES

MIGRATION OF LEGACY APPLICATIONS TO HETEROGENEOUS ARCHITECTURES Francois Bodin, CTO, CAPS Entreprise. June 2011

From Shader Code to a Teraflop: How GPU Shader Cores Work. Jonathan Ragan- Kelley (Slides by Kayvon Fatahalian)

PROTECTING VM REGISTER STATE WITH AMD SEV-ES DAVID KAPLAN LSS 2017

Spring 2010 Prof. Hyesoon Kim. AMD presentations from Richard Huddy and Michael Doggett

AMD AIB Partner Guidelines. Version February, 2015

The 3rd Generation High Frequency Microprocessor Chip

Pattern-based analytics to estimate and track yield risk of designs down to 7nm

Parallel Programming on Larrabee. Tim Foley Intel Corp

The Bifrost GPU architecture and the ARM Mali-G71 GPU

A Detailed Look at the R600 Backend. Tom Stellard November 7, 2013

This Unit: Putting It All Together. CIS 501 Computer Architecture. What is Computer Architecture? Sources

GRAPHICS PROCESSING UNITS

Unit 11: Putting it All Together: Anatomy of the XBox 360 Game Console

Beyond Programmable Shading Keeping Many Cores Busy: Scheduling the Graphics Pipeline

From Shader Code to a Teraflop: How Shader Cores Work

AMD Security and Server innovation

This Unit: Putting It All Together. CIS 371 Computer Organization and Design. Sources. What is Computer Architecture?

Transcription:

HPG 2011 HIGH PERFORMANCE GRAPHICS HOT 3D AMD GRAPHIC CORE NEXT Low Power High Performance Graphics & Parallel Compute Michael Mantor AMD Senior Fellow Architect Michael.mantor@amd.com Mike Houston AMD Fellow Architect michael.houston@amd.com At the heart of every AMD APU/GPU is a power aware high performance set of compute units that have been advancing to bring users new levels of programmability, precision and performance.

AGENDA AMD Graphic Core Next Architecture Unified Scalable Graphic Processing Unit (GPU) optimized for Graphics and Compute Multiple Engine Architecture with Multi-Task Capabilities Compute Unit Architecture Multi-Level R/W Cache Architecture What will not be discussed Roadmaps/Schedules New Product Configurations Feature Rollout Visit AMD Fusion Developers Summit online for Fusion System Architecture details http://developer.amd.com/afds/pages/session.aspx 2 HPG 2011 AMD Graphics Core Next August 2011

SCALABLE MULTI-TASK GRAPHICS ENGINE GFX Command Processor CS Pipe Work Distributor Primitive Pipe 0 HOS Tessellate Geometry Primitive Pipe n HOS Tessellate Geometry Pixel Pipe 0 Scan Conversion Scalable Graphics Engine Pixel Pipe n Scan Conversion R/W L2 MC HUB & MEM RB RB Unified Shader Core HOS High Order Surface RB - Render Backend CS - Compute Shader GFX - Graphics 3 HPG 2011 AMD Graphics Core Next August 2011

SCALABLE MULTI-TASK GRAPHICS ENGINE CS Pipe Primitive Scaling Multiple Primitive Pipelines GFX Command Processor Pixel Scaling Multiple Screen Partitions Multi-task graphics engine use of unified shader Work Distributor Primitive Pipe 0 HOS Tessellate Geometry Primitive Pipe n HOS Tessellate Geometry Pixel Pipe 0 Scan Conversion Scalable Graphics Engine Pixel Pipe n Scan Conversion R/W L2 MC HUB & MEM RB RB Unified Shader Core 4 HPG 2011 AMD Graphics Core Next August 2011

MULTI-ENGINE UNIFIED COMPUTING GPU ACE 0 CP CS Pipe 0 ACE n CP CS Pipe n GFX Command Processor Asynchronous Work Compute Distributor Engine (ACE) Primitive Command Pipe Processor 0 HOS CS Hardware Command Queue Fetcher Pipe Device Coherent R/W Cache Access Tessellate Primitive Pipe n HOS Tessellate Load Acquire/Store Release Semantics Geometry Geometry Global Data Share Access Hardware synchronization Independent & Concurrent Grid/Group Dispatchers Unified Shader Core Scalable Graphics Engine Real time task scheduling Background task scheduling Pixel Pixel R/W Pipe Compute 0 Task Graph Pipe Processing n L2 Hardware Scheduling Scan Conversion RB Scan Conversion Task Queue Context Switching Error Detection & Correction (EDCC) RB For GDDR and internal SRAM Pools MC HUB & MEM 5 HPG 2011 AMD Graphics Core Next August 2011

AMD GRAPHIC CORE NEXT COMPUTE UNIT ARCHITECTURE 6 HPG 2011 AMD Graphics Core Next August 2011

Instruction Fetch Arbitration NON-VLIW ISA WITH SCALAR + VECTOR UNITS Simpler ISA compared to previous generation Msg Bus No clauses Branch & MSG Unit Export/GDS No VLIW packing Control flow more directly programmed Advanced language feature support Instruction Arbitration Scalar Memory Scalar Unit 8 KB Integer 0 MP 1 MP 2 MP 3 MP R/W data L1 16KB Export Bus R/W L2 Exception support LDS LDS Memory Function calls Recursion 4 CU Shared 16KB Scalar Read Only L1 4 CU Shared 32KB Instruction L1 Rqst Arb R/W L2 Enhanced extended operations Media ops Integer ops Floating point atomics (min, max, cmpxchg) Improved debug support HW functionality to improve debug support 7 HPG 2011 AMD Graphics Core Next August 2011

Instruction Fetch Arbitration PROGRAMMERS VIEW OF COMPUTE UNIT Input Data: PC/State/ Register/Scalar Register Instruction Arbitration Branch & MSG Unit Msg Bus Export/GDS Memory Scalar Scalar Unit 8 KB Integer PC- Program Counter MSG - Message Single Instruction multiple data MP R/W data L1 16KB Export Bus R/W L2 LDS LDS Memory 4 CU Shared 16KB Scalar Read Only L1 4 CU Shared 32KB Instruction L1 Rqst Arb R/W L2 8 HPG 2011 AMD Graphics Core Next August 2011

SOME CODE EXAMPLES (1) float fn0(float a,float b) { if(a>b) return((a-b)*a); else return((b-a)*b); } Optional: Use based on the number of instruction in conditional section. Executed in branch unit // r0 contains a, r1 contains b //Value is returned in r2 v_cmp_gt_f32 r0,r1 //a > b, establish VCC s_mov_b64 s0,exec //Save current exec mask s_and_b64 exec,vcc,exec //Do if s_cbranch_vccz label0 //Branch if all lanes fail v_sub_f32 r2,r0,r1 //result = a b v_mul_f32 r2,r2,r0 //result=result * a label0: s_andn2_b64 exec,s0,exec //Do else (s0 &!exec) s_cbranch_execz label1 //Branch if all lanes fail v_sub_f32 r2,r1,r0 //result = b a v_mul_f32 r2,r2,r1 //result = result * b label1: s_mov_b64 exec,s0 //Restore exec mask 9 HPG 2011 AMD Graphics Core Next August 2011

Instruction Fetch Arbitration COMPUTE UNIT ARCHITECTURE Input Data: PC/State/ Register/Scalar Register Branch & MSG Unit Msg Bus Instruction Arbitration Memory Scalar LDS Export/GDS Scalar Unit 8 KB Integer 0 MP 1 MP 2 MP LDS Memory 3 MP R/W data L1 16KB Export Bus R/W L2 4 CU Shared 16KB Scalar Read Only L1 4 CU Shared 32KB Instruction L1 Rqst Arb R/W L2 10 HPG 2011 AMD Graphics Core Next August 2011

NON-VERY LONG INSTRUCTION WORD (VLIW) VECTOR ENGINES 4 WAY VLIW 4 Non-VLIW 4 Way VLIW 4 non-vliw 64 Single Precision MAC 64 Single Precision MAC VGPR 64 * 4 * 256-32bit 256KB VGPR 4 * 64 * 256-32bit 256KB 1 VLIW Instruction * 4 Ops Dependencies limitations 4 * 1 Operation Occupancy limitations 3 SRC GPRs, 1 Destination 3 SRC GPRs, 1 \1Scalar Register Destination Compiler manage VGPR port conflicts V Instruction Bandwidth 1-7 dwords(~2 dwords/clk) Interleaved wavefront instruction required Specialized complicated compiler scheduling Difficult assembly creation, analysis, & debug Complicated tool chain support Less predictive results and performance No VGPR port conflicts V Instruction Bandwidth 1-2 dwords/cycle back-to-back wavefront instruction issue Standard compiler scheduling & optimizations Simplified assembly creation, analysis, & debug Simplified tool chain development and support Stable and predictive results and performance 11 HPG 2011 AMD Graphics Core Next August 2011

R/W CACHE Read / Write Data cached Bandwidth amplification Command Processors Improved behavior on more memory access patterns Improved write to read reuse performance Compute Unit 16KB DL1 64-128KB R/W L2 per MC Channel L1 Write-through / L2 write-back caches Relaxed memory model Consistency controls available for locality of load/store/atomic GPU Coherent Acquire / Release semantics control data visibility across the machine 16 KB Scalar DL1 32 KB Instruction L1 X B A R 64-128KB R/W L2 per MC Channel L2 coherent = all CUs can have the same view of data Compute Unit 16KB DL1 Remote Global atomics Performed in L2 cache 12 HPG 2011 AMD Graphics Core Next August 2011

AMD Graphic Core Next Compute Unit Architecture Summary A heavily multi-threaded Compute Unit (CU) architected for throughput Efficiently balanced for graphics and general compute Simplified coding for performance, debug and analysis Simplified machine view for tool chain development Low latency flexible control flow operations Read/Write Cache Hierarchy improves I/O characteristics Flexible vector load, store, and remote atomic operations Load acquire / Store release consistency controls 13 HPG 2011 AMD Graphics Core Next August 2011

QUESTIONS?

Disclaimer & Attribution The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. The information contained herein is subject to change and may be rendered inaccurate for many reasons, including but not limited to product and roadmap changes, component and motherboard version changes, new model and/or product releases, product differences between differing manufacturers, software changes, BIOS flashes, firmware upgrades, or the like. There is no obligation to update or otherwise correct or revise this information. However, we reserve the right to revise this information and to make changes from time to time to the content hereof without obligation to notify any person of such revisions or changes. NO REPRESENTATIONS OR WARRANTIES ARE MADE WITH RESPECT TO THE CONTENTS HEREOF AND NO RESPONSIBILITY IS ASSUMED FOR ANY INACCURACIES, ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION. ALL IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE ARE EXPRESSLY DISCLAIMED. IN NO EVENT WILL ANY LIABILITY TO ANY PERSON BE INCURRED FOR ANY DIRECT, INDIRECT, SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN, EVEN IF EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES. AMD, the AMD arrow logo, and combinations thereof are trademarks of Advanced Micro Devices, Inc. All other names used in this presentation are for informational purposes only and may be trademarks of their respective owners. OpenCL is a trademark of Apple Inc. used with permission by Khronos. DirectX is a registered trademark of Microsoft Corporation. 2011 Advanced Micro Devices, Inc. All rights reserved. 15 HPG 2011 AMD Graphics Core Next August 2011