THE PROGRAMMER S GUIDE TO THE APU GALAXY. Phil Rogers, Corporate Fellow AMD

Similar documents
HETEROGENEOUS SYSTEM ARCHITECTURE: PLATFORM FOR THE FUTURE

INTRODUCTION TO OPENCL TM A Beginner s Tutorial. Udeepta Bordoloi AMD

Heterogeneous Computing

OPENCL TM APPLICATION ANALYSIS AND OPTIMIZATION MADE EASY WITH AMD APP PROFILER AND KERNELANALYZER

FUSION PROCESSORS AND HPC

AMD ACCELERATING TECHNOLOGIES FOR EXASCALE COMPUTING FELLOW 3 OCTOBER 2016

HETEROGENEOUS COMPUTING IN THE MODERN AGE. 1 HiPEAC January, 2012 Public

SIMULATOR AMD RESEARCH JUNE 14, 2015

HSA foundation! Advanced Topics on Heterogeneous System Architectures. Politecnico di Milano! Seminar Room A. Alario! 23 November, 2015!

Panel Discussion: The Future of I/O From a CPU Architecture Perspective

HSA Foundation! Advanced Topics on Heterogeneous System Architectures. Politecnico di Milano! Seminar Room (Bld 20)! 15 December, 2017!

x Welcome to the jungle. The free lunch is so over

Use cases. Faces tagging in photo and video, enabling: sharing media editing automatic media mashuping entertaining Augmented reality Games

HyperTransport Technology

AMD CORPORATE TEMPLATE AMD Radeon Open Compute Platform Felix Kuehling

ADVANCED RENDERING EFFECTS USING OPENCL TM AND APU Session Olivier Zegdoun AMD Sr. Software Engineer

The Rise of Open Programming Frameworks. JC BARATAULT IWOCL May 2015

Automatic Intra-Application Load Balancing for Heterogeneous Systems

AMD s Unified CPU & GPU Processor Concept

CLICK TO EDIT MASTER TITLE STYLE. Click to edit Master text styles. Second level Third level Fourth level Fifth level

Generic System Calls for GPUs

HPG 2011 HIGH PERFORMANCE GRAPHICS HOT 3D

AMD APU and Processor Comparisons. AMD Client Desktop Feb 2013 AMD

INTERFERENCE FROM GPU SYSTEM SERVICE REQUESTS

viewdle! - machine vision experts

Fusion Enabled Image Processing

AMD Graphics Team Last Updated February 11, 2013 APPROVED FOR PUBLIC DISTRIBUTION. 1 3DMark Overview February 2013 Approved for public distribution

BIOMEDICAL DATA ANALYSIS ON HETEROGENEOUS PLATFORM. Dong Ping Zhang Heterogeneous System Architecture AMD

EXPLOITING ACCELERATOR-BASED HPC FOR ARMY APPLICATIONS

EFFICIENT SPARSE MATRIX-VECTOR MULTIPLICATION ON GPUS USING THE CSR STORAGE FORMAT

AMD IOMMU VERSION 2 How KVM will use it. Jörg Rödel August 16th, 2011

MIGRATION OF LEGACY APPLICATIONS TO HETEROGENEOUS ARCHITECTURES Francois Bodin, CTO, CAPS Entreprise. June 2011

ROCm: An open platform for GPU computing exploration

HIGHLY PARALLEL COMPUTING IN PHYSICS-BASED RENDERING OpenCL Raytracing Based. Thibaut PRADOS OPTIS Real-Time & Virtual Reality Manager


1 HiPEAC January, 2012 Public TASKS, FUTURES AND ASYNCHRONOUS PROGRAMMING

AMD Graphics Team Last Updated April 29, 2013 APPROVED FOR PUBLIC DISTRIBUTION. 1 3DMark Overview April 2013 Approved for public distribution

KVM CPU MODEL IN SYSCALL EMULATION MODE ALEXANDRU DUTU, JOHN SLICE JUNE 14, 2015

MEASURING AND MODELING ON-CHIP INTERCONNECT POWER ON REAL HARDWARE

A LOOK INTO THE FUTURE. 1 HiPEAC January, 2012 Public

CAUTIONARY STATEMENT 1 AMD NEXT HORIZON NOVEMBER 6, 2018

SCALING DGEMM TO MULTIPLE CAYMAN GPUS AND INTERLAGOS MANY-CORE CPUS FOR HPL

AMD HD3D Technology. Setup Guide. 1 AMD HD3D TECHNOLOGY: Setup Guide

Graphics Hardware 2008

LIQUIDVR TODAY AND TOMORROW GUENNADI RIGUER, SOFTWARE ARCHITECT

THE HETEROGENEOUS SYSTEM ARCHITECTURE IT S BEYOND THE GPU

CAUTIONARY STATEMENT This presentation contains forward-looking statements concerning Advanced Micro Devices, Inc. (AMD) including, but not limited to

GPGPU COMPUTE ON AMD. Udeepta Bordoloi April 6, 2011

D3D12 & Vulkan: Lessons learned. Dr. Matthäus G. Chajdas Developer Technology Engineer, AMD

CAUTIONARY STATEMENT This presentation contains forward-looking statements concerning Advanced Micro Devices, Inc. (AMD) including, but not limited to

Accelerating Applications. the art of maximum performance computing James Spooner Maxeler VP of Acceleration

Understanding GPGPU Vector Register File Usage

NEXT-GENERATION MATRIX 3D IMMERSIVE USER INTERFACE [ M3D-IUI ] H Raghavendra Swamy AMD Senior Software Engineer

Sequential Consistency for Heterogeneous-Race-Free

The mobile computing evolution. The Griffin architecture. Memory enhancements. Power management. Thermal management

The Road to the AMD. Fiji GPU. Featuring Die Stacking and HBM Technology 1 THE ROAD TO THE AMD FIJI GPU ECTC 2016 MAY 2015

SOLUTION TO SHADER RECOMPILES IN RADEONSI SEPTEMBER 2015

clarmor: A DYNAMIC BUFFER OVERFLOW DETECTOR FOR OPENCL KERNELS CHRIS ERB, JOE GREATHOUSE, MAY 16, 2018

Maximizing Six-Core AMD Opteron Processor Performance with RHEL

AMD RYZEN PROCESSOR WITH RADEON VEGA GRAPHICS CORPORATE BRAND GUIDELINES

STREAMING VIDEO DATA INTO 3D APPLICATIONS Session Christopher Mayer AMD Sr. Software Engineer

Cilk Plus: Multicore extensions for C and C++

Multi-core processors are here, but how do you resolve data bottlenecks in native code?

Gestural and Cinematic Interfaces - DX11. David Brebner Unlimited Realities CTO

Designing Natural Interfaces

Run Anywhere. The Hardware Platform Perspective. Ben Pollan, AMD Java Labs October 28, 2008

Desktop Telepresence Arrived! Sudha Valluru ViVu CEO

Vulkan (including Vulkan Fast Paths)

RegMutex: Inter-Warp GPU Register Time-Sharing

HPCA 18. Reliability-aware Data Placement for Heterogeneous memory Architecture

AMD Radeon ProRender plug-in for Unreal Engine. Installation Guide

PCCC WORKSHOP:AMD の最新製品戦略とプラットフォームソリューション FEBRUARY 19 TH 2016 HIDETOSHI IWASA, FAE MANAGER AMD JAPAN

ASYNCHRONOUS SHADERS WHITE PAPER 0

MULTIMEDIA PROCESSING Real-time H.264 video enhancement by using AMD APP SDK

Anatomy of AMD s TeraScale Graphics Engine

OpenCL Implementation Of A Heterogeneous Computing System For Real-time Rendering And Dynamic Updating Of Dense 3-d Volumetric Data

FLASH MEMORY SUMMIT Adoption of Caching & Hybrid Solutions

AMD S X86 OPEN64 COMPILER. Michael Lai AMD

AMD SEV Update Linux Security Summit David Kaplan, Security Architect

Parallel Programming on Larrabee. Tim Foley Intel Corp

Pattern-based analytics to estimate and track yield risk of designs down to 7nm

Introducing NVDIMM-X: Designed to be the World s Fastest NAND-Based SSD Architecture and a Platform for the Next Generation of New Media SSDs

AMD EPYC CORPORATE BRAND GUIDELINES

3D Numerical Analysis of Two-Phase Immersion Cooling for Electronic Components

DR. LISA SU

AMD 780G. Niles Burbank AMD. an x86 chipset with advanced integrated GPU. Hot Chips 2008

Challenges for GPU Architecture. Michael Doggett Graphics Architecture Group April 2, 2008

INTRODUCING RYZEN MARCH

Copyright Khronos Group Page 1. Vulkan Overview. June 2015

ACCELERATING MATRIX PROCESSING WITH GPUs. Nicholas Malaya, Shuai Che, Joseph Greathouse, Rene van Oostrum, and Michael Schulte AMD Research

Higher Level Programming Abstractions for FPGAs using OpenCL

OpenCL Overview. Shanghai March Neil Trevett Vice President Mobile Content, NVIDIA President, The Khronos Group

1401 HETEROGENEOUS HPC How Fusion Designs Can Advance Science

GPUs and Emerging Architectures

Resource Saving: Latest Innovation in Optimized Cloud Infrastructure

AMD AIB Partner Guidelines. Version February, 2015

AMD Elite A-Series APU Desktop LAUNCHING JUNE 4 TH PLACE YOUR ORDERS TODAY!

PROTECTING VM REGISTER STATE WITH AMD SEV-ES DAVID KAPLAN LSS 2017

AMD Brings New Value to Radeon

1 Presentation Title Month ##, 2012

Transcription:

THE PROGRAMMER S GUIDE TO THE APU GALAXY Phil Rogers, Corporate Fellow AMD

THE OPPORTUNITY WE ARE SEIZING Make the unprecedented processing capability of the APU as accessible to programmers as the CPU is today. 2 The Programmer s Guide to the APU Galaxy June 2011

OUTLINE The APU today and its programming environment The future of the heterogeneous platform AMD Fusion System Architecture Roadmap Software evolution A visual view of the new command and data flow 3 The Programmer s Guide to the APU Galaxy June 2011

APU: ACCELERATED PROCESSING UNIT The APU has arrived and it is a great advance over previous platforms Combines scalar processing on CPU with parallel processing on the GPU and high bandwidth access to memory How do we make it even better going forward? Easier to program Easier to optimize Easier to load balance Higher performance Lower power 4 The Programmer s Guide to the APU Galaxy June 2011

LOW POWER E-SERIES AMD FUSION APU: ZACATE E-Series APU 2 x86 Bobcat CPU cores Array of Radeon Cores Discrete-class DirectX 11 performance 80 Stream Processors 3rd Generation Unified Video Decoder PCIe Gen2 Single-channel DDR3 @ 1066 18W TDP Performance: Up to 8.5GB/s System Memory Bandwidth Up to 90 Gflop of Single Precision Compute 5 The Programmer s Guide to the APU Galaxy June 2011

TABLET Z-SERIES AMD FUSION APU: DESNA Z-Series APU 2 x86 Bobcat CPU cores Array of Radeon Cores Discrete-class DirectX 11 performance 80 Stream Processors 3rd Generation Unified Video Decoder PCIe Gen2 Single-channel DDR3 @ 1066 6W TDP w/ Local Hardware Thermal Control Performance: Up to 8.5GB/s System Memory Bandwidth Suitable for sealed, passively cooled designs 6 The Programmer s Guide to the APU Galaxy June 2011

MAINSTREAM A-SERIES AMD FUSION APU: LLANO A-Series APU Up to four x86 CPU cores AMD Turbo CORE frequency acceleration Array of AMD Radeon Cores Discrete-class DirectX 11 performance 3rd Generation Unified Video Decoder Blu-ray 3D stereoscopic display PCIe Gen2 Dual-channel DDR3 45W TDP Performance: Up to 29GB/s System Memory Bandwidth Up to 500 Gflops of Single Precision Compute 7 The Programmer s Guide to the APU Galaxy June 2011

COMMITTED TO OPEN STANDARDS AMD drives open and de-facto standards Compete on the best implementation Open standards are the basis for large ecosystems Open standards always win over time DirectX SW developers want their applications to run on multiple platforms from multiple hardware vendors 8 The Programmer s Guide to the APU Galaxy June 2011

Single-thread Performance Throughput Performance rn Application Performance A NEW ERA OF PROCESSOR PERFORMANCE Single-Core Era Multi-Core Era Heterogeneous Systems Era Enabled by: Moore s Law Voltage Scaling Constrained by: Power Complexity Enabled by: Moore s Law SMP architecture Constrained by: Power Parallel SW Scalability Enabled by: Abundant data parallelism Power efficient GPUs Temporarily Constrained by: Programming models Comm.overhead Assembly C/C++ Java pthreads OpenMP / TBB Shader CUDA OpenCL!!!? we are here we are here we are here Time Time (# of processors) Time (Data-parallel exploitation) 9 The Programmer s Guide to the APU Galaxy June 2011

Architecture Maturity & Programmer Accessibility Poor Excellent EVOLUTION OF HETEROGENEOUS COMPUTING Standards s Era Architected Era AMD Fusion System Architecture GPU Peer Processor Proprietary s Era Graphics & Proprietary -based APIs Adventurous programmers Exploit early programmable shader cores in the GPU Make your program look like graphics to the GPU CUDA, Brook+, etc OpenCL, DirectCompute -based APIs Expert programmers C and C++ subsets Compute centric APIs, data types Multiple address spaces with explicit data movement Specialized work queue based structures Kernel mode dispatch Mainstream programmers Full C++ GPU as a co-processor Unified coherent address space Task parallel runtimes Nested Data Parallel programs User mode dispatch Pre-emption and context switching See Herb Sutter s Keynote tomorrow for a cool example of plans for the architected era! 2002-2008 2009-2011 2012-2020 10 The Programmer s Guide to the APU Galaxy June 2011

FSA FEATURE ROADMAP Physical Integration Optimized Platforms Architectural Integration System Integration Integrate CPU & GPU in silicon GPU Compute C++ support Unified Address Space for CPU and GPU GPU compute context switch Unified Memory Controller User mode schedulng GPU uses pageable system memory via CPU pointers GPU graphics pre-emption Quality of Service Common Manufacturing Technology Bi-Directional Power Mgmt between CPU and GPU Fully coherent memory between CPU & GPU Extend to Discrete GPU 11 The Programmer s Guide to the APU Galaxy June 2011

AMD FUSION SYSTEM ARCHITECTURE AN OPEN PLATFORM Open Architecture, published specifications FSAIL virtual ISA FSA memory model FSA dispatch ISA agnostic for both CPU and GPU Inviting partners to join us, in all areas Hardware companies Operating Systems Tools and Middleware Applications FSA review committee planned 12 The Programmer s Guide to the APU Galaxy June 2011

FSA INTERMEDIATE LAYER - FSAIL FSAIL is a virtual ISA for parallel programs Finalized to ISA by a JIT compiler or Finalizer Explicitly parallel Designed for data parallel programming Support for exceptions, virtual functions, and other high level language features Syscall methods GPU code can call directly to system services, IO, printf etc Debugging support 13 The Programmer s Guide to the APU Galaxy June 2011

FSA MEMORY MODEL Designed to be compatible with C++0x, Java and.net Memory ls Relaxed consistency memory model for parallel compute performance Loads and stores can be re-ordered by the finalizer Visibility controlled by: Load.Acquire, Store.Release Fences Barriers 14 The Programmer s Guide to the APU Galaxy June 2011

Stack FSA Software Stack Apps AppsApps Apps AppsApps Apps Apps Apps Apps Apps Apps Domain Libraries FSA Domain Libraries OpenCL 1.x, DX Runtimes, User s Graphics Kernel FSA JIT Task Queuing Libraries FSA Runtime FSA Kernel Hardware - APUs, CPUs, GPUs AMD user mode component AMD kernel mode component All others contributed by third parties or AMD 15 The Programmer s Guide to the APU Galaxy June 2011

OPENCL AND FSA FSA is an optimized platform architecture for OpenCL Not an alternative to OpenCL OpenCL on FSA will benefit from Avoidance of wasteful copies Low latency dispatch Improved memory model Pointers shared between CPU and GPU FSA also exposes a lower level programming interface, for those that want the ultimate in control and performance Optimized libraries may choose the lower level interface 16 The Programmer s Guide to the APU Galaxy June 2011

TASK QUEUING RUNTIMES Popular pattern for task and data parallel programming on SMP systems today Characterized by: A work queue per core Runtime library that divides large loops into tasks and distributes to queues A work stealing runtime that keeps the system balanced FSA is designed to extend this pattern to run on heterogeneous systems 17 The Programmer s Guide to the APU Galaxy June 2011

TASK QUEUING RUNTIME ON CPUS Work Stealing Runtime Q Q Q Q CPU Worker CPU Worker CPU Worker CPU Worker X86 CPU X86 CPU X86 CPU X86 CPU CPU Threads GPU Threads Memory 18 The Programmer s Guide to the APU Galaxy June 2011

TASK QUEUING RUNTIME ON THE FSA PLATFORM Work Stealing Runtime Q Q Q Q Q CPU Worker CPU Worker CPU Worker CPU Worker GPU Manager X86 CPU X86 CPU X86 CPU X86 CPU Radeon GPU CPU Threads GPU Threads Memory 19 The Programmer s Guide to the APU Galaxy June 2011

TASK QUEUING RUNTIME ON THE FSA PLATFORM Work Stealing Runtime Q Q Q Q Q CPU Worker CPU Worker CPU Worker CPU Worker GPU Manager Memory X86 CPU X86 CPU X86 CPU X86 CPU Fetch and Dispatch CPU Threads GPU Threads Memory S I M D S I M D S I M D S I M D S I M D 20 The Programmer s Guide to the APU Galaxy June 2011

FSA SOFTWARE EXAMPLE - REDUCTION float foo(float); float myarray[ ]; Task<float, ReductionBin> sum([myarray](indexrange<1> index) [[device]] { float sum = 0.; for (size_t I = index.begin(); I!= index.end(); i++) { sum += foo(myarray[i]); } return float; }); float sum = task.enqueuewithreduce(partition<1, Auto>(1920), [] (int x, int y) [[device]] { return x+y; }, 0.); 21 The Programmer s Guide to the APU Galaxy June 2011

HETEROGENEOUS COMPUTE DISPATCH How compute dispatch operates today in the driver model How compute dispatch improves tomorrow under FSA 22 The Programmer s Guide to the APU Galaxy June 2011

TODAY S COMMAND AND DISPATCH FLOW Command Flow Data Flow Application A Direct3D User Soft Queue Kernel Command Buffer DMA Buffer A GPU HARDWARE Hardware Queue 23 The Programmer s Guide to the APU Galaxy June 2011

TODAY S COMMAND AND DISPATCH FLOW Command Flow Data Flow Application A Direct3D User Soft Queue Kernel Command Buffer DMA Buffer Command Flow Data Flow Application B Direct3D User Soft Queue Kernel A GPU HARDWARE Command Buffer DMA Buffer Command Flow Data Flow Application C Direct3D User Soft Queue Kernel Hardware Queue Command Buffer DMA Buffer 24 The Programmer s Guide to the APU Galaxy June 2011

TODAY S COMMAND AND DISPATCH FLOW Command Flow Data Flow Application A Direct3D User Soft Queue Kernel Command Buffer DMA Buffer Application B Command Flow Direct3D User Data Flow Soft Queue Kernel A C B A B GPU HARDWARE Command Buffer DMA Buffer Command Flow Data Flow Application C Direct3D User Soft Queue Kernel Hardware Queue Command Buffer DMA Buffer 25 The Programmer s Guide to the APU Galaxy June 2011

TODAY S COMMAND AND DISPATCH FLOW Command Flow Data Flow Application A Direct3D User Soft Queue Kernel Command Buffer DMA Buffer Application B Command Flow Direct3D User Data Flow Soft Queue Kernel A C B A B GPU HARDWARE Command Buffer DMA Buffer Command Flow Data Flow Application C Direct3D User Soft Queue Kernel Hardware Queue Command Buffer DMA Buffer 26 The Programmer s Guide to the APU Galaxy June 2011

FUTURE COMMAND AND DISPATCH FLOW Application C C C C C C Application codes to the hardware User mode queuing Application B Optional Dispatch Buffer Hardware Queue B B B GPU HARDWARE Hardware scheduling Low dispatch times No APIs Application A Hardware Queue A A A No Soft Queues No User s No Kernel Transitions No Overhead! Hardware Queue 27 The Programmer s Guide to the APU Galaxy June 2011

FUTURE COMMAND AND DISPATCH CPU <-> GPU Application / Runtime CPU1 CPU2 GPU 28 The Programmer s Guide to the APU Galaxy June 2011

FUTURE COMMAND AND DISPATCH CPU <-> GPU Application / Runtime CPU1 CPU2 GPU 29 The Programmer s Guide to the APU Galaxy June 2011

FUTURE COMMAND AND DISPATCH CPU <-> GPU Application / Runtime CPU1 CPU2 GPU 30 The Programmer s Guide to the APU Galaxy June 2011

FUTURE COMMAND AND DISPATCH CPU <-> GPU Application / Runtime CPU1 CPU2 GPU 31 The Programmer s Guide to the APU Galaxy June 2011

WHERE ARE WE TAKING YOU? Switch the compute, don t move the data! Every processor now has serial and parallel cores Simple and efficient program model All cores capable, with performance differences Platform Design Goals Easy support of massive data sets Support for task based programming models Solutions for all platforms Open to all 32 The Programmer s Guide to the APU Galaxy June 2011

THE FUTURE OF HETEROGENEOUS COMPUTING IS HERE The architectural path for the future is clear Programming patterns established on Symmetric Multi-Processor (SMP) systems migrate to the heterogeneous world An open architecture, with published specifications and an open source execution software stack Heterogeneous cores working together seamlessly in coherent memory Low latency dispatch No software fault lines 33 The Programmer s Guide to the APU Galaxy June 2011

Disclaimer & Attribution The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. The information contained herein is subject to change and may be rendered inaccurate for many reasons, including but not limited to product and roadmap changes, component and motherboard version changes, new model and/or product releases, product differences between differing manufacturers, software changes, BIOS flashes, firmware upgrades, or the like. There is no obligation to update or otherwise correct or revise this information. However, we reserve the right to revise this information and to make changes from time to time to the content hereof without obligation to notify any person of such revisions or changes. NO REPRESENTATIONS OR WARRANTIES ARE MADE WITH RESPECT TO THE CONTENTS HEREOF AND NO RESPONSIBILITY IS ASSUMED FOR ANY INACCURACIES, ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION. ALL IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE ARE EXPRESSLY DISCLAIMED. IN NO EVENT WILL ANY LIABILITY TO ANY PERSON BE INCURRED FOR ANY DIRECT, INDIRECT, SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN, EVEN IF EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES. AMD, the AMD arrow logo, and combinations thereof are trademarks of Advanced Micro Devices, Inc. All other names used in this presentation are for informational purposes only and may be trademarks of their respective owners. OpenCL and the OpenCL logo are trademarks of Apple, Inc. and are used by permission by Khronos. 2011 Advanced Micro Devices, Inc. All Rights Reserved. 35 The Programmer s Guide to the APU Galaxy June 2011