THE PROGRAMMER S GUIDE TO THE APU GALAXY. Phil Rogers, Corporate Fellow AMD

THE PROGRAMMER S GUIDE TO THE APU GALAXY Phil Rogers, Corporate Fellow AMD

THE OPPORTUNITY WE ARE SEIZING Make the unprecedented processing capability of the APU as accessible to programmers as the CPU is today. 2 The Programmer s Guide to the APU Galaxy June 2011

OUTLINE The APU today and its programming environment The future of the heterogeneous platform AMD Fusion System Architecture Roadmap Software evolution A visual view of the new command and data flow 3 The Programmer s Guide to the APU Galaxy June 2011

APU: ACCELERATED PROCESSING UNIT The APU has arrived and it is a great advance over previous platforms Combines scalar processing on CPU with parallel processing on the GPU and high bandwidth access to memory How do we make it even better going forward? Easier to program Easier to optimize Easier to load balance Higher performance Lower power 4 The Programmer s Guide to the APU Galaxy June 2011

LOW POWER E-SERIES AMD FUSION APU: ZACATE E-Series APU 2 x86 Bobcat CPU cores Array of Radeon Cores Discrete-class DirectX 11 performance 80 Stream Processors 3rd Generation Unified Video Decoder PCIe Gen2 Single-channel DDR3 @ 1066 18W TDP Performance: Up to 8.5GB/s System Memory Bandwidth Up to 90 Gflop of Single Precision Compute 5 The Programmer s Guide to the APU Galaxy June 2011

TABLET Z-SERIES AMD FUSION APU: DESNA Z-Series APU 2 x86 Bobcat CPU cores Array of Radeon Cores Discrete-class DirectX 11 performance 80 Stream Processors 3rd Generation Unified Video Decoder PCIe Gen2 Single-channel DDR3 @ 1066 6W TDP w/ Local Hardware Thermal Control Performance: Up to 8.5GB/s System Memory Bandwidth Suitable for sealed, passively cooled designs 6 The Programmer s Guide to the APU Galaxy June 2011

MAINSTREAM A-SERIES AMD FUSION APU: LLANO A-Series APU Up to four x86 CPU cores AMD Turbo CORE frequency acceleration Array of AMD Radeon Cores Discrete-class DirectX 11 performance 3rd Generation Unified Video Decoder Blu-ray 3D stereoscopic display PCIe Gen2 Dual-channel DDR3 45W TDP Performance: Up to 29GB/s System Memory Bandwidth Up to 500 Gflops of Single Precision Compute 7 The Programmer s Guide to the APU Galaxy June 2011

COMMITTED TO OPEN STANDARDS AMD drives open and de-facto standards Compete on the best implementation Open standards are the basis for large ecosystems Open standards always win over time DirectX SW developers want their applications to run on multiple platforms from multiple hardware vendors 8 The Programmer s Guide to the APU Galaxy June 2011

Single-thread Performance Throughput Performance rn Application Performance A NEW ERA OF PROCESSOR PERFORMANCE Single-Core Era Multi-Core Era Heterogeneous Systems Era Enabled by: Moore s Law Voltage Scaling Constrained by: Power Complexity Enabled by: Moore s Law SMP architecture Constrained by: Power Parallel SW Scalability Enabled by: Abundant data parallelism Power efficient GPUs Temporarily Constrained by: Programming models Comm.overhead Assembly C/C++ Java pthreads OpenMP / TBB Shader CUDA OpenCL!!!? we are here we are here we are here Time Time (# of processors) Time (Data-parallel exploitation) 9 The Programmer s Guide to the APU Galaxy June 2011

Architecture Maturity & Programmer Accessibility Poor Excellent EVOLUTION OF HETEROGENEOUS COMPUTING Standards s Era Architected Era AMD Fusion System Architecture GPU Peer Processor Proprietary s Era Graphics & Proprietary -based APIs Adventurous programmers Exploit early programmable shader cores in the GPU Make your program look like graphics to the GPU CUDA, Brook+, etc OpenCL, DirectCompute -based APIs Expert programmers C and C++ subsets Compute centric APIs, data types Multiple address spaces with explicit data movement Specialized work queue based structures Kernel mode dispatch Mainstream programmers Full C++ GPU as a co-processor Unified coherent address space Task parallel runtimes Nested Data Parallel programs User mode dispatch Pre-emption and context switching See Herb Sutter s Keynote tomorrow for a cool example of plans for the architected era! 2002-2008 2009-2011 2012-2020 10 The Programmer s Guide to the APU Galaxy June 2011

FSA FEATURE ROADMAP Physical Integration Optimized Platforms Architectural Integration System Integration Integrate CPU & GPU in silicon GPU Compute C++ support Unified Address Space for CPU and GPU GPU compute context switch Unified Memory Controller User mode schedulng GPU uses pageable system memory via CPU pointers GPU graphics pre-emption Quality of Service Common Manufacturing Technology Bi-Directional Power Mgmt between CPU and GPU Fully coherent memory between CPU & GPU Extend to Discrete GPU 11 The Programmer s Guide to the APU Galaxy June 2011

AMD FUSION SYSTEM ARCHITECTURE AN OPEN PLATFORM Open Architecture, published specifications FSAIL virtual ISA FSA memory model FSA dispatch ISA agnostic for both CPU and GPU Inviting partners to join us, in all areas Hardware companies Operating Systems Tools and Middleware Applications FSA review committee planned 12 The Programmer s Guide to the APU Galaxy June 2011

FSA INTERMEDIATE LAYER - FSAIL FSAIL is a virtual ISA for parallel programs Finalized to ISA by a JIT compiler or Finalizer Explicitly parallel Designed for data parallel programming Support for exceptions, virtual functions, and other high level language features Syscall methods GPU code can call directly to system services, IO, printf etc Debugging support 13 The Programmer s Guide to the APU Galaxy June 2011

FSA MEMORY MODEL Designed to be compatible with C++0x, Java and.net Memory ls Relaxed consistency memory model for parallel compute performance Loads and stores can be re-ordered by the finalizer Visibility controlled by: Load.Acquire, Store.Release Fences Barriers 14 The Programmer s Guide to the APU Galaxy June 2011

Stack FSA Software Stack Apps AppsApps Apps AppsApps Apps Apps Apps Apps Apps Apps Domain Libraries FSA Domain Libraries OpenCL 1.x, DX Runtimes, User s Graphics Kernel FSA JIT Task Queuing Libraries FSA Runtime FSA Kernel Hardware - APUs, CPUs, GPUs AMD user mode component AMD kernel mode component All others contributed by third parties or AMD 15 The Programmer s Guide to the APU Galaxy June 2011

OPENCL AND FSA FSA is an optimized platform architecture for OpenCL Not an alternative to OpenCL OpenCL on FSA will benefit from Avoidance of wasteful copies Low latency dispatch Improved memory model Pointers shared between CPU and GPU FSA also exposes a lower level programming interface, for those that want the ultimate in control and performance Optimized libraries may choose the lower level interface 16 The Programmer s Guide to the APU Galaxy June 2011

TASK QUEUING RUNTIMES Popular pattern for task and data parallel programming on SMP systems today Characterized by: A work queue per core Runtime library that divides large loops into tasks and distributes to queues A work stealing runtime that keeps the system balanced FSA is designed to extend this pattern to run on heterogeneous systems 17 The Programmer s Guide to the APU Galaxy June 2011

TASK QUEUING RUNTIME ON CPUS Work Stealing Runtime Q Q Q Q CPU Worker CPU Worker CPU Worker CPU Worker X86 CPU X86 CPU X86 CPU X86 CPU CPU Threads GPU Threads Memory 18 The Programmer s Guide to the APU Galaxy June 2011

TASK QUEUING RUNTIME ON THE FSA PLATFORM Work Stealing Runtime Q Q Q Q Q CPU Worker CPU Worker CPU Worker CPU Worker GPU Manager X86 CPU X86 CPU X86 CPU X86 CPU Radeon GPU CPU Threads GPU Threads Memory 19 The Programmer s Guide to the APU Galaxy June 2011

TASK QUEUING RUNTIME ON THE FSA PLATFORM Work Stealing Runtime Q Q Q Q Q CPU Worker CPU Worker CPU Worker CPU Worker GPU Manager Memory X86 CPU X86 CPU X86 CPU X86 CPU Fetch and Dispatch CPU Threads GPU Threads Memory S I M D S I M D S I M D S I M D S I M D 20 The Programmer s Guide to the APU Galaxy June 2011

FSA SOFTWARE EXAMPLE - REDUCTION float foo(float); float myarray[ ]; Task<float, ReductionBin> sum([myarray](indexrange<1> index) [[device]] { float sum = 0.; for (size_t I = index.begin(); I!= index.end(); i++) { sum += foo(myarray[i]); } return float; }); float sum = task.enqueuewithreduce(partition<1, Auto>(1920), [] (int x, int y) [[device]] { return x+y; }, 0.); 21 The Programmer s Guide to the APU Galaxy June 2011

HETEROGENEOUS COMPUTE DISPATCH How compute dispatch operates today in the driver model How compute dispatch improves tomorrow under FSA 22 The Programmer s Guide to the APU Galaxy June 2011

TODAY S COMMAND AND DISPATCH FLOW Command Flow Data Flow Application A Direct3D User Soft Queue Kernel Command Buffer DMA Buffer A GPU HARDWARE Hardware Queue 23 The Programmer s Guide to the APU Galaxy June 2011

TODAY S COMMAND AND DISPATCH FLOW Command Flow Data Flow Application A Direct3D User Soft Queue Kernel Command Buffer DMA Buffer Command Flow Data Flow Application B Direct3D User Soft Queue Kernel A GPU HARDWARE Command Buffer DMA Buffer Command Flow Data Flow Application C Direct3D User Soft Queue Kernel Hardware Queue Command Buffer DMA Buffer 24 The Programmer s Guide to the APU Galaxy June 2011

TODAY S COMMAND AND DISPATCH FLOW Command Flow Data Flow Application A Direct3D User Soft Queue Kernel Command Buffer DMA Buffer Application B Command Flow Direct3D User Data Flow Soft Queue Kernel A C B A B GPU HARDWARE Command Buffer DMA Buffer Command Flow Data Flow Application C Direct3D User Soft Queue Kernel Hardware Queue Command Buffer DMA Buffer 25 The Programmer s Guide to the APU Galaxy June 2011

TODAY S COMMAND AND DISPATCH FLOW Command Flow Data Flow Application A Direct3D User Soft Queue Kernel Command Buffer DMA Buffer Application B Command Flow Direct3D User Data Flow Soft Queue Kernel A C B A B GPU HARDWARE Command Buffer DMA Buffer Command Flow Data Flow Application C Direct3D User Soft Queue Kernel Hardware Queue Command Buffer DMA Buffer 26 The Programmer s Guide to the APU Galaxy June 2011

FUTURE COMMAND AND DISPATCH FLOW Application C C C C C C Application codes to the hardware User mode queuing Application B Optional Dispatch Buffer Hardware Queue B B B GPU HARDWARE Hardware scheduling Low dispatch times No APIs Application A Hardware Queue A A A No Soft Queues No User s No Kernel Transitions No Overhead! Hardware Queue 27 The Programmer s Guide to the APU Galaxy June 2011

FUTURE COMMAND AND DISPATCH CPU <-> GPU Application / Runtime CPU1 CPU2 GPU 28 The Programmer s Guide to the APU Galaxy June 2011

FUTURE COMMAND AND DISPATCH CPU <-> GPU Application / Runtime CPU1 CPU2 GPU 29 The Programmer s Guide to the APU Galaxy June 2011

FUTURE COMMAND AND DISPATCH CPU <-> GPU Application / Runtime CPU1 CPU2 GPU 30 The Programmer s Guide to the APU Galaxy June 2011

FUTURE COMMAND AND DISPATCH CPU <-> GPU Application / Runtime CPU1 CPU2 GPU 31 The Programmer s Guide to the APU Galaxy June 2011

WHERE ARE WE TAKING YOU? Switch the compute, don t move the data! Every processor now has serial and parallel cores Simple and efficient program model All cores capable, with performance differences Platform Design Goals Easy support of massive data sets Support for task based programming models Solutions for all platforms Open to all 32 The Programmer s Guide to the APU Galaxy June 2011

THE FUTURE OF HETEROGENEOUS COMPUTING IS HERE The architectural path for the future is clear Programming patterns established on Symmetric Multi-Processor (SMP) systems migrate to the heterogeneous world An open architecture, with published specifications and an open source execution software stack Heterogeneous cores working together seamlessly in coherent memory Low latency dispatch No software fault lines 33 The Programmer s Guide to the APU Galaxy June 2011

Disclaimer & Attribution The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. The information contained herein is subject to change and may be rendered inaccurate for many reasons, including but not limited to product and roadmap changes, component and motherboard version changes, new model and/or product releases, product differences between differing manufacturers, software changes, BIOS flashes, firmware upgrades, or the like. There is no obligation to update or otherwise correct or revise this information. However, we reserve the right to revise this information and to make changes from time to time to the content hereof without obligation to notify any person of such revisions or changes. NO REPRESENTATIONS OR WARRANTIES ARE MADE WITH RESPECT TO THE CONTENTS HEREOF AND NO RESPONSIBILITY IS ASSUMED FOR ANY INACCURACIES, ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION. ALL IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE ARE EXPRESSLY DISCLAIMED. IN NO EVENT WILL ANY LIABILITY TO ANY PERSON BE INCURRED FOR ANY DIRECT, INDIRECT, SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN, EVEN IF EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES. AMD, the AMD arrow logo, and combinations thereof are trademarks of Advanced Micro Devices, Inc. All other names used in this presentation are for informational purposes only and may be trademarks of their respective owners. OpenCL and the OpenCL logo are trademarks of Apple, Inc. and are used by permission by Khronos. 2011 Advanced Micro Devices, Inc. All Rights Reserved. 35 The Programmer s Guide to the APU Galaxy June 2011