Automatic Intra-Application Load Balancing for Heterogeneous Systems

Similar documents
Use cases. Faces tagging in photo and video, enabling: sharing media editing automatic media mashuping entertaining Augmented reality Games

INTRODUCTION TO OPENCL TM A Beginner s Tutorial. Udeepta Bordoloi AMD

OPENCL TM APPLICATION ANALYSIS AND OPTIMIZATION MADE EASY WITH AMD APP PROFILER AND KERNELANALYZER

viewdle! - machine vision experts

HETEROGENEOUS SYSTEM ARCHITECTURE: PLATFORM FOR THE FUTURE

EXPLOITING ACCELERATOR-BASED HPC FOR ARMY APPLICATIONS

ADVANCED RENDERING EFFECTS USING OPENCL TM AND APU Session Olivier Zegdoun AMD Sr. Software Engineer

Fusion Enabled Image Processing

AMD CORPORATE TEMPLATE AMD Radeon Open Compute Platform Felix Kuehling

AMD Graphics Team Last Updated February 11, 2013 APPROVED FOR PUBLIC DISTRIBUTION. 1 3DMark Overview February 2013 Approved for public distribution

SIMULATOR AMD RESEARCH JUNE 14, 2015

AMD ACCELERATING TECHNOLOGIES FOR EXASCALE COMPUTING FELLOW 3 OCTOBER 2016

AMD Graphics Team Last Updated April 29, 2013 APPROVED FOR PUBLIC DISTRIBUTION. 1 3DMark Overview April 2013 Approved for public distribution

AMD APU and Processor Comparisons. AMD Client Desktop Feb 2013 AMD


MIGRATION OF LEGACY APPLICATIONS TO HETEROGENEOUS ARCHITECTURES Francois Bodin, CTO, CAPS Entreprise. June 2011

AMD IOMMU VERSION 2 How KVM will use it. Jörg Rödel August 16th, 2011

THE PROGRAMMER S GUIDE TO THE APU GALAXY. Phil Rogers, Corporate Fellow AMD

BIOMEDICAL DATA ANALYSIS ON HETEROGENEOUS PLATFORM. Dong Ping Zhang Heterogeneous System Architecture AMD

EFFICIENT SPARSE MATRIX-VECTOR MULTIPLICATION ON GPUS USING THE CSR STORAGE FORMAT

RegMutex: Inter-Warp GPU Register Time-Sharing

SCALING DGEMM TO MULTIPLE CAYMAN GPUS AND INTERLAGOS MANY-CORE CPUS FOR HPL

ROCm: An open platform for GPU computing exploration

Panel Discussion: The Future of I/O From a CPU Architecture Perspective

HIGHLY PARALLEL COMPUTING IN PHYSICS-BASED RENDERING OpenCL Raytracing Based. Thibaut PRADOS OPTIS Real-Time & Virtual Reality Manager

Heterogeneous Computing

The Rise of Open Programming Frameworks. JC BARATAULT IWOCL May 2015

SOLUTION TO SHADER RECOMPILES IN RADEONSI SEPTEMBER 2015

Understanding GPGPU Vector Register File Usage

ACCELERATING MATRIX PROCESSING WITH GPUs. Nicholas Malaya, Shuai Che, Joseph Greathouse, Rene van Oostrum, and Michael Schulte AMD Research

CAUTIONARY STATEMENT This presentation contains forward-looking statements concerning Advanced Micro Devices, Inc. (AMD) including, but not limited to

clarmor: A DYNAMIC BUFFER OVERFLOW DETECTOR FOR OPENCL KERNELS CHRIS ERB, JOE GREATHOUSE, MAY 16, 2018

Accelerating Applications. the art of maximum performance computing James Spooner Maxeler VP of Acceleration

AMD RYZEN PROCESSOR WITH RADEON VEGA GRAPHICS CORPORATE BRAND GUIDELINES

1 Presentation Title Month ##, 2012

KVM CPU MODEL IN SYSCALL EMULATION MODE ALEXANDRU DUTU, JOHN SLICE JUNE 14, 2015

INTERFERENCE FROM GPU SYSTEM SERVICE REQUESTS

1 HiPEAC January, 2012 Public TASKS, FUTURES AND ASYNCHRONOUS PROGRAMMING

STREAMING VIDEO DATA INTO 3D APPLICATIONS Session Christopher Mayer AMD Sr. Software Engineer

FUSION PROCESSORS AND HPC

CAUTIONARY STATEMENT 1 AMD NEXT HORIZON NOVEMBER 6, 2018

GPGPU COMPUTE ON AMD. Udeepta Bordoloi April 6, 2011

HyperTransport Technology

Gestural and Cinematic Interfaces - DX11. David Brebner Unlimited Realities CTO

Generic System Calls for GPUs

Desktop Telepresence Arrived! Sudha Valluru ViVu CEO

CAUTIONARY STATEMENT This presentation contains forward-looking statements concerning Advanced Micro Devices, Inc. (AMD) including, but not limited to

Designing Natural Interfaces

Cilk Plus: Multicore extensions for C and C++

OpenCL Implementation Of A Heterogeneous Computing System For Real-time Rendering And Dynamic Updating Of Dense 3-d Volumetric Data

AMD HD3D Technology. Setup Guide. 1 AMD HD3D TECHNOLOGY: Setup Guide

The Road to the AMD. Fiji GPU. Featuring Die Stacking and HBM Technology 1 THE ROAD TO THE AMD FIJI GPU ECTC 2016 MAY 2015

FLASH MEMORY SUMMIT Adoption of Caching & Hybrid Solutions

NEXT-GENERATION MATRIX 3D IMMERSIVE USER INTERFACE [ M3D-IUI ] H Raghavendra Swamy AMD Senior Software Engineer

MULTIMEDIA PROCESSING Real-time H.264 video enhancement by using AMD APP SDK

Maximizing Six-Core AMD Opteron Processor Performance with RHEL

HPCA 18. Reliability-aware Data Placement for Heterogeneous memory Architecture

Multi-core processors are here, but how do you resolve data bottlenecks in native code?

Run Anywhere. The Hardware Platform Perspective. Ben Pollan, AMD Java Labs October 28, 2008

3D Numerical Analysis of Two-Phase Immersion Cooling for Electronic Components

Sequential Consistency for Heterogeneous-Race-Free

Pattern-based analytics to estimate and track yield risk of designs down to 7nm

HPG 2011 HIGH PERFORMANCE GRAPHICS HOT 3D

1401 HETEROGENEOUS HPC How Fusion Designs Can Advance Science

MEASURING AND MODELING ON-CHIP INTERCONNECT POWER ON REAL HARDWARE

AMD Radeon ProRender plug-in for Unreal Engine. Installation Guide

AMD S X86 OPEN64 COMPILER. Michael Lai AMD

Graphics Hardware 2008

The mobile computing evolution. The Griffin architecture. Memory enhancements. Power management. Thermal management

AMD AIB Partner Guidelines. Version February, 2015

D3D12 & Vulkan: Lessons learned. Dr. Matthäus G. Chajdas Developer Technology Engineer, AMD

AMD RYZEN CORPORATE BRAND GUIDELINES

Contention-Aware Scheduling of Parallel Code for Heterogeneous Systems

Vulkan (including Vulkan Fast Paths)

LIQUIDVR TODAY AND TOMORROW GUENNADI RIGUER, SOFTWARE ARCHITECT

AMD EPYC CORPORATE BRAND GUIDELINES

DR. LISA SU

A Case for Better Integration of Host and Target Compilation When Using OpenCL for FPGAs

Fan Control in AMD Radeon Pro Settings. User Guide. This document is a quick user guide on how to configure GPU fan speed in AMD Radeon Pro Settings.

INTRODUCING RYZEN MARCH

AMD SEV Update Linux Security Summit David Kaplan, Security Architect

PROTECTING VM REGISTER STATE WITH AMD SEV-ES DAVID KAPLAN LSS 2017

A Large-Scale Cross-Architecture Evaluation of Thread-Coarsening. Alberto Magni, Christophe Dubach, Michael O'Boyle

Anatomy of AMD s TeraScale Graphics Engine

Changing your Driver Options with Radeon Pro Settings. Quick Start User Guide v3.0

Getting Started with Intel SDK for OpenCL Applications

Solid State Graphics (SSG) SDK Setup and Raw Video Player Guide

AMD Radeon ProRender plug-in for Universal Scene Description. Installation Guide

Driver Options in AMD Radeon Pro Settings. User Guide

A Code Merging Optimization Technique for GPU. Ryan Taylor Xiaoming Li University of Delaware

Microsoft Windows 2016 Mellanox 100GbE NIC Tuning Guide

Performance Characterization, Prediction, and Optimization for Heterogeneous Systems with Multi-Level Memory Interference

Accelerating MapReduce on a Coupled CPU-GPU Architecture

Changing your Driver Options with Radeon Pro Settings. Quick Start User Guide v2.1

Accelerating Leukocyte Tracking Using CUDA: A Case Study in Leveraging Manycore Coprocessors

MICHAL MROZEK ZBIGNIEW ZDANOWICZ

Forza Horizon 4 Benchmark Guide

High performance 2D Discrete Fourier Transform on Heterogeneous Platforms. Shrenik Lad, IIIT Hyderabad Advisor : Dr. Kishore Kothapalli

User Manual. Nvidia Jetson Series Carrier board Aetina ACE-N622

RUNTIME SUPPORT FOR ADAPTIVE SPATIAL PARTITIONING AND INTER-KERNEL COMMUNICATION ON GPUS

Transcription:

Automatic Intra-Application Load Balancing for Heterogeneous Systems Michael Boyer, Shuai Che, and Kevin Skadron Department of Computer Science University of Virginia Jayanth Gummaraju and Nuwan Jayasena AMD

Motivation Many applications fail to harness all of the available computational power CPU CPU CPU Kernel GPU GPU GPU 1 GPU 2 2 Automatic Intra-Application Load Balancing June 14, 2011

Leveraging Multiple Devices Writing correct multi-device applications is challenging Writing efficient multi-device applications is even harder The optimal division of labor depends on: Hardware characteristics Input data set Behavior of other applications Metric being optimized Efficiently orchestrating the data movement and kernel invocations is complicated: Set appropriate flags based on hardware and software characteristics Use multiple buffers to overlap operations 3 Automatic Intra-Application Load Balancing June 14, 2011

Goal: Automatically load balance a single-device application across multiple (heterogeneous) devices 4 Automatic Intra-Application Load Balancing June 14, 2011

Outline Motivation Our approach: dynamic chunking Load balancing framework Preliminary results Conclusions and future work 5 Automatic Intra-Application Load Balancing June 14, 2011

Our Approach: Dynamic Chunking Break the kernel into multiple chunks Key distinction: # chunks > # devices Dynamically schedule chunks to devices Scheduling decisions and chunk sizes based on online profiling 6 Automatic Intra-Application Load Balancing June 14, 2011

Chunking Chunk: a contiguous set of work groups Division can be along any dimension Different chunks can be different sizes Original NDRange Chunks Chunks 7 Automatic Intra-Application Load Balancing June 14, 2011

Chunking: Kernel + Data Need to ensure that: The input data consumed by each chunk is copied to device memory The output data produced by each chunk is copied back to host memory Application Application Kernel Data Chunk 1 Chunk 2 Chunk 3 GPU CPU GPU CPU 8 Automatic Intra-Application Load Balancing June 14, 2011

Chunking: Advantages & Challenges Advantages: No training required, even as hardware changes Can respond to dynamic performance changes due to: Data-dependent behavior Contention from other applications Can overlap kernel execution and data transfer Challenges: Managing contention Contention from other applications CPU is both host and compute device Memory contention between CPU & GPU operations Dispatch overhead Determining which data is accessed by each chunk Difficult data structures / access patterns 9 Automatic Intra-Application Load Balancing June 14, 2011

Chunking: Overlapping Operations No chunking: Transfer Computation Chunking: Transfer Computation 1 2 3 4 1 2 3 4 10 Automatic Intra-Application Load Balancing June 14, 2011

Load-Balancing Framework 11 Presentation Title Month ##, 2011

Framework Overview Application OpenCL API calls OpenCL Runtime Application View Application Load- Balancing Framework OpenCL Runtime System View 12 Automatic Intra-Application Load Balancing June 14, 2011

Framework Requirements 1. Intercept OpenCL API calls 2. Determine what data to transfer to each device 3. Orchestrate and balance execution across the devices 13 Automatic Intra-Application Load Balancing June 14, 2011

Framework Components 1. API intercept layer Intercepts and transforms calls from application to OpenCL runtime 2. Access pattern extractor Analyzes kernel source code to extract data access patterns 3. Chunk scheduler Breaks kernels into chunks and schedules them onto devices 14 Automatic Intra-Application Load Balancing June 14, 2011

API Intercept Layer Intercepts each OpenCL API call and replicates it across multiple devices or transforms it OpenCL API Function Application View System View GPU GPU CPU clcreatecommandqueue GPU GPU CPU clcreatebuffer 15 Automatic Intra-Application Load Balancing June 14, 2011

Access Pattern Extractor Need a mechanism for sending the right data to each device Given the kernel source, determines: Mapping from chunk to memory region Preferred chunking direction Kernel source code is available from intercepting call to the OpenCL compiler Output: function that returns the region of memory accessed by a given chunk Callable by the chunk scheduler Works for arbitrary chunk sizes 16 Automatic Intra-Application Load Balancing June 14, 2011

Access Pattern Extractor: Details Built on Clang (LLVM s front-end) Clang s OpenCL support is under active development For now, add an implicit header to define built-in data types and functions Basic idea: traverse the kernel AST Identify accesses to memory buffers Express buffer offsets in terms of values that can be reasoned about at kernel invocation time Determine relationship between accesses from different work items 17 Automatic Intra-Application Load Balancing June 14, 2011

Access Pattern Extractor: Example kernel void blackscholes(const global float4 *randarray, global float4 *put, int width) { size_t xpos = get_global_id(0); size_t ypos = get_global_id(1); float4 inrand = randarray[ ypos * width + xpos ]; put[ ypos * width + xpos ] = KexpMinusRT * phid2 - S * phid1; Read access Write access 18 Automatic Intra-Application Load Balancing June 14, 2011

Access Pattern Extractor: Input vs. Output Buffers Input buffers: Can afford to be imprecise Approach: determine minimum and maximum offset and transfer entire range Output buffers: Need to be precise Approach: determine parameters of strided access Buffer Access Input Buffer Output Buffer 19 Automatic Intra-Application Load Balancing June 14, 2011

Scheduler Breaks kernel into chunks For each chunk: Sends input data to device Launches kernel on device Copies output data back to host Mapping of chunks to devices is determined dynamically, based on online profiling data 20 Automatic Intra-Application Load Balancing June 14, 2011

Dynamic Scheduling If number of work groups is small, skip chunking and send whole kernel to one device Otherwise, send initial chunks to each device: Initial chunk size set to exactly fill each device Maintain two chunks outstanding to each device to hide dispatch overheads Exponentially increase chunk sizes for fast devices until slow device has completed a chunk Once performance data is available for all devices: Distribute a portion of the remaining work to all devices based on relative performance Maintain aggregate history information, but decay it exponentially 21 Automatic Intra-Application Load Balancing June 14, 2011

22 Presentation Title Month ##, 2011 Preliminary Results

Experimental Systems 1. CPU + GPU: AMD Radeon HD 5870 (Cypress) Intel Core i7 920: quad-core, hyper-threaded, 2.67 GHz 2. Homogenous Multi-GPU: 2 x AMD Radeon HD 5870 3. Heterogeneous Multi-GPU: AMD Radeon HD 5870 AMD Radeon HD 6570 (Turks) 23 Automatic Intra-Application Load Balancing June 14, 2011

Synthetic Benchmark: Computation-to-Communication Ratio Ratio of kernel execution time to data transfer time can be controlled arbitrarily 10 / 90 Data Transfer KE 50 / 50 Data Transfer Kernel Execution 90 / 10 DT Kernel Execution 24 Automatic Intra-Application Load Balancing June 14, 2011

Speedup Synthetic Benchmark Results 2.5 Multi-GPU CPU-GPU 2.0 1.5 1.0 0.5 0 10 20 30 40 50 60 70 80 90 100 Data Transfer Dominated Percent Kernel Execution Kernel Execution Dominated 25 Automatic Intra-Application Load Balancing June 14, 2011

Sample Applications From Rodinia benchmark suite: SRAD From AMD APP SDK: Mandelbrot Black-Scholes 26 Automatic Intra-Application Load Balancing June 14, 2011

Speedup Results: CPU + GPU, SRAD 1.3 1.2 1.1 1.0 0.9 0.8 16 36 64 Data Set Size (MB) 27 Automatic Intra-Application Load Balancing June 14, 2011

Speedup Results: Heterogeneous Multi-GPU, Mandelbrot 1.3 1.2 1.1 1.0 0.9 0.8 16 32 64 128 Data Set Size (MB) 28 Automatic Intra-Application Load Balancing June 14, 2011

Speedup Results: Homogeneous Multi-GPU, Black-Scholes* 1.8 1.7 1.6 1.5 1.4 1.3 1.2 1.1 1.0 48 96 192 Data Set Size (MB) 29 Automatic Intra-Application Load Balancing June 14, 2011

Previous Work: Qilin Divides a CUDA kernel across a CPU and GPU Limitations: Requires manual creation of CPU & GPU versions of kernel Requires a training phase Scheduling is static Only works on NVIDIA GPUs Reference: C.-K. Luk, S. Hong, and H. Kim, Qilin: exploiting parallelism on heterogeneous multiprocessors with adaptive mapping, in MICRO 42: Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture, 2009. 30 Automatic Intra-Application Load Balancing June 14, 2011

Previous Work: Single Compute Device Image Divides an OpenCL kernel across multiple GPUs Limitations: Naïve scheduling: each device gets an equal amount of work Only works on NVIDIA GPUs Reference: J. Kim, H. Kim, J. H. Lee, and J. Lee, Achieving a single compute device image in OpenCL for multiple GPUs, in Proceedings of the 16th ACM symposium on Principles and Practice of Parallel Programming, 2011. 31 Automatic Intra-Application Load Balancing June 14, 2011

Challenges and Future Work Tune the framework for different hardware and software configurations Optimize for different metrics One version of the kernel for multiple devices Optimizations for GPU may hurt performance on CPU and vice versa Supporting device-specific kernels would allow a tradeoff between programmer effort and performance Multiple kernel calls Need to understand data flow between kernel calls Possible (but rare) for work groups to communicate with each other using atomic instructions Difficult to support across multiple devices efficiently Deployment possibilities: OpenCL layer targeting standalone applications Layer in the Fusion software stack 32 Automatic Intra-Application Load Balancing June 14, 2011

Conclusions Heterogeneous multi-device systems (like Fusion) are becoming ubiquitous Effectively utilizing the available devices is difficult Our framework automatically load balances unmodified OpenCL applications across multiple (possibly heterogeneous) devices Extracts access patterns to determine mapping of work groups to data Uses online profiling to guide scheduling decisions Preliminary performance results are encouraging 33 Automatic Intra-Application Load Balancing June 14, 2011

34 Presentation Title Month ##, 2011 Questions

Disclaimer & Attribution The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. The information contained herein is subject to change and may be rendered inaccurate for many reasons, including but not limited to product and roadmap changes, component and motherboard version changes, new model and/or product releases, product differences between differing manufacturers, software changes, BIOS flashes, firmware upgrades, or the like. There is no obligation to update or otherwise correct or revise this information. However, we reserve the right to revise this information and to make changes from time to time to the content hereof without obligation to notify any person of such revisions or changes. NO REPRESENTATIONS OR WARRANTIES ARE MADE WITH RESPECT TO THE CONTENTS HEREOF AND NO RESPONSIBILITY IS ASSUMED FOR ANY INACCURACIES, ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION. ALL IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE ARE EXPRESSLY DISCLAIMED. IN NO EVENT WILL ANY LIABILITY TO ANY PERSON BE INCURRED FOR ANY DIRECT, INDIRECT, SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN, EVEN IF EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES. AMD, the AMD arrow logo, and combinations thereof are trademarks of Advanced Micro Devices, Inc. All other names used in this presentation are for informational purposes only and may be trademarks of their respective owners. The contents of this presentation were provided by individual(s) and/or company listed on the title page. The information and opinions presented in this presentation may not represent AMD s positions, strategies or opinions. Unless explicitly stated, AMD is not responsible for the content herein and no endorsements are implied. 35 Automatic Intra-Application Load Balancing June 14, 2011