RUNTIME SUPPORT FOR ADAPTIVE SPATIAL PARTITIONING AND INTER-KERNEL COMMUNICATION ON GPUS

Similar documents
Visualization of OpenCL Application Execution on CPU-GPU Systems

Architectural and Runtime Enhancements for Dynamically Controlled Multi-Level Concurrency on GPUs

Staged Memory Scheduling

ARCHITECTURAL SUPPORT FOR IRREGULAR PROGRAMS AND PERFORMANCE MONITORING FOR HETEROGENEOUS SYSTEMS

Performance Characterization, Prediction, and Optimization for Heterogeneous Systems with Multi-Level Memory Interference

Cache Memory Access Patterns in the GPU Architecture

Exploring the features of OpenCL 2.0

Multi2sim Kepler: A Detailed Architectural GPU Simulator

A Framework for Visualization of OpenCL Applications Execution

Profiling & Tuning Applications. CUDA Course István Reguly

OpenCL TM & OpenMP Offload on Sitara TM AM57x Processors

Improving GPU Performance through Instruction Redistribution and Diversification

LaPerm: Locality Aware Scheduler for Dynamic Parallelism on GPUs

CPU-GPU Heterogeneous Computing

Balancing DRAM Locality and Parallelism in Shared Memory CMP Systems

OpenCL: History & Future. November 20, 2017

A Code Merging Optimization Technique for GPU. Ryan Taylor Xiaoming Li University of Delaware

MARACAS: A Real-Time Multicore VCPU Scheduling Framework

CUDA. Matthew Joyner, Jeremy Williams

Multi-Architecture ISA-Level Simulation of OpenCL

INTERFERENCE FROM GPU SYSTEM SERVICE REQUESTS

HSA foundation! Advanced Topics on Heterogeneous System Architectures. Politecnico di Milano! Seminar Room A. Alario! 23 November, 2015!

Automatic Intra-Application Load Balancing for Heterogeneous Systems

Energy Efficient K-Means Clustering for an Intel Hybrid Multi-Chip Package

Regression Modelling of Power Consumption for Heterogeneous Processors. Tahir Diop

Profiling-Based L1 Data Cache Bypassing to Improve GPU Performance and Energy Efficiency

Recent Advances in Heterogeneous Computing using Charm++

Real-Time Rendering Architectures

ME964 High Performance Computing for Engineering Applications

Execution Strategy and Runtime Support for Regular and Irregular Applications on Emerging Parallel Architectures

NVidia s GPU Microarchitectures. By Stephen Lucas and Gerald Kotas

Fundamental CUDA Optimization. NVIDIA Corporation

Implementing an efficient method of check-pointing on CPU-GPU

Programming and Simulating Fused Devices. Part 2 Multi2Sim

Simulation of OpenCL and APUs on Multi2Sim 4.1

The Distribution of OpenCL Kernel Execution Across Multiple Devices. Steven Gurfinkel

LaPerm: Locality Aware Scheduler for Dynamic Parallelism on GPUs

Modern Processor Architectures. L25: Modern Compiler Design

Performance of Multithreaded Chip Multiprocessors and Implications for Operating System Design

OpenCL C. Matt Sellitto Dana Schaa Northeastern University NUCAR

From Application to Technology OpenCL Application Processors Chung-Ho Chen

Lecture 25: Board Notes: Threads and GPUs

HSA Foundation! Advanced Topics on Heterogeneous System Architectures. Politecnico di Milano! Seminar Room (Bld 20)! 15 December, 2017!

Fundamental CUDA Optimization. NVIDIA Corporation

Sudhakar Yalamanchili, Georgia Institute of Technology (except as indicated) Active thread Idle thread

OpenCL Base Course Ing. Marco Stefano Scroppo, PhD Student at University of Catania

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model

Dynamic Thread Block Launch: A Lightweight Execution Mechanism to Support Irregular Applications on GPUs

Finite Element Integration and Assembly on Modern Multi and Many-core Processors

PLB-HeC: A Profile-based Load-Balancing Algorithm for Heterogeneous CPU-GPU Clusters

Auto-tunable GPU BLAS

Heterogeneous Computing with a Fused CPU+GPU Device

Fluidic Kernels: Cooperative Execution of OpenCL Programs on Multiple Heterogeneous Devices

OPENCL GPU BEST PRACTICES BENJAMIN COQUELLE MAY 2015

CSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University

Michael Kinsner, Dirk Seynhaeve IWOCL 2018

High Performance Computing on GPUs using NVIDIA CUDA

GPU Acceleration of the Longwave Rapid Radiative Transfer Model in WRF using CUDA Fortran. G. Ruetsch, M. Fatica, E. Phillips, N.

The Uintah Framework: A Unified Heterogeneous Task Scheduling and Runtime System

CS427 Multicore Architecture and Parallel Computing

Understanding GPGPU Vector Register File Usage

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS

Design methodology for multi processor systems design on regular platforms

A Case Study in Optimizing GNU Radio s ATSC Flowgraph

CUDA OPTIMIZATIONS ISC 2011 Tutorial

Fundamental Optimizations in CUDA Peng Wang, Developer Technology, NVIDIA

Optimization Case Study for Kepler K20 GPUs: Synthetic Aperture Radar Backprojection

Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design

TUNING CUDA APPLICATIONS FOR MAXWELL

CSE 591: GPU Programming. Introduction. Entertainment Graphics: Virtual Realism for the Masses. Computer games need to have: Klaus Mueller

Multi-Threaded UPC Runtime for GPU to GPU communication over InfiniBand

GPU! Advanced Topics on Heterogeneous System Architectures. Politecnico di Milano! Seminar Room, Bld 20! 11 December, 2017!

gem5-gpu Extending gem5 for GPGPUs Jason Power, Marc Orr, Joel Hestness, Mark Hill, David Wood

Capriccio : Scalable Threads for Internet Services

GPU Fundamentals Jeff Larkin November 14, 2016

Concurrent execution of an analytical workload on a POWER8 server with K40 GPUs A Technology Demonstration

Efficient Hardware Acceleration on SoC- FPGA using OpenCL

Review: Creating a Parallel Program. Programming for Performance

Directed Optimization On Stencil-based Computational Fluid Dynamics Application(s)

Oversubscribed Command Queues in GPUs

HPX. High Performance ParalleX CCT Tech Talk Series. Hartmut Kaiser

EFFICIENT SPARSE MATRIX-VECTOR MULTIPLICATION ON GPUS USING THE CSR STORAGE FORMAT

Benchmarking the Memory Hierarchy of Modern GPUs

Outline. Operating Systems: Devices and I/O p. 1/18

Data Partitioning on Heterogeneous Multicore and Multi-GPU systems Using Functional Performance Models of Data-Parallel Applictions

3D ADI Method for Fluid Simulation on Multiple GPUs. Nikolai Sakharnykh, NVIDIA Nikolay Markovskiy, NVIDIA

Performance and Optimization Issues in Multicore Computing

CUDA Optimization: Memory Bandwidth Limited Kernels CUDA Webinar Tim C. Schroeder, HPC Developer Technology Engineer

davidklee.net gplus.to/kleegeek linked.com/a/davidaklee

A TALENTED CPU-TO-GPU MEMORY MAPPING TECHNIQUE

TUNING CUDA APPLICATIONS FOR MAXWELL

Kartik Lakhotia, Rajgopal Kannan, Viktor Prasanna USENIX ATC 18

Performance Analysis of Memory Transfers and GEMM Subroutines on NVIDIA TESLA GPU Cluster

How to Optimize Geometric Multigrid Methods on GPUs

Oracle Database 12c: JMS Sharded Queues

Thread Cluster Memory Scheduling: Exploiting Differences in Memory Access Behavior. Yoongu Kim Michael Papamichael Onur Mutlu Mor Harchol-Balter

Rethinking Prefetching in GPGPUs: Exploiting Unique Opportunities

Technology for a better society. hetcomp.com

SIMULATOR AMD RESEARCH JUNE 14, 2015

CRAY XK6 REDEFINING SUPERCOMPUTING. - Sanjana Rakhecha - Nishad Nerurkar

Transcription:

RUNTIME SUPPORT FOR ADAPTIVE SPATIAL PARTITIONING AND INTER-KERNEL COMMUNICATION ON GPUS Yash Ukidave, Perhaad Mistry, Charu Kalra, Dana Schaa and David Kaeli Department of Electrical and Computer Engineering Northeastern University, Boston, USA Advanced Micro Devices (AMD), USA SBAC-PAD 2014 Paris, France 22 nd October, 2014 1 SBAC-PAD 2014 Oct 2014

WHAT IS THIS TALK ABOUT? Improving concurrent kernel execution through adaptive spatial partitioning of compute units on GPUs Implementing a pipe based memory object for inter-kernel communication on GPUs 2 SBAC-PAD 2014 Oct 2014

TOPICS Introduction Background & Motivation GPU workgroup scheduling mechanism Adaptive spatial partitioning Pipe based communication channel Evaluation methodology and benchmarks Performance results Conclusion Future work 3 SBAC-PAD 2014 Oct 2014

INTRODUCTION GPUs have become the most popular accelerator device in recent years Applications from scientific and high performance computing (HPC) domains have reported impressive perform gains Modern GPU workloads possess multiple kernels and varying degrees of parallelism Applications demand concurrent execution and flexible scheduling Dynamic resource allocation is required for efficient sharing of compute resources Kernels should adapt at runtime to allow other kernels to execute concurrently Effective inter-kernel communication is needed between concurrent kernels 4 SBAC-PAD 2014 Oct 2014

BACKGROUND & MOTIVATION Concurrent Kernel Execution on current GPUs NVIDIA Fermi GPUs support concurrent execution using a left-over policy NVIDIA Kepler GPUs use Hyper-Q technology with multiple hardware queues for scheduling multiple kernels AMD uses Asynchronous Compute Engines (ACE) units to manage multiple kernels ACE units allow for interleaved kernel execution Concurrency is limited by the number of available CUs and leads to a fixed partition We implement adaptive spatial partitioning to dynamically change CU allocation Kernels adapt their hardware usage to accommodate other concurrent kernels A novel workgroup scheduler and partition handler is implemented Our scheme improves GPU utilization and avoids resource starvation by smaller kernels 5 SBAC-PAD 2014 Oct 2014

BACKGROUND & MOTIVATION (CONTD ) Inter-Kernel Communication on GPUs Stage-based computations gaining popularity on GPUs (e.g., audio and video processing) Applications require inter-kernel communication between stages Real-time communication between executing kernels is not supported on GPUs The definition of a pipe object was introduced in the OpenCL 2.0 spec We have implemented a pipe channel for inter-kernel communication 6 SBAC-PAD 2014 Oct 2014

RELATED WORK Gregg et al. examine kernel concurrency using concatenated kernel approach [USENIX 2012] Tanasic et al. introduce preemption support on GPUs with a flexible CU allocation for kernels [ISCA 2014] Lustig et al. propose memory design improvements for computation and communication overlap for inter-kernel communication [HPCA 2013] Boyer et al. demonstrate dynamic load balancing of computation shared between GPUs [Computing Frontiers 2013] 7 SBAC-PAD 2014 Oct 2014

WORKGROUP SCHEDULING FOR FIXED PARTITION OpenCL sub-devices API creates subdevice with a fixed number of CUs Multiple command queues (CQ) are mapped to different sub-devices NDRange computation is launched on different sub-device through the CQs NDRanges enqueued on one sub-device use CUs assigned to that sub-device Sub-device maintains the following information: Number of mapped CQ Number of NDRanges launched on each CQ Number of compute units allocated to each sub-device High-level model of a Sub-device 8 SBAC-PAD 2014 Oct 2014

ADAPTIVE SPATIAL PARTITIONING Fixed partitions lead to starvation by smaller kernels in a multi-kernel application Adaptive spatial partitioning is implemented as an extension to fixed partition Two new properties added to the OpenCL clcreatesubdevices API Fixed property: Creates sub-device with a fixed number of CUs Adaptive property: Creates a sub-device which can dynamically allocate CUs Partition handler allocates CUs to adaptive sub-device based on size of the executing NDRange Adaptive Partition handler is invoked when: A new NDRange arrives An active NDRange completes execution Adaptive partition handler consists of 3 modules: Dispatcher NDRange scheduler Load balancer 9 SBAC-PAD 2014 Oct 2014

ADAPTIVE SPATIAL PARTITIONING Dispatcher: Invoked when an NDR arrives or leaves the GPUs Checks for new NDRanges Dispatches them to the NDR-scheduler Checks for completed NDRanges Invokes the Load Balancer Manages the pending NDRanges 10 SBAC-PAD 2014 Oct 2014

ADAPTIVE SPATIAL PARTITIONING NDR Scheduler: Checks sub-device property for new NDRanges Calls Load balancer to manage adaptive NDRanges Assigns requested CUs to fixed partition NDRanges Also manages pending NDRanges 11 SBAC-PAD 2014 Oct 2014

ADAPTIVE SPATIAL PARTITIONING Load Balancer: Handles all adaptive NDRanges CU assignment According to NDR ratio Assigns CUs to adaptive NDRanges based on their size Considers ratio of NDR sizes for CU allocation Maintains at least 1 CU per active Adaptive NDRange Maintain 1 CU per Active adaptive NDR Calls the workgroup scheduler 12 SBAC-PAD 2014 Oct 2014

WORKING OF ADAPTIVE PARTITIONING Two adaptive NDRages(NDR#0, NDR#1) mapped to two command queues (CQ#0, CQ#1) Initial allocation of CUs done for NDR#0 NDR#1 arrives for execution, causing reassignment of resources Blocking policy implemented to allow execution of NDR#0 workgroups Blocking also prevents more workgroups from NDR#0 being scheduled on blocked CUs De-allocation of CUs for NDR#0 and mapping of CU#21 and CU#22 to NDR#1 13 SBAC-PAD 2014 Oct 2014

WORKGROUP SCHEDULING MECHANISMS & PARTITION POLICIES Workgroup Scheduling Mechanisms 1. Occupancy Based Scheduling: Maps workgroups to CU and moves to next if: Max workgroup limit is for CU reached CU expends all compute resources Attempts for maximum occupancy on GPU 2. Latency Based Scheduling: Iterates over CUs in round robin: Assigns 1 workgroup to each CU in iteration Continues till all workgroups are assigned Minimizes compute latency by utilizing each CU Partitioning Policies 1. Full-fixed Each sub-device gets fixed number of CUs Completely controlled by user Best when user is knowledgeable about device hardware 2. Full-adaptive Each sub-device has adaptive property CU assignment controlled by runtime Best when user does not know about device 3. Hybrid Combination of fixed and adaptive property sub-devices Best for performance tuning of applications 14 SBAC-PAD 2014 Oct 2014

PIPE BASED COMMUNICATION CHANNEL Pipe is a typed memory object with FIFO functionality Data stored in form of packets with scalar and vector data type (int, int4, float4, etc.) Size of pipe based on number of packets and size of each packet Transactions done using OpenCL built in functions write_pipe and read_pipe Can be accessed by a kernel as a read-only or write-only memory object Used for producer-consumer based communication pattern between concurrent kernels 15 SBAC-PAD 2014 Oct 2014

BENCHMARKS USED FOR EVALUATION Set 1. Adaptive Partition Evaluation: 1. Matrix Equation Solver (MES): Linear solver with 3 kernels 2. Communication Channel Analyzer (COM): Emulates 4 communication channels 4 kernels 3. Big Data Clustering (BDC): Big Data analysis application with 3 kernels 4. Search Application (SER): Distributed search using 2 kernels mapped to 2 CQs Set 2. Pipe-Based Communication Evaluation: 1. Audio Signal Processing (AUD): Two channel audio processing in 3 stages Stages connected using pipe 2. Search-Bin Application (SBN): Search benchmark with bin allocation 2 kernels to perform distributed search 1 kernel performs bin allocation Search kernels and bin kernel connected by pipes 5. Texture mixing (TEX): Image application to perform mixing of 3 textures 16 SBAC-PAD 2014 Oct 2014

EVALUATION PLATFORM Multi2Sim simulation framework used for evaluation [multi2sim.org] Cycle level GPU simulator Supports x86 CPU, AMD Southern Islands (SI) GPU simulation Provides OpenCL runtime and driver layer Simulator configured to match AMD SI Radeon 7970 GPU GPU scheduler updated for: New Workgroup scheduling Adaptive Partitioning handling Compute Unit Configuration Memory Configuration Device Configuration # of CUs 32 # of Wavefront pools / CU 4 # of SIMD/CU 4 # of lanes/simd 16 Vector Regs/CU 64K Scalar Regs/CU 2K Frequency 1GHz Global Memory 1GB Local Memory/CU 64KB L1 cache 16KB Runtime updated for: L2 cache 768KB Sub-device property support OpenCL pipe support # of Mem controller 6 17 SBAC-PAD 2014 Oct 2014

EFFECTS OF ADAPTIVE PARTITIONING Occupancy based scheduling: Full-fixed limits applications to fixed CUs Leads to poor utilization of GPU Full-adaptive shows avg. improvement of 26% BDC shows 32% degradation Hybrid partition Effective load balancing over fixed and adaptive Latency based scheduling: Best performance using full-fixed and hybrid Uses CUs effectively in fixed partition Degradation using full-adaptive First NDRange occupies entire GPU Reassignment is slow BDC shows 96% degradation over full-fixed Others show 35% degradation on average 18 SBAC-PAD 2014 Oct 2014

LOAD BALANCING MECHANISM Timeline showing allocation and deallocation of CUs for each kernel Full-adaptive partitioning used Sampling interval of 5000 GPU cycles Load Balancing Mechanism: NDR 0 arrives first on GPU and receives large number of CUs CUs reassigned as other NDRs arrive Steady state reached when all kernels are executing Average CU reassignment latency is 9300 cycles 19 SBAC-PAD 2014 Oct 2014

PERFORMANCE OF PIPE COMMUNICATION Set-2 benchmarks evaluated with pipe object Baseline is single-cq execution Average 2.9x speedup observed with multiple CQ and pipe Pipe object overlaps computation and communication Shared data buffers across stages improves cache hits 35% increase in cache hits 20 SBAC-PAD 2014 Oct 2014

CONCLUSION We evaluate the benefits of multiple command queue mapping on GPUs We consider the effects of different workgroup scheduling policies using multiple command queues Applying tighter control over the workgroup to CU mapping produces over 3x speedup Adaptive partitioning helps in effective resource utilization and load balancing Developers get a choice of different partitioning policies according to the computation requirements Use adaptive partitioning when not knowledgeable about GPU hardware Use fixed or hybrid partitioning for greater control over hardware utilization We evaluate the performance and programmability of OpenCL pipes for interkernel communication 21 SBAC-PAD 2014 Oct 2014

FUTURE WORK Consider spatial affinity between CU and command queues in scheduler policy Experiment with on-chip and off-chip memory based pipe objects Evaluate pipe object efficiency for HSA class APUs with shared virtual memory Implement priority based execution for kernels 22 SBAC-PAD 2014 Oct 2014

THANK YOU! QUESTIONS? COMMENTS? Yash Ukidave yukidave@ece.neu.edu 23 SBAC-PAD 2014 Oct 2014