Predictive Runtime Code Scheduling for Heterogeneous Architectures

Similar documents
High performance 2D Discrete Fourier Transform on Heterogeneous Platforms. Shrenik Lad, IIIT Hyderabad Advisor : Dr. Kishore Kothapalli

Parallel FFT Program Optimizations on Heterogeneous Computers

Recent Advances in Heterogeneous Computing using Charm++

Introduction to CELL B.E. and GPU Programming. Agenda

GViM: GPU-accelerated Virtual Machines

Contention-Aware Scheduling of Parallel Code for Heterogeneous Systems

Operating System Review Part

Performance Characterization, Prediction, and Optimization for Heterogeneous Systems with Multi-Level Memory Interference

Alternative GPU friendly assignment algorithms. Paul Richmond and Peter Heywood Department of Computer Science The University of Sheffield

High Performance Computing on GPUs using NVIDIA CUDA

COSC 6339 Accelerators in Big Data

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620

Critically Missing Pieces on Accelerators: A Performance Tools Perspective

high performance medical reconstruction using stream programming paradigms

Heterogeneous Computing with a Fused CPU+GPU Device

Intra Application Data Communication Characterization

Datacenter application interference

Technology for a better society. hetcomp.com

CSCI-GA Multicore Processors: Architecture & Programming Lecture 10: Heterogeneous Multicore

Fahad Zafar, Dibyajyoti Ghosh, Lawrence Sebald, Shujia Zhou. University of Maryland Baltimore County

ECE 8823: GPU Architectures. Objectives

Shadowfax: Scaling in Heterogeneous Cluster Systems via GPGPU Assemblies

Porting Performance across GPUs and FPGAs

HPC with GPU and its applications from Inspur. Haibo Xie, Ph.D

Introduction to CUDA

CPU Scheduling. Operating Systems (Fall/Winter 2018) Yajin Zhou ( Zhejiang University

Towards a Performance- Portable FFT Library for Heterogeneous Computing

ad-heap: an Efficient Heap Data Structure for Asymmetric Multicore Processors

Re-architecting Virtualization in Heterogeneous Multicore Systems

CellSs Making it easier to program the Cell Broadband Engine processor

Data Partitioning on Heterogeneous Multicore and Multi-GPU systems Using Functional Performance Models of Data-Parallel Applictions

CUDA Performance Optimization. Patrick Legresley

CS370: System Architecture & Software [Fall 2014] Dept. Of Computer Science, Colorado State University

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS

Fundamental CUDA Optimization. NVIDIA Corporation

An Extension of the StarSs Programming Model for Platforms with Multiple GPUs

Scalable Multi Agent Simulation on the GPU. Avi Bleiweiss NVIDIA Corporation San Jose, 2009

Deep Learning Accelerators

Why? High performance clusters: Fast interconnects Hundreds of nodes, with multiple cores per node Large storage systems Hardware accelerators

Fundamental CUDA Optimization. NVIDIA Corporation

CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman)

IN recent years, diminishing returns in single-core processor

High Performance Computing with Accelerators

Course web site: teaching/courses/car. Piazza discussion forum:

JCudaMP: OpenMP/Java on CUDA

RUNTIME SUPPORT FOR ADAPTIVE SPATIAL PARTITIONING AND INTER-KERNEL COMMUNICATION ON GPUS

Modern Processor Architectures. L25: Modern Compiler Design

IMPROVING ENERGY EFFICIENCY THROUGH PARALLELIZATION AND VECTORIZATION ON INTEL R CORE TM

Lecture 2: CUDA Programming

Adaptive Runtime Resource Management of Heterogeneous Resources

Duksu Kim. Professional Experience Senior researcher, KISTI High performance visualization

CUDA OPTIMIZATIONS ISC 2011 Tutorial

Unrolling parallel loops

An Introduction to GPU Architecture and CUDA C/C++ Programming. Bin Chen April 4, 2018 Research Computing Center

The Case for Heterogeneous HTAP

Finite Element Integration and Assembly on Modern Multi and Many-core Processors

Staged Memory Scheduling

Parallel LZ77 Decoding with a GPU. Emmanuel Morfiadakis Supervisor: Dr Eric McCreath College of Engineering and Computer Science, ANU

Applications of Berkeley s Dwarfs on Nvidia GPUs

Accelerating image registration on GPUs

Multi-GPU Load Balancing for In-Situ Simulation and Visualization

CS516 Programming Languages and Compilers II

CS8803SC Software and Hardware Cooperative Computing GPGPU. Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology

EE382N (20): Computer Architecture - Parallelism and Locality Spring 2015 Lecture 09 GPUs (II) Mattan Erez. The University of Texas at Austin

Multi Agent Navigation on GPU. Avi Bleiweiss

SDA: Software-Defined Accelerator for general-purpose big data analysis system

Inter-Block GPU Communication via Fast Barrier Synchronization

A Framework for Modeling GPUs Power Consumption

Accelerating Leukocyte Tracking Using CUDA: A Case Study in Leveraging Manycore Coprocessors

CHAPTER 2: PROCESS MANAGEMENT

Pegasus: Coordinated Scheduling for Virtualized Accelerator-based Systems

A Cross-Input Adaptive Framework for GPU Program Optimizations

On Massively Parallel Algorithms to Track One Path of a Polynomial Homotopy

EECS 583 Class 20 Research Topic 2: Stream Compilation, GPU Compilation

CPU Scheduling. Daniel Mosse. (Most slides are from Sherif Khattab and Silberschatz, Galvin and Gagne 2013)

Tesla GPU Computing A Revolution in High Performance Computing

Hardware Sizing Guide OV

Announcements. Reading. Project #1 due in 1 week at 5:00 pm Scheduling Chapter 6 (6 th ed) or Chapter 5 (8 th ed) CMSC 412 S14 (lect 5)

Multi2sim Kepler: A Detailed Architectural GPU Simulator

OPERATING SYSTEMS CS3502 Spring Processor Scheduling. Chapter 5

CUDA Memory Types All material not from online sources/textbook copyright Travis Desell, 2012

Addressing the Increasing Challenges of Debugging on Accelerated HPC Systems. Ed Hinkel Senior Sales Engineer

Offloading Java to Graphics Processors

XPU A Programmable FPGA Accelerator for Diverse Workloads

CUDA Architecture & Programming Model

FCUDA: Enabling Efficient Compilation of CUDA Kernels onto

A Comparative Performance Evaluation of Different Application Domains on Server Processor Architectures

Evaluation and Exploration of Next Generation Systems for Applicability and Performance Volodymyr Kindratenko Guochun Shi

COMP 605: Introduction to Parallel Computing Lecture : GPU Architecture

Gaining Insights into Multicore Cache Partitioning: Bridging the Gap between Simulation and Real Systems

Study and implementation of computational methods for Differential Equations in heterogeneous systems. Asimina Vouronikoy - Eleni Zisiou

Optimization Principles and Application Performance Evaluation of a Multithreaded GPU Using CUDA

Chris Rossbach, Jon Currey, Microsoft Research Mark Silberstein, Technion Baishakhi Ray, Emmett Witchel, UT Austin SOSP October 25, 2011

Many-core Computing. Can compilers and tools do the heavy lifting? Wen-mei Hwu

EE , GPU Programming

Auto-Tuning Strategies for Parallelizing Sparse Matrix-Vector (SpMV) Multiplication on Multi- and Many-Core Processors

PARALLEL PROGRAMMING MANY-CORE COMPUTING: INTRO (1/5) Rob van Nieuwpoort

FCUDA: Enabling Efficient Compilation of CUDA Kernels onto

Transcription:

Predictive Runtime Code Scheduling for Heterogeneous Architectures Víctor Jiménez, Lluís Vilanova, Isaac Gelado Marisa Gil, Grigori Fursin, Nacho Navarro HiPEAC 2009 January, 26th, 2009 1

Outline Motivation Multiversioning for homogeneous architectures Heterogeneous architectures (CPU + GPU) Predictive code scheduling Experiments and results Conclusions and future work 2

Multiversioning for Homogeneous Architectures runtime func1 func1 func compile func2 func3 runtime func2 func3 Code multiversioning has been proved useful for homogeneous architectures Dynamic adaptation to system workload changes Best performance across different microarchitectures Reducing search time for evaluating program optimizations 3

Multiversioning for Heterogeneous Architectures Heterogeneous arch. add a new dimension Processing elements (PE) with different characteristics Applications behave different on each PE Different ISAs may coexist Opportunity for dynamic code scheduling Adaptability: allows to optimize system parameters (performance, power, etc.) Requires system profiling information Benefits: adaptive applications and libraries 4

Current Limitations Traditional view for heterogeneous architectures Accelerators (intensive computation): GPU Control processor (accelerator management): CPU Waste of processing power Benefit from dynamic scheduling using all the PEs. Example: IBM Roadrunner (~7k Opteron, ~13k Cell) GPUs are still running a single application Traditional usage for games and visualization Open to new applications thanks to CUDA Multiprogramming becomes a need 5

Objectives System throughput improvement using dynamic code scheduling on a CPU/GPU system Fully use all its computation power (CPU and GPU) To allow multiprogramming in a CPU/GPU architecture Study scheduling algorithms for such a system 6

Dynamic Predictive Scheduling Why do we need dynamic predictive scheduling? Best fitting PE depends on the code, the input and the system current workload Example (matrix multiplication on single-cpu/gpu): 7 Matrix multiplication

Scheduler Design Framework for a multiprogrammed environment process process User-level Shared library Simplified development scheduling runtime OS HW 8

Scheduler Design Granularity Function versioning Function-level process process funccpu funcgpu funccpu scheduling runtime Orthogonal to the scheduling problem T2 T1 Task queue funcgpu func time f1 t1 f2 t2 History table Explicit data transfer OS HW CPU GPU 9

Usage Example 1. Obtain both the CPU and the GPU versions from a current implementation void matmul(float* A, float *B, float *C, int N); void matmul_cpu(float* A, float* B, float *C, int N); void matmul_gpu(float* A, float* B, float *C, int N); versioning (CUDA, MCUDA) 2. Change calls to that function by a request to the scheduler (+ wrapper) CallScheduler* cs = CallScheduler::getInstance(); MatmulFunc* f = new MatmulFunc; MatmulArgs* a = new MatmulArgs(A,B,C,N); cs->schedule(f,a); matmul(a,b,c,n); compiler-generated original code resulting code class MatmulFunc : public Func { void run(pe::type petype, Args* args) { MatmulArgs* a = (MatmulArgs*) args; switch (petype) { case PE::CPU: matmul_cpu(a->a,a->b,a->c,a->n); break; case PE::GPU: matmul_gpu(a->a,a->b,a->c,a->n); break; } } }; 10

Scheduling Algorithms 1.First-free algorithms Select the first-free PE in the system 2.Predictive algorithms CPU/GPU performance prediction Waiting time prediction (load balancing) 11

First-free Algorithms 1.First-free FIFO GPU (ff-fifo-gpu) Select the first free PE (if all busy, then select the GPU) 2.First-free FIFO round-robin (ff-fifo-rrk:1,k) Select one PE k times for each time the other is selected 12

Predictive Algorithms History GPU (history-gpu) Maintain a performance history table for every pair <function,pe> Compute a PE set where the task is allowed to execute Select the first free PE from that set. In case all are busy, select the GPU Avoid choosing a poor <function,pe> combination CPU history GPU history func time matmul timematmul fft timefft CPU GPU func time matmul timematmul fft timefft allowed PE ={ pei j : time f, pe i k time f, pe j } 13

Predictive Algorithms Estimate history (estimate-hist) 1.Performance history table for every pair <function,pe> 2.Predict the waiting time for a new task in each PE's queue 3.Assign the task to the queue with the minimum waiting time Tries to reduce load unbalance CPU history func time matmul timematmul fft timefft GPU history CPU GPU T2 T1 CPU task queue T3 T2 T1 GPU task queue func time matmul timematmul fft timefft 14

Methodology Benchmarks 1.Parboil Benchmark Suite (UIUC) cp (VMD) sad (H264) 2.FTDock (protein alignment) Benchmark's shortname Cp Sad Ftdock Matmul Uses Fast Fourier Transform 3.Matrix multiplication System Intel Core 2 Duo @ 2.4Ghz NVIDIA Geforce 8600 GTS / 8800 GTX (PCIe x16) Red Hat Enterprise Linux 5 (2.6.18) 15

Methodology Concurrent execution of benchmarks combinations (NMP: multiprogramming level) Example: mmfs (NMP = 4) Representative subset from the whole set Measures Execution time (wall-clock time, speedup over the GPU in batch execution mode) Traces scheduler overhead 16

CPU-GPU multiprogrammed performance Benchmarks typically perform better in GPU Some combinations perform similar in CPU/GPU 17

Performance Speedup for NMP=4 First-free algorithms have a non-constant performance Predictive algorithms achieve up to 20-30% speedup over batch mode execution on the GPU Baseline is GPU batch execution 18

Performance Speedup for NMP=6 Predictive algorithms increase the speedup up to 30-40% estimate-hist does not overperform history-gpu Baseline is GPU batch execution 19

PE's Queue Usage PE usage is increased with more concurrent tasks 20 Number of tasks waiting in PEs queues upon the arrival of a new task

Task Distribution Prediction algorithms achieve a better CPU/GPU balance 21 Task execution distribution between both PEs (CPU-GPU)

Conclusions CPU/GPU multiprogrammed environment Multiversioning + dynamic scheduling Dynamic scheduling improves system throughput Up to 40% speedup over batch execution on the GPU Throughput increases as NMP does Predictive algorithms outperforms first-free ones Scheduler overhead is less than 0.1% 22

Future Work Analyze the influence of data inputs on scheduling and performance Add data input characteristics and run-time performance information (HW counters) to train the predictive model Collective profiling from multiple users Increase prediction accuracy Integrate our predictive scheduler with the Linux kernel for greater flexibility 23

Predictive Runtime Code Scheduling for Heterogeneous Architectures Víctor Jiménez, Lluís Vilanova, Isaac Gelado Marisa Gil, Grigori Fursin, Nacho Navarro HiPEAC 2009 January, 26th, 2009 24

Backup Slides 25

Heterogeneous Architectures CPU + NVIDIA G80 Processing elements with different purposes and/or characteristics NVIDIA G80 (CUDA) General-purpose GPU Up to 128 processors Programming model Kernel-calling Explicit DMA transfer Widely available 26

CUDA Execution Model Grid-like execution Grid made of blocks Block made of threads Concurrent execution for thousand of threads 27

Prediction Accuracy As the number of concurrent processes increases, the accuracy decreases Especially for the CPU estimate-hist suffers from this loss of acuracy Keep some cores off the scheduling system Give them to other parts of the application and/or other processes Use prediction accuracy in the scheduling decision 28