HETEROGENEOUS CPU+GPU COMPUTING

Similar documents
Heterogeneous platforms

CPU-GPU Heterogeneous Computing

Duksu Kim. Professional Experience Senior researcher, KISTI High performance visualization

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620

Finite Element Integration and Assembly on Modern Multi and Many-core Processors

GPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC

CME 213 S PRING Eric Darve

Hybrid Implementation of 3D Kirchhoff Migration

PLB-HeC: A Profile-based Load-Balancing Algorithm for Heterogeneous CPU-GPU Clusters

Performance Characterization, Prediction, and Optimization for Heterogeneous Systems with Multi-Level Memory Interference

NVIDIA Think about Computing as Heterogeneous One Leo Liao, 1/29/2106, NTU

Overview of research activities Toward portability of performance

GPUs and Emerging Architectures

High Performance Computing on GPUs using NVIDIA CUDA

Tanuj Kr Aasawat, Tahsin Reza, Matei Ripeanu Networked Systems Laboratory (NetSysLab) University of British Columbia

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

Parallel Systems. Project topics

GPU Computing: Development and Analysis. Part 1. Anton Wijs Muhammad Osama. Marieke Huisman Sebastiaan Joosten

CUDA Accelerated Linpack on Clusters. E. Phillips, NVIDIA Corporation

MAGMA a New Generation of Linear Algebra Libraries for GPU and Multicore Architectures

Practical Near-Data Processing for In-Memory Analytics Frameworks

High Performance Computing (HPC) Introduction

CUDA. Matthew Joyner, Jeremy Williams

Critically Missing Pieces on Accelerators: A Performance Tools Perspective

Lecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013

HPC with Multicore and GPUs

GpuWrapper: A Portable API for Heterogeneous Programming at CGG

HETEROGENEOUS SYSTEM ARCHITECTURE: PLATFORM FOR THE FUTURE

Distributed Dense Linear Algebra on Heterogeneous Architectures. George Bosilca

A Large-Scale Cross-Architecture Evaluation of Thread-Coarsening. Alberto Magni, Christophe Dubach, Michael O'Boyle

Lecture 1: Gentle Introduction to GPUs

Directed Optimization On Stencil-based Computational Fluid Dynamics Application(s)

Technology for a better society. hetcomp.com

RUNTIME SUPPORT FOR ADAPTIVE SPATIAL PARTITIONING AND INTER-KERNEL COMMUNICATION ON GPUS

Parallel Computing. November 20, W.Homberg

On Level Scheduling for Incomplete LU Factorization Preconditioners on Accelerators

TOOLS FOR IMPROVING CROSS-PLATFORM SOFTWARE DEVELOPMENT

HSA Foundation! Advanced Topics on Heterogeneous System Architectures. Politecnico di Milano! Seminar Room (Bld 20)! 15 December, 2017!

CSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University

An innovative compilation tool-chain for embedded multi-core architectures M. Torquati, Computer Science Departmente, Univ.

n N c CIni.o ewsrg.au

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS

An Extension of the StarSs Programming Model for Platforms with Multiple GPUs

Titan - Early Experience with the Titan System at Oak Ridge National Laboratory

Fundamental CUDA Optimization. NVIDIA Corporation

When MPPDB Meets GPU:

On the limits of (and opportunities for?) GPU acceleration

Fundamental Optimizations in CUDA Peng Wang, Developer Technology, NVIDIA

WHAT S NEW IN CUDA 8. Siddharth Sharma, Oct 2016

Re-architecting Virtualization in Heterogeneous Multicore Systems

Modern Processor Architectures. L25: Modern Compiler Design

System Design of Kepler Based HPC Solutions. Saeed Iqbal, Shawn Gao and Kevin Tubbs HPC Global Solutions Engineering.

On the Comparative Performance of Parallel Algorithms on Small GPU/CUDA Clusters

ET International HPC Runtime Software. ET International Rishi Khan SC 11. Copyright 2011 ET International, Inc.

Pedraforca: a First ARM + GPU Cluster for HPC

The Era of Heterogeneous Computing

A TALENTED CPU-TO-GPU MEMORY MAPPING TECHNIQUE

TUNING CUDA APPLICATIONS FOR MAXWELL

Fundamental CUDA Optimization. NVIDIA Corporation

Hybrid KAUST Many Cores and OpenACC. Alain Clo - KAUST Research Computing Saber Feki KAUST Supercomputing Lab Florent Lebeau - CAPS

Parallelization of Shortest Path Graph Kernels on Multi-Core CPUs and GPU

GPU ACCELERATED DATABASE MANAGEMENT SYSTEMS

Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design

ACCELERATED COMPUTING: THE PATH FORWARD. Jen-Hsun Huang, Co-Founder and CEO, NVIDIA SC15 Nov. 16, 2015

Introduction to Runtime Systems

Overview of Tianhe-2

Digital Earth Routine on Tegra K1

GPU GPU CPU. Raymond Namyst 3 Samuel Thibault 3 Olivier Aumage 3

Efficient Memory and Bandwidth Management for Industrial Strength Kirchhoff Migra<on

CUDA OPTIMIZATIONS ISC 2011 Tutorial

HSA foundation! Advanced Topics on Heterogeneous System Architectures. Politecnico di Milano! Seminar Room A. Alario! 23 November, 2015!

TUNING CUDA APPLICATIONS FOR MAXWELL

OpenACC. Part I. Ned Nedialkov. McMaster University Canada. October 2016

Accelerating MapReduce on a Coupled CPU-GPU Architecture

Data Partitioning on Heterogeneous Multicore and Multi-GPU systems Using Functional Performance Models of Data-Parallel Applictions

Elaborazione dati real-time su architetture embedded many-core e FPGA

Big Data Systems on Future Hardware. Bingsheng He NUS Computing

Introduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes

Unified Tasks and Conduits for Programming on Heterogeneous Computing Platforms

MAPPING VIDEO CODECS TO HETEROGENEOUS ARCHITECTURES. Mauricio Alvarez-Mesa Techische Universität Berlin - Spin Digital MULTIPROG 2015

HPC with GPU and its applications from Inspur. Haibo Xie, Ph.D

ECE 8823: GPU Architectures. Objectives

Performance Analysis of Memory Transfers and GEMM Subroutines on NVIDIA TESLA GPU Cluster

Agenda. System Performance Scaling of IBM POWER6 TM Based Servers

Experiences Using Tegra K1 and X1 for Highly Energy Efficient Computing

Evaluation and Exploration of Next Generation Systems for Applicability and Performance Volodymyr Kindratenko Guochun Shi

OpenACC Course. Office Hour #2 Q&A

A Distributed Hash Table for Shared Memory

GPU Programming Using NVIDIA CUDA

IMPROVING ENERGY EFFICIENCY THROUGH PARALLELIZATION AND VECTORIZATION ON INTEL R CORE TM

MAGMA. Matrix Algebra on GPU and Multicore Architectures

CSE 591: GPU Programming. Introduction. Entertainment Graphics: Virtual Realism for the Masses. Computer games need to have: Klaus Mueller

Introduction to GPU hardware and to CUDA

vs. GPU Performance Without the Answer University of Virginia Computer Engineering g Labs

Auto-Generation and Auto-Tuning of 3D Stencil Codes on GPU Clusters

Trends in HPC (hardware complexity and software challenges)

S WHAT THE PROFILER IS TELLING YOU: OPTIMIZING GPU KERNELS. Jakob Progsch, Mathias Wagner GTC 2018

Maximizing Face Detection Performance

GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE)

Transcription:

HETEROGENEOUS CPU+GPU COMPUTING Ana Lucia Varbanescu University of Amsterdam a.l.varbanescu@uva.nl Significant contributions by: Stijn Heldens (U Twente), Jie Shen (NUDT, China),

Heterogeneous platforms Systems combining main processors and accelerators e.g., CPU + GPU, CPU + Intel MIC, AMD APU, ARM SoC Everywhere from supercomputers to mobile devices

Heterogeneous platforms Host-accelerator hardware model Accelerator Host PCIe / Shared memory Accelerator Accelerator... MICs FPGAs Accelerator GPUs CPUs

Our focus today A heterogeneous platform = CPU + GPU Most solutions work for other/multiple accelerators An application workload = an application + its input dataset Workload partitioning = workload distribution among the processing units of a heterogeneous system Few cores Thousands of Cores

Generic multi-core CPU 5

6 Programming models Pthreads + intrinsics TBB Thread building blocks Threading library OpenCL To be discussed OpenMP Traditional parallel library High-level, pragma-based Cilk Simple divide-and-conquer model Level of abstraction increases

A GPU Architecture 7

Offloading model Kernel Host code

9 Programming models CUDA NVIDIA proprietary OpenCL Open standard, functionally portable across multi-cores OpenACC High-level, pragma-based Different libraries, programming models, and DSLs for different domains Level of abstraction increases

CPU vs. GPU 10 ALU ALU Control Throughput: ~ ALU 500 GFLOPs ALU Bandwidth: ~ 60 GB/s Cache CPU Low latency, high flexibility. Excellent for irregular codes with limited parallelism. Heterogeneous computing Do we gain = CPU anything? + GPU work together GPU High throughput. Excellent for massively parallel workloads. Throughput: ~ 5.500 GFLOPs Bandwidth: ~ 350 GB/s

Case studies Executiontime(ms) 450 400 350 300 250 200 150 100 50 0 Best = (80,20) TG TD TC TMax Matrix Multiplication (SGEMM) n = 2048 x 2048 (48MB) Only-GPU granularity = 10% Only-CPU T G T C T Max T D T G = GPU kernel execution time T D = Data transfer time T C = CPU kernel execution time T Max = max(t G +T D, T C )

Case studies Executiontime(ms) 450 400 350 300 250 200 150 100 50 0 TG TD TC TMax Best = (80,20) T G = GPU kernel execution time T D = Data transfer time T C = CPU kernel execution time T Max = max(t G +T D, T C ) Matrix Multiplication (SGEMM) n = 2048 x 2048 (48MB) Only-GPU Executiontime(ms) 180 160 140 120 100 80 60 40 20 0 Only-CPU granularity = 10% TG TD TC TMax Best = (30,70) 0 Very few applications are GPU-only. Executiontime(ms) 250 200 150 100 50 TG TD TC TMax Best =Only-CPU Separable Convolution (SConv) n = 6144x6144 (432MB) Dot Product (SDOT) n = 14913072 (512MB)

Possible advantages of heterogeneity Increase performance Both devices work in parallel Decrease data communication Which is often the bottleneck of the system Different devices for different roles Increase flexibility and reliability Choose one/all *PUs for execution Fall-back solution when one *PU fails Increase energy efficiency Cheaper per flop Main challenges: programming and workload partitioning!

Challenge 1: Programming models

OmpSs 15 Programming models (PMs) Heterogeneous computing = a mix of different processors on the same platform. Programming Mix of programming models One(/several?) for CPUs OpenMP One(/several?) for GPUs CUDA Single programming model (unified) OpenCL is a popular choice Low level High level 4.0

16 Heterogeneous Computing PMs High productivity; not all applications are easy to implement. Generic OpenACC, OpenMP 4.0 OmpSS, StarPU, HPL High level HyGraph, Cashmere, GlassWing Domain and/or application specific. Focus on: productivity and performance Specific OpenCL OpenMP+CUDA TOTEM The most common atm. Useful for performance, more difficult to use in practice Low level Domain specific, focus on performance. More difficult to use. Quite rare.

Heterogeneous computing PMs CUDA + OpenMP/TBB Typical combination for NVIDIA GPUs Individual development per *PU Glue code can be challenging OpenCL (KHRONOS group) Functional portability => can be used as a unified model Performance portability via code specialization HPL (University of A Coruna, Spain) Library on top of OpenCL, to automate code specialization

Heterogeneous computing PMs StarPU (INRIA, France) Special API for coding Runtime system for scheduling OmpSS (UPC + BSC, Spain) C + OpenCL/CUDA kernels Runtime system for scheduling and communication optimization

Heterogeneous computing PMs Cashmere (VU Amsterdam + NLeSC) Dedicated to Divide-and-conquer solutions OpenCL backend. GlassWing (VU Amsterdam) Dedicated to MapReduce applications TOTEM (U. of British Columbia, Canada) Graph processing CUDA+Multi-threading HyGraph (TUDelft, UTwente, UvA, NL) Graph processing Based on CUDA+OpenMP

Challenge 2: Workload partitioning

Workload DAG (directed acyclic graph) of kernels k0 k0 k0 k0 k0 k1 k1 k1 k2...... k3 k4 kn kn k5 I II III IV V SK-One SK-Loop MK-Seq MK-Loop MK-DAG

22 Determining the partition Static partitioning (SP) vs. Dynamic partitioning (DP) Mul3ple Cores Thousands of Cores Mul3ple Cores Thousands of Cores

23 Static vs. dynamic Static partitioning + can be computed before runtime => no overhead + can detect GPU-only/CPU-only cases + no unnecessary CPU-GPU data transfers -- does not work for all applications Dynamic partitioning + responds to runtime performance variability + works for all applications -- incurs (high) runtime scheduling overhead -- might introduce (high) CPU-GPU data-transfer overhead -- might not work for CPU-only/GPU-only cases

24 Determining the partition Static partitioning (SP) vs. Dynamic partitioning (DP) Mul3ple Cores (near-) Optimal Low applicability Thousands of Cores Often sub-ptimal High applicability Mul3ple Cores Thousands of Cores

25 Heterogeneous Computing today Limited applicability. Low overhead => high performance Systems/frameworks: Qilin, Insieme, SKMD, Glinda,... Libraries: HPL, Static Single kernel Sporradic attempts and light runtime systems Not interesting, given that static & run-time based systems exist. Dynamic Glinda 2.0 Low overhead => high performance Still limited in applicability. Multi-kernel (complex) DAG Run-time based systems: StarPU OmpSS HyGraph High Applicability, high overhead

Mul3ple Cores Thousands of Cores Static partitioning: Glinda

*Jie Shen et al., HPCC 14. Look before you Leap: Using the Right Hardware. Resources to Accelerate Applications Glinda* Modeling the partitioning The application workload The hardware capabilities The GPU-CPU data transfer Predict the optimal partitioning Making the decision in practice Only-GPU Only-CPU CPU+GPU with the optimal partitioning

28 Modeling the partitioning Define the optimal (static) partitioning β= the fraction of data points assigned to the GPU (+T D ) T G +T D = T C

29 Modeling the workload n: the total problem size w: workload per work- item W = n 1 w i i=0 the area of the rectangle GPU par33on CPU par33on w W G =w x n x β W C =w x n x (1- β) n 0 1 β 1- β W (total workload size) quan3fies how much work has to be done

Modeling the partitioning T G = W G P G T C = W C P C T D = O Q W: total workload size W C = (1-β) * W and W G = β * W P: processing throughput (W/second) O: CPU-GPU data-transfer size Q: CPU-GPU bandwidth (B/second) T G +T D = T C W G W C = P G P C 1 1+ P G Q O W G R GC CPU kernel time vs. GPU kernel time R GD GPU data-transfer time vs. GPU kernel execution time

Predicting the optimal partitioning Solving β from the equation Total workload size HW capability ratios Data transfer size W G W C = P G P C 1 1+ P G Q O W G β predictor There are three β predictors (by data transfer type) β = R GC 1+ R GC β = R GC 1+ v w R GD + R GC β = R GC v w R GD 1+ R GC No data transfer Partial data transfer Full data transfer

32 Glinda outcome (Any?) data-parallel application can be transformed to support heterogeneous computing A decision on the execution of the application only on the CPU only on the GPU CPU+GPU And the partitioning point

Results [1] Quality (compared to an oracle) The right HW configuration in 62 out of 72 test cases Optimal partitioning when CPU+GPU is selected Predicted partitioning Execution time (ms) Optimal partitioning (auto-tuning)

Results [2] Effectiveness (compared to Only-CPU/Only-GPU) 1.2x-14.6x speedup If taking GPU for granted, up to 96% performance will be lost Our config Only- GPU Only- CPU Execution time (ms)

Related work Glinda achieves similar or better performance than the other partitioning approaches with less cost

More advantages Support different types of heterogeneous platforms Multi-GPUs, identical or non-identical Support different profiling options suitable for different execution scenarios Online or offline Partial of full Determine not only the optimal partitioning but also the best hardware configuration Only-GPU / Only-CPU / CPU+GPU with the optimal partitioning Support both balanced and imbalanced applications

Glinda: limitations? Regular applications only >> NO, we support imbalanced applications, too. Single kernel only? >> NO, we support two types of multikernel applications, too. Single GPU only? >> NO, we support multiple GPUs/ accelerators on the same node Single node only? >> YES for now.

What about energy? The Glinda model can be adapted to optimize for energy Ongoing work at UvA Not yet clear whether optimality can be so easily determined Finding the right energy models We started with statistical models We plan to use existing analytical models in the next stage Would probably be more limitations regarding the workload types

Mul3ple Cores Thousands of Cores Dynamic partitioning: HyGraph

Dynamic partitioning: StarPU, OmpSS k1 k0 k3 k2 k4 Runtime system k2,k3,k4 Partitioning & Scheduling k5 V k0,k1,k5 Our own take on it: graph processing.

Graph processing Time Graphs are increasing in size and complexity Facebook 1.7B users, LinkedIn ~500M users, Twitter ~400M users We need HPC solutions for graph processing

CPU or GPU? Graph processing Many vertices => Massive parallelism. Some vertices inactive => Irregular workload. Varying vertex degree => Irregular tasks. Low arithmetic intensity => Bandwidth limited. Irregular access pattern => Poor caching. Ideal platform? CPU: few cores, deep caches, asymmetric processing. GPU: many cores, latency hiding, data-parallel processing. No clear winner => Attempt CPU+GPU!

Design of HyGraph HyGraph: Dynamic scheduling Replicate vertex states on CPU and GPU. Divide vertices into fixed-sized segments called blocks. Processing vertices inside block corresponds to one task. CPU: 1 thread per block. GPU: 1 SM (=1024 threads) per block

Design of HyGraph Overlap of computation with communication

Performance evaluation 2 GPUs presented Kepler : K20 Maxwell : GTX980 1 CPU Dual 8-core CPUs, 2.4GHz Three algorithms PageRank (500 iterations) 10 datasets 5 synthetic ones presented here PRELIMINARY results! http://www.cs.vu.nl/das5/ V E RMAT22 2.4 M 64.2 M RMAT23 4.6 M 129.3M RMAT24 8.9 M 260.1M RMAT25 17.1 M 523.6M RMAT26 33.0 M 1051.9M

Lower is better Preliminary results: execution time [s] 1000 100 CPU GPU K20 Hybrid K20 GPU GTX980 Hybrid GTX980 10 1 RMAT22 RMAT23 RMAT24 RMAT25 RMAT26

Preliminary results: Energy [Joule] 100,000 10,000 1,000 CPU GPU K20 Hybrid K20 GPU GTX980 Hybrid GTX980 100 10 1 RMAT22 RMAT23 RMAT24 RMAT25 RMAT26 Lower is better

Higher is better (TEPS = traversed edges per s) Results [3]: TEPS / Watt 600000 500000 400000 300000 CPU GPU K20 Hybrid K20 GPU GTX 980 Hybrid GTX980 200000 100000 0 RMAT22 RMAT23 RMAT24 RMAT25 RMAT26

Results [4]: EDP 100000000 10000000 1000000 100000 GPU K20 Hybrid K20 GPU GTX980 Hybrid GTX980 CPU 10000 1000 100 10 1 RMAT22 RMAT23 RMAT24 RMAT25 RMAT26

Lessons learned For irregular applications Heterogeneous computing helps performance Heterogeneous computing might help energy efficiency For K20, it s pretty good For GTX980. GPU wins Cannot generalize results to other types of graphs Both energy and performance are unpredictable Is hybrid dynamic processing feasible for energy efficiency?! Open question, still

Instead of summary

to the office Take home message [1] Heterogeneous computing works! More resources. Specialized resources. Most efforts in static partitioning and run-time systems Glinda = static partitioning for single-kernel, data parallel applications Now works for multi-kernel applications, too StarPU, OmpSS = run-time based dynamic partitioning for multikernel, complex DAG applications Domain-specific efforts, too HyGraph - graph processing Cashmere divide-and-conquer, distributed

to the office Take home message [2] Choose a system based on your application scenario: Single-kernel vs. multi-kernel Massive parallel vs. Data-dependent Single run vs. Multiple run Programming model of choice There are models to cover combinations of these choices! No framework to combine them all food for thought?

to the office Take home message [3] GPUs are energy efficient On their own Hybrid processing helps performance but for energy efficiency we are heavily application, workload, and hardware specific Irregular applications remain the cornerstone of *both* performance and energy

Current/Future research directions More heterogeneous platforms Extension to more application classes Multi-kernel with complex DAGs (streaming applications, graph-processing applications) Integration with distributed systems Intra-node workload partitioning + inter-node workload scheduling Extension of the partitioning model for energy

Backup slides

57 1 2 Imbalanced app: Sound ray tracing

Example 4: Sound ray tracing 58

Which hardware? Our application has Massive data-parallelism No data dependency between rays Compute-intensive per ray clearly, this is a perfect GPU workload!!!

60 Results [1] 180 Execu3on 3me (s) 160 140 120 100 80 60 40 20 Only GPU Only CPU 0 W9(1.3GB) Only 2.2x performance improvement! We expected 100x

Workload profile 0.8 Peak Processing iterations: ~7000 Altitude, km 0.7 0.6 0.5 0.4 0.3 0.2 0.1 aircraft 330 340 350 Sound speed, m/s (a) 1 2 3 4 5 Relative distance fromsource, km (b) 40 30 20 10 0 10 20 30 40 Launch Angle [deg] #Simulation iterations 7000 6000 5000 4000 3000 2000 1000 0 0 1000 2000 3000 4000 5000 6000 Ray ID Bottom Processing iterations: ~500

Results [2] Execu3on 3me (ms) 62% performance improvement compared to Only-GPU

63 Model the app workload n: the total problem size w: workload per work- item w n

64 Model the app workload n: the total problem size w: workload per work- item W = n 1 w i i=0 the area of the rectangle GPU par33on CPU par33on w W G =w x n x β W C =w x n x (1- β) n 0 1 β 1- β W (total workload size) quan3fies how much work has to be done

Modeling the partitioning T G = W G P G T C = W C P C T D = O Q W: total workload size W C = (1-β) * W and W G = β * W P: processing throughput (W/second) O: data-transfer size Q: data-transfer bandwidth (bytes/second) T G +T D = T C W G W C = P G P C 1 1+ P G Q O W G

66 Model the HW capabilities Workload: W G + W C = W Execution time: T G and T C P (processing throughput) Measured as workload processed per second P evaluates the hardware capability of a processor GPU kernel execu3on 3me: T G =W G /P G CPU kernel execu3on 3me: T C =W C /P C

67 Model the data-transfer O (GPU data-transfer size) Measured in bytes Q (GPU data-transfer bandwidth) Measured in bytes per second Data- transfer 3me: TD=O/Q + (Latency) Latency < 0.1 ms, negligible impact

68 Predict the partitioning T G = W G P G T C = W C P C T D = O Q T G +T D = T C W G W C = P G P C 1 1+ P G Q O W G by solving this equation in β

69 Predict the partitioning Solve the equation W G W C = P G P C 1 1+ P G Q O W G β- dependent terms Expression W G =w x n x β W C =w x n x (1- β) O=Full data transfer or Full data transfer x β

70 Predict the partitioning Solve the equation W G W C = P G P C 1 1+ P G Q O W G β- dependent terms β- independent terms Expression W G =w x n x β W C =w x n x (1- β) O=Full data transfer or Full data transfer x β Es3ma3on

Modeling the partitioning Estimating the HW capability ratios by using profiling The ratio of GPU throughput to CPU throughput The ratio of GPU throughput to data transfer bandwidth W G W C = P G P C 1 1+ P G Q O W G R GC CPU kernel execution time vs. GPU kernel execution time R GD GPU data-transfer time vs. GPU kernel execution time

Predicting the optimal partitioning Solving β from the equation Total workload size HW capability ratios Data transfer size W G W C = P G P C 1 1+ P G Q O W G β predictor There are three β predictors (by data transfer type) β = R GC 1+ R GC β = R GC 1+ v w R GD + R GC β = R GC v w R GD 1+ R GC No data transfer Partial data transfer Full data transfer

Making the decision in practice From β to a practical HW configuration Calculate N G and N C N G = n xβ N C = n x (1-β) N G : GPU partition size N C : CPU partition size N G <its lower bound Y Only-CPU N N c <its lower bound Y Only-GPU N Round up N G CPU+GPU Final N G and N C

Extensions Different profiling options Online vs. Offline profiling Partial vs. Full profiling Overall profiling overhead Offline Online w W V = v n v # runs CPU+Multiple GPUs Identical GPUs Non-identical GPUs (may be suboptimal) n profile partial workload W V v min v w

What if we have multiple kernels? Can we still do static partitioning? Mul3ple Cores Thousands of Cores

Multi-kernel applications? Use dynamic partitioning: OmpSs, StarPU Partition the kernels into chunks Distribute chunks to *PUs Keep data dependencies Enabling multi-kernel asynchronous execution (inter-kernel Static partitioning: paralelism) low applicability, high performance. Dynamic Respond partitioning: automatically low performance, to runtime high applicability. performance variability (Often) lead to suboptimal performance Scheduling policies and chunk size Scheduling overhead (taking the decision, data transfer, etc.) Can we get the best of both worlds?

How to satisfy both requirements? We combine static and dynamic partitioning We design an application analyzer that chooses the best performing partitioning strategy for any given application (1) Implement the app source code (2) Analyze the app kernel structure App class (3) Select the partitioning strategy (4) Enable the partitioning in the code App classification Partitioning strategy repository Implement/get the source code Analyze application s class Select a partitioning strategy Enable the partitioning

Application classification 5 application classes k0 k0 k0 k0 k0 k1 k1 k1 k2...... k3 k4 kn kn k5 I II III IV V SK-One SK-Loop MK-Seq MK-Loop MK-DAG Implement/get the source code Analyze application s class Select a partitioning strategy Enable the partitioning

Partitioning strategies Static partitioning: single-kernel applications Dynamic partitioning: multi-kernel applications SP-Single SP-Unified SP-Varied k0 GPU CPU k0 GPU CPU k0 GPU CPU global sync k1 GPU CPU k1 GPU CPU global sync thread 0 thread 1 k2 GPU CPU k2 GPU CPU global sync k0 k1 DP-Dep & DP-Perf data dependency chains scheduling policies DP-Dep or DP-Perf scheduled partitions GPU CPU thread 0 CPU thread 1 Implement/get the source code Analyze application s class Select a partitioning strategy Enable the partitioning

Partitioning strategies Static partitioning: in Glinda single-kernel + multi-kernel applications SP-Single SP-Unified SP-Varied k0 GPU CPU k0 GPU CPU k0 GPU CPU global sync k1 GPU CPU k1 GPU CPU global sync thread 0 thread 1 k2 GPU CPU k2 GPU CPU global sync k0 k1 DP-Dep & DP-Perf data dependency chains scheduling policies DP-Dep or DP-Perf scheduled partitions GPU CPU thread 0 CPU thread 1 Implement the source code Dynamic partitioning: in OmpSs Analyze multi-kernel applications Select a (fall-back scenarios) Enable application s class partitioning strategy the partitioning

Results [4] MK-Loop (STREAM-Loop) w/o sync: SP-Unified > DP-Perf >= DP-Dep > SP-Varied with sync: SP-Varied > DP-Perf >= DP-Dep > SP-Unified Execu'on Execution 'me [ms] time (ms) 6000 5000 4000 3000 2000 1000 0 Only-GPU Only-CPU SP-Unified DP-Perf DP-Dep SP-Varied STREAM-Loop-w/o w sync 7133.2 STREAM-Loop-w

Results: performance gain Best partitioning strategy vs. Only-CPU or Only-GPU Performance Performance Speedup Speedup 10 9 8 7 6 5 4 3 2 1 0 MatrixMul 22.2 Nbody BlackScholes HotSpot STREAM-Seq-w STREAM-Seq-w/o Best vs OG Best vs OC Avg STREAM-Loop-w STREAM-Loop-w/o Average: 3.0x (vs. Only-GPU) 5.3x (vs. Only-CPU)

Summary: Glinda Computes (close-to-)optimal partitioning Based on static partitioning Single-kernel Multi-kernel (with restrictions) No programming models attached CUDA + OpenMP/TBB OpenCL others => we propose HPL