Heterogeneous platforms

Similar documents
HETEROGENEOUS CPU+GPU COMPUTING

CPU-GPU Heterogeneous Computing

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620

Hybrid Implementation of 3D Kirchhoff Migration

NVIDIA Think about Computing as Heterogeneous One Leo Liao, 1/29/2106, NTU

Finite Element Integration and Assembly on Modern Multi and Many-core Processors

Parallel Systems. Project topics

Duksu Kim. Professional Experience Senior researcher, KISTI High performance visualization

Heterogeneous Computing with a Fused CPU+GPU Device

High Performance Computing on GPUs using NVIDIA CUDA

Practical Near-Data Processing for In-Memory Analytics Frameworks

Using Industry Standards to Exploit the Advantages and Resolve the Challenges of Multicore Technology

On Level Scheduling for Incomplete LU Factorization Preconditioners on Accelerators

Heterogeneous-Race-Free Memory Models

Design of Parallel Algorithms. Course Introduction

Performance Characterization, Prediction, and Optimization for Heterogeneous Systems with Multi-Level Memory Interference

HSA Foundation! Advanced Topics on Heterogeneous System Architectures. Politecnico di Milano! Seminar Room (Bld 20)! 15 December, 2017!

Lecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013

Directed Optimization On Stencil-based Computational Fluid Dynamics Application(s)

GPU ACCELERATED DATABASE MANAGEMENT SYSTEMS

PLB-HeC: A Profile-based Load-Balancing Algorithm for Heterogeneous CPU-GPU Clusters

Accelerating MapReduce on a Coupled CPU-GPU Architecture

Evaluation of Asynchronous Offloading Capabilities of Accelerator Programming Models for Multiple Devices

CS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it

HETEROGENEOUS SYSTEM ARCHITECTURE: PLATFORM FOR THE FUTURE

Challenges for GPU Architecture. Michael Doggett Graphics Architecture Group April 2, 2008

CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman)

ECE 8823: GPU Architectures. Objectives

LogCA: A High-Level Performance Model for Hardware Accelerators

High-Performance Data Loading and Augmentation for Deep Neural Network Training

Big Data Systems on Future Hardware. Bingsheng He NUS Computing

A Distributed Hash Table for Shared Memory

Critically Missing Pieces on Accelerators: A Performance Tools Perspective

vs. GPU Performance Without the Answer University of Virginia Computer Engineering g Labs

Energy Efficient K-Means Clustering for an Intel Hybrid Multi-Chip Package

Improving Performance by Matching Imbalanced Workloads with Heterogeneous Platforms

Modern Processor Architectures. L25: Modern Compiler Design

Intel Xeon Phi Coprocessors

Dense Linear Algebra on Heterogeneous Platforms: State of the Art and Trends

GPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC

OVERHEADS ENHANCEMENT IN MUTIPLE PROCESSING SYSTEMS BY ANURAG REDDY GANKAT KARTHIK REDDY AKKATI

HSA foundation! Advanced Topics on Heterogeneous System Architectures. Politecnico di Milano! Seminar Room A. Alario! 23 November, 2015!

Tesla GPU Computing A Revolution in High Performance Computing

On the limits of (and opportunities for?) GPU acceleration

CSCI-GA Multicore Processors: Architecture & Programming Lecture 10: Heterogeneous Multicore

When MPPDB Meets GPU:

Execution Strategy and Runtime Support for Regular and Irregular Applications on Emerging Parallel Architectures

Parallel Computing. November 20, W.Homberg

Dynamic Fine Grain Scheduling of Pipeline Parallelism. Presented by: Ram Manohar Oruganti and Michael TeWinkle

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

QR Decomposition on GPUs

Lecture 1: Gentle Introduction to GPUs

Introduction to parallel Computing

Experiences with Achieving Portability across Heterogeneous Architectures

TOOLS FOR IMPROVING CROSS-PLATFORM SOFTWARE DEVELOPMENT

MediaTek CorePilot 2.0. Delivering extreme compute performance with maximum power efficiency

Case study: OpenMP-parallel sparse matrix-vector multiplication

AFOSR BRI: Codifying and Applying a Methodology for Manual Co-Design and Developing an Accelerated CFD Library

Use cases. Faces tagging in photo and video, enabling: sharing media editing automatic media mashuping entertaining Augmented reality Games

Mali GPU acceleration of HEVC and VP9 Decoder

Accelerator programming with OpenACC

Heterogeneous SoCs. May 28, 2014 COMPUTER SYSTEM COLLOQUIUM 1

CUDA Accelerated Linpack on Clusters. E. Phillips, NVIDIA Corporation

Accelerating HPL on Heterogeneous GPU Clusters

Efficient Memory and Bandwidth Management for Industrial Strength Kirchhoff Migra<on

Shadowfax: Scaling in Heterogeneous Cluster Systems via GPGPU Assemblies

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

CMAQ PARALLEL PERFORMANCE WITH MPI AND OPENMP**

Approaches to acceleration: GPUs vs Intel MIC. Fabio AFFINITO SCAI department

GPUfs: Integrating a file system with GPUs

Advanced Topics UNIT 2 PERFORMANCE EVALUATIONS

HiPANQ Overview of NVIDIA GPU Architecture and Introduction to CUDA/OpenCL Programming, and Parallelization of LDPC codes.

Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design

NEW DEVELOPER TOOLS FEATURES IN CUDA 8.0. Sanjiv Satoor

The Heterogeneous Programming Jungle. Service d Expérimentation et de développement Centre Inria Bordeaux Sud-Ouest

Parallel Programming for Graphics

GpuWrapper: A Portable API for Heterogeneous Programming at CGG

Portland State University ECE 588/688. Graphics Processors

Parallel FFT Program Optimizations on Heterogeneous Computers

HPC future trends from a science perspective

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing

MAGMA: a New Generation

Infiniswap. Efficient Memory Disaggregation. Mosharaf Chowdhury. with Juncheng Gu, Youngmoon Lee, Yiwen Zhang, and Kang G. Shin

GPU Computing: Development and Analysis. Part 1. Anton Wijs Muhammad Osama. Marieke Huisman Sebastiaan Joosten

HPC with Multicore and GPUs

Accelerating Applications. the art of maximum performance computing James Spooner Maxeler VP of Acceleration

Profiling and Debugging OpenCL Applications with ARM Development Tools. October 2014

High Performance Computing (HPC) Introduction

GPGPU Offloading with OpenMP 4.5 In the IBM XL Compiler

Exploiting the OpenPOWER Platform for Big Data Analytics and Cognitive. Rajesh Bordawekar and Ruchir Puri IBM T. J. Watson Research Center

Generating Efficient Data Movement Code for Heterogeneous Architectures with Distributed-Memory

HYPERDRIVE IMPLEMENTATION AND ANALYSIS OF A PARALLEL, CONJUGATE GRADIENT LINEAR SOLVER PROF. BRYANT PROF. KAYVON 15618: PARALLEL COMPUTER ARCHITECTURE

The Case for Heterogeneous HTAP

Some aspects of parallel program design. R. Bader (LRZ) G. Hager (RRZE)

Evaluation and Exploration of Next Generation Systems for Applicability and Performance Volodymyr Kindratenko Guochun Shi

Accelerated Machine Learning Algorithms in Python

3D ADI Method for Fluid Simulation on Multiple GPUs. Nikolai Sakharnykh, NVIDIA Nikolay Markovskiy, NVIDIA

A Case for Core-Assisted Bottleneck Acceleration in GPUs Enabling Flexible Data Compression with Assist Warps

GPUs and Emerging Architectures

HPC with GPU and its applications from Inspur. Haibo Xie, Ph.D

Transcription:

Heterogeneous platforms Systems combining main processors and accelerators e.g., CPU + GPU, CPU + Intel MIC, AMD APU, ARM SoC Any platform using a GPU is a heterogeneous platform!

Further in this talk A heterogeneous platform = 1 CPU + n GPUs Execution model = computation/kernel offloading An application workload = an application + its input dataset Workload partitioning = workload distribution among the processing units of a heterogeneous system Few cores Thousands of Cores

Heterogeneity vs. Homogeneity Increase performance Both devices work in parallel (might) Decrease data communication Different devices play different roles Increase flexibility and reliability Choose one/all *PUs for execution Fall-back solution when one *PU fails Increase power efficiency Cheaper per flop

Goals* Demonstrate heterogeneous computing is interesting challenging Discuss the landscape of heterogeneous computing Programming models Partitioning models Tell some success stories Present open questions CPU

CPU vs. Accelerator (GPU) 5 Control Cache Compute Compute Compute Compute CPU Low latency, high flexibility. Excellent for irregular codes with limited parallelism. communication GPU High throughput. Excellent for massively parallel workloads.

Example 1: dot product 6 Dot product Compute the dot product of 2 (1D) arrays Performance T G = execution time on GPU T C = execution time on CPU T D = data transfer time CPU-GPU GPU best or CPU best?

Example 1: dot product 7 250 TG TD TC TMax 200 Executiontime(ms) 150 100 50 0

Example 2: separable convolution Separable convolution (CUDA SDK) Apply a convolution filter (kernel) on a large image. Separable kernel allows applying n Horizontal first n Vertical second Performance T G = execution time on GPU T C = execution time on CPU T D = data transfer time GPU best or CPU best?

Example 2: separable convolution Executiontime(ms) 180 160 140 120 100 80 60 40 20 0 TG TD TC TMax

Example 3: matrix multiply 10 Matrix multiply Compute the product of 2 matrices Performance T G = execution time on GPU T C = execution time on CPU T D = data transfer time CPU-GPU GPU best or CPU best?

Example 3: matrix multiply 11 Executiontime(ms) 450 400 350 300 250 200 150 100 50 0 TG TD TC TMax

Findings 12 There are very few GPU-only applications CPU GPU communication bottleneck. Increasing performance of CPUs Optimal partitioning between *PUs is difficult Load balancing depends on (platform, application, dataset) Imbalance => performance loss versus original! Programming different platforms with a coherent model is difficult

Findings 13 There are very few GPU-only applications CPU GPU communication bottleneck. Increasing performance of CPUs Optimal partitioning between *PUs is difficult Load balancing depends on (platform, application, dataset) Imbalance => performance loss versus original! We Programming need systematic different methods platforms (1) to program with a coherent and (2) to model partition is difficult workloads for heterogeneous platforms.

Programming

Programming models (PMs) Variety of options Platform-specific programming models Unified programming models Heterogeneous programming models (WiP) Taxonomy: abstraction level and generality Low level High level 4.0 OmpSs

Heterogeneous Computing PMs Higher level abstraction. Dedicated APIs/pragma s. Focus on ease of use. 16 Generic OpenACC, OpenMP 4.0 OmpSS, StarPU, HPL OpenCL OpenMP+CUDA High level Domain and/or application specific. Focus on: productivity and performance HyGraph (graph processing), Cashmere (divide and conquer) GlassWing (mapreduce) TOTEM (graph processing) Specific The most common atm. Useful for performance, difficult to use in practice Low level Domain specific, focus on performance. Difficult to use.

Partitioning

Determining the partition 18 Static partitioning (SP) vs. Dynamic partitioning (DP) Mul3ple Cores Thousands of Cores Mul3ple Cores Thousands of Cores

Static vs. dynamic 19 Static partitioning + can be computed before runtime => no overhead + can detect GPU-only/CPU-only cases + no unnecessary CPU-GPU data transfers -- does not work for all applications Dynamic partitioning + responds to runtime performance variability + works for all applications -- incurs (high) runtime scheduling overhead -- might introduce (high) CPU-GPU data-transfer overhead -- might not work for CPU-only/GPU-only cases

Determining the partition 20 Static partitioning (SP) vs. Dynamic partitioning (DP) (near-) Optimal Mul3ple Cores Low applicability Thousands of Cores Often sub-ptimal High applicability Mul3ple Cores Thousands of Cores

A simple taxonomy Limited applicability. Low overhead => high performance 21 Single kernel Static Qilin, Insieme, SKMD, Glinda,...? Dynamic? Multi-kernel (complex) DAG Run-time based systems: StarPU OmpSS Our goal is to extend the use of static partitioning for as many applications as possible. unlimited applicability, potentially high overhead Dynamic partitioning is an excellent fallback scenario!

*Jie Shen et al., HPCC 14. Look before you Leap: Using the Right Hardware. Resources to Accelerate Applications Static partitioning: Glinda* Model The application workload The hardware capabilities The GPU-CPU data transfer Predict the optimal partitioning Making the decision in practice Only-GPU Only-CPU CPU+GPU with the optimal partitioning Model Determine partitioning Deploy application

Model the partitioning 23 Define the optimal (static) partitioning T G +T D = T C β= the fraction of data points assigned to the GPU (+T D )

Model the workload 24 n: the total problem size w: workload per work- item w n

*W (total workload) quan3fies how much work has to be done Model the workload 25 n: the total problem size w: workload per work- item W = n 1 w i i=0 the area of the rectangle GPU par33on CPU par33on w W G =w x n x β W C =w x n x (1- β) n 0 1 β 1- β

Model the hardware T G = W G P G T C = W C P C T D = O Q Two pairs of metrics W: total workload size P: processing throughput (W/second) O: data-transfer size Q: data-transfer bandwidth (bytes/second) T G +T D = T C W = W G +W C W G W C = P G P C 1 1+ P G Q O W G

Determine the partitioning Estimating the HW capability ratios by using profiling The ratio of GPU throughput to CPU throughput The ratio of GPU throughput to data transfer bandwidth W G W C = P G P C 1 1+ P G Q O W G β dependent terms R GC CPU kernel execution time vs. GPU kernel execution time R GD GPU data-transfer time vs. GPU kernel execution time

Determine the partitioning Solving β from the equation Total workload size HW capability ratios Data transfer size W G W C = P G P C 1 1+ P G Q O W G β predictor There are three β predictors (by data transfer type) β = R GC 1+ R GC β = R GC 1+ v w R GD + R GC β = R GC v w R GD 1+ R GC No data transfer Partial data transfer Full data transfer

Making the decision in practice From β to a practical HW configuration utilization Calculate N G and N C N G = n xβ N C = n x (1-β) N G : GPU partition size N C : CPU partition size N G <its lower bound N Y Only-CPU N c <its lower bound Y N Round up N G Only-GPU CPU+GPU Final N G and N C

Glinda outcome 30 A data-parallel application can be transformed to support heterogeneous computing A decision on the execution of the application only on the CPU only on the GPU CPU+GPU n And the partitioning point

Success story #1 Applied Glinda for 7 (single-kernel) applications x 6 datasets per application 42 tests 38/42 Glinda selected the best configuration 14 CPU-only 4 CPU+GPU incorrect 20 CPU+GPU correct In all cases Glinda gains speed-up over GPU-only 1.2x-14.6x speedup

How to use Glinda? Profile the platform to determine R GC, R GD Use the Glinda solver and determine β Take the decision: Only-CPU, Only-GPU, CPU+GPU (and partitioning) if needed, apply the partitioning Code preparation Parallel implementations for both CPUs and GPUs Code templates for partitioning Instrumentation for profiling Code reuse *HPL supports Glinda. Single-device code and multi-device code are reusable for different datasets and HW platforms *http://hpl.des.udc.es

Success story #2

Sound ray tracing 1 2 34

35 Sound ray tracing

Which hardware? Our application has Massive data-parallelism No data dependency between rays Compute-intensive per ray clearly, this is a perfect GPU workload!!!

Initial Results 37 Execu3on 3me (s) 180 160 140 120 100 80 60 40 20 Only GPU Only CPU 0 W9(1.3GB) Only 2.2x performance improvement! We expected 100x

Workload profile 0.8 Peak Processing iterations: ~7000 Altitude, km 0.7 0.6 0.5 0.4 0.3 0.2 0.1 aircraft 330 340 350 Sound speed, m/s (a) 1 2 3 4 5 Relative distance fromsource, km (b) 40 30 20 10 0 10 20 30 40 Launch Angle [deg] #Simulation iterations 7000 6000 5000 4000 3000 2000 1000 0 0 1000 2000 3000 4000 5000 6000 Ray ID Bottom Processing iterations: ~500

Modeling the imbalanced workload Workload sorting sorting Workload modeling Data point index Data point index flat peak

Glinda for imbalanced workloads #Simulation iterations 7000 6000 5000 4000 3000 2000 1000 0 0 1000 2000 3000 6000 5000 4000 #Simulation iterations 7000 6000 5000 4000 3000 2000 1000 0 0 1000 2000 3000 6000 5000 4000 #Simulation iterations 7000 6000 5000 4000 3000 2000 1000 0 0 1000 2000 Ray ID ip1 ip2 ip3 3000 4000 6000 5000 #Simulation iterations 7000 6000 5000 4000 3000 2000 1000 0 1000 0 2000 Ray ID ip1 ip2 ip3 3000 4000 6000 5000 Ray ID Ray ID

Final results Execu3on 3me (ms) Glinda determines the optimal partitioning for any case, enabling the performance gain for free 62% performance improvement compared to Only-GPU

More complex applications State-of-the-art: dynamic partitioning (OmpSs, StarPU) Partition the kernels into chunks Distribute chunks to *PUs Keep data dependencies Enabling multi-kernel asynchronous execution (inter-kernel paralelism) Can we extend the use of static partitioning? Respond automatically to runtime performance variability (Often) leads to suboptimal performance n Scheduling policies and chunk size n Scheduling overhead (taking the decision, data transfer, etc.) *Jie Shen et al., IEEE TPDS 16. Workload Partitioning for Accelerating Applications on Heterogeneous Platforms

More complex applications We combine static and dynamic partitioning We design an application analyzer that chooses the best performing partitioning strategy for any given application (1) Implement the app source code (2) Analyze the app kernel structure App class (3) Select the partitioning strategy (4) Enable the partitioning in the code App classification Partitioning strategy repository Implement/get the source code Analyze application s class Select a partitioning strategy Enable the partitioning

Application classification 5 application classes k0 k0 k0 k2 k0 k0 k1 k1 k1 k3 k4 A survey* demonstrates that most applications belong to classes I-III, a few knto class kniv, and none k5to class V....... I II III IV V SK-One SK-Loop MK-Seq MK-Loop MK-DAG Implement/get the source code Analyze application s class Select a partitioning strategy Enable the partitioning *Jie Shen et al., TUDelft technical report, A Study of Application Kernel Structure for Data Parallel Applications

Implement/get the source code Analyze application s class Select a partitioning strategy Enable the partitioning Partitioning strategies: before Static partitioning: single-kernel applications SP-Single SP-Unified SP-Varied k0 GPU CPU k0 GPU CPU k0 GPU CPU global sync k1 GPU CPU k1 GPU CPU global sync thread 0 thread 1 k2 GPU CPU k2 GPU CPU global sync k0 k1 DP-Dep & DP-Perf data dependency chains scheduling policies DP-Dep or DP-Perf Dynamic partitioning: multi-kernel applications scheduled partitions GPU CPU thread 0 CPU thread 1

Partitioning strategies: now Static partitioning: in Glinda single-kernel + multi-kernel applications SP-Single SP-Unified SP-Varied k0 GPU CPU k0 GPU CPU k0 GPU CPU global sync k1 GPU CPU k1 GPU CPU global sync thread 0 thread 1 k2 GPU CPU k2 GPU CPU global sync k0 k1 DP-Dep & DP-Perf data dependency chains scheduling policies DP-Dep or DP-Perf scheduled partitions GPU CPU thread 0 CPU thread 1 Implement/get the source code Dynamic partitioning: in OmpSs Analyze multi-kernel applications Select a (fall-back scenario) Enable application s class partitioning strategy the partitioning

Partitioning strategies Glinda: single-kernel SP-Single Glinda 2.0: multi-kernel SP-Unified SP-Varied k0 GPU CPU k0 GPU CPU k0 GPU CPU global sync k1 GPU CPU k1 GPU CPU global sync thread 0 thread 1 k2 GPU CPU k2 GPU CPU global sync k0 k1 DP-Dep & DP-Perf data dependency chains scheduling policies DP-Dep or DP-Perf scheduled partitions GPU CPU thread 0 OmpSs: dynamic partitioning (fall-back scenario) CPU thread 1

k0 k0 k0 Success story #3 k0 k0 k1... k1... k1 k3 k2 k4 kn kn k5 6 applications 2 type I, 2 type II, 1 type III, and 1 type IV Glinda detects and uses the best partitioning strategy SP-Single for type I, II SP-Unified or SP-Varied for types III and IV n Depends on the synchronization model I II III IV V SK-One SK-Loop MK-Seq MK-Loop MK-DAG In all cases, Glinda s static partitioning outperforms OmpSS dynamic partitioning Glinda is the only static partitioner to support multi-kernel applications.

Results* Similar results for all 6 applications with multiple kernels. MK-Loop (STREAM-Loop) w/o sync: SP-Unified > DP-Perf >= DP-Dep > SP-Varied with sync: SP-Varied > DP-Perf >= DP-Dep > SP-Unified Execu'on Execution 'me time [ms] (ms) 6000 5000 4000 3000 2000 1000 Only-GPU Only-CPU SP-Unified DP-Perf DP-Dep SP-Varied w sync 7133.2 0 STREAM-Loop-w/o STREAM-Loop-w *Jie Shen et al., ICPP 15. Matchmaking Applications and Partitioning Strategies for Efficient Execution on Heterogeneous Platforms

Heterogeneous Computing today* Limited applicability. Low overhead => high performance 50 Static Qilin, Insieme, SKMD, Glinda,... Single kernel Sporadic attempts and light runtime systems Not interesting, given that static & run-time based systems exist. Dynamic Only Glinda, for restricted types of DAGs. Improved applicability, but remains limited. Low overhead => high performance Multi-kernel (complex) DAG Run-time based systems: StarPU OmpSS High Applicability, potentially high overhead *A.L.Varbanescu et al., FDL 16. Heterogeneous Computing with Accelerators: an Overview with Examples.

Instead of conclusion

to the office Take home message [1] 52 Heterogeneous computing works! More resources. Specialized resources. Performance gain for free Or at the price of some minor code changes Plenty of systems to support you Different programming models Generic systems for static/dynamic partitioning Domain-specific/Application-specific models n Totem, HyGraph graph processing n GlassWing MapReduce n Cashmere Divide&Conquer

to the office Take home message [2] 53 You have an application to run? Now: Choose one solution based on your application scenario: Single-kernel vs. multi-kernel Massive parallel vs. Data-dependent Single run vs. Multiple run Programming model of choice WiP: Framework to combine them all Start from: Glinda + OmpSS n We are still working on it J

Ultimate goal (WiP) SEMIautomated decision High Performance Heterogeneous Computing (1) Implement the app source code (2) Analyze the app kernel structure App class (3) Select the partitioning strategy (4) Enable the partitioning in the code App classification Partitioning strategy repository

Open questions Analytical modeling instead of profiling Unified programming models Performance portability Extending to more specific types of workloads Performance modeling and prediction What is the right hardware for my software? Understand the links with other fields where heterogeneous computing is already heavily used: Embedded systems, cyber-physical systems, etc.

Heterogeneous computing works!??!?? Ana Lucia Varbanescu A.L.Varbanescu@uva.nl