Enabling GPU support for the COMPSs-Mobile framework

Similar documents
Performing Reductions in OpenCL

OpenCL. Matt Sellitto Dana Schaa Northeastern University NUCAR

Introduction to OpenCL!

Heterogeneous Computing

CS/EE 217 GPU Architecture and Parallel Programming. Lecture 22: Introduction to OpenCL

OpenCL C. Matt Sellitto Dana Schaa Northeastern University NUCAR

Neil Trevett Vice President, NVIDIA OpenCL Chair Khronos President. Copyright Khronos Group, Page 1

Communication Library to Overlap Computation and Communication for OpenCL Application

Cloud interoperability and elasticity with COMPSs

OpenCL The Open Standard for Heterogeneous Parallel Programming

Colin Riddell GPU Compiler Developer Codeplay Visit us at

OpenCL API. OpenCL Tutorial, PPAM Dominik Behr September 13 th, 2009

EEDC. Scientific Programming Models. Execution Environments for Distributed Computing. Master in Computer Architecture, Networks and Systems - CANS

OVERHEADS ENHANCEMENT IN MUTIPLE PROCESSING SYSTEMS BY ANURAG REDDY GANKAT KARTHIK REDDY AKKATI

Why? High performance clusters: Fast interconnects Hundreds of nodes, with multiple cores per node Large storage systems Hardware accelerators

Objective. GPU Teaching Kit. OpenACC. To understand the OpenACC programming model. Introduction to OpenACC

Porting Performance across GPUs and FPGAs

CPU-GPU Heterogeneous Computing

OpenCL Base Course Ing. Marco Stefano Scroppo, PhD Student at University of Catania

The OpenVX Computer Vision and Neural Network Inference

cuda-on-cl A compiler and runtime for running NVIDIA CUDA C++11 applications on OpenCL 1.2 devices Hugh Perkins (ASAPP)

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

OmpCloud: Bridging the Gap between OpenMP and Cloud Computing

Paralization on GPU using CUDA An Introduction

Neil Trevett Vice President, NVIDIA OpenCL Chair Khronos President

OpenCL. Dr. David Brayford, LRZ, PRACE PATC: Intel MIC & GPU Programming Workshop

SDAccel Development Environment User Guide

Introduction to CUDA Programming

A Translation Framework for Automatic Translation of Annotated LLVM IR into OpenCL Kernel Function

clarmor: A DYNAMIC BUFFER OVERFLOW DETECTOR FOR OPENCL KERNELS CHRIS ERB, JOE GREATHOUSE, MAY 16, 2018

ECE 574 Cluster Computing Lecture 17

Automatic Intra-Application Load Balancing for Heterogeneous Systems

WebCL Overview and Roadmap

Module 2: Introduction to CUDA C

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

High Dynamic Range Tone Mapping Post Processing Effect Multi-Device Version

Overview of research activities Toward portability of performance

Introduction à OpenCL

An Extension of the StarSs Programming Model for Platforms with Multiple GPUs

Fluidic Kernels: Cooperative Execution of OpenCL Programs on Multiple Heterogeneous Devices

Accelerated Library Framework for Hybrid-x86

Lecture 2: Introduction to CUDA C

Introduction to CUDA (1 of n*)

Lecture 6: Optimize for Hardware Backends. CSE599G1: Spring 2017

Heterogeneous Computing Made Easy:

GPU COMPUTING RESEARCH WITH OPENCL

Process Characteristics. Threads Chapter 4. Process Characteristics. Multithreading vs. Single threading

Threads Chapter 4. Reading: 4.1,4.4, 4.5

OpenCL Overview. Shanghai March Neil Trevett Vice President Mobile Content, NVIDIA President, The Khronos Group

Module 2: Introduction to CUDA C. Objective

This is a draft chapter from an upcoming CUDA textbook by David Kirk from NVIDIA and Prof. Wen-mei Hwu from UIUC.

Use cases. Faces tagging in photo and video, enabling: sharing media editing automatic media mashuping entertaining Augmented reality Games

A Case for Better Integration of Host and Target Compilation When Using OpenCL for FPGAs

Chapter 11: Implementing File-Systems

Important DevOps Technologies (3+2+3days) for Deployment

Lecture Topic: An Overview of OpenCL on Xeon Phi

MIGRATION OF LEGACY APPLICATIONS TO HETEROGENEOUS ARCHITECTURES Francois Bodin, CTO, CAPS Entreprise. June 2011

Implementing Long-term Recurrent Convolutional Network Using HLS on POWER System

Technische Universität München. GPU Programming. Rüdiger Westermann Chair for Computer Graphics & Visualization. Faculty of Informatics

Michael Kinsner, Dirk Seynhaeve IWOCL 2018

OpenMP Device Offloading to FPGA Accelerators. Lukas Sommer, Jens Korinth, Andreas Koch

GPGPU IGAD 2014/2015. Lecture 1. Jacco Bikker

Evaluation of Asynchronous Offloading Capabilities of Accelerator Programming Models for Multiple Devices

OpenCL TM & OpenMP Offload on Sitara TM AM57x Processors

Addressing Heterogeneity in Manycore Applications

Programming in CUDA. Malik M Khan

GPU Programming. Alan Gray, James Perry EPCC The University of Edinburgh

Lecture 5: Input Binning

NUMA-Aware Data-Transfer Measurements for Power/NVLink Multi-GPU Systems

Motion Estimation Extension for OpenCL

Auto Source Code Generation and Run-Time Infrastructure and Environment for High Performance, Distributed Computing Systems

Concurrency for data-intensive applications

Data Parallelism. CSCI 5828: Foundations of Software Engineering Lecture 28 12/01/2016

Building an Operating System for AI

CSE 599 I Accelerated Computing - Programming GPUS. Parallel Pattern: Sparse Matrices

Heterogeneous Computing with a Fused CPU+GPU Device

Mapping C++ AMP to OpenCL / HSA Wen-Heng Jack Chung

HETEROGENEOUS SYSTEM ARCHITECTURE: PLATFORM FOR THE FUTURE

Processes. CS 416: Operating Systems Design, Spring 2011 Department of Computer Science Rutgers University

TENSORFLOW: LARGE-SCALE MACHINE LEARNING ON HETEROGENEOUS DISTRIBUTED SYSTEMS. by Google Research. presented by Weichen Wang

Recurrent Neural Networks. Deep neural networks have enabled major advances in machine learning and AI. Convolutional Neural Networks

Standards for Vision Processing and Neural Networks

Simplify Software Integration for FPGA Accelerators with OPAE

Agenda. Threads. Single and Multi-threaded Processes. What is Thread. CSCI 444/544 Operating Systems Fall 2008

Mali -T600 Series GPU OpenCL ARM. Developer Guide. Version 2.0. Copyright ARM. All rights reserved. DUI0538F (ID012914)

OpenACC (Open Accelerators - Introduced in 2012)

OPERATING SYSTEM. Chapter 12: File System Implementation

An Introduction to GPGPU Pro g ra m m ing - CUDA Arc hitec ture

From Application to Technology OpenCL Application Processors Chung-Ho Chen

Exploring Task Parallelism for Heterogeneous Systems Using Multicore Task Management API

Seamless Dynamic Runtime Reconfiguration in a Software-Defined Radio

SPOC : GPGPU programming through Stream Processing with OCaml

General Purpose GPU Programming (1) Advanced Operating Systems Lecture 14

Two FPGA-DNN Projects: 1. Low Latency Multi-Layer Perceptrons using FPGAs 2. Acceleration of CNN Training on FPGA-based Clusters

Data Parallel Execution Model

Acceleration of a Python-based Tsunami Modelling Application via CUDA and OpenHMPP

CLU: Open Source API for OpenCL Prototyping

Task-based programming models to support hierarchical algorithms

CUDA Memory Types All material not from online sources/textbook copyright Travis Desell, 2012

Using SYCL as an Implementation Framework for HPX.Compute

Transcription:

Enabling GPU support for the COMPSs-Mobile framework Francesc Lordan, Rosa M Badia and Wen-Mei Hwu Nov 13, 2017 4th Workshop on Accelerator Programming Using Directives

COMPSs-Mobile infrastructure WAN Collects data from sensors Offloads computation to Cloud LAN PAN

Programming Model: COMPSs Programming Sequential code Standard programming languages with no APIs Infrastructure agnostic Task-based An annotated interface contains the declaration of those methods (CE) selected to become tasks with information about the accesses to data Execution managed by a runtime system Replaces CE invocations by asynchronous tasks Monitors data-dependencies among tasks Orchestrates the execution on the underlying infrastructure Transfers data values Submits task executions

Matmul.java Programming Model Example public class Matmul { public static void main(string[] args) { int[][] A; int[][] B int[][] C;... C = multiply(a, B);... } public static int[][] multiply (int[][] A, int[][] B) { // Matrix multiplication code //C = AB... return C; } } public interface CEI { @Method(declaringClass="es.bsc.compss.matmul.Matmul") int[][] multiply ( @Parameter(direction = IN) int[][] A, @Parameter(direction = IN) int[][] B ); } CEI.java

Enabling GPU support on COMPSs-Mobile

Programming Model Extension Add the OpenCL kernel codes in application resource files Indicate on the CEI the CE implementation as an OpenCL kernel using the @OpenCL annotation indicating: Resource containing the code implementation (kernel) Number of work-items (globalworksize) Work-items ID offset (offset) Work groups size (localworksize) Result Size (result)

Programming Model Extension Example Matmul.java public class Matmul { public static void main(string[] args) { int[][] A; int[][] B int[][] C;... C = multiply(a, B);... } public static int[][] multiply (int[][] A, int[][] B) { // Matrix multiplication code //C = AB... return C; } } Matmul.cl kernel void multiply ( global const int *a, global const int *b, global int *c) { //Matrix multiplication code // C = AB... } CEI.java public interface CEI { @OpenCL(kernel="matmul.cl", globalworksize="par0.x,par1.y", resultsize="par0.x,par1.y") @Method(declaringClass="es.bsc.compss.matmul.Matmul") int[][] multiply ( @Parameter(direction = IN) int[][] A, @Parameter(direction = IN) int[][] B ); }

Kernel Execution Lifecycle [0.- Fetch input values from a remote node] 1.- Create a cl_buffer for each parameter and the result cl_mem par0dev = clcreatebuffer( ); 2.- Fill the cl_buffers corresponding to the input values clenqueuewritebuffer(, par0dev, ); 3.- Run the kernel cl_kernel task1 = clcreatekernel ( ); clsetkernelargument(task1, 0, par0dev); clndrangekernel(, task1, ) 4.- Bring back the modified/result values clenqueuereadbuffer(, par0dev, ) 5.- Notify the existence of the created values to enable datadependent tasks executions.

Kernel Execution Lifecycle Management Management is twofold: The runtime library directly controls the creation of OpenCL structures and Host memory management Fetching data from remote workers Allocating the necessary device memory space to host the buffers Creating the Kernel object and setting up its arguments Publishing the value existence to release data dependencies It delegates on the out-of-order mode of OpenCL the execution of OpenCL commands: clenqueuewritebuffer clndrangekernelexecution clenqueuereadbuffer

Optimizations Keep values on the device memory to reuse them Keep track of the events generating the values C = matmul(a,b) E = matmul(c,d)

Executing Platform Selection Each task is profiled to measure the execution time and energy consumption Using historical values of previous executions the runtime forecasts: The execution time on the resource The energy consumption of the execution The waiting time before the execution start According to a heuristic defined by the application user and based on the forecasts, the runtime picks one platform

Performance Evaluation

Digits Recognition CNN recognizing 512 handwritten digits Sequence of 8 steps (tasks) processing all 512 images GPU speeds up task calculation by 6.5x Optimizations reduce the 1,531 ms overhead to 5 ms Energy consumption falls from 30 J to 8J

Canny Edge detection Detects the edges in 30 frames of a video Each frame goes through 4 stages (4 tasks) Developers normally try to balance the load according to: Data Partitioning: assign the processing of a whole frame to a resource Task Partitioning: run the two first tasks of every frame on the GPU and the last two on the CPU 2 dynamic policies DynPerf: minimizes the execution time DynEn: slows down 1 ms the execution if that saves 5 mj

Canny Edge detection GPU speeds up 12x the task computation GPU reduces the energy consumption from 9.89 J to 1.34 J Data Partitioning achieves lower execution times but consumes more energy than Task Partitioning DynPerf achieves the lowest execution time (375 ms) DynEn gets lower energy consumptions while its performance compares to Task Partitioning

Conclusions The programming model allows developers to code parallel applications that run on heterogeneous systems being unaware of the parallelization and collaboration issues Using GPUs reduces drastically the timespan and the energy consumption of the application Dynamic policies provide applications with Portability Flexibility Code is available at http://compssdev.bsc.es/gitlab/flordan/waccpd http://compssdev.bsc.es/gitlab/flordan/compss-mobile

Future Work Enable the usage of the GPUs to compute tasks on remote nodes Improve the forecasts for the energy and end time Allow the dynamic policies to change an initial decision on which resources the task will run

Thank you! francesc.lordan@bsc.es