A Proposal to Extend the OpenMP Tasking Model for Heterogeneous Architectures

Similar documents
An Extension of the StarSs Programming Model for Platforms with Multiple GPUs

Extending OpenMP to Survive the Heterogeneous Multi-Core Era

CellSs Making it easier to program the Cell Broadband Engine processor

An Introduction to OpenAcc

Introduction to GPU Computing Using CUDA. Spring 2014 Westgid Seminar Series

Introduction to GPU Computing Using CUDA. Spring 2014 Westgid Seminar Series

OpenMP extensions for FPGA Accelerators

Solving Dense Linear Systems on Graphics Processors

Massively Parallel Architectures

Exploiting Task-Parallelism on GPU Clusters via OmpSs and rcuda Virtualization

OmpSs Fundamentals. ISC 2017: OpenSuCo. Xavier Teruel

Introduction to OpenACC. 16 May 2013

Design Decisions for a Source-2-Source Compiler

Overview of research activities Toward portability of performance

CUDA Architecture & Programming Model

A brief introduction to OpenMP

Scientific discovery, analysis and prediction made possible through high performance computing.

rcuda: an approach to provide remote access to GPU computational power

OpenACC. Part I. Ned Nedialkov. McMaster University Canada. October 2016

CellSs: a Programming Model for the Cell BE Architecture

Solving Dense Linear Systems on Platforms with Multiple Hardware Accelerators

GPU Programming. Alan Gray, James Perry EPCC The University of Edinburgh

StarPU: a runtime system for multigpu multicore machines

Parallel Programming Libraries and implementations

ALYA Multi-Physics System on GPUs: Offloading Large-Scale Computational Mechanics Problems

Task Superscalar: Using Processors as Functional Units

Accelerating image registration on GPUs

OpenMP Tasking Model Unstructured parallelism

Pinned-Memory. Table of Contents. Streams Learning CUDA to Solve Scientific Problems. Objectives. Technical Issues Stream. Pinned-memory.

Exploring Dynamic Parallelism in OpenMP

Implementation of the K-means Algorithm on Heterogeneous Devices: a Use Case Based on an Industrial Dataset

Technische Universität München. GPU Programming. Rüdiger Westermann Chair for Computer Graphics & Visualization. Faculty of Informatics

Cell Programming Tutorial JHD

CUDA GPGPU Workshop CUDA/GPGPU Arch&Prog

GPU Programming Using CUDA

PS3 Programming. Week 2. PPE and SPE The SPE runtime management library (libspe)

OmpSs + OpenACC Multi-target Task-Based Programming Model Exploiting OpenACC GPU Kernel

Overview: Graphics Processing Units

Early Experiences With The OpenMP Accelerator Model

GPU Computing: Introduction to CUDA. Dr Paul Richmond

HPCSE II. GPU programming and CUDA

INTRODUCTION TO GPU COMPUTING WITH CUDA. Topi Siro

Tesla Architecture, CUDA and Optimization Strategies

CUDA C Programming Mark Harris NVIDIA Corporation

Scheduling of QR Factorization Algorithms on SMP and Multi-core Architectures

Task-based programming models to support hierarchical algorithms

OpenMPSuperscalar: Task-Parallel Simulation and Visualization of Crowds with Several CPUs and GPUs

Lecture 8: GPU Programming. CSE599G1: Spring 2017

Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA. Part 1: Hardware design and programming model

Module 2: Introduction to CUDA C

Scalability and Parallel Execution of OmpSs-OpenCL Tasks on Heterogeneous CPU-GPU Environment

Why? High performance clusters: Fast interconnects Hundreds of nodes, with multiple cores per node Large storage systems Hardware accelerators

Josef Pelikán, Jan Horáček CGG MFF UK Praha

Design and Development of support for GPU Unified Memory in OMPSS

OMP2HMPP: HMPP Source Code Generation from Programs with Pragma Extensions

Hardware Hetergeneous Task Scheduling for Task-based Programming Models

THE AUSTRALIAN NATIONAL UNIVERSITY First Semester Examination June 2011 COMP4300/6430. Parallel Systems

Evaluating the Portability of UPC to the Cell Broadband Engine

GPU Computing with CUDA

Programming with CUDA, WS09

Cartoon parallel architectures; CPUs and GPUs

Designing and Optimizing LQCD code using OpenACC

COMP 322: Fundamentals of Parallel Programming. Flynn s Taxonomy for Parallel Computers

Massively Parallel Algorithms

JCudaMP: OpenMP/Java on CUDA

The PGI Fortran and C99 OpenACC Compilers

Outline 2011/10/8. Memory Management. Kernels. Matrix multiplication. CIS 565 Fall 2011 Qing Sun

PATC Training. OmpSs and GPUs support. Xavier Martorell Programming Models / Computer Science Dept. BSC

Memory concept. Grid concept, Synchronization. GPU Programming. Szénási Sándor.

Integrating DMA capabilities into BLIS for on-chip data movement. Devangi Parikh Ilya Polkovnichenko Francisco Igual Peña Murtaza Ali

Asynchronous Task Creation for Task-Based Parallel Programming Runtimes

Lecture: Manycore GPU Architectures and Programming, Part 4 -- Introducing OpenMP and HOMP for Accelerators

OpenMP and GPU Programming

Accelerating the Implicit Integration of Stiff Chemical Systems with Emerging Multi-core Technologies

Introduction to CUDA Programming

6.189 IAP Recitations 2-3. Cell Programming Hands-on. David Zhang, MIT IAP 2007 MIT

Parallelism paradigms

Multicore Programming Case Studies: Cell BE and NVIDIA Tesla Meeting on Parallel Routine Optimization and Applications

Module 2: Introduction to CUDA C. Objective

CUDA C/C++ BASICS. NVIDIA Corporation

Task-parallel reductions in OpenMP and OmpSs

Graph Partitioning. Standard problem in parallelization, partitioning sparse matrix in nearly independent blocks or discretization grids in FEM.

Parallel Programming. Libraries and Implementations

High Performance Linear Algebra on Data Parallel Co-Processors I

Programming Models for Multi- Threading. Brian Marshall, Advanced Research Computing

OpenCL TM & OpenMP Offload on Sitara TM AM57x Processors

Paralization on GPU using CUDA An Introduction

OpenACC. Introduction and Evolutions Sebastien Deldon, GPU Compiler engineer

Lecture 3: Introduction to CUDA

GPU & High Performance Computing (by NVIDIA) CUDA. Compute Unified Device Architecture Florian Schornbaum

Analysis of the Task Superscalar Architecture Hardware Design

ECE 574 Cluster Computing Lecture 15

GPGPU Offloading with OpenMP 4.5 In the IBM XL Compiler

Introduction to Runtime Systems

( ZIH ) Center for Information Services and High Performance Computing. Event Tracing and Visualization for Cell Broadband Engine Systems

CUDA Kenjiro Taura 1 / 36

GPU Programming. Lecture 2: CUDA C Basics. Miaoqing Huang University of Arkansas 1 / 34

An Introduction to GPGPU Pro g ra m m ing - CUDA Arc hitec ture

OpenCL. Matt Sellitto Dana Schaa Northeastern University NUCAR

Barcelona Supercomputing Center

Transcription:

A Proposal to Extend the OpenMP Tasking Model for Heterogeneous Architectures E. Ayguade 1,2, R.M. Badia 2,4, D. Cabrera 2, A. Duran 2, M. Gonzalez 1,2, F. Igual 3, D. Jimenez 1, J. Labarta 1,2, X. Martorell 1,2, R. Mayo 3, J.M. Perez 2 and E.S. Quintana-Ortí 3 1 Universitat Politècnica de Catalunya (UPC) 2 Barcelona Supercomputing Center (BSC-CNS) 3 Universidad Jaume I, Castellon 4 Centro Superior de Investigaciones Científicas (CSIC) June 3rd 2009

Outline Outline 1 Motivation 2 Proposal 3 Runtime considerations 4 Conclusions A. Duran (BSC) OpenMP for Heterogeneous Architectures June 3rd 2009 2 / 21

Motivation Motivation Architecture trends Current trends in architecture point to heterogeneous systems with multiple accelerators raising interest: Cell processor (SPUs) GPGPU computing (GPUs) OpenCL A. Duran (BSC) OpenMP for Heterogeneous Architectures June 3rd 2009 3 / 21

Motivation Motivation Heterogeneity problems In these environments, the user needs to take care of a number of problems 1 Identify parts of the problem that are suitable to offload to an accelerator 2 Separate those parts into functions with specific code 3 Compile them with a separate tool-chain 4 Write wrap-up code to offload (and synchronize) the computation 5 Possibly, optimize the function using specific features of the accelerator Portability becomes an issue A. Duran (BSC) OpenMP for Heterogeneous Architectures June 3rd 2009 4 / 21

Simple example Blocked Matrix multiply Motivation In a SMP void matmul ( f l o a t A, f l o a t B, f l o a t C ) { for ( i n t i =0; i < BS; i ++) for ( i n t j =0; j < BS; j ++) for ( i n t k =0; k < BS; k ++) C[ i BS+ j ] += A [ i BS+k ] B [ k BS+ j ] ; f l o a t A [NB ] [ NB], B [NB ] [ NB], C[NB ] [ NB ] ; i n t main ( void ) { i n t i, j, k ; for ( i = 0; i < NB; i ++) for ( j = 0; j < NB; j ++) { A [ i ] [ j ] = ( f l o a t ) malloc (BS BS sizeof ( f l o a t ) ) ; B [ i ] [ j ] = ( f l o a t ) malloc (BS BS sizeof ( f l o a t ) ) ; C[ i ] [ j ] = ( f l o a t ) malloc (BS BS sizeof ( f l o a t ) ) ; for ( i = 0; i < NB; i ++) for ( j = 0; j < NB; j ++) for ( k = 0; k < NB; k++) matmul ( A [ i ] [ k ], B [ k ] [ j ], C[ i ] [ j ] ) ; A. Duran (BSC) OpenMP for Heterogeneous Architectures June 3rd 2009 5 / 21

Simple example Blocked Matrix multiply Motivation In CUDA global void matmul_kernel ( f l o a t A, f l o a t B, f l o a t C ) ; #define THREADS_PER_BLOCK 16 void matmul ( f l o a t A, f l o a t B, f l o a t C ) {... / / a l l o c a t e device memory f l o a t d_a, d_b, d_c ; cudamalloc ( ( void ) &d_a, BS BS sizeof ( float ) ) ; cudamalloc ( ( void ) &d_b, BS BS sizeof ( float ) ) ; cudamalloc ( ( void ) &d_c, BS BS sizeof ( float ) ) ; / / copy host memory to device cudamemcpy (d_a, A, BS BS sizeof ( float ), cudamemcpyhosttodevice ) ; cudamemcpy (d_b, B, BS BS sizeof ( float ), cudamemcpyhosttodevice ) ; / / setup execution parameters dim3 threads (THREADS_PER_BLOCK, THREADS_PER_BLOCK ) ; dim3 g r i d (BS/ threads. x, BS/ threads. y ) ; / / execute the k e r n e l matmul_kernel <<< grid, threads >>>(d_a, d_b, d_c ) ; / / copy r e s u l t from device to host cudamemcpy (C, d_c, BS BS sizeof ( float ), cudamemcpydevicetohost ) ; / / clean up memory cudafree ( d_a ) ; cudafree ( d_b ) ; cudafree (d_c ) ; A. Duran (BSC) OpenMP for Heterogeneous Architectures June 3rd 2009 5 / 21

Simple example Blocked Matrix multiply Motivation In a Cell PPE void matmul_spe ( f l o a t A, f l o a t B, f l o a t C ) ; void matmul ( f l o a t A, f l o a t B, f l o a t C ) { for ( i =0; i <num_spus ; i ++) { / / I n i t i a l i z e the thread s t r u c t u r e and i t s parameters... / / Create context threads [ i ]. i d = spe_context_create (SPE_MAP_PS, NULL ) ; / / Load program rc = spe_program_load ( threads [ i ]. id, &matmul_spe ) )!= 0; / / Create thread rc = pthread_create (& threads [ i ]. pthread, NULL, &ppu_pthread_function, &threads [ i ]. i d ) ; / / Get thread c o n t r o l threads [ i ]. c t l _a r e a = ( spe_spu_control_area_t ) spe_ps_area_get ( threads [ i ]. id, SPE_CONTROL_AREA ) ; / / S t a r t SPUs for ( i =0; i <spus ; i ++) send_mail ( i, 1 ) ; / / Wait f o r the SPUs to complete for ( i =0; i <spus ; i ++) rc = pthread_join ( threads [ i ]. pthread, NULL ) ; A. Duran (BSC) OpenMP for Heterogeneous Architectures June 3rd 2009 5 / 21

Simple example Blocked Matrix multiply Motivation In Cell SPE void matmul_spe ( f l o a t A, f l o a t B, f l o a t C ) {... while ( blocks_to_process ( ) ) { next_block ( i, j, k ) ; calculate_address ( basea, A, i, k ) ; calculate_address ( baseb, B, k, j ) ; calculate_address ( basec, C, i, j ) ; mfc_get ( locala, basea, sizeof ( float ) BS BS, in_tags, 0, 0 ) ; mfc_get ( localb, baseb, sizeof ( float ) BS BS, in_tags, 0, 0 ) ; mfc_get ( localc, basec, sizeof ( float ) BS BS, in_tags, 0, 0 ) ; mfc_write_tag_mask ((1 < <( in_tags ) ) ) ; mfc_read_tag_status_all ( ) ; / Wait for input data f o r ( i i =0; i i < BS; i i ++) f o r ( j j =0; j j < BS; j j ++) f o r ( kk =0; kk < BS; kk ++) localc [ i ] [ j ]+= l o c ala [ i ] [ k ] localb [ k ] [ j ] ; mfc_put ( localc, basec, s i z e o f ( f l o a t ) BS BS, out_tags, 0, 0 ) ; mfc_write_tag_mask ((1 < <( out_tags ) ) ) ; mfc_read_tag_status_all ( ) ; / Wait for output data... A. Duran (BSC) OpenMP for Heterogeneous Architectures June 3rd 2009 5 / 21

Motivation Our proposal Extend OpenMP so it incorporates the concept of multiple architectures so it takes care of separating the different pieces it takes care of compiling them adequately it takes care of offloading them The user is still responsible for identifying interesting parts to offload and optimize for the target. A. Duran (BSC) OpenMP for Heterogeneous Architectures June 3rd 2009 6 / 21

Motivation Example Blocked matrix multiply Parallelization for SMP void matmul ( f l o a t A, f l o a t B, f l o a t C ) { / / o r i g i n a l s e q u e n t i a l matmul f l o a t A [NB ] [ NB], B [NB ] [ NB], C[NB ] [ NB ] ; i n t main ( void ) { for ( i n t i = 0; i < NB; i ++) for ( i n t j = 0; j < NB; j ++) for ( i n t k = 0; k < NB; k ++) #pragma omp task inout ( [ BS ] [ BS] C) matmul ( A [ i ] [ k ], B [ k ] [ j ], C[ i ] [ j ] ) ; A. Duran (BSC) OpenMP for Heterogeneous Architectures June 3rd 2009 7 / 21

Target directive Proposal #pragma omp target device ( devicename l i s t ) [ clauses ] omp task f u n c t i o n header f u n c t i o n d e f i n i t i o n A. Duran (BSC) OpenMP for Heterogeneous Architectures June 3rd 2009 8 / 21

Proposal Target directive #pragma omp target device ( devicename l i s t ) [ clauses ] omp task f u n c t i o n header f u n c t i o n d e f i n i t i o n Clauses copy_in (data-reference-list) copy_out (data-reference-list) implements (function-name) A. Duran (BSC) OpenMP for Heterogeneous Architectures June 3rd 2009 8 / 21

Proposal The device clause Specifies that a given task (or function) could be offloaded to any device in the device-list Appropriate wrapping code is generated The appropriate frontend/backends are used to prepare the outlines If not specified the device is assumed to be smp Other devices can be: cell, cuda, opencl,... If a device is not supported the compiler can ignore it A. Duran (BSC) OpenMP for Heterogeneous Architectures June 3rd 2009 9 / 21

Moving data Proposal A common problem is that data needs to be moved into the accelerator memory at the beginning and out of it at the end A. Duran (BSC) OpenMP for Heterogeneous Architectures June 3rd 2009 10 / 21

Proposal Moving data A common problem is that data needs to be moved into the accelerator memory at the beginning and out of it at the end copy_in and copy_out clauses allow to specify such movements Both allow to specify object references (or subobjects) that will copied to/from the accelarator Subobjects can be: Field members a.b Array elements a[0], a[10] Array sections a[2:15], a[:n], a[0:][3:5] Shaping expressions [N] a, [N][M] a A. Duran (BSC) OpenMP for Heterogeneous Architectures June 3rd 2009 10 / 21

Proposal Example Blocked matrix multiply void matmul ( f l o a t A, f l o a t B, f l o a t C ) { / / o r i g i n a l s e q u e n t i a l matmul f l o a t A [NB ] [ NB], B [NB ] [ NB], C[NB ] [ NB ] ; i n t main ( void ) { for ( i n t i = 0; i < NB; i ++) for ( i n t j = 0; j < NB; j ++) for ( i n t k = 0; k < NB; k ++) #pragma omp target device (smp, c e l l ) copy_in ( [ BS ] [ BS] A, [BS ] [ BS] B, [BS ] [ BS] C) copy_out ( [ BS ] [ BS] C) #pragma omp task inout ( [ BS ] [ BS] C) matmul ( A [ i ] [ k ], B [ k ] [ j ], C[ i ] [ j ] ) ; A. Duran (BSC) OpenMP for Heterogeneous Architectures June 3rd 2009 11 / 21

Proposal Device specific characteristics Each device may define other clauses that will be ignored for other devices Each device may define additional restrictions No additional OpenMP No I/O A. Duran (BSC) OpenMP for Heterogeneous Architectures June 3rd 2009 12 / 21

Proposal Taskifying functions Proposal Extend the task construct so it can be applied to functions a la Cilk Each time the function is called a task is implicitely created If preceded by a target directive offloaded to the appropriate device A. Duran (BSC) OpenMP for Heterogeneous Architectures June 3rd 2009 13 / 21

Proposal Implements clause implements ( f u n c t i o n name) It denotes that a give function is an alternative to another one It allows to implement specific device optimizations for a device It uses the function name to relate to implementations A. Duran (BSC) OpenMP for Heterogeneous Architectures June 3rd 2009 14 / 21

Proposal Example Blocked matrix multiply #pragma omp task inout ( [ BS ] [ BS] C) void matmul ( f l o a t A, f l o a t B, f l o a t C) { / / o r i g i n a l s e q u e n t i a l matmul #pragma omp target device ( cuda ) implements ( matmul ) copy_in ( [ BS ] [ BS] A, [BS ] [ BS] B, [BS ] [ BS] C) copy_out ( [ BS ] [ BS] C) void matmul_cuda ( f l o a t A, f l o a t B, f l o a t C) { / / optimized k e r n e l f o r cuda / / l i b r a r y f u n c t i o n #pragma omp target device ( c e l l ) implements ( matmul ) copy_in ( [ BS ] [ BS] A, [BS ] [ BS] B, [BS ] [ BS] C) copy_out ( [ BS ] [ BS] C) void matmul_spe ( f l o a t A, f l o a t B, f l o a t C ) ; f l o a t A [NB ] [ NB], B [NB ] [ NB], C[NB ] [ NB ] ; i n t main ( void ) { for ( i n t i = 0; i < NB; i ++) for ( i n t j = 0; j < NB; j ++) for ( i n t k = 0; k < NB; k++) matmul (A [ i ] [ k ], B [ k ] [ j ], C[ i ] [ j ] ) ; A. Duran (BSC) OpenMP for Heterogeneous Architectures June 3rd 2009 15 / 21

Runtime considerations Runtime considerations Scheduling The runtime chooses among the different alternatives which one to use Ideally taking into account resources If all the possible resources are complete, the task waits until one is available If all possible devices are unsupported, a runtime error is generated Optimize data transfers The runtime can try to optimize data movement: by using double buffering or pre-fetch mechanisms by using that information for scheduling Schedule tasks that use the same data on the same device A. Duran (BSC) OpenMP for Heterogeneous Architectures June 3rd 2009 16 / 21

Current Status Runtime considerations We have separate prototype implementations for SMP, Cell, GPU They take care task offloading task synchronization data movement They use specific optimizations for each platform A. Duran (BSC) OpenMP for Heterogeneous Architectures June 3rd 2009 17 / 21

Conclusions Conclusions Our proposal allows: Tag tasks and functions to be executed in a device It takes care of: task offloading task synchronization data movement It allows to write code that is portable across multiple environments User still can use (or develop) optimized code for the devices A. Duran (BSC) OpenMP for Heterogeneous Architectures June 3rd 2009 18 / 21

Conclusions Future work Actually, fully implement it :-) We have several speficic implementations We lack one that is able to exploit multiple devices at the same time Implement the OpenCL device A. Duran (BSC) OpenMP for Heterogeneous Architectures June 3rd 2009 19 / 21

The End Conclusions Thanks for your attention! A. Duran (BSC) OpenMP for Heterogeneous Architectures June 3rd 2009 20 / 21

Conclusions Some results Cholesky factorization GFLOPS 400 350 300 250 200 150 100 50 Cholesky factorization on 4 GPUs GPUSs CUBLAS 0 0 4000 8000 12000 16000 20000 Matrix size GFLOPS 200 150 100 50 Cholesky factorization on 8 SPUs Hand-coded (static scheduling) CellSs 0 0 1000 2000 3000 4000 Matrix size A. Duran (BSC) OpenMP for Heterogeneous Architectures June 3rd 2009 21 / 21