An Extension of the StarSs Programming Model for Platforms with Multiple GPUs

Similar documents
A Proposal to Extend the OpenMP Tasking Model for Heterogeneous Architectures

Solving Dense Linear Systems on Graphics Processors

Solving Dense Linear Systems on Platforms with Multiple Hardware Accelerators

CellSs Making it easier to program the Cell Broadband Engine processor

OmpSs Fundamentals. ISC 2017: OpenSuCo. Xavier Teruel

Matrix Computations on GPUs, multiple GPUs and clusters of GPUs

Automatic Development of Linear Algebra Libraries for the Tesla Series

Extending OpenMP to Survive the Heterogeneous Multi-Core Era

Optimization of Dense Linear Systems on Platforms with Multiple Hardware Accelerators. Enrique S. Quintana-Ortí

Technology on Dense Linear Algebra

MAGMA a New Generation of Linear Algebra Libraries for GPU and Multicore Architectures

Overview of research activities Toward portability of performance

Matrix Computations. Enrique S. Quintana-Ortí. September 11, 2012, Hamburg, Germany

Task-based programming models to support hierarchical algorithms

rcuda: an approach to provide remote access to GPU computational power

Addressing Heterogeneity in Manycore Applications

Scheduling of QR Factorization Algorithms on SMP and Multi-core Architectures

Hybrid Multicore Cholesky Factorization with Multiple GPU Accelerators

Task Superscalar: Using Processors as Functional Units

MAGMA. Matrix Algebra on GPU and Multicore Architectures

Evaluation and Tuning of the Level 3 CUBLAS for Graphics Processors

Exploiting Task-Parallelism on GPU Clusters via OmpSs and rcuda Virtualization

GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE)

StarPU: a runtime system for multigpu multicore machines

A Linear Algebra Library for Multicore/Accelerators: the PLASMA/MAGMA Collection

Hierarchical DAG Scheduling for Hybrid Distributed Systems

Asynchronous Task Creation for Task-Based Parallel Programming Runtimes

Locality-Aware Automatic Parallelization for GPGPU with OpenHMPP Directives

Leveraging Task-Parallelism in Message-Passing Dense Matrix Factorizations using SMPSs

High performance matrix inversion of SPD matrices on graphics processors

Distributed Dense Linear Algebra on Heterogeneous Architectures. George Bosilca

Unleashing the Power of Multi-GPU Accelerators with FLAME

GPU GPU CPU. Raymond Namyst 3 Samuel Thibault 3 Olivier Aumage 3

Dealing with Heterogeneous Multicores

! XKaapi : a runtime for highly parallel (OpenMP) application

Dealing with Asymmetry for Performance and Energy Efficiency

A Dependency-Aware Task-Based Programming Environment for Multi-Core Architectures

CUDA Accelerated Linpack on Clusters. E. Phillips, NVIDIA Corporation

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620

Scalability and Parallel Execution of OmpSs-OpenCL Tasks on Heterogeneous CPU-GPU Environment

Hardware Hetergeneous Task Scheduling for Task-based Programming Models

Runtime Data Flow Graph Scheduling of Matrix Computations with Multiple Hardware Accelerators

CPU-GPU Heterogeneous Computing

OmpSs + OpenACC Multi-target Task-Based Programming Model Exploiting OpenACC GPU Kernel

TOOLS FOR IMPROVING CROSS-PLATFORM SOFTWARE DEVELOPMENT

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

( ZIH ) Center for Information Services and High Performance Computing. Event Tracing and Visualization for Cell Broadband Engine Systems

IBM Cell Processor. Gilbert Hendry Mark Kretschmann

Hybrid KAUST Many Cores and OpenACC. Alain Clo - KAUST Research Computing Saber Feki KAUST Supercomputing Lab Florent Lebeau - CAPS

OpenMPSuperscalar: Task-Parallel Simulation and Visualization of Crowds with Several CPUs and GPUs

CUDA 6.0 Performance Report. April 2014

Dense Linear Algebra for Hybrid GPU-Based Systems. Stanimire Tomov Department of Electrical Engineering and Computer Science, University of Tennessee

High Performance Computing with Accelerators

The StarPU Runtime System

INSTITUTO SUPERIOR TÉCNICO. Architectures for Embedded Computing

Accelerating the Implicit Integration of Stiff Chemical Systems with Emerging Multi-core Technologies

How to perform HPL on CPU&GPU clusters. Dr.sc. Draško Tomić

StarPU: a unified platform for task scheduling on heterogeneous multicore architectures

Performance Analysis of Memory Transfers and GEMM Subroutines on NVIDIA TESLA GPU Cluster

MAGMA Library. version 0.1. S. Tomov J. Dongarra V. Volkov J. Demmel

COMPUTING ELEMENT EVOLUTION AND ITS IMPACT ON SIMULATION CODES

NVIDIA Think about Computing as Heterogeneous One Leo Liao, 1/29/2106, NTU

NEW ADVANCES IN GPU LINEAR ALGEBRA

Computing on GPUs. Prof. Dr. Uli Göhner. DYNAmore GmbH. Stuttgart, Germany

Finite Element Integration and Assembly on Modern Multi and Many-core Processors

Pedraforca: a First ARM + GPU Cluster for HPC

Trends and Challenges in Multicore Programming

Technology for a better society. hetcomp.com

Intel Xeon Phi Coprocessors

OpenACC. Part I. Ned Nedialkov. McMaster University Canada. October 2016

HPC with Multicore and GPUs

Thinking Outside of the Tera-Scale Box. Piotr Luszczek

LU Factorization for Accelerator-based Systems

Using OpenACC With CUDA Libraries

An approach to provide remote access to GPU computational power

CUDA Accelerated Compute Libraries. M. Naumov

Arquitecturas y Modelos de. Multicore

TaskGenX: A Hardware-Software Proposal for Accelerating Task Parallelism

ANSYS Improvements to Engineering Productivity with HPC and GPU-Accelerated Simulation

CafeGPI. Single-Sided Communication for Scalable Deep Learning

Making Dataflow Programming Ubiquitous for Scientific Computing

Parallel and Distributed Computing

How to Write Fast Code , spring th Lecture, Mar. 31 st

Parallel Computing. Hwansoo Han (SKKU)

Lecture: Manycore GPU Architectures and Programming, Part 4 -- Introducing OpenMP and HOMP for Accelerators

Analysis and Optimization of Power Consumption in the Iterative Solution of Sparse Linear Systems on Multi-core and Many-core Platforms

OpenACC 2.6 Proposed Features

SuperMatrix on Heterogeneous Platforms. Jianyu Huang SHPC, UT Austin

Parallel Hybrid Computing Stéphane Bihan, CAPS

Best GPU Code Practices Combining OpenACC, CUDA, and OmpSs

Accelerating Financial Applications on the GPU

OpenACC Standard. Credits 19/07/ OpenACC, Directives for Accelerators, Nvidia Slideware

G P G P U : H I G H - P E R F O R M A N C E C O M P U T I N G

Data Partitioning on Heterogeneous Multicore and Multi-GPU systems Using Functional Performance Models of Data-Parallel Applictions

Solving Dense Linear Systems on Platforms with Multiple Hardware Accelerators. FLAME Working Note #32

Big Data Analytics Performance for Large Out-Of- Core Matrix Solvers on Advanced Hybrid Architectures

OpenACC. Introduction and Evolutions Sebastien Deldon, GPU Compiler engineer

Carlos Reaño, Javier Prades and Federico Silla Technical University of Valencia (Spain)

Modeling and Simulation of Hybrid Systems

On Level Scheduling for Incomplete LU Factorization Preconditioners on Accelerators

Transcription:

An Extension of the StarSs Programming Model for Platforms with Multiple GPUs Eduard Ayguadé 2 Rosa M. Badia 2 Francisco Igual 1 Jesús Labarta 2 Rafael Mayo 1 Enrique S. Quintana-Ortí 1 1 Departamento de Ingeniería y Ciencia de los Computadores. University Jaume I. Castellón (Spain) 2 Barcelona Supercomputing Center - Centro Nacional de Supercomputación. Barcelona (Spain) An Extension of the StarSs Programming Model... 1 Ayguadé et al.

Motivation (I) Motivation The past: emergence of hardware accelerators Hardware accelerators (especially GPUs) have become a real solution for HPC Hardware: manycore systems on chip (up to 240 cores on modern Nvidia GPUs) Software: problem solved with high level programming and execution models (e.g. Nvidia CUDA, Brook+, OpenCL) An Extension of the StarSs Programming Model... 2 Ayguadé et al.

Motivation (II) Motivation The present: heterogeneous multi-accelerator systems One accelerator is not always enough for many applications Different accelerators adapted to specific applications Multi-accelerator systems are the next step Hardware: Nvidia Tesla series, multiple ClearSpeed boards per system, hybrid architectures,... Software: the problem is not solved yet: Big code modifications from sequential code Manual scheduling The user has to know the best accelerator for each part of the application An Extension of the StarSs Programming Model... 3 Ayguadé et al.

Motivation (III) Motivation The future: heterogeneous multi-accelerator systems (on-chip) Number of cores is increasing The programmability problem must be addressed as soon as possible Hardware: Larrabee, AMD Fusion,... Software: will determine the success or failure of novel architectures An Extension of the StarSs Programming Model... 4 Ayguadé et al.

Outline Motivation 1 Introduction 2 The StarSs programming model. New extensions 3 The GPUSs framework 4 Experimental results 5 Conclusions and future work An Extension of the StarSs Programming Model... 5 Ayguadé et al.

Contents Introduction 1 Introduction 2 The StarSs programming model. New extensions 3 The GPUSs framework 4 Experimental results 5 Conclusions and future work An Extension of the StarSs Programming Model... 6 Ayguadé et al.

Introduction. StarSs Introduction The StarSs programming model addresses the programmability problem by exploiting task level parallelism It consists of: A few OpenMP-like pragmas identifying tasks in the user code A source-to-source compiler A runtime system adapted to the underlying architecture Many instantiations of StarSs have been developed: CellSs, SMPSs, GridSs Each instantiation targets one specific architecture An Extension of the StarSs Programming Model... 7 Ayguadé et al.

Introduction. GPUSs Introduction Our proposal: GPUSs GPUSs: Instantation of the StarSs programming model focusing heterogeneous multi-accelerator platforms 1 Heterogeneity: The target architecture is an heterogeneous multi-accelerator system 2 Separate memory spaces: The user does not have to deal with separate memory spaces for each accelerator 3 Simplicity: It adds few pragmas to the sequential user code to port it to the multi-accelerator system 4 Portability: It can be easily ported to other similar architectures based on multiple accelerators An Extension of the StarSs Programming Model... 8 Ayguadé et al.

The StarSs programming model. New extensions Contents 1 Introduction 2 The StarSs programming model. New extensions 3 The GPUSs framework 4 Experimental results 5 Conclusions and future work An Extension of the StarSs Programming Model... 9 Ayguadé et al.

The StarSs programming model. New extensions The StarSs programming model StarSs programming model Automatic parallelization of sequential applications Runtime system: efficient use available resources (e.g. GPUs) in parallel The user annotates the application: pieces of code that will be executed on a GPU (tasks) Runtime extracts parallelism building a data dependency graph User code Annotated code TDG Tesla system void task1( float * A ){... } #pragma css task void task1( float * A ){ User... } Runtime Runtime An Extension of the StarSs Programming Model... 10 Ayguadé et al.

The StarSs programming model. New extensions Proposed extensions Extensions to the StarSs programming model GPUSs provides OpenMP-like constructs to annotate code: To identify a unit of work, or task: pragma css task To select the execution device: pragma css target device An Extension of the StarSs Programming Model... 11 Ayguadé et al.

The StarSs programming model. New extensions Defining tasks: the task clause Taskifying functions #pragma css task [ c l a u s e _ l i s t ] { f u n c t i o n header f u n c t i o n d e f i n i t i o n } The task clause denotes a function that is always executed as a task. Whenever the program calls a function annotated in this way, the runtime will create an explicit task. An Extension of the StarSs Programming Model... 12 Ayguadé et al.

The StarSs programming model. New extensions Defining tasks: the task clause Identifying the directionality of the arguments #pragma css task i n p u t ( parameter ) output ( parameter ) i n o u t ( parameter ) { f u n c t i o n header f u n c t i o n d e f i n i t i o n } The input, output and inout clauses denote the directionality of each argument. Used by the runtime to track dependencies among tasks and manage data transfers. An Extension of the StarSs Programming Model... 13 Ayguadé et al.

The StarSs programming model. New extensions Specifying target devices: the target clause Specifying target devices #pragma css t a r g e t device ( device name l i s t ) [ clause l i s t ] { f u n c t i o n header f u n c t i o n d e f i n i t i o n } The target construct specifies that the execution of a task can be offloaded on a given device. The target device is specified in device-name-list. When a task becomes ready, the runtime can choose among the available targets to decide where to execute the task. An Extension of the StarSs Programming Model... 14 Ayguadé et al.

The StarSs programming model. New extensions Managing heterogeneity: the implements clause The implements clause is used to specify alternative implementations for a function Example #pragma css task void matmul ( f l o a t A, f l o a t B, f l o a t C ) ; #pragma css t a r g e t device ( cuda ) implements ( matmul ) void matmul_cuda ( f l o a t A, f l o a t B, f l o a t C ) { / / tuned version f o r a CUDA compatible device } #pragma css t a r g e t device ( smp ) implements ( matmul ) void matmul_smp ( f l o a t A, f l o a t B, f l o a t C ) { / / tuned v r s i o n f o r a SMP device } An Extension of the StarSs Programming Model... 15 Ayguadé et al.

The StarSs programming model. New extensions Example: the matrix-matrix multiplication Parallelizing the matix-matrix multiplication #pragma css task i n p u t (A [BS ] [ BS], B [BS ] [ BS ] ) i n o u t (C[BS ] [ BS ] ) #pragma css t a r g e t device ( cuda ) void matmul ( f l o a t A, f l o a t B, f l o a t C ) { / / tuned CUDA code f o r the matmul } f l o a t A [ ] [ ], B [ ] [ ], C [ ] [ ] ; i n t main ( void ) { for ( i n t i =0; i <NB; i ++ ) for ( i n t j =0; j <NB; j ++ ) for ( i n t k =0; k<nb; k++ ) matmul ( A [ i ] [ k ], B [ k ] [ j ], C[ i ] [ j ] ) ; } An Extension of the StarSs Programming Model... 16 Ayguadé et al.

The StarSs programming model. New extensions Example: the Cholesky factorization The Chokesky factorization of a dense SPD matrix A R n n is defined as A = LL T where L R n n is a lower triangular matrix. Blocked algorithm: Chol_spotrf Chol_trsm Chol_upd - Chol_gemm - Chol_syrk An Extension of the StarSs Programming Model... 17 Ayguadé et al.

The StarSs programming model. New extensions Sequential Cholesky factorization void Cholesky ( f l o a t A, i n t ts, i n t nt ) { for ( i n t k = 0; k < nt ; k ++) { c h o l _ s p o t r f ( A [ k nt+k ], t s ) ; / / F a c t o r i z e diagonal block for ( i n t i = k +1; i < nt ; i ++) / / T r i a n g u l a r solves chol_strsm ( A [ k nt+k ], A [ k nt+ i ], t s ) ; } / / Update t r a i l i n g submatrix for ( i n t i = k +1; i < nt ; i ++) { for ( i n t j = k +1; j < i ; j ++) chol_sgemm ( A [ k nt+ i ], A [ k nt+ j ], A [ j nt+ i ], t s ) ; chol_ssyrk ( A [ k nt+ i ], A [ i nt+ i ], t s ) ; } i n t main ( void ) { f l o a t A [ nt ] [ nt ] ;... / / Compute the Cholesky f a c t o r Cholesky ( A, ts, nt ) ; } An Extension of the StarSs Programming Model... 18 Ayguadé et al.

The StarSs programming model. New extensions Taskifying the Cholesky factorization Each block function can be converted into a task: spotrf task #pragma css task i n o u t (A [NT ] [ NT ] ) void c h o l _ s p o t r f ( f l o a t A ) { s p o t r f ( "Lower", &ts, A, &ts, &i n f o ) ; } sgemm task #pragma css task input (A [NT ] [ NT], B [NT ] [ NT ] ) i n o u t (C[NT ] [ NT ] ) void chol_sgemm ( f l o a t A, f l o a t B, f l o a t C ) { sgemm( "N", "T", &ts, &ts, &ts, 1.0, A, &ts, B, &ts, 1.0, C, &t s ) ; } ssyrk task #pragma css task i n p u t (A [NT ] [ NT ] ) i n o u t (C[NT ] [ NT ] ) void chol_syrk ( f l o a t A, f l o a t C ) { ssyrk ( "L", "N", &ts, &ts, 1.0, A, &ts, 1.0, C, &t s ) ; } strsm task #pragma css task i n p u t ( T [NT ] [ NT ] ) i n o u t (B [NT ] [ NT ] ) void chol_strsm ( f l o a t T, f l o a t B ) { dtrsm ( "R", "L", "T", "N", &ts, &ts, 1.0, T, &ts, B, &t s ) ; } An Extension of the StarSs Programming Model... 19 Ayguadé et al.

The StarSs programming model. New extensions Taskifying the Cholesky factorization Each block function can be converted into a task: spotrf task #pragma css task i n o u t (A [NT ] [ NT ] ) void c h o l _ s p o t r f ( f l o a t A ) { s p o t r f ( "Lower", &ts, A, &ts, &i n f o ) ; } sgemm task #pragma css task input (A [NT ] [ NT], B [NT ] [ NT ] ) i n o u t (C[NT ] [ NT ] ) void chol_sgemm ( f l o a t A, f l o a t B, f l o a t C ) { sgemm( "N", "T", &ts, &ts, &ts, 1.0, A, &ts, B, &ts, 1.0, C, &t s ) ; } ssyrk task #pragma css task i n p u t (A [NT ] [ NT ] ) i n o u t (C[NT ] [ NT ] ) void chol_syrk ( f l o a t A, f l o a t C ) { ssyrk ( "L", "N", &ts, &ts, 1.0, A, &ts, 1.0, C, &t s ) ; } strsm task #pragma css task i n p u t ( T [NT ] [ NT ] ) i n o u t (B [NT ] [ NT ] ) void chol_strsm ( f l o a t T, f l o a t B ) { dtrsm ( "R", "L", "T", "N", &ts, &ts, 1.0, T, &ts, B, &t s ) ; } An Extension of the StarSs Programming Model... 19 Ayguadé et al.

The StarSs programming model. New extensions Specifying the target device for each task By default, each task is executed on the SMP device unless the target clause is given. Example: chol_spotrf can be executed on a CUDA-capable device: spotrf task on a CUDA-capable device #pragma css task i n o u t (A [NT ] [ NT ] ) t a r g e t device ( cuda ) void c h o l _ s p o t r f ( f l o a t A ) { / / CUDA k e r n e l f o r / / the Cholesky f a c t o r i z a t i o n } An Extension of the StarSs Programming Model... 20 Ayguadé et al.

The StarSs programming model. New extensions Specifying multiple implementations for each task Multiple implementations for the chol_spotrf can be given: spotrf task on a CUDA-capable device #pragma css task i n o u t (A [NT ] [ NT ] ) void c h o l _ s p o t r f ( f l o a t A ) ; #pragma css task i n o u t (A [NT ] [ NT ] ) t a r g e t device ( cuda ) implements ( c h o l _ s p o t r f ) void chol_spotrf_cuda ( f l o a t A ) { / / CUDA k e r n e l f o r / / the Cholesky f a c t o r i z a t i o n } #pragma css task i n o u t (A [NT ] [ NT ] ) t a r g e t device ( smp ) implements ( c h o l _ s p o t r f ) void chol_spotrf_smp ( f l o a t A ) { / / SMP r o u t i n e f o r / / the Cholesky f a c t o r i z a t i o n } An Extension of the StarSs Programming Model... 21 Ayguadé et al.

Contents The GPUSs framework 1 Introduction 2 The StarSs programming model. New extensions 3 The GPUSs framework 4 Experimental results 5 Conclusions and future work An Extension of the StarSs Programming Model... 22 Ayguadé et al.

The GPUSs framework The target architecture A typical multi-accelerator system Host Host with main memory Devices with local memory Communication through PCIExpress No direct device-device communication Communication through main memory An Extension of the StarSs Programming Model... 23 Ayguadé et al.

The GPUSs framework The target architecture A typical multi-accelerator system Host Main memory Host with main memory Devices with local memory Communication through PCIExpress No direct device-device communication Communication through main memory An Extension of the StarSs Programming Model... 23 Ayguadé et al.

The GPUSs framework The target architecture A typical multi-accelerator system Device Host Main memory Host with main memory Devices with local memory Communication through PCIExpress No direct device-device communication Communication through main memory An Extension of the StarSs Programming Model... 23 Ayguadé et al.

The GPUSs framework The target architecture A typical multi-accelerator system Device Host Main memory Device Host with main memory Devices with local memory Communication through PCIExpress No direct device-device communication Communication through main memory An Extension of the StarSs Programming Model... 23 Ayguadé et al.

The GPUSs framework The target architecture A typical multi-accelerator system Host Main memory Device Device... Device Host with main memory Devices with local memory Communication through PCIExpress No direct device-device communication Communication through main memory An Extension of the StarSs Programming Model... 23 Ayguadé et al.

The GPUSs framework The target architecture A typical multi-accelerator system Host with main memory Devices with local memory Host Main memory Communication through PCIExpress Device Dev. memory Device Dev. memory... Device Dev. memory No direct device-device communication Communication through main memory An Extension of the StarSs Programming Model... 23 Ayguadé et al.

The GPUSs framework The target architecture A typical multi-accelerator system Host with main memory Devices with local memory PCIExpress Bus Host Main memory Communication through PCIExpress Device Dev. memory Device Dev. memory... Device Dev. memory No direct device-device communication Communication through main memory An Extension of the StarSs Programming Model... 23 Ayguadé et al.

The GPUSs framework The target architecture A typical multi-accelerator system Host with main memory Devices with local memory PCIExpress Bus Host Main memory Communication through PCIExpress Device Dev. memory Device Dev. memory... Device Dev. memory No direct device-device communication Communication through main memory An Extension of the StarSs Programming Model... 23 Ayguadé et al.

The GPUSs framework The target architecture A typical multi-accelerator system Host with main memory Devices with local memory PCIExpress Bus Host Main memory Communication through PCIExpress Device Dev. memory Device Dev. memory... Device Dev. memory No direct device-device communication Communication through main memory An Extension of the StarSs Programming Model... 23 Ayguadé et al.

The GPUSs framework The GPUSs runtime. Overview Many features inherited from the CellSs and SMPSs runtimes Two main modules: 1 Execution of the annotated user code, task generation and scheduling 2 Data movements and task execution An Extension of the StarSs Programming Model... 24 Ayguadé et al.

The GPUSs framework The GPUSs runtime. Structure 1 A master thread: Executes the user code Intercepts calls to annotated functions Generates tasks Inserts them in a Task Dependency Graph 2 A helper thread: Consumes tasks from the TDG as the GPUs become idle Maps tasks to the most suitable device Intercepts finalization signals from the worker threads 3 A set of worker threads: Wait for available tasks Perform the necessary data transfers from RAM to GPU Invoke the task call on the GPU Retrieve the results (if necessary) An Extension of the StarSs Programming Model... 25 Ayguadé et al.

The GPUSs framework The GPUSs runtime. Locality exploitation Host and device memories: two-level memory hierarchy Data is transferred to device memory prior to any task execution Data is transferred back after execution Consider the local memory of each GPU as a cache memory storing recently-used data blocks Software cache + Memory coherence policies: Write-invalidate Write-back The runtime keeps a memory map of each accelerator cache This information can be used to improve the mapping of tasks to resources An Extension of the StarSs Programming Model... 26 Ayguadé et al.

The GPUSs framework The GPUSs runtime. Additional features Definition of the number of accelerators at runtime Paraver traces to analyze performance Hybrid CPU/GPU execution of tasks Ported to a system with multiple ClearSpeed boards An Extension of the StarSs Programming Model... 27 Ayguadé et al.

Contents Experimental results 1 Introduction 2 The StarSs programming model. New extensions 3 The GPUSs framework 4 Experimental results 5 Conclusions and future work An Extension of the StarSs Programming Model... 28 Ayguadé et al.

Experimental results Experimental results Experimental setup CPU CPU frequency RAM memory GPU Graphics processors GPU frequency Video memory Interconnection Dual Xeon QuadCore E5440 2.83 Ghz 16 Gbytes Tesla s1070 4 x GT200 1.3 Ghz 4 Gbytes per GPU PCIExpress Gen2 CUDA version 2.0 MKL version 10.0.1 Driver version 185.18 Performance measured in GFLOPS An Extension of the StarSs Programming Model... 29 Ayguadé et al.

Experimental results Experimental results. Cholesky factorization Cholesky factorization - GPUSs runtime 500 400 GPUSS - Cached Write-back - 4 GPUs GPUSS - Basic implementation - 4 GPUs MKL spotrf - Dual Intel Xeon (8 cores) Cholesky GPU (CUBLAS) - 1 GPU GFLOPS 300 200 100 0 0 4096 8192 12288 16384 20480 Matrix size Tasks executed exclusively on GPUs (simple precision) Important improvement with software cache An Extension of the StarSs Programming Model... 30 Ayguadé et al.

Experimental results Experimental results. Scalability 4 3.5 3 Cholesky factorization - Speed-up GPUSS - 4 GPUs GPUSS - 3 GPUs GPUSS - 2 GPUs GPUSS - 1 GPUs Speedup 2.5 2 1.5 1 0.5 0 0 4096 8192 12288 16384 20480 Matrix size PCIExpress: main bottleneck as number of GPUs increases An Extension of the StarSs Programming Model... 31 Ayguadé et al.

Experimental results Experimental results. GEMM Performance of the Matrix-Matrix Product (C=C+A*B) on s1070 1200 GPUSS - 4 GPUs CUBLAS - 1 GPUs 1000 GFLOPS 800 600 400 200 0 0 4096 8192 12288 16384 20480 Matrix size (m=n=k) Performance near 1.1 TFlop (346 GFlops on 1 GPU using CUBLAS) An Extension of the StarSs Programming Model... 32 Ayguadé et al.

Experimental results Experimental results on ClearSpeed boards Performance of the Matrix-Matrix Product (C=C+A*B) 300 GPUSS - Double precision - 4 CSX700 GPUSS - Double precision - 4 GPUs 250 GFLOPS 200 150 100 50 0 0 4096 8192 12288 Matrix size (m=n=k) Performance similar to 4 GPUs (double precision) An Extension of the StarSs Programming Model... 33 Ayguadé et al.

Contents Conclusions and future work 1 Introduction 2 The StarSs programming model. New extensions 3 The GPUSs framework 4 Experimental results 5 Conclusions and future work An Extension of the StarSs Programming Model... 34 Ayguadé et al.

Conclusions Conclusions and future work Conclusions StarSs programming model: versatile and extensible for new architectures Programmability will determine the success of emerging architectures Our approach relies on a runtime system: little user intervention Many ideas can be applied to other multi-accelerator systems Future work More complex scheduling strategies Porting to other multi-accelerator platforms Porting to heterogeneous multi-accelerator platforms Let the runtime automatically decide where to execute each task An Extension of the StarSs Programming Model... 35 Ayguadé et al.

Conclusions Conclusions and future work Conclusions StarSs programming model: versatile and extensible for new architectures Programmability will determine the success of emerging architectures Our approach relies on a runtime system: little user intervention Many ideas can be applied to other multi-accelerator systems Future work More complex scheduling strategies Porting to other multi-accelerator platforms Porting to heterogeneous multi-accelerator platforms Let the runtime automatically decide where to execute each task An Extension of the StarSs Programming Model... 35 Ayguadé et al.

Conclusions and future work Questions? An Extension of the StarSs Programming Model... 36 Ayguadé et al.

Related work Conclusions and future work Related work SuperMatrix Volkov et al. Lee et al. StarPU Extension of the SuperMatrix SMP runtime Automatic parallelization of linear algebra programs Hybrid CPU / Multi-GPU systems Some highly tuned codes for multi-gpu systems Linear algebra codes No runtime or automatic scheduling Compiler framework for automatic translation and optimization OpenMP GPU translation An Extension of the StarSs Programming Model... 37 Ayguadé et al.

Related work Conclusions and future work Related work SuperMatrix Volkov et al. Lee et al. StarPU Extension of the SuperMatrix SMP runtime Automatic parallelization of linear algebra programs Hybrid CPU / Multi-GPU systems Some highly tuned codes for multi-gpu systems Linear algebra codes No runtime or automatic scheduling Compiler framework for automatic translation and optimization OpenMP GPU translation An Extension of the StarSs Programming Model... 37 Ayguadé et al.

Related work Conclusions and future work Related work SuperMatrix Volkov et al. Lee et al. StarPU Extension of the SuperMatrix SMP runtime Automatic parallelization of linear algebra programs Hybrid CPU / Multi-GPU systems Some highly tuned codes for multi-gpu systems Linear algebra codes No runtime or automatic scheduling Compiler framework for automatic translation and optimization OpenMP GPU translation An Extension of the StarSs Programming Model... 37 Ayguadé et al.

Related work Conclusions and future work Related work SuperMatrix Volkov et al. Lee et al. StarPU Extension of the SuperMatrix SMP runtime Automatic parallelization of linear algebra programs Hybrid CPU / Multi-GPU systems Some highly tuned codes for multi-gpu systems Linear algebra codes No runtime or automatic scheduling Compiler framework for automatic translation and optimization OpenMP GPU translation An Extension of the StarSs Programming Model... 37 Ayguadé et al.

Conclusions and future work Tesla vs. Cell B.E. Similarities with the Cell B.E. Heterogeneous architectures: Cell B.E.: 1 PPE + 8 SPEs Tesla: 1 (multicore) CPU + 4 GPUs Each accelerator has its own local memory pool Fast interconnection network Differences with the Cell B.E. GPUs need more granularity to attain good performance PPE performance is poor compared to that of the SPE Larger local memory spaces for each GPU (Gbytes) than for each SPE (Kbytes) Impact of data transfers (PCIExpress vs EIB) GPUs are passive elements: no system threads can be run on them An Extension of the StarSs Programming Model... 38 Ayguadé et al.

Conclusions and future work Specifying data movements Some additional clauses can be used with the device pragma: Data movement clauses copy_in ( data reference l i s t ) copy_out ( data reference l i s t ) These clauses specify data movement for the shared variables inside a task: copy_in moves variables from host to device memory once the task is ready for execution. copy_out moves variables from device to host once the task finishes execution. An Extension of the StarSs Programming Model... 39 Ayguadé et al.