SuperMatrix on Heterogeneous Platforms. Jianyu Huang SHPC, UT Austin

Similar documents
Solving Dense Linear Systems on Platforms with Multiple Hardware Accelerators

Scheduling of QR Factorization Algorithms on SMP and Multi-core Architectures

Dealing with Asymmetry for Performance and Energy Efficiency

Unleashing the Power of Multi-GPU Accelerators with FLAME

Dense Linear Algebra on Heterogeneous Platforms: State of the Art and Trends

Automatic Development of Linear Algebra Libraries for the Tesla Series

Solving Dense Linear Systems on Graphics Processors

Hierarchical DAG Scheduling for Hybrid Distributed Systems

An Extension of the StarSs Programming Model for Platforms with Multiple GPUs

Solving Large Dense Matrix Problems on Multi-Core Processors

Matrix Computations on GPUs, multiple GPUs and clusters of GPUs

Runtime Data Flow Graph Scheduling of Matrix Computations with Multiple Hardware Accelerators

Towards an Efficient Tile Matrix Inversion of Symmetric Positive Definite Matrices on Multicore Architectures

Beautiful Parallel Code: Evolution vs. Intelligent Design

Implementing Strassen-like Fast Matrix Multiplication Algorithms with BLIS

Distributed Dense Linear Algebra on Heterogeneous Architectures. George Bosilca

Level-3 BLAS on a GPU: Picking the Low Hanging Fruit

Optimization of Dense Linear Systems on Platforms with Multiple Hardware Accelerators. Enrique S. Quintana-Ortí

MAGMA. Matrix Algebra on GPU and Multicore Architectures

Using Graphics Processors to Accelerate the Solution of Out-of-Core Linear Systems

Solving Dense Linear Systems on Platforms with Multiple Hardware Accelerators. FLAME Working Note #32

Transforming Linear Algebra Libraries: From Abstraction to Parallelism

rcuda: an approach to provide remote access to GPU computational power

Overview of research activities Toward portability of performance

Towards an efficient tile matrix inversion of symmetric positive definite matrices on multicore architectures

Matrix Computations. Enrique S. Quintana-Ortí. September 11, 2012, Hamburg, Germany

Integrating DMA capabilities into BLIS for on-chip data movement. Devangi Parikh Ilya Polkovnichenko Francisco Igual Peña Murtaza Ali

High performance matrix inversion of SPD matrices on graphics processors

Parallelization of the QR Decomposition with Column Pivoting Using Column Cyclic Distribution on Multicore and GPU Processors

Transforming Linear Algebra Libraries: From Abstraction to Parallelism

Resource aggregation for task-based Cholesky Factorization on top of heterogeneous machines

Making Programming Synonymous with Programming for. Linear Algebra Libraries

Solving Dense Linear Systems on Platforms with Multiple Hardware Accelerators

CPU-GPU Heterogeneous Computing

Toward Scalable Matrix Multiply on Multithreaded Architectures

Parallelization of the QR Decomposition with Column Pivoting Using Column Cyclic Distribution on Multicore and GPU Processors

Thinking Outside of the Tera-Scale Box. Piotr Luszczek

Parallel Tiled QR Factorization for Multicore Architectures

The StarPU Runtime System

Introduction II. Overview

Performance Analysis of BLAS Libraries in SuperLU_DIST for SuperLU_MCDT (Multi Core Distributed) Development

HPC with Multicore and GPUs

Bridging the Gap between Performance and Bounds of Cholesky Factorization on Heterogeneous Platforms

Scheduling of Linear Algebra Kernels on Multiple Heterogeneous Resources

Evaluation and Tuning of the Level 3 CUBLAS for Graphics Processors

Jose Aliaga (Universitat Jaume I, Castellon, Spain), Ruyman Reyes, Mehdi Goli (Codeplay Software) 2017 Codeplay Software Ltd.

MAGMA a New Generation of Linear Algebra Libraries for GPU and Multicore Architectures

Programming Models for Multi- Threading. Brian Marshall, Advanced Research Computing

A Linear Algebra Library for Multicore/Accelerators: the PLASMA/MAGMA Collection

The Basic Linear Algebra Subprograms (BLAS) are an interface to commonly used fundamental linear algebra operations.

GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE)

Linear Algebra libraries in Debian. DebConf 10 New York 05/08/2010 Sylvestre

Data Partitioning on Heterogeneous Multicore and Multi-GPU systems Using Functional Performance Models of Data-Parallel Applictions

FLAME Working Note #23

Using R for HPC Data Science. Session: Parallel Programming Paradigms. George Ostrouchov

FLAMES2S: From Abstraction to High Performance

Level-3 BLAS on the TI C6678 multi-core DSP

Analysis of Dynamically Scheduled Tile Algorithms for Dense Linear Algebra on Multicore Architectures

libflame The Complete Reference ( version ) Field G. Van Zee The University of Texas at Austin

A Proposal to Extend the OpenMP Tasking Model for Heterogeneous Architectures

Technology on Dense Linear Algebra

MAGMA: a New Generation

ad-heap: an Efficient Heap Data Structure for Asymmetric Multicore Processors

Solving Dense Linear Systems on Graphics Processors

Unified Development for Mixed Multi-GPU and Multi-Coprocessor Environments using a Lightweight Runtime Environment

Why? High performance clusters: Fast interconnects Hundreds of nodes, with multiple cores per node Large storage systems Hardware accelerators

Accelerating GPU kernels for dense linear algebra

Programming Algorithms-by-Blocks for Matrix Computations on Multithreaded Architectures FLAME Working Note #29

Implementing Level-3 BLAS Routines in OpenCL on Different Processing Units

SciDAC CScADS Summer Workshop on Libraries and Algorithms for Petascale Applications

Trends and Challenges in Multicore Programming

A scalable approach to solving dense linear algebra problems on hybrid CPU-GPU systems

Parallelism. CS6787 Lecture 8 Fall 2017

Addressing Heterogeneity in Manycore Applications

OpenMP for next generation heterogeneous clusters

A Square Block Format for Symmetric Band Matrices

Parallel Tiled QR Factorization for Multicore Architectures LAPACK Working Note # 190

arxiv: v1 [math.na] 24 Jul 2007

Flexible Linear Algebra Development and Scheduling with Cholesky Factorization

Performance Analysis of Memory Transfers and GEMM Subroutines on NVIDIA TESLA GPU Cluster

AperTO - Archivio Istituzionale Open Access dell'università di Torino

An approach to provide remote access to GPU computational power

StarPU: a runtime system for multigpu multicore machines

Accelerating the reduction to upper Hessenberg form through hybrid GPU-based computing

Best Practice Guide to Hybrid PaRSEC + OpenMP Programming

arxiv: v2 [cs.ms] 11 Sep 2007

Ultra Large-Scale FFT Processing on Graphics Processor Arrays. Author: J.B. Glenn-Anderson, PhD, CTO enparallel, Inc.

A Class of Parallel Tiled Linear Algebra Algorithms for Multicore Architectures

MAGMA. LAPACK for GPUs. Stan Tomov Research Director Innovative Computing Laboratory Department of Computer Science University of Tennessee, Knoxville

SemCache: Semantics-Aware Caching for Efficient GPU Offloading

High-Performance Libraries and Tools. HPC Fall 2012 Prof. Robert van Engelen

! Readings! ! Room-level, on-chip! vs.!

Making Dataflow Programming Ubiquitous for Scientific Computing

Accelerating the reduction to upper Hessenberg, tridiagonal, and bidiagonal forms through hybrid GPU-based computing

Max Planck Institute Magdeburg Preprints

Chapter 3 Parallel Software

Transforming Linear Algebra Libraries: From Abstraction to High Performance

A Study of High Performance Computing and the Cray SV1 Supercomputer. Michael Sullivan TJHSST Class of 2004

Solution of the Transport Equation Using Graphical Processing Units

CafeGPI. Single-Sided Communication for Scalable Deep Learning

Transcription:

SuperMatrix on Heterogeneous Platforms Jianyu Huang SHPC, U Austin 1

How Heterogeneous? 2

How Many Languages? 3

How Many Languages? 3

Question! 4

FLAME Answer: SuperMatrix libflame SuperMatrix clblas OpenCL cublas CUDA GPU ACML OpenMP /pthread C/asse mbly/f ortran MKL OpenMP/ pthread C/For tran/ Asse mbly BLIS OpenMP/pt hread C/Asse mbly CPU/MIC BLIS OpenMP/pt hread C/Asse mbly Accelerator/Other platforms 5

FLAME Answer: SuperMatrix libflame SuperMatrix clblas OpenCL cublas CUDA GPU ACML OpenMP /pthread C/asse mbly/f ortran MKL OpenMP/ pthread C/For tran/ Asse mbly BLIS OpenMP/pt hread C/Asse mbly CPU/MIC BLIS OpenMP/pt hread C/Asse mbly Accelerator/Other platforms Programmability Parallelism Use tools provide by Directed acyclic graph FLAME 5 (DAG) scheduling

FLAME Answer: SuperMatrix Chan, E., Quintana-Ortí, E. S., Quintana-Ortí, G., and van de Geijn, R.. SuperMatrix out-of-order scheduling of matrix operations for SMP and multi-core architectures. In SPAA'07: Proceedings of the Nineteenth Annual ACM Symposium on Parallelism in Algorithms and Architectures, pages 116-125, San Diego, CA, USA, June 9-11, 2007. Chan, E., G. Van Zee, F., Quintana-Ortí, E. S., Quintana-Ortí, G., and van de Geijn, R.. Satisfying your dependencies with SuperMatrix. InCluster'07: Proceedings of the 2007 IEEE International Conference on Cluster Computing, pages 91-99, Austin,, USA, September 17-20, 2007. Chan, E., G. Van Zee, F., Bientinesi, P., Quintana-Ortí, E. S., Quintana-Ortí, G., and van de Geijn, R.. SuperMatrix: A multithreaded runtime scheduling system for algorithms-by-blocks. In PPoPP'08: Proceedings of the hirteenth ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pages 123-132, Salt Lake City, U, USA, February 20-23, 2008. Quintana-Orti, G., Igual, F. D., Quintana-Orti, E. S., van de Geijn, R.. Solving dense linear systems on platforms with multiple hardware accelerators. In PPoPP '09 Proceedings of the 14th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming, 2009 Quintana-Ortí, G., Quintana-Ortí, E.S., van de Geijn, R., G. Van Zee, F., and Chan, E.. Programming matrix algorithms-by-blocks for thread-level parallelism. ACM ransactions on Mathematical Software, 36(3):14:1-14:26, July 2009. Chan, E.. Application of Dependence Analysis and Runtime Data Flow Graph Scheduling to Matrix Computations. Ph.D. dissertation, Department of Computer Science, he University of exas at Austin Quintana-Ortí, G., Igual, F. D., Marqués, M., Quintana-Ortí, E. S., and van de Geijn, R.. "A Runtime System for Programming Out-of-Core Matrix Algorithms-by-iles on Multithreaded Architectures." ACM ransactions on Mathematical Software (OMS) 38, no. 4 (2012): 25. 6

Parallel? S0: D A*B S1: A L * L S2: B B * L - S3: C C B * B S4: L -1 * Can the code be parallelized? 7

Parallel? S0: D A*B S1: A L * L S2: B B * L - S3: C C B * B S4: L -1 * Can the code be parallelized? Write After Read: (S0, S1) Read After Write: (S0, S1) Read After Write: (S1, S2) Read After Write: (S2, S3) Read After Write: (S1, S4) 7

Parallel? S0: D A*B Write After Read: (S0, S1) Read After Write: (S0, S1) S1: A L * L Read After Write: (S1, S2) S2: B B * L - Read After Write: (S2, S3) S3: C C B * B Read After Write: (S1, S4) S4: L -1 * Can the code be parallelized? Are you sure S1 and S2 cannot be parallelized? 7

Parallel? S0: D A*B S1: A L * L S2: B B * L - S3: C C B * B D A B A B L B C C B B L S4: L -1 * L B How to parallelize? 8

raditional Library Approach S0: D A*B S1: A L * L S2: B B * L - S3: C C B * B S4: L -1 * How to parallelize? 9

raditional Library Approach S0: D A*B S1: A L * L S2: B B * L - S3: C C B * B S4: L -1 * S0: ParGemm (A,B,D) S1: L = ParPortf(A) S2: Parrsm(L,B) S3: ParSyrk(B,C) S4: Parrsm(L,) How to parallelize? 9

raditional Library Approach Implemented with libflame and BLIS S0: D A*B S1: A L * L S2: B B * L - S3: C C B * B S4: L -1 * /*-----------------------------------------------*/ FLA_Gemm( FLA_NO_RANSPOSE,FLA_NO_RANSPOSE, FLA_ONE, A, B, FLA_ZERO, D ); FLA_Chol( FLA_LOWER_RIANGULAR, A ); FLA_rsm( FLA_RIGH, FLA_LOWER_RIANGULAR, FLA_RANSPOSE, FLA_NONUNI_DIAG, FLA_ONE, A, B ); FLA_Syrk( FLA_LOWER_RIANGULAR, FLA_NO_RANSPOSE, FLA_MINUS_ONE, B, FLA_ONE, C ); FLA_rsm( FLA_LEF, FLA_LOWER_RIANGULAR, FLA_NO_RANSPOSE, FLA_NONUNI_DIAG, FLA_ONE, L, ); /*-----------------------------------------------*/ Supported by parallel BLAS, LAPACK (multi-thread BLIS) 10

Problem for Fine-grained Parallelism libflame Fine-grained parallelism BLIS pthreads OpenMP 11

Problem for Fine-grained Parallelism Synchronization point overhead Not fit for multiple devices scenarios. libflame Fine-grained parallelism BLIS pthreads OpenMP 11

Problem for Fine-grained Parallelism Synchronization point overhead Not fit for multiple devices scenarios. libflame Coarse-grained parallelism libflame Fine-grained parallelism BLIS pthreads OpenMP pthreads OpenMP BLIS BLIS Introduce parallelism across instructions Fit for the platform with multiple computation 11 units.

Problem for Fine-grained Parallelism Synchronization point overhead Not fit for multiple devices scenarios. libflame Coarse-grained parallelism libflame Fine-grained parallelism BLIS pthreads OpenMP pthreads OpenMP BLIS BLIS Introduce parallelism across instructions Fit for the platform with multiple computation 11 units.

Problem for Fine-grained Parallelism Synchronization point overhead Not fit for multiple devices scenarios. libflame Coarse-grained parallelism libflame Fine-grained parallelism BLIS pthreads OpenMP pthreads OpenMP BLIS BLIS Introduce parallelism across instructions Fit for the platform with multiple computation 11 units.

Coarse-grained Parallelism Coarse-grained parallelism libflame libflame Fine-grained parallelism BLIS SuperMatrix pthreads OpenMP BLIS BLIS Introduce parallelism across instructions Fit for the platform with multiple computation 12 units.

SuperMatrix Approach S0: D A*B D A B S1: A L * L A L S2: B B * L - B B L S3: C C B * B C C B B S4: L -1 * L B How to parallelize? 13

S0: D A*B S1: A L * L S2: B B * L - S3: C C B * B S4: L -1 * SuperMatrix Approach How to parallelize? 14

S0: D A*B S1: A L * L S2: B B * L - S3: C C B * B S4: L -1 * SuperMatrix Approach How to parallelize? Partitioning/Algorithm-by-blocks! 15

S0: D A*B S1: A L * L S2: B B * L - S3: C C B * B S4: L -1 * SuperMatrix Approach How to parallelize? 16

SuperMatrix Approach Construct the DAG across the instructions automatically No need to annotate the task dependencies manually! 17

raditional Library Approach Implemented with libflame and BLIS S0: D A*B S1: A L * L S2: B B * L - S3: C C B * B S4: L -1 * /*-----------------------------------------------*/ FLA_Gemm( FLA_NO_RANSPOSE,FLA_NO_RANSPOSE, FLA_ONE, A, B, FLA_ZERO, D ); FLA_Chol( FLA_LOWER_RIANGULAR, A ); FLA_rsm( FLA_RIGH, FLA_LOWER_RIANGULAR, FLA_RANSPOSE, FLA_NONUNI_DIAG, FLA_ONE, A, B ); FLA_Syrk( FLA_LOWER_RIANGULAR, FLA_NO_RANSPOSE, FLA_MINUS_ONE, B, FLA_ONE, C ); FLA_rsm( FLA_LEF, FLA_LOWER_RIANGULAR, FLA_NO_RANSPOSE, FLA_NONUNI_DIAG, FLA_ONE, L, ); /*-----------------------------------------------*/ Supported by parallel BLAS, LAPACK (multi-thread BLIS) 18

SuperMatrix Approach Implemented with libflame and BLIS S0: D A*B S1: A L * L S2: B B * L - S3: C C B * B S4: L -1 * /*-----------------------------------------------*/ FLASH_Gemm( FLA_NO_RANSPOSE,FLA_NO_RANSPOSE, FLA_ONE, A, B, FLA_ZERO, D ); FLASH_Chol( FLA_LOWER_RIANGULAR, A ); FLASH_rsm( FLA_RIGH, FLA_LOWER_RIANGULAR, FLA_RANSPOSE, FLA_NONUNI_DIAG, FLA_ONE, A, B ); FLASH_Syrk( FLA_LOWER_RIANGULAR, FLA_NO_RANSPOSE, FLA_MINUS_ONE, B, FLA_ONE, C ); FLASH_rsm( FLA_LEF, FLA_LOWER_RIANGULAR, FLA_NO_RANSPOSE, FLA_NONUNI_DIAG, FLA_ONE, L, ); /*-----------------------------------------------*/ 19

Free Lunch for Both Programmability and Performance! From libflame manual, 2011 20

Original SuperMatrix primarily targets at multi-core shared memory system 21

22

HPC Heterogeneous Platforms Matrix 22

HPC Heterogeneous Platforms Matrix PCIE 22

Challenges in Heterogeneous Platforms! S0: D A*A S0: ParGemm (A,A,D) S1: A L * L S2: B B * L - S3: C C B * B S4: L -1 * S1: L = ParPortf(A) S2: Parrsm(L,B) S3: ParSyrk(B,C) S4: Parrsm(L,) What if there is one accelerator in your system? 23

Challenges in Heterogeneous Platforms! S0: D A*A S0: ParGemm (A,A,D) S1: A L * L S2: B B * L - S3: C C B * B S4: L -1 * S1: L = ParPortf(A) S2: Parrsm(L,B) S3: ParSyrk(B,C) S4: Parrsm(L,) What if there is one accelerator in your system? 23

Challenges in Heterogeneous Platforms! S0: ParGemm (A,A,D) S1: L = ParPortf(A) S2: Parrsm(L,B) S3: ParSyrk(B,C) S4: Parrsm(L,) What if there is one accelerator in your system? 23

Challenges in Heterogeneous Platforms! /*-----------------------------*/ Memcpy(A, ha); Memcpy(D, hd); Memcpy(B, hb); Memcpy(C, hc); Memcpy(, h); /*-----------------------------*/ S0: ParGemm (A,A,D) S1: L = ParPortf(A) S2: Parrsm(L,B) S3: ParSyrk(B,C) /*-----------------------------*/ Memcpy(h, ); /*-----------------------------*/ S4: Parrsm(L,) What if there is one accelerator in your system? 24

Challenges in Heterogeneous Platforms! /*-----------------------------*/ Memcpy(A, ha); Memcpy(D, hd); Memcpy(B, hb); Memcpy(C, hc); Memcpy(, h); /*-----------------------------*/ S0: ParGemm (A,A,D) S1: L = ParPortf(A) S2: Parrsm(L,B) S3: ParSyrk(B,C) /*-----------------------------*/ Memcpy(h, ); /*-----------------------------*/ S4: Parrsm(L,) What if there are 4 GPUs and 8 CPU cores in your system? 24

Adapting Original SuperMatrix to Heterogeneous Platforms Software Cache Heterogeneous Scheduler Asynchronous Memory Copy Worker ask Performance Model 25

Naïve Approach PCIE 26

C A B Naïve Approach PCIE 26

Naïve Approach ransfer data from host to device before execution C A B PCIE 26

Naïve Approach ransfer data from host to device before execution C A B PCIE Execute the task on the device 26

Naïve Approach ransfer data from host to device before execution A B PCIE Execute the task on the device C 26

Naïve Approach ransfer data from host to device before execution A B PCIE Execute the task on the device C ransfer data from device to host upon execution 26

Naïve Approach ransfer data from host to device before execution A B PCIE Execute the task on the device C ransfer data from device to host upon execution No Data Reuse on the devices! 26

Software Cache PCIE Quintana-Ortí, G., et al. "Solving dense linear systems on platforms with multiple hardware accelerators." In PPoPP 27 '09 Proceedings of the 14th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming, 2009

Software Cache PCIE Quintana-Ortí, G., et al. "Solving dense linear systems on platforms with multiple hardware accelerators." In PPoPP 27 '09 Proceedings of the 14th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming, 2009

C A B Software Cache PCIE Quintana-Ortí, G., et al. "Solving dense linear systems on platforms with multiple hardware accelerators." In PPoPP 27 '09 Proceedings of the 14th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming, 2009

C Software Cache No need to transfer data from host to device before execution if the data is already on the device PCIE Quintana-Ortí, G., et al. "Solving dense linear systems on platforms with multiple hardware accelerators." In PPoPP 27 '09 Proceedings of the 14th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming, 2009

Software Cache No need to transfer data from host to device before execution if the data is already on the device PCIE No need to transfer data from device to host upon execution if the data is not required by the host immediately Quintana-Ortí, G., et al. "Solving dense linear systems on platforms with multiple hardware accelerators." In PPoPP 27 '09 Proceedings of the 14th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming, 2009

HEF(Heterogeneous Earliest Finish ime) imeline 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 ask 1 Where should we place ask 6? ask 6 ask 2 ask 3 ask 4 ask 5 opcuoglu, H., Hariri, S., and Wu, M.. "Performance-effective and low-complexity task scheduling for heterogeneous computing." IEEE ransactions on Parallel and Distributed Systems, 13.3 (2002): 260-274. 28

HEF(Heterogeneous Earliest Finish ime) imeline 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 ask 1 Where should we place ask 6? ask 6 ask 2 ask 3 ask 4 ask 5 opcuoglu, H., Hariri, S., and Wu, M.. "Performance-effective and low-complexity task scheduling for heterogeneous computing." IEEE ransactions on Parallel and Distributed Systems, 13.3 (2002): 260-274. 28

HEF(Heterogeneous Earliest Finish ime) imeline 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 ask 1 ask 6 ask 2 ask 3 ask 4 ask 5 ask 6 opcuoglu, H., Hariri, S., and Wu, M.. "Performance-effective and low-complexity task scheduling for heterogeneous computing." IEEE ransactions on Parallel and Distributed Systems, 13.3 (2002): 260-274. 28

3x3 Blocked Cholesky Decomposition CHOL 0 A 0,0 Chol( A 0,0 ) RSM 1 A 1,0 A 1,0 A 0,0 - RSM 2 A 2,0 A 2,0 A 0,0 - SYRK 3 A 1,1 A 1,1 A 1,0 A 1,0 GEMM 4 A 2,1 A 2,1 A 2,0 A 1,0 SYRK 5 A 2,2 A 2,2 A 2,0 A 2,0 CHOL 6 A 1,1 Chol( A 1,1 ) RSM 7 A 2,1 A 2,1 A 1,1 - SYRK 8 A 2,2 A 2,2 A 2,1 A 2,1 CHOL 9 A 2,2 Chol( A 2,2 ) 55

Data Distribution CPU GPU0 GPU1 A00 1 0 0 CHOL 0 A 0,0 Chol( A 0,0 ) A10 1 0 0 A11 1 0 0 A20 1 0 0 RSM 1 A 1,0 A 1,0 A 0,0 - RSM 2 A 2,0 A 2,0 A 0,0 - A21 1 0 0 A22 1 0 0 SYRK 3 A 1,1 A 1,1 A 1,0 A 1,0 GEMM 4 A 2,1 A 2,1 A 2,0 A 1,0 SYRK 5 A 2,2 A 2,2 A 2,0 A 2,0 Scheduler CPU GPU0 GPU1 CHOL 6 A 1,1 Chol( A 1,1 ) CHOL0 RSM1 RSM2 RSM 7 A 2,1 A 2,1 A 1,1 - SYRK3 GEMM4 SYRK5 HEF assignment table SYRK 8 A 2,2 A 2,2 A 2,1 A 2,1 CHOL6 RSM7 CPU GPU0 GPU1 Avail 0 0 0 CHOL 9 A 2,2 Chol( A 2,2 ) SYRK8 ES CHOL9 EF Priority 30

Data Distribution CPU GPU0 GPU1 A00 1 0 0 CHOL 0 A 0,0 Chol( A 0,0 ) A10 1 0 0 A11 1 0 0 A20 1 0 0 RSM 1 A 1,0 A 1,0 A 0,0 - RSM 2 A 2,0 A 2,0 A 0,0 - A21 1 0 0 A22 1 0 0 SYRK 3 A 1,1 A 1,1 A 1,0 A 1,0 GEMM 4 A 2,1 A 2,1 A 2,0 A 1,0 SYRK 5 A 2,2 A 2,2 A 2,0 A 2,0 Scheduler CPU GPU0 GPU1 CHOL 6 A 1,1 Chol( A 1,1 ) CHOL0 RSM1 RSM2 RSM 7 A 2,1 A 2,1 A 1,1 - SYRK3 GEMM4 SYRK5 HEF assignment table SYRK 8 A 2,2 A 2,2 A 2,1 A 2,1 CHOL6 RSM7 CPU GPU0 GPU1 Avail 0 0 0 CHOL 9 A 2,2 Chol( A 2,2 ) SYRK8 ES 0 1 1 CHOL9 EF 1.5 2 2 Priority 1 2 3 31

Data Distribution CPU GPU0 GPU1 A00 1 1 0 CHOL 0 A 0,0 Chol( A 0,0 ) A10 0 1 0 A11 1 0 0 A20 1 0 0 RSM 1 A 1,0 A 1,0 A 0,0 - RSM 2 A 2,0 A 2,0 A 0,0 - A21 1 0 0 A22 1 0 0 SYRK 3 A 1,1 A 1,1 A 1,0 A 1,0 GEMM 4 A 2,1 A 2,1 A 2,0 A 1,0 SYRK 5 A 2,2 A 2,2 A 2,0 A 2,0 Scheduler CPU GPU0 GPU1 CHOL 6 A 1,1 Chol( A 1,1 ) CHOL0 RSM1 RSM2 SYRK3 RSM 7 A 2,1 A 2,1 A 1,1 - GEMM4 SYRK5 HEF assignment table SYRK 8 A 2,2 A 2,2 A 2,1 A 2,1 CHOL6 RSM7 CPU GPU0 GPU1 Avail 1.5 0 0 CHOL 9 A 2,2 Chol( A 2,2 ) SYRK8 ES 1.5 3.5 3.5 CHOL9 EF 5.5 5 5 Priority 3 1 2 32

Data Distribution CPU GPU0 GPU1 A00 1 1 1 CHOL 0 A 0,0 Chol( A 0,0 ) A10 0 1 0 A11 1 0 0 A20 0 0 1 RSM 1 A 1,0 A 1,0 A 0,0 - RSM 2 A 2,0 A 2,0 A 0,0 - A21 1 0 0 A22 1 0 0 SYRK 3 A 1,1 A 1,1 A 1,0 A 1,0 GEMM 4 A 2,1 A 2,1 A 2,0 A 1,0 SYRK 5 A 2,2 A 2,2 A 2,0 A 2,0 Scheduler CPU GPU0 GPU1 CHOL 6 A 1,1 Chol( A 1,1 ) CHOL0 RSM1 RSM2 RSM 7 A 2,1 A 2,1 A 1,1 - SYRK3 GEMM4 SYRK5 HEF assignment table SYRK 8 A 2,2 A 2,2 A 2,1 A 2,1 CHOL6 RSM7 CPU GPU0 GPU1 Avail 1.5 5 0 CHOL 9 A 2,2 Chol( A 2,2 ) SYRK8 ES 1.5 5 3.5 CHOL9 EF 5.5 6.5 5 Priority 2 3 1 33

Data Distribution CPU GPU0 GPU1 A00 1 1 1 CHOL 0 A 0,0 Chol( A 0,0 ) A10 0 1 0 A11 0 1 0 A20 0 0 1 RSM 1 A 1,0 A 1,0 A 0,0 - RSM 2 A 2,0 A 2,0 A 0,0 - A21 1 0 0 A22 1 0 0 SYRK 3 A 1,1 A 1,1 A 1,0 A 1,0 GEMM 4 A 2,1 A 2,1 A 2,0 A 1,0 SYRK 5 A 2,2 A 2,2 A 2,0 A 2,0 Scheduler CPU GPU0 GPU1 CHOL 6 A 1,1 Chol( A 1,1 ) CHOL0 RSM1 RSM2 RSM 7 A 2,1 A 2,1 A 1,1 - SYRK3 GEMM4 SYRK5 HEF assignment table SYRK 8 A 2,2 A 2,2 A 2,1 A 2,1 CHOL6 RSM7 CPU GPU0 GPU1 Avail 1.5 5 5 CHOL 9 A 2,2 Chol( A 2,2 ) SYRK8 ES 6 5 7 CHOL9 EF 10 6.5 8.5 Priority 3 1 2 34

Data Distribution CPU GPU0 GPU1 A00 1 1 1 CHOL 0 A 0,0 Chol( A 0,0 ) A10 0 1 0 A11 0 1 0 A20 1 1 1 RSM 1 A 1,0 A 1,0 A 0,0 - RSM 2 A 2,0 A 2,0 A 0,0 - A21 0 1 0 A22 1 0 0 SYRK 3 A 1,1 A 1,1 A 1,0 A 1,0 GEMM 4 A 2,1 A 2,1 A 2,0 A 1,0 SYRK 5 A 2,2 A 2,2 A 2,0 A 2,0 Scheduler CPU GPU0 GPU1 CHOL 6 A 1,1 Chol( A 1,1 ) CHOL0 RSM1 RSM2 RSM 7 A 2,1 A 2,1 A 1,1 - SYRK3 GEMM4 SYRK5 HEF assignment table SYRK 8 A 2,2 A 2,2 A 2,1 A 2,1 CHOL6 RSM7 CPU GPU0 GPU1 Avail 1.5 6.5 5 CHOL 9 A 2,2 Chol( A 2,2 ) SYRK8 ES 6 7 7 CHOL9 EF 14 10 10 Priority 3 1 2 35

Data Distribution CPU GPU0 GPU1 A00 1 1 1 CHOL 0 A 0,0 Chol( A 0,0 ) A10 0 1 0 A11 0 1 0 A20 1 1 1 RSM 1 A 1,0 A 1,0 A 0,0 - RSM 2 A 2,0 A 2,0 A 0,0 - A21 0 1 0 A22 0 0 1 SYRK 3 A 1,1 A 1,1 A 1,0 A 1,0 GEMM 4 A 2,1 A 2,1 A 2,0 A 1,0 SYRK 5 A 2,2 A 2,2 A 2,0 A 2,0 Scheduler CPU GPU0 GPU1 CHOL 6 A 1,1 Chol( A 1,1 ) CHOL0 RSM1 RSM2 RSM 7 A 2,1 A 2,1 A 1,1 - SYRK3 GEMM4 SYRK5 HEF assignment table SYRK 8 A 2,2 A 2,2 A 2,1 A 2,1 CHOL6 RSM7 CPU GPU0 GPU1 Avail 1.5 10 5 CHOL 9 A 2,2 Chol( A 2,2 ) SYRK8 ES 6 10 5 CHOL9 EF 10 11.5 6.5 Priority 2 3 1 36

Data Distribution CPU GPU0 GPU1 A00 1 1 1 CHOL 0 A 0,0 Chol( A 0,0 ) A10 0 1 0 A11 1 0 0 A20 1 1 1 RSM 1 A 1,0 A 1,0 A 0,0 - RSM 2 A 2,0 A 2,0 A 0,0 - A21 0 1 0 A22 0 0 1 SYRK 3 A 1,1 A 1,1 A 1,0 A 1,0 GEMM 4 A 2,1 A 2,1 A 2,0 A 1,0 SYRK 5 A 2,2 A 2,2 A 2,0 A 2,0 Scheduler CPU GPU0 GPU1 CHOL 6 A 1,1 Chol( A 1,1 ) CHOL0 RSM1 RSM2 RSM 7 A 2,1 A 2,1 A 1,1 - SYRK3 GEMM4 SYRK5 HEF assignment table SYRK 8 A 2,2 A 2,2 A 2,1 A 2,1 CHOL6 RSM7 CPU GPU0 GPU1 Avail 1.5 10 6.5 CHOL 9 A 2,2 Chol( A 2,2 ) SYRK8 ES 7.5 10 8.5 CHOL9 EF 9 11 9.5 Priority 1 3 2 37

Data Distribution CPU GPU0 GPU1 A00 1 1 1 CHOL 0 A 0,0 Chol( A 0,0 ) A10 0 1 0 A11 1 1 0 A20 1 1 1 RSM 1 A 1,0 A 1,0 A 0,0 - RSM 2 A 2,0 A 2,0 A 0,0 - A21 0 1 0 A22 0 0 1 SYRK 3 A 1,1 A 1,1 A 1,0 A 1,0 GEMM 4 A 2,1 A 2,1 A 2,0 A 1,0 SYRK 5 A 2,2 A 2,2 A 2,0 A 2,0 Scheduler CPU GPU0 GPU1 CHOL 6 A 1,1 Chol( A 1,1 ) CHOL0 RSM1 RSM2 RSM 7 A 2,1 A 2,1 A 1,1 - SYRK3 GEMM4 SYRK5 HEF assignment table SYRK 8 A 2,2 A 2,2 A 2,1 A 2,1 CHOL6 RSM7 CPU GPU0 GPU1 Avail 9 10 6.5 CHOL 9 A 2,2 Chol( A 2,2 ) SYRK8 ES 11 10 12 CHOL9 EF 15 11.5 13.5 Priority 3 1 2 38

Data Distribution CPU GPU0 GPU1 A00 1 1 1 CHOL 0 A 0,0 Chol( A 0,0 ) A10 0 1 0 A11 1 1 0 A20 1 1 1 RSM 1 A 1,0 A 1,0 A 0,0 - RSM 2 A 2,0 A 2,0 A 0,0 - A21 0 1 0 A22 0 1 0 SYRK 3 A 1,1 A 1,1 A 1,0 A 1,0 GEMM 4 A 2,1 A 2,1 A 2,0 A 1,0 SYRK 5 A 2,2 A 2,2 A 2,0 A 2,0 Scheduler CPU GPU0 GPU1 CHOL 6 A 1,1 Chol( A 1,1 ) CHOL0 RSM1 RSM2 RSM 7 A 2,1 A 2,1 A 1,1 - SYRK3 GEMM4 SYRK5 HEF assignment table SYRK 8 A 2,2 A 2,2 A 2,1 A 2,1 CHOL6 RSM7 CPU GPU0 GPU1 Avail 9 11.5 6.5 CHOL 9 A 2,2 Chol( A 2,2 ) SYRK8 ES 12.5 11.5 13.5 CHOL9 EF 16.5 13 15 Priority 3 1 2 39

Data Distribution CPU GPU0 GPU1 A00 1 1 1 CHOL 0 A 0,0 Chol( A 0,0 ) A10 0 1 0 A11 1 1 0 A20 1 1 1 RSM 1 A 1,0 A 1,0 A 0,0 - RSM 2 A 2,0 A 2,0 A 0,0 - A21 0 1 0 A22 0 1 0 SYRK 3 A 1,1 A 1,1 A 1,0 A 1,0 GEMM 4 A 2,1 A 2,1 A 2,0 A 1,0 SYRK 5 A 2,2 A 2,2 A 2,0 A 2,0 Scheduler CPU GPU0 GPU1 CHOL 6 A 1,1 Chol( A 1,1 ) CHOL0 RSM1 RSM2 RSM 7 A 2,1 A 2,1 A 1,1 - SYRK3 GEMM4 SYRK5 HEF assignment table SYRK 8 A 2,2 A 2,2 A 2,1 A 2,1 CHOL6 RSM7 CPU GPU0 GPU1 Avail 9 13 6.5 CHOL 9 A 2,2 Chol( A 2,2 ) SYRK8 ES 14 13 15 CHOL9 EF 15.5 14 16 Priority 2 1 3 40

S0: D A*B S1: A L * L S2: B B * L - S3: C C B * B S4: L -1 * SuperMatrix Approach on Heterogeneous Platforms /*-----------------------------------------------*/ FLASH_Gemm( FLA_NO_RANSPOSE,FLA_NO_RANSPOSE, FLA_ONE, A, B, FLA_ZERO, D ); FLASH_Chol( FLA_LOWER_RIANGULAR, A ); FLASH_rsm( FLA_RIGH, FLA_LOWER_RIANGULAR, FLA_RANSPOSE, FLA_NONUNI_DIAG, FLA_ONE, A, B ); FLASH_Syrk( FLA_LOWER_RIANGULAR, FLA_NO_RANSPOSE, FLA_MINUS_ONE, B, FLA_ONE, C ); FLASH_rsm( FLA_LEF, FLA_LOWER_RIANGULAR, FLA_NO_RANSPOSE, FLA_NONUNI_DIAG, FLA_ONE, L, ); /*-----------------------------------------------*/ 41

Performance 6-core single-socket eon E5649 CPU + 1 G 480 GPU card BLOCK SIZE: 1024 6-core single-socket eon E5649 CPU + 2 esla C2070 GPU card BLOCK SIZE: 2048 42

Conclusion libflame SuperMatrix clblas OpenCL cublas CUDA GPU ACML OpenMP /pthread C/asse mbly/f ortran MKL OpenMP/ pthread C/For tran/ Asse mbly BLIS OpenMP/pt hread C/Asse mbly CPU/MIC BLIS OpenMP/pt hread C/Asse mbly Accelerator/Other platforms 43

S0: D A*B S1: A L * L S2: B B * L - S3: C C B * B S4: L -1 * SuperMatrix Approach on Heterogeneous Platforms /*-----------------------------------------------*/ FLASH_Gemm( FLA_NO_RANSPOSE,FLA_NO_RANSPOSE, FLA_ONE, A, B, FLA_ZERO, D ); FLASH_Chol( FLA_LOWER_RIANGULAR, A ); FLASH_rsm( FLA_RIGH, FLA_LOWER_RIANGULAR, FLA_RANSPOSE, FLA_NONUNI_DIAG, FLA_ONE, A, B ); FLASH_Syrk( FLA_LOWER_RIANGULAR, FLA_NO_RANSPOSE, FLA_MINUS_ONE, B, FLA_ONE, C ); FLASH_rsm( FLA_LEF, FLA_LOWER_RIANGULAR, FLA_NO_RANSPOSE, FLA_NONUNI_DIAG, FLA_ONE, L, ); /*-----------------------------------------------*/ 44

Related Work arget Platform Lapack Project FLAME Project Sequential LAPACK libflame Sequential+multithreaded BLAS LAPACK libflame Multicore/multithreaded PLASMA libflame+supermatrix Multicore+out-of-order scheduling PLASMA+Quark libflame+supermatrix CPU + single GPU MAGMA libflame+supermatrix Multicore + multi-gpu DAGuE/StarPU/ Kaapi libflame+supermatrix 45

Related Work arget Platform Lapack Project FLAME Project Sequential LAPACK libflame Sequential+multithreaded BLAS LAPACK libflame Multicore/multithreaded PLASMA libflame+supermatrix Multicore+out-of-order scheduling PLASMA+Quark libflame+supermatrix CPU + single GPU MAGMA libflame+supermatrix Multicore + multi-gpu DAGuE/StarPU/ Kaapi libflame+supermatrix 45

Questions? 46