Task-based programming models to support hierarchical algorithms

Size: px

Start display at page:

Download "Task-based programming models to support hierarchical algorithms"

Maximillian Wilfred Morton
5 years ago
Views:

1 Task-based programming models to support hierarchical algorithms Rosa M BadiaBarcelona Supercomputing Center SHAXC 2016, KAUST, 11 May 2016

2 Outline BSC Overview of superscalar programming model OmpSs overview Use of OmpSs in numerical examples PyCOMPSs overview Use of PyCOMPSs in numerical examples Conclusions 2

3 Barcelona Supercomputing Center - Centro Nacional de Supercomputación BSC-CNS objectives: R&D in Computer, Life, Earth and Engineering Sciences. Supercomputing services and support to Spanish and European researchers. BSC-CNS is a consortium that includes: Spanish Government 60% Catalonian Government 30% Universitat Politècnica de Catalunya (UPC) 10% 425 people, 40 countries 3

Mission of BSC R&D Departments COMPUTER SCIENCES To influence the way machines are

computer architecture, energy efficiency EARTH SCIENCES To develop and implement

long-term climate applications LIFE SCIENCES To understand living organisms by means

CASE To develop scientific and engineering software to efficiently exploit

4 Mission of BSC R&D Departments COMPUTER SCIENCES To influence the way machines are built, programmed and used: programming models, performance tools, Big Data, computer architecture, energy efficiency EARTH SCIENCES To develop and implement global and regional stateof-the-art models for shortterm air quality forecast and long-term climate applications LIFE SCIENCES To understand living organisms by means of theoretical and computational methods (molecular modeling, genomics, proteomics) CASE To develop scientific and engineering software to efficiently exploit supercomputing capabilities (biomedical, geophysics, atmospheric, energy, social and economic simulations) 4

5 Reducing the gap between applcations and architectures New algorithms and data structures require programming models and runtimes that: Reduce programming complexity Keep portability Enable communication avoiding, asynchronous algorithms, Applications PM: High-level, clean, abstract interface Computing Platform Runtime API Storage Platform 5

6 So, what is a superscalar programming model? High-level sequential programming Executes following superscalar processor model Out of order Task is the unit of work Builds a task graph at runtime that express potential concurrency Large number of in-flight tasks Exposes distant parallelism Based on a runtime Makes decisions and executes the workflow Offers an abstraction to plug applications to different resources Computing Storage Superscalar processor Dataflow High level Sequential programming Workflows Utilities 6

7 The StarSs programming model CellSs SMPSs GPUSs ClusterSs ClearSpeedSs StarSs OmpSs GridSs ClusterSs PyCOMPSs/COMPSs Different implementations, targeting different platforms OmpSs: multicore, GPUs, clusters COMPSs: clusters, federated clouds, old grids Open source pm.bsc.es compss.bsc.es 7

8 Main elements of superscalar programming model syntax Superscalar program Sequential code Single shared memory space Identification of tasks Task Main element of programming model: computation unit Operates in given parameters and local variables Amount of work (granularity) may vary in a wide range (from μsecs, to minutes, hours), may depend on input arguments, Once started executes to completion independent of other tasks Syntax Task annotations Task arguments directionality Synchronizations 8

9 Task annotations Different languages, same idea Annotations designed according to the standards of each programming = = cholesky.objects.block") void = Direction.INOUT) Block diag... #pragma omp task inout (A[k][k])priority (10) spotrf (A[k][k]) ; for (i=k+1; i<nt; i++) { #pragma omp task in (A[k][k]) inout... strsm (A[k][k], priority=true) def potrf(a): A = dpotrf(a, lower=true)[0].tolist() Python Java 9

Task arguments directionality Input, Output, Inout Indicate that the argument is read, written, or read and written by the task Used at execution time to determine the data dependences between tasks

10 Task arguments directionality Input, Output, Inout Indicate that the argument is read, written, or read and written by the task Used at execution time to determine the data dependences between tasks Is not a direct edge, but may generate one or more edges Gives information about Locality Data to be (output o1 ) def meta (input o2) def metb (o2)... def main () meta (myobject) metb (myobject) metb (otherobject) meta metb metb 10

Impact of synchronization in task based programming Ideally tasks executed according data dependences However synchronizations can not always be avoided When tasks results are needed Semantics of a

11 Impact of synchronization in task based programming Ideally tasks executed according data dependences However synchronizations can not always be avoided When tasks results are needed Semantics of a synchronization Granularity should not be the only parameter to decide what is a task Synchronizations stop task graph generation Synchronizations syntax Designed according to language standards in other languages Synchronizations added by interception in Java #pragma omp taskwait foo = compss_wait_on(foo) C Python 11

12 Performance analysis Runtimes instrumented with Extrae library to generate Paraver tracefiles 12

13 PART I: OMPSS 13

OmpSs environent: Mercurium Compiler Recognizes constructs and transforms them to calls to the runtime Manages code restructuring for different

14 OmpSs environent: Mercurium Compiler Recognizes constructs and transforms them to calls to the runtime Manages code restructuring for different target devices Device-specific handlers May generate code in a separate file Invokes different back-end compilers à nvcc for NVIDIA C/C++/Fortran

OmpSs Environment: Nanos++ runtime CUDA threads

task Mercurium compiler Application binary New

(Directory / Cache) Data requests Dependency

Execute task (local) Device operations (e.g.

15 OmpSs Environment: Nanos++ runtime CUDA threads GPU MIC threads MIC Source code #pragma omp task Mercurium compiler Application binary New tasks Nanos++ runtime Coherence support (Directory / Cache) Data requests Dependency support (task graph) Ready tasks Worker threads Execute task (local) Device operations (e.g. execute task) Helper threads Execute task (device) Scheduler 15

Example 1: Communication avoiding QR* in OmpSs #pragma omp task inout( A[0;br*bc] ) output( T[0;br*bc] ) priority(3) void dgeqrf_dlarft (int br, int bc, int skip, double *A, double *T); #pragma omp

16 Example 1: Communication avoiding QR* in OmpSs #pragma omp task inout( A[0;br*bc] ) output( T[0;br*bc] ) priority(3) void dgeqrf_dlarft (int br, int bc, int skip, double *A, double *T); #pragma omp task input( T[0;br*bc] ) inout( C[0;br*bc] ) void dlarfb (int br, int bc, int skip, double *V, double *T, double *C); for ( int k=0; k<nt ; k++ ) { #pragma omp task dgeqr_dlarft(br,bc,0,a[k,k],t[k,k]); inout( C[0;br*bc], D[0;br*bc]) output( T[0;br*bc]) priority(2) void dgeqrf_split for ( int br, j=k+1; int j<nt bc, int ; j++ skip, ) { double *C, double *D, double *T); dlarfb(br,bc,0,a[k,k],t[k,k],a[k,j]); #pragma omp task } input( T[0;br*bc] ) inout( F[0;br*bc], G[0;br*bc]) priority(1) void dlarfb_split for (int ( intbr, i=k+1; int bc, i<mt int ; i++ skip, ) { double *D, double *T, double *F, double *G); dgeqr_split(br,bc,0,a[k,k],a[i,k],t[i,k]); for ( int j=k+1; j<nt ; j++ ) { dlarfb_split(br,bc,0,a[i,k],t[i,k],a[k,j],a[i,j]); } } } * Demmel et al., Communication-optimal parallel and sequential QR and LU factorizations. 16

17 Example 2: Heterogeneous #pragma omp target device (hstreams) implements(dgemm_task) copy_deps #pragma omp task input([ts][ts] A, [ts][ts] B) inout([ts][ts] C) priority (10) void dgemm_phi (int ts, double *A, double *B, double *C) { double alpha = 1.0; const char trans = N ; dgemm(&trans, &trans, &ts, &ts, &ts, &alpha, A, &ts, B, &ts, &alpha, C, &ts); } #pragma omp task input([ts][ts] A, [ts][ts] B) inout([ts][ts] C) priority (10) void dgemm_task (int ts, double *A, double *B, double *C) { cblas_dgemm (CblasRowMajor, CblasNoTrans, CblasNoTrans, ts, ts, ts, 1, A, ts, B, ts, 1, C, ts); } void matmul (int N, int TS, double *A, double *B, double *C) { for (int i = 0; i < N; j++) for (int j = 0; j < N; j++) for (int k = 0; k < N; k++) dgemm_task(ts, &A[i][k], &B[k][j], &C[i][j]); #pragma omp taskwait 17

18 Example 3: Task parallelism ILUPACK s preconditioned CG solver* Exploitation of nested parallelism Tasks are split into finer granularity tasks 18

19 Example 3: Task parallelism ILUPACK s preconditioned CG solver* NUMA-aware execution thanks to OmpSs specific NUMA-aware scheduler The code records in which socket each task is executed during the initial calculation of the preconditioner During all iterations, all tasks which operate on the same data that was generated/accessed during the preconditioner calculation are mapped to the same socket where they were originally executed 19

Some results: Ompss @ 8 CPUs + 1 GPGPU + 1 MIC Performance [Gflop/s] 1200 1000 800 600 400 200 0 Matrix Multiply 4096 8192 10240 12288 16384 Matrix size [#elements on a side] OCUbf OCUaf OhSbf OhSaf

20 Some results: 8 CPUs + 1 GPGPU + 1 MIC Performance [Gflop/s] Matrix Multiply Matrix size [#elements on a side] OCUbf OCUaf OhSbf OhSaf OHTsm CUDA hstreams Performance [Gflop/s] Cholesky Factorization Matrix size [#elements on a side] OCUbf OCUaf OhSbf OhSaf OHTsm CUDA hstreams The best performance can be achieved when all processing units (CPUs, GPGPU and Xeon Phi) cooperate together 22

21 Some results: NUMA-aware scheduler results Scheduler distance-aware Al-Omairy et at., Dense Matrix Computations on NUMA Architectures with Distance-Aware Work Stealing 23

22 PART II: PYCOMPSS/COMPSS

23 Why Python? Python is powerful... and fast; plays well with others; runs everywhere; is friendly & easy to learn; is Open. * Its design philosophy emphasizes code readability, and its syntax allows programmers to express concepts in fewer lines of code than would be possible in languages such as C Large community using it, including scientific and numeric Object-oriented programming and structured programming are fully supported Large number of software modules available (38,000 as of January 2014) * From python.org 25

24 Task annotations Use of decorators to annotate tasks and indicate arguments directionality Other annotations: constraints Small API priority=true) def potrf(a): A = dpotrf(a, lower=true)[0].tolist() foo = compss_wait_on(foo) 26

25 PyCOMPSsRuntime behavior Python user code + task annotations Binding TDG Tasks Grids Clusters Clouds Runtime Files, objects

26 PyCOMPSs stack + Numpy and MKL MKL is parallelized with OpenMP. Two levels of parallelism to exploit Task level Thread level App Main Python interpreter Numpy, ScyPy COMPSs worker Task code Python interpreter Numpy, ScyPy COMPSs master MKL MKL Host node Worker node Parallelized with OpenMP 28

27 1 Example 1: Matrix = INOUT) def multiply(a, b, c): c += a*b 5 d11 6 d11 d3 2 d3 3 d3 9 d16 def initialize_variables(): import numpy as np for matrix in [A,B,C]: for i in range(msize): matrix.append([]) for j in range(msize): if matrix == C: block = np.array(np.zeros((bsize, BSIZE)), dtype=np.double, copy=false)... Initialization Synchro0 10 d21 d11 d3 d3 d sync d21 d11 d11 d16 d26 15 sync d21 d16 d16 d26 d30 16 sync d21 d21 d26 d30 d31 sync d26 d26 d30 d31 d32 33 sync d34 d30 d30 d31 d32 34 sync d34 d31 d31 d32 d sync d39 d34 d32 d32 d sync 39 d40 d39 d34 d34 d sync 40 d42 d40 d39 d38 d sync d46 d42 d40 d39 d sync 57 d46 d42 d40 d40 d sync 58 d48 d46 d42 d42 d sync 59 d48 d46 d46 d47 63 sync 60 initialize_variables() main eti = time.time() for i in range(msize): for j in range(msize): for k in range(msize): multiply(a[i][k], B[k][j], C[i][j]) C = compss_wait_on(c) print "Compute Time {} s".format(time.time()-eti) d48 d47 d47 64 sync d48 d48 sync 29

28 36 d36 64 d33 d43 61 d32 d d2 5 d1 15 d2 d8 d5 d8 d8 d8 d4 d8 d8 d5 d5 d4 d5 d5 d3 d4 d4 d9 d4 d3 d3 d3 d3 d5 d6 d4 d6 d6 d3 d6 d d d31 34 d34 d43 d40d43 d43 d39 d43 d38 d43 d64 85 d61 d60 82 d d d d1 12 d1 d d1 2 d d2 d d1 d2 21 d37 d12 d37 d37 d10 11 d37 d19 23 d37 d18 d17 40 d16 d d13 d11 d22 d21 41 d62 49 d70d70 d67 d70 d70 d66 d d5 d20 d7 d1 d3 d6 d2 27 d7 d4 25 d7 7 d6 d7 24 d7 26 d24 d23 d27 d25 42 d41 d40 d40 d40 d39 d39 d39 d38 d38 d38 d40 d39 d41 d38 d41 d41 d41 d40 d38 d39d42 d42 d41 d42 d42 d42 83 d63 d49 84 d85 d82 d83 d84 d73 d86 d86 d86 d75 d d81 90 d48 73 d47 67 d d45 66 d d50 d65 d65 d65 d65 d52 d51 d65 d68d67 d67 d66 d67 d66 d66 d68 d68 d68 d67 d69 d66 d69 d68 d69 d69 d69 d72 87 d90 d90 d90 d87 d90 d87 d88d87 d88 d88 d87 d89 d88 d89 d89 d89 97 d97 98 d100 d98 d101 d99 d101 d d d104 d d104 d d113 d d d113 d d d71 86 d d d74 d94 d102 d102 d102 d103 d103 d d d118 d120 sync d115 d d d112 d112 d117 Synchro0 d d114 d d d95 d d78 89 d96 d d53 d79 77 d56 d d d55 55 d57 d d58 d7 d26 d d29 Example 2: Cholesky def cholesky_blocked(a): from import priority=true) n = len(a) def potrf(a): for k in range(n): from scipy.linalg.lapack import dpotrf A = dpotrf(a, lower=true)[0].tolist() return priority=true) def trsm(a, B): from scipy.linalg import solve_triangular # update trailing matrix from numpy import transpose for i in range(k+1, n): B = solve_triangular(a, B, lower=true, trans='t') for j in range(k+1, i): def gemm(a, B, C): from scipy.linalg.blas import dgemm from numpy import transpose alpha = -1.0 beta = 1.0 C = dgemm(alpha, A, B, c=c, beta=beta, trans_b=1).tolist() import compss_wait_on # Diagonal block factorization A[k][k] = potrf(a[k][k]) # Triangular systems for i in range(k+1, n): A[k][i] = trsm(a[k][k], A[k][i]) main A[j][i] = gemm(a[k][i], A[k][j], A[j][i]) A[i][i] = syrk(a[k][i], A[i][i]) A = compss_wait_on(a) return A 30

29 Some results: matrix multiply Matrix size: Block size: 4096 and OpenMP thread / task 16 tasks per node 14,00 12,00 Matrix multiply - speedup Execution time (secs) Matrix multiply - Execution time #cores Speedup 10,00 8,00 6,00 4, Matrix multiply - Performance 2, ,00 0,00 5,00 10,00 15,00 20,00 25,00 30,00 35,00 #nodes GFlops #cores 31

30 Some results: matrix multiply Matrix size: Block size: 4096 and OpenMP thread / task 2 tasks per node Matrix Multiply - Speedup - 8 threads Seconds Matrix multiply - Execution time - 8 threads 4500, , , , , , , ,00 500,00 0,00 0,00 100,00 200,00 300,00 400,00 500,00 600,00 #cores ,00 Speedup 16,00 14,00 12,00 10,00 8,00 6,00 4,00 2,00 0,00 0,00 5,00 10,00 15,00 20,00 25,00 30,00 35,00 #nodes GFlops Matrix multiply - Performance - 8 threads 500,00 450,00 400,00 350,00 300,00 250,00 200,00 150,00 100,00 50,00 0,00 0,00 100,00 200,00 300,00 400,00 500,00 600,00 #cores

31 Some insight BSIZE 2048 BSIZE thread 8 thread 33

32 Some insight BSIZE 2048 BSIZE thread 8 thread 34

33 Conclusions StarSs is a family of task-based programming models targeting parallel systems OmpSs focuses in more traditional HPC systems, including heterogeneous nodes with GPUs and accelerators PyCOMPSs provides a high level, easy interface focusing in distributed computing, including cloud and Big Data Both systems come with a whole environment of tools Performance analysis Monitor Ongoing work integrating PyCOMPSs/COMPSs to new storage approaches to enable converge between HPC and Big Data Open source: pm.bsc.es compss.bsc.es 35

Productivity Promoting best practices in

34 POP CoE A Center of Excellence On Performance Optimization and Productivity Promoting best practices in performance analysis and parallel programming Providing Services Precise understanding of application and system behavior Suggestion/support on how to refactor code in the most productive way Horizontal Transversal across application areas, platforms, scales For academic AND industrial codes and users! 36

35 Thank you! 37

Barcelona Supercomputing Center

www.bsc.es Barcelona Supercomputing Center Centro Nacional de Supercomputación EMIT 2016. Barcelona June 2 nd, 2016 Barcelona Supercomputing Center Centro Nacional de Supercomputación BSC-CNS objectives: