OmpSs Fundamentals. ISC 2017: OpenSuCo. Xavier Teruel

Size: px

Start display at page:

Download "OmpSs Fundamentals. ISC 2017: OpenSuCo. Xavier Teruel"

Brook Owens
6 years ago
Views:

1 OmpSs Fundamentals ISC 2017: OpenSuCo Xavier Teruel

2 Outline OmpSs brief introduction OmpSs overview and influence in OpenMP Execution model and parallelization approaches Memory model and target copies OmpSs Toolchain Software components: Mercurium and Nanos++ Software repositories and contiguous integration 2

3 3 OmpSs overview Parallel Programming Model Build on existing standard: OpenMP (directive based, keep a serial version) Targeting: SMP, clusters and accelerators (OpenCL, CUDA, FPGAs, ) Developed in the Barcelona Supercomputing Center (BSC)» Compiler: Mercurium (source-to-source)» Runtime systems: Nanos++ (present) Nanos-6 (future) Where it comes from (a bit of history) BSC had two working lines for several years» OpenMP Extensions: dynamic sections, OpenMP tasking prototype» StarSs: asynchronous task parallelism ideas (Ss dependences) OmpSs is our effort to fold them together

4 Influence in OpenMP 4

OmpSs execution model Global thread team created on startup One worker starts main task (also executes) N-1 workers execute tasks One representative per device (if any) All of them get work from a

5 OmpSs execution model Global thread team created on startup One worker starts main task (also executes) N-1 workers execute tasks One representative per device (if any) All of them get work from a task pool Device kernels become tasks Task are labeled with (at least) one target device Scheduler decides which task to execute Tasks may have several targets (versioning) Global thread team Task pool 5

6 Task Directive Creating parallelism: the task directive #pragma omp task [clause[[,] clause]...] {structured-block Where clauses are: private(list), firstprivate(list), shared(list), default(shared none) if(scalar-expression), mergeable, final(scalar-expression) priority(priority-value) depend(dependence-type: list) tied, label(string), 6

7 Cholesky Factorization (introduction) 7 for (j = 0; j < NB; j++) { for (k = 0; k < j; k++) for (i = j+1; i < NB; i++) { sgemm( A[i][k], A[j][k], A[i][j]); for (i = 0; i < j; i++) { Pointer to Block ssyrk( A[j][i], A[j][j]); 0,0 0,1 spotrf( A[j][j]); for (i = j+1; i < NB; i++) { strsm( A[j][j], A[i][j]); NB Number of blocks NB 3,3 BS Block Size BS

Cholesky Factorization (explicit synchronization) 8 for (j = 0; j < NB; j++) { for (k =

A[i][j]); for (i = 0; i < j; i++) { #pragma omp task ssyrk( A[j][i], A[j][j]); #pragma

NB; i++) { #pragma omp task strsm( A[j][j], A[i][j]); #pragma omp taskwait sgemm 2 2 2

8 Cholesky Factorization (explicit synchronization) 8 for (j = 0; j < NB; j++) { for (k = 0; k < j; k++) for (i = j+1; i < NB; i++) { #pragma omp task sgemm( A[i][k], A[j][k], A[i][j]); for (i = 0; i < j; i++) { #pragma omp task ssyrk( A[j][i], A[j][j]); #pragma omp taskwait #pragma omp task spotrf( A[j][j]); #pragma omp taskwait for (i = j+1; i < NB; i++) { #pragma omp task strsm( A[j][j], A[i][j]); #pragma omp taskwait sgemm ssyrk spotrf strsm i i read update

0; i < j; i++) { #pragma omp task in(a[j][i]) inout(a[j][j]) ssyrk( A[j][i], A[j][j]); #pragma omp task

9 Cholesky Factorization (data-flow synchronization) 9 for (j = 0; j < NB; j++) { for (k = 0; k < j; k++) for (i = j+1; i < NB; i++) { #pragma omp task in(a[i][k], A[j][k]) inout(a[i][j] sgemm( A[i][k], A[j][k], A[i][j]); for (i = 0; i < j; i++) { #pragma omp task in(a[j][i]) inout(a[j][j]) ssyrk( A[j][i], A[j][j]); #pragma omp task inout(a[j][j]) spotrf( A[j][j]); for (i = j+1; i < NB; i++) { #pragma omp task in(a[j][j]) inout(a[i][j] strsm( A[j][j], A[i][j]);

OmpSs memory model A global (logical) address space Runtime handles device/host memories SMP machines no extra runtime support Distributed/heterogeneous environments» Multiple physical memory address

10 OmpSs memory model A global (logical) address space Runtime handles device/host memories SMP machines no extra runtime support Distributed/heterogeneous environments» Multiple physical memory address spaces exist» Versions of the same data can reside on them DEVICE MEMORY GPU I/O L1 N I C N I C L1 L1 L1 L1 L1 L1 L2 L2 L2 L2 MEMORY HOST MEMORY MEMORY MEMORY L1 N I C N I C» Data consistency ensured by the runtime system L1 L2 L1 L1 L2 L1 L1 L2 L1 L1 L2 L1 10

11 11 Target Directive Device information: the target directive Always attached to the task directive (outlined functions) #pragma omp target device (type) [clause[[,] clause]...] {outlined-task-construct Explicit copy clauses copy_in(var-list): requests a consistent copy of variables before execution copy_out(var-list): after execution produces next version of variable copy_inout(var-list): combination of in and out Copy data using tasks dependence clauses: copy_deps (default) XXX(var-list) copy_xxx(var-list)

$Memory consistency (getting consistent copies) 12 #pragma omp target device (cuda) #pragma omp task out([n] b) in([n] c) void scale_task_cuda(double *b, double *c, double a, int N) { int j = blockidx.$ $x; if (j < N) b[j] = a * c[j]; #pragma omp target device (smp) #pragma omp task out([n] b) in([n] c) void scale_task_host(double *b, double *c, double a, int N) { for (int j=0; j < N; j++) b[j] =$

12 Memory consistency (getting consistent copies) 12 #pragma omp target device (cuda) #pragma omp task out([n] b) in([n] c) void scale_task_cuda(double *b, double *c, double a, int N) { int j = blockidx.x * blockdim.x + threadidx.x; if (j < N) b[j] = a * c[j]; #pragma omp target device (smp) #pragma omp task out([n] b) in([n] c) void scale_task_host(double *b, double *c, double a, int N) { for (int j=0; j < N; j++) b[j] = a*c[j]; void main(int argc, char *argv[]) {... scale_task_cuda (B, A, 10.0, 1024); //T1 scale_task_cuda (A, B, 0.01, 1024); //T2 #pragma omp taskwait // can access any of A,B,C... T1 needs a valid copy of array A in the device Also it allocates array B in the device (no copy needed), and invalidates other B s DEVICE MEMORY A B T1 Task Dependency Graph T2 T1 Memory Transfers No need to copy HOST MEMORY A B C

$scale_task_cuda(double *b, double *c, double a, int N) { int j = blockidx.x * blockdim.x + threadidx.$ $double *c, double a, int N) { for (int j=0; j < N; j++) b[j] = a*c[j]; void main(int argc, char *argv[]) {... scale_task_cuda (B, A, 10.$ 0, 1024); //T1 scale_task_cuda (A, B, 0.01, 1024); //T2 #pragma omp taskwait // can access any of A,B,C.

0, 1024); //T1 scale_task_cuda (A, B, 0.01, 1024); //T2 #pragma omp taskwait // can access any of A,B,C.

13 Memory consistency (reusing data in place) 13 #pragma omp target device (cuda) #pragma omp task out([n] b) in([n] c) void scale_task_cuda(double *b, double *c, double a, int N) { int j = blockidx.x * blockdim.x + threadidx.x; if (j < N) b[j] = a * c[j]; #pragma omp target device (smp) #pragma omp task out([n] b) in([n] c) void scale_task_host(double *b, double *c, double a, int N) { for (int j=0; j < N; j++) b[j] = a*c[j]; void main(int argc, char *argv[]) {... scale_task_cuda (B, A, 10.0, 1024); //T1 scale_task_cuda (A, B, 0.01, 1024); //T2 #pragma omp taskwait // can access any of A,B,C... T2 can reuse arrays A and B, due they have been used by previous task (T1) Additionally it also invalidates others A s DEVICE MEMORY A B T2 Task Dependency Graph T2 T1 Memory Transfers HOST MEMORY A B C

14 OmpSs Tool-Chain 14

Compiler support: Mercurium 15 (1) Source-to-source transformation: runtime calls (2) Native compilation: gcc, nvcc, icc, xlc, 1 -- Multi-file -- Sources Language

15 Compiler support: Mercurium 15 (1) Source-to-source transformation: runtime calls (2) Native compilation: gcc, nvcc, icc, xlc, 1 -- Multi-file -- Sources Language FE OmpSs Core Nanos++ Device Provider Device Provider Mercurium Nanos++ Fortran: mfc C: mcc C++: mcxx Executable Embedding Dev Compiler Linking Host Compiler 2 S 1 S 2

16 Runtime support: Nanos++ 16

core» Tasks not in the critical path little core Methodology approach» Prioritize by the bottom level of the task Kallia Chronaki, Alejandro Rico, Rosa

17 Scheduler plugin: Criticality-Aware Task Scheduler (CATS) Support for heterogeneous systems Target: ARM big.little The main idea (philosophy) Find the critical path (longest path of TDG) Scheduling for big cores / little cores» Tasks in the critical path big core» Tasks not in the critical path little core Methodology approach» Prioritize by the bottom level of the task Kallia Chronaki, Alejandro Rico, Rosa M. Badia, Eduard Ayguadé, Jesús Labarta, Mateo Valero: Criticality-Aware Dynamic Task Scheduling for Heterogeneous Architectures. ICS

18 Where to find Mercurium and Nanos

19 Contiguous integration: overview 19 master branch Multiples branches Only master branch Merge Request Clone Locally

20 Summary OmpSs as a task-based programming model Execution model how tasks are executed (data-flow Memory model how data is handled in multiple address spaces A successful forerunner for OpenMP Tasking The tool-chain compiler and runtime library Mercurium source-to-source compiler Nanos++ Runtime Library Contiguous integration: work flow at BSC 20

21 Thanks! Further information at 21

An Extension of the StarSs Programming Model for Platforms with Multiple GPUs

An Extension of the StarSs Programming Model for Platforms with Multiple GPUs Eduard Ayguadé 2 Rosa M. Badia 2 Francisco Igual 1 Jesús Labarta 2 Rafael Mayo 1 Enrique S. Quintana-Ortí 1 1 Departamento