Partitioned Global Address Space Paradigm ASD Distributed Memory HPC Workshop

Size: px

Start display at page:

Download "Partitioned Global Address Space Paradigm ASD Distributed Memory HPC Workshop"

Pauline Franklin
5 years ago
Views:

1 Partitioned Global Address Space Paradigm ASD Distributed Memory HPC Workshop Computer Systems Group Research School of Computer Science Australian National University Canberra, Australia November 02, 2017

2 Day 4 Schedule Computer Systems (ANU) PGAS Paradigm 02 Nov / 90

3 Introduction to the PGAS Paradigm and Chapel Outline 1 Introduction to the PGAS Paradigm and Chapel 2 Chapel Programming Strategies for Distributed Memory 3 Runtime Support for PGAS 4 MPI One-Sided Communications 5 Fault Tolerance Computer Systems (ANU) PGAS Paradigm 02 Nov / 90

Introduction to the PGAS Paradigm and Chapel Partitioned Global Address Space recall the shared memory model: multiple threads with pointers to a global address space in the partitioned global

4 Introduction to the PGAS Paradigm and Chapel Partitioned Global Address Space recall the shared memory model: multiple threads with pointers to a global address space in the partitioned global address space (PGAS) model: have multiple threads, each with affinity to some portion of global address space SPMD or fork-join thread creation remote pointers to access data in other partitions the model maps to a cluster with remote memory access can also map to NUMA domains Computer Systems (ANU) PGAS Paradigm 02 Nov / 90

Introduction to the PGAS Paradigm and Chapel Chapel: Design Principles Cray High Performance Language originally developed under DARPA High

supports exploratory programming implicit (statically-inferable) types, run-time settable parameters (config), implicit main and module

5 Introduction to the PGAS Paradigm and Chapel Chapel: Design Principles Cray High Performance Language originally developed under DARPA High Productivity Computing Systems program Targeted at massively parallel computers object-oriented (Java-like syntax, but influenced by ZPL & HPF) supports exploratory programming implicit (statically-inferable) types, run-time settable parameters (config), implicit main and module wrappings multiresolution design: build higher-level concepts in terms of lower Fork-join, not SPMD Computer Systems (ANU) PGAS Paradigm 02 Nov / 90

6 Introduction to the PGAS Paradigm and Chapel Chapel: Language Primitives Task parallelism: concurrent loops and blocks (cobegin, coforall) Data parallelism: Concurrent map operations (forall) Concurrent fold operations (scan, reduce) Synchronization: Task synchronization, sync variables, atomic sections Locality: locales (UMA places to hold data and run tasks) (index) domains used to specify arrays, iteration ranges distributions (mappings of domains to locales) can drastically reduce code size compared to MPI+X more info on Chapel home page Computer Systems (ANU) PGAS Paradigm 02 Nov / 90

7 Introduction to the PGAS Paradigm and Chapel Chapel: Compile Chain chpl compiler generates standard C code, or uses LLVM backend (Image: Cray Inc.) Computer Systems (ANU) PGAS Paradigm 02 Nov / 90

8 Introduction to the PGAS Paradigm and Chapel Chapel: Base Language variables, constants, parameters: 1 var timestep : int ; 2 param pi: real = ; 3 config const epsilon = 0.05; // $./ myprogram -- epsilon =0.01 records: 1 record Vector3D { 2 var x, y, z: real ; 3 } 4 var pos = new Vector3D (0.0, 1.0, -1.5) ; 5 pos. x = 2.0; 6 var copy = pos ; // copied by value classes: 1 class Person { 2 var firstname, surname : string ; 3 var age : int ; 4 } 5 var patsy = new Person (" Patricia ", " Stone ", 39) ; Computer Systems (ANU) PGAS Paradigm 02 Nov / 90

9 Introduction to the PGAS Paradigm and Chapel Chapel: Base Language (2) procedures, type inference, generic methods: 1 proc square (n) { 2 return n * n; 3 } 4 5 var x = 2; 6 var x2 = square ( x); 7 writeln (x2, ": ",x2. type : string ); // 4: int (64) 8 9 var y = 0.5; 10 var y2 = square ( y); 11 writeln (y2, ": ",y2. type : string ); // 0.25: real (64) Computer Systems (ANU) PGAS Paradigm 02 Nov / 90

10 Introduction to the PGAS Paradigm and Chapel Chapel: Base Language (3) iterators: 1 iter triangle ( n) { 2 var current = 0; 3 for i in 1.. n { 4 current += i; 5 yield current ; 6 } 7 } tuples, zippered iteration: 1 config const n = 10; 2 for (i, t) in zip (0..# n, triangle ( n)) do 3 writeln (" triangle number ", i, " is ", t); Computer Systems (ANU) PGAS Paradigm 02 Nov / 90

11 Introduction to the PGAS Paradigm and Chapel Chapel: Task Parallelism task creation: 1 begin dostuff (); // spawn task and don t wait 2 cobegin { 3 dostuff1 (); 4 dostuff2 (); 5 } // wait for completion of all statements in the block synchronisation variables: 1 var a$: sync int ; 2 begin a$ = foo (); 3 var c = 2 * a$; // suspend until a$ is assigned Computer Systems (ANU) PGAS Paradigm 02 Nov / 90

12 Introduction to the PGAS Paradigm and Chapel Chapel: Synchronization Variables single variables can only be written once; sync variables are reset to empty when read. 1 var item$ : sync int ; 2 proc produce () { 3 for i in 0..# N do 4 item$ = i; 5 } 6 proc consume () { 7 for i in 0..# N { 8 var x = item$ ; 9 writeln (x); 10 } 11 } begin produce (); 14 begin consume (); 1 var latch$ : single bool ; 2 proc await () { 3 latch$ ; 4 } 5 proc release () { 6 latch$ = true ; 7 } 8 9 begin await (); 10 begin release (); Computer Systems (ANU) PGAS Paradigm 02 Nov / 90

13 Introduction to the PGAS Paradigm and Chapel Chapel: Task Parallelism Example Fibonacci numbers: 1 proc fib ( n): int { 2 if n <= 2 then 3 return 1; 4 var t1$ : single int ; 5 var t2: int ; 6 begin t1$ = fib (n -1) ; 7 t2 = fib (n -2) ; 8 // wait for $t1 9 return t1$ + t2; 10 } 1 proc fib ( n): int { 2 if n <= 2 then 3 return 1; 4 var t1$, t2$ : single int ; 5 cobegin { 6 t1$ = fib (n -1) ; 7 t2$ = fib (n -2) ; 8 } 9 // wait for t1$ and t2$ 10 return t1$ + t2$ ; Computer Systems (ANU) PGAS Paradigm 02 Nov / 90

14 Introduction to the PGAS Paradigm and Chapel Chapel: Data Parallelism ranges: 1 var r1 = 0..3; // 0, 1, 2, 3 2 var r2 = 0..#10 by 2; // 0, 2, 4, 6, 8 arrays, data parallel loops: 1 var A, B: [0..# N] real ; 2 forall i in 0..# N do // cf. coforall 3 A(i) = A(i) + B(i); scalar promotion: 1 A = A + B; Computer Systems (ANU) PGAS Paradigm 02 Nov / 90

15 Introduction to the PGAS Paradigm and Chapel Chapel: Data Parallelism (2) example: DAXPY 1 config const alpha = 3.0; 2 const MyRange = 0..# N; 3 proc daxpy ( x: [ MyRange ] real, y: [ MyRange ] real ): int { 4 forall i in MyRange do 5 y( i) = alpha * x( i) + y( i); 6 } Alternatively, via promotion, the forall loop can be replaced by: y = alpha * x + y; reductions and scans: 1 var mx = ( max reduce A); 2 A = (+ scan A); // prefix sum of A - parallel? the target of data parallelism could be SIMD, GPU or normal threads (currently no way to express this) Computer Systems (ANU) PGAS Paradigm 02 Nov / 90

16 Introduction to the PGAS Paradigm and Chapel Chapel: forall vs. coforall Use forall when iterations may be executed in parallel Use coforall when iterations must be executed in parallel What s wrong with this code? 1 var a$: [0..# N] single int ; 2 forall i in {0..# N} { 3 if i < (N -1) then 4 a$[ i] = a$[ i +1] - 1; 5 else 6 a$[ i] = N; 7 var result = a$[ i]; 8 writeln ( result ); 9 } Computer Systems (ANU) PGAS Paradigm 02 Nov / 90

17 Introduction to the PGAS Paradigm and Chapel Chapel: Task Intents constant (default): 1 config const N = 10; 2 var race : int ; 3 coforall i in 0..# N do 4 race += 1; // illegal! reference: 1 var deliberaterace : int ; 2 coforall i in 0..# N with ( ref deliberaterace ) do 3 deliberaterace += 1; reduce: 1 var sum : int ; 2 coforall i in 0..# N with (+ reduce sum ) do 3 sum += i; Computer Systems (ANU) PGAS Paradigm 02 Nov / 90

18 Introduction to the PGAS Paradigm and Chapel Chapel: Domains domain: an index set, can be used to declare arrays dense (rectangular): a tensor product of ranges, e.g. 1 config const M = 5, N = 7; 2 const D: domain (2) = {0..# M, 0..# N}; strided: 1 const D1 = {0..# M by 4, 0..# N by 2}; Computer Systems (ANU) PGAS Paradigm 02 Nov / 90

19 Introduction to the PGAS Paradigm and Chapel Chapel: Domains (2) sparse: 1 const SparseD : sparse subdomain ( D) 2 = ((0,0), (1,2), (3,2), (4,4) ); associative: 1 var Colours : domain ( string ) = {" Black ", " Yellow ", " Red "}; Computer Systems (ANU) PGAS Paradigm 02 Nov / 90

20 Introduction to the PGAS Paradigm and Chapel Chapel: Locales locale: a unit of the target architecture: processing elements with (uniform) local memory 1 const Locales : [0..# numlocales ] locale =... ; // built - in 2 on Locales [1] do 3 foo (); 4 coforall ( loc, id) in zip ( Locales, 1..) do 5 on loc do // migrates this task to loc 6 coforall tid in 0..# numtasks do 7 writeln (" Task ", id, " thread ", tid, " on ", loc ); Computer Systems (ANU) PGAS Paradigm 02 Nov / 90

Introduction to the PGAS Paradigm and Chapel Chapel: Domain Maps use domain maps to map indices in a domain to locales: 1 use CyclicDist ; 2 const Dist = new dmap ( 3 new

21 Introduction to the PGAS Paradigm and Chapel Chapel: Domain Maps use domain maps to map indices in a domain to locales: 1 use CyclicDist ; 2 const Dist = new dmap ( 3 new Cyclic ( startidx = 1, targetlocales = Locales [0..1]) ); 4 const D = {0..# N} dmapped Dist ; 5 var x, y: [ D] real ; Computer Systems (ANU) PGAS Paradigm 02 Nov / 90

22 Introduction to the PGAS Paradigm and Chapel Chapel: Domain Maps (2) block: 1 use BlockDist ; 2 const space1d = {0..# N}; 3 const B = space1d dmapped Block ( boundingbox = space1d ); Computer Systems (ANU) PGAS Paradigm 02 Nov / 90

23 Introduction to the PGAS Paradigm and Chapel Hands-on Exercise: Locales in Chapel Computer Systems (ANU) PGAS Paradigm 02 Nov / 90

24 Chapel Programming Strategies for Distributed Memory Outline 1 Introduction to the PGAS Paradigm and Chapel 2 Chapel Programming Strategies for Distributed Memory 3 Runtime Support for PGAS 4 MPI One-Sided Communications 5 Fault Tolerance Computer Systems (ANU) PGAS Paradigm 02 Nov / 90

25 Chapel Programming Strategies for Distributed Memory Chapel: Programming Strategies Think globally, compute locally Define key data structures arrays domains Specify distribution and layout domain maps Exploit parallelism over the available hardware (co-)forall (co-)begin Computer Systems (ANU) PGAS Paradigm 02 Nov / 90

26 Chapel Programming Strategies for Distributed Memory Chapel: Matrix Multiplication start with sequential matrix multiplication: 1 proc matmul ( const ref A, const ref B, C) { 2 for (m, n) in C. domain { 3 var c = 0.0; 4 for k in A. domain. dim (2) do 5 c += A[m, k] * B[k, n]; 6 C[m, n] = c; 7 } 8 } 9 config const M = 4, K = 4, N = 4; 10 var A: [0..#M,0..# K] real ; 11 var B: [0..#K,0..# N] real ; 12 var C: [0..#M,0..# N] real ; 13 matmul (A, B, C); i k A B C j k Computer Systems (ANU) PGAS Paradigm 02 Nov / 90

27 Chapel Programming Strategies for Distributed Memory Chapel: Performance Timing one way to measure elapsed time: 1 var timer : Timer ; 2 timer. start (); 3 matmul (A, B, C); 4 timer. stop (); 5 var timemillis = timer. elapsed ( TimeUnits. milliseconds ); 6 writef (" Serial Multiply M=% i, N=% i, K=% i took %7.3 dr ms (%7.3 dr GFLOP /s)\n", M, N, K, timemillis, 2*M*K*N/1 e6/ timemillis ); Computer Systems (ANU) PGAS Paradigm 02 Nov / 90

28 Chapel Programming Strategies for Distributed Memory Chapel: Matrix Multiplication parallel, single locale: 1 proc parmatmul ( const ref A, const ref B, C) { 2 forall (m, n) in C. domain { 3 var c = 0.0; 4 for k in A. domain. dim (2) do 5 c += A[m, k] * B[k, n]; 6 C[m, n] = c; 7 } 8 } Computer Systems (ANU) PGAS Paradigm 02 Nov / 90

29 Chapel Programming Strategies for Distributed Memory Chapel: Matrix Multiplication parallel, distributed: 1 const rows = reshape ( Locales, {0..# numlocales, 0..0}) ; 2 const cols = reshape ( Locales, {0..0, 0..# numlocales }); 3 const spacea = {0..#M, 0..# K}; 4 const da: domain (2) dmapped Block ( spacea, rows ) = spacea ; 5 const spaceb = {0..#K, 0..# N}; 6 const db: domain (2) dmapped Block ( spaceb, cols ) = spaceb ; 7 const spacec = {0..#M, 0..# N}; 8 const dc: domain (2) dmapped Block ( spacec, rows ) = spacec ; 9 var blocka : [ da] real ; 10 var blockb : [ db] real ; 11 var blockc : [ dc] real ; 12 parmatmul ( blocka, blockb, blockc ); Computer Systems (ANU) PGAS Paradigm 02 Nov / 90

30 Chapel Programming Strategies for Distributed Memory Chapel: Programming Strategies (continued) Batch communications - avoid fine-grained remote accesses array slicing specialized distributions e.g. StencilDist Overlap computation and communication tasks sync variables Computer Systems (ANU) PGAS Paradigm 02 Nov / 90

31 Chapel Programming Strategies for Distributed Memory Chapel: Further Reading Chapel Web page: Chapel tutorials: Computer Systems (ANU) PGAS Paradigm 02 Nov / 90

32 Chapel Programming Strategies for Distributed Memory Hands-on Exercise: 2D Stencil via Templates Computer Systems (ANU) PGAS Paradigm 02 Nov / 90

33 Runtime Support for PGAS Outline 1 Introduction to the PGAS Paradigm and Chapel 2 Chapel Programming Strategies for Distributed Memory 3 Runtime Support for PGAS 4 MPI One-Sided Communications 5 Fault Tolerance Computer Systems (ANU) PGAS Paradigm 02 Nov / 90

34 MPI One-Sided Communications Outline 1 Introduction to the PGAS Paradigm and Chapel 2 Chapel Programming Strategies for Distributed Memory 3 Runtime Support for PGAS 4 MPI One-Sided Communications 5 Fault Tolerance Computer Systems (ANU) PGAS Paradigm 02 Nov / 90

35 MPI One-Sided Communications Programming Models Each process exposes a part of its memory to the other processes Allow data movement without direct involvement of process that holds the data Computer Systems (ANU) PGAS Paradigm 02 Nov / 90

36 MPI One-Sided Communications Comparison with Two-sided Computer Systems (ANU) PGAS Paradigm 02 Nov / 90

37 MPI One-Sided Communications It s all about Memory Consistency Remember this from the shared memory course? Memory consistency concerns how memory behaves with respect to read and write operations from multiple processors Sequential consistency: the result of any execution is the same as if the operations of all the processors were executed in some sequential order, and the operations of each individual processor appear in this sequence in the order specified by its program. See: Shared Memory Consistency Models: A Tutorial, Sarita V. Adve Kourosh Gharachorloo Computer Systems (ANU) PGAS Paradigm 02 Nov / 90

38 MPI One-Sided Communications Reference Material Subsequent slides will draw heavily on following material: Overviews of MPI3 William D Gropp: New Features of MPI-3 Fabio Affinito: MPI3 Two detailed lectures on one sided MPI William Gropp: One-sided Communication in MPI William Gropp: More on One Sided Communication Tutorial on MPI 2.2 and 3.0 by Torsten Hoefler Torsten Hoefler: Advanced MPI 2.2 and 3.0 Tutorial Detailed paper on remote memory access programming in MPI-3 T. Hoefler et al., Remote Memory Access Programming in MPI-3. ACM Trans. Parallel Comput. 1, 1, Article 1 (March 2013) Cornell Virtual Workshop on one-sided communication methods Computer Systems (ANU) PGAS Paradigm 02 Nov / 90

39 MPI One-Sided Communications RMA advantages and Issues Advantages Multiple transfers with single synchronization Bypass tag matching Can be faster exploiting underlying hardware support Better able to handle problems where communication pattern is unknown or irregular Issues How to create remote accessible memory Reading, writing and updating remote memory Data synchronisation Memory model Computer Systems (ANU) PGAS Paradigm 02 Nov / 90

40 MPI One-Sided Communications Window Creation Regions of memory that we want to expose to RMA operations are called windows, they can be created in four ways Fabio Affinito: MPI3 Computer Systems (ANU) PGAS Paradigm 02 Nov / 90

41 MPI One-Sided Communications Simple Window Creation Fabio Affinito: MPI3 Computer Systems (ANU) PGAS Paradigm 02 Nov / 90

42 MPI One-Sided Communications Data Movement MPI provides operations to read, write and atomically modify remote data MPI GET MPI PUT MPI ACCUMULATE MPI GET ACCUMULATE MPI COMPARE AND SWAP MPI FETCH AND OP Fabio Affinito: MPI3 Computer Systems (ANU) PGAS Paradigm 02 Nov / 90

43 MPI One-Sided Communications The Memory Consistency Issue Fabio Affinito: MPI3 Three Synchronization models Fence (active target) Post-start-complete-wait (generalized active target) Lock/Unlock (passive target) Data accesses occur within epochs Computer Systems (ANU) PGAS Paradigm 02 Nov / 90

44 MPI One-Sided Communications Three Synchronization Models Fabio Affinito: MPI3 Computer Systems (ANU) PGAS Paradigm 02 Nov / 90

45 MPI One-Sided Communications Passive Target Synchronization William Gropp: More on One Sided Communication Computer Systems (ANU) PGAS Paradigm 02 Nov / 90

46 MPI One-Sided Communications Completion Model Relaxed memory model, acquire and release Immediate Data Movement Delayed Data Movement Which is best when? William Gropp: More on One Sided Communication Computer Systems (ANU) PGAS Paradigm 02 Nov / 90

47 MPI One-Sided Communications Memory Models Unified model new in MPI 3, what are advantages? Torsten Hoefler: Advanced MPI 2.2 and 3.0 Tutorial Computer Systems (ANU) PGAS Paradigm 02 Nov / 90

48 MPI One-Sided Communications Separate Semantics Another table for unified semantics Torsten Hoefler: Advanced MPI 2.2 and 3.0 Tutorial Computer Systems (ANU) PGAS Paradigm 02 Nov / 90

MPI One-Sided Communications MPI-3 Communication Options T. Hoefler et al., 2013.

49 MPI One-Sided Communications MPI-3 Communication Options T. Hoefler et al., Remote Memory Access Programming in MPI-3. ACM Trans. Parallel Comput. 1, 1, Article 1 Computer Systems (ANU) PGAS Paradigm 02 Nov / 90

50 MPI One-Sided Communications Example Codes Fence Synchronization Post-Start-Complete-Wait Synchronization Lock-Unlock Synchronization Computer Systems (ANU) PGAS Paradigm 02 Nov / 90

51 MPI One-Sided Communications Fence Synchronization 1 // S t a r t up MPI MPI Win win ; 3 i f ( rank == 0) { 4 / Everyone w i l l r e t r i e v e from a b u f f e r on r o o t / 5 i n t s o i = s i z e o f ( i n t ) ; 6 MPI Win create ( buf, s o i 20, s o i, MPI INFO NULL, comm,& win ) ; } 7 e l s e { 8 / Others o n l y r e t r i e v e, so t h e s e windows can be s i z e 0 / 9 MPI Win create (NULL, 0, s i z e o f ( i n t ), MPI INFO NULL, comm,& win ) ; 0 } 1 2 / No l o c a l o p e r a t i o n s p r i o r to t h i s epoch, so g i v e an a s s e r t i o n / 3 MPI Win fence (MPI MODE NOPRECEDE, win ) ; 4 i f ( rank!= 0) { 5 / I n s i d e t h e f e n c e, make RMA c a l l s to GET from rank 0 / 6 MPI Get ( buf, 2 0, MPI INT, 0, 0, 2 0, MPI INT, win ) ; 7 } 8 9 / Complete t h e epoch t h i s w i l l b l o c k u n t i l MPI Get i s complete / 0 MPI Win fence ( 0, win ) ; 1 / A l l done with the window t e l l MPI there are no more epochs / 2 MPI Win fence (MPI MODE NOSUCCEED, win ) ; 3 / F r e e up our window / 4 MPI Win free (&win ) 5 // s h u t down... Source: Cornell Virtual Workshop: Computer Systems (ANU) PGAS Paradigm 02 Nov / 90

52 MPI One-Sided Communications Post-Start-Complete-Wait Synchronization 1 // S t a r t up MPI... 2 MPI Group comm group, group ; 3 4 f o r ( i =0; i <3; i ++) { 5 r a n k s [ i ] = i ; // For f o r m i n g groups, l a t e r 6 } 7 MPI Comm group (MPI COMM WORLD,&comm group ) ; 8 9 / Create new window f o r t h i s comm / 0 i f ( rank == 0) { 1 MPI Win create ( buf, s i z e o f ( i n t ) 3, s i z e o f ( i n t ), 2 MPI INFO NULL,MPI COMM WORLD,& win ) ; 3 } 4 e l s e { 5 / Rank 1 o r 2 / 6 MPI Win create (NULL, 0, s i z e o f ( i n t ), 7 MPI INFO NULL,MPI COMM WORLD,& win ) ; 8 } 9 0 / > continues in next s l i d e > / Source: Cornell Virtual Workshop: Computer Systems (ANU) PGAS Paradigm 02 Nov / 90

53 MPI One-Sided Communications Post-Start-Complete-Wait Synchronization (2) 1 / Now do t h e communication epochs / 2 i f ( rank == 0) { 3 / O r i g i n group c o n s i s t s o f r a n k s 1 and 2 / 4 MPI Group incl ( comm group, 2, ranks+1,&group ) ; 5 / Begin t h e e x p o s u r e epoch / 6 MPI Win post ( group, 0, win ) ; 7 / Wait f o r epoch to end / 8 MPI Win wait ( win ) ; 9 } 0 e l s e { 1 / Target group c o n s i s t s o f rank 0 / 2 M P I G r o u p i n c l ( comm group, 1, ranks,& group ) ; 3 / Begin t h e a c c e s s epoch / 4 M P I W i n s t a r t ( group, 0, win ) ; 5 / Put i n t o rank==0 a c c o r d i n g to my rank / 6 MPI Put ( buf, 1, MPI INT, 0, rank, 1, MPI INT, win ) ; 7 / Terminate t h e a c c e s s epoch / 8 MPI Win complete ( win ) ; 9 } 0 1 / F r e e window and g r o u p s / 2 MPI Win free (&win ) ; 3 MPI Group free (&group ) ; 4 MPI Group free (&comm group ) ; 5 6 // Shut down... Source: Cornell Virtual Workshop: Computer Systems (ANU) PGAS Paradigm 02 Nov / 90

54 MPI One-Sided Communications Lock-Unlock Synchronization 1 // S t a r t up MPI... 2 MPI Win win ; 3 4 i f ( rank == 0) { 5 / Rank 0 w i l l be t h e c a l l e r, so n u l l window / 6 MPI Win create (NULL, 0, 1, 7 MPI INFO NULL,MPI COMM WORLD,& win ) ; 8 / Request l o c k o f p r o c e s s 1 / 9 MPI Win lock (MPI LOCK SHARED, 1, 0, win ) ; 0 MPI Put ( buf, 1, MPI INT, 1, 0, 1, MPI INT, win ) ; 1 / Block u n t i l put succeeds / 2 MPI Win unlock (1, win ) ; 3 / F r e e t h e window / 4 MPI Win free (&win ) ; 5 } 6 e l s e { 7 / Rank 1 i s t h e t a r g e t p r o c e s s / 8 MPI Win create ( buf,2 s i z e o f ( i n t ), s i z e o f ( i n t ), 9 MPI INFO NULL, MPI COMM WORLD, &win ) ; 0 / No s y n c c a l l s on t h e t a r g e t p r o c e s s! / 1 MPI Win free (&win ) ; 2 } Source: Cornell Virtual Workshop: Computer Systems (ANU) PGAS Paradigm 02 Nov / 90

MPI One-Sided Communications Case Studies T. Hoefler et al., 2013.

55 MPI One-Sided Communications Case Studies T. Hoefler et al., Remote Memory Access Programming in MPI-3. ACM Trans. Parallel Comput. 1, 1, Article 1 Computer Systems (ANU) PGAS Paradigm 02 Nov / 90

56 MPI One-Sided Communications Hands-on Exercise: The 3 Synchronization Methods Computer Systems (ANU) PGAS Paradigm 02 Nov / 90

57 Fault Tolerance Outline 1 Introduction to the PGAS Paradigm and Chapel 2 Chapel Programming Strategies for Distributed Memory 3 Runtime Support for PGAS 4 MPI One-Sided Communications 5 Fault Tolerance Computer Systems (ANU) PGAS Paradigm 02 Nov / 90

Fault Tolerance HPC Systems: Fast, Complex and Error Prone Sunway TaihuLight: the fastest supercomputer today (peak 125.

58 Fault Tolerance HPC Systems: Fast, Complex and Error Prone Sunway TaihuLight: the fastest supercomputer today (peak Pflop/s) (Image: Top500) 1. Dongarra, Jack. Report on the sunway taihulight system. Tech Report UT-EECS (2016) Computer Systems (ANU) PGAS Paradigm 02 Nov / 90

59 Fault Tolerance The Reliability Challenge in HPC Reliability Terms MTTI : Mean Time To Interrupt MTTR : Mean Time To Repair MTBF : Mean Time Between Failures = MTTI + MTTR Reliability Figures for Terascale Systems: System CPUs Reliability Src LANL ASCI Q 8,192 MTTI: 6.5 hours 2 LLNL ASCI White (2003) 8,192 MTBF: 40 hours 2 PSC Lemieux 3,016 MTTI: 9.7 hours 2 LLNL BlueGene/L 106,496 MTTI: 7-10 days 3 2. Feng, Wu-chun. The importance of being low power in high performance computing. (2005) 3. Bronevetsky, Greg, and Adam Moody. Scalable I/O systems via node-local storage: Approaching 1 TB/sec file I/O. (2009) Computer Systems (ANU) PGAS Paradigm 02 Nov / 90

Fault Tolerance A Statistical Study of Failures on HPC Systems Conclusions: First, the failure rate of a system grows proportional to the number of processor chips in the

60 Fault Tolerance A Statistical Study of Failures on HPC Systems Conclusions: First, the failure rate of a system grows proportional to the number of processor chips in the system. Second, there is little indication that systems and their hardware get more reliable over time as technology changes. Computer Systems (ANU) PGAS Paradigm 02 Nov / 90

61 Fault Tolerance The Reliability Challenge in HPC Prediction of MTTI with three rates of growth in cores: doubling every 18, 24 and 30 months Computer Systems (ANU) PGAS Paradigm 02 Nov / 90

62 Fault Tolerance Fault Tolerance As HPC systems grow in size, the MTTI shrinks and long-running applications become at a higher risk of encountering faults. Faults are generally classified into: hard faults: inhibit process execution and result in data loss. soft faults: undetected bit flips silently corrupting data in disk, memory, or registers. Fault tolerance is the ability to contain faults and reduce their impact. Computer Systems (ANU) PGAS Paradigm 02 Nov / 90

63 Fault Tolerance Fault Tolerance Techniques Rollback recovery: Returns the application to an old consistent state. Recomputes previously reached states before the failure. Common technique: Checkpoint/restart Forward recovery: Computation proceeds after a failure without rollback. Requires a fault-aware runtime system (i.e. a runtime system that does not crash upon a failure). Common techniques: Replication Master-Worker ABFT (Algorithmic-Based Fault Tolerance) Or composite techniques e.g. Replication-enhaned checkpointing [4] 4. Ni, Xiang, et al. ACR: Automatic checkpoint/restart for soft and hard error protection. Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis. ACM, Computer Systems (ANU) PGAS Paradigm 02 Nov / 90

64 Fault Tolerance Rollback Recovery Checkpoint/Restart The most widely used fault tolerance mechanism in HPC systems. Requires saving the application state periodically on a reliable storage. Upon a failure, the application restarts from the last consistent checkpoint. Computer Systems (ANU) PGAS Paradigm 02 Nov / 90

Fault Tolerance Checkpointing Classifications Coordinated Uncoordinated Collective checkpointing All processes restart Suitable for synchronized computations Processes checkpoint independently Only

65 Fault Tolerance Checkpointing Classifications Coordinated Uncoordinated Collective checkpointing All processes restart Suitable for synchronized computations Processes checkpoint independently Only the failed process restarts Suitable for loosly coupled processes Often requires message logging Vulnerable to the domino effect 5. Elnozahy, Elmootazbellah Nabil, et al. A survey of rollback-recovery protocols in message-passing systems. ACM Computing Surveys (CSUR) 34.3 (2002): Computer Systems (ANU) PGAS Paradigm 02 Nov / 90

Replaces disk with in-memory replication Fault-aware systems only More

66 Fault Tolerance Checkpointing Classifications (Cont.) Disk-based Diskless I/O Intensive Applicable to all runtime systems Replaces disk with in-memory replication Fault-aware systems only More replicas more reliability, and higher failure-free overhead Computer Systems (ANU) PGAS Paradigm 02 Nov / 90

67 Fault Tolerance Checkpointing MPI Applications Coordinated disk-based checkpointing is the common mechanism for fault tolerance on HPC platforms. Provided transparently in some MPI implementations (e.g. Intel-MPI: mpirun -chkpoint-interval 100sec -np 100./MyApp) Provided outside of MPI by tools that dump process image into disk, like: BLCR: Berkeley Lab Checkpoint/Restart for Linux DMTCP: Distributed MultiThreaded CheckPointing Or done manually by programmers using file system APIs. Diskless checkpointing is only applicable to fault-aware MPI implementations (like MPI-ULFM, which we cover in the last part of this lecture) Computer Systems (ANU) PGAS Paradigm 02 Nov / 90

68 Fault Tolerance Checkpoint Interval The checkpoint interval has a crucial impact on performance: A long interval less checkpoints, and more lost work upon a failure. A short interval more checkpoints, and less lost work upon a failure. Young s formula [6] is often used to compute the optimal checkpoint interval i as 2 t MTTI, where t is the checkpointing time. The effective application utilization (u) of a system can be computed as [7]: u = 1 (lost utilization for recovery + lost utilization for checkpointing) u = 1 ( 1 2.i. 1 MTTI + t. 1 i ) 6. Young, John W. A first order approximation to the optimum checkpoint interval. Communications of the ACM 17.9 (1974) 7. Schroeder, Bianca, and Garth A. Gibson. Understanding failures in petascale computers. Journal of Physics (2007). Computer Systems (ANU) PGAS Paradigm 02 Nov / 90

69 Fault Tolerance Projected System Utilization with C/R Effective application utilization over time Computer Systems (ANU) PGAS Paradigm 02 Nov / 90

Fault Tolerance Forward Recovery Replication executes one or more replicas of each process on independent nodes when a replica is lost, another replica takes over without rollback Replication in

70 Fault Tolerance Forward Recovery Replication executes one or more replicas of each process on independent nodes when a replica is lost, another replica takes over without rollback Replication in message passing systems: the message ordering challenge used as a detection and correction mechanism for silent data corruption errors. despite its expensive resource requirements, recent studies [8,9] suggest that replication can be a viable alternative for checkpointing on extreme scale systems with short MTTI. 8. Ropars, Thomas, et al. Efficient Process Replication for MPI Applications: Sharing Work between Replicas. IPDPS Ferreira, Kurt, et al. Evaluating the viability of process replication reliability for exascale systems. SC Computer Systems (ANU) PGAS Paradigm 02 Nov / 90

71 Fault Tolerance Forward Recovery Master-Worker Worker failure: can be tolerated without rollback by assigning the tasks of the failed worker to another worker Master failure: can be tolerated using replication or checkpointing. Because the probably of the master failure is constant (as it does not depend on the scale of the application), it is often more efficient to consider the master failure as a fatal error. Computer Systems (ANU) PGAS Paradigm 02 Nov / 90

72 Fault Tolerance Forward Recovery Algorithmic-Based Fault Tolerance the design of custom recovery mechanisms based on expert-knowledge of special algorithm properties (e.g. available data redundancy, the ability to approximate lost data from remaining data,... ). for example: using redundant data to recover lost sub-grids in PDE solvers that use the Sparse Grid Combination Technique (SGCT) [10]: (Image: Ali, et al. 2016) 10. Ali, Md Mohsin, et al. Complex scientific applications made fault-tolerant with the sparse grid combination technique. IJHPCA 30.3 (2016): Computer Systems (ANU) PGAS Paradigm 02 Nov / 90

73 Fault Tolerance MPI and Fault Tolerance The MPI standard does not specify the behaviour of MPI when ranks fail. Most implementations terminate the application as a result of rank failure. Users rely on coordinated disk-based checkpointing because it does not require fault tolerance support from MPI. However, the time to checkpoint an application with a large memory footprint can exceed the MTTI on large systems, making coordinated disk-based checkpointing unapplicable on large scale. User-level fault tolerance techniques can deliver better performance, however, they require fault tolerance support from MPI. Computer Systems (ANU) PGAS Paradigm 02 Nov / 90

74 Fault Tolerance MPI and Fault Tolerance (Cont.) MPI User Level Failure Mitigation A proposal by the MPI Forum s Fault Tolerance Working Group to add fault tolerance semantics to MPI. Under assessment by the MPI forum to be part of the coming MPI-4 standard. A reference implementation of ULFM is available based on OpenMPI1.7. Computer Systems (ANU) PGAS Paradigm 02 Nov / 90

75 Fault Tolerance MPI User Level Failure Mitigation In the following, we cover the following aspects of MPI-ULFM: Error Handling Failure Notification Failure Mitigation Computer Systems (ANU) PGAS Paradigm 02 Nov / 90

76 Fault Tolerance Error Handling (1/2) In standard MPI: most MPI interfaces return an error code (e.g. 0=MPI SUCCESS, 1=MPI ERR BUFFER,... ) we can set an error handler to a communicator using: MPI Comm set errhandler there are two predefined error handlers: MPI ERRORS ARE FATAL: terminates MPI (the default). MPI ERRORS RETURN: returns an error code to the caller. The user can also define a customized error handler, as follows: 1 /* User s error handling function */ 2 void errorcallback ( MPI_Comm * comm, int * errcode,...) { } 3 4 /* Changing the communicator s error handler */ 5 MPI_Errhandler handler ; 6 MPI_Comm_create_errhandler ( errorcallback, & handler ); 7 MPI_Comm_set_errhandler ( MPI_COMM_WORLD, handler ); Computer Systems (ANU) PGAS Paradigm 02 Nov / 90

77 Fault Tolerance Error Handling (2/2) ULFM uses the same error handling mechanism of standard MPI It added new error codes to report process failure events: 54=MPI ERR PROC FAILED 55=MPI ERR PROC FAILED PENDING 56=MPI ERR REVOKED The default error handler MPI ERRORS ARE FATAL must not be used. Computer Systems (ANU) PGAS Paradigm 02 Nov / 90

rank. Point-to-point operations Using a named rank: Using MPI

78 Fault Tolerance Failure Notification (1/3) Process failure errors are raised only in MPI operations that involve a failed rank. Point-to-point operations Using a named rank: Using MPI ANY SOURCE: Computer Systems (ANU) PGAS Paradigm 02 Nov / 90

79 Fault Tolerance Failure Notification (2/3) Process failure errors are raised only in MPI operations that involve a failed rank. Collective operations: some live processes may raise an error, while others return successfully. Computer Systems (ANU) PGAS Paradigm 02 Nov / 90

80 Fault Tolerance Failure Notification (3/3) Process failure errors are raised only in MPI operations that involve a failed rank. Non-blocking operations: error reporting is postponed to the corresponding completion function (e.g. MPI Wait, MPI Test). Computer Systems (ANU) PGAS Paradigm 02 Nov / 90

81 Fault Tolerance Failure Mitigation Interfaces (1/2) MPI Comm failure ack( comm ) a local operation that acknowledges all detected failures on the communicator. it s purpose is to silence process failure errors in future MPI ANY SOURCE calls that involve an acknowledged process failure. MPI Comm failure get acked( comm, failedgrp ) returns the group of failed ranks that were already acknowledged Computer Systems (ANU) PGAS Paradigm 02 Nov / 90

82 Fault Tolerance Failure Mitigation Interfaces (2/2) MPI Comm revoke( comm ) a local operation that invalidates the communicator any future communication on a revoked communicator fails with error MPI ERR REVOKED live ranks must collectively create a new communicator MPI comm shrink( oldcomm, newcomm ) a collective operation that creates a new communicator that excludes the dead ranks in the old communicator like other collectives, it may succeed at some ranks and fail at others. MPI comm agree( oldcomm, flag ) a collective operation for participants to agree on some value. the only collective operation that returns the same result to all participants. Computer Systems (ANU) PGAS Paradigm 02 Nov / 90

83 Fault Tolerance Resilient Iterative Application Skeleton Computer Systems (ANU) PGAS Paradigm 02 Nov / 90

84 Fault Tolerance 1 # define CKPT_INTERVAL 10 /* the checkpointing iterval */ 2 # define MAX_ITER 100 /* maximum no. of iters */ 3 4 MPI_Comm world ; /* the working communicator */ 5 int nprocs ; /* communicator size */ 6 int rank ; /* my rank */ 7 bool restart ; /* restart flag */ 8 9 void compute (); /* executes the iterative computation,* 0 and orchestrates checkpoint / restart */ 1 2 int runiter ( int i); /* runs a single iteration, * 3 * and returns the MPI error code * 4 * of the last MPI call */ 5 6 void shrinkworld (); /* shrinks a failed communicator, * 7 * and sets the new rank and nprocs */ 8 9 void errorcallback ( MPI_Comm * comm, int * rc,...) ; 0 /* the communicator s error handler */ 1 2 void writeckpt (); /* creates a new checkpoint */ 3 4 int readckpt (); /* loads the last checkpoint, and 5 * returns the corresponding iteration */ Computer Systems (ANU) PGAS Paradigm 02 Nov / 90

85 Fault Tolerance 1 void main ( int argc, char * argv []) { 2 MPI_Init (& argc, & argv ); 3 4 /* the initial world state */ 5 world = MPI_COMM_WORLD ; 6 MPI_Comm_rank ( world, & rank ); 7 MPI_Comm_size ( world, & nprocs ); 8 9 /* setting the error handler */ 0 MPI_Errhandler errhandler ; 1 MPI_Comm_create_errhandler ( errorcallback, & errhandler ); 2 MPI_Comm_set_errhandler ( world, errhandler ); 3 4 compute (); 5 6 MPI_Finalize (); 7 } Computer Systems (ANU) PGAS Paradigm 02 Nov / 90

86 Fault Tolerance 1 /* orchestrates the iterative processing and C/ R */ 2 void compute (){ 3 int rc; /* holds MPI return codes */ 4 restart = false ; /* set to true only in errorcallback () */ 5 int i = 0; /* current iteration number */ 6 do { 7 if ( restart ) { 8 i = readckpt (); 9 rc = MPI_Comm_agree ( world, i); 0 if ( rc!= MPI_SUCCESS ) 1 continue ; 2 restart = false ; 3 } 4 while ( i < MAX_ITER ) { 5 rc = runiter (i); 6 if ( rc!= MPI_SUCCESS ) 7 break ; /* jump to the outer loop to restart */ 8 9 if ( i % CKPT_INTERVAL == 0) 0 writeckpt (); 1 2 i ++; 3 } 4 } while ( restart i < MAX_ITER ); 5 } Computer Systems (ANU) PGAS Paradigm 02 Nov / 90

87 Fault Tolerance 1 /* a callback function to handle MPI errors */ 2 void errorcallback ( MPI_Comm * comm, int * errcode,...) { 3 if (* errcode!= MPI_ERR_PROC_FAILED && 4 * errcode!= MPI_ERR_PROC_FAILED_PENDING && 5 * errcode!= MPI_ERR_COMM_REVOKED ) { 6 /* We only tolerate process failure errors */ 7 MPI_Abort (comm, -1); 8 } 9 0 /* acknowledge the detected failures */ 1 MPI_Comm_failure_ack ( * comm ); 2 3 if ( errcode!= MPI_ERR_COMM_REVOKED ){ 4 /* propagate the failure to other ranks */ 5 MPI_Comm_revoke ( * comm ); 6 } 7 8 /* all live ranks must reach this point 9 to collectively shrink the communicator */ 0 shrinkworld (); 1 2 restart = true ; 3 } Computer Systems (ANU) PGAS Paradigm 02 Nov / 90

88 Fault Tolerance 1 /* Creates a new communicator for the application, 2 excluding dead ranks in the old ( revoked ) communicator */ 3 void shrinkworld () { 4 int rc; /* shrink return code */ 5 MPI_Comm newcomm ; 6 do { 7 rc = MPI_Comm_shrink ( world, newcomm ); 8 MPI_Comm_agree ( newcomm, & rc); 9 } while ( rc!= MPI_SUCCESS ); 0 1 /* update the communicator */ 2 world = newcomm ; 3 4 /* update my rank and nprocs */ 5 MPI_Comm_rank ( world, & rank ); 6 MPI_Comm_size ( world, & nprocs ); 7 } Computer Systems (ANU) PGAS Paradigm 02 Nov / 90

89 Fault Tolerance Fault Tolerance: Summary Topics covered today: the reducing reliability of HPC system as they grow larger fault tolerance techniques (C/R, Replication, Master-Worker, ABFT) the MPI-ULFM proposal for adding fault tolerance support to MPI Acknowledgement: The fault tolerance part of today s lecture is influenced by materials from SC 16 Tutorial Fault Tolerance for HPC: Theory and Practice Computer Systems (ANU) PGAS Paradigm 02 Nov / 90

90 Fault Tolerance Hands-on Exercise: Checkpointing and ULFM Computer Systems (ANU) PGAS Paradigm 02 Nov / 90

Overview: Emerging Parallel Programming Models

Overview: Emerging Parallel Programming Models the partitioned global address space paradigm the HPCS initiative; basic idea of PGAS the Chapel language: design principles, task and data parallelism, sum