Partitioned Global Address Space Paradigm ASD Distributed Memory HPC Workshop

Size: px
Start display at page:

Download "Partitioned Global Address Space Paradigm ASD Distributed Memory HPC Workshop"

Transcription

1 Partitioned Global Address Space Paradigm ASD Distributed Memory HPC Workshop Computer Systems Group Research School of Computer Science Australian National University Canberra, Australia November 02, 2017

2 Day 4 Schedule Computer Systems (ANU) PGAS Paradigm 02 Nov / 90

3 Introduction to the PGAS Paradigm and Chapel Outline 1 Introduction to the PGAS Paradigm and Chapel 2 Chapel Programming Strategies for Distributed Memory 3 Runtime Support for PGAS 4 MPI One-Sided Communications 5 Fault Tolerance Computer Systems (ANU) PGAS Paradigm 02 Nov / 90

4 Introduction to the PGAS Paradigm and Chapel Partitioned Global Address Space recall the shared memory model: multiple threads with pointers to a global address space in the partitioned global address space (PGAS) model: have multiple threads, each with affinity to some portion of global address space SPMD or fork-join thread creation remote pointers to access data in other partitions the model maps to a cluster with remote memory access can also map to NUMA domains Computer Systems (ANU) PGAS Paradigm 02 Nov / 90

5 Introduction to the PGAS Paradigm and Chapel Chapel: Design Principles Cray High Performance Language originally developed under DARPA High Productivity Computing Systems program Targeted at massively parallel computers object-oriented (Java-like syntax, but influenced by ZPL & HPF) supports exploratory programming implicit (statically-inferable) types, run-time settable parameters (config), implicit main and module wrappings multiresolution design: build higher-level concepts in terms of lower Fork-join, not SPMD Computer Systems (ANU) PGAS Paradigm 02 Nov / 90

6 Introduction to the PGAS Paradigm and Chapel Chapel: Language Primitives Task parallelism: concurrent loops and blocks (cobegin, coforall) Data parallelism: Concurrent map operations (forall) Concurrent fold operations (scan, reduce) Synchronization: Task synchronization, sync variables, atomic sections Locality: locales (UMA places to hold data and run tasks) (index) domains used to specify arrays, iteration ranges distributions (mappings of domains to locales) can drastically reduce code size compared to MPI+X more info on Chapel home page Computer Systems (ANU) PGAS Paradigm 02 Nov / 90

7 Introduction to the PGAS Paradigm and Chapel Chapel: Compile Chain chpl compiler generates standard C code, or uses LLVM backend (Image: Cray Inc.) Computer Systems (ANU) PGAS Paradigm 02 Nov / 90

8 Introduction to the PGAS Paradigm and Chapel Chapel: Base Language variables, constants, parameters: 1 var timestep : int ; 2 param pi: real = ; 3 config const epsilon = 0.05; // $./ myprogram -- epsilon =0.01 records: 1 record Vector3D { 2 var x, y, z: real ; 3 } 4 var pos = new Vector3D (0.0, 1.0, -1.5) ; 5 pos. x = 2.0; 6 var copy = pos ; // copied by value classes: 1 class Person { 2 var firstname, surname : string ; 3 var age : int ; 4 } 5 var patsy = new Person (" Patricia ", " Stone ", 39) ; Computer Systems (ANU) PGAS Paradigm 02 Nov / 90

9 Introduction to the PGAS Paradigm and Chapel Chapel: Base Language (2) procedures, type inference, generic methods: 1 proc square (n) { 2 return n * n; 3 } 4 5 var x = 2; 6 var x2 = square ( x); 7 writeln (x2, ": ",x2. type : string ); // 4: int (64) 8 9 var y = 0.5; 10 var y2 = square ( y); 11 writeln (y2, ": ",y2. type : string ); // 0.25: real (64) Computer Systems (ANU) PGAS Paradigm 02 Nov / 90

10 Introduction to the PGAS Paradigm and Chapel Chapel: Base Language (3) iterators: 1 iter triangle ( n) { 2 var current = 0; 3 for i in 1.. n { 4 current += i; 5 yield current ; 6 } 7 } tuples, zippered iteration: 1 config const n = 10; 2 for (i, t) in zip (0..# n, triangle ( n)) do 3 writeln (" triangle number ", i, " is ", t); Computer Systems (ANU) PGAS Paradigm 02 Nov / 90

11 Introduction to the PGAS Paradigm and Chapel Chapel: Task Parallelism task creation: 1 begin dostuff (); // spawn task and don t wait 2 cobegin { 3 dostuff1 (); 4 dostuff2 (); 5 } // wait for completion of all statements in the block synchronisation variables: 1 var a$: sync int ; 2 begin a$ = foo (); 3 var c = 2 * a$; // suspend until a$ is assigned Computer Systems (ANU) PGAS Paradigm 02 Nov / 90

12 Introduction to the PGAS Paradigm and Chapel Chapel: Synchronization Variables single variables can only be written once; sync variables are reset to empty when read. 1 var item$ : sync int ; 2 proc produce () { 3 for i in 0..# N do 4 item$ = i; 5 } 6 proc consume () { 7 for i in 0..# N { 8 var x = item$ ; 9 writeln (x); 10 } 11 } begin produce (); 14 begin consume (); 1 var latch$ : single bool ; 2 proc await () { 3 latch$ ; 4 } 5 proc release () { 6 latch$ = true ; 7 } 8 9 begin await (); 10 begin release (); Computer Systems (ANU) PGAS Paradigm 02 Nov / 90

13 Introduction to the PGAS Paradigm and Chapel Chapel: Task Parallelism Example Fibonacci numbers: 1 proc fib ( n): int { 2 if n <= 2 then 3 return 1; 4 var t1$ : single int ; 5 var t2: int ; 6 begin t1$ = fib (n -1) ; 7 t2 = fib (n -2) ; 8 // wait for $t1 9 return t1$ + t2; 10 } 1 proc fib ( n): int { 2 if n <= 2 then 3 return 1; 4 var t1$, t2$ : single int ; 5 cobegin { 6 t1$ = fib (n -1) ; 7 t2$ = fib (n -2) ; 8 } 9 // wait for t1$ and t2$ 10 return t1$ + t2$ ; Computer Systems (ANU) PGAS Paradigm 02 Nov / 90

14 Introduction to the PGAS Paradigm and Chapel Chapel: Data Parallelism ranges: 1 var r1 = 0..3; // 0, 1, 2, 3 2 var r2 = 0..#10 by 2; // 0, 2, 4, 6, 8 arrays, data parallel loops: 1 var A, B: [0..# N] real ; 2 forall i in 0..# N do // cf. coforall 3 A(i) = A(i) + B(i); scalar promotion: 1 A = A + B; Computer Systems (ANU) PGAS Paradigm 02 Nov / 90

15 Introduction to the PGAS Paradigm and Chapel Chapel: Data Parallelism (2) example: DAXPY 1 config const alpha = 3.0; 2 const MyRange = 0..# N; 3 proc daxpy ( x: [ MyRange ] real, y: [ MyRange ] real ): int { 4 forall i in MyRange do 5 y( i) = alpha * x( i) + y( i); 6 } Alternatively, via promotion, the forall loop can be replaced by: y = alpha * x + y; reductions and scans: 1 var mx = ( max reduce A); 2 A = (+ scan A); // prefix sum of A - parallel? the target of data parallelism could be SIMD, GPU or normal threads (currently no way to express this) Computer Systems (ANU) PGAS Paradigm 02 Nov / 90

16 Introduction to the PGAS Paradigm and Chapel Chapel: forall vs. coforall Use forall when iterations may be executed in parallel Use coforall when iterations must be executed in parallel What s wrong with this code? 1 var a$: [0..# N] single int ; 2 forall i in {0..# N} { 3 if i < (N -1) then 4 a$[ i] = a$[ i +1] - 1; 5 else 6 a$[ i] = N; 7 var result = a$[ i]; 8 writeln ( result ); 9 } Computer Systems (ANU) PGAS Paradigm 02 Nov / 90

17 Introduction to the PGAS Paradigm and Chapel Chapel: Task Intents constant (default): 1 config const N = 10; 2 var race : int ; 3 coforall i in 0..# N do 4 race += 1; // illegal! reference: 1 var deliberaterace : int ; 2 coforall i in 0..# N with ( ref deliberaterace ) do 3 deliberaterace += 1; reduce: 1 var sum : int ; 2 coforall i in 0..# N with (+ reduce sum ) do 3 sum += i; Computer Systems (ANU) PGAS Paradigm 02 Nov / 90

18 Introduction to the PGAS Paradigm and Chapel Chapel: Domains domain: an index set, can be used to declare arrays dense (rectangular): a tensor product of ranges, e.g. 1 config const M = 5, N = 7; 2 const D: domain (2) = {0..# M, 0..# N}; strided: 1 const D1 = {0..# M by 4, 0..# N by 2}; Computer Systems (ANU) PGAS Paradigm 02 Nov / 90

19 Introduction to the PGAS Paradigm and Chapel Chapel: Domains (2) sparse: 1 const SparseD : sparse subdomain ( D) 2 = ((0,0), (1,2), (3,2), (4,4) ); associative: 1 var Colours : domain ( string ) = {" Black ", " Yellow ", " Red "}; Computer Systems (ANU) PGAS Paradigm 02 Nov / 90

20 Introduction to the PGAS Paradigm and Chapel Chapel: Locales locale: a unit of the target architecture: processing elements with (uniform) local memory 1 const Locales : [0..# numlocales ] locale =... ; // built - in 2 on Locales [1] do 3 foo (); 4 coforall ( loc, id) in zip ( Locales, 1..) do 5 on loc do // migrates this task to loc 6 coforall tid in 0..# numtasks do 7 writeln (" Task ", id, " thread ", tid, " on ", loc ); Computer Systems (ANU) PGAS Paradigm 02 Nov / 90

21 Introduction to the PGAS Paradigm and Chapel Chapel: Domain Maps use domain maps to map indices in a domain to locales: 1 use CyclicDist ; 2 const Dist = new dmap ( 3 new Cyclic ( startidx = 1, targetlocales = Locales [0..1]) ); 4 const D = {0..# N} dmapped Dist ; 5 var x, y: [ D] real ; Computer Systems (ANU) PGAS Paradigm 02 Nov / 90

22 Introduction to the PGAS Paradigm and Chapel Chapel: Domain Maps (2) block: 1 use BlockDist ; 2 const space1d = {0..# N}; 3 const B = space1d dmapped Block ( boundingbox = space1d ); Computer Systems (ANU) PGAS Paradigm 02 Nov / 90

23 Introduction to the PGAS Paradigm and Chapel Hands-on Exercise: Locales in Chapel Computer Systems (ANU) PGAS Paradigm 02 Nov / 90

24 Chapel Programming Strategies for Distributed Memory Outline 1 Introduction to the PGAS Paradigm and Chapel 2 Chapel Programming Strategies for Distributed Memory 3 Runtime Support for PGAS 4 MPI One-Sided Communications 5 Fault Tolerance Computer Systems (ANU) PGAS Paradigm 02 Nov / 90

25 Chapel Programming Strategies for Distributed Memory Chapel: Programming Strategies Think globally, compute locally Define key data structures arrays domains Specify distribution and layout domain maps Exploit parallelism over the available hardware (co-)forall (co-)begin Computer Systems (ANU) PGAS Paradigm 02 Nov / 90

26 Chapel Programming Strategies for Distributed Memory Chapel: Matrix Multiplication start with sequential matrix multiplication: 1 proc matmul ( const ref A, const ref B, C) { 2 for (m, n) in C. domain { 3 var c = 0.0; 4 for k in A. domain. dim (2) do 5 c += A[m, k] * B[k, n]; 6 C[m, n] = c; 7 } 8 } 9 config const M = 4, K = 4, N = 4; 10 var A: [0..#M,0..# K] real ; 11 var B: [0..#K,0..# N] real ; 12 var C: [0..#M,0..# N] real ; 13 matmul (A, B, C); i k A B C j k Computer Systems (ANU) PGAS Paradigm 02 Nov / 90

27 Chapel Programming Strategies for Distributed Memory Chapel: Performance Timing one way to measure elapsed time: 1 var timer : Timer ; 2 timer. start (); 3 matmul (A, B, C); 4 timer. stop (); 5 var timemillis = timer. elapsed ( TimeUnits. milliseconds ); 6 writef (" Serial Multiply M=% i, N=% i, K=% i took %7.3 dr ms (%7.3 dr GFLOP /s)\n", M, N, K, timemillis, 2*M*K*N/1 e6/ timemillis ); Computer Systems (ANU) PGAS Paradigm 02 Nov / 90

28 Chapel Programming Strategies for Distributed Memory Chapel: Matrix Multiplication parallel, single locale: 1 proc parmatmul ( const ref A, const ref B, C) { 2 forall (m, n) in C. domain { 3 var c = 0.0; 4 for k in A. domain. dim (2) do 5 c += A[m, k] * B[k, n]; 6 C[m, n] = c; 7 } 8 } Computer Systems (ANU) PGAS Paradigm 02 Nov / 90

29 Chapel Programming Strategies for Distributed Memory Chapel: Matrix Multiplication parallel, distributed: 1 const rows = reshape ( Locales, {0..# numlocales, 0..0}) ; 2 const cols = reshape ( Locales, {0..0, 0..# numlocales }); 3 const spacea = {0..#M, 0..# K}; 4 const da: domain (2) dmapped Block ( spacea, rows ) = spacea ; 5 const spaceb = {0..#K, 0..# N}; 6 const db: domain (2) dmapped Block ( spaceb, cols ) = spaceb ; 7 const spacec = {0..#M, 0..# N}; 8 const dc: domain (2) dmapped Block ( spacec, rows ) = spacec ; 9 var blocka : [ da] real ; 10 var blockb : [ db] real ; 11 var blockc : [ dc] real ; 12 parmatmul ( blocka, blockb, blockc ); Computer Systems (ANU) PGAS Paradigm 02 Nov / 90

30 Chapel Programming Strategies for Distributed Memory Chapel: Programming Strategies (continued) Batch communications - avoid fine-grained remote accesses array slicing specialized distributions e.g. StencilDist Overlap computation and communication tasks sync variables Computer Systems (ANU) PGAS Paradigm 02 Nov / 90

31 Chapel Programming Strategies for Distributed Memory Chapel: Further Reading Chapel Web page: Chapel tutorials: Computer Systems (ANU) PGAS Paradigm 02 Nov / 90

32 Chapel Programming Strategies for Distributed Memory Hands-on Exercise: 2D Stencil via Templates Computer Systems (ANU) PGAS Paradigm 02 Nov / 90

33 Runtime Support for PGAS Outline 1 Introduction to the PGAS Paradigm and Chapel 2 Chapel Programming Strategies for Distributed Memory 3 Runtime Support for PGAS 4 MPI One-Sided Communications 5 Fault Tolerance Computer Systems (ANU) PGAS Paradigm 02 Nov / 90

34 MPI One-Sided Communications Outline 1 Introduction to the PGAS Paradigm and Chapel 2 Chapel Programming Strategies for Distributed Memory 3 Runtime Support for PGAS 4 MPI One-Sided Communications 5 Fault Tolerance Computer Systems (ANU) PGAS Paradigm 02 Nov / 90

35 MPI One-Sided Communications Programming Models Each process exposes a part of its memory to the other processes Allow data movement without direct involvement of process that holds the data Computer Systems (ANU) PGAS Paradigm 02 Nov / 90

36 MPI One-Sided Communications Comparison with Two-sided Computer Systems (ANU) PGAS Paradigm 02 Nov / 90

37 MPI One-Sided Communications It s all about Memory Consistency Remember this from the shared memory course? Memory consistency concerns how memory behaves with respect to read and write operations from multiple processors Sequential consistency: the result of any execution is the same as if the operations of all the processors were executed in some sequential order, and the operations of each individual processor appear in this sequence in the order specified by its program. See: Shared Memory Consistency Models: A Tutorial, Sarita V. Adve Kourosh Gharachorloo Computer Systems (ANU) PGAS Paradigm 02 Nov / 90

38 MPI One-Sided Communications Reference Material Subsequent slides will draw heavily on following material: Overviews of MPI3 William D Gropp: New Features of MPI-3 Fabio Affinito: MPI3 Two detailed lectures on one sided MPI William Gropp: One-sided Communication in MPI William Gropp: More on One Sided Communication Tutorial on MPI 2.2 and 3.0 by Torsten Hoefler Torsten Hoefler: Advanced MPI 2.2 and 3.0 Tutorial Detailed paper on remote memory access programming in MPI-3 T. Hoefler et al., Remote Memory Access Programming in MPI-3. ACM Trans. Parallel Comput. 1, 1, Article 1 (March 2013) Cornell Virtual Workshop on one-sided communication methods Computer Systems (ANU) PGAS Paradigm 02 Nov / 90

39 MPI One-Sided Communications RMA advantages and Issues Advantages Multiple transfers with single synchronization Bypass tag matching Can be faster exploiting underlying hardware support Better able to handle problems where communication pattern is unknown or irregular Issues How to create remote accessible memory Reading, writing and updating remote memory Data synchronisation Memory model Computer Systems (ANU) PGAS Paradigm 02 Nov / 90

40 MPI One-Sided Communications Window Creation Regions of memory that we want to expose to RMA operations are called windows, they can be created in four ways Fabio Affinito: MPI3 Computer Systems (ANU) PGAS Paradigm 02 Nov / 90

41 MPI One-Sided Communications Simple Window Creation Fabio Affinito: MPI3 Computer Systems (ANU) PGAS Paradigm 02 Nov / 90

42 MPI One-Sided Communications Data Movement MPI provides operations to read, write and atomically modify remote data MPI GET MPI PUT MPI ACCUMULATE MPI GET ACCUMULATE MPI COMPARE AND SWAP MPI FETCH AND OP Fabio Affinito: MPI3 Computer Systems (ANU) PGAS Paradigm 02 Nov / 90

43 MPI One-Sided Communications The Memory Consistency Issue Fabio Affinito: MPI3 Three Synchronization models Fence (active target) Post-start-complete-wait (generalized active target) Lock/Unlock (passive target) Data accesses occur within epochs Computer Systems (ANU) PGAS Paradigm 02 Nov / 90

44 MPI One-Sided Communications Three Synchronization Models Fabio Affinito: MPI3 Computer Systems (ANU) PGAS Paradigm 02 Nov / 90

45 MPI One-Sided Communications Passive Target Synchronization William Gropp: More on One Sided Communication Computer Systems (ANU) PGAS Paradigm 02 Nov / 90

46 MPI One-Sided Communications Completion Model Relaxed memory model, acquire and release Immediate Data Movement Delayed Data Movement Which is best when? William Gropp: More on One Sided Communication Computer Systems (ANU) PGAS Paradigm 02 Nov / 90

47 MPI One-Sided Communications Memory Models Unified model new in MPI 3, what are advantages? Torsten Hoefler: Advanced MPI 2.2 and 3.0 Tutorial Computer Systems (ANU) PGAS Paradigm 02 Nov / 90

48 MPI One-Sided Communications Separate Semantics Another table for unified semantics Torsten Hoefler: Advanced MPI 2.2 and 3.0 Tutorial Computer Systems (ANU) PGAS Paradigm 02 Nov / 90

49 MPI One-Sided Communications MPI-3 Communication Options T. Hoefler et al., Remote Memory Access Programming in MPI-3. ACM Trans. Parallel Comput. 1, 1, Article 1 Computer Systems (ANU) PGAS Paradigm 02 Nov / 90

50 MPI One-Sided Communications Example Codes Fence Synchronization Post-Start-Complete-Wait Synchronization Lock-Unlock Synchronization Computer Systems (ANU) PGAS Paradigm 02 Nov / 90

51 MPI One-Sided Communications Fence Synchronization 1 // S t a r t up MPI MPI Win win ; 3 i f ( rank == 0) { 4 / Everyone w i l l r e t r i e v e from a b u f f e r on r o o t / 5 i n t s o i = s i z e o f ( i n t ) ; 6 MPI Win create ( buf, s o i 20, s o i, MPI INFO NULL, comm,& win ) ; } 7 e l s e { 8 / Others o n l y r e t r i e v e, so t h e s e windows can be s i z e 0 / 9 MPI Win create (NULL, 0, s i z e o f ( i n t ), MPI INFO NULL, comm,& win ) ; 0 } 1 2 / No l o c a l o p e r a t i o n s p r i o r to t h i s epoch, so g i v e an a s s e r t i o n / 3 MPI Win fence (MPI MODE NOPRECEDE, win ) ; 4 i f ( rank!= 0) { 5 / I n s i d e t h e f e n c e, make RMA c a l l s to GET from rank 0 / 6 MPI Get ( buf, 2 0, MPI INT, 0, 0, 2 0, MPI INT, win ) ; 7 } 8 9 / Complete t h e epoch t h i s w i l l b l o c k u n t i l MPI Get i s complete / 0 MPI Win fence ( 0, win ) ; 1 / A l l done with the window t e l l MPI there are no more epochs / 2 MPI Win fence (MPI MODE NOSUCCEED, win ) ; 3 / F r e e up our window / 4 MPI Win free (&win ) 5 // s h u t down... Source: Cornell Virtual Workshop: Computer Systems (ANU) PGAS Paradigm 02 Nov / 90

52 MPI One-Sided Communications Post-Start-Complete-Wait Synchronization 1 // S t a r t up MPI... 2 MPI Group comm group, group ; 3 4 f o r ( i =0; i <3; i ++) { 5 r a n k s [ i ] = i ; // For f o r m i n g groups, l a t e r 6 } 7 MPI Comm group (MPI COMM WORLD,&comm group ) ; 8 9 / Create new window f o r t h i s comm / 0 i f ( rank == 0) { 1 MPI Win create ( buf, s i z e o f ( i n t ) 3, s i z e o f ( i n t ), 2 MPI INFO NULL,MPI COMM WORLD,& win ) ; 3 } 4 e l s e { 5 / Rank 1 o r 2 / 6 MPI Win create (NULL, 0, s i z e o f ( i n t ), 7 MPI INFO NULL,MPI COMM WORLD,& win ) ; 8 } 9 0 / > continues in next s l i d e > / Source: Cornell Virtual Workshop: Computer Systems (ANU) PGAS Paradigm 02 Nov / 90

53 MPI One-Sided Communications Post-Start-Complete-Wait Synchronization (2) 1 / Now do t h e communication epochs / 2 i f ( rank == 0) { 3 / O r i g i n group c o n s i s t s o f r a n k s 1 and 2 / 4 MPI Group incl ( comm group, 2, ranks+1,&group ) ; 5 / Begin t h e e x p o s u r e epoch / 6 MPI Win post ( group, 0, win ) ; 7 / Wait f o r epoch to end / 8 MPI Win wait ( win ) ; 9 } 0 e l s e { 1 / Target group c o n s i s t s o f rank 0 / 2 M P I G r o u p i n c l ( comm group, 1, ranks,& group ) ; 3 / Begin t h e a c c e s s epoch / 4 M P I W i n s t a r t ( group, 0, win ) ; 5 / Put i n t o rank==0 a c c o r d i n g to my rank / 6 MPI Put ( buf, 1, MPI INT, 0, rank, 1, MPI INT, win ) ; 7 / Terminate t h e a c c e s s epoch / 8 MPI Win complete ( win ) ; 9 } 0 1 / F r e e window and g r o u p s / 2 MPI Win free (&win ) ; 3 MPI Group free (&group ) ; 4 MPI Group free (&comm group ) ; 5 6 // Shut down... Source: Cornell Virtual Workshop: Computer Systems (ANU) PGAS Paradigm 02 Nov / 90

54 MPI One-Sided Communications Lock-Unlock Synchronization 1 // S t a r t up MPI... 2 MPI Win win ; 3 4 i f ( rank == 0) { 5 / Rank 0 w i l l be t h e c a l l e r, so n u l l window / 6 MPI Win create (NULL, 0, 1, 7 MPI INFO NULL,MPI COMM WORLD,& win ) ; 8 / Request l o c k o f p r o c e s s 1 / 9 MPI Win lock (MPI LOCK SHARED, 1, 0, win ) ; 0 MPI Put ( buf, 1, MPI INT, 1, 0, 1, MPI INT, win ) ; 1 / Block u n t i l put succeeds / 2 MPI Win unlock (1, win ) ; 3 / F r e e t h e window / 4 MPI Win free (&win ) ; 5 } 6 e l s e { 7 / Rank 1 i s t h e t a r g e t p r o c e s s / 8 MPI Win create ( buf,2 s i z e o f ( i n t ), s i z e o f ( i n t ), 9 MPI INFO NULL, MPI COMM WORLD, &win ) ; 0 / No s y n c c a l l s on t h e t a r g e t p r o c e s s! / 1 MPI Win free (&win ) ; 2 } Source: Cornell Virtual Workshop: Computer Systems (ANU) PGAS Paradigm 02 Nov / 90

55 MPI One-Sided Communications Case Studies T. Hoefler et al., Remote Memory Access Programming in MPI-3. ACM Trans. Parallel Comput. 1, 1, Article 1 Computer Systems (ANU) PGAS Paradigm 02 Nov / 90

56 MPI One-Sided Communications Hands-on Exercise: The 3 Synchronization Methods Computer Systems (ANU) PGAS Paradigm 02 Nov / 90

57 Fault Tolerance Outline 1 Introduction to the PGAS Paradigm and Chapel 2 Chapel Programming Strategies for Distributed Memory 3 Runtime Support for PGAS 4 MPI One-Sided Communications 5 Fault Tolerance Computer Systems (ANU) PGAS Paradigm 02 Nov / 90

58 Fault Tolerance HPC Systems: Fast, Complex and Error Prone Sunway TaihuLight: the fastest supercomputer today (peak Pflop/s) (Image: Top500) 1. Dongarra, Jack. Report on the sunway taihulight system. Tech Report UT-EECS (2016) Computer Systems (ANU) PGAS Paradigm 02 Nov / 90

59 Fault Tolerance The Reliability Challenge in HPC Reliability Terms MTTI : Mean Time To Interrupt MTTR : Mean Time To Repair MTBF : Mean Time Between Failures = MTTI + MTTR Reliability Figures for Terascale Systems: System CPUs Reliability Src LANL ASCI Q 8,192 MTTI: 6.5 hours 2 LLNL ASCI White (2003) 8,192 MTBF: 40 hours 2 PSC Lemieux 3,016 MTTI: 9.7 hours 2 LLNL BlueGene/L 106,496 MTTI: 7-10 days 3 2. Feng, Wu-chun. The importance of being low power in high performance computing. (2005) 3. Bronevetsky, Greg, and Adam Moody. Scalable I/O systems via node-local storage: Approaching 1 TB/sec file I/O. (2009) Computer Systems (ANU) PGAS Paradigm 02 Nov / 90

60 Fault Tolerance A Statistical Study of Failures on HPC Systems Conclusions: First, the failure rate of a system grows proportional to the number of processor chips in the system. Second, there is little indication that systems and their hardware get more reliable over time as technology changes. Computer Systems (ANU) PGAS Paradigm 02 Nov / 90

61 Fault Tolerance The Reliability Challenge in HPC Prediction of MTTI with three rates of growth in cores: doubling every 18, 24 and 30 months Computer Systems (ANU) PGAS Paradigm 02 Nov / 90

62 Fault Tolerance Fault Tolerance As HPC systems grow in size, the MTTI shrinks and long-running applications become at a higher risk of encountering faults. Faults are generally classified into: hard faults: inhibit process execution and result in data loss. soft faults: undetected bit flips silently corrupting data in disk, memory, or registers. Fault tolerance is the ability to contain faults and reduce their impact. Computer Systems (ANU) PGAS Paradigm 02 Nov / 90

63 Fault Tolerance Fault Tolerance Techniques Rollback recovery: Returns the application to an old consistent state. Recomputes previously reached states before the failure. Common technique: Checkpoint/restart Forward recovery: Computation proceeds after a failure without rollback. Requires a fault-aware runtime system (i.e. a runtime system that does not crash upon a failure). Common techniques: Replication Master-Worker ABFT (Algorithmic-Based Fault Tolerance) Or composite techniques e.g. Replication-enhaned checkpointing [4] 4. Ni, Xiang, et al. ACR: Automatic checkpoint/restart for soft and hard error protection. Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis. ACM, Computer Systems (ANU) PGAS Paradigm 02 Nov / 90

64 Fault Tolerance Rollback Recovery Checkpoint/Restart The most widely used fault tolerance mechanism in HPC systems. Requires saving the application state periodically on a reliable storage. Upon a failure, the application restarts from the last consistent checkpoint. Computer Systems (ANU) PGAS Paradigm 02 Nov / 90

65 Fault Tolerance Checkpointing Classifications Coordinated Uncoordinated Collective checkpointing All processes restart Suitable for synchronized computations Processes checkpoint independently Only the failed process restarts Suitable for loosly coupled processes Often requires message logging Vulnerable to the domino effect 5. Elnozahy, Elmootazbellah Nabil, et al. A survey of rollback-recovery protocols in message-passing systems. ACM Computing Surveys (CSUR) 34.3 (2002): Computer Systems (ANU) PGAS Paradigm 02 Nov / 90

66 Fault Tolerance Checkpointing Classifications (Cont.) Disk-based Diskless I/O Intensive Applicable to all runtime systems Replaces disk with in-memory replication Fault-aware systems only More replicas more reliability, and higher failure-free overhead Computer Systems (ANU) PGAS Paradigm 02 Nov / 90

67 Fault Tolerance Checkpointing MPI Applications Coordinated disk-based checkpointing is the common mechanism for fault tolerance on HPC platforms. Provided transparently in some MPI implementations (e.g. Intel-MPI: mpirun -chkpoint-interval 100sec -np 100./MyApp) Provided outside of MPI by tools that dump process image into disk, like: BLCR: Berkeley Lab Checkpoint/Restart for Linux DMTCP: Distributed MultiThreaded CheckPointing Or done manually by programmers using file system APIs. Diskless checkpointing is only applicable to fault-aware MPI implementations (like MPI-ULFM, which we cover in the last part of this lecture) Computer Systems (ANU) PGAS Paradigm 02 Nov / 90

68 Fault Tolerance Checkpoint Interval The checkpoint interval has a crucial impact on performance: A long interval less checkpoints, and more lost work upon a failure. A short interval more checkpoints, and less lost work upon a failure. Young s formula [6] is often used to compute the optimal checkpoint interval i as 2 t MTTI, where t is the checkpointing time. The effective application utilization (u) of a system can be computed as [7]: u = 1 (lost utilization for recovery + lost utilization for checkpointing) u = 1 ( 1 2.i. 1 MTTI + t. 1 i ) 6. Young, John W. A first order approximation to the optimum checkpoint interval. Communications of the ACM 17.9 (1974) 7. Schroeder, Bianca, and Garth A. Gibson. Understanding failures in petascale computers. Journal of Physics (2007). Computer Systems (ANU) PGAS Paradigm 02 Nov / 90

69 Fault Tolerance Projected System Utilization with C/R Effective application utilization over time Computer Systems (ANU) PGAS Paradigm 02 Nov / 90

70 Fault Tolerance Forward Recovery Replication executes one or more replicas of each process on independent nodes when a replica is lost, another replica takes over without rollback Replication in message passing systems: the message ordering challenge used as a detection and correction mechanism for silent data corruption errors. despite its expensive resource requirements, recent studies [8,9] suggest that replication can be a viable alternative for checkpointing on extreme scale systems with short MTTI. 8. Ropars, Thomas, et al. Efficient Process Replication for MPI Applications: Sharing Work between Replicas. IPDPS Ferreira, Kurt, et al. Evaluating the viability of process replication reliability for exascale systems. SC Computer Systems (ANU) PGAS Paradigm 02 Nov / 90

71 Fault Tolerance Forward Recovery Master-Worker Worker failure: can be tolerated without rollback by assigning the tasks of the failed worker to another worker Master failure: can be tolerated using replication or checkpointing. Because the probably of the master failure is constant (as it does not depend on the scale of the application), it is often more efficient to consider the master failure as a fatal error. Computer Systems (ANU) PGAS Paradigm 02 Nov / 90

72 Fault Tolerance Forward Recovery Algorithmic-Based Fault Tolerance the design of custom recovery mechanisms based on expert-knowledge of special algorithm properties (e.g. available data redundancy, the ability to approximate lost data from remaining data,... ). for example: using redundant data to recover lost sub-grids in PDE solvers that use the Sparse Grid Combination Technique (SGCT) [10]: (Image: Ali, et al. 2016) 10. Ali, Md Mohsin, et al. Complex scientific applications made fault-tolerant with the sparse grid combination technique. IJHPCA 30.3 (2016): Computer Systems (ANU) PGAS Paradigm 02 Nov / 90

73 Fault Tolerance MPI and Fault Tolerance The MPI standard does not specify the behaviour of MPI when ranks fail. Most implementations terminate the application as a result of rank failure. Users rely on coordinated disk-based checkpointing because it does not require fault tolerance support from MPI. However, the time to checkpoint an application with a large memory footprint can exceed the MTTI on large systems, making coordinated disk-based checkpointing unapplicable on large scale. User-level fault tolerance techniques can deliver better performance, however, they require fault tolerance support from MPI. Computer Systems (ANU) PGAS Paradigm 02 Nov / 90

74 Fault Tolerance MPI and Fault Tolerance (Cont.) MPI User Level Failure Mitigation A proposal by the MPI Forum s Fault Tolerance Working Group to add fault tolerance semantics to MPI. Under assessment by the MPI forum to be part of the coming MPI-4 standard. A reference implementation of ULFM is available based on OpenMPI1.7. Computer Systems (ANU) PGAS Paradigm 02 Nov / 90

75 Fault Tolerance MPI User Level Failure Mitigation In the following, we cover the following aspects of MPI-ULFM: Error Handling Failure Notification Failure Mitigation Computer Systems (ANU) PGAS Paradigm 02 Nov / 90

76 Fault Tolerance Error Handling (1/2) In standard MPI: most MPI interfaces return an error code (e.g. 0=MPI SUCCESS, 1=MPI ERR BUFFER,... ) we can set an error handler to a communicator using: MPI Comm set errhandler there are two predefined error handlers: MPI ERRORS ARE FATAL: terminates MPI (the default). MPI ERRORS RETURN: returns an error code to the caller. The user can also define a customized error handler, as follows: 1 /* User s error handling function */ 2 void errorcallback ( MPI_Comm * comm, int * errcode,...) { } 3 4 /* Changing the communicator s error handler */ 5 MPI_Errhandler handler ; 6 MPI_Comm_create_errhandler ( errorcallback, & handler ); 7 MPI_Comm_set_errhandler ( MPI_COMM_WORLD, handler ); Computer Systems (ANU) PGAS Paradigm 02 Nov / 90

77 Fault Tolerance Error Handling (2/2) ULFM uses the same error handling mechanism of standard MPI It added new error codes to report process failure events: 54=MPI ERR PROC FAILED 55=MPI ERR PROC FAILED PENDING 56=MPI ERR REVOKED The default error handler MPI ERRORS ARE FATAL must not be used. Computer Systems (ANU) PGAS Paradigm 02 Nov / 90

78 Fault Tolerance Failure Notification (1/3) Process failure errors are raised only in MPI operations that involve a failed rank. Point-to-point operations Using a named rank: Using MPI ANY SOURCE: Computer Systems (ANU) PGAS Paradigm 02 Nov / 90

79 Fault Tolerance Failure Notification (2/3) Process failure errors are raised only in MPI operations that involve a failed rank. Collective operations: some live processes may raise an error, while others return successfully. Computer Systems (ANU) PGAS Paradigm 02 Nov / 90

80 Fault Tolerance Failure Notification (3/3) Process failure errors are raised only in MPI operations that involve a failed rank. Non-blocking operations: error reporting is postponed to the corresponding completion function (e.g. MPI Wait, MPI Test). Computer Systems (ANU) PGAS Paradigm 02 Nov / 90

81 Fault Tolerance Failure Mitigation Interfaces (1/2) MPI Comm failure ack( comm ) a local operation that acknowledges all detected failures on the communicator. it s purpose is to silence process failure errors in future MPI ANY SOURCE calls that involve an acknowledged process failure. MPI Comm failure get acked( comm, failedgrp ) returns the group of failed ranks that were already acknowledged Computer Systems (ANU) PGAS Paradigm 02 Nov / 90

82 Fault Tolerance Failure Mitigation Interfaces (2/2) MPI Comm revoke( comm ) a local operation that invalidates the communicator any future communication on a revoked communicator fails with error MPI ERR REVOKED live ranks must collectively create a new communicator MPI comm shrink( oldcomm, newcomm ) a collective operation that creates a new communicator that excludes the dead ranks in the old communicator like other collectives, it may succeed at some ranks and fail at others. MPI comm agree( oldcomm, flag ) a collective operation for participants to agree on some value. the only collective operation that returns the same result to all participants. Computer Systems (ANU) PGAS Paradigm 02 Nov / 90

83 Fault Tolerance Resilient Iterative Application Skeleton Computer Systems (ANU) PGAS Paradigm 02 Nov / 90

84 Fault Tolerance 1 # define CKPT_INTERVAL 10 /* the checkpointing iterval */ 2 # define MAX_ITER 100 /* maximum no. of iters */ 3 4 MPI_Comm world ; /* the working communicator */ 5 int nprocs ; /* communicator size */ 6 int rank ; /* my rank */ 7 bool restart ; /* restart flag */ 8 9 void compute (); /* executes the iterative computation,* 0 and orchestrates checkpoint / restart */ 1 2 int runiter ( int i); /* runs a single iteration, * 3 * and returns the MPI error code * 4 * of the last MPI call */ 5 6 void shrinkworld (); /* shrinks a failed communicator, * 7 * and sets the new rank and nprocs */ 8 9 void errorcallback ( MPI_Comm * comm, int * rc,...) ; 0 /* the communicator s error handler */ 1 2 void writeckpt (); /* creates a new checkpoint */ 3 4 int readckpt (); /* loads the last checkpoint, and 5 * returns the corresponding iteration */ Computer Systems (ANU) PGAS Paradigm 02 Nov / 90

85 Fault Tolerance 1 void main ( int argc, char * argv []) { 2 MPI_Init (& argc, & argv ); 3 4 /* the initial world state */ 5 world = MPI_COMM_WORLD ; 6 MPI_Comm_rank ( world, & rank ); 7 MPI_Comm_size ( world, & nprocs ); 8 9 /* setting the error handler */ 0 MPI_Errhandler errhandler ; 1 MPI_Comm_create_errhandler ( errorcallback, & errhandler ); 2 MPI_Comm_set_errhandler ( world, errhandler ); 3 4 compute (); 5 6 MPI_Finalize (); 7 } Computer Systems (ANU) PGAS Paradigm 02 Nov / 90

86 Fault Tolerance 1 /* orchestrates the iterative processing and C/ R */ 2 void compute (){ 3 int rc; /* holds MPI return codes */ 4 restart = false ; /* set to true only in errorcallback () */ 5 int i = 0; /* current iteration number */ 6 do { 7 if ( restart ) { 8 i = readckpt (); 9 rc = MPI_Comm_agree ( world, i); 0 if ( rc!= MPI_SUCCESS ) 1 continue ; 2 restart = false ; 3 } 4 while ( i < MAX_ITER ) { 5 rc = runiter (i); 6 if ( rc!= MPI_SUCCESS ) 7 break ; /* jump to the outer loop to restart */ 8 9 if ( i % CKPT_INTERVAL == 0) 0 writeckpt (); 1 2 i ++; 3 } 4 } while ( restart i < MAX_ITER ); 5 } Computer Systems (ANU) PGAS Paradigm 02 Nov / 90

87 Fault Tolerance 1 /* a callback function to handle MPI errors */ 2 void errorcallback ( MPI_Comm * comm, int * errcode,...) { 3 if (* errcode!= MPI_ERR_PROC_FAILED && 4 * errcode!= MPI_ERR_PROC_FAILED_PENDING && 5 * errcode!= MPI_ERR_COMM_REVOKED ) { 6 /* We only tolerate process failure errors */ 7 MPI_Abort (comm, -1); 8 } 9 0 /* acknowledge the detected failures */ 1 MPI_Comm_failure_ack ( * comm ); 2 3 if ( errcode!= MPI_ERR_COMM_REVOKED ){ 4 /* propagate the failure to other ranks */ 5 MPI_Comm_revoke ( * comm ); 6 } 7 8 /* all live ranks must reach this point 9 to collectively shrink the communicator */ 0 shrinkworld (); 1 2 restart = true ; 3 } Computer Systems (ANU) PGAS Paradigm 02 Nov / 90

88 Fault Tolerance 1 /* Creates a new communicator for the application, 2 excluding dead ranks in the old ( revoked ) communicator */ 3 void shrinkworld () { 4 int rc; /* shrink return code */ 5 MPI_Comm newcomm ; 6 do { 7 rc = MPI_Comm_shrink ( world, newcomm ); 8 MPI_Comm_agree ( newcomm, & rc); 9 } while ( rc!= MPI_SUCCESS ); 0 1 /* update the communicator */ 2 world = newcomm ; 3 4 /* update my rank and nprocs */ 5 MPI_Comm_rank ( world, & rank ); 6 MPI_Comm_size ( world, & nprocs ); 7 } Computer Systems (ANU) PGAS Paradigm 02 Nov / 90

89 Fault Tolerance Fault Tolerance: Summary Topics covered today: the reducing reliability of HPC system as they grow larger fault tolerance techniques (C/R, Replication, Master-Worker, ABFT) the MPI-ULFM proposal for adding fault tolerance support to MPI Acknowledgement: The fault tolerance part of today s lecture is influenced by materials from SC 16 Tutorial Fault Tolerance for HPC: Theory and Practice Computer Systems (ANU) PGAS Paradigm 02 Nov / 90

90 Fault Tolerance Hands-on Exercise: Checkpointing and ULFM Computer Systems (ANU) PGAS Paradigm 02 Nov / 90

Overview: Emerging Parallel Programming Models

Overview: Emerging Parallel Programming Models Overview: Emerging Parallel Programming Models the partitioned global address space paradigm the HPCS initiative; basic idea of PGAS the Chapel language: design principles, task and data parallelism, sum

More information

A Proposal for User-Level Failure Mitigation in the MPI-3 Standard

A Proposal for User-Level Failure Mitigation in the MPI-3 Standard A Proposal for User-Level Failure Mitigation in the MPI-3 Standard Wesley Bland George Bosilca Aurelien Bouteiller Thomas Herault Jack Dongarra {bland, bosilca, bouteill, herault, dongarra } @ eecs.utk.edu

More information

. Programming in Chapel. Kenjiro Taura. University of Tokyo

. Programming in Chapel. Kenjiro Taura. University of Tokyo .. Programming in Chapel Kenjiro Taura University of Tokyo 1 / 44 Contents. 1 Chapel Chapel overview Minimum introduction to syntax Task Parallelism Locales Data parallel constructs Ranges, domains, and

More information

xsim The Extreme-Scale Simulator

xsim The Extreme-Scale Simulator www.bsc.es xsim The Extreme-Scale Simulator Janko Strassburg Severo Ochoa Seminar @ BSC, 28 Feb 2014 Motivation Future exascale systems are predicted to have hundreds of thousands of nodes, thousands of

More information

MPI and OpenMP (Lecture 25, cs262a) Ion Stoica, UC Berkeley November 19, 2016

MPI and OpenMP (Lecture 25, cs262a) Ion Stoica, UC Berkeley November 19, 2016 MPI and OpenMP (Lecture 25, cs262a) Ion Stoica, UC Berkeley November 19, 2016 Message passing vs. Shared memory Client Client Client Client send(msg) recv(msg) send(msg) recv(msg) MSG MSG MSG IPC Shared

More information

Chapel: An Emerging Parallel Programming Language. Thomas Van Doren, Chapel Team, Cray Inc. Northwest C++ Users Group April 16 th, 2014

Chapel: An Emerging Parallel Programming Language. Thomas Van Doren, Chapel Team, Cray Inc. Northwest C++ Users Group April 16 th, 2014 Chapel: An Emerging Parallel Programming Language Thomas Van Doren, Chapel Team, Cray Inc. Northwest C Users Group April 16 th, 2014 My Employer: 2 Parallel Challenges Square-Kilometer Array Photo: www.phy.cam.ac.uk

More information

Introduction to the Message Passing Interface (MPI)

Introduction to the Message Passing Interface (MPI) Introduction to the Message Passing Interface (MPI) CPS343 Parallel and High Performance Computing Spring 2018 CPS343 (Parallel and HPC) Introduction to the Message Passing Interface (MPI) Spring 2018

More information

Affine Loop Optimization using Modulo Unrolling in CHAPEL

Affine Loop Optimization using Modulo Unrolling in CHAPEL Affine Loop Optimization using Modulo Unrolling in CHAPEL Aroon Sharma, Joshua Koehler, Rajeev Barua LTS POC: Michael Ferguson 2 Overall Goal Improve the runtime of certain types of parallel computers

More information

Chapel: Multi-Locale Execution 2

Chapel: Multi-Locale Execution 2 Definition Abstract unit of target architecture Capacity for processing and storage (memory) Supports reasoning about locality Properties Locale s tasks have uniform access to local memory Other locale

More information

Combing Partial Redundancy and Checkpointing for HPC

Combing Partial Redundancy and Checkpointing for HPC Combing Partial Redundancy and Checkpointing for HPC James Elliott, Kishor Kharbas, David Fiala, Frank Mueller, Kurt Ferreira, and Christian Engelmann North Carolina State University Sandia National Laboratory

More information

Brad Chamberlain Cray Inc. March 2011

Brad Chamberlain Cray Inc. March 2011 Brad Chamberlain Cray Inc. March 2011 Approach the topic of mapping Chapel to a new target platform by reviewing some core Chapel concepts describing how Chapel s downward-facing interfaces implement those

More information

Issues in Parallel Processing. Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University

Issues in Parallel Processing. Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University Issues in Parallel Processing Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University Introduction Goal: connecting multiple computers to get higher performance

More information

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective Part II: Data Center Software Architecture: Topic 3: Programming Models Piccolo: Building Fast, Distributed Programs

More information

CS4961 Parallel Programming. Lecture 16: Introduction to Message Passing 11/3/11. Administrative. Mary Hall November 3, 2011.

CS4961 Parallel Programming. Lecture 16: Introduction to Message Passing 11/3/11. Administrative. Mary Hall November 3, 2011. CS4961 Parallel Programming Lecture 16: Introduction to Message Passing Administrative Next programming assignment due on Monday, Nov. 7 at midnight Need to define teams and have initial conversation with

More information

Programming with MPI

Programming with MPI Programming with MPI p. 1/?? Programming with MPI One-sided Communication Nick Maclaren nmm1@cam.ac.uk October 2010 Programming with MPI p. 2/?? What Is It? This corresponds to what is often called RDMA

More information

PCAP Assignment I. 1. A. Why is there a large performance gap between many-core GPUs and generalpurpose multicore CPUs. Discuss in detail.

PCAP Assignment I. 1. A. Why is there a large performance gap between many-core GPUs and generalpurpose multicore CPUs. Discuss in detail. PCAP Assignment I 1. A. Why is there a large performance gap between many-core GPUs and generalpurpose multicore CPUs. Discuss in detail. The multicore CPUs are designed to maximize the execution speed

More information

Resilient Distributed Concurrent Collections. Cédric Bassem Promotor: Prof. Dr. Wolfgang De Meuter Advisor: Dr. Yves Vandriessche

Resilient Distributed Concurrent Collections. Cédric Bassem Promotor: Prof. Dr. Wolfgang De Meuter Advisor: Dr. Yves Vandriessche Resilient Distributed Concurrent Collections Cédric Bassem Promotor: Prof. Dr. Wolfgang De Meuter Advisor: Dr. Yves Vandriessche 1 Evolution of Performance in High Performance Computing Exascale = 10 18

More information

CS4961 Parallel Programming. Lecture 18: Introduction to Message Passing 11/3/10. Final Project Purpose: Mary Hall November 2, 2010.

CS4961 Parallel Programming. Lecture 18: Introduction to Message Passing 11/3/10. Final Project Purpose: Mary Hall November 2, 2010. Parallel Programming Lecture 18: Introduction to Message Passing Mary Hall November 2, 2010 Final Project Purpose: - A chance to dig in deeper into a parallel programming model and explore concepts. -

More information

Chapel Introduction and

Chapel Introduction and Lecture 24 Chapel Introduction and Overview of X10 and Fortress John Cavazos Dept of Computer & Information Sciences University of Delaware www.cis.udel.edu/~cavazos/cisc879 But before that Created a simple

More information

Lecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013

Lecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013 Lecture 13: Memory Consistency + a Course-So-Far Review Parallel Computer Architecture and Programming Today: what you should know Understand the motivation for relaxed consistency models Understand the

More information

Advanced MPI. Andrew Emerson

Advanced MPI. Andrew Emerson Advanced MPI Andrew Emerson (a.emerson@cineca.it) Agenda 1. One sided Communications (MPI-2) 2. Dynamic processes (MPI-2) 3. Profiling MPI and tracing 4. MPI-I/O 5. MPI-3 11/12/2015 Advanced MPI 2 One

More information

Overview. CMSC 330: Organization of Programming Languages. Concurrency. Multiprocessors. Processes vs. Threads. Computation Abstractions

Overview. CMSC 330: Organization of Programming Languages. Concurrency. Multiprocessors. Processes vs. Threads. Computation Abstractions CMSC 330: Organization of Programming Languages Multithreaded Programming Patterns in Java CMSC 330 2 Multiprocessors Description Multiple processing units (multiprocessor) From single microprocessor to

More information

Programming Scalable Systems with MPI. UvA / SURFsara High Performance Computing and Big Data. Clemens Grelck, University of Amsterdam

Programming Scalable Systems with MPI. UvA / SURFsara High Performance Computing and Big Data. Clemens Grelck, University of Amsterdam Clemens Grelck University of Amsterdam UvA / SURFsara High Performance Computing and Big Data Message Passing as a Programming Paradigm Gentle Introduction to MPI Point-to-point Communication Message Passing

More information

Expressing Fault Tolerant Algorithms with MPI-2. William D. Gropp Ewing Lusk

Expressing Fault Tolerant Algorithms with MPI-2. William D. Gropp Ewing Lusk Expressing Fault Tolerant Algorithms with MPI-2 William D. Gropp Ewing Lusk www.mcs.anl.gov/~gropp Overview Myths about MPI and Fault Tolerance Error handling and reporting Goal of Fault Tolerance Run

More information

Chapel: Locality and Affinity

Chapel: Locality and Affinity Chapel: Locality and Affinity Brad Chamberlain PRACE Winter School 12 February 2009 Outline Basics of Multi-Locale Chapel The locale type and Locales array The on statement, here locale, and communication

More information

Advanced MPI. Andrew Emerson

Advanced MPI. Andrew Emerson Advanced MPI Andrew Emerson (a.emerson@cineca.it) Agenda 1. One sided Communications (MPI-2) 2. Dynamic processes (MPI-2) 3. Profiling MPI and tracing 4. MPI-I/O 5. MPI-3 22/02/2017 Advanced MPI 2 One

More information

Programming with MPI on GridRS. Dr. Márcio Castro e Dr. Pedro Velho

Programming with MPI on GridRS. Dr. Márcio Castro e Dr. Pedro Velho Programming with MPI on GridRS Dr. Márcio Castro e Dr. Pedro Velho Science Research Challenges Some applications require tremendous computing power - Stress the limits of computing power and storage -

More information

Semantic and State: Fault Tolerant Application Design for a Fault Tolerant MPI

Semantic and State: Fault Tolerant Application Design for a Fault Tolerant MPI Semantic and State: Fault Tolerant Application Design for a Fault Tolerant MPI and Graham E. Fagg George Bosilca, Thara Angskun, Chen Zinzhong, Jelena Pjesivac-Grbovic, and Jack J. Dongarra

More information

Lecture 36: MPI, Hybrid Programming, and Shared Memory. William Gropp

Lecture 36: MPI, Hybrid Programming, and Shared Memory. William Gropp Lecture 36: MPI, Hybrid Programming, and Shared Memory William Gropp www.cs.illinois.edu/~wgropp Thanks to This material based on the SC14 Tutorial presented by Pavan Balaji William Gropp Torsten Hoefler

More information

Sung-Eun Choi and Steve Deitz Cray Inc.

Sung-Eun Choi and Steve Deitz Cray Inc. Sung-Eun Choi and Steve Deitz Cray Inc. Domains and Arrays Overview Arithmetic Other Domain Types Data Parallel Operations Examples Chapel: Data Parallelism 2 A first-class index set Specifies size and

More information

Lecture 34: One-sided Communication in MPI. William Gropp

Lecture 34: One-sided Communication in MPI. William Gropp Lecture 34: One-sided Communication in MPI William Gropp www.cs.illinois.edu/~wgropp Thanks to This material based on the SC14 Tutorial presented by Pavan Balaji William Gropp Torsten Hoefler Rajeev Thakur

More information

Programming with MPI. Pedro Velho

Programming with MPI. Pedro Velho Programming with MPI Pedro Velho Science Research Challenges Some applications require tremendous computing power - Stress the limits of computing power and storage - Who might be interested in those applications?

More information

Aggregation of Real-Time System Monitoring Data for Analyzing Large-Scale Parallel and Distributed Computing Environments

Aggregation of Real-Time System Monitoring Data for Analyzing Large-Scale Parallel and Distributed Computing Environments Aggregation of Real-Time System Monitoring Data for Analyzing Large-Scale Parallel and Distributed Computing Environments Swen Böhm 1,2, Christian Engelmann 2, and Stephen L. Scott 2 1 Department of Computer

More information

Overview: Memory Consistency

Overview: Memory Consistency Overview: Memory Consistency the ordering of memory operations basic definitions; sequential consistency comparison with cache coherency relaxing memory consistency write buffers the total store ordering

More information

Parallel Programming. Libraries and Implementations

Parallel Programming. Libraries and Implementations Parallel Programming Libraries and Implementations Reusing this material This work is licensed under a Creative Commons Attribution- NonCommercial-ShareAlike 4.0 International License. http://creativecommons.org/licenses/by-nc-sa/4.0/deed.en_us

More information

MPI and comparison of models Lecture 23, cs262a. Ion Stoica & Ali Ghodsi UC Berkeley April 16, 2018

MPI and comparison of models Lecture 23, cs262a. Ion Stoica & Ali Ghodsi UC Berkeley April 16, 2018 MPI and comparison of models Lecture 23, cs262a Ion Stoica & Ali Ghodsi UC Berkeley April 16, 2018 MPI MPI - Message Passing Interface Library standard defined by a committee of vendors, implementers,

More information

LOAD BALANCING DISTRIBUTED OPERATING SYSTEMS, SCALABILITY, SS Hermann Härtig

LOAD BALANCING DISTRIBUTED OPERATING SYSTEMS, SCALABILITY, SS Hermann Härtig LOAD BALANCING DISTRIBUTED OPERATING SYSTEMS, SCALABILITY, SS 2016 Hermann Härtig LECTURE OBJECTIVES starting points independent Unix processes and block synchronous execution which component (point in

More information

DPHPC Recitation Session 2 Advanced MPI Concepts

DPHPC Recitation Session 2 Advanced MPI Concepts TIMO SCHNEIDER DPHPC Recitation Session 2 Advanced MPI Concepts Recap MPI is a widely used API to support message passing for HPC We saw that six functions are enough to write useful

More information

Steve Deitz Chapel project, Cray Inc.

Steve Deitz Chapel project, Cray Inc. Parallel Programming in Chapel LACSI 2006 October 18 th, 2006 Steve Deitz Chapel project, Cray Inc. Why is Parallel Programming Hard? Partitioning of data across processors Partitioning of tasks across

More information

ECE 574 Cluster Computing Lecture 13

ECE 574 Cluster Computing Lecture 13 ECE 574 Cluster Computing Lecture 13 Vince Weaver http://www.eece.maine.edu/~vweaver vincent.weaver@maine.edu 15 October 2015 Announcements Homework #3 and #4 Grades out soon Homework #5 will be posted

More information

Fault Tolerant Runtime ANL. Wesley Bland Joint Lab for Petascale Compu9ng Workshop November 26, 2013

Fault Tolerant Runtime ANL. Wesley Bland Joint Lab for Petascale Compu9ng Workshop November 26, 2013 Fault Tolerant Runtime Research @ ANL Wesley Bland Joint Lab for Petascale Compu9ng Workshop November 26, 2013 Brief History of FT Checkpoint/Restart (C/R) has been around for quite a while Guards against

More information

Primitive Task-Parallel Constructs The begin statement The sync types Structured Task-Parallel Constructs Atomic Transactions and Memory Consistency

Primitive Task-Parallel Constructs The begin statement The sync types Structured Task-Parallel Constructs Atomic Transactions and Memory Consistency Primitive Task-Parallel Constructs The begin statement The sync types Structured Task-Parallel Constructs Atomic Transactions and Memory Consistency Chapel: Task Parallelism 2 Syntax begin-stmt: begin

More information

High Performance Computing on GPUs using NVIDIA CUDA

High Performance Computing on GPUs using NVIDIA CUDA High Performance Computing on GPUs using NVIDIA CUDA Slides include some material from GPGPU tutorial at SIGGRAPH2007: http://www.gpgpu.org/s2007 1 Outline Motivation Stream programming Simplified HW and

More information

Data Parallelism COMPUTE STORE ANALYZE

Data Parallelism COMPUTE STORE ANALYZE Data Parallelism Safe Harbor Statement This presentation may contain forward-looking statements that are based on our current expectations. Forward looking statements may include statements about our financial

More information

Exercises: April 11. Hermann Härtig, TU Dresden, Distributed OS, Load Balancing

Exercises: April 11. Hermann Härtig, TU Dresden, Distributed OS, Load Balancing Exercises: April 11 1 PARTITIONING IN MPI COMMUNICATION AND NOISE AS HPC BOTTLENECK LOAD BALANCING DISTRIBUTED OPERATING SYSTEMS, SCALABILITY, SS 2017 Hermann Härtig THIS LECTURE Partitioning: bulk synchronous

More information

Distributed Memory Parallel Programming

Distributed Memory Parallel Programming COSC Big Data Analytics Parallel Programming using MPI Edgar Gabriel Spring 201 Distributed Memory Parallel Programming Vast majority of clusters are homogeneous Necessitated by the complexity of maintaining

More information

Scalable and Fault Tolerant Failure Detection and Consensus

Scalable and Fault Tolerant Failure Detection and Consensus EuroMPI'15, Bordeaux, France, September 21-23, 2015 Scalable and Fault Tolerant Failure Detection and Consensus Amogh Katti, Giuseppe Di Fatta, University of Reading, UK Thomas Naughton, Christian Engelmann

More information

CMSC 330: Organization of Programming Languages

CMSC 330: Organization of Programming Languages CMSC 330: Organization of Programming Languages Multithreading Multiprocessors Description Multiple processing units (multiprocessor) From single microprocessor to large compute clusters Can perform multiple

More information

Lecture 21: Transactional Memory. Topics: consistency model recap, introduction to transactional memory

Lecture 21: Transactional Memory. Topics: consistency model recap, introduction to transactional memory Lecture 21: Transactional Memory Topics: consistency model recap, introduction to transactional memory 1 Example Programs Initially, A = B = 0 P1 P2 A = 1 B = 1 if (B == 0) if (A == 0) critical section

More information

Compilers and Compiler-based Tools for HPC

Compilers and Compiler-based Tools for HPC Compilers and Compiler-based Tools for HPC John Mellor-Crummey Department of Computer Science Rice University http://lacsi.rice.edu/review/2004/slides/compilers-tools.pdf High Performance Computing Algorithms

More information

Chapter 4. Message-passing Model

Chapter 4. Message-passing Model Chapter 4 Message-Passing Programming Message-passing Model 2 1 Characteristics of Processes Number is specified at start-up time Remains constant throughout the execution of program All execute same program

More information

Parallel Computing. Slides credit: M. Quinn book (chapter 3 slides), A Grama book (chapter 3 slides)

Parallel Computing. Slides credit: M. Quinn book (chapter 3 slides), A Grama book (chapter 3 slides) Parallel Computing 2012 Slides credit: M. Quinn book (chapter 3 slides), A Grama book (chapter 3 slides) Parallel Algorithm Design Outline Computational Model Design Methodology Partitioning Communication

More information

CS4961 Parallel Programming. Lecture 4: Data and Task Parallelism 9/3/09. Administrative. Mary Hall September 3, Going over Homework 1

CS4961 Parallel Programming. Lecture 4: Data and Task Parallelism 9/3/09. Administrative. Mary Hall September 3, Going over Homework 1 CS4961 Parallel Programming Lecture 4: Data and Task Parallelism Administrative Homework 2 posted, due September 10 before class - Use the handin program on the CADE machines - Use the following command:

More information

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing CIT 668: System Architecture Parallel Computing Topics 1. What is Parallel Computing? 2. Why use Parallel Computing? 3. Types of Parallelism 4. Amdahl s Law 5. Flynn s Taxonomy of Parallel Computers 6.

More information

HPC Parallel Programing Multi-node Computation with MPI - I

HPC Parallel Programing Multi-node Computation with MPI - I HPC Parallel Programing Multi-node Computation with MPI - I Parallelization and Optimization Group TATA Consultancy Services, Sahyadri Park Pune, India TCS all rights reserved April 29, 2013 Copyright

More information

High Performance Computing Lecture 41. Matthew Jacob Indian Institute of Science

High Performance Computing Lecture 41. Matthew Jacob Indian Institute of Science High Performance Computing Lecture 41 Matthew Jacob Indian Institute of Science Example: MPI Pi Calculating Program /Each process initializes, determines the communicator size and its own rank MPI_Init

More information

Programming Scalable Systems with MPI. Clemens Grelck, University of Amsterdam

Programming Scalable Systems with MPI. Clemens Grelck, University of Amsterdam Clemens Grelck University of Amsterdam UvA / SurfSARA High Performance Computing and Big Data Course June 2014 Parallel Programming with Compiler Directives: OpenMP Message Passing Gentle Introduction

More information

MPI 1. CSCI 4850/5850 High-Performance Computing Spring 2018

MPI 1. CSCI 4850/5850 High-Performance Computing Spring 2018 MPI 1 CSCI 4850/5850 High-Performance Computing Spring 2018 Tae-Hyuk (Ted) Ahn Department of Computer Science Program of Bioinformatics and Computational Biology Saint Louis University Learning Objectives

More information

CS4961 Parallel Programming. Lecture 19: Message Passing, cont. 11/5/10. Programming Assignment #3: Simple CUDA Due Thursday, November 18, 11:59 PM

CS4961 Parallel Programming. Lecture 19: Message Passing, cont. 11/5/10. Programming Assignment #3: Simple CUDA Due Thursday, November 18, 11:59 PM Parallel Programming Lecture 19: Message Passing, cont. Mary Hall November 4, 2010 Programming Assignment #3: Simple CUDA Due Thursday, November 18, 11:59 PM Today we will cover Successive Over Relaxation.

More information

Introduction to MPI. May 20, Daniel J. Bodony Department of Aerospace Engineering University of Illinois at Urbana-Champaign

Introduction to MPI. May 20, Daniel J. Bodony Department of Aerospace Engineering University of Illinois at Urbana-Champaign Introduction to MPI May 20, 2013 Daniel J. Bodony Department of Aerospace Engineering University of Illinois at Urbana-Champaign Top500.org PERFORMANCE DEVELOPMENT 1 Eflop/s 162 Pflop/s PROJECTED 100 Pflop/s

More information

CSE 613: Parallel Programming. Lecture 21 ( The Message Passing Interface )

CSE 613: Parallel Programming. Lecture 21 ( The Message Passing Interface ) CSE 613: Parallel Programming Lecture 21 ( The Message Passing Interface ) Jesmin Jahan Tithi Department of Computer Science SUNY Stony Brook Fall 2013 ( Slides from Rezaul A. Chowdhury ) Principles of

More information

Parallel Programming: Background Information

Parallel Programming: Background Information 1 Parallel Programming: Background Information Mike Bailey mjb@cs.oregonstate.edu parallel.background.pptx Three Reasons to Study Parallel Programming 2 1. Increase performance: do more work in the same

More information

Parallel Programming: Background Information

Parallel Programming: Background Information 1 Parallel Programming: Background Information Mike Bailey mjb@cs.oregonstate.edu parallel.background.pptx Three Reasons to Study Parallel Programming 2 1. Increase performance: do more work in the same

More information

EReinit: Scalable and Efficient Fault-Tolerance for Bulk-Synchronous MPI Applications

EReinit: Scalable and Efficient Fault-Tolerance for Bulk-Synchronous MPI Applications EReinit: Scalable and Efficient Fault-Tolerance for Bulk-Synchronous MPI Applications Sourav Chakraborty 1, Ignacio Laguna 2, Murali Emani 2, Kathryn Mohror 2, Dhabaleswar K (DK) Panda 1, Martin Schulz

More information

Bulk Synchronous and SPMD Programming. The Bulk Synchronous Model. CS315B Lecture 2. Bulk Synchronous Model. The Machine. A model

Bulk Synchronous and SPMD Programming. The Bulk Synchronous Model. CS315B Lecture 2. Bulk Synchronous Model. The Machine. A model Bulk Synchronous and SPMD Programming The Bulk Synchronous Model CS315B Lecture 2 Prof. Aiken CS 315B Lecture 2 1 Prof. Aiken CS 315B Lecture 2 2 Bulk Synchronous Model The Machine A model An idealized

More information

CS 426. Building and Running a Parallel Application

CS 426. Building and Running a Parallel Application CS 426 Building and Running a Parallel Application 1 Task/Channel Model Design Efficient Parallel Programs (or Algorithms) Mainly for distributed memory systems (e.g. Clusters) Break Parallel Computations

More information

Chapel Hierarchical Locales

Chapel Hierarchical Locales Chapel Hierarchical Locales Greg Titus, Chapel Team, Cray Inc. SC14 Emerging Technologies November 18 th, 2014 Safe Harbor Statement This presentation may contain forward-looking statements that are based

More information

Parallel Programming with Coarray Fortran

Parallel Programming with Coarray Fortran Parallel Programming with Coarray Fortran SC10 Tutorial, November 15 th 2010 David Henty, Alan Simpson (EPCC) Harvey Richardson, Bill Long, Nathan Wichmann (Cray) Tutorial Overview The Fortran Programming

More information

A brief introduction to OpenMP

A brief introduction to OpenMP A brief introduction to OpenMP Alejandro Duran Barcelona Supercomputing Center Outline 1 Introduction 2 Writing OpenMP programs 3 Data-sharing attributes 4 Synchronization 5 Worksharings 6 Task parallelism

More information

Principles of Parallel Algorithm Design: Concurrency and Mapping

Principles of Parallel Algorithm Design: Concurrency and Mapping Principles of Parallel Algorithm Design: Concurrency and Mapping John Mellor-Crummey Department of Computer Science Rice University johnmc@rice.edu COMP 422/534 Lecture 3 17 January 2017 Last Thursday

More information

Parallelism paradigms

Parallelism paradigms Parallelism paradigms Intro part of course in Parallel Image Analysis Elias Rudberg elias.rudberg@it.uu.se March 23, 2011 Outline 1 Parallelization strategies 2 Shared memory 3 Distributed memory 4 Parallelization

More information

Lecture V: Introduction to parallel programming with Fortran coarrays

Lecture V: Introduction to parallel programming with Fortran coarrays Lecture V: Introduction to parallel programming with Fortran coarrays What is parallel computing? Serial computing Single processing unit (core) is used for solving a problem One task processed at a time

More information

Lecture 7: More about MPI programming. Lecture 7: More about MPI programming p. 1

Lecture 7: More about MPI programming. Lecture 7: More about MPI programming p. 1 Lecture 7: More about MPI programming Lecture 7: More about MPI programming p. 1 Some recaps (1) One way of categorizing parallel computers is by looking at the memory configuration: In shared-memory systems

More information

Lesson 1. MPI runs on distributed memory systems, shared memory systems, or hybrid systems.

Lesson 1. MPI runs on distributed memory systems, shared memory systems, or hybrid systems. The goals of this lesson are: understanding the MPI programming model managing the MPI environment handling errors point-to-point communication 1. The MPI Environment Lesson 1 MPI (Message Passing Interface)

More information

Introduction to parallel computing concepts and technics

Introduction to parallel computing concepts and technics Introduction to parallel computing concepts and technics Paschalis Korosoglou (support@grid.auth.gr) User and Application Support Unit Scientific Computing Center @ AUTH Overview of Parallel computing

More information

Parallel Programming. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

Parallel Programming. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University Parallel Programming Jin-Soo Kim (jinsookim@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu Challenges Difficult to write parallel programs Most programmers think sequentially

More information

Introducing Task-Containers as an Alternative to Runtime Stacking

Introducing Task-Containers as an Alternative to Runtime Stacking Introducing Task-Containers as an Alternative to Runtime Stacking EuroMPI, Edinburgh, UK September 2016 Jean-Baptiste BESNARD jbbesnard@paratools.fr Julien ADAM, Sameer SHENDE, Allen MALONY (ParaTools)

More information

OpenMP I. Diego Fabregat-Traver and Prof. Paolo Bientinesi WS16/17. HPAC, RWTH Aachen

OpenMP I. Diego Fabregat-Traver and Prof. Paolo Bientinesi WS16/17. HPAC, RWTH Aachen OpenMP I Diego Fabregat-Traver and Prof. Paolo Bientinesi HPAC, RWTH Aachen fabregat@aices.rwth-aachen.de WS16/17 OpenMP References Using OpenMP: Portable Shared Memory Parallel Programming. The MIT Press,

More information

Lecture 28: Introduction to the Message Passing Interface (MPI) (Start of Module 3 on Distribution and Locality)

Lecture 28: Introduction to the Message Passing Interface (MPI) (Start of Module 3 on Distribution and Locality) COMP 322: Fundamentals of Parallel Programming Lecture 28: Introduction to the Message Passing Interface (MPI) (Start of Module 3 on Distribution and Locality) Mack Joyner and Zoran Budimlić {mjoyner,

More information

Multithreaded Algorithms Part 1. Dept. of Computer Science & Eng University of Moratuwa

Multithreaded Algorithms Part 1. Dept. of Computer Science & Eng University of Moratuwa CS4460 Advanced d Algorithms Batch 08, L4S2 Lecture 11 Multithreaded Algorithms Part 1 N. H. N. D. de Silva Dept. of Computer Science & Eng University of Moratuwa Announcements Last topic discussed is

More information

Parallel Computing Why & How?

Parallel Computing Why & How? Parallel Computing Why & How? Xing Cai Simula Research Laboratory Dept. of Informatics, University of Oslo Winter School on Parallel Computing Geilo January 20 25, 2008 Outline 1 Motivation 2 Parallel

More information

Lecture: Consistency Models, TM. Topics: consistency models, TM intro (Section 5.6)

Lecture: Consistency Models, TM. Topics: consistency models, TM intro (Section 5.6) Lecture: Consistency Models, TM Topics: consistency models, TM intro (Section 5.6) 1 Coherence Vs. Consistency Recall that coherence guarantees (i) that a write will eventually be seen by other processors,

More information

An Introduction to Parallel Programming

An Introduction to Parallel Programming An Introduction to Parallel Programming Ing. Andrea Marongiu (a.marongiu@unibo.it) Includes slides from Multicore Programming Primer course at Massachusetts Institute of Technology (MIT) by Prof. SamanAmarasinghe

More information

OpenMP and MPI. Parallel and Distributed Computing. Department of Computer Science and Engineering (DEI) Instituto Superior Técnico.

OpenMP and MPI. Parallel and Distributed Computing. Department of Computer Science and Engineering (DEI) Instituto Superior Técnico. OpenMP and MPI Parallel and Distributed Computing Department of Computer Science and Engineering (DEI) Instituto Superior Técnico November 16, 2011 CPD (DEI / IST) Parallel and Distributed Computing 18

More information

Overview: The OpenMP Programming Model

Overview: The OpenMP Programming Model Overview: The OpenMP Programming Model motivation and overview the parallel directive: clauses, equivalent pthread code, examples the for directive and scheduling of loop iterations Pi example in OpenMP

More information

Parallel Paradigms & Programming Models. Lectured by: Pham Tran Vu Prepared by: Thoai Nam

Parallel Paradigms & Programming Models. Lectured by: Pham Tran Vu Prepared by: Thoai Nam Parallel Paradigms & Programming Models Lectured by: Pham Tran Vu Prepared by: Thoai Nam Outline Parallel programming paradigms Programmability issues Parallel programming models Implicit parallelism Explicit

More information

Techniques to improve the scalability of Checkpoint-Restart

Techniques to improve the scalability of Checkpoint-Restart Techniques to improve the scalability of Checkpoint-Restart Bogdan Nicolae Exascale Systems Group IBM Research Ireland 1 Outline A few words about the lab and team Challenges of Exascale A case for Checkpoint-Restart

More information

Parallel Programming in C with MPI and OpenMP

Parallel Programming in C with MPI and OpenMP Parallel Programming in C with MPI and OpenMP Michael J. Quinn Chapter 17 Shared-memory Programming 1 Outline n OpenMP n Shared-memory model n Parallel for loops n Declaring private variables n Critical

More information

Task: a unit of parallel work in a Chapel program all Chapel parallelism is implemented using tasks

Task: a unit of parallel work in a Chapel program all Chapel parallelism is implemented using tasks Task: a unit of parallel work in a Chapel program all Chapel parallelism is implemented using tasks Thread: a system-level concept for executing tasks not exposed in the language sometimes exposed in the

More information

CMSC Computer Architecture Lecture 12: Multi-Core. Prof. Yanjing Li University of Chicago

CMSC Computer Architecture Lecture 12: Multi-Core. Prof. Yanjing Li University of Chicago CMSC 22200 Computer Architecture Lecture 12: Multi-Core Prof. Yanjing Li University of Chicago Administrative Stuff! Lab 4 " Due: 11:49pm, Saturday " Two late days with penalty! Exam I " Grades out on

More information

CS 470 Spring Parallel Languages. Mike Lam, Professor

CS 470 Spring Parallel Languages. Mike Lam, Professor CS 470 Spring 2017 Mike Lam, Professor Parallel Languages Graphics and content taken from the following: http://dl.acm.org/citation.cfm?id=2716320 http://chapel.cray.com/papers/briefoverviewchapel.pdf

More information

The Parallel Boost Graph Library spawn(active Pebbles)

The Parallel Boost Graph Library spawn(active Pebbles) The Parallel Boost Graph Library spawn(active Pebbles) Nicholas Edmonds and Andrew Lumsdaine Center for Research in Extreme Scale Technologies Indiana University Origins Boost Graph Library (1999) Generic

More information

Linear Algebra Programming Motifs

Linear Algebra Programming Motifs Linear Algebra Programming Motifs John G. Lewis Cray Inc. (retired) March 2, 2011 Programming Motifs 1, 2 & 9 Dense Linear Algebra Graph Algorithms (and Sparse Matrix Reordering) (2) SIAM CSE 11 Features

More information

OpenMP and MPI. Parallel and Distributed Computing. Department of Computer Science and Engineering (DEI) Instituto Superior Técnico.

OpenMP and MPI. Parallel and Distributed Computing. Department of Computer Science and Engineering (DEI) Instituto Superior Técnico. OpenMP and MPI Parallel and Distributed Computing Department of Computer Science and Engineering (DEI) Instituto Superior Técnico November 15, 2010 José Monteiro (DEI / IST) Parallel and Distributed Computing

More information

Task: a unit of parallel work in a Chapel program all Chapel parallelism is implemented using tasks

Task: a unit of parallel work in a Chapel program all Chapel parallelism is implemented using tasks Task: a unit of parallel work in a Chapel program all Chapel parallelism is implemented using tasks main() is the only task when execution begins Thread: a system-level concept that executes tasks not

More information

CSL373: Lecture 5 Deadlocks (no process runnable) + Scheduling (> 1 process runnable)

CSL373: Lecture 5 Deadlocks (no process runnable) + Scheduling (> 1 process runnable) CSL373: Lecture 5 Deadlocks (no process runnable) + Scheduling (> 1 process runnable) Past & Present Have looked at two constraints: Mutual exclusion constraint between two events is a requirement that

More information

Primary-Backup Replication

Primary-Backup Replication Primary-Backup Replication CS 240: Computing Systems and Concurrency Lecture 7 Marco Canini Credits: Michael Freedman and Kyle Jamieson developed much of the original material. Simplified Fault Tolerance

More information

Parallel Languages: Past, Present and Future

Parallel Languages: Past, Present and Future Parallel Languages: Past, Present and Future Katherine Yelick U.C. Berkeley and Lawrence Berkeley National Lab 1 Kathy Yelick Internal Outline Two components: control and data (communication/sharing) One

More information

Adaptive Runtime Support

Adaptive Runtime Support Scalable Fault Tolerance Schemes using Adaptive Runtime Support Laxmikant (Sanjay) Kale http://charm.cs.uiuc.edu Parallel Programming Laboratory Department of Computer Science University of Illinois at

More information

L17: Introduction to Irregular Algorithms and MPI, cont.! November 8, 2011!

L17: Introduction to Irregular Algorithms and MPI, cont.! November 8, 2011! L17: Introduction to Irregular Algorithms and MPI, cont.! November 8, 2011! Administrative Class cancelled, Tuesday, November 15 Guest Lecture, Thursday, November 17, Ganesh Gopalakrishnan CUDA Project

More information