for Task-Based Parallelism

Size: px

Start display at page:

Download "for Task-Based Parallelism"

Laurence Curtis
5 years ago
Views:

1 for Task-Based Parallelism School of EEECS, Queen s University of Belfast July

2 Table of Contents 2 / 24

3 Thread-Based Programming Hard for programmer to reason about thread interleavings and concurrent memory accesses Error-prone: Races, deadlocks, livelocks, all hard to reproduce Task-based parallelism offers a higher level of abstraction Early models (e.g. OpenMP, Cilk) based on explicit synchronization New models based on automatic dependence analysis are emerging Runtime dependence analysis based on programmer s annotation of memory footprint (OmpSs, SvS) dependence analysis based on compiler s inference of memory footprint (DPJ) Hybrid static-dynamic schemes this talk 3 / 24

4 OpportunityA:4for : FFT 1 void FFT 1D (long N, long N SQRT, long FFT BS, long TR BS, 2 double Complex A[N SQRT][N SQRT]) 3 { 4 const size t rowsz = N SQRT * 2 * sizeof(double); 5 const size t longsz = sizeof(long); 6 7 // Loop 1: Transpose 8 for (long I = 0; I < N SQRT; I += TR BS) { 9 void * tile ii = &A[I ][ I ]; #pragma task \ 12 in(n[longsz], N SQRT[longsz], TR BS[longsz]) \ 13 inout(tile ii[rowsz][tr BS TR BS*2*sizeof(double)]); 14 trsp blk (N, N SQRT, TR BS, tile ii); for (long J = I + TR BS; J < N SQRT; J += TR BS) { 17 void * tile ij = &A[I ][ J ]; 18 void * tile ji = &A[J][I ]; #pragma task \ 21 in(n[longsz], N SQRT[longsz], TR BS[longsz]) \ 22 inout(tile ij[rowsz][tr BS TR BS*2*sizeof(double)], \ 23 tile ji [rowsz][tr BS TR BS*2*sizeof(double)]); 24 trsp swap(n, N SQRT, TR BS, tile1, tile2); 25 } 26 } // Loop 2: First FFT round 29 for (long J = 0; J < N SQRT; J += FFT BS) { 30 tile = &A[J ][0]; #pragma task \ 33 in(n SQRT[longsz], FFT BS[longsz]) \ 34 inout(tile[fft BS*rowsz)]); 35 FFT1D(N SQRT, FFT BS, &A[J][0]); 36 } } 4 / 24

5 Table of Contents 5 / 24

6 Whole Object of whole task arguments (objects) Simple: use object base address for detecting dependencies Assumes objects are contiguous in memory Can not analyze dependencies between multi-dimensional array blocks Can not analyze dependencies static or dynamic, arbitrary collections of objects Can not detect partial overlaps May require data layout transformations (memory copies) Can not be used to write parallel code operating on dynamic data structures 6 / 24

7 Block-level analysis Intuition: partition virtual memory into fixed-size blocks Analyze dependencies between blocks instead of program objects Treat objects as arbitrary collections of memory blocks Can analyze dependencies between multi-dimensional array blocks, arbitrary collections of objects, static or dynamic Can detect partial overlaps Programmability: easier to write parallel code operating on static or dynamic structures, array-based or linked Overhead: Objects composed of N blocks need N analysis checks 7 / 24

8 Elements of Task descriptor containing the closure of task Code, memory footprint and other dependent tasks Block descriptors containing queue of tasks waiting to access the block Include access mode (in, out, inout) Use versioning (renaming) on writes Versions reveal parallelism Efficient implementation Custom memory allocator with task metadata embedded with memory allocator metadata O(1) lookups for all metadata Small memory footprint for low cache pollution 8 / 24

9 by Example (a) Task 1 9 / 24

10 by Example 10 / 24

11 (a) Task 1 by Example 11 / 24

12 by Example (b) Task 2 (d) Task 4 12 / 24

13 Table of Contents 13 / 24

14 Opportunity for (a) OpenMP Fig int a = 1, b = 2, c = 3; 2 int * alias = &b; 3 4 void set(int * x, int * y) { * x= * y; } 5 void addto(int * x, int * y) { * x += * y; } 6 7 int main() { 8 #pragma task inout(&b) safe(&c); 9 addto(&b, &c); #pragma task safe(&a) in(alias); 12 set(&a, alias ); #pragma wait all #pragma task safe(&a) safe(&c); 17 set(&a, &c); 18 } (b) IDDI Task graphs showing run-time task execution and dependencies Fig. 3. analysis unnecessary on a and c in first two tasks because of barrier analysis unnecessary on a and c in third task because of barrier Tedious for programmer to manage, best handled by compiler Tasks with independent arguments To execute this program preserving the sequential semantics, the second task set needs to wait until the value of b is produced by the first task, i.e., there is a dependence on memory location b. Note, however, that since the third task cannot be spawned 14 / until 24

15 A:6 University of Language λ Values Expressions Locations ρ L CFG Points φ F Tasks π T Types Constraints v ::= n () λ x.e e ::= v x e; e ee ref e! e e := e task(e 1,...,e n ) {e} barrier τ ::= int unit (τ, φ) (τ, φ) ref ρ (τ) C ::= C C τ τ ρ ρ φ φ ρ π ππ φ : Barrier φ : π Environments Γ ::= Γ,x: τ Fig. 4. λ : A simple task-based parallel language Simply-typed lambda calculus Section 3 describes the static analysis we use to discover independen ments like a and Extensions: c above. Inferring dynamicthat memory a task allocation, argument task does creation, not need t for dependenciesbarrier requires synchronization verifying that no other task can access that a short, the static analysis infers this independence in three steps. First, aliasing information for all memory locations in the program. Second, 15 / 24

16 A:6 University of Language λ Values Expressions Locations ρ L CFG Points φ F Tasks π T Types Constraints v ::= n () λ x.e e ::= v x e; e ee ref e! e e := e task(e 1,...,e n ) {e} barrier τ ::= int unit (τ, φ) (τ, φ) ref ρ (τ) C ::= C C τ τ ρ ρ φ φ ρ π ππ φ : Barrier φ : π Environments Γ ::= Γ,x: τ Type system generates set of constraints solved by rewriting Fig. 4. λ : A simple task-based parallel language rules: Inference labels φ and ρ generate control-flow and points-to graphs Constraints for data (ρ1 ρ 2 ), control (φ 1 φ 2 ), task memory footprints (ρ π), can happen in parallel (π 1 π 2 ) and barrier (φ) Section 3 describes the static analysis we use to discover independen ments like a and c above. Inferring that a task argument does not need t for dependencies requires verifying that no other task can access that a short, the static analysis infers this independence in three steps. First, aliasing information for all memory locations in the program. Second, which tasks can run in parallel; we do not need to check for conflicting 15 / 24a

17 Table of Contents 16 / 24

18 Details Source-to-source compiler using CIL OpenMP and OmpSs-like syntax supported Strided arguments, multi-dimensional array tiles, and dynamic regions Locksmith engine for points-to analysis Scalable runtime support: Concurrent queues, NUMA-aware data allocation and task scheduling Provable low bounds and determinism 17 / 24

19 Experimental Platform Cray XE6 compute node on Hector, 32 GB DRAM, NUMA 18 / 24

20 Experimental Results A:17 21*-34*56(47/(89 $'!"!#!%!$!' " # % $ '!""! #$%## &'()$% %*+,(-./ 01,(/ (a) Black-Scholes /.&*01&23)14,)56 #'!""! (' #$%## %&'()*+, %' &' $'!' ' -.(), (b) Ferret.+/012/34(254-6 (''' %('' %''' &('' &''' $('' $'''!(''!''' ('' '!""! #$%## &'()$% *+,(- (c) Cholesky +',-./,01)/2*)34!'' *' "' )' #' (' %' &' $'!'!""! #$%## +',-./,01)/2*)34!%!$!' " # % $!""! #$%## &"'()*'+,$*-,%. &('' &''' $('' $'''!(''!''' ('' /00/ $623 ' &'()* ' &'()* '!"#$% 19 / 24

21 +',-./, %' &' $'!' University of Manchester, ' July! "!# &'()* $% &$ Experimental Results ' (d) GMRES +',-./, # % $ &'()* (e) HPL &"'()*'+!(''!''' ('' '! "!#!"#$% (f) Jacob.+/012/34(254-6 $('' $'''!(''!''' (''!""! #$%## &'()$% -*./01.23%14,%56 &() &(' $() $('!()!(' '()!""! #$%&'( ' *+,(- (g) SMPSS-FFT '(' )*+%, (h) SPLASH-FFT (254-6 $('' $'''!(''!""! #$%## &'()$% ).1*)23 %' &( &' $(!""! #$% 20 / 24

22 ,(/ /.&*0 $'!' '!# $% &$Experimental Results -.(),.+/01!(''!''' ('' ' *+,(- -Scholes!# $% &$ ()* RES +',-./,01)/2*)34!%!$!' " # % $ (b) Ferret!""! #$%## ' &'()* (e) HPL &"'()*'+,$*-,%. &('' &''' $('' $'''!(''!''' ('' (c) Cholesky /00/ $623 '!"#$% (f) Jacobi (254-6 $('' $'''!(''!""! #$%## &'()$% %14,%56 &() &(' $() $('!""! #$%&'( 21 / 24

23 Table of Contents 22 / 24

24 Lessons and Challenges Lesson: Task-based parallelism achieves good balance between productivity and performance Lesson: Automatic dependence analysis uncovers more parallelism and enables further optimization Lesson: A new opportunity for parallelizing compilers Challenge: Overhead still present in some codes with low operational intensity (e.g. stencils) Challenge: Points-to analysis still limited (even in simple cases) Challenge: no magic recipe, expose analysis options as user knobs or autotuners 23 / 24

25 Acknowledgment Collaborators: Angelos Bilas, Angelos Papatriantafyllou, Polyvios Pratikakis, George Tzenakis, Hans Vandierendonck Funding: EU FP7 (ENCORE, TEXT, I-CORES) Infrastructure: Hector (EPCC), BSC 24 / 24

Region-Based Memory Management for Task Dataflow Models

Region-Based Memory Management for Task Dataflow Models Nikolopoulos, D. (2012). Region-Based Memory Management for Task Dataflow Models: First Joint ENCORE & PEPPHER Workshop on Programmability and Portability