Runtime Support for Scalable Task-parallel Programs

Size: px

Start display at page:

Download "Runtime Support for Scalable Task-parallel Programs"

Imogen Carter
5 years ago
Views:

1 Runtime Support for Scalable Task-parallel Programs Pacific Northwest National Lab xsig workshop May

2 Single Program Multiple Data int main () {... } 2

3 Task Parallelism int main () {... } 3

4 Task Parallelism int main () {... } 4

5 Task-parallel Abstractions Finer specification of concurrency, data locality, and dependences Convey more application information to compiler and runtime Adaptive runtime system to manage tasks Application writer specifies the computation Writes optimizable code Tools to transform code to generate an efficient implementation 5

6 The Promise Application writer specifies the computation Computation mapped to specific execution environment by the software stack We are transferring some of the burden away from the programmer 6

7 The Challenge We are transferring some of the burden to the software stack Handling million MPI processes is supposed to be hard; how about billions of tasks? What about the software ecosystem? 7

8 Tracing and Constraining Work Stealing Schedulers 8

9 Research Directions Concurrency management and tracing Dynamic load balancing Data locality optimization Task granularity selection Data race detection 9

10 Recursive Task Parallelism S fn() { s1; async { /*A1*/ s2; finish async s3;//a2 s4; } async s5; //A3 s6; } s1 s2 S P A1 S s5 P A3 s6 P A2 s4 s3 10

11 Work Stealing A worker begins with one/few tasks Tasks spawn more tasks When a worker is out of tasks, it steals from another worker A popular scheduling strategy for recursive parallel programs Well-studied load balancing strategy Provably efficient scheduling Understandable space and time bounds 11

12 Objective Trace execution under work stealing Exploit information from trace to perform various optimizations Constrain the scheduler to obtain desired behavior 12

13 Tracing 13 Steal tree: low-overhead tracing of work stealing schedulers. PLDI 13

14 Tracing Work Stealing When and where each task executed Captures the order of events for online and offline analysis Challenges Sheer size of the trace Application perturbation might make it impractical 14

15 Tracing Approach: Illustration a b c bc a Deque task async step steal Steals in order of levels Almost one steal per level 15

16 Space Overhead: Shared Memory 16 Total Total Space Space (kb) (kb) Total Space (kb) 1e+10 1e+11 1e+10 1e+09 1e+09 1e+08 1e+08 1e+07 1e+07 1e+06 1e+06 1e+05 1e+05 1e+04 1e+04 1e+03 1e+02 1e e+10 1e+09 Enum Enum StealTree: StealTree: 2 2 Cores Cores StealTree: StealTree: 4 4 Cores Cores StealTree: StealTree: Cores Cores n=1k nq=14 z=10k b=6 b=8 n=2k nq=14 z=20k b=3 b=8 StealTree: StealTree: Cores Cores StealTree: StealTree: Cores Cores StealTree: StealTree: Cores Cores n=1k nq=14 z=10k b=1 b=4 n=2k nq=15 z=20k b=6 b=4 (a) (c) Cholesky AllQueens StealTree: 1e Cores n=1k nq=15 z=10k b=3 b=2 n=2k nq=15 z=20k b=1 b=2 (b) (d) Heat Fib StealTree: Small trace sizes, 1e Cores 1e+10 less StealTree: 2 Enum affected Cores StealTree: 20 Cores Enum StealTree: 1e+08 StealTree: 4 Cores StealTree: 80 Cores StealTree: 4 Cores StealTree: 1e+06 StealTree: by StealTree: 2 Cores StealTree: 10 Cores Corescore StealTree: 4 Cores count StealTree: 20 Cores StealTree: or problem Cores Cores 1e+07 1e+09 StealTree: 10 Cores StealTree: 10 Cores 1e+05 size 1e+06 1e+05 1e+04 1e+03 1e+02 1e MB Enum StealTree: 2 Cores n=1k z=10k b=8 n=2k z=20k b=8 Total Space (kb) n=1k z=10k b=4 1e+08 1e+07 1e+06 1e+05 1e+04 1e+03 1e+02 1e n=2k z=20k b=4 n=1k n=3000 z=10k 2 b=64 b=2 n=2k n=4000 z=20k 2 b=64 b=2 n= b=16 Total Space (kb) Total Space (kb) 1e+11 1e+10 1e+09 1e+08 1e+07 1e+06 1e+05 1e+04 1e+03 1e+02 1e e+07 1e+04 1e+03 1e+02 1e n= b=16 Enum StealTree: 2 Cores StealTree: 4 Cores StealTree: 10 Cores n=35 b=10 n=35 b=5 n=3000 n=16k 2 n= n=32k 2 2 b=8 c=8 b=8 c=8 StealTree: 20 Cores StealTree: 40 Cores StealTree: 80 Cores n=35 b=1 n=16k 2 c=4 n=43 b=10 n=32k 2 c=4 n=43 b=5 n=16k 2 c=1 n=43 b=1 n=32k 2 c=1 1MB

17 Space Overhead: Distributed Memory KB/Core Cores 4000 Cores 8000 Cores Cores Cores 1 0 AQ-HF AQ-WF SCF-HF SCF-WF TCE-HF TCE-WF PG-HF PG-WF Still less than 160MB in total on cores 17

18 Time Overhead: Shared Memory Threads 4 Threads 8 Threads 16 Threads 32 Threads 64 Threads 96 Threads 120 Threads AllQueens Heat Fib FFT Strassen NBody Cholesky LU Matmul Time overhead within variation in execution time 18

19 Time Overhead: Distributed Memory Cores 4000 Cores 8000 Cores Cores Cores AQ-HF AQ-WF SCF-HF SCF-WF TCE-HF TCE-WF PG-HF PG-WF Time overhead within variation in execution time 19

20 What can we do with a steal tree? 20

21 Visualization Core utilization plot over time Cilk LU benchmark on 24 cores Trace size <100KB 21

22 Replay 22 Optimizing data locality for fork/join programs using constrained work stealing. SC 14.

23 Replay Schedulers Strict, ordered replay (StOWS) Exactly reproduce the template schedule Donation of continuations to be stolen Strict, unordered replay (StUWS) Reproduce the template schedule, but allow the order to deviate (respecting the application s dependencies) Relaxed work-stealing replay (RelWS) Reproduce the template schedule as much as possible, but allow workers to deviate when they are idle, by further stealing work 23

24 How good are the schedulers? Trace StOWS StUWS RelWS heat floyd-warshall fdtd cg mg parallel prefix Relaxed work stealing incurs some overhead because it combines replay and work stealing 24

25 Relaxed Work Stealing: Adaptability I Slow down one out of 80 workers 4 times Execution Time (s) fib(48) fib(48) StOWS, slow worker fib(48) RelWS, slow worker 0 25

26 Relaxed Work Stealing: Adaptability II Relaxed replay of schedule from (p-10) workers on p workers Execution Time (s) fib(48) default p-10 Threads fib(48) default p Threads fib(48) RelWS(p-10) on p 0 p = 20 p = 40 p = 60 26

27 Relaxed Work Stealing: Adaptability III Relaxed work stealing of fib(54) with a schedule from fib(48) Trace StOWS StUWS RelWS fib(48) fib(48+6)

28 Retentive Stealing 28 Work stealing and persistence-based load balancers for iterative overdecomposed applications. HPDC 12

29 Iterative Applications Applications repeatedly executing the same computation Many scientific applications are iterative Static or slowly evolving execution characteristics Execution characteristics preclude static balancing Application characteristics (comm. pattern, sparsity, ) Execution environment (topology, asymmetry, ) Yes Done? No do work 29

30 Retentive Work Stealing Seeded Local Queues Actual Executed Tasks Proc 1 Proc 2 Proc 3 Proc n Proc 1 Proc 2Proc 3 Proc n Intuition: Stealing indicates poor initial balance 30

31 Retentive stealing Use work stealing to load balance within each phase Persistence-based load balancers only rebalance across phases Begin next iteration with a trace of the previous iteration s schedule 31

32 Retentive stealing results Parallel efficiency (%) StealRet-1 StealRet-2 StealRet-5 StealRet-10 StealRet-14 Avg. tasks (e) Core HF-Be512-Titan count Tasks per core Num. successful steals SuccSteals-1 SuccSteals-2 SuccSteals (e) HF-Be512-Titan Core count Retentive stealing stabilizes stealing costs 32

33 Retentive Stealing Space Overhead: HF Trace size (KB/core) Cores 4000 Cores 8000 Cores Cores Cores Cores Iteration Execution on Titan Space overhead increase but still same manageable across iterations 33

34 Retentive Stealing Space Overhead: TCE Trace KB/core size (KB/core) Cores 4000 Cores 8000 Cores Cores Cores Cores Iteration Execution on Titan Space overhead stays the same across iterations 34

35 Data Locality Optimization: NUMA Locality 35 Optimizing data locality for fork/join programs using constrained work stealing. SC 14.

Constrained Schedules in OpenMP #pragma omp parallel for schedule(static) for (i = 0; i < size; i++) A[i] = B[i] = 0; //init #pragma omp parallel for schedule(static) for (i = 0; i < size; i++) B[i]

36 Constrained Schedules in OpenMP #pragma omp parallel for schedule(static) for (i = 0; i < size; i++) A[i] = B[i] = 0; //init #pragma omp parallel for schedule(static) for (i = 0; i < size; i++) B[i] = A[i]; //memcpy A B memcpy thread Empirical study Loops are naturally matched, leading to good performance Parallel memory copy of 8GB of data, using OpenMP schedule static 80-core system with eight NUMA domains, first-touch policy Execution time: 169ms 36

Cilk Scheduling cilk_for (i = 0; i < size; i++) A[i] = B[i] = 0; //init cilk_for (i = 0; i < size; i++) B[i] = A[i]; //memcpy A B 5 2 3 1 2 4 3 4 5 1 5 2 3 1 4 5 2 3 1 2 4 3 4 5 1 5 2 3 1 4 memcpy

37 Cilk Scheduling cilk_for (i = 0; i < size; i++) A[i] = B[i] = 0; //init cilk_for (i = 0; i < size; i++) B[i] = A[i]; //memcpy A B memcpy thread Random Empirical study work stealing mismatches the initialization and subsequent use, causing performance degradation. Parallel memory copy of 8GB of data, using MIT Cilk or OpenMP 3.0 tasks Execution time: 436ms (Cilk/OMP task) vs 169ms (OMP static) 37

38 Can we constrain the scheduler to improve NUMA locality? 38

39 Solution: Evolve a Schedule Capture an application phase s steal tree Is load balanced? Yes Strict ordered replay No Data localization Load imbalance observed? No Adapt schedule using relaxed work stealing Yes 39

40 Alternative Strategy: Manual Steal Tree Construction Explicit markup of steal tree in the user program Useful in non-iterative applications 40

41 Data Redistribution Cost First few iterations, data is redistributed (copied) to match a given schedule Execution time (sec) mg fdtd floyd-warshall heat cg Number of threads 41

42 Benchmarks: Iterative Matching Structure 40 heat 40 floyd-warshall Speedup Number of Threads Number of Threads Cilk interleave OMP tasks (interleave) Constrained Iter. RelWS Extract template schedule, apply RelWS for five iterations until convergence, then use StOWS 42

43 Benchmarks: Iterative Differing Structure Speedup Number of threads Cilk first-touch Cilk interleave OMP tasks (interleave) OMP static (first-touch) Constrained Iter. RelWS Constrained StUWS Constrained User-Specified Start with random work stealing on kernel, refine with RelWS until convergence, then use StOWS 43

44 Benchmarks: Iterative Multiple Structures Speedup Number of threads Cilk first-touch Cilk interleave OMP tasks (interleave) OMP static (first-touch) Constrained Iter. RelWS Constrained StUWS Constrained User-Specified We evaluate two approaches: using the same schedule across all kernels, and using a different schedule for each kernel 44

45 Benchmarks: Non-iterative Matching Structure Speedup Number of threads Cilk first-touch Cilk interleave OMP tasks (interleave) OMP static (first-touch) Constrained Iter. RelWS Constrained StUWS Constrained User-Specified Reuse schedule from initialization for other phases with StUWS 45

46 Task Granularity Selection 46 Optimizing data locality for fork/join programs using constrained work stealing. SC 14.

47 Task granularity selection A key challenge for task-parallel programs Trade-off Expose more concurrency Achieve good sequential performance with a coarse grain size 47

48 Observation Concurrency only need to be exposed to achieve load balance Once load is balanced, exposed concurrency can be turned off We can coax the scheduler to select coarser grained work units 48

49 Iterative Granularity Selection Start with small grain size Relaxed work stealing à schedule Drop steal tree leaves Needs coarsening? Yes No Replace leaves with sequential (coarse) tasks 49

50 Dynamic Granularity Selection: heat Speedup x8192-Block Iterative 64x1024-Block Iterative 64x256-Block Iterative 64x256-Block Dynamic Number of threads sample 1 sample 2 sample 3 Iterative locality optimization with grain size selection Count 64x512 64x8k 64x16k 50

51 Dynamic Granularity Selection: cg Speedup 1024-Rows Iterative 128-Rows Iterative 32-Rows Iterative 32-Rows Dynamic Count Rows 512 Rows 4k Rows 16k Rows sample 1 sample 2 sample Number of threads 51

52 Data Race Detection t1 t2 A 52 Steal tree: low-overhead tracing of work stealing schedulers. PLDI 13

53 Data Race Detection Detect conflicting operations in a fork/join program Key check: Determine if two memory operation can execute in parallel For any possible schedule 53

54 Dynamic Program Structure Tree (DPST) finish { step1; async { } //F1 //A1 step2; async { //A2 step3; } step4; } step5; Two steps s1 and s2 may execute in parallel if: l1 is least common ancestor (LCA) of s1 and s2 in DPST c1 is ancestor of s1 and immediate child of l1 c1 is an async node 54

55 Steal-Tree Aided LCA Computation The nodes of the DPST tree can be annotated with the nodes of steal tree they belong to Data race detection involves multiple walks of the DPST for each memory access checked lca(s1, s2): if (s1.st_node == s2.st_node) return dpst_lca(s1,s2); //dpst walk if (s1.st_node.level > s2.st_node.level) return lca(s1.st_node.victim, s2) return lca(s1, s2.st_node.victim) 55

56 Application: Data Race Detection Percent Reduction Threads 4 Threads 8 Threads 16 Threads 32 Threads 64 Threads 96 Threads 120 Threads AllQueens LU Heat NBody Matmul Significant reduction in the number of DPST edges traversed 56

57 Other Results Locality-aware task graph scheduling Color-based constraints on work stealing schedulers Cache locality optimization Effect-based splicing of concurrent tasks to improve cache locality Speculative work stealing Expose greater concurrency Localized parallel failure recovery 57

58 Lessons Learned Random work stealing with ability to constrain its behavior can bring several benefits Steal trees can be useful in a variety of contexts Retentive stealing Data locality optimization Task granularity selection Data race detection Need to design interfaces to programmatically extract and use work stealing schedules 58

59 Continuing Research Challenges Recursive program specification Enabling user to express high level intent and properties Compiler analysis and transformation Runtime techniques Scheduling and load balancing Fault tolerance Power/energy efficiency Data locality Correctness and performance tools Architectural and other low-level support for such abstractions 59

Conclusions Abstractions supporting task parallelism can meet performance and programmability challenges Runtime systems can adapt productively Changing

60 Conclusions Abstractions supporting task parallelism can meet performance and programmability challenges Runtime systems can adapt productively Changing the load balancer or adding fault tolerance involved no change in the user code Maturing an execution paradigm requires lots of research and experience 60

61 Thank You! 61

Compsci 590.3: Introduction to Parallel Computing

Compsci 590.3: Introduction to Parallel Computing Alvin R. Lebeck Slides based on this from the University of Oregon Admin Logistics Homework #3 Use script Project Proposals Document: see web site» Due