Cilk programs as a DAG

Size: px

Start display at page:

Download "Cilk programs as a DAG"

Dwayne Gilmore
5 years ago
Views:

1 Cilk programs as a DAG The pattern of spawn and sync commands defines a graph The graph contains dependencies between different functions spawn command creates a new task with an out-bound link sync command creates inbound link from spawned tasks cilk int Fib(n=3) { if(n<2) return n; } int x=spawn Fib(n-1); int y=spawn Fib(n-2); sync; return x+y; cilk int Fib(n=2) { if(n<2) return n; int x=spawn Fib(n-1); int y=spawn Fib(n-2); sync; return x+y; } cilk int Fib(n=1) { if(n<2) return n; int x=spawn Fib(n-1); int y=spawn Fib(n-2); sync; return x+y; } cilk int Fib(n=1) { if(n<2) return n;... } cilk int Fib(n=0) { if(n<2) return n;... } HPCE / dt10 / 2014 / 16.1

2 cilk int Fib(int n) { if(n<2) return n; } int x=spawn Fib(n-1); int y=spawn Fib(n-2); sync; return x+y; HPCE / dt10 / 2014 / 16.2

3 Steps within a function execution sequentially Independent functions may execute in parallel HPCE / dt10 / 2014 / 16.3

4 Total Work : T 1 Critical path : T - total time required to execute all tasks - longest path through all tasks assume each step takes unit time : total work = 35; critical path = 16 HPCE / dt10 / 2014 / 16.4

5 Best case and worst-case times Define three times: T 1, T P, T T 1 : Time to execute on one processor (Total Work) T P : Time to execute on P processors T : Time to execute on infinite processors (Critical Path) T 1 / T P : Speedup with P processors Can establish an ordering on the times T 1 / P T P - Maximum speedup with P processors is P T P T - Finite processors are no faster than infinite Can talk about scalability if T 1 / T P = O(P) then Linear speedup (perfect scaling) We always want linear speedup can we achieve it? HPCE / dt10 / 2014 / 16.5

6 Greedy Schedulers A Greedy Scheduler executes work using an ASAP approach Each time step launch all tasks with no dependencies The notion of a time-step is deliberately context dependent When executing with P processors we have two types of step complete step : There are P or more tasks ready to execute incomplete step : There are less than P tasks ready to execute A greedy scheduler always achieves T P T 1 / P + T Best case is easy to visualise we do all work in T P complete steps Worst case is a bit more difficult Steps on critical path execute in incomplete steps Last step on critical path frees up all remaining work for complete steps HPCE / dt10 / 2014 / 16.6

7 HPCE / dt10 / 2014 / 16.7

8 Linear Scaling and Greedy Schedulers Previous equations assume zero-cost scheduling Some overhead involved in tracking tasks that can be run Some overhead in scheduling ready tasks to a processor Define critical overhead : c Smallest c such that T P T 1 / P + c T Covers the cost of tracking dependencies on critical path Linear scaling if there is usually much more work than CPUs Average parallelism : P = T 1 / T Assumption of parallel slackness : P / P >> c Therefore: T 1 / P >> c T And so: T P T 1 / P (linear speedup) Assumption of parallel slackness implies linear speedup HPCE / dt10 / 2014 / 16.8

9 Is that a reasonable assumption? Central idea is that most steps are complete All processors are occupied most of the time Does computation look like that? Recall Gustafson s law and the finite-difference example T 1 = O(n 2 ); T = O(n) P = T 1 / T = O(n) Assuming c is not too high we should get linear scaling For lots of stuff the assumption is broadly true HPCE / dt10 / 2014 / 16.9

10 Work-first rule Define work overhead : c 1 = T 1 / T S T S : Time to run serial version of program (serial elision) Cost of dynamic scheduling vs static scheduling on one CPU What is the importance of c 1 vs c? Substitute into previous defn (T P T 1 / P + c T ) T P c 1 T s / P + c T Now re-introduce assumption of parallel slackness (P / P >> c ) T 1 / (T P) >> c T 1 / P >> c T c 1 T S / P >> c T Therefore: T P c 1 T s / P Work-first rule: minimise c 1 rather than c HPCE / dt10 / 2014 / 16.10

11 Total Work : T 1 Serial Work : T S - total time required for Cilk on one processor (red+green) - total time required for serial-elisions (green only) assume each step takes unit time : total work = 35; serial work = 22 HPCE / dt10 / 2014 / 16.11

12 Interpreting the work-first rule The work-first rule appears in many guises What are c 1 and c in practise? Multi-core CPUs and OSs support traditional threads c 1 : How much time to swap between two threads on a CPU? c : How much time to create a new thread? GPUs support hundreds of parallel threads c 1 : Nano-second scheduling of threads in a kernel c : Milli-second cost to manage kernels from the CPU Intel TBB supports thousands of tasks c 1 : Agglomeration of loop iterations to reduce overheads c : Hierarchical task based scheduler (based on Cilk) Bear this principle in mind when looking at real systems HPCE / dt10 / 2014 / 16.12

13 Work-first has permeated everything Vectorisation: size of vector versus cost of operation Pipe processing: size of buffer versus cost of call FFT: size of parallel loop versus cost of spawning task Heat: cost of memory access versus bit-wise accesses (bit more tenuous, but still the same principle) Open/Close: size of parallel batch versus latency cost Does the assumption of average parallelism hold? Bitecoin:? HPCE / dt10 / 2014 / 16.13

14 Administrivia: CW6 A number of requests for coursework extensions I reluctantly agree to the possibility But only within the context of the exercise Some people already have sunk cost based on original timing Proposed amendment Friday 21st, 23:59. Coin weight 1. (Same) Sunday 23rd, 23:59. Coin weight 2. (Was Saturday) Friday 24th, 23:59. Coin weight 3. (Was Monday) HPCE / dt10 / 2014 / 16.14

15 Coursework 5 debrief Large diversity of solutions, trading off various concepts Currently looking at them as they get ready to compile Also doing final tests on CW % TBB; 27% OpenCL+TBB; 20% OpenCL; other Lots of approaches to solving the problem Original loop order in chunks Sliding diamonds Clever techniques for handing border overlap I m reluctant to give my solution: implies it is the correct one My plan is to collect together the most interesting ones Write up (with permission) and possibly do a short debrief HPCE / dt10 / 2014 / 16.15

16 Reflections on the course: good Quality of practical skills learnt is vastly better Some of the CW5 implementations are very sophisticated Assessment is more authentic Previous assessments too constrained and artificial IO has been considered rather than just ignored All data comes from somewhere and goes somewhere The need to test has (mostly) been integrated Previous two years people did not test their code, and it showed HPCE / dt10 / 2014 / 16.16

17 Reflections on the course : less good Feedback: still way too slow for the first four courseworks Not as slow as last year I need to get out of the way and let GTAs help You will still get all the feedback for all the assessments Not using the technology available I set up a message-board, then didn t realise it was invisible Lack of a clear interaction point: webpage, git, blackboard (Minor) No project management: git, collaborative work Had to strip out when it became clear there wasn t time HPCE / dt10 / 2014 / 16.17

18 Ideas for next year Front-load the course more Schedule two lecture + 1 practical in the first half of term Use the technology better I was acting as a conduit: not scalable, and not helpful Collaborative tools exist and work well Improve feedback timing Now have more experience of high throughput marking Can now build more robust marking system for early coursework HPCE / dt10 / 2014 / 16.18

19 And that s it (Apart from Orals) HPCE / dt10 / 2014 / 16.19

High Performance Computing for Engineers

High Performance Computing for Engineers David Thomas dt10@ic.ac.uk Room 903 HPCE / dt10/ 2014 / 0.1 High Performance Computing for Engineers Research Testing communication protocols Evaluating signal-processing