Cilk. Cilk In 2008, ACM SIGPLAN awarded Best influential paper of Decade. Cilk : Biggest principle

Size: px

Start display at page:

Download "Cilk. Cilk In 2008, ACM SIGPLAN awarded Best influential paper of Decade. Cilk : Biggest principle"

Arron Garrett
5 years ago
Views:

1 CS528 Slides are adopted from Charles E. Leiserson A Sahu Dept of CSE, IIT Guwahati HPC Flow Plan: Before MID Processor + Super scalar+ Vector Unit Serial C/C++ Coding Optimization Shared Memory Based Parallel Programming Pthread C++ Thread/ Java Thread Memory Model, Advanced launch, Sync Thread Pooling and higher level management Implicit threading : OpenMP (loop based), (dynamic DAG based) Accelerator Based Threading (Half Distributed) Nvidia GPU, Cuda Programming Multi Computer based Parallel Programming: MPI A Sahu HPC Flow Plan: AfterMID Scheduling Load Balancing Energy Efficient Scheduling Scheduling in Grid/HPC environment Cloud Environment Economic Model Power/Energy Model Developed by Leiserson at CSAIL, MIT Chapter 27, Multithreaded Algorithm, Introduction to Algorithm, Coreman, Leiserson and Rivest Initiated a startup: Plus Added _for Keyword, Reduction features Acquired by Intel, Intel uses Scheduler Addition of 6 keywords to standard C Easy to install in linux system With gcc and pthread In 2008, ACM SIGPLAN awarded Best influential paper of Decade The Implementation of the 5 Multithreaded Language, PLDI 998 PLDI 2008 Best paper Award Reducers and Other ++ Hyperobjects, PLDI 2008 : Biggest principle Programmer should be responsible for Exposing the parallelism, Identifying elements that can safely be executed in parallel Work of run time environment (scheduler) to Decide during execution how to actually divide the work between processors Work Stealing Scheduler Proved to be good scheduler Now also in GCC, Intel CC, Intel acquire ++

2 Fibonacci int fib (int n) { x = fib(n-); y = fib(n-2); C elision code cilk int fib (int n) { x = spawn fib(n-); y = spawn fib(n-2); sync; is a faithful extension of C. A program s serial elision is always a legal implementation of semantics. provides no new data types. Basic Keywords cilk int fib (int n) { x = spawn fib(n-); y = spawn fib(n-2); sync; Control cannot pass this point until all spawned children have returned. Identifies a function as a procedure, capable of being spawned in parallel. The named child procedure can execute in parallel with the parent caller. continue edge spawn edge Multithreaded Computation initial thread final thread return edge The dag G = (V, E) represents a parallel instruction stream. Each vertex v 2 V represents a () thread: a maximal sequence of instructions not containing parallel control (spawn, sync, return). Every edge e 2 E is either a spawn edge, a return edge, or a continue edge. Fib: ++ Version int fib(int n) { if (n < 2) return n; int x=cilk_spawn fib(n-); int y = fib(n-2); cilk_sync; return x + y; For loop in for (int i = 0; i < 8; ++i) do_work(i); for (int i = 0; i < 8; ++i) cilk_spawn do_work(i); cilk_ sync; Serial Parallel Loop_for in ++ cilk_for (int i=0;i<8;++i){ do_work(i); // No sync required; auto sync Parallel 2

3 Run Time Scheduler Distributed load balancing Receiver initiated Work stealing : Free processor steal a task ofbusy processor When ever a process spawns a new process, This processor starts executing the spawned one Parent goes to waiting/suspend mode Parent can be transferred to other processor Work stealing Work stealing algorithm is receiver initiated algorithm Technique commonly used for load balancing Thief processor ( Idle processor) Steal work from other processor Victim is selected randomly Victim processor (From a set of busy processor) Work is stolen from these processor A Sahu 4 Work stealing Optimal algorithm for load balancing If select victim randomly algorithm is Optimal Proved Basic assumption in work stealing All the memory access are take same time UMA (Uniform Memory Access): shared memory Can be feasible iff Task transfer time is same for all pair of processor Communication bandwidth is same for all pair of processor Run Time Scheduler P0 P0 P0 P0 P0 P0 P0 main 0 main 0 init_ init_ init_ init_ 2 init_ init_ 3 init_ P P P P P P P Request for task Task transfer main 0 main 0 main 0 init_ 5 main 0 init _6 main 0 Task return main0 init_ init_4 vadd_7 init_2 init_3 init_5 init_6 Vadd_8 Vadd_9 A Sahu 5 T = work T = work T = span* * Also called critical path length or computational depth. 3

4 T = work T = span* LOWER BOUNDS T P T /P T P T Speedup Definition: T /T P = speedup on P processors. If T /T P = Θ(P) P, we have linear speedup; = P, we have perfect linear speedup; > P, we have superlinear speedup, which is not possible in our model, because of the lower bound T P T /P. *Also called critical path length or computational depth. Parallelism Because we have the lower bound T P T, the maximum possible speedup given T and T is T / T = parallelism li = the average amount of work per step along the span. Computation DAG unfolds dynamically. spawn edge CILK Example: Fib(4) initial thread continue edge final thread return edge Assume for simplicity that each thread in fib() takes unit time to execute. Work: T = 7 Using many more Span: T = 8 than 2 processors Parallelism: T makes little sense. /T = 2.25 Ref:The System for Parallel Multithreaded Computing, MIT Phd Thesis Ref2:The Implementation of the 5 Multithreaded Language, 998 ACM SIGPLAN Scheduling allows the programmer to express potential parallelism in an application. The scheduler maps threads onto processors dynamically at runtime. Since on line schedulers are complicated, we ll illustrate the ideas with an off line scheduler. P P P $ $ $ Memory Network I/O 4

5 Complete step P threads ready. Run any P. Complete step P threads ready. Run any P. Incomplete step < P threads ready. Run all of them. Greedy Scheduling Theorem Theorem [Graham 68 & Brent 75]. Any greedy scheduler achieves T P T /P+ T. Proof. # complete steps T /P, since each complete step performs P work. # incomplete steps T, since each incomplete step reduces the span of the unexecuted dag by. Optimality of Greedy Corollary. Any greedy scheduler achieves within a factor of 2 of optimal. Proof. Let T P * be the execution time produced by the optimal scheduler. Since T P * max{t /P, T (lower bounds), we have T P T /P + T 2max{T /P, T 2T P *. Linear Speedup Corollary. Any greedy scheduler achieves near perfect linear speedup whenever P << T /T Proof. Since P << T /T is equivalent to T << T /P, the Theorem gives us T P T /P + T T /P. Thus, the speedup is T /T P P. Definition. The quantity (T / T )/P is called the parallel slackness. Performance s work stealing scheduler achieves T P = T /P + O(T ) expected time (provably); T P T /P + T time (empirically). Near perfect linear speedup if P << T /T. Instrumentation in allows the user to determine accurate measures of T and T. The average cost of a spawn in 5 is only 2 6 times the cost of an ordinary C function call, depending on the platform. 5

The Cilk part is a small set of linguistic extensions to C/C++ to support fork-join parallelism. (The Plus part supports vector parallelism.

Cilk Plus The Cilk part is a small set of linguistic extensions to C/C++ to support fork-join parallelism. (The Plus part supports vector parallelism.) Developed originally by Cilk Arts, an MIT spinoff,