Plan. 1 Parallelism Complexity Measures. 2 cilk for Loops. 3 Scheduling Theory and Implementation. 4 Measuring Parallelism in Practice

Size: px

Start display at page:

Download "Plan. 1 Parallelism Complexity Measures. 2 cilk for Loops. 3 Scheduling Theory and Implementation. 4 Measuring Parallelism in Practice"

Beryl Johns
6 years ago
Views:

lan Multithreaded arallelism and erformance Measures Marc Moreno Maza University of Western Ontario, London, Ontario (Canada) CS 02 - CS

Announcements lan (Moreno Maza) Multithreaded arallelism and erformance Measures CS 02 - CS 02 / 7 arallelism Complexity Measures

$overheads 6 Announcements int fib (int n) { if (n<2) return (n); else { int x,y; x = cilk_ fib(n-); y = fib(n-2); cilk_sync; return$ (x+y); 3 2 rocessor oblivious Example: fib() 2 0 0 The computation dag unfolds dynamiy.

(x+y); 3 2 rocessor oblivious Example: fib() 2 0 0 The computation dag unfolds dynamiy.

1 lan Multithreaded arallelism and erformance Measures Marc Moreno Maza University of Western Ontario, London, Ontario (Canada) CS 02 - CS 9535 arallelism Complexity Measures 2 cilk for Loops 3 Measuring arallelism in ractice 5 Anticipating parallelization overheads 6 Announcements lan (Moreno Maza) Multithreaded arallelism and erformance Measures CS 02 - CS 02 / 7 arallelism Complexity Measures (Moreno Maza) Multithreaded arallelism and erformance Measures CS 02 - CS 02 2 / 7 arallelism Complexity Measures The fork-join parallelism model arallelism Complexity Measures 2 cilk for Loops 3 Measuring arallelism in ractice 5 Anticipating parallelization overheads 6 Announcements int fib (int n) { if (n<2) return (n); else { int x,y; x = cilk_ fib(n-); y = fib(n-2); cilk_sync; return (x+y); 3 2 rocessor oblivious Example: fib() The computation dag unfolds dynamiy. We shall also this model multithreaded parallelism. (Moreno Maza) Multithreaded arallelism and erformance Measures CS 02 - CS 02 3 / 7 (Moreno Maza) Multithreaded arallelism and erformance Measures CS 02 - CS 02 / 7

Terminology arallelism Complexity Measures Work and span arallelism Complexity Measures initial strand continue edge edge edge final strand strand return edge a strand is is a maximal sequence of

At runtime, the relation causes procedure instances to be structured as a rooted tree, ed tree or parallel instruction stream, where dependencies among strands form a dag.

We assume an ideal situation: no cache issues, no interprocessor costs: T p is the minimum running time on p processors T is ed the work, that is, the sum of the number of instructions at each node.

2 Terminology arallelism Complexity Measures Work and span arallelism Complexity Measures initial strand continue edge edge edge final strand strand return edge a strand is is a maximal sequence of instructions that ends with a, sync, or return (either explicit or implicit) statement. At runtime, the relation causes procedure instances to be structured as a rooted tree, ed tree or parallel instruction stream, where dependencies among strands form a dag. (Moreno Maza) Multithreaded arallelism and erformance Measures CS 02 - CS 02 5 / 7 arallelism Complexity Measures The critical path length We define several performance measures. We assume an ideal situation: no cache issues, no interprocessor costs: T p is the minimum running time on p processors T is ed the work, that is, the sum of the number of instructions at each node. T is the minimum running time with infinitely many processors, ed the span Work law (Moreno Maza) Multithreaded arallelism and erformance Measures CS 02 - CS 02 6 / 7 arallelism Complexity Measures Assuming all strands run in unit time, the longest path in the DAG is equal to T. For this reason, T is also referred to as the critical path length. (Moreno Maza) Multithreaded arallelism and erformance Measures CS 02 - CS 02 7 / 7 We have: T p T /p. Indeed, in the best case, p processors can do p works per unit of time. (Moreno Maza) Multithreaded arallelism and erformance Measures CS 02 - CS 02 8 / 7

Span law arallelism Complexity Measures arallelism Complexity Measures Speedup on p processors T /T p is ed the speedup on p processors A parallel program execution can have: linear speedup: T /T =

3 Span law arallelism Complexity Measures arallelism Complexity Measures Speedup on p processors T /T p is ed the speedup on p processors A parallel program execution can have: linear speedup: T /T = Θ(p) superlinear speedup: T /T = ω(p) (not possible in this model, though it is possible in others) sublinear speedup: T /T = o(p) We have: T p T. Indeed, T p < T contradicts the definitions of T p and T. arallelism (Moreno Maza) Multithreaded arallelism and erformance Measures CS 02 - CS 02 9 / 7 arallelism Complexity Measures (Moreno Maza) Multithreaded arallelism and erformance Measures CS 02 - CS 02 0 / 7 arallelism Complexity Measures The Fibonacci example (/2) Because the Span Law dictates that T T, the maximum possible speedup given T and T is T /T = parallelism li = the average amount of work per step along the span For Fib(), we have T = 7 and T = 8 and thus T /T = What about T (Fib(n)) and T (Fib(n))? (Moreno Maza) Multithreaded arallelism and erformance Measures CS 02 - CS 02 / 7 (Moreno Maza) Multithreaded arallelism and erformance Measures CS 02 - CS 02 2 / 7

4 arallelism Complexity Measures arallelism Complexity Measures The Fibonacci example (2/2) Series composition We have T (n) = T (n ) + T (n 2) + Θ(). Let s solve it. One verify by induction that T (n) af n b for b > 0 large enough to dominate Θ() and a >. We can then choose a large enough to satisfy the initial condition, whatever that is. On the other hand we also have F n T (n). Therefore T (n) = Θ(F n ) = Θ(ψ n ) with ψ = ( + 5)/2. A B We have T (n) = max(t (n ), T (n 2)) + Θ(). We easily check T (n ) T (n 2). This implies T (n) = T (n ) + Θ(). Therefore T (n) = Θ(n). Work? Span? Consequently the parallelism is Θ(ψ n /n). (Moreno Maza) Multithreaded arallelism and erformance Measures CS 02 - CS 02 3 / 7 arallelism Complexity Measures Series composition (Moreno Maza) Multithreaded arallelism and erformance Measures CS 02 - CS 02 / 7 arallelism Complexity Measures arallel composition A A B B Work: T (A B) = T (A) + T (B) Span: T (A B) = T (A) + T (B) Work? Span? (Moreno Maza) Multithreaded arallelism and erformance Measures CS 02 - CS 02 5 / 7 (Moreno Maza) Multithreaded arallelism and erformance Measures CS 02 - CS 02 6 / 7

5 arallel composition arallelism Complexity Measures arallelism Complexity Measures Some results in the fork-join parallelism model A B Work: T (A B) = T (A) + T (B) Span: T (A B) = max(t (A), T (B)) Algorithm Work Span Merge sort Θ(n lg n) Θ(lg 3 n) Matrix multiplication Θ(n 3 ) Θ(lg n) Strassen Θ(n lg7 ) Θ(lg 2 n) LU-decomposition Θ(n 3 ) Θ(n lg n) Tableau construction Θ(n 2 ) Ω(n lg3 ) FFT Θ(n lg n) Θ(lg 2 n) Breadth-first search Θ(E) Θ(d lg V) We shall prove those results in the next lectures. lan (Moreno Maza) Multithreaded arallelism and erformance Measures CS 02 - CS 02 7 / 7 cilk for Loops (Moreno Maza) Multithreaded arallelism and erformance Measures CS 02 - CS 02 8 / 7 cilk for Loops For loop parallelism in Cilk++ arallelism Complexity Measures 2 cilk for Loops 3 Measuring arallelism in ractice 5 Anticipating parallelization overheads 6 Announcements a a 2 a n a 2 a 22 a 2n a n a n2 a nn a a 2 a n a 2 a 22 a n2 a n a 2n a nn A T n n2 nn n 2n nn cilk_for (int i=; i<n; ++i) { for (int j=0; j<i; ++j) { double temp = A[i][j]; A[i][j] = A[j][i]; A[j][i] = temp; A The iterations of a cilk for loop execute in parallel. (Moreno Maza) Multithreaded arallelism and erformance Measures CS 02 - CS 02 9 / 7 (Moreno Maza) Multithreaded arallelism and erformance Measures CS 02 - CS / 7

cilk for Loops cilk for Loops Implementation of for loops in Cilk++ Analysis of

) the previous loop is compiled as follows, using a divide-and-conquer

lo + (hi - lo)/2; cilk_ recur(lo, mid); recur(mid, hi); cilk_sync; else for (int

(Moreno Maza) Multithreaded arallelism and erformance Measures CS 02 - CS 02 2 /

$erformance Measures CS 02 - CS 02 22 / 7 cilk_for (int i=; i<n; ++i) { cilk_for$ (int j=0; j<i; ++j) { double temp = A[i][j]; A[i][j] = A[j][i]; A[j][i] = temp;

Θ(log(n)) Span of an iteration: Θ() Span: Θ(log(n)) Work: Θ(n 2 ) arallelism:

.. arallelism Complexity Measures 2 cilk for Loops 3 Measuring arallelism in

6 cilk for Loops cilk for Loops Implementation of for loops in Cilk++ Analysis of parallel for loops Up to details (next week!) the previous loop is compiled as follows, using a divide-and-conquer implementation: void recur(int lo, int hi) { if (hi > lo) { // coarsen int mid = lo + (hi - lo)/2; cilk_ recur(lo, mid); recur(mid, hi); cilk_sync; else for (int j=0; j<hi; ++j) { double temp = A[i][j]; A[i][j] = A[j][i]; A[j][i] = temp; (Moreno Maza) Multithreaded arallelism and erformance Measures CS 02 - CS 02 2 / 7 cilk for Loops arallelizing the inner loop Here we do not assume that each strand runs in unit time. lan Span of loop control: Θ(log(n)) Max span of an iteration: Θ(n) Span: Θ(n) Work: Θ(n 2 ) arallelism: Θ(n) (Moreno Maza) Multithreaded arallelism and erformance Measures CS 02 - CS / 7 cilk_for (int i=; i<n; ++i) { cilk_for (int j=0; j<i; ++j) { double temp = A[i][j]; A[i][j] = A[j][i]; A[j][i] = temp; Span of outer loop control: Θ(log(n)) Max span of an inner loop control: Θ(log(n)) Span of an iteration: Θ() Span: Θ(log(n)) Work: Θ(n 2 ) arallelism: Θ(n 2 /log(n)) But! More on this next week... arallelism Complexity Measures 2 cilk for Loops 3 Measuring arallelism in ractice 5 Anticipating parallelization overheads 6 Announcements (Moreno Maza) Multithreaded arallelism and erformance Measures CS 02 - CS / 7 (Moreno Maza) Multithreaded arallelism and erformance Measures CS 02 - CS 02 2 / 7

Scheduling Greedy scheduling (/2) Memory I/O Network $ $ $ A scheduler s job is to map a computation to particular processors. Such a mapping is ed a schedule.

(Moreno Maza) Multithreaded arallelism and erformance Measures CS 02 - CS 02 25 / 7 Greedy scheduling (2/2) A strand is ready if all its predecessors have executed A scheduler is greedy if it

(Moreno Maza) Multithreaded arallelism and erformance Measures 26 / 7 Theorem of Graham and Brent =3 =3 In any greedy schedule, there are two types of steps: complete step: There are at least p

7 Scheduling Greedy scheduling (/2) Memory I/O Network $ $ $ A scheduler s job is to map a computation to particular processors. Such a mapping is ed a schedule. If decisions are made at runtime, the scheduler is online, otherwise, it is offline Cilk++ s scheduler maps strands onto processors dynamiy at runtime. (Moreno Maza) Multithreaded arallelism and erformance Measures CS 02 - CS / 7 Greedy scheduling (2/2) A strand is ready if all its predecessors have executed A scheduler is greedy if it attempts to do as much work as possible at every step. (Moreno Maza) Multithreaded arallelism and erformance Measures 26 / 7 Theorem of Graham and Brent =3 =3 In any greedy schedule, there are two types of steps: complete step: There are at least p strands that are ready to run. The greedy scheduler selects any p of them and runs them. incomplete step: There are strictly less than p threads that are ready to run. The greedy scheduler runs them all. (Moreno Maza) CS 02 - CS 02 Multithreaded arallelism and erformance Measures CS 02 - CS / 7 For any greedy schedule, we have Tp T /p + T #complete steps T /p, by definition of T. #incomplete steps T. Indeed, let G 0 be the subgraph of G that remains to be executed immediately prior to a incomplete step. (i) During this incomplete step, all strands that can be run are actually run (ii) Hence removing this incomplete step from G 0 reduces T by one. (Moreno Maza) Multithreaded arallelism and erformance Measures CS 02 - CS / 7

8 Corollary Corollary 2 A greedy scheduler is always within a factor of 2 of optimal. From the work and span laws, we have: T max(t /p, T ) () In addition, we can trivially express: T /p max(t /p, T ) (2) T max(t /p, T ) (3) From Graham - Brent Theorem, we deduce: T T /p + T () max(t /p, T ) + max(t /p, T ) (5) 2 max(t /p, T ) (6) The greedy scheduler achieves linear speedup whenever T = O(T /p). From Graham - Brent Theorem, we deduce: T p T /p + T (7) = T /p + O(T /p) (8) = Θ(T /p) (9) The idea is to operate in the range where T /p dominates T. As long as T /p dominates T, all processors can be used efficiently. The quantity T /pt is ed the parallel slackness. which concludes the proof. (Moreno Maza) Multithreaded arallelism and erformance Measures CS 02 - CS / 7 The work-stealing scheduler (/3) (Moreno Maza) Multithreaded arallelism and erformance Measures CS 02 - CS / 7 The work-stealing scheduler (2/3) Cilk/Cilk++ randomized work-stealing scheduler load-balances the computation at run-time. Each processor maintains a ready deque: A ready deque is a double ended queue, where each entry is a procedure instance that is ready to execute. Adding a procedure instance to the bottom of the deque represents a procedure being ed. A procedure instance being deleted from the bottom of the deque represents the processor beginning/resuming execution on that procedure. Deletion from the top of the deque corresponds to that procedure instance being stolen. A mathematical proof guarantees near-perfect linear speed-up on applications with sufficient parallelism, as long as the architecture has sufficient memory bandwidth. A /return in Cilk is over 00 times faster than a thread create/exit and less than 3 times slower than an ordinary C function on a modern Intel processor. (Moreno Maza) Multithreaded arallelism and erformance Measures CS 02 - CS 02 3 / 7 (Moreno Maza) Multithreaded arallelism and erformance Measures CS 02 - CS / 7

9 The work-stealing scheduler (3/3) The work-stealing scheduler (/3) Call! Spawn! (Moreno Maza) Multithreaded arallelism and erformance Measures CS 02 - CS / 7 The work-stealing scheduler (5/3) (Moreno Maza) Multithreaded arallelism and erformance Measures CS 02 - CS 02 3 / 7 The work-stealing scheduler (6/3) Spawn! Call! Spawn! Return! (Moreno Maza) Multithreaded arallelism and erformance Measures CS 02 - CS / 7 (Moreno Maza) Multithreaded arallelism and erformance Measures CS 02 - CS / 7

10 The work-stealing scheduler (7/3) The work-stealing scheduler (8/3) Return! Steal! (Moreno Maza) Multithreaded arallelism and erformance Measures CS 02 - CS / 7 The work-stealing scheduler (9/3) (Moreno Maza) Multithreaded arallelism and erformance Measures CS 02 - CS / 7 The work-stealing scheduler (0/3) Steal! (Moreno Maza) Multithreaded arallelism and erformance Measures CS 02 - CS / 7 (Moreno Maza) Multithreaded arallelism and erformance Measures CS 02 - CS 02 0 / 7

11 The work-stealing scheduler (/3) The work-stealing scheduler (2/3) Spawn! (Moreno Maza) Multithreaded arallelism and erformance Measures CS 02 - CS 02 / 7 The work-stealing scheduler (3/3) (Moreno Maza) Multithreaded arallelism and erformance Measures CS 02 - CS 02 3 / 7 (Moreno Maza) Multithreaded arallelism and erformance Measures CS 02 - CS 02 2 / 7 erformances of the work-stealing scheduler Assume that each strand executes in unit time, for almost all parallel steps there are at least p strands to run, each processor is either working or stealing. Then, the randomized work-stealing scheduler is expected to run in T = T /p + O(T ) During a steal-free parallel steps (steps at which all processors have work on their deque) each of the p processors consumes work unit. Thus, there is at most T /p steal-free parallel steps. During a parallel step with steals each thief may reduce by the running time with a probability of /p Thus, the expected number of steals is O(p T ). Therefore, the expected running time T = (T + O(p T ))/p = T /p + O(T ). (0) (Moreno Maza) Multithreaded arallelism and erformance Measures CS 02 - CS 02 / 7

12 Overheads and burden Span overhead Obviously T /p + T will under-estimate T p in practice. Many factors (simplification assumptions of the fork-join parallelism model, architecture limitation, costs of executing the parallel constructs, overheads of scheduling) will make T p larger in practice. One may want to estimate the impact of those factors: by improving the estimate of the randomized work-stealing complexity result 2 by comparing a Cilk++ program with its C++ elision 3 by estimating the costs of ing and synchronizing Let T, T, T p be given. We want to refine the randomized work-stealing complexity result. The span overhead is the smallest constant c such that T p T /p + c T. Re that T /T is the maximum possible speed-up that the application can obtain. We parallel slackness assumption the following property T /T >> c p () that is, c p is much smaller than the average parallelism. Under this assumption it follows that T /p >> c T holds, thus c has little effect on performance when sufficiently slackness exists. (Moreno Maza) Multithreaded arallelism and erformance Measures CS 02 - CS 02 5 / 7 Work overhead Let T s be the running time of the C++ elision of a Cilk++ program. We denote by c the work overhead c = T /T s Re the expected running time: T T / + c T. Thus with the parallel slackness assumption we get T c T s /p + c T c T s /p. (2) We can now state the work first principle precisely Minimize c, even at the expense of a larger c. This is a key feature since it is conceptually easier to minimize c rather than minimizing c. Cilk++ estimates T p as T p = T /p +.7 burden span, where burden span is 5000 instructions times the number of continuation edges along the critical path. (Moreno Maza) Multithreaded arallelism and erformance Measures CS 02 - CS 02 7 / 7 (Moreno Maza) Multithreaded arallelism and erformance Measures CS 02 - CS 02 6 / 7 The cactus stack B A B C D E A A A A A A D C E B C Views of stack A cactus stack is used to implement C s rule for sharing of function-local variables. A stack frame can only see data stored in the current and in the previous stack frames. (Moreno Maza) Multithreaded arallelism and erformance Measures CS 02 - CS 02 8 / 7 C D C E

13 Space bounds lan Measuring arallelism in ractice = 3 arallelism Complexity Measures S 2 cilk for Loops 3 Measuring arallelism in ractice The space S p of a parallel execution on p processors required by Cilk++ s work-stealing satisfies: S p p S (3) where S is the minimal serial space requirement. 5 Anticipating parallelization overheads 6 Announcements Cilkview (Moreno Maza) Multithreaded arallelism and erformance Measures CS 02 - CS 02 9 / 7 Measuring arallelism in ractice (Moreno Maza) Multithreaded arallelism and erformance Measures CS 02 - CS / 7 Measuring arallelism in ractice The Fibonacci Cilk++ example Work Law (linear speedup) Span Law Measured speedup Burdened parallelism estimates arallelism scheduling overheads Cilkview computes work and span to derive upper bounds on parallel performance Cilkview also estimates scheduling overhead to compute a burdened span for lower bounds. (Moreno Maza) Multithreaded arallelism and erformance Measures CS 02 - CS 02 5 / 7 Code fragment long fib(int n) { if (n < 2) return n; long x, y; x = cilk_ fib(n-); y = fib(n-2); cilk_sync; return x + y; (Moreno Maza) Multithreaded arallelism and erformance Measures CS 02 - CS / 7

14 Measuring arallelism in ractice Fibonacci program timing Quicksort Measuring arallelism in ractice The environment for benchmarking: model name : Intel(R) Core(TM)2 Quad CU 2.0GHz L2 cache size : 096 KB memory size : 3 GB #cores = #cores = 2 #cores = n timing(s) timing(s) speedup timing(s) speedup code in cilk/examples/qsort void sample_qsort(int * begin, int * end) { if (begin!= end) { --end; int * middle = std::partition(begin, end, std::bind2nd(std::less<int>(), *end)); using std::swap; swap(*end, *middle); cilk_ sample_qsort(begin, middle); sample_qsort(++middle, ++end); cilk_sync; (Moreno Maza) Multithreaded arallelism and erformance Measures CS 02 - CS / 7 Measuring arallelism in ractice Quicksort timing (Moreno Maza) Multithreaded arallelism and erformance Measures CS 02 - CS 02 5 / 7 Measuring arallelism in ractice Matrix multiplication Timing for sorting an array of integers: #cores = #cores = 2 #cores = # of int timing(s) timing(s) speedup timing(s) speedup Code in cilk/examples/matrix Timing of multiplying a matrix by a matrix iterative recursive threshold st(s) pt(s) su st(s) pt (s) su st = sequential time; pt = parallel time with cores; su = speedup (Moreno Maza) Multithreaded arallelism and erformance Measures CS 02 - CS / 7 (Moreno Maza) Multithreaded arallelism and erformance Measures CS 02 - CS / 7

15 Measuring arallelism in ractice Measuring arallelism in ractice The cilkview example from the documentation Using cilk for to perform operations over an array in parallel: static const int COUNT = ; static const int ITERATION = ; long arr[count]; long do_work(long k){ long x = 5; static const int nn = 87; for (long i = ; i < nn; ++i) x = x / i + k % i; return x; int cilk_main(){ for (int j = 0; j < ITERATION; j++) cilk_for (int i = 0; i < COUNT; i++) arr[i] += do_work( j * i + i + j); (Moreno Maza) Multithreaded arallelism and erformance Measures CS 02 - CS / 7 A simple fix Inverting the two for loops Measuring arallelism in ractice int cilk_main() { cilk_for (int i = 0; i < COUNT; i++) for (int j = 0; j < ITERATION; j++) arr[i] += do_work( j * i + i + j); ) arallelism rofile Work : 6,80,80,250 ins Span : 2,6,80,250 ins Burdened span : 3,920,80,250 ins arallelism : 3.06 Burdened parallelism : 0.20 Number of s/syncs: 3,000,000 Average instructions / strand : 720 Strands along span :,000,00 Average instructions / strand on span : 529 2) Speedup Estimate 2 processors: processors: processors: processors: processors: (Moreno Maza) Multithreaded arallelism and erformance Measures CS 02 - CS / 7 Measuring arallelism in ractice ) arallelism rofile Work : 5,295,80,529 ins Span :,326,80,07 ins Burdened span :,326,830,9 ins arallelism : 3.99 Burdened parallelism : 3.99 Number of s/syncs: 3 Average instructions / strand : 529,580,52 Strands along span : 5 Average instructions / strand on span: 265,360,22 2) Speedup Estimate 2 processors: processors: processors: processors: processors: (Moreno Maza) Multithreaded arallelism and erformance Measures CS 02 - CS / 7 (Moreno Maza) Multithreaded arallelism and erformance Measures CS 02 - CS / 7

16 Timing Measuring arallelism in ractice lan Anticipating parallelization overheads arallelism Complexity Measures #cores = #cores = 2 #cores = version timing(s) timing(s) speedup timing(s) speedup original improved cilk for Loops 3 Measuring arallelism in ractice 5 Anticipating parallelization overheads 6 Announcements (Moreno Maza) Multithreaded arallelism and erformance Measures CS 02 - CS 02 6 / 7 Anticipating parallelization overheads ascal Triangle (Moreno Maza) Multithreaded arallelism and erformance Measures CS 02 - CS / 7 Anticipating parallelization overheads Divide and conquer: principle Construction of the ascal Triangle: nearly the simplest stencil computation! II I II The parallelism is Θ(n 2 log 23 ), so roughly Θ(n 0.5 ) which can be regarded as low parallelism. I II II III (Moreno Maza) Multithreaded arallelism and erformance Measures CS 02 - CS / 7 (Moreno Maza) Multithreaded arallelism and erformance Measures CS 02 - CS 02 6 / 7

17 Anticipating parallelization overheads Blocking strategy: principle a 7 a 6 a 5 a a 3 a 2 a a Let B be the order of a block and n be the number of elements. The parallelism of Θ(n/B) can still be regarded as low parallelism, but better than with the divide and conquer scheme. (Moreno Maza) Multithreaded arallelism and erformance Measures CS 02 - CS / 7 Anticipating parallelization overheads Construction of the ascal Triangle: experimental results 3 Anticipating parallelization overheads Estimating parallelization overheads The instruction stream DAG of the blocking strategy consists of n/b binary tress T 0, T,..., T n/b such that T i is the instruction stream DAG of the cilk for loop executing the i-th band each leaf of T i is connected by an edge to the root of T i+. Consequently, the burdened span is n/b n/b S b (n) = log(i) = log( i) = log(γ( n B + )). i= Using Stirling s Formula, we deduce S b (n) Θ i= ( n B log( n B ) ). () Thus the burdened parallelism (that is, the ratio work to burdened span) is Θ(Bn/log( n B )), that is sub-linear in n, while the non-burdened parallelism is Θ(n/B). (Moreno Maza) Multithreaded arallelism and erformance Measures CS 02 - CS / 7 Anticipating parallelization overheads Summary and notes Burdened parallelism 2 0 Worker vs Speedup and arallelism speedup dynamic block speedup static block parallelism dynamic block parallelism static block arallelism after accounting for parallelization overheads (thread management, costs of scheduling, etc.) The burdened parallelism is estimated as the ratio work to burdened span. The burdened span is defined as the maximum number of s/syncs on a critical path times the cost for a cilk (cilk sync) taken as 5,000 cycles. Speedup and arallelism Core/Workers Impact in practice: example for the ascal Triangle a7 a6 a5 a a3 a2 a a Consider executing one band after another, where for each band all B B blocks are executed concurrently. The non-burdened span is in Θ(B 2 n/b) = Θ(n/B). While the burdened span is S b (n) = n/b i= log(i) = log( n/b i= i) = log(γ( n B + )) Θ ( n B log( n B )). (Moreno Maza) Multithreaded arallelism and erformance Measures CS 02 - CS / 7 (Moreno Maza) Multithreaded arallelism and erformance Measures CS 02 - CS / 7

18 Announcements Announcements lan Acknowledgements arallelism Complexity Measures 2 cilk for Loops 3 Measuring arallelism in ractice 5 Anticipating parallelization overheads 6 Announcements Charles E. Leiserson (MIT) for providing me with the sources of its lecture notes. Matteo Frigo (Intel) for supporting the work of my team with Cilk++. Yuzhen Xie (UWO) for helping me with the images used in these slides. Liyun Li (UWO) for generating the experimental data. (Moreno Maza) Multithreaded arallelism and erformance Measures CS 02 - CS / 7 Announcements (Moreno Maza) Multithreaded arallelism and erformance Measures CS 02 - CS / 7 References Matteo Frigo, Charles E. Leiserson, and Keith H. Randall. The Implementation of the Cilk-5 Multithreaded Language. roceedings of the ACM SIGLAN 98 Conference on rogramming Language Design and Implementation, ages: June, 998. Robert D. Blumofe, Christopher F. Joerg, Bradley C. Kuszmaul, Charles E. Leiserson, Keith H. Randall, and Yuli Zhou. Cilk: An Efficient Multithreaded Runtime System. Journal of arallel and Distributed Computing, 55-69, August 25, 996. Robert D. Blumofe and Charles E. Leiserson. Scheduling Multithreaded Computations by Work Stealing. Journal of the ACM, Vol. 6, No. 5, pp September 999. (Moreno Maza) Multithreaded arallelism and erformance Measures CS 02 - CS 02 7 / 7

Plan. 1 Parallelism Complexity Measures. 2 cilk for Loops. 3 Scheduling Theory and Implementation. 4 Measuring Parallelism in Practice

Plan. 1 Parallelism Complexity Measures. 2 cilk for Loops. 3 Scheduling Theory and Implementation. 4 Measuring Parallelism in Practice lan Multithreaded arallelism and erformance Measures Marc Moreno Maza University of Western Ontario, London, Ontario (Canada) CS 3101 1 2 cilk for Loops 3 4 Measuring arallelism in ractice 5 Announcements