Plan. 1 Parallelism Complexity Measures. 2 cilk for Loops. 3 Scheduling Theory and Implementation. 4 Measuring Parallelism in Practice

Size: px
Start display at page:

Download "Plan. 1 Parallelism Complexity Measures. 2 cilk for Loops. 3 Scheduling Theory and Implementation. 4 Measuring Parallelism in Practice"

Transcription

1 lan Multithreaded arallelism and erformance Measures Marc Moreno Maza University of Western Ontario, London, Ontario (Canada) CS 02 - CS 9535 arallelism Complexity Measures 2 cilk for Loops 3 Measuring arallelism in ractice 5 Anticipating parallelization overheads 6 Announcements lan (Moreno Maza) Multithreaded arallelism and erformance Measures CS 02 - CS 02 / 7 arallelism Complexity Measures (Moreno Maza) Multithreaded arallelism and erformance Measures CS 02 - CS 02 2 / 7 arallelism Complexity Measures The fork-join parallelism model arallelism Complexity Measures 2 cilk for Loops 3 Measuring arallelism in ractice 5 Anticipating parallelization overheads 6 Announcements int fib (int n) { if (n<2) return (n); else { int x,y; x = cilk_ fib(n-); y = fib(n-2); cilk_sync; return (x+y); 3 2 rocessor oblivious Example: fib() The computation dag unfolds dynamiy. We shall also this model multithreaded parallelism. (Moreno Maza) Multithreaded arallelism and erformance Measures CS 02 - CS 02 3 / 7 (Moreno Maza) Multithreaded arallelism and erformance Measures CS 02 - CS 02 / 7

2 Terminology arallelism Complexity Measures Work and span arallelism Complexity Measures initial strand continue edge edge edge final strand strand return edge a strand is is a maximal sequence of instructions that ends with a, sync, or return (either explicit or implicit) statement. At runtime, the relation causes procedure instances to be structured as a rooted tree, ed tree or parallel instruction stream, where dependencies among strands form a dag. (Moreno Maza) Multithreaded arallelism and erformance Measures CS 02 - CS 02 5 / 7 arallelism Complexity Measures The critical path length We define several performance measures. We assume an ideal situation: no cache issues, no interprocessor costs: T p is the minimum running time on p processors T is ed the work, that is, the sum of the number of instructions at each node. T is the minimum running time with infinitely many processors, ed the span Work law (Moreno Maza) Multithreaded arallelism and erformance Measures CS 02 - CS 02 6 / 7 arallelism Complexity Measures Assuming all strands run in unit time, the longest path in the DAG is equal to T. For this reason, T is also referred to as the critical path length. (Moreno Maza) Multithreaded arallelism and erformance Measures CS 02 - CS 02 7 / 7 We have: T p T /p. Indeed, in the best case, p processors can do p works per unit of time. (Moreno Maza) Multithreaded arallelism and erformance Measures CS 02 - CS 02 8 / 7

3 Span law arallelism Complexity Measures arallelism Complexity Measures Speedup on p processors T /T p is ed the speedup on p processors A parallel program execution can have: linear speedup: T /T = Θ(p) superlinear speedup: T /T = ω(p) (not possible in this model, though it is possible in others) sublinear speedup: T /T = o(p) We have: T p T. Indeed, T p < T contradicts the definitions of T p and T. arallelism (Moreno Maza) Multithreaded arallelism and erformance Measures CS 02 - CS 02 9 / 7 arallelism Complexity Measures (Moreno Maza) Multithreaded arallelism and erformance Measures CS 02 - CS 02 0 / 7 arallelism Complexity Measures The Fibonacci example (/2) Because the Span Law dictates that T T, the maximum possible speedup given T and T is T /T = parallelism li = the average amount of work per step along the span For Fib(), we have T = 7 and T = 8 and thus T /T = What about T (Fib(n)) and T (Fib(n))? (Moreno Maza) Multithreaded arallelism and erformance Measures CS 02 - CS 02 / 7 (Moreno Maza) Multithreaded arallelism and erformance Measures CS 02 - CS 02 2 / 7

4 arallelism Complexity Measures arallelism Complexity Measures The Fibonacci example (2/2) Series composition We have T (n) = T (n ) + T (n 2) + Θ(). Let s solve it. One verify by induction that T (n) af n b for b > 0 large enough to dominate Θ() and a >. We can then choose a large enough to satisfy the initial condition, whatever that is. On the other hand we also have F n T (n). Therefore T (n) = Θ(F n ) = Θ(ψ n ) with ψ = ( + 5)/2. A B We have T (n) = max(t (n ), T (n 2)) + Θ(). We easily check T (n ) T (n 2). This implies T (n) = T (n ) + Θ(). Therefore T (n) = Θ(n). Work? Span? Consequently the parallelism is Θ(ψ n /n). (Moreno Maza) Multithreaded arallelism and erformance Measures CS 02 - CS 02 3 / 7 arallelism Complexity Measures Series composition (Moreno Maza) Multithreaded arallelism and erformance Measures CS 02 - CS 02 / 7 arallelism Complexity Measures arallel composition A A B B Work: T (A B) = T (A) + T (B) Span: T (A B) = T (A) + T (B) Work? Span? (Moreno Maza) Multithreaded arallelism and erformance Measures CS 02 - CS 02 5 / 7 (Moreno Maza) Multithreaded arallelism and erformance Measures CS 02 - CS 02 6 / 7

5 arallel composition arallelism Complexity Measures arallelism Complexity Measures Some results in the fork-join parallelism model A B Work: T (A B) = T (A) + T (B) Span: T (A B) = max(t (A), T (B)) Algorithm Work Span Merge sort Θ(n lg n) Θ(lg 3 n) Matrix multiplication Θ(n 3 ) Θ(lg n) Strassen Θ(n lg7 ) Θ(lg 2 n) LU-decomposition Θ(n 3 ) Θ(n lg n) Tableau construction Θ(n 2 ) Ω(n lg3 ) FFT Θ(n lg n) Θ(lg 2 n) Breadth-first search Θ(E) Θ(d lg V) We shall prove those results in the next lectures. lan (Moreno Maza) Multithreaded arallelism and erformance Measures CS 02 - CS 02 7 / 7 cilk for Loops (Moreno Maza) Multithreaded arallelism and erformance Measures CS 02 - CS 02 8 / 7 cilk for Loops For loop parallelism in Cilk++ arallelism Complexity Measures 2 cilk for Loops 3 Measuring arallelism in ractice 5 Anticipating parallelization overheads 6 Announcements a a 2 a n a 2 a 22 a 2n a n a n2 a nn a a 2 a n a 2 a 22 a n2 a n a 2n a nn A T n n2 nn n 2n nn cilk_for (int i=; i<n; ++i) { for (int j=0; j<i; ++j) { double temp = A[i][j]; A[i][j] = A[j][i]; A[j][i] = temp; A The iterations of a cilk for loop execute in parallel. (Moreno Maza) Multithreaded arallelism and erformance Measures CS 02 - CS 02 9 / 7 (Moreno Maza) Multithreaded arallelism and erformance Measures CS 02 - CS / 7

6 cilk for Loops cilk for Loops Implementation of for loops in Cilk++ Analysis of parallel for loops Up to details (next week!) the previous loop is compiled as follows, using a divide-and-conquer implementation: void recur(int lo, int hi) { if (hi > lo) { // coarsen int mid = lo + (hi - lo)/2; cilk_ recur(lo, mid); recur(mid, hi); cilk_sync; else for (int j=0; j<hi; ++j) { double temp = A[i][j]; A[i][j] = A[j][i]; A[j][i] = temp; (Moreno Maza) Multithreaded arallelism and erformance Measures CS 02 - CS 02 2 / 7 cilk for Loops arallelizing the inner loop Here we do not assume that each strand runs in unit time. lan Span of loop control: Θ(log(n)) Max span of an iteration: Θ(n) Span: Θ(n) Work: Θ(n 2 ) arallelism: Θ(n) (Moreno Maza) Multithreaded arallelism and erformance Measures CS 02 - CS / 7 cilk_for (int i=; i<n; ++i) { cilk_for (int j=0; j<i; ++j) { double temp = A[i][j]; A[i][j] = A[j][i]; A[j][i] = temp; Span of outer loop control: Θ(log(n)) Max span of an inner loop control: Θ(log(n)) Span of an iteration: Θ() Span: Θ(log(n)) Work: Θ(n 2 ) arallelism: Θ(n 2 /log(n)) But! More on this next week... arallelism Complexity Measures 2 cilk for Loops 3 Measuring arallelism in ractice 5 Anticipating parallelization overheads 6 Announcements (Moreno Maza) Multithreaded arallelism and erformance Measures CS 02 - CS / 7 (Moreno Maza) Multithreaded arallelism and erformance Measures CS 02 - CS 02 2 / 7

7 Scheduling Greedy scheduling (/2) Memory I/O Network $ $ $ A scheduler s job is to map a computation to particular processors. Such a mapping is ed a schedule. If decisions are made at runtime, the scheduler is online, otherwise, it is offline Cilk++ s scheduler maps strands onto processors dynamiy at runtime. (Moreno Maza) Multithreaded arallelism and erformance Measures CS 02 - CS / 7 Greedy scheduling (2/2) A strand is ready if all its predecessors have executed A scheduler is greedy if it attempts to do as much work as possible at every step. (Moreno Maza) Multithreaded arallelism and erformance Measures 26 / 7 Theorem of Graham and Brent =3 =3 In any greedy schedule, there are two types of steps: complete step: There are at least p strands that are ready to run. The greedy scheduler selects any p of them and runs them. incomplete step: There are strictly less than p threads that are ready to run. The greedy scheduler runs them all. (Moreno Maza) CS 02 - CS 02 Multithreaded arallelism and erformance Measures CS 02 - CS / 7 For any greedy schedule, we have Tp T /p + T #complete steps T /p, by definition of T. #incomplete steps T. Indeed, let G 0 be the subgraph of G that remains to be executed immediately prior to a incomplete step. (i) During this incomplete step, all strands that can be run are actually run (ii) Hence removing this incomplete step from G 0 reduces T by one. (Moreno Maza) Multithreaded arallelism and erformance Measures CS 02 - CS / 7

8 Corollary Corollary 2 A greedy scheduler is always within a factor of 2 of optimal. From the work and span laws, we have: T max(t /p, T ) () In addition, we can trivially express: T /p max(t /p, T ) (2) T max(t /p, T ) (3) From Graham - Brent Theorem, we deduce: T T /p + T () max(t /p, T ) + max(t /p, T ) (5) 2 max(t /p, T ) (6) The greedy scheduler achieves linear speedup whenever T = O(T /p). From Graham - Brent Theorem, we deduce: T p T /p + T (7) = T /p + O(T /p) (8) = Θ(T /p) (9) The idea is to operate in the range where T /p dominates T. As long as T /p dominates T, all processors can be used efficiently. The quantity T /pt is ed the parallel slackness. which concludes the proof. (Moreno Maza) Multithreaded arallelism and erformance Measures CS 02 - CS / 7 The work-stealing scheduler (/3) (Moreno Maza) Multithreaded arallelism and erformance Measures CS 02 - CS / 7 The work-stealing scheduler (2/3) Cilk/Cilk++ randomized work-stealing scheduler load-balances the computation at run-time. Each processor maintains a ready deque: A ready deque is a double ended queue, where each entry is a procedure instance that is ready to execute. Adding a procedure instance to the bottom of the deque represents a procedure being ed. A procedure instance being deleted from the bottom of the deque represents the processor beginning/resuming execution on that procedure. Deletion from the top of the deque corresponds to that procedure instance being stolen. A mathematical proof guarantees near-perfect linear speed-up on applications with sufficient parallelism, as long as the architecture has sufficient memory bandwidth. A /return in Cilk is over 00 times faster than a thread create/exit and less than 3 times slower than an ordinary C function on a modern Intel processor. (Moreno Maza) Multithreaded arallelism and erformance Measures CS 02 - CS 02 3 / 7 (Moreno Maza) Multithreaded arallelism and erformance Measures CS 02 - CS / 7

9 The work-stealing scheduler (3/3) The work-stealing scheduler (/3) Call! Spawn! (Moreno Maza) Multithreaded arallelism and erformance Measures CS 02 - CS / 7 The work-stealing scheduler (5/3) (Moreno Maza) Multithreaded arallelism and erformance Measures CS 02 - CS 02 3 / 7 The work-stealing scheduler (6/3) Spawn! Call! Spawn! Return! (Moreno Maza) Multithreaded arallelism and erformance Measures CS 02 - CS / 7 (Moreno Maza) Multithreaded arallelism and erformance Measures CS 02 - CS / 7

10 The work-stealing scheduler (7/3) The work-stealing scheduler (8/3) Return! Steal! (Moreno Maza) Multithreaded arallelism and erformance Measures CS 02 - CS / 7 The work-stealing scheduler (9/3) (Moreno Maza) Multithreaded arallelism and erformance Measures CS 02 - CS / 7 The work-stealing scheduler (0/3) Steal! (Moreno Maza) Multithreaded arallelism and erformance Measures CS 02 - CS / 7 (Moreno Maza) Multithreaded arallelism and erformance Measures CS 02 - CS 02 0 / 7

11 The work-stealing scheduler (/3) The work-stealing scheduler (2/3) Spawn! (Moreno Maza) Multithreaded arallelism and erformance Measures CS 02 - CS 02 / 7 The work-stealing scheduler (3/3) (Moreno Maza) Multithreaded arallelism and erformance Measures CS 02 - CS 02 3 / 7 (Moreno Maza) Multithreaded arallelism and erformance Measures CS 02 - CS 02 2 / 7 erformances of the work-stealing scheduler Assume that each strand executes in unit time, for almost all parallel steps there are at least p strands to run, each processor is either working or stealing. Then, the randomized work-stealing scheduler is expected to run in T = T /p + O(T ) During a steal-free parallel steps (steps at which all processors have work on their deque) each of the p processors consumes work unit. Thus, there is at most T /p steal-free parallel steps. During a parallel step with steals each thief may reduce by the running time with a probability of /p Thus, the expected number of steals is O(p T ). Therefore, the expected running time T = (T + O(p T ))/p = T /p + O(T ). (0) (Moreno Maza) Multithreaded arallelism and erformance Measures CS 02 - CS 02 / 7

12 Overheads and burden Span overhead Obviously T /p + T will under-estimate T p in practice. Many factors (simplification assumptions of the fork-join parallelism model, architecture limitation, costs of executing the parallel constructs, overheads of scheduling) will make T p larger in practice. One may want to estimate the impact of those factors: by improving the estimate of the randomized work-stealing complexity result 2 by comparing a Cilk++ program with its C++ elision 3 by estimating the costs of ing and synchronizing Let T, T, T p be given. We want to refine the randomized work-stealing complexity result. The span overhead is the smallest constant c such that T p T /p + c T. Re that T /T is the maximum possible speed-up that the application can obtain. We parallel slackness assumption the following property T /T >> c p () that is, c p is much smaller than the average parallelism. Under this assumption it follows that T /p >> c T holds, thus c has little effect on performance when sufficiently slackness exists. (Moreno Maza) Multithreaded arallelism and erformance Measures CS 02 - CS 02 5 / 7 Work overhead Let T s be the running time of the C++ elision of a Cilk++ program. We denote by c the work overhead c = T /T s Re the expected running time: T T / + c T. Thus with the parallel slackness assumption we get T c T s /p + c T c T s /p. (2) We can now state the work first principle precisely Minimize c, even at the expense of a larger c. This is a key feature since it is conceptually easier to minimize c rather than minimizing c. Cilk++ estimates T p as T p = T /p +.7 burden span, where burden span is 5000 instructions times the number of continuation edges along the critical path. (Moreno Maza) Multithreaded arallelism and erformance Measures CS 02 - CS 02 7 / 7 (Moreno Maza) Multithreaded arallelism and erformance Measures CS 02 - CS 02 6 / 7 The cactus stack B A B C D E A A A A A A D C E B C Views of stack A cactus stack is used to implement C s rule for sharing of function-local variables. A stack frame can only see data stored in the current and in the previous stack frames. (Moreno Maza) Multithreaded arallelism and erformance Measures CS 02 - CS 02 8 / 7 C D C E

13 Space bounds lan Measuring arallelism in ractice = 3 arallelism Complexity Measures S 2 cilk for Loops 3 Measuring arallelism in ractice The space S p of a parallel execution on p processors required by Cilk++ s work-stealing satisfies: S p p S (3) where S is the minimal serial space requirement. 5 Anticipating parallelization overheads 6 Announcements Cilkview (Moreno Maza) Multithreaded arallelism and erformance Measures CS 02 - CS 02 9 / 7 Measuring arallelism in ractice (Moreno Maza) Multithreaded arallelism and erformance Measures CS 02 - CS / 7 Measuring arallelism in ractice The Fibonacci Cilk++ example Work Law (linear speedup) Span Law Measured speedup Burdened parallelism estimates arallelism scheduling overheads Cilkview computes work and span to derive upper bounds on parallel performance Cilkview also estimates scheduling overhead to compute a burdened span for lower bounds. (Moreno Maza) Multithreaded arallelism and erformance Measures CS 02 - CS 02 5 / 7 Code fragment long fib(int n) { if (n < 2) return n; long x, y; x = cilk_ fib(n-); y = fib(n-2); cilk_sync; return x + y; (Moreno Maza) Multithreaded arallelism and erformance Measures CS 02 - CS / 7

14 Measuring arallelism in ractice Fibonacci program timing Quicksort Measuring arallelism in ractice The environment for benchmarking: model name : Intel(R) Core(TM)2 Quad CU 2.0GHz L2 cache size : 096 KB memory size : 3 GB #cores = #cores = 2 #cores = n timing(s) timing(s) speedup timing(s) speedup code in cilk/examples/qsort void sample_qsort(int * begin, int * end) { if (begin!= end) { --end; int * middle = std::partition(begin, end, std::bind2nd(std::less<int>(), *end)); using std::swap; swap(*end, *middle); cilk_ sample_qsort(begin, middle); sample_qsort(++middle, ++end); cilk_sync; (Moreno Maza) Multithreaded arallelism and erformance Measures CS 02 - CS / 7 Measuring arallelism in ractice Quicksort timing (Moreno Maza) Multithreaded arallelism and erformance Measures CS 02 - CS 02 5 / 7 Measuring arallelism in ractice Matrix multiplication Timing for sorting an array of integers: #cores = #cores = 2 #cores = # of int timing(s) timing(s) speedup timing(s) speedup Code in cilk/examples/matrix Timing of multiplying a matrix by a matrix iterative recursive threshold st(s) pt(s) su st(s) pt (s) su st = sequential time; pt = parallel time with cores; su = speedup (Moreno Maza) Multithreaded arallelism and erformance Measures CS 02 - CS / 7 (Moreno Maza) Multithreaded arallelism and erformance Measures CS 02 - CS / 7

15 Measuring arallelism in ractice Measuring arallelism in ractice The cilkview example from the documentation Using cilk for to perform operations over an array in parallel: static const int COUNT = ; static const int ITERATION = ; long arr[count]; long do_work(long k){ long x = 5; static const int nn = 87; for (long i = ; i < nn; ++i) x = x / i + k % i; return x; int cilk_main(){ for (int j = 0; j < ITERATION; j++) cilk_for (int i = 0; i < COUNT; i++) arr[i] += do_work( j * i + i + j); (Moreno Maza) Multithreaded arallelism and erformance Measures CS 02 - CS / 7 A simple fix Inverting the two for loops Measuring arallelism in ractice int cilk_main() { cilk_for (int i = 0; i < COUNT; i++) for (int j = 0; j < ITERATION; j++) arr[i] += do_work( j * i + i + j); ) arallelism rofile Work : 6,80,80,250 ins Span : 2,6,80,250 ins Burdened span : 3,920,80,250 ins arallelism : 3.06 Burdened parallelism : 0.20 Number of s/syncs: 3,000,000 Average instructions / strand : 720 Strands along span :,000,00 Average instructions / strand on span : 529 2) Speedup Estimate 2 processors: processors: processors: processors: processors: (Moreno Maza) Multithreaded arallelism and erformance Measures CS 02 - CS / 7 Measuring arallelism in ractice ) arallelism rofile Work : 5,295,80,529 ins Span :,326,80,07 ins Burdened span :,326,830,9 ins arallelism : 3.99 Burdened parallelism : 3.99 Number of s/syncs: 3 Average instructions / strand : 529,580,52 Strands along span : 5 Average instructions / strand on span: 265,360,22 2) Speedup Estimate 2 processors: processors: processors: processors: processors: (Moreno Maza) Multithreaded arallelism and erformance Measures CS 02 - CS / 7 (Moreno Maza) Multithreaded arallelism and erformance Measures CS 02 - CS / 7

16 Timing Measuring arallelism in ractice lan Anticipating parallelization overheads arallelism Complexity Measures #cores = #cores = 2 #cores = version timing(s) timing(s) speedup timing(s) speedup original improved cilk for Loops 3 Measuring arallelism in ractice 5 Anticipating parallelization overheads 6 Announcements (Moreno Maza) Multithreaded arallelism and erformance Measures CS 02 - CS 02 6 / 7 Anticipating parallelization overheads ascal Triangle (Moreno Maza) Multithreaded arallelism and erformance Measures CS 02 - CS / 7 Anticipating parallelization overheads Divide and conquer: principle Construction of the ascal Triangle: nearly the simplest stencil computation! II I II The parallelism is Θ(n 2 log 23 ), so roughly Θ(n 0.5 ) which can be regarded as low parallelism. I II II III (Moreno Maza) Multithreaded arallelism and erformance Measures CS 02 - CS / 7 (Moreno Maza) Multithreaded arallelism and erformance Measures CS 02 - CS 02 6 / 7

17 Anticipating parallelization overheads Blocking strategy: principle a 7 a 6 a 5 a a 3 a 2 a a Let B be the order of a block and n be the number of elements. The parallelism of Θ(n/B) can still be regarded as low parallelism, but better than with the divide and conquer scheme. (Moreno Maza) Multithreaded arallelism and erformance Measures CS 02 - CS / 7 Anticipating parallelization overheads Construction of the ascal Triangle: experimental results 3 Anticipating parallelization overheads Estimating parallelization overheads The instruction stream DAG of the blocking strategy consists of n/b binary tress T 0, T,..., T n/b such that T i is the instruction stream DAG of the cilk for loop executing the i-th band each leaf of T i is connected by an edge to the root of T i+. Consequently, the burdened span is n/b n/b S b (n) = log(i) = log( i) = log(γ( n B + )). i= Using Stirling s Formula, we deduce S b (n) Θ i= ( n B log( n B ) ). () Thus the burdened parallelism (that is, the ratio work to burdened span) is Θ(Bn/log( n B )), that is sub-linear in n, while the non-burdened parallelism is Θ(n/B). (Moreno Maza) Multithreaded arallelism and erformance Measures CS 02 - CS / 7 Anticipating parallelization overheads Summary and notes Burdened parallelism 2 0 Worker vs Speedup and arallelism speedup dynamic block speedup static block parallelism dynamic block parallelism static block arallelism after accounting for parallelization overheads (thread management, costs of scheduling, etc.) The burdened parallelism is estimated as the ratio work to burdened span. The burdened span is defined as the maximum number of s/syncs on a critical path times the cost for a cilk (cilk sync) taken as 5,000 cycles. Speedup and arallelism Core/Workers Impact in practice: example for the ascal Triangle a7 a6 a5 a a3 a2 a a Consider executing one band after another, where for each band all B B blocks are executed concurrently. The non-burdened span is in Θ(B 2 n/b) = Θ(n/B). While the burdened span is S b (n) = n/b i= log(i) = log( n/b i= i) = log(γ( n B + )) Θ ( n B log( n B )). (Moreno Maza) Multithreaded arallelism and erformance Measures CS 02 - CS / 7 (Moreno Maza) Multithreaded arallelism and erformance Measures CS 02 - CS / 7

18 Announcements Announcements lan Acknowledgements arallelism Complexity Measures 2 cilk for Loops 3 Measuring arallelism in ractice 5 Anticipating parallelization overheads 6 Announcements Charles E. Leiserson (MIT) for providing me with the sources of its lecture notes. Matteo Frigo (Intel) for supporting the work of my team with Cilk++. Yuzhen Xie (UWO) for helping me with the images used in these slides. Liyun Li (UWO) for generating the experimental data. (Moreno Maza) Multithreaded arallelism and erformance Measures CS 02 - CS / 7 Announcements (Moreno Maza) Multithreaded arallelism and erformance Measures CS 02 - CS / 7 References Matteo Frigo, Charles E. Leiserson, and Keith H. Randall. The Implementation of the Cilk-5 Multithreaded Language. roceedings of the ACM SIGLAN 98 Conference on rogramming Language Design and Implementation, ages: June, 998. Robert D. Blumofe, Christopher F. Joerg, Bradley C. Kuszmaul, Charles E. Leiserson, Keith H. Randall, and Yuli Zhou. Cilk: An Efficient Multithreaded Runtime System. Journal of arallel and Distributed Computing, 55-69, August 25, 996. Robert D. Blumofe and Charles E. Leiserson. Scheduling Multithreaded Computations by Work Stealing. Journal of the ACM, Vol. 6, No. 5, pp September 999. (Moreno Maza) Multithreaded arallelism and erformance Measures CS 02 - CS 02 7 / 7

Plan. 1 Parallelism Complexity Measures. 2 cilk for Loops. 3 Scheduling Theory and Implementation. 4 Measuring Parallelism in Practice

Plan. 1 Parallelism Complexity Measures. 2 cilk for Loops. 3 Scheduling Theory and Implementation. 4 Measuring Parallelism in Practice lan Multithreaded arallelism and erformance Measures Marc Moreno Maza University of Western Ontario, London, Ontario (Canada) CS 3101 1 2 cilk for Loops 3 4 Measuring arallelism in ractice 5 Announcements

More information

Plan. 1 Parallelism Complexity Measures. 2 cilk for Loops. 3 Scheduling Theory and Implementation. 4 Measuring Parallelism in Practice

Plan. 1 Parallelism Complexity Measures. 2 cilk for Loops. 3 Scheduling Theory and Implementation. 4 Measuring Parallelism in Practice lan Multithreaded arallelism and erformance Measures Marc Moreno Maza University of Western Ontario, London, Ontario (Canada) CS 4435 - CS 9624 1 2 cilk for Loops 3 4 Measuring arallelism in ractice 5

More information

Multithreaded Parallelism and Performance Measures

Multithreaded Parallelism and Performance Measures Multithreaded Parallelism and Performance Measures Marc Moreno Maza University of Western Ontario, London, Ontario (Canada) CS 3101 (Moreno Maza) Multithreaded Parallelism and Performance Measures CS 3101

More information

Multicore programming in CilkPlus

Multicore programming in CilkPlus Multicore programming in CilkPlus Marc Moreno Maza University of Western Ontario, Canada CS3350 March 16, 2015 CilkPlus From Cilk to Cilk++ and Cilk Plus Cilk has been developed since 1994 at the MIT Laboratory

More information

Multithreaded Parallelism on Multicore Architectures

Multithreaded Parallelism on Multicore Architectures Multithreaded Parallelism on Multicore Architectures Marc Moreno Maza University of Western Ontario, Canada CS2101 March 2012 Plan 1 Multicore programming Multicore architectures 2 Cilk / Cilk++ / Cilk

More information

Parallelism and Performance

Parallelism and Performance 6.172 erformance Engineering of Software Systems LECTURE 13 arallelism and erformance Charles E. Leiserson October 26, 2010 2010 Charles E. Leiserson 1 Amdahl s Law If 50% of your application is parallel

More information

Multithreaded Parallelism on Multicore Architectures

Multithreaded Parallelism on Multicore Architectures Multithreaded Parallelism on Multicore Architectures Marc Moreno Maza University of Western Ontario, Canada CS2101 November 2012 Plan 1 Multicore programming Multicore architectures 2 Cilk / Cilk++ / Cilk

More information

The Cilk part is a small set of linguistic extensions to C/C++ to support fork-join parallelism. (The Plus part supports vector parallelism.

The Cilk part is a small set of linguistic extensions to C/C++ to support fork-join parallelism. (The Plus part supports vector parallelism. Cilk Plus The Cilk part is a small set of linguistic extensions to C/C++ to support fork-join parallelism. (The Plus part supports vector parallelism.) Developed originally by Cilk Arts, an MIT spinoff,

More information

CS 240A: Shared Memory & Multicore Programming with Cilk++

CS 240A: Shared Memory & Multicore Programming with Cilk++ CS 240A: Shared Memory & Multicore rogramming with Cilk++ Multicore and NUMA architectures Multithreaded rogramming Cilk++ as a concurrency platform Work and Span Thanks to Charles E. Leiserson for some

More information

CS 140 : Feb 19, 2015 Cilk Scheduling & Applications

CS 140 : Feb 19, 2015 Cilk Scheduling & Applications CS 140 : Feb 19, 2015 Cilk Scheduling & Applications Analyzing quicksort Optional: Master method for solving divide-and-conquer recurrences Tips on parallelism and overheads Greedy scheduling and parallel

More information

Multithreaded Programming in. Cilk LECTURE 1. Charles E. Leiserson

Multithreaded Programming in. Cilk LECTURE 1. Charles E. Leiserson Multithreaded Programming in Cilk LECTURE 1 Charles E. Leiserson Supercomputing Technologies Research Group Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology

More information

CSE 260 Lecture 19. Parallel Programming Languages

CSE 260 Lecture 19. Parallel Programming Languages CSE 260 Lecture 19 Parallel Programming Languages Announcements Thursday s office hours are cancelled Office hours on Weds 2p to 4pm Jing will hold OH, too, see Moodle Scott B. Baden /CSE 260/ Winter 2014

More information

Cilk. Cilk In 2008, ACM SIGPLAN awarded Best influential paper of Decade. Cilk : Biggest principle

Cilk. Cilk In 2008, ACM SIGPLAN awarded Best influential paper of Decade. Cilk : Biggest principle CS528 Slides are adopted from http://supertech.csail.mit.edu/cilk/ Charles E. Leiserson A Sahu Dept of CSE, IIT Guwahati HPC Flow Plan: Before MID Processor + Super scalar+ Vector Unit Serial C/C++ Coding

More information

An Overview of Parallel Computing

An Overview of Parallel Computing An Overview of Parallel Computing Marc Moreno Maza University of Western Ontario, London, Ontario (Canada) CS2101 Plan 1 Hardware 2 Types of Parallelism 3 Concurrency Platforms: Three Examples Cilk CUDA

More information

Shared-memory Parallel Programming with Cilk Plus

Shared-memory Parallel Programming with Cilk Plus Shared-memory Parallel Programming with Cilk Plus John Mellor-Crummey Department of Computer Science Rice University johnmc@rice.edu COMP 422/534 Lecture 4 30 August 2018 Outline for Today Threaded programming

More information

Analysis of Multithreaded Algorithms

Analysis of Multithreaded Algorithms Analysis of Multithreaded Algorithms Marc Moreno Maza University of Western Ontario, London, Ontario (Canada) CS4402-9535 (Moreno Maza) Analysis of Multithreaded Algorithms CS4402-9535 1 / 27 Plan 1 Matrix

More information

Shared-memory Parallel Programming with Cilk Plus

Shared-memory Parallel Programming with Cilk Plus Shared-memory Parallel Programming with Cilk Plus John Mellor-Crummey Department of Computer Science Rice University johnmc@rice.edu COMP 422/534 Lecture 4 19 January 2017 Outline for Today Threaded programming

More information

Plan. Introduction to Multicore Programming. Plan. University of Western Ontario, London, Ontario (Canada) Multi-core processor CPU Coherence

Plan. Introduction to Multicore Programming. Plan. University of Western Ontario, London, Ontario (Canada) Multi-core processor CPU Coherence Plan Introduction to Multicore Programming Marc Moreno Maza University of Western Ontario, London, Ontario (Canada) CS 3101 1 Multi-core Architecture 2 Race Conditions and Cilkscreen (Moreno Maza) Introduction

More information

Multithreaded Algorithms Part 1. Dept. of Computer Science & Eng University of Moratuwa

Multithreaded Algorithms Part 1. Dept. of Computer Science & Eng University of Moratuwa CS4460 Advanced d Algorithms Batch 08, L4S2 Lecture 11 Multithreaded Algorithms Part 1 N. H. N. D. de Silva Dept. of Computer Science & Eng University of Moratuwa Announcements Last topic discussed is

More information

CSE 613: Parallel Programming

CSE 613: Parallel Programming CSE 613: Parallel Programming Lecture 3 ( The Cilk++ Concurrency Platform ) ( inspiration for many slides comes from talks given by Charles Leiserson and Matteo Frigo ) Rezaul A. Chowdhury Department of

More information

CS CS9535: An Overview of Parallel Computing

CS CS9535: An Overview of Parallel Computing CS4403 - CS9535: An Overview of Parallel Computing Marc Moreno Maza University of Western Ontario, London, Ontario (Canada) January 10, 2017 Plan 1 Hardware 2 Types of Parallelism 3 Concurrency Platforms:

More information

CS 140 : Non-numerical Examples with Cilk++

CS 140 : Non-numerical Examples with Cilk++ CS 140 : Non-numerical Examples with Cilk++ Divide and conquer paradigm for Cilk++ Quicksort Mergesort Thanks to Charles E. Leiserson for some of these slides 1 Work and Span (Recap) T P = execution time

More information

Introduction to Multithreaded Algorithms

Introduction to Multithreaded Algorithms Introduction to Multithreaded Algorithms CCOM5050: Design and Analysis of Algorithms Chapter VII Selected Topics T. H. Cormen, C. E. Leiserson, R. L. Rivest, C. Stein. Introduction to algorithms, 3 rd

More information

Cilk, Matrix Multiplication, and Sorting

Cilk, Matrix Multiplication, and Sorting 6.895 Theory of Parallel Systems Lecture 2 Lecturer: Charles Leiserson Cilk, Matrix Multiplication, and Sorting Lecture Summary 1. Parallel Processing With Cilk This section provides a brief introduction

More information

A Minicourse on Dynamic Multithreaded Algorithms

A Minicourse on Dynamic Multithreaded Algorithms Introduction to Algorithms December 5, 005 Massachusetts Institute of Technology 6.046J/18.410J Professors Erik D. Demaine and Charles E. Leiserson Handout 9 A Minicourse on Dynamic Multithreaded Algorithms

More information

MetaFork: A Compilation Framework for Concurrency Platforms Targeting Multicores

MetaFork: A Compilation Framework for Concurrency Platforms Targeting Multicores MetaFork: A Compilation Framework for Concurrency Platforms Targeting Multicores Presented by Xiaohui Chen Joint work with Marc Moreno Maza, Sushek Shekar & Priya Unnikrishnan University of Western Ontario,

More information

Multithreaded Programming in Cilk. Matteo Frigo

Multithreaded Programming in Cilk. Matteo Frigo Multithreaded Programming in Cilk Matteo Frigo Multicore challanges Development time: Will you get your product out in time? Where will you find enough parallel-programming talent? Will you be forced to

More information

CILK/CILK++ AND REDUCERS YUNMING ZHANG RICE UNIVERSITY

CILK/CILK++ AND REDUCERS YUNMING ZHANG RICE UNIVERSITY CILK/CILK++ AND REDUCERS YUNMING ZHANG RICE UNIVERSITY 1 OUTLINE CILK and CILK++ Language Features and Usages Work stealing runtime CILK++ Reducers Conclusions 2 IDEALIZED SHARED MEMORY ARCHITECTURE Hardware

More information

Parallel Algorithms. 6/12/2012 6:42 PM gdeepak.com 1

Parallel Algorithms. 6/12/2012 6:42 PM gdeepak.com 1 Parallel Algorithms 1 Deliverables Parallel Programming Paradigm Matrix Multiplication Merge Sort 2 Introduction It involves using the power of multiple processors in parallel to increase the performance

More information

Space vs Time, Cache vs Main Memory

Space vs Time, Cache vs Main Memory Space vs Time, Cache vs Main Memory Marc Moreno Maza University of Western Ontario, London, Ontario (Canada) CS 4435 - CS 9624 (Moreno Maza) Space vs Time, Cache vs Main Memory CS 4435 - CS 9624 1 / 49

More information

Cilk Plus: Multicore extensions for C and C++

Cilk Plus: Multicore extensions for C and C++ Cilk Plus: Multicore extensions for C and C++ Matteo Frigo 1 June 6, 2011 1 Some slides courtesy of Prof. Charles E. Leiserson of MIT. Intel R Cilk TM Plus What is it? C/C++ language extensions supporting

More information

Plan. Introduction to Multicore Programming. Plan. University of Western Ontario, London, Ontario (Canada) Marc Moreno Maza CS CS 9624

Plan. Introduction to Multicore Programming. Plan. University of Western Ontario, London, Ontario (Canada) Marc Moreno Maza CS CS 9624 Plan Introduction to Multicore Programming Marc Moreno Maza University of Western Ontario, London, Ontario (Canada) CS 4435 - CS 9624 1 Multi-core Architecture Multi-core processor CPU Cache CPU Coherence

More information

Cost Model: Work, Span and Parallelism

Cost Model: Work, Span and Parallelism CSE 539 01/15/2015 Cost Model: Work, Span and Parallelism Lecture 2 Scribe: Angelina Lee Outline of this lecture: 1. Overview of Cilk 2. The dag computation model 3. Performance measures 4. A simple greedy

More information

Provably Efficient Non-Preemptive Task Scheduling with Cilk

Provably Efficient Non-Preemptive Task Scheduling with Cilk Provably Efficient Non-Preemptive Task Scheduling with Cilk V. -Y. Vee and W.-J. Hsu School of Applied Science, Nanyang Technological University Nanyang Avenue, Singapore 639798. Abstract We consider the

More information

Multithreaded Programming in. Cilk LECTURE 2. Charles E. Leiserson

Multithreaded Programming in. Cilk LECTURE 2. Charles E. Leiserson Multithreaded Programming in Cilk LECTURE 2 Charles E. Leiserson Supercomputing Technologies Research Group Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology

More information

A Many-Core Machine Model for Designing Algorithms with Minimum Parallelism Overheads

A Many-Core Machine Model for Designing Algorithms with Minimum Parallelism Overheads A Many-Core Machine Model for Designing Algorithms with Minimum Parallelism Overheads Sardar Anisul Haque Marc Moreno Maza Ning Xie University of Western Ontario, Canada IBM CASCON, November 4, 2014 ardar

More information

CS 140 : Numerical Examples on Shared Memory with Cilk++

CS 140 : Numerical Examples on Shared Memory with Cilk++ CS 140 : Numerical Examples on Shared Memory with Cilk++ Matrix-matrix multiplication Matrix-vector multiplication Hyperobjects Thanks to Charles E. Leiserson for some of these slides 1 Work and Span (Recap)

More information

A Quick Introduction To The Intel Cilk Plus Runtime

A Quick Introduction To The Intel Cilk Plus Runtime A Quick Introduction To The Intel Cilk Plus Runtime 6.S898: Advanced Performance Engineering for Multicore Applications March 8, 2017 Adapted from slides by Charles E. Leiserson, Saman P. Amarasinghe,

More information

COMP Parallel Computing. SMM (4) Nested Parallelism

COMP Parallel Computing. SMM (4) Nested Parallelism COMP 633 - Parallel Computing Lecture 9 September 19, 2017 Nested Parallelism Reading: The Implementation of the Cilk-5 Multithreaded Language sections 1 3 1 Topics Nested parallelism in OpenMP and other

More information

The Implementation of Cilk-5 Multithreaded Language

The Implementation of Cilk-5 Multithreaded Language The Implementation of Cilk-5 Multithreaded Language By Matteo Frigo, Charles E. Leiserson, and Keith H Randall Presented by Martin Skou 1/14 The authors Matteo Frigo Chief Scientist and founder of Cilk

More information

Multithreaded Algorithms Part 2. Dept. of Computer Science & Eng University of Moratuwa

Multithreaded Algorithms Part 2. Dept. of Computer Science & Eng University of Moratuwa CS4460 Advanced d Algorithms Batch 08, L4S2 Lecture 12 Multithreaded Algorithms Part 2 N. H. N. D. de Silva Dept. of Computer Science & Eng University of Moratuwa Outline: Multithreaded Algorithms Part

More information

Reducers and other Cilk++ hyperobjects

Reducers and other Cilk++ hyperobjects Reducers and other Cilk++ hyperobjects Matteo Frigo (Intel) ablo Halpern (Intel) Charles E. Leiserson (MIT) Stephen Lewin-Berlin (Intel) August 11, 2009 Collision detection Assembly: Represented as a tree

More information

Today: Amortized Analysis (examples) Multithreaded Algs.

Today: Amortized Analysis (examples) Multithreaded Algs. Today: Amortized Analysis (examples) Multithreaded Algs. COSC 581, Algorithms March 11, 2014 Many of these slides are adapted from several online sources Reading Assignments Today s class: Chapter 17 (Amortized

More information

Parallel Algorithms CSE /22/2015. Outline of this lecture: 1 Implementation of cilk for. 2. Parallel Matrix Multiplication

Parallel Algorithms CSE /22/2015. Outline of this lecture: 1 Implementation of cilk for. 2. Parallel Matrix Multiplication CSE 539 01/22/2015 Parallel Algorithms Lecture 3 Scribe: Angelina Lee Outline of this lecture: 1. Implementation of cilk for 2. Parallel Matrix Multiplication 1 Implementation of cilk for We mentioned

More information

1 Optimizing parallel iterative graph computation

1 Optimizing parallel iterative graph computation May 15, 2012 1 Optimizing parallel iterative graph computation I propose to develop a deterministic parallel framework for performing iterative computation on a graph which schedules work on vertices based

More information

Table of Contents. Cilk

Table of Contents. Cilk Table of Contents 212 Introduction to Parallelism Introduction to Programming Models Shared Memory Programming Message Passing Programming Shared Memory Models Cilk TBB HPF Chapel Fortress Stapl PGAS Languages

More information

Compsci 590.3: Introduction to Parallel Computing

Compsci 590.3: Introduction to Parallel Computing Compsci 590.3: Introduction to Parallel Computing Alvin R. Lebeck Slides based on this from the University of Oregon Admin Logistics Homework #3 Use script Project Proposals Document: see web site» Due

More information

Brushing the Locks out of the Fur: A Lock-Free Work Stealing Library Based on Wool

Brushing the Locks out of the Fur: A Lock-Free Work Stealing Library Based on Wool Brushing the Locks out of the Fur: A Lock-Free Work Stealing Library Based on Wool Håkan Sundell School of Business and Informatics University of Borås, 50 90 Borås E-mail: Hakan.Sundell@hb.se Philippas

More information

The DAG Model; Analysis of For-Loops; Reduction

The DAG Model; Analysis of For-Loops; Reduction CSE341T 09/06/2017 Lecture 3 The DAG Model; Analysis of For-Loops; Reduction We will now formalize the DAG model. We will also see how parallel for loops are implemented and what are reductions. 1 The

More information

A Static Cut-off for Task Parallel Programs

A Static Cut-off for Task Parallel Programs A Static Cut-off for Task Parallel Programs Shintaro Iwasaki, Kenjiro Taura Graduate School of Information Science and Technology The University of Tokyo September 12, 2016 @ PACT '16 1 Short Summary We

More information

On the cost of managing data flow dependencies

On the cost of managing data flow dependencies On the cost of managing data flow dependencies - program scheduled by work stealing - Thierry Gautier, INRIA, EPI MOAIS, Grenoble France Workshop INRIA/UIUC/NCSA Outline Context - introduction of work

More information

MetaFork: A Metalanguage for Concurrency Platforms Targeting Multicores

MetaFork: A Metalanguage for Concurrency Platforms Targeting Multicores MetaFork: A Metalanguage for Concurrency Platforms Targeting Multicores Xiaohui Chen, Marc Moreno Maza & Sushek Shekar University of Western Ontario September 1, 2013 Document number: N1746 Date: 2013-09-01

More information

Experimenting with the MetaFork Framework Targeting Multicores

Experimenting with the MetaFork Framework Targeting Multicores Experimenting with the MetaFork Framework Targeting Multicores Xiaohui Chen, Marc Moreno Maza & Sushek Shekar University of Western Ontario 26 January 2014 1 Introduction The work reported in this report

More information

Introduction to Algorithms

Introduction to Algorithms Thomas H. Cormen Charles E. Leiserson Ronald L. Rivest Clifford Stein Introduction to Algorithms Third Edition The MIT Press Cambridge, Massachusetts London, England 27 Multithreaded Algorithms The vast

More information

CS584/684 Algorithm Analysis and Design Spring 2017 Week 3: Multithreaded Algorithms

CS584/684 Algorithm Analysis and Design Spring 2017 Week 3: Multithreaded Algorithms CS584/684 Algorithm Analysis and Design Spring 2017 Week 3: Multithreaded Algorithms Additional sources for this lecture include: 1. Umut Acar and Guy Blelloch, Algorithm Design: Parallel and Sequential

More information

MetaFork: A Compilation Framework for Concurrency Platforms Targeting Multicores

MetaFork: A Compilation Framework for Concurrency Platforms Targeting Multicores MetaFork: A Compilation Framework for Concurrency Platforms Targeting Multicores Xiaohui Chen, Marc Moreno Maza & Sushek Shekar University of Western Ontario, Canada IBM Toronto Lab February 11, 2015 Plan

More information

Cilk Plus GETTING STARTED

Cilk Plus GETTING STARTED Cilk Plus GETTING STARTED Overview Fundamentals of Cilk Plus Hyperobjects Compiler Support Case Study 3/17/2015 CHRIS SZALWINSKI 2 Fundamentals of Cilk Plus Terminology Execution Model Language Extensions

More information

A Primer on Scheduling Fork-Join Parallelism with Work Stealing

A Primer on Scheduling Fork-Join Parallelism with Work Stealing Doc. No.: N3872 Date: 2014-01-15 Reply to: Arch Robison A Primer on Scheduling Fork-Join Parallelism with Work Stealing This paper is a primer, not a proposal, on some issues related to implementing fork-join

More information

Lecture 10: Recursion vs Iteration

Lecture 10: Recursion vs Iteration cs2010: algorithms and data structures Lecture 10: Recursion vs Iteration Vasileios Koutavas School of Computer Science and Statistics Trinity College Dublin how methods execute Call stack: is a stack

More information

Parallel Programming with Cilk

Parallel Programming with Cilk Project 4-1 Parallel Programming with Cilk Out: Friday, 23 Oct 2009 Due: 10AM on Wednesday, 4 Nov 2009 In this project you will learn how to parallel program with Cilk. In the first part of the project,

More information

Performance Optimization Part 1: Work Distribution and Scheduling

Performance Optimization Part 1: Work Distribution and Scheduling Lecture 5: Performance Optimization Part 1: Work Distribution and Scheduling Parallel Computer Architecture and Programming CMU 15-418/15-618, Fall 2017 Programming for high performance Optimizing the

More information

Performance Issues in Parallelization. Saman Amarasinghe Fall 2010

Performance Issues in Parallelization. Saman Amarasinghe Fall 2010 Performance Issues in Parallelization Saman Amarasinghe Fall 2010 Today s Lecture Performance Issues of Parallelism Cilk provides a robust environment for parallelization It hides many issues and tries

More information

Parallel Random-Access Machines

Parallel Random-Access Machines Parallel Random-Access Machines Marc Moreno Maza University of Western Ontario, London, Ontario (Canada) CS3101 (Moreno Maza) Parallel Random-Access Machines CS3101 1 / 69 Plan 1 The PRAM Model 2 Performance

More information

How fast can we sort? Sorting. Decision-tree example. Decision-tree example. CS 3343 Fall by Charles E. Leiserson; small changes by Carola

How fast can we sort? Sorting. Decision-tree example. Decision-tree example. CS 3343 Fall by Charles E. Leiserson; small changes by Carola CS 3343 Fall 2007 Sorting Carola Wenk Slides courtesy of Charles Leiserson with small changes by Carola Wenk CS 3343 Analysis of Algorithms 1 How fast can we sort? All the sorting algorithms we have seen

More information

Cache Memories. University of Western Ontario, London, Ontario (Canada) Marc Moreno Maza. CS2101 October 2012

Cache Memories. University of Western Ontario, London, Ontario (Canada) Marc Moreno Maza. CS2101 October 2012 Cache Memories Marc Moreno Maza University of Western Ontario, London, Ontario (Canada) CS2101 October 2012 Plan 1 Hierarchical memories and their impact on our 2 Cache Analysis in Practice Plan 1 Hierarchical

More information

27. Parallel Programming I

27. Parallel Programming I 771 27. Parallel Programming I Moore s Law and the Free Lunch, Hardware Architectures, Parallel Execution, Flynn s Taxonomy, Scalability: Amdahl and Gustafson, Data-parallelism, Task-parallelism, Scheduling

More information

Pablo Halpern Parallel Programming Languages Architect Intel Corporation

Pablo Halpern Parallel Programming Languages Architect Intel Corporation Pablo Halpern Parallel Programming Languages Architect Intel Corporation CppCon, 8 September 2014 This work by Pablo Halpern is licensed under a Creative Commons Attribution

More information

Design and Analysis of Parallel Programs. Parallel Cost Models. Outline. Foster s Method for Design of Parallel Programs ( PCAM )

Design and Analysis of Parallel Programs. Parallel Cost Models. Outline. Foster s Method for Design of Parallel Programs ( PCAM ) Outline Design and Analysis of Parallel Programs TDDD93 Lecture 3-4 Christoph Kessler PELAB / IDA Linköping university Sweden 2017 Design and analysis of parallel algorithms Foster s PCAM method for the

More information

Cache-Oblivious Traversals of an Array s Pairs

Cache-Oblivious Traversals of an Array s Pairs Cache-Oblivious Traversals of an Array s Pairs Tobias Johnson May 7, 2007 Abstract Cache-obliviousness is a concept first introduced by Frigo et al. in [1]. We follow their model and develop a cache-oblivious

More information

Performance Issues in Parallelization Saman Amarasinghe Fall 2009

Performance Issues in Parallelization Saman Amarasinghe Fall 2009 Performance Issues in Parallelization Saman Amarasinghe Fall 2009 Today s Lecture Performance Issues of Parallelism Cilk provides a robust environment for parallelization It hides many issues and tries

More information

Adaptively Parallel Processor Allocation for Cilk Jobs

Adaptively Parallel Processor Allocation for Cilk Jobs 6.895 Theory of Parallel Systems Kunal Agrawal, Siddhartha Sen Final Report Adaptively Parallel Processor Allocation for Cilk Jobs Abstract An adaptively parallel job is one in which the number of processors

More information

Moore s Law. Multicore Programming. Vendor Solution. Power Density. Parallelism and Performance MIT Lecture 11 1.

Moore s Law. Multicore Programming. Vendor Solution. Power Density. Parallelism and Performance MIT Lecture 11 1. Moore s Law 1000000 Intel CPU Introductions 6.172 Performance Engineering of Software Systems Lecture 11 Multicore Programming Charles E. Leiserson 100000 10000 1000 100 10 Clock Speed (MHz) Transistors

More information

Beyond Nested Parallelism: Tight Bounds on Work-Stealing Overheads for Parallel Futures

Beyond Nested Parallelism: Tight Bounds on Work-Stealing Overheads for Parallel Futures Beyond Nested Parallelism: Tight Bounds on Work-Stealing Overheads for Parallel Futures Daniel Spoonhower Guy E. Blelloch Phillip B. Gibbons Robert Harper Carnegie Mellon University {spoons,blelloch,rwh}@cs.cmu.edu

More information

On the Interoperability of Programming Languages based on the Fork-Join Parallelism Model. (Thesis format: Monograph) Sushek Shekar

On the Interoperability of Programming Languages based on the Fork-Join Parallelism Model. (Thesis format: Monograph) Sushek Shekar On the Interoperability of Programming Languages based on the Fork-Join Parallelism Model (Spine title: On the Interoperability of Programming Languages based on the Fork-Join Parallelism Model) (Thesis

More information

A Framework for Space and Time Efficient Scheduling of Parallelism

A Framework for Space and Time Efficient Scheduling of Parallelism A Framework for Space and Time Efficient Scheduling of Parallelism Girija J. Narlikar Guy E. Blelloch December 996 CMU-CS-96-97 School of Computer Science Carnegie Mellon University Pittsburgh, PA 523

More information

27. Parallel Programming I

27. Parallel Programming I 760 27. Parallel Programming I Moore s Law and the Free Lunch, Hardware Architectures, Parallel Execution, Flynn s Taxonomy, Scalability: Amdahl and Gustafson, Data-parallelism, Task-parallelism, Scheduling

More information

Understanding Task Scheduling Algorithms. Kenjiro Taura

Understanding Task Scheduling Algorithms. Kenjiro Taura Understanding Task Scheduling Algorithms Kenjiro Taura 1 / 48 Contents 1 Introduction 2 Work stealing scheduler 3 Analyzing execution time of work stealing 4 Analyzing cache misses of work stealing 5 Summary

More information

Performance Optimization Part 1: Work Distribution and Scheduling

Performance Optimization Part 1: Work Distribution and Scheduling ( how to be l33t ) Lecture 6: Performance Optimization Part 1: Work Distribution and Scheduling Parallel Computer Architecture and Programming CMU 15-418/15-618, Spring 2017 Tunes Kaleo Way Down We Go

More information

Multi-core Computing Lecture 1

Multi-core Computing Lecture 1 Hi-Spade Multi-core Computing Lecture 1 MADALGO Summer School 2012 Algorithms for Modern Parallel and Distributed Models Phillip B. Gibbons Intel Labs Pittsburgh August 20, 2012 Lecture 1 Outline Multi-cores:

More information

Lower Bound on Comparison-based Sorting

Lower Bound on Comparison-based Sorting Lower Bound on Comparison-based Sorting Different sorting algorithms may have different time complexity, how to know whether the running time of an algorithm is best possible? We know of several sorting

More information

15-210: Parallelism in the Real World

15-210: Parallelism in the Real World : Parallelism in the Real World Types of paralellism Parallel Thinking Nested Parallelism Examples (Cilk, OpenMP, Java Fork/Join) Concurrency Page1 Cray-1 (1976): the world s most expensive love seat 2

More information

CSE 332: Data Structures & Parallelism Lecture 15: Analysis of Fork-Join Parallel Programs. Ruth Anderson Autumn 2018

CSE 332: Data Structures & Parallelism Lecture 15: Analysis of Fork-Join Parallel Programs. Ruth Anderson Autumn 2018 CSE 332: Data Structures & Parallelism Lecture 15: Analysis of Fork-Join Parallel Programs Ruth Anderson Autumn 2018 Outline Done: How to use fork and join to write a parallel algorithm Why using divide-and-conquer

More information

Cilk programs as a DAG

Cilk programs as a DAG Cilk programs as a DAG The pattern of spawn and sync commands defines a graph The graph contains dependencies between different functions spawn command creates a new task with an out-bound link sync command

More information

CS473 - Algorithms I

CS473 - Algorithms I CS473 - Algorithms I Lecture 4 The Divide-and-Conquer Design Paradigm View in slide-show mode 1 Reminder: Merge Sort Input array A sort this half sort this half Divide Conquer merge two sorted halves Combine

More information

Memory Management Algorithms on Distributed Systems. Katie Becker and David Rodgers CS425 April 15, 2005

Memory Management Algorithms on Distributed Systems. Katie Becker and David Rodgers CS425 April 15, 2005 Memory Management Algorithms on Distributed Systems Katie Becker and David Rodgers CS425 April 15, 2005 Table of Contents 1. Introduction 2. Coarse Grained Memory 2.1. Bottlenecks 2.2. Simulations 2.3.

More information

Heterogeneous Multithreaded Computing

Heterogeneous Multithreaded Computing Heterogeneous Multithreaded Computing by Howard J. Lu Submitted to the Department of Electrical Engineering and Computer Science in artial Fulfillment of the Requirements for the Degrees of Bachelor of

More information

A Sophomoric Introduction to Shared-Memory Parallelism and Concurrency Lecture 2 Analysis of Fork-Join Parallel Programs

A Sophomoric Introduction to Shared-Memory Parallelism and Concurrency Lecture 2 Analysis of Fork-Join Parallel Programs A Sophomoric Introduction to Shared-Memory Parallelism and Concurrency Lecture 2 Analysis of Fork-Join Parallel Programs Steve Wolfman, based on work by Dan Grossman (with small tweaks by Alan Hu) Learning

More information

Cache Oblivious Algorithms

Cache Oblivious Algorithms Cache Oblivious Algorithms Volker Strumpen IBM Research Austin, TX September 4, 2007 Iterative matrix transposition #define N 1000 double A[N][N], B[N][N]; void iter(void) { int i, j; for (i = 0; i < N;

More information

Divide and Conquer Algorithms

Divide and Conquer Algorithms CSE341T 09/13/2017 Lecture 5 Divide and Conquer Algorithms We have already seen a couple of divide and conquer algorithms in this lecture. The reduce algorithm and the algorithm to copy elements of the

More information

CSC630/CSC730 Parallel & Distributed Computing

CSC630/CSC730 Parallel & Distributed Computing CSC630/CSC730 Parallel & Distributed Computing Analytical Modeling of Parallel Programs Chapter 5 1 Contents Sources of Parallel Overhead Performance Metrics Granularity and Data Mapping Scalability 2

More information

Concepts in. Programming. The Multicore- Software Challenge. MIT Professional Education 6.02s Lecture 1 June 8 9, 2009

Concepts in. Programming. The Multicore- Software Challenge. MIT Professional Education 6.02s Lecture 1 June 8 9, 2009 Concepts in Multicore Programming The Multicore- Software Challenge MIT Professional Education 6.02s Lecture 1 June 8 9, 2009 2009 Charles E. Leiserson 1 Cilk, Cilk++, and Cilkscreen, are trademarks of

More information

A Sophomoric Introduction to Shared-Memory Parallelism and Concurrency Lecture 2 Analysis of Fork-Join Parallel Programs

A Sophomoric Introduction to Shared-Memory Parallelism and Concurrency Lecture 2 Analysis of Fork-Join Parallel Programs A Sophomoric Introduction to Shared-Memory Parallelism and Concurrency Lecture 2 Analysis of Fork-Join Parallel Programs Dan Grossman Last Updated: January 2016 For more information, see http://www.cs.washington.edu/homes/djg/teachingmaterials/

More information

Writing Parallel Programs; Cost Model.

Writing Parallel Programs; Cost Model. CSE341T 08/30/2017 Lecture 2 Writing Parallel Programs; Cost Model. Due to physical and economical constraints, a typical machine we can buy now has 4 to 8 computing cores, and soon this number will be

More information

Algorithms and Data Structures

Algorithms and Data Structures Algorithms and Data Structures or, Classical Algorithms of the 50s, 60s, 70s Richard Mayr Slides adapted from Mary Cryan (2015/16) with small changes. School of Informatics University of Edinburgh ADS

More information

CSE613: Parallel Programming, Spring 2012 Date: March 31. Homework #2. ( Due: April 14 )

CSE613: Parallel Programming, Spring 2012 Date: March 31. Homework #2. ( Due: April 14 ) CSE613: Parallel Programming, Spring 2012 Date: March 31 Homework #2 ( Due: April 14 ) Serial-BFS( G, s, d ) (Inputs are an unweighted directed graph G with vertex set G[V ], and a source vertex s G[V

More information

Algorithms and Data Structures

Algorithms and Data Structures Algorithms and Data Structures Spring 2019 Alexis Maciel Department of Computer Science Clarkson University Copyright c 2019 Alexis Maciel ii Contents 1 Analysis of Algorithms 1 1.1 Introduction.................................

More information

Multi-core Computing Lecture 2

Multi-core Computing Lecture 2 Multi-core Computing Lecture 2 MADALGO Summer School 2012 Algorithms for Modern Parallel and Distributed Models Phillip B. Gibbons Intel Labs Pittsburgh August 21, 2012 Multi-core Computing Lectures: Progress-to-date

More information

Runtime Support for Scalable Task-parallel Programs

Runtime Support for Scalable Task-parallel Programs Runtime Support for Scalable Task-parallel Programs Pacific Northwest National Lab xsig workshop May 2018 http://hpc.pnl.gov/people/sriram/ Single Program Multiple Data int main () {... } 2 Task Parallelism

More information

An Architectural Framework for Accelerating Dynamic Parallel Algorithms on Reconfigurable Hardware

An Architectural Framework for Accelerating Dynamic Parallel Algorithms on Reconfigurable Hardware An Architectural Framework for Accelerating Dynamic Parallel Algorithms on Reconfigurable Hardware Tao Chen, Shreesha Srinath Christopher Batten, G. Edward Suh Computer Systems Laboratory School of Electrical

More information

CS126 Final Exam Review

CS126 Final Exam Review CS126 Final Exam Review Fall 2007 1 Asymptotic Analysis (Big-O) Definition. f(n) is O(g(n)) if there exists constants c, n 0 > 0 such that f(n) c g(n) n n 0 We have not formed any theorems dealing with

More information