Plan. 1 Parallelism Complexity Measures. 2 cilk for Loops. 3 Scheduling Theory and Implementation. 4 Measuring Parallelism in Practice

Size: px
Start display at page:

Download "Plan. 1 Parallelism Complexity Measures. 2 cilk for Loops. 3 Scheduling Theory and Implementation. 4 Measuring Parallelism in Practice"

Transcription

1 lan Multithreaded arallelism and erformance Measures Marc Moreno Maza University of Western Ontario, London, Ontario (Canada) CS CS cilk for Loops 3 4 Measuring arallelism in ractice 5 Announcements lan (Moreno Maza) Multithreaded arallelism and erformance Measures CS CS / 62 (Moreno Maza) Multithreaded arallelism and erformance Measures CS CS / 62 The fork-join parallelism model 1 2 cilk for Loops 3 4 Measuring arallelism in ractice 5 Announcements int fib (int n) { if (n<2) return (n); else { int x,y; x = cilk_ fib(n-1); y = fib(n-2); cilk_sync; return (x+y); 3 2 rocessor oblivious 4 Example: fib(4) The computation dag unfolds dynamiy. We shall also this model multithreaded parallelism. (Moreno Maza) Multithreaded arallelism and erformance Measures CS CS / 62 (Moreno Maza) Multithreaded arallelism and erformance Measures CS CS / 62

2 Terminology Work and span initial strand continue edge edge edge final strand strand return edge a strand is is a maximal sequence of instructions that ends with a, sync, or return (either explicit or implicit) statement. At runtime, the relation causes procedure instances to be structured as a rooted tree, ed tree or parallel instruction stream, where dependencies among strands form a dag. (Moreno Maza) Multithreaded arallelism and erformance Measures CS CS / 62 The critical path length We define several performance measures. We assume an ideal situation: no cache issues, no interprocessor costs: T p is the minimum running time on p processors T 1 is ed the work, that is, the sum of the number of instructions at each node. T is the minimum running time with infinitely many processors, ed the span Work law (Moreno Maza) Multithreaded arallelism and erformance Measures CS CS / 62 Assuming all strands run in unit time, the longest path in the DAG is equal to T. For this reason, T is also referred to as the critical path length. (Moreno Maza) Multithreaded arallelism and erformance Measures CS CS / 62 We have: T p T 1 /p. Indeed, in the best case, p processors can do p works per unit of time. (Moreno Maza) Multithreaded arallelism and erformance Measures CS CS / 62

3 Span law Speedup on p processors T 1 /T p is ed the speedup on p processors A parallel program execution can have: linear speedup: T 1 /T = Θ(p) superlinear speedup: T 1 /T = ω(p) (not possible in this model, though it is possible in others) sublinear speedup: T 1 /T = o(p) We have: T p T. Indeed, T p < T contradicts the definitions of T p and T. arallelism (Moreno Maza) Multithreaded arallelism and erformance Measures CS CS / 62 (Moreno Maza) Multithreaded arallelism and erformance Measures CS CS / 62 The Fibonacci example (1/2) Because the Span Law dictates that T T, the maximum possible speedup given T 1 and T is T 1 /T = parallelism li = the average amount of work per step along the span For Fib(4), we have T 1 = 17 and T = 8 and thus T 1 /T = What about T 1 (Fib(n)) and T (Fib(n))? (Moreno Maza) Multithreaded arallelism and erformance Measures CS CS / 62 (Moreno Maza) Multithreaded arallelism and erformance Measures CS CS / 62

4 The Fibonacci example (2/2) Series composition We have T 1 (n) = T 1 (n 1) + T 1 (n 2) + Θ(1). Let s solve it. One verify by induction that T (n) af n b for b > 0 large enough to dominate Θ(1) and a > 1. We can then choose a large enough to satisfy the initial condition, whatever that is. On the other hand we also have F n T (n). Therefore T 1 (n) = Θ(F n ) = Θ(ψ n ) with ψ = (1 + 5)/2. A B We have T (n) = max(t (n 1), T (n 2)) + Θ(1). We easily check T (n 1) T (n 2). This implies T (n) = T (n 1) + Θ(1). Therefore T (n) = Θ(n). Work? Span? Consequently the parallelism is Θ(ψ n /n). (Moreno Maza) Multithreaded arallelism and erformance Measures CS CS / 62 Series composition (Moreno Maza) Multithreaded arallelism and erformance Measures CS CS / 62 arallel composition A A B B Work: T 1 (A B) = T 1 (A) + T 1 (B) Span: T (A B) = T (A) + T (B) Work? Span? (Moreno Maza) Multithreaded arallelism and erformance Measures CS CS / 62 (Moreno Maza) Multithreaded arallelism and erformance Measures CS CS / 62

5 arallel composition Some results in the fork-join parallelism model A B Work: T 1 (A B) = T 1 (A) + T 1 (B) Span: T (A B) = max(t (A), T (B)) Algorithm Work Span Merge sort Θ(n lg n) Θ(lg 3 n) Matrix multiplication Θ(n 3 ) Θ(lg n) Strassen Θ(n lg7 ) Θ(lg 2 n) LU-decomposition Θ(n 3 ) Θ(n lg n) Tableau construction Θ(n 2 ) Ω(n lg3 ) FFT Θ(n lg n) Θ(lg 2 n) Breadth-first search Θ(E) Θ(d lg V) We shall prove those results in the next lectures. lan (Moreno Maza) Multithreaded arallelism and erformance Measures CS CS / 62 cilk for Loops (Moreno Maza) Multithreaded arallelism and erformance Measures CS CS / 62 cilk for Loops For loop parallelism in Cilk++ 1 a 11 a 12 a 1n a 21 a 22 a 2n a 11 a 21 a n1 a 12 a 22 a n2 2 cilk for Loops 3 4 Measuring arallelism in ractice 5 Announcements a n1 a n2 a nn a 1n a 2n a nn A T n1 n2 nn 1n 2n nn cilk_for (int i=1; i<n; ++i) { for (int j=0; j<i; ++j) { double temp = A[i][j]; A[i][j] = A[j][i]; A[j][i] = temp; A The iterations of a cilk for loop execute in parallel. (Moreno Maza) Multithreaded arallelism and erformance Measures CS CS / 62 (Moreno Maza) Multithreaded arallelism and erformance Measures CS CS / 62

6 cilk for Loops cilk for Loops Implementation of for loops in Cilk++ Analysis of parallel for loops Up to details (next week!) the previous loop is compiled as follows, using a divide-and-conquer implementation: void recur(int lo, int hi) { if (hi > lo) { // coarsen int mid = lo + (hi - lo)/2; cilk_ recur(lo, mid); recur(mid, hi); cilk_sync; else for (int j=0; j<i; ++j) { double temp = A[i][j]; A[i][j] = A[j][i]; A[j][i] = temp; (Moreno Maza) Multithreaded arallelism and erformance Measures CS CS / 62 cilk for Loops arallelizing the inner loop Here we do not assume that each strand runs in unit time. lan Span of loop control: Θ(log(n)) Max span of an iteration: Θ(n) Span: Θ(n) Work: Θ(n 2 ) arallelism: Θ(n) (Moreno Maza) Multithreaded arallelism and erformance Measures CS CS / 62 cilk_for (int i=1; i<n; ++i) { cilk_for (int j=0; j<i; ++j) { double temp = A[i][j]; A[i][j] = A[j][i]; A[j][i] = temp; Span of outer loop control: Θ(log(n)) Max span of an inner loop control: Θ(log(n)) Span of an iteration: Θ(1) Span: Θ(log(n)) Work: Θ(n 2 ) arallelism: Θ(n 2 /log(n)) But! More on this next week cilk for Loops 3 4 Measuring arallelism in ractice 5 Announcements (Moreno Maza) Multithreaded arallelism and erformance Measures CS CS / 62 (Moreno Maza) Multithreaded arallelism and erformance Measures CS CS / 62

7 Scheduling Greedy scheduling (1/2) Memory I/O Network $ $ $ $ A scheduler s job is to map a computation to particular processors. Such a mapping is ed a schedule. If decisions are made at runtime, the scheduler is online, otherwise, it is offline Cilk++ s scheduler maps strands onto processors dynamiy at runtime. (Moreno Maza) Multithreaded arallelism and erformance Measures CS CS / 62 Greedy scheduling (2/2) A strand is ready if all its predecessors have executed A scheduler is greedy if it attempts to do as much work as possible at every step. (Moreno Maza) Multithreaded arallelism and erformance Measures CS CS / 62 Theorem of Graham and Brent = 3 = 3 In any greedy schedule, there are two types of steps: complete step: There are at least p strands that are ready to run. The greedy scheduler selects any p of them and runs them. incomplete step: There are strictly less than p threads that are ready to run. The greedy scheduler runs them all. (Moreno Maza) Multithreaded arallelism and erformance Measures CS CS / 62 For any greedy schedule, we have T p T 1 /p + T #complete steps T 1 /p, by definition of T 1. #incomplete steps T. Indeed, let G be the subgraph of G that remains to be executed immediately prior to a incomplete step. (i) During this incomplete step, all strands that can be run are actually run (ii) Hence removing this incomplete step from G reduces T by one. (Moreno Maza) Multithreaded arallelism and erformance Measures CS CS / 62

8 Corollary 1 Corollary 2 A greedy scheduler is always within a factor of 2 of optimal. From the work and span laws, we have: T max(t 1 /p, T ) (1) In addition, we can trivially express: T 1 /p max(t 1 /p, T ) (2) T max(t 1 /p, T ) (3) From Graham - Brent Theorem, we deduce: T T 1 /p + T (4) max(t 1 /p, T ) + max(t 1 /p, T ) (5) 2 max(t 1 /p, T ) (6) which concludes the proof. (Moreno Maza) Multithreaded arallelism and erformance Measures CS CS / 62 The work-stealing scheduler (1/11) The greedy scheduler achieves linear speedup whenever T = O(T 1 /p). From Graham - Brent Theorem, we deduce: T p T 1 /p + T (7) = T 1 /p + O(T 1 /p) (8) = Θ(T 1 /p) (9) The idea is to operate in the range where T 1 /p dominates T. As long as T 1 /p dominates T, all processors can be used efficiently. The quantity T 1 /pt is ed the parallel slackness. (Moreno Maza) Multithreaded arallelism and erformance Measures CS CS / 62 The work-stealing scheduler (2/11) Call! Spawn! (Moreno Maza) Multithreaded arallelism and erformance Measures CS CS / 62 (Moreno Maza) Multithreaded arallelism and erformance Measures CS CS / 62

9 The work-stealing scheduler (3/11) The work-stealing scheduler (4/11) Spawn! Call! Spawn! Return! (Moreno Maza) Multithreaded arallelism and erformance Measures CS CS / 62 The work-stealing scheduler (5/11) (Moreno Maza) Multithreaded arallelism and erformance Measures CS CS / 62 The work-stealing scheduler (6/11) Return! Steal! (Moreno Maza) Multithreaded arallelism and erformance Measures CS CS / 62 (Moreno Maza) Multithreaded arallelism and erformance Measures CS CS / 62

10 The work-stealing scheduler (7/11) The work-stealing scheduler (8/11) Steal! (Moreno Maza) Multithreaded arallelism and erformance Measures CS CS / 62 The work-stealing scheduler (9/11) (Moreno Maza) Multithreaded arallelism and erformance Measures CS CS / 62 The work-stealing scheduler (10/11) Spawn! (Moreno Maza) Multithreaded arallelism and erformance Measures CS CS / 62 (Moreno Maza) Multithreaded arallelism and erformance Measures CS CS / 62

11 The work-stealing scheduler (11/11) (Moreno Maza) Multithreaded arallelism and erformance Measures CS CS / 62 Overheads and burden erformances of the work-stealing scheduler Assume that each strand executes in unit time, for almost all parallel steps there are at least p strands to run, each processor is either working or stealing. Then, the randomized work-stealing scheduler is expected to run in T = T 1 /p + O(T ) During a steal-free parallel steps (steps at which all processors have work on their deque) each of the p processors consumes 1 work unit. Thus, there is at most T 1 /p steal-free parallel steps. During a parallel step with steals each thief may reduce by 1 the running time with a probability of 1/p Thus, the expected number of steals is O(p T ). Therefore, the expected running time T = (T 1 + O(p T ))/p = T 1 /p + O(T ). (10) (Moreno Maza) Multithreaded arallelism and erformance Measures CS CS / 62 Span overhead Obviously T 1 /p + T will over-estimate T p in practice. Many factors (simplification assumptions of the fork-join parallelism model, architecture limitation, costs of executing the parallel constructs, overheads of scheduling) will make T p smaller in practice. One may want to estimate the impact of those factors: 1 by improving the estimate of the randomized work-stealing complexity result 2 by comparing a Cilk++ program with its C++ elision 3 by estimating the costs of ing and synchronizing Let T 1, T, T p be given. We want to refine the randomized work-stealing complexity result. The span overhead is the smallest constant c such that T p T 1 /p + c T. Re that T 1 /T is the maximum possible speed-up that the application can obtain. We parallel slackness assumption the following property T 1 /T >> c p (11) that is, c p is much smaller than the average parallelism. Under this assumption it follows that T 1 /p >> c T holds, thus c has little effect on performance when sufficiently slackness exists. (Moreno Maza) Multithreaded arallelism and erformance Measures CS CS / 62 (Moreno Maza) Multithreaded arallelism and erformance Measures CS CS / 62

12 Work overhead Let T s be the running time of the C++ elision of a Cilk++ program. We denote by c 1 the work overhead c 1 = T 1 /T s Re the expected running time: T T 1 / + c T. Thus with the parallel slackness assumption we get T c 1 T s /p + c T c 1 T s /p. (12) We can now state the work first principle precisely Minimize c 1, even at the expense of a larger c. This is a key feature since it is conceptually easier to minimize c 1 rather than minimizing c. Cilk++ estimates T p as T p = T 1 /p burden span, where burden span is instructions times the number of continuation edges along the critical path. The cactus stack B A B C D E A A A A A A D C E B C Views of stack A cactus stack is used to implement C s rule for sharing of function-local variables. A stack frame can only see data stored in the current and in the previous stack frames. C D C E (Moreno Maza) Multithreaded arallelism and erformance Measures CS CS / 62 Space bounds lan (Moreno Maza) Multithreaded arallelism and erformance Measures CS CS / 62 Measuring arallelism in ractice = 3 1 S 1 2 cilk for Loops 3 The space S p of a parallel execution on p processors required by Cilk++ s work-stealing satisfies: S p p S 1 (13) 4 Measuring arallelism in ractice 5 Announcements where S 1 is the minimal serial space requirement. (Moreno Maza) Multithreaded arallelism and erformance Measures CS CS / 62 (Moreno Maza) Multithreaded arallelism and erformance Measures CS CS / 62

13 Cilkview Measuring arallelism in ractice Measuring arallelism in ractice The Fibonacci Cilk++ example Work Law (linear speedup) Span Law Measured speedup Burdened parallelism estimates arallelism scheduling overheads Cilkview computes work and span to derive upper bounds on parallel performance Cilkview also estimates scheduling overhead to compute a burdened span for lower bounds. (Moreno Maza) Multithreaded arallelism and erformance Measures CS CS / 62 Measuring arallelism in ractice Fibonacci program timing Code fragment long fib(int n) { if (n < 2) return n; long x, y; x = cilk_ fib(n-1); y = fib(n-2); cilk_sync; return x + y; Quicksort (Moreno Maza) Multithreaded arallelism and erformance Measures CS CS / 62 Measuring arallelism in ractice The environment for benchmarking: model name : Intel(R) Core(TM)2 Quad CU 2.40GHz L2 cache size : 4096 KB memory size : 3 GB #cores = 1 #cores = 2 #cores = 4 n timing(s) timing(s) speedup timing(s) speedup code in cilk/examples/qsort void sample_qsort(int * begin, int * end) { if (begin!= end) { --end; int * middle = std::partition(begin, end, std::bind2nd(std::less<int>(), *end)); using std::swap; swap(*end, *middle); cilk_ sample_qsort(begin, middle); sample_qsort(++middle, ++end); cilk_sync; (Moreno Maza) Multithreaded arallelism and erformance Measures CS CS / 62 (Moreno Maza) Multithreaded arallelism and erformance Measures CS CS / 62

14 Quicksort timing Measuring arallelism in ractice Matrix multiplication Measuring arallelism in ractice Timing for sorting an array of integers: #cores = 1 #cores = 2 #cores = 4 # of int timing(s) timing(s) speedup timing(s) speedup Code in cilk/examples/matrix Timing of multiplying a matrix by a matrix iterative recursive threshold st(s) pt(s) su st(s) pt (s) su st = sequential time; pt = parallel time with 4 cores; su = speedup (Moreno Maza) Multithreaded arallelism and erformance Measures CS CS / 62 Measuring arallelism in ractice The cilkview example from the documentation Using cilk for to perform operations over an array in parallel: static const int COUNT = 4; static const int ITERATION = ; long arr[count]; long do_work(long k){ long x = 15; static const int nn = 87; for (long i = 1; i < nn; ++i) x = x / i + k % i; return x; int cilk_main(){ for (int j = 0; j < ITERATION; j++) cilk_for (int i = 0; i < COUNT; i++) arr[i] += do_work( j * i + i + j); (Moreno Maza) Multithreaded arallelism and erformance Measures CS CS / 62 (Moreno Maza) Multithreaded arallelism and erformance Measures CS CS / 62 Measuring arallelism in ractice 1) arallelism rofile Work : 6,480,801,250 ins Span : 2,116,801,250 ins Burdened span : 31,920,801,250 ins arallelism : 3.06 Burdened parallelism : 0.20 Number of s/syncs: 3,000,000 Average instructions / strand : 720 Strands along span : 4,000,001 Average instructions / strand on span : 529 2) Speedup Estimate 2 processors: processors: processors: processors: processors: (Moreno Maza) Multithreaded arallelism and erformance Measures CS CS / 62

15 Measuring arallelism in ractice Measuring arallelism in ractice A simple fix Inverting the two for loops int cilk_main() { cilk_for (int i = 0; i < COUNT; i++) for (int j = 0; j < ITERATION; j++) arr[i] += do_work( j * i + i + j); 1) arallelism rofile Work : 5,295,801,529 ins Span : 1,326,801,107 ins Burdened span : 1,326,830,911 ins arallelism : 3.99 Burdened parallelism : 3.99 Number of s/syncs: 3 Average instructions / strand : 529,580,152 Strands along span : 5 Average instructions / strand on span: 265,360,221 2) Speedup Estimate 2 processors: processors: processors: processors: processors: Timing (Moreno Maza) Multithreaded arallelism and erformance Measures CS CS / 62 Measuring arallelism in ractice lan (Moreno Maza) Multithreaded arallelism and erformance Measures CS CS / 62 Announcements 1 #cores = 1 #cores = 2 #cores = 4 version timing(s) timing(s) speedup timing(s) speedup original improved cilk for Loops 3 4 Measuring arallelism in ractice 5 Announcements (Moreno Maza) Multithreaded arallelism and erformance Measures CS CS / 62 (Moreno Maza) Multithreaded arallelism and erformance Measures CS CS / 62

16 Acknowledgements Announcements References Announcements Charles E. Leiserson (MIT) for providing me with the sources of its lecture notes. Matteo Frigo (Intel) for supporting the work of my team with Cilk++. Yuzhen Xie (UWO) for helping me with the images used in these slides. Liyun Li (UWO) for generating the experimental data. Matteo Frigo, Charles E. Leiserson, and Keith H. Randall. The Implementation of the Cilk-5 Multithreaded Language. roceedings of the ACM SIGLAN 98 Conference on rogramming Language Design and Implementation, ages: June, Robert D. Blumofe, Christopher F. Joerg, Bradley C. Kuszmaul, Charles E. Leiserson, Keith H. Randall, and Yuli Zhou. Cilk: An Efficient Multithreaded Runtime System. Journal of arallel and Distributed Computing, 55-69, August 25, Robert D. Blumofe and Charles E. Leiserson. Scheduling Multithreaded Computations by Work Stealing. Journal of the ACM, Vol. 46, No. 5, pp September (Moreno Maza) Multithreaded arallelism and erformance Measures CS CS / 62 (Moreno Maza) Multithreaded arallelism and erformance Measures CS CS / 62

Plan. 1 Parallelism Complexity Measures. 2 cilk for Loops. 3 Scheduling Theory and Implementation. 4 Measuring Parallelism in Practice

Plan. 1 Parallelism Complexity Measures. 2 cilk for Loops. 3 Scheduling Theory and Implementation. 4 Measuring Parallelism in Practice lan Multithreaded arallelism and erformance Measures Marc Moreno Maza University of Western Ontario, London, Ontario (Canada) CS 3101 1 2 cilk for Loops 3 4 Measuring arallelism in ractice 5 Announcements

More information

Multithreaded Parallelism and Performance Measures

Multithreaded Parallelism and Performance Measures Multithreaded Parallelism and Performance Measures Marc Moreno Maza University of Western Ontario, London, Ontario (Canada) CS 3101 (Moreno Maza) Multithreaded Parallelism and Performance Measures CS 3101

More information

Plan. 1 Parallelism Complexity Measures. 2 cilk for Loops. 3 Scheduling Theory and Implementation. 4 Measuring Parallelism in Practice

Plan. 1 Parallelism Complexity Measures. 2 cilk for Loops. 3 Scheduling Theory and Implementation. 4 Measuring Parallelism in Practice lan Multithreaded arallelism and erformance Measures Marc Moreno Maza University of Western Ontario, London, Ontario (Canada) CS 02 - CS 9535 arallelism Complexity Measures 2 cilk for Loops 3 Measuring

More information

Multicore programming in CilkPlus

Multicore programming in CilkPlus Multicore programming in CilkPlus Marc Moreno Maza University of Western Ontario, Canada CS3350 March 16, 2015 CilkPlus From Cilk to Cilk++ and Cilk Plus Cilk has been developed since 1994 at the MIT Laboratory

More information

Multithreaded Parallelism on Multicore Architectures

Multithreaded Parallelism on Multicore Architectures Multithreaded Parallelism on Multicore Architectures Marc Moreno Maza University of Western Ontario, Canada CS2101 March 2012 Plan 1 Multicore programming Multicore architectures 2 Cilk / Cilk++ / Cilk

More information

Parallelism and Performance

Parallelism and Performance 6.172 erformance Engineering of Software Systems LECTURE 13 arallelism and erformance Charles E. Leiserson October 26, 2010 2010 Charles E. Leiserson 1 Amdahl s Law If 50% of your application is parallel

More information

Multithreaded Parallelism on Multicore Architectures

Multithreaded Parallelism on Multicore Architectures Multithreaded Parallelism on Multicore Architectures Marc Moreno Maza University of Western Ontario, Canada CS2101 November 2012 Plan 1 Multicore programming Multicore architectures 2 Cilk / Cilk++ / Cilk

More information

The Cilk part is a small set of linguistic extensions to C/C++ to support fork-join parallelism. (The Plus part supports vector parallelism.

The Cilk part is a small set of linguistic extensions to C/C++ to support fork-join parallelism. (The Plus part supports vector parallelism. Cilk Plus The Cilk part is a small set of linguistic extensions to C/C++ to support fork-join parallelism. (The Plus part supports vector parallelism.) Developed originally by Cilk Arts, an MIT spinoff,

More information

CS 140 : Feb 19, 2015 Cilk Scheduling & Applications

CS 140 : Feb 19, 2015 Cilk Scheduling & Applications CS 140 : Feb 19, 2015 Cilk Scheduling & Applications Analyzing quicksort Optional: Master method for solving divide-and-conquer recurrences Tips on parallelism and overheads Greedy scheduling and parallel

More information

CS 240A: Shared Memory & Multicore Programming with Cilk++

CS 240A: Shared Memory & Multicore Programming with Cilk++ CS 240A: Shared Memory & Multicore rogramming with Cilk++ Multicore and NUMA architectures Multithreaded rogramming Cilk++ as a concurrency platform Work and Span Thanks to Charles E. Leiserson for some

More information

Multithreaded Programming in. Cilk LECTURE 1. Charles E. Leiserson

Multithreaded Programming in. Cilk LECTURE 1. Charles E. Leiserson Multithreaded Programming in Cilk LECTURE 1 Charles E. Leiserson Supercomputing Technologies Research Group Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology

More information

CSE 260 Lecture 19. Parallel Programming Languages

CSE 260 Lecture 19. Parallel Programming Languages CSE 260 Lecture 19 Parallel Programming Languages Announcements Thursday s office hours are cancelled Office hours on Weds 2p to 4pm Jing will hold OH, too, see Moodle Scott B. Baden /CSE 260/ Winter 2014

More information

Cilk. Cilk In 2008, ACM SIGPLAN awarded Best influential paper of Decade. Cilk : Biggest principle

Cilk. Cilk In 2008, ACM SIGPLAN awarded Best influential paper of Decade. Cilk : Biggest principle CS528 Slides are adopted from http://supertech.csail.mit.edu/cilk/ Charles E. Leiserson A Sahu Dept of CSE, IIT Guwahati HPC Flow Plan: Before MID Processor + Super scalar+ Vector Unit Serial C/C++ Coding

More information

Analysis of Multithreaded Algorithms

Analysis of Multithreaded Algorithms Analysis of Multithreaded Algorithms Marc Moreno Maza University of Western Ontario, London, Ontario (Canada) CS4402-9535 (Moreno Maza) Analysis of Multithreaded Algorithms CS4402-9535 1 / 27 Plan 1 Matrix

More information

Shared-memory Parallel Programming with Cilk Plus

Shared-memory Parallel Programming with Cilk Plus Shared-memory Parallel Programming with Cilk Plus John Mellor-Crummey Department of Computer Science Rice University johnmc@rice.edu COMP 422/534 Lecture 4 30 August 2018 Outline for Today Threaded programming

More information

Shared-memory Parallel Programming with Cilk Plus

Shared-memory Parallel Programming with Cilk Plus Shared-memory Parallel Programming with Cilk Plus John Mellor-Crummey Department of Computer Science Rice University johnmc@rice.edu COMP 422/534 Lecture 4 19 January 2017 Outline for Today Threaded programming

More information

An Overview of Parallel Computing

An Overview of Parallel Computing An Overview of Parallel Computing Marc Moreno Maza University of Western Ontario, London, Ontario (Canada) CS2101 Plan 1 Hardware 2 Types of Parallelism 3 Concurrency Platforms: Three Examples Cilk CUDA

More information

Plan. Introduction to Multicore Programming. Plan. University of Western Ontario, London, Ontario (Canada) Multi-core processor CPU Coherence

Plan. Introduction to Multicore Programming. Plan. University of Western Ontario, London, Ontario (Canada) Multi-core processor CPU Coherence Plan Introduction to Multicore Programming Marc Moreno Maza University of Western Ontario, London, Ontario (Canada) CS 3101 1 Multi-core Architecture 2 Race Conditions and Cilkscreen (Moreno Maza) Introduction

More information

A Minicourse on Dynamic Multithreaded Algorithms

A Minicourse on Dynamic Multithreaded Algorithms Introduction to Algorithms December 5, 005 Massachusetts Institute of Technology 6.046J/18.410J Professors Erik D. Demaine and Charles E. Leiserson Handout 9 A Minicourse on Dynamic Multithreaded Algorithms

More information

Multithreaded Algorithms Part 1. Dept. of Computer Science & Eng University of Moratuwa

Multithreaded Algorithms Part 1. Dept. of Computer Science & Eng University of Moratuwa CS4460 Advanced d Algorithms Batch 08, L4S2 Lecture 11 Multithreaded Algorithms Part 1 N. H. N. D. de Silva Dept. of Computer Science & Eng University of Moratuwa Announcements Last topic discussed is

More information

CSE 613: Parallel Programming

CSE 613: Parallel Programming CSE 613: Parallel Programming Lecture 3 ( The Cilk++ Concurrency Platform ) ( inspiration for many slides comes from talks given by Charles Leiserson and Matteo Frigo ) Rezaul A. Chowdhury Department of

More information

Introduction to Multithreaded Algorithms

Introduction to Multithreaded Algorithms Introduction to Multithreaded Algorithms CCOM5050: Design and Analysis of Algorithms Chapter VII Selected Topics T. H. Cormen, C. E. Leiserson, R. L. Rivest, C. Stein. Introduction to algorithms, 3 rd

More information

MetaFork: A Compilation Framework for Concurrency Platforms Targeting Multicores

MetaFork: A Compilation Framework for Concurrency Platforms Targeting Multicores MetaFork: A Compilation Framework for Concurrency Platforms Targeting Multicores Presented by Xiaohui Chen Joint work with Marc Moreno Maza, Sushek Shekar & Priya Unnikrishnan University of Western Ontario,

More information

CS 140 : Non-numerical Examples with Cilk++

CS 140 : Non-numerical Examples with Cilk++ CS 140 : Non-numerical Examples with Cilk++ Divide and conquer paradigm for Cilk++ Quicksort Mergesort Thanks to Charles E. Leiserson for some of these slides 1 Work and Span (Recap) T P = execution time

More information

Cilk, Matrix Multiplication, and Sorting

Cilk, Matrix Multiplication, and Sorting 6.895 Theory of Parallel Systems Lecture 2 Lecturer: Charles Leiserson Cilk, Matrix Multiplication, and Sorting Lecture Summary 1. Parallel Processing With Cilk This section provides a brief introduction

More information

Cilk Plus: Multicore extensions for C and C++

Cilk Plus: Multicore extensions for C and C++ Cilk Plus: Multicore extensions for C and C++ Matteo Frigo 1 June 6, 2011 1 Some slides courtesy of Prof. Charles E. Leiserson of MIT. Intel R Cilk TM Plus What is it? C/C++ language extensions supporting

More information

Space vs Time, Cache vs Main Memory

Space vs Time, Cache vs Main Memory Space vs Time, Cache vs Main Memory Marc Moreno Maza University of Western Ontario, London, Ontario (Canada) CS 4435 - CS 9624 (Moreno Maza) Space vs Time, Cache vs Main Memory CS 4435 - CS 9624 1 / 49

More information

CS CS9535: An Overview of Parallel Computing

CS CS9535: An Overview of Parallel Computing CS4403 - CS9535: An Overview of Parallel Computing Marc Moreno Maza University of Western Ontario, London, Ontario (Canada) January 10, 2017 Plan 1 Hardware 2 Types of Parallelism 3 Concurrency Platforms:

More information

Cost Model: Work, Span and Parallelism

Cost Model: Work, Span and Parallelism CSE 539 01/15/2015 Cost Model: Work, Span and Parallelism Lecture 2 Scribe: Angelina Lee Outline of this lecture: 1. Overview of Cilk 2. The dag computation model 3. Performance measures 4. A simple greedy

More information

CILK/CILK++ AND REDUCERS YUNMING ZHANG RICE UNIVERSITY

CILK/CILK++ AND REDUCERS YUNMING ZHANG RICE UNIVERSITY CILK/CILK++ AND REDUCERS YUNMING ZHANG RICE UNIVERSITY 1 OUTLINE CILK and CILK++ Language Features and Usages Work stealing runtime CILK++ Reducers Conclusions 2 IDEALIZED SHARED MEMORY ARCHITECTURE Hardware

More information

Provably Efficient Non-Preemptive Task Scheduling with Cilk

Provably Efficient Non-Preemptive Task Scheduling with Cilk Provably Efficient Non-Preemptive Task Scheduling with Cilk V. -Y. Vee and W.-J. Hsu School of Applied Science, Nanyang Technological University Nanyang Avenue, Singapore 639798. Abstract We consider the

More information

Parallel Algorithms. 6/12/2012 6:42 PM gdeepak.com 1

Parallel Algorithms. 6/12/2012 6:42 PM gdeepak.com 1 Parallel Algorithms 1 Deliverables Parallel Programming Paradigm Matrix Multiplication Merge Sort 2 Introduction It involves using the power of multiple processors in parallel to increase the performance

More information

CS 140 : Numerical Examples on Shared Memory with Cilk++

CS 140 : Numerical Examples on Shared Memory with Cilk++ CS 140 : Numerical Examples on Shared Memory with Cilk++ Matrix-matrix multiplication Matrix-vector multiplication Hyperobjects Thanks to Charles E. Leiserson for some of these slides 1 Work and Span (Recap)

More information

Reducers and other Cilk++ hyperobjects

Reducers and other Cilk++ hyperobjects Reducers and other Cilk++ hyperobjects Matteo Frigo (Intel) ablo Halpern (Intel) Charles E. Leiserson (MIT) Stephen Lewin-Berlin (Intel) August 11, 2009 Collision detection Assembly: Represented as a tree

More information

Today: Amortized Analysis (examples) Multithreaded Algs.

Today: Amortized Analysis (examples) Multithreaded Algs. Today: Amortized Analysis (examples) Multithreaded Algs. COSC 581, Algorithms March 11, 2014 Many of these slides are adapted from several online sources Reading Assignments Today s class: Chapter 17 (Amortized

More information

Multithreaded Programming in Cilk. Matteo Frigo

Multithreaded Programming in Cilk. Matteo Frigo Multithreaded Programming in Cilk Matteo Frigo Multicore challanges Development time: Will you get your product out in time? Where will you find enough parallel-programming talent? Will you be forced to

More information

Plan. Introduction to Multicore Programming. Plan. University of Western Ontario, London, Ontario (Canada) Marc Moreno Maza CS CS 9624

Plan. Introduction to Multicore Programming. Plan. University of Western Ontario, London, Ontario (Canada) Marc Moreno Maza CS CS 9624 Plan Introduction to Multicore Programming Marc Moreno Maza University of Western Ontario, London, Ontario (Canada) CS 4435 - CS 9624 1 Multi-core Architecture Multi-core processor CPU Cache CPU Coherence

More information

Multithreaded Algorithms Part 2. Dept. of Computer Science & Eng University of Moratuwa

Multithreaded Algorithms Part 2. Dept. of Computer Science & Eng University of Moratuwa CS4460 Advanced d Algorithms Batch 08, L4S2 Lecture 12 Multithreaded Algorithms Part 2 N. H. N. D. de Silva Dept. of Computer Science & Eng University of Moratuwa Outline: Multithreaded Algorithms Part

More information

Multithreaded Programming in. Cilk LECTURE 2. Charles E. Leiserson

Multithreaded Programming in. Cilk LECTURE 2. Charles E. Leiserson Multithreaded Programming in Cilk LECTURE 2 Charles E. Leiserson Supercomputing Technologies Research Group Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology

More information

COMP Parallel Computing. SMM (4) Nested Parallelism

COMP Parallel Computing. SMM (4) Nested Parallelism COMP 633 - Parallel Computing Lecture 9 September 19, 2017 Nested Parallelism Reading: The Implementation of the Cilk-5 Multithreaded Language sections 1 3 1 Topics Nested parallelism in OpenMP and other

More information

A Many-Core Machine Model for Designing Algorithms with Minimum Parallelism Overheads

A Many-Core Machine Model for Designing Algorithms with Minimum Parallelism Overheads A Many-Core Machine Model for Designing Algorithms with Minimum Parallelism Overheads Sardar Anisul Haque Marc Moreno Maza Ning Xie University of Western Ontario, Canada IBM CASCON, November 4, 2014 ardar

More information

A Quick Introduction To The Intel Cilk Plus Runtime

A Quick Introduction To The Intel Cilk Plus Runtime A Quick Introduction To The Intel Cilk Plus Runtime 6.S898: Advanced Performance Engineering for Multicore Applications March 8, 2017 Adapted from slides by Charles E. Leiserson, Saman P. Amarasinghe,

More information

The Implementation of Cilk-5 Multithreaded Language

The Implementation of Cilk-5 Multithreaded Language The Implementation of Cilk-5 Multithreaded Language By Matteo Frigo, Charles E. Leiserson, and Keith H Randall Presented by Martin Skou 1/14 The authors Matteo Frigo Chief Scientist and founder of Cilk

More information

Parallel Algorithms CSE /22/2015. Outline of this lecture: 1 Implementation of cilk for. 2. Parallel Matrix Multiplication

Parallel Algorithms CSE /22/2015. Outline of this lecture: 1 Implementation of cilk for. 2. Parallel Matrix Multiplication CSE 539 01/22/2015 Parallel Algorithms Lecture 3 Scribe: Angelina Lee Outline of this lecture: 1. Implementation of cilk for 2. Parallel Matrix Multiplication 1 Implementation of cilk for We mentioned

More information

1 Optimizing parallel iterative graph computation

1 Optimizing parallel iterative graph computation May 15, 2012 1 Optimizing parallel iterative graph computation I propose to develop a deterministic parallel framework for performing iterative computation on a graph which schedules work on vertices based

More information

The DAG Model; Analysis of For-Loops; Reduction

The DAG Model; Analysis of For-Loops; Reduction CSE341T 09/06/2017 Lecture 3 The DAG Model; Analysis of For-Loops; Reduction We will now formalize the DAG model. We will also see how parallel for loops are implemented and what are reductions. 1 The

More information

Compsci 590.3: Introduction to Parallel Computing

Compsci 590.3: Introduction to Parallel Computing Compsci 590.3: Introduction to Parallel Computing Alvin R. Lebeck Slides based on this from the University of Oregon Admin Logistics Homework #3 Use script Project Proposals Document: see web site» Due

More information

Table of Contents. Cilk

Table of Contents. Cilk Table of Contents 212 Introduction to Parallelism Introduction to Programming Models Shared Memory Programming Message Passing Programming Shared Memory Models Cilk TBB HPF Chapel Fortress Stapl PGAS Languages

More information

Experimenting with the MetaFork Framework Targeting Multicores

Experimenting with the MetaFork Framework Targeting Multicores Experimenting with the MetaFork Framework Targeting Multicores Xiaohui Chen, Marc Moreno Maza & Sushek Shekar University of Western Ontario 26 January 2014 1 Introduction The work reported in this report

More information

MetaFork: A Compilation Framework for Concurrency Platforms Targeting Multicores

MetaFork: A Compilation Framework for Concurrency Platforms Targeting Multicores MetaFork: A Compilation Framework for Concurrency Platforms Targeting Multicores Xiaohui Chen, Marc Moreno Maza & Sushek Shekar University of Western Ontario, Canada IBM Toronto Lab February 11, 2015 Plan

More information

MetaFork: A Metalanguage for Concurrency Platforms Targeting Multicores

MetaFork: A Metalanguage for Concurrency Platforms Targeting Multicores MetaFork: A Metalanguage for Concurrency Platforms Targeting Multicores Xiaohui Chen, Marc Moreno Maza & Sushek Shekar University of Western Ontario September 1, 2013 Document number: N1746 Date: 2013-09-01

More information

Pablo Halpern Parallel Programming Languages Architect Intel Corporation

Pablo Halpern Parallel Programming Languages Architect Intel Corporation Pablo Halpern Parallel Programming Languages Architect Intel Corporation CppCon, 8 September 2014 This work by Pablo Halpern is licensed under a Creative Commons Attribution

More information

Introduction to Algorithms

Introduction to Algorithms Thomas H. Cormen Charles E. Leiserson Ronald L. Rivest Clifford Stein Introduction to Algorithms Third Edition The MIT Press Cambridge, Massachusetts London, England 27 Multithreaded Algorithms The vast

More information

Brushing the Locks out of the Fur: A Lock-Free Work Stealing Library Based on Wool

Brushing the Locks out of the Fur: A Lock-Free Work Stealing Library Based on Wool Brushing the Locks out of the Fur: A Lock-Free Work Stealing Library Based on Wool Håkan Sundell School of Business and Informatics University of Borås, 50 90 Borås E-mail: Hakan.Sundell@hb.se Philippas

More information

Cache Memories. University of Western Ontario, London, Ontario (Canada) Marc Moreno Maza. CS2101 October 2012

Cache Memories. University of Western Ontario, London, Ontario (Canada) Marc Moreno Maza. CS2101 October 2012 Cache Memories Marc Moreno Maza University of Western Ontario, London, Ontario (Canada) CS2101 October 2012 Plan 1 Hierarchical memories and their impact on our 2 Cache Analysis in Practice Plan 1 Hierarchical

More information

Parallel Random-Access Machines

Parallel Random-Access Machines Parallel Random-Access Machines Marc Moreno Maza University of Western Ontario, London, Ontario (Canada) CS3101 (Moreno Maza) Parallel Random-Access Machines CS3101 1 / 69 Plan 1 The PRAM Model 2 Performance

More information

Cilk Plus GETTING STARTED

Cilk Plus GETTING STARTED Cilk Plus GETTING STARTED Overview Fundamentals of Cilk Plus Hyperobjects Compiler Support Case Study 3/17/2015 CHRIS SZALWINSKI 2 Fundamentals of Cilk Plus Terminology Execution Model Language Extensions

More information

Performance Issues in Parallelization. Saman Amarasinghe Fall 2010

Performance Issues in Parallelization. Saman Amarasinghe Fall 2010 Performance Issues in Parallelization Saman Amarasinghe Fall 2010 Today s Lecture Performance Issues of Parallelism Cilk provides a robust environment for parallelization It hides many issues and tries

More information

On the cost of managing data flow dependencies

On the cost of managing data flow dependencies On the cost of managing data flow dependencies - program scheduled by work stealing - Thierry Gautier, INRIA, EPI MOAIS, Grenoble France Workshop INRIA/UIUC/NCSA Outline Context - introduction of work

More information

Moore s Law. Multicore Programming. Vendor Solution. Power Density. Parallelism and Performance MIT Lecture 11 1.

Moore s Law. Multicore Programming. Vendor Solution. Power Density. Parallelism and Performance MIT Lecture 11 1. Moore s Law 1000000 Intel CPU Introductions 6.172 Performance Engineering of Software Systems Lecture 11 Multicore Programming Charles E. Leiserson 100000 10000 1000 100 10 Clock Speed (MHz) Transistors

More information

On the Interoperability of Programming Languages based on the Fork-Join Parallelism Model. (Thesis format: Monograph) Sushek Shekar

On the Interoperability of Programming Languages based on the Fork-Join Parallelism Model. (Thesis format: Monograph) Sushek Shekar On the Interoperability of Programming Languages based on the Fork-Join Parallelism Model (Spine title: On the Interoperability of Programming Languages based on the Fork-Join Parallelism Model) (Thesis

More information

Lecture 10: Recursion vs Iteration

Lecture 10: Recursion vs Iteration cs2010: algorithms and data structures Lecture 10: Recursion vs Iteration Vasileios Koutavas School of Computer Science and Statistics Trinity College Dublin how methods execute Call stack: is a stack

More information

A Primer on Scheduling Fork-Join Parallelism with Work Stealing

A Primer on Scheduling Fork-Join Parallelism with Work Stealing Doc. No.: N3872 Date: 2014-01-15 Reply to: Arch Robison A Primer on Scheduling Fork-Join Parallelism with Work Stealing This paper is a primer, not a proposal, on some issues related to implementing fork-join

More information

Performance Issues in Parallelization Saman Amarasinghe Fall 2009

Performance Issues in Parallelization Saman Amarasinghe Fall 2009 Performance Issues in Parallelization Saman Amarasinghe Fall 2009 Today s Lecture Performance Issues of Parallelism Cilk provides a robust environment for parallelization It hides many issues and tries

More information

Parallel Programming with Cilk

Parallel Programming with Cilk Project 4-1 Parallel Programming with Cilk Out: Friday, 23 Oct 2009 Due: 10AM on Wednesday, 4 Nov 2009 In this project you will learn how to parallel program with Cilk. In the first part of the project,

More information

A Framework for Space and Time Efficient Scheduling of Parallelism

A Framework for Space and Time Efficient Scheduling of Parallelism A Framework for Space and Time Efficient Scheduling of Parallelism Girija J. Narlikar Guy E. Blelloch December 996 CMU-CS-96-97 School of Computer Science Carnegie Mellon University Pittsburgh, PA 523

More information

Divide and Conquer Algorithms

Divide and Conquer Algorithms CSE341T 09/13/2017 Lecture 5 Divide and Conquer Algorithms We have already seen a couple of divide and conquer algorithms in this lecture. The reduce algorithm and the algorithm to copy elements of the

More information

Heterogeneous Multithreaded Computing

Heterogeneous Multithreaded Computing Heterogeneous Multithreaded Computing by Howard J. Lu Submitted to the Department of Electrical Engineering and Computer Science in artial Fulfillment of the Requirements for the Degrees of Bachelor of

More information

Multi-core Computing Lecture 1

Multi-core Computing Lecture 1 Hi-Spade Multi-core Computing Lecture 1 MADALGO Summer School 2012 Algorithms for Modern Parallel and Distributed Models Phillip B. Gibbons Intel Labs Pittsburgh August 20, 2012 Lecture 1 Outline Multi-cores:

More information

Cache-Oblivious Traversals of an Array s Pairs

Cache-Oblivious Traversals of an Array s Pairs Cache-Oblivious Traversals of an Array s Pairs Tobias Johnson May 7, 2007 Abstract Cache-obliviousness is a concept first introduced by Frigo et al. in [1]. We follow their model and develop a cache-oblivious

More information

Algorithms and Data Structures

Algorithms and Data Structures Algorithms and Data Structures or, Classical Algorithms of the 50s, 60s, 70s Richard Mayr Slides adapted from Mary Cryan (2015/16) with small changes. School of Informatics University of Edinburgh ADS

More information

Cilk programs as a DAG

Cilk programs as a DAG Cilk programs as a DAG The pattern of spawn and sync commands defines a graph The graph contains dependencies between different functions spawn command creates a new task with an out-bound link sync command

More information

Beyond Nested Parallelism: Tight Bounds on Work-Stealing Overheads for Parallel Futures

Beyond Nested Parallelism: Tight Bounds on Work-Stealing Overheads for Parallel Futures Beyond Nested Parallelism: Tight Bounds on Work-Stealing Overheads for Parallel Futures Daniel Spoonhower Guy E. Blelloch Phillip B. Gibbons Robert Harper Carnegie Mellon University {spoons,blelloch,rwh}@cs.cmu.edu

More information

Cache Oblivious Algorithms

Cache Oblivious Algorithms Cache Oblivious Algorithms Volker Strumpen IBM Research Austin, TX September 4, 2007 Iterative matrix transposition #define N 1000 double A[N][N], B[N][N]; void iter(void) { int i, j; for (i = 0; i < N;

More information

CS584/684 Algorithm Analysis and Design Spring 2017 Week 3: Multithreaded Algorithms

CS584/684 Algorithm Analysis and Design Spring 2017 Week 3: Multithreaded Algorithms CS584/684 Algorithm Analysis and Design Spring 2017 Week 3: Multithreaded Algorithms Additional sources for this lecture include: 1. Umut Acar and Guy Blelloch, Algorithm Design: Parallel and Sequential

More information

Algorithms and Data Structures. Algorithms and Data Structures. Algorithms and Data Structures. Algorithms and Data Structures

Algorithms and Data Structures. Algorithms and Data Structures. Algorithms and Data Structures. Algorithms and Data Structures Richard Mayr Slides adapted from Mary Cryan (2015/16) with some changes. School of Informatics University of Edinburgh ADS (2018/19) Lecture 1 slide 1 ADS (2018/19) Lecture 1 slide 3 ADS (2018/19) Lecture

More information

Concepts in. Programming. The Multicore- Software Challenge. MIT Professional Education 6.02s Lecture 1 June 8 9, 2009

Concepts in. Programming. The Multicore- Software Challenge. MIT Professional Education 6.02s Lecture 1 June 8 9, 2009 Concepts in Multicore Programming The Multicore- Software Challenge MIT Professional Education 6.02s Lecture 1 June 8 9, 2009 2009 Charles E. Leiserson 1 Cilk, Cilk++, and Cilkscreen, are trademarks of

More information

27. Parallel Programming I

27. Parallel Programming I 771 27. Parallel Programming I Moore s Law and the Free Lunch, Hardware Architectures, Parallel Execution, Flynn s Taxonomy, Scalability: Amdahl and Gustafson, Data-parallelism, Task-parallelism, Scheduling

More information

15-210: Parallelism in the Real World

15-210: Parallelism in the Real World : Parallelism in the Real World Types of paralellism Parallel Thinking Nested Parallelism Examples (Cilk, OpenMP, Java Fork/Join) Concurrency Page1 Cray-1 (1976): the world s most expensive love seat 2

More information

Design and Analysis of Parallel Programs. Parallel Cost Models. Outline. Foster s Method for Design of Parallel Programs ( PCAM )

Design and Analysis of Parallel Programs. Parallel Cost Models. Outline. Foster s Method for Design of Parallel Programs ( PCAM ) Outline Design and Analysis of Parallel Programs TDDD93 Lecture 3-4 Christoph Kessler PELAB / IDA Linköping university Sweden 2017 Design and analysis of parallel algorithms Foster s PCAM method for the

More information

A Static Cut-off for Task Parallel Programs

A Static Cut-off for Task Parallel Programs A Static Cut-off for Task Parallel Programs Shintaro Iwasaki, Kenjiro Taura Graduate School of Information Science and Technology The University of Tokyo September 12, 2016 @ PACT '16 1 Short Summary We

More information

CS126 Final Exam Review

CS126 Final Exam Review CS126 Final Exam Review Fall 2007 1 Asymptotic Analysis (Big-O) Definition. f(n) is O(g(n)) if there exists constants c, n 0 > 0 such that f(n) c g(n) n n 0 We have not formed any theorems dealing with

More information

Adaptively Parallel Processor Allocation for Cilk Jobs

Adaptively Parallel Processor Allocation for Cilk Jobs 6.895 Theory of Parallel Systems Kunal Agrawal, Siddhartha Sen Final Report Adaptively Parallel Processor Allocation for Cilk Jobs Abstract An adaptively parallel job is one in which the number of processors

More information

Algorithms and Data Structures, or

Algorithms and Data Structures, or Algorithms and Data Structures, or... Classical Algorithms of the 50s, 60s and 70s Mary Cryan A&DS Lecture 1 1 Mary Cryan Our focus Emphasis is Algorithms ( Data Structures less important). Most of the

More information

Writing Parallel Programs; Cost Model.

Writing Parallel Programs; Cost Model. CSE341T 08/30/2017 Lecture 2 Writing Parallel Programs; Cost Model. Due to physical and economical constraints, a typical machine we can buy now has 4 to 8 computing cores, and soon this number will be

More information

Memory Management Algorithms on Distributed Systems. Katie Becker and David Rodgers CS425 April 15, 2005

Memory Management Algorithms on Distributed Systems. Katie Becker and David Rodgers CS425 April 15, 2005 Memory Management Algorithms on Distributed Systems Katie Becker and David Rodgers CS425 April 15, 2005 Table of Contents 1. Introduction 2. Coarse Grained Memory 2.1. Bottlenecks 2.2. Simulations 2.3.

More information

QB LECTURE #1: Algorithms and Dynamic Programming

QB LECTURE #1: Algorithms and Dynamic Programming QB LECTURE #1: Algorithms and Dynamic Programming Adam Siepel Nov. 16, 2015 2 Plan for Today Introduction to algorithms Simple algorithms and running time Dynamic programming Soon: sequence alignment 3

More information

27. Parallel Programming I

27. Parallel Programming I 760 27. Parallel Programming I Moore s Law and the Free Lunch, Hardware Architectures, Parallel Execution, Flynn s Taxonomy, Scalability: Amdahl and Gustafson, Data-parallelism, Task-parallelism, Scheduling

More information

Lesson 1 1 Introduction

Lesson 1 1 Introduction Lesson 1 1 Introduction The Multithreaded DAG Model DAG = Directed Acyclic Graph : a collection of vertices and directed edges (lines with arrows). Each edge connects two vertices. The final result of

More information

Overview Parallel Algorithms. New parallel supports Interactive parallel computation? Any application is parallel :

Overview Parallel Algorithms. New parallel supports Interactive parallel computation? Any application is parallel : Overview Parallel Algorithms 2! Machine model and work-stealing!!work and depth! Design and Implementation! Fundamental theorem!! Parallel divide & conquer!! Examples!!Accumulate!!Monte Carlo simulations!!prefix/partial

More information

Unit #3: Recursion, Induction, and Loop Invariants

Unit #3: Recursion, Induction, and Loop Invariants Unit #3: Recursion, Induction, and Loop Invariants CPSC 221: Basic Algorithms and Data Structures Jan Manuch 2017S1: May June 2017 Unit Outline Thinking Recursively Recursion Examples Analyzing Recursion:

More information

CSE 332: Data Structures & Parallelism Lecture 15: Analysis of Fork-Join Parallel Programs. Ruth Anderson Autumn 2018

CSE 332: Data Structures & Parallelism Lecture 15: Analysis of Fork-Join Parallel Programs. Ruth Anderson Autumn 2018 CSE 332: Data Structures & Parallelism Lecture 15: Analysis of Fork-Join Parallel Programs Ruth Anderson Autumn 2018 Outline Done: How to use fork and join to write a parallel algorithm Why using divide-and-conquer

More information

CS473 - Algorithms I

CS473 - Algorithms I CS473 - Algorithms I Lecture 4 The Divide-and-Conquer Design Paradigm View in slide-show mode 1 Reminder: Merge Sort Input array A sort this half sort this half Divide Conquer merge two sorted halves Combine

More information

Chapter 5. Quicksort. Copyright Oliver Serang, 2018 University of Montana Department of Computer Science

Chapter 5. Quicksort. Copyright Oliver Serang, 2018 University of Montana Department of Computer Science Chapter 5 Quicsort Copyright Oliver Serang, 08 University of Montana Department of Computer Science Quicsort is famous because of its ability to sort in-place. I.e., it directly modifies the contents of

More information

Complexity of Algorithms

Complexity of Algorithms CSCE 222 Discrete Structures for Computing Complexity of Algorithms Dr. Hyunyoung Lee Based on slides by Andreas Klappenecker 1 Overview Example - Fibonacci Motivating Asymptotic Run Time Analysis Asymptotic

More information

CSE613: Parallel Programming, Spring 2012 Date: March 31. Homework #2. ( Due: April 14 )

CSE613: Parallel Programming, Spring 2012 Date: March 31. Homework #2. ( Due: April 14 ) CSE613: Parallel Programming, Spring 2012 Date: March 31 Homework #2 ( Due: April 14 ) Serial-BFS( G, s, d ) (Inputs are an unweighted directed graph G with vertex set G[V ], and a source vertex s G[V

More information

Speculative Parallelism in Cilk++

Speculative Parallelism in Cilk++ Speculative Parallelism in Cilk++ Ruben Perez MIT rmperez@mit.edu Gregory Malecha Harvard University SEAS gmalecha@cs.harvard.edu ABSTRACT Backtracking search algorithms are useful in many domains, from

More information

Dynamic Processor Allocation for Adaptively Parallel Work-Stealing Jobs. Siddhartha Sen

Dynamic Processor Allocation for Adaptively Parallel Work-Stealing Jobs. Siddhartha Sen Dynamic Processor Allocation for Adaptively Parallel Work-Stealing Jobs by Siddhartha Sen Submitted to the Department of Electrical Engineering and Computer Science in partial fulfillment of the requirements

More information

Cache-Oblivious Algorithms

Cache-Oblivious Algorithms Cache-Oblivious Algorithms Matteo Frigo, Charles Leiserson, Harald Prokop, Sridhar Ramchandran Slides Written and Presented by William Kuszmaul THE DISK ACCESS MODEL Three Parameters: B M P Block Size

More information

Performance Optimization Part 1: Work Distribution and Scheduling

Performance Optimization Part 1: Work Distribution and Scheduling Lecture 5: Performance Optimization Part 1: Work Distribution and Scheduling Parallel Computer Architecture and Programming CMU 15-418/15-618, Fall 2017 Programming for high performance Optimizing the

More information