Cilk. Cilk In 2008, ACM SIGPLAN awarded Best influential paper of Decade. Cilk : Biggest principle

Size: px
Start display at page:

Download "Cilk. Cilk In 2008, ACM SIGPLAN awarded Best influential paper of Decade. Cilk : Biggest principle"

Transcription

1 CS528 Slides are adopted from Charles E. Leiserson A Sahu Dept of CSE, IIT Guwahati HPC Flow Plan: Before MID Processor + Super scalar+ Vector Unit Serial C/C++ Coding Optimization Shared Memory Based Parallel Programming Pthread C++ Thread/ Java Thread Memory Model, Advanced launch, Sync Thread Pooling and higher level management Implicit threading : OpenMP (loop based), (dynamic DAG based) Accelerator Based Threading (Half Distributed) Nvidia GPU, Cuda Programming Multi Computer based Parallel Programming: MPI A Sahu HPC Flow Plan: AfterMID Scheduling Load Balancing Energy Efficient Scheduling Scheduling in Grid/HPC environment Cloud Environment Economic Model Power/Energy Model Developed by Leiserson at CSAIL, MIT Chapter 27, Multithreaded Algorithm, Introduction to Algorithm, Coreman, Leiserson and Rivest Initiated a startup: Plus Added _for Keyword, Reduction features Acquired by Intel, Intel uses Scheduler Addition of 6 keywords to standard C Easy to install in linux system With gcc and pthread In 2008, ACM SIGPLAN awarded Best influential paper of Decade The Implementation of the 5 Multithreaded Language, PLDI 998 PLDI 2008 Best paper Award Reducers and Other ++ Hyperobjects, PLDI 2008 : Biggest principle Programmer should be responsible for Exposing the parallelism, Identifying elements that can safely be executed in parallel Work of run time environment (scheduler) to Decide during execution how to actually divide the work between processors Work Stealing Scheduler Proved to be good scheduler Now also in GCC, Intel CC, Intel acquire ++

2 Fibonacci int fib (int n) { x = fib(n-); y = fib(n-2); C elision code cilk int fib (int n) { x = spawn fib(n-); y = spawn fib(n-2); sync; is a faithful extension of C. A program s serial elision is always a legal implementation of semantics. provides no new data types. Basic Keywords cilk int fib (int n) { x = spawn fib(n-); y = spawn fib(n-2); sync; Control cannot pass this point until all spawned children have returned. Identifies a function as a procedure, capable of being spawned in parallel. The named child procedure can execute in parallel with the parent caller. continue edge spawn edge Multithreaded Computation initial thread final thread return edge The dag G = (V, E) represents a parallel instruction stream. Each vertex v 2 V represents a () thread: a maximal sequence of instructions not containing parallel control (spawn, sync, return). Every edge e 2 E is either a spawn edge, a return edge, or a continue edge. Fib: ++ Version int fib(int n) { if (n < 2) return n; int x=cilk_spawn fib(n-); int y = fib(n-2); cilk_sync; return x + y; For loop in for (int i = 0; i < 8; ++i) do_work(i); for (int i = 0; i < 8; ++i) cilk_spawn do_work(i); cilk_ sync; Serial Parallel Loop_for in ++ cilk_for (int i=0;i<8;++i){ do_work(i); // No sync required; auto sync Parallel 2

3 Run Time Scheduler Distributed load balancing Receiver initiated Work stealing : Free processor steal a task ofbusy processor When ever a process spawns a new process, This processor starts executing the spawned one Parent goes to waiting/suspend mode Parent can be transferred to other processor Work stealing Work stealing algorithm is receiver initiated algorithm Technique commonly used for load balancing Thief processor ( Idle processor) Steal work from other processor Victim is selected randomly Victim processor (From a set of busy processor) Work is stolen from these processor A Sahu 4 Work stealing Optimal algorithm for load balancing If select victim randomly algorithm is Optimal Proved Basic assumption in work stealing All the memory access are take same time UMA (Uniform Memory Access): shared memory Can be feasible iff Task transfer time is same for all pair of processor Communication bandwidth is same for all pair of processor Run Time Scheduler P0 P0 P0 P0 P0 P0 P0 main 0 main 0 init_ init_ init_ init_ 2 init_ init_ 3 init_ P P P P P P P Request for task Task transfer main 0 main 0 main 0 init_ 5 main 0 init _6 main 0 Task return main0 init_ init_4 vadd_7 init_2 init_3 init_5 init_6 Vadd_8 Vadd_9 A Sahu 5 T = work T = work T = span* * Also called critical path length or computational depth. 3

4 T = work T = span* LOWER BOUNDS T P T /P T P T Speedup Definition: T /T P = speedup on P processors. If T /T P = Θ(P) P, we have linear speedup; = P, we have perfect linear speedup; > P, we have superlinear speedup, which is not possible in our model, because of the lower bound T P T /P. *Also called critical path length or computational depth. Parallelism Because we have the lower bound T P T, the maximum possible speedup given T and T is T / T = parallelism li = the average amount of work per step along the span. Computation DAG unfolds dynamically. spawn edge CILK Example: Fib(4) initial thread continue edge final thread return edge Assume for simplicity that each thread in fib() takes unit time to execute. Work: T = 7 Using many more Span: T = 8 than 2 processors Parallelism: T makes little sense. /T = 2.25 Ref:The System for Parallel Multithreaded Computing, MIT Phd Thesis Ref2:The Implementation of the 5 Multithreaded Language, 998 ACM SIGPLAN Scheduling allows the programmer to express potential parallelism in an application. The scheduler maps threads onto processors dynamically at runtime. Since on line schedulers are complicated, we ll illustrate the ideas with an off line scheduler. P P P $ $ $ Memory Network I/O 4

5 Complete step P threads ready. Run any P. Complete step P threads ready. Run any P. Incomplete step < P threads ready. Run all of them. Greedy Scheduling Theorem Theorem [Graham 68 & Brent 75]. Any greedy scheduler achieves T P T /P+ T. Proof. # complete steps T /P, since each complete step performs P work. # incomplete steps T, since each incomplete step reduces the span of the unexecuted dag by. Optimality of Greedy Corollary. Any greedy scheduler achieves within a factor of 2 of optimal. Proof. Let T P * be the execution time produced by the optimal scheduler. Since T P * max{t /P, T (lower bounds), we have T P T /P + T 2max{T /P, T 2T P *. Linear Speedup Corollary. Any greedy scheduler achieves near perfect linear speedup whenever P << T /T Proof. Since P << T /T is equivalent to T << T /P, the Theorem gives us T P T /P + T T /P. Thus, the speedup is T /T P P. Definition. The quantity (T / T )/P is called the parallel slackness. Performance s work stealing scheduler achieves T P = T /P + O(T ) expected time (provably); T P T /P + T time (empirically). Near perfect linear speedup if P << T /T. Instrumentation in allows the user to determine accurate measures of T and T. The average cost of a spawn in 5 is only 2 6 times the cost of an ordinary C function call, depending on the platform. 5

The Cilk part is a small set of linguistic extensions to C/C++ to support fork-join parallelism. (The Plus part supports vector parallelism.

The Cilk part is a small set of linguistic extensions to C/C++ to support fork-join parallelism. (The Plus part supports vector parallelism. Cilk Plus The Cilk part is a small set of linguistic extensions to C/C++ to support fork-join parallelism. (The Plus part supports vector parallelism.) Developed originally by Cilk Arts, an MIT spinoff,

More information

Multithreaded Programming in. Cilk LECTURE 1. Charles E. Leiserson

Multithreaded Programming in. Cilk LECTURE 1. Charles E. Leiserson Multithreaded Programming in Cilk LECTURE 1 Charles E. Leiserson Supercomputing Technologies Research Group Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology

More information

Multicore programming in CilkPlus

Multicore programming in CilkPlus Multicore programming in CilkPlus Marc Moreno Maza University of Western Ontario, Canada CS3350 March 16, 2015 CilkPlus From Cilk to Cilk++ and Cilk Plus Cilk has been developed since 1994 at the MIT Laboratory

More information

Multithreaded Parallelism and Performance Measures

Multithreaded Parallelism and Performance Measures Multithreaded Parallelism and Performance Measures Marc Moreno Maza University of Western Ontario, London, Ontario (Canada) CS 3101 (Moreno Maza) Multithreaded Parallelism and Performance Measures CS 3101

More information

CSE 260 Lecture 19. Parallel Programming Languages

CSE 260 Lecture 19. Parallel Programming Languages CSE 260 Lecture 19 Parallel Programming Languages Announcements Thursday s office hours are cancelled Office hours on Weds 2p to 4pm Jing will hold OH, too, see Moodle Scott B. Baden /CSE 260/ Winter 2014

More information

Shared-memory Parallel Programming with Cilk Plus

Shared-memory Parallel Programming with Cilk Plus Shared-memory Parallel Programming with Cilk Plus John Mellor-Crummey Department of Computer Science Rice University johnmc@rice.edu COMP 422/534 Lecture 4 30 August 2018 Outline for Today Threaded programming

More information

Shared-memory Parallel Programming with Cilk Plus

Shared-memory Parallel Programming with Cilk Plus Shared-memory Parallel Programming with Cilk Plus John Mellor-Crummey Department of Computer Science Rice University johnmc@rice.edu COMP 422/534 Lecture 4 19 January 2017 Outline for Today Threaded programming

More information

Plan. 1 Parallelism Complexity Measures. 2 cilk for Loops. 3 Scheduling Theory and Implementation. 4 Measuring Parallelism in Practice

Plan. 1 Parallelism Complexity Measures. 2 cilk for Loops. 3 Scheduling Theory and Implementation. 4 Measuring Parallelism in Practice lan Multithreaded arallelism and erformance Measures Marc Moreno Maza University of Western Ontario, London, Ontario (Canada) CS 3101 1 2 cilk for Loops 3 4 Measuring arallelism in ractice 5 Announcements

More information

Parallelism and Performance

Parallelism and Performance 6.172 erformance Engineering of Software Systems LECTURE 13 arallelism and erformance Charles E. Leiserson October 26, 2010 2010 Charles E. Leiserson 1 Amdahl s Law If 50% of your application is parallel

More information

CS 240A: Shared Memory & Multicore Programming with Cilk++

CS 240A: Shared Memory & Multicore Programming with Cilk++ CS 240A: Shared Memory & Multicore rogramming with Cilk++ Multicore and NUMA architectures Multithreaded rogramming Cilk++ as a concurrency platform Work and Span Thanks to Charles E. Leiserson for some

More information

Plan. 1 Parallelism Complexity Measures. 2 cilk for Loops. 3 Scheduling Theory and Implementation. 4 Measuring Parallelism in Practice

Plan. 1 Parallelism Complexity Measures. 2 cilk for Loops. 3 Scheduling Theory and Implementation. 4 Measuring Parallelism in Practice lan Multithreaded arallelism and erformance Measures Marc Moreno Maza University of Western Ontario, London, Ontario (Canada) CS 4435 - CS 9624 1 2 cilk for Loops 3 4 Measuring arallelism in ractice 5

More information

Multithreaded Programming in Cilk. Matteo Frigo

Multithreaded Programming in Cilk. Matteo Frigo Multithreaded Programming in Cilk Matteo Frigo Multicore challanges Development time: Will you get your product out in time? Where will you find enough parallel-programming talent? Will you be forced to

More information

Plan. 1 Parallelism Complexity Measures. 2 cilk for Loops. 3 Scheduling Theory and Implementation. 4 Measuring Parallelism in Practice

Plan. 1 Parallelism Complexity Measures. 2 cilk for Loops. 3 Scheduling Theory and Implementation. 4 Measuring Parallelism in Practice lan Multithreaded arallelism and erformance Measures Marc Moreno Maza University of Western Ontario, London, Ontario (Canada) CS 02 - CS 9535 arallelism Complexity Measures 2 cilk for Loops 3 Measuring

More information

The Implementation of Cilk-5 Multithreaded Language

The Implementation of Cilk-5 Multithreaded Language The Implementation of Cilk-5 Multithreaded Language By Matteo Frigo, Charles E. Leiserson, and Keith H Randall Presented by Martin Skou 1/14 The authors Matteo Frigo Chief Scientist and founder of Cilk

More information

CILK/CILK++ AND REDUCERS YUNMING ZHANG RICE UNIVERSITY

CILK/CILK++ AND REDUCERS YUNMING ZHANG RICE UNIVERSITY CILK/CILK++ AND REDUCERS YUNMING ZHANG RICE UNIVERSITY 1 OUTLINE CILK and CILK++ Language Features and Usages Work stealing runtime CILK++ Reducers Conclusions 2 IDEALIZED SHARED MEMORY ARCHITECTURE Hardware

More information

CSE 613: Parallel Programming

CSE 613: Parallel Programming CSE 613: Parallel Programming Lecture 3 ( The Cilk++ Concurrency Platform ) ( inspiration for many slides comes from talks given by Charles Leiserson and Matteo Frigo ) Rezaul A. Chowdhury Department of

More information

Cilk Plus: Multicore extensions for C and C++

Cilk Plus: Multicore extensions for C and C++ Cilk Plus: Multicore extensions for C and C++ Matteo Frigo 1 June 6, 2011 1 Some slides courtesy of Prof. Charles E. Leiserson of MIT. Intel R Cilk TM Plus What is it? C/C++ language extensions supporting

More information

Plan. Introduction to Multicore Programming. Plan. University of Western Ontario, London, Ontario (Canada) Multi-core processor CPU Coherence

Plan. Introduction to Multicore Programming. Plan. University of Western Ontario, London, Ontario (Canada) Multi-core processor CPU Coherence Plan Introduction to Multicore Programming Marc Moreno Maza University of Western Ontario, London, Ontario (Canada) CS 3101 1 Multi-core Architecture 2 Race Conditions and Cilkscreen (Moreno Maza) Introduction

More information

An Overview of Parallel Computing

An Overview of Parallel Computing An Overview of Parallel Computing Marc Moreno Maza University of Western Ontario, London, Ontario (Canada) CS2101 Plan 1 Hardware 2 Types of Parallelism 3 Concurrency Platforms: Three Examples Cilk CUDA

More information

Multithreaded Algorithms Part 1. Dept. of Computer Science & Eng University of Moratuwa

Multithreaded Algorithms Part 1. Dept. of Computer Science & Eng University of Moratuwa CS4460 Advanced d Algorithms Batch 08, L4S2 Lecture 11 Multithreaded Algorithms Part 1 N. H. N. D. de Silva Dept. of Computer Science & Eng University of Moratuwa Announcements Last topic discussed is

More information

Reducers and other Cilk++ hyperobjects

Reducers and other Cilk++ hyperobjects Reducers and other Cilk++ hyperobjects Matteo Frigo (Intel) ablo Halpern (Intel) Charles E. Leiserson (MIT) Stephen Lewin-Berlin (Intel) August 11, 2009 Collision detection Assembly: Represented as a tree

More information

A Quick Introduction To The Intel Cilk Plus Runtime

A Quick Introduction To The Intel Cilk Plus Runtime A Quick Introduction To The Intel Cilk Plus Runtime 6.S898: Advanced Performance Engineering for Multicore Applications March 8, 2017 Adapted from slides by Charles E. Leiserson, Saman P. Amarasinghe,

More information

Introduction to Multithreaded Algorithms

Introduction to Multithreaded Algorithms Introduction to Multithreaded Algorithms CCOM5050: Design and Analysis of Algorithms Chapter VII Selected Topics T. H. Cormen, C. E. Leiserson, R. L. Rivest, C. Stein. Introduction to algorithms, 3 rd

More information

A Minicourse on Dynamic Multithreaded Algorithms

A Minicourse on Dynamic Multithreaded Algorithms Introduction to Algorithms December 5, 005 Massachusetts Institute of Technology 6.046J/18.410J Professors Erik D. Demaine and Charles E. Leiserson Handout 9 A Minicourse on Dynamic Multithreaded Algorithms

More information

CS 140 : Feb 19, 2015 Cilk Scheduling & Applications

CS 140 : Feb 19, 2015 Cilk Scheduling & Applications CS 140 : Feb 19, 2015 Cilk Scheduling & Applications Analyzing quicksort Optional: Master method for solving divide-and-conquer recurrences Tips on parallelism and overheads Greedy scheduling and parallel

More information

Cost Model: Work, Span and Parallelism

Cost Model: Work, Span and Parallelism CSE 539 01/15/2015 Cost Model: Work, Span and Parallelism Lecture 2 Scribe: Angelina Lee Outline of this lecture: 1. Overview of Cilk 2. The dag computation model 3. Performance measures 4. A simple greedy

More information

Compsci 590.3: Introduction to Parallel Computing

Compsci 590.3: Introduction to Parallel Computing Compsci 590.3: Introduction to Parallel Computing Alvin R. Lebeck Slides based on this from the University of Oregon Admin Logistics Homework #3 Use script Project Proposals Document: see web site» Due

More information

CS 140 : Non-numerical Examples with Cilk++

CS 140 : Non-numerical Examples with Cilk++ CS 140 : Non-numerical Examples with Cilk++ Divide and conquer paradigm for Cilk++ Quicksort Mergesort Thanks to Charles E. Leiserson for some of these slides 1 Work and Span (Recap) T P = execution time

More information

Concepts in. Programming. The Multicore- Software Challenge. MIT Professional Education 6.02s Lecture 1 June 8 9, 2009

Concepts in. Programming. The Multicore- Software Challenge. MIT Professional Education 6.02s Lecture 1 June 8 9, 2009 Concepts in Multicore Programming The Multicore- Software Challenge MIT Professional Education 6.02s Lecture 1 June 8 9, 2009 2009 Charles E. Leiserson 1 Cilk, Cilk++, and Cilkscreen, are trademarks of

More information

CS CS9535: An Overview of Parallel Computing

CS CS9535: An Overview of Parallel Computing CS4403 - CS9535: An Overview of Parallel Computing Marc Moreno Maza University of Western Ontario, London, Ontario (Canada) January 10, 2017 Plan 1 Hardware 2 Types of Parallelism 3 Concurrency Platforms:

More information

Table of Contents. Cilk

Table of Contents. Cilk Table of Contents 212 Introduction to Parallelism Introduction to Programming Models Shared Memory Programming Message Passing Programming Shared Memory Models Cilk TBB HPF Chapel Fortress Stapl PGAS Languages

More information

Cilk Plus GETTING STARTED

Cilk Plus GETTING STARTED Cilk Plus GETTING STARTED Overview Fundamentals of Cilk Plus Hyperobjects Compiler Support Case Study 3/17/2015 CHRIS SZALWINSKI 2 Fundamentals of Cilk Plus Terminology Execution Model Language Extensions

More information

On the cost of managing data flow dependencies

On the cost of managing data flow dependencies On the cost of managing data flow dependencies - program scheduled by work stealing - Thierry Gautier, INRIA, EPI MOAIS, Grenoble France Workshop INRIA/UIUC/NCSA Outline Context - introduction of work

More information

Multithreaded Parallelism on Multicore Architectures

Multithreaded Parallelism on Multicore Architectures Multithreaded Parallelism on Multicore Architectures Marc Moreno Maza University of Western Ontario, Canada CS2101 March 2012 Plan 1 Multicore programming Multicore architectures 2 Cilk / Cilk++ / Cilk

More information

Moore s Law. Multicore Programming. Vendor Solution. Power Density. Parallelism and Performance MIT Lecture 11 1.

Moore s Law. Multicore Programming. Vendor Solution. Power Density. Parallelism and Performance MIT Lecture 11 1. Moore s Law 1000000 Intel CPU Introductions 6.172 Performance Engineering of Software Systems Lecture 11 Multicore Programming Charles E. Leiserson 100000 10000 1000 100 10 Clock Speed (MHz) Transistors

More information

Today: Amortized Analysis (examples) Multithreaded Algs.

Today: Amortized Analysis (examples) Multithreaded Algs. Today: Amortized Analysis (examples) Multithreaded Algs. COSC 581, Algorithms March 11, 2014 Many of these slides are adapted from several online sources Reading Assignments Today s class: Chapter 17 (Amortized

More information

A Primer on Scheduling Fork-Join Parallelism with Work Stealing

A Primer on Scheduling Fork-Join Parallelism with Work Stealing Doc. No.: N3872 Date: 2014-01-15 Reply to: Arch Robison A Primer on Scheduling Fork-Join Parallelism with Work Stealing This paper is a primer, not a proposal, on some issues related to implementing fork-join

More information

Cilk, Matrix Multiplication, and Sorting

Cilk, Matrix Multiplication, and Sorting 6.895 Theory of Parallel Systems Lecture 2 Lecturer: Charles Leiserson Cilk, Matrix Multiplication, and Sorting Lecture Summary 1. Parallel Processing With Cilk This section provides a brief introduction

More information

MetaFork: A Compilation Framework for Concurrency Platforms Targeting Multicores

MetaFork: A Compilation Framework for Concurrency Platforms Targeting Multicores MetaFork: A Compilation Framework for Concurrency Platforms Targeting Multicores Presented by Xiaohui Chen Joint work with Marc Moreno Maza, Sushek Shekar & Priya Unnikrishnan University of Western Ontario,

More information

Beyond Threads: Scalable, Composable, Parallelism with Intel Cilk Plus and TBB

Beyond Threads: Scalable, Composable, Parallelism with Intel Cilk Plus and TBB Beyond Threads: Scalable, Composable, Parallelism with Intel Cilk Plus and TBB Jim Cownie Intel SSG/DPD/TCAR 1 Optimization Notice Optimization Notice Intel s compilers may or

More information

Parallel Algorithms. 6/12/2012 6:42 PM gdeepak.com 1

Parallel Algorithms. 6/12/2012 6:42 PM gdeepak.com 1 Parallel Algorithms 1 Deliverables Parallel Programming Paradigm Matrix Multiplication Merge Sort 2 Introduction It involves using the power of multiple processors in parallel to increase the performance

More information

Plan. Introduction to Multicore Programming. Plan. University of Western Ontario, London, Ontario (Canada) Marc Moreno Maza CS CS 9624

Plan. Introduction to Multicore Programming. Plan. University of Western Ontario, London, Ontario (Canada) Marc Moreno Maza CS CS 9624 Plan Introduction to Multicore Programming Marc Moreno Maza University of Western Ontario, London, Ontario (Canada) CS 4435 - CS 9624 1 Multi-core Architecture Multi-core processor CPU Cache CPU Coherence

More information

COMP Parallel Computing. SMM (4) Nested Parallelism

COMP Parallel Computing. SMM (4) Nested Parallelism COMP 633 - Parallel Computing Lecture 9 September 19, 2017 Nested Parallelism Reading: The Implementation of the Cilk-5 Multithreaded Language sections 1 3 1 Topics Nested parallelism in OpenMP and other

More information

Provably Efficient Non-Preemptive Task Scheduling with Cilk

Provably Efficient Non-Preemptive Task Scheduling with Cilk Provably Efficient Non-Preemptive Task Scheduling with Cilk V. -Y. Vee and W.-J. Hsu School of Applied Science, Nanyang Technological University Nanyang Avenue, Singapore 639798. Abstract We consider the

More information

Understanding Task Scheduling Algorithms. Kenjiro Taura

Understanding Task Scheduling Algorithms. Kenjiro Taura Understanding Task Scheduling Algorithms Kenjiro Taura 1 / 48 Contents 1 Introduction 2 Work stealing scheduler 3 Analyzing execution time of work stealing 4 Analyzing cache misses of work stealing 5 Summary

More information

Efficiently Detecting Races in Cilk Programs That Use Reducer Hyperobjects

Efficiently Detecting Races in Cilk Programs That Use Reducer Hyperobjects Efficiently Detecting Races in Cilk Programs That Use Reducer Hyperobjects ABSTRACT I-Ting Angelina Lee Washington University in St. Louis One Brookings Drive St. Louis, MO 63130 A multithreaded Cilk program

More information

Brushing the Locks out of the Fur: A Lock-Free Work Stealing Library Based on Wool

Brushing the Locks out of the Fur: A Lock-Free Work Stealing Library Based on Wool Brushing the Locks out of the Fur: A Lock-Free Work Stealing Library Based on Wool Håkan Sundell School of Business and Informatics University of Borås, 50 90 Borås E-mail: Hakan.Sundell@hb.se Philippas

More information

Performance Optimization Part 1: Work Distribution and Scheduling

Performance Optimization Part 1: Work Distribution and Scheduling Lecture 5: Performance Optimization Part 1: Work Distribution and Scheduling Parallel Computer Architecture and Programming CMU 15-418/15-618, Fall 2017 Programming for high performance Optimizing the

More information

Cilk programs as a DAG

Cilk programs as a DAG Cilk programs as a DAG The pattern of spawn and sync commands defines a graph The graph contains dependencies between different functions spawn command creates a new task with an out-bound link sync command

More information

Pablo Halpern Parallel Programming Languages Architect Intel Corporation

Pablo Halpern Parallel Programming Languages Architect Intel Corporation Pablo Halpern Parallel Programming Languages Architect Intel Corporation CppCon, 8 September 2014 This work by Pablo Halpern is licensed under a Creative Commons Attribution

More information

27. Parallel Programming I

27. Parallel Programming I 771 27. Parallel Programming I Moore s Law and the Free Lunch, Hardware Architectures, Parallel Execution, Flynn s Taxonomy, Scalability: Amdahl and Gustafson, Data-parallelism, Task-parallelism, Scheduling

More information

27. Parallel Programming I

27. Parallel Programming I 760 27. Parallel Programming I Moore s Law and the Free Lunch, Hardware Architectures, Parallel Execution, Flynn s Taxonomy, Scalability: Amdahl and Gustafson, Data-parallelism, Task-parallelism, Scheduling

More information

Performance Optimization Part 1: Work Distribution and Scheduling

Performance Optimization Part 1: Work Distribution and Scheduling ( how to be l33t ) Lecture 6: Performance Optimization Part 1: Work Distribution and Scheduling Parallel Computer Architecture and Programming CMU 15-418/15-618, Spring 2017 Tunes Kaleo Way Down We Go

More information

27. Parallel Programming I

27. Parallel Programming I The Free Lunch 27. Parallel Programming I Moore s Law and the Free Lunch, Hardware Architectures, Parallel Execution, Flynn s Taxonomy, Scalability: Amdahl and Gustafson, Data-parallelism, Task-parallelism,

More information

Scheduling Parallel Programs by Work Stealing with Private Deques

Scheduling Parallel Programs by Work Stealing with Private Deques Scheduling Parallel Programs by Work Stealing with Private Deques Umut Acar Carnegie Mellon University Arthur Charguéraud INRIA Mike Rainey Max Planck Institute for Software Systems PPoPP 25.2.2013 1 Scheduling

More information

Introduction to Algorithms

Introduction to Algorithms Thomas H. Cormen Charles E. Leiserson Ronald L. Rivest Clifford Stein Introduction to Algorithms Third Edition The MIT Press Cambridge, Massachusetts London, England 27 Multithreaded Algorithms The vast

More information

15-210: Parallelism in the Real World

15-210: Parallelism in the Real World : Parallelism in the Real World Types of paralellism Parallel Thinking Nested Parallelism Examples (Cilk, OpenMP, Java Fork/Join) Concurrency Page1 Cray-1 (1976): the world s most expensive love seat 2

More information

Multithreaded Parallelism on Multicore Architectures

Multithreaded Parallelism on Multicore Architectures Multithreaded Parallelism on Multicore Architectures Marc Moreno Maza University of Western Ontario, Canada CS2101 November 2012 Plan 1 Multicore programming Multicore architectures 2 Cilk / Cilk++ / Cilk

More information

Runtime Support for Scalable Task-parallel Programs

Runtime Support for Scalable Task-parallel Programs Runtime Support for Scalable Task-parallel Programs Pacific Northwest National Lab xsig workshop May 2018 http://hpc.pnl.gov/people/sriram/ Single Program Multiple Data int main () {... } 2 Task Parallelism

More information

Latency-Hiding Work Stealing

Latency-Hiding Work Stealing Latency-Hiding Work Stealing Stefan K. Muller April 2017 CMU-CS-16-112R Umut A. Acar School of Computer Science Carnegie Mellon University Pittsburgh, PA 15213 A version of this work appears in the proceedings

More information

Efficient Work Stealing for Fine-Grained Parallelism

Efficient Work Stealing for Fine-Grained Parallelism Efficient Work Stealing for Fine-Grained Parallelism Karl-Filip Faxén Swedish Institute of Computer Science November 26, 2009 Task parallel fib in Wool TASK 1( int, fib, int, n ) { if( n

More information

Principles of Tapir. Tao B. Schardl William S. Moses Charles E. Leiserson. LLVM Performance Workshop February 4, 2017

Principles of Tapir. Tao B. Schardl William S. Moses Charles E. Leiserson. LLVM Performance Workshop February 4, 2017 Principles of Tapir Tao B. Schardl William S. Moses Charles E. Leiserson LLVM Performance Workshop February 4, 2017 1 Outline What is Tapir? Tapir s debugging methodology Tapir s optimization strategy

More information

Note: Please use the actual date you accessed this material in your citation.

Note: Please use the actual date you accessed this material in your citation. MIT OpenCourseWare http://ocw.mit.edu 6.046J Introduction to Algorithms, Fall 2005 Please use the following citation format: Erik Demaine and Charles Leiserson, 6.046J Introduction to Algorithms, Fall

More information

DAGViz: A DAG Visualization Tool for Analyzing Task-Parallel Program Traces

DAGViz: A DAG Visualization Tool for Analyzing Task-Parallel Program Traces DAGViz: A DAG Visualization Tool for Analyzing Task-Parallel Program Traces An Huynh University of Tokyo, Japan Douglas Thain University of Notre Dame, USA Miquel Pericas Chalmers University of Technology,

More information

CS 140 : Numerical Examples on Shared Memory with Cilk++

CS 140 : Numerical Examples on Shared Memory with Cilk++ CS 140 : Numerical Examples on Shared Memory with Cilk++ Matrix-matrix multiplication Matrix-vector multiplication Hyperobjects Thanks to Charles E. Leiserson for some of these slides 1 Work and Span (Recap)

More information

A Simple Path to Parallelism with Intel Cilk Plus

A Simple Path to Parallelism with Intel Cilk Plus Introduction This introductory tutorial describes how to use Intel Cilk Plus to simplify making taking advantage of vectorization and threading parallelism in your code. It provides a brief description

More information

MetaFork: A Metalanguage for Concurrency Platforms Targeting Multicores

MetaFork: A Metalanguage for Concurrency Platforms Targeting Multicores MetaFork: A Metalanguage for Concurrency Platforms Targeting Multicores Xiaohui Chen, Marc Moreno Maza & Sushek Shekar University of Western Ontario September 1, 2013 Document number: N1746 Date: 2013-09-01

More information

An Architectural Framework for Accelerating Dynamic Parallel Algorithms on Reconfigurable Hardware

An Architectural Framework for Accelerating Dynamic Parallel Algorithms on Reconfigurable Hardware An Architectural Framework for Accelerating Dynamic Parallel Algorithms on Reconfigurable Hardware Tao Chen, Shreesha Srinath Christopher Batten, G. Edward Suh Computer Systems Laboratory School of Electrical

More information

Project 3. Building a parallelism profiler for Cilk computations CSE 539. Assigned: 03/17/2015 Due Date: 03/27/2015

Project 3. Building a parallelism profiler for Cilk computations CSE 539. Assigned: 03/17/2015 Due Date: 03/27/2015 CSE 539 Project 3 Assigned: 03/17/2015 Due Date: 03/27/2015 Building a parallelism profiler for Cilk computations In this project, you will implement a simple serial tool for Cilk programs a parallelism

More information

CS 267 Applications of Parallel Computers. Lecture 23: Load Balancing and Scheduling. James Demmel

CS 267 Applications of Parallel Computers. Lecture 23: Load Balancing and Scheduling. James Demmel CS 267 Applications of Parallel Computers Lecture 23: Load Balancing and Scheduling James Demmel http://www.cs.berkeley.edu/~demmel/cs267_spr99 CS267 L23 Load Balancing and Scheduling.1 Demmel Sp 1999

More information

Trends and Challenges in Multicore Programming

Trends and Challenges in Multicore Programming Trends and Challenges in Multicore Programming Eva Burrows Bergen Language Design Laboratory (BLDL) Department of Informatics, University of Bergen Bergen, March 17, 2010 Outline The Roadmap of Multicores

More information

Dynamic Processor Allocation for Adaptively Parallel Work-Stealing Jobs. Siddhartha Sen

Dynamic Processor Allocation for Adaptively Parallel Work-Stealing Jobs. Siddhartha Sen Dynamic Processor Allocation for Adaptively Parallel Work-Stealing Jobs by Siddhartha Sen Submitted to the Department of Electrical Engineering and Computer Science in partial fulfillment of the requirements

More information

Thread Scheduling for Multiprogrammed Multiprocessors

Thread Scheduling for Multiprogrammed Multiprocessors Thread Scheduling for Multiprogrammed Multiprocessors (Authors: N. Arora, R. Blumofe, C. G. Plaxton) Geoff Gerfin Dept of Computer & Information Sciences University of Delaware Outline Programming Environment

More information

Thread Scheduling for Multiprogrammed Multiprocessors

Thread Scheduling for Multiprogrammed Multiprocessors Thread Scheduling for Multiprogrammed Multiprocessors Nimar S Arora Robert D Blumofe C Greg Plaxton Department of Computer Science, University of Texas at Austin nimar,rdb,plaxton @csutexasedu Abstract

More information

Under the Hood, Part 1: Implementing Message Passing

Under the Hood, Part 1: Implementing Message Passing Lecture 27: Under the Hood, Part 1: Implementing Message Passing Parallel Computer Architecture and Programming CMU 15-418/15-618, Today s Theme Message passing model (abstraction) Threads operate within

More information

The Data Locality of Work Stealing

The Data Locality of Work Stealing The Data Locality of Work Stealing Umut A. Acar School of Computer Science Carnegie Mellon University umut@cs.cmu.edu Guy E. Blelloch School of Computer Science Carnegie Mellon University guyb@cs.cmu.edu

More information

Processor-Oblivious Record and Replay

Processor-Oblivious Record and Replay * Processor-Oblivious Record and Replay * P PoP P * Artifact Consistent to Reuse * Complete Easy Evaluated * Well Documented * * AEC * Robert Utterback Kunal Agrawal I-Ting Angelina Lee Washington University

More information

A Framework for Space and Time Efficient Scheduling of Parallelism

A Framework for Space and Time Efficient Scheduling of Parallelism A Framework for Space and Time Efficient Scheduling of Parallelism Girija J. Narlikar Guy E. Blelloch December 996 CMU-CS-96-97 School of Computer Science Carnegie Mellon University Pittsburgh, PA 523

More information

Towards a codelet-based runtime for exascale computing. Chris Lauderdale ET International, Inc.

Towards a codelet-based runtime for exascale computing. Chris Lauderdale ET International, Inc. Towards a codelet-based runtime for exascale computing Chris Lauderdale ET International, Inc. What will be covered Slide 2 of 24 Problems & motivation Codelet runtime overview Codelets & complexes Dealing

More information

IDENTIFYING PERFORMANCE BOTTLENECKS IN WORK-STEALING COMPUTATIONS

IDENTIFYING PERFORMANCE BOTTLENECKS IN WORK-STEALING COMPUTATIONS C OV ER F E AT U RE IDENTIFYING PERFORMANCE BOTTLENECKS IN WORK-STEALING COMPUTATIONS Nathan R. Tallent and John M. Mellor-Crummey, Rice University Work stealing is an effective load-balancing strategy

More information

Parallel Real-Time Systems for Latency-Cri6cal Applica6ons

Parallel Real-Time Systems for Latency-Cri6cal Applica6ons Parallel Real-Time Systems for Latency-Cri6cal Applica6ons Chenyang Lu CSE 520S Cyber-Physical Systems (CPS) Cyber-Physical Boundary Real-Time Hybrid SimulaEon (RTHS) Since the application interacts with

More information

Dynamic Monitoring Tool based on Vector Clocks for Multithread Programs

Dynamic Monitoring Tool based on Vector Clocks for Multithread Programs , pp.45-49 http://dx.doi.org/10.14257/astl.2014.76.12 Dynamic Monitoring Tool based on Vector Clocks for Multithread Programs Hyun-Ji Kim 1, Byoung-Kwi Lee 2, Ok-Kyoon Ha 3, and Yong-Kee Jun 1 1 Department

More information

Effective Performance Measurement and Analysis of Multithreaded Applications

Effective Performance Measurement and Analysis of Multithreaded Applications Effective Performance Measurement and Analysis of Multithreaded Applications Nathan Tallent John Mellor-Crummey Rice University CSCaDS hpctoolkit.org Wanted: Multicore Programming Models Simple well-defined

More information

Overview of research activities Toward portability of performance

Overview of research activities Toward portability of performance Overview of research activities Toward portability of performance Do dynamically what can t be done statically Understand evolution of architectures Enable new programming models Put intelligence into

More information

6.886: Algorithm Engineering

6.886: Algorithm Engineering 6.886: Algorithm Engineering LECTURE 2 PARALLEL ALGORITHMS Julian Shun February 7, 2019 Lecture material taken from Parallel Algorithms by Guy E. Blelloch and Bruce M. Maggs and 6.172 by Charles Leiserson

More information

Heterogeneous Multithreaded Computing

Heterogeneous Multithreaded Computing Heterogeneous Multithreaded Computing by Howard J. Lu Submitted to the Department of Electrical Engineering and Computer Science in artial Fulfillment of the Requirements for the Degrees of Bachelor of

More information

Lesson 1 1 Introduction

Lesson 1 1 Introduction Lesson 1 1 Introduction The Multithreaded DAG Model DAG = Directed Acyclic Graph : a collection of vertices and directed edges (lines with arrows). Each edge connects two vertices. The final result of

More information

Theory of Computing Systems 2002 Springer-Verlag New York Inc.

Theory of Computing Systems 2002 Springer-Verlag New York Inc. Theory Comput. Systems 35, 321 347 (2002) DOI: 10.1007/s00224-002-1057-3 Theory of Computing Systems 2002 Springer-Verlag New York Inc. The Data Locality of Work Stealing Umut A. Acar, 1 Guy E. Blelloch,

More information

Efficient Detection of Determinacy Race in Transactional Cilk Programs

Efficient Detection of Determinacy Race in Transactional Cilk Programs Efficient Detection of Determinacy Race in Transactional Cilk Programs Xie Yong Singapore-MIT Alliance Outline Definition determinacy race in transactional Cilk Algorithm T. E. R. D. Implementation Cilk

More information

What is a minimal spanning tree (MST) and how to find one

What is a minimal spanning tree (MST) and how to find one What is a minimal spanning tree (MST) and how to find one A tree contains a root, the top node. Each node has: One parent Any number of children A spanning tree of a graph is a subgraph that contains all

More information

CMPSCI 250: Introduction to Computation. Lecture #14: Induction and Recursion (Still More Induction) David Mix Barrington 14 March 2013

CMPSCI 250: Introduction to Computation. Lecture #14: Induction and Recursion (Still More Induction) David Mix Barrington 14 March 2013 CMPSCI 250: Introduction to Computation Lecture #14: Induction and Recursion (Still More Induction) David Mix Barrington 14 March 2013 Induction and Recursion Three Rules for Recursive Algorithms Proving

More information

Parallel Computing. Hwansoo Han (SKKU)

Parallel Computing. Hwansoo Han (SKKU) Parallel Computing Hwansoo Han (SKKU) Unicore Limitations Performance scaling stopped due to Power consumption Wire delay DRAM latency Limitation in ILP 10000 SPEC CINT2000 2 cores/chip Xeon 3.0GHz Core2duo

More information

OpenMP 3.0 Tasking Implementation in OpenUH

OpenMP 3.0 Tasking Implementation in OpenUH Open64 Workshop @ CGO 09 OpenMP 3.0 Tasking Implementation in OpenUH Cody Addison Texas Instruments Lei Huang University of Houston James (Jim) LaGrone University of Houston Barbara Chapman University

More information

Scheduling Multithreaded Computations by Work Stealing. Robert D. Blumofe. Charles E. Leiserson. 545 Technology Square. Cambridge, MA 02139

Scheduling Multithreaded Computations by Work Stealing. Robert D. Blumofe. Charles E. Leiserson. 545 Technology Square. Cambridge, MA 02139 Scheduling Multithreaded Computations by Work Stealing Robert D. Blumofe Charles E. Leiserson MIT Laboratory for Computer Science 545 Technology Square Cambridge, MA 02139 Abstract This paper studies the

More information

MetaFork: A Compilation Framework for Concurrency Platforms Targeting Multicores

MetaFork: A Compilation Framework for Concurrency Platforms Targeting Multicores MetaFork: A Compilation Framework for Concurrency Platforms Targeting Multicores Xiaohui Chen, Marc Moreno Maza & Sushek Shekar University of Western Ontario, Canada IBM Toronto Lab February 11, 2015 Plan

More information

Molatomium: Parallel Programming Model in Practice

Molatomium: Parallel Programming Model in Practice Molatomium: Parallel Programming Model in Practice Motohiro Takayama, Ryuji Sakai, Nobuhiro Kato, Tomofumi Shimada Toshiba Corporation Abstract Consumer electronics products are adopting multi-core processors.

More information

Adaptively Parallel Processor Allocation for Cilk Jobs

Adaptively Parallel Processor Allocation for Cilk Jobs 6.895 Theory of Parallel Systems Kunal Agrawal, Siddhartha Sen Final Report Adaptively Parallel Processor Allocation for Cilk Jobs Abstract An adaptively parallel job is one in which the number of processors

More information

Shared-memory Parallel Programming with Cilk Plus (Parts 2-3)

Shared-memory Parallel Programming with Cilk Plus (Parts 2-3) Shared-memory Parallel Programming with Cilk Plus (Parts 2-3) John Mellor-Crummey Department of Computer Science Rice University johnmc@rice.edu COMP 422/534 Lecture 5-6 24,26 January 2017 Last Thursday

More information

GPUs and Emerging Architectures

GPUs and Emerging Architectures GPUs and Emerging Architectures Mike Giles mike.giles@maths.ox.ac.uk Mathematical Institute, Oxford University e-infrastructure South Consortium Oxford e-research Centre Emerging Architectures p. 1 CPUs

More information

The JCilk Multithreaded Language. I-Ting Angelina Lee

The JCilk Multithreaded Language. I-Ting Angelina Lee The JCilk Multithreaded Language by I-Ting Angelina Lee Submitted to the Department of Electrical Engineering and Computer Science in partial fulfillment of the requirements for the degree of Master of

More information