The Cilk part is a small set of linguistic extensions to C/C++ to support fork-join parallelism. (The Plus part supports vector parallelism.

Similar documents
Multicore programming in CilkPlus

Parallelism and Performance

Cilk. Cilk In 2008, ACM SIGPLAN awarded Best influential paper of Decade. Cilk : Biggest principle

CS 240A: Shared Memory & Multicore Programming with Cilk++

Multithreaded Programming in. Cilk LECTURE 1. Charles E. Leiserson

Multithreaded Parallelism and Performance Measures

Plan. Introduction to Multicore Programming. Plan. University of Western Ontario, London, Ontario (Canada) Multi-core processor CPU Coherence

Plan. 1 Parallelism Complexity Measures. 2 cilk for Loops. 3 Scheduling Theory and Implementation. 4 Measuring Parallelism in Practice

CSE 613: Parallel Programming

Plan. 1 Parallelism Complexity Measures. 2 cilk for Loops. 3 Scheduling Theory and Implementation. 4 Measuring Parallelism in Practice

An Overview of Parallel Computing

Multithreaded Parallelism on Multicore Architectures

CSE 260 Lecture 19. Parallel Programming Languages

Plan. 1 Parallelism Complexity Measures. 2 cilk for Loops. 3 Scheduling Theory and Implementation. 4 Measuring Parallelism in Practice

CS CS9535: An Overview of Parallel Computing

Moore s Law. Multicore Programming. Vendor Solution. Power Density. Parallelism and Performance MIT Lecture 11 1.

Shared-memory Parallel Programming with Cilk Plus

Multithreaded Algorithms Part 1. Dept. of Computer Science & Eng University of Moratuwa

Cilk Plus: Multicore extensions for C and C++

Cilk Plus GETTING STARTED

Shared-memory Parallel Programming with Cilk Plus

Concepts in. Programming. The Multicore- Software Challenge. MIT Professional Education 6.02s Lecture 1 June 8 9, 2009

Multithreaded Parallelism on Multicore Architectures

Introduction to Multithreaded Algorithms

CILK/CILK++ AND REDUCERS YUNMING ZHANG RICE UNIVERSITY

Today: Amortized Analysis (examples) Multithreaded Algs.

CS 140 : Feb 19, 2015 Cilk Scheduling & Applications

CS 140 : Numerical Examples on Shared Memory with Cilk++

CS 140 : Non-numerical Examples with Cilk++

Plan. Introduction to Multicore Programming. Plan. University of Western Ontario, London, Ontario (Canada) Marc Moreno Maza CS CS 9624

Cost Model: Work, Span and Parallelism

Compsci 590.3: Introduction to Parallel Computing

Cilk, Matrix Multiplication, and Sorting

Multithreaded Programming in Cilk. Matteo Frigo

MetaFork: A Compilation Framework for Concurrency Platforms Targeting Multicores

Reducers and other Cilk++ hyperobjects

Pablo Halpern Parallel Programming Languages Architect Intel Corporation

Introduction to Algorithms

COMP Parallel Computing. SMM (4) Nested Parallelism

Parallel Algorithms. 6/12/2012 6:42 PM gdeepak.com 1

Lesson 1 1 Introduction

A Primer on Scheduling Fork-Join Parallelism with Work Stealing

Beyond Threads: Scalable, Composable, Parallelism with Intel Cilk Plus and TBB

Parallel Programming with Cilk

A Minicourse on Dynamic Multithreaded Algorithms

A Quick Introduction To The Intel Cilk Plus Runtime

MetaFork: A Metalanguage for Concurrency Platforms Targeting Multicores

On the cost of managing data flow dependencies

Efficiently Detecting Races in Cilk Programs That Use Reducer Hyperobjects

Project 3. Building a parallelism profiler for Cilk computations CSE 539. Assigned: 03/17/2015 Due Date: 03/27/2015

Beyond Nested Parallelism: Tight Bounds on Work-Stealing Overheads for Parallel Futures

A Static Cut-off for Task Parallel Programs

15-210: Parallelism in the Real World

The Cilkview Scalability Analyzer

C Language Constructs for Parallel Programming

Multithreaded Algorithms Part 2. Dept. of Computer Science & Eng University of Moratuwa

15-853:Algorithms in the Real World. Outline. Parallelism: Lecture 1 Nested parallelism Cost model Parallel techniques and algorithms

Cilk programs as a DAG

Performance Optimization Part 1: Work Distribution and Scheduling

On the Interoperability of Programming Languages based on the Fork-Join Parallelism Model. (Thesis format: Monograph) Sushek Shekar

Nondeterministic Programming

The Data Locality of Work Stealing

The Implementation of Cilk-5 Multithreaded Language

Table of Contents. Cilk

Effective Performance Measurement and Analysis of Multithreaded Applications

The DAG Model; Analysis of For-Loops; Reduction

Principles of Tapir. Tao B. Schardl William S. Moses Charles E. Leiserson. LLVM Performance Workshop February 4, 2017

Writing Parallel Programs; Cost Model.

MetaFork: A Compilation Framework for Concurrency Platforms Targeting Multicores

Multicore Programming Handout 1: Installing GCC Cilk Plus

A Framework for Space and Time Efficient Scheduling of Parallelism

An Architectural Framework for Accelerating Dynamic Parallel Algorithms on Reconfigurable Hardware

27. Parallel Programming I

Performance Optimization Part 1: Work Distribution and Scheduling

Shared-Memory Programming Models

27. Parallel Programming I

Theory of Computing Systems 2002 Springer-Verlag New York Inc.

6.886: Algorithm Engineering

27. Parallel Programming I

CSE613: Parallel Programming, Spring 2012 Date: March 31. Homework #2. ( Due: April 14 )

The Cilk++ Concurrency Platform

Provably Efficient Non-Preemptive Task Scheduling with Cilk

Performance Issues in Parallelization. Saman Amarasinghe Fall 2010

Cilk User s Guide. Document Number: US

Analysis of Multithreaded Algorithms

Case Study: Matrix Multiplication. 6.S898: Advanced Performance Engineering for Multicore Applications February 22, 2017

Design and Analysis of Parallel Programs. Parallel Cost Models. Outline. Foster s Method for Design of Parallel Programs ( PCAM )

Efficient Detection of Determinacy Race in Transactional Cilk Programs

Performance Issues in Parallelization Saman Amarasinghe Fall 2009

Parallel Algorithms CSE /22/2015. Outline of this lecture: 1 Implementation of cilk for. 2. Parallel Matrix Multiplication

Understanding Task Scheduling Algorithms. Kenjiro Taura

Locality-Aware Dynamic Task Graph Scheduling

CS 62 Midterm 2 Practice

1 Optimizing parallel iterative graph computation

CMPSCI 250: Introduction to Computation. Lecture #14: Induction and Recursion (Still More Induction) David Mix Barrington 14 March 2013

Cilk: An Efficient Multithreaded Runtime System

CS140 Final Project. Nathan Crandall, Dane Pitkin, Introduction:

Note: Please use the actual date you accessed this material in your citation.

Shared-memory Parallel Programming with Cilk Plus (Parts 2-3)

COMP 322: Fundamentals of Parallel Programming Module 1: Parallelism

Transcription:

Cilk Plus The Cilk part is a small set of linguistic extensions to C/C++ to support fork-join parallelism. (The Plus part supports vector parallelism.) Developed originally by Cilk Arts, an MIT spinoff, which was acquired by Intel in July 2009. Based on the award-winning Cilk multithreaded language developed at MIT. Features a provably efficient work-stealing scheduler. Provides a hyperobject library for parallelizing code with non-local variables. Includes the Cilkscreen race detector and Cilkview scalability analyzer.

Nested Parallelism in Cilk uint64_t fib(uint64_t n) { if (n < 2) { return n; else { uint64_t x, y; x = cilk_spawn fib(n- 1); y = fib(n- 2); cilk_sync; return (x + y); The named child function may execute in parallel with the parent caller. Control cannot pass this point until all spawned children have returned. Cilk keywords grant permission for parallel execution. They do not command parallel execution.

Loop Parallelism in Cilk Example: In-place matrix transpose The iterations of a cilk_for loop execute in parallel. a 11 a 12 a 1n a 21 a 22 a 2n a n1 a n2 a nn A a 11 a 21 a n1 a 12 a 22 a n2 a 1n a 2n a nn A T // indices run from 0, not 1 cilk_for (int i=1; i<n; ++i) { for (int j=0; j<i; ++j) { double temp = A[i][j]; A[i][j] = A[j][i]; A[j][i] = temp;

Serial Semantics Cilk source uint64_t fib(uint64_t n) { if (n < 2) { return n; else { uint64_t x, y; x = cilk_spawn fib(n- 1); y = fib(n- 2); cilk_sync; return (x + y); uint64_t fib(uint64_t n) { if (n < 2) { return n; else { uint64_t x, y; x = fib(n- 1); y = fib(n- 2); serialization return (x + y); The serialization of a Cilk program is always a legal interpretation of the program s semantics. Remember, Cilk keywords grant permission for parallel execution. They do not command parallel execution. #define cilk_for for To obtain the serialization: #define cilk_spawn #define cilk_sync

Scheduling The Cilk concurrency platform allows the programmer to express logical parallelism in an application. The Cilk scheduler maps the executing program onto the processor cores dynamically at runtime. Cilk s work-stealing scheduler is provably efficient. uint64_t fib(uint64_t n) { if (n < 2) { return n; else { uint64_t x, y; x = cilk_spawn fib(n- 1); y = fib(n- 2); cilk_sync; return (x + y); Memory $ P Network $ P I/O $ P

Cilk Platform 1 uint64_t fib(uint64_t n) { if (n < 2) { return n; else { uint64_t x, y; x = cilk_spawn fib(n- 1); y = fib(n- 2); cilk_sync; return (x + y); Cilk++ source uint64_t fib(uint64_t n) { if (n < 2) { return n; else { uint64_t x = fib(n- 1); uint64_t y = fib(n- 2); return (x + y); Serialization 2 Compiler Linker Binary 3 6 5 Hyperobject Library Cilkview Scalability Analyzer Cilkscreen Race Detector Conventional Regression Tests 4 Runtime System Parallel Regression Tests Reliable Single- Threaded Code Parallel Performance Reliable Multi- Threaded Code

Execution Model int fib (int n) { if (n<2) return (n); else { int x,y; x = cilk_spawn fib(n- 1); y = fib(n- 2); cilk_sync; return (x+y); Example: fib(4)

Execution Model int fib (int n) { if (n<2) return (n); else { int x,y; x = cilk_spawn fib(n- 1); y = fib(n- 2); cilk_sync; return (x+y); 3 2 4 Example: fib(4) 1 1 0 2 Processor oblivious 1 0 The computation dag unfolds dynamically.

Computation Dag continue edge spawn edge initial strand call edge final strand strand return edge A parallel instruction stream is a dag G = (V, E ). Each vertex v V is a strand : a sequence of instructions not containing a call, spawn, sync, or return (or thrown exception). An edge e E is a spawn, call, return, or continue edge. Loop parallelism (cilk_for) is converted to spawns and syncs using recursive divide-and-conquer.

How Much Parallelism? Assuming that each strand executes in unit time, what is the parallelism of this computation?

Amdahl s Law If 50% of your application is parallel and 50% is serial, you can t get more than a factor of 2 speedup, no matter how many processors it runs on. Gene M. Amdahl In general, if a fraction α of an application must be run serially, the speedup can be at most 1/α.

Quantifying Parallelism What is the parallelism of this computation? Amdahl s Law says that since the serial fraction is 3/18 = 1/6, the speedup is upper-bounded by 6.

Performance Measures T P = execution time on P processors T 1 = work = 18

Performance Measures T P = execution time on P processors T 1 = work T = span* = 18 = 9 * Also called critical-path length or computational depth.

Performance Measures T P = execution time on P processors T 1 = work T = span* = 18 = 9 WORK LAW T P T 1 /P SPAN LAW T P T * Also called critical-path length or computational depth.

Series Composition A B Work: T 1 (A B) = T 1 (A) + T 1 (B) Span: T (A B) = T (A) + T (B)

Parallel Composition A B Work: T 1 (A B) = T 1 (A) + T 1 (B) Span: T (A B) = max{t (A), T (B)

Speedup Definition. T 1 /T P = speedup on P processors. If T 1 /T P < P, we have sublinear speedup. If T 1 /T P = P, we have (perfect) linear speedup. If T 1 /T P > P, we have superlinear speedup, which is not possible in this simple performance model, because of the WORK LAW T P T 1 /P.

Parallelism Because the SPAN LAW dictates that T P T, the maximum possible speedup given T 1 and T is T 1 /T = parallelism = the average amount of work per step along the span = 18/9 = 2.

Example: fib(4) 3 4 2 7 6 1 8 Assume for simplicity that each strand in fib(4) takes unit time to execute. 5 Work: T 1 = 17 Span: T = 8 Parallelism: T 1 /T = 2.125 Using many more than 2 processors can yield only marginal performance gains.

Greedy Scheduling IDEA: Do as much as possible on every step. Definition. A strand is ready if all its predecessors have executed.

Greedy Scheduling IDEA: Do as much as possible on every step. Definition. A strand is ready if all its predecessors have executed. Complete step P strands ready. Run any P. P = 3

Greedy Scheduling IDEA: Do as much as possible on every step. Definition. A strand is ready if all its predecessors have executed. P = 3 Complete step P strands ready. Run any P. Incomplete step < P strands ready. Run all of them.

Analysis of Greedy Theorem [G68, B75, EZL89]. Any greedy scheduler achieves T P T 1 /P + T. Proof. # complete steps T 1 /P, since each complete step performs P work. # incomplete steps T, since each incomplete step reduces the span of the unexecuted dag by 1.

Optimality of Greedy Corollary. Any greedy scheduler achieves within a factor of 2 of optimal. Proof. Let T P * be the execution time produced by the optimal scheduler. Since T P * max{t 1 /P, T by the WORK and SPAN LAWS, we have T P T 1 /P + T 2 max{t 1 /P, T 2T P *.

Linear Speedup Corollary. Any greedy scheduler achieves nearperfect linear speedup whenever T 1 /T P. Proof. Since T 1 /T P is equivalent to T T 1 /P, the Greedy Scheduling Theorem gives us T P T 1 /P + T T 1 /P. Thus, the speedup is T 1 /T P P. Definition. The quantity T 1 /PT is called the parallel slackness.

Cilk Performance Cilk s work-stealing scheduler achieves T P = T 1 /P + O(T ) expected time (provably); T P T 1 /P + T time (empirically). Near-perfect linear speedup as long as P T 1 /T. Instrumentation in Cilkview allows you to measure T 1 and T.