Shared-memory Parallel Programming with Cilk Plus

Similar documents
Shared-memory Parallel Programming with Cilk Plus

Multithreaded Programming in. Cilk LECTURE 1. Charles E. Leiserson

Cilk. Cilk In 2008, ACM SIGPLAN awarded Best influential paper of Decade. Cilk : Biggest principle

CSE 260 Lecture 19. Parallel Programming Languages

The Cilk part is a small set of linguistic extensions to C/C++ to support fork-join parallelism. (The Plus part supports vector parallelism.

Multithreaded Parallelism and Performance Measures

Parallelism and Performance

Plan. 1 Parallelism Complexity Measures. 2 cilk for Loops. 3 Scheduling Theory and Implementation. 4 Measuring Parallelism in Practice

CILK/CILK++ AND REDUCERS YUNMING ZHANG RICE UNIVERSITY

Multicore programming in CilkPlus

Multithreaded Algorithms Part 1. Dept. of Computer Science & Eng University of Moratuwa

Plan. 1 Parallelism Complexity Measures. 2 cilk for Loops. 3 Scheduling Theory and Implementation. 4 Measuring Parallelism in Practice

Introduction to Multithreaded Algorithms

Principles of Parallel Algorithm Design: Concurrency and Decomposition

CS 240A: Shared Memory & Multicore Programming with Cilk++

Cilk, Matrix Multiplication, and Sorting

Today: Amortized Analysis (examples) Multithreaded Algs.

Principles of Parallel Algorithm Design: Concurrency and Mapping

Principles of Parallel Algorithm Design: Concurrency and Mapping

Plan. 1 Parallelism Complexity Measures. 2 cilk for Loops. 3 Scheduling Theory and Implementation. 4 Measuring Parallelism in Practice

CS 140 : Feb 19, 2015 Cilk Scheduling & Applications

Compsci 590.3: Introduction to Parallel Computing

Cilk Plus: Multicore extensions for C and C++

Distributed-memory Algorithms for Dense Matrices, Vectors, and Arrays

Multithreaded Parallelism on Multicore Architectures

Cilk Plus GETTING STARTED

On the cost of managing data flow dependencies

Plan. Introduction to Multicore Programming. Plan. University of Western Ontario, London, Ontario (Canada) Multi-core processor CPU Coherence

MetaFork: A Compilation Framework for Concurrency Platforms Targeting Multicores

An Overview of Parallel Computing

CS 140 : Numerical Examples on Shared Memory with Cilk++

CSE 613: Parallel Programming

Reducers and other Cilk++ hyperobjects

Effective Performance Measurement and Analysis of Multithreaded Applications

Cost Model: Work, Span and Parallelism

COMP Parallel Computing. SMM (4) Nested Parallelism

Table of Contents. Cilk

Multithreaded Programming in Cilk. Matteo Frigo

A Minicourse on Dynamic Multithreaded Algorithms

CS 140 : Non-numerical Examples with Cilk++

Under the Hood, Part 1: Implementing Message Passing

Understanding Task Scheduling Algorithms. Kenjiro Taura

CS CS9535: An Overview of Parallel Computing

Consultation for CZ4102

A Quick Introduction To The Intel Cilk Plus Runtime

15-210: Parallelism in the Real World

Parallel Algorithms. 6/12/2012 6:42 PM gdeepak.com 1

MetaFork: A Metalanguage for Concurrency Platforms Targeting Multicores

Shared-memory Parallel Programming with Cilk Plus (Parts 2-3)

Multithreaded Programming in. Cilk LECTURE 2. Charles E. Leiserson

A Static Cut-off for Task Parallel Programs

Performance Issues in Parallelization Saman Amarasinghe Fall 2009

John Mellor-Crummey Department of Computer Science Rice University

Multithreaded Parallelism on Multicore Architectures

A Study of High Performance Computing and the Cray SV1 Supercomputer. Michael Sullivan TJHSST Class of 2004

Multi-core Computing Lecture 2

Design of Parallel Algorithms. Course Introduction

Concepts in. Programming. The Multicore- Software Challenge. MIT Professional Education 6.02s Lecture 1 June 8 9, 2009

Pablo Halpern Parallel Programming Languages Architect Intel Corporation

Performance Issues in Parallelization. Saman Amarasinghe Fall 2010

Parallel Computing. Hwansoo Han (SKKU)

27. Parallel Programming I

Performance Optimization Part 1: Work Distribution and Scheduling

Tasking and OpenMP Success Stories

A Primer on Scheduling Fork-Join Parallelism with Work Stealing

Moore s Law. Multicore Programming. Vendor Solution. Power Density. Parallelism and Performance MIT Lecture 11 1.

Plan. Introduction to Multicore Programming. Plan. University of Western Ontario, London, Ontario (Canada) Marc Moreno Maza CS CS 9624

Lecture 10: Recursion vs Iteration

The Implementation of Cilk-5 Multithreaded Language

Chapter 5: Analytical Modelling of Parallel Programs

Dynamic inter-core scheduling in Barrelfish

27. Parallel Programming I

Overview: The OpenMP Programming Model

Scheduling Parallel Programs by Work Stealing with Private Deques

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors

Efficient Work Stealing for Fine-Grained Parallelism

IDENTIFYING PERFORMANCE BOTTLENECKS IN WORK-STEALING COMPUTATIONS

CUDA GPGPU Workshop 2012

INF3380: Parallel Programming for Scientific Problems

CSE 613: Parallel Programming. Lecture 2 ( Analytical Modeling of Parallel Algorithms )

Introduction to Multicore Programming

Håkan Sundell University College of Borås Parallel Scalable Solutions AB

The DAG Model; Analysis of For-Loops; Reduction

27. Parallel Programming I

Dr Tay Seng Chuan Tel: Office: S16-02, Dean s s Office at Level 2 URL:

Search Algorithms for Discrete Optimization Problems

ADAPTIVE TASK SCHEDULING USING LOW-LEVEL RUNTIME APIs AND MACHINE LEARNING

Case Study: Matrix Multiplication. 6.S898: Advanced Performance Engineering for Multicore Applications February 22, 2017

UvA-SARA High Performance Computing Course June Clemens Grelck, University of Amsterdam. Parallel Programming with Compiler Directives: OpenMP

Shared-Memory Programming Models

Introduction to Multicore Programming

1 of 6 Lecture 7: March 4. CISC 879 Software Support for Multicore Architectures Spring Lecture 7: March 4, 2008

Lecture 6: Recursion RECURSION

Introduction to OpenMP

CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav

Parallel Algorithm for Multilevel Graph Partitioning and Sparse Matrix Ordering

Introduction to OpenMP. OpenMP basics OpenMP directives, clauses, and library routines

Tapir: Embedding Fork-Join Parallelism into LLVM s Intermediate Representation

CS4961 Parallel Programming. Lecture 5: More OpenMP, Introduction to Data Parallel Algorithms 9/5/12. Administrative. Mary Hall September 4, 2012

Beyond Threads: Scalable, Composable, Parallelism with Intel Cilk Plus and TBB

Transcription:

Shared-memory Parallel Programming with Cilk Plus John Mellor-Crummey Department of Computer Science Rice University johnmc@rice.edu COMP 422/534 Lecture 4 19 January 2017

Outline for Today Threaded programming models Introduction to Cilk Plus tasks algorithmic complexity measures scheduling performance and granularity task parallelism examples vector addition using divide and conquer nqueens: exploratory search 2

What is a Thread? Thread: an independent flow of control software entity that executes a sequence of instructions Thread requires program counter a set of registers an area in memory, including a call stack a thread id A process consists of one or more threads that share address space attributes including user id, open files, working directory,... 3

An Abstract Example of Threading A sequential program for matrix multiply for (row = 0; row < n; row++) for (col = 0; col < n; col++) c[row][col] = dot_product(get_row(a, row), get_col(b, col)) can be transformed to use multiple threads for (row = 0; row < n; row++) for (col = 0; col < n; col++) c[row][col] = spawn dot_product(get_row(a, row), get_col(b, col)) 4

Why Threads? Well matched to multicore hardware Employ parallelism to compute on shared data boost performance on a fixed memory footprint (strong scaling) Useful for hiding latency e.g. latency due to memory, communication, I/O Useful for scheduling and load balancing especially for dynamic concurrency Relatively easy to program easier than message-passing? you be the judge! 5

Threads and Memory All memory is globally accessible to every thread Each thread s stack is treated as local to the thread Additional local storage can be allocated on a perthread basis Idealization: treat all memory as equidistant Threads OS Thread Scheduler P C... C... C... C Shared Address Space Schema for SMP Node 6

Targets for Threaded Programs Shared-memory parallel systems Multicore processor Workstations or cluster nodes with multiple processors Xeon Phi accelerator about 250 threads SGI UV: scalable shared memory system up to 4096 threads 7

Threaded Programming Models Library-based models all data is shared, unless otherwise specified examples: Pthreads, C++11 threads, Intel Threading Building Blocks, Java Concurrency Library, Boost Directive-based models, e.g., OpenMP shared and private data pragma syntax simplifies thread creation and synchronization Programming languages Cilk Plus (Intel, GCC) CUDA (NVIDIA) Habanero-Java (Rice) 8

Toward Standard Threading for C/C++ At last month's meeting of the C standard committee, WG14 decided to form a study group to produce a proposal for language extensions for C to simplify parallel programming. This proposal is expected to combine the best ideas from Cilk and OpenMP, two of the most widely-used and well-established parallel language extensions for the C language family. As the chair of this new study group, named CPLEX (C Parallel Language Extensions), I am announcing its organizational meeting: June 17, 2013 10:00 AM PDT, 2 hours Interested parties should join the group's mailing list, to which further information will be sent: http://www.open-std.org/mailman/listinfo/cplex Questions can be sent to that list, and/or to me directly. -- Clark Nelson Vice chair, PL22.16 (ANSI C++ standard committee) Intel Corporation Chair, SG10 (WG21 study group for C++ feature-testing) clark.nelson@intel.com Chair, CPLEX (WG14 study group for C parallel language extensions) 9

Cilk Plus Programming Model A simple and powerful model for writing multithreaded programs Extends C/C++ with three new keywords cilk_spawn: invoke a function (potentially) in parallel cilk_sync: wait for a procedure s spawned functions to finish cilk_for: execute a loop in parallel Cilk Plus programs specify logical parallelism what computations can be performed in parallel, i.e., tasks not mapping of work to threads or cores Faithful language extension if Cilk Plus keywords are elided C/C++ program semantics Availability Intel compilers GCC (some in 4.8, full in 5) 10

Cilk Plus Tasking Example: Fibonacci Fibonacci sequence + + +... 0 + 1 1 + 2 3 + 5 8 + 13 21 34 55 89 144 233 377 610 987... Computing Fibonacci recursively int fib(int n) { if (n < 2) return n; else { int n1, n2; n1 = fib(n-1); n2 = fib(n-2); return (n1 + n2); } }

Cilk Plus Tasking Example: Fibonacci Fibonacci sequence + + +... 0 + 1 1 + 2 3 + 5 8 + 13 21 34 55 89 144 233 377 610 987... Computing Fibonacci recursively in parallel with Cilk Plus int fib(int n) { if (n < 2) return n; else { int n1, n2; n1 = cilk_spawn fib(n-1); n2 = fib(n-2); cilk_sync; return (n1 + n2); } }

Cilk Plus Terminology Parallel control cilk_spawn, cilk_sync return from spawned function Strand maximal sequence of instructions not containing parallel control int fib(n) { if (n < 2) return n; else { int n1, n2; n1 = cilk_spawn fib(n - 1); n2 = cilk_spawn fib(n - 2); cilk_sync; return (n1 + n2); } } Strand A: code before first spawn Strand B: compute n-2 before 2 nd spawn Strand C: n1+ n2 before the return fib(n) continuation A B C 13

Cilk Program Execution as a DAG spawn fib(4) continuation A B C fib(3) A B C return fib(2) A B C each circle represents a strand fib(2) A B C fib(1) A fib(1) A fib(0) A fib(1) A fib(0) A Legend continuation spawn return 14

Cilk Program Execution as a DAG A B C each circle represents a strand A B C A B C A B C A A A A A Legend continuation spawn return 15

Algorithmic Complexity Measures T P = execution time on P processors Computation graph abstraction: - node = arbitrary sequential computation - edge = dependence (successor node can only execute after predecessor node has completed) - Directed Acyclic Graph (DAG) PROC 0... PROC P-1 Processor abstraction: - P identical processors - each processor executes one node at a time 16

Algorithmic Complexity Measures T P = execution time on P processors T 1 = work 17

Algorithmic Complexity Measures T P = execution time on P processors T 1 = work T = span* *Also called critical-path length 18

Algorithmic Complexity Measures T P = execution time on P processors T 1 = work T = span LOWER BOUNDS T T /P P 1 T T P 19

Speedup Definition: T 1 /T P = speedup on P processors If T 1 /T P = Θ(P), we have linear speedup; = P, we have perfect linear speedup; > P, we have superlinear speedup, Superlinear speedup is not possible in this model because of the lower bound T P T 1 /P, but it can occur in practice (e.g., due to cache effects) 20

Parallelism ( Ideal Speedup ) T P depends on the schedule of computation graph nodes on the processors - two different schedules can yield different values of T P for the same P For convenience, define parallelism (or ideal speedup) as the ratio T 1 /T Parallelism is independent of P, and only depends on the computation graph Also define parallel slackness as the ratio, (T 1 /T )/P ; the larger the slackness, the less the impact of T on performance 21

Example: fib(4) 1 8 2 7 3 4 6 Assume for simplicity that each strand in fib() takes unit time to execute. Span: T =? 8 5 Work: T 1 = 17 (T P refers to execution time on P processors) (Span = critical path length ) 22

Example: fib(4) Assume for simplicity that each strand in fib() takes unit time to execute. Work: T 11 =? 17 Span: T 1 =? = 8 Ideal Speedup: T 1 / T = 2.125 Using more than 2 processors makes little sense 23

Task Scheduling Popular scheduling strategies work-sharing: task scheduled to run in parallel at every spawn benefit: maximizes parallelism drawback: cost of setting up new tasks is high should be avoided work-stealing: processor looks for work when it becomes idle lazy parallelism: put off work for parallel execution until necessary benefits: executes with precisely as much parallelism as needed minimizes the number of tasks that must be set up runs with same efficiency as serial program on uniprocessor Cilk uses work-stealing rather than work-sharing 24

Cilk Execution using Work Stealing Cilk runtime maps logical tasks to compute cores Approach: lazy task creation plus work-stealing scheduler cilk_spawn: a potentially parallel task is available an idle thread steals a task from a random working thread Possible Execution: thread 1 begins thread 2 steals from 1 thread 3 steals from 1 etc... f(n-2) f(n-1) f(n-3) f(n) f(n-3) f(n-2) f(n-4)........................ 25

Cilk s Work-Stealing Scheduler Each processor maintains a work deque of ready strands, and it manipulates the bottom of the deque like a stack. Spawn! P P P P 26

Cilk s Work-Stealing Scheduler Each processor maintains a work deque of ready strands, and it manipulates the bottom of the deque like a stack. Spawn! Spawn! P P P P 27

Cilk s Work-Stealing Scheduler Each processor maintains a work deque of ready strands, and it manipulates the bottom of the deque like a stack. Return! P P P P 28

Cilk s Work-Stealing Scheduler Each processor maintains a work deque of ready strands, and it manipulates the bottom of the deque like a stack. Return! P P P P 29

Cilk s Work-Stealing Scheduler Each processor maintains a work deque of ready strands, and it manipulates the bottom of the deque like a stack. Steal! P P P P When a processor runs out of work, it steals a strand from the top of a random victim s deque 30

Cilk s Work-Stealing Scheduler Each processor maintains a work deque of ready strands, and it manipulates the bottom of the deque like a stack. Steal! P P P P When a processor runs out of work, it steals a strand from the top of a random victim s deque 31

Cilk s Work-Stealing Scheduler Each processor maintains a work deque of ready strands, and it manipulates the bottom of the deque like a stack. P P P P When a processor runs out of work, it steals a strand from the top of a random victim s deque 32

Cilk s Work-Stealing Scheduler Each processor maintains a work deque of ready strands, and it manipulates the bottom of the deque like a stack. Spawn! P P P P When a processor runs out of work, it steals a strand from the top of a random victim s deque 33

Performance of Work-Stealing Theorem: Cilk s work-stealing scheduler achieves an expected running time of T P T 1 /P + O(T ) on P processors 34

Greedy Scheduling Theorem Types of schedule steps complete step at least P operations ready to run select any P and run them incomplete step strictly < P operation ready to run greedy scheduler runs them all Theorem: On P processors, a greedy scheduler executes any computation G with work T 1 and critical path of length T in time T p T 1 /P + T Proof sketch only two types of scheduler steps: complete, incomplete cannot be more than T 1 /P complete steps, else work > T 1 every incomplete step reduces remaining critical path length by 1 no more than T incomplete steps 35

Parallel Slackness Revisited critical path overhead = smallest constant c such that T p T 1 P + c T $ T p T 1 T P + c ' $ & ) T = P % ( P + c ' & ) T % ( Let P = T 1 /T = parallelism = max speedup on processors Parallel slackness assumption P /P >> c T p T 1 P thus linear speedup T 1 P >> c T critical path overhead has little effect on performance when sufficient parallel slackness exists 36

c 1 = T 1 T s T p c 1 T s P + c T Work Overhead work overhead Minimize work overhead (c 1 ) at the expense of a larger critical path overhead (c ), because work overhead has a more direct impact on performance T p c 1 T s P assuming parallel slackness You can reduce c1 by increasing the granularity of parallel work 37

Parallelizing Vector Addition C void vadd (real *A, real *B, int n){ int i; for (i=0; i<n; i++) A[i]+=B[i]; } 38

Divide and Conquer An effective parallelization strategy creates a good mix of large and small sub-problems Work-stealing scheduler can allocate chunks of work efficiently to the cores, as long as not only a few large chunks if work is divided into just a few large chunks, there may not be enough parallelism to keep all the cores busy not too many very small chunks if the chunks are too small, then scheduling overhead may overwhelm the benefit of parallelism 39

Parallelizing Vector Addition C void vadd (real *A, real *B, int n){ int i; for (i=0; i<n; i++) A[i]+=B[i]; } C void vadd (real *A, real *B, int n){ if (n<=base) { int i; for (i=0; i<n; i++) A[i]+=B[i]; } else { vadd (A, B, n/2); vadd (A+n/2, B+n/2, n-n/2); } } Parallelization strategy: 1. Convert loops to recursion. 40

Parallelizing Vector Addition C void vadd (real *A, real *B, int n){ int i; for (i=0; i<n; i++) A[i]+=B[i]; } C ilk Plus void vadd (real *A, real *B, int n){ if (n<=base) { int i; for (i=0; i<n; i++) A[i]+=B[i]; } else { cilk_spawn vadd (A, B, n/2); vadd (A+n/2, B+n/2, n-n/2); } cilk_sync; } Parallelization strategy: 1. Convert loops to recursion. 2. Insert Cilk Plus keywords. Side benefit: D&C is generally good for caches! 41

Vector Addition void vadd (real *A, real *B, int n){ if (n<=base) { int i; for (i=0; i<n; i++) A[i]+=B[i]; } else { cilk_spawn vadd (A, B, n/2); vadd (A+n/2, B+n/2, n-n/2); cilk_sync; } } 42

Vector Addition Analysis To add two vectors of length n, where BASE = Θ(1): Work: T 1 = Θ(n)? Span: T = Θ(lg? n) Parallelism: T 1 / T = Θ(n/lg? n) BASE 43

Example: N Queens Problem place N queens on an N x N chess board no 2 queens in same row, column, or diagonal Example: a solution to 8 queens problem Image credit: http://en.wikipedia.org/wiki/eight_queens_puzzle 44

N Queens: Many Solutions Possible Example: 8 queens 92 distinct solutions 12 unique solutions; others are rotations & reflections Image credit: http://en.wikipedia.org/wiki/eight_queens_puzzle 45

N Queens Solution Sketch Sequential Recursive Enumeration of All Solutions int nqueens(n, j, placement) { // precondition: placed j queens so far if (j == n) { print placement; return; } for (k = 0; k < n; k++) if putting j+1 queen in k th position in row j+1 is legal add queen j+1 to placement nqueens(n, j+1, placement) remove queen j+1 from placement } Where s the potential for parallelism? What issues must we consider? 46

void nqueens(n, j, placement) { } Parallel N Queens Solution Sketch // precondition: placed j queens so far if (j == n) { /* found a placement */ process placement; return; } for (k = 1; k <= n; k++) if putting j+1 queen in k th position in row j+1 is legal copy placement into newplacement and add extra queen cilk_spawn nqueens(n,j+1,newplacement) cilk_sync discard placement Issues regarding placements how can we report placements? what if a single placement suffices? no need to compute all legal placements so far, no way to terminate children exploring alternate placement 47

Approaches to Managing Placements Choices for reporting multiple legal placements count them print them on the fly collect them on the fly; print them at the end If only one placement desired, can skip remaining search 48

References Introduction to Parallel Computing by Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar. Addison Wesley, 2003 Charles E. Leiserson. Cilk LECTURE 1. Supercomputing Technologies Research Group. Computer Science and Artificial Intelligence Laboratory. http://bit.ly/mit-cilk-lec1 Charles Leiserson, Bradley Kuzmaul, Michael Bender, and Hua-wen Jing. MIT 6.895 lecture notes - Theory of Parallel Systems. http://bit.ly/mit-6895-fall03 Intel Cilk++ Programmer s Guide. Document # 322581-001US. 49