CILK/CILK++ AND REDUCERS YUNMING ZHANG RICE UNIVERSITY
|
|
- Trevor Lewis
- 5 years ago
- Views:
Transcription
1 CILK/CILK++ AND REDUCERS YUNMING ZHANG RICE UNIVERSITY 1
2 OUTLINE CILK and CILK++ Language Features and Usages Work stealing runtime CILK++ Reducers Conclusions 2
3 IDEALIZED SHARED MEMORY ARCHITECTURE Hardware model Processors Shared global memory Software model Threads Shared variables Communication Synchronization Slide from Comp 422 Rice University Lecture 4 3
4 CILK AND CILK++ DESIGN GOALS Programmer friendly Dynamic tasking Parallel extension to C Scalable performance Efficient runtime system Minimum program overhead 4
5 CILK KEYWORDS Cilk: a Cilk function Spawn: call can execute asynchronously in a concurrent thread Sync: current thread waits for all locallyspawned functions 5
6 CILK EXAMPLE cilk int fib(n) { if (n < 2) } } else { return n; int n1, n2; n1 = spawn fib(n-1); n2 = spawn fib(n-2); sync; return (n1 + n2); Borrowed from Comp 422 Rice University Lecture 4 6
7 CILK++ EXAMPLE int fib(n) { if (n < 2) } } else { return n; int n1, n2; n1 = cilk_spawn fib(n-1); n2 = fib(n-2); cilk_sync; return (n1 + n2); Borrowed from Comp 422 Rice University Lecture 4 7
8 CILK++ EXAMPLE WITH DAG Pictures from Reducers and Other CILK+ HyperObjects Talk by Matteo Frigo (Intel). Pablo Halpern ( Intel). Charles E. Leiserson (MIT). Stephen Lewin-Berlin (Intel). 8
9 OUTLINE CILK and CILK++ Language Features and Usages Work stealing runtime CILK++ Reducers Conclusions 9
10 WORK FIRST PRINCIPLE Work: T1 Critical path length: T Number of processor: P Expected time Tp = T1/P + O(T ) Parallel slackness assumption T1/P >> C T 10
11 WORK FIRST PRINCIPLE Minimize scheduling overhead borne by work at the expense of increasing critical path Tp C1Ts/P + C T C1Ts/P Minimize C1 even at the expense of a larger C 11
12 WORK STEALING DESIGN GOALS Minimizing contentions Decentralized task deque Doubly linked deque Minimize communication Steal work rather than push work Load balance across cores Lazy task creation Steal from the top of the deque 12
13 CILK WORK STEALING SCHEDULER Pictures from Reducers and Other CILK+ HyperObjects Talk by Matteo Frigo (Intel). Pablo Halpern ( Intel). Charles E. Leiserson (MIT). Stephen Lewin-Berlin (Intel). 13
14 CILK WORK STEALING SCHEDULER Pictures from Reducers and Other CILK+ HyperObjects Talk by Matteo Frigo (Intel). Pablo Halpern ( Intel). Charles E. Leiserson (MIT). Stephen Lewin-Berlin (Intel). 14
15 CILK WORK STEALING SCHEDULER Pictures from Reducers and Other CILK+ HyperObjects Talk by Matteo Frigo (Intel). Pablo Halpern ( Intel). Charles E. Leiserson (MIT). Stephen Lewin-Berlin (Intel). 15
16 CILK WORK STEALING SCHEDULER Pictures from Reducers and Other CILK+ HyperObjects Talk by Matteo Frigo (Intel). Pablo Halpern ( Intel). Charles E. Leiserson (MIT). Stephen Lewin-Berlin (Intel). 16
17 CILK WORK STEALING SCHEDULER Pictures from Reducers and Other CILK+ HyperObjects Talk by Matteo Frigo (Intel). Pablo Halpern ( Intel). Charles E. Leiserson (MIT). Stephen Lewin-Berlin (Intel). 17
18 CILK WORK STEALING SCHEDULER Pictures from Reducers and Other CILK+ HyperObjects Talk by Matteo Frigo (Intel). Pablo Halpern ( Intel). Charles E. Leiserson (MIT). Stephen Lewin-Berlin (Intel). 18
19 CILK WORK STEALING SCHEDULER Pictures from Reducers and Other CILK+ HyperObjects Talk by Matteo Frigo (Intel). Pablo Halpern ( Intel). Charles E. Leiserson (MIT). Stephen Lewin-Berlin (Intel).
20 CILK WORK STEALING SCHEDULER Pictures from Reducers and Other CILK+ HyperObjects Talk by Matteo Frigo (Intel). Pablo Halpern ( Intel). Charles E. Leiserson (MIT). Stephen Lewin-Berlin (Intel).
21 CILK WORK STEALING SCHEDULER Pictures from Reducers and Other CILK+ HyperObjects Talk by Matteo Frigo (Intel). Pablo Halpern ( Intel). Charles E. Leiserson (MIT). Stephen Lewin-Berlin (Intel). 21
22 CILK WORK STEALING SCHEDULER Pictures from Reducers and Other CILK+ HyperObjects Talk by Matteo Frigo (Intel). Pablo Halpern ( Intel). Charles E. Leiserson (MIT). Stephen Lewin-Berlin (Intel). 22
23 CILK WORK STEALING SCHEDULER Pictures from Reducers and Other CILK+ HyperObjects Talk by Matteo Frigo (Intel). Pablo Halpern ( Intel). Charles E. Leiserson (MIT). Stephen Lewin-Berlin (Intel). 23
24 CILK WORK STEALING SCHEDULER Pictures from Reducers and Other CILK+ HyperObjects Talk by Matteo Frigo (Intel). Pablo Halpern ( Intel). Charles E. Leiserson (MIT). Stephen Lewin-Berlin (Intel). 24
25 CILK WORK STEALING SCHEDULER Pictures from Reducers and Other CILK+ HyperObjects Talk by Matteo Frigo (Intel). Pablo Halpern ( Intel). Charles E. Leiserson (MIT). Stephen Lewin-Berlin (Intel). 25
26 CILK WORK STEALING SCHEDULER Pictures from Reducers and Other CILK+ HyperObjects Talk by Matteo Frigo (Intel). Pablo Halpern ( Intel). Charles E. Leiserson (MIT). Stephen Lewin-Berlin (Intel). 26
27 TWO CLONE STRATEGY Fast clone Identical in most respects to the C elision of the Cilk program Very little execution overhead Sync statements compile to no op Allocates an continuation Program variables and instruction pointer Slow clone Convert a spawn schedule to slow clone only when it is stolen Restores program state from activation frame that contains local variables, program counter and other parts of the procedure instance 27
28 FAST CLONE 28
29 SLOW CLONE Slow_fib(frame * _cilk_frame){ switch (_cilk_frame->header.entry) { fast_fib(_cilk_frame->n - 1 ); case 1: goto _cilk_sync1; fast_fib(_cilk_frame->n - 2 ); case 2: goto _cilk_sync2; sync (not a no op) case 3: goto _cilk_sync3; } } 29
30 FRAMES C++ Main Frame Local variables of the procedure instance Temporary variables Linkage information for return values 30
31 FRAMES CILK++ Stack Frame Everything in C++ Main Frame Continuation Parent pointer Have exactly one child Used by Fast Clone A worker can have multiple Stack Frames 31
32 FRAMES CILK++ Full Frame (used by slow clone) Everything in CILK++ Stack Frame Lock Join counter List of children (has more than one children) A worker has at most one Full Frame 32
33 EXTENDED DEQUE WITH CALL STACKS Extended Deque Call stack Stack frame Full frame 33
34 FUNCTION CALL Function call Extended Deque (Before Function Call) Spawn Call return Spawn return Sync Randomly steal Provably good steal Unconditionally steal Stack frame Full frame Resume full frame 34
35 FUNCTION CALL Function call Extended Deque (After Function Call) Spawn Call return Spawn return Sync Randomly steal New stack frame Provably good steal Unconditionally steal Stack frame Full frame Resume full frame 35
36 SPAWN Function call Extended Deque (Before Spawn Call) Spawn Call return Spawn return Sync Randomly steal Provably good steal Unconditionally steal Stack frame Full frame Resume full frame 36
37 SPAWN Function call Extended Deque (After Spawn Call) Spawn Call return Spawn return Sync Randomly steal Provably good steal Unconditionally steal Set continuation in last stack frame Stack frame Full frame Resume full frame 37
38 RESUME FULL FRAME Function call Extended Deque Spawn Call return Spawn return Sync Randomly steal Provably good steal Unconditionally steal Set the full frame to be the only frame in the call stack, resume execution on the continuation Stack frame Full frame Resume full frame 38
39 RANDOMLY STEAL Function call Spawn Steal this call stack Extended Deque Call return Spawn return Sync Randomly steal Provably good steal Unconditionally steal Stack frame Full frame Resume full frame 39
40 RANDOMLY STEAL Function call Spawn Call return Steal this call stack Extended Deque Spawn return Sync Randomly steal Provably good steal Unconditionally steal Stack frame Full frame Resume full frame 40
41 RANDOMLY STEAL Function call Spawn Call return Spawn return Sync Randomly steal Provably good steal Unconditionally steal Extended Deque Stack frame Full frame Resume full frame 41
42 PROVABLY GOOD STEAL Function call Spawn Call return Extended Deque 0 Spawn return Sync Randomly steal Provably good steal Unconditionally steal Stack frame Full frame Resume full frame 42
43 UNCONDITIONALLY STEAL Function call Spawn Call return Extended Deque 2 Spawn return Sync Randomly steal Provably good steal Unconditionally steal Stack frame Full frame Resume full frame 43
44 FUNCTION CALL RETURN Function call Extended Deque (Before Return from a Call Case1) Spawn Call return Spawn return Sync Randomly steal Provably good steal Unconditionally steal Stack frame Full frame Resume full frame 44
45 FUNCTION CALL RETURN Function call Extended Deque (Return from a Call Case 1) Spawn Call return Spawn return Sync Randomly steal Provably good steal Unconditionally steal Stack frame Full frame Resume full frame 45
46 FUNCTION CALL RETURN Function call Extended Deque (Return from a Call Case2) Spawn Call return Spawn return Sync Randomly steal Provably good steal Unconditionally steal Worker executes an unconditional steal Stack frame Full frame Resume full frame 46
47 SPAWN RETURN Function call Extended Deque (Before Spawn return Case 1) Spawn Call return Spawn return Sync Randomly steal Provably good steal Unconditionally steal Stack frame Full frame Resume full frame 47
48 SPAWN RETURN Function call Extended Deque (After Spawn return Case 1) Spawn Call return Spawn return Sync Randomly steal Provably good steal Unconditionally steal Stack frame Full frame Resume full frame 48
49 SPAWN RETURN Function call Extended Deque (Return from a SpawnCase2) Spawn Call return Spawn return Sync Randomly steal Provably good steal Unconditionally steal Worker executes an provably good steal Stack frame Full frame Resume full frame 49
50 SYNC Function call Extended Deque (Sync Case 1) Spawn Call return Spawn return Sync Randomly steal Provably good steal Unconditionally steal Do nothing if it is a stack frame (No Op) Stack frame Full frame Resume full frame 50
51 SYNC Function call Extended Deque (Sync Case 2) Spawn Call return Spawn return Sync Randomly steal Provably good steal Unconditionally steal Pop the frame, provably good steal Stack frame Full frame Resume full frame 51
52 OUTLINE CILK and CILK++ Language Features and Usages Work stealing runtime CILK++ Reducers Conclusions 52
53 PROBLEMS WITH NON-LOCAL VARIABLES bool has_property(node *) List<Node *> output_list; void walk(node *x) { } if (x) { if (has_property(x)) output_list.push_back(x); cilk_spawn walk(x->left); walk(x->right); cilk_sync; 53
54 REDUCER DESIGN GOALS Support parallelization of programs containing global variables Enable efficient parallel scaling by avoiding a single point of contention Provide deterministic result for associative reduce operations Operate independently of any control constructs 54
55 REDUCER EXAMPLE bool has_property(node *) List_append_reducer<Node *> output_list; void walk(node *x) { if (x) { if (has_property(x)) output_list.push_back(x); cilk_spawn walk(x->left); walk(x->right); cilk_sync; } 55
56 HYPER OBJECTS Pictures from Reducers and Other CILK+ HyperObjects Talk by Matteo Frigo (Intel). Pablo Halpern ( Intel). Charles E. Leiserson (MIT). Stephen Lewin-Berlin (Intel). 56
57 REDUCER Pictures from Reducers and Other CILK+ HyperObjects Talk by Matteo Frigo (Intel). Pablo Halpern ( Intel). Charles E. Leiserson (MIT). Stephen Lewin-Berlin (Intel). 57
58 SEMANTICS OF REDUCERS The child strand owns the view owned by parent function before cilk_spawn The parent strand owns a new view, initialized to identity view e, A special optimization ensures that if a view is unchanged when combined with the identity view 3 Parent strand P own the view from completed child strands 58
59 REDUCING OVER LIST CONCATENATION Pictures from Reducers and Other CILK+ HyperObjects Talk by Matteo Frigo (Intel). Pablo Halpern ( Intel). Charles E. Leiserson (MIT). Stephen Lewin-Berlin (Intel). 59
60 REDUCING OVER LIST CONCATENATION Pictures from Reducers and Other CILK+ HyperObjects Talk by Matteo Frigo (Intel). Pablo Halpern ( Intel). Charles E. Leiserson (MIT). Stephen Lewin-Berlin (Intel). 60
61 IMPLEMENTATION OF REDUCER Each worker maintains a hypermap Hypermap Maps reducers to the views User Children The view of the current procedure The view of the children procedures Right The view of right sibling Identity The default value of a view 61
62 UNDERSTANDING HYPERMAPS bool has_property(node *) List_append_reducer<Node *> output_list; void walk(node *x) ß Proc A { if (x) { if (has_property(x)) output_list.push_back(x); cilk_spawn walk(x->left); ß proc B cilk_spawn walk(x->right); ß proc C cilk_sync; } 62
63 LAZY CREATION A new view will only be created after a steal On demand 63
64 HYPERMAP CREATION Pictures from Reducers and Other CILK+ HyperObjects Talk by Matteo Frigo (Intel). Pablo Halpern ( Intel). Charles E. Leiserson (MIT). Stephen Lewin-Berlin (Intel). 64
65 HYPERMAP CREATION Pictures from Reducers and Other CILK+ HyperObjects Talk by Matteo Frigo (Intel). Pablo Halpern ( Intel). Charles E. Leiserson (MIT). Stephen Lewin-Berlin (Intel). 65
66 HYPERMAP CREATION Pictures from Reducers and Other CILK+ HyperObjects Talk by Matteo Frigo (Intel). Pablo Halpern ( Intel). Charles E. Leiserson (MIT). Stephen Lewin-Berlin (Intel). 66
67 HYPERMAP CREATION Pictures from Reducers and Other CILK+ HyperObjects Talk by Matteo Frigo (Intel). Pablo Halpern ( Intel). Charles E. Leiserson (MIT). Stephen Lewin-Berlin (Intel). 67
68 HYPERMAP CREATION Pictures from Reducers and Other CILK+ HyperObjects Talk by Matteo Frigo (Intel). Pablo Halpern ( Intel). Charles E. Leiserson (MIT). Stephen Lewin-Berlin (Intel). 68
69 LOOK UP FAILURE Inserts a view containing an identity element for the reducer into the hypermap. Following the lazy principle Look up returns the newly inserted identity view 69
70 RANDOM WORK STEALING A random steal operation steals a full frame P and replaces it with a new full frame C in the victim. USERC USERP; U S E R P 0/ ; CHILDRENP 0/; RIGHTP 0/. 70
71 RANDOM WORK STEALING Pictures from Reducers and Other CILK+ HyperObjects Talk by Matteo Frigo (Intel). Pablo Halpern ( Intel). Charles E. Leiserson (MIT). Stephen Lewin-Berlin (Intel). 71
72 RETURN FROM A CALL Let C be a child frame of the parent frame P that originally called C, and suppose that C returns. 72
73 RETURN FROM A CALL Function call Extended Deque (Before Return from a Call Case1) Spawn Call return Spawn return Sync Randomly steal Provably good steal Unconditionally steal Stack frame Full frame Resume full frame 73
74 RETURN FROM A CALL Function call Extended Deque (Return from a Call Case 1) Spawn Call return Spawn return Sync Randomly steal Provably good steal Unconditionally steal Stack frame Full frame Resume full frame 74
75 RETURN FROM A CALL Let C be a child frame of the parent frame P that originally called C, and suppose that C returns. If C is a stack frame, do nothing, 75
76 FUNCTION CALL RETURN Function call Extended Deque (Return from a Call Case2) Spawn Call return Spawn return Sync Randomly steal Provably good steal Unconditionally steal Worker executes an unconditional steal Stack frame Full frame Resume full frame 76
77 RETURN FROM A CALL Let C be a child frame of the parent frame P that originally called C, and suppose that C returns. If C is a stack frame, do nothing, If C is a full frame. Transfer ownership of view Children and Right are empty USERP USERC 77
78 RETURN FROM A SPAWN Let C be a child frame of the parent frame P that originally spawned C, and suppose that C returns. Always do USERC REDUCE(USERC,RIGHTC) If C is a stack frame, do nothing If C is a full frame If C has siblings, RIGHTL REDUCE(RIGHTL,USERC) C is the leftmost child CHILDRENP REDUCE(CHILDRENP,USERC) 78
79 RETURN FROM A SPAWN EXAMPLE bool has_property(node *) List_append_reducer<Node *> output_list; void walk(node *x) ß Proc A { if (x) { if (has_property(x)) output_list.push_back(x); cilk_spawn walk(x->left); ß proc B cilk_spawn walk(x->right); ß proc C cilk_sync; } 79
80 RETURN FROM A SPAWN EXAMPLE bool has_property(node *) List_append_reducer<Node *> output_list; void walk(node *x) ß Proc A { if (x) { if (has_property(x)) output_list.push_back(x); cilk_spawn walk(x->left); ß proc B cilk_spawn walk(x->right); ß proc C cilk_sync; } 80
81 RETURN FROM A SPAWN EXAMPLE bool has_property(node *) List_append_reducer<Node *> output_list; void walk(node *x) ß Proc A { if (x) { if (has_property(x)) output_list.push_back(x); cilk_spawn walk(x->left); ß proc B cilk_spawn walk(x->right); ß proc C cilk_sync; } 81
82 SYNC A cilk_sync statement waits until all children have completed. When frame P executes a cilk_sync, one of following two cases applies: If P is a stack frame, do nothing. If P is a full frame, USERP REDUCE(CHILDRENP,USERP). 82
83 BENEFITS OF REDUCERS 83
84 OUTLINE CILK and CILK++ Language Features and Usages Work stealing runtime CILK++ Reducers Conclusions 84
85 CONCLUSIONS CILK and CILK++ provide a programmer friendly programming model Extension to C Incremental parallelism Scaling on future machines Non-compromising performance Work stealing runtime Minimizing overheads Reducers 85
86 FINAL NOTES Designed for an idealized shared memory model Today s architectures are typically NUMA Task creation can be lazier arnumber= &tag=1 Cilk_for Divide and conquer parallelization 86
Reducers and other Cilk++ hyperobjects
Reducers and other Cilk++ hyperobjects Matteo Frigo (Intel) ablo Halpern (Intel) Charles E. Leiserson (MIT) Stephen Lewin-Berlin (Intel) August 11, 2009 Collision detection Assembly: Represented as a tree
More informationCilk Plus: Multicore extensions for C and C++
Cilk Plus: Multicore extensions for C and C++ Matteo Frigo 1 June 6, 2011 1 Some slides courtesy of Prof. Charles E. Leiserson of MIT. Intel R Cilk TM Plus What is it? C/C++ language extensions supporting
More informationShared-memory Parallel Programming with Cilk Plus
Shared-memory Parallel Programming with Cilk Plus John Mellor-Crummey Department of Computer Science Rice University johnmc@rice.edu COMP 422/534 Lecture 4 30 August 2018 Outline for Today Threaded programming
More informationMulticore programming in CilkPlus
Multicore programming in CilkPlus Marc Moreno Maza University of Western Ontario, Canada CS3350 March 16, 2015 CilkPlus From Cilk to Cilk++ and Cilk Plus Cilk has been developed since 1994 at the MIT Laboratory
More informationA Quick Introduction To The Intel Cilk Plus Runtime
A Quick Introduction To The Intel Cilk Plus Runtime 6.S898: Advanced Performance Engineering for Multicore Applications March 8, 2017 Adapted from slides by Charles E. Leiserson, Saman P. Amarasinghe,
More informationThe Cilk part is a small set of linguistic extensions to C/C++ to support fork-join parallelism. (The Plus part supports vector parallelism.
Cilk Plus The Cilk part is a small set of linguistic extensions to C/C++ to support fork-join parallelism. (The Plus part supports vector parallelism.) Developed originally by Cilk Arts, an MIT spinoff,
More informationShared-memory Parallel Programming with Cilk Plus
Shared-memory Parallel Programming with Cilk Plus John Mellor-Crummey Department of Computer Science Rice University johnmc@rice.edu COMP 422/534 Lecture 4 19 January 2017 Outline for Today Threaded programming
More informationCilk. Cilk In 2008, ACM SIGPLAN awarded Best influential paper of Decade. Cilk : Biggest principle
CS528 Slides are adopted from http://supertech.csail.mit.edu/cilk/ Charles E. Leiserson A Sahu Dept of CSE, IIT Guwahati HPC Flow Plan: Before MID Processor + Super scalar+ Vector Unit Serial C/C++ Coding
More informationCSE 613: Parallel Programming
CSE 613: Parallel Programming Lecture 3 ( The Cilk++ Concurrency Platform ) ( inspiration for many slides comes from talks given by Charles Leiserson and Matteo Frigo ) Rezaul A. Chowdhury Department of
More informationMultithreaded Parallelism and Performance Measures
Multithreaded Parallelism and Performance Measures Marc Moreno Maza University of Western Ontario, London, Ontario (Canada) CS 3101 (Moreno Maza) Multithreaded Parallelism and Performance Measures CS 3101
More informationCS 240A: Shared Memory & Multicore Programming with Cilk++
CS 240A: Shared Memory & Multicore rogramming with Cilk++ Multicore and NUMA architectures Multithreaded rogramming Cilk++ as a concurrency platform Work and Span Thanks to Charles E. Leiserson for some
More informationThe Implementation of Cilk-5 Multithreaded Language
The Implementation of Cilk-5 Multithreaded Language By Matteo Frigo, Charles E. Leiserson, and Keith H Randall Presented by Martin Skou 1/14 The authors Matteo Frigo Chief Scientist and founder of Cilk
More informationPlan. 1 Parallelism Complexity Measures. 2 cilk for Loops. 3 Scheduling Theory and Implementation. 4 Measuring Parallelism in Practice
lan Multithreaded arallelism and erformance Measures Marc Moreno Maza University of Western Ontario, London, Ontario (Canada) CS 3101 1 2 cilk for Loops 3 4 Measuring arallelism in ractice 5 Announcements
More informationParallelism and Performance
6.172 erformance Engineering of Software Systems LECTURE 13 arallelism and erformance Charles E. Leiserson October 26, 2010 2010 Charles E. Leiserson 1 Amdahl s Law If 50% of your application is parallel
More informationCompsci 590.3: Introduction to Parallel Computing
Compsci 590.3: Introduction to Parallel Computing Alvin R. Lebeck Slides based on this from the University of Oregon Admin Logistics Homework #3 Use script Project Proposals Document: see web site» Due
More informationAn Overview of Parallel Computing
An Overview of Parallel Computing Marc Moreno Maza University of Western Ontario, London, Ontario (Canada) CS2101 Plan 1 Hardware 2 Types of Parallelism 3 Concurrency Platforms: Three Examples Cilk CUDA
More informationOn the cost of managing data flow dependencies
On the cost of managing data flow dependencies - program scheduled by work stealing - Thierry Gautier, INRIA, EPI MOAIS, Grenoble France Workshop INRIA/UIUC/NCSA Outline Context - introduction of work
More informationCSE 260 Lecture 19. Parallel Programming Languages
CSE 260 Lecture 19 Parallel Programming Languages Announcements Thursday s office hours are cancelled Office hours on Weds 2p to 4pm Jing will hold OH, too, see Moodle Scott B. Baden /CSE 260/ Winter 2014
More informationMultithreaded Programming in Cilk. Matteo Frigo
Multithreaded Programming in Cilk Matteo Frigo Multicore challanges Development time: Will you get your product out in time? Where will you find enough parallel-programming talent? Will you be forced to
More informationPlan. Introduction to Multicore Programming. Plan. University of Western Ontario, London, Ontario (Canada) Multi-core processor CPU Coherence
Plan Introduction to Multicore Programming Marc Moreno Maza University of Western Ontario, London, Ontario (Canada) CS 3101 1 Multi-core Architecture 2 Race Conditions and Cilkscreen (Moreno Maza) Introduction
More informationMultithreaded Programming in. Cilk LECTURE 1. Charles E. Leiserson
Multithreaded Programming in Cilk LECTURE 1 Charles E. Leiserson Supercomputing Technologies Research Group Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology
More informationCilk Plus GETTING STARTED
Cilk Plus GETTING STARTED Overview Fundamentals of Cilk Plus Hyperobjects Compiler Support Case Study 3/17/2015 CHRIS SZALWINSKI 2 Fundamentals of Cilk Plus Terminology Execution Model Language Extensions
More informationA Primer on Scheduling Fork-Join Parallelism with Work Stealing
Doc. No.: N3872 Date: 2014-01-15 Reply to: Arch Robison A Primer on Scheduling Fork-Join Parallelism with Work Stealing This paper is a primer, not a proposal, on some issues related to implementing fork-join
More informationSynchronizing without Locks Charles E. Leiserson Charles E. Leiserson 2
OUTLINE 6.172 Performance Engineering of Software Systems Lecture 15 Synchronizing without Locks Charles E. Leiserson November 5, 2009 Memory Consistency Lock-Free Protocols The ABA Problem Reducer Hyperobjects
More informationShared-memory Parallel Programming with Cilk Plus (Parts 2-3)
Shared-memory Parallel Programming with Cilk Plus (Parts 2-3) John Mellor-Crummey Department of Computer Science Rice University johnmc@rice.edu COMP 422/534 Lecture 5-6 24,26 January 2017 Last Thursday
More informationPlan. 1 Parallelism Complexity Measures. 2 cilk for Loops. 3 Scheduling Theory and Implementation. 4 Measuring Parallelism in Practice
lan Multithreaded arallelism and erformance Measures Marc Moreno Maza University of Western Ontario, London, Ontario (Canada) CS 02 - CS 9535 arallelism Complexity Measures 2 cilk for Loops 3 Measuring
More informationPlan. 1 Parallelism Complexity Measures. 2 cilk for Loops. 3 Scheduling Theory and Implementation. 4 Measuring Parallelism in Practice
lan Multithreaded arallelism and erformance Measures Marc Moreno Maza University of Western Ontario, London, Ontario (Canada) CS 4435 - CS 9624 1 2 cilk for Loops 3 4 Measuring arallelism in ractice 5
More informationProject 3. Building a parallelism profiler for Cilk computations CSE 539. Assigned: 03/17/2015 Due Date: 03/27/2015
CSE 539 Project 3 Assigned: 03/17/2015 Due Date: 03/27/2015 Building a parallelism profiler for Cilk computations In this project, you will implement a simple serial tool for Cilk programs a parallelism
More informationMetaFork: A Compilation Framework for Concurrency Platforms Targeting Multicores
MetaFork: A Compilation Framework for Concurrency Platforms Targeting Multicores Presented by Xiaohui Chen Joint work with Marc Moreno Maza, Sushek Shekar & Priya Unnikrishnan University of Western Ontario,
More informationPerformance Optimization Part 1: Work Distribution and Scheduling
Lecture 5: Performance Optimization Part 1: Work Distribution and Scheduling Parallel Computer Architecture and Programming CMU 15-418/15-618, Fall 2017 Programming for high performance Optimizing the
More informationCS CS9535: An Overview of Parallel Computing
CS4403 - CS9535: An Overview of Parallel Computing Marc Moreno Maza University of Western Ontario, London, Ontario (Canada) January 10, 2017 Plan 1 Hardware 2 Types of Parallelism 3 Concurrency Platforms:
More informationEffective Performance Measurement and Analysis of Multithreaded Applications
Effective Performance Measurement and Analysis of Multithreaded Applications Nathan Tallent John Mellor-Crummey Rice University CSCaDS hpctoolkit.org Wanted: Multicore Programming Models Simple well-defined
More informationPlan. Introduction to Multicore Programming. Plan. University of Western Ontario, London, Ontario (Canada) Marc Moreno Maza CS CS 9624
Plan Introduction to Multicore Programming Marc Moreno Maza University of Western Ontario, London, Ontario (Canada) CS 4435 - CS 9624 1 Multi-core Architecture Multi-core processor CPU Cache CPU Coherence
More informationTable of Contents. Cilk
Table of Contents 212 Introduction to Parallelism Introduction to Programming Models Shared Memory Programming Message Passing Programming Shared Memory Models Cilk TBB HPF Chapel Fortress Stapl PGAS Languages
More informationMultithreaded Parallelism on Multicore Architectures
Multithreaded Parallelism on Multicore Architectures Marc Moreno Maza University of Western Ontario, Canada CS2101 March 2012 Plan 1 Multicore programming Multicore architectures 2 Cilk / Cilk++ / Cilk
More informationCilk, Matrix Multiplication, and Sorting
6.895 Theory of Parallel Systems Lecture 2 Lecturer: Charles Leiserson Cilk, Matrix Multiplication, and Sorting Lecture Summary 1. Parallel Processing With Cilk This section provides a brief introduction
More information1 Optimizing parallel iterative graph computation
May 15, 2012 1 Optimizing parallel iterative graph computation I propose to develop a deterministic parallel framework for performing iterative computation on a graph which schedules work on vertices based
More informationCost Model: Work, Span and Parallelism
CSE 539 01/15/2015 Cost Model: Work, Span and Parallelism Lecture 2 Scribe: Angelina Lee Outline of this lecture: 1. Overview of Cilk 2. The dag computation model 3. Performance measures 4. A simple greedy
More informationEfficient Work Stealing for Fine-Grained Parallelism
Efficient Work Stealing for Fine-Grained Parallelism Karl-Filip Faxén Swedish Institute of Computer Science November 26, 2009 Task parallel fib in Wool TASK 1( int, fib, int, n ) { if( n
More informationMultithreaded Algorithms Part 1. Dept. of Computer Science & Eng University of Moratuwa
CS4460 Advanced d Algorithms Batch 08, L4S2 Lecture 11 Multithreaded Algorithms Part 1 N. H. N. D. de Silva Dept. of Computer Science & Eng University of Moratuwa Announcements Last topic discussed is
More informationUni-Address Threads: Scalable Thread Management for RDMA-based Work Stealing
Uni-Address Threads: Scalable Thread Management for RDMA-based Work Stealing Shigeki Akiyama, Kenjiro Taura The University of Tokyo June 17, 2015 HPDC 15 Lightweight Threads Lightweight threads enable
More informationPerformance Optimization Part 1: Work Distribution and Scheduling
( how to be l33t ) Lecture 6: Performance Optimization Part 1: Work Distribution and Scheduling Parallel Computer Architecture and Programming CMU 15-418/15-618, Spring 2017 Tunes Kaleo Way Down We Go
More informationMemory-Mapping Support for Reducer Hyperobjects
Memory-Mapping Support for Reducer Hyperobjects I-Ting Angelina Lee MIT CSAIL 32 Vassar Street Cambridge, MA 2139 USA angelee@csail.mit.edu Aamir Shafi National University of Sciences and Technology Sector
More informationAtomic Transactions in Cilk Project Presentation 12/1/03
Atomic Transactions in Cilk 6.895 Project Presentation 12/1/03 Data Races and Nondeterminism int x = 0; 1: read x 1: write x time cilk void increment() { x = x + 1; cilk int main() { spawn increment();
More informationMoore s Law. Multicore Programming. Vendor Solution. Power Density. Parallelism and Performance MIT Lecture 11 1.
Moore s Law 1000000 Intel CPU Introductions 6.172 Performance Engineering of Software Systems Lecture 11 Multicore Programming Charles E. Leiserson 100000 10000 1000 100 10 Clock Speed (MHz) Transistors
More informationReducer Hyperobjects
Reducer Hyperobjects int compute(const X& v); int main() { const int n = 1000000; extern X myarray[n]; // Summing Example } int result = 0; for (int i = 0; i < n; ++i) { result += compute(myarray[i]);
More informationUnderstanding Task Scheduling Algorithms. Kenjiro Taura
Understanding Task Scheduling Algorithms Kenjiro Taura 1 / 48 Contents 1 Introduction 2 Work stealing scheduler 3 Analyzing execution time of work stealing 4 Analyzing cache misses of work stealing 5 Summary
More informationBrushing the Locks out of the Fur: A Lock-Free Work Stealing Library Based on Wool
Brushing the Locks out of the Fur: A Lock-Free Work Stealing Library Based on Wool Håkan Sundell School of Business and Informatics University of Borås, 50 90 Borås E-mail: Hakan.Sundell@hb.se Philippas
More informationToday: Amortized Analysis (examples) Multithreaded Algs.
Today: Amortized Analysis (examples) Multithreaded Algs. COSC 581, Algorithms March 11, 2014 Many of these slides are adapted from several online sources Reading Assignments Today s class: Chapter 17 (Amortized
More informationThe Cilk++ Concurrency Platform
The Cilk++ Concurrency Platform The MIT Faculty has made this article openly available. Please share how this access benefits you. Your story matters. Citation As Published Publisher Leiserson, Charles
More informationCS 140 : Numerical Examples on Shared Memory with Cilk++
CS 140 : Numerical Examples on Shared Memory with Cilk++ Matrix-matrix multiplication Matrix-vector multiplication Hyperobjects Thanks to Charles E. Leiserson for some of these slides 1 Work and Span (Recap)
More informationAtomic Transactions in Cilk
Atomic Transactions in Jim Sukha 12-13-03 Contents 1 Introduction 2 1.1 Determinacy Races in Multi-Threaded Programs......................... 2 1.2 Atomicity through Transactions...................................
More informationConcepts in. Programming. The Multicore- Software Challenge. MIT Professional Education 6.02s Lecture 1 June 8 9, 2009
Concepts in Multicore Programming The Multicore- Software Challenge MIT Professional Education 6.02s Lecture 1 June 8 9, 2009 2009 Charles E. Leiserson 1 Cilk, Cilk++, and Cilkscreen, are trademarks of
More informationHåkan Sundell University College of Borås Parallel Scalable Solutions AB
Brushing the Locks out of the Fur: A Lock-Free Work Stealing Library Based on Wool Håkan Sundell University College of Borås Parallel Scalable Solutions AB Philippas Tsigas Chalmers University of Technology
More informationCS 140 : Feb 19, 2015 Cilk Scheduling & Applications
CS 140 : Feb 19, 2015 Cilk Scheduling & Applications Analyzing quicksort Optional: Master method for solving divide-and-conquer recurrences Tips on parallelism and overheads Greedy scheduling and parallel
More informationCOMP Parallel Computing. SMM (4) Nested Parallelism
COMP 633 - Parallel Computing Lecture 9 September 19, 2017 Nested Parallelism Reading: The Implementation of the Cilk-5 Multithreaded Language sections 1 3 1 Topics Nested parallelism in OpenMP and other
More informationMetaFork: A Metalanguage for Concurrency Platforms Targeting Multicores
MetaFork: A Metalanguage for Concurrency Platforms Targeting Multicores Xiaohui Chen, Marc Moreno Maza & Sushek Shekar University of Western Ontario September 1, 2013 Document number: N1746 Date: 2013-09-01
More informationUnder the Hood, Part 1: Implementing Message Passing
Lecture 27: Under the Hood, Part 1: Implementing Message Passing Parallel Computer Architecture and Programming CMU 15-418/15-618, Today s Theme Message passing model (abstraction) Threads operate within
More informationIntroduction to Multithreaded Algorithms
Introduction to Multithreaded Algorithms CCOM5050: Design and Analysis of Algorithms Chapter VII Selected Topics T. H. Cormen, C. E. Leiserson, R. L. Rivest, C. Stein. Introduction to algorithms, 3 rd
More informationParallel GC. (Chapter 14) Eleanor Ainy December 16 th 2014
GC (Chapter 14) Eleanor Ainy December 16 th 2014 1 Outline of Today s Talk How to use parallelism in each of the 4 components of tracing GC: Marking Copying Sweeping Compaction 2 Introduction Till now
More informationEfficiently Detecting Races in Cilk Programs That Use Reducer Hyperobjects
Efficiently Detecting Races in Cilk Programs That Use Reducer Hyperobjects ABSTRACT I-Ting Angelina Lee Washington University in St. Louis One Brookings Drive St. Louis, MO 63130 A multithreaded Cilk program
More informationNondeterministic Programming
6.172 Performance Engineering of Software Systems LECTURE 15 Nondeterministic Programming Charles E. Leiserson November 2, 2010 2010 Charles E. Leiserson 1 Determinism Definition. A program is deterministic
More informationCOMP 303 Computer Architecture Lecture 3. Comp 303 Computer Architecture
COMP 303 Computer Architecture Lecture 3 Comp 303 Computer Architecture 1 Supporting procedures in computer hardware The execution of a procedure Place parameters in a place where the procedure can access
More informationAnalysis of Multithreaded Algorithms
Analysis of Multithreaded Algorithms Marc Moreno Maza University of Western Ontario, London, Ontario (Canada) CS4402-9535 (Moreno Maza) Analysis of Multithreaded Algorithms CS4402-9535 1 / 27 Plan 1 Matrix
More informationBeyond Threads: Scalable, Composable, Parallelism with Intel Cilk Plus and TBB
Beyond Threads: Scalable, Composable, Parallelism with Intel Cilk Plus and TBB Jim Cownie Intel SSG/DPD/TCAR 1 Optimization Notice Optimization Notice Intel s compilers may or
More informationBeyond Nested Parallelism: Tight Bounds on Work-Stealing Overheads for Parallel Futures
Beyond Nested Parallelism: Tight Bounds on Work-Stealing Overheads for Parallel Futures Daniel Spoonhower Guy E. Blelloch Phillip B. Gibbons Robert Harper Carnegie Mellon University {spoons,blelloch,rwh}@cs.cmu.edu
More informationProcesses and Threads
COS 318: Operating Systems Processes and Threads Kai Li and Andy Bavier Computer Science Department Princeton University http://www.cs.princeton.edu/courses/archive/fall13/cos318 Today s Topics u Concurrency
More informationProcesses. Process Concept
Processes These slides are created by Dr. Huang of George Mason University. Students registered in Dr. Huang s courses at GMU can make a single machine readable copy and print a single copy of each slide
More informationDecomposing a Problem for Parallel Execution
Decomposing a Problem for Parallel Execution Pablo Halpern Parallel Programming Languages Architect, Intel Corporation CppCon, 9 September 2014 This work by Pablo Halpern is
More informationCilk Plus in GCC. GNU Tools Cauldron Balaji V. Iyer Robert Geva and Pablo Halpern Intel Corporation
Cilk Plus in GCC GNU Tools Cauldron 2012 Balaji V. Iyer Robert Geva and Pablo Halpern Intel Corporation July 10, 2012 Presentation Outline Introduction Cilk Plus components Implementation GCC Project Status
More informationImplementing Subroutines. Outline [1]
Implementing Subroutines In Text: Chapter 9 Outline [1] General semantics of calls and returns Implementing simple subroutines Call Stack Implementing subroutines with stackdynamic local variables Nested
More informationIDENTIFYING PERFORMANCE BOTTLENECKS IN WORK-STEALING COMPUTATIONS
C OV ER F E AT U RE IDENTIFYING PERFORMANCE BOTTLENECKS IN WORK-STEALING COMPUTATIONS Nathan R. Tallent and John M. Mellor-Crummey, Rice University Work stealing is an effective load-balancing strategy
More informationRuntime Support for Scalable Task-parallel Programs
Runtime Support for Scalable Task-parallel Programs Pacific Northwest National Lab xsig workshop May 2018 http://hpc.pnl.gov/people/sriram/ Single Program Multiple Data int main () {... } 2 Task Parallelism
More informationDeterministic Scale-Free Pipeline Parallelism with Hyperqueues
Deterministic Scale-Free Pipeline Parallelism with Hyperqueues Hans Vandierendonck Queen s University Belfast United Kingdom h.vandierendonck@qub.ac.uk Kallia Chronaki Barcelona Supercomputing Center,
More informationChapter 3: Process-Concept. Operating System Concepts 8 th Edition,
Chapter 3: Process-Concept, Silberschatz, Galvin and Gagne 2009 Chapter 3: Process-Concept Process Concept Process Scheduling Operations on Processes Interprocess Communication 3.2 Silberschatz, Galvin
More informationShared-Memory Programming Models
Shared-Memory Programming Models Parallel Programming Concepts Winter Term 2013 / 2014 Dr. Peter Tröger, M.Sc. Frank Feinbube Cilk C language combined with several new keywords Different approach to OpenMP
More informationVirtual Memory COMPSCI 386
Virtual Memory COMPSCI 386 Motivation An instruction to be executed must be in physical memory, but there may not be enough space for all ready processes. Typically the entire program is not needed. Exception
More informationProcess Concepts. CSC400 - Operating Systems. 3. Process Concepts. J. Sumey
CSC400 - Operating Systems 3. Process Concepts J. Sumey Overview Concurrency Processes & Process States Process Accounting Interrupts & Interrupt Processing Interprocess Communication CSC400 - Process
More informationCS 471 Operating Systems. Yue Cheng. George Mason University Fall 2017
CS 471 Operating Systems Yue Cheng George Mason University Fall 2017 Outline o Process concept o Process creation o Process states and scheduling o Preemption and context switch o Inter-process communication
More informationNON-BLOCKING DATA STRUCTURES AND TRANSACTIONAL MEMORY. Tim Harris, 31 October 2012
NON-BLOCKING DATA STRUCTURES AND TRANSACTIONAL MEMORY Tim Harris, 31 October 2012 Lecture 6 Linearizability Lock-free progress properties Queues Reducing contention Explicit memory management Linearizability
More informationDynamic inter-core scheduling in Barrelfish
Dynamic inter-core scheduling in Barrelfish. avoiding contention with malleable domains Georgios Varisteas, Mats Brorsson, Karl-Filip Faxén November 25, 2011 Outline Introduction Scheduling & Programming
More informationMetaFork: A Compilation Framework for Concurrency Platforms Targeting Multicores
MetaFork: A Compilation Framework for Concurrency Platforms Targeting Multicores Xiaohui Chen, Marc Moreno Maza & Sushek Shekar University of Western Ontario, Canada IBM Toronto Lab February 11, 2015 Plan
More informationDAGViz: A DAG Visualization Tool for Analyzing Task-Parallel Program Traces
DAGViz: A DAG Visualization Tool for Analyzing Task-Parallel Program Traces An Huynh University of Tokyo, Japan Douglas Thain University of Notre Dame, USA Miquel Pericas Chalmers University of Technology,
More informationCilk programs as a DAG
Cilk programs as a DAG The pattern of spawn and sync commands defines a graph The graph contains dependencies between different functions spawn command creates a new task with an out-bound link sync command
More informationWork Stealing. in Multiprogrammed Environments. Brice Dobry Dept. of Computer & Information Sciences University of Delaware
Work Stealing in Multiprogrammed Environments Brice Dobry Dept. of Computer & Information Sciences University of Delaware Outline Motivate the issue Describe work-stealing in general Explain the new algorithm
More informationPablo Halpern Parallel Programming Languages Architect Intel Corporation
Pablo Halpern Parallel Programming Languages Architect Intel Corporation CppCon, 8 September 2014 This work by Pablo Halpern is licensed under a Creative Commons Attribution
More informationCAB: Cache Aware Bi-tier Task-stealing in Multi-socket Multi-core Architecture
CAB: Cache Aware Bi-tier Task-stealing in Multi-socket Multi-core Architecture Quan Chen, Zhiyi Huang, Minyi Guo Department of Computer Science, Shanghai Jiao Tong University, Shanghai, China chen-quan@sjtu.edu.cn,
More informationSpace Profiling for Parallel Functional Programs
Space Profiling for Parallel Functional Programs Daniel Spoonhower 1, Guy Blelloch 1, Robert Harper 1, & Phillip Gibbons 2 1 Carnegie Mellon University 2 Intel Research Pittsburgh 23 September 2008 ICFP
More informationLECTURE 11 TREE TRAVERSALS
DATA STRUCTURES AND ALGORITHMS LECTURE 11 TREE TRAVERSALS IMRAN IHSAN ASSISTANT PROFESSOR AIR UNIVERSITY, ISLAMABAD BACKGROUND All the objects stored in an array or linked list can be accessed sequentially
More information15-210: Parallelism in the Real World
: Parallelism in the Real World Types of paralellism Parallel Thinking Nested Parallelism Examples (Cilk, OpenMP, Java Fork/Join) Concurrency Page1 Cray-1 (1976): the world s most expensive love seat 2
More informationChapter 4: Threads. Operating System Concepts. Silberschatz, Galvin and Gagne
Chapter 4: Threads Silberschatz, Galvin and Gagne Chapter 4: Threads Overview Multithreading Models Thread Libraries Threading Issues Operating System Examples Linux Threads 4.2 Silberschatz, Galvin and
More informationParallel Programming of General-Purpose Programs Using Task-Based Programming Models
Parallel Programming of General-Purpose Programs Using Task-Based Programming Models Hans Vandierendonck, Polyvios Pratikakis and Dimitrios S. Nikolopoulos Dept. of Electronics and Information Systems,
More informationLatency-Hiding Work Stealing
Latency-Hiding Work Stealing Stefan K. Muller April 2017 CMU-CS-16-112R Umut A. Acar School of Computer Science Carnegie Mellon University Pittsburgh, PA 15213 A version of this work appears in the proceedings
More informationNON-BLOCKING DATA STRUCTURES AND TRANSACTIONAL MEMORY. Tim Harris, 17 November 2017
NON-BLOCKING DATA STRUCTURES AND TRANSACTIONAL MEMORY Tim Harris, 17 November 2017 Lecture 7 Linearizability Lock-free progress properties Hashtables and skip-lists Queues Reducing contention Explicit
More informationProcess. Heechul Yun. Disclaimer: some slides are adopted from the book authors slides with permission
Process Heechul Yun Disclaimer: some slides are adopted from the book authors slides with permission 1 Recap OS services Resource (CPU, memory) allocation, filesystem, communication, protection, security,
More informationOverview. CMSC 330: Organization of Programming Languages. Concurrency. Multiprocessors. Processes vs. Threads. Computation Abstractions
CMSC 330: Organization of Programming Languages Multithreaded Programming Patterns in Java CMSC 330 2 Multiprocessors Description Multiple processing units (multiprocessor) From single microprocessor to
More informationAn Architectural Framework for Accelerating Dynamic Parallel Algorithms on Reconfigurable Hardware
An Architectural Framework for Accelerating Dynamic Parallel Algorithms on Reconfigurable Hardware Tao Chen, Shreesha Srinath Christopher Batten, G. Edward Suh Computer Systems Laboratory School of Electrical
More informationThe Implementation of the Cilk-5 Multithreaded Language
The Implementation of the Cilk-5 Multithreaded Language Matte0 Frigo Charles E. Leiserson Keith H. Randall MIT Laboratory for Computer Science 545 Technology Square Cambridge, Massachusetts 02139 {athena,cel,randall}@lcs.mit.edu
More information3. Process Management in xv6
Lecture Notes for CS347: Operating Systems Mythili Vutukuru, Department of Computer Science and Engineering, IIT Bombay 3. Process Management in xv6 We begin understanding xv6 process management by looking
More informationOn the Interoperability of Programming Languages based on the Fork-Join Parallelism Model. (Thesis format: Monograph) Sushek Shekar
On the Interoperability of Programming Languages based on the Fork-Join Parallelism Model (Spine title: On the Interoperability of Programming Languages based on the Fork-Join Parallelism Model) (Thesis
More information