CILK/CILK++ AND REDUCERS YUNMING ZHANG RICE UNIVERSITY

Size: px
Start display at page:

Download "CILK/CILK++ AND REDUCERS YUNMING ZHANG RICE UNIVERSITY"

Transcription

1 CILK/CILK++ AND REDUCERS YUNMING ZHANG RICE UNIVERSITY 1

2 OUTLINE CILK and CILK++ Language Features and Usages Work stealing runtime CILK++ Reducers Conclusions 2

3 IDEALIZED SHARED MEMORY ARCHITECTURE Hardware model Processors Shared global memory Software model Threads Shared variables Communication Synchronization Slide from Comp 422 Rice University Lecture 4 3

4 CILK AND CILK++ DESIGN GOALS Programmer friendly Dynamic tasking Parallel extension to C Scalable performance Efficient runtime system Minimum program overhead 4

5 CILK KEYWORDS Cilk: a Cilk function Spawn: call can execute asynchronously in a concurrent thread Sync: current thread waits for all locallyspawned functions 5

6 CILK EXAMPLE cilk int fib(n) { if (n < 2) } } else { return n; int n1, n2; n1 = spawn fib(n-1); n2 = spawn fib(n-2); sync; return (n1 + n2); Borrowed from Comp 422 Rice University Lecture 4 6

7 CILK++ EXAMPLE int fib(n) { if (n < 2) } } else { return n; int n1, n2; n1 = cilk_spawn fib(n-1); n2 = fib(n-2); cilk_sync; return (n1 + n2); Borrowed from Comp 422 Rice University Lecture 4 7

8 CILK++ EXAMPLE WITH DAG Pictures from Reducers and Other CILK+ HyperObjects Talk by Matteo Frigo (Intel). Pablo Halpern ( Intel). Charles E. Leiserson (MIT). Stephen Lewin-Berlin (Intel). 8

9 OUTLINE CILK and CILK++ Language Features and Usages Work stealing runtime CILK++ Reducers Conclusions 9

10 WORK FIRST PRINCIPLE Work: T1 Critical path length: T Number of processor: P Expected time Tp = T1/P + O(T ) Parallel slackness assumption T1/P >> C T 10

11 WORK FIRST PRINCIPLE Minimize scheduling overhead borne by work at the expense of increasing critical path Tp C1Ts/P + C T C1Ts/P Minimize C1 even at the expense of a larger C 11

12 WORK STEALING DESIGN GOALS Minimizing contentions Decentralized task deque Doubly linked deque Minimize communication Steal work rather than push work Load balance across cores Lazy task creation Steal from the top of the deque 12

13 CILK WORK STEALING SCHEDULER Pictures from Reducers and Other CILK+ HyperObjects Talk by Matteo Frigo (Intel). Pablo Halpern ( Intel). Charles E. Leiserson (MIT). Stephen Lewin-Berlin (Intel). 13

14 CILK WORK STEALING SCHEDULER Pictures from Reducers and Other CILK+ HyperObjects Talk by Matteo Frigo (Intel). Pablo Halpern ( Intel). Charles E. Leiserson (MIT). Stephen Lewin-Berlin (Intel). 14

15 CILK WORK STEALING SCHEDULER Pictures from Reducers and Other CILK+ HyperObjects Talk by Matteo Frigo (Intel). Pablo Halpern ( Intel). Charles E. Leiserson (MIT). Stephen Lewin-Berlin (Intel). 15

16 CILK WORK STEALING SCHEDULER Pictures from Reducers and Other CILK+ HyperObjects Talk by Matteo Frigo (Intel). Pablo Halpern ( Intel). Charles E. Leiserson (MIT). Stephen Lewin-Berlin (Intel). 16

17 CILK WORK STEALING SCHEDULER Pictures from Reducers and Other CILK+ HyperObjects Talk by Matteo Frigo (Intel). Pablo Halpern ( Intel). Charles E. Leiserson (MIT). Stephen Lewin-Berlin (Intel). 17

18 CILK WORK STEALING SCHEDULER Pictures from Reducers and Other CILK+ HyperObjects Talk by Matteo Frigo (Intel). Pablo Halpern ( Intel). Charles E. Leiserson (MIT). Stephen Lewin-Berlin (Intel). 18

19 CILK WORK STEALING SCHEDULER Pictures from Reducers and Other CILK+ HyperObjects Talk by Matteo Frigo (Intel). Pablo Halpern ( Intel). Charles E. Leiserson (MIT). Stephen Lewin-Berlin (Intel).

20 CILK WORK STEALING SCHEDULER Pictures from Reducers and Other CILK+ HyperObjects Talk by Matteo Frigo (Intel). Pablo Halpern ( Intel). Charles E. Leiserson (MIT). Stephen Lewin-Berlin (Intel).

21 CILK WORK STEALING SCHEDULER Pictures from Reducers and Other CILK+ HyperObjects Talk by Matteo Frigo (Intel). Pablo Halpern ( Intel). Charles E. Leiserson (MIT). Stephen Lewin-Berlin (Intel). 21

22 CILK WORK STEALING SCHEDULER Pictures from Reducers and Other CILK+ HyperObjects Talk by Matteo Frigo (Intel). Pablo Halpern ( Intel). Charles E. Leiserson (MIT). Stephen Lewin-Berlin (Intel). 22

23 CILK WORK STEALING SCHEDULER Pictures from Reducers and Other CILK+ HyperObjects Talk by Matteo Frigo (Intel). Pablo Halpern ( Intel). Charles E. Leiserson (MIT). Stephen Lewin-Berlin (Intel). 23

24 CILK WORK STEALING SCHEDULER Pictures from Reducers and Other CILK+ HyperObjects Talk by Matteo Frigo (Intel). Pablo Halpern ( Intel). Charles E. Leiserson (MIT). Stephen Lewin-Berlin (Intel). 24

25 CILK WORK STEALING SCHEDULER Pictures from Reducers and Other CILK+ HyperObjects Talk by Matteo Frigo (Intel). Pablo Halpern ( Intel). Charles E. Leiserson (MIT). Stephen Lewin-Berlin (Intel). 25

26 CILK WORK STEALING SCHEDULER Pictures from Reducers and Other CILK+ HyperObjects Talk by Matteo Frigo (Intel). Pablo Halpern ( Intel). Charles E. Leiserson (MIT). Stephen Lewin-Berlin (Intel). 26

27 TWO CLONE STRATEGY Fast clone Identical in most respects to the C elision of the Cilk program Very little execution overhead Sync statements compile to no op Allocates an continuation Program variables and instruction pointer Slow clone Convert a spawn schedule to slow clone only when it is stolen Restores program state from activation frame that contains local variables, program counter and other parts of the procedure instance 27

28 FAST CLONE 28

29 SLOW CLONE Slow_fib(frame * _cilk_frame){ switch (_cilk_frame->header.entry) { fast_fib(_cilk_frame->n - 1 ); case 1: goto _cilk_sync1; fast_fib(_cilk_frame->n - 2 ); case 2: goto _cilk_sync2; sync (not a no op) case 3: goto _cilk_sync3; } } 29

30 FRAMES C++ Main Frame Local variables of the procedure instance Temporary variables Linkage information for return values 30

31 FRAMES CILK++ Stack Frame Everything in C++ Main Frame Continuation Parent pointer Have exactly one child Used by Fast Clone A worker can have multiple Stack Frames 31

32 FRAMES CILK++ Full Frame (used by slow clone) Everything in CILK++ Stack Frame Lock Join counter List of children (has more than one children) A worker has at most one Full Frame 32

33 EXTENDED DEQUE WITH CALL STACKS Extended Deque Call stack Stack frame Full frame 33

34 FUNCTION CALL Function call Extended Deque (Before Function Call) Spawn Call return Spawn return Sync Randomly steal Provably good steal Unconditionally steal Stack frame Full frame Resume full frame 34

35 FUNCTION CALL Function call Extended Deque (After Function Call) Spawn Call return Spawn return Sync Randomly steal New stack frame Provably good steal Unconditionally steal Stack frame Full frame Resume full frame 35

36 SPAWN Function call Extended Deque (Before Spawn Call) Spawn Call return Spawn return Sync Randomly steal Provably good steal Unconditionally steal Stack frame Full frame Resume full frame 36

37 SPAWN Function call Extended Deque (After Spawn Call) Spawn Call return Spawn return Sync Randomly steal Provably good steal Unconditionally steal Set continuation in last stack frame Stack frame Full frame Resume full frame 37

38 RESUME FULL FRAME Function call Extended Deque Spawn Call return Spawn return Sync Randomly steal Provably good steal Unconditionally steal Set the full frame to be the only frame in the call stack, resume execution on the continuation Stack frame Full frame Resume full frame 38

39 RANDOMLY STEAL Function call Spawn Steal this call stack Extended Deque Call return Spawn return Sync Randomly steal Provably good steal Unconditionally steal Stack frame Full frame Resume full frame 39

40 RANDOMLY STEAL Function call Spawn Call return Steal this call stack Extended Deque Spawn return Sync Randomly steal Provably good steal Unconditionally steal Stack frame Full frame Resume full frame 40

41 RANDOMLY STEAL Function call Spawn Call return Spawn return Sync Randomly steal Provably good steal Unconditionally steal Extended Deque Stack frame Full frame Resume full frame 41

42 PROVABLY GOOD STEAL Function call Spawn Call return Extended Deque 0 Spawn return Sync Randomly steal Provably good steal Unconditionally steal Stack frame Full frame Resume full frame 42

43 UNCONDITIONALLY STEAL Function call Spawn Call return Extended Deque 2 Spawn return Sync Randomly steal Provably good steal Unconditionally steal Stack frame Full frame Resume full frame 43

44 FUNCTION CALL RETURN Function call Extended Deque (Before Return from a Call Case1) Spawn Call return Spawn return Sync Randomly steal Provably good steal Unconditionally steal Stack frame Full frame Resume full frame 44

45 FUNCTION CALL RETURN Function call Extended Deque (Return from a Call Case 1) Spawn Call return Spawn return Sync Randomly steal Provably good steal Unconditionally steal Stack frame Full frame Resume full frame 45

46 FUNCTION CALL RETURN Function call Extended Deque (Return from a Call Case2) Spawn Call return Spawn return Sync Randomly steal Provably good steal Unconditionally steal Worker executes an unconditional steal Stack frame Full frame Resume full frame 46

47 SPAWN RETURN Function call Extended Deque (Before Spawn return Case 1) Spawn Call return Spawn return Sync Randomly steal Provably good steal Unconditionally steal Stack frame Full frame Resume full frame 47

48 SPAWN RETURN Function call Extended Deque (After Spawn return Case 1) Spawn Call return Spawn return Sync Randomly steal Provably good steal Unconditionally steal Stack frame Full frame Resume full frame 48

49 SPAWN RETURN Function call Extended Deque (Return from a SpawnCase2) Spawn Call return Spawn return Sync Randomly steal Provably good steal Unconditionally steal Worker executes an provably good steal Stack frame Full frame Resume full frame 49

50 SYNC Function call Extended Deque (Sync Case 1) Spawn Call return Spawn return Sync Randomly steal Provably good steal Unconditionally steal Do nothing if it is a stack frame (No Op) Stack frame Full frame Resume full frame 50

51 SYNC Function call Extended Deque (Sync Case 2) Spawn Call return Spawn return Sync Randomly steal Provably good steal Unconditionally steal Pop the frame, provably good steal Stack frame Full frame Resume full frame 51

52 OUTLINE CILK and CILK++ Language Features and Usages Work stealing runtime CILK++ Reducers Conclusions 52

53 PROBLEMS WITH NON-LOCAL VARIABLES bool has_property(node *) List<Node *> output_list; void walk(node *x) { } if (x) { if (has_property(x)) output_list.push_back(x); cilk_spawn walk(x->left); walk(x->right); cilk_sync; 53

54 REDUCER DESIGN GOALS Support parallelization of programs containing global variables Enable efficient parallel scaling by avoiding a single point of contention Provide deterministic result for associative reduce operations Operate independently of any control constructs 54

55 REDUCER EXAMPLE bool has_property(node *) List_append_reducer<Node *> output_list; void walk(node *x) { if (x) { if (has_property(x)) output_list.push_back(x); cilk_spawn walk(x->left); walk(x->right); cilk_sync; } 55

56 HYPER OBJECTS Pictures from Reducers and Other CILK+ HyperObjects Talk by Matteo Frigo (Intel). Pablo Halpern ( Intel). Charles E. Leiserson (MIT). Stephen Lewin-Berlin (Intel). 56

57 REDUCER Pictures from Reducers and Other CILK+ HyperObjects Talk by Matteo Frigo (Intel). Pablo Halpern ( Intel). Charles E. Leiserson (MIT). Stephen Lewin-Berlin (Intel). 57

58 SEMANTICS OF REDUCERS The child strand owns the view owned by parent function before cilk_spawn The parent strand owns a new view, initialized to identity view e, A special optimization ensures that if a view is unchanged when combined with the identity view 3 Parent strand P own the view from completed child strands 58

59 REDUCING OVER LIST CONCATENATION Pictures from Reducers and Other CILK+ HyperObjects Talk by Matteo Frigo (Intel). Pablo Halpern ( Intel). Charles E. Leiserson (MIT). Stephen Lewin-Berlin (Intel). 59

60 REDUCING OVER LIST CONCATENATION Pictures from Reducers and Other CILK+ HyperObjects Talk by Matteo Frigo (Intel). Pablo Halpern ( Intel). Charles E. Leiserson (MIT). Stephen Lewin-Berlin (Intel). 60

61 IMPLEMENTATION OF REDUCER Each worker maintains a hypermap Hypermap Maps reducers to the views User Children The view of the current procedure The view of the children procedures Right The view of right sibling Identity The default value of a view 61

62 UNDERSTANDING HYPERMAPS bool has_property(node *) List_append_reducer<Node *> output_list; void walk(node *x) ß Proc A { if (x) { if (has_property(x)) output_list.push_back(x); cilk_spawn walk(x->left); ß proc B cilk_spawn walk(x->right); ß proc C cilk_sync; } 62

63 LAZY CREATION A new view will only be created after a steal On demand 63

64 HYPERMAP CREATION Pictures from Reducers and Other CILK+ HyperObjects Talk by Matteo Frigo (Intel). Pablo Halpern ( Intel). Charles E. Leiserson (MIT). Stephen Lewin-Berlin (Intel). 64

65 HYPERMAP CREATION Pictures from Reducers and Other CILK+ HyperObjects Talk by Matteo Frigo (Intel). Pablo Halpern ( Intel). Charles E. Leiserson (MIT). Stephen Lewin-Berlin (Intel). 65

66 HYPERMAP CREATION Pictures from Reducers and Other CILK+ HyperObjects Talk by Matteo Frigo (Intel). Pablo Halpern ( Intel). Charles E. Leiserson (MIT). Stephen Lewin-Berlin (Intel). 66

67 HYPERMAP CREATION Pictures from Reducers and Other CILK+ HyperObjects Talk by Matteo Frigo (Intel). Pablo Halpern ( Intel). Charles E. Leiserson (MIT). Stephen Lewin-Berlin (Intel). 67

68 HYPERMAP CREATION Pictures from Reducers and Other CILK+ HyperObjects Talk by Matteo Frigo (Intel). Pablo Halpern ( Intel). Charles E. Leiserson (MIT). Stephen Lewin-Berlin (Intel). 68

69 LOOK UP FAILURE Inserts a view containing an identity element for the reducer into the hypermap. Following the lazy principle Look up returns the newly inserted identity view 69

70 RANDOM WORK STEALING A random steal operation steals a full frame P and replaces it with a new full frame C in the victim. USERC USERP; U S E R P 0/ ; CHILDRENP 0/; RIGHTP 0/. 70

71 RANDOM WORK STEALING Pictures from Reducers and Other CILK+ HyperObjects Talk by Matteo Frigo (Intel). Pablo Halpern ( Intel). Charles E. Leiserson (MIT). Stephen Lewin-Berlin (Intel). 71

72 RETURN FROM A CALL Let C be a child frame of the parent frame P that originally called C, and suppose that C returns. 72

73 RETURN FROM A CALL Function call Extended Deque (Before Return from a Call Case1) Spawn Call return Spawn return Sync Randomly steal Provably good steal Unconditionally steal Stack frame Full frame Resume full frame 73

74 RETURN FROM A CALL Function call Extended Deque (Return from a Call Case 1) Spawn Call return Spawn return Sync Randomly steal Provably good steal Unconditionally steal Stack frame Full frame Resume full frame 74

75 RETURN FROM A CALL Let C be a child frame of the parent frame P that originally called C, and suppose that C returns. If C is a stack frame, do nothing, 75

76 FUNCTION CALL RETURN Function call Extended Deque (Return from a Call Case2) Spawn Call return Spawn return Sync Randomly steal Provably good steal Unconditionally steal Worker executes an unconditional steal Stack frame Full frame Resume full frame 76

77 RETURN FROM A CALL Let C be a child frame of the parent frame P that originally called C, and suppose that C returns. If C is a stack frame, do nothing, If C is a full frame. Transfer ownership of view Children and Right are empty USERP USERC 77

78 RETURN FROM A SPAWN Let C be a child frame of the parent frame P that originally spawned C, and suppose that C returns. Always do USERC REDUCE(USERC,RIGHTC) If C is a stack frame, do nothing If C is a full frame If C has siblings, RIGHTL REDUCE(RIGHTL,USERC) C is the leftmost child CHILDRENP REDUCE(CHILDRENP,USERC) 78

79 RETURN FROM A SPAWN EXAMPLE bool has_property(node *) List_append_reducer<Node *> output_list; void walk(node *x) ß Proc A { if (x) { if (has_property(x)) output_list.push_back(x); cilk_spawn walk(x->left); ß proc B cilk_spawn walk(x->right); ß proc C cilk_sync; } 79

80 RETURN FROM A SPAWN EXAMPLE bool has_property(node *) List_append_reducer<Node *> output_list; void walk(node *x) ß Proc A { if (x) { if (has_property(x)) output_list.push_back(x); cilk_spawn walk(x->left); ß proc B cilk_spawn walk(x->right); ß proc C cilk_sync; } 80

81 RETURN FROM A SPAWN EXAMPLE bool has_property(node *) List_append_reducer<Node *> output_list; void walk(node *x) ß Proc A { if (x) { if (has_property(x)) output_list.push_back(x); cilk_spawn walk(x->left); ß proc B cilk_spawn walk(x->right); ß proc C cilk_sync; } 81

82 SYNC A cilk_sync statement waits until all children have completed. When frame P executes a cilk_sync, one of following two cases applies: If P is a stack frame, do nothing. If P is a full frame, USERP REDUCE(CHILDRENP,USERP). 82

83 BENEFITS OF REDUCERS 83

84 OUTLINE CILK and CILK++ Language Features and Usages Work stealing runtime CILK++ Reducers Conclusions 84

85 CONCLUSIONS CILK and CILK++ provide a programmer friendly programming model Extension to C Incremental parallelism Scaling on future machines Non-compromising performance Work stealing runtime Minimizing overheads Reducers 85

86 FINAL NOTES Designed for an idealized shared memory model Today s architectures are typically NUMA Task creation can be lazier arnumber= &tag=1 Cilk_for Divide and conquer parallelization 86

Reducers and other Cilk++ hyperobjects

Reducers and other Cilk++ hyperobjects Reducers and other Cilk++ hyperobjects Matteo Frigo (Intel) ablo Halpern (Intel) Charles E. Leiserson (MIT) Stephen Lewin-Berlin (Intel) August 11, 2009 Collision detection Assembly: Represented as a tree

More information

Cilk Plus: Multicore extensions for C and C++

Cilk Plus: Multicore extensions for C and C++ Cilk Plus: Multicore extensions for C and C++ Matteo Frigo 1 June 6, 2011 1 Some slides courtesy of Prof. Charles E. Leiserson of MIT. Intel R Cilk TM Plus What is it? C/C++ language extensions supporting

More information

Shared-memory Parallel Programming with Cilk Plus

Shared-memory Parallel Programming with Cilk Plus Shared-memory Parallel Programming with Cilk Plus John Mellor-Crummey Department of Computer Science Rice University johnmc@rice.edu COMP 422/534 Lecture 4 30 August 2018 Outline for Today Threaded programming

More information

Multicore programming in CilkPlus

Multicore programming in CilkPlus Multicore programming in CilkPlus Marc Moreno Maza University of Western Ontario, Canada CS3350 March 16, 2015 CilkPlus From Cilk to Cilk++ and Cilk Plus Cilk has been developed since 1994 at the MIT Laboratory

More information

A Quick Introduction To The Intel Cilk Plus Runtime

A Quick Introduction To The Intel Cilk Plus Runtime A Quick Introduction To The Intel Cilk Plus Runtime 6.S898: Advanced Performance Engineering for Multicore Applications March 8, 2017 Adapted from slides by Charles E. Leiserson, Saman P. Amarasinghe,

More information

The Cilk part is a small set of linguistic extensions to C/C++ to support fork-join parallelism. (The Plus part supports vector parallelism.

The Cilk part is a small set of linguistic extensions to C/C++ to support fork-join parallelism. (The Plus part supports vector parallelism. Cilk Plus The Cilk part is a small set of linguistic extensions to C/C++ to support fork-join parallelism. (The Plus part supports vector parallelism.) Developed originally by Cilk Arts, an MIT spinoff,

More information

Shared-memory Parallel Programming with Cilk Plus

Shared-memory Parallel Programming with Cilk Plus Shared-memory Parallel Programming with Cilk Plus John Mellor-Crummey Department of Computer Science Rice University johnmc@rice.edu COMP 422/534 Lecture 4 19 January 2017 Outline for Today Threaded programming

More information

Cilk. Cilk In 2008, ACM SIGPLAN awarded Best influential paper of Decade. Cilk : Biggest principle

Cilk. Cilk In 2008, ACM SIGPLAN awarded Best influential paper of Decade. Cilk : Biggest principle CS528 Slides are adopted from http://supertech.csail.mit.edu/cilk/ Charles E. Leiserson A Sahu Dept of CSE, IIT Guwahati HPC Flow Plan: Before MID Processor + Super scalar+ Vector Unit Serial C/C++ Coding

More information

CSE 613: Parallel Programming

CSE 613: Parallel Programming CSE 613: Parallel Programming Lecture 3 ( The Cilk++ Concurrency Platform ) ( inspiration for many slides comes from talks given by Charles Leiserson and Matteo Frigo ) Rezaul A. Chowdhury Department of

More information

Multithreaded Parallelism and Performance Measures

Multithreaded Parallelism and Performance Measures Multithreaded Parallelism and Performance Measures Marc Moreno Maza University of Western Ontario, London, Ontario (Canada) CS 3101 (Moreno Maza) Multithreaded Parallelism and Performance Measures CS 3101

More information

CS 240A: Shared Memory & Multicore Programming with Cilk++

CS 240A: Shared Memory & Multicore Programming with Cilk++ CS 240A: Shared Memory & Multicore rogramming with Cilk++ Multicore and NUMA architectures Multithreaded rogramming Cilk++ as a concurrency platform Work and Span Thanks to Charles E. Leiserson for some

More information

The Implementation of Cilk-5 Multithreaded Language

The Implementation of Cilk-5 Multithreaded Language The Implementation of Cilk-5 Multithreaded Language By Matteo Frigo, Charles E. Leiserson, and Keith H Randall Presented by Martin Skou 1/14 The authors Matteo Frigo Chief Scientist and founder of Cilk

More information

Plan. 1 Parallelism Complexity Measures. 2 cilk for Loops. 3 Scheduling Theory and Implementation. 4 Measuring Parallelism in Practice

Plan. 1 Parallelism Complexity Measures. 2 cilk for Loops. 3 Scheduling Theory and Implementation. 4 Measuring Parallelism in Practice lan Multithreaded arallelism and erformance Measures Marc Moreno Maza University of Western Ontario, London, Ontario (Canada) CS 3101 1 2 cilk for Loops 3 4 Measuring arallelism in ractice 5 Announcements

More information

Parallelism and Performance

Parallelism and Performance 6.172 erformance Engineering of Software Systems LECTURE 13 arallelism and erformance Charles E. Leiserson October 26, 2010 2010 Charles E. Leiserson 1 Amdahl s Law If 50% of your application is parallel

More information

Compsci 590.3: Introduction to Parallel Computing

Compsci 590.3: Introduction to Parallel Computing Compsci 590.3: Introduction to Parallel Computing Alvin R. Lebeck Slides based on this from the University of Oregon Admin Logistics Homework #3 Use script Project Proposals Document: see web site» Due

More information

An Overview of Parallel Computing

An Overview of Parallel Computing An Overview of Parallel Computing Marc Moreno Maza University of Western Ontario, London, Ontario (Canada) CS2101 Plan 1 Hardware 2 Types of Parallelism 3 Concurrency Platforms: Three Examples Cilk CUDA

More information

On the cost of managing data flow dependencies

On the cost of managing data flow dependencies On the cost of managing data flow dependencies - program scheduled by work stealing - Thierry Gautier, INRIA, EPI MOAIS, Grenoble France Workshop INRIA/UIUC/NCSA Outline Context - introduction of work

More information

CSE 260 Lecture 19. Parallel Programming Languages

CSE 260 Lecture 19. Parallel Programming Languages CSE 260 Lecture 19 Parallel Programming Languages Announcements Thursday s office hours are cancelled Office hours on Weds 2p to 4pm Jing will hold OH, too, see Moodle Scott B. Baden /CSE 260/ Winter 2014

More information

Multithreaded Programming in Cilk. Matteo Frigo

Multithreaded Programming in Cilk. Matteo Frigo Multithreaded Programming in Cilk Matteo Frigo Multicore challanges Development time: Will you get your product out in time? Where will you find enough parallel-programming talent? Will you be forced to

More information

Plan. Introduction to Multicore Programming. Plan. University of Western Ontario, London, Ontario (Canada) Multi-core processor CPU Coherence

Plan. Introduction to Multicore Programming. Plan. University of Western Ontario, London, Ontario (Canada) Multi-core processor CPU Coherence Plan Introduction to Multicore Programming Marc Moreno Maza University of Western Ontario, London, Ontario (Canada) CS 3101 1 Multi-core Architecture 2 Race Conditions and Cilkscreen (Moreno Maza) Introduction

More information

Multithreaded Programming in. Cilk LECTURE 1. Charles E. Leiserson

Multithreaded Programming in. Cilk LECTURE 1. Charles E. Leiserson Multithreaded Programming in Cilk LECTURE 1 Charles E. Leiserson Supercomputing Technologies Research Group Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology

More information

Cilk Plus GETTING STARTED

Cilk Plus GETTING STARTED Cilk Plus GETTING STARTED Overview Fundamentals of Cilk Plus Hyperobjects Compiler Support Case Study 3/17/2015 CHRIS SZALWINSKI 2 Fundamentals of Cilk Plus Terminology Execution Model Language Extensions

More information

A Primer on Scheduling Fork-Join Parallelism with Work Stealing

A Primer on Scheduling Fork-Join Parallelism with Work Stealing Doc. No.: N3872 Date: 2014-01-15 Reply to: Arch Robison A Primer on Scheduling Fork-Join Parallelism with Work Stealing This paper is a primer, not a proposal, on some issues related to implementing fork-join

More information

Synchronizing without Locks Charles E. Leiserson Charles E. Leiserson 2

Synchronizing without Locks Charles E. Leiserson Charles E. Leiserson 2 OUTLINE 6.172 Performance Engineering of Software Systems Lecture 15 Synchronizing without Locks Charles E. Leiserson November 5, 2009 Memory Consistency Lock-Free Protocols The ABA Problem Reducer Hyperobjects

More information

Shared-memory Parallel Programming with Cilk Plus (Parts 2-3)

Shared-memory Parallel Programming with Cilk Plus (Parts 2-3) Shared-memory Parallel Programming with Cilk Plus (Parts 2-3) John Mellor-Crummey Department of Computer Science Rice University johnmc@rice.edu COMP 422/534 Lecture 5-6 24,26 January 2017 Last Thursday

More information

Plan. 1 Parallelism Complexity Measures. 2 cilk for Loops. 3 Scheduling Theory and Implementation. 4 Measuring Parallelism in Practice

Plan. 1 Parallelism Complexity Measures. 2 cilk for Loops. 3 Scheduling Theory and Implementation. 4 Measuring Parallelism in Practice lan Multithreaded arallelism and erformance Measures Marc Moreno Maza University of Western Ontario, London, Ontario (Canada) CS 02 - CS 9535 arallelism Complexity Measures 2 cilk for Loops 3 Measuring

More information

Plan. 1 Parallelism Complexity Measures. 2 cilk for Loops. 3 Scheduling Theory and Implementation. 4 Measuring Parallelism in Practice

Plan. 1 Parallelism Complexity Measures. 2 cilk for Loops. 3 Scheduling Theory and Implementation. 4 Measuring Parallelism in Practice lan Multithreaded arallelism and erformance Measures Marc Moreno Maza University of Western Ontario, London, Ontario (Canada) CS 4435 - CS 9624 1 2 cilk for Loops 3 4 Measuring arallelism in ractice 5

More information

Project 3. Building a parallelism profiler for Cilk computations CSE 539. Assigned: 03/17/2015 Due Date: 03/27/2015

Project 3. Building a parallelism profiler for Cilk computations CSE 539. Assigned: 03/17/2015 Due Date: 03/27/2015 CSE 539 Project 3 Assigned: 03/17/2015 Due Date: 03/27/2015 Building a parallelism profiler for Cilk computations In this project, you will implement a simple serial tool for Cilk programs a parallelism

More information

MetaFork: A Compilation Framework for Concurrency Platforms Targeting Multicores

MetaFork: A Compilation Framework for Concurrency Platforms Targeting Multicores MetaFork: A Compilation Framework for Concurrency Platforms Targeting Multicores Presented by Xiaohui Chen Joint work with Marc Moreno Maza, Sushek Shekar & Priya Unnikrishnan University of Western Ontario,

More information

Performance Optimization Part 1: Work Distribution and Scheduling

Performance Optimization Part 1: Work Distribution and Scheduling Lecture 5: Performance Optimization Part 1: Work Distribution and Scheduling Parallel Computer Architecture and Programming CMU 15-418/15-618, Fall 2017 Programming for high performance Optimizing the

More information

CS CS9535: An Overview of Parallel Computing

CS CS9535: An Overview of Parallel Computing CS4403 - CS9535: An Overview of Parallel Computing Marc Moreno Maza University of Western Ontario, London, Ontario (Canada) January 10, 2017 Plan 1 Hardware 2 Types of Parallelism 3 Concurrency Platforms:

More information

Effective Performance Measurement and Analysis of Multithreaded Applications

Effective Performance Measurement and Analysis of Multithreaded Applications Effective Performance Measurement and Analysis of Multithreaded Applications Nathan Tallent John Mellor-Crummey Rice University CSCaDS hpctoolkit.org Wanted: Multicore Programming Models Simple well-defined

More information

Plan. Introduction to Multicore Programming. Plan. University of Western Ontario, London, Ontario (Canada) Marc Moreno Maza CS CS 9624

Plan. Introduction to Multicore Programming. Plan. University of Western Ontario, London, Ontario (Canada) Marc Moreno Maza CS CS 9624 Plan Introduction to Multicore Programming Marc Moreno Maza University of Western Ontario, London, Ontario (Canada) CS 4435 - CS 9624 1 Multi-core Architecture Multi-core processor CPU Cache CPU Coherence

More information

Table of Contents. Cilk

Table of Contents. Cilk Table of Contents 212 Introduction to Parallelism Introduction to Programming Models Shared Memory Programming Message Passing Programming Shared Memory Models Cilk TBB HPF Chapel Fortress Stapl PGAS Languages

More information

Multithreaded Parallelism on Multicore Architectures

Multithreaded Parallelism on Multicore Architectures Multithreaded Parallelism on Multicore Architectures Marc Moreno Maza University of Western Ontario, Canada CS2101 March 2012 Plan 1 Multicore programming Multicore architectures 2 Cilk / Cilk++ / Cilk

More information

Cilk, Matrix Multiplication, and Sorting

Cilk, Matrix Multiplication, and Sorting 6.895 Theory of Parallel Systems Lecture 2 Lecturer: Charles Leiserson Cilk, Matrix Multiplication, and Sorting Lecture Summary 1. Parallel Processing With Cilk This section provides a brief introduction

More information

1 Optimizing parallel iterative graph computation

1 Optimizing parallel iterative graph computation May 15, 2012 1 Optimizing parallel iterative graph computation I propose to develop a deterministic parallel framework for performing iterative computation on a graph which schedules work on vertices based

More information

Cost Model: Work, Span and Parallelism

Cost Model: Work, Span and Parallelism CSE 539 01/15/2015 Cost Model: Work, Span and Parallelism Lecture 2 Scribe: Angelina Lee Outline of this lecture: 1. Overview of Cilk 2. The dag computation model 3. Performance measures 4. A simple greedy

More information

Efficient Work Stealing for Fine-Grained Parallelism

Efficient Work Stealing for Fine-Grained Parallelism Efficient Work Stealing for Fine-Grained Parallelism Karl-Filip Faxén Swedish Institute of Computer Science November 26, 2009 Task parallel fib in Wool TASK 1( int, fib, int, n ) { if( n

More information

Multithreaded Algorithms Part 1. Dept. of Computer Science & Eng University of Moratuwa

Multithreaded Algorithms Part 1. Dept. of Computer Science & Eng University of Moratuwa CS4460 Advanced d Algorithms Batch 08, L4S2 Lecture 11 Multithreaded Algorithms Part 1 N. H. N. D. de Silva Dept. of Computer Science & Eng University of Moratuwa Announcements Last topic discussed is

More information

Uni-Address Threads: Scalable Thread Management for RDMA-based Work Stealing

Uni-Address Threads: Scalable Thread Management for RDMA-based Work Stealing Uni-Address Threads: Scalable Thread Management for RDMA-based Work Stealing Shigeki Akiyama, Kenjiro Taura The University of Tokyo June 17, 2015 HPDC 15 Lightweight Threads Lightweight threads enable

More information

Performance Optimization Part 1: Work Distribution and Scheduling

Performance Optimization Part 1: Work Distribution and Scheduling ( how to be l33t ) Lecture 6: Performance Optimization Part 1: Work Distribution and Scheduling Parallel Computer Architecture and Programming CMU 15-418/15-618, Spring 2017 Tunes Kaleo Way Down We Go

More information

Memory-Mapping Support for Reducer Hyperobjects

Memory-Mapping Support for Reducer Hyperobjects Memory-Mapping Support for Reducer Hyperobjects I-Ting Angelina Lee MIT CSAIL 32 Vassar Street Cambridge, MA 2139 USA angelee@csail.mit.edu Aamir Shafi National University of Sciences and Technology Sector

More information

Atomic Transactions in Cilk Project Presentation 12/1/03

Atomic Transactions in Cilk Project Presentation 12/1/03 Atomic Transactions in Cilk 6.895 Project Presentation 12/1/03 Data Races and Nondeterminism int x = 0; 1: read x 1: write x time cilk void increment() { x = x + 1; cilk int main() { spawn increment();

More information

Moore s Law. Multicore Programming. Vendor Solution. Power Density. Parallelism and Performance MIT Lecture 11 1.

Moore s Law. Multicore Programming. Vendor Solution. Power Density. Parallelism and Performance MIT Lecture 11 1. Moore s Law 1000000 Intel CPU Introductions 6.172 Performance Engineering of Software Systems Lecture 11 Multicore Programming Charles E. Leiserson 100000 10000 1000 100 10 Clock Speed (MHz) Transistors

More information

Reducer Hyperobjects

Reducer Hyperobjects Reducer Hyperobjects int compute(const X& v); int main() { const int n = 1000000; extern X myarray[n]; // Summing Example } int result = 0; for (int i = 0; i < n; ++i) { result += compute(myarray[i]);

More information

Understanding Task Scheduling Algorithms. Kenjiro Taura

Understanding Task Scheduling Algorithms. Kenjiro Taura Understanding Task Scheduling Algorithms Kenjiro Taura 1 / 48 Contents 1 Introduction 2 Work stealing scheduler 3 Analyzing execution time of work stealing 4 Analyzing cache misses of work stealing 5 Summary

More information

Brushing the Locks out of the Fur: A Lock-Free Work Stealing Library Based on Wool

Brushing the Locks out of the Fur: A Lock-Free Work Stealing Library Based on Wool Brushing the Locks out of the Fur: A Lock-Free Work Stealing Library Based on Wool Håkan Sundell School of Business and Informatics University of Borås, 50 90 Borås E-mail: Hakan.Sundell@hb.se Philippas

More information

Today: Amortized Analysis (examples) Multithreaded Algs.

Today: Amortized Analysis (examples) Multithreaded Algs. Today: Amortized Analysis (examples) Multithreaded Algs. COSC 581, Algorithms March 11, 2014 Many of these slides are adapted from several online sources Reading Assignments Today s class: Chapter 17 (Amortized

More information

The Cilk++ Concurrency Platform

The Cilk++ Concurrency Platform The Cilk++ Concurrency Platform The MIT Faculty has made this article openly available. Please share how this access benefits you. Your story matters. Citation As Published Publisher Leiserson, Charles

More information

CS 140 : Numerical Examples on Shared Memory with Cilk++

CS 140 : Numerical Examples on Shared Memory with Cilk++ CS 140 : Numerical Examples on Shared Memory with Cilk++ Matrix-matrix multiplication Matrix-vector multiplication Hyperobjects Thanks to Charles E. Leiserson for some of these slides 1 Work and Span (Recap)

More information

Atomic Transactions in Cilk

Atomic Transactions in Cilk Atomic Transactions in Jim Sukha 12-13-03 Contents 1 Introduction 2 1.1 Determinacy Races in Multi-Threaded Programs......................... 2 1.2 Atomicity through Transactions...................................

More information

Concepts in. Programming. The Multicore- Software Challenge. MIT Professional Education 6.02s Lecture 1 June 8 9, 2009

Concepts in. Programming. The Multicore- Software Challenge. MIT Professional Education 6.02s Lecture 1 June 8 9, 2009 Concepts in Multicore Programming The Multicore- Software Challenge MIT Professional Education 6.02s Lecture 1 June 8 9, 2009 2009 Charles E. Leiserson 1 Cilk, Cilk++, and Cilkscreen, are trademarks of

More information

Håkan Sundell University College of Borås Parallel Scalable Solutions AB

Håkan Sundell University College of Borås Parallel Scalable Solutions AB Brushing the Locks out of the Fur: A Lock-Free Work Stealing Library Based on Wool Håkan Sundell University College of Borås Parallel Scalable Solutions AB Philippas Tsigas Chalmers University of Technology

More information

CS 140 : Feb 19, 2015 Cilk Scheduling & Applications

CS 140 : Feb 19, 2015 Cilk Scheduling & Applications CS 140 : Feb 19, 2015 Cilk Scheduling & Applications Analyzing quicksort Optional: Master method for solving divide-and-conquer recurrences Tips on parallelism and overheads Greedy scheduling and parallel

More information

COMP Parallel Computing. SMM (4) Nested Parallelism

COMP Parallel Computing. SMM (4) Nested Parallelism COMP 633 - Parallel Computing Lecture 9 September 19, 2017 Nested Parallelism Reading: The Implementation of the Cilk-5 Multithreaded Language sections 1 3 1 Topics Nested parallelism in OpenMP and other

More information

MetaFork: A Metalanguage for Concurrency Platforms Targeting Multicores

MetaFork: A Metalanguage for Concurrency Platforms Targeting Multicores MetaFork: A Metalanguage for Concurrency Platforms Targeting Multicores Xiaohui Chen, Marc Moreno Maza & Sushek Shekar University of Western Ontario September 1, 2013 Document number: N1746 Date: 2013-09-01

More information

Under the Hood, Part 1: Implementing Message Passing

Under the Hood, Part 1: Implementing Message Passing Lecture 27: Under the Hood, Part 1: Implementing Message Passing Parallel Computer Architecture and Programming CMU 15-418/15-618, Today s Theme Message passing model (abstraction) Threads operate within

More information

Introduction to Multithreaded Algorithms

Introduction to Multithreaded Algorithms Introduction to Multithreaded Algorithms CCOM5050: Design and Analysis of Algorithms Chapter VII Selected Topics T. H. Cormen, C. E. Leiserson, R. L. Rivest, C. Stein. Introduction to algorithms, 3 rd

More information

Parallel GC. (Chapter 14) Eleanor Ainy December 16 th 2014

Parallel GC. (Chapter 14) Eleanor Ainy December 16 th 2014 GC (Chapter 14) Eleanor Ainy December 16 th 2014 1 Outline of Today s Talk How to use parallelism in each of the 4 components of tracing GC: Marking Copying Sweeping Compaction 2 Introduction Till now

More information

Efficiently Detecting Races in Cilk Programs That Use Reducer Hyperobjects

Efficiently Detecting Races in Cilk Programs That Use Reducer Hyperobjects Efficiently Detecting Races in Cilk Programs That Use Reducer Hyperobjects ABSTRACT I-Ting Angelina Lee Washington University in St. Louis One Brookings Drive St. Louis, MO 63130 A multithreaded Cilk program

More information

Nondeterministic Programming

Nondeterministic Programming 6.172 Performance Engineering of Software Systems LECTURE 15 Nondeterministic Programming Charles E. Leiserson November 2, 2010 2010 Charles E. Leiserson 1 Determinism Definition. A program is deterministic

More information

COMP 303 Computer Architecture Lecture 3. Comp 303 Computer Architecture

COMP 303 Computer Architecture Lecture 3. Comp 303 Computer Architecture COMP 303 Computer Architecture Lecture 3 Comp 303 Computer Architecture 1 Supporting procedures in computer hardware The execution of a procedure Place parameters in a place where the procedure can access

More information

Analysis of Multithreaded Algorithms

Analysis of Multithreaded Algorithms Analysis of Multithreaded Algorithms Marc Moreno Maza University of Western Ontario, London, Ontario (Canada) CS4402-9535 (Moreno Maza) Analysis of Multithreaded Algorithms CS4402-9535 1 / 27 Plan 1 Matrix

More information

Beyond Threads: Scalable, Composable, Parallelism with Intel Cilk Plus and TBB

Beyond Threads: Scalable, Composable, Parallelism with Intel Cilk Plus and TBB Beyond Threads: Scalable, Composable, Parallelism with Intel Cilk Plus and TBB Jim Cownie Intel SSG/DPD/TCAR 1 Optimization Notice Optimization Notice Intel s compilers may or

More information

Beyond Nested Parallelism: Tight Bounds on Work-Stealing Overheads for Parallel Futures

Beyond Nested Parallelism: Tight Bounds on Work-Stealing Overheads for Parallel Futures Beyond Nested Parallelism: Tight Bounds on Work-Stealing Overheads for Parallel Futures Daniel Spoonhower Guy E. Blelloch Phillip B. Gibbons Robert Harper Carnegie Mellon University {spoons,blelloch,rwh}@cs.cmu.edu

More information

Processes and Threads

Processes and Threads COS 318: Operating Systems Processes and Threads Kai Li and Andy Bavier Computer Science Department Princeton University http://www.cs.princeton.edu/courses/archive/fall13/cos318 Today s Topics u Concurrency

More information

Processes. Process Concept

Processes. Process Concept Processes These slides are created by Dr. Huang of George Mason University. Students registered in Dr. Huang s courses at GMU can make a single machine readable copy and print a single copy of each slide

More information

Decomposing a Problem for Parallel Execution

Decomposing a Problem for Parallel Execution Decomposing a Problem for Parallel Execution Pablo Halpern Parallel Programming Languages Architect, Intel Corporation CppCon, 9 September 2014 This work by Pablo Halpern is

More information

Cilk Plus in GCC. GNU Tools Cauldron Balaji V. Iyer Robert Geva and Pablo Halpern Intel Corporation

Cilk Plus in GCC. GNU Tools Cauldron Balaji V. Iyer Robert Geva and Pablo Halpern Intel Corporation Cilk Plus in GCC GNU Tools Cauldron 2012 Balaji V. Iyer Robert Geva and Pablo Halpern Intel Corporation July 10, 2012 Presentation Outline Introduction Cilk Plus components Implementation GCC Project Status

More information

Implementing Subroutines. Outline [1]

Implementing Subroutines. Outline [1] Implementing Subroutines In Text: Chapter 9 Outline [1] General semantics of calls and returns Implementing simple subroutines Call Stack Implementing subroutines with stackdynamic local variables Nested

More information

IDENTIFYING PERFORMANCE BOTTLENECKS IN WORK-STEALING COMPUTATIONS

IDENTIFYING PERFORMANCE BOTTLENECKS IN WORK-STEALING COMPUTATIONS C OV ER F E AT U RE IDENTIFYING PERFORMANCE BOTTLENECKS IN WORK-STEALING COMPUTATIONS Nathan R. Tallent and John M. Mellor-Crummey, Rice University Work stealing is an effective load-balancing strategy

More information

Runtime Support for Scalable Task-parallel Programs

Runtime Support for Scalable Task-parallel Programs Runtime Support for Scalable Task-parallel Programs Pacific Northwest National Lab xsig workshop May 2018 http://hpc.pnl.gov/people/sriram/ Single Program Multiple Data int main () {... } 2 Task Parallelism

More information

Deterministic Scale-Free Pipeline Parallelism with Hyperqueues

Deterministic Scale-Free Pipeline Parallelism with Hyperqueues Deterministic Scale-Free Pipeline Parallelism with Hyperqueues Hans Vandierendonck Queen s University Belfast United Kingdom h.vandierendonck@qub.ac.uk Kallia Chronaki Barcelona Supercomputing Center,

More information

Chapter 3: Process-Concept. Operating System Concepts 8 th Edition,

Chapter 3: Process-Concept. Operating System Concepts 8 th Edition, Chapter 3: Process-Concept, Silberschatz, Galvin and Gagne 2009 Chapter 3: Process-Concept Process Concept Process Scheduling Operations on Processes Interprocess Communication 3.2 Silberschatz, Galvin

More information

Shared-Memory Programming Models

Shared-Memory Programming Models Shared-Memory Programming Models Parallel Programming Concepts Winter Term 2013 / 2014 Dr. Peter Tröger, M.Sc. Frank Feinbube Cilk C language combined with several new keywords Different approach to OpenMP

More information

Virtual Memory COMPSCI 386

Virtual Memory COMPSCI 386 Virtual Memory COMPSCI 386 Motivation An instruction to be executed must be in physical memory, but there may not be enough space for all ready processes. Typically the entire program is not needed. Exception

More information

Process Concepts. CSC400 - Operating Systems. 3. Process Concepts. J. Sumey

Process Concepts. CSC400 - Operating Systems. 3. Process Concepts. J. Sumey CSC400 - Operating Systems 3. Process Concepts J. Sumey Overview Concurrency Processes & Process States Process Accounting Interrupts & Interrupt Processing Interprocess Communication CSC400 - Process

More information

CS 471 Operating Systems. Yue Cheng. George Mason University Fall 2017

CS 471 Operating Systems. Yue Cheng. George Mason University Fall 2017 CS 471 Operating Systems Yue Cheng George Mason University Fall 2017 Outline o Process concept o Process creation o Process states and scheduling o Preemption and context switch o Inter-process communication

More information

NON-BLOCKING DATA STRUCTURES AND TRANSACTIONAL MEMORY. Tim Harris, 31 October 2012

NON-BLOCKING DATA STRUCTURES AND TRANSACTIONAL MEMORY. Tim Harris, 31 October 2012 NON-BLOCKING DATA STRUCTURES AND TRANSACTIONAL MEMORY Tim Harris, 31 October 2012 Lecture 6 Linearizability Lock-free progress properties Queues Reducing contention Explicit memory management Linearizability

More information

Dynamic inter-core scheduling in Barrelfish

Dynamic inter-core scheduling in Barrelfish Dynamic inter-core scheduling in Barrelfish. avoiding contention with malleable domains Georgios Varisteas, Mats Brorsson, Karl-Filip Faxén November 25, 2011 Outline Introduction Scheduling & Programming

More information

MetaFork: A Compilation Framework for Concurrency Platforms Targeting Multicores

MetaFork: A Compilation Framework for Concurrency Platforms Targeting Multicores MetaFork: A Compilation Framework for Concurrency Platforms Targeting Multicores Xiaohui Chen, Marc Moreno Maza & Sushek Shekar University of Western Ontario, Canada IBM Toronto Lab February 11, 2015 Plan

More information

DAGViz: A DAG Visualization Tool for Analyzing Task-Parallel Program Traces

DAGViz: A DAG Visualization Tool for Analyzing Task-Parallel Program Traces DAGViz: A DAG Visualization Tool for Analyzing Task-Parallel Program Traces An Huynh University of Tokyo, Japan Douglas Thain University of Notre Dame, USA Miquel Pericas Chalmers University of Technology,

More information

Cilk programs as a DAG

Cilk programs as a DAG Cilk programs as a DAG The pattern of spawn and sync commands defines a graph The graph contains dependencies between different functions spawn command creates a new task with an out-bound link sync command

More information

Work Stealing. in Multiprogrammed Environments. Brice Dobry Dept. of Computer & Information Sciences University of Delaware

Work Stealing. in Multiprogrammed Environments. Brice Dobry Dept. of Computer & Information Sciences University of Delaware Work Stealing in Multiprogrammed Environments Brice Dobry Dept. of Computer & Information Sciences University of Delaware Outline Motivate the issue Describe work-stealing in general Explain the new algorithm

More information

Pablo Halpern Parallel Programming Languages Architect Intel Corporation

Pablo Halpern Parallel Programming Languages Architect Intel Corporation Pablo Halpern Parallel Programming Languages Architect Intel Corporation CppCon, 8 September 2014 This work by Pablo Halpern is licensed under a Creative Commons Attribution

More information

CAB: Cache Aware Bi-tier Task-stealing in Multi-socket Multi-core Architecture

CAB: Cache Aware Bi-tier Task-stealing in Multi-socket Multi-core Architecture CAB: Cache Aware Bi-tier Task-stealing in Multi-socket Multi-core Architecture Quan Chen, Zhiyi Huang, Minyi Guo Department of Computer Science, Shanghai Jiao Tong University, Shanghai, China chen-quan@sjtu.edu.cn,

More information

Space Profiling for Parallel Functional Programs

Space Profiling for Parallel Functional Programs Space Profiling for Parallel Functional Programs Daniel Spoonhower 1, Guy Blelloch 1, Robert Harper 1, & Phillip Gibbons 2 1 Carnegie Mellon University 2 Intel Research Pittsburgh 23 September 2008 ICFP

More information

LECTURE 11 TREE TRAVERSALS

LECTURE 11 TREE TRAVERSALS DATA STRUCTURES AND ALGORITHMS LECTURE 11 TREE TRAVERSALS IMRAN IHSAN ASSISTANT PROFESSOR AIR UNIVERSITY, ISLAMABAD BACKGROUND All the objects stored in an array or linked list can be accessed sequentially

More information

15-210: Parallelism in the Real World

15-210: Parallelism in the Real World : Parallelism in the Real World Types of paralellism Parallel Thinking Nested Parallelism Examples (Cilk, OpenMP, Java Fork/Join) Concurrency Page1 Cray-1 (1976): the world s most expensive love seat 2

More information

Chapter 4: Threads. Operating System Concepts. Silberschatz, Galvin and Gagne

Chapter 4: Threads. Operating System Concepts. Silberschatz, Galvin and Gagne Chapter 4: Threads Silberschatz, Galvin and Gagne Chapter 4: Threads Overview Multithreading Models Thread Libraries Threading Issues Operating System Examples Linux Threads 4.2 Silberschatz, Galvin and

More information

Parallel Programming of General-Purpose Programs Using Task-Based Programming Models

Parallel Programming of General-Purpose Programs Using Task-Based Programming Models Parallel Programming of General-Purpose Programs Using Task-Based Programming Models Hans Vandierendonck, Polyvios Pratikakis and Dimitrios S. Nikolopoulos Dept. of Electronics and Information Systems,

More information

Latency-Hiding Work Stealing

Latency-Hiding Work Stealing Latency-Hiding Work Stealing Stefan K. Muller April 2017 CMU-CS-16-112R Umut A. Acar School of Computer Science Carnegie Mellon University Pittsburgh, PA 15213 A version of this work appears in the proceedings

More information

NON-BLOCKING DATA STRUCTURES AND TRANSACTIONAL MEMORY. Tim Harris, 17 November 2017

NON-BLOCKING DATA STRUCTURES AND TRANSACTIONAL MEMORY. Tim Harris, 17 November 2017 NON-BLOCKING DATA STRUCTURES AND TRANSACTIONAL MEMORY Tim Harris, 17 November 2017 Lecture 7 Linearizability Lock-free progress properties Hashtables and skip-lists Queues Reducing contention Explicit

More information

Process. Heechul Yun. Disclaimer: some slides are adopted from the book authors slides with permission

Process. Heechul Yun. Disclaimer: some slides are adopted from the book authors slides with permission Process Heechul Yun Disclaimer: some slides are adopted from the book authors slides with permission 1 Recap OS services Resource (CPU, memory) allocation, filesystem, communication, protection, security,

More information

Overview. CMSC 330: Organization of Programming Languages. Concurrency. Multiprocessors. Processes vs. Threads. Computation Abstractions

Overview. CMSC 330: Organization of Programming Languages. Concurrency. Multiprocessors. Processes vs. Threads. Computation Abstractions CMSC 330: Organization of Programming Languages Multithreaded Programming Patterns in Java CMSC 330 2 Multiprocessors Description Multiple processing units (multiprocessor) From single microprocessor to

More information

An Architectural Framework for Accelerating Dynamic Parallel Algorithms on Reconfigurable Hardware

An Architectural Framework for Accelerating Dynamic Parallel Algorithms on Reconfigurable Hardware An Architectural Framework for Accelerating Dynamic Parallel Algorithms on Reconfigurable Hardware Tao Chen, Shreesha Srinath Christopher Batten, G. Edward Suh Computer Systems Laboratory School of Electrical

More information

The Implementation of the Cilk-5 Multithreaded Language

The Implementation of the Cilk-5 Multithreaded Language The Implementation of the Cilk-5 Multithreaded Language Matte0 Frigo Charles E. Leiserson Keith H. Randall MIT Laboratory for Computer Science 545 Technology Square Cambridge, Massachusetts 02139 {athena,cel,randall}@lcs.mit.edu

More information

3. Process Management in xv6

3. Process Management in xv6 Lecture Notes for CS347: Operating Systems Mythili Vutukuru, Department of Computer Science and Engineering, IIT Bombay 3. Process Management in xv6 We begin understanding xv6 process management by looking

More information

On the Interoperability of Programming Languages based on the Fork-Join Parallelism Model. (Thesis format: Monograph) Sushek Shekar

On the Interoperability of Programming Languages based on the Fork-Join Parallelism Model. (Thesis format: Monograph) Sushek Shekar On the Interoperability of Programming Languages based on the Fork-Join Parallelism Model (Spine title: On the Interoperability of Programming Languages based on the Fork-Join Parallelism Model) (Thesis

More information