CILK/CILK++ AND REDUCERS YUNMING ZHANG RICE UNIVERSITY

Size: px

Start display at page:

Download "CILK/CILK++ AND REDUCERS YUNMING ZHANG RICE UNIVERSITY"

Trevor Lewis
5 years ago
Views:

1 CILK/CILK++ AND REDUCERS YUNMING ZHANG RICE UNIVERSITY 1

2 OUTLINE CILK and CILK++ Language Features and Usages Work stealing runtime CILK++ Reducers Conclusions 2

3 IDEALIZED SHARED MEMORY ARCHITECTURE Hardware model Processors Shared global memory Software model Threads Shared variables Communication Synchronization Slide from Comp 422 Rice University Lecture 4 3

4 CILK AND CILK++ DESIGN GOALS Programmer friendly Dynamic tasking Parallel extension to C Scalable performance Efficient runtime system Minimum program overhead 4

5 CILK KEYWORDS Cilk: a Cilk function Spawn: call can execute asynchronously in a concurrent thread Sync: current thread waits for all locallyspawned functions 5

6 CILK EXAMPLE cilk int fib(n) { if (n < 2) } } else { return n; int n1, n2; n1 = spawn fib(n-1); n2 = spawn fib(n-2); sync; return (n1 + n2); Borrowed from Comp 422 Rice University Lecture 4 6

7 CILK++ EXAMPLE int fib(n) { if (n < 2) } } else { return n; int n1, n2; n1 = cilk_spawn fib(n-1); n2 = fib(n-2); cilk_sync; return (n1 + n2); Borrowed from Comp 422 Rice University Lecture 4 7

8 CILK++ EXAMPLE WITH DAG Pictures from Reducers and Other CILK+ HyperObjects Talk by Matteo Frigo (Intel). Pablo Halpern ( Intel). Charles E. Leiserson (MIT). Stephen Lewin-Berlin (Intel). 8

9 OUTLINE CILK and CILK++ Language Features and Usages Work stealing runtime CILK++ Reducers Conclusions 9

10 WORK FIRST PRINCIPLE Work: T1 Critical path length: T Number of processor: P Expected time Tp = T1/P + O(T ) Parallel slackness assumption T1/P >> C T 10

11 WORK FIRST PRINCIPLE Minimize scheduling overhead borne by work at the expense of increasing critical path Tp C1Ts/P + C T C1Ts/P Minimize C1 even at the expense of a larger C 11

12 WORK STEALING DESIGN GOALS Minimizing contentions Decentralized task deque Doubly linked deque Minimize communication Steal work rather than push work Load balance across cores Lazy task creation Steal from the top of the deque 12

13 CILK WORK STEALING SCHEDULER Pictures from Reducers and Other CILK+ HyperObjects Talk by Matteo Frigo (Intel). Pablo Halpern ( Intel). Charles E. Leiserson (MIT). Stephen Lewin-Berlin (Intel). 13

14 CILK WORK STEALING SCHEDULER Pictures from Reducers and Other CILK+ HyperObjects Talk by Matteo Frigo (Intel). Pablo Halpern ( Intel). Charles E. Leiserson (MIT). Stephen Lewin-Berlin (Intel). 14

15 CILK WORK STEALING SCHEDULER Pictures from Reducers and Other CILK+ HyperObjects Talk by Matteo Frigo (Intel). Pablo Halpern ( Intel). Charles E. Leiserson (MIT). Stephen Lewin-Berlin (Intel). 15

16 CILK WORK STEALING SCHEDULER Pictures from Reducers and Other CILK+ HyperObjects Talk by Matteo Frigo (Intel). Pablo Halpern ( Intel). Charles E. Leiserson (MIT). Stephen Lewin-Berlin (Intel). 16

17 CILK WORK STEALING SCHEDULER Pictures from Reducers and Other CILK+ HyperObjects Talk by Matteo Frigo (Intel). Pablo Halpern ( Intel). Charles E. Leiserson (MIT). Stephen Lewin-Berlin (Intel). 17

18 CILK WORK STEALING SCHEDULER Pictures from Reducers and Other CILK+ HyperObjects Talk by Matteo Frigo (Intel). Pablo Halpern ( Intel). Charles E. Leiserson (MIT). Stephen Lewin-Berlin (Intel). 18

19 CILK WORK STEALING SCHEDULER Pictures from Reducers and Other CILK+ HyperObjects Talk by Matteo Frigo (Intel). Pablo Halpern ( Intel). Charles E. Leiserson (MIT). Stephen Lewin-Berlin (Intel).

20 CILK WORK STEALING SCHEDULER Pictures from Reducers and Other CILK+ HyperObjects Talk by Matteo Frigo (Intel). Pablo Halpern ( Intel). Charles E. Leiserson (MIT). Stephen Lewin-Berlin (Intel).

21 CILK WORK STEALING SCHEDULER Pictures from Reducers and Other CILK+ HyperObjects Talk by Matteo Frigo (Intel). Pablo Halpern ( Intel). Charles E. Leiserson (MIT). Stephen Lewin-Berlin (Intel). 21

22 CILK WORK STEALING SCHEDULER Pictures from Reducers and Other CILK+ HyperObjects Talk by Matteo Frigo (Intel). Pablo Halpern ( Intel). Charles E. Leiserson (MIT). Stephen Lewin-Berlin (Intel). 22

23 CILK WORK STEALING SCHEDULER Pictures from Reducers and Other CILK+ HyperObjects Talk by Matteo Frigo (Intel). Pablo Halpern ( Intel). Charles E. Leiserson (MIT). Stephen Lewin-Berlin (Intel). 23

24 CILK WORK STEALING SCHEDULER Pictures from Reducers and Other CILK+ HyperObjects Talk by Matteo Frigo (Intel). Pablo Halpern ( Intel). Charles E. Leiserson (MIT). Stephen Lewin-Berlin (Intel). 24

25 CILK WORK STEALING SCHEDULER Pictures from Reducers and Other CILK+ HyperObjects Talk by Matteo Frigo (Intel). Pablo Halpern ( Intel). Charles E. Leiserson (MIT). Stephen Lewin-Berlin (Intel). 25

26 CILK WORK STEALING SCHEDULER Pictures from Reducers and Other CILK+ HyperObjects Talk by Matteo Frigo (Intel). Pablo Halpern ( Intel). Charles E. Leiserson (MIT). Stephen Lewin-Berlin (Intel). 26

27 TWO CLONE STRATEGY Fast clone Identical in most respects to the C elision of the Cilk program Very little execution overhead Sync statements compile to no op Allocates an continuation Program variables and instruction pointer Slow clone Convert a spawn schedule to slow clone only when it is stolen Restores program state from activation frame that contains local variables, program counter and other parts of the procedure instance 27

28 FAST CLONE 28

29 SLOW CLONE Slow_fib(frame * _cilk_frame){ switch (_cilk_frame->header.entry) { fast_fib(_cilk_frame->n - 1 ); case 1: goto _cilk_sync1; fast_fib(_cilk_frame->n - 2 ); case 2: goto _cilk_sync2; sync (not a no op) case 3: goto _cilk_sync3; } } 29

30 FRAMES C++ Main Frame Local variables of the procedure instance Temporary variables Linkage information for return values 30

31 FRAMES CILK++ Stack Frame Everything in C++ Main Frame Continuation Parent pointer Have exactly one child Used by Fast Clone A worker can have multiple Stack Frames 31

32 FRAMES CILK++ Full Frame (used by slow clone) Everything in CILK++ Stack Frame Lock Join counter List of children (has more than one children) A worker has at most one Full Frame 32

33 EXTENDED DEQUE WITH CALL STACKS Extended Deque Call stack Stack frame Full frame 33

34 FUNCTION CALL Function call Extended Deque (Before Function Call) Spawn Call return Spawn return Sync Randomly steal Provably good steal Unconditionally steal Stack frame Full frame Resume full frame 34

35 FUNCTION CALL Function call Extended Deque (After Function Call) Spawn Call return Spawn return Sync Randomly steal New stack frame Provably good steal Unconditionally steal Stack frame Full frame Resume full frame 35

36 SPAWN Function call Extended Deque (Before Spawn Call) Spawn Call return Spawn return Sync Randomly steal Provably good steal Unconditionally steal Stack frame Full frame Resume full frame 36

37 SPAWN Function call Extended Deque (After Spawn Call) Spawn Call return Spawn return Sync Randomly steal Provably good steal Unconditionally steal Set continuation in last stack frame Stack frame Full frame Resume full frame 37

38 RESUME FULL FRAME Function call Extended Deque Spawn Call return Spawn return Sync Randomly steal Provably good steal Unconditionally steal Set the full frame to be the only frame in the call stack, resume execution on the continuation Stack frame Full frame Resume full frame 38

39 RANDOMLY STEAL Function call Spawn Steal this call stack Extended Deque Call return Spawn return Sync Randomly steal Provably good steal Unconditionally steal Stack frame Full frame Resume full frame 39

40 RANDOMLY STEAL Function call Spawn Call return Steal this call stack Extended Deque Spawn return Sync Randomly steal Provably good steal Unconditionally steal Stack frame Full frame Resume full frame 40

41 RANDOMLY STEAL Function call Spawn Call return Spawn return Sync Randomly steal Provably good steal Unconditionally steal Extended Deque Stack frame Full frame Resume full frame 41

42 PROVABLY GOOD STEAL Function call Spawn Call return Extended Deque 0 Spawn return Sync Randomly steal Provably good steal Unconditionally steal Stack frame Full frame Resume full frame 42

43 UNCONDITIONALLY STEAL Function call Spawn Call return Extended Deque 2 Spawn return Sync Randomly steal Provably good steal Unconditionally steal Stack frame Full frame Resume full frame 43

44 FUNCTION CALL RETURN Function call Extended Deque (Before Return from a Call Case1) Spawn Call return Spawn return Sync Randomly steal Provably good steal Unconditionally steal Stack frame Full frame Resume full frame 44

45 FUNCTION CALL RETURN Function call Extended Deque (Return from a Call Case 1) Spawn Call return Spawn return Sync Randomly steal Provably good steal Unconditionally steal Stack frame Full frame Resume full frame 45

46 FUNCTION CALL RETURN Function call Extended Deque (Return from a Call Case2) Spawn Call return Spawn return Sync Randomly steal Provably good steal Unconditionally steal Worker executes an unconditional steal Stack frame Full frame Resume full frame 46

47 SPAWN RETURN Function call Extended Deque (Before Spawn return Case 1) Spawn Call return Spawn return Sync Randomly steal Provably good steal Unconditionally steal Stack frame Full frame Resume full frame 47

48 SPAWN RETURN Function call Extended Deque (After Spawn return Case 1) Spawn Call return Spawn return Sync Randomly steal Provably good steal Unconditionally steal Stack frame Full frame Resume full frame 48

49 SPAWN RETURN Function call Extended Deque (Return from a SpawnCase2) Spawn Call return Spawn return Sync Randomly steal Provably good steal Unconditionally steal Worker executes an provably good steal Stack frame Full frame Resume full frame 49

50 SYNC Function call Extended Deque (Sync Case 1) Spawn Call return Spawn return Sync Randomly steal Provably good steal Unconditionally steal Do nothing if it is a stack frame (No Op) Stack frame Full frame Resume full frame 50

51 SYNC Function call Extended Deque (Sync Case 2) Spawn Call return Spawn return Sync Randomly steal Provably good steal Unconditionally steal Pop the frame, provably good steal Stack frame Full frame Resume full frame 51

52 OUTLINE CILK and CILK++ Language Features and Usages Work stealing runtime CILK++ Reducers Conclusions 52

53 PROBLEMS WITH NON-LOCAL VARIABLES bool has_property(node *) List<Node *> output_list; void walk(node *x) { } if (x) { if (has_property(x)) output_list.push_back(x); cilk_spawn walk(x->left); walk(x->right); cilk_sync; 53

54 REDUCER DESIGN GOALS Support parallelization of programs containing global variables Enable efficient parallel scaling by avoiding a single point of contention Provide deterministic result for associative reduce operations Operate independently of any control constructs 54

55 REDUCER EXAMPLE bool has_property(node *) List_append_reducer<Node *> output_list; void walk(node *x) { if (x) { if (has_property(x)) output_list.push_back(x); cilk_spawn walk(x->left); walk(x->right); cilk_sync; } 55

56 HYPER OBJECTS Pictures from Reducers and Other CILK+ HyperObjects Talk by Matteo Frigo (Intel). Pablo Halpern ( Intel). Charles E. Leiserson (MIT). Stephen Lewin-Berlin (Intel). 56

57 REDUCER Pictures from Reducers and Other CILK+ HyperObjects Talk by Matteo Frigo (Intel). Pablo Halpern ( Intel). Charles E. Leiserson (MIT). Stephen Lewin-Berlin (Intel). 57

58 SEMANTICS OF REDUCERS The child strand owns the view owned by parent function before cilk_spawn The parent strand owns a new view, initialized to identity view e, A special optimization ensures that if a view is unchanged when combined with the identity view 3 Parent strand P own the view from completed child strands 58

59 REDUCING OVER LIST CONCATENATION Pictures from Reducers and Other CILK+ HyperObjects Talk by Matteo Frigo (Intel). Pablo Halpern ( Intel). Charles E. Leiserson (MIT). Stephen Lewin-Berlin (Intel). 59

60 REDUCING OVER LIST CONCATENATION Pictures from Reducers and Other CILK+ HyperObjects Talk by Matteo Frigo (Intel). Pablo Halpern ( Intel). Charles E. Leiserson (MIT). Stephen Lewin-Berlin (Intel). 60

61 IMPLEMENTATION OF REDUCER Each worker maintains a hypermap Hypermap Maps reducers to the views User Children The view of the current procedure The view of the children procedures Right The view of right sibling Identity The default value of a view 61

62 UNDERSTANDING HYPERMAPS bool has_property(node *) List_append_reducer<Node *> output_list; void walk(node *x) ß Proc A { if (x) { if (has_property(x)) output_list.push_back(x); cilk_spawn walk(x->left); ß proc B cilk_spawn walk(x->right); ß proc C cilk_sync; } 62

63 LAZY CREATION A new view will only be created after a steal On demand 63

64 HYPERMAP CREATION Pictures from Reducers and Other CILK+ HyperObjects Talk by Matteo Frigo (Intel). Pablo Halpern ( Intel). Charles E. Leiserson (MIT). Stephen Lewin-Berlin (Intel). 64

65 HYPERMAP CREATION Pictures from Reducers and Other CILK+ HyperObjects Talk by Matteo Frigo (Intel). Pablo Halpern ( Intel). Charles E. Leiserson (MIT). Stephen Lewin-Berlin (Intel). 65

66 HYPERMAP CREATION Pictures from Reducers and Other CILK+ HyperObjects Talk by Matteo Frigo (Intel). Pablo Halpern ( Intel). Charles E. Leiserson (MIT). Stephen Lewin-Berlin (Intel). 66

67 HYPERMAP CREATION Pictures from Reducers and Other CILK+ HyperObjects Talk by Matteo Frigo (Intel). Pablo Halpern ( Intel). Charles E. Leiserson (MIT). Stephen Lewin-Berlin (Intel). 67

68 HYPERMAP CREATION Pictures from Reducers and Other CILK+ HyperObjects Talk by Matteo Frigo (Intel). Pablo Halpern ( Intel). Charles E. Leiserson (MIT). Stephen Lewin-Berlin (Intel). 68

69 LOOK UP FAILURE Inserts a view containing an identity element for the reducer into the hypermap. Following the lazy principle Look up returns the newly inserted identity view 69

70 RANDOM WORK STEALING A random steal operation steals a full frame P and replaces it with a new full frame C in the victim. USERC USERP; U S E R P 0/ ; CHILDRENP 0/; RIGHTP 0/. 70

71 RANDOM WORK STEALING Pictures from Reducers and Other CILK+ HyperObjects Talk by Matteo Frigo (Intel). Pablo Halpern ( Intel). Charles E. Leiserson (MIT). Stephen Lewin-Berlin (Intel). 71

72 RETURN FROM A CALL Let C be a child frame of the parent frame P that originally called C, and suppose that C returns. 72

73 RETURN FROM A CALL Function call Extended Deque (Before Return from a Call Case1) Spawn Call return Spawn return Sync Randomly steal Provably good steal Unconditionally steal Stack frame Full frame Resume full frame 73

74 RETURN FROM A CALL Function call Extended Deque (Return from a Call Case 1) Spawn Call return Spawn return Sync Randomly steal Provably good steal Unconditionally steal Stack frame Full frame Resume full frame 74

75 RETURN FROM A CALL Let C be a child frame of the parent frame P that originally called C, and suppose that C returns. If C is a stack frame, do nothing, 75

76 FUNCTION CALL RETURN Function call Extended Deque (Return from a Call Case2) Spawn Call return Spawn return Sync Randomly steal Provably good steal Unconditionally steal Worker executes an unconditional steal Stack frame Full frame Resume full frame 76

77 RETURN FROM A CALL Let C be a child frame of the parent frame P that originally called C, and suppose that C returns. If C is a stack frame, do nothing, If C is a full frame. Transfer ownership of view Children and Right are empty USERP USERC 77

78 RETURN FROM A SPAWN Let C be a child frame of the parent frame P that originally spawned C, and suppose that C returns. Always do USERC REDUCE(USERC,RIGHTC) If C is a stack frame, do nothing If C is a full frame If C has siblings, RIGHTL REDUCE(RIGHTL,USERC) C is the leftmost child CHILDRENP REDUCE(CHILDRENP,USERC) 78

79 RETURN FROM A SPAWN EXAMPLE bool has_property(node *) List_append_reducer<Node *> output_list; void walk(node *x) ß Proc A { if (x) { if (has_property(x)) output_list.push_back(x); cilk_spawn walk(x->left); ß proc B cilk_spawn walk(x->right); ß proc C cilk_sync; } 79

80 RETURN FROM A SPAWN EXAMPLE bool has_property(node *) List_append_reducer<Node *> output_list; void walk(node *x) ß Proc A { if (x) { if (has_property(x)) output_list.push_back(x); cilk_spawn walk(x->left); ß proc B cilk_spawn walk(x->right); ß proc C cilk_sync; } 80

81 RETURN FROM A SPAWN EXAMPLE bool has_property(node *) List_append_reducer<Node *> output_list; void walk(node *x) ß Proc A { if (x) { if (has_property(x)) output_list.push_back(x); cilk_spawn walk(x->left); ß proc B cilk_spawn walk(x->right); ß proc C cilk_sync; } 81

82 SYNC A cilk_sync statement waits until all children have completed. When frame P executes a cilk_sync, one of following two cases applies: If P is a stack frame, do nothing. If P is a full frame, USERP REDUCE(CHILDRENP,USERP). 82

83 BENEFITS OF REDUCERS 83

84 OUTLINE CILK and CILK++ Language Features and Usages Work stealing runtime CILK++ Reducers Conclusions 84

85 CONCLUSIONS CILK and CILK++ provide a programmer friendly programming model Extension to C Incremental parallelism Scaling on future machines Non-compromising performance Work stealing runtime Minimizing overheads Reducers 85

86 FINAL NOTES Designed for an idealized shared memory model Today s architectures are typically NUMA Task creation can be lazier arnumber= &tag=1 Cilk_for Divide and conquer parallelization 86

Reducers and other Cilk++ hyperobjects

Reducers and other Cilk++ hyperobjects Matteo Frigo (Intel) ablo Halpern (Intel) Charles E. Leiserson (MIT) Stephen Lewin-Berlin (Intel) August 11, 2009 Collision detection Assembly: Represented as a tree