The Manticore Project

Size: px

Start display at page:

Download "The Manticore Project"

Homer Golden
6 years ago
Views:

1 The Manticore Project John Reppy University of Chicago January 20, 2009

2 Introduction People The Manticore project is a joint project between the University of Chicago and the Toyota Technological Institute. Lars Bergstrom Matthew Fluet Mike Rainey Adam Shaw Yingqi Xiao University of Chicago Toyota Technological Institute at Chicago University of Chicago University of Chicago University of Chicago with help from Nic Ford Jon Riehl Ridg Scott University of Chicago University of Chicago University of Chicago DAMP 2009 The Manticore Project 2

3 Introduction Background and motivation The Manticore project is our effort to address the programming needs of commodity applications running on the commodity hardware of Hardware supports parallelism at multiple levels: SIMD, SMT, multicore, and small-scale SMP. Likewise, many commodity applications exhibit parallelism at multiple levels. To maximize the performance of these applications, parallel languages should support parallelism at multiple levels. DAMP 2009 The Manticore Project 3

4 Introduction Background and motivation The Manticore project is our effort to address the programming needs of commodity applications running on the commodity hardware of Hardware supports parallelism at multiple levels: SIMD, SMT, multicore, and small-scale SMP. Likewise, many commodity applications exhibit parallelism at multiple levels. To maximize the performance of these applications, parallel languages should support parallelism at multiple levels. DAMP 2009 The Manticore Project 3

5 Introduction Background and motivation The Manticore project is our effort to address the programming needs of commodity applications running on the commodity hardware of Hardware supports parallelism at multiple levels: SIMD, SMT, multicore, and small-scale SMP. Likewise, many commodity applications exhibit parallelism at multiple levels. To maximize the performance of these applications, parallel languages should support parallelism at multiple levels. DAMP 2009 The Manticore Project 3

6 Introduction Background and motivation The Manticore project is our effort to address the programming needs of commodity applications running on the commodity hardware of Hardware supports parallelism at multiple levels: SIMD, SMT, multicore, and small-scale SMP. Likewise, many commodity applications exhibit parallelism at multiple levels. To maximize the performance of these applications, parallel languages should support parallelism at multiple levels. DAMP 2009 The Manticore Project 3

7 Introduction Background and motivation The Manticore project is our effort to address the programming needs of commodity applications running on the commodity hardware of Hardware supports parallelism at multiple levels: SIMD, SMT, multicore, and small-scale SMP. Likewise, many commodity applications exhibit parallelism at multiple levels. To maximize the performance of these applications, parallel languages should support parallelism at multiple levels. CML Erlang Java Manticore OpenMP Nesl Nepal Id ph SISAL Explicitly Threaded Implicitly Threaded Implicitly Parallel DAMP 2009 The Manticore Project 3

8 Introduction Talk overview This talk is divided into two parts: 1. Language design for heterogeneous parallel programming 2. The Manticore implementation. Please see for papers. DAMP 2009 The Manticore Project 4

9 Design Design DAMP 2009 The Manticore Project 5

10 Design Design overview Language design Our initial design is purposefully conservative. It can be summarized as the combination of three distinct sub-languages: A mutation-free subset of SML (no refs or arrays, but includes exceptions). Language mechanisms for implicitly-threaded parallel programming. Language mechanisms for explicitly-threaded parallel programming (a.k.a. concurrent programming). DAMP 2009 The Manticore Project 6

11 Design Implicit threading Language design (continued...) Manticore provides several light-weight syntactic forms for introducing parallel computation. These forms are hints to the compiler and runtime that a computation is a good candidate for parallel execution. Parallel arrays provide fine-grain data-parallel computations over sequences. Parallel tuples provide a basic fork-join parallel computation. Parallel bindings provide data-flow parallelism with cancelation of unused subcomputations. Parallel case provides non-deterministic speculative parallelism. DAMP 2009 The Manticore Project 7

12 Design Implicit threading Parallel arrays We support fine-grained data-parallel computation using a parallel array comprehension form (NESL/Nepal/DPH): [ exp pat i in exp i where pred ] For example, the parallel point-wise summing of two arrays: [ x+y x in xs, y in ys ] NOTE: zip semantics, not Cartesian-product semantics. DAMP 2009 The Manticore Project 8

13 Design Implicit threading Nested data parallelism (continued...) Mandelbrot set computation: fun x i = x0 + dx * itof i; fun y j = y0 - dy * itof j; fun loop (cnt, re, im) = if (cnt < 255) andalso (re*re + im*im > 4.0) then loop(cnt+1, re*re - re*im + re, 2.0*re*im + im) else cnt; [ [ loop(0, x i, y j) i in [ 0..N ] ] j in [ 0..N ] ] DAMP 2009 The Manticore Project 9

14 Design Implicit threading Parallel tuples Parallel tuples provide fork-join parallelism. For example, consider summing the leaves of a binary tree. datatype tree = LF of long ND of tree * tree fun treeadd (LF n) = n treeadd (ND(t1, t2)) = (op +) ( treeadd t1, treeadd t2 ) DAMP 2009 The Manticore Project 10

15 Design Implicit threading Parallel bindings Parallel bindings provide more flexibility than parallel tuples. For example, consider computing the product of the leaves of a binary tree. fun treemul (LF n) = n treemul (ND(t1, t2)) = let pval b = treemul t2 val a = treemul t1 in if (a = 0) then 0 else a*b end NOTE: the computation of b is speculative. DAMP 2009 The Manticore Project 11

16 Design Implicit threading Parallel bindings (continued...) DAMP 2009 The Manticore Project 12

17 Design Implicit threading Parallel case Parallel case supports speculative parallelism when we want the quickest answer (e.g., search problems). For example, consider picking a leaf of the tree: fun treepick (LF n) = n treepick (ND(t1, t2)) = ( pcase treepick t1 & treepick t2 of? & n => n n &? => n) There is some similarity with join patterns. DAMP 2009 The Manticore Project 13

18 Design Implicit threading Parallel case (continued...) DAMP 2009 The Manticore Project 14

19 Design Explicit threading Explicit threading Explicit threading with preemptive scheduling. Threads do not share state; instead they communicate and synchronize via message passing. Synchronization and communication abstraction are supported by the mechanism of first-class synchronous operations (called events). In addition to supporting abstraction, events also provide a uniform framework for many kinds of communication and synchronization primitives (e.g., buffered and unbuffered channels, I-variables, and M-variables). DAMP 2009 The Manticore Project 15

20 Design Explicit threading Explicit threading Explicit threading with preemptive scheduling. Threads do not share state; instead they communicate and synchronize via message passing. Synchronization and communication abstraction are supported by the mechanism of first-class synchronous operations (called events). In addition to supporting abstraction, events also provide a uniform framework for many kinds of communication and synchronization primitives (e.g., buffered and unbuffered channels, I-variables, and M-variables). DAMP 2009 The Manticore Project 15

21 Design Explicit threading Explicit threading Explicit threading with preemptive scheduling. Threads do not share state; instead they communicate and synchronize via message passing. Synchronization and communication abstraction are supported by the mechanism of first-class synchronous operations (called events). In addition to supporting abstraction, events also provide a uniform framework for many kinds of communication and synchronization primitives (e.g., buffered and unbuffered channels, I-variables, and M-variables). DAMP 2009 The Manticore Project 15

22 Design Explicit threading Explicit threading Explicit threading with preemptive scheduling. Threads do not share state; instead they communicate and synchronize via message passing. Synchronization and communication abstraction are supported by the mechanism of first-class synchronous operations (called events). In addition to supporting abstraction, events also provide a uniform framework for many kinds of communication and synchronization primitives (e.g., buffered and unbuffered channels, I-variables, and M-variables). DAMP 2009 The Manticore Project 15

23 Design Explicit threading Interprocess communication The most common mechanism of communication is by channels: type a chan val channel : unit -> a chan val recv : a chan -> a val send : ( a chan * a) -> unit Manticore provides both unbuffered (i.e., blocking send) and buffered channels, but we focus on the unbuffered channels here. DAMP 2009 The Manticore Project 16

24 Design Explicit threading Interprocess communication (continued...) CML s events are motivated by the following observations: Most interactions between processes involve multiple messages. A process may need to interact with multiple partners (nondeterministic choice). These two properties of IPC cause a conflict. DAMP 2009 The Manticore Project 17

25 Design Explicit threading Interprocess communication (continued...) CML s events are motivated by the following observations: Most interactions between processes involve multiple messages. A process may need to interact with multiple partners (nondeterministic choice). These two properties of IPC cause a conflict. DAMP 2009 The Manticore Project 17

26 Design Explicit threading Interprocess communication (continued...) CML s events are motivated by the following observations: Most interactions between processes involve multiple messages. A process may need to interact with multiple partners (nondeterministic choice). These two properties of IPC cause a conflict. DAMP 2009 The Manticore Project 17

27 Design Explicit threading Interprocess communication (continued...) CML s events are motivated by the following observations: Most interactions between processes involve multiple messages. A process may need to interact with multiple partners (nondeterministic choice). These two properties of IPC cause a conflict. DAMP 2009 The Manticore Project 17

28 Design Explicit threading Interprocess communication (continued...) For example, consider a possible interaction between a client and two servers. Server1 Client Server2 request request reply / ack nack DAMP 2009 The Manticore Project 18

29 Design Explicit threading Interprocess communication (continued...) Without abstraction, the code is a mess. let val replch1 = channel() and nack1 = cvar() val replch2 = channel() and nack2 = cvar() in send (reqch1, (req1, replch1, nack1)); ] end send (reqch2, (req2, replch2, nack2)); select [ (replch1, fn repl1 => ( set nack2; act1 repl1 ), (replch2, fn repl2 => ( set nack1; act2 repl2 ) But traditional abstraction mechanisms do not support choice! DAMP 2009 The Manticore Project 19

30 Design Explicit threading Events We use event values to package up protocols as first-class abstractions. An event is an abstraction of a synchronous operation, such as receiving a message or a timeout. type a event Base-event constructors create event values for communication primitives: val recvevt : a chan -> a event DAMP 2009 The Manticore Project 20

31 Design Explicit threading Events (continued...) CML event operations: Event wrappers for post-synchronization actions: val wrap : ( a event * ( a -> b)) -> b event Event generators for pre-synchronization actions and cancellation: val guard : (unit -> a event) -> a event val withnack : (unit event -> a event) -> a event Choice for managing multiple communications: val choose : a event list -> a event Synchronization on an event value: val sync : a event -> a DAMP 2009 The Manticore Project 21

32 Design Explicit threading Two-server interaction using events Server abstraction: type server val rpcevt : server * req -> repl event The client code is no longer a mess. select [ wrap (rpcevt server1, fn repl1 => act1 repl1 ), ] wrap (rpcevt server2, fn repl2 => act2 repl2 ) Note that select is shorthand for sync o choose. DAMP 2009 The Manticore Project 22

33 Implementation DAMP 2009 The Manticore Project 23

34 Process abstractions Process abstractions The runtime model is based on heap-allocated first-class continuations and has three distinct notions of process abstraction: Fibers are unadorned threads of control. A suspended fiber s state is represented as a continuation. Threads correspond to language-level threads and have an ID. Virtual Processors (VProcs) are an abstraction of a computational resource. The runtime maintains a dynamic binding between fibers and fiber-local storage (FLS). DAMP 2009 The Manticore Project 24

35 Memory model and GC Memory model The GC is a combination of the Appel Semi-generational collector and the Doligez-Leroy-Gonthier parallel collector. Each VProc has a local heap that can be independently collected. Objects that are shared between VProcs must be promoted to the global heap. Minor GCs are completely asynchronous. Major GC are mostly asynchronous. Global GCs are parallel stop-the-world. DAMP 2009 The Manticore Project 25

36 Memory model and GC Memory model The GC is a combination of the Appel Semi-generational collector and the Doligez-Leroy-Gonthier parallel collector. Each VProc has a local heap that can be independently collected. Objects that are shared between VProcs must be promoted to the global heap. Minor GCs are completely asynchronous. Major GC are mostly asynchronous. Global GCs are parallel stop-the-world. DAMP 2009 The Manticore Project 25

37 Memory model and GC Memory model The GC is a combination of the Appel Semi-generational collector and the Doligez-Leroy-Gonthier parallel collector. Each VProc has a local heap that can be independently collected. Objects that are shared between VProcs must be promoted to the global heap. Minor GCs are completely asynchronous. Major GC are mostly asynchronous. Global GCs are parallel stop-the-world. DAMP 2009 The Manticore Project 25

38 Memory model and GC Memory model The GC is a combination of the Appel Semi-generational collector and the Doligez-Leroy-Gonthier parallel collector. Each VProc has a local heap that can be independently collected. Objects that are shared between VProcs must be promoted to the global heap. Minor GCs are completely asynchronous. Major GC are mostly asynchronous. Global GCs are parallel stop-the-world. DAMP 2009 The Manticore Project 25

39 Memory model and GC Memory model The GC is a combination of the Appel Semi-generational collector and the Doligez-Leroy-Gonthier parallel collector. Each VProc has a local heap that can be independently collected. Objects that are shared between VProcs must be promoted to the global heap. Minor GCs are completely asynchronous. Major GC are mostly asynchronous. Global GCs are parallel stop-the-world. DAMP 2009 The Manticore Project 25

40 Memory model and GC Memory model The GC is a combination of the Appel Semi-generational collector and the Doligez-Leroy-Gonthier parallel collector. Each VProc has a local heap that can be independently collected. Objects that are shared between VProcs must be promoted to the global heap. Minor GCs are completely asynchronous. Major GC are mostly asynchronous. Global GCs are parallel stop-the-world. DAMP 2009 The Manticore Project 25

41 Memory model and GC Memory model (continued...) Global heap Local Heap Local Heap Local Heap VProc state VProc state VProc state VProc VProc VProc DAMP 2009 The Manticore Project 26

42 Schedulers VProcs In addition to its heap, each VProc has local state that provides the execution context for running fibers. This state includes a ready queue of fibers paired with FLS (local access only) a signal mask for local masking of preemption a landing pad for incoming threads (lock-free stack) a stack of scheduler actions DAMP 2009 The Manticore Project 27

43 Schedulers Schedulers The Manticore runtime system provides a flexible framework for nested schedulers Schedulers are implemented using scheduler actions. This framework has been used to implement a wide range of scheduler policies. DAMP 2009 The Manticore Project 28

44 Schedulers Schedulers The Manticore runtime system provides a flexible framework for nested schedulers Schedulers are implemented using scheduler actions. This framework has been used to implement a wide range of scheduler policies. DAMP 2009 The Manticore Project 28

45 Schedulers Schedulers The Manticore runtime system provides a flexible framework for nested schedulers Schedulers are implemented using scheduler actions. This framework has been used to implement a wide range of scheduler policies. DAMP 2009 The Manticore Project 28

46 Schedulers Scheduler actions A scheduler action is a function that implements context switching for a VProc. type fiber = unit cont datatype signal = STOP PREEMPT of fiber type action = signal cont The runtime model defines three operations on a VProc s action stack DAMP 2009 The Manticore Project 29

47 Schedulers Action-stack transitions: running a fiber val run : action -> fiber -> a pushes the action on the stack and starts running the fiber. thread queue... mask thread queue... mask act action stack run act E[run act f ] throw f () act running action stack running DAMP 2009 The Manticore Project 30

Schedulers Action-stack transitions: preemption Preemption captures the current state as a fiber and delivers a preemption signal to the top action on the stack (which

48 Schedulers Action-stack transitions: preemption Preemption captures the current state as a fiber and delivers a preemption signal to the top action on the stack (which is popped) thread queue... mask thread queue... mask preemption act f act (PREEMPT f) act act action stack running action stack running DAMP 2009 The Manticore Project 31

49 Schedulers Action-stack transitions: forwarding a signal val forward : signal -> a delivers the signal to the top action on the stack (which is popped) thread queue... mask thread queue... mask forward act act E[forward sig] act sig act action stack running action stack running DAMP 2009 The Manticore Project 32

50 Compiler support The Manticore compiler The compiler is organized as a series on transformations between IRs: Typed AST explicitly-typed, polymorphic, abstract-syntax tree. BOM direct-style, normalized λ-calculus. CPS continuation-passing-style λ-calculus. CFG first-order control-flow graph Typed Source Front end AST Translate BOM Convert CPS Closure CFG Codegen Asm DAMP 2009 The Manticore Project 33

51 Compiler support AST transformations Introduction of compensation code for exceptions. Flattening of nested-data-parallelism (AOS to SOA). Introduction of futures for other implicitly-threaded parallel constructs. Compilation of pattern matching. DAMP 2009 The Manticore Project 34

52 Compiler support Introducing futures fun treemul (LF n) = n treemul (ND(t1, t2)) = let (* pval b = treemul t2 *) val b = future (treemul t2) val a = treemul t1 in if (a = 0) then (cancel b ; 0) else a * (touch b ) end DAMP 2009 The Manticore Project 35

53 Compiler support BOM transformations Standard functional-compiler optimizations (e.g., uncurrying, inlining, and contraction). High-level operator expansion. Rewriting-based domain-specific optimizations (under construction). DAMP 2009 The Manticore Project 36

54 Compiler support High-level operations BOM types and code embedded in ML modules. Used to implement concurrency and parallel features (e.g., spawn, future, and touch). Used to import scheduler code into a program and to implement scheduler operations (e.g., run and forward). Rewriting to implement DSOs (e.g., fusion) DAMP 2009 The Manticore Project 37

55 Implementation status Implementation status Core-SML plus a simple module system is implemented. All of the IRs (AST, BOM, CPS, and CFG) are implemented (with invariant checkers), as are the translations between IRs. Code generation for the x86-64 is implemented using the MLRisc framework. The runtime system is implemented on Linux and Mac OS X. A number of basic BOM optimizations are implemented. Protocols for asymmetric CML have been designed and coded (see DAMP 08). Protocols symmetric CML are designed and model checked, but not implemented. DAMP 2009 The Manticore Project 38

56 Implementation status Implementation status Core-SML plus a simple module system is implemented. All of the IRs (AST, BOM, CPS, and CFG) are implemented (with invariant checkers), as are the translations between IRs. Code generation for the x86-64 is implemented using the MLRisc framework. The runtime system is implemented on Linux and Mac OS X. A number of basic BOM optimizations are implemented. Protocols for asymmetric CML have been designed and coded (see DAMP 08). Protocols symmetric CML are designed and model checked, but not implemented. DAMP 2009 The Manticore Project 38

57 Implementation status Implementation status Core-SML plus a simple module system is implemented. All of the IRs (AST, BOM, CPS, and CFG) are implemented (with invariant checkers), as are the translations between IRs. Code generation for the x86-64 is implemented using the MLRisc framework. The runtime system is implemented on Linux and Mac OS X. A number of basic BOM optimizations are implemented. Protocols for asymmetric CML have been designed and coded (see DAMP 08). Protocols symmetric CML are designed and model checked, but not implemented. DAMP 2009 The Manticore Project 38

58 Implementation status Implementation status Core-SML plus a simple module system is implemented. All of the IRs (AST, BOM, CPS, and CFG) are implemented (with invariant checkers), as are the translations between IRs. Code generation for the x86-64 is implemented using the MLRisc framework. The runtime system is implemented on Linux and Mac OS X. A number of basic BOM optimizations are implemented. Protocols for asymmetric CML have been designed and coded (see DAMP 08). Protocols symmetric CML are designed and model checked, but not implemented. DAMP 2009 The Manticore Project 38

59 Implementation status Implementation status Core-SML plus a simple module system is implemented. All of the IRs (AST, BOM, CPS, and CFG) are implemented (with invariant checkers), as are the translations between IRs. Code generation for the x86-64 is implemented using the MLRisc framework. The runtime system is implemented on Linux and Mac OS X. A number of basic BOM optimizations are implemented. Protocols for asymmetric CML have been designed and coded (see DAMP 08). Protocols symmetric CML are designed and model checked, but not implemented. DAMP 2009 The Manticore Project 38

60 Implementation status Implementation status Core-SML plus a simple module system is implemented. All of the IRs (AST, BOM, CPS, and CFG) are implemented (with invariant checkers), as are the translations between IRs. Code generation for the x86-64 is implemented using the MLRisc framework. The runtime system is implemented on Linux and Mac OS X. A number of basic BOM optimizations are implemented. Protocols for asymmetric CML have been designed and coded (see DAMP 08). Protocols symmetric CML are designed and model checked, but not implemented. DAMP 2009 The Manticore Project 38

61 Implementation status Implementation status Core-SML plus a simple module system is implemented. All of the IRs (AST, BOM, CPS, and CFG) are implemented (with invariant checkers), as are the translations between IRs. Code generation for the x86-64 is implemented using the MLRisc framework. The runtime system is implemented on Linux and Mac OS X. A number of basic BOM optimizations are implemented. Protocols for asymmetric CML have been designed and coded (see DAMP 08). Protocols symmetric CML are designed and model checked, but not implemented. DAMP 2009 The Manticore Project 38

62 Implementation status Preliminary performance data We have done some preliminary benchmarking of the system to test the efficiency of the scheduler framework. The benchmarks use Cilk s version of lazy-task creation (PLDI 98) and gang scheduling. By-hand chunking was used for small problem sizes. Scaling is good, but our base-line performance is still poor. We have implemented very few sequential optimizations, so we expect significant improvement in the baseline. DAMP 2009 The Manticore Project 39

63 Implementation status Preliminary performance data We have done some preliminary benchmarking of the system to test the efficiency of the scheduler framework. The benchmarks use Cilk s version of lazy-task creation (PLDI 98) and gang scheduling. By-hand chunking was used for small problem sizes. Scaling is good, but our base-line performance is still poor. We have implemented very few sequential optimizations, so we expect significant improvement in the baseline. DAMP 2009 The Manticore Project 39

64 Implementation status Preliminary performance data We have done some preliminary benchmarking of the system to test the efficiency of the scheduler framework. The benchmarks use Cilk s version of lazy-task creation (PLDI 98) and gang scheduling. By-hand chunking was used for small problem sizes. Scaling is good, but our base-line performance is still poor. We have implemented very few sequential optimizations, so we expect significant improvement in the baseline. DAMP 2009 The Manticore Project 39

65 Implementation status Preliminary performance data We have done some preliminary benchmarking of the system to test the efficiency of the scheduler framework. The benchmarks use Cilk s version of lazy-task creation (PLDI 98) and gang scheduling. By-hand chunking was used for small problem sizes. Scaling is good, but our base-line performance is still poor. We have implemented very few sequential optimizations, so we expect significant improvement in the baseline. DAMP 2009 The Manticore Project 39

66 Implementation status Preliminary performance data We have done some preliminary benchmarking of the system to test the efficiency of the scheduler framework. The benchmarks use Cilk s version of lazy-task creation (PLDI 98) and gang scheduling. By-hand chunking was used for small problem sizes. Scaling is good, but our base-line performance is still poor. We have implemented very few sequential optimizations, so we expect significant improvement in the baseline. DAMP 2009 The Manticore Project 39

67 Implementation status Preliminary performance data (continued...) Speedup vs. Sequential Manticore Perfect speedup Barnes-Hut (LTC) Bitonic Sort (LTC) Merge Sort (LTC) Ray Tracer Speedup Seq Number of Processors DAMP 2009 The Manticore Project 40

68 Implementation status Preliminary performance data (continued...) Speedup vs. SML/NJ Perfect speedup Barnes-Hut (LTC) Bitonic Sort (LTC) Merge Sort (LTC) Ray Tracer Speedup Seq Number of Processors DAMP 2009 The Manticore Project 41

69 Optimizations Optimizations We are investigating several kinds of optimizations in the Manticore compiler and runtime. Traditional sequential optimizations, such as arity raising. Effects analysis to eliminate compensation code. Aggregation of work to reduce overhead (chunking). Techniques to reduce the cost of promotion DAMP 2009 The Manticore Project 42

70 Optimizations Optimizations We are investigating several kinds of optimizations in the Manticore compiler and runtime. Traditional sequential optimizations, such as arity raising. Effects analysis to eliminate compensation code. Aggregation of work to reduce overhead (chunking). Techniques to reduce the cost of promotion DAMP 2009 The Manticore Project 42

71 Optimizations Optimizations We are investigating several kinds of optimizations in the Manticore compiler and runtime. Traditional sequential optimizations, such as arity raising. Effects analysis to eliminate compensation code. Aggregation of work to reduce overhead (chunking). Techniques to reduce the cost of promotion DAMP 2009 The Manticore Project 42

72 Optimizations Optimizations We are investigating several kinds of optimizations in the Manticore compiler and runtime. Traditional sequential optimizations, such as arity raising. Effects analysis to eliminate compensation code. Aggregation of work to reduce overhead (chunking). Techniques to reduce the cost of promotion DAMP 2009 The Manticore Project 42

73 Optimizations Reducing the cost of object promotion The requirement that shared objects get promoted to the global heap is a significant source of overhead. For example, 47% of the overhead of work stealing is from object promotion. We are investigating three ways of reducing this cost: 1. direct global allocation 2. lazy promotion for work stealing 3. proxy objects DAMP 2009 The Manticore Project 43

74 Optimizations Reducing the cost of object promotion The requirement that shared objects get promoted to the global heap is a significant source of overhead. For example, 47% of the overhead of work stealing is from object promotion. We are investigating three ways of reducing this cost: 1. direct global allocation 2. lazy promotion for work stealing 3. proxy objects DAMP 2009 The Manticore Project 43

75 Optimizations Reducing the cost of object promotion The requirement that shared objects get promoted to the global heap is a significant source of overhead. For example, 47% of the overhead of work stealing is from object promotion. We are investigating three ways of reducing this cost: 1. direct global allocation 2. lazy promotion for work stealing 3. proxy objects DAMP 2009 The Manticore Project 43

Optimizations Lazy promotion for work stealing Global heap VProc VProc In our implementation of work stealing, each worker has a globally visible work

76 Optimizations Lazy promotion for work stealing Global heap VProc VProc In our implementation of work stealing, each worker has a globally visible work list that other workers can steal from. Thus, the units of work (i.e., continuations) must be promoted to the global heap. DAMP 2009 The Manticore Project 44

77 Optimizations Lazy promotion for work stealing (continued...) Global heap VProc VProc An alternative is to keep the work list in the local heap and only promote stolen work. But this choice means that the work list cannot be visible to workers running on other vprocs. Our protocol is to send a thief to the vproc that we wish to steal work from. The thief promotes the work. The thief sends the work back to its master. DAMP 2009 The Manticore Project 45

78 Optimizations Lazy promotion for work stealing (continued...) Global heap VProc VProc An alternative is to keep the work list in the local heap and only promote stolen work. But this choice means that the work list cannot be visible to workers running on other vprocs. Our protocol is to send a thief to the vproc that we wish to steal work from. The thief promotes the work. The thief sends the work back to its master. DAMP 2009 The Manticore Project 45

79 Optimizations Lazy promotion for work stealing (continued...) Global heap VProc VProc An alternative is to keep the work list in the local heap and only promote stolen work. But this choice means that the work list cannot be visible to workers running on other vprocs. Our protocol is to send a thief to the vproc that we wish to steal work from. The thief promotes the work. The thief sends the work back to its master. DAMP 2009 The Manticore Project 45

80 Optimizations Lazy promotion for work stealing (continued...) Global heap VProc VProc An alternative is to keep the work list in the local heap and only promote stolen work. But this choice means that the work list cannot be visible to workers running on other vprocs. Our protocol is to send a thief to the vproc that we wish to steal work from. The thief promotes the work. The thief sends the work back to its master. DAMP 2009 The Manticore Project 45

81 Optimizations Evaluation of lazy promotion Lazy promotion and software polling reduces the overhead of stealing work. Chunking increases the amount of work per steal. We have benchmarked the combination of these features for mergesort. DAMP 2009 The Manticore Project 46

82 Optimizations Evaluation of lazy promotion (continued...) Speedup vs. Sequential Manticore Mergesort (LTC+Chunk) Mergesort (LTC+Lazy+Chunk) Mergesort (LTC+Lazy) Mergesort (LTC) Speedup Number of processors DAMP 2009 The Manticore Project 47

83 Optimizations Proxy continuations Another source of unnecessary promotion is when continuations are put into the scheduling queues of synchronization objects (e.g., channels). Often, the continuations may not get used or may be resumed on the same processor as they were allocated on. DAMP 2009 The Manticore Project 48

84 Optimizations Proxy continuations (continued...) Global heap Proxy continuation Proxy table Proxy table Local heap Local heap Instead of copying the continuation, and everything reachable from it, copy a proxy. DAMP 2009 The Manticore Project 49

85 Conclusion Questions? DAMP 2009 The Manticore Project 50

Implicit threading, Speculation, and Mutation in Manticore

Implicit threading, Speculation, and Mutation in Manticore John Reppy University of Chicago April 30, 2010 Introduction Overview This talk is about some very preliminary ideas for the next step in the