Effectively Mapping Linguistic Abstractions for Message-passing Concurrency to Threads on the Java Virtual Machine

Size: px

Start display at page:

Download "Effectively Mapping Linguistic Abstractions for Message-passing Concurrency to Threads on the Java Virtual Machine"

Magnus Perry
5 years ago
Views:

1 Effectively Mapping Linguistic Abstractions for Message-passing Concurrency to Threads on the Java Virtual Machine Ganesha Upadhyaya Iowa State University Hridesh Rajan Iowa State University Supported in part by the NSF grants CCF , CCF , and CCF

2 Effectively Mapping Linguistic Abstractions for Message-passing Concurrency to Threads on the Java Virtual Machine Ganesha Upadhyaya Iowa State University Hridesh Rajan Iowa State University 1) Local computation and communication behavior of a concurrent entity can help predict the performance. 2) A modular and static technique can solve the problem. Supported in part by the NSF grants CCF , CCF , and CCF

3 2 Problem MPC System Architecture a1 a2 core0 core1 a3 Mapping a4 a5 core2 core3 MPC: Message-passing Concurrency, a1, a2, a3, a4 are MPC abstractions

4 2 Problem MPC System Architecture a1 a3 a2 Mapping OS Scheduler core0 core1 a4 a5 JVM threads core2 core3 MPC: Message-passing Concurrency, a1, a2, a3, a4 are MPC abstractions

5 2 Problem MPC System Architecture a1 a3 a2 Mapping OS Scheduler core0 core1 a4 a5 JVM threads core2 core3 MPC: Message-passing Concurrency, a1, a2, a3, a4 are MPC abstractions In this work : Mapping Abstractions to JVM threads

6 2 Problem MPC System Architecture a1 a3 a2 Mapping OS Scheduler core0 core1 a4 a5 JVM threads core2 core3 MPC: Message-passing Concurrency, a1, a2, a3, a4 are MPC abstractions PROBLEM: Mapping Abstractions to JVM threads GOAL: Concurrency

7 2 Problem MPC System Architecture a1 a3 a2 Mapping OS Scheduler core0 core1 a4 a5 JVM threads core2 core3 MPC: Message-passing Concurrency, a1, a2, a3, a4 are MPC abstractions PROBLEM: Mapping Abstractions to JVM threads GOAL: Concurrency + Performance (Fast and Efficient)

8 3 Motivation State of the art: Abstractions to Threads mapping is performed by programmers.

9 3 Motivation Abstractions to Threads mapping is performed by programmers MPC frameworks (Akka, Scala Actors, SALSA) provide Schedulers and Dispatchers to programmers for mapping Abstractions to Threads

10 4 JVM-based MPC Frameworks Akka Scala Actors SALSA Kilim Jetlang ActorFoundry Actors Guild Dispatchers in Akka default pinned balancing calling-thread Schedulers in Scala Actors thread-based event-based Availability of wide-variety of Schedulers and Dispatchers suggests that. - Programmers can chose the ones that works best for their applications - And, perform the mapping carefully

11 3 Motivation Abstractions to Threads mapping is performed by programmers MPC frameworks (Akka, Scala Actors, SALSA) provide Schedulers and Dispatchers to programmers for mapping Abstractions to Threads Programmers find it hard to manually perform the mapping. Start with an initial mapping and incrementally improve the mapping. This process can be tedious and time consuming.

3 Motivation Abstractions to Threads mapping is performed by programmers MPC frameworks (Akka, Scala Actors, SALSA) provide Schedulers and Dispatchers to programmers for mapping Abstractions to

12 3 Motivation Abstractions to Threads mapping is performed by programmers MPC frameworks (Akka, Scala Actors, SALSA) provide Schedulers and Dispatchers to programmers for mapping Abstractions to Threads SO discussions about configuring and fine tuning the mapping suggests that, Randomly tweaking the mapping without finding the root cause of performance problem doesn t help and Without knowing the nature of the task performed by the abstractions, the mapping task becomes hard.

13 5 Motivation When manual tuning is hard, Programmers use default mappings (default schedulers/dispatchers)

14 5 Motivation When manual tuning is hard, Programmers use default mappings (default schedulers/dispatchers) Problem: a single default mapping may not work across programs

15 5 Motivation When manual tuning is hard, Programmers use default mappings (default schedulers/dispatchers) Problem: a single default mapping may not work across programs LogisticMap (computes logistic map using a recurrence relation) ScratchPad (counts lines for all files in a directory) Execution time (s) thread round-robin random work-stealing Core setting Execution time (s) thread round-robin random work-stealing Core setting

16 5 Motivation When manual tuning is hard, Programmers use default mappings (default schedulers/dispatchers) Problem: a single default mapping may not work across programs LogisticMap (computes logistic map using a recurrence relation) Execution time (s) thread round-robin random work-stealing Core setting

17 5 Motivation When manual tuning is hard, Programmers use default mappings (default schedulers/dispatchers) Problem: a single default mapping may not work across programs ScratchPad (counts lines for all files in a directory) Execution time (s) thread round-robin random work-stealing Core setting

18 5 Motivation When manual tuning is hard, Programmers use default mappings (default schedulers/dispatchers) Problem: a single default mapping may not work across programs ScratchPad (counts lines for all files in a directory) Execution time (s) thread round-robin random work-stealing Core setting

19 5 Motivation When manual tuning is hard, Programmers use default mappings (default schedulers/dispatchers) Problem: a single default mapping may not work across programs ScratchPad (counts lines for all files in a directory) Execution time (s) thread round-robin random work-stealing Core setting

20 5 Motivation When manual tuning is hard, Programmers use default mappings (default schedulers/dispatchers) Problem: a single default mapping may not work across programs LogisticMap (computes logistic map using a recurrence relation) ScratchPad (counts lines for all files in a directory) Execution time (s) thread round-robin random work-stealing Core setting Execution time (s) thread round-robin random work-stealing Core setting

21 6 Motivation When manual tuning is hard and default mappings may not produce the desired performance, Brute force technique that tries all possible combinations of Abstractions to Threads mapping could be used

22 6 Motivation When manual tuning is hard and default mappings may not produce the desired performance, brute force technique that tries all possible combinations of Abstractions to Threads mapping could be used Problem: Combinatorial explosion

23 6 Motivation When manual tuning is hard and default mappings may not produce the desired performance, brute force technique that tries all possible combinations of Abstractions to Threads mapping could be used Problem: Combinatorial explosion For example: an MPC program with 8 kinds of abstractions, trying all possible combinations of 4 kinds of schedulers/dispatchers requires exploring (4^8) different combinations (some may even violate the concurrency correctness property)

24 6 Motivation When manual tuning is hard and default mappings may not produce the desired performance, brute force technique that tries all possible combinations of Abstractions to Threads mapping could be used Problem: Combinatorial explosion For example: an MPC program with 8 kinds of abstractions, trying all possible combinations of 4 kinds of schedulers/dispatchers requires exploring (4^8) different combinations (some may even violate the concurrency correctness property) Also, a small change to the program may require re-doing the mapping.

25 6 Motivation When manual tuning is hard and default mappings may not produce the desired performance, brute force technique that tries all possible combinations of Abstractions to Threads mapping could be used Problem: Combinatorial explosion For example: an MPC program with 8 kinds of abstractions, trying all possible combinations of 4 kinds of schedulers/dispatchers requires exploring (4^8) different combinations Also, a small change to the program may require redoing the mapping. A mapping solution that yields significant performance improvement over default mappings is desirable

26 7 Key Ideas 1) Local computation and communication behavior of concurrent entity is predictive for determining globally beneficial mapping

27 7 Key Ideas 1) Local computation and communication behavior of concurrent entity is predictive for determining globally beneficial mapping Computation and Communication behaviors, externally blocking behavior local state computational workload message send/receive pattern inherent parallelism

28 7 Key Ideas 1) Local computation and communication behavior of concurrent entity is predictive for determining globally beneficial mapping Computation and Communication behaviors, externally blocking behavior local state computational workload message send/receive pattern inherent parallelism 2) Determining these behavior at a coarse/abstract level is sufficient to solve the mapping problem

29 8 Solution Outline Represent Computation and Communication behavior of MPC abstractions (language-agnostic manner), Input Program

30 8 Solution Outline Represent Computation and Communication behavior of MPC abstractions (language-agnostic manner), Perform local program analyses statically to determine behaviors (proposed solution is both modular and static) Input Program cvector Analysis cvector

determine behaviors (proposed solution is both modular and static) A Mapping

for each abstraction Execution policy: describes how messages of MPC abstraction are

31 8 Solution Outline Represent Computation and Communication behavior of MPC abstractions (language-agnostic manner) Perform local program analyses statically to determine behaviors (proposed solution is both modular and static) A Mapping function that takes represented behaviors as input and produces an execution policy for each abstraction Execution policy: describes how messages of MPC abstraction are processed (in detail later) Input Program cvector Analysis cvector Mapping Function Execution Policy

32 9 Panini Capsules Input Program cvector Analysis cvector Mapping Function Execution Policy General MPC framework MPC Abstraction Message Handlers Panini Capsules Capsule Procedures

33 9 Panini Capsules Input Program cvector Analysis cvector Mapping Function Execution Policy General MPC framework MPC Abstraction Message Handlers Panini Capsules Capsule Procedures Capsule State p0 p1... pn capsule Ship { short state = 0; int x = 5; } void die() { state = 2; } void fire() { state = 1; } void moveleft() { if (x>0) x--; } void moveright() { if (x<10) x++; }

9 Solution Outline Input Program cvector Analysis cvector Mapping Function Execution Policy Static Analyses State Analysis Capsule State p0 p1.

34 9 Solution Outline Input Program cvector Analysis cvector Mapping Function Execution Policy Static Analyses State Analysis Capsule State p0 p1... for each procedure pi May-Block Analysis Call-claim Analysis Communication Summary Analysis Procedure Behavior Composition cvector pn Computational Workload Analysis

35 9 Solution Outline Input Program cvector Analysis cvector Mapping Function Execution Policy Static Analyses State Analysis Capsule State p0 p1... for each procedure pi May-Block Analysis Call-claim Analysis Communication Summary Analysis Procedure Behavior Composition cvector pn Computational Workload Analysis

36 9 Solution Outline Input Program cvector Analysis cvector Mapping Function Execution Policy State Analysis Capsule State p0 p1... for each procedure pi May-Block Analysis Call-claim Analysis Communication Summary Analysis Procedure Behavior Composition cvector Mapping Function EP pn Computational Workload Analysis

37 10 Representing Behaviors Characteristics Vector (cvector) <β, σ, π, ρ, ρ, ω> Blocking behavior (β) represents externally blocking behavior due to I/O, socket or db primitives dom(beta): {true, false} Local state (σ) local state variables dom(sigma): {nil, primitive, large} Inherent parallelism (π) inherent parallelism exposed by capsule when it communicates with other capsules dom(pi): {sync, async, future} Computational workload (ω) represents computations performed by capsule dom(omega): {math, io} Communication Pattern ( ρ, ρ )

10 Representing Behaviors Characteristics Vector (cvector) <β, σ, π, ρ, ρ, ω> Blocking behavior (β) represents externally blocking behavior due to I/O, socket or db

when it communicates with other capsules dom(pi): {sync, async, future} Computational workload (ω) represents computations performed by capsule dom(omega): {math, io}

38 10 Representing Behaviors Characteristics Vector (cvector) <β, σ, π, ρ, ρ, ω> Blocking behavior (β) represents externally blocking behavior due to I/O, socket or db primitives dom(beta): {true, false} Local state (σ) local state variables dom(sigma): {nil, primitive, large} Inherent parallelism (π) inherent parallelism exposed by capsule when it communicates with other capsules dom(pi): {sync, async, future} Computational workload (ω) represents computations performed by capsule dom(omega): {math, io} Communication Pattern ( ρ, ρ ) All behaviors <β, π, ρ, ρ, ω> except σ is defined for capsule procedures and combined to form behaviors for capsule using behavior composition (described later)

39 Representing Behaviors Characteristics Vector (cvector) Blocking behavior (β) represents externally blocking behavior due to I/O, socket or db primitives dom (β) : {true, false} Local state (sigma) BufferedReader br = new BufferedReader(new InputStreamReader(System.in)); try { local state username variables = br.readline(); // blocking } catch (IOException ioe) { dom(sigma): {nil, primitive, large} } Inherent parallelism (pi) inherent parallelism exposed by capsule when it communicates with other capsules May dom(pi): Block Analysis {sync, async, future} Communication Pattern (rho) sinks Computational workload (omega) represents computations performed by capsule dom(omega): {math, io} <β, σ, π, ρ, ρ, ω> Input : Manually created dictionary of blocking library calls Analysis : Flow analysis with message receives as sources and blocking library calls as 10

40 Representing Behaviors Characteristics Vector (cvector) <β, σ, π, ρ, ρ, ω> Blocking behavior (beta) represents externally blocking behavior due to I/O, socket or db primitives dom(beta): {true, false} Local state (σ) local state variables dom (σ): {nil, fixed, variable} Inherent parallelism (pi) capsule Receiver { void receive() { } inherent parallelism exposed by capsule when it communicates with other capsules } dom(pi): {sync, async, future} Communication Pattern } (rho) Computational State Analysis workload (omega) dom(omega): {math, io} capsule Receiver { int state1; int state2; void receive() { } 10 capsule Receiver { Map<String,String> datamap = new HashMap<>(); void receive() { Input : State variables Analysis represents : Checks computations the type of state performed variables that by composed capsule capsule state for primitive or collection types. } }

41 10 Representing Behaviors Characteristics Vector (cvector) Blocking behavior (beta) represents externally blocking behavior due to I/O, socket or db primitives dom(beta): {true, false} Local state (sigma) local state variables dom(sigma): {nil, primitive, large} Inherent parallelism (π) kind of parallelism exposed by capsule while communicating dom (π) : {sync, async, future} Communication Pattern (rho) capsule Receiver (Sender sender) { void receive() { // sync int i = sender.get(); // print i; } } Computational workload void receive() (omega) { represents computations } performed by capsule dom(omega): {math, io} capsule Receiver (Sender sender) { } <β, σ, π, ρ, ρ, ω> sender.done(); // async capsule Receiver (Sender sender) { void receive() { // future int i = sender.get(); // some computation // print i; }}

10 Representing Behaviors Characteristics Vector (cvector) Blocking behavior (beta) represents externally blocking behavior due to I/O, socket or db primitives Inherent Parallelism Analysis

42 10 Representing Behaviors Characteristics Vector (cvector) Blocking behavior (beta) represents externally blocking behavior due to I/O, socket or db primitives Inherent Parallelism Analysis dom(beta): {true, false} Local state (sigma) local state variables dom(sigma): {nil, primitive, large} Inherent parallelism (π) dom (π) : {sync, async, future} Computational workload represents computations performed by capsule dom(omega): {math, io} Communication Pattern <β, σ, π, ρ, ρ, ω>

10 Representing Behaviors Characteristics Vector (cvector) <β, σ, π, ρ, ρ, ω> Blocking behavior (beta) represents externally blocking behavior due to I/O, socket or db primitives dom(beta): {true,

43 10 Representing Behaviors Characteristics Vector (cvector) <β, σ, π, ρ, ρ, ω> Blocking behavior (beta) represents externally blocking behavior due to I/O, socket or db primitives dom(beta): {true, false} Local state (sigma) local state variables dom(sigma): {nil, primitive, large} Inherent parallelism (pi) dom(pi): {sync, async, future} Computational workload (ω) represents computations performed by capsule dom (ω) : {math, io} math : computation-to-wait > 1 io : computation-to-wait <= 1 Computational Workload Analysis Computation summary : recursive calls, high cost library calls, unbounded loops, complete read/write to state that is variable size.

44 11 Representing Behaviors Communication pattern Message send pattern dom ( ρ ): {leaf, router, scatter} Message receive pattern dom ( ρ ): {gather, request-reply} leaf : no outgoing communication router : one-to-one communication scatter : batch communication gather : recv-to-send > 1 request-reply : recv-to-send <= 1

45 11 Representing Behaviors Communication pattern Message send pattern dom ( ρ ): {leaf, router, scatter} Message receive pattern dom ( ρ ): {gather, request-reply} leaf : no outgoing communication router : one-to-one communication scatter : batch communication gather : recv-to-send > 1 request-reply : recv-to-send <= 1 Communication Pattern Analysis Builds Communication Summary which abstracts away expressions except message send/receive, state read/writes. Analysis is a function that takes communication summary of a procedure and produces message send/receive pattern tuple as output.

9 Procedure Behavior Composition Input Program cvector Analysis cvector Mapping Function Execution Policy State Analysis Capsule State p0 p1.

46 9 Procedure Behavior Composition Input Program cvector Analysis cvector Mapping Function Execution Policy State Analysis Capsule State p0 p1... for each procedure pi May-Block Analysis Call-claim Analysis Communication Summary Analysis Procedure Behavior Composition cvector Mapping Function EP pn Computational Workload Analysis

47 12 Procedure Behavior Composition ( ) Capsule may have multiple procedures Behavior of the capsule is determined by combining the behaviors of its procedures For instance, a capsule has blocking behavior is any of its procedures are blocking Key idea : Capsule behavior is predominantly defined by the procedure that executes often

48 9 Input Program cvector Analysis cvector Mapping Function Execution Policy

49 9 Input Program cvector Analysis cvector Mapping Function Execution Policy

50 13 Execution Policies A: Th, B:Th A: Ta, B:Ta Thread Thread Q A B A B Thread A: Th, B:S A: Th, B:M Thread Thread A B A B Th: Thread Ta: Task S: Sequential M: Monitor THREAD, capsule is assigned a dedicated thread, TASK, capsule is assigned to a task-pool and the shared thread of the taskpool will process the messages, SEQ/MONITOR, calling capsule s thread itself will execute the behavior at callee capsule.

51 9 Input Program cvector Analysis cvector Mapping Function Execution Policy

52 14 Mapping Heuristics Encodes several intuitions about MPC abstractions Heuristic Blocking Heavy HighCPU LowCPU Hub Affinity Master Worker Execution Policy Th Th Ta M Ta M/S Ta Ta Th - Thread, Ta - Task, M - Monitor and S - Sequential

53 14 Mapping Heuristics Encodes several intuitions about MPC abstractions Examples Blocking Heuristics capsules with externally blocking behaviors should be assigned Th execution policy rationale : other policies may lead to blocking of the executing thread, starvation and system deadlocks. Heuristic Blocking Heavy HighCPU LowCPU Hub Affinity Master Worker Execution Policy Th Th Ta M Ta M/S Ta Ta Th - Thread, Ta - Task, M - Monitor and S - Sequential

14 Mapping Heuristics Encodes several intuitions about MPC abstractions Examples Blocking Heuristics capsules with externally blocking

54 14 Mapping Heuristics Encodes several intuitions about MPC abstractions Examples Blocking Heuristics capsules with externally blocking behaviors should be assigned Th execution policy rationale : other policies may lead to blocking of the executing thread, starvation and system deadlocks. Heuristic Blocking Heavy HighCPU LowCPU Hub Execution Policy Th Th Ta M Ta The goal of heuristics, Ø Reduce mailbox contentions, message passing and processing overheads and cache-misses Affinity Master Worker M/S Ta Ta Th - Thread, Ta - Task, M - Monitor and S - Sequential

55 15 Mapping Function: Flow Diagram Mapping function takes cvector as input and assigns an execution policy Encodes several intuitions (heuristics) as shown in the figure It is complete and assigns a single policy

56 16 Evaluation Benchmark programs (15 total) that exhibits data, task, and pipeline parallelism at coarse and fine granularities. Comparing cvector mapping against thread-all and round-robin-taskall, random-task-all, and work-stealing-task-all Measured reduction in program execution time and CPU consumption over default mappings on different core settings.

57 17 Results: Improvement In Program Runtime 120" 100% %"run&me"improvement" 100" 80" 60" 40" 20" 0" 2" 4" 8" 12" 2" 4" 8" 12" 2" 4" 8" 12" 2" 4" 8" 12" 2" 4" 8" 12" 2" 4" 8" 12" 2" 4" 8" 12" %"run&me"improvement" 50% 0%!50%!100%!150%!200% 2% 4% 8% 12% DCT% bang" mbrot" serialmsg" FileSearch" ScratchPad" Polynomial" fasta"!250% 100$ 80$ %"run&me"improvement" 60$ 40$ 20$ 0$!20$ 2$ 4$ 8$ 12$ 2$ 4$ 8$ 12$ 2$ 4$ 8$ 12$ 2$ 4$ 8$ 12$ 2$ 4$ 8$ 12$ 2$ 4$ 8$ 12$ 2$ 4$ 8$ 12$ Knucleo0de$ Fannkuchred$ RayTracer$ Breamformer$ logmap$ concdict$ concsll$!40$!60$ % runtime improvement over default mappings, for fifteen benchmarks. For each benchmark there are four core settings (2, 4, 8, 12-cores) and for each core setting there are four bars (Ith, Irr, Ir, Iws) showing improvement over four default mappings (thread, round-robin, random, work-stealing). Higher bars are better.

17 Results: Improvement In Program Runtime 120" 100% %"run&me"improvement" 100" 80" 60" 40" 20" 0" X-axis : core settings (2,4,8,12) Y-axis : % runtime improvement 2" 4" 8" 12" 2" 4" 8" 12" 2" 4" 8"

58 17 Results: Improvement In Program Runtime 120" 100% %"run&me"improvement" 100" 80" 60" 40" 20" 0" X-axis : core settings (2,4,8,12) Y-axis : % runtime improvement 2" 4" 8" 12" 2" 4" 8" 12" 2" 4" 8" 12" 2" 4" 8" 12" 2" 4" 8" 12" 2" 4" 8" 12" 2" 4" 8" 12" %"run&me"improvement" 50% 0%!50%!100%!150%!200% 2% 4% 8% 12% DCT% bang" mbrot" serialmsg" FileSearch" ScratchPad" Polynomial" fasta"!250% 100$ 80$ %"run&me"improvement" 60$ 40$ 20$ 0$!20$ 2$ 4$ 8$ 12$ 2$ 4$ 8$ 12$ 2$ 4$ 8$ 12$ 2$ 4$ 8$ 12$ 2$ 4$ 8$ 12$ 2$ 4$ 8$ 12$ 2$ 4$ 8$ 12$ Knucleo0de$ Fannkuchred$ RayTracer$ Breamformer$ logmap$ concdict$ concsll$!40$!60$ % runtime improvement over default mappings, for fifteen benchmarks. For each benchmark there are four core settings (2, 4, 8, 12-cores) and for each core setting there are four bars (Ith, Irr, Ir, Iws) showing improvement over four default mappings (thread, round-robin, random, work-stealing). Higher bars are better.

17 Results: Improvement In Program Runtime 120" 100% %"run&me"improvement" 100" 80" 60" 40" 20" 0" 4 bars indicating % runtime improvements over thread-all, round-robin, random, and work-stealing

59 17 Results: Improvement In Program Runtime 120" 100% %"run&me"improvement" 100" 80" 60" 40" 20" 0" 4 bars indicating % runtime improvements over thread-all, round-robin, random, and work-stealing default strategies 2" 4" 8" 12" 2" 4" 8" 12" 2" 4" 8" 12" 2" 4" 8" 12" 2" 4" 8" 12" 2" 4" 8" 12" 2" 4" 8" 12" %"run&me"improvement" 50% 0%!50%!100%!150%!200% 2% 4% 8% 12% DCT% bang" mbrot" serialmsg" FileSearch" ScratchPad" Polynomial" fasta"!250% 100$ 80$ %"run&me"improvement" 60$ 40$ 20$ 0$!20$ 2$ 4$ 8$ 12$ 2$ 4$ 8$ 12$ 2$ 4$ 8$ 12$ 2$ 4$ 8$ 12$ 2$ 4$ 8$ 12$ 2$ 4$ 8$ 12$ 2$ 4$ 8$ 12$ Knucleo0de$ Fannkuchred$ RayTracer$ Breamformer$ logmap$ concdict$ concsll$!40$!60$ % runtime improvement over default mappings, for fifteen benchmarks. For each benchmark there are four core settings (2, 4, 8, 12-cores) and for each core setting there are four bars (Ith, Irr, Ir, Iws) showing improvement over four default mappings (thread, round-robin, random, work-stealing). Higher bars are better.

17 Results: Improvement In Program Runtime 120" 100% %"run&me"improvement" %"run&me"improvement" 100" 80" 60" 40" 20" 0" o 12" of 4" 15 8" 12" programs 2" 4" 8" 12" showed 2" 4" 8" 12" improvements

60 17 Results: Improvement In Program Runtime 120" 100% %"run&me"improvement" %"run&me"improvement" 100" 80" 60" 40" 20" 0" o 12" of 4" 15 8" 12" programs 2" 4" 8" 12" showed 2" 4" 8" 12" improvements 2" 4" 8" 12" 2" 4" 8" 12" 2" 4" 8" 12" 2" 4" 8" 12" bang" mbrot" serialmsg" FileSearch" ScratchPad" Polynomial" fasta" 60$ 40$ 20$ 0$ 2$ 4$ 8$ 12$ 2$ 4$ 8$ 12$ 2$ 4$ 8$ 12$ 2$ 4$ 8$ 12$ 2$ 4$ 8$ 12$ 2$ 4$ 8$ 12$ 2$ 4$ 8$ 12$!20$ Knucleo0de$ Fannkuchred$ RayTracer$ Breamformer$ logmap$ concdict$ concsll$!40$!60$ %"run&me"improvement" 50% 0%!50%!100%!150%!200%!250% 2% 4% 8% 12% DCT% 100$ o On average 40.56%, 30.71%, 59.50%, and 40.03% improvements (execution 80$ time) over thread, round-robin, random and work-stealing mappings respectively o 3 programs showed no improvements (data parallel programs) % runtime improvement over default mappings, for fifteen benchmarks. For each benchmark there are four core settings (2, 4, 8, 12-cores) and for each core setting there are four bars (Ith, Irr, Ir, Iws) showing improvement over four default mappings (thread, round-robin, random, work-stealing). Higher bars are better.

61 17 Analysis: Improvement In Program Runtime 120" 100% %"run&me"improvement" 100" 80" 60" 40" 20" 0" 100$ 1) Reduced mailbox contentions 2) Reduced message-passing and processing overheads bang" mbrot" serialmsg" FileSearch" ScratchPad" Polynomial" fasta" 3) Reduced mailbox contentions and cache-misses 2" 4" 8" 12" 2" 4" 8" 12" 2" 4" 8" 12" 2" 4" 8" 12" 2" 4" 8" 12" 2" 4" 8" 12" 2" 4" 8" 12" %"run&me"improvement" 50% 0%!50%!100%!150%!200%!250% 2% 4% 8% 12% DCT% 80$ %"run&me"improvement" 60$ 40$ 20$ 0$!20$ 2$ 4$ 8$ 12$ 2$ 4$ 8$ 12$ 2$ 4$ 8$ 12$ 2$ 4$ 8$ 12$ 2$ 4$ 8$ 12$ 2$ 4$ 8$ 12$ 2$ 4$ 8$ 12$ Knucleo0de$ Fannkuchred$ RayTracer$ Breamformer$ logmap$ concdict$ concsll$!40$!60$ % runtime improvement over default mappings, for fifteen benchmarks. For each benchmark there are four core settings (2, 4, 8, 12-cores) and for each core setting there are four bars (Ith, Irr, Ir, Iws) showing improvement over four default mappings (thread, round-robin, random, work-stealing). Higher bars are better.

62 18 Applicability We presented cvector mapping technique for capsules, however the technique should be applicable to other MPC frameworks Proof of concept : evaluation on Akka Similar results for most benchmarks Data parallel applications show no improvements (as in Panini) 100$ %"execu'on"'me"improvements" 80$ 60$ 40$ 20$ 0$!20$ Ith$ Ita$ I;$ 2$ 4$ 8$ 12$ 2$ 4$ 8$ 12$ 2$ 4$ 8$ 12$ 2$ 4$ 8$ 12$ 2$ 4$ 8$ 12$ 2$ 4$ 8$ 12$ 2$ 4$ 8$ 12$ 2$ 4$ 8$ 12$ bang$ mbrot$ serialmsg$ fasta$ scratchpad$ logmap$ concdict$ concsll$!40$ Benchmarks"and"core"se6ngs"!60$

63 19 Related Works Placing Erlang actors on multicore efficiently, Francesquini et al. [Erlang 13 Workshop] considers only hub-affinity behavior that is annotated by programmer out technique takes care of many other behaviors Mapping task graphs to cores, Survey [DAC 13] not directly applicable to JVM- based MPC frameworks, because threads to cores mapping is left to OS scheduler Efficient strategies for mapping threads to cores for OpenMP multi-threaded programs, Tousimojarad and Vanderbauwhede [Journal of Parallel Computing 14] our technique maps capsules to threads and not threads to cores

64 19 Related Works Placing Erlang actors on multicore efficiently, Francesquini et al. [Erlang 13 Workshop] considers only hub-affinity behavior that is annotated by programmer out technique takes care of many other behaviors o Not directly applicable to JVM-based MPC frameworks o Non-automatic Mapping task graphs to cores, Survey [DAC 13] not directly applicable to JVM- based MPC frameworks, because threads to cores mapping is left to OS scheduler Efficient strategies for mapping threads to cores for OpenMP multi-threaded programs, Tousimojarad and Vanderbauwhede [Journal of Parallel Computing 14] our technique maps capsules to threads and not threads to cores

65 Conclusion MPC System Architecture a1 a2 Mapping OS Scheduler core0 core1 a3 a4 a5 JVM threads core2 core3 MPC: Message-passing Concurrency, a1, a2, a3, a4 are MPC abstractions Mapping Abstractions to JVM threads Evaluated on Panini and Akka Questions? Ganesha Upadhyaya and Hridesh Rajan Supported in part by the NSF grants CCF , CCF , and CCF

An Automatic Actors to Threads Mapping Technique for JVM-Based Actor Frameworks

Computer Science Conference Presentations, Posters and Proceedings Computer Science 10-20-2014 An Automatic Actors to Threads Mapping Technique for JVM-Based Actor Frameworks Ganesha Upadhyaya Iowa State