Parallel Code Generation of Synchronous Programs for a Many-core Architecture

Size: px

Start display at page:

Download "Parallel Code Generation of Synchronous Programs for a Many-core Architecture"

Bridget Burns
5 years ago
Views:

(Verimag), Matthieu Moy (LIP) Benoît Dupont de Dinechin (Kalray) Jan

Sophia-Antipolis, Aoste) Examiners: Anne Bouillard (ENS Paris) Alain

1 Parallel Code Generation of Synchronous Programs for a Many-core Architecture Amaury Graillat Supervisors: Reviewers: Pascal Raymond (Verimag), Matthieu Moy (LIP) Benoît Dupont de Dinechin (Kalray) Jan Reineke (Universität des Saarlandes) Robert de Simone (INRIA Sophia-Antipolis, Aoste) Examiners: Anne Bouillard (ENS Paris) Alain Girault (INRIA Grenoble) Christine Rochange (IRIT) Thesis defense, November 16th, /48

2 2/48 SPACE, WEIGHT, AND POWER

3 2/48 SPACE, WEIGHT, AND POWER

Processor feedback Time-critical: latency constraints are part of

4 3/48 SAFETY CRITICAL SYSTEMS Example of aircraft flight controller (control-command) Pilot (stick) Actuator Sensors (altitude) Processor feedback Time-critical: latency constraints are part of the specification (e.g. < some 10ms) Designed with eg., Synchronous Languages

5 4/48 THE SYNCHRONOUS DATA-FLOW LANGUAGES (1/2) Node 1 Node 4 Node 2 Node 6 Node 3 Node 5 pre init Lustre (academic), Scade (industrial) Network of nodes

6 5/48 THE SYNCHRONOUS DATA-FLOW LANGUAGES (2/2) Logical i n 1 = 1 i n = 3 i n+1 = 7... instant n-1 instant n instant n+1... i t o n 1 = 2 o n = 6 o n+1 = 14 2 o One execution is called a logical instant

7 THE SYNCHRONOUS DATA-FLOW LANGUAGES (2/2) Logical i n 1 = 1 i n = 3 i n+1 = 7... instant n-1 instant n instant n+1... i t o n 1 = 2 o n = 6 o n+1 = 14 One execution is called a logical instant 2 o Physical i n i n 1 i n+1 instant n-1 instant n instant n+1... Step Step Step... t o n 1 o n o n+1 Requirement: o n before i n+1. Worst-Case Execution Time (WCET) 5/48

8 6/48 LUSTRE/SCADE DELAY OPERATOR instant 1 instant 2 instant 3 a a=1 a=2 a=3 init=0 c x 2 b pre init=0 b=1 b=4 b=9 d c=1 c=2 c=3 d=0 d=1 d=4 t t + t e e=1 e=3 e=7 previous operator Logical/functional delay

9 7/48 CODE GENERATION FOR SINGLE-CORE Node 1 Node 2 Node 3 Formal semantics Determinism Code Generation Compilation.c for each period { i = sensors(); o = main_step(i); actuators(o); } Static schedule Synchronous languages void main_step(i) { Lustre, o1 = Heptagon, N1_step(i); SCADE o2 = N2_step(i); Industrial returns use: N3_step(o1, SCADE in Airbus o2); A380 } Single-core processor Static schedule N1 N2 N3... WCET

10 CODE GENERATION FOR SINGLE-CORE Node 1 Node 2 Node 3 Formal semantics Determinism Code Generation Compilation.c for each period { i = sensors(); o = main_step(i); actuators(o); } Static schedule Synchronous languages void main_step(i) { Lustre, o1 = Heptagon, N1_step(i); SCADE o2 = N2_step(i); Industrial returns use: N3_step(o1, SCADE in Airbus o2); A380 } Single-core processor Static schedule N1 N2 N3... WCET OK for sequential execution. What about parallel? 7/48

Independent arbiter for each memory bank.

11 THE KALRAY MPPA2 Core Properties of the Kalray MPPA2: Cores No complex branch prediction Only LRU caches Cluster Banked Shared-Memory (16*128ko) Independent arbiter for each memory bank. Network-on-Chip (NoC) between clusters Bandwidth limiter (Network calculus possible) Performance + Predictability Cluster (=multi-core) 8/48

12 9/48 OVERVIEW Kalray MPPA2 Bostan Node 1 Node 2 Node 3 Node 4 Node 5 Node 6 pre init

13 10/48 Part I: Semantics Preserving Parallelization Extraction of Parallelism Mapping / Scheduling Code Generation Communication Channels Implementation Part II: Real-Time Guarantees Part III: Evaluation and Conclusion

14 11/48 EXTRACTION OF PARALLELISM Contribution External tool N1 N2 N3 N4 N5 pre init N6 Parallelism Extraction Method 1: functional code N1.c N2.c N3.c N4.c N5.c N6.c Mapping + Non-preemptive scheduling Code Generation System + Communication Communication and Dependency graph N1 N2 N4 N6 N5 N3 NoC Routing In the top-level node: 1 node = 1 task (runnable). Generation of sequential code for each node Similar to Architecture Description Language (Prelude, Giotto) Method 2: based on fork-join: see manuscript Executable for Kalray

15 11/48 EXTRACTION OF PARALLELISM Contribution External tool N1 N2 N3 N4 N5 pre init N6 Parallelism Extraction Method 1: functional code N1.c N2.c N3.c N4.c N5.c N6.c Mapping + Non-preemptive scheduling Code Generation System + Communication Communication and Dependency graph N1 N2 N4 N6 N5 N3 NoC Routing In the top-level node: 1 node = 1 task (runnable). Generation of sequential code for each node Similar to Architecture Description Language (Prelude, Giotto) Method 2: based on fork-join: see manuscript Executable for Kalray

16 12/48 MAPPING / SCHEDULING Contribution External tool N1 N2 N3 N4 N5 pre init N6 functional code N1.c N2.c N3.c N4.c N5.c N6.c Mapping + Non-preemptive scheduling Parallelism Extraction Communication and Dependency graph N1 N2 N4 N6 N5 N3 NoC Routing Need non-preemptive static schedule External scheduling tool We can easily check the schedule Code Generation System + Communication Executable for Kalray

17 13/48 MAPPING/SCHEDULING: EXAMPLE Node 1 Node 4 Node 2 Node 6 Node 3 Node 5 Core 0: N1 ; N4 ; N6 Core 1: N2 ; N5 Core 2: N3 Schedule checked using the dependency graph

18 14/48 CODE GENERATION Contribution External tool N1 N2 N3 N4 N5 pre init N6 Parallelism Extraction functional code N1.c N2.c N3.c N4.c N5.c N6.c Communication and Dependency graph N1 N2 N3 N4 N6 N5 Mapping + Non-preemptive scheduling NoC Routing Code Generation System + Communication Executable for Kalray

19 14/48 CODE GENERATION Contribution External tool N1 N2 N3 N4 N5 pre init N6 Dependencies compiled into wait for input. functional code N1.c N2.c N3.c N4.c N5.c N6.c Mapping + Non-preemptive scheduling Parallelism Extraction Communication and Dependency graph N1 N2 N4 N6 N5 N3 NoC Routing Sketch of code for core 0. for each period{ wait_inputs_n1(); // N1 N1_step(); write_outputs_n1(); wait_inputs_n4(); // N4 N4_step(); write_outputs_n4(); Code Generation System + Communication Executable for Kalray } wait_inputs_n6(); // N6 N6_step(); write_outputs_n6();

20 15/48 COMMUNICATION CHANNELS Contribution External tool N1 N2 N3 N4 N5 pre init N6 functional code N1.c N2.c N3.c N4.c N5.c N6.c Mapping + Non-preemptive scheduling Parallelism Extraction Communication and Dependency graph N1 N2 N4 N6 N5 N3 NoC Routing Two kinds of communications: - instantaneous ( ) - delayed ( ) Code Generation System + Communication Executable for Kalray

21 INSTANTANEOUS COMMUNICATION Contribution External tool N1 N2 N3 N4 N5 pre init N6 void write_outputs_n2() { copy(n4.in2, N2.out); functional code N1.c N2.c N3.c N4.c N5.c N6.c Parallelism Extraction Communication and Dependency graph N1 N2 N4 N6 N5 N3 } copy(n5.in1, N2.out); void wait_for_inputs_n4() { Mapping + Non-preemptive scheduling NoC Routing Code Generation System + Communication Executable for Kalray } Efficient hardware synchronization Software-based cache coherency 16/48

22 INSTANTANEOUS COMMUNICATION Contribution External tool N1 N2 N3 functional code N1.c N2.c N3.c N4.c N5.c N6.c Mapping + Non-preemptive scheduling Code Generation System + Communication Executable for Kalray N4 N5 pre init Communication and Dependency graph N1 N2 N4 N6 Parallelism Extraction N6 N5 N3 NoC Routing void write_outputs_n2() { copy(n4.in2, N2.out); channel N2 N4 = true; // + cache management notify(core N4); copy(n5.in1, N2.out); } void wait_for_inputs_n4() { while(! channel N2 N4 ) { wait(); } channel N2 N4 = false; while(! channel_n1_n4) { wait(); } channel_n1_n4 = false; // + cache management } Efficient hardware synchronization Software-based cache coherency 16/48

23 17/48 COMMUNICATION CHANNELS Contribution External tool N1 N2 N3 N4 N5 pre init N6 functional code N1.c N2.c N3.c N4.c N5.c N6.c Mapping + Non-preemptive scheduling Parallelism Extraction Communication and Dependency graph N1 N2 N4 N6 N5 N3 NoC Routing Two kinds of communications: - instantaneous ( ) - delayed ( ) Code Generation System + Communication Executable for Kalray

24 18/48 DELAYED COMMUNICATION i1 i2 A pre B o1 init b o2 A S B Transformation into a SWAP + scheduling constraints. Scenario 1 Scenario 2 n-1 instant n n+1 n-1 instant n n+1 A.in S.in Y X A B S... A.in S.in X Y B A S... Constraint: B S Constraint: A S

25 18/48 DELAYED COMMUNICATION i1 i2 A pre B o1 init b o2 A S B Transformation into a SWAP + scheduling constraints. Scenario 1 Scenario 2 n-1 instant n n+1 n-1 instant n n+1 A.in S.in Y X A B S... A.in S.in X Y B A S... Constraint: B S Constraint: A S

26 18/48 DELAYED COMMUNICATION i1 i2 A pre B o1 init b o2 A S B Transformation into a SWAP + scheduling constraints. Scenario 1 Scenario 2 n-1 instant n n+1 n-1 instant n n+1 A.in S.in Y X A B S... A.in S.in X Y B A S... Constraint: B S Constraint: A S

27 19/48 CONCLUSION OF PART I Contribution External tool N1 N2 N3 N4 N5 pre init N6 Parallelism Extraction functional code N1.c N2.c N3.c N4.c N5.c N6.c Mapping + Non-preemptive scheduling Communication and Dependency graph N1 N2 N4 N6 N5 N3 NoC Routing Semantics preserved Next step: Temporal Guaranties Code Generation System + Communication Executable for Kalray

28 20/48 Part I: Semantics Preserving Parallelization Part II: Real-Time Guarantees Shared Memory Interference Time-Triggered Execution Model Real-Time Guarantees with Network-on-Chip Part III: Evaluation and Conclusion

29 21/48 INTERFERENCE AND REACTION TIME Core 0 Task A Core 1... M E M Core 0 Core 1 Task A WCET Core 15 Single-core: WCET is sufficient

30 INTERFERENCE AND REACTION TIME Core 0 WCET WCRT Task A Core 0 Task A M Core 1 E Request Access Task B M... Core 1 Task B Core 15 Request Access Single-core: WCET is sufficient Multi-core: WCET+interference on shared resources = Worst-Case Response Time (WCRT). WCET+interference too pessimistic in the general case exploit the execution model in the analysis 21/48

31 22/48 BANKED MEMORY Bank 0 Core 0 Bank 1 Core 1... Round Robin... Private arbiter for each bank Available in the Kalray MPPA2 (silicon)

32 23/48 BANKED MEMORY IN KALRAY MPPA2 Node 1 Node 4 Node 2 Node 6 Node 3 Node 5 Bank 0 for core 0 Core 0 N1 N4 N6 Bank 1 for Core 1 Core 1 N2 N5... Round Robin Choice: Code, input buffer, local variables are mapped in core s bank of the core Interference on communication only...

33 23/48 BANKED MEMORY IN KALRAY MPPA2 Node 1 Node 4 Node 2 Node 6 Node 3 Node 5 Bank 0 for core 0 Core 0 N1 N4 N6 Bank 1 for Core 1 Core 1 N2 N5... Round Robin Choice: Code, input buffer, local variables are mapped in core s bank of the core Interference on communication only...

34 23/48 BANKED MEMORY IN KALRAY MPPA2 Node 1 Node 4 Node 2 Node 6 Node 3 Node 5 Bank 0 for core 0 Core 0 N1 N4 N6 Communication Bank 1 for Core 1 Core 1 N2 N5... Round Robin Choice: Code, input buffer, local variables are mapped in core s bank of the core Interference on communication only...

35 BANKED MEMORY IN KALRAY MPPA2 Node 1 Node 4 Node 2 Node 6 Node 3 Node 5 Bank 0 for core 0 Core 0 N1 N4 N6 Communication Bank 1 for Core 1 Core 1 N2 N5... Round Robin Choice: Code, input buffer, local variables are mapped in core s bank of the core Interference on communication only How to activate tasks?... 23/48

36 24/48 PROBLEM WITH ASAP EXECUTION Core 0 Core 1 WCET A Task A WCET C Task C WCET B Task B WCET D Task D Deadline

37 24/48 PROBLEM WITH ASAP EXECUTION WCET A WCET B Core 0 Task A Task B WCET C WCET D Core 1 Task C Task D Deadline WCET B Faster execution of A interference between B and C. Core 0 WCET A Task A Task B Timing Anomaly WCET C WCET D Core 1 Task C Task D Deadline

38 24/48 PROBLEM WITH ASAP EXECUTION WCET A WCET B Core 0 Task A Task B WCET C WCET D Core 1 Task C Task D Deadline WCET B Faster execution of A interference between B and C. Core 0 WCET A Task A WCET C Task B WCET D Timing Anomaly Solution: release dates for tasks (time-triggered) Core 1 Task C Task D Deadline

39 TIME-TRIGGERED EXECUTION MODEL Principle Release = 0 Release = 427 WCRT A WCRT B Core 0 Task A Task B Core 1 Release = 0 WCRT C WCRT D Task C Task D Release = 398 Compute a static release date when data is guaranteed to be available Time-Triggered execution prevents tasks from starting earlier Implementation void wait_inputs_n1() { while(time() < t_period + release_date_n1){ // wait } } 25/48

40 26/48 MULTI-CORE INTERFERENCE ANALYSIS WCET Dependencies Memory access Mapping MIA Release dates of tasks WCRT Multi-Core Interference Analysis (MIA) Tool [Hamza Rihani, RTNS 2016] www-verimag.imag.fr/multi-core-interference-analysis.html

41 26/48 MULTI-CORE INTERFERENCE ANALYSIS WCET Dependencies Memory access Mapping MIA Release dates of tasks WCRT Multi-Core Interference Analysis (MIA) Tool [Hamza Rihani, RTNS 2016] www-verimag.imag.fr/multi-core-interference-analysis.html

42 FRAMEWORK Contribution External tool N1 N2 N3 N4 N5 pre init N6 functional code N1.c N2.c N3.c N4.c N5.c N6.c Mapping + Non-preemptive scheduling Parallelism Extraction Communication and Dependency graph N1 N2 N4 N6 N5 N3 NoC Routing + Rate Attribution WCET Analysis (Otawa, AiT) Computation of the release dates (MIA tool) Insertion in the executable. Code Generation System + Communication Executable for Kalray Network Calculus (WCTT) [Submitted: Graillat A., Rihani H., Maiza C., Moy M., Raymond P., Dupont de Dinechin B., Real-Time Systems Journal] release dates WCET Analysis MIA 27/48

43 28/48 OUR EXECUTION MODEL Static scheduling Bare metal, no interrupt, no preemption Banked memory mapping (1 core 1 bank) Time-triggered Deterministic and WCRT guarantee

44 OVERVIEW Contribution External tool N1 N2 N3 N4 N5 pre init N6 Parallelism Extraction functional code N1.c N2.c N3.c N4.c N5.c N6.c Communication and Dependency graph N1 N2 N3 N4 N6 N5 Mapping + Non-preemptive scheduling NoC Routing + Rate Attribution Code Generation System + Communication Executable for Kalray Network Calculus (WCTT) release dates WCET Analysis MIA 29/48

45 30/48 THE KALRAY MPPA2 S NOC flow 2 North East East North Local East, North, East, Local Router Rate limiter head route data TX buffer head route data DMA Core 0 MEM Core 0 RX buffer MEM Cluster A Cluster B

46 30/48 THE KALRAY MPPA2 S NOC North East East North Local East, North, East, Local flow 1 Router Rate limiter head route data TX buffer head route data DMA Core 0 MEM Core 0 RX buffer MEM Cluster A Cluster B

47 30/48 THE KALRAY MPPA2 S NOC flow 2 North East East North Full Duplex Links Local East, North, East, Local flow 1 Router Rate limiter head route data TX buffer head route data DMA Core 0 MEM Core 0 RX buffer MEM Cluster A Cluster B

48 30/48 THE KALRAY MPPA2 S NOC flow 2 North East North East rate = 1 2 B rate = 1 2 B Local East, North, East, Local flow 1 Router Rate limiter head route data TX buffer head route data DMA Core 0 MEM Core 0 RX buffer MEM Cluster A Cluster B

49 31/48 REAL-TIME GUARANTEES WITH NETWORK-ON-CHIP f Routing Route = sequence of directions computed by sender cluster router [Dupont de Dinechin B., Graillat A., NoCArc 2017]

50 31/48 REAL-TIME GUARANTEES WITH NETWORK-ON-CHIP f East East South 1. Routing Route = sequence of directions computed by sender Local South cluster router [Dupont de Dinechin B., Graillat A., NoCArc 2017]

51 31/48 REAL-TIME GUARANTEES WITH NETWORK-ON-CHIP f1 f1.1 f f cluster router 1. Routing Route = sequence of directions computed by sender Deadlock-free algorithms (XY, Hamiltonian) [Dupont de Dinechin B., Graillat A., NoCArc 2017]

52 31/48 REAL-TIME GUARANTEES WITH NETWORK-ON-CHIP f1 f1.2 f1.1 f f2.1 f cluster router 1. Routing Route = sequence of directions computed by sender Deadlock-free algorithms (XY, Hamiltonian) Algorithm with path diversity (Hamiltonian Odd Even (HOE)) [Dupont de Dinechin B., Graillat A., NoCArc 2017]

53 31/48 REAL-TIME GUARANTEES WITH NETWORK-ON-CHIP f1 f1.2 f1.1 f f2.1 f cluster router 1. Routing Route = sequence of directions computed by sender Deadlock-free algorithms (XY, Hamiltonian) Algorithm with path diversity (Hamiltonian Odd Even (HOE)) 2. Route Selection Choose one static route per flow Optimize performance and fairness [Dupont de Dinechin B., Graillat A., NoCArc 2017]

54 31/48 REAL-TIME GUARANTEES WITH NETWORK-ON-CHIP f1 f1.1 f f cluster router 1. Routing Route = sequence of directions computed by sender Deadlock-free algorithms (XY, Hamiltonian) Algorithm with path diversity (Hamiltonian Odd Even (HOE)) 2. Route Selection Choose one static route per flow Optimize performance and fairness e.g. {f1.1, f2.2}, [Dupont de Dinechin B., Graillat A., NoCArc 2017]

55 31/48 REAL-TIME GUARANTEES WITH NETWORK-ON-CHIP f1 f1.2 f f cluster router 1. Routing Route = sequence of directions computed by sender Deadlock-free algorithms (XY, Hamiltonian) Algorithm with path diversity (Hamiltonian Odd Even (HOE)) 2. Route Selection Choose one static route per flow Optimize performance and fairness e.g. {f1.1, f2.2}, {f1.2, f2.2}... [Dupont de Dinechin B., Graillat A., NoCArc 2017]

56 31/48 REAL-TIME GUARANTEES WITH NETWORK-ON-CHIP f1 f cluster router f f [Dupont de Dinechin B., Graillat A., NoCArc 2017] 1. Routing Route = sequence of directions computed by sender Deadlock-free algorithms (XY, Hamiltonian) Algorithm with path diversity (Hamiltonian Odd Even (HOE)) 2. Route Selection Choose one static route per flow Optimize performance and fairness e.g. {f1.1, f2.2}, {f1.2, f2.2} Rate attribution Fair attribution e.g. {0.5, 0.5},

57 31/48 REAL-TIME GUARANTEES WITH NETWORK-ON-CHIP f1 f cluster router f f [Dupont de Dinechin B., Graillat A., NoCArc 2017] 1. Routing Route = sequence of directions computed by sender Deadlock-free algorithms (XY, Hamiltonian) Algorithm with path diversity (Hamiltonian Odd Even (HOE)) 2. Route Selection Choose one static route per flow Optimize performance and fairness e.g. {f1.1, f2.2}, {f1.2, f2.2} Rate attribution Fair attribution e.g. {0.5, 0.5}, {1.0, 1.0}...

58 31/48 REAL-TIME GUARANTEES WITH NETWORK-ON-CHIP f1 f cluster router f f [Dupont de Dinechin B., Graillat A., NoCArc 2017] 1. Routing Route = sequence of directions computed by sender Deadlock-free algorithms (XY, Hamiltonian) Algorithm with path diversity (Hamiltonian Odd Even (HOE)) 2. Route Selection Choose one static route per flow Optimize performance and fairness e.g. {f1.1, f2.2}, {f1.2, f2.2} Rate attribution Fair attribution e.g. {0.5, 0.5}, {1.0, 1.0} Network Calculus Compute latency and buffer level

59 32/48 1. UNICAST ROUTING f1 Hamiltonian Routing Up-Network Down-Network Deadlock-free Path-Diversity

60 33/48 1. MULTICAST ROUTING PROBLEM f Deadlock-free heuristic to travelling salesman problem : One route? Tree is no possible with Kalray architecture Hamiltonian: minimum of two routes Source Destination

61 33/48 1. MULTICAST ROUTING PROBLEM f Deadlock-free heuristic to travelling salesman problem : One route? Tree is no possible with Kalray architecture Hamiltonian: minimum of two routes Source Destination

62 34/48 1. MULTICAST ROUTING: HAMILTONIAN Dual-Path + Hamiltonian without path diversity (Lin et al., 1992) Small path diversity, we can do better

63 35/48 1. MULTICAST ROUTING: HOE Even Odd Allowed with HOE Even Hamiltonian Odd Even (Bahrebar et al.) routing: allows some non-hamiltonian paths Our choice: Dual-Path + HOE

64 36/48 EVALUATION OF DEADLOCK-FREE MULTICAST ROUTING

65 EVALUATION OF DEADLOCK-FREE MULTICAST ROUTING Minimum rate vs. Network load (Higher is better) 1 Dual Path+Hamiltonian Dual Path+HOE Realistic network load Network Load Minimal rate increased of up to 19% with Dual Path HOE compared to Dual Path Hamiltonian. 36/48

66 37/48 2. ROUTE SELECTION f1 f1.2 f f1.3 cluster router f f2.1 f2.2 f [Boyer M., Dupont de Dinechin B., Graillat A., Havet L, ERTS 2018] Input: set of route for each flow Output: one route per flow Maximize fairness Naive exploration: Enumeration all combinations, run step 3 on each solution, keep the best. f2.1 f2.2 f2.3 f , , , 0.5 f , , , 1.0 f , , , 1.0 Enumeration of 9 combinations of possible routes

67 38/48 2. ROUTE SELECTION f1 f1.2 f2 f f f2.1 f2.2 f Input: set of route for each flow Output: one route per flow Maximize fairness Naive exploration: Enumeration all combinations (9 combinations) Exploration with pruning: Ignore combinations with non-minimal bottleneck cluster router [Boyer M., Dupont de Dinechin B., Graillat A., Havet L, ERTS 2018] f1.1 f1.2 f1.3 f2.1 f2.2 f2.3

68 38/48 2. ROUTE SELECTION f1 f1.2 f f1.3 cluster router f f2.1 f2.2 f [Boyer M., Dupont de Dinechin B., Graillat A., Havet L, ERTS 2018] Input: set of route for each flow Output: one route per flow Maximize fairness Naive exploration: Enumeration all combinations (9 combinations) Exploration with pruning: Ignore combinations with non-minimal bottleneck Apply rate attribution (step 3) f2.1 f2.2 f2.3 f1.1 f , , 1.0 f , , 1.0 Enumeration of 4 combinations

69 39/48 Exploration vs. Our Exploration with Pruning

70 39/48 Exploration vs. Our Exploration with Pruning

71 40/48 2. ROUTE SELECTION f1 f1.2 f f1.3 cluster router f f2.1 f2.2 f [Boyer M., Dupont de Dinechin B., Graillat A., Havet L, ERTS 2018] Input: set of route for each flow Output: one route per flow Maximize fairness Naive exploration: Enumeration all combinations (9 combinations) Our Exploration with pruning: Ignore non-minimal bottleneck (4 combinations) Our LP-based Heuristic: Consider all alternative routes at once 1. Attribute fair rates

72 40/48 2. ROUTE SELECTION f1 f f f1.3 cluster router 1 4 f f2.1 f f [Boyer M., Dupont de Dinechin B., Graillat A., Havet L, ERTS 2018] 1 4 Input: set of route for each flow Output: one route per flow Maximize fairness Naive exploration: Enumeration all combinations (9 combinations) Our Exploration with pruning: Ignore non-minimal bottleneck (4 combinations) Our LP-based Heuristic: Consider all alternative routes at once 1. Attribute fair rates

73 40/48 2. ROUTE SELECTION f1 f f f1.3 cluster router 1 4 f f2.1 f f [Boyer M., Dupont de Dinechin B., Graillat A., Havet L, ERTS 2018] 1 4 Input: set of route for each flow Output: one route per flow Maximize fairness Naive exploration: Enumeration all combinations (9 combinations) Our Exploration with pruning: Ignore non-minimal bottleneck (4 combinations) Our LP-based Heuristic: Consider all alternative routes at once 1. Attribute fair rates 2. Remove smallest alternative routes

74 40/48 2. ROUTE SELECTION f1 f f f1.3 cluster router f f2.1 f f [Boyer M., Dupont de Dinechin B., Graillat A., Havet L, ERTS 2018] Input: set of route for each flow Output: one route per flow Maximize fairness Naive exploration: Enumeration all combinations (9 combinations) Our Exploration with pruning: Ignore non-minimal bottleneck (4 combinations) Our LP-based Heuristic: Consider all alternative routes at once 1. Attribute fair rates 2. Remove smallest alternative routes

75 40/48 2. ROUTE SELECTION f1 f f f1.3 cluster router f f2.1 f f [Boyer M., Dupont de Dinechin B., Graillat A., Havet L, ERTS 2018] Input: set of route for each flow Output: one route per flow Maximize fairness Naive exploration: Enumeration all combinations (9 combinations) Our Exploration with pruning: Ignore non-minimal bottleneck (4 combinations) Our LP-based Heuristic: Consider all alternative routes at once 1. Attribute fair rates 2. Remove smallest alternative routes 3. Enumerate the 3 combinations

76 41/48 EVALUATION OF OUR LP-BASED HEURISTIC

77 41/48 EVALUATION OF OUR LP-BASED HEURISTIC

78 42/48 LP-BASED HEURISTIC VS. OPTIMAL EXPLORATION Optimal algorithms (100%): naive exploration, exploration with pruning Minimal rate as indicator of fairness

79 43/48 Part I: Semantics Preserving Parallelization Part II: Real-Time Guarantees Use Cases and Evaluation Conclusion Part III: Evaluation and Conclusion

80 THE ROSACE CASE STUDY Altitude-only flight controller Open source (Simulink, Lustre, Giotto) [Pagetti, Soussié, RTAS 14] Sequential Best effort pure +100 cycles +200 cycles +100: each task augmented with 100 cycles +200: each task augmented with 200 cycles Best effort (ASAP): no real-time guarantees Time-Triggered (TT): real-time guarantees 44/48

81 THE ROSACE CASE STUDY Altitude-only flight controller Open source (Simulink, Lustre, Giotto) [Pagetti, Soussié, RTAS 14] Sequential Best effort pure +100 cycles +200 cycles +100: each task augmented with 100 cycles +200: each task augmented with 200 cycles Best effort (ASAP): no real-time guarantees Time-Triggered (TT): real-time guarantees 44/48

82 45/48 SYNTHETIC BENCHMARK ON 64 CORES 3 phases: Dispatch, Compute, Gather 20 Bytes per flow, high network congestion for gather phase Compute Dispatch Gather 13 15

83 SYNTHETIC BENCHMARK ON 64 CORES 3 phases: Dispatch, Compute, Gather 20 Bytes per flow, high network congestion for gather phase 54% of WCRT for functional code 46% of WCRT for communication and system code 45/48

84 CONCLUSION Contribution External tool N1 N2 N3 N4 N5 pre init N6 Parallelism Extraction functional code N1.c N2.c N3.c N4.c N5.c N6.c Communication and Dependency graph N1 N2 N3 N4 N6 N5 Mapping + Non-preemptive scheduling NoC Routing + Rate Attribution Code Generation System + Communication Executable for Kalray Network Calculus (WCTT) release dates WCET Analysis MIA 46/48

85 CONCLUSION Contrib Externa Semantics init N6 Parallelism Extraction Preserving Code Communication and Dependency graph N1 N2 N4 N6 N5 N3 Generation NoC Routing + Rate Attribution Executable for Kalray Network Calculus (WCTT) release dates WCET Analysis MIA 46/48

86 CONCLUSION Contrib Externa Semantics init N6 Parallelism Extraction Preserving Code Communication and Dependency graph N1 N2 N4 N6 N5 N3 Generation NoC Routing + Rate Attribution Time-triggered Network Calculus (WCTT) release dates Execution Model 46/48

87 CONCLUSION Contrib Externa Semantics Preserving Code Hard realtime Guarantees Generation Many-Core Time-triggered Performance release dates Execution Model 46/48

88 CONCLUSION Contrib Externa Semantics Preserving Code Generation Time-triggered Hard realtime Guarantees Many-Core Performance Bridge between academic and industry release dates Execution Model 46/48

89 47/48 FUTURE WORK Semantics Preserving Code Hard realtime Guarantees Generation Many-Core Time-triggered Performance Execution Model

90 47/48 FUTURE WORK Semantics Preserving Code Hard realtime Guarantees Generation Release date computation and memory interferences analysis with NoC Time-triggered Execution Model Many-Core Performance

91 47/48 FUTURE WORK Semantics Memory size problem (overlays?) Release date computation and memory interferences analysis with NoC Preserving Code Generation Time-triggered Execution Model Hard realtime Guarantees Many-Core Performance

92 47/48 FUTURE WORK Input/Output Management Semantics Memory size problem (overlays?) Release date computation and memory interferences analysis with NoC Preserving Code Generation Time-triggered Execution Model Hard realtime Guarantees Many-Core Performance

93 47/48 FUTURE WORK Input/Output Management Semantics Memory size problem (overlays?) Preserving Code Generation Hard realtime Guarantees What should be the standard real-time multi-core processor? Many-Core Release date computation Time-triggered Performance and memory interferences Execution analysis with NoC Model

94 FUTURE WORK Input/Output Management Semantics Memory size problem (overlays?) Preserving Code Generation Hard realtime Guarantees What should be the standard real-time multi-core processor? Many-Core Performance Release date computation Time-triggered and memory interferences Execution analysis with NoC Model Thank you for your attention. Questions? 47/48

95 48/48 Published Graillat A., Dupont de Dinechin B., DATE 2018 Parallel Code Generation of Synchronous Programs for a Many-core Architecture. Boyer M., Dupont de Dinechin B., Graillat A., Havet L, ERTS 2018 Computing Routes and Delay Bounds for the Network-on-Chip of the Kalray MPPA2 Processor. Dupont de Dinechin B., Graillat A., NoCArc 2017 Feed-Forward Routing for the Wormhole Switching Network-on-Chip of the Kalray MPPA2-256 Processor. Dupont de Dinechin B., Graillat A., AISTECS 2017 Network-on-Chip Service Guarantees on the Kalray MPPA-256 Bostan Processor. Submitted Graillat A., Rihani H., Maiza C., Moy M., Raymond P., Dupont de Dinechin B., Real-Time Systems Journal Implementation Framework for Real-Time Data-Flow Synchronous Programs on Many-Cores. Graillat A, Maiza C., Moy M., Raymond P., Dupont de Dinechin B., DATE 2019 Response Time Analysis of Dataflow Applications on a Many-Core Processor with Shared-Memory and Network-on-Chip.

96 48/48 REFERENCES [1] Bertsekas, D. P., Gallager, R. G., and Humblet, P. (1992). Data networks, vol. 2. Prentice Hall International, Englewood Cliffs, New Jersey, 7632: [2] Chen, S. and Nahrstedt, K. (1998). Maxmin fair routing in connection-oriented networks. In Proc. Euro-Parallel and Distributed Systems Conf, pages

97 48/48 Backup

98 48/48 DEADLOCK IN WORMHOLE NETWORKS A R1 R2 R4 R3 Deadlock-freeness can be ensured at routing time B Solutions: XY, Hamiltonian Odd-Even, Turn Prohibition, etc This instance of flows deadlocks A wormhole packet is spread along the route Links 1-4 and 3-2 are shared A holds 1-4 but waits for 3-2 B holds 3-2 but waits for 1-4

99 48/48 MAX MIN FAIR RATE ATTRIBUTION Rate limiter configuration to avoid buffer overflow Rate of a flow f i noted ρ i f1 f2 f3 f Valid: for each link, i ρ i 1 flit/cycle Fair attribution: max-min fairness [2]: cannot increase a rate without decreasing an already smaller (or equal) rate cluster router

100 48/48 MAX MIN FAIR RATE ATTRIBUTION Rate limiter configuration to avoid buffer overflow Rate of a flow f i noted ρ i f1 f2 f3 f Valid: for each link, i ρ i 1 flit/cycle Fair attribution: max-min fairness [2]: cannot increase a rate without decreasing an already smaller (or equal) rate Example 1: f1 = 1 2, f2 = 1 2, f3 = 1 4, f4 = cluster router

101 48/48 MAX MIN FAIR RATE ATTRIBUTION Rate limiter configuration to avoid buffer overflow Rate of a flow f i noted ρ i f1 f2 f3 f Valid: for each link, i ρ i 1 flit/cycle Fair attribution: max-min fairness [2]: cannot increase a rate without decreasing an already smaller (or equal) rate Example 1: f1 = 1 2, f2 = 1 2, f3 = 1 4, f4 = 1 4 Valid but not max-min fair (since increasing f3 or f4 can be done by reducing the greater f2) cluster router

102 48/48 MAX MIN FAIR RATE ATTRIBUTION Rate limiter configuration to avoid buffer overflow Rate of a flow f i noted ρ i f1 f2 f3 f Valid: for each link, i ρ i 1 flit/cycle Fair attribution: max-min fairness [2]: cannot increase a rate without decreasing an already smaller (or equal) rate Example 1: f1 = 1 2, f2 = 1 2, f3 = 1 4, f4 = 1 4 Valid but not max-min fair (since increasing f3 or f4 can be done by reducing the greater f2) cluster router Example 2: f1 = 2 3, f2 = 1 3, f3 = 1 3, f4 = 1 3

103 MAX MIN FAIR RATE ATTRIBUTION Rate limiter configuration to avoid buffer overflow Rate of a flow f i noted ρ i f1 f2 f3 f Valid: for each link, i ρ i 1 flit/cycle Fair attribution: max-min fairness [2]: cannot increase a rate without decreasing an already smaller (or equal) rate Example 1: f1 = 1 2, f2 = 1 2, f3 = 1 4, f4 = 1 4 Valid but not max-min fair (since increasing f3 or f4 can be done by reducing the greater f2) cluster router Example 2: f1 = 2 3, f2 = 1 3, f3 = 1 3, f4 = 1 3 Valid and max-min fair (increasing f2, f3 or f4 cannot be done without reducing one of them) Classical solution: the Water Filling algorithm [1] 48/48

104 48/48 DETERMINISTIC NETWORK CALCULUS (DNC) PRINCIPLE f1 f2 f3 f cluster router b r R r b delay backlog R t T t T γ r,b (t) β R,T (t) (β R,T γ r,b )(t) Arrival Curve Service Curve Convolution t Arrival curve is a maximum traffic entering the network Service curve is a minimum traffic handeled by the network How to compute service curve?

105 48/48 KALRAY MPPA2 NETWORK-ON-CHIP f1 f2 f3 f4 f3= f2= 1 3 f4= f1= cluster router FIFO Round-Robin... f1 f3 f2 f4 Kalray MPPA2 Network Elements 7 11

106 48/48 SEPARATED FLOW ANALYSIS (1/2) f3= 1 3 f2= 1 3 f4= 1 3 f1= FIFO Round-Robin... f1 f3 f2 f Separated Flow Analysis: Compute the service curve of each network element Compute the successive arrival curves at each network element Convolution of the element service curves network service curve

107 48/48 SEPARATED FLOW ANALYSIS (2/2) f3= 1 3 f2= 1 3 f4= 1 3 f1= FIFO Round-Robin... f1 f3 f2 f Service curve offered to f2? At routers 1, 2, 3, 7 and 11.

108 48/48 BLIND MULTIPLEXING f2= 1 3 f1= 2 3 R=1 - b1 r1 = R-r1 Output link Peak rate Arrival curve of f1 T = b1 R r1 Service Curve for f2 No information about the arbitration: consider f2 is low priority. Can we do better?

109 48/48 ROUND ROBIN MULTIPLEXING f2= 1 3 f1= 2 3 N=2 Packets of size l max Restriction: Rate R (not applicable to f1) Blind multiplexing is the conservative solution. T = (N 1)l max Service Curve for f2 R = 1 N

110 WORST-CASE TRAVERSAL TIME (WCTT): APPLICATION (1/2) f3= 1 3 f2= 1 3 f4= 1 3 f1= FIFO Round-Robin... f1 f3 f2 f Service curve offered to f2? Router 1: Round Robin multiplexing (N=2) Router 2: non active (alone) Router 3: Round Robin multiplexing (N=2, f3 and f4 are aggregated with b a = i 2 b i, r a = i 2 r i) Router 7 and 11: non active 48/48

111 48/48 WORST-CASE TRAVERSAL TIME (WCTT): APPLICATION (2/2) f3= 1 3 f2= 1 3 f4= 1 3 f1= FIFO Round-Robin... f1 f3 f2 f f1 f2 f3 f4 WCTT (cycles) With packets of l max =17 flits.

112 48/48 EVALUATION OF OUR LP-BASED HEURISTIC

113 48/48 EVALUATION OF OUR LP-BASED HEURISTIC

114 48/48 LP-BASED HEURISTIC VS. OPTIMAL EXPLORATION Optimal algorithms (100%): naive exploration, exploration with pruning Minimal rate as indicator of fairness

115 48/48 [1] Bertsekas, D. P., Gallager, R. G., and Humblet, P. (1992). Data networks, vol. 2. Prentice Hall International, Englewood Cliffs, New Jersey, 7632: [2] Chen, S. and Nahrstedt, K. (1998). Maxmin fair routing in connection-oriented networks. In Proc. Euro-Parallel and Distributed Systems Conf, pages

Response Time Analysis of Synchronous Data Flow Programs on a Many-Core Processor

Response Time Analysis of Synchronous Data Flow Programs on a Many-Core Processor Hamza Rihani, Matthieu Moy, Claire Maiza, Robert Davis, Sebastian Altmeyer To cite this version: Hamza Rihani, Matthieu