NoCo: ILP-based Worst-Case Contention Estimation for Mesh Real-Time Manycores

Size: px

Start display at page:

Download "NoCo: ILP-based Worst-Case Contention Estimation for Mesh Real-Time Manycores"

Darcy Barnett
5 years ago
Views:

1 NoCo: ILP-based Worst-Case Contention Estimation for Mesh Real-Time Manycores Jordi Cardona 1,2, Carles Hernandez 1, Enrico Mezzetti 1, Jaume Abella 1 and Francisco J.Cazorla 1,3 1 Barcelona Supercomputing Center (BSC) 2 Universitat Politècnica de Catalunya (UPC) 3 IIIA-CSIC December 13 th Nashville, USA 39 th IEEE Real-Time Systems Symposium RTSS 2018

Critical Real-Time Embedded Systems Used in

process Need to provide evidence against the

2 Critical Real-Time Embedded Systems Used in industries like: Avionics Railway Space Require: Functional Correctness Timing Correctness Validation & Verification (V&V) process Need to provide evidence against the safety standards Avionics: DO178B/C Automotive: ISO

Increasing Performance Needs in CRTES New software implementing complex functionalities Complex AI algorithms Manage Huge amounts of data

3 Increasing Performance Needs in CRTES New software implementing complex functionalities Complex AI algorithms Manage Huge amounts of data Performance needs increase significantly ARM predicts that the performance requirements of ADAS to grow 100x from 2016 to 2024 Autonomous driving 3

Covering high-performance needs How to deliver the

way? Embrace high-performance hardware coming from

Networks on chip (NoCs) Accelerators SnapDragon

4 Covering high-performance needs How to deliver the performance needed by CRTES Software in an efficient way? Embrace high-performance hardware coming from mainstream market Multicores and Manycores Caches Networks on chip (NoCs) Accelerators SnapDragon (automotive) Nvidia Pascal (automotive) Kalray MPPA-256 (aviation) 4

5 The other side of the coin High-performance (complex) hardware complicates timing analysis, i.e. deriving WCET estimates for tasks Source of the problem: contention Must be bounded and reduced Worst-case Contention Delay (WCD) Worst Case Execution Time (WCET) 2x2 2D mesh with 4 cores 5

6 Related work Real-Time Specific NoC designs: Provide Contention-Free NoCs and easy to V&V Do not scale well (bad average performance in general) High costs for being adopted in Industry Wormhole NoC designs (wnoc) Best-effort wormhole NoCs (wormhole switching) Used in Commercial Off the shelf processors (low costs for industry) More difficult to derive upperbounds (can be very pessimistic) Optimize parameters of these NoCs» Mapping, Routing, Bandwidth distribution, 6

7 Worst Case Execution Time (WCET) - ZLL WCET = f(zll,wcd) Zero Load Latency (ZLL) = f(distance) 1. Mapping 3 hops 5 hops 7

8 Worst Case Execution Time (WCET) WCET = f(zll,wcd) Zero Load Latency (ZLL) = f(distance) 1. Mapping Worst case Contention Delay (WCD) = f(routing, Arbitration) 2. Routing 3. Bandwidth weighted allocation (walloc) 8

9 Worst Case Execution Time (WCET) - WCD WCET = f(zll,wcd) Zero Load Latency (ZLL) = f(distance) 1. Mapping Worst case Contention Delay (WCD) = f(routing, Arbitration) 2. Routing 3. Bandwidth weighted allocation (walloc) Y X 3x3 mesh flows mapping using XY 3x3 mesh flows mapping using XY-YX combination 9

10 Worst Case Execution Time (WCET) - WCD WCET = f(zll,wcd) Zero Load Latency (ZLL) = f(distance) 1. Mapping WCET is affected by all three parameters: Worst case Contention Delay (WCD) = f(routing, Arbitration) 2. Routing Mapping, Routing and Walloc 3. Bandwidth weighted allocation (walloc) WCD = 15 WCD = 10 2x2 2D mesh XY flows mapping RR arbitration Weighted mesh arbitration (WRR) 10

optimal All the NoC parameters configuration, just need a local one.

11 Parameters are inter-dependent WCET = f(zll, WCD) = f(mapping, Routing, Walloc) Optimizing each individually or in pairs, does not provide a global optimal All the NoC parameters configuration, just need a local one. to be optimized at the same time Mapping constraints Routing constraints Bandwidth constraints 11

12 Our proposal: NoCo Given a Workload (Tasks) Wormhole Mesh NoC configuration Optimizes The WCET of applications finding the best mesh configuration: Mapping Routing Weights allocation (Walloc) NoCo uses: Stochastic exploration to optimize routing Integer Linear Programming (ILP) to optimize Mapping and Walloc 12

13 Agenda Introduction and Motivation Background and problem analysis NoCo: Stochastic/ILP model Evaluation Conclusions 13

14 NoCo: ILP-based Worst-Case Contention Estimation for Mesh Real-Time Manycores Jordi Cardona PROPOSAL: NOCO

15 Approach/Concept NoCo Optimization Framework Routing Stochastic Generates random routes and pass it to the ILP optimizer Placement, walloc Integer Linear Programming (ILP) Placement and Walloc are optimized per each routing Selection of the best setup Stochastic Random selection Route Route Route ILP NoC NoC Optimized NoC Performance configurations e Route generation Mapping and Walloc Optimization Best configuration 15

16 NoCo Proposal Problem description: Tasks information Execution Time Observed (ETO) Memory Accesses NoC information: Target Node location Number of routers Constraints (only one task to each core) (mapping and walloc) Main stages of NoCo Framework 16

Deterministic routing policies (ie XY, YX) Deadlock avoidance

17 NoCo Proposal Routing Stochastic exploration Generate Randomly Routing configurations Minimal distance routing policies Deterministic routing policies (ie XY, YX) Deadlock avoidance Prohibiting certain turns (no cycles) (mapping and walloc) Main stages of NoCo Framework 17

18 1% 0,1% NoCo Proposal Routing Random sampling (finite population) C = probability that one of the top X% routes is not in the random sample. Worst routings Best routings The probability of having 1 routing in the 1% of the top routings in a 1000 size sample is 1-0, = 0, (99,99%) 18

19 NoCo Proposal Stochastic Routing It warrantees stochastically to find one of the best routing solutions at low cost (without exploring all the possible routings) With 330 samples out of 2^16 = routings finds the best routing (0,5% of the population) (mapping and walloc) Main stages of NoCo Framework 19

20 NoCo Proposal Mapping and Walloc ILP optimization Main stages of NoCo Framework 20

21 NoCo Proposal ILP model Objective function: Parallel applications W_C1 W_C2 W_C3 W_C3 The WCET of the application is determined by the WCET of the slowest thread 21

22 NoCo Proposal ILP model Compute WCET: Bandwidth and WCD modeling WCET in isolation Number of Memory accesses BW distribution constraints from Routing configuration Path flows mapping from Routing configuration 22

NoCo Proposal ILP model Compute WCET: Routing rules: Bandwidth distribution Path restrictions Encoded in Boolean matrixes Other restrictions:

23 NoCo Proposal ILP model Compute WCET: Routing rules: Bandwidth distribution Path restrictions Encoded in Boolean matrixes Other restrictions: One task assigned in one core One core can only run one task BW assigned to a cores > 0.0 WCD of all tasks > 0.0 Total BW in the mesh must be 1 23

24 NoCo Proposal Stochastic + ILP model Local solutions: Provides WCET of each task Mapping Bandwidth distribution (arbitration weights) Post processing (minimum WCET) Global solution Main stages of NoCo Framework 24

25 NoCo: ILP-based Worst-Case Contention Estimation for Mesh Real-Time Manycores Jordi Cardona EVALUATION

26 Evaluation Cycle-accurate Simulator SoCLib simulator integrated with gnocsim Benchmarks Key parameter: frequency of access to the NoC for loads/stores Workloads Cover the range shown for Mediabench and EEMBC auto MIX Benchmarks (i.e MIX1 => ABCDEFFGH) 26

27 Evaluation: impact of optimizing each parameter Incremental Evaluation NoC configuration Static-base (RR) Static-opt (WRR) Map Map + Walloc Map + Walloc + R ILP Optimizations Routing Weights Mapping Baseline NoCo Optimization versions evaluated 27

28 Results Incremental Evaulation Static_opt (WRR) vs Static_base (RR) -1% 16% Effect of incremental optimizations: mapping, walloc and routing (3x3 heterogeneous workloads) 28

29 Results Incremental Optimizations Map vs Static-base (RR) 17% 30% 23% Effect of incremental optimizations: mapping, walloc and routing (3x3 heterogeneous workloads) 29

30 Results Incremental Optimizations Map_Walloc vs Static-base (RR) 31% 41% 37% Effect of incremental optimizations: mapping, walloc and routing (3x3 heterogeneous workloads) 30

31 Results Incremental Optimizations Map_Walloc_Routing (NoCo) vs XY_RR 9% 14% 50% 40% 46% 23% 3% Effect of incremental optimizations: mapping, walloc and routing (3x3 heterogeneous workloads) 31

32 NoCo: ILP-based Worst-Case Contention Estimation for Mesh Real-Time Manycores Jordi Cardona CONCLUSIONS

33 Conclusions Optimizing NoC to reduce WCET is a multidimensional problem Zero Load Latency (Mapping) Worst Case Delay (Routing and Arbitration) Some proposals exist in the state of the art that optimize one or combinations of the mentioned parameters that increase the WCET of applications. We propose NoCo a stochastic/ilp hybrid solution that optimizes at the same time: Routing (XY, YX combinations) Arbitration (Walloc) Applications mapping NoCo reduces the maxwcet of heterogeneous tasks in 3x3 meshes between 40 and 50% with respect XY-RR configuration. 33

34 NoCo: ILP-based Worst-Case Contention Estimation for Mesh Real-Time Manycores Jordi Cardona 1,2, Carles Hernandez 1, Enrico Mezzetti 1, Jaume Abella 1 and Francisco J.Cazorla 1,3 1 Barcelona Supercomputing Center (BSC) 2 Universitat Politècnica de Catalunya (UPC) 3 IIIA-CSIC December 13 th Nashville, USA 39 th IEEE Real-Time Systems Symposium RTSS 2018

35 BACKUP 35

36 Reliability of stochastic method Random Routing vs Optimal Routing 88,7% 2^9 = 512 routings 2^16 = routings 94,6% 100 samples 98,5% 330 samples best solution in fifth examples 36

r.t WRR 26% rilp (16 threads): Avg w.r.t RR 88% Avg w.r.t WRR 29% rilp(m,w)

37 Reliability of stochastic method Improvement in homogeneous tasks running in parallel Reducing max WCET of all tasks rilp (9 threads): Avg w.r.t RR 74% Avg w.r.t WRR 26% rilp (16 threads): Avg w.r.t RR 88% Avg w.r.t WRR 29% rilp(m,w) maxwcet results for 9 threads rilp(m,w) maxwcet results for 16 threads 37

r.t WRR 19% rilp (16 threads): Avg w.r.t RR 30% Avg w.r.t WRR 23% rilp(m,w)

38 Reliability of stochastic method Improvement in heterogeneous tasks running in parallel Reducing summation of max WCET of all tasks rilp (9 threads): Avg w.r.t RR 26% Avg w.r.t WRR 19% rilp (16 threads): Avg w.r.t RR 30% Avg w.r.t WRR 23% rilp(m,w) sumwcet results for 9 threads rilp(m,w) sumwcet results for 16 threads 38

MC2: Multicore and Cache Analysis via Deterministic and Probabilistic Jitter Bounding

www.bsc.es MC2: Multicore and Cache Analysis via Deterministic and Probabilistic Jitter Bounding Enrique Díaz¹,², Mikel Fernández¹, Leonidas Kosmidis¹, Enrico Mezzetti¹, Carles Hernandez¹, Jaume Abella¹,