NoCo: ILP-based Worst-Case Contention Estimation for Mesh Real-Time Manycores

Size: px
Start display at page:

Download "NoCo: ILP-based Worst-Case Contention Estimation for Mesh Real-Time Manycores"

Transcription

1 NoCo: ILP-based Worst-Case Contention Estimation for Mesh Real-Time Manycores Jordi Cardona 1,2, Carles Hernandez 1, Enrico Mezzetti 1, Jaume Abella 1 and Francisco J.Cazorla 1,3 1 Barcelona Supercomputing Center (BSC) 2 Universitat Politècnica de Catalunya (UPC) 3 IIIA-CSIC December 13 th Nashville, USA 39 th IEEE Real-Time Systems Symposium RTSS 2018

2 Critical Real-Time Embedded Systems Used in industries like: Avionics Railway Space Require: Functional Correctness Timing Correctness Validation & Verification (V&V) process Need to provide evidence against the safety standards Avionics: DO178B/C Automotive: ISO

3 Increasing Performance Needs in CRTES New software implementing complex functionalities Complex AI algorithms Manage Huge amounts of data Performance needs increase significantly ARM predicts that the performance requirements of ADAS to grow 100x from 2016 to 2024 Autonomous driving 3

4 Covering high-performance needs How to deliver the performance needed by CRTES Software in an efficient way? Embrace high-performance hardware coming from mainstream market Multicores and Manycores Caches Networks on chip (NoCs) Accelerators SnapDragon (automotive) Nvidia Pascal (automotive) Kalray MPPA-256 (aviation) 4

5 The other side of the coin High-performance (complex) hardware complicates timing analysis, i.e. deriving WCET estimates for tasks Source of the problem: contention Must be bounded and reduced Worst-case Contention Delay (WCD) Worst Case Execution Time (WCET) 2x2 2D mesh with 4 cores 5

6 Related work Real-Time Specific NoC designs: Provide Contention-Free NoCs and easy to V&V Do not scale well (bad average performance in general) High costs for being adopted in Industry Wormhole NoC designs (wnoc) Best-effort wormhole NoCs (wormhole switching) Used in Commercial Off the shelf processors (low costs for industry) More difficult to derive upperbounds (can be very pessimistic) Optimize parameters of these NoCs» Mapping, Routing, Bandwidth distribution, 6

7 Worst Case Execution Time (WCET) - ZLL WCET = f(zll,wcd) Zero Load Latency (ZLL) = f(distance) 1. Mapping 3 hops 5 hops 7

8 Worst Case Execution Time (WCET) WCET = f(zll,wcd) Zero Load Latency (ZLL) = f(distance) 1. Mapping Worst case Contention Delay (WCD) = f(routing, Arbitration) 2. Routing 3. Bandwidth weighted allocation (walloc) 8

9 Worst Case Execution Time (WCET) - WCD WCET = f(zll,wcd) Zero Load Latency (ZLL) = f(distance) 1. Mapping Worst case Contention Delay (WCD) = f(routing, Arbitration) 2. Routing 3. Bandwidth weighted allocation (walloc) Y X 3x3 mesh flows mapping using XY 3x3 mesh flows mapping using XY-YX combination 9

10 Worst Case Execution Time (WCET) - WCD WCET = f(zll,wcd) Zero Load Latency (ZLL) = f(distance) 1. Mapping WCET is affected by all three parameters: Worst case Contention Delay (WCD) = f(routing, Arbitration) 2. Routing Mapping, Routing and Walloc 3. Bandwidth weighted allocation (walloc) WCD = 15 WCD = 10 2x2 2D mesh XY flows mapping RR arbitration Weighted mesh arbitration (WRR) 10

11 Parameters are inter-dependent WCET = f(zll, WCD) = f(mapping, Routing, Walloc) Optimizing each individually or in pairs, does not provide a global optimal All the NoC parameters configuration, just need a local one. to be optimized at the same time Mapping constraints Routing constraints Bandwidth constraints 11

12 Our proposal: NoCo Given a Workload (Tasks) Wormhole Mesh NoC configuration Optimizes The WCET of applications finding the best mesh configuration: Mapping Routing Weights allocation (Walloc) NoCo uses: Stochastic exploration to optimize routing Integer Linear Programming (ILP) to optimize Mapping and Walloc 12

13 Agenda Introduction and Motivation Background and problem analysis NoCo: Stochastic/ILP model Evaluation Conclusions 13

14 NoCo: ILP-based Worst-Case Contention Estimation for Mesh Real-Time Manycores Jordi Cardona PROPOSAL: NOCO

15 Approach/Concept NoCo Optimization Framework Routing Stochastic Generates random routes and pass it to the ILP optimizer Placement, walloc Integer Linear Programming (ILP) Placement and Walloc are optimized per each routing Selection of the best setup Stochastic Random selection Route Route Route ILP NoC NoC Optimized NoC Performance configurations e Route generation Mapping and Walloc Optimization Best configuration 15

16 NoCo Proposal Problem description: Tasks information Execution Time Observed (ETO) Memory Accesses NoC information: Target Node location Number of routers Constraints (only one task to each core) (mapping and walloc) Main stages of NoCo Framework 16

17 NoCo Proposal Routing Stochastic exploration Generate Randomly Routing configurations Minimal distance routing policies Deterministic routing policies (ie XY, YX) Deadlock avoidance Prohibiting certain turns (no cycles) (mapping and walloc) Main stages of NoCo Framework 17

18 1% 0,1% NoCo Proposal Routing Random sampling (finite population) C = probability that one of the top X% routes is not in the random sample. Worst routings Best routings The probability of having 1 routing in the 1% of the top routings in a 1000 size sample is 1-0, = 0, (99,99%) 18

19 NoCo Proposal Stochastic Routing It warrantees stochastically to find one of the best routing solutions at low cost (without exploring all the possible routings) With 330 samples out of 2^16 = routings finds the best routing (0,5% of the population) (mapping and walloc) Main stages of NoCo Framework 19

20 NoCo Proposal Mapping and Walloc ILP optimization Main stages of NoCo Framework 20

21 NoCo Proposal ILP model Objective function: Parallel applications W_C1 W_C2 W_C3 W_C3 The WCET of the application is determined by the WCET of the slowest thread 21

22 NoCo Proposal ILP model Compute WCET: Bandwidth and WCD modeling WCET in isolation Number of Memory accesses BW distribution constraints from Routing configuration Path flows mapping from Routing configuration 22

23 NoCo Proposal ILP model Compute WCET: Routing rules: Bandwidth distribution Path restrictions Encoded in Boolean matrixes Other restrictions: One task assigned in one core One core can only run one task BW assigned to a cores > 0.0 WCD of all tasks > 0.0 Total BW in the mesh must be 1 23

24 NoCo Proposal Stochastic + ILP model Local solutions: Provides WCET of each task Mapping Bandwidth distribution (arbitration weights) Post processing (minimum WCET) Global solution Main stages of NoCo Framework 24

25 NoCo: ILP-based Worst-Case Contention Estimation for Mesh Real-Time Manycores Jordi Cardona EVALUATION

26 Evaluation Cycle-accurate Simulator SoCLib simulator integrated with gnocsim Benchmarks Key parameter: frequency of access to the NoC for loads/stores Workloads Cover the range shown for Mediabench and EEMBC auto MIX Benchmarks (i.e MIX1 => ABCDEFFGH) 26

27 Evaluation: impact of optimizing each parameter Incremental Evaluation NoC configuration Static-base (RR) Static-opt (WRR) Map Map + Walloc Map + Walloc + R ILP Optimizations Routing Weights Mapping Baseline NoCo Optimization versions evaluated 27

28 Results Incremental Evaulation Static_opt (WRR) vs Static_base (RR) -1% 16% Effect of incremental optimizations: mapping, walloc and routing (3x3 heterogeneous workloads) 28

29 Results Incremental Optimizations Map vs Static-base (RR) 17% 30% 23% Effect of incremental optimizations: mapping, walloc and routing (3x3 heterogeneous workloads) 29

30 Results Incremental Optimizations Map_Walloc vs Static-base (RR) 31% 41% 37% Effect of incremental optimizations: mapping, walloc and routing (3x3 heterogeneous workloads) 30

31 Results Incremental Optimizations Map_Walloc_Routing (NoCo) vs XY_RR 9% 14% 50% 40% 46% 23% 3% Effect of incremental optimizations: mapping, walloc and routing (3x3 heterogeneous workloads) 31

32 NoCo: ILP-based Worst-Case Contention Estimation for Mesh Real-Time Manycores Jordi Cardona CONCLUSIONS

33 Conclusions Optimizing NoC to reduce WCET is a multidimensional problem Zero Load Latency (Mapping) Worst Case Delay (Routing and Arbitration) Some proposals exist in the state of the art that optimize one or combinations of the mentioned parameters that increase the WCET of applications. We propose NoCo a stochastic/ilp hybrid solution that optimizes at the same time: Routing (XY, YX combinations) Arbitration (Walloc) Applications mapping NoCo reduces the maxwcet of heterogeneous tasks in 3x3 meshes between 40 and 50% with respect XY-RR configuration. 33

34 NoCo: ILP-based Worst-Case Contention Estimation for Mesh Real-Time Manycores Jordi Cardona 1,2, Carles Hernandez 1, Enrico Mezzetti 1, Jaume Abella 1 and Francisco J.Cazorla 1,3 1 Barcelona Supercomputing Center (BSC) 2 Universitat Politècnica de Catalunya (UPC) 3 IIIA-CSIC December 13 th Nashville, USA 39 th IEEE Real-Time Systems Symposium RTSS 2018

35 BACKUP 35

36 Reliability of stochastic method Random Routing vs Optimal Routing 88,7% 2^9 = 512 routings 2^16 = routings 94,6% 100 samples 98,5% 330 samples best solution in fifth examples 36

37 Reliability of stochastic method Improvement in homogeneous tasks running in parallel Reducing max WCET of all tasks rilp (9 threads): Avg w.r.t RR 74% Avg w.r.t WRR 26% rilp (16 threads): Avg w.r.t RR 88% Avg w.r.t WRR 29% rilp(m,w) maxwcet results for 9 threads rilp(m,w) maxwcet results for 16 threads 37

38 Reliability of stochastic method Improvement in heterogeneous tasks running in parallel Reducing summation of max WCET of all tasks rilp (9 threads): Avg w.r.t RR 26% Avg w.r.t WRR 19% rilp (16 threads): Avg w.r.t RR 30% Avg w.r.t WRR 23% rilp(m,w) sumwcet results for 9 threads rilp(m,w) sumwcet results for 16 threads 38

MC2: Multicore and Cache Analysis via Deterministic and Probabilistic Jitter Bounding

MC2: Multicore and Cache Analysis via Deterministic and Probabilistic Jitter Bounding www.bsc.es MC2: Multicore and Cache Analysis via Deterministic and Probabilistic Jitter Bounding Enrique Díaz¹,², Mikel Fernández¹, Leonidas Kosmidis¹, Enrico Mezzetti¹, Carles Hernandez¹, Jaume Abella¹,

More information

Resource Sharing and Partitioning in Multicore

Resource Sharing and Partitioning in Multicore www.bsc.es Resource Sharing and Partitioning in Multicore Francisco J. Cazorla Mixed Criticality/Reliability Workshop HiPEAC CSW Barcelona May 2014 Transition to Multicore and Manycores Wanted or imposed

More information

NetSpeed ORION: A New Approach to Design On-chip Interconnects. August 26 th, 2013

NetSpeed ORION: A New Approach to Design On-chip Interconnects. August 26 th, 2013 NetSpeed ORION: A New Approach to Design On-chip Interconnects August 26 th, 2013 INTERCONNECTS BECOMING INCREASINGLY IMPORTANT Growing number of IP cores Average SoCs today have 100+ IPs Mixing and matching

More information

CONTENTION IN MULTICORE HARDWARE SHARED RESOURCES: UNDERSTANDING OF THE STATE OF THE ART

CONTENTION IN MULTICORE HARDWARE SHARED RESOURCES: UNDERSTANDING OF THE STATE OF THE ART CONTENTION IN MULTICORE HARDWARE SHARED RESOURCES: UNDERSTANDING OF THE STATE OF THE ART Gabriel Fernandez 1, Jaume Abella 2, Eduardo Quiñones 2, Christine Rochange 3, Tullio Vardanega 4 and Francisco

More information

IA 3 : An Interference Aware Allocation Algorithm for Multicore Hard Real-Time Systems

IA 3 : An Interference Aware Allocation Algorithm for Multicore Hard Real-Time Systems IA 3 : An Interference Aware Allocation Algorithm for Multicore Hard Real-Time Systems Marco Paolieri Barcelona Supercomputing Center (BSC) Barcelona, Spain e-mail: marco.paolieri@bsc.es Eduardo Quiñones

More information

Fig. 1. AMBA AHB main components: Masters, slaves, arbiter and decoder. (Picture from AMBA Specification Rev 2.0)

Fig. 1. AMBA AHB main components: Masters, slaves, arbiter and decoder. (Picture from AMBA Specification Rev 2.0) AHRB: A High-Performance Time-Composable AMBA AHB Bus Javier Jalle,, Jaume Abella, Eduardo Quiñones, Luca Fossati, Marco Zulianello, Francisco J. Cazorla, Barcelona Supercomputing Center, Spain Universitat

More information

A Dual-Criticality Memory Controller (DCmc): Proposal and Evaluation of a Space Case Study

A Dual-Criticality Memory Controller (DCmc): Proposal and Evaluation of a Space Case Study A Dual-Criticality Memory Controller (DCmc): Proposal and Evaluation of a Space Case Study Javier Jalle,, Eduardo Quiñones, Jaume Abella, Luca Fossati, Marco Zulianello, Francisco J. Cazorla, Barcelona

More information

Consumer Electronics Processors for Critical Real-Time Systems: a (Failed) Practical Experience

Consumer Electronics Processors for Critical Real-Time Systems: a (Failed) Practical Experience Consumer Electronics Processors for Critical Real-Time Systems: a (Failed) Practical Experience Gabriel Fernandez, Francisco Cazorla, Jaume Abella To cite this version: Gabriel Fernandez, Francisco Cazorla,

More information

Memory Architectures for NoC-Based Real-Time Mixed Criticality Systems

Memory Architectures for NoC-Based Real-Time Mixed Criticality Systems Memory Architectures for NoC-Based Real-Time Mixed Criticality Systems Neil Audsley Real-Time Systems Group Computer Science Department University of York York United Kingdom 2011-12 1 Overview Motivation:

More information

Hardware Support for WCET Analysis of Hard Real-Time Multicore Systems

Hardware Support for WCET Analysis of Hard Real-Time Multicore Systems Hardware Support for WCET Analysis of Hard Real-Time Multicore Systems Marco Paolieri Barcelona Supercomputing Center (BSC) Barcelona, Spain marco.paolieri@bsc.es Eduardo Quiñones Barcelona Supercomputing

More information

On the Evaluation of the Impact of Shared Resources in Multithreaded COTS Processors in Time-Critical Environments

On the Evaluation of the Impact of Shared Resources in Multithreaded COTS Processors in Time-Critical Environments On the Evaluation of the Impact of Shared Resources in Multithreaded COTS Processors in Time-Critical Environments PETAR RADOJKOVIĆ, Barcelona Supercomputing Center SYLVAIN GIRBAL and ARNAUD GRASSET, Thales

More information

DTM: Degraded Test Mode for Fault-Aware Probabilistic Timing Analysis

DTM: Degraded Test Mode for Fault-Aware Probabilistic Timing Analysis DTM: Degraded Test Mode for Fault-Aware Probabilistic Timing Analysis Mladen Slijepcevic,, Leonidas Kosmidis,, Jaume Abella, Eduardo Quiñones, Francisco J. Cazorla, Universitat Politècnica de Catalunya

More information

Comparing Memory Systems for Chip Multiprocessors

Comparing Memory Systems for Chip Multiprocessors Comparing Memory Systems for Chip Multiprocessors Jacob Leverich Hideho Arakida, Alex Solomatnikov, Amin Firoozshahian, Mark Horowitz, Christos Kozyrakis Computer Systems Laboratory Stanford University

More information

High-Integrity Performance Monitoring Units in Automotive Chips for Reliable Timing V&V

High-Integrity Performance Monitoring Units in Automotive Chips for Reliable Timing V&V THEME ARTICLE: Automotive Computing High-Integrity Performance Monitoring Units in Automotive Chips for Reliable Timing V&V Enrico Mezzetti Leonidas Kosmidis Jaume Abella Francisco J. Cazorla Barcelona

More information

WCET-Aware C Compiler: WCC

WCET-Aware C Compiler: WCC 12 WCET-Aware C Compiler: WCC Jian-Jia Chen (slides are based on Prof. Heiko Falk) TU Dortmund, Informatik 12 2015 年 05 月 05 日 These slides use Microsoft clip arts. Microsoft copyright restrictions apply.

More information

CCNoC: Specializing On-Chip Interconnects for Energy Efficiency in Cache-Coherent Servers

CCNoC: Specializing On-Chip Interconnects for Energy Efficiency in Cache-Coherent Servers CCNoC: Specializing On-Chip Interconnects for Energy Efficiency in Cache-Coherent Servers Stavros Volos, Ciprian Seiculescu, Boris Grot, Naser Khosro Pour, Babak Falsafi, and Giovanni De Micheli Toward

More information

An Autonomous Dynamically Adaptive Memory Hierarchy for Chip Multiprocessors

An Autonomous Dynamically Adaptive Memory Hierarchy for Chip Multiprocessors ACM IEEE 37 th International Symposium on Computer Architecture Elastic Cooperative Caching: An Autonomous Dynamically Adaptive Memory Hierarchy for Chip Multiprocessors Enric Herrero¹, José González²,

More information

Measurement-Based Timing Analysis of the AURIX Caches

Measurement-Based Timing Analysis of the AURIX Caches Measurement-Based Timing Analysis of the AURIX Caches Leonidas Kosmidis 1,3, Davide Compagnin 2, David Morales 3, Enrico Mezzetti 3, Eduardo Quinones 3, Jaume Abella 3, Tullio Vardanega 2, and Francisco

More information

Network Calculus: A Comparison

Network Calculus: A Comparison Time-Division Multiplexing vs Network Calculus: A Comparison Wolfgang Puffitsch, Rasmus Bo Sørensen, Martin Schoeberl RTNS 15, Lille, France Motivation Modern multiprocessors use networks-on-chip Congestion

More information

Thomas Moscibroda Microsoft Research. Onur Mutlu CMU

Thomas Moscibroda Microsoft Research. Onur Mutlu CMU Thomas Moscibroda Microsoft Research Onur Mutlu CMU CPU+L1 CPU+L1 CPU+L1 CPU+L1 Multi-core Chip Cache -Bank Cache -Bank Cache -Bank Cache -Bank CPU+L1 CPU+L1 CPU+L1 CPU+L1 Accelerator, etc Cache -Bank

More information

EPC Enacted: Integration in an Industrial Toolbox and Use Against a Railway Application

EPC Enacted: Integration in an Industrial Toolbox and Use Against a Railway Application EPC Enacted: Integration in an Industrial Toolbox and Use Against a Railway Application E. Mezzetti, M. Fernandez, A. Bardizbanyan I. Agirre, J. Abella, T. Vardanega, F. Cazorla, * This project Real-Time

More information

Hybrid SPM-Cache Architectures to Achieve High Time Predictability and Performance

Hybrid SPM-Cache Architectures to Achieve High Time Predictability and Performance Hybrid SPM-Cache Architectures to Achieve High Time Predictability and Performance Wei Zhang and Yiqiang Ding Department of Electrical and Computer Engineering Virginia Commonwealth University {wzhang4,ding4}@vcu.edu

More information

Efficient Throughput-Guarantees for Latency-Sensitive Networks-On-Chip

Efficient Throughput-Guarantees for Latency-Sensitive Networks-On-Chip ASP-DAC 2010 20 Jan 2010 Session 6C Efficient Throughput-Guarantees for Latency-Sensitive Networks-On-Chip Jonas Diemer, Rolf Ernst TU Braunschweig, Germany diemer@ida.ing.tu-bs.de Michael Kauschke Intel,

More information

Ensuring Schedulability of Spacecraft Flight Software

Ensuring Schedulability of Spacecraft Flight Software Ensuring Schedulability of Spacecraft Flight Software Flight Software Workshop 7-9 November 2012 Marek Prochazka & Jorge Lopez Trescastro European Space Agency OUTLINE Introduction Current approach to

More information

Barcelona Supercomputing Center. Spanish National Research Council (IIIA-CSIC)

Barcelona Supercomputing Center. Spanish National Research Council (IIIA-CSIC) Bus Designs for Time-Probabilistic Multicore Processors Javier Jalle,, Leonidas Kosmidis,, Jaume Abella, Eduardo Quiñones, Francisco J. Cazorla, Universitat Politècnica de Catalunya Barcelona Supercomputing

More information

A Statically Scheduled Time- Division-Multiplexed Networkon-Chip for Real-Time Systems

A Statically Scheduled Time- Division-Multiplexed Networkon-Chip for Real-Time Systems A Statically Scheduled Time- Division-Multiplexed Networkon-Chip for Real-Time Systems Martin Schoeberl, Florian Brandner, Jens Sparsø, Evangelia Kasapaki Technical University of Denamrk 1 Real-Time Systems

More information

SOFTWARE-DEFINED MEMORY HIERARCHIES: SCALABILITY AND QOS IN THOUSAND-CORE SYSTEMS

SOFTWARE-DEFINED MEMORY HIERARCHIES: SCALABILITY AND QOS IN THOUSAND-CORE SYSTEMS SOFTWARE-DEFINED MEMORY HIERARCHIES: SCALABILITY AND QOS IN THOUSAND-CORE SYSTEMS DANIEL SANCHEZ MIT CSAIL IAP MEETING MAY 21, 2013 Research Agenda Lack of technology progress Moore s Law still alive Power

More information

Sireesha R Basavaraju Embedded Systems Group, Technical University of Kaiserslautern

Sireesha R Basavaraju Embedded Systems Group, Technical University of Kaiserslautern Sireesha R Basavaraju Embedded Systems Group, Technical University of Kaiserslautern Introduction WCET of program ILP Formulation Requirement SPM allocation for code SPM allocation for data Conclusion

More information

FCUDA-NoC: A Scalable and Efficient Network-on-Chip Implementation for the CUDA-to-FPGA Flow

FCUDA-NoC: A Scalable and Efficient Network-on-Chip Implementation for the CUDA-to-FPGA Flow FCUDA-NoC: A Scalable and Efficient Network-on-Chip Implementation for the CUDA-to-FPGA Flow Abstract: High-level synthesis (HLS) of data-parallel input languages, such as the Compute Unified Device Architecture

More information

Designing Predictable Real-Time and Embedded Systems

Designing Predictable Real-Time and Embedded Systems Designing Predictable Real-Time and Embedded Systems Juniorprofessor Dr. Jian-Jia Chen Karlsruhe Institute of Technology (KIT), Germany 0 KIT Feb. University 27-29, 2012 of at thetu-berlin, State of Baden-Wuerttemberg

More information

Reconciling Time Predictability and Performance in Future Computing Systems

Reconciling Time Predictability and Performance in Future Computing Systems Reconciling Time Predictability and Performance in Future Computing Systems Francisco J. Cazorla, Jaume Abella, Enrico Mezzetti, Carles Hernandez, Tullio Vardanega, Guillem Bernat Barcelona Supercomputing

More information

QoS-aware resource allocation and load-balancing in enterprise Grids using online simulation

QoS-aware resource allocation and load-balancing in enterprise Grids using online simulation QoS-aware resource allocation and load-balancing in enterprise Grids using online simulation * Universität Karlsruhe (TH) Technical University of Catalonia (UPC) Barcelona Supercomputing Center (BSC) Samuel

More information

Real-Time Mixed-Criticality Wormhole Networks

Real-Time Mixed-Criticality Wormhole Networks eal-time Mixed-Criticality Wormhole Networks Leandro Soares Indrusiak eal-time Systems Group Department of Computer Science University of York United Kingdom eal-time Systems Group 1 Outline Wormhole Networks

More information

A Tuneable Software Cache Coherence Protocol for Heterogeneous MPSoCs. Marco Bekooij & Frank Ophelders

A Tuneable Software Cache Coherence Protocol for Heterogeneous MPSoCs. Marco Bekooij & Frank Ophelders A Tuneable Software Cache Coherence Protocol for Heterogeneous MPSoCs Marco Bekooij & Frank Ophelders Outline Context What is cache coherence Addressed challenge Short overview of related work Related

More information

Joint consideration of performance, reliability and fault tolerance in regular Networks-on-Chip via multiple spatially-independent interface terminals

Joint consideration of performance, reliability and fault tolerance in regular Networks-on-Chip via multiple spatially-independent interface terminals Joint consideration of performance, reliability and fault tolerance in regular Networks-on-Chip via multiple spatially-independent interface terminals Philipp Gorski, Tim Wegner, Dirk Timmermann University

More information

Design and Analysis of Time-Critical Systems Introduction

Design and Analysis of Time-Critical Systems Introduction Design and Analysis of Time-Critical Systems Introduction Jan Reineke @ saarland university ACACES Summer School 2017 Fiuggi, Italy computer science Structure of this Course 2. How are they implemented?

More information

A Server-based Approach for Predictable GPU Access Control

A Server-based Approach for Predictable GPU Access Control A Server-based Approach for Predictable GPU Access Control Hyoseung Kim * Pratyush Patel Shige Wang Raj Rajkumar * University of California, Riverside Carnegie Mellon University General Motors R&D Benefits

More information

OpenMP tasking model for Ada: safety and correctness

OpenMP tasking model for Ada: safety and correctness www.bsc.es www.cister.isep.ipp.pt OpenMP tasking model for Ada: safety and correctness Sara Royuela, Xavier Martorell, Eduardo Quiñones and Luis Miguel Pinho Vienna (Austria) June 12-16, 2017 Parallel

More information

Modeling and Verification of Networkon-Chip using Constrained-DEVS

Modeling and Verification of Networkon-Chip using Constrained-DEVS Modeling and Verification of Networkon-Chip using Constrained-DEVS Soroosh Gholami Hessam S. Sarjoughian School of Computing, Informatics, and Decision Systems Engineering Arizona Center for Integrative

More information

Automatic Speech Recognition (ASR)

Automatic Speech Recognition (ASR) Automatic Speech Recognition (ASR) February 2018 Reza Yazdani Aminabadi Universitat Politecnica de Catalunya (UPC) State-of-the-art State-of-the-art ASR system: DNN+HMM Speech (words) Sound Signal Graph

More information

Deterministic Memory Abstraction and Supporting Multicore System Architecture

Deterministic Memory Abstraction and Supporting Multicore System Architecture Deterministic Memory Abstraction and Supporting Multicore System Architecture Farzad Farshchi $, Prathap Kumar Valsan^, Renato Mancuso *, Heechul Yun $ $ University of Kansas, ^ Intel, * Boston University

More information

Network-on-Chip Architecture

Network-on-Chip Architecture Multiple Processor Systems(CMPE-655) Network-on-Chip Architecture Performance aspect and Firefly network architecture By Siva Shankar Chandrasekaran and SreeGowri Shankar Agenda (Enhancing performance)

More information

Lecture 3: Flow-Control

Lecture 3: Flow-Control High-Performance On-Chip Interconnects for Emerging SoCs http://tusharkrishna.ece.gatech.edu/teaching/nocs_acaces17/ ACACES Summer School 2017 Lecture 3: Flow-Control Tushar Krishna Assistant Professor

More information

Shared Cache Aware Task Mapping for WCRT Minimization

Shared Cache Aware Task Mapping for WCRT Minimization Shared Cache Aware Task Mapping for WCRT Minimization Huping Ding & Tulika Mitra School of Computing, National University of Singapore Yun Liang Center for Energy-efficient Computing and Applications,

More information

Implementing Flexible Interconnect Topologies for Machine Learning Acceleration

Implementing Flexible Interconnect Topologies for Machine Learning Acceleration Implementing Flexible Interconnect for Machine Learning Acceleration A R M T E C H S Y M P O S I A O C T 2 0 1 8 WILLIAM TSENG Mem Controller 20 mm Mem Controller Machine Learning / AI SoC New Challenges

More information

Power and Performance Efficient Partial Circuits in Packet-Switched Networks-on-Chip

Power and Performance Efficient Partial Circuits in Packet-Switched Networks-on-Chip 2013 21st Euromicro International Conference on Parallel, Distributed, and Network-Based Processing Power and Performance Efficient Partial Circuits in Packet-Switched Networks-on-Chip Nasibeh Teimouri

More information

OpenSMART: Single-cycle Multi-hop NoC Generator in BSV and Chisel

OpenSMART: Single-cycle Multi-hop NoC Generator in BSV and Chisel OpenSMART: Single-cycle Multi-hop NoC Generator in BSV and Chisel Hyoukjun Kwon and Tushar Krishna Georgia Institute of Technology Synergy Lab (http://synergy.ece.gatech.edu) hyoukjun@gatech.edu April

More information

Agenda. System Performance Scaling of IBM POWER6 TM Based Servers

Agenda. System Performance Scaling of IBM POWER6 TM Based Servers System Performance Scaling of IBM POWER6 TM Based Servers Jeff Stuecheli Hot Chips 19 August 2007 Agenda Historical background POWER6 TM chip components Interconnect topology Cache Coherence strategies

More information

NoC Simulation in Heterogeneous Architectures for PGAS Programming Model

NoC Simulation in Heterogeneous Architectures for PGAS Programming Model NoC Simulation in Heterogeneous Architectures for PGAS Programming Model Sascha Roloff, Andreas Weichslgartner, Frank Hannig, Jürgen Teich University of Erlangen-Nuremberg, Germany Jan Heißwolf Karlsruhe

More information

Enabling TDMA Arbitration in the Context of MBPTA

Enabling TDMA Arbitration in the Context of MBPTA Enabling TDMA Arbitration in the Context of MBPTA Miloš Panić,, Jaume Abella, Eduardo Quiñones, Carles Hernandez, Theo Ungerer, Francisco J. Cazorla, Universitat Politècnica de Catalunya Barcelona Supercomputing

More information

Efficient Latency Guarantees for Mixed-criticality Networks-on-Chip

Efficient Latency Guarantees for Mixed-criticality Networks-on-Chip Platzhalter für Bild, Bild auf Titelfolie hinter das Logo einsetzen Efficient Latency Guarantees for Mixed-criticality Networks-on-Chip Sebastian Tobuschat, Rolf Ernst IDA, TU Braunschweig, Germany 18.

More information

Asymmetry-aware execution placement on manycore chips

Asymmetry-aware execution placement on manycore chips Asymmetry-aware execution placement on manycore chips Alexey Tumanov Joshua Wise, Onur Mutlu, Greg Ganger CARNEGIE MELLON UNIVERSITY Introduction: Core Scaling? Moore s Law continues: can still fit more

More information

An Efficient Network-on-Chip (NoC) based Multicore Platform for Hierarchical Parallel Genetic Algorithms

An Efficient Network-on-Chip (NoC) based Multicore Platform for Hierarchical Parallel Genetic Algorithms An Efficient Network-on-Chip (NoC) based Multicore Platform for Hierarchical Parallel Genetic Algorithms Yuankun Xue 1, Zhiliang Qian 2, Guopeng Wei 3, Paul Bogdan 1, Chi-Ying Tsui 2, Radu Marculescu 3

More information

Interconnection Network

Interconnection Network Interconnection Network Recap: Generic Parallel Architecture A generic modern multiprocessor Network Mem Communication assist (CA) $ P Node: processor(s), memory system, plus communication assist Network

More information

New ARMv8-R technology for real-time control in safetyrelated

New ARMv8-R technology for real-time control in safetyrelated New ARMv8-R technology for real-time control in safetyrelated applications James Scobie Product manager ARM Technical Symposium China: Automotive, Industrial & Functional Safety October 31 st 2016 November

More information

Pseudo-Circuit: Accelerating Communication for On-Chip Interconnection Networks

Pseudo-Circuit: Accelerating Communication for On-Chip Interconnection Networks Department of Computer Science and Engineering, Texas A&M University Technical eport #2010-3-1 seudo-circuit: Accelerating Communication for On-Chip Interconnection Networks Minseon Ahn, Eun Jung Kim Department

More information

Data Bus Slicing for Contention-Free Multicore Real-Time Memory Systems

Data Bus Slicing for Contention-Free Multicore Real-Time Memory Systems Data Bus Slicing for Contention-Free Multicore Real-Time Memory Systems Javier Jalle,, Eduardo Quiñones, Jaume Abella, Luca Fossati, Marco Zulianello, Francisco J. Cazorla, Barcelona Supercomputing Center

More information

A Heterogeneous Multiple Network-On-Chip Design: An Application-Aware Approach

A Heterogeneous Multiple Network-On-Chip Design: An Application-Aware Approach A Heterogeneous Multiple Network-On-Chip Design: An Application-Aware Approach Asit K. Mishra Onur Mutlu Chita R. Das Executive summary Problem: Current day NoC designs are agnostic to application requirements

More information

Achieving Lightweight Multicast in Asynchronous Networks-on-Chip Using Local Speculation

Achieving Lightweight Multicast in Asynchronous Networks-on-Chip Using Local Speculation Achieving Lightweight Multicast in Asynchronous Networks-on-Chip Using Local Speculation Kshitij Bhardwaj Dept. of Computer Science Columbia University Steven M. Nowick 2016 ACM/IEEE Design Automation

More information

Two-Level Address Storage and Address Prediction

Two-Level Address Storage and Address Prediction Two-Level Address Storage and Address Prediction Enric Morancho, José María Llabería and Àngel Olivé Computer Architecture Department - Universitat Politècnica de Catalunya (Spain) 1 Abstract. : The amount

More information

A Predictable Simultaneous Multithreading Scheme for Hard Real-Time

A Predictable Simultaneous Multithreading Scheme for Hard Real-Time A Predictable Simultaneous Multithreading Scheme for Hard Real-Time Jonathan Barre, Christine Rochange, and Pascal Sainrat Institut de Recherche en Informatique de Toulouse, Université detoulouse-cnrs,france

More information

Computing Safe Contention Bounds for Multicore Resources with Round-Robin and FIFO Arbitration

Computing Safe Contention Bounds for Multicore Resources with Round-Robin and FIFO Arbitration Computing Safe Contention Bounds for Multicore Resources with Round-Robin and FIFO Arbitration 1 Gabriel Fernandez,, Javier Jalle,, Jaume Abella, Eduardo Quiñones, Tullio Vardanega, Francisco J. Cazorla,

More information

-- the Timing Problem & Possible Solutions

-- the Timing Problem & Possible Solutions ARTIST Summer School in Europe 2010 Autrans (near Grenoble), France September 5-10, 2010 Towards Real-Time Applications on Multicore -- the Timing Problem & Possible Solutions Wang Yi Uppsala University,

More information

Elaborazione dati real-time su architetture embedded many-core e FPGA

Elaborazione dati real-time su architetture embedded many-core e FPGA Elaborazione dati real-time su architetture embedded many-core e FPGA DAVIDE ROSSI A L E S S A N D R O C A P O T O N D I G I U S E P P E T A G L I A V I N I A N D R E A M A R O N G I U C I R I - I C T

More information

High-Performance Real-Time Lab (HiPeRT) Marko Bertogna University of Modena, Italy

High-Performance Real-Time Lab (HiPeRT) Marko Bertogna University of Modena, Italy High-Performance Real-Time Lab (HiPeRT) Marko Bertogna University of Modena, Italy marko.bertogna@unimore.it http://hipert.unimore.it/ HiPeRT Lab Research on High-Performance Real-Time Systems ~20 people

More information

3D WiNoC Architectures

3D WiNoC Architectures Interconnect Enhances Architecture: Evolution of Wireless NoC from Planar to 3D 3D WiNoC Architectures Hiroki Matsutani Keio University, Japan Sep 18th, 2014 Hiroki Matsutani, "3D WiNoC Architectures",

More information

Basic Switch Organization

Basic Switch Organization NOC Routing 1 Basic Switch Organization 2 Basic Switch Organization Link Controller Used for coordinating the flow of messages across the physical link of two adjacent switches 3 Basic Switch Organization

More information

IETF 90: VNF PERFORMANCE BENCHMARKING METHODOLOGY

IETF 90: VNF PERFORMANCE BENCHMARKING METHODOLOGY IETF 90: VNF PERFORMANCE BENCHMARKING METHODOLOGY Contributors: Sarah Banks:sbanks@akamai.com Muhammad Durrani: mdurrani@brocade.com Mike Chen: mchen@brocade.com Objective Create comprehensive VNF performance

More information

Fast Flexible FPGA-Tuned Networks-on-Chip

Fast Flexible FPGA-Tuned Networks-on-Chip This work was funded by NSF. We thank Xilinx for their FPGA and tool donations. We thank Bluespec for their tool donations. Fast Flexible FPGA-Tuned Networks-on-Chip Michael K. Papamichael, James C. Hoe

More information

Context. Hardware Performance. Increasing complexity. Software Complexity. And the Result is. Embedded systems are becoming more complex every day:

Context. Hardware Performance. Increasing complexity. Software Complexity. And the Result is. Embedded systems are becoming more complex every day: Context Embedded systems are becoming more complex every day: Giorgio uttazzo g.buttazzo@sssup.it more functions higher performance higher efficiency Scuola Superiore Sant nna new hardware s Increasing

More information

Context. Giorgio Buttazzo. Scuola Superiore Sant Anna. Embedded systems are becoming more complex every day: more functions. higher performance

Context. Giorgio Buttazzo. Scuola Superiore Sant Anna. Embedded systems are becoming more complex every day: more functions. higher performance Giorgio uttazzo g.buttazzo@sssup.it Scuola Superiore Sant nna Context Embedded systems are becoming more complex every day: more functions higher performance higher efficiency new hardware platforms 2

More information

Parallel Code Generation of Synchronous Programs for a Many-core Architecture

Parallel Code Generation of Synchronous Programs for a Many-core Architecture Parallel Code Generation of Synchronous Programs for a Many-core Architecture Amaury Graillat Supervisors: Reviewers: Pascal Raymond (Verimag), Matthieu Moy (LIP) Benoît Dupont de Dinechin (Kalray) Jan

More information

A Timing Effects of DDR Memory Systems in Hard Real-Time Multicore Architectures: Issues and Solutions

A Timing Effects of DDR Memory Systems in Hard Real-Time Multicore Architectures: Issues and Solutions A Timing Effects of DDR Memory Systems in Hard Real-Time Multicore Architectures: Issues and Solutions MARCO PAOLIERI, Barcelona Supercomputing Center (BSC) EDUARDO QUIÑONES, Barcelona Supercomputing Center

More information

SIC: Provably Timing-Predictable Strictly In-Order Pipelined Processor Core

SIC: Provably Timing-Predictable Strictly In-Order Pipelined Processor Core SIC: Provably Timing-Predictable Strictly In-Order Pipelined Processor Core Sebastian Hahn and Jan Reineke RTSS, Nashville December, 2018 saarland university computer science SIC: Provably Timing-Predictable

More information

A Thermal-aware Application specific Routing Algorithm for Network-on-chip Design

A Thermal-aware Application specific Routing Algorithm for Network-on-chip Design A Thermal-aware Application specific Routing Algorithm for Network-on-chip Design Zhi-Liang Qian and Chi-Ying Tsui VLSI Research Laboratory Department of Electronic and Computer Engineering The Hong Kong

More information

Introduction to Multiprocessors (Part I) Prof. Cristina Silvano Politecnico di Milano

Introduction to Multiprocessors (Part I) Prof. Cristina Silvano Politecnico di Milano Introduction to Multiprocessors (Part I) Prof. Cristina Silvano Politecnico di Milano Outline Key issues to design multiprocessors Interconnection network Centralized shared-memory architectures Distributed

More information

Lecture 2: Topology - I

Lecture 2: Topology - I ECE 8823 A / CS 8803 - ICN Interconnection Networks Spring 2017 http://tusharkrishna.ece.gatech.edu/teaching/icn_s17/ Lecture 2: Topology - I Tushar Krishna Assistant Professor School of Electrical and

More information

ibench: Quantifying Interference in Datacenter Applications

ibench: Quantifying Interference in Datacenter Applications ibench: Quantifying Interference in Datacenter Applications Christina Delimitrou and Christos Kozyrakis Stanford University IISWC September 23 th 2013 Executive Summary Problem: Increasing utilization

More information

Energy Efficient K-Means Clustering for an Intel Hybrid Multi-Chip Package

Energy Efficient K-Means Clustering for an Intel Hybrid Multi-Chip Package High Performance Machine Learning Workshop Energy Efficient K-Means Clustering for an Intel Hybrid Multi-Chip Package Matheus Souza, Lucas Maciel, Pedro Penna, Henrique Freitas 24/09/2018 Agenda Introduction

More information

Managing Memory for Timing Predictability. Rodolfo Pellizzoni

Managing Memory for Timing Predictability. Rodolfo Pellizzoni Managing Memory for Timing Predictability Rodolfo Pellizzoni Thanks This work would not have been possible without the following students and collaborators Zheng Pei Wu*, Yogen Krish Heechul Yun* Renato

More information

Studying Optimal Spilling in the light of SSA

Studying Optimal Spilling in the light of SSA Studying Optimal Spilling in the light of SSA Quentin Colombet, Florian Brandner and Alain Darte Compsys, LIP, UMR 5668 CNRS, INRIA, ENS-Lyon, UCB-Lyon Journées compilation, Rennes, France, June 18-20

More information

Cortex-A75 and Cortex-A55 DynamIQ processors Powering applications from mobile to autonomous driving

Cortex-A75 and Cortex-A55 DynamIQ processors Powering applications from mobile to autonomous driving Cortex-A75 and Cortex- DynamIQ processors Powering applications from mobile to autonomous driving Lionel Belnet Sr. Product Manager Arm Arm Tech Symposia 2017 Agenda Market growth and trends DynamIQ technology

More information

Future Gigascale MCSoCs Applications: Computation & Communication Orthogonalization

Future Gigascale MCSoCs Applications: Computation & Communication Orthogonalization Basic Network-on-Chip (BANC) interconnection for Future Gigascale MCSoCs Applications: Computation & Communication Orthogonalization Abderazek Ben Abdallah, Masahiro Sowa Graduate School of Information

More information

WMC. MPSoCs for Mixed-Criticality Systems: Challenges and Opportunities. Mohamed Hassan

WMC. MPSoCs for Mixed-Criticality Systems: Challenges and Opportunities. Mohamed Hassan WMC MPSoCs for Mixed-Criticality Systems: Challenges and Opportunities Mohamed Hassan 1 IBM s Acorn Smart Phones Automotive 1943 1989 2010s 1981 2000s Now-Near Colossus NEC s UltaLite Wearables IoT/Smart

More information

Integration of Mixed Criticality Systems on MultiCores: Limitations, Challenges and Way ahead for Avionics

Integration of Mixed Criticality Systems on MultiCores: Limitations, Challenges and Way ahead for Avionics Integration of Mixed Criticality Systems on MultiCores: Limitations, Challenges and Way ahead for Avionics TecDay 13./14. Oct. 2015 Dietmar Geiger, Bernd Koppenhöfer 1 COTS HW Evolution - Single-Core Multi-Core

More information

Prediction Router: Yet another low-latency on-chip router architecture

Prediction Router: Yet another low-latency on-chip router architecture Prediction Router: Yet another low-latency on-chip router architecture Hiroki Matsutani Michihiro Koibuchi Hideharu Amano Tsutomu Yoshinaga (Keio Univ., Japan) (NII, Japan) (Keio Univ., Japan) (UEC, Japan)

More information

Performance Tools for Technical Computing

Performance Tools for Technical Computing Christian Terboven terboven@rz.rwth-aachen.de Center for Computing and Communication RWTH Aachen University Intel Software Conference 2010 April 13th, Barcelona, Spain Agenda o Motivation and Methodology

More information

Quest for High-Performance Bufferless NoCs with Single-Cycle Express Paths and Self-Learning Throttling

Quest for High-Performance Bufferless NoCs with Single-Cycle Express Paths and Self-Learning Throttling Quest for High-Performance Bufferless NoCs with Single-Cycle Express Paths and Self-Learning Throttling Bhavya K. Daya, Li-Shiuan Peh, Anantha P. Chandrakasan Dept. of Electrical Engineering and Computer

More information

Interactive Realtime Multimedia Applications on SOIs

Interactive Realtime Multimedia Applications on SOIs Interactive Realtime Multimedia Applications on SOIs Advance Reservations for Distributed Real-Time Workflows with Probabilistic Service Guarantees Tommaso Cucinotta Real-Time Systems Laboratory Scuola

More information

Intelligent Interconnect for Autonomous Vehicle SoCs. Sam Wong / Chi Peng, NetSpeed Systems

Intelligent Interconnect for Autonomous Vehicle SoCs. Sam Wong / Chi Peng, NetSpeed Systems Intelligent Interconnect for Autonomous Vehicle SoCs Sam Wong / Chi Peng, NetSpeed Systems Challenges Facing Autonomous Vehicles Exploding Performance Requirements Real-Time Processing of Sensors Ultra-High

More information

Partitioned Fixed-Priority Scheduling of Parallel Tasks Without Preemptions

Partitioned Fixed-Priority Scheduling of Parallel Tasks Without Preemptions Partitioned Fixed-Priority Scheduling of Parallel Tasks Without Preemptions *, Alessandro Biondi *, Geoffrey Nelissen, and Giorgio Buttazzo * * ReTiS Lab, Scuola Superiore Sant Anna, Pisa, Italy CISTER,

More information

Real-Time Communication Services for Networks on Chip. Zheng Shi

Real-Time Communication Services for Networks on Chip. Zheng Shi Real-Time Communication Services for Networks on Chip Zheng Shi Submitted for the degree of Doctor of Philosophy Computer Science The University of York November 2009 Abstract Networks-on-Chip (NoCs),

More information

Managing Hybrid On-chip Scratchpad and Cache Memories for Multi-tasking Embedded Systems

Managing Hybrid On-chip Scratchpad and Cache Memories for Multi-tasking Embedded Systems Managing Hybrid On-chip Scratchpad and Cache Memories for Multi-tasking Embedded Systems Zimeng Zhou, Lei Ju, Zhiping Jia, Xin Li School of Computer Science and Technology Shandong University, China Outline

More information

Optimal Implementation of Simulink Models on Multicore Architectures with Partitioned Fixed Priority Scheduling

Optimal Implementation of Simulink Models on Multicore Architectures with Partitioned Fixed Priority Scheduling The 39th IEEE Real-Time Systems Symposium (RTSS 18) Optimal Implementation of Simulink Models on Multicore Architectures with Partitioned Fixed Priority Scheduling Shamit Bansal, Yecheng Zhao, Haibo Zeng,

More information

Exploring Task Parallelism for Heterogeneous Systems Using Multicore Task Management API

Exploring Task Parallelism for Heterogeneous Systems Using Multicore Task Management API EuroPAR 2016 ROME Workshop Exploring Task Parallelism for Heterogeneous Systems Using Multicore Task Management API Suyang Zhu 1, Sunita Chandrasekaran 2, Peng Sun 1, Barbara Chapman 1, Marcus Winter 3,

More information

Overview of Potential Software solutions making multi-core processors predictable for Avionics real-time applications

Overview of Potential Software solutions making multi-core processors predictable for Avionics real-time applications Overview of Potential Software solutions making multi-core processors predictable for Avionics real-time applications Marc Gatti, Thales Avionics Sylvain Girbal, Xavier Jean, Daniel Gracia Pérez, Jimmy

More information

STLAC: A Spatial and Temporal Locality-Aware Cache and Networkon-Chip

STLAC: A Spatial and Temporal Locality-Aware Cache and Networkon-Chip STLAC: A Spatial and Temporal Locality-Aware Cache and Networkon-Chip Codesign for Tiled Manycore Systems Mingyu Wang and Zhaolin Li Institute of Microelectronics, Tsinghua University, Beijing 100084,

More information

Measurement-Based Probabilistic Timing Analysis and Its Impact on Processor Architecture

Measurement-Based Probabilistic Timing Analysis and Its Impact on Processor Architecture Measurement-Based Probabilistic Timing Analysis and Its Impact on Processor Architecture Leonidas Kosmidis, Eduardo Quiñones,, Jaume Abella, Tullio Vardanega, Ian Broster, Francisco J. Cazorla Universitat

More information

A Detailed GPU Cache Model Based on Reuse Distance Theory

A Detailed GPU Cache Model Based on Reuse Distance Theory A Detailed GPU Cache Model Based on Reuse Distance Theory Cedric Nugteren, Gert-Jan van den Braak, Henk Corporaal Eindhoven University of Technology (Netherlands) Henri Bal Vrije Universiteit Amsterdam

More information

Hybrid Implementation of 3D Kirchhoff Migration

Hybrid Implementation of 3D Kirchhoff Migration Hybrid Implementation of 3D Kirchhoff Migration Max Grossman, Mauricio Araya-Polo, Gladys Gonzalez GTC, San Jose March 19, 2013 Agenda 1. Motivation 2. The Problem at Hand 3. Solution Strategy 4. GPU Implementation

More information