Real Time Systems Compilation with Lopht. Dumitru Potop-Butucaru INRIA Paris/ AOSTE team Jan. 2016

Similar documents
Design and Analysis of Time-Critical Systems Introduction

System-level co-modeling AADL and Simulink specifications using Polychrony (and Syndex)

Ensuring Schedulability of Spacecraft Flight Software

Achieving Predictable Multicore Execution of Automotive Applications Using the LET Paradigm

Real-Time Component Software. slide credits: H. Kopetz, P. Puschner

Parallel Code Generation of Synchronous Programs for a Many-core Architecture

Non-Blocking Write Protocol NBW:

Introduction to Real-time Systems. Advanced Operating Systems (M) Lecture 2

2. Introduction to Software for Embedded Systems

AUTOBEST: A United AUTOSAR-OS And ARINC 653 Kernel. Alexander Züpke, Marc Bommert, Daniel Lohmann

4. Hardware Platform: Real-Time Requirements

CEC 450 Real-Time Systems

Simplified design flow for embedded systems

Unit 2: High-Level Synthesis

Testing Operating Systems with RT-Tester

NetSpeed ORION: A New Approach to Design On-chip Interconnects. August 26 th, 2013

Main Points of the Computer Organization and System Software Module

Real-Time Systems Compilation

Aperiodic Servers (Issues)

ECE 3055: Final Exam

In examining performance Interested in several things Exact times if computable Bounded times if exact not computable Can be measured

High-Level Synthesis

Green Hills Software, Inc.

Caches in Real-Time Systems. Instruction Cache vs. Data Cache

Chapter 1: Distributed Systems: What is a distributed system? Fall 2013

A Software Development Toolset for Multi-Core Processors. Yuichi Nakamura System IP Core Research Labs. NEC Corp.

Impact of Runtime Architectures on Control System Stability

Time Triggered and Event Triggered; Off-line Scheduling

Operating Systems (2INC0) 2017/18

! Readings! ! Room-level, on-chip! vs.!

IX: A Protected Dataplane Operating System for High Throughput and Low Latency

Embedded Software Programming

COMP 3361: Operating Systems 1 Final Exam Winter 2009

Multimedia Systems 2011/2012

Predicate-Aware, Makespan-Preserving Software Pipelining of Scheduling Tables

Reference Model and Scheduling Policies for Real-Time Systems

Model-based Analysis of Event-driven Distributed Real-time Embedded Systems

Martin Kruliš, v

Timing Analysis on Complex Real-Time Automotive Multicore Architectures

4. Networks. in parallel computers. Advances in Computer Architecture

Predicate-aware, makespan-preserving software pipelining of scheduling tables

Developing deterministic networking technology for railway applications using TTEthernet software-based end systems

Investigation of System Timing Concerns in Embedded Systems: Tool-based Analysis of AADL Models

Lecture 10: Cache Coherence: Part I. Parallel Computer Architecture and Programming CMU /15-618, Spring 2015

Multiprocessors and Locking


CS 856 Latency in Communication Systems

Tasks. Task Implementation and management

Introduction to Embedded Systems

Simulink/Stateflow. June 2008

Handling Challenges of Multi-Core Technology in Automotive Software Engineering

A Multi-Modal Composability Framework for Cyber-Physical Systems

What s An OS? Cyclic Executive. Interrupts. Advantages Simple implementation Low overhead Very predictable

DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING UNIT-1

Embedded software design with Polychrony

CPSC/ECE 3220 Summer 2018 Exam 2 No Electronics.

Dr. Ing. Cornelia Zahlten. Prof. Dr. Jan Peleska. Concepts and Implementation. Hard Real-Time Test Tools

Introduction to Real-Time Systems ECE 397-1

In this presentation,...

Multi-core Architecture and Programming

Memory Architectures for NoC-Based Real-Time Mixed Criticality Systems

Subject Name: OPERATING SYSTEMS. Subject Code: 10EC65. Prepared By: Kala H S and Remya R. Department: ECE. Date:

Lecture 10: Cache Coherence: Part I. Parallel Computer Architecture and Programming CMU , Spring 2013

SE300 SWE Practices. Lecture 10 Introduction to Event- Driven Architectures. Tuesday, March 17, Sam Siewert

Embedded Systems. 6. Real-Time Operating Systems

PREEvision at Porsche (Update 2018)

Introduction to Electronic Design Automation. Model of Computation. Model of Computation. Model of Computation

Profiling and Debugging OpenCL Applications with ARM Development Tools. October 2014

Interconnecting Components

ARTIST-Relevant Research from Linköping

Multiprocessors II: CC-NUMA DSM. CC-NUMA for Large Systems

Predictable Timing of Cyber-Physical Systems Future Research Challenges

Single-Path Programming on a Chip-Multiprocessor System

Performance and Optimization Issues in Multicore Computing

Hardware/Software Co-design

Real-Time (Paradigms) (47)

A Simulator for Control Applications on TTNoC-based Multicore Systems. Andrei Cibu

Operating Systems (2INC0) 2018/19. Introduction (01) Dr. Tanir Ozcelebi. Courtesy of Prof. Dr. Johan Lukkien. System Architecture and Networking Group

MULTIPROCESSORS AND THREAD LEVEL PARALLELISM

A common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads...

CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading)

Traditional Approaches to Modeling

Deadlock: Part II. Reading Assignment. Deadlock: A Closer Look. Types of Deadlock

EN164: Design of Computing Systems Topic 08: Parallel Processor Design (introduction)

Advanced Topic: Efficient Synchronization

Design of Parallel Algorithms. Models of Parallel Computation

Outline EEL 5764 Graduate Computer Architecture. Chapter 3 Limits to ILP and Simultaneous Multithreading. Overcoming Limits - What do we need??

CS 571 Operating Systems. Midterm Review. Angelos Stavrou, George Mason University

Operating system Dr. Shroouq J.

The Nios II Family of Configurable Soft-core Processors

Interconnection Networks

Real-Time Cache Management for Multi-Core Virtualization

Fall 2012 Parallel Computer Architecture Lecture 16: Speculation II. Prof. Onur Mutlu Carnegie Mellon University 10/12/2012

HW/SW Design Space Exploration on the Production Cell Setup

Operating System Review Part

Platform modeling and allocation

Operating Systems. Deadlock. Lecturer: William Fornaciari Politecnico di Milano Homel.deib.polimi.

CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS

Computer Architecture. Lecture 8: Virtual Memory

Caches in Real-Time Systems. Instruction Cache vs. Data Cache

Transcription:

Real Time Systems Compilation with Lopht Dumitru Potop-Butucaru INRIA Paris/ AOSTE team Jan. 2016 1

Embedded control systems Computing system that controls the execution of a piece of «physical» equipment, in order to ensure its correct functioning Actuators Plant (controlled «physical» system) Sensors Analog domain Digital domain Embedded control system 2

Computer engineering Control engineering Embedded control systems Continuous-time model Controller discretization Plant+Controller Other engineering fields (HW design, reliability, etc ) Functional specification Platform-independent real-time specification Other non-functional requirements Platform model (HW+OS) Cyclic control Functional correctness Periods Deadlines Latency constraints Non-functional specification Criticality Allocation Partitioning Implementation Other real-time Respect of requirements CPU, bus, etc. Topology WCET/WCCT OS/drivers Executable implementation 3

Embedded systems implementation Still largely a craft in the industry Important manual and/or unformalized phases Some that could be automatized with existing industry-grade tools (modulo certification arguments) Discretization of the continuous-time model Construction of tasks from pieces of C code Allocation and scheduling of tasks for functional correctness Programming and implementing complex timed behaviors Some that cannot (to my knowledge) Building platform and implementation models for precise and safe timing analysis Allocation & scheduling providing hard guarantees of respect for realtime and other non-functional requirements Resource-consuming, error-prone Ultimately rely on test to check system correctness 4

Hand-crafting implementations no longer scales Complexity explodes Large numbers of resources (multi-/many-cores) Large amount of SW (more tasks, more system SW) More complex hardware, OS, SW Communication networks (NoCs, AFDX, etc.), hierarchical schedulers, dependent tasks Efficiency becomes critical Parallelized filters = dependent task systems Can rapidly lose performance if multi-processor scheduling is not efficient Performance highly depends on mapping, need to fine tune all system parameters Allocation, scheduling Memory allocations Synchronizations, communications Delegating mapping to HW/OS results in impredictability Cache coherency, GPUs, dynamic allocation and scheduling, etc.5

Off-line automation is needed Requires full formalization: Functional specification Non-functional requirements Execution platform models Sound abstraction issue Formalization of implementation correctness Requires efficient algorithms for analysis and synthesis Standardization is a prerequisite (allows cost-effective tool-building) 6

A success story of off-line automation: compilation Low level of the embedded implementation flow. Made possible by the early standardization of programming languages, and execution platforms (ISAs, ABIs, APIs). Almost complete replacement of assembly coding, even in embedded systems design. C/Ada/ programs Platform model (ISA+ABI+API+ti ming) Compiler Executable implementation 7

Compilation Real-time scheduling Real-time scheduling vs. compilation Functional specification Platform-independent real-time specification Other non-functional requirements High-level platform model Off-line vs. On-line (RM/EDF/RTC/etc.) Scheduling tables Task models R-T calculus models Real time scheduling Abstract implementation model C/Ada/ programs Compiler Executable implementation Platform model (ISA+ABI+API+ti ming) 8

Compilation Real-time scheduling Real-time scheduling vs. compilation Functional specification Platform-independent real-time specification Other non-functional requirements High-level platform model Real time scheduling Abstract implementation model Abstraction layer (WCET analysis, protocol synthesis, code generation) Rarely sound C/Ada/ programs Compiler Executable implementation Platform model (ISA+ABI+API+ti ming) 9

Compilation Real-time scheduling Challenge: full automation Functional specification Platform-independent real-time specification Other non-functional requirements High-level platform model Real time scheduling Abstract implementation model Abstraction layer (WCET analysis, protocol synthesis, code generation) Rarely sound C/Ada/ programs Compiler Executable implementation Platform model (ISA+ABI+API+ti ming) 10

Challenge: full automation Historically: difficult, due to lack of standardization Recently: functional and non-functional specification languages Dynamic systems specification languages: Simulink, LabView, Scade, etc. Code generation ensuring functional correctness, not schedulability Discretization Systems engineering languages: SysML, AADL Real-time, Criticality/partitioning, Allocation Today: Execution platforms with support for real-time: ARINC 653, AUTOSAR, TTP/TTEthernet, etc. Many-cores (some of them) 11

Real time systems compilation Move from specification to running implementation Fully automatically No human intervention during the compilation itself Inputs and outputs must all be formalized Functional and non-functional inputs, (model of) generated code Provide formal correctness guarantees (functional&non-functional) No break in the formal chain Formal proof possible Failure traceability Functional and non-functional causes Easy to do trial-and-error systems design Fast and efficient Fine-grain platform and application modeling Use of fast heuristics Compilation, real time scheduling, synchronous languages, etc. Exact techniques (e.g. SMT-based) do not scale [FORMATS 15] 12

Real time systems compilation Start with statically scheduled systems Timing analysis-friendly (if HW&OS are also friendly) Good performance and timing predictability Checking functional correctness and computing WCET/WCRT is easy using existing tools We have good experience with static scheduling approaches Classes of applications with practical importance Critical embedded control applications Signal/image processing Implementation model: scheduling/reservation tables Shared between real-time scheduling and compilation» Facilitate reuse of algorithms 13

Previous work Work on superscalar/vliw/many-core compilation Optimization, not respect of requirements No hard real-time guarantees Finer level of control Work on off-line real-time scheduling AAA/SynDEx (Sorel et al.) Does not have: classical compiler optimizations (pipelining), fine-grain execution condition analysis, time triggered code generation, partitioning, preemptable tasks, memory allocation Simulink -> Scade -> TTA (Caspi et al.) Does not have: Automatic allocation, conditional execution, preemption, mapping optimizations (pipelining, memory bank allocation, etc.). Prelude (Forget et al.) Does not have: Partitioning, deadlines longer than periods, memory bank allocations. PsyC/PharOS Input is lower-level (can be used as back-end to our approach) 14

The real time compiler Lopht Compiler for statically-scheduled real-time systems Functional specification: Data-flow synchronous (e.g. Lustre/Scade) Target execution platforms: Distributed time-triggered systems ARINC 653-based processors, TTEthernet Full generation of ARINC 653 configuration & partition C/APEX code & bus schedule «WCET-friendly» many-cores (our own, Kalray MPPA) Full generation of C bare metal code, including communication and synchronization operations, memory allocations, NoC configuration, etc. WCET analysis of generated parallel code, including synchronization code (sound architecture abstraction layer) Meaningful applications: Space launcher model (Airbus DS) Complex non-functional requirements Passenger exchange (Alstom/IRT SystemX) Focus on mixed criticality Platooning application model (CyCab) Parallelized image processing code inside a control loop 15

The principle: table-based scheduling Example: Engine ignition application FS_IN HS_IN if (HS) G if (not HS) F1 x F2 V Δ F3 ID ID Functional specification: dataflow synchronous Cyclic execution model 4 modes determined by 2 independent switches 16 ID if (FS) if (not FS) ID N M HS: High speed mode FS: Fail-safe mode

The principle: table-based scheduling Example : Non-functional specification 3 CPUs, 1 broadcast bus read_input=1 F1 = 3 F2 = 8 F3 = 5 WCETs/WCCTs Allocation constraints P0 Bool=2 ID=5 V=2 x=2 BUS P2 F1 = 3 F2 = 8 F3 = 2 N = 3 M = 1 P1 F1 = 3 F2 = 8 F3 = 5 G = 3 17

The principle: table-based scheduling List scheduling (like in compilers, SynDEx, etc.) Allocate and schedule the blocks 1 by 1 Allocation and scheduling are optimal for each block Criterion: Cost function All legal allocations are tried, we retain the one that minimizes the cost function Cost functions:» Simplest: worst-case end date of the block (this example)» More complex can include e.g. locality-related terms Does not imply global optimality (heuristic) No backtracking Communication mapping is done during block mapping Failure reporting: current scheduling status, blocking 18 reason

The principle: table-based scheduling Result: conditioned scheduling table temps 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 P0 P1 P2 HS_IN FS_IN if(not HS) F1 if(not HS) F2 if(hs) G if(fs) N if(not FS)M if(not HS) F3 Bus if(not HS) send(p0,v) send(p0,fs) send(p0,hs) if(not HS& not FS) send(p0,id) if(hs& not FS) send(p1,id) 19

Before scheduling: flattening the dataflow Translation to non-hierarchic data-flow FS_IN if (HS) if (FS) HS_IN G if (not HS) F1 x ID ID ID if (not FS) ID N M Init=v 0 F2 V F3 HS: High speed mode FS: Fail-safe mode 20

Before scheduling: flattening the dataflow Translation to non-hierarchic data-flow true FS_IN true HS_IN HS HS ID G HS F1 x HS HS Init=v 0 HS FS HS FS HS HS HS F2 F3 V HS N M FS FS HS: High speed mode FS: Fail-safe mode 21

Scheduling algorithm Step 1: order the dataflow blocks («list scheduling») Total order compatible with the data dependencies (except for delayed ones) true FS_IN HS_IN true HS ID G HS F1 x HS HS FS HS FS HS HS HS F2 F3 V FS N FS M 22

Scheduling algorithm Step 1: order the dataflow blocks («list scheduling») Total order compatible with the data dependencies 2 1 (except for delayed ones) FS_IN HS_IN true true 7 HS 5 ID 3 G HS F1 x HS HS FS HS FS 4 HS 6 HS F2 V F3 HS 8 N M FS FS 23

Scheduling algorithm Step 2: allocate and schedule the blocks 1 by 1 Allocation and scheduling are optimal for each block, taken separately. Criterion: Cost function Try all legal allocations, but retain only one of those that minimize the cost function In this example, cost function = reservation end date Local optimality does not imply global optimality No backtracking Communication scheduling is done on demand, during block scheduling. Multiple routes possible => more complicated 24

Scheduling algorithm Step 2: allocate and schedule the blocks 1 by 1 At first, the scheduling table is empty. time 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 P0 P1 P2 Bus 25

Scheduling algorithm Step 2: allocate and schedule the blocks 1 by 1 Operation HS_IN can only be allocated on P0. The optimal schedule places the operation at date 0. time 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 P0 P1 P2 HS_IN Bus 26

Scheduling algorithm Step 2: allocate and schedule the blocks 1 by 1 Operation FS_IN can only be allocated on P0. The optimal schedule places the operation at date 1. time 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 P0 P1 P2 HS_IN FS_IN Bus 27

Scheduling algorithm Step 2: allocate and schedule the blocks 1 by 1 Operation F1 can be allocated on P0, P1, or P2. We retain the allocation on P0 (in darker blue) because in this case F1 terminates at date 5, whereas on P1 or P2 it would end at date 6 (due to the communication of HS, needed to compute the activation condition). time 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 P0 P1 P2 HS_IN FS_IN if(not HS) F1 if(not HS) F1 if(not HS) F1 Bus send(p0,hs) 28

Scheduling algorithm Step 2: allocate and schedule the blocks 1 by 1 Operation F1 can be allocated on P0, P1, or P2. We retain the allocation on P0 (in darker blue) because in this case F1 terminates at date 5, whereas on P1 or P2 it would end at date 6 (due to the communication of HS, needed to compute the activation condition). time 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 P0 P1 P2 HS_IN FS_IN if(not HS) F1 Bus 29

Scheduling algorithm Step 2: allocate and schedule the blocks 1 by 1 Same for F2. time 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 P0 P1 P2 HS_IN FS_IN if(not HS) F1 if(not HS) F2 if(not HS) F2 if(not HS) F2 Bus send(p0,hs) if(not HS) send(p0,x) 30

Scheduling algorithm Step 2: allocate and schedule the blocks 1 by 1 Same for F2. time 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 P0 P1 P2 HS_IN FS_IN if(not HS) F1 if(not HS) F2 Bus 31

Scheduling algorithm Step 2: allocate and schedule the blocks 1 by 1 N can only be allocated on P2. This forces the transmission of FS. The 2 operations are scheduled as soon as possible (ASAP). time 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 P0 P1 P2 HS_IN FS_IN if(not HS) F1 if(not HS) F2 if(fs) N Bus send(p0,fs) 32

Scheduling algorithm Step 2: allocate and schedule the blocks 1 by 1 F3 can be allocated on P0, P1, or P2. It is executed much faster on P2, which compensates for the needed communication time. The allocation on P2 is therefore retained. time 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 P0 P1 P2 HS_IN FS_IN if(not HS) F1 if(not HS) F2 if(not HS) F3 if(not HS) F3 if(fs) N if(not HS) F3 Bus if(not HS) send(p0,v) send(p0,fs) send(p0,hs) 33

Scheduling algorithm Step 2: allocate and schedule the blocks 1 by 1 F3 can be allocated on P0, P1, or P2. It is executed much faster on P2, which compensates for the needed communication time. The allocation on P2 is therefore retained. time 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 P0 P1 P2 HS_IN FS_IN if(not HS) F1 if(not HS) F2 if(fs) N if(not HS) F3 Bus send(p0,fs) send(p0,hs) if(not HS) send(p0,v) 34

Scheduling algorithm Step 2: allocate and schedule the blocks 1 by 1 G can be allocated on P0, P1, or P2. Allocation on P1 minimizes the end date, so it is retained. time 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 P0 P1 P2 HS_IN FS_IN if(not HS) F1 if(not HS) F2 if(hs) G if(hs) G if(fs) N if(hs) G if(not HS) F3 Bus if(not HS) send(p0,v) send(p0,fs) send(p0,hs) 35

Scheduling algorithm Step 2: allocate and schedule the blocks 1 by 1 G can be allocated on P0, P1, or P2. Allocation on P1 minimizes the end date, so it is retained. time 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 P0 P1 P2 HS_IN FS_IN if(not HS) F1 if(not HS) F2 if(hs) G if(fs) N if(not HS) F3 Bus if(not HS) send(p0,v) send(p0,fs) send(p0,hs) 36

Scheduling algorithm Step 2: allocate and schedule the blocks 1 by 1 M can be allocated only on P2. Depending on the value of HS, its input ID comes from either P0, or from P1. Also this data is not needed when M is not executed, hence the execution conditions of the send operations. time 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 P0 P1 P2 HS_IN FS_IN if(not HS) F1 if(not HS) F2 if(hs) G if(fs) N if(not FS)M if(not HS) F3 Bus if(not HS) send(p0,v) send(p0,fs) send(p0,hs) if(not HS& not FS) send(p0,id) if(hs& not FS) send(p1,id) 37

Scheduling algorithm Step 2: allocate and schedule the blocks 1 by 1 Generated schedule: - Latency: 17 - Throughput: 1/17 time 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 P0 P1 P2 HS_IN FS_IN if(not HS) F1 if(not HS) F2 if(hs) G if(fs) N if(not FS)M if(not HS) F3 Bus if(not HS) send(p0,v) send(p0,fs) send(p0,hs) if(not HS& not FS) send(p0,id) if(hs& not FS) send(p1,id) 38

But this is not enough (1/3) General-purpose optimization, to reduce communications, reduce makespan, and increase throughput Software pipelining of scheduling tables Reduce throughput without changing makespan Take into account conditional execution to allow double reservation between successive cycles Safe double reservation of resources Precise analysis of execution conditions 39

But this is not enough (1/3) Example: software pipelining Generated schedule: - Latency: 17 - Throughput: 1/17 time 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 P0 P1 P2 HS_IN FS_IN if(not HS) F1 if(not HS) F2 if(hs) G if(fs) N if(not FS)M if(not HS) F3 Bus if(not HS) send(p0,v) send(p0,fs) send(p0,hs) if(not HS& not FS) send(p0,id) if(hs& not FS) send(p1,id) 40

But this is not enough (1/3) Example: software pipelining Generated schedule: - Latency: 17 - Throughput: 1/17 time 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 P0 P1 P2 HS_IN FS_IN if(not HS) F1 if(not HS) F2 HS_IN FS_IN if(not HS) F1 if(hs) G if(hs) if(fs) N if(fs) N if(not FS)M if(not HS) F3 Bus if(not HS) send(p0,v) send(p0,fs) send(p0,hs) if(not HS& not FS) send(p0,id) send(p0,fs) send(p0,hs) if(hs& not FS) send(p1,id) 41

But this is not enough (1/3) Example: software pipelining Generated schedule: - Latency: 17 - Throughput: 1/13 Software pipelining (classical compil. technique) can be applied. time 0 1 2 3 4 5 6 7 8 9 10 11 12 P0 P1 P2 HS_IN FS_IN if(not HS) F1 if(not HS) F2 if(hs) G if(fs) N if(not FS)M if(not HS) F3 Bus if(not HS) send(p0,v) send(p0,fs) send(p0,hs) if(not HS& not FS) send(p0,id) if(hs& not FS) send(p1,id) Using software pipelining required some modifications to classical algorithms: Bi-criteria: latency first, throughput second Improved predication handling (conditions between cycles, no delay slot needed) 42

But this is not enough (1/3) Application if(m=1) F1 MC if(m=2) F2 G Init = 1 if(m=3) F3 m=1 m=2 m=3 43

But this is not enough (1/3) Latency-optimizing scheduling Latency = 7 Throughput = 1/7 temps P1 P2 0 1 2 3 4 5 6 F1 @m=1 MC@true F2 @m=2 F3 @m=3 G @m=1 G @m=2 G @m=3 44

But this is not enough (1/3) Pipelined scheduling Latency = 7 Throughput = 1/4 temps P1 P2 0 1 2 3 F1 @m=1 MC@true F2 @m=2 F3 @m=3 G @m=1 G @m=2 G @m=2 G @m=3 G @m=3 Using mode transition information 45

But this is not enough (1/3) Pipelined scheduling Latency = 7 Throughput = 1/5 temps P1 P2 0 1 2 3 4 F1 @m=1 MC@true F2 @m=2 F3 @m=3 G @m=1 G @m=1 G @m=2 G @m=2 G @m=3 Without using mode transition information 46

But this is not enough (1/3) Software pipelining => memory replication Rotating registers Conditional execution => not clear which version of a register to access Need an indirection table, dynamically updated Real-time guarantees Worst-case analysis Account for artefacts E.g. rotating register operations 47

But this is not enough (2/3) Time triggered applications Target: ARINC 653-compliant OSs, time-triggered comm. Synthesis of inter-partition communication code (and considering it during scheduling) Non-functional properties taken as input: ARINC 653 partitioning (all or part or none, application and platform) Real time (release dates, deadlines can represent end-to-end latency) Preemptability of each task Allocation requirements Optimizations: Deadline-driven scheduling (to improve schedulability) Minimize the number of tasks at code generation (at most one per input acquisition point) Minimize the number of partition changes Hypotheses: The scheduler and I/O impact on WCETs can be bounded. 48

But this is not enough (3/3) Many-core (bare metal) Sound architecture abstraction layer with support for efficiency Precise platform modeling: NoC: one resource per DMA and NoC arbiter» Wormhole scheduling synchronizes schedules along the path of each packet Memory banks are resources» Reserve them for data, code, and stack to avoid contentions WCET analysis for parallel code Considers synchronization overheads Synthesis of communication and synchronization code (lockbased) on the processors Account for communication/synchronization initiation costs Precomputed preemptions for communications Account for preemption costs 49

But this is not enough (3/3) Synchronized & preemptive comm. g 0 500 1000 1500 2000 Tile (1,1) f DMA (1,1) z x x x N(1,1) (1,2) y x x x N(1,2) (2,2) y x x x In (2,2) x u x x Tile (2,2) h x y u z f 2500 g 50

Other related results (1/3) Many-core platform with support for efficient predictability Principle: avoiding contentions through mapping No shared caches Hardware locks for mutual exclusion during RAM access Expose NoC arbitration to programming Efficiency bottleneck is in both CPUs and interconnect, and similar arbitration is required of both (in various contexts), so they should provide the same level of control. General-purpose: No change in processor programming (gcc) Not programming the NoC arbiters maintains functional correctness Small hardware overhead Automatic mapping (global optimization) -> good results! 51

Other related results (2/3) Heuristics vs. exact methods (SAT/SMT/etc.) Recent constraint solvers (Cplex, Gurobi, Yices2, etc.) have largely improved performance Question: Can they solve scheduling problems? Answer: Yes, up to a certain size, which largely depends on problem complexity, problem type (schedulability vs. optimization), system load. From a few tasks (optimization, preemptive, multi-periodic) to more than 100 tasks. Conclusion: Developing heuristic methods still has its use, especially for complex problems like ours. 52

Other related results (3/3) Automatic delay-insensitive implementation of synchronous programs Objective: separation of concerns between functional and non-functional aspects Optimal synchronization protocols ensuring functional determinism Characterization of delay-insensitive synchronous components (weak endochrony) Set theoretical limits Weak endochrony checking New representation of the behavior of concurrent systems by means of sets of atomic behaviors Synthesis of delay-insensitive concurrent implementations 53

Avionics GNC application Simplified version: 3 tasks T Fast,T GNC et T thermal 3 partitions P Fast, P GNC et P thermal WCETs: 40ms, 200ms et 100ms Periods: 100ms, 1s, 1s Fast GNC Thermal 54

Avionics GNC application Simplified version: 3 tasks T Fast,T GNC et T thermal 3 partitions P Fast, P GNC et P thermal WCETs: 40ms, 200ms et 100ms Periods: 100ms, 1s, 1s Hyperperiod expansion : MTF = 1s Fast Fast Fast Fast Fast Fast Fast Fast Fast Fast D Thermal D GNC 55

Avionics GNC application Real time requirements T fast reads inputs arriving periodically in an infinite buffer 0 100 200 300 400 500 600 700 800 900 1000 Fast Fast Fast Fast Fast Fast Fast Fast Fast Fast D Thermal D GNC 56

Avionics GNC application Real time requirements T fast reads inputs arriving periodically in a buffer of size 2 0 100 200 300 400 500 600 700 800 900 1000 Fast Fast Fast Fast Fast Fast Fast Fast Fast Fast D Thermal D GNC 57

Avionics GNC application Real time requirements Infinite buffers «A full flow formed of 10 instances of T fast followed by an instance of T GNC and then by an instance of T fast should take less than 1400 ms» 0 100 200 300 400 500 600 700 800 900 1000 Fast Fast Fast Fast Fast Fast Fast Fast Fast Fast D Thermal D GNC 58

Avionics GNC application Step 1 : remove delayed dependencies Replace them by timing barriers Heuristic approach (sub-optimal) 0 100 200 300 400 500 600 700 800 900 1000 Fast Fast Fast Fast Fast Fast Fast Fast Fast Fast D Thermal D GNC 59

Avionics GNC application Step 1 : remove delayed dependencies Replace them by timing barriers Heuristic approach (sub-optimal) 0 100 200 300 400 500 600 700 800 900 1000 Fast Fast Fast Fast Fast Fast Fast Fast Fast Fast 1300 Thermal GNC 60

Avionics GNC application Step 2 : off-line scheduling Allocate and schedule blocks one by one (no backtracking) Deadline-driven Of all tasks that can be executed, consider the one with earliest deadline Deadline = minimum of all successor deadlines 0 100 200 300 400 500 600 700 800 900 1000 Fast Fast Fast Fast Fast Fast Fast Fast Fast Fast 1300 Thermal GNC 61

Avionics GNC application Step 2: off-line scheduling 0 100 200 300 400 500 600 700 800 900 1000 GNC is pipelined 0 100 200 300 400 500 600 700 800 900 1000 Fast Fast Fast Fast Fast Fast Fast Fast Fast Fast 0 Thermal GNC 62 1300

Avionics GNC application Step 3: post-scheduling optimization 0 100 200 300 400 500 600 700 800 900 1000 Deadline-driven scheduling routine Many context/partition changes Reduce their number by moving reservations subject to timing and data dependency requirements Low complexity Possible results 0 100 200 300 400 500 600 700 800 900 1000 0 100 200 300 400 500 600 700 800 900 1000 Infinite buffer Buffer of size 3 63

Avionics GNC application Full application 4 processors, 1 broadcast bus 13 tasks before hyper-period expansion 7 partitions 10 end-to-end latency constraints Result: Processor loads: 82%, 72%, 72%, 10% (last one for telemetry) Bus: 81%

Conclusion (1/3) Full-fledged real-time systems compilation is feasible At least for certain application classes Ensures correctness and efficiency Allows a trial-and-error design approach for complex embedded systems Simple theoretical and algorithmic principles Independence, isolation, list scheduling, timetables, pipelining 65

Conclusion (2/3) But no not mistake a list of simple principles for a compiler Integration requires careful adaptation of the concepts and techniques Match the needs of realistic applications to give good results Good resource use Good timing precision Few percent difference between guarantees and actual execution for certain many-core applications. Lopht: not a scheduling toolbox, nor a classical compiler Hence the title «real time systems compiler» 66

Conclusion (3/3) Major difficulties Few benchmarks/case studies Sharing is difficult (technical, legal, will issues) Publication Cross-domain, everybody sees it as interesting «for the other domain» (sort of NIMBY syndrome) Sharing between various targets No general platform definition language 67

Future work (1/2) Short/medium term (ongoing): Execution platform modeling Provide compiler correctness guarantees/proofs Improve treatment of multi-periodic applications Trade-off between scheduling freedom (and thus efficient resource use) and compactness of representation (architectural limits, legibility) New platforms TTEthernet taking into account platform-related tools Kalray MPPA Generate code over the Kalray hypervisor, NoC handling New case studies for/from the industry 68

Future work (2/2) Longer term: Improve embedded systems mapping by adapting/combining models and techniques from compilation, real-time scheduling, and synchronous languages Compilation: Code replication, Loop unrolling Real-time scheduling: CRPD computation Synchronous languages: n-synchronous clock calculus 69

70