Real-Time Systems Compilation

Size: px

Start display at page:

Download "Real-Time Systems Compilation"

Martin Gordon
6 years ago
Views:

Real-Time Systems Compilation Dumitru Potop-Butucaru To cite this version: Dumitru Potop-Butucaru. Real-Time Systems Compilation. Embedded Systems. EDITE, 2015. HAL Id: tel-01264021 https://hal.

fr/tel-01264021 Submitted on 28 Jan 2016 HAL is a multi-disciplinary open access archive for the deposit and dissemination of scientific research documents, whether they are published or not.

1 Real-Time Systems Compilation Dumitru Potop-Butucaru To cite this version: Dumitru Potop-Butucaru. Real-Time Systems Compilation. Embedded Systems. EDITE, HAL Id: tel Submitted on 28 Jan 2016 HAL is a multi-disciplinary open access archive for the deposit and dissemination of scientific research documents, whether they are published or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers. L archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.

2 Université Pierre et Marie Curie Mémoire d habilitation à diriger les recherches Spécialité Informatique École Doctorale Informatique, Télécommunications, et Électronique (EDITE) Compilation de systèmes temps réel (Real- Time Systems Compilation) par Dumitru Potop Butucaru Présenté aux rapporteurs : Sanjoy Baruah Professeur, University of North Carolina Nicolas Halbwachs Directeur de recherche, CNRS/Vérimag Reinhard von Hanxleden Professeur, Université de Kiel afin d être soutenu publiquement le 10 novembre 2015 devant la commission d examen formée de : Albert Cohen Directeur de recherche, INRIA Paris- Rocquencourt Nicolas Halbwachs Directeur de recherche, CNRS/Vérimag Reinhard von Hanxleden Professeur, Université de Kiel François Irigoin Directeur de recherche, MINES ParisTech Alix Munier- Kordon Professeur, Université Pierre et Marie Curie François Pêcheux Professeur, Université Pierre et Marie Curie Renaud Sirdey Directeur de recherche, CEA Saclay

4 Contents 1 Introduction When it all began Overview of previous work Research project: Real-time systems compilation Introduction to synchronous languages Synchronous languages Related formalisms Automatic synthesis of optimal synchronization protocols Semantics of a simple example Problem definition Previous work Contribution Definition of weak endochrony Characterization of delay-insensitive synchronous components Synthesis of delay-insensitive concurrent implementations 29 4 Reconciling performance and predictability on a many-core Motivation MPPA/NoC architectures for the real-time Structure of an MPPA Support for real-time implementation MPPA platform for off-line scheduling Software organization WCET analysis of parallel code Mapping (1) - MPPA-specific aspects Resource modeling Application specification Non-functional properties Scheduling and code generation Mapping (2) - Architecture-independent optimizations Motivation

5 2 CONTENTS Related work and originality Conclusion Automatic implementation of systems with complex functional and non-functional properties Related work Time-triggered systems General definition Model restriction Temporal partitioning A typical case study Functional specification Non-functional properties Period, release dates, and deadlines Modeling of the case study Architecture-dependent constraints Worst-case durations, allocations, preemptability Partitioning Scheduling and code generation Conclusion

6 Chapter 1 Introduction Writing an habilitation thesis always involves a self-assessment of past research. In my case, this retrospect across the years revealed deep roots that may explain why, regardless of changes in research positions, all research I did fits in a welldefined research program that I would also like to continue in the future, a program I named Real-Time Systems Compilation. This particularity of my work resulted in a particular structure for this thesis. To put in evidence the continuity between motivation, past research work, and research project, I concisely described them in a single chapter (this one). The remaining four chapters provide a more technical description of my most important results obtained after the PhD. Of these four chapters, the first is dedicated to an introduction to synchronous languages and formalisms, which are the semantic basis of my research work, written from the perspective of their use in the design of real-time embedded systems. 1.1 When it all began... Like others of my generation, I discovered the world of computing on a home computer, a Sinclair ZX Spectrum clone. These were affordable, simple microcomputers 1 but already complex enough to exhibit the two sides of computing that have followed me to this day. The rational, Apollonian side, the one I immediately liked, was that of programming. That of algorithms that I could write on a (paper) notebook using a clear syntax, before giving them to a computer. Of programs that during execution would either do exactly what I thought they will, or (if not) would allow me to find my error, usually a misunderstanding of some semantic rule. Using reason and the simple statements of BASIC I could at first create nice drawings, some music notes, then simple games. Later came more powerful computers, increasingly sophisticated languages and programming tools, and increasingly complex applications. But the basics, which I liked, remained largely the same. 1 8-bit Z80 CPU, 64kbytes RAM. 3

7 4 CHAPTER 1. INTRODUCTION But there was also a dark, Dionysian side to home computers. That of computer games, programs of a special kind that I could not fully understand and control. This was a world of magic spells 2 transmitted from gamer to gamer and allowing one to obtain more (or infinite) lives, to change screen features, etc. Of gurus able to write (in assembly!) a new loader for some game in order to permanently alter its behavior, or add features to it. Even though I didn t know it at the time, playing with home computer games was my first contact with the world of embedded computing. This side of computing fascinated me, and yet it made me uncomfortable. Attempts to understand why its various manipulations worked seemed doomed to failure, as they required detailed understanding of: The physical world with which the programs interacted. For instance, understanding data loading from audio cassettes required at least basic knowledge of sampling theory. Techniques for the efficient and real-time implementation of programs. This included detailed knowledge of the hardware, such as the functioning and real-time characteristics of video memory. It also included mastering the software development process, including low-level aspects such as assembly coding. Searching, guessing, and trying seemed more important here than reasoning. And yet, over the years, I had to cope with this dark side again and again. First, on toy projects. For instance, creating a virus-like resident program on a PC required reprogramming the keyboard interrupt of my DOS/x86 system, while programming a two-wheeled Lego Mindstorms robot required me to understand the basics of the inverted pendulum, PID controllers and the functioning of sensor and actuator hardware. Later, through my research, I learned that industrial embedded systems designers shared the same problems, scaled up in complexity according to system size, embedding constraints (safety, low consumption, etc.), and industrial process considerations. From my research I have also learned that the description of physical processes did not belong (any more) to the dark side. Well-defined languages, such as Simulink/Stateflow or LabView, had been introduced to allow their non-ambiguous description and analysis. The implementation side also gained more detail. I was progressively able to grasp the complexity of the infrastructure that allowed sequential programs, written in BASIC, C, or Ada, to run correctly and efficiently. This infrastructure includes development tools (compiler, debugger) and system software (drivers, operating system, middleware). It also includes the standards on which these tools are based: programming languages, instruction set architectures (ISAs) such as x86 or ARMv5, application binary interfaces (ABIs) such as the ARM EABI, executable formats such as ELF, or even system-level standards such as POSIX or ARINC Specific calls to the PEEK and POKE instructions that directly read and wrote specific memory addresses.

8 1.2. OVERVIEW OF PREVIOUS WORK 5 But even with this deeper understanding, the embedded design flow falls short of the expectations created by high-level sequential programming. Significant manual phases remain, where ensuring correctness and efficiency relies on the use of expert intervention (the modern equivalent of magic) to either manually transform the code or at least to validate it. In this context, it is only natural to ask the simple question that has guided my research: what part of the embedded implementation process can be fully automated, in a way that ensures both correctness and efficiency? 1.2 Overview of previous work The question of automation in the embedded design process is general enough that I was able to follow it for my entire research career. Another aspect of my research work that never changed since the beginning of my PhD was the use of a specific tool: the synchronous, multi-clock, and time-triggered languages, presented in Chapter 2, which facilitate the formal specification and analysis of deterministic concurrent systems. Using these formalisms, I have considered three increasingly complex implementation problems that must be solved as part of the embedded design process: Compilation of synchronous programs into efficient sequential (task) code. I started this line of work during my PhD, supervised by G. Berry and R. de Simone. I defined a technique for the compilation of imperative Esterel programs into fast and small sequential code. By introducing a series of optimizations based on static analysis of Esterel programs (using their rich structural information), I was able to produce code that still remains the fastest in terms of speed and a close contender in terms of size. To have a clear semantic definition of the data handling instructions of Esterel, and therefore be able to define the correctness of my compiler, I have also introduced a new operational semantics for the language. Main results on these topics are presented in my book Compiling Esterel, co-written with S. Edwards and G. Berry [132]. The prototype compiler I wrote was transferred to industry (the Esterel Technologies company). Automatic synthesis of optimal synchronization protocols for the concurrent implementation of synchronous programs. Large synchronous specifications are often implemented as a set of components (e.g. tasks, threads) running in an asynchronous environment. This can be done for execution or simulation purposes in a multi-thread, multi-task, or distributed context. To preserve the semantics of the initial synchronous specification, supplementary inter-component synchronizations may be needed, and for efficiency purposes it is important to keep synchronization at a minimum. As a post-doc, I started working on this problem with A. Benveniste and B. Caillaud, which had already defined endochrony. Endochrony is a property of synchronous components. When executed in an asynchronous environment, an endochronous component

9 6 CHAPTER 1. INTRODUCTION remains deterministic without the need of supplementary synchronization, because sufficient synchronization is provided by the message exchanges already prescribed by the initial synchronous specification. With my collaborators, I first determined that endochrony is a rather restrictive and non-compositional sufficient property. Then, we introduced the theory of weak endochrony, which characterizes exactly the components that need no supplementary synchronization [128, 126, 120]. Based on this theory, I defined algorithms for determining whether a synchronous program is weakly endochronous, and then a method for adding minimal (optimal) synchronization ensuring weak endochrony to an existing program [134, 118]. These algorithms use an original, compact representation of synchronization patterns of a synchronous program, which I defined. These algorithms were implemented in a prototype tool connected to the Signal/Polychrony toolset (post-doc of V. Papailiopoulou, collaboration with INRIA Espresso team) [118]. These results are presented in more detail in Chapter 3. Efficient compilation of systems with complex functional and nonfunctional properties. Working with industrial partners from the embedded design field made me realize that my previous results addressed only particular, albeit important, aspects of a complex system (the synthesis of sequential tasks and the synthesis of communications). After joining the INRIA Aoste team as a permanent researcher I started investigating the system-level synthesis of real-time embedded systems, 3 and in particular that of systems relying on static (off-line) or time-triggered real-time scheduling. Such systems are used in safetycritical systems (avionics, automotive, rail) and in signal processing. I developed the conviction that building safe and efficient systems requires addressing two fundamental problems, which are only partially solved today: The seamless formal integration of full implementation flows going all the way from high-level specification (e.g. Scade, Simulink) to running implementation (code executing on the platform and platform configuration). Fast and efficient synthesis with full error traceability, which allows the use of a trial-and-error design style aimed at maximizing engineer productivity. To solve these two problems, I have designed and built, with my students and post-docs, the LoPhT real-time systems compiler [38, 37, 73, 36, 124, 130]. By using fast allocation and scheduling heuristics to ensure scalability, LoPhT takes inspiration from previous work in the fields of off-line real-time scheduling, optimized compilation, and synchronous language analysis and implementation. But LoPhT goes beyond previous work by carefully integrating semantically and algorithmically aspects that were previously considered separately, such as 3 A subject well-studied in the team by both Y. Sorel, author of the AAA methodology [76], and R. De Simone, whose research interest in modeling time was materialized, among others, in a significant contribution to the time model of the UML MARTE standard [53].

10 1.3. RESEARCH PROJECT: REAL-TIME SYSTEMS COMPILATION 7 the fine semantic properties of the high-level specifications [130], detailed descriptions of the execution platforms (hardware, OS/libraries, execution model) [124, 36, 38], and complex non-functional specifications covering all the modeling needs of realistic systems [38]. For instance, LoPhT combines in a single tool the use of a classical real-time scheduling algorithm (deadline-driven scheduling) with classical compiler optimizations (e.g. software pipelining [37, 38]), domain-specific optimizations (safe double reservation based on predicate analysis [130]), and platform-specific optimizations (minimizing the number of tasks and partition switches for ARINC 653 systems [38], pre-emptive communications scheduling for many-core and TTEthernet-based networks [124, 36], etc.). Combined with precise time accounting, the integration of these optimizations allows the generation of efficient code while providing formal correctness guarantees. I have dedicated special attention to ensuring that the platform models used by the scheduling algorithms are conservative abstractions of the actual platforms. To do this, I have initiated collaborations that allowed us to explore the design of execution platforms with support for off-line real-time scheduling. Such platforms allow the construction of applications that are both highly efficient and temporally predictable [55, 35, 124]. Together with my collaborators, I have determined that such architectures enable precise worst-case execution time analysis for parallel software [133] and efficient application mapping [36]. I have also initiated industrial collaborations meant to ensure that LoPhT responds to industry needs, and to promote its use [73, 38, 49]. These results are presented in more detail in Chapters 4 and 5. The first one considers a more compilation-like point of view by focusing on fine-grain architecture detail and by considering mapping problems where the objective is to optimize simple metrics. While providing hard real-time guarantees, the methods presented in this chapter do not consider real-time requirements. Subjects covered in this chapter are the mapping of applications to many-cores, the use of advanced compiler optimizations in off-line real-time scheduling, and the worst-case execution time analysis of parallel code. Non-functional requirements of multiple types (real-time, partitioning, preemptability) are considered in Chapter 5, in conjunction with time-triggered execution targets. This completes the definition of our real-time systems compilation approach. 1.3 Research project: Real-time systems compilation The implementation of complex embedded software relies on two fundamental and complementary engineering disciplines: real-time scheduling and compilation. Real-time scheduling covers 4 the upper abstraction levels of the implementation process, which determine how the functional specification is transformed 4 Together with other disciplines such as systems engineering, software engineering, etc.

11 8 CHAPTER 1. INTRODUCTION into a set of tasks and then determine how the tasks must be allocated and scheduled onto the resources of the execution platform in a way that ensures functional correctness and the respect of non-functional requirements. By comparison, compilation covers the low-level code generation process, where each task (a piece of sequential code written in C, Ada, etc.) is transformed into machine code, allowing actual execution. In the early days of embedded systems design, both high-level and low-level implementation activities were largely manual. However, this is no longer the case in the low level, where manual assembly coding has been almost completely replaced by the combined use of programming languages such as C or Ada and compilers [60]. This shift towards high-level languages and compilation allowed a significant productivity gain by ensuring that source code is safer and more portable. As compiler technology improved and systems became more complex in both hardware and software, compilation has also approached the efficiency of manual assembly coding, and in most cases outperformed it. The widespread adoption of compilation was only possible due to the early adoption of standard interfaces that allowed the definition of economically-viable compilation tools with a large-enough user base. These interfaces include not only the programming languages (C, Ada, etc.), but also relatively stable microprocessor instruction set architectures (ISAs) or executable code formats like ELF. The paradigm shift towards fully automated code generation is far from being completed at the system level. Aspects such as the division of the functional specification into tasks, the allocation of tasks to resources, or the configuration of the real-time scheduler are still performed manually for most industrial applications. Furthermore, research in real-time scheduling has largely followed this trend, with most (but not all) effort still invested into verification-based approaches aimed at proving the schedulability of a given system (and into the definition of run-time mechanisms improving resource use). This slow adoption of automatic code generation can be traced back to the slower introduction of standard interfaces allowing the definition of economicallyviable compilers. This also explains why real-time scheduling has historically dedicated much of its research effort to verifying the correctness of very abstract and relatively standard implementation models (the task models). The actual construction of the implementations and the abstraction of these implementations as task models drew comparatively less interest, because they were application-dependent and non-portable. But if standardization and automation advanced slower, they advanced nevertheless. Functional specification languages such as Simulink, LabVIEW, or SCADE have been introduced in the mid-1980s, which allowed the gradual definition of techniques for the synthesis of functionally-correct sequential or even multi-task embedded code (but without real-time guarantees). The next major step came in the mid-1990s, when execution platforms have been standardized in fields such as avionics (IMA/ARINC 653) and automotive (OSEK/VDO, then AUTOSAR). This second wave of standardization already allowed the industrial introduction of automatic tools for the (separate) synthesis of processor

12 1.3. RESEARCH PROJECT: REAL-TIME SYSTEMS COMPILATION 9 schedules or network schedules. The research community went even farther and proposed real-time implementation flows that automatically produced running real-time applications [76, 37, 36, 50] where the processor and network schedules are jointly computed using a global optimization approach that results in better resource use. Of course, more work is needed to ensure the industrial applicability of such results. For instance, the aforementioned techniques could not handle all the complexity of IMA avionics systems, which involve functional specifications with multiple execution modes, multi-processor architectures with complex interconnect networks, and complex non-functional requirements including real-time, partitioning, preemptability, allocation, etc. This explains why, to this day, the design and implementation of industrial real-time systems remains to a large extent a craft, with significant manual phases. But a revolution is brewing, driven by two factors: Automation can no longer be avoided, as the complexity of systems steadily increases in both specification size (number of tasks, processors, etc.) and complexity of the objects involved (dependent tasks, multiple modes and criticalities, novel processing elements and communication media...). Fully automated implementation is attainable for industrially significant classes of systems, due to significant advances in the standardization of both specification languages and of implementation platforms. To allow the automatic implementation of complex embedded systems, I advocate for a real-time systems compilation approach that combines aspects of both real-time scheduling and (classical) compilation. Like a classical compiler such as GCC, a real-time systems compiler should use fast and efficient scheduling and code generation heuristics, to ensure scalability. 5 Similarly, it should provide traceability support under the form of informative error messages enabling an incremental trial-and-error design style, much like that of classical application software. This is more difficult than in a classical compiler, given the complexity of the transformation flow (creation of tasks, allocation, scheduling, synthesis of communication and synchronization code, etc.), and requires a full formal integration along the whole flow, including the crucial issue of correct hardware abstraction. A real-time systems compiler should perform precise, conservative timing accounting along the whole scheduling and code generation flow, allowing it to produce safe and tight real-time guarantees. More generally, and unlike in classical compilers, the allocation and scheduling algorithms must take into account a variety of non-functional requirements, such as real-time constraints, criticality/partitioning, preemptability, allocation constraints, etc. As the accent is put on the respect of requirements (as opposed to optimization of a metric, like in classical compilation), resulting scheduling problems are quite different. Together with my students, I have defined and built such a real-time systems compiler, called LoPhT, for statically scheduled real-time systems. While first 5 Exact application mapping techniques do not scale [72].

13 10 CHAPTER 1. INTRODUCTION results are already here [37, 38, 73, 36, 35], I believe that the work on real-time systems compilation is only at its beginning. It must be extended to cover more execution platforms, and we are currently working on porting LoPhT on the Kalray MPPA256 many-core and on TTEthernet-based time-triggered systems. Efficiency is also a critical issue in practical systems design, and we must invest even more in the use of classical optimizations such as loop unrolling and inline expansion. as well as new optimizations specific to the real-time context and to each platform in particular. To cover these needs, we must also go beyond fully static/offline scheduling, but while remaining fully formal, automated, and safe. Ensuring the safety and efficiency of the generated code cannot be done by a single team. I am actively promoting the concept of real-time systems compilation in the community, and collaborations on the subject will have to cover at least the following subjects: the interaction between real-time scheduling and WCET analysis, the design of predictable hardware and software architectures, programming language support for efficient compilation, and formally proving the correctness of the compiler. Of course, the final objective is that of promoting real-time systems compilation in the industry, and to this end I actively seek industrial case studies and disseminate our work towards industry. From a methodological point of view, my research will continue on its current trend of combining concepts and methods from 3 different communities: compilation, real-time scheduling, and synchronous languages. I am fully aware of the long-standing separation between the fields of compilation and real-time scheduling. 6 However, I believe that the original reasons of this separation 7 are less and less true today. 8 The convergence between these two communities seems to me inevitable in the long run, and my work can be seen as part of the much-needed mutualization of resources (concepts and techniques) between the two fields. My work also shows that synchronous languages should play an important role in the convergence between real-time scheduling and compilation. First of all, as a common ground for formal modeling. Indeed, synchronous formalisms are natural extensions of formalisms of both real-time scheduling (the dependent task graphs) and compilation (static single assignment representations and the data dependency graphs). But beyond being a mere common formal ground, previous work on synchronous languages also provides powerful techniques for the modeling and analysis of complex control structures that are used in embedded systems design. 9 6 Publishing has been quite a struggle for this reason. 7 Focus on sequential code and static scheduling in the compilation community, focus on dynamic, multi-task code in real-time scheduling. 8 Compilation, for instance, considers dynamically-scheduled targets such as GPGPUs, and some algorithms perform precise timing accounting, like in software pipelining. At the same time, real-time scheduling is considering with renewed interest statically-scheduled targets (due to industrial demand). 9 By means of so-called clocks and delays, presented in the next chapter.

14 Chapter 2 Introduction to synchronous languages As evidenced by the publication record, all three of the originary synchronous languages (Esterel, Lustre, and Signal) are the product of the 1980s real-time community [21, 20, 99], where they were introduced to facilitate the high-level specification of complex real-time embedded control systems. An embedded control system aims to control, in the sense of automatic control theory, a physical process in order to lead it towards a given state. The physical process and the control system, which usually form a closed loop [56], are in the beginning specified in continuous time in order to be analyzed and simulated. Then, the control system is discretized in order to allow its implementation on the embedded execution platform. Fig. 2.1 describes the interactions between the discrete control system and the physical process. The embedded control system obtains its discrete-time inputs through sensors equipped with analog-digital converters (ADC). The discrete outputs of the embedded system are transformed by the digital-analog converters (DAC) of the actuators into continous-time feedback to the physical process. Both inputs and outputs can be implemented using event-driven or periodic sampling (time-triggered) mechanisms. Actuator (DAC) Physical process Real time control system Sensor (ADC) Analog domain Digital domain Figure 2.1: Closed loop control system 11

15 12 CHAPTER 2. INTRODUCTION TO SYNCHRONOUS LANGUAGES Synchronous languages were introduced in order to specify discretized control systems, which are reactive, real-time systems. In a reactive system [82, 78], execution can be described as a set of interactions between the system and the physical process. Each acquisition of data by the sensors is followed by a reaction of the control system, which consists in performing some computations and then updating the actuators. Multiple reactions may be executed at the same time, competing for the same execution resources, a property known as concurrency. Real-time systems are reactive systems where reactions are subject to timing constraints. These constraints are determined by the control engineers during discretization of the control system. The constraints may concern the sampling periods of the sensors and actuators and/or the latencies (end-to-end delays) of the reactions. The sampling constraints on the sensors and actuators determine the periods of the computation functions (tasks) that depend on or drive sensing and actuation. The latency constraints are applied to chains of computation functions, which may have a sensor or an actuator as extremity. A deadline is a particular case of latency constraint that is applied to a single computation function, for instance the code controlling a sensor or a digital filter. Reactive and real-time control systems have particular specification needs. To describe the reactive aspects, synchronous languages offer syntactical constructs allowing the specification of order (dependency, sequence), concurrency (parallelism), conditional execution and simultaneity relations between operations of the system (data acquisitions, computation functions, data transfers and actuator updates). For the non-functional specification of the real-time aspects, the synchronous languages implicitly define or allow the explicit definition of one or more discrete time bases, called clocks. A clock describes a finite or infinite sequence of events in the execution of the system. Thus, each clock divides the execution of the system into a series of execution steps, which are sometimes called logical instants, reactions, or computation instants. We can associate a clock with each periodic event (e.g. periodic timer), sporadic event (e.g. the top dead center, or TDC, of a piston in a combustion engine), aperiodic event (e.g. button press), or simply with an internal event of the control system, which is built from other internal or external events and therefore depends on other clocks. Clocks are so-called logical time bases. This means that the synchronous languages allow the specification of order relations between events associated with these clocks, but do not offer support for the analysis of the relations between physical quantities they may represent (except through specific extensions detailed below). Clocks associated with physical quantities are called physical clocks. For instance, an engine control system may have a physical clock associated to a timer and another physical clock associated with the the TDC event. By taking into account the maximum speed of the engine, we can determine the maximal duration (in time) between two events of the TDC clock, thus relating events of the two clocks. Such an analysis requires the application of physical theories, in addition to the theory of synchronous languages. The execution of every operation of a synchronous system has to be synchro-

16 13 nized with respect to at least one clock. Real-time information coming from the control specification (periods, deadlines, latencies) cannot be directly associated to operations. Instead, it is associated to the clocks, which in turn drive the execution of operations. Continuous-time control specification Discrete-time control specification (platform-independent) Synchronous languages Functionality (computations, dependencies, modes) Periods deadlines latencies Platform-dependent specification Other non-functional requirements WCETs allocations criticalities... Resources Topology... Functional specification Non-functional specification Real-time implementation problem Figure 2.2: Scope of application of synchronous languages in real-time systems specification As shown in Fig. 2.2, the specification of a real-time implementation problem does not only include the platform-independent discrete-time controller specification, provided under the form of a synchronous program. It also includes non-functional requirements coming from other engineering disciplines (such as criticalities) and the constraints related to the implementation of the control system on an embedded execution platform. These platform-dependent constraints include the definition of the resources of the platform, the worst-case execution time estimations (WCETs) of computations on the CPUs, the worst-case durations of communications over the buses (WCCT), the allocation constraints, etc. Criticalities and platform-dependent information are not part of the discretetime controller specification, and synchronous languages are not meant to represent them. In other terms, synchronous languages alone are not equiped to allow the specification and analysis of all the aspects of a real-time embedded implementation problem. Some synchronous languages allow the specification of platform-related properties through dedicated extensions that will be discussed later in this chapter. Ignoring platform-related aspects is one of the key points of synchronous languages. Ignoring execution durations means that we may assume the computation of each reaction to take 0 time, so that its inputs and outputs are simultaneous (synchronous) in the discrete time scale (clock) that governs the

17 14 CHAPTER 2. INTRODUCTION TO SYNCHRONOUS LANGUAGES execution of the reaction. This synchrony hypothesis, which gives its name to the synchronous model, is naturally inherited through discretization from continuous-time modeling, where it is implicitly used. Implementing a specification relying on the synchrony hypothesis amounts to solving a scheduling problem which ensures that: The resources of the execution platform allow each reaction to terminate before their outputs are needed either as input to other reactions or to drive the actuators. All period and latency requirements specified by the control engineers are satisfied. As part of the synchrony hypothesis, we also require that a reaction has a bounded number of operations. This assumption ensures that, independently of the execution platform, the computation of a reaction is always completed in bounded time, which allows the application of real-time schedulability analysis. Under the synchrony hypothesis, all computations of a reaction are synchronous, in the sense that they are not ordered in the discrete time scale defined by the clock of the reaction. However, their execution has to be causal: Two reads of the same variable/signal performed during the same reaction must always provide the same result. If a variable/signal is written during a reaction, then all reads inside the reaction will produce the same value. No variable/signal should be written twice during a reaction, or otherwise it must be specified which of the writes gives the variable/signal its value. This amounts to requiring that reading a variable/signal is performed in a reaction only after all write operations on the variable/signal have been completed. Causality ensures functional determinism in the presence of concurrency. It ensures that executing the computations and communications of a reaction will always produce the same result, for any scheduling of the operations that satisfies the data and control dependencies. In a causal system a set of inputs will always produce the same set of outputs. This property is important in practice, since it simplifies the costly activities of verification and validation (test, formal verification, etc.), as well as debugging. These four ingredients clocks, synchrony hypothesis, causality, and functional determinism define a formal base that is common to all synchronous languages. It ensures strong semantic soundness by allowing universally recognized mathematical models such as the Mealy machines and the digital circuits to be used as supporting foundations. In turn, these models give access to a large corpus of efficient optimization, compilation, and formal verification techniques. The synchronous hypothesis also guarantees full equivalence between various levels of representation, thereby avoiding altogether the pitfalls of nonsynthesizability of other similar formalisms.

18 2.1. SYNCHRONOUS LANGUAGES Synchronous languages Structured languages have been introduced for the modeling and programming of synchronous applications. From a syntactical point of view, each one of them provides a certain number of constructions facilitating the description of complex systems: concurrency, conditional execution and/or modes of execution, dependencies and clocks allowing to describe complex temporal relations such as multiple periods. This allows the incremental (hierarchical) specification of complex behaviors from elementary behaviors (computation functions without side effects and with bounded durations). The concision and the deterministic semantics of synchronous specifications make them a good starting point for design methodologies for safe systems, where a significant part of the time budget is devoted to formal analysis and testing. Language Imperative/ Base Physical Real-time Data Flow clock(s) time analysis Esterel/SSM I(+DF) Single Lustre/Scade DF(+I) Single TAXYS I Single APODW WCRT, sched Lucy-n DF Affine SynDEx DF Affine PW (D=P) WCRT, sched Giotto DF Affine P (D=P) Prelude DF Affine PODW WCRT, sched Signal DF(+I) Multiple EmbeddedCode I Multiple AD ΨC I+DF Multiple APODW WCRT, sched SciCos DF Multiple C Zélus DF Multiple C Table 2.1: Classification of synchronous languages. I = Imperative, DF = Data flow, P = periodic activations, A = aperiodic activations, D = Deadlines, O = Offsets, W = durations, C = continuous time. However, beyond these aspects, each of the synchronous languages has original points and particular uses, and therefore a classification is required. Table 2.1 summarizes this classification along 4 criteria. Programming paradigm. According to the programming paradigm, synchronous languages are divided into two large classes: declarative data-flow languages and imperative languages. Declarative languages, which include dataflow languages, focus on the definition of the function to be computed. Imperative languages describe the organization of the operations needed to compute this function (computations, decisions, inputs/outputs, state changes). Among the synchronous languages, Esterel[21], SyncCharts[7], ΨC[42, 43], and the embedded code formalism [86] are classified as imperative, while the Lustre/SCADE[20], Signal/Polychrony[99, 19], SynDEx[76, 94], Giotto[85], and Prelude[116] are

19 16 CHAPTER 2. INTRODUCTION TO SYNCHRONOUS LANGUAGES classified as data-flow. The SciCos[32] and Zélus[28] languages are data-flow languages, with the particularity of being hybrid languages, which allow the representation of both continuous-time and discrete-time control systems. Data-flow languages are syntactically closer to synchronous digital circuits and to real-time dependent task models. In these languages, the concurrency between operations is only constrained by explicit data dependencies. Dataflow languages are used to highlight the flow of information between parts of an application, or its structuring into tasks. Imperative languages are syntactically closer to (hierarchical) synchronous automata. They are generally used to represent complex control structures, such as those of an operating system scheduler. Besides concurrency, they allow the specification of operation sequencing and offer hierarchical constructs allowing to stop or resume a behavior in response to an internal or external signal. The data dependencies are often represented implicitly, using shared variables, instead of explicit data dependencies. The first synchronous languages could be easily classified as imperative (Esterel) or data-flow (Lustre, Signal). However, the successive evolutions of these languages have made classification more difficult. For instance, the data-flow language Scade/Lustre has incorporated imperative elements, such as the state machines, whereas an imperative language such as Esterel has incorporated declarative elements such as the sustained emit instruction. This is why, in our table, some languages belong to both classes. Number of time bases. A second criterion of classification of synchronous languages is related to the number of time bases that can be defined and used in a program. In Lustre/SCADE and Esterel, which we call single-clock, a single logical time base exists, called global clock or base clock. All other time bases (clocks) are explicitly derived from the base clock by sub-sampling. The Signal and ΨC languages do not have this limitation. They allow the definition of several logical time bases. As explained above, an automotive engine control application may have two base clocks, one corresponding to time and the other to the rotation of the engine (TDC), and these two clocks cannot be defined from one another. Having two base clocks allows the operations to be ordered with respect to events of two different discrete time bases, which may facilitate both modeling an analysis. 1 The languages allowing the definition of multiple, independent clocks are called polychronous or multi-clock. Between single-clock languages and multi-clock languages, we identify an intermediate class of languages that allow the definition of several time bases, but require that as soon as two clocks are used by a same operation, they become linked by a relation allowing to completely order their logical instants in a unique way. Therefore, a global clock can be built unambiguously for every program from the time bases specified by the programmer. 2 However, it is 1 Buiding a single-clock model of an application is always possible[79], but analysis may be more complicated. 2 More precisely, we can build such a clock if the program cannot be divided into completely

20 2.1. SYNCHRONOUS LANGUAGES 17 often more interesting to not build this global clock and instead apply specific analysis directly on the clocks specified by the programmer. For instance, the languages Giotto, Lucy-n, Prelude and SynDEx allow the definition of clocks linked in period and phase (offset) by affine relations. These languages allow a more direct description of periodic real-time task systems with different periods [116, 51, 94]. Modeling of physical time. The third classification criterion we use is the presence in the language of extensions allowing the description of physical time. This concept appears naturally in languages allowing the specification of continuous-time systems, like in SciCos or Zélus [32, 28]. However, we are more interested here by languages aiming directly at the specification of a real-time implementation problem. This requires concepts such as periodic and aperiodic activations, deadlines, offsets and execution times. These extensions allow the application of various real-time analyses: worst-case response time analysis, schedulability analysis, or even the synthesis of schedules or the adjustment of parameters of the scheduler of a real-time operating system. Enforcement of synchrony hypothesis. Our final classification criterion concerns the enforcement of the synchrony hypothesis. The Esterel, Lustre, Signal, and SynDEx languages require strict adherence to it. The computation and data transfer operations are semantically infinitely fast, so that a reaction must always terminate before the beginning of the next one. The execution of the system can therefore be seen as the totally ordered sequence of reactions. 3 In particular, every operation (computation or communication), independently of its clock, can be and must be terminated in the clock cycle where it began, before the beginning of the next clock cycle. If we associate real-time periods to logical clocks, this assumption implies that an operation (computation or communication) cannot have a duration longer than the greatest common divisor (GCD) of the periods of the system. However, the description of real-time systems often implies so-called long tasks with a duration longer than the GCD of the periods. Representing such tasks in a synchronous formalism requires constraining the synchronous composition to ensure that an operation takes more than one logical instant. One way to do it is by systematically introducing explicit delays between the end of an operation and the operations depending on it. These delays explicitly represent the time (in clock cycles) reserved for the execution of the operation. Introducing such delays manually may be tedious, and some languages, such as Giotto, Prelude, and SynDEx have proposed dedicated constructs with the same effect. In Giotto, the convention is that the outputs of a task remain available during the clock cycle following the one where the operation started, in the time base given by the clock associated with the task. Prelude is more expressive. It allows the definition of delays shorter than one clock cycle by refining the clock independent parts. 3 This is true even for multi-clock languages.

21 18 CHAPTER 2. INTRODUCTION TO SYNCHRONOUS LANGUAGES of the operation and then working in this refined time base. SynDEx proposes an intermediate solution. 2.2 Related formalisms Of course, synchronous languages are only one of the classes of formalisms used in embedded control system design. For instance, in traditional real-time systems design, two levels of representation are particularly important: real-time task models [102, 14], which serve to perform the real-time scheduling analysis (feasibility or schedulability), and the low-level implementation code, provided in languages such as C or assembly. Real-time task models are not designed as full-fledged programming languages, focusing only on the definition of properties that will be exploited by classical schedulability analysis techniques. Among these properties: the organization of computations into tasks, the periods, durations, and deadlines of tasks, and sometimes their dependencies or exclusion relations. By comparison, synchronous languages are full-fledged programming languages that can serve both as support for real-time scheduling analyses and as task-level and systemlevel programming languages. They allow, for instance, the full specification of a task functionality, followed by fully automatic generation of the low-level task code. They also allow the specification of full systems including tasks, OS, and hardware for simulation, formal analysis, or to allow the synthesis of task sequencing and synchronization code. Thus, while remaining at a high abstraction level and focusing on the specification of platform-independent functional and timing aspects, synchronous languages allow the automatic synthesis of an increasing part of the low-level code, especially in critical embedded control systems. Like synchronous languages, the synchronous data-flow (SDF) [LEE 87] and derived formalisms (such as CSDF [22], SigmaC[74], or StreamIt[5]) feature a cyclic execution model. The difference is that repetitions (cycles) of the various computations and communications are not synchronized along global time references (clocks). Instead, the execution of each data-flow node is driven by the arrival of input data along lossless FIFO channels (a form of local synchronization). The pair of formalisms Simulink/StateFlow [27, 46] is the de facto standard for the modeling of control systems. These languages share with synchronous languages a great deal of their basic constructs: the use of logical and physical time bases, the synchrony hypothesis, a definition of causality and even a good part of the language constructs. However, the differences are also great: synchronous languages aim to give unique and deterministic semantics to every correct specification, and thus they aim to ensure the equivalence between analysis, simulation and execution of the implemented system. The objective of Simulink (as its name indicates) is to allow the simulation of control systems, whether they are specified in discrete and/or continuous time. The definition of causal dependencies is clear, but it depends on the chosen simulation mode,

22 2.2. RELATED FORMALISMS 19 and the number of simulation options is such that it is sometimes difficult to determine which rules apply. To accelerate the simulations, there are options that explicitly allow for non-determinism. Finally, the determinism of the simulation is sometimes only acquired through the use of rules depending on the relative position of the graphical objects of a specification (in addition to the classical causality rules). By comparison, the semantics of a synchronous program only depends on the data dependencies between operations, which allows to preserve more concurrency and therefore give more freedom to the scheduling algorithms. The definition of a synchronized cyclic execution on physical or logical time bases is also shared by formalisms such as StateCharts [81] or VHDL/VERILOG [90]. Like synchronous languages, these formalisms define a concept of (logical) execution time and allow a complex propagation of control within these instants. However, synchronous causality (and thus determinism) is not always required.

23 20 CHAPTER 2. INTRODUCTION TO SYNCHRONOUS LANGUAGES

24 Chapter 3 Automatic synthesis of optimal synchronization protocols Synchronous programming is nowadays a widely accepted paradigm for the design of critical applications such as digital circuits or embedded real-time software [18, 122], especially when a semantic reference is sought to ensure the coherence between the implementation and the various analyses and simulations. But building concurrent (distributed, multi-task, multi-thread) implementations of synchronous specifications remains an open and difficult subject, due to the need of preserving the global synchronizations specific to the model. Synchronization artifacts need most of the time be preserved, at least in part, in order to ensure functional correctness when the behavior of the whole system depends on properties such as the arrival order of events on the various communication media or the passage of time, as measured by the presence or absence of events in successive reactions. Ensuring synchronization in concurrent implementations can be done in two fundamentally different manners: Delay-insensitive 1 synchronization protocols make no hypothesis on the real-time duration of the various computations or communications. Under this hypothesis, detecting the sending order of two events arriving of different communication media is a priori impossible, as is determining that a signal is absent in a reaction. 2 Delay-insensitive synchronization protocols can only rely on the ordering of events imposed by the various 1 In the sense of delay-insensitive algorithms [15], sometimes also called self-timed, or scheduling-independent. 2 Because each computation or communication can take an arbitrary, unbounded time. Another consequence of this property is the impossibility of consensus in faulty asynchronous systems [63]. 21

25 22 CHAPTER 3. OPTIMAL SYNCHRONIZATION PROTOCOLS system components, such as the sequencing of message transmissions on each bus, or the sequencing of computations on each processor. Delay-sensitive synchronization protocols allow the use of hypotheses on the real-time durations of the various computations and communications. In time-triggered systems [38], for instance, time-based synchronization is dominant and great care must be taken to ensure that a global time reference is available, with good-enough precision and accuracy. Time-triggered delay-sensitive systems will be the focus of Chapter 5. In the current chapter I consider the problem of constructing delay-insensitive implementations of deterministic synchronous specifications. Using such delay-insensitive protocols in the construction of embedded real-time systems can be useful in two circumstances: When building non-real-time or soft real-time systems where the accent is put on computational efficiency, rather than on the respect of real-time requirements. In such systems, tasks are often executed on a platform whose temporal behavior cannot be precisely predicted due to reasons such as an unknown number of cores, the use of a dynamic fair scheduler, or the unknown cost of system software (drivers and OS). When building hard real-time systems where the platform provides predictability guarantees, it may be useful to enforce a separation of concerns between functional correctness and real-time correctness issues in the design flow. A delay-insensitive functionally-correct implementation may be first used for functional simulations, before being provided as input to the allocation and scheduling phases that configure the execution platform. Such an approach is taken in SynDEx [76] and OCREP [41]. In both cases, the use of delay-insensitive synchronization provides guarantees of functional correctness independently from timing correctness. Possibly the most popular approach of building deterministic delay-insensitive concurrent systems is the one based on the Kahn principle, which provides the theoretical basis for building Kahn process networks (KPN) [91, 105, 123]. The Kahn principle states that interconnecting deterministic delay-insensitive components by means of deterministic delay-insensitive communication lines always results in a deterministic delay-insensitive system. This provides a solid two-stage methodology for building delay-insensitive systems. The first stage consists in building deterministic delay-insensitive components, which are then incrementally composed together in phase two. The work I present in this chapter has focused on applying this two-stage approach to the implementation of synchronous specifications. The problem I considered is that of building synchronous components (programs) that can function as deterministic delay-insensitive systems when the global clock synchronization is removed. The main difficulty here is transforming general synchronous components into delay-insensitive ones by adding minimal synchronization to their interfaces.

26 3.1. SEMANTICS OF A SIMPLE EXAMPLE Semantics of a simple example I use a small, intuitive example to present the problem, the desired result, and the main implementation issues. The example, pictured in Fig. 3.1, is a reconfigurable filter (in this case a simple adder, but similar reasoning can be applied to other filters, such as the FFT). In this example, two independent single-word adders can be used either independently, or synchronized to form a double-word adder. The choice between synchronized and non-synchronized mode is done using the SYNC signal. The carry between the two adders is propagated through the Boolean wire C whenever SYNC is present. To simplify figures and notations, we group both integer inputs of ADD1 under I1, and both integer inputs of ADD2 under I2. This poses no problem because from the synchronization perspective of this chapter the two integer inputs of an adder have the same properties. I1 ADD1 O1 SYNC C I2 ADD2 O2 Figure 3.1: Data-flow of a configurable adder. Time is discrete, and executions are sequences of reactions, indexed by a global clock. Given a synchronous program, a reaction is a valuation of its input, output and internal (local) signals. Fig. 3.2 gives a possible execution of our example. We shall denote with V(P ) the finite set of signals of a program P. We shall distinguish inside V(P ) the disjoint sub-sets of input and output signals, respectively denoted I(P ) and O(P ). Reaction I1 (1,2) (9,9) (9,9) (2,5) O SYNC C 1 0 I2 (0,0) (0,0) (1,4) (2,3) O Figure 3.2: A synchronous run of the adder

27 24 CHAPTER 3. OPTIMAL SYNCHRONIZATION PROTOCOLS If we denote with EXAMPLE our configurable adder, then V(EXAMPLE) = {I1, I2, SYNC, O1, O2, C} I(EXAMPLE) = {I1, I2, SYNC} O(EXAMPLE) = {O1, O2} All signals are typed. We denote with D S the domain (set of possible values) of a signal S. Not all signals need to have a value in a reaction, to model cases where only parts of the program compute. We will say that a signal is present in a reaction when it has a value in D S. Otherwise, we say that it is absent. Absence is simply represented with a special value, which is appended to all domains DS = D S { }. Formally, a reaction of a program P is a valuation of all the signals S of V(P ) into their extended domains DS. We denote with R(P ) the set of all reactions of P. Given a set of signals V, we denote with R(V) the set of all possible valuations of the signals in V. Obviously, R(P ) R(V(P )). In a reaction r of a program P, we distinguish the input event, which is the restriction r I(P ) of r to input signals, and the output event, which is the restriction r O(P ) to output signals. In many cases we are only interested in the presence or absence of a signal, because it transmits no data, just synchronization (or because we are only interested in synchronization aspects). To represent such signals, the Signal language [77] uses a dedicated type named event of domain D event = { }. We follow the same convention: In our example, SYNC has type event. The types of the other signals in Fig. 3.2 are SYNC:event; O1,O2:integer; I1,I2:integer pair; C:boolean. To represent reactions, we use a set-like convention and omit signals with value. Thus, reaction 4 is denoted (I1 (9,9), O1 8, I2 (0,0), O2 0 ). 3.2 Problem definition We consider a synchronous program, and we want to execute it in an asynchronous environment where inputs arrive and outputs depart via asynchronous FIFO channels with uncontrolled (unbounded, but finite) communication latencies. To simplify, we assume that we have exactly one channel for each input and output signal of the program. We also assume a very simple correspondence between messages on channels and signal values: Each message on a channel corresponds to exactly one value (not absence) of a signal in a reaction. No message represents absence. The execution machine driving the synchronous program in the asynchronous environment cyclically performs the following 3 steps: 1. assembling asynchronous input messages arriving onto the input channels into a synchronous input event acceptable by the program, 2. triggering a reaction of the program for the reconstructed input event, and

28 3.2. PROBLEM DEFINITION 25 Input FIFO channels... Synchronous... Output FIFO channels (asynchronous) (asynchronous) process clock Asynchronous FSM (I/O control, buffering Synchronous clock triggering) Figure 3.3: GALS wrapper driving the execution of a synchronous program in an asynchronous environment 3. transforming the output event of the reaction into messages onto the output asynchronous channels. Fig. 3.3 provides the general form of such an execution machine, which is basically a wrapper transforming the synchronous program into a globally asynchronous, locally synchronous (GALS) [44] component that can be used in a larger GALS system. The actual form of the asynchronous finite state machine (AFSM) implementing the execution machine, and the form of the code implementing the synchronous program depends on a variety of factors, such as the desired implementation (software or hardware), the asynchronous signaling used by the input and output FIFOs, the properties of the synchronous program, etc. In order to achieve deterministic execution, 3 the main difficulty lies in step (1) above, as it involves the potential reconstruction of signal absence, whereas absence is meaningless in the chosen asynchronous framework. Reconstructing reactions from asynchronous messages must be done in a way that ensures global determinism, regardless of the message arrival order. This is not always possible. Assume, like in Fig. 3.4, that we consider the inputs and outputs of Fig. 3.2 without synchronization information. I1 (1,2) (9,9) (9,9) (2,5) O SYNC C 1 0 I2 (0,0) (0,0) (1,4) (2,3) O Figure 3.4: Corresponding asynchronous run of our example. No synchronization exists between the various signals, so that correctly reconstructing synchronous inputs from the asynchronous ones is impossible The adder ADD1 will then receive the first value (1, 2) on the input channel I1 and on SYNC. Depending on the arrival order, which cannot be determined, any 3 Like in [125], determinism can be relaxed here to predictability the fact that the environment is always informed of the choices made inside the program.

29 26 CHAPTER 3. OPTIMAL SYNCHRONIZATION PROTOCOLS of the reactions (I1 (1,2), O1 3, SYNC, C 0 ) or (I1 (1,2), O1 3 ) can be executed by ADD1, leading to divergent computations. The problem is that these two reactions are not independent, but no value of a given channel allows to differentiate one from the other (so one can t deterministically choose between them in an asynchronous environment). Deterministic input event reconstruction is therefore impossible for some synchronous programs. Therefore, a methodology to implement synchronous programs on an asynchronous architecture must rely on the (implicit or explicit) identification of some class of programs for which reconstruction is possible. Then, giving a deterministic asynchronous implementation to any given synchronous program is done in two steps: Step 1. Transforming the initial program, through added synchronizations and/or signals, so that it belongs to the implementable class. Step 2. Generating an implementation for the transformed program. The choice of the class of implementable programs is therefore essential. On one hand, choosing a small class can highly simplify analysis and code generation in step (2). On the other, small classes of programs result in heavier synchronization added to the programs in step (1). Our choice, justified in the next section, is the class of weakly endochronous programs Previous work Aside from weak endochrony, the most developed notions identifying classes of implementable programs are the latency-insensitive systems of Carloni et al. [39] and the endochronous systems of Benveniste et al. [17, 77]. Latency-insensitive systems are those featuring no signal absence. Transforming processes featuring absence, such as our example of Figures 3.1 and 3.2, into latency-insensitive ones amounts to adding supplementary Boolean signals that transmit at each reaction the status of every other signal. This is easy to check and implement, but often results in an unneeded communication overhead due to messages that need to be sent at each reaction. Several variations and hardware implementations of the theory have been proposed, of which we mention here only the one by Vijayaraghavan and Arvind [149]. The endochronous systems and the related hardware-centric generalized latencyinsensitive systems [145] are those where the presence and absence of all signals can be incrementally inferred starting from the state and from signals that are always present. For instance, Fig. 3.5 presents a run of an endochronous program obtained by transforming the SYNC signal of our example into one that carries values from 0 to 3: 0 for ADD1 executing alone, 1 for ADD2 executing alone, 2 for both adders executing without communicating (C absent), and 3 for the synchronized execution of the two adders (C present). Note that the value of SYNC determines the presence/absence of all signals. Checking endochrony consists in ordering the signals of the process in a tree representing the incremental process used to infer signal presence (the signals

30 3.3. CONTRIBUTION 27 Clock I1 (1,2) (9,9) (9,9) (2,5) O SYNC C 1 0 I2 (0,0) (0,0) (1,4) (2,3) O Figure 3.5: Endochronous solution that are always read are all placed in the tree root). The compilation of the Signal/Polychrony language is currently founded on a version of endochrony [4]. The endochronous reaction reconstruction process is fully deterministic, and the presence of all signals is synchronized with respect to some base signal(s) in a hierarchic fashion. This means that no concurrency remains between subprograms of an endochronous program. For instance, in the endochronous model of our adder, the behavior of the two adders is synchronized at all instants by the SYNC signal (whereas in the initial model the adders can function independently whenever SYNC is absent). By consequence, using endochrony as the basis for the development of systems with internal concurrency has 2 drawbacks: Endochrony is non-compositional (synchronization code must be added even when composing programs sharing no signal). Specifications and implementations/simulations are often over-synchronized. 3.3 Contribution Definition of weak endochrony My first contribution here was the definition of weak endochrony, defined in collaboration with B. Caillaud and A. Benveniste [128, 127]. Weak endochrony generalizes endochrony by allowing both synchronized and non-synchronized (independent) computations to be realized by a given program. Weak endochrony determines that compound reactions that are apparently synchronous can be split into independent smaller reactions that are asynchronously feasible in a confluent way, so that the first one does not discard the second. Fig. 3.6 presents a run of a weakly endochronous system obtained by replacing the SYNC signal of our example with two input signals: SYNC1, of Boolean type, is received at each execution of ADD1. It has value 0 to notify that no synchronization is necessary, and value 1 to notify that synchronization is necessary and the carry signal C must be produced.

31 28 CHAPTER 3. OPTIMAL SYNCHRONIZATION PROTOCOLS SYNC2, of Boolean type, is received at each execution of ADD2. It has value 0 to notify that no synchronization is necessary, and value 1 to notify that synchronization is necessary and the carry signal C must be read. The two adders are synchronized when SYNC1=1 and SYNC2=1, corresponding to the cases where SYNC= in the original design. However, the adders function independently elsewhere (between synchronization points). I1 (1,2) (9,9) (9,9) (2,5) O SYNC C 1 0 SYNC I2 (0,0) (0,0) (1,4) (2,3) O Figure 3.6: Weakly endochronous solution. From a practical point of view, weak endochrony supports less synchronized, concurrent GALS implementations. While the implementation of latencyinsensitive and endochronous synchronous programs is strictly bound by the scheme of Fig. 3.3, wrappers of weakly endochronous programs may exploit the concurrency of the specification by directly and concurrently activating various parts of a program. In the context of the example of Fig. 3.6, the GALS wrapper may consist of an AFSM that can independently activate the two adders, the activations being synchronized only when SYNC1=SYNC2=1. Weak endochrony provides an important theoretical tool in the analysis of concurrent synchronous systems. It generalizes to a synchronous setting [128] the theory of Mazurkiewicz traces [54]. Although it deals with the signal values 4, (weak) endochrony is in essence strongly related with the notion of conflict-freeness, first introduced in the context of Petri Nets, which simply states that once enabled, an action cannot be disabled, and must eventually be executed. Various conflict-free variants of data-flow declarative formalisms form the area of process networks (such as Kahn Process Networks [91]), or various so-called domains of the Ptolemy environment such as SDF Process Networks [30]. Conflict-freeness is also called confluence ( diamond property ) in process algebra theory [110], and monotony in Kahn Process Networks Characterization of delay-insensitive synchronous components While weak endochrony provided a sufficient property ensuring delay-insensitivity, exactly characterizing the class of synchronous components that can function as 4 Which may be used to decide which further signals are to be present next causally in the reaction

32 3.3. CONTRIBUTION 29 delay-insensitive deterministic systems remained an open problem. I have characterized this class [120, 129]. The characterization is given by a simple diamond closure property, very close to weak endochrony. This simple characterization is important because: It offers a good basis for the development of optimal synchronization protocols. It corresponds to a very general execution mechanism covering current practice in embedded system design. Thus, it fixes theoretical limits to what can be done in practice Synthesis of delay-insensitive concurrent implementations In [134, 131] I proposed a method to check weak endochrony on multi-clock synchronous programs. This method is based on the construction of so-called generator sets. Generators are minimal synchronization patterns of a program, and the set of all generators provides a representation of all synchronization configurations of a synchronous program. I have proposed a compact representation for generator sets and algorithms for their modular construction for Signal/Polychrony [77] programs, starting from those of the language primitives. The generator set of a program can be analyzed to determine if the specification is weakly endochronous. In case the specification is not weakly endochronous, the generator sets can be used to produce intuitive error messages helping the programmer add supplementary synchronization to the program interface (or minimal synchronization can be automatically synthesized). The algorithms have been implemented in a tool connected to the Signal/Polychrony toolset. In case the specification is weakly endochronous, we provided a technique for building multi-threaded and distributed implementation code [118]. This technique takes advantage of the structure of the generator set to limit the number of threads, and thus reduce scheduler overhead.

33 30 CHAPTER 3. OPTIMAL SYNCHRONIZATION PROTOCOLS

34 Chapter 4 Reconciling performance and predictability on a many-core The previous chapter described work on a particular aspect of the embedded implementation flow: the synthesis of efficient synchronization protocols. I will now take the first step towards considering the whole complexity of synthesizing full real-time embedded implementations. For this first step, I consider a mapping and code generation approach that is very similar to that of classical compilation. Like in classical compiler such as GCC, our scheduling routines have as objective the optimization of simple metrics, such as reaction latency or throughput. Furthermore, scheduling always succeeds if execution on the platform is functionally possible, because no real-time requirements are taken into account. But there are also significant differences with respect to classical compilers: Our systems compiler performs precise and conservative timing accounting, which allows it to provide tight, hard real-time guarantees on the generated code. Such guarantees can then be checked against real-time requirements. The optimization metrics may be different from those used in classical compilation. Architecture descriptions used by the scheduling algorithms (and by consequence the architecture-dependent scheduling heuristics) are significantly different. These differences mean that my work had to focus on two main aspects: Modeling the execution platform and ensuring that the platform models are conservative abstractions of the actual execution platform. To provide 31

35 32 CHAPTER 4. RECONCILE PERFORMANCE AND PREDICTABILITY tight and hard real-time guarantees, these models must be precise and include timing aspects. 1 The definition of the scheduling and code generation algorithms. As a test case I use here a many-core platform, showing how its architectural detail can be efficiently taken into account, which is an important subject per se. 4.1 Motivation One crucial problem in real-time scheduling is that of ensuring that the application software and the implementation platform satisfy the hypotheses allowing the application of specific schedulability analysis techniques [102]. Such hypotheses are the availability of a priority-driven scheduler or the possibility of including scheduler-related costs in the durations of the tasks. Classical work on real-time scheduling [52, 70, 66] has proposed formal models allowing schedulability analysis in classical mono-processor and distributed settings. But the advent of multiprocessor systems-on-chip (MPSoC) architectures imposes significant changes to these models and to the scheduling techniques using them. MPSoCs are becoming prevalent in both general-purpose and embedded systems. Their adoption is driven by scalable performance arguments (concerning speed, power, etc.), but this scalability comes at the price of increased complexity of both the software and the software mapping (allocation and scheduling) process. Part of this complexity can be attributed to the steady increase in the quantity of software that is run by a single system. But there are also significant qualitative changes concerning both the software and the hardware. In software, more and more applications include parallel versions of classical signal or image processing algorithms [150, 9, 69], which are best modeled using data-flow models (as opposed to independent tasks). Providing functional and real-time correctness guarantees for parallel code requires an accurate control of the interferences due to concurrent use of communication resources. Depending on the hardware and software architecture, this can be very difficult [154, 87]. Significant changes also concern the execution platforms, where the gains predicted by Moore s law no longer translate into improved single-processor performance, but in a rapid increase of the number of processor cores placed on a single chip [26]. This trend is best illustrated by the massively parallel processor arrays (MPPAs), which are the object of the work presented in this chapter. MPPAs are MPSoCs characterized by: Large numbers of processing cores, ranging in current silicon implementations from a few tens to a few hundreds [148, 113, 1, 62]. The cores are typically chosen for their area or energy efficiency instead of raw computing power. 1 These models can be seen as the equivalent of the ISAs, ABIs and APIs used in classical compilation.

36 4.1. MOTIVATION 33 A regular internal structure where processor cores are divided among a set of identical tiles, which are connected through one or more NoCs with regular structure (e.g. torus, mesh). Industrial [148, 113, 1, 62] and academic [71, 23, 144] MPPA architectures targeting hard real-time applications already exist, but the problem of mapping applications on them remains largely open. There are two main reasons to this. The first one concerns the NoCs: as the tasks are more tightly coupled and the number of resources in the system increases, the on-chip networks become critical resources, which need to be explicitly considered and managed during real-time scheduling. Recent work [144, 93, 115] has determined that NoCs have distinctive traits requiring significant changes to classical real-time scheduling theory [70]. In particular, they have large numbers of potential contention points, limited buffering capacities requiring synchronized resource reservation, and network control often operates at the level of small data packets. The second reason concerns automation: the complexity of MPPAs and of the (parallel) applications mapped on them is such that the allocation and scheduling must be largely automated. Unlike previous work on the subject I have addressed these needs by relying on static (off-line) scheduling approaches. In theory, off-line algorithms allow the computation of scheduling tables specifying an optimal allocation and real-time scheduling of the various computations and communications onto the resources of the MPPA. In practice, this ability is severely limited by 3 factors: 1. The application may exhibit a high degree of dynamicity due to either environment variability or to execution time variability resulting from datadependent conditional control The hardware may not allow the implementation of optimal scheduling tables. For instance, most MPPA architectures provide only limited control over the scheduling of communications inside the NoC. 3. The mapping problems I consider are NP-hard. In practice, this means that optimality cannot be attained, and that efficient heuristics are needed. Clearly, not all applications can benefit from off-line scheduling. But this paradigm is well adapted to our target application class: (parallelized versions of) periodic embedded control and signal processing applications. Our work shows that for such applications, off-line scheduling techniques attain both high timing predictability and high performance. But reconciling performance and predictability (two properties often seen as antagonistic) was only possible by considering with a unifying view all the hardware, software, and mapping aspects of the design flow. My contributions concern all these aspects, which will be covered one by one in the following sections. On the hardware side, an in-depth review of 2 Implementing an optimal control scheme for such an application may require more resources than the application itself, which is why dynamic/on-line scheduling techniques are often preferred.

37 34 CHAPTER 4. RECONCILE PERFORMANCE AND PREDICTABILITY MPPA/NoC architectures with support for real-time scheduling allowed us to determine that NoCs allowing static communication scheduling offer the best support to off-line application mapping (and I have participated in the definition of such a NoC and MPPA based on it). I have also proposed a software organization that improves timing predictability. These hardware and software advances allowed the definition of a technique for the WCET analysis of parallel code. But my main effort went into defining a new technique and tool, called LoPhT, for automatic real-time mapping and code generation, whose global flow is pictured in Fig Our tool takes as input data-flow synchronous specifica- Functional spec. (dataflow synchronous) Specification Non functional spec. Allocation Platform model NoC resources CPUs, DMAs, RAMs Timing (WCETs,WCCTs) LoPhT tool Automatic mapping (allocation+scheduling) Global scheduling table CPUs&Interconnect Code generation Real time application CPU programs + NoC programs Timing guarantees Figure 4.1: Global flow of the proposed mapping technique for many-cores tions and precise hardware descriptions including all potential NoC contention points. It uses advanced off-line scheduling techniques such as software pipelining and pre-computed preemption, and it takes into account the specificities of the MPPA hardware to build scheduling tables that provide good latency and throughput guarantees and ensure an efficient use of computation and communication resources. Scheduling tables are then automatically converted into sequential code ensuring the correct ordering of operations on each resource and the respect of the real-time guarantees. Experimental evaluations show that the off-line mapping of communications not only allows us to provide static latency and throughput guarantees, but may also improve the speed of the application, when compared to (simple) hand-written parallel code. Through these results, I have shown that taking into account the fine detail of the hardware and software architecture of a many-core is possible, and allows off-line mapping of very good precision, even when low-complexity scheduling heuristics are used in order to ensure the scalability of the approach. This is similar to what compilation did to replace manual assembly coding.

38 4.2. MPPA/NOC ARCHITECTURES FOR THE REAL-TIME MPPA/NoC architectures for the real-time This section starts with a general introduction to MPPA platforms, and then presents the main characteristics of existing NoCs, with a focus on flow management mechanisms supporting real-time implementation Structure of an MPPA My work concerns hard real-time systems where timing guarantees must be determined by static analysis methods before system execution. Complex memory hierarchies involving multiple cache levels and cache coherency mechanisms are known to complicate timing analysis [154, 80], and I assume they are not used in the MPPA platforms I consider. Under this hypothesis, all data transfers between tiles are performed through one or more NoCs. A NoC can be described in terms of point-to-point communication links and NoC routers which perform the routing and scheduling (arbitration) functions. Fig. 4.2 provides the description of a 2-dimensional (2D) mesh NoC like the ones used in the Adapteva Epiphany [1], Tilera TilePro[148], or DSPIN[117]. The structure of a router in a 2D mesh NoC is described in Fig It has 5 connec- Tile (2,0) Tile (2,1) Tile (2,2) Tile (2,3) Tile (1,0) Tile (1,1) Tile (1,2) Tile (1,3) Tile (0,0) Tile (0,1) Tile (0,2) Tile (0,3) Figure 4.2: An MPPA platform with 4x3 tiles connected with a 2D mesh NoC. Black rectanges are the NoC routers. Tile coordinates are in (Y,X) order. tions (labeled North, South, West, East, and Local) to the 4 routers next to it and to the local tile. Each connection is formed of a routing component, which we call demultiplexer (labeled D in Fig. 4.3), and a scheduling/arbitration component, which we call multiplexer (labeled M). Data enters the router through demultiplexers and exits through multiplexers. To allow on-chip implementation at a reasonable area cost, NoCs use simple routing algorithms. For instance, the Adapteva [1], Tilera [148], and DSPIN NoCs [117] use an X-first routing algorithm where data first travels all the way in the X direction, and only then in the Y direction. Furthermore, all NoCs mentioned in this chapter use simple wormhole switching approaches [114] requiring that all data of a communication unit (e.g. packet) follow the same route, in order, and that the communication unit is not logically split during transmission.

39 36 CHAPTER 4. RECONCILE PERFORMANCE AND PREDICTABILITY Figure 4.3: Generic router for a 2D mesh NoC with X-first routing policy The use of a wormhole switching approach is justified by the limited buffering capabilities of NoCs [115] and by the possibility of decreasing transmission latencies (by comparison with more classical store-and-forward approaches). But the use of wormhole switching means that one data transmission unit (such as a packet) is seldom stored in a single router buffer. Instead, a packet usually spans over several routers, so that its transmission strongly synchronizes multiplexers and demultiplexers along its path Support for real-time implementation Given the large number of potential contention points (router multiplexers), and the synchronizations induced by data transmissions, providing tight static timing guarantees is only possible if some form of flow control mechanism is used. In NoCs based on circuit switching [88], inter-tile communications are performed through dedicated communication channels formed of point-to-point physical links. Two channels cannot share a physical link. This is achieved by statically fixing the output direction of each demultiplexer and the data source of each multiplexer along the channel path. Timing interferences between channels are impossible, which radically simplifies timing analysis, and the latency of communications is low. But the absence of resource sharing is also the main drawback of circuit switching, resulting in low numbers of possible communication channels and low utilization of the NoC resources. Reconfiguration is usually possible, but it carries a large timing penalty. Virtual circuit switching is an evolution of circuit switching which allows resource sharing between circuits. But resource sharing implies the need for arbitration mechanisms inside NoC multiplexers. Very interesting from the point of view of timing predictability are NoCs where arbitration is based on time division multiplexing (TDM), such as Aethereal [71], Nostrum [109], and others [147]. In a TDM NoC, all routers share a common time base. The point-

40 4.2. MPPA/NOC ARCHITECTURES FOR THE REAL-TIME 37 to-point links are reserved for the use of the virtual circuits following a fixed cyclic schedule (a scheduling table). The reservations made on the various links ensure that communications can follow their path without waiting. TDM-based NoCs allow the computation of precise latency and throughput guarantees. They also ensure a strong temporal isolation between virtual circuits, so that changes to a virtual circuit do not modify the real-time characteristics of the other. When no global time base exists, the same type of latency and throughput guarantees can be obtained in NoCs relying on bandwidth management mechanisms such as Kalray MPPA [113, 83]. The idea here is to ensure that the throughput of each virtual circuit is limited to a fraction of the transmission capacity of a physical point-to-point link, by either the emitting tile or by the NoC routers. Two or more virtual circuits can share a point-to-point link if their combined transmission needs are less than what the physical link provides. But TDM and bandwidth management NoCs have certain limitations: One of them is that latency and throughput are correlated [144], which may result in a waste of resources. But the latency-throughput correlation is just one consequence of a more profound limitation: TDM and bandwith management NoCs largely ignore the fact that the needs of an application may change during execution, depending on its state. For instance, when scheduling a data-flow synchronous graph with the objective of reducing the duration of one computation cycle (also known as makespan or latency), it is often useful to allow some communications to use 100% of the physical link, so that they complete faster, before allowing all other communications to be performed. One way of taking into account the application state is by using NoCs with support for priority-based scheduling [144, 117, 62]. In these NoCs, each data packet is assigned a priority level (a small integer), and NoC routers allow higher-priority packets to pass before lower-priority packets. To avoid priority inversion phenomenons, higher-priority packets have the right to preempt the transmission of lower-priority ones. In turn, this requires the use of one separate buffer for each priority level in each router multiplexer, a mechanism known as virtual channels (VCs) in the NoC community[117]. The need for VCs is the main limiting factor of priority-based arbitration in NoCs. Indeed, adding a VC is as complex as adding a whole new NoC[157, 33], and NoC resources (especially buffers) are expensive in both power consumption and area [112]. Among existing silicon implementations only the Intel SCC chip offers a relatively large numbers of VCs (eight) [62], and it is targeted at high-performance computing applications. Industrial MPPA chips targeting an embedded market usually feature multiple, specialized NoCs [148, 1, 83] without virtual channels. Other NoC architectures feature low numbers of VCs. Current research on priority-based communication scheduling has already integrated this limitation, by investigating the sharing of priority levels [144]. Significant work already exists on the mapping of real-time applications onto priority-based NoCs [144, 138, 115, 93]. This work has shown that priority-based NoCs support the efficient mapping of independent tasks. But we already explained that the large number of computing cores in an MPPA means that applications are also likely to include parallelized code which

41 38 CHAPTER 4. RECONCILE PERFORMANCE AND PREDICTABILITY is best modeled by large sets of relatively small dependent tasks (data-flow graphs) with predictable functional and temporal behavior [150, 9, 69]. Such timing-predictable specifications are those that can a priori take advantage of a static scheduling approach, which provides best results on architectures with support for static communication scheduling [148, 61, 55]. Such architectures allow the construction of an efficient (possibly optimal) global computation and communication schedule, represented with a scheduling table and implemented as a set of synchronized sequential computation and communication programs. Computation programs run on processor cores to sequence task executions and the initiation of communications. Communication programs run on speciallydesigned micro-controllers that control each NoC multiplexer to fix the order in which individual data packets are transmitted. Synchronization between the programs is ensured by the data packet communications themselves. Like in TDM NoCs, the use of global computation and communication scheduling tables allows the computation of very precise latency and throughput estimations. Unlike in TDM NoCs, NoC resource reservations can depend on the application state. Global time synchronization is not needed, and existing NoCs based on static communication scheduling do not use it[148, 61, 55]. Instead, global synchronization is realized by the data transmissions (which eliminates some of the run-time pessimism of TDM-based approaches). The microcontrollers that drive each NoC router multiplexer are similar in structure to those used in TDM NoCs to enforce the TDM reservation pattern. The main difference is that the communication programs are usually longer than the TDM configurations, because they must cover longer execution patterns. This requires the use of larger program memory (which can be seen as part of the tile program memory[55]). But like in TDM NoCs, buffering needs are limited and no virtual channel mechanism is needed, which results in lower power consumption. From a mapping-oriented point of view, determining exact packet transmission orders cannot be separated from the larger problem of building a global scheduling table comprising both computations and communications. By comparison, mapping onto MPPAs with TDM-based or bandwith reservation-based NoCs usually separates task allocation and scheduling from the synthesis of a NoC configuration independent from the application state [104, 9, 160]. Under static communication scheduling, there is little run-time flexibility, as all scheduling possibilities must be considered during the off-line construction of the global scheduling table. For very dynamic applications this can be difficult. This is why existing MPPA architectures that allow static communication scheduling also allow communications with dynamic (Round Robin) arbitration. In conclusion, NoCs allowing static communication scheduling offer the best temporal precision in the off-line mapping of dependent tasks (data-flow graphs), while priority-based NoCs are better at dealing with more dynamic applications. As future systems will include both statically parallelized code and more dynamic aspects, NoCs should include mechanisms supporting both off-line and on-line communication scheduling. Significant work already exists on the realtime mapping for priority-based platforms, while little work has addressed the

42 4.2. MPPA/NOC ARCHITECTURES FOR THE REAL-TIME 39 NoCs with static communication scheduling the subject I addressed with my PhD students and post-docs and detailed in this chapter MPPA platform for off-line scheduling Tiled many-cores in SoCLib The SoCLib virtual prototyping library [101] allows the definition of tiled manycores following a distributed shared memory paradigm where all memory banks and component programming interfaces are assigned unique addresses in a global address space. All memory transfers and component programming operations are represented with memory accesses organised as command/response transactions according to the VCI/OCP protocol [151]. To avoid interferences between commands and responses (which can lead to deadlocks), the on-chip interconnect is organized in two completely disjoint sub-networks, one for transmitting commands, and the other for responses. There are two types of transactions: read and write. In write transactions the command sub-network carries the data to be written and the target adress, and the response sub-network carries a return code. In read transactions, the command sub-network carries the address and size of the requested data, and the response sub-network carries the data. The tiled many-cores of SoCLib are formed of a rectangular set of tiles connected through a state-of-the-art 2D synchronous mesh network-on-chip (NoC) called DSPIN [117]. The NoC is formed of a command NoC and a response NoC which are fully separated. Note that the representation of Fig. 4.2 is incomplete, as it only represents one of the NoCs. However, we shall see in the following sections that it is sufficient to allow scheduling under certain hypotheses. Each tile has its own command and response local interconnects, linked to the NoCs and to the IP cores of the tile (CPUs, RAMs, DMAs, etc.). Modifications of the tile structure To improve timing predictability and worst-case performance, we modify both the tiles and the NoC of the SoCLib-based many-core. Of the original organization, we retain the global organization of the many-core, and in particular its distributed shared memory model which allows programming using generalpurpose tools. Fig. 4.4 pictures the structure of the computing tile in the original SoCLib many-core, and Fig. 4.5 its modified version. The memory subsystem. Our objective here is to improve timing predictability by eliminating contentions. In our experiments with the original SoCLib-based many-core, the second most important source of contentions (after the NoC) is the access to the unique RAM bank of each tile. To reduce these contentions, we decided to follow the example of existing industrial many-core architectures [108, 83], and replace the single RAM bank of a tile with several memory banks that can be accessed independently.

43 40 CHAPTER 4. RECONCILE PERFORMANCE AND PREDICTABILITY RAM (single bank) Interrupt unit Command router Interrupts Local Interconnect NIC... Prog. Data cache cache (PLRU, (PLRU, WB) WB) {CPU_n} (MIPS32) DMA I/O (option) Response router Figure 4.4: The computing tile in the original DSPIN-based many-core architecture N S E W L Multi bank data RAM Program RAM/ROM Lock unit Command Router Controllers Program RAM Command router Local Interconnect (full crossbar) NIC Prog. cache (LRU, WT)... Data cache (LRU, WT) {CPU_n} (MIPS32) Buffered DMA I/O (option) Response router Figure 4.5: Modified computing tile of our architecture To facilitate timing analysis, we separate data (including stack) and program memory. One RAM bank is used in each tile to store the program of all the CPUs of the tile. Data and stack are stored on a multi-bank RAM. Each bank of the data RAM has a separate connection to the local interconnect. RAM banks of a tile are assigned contiguous address ranges, so that they can store data structures larger than a single tile. Explicit allocation of data onto the memory banks, along with the use of lock-based synchronization and the topology of the local interconnect presented below allow the elimination of contentions due to concurrent access to memory banks, by ensuring that no data bank is accessed from two sources (CPUs or DMAs) at the same time. Note that the use of a multi-bank data RAM also removes a significant performance bottleneck of the original architecture. Indeed, a single RAM bank can only serve 4 CPUs. Experimental data shows that placing more than 4

44 4.2. MPPA/NOC ARCHITECTURES FOR THE REAL-TIME 41 CPUs per tile results in no performance gain because the RAM access is saturated. Having multiple RAM banks per tile removes this limitation. Our test configurations use a maximum of 16 CPU cores per tile and two data RAM banks per CPU core, for a maximum of 4Mbytes of RAM per tile. The local interconnect is chosen in our design so that it cannot introduce contentions due to its internal organization. Contentions can still happen, for instance, when two CPUs access concurrently the program memory. However, accesses from different sources to different targets never introduce contentions. Interconnect types allowing this are the full crossbars and the multi-stage interconnection network [12] such as the omega networks, the delta networks, or the related logarithmic interconnect [92]. My experiments used a full crossbar interconnect. The CPU core we use is a single-issue, in-order, pipelined implementation of the MIPS32 ISA with no speculative execution. We did not change this, as it simplifies timing analysis and allows small-area hardware implementation. However, significant work has been invested in designing a cycle-accurate model of this core inside a state-of-the art WCET analysis tool [133]. The caches have been significantly modified. The original design featured caches with a pseudo-lru (PLRU) replacement policy and with a writing policy that is intermediate between write-through and write-back. 3 Memory accesses from the data and instruction caches of a single CPU were multiplexed over a single conection to the local interconnect of the tile. All these choices are known to complicate timing analysis and/or to reduce the precision of the analysis [137, 80], and thus we reverted to more conservative choices: We use the LRU replacement policy, a fully write-through policy, and we let the instruction and data caches access the local tile interconnect through separate connexions. The use of a write-through policy reduces the processing speed of each CPU. This is the only modification we made on the architecture that decreases processing speed. Synchronization. To improve temporal predictability, and also speed, our architecture does not use interrupt-based synchronization. Interrupt signaling by itself is fast, but handling an interrupt usually requires accesses to program memory which take supplementary time. Furthermore, arrival date imprecision and modifications of the cache state mean that interrupts are difficult to take into account during timing analysis. To avoid these performance and predictability problems, we replace the interrupt unit present in each tile of the original architecture with a hardware lock component. These components allow synchronization with very low overhead (1 non-cached local RAM access) and 3 Consecutive writes inside a single cache ligne were buffered.

45 42 CHAPTER 4. RECONCILE PERFORMANCE AND PREDICTABILITY without modifications of the cache state. The lock unit follows a simple request/grant protocol which allows a single grant operation to unblock multiple requests. Buffered DMA. The traditional DMA unit used in the original architecture requires significant software control to determine when a DMA operation is finished so that another can start. This is either done using interrupt-based signaling, which has the inconvenients mentioned above, or through polling of the DMA registers, which requires significant CPU time and imposes significant constraints on CPU scheduling. To avoid these problems, we use DMA units allowing the buffering of transmission commands. A CPU can send one or more DMA commands while the previous DMA operation is not yet completed. Furthermore, the DMA unit can be programmed so that it not only sends out data, but also signals the end of the transmission to the target tile by granting a lock, as described in Section 4.3. Thus, all inter-tile communication and synchronization can be performed by the DMA units, in parallel with the data computations of the CPUs and without requiring significant CPU time for control. Modifications of the NoC The DSPIN network-on-chip [117] is a classical 2D mesh NoC. It uses wormhole packet switching and a static routing scheme 4. Each router of the command or response NoC has the internal structure of Fig Each NoC router is connected through FIFO links with the 4 neighboring routers (denoted with North, South, West, East) and with the local computing tile. Each of these connections is realized through one demultiplexer and one multiplexer. The demultiplexer ensures the routing function (X-first/Y-first). It reads the headers of the incoming packets and, depending on the target address, sends them towards one of the multiplexers of the router. The multiplexer ensures the arbitration (scheduling) function. When two packets arrive at the same time (from different demultiplexers), a fair Round Robin arbiter is used to decide which one will be transmitted first. Once the transmission of a packet is started, it cannot be stopped. 5 The fair arbitration scheme is well-adapted to applications without real-time requirements, where it ensures a good use of NoC resources. But when the objective is to provide real-time guarantees and to allocate NoC resources according to application needs, it is better to use some other arbitration mechanism. In our case, the objective is to provide the best possible support for the implementation of static computation and communication schedules. Therefore, we rely on a programmed arbitration approach where each router multiplexer enforces a fixed packet transmission order specified under the form of a communication program. 4 X-first routing for the command network and Y-first for the response network. 5 Unless a virtual channel mechanism is used, as described in [117].

46 4.3. SOFTWARE ORGANIZATION 43 Figure 4.6: Programmed arbitration in a NoC router multiplexer Enforcing fixed packet transmission orders requires the use of new hardware components called router controllers. These components are present in the tile description of Fig. 4.5 and their interaction with the NoC router multiplexers is realized as pictured in Fig Each multiplexer of the command network is controlled by its own controller running a separate program. The program is loaded onto the component through the local interconnect. The interface between router controller and the local interconnect is also used to start or stop the router controller. When the controller is stopped, the fair arbiter of DSPIN takes control. In Fig. 4.6, the arbitration program will cyclically accept 26 packets from the Local connection, then 26 packets from the West connection. More details on the implementation and properties of our programmed arbitration mechanism can be found in [55]. 4.3 Software organization On the hardware architecture defined above, we use non-preemptive, staticallyscheduled software with lock-based inter-processor synchronization. Each core is allocated one sequential thread. To ensure both performance and predictability, we require that software follows the organization rules detailed in this section. Data locality. We require computation functions to only operate on local tile data. If a computation needs data from another tile, this data is first transferred to the local data RAM. Under this hypothesis, the timing (WCET/BCET) analysis of computation functions can be performed without considering the NoC. Inter-tile data transfers. They are only performed using the DMA units. The CPUs still retain the possibility of directly accessing RAM banks of other tiles, but they only do so during the boot phase (which follows the standard protocol of the MIPS32 ISA), or for non-real-time code running on many-core tiles allocated to such code. Traffic generated directly by CPUs and their caches

A Methodology for Improving Software Design Lifecycle in Embedded Control Systems

A Methodology for Improving Software Design Lifecycle in Embedded Control Systems Mohamed El Mongi Ben Gaïd, Rémy Kocik, Yves Sorel, Rédha Hamouche To cite this version: Mohamed El Mongi Ben Gaïd, Rémy