D7a. State of the Art Assessment

Size: px

Start display at page:

Download "D7a. State of the Art Assessment"

Felicia Richard
5 years ago
Views:

Adaptivity & Control of Resources in Embedded Systems D7a State of the Art Assessment Responsible: Ecole Polytechnique Fédérale de Lausanne (EPFL) Gerhard Fohler (TUKL), Alexander Neundorf (TUKL),

1 Adaptivity & Control of Resources in Embedded Systems D7a State of the Art Assessment Responsible: Ecole Polytechnique Fédérale de Lausanne (EPFL) Gerhard Fohler (TUKL), Alexander Neundorf (TUKL), Karl-Erik Årzén (ULUND), Christophe Lucarz (EPFL), Marco Mattavelli (EPFL), Vincent Noel (AKAtech), Carl von Platen (Ericsson), Giorgio Buttazzo (SSSA), Enrico Bini (SSSA) Project Acronym: ACTORS Project full title: Adaptivity and Control of Resources in Embedded Systems Proposal/Contract no: ICT Project Document Number: D7a 1.1 Project Document Date: Workpackage Contributing to the Project Document: WP7 Deliverable Type and Security: R-PU

3 Abstract The state-of-the-art in the areas pertaining to the planned activities in the EU AC- TORS FP-7 project are described. The project aims at developing methodologies and tools for implementation and deployment of flexible soft real-time using dataflow modeling, adaptive resource management, and reservation based scheduling. 3

5 Contents 1 Trends in Embedded Devices The implementation gap Current approaches in high level design Architecture and Algorithm representations Tools for complexity analysis Heuristics and metrics for system analysis Current frameworks to bridge the implementation gap The approach in ACTORS project Potential research directions in the project Compilation of Dataflow Programs Introduction Dataflow Graphs Scheduling of Dataflow Graphs Problems with Current Approaches Relation to Work Packages and Assessment Potential Research Directions Adaptive Resource Management Introduction Principal Resource Management Methods Application of the Management Methods to Resources Adaptive Algorithms Relation to Work Packages and Assessment Resource Reservation in Real-Time Systems Application Domains Problems without temporal protection Providing temporal protection The GPS model Proportional share scheduling Resource reservation techniques Resource reservations in dynamic priority systems Temporal guarantees Resource reservations in operating system kernels Feedback Scheduling Background Motivation and Objectives Important Issues Feedback Scheduling of CPU Resources

6 6.5 Feedback Scheduling and Resource Reservations OS Support Feedback Scheduling in ACTORS Bibliography 97 6

7 Chapter 1 Trends in Embedded Devices The research challenges in the field of digital systems for signal processing have radically changed since the first digital signal processing pioneering works in the 60s and early 70s aimed at showing the potential of transforming analog signals into digital samples. At the time the implementation of basic building blocks of digital processing systems, such as analog to digital converters, FIR filters and DFFT for example, represented the main architectural challenges. From a technology point of view the challenges were represented by the development of new silicon technologies capable of providing faster and smaller circuits. With the advent of Si CMOS technology, scaling into higher density components emerged as the main motor driving architectures and processing innovation. Indeed, in the past two decades, the performances of digital system has progressed at an astounding pace sustained by the successful scaling of Si CMOS technology in the submicron range providing powerful platforms in the form of general purpose processors, DSPs, dedicated SoC and FPGAs satisfying new demanding applications such as multimedia processing, digital transmission and video compression. The fundamental consequence is that sequential models for specifying and representing algorithms, SW and sequential processors architectures have proven to be the winning approaches for exploiting the potential of each new generation of Si CMOS scaling. SW developers and HW designers knew that at each Si CMOS scaling the boost in performance achievable would be higher and easier to obtain than what could have been expected by developing new concurrent, parallel systems, tools and architectures. The only important exception to this rule is represented by a specific field, the graphic world where the intrinsic wide parallelism of tasks has lead to the development of a family of parallel architectures only used for a very restricted set of basic operations (e.g. polygon rendering). So, while the success of Si CMOS scaling has been the main driving force for performance progress, it has at the same time practically restricted the way system specification SW development, HW design and all related tools and formalisms have evolved in the last twenty-five years. No common formalism, language or methodology is available to map concurrency and parallelism for SW and HW from the specification level down through all the different levels of abstractions to final system design. No communication or video compression standard provides specifications and models that expose any form of explicit parallelism. Meanwhile, digital algorithms, such as digital video compression and multimedia processing, have grown in functionality and performance, at the expense of an ever growing complexity. Indeed, complexity has reached levels for which most of the tasks consisting in passing from the algorithm specification to an architecture now need to be executed using new methodologies and optimization tools capable 7

8 of assisting and supporting the designer s work. In other words, the gap between algorithm/system specification and the architecture that in the past was "filled" by the designer, is getting larger and larger, and what is even more worrying is the fact that it is becoming much more difficult or even impossible to be covered by the designer s intuition and creativity alone. It is clear that the sequential general purpose processor era during which performances increased more than two-fold every year, is definitely over. There has been no further increase in leading edge processor performance for over two years now and all projects aimed at breaking the 4GHz barrier have been canceled by the major processor manufacturers. All current projects aimed at next generation processor systems, improving state of the art performances, are now based on parallel multi-core architectures. The two major factors driving this radical change are the need to reduce power dissipation and the limits of electrical design models in the microwave range. We are thus approaching the end of the "sequential processor" era, but we don t have developed models and tools to efficiently cope with parallelism from the specification level down to the final implementations. 1.1 The implementation gap In embedded systems, many languages are employed. All along the design flow, the system is described in very different forms to go through the different steps from specification to implementation. Many different abstractions levels are crossed. The system is represented with a different language at each level of abstraction, from specification down to implementation. First of all, the designer needs to specify precisely the functionalities of the system. As embedded systems become more and more complex, we must find a way to express the algorithms in a rigorous form in order to avoid confusion. Specifying them using only sentences is not possible anymore. The International Standardization Organization (ISO/IEC) faces this problem during the development of MPEG codecs. In the last 20 years, MPEG has produced many important and innovative video coding standards, such as MPEG-1, MPEG-2, MPEG-4 Part 2 and Part 10 (AVC) and is currently working for a new standard "scalable video coding" (SVC). Video coding technology in these years has also reached very high levels of complexity. Different forms of specifications have been adopted since the very beginning of standard video coding. MPEG-1 and MPEG-2 were only described by textual specifications. For MPEG-4 it was the C reference software that became the true standard specification. However, the specification of the standard by means of such monolithic C/C++ descriptions presents several limitations and drawbacks. A large portion of the coding tools is common to all the different standards, but there is no way to recognize common tools from the different standards. "Higher level" formalisms must be used to adopt appropriate "top-down" systems implementation methodologies. At the end of the design flows, the entire system must be described according to the different processing units composing the device on which the system will be implemented. When dealing with heterogeneous platforms, the entire system must be decomposed in several parts and each of them must be described in the appropriate language, according to the processing unit. Subsets of C language are used to describes software part and VHDL/Verilog are used for the hardware parts. Among the processing units available in heterogeneous platforms, we can 8

9 find processors, Field Programmable Gate Array (FPGA), Application Specific Integrated Circuits (ASIC) and Digital Signal Processors (DSP). Since the complexity of embedded system increases, it is not possible anymore to handle such systems at low level of abstraction, i.e. VHDL/Verilog or C level. As a consequence, the level of abstraction at which the system is specified, analyzed and designed must be raised. Nowadays, the problem is how we can generate low level descriptions of the system (at C and VHDL/Verilog level) from higher descriptions which are easier and quicker to write. Generating VHDL/Verilog code does not necessary mean it will results in a efficient design. And this is a major problem of increasing the level of abstraction when specifying a system. Because we let to Computer Aided Design (CAD) tools the task of generating code, the designer has less control on the low level descriptions. Finding systematic ways to build efficient designs from a high level description remains a very promising research topic. Between high level programming languages and implementation languages, there is a large gap. This latter is often called "implementation" or "implementation gap". This state of the art shows what are the current approaches to reduce this huge gap between the specification of the system and its real implementation. Filling this gap is and will remain the main challenge of digital system design during the next two decades. Let us briefly summarize the challenges that we will need to face to "fill the implementation gap". Figure 1.1 graphically represents the "implementation gap" between digital system descriptions that present or do not present the concept of "architecture". It is clear by looking at the graph that there are very few tools and specification formalisms that can help the designers to cross the gap. Figure 1.1: Graphic representation of the "Architectural Gap" with examples of models Filling the "Architectural Gap": The Complexity/Design Productivity Challenge Due to the level of complexity reached by communication and multimedia processing it is necessary to raise the level of "abstraction" at all levels of the design flows, from the specification down to the level in which designer can specify the architecture in a more appropriate form than current HDL levels. The size of current HDL source codes has become too large for any efficient optimization stage 9

10 driven by designer creativity that results frustrated by the quantity of resource he need to spend to yield a "working design". Although this process of raising the level of abstraction is already in place at some levels of the design flow (i.e. System-C and C2Gate tools), it is not based on the right specifications and behavioral models and, to be successful, more radical innovations and breakthroughs at additional levels of the design flow are needed. By specification here we mean a fully functional behavioral model such as the one provided by ISO/IEC (MPEG) for the specification of MPEG-4, MPEG-7 and MPEG-21 standards in form of C, C++ and Java codes. Even though (unfortunately) several technologies and standards within the communication and multimedia field are still providing specification in natural language, it is clear that current algorithmic sophistication and complexity cannot be handled by textual descriptions only. Filling the "Architectural Gap": The Parallelism Challenge An approach aimed at generating parallelism only by means of specific tools during the final mapping stage into a (parallel) platform is clearly inadequate when considering the complexity of today s specifications and algorithms. Specifications and high level models should explicitly "expose" parallelism from the very beginning of any design flow. It will then be up to designer s creativity and to the specific design tools to specify the level of parallelism suitable for the architecture of the final platform. 10

11 Chapter 2 Current approaches in high level design This chapter provides an overview of the different languages, methods and tools used in embedded system design. It focuses on high level design flows capable of capturing the parallelism of the systems. The major problems in embedded system design are "how to map an algorithm on a given architecture?" and "which architecture provides the most efficient mapping?" All the research done in embedded system design can be summarized in these last sentences. Section 2.1 reviews the different manners algorithms and architectures can be described. In order to design complex but efficient systems, designers need additional information to characterize both algorithms and architecture. Section 2.2 gives an overview of the existing tools available for complexity analysis. Follows section 2.3 which provides the state of the art of the different metrics and heuristics used to characterize systems. Section 2.4 reviews the existing tools and methodologies to bridge the gap between specification and implementation. How an entire system can be built given the specification of the algorithm and the underlying architecture? Finally, section 2.5 tries to position the ACTORS project with relation to the current approaches and discusses the choice of CAL to specify algorithms. The potential improvements with the ACTORS approach and potential research directions are also exposed. 2.1 Architecture and Algorithm representations The major problem in embedded systems design is that it must "merge" two different aspects of the system, algorithm and architecture. Which are the different models to specify algorithms and architectures? In embedded systems, many languages are used. All along the design flow, the system can be described in very different manners to go through the different steps. There are many abstractions levels. The system is represented with different languages at each level of abstraction, from specification down to implementation. First, the designer specifies the system using an algorithmic model. Then, the system is implemented on a physical device with an architecture, composed of different processing components : processors, FPGA, ASIC and DSP. This section overviews the different representations at the algorithm and the architecture levels. 11

12 Algorithmic models Such models are used to specify the algorithms at a high level of abstraction. There is a lot of different manners for specifying algorithms. There is no golden representation, it depends often on many other parameters like the domain of application, the policy of a company/organization. MPEG uses C/C++ to specify their reference software. High-level programming languages (C, C++, and Java, or more application-specific languages, such as Matlab for signal processing) can be used to specify algorithms. As these languages are commonly used, it is practical to use the benchmarks to test the system under development. However, the capabilities of those language to express concurrency are limited. It is even more annoying that we enter in the multiprocessor era. Transaction Level Modeling (TLM) is getting more and more used because it raises the level of abstraction one step above SystemC. TLM is a discrete-event model of computation built on top of SystemC, where modules interchange events with time stamps. Interactions between software and hardware are modeled using shared buses. Inside this model of computation, modules can be specified at several levels of abstraction, making possible to specify functionals, un-timed state machine models for the application and specify an instruction-accurate performance model for the architecture. The level of abstraction of this model of computation remains quite low. It is also possible to combine hierarchically heterogeneous models of computations at high level in the well-known Ptolemy framework [1] in a Simulink fashion. CAL has been designed as part as the Ptolemy framework. Ptolemy focuses on component-based heterogeneous modeling. It uses tokens as the underlying communication mechanism. Directors regulate how actors in the design fire and how tokens are used to communicate between them. This mechanism allows different Models of Computation (MoC) to be constructed within Ptolemy II. The model of computation presented above are the most used nowadays. But there are many other ways to specify algorithms. At a high level of abstraction, come Kahn Process Networks (KPN) which have been introduced by G. Kahn in 1974 [2]: it consists in concurrent processes which communicate through unbounded and uni-directional FIFO. Each process is sequential but the global execution at the processes level is parallel. The YAPI model [3] extends KPNs by associating a data type with each channel and by introducing non-determinism by allowing dynamic decisions on selecting the next channel communication. Thus, scheduling on shared resources can be modeled. Graphs are also a very suited way to represent algorithms. In Directed Acyclic Graphs (DAG), nodes represent atomic operations and directed edges data dependencies between operations, respectively. DAG represents the data flow of an application. DAG can be used at different levels of granularity, from logic operators [4] to tasks [5, 6, 7] or packet processing [8]. These representations can be extended to DAG with periods and deadlines, where computation tasks are annotated with execution deadlines and periods. As DAGs, in Synchronous dataflow graphs (SDF), nodes represent atomic operations / tasks and directed edges data dependencies between nodes. SDF extends DAG because each SDF node is annotated with the number of data tokens produced and consumed by the computation at the node. It is a static information, from which feasible schedules can be determined. This information is also very useful for determining the memory requirements for buffering data between the processing entities. An other type of graph are the Control data flow graphs (CDFG) which can be extracted from the source code description of a program. The graph describes the static control flow 12

13 of the program, showing all the possible paths of computations and dynamic decision points at run-time. It also describes the flow of the data inside the program, showing the concurrency of the application. CDFG are discussed in [8]. Others graphs focus on the communications between the processing entities like arrival curves [9] (for modeling the workload imposed on an application or system as well as the output generated by the system) or Communication Analysis Graphs (CAG) [10]. CAG represent communication and computation traces extracted from system-level simulations. It is a DAG with communication and computation nodes, which includes timing information. A CAG can thus also be seen as an abstract task level performance model of the application that includes a schedule. An algorithm can also be represented with Co-Design Finite State Machines (CFSM). The communication between CFSM components is asynchronous, whereas within a Finite state machine (FSM), the execution is synchronous, based on events. This representation allows the expression of concurrency. The domain specific Click model of computation [11] is a representation especially suited for describing packet processing applications. Nodes represent computation on packets, whereas edges represent packet flow between computations. The packet flow is driven by push and pull semantics. Push transfers are initiated by traffic sources and pull transfers are driven by traffic sinks. Architecture models Architecture models focus more on performance perspectives of the modeled hardware. These models are useful for the analysis of the system in terms of speed, power, memory or area consumption. Either it is a abstract description of the hardware with a set of associated constraints/characteristics or it is an executable description. Abstract models represent only performance symbolically and executable specifications allow more precise analysis of the underlying hardware concerning communication, computations and memory. The most well-known languages for describing hardware are the Hardware Description Language (HDL) which describe the functionality and the behavior of an architecture. SystemC enables functional, transaction level, and cycle-accurate models, whereas other HDLs, such as Verilog and VHDL, are especially used for describing the actual structure of the underlying hardware in the form of RTL netlists. Modeling computer architectures at a higher level of abstraction than HDL can be done using Architecture Description Language (ADL) while preserving paths to cycle-accurate simulators and synthesizable hardware models. ADLs can be categorized according to the type of architectures they can describe (single vs. multithreaded for example). LISA [12], EXPRESSION [13], nml [14], MIMOLA [15], and the languages used within the Mescal framework [16] are example of ADLs. Instead of describing the entire architecture from scratch, an other method consists in describing architectures with pre-defined templates. The main drawback is that it constrains the design to a given class of architectures (e.g. VLIW). The advantage of using well-defined and pre-configured IP-Blocks is that it can lead to optimized and efficient code generation with a set of compiler and simulator tools. Example of these techniques can be found in the PICO framework [17], SimpleScalar [18] or in the work of Lahiri et al. [10]. There are several ways to abstractly describe architectures, at different levels of abstraction. At the system level, service curves [9] describe a non-linear worst-case envelope for the computation or communication capabilities of system-level components for all possible time intervals. Service curves are measured in units of cy- 13

14 cles, instructions, bytes, or service time per second. This model is used in EXPO [8] to model building blocks of SoC designs. At the task level, in task-accurate models the timing behavior of a resource is described by a list of supported tasks and their worst-case or average (estimated) execution times on this resource. This abstraction level is suited for SoC designs in which the level of granularity of interest is on the level of computation cores, memories, and buses. This model is used in the work of Ascia et al. [6]. At the lowest level of abstraction, in instruction-accurate models, the timing behavior of a resource is described by a list of symbolic instructions and their associated latencies. Traces of symbolic instructions are generated by annotated application models during execution and handed to the architecture models to determine the overall execution time of an application. This type of model is used in the SPADE framework [19]. Discussion The Holy Grail of embedded system designers would be to find a language capable of representing both algorithm and architecture. Unfortunately, it is difficult to specify an algorithm at a high level of abstraction and to deduce from it an accurate low level representation of the same algorithm. How can high level and low level constructs be mixed in a single language? Furthermore, this language must support simulations in order to ease the design of systems. Thus, designers have to use several languages to represent the algorithms at different levels of abstraction during the entire design flow. The main drawback of this approach is to make the conversions between all these languages. Compilers often support only a subset of the languages, making the different conversions between the languages a headache! Thus, the aim is to reduce the number of languages used during the design flow of the system. From the description of the system using a high level language, it would be very interesting if the language can support the description of the algorithm as far as possible in the lower levels of implementation. Thus, the unique language transformation would be from this high level language down to implementation language such as C, VHDL or Verilog. In a nutshell, defining and developing such a language is a very challenging task but would constitute a big progress in embedded system design. This Holy Grail language would have the following characteristics: high level language for algorithm specification easy expression of concurrency for next generation multi-core systems support of several levels of abstraction to avoid conversion problems support of simulation for design easiness easiness to generate implementation code from it Abstract languages do not often support simulation, architecture representations are at a too low level of abstraction, standard imperative languages do not express easily parallelism. It reduces considerably the set of possible candidates. A new actor/dataflow oriented language called CAL can be a very good candidate. CAL is a dataflow and actor oriented language that has been recently specified and developed as subproject of the Ptolemy project at the University of California at Berkeley. The final CAL language specification has been released in December 2003 [20]. CAL is good at describing algorithms with a set of encapsulated actors communicating with each others in a dataflow manner. An actor is a modular component that encapsulates its own state. The state of one given actor is not shareable with other actors. Thus, one actor cannot modify the state of an other actor. The only interaction one actor has with an other one is thought inputs and outputs. 14

15 Every actor is composed of one or several actions. In an actor, the operations are executed sequentially. At the network level (several actors connected together), the actors work concurrently, each one executing its sequential operations. An actor is composed of a set of actions that can (1) read and consume input tokens, (2) modify the internal state of the actor or (3) produce output tokens. Each action defines a kind of transition and can be fired under some conditions: (1) the availability of input tokens, (2) the value of input tokens, (3) the state of the actor or (4) the priority of that action. How the ACTORS project can improve, with the use of the CAL language, the current systems design methodologies is discussed in section Tools for complexity analysis When dealing with large software specifications in C/C++ or other imperative languages, it is difficult for designers to implement these systems only by relying on their intuition. Because systems become more and more complex, designers need tools to understand and analyze rapidly the specifications in order to obtain quickly an efficient implementation. This section presents the tools for analyzing complex C/C++ programs. The Software Instrumentation Tool (SIT) [21] is an interesting tool for making high level complexity analysis and is described in this section. Complexity analysis consists in determining computational parameters as number of arithmetic, control and memory access instructions, together with the task distribution which emulates the level of parallelism. Given an executable specification of the application, meaningful information could be extracted from static and dynamic profiling to guide further design decisions. An example is generating a histogram of the instructions used, e.g. data transfer vs. control vs. computation vs. bit level operations, that could indicate further exploration steps towards certain architectures supporting the most frequent operations in hardware. Profiling results could therefore be used as an affinity metrics towards certain design decisions. Complexity and memory analysis may be performed at different abstraction levels, i.e. at different stages in the design cycle. The most precise complexity evaluations can be obtained at the last stages of the design cycle, that is when the entire architecture has been designed and simulated. Nevertheless, it is much more interesting to analyze the complexity of a system at the beginning of the design cycle in order to make the right design choices at the early stages of the design flow. Thus, the number of re-design iterations (and overall design development time and cost) are reduced. Static Approaches The methods based on a static analysis of the source code range from the simple counting of the number of operations appearing in a program up to sophisticated approaches determining lower and upper running time of a given program on a given processor [22, 23]. While the simple counting technique provides a very accurate evaluation of the operations, it cannot handle loops, recursion and conditional statements except for some particular cases. Explicit or implicit enumeration of program paths can handle loops and conditional statements and can yield bounds on run time best and worst case [22, 23]. The main drawback of these techniques is that the real processing complexity of many algorithms heavily depends on the input data while static analysis depends only on the algorithm. For video coding algorithms, for instance, strict worst-case 15

16 analysis can lead to results one or two orders of magnitude higher than the typical complexity values [24]. Consequently the range best case-worst case is so wide that results are meaningless. Instruction Level Complexity Analysis Instruction level profiling provides the number and type of processor instructions that a program executes at runtime. These data give information on computational, control and memory access costs and can be used for complexity evaluation, as well as for performance tuning of programs and algorithms. There are many tools for programs instrumentation (Abstract Execution [25], QPT [26, 27], ATOM [28], ATOMIUM [29], Compilers [30, 31], Executable Editing Library [32], iprof tool [33], Pixie [34], Profilers (GNU gprof) [31, 35], microprofiling [36, 37]) but the main drawback of these approaches is that the analysis is based on the modification of the compiled executable and is not, consequently, platform- and compilation-independent. A reliable algorithmic complexity analysis at a high level of abstraction must be performed at the abstraction level of source-code s language, i.e. at the same abstraction level of verification models, which are conceived and developed for no specific target architecture but for algorithmic specification and validation purposes only. Any complexity analysis based on the modification of the compiled code (either of the object files or of the executable code) inevitably takes into account the code transformations due to the compilation process (e.g., source-code to intermediate-format transformations, optimization transformations, target-architecture specific transformations) and strictly depends on the instruction-set of the target architecture. When a pure algorithmic complexity analysis at high-abstraction level is of concern, the results of a platform- or compilation- dependent analysis could yield misleading complexity evaluations because of the aforementioned transformations. Software Instrumentation Tool (SIT) Because the former methodologies aiming at analyzing the complexity are always platform- or compilation-dependant, the Software Instrumentation Tool (SIT) [21] has been developed with the goal of measuring the complexity of a specific algorithm independently from the hardware architecture on which the software model of the algorithm is ran. Architectural information from non-architectural C algorithms can be extracted. This approach aims at extracting useful metrics for the design of an architecture, optimizing data transfers, memory bandwidths and storage requirements. This optimization step can be achieved directly on algorithm specifications at high abstraction level. The new approach of SIT is possible by means of a breakthrough in the instrumentation / overloading technology enabling a complete detection of all C operators without any limitation in the way pointers and data structures are used. The SIT can be seen as a virtual-machine for running C source code. All the operations performed during the execution of the instrumented model are intercepted and processed, providing an exhaustive basis for computational complexity and architectural analysis. Customizable virtual memory architectures can be plugged into the virtual-machine extending the analysis capabilities to the data-transfer and storage domain. The whole instrumentation process, from the source files to the instrumented executable, is completely automatic. The instrumented executable can be ran on real input data to produce the complexity analysis results. 16

17 In a nutshell, the SIT allows a pure algorithmic complexity analysis at the sourcecode level. This analysis does not depend on the underlying platform or on the compilation. The analysis is performed with real-life input data in order to catch a true execution of the programs in real conditions. There are no limitations for ANSI C and K&R compliant C code. It allows a fully customizable memory simulation, useful for the exploration of the design space for memory architectures. The SIT allows a given number of analysis which are extremely useful for analyzing system at the early stage of the design: the complexity analysis provides information on the type and the number of operations executed. The type of the operands associated to each operation is also available. It helps the designer to partition the system into software and hardware, because hardware is better for intensive computing and software for control. the dataflow analysis gives the data transfers between the different functions of the program. It becomes very easy to visualize at a glance what are the large data transfers among the functions. This is a useful information when designing the architecture of the system. the critical path analysis shows the functions which belong to the critical path. Understanding which parts of the algorithm are critical is very important. For instance, it can help the designer to improve the throughput of the system by shortening as most as possible the critical path. It gives also an interesting indication useful in the HW/SW partitioning step. the memory analysis shows what type of memory operations are done. the memory architecture analysis helps the designer to find an optimal memory architecture for a given algorithm. The designer runs the C/C++ code with a virtual memory architecture and the tool provides the number of read or writes misses and the read or write hits. Memory is a very important issue in multimedia systems. Thus, it is very important to find the optimal solution at the first stage of the design flow. the execution tree indicates the execution order of the functions. It provides also the hierarchy of the function calls. How many times a part of the algorithm is ran is also a valuable information to detect computer intensive blocks. 2.3 Heuristics and metrics for system analysis Because of the increasing complexity of systems, design problems become NP complete, and thus, good heuristics must be found for the design of complex systems. An important step in design space exploration is to evaluate the intermediate and final mappings. Nowadays, it is no longer possible (in a reasonable time) to find an optimal partitioning by relying only on the intuition of the designer. Given a set of meaningful metrics, the designer will be able to partition the system between hardware and software. There is three major aspects in embedded systems: computing, communication and memory. This part reviews briefly the metrics and the heuristics for characterizing complex systems for each of these aspects but also for global and system-level aspects. Common global metrics There is a given number of well-known metrics which is used to characterize systems. They generally concern the properties of the overall system and are typi- 17

18 cally used directly as optimization goal but drive the optimization process directly. These metrics can be applied to every system. A system can be characterize by its speed in different ways: the global throughput for computations and communications, the total amount of data processed or transferred, the reactivity of the system in the case of real-time systems, the global latency of the system, the time necessary for a given amount of computations or the clock speed supported by the design. Power dissipation is becoming a more and more important issue in the design of systems. This criteria must be taken under account in the design flow of systems. On one hand, systems optimized for speed have to cope with the heat of components which could degrade the lifetime of the system. On the other hand, embedded systems have to deal with power leakage during idle periods. The area consumption of the design in a target technology determines partially the cost of the entire design. Reducing the area of the global design allows the company to use smaller and cheaper chip for their system, resulting in a reduced cost of the design. Other costs like packaging, fabrication, engineering costs are also taken under account in companies during the design flow. Computing-related metrics Some parts of the algorithm are more suited for being executed on a given type of target. Computing intensive tasks are more suited for hardware and control tasks for software. This is basically the aim of these metrics, to detect what kind of target is better suited for the different parts of the algorithm. An affinity metric is defined in the work of Sciuto et al. [38] [39]. The considered targets are GPP, DSP and ASIC. They define a set of constructs which are more suited for the different targets. The affinity metrics are the result of the analysis and highlight instruction sequences which are be DSP-oriented (buffer circularity, MAC operations inside loops, etc.), ASIC-oriented (bit level instructions) or GPP-oriented (conditional or I/O instructions ratio). An interesting method for selecting processors is presented in [40]. Three metrics give the orientation of functions in terms of control, data transformation and data accesses by counting specific instructions from a processor independent code. Then a distance is calculated, with specific characteristics of processors regarding their control, bandwidth and processing capabilities. It doesn t take also under account the instruction dependencies and there is no detail about the different types of memory accesses regarding the abstract processor model used. This approach is at a very low level of abstraction, dealing with specific processors characteristics to choose what is the best processor for a given application. A regularity metric is defined in [41]: if a given algorithm exhibits a high degree of regularity, i.e., requires the repeated computation of certain patterns of operations, it is possible to take advantage of such regularity for minimizing implementation overhead. It is interesting to detect these different types of regularity in order to optimize the design for one special target. The Control Orientation Metric (COM) has been introduced in [42]. This metric indicates the frequency of control operations which cannot be eliminated at compile time. This metric is useful for evaluating the need for complex control structures to implement a function. Functions with high COM values will be more efficiently implemented in a global processing processor than in hardware (in which large FSM would be needed). All these metrics are interesting because it helps the designer in the HW/SW partitioning step. If a metric can indicate to the designer that this part of the code 18

19 is more suited for GPP or FPGA or even DSP, it would be a great gain of time in the design process and can lead to better implementations in terms of efficiency and power consumption. Mapping control tasks on hardware and computing intensive tasks on processors will lead to inefficient system. These metrics can help in detecting quickly the control and processing functions into the program. Communications-related metrics Communications are very important in embedded system design because they can constitute a serious bottleneck if they are not well studied. Methodologies help the designer to handle these communications at a high level of abstraction. Among them, clustering techniques aim at gathering functions / tasks / operators which are closely related and at separating functions / tasks / operators which are independent. These techniques can reduce the communication overhead in complex systems They can be applied at different levels of abstraction. The metrics defined in [43, 44] are computed at the functional level to highlight resource, data and communication channel sharing capabilities in order to perform a pre-partitioning resulting in function clustering to guide hardware/software partitioning. The aim is to optimize communications and resource sharing. Locality of computations is firstly defined in the work of P. Peixoto et al. [41]. Computations within an algorithm are said to have a high degree of locality if the algorithm level representation of those computations corresponds to an isolated, strongly connected (sub)graph. If an algorithm exhibits a good degree of locality of computations, i.e, has a number of strongly connected clusters of computation. Such clusters define a clear way of organizing the functional units into modules so as minimize the number of global buses, control and steering logic. At a lower level of abstraction, other communication-related metrics are presented in [45]. The goal is to quantify the communications between arithmetic operators (through memory or registers). These metrics focus on a fine grain analysis and are mainly used to guide the design of data paths, especially to optimize local connections and resource reuse. Memory-related metrics Memory-specific metrics are important in order to evaluate the memory architecture of the system. Choosing a good cache architecture in multiprocessor systems can lead to a global speedup of the system. The number of conflict and capacity misses, cache hit and miss ratios, as well as the locality of accesses extracted from the application are extremely useful in the early stage of the design flow. These metrics can be extracted from the Software Instrumentation Tool described in section 2.2. The Memory Orientation Metric (MOM) [42] indicates the frequency of global memory accesses, i.e. accesses to input/output data. By referring to this metric the designer can see which functions require special care for implementation: those with large MOM values are most likely to require a good data bandwidth. The MOM metric also indicates the potential need for a memory hierarchy since, this metric is computed for all the hierarchy levels of a function graph. System-level techniques There are other metrics which try to characterize a system at the system-level. The notion of flexibility has been firstly described in [46]. The flexibility of a design is difficult to express quantitatively. However, many fundamental design decisions 19

20 are based on the need of programmability or dynamic reconfigurability in order to extend the life-time of a design, to be able to incorporate late fixes due to, for instance, changes in communication standards, or to ease the remote maintenance of an embedded system. An application is described as a hierarchical DAG where hierarchy levels represent different options (algorithms) to implement the same functionality denoted by the parent node in the graph. An architecture is supposed to be more flexible the more options described in the application DAG it can implement, given timing and cost constraints. Today, parallelism is an important aspect in systems. Thus, having an idea on which part of the algorithm is inherently parallel is a very interesting information for designers. In the Software Instrumentation Tool (SIT), the critical path analysis gives precise metrics showing the potential parallelism available in the program. In [42], they define a metric which indicates the average parallelism of a function. Those having the largest values offer more optimization opportunities since, they are likely to present a number of implementation alternatives offered by their inherent parallelism. 2.4 Current frameworks to bridge the implementation gap This section details the different current approaches aiming at bridging the implementation gap, i.e. mapping an algorithm onto an architecture and finding the most efficient architecture for a given application. The reader can refer to the following reviews [47, 48, 49, 50, 51, 52] for more details. Handling parallelism using Process Networks and Actors Process networks and actor programming are interesting solutions for handling parallelism in future systems. Daedalus Framework Daedalus [53] provides an unified environment for rapid system-level architectural exploration, high-level synthesis, programming and prototyping of multimedia MP-SoC architectures. In this framework, the implementation is based on a library of pre-determined and pre-verified IP components. Daedalus design flow is strongly related to SPADE [19], Artemis [54] and Compaan framework [55]. This latter enables the automatic transformation of nested loop programs written in Matlab to Kahn-like process networks. This framework is based on Kahn Process Networks [2]. The design flow is fully automatic and comprises several tools. The KPNgen tool converts automatically sequential applications (written in C/C++) into a parallel Kahn Process Network (KPN). These latter are inputs to the Sesame modeling and simulation environment to perform system-level architectural design space exploration. The output of the Sesame tool is a set of candidate system designs with their corresponding specifications including system level platform description, application-architecture mapping description, and application description. The ESPAM tool inputs these latter specifications with RTL description of the corresponding components from the IP library to perform VHDL synthesis. It implements the candidate MP-SoC platform architecture. C/C++ code is also generated for application processes that are mapped onto programmable cores. This implementation can be mapped onto FPGA using commercial synthesis tools and compilers. The Daedalus framework is very interesting, making the design flow fully automatic from Kahn Process Networks or directly from C/C++ specifications. KPN are 20

21 well suited for signal processing systems. Nevertheless, some modifications in the C/C++ specification are sometimes needed in case the input specification did not entirely meet the requirements of the KPNgen tool, but these modifications seem not to be very time-consuming. The modeling of interrupts is difficult because of the nature of KPN models. Thus, it makes the study of time-dependant systems limited. METROPOLIS framework Metropolis [56] [57] [58] is a framework allowing the description and refinement of a design at different levels of abstraction and integrates modeling, simulation, synthesis, and verification tools. Metropolis is based on the Polis design environment [59]. The function of a system, such as the application, is modeled as a set of processes that communicate through media. Architecture building blocks are represented by performance models where events are annotated with the costs of interest. A mapping between functional and architecture models is determined by a third network that correlates the two models by synchronizing events (using constraints) between them. The framework uses an internal representation called the Metropolis Meta-Model (MMM). It uses different classes of objects to represent a design in which processes, communication media, and netlists describe the functional requirements, in terms of input / output relations. Mescal Mescal project [16] from University of California at Berkeley aims at designing heterogeneous, application-specific, programmable (multi) processors. The goal is to allow the programmer to describe the application in any combination of models of computation that is natural for the application domain. The goal is also to find a disciplined and correct by construction abstraction path from the underlying micro-architecture to an efficient mapping between application and architecture. The architecture development system is based on an architecture description language. The Mescal architecture development framework (called Teepee) implements a design space exploration methodology based on the Y-chart. Peace/Ptolemy The PeaCE Environment [60] specifies the system behavior with a heterogeneous composition of three models of computation. The PeaCE environment provides seamless co-design flow from functional simulation to system synthesis, utilizing the features of the formal models maximally during the whole design process. This framework is based on the Ptolemy project [1] [61]. When dealing with C/C++ specifications, the Peace approach do not propose an automatic procedure to transform this specification into dataflow graphs. This step is manual and an example of this transformation on a MPEG-4 decoder is given in [62]. Converting this sequential specification (C/C++) into dataflow graphs constitutes a time-consuming and burdensome task. Ptolemy focuses on component-based heterogeneous modeling. It uses tokens as the underlying communication mechanism. Controllers regulate how actors in the design fire and how tokens are used to communicate between them. This mechanism allows different MoCs to be constructed within Ptolemy II. C-based methodologies An other solution consists in starting the design flow directly from specification written in an imperative language as C/C++. 21

22 C to Gates tools The corresponding design flows (ImpulseC [63], Handel-C [64] [65] and Spark [66]) generate VHDL code from C-like specifications. Thus, the entire design space is far from being completely explored because these tools handle only hardware code generation. Dealing with heterogeneous systems, composed of processors and FPGA is not possible. At least, theses flows can be used to generate code corresponding to hardware parts of the system, but they are not able to handle an entire heterogeneous system and explore its design space. Handel-C and ImpulseC design flows allow some partitioning but the well-known C to Gate tool from Mentor Graphics (CatapultC [67]) does not allow any partitioning, making the design of heterogeneous systems difficult. Other C to gates tools are available but they do not allow any partitioning. These latter are not incorporated in any design flow capable of handling heterogeneous systems. Commercial offerings such as Mentor CatapultC, Celoxica Handel-C [68], C2Verilog, and Bach [69] defined a subset of ANSI C to do either synthesis or verification. De Micheli and his students discussed in [70, 71] the main problems of using C as a HDL. The lack of concurrency and the concept of time are missing in C. The way communication mechanisms are written in imperative languages are not suited for hardware representation..net framework based tool The.NET framework based tool [72] supports a design methodology which unifies and automates the hardware and software design flows. It is capable of refining automatically a high-level system specification, given in a programming language supported by the.net framework, to a hardware/software description. The.NET framework [73] is a set of tools and mechanisms for the development of multi-lingual software components. The algorithm is described with a high-level programming language supported by the.net framework and is compiled into Common Intermediate Language (CIL) [74]. The hardware is described in CASM (Channel-based Algorithmic State Machine), an intermediate-level hardware description language [75]. The design flow based on the.net framework is interesting because it lets designers specify the system at a much higher level of abstraction than the traditional frameworks, with a large number of different programming languages, making the design flow more flexible. But design space exploration tools are missing to help the designer to decide the partitioning of the system. It is one-way design flow and there is no way to back annotate the specification in order to reach an optimal design. Transaction Modeling & SystemC This section reviews some methodologies based on Transaction Level Modeling (TLM) and SystemC. StepNP StepNP [76] is a system-level exploration framework based on SystemC targeted at network processors and developed by ST Microelectronics in collaboration with universities. It enables rapid prototyping and architectural exploration. It provides well-defined interfaces between processing cores, co-processors, and communication channels to enable the usage of component models at different levels of abstraction. It enables the creation of multi-processor architectures with models of interconnects (functional channels, NoCs), processors (simple RISC), memories and coprocessors. 22

23 BlueSpec BlueSpec [77] takes as input a SystemVerilog or a SystemC subset and manipulates it with technology derived from term rewriting systems (TRS) [78] initially developed at MIT by Arvind et al. It offers an environment to capture successive refinements to an initial high-level design that are guaranteed correct by the system. An other approach [79], more focused on SW, consists in making high level transformations on the source code in order to take under account the target on which the algorithm is going to be implemented. This work is useful when mapping a program on processors but does not apply to the hardware domain. HW/SW interface design Hardware/software interface design constitutes a bottleneck in embedded system design. Thus, the work of Jerraya and al. [80] focusses on interface design for multiprocessors SoC. SystemC language, based on C/C++, is a solution for representing functionality, communication, software and hardware at various system levels of abstraction. SystemC is mainly used for simulation because it is not synthesisable in its whole generality. In practice, it can be reduced for ensuring the use of a synthesisable subset. The problem is that this subset of the language is at a very low level of abstraction. Simulation capabilities at a high level of abstraction is an interesting feature, but during the implementation step, the designer has to re-write the code using only the synthetisable subset. It is a very burdensome, time consuming and error-prone task. In a nutshell, it is hard to use SystemC as a high level design language because it is not possible to implement systems from this high level specification. From graph specifications In a more abstract way, the algorithm specifications can be described using different type of graphs. The three frameworks presented here use respectively petri nets, control data flow graph or synchronous data flow graphs. CodeSign CodeSign [81] framework uses Object Oriented Time Petri Nets (OOTPN) as the modeling language. This language is quite powerful to model real time systems and it is possible to analyze it mathematically. The notion of time allows performance evaluation at the early stages of the design. In order to facilitate the hardware/software partitioning task according to system constraints, CodeSign supports generic models to which implementation details are added as the design is refined. Interfaces between hardware and software are automatically inserted according to the protocol specifications. Since the configuration of the interfaces can have a large impact on the system performance, CodeSign allows exploring different implementations derived from a master specification. CodeSign supports C and VHDL code generators for software and hardware implementation. Codesign project outputted the Moses Tool Suite, a tool for modeling and simulating and evaluating heterogeneous systems using Petri nets. This tool supports the CAL language [20], an dataflow and actor oriented language, especially suited for modeling and simulating signal processing systems. Trotter design flow The Trotter design flow [82] enables the rapid prototyping and design space exploration of applications specified using an internal graph representation of the application. In the very early steps of the design flow, the framework provides useful metrics allowing the designer to evaluate the impact of algorithmic choices on resource requirements in terms of processing, control, memory 23

24 bandwidth and potential parallelism at different levels of granularity. The system is specified using C language and is immediately transformed into Hierarchical and Control Data Flow Graphs (HCDFG) for the rest of the design flow. The required information and the results are stored as attributes in the graph. A HCDFG is composed of elementary nodes (processing, memory, control), dependence edges (control, data) or subgraphs. The goal of their work is to perform automatically and rapidly the algorithmic exploration for the functions called from the event-based level. Tools are available for simulation, formal proof and code generation at the event-based level but they do not consider any path to hardware. Syndex SynDEx [83] [84] is a graphical and interactive software implementing the Algorithm Architecture Adequation methodology (AAA) [85]. Within this environment, the designer defines an algorithm graph, an architecture graph and system constraints. The methodology is based on graphs models to exhibit both the potential parallelism of the algorithm and the available parallelism of a given multi component architecture. The methodology consists in distributing and scheduling the algorithm graph on the multi-component architecture graph while satisfying real time constraints. The heuristics take under account execution time durations of computations and inter component communications. The result of graphs transformations is an optimized Synchronized Distributed Executive (SynDEx), automatically built from a library of architecture dependent executive primitives composing the executive kernel. Syndex is a Computer-Aided-Design software aiming at mapping an algorithm onto an architecture. The architecture taken under account is only composed of several processors and hardware logic like FPGA cannot be taken under account in this flow. The design space exploration is done according one unique criteria, throughput. UML-based design flow Unified Modeling Language (UML) is largely used in software engineering for designing large software programs. The most complete design flow using UML for system modeling is achieved by Kukkala et al. and is called Koski. The target of the Koski design flow [86] is multiprocessor System-on-Chip (SoC). It is a library-based method that hides unnecessary details from high-level design phases, but does not require a plethora of model abstractions. The design flow provides an automated path from UML design entry to FPGA prototyping, including functional verification, automated architecture exploration and back annotation. The design of the architecture is based on the application model: it results in a application-specific implementation. The flow has been successfully applied to a WLAN Video Terminal [87]. The targeted platform is a multiprocessor SoC platform implemented on FPGA. UML models (i.e. application, platform and mapping models) are written thanks to the experience of the designers and there is no help for their elaboration. Furthermore, is UML a nice way to express parallelism? When dealing with C/C++ specifications, UML models must be written, which can be a burdensome task for the designer. The design space is restricted to multiprocessor SoCs. An other aspect is highlighted in [88]: graphical languages are not well accepted because it is slower to use than writing code. Coding an algorithm in a textual manner is more productive than drawing the equivalent flow chart. 24

25 Commercial tools There are several commercial tools helping in the design of complex systems, at different levels of abstraction. Matlab-Simulink and LabVIEW Simulink from Mathworks [89] and LabVIEW from National Instruments [90] are the most known tools for modeling control and signal processing application using a nice graphical environment. These tools focus on the functional aspects of the algorithm and not really on the implementation aspect even if some progresses have been made in this domain. For example, the Xilinx System Generator for DSP tries to fill the gap between the high-level abstract version of a design and its actual implementation in a FPGA. Esterel [91] is good at describing reactive systems. It is both a programming language and a compiler which translates Esterel programs into FSM. It is a synchronous language, well suited for programming real-time control systems. The compiler generates software or hardware implementation. It generates C code which is integrated into a larger program that handles the interface and data manipulations. It can also generate hardware in the form of netlists of gates, which can then be embedded in a larger system. This language is more suited for control and real-time systems. Furthermore, the level of abstraction of this language is quite low and the development time is bigger than with high level languages. Electronic System Level (ESL) tools The Signal Processing Worksystem (SPW) [92] is a tool using data flow formalism even ealier than Ptolemy. SPW has been acquired by CoWare in Coware proposes a system-level tool performing design, analysis and simulation of system specified in SystemC and using TLM. Synopsys System Studio [93] is a model-based design and analysis tool helping the designer to build system at a high level of abstraction. It supports all the models of computation supported by SystemC. It is possible to run co-simulation of HDL descriptions with SystemC models or Matlab algorithms. This tool supports also hardware synthesis from SystemC. C/C++ and SystemC are used to describe a high level model or to integrate existing Intellectual Property (IP) blocks. Because it is hard to make reusable models in C/C++, SystemC is often used as an encapsulation mechanism. The problem of these tools is that there is no explicit methodology which helps the designer in choosing the right mapping decisions. Rapid Prototyping framework Virtual Socket Platform The Virtual Socket concept has been presented in detail in [94, 95, 96, 97] and has been developed to support mixed specification of MPEG- 4 Part2 and Part 10 (AVC/H.264) specifications in terms of reference SW including the plug-in of HDL modules. The platform is constituted by a standard PC where the SW is executed and by a PCMCIA card that contains a FPGA where the HDL modules run. The goal of this platform is to provide a "direct map" of any SW portion to a corresponding HDL specification without the need of specifying any data transfer explicitly between the HW and SW parts of the system. Specifying explicitly the data transfers is a burdensome tasks when dealing with complex systems. HDL modules and software algorithm share a unified virtual memory space and transfers of data between the software program and the HDL modules are handled automatically. The clear advantage of such solution is that data transfers needed to feed the HDL module are automatic and can be directly profiled so as to explore 25

26 different memory architecture solutions. Another advantage of such direct map is that conformance with the original SW specification is guaranteed at any stage and the generation of test vectors is naturally provided by the way the HDL module is plugged to the SW section. This framework can be very useful in the rapid prototyping step in the early steps of the implementation phase. The transfers of data can be extremely painful for the designer in the case of complex multimedia systems. The seamless integration of HDL modules into SW specifications would result in a speedup in the prototyping phase. Summary & discussion This section discusses the different existing solutions and try to clarify the situation in embedded system design. Table 2.1 summaries the different characteristics of the different design flows. Follows the description of the different characteristics used for the comparison. Algorithm model: representation of the application used in the methodology. Architecture model: representation of the architecture used in the methodology. Design Space Exploration: tools for exploring the design space: partitioning, simulation, scheduling, simulation.... Automatic based exploration means that the corresponding tool are able to automatically explore designs by evaluating possible permutations of design parameters. Target (generated code): what type of platforms is targeted? 2.5 The approach in ACTORS project The ACTORS project is based on the actor/dataflow oriented language called CAL [20]. This section shows why CAL language can be a good support for an entire design flow in ACTORS. CAL has been chosen because it has some interesting properties which can be extremely useful for next generation design flows in the case of highly parallel complex systems. Follow the details of the characteristics making CAL a very good language for supporting a design flow. Expressing concurrency The ability of CAL to describe and easily fit concurrent problems make actor-modeling a perfect fit for system design of parallel algorithms and streaming applications. Compactness The MPEG-4 Simple Profile decoder has been manually implemented in different languages such as C/C++, VHDL and CAL. The full implementation is composed of approximately 4000 lines for CAL, lines for VHDL, 4100 for an optimized version in C/C++. It shows how concise CAL is. By raising the level of abstraction, CAL needs less lines of code to fully describe a given algorithm. CAL is not overloaded with implementation details and such feature is a clear advantage. CAL focuses only on the description of the algorithm itself and how data is generated and consumed by the different components. Implementation details such as scheduling of the operations are let to code generators. 26

27 Name Algorithm model Architecture model Design Space Exploration Target (generated code).net Framework C or other language sup- ported by.net framework Daedalus C/C++ or KPN Koski UML profile : TUT CASM None SW (ASM) and HW (VHDL) Pearl or SystemC UML profile : TUT Partitioning, exploration, synthesis Automatic, back annotation Syndex SDF Abstract Throughputdominated exploration Trotter C UAR model Partitioning, automatic exploration with metrics HW (VHDL) and SW (C) Multiprocessors on FPGA using up to 5 NiosII (C) SW (C) for multiprocessors None Mescal mixed MoC MoC, Teepee Manual ASIP (ISAindependent language) StepNP Click SystemC Manual Processor networks Metropolis Meta Model Language (functional model) Meta Model language (architecture model) Simulation, formal verification, QSS scheduling None Peace Dataflow (data) FSM (control) Abstract Automatic DSE More focused on MPSoCs (C) but also HW (VHDL) Table 2.1: Summary of the most interesting design flows Analysis CAL language allows the analysis of an actor and networks of actors. For the definition of the actors, CAL provides statically analyzable information about their behavior, such as the number of tokens it produces and consumes in each firing, the necessary conditions for their firing, on what depend those conditions... These information are very useful for effectively schedule, compose, and implement those actors. Portability CAL defines models in a concise and clear way. CAL models are completely independent from implementation. Thanks to this independency, it eases the integration, exchange and the development of actors. Different implementation for several targets can be built easily from these models. The encapsulation property of the actors is very convenient: global variables do not exist, making the integration of external actors (not written by the designer himself) in the design much easier. 27

28 Simplicity of actor design CAL offers a compact, clear and precise semantics, which is tailored to the constraints of actor design and thus facilitates readability and maintainability of the actors. The ease of programming is also necessary to make the language accepted in the scientific community. Hardware and Software code generation The CAL language is a good starting point for hardware and software code generation. The first results presented in [98] shows that efficient hardware code can be generated directly from CAL models. The results show the quality of results produced by the RTL synthesis engine for a real-world application (MPEG-4 Simple Profile video decoder). The code generated from the high-level dataflow description actually outperforms the VHDL design in terms of both throughput and silicon area. C code can also be generated from CAL models. The first results presented in [99] shows significant improvements compared to CAL dataflow simulation with the Open Dataflow environment [100]. The compilation process has been successfully applied to the same MPEG-4 Simple Profile video decoder. The synthesized software (10600 lines of code) is faster than the CAL dataflow simulated (about 20 frames/s instead of 0.15 frames/s), and close to real-time for a QCIF format (25 frames/s) on a standard PC platform. It is interesting to notice that the model is scalable: the number of macro-blocks decoded per second remains constant when dealing with larger image sizes. 2.6 Potential research directions in the project CAL is a very interesting candidate for supporting a complete design flow. Once written in CAL, an algorithm can be optimized and ported in hardware through a set of tools, allowing analysis, optimization and implementation. The characteristics of this language make CAL a very good candidate for an usage in ACTORS project. CAL is not intended as a complete programming but rather a small and concise language which must be inserted into an other environment which provides the necessary infrastructure. It was designed to coordinate scheduling and communication in a component-based framework. CAL facilitates the use of several techniques for checking compatibility between connected actors. Production and consumption rates for an actor may be extracted from a CAL actor and can be used to statically check a synchronous data flow model that is built using CAL actors. This is a excellent starting point but it can be upgraded by several ways. This section details the potential research directions in the ACTORS project. Efficient code generation Although a model of a system defined in CAL looks very compact and clear, it is clearly a challenge to write a compiler, whose generated code can compete with handwritten code in terms of compactness and execution speed. Depending on the time constraints and importance of the execution speed of a platform, this may or not be a serious challenge. CAL2C and CAL2HDL are promising tools for code generation and are still under development. Model-Driven Development The paper [88] presents the pro and cons of the Model-Driven Development methodology which consists of describing systems using different languages at different abstraction levels, from a high level description down to the implementation. Hailpern et al. highlight the lack of semantics of the languages, useful for the automation of the process. CAL has a concise and precise semantics, allowing au- 28

29 tomated tools as code generators. The choice of having a minimal semantic core makes language conversion much easier. An other highlighted aspect is the problem of consistency between the successive languages used during the entire design flow. According to Hailpern and al.,"the more models and levels of abstraction that are associated with any given software system, the more relationships will exist among those models. [...] The basic problem is that the introduction of multiple, interrelated representations implies the issue of assuring their mutual consistency a very difficult problem." [88]. In order to avoid these consistency problems between different levels of abstractions, the number of representations during the design flow must be reduced as much as possible. Either the transformations between the different models are improved or the entire system is built using an unique language, capable of representing a system at different levels of abstraction: a first high level view to catch the functionality of the algorithm, a second view to take under account the architecture and a third view capable of evaluating the design using a set of metrics. In ACTORS, CAL is intended to support both architecture-aware and architecture-agnostic models. The functionality of the algorithm can be described with an initial model in CAL. As far as the design progresses and the architecture is known, the CAL model is refined in order to take into account the underlying architecture. Thus, the major design steps are made at the CAL level. The constraints from the hardware must be raised at the CAL level. However, consistency problems will still lie in the conversion of CAL in implementation languages as VHDL/Verilog or C. Currently, CAL is not entirely supported by the hardware and software synthesis tools. They are still under development in order to extend the support of the language. Reducing at most the number of languages used for the representation of the system along the design flow is an important point in ACTORS. KPNGen from Daedalus Framework The Daedalus framework is a complete design flow, starting from specification (C/C++) down to implementation (HW and SW). There are some interesting aspects such as the conversion of the specification algorithm written in C/C++ to a Kahn Process Network (KPN). A KPN is represented as a "network of small C programs". KPN and CAL models use the same approach which consists in viewing a system as a set of actors communicating with infinite FIFO. Thus, it makes this KPNGen tool a potential starting point for the conversion of C/C++ algorithm into CAL models. Imperative programs representation using graphs There is an interesting work around the Trotter design flow about the way they represent the initial algorithm written in an imperative language. They use Hierarchical and Control Data Flow Graphs (HCDFG) to represent a complete view of the specification algorithm. They also define a given number of metrics which are extracted from the analysis of these HCDFG graphs. Such metrics can be very interesting in the transformation process from C/C++ to CAL. Software Instrumentation Tool The SIT provides a dynamic analysis of a specification algorithm written in C/C++. It extracts run time metrics which can be very useful to better understand what are the bottlenecks of the algorithm and to determine a well suited architecture. This tool can be very useful in the design space exploration step at the CAL level. Thanks to the CAL2C tool, one can analyze dynamically an entire system using the SIT and perform an accurate analysis of the 29

30 system. Knowing how to evaluate a system composed of several actors using the SIT is very interesting because it will guide the entire design space exploration step. Virtual Socket Platform The Virtual Socket platform is a framework enabling the designer to call HDL modules from a SW program running on a standard PC. It would be interesting to investigate how this framework could help in the implementation phase of the system or even at the design exploration phase. If the use of this framework is straight forward, one can imagine using this platform to backannotate CAL models given the first implementation results of this prototype. Relation to work packages This part of the state of the art is closely linked to Work Package 1 (WP1) which aims at developing methods for designing CAL models. These models are analyzed and given the extracted metrics, the design space can be explored in order to obtain a (sub)optimal solution. Work Package 1 The Work Package 1 is composed of several tasks which are all related to this chapter of the state of the art. Task 1.1, CAL Methodology deals with the early stages of the design flow. This task aims at finding a methodology for building CAL models from C/C++ programs. By analyzing dynamically this C/C++ specification using the Software Instrumentation Tool (SIT) [21], meaningful metrics can be extracted to build CAL models. It would be interesting to have a look on the KPNGen tool from the Daedalus framework before beginning from scratch our own methodology. The Trotter framework presents also an interesting approach for representing C/C++ algorithms with hierarchical control dataflow graphs. This framework proposes also some interesting evaluation metrics which can help at characterizing sequential programs. Task 1.2, Design Space Tool aims at extracting meaningful metrics from the CAL models considering them as sequential code. The tools used in the analysis of C/C++ can be used here to analyze the actors composing the models. The aim is to characterize the models with a set of meaningful metrics. The Software Instrumentation Tool (SIT) [21] is a very good starting point of characterizing algorithms written in C/C++. The Virtual Socket platform can be also useful when characterizing a system implemented in hardware. Task 1.3, Design Space Exploration aims at partitioning CAL models according to the criteria which has been chosen by the designer for guiding the design exploration step. The analysis of the CAL models is also studied in this task to get further information of the networks of actors. Designing the design space exploration algorithms at a high level of abstraction enters in this task. It would be interesting to have a closer look to the Daedalus framework to understand how the design space exploration is done using KPN models, which are quite near from CAL models. Work Package 5 Work Package 5 aims at testing the methods developed in the entire project to case studies. This WP aims at validating the whole design flow defined during the entire project. At the end of the project, it would be interesting to compare our methodology with the ones presented in this state of the art. 30

31 Chapter 3 Compilation of Dataflow Programs 3.1 Introduction Efficient execution is crucial when developing mobile multimedia systems that are manufactured in long series. Typical multimedia tasks are demanding, in terms of required computing power, and available resources have to be utilized efficiently in order to meet desired design constraints. Unlike desktop systems, embedded and mobile multimedia systems can generally not rely on a hardware platform that is over-dimensioned for the multimedia task at hand, since there is a direct relation between the utilization of a hardware platforms capabilities and desirable properties such as low production cost, long battery life time, cold and silent operation and even small physical dimensions of the device. For this reason, developers of multimedia devices are prepared to go to great lengths in order to maximize utilization of available computing resources. The prevailing approach is to rely on carefully tuned low-level implementations; it is a laborious process, but development and maintenance costs are considered secondary to the benefits that are achieved in this way. Increasingly complex execution platforms for embedded multimedia systems challenge the current approach: parallelization is required to utilize multi-core architectures, vectorization is required to utilize so called SIMD or multimedia instructions and application-specific hardware acceleration emphasizes the need of hardware/software partitioning. A low abstraction level obstructs automated transformations to this end. It has been shown that dataflow models offer a representation that efficiently supports parallelization, vectorization and synthesis of both hardware and software (for instance see [101, 102]). Restricted forms of dataflow, such as synchronous dataflow, are attractive in that they allow for compile-time analysis of many interesting properties and can be realized particularly efficiently in hardware and software. More expressive dataflow models are better suited for complex applications, but cannot be analyzed as extensively. In the general case, the computations performed by a dataflow program have to be scheduled at run-time and thus induces an overhead. The goal of the dataflow compiler, which will be developed within ACTORS, is to improve the realization of complex dataflow models so that generation of production code is possible. 3.2 Dataflow Graphs A dataflow graph is a directed graph, in which the nodes denote actors that perform computations and the arcs denote communication channels. The channels are 31

32 Figure 3.1: Non-deterministic merge and fair merge actors, only the latter of which has sequential firing rules and thus adheres to the dynamic dataflow computation model connected at endpoints, called input ports and output ports. Data is transferred over the channels in quanta called tokens. An actor operates in steps called firings, during which it consumes a sequence of tokens from its input ports and produces tokens to its output ports. Firing is generally subject to constraints, such as availability of tokens. When an actor is able to fire it is said to be enabled. Execution of a dataflow graph is inherently parallel with any dependence between actors explicitly specified as paths in the graph. However, to be able to reason about the effect of a dataflow graph, it has to be put in the context of a computation model that defines the fashion, in which actors communicate. In what follows, two models of computation will be discussed in more detail: synchronous and dynamic dataflow. Dynamic Dataflow (DDF) The concept of dataflow process networks [103] provides the theoretical framework for dynamic dataflow. A fundamental property of dynamic dataflow is that it offers a determinate computation model, which means that the outputs that are computed by a program only depends on the inputs it has consumed; any admissible schedule of the computations produces the same result. Determinacy is a result of dataflow process networks being a special case of Kahn process networks [2]. A DDF actor may have several so called firing rules, each of which governs a firing that consumes a distinct set of tokens from the input ports. Such sets may differ in the number of tokens that are consumed and may differ further in the required values of these tokens. DDF actors use FIFO channels, which means that the actor consumes tokens in the same order as produced by its predecessors in the dataflow graph. Blocking reads are used when consuming tokens, whereas tokens are produced by nonblocking writes. The use of blocking reads means that the order, in which tokens are consumed for the purpose of checking the firing rules, is crucial. A poorly choosen order could cause an actor to block although there is a satisfied firing rule. The firing rules of a DDF actor must be sequential, which means that an appropriate order can be determined beforehand (see Figure 3.1 for an example). A further requirement is that the mapping from input tokens to output tokens be functional, that is free from side-effects. Synchronous Dataflow (SDF) Synchronous dataflow is the special case of dynamic dataflow, in which actors have fixed token rates. A particular actor thus consumes (and produces) the same number of tokens in every firing. This restriction sacrifices expressiveness, but allows several interesting properties to be determined using static (compile-time) analysis. 32

33 in ctrl Switch T F T F ctrl Select out in: [x], ctrl: [true] ==> T: [x] in: [x], ctrl: [false] ==> F: [x] T: [x], ctrl: [true] ==> out: [x] F: [x], ctrl: [false] ==> out: [x] Figure 3.2: Conditional actors: Switch forwards a token from its in port, to T or F depending on the Boolean token on the ctrl port, whereas Select forwards a token, either from T or F, to out. Conditional actors presents a problem in the SDF model of computation, since the token rates are not known beforehand. A model of computation, known as Boolean dataflow, results when extending synchronous dataflow with two types of conditional actors, Switch and Select (see Figure 3.2). This is sufficient to render the dataflow language Turing complete [104]. We thus attain computational power, but certain properties of Boolean dataflow graphs can no longer be decided in general; the existence of a static schedule is one such property. However, in several other extensions of the SDF model the task of finding a static schedule remains tractable while expressiveness is improved. One such extension, known as well-behaved dataflow [105], restricts the use of conditional actors in particular patterns: the conditional schema (if-then-else constructs) and the loop schema. Another extension, cyclo-static dataflow [106], allows token rates to vary over a fixed period that is associated with each actor. Firing advances an actor s phase within its period and thus determines the token rates of the next firing. In yet another extension, scenario-aware dataflow [107], token rates are parameterized by particular operational modes, called scenarios, but remain fixed within a single scenario. Scenario-aware dataflow is primarily intended for static analysis of models with dynamic behavior and includes stochastic modeling of performance metrics, such as throughput. Under certain conditions, it is possible to analyze combinations of scenarios in isolation using techniques similar to those of SDF graphs. CAL and Computation Models The CAL Actor Language [108] allows for actors that can be used in the context of different models of computation; the language was designed with this flexibility in mind. A particular CAL actor may conform to both synchronous and dynamic dataflow, but there are also CAL actors that require less constrained computation models; in particular it is possible to express actors with non-deterministic behavior. The state variables of CAL appear to violate the DDF model s requirement of a functional mapping of input tokens to output tokens. However, the internal state of a CAL actor can be modeled as a feedback loop that surrounds the actor. In CAL, firing is constrained by the input pattern and the guard of each action. The finite state automaton, which can be associated with a CAL actor, further constrains firing though it can be considered syntactic sugar for additional guard conditions. The constraints are not guaranteed to express sequential firing rules, but if they do the actor adheres to DDF. An algorithm that identifies sequential firing rules is given in [103]. Adherence to the SDF model of computation is trivially fulfilled by a CAL actor with a single action. It is not obvious how to identify CAL actors that adhere to 33

34 other computation models, which generalize the SDF model. 3.3 Scheduling of Dataflow Graphs Scheduling concerns the tasks of assigning actors to processors and ordering their execution on each processor. Assignment can be either static (performed at compiletime) or dynamic (performed at run-time) and, given a static assignment, the ordering can be either static or dynamic. A distinction can be made between a fully-static schedule and a self-timed schedule [109], both of which are assigned and ordered statically. Actors have exact starting times within the period of a fully-static schedule, whereas self-timed schedules rely on inter-processor synchronization. Still following the terminology of [109], we arrive at static assignment if we defer only ordering to run-time and fully-dynamic scheduling if both assignment and ordering is performed at run-time. The execution of a dataflow graph is usually assumed to be non-terminating, which makes it relevant to find periodic schedules that can execute indefinitely. In particular, it is desirable to rule out risk of deadlock, which arises when no actor is able to fire. A bound on the amount of tokens, which need buffering on the FIFO channels, is another desirable feature. The SDF model allows these properties to be verified beforehand [101], whereas they are undecidable in general for DDF graphs [104, 110]. For this reason, DDF graphs usually require dynamic ordering and leaves the question of successful long-term operation unanswered. SDF graphs, on the other hand, are usually ordered statically, which not only eliminates the run-time overhead of scheduling, but also enables static allocation of buffers and generation of efficient and compact code that can be repeated indefinitely without deadlock. We stress the point that static schedulability is a property of a particular dataflow graph. Given the existence of a static schedule, deployment on one or several processors is possible. One approach, proposed in [101], is to create a static singleprocessor schedule to establish static schedulability and use it as basis when forming a static multi-processor schedule. Alternatively, the task precedence graph can be formed to establish the same property, which has the benefit of making standard static scheduling algorithms applicable. We refer to [111] for a survey of such algorithms. Finding Static Schedules Balanced token production and consumption (consistency) and absence of deadlock are prerequisites of a static schedule. There is a practical algorithm [101], which decides whether or not an SDF graph has these properties. Consistency is shown by solving the so called balance equations, for the repetitions vector. The repretitions vector expresses the number of times each actor should fire in one period of the schedule. Construction of the precedence graph, in which the actors are duplicated according to the repetitions vector, provides necessary evidence that the graph is deadlock-free. As mentioned, the precedence graph may also serve as input to standard scheduling algorithms. Given a dataflow graph, which cannot be scheduled statically, it is still relevant to find substructures that do not require run-time scheduling. Hybrid static/dynamic scheduling [110] uses compile-time scheduling when possible and run-time scheduling when necessary. Clustering is a technique that combines adjacent actors. Conditional actors can be clustered in specific cases, as described in [104]. The result is a larger-grain actor that contains conditional constructs internally, but may expose 34

35 fixed token rates to its neighbors. Scheduling larger-grain clusters of actors is also beneficial in that it pushes overhead from frequently fired actors to less frequently ones. Dynamic Scheduling A consequence of the determinism offered by the DDF model is that absence of deadlock is a property of a particular DDF graph, unaffected by the actors execution order. In contrast, the required size of the FIFO buffers is dependent on the schedule. Further, even if a dataflow graph has a schedule that limits the number of tokens that need buffering, unbounded buffers may result from unfortunate scheduling decisions. It is generally not possible to limit buffer sizes beforehand, since determining the boundedness is an undecidable property of DDF graphs and setting capacities too low causes deadlock. Both a strictly data-driven scheduling policy, in which actors fire as soon as they are enabled, and a strictly demand-driven policy, in which actors are not executed until their output is requested, may unnecessarily result in unbounded buffering. More elaborate regulation of token production and consumption is thus needed. A particularly simple approach is proposed in [110], in which buffers are bounded initially and there is a scheme of increasing capacities when deadlock is caused artificially. In this way, deadlock-free dataflow graphs can execute forever with bounded buffers whenever possible. Relation between Scheduling and Software Synthesis A static schedule can be translated into executable instructions by generated threaded code [112]. If a block of executable code is assumed to be available for each actor, the task of code generation essentially consists in stepping through the schedule while emitting instructions that invoke the code blocks in the corresponding order. One approach is to use the addresses of the code blocks as virtual instructions and make each code block responsible for the invocation of the following block (directthreaded code). Another approach is to emit native call instructions, by which we arrive at subroutine-threaded code. It is also possible to knit together the code blocks in line, rather than emitting call instructions. In this way the invokation overhead is avoided and opportunities for improving the efficiency of the code blocks that become adjacent arise. Excessive code size is a potential problem of straight-forward generation of threaded code and it is further emphasized by the code growth that is usually caused by in-lining. For this reason, it is relevant to look for static schedules that limit the number of appearances of an actor in the schedule. In a looped schedule [113], a sequence of actor invocations may form a sub-schedule, which is iterated a given number of times. Such schedule loops can be realized in software as loops with a single copy of the sub-schedule in the loop body. Allowing sub-schedules to contain schedule loops naturally leads to a software realization with nested loops. Some, but not all, static schedules can be expressed as single-appearance schedules, by which code size is minimized. If the invocation of an actor is associated with an overhead, which for instance is the case in threaded code, it is relevant to limit the number of invocations. One technique with this purpose is vectorization [102], by which the number of tokens that are consumed and produced in each firing are increased by an integer fac- 35

36 tor. Loosely speaking, the iteration that surrounds an invocation is moved into the block that realizes the actor. Control-flow constructs also result from clustering of conditional actors. Unlike sub-schedules of looped schedules, which have constant iteration counts, actor clusters may embed conditional (if-then-else) constructs and data-dependent iteration. The code, which is generated from a static schedule, needs to be accompanied by an assignment of storage for the tokens that are passed between the actors. The execution order affects the amount of buffering required. Clearly, there is a tradeoff between code size, invocation overhead and buffer sizes. If the dataflow graph is not built from a library of pre-defined actors, it is not realistic to assume suitable blocks of executable code to be available. Instead, that code has to be generated from a specification in some programming language. This is of course the situation that arises when compiling networks of arbitrary CAL actors. The tasks of compiling such dataflow graphs is commonly divided into two steps: software synthesis and code generation. Software synthesis involves scheduling and realization of that schedule in some intermediate language, the latter of which includes translation of the actors to the intermediate language. Code generation translates the intermediate representation of the dataflow program into executable code, which is a well-understood problem in the field of compiler implementation. A popular approach is to leverage on existing compilers for code generation by choosing an widely used programming language, usually C, as intermediate language. 3.4 Problems with Current Approaches Handwritten assembly language A common practice today is to translate critical parts of embedded multimedia software by hand when the compiler fails to generate code of sufficient quality. In this case the programmer uses the target-specific assembly language directly as source language. There are several undesired aspects of this method. First, the process is laborious and requires highly specialized staff, which in itself may limit the portion of the program that can be addressed. Second, the process is error-prone. Third, the resulting software is specific to a particular target processor, which means that a repeated effort is required to migrate the software to a new target processor. As discussed in section 3.1, these disadvantages are emphasized by certain features of emerging processors, which complicate the task of finding optimal (or nearly optimal) assembly-language programs. It is simply too hard to make good use of the opportunities for large-grain and fine-grain parallelism and even to assess the relative benefit of two alternative solutions. Compilation from C If present compilers are not up to the task of generating sufficiently good code, are there any prospects of significant enhancements in the future? Unfortunately, progress has been very slow historically and a breakthrough in compilation of widely used languages, such as C, appears unlikely. Todd Proebsting captures this, somewhat wittily, in his paraphrase of Moore s Law: advances in compiler optimizations double computing power every 18 years [114]. The control over low-level detail, which is considered a merit of C in multimedia programming, tends to over-specify programs: not only the algorithms 36

37 themselves are specified, but also how inherently parallel computations are sequenced, how inputs and outputs are passed between the algorithms and, at a higher level, how computations are mapped to threads, processors and applicationspecific hardware. It is not always possible to recover the original knowledge about the program by means of analysis and the opportunities for restructuring transformations are limited. Code generation is constrained by the requirement of preserving the semantic effect of the original program. What constitutes the semantic effect of a program depends on the source language, but loosely speaking some observable properties of the program s execution are required to be invariant. Program analysis is employed to identify the set of admissible transformations; a code generator is required to be conservative in the sense that it can only perform a particular transformation when the analysis results can be used to prove that the effect of the program is preserved. Dependence analysis is one of the most challenging tasks of high-quality code generation (for instance see [115]). It determines a set of constraints on the order, in which the computations of a program may be performed. Efficient utilization of modern processor architectures relies on dependence analysis, for instance: To determine efficient mappings of a program onto multiple processor cores (parallelization), to utilize so called SIMD or multimedia instructions that operate on multiple scalar values simultaneously (vectorization), and to utilize multiple functional units and avoid pipeline stalls (instruction scheduling). Determining (a conservative approximation of) the dependence relation of a C program involves pointer analysis. Since the general problem is undecideable, a tradeoff will always have to be made between the precision of the analysis and its resource requirements [116]. For a concrete example consider the loop in Figure 3.3a, which computes the sum of the vectors x and y (or at least so it appears). This is a situation, in which vector instructions can be extremely useful. As suggested in Figure 3.3b, a vectorized addition instruction allows L scalar additions to be performed simultaneously; typically a speed-up factor of L results. However, in order to apply this transformation, a compiler is required to provide evidence that the effect of the program is preserved under the transformation. For this particular example taken in isolation, it is not possible to do so; it is even easy to find a counterexample (when the pointer y lags between 1 and L 1 elements behind x). The problem is that vectorization reorders reads from y with respect to writes to x and thus potentially violates a loop-carried data dependence. Putting the example loop in the context of a whole program, it may or may not be the case that tools, such as pointer analysis, are able to prove the transformation admissible even if the problematic case indeed never occurs. for (int i=0; i<64; i++) x[i] += y[i]; for (i=0; i<64; i+=l) x[i:i+l-1] += y[i:i+l-1]; a) Original loop b) Vector code Figure 3.3: Length-L vectorization using loop sectioning 37

38 Compilation of dataflow programs It has been shown that the SDF model of computation and its extensions, which can be scheduled statically, allow efficient code to be generated. However, complex applications do not adhere to these restricted computation models. Whereas dynamic scheduling may be acceptable for large-grain actors, which are fired relatively infrequently, the scheduling overhead is of great concern in the case of fine-grained actors. The video decoder, which will be used in the Wireless Video Terminal demonstrator, serves as a concrete example: it operates in a fine-grained fashion, essentially pixel-by-pixel, and the token rates of the actors are not constant in general. Scheduling such an application dynamically is analogous to adding significant overhead in the innermost loops of a traditionally developed decoder. A CAL actor is structured into actions in a manner that closely corresponds to the firing rules of a dataflow process. We are not aware of any work that considers static scheduling of actors with arbitrary firing rules. Although it is not possible to find such schedules in general, it is relevant to find substructures that have static schedules, which would allow for a hybrid scheduling approach that pushes the overhead to larger-grain actors. Although neither generation of threaded code nor the use of C as an intermediate language is implied by dataflow compilation, we note that they are popular implementation strategies. Threaded code suffers from the overhead of invoking the actors and, even when in-lined, overhead results from concatenating code blocks without considering their combined effect. Although once a popular technique in compilers and still popular as an implementation technique for interpretors, threaded code has been long abandoned for the purposes of high-quality code generation. The use of C as intermediate language is attractive in that it provides portability with respect to the target architecture and allows code generation to leverage on the techniques implemented in an existing compiler. The code quality however depends on the compiler s capabilities and producing C that renders efficient target code requires significant insights into the inner workings of a particular compiler. Vector code generation stresses this point: the C compilers that at all perform vectorization are very sensitive to the conditions, under which the transformation can be applied. 3.5 Relation to Work Packages and Assessment Relations to WP1 Tools for analysis of CAL actor networks are developed in work package 1: Task 1.3 (design space exploration) produces traces and execution profiles, which could be useful scheduling heuristics. Task 1.4 (model compiler) implements analysis and software synthesis techniques, in particular a method of identifying statically schedulable substructures. Current static scheduling techniques are not directly applicable to CAL networks. Techniques are needed that allows CAL actors (even better: sub-networks) to be modeled as SDF, well-behaved dataflow, cyclo-static dataflow or some other representation that allows for static scheduling. Conversely, the model compiler has the responsibility to identify actors (and clusters of statically scheduled actors) that have to be scheduled at run-time. 38

39 Fully-dynamic scheduling of such actor clusters appears the most promising (and doable) approach to utilize the parallelism in a multi-core system. Relations to WP2 Work package 2 implements the code generator. Task 2.2 (Caltrop extensions) Concerns the integration of the ACTOR dataflow compiler (Tasks 1.4 and 2.3) into the existing opendf framework. The work concerns refactoring of the current toolchain, which has a HDL-generation focus to better support generation of software: design of appropriate intermediate representations and the run-time system required by the generated code. Task 2.3 (CAL Compiler) ARM11 code generation, in particular vector code generation for the purposes of supporting SIMD instructions. Vectorization, in the sense of aggregating multiple actor firings, is a useful first step of vector code generation, but the CAL code within actions need to be analyzed and transformed also. Comparison between the ARM11 code generator and the solution of using C as intermediate language is relevant. Reliable techniques for vector code generation, using C as intermediate language, need to be explored. One possibility is to inject vector code using intrinsic functions or in-line assembly. Relation to WP3 The resource management framework is developed in work package 3 (task 3.3). Application-specific resource management will be expressed as CAL actors and leverage on built-in actors that are implemented by the run-time of task Potential Research Directions The greatest challenge in realizing CAL efficiently in software lies in avoiding the overhead of run-time scheduling when possible. Multiple CAL actions with different token rates prevent direct use of existing techniques for static scheduling. We have the following ideas on extending the class of dataflow graphs that can be scheduled statically: By analyzing a CAL actor in isolation, it may be possible to find periodic patterns of action firings that depend on internal state only. This would allow certain CAL actors to be modeled using cyclo-static dataflow. We believe that abstract interpretation of relevant parts of the actors internal state may prove a very useful technique to this end. Using the same technique, we believe that it is possible to enumerate distinct operational modes of actors with data-dependent behavior, so that each operational mode has a pattern of action firings that depend on internal state only Operational modes of several data-dependent actors may be correlated, for instance by operating on duplicates of the same channel, by which clustering operational modes across several actors may be possible. 39

40 We believe that CAL actions are more useful building blocks of static schedules than CAL actors. In particular, we believe that it is possible to identify producer/consumer pairs of actions, which could be clustered. A representation of CAL actors that exposes individual actions is required. Parallelization on multi-core systems using hybrid static/dynamic scheduling is another interesting topic: Given that it proves possible to synthesize sufficiently large clusters of statically schedulable actors, a trade off between the run-time scheduling overhead and the load balancing achieved by parallelization becomes relevant. Is a standard (Linux) multi-processor scheduler at all fit for dataflow workloads? Is it practical to achive proper regulation of actor firings, which is necessary to control buffering and response latency? If not, would a dataflowaware scheduler be conceivable? Vector code generation from CAL is yet another interesting topic, primarily due to the prospects of simpler, more robust and more practical approaches than those offered by vectorization of imperative languages such as C: Is aggressive use of vector instructions facilitated by the programming model or have we infact traded one undecidable problem (dependence analysis in C) for another (static schedulability in CAL)? Vector code generation is constrained by the layout of data (stride-1 access, proper alignment etc.). Since buffers are allocated by the CAL compiler it appears possible to select a layout that is optimal (or at least efficient) considering the combination of the producer and the consumers of the data. 40

41 Chapter 4 Adaptive Resource Management 4.1 Introduction Hard real-time tasks are those where missing one deadline may lead to a fatal failure of the system, so temporal and functional feasibility of the system must be preserved even in the worst case. There are different approaches to ensure this property, one of them is reserving the required resources for the corresponding tasks. In multimedia processing systems this approach has to deal with the property of e.g. an MPEG2 video stream that its bandwidth and computation demand can vary widely over time [117], so that reserving resources for the worst case can lead to significant overprovisioning of hardware resources. This is costly and therefore prohibitive for mass market devices. Instead an adaptive approach should be preferred. Adaptive real-time systems are able to adjust their internal strategies in response to changes in the resource availability and resource demands to keep the system performance at an acceptable level. The goal of the resource management is to guarantee availability of required resources to applications so they can rely on them. This implies that applications are to some degree isolated from the behaviour of other applications on the same system. Depending on the load of the system the resources should be distributed flexible among the applications, and the applications should adapt their algorithms accordingly, so that the best possible Quality of Service for the given resources is achieved. [118] [119] [120]. In order to be able to optimize the overall Quality of Service, several preconditions have to be met: the operating system must be able to measure the resource usage for individual tasks the operating system must be able restrict the resource usage of tasks the operating system must be able to notify tasks about their currently available resources tasks must be able to express their desired resource requirements to the operating system there must be a way to determine the current Quality of Service Real-time research has traditionally concentrated on scheduling CPU time. But basically all hardware resources which are used by multiple tasks influence the timing behaviour of the tasks using them. The resources which will be discussed in more detail are: 41

42 CPU time memory and cache bus usage permanent storage energy consumption and heat dissipation 4.2 Principal Resource Management Methods In this section principal methods for managing resources will be presented. All operating systems manage the available resources, so that they can be used by applications. The management algorithms can range from simple, straightforward implementations to highly sophisticated frameworks. The goal of resource management in general purpose operating systems is to achieve good average performance. This means that in the majority of use cases latencies should be small and the throughput high, while for rare use cases significantly worse performance can be acceptable. Real-time operating systems on the other hand need to provide deterministic and predictable behaviour, also if this means that the average performance may not reach the level of general purpose operating systems. This is the cost of being able to guarantee timely execution of tasks even in the worst case. Beside hard real-time systems, which preserve temporal and functional feasibility even in the worst case, there are also adaptive real-time systems. As stated above, due to the varying resource requirements not enough resources are provisioned so that even in the worst case for all tasks all requirements could be met. Instead the resource management tries to handle the available resources in such a way that the best possible Quality of Service is achieved under the given constraints [120]. There are several underlying algorithms which can be applied to the management of different resource types. If the resource can be used by only one task at one point in time, the algorithm has to deal with the temporal management of the resource, i.e. which task accesses the resource at which point in time. This is the case e.g. for scheduling the CPU time in a single core system, for scheduling I/O requests for a hard disk or for granting access to a shared bus. If there are multiple instances of one resource or if one resource can serve multiple tasks at the same time, additionally to the temporal management also spatial management of the resource has to be done. The obvious example for this is memory management, it can hold the data of multiple tasks at the same time, other examples are dynamic memory allocation algorithms, disk quotas or in multicore systems the assignment of cores to threads. Offline Algorithms In offline resource management the complete distribution of resources is determined by executing the resource management algorithm before runtime of the system. The algorithm can use complex rules and requirements, since the overhead occurs before the runtime of the system. The predetermined resource distribution is stored in a data structure and then executed at runtime. This mechanism has a very low runtime overhead for the system. In order to be able to compute the resource distribution offline, it is necessary that the system is completely known, i.e. 42

43 the set of tasks and their properties must be known. For complex dynamic systems with a mix of different tasks this can be very hard. Online Algorithms As opposed to offline algorithms, online algorithms are executed at runtime. Thus they can react to the runtime behaviour of the system. Thus it is necessary that their overhead is small enough so they are applicable for the desired use. First Come First Served First Come First Served (FCFS) is the simplest way to manage a resource. Requests are served in the order they arrive, no priorities or any other measures are used. With FCFS alone it is not possible to provide any guarantees. Static Priorities A common approach to manage resources is to use static priorities, i.e. tasks are assigned priorities which do not change during the runtime of the system. The managment algorithm uses these priorities to decide which task will get the resource. In general the resource will be alloted to the task with the highest priority. This means that the highest priority task will always be served well and that for the lowest priority task the risk of starvation exists. Resource management based on static priorities is (one of) the most common algorithm used for scheduling CPU time, called Fixed Priority Scheduling, see below for more details. Dynamic Priorities Beside algorithms based on static priorities there are also algorithms where the priorities are modified during runtime depending on additional parameters. An example for this is the deadline of a task as it is used e.g. in EDF scheduling (see 4.3), the task with the earliest deadline is assigned the highest priority. Once the new priorities have been determined, the task with the highest priority will be granted access to the resource. Reservations Another way to manage resources is to reserve fractions of resources to tasks. This way it can be guaranteed that resources are always available when they are required by a task. For best results the amount of reserved resources should match the resource requirements of the task as good as possible. Knowing the required resources before runtime is not easy, especially with varying resource requirements. The reservations can also guard tasks from misbehaving other tasks in the system and provide some degree of isolation between them. Using reservations on temporally managed resources they can be virtually converted into multiple separate instances of the same resource type. The Need for Adaptive Resource Management As mentioned before, knowing the exact resource requirements of a mix of applications at development time is hard. There are multiple reasons for this: resource requirements vary at runtime, e.g. encoded video streams can have widely varying bit rates within one stream 43

44 the resource availability varies at runtime, e.g. available bandwidth in networks or the remaining power in a mobile device the systems can be too complex to know everything in detail, e.g. if a lot of software runs on them not all software can be analyzed e.g. for their WCET the resource requirements change during development time, e.g. simply due to changes in the code (features, bug fixes, etc.) or in the hardware, hence previously correct numbers become incorrect. task sets running concurrently can change at development time and runtime, e.g. due to changes in the required feature set or user installed software when deployed These issues can be handled better if the distribution of resources among the tasks can be adapted according to the current state of the system and the environment at runtime. This requires that the current state is known, i.e. which tasks are running how much resources do these tasks demand how much resources are available the importance of the tasks for the overall Quality of Service of the system If this information is available, the allocation of resources to tasks can be adapted to the current situation. This can also be seen as a control loop: the plant is the set of tasks, the sensor is the information gathering and the actuator is the modification of the resource distribution. More about that can be found in Application of the Management Methods to Resources CPU Time The CPU time is the classic resource which real-time research is focussing on. A lot of work has been put into that field and a wide variety of algorithms exists [121]. In a system with multiple running tasks the CPU time has to be distributed among these tasks according to scheduling goals. In general purpose operating systems the goal is typically to achieve a good average performance, high throughput and an interactive user interface. In real-time systems the goal of task scheduling is to schedule the tasks in such a way that all tasks meet their deadlines. Typical tasks which have deadlines include e.g. control tasks and video- and audio decoding. If they are not executed in time the control performance and the perceived videoand audio quality will decrease. While in classical real-time systems the set of tasks and their properties were well known, this becomes increasingly less true. As CPU technology is evolving, features like multi-level caches, out-of-order execution and branch prediction make predicting the execution time much less reliable, with the worst case execution time being very pessimistic. Also with more powerful embedded systems more applications can be run on one system, and determining properties like worst case execution times of such applications mixes is not practically doable. 44

45 Off-line Scheduling If the system behaviour is completely known, off-line scheduling can be used. The schedule of the tasks is determined off-line, i.e. before runtime. This requires that all tasks and their execution times are known and that the tasks are periodic. Then using any of the on-line scheduling algorithms, any complex algorithm or just heuristics a schedule which can potentially satisfy arbitrary complex constraints can be constructed. Offline scheduling is used mainly in safety-critical applications as automotive or avionics. The OSEK/VDX Time Triggered Operating System 1.0 standard, short OSEKtime [122], was released in 2001 and defines the interface for a time triggered operating system. There are not yet too many implementations of OSEKtime available. There is TTPos, the Time Triggered Protocol operating system, but the last mention of that is from Current implementations are e.g. the ElektroBit OSEKtime RTOS [123] [124] [125]. Combined Off-line and On-line Scheduling Offline scheduling can handle periodic tasks well, but is not well suited for handling aperiodic tasks. A way to handle them are e.g. aperiodic servers, but they introduce a relatively high latency for the aperiodic events. In [126] the slotshifting algorithm is presented. It combines offline- and online scheduling in order to improve the response times for aperiodic tasks while keeping the guarantees for periodic tasks. Slack time of off-line scheduled tasks is analyzed and if available, used for serving aperiodic tasks. This will delay the execution of the periodic tasks but only within their deadline. Fixed Priority Scheduling The most common scheduling method in real-time operating systems today is Fixed Priority Scheduling with preemption. Tasks are assigned priorities and at every point in time the ready task with the highest priority runs. Fixed Priority Scheduling can be combined with round-robin for ready tasks which have the same priority. A Fixed Priority scheduler can be implemented with complexity O(1). The assignment of priorities is done by the developer, either ad-hoc or using an algorithm. e.g. Rate Monotonic Scheduling (RMS). With RMS tasks are assigned priorities according to their periods, the smaller the period the higher the priority. Schedulability is guaranteed as long as the processor utilization U is below ln(2) [127]. This can be a very pessimistic calculation, since it is assumed that all tasks need their worst case execution time to finish. Under overload, i.e. U > 1.0, low priority tasks can suffer from starvation, while the highest priority task has still guaranteed access to the processor. Fixed Priority Scheduling is supported by most if not all available real-time operating systems, e.g. VxWorks from Windriver, the microkernel-based QNX, MS WinCE, ecos, which is an open-source real-time operating system, ENEA OSE and it is also the scheduling algorithm which has to be supported by OSEK-compliant operating systems [128] [129]. Dynamic Priority Scheduling Beside fixed priority scheduling there are multiple dynamic priority scheduling algorithms. In these algorithms the priorities are determined at scheduling time. Earliest Deadline First (EDF) is one such a scheduling algorithm. With EDF the 45

46 ready task with the earliest deadline is scheduled to run. The complexity of that algorithm is due to the required sorting of task deadlines at least O(log n). EDF can guarantee schedulability up to a processor utilization of 1.0 [127], i.e. it can fully exploit the available processing capacity of the processor. Under overload, i.e. U > 1.0, there are no guarantees that tasks will meet their deadlines. There is research being done on handling overload better with EDF, e.g. by adding admission control to it [130]. EDF is implemented in several research operating systems and scheduling frameworks, e.g. in the Shark real-time operating system, in the FRESCOR resource management framework, in Aquosa and Litmus, which are both extensions of the scheduler in the Linux kernel, and in the OSEK-compatible Evidence Erika RTOS [131] [132] [133]. Multi-Core Scheduling Until a few years ago increased processing power was achieved in large part by increasing the clock frequency of the CPU, but this has changed significantly [134]. Instead nowadays, the trend goes towards multi- and many-core architectures, as it can be seen e.g. with the Intel Core 2 Duo or the IBM Cell processor, which consists of one PowerPC accompanied by 8 SPEs (Synergistic processing elements). This trend is starting to reach also embedded devices. Examples are the dual core Analog Devices Blackfin BF561 DSP and the Texas Instruments OMAP SoC (System on Chip) which combine an ARM and a DSP on one chip. Multicore SoC built from the new ARM Cortex A9 architecture are to be expected in the near future. This makes scheduling algorithms optimized for multi-core systems required. In multicore systems two general types of scheduling are applicable, partitioned and global scheduling. In partitioned scheduling tasks are assigned to cores for their whole lifetime, while in global scheduling tasks can migrate between cores at runtime. With partitioned scheduling regular uni-core scheduling algorithms as FPS or EDF can be used per core. The problem is now split into two problems: the order of execution, i.e. the scheduling, and the location of execution, i.e. the allocation. Litmus is a soft real-time extension of the Linux kernel (version currently supported), which focuses on multiprocessor real-time scheduling and synchronization. It implements Partitioned EDF and Global EDF in various variants [135] [136]. Reservation based Scheduling With more powerful processors and in general more available resources for embedded systems it becomes possible to run more applications on one system. These additional applications do not have necessarily real-time requirements, but offer additional features to the core functionality of the device: media players may come with meta data databases, mobile phones with web browsers etc. In order to be able to guarantee timely behaviour for real-time applications on such systems, it is necessary to shield them from potentially misbehaving other applications. One approach is to use resource reservations to isolate tasks to some degree from each other. This also has the advantages that starvation for low-priority tasks as can happen with FPS can be avoided and guarantees can be given also under overload, which is a problem with EDF. There are several different reservation based scheduling algorithms, e.g. the Constant Bandwidth Server (CBS), which is based on EDF, the Weighted Fair Schedul- 46

47 ing, which has its origins in GPS the networking field and also Lottery scheduling, which takes a statistic, less deterministic approach to reservations [137] [138] [139] [140]. Reservation based scheduling is covered in depth in the chapter Reservation Based Scheduling. Bandwidth reservation schedulers are not (yet) widespread in deployed realtime operating systems, mainly they are already in use in safety-critical applications, the typical example is avionics. With embedded computers becoming more powerful there is a trend to consolidate multiple, distributed systems into fewer but more powerful computers. To ensure reliability, the ARINC 653 standard, which defines real-time operating system services, was created. One key aspect is the use of partitions, which isolate the running tasks from each other. A global period is defined and during this global period each partition is assigned a fixed time slot [141]. There are a few partitioning operating systems available: e.g. VxWorks 653 from Windriver Systems and Integrity 178B from Green Hills. Partitioned operating systems are also used to run multiple different operating systems in one partition each. CPU bandwidth reservations is also supported by the Green Hills Integrity realtime operating system. It employs a two-level scheduler. The first level is a regular Fixed Priority Scheduler, where the task with the highest priority is executed. If there are multiple ready tasks with the same priority, weighted fair scheduling is used among them: they are executed using round robin, with each task getting to run for the number of time slices it has been assigned [142]. The FRESCOR project proposes a contract-based interface, where applications can negotiate contracts with the operating system, which describe their scheduling requirements. This is implemented using a two level hierarchical scheduler, where the top level scheduler is given by the operating system, and the second level scheduler supports FPS, EDF and table driven scheduling. This is implemented on Linux using an extension for the Linux kernel which implements more sophisticated scheduling schemes, called Aquosa. The FRESCOR API is also implemented on MarteOS, an Ada-based hard real-time operating system [143] [144] [145]. Adaptive Management The algorithms mentioned until now manage the resource CPU time, but without feedback about the consequences of their scheduling decisions. This is a problem, since, as mentioned before, resource requirements and availability can vary or they can simply be not exactly known. There are several research projects working on this problem. By adding adaptivity they strive to make the systems react in a meaninful way to varying conditions, i.e. so that the Quality of Service is maximized. In order to be able to adapt to the current situation, the current situation must be known, so sensors are required to gather information about the current state, e.g. measuring the CPU utilization of tasks, deadline misses or current bitrate of an MPEG stream. This information is then used to influence the functioning of the system using actuators, which can be e.g. task admission control, modification of tasks weights or priorities or modifying the processed data so that the workload changes. These schemes resembles a control loop with sensors, actuators and a plant which is to be controlled. There are a variety of approaches how to apply control theory to scheduling [146], [147] [148], details can be found in 6. One scheduling model which incorporates adaptivity is the Elastic Task model [149]. Here the tasks are characterized by the computation time, their period and an 47

48 elastic coefficient. Under overload (detected by sensors), the periods of the tasks are modified (actuators). This is done considering the elastic coefficients of tasks. The period of more elastic tasks will increase more than that of less elastic tasks. This way it is possible to handle varying conditions such as temporary overload more flexible. In the MATRIX project a distributed system of heterogenous, multimedia-processing nodes connected via a wireless network is considered [150]. The nodes have different properties, as computing power, operating system, i.e. schedulers, available memory etc. They can enter and leave the network dynamically, the bitrate of the video stream changes over time, and so does the available bandwidth of the wireless network. To handle this dynamic scenario, a network-global resource manager is introduced. This receives status updates from all participating nodes, so it has knowledge about the available computing power and network bandwidth on all nodes. This information is not detailed, but only very few abstract discrete levels are supported, e.g. Low, Medium and High. Based on this global view the resource manager makes decision big decisions, as e.g. switching the quality of one video stream from High to Medium if the receiving node does not have enough CPU bandwidth available for the high quality stream. On the nodes local decisions are made with the goal to achieve the best possible perceived Quality of Service, within the bounds set by the global resource manager. Memory and Cache A resource which is also shared by all tasks is the main memory. The memory management influences the timing behaviour of tasks in multiple ways, which are described below. When building real-time software systems these issues have to be taken care of in order to achieve deterministic runtime behaviour. Dynamic Memory Allocation Dynamic memory allocation is a spatial resource management problem, multiple users can use parts of one resource simultaneously. Most allocators implement a FCFS-like strategy, i.e. if a task tries to allocate memory, this succeeds as long as there is enough memory available, no priorities or weights are considered. In general modern software makes heavy use of dynamic memory allocation. Beside the obvious uses e.g. for creating buffers or objects in application code, there are also less obvious uses. For example the C++ STL [151] uses dynamic memory for each item in its container classes, and software using the pimpl idiom [152] to reduce dependencies between classes also needs at least one dynamic memory allocation per object instantiation, as is done e.g. in the Qt library [153]. Various different dynamic memory allocators exist, usually with the goal to achieve good average performance and to limit fragmentation, but for real-time systems it is additionally necessary that tasks can rely on bounded response times. One of the most widely used allocators is dlmalloc by Doug Lea. It offers good speed and little memory overhead. dlmalloc is e.g. available for ecos and it is also the basis for ptmalloc2, the allocator of the GNU C library, which is used basically on every modern Linux system. While dlmalloc offers good average performance, in the worst case the response time is exceedingly longer [154] [155] [156]. This undeterministic timing behaviour of dlmalloc and most other memory allocators is a problem when used in real-time systems, where deadlines have to be met. There are different strategies to avoid the undeterminism introduced by mallocimplementations in real-time systems: 48

49 Option 1: No use of dynamic memory at all The most straightforward strategy to avoid the problems of dynamic memory allocation is to avoid it completely, i.e. use only static memory. This has several advantages. All memory used throughout the runtime of the application is static, i.e. allocated at link time. If there is not enough memory available, linking will fail, so this problem can and must be fixed already at buildtime without any runtime testing. This can be considered an off-line algorithm. Avoiding calling malloc obviously avoids the (varying) runtime overhead of the allocation algorithm itself and also for the error checking which is required otherwise. This means that the code not only runs faster, but it runs also with less variations in execution time. Also the class of errors due to failed malloc-calls in the code is completely excluded, which makes the code more reliable. But there is also a downside to avoiding dynamic memory: writing code without any dynamic memory is much more complicated. As mentioned above, dynamic memory is used in many libraries, so they cannot be used if dynamic memory must be avoided. This can significantly increase the development effort. If no dynamic memory is available, fixed size memory pools are a common replacement for the cases where items have to be created at runtime. These pools can be implemented very efficiently with O(1). Usually one memory pool is created for one type of used items, e.g. one pool for event records, one for file handles, etc. This is basically a reservation approach, not per task, but per functional module. Option 2: Limited use of dynamic memory A very common strategy is to allow limited use of dynamic memory. This is not a very clear term, in general it means avoiding memory allocations in timesensitive parts of the code. So e.g. allocating buffers or stacks or creating objects using dynamic memory is allowed at application startup, but not later on e.g. in video data processing loops. The heap, where the dynamic memory is allocated, is usually a shared process-global resource. This means that accessing it (i.e. calling malloc() and free()) is synchronized for multithreaded usage internally. This way real-time- and non-real-time-threads can become dependent on each other, with the well-known problems of priority inversion, etc. [157]. Option 3: Use a dynamic memory allocator with deterministic behaviour There is research work being done on developing allocation algorithms which have deterministic timing behaviour, also in the worst case, as it is required for use in real-time systems. The TLSF allocator is a dynamic memory allocator which works in O(1) and so has bounded worst-case response time [154]. With these properties it makes dynamic memory allocation feasible for use in real-time code. Of course it still has the problem of runtime overhead and failed allocations at runtime compared to not using dynamic memory at all. Memory Reservations The physically available memory is a globally shared resource and used by all tasks in a system. Without special measures by simply allocating more memory applications can grow in memory so much, that there may be not enough memory left for other, e.g. real-time tasks. This is independent from the tasks priorities, e.g. a low priority thread could allocate so much memory that an allocation of a high priority thread afterwards would fail (as discussed, memory allocations during runtime should be avoided in time-critical code sections). A method to help against is to use memory reservations for tasks. This can isolate tasks from each other with regard to memory usage. 49

50 Green Hills Integrity is one of the few deployed operating systems which offers this functionality. Tasks are assigned static memory areas, which cannot change during runtime. Using the MMU of the processor access to these memory areas is protected, so that tasks can only access memory inside their assigned region. If a new task is created at runtime, it "lives" in the same memory area as the parent task [142]. In the Linux kernel there is currently work being done on integrating memory reservations for groups of processes, so-called "control groups" or "containers". Here the driving force comes from server virtualization, where service providers want to profit from consolidating multiple physical servers into one physical server running virtual servers. Containers in Linux are groups of processes, which are to some degree isolated from other processes and which use their own set of resources [158] [159]. The FRESCOR project also investigates memory reservations. Here applications can negotiate contracts with the resource management framework, where they specifiy minimum and maximum memory requirements. If the contract is agreed upon, the resource management will guarantee that the specified parameters are adhered to. This is implemented as a resource management layer on top of Linux and on Marte OS [131]. Virtual Memory In systems supporting virtual memory, applications can use more memory than physically available. This is done using swap space, i.e. memory can be swapped out from main memory to a permanent storage. If a task is swapped out and gets scheduled to run again, first it must be swapped back into main memory again. This is unacceptable for real-time tasks, since it introduces latencies usually magnitudes bigger than their deadlines. There are two solutions for this, one is to disable swapping completely. This is a good approach for embedded systems where the set of running tasks is largely known and so it can be checked whether there is in general enough memory available to implement the required features of the device. For a general purpose system, where one of the tasks may be a real-time task, disabling swapping for the whole system is no option. The user expects to be able to run all kinds of applications, and this can require the availability of swap space. All general purpose operating systems provide an API to obtain memory which will be locked in main memory and which will never be swapped to a secondary storage, e.g. on POSIX systems this is the mlock() function [160]. Basically all data which is accessed in time-sensitive code should be locked to main memory so it is always available if required. Cache management The access from the CPU to the main memory is more and more becoming a performance bottleneck. The latencies and transfer rates of the main memory are much slower than what a CPU can handle, so the CPU has to wait in so called wait-states when accessing main memory. To optimize memory access, all modern CPUs (except low-end microcontrollers) have one or more caches, which optimize the memory access to main memory by keeping recently used data in a much faster CPUlocal memory. There can be separate caches for data and instructions, and there can be multi-level caches. While this significantly improves average performance it also makes the timing behaviour of tasks less deterministic, since now the exe- 50

51 cution time depends also on the history, i.e. the contents of the cache. Switching between tasks will often push some of the cached data or code of the previous task out of the cache, so the next time that task runs again it will be slowed down in the beginning because the cache has to fetch the data again from main memory. Therefore, tasks affect each other even if CPU scheduling algorithms which enforce CPU bandwidth reservations, are used. No current operating system supports explicit management of the cache, except enabling and disabling it completely. One line of research going on in the field is to fill the cache software-controlled. Usually, the cache is filled on demand, i.e. when cache misses occur. With software-controlled caches, the cache is locked, i.e. its own reload functionality is disabled, and instead the software takes care of loading the cache with contents. Hence it is known when and with which contents the cache will be loaded, which makes predicting the execution time easier and the timing behaviour more deterministic [161]. Another approach is to split the cache into areas, which can be reserved for use by specific tasks. While this removes some of the undeterminism introduced by caches, it also removes some of the actual benefits of a cache. [162]. Bus As just mentioned the access of the CPU to the main memory is a bottleneck, but there are more peripherals which need to transfer data. These peripherals can include displays, network controllers, image sensors, FPGAs and other devices. For instance on architectures where the display does not have dedicated memory, it must be constantly refreshed from main memory, e.g. for a display size of bpp 60Hz this gives 9 MB/s. This bandwidth is then constantly occupied on the bus only for keeping the display running. Considering e.g. additional data transfers from a camera module to the memory, it becomes obvious that the bus which connects the components plays a central role and can become a bottleneck. Additionally the mentioned examples have tight timing constraints, e.g. if image data from the camera module is not transferred in time the data is simply lost. In [163] memory streams are analyzed and three categories of memory streams are defined describing their timing requirements: hard real time streams, soft real time streams and low latency streams. Using a sporadic server-inspired bus arbitrater an algorithm is presented to find arbitration settings which satisfy the timing requirements for all memory streams. Energy Consumption and Heat Dissipation Especially in mobile devices which run on batteries preserving energy is important, a longer working time can be a significant selling point for devices like e.g. cameras, mobile phones etc.. Modern CPUs or SoCs for embedded devices can run at frequencies up to 1 GHz (e.g. Marvell XScale 800 MHz, ARM Cortex A8: > 1 GHz). Because higher frequencies also mean higher energy consumption, these CPUs can typically be clocked down to lower speeds in order to save energy [164] [165]. This technology is present in many architectures under different names, e.g. AMD calls it PowerNow, for Intel desktop CPUs the name is SpeedStep and Wireless SpeedStep for the ARM-based XScale processors [166]. These technologies are well supported on general purpose operating systems. For real-time systems this is a more challenging task. Real-time tasks have to meet deadlines and running 51

52 the CPU at a lower speed will increase their computing time, so schedulers have to care for multiple things: meeting the deadlines of real-time tasks saving energy by switching to lower frequencies the overhead involved in switching frequencies the trade-offs between saving energy in the CPU and therefore having to keep other devices for a longer time active There is research work done on the topic of power-saving scheduling e.g. in [167] [168]. The focus is to schedule tasks in such a way that they can still finish in time before their deadlines, while at the same time running the processor as slow as possible to save energy. Overhead for switching between different frequencies is considered insignificant if it happens only a few times per second. However, the CPU is not the only power consumer in an embedded device. There are also memory and other peripherals which consume power when they are active, as e.g. displays, image sensors and radios. To reduce the overall power consumption of a system these devices must also be put into power saving modes. Recent research shows that running a processor at the lowest possible speed does not automatically result in the least power consumption. If the processor runs at a lower speed, other peripherals such as the memory have to work in full-power mode for a longer time. This increased power consumption can outweigh the savings achieved by running the CPU at lower speed [169]. Currently there are no deployed real-time operating systems with energy aware real-time schedulers. Instead real-time operating systems offer frameworks for implementing application-dependent power management schemes, QNX, ecos [170], [129]. The FRESCOR project provides an API for defining execution times e.g. for processor speeds. With the increasing frequencies of processors for mobile devices also heat dissipation becomes a bigger problem. This leads to multiple problems: the device gets warmer, which may just feel bad if it is a handheld device. Some components as e.g. image sensors produce different results when operated at different temperatures due to more noise. If fans become necessary the device gets louder and it is not sealed anymore, so dust can get in and getting electromagnetic compatibility right becomes harder, and in the worst case heat may lead to mechanical damages. Permanent Storage Permanent storage in embedded systems ranges from regular harddisks to flash memory, which can be e.g. soldered flash chips or also regular CF- or SD-cards. Increasingly Solid State Disks are used instead of rotating disks, since they consume less energy and are mechanically more robust. The storage can be used for different purposes, with and without real-time constraints. Continuously reading or writing data streams, like e.g. recording video streams, has strict timing constraints. If deadlines are missed buffers will overflow and frames will be missing or corrupted. On the other hand if a task reads or writes data like settings, icons or e.g. lookups in a database containing music metadata there are no hard deadlines, it should just be fast enough so that the user experiences a smooth interactive behaviour. 52

53 When accessing the permanent storage device, not only the storage device itself is occupied, but also CPU time and the bus. If there are multiple applications accessing a storage device at the same time, they have to share the available transmission rate of that device. This slows down accesses and influences the execution time of applications, even if the CPU scheduler uses isolation techniques like bandwidth reservation. The support for permanent storage varies widely between operating systems. While general purpose operating systems offer advanced filesystem- and I/O drivers with sophisticated disk-scheduling and caching algorithms, especially small realtime operating systems offer no or only basic functionality, as e.g. FreeRTOS, ecos or Segger embos. Support for filesystems or permanent storage often is available as separate portable components. These are then usually portable between different processors and operating systems, as e.g. Segger emfile. Due to this weak integration with the operating system, advanced features like per-task-accounting or reservations are not supported, and also available hardware resources like DMA channels are not always used optimally. Disk Scheduling General purpose operating systems strive to achieve high data throughput and low latencies. Therefore multiple disk scheduling algorithms have been developed, which determine in which order requests to the disk will be processed. Depending on physical characteristics different strategies are preferential, e.g. with spinning harddisks, head movement is expensive in terms of time, so this should be minimized. Well-known disk scheduling algorithms based on this principle are the Elevator (aka Scan) and Cyclic Elevator algorithms [171]. While these algorithms try to get the best performance out of disk drives with rotating disks, they do not guarantee bounded service times. In the Linux/RK resource kernel the Just In Time disk scheduling is presented, which combines elevator scheduling with deadline-based scheduling for disk requests. As long as read requests have some slack time left, i.e. their deadline is not yet imminent, requests which are closer to the current position of the disk head are served, otherwise the request with the next deadline is served [172]. The current Linux kernel comes with three different I/O schedulers, one of them is the Deadline Disk Scheduler, which seems to be very similar to the Just- In-Time scheduling [173] [174]. This scheduler uses the cyclic elevator principle, but adds deadlines to it. Each request has a deadline, and if the deadline is about to expire, this request is prefered over the ones which would be next in plain cyclic elevator. This one is not the default Linux scheduler since the average behaviour of the Anticipatory Scheduler is better, and for a general purpose operating system this is considered more important. The FRESCOR project also offers support for disk bandwidth guarantees, but this is still in the planning stages [143]. As noted above, with the increasing use of flash based devices without moving parts, adapted algorithms are required. These algorithms do not have to take head movements into account, but should try e.g. to transfer data in big contiguous blocks, since using burst mode usually much higher data rates can be achieved than with just random small accesses. 53

54 Other Issues Another issue is the timing behaviour of the filesystem. This comprises e.g. the timing behaviour of operations like creating a new file or appending data to a file, which is quite common when recording multimedia streams. For deterministic performance of the application it is necessary that these operations also behave deterministic. If the operating system supports disk caches, accessing the permanent storage will use the caches. Here we have a similar effect as with the CPU cache, the disk cache makes the read- and write-timing behaviour much more indeterministic. Tasks can push cached data of other tasks out of the cache. Also the write-timing can become very indeterministic through cached writing: while the write cache is not full yet, written data will go into the cache. Once the cache management decides that the contents have to be written to disk, writing will become much slower. The general advice is not to use disk I/O in time-sensitive code, or at least to add enough buffering to relax the timing constraints. Multi-Resource Management As can be seen from the previous sections, tasks influence each other through the use of multiple shared resources. Isolating them from each other in just one resource only goes so far; if a task has e.g. 50 % CPU bandwidth guaranteed but cannot progress at all because it suffers from starvation in disk I/O that does not help very much. Improvements to this situation can be achieved by coordinated multi-resource management. FRESCOR The FRESCOR project, mentioned in each of the previous sections, investigates multi-resource management. One purpose is to provide an easy-to-use API for developers which provides access to advanced resource management technologies. On one hand this can increase the real-world usage of these technologies, on the other hand it abstracts implementation details away, so that applications can be portable across different operating systems with different scheduler implementations. A basic concept in FRESCOR is the contract. Applications negotiate contracts with the resource management framework. If a contract is agreed upon, the application can rely on the availability of the resource as specified in the contract. The FRESCOR framework is implemented on Linux kernel 2.6. Hola-QOS The Hola-QOS project is quite similar in scope, it also aims at developing a resource management framework for the various resources using a homogenous layered architecture. It is flexibly by allowing different management algorithms to be loaded via exchangeable modules. Linux/RK Linux/RK is a resource-kernel, it is an extension to the Linux kernel which focuses on resource management. The idea is to offer an interface to applications which is portable between multiple OS and which offers a common API to applications. CPU bandwidth reservations and advanced disk scheduling mechanisms are implemented. 54

55 Nemesis Nemesis is an operating system whose main focus is on resource management. The operating system kernel itself is minimal, mainly only interrupt handling etc. Tasks can be grouped to domains, for which resources can be reserved. Most operating system services are executed in user space. The functionality is provided by libraries which are executed in the context of the application which uses them. This makes it easier to account operating system services to the processes which actually use them. This also makes implementing custom resource management schemes for processes possible [175]. 4.4 Adaptive Algorithms If the available resources for an application change, the application should be able to adapt to this. If more resource become available, it should be able to deliver a higher quality result, if less resources are available, it should successfully produce a lower quality result, instead of failing to produce a high quality result. Adaptive algorithms are required to achieve this goal. These algorithms trade output quality against resource requirements. If no compromises in output quality are acceptable, then the application is no candidate for adaptivity and it always has to run with the full required resources. Anytime Algorithms Anytime algorithms are algorithms which provide a useful result after any amount of time they have executed. The longer they can execute the better the result is. Imprecise Computation algorithms are a subtype of this type of algorithm. These algorithms require a minimum execution time in which they produce a baseline result. If more time is available optional steps of the algorithm are executed, resulting in more precise results [176] Multi-Version Algorithms Another way to achieve adaptibility in applications are multi-version algorithms. The idea is straightforward: provide multiple implementations of the same functionality, e.g. a fast low-quality version and a slow high-quality version, and then depending on the currently available resources switch between them. Example: Quality Aware Frame Skipping An MPEG stream consists of three different frame types. I-frames (intra-coded) contain the complete information for the current frame, P-frames (predicted) contain only the differences compared to the latest I-frame, and B-frames (bi-directionally predicted) require both the latest I-frame and the next P-frame to be available for decoding. These frames form GOPs (group of pictures), usually consisting of around 12 frames, where each GOP starts with an I-frame followed by P- and B- frames. Quality Aware Frame Skipping (QAFS) is an algorithm to adapt the decoding of MPEG2 video streams to the currently available resources [177]. Without special measures, a system will start skipping decoding random video frames under overload. This means the frame will be decoded and use CPU time and energy, but then be discarded because it was too late. If it was an I-frame of a GOP this means the whole GOP is useless, because without the I-frame no frame at all can 55

56 be decoded. Quality Aware Frame Skipping improves on that by selectively skipping frames. Priorities are assigned to the frames, depending on their type (I, P, B-frame), their position in the GOP and some heuristics as e.g. the frame size. If not enough processing time for decoding all frames is available, it starts skipping, i.e. not decoding at all, the frames with the lowest priorities. This makes sure that CPU time is not wasted in calculating useless results and that the skipped frames have the lowest possible negative impact on the perceived image quality. Example: Decoding Complexity Prediction by Coding Statistics In QAFS the decoding time is rouhgly estimated based e.g. on the frame size. A more exact decoding time prediction can be achieved by using Decoding Complexity Prediction by Coding Statistics (DCPCS) in MPEG4 videos [178]. Here the support of the video encoder is required. The decoding process is split up into a discrete number of processing steps, which should be easy to fit to actual hardware or software implementations. The encoder knows which steps the decoder will have to do in order to decode the video and stores the number of the required steps in the encoded video streams. The decoder can then read these number and calculate the exact decoding time it will need for this frame, taking the actual available resources and implementation into account. 4.5 Relation to Work Packages and Assessment Relations to WP1 Work package 1 is concerned with analyzing and modelling actor networks. Task 1.3: Actor networks are simulated and profiled. The resulting information about token consumption and production as well as hints about potential parallel execution may provide useful metadata for the resource management. This may be used for measuring utilization, allocating threads, etc. Task 1.4: The Actors model compiler will be developed. This will try to identify statically schedulable regions in the Actors network, and also parallel regions. This knowledge should be available in the Actors-executable at runtime to be used for scheduling and thread allocation decisions. Task 1.5: The current abstract Quality of Service settings have to be "translated" into application-specific actions and settings. Also the current realized Quality of Service has to be measured, "translated" and communicated to the outside. This is in close connection to Task 3.1. Relations to WP2 In work package 2 develops the CAL to ARM compiler and the accompanying runtime environment are developed. Adaptive Resource Management is in so far related to this work package, as some of the components will be implemented in CAL and so will have to run in the runtime environment. Relations to WP3 Work package 3 creates the resource management framework, which on one side interfaces to the services provided by the operating system and on the other side interfaces with the applications, both CAL- and non-cal-applications. 56

57 Task 3.1: This will define the interface between the resource management and CAL applications. A way to communicate resource management parameters to CAL applications has to be defined. This may use abstractions as in the MATRIX project, i.e. provide a set of resource levels, at which the application can work with good results. These levels must make sense for the application and must be in a form so that they are suitable for use in the feedback control. Task 3.2: This task deals with feedback control modelling, which has a dedicated chapter in this document. Task 3.3: The resource management framework is implemented in this task. The solutions from MATRIX, FRESCOR, Litmus and the Linux Containers/Control Groups are candidates for use in ACTORS. The situation in ACTORS is quite different than in MATRIX: in MATRIX there is a wireless network of heterogenous nodes dynamically joining and leaving, in ACTORS the set of components is fixed, their properties are well known (compared to MATRIX) and there are no communication bottlenecks between them. Nevertheless the idea to define discrete resource levels for applications should be investigated for use in ACTORS. FRESCOR aims to provide an API for managing basically all resources which could be of interest in ACTORS, so it may make sense to reuse results from there. A downside is that the FRESCOR solution still has to be ported to the current Linux kernel (2.6.25) and that it does not support multicore-scheduling. The Litmus project provides multicore-scheduling, so it is an interesting option to use it as basis for the multicore-related work in ACTORS. Downsides are that it also needs to be ported to the current Linux kernel (2.6.25) and that it currently is not integrated with FRESCOR. This should be possible, since FRESCOR is intended to hide implementation details behind the high-level API, so the backend should be replaceable. The Linux Containers/Control Groups are a mechanism to measure and also to limit the resource usage of groups of processes. Although originally intended for server virtualization, it may be very interesting for ACTORS. An advantage is that it is partly already integrated in the mainline kernel, so we would stay close to the mainline. This would reduce the amount of maintainance work required to keep the ACTORS solution working with new kernel versions, and it would help when actually using the ACTORS solution in products. The Linux/RK seems to be not very actively developed. Relations to WP4 This work package handles the resource reservation algorithms. Task 4.1: The reservation schemes in this task are handled in a dedicated chapter this document. Task 4.2: Reservation schemes will be matched with the control abstractions. This is very related to tasks 3.1 and 1.5. Here the interface of the resource management to the services offered by the operating system will be defined. Tasks 4.3 and 4.4: The multicore aspect in resourcemanagement, see in Relations to WP3, task

59 Chapter 5 Resource Reservation in Real-Time Systems Reservation-based resource partitioning, resource reservation (RR) for short, is an emerging paradigm for resource management in embedded systems with timing requirements. We strongly believe that, sooner or later, this paradigm will become a standard for mainstream Real-Time Operating Systems. If predictability is the main goal of a system, traditional real-time scheduling theory can be successfully used to verify the feasibility of the schedule under worstcase scenarios. However, when efficiency becomes relevant and when the worstcase parameters of the tasks are too pessimistic or unknown, the hard real-time approach presents some problems. In particular, task overruns can cause temporary or permanent overload conditions that may degrade system performance in an unpredictable fashion. A number of techniques have been proposed in the literature to handle transient overload conditions due to task overruns. They are aimed at providing temporal protection among tasks, meaning that, if a task overruns, only the task itself should suffer the possible consequences. In this way, the effect of the overrun is confined so that each task can be analyzed in isolation. After an introduction to the problem, we will present two different classes of algorithms for providing temporal protection: algorithms based on the fairness property, often referred to as proportional share algorithms, and algorithms based on resource reservation. Finally, we will describe some operating systems that provides resource reservation mechanisms. 5.1 Application Domains There is a need for resource reservation in many application domains, from critical hard real-time domains such as aerospace applications, to more soft real-time domains like multimedia. Aerospace is a traditional domain for hard real-time systems. The aerospace community has identified the need for temporal protection and has introduced it as part of the Avionics Application Software Standard Interface (APEX) [179], one of the standards in Integrated Modular Avionics (IMA) [180]. APEX introduces the concept of partition. Partitions are (subsystems that occupy) distinct parts of physical memory. Partitions contain processes. Inter-process communication is allowed within and between partitions. Each partition has a number of properties, such as the criticality level, period, duration, and lock or pre-emption level. Partitions are scheduled by a cyclic schedule, which has been created off line, and takes the 59

60 above mentioned properties into account. Non-critical partitions are not allowed to consume processing resources outside their slots in the cyclic schedule. Multimedia is becoming a major player in many application domains, ranging from consumer electronics to military equipment. The underlying media and signal processing is very data intensive, and thus imposes very high requirements on the system. Nevertheless, media and signal processing is gradually moving from dedicated hardware to programmable hardware (in various degrees of programmability) in combination with software. Multimedia processing often shows high load fluctuations and a large gap between average- and worst-case processing requirements, but this is compensated by the fact that multimedia allows a trade off between output quality (image quality, output latency, output frequency, output jitter, motion fidelity, etc.) and resource usage. Consumer devices have the additional requirements of low cost and short time to market on one hand, and robustness and predictability on the other. Guaranteed reservations are very well suited to address robustness, predictability and short time to market (reuse and independent development), and provide a good basis for achieving low cost by taking advantage of the quality-for-resource trade-off [181, 182]. Real-time control systems (such as robotics) have always been the domain of hard real-time scheduling. Recent study [183] has shown that the use of resource reservation techniques can bring advantages in terms of control performance. Designing the system for the worst-case guarantees absence of deadline misses, but can impose a low control loop rate. On the other hand, designing the system for the average case allows increasing the control loop rate, and thus improving the control performance, in the average case, but can lead to deadline misses of critical activities. Resource reservation techniques permit to calibrate the resource usage for the different activities. For example, in a complex control system with multirate sensor acquisition that uses a video camera for pattern recognition, the most critical low-level control loop can be considered as a hard activity, which should be assigned an amount of resource equal to its worst-case requirement. The less critical activities, like image acquisition and recognition, can be assigned a fraction of the resource equal to their average case conditions, thus increasing their rate [184, 183]. 5.2 Problems without temporal protection To introduce the problems that can occur when using traditional real-time algorithms for scheduling soft real-time tasks, consider a set of two periodic tasks τ 1 and τ 2, with T 1 = 3 and T 2 = 5. If C 1 = 2 and C 2 = 1, the task set can be feasibly scheduled (in fact, 2/3 + 1/5 = 13/15 < 1) and the schedule produced by EDF is illustrated in Figure 5.1. If, for some reason, the first instance of τ 2 increases its execution time to 3, then τ 1 misses its deadline, as shown in Figure 5.2. Notice that τ 1 is suffering for the misbehavior of another task (τ 2 ), although the two tasks are independent. This problem does not specifically depend on EDF, but is inherent to all scheduling algorithms that rely on a guarantee based on worst-case execution times (WCETs). For instance, Figure 5.3 shows another example in which two tasks, τ 1 = (2, 3) and τ 2 = (1, 5), are feasibly scheduled by a fixed priority scheduler (where tasks have been assigned priorities based on the rate monotonic priority assignment). However, if the first instance of τ 1 increases its execution from 2 to 3 units of time, then the first instance of τ 2 will miss its deadline, as shown in Figure 5.4. Again, one task (τ 2 ) is suffering for the misbehavior of another task (τ 1 ). 60

61 τ τ Figure 5.1: A task set schedulable under EDF. deadline miss τ τ Figure 5.2: An instance of τ 2 executing for too long can cause a deadline miss in τ 1. τ τ Figure 5.3: A task set schedulable under RM. Notice that, under fixed priority scheduling, a high priority task (τ 1 in the example) cannot be influenced by a lower priority task (τ 2 ). However, task priorities do not always reflect importance and are often assigned based on other considerations, like schedulability, as for the rate monotonic assignment. If importance values are not related with task rates, assigning priorities to tasks is not trivial, if a high schedulability bound has to be reached. For some specific task sets, schedulability can be increased by applying a period transformation technique [185], which basically splits a task with a long period into smaller subtasks with shorter periods. However, playing with priorities is not the best approach to follow, and the method becomes inefficient for large task sets with arbitrary periods. The examples presented above show that when a real-time system includes tasks with variable (or unknown) parameters, some kind of temporal protection among tasks is desirable. Definition 1 The temporal protection property requires that the temporal behavior of a task is not affected by the temporal behavior of the other tasks running in the system. In a real-time system that provides temporal isolation, a task executing too much cannot cause the other tasks to miss their deadlines. For example, in the 61

62 τ τ deadline miss Figure 5.4: An instance of τ 1 executing for too long can cause a deadline miss in τ 2. case illustrated in Figure 5.2, if temporal protection were enforced by the system, then the task missing the deadline would be τ 2. Temporal protection (also referred to as temporal isolation, temporal firewalling, or bandwidth isolation) has the following advantages: it prevents an overrun occurring in a task to affect the temporal behavior of the other tasks; it allows partitioning the processor into tasks, so that each task can be guaranteed in isolation (that is, independently of the other tasks in the system) only based on the amount of processor utilization allocated to it; it provides different types of guarantee to different tasks, for example, a hard guarantee to a task and a probabilistic guarantee to another task; when applied to an aperiodic server, it protects hard tasks from the unpredictable behavior of aperiodic activities. Another important thing to be noticed is that, if the system is in a permanent overload condition, some high level action must be taken to remove the overload. The techniques described in this chapter act at a lower level: they introduce temporal protection and allow the detection of the failing tasks. If the system detects that some task is in a permanent overrun condition, then some of the techniques presented in the previous chapter (for example the elastic task model, the RED algorithm, etc.) can be applied by either removing some task or by degrading the quality of the results of the application. Again, consider the example of Figure 5.4: if all instances of τ 1 execute for 3 instead of 2 units of time, the overload is permanent and would prevent the execution of τ 2. In this case, the overload could be removed for example by enlarging task periods according to the elastic task model, or rejecting some task based on some heuristics, or reducing the computation times by degrading the quality of results according to the imprecise computation model. 5.3 Providing temporal protection Several algorithms have been presented in the literature to provide some form of temporal protection, both in processor scheduling and in network scheduling 1. To 1 In this chapter, for the sake of simplicity, we will use the terminology related to processor scheduling. Many of the properties and characteristics of the algorithms explained here are also applicable, with some difference, to the scheduling of other system resources. When necessary, we will specify the differences. 62

63 Reservations Fixed Priorities Temporal Protection Fair Scheduling Dynamic Priorities P Fair Scheduling Proportional Share Scheduling WFQ GPS SFQ EEVDF Figure 5.5: Schedulers providing temporal protection. help distinguishing the various algorithms and their characteristics, they have been categorized according to the taxonomy illustrated in Figure 5.5. The temporal protection property is also referred to as temporal isolation property, since many authors stress the fact that each task is isolated from the others. In general, the idea is to assign each task a share of the processor, and the task is guaranteed to obtain at least its share. The class of algorithms providing temporal protection can be divided in two main classes: the class of fair scheduling algorithms and the class of resource reservation algorithms. Fair scheduling The class of fair share scheduling algorithms is based on a theoretical model that assumes a fluid allocation of the resources. In some cases (like in P-Fair algorithms) each task is directly assigned a share of the processor, whereas in other cases (like in proportional share algorithms) each task is assigned a weight, from which a share is derived as proportional to the task weight. In both cases, the objective is to allocate the processor so that in every interval of time each task precisely receives its share of the processor. Notice that such an objective cannot be realized in practice, since it would require a infinitely divisible resource: no matter how small the interval is, each task should receive its share of the processor. But the minimum time granularity of one processor is given by the clock! As a consequence, any implementable algorithm can only approximate the ideal one. A theoretical algorithm based on the ideal fluid resource allocation model is the Generalized Processor Sharing (GPS), which will be presented in Section 5.4. The GPS is mainly used for evaluation purposes, to verify how closely an algorithm approximates the fluid model. A parameter that can be used to measure how closely a realistic algorithm approximates the ideal one is the lag. For each task, the lag is defined as the difference between the execution time actually assigned to a task by the realistic algorithm and the amount of time assigned by the ideal fluid algorithm. Hence, the objective of a fair scheduler is to limit the lag to an interval as close as possible to 0. Most of the algorithms belonging to this class divide the time line into intervals of fixed length, called quantum, with the constraint that only one task per pro- 63

64 cessor can be executed in one quantum. The idea is to approximate fluid allocation with small discrete intervals of time. We can further divide the class of fair scheduling algorithms in p-fair scheduling and in proportional share algorithms. The main difference is on how the processor share is assigned to tasks. In proportional share scheduling, each task is assigned a weight w i, and it receives a share of the processor equal to: F i = w i N i=1 w i where N is the number of tasks. If the number of tasks does not change during the system lifetime (i.e., not new tasks are allowed to dynamically join the system, nor tasks can leave the system), then the task share is a constant. However, if tasks are allowed to dynamically join the system, task shares can change. If this change is not controlled, the temporal isolation property is broken: a new tasks joining the system can require a very high weight, reducing considerably the share of the existing tasks. Therefore, if we want to provide temporal guarantees in proportional share schedulers, an admission control policy is needed to check whether after each insertion each task receives at least the desired share of the processor, and to re-weight the tasks in order to achieve the desired level of processor share. Proportional share algorithms can provide temporal protection only if complemented with proper admission control algorithms and re-weighting policies. In p-fair scheduling, instead, each task is assigned a weight w i, with N i=0 w i M, where N is the number of tasks and M is the number of processors. Since, the weights are already normalized, each task receives a share of the system equal to its weight. Therefore, the admission control test is simply to check that the sum of all weights does not exceed the number of processors. Proportional share algorithms were initially presented in the context of network scheduling, where the concept of task is substituted with the concept of packets flow. A network link is shared among different flows, each flow is assigned a weight and the goal is to allocate the bandwidth of the link to the different flows in a fair manner, so that each flow receives a share proportional to its weight. The same algorithms have also been applied to the context of processor scheduling. One difference between network scheduling and processor scheduling is that in network scheduling the basic scheduling unit is the packet. In fact, the packet must be transmitted entirely, and cannot be divided into smaller units. Hence, there is no need for specifying a scheduling quantum : the length of the packet is itself the scheduling quantum. The problem becomes slightly more complex if packets have different lengths. In Section 5.4, we present some of the most popular fair scheduling algorithms in the context of processor scheduling. Resource reservation The class of resource reservation algorithms consists of algorithms derived from classical real-time scheduling theory. The first algorithms, generically called aperiodic servers, were proposed to schedule aperiodic soft real-time tasks together with periodic hard real-time tasks. The goal was to minimize the response time of aperiodic tasks without jeopardizing the hard real-time tasks. Aperiodic server algorithms were proposed both for fixed priority scheduling and dynamic priority scheduling. In fixed priority scheduling, the main algorithms 64

65 are the Polling Server, the Deferrable Server (DS) and the Sporadic Server (SS) [186, 187, 188]. In dynamic priority scheduling, the most important algorithms are the Total Bandwidth Server (TBS) [189, 190] and the Constant Bandwidth Server (CBS) [184, 137]. An approach similar to the server algorithms was applied for the first time to soft real-time multimedia applications by Mercer et al. [191], with the explicit purpose of providing temporal protection. Later, Rajkumar et al. [192] introduced the term resource reservation to indicate this class of techniques. In all the previous cited algorithms (with the exception of the TBS), a server is characterized by a budget Q and a period P. The processor share assigned to each server is Q/P. In the original formulation of these algorithms, one server was defined for the entire system, with the purpose of serving all aperiodic tasks in First-Come-First-Served (FCFS) order. The behavior of the server is similar to that of a periodic hard real-time task with a worst-case execution time equal to the assigned budget Q and a period equal to P. Hence, it is possible to apply the existing real-time scheduling analysis techniques to check the schedulability of the system. Resource reservation algorithms provide the temporal protection property. In one possible configuration, every task in the system (periodic or aperiodic, hard or soft real-time) is assigned a dedicated server with a share Q i /P i, under the constraint that: N Q i P i=1 i U lub where N is the number of tasks in the system and U lub is the schedulability utilization bound, which depends on the adopted scheduling algorithm. Then, each task is guaranteed to obtain a budget Q i every server period P i, regardless of the behavior of the other tasks in the system. It is important to note that in the configuration one server per task, the assumption of periodic or sporadic tasks can be removed. For example, consider a non-real-time non-interactive task (like for example a complex scientific computation or the compilation of a large program). By assigning a server with a certain budget and a period to this task, it will receive a steady and regular allocation of the processor, independently of the presence of other (real-time or non-real-time) tasks in the system. Resource reservation techniques will be described in detail in Section 5.6, and the Constant Bandwidth Server (CBS) [184, 137] will be presented in Section 5.7. Before continuing the presentation of the different approaches to temporal protection, it is important to highlight the main differences between fair scheduling and resource reservation techniques. The main objective of a fair scheduler is to keep the lag between the task execution and the ideal fluid allocation as close as possible to zero. For this reason, in processor scheduling, these algorithms need to introduce the concept of scheduling quantum that is the basic unit of allocation. The smaller the quantum, the smaller the lag bound. However, a small quantum implies a large number of context switches. Moreover, once the scheduling quantum has been fixed for the entire system, each task is assigned one single parameter, the weight (or the share in p-fair schedulers). The granularity of the allocation depends on the scheduling quantum while the share of the processor depends on the task weight. Therefore, if a task requires a very small granularity, we must reduce the scheduling quantum, causing a large number of context switches and more overhead. 65

66 Conversely, the goal of a resource reservation algorithm is to keep resource allocation under control so that a task can meet its timing constraints. To this end, each reservation is associated with two parameters, the budget Q and the period P. The period of the reservation represents the granularity of the allocation needed by the corresponding task, while the rate Q/P represents the share of the processor. Therefore, unlike fair schedulers, it is possible to select the most appropriate granularity for each task. If a task requires a very small granularity, its reservation period must be reduced accordingly, while the other reservations can keep a large period. In the general case, it is possible to show that, the number of context switches produced by a reservation scheduler is considerably less than the number of context switches produced by a proportional share scheduler. 5.4 The GPS model As explained above, temporal protection can be provided by adding an admission control mechanism to a fair scheduler. In fact, if a real-time task is assigned a sufficient amount of resources, it can execute at a constant rate, while respecting its timing constraints. Executing each task τ i at a constant rate is the essence of the Generalized Processor Sharing (GPS) approach [193, 194]. In this model, each shared resource needed by tasks (such as the CPU) is considered as a fluid that can be partitioned among the applications. Each task instantaneously receives a fraction f i (t) of the resource at time t, where f i (t) is defined as the task share. Note that the GPS model can be seen as an extreme form of a Weighted Round Robin policy. To compute the share of a resource, each task τ i is assigned a weight w i, and its share is computed as w i f i (t) = τj Γ(t) w j where Γ(t) is the set of tasks that are active at time t. Since each task consists of one or more requests for shared resources, tasks can block and unblock, and the Γ(t) set can vary with time. Hence, the share f i (t) is a time varying quantity. The minimum guaranteed share is defined as the rate F i = w i τj Γ w j, where Γ is the set of all tasks in the system. If an appropriate admission control is performed, it is possible to find an assignment of weights to tasks to guarantee real-time performance to all the time sensitive applications. In fact, based on the task rate, the maximum response time for each task can be computed as C i /F i. The problem with the GPS model is that the task response time C i /F i and the task throughput F i are not independent (using real-time terminology, this means that the relative deadline of a task is implicitly equal to its period). The GPS model describes a task system as a fluid flow system, in which each task τ i is modeled as an infinitely divisible fluid, and executes at a minimum rate F i that is proportional to a user specified weight w i. For example, Figure 5.6 shows the ideal schedule of 2 GPS tasks, τ 1 and τ 2, with weights w 1 = 3 and w 2 = 1. Note that τ 2 is always active, whereas τ 1 is a periodic task with period T 1 = 8 and execution time C 1 = 3. At time t = 0, both tasks are active, hence they receives two shares f 1 (0) = 3/(1 + 3) = 3/4 and f 2 (0) = 1/(1 + 3) = 1/4. This means that the two tasks execute simultaneously, and τ 1 executes at 3/4 of the CPU speed, whereas τ 2 66

67 1 τ τ Figure 5.6: Ideal schedule of two GPS tasks. The height of task executions is proportional to CPU speed. executes at 1/4 of the CPU speed. As a result, the first instance of τ 1 finishes at time C 1 / f 1 (0) = 3/(3/4) = 4, when τ 2 remains the only active task in the system and receives a share f 2 (4) = 1. At time 8, τ 1 activates again and the schedule repeats as at time 0. Note that the schedule represented in Figure 5.6 cannot be realized in practice, because tasks execute simultaneously. According to the ideal GPS model, task τ i is guaranteed to execute for an amount of time s i (t 1, t 2 ) > (t 2 t 1 )F i in each backlogged interval [t 1, t 2 ]. More precisely, the amount of time s i executed by task τ i in an ideal GPS system is: s i (t 1, t 2 ) = t2 t 1 f i (t)dt. As a result, in the ideal fluid flow model, tasks execution can be described through the following GPS guarantee: τ i active in [t 1, t 2 ], exec i(t 1, t 2 ) exec j (t 1, t 2 ) w i w j j = 1, 2,..., n (5.1) where exec i (t 1, t 2 ) is the amount of time actually executed by τ i in the interval [t 1, t 2 ]. It can be easily seen that Equation 5.1 is equivalent to exec i (t 1, t 2 ) = s i (t 1, t 2 ). 5.5 Proportional share scheduling Although the ideal GPS schedule cannot be realized on a real system, it can be used as a reference model to compare the performance of practical algorithms that attempt to approximate its behavior. In a real system, resources must be allocated in discrete time quanta of size Q, and such a quantum-based allocation causes an allocation error. Given two active tasks τ 1 and τ 2, the allocation error in the time interval [t 1, t 2 ] can be expressed as exec i (t 1, t 2 ) w i exec j(t 1, t 2 ) w j. An alternative way to express the allocation error is the maximum lag: Lag i = max t 1,t 2 { exec i (t 1, t 2 ) s i (t 1, t 2 ) }. Hence, a more realistic version of the GPS guarantee is the following: t2 exec i (t 1, t 2 ) f i (t)dt Lag i t 1 67

68 Proportional Share (PS) scheduling was originally developed for handling network packets. It provides fairness among different streams by emulating the GPS allocation model in a real system, where multiple tasks do not run simultaneously on the same CPU, but are executed using a quantum-based allocation. In other words, in a Proportional Share scheduler, resources are allocated in discrete time quanta having maximum size Q: a process acquires a resource at the beginning of a time quantum and releases the resource at the end of the quantum. To do that, each task τ i is divided in requests qi k of size Q. Clearly, quantum-based allocation introduces an allocation error with respect to the fluid flow model. The minimum theoretical error bound is H i,j = 1 2 ( Q i w i + Q j w j ), where Q i is the maximum dimension for τ i requests and Q j is the maximum dimension for τ j requests. An important properties of PS schedulers (that directly derives from the GPS definition) is that they are work conserving algorithms. Definition 2 An algorithm is said to be work conserving if it ensures that the CPU is not idle when there are jobs ready to execute. As we will see in the next sections, some algorithms providing temporal protection are not work conserving (for example, hard reservation algorithms). In the rest of this section, some of the most important PS scheduling algorithms are analyzed, showing how they emulate the ideal GPS allocation, and evaluating their performance in terms of allocation error and lag. Weighted Fair Queuing The first known Proportional Share scheduling algorithm is Weighted Fair Queuing (WFQ), which emulates the behavior of a GPS system by using the concept of virtual time. The virtual time v(t) is defined by increments as follows: { v(0) = 0 dv(t) = 1 τi Γ(t) w i dt. Each quantum request qi k is assigned a virtual start time S(qi k ) and a virtual finish time F(qi k ) as follows: S(q k i ) = max{v(r i,k), F(q k 1 i )} F(q k i ) = S(qk i ) + Q i,k w i where r i,k is the time at which request q k i is generated and Q i,k is the request dimension (required execution time). Since Q i,k is not known a priori (a task may release the CPU before the end of the time quantum), it is assumed to be equal to the quantum size Q (note that the quantum size is the same for all the tasks, hence the i index can be removed). Tasks requests are scheduled in order of increasing virtual finish time, and the definitions presented above guarantee that each request completes before its virtual finishing time. Figure 5.7 shows an example of WFQ scheduling, with the same task set presented in Figure 5.6 and considering a quantum size Q = 1. The first quantum begins at time 0, hence its virtual start time is 0 for both tasks. Since the virtual finishing time of the first quantum is 0 + 1/3 = 1/3 for task τ 1, and 0 + 1/1 = 1 for task τ 2, such a quantum is assigned to τ 1. The virtual start time of the second 68

69 τ 1 τ 2 v(t) t Figure 5.7: WFQ schedule generated by the task set of Figure 5.6. quantum of task τ 1 is max{1/4, 1/3} = 1/3, hence F(q 2 1 ) = 1/3 + 1/3 = 2/3 and τ 1 is scheduled again. In the same way, S(q 3 1 ) = max{1/2, 2/3} = 2/3, and F(q 3 1 ) = 1. Since the virtual finishing time of the two tasks is the same, both τ 1 and τ 2 can be scheduled at time t = 2: let us assume that τ 2 is scheduled. As a result, S(q 2 2 ) = max{3/4, 1} = 1 and F(q2 2 ) = 2. Since F(q3 1 ) < F(q2 2 ), τ 1 is scheduled at time t = 3 and finishes its first instance at time t = 4. At this point, the virtual time changes its increase rate to reflect the fact that τ 2 remains the only active task in the system (w 2 = 1 dv(t) = dt). As a result, when τ 1 activates again at time t = 8, the virtual time v(8) = 5 is equal to the virtual finishing time F(q 5 2 ) = 5 of the latest quantum executed by τ 2. Hence, the virtual start time of the two competing quanta of τ 1 and τ 2 is the same (5), and the schedule repeats as at time 0. The WFQ algorithm is one of the first known PS schedulers, and it is the basis for all the other PS algorithms. In fact, most of the PS schedulers are just modifications of WFQ that try to solve some of its problems. Some of the most notable problems presented by WFQ are: it needs a frequent recalculation of v(t); it does not perform well in dynamic systems (when a task activates or blocks, the fairness of the schedule is compromised); it assumes each requests size equal to the maximum value (the scheduling quantum); in real situations this assumption is not correct. In general, the main difference among the various PS schedulers consists in the way they define the virtual time, or in some additional rule that can be used to increase the fairness in some pathological situations. Start Fair Queuing Start Fair Queuing (SFQ) [195] is a proportional share scheduler that reduces the computational complexity of WFQ and increases the fairness by using a simpler 69

70 definition of virtual time. The algorithm has been designed to hierarchically subdivide the CPU bandwidth among various application classes. Another difference with WFQ is that SFQ schedules the requests in order of increasing virtual start time. The SFQ algorithm defines the virtual time v(t) as follows: 0 if t = 0 v(t) = 0 or any value if the CPU is idle S(qi k) if request qk i is executing SFQ guarantees an allocation error bound of 2H i,j, so it is nearly-optimal. Moreover, SFQ calculates v(t) in a way simpler than that used in WFQ (introducing less overhead) and does not need the virtual finish time of a request to schedule it, so it does not require any a priori knowledge of the request execution time (F(q k i ) can be computed at the end of q k i execution). A Proportional Share algorithm schedules the tasks in order to reduce the allocation error experienced by each of them; to provide some form of real-time execution it is important to guarantee that lag i (t) is bounded. SFQ and WFQ provide an optimal upper bound for the lag (max t {lag i (t)} = Q i ), but do not provide an optimal bound for the absolute value of the lag. For example, for SFQ this bound is max t { lag i (t) } = Q i + f i Q j, which depends on the number of active tasks. Earliest Eligible Virtual Deadline First In [196] the authors propose a scheduling algorithm, called Earliest Eligible Deadline First (EEVDF), that provides an optimal bound on the lag experienced by each task. EEVDF defines the virtual time as WFQ and schedules the requests by virtual finish times (in this case called virtual deadlines), but uses the virtual start time (called virtual eligible time) to decide whether a task is eligible to be scheduled: if the virtual eligible time is greater than the actual virtual time, the request is not eligible. Virtual eligible and finish time are defined as follows: S(qi k) = max{v(r i,k), S(q k 1 i ) + Q i,k 1 } w i F(q k i ) = S(qk i ) + Q i,k w i. When a task joins or leaves the competition (activates or blocks), v(t) is adjusted in order to maintain the fairness in a dynamic system. It can be proved that, although the EEVDF algorithm uses the concept of eligible time, it is still a work conserving algorithm (in other words, if there is at least a ready task in the system, then there is at least an eligible task). The minimum theoretical bound guaranteed by EEVDF for the absolute value of the lag is Q; for this reason, EEVDF is said to be optimal. EEVDF can also schedule dynamic task sets and can use non uniform quantum sizes, so it can be used in a real operating system. To the best knowledge of the authors, EEVDF is the only algorithm that provides a fixed lag bound. If the lag is bounded, real-time execution can be guaranteed by maintaining the share of each real-time task constant: f i (t) = C i + max t {lag i (t)} D i. 70

71 5.6 Resource reservation techniques A simple and effective mechanism for implementing temporal protection in a realtime system is to reserve each task τ i a specified amount of CPU time Q i in every interval P i. Such a general approach can also be applied to other resources different than the CPU, but in this context we will mainly focus on the CPU, because CPU scheduling is the topic of this book. Some authors [192] tend to distinguish between hard and soft reservations. Definition 3 A hard reservation is an abstraction that guarantees the reserved amount of time to the served task, but allows such task to execute at most for Q i units of time every P i. Definition 4 A soft reservation is a reservation guaranteeing that the task executes at least for Q i time units every P i, allowing it to execute more if there is some idle time available. A resource reservation technique for fixed priority scheduling was first presented in [191]. According to this method, a task τ i is first assigned a pair (Q i, P i ) (denoted as a CPU capacity reserve) and then it is enabled to execute as a real-time task for Q i units of time every P i. When the task consumes its reserved quantum Q i, it is blocked until the next period, if the reservation is hard, or it is scheduled in background as a non real-time task, if the reservation is soft. At the beginning of the next period, the task is assigned another time quantum Q i and it is scheduled as a real-time task until the budget expires. In this way, a task is reshaped so that it behaves like a periodic real-time task with known parameters (Q i, P i ) and can be properly scheduled by a classical real-time scheduler. A similar technique is used in computer networks by traffic shapers, such as the leaky bucket or the token bucket [197]. More formally, a reservation technique can be defined as follows: a reservation RSV is characterized by two parameters (Q, P), referred to as the maximum budget and the reservation period; a budget q (also referred to as capacity), is associated with each reservation; at the beginning of each reservation period, the budget q is recharged to Q; when the reserved task executes, the budget is decreased accordingly; when the budget becomes zero, the reservation is said to be depleted and an appropriate action must be taken. The action to be taken when a reservation is depleted depends on the reservation type (hard or soft). In a hard reservation, the task is suspended until the budget is recharged, and another task can be executed. If all tasks are suspended, the system remains idle until the first recharging event. Thus, in case of hard reservations, the scheduling algorithm is said to be non work-conserving. In a soft reservation, if the budget is depleted and the task has not yet completed, the task s priority is downgraded to background priority until the budget is recharged. In this way, the task can take advantage of unused bandwidth in the system. When all reservations are soft, the algorithm is work-conserving. Figure 5.8 shows how the tasks of Figure 5.2 are scheduled using two hard CPU reservations RSV 1 and RSV 2 with Q 1 = 2, P 1 = 3, Q 2 = 1, and P 2 = 5, under RM. The same figure also shows the temporal evolution of the budgets q 1 and q 2. Since 71

72 τ 1 q CBS (2,3) τ CBS (1,5) q Figure 5.8: Example of CPU Reservations implemented over a fixed priority scheduler. the reservations are based on RM, RSV 1 has priority over RSV 2, and task τ 1 starts to execute. After 2 time units, τ 1 stops and τ 2 starts executing. At time 3, τ 2 has not completed, but its current budget q 2 = 0 and the reservation RSV 2 is depleted. Hence, τ 2 is suspended waiting for its budget to be recharged. As we can see, τ 1 does not suffer from the overrun of τ 2. At the same time, a new period for RSV 1 is activated, and budget q 1 is recharged to 2. Hence, τ 1 can execute again and complete its instance after one more unit of time. Notice that task τ 1 has missed its deadline at time 3. Moreover, since the task has a period of 3, at time 3 another instance should have been activated. Depending on the actual implementation of the scheduler and of the task, it may happen that the task activation at time 3 is skipped or buffered. In Figure 5.8 we assume that the task activation is buffered. Hence, at time 4 the task resumes executing the next buffered instance. Note that, even if the first instance of τ 1 is too long, the schedule is equivalent to the one generated by RM for two tasks τ 1 = (2, 3) and τ 2 = (1, 5). In other words, the CPU reservation mechanism provides temporal isolation between the two tasks: since τ 1 is the one executing too much, it will miss some deadlines, but τ 2 is not affected. Problems with Traditional Reservation Systems The reservation mechanism presented in the previous section can be easily used also with dynamic priority schedulers, such as EDF, to obtain a better CPU utilization. However, when using CPU reservations in a dynamic priority system it can be useful to extend the scheduler to solve some problems that generally affect reservation based schedulers. Hence, before presenting some more advanced reservation algorithms, we show a typical problem encountered in traditional reservation systems. 72

73 τ τ Figure 5.9: The task set is schedulable by CPU Reservations implemented over EDF. τ deadline miss τ Figure 5.10: A late arrival in τ 2 can cause a deadline miss in τ 1. In particular, a generic reservation based scheduling algorithm can have problems in handling aperiodic task s arrivals. Consider two tasks τ 1 = (4, 8) and τ 2 = (3, 6) served by two reservations RSV 1 = (4, 8), and RSV 2 = (3, 6). As shown in Figure 5.9, if the EDF priority assignment is used to implement the reservation scheme, then the task set is schedulable (and each task will respect all its deadlines). However, if an instance of one of the two tasks is activated later, the temporal isolation provided by the reservation mechanism may be broken. For example, Figure 5.10 shows the schedule produced when the third instance of τ 1 arrives at time 18 instead of time 16: the system is idle between time 17 and 18, and task τ 2 (which is behaving correctly) misses a deadline. If correctly used, dynamic priorities permit to fix this kind of problems and better exploit the CPU time, as shown in the next section. 5.7 Resource reservations in dynamic priority systems To better exploit the advantages of a dynamic priority system, resource reservations can be implemented by properly assigning a dynamic scheduling deadline to each task and by scheduling tasks by EDF based on their scheduling deadlines. Definition 5 A scheduling deadline d s i,j is a dynamic deadline assigned to a job τ i,j in order to schedule it by EDF. Note that a scheduling deadline is something different from the job deadline d i,j, which in this case is only used for performance monitoring. The abstract entity that is responsible for assigning a correct scheduling deadline to each job is called aperiodic server. Definition 6 A server is a mechanism used to assign scheduling deadlines to jobs in order to schedule them so that some properties (such as the reservation guarantee) are respected. 73

74 Aperiodic servers are widely known in the real-time literature, [186, 187, 198, 199, 188, 189, 190, 200], but, in general, they have been used to reduce the response time of aperiodic requests, and not to implement temporal protection. The server assigns each job τ i,j an absolute time-varying deadline di,j s which can be dynamically changed. This fact can be modeled by splitting each job τ i,j into chunks H i,j,k, each having a fixed scheduling deadline di,j,k s. Definition 7 A chunk H i,j,k is a part of the job τ i,j characterized by a fixed scheduling deadline d s i,j,k. Each chunk H i,j,k is characterized by an arrival time a i,j,k, an execution time e i,j,k and a scheduling deadline. Note that the arrival time a i,j,0 of the first chunk of a job τ i,j is equal to the job release time: a i,j,0 = r i,j. The Constant Bandwidth Server The Constant Bandwidth Server (CBS) is a work conserving server (implementing soft reservations) that takes advantage of dynamic priorities to properly serve aperiodic requests and better exploit the CPU. The CBS algorithm is formally defined as follows: A CBS S is characterized by a budget q s and by a ordered pair (Q s, P s ), where Q s is the server maximum budget and P s is the server period. The ratio U s = Q s /P s is denoted as the server bandwidth. At each instant, a fixed deadline d s k is associated with the server. At the beginning d s 0 = 0. Each served job τ i,j is assigned a dynamic deadline d i,j equal to the current server deadline d s k. Whenever a served job τ i,j executes, the budget q s of the server S serving τ i is decreased by the same amount. When q s = 0, the server budget is recharged at the maximum value Q s and a new server deadline is generated as d s k+1 = ds k + Ps. Notice that there are no finite intervals of time in which the budget is equal to zero. A CBS is said to be active at time t if there are pending jobs (remember the budget q s is always greater than 0); that is, if there exists a served job τ i,j such that r i,j t < f i,j. A CBS is said to be idle at time t if it is not active. When a job τ i,j arrives and the server is active the request is enqueued in a queue of pending jobs according to a given (arbitrary) non-preemptive discipline (e.g., FIFO, shortest execution time first, or earliest deadline first, if tasks have soft deadlines). When a job τ i,j arrives and the server is idle, if q s (d s k r i,j)u s the server generates a new deadline d s k+1 = r i,j + P s and q s is recharged to the maximum value Q s, otherwise the job is served with the last server deadline d s k using the current budget. When a job finishes, the next pending job, if any, is served using the current budget and deadline. If there are no pending jobs, the server becomes idle. At any instant, a job is assigned the last deadline generated by the server. Figure 5.11 illustrates an example in which a hard periodic task τ 1 is scheduled by EDF together with a soft task τ 2, served by a CBS having a budget Q s = 2 and 74

75 τ (2,3) 1 HARD τ 2 SOFT 2 c1=3 7 c2=2 c3=1 d1 d2 d CBS (2,7) t Figure 5.11: Simple example of CBS scheduling. τ (2,3) 1 HARD τ 2 SOFT c1=2 d1 c2=3 d2 c3=2 d CBS (2,7) t Figure 5.12: Example of CBS serving a task with variable execution time and constant inter-arrival time. a period P s = 7. The first job of τ 2 arrives at time r 1 = 2, when the server is idle. Being q s (d0 s r 1)U s, the deadline assigned to the job is d1 s = r 1 + P s = 9 and q s is recharged at Q s = 2. At time t 1 = 6 the budget is exhausted, so a new deadline d2 s = ds 1 + Ps = 16 is generated and q s is replenished. At time r 2 = 7, the second job arrives when the server is active, so the request is enqueued. When the first job finishes, the second job is served with the actual server deadline (d2 s = 16). At time t 2 = 12, the server budget is exhausted so a new server deadline d3 s = ds 2 + Ps = 23 is generated and q s is replenished to Q s. The third job arrives at time 17, when the server is idle and q s = 1 < (d3 s r 3)U s = (23 17) 2 7 = 1.71, so it is scheduled with the actual server deadline d3 s without changing the budget. In Figure 5.12, a hard periodic task τ 1 is scheduled together with a soft task τ 2, having fixed inter-arrival time (T 2 = 7) and variable computation time, with a mean value equal to C 2 = 2. This situation is typical in applications that manage continuous media: for example, a video stream requires to be played periodically, but the decoding/playing time of each frame is not constant. In order to optimize the processor utilization, τ 2 is served by a CBS with a maximum budget equal to the mean computation time of the task (Q s = 2) and a period equal to the task period (P s = T 2 = 7). As we can see from Figure 5.12, the second job of task τ 2 is first assigned a deadline d2 s = r 2 + P s = 14. At time t 2 = 12, however, since q s is exhausted and the job is not finished, the job is scheduled with a new deadline d3 s = ds 2 + Ps = 21. As a result of a longer execution, only the soft task is delayed, while the hard task meets all its deadlines. Moreover, the exceeding portion of the late job is not executed in background, but is scheduled with a suitable dynamic priority. 75

76 τ (2,3) 1 HARD τ 2 SOFT c1=2 c2=2 c3=2 d1 d2 d CBS (2,7) t Figure 5.13: Example of CBS serving a task with constant execution time and variable inter-arrival time. τ τ Figure 5.14: Example of a CBS coping with late arrivals. In other situations, frequently encountered in continuous media (CM) applications, tasks have fixed computation times but variable inter-arrival times. For example, this is the case of a task activated by external events, such a driver process activated by interrupts coming from a communication network. In this case, the CBS behaves exactly like a Total Bandwidth Server (TBS) [190] with a bandwidth U s = Q s /P s. In fact, if C i = Q s each job finishes exactly when the budget arrives to 0, so the server deadline is increased of P s. It is also interesting to observe that, in this situation, the CBS is also equivalent to a Rate-Based Execution (RBE) model [201] with parameters x = 1, y = T i, D = T i. An example of such a scenario is depicted in Figure Finally, Figure 5.14 shows how the tasks presented in Figure 5.10 are scheduled by a CBS. Since the CBS assigns a correct deadline to the instance arriving late (the third instance of τ 1 ), τ 2 does not miss any deadline, and temporal protection is preserved. CBS properties The proposed CBS service mechanism presents some interesting properties that make it suitable for supporting CM applications. The most important one, the isolation property, is formally expressed by the following theorem. Theorem 1 A CBS with parameters (Q s, P s ) demands a bandwidth U s = Q s /P s. The isolation property allows us to use a bandwidth reservation strategy to allocate a fraction of the CPU time to each task that cannot be guaranteed a priori. The most important consequence of this result is that soft tasks can be scheduled together with hard tasks without affecting the a priori guarantee even in the case in which soft requests exceed the expected load. 76

77 In addition to the isolation property, the CBS has the following characteristics: No assumptions are required on the WCET and the minimum inter-arrival time of the served tasks: this allows the same program to be used on different systems without recalculating the computation times. This property allows decoupling the task model from the scheduling parameters. If the task s parameters are known in advance, a hard real-time guarantee can be performed (see Section 5.8). The CBS automatically reclaims any spare time caused by early completions or late arrivals. This is due to the fact that whenever the budget is exhausted, it is always immediately replenished at its full value and the server deadline is postponed. In this way, the server remains eligible and the budget can be exploited by the pending requests with the current deadline. Knowing the statistical distribution of the computation time of a task served by a CBS, it is possible to perform a statistical guarantee, expressed in terms of probability for each served job to meet its deadline 5.8 Temporal guarantees Resource reservations provide a basic scheduling mechanism that, thanks to the temporal isolation property, can be used in different ways to serve hard or soft tasks providing different kinds of guarantees. In this section we briefly recall some possible parameters assignment policies; note that, although most of the presented results are applied to the CBS algorithm (because they were originally developed for the CBS), they can be extended to other reservation policies. The first (and simplest) usage of a reservation algorithm is to use it for serving aperiodic tasks so that they do not interfere with the hard real-time activities. This is the approach followed in all the works on aperiodic servers [186, 187, 198, 199, 188, 189, 190, 200]. Obviously, a single CBS can be used to serve all the soft real-time tasks, but in this case it might be very difficult to provide soft real-time guarantees. The best way to provide some kind of performance guarantee to soft real-time tasks is to serve each task with a dedicated CBS (or CPU reservation). In this way, it is possible to guarantee that each task is periodically assigned a given amount of time; if the task parameters are not know a priori this is the only performance guarantee that can be performed, but if some information is known about the task, more complex guarantee strategies can be used. Finally, a dedicated server can also be used to schedule hard real-time tasks, which can be guaranteed thanks to the hard schedulability property, expressed by the following lemma: Lemma 1 A hard task τ i with parameters (C i, T i ) is schedulable by a CBS with parameters Q s C i and P s T i if and only if τ i is schedulable without the CBS. All the policies described above can be used off-line for assigning reservations parameters during the system design phase, when tasks parameters are known a- priori. But, such an a-priori information is often not available and static allocation techniques cannot be used. In this case, it is possible to dynamically change the reservation parameters using a sort of feedback mechanism. 77

78 5.9 Resource reservations in operating system kernels Resource Reservations have been implemented in various real-time kernels (mainly research kernels), starting from Real-Time Mach, and are available in a commercial real-time extension of Linux, (Linux/RK by TimeSys). While most of these systems provide CPU reservations as an alternative to classical real-time scheduling algorithms, few of them base the whole kernel on the reservation concept and provide reservations for all the resources managed by the system. Real-Time Mach Real-Time Mach (RT-Mach) [202] is a real-time extension of the Mach µkernel [203], developed at the CMU. RT-Mach extends the standard Mach by increasing the predictability of the kernel, and providing a real-time threading library, a real-time scheduler, and a real-time communication mechanism. The predictability of the kernel is increased by using eager evaluation policies (opposed to the lazy evaluation policies used by standard Mach) and by substituting the FIFO queues contained in the kernel with priority queues (where the priorities are derived by the tasks temporal constraints). As an example of lazy evaluation policy used in standard Mach, when a task dynamically allocates some memory, the kernel really gives it to the task only when the task accesses the allocated memory. Such a lazy allocation allows enhancing the kernel efficiency, and enabling some optimizations such as copy-on-write, but increases the unpredictability of the system. Hence, RT-Mach modifies this behavior by immediately allocating the memory; other similar optimizations present in the Mach µkernel have been removed in RT-Mach for similar reasons. The real-time threading library coming with RT-Mach implements the periodic and sporadic thread models, enabling the user to express the WCET and the period (or the minimum interarrival time) for each thread. In this way, RT-Mach can perform the admission control and correctly schedule the treads using a Rate Monotonic (or Deadline Monotonic) scheduler. Finally, the real-time communication mechanism uses priority inheritance [204] to bound the waiting times. CPU reservations were added to RT-Mach by Mercer and others [191] to support multimedia applications. In particular, the authors realized the lack of temporal protection presented by the priority-based RT-Mach scheduler (similar to the problem shown in Section 5.6), and implemented a CPU reservation mechanism based on the Rate Monotonic algorithm. This was done by enhancing the RT-Mach time accounting mechanism to exactly measure the execution time used by each thread (and keeping track of the reservation budget) and by implementing an enforcement mechanism. The enforcement mechanism downgrades a thread to non real-time when it consumes all its reserved time (the thread will be promoted again to real time priority at the beginning of the next reservation period). The authors argued that to compensate some approximations in accounting and enforcement, a fraction of the CPU time must be left unreserved, and they estimated this percentage in about 5 10%. Since in realistic situations the RM utilization bound is about 88% [205], the authors claim that basing the reservation mechanism on EDF would not give any sensible advantage with respect to RM, and thus they adopted the RM scheduler provided by RT-Mach as a basis for their CPU capacity reserves. Nowadays, using modern hardware and OS kernels the overhead for accounting and enforcement is negligible, hence there are no more reasons for compensating it. As a consequence, basing the reservation mechanism on EDF can be a realistic choice. 78

79 Other Research Systems CPU Reservations have also been implemented in other research kernels to support predictable CPU allocation in dynamic systems. For example, Rialto is a research system developed by Microsoft [206] that permits to mix CPU reservations and other kinds of timing constraints. Rialto was designed to combine timesharing and soft real-time in a desktop operating system, and thus uses CPU reservations to isolate the different applications. The execution time is reserved to activities and monitored at runtime. Activities can be composed by more threads, and threads belonging to the same activity share its reserved time in a round-robin fashion. Another difference between Rialto CPU reservations and traditional ones is that in Rialto reservations are continuously guaranteed. That is to say, if an activity has a (Q, T) reservation, then for every time t the activity will run for at least Q units of time in the interval (t, t + T) 2. This result is impossible to obtain using a priority scheduler, and in fact Rialto uses a table driven schedule that is computed when a reservation is created and is repeated over time. Moreover, Rialto provides time constraints: a time constraint is a tuple (s, c, b), indicating that a thread requires to execute for a time c, starting at time s, and terminating before b. Based on the thread s activity reserved time on the static schedule, and on the available spare time, Rialto can guarantee the time constraint or reject it. If the time constraint is accepted, the activity s threads are scheduled so that it is respected (the scheduling algorithm used inside the activity is based on EDF). Another system supporting resource reservations is HARTIK [207], an experimental real-time kernel developed at the ReTiS Lab of the Scuola Superiore S. Anna of Pisa, to support real-time and control applications running on conventional PC hardware (based on Intel x86 processors). Like RT-Mach and Rialto, HARTIK permits to explicitly express the tasks temporal constraints, implements an on-line admission test, and dynamically create and destroy processes. Moreover, HAR- TIK also provides some unique features that are rarely found all together in other kernels. They include a support for both periodic and aperiodic processes, the possibility to mix hard, soft, and non real-time tasks, the implementation of resource sharing protocols (based on SRP [208]), and the presence of a non-blocking communication mechanism (the CAB [207]) for exchanging data among periodic tasks having different rates. The HARTIK scheduler is based on EDF. The kernel was later extended to support multimedia applications through the CBS, which was explicitly designed to efficiently schedule periodic and aperiodic soft tasks with unknown execution times [209]. Nowadays, the CBS can be used in HARTIK to schedule both hard and soft real-time tasks, or to reserve a fixed fraction of the CPU bandwidth to non real-time tasks to prevent starvation. Moreover, the CBS is used to schedule all the drivers tasks so that it is not necessary to adjust the drivers WCET estimation on every new machine the first time a driver runs on it. Another real-time kernel developed at the Retis Lab of Scuola Superiore S. Anna of Pisa is SHaRK [210]. ShaRK is an evolution of HARTIK and has been designed to easily implement new scheduling algorithms in the kernel as scheduling modules. The CBS is still provided as one of the standard scheduling modules, and other reservation mechanisms can be easily added, hence SHaRK provides full support for CPU reservations. 2 In a traditional reservation, this is valid only for t = kt + t 0, where t 0 is a fixed offset. 79

80 A similar concept (easy implementation of new scheduling algorithms) is proposed by RED Linux, that modifies the 2.2 Linux kernel to provide high-resolution timers, low kernel latency, and a modular scheduler. This latest feature permits to easily implement CPU reservations in RED Linux. Resource Kernels Extending the concepts presented by the kernels described above, it is possible to consider resource reservations as an abstraction for decoupling the applications from the scheduling algorithm. Hence, applications only need to express their resource requirements in terms of reservations (Q, T) (plus an optional parameter D indicating a relative deadline), so that the kernel can perform an admission test and schedule tasks in the proper way. This is the resource centric approach taken by Resource Kernels (RK) [192]. A resource kernel is based on the Resource Set abstraction, which describes all the resources that can be used by one or more tasks. A resource set may include multiple reservation types (for example, a CPU reservation, a network reservation, and a disk reservation), and all the tasks attached to the resource set will be allowed to use those reservations. Hence, in order to be guaranteed to execute in a proper timely fashion, a task must create a resource set, create the proper resource reservations expressing its requirements, connect them to the resource set, and then attach itself to the resource set. The RK concept was initially implemented in a modified version of RT-Mach, but it is fairly general [211], and has been, for example, ported to Linux [212]. Linux/RK provides high-resolution timers and an accurate accounting mechanism, and implements the resource set abstraction, CPU reservations (based on RM, DM, or EDF), network reservations, and disk reservations. A commercial version of Linux/RK is distributed by TimeSys as TimeSys Linux [213], which adds some additional feature (such as more predictable kernel services) to the original RK. The AQuoSA framework AQuoSA (Adaptive Quality of Service Architecture) is an open architecture for the provisioning of adaptive Quality of Service functionality into the Linux kernel. The project features a flexible, portable, lightweight and open architecture for supporting QoS related services on the top of a general-purpose operating system as Linux. The architecture is well founded on formal scheduling analysis and control theoretical results. A key feature of AQuoSA is the Resource Reservation layer that is capable of dynamically adapting the CPU allocation for QoS aware applications based on their run-time requirements. In order to provide such functionality, AQuoSA embeds a kernel-level CPU scheduler implementing a resource reservation mechanism for the CPU, which gives the ability to the Linux kernel to realize (partially) temporal isolation among the tasks running within the system. The AQuoSA architecture consists of three hierarchical layers: 1. A patch to the Linux kernel; 2. A resource reservations layer; 3. An adaptive reservation layer. 80

81 Patch to the Linux kernel At the lowest level, a patch to the Linux kernel adds the ability to notify to dynamically loaded modules any relevant scheduling event. These have been identified in the creation or death of tasks, as well as the block and unblock events. This patch is minimally invasive, in that it consists of a few lines of code properly inserted mainly within the Linux scheduler code (sched.c). It has been called "Generic Scheduler Patch", because it potentially allows to implement any scheduling policy. The Resource Reservations layer The Resource Reservations layer is composed of three components. The core component is a dynamically loadable kernel module that implements a Resource Reservations scheduling paradigm for the CPU, by exploiting functionality introduced into the Linux kernel through the Generic Scheduler Patch. Second, a user-level library (QRES library) allows an application to use the new scheduling policy through a complete and well-designed set of API calls. Essentially, these calls allow an application to ask the system to reserve a certain percentage of the CPU to their process(es). Third, a kernel-level component (the Supervisor) mediates all requests made by the applications through the QRES library, so that the total sum of the requested CPU shares does not violate the consistency relationship for the scheduler (less than one, or slightly less than one, due to overhead). The supervisor behaviour is completely configurable by the system administrator, so that it is possible to specify, on a per-user/per-group basis, minimum guaranteed and maximum allowed values for the reservations made on the CPU. With AQuoSA, applications may use directly the Resource Reservation layer, which allows them to reserve a fraction of the CPU, so to run with the required scheduling guarantees. For example, a multimedia application may ask to the operating system to run with the guarantee of being scheduled at least for Q milliseconds every P, where Q and P depend on the nature of the application. When registering an application with the Resource Reservation layer, it is possible to specify a minimum guaranteed reservation that the system should always guarantee to the application. Based on the requests of minimum guaranteed reservations, the layer performs admission control, i.e. it allows a new application in only if, after the addition of it, the new set of running applications does not overcome the CPU saturation limit. The Adaptive Reservations layer For typical multimedia application making use of high compression technologies, it may be quite difficult, impractical or inconvenient to run applications with a fixed reservation on the CPU. In fact, the problem arises on how to tune the correct reservation to use. Traditional real-time systems make use of WCET (Worst Case Execution Time) analysis techniques in order to compute what is the maximum time an instance of (e.g.) a periodic task may last on the CPU before blocking waiting for the next instance. Such analysis is very difficult in today s complex multimedia applications, especially when running on general-purpose hardware like standard PCs, where technologies like multi-level caches, CPU execution pipelines, on-bus buffers, multimaster buses, introduce many unpredictable variables in the computation of the time required for memory accesses. 81

82 On such systems, it is much more convenient to tune a system design based on the average expected load of the application, lest a heavy under-utilization of the system at run-time. For certain classes of multimedia applications, e.g. a video player, it is quite impossible to find an appropriate fixed value for the fraction of CPU required by the application at run-time, due to the heavy fluctuations of the load depending on the actual data that is being managed by the program. A reservation based on the average requirements, or slightly greater than that, results in transient periods of poor quality during the application run (e.g. movie playback). On the other hand, one based on the maximum expected load results in an unneeded over-reservation of the CPU for most of the time, except the periods in which the load really approaches the maximum expected value. For these classes of applications, it is much more convenient to use the Adaptive Reservation techniques, like those ones provided by the Adaptive Reservation layer of AQuoSA, which performs a continuous on-line monitoring of the computational requirements of the application process(es), so that it may dynamically adapt the reservation made on the CPU depending on the monitored data. The Adaptive Reservation layer exposes to applications an API for using a set of controllers which are of quite general use within a wide set of multimedia applications. 82

83 Chapter 6 Feedback Scheduling The main idea of feedback scheduling is to base the scheduling decisions for how shared computing or communication resources should be allocated on on-line measurements of the actual resource utilization or the actual quality of service. 6.1 Background Feedback-based approaches have always been used in engineering systems. One example is the flow and congestion control mechanisms in the TCP transport protocol. Typical of many applications of this type is that feedback control is used in a more or less ad hoc way, without any connections to control theory. During the last 10 years this situation has changed. Today, control theory is beginning to be applied to real-time computing systems in a more structured way. Dynamic models are used to describe how the performance or quality of service depend on the resources at hand. The models are then analyzed to determine the fundamental performance limitations of the system. Based on the model and the specifications, control design is performed. In some cases the analysis and design is based on optimization. 6.2 Motivation and Objectives In a real-time system with hard resource constraints, e.g., execution deadlines, it is paramount that the constraints are fulfilled. If sufficient information is available about worst-case resource requirements, e.g., worst-case execution times (WCET), then the results from classical schedulability theory can be applied to decide whether this is the case. It is then possible to provide a system implementation that guarantees that the resource constraints are fulfilled at all times. However, in many situations the hard real-time scheduling approach is unpractical. Worst-case numbers are notoriously difficult to derive. In order to be on the safe side, a heuristically chosen safety margin is often added to measurements of worst-case values. This may lead to under-utilization of resources. In other cases resource requirements vary greatly over time. The reason for this may be changes in the external load on the system, dynamic changes in the use cases, or mode changes in application tasks. Again, designing the system for the worst case may lead to under-utilization. The common problem in the above situations is uncertainty, in the form of either unknown parameters or time-varying system behavior. A major strength of control theory is its ability to manage uncertainty. In feedback scheduling, the allocation of resources is based on a comparison between the actual resource consumption and the desired resource consumption 83

84 Figure 6.1: A feedback scheduler structure. Feedforward can be used to proactively adjust to known changes in required resources. Figure 6.2: A feedback scheduler structure using QoS. (the setpoint value or the reference value). The difference, or control error, is then used for deciding how the resources should be allocated to different tasks. The decision mechanism constitutes the actual controller in the feedback scheduling scheme. The structure of a feedback-based resource allocation scheme is shown in Figure 6.1. In the figure we assume that the resource consumers are tasks that need a certain amount of CPU time each. The setpoint of the controller/scheduler is the desired total CPU utilization. If a task knows beforehand that it is about to change its resource consumption, it may inform the scheduler about this directly. This constitutes a feedforward path in the controller. Another name for a feedback loop is a closed loop. In contrast to this, conventional scheduling algorithms can be described as operating in open loop, without any mechanisms that allow it to adjust to changes in load, overruns, etc. Instead of basing the resource allocation on the resource utilization, one could instead use the quality of service as the measured variable. The often implicit assumption made is then that there is a monotonically increasing relationship between the amount of resources allocated and the quality obtained. The scheme is shown in Figure 6.2. A problem with QoS-based approaches is the difficulties in defining the appropriate quality attributes for a given application. A key observation is that feedback scheduling is not suitable for applications that are truly hard in nature. The reason for this is that feedback acts on errors. In the CPU utilization case above, this would mean that some tasks might temporarily receive less resources than required, i.e., they could miss deadlines. Feedback scheduling is therefore primarily suited for applications that are soft, i.e., applications that can tolerate occasional deadline misses without any catastrophic effects. 84

85 Another increasingly popular term is adaptive tasks. The latter means that missing one or more deadline does not jeopardize correct system behavior, but only causes a performance degradation. For this type of systems, the goal is typically to meet some QoS requirements. The adaptive class of real-time systems is a suitable description for a many practical applications, including different types of multimedia applications. It also includes a large class of control applications. Most control systems can tolerate occasional deadline misses. The control performance or Quality of Control (QoC) is also dependent on to which degree the timing requirements are fulfilled. It is only in safety-critical control applications, e.g., automotive X-by-wire applications, that the hard real-time model is really motivated. 6.3 Important Issues Important issues in feedback scheduling are what the inputs and outputs of the systems are, the structure and type of controller, and what modeling formalism that is employed. Sensors and Actuators An important issue in all control problems is to determine what the inputs and outputs are. The input to the controlled system, i.e., the the control variable, is the means by which the controller changes the resource allocation. The output of the controlled system, i.e., the measured variable, is the variable or signal that the controller aims to maintain under control. The controller could try to keep the measured variable at a constant desired setpoint or have it follow a changing setpoint. The actuator is the software mechanism through which the control variable enters the controlled system. Examples of actuators mechanisms in feedback scheduling systems are admission controllers, changes in the rate of periodic tasks, skipping of individual task instances (jobs), skipping of frames in video streams, and using different versions with different resource requirements. The sensor is the software mechanism that is used to actually obtain the measured variable. In a CPU utilization control application the sensor could correspond to measurements of the tasks actual execution time. Other examples could be the deadline miss ratio or task tardiness. In order to keep the controller from over-reacting to spurious upsets in the measured variable, e.g., occasional long execution times, a low-pass filter is often included in the sensor. Controller structure and type Another important issue is the structure and type of the controller. In a feedback structure the controller bases its actions on the measured variable and the setpoint only. In a feedforward structure the actions are based only on the setpoint and/or on measurable disturbances acting on the controlled system. In a combined feedback and feedforward structure the feedforward path is typically used to provide a fast response to setpoint changes whereas the feedback path is used to compensate for errors caused by disturbances acting on the controlled system or incorrect modeling assumptions. In a single-input single-output (SISO) structure the controller controls a single measured variable using one control variable, whereas in a multi-input multi-output (MIMO) structure several measured variables and control variables are used. A common controller structure is the cascade controller 85

86 where two ordinary controller are connected in series, the control variable of the first, outer, controller being used as the setpoint of the second, inner controller. The controller type governs how the control variable is calculated based on the measured variable and setpoint. A common controller type both in control of computer systems and in control in general is the PID controller. In this controller, the control variable is formed as a combination of three terms: a proportional term, an integral term, and a derivative term. In the proportional term the control variable at time k, u(k), is proportional to the control error at time k, i.e., u(k) = k p (y re f (k) y(k)) = k p e(k). Here, y(k) is the measured variable at time k, y re f (k) is the setpoint (or reference value) at time k, and k p is the proportional gain. In the integral term the control variable is proportional to the integral of the control error, i.e., u(k) = u(k 1) + k i e(k), where k i is the integral gain. The integrator is hence implemented through accumulation. Finally, in the derivative part the control variable is proportional to the derivative of the control error, i.e., u(k) = k d (e(k) e(k 1)), where k d is the derivative gain parameter. The PID controller has become popular since it is relatively easy to tune. The proportional gain is adjusted to make the closed-loop system fast enough, the integral gain is tuned to handle load disturbances as efficiently as possible, while the derivative part can be used to reduce oscillations in the control loop. Another common controller type is state feedback from an observer. In general, the controller can be any linear filter, implemented either in input-output form or state-space form. The order of the controller corresponds to the number of old variables, i.e., the state, that must be stored in order to calculate the the control variable. For example, a proportional controller is of zero order and a PI controller is of first order. The above controller types are linear. In a nonlinear controller, the control variable is a nonlinear function of the controller inputs. In an adaptive controller the controller parameters vary over time based on changing conditions, whereas in a non-adaptive controller, the controller parameters are kept constant. Hence, the meaning of the word adaptive is quite different in the computing community compared to the control community. In the computing community an ordinary controller with constant parameters, i.e., a non-adaptive controller from the control point of view, is often considered as adaptive, since it generates different control signals for different external conditions, i.e., it adapts it behaviour to the external conditions. For example, the name adaptive resource management is used in the computing community to denote resource management systems where the resources allocation is changed dynamically based on resource requirements and availability. From a control point of view, a more adequate name for this would be dynamic resource management or controlled resource management. For a general background on computer-based control, see [214], on PID control, see [215], on adaptive control, see [216], and on control of computing systems, see [217]. 86

87 Modeling Formalisms When designing a controller, two main paths can be followed. In the heuristic approach, a controller structure is selected based on experience and heuristics, and the controller parameters are tuned manually, based on numerous experiments. Although this approach works well in many cases, in particular for low-order controllers with few parameters, the approach has little theoretical foundation. In the model-based approach, a model of the controlled systems is developed and this model is then used during the design of the controller. The model describes the dynamic relationship between system inputs and outputs. Due to the amazing properties of feedback, it is often possible to achieve satisfactory performance using a quite coarse-grained model that only captures the dominating system dynamics. When a computer-based controller is controlling a physical system (often denoted plant) sampling is employed. The (normally continuous) outputs of the plant are sampled with a certain sampling interval. This transforms the continuous-time signal to a discrete time series, which is then used by the controller to generate the control variable, which is also a discrete time series. The digital-analog converter often works as a zero-order hold device, generating a piecewise constant output signal which is then fed to the plant via the actuator. When sampling continuoustime signals and systems, the sampling period must be chosen with care. A too long sampling interval may result in poor performance or instability. Aliasing effects may introduce artificial disturbance frequencies into the system unless proper analog anti-aliasing filtering is used. Measurement noise which may be unavoidable also generates fundamental limitations on the performance that the controller can deliver. When controlling a real-time computing system, several things are different. The controlled system is of discrete-time nature and all variables are discrete-valued. Sample-and-hold circuits are not needed. Measurement disturbances caused by noisy sensors is not a large issue any more. Although several things become easier, control of computing systems also introduces new problems. A main problem is the lack of first principles models. When controlling a physical plant the laws of physics to a large degree decide the behaviour of the plant and can be use to derive dynamical models. Some examples are mass balances, energy balances, and momentum balances, often resulting in linear differential equations. A computing system, on the other hand, is a man-made artifact whose internal behaviour is not governed by physical laws, at least not on the macroscopic level. This means that it is hard to derive useful first principles models. Computing systems can be viewed as discrete-event dynamic systems (DEDS), [218]. This and the fact that they are real-time systems makes it natural to use a timed discrete-event formalism, such as timed automata or timed Petri nets for modeling these systems. A drawback with the DEDS approach is that it is in many cases too fine-grained and easily leads to state-space explosion. This is typically the case in queuing control systems when the arrival and departure rates are large. Another issue is the types of problems that these formalisms typically lend themselves to. Automata-based formalisms are well-suited for expressing and analyzing safety properties and blocking properties. Safety properties are concerned with the reachability of certain undesirable states, which could model undesirable or faulty conditions. Blocking properties are concerned with issues like deadlock and livelock. Safety and blocking problems are however not the main concerns in performance control of real-time computing systems. Instead, it is issues such as stability, 87

88 performance, and robustness that are prime concerns. For these types of problems a time-driven/continuous-state approach is more natural. However, the lack of first principles knowledge may necessitate a system identification-based approach, in which a discrete-time model, typically a difference equation, is derived from measured input and output data. One example of this the fitting of a discrete-time model to measurement data using a least-square approach. The models derived in this way are based on periodic sampling. Likewise, the controllers designed from this type of models are based on periodic sampling. Although periodic controllers are common in real-time computing, it would from many aspects be more natural to invoke the controller in an event-driven fashion. For example, the controller could be executed when a task instance has completed or when a new task arrives. An event-based controller could potentially be better conditioned for controlling the transient behaviour of the system than a periodic controller that is based on time-averaged measurements. A problem with aperiodic or event-triggered control of this type (rather than the DEDS type) is the lack of control theory. The resulting system descriptions are both time-varying and non-linear and hence very difficult to analyze. However, there are several indications from the field of control of physical systems that event-based control can have substantial advantages in terms of both fewer control actions per time unit and better tracking performance. 6.4 Feedback Scheduling of CPU Resources Feedback scheduling of CPU resources is an area where a fair amount of research has been performed. As noted above, feedback scheduling is primarily suited for applications with soft or adaptive real-time requirements. This includes various types of multimedia applications, but also a large class of control applications. One early approach to feedback task scheduling was taken in [219, 146], that presented a scheduling algorithm called Feedback Control EDF (FC-EDF). A PID controller regulates the deadline miss-ratio for a set of soft real-time tasks with varying execution times, by adjusting their CPU utilization. It is assumed that tasks can change their CPU consumption by executing different versions of the same algorithm. An admission controller is used to accommodate larger changes in the workload. The scheme is shown in Fig In [220] the approach is extended. An additional PID controller is added that instead controls the CPU utilization. The two controllers are combined using a minapproach. The resulting hybrid controller scheme, named FC-EDF 2, gives good performance both during steady-state and under transient conditions. The framework is further generalized in [221], where the feedback scheduler is broken down in three parts: the monitor that measures the miss ratio and/or the utilization, the control algorithm, and the QoS actuator that contains a QoS optimization algorithm to maximize the system value. Many scheduling techniques that allow QoS adaptation have been developed. An interesting mechanism for workload adjustments is given in [149], where an elastic task model for periodic tasks is presented. The relative sensitivities of tasks to period rescaling are expressed in terms of elasticity coefficients. Each task is characterized by four parameters: the computation time C i, a nominal period T i0, a maximum period T imax, and an elasticity coefficient e i 0. An analogy with a linear spring is used, where the utilization of a task is viewed as the length of a spring that has a given rigidity coefficient (1/e i ) and length constraints. When tasks arrive in the system, a compression algorithm is run to compute the new task periods. To 88

89 Computed tasks Desired Miss ratio Miss ratio EDF Scheduler CPU PID controller CPU I Service Level Controller (SLC) CPU O Accepted tasks FC EDF Admission Controller (AC) Submitted tasks Figure 6.3: The EDF-FC scheme (from [219]) allow for time-varying or unknown execution times, feedback from an executiontime estimator can be added to the scheme [222]. The End-to-end Utilization CONtrol (EUCON) algorithm, [223], employs a distributed performance feedback loop that dynamically enforces desired CPU utilization bounds on multiple processors in distributed real-time embedded systems. EUCON is based on a model predictive control approach that models the utilization control problem on a distributed platform as a multi-variable constrained optimization problem. A multi-input-multi-output model predictive controller is designed and analyzed based on a difference equation model of distributed real-time systems. Feedback Scheduling of Control Tasks There are several reasons as to why feedback scheduling could be applied to control systems. One reason is the uncertainty associated with the WCET estimation. This is something that control applications share with most real-time computing appli- 89

90 cations. However, since control applications are reactive in nature, it is more expressed for these. An overly pessimistic WCET estimation may cause the designer to chose a more powerful processor, which then will be under-utilized. Alternatively, the designer will reduce the task utilization by increasing the task periods, which will lead to poor control performance. In some control applications, e.g., hybrid and switching controllers and controllers employing on-line optimization, the computational workload can change dramatically over time as different control algorithms are switched in and out when the external environment changes, and from job to job due to the varying number of iterations that are needed in the optimization. Here we assume that a control system involving multiple control loops is implemented as a multi-tasking system, with each controller being realized as a separate periodic task. The main resource of concern in these types of problems is the CPU time. The objective for the feedback scheduler is to dynamically adjust the CPU utilization of the controller tasks so that the task set remains schedulable and the stability and performance requirements of the individual controllers are met. One possible structure is shown in Fig The controller are denoted C i (z) and the physical plants are denoted P i (s). Control is used at two levels: to control a number of physical plants and to control the resource allocation to the controllers. In this approach the control performance can be viewed as a QoS parameter. The feedback scheduling problem is often stated as a optimization problem where the objective is to maximize the global control performance according to some criterion, subject to resource and schedulability constraints. An optimization-based approach to feedback scheduling requires performance metrics that are parametrized with scheduling-related parameters, e.g., task periods. For general applications this kind of information is normally not available. However, for control application such performance metrics can often be derived. For example, using tools such as Jitterbug, [224], it is possible to evaluate stochastic performance indices for linear control systems as functions of the sampling periods. In [225] it was shown that a simple linear rescaling of the nominal task periods in order to meet the utilization set-point is optimal with respect to the control performance under certain conditions. It holds in the case of arriving or departing control tasks with constant execution times, and if the performance indices of all controllers are either linear or quadratic functions of the periods. Linear or quadratic cost functions are quite good approximations of true cost functions in many cases. A drawback with the previous approach is that it does not consider the on-line control performance. The optimization only concerns the expected stationary performance. Disturbances acting on the control loops will not be taken into account in the optimization. In [226] an alternative approach is proposed. Here, the performance indices of the controllers are based on finite-horizon cost functions related to the sampling period, the current state of the control loop, and the period at which the feedback scheduler is invoked. The optimization horizon corresponds to the period of the feedback scheduler. The intuition behind this formulation is that a process in a transient phase, e.g., during a setpoint change, or exposed to an external disturbance may require more resources, e.g., a smaller sampling interval, than a process in stationarity. 90

91 x1 x2 U FBS h1 h2 G1 G2 Figure 6.4: Feedback scheduling of control loops. 6.5 Feedback Scheduling and Resource Reservations The idea behind resource reservation is to explicitly control the computing resources assigned to a given activity (job, task, or application). Each activity receives a fraction U i of the processor capacity and will behave as if it was executing alone on slower processor. If an activity attempts to exceed its allocated reservation, it will be delayed, preserving the resource for other activities. Through resource reservation the experienced QoS of a task will depend of how large reservation that has been reserved for the service. The main benefit of resource reservation compared to using task priorities to express relative importance is that it provides temporal isolation between tasks. The motivation for feedback control in combination with resource reservation is the need to cope with incorrect reservations, to be able to reclaim unused resources and distribute them to more demanding tasks, and to be able to adjust to dynamic changes in resource requirements. Hence, a monitoring mechanism is needed to measure the actual demands and a feedback mechanism is needed to perform the reservation adaptation. Two types of feedback are possible: global and local. 91

92 Global Feedback On a global, system-wide level a QoS controller adjusts the size of the individual reservations given to the different activities based on the measured performance and/or resource utilization. In the case of an overloaded system this will lead to a gradual increase of the allocated budgets up to the level where the virtualization can no longer be maintained. This, therefore, has to be combined with a global supervisor that scales down the allocated budgets whenever the total utilization exceeds the utilization bound. An alternative or complementary technique is to use resource reclaiming. Here, allocated but unused resources (slack resources) may be temporarily allocated to reservations in need of additional resources. Several resource reclaiming algorithms have been proposed, e.g., CASH [227], GRUB [228], IRIS [229], and BACKSLASH [230]. The different algorithms differ with respect to to whom to give the unused resource and how. A reservation scheduler without any feedback or resource reclaiming provides temporal isolation. However, without global feedback or reclaiming the total amount of resources may not be used in an optimal way. Using feedback and reclaiming the system maintains the temporal isolation while trying to maximize the resource utilization and, hence, hopefully, the global QoS. A problem appears when a reservation needs more resources than what is available. The temporal isolation mechanism of the reservation scheme will prevent this, but it will prevent it in a way that is oblivious to which way of doing it that is best for the task or application executing within the reservation. For example, in order to prevent a periodic task from overconsumtion of resources it is possible both to change the rate or to occasionally skip a task instance. Which is best may be application dependent. This is the motivation for combining the global feedback with local feedback controllers. A large number of feedback-based or adaptive global QoS management systems have been proposed. Some examples are [231, 232, 233]. These systems does typically not consider the local application behaviour. In [234], the problem of dynamically assigning bandwidths to a set of constant bandwidth servers is analyzed. A PI-based controller structure is suggested. In [147] the authors propose an hybrid control approach. The servers are modeled as discrete switched systems, and a feedback scheduler that adjusts the server bandwidths is derived using hybrid control theory. Finally, in [235] they propose combining feedback based on a stochastic dead-beat controller with a feedforward moving average predictor. Local Feedback On a task or activity level local feedback is employed to adjust the resource requirements of the individual tasks based on the experienced QoS levels and the amount of resources available to the task, as decided by the global QoS controller. The local resource requirements can be done by rate adaptation, executing the task at different service levels using, e.g., imprecise computations or multiple version, and job skipping. The resulting feedback scheduling structure is hierarchical or cascaded and shown in Fig When local and global feedback is combined the global controller can no longer base its decision on the measured resource demands of the different tasks. The reason for this is that local controller will, assuming that it works properly, adjust the resource consumption of the task so that it matches the amount of global resources that it available for it. Instead the global controller must base its decisions on the ideal desired amount of resources defined for each task. 92

Figure 6.5: Hierarchical reservation control In [236] an adaptive reservation strategy is proposed for controlling the CPU bandwidth reserved to a task based on QoS requirements.

93 Figure 6.5: Hierarchical reservation control In [236] an adaptive reservation strategy is proposed for controlling the CPU bandwidth reserved to a task based on QoS requirements. A two-level feedback control is used to combine local application level mechanisms (e.g., period rescaling and job skipping) with global system-level strategies (e.g., elastic task compression). 6.6 OS Support In order to support feedback scheduling the operating system must provide sensing and actuation possibilities. The actual feedback controller can either be implemented as a separate task, typically in kernel mode, or be integrated with the default scheduler of the OS. Examples of variables that are useful to be able to measure are actual task execution times, actual system utilization, the amount of time a task has spent within a task queue, e.g., the ready queue, and the completion time of a task instance in relation to its deadline. Examples of actuator mechanisms include the possibility to change the period and deadlines of periodic tasks, to modify task priorities, and to be able to indicate to signal to a task that it should take actions to modify its execution time demands. 6.7 Feedback Scheduling in ACTORS ACTORS has no special work package devoted to feedback scheduling. Instead feedback issues are important in several WPs, in particular in WP1, WP3, and WP4. Relations to WP1 In WP1 the basic CAL-based modeling environment will be developed. The following issues are related to control: When defining the CAL actor networks representing a multimedia or signal processing stream, it must be possible to specify what the sensors are that the local feedback scheduler algorithms may use. When defining the CAL actor networks representing a multimedia or signal processing stream, it must be possible to specify what the actuators are that the local feedback scheduling algorithms may use. When defining the CAL actor networks representing a multimedia or signal processing stream, it must be possible to specify the resource and/or QoS 93

EE382V: System-on-a-Chip (SoC) Design

EE382V: System-on-a-Chip (SoC) Design Lecture 8 HW/SW Co-Design Sources: Prof. Margarida Jacome, UT Austin Andreas Gerstlauer Electrical and Computer Engineering University of Texas at Austin gerstl@ece.utexas.edu