Languages for Embedded Systems and Their Applications

Size: px

Start display at page:

Download "Languages for Embedded Systems and Their Applications"

Rudolf Hensley
6 years ago
Views:

2 Languages for Embedded Systems and Their Applications

3 Lecture Notes in Electrical Engineering Volume 36 For other titles published in this series, go to

4 Martin Radetzki Editor Languages for Embedded Systems and Their Applications Selected Contributions on Specification, Design, and Verification from FDL 08

5 Editor Prof. Dr. Martin Radetzki Institut für Technische Informatik Universität Stuttgart Pfaffenwaldring Stuttgart Germany ISSN Lecture Notes in Electrical Engineering ISBN e-isbn DOI / Springer Dordrecht Heidelberg London New York Library of Congress Control Number: Springer Science+Business Media B.V No part of this work may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical, photocopying, microfilming, recording or otherwise, without written permission from the Publisher, with the exception of any material supplied specifically for the purpose of being entered and executed on a computer system, for exclusive use by the purchaser of the work. Printed on acid-free paper Springer is part of Springer Science+Business Media (

6 Preface Embedded systems take over complex control and data processing tasks in diverse application fields such as automotive, avionics, consumer products, and telecommunications. They are the primary driver for improving overall system safety, efficiency, and comfort. The demand for further improvement in these aspects can only be satisfied by designing embedded systems of increasing complexity, which in turn necessitates the development of new system design methodologies based on specification, design, and verification languages. The objective of the book at hand is to provide researchers and designers with an overview of current research trends, results, and application experiences in computer languages for embedded systems. The book builds upon the most relevant contributions to the 2008 conference Forum on Design Languages (FDL), the premier international conference specializing in this field. These contributions have been selected based on the results of reviews provided by leading experts from research and industry. In many cases, the authors have improved their original work by adding breadth, depth, or explanation. System development includes the tasks of defining an initial, high-level specification, designing system architecture and functional blocks, and verifying that architecture and functionality meet the specified properties and requirements. The designers working on these tasks, and the electronic design automation tools deployed in the design process, have to take into account software, digital logic, and analog system components and their complex interactions in heterogeneous, mixed discrete/continuous systems. This book therefore addresses related issues in four parts, dedicated to specification, heterogeneity, design, and verification. Part I, Model-Based System Specification Languages, focuses on two high-level specification languages which are emerging as standards for embedded systems: the Architecture Analysis and Design Language (AADL), and the Modeling and Analysis of Real-Time and Embedded Systems (MARTE) profile for the Unified Modeling Language (UML). Beyond their syntax and semantics, the methods built upon these languages, and initial applications are presented in three chapters. Two further chapters are dedicated to competing approaches using an abstract state machine based language and Matlab/Simulink driven modeling, respectively. Part II, Languages for Heterogeneous System Design, is devoted to two promising languages that provide the means to describe heterogeneous systems. The discrete-time and continuous-time worlds are brought together by SystemC-AMS on system level, whereas VHDL-AMS provides all it takes to describe mixed analog and digital circuits. Part III, Digital Systems Design Methodologies Based on C++, is the largest part of this book, based on the substantial impact that the SystemC library and its methodology-specific additions continue to be making in the digital (hardware/software) design community. This part comprises eight chapters devoted to M. Radetzki (ed.), Languages for Embedded Systems and Their Applications, Lecture Notes in Electrical Engineering 36, Springer Science + Business Media B.V v

7 vi Preface the subjects of transaction-level modeling and its applications, architecture and performance evaluation, design and scheduling of functional blocks, as well as programming and modeling approaches for (run-time) reconfigurable FPGA architectures. Part IV, Verification and Requirements Evaluation, features contributions addressing both functional and beyond-functional properties. Functional aspects include the verification of circuitry implementing arithmetic operations and the debugging of contradictory functional constraints specified with the SystemC Verification Library (SCV). Analysis of beyond-functional properties such as timing behavior, performance, area cost, and power dissipation, is covered for Multi Processor Systems-on-Chip (MPSoC) as well as on-chip interconnection networks. The selection of the contributions to the before-mentioned parts has been guided by the reviews provided by FDL reviewers and programme committee members. I would like to thank everybody involved in these reviews, and in particular the FDL 08 chairpersons responsible for the conference tracks that relate to the four parts, namely Dominique Borrione (PDV track, Property Driven Verification), Pierre Boulet (UMES track, UML and MDE for Embedded Systems), Sorin Huss (DCS track, Discrete and Continuous Systems), and Frank Oppenheimer (CSD track, C-Based System Design). Moreover, I would like to acknowledge the extra effort made by the authors to layout, and in most cases revise and extend their original work as a contribution to this book. Universität Stuttgart Martin Radetzki FDL 08 General Chair

8 Contents Part I Model-Based System Specification Languages 1 Power and Energy Estimations in Model-Based Design... 3 Eric Senn, Saadia Douhib, Dominique Blouin, Johann Laurent, Skander Turki and Jean-Philippe Diguet 1.1 Introduction AADL Component Based Design Flow Consumption Analysis: the Methodology PowerEstimation Power Models Multi-level Estimation PowerEstimationforComplexDSP PowerEstimationforFieldProgrammableGateArray PowerEstimationforOperatingSystemServices Ethernet Communications Consumption Modelling Models ConsumptionAnalysisTool Property Sets Conclusion References MARTE vs. AADL for Discrete-Event and Discrete-Time Domains. 27 Frédéric Mallet and Robert de Simone 2.1 Introduction Marte Time Model Definitions Event-Triggered Communications Time-Triggered Communications PeriodicTasksandPhysicalTime TimeSquare AADL ModelingElements AADL Application Software Components AADL Flows AADL Ports Three Different Configurations TheAperiodicCase TheMixedEvent DataFlowCase ThePeriodicCase vii

9 viii Contents 2.5 Conclusion Glossary References Generation of MARTE Allocation Models from Activity Threads.. 43 Andreas W. Liehr, Klaus J. Buchenrieder, Heike S. Rolfs and Ulrich Nageldinger 3.1 Introduction RelatedWork Building System Models with MARTE Utilizing Activity Threads for Design Space Exploration Generating MARTE Allocation Models with Activity Threads APrototypicImplementationoftheMethod Visualization of Performance Feedback SummaryandOutlook References Model-Driven System Validation by Scenarios A. Carioni, A. Gargantini, E. Riccobene and P. Scandurra 4.1 Introduction ASMsandASMETA Scenario-Based Validation of ASM Models The AVALLA Language The Model-Driven Validation Environment From SystemC UML Models to ASM Models Model Validator TheSimpleBusCaseStudy RelatedWork Conclusions and Future Work References An Advanced Simulink Verification Flow Using SystemC Kai Hylla, Jan-Hendrik Oetjens and Wolfgang Nebel 5.1 Introduction RelatedWork Extended Verification Flow Conventional Flow Extending the Verification Flow Implementation Synchronization DataTypeConversion Evaluation Implementation Extended Verification Flow... 82

10 Contents ix 5.6 Conclusion References Part II Languages for Heterogeneous System Design 6 VHDL AMS Implementation of a Numerical Ballistic CNT Model. 87 Dafeng Zhou, Tom J. Kazmierski and Bashir M. Al-Hashimi 6.1 Introduction MobileChargeDensityandSelf-ConsistentVoltage Numerical Piece-Wise Approximation of the Charge Density Performance of Numerical Approximations VHDL AMS Implementation Conclusion References Wide-Band Sigma Delta ADC Design in Superconducting Technology R. Guelaz, P. Desgreys and P. Loumeau 7.1 Introduction Sigma Delta Second Order Architecture Bandpass Sigma Delta Modulator The Josephson Junction The RSFQ Balanced Comparator Sigma Delta Modulator Operation with Josephson Junctions System Modeling with VHDL AMS TheSigma DeltaADCDesign ClockandComparatorDesign SimulationResults Conclusion References Heterogeneous and Non-linear Modeling in SystemC AMS Ken Caluwaerts and Dimitri Galayko 8.1 Introduction SystemC AMSModelingPlatform SummaryofElectrostaticHarvesterOperation SystemC AMSModelingoftheHarvester Resonator Modeling Implementation of the Conditioning Circuit Model Model of the Whole System ModelingResults Description of the Modeling Experiment ModelingResultsValidation Conclusion References

11 x Contents Part III Digital Systems Design Methodologies Based on C++ 9 Application Workload and SystemC Platform Modeling for Performance Evaluation Jari Kreku, Mika Hoppari, Tuomo Kestilä, Yang Qu, Juha-Pekka Soininen and Kari Tiensyrjä 9.1 Introduction Performance Modeling and Simulation ApplicationandWorkloadModeling ExecutionPlatformModeling AllocationandTransformationtoSystemC Performance Simulation Mobile Video Player Case Example Modeling of the Execution Platform Components ModelingoftheServices ModelingoftheApplication AnalysisofSimulationResults Conclusions References Adaptive Interconnect Models for Transaction-Level Simulation Rauf Salimi Khaligh and Martin Radetzki 10.1 Introduction RelatedWork Adaptive Interconnect Models Point-to-PointCommunication Bus-BasedCommunication Model Implementation An Adaptive FSL Model An Adaptive AHB Model Experimental Results Conclusion References Efficient Architecture Evaluation Using Functional Mapping C. Kerstan, N. Bannow and W. Rosenstiel 11.1 Introduction Functional Mapping TimingBehavior Conventional Code Transformation Optimization Approach ClassUnitized CustomizeandApplyUnitized Applicationofu_trace...174

12 Contents xi 11.5 Using the Approach in the Design Flow HandlingArrays DesignExample SimulationResults Limitations and Experiences Summary Outlook References Symbolic Scheduling of SystemC Dataflow Designs Jens Gladigau, Christian Haubelt and Jürgen Teich 12.1 Introduction Model of Computation SymbolicRepresentation QSS of SysteMoC Models Transition Graphs Path Searching Scheduling Algorithm RelatedWork Example Conclusions and Further Work References SystemC Simulation of Networked Embedded Systems Francesco Stefanni, Davide Quaglia and Franco Fummi 13.1 Introduction TheArchitectureofSCNSL Main Components of SCNSL MainProblemsSolvedbySCNSL Simulation of RTL Models AssessmentofTransmissionValidity Simulation Planning Application to a Wireless Scenario Experimental Results Conclusions References Modeling of Embedded Software Multitasking in SystemC/OSSS Philipp A. Hartmann, Philipp Reinkemeier, Henning Kleen and Wolfgang Nebel 14.1 Introduction RelatedWork TheOSSSDesignFlow Application Layer Virtual Target Architecture Layer...217

13 xii Contents 14.4ModelingSoftwareinOSSS AbstractionofRun-timeSystem SoftwareTasks Software Shared Objects SoftwareExecutionTimes ExplorationofPlatformEffects SimulationResults Accuracy and Performance Lazy Synchronization Conclusion References High-Level Reconfiguration Modeling in SystemC Andreas Raabe and Armin Felke 15.1 Introduction RelatedWork Basic Reconfiguration Modeling Interpreting Reconfiguration as Circuit Switch Creating Reconfigurable Modules from Static Ones Control Advanced ReChannel Features Exportals Synchronization Explicit Description of Reconfiguration ResettableProcesses Resettable Components Binding Groups of Switches CaseStudy Conclusion and Future Work References Stream Programming for FPGAs Franjo Plavec, Zvonko Vranesic and Stephen Brown 16.1 Introduction StreamComputing StreamingonFPGAs Compiling Brook to Hardware ExampleBrookProgram Exploiting Data Parallelism Experimental Evaluation Results Concluding Remarks References

14 Contents xiii Part IV Verification and Requirements Evaluation 17 A New Verification Technique for Custom-Designed Components at the Arithmetic Bit Level Evgeny Pavlenko, Markus Wedler, Dominik Stoffel, Wolfgang Kunz, Oliver Wienand and Evgeny Karibaev 17.1 Introduction NormalizationMethod ABLNormalization MixedABL/Gate-LevelProblems Synthesis of ABL Descriptions from Gate-Level Models Generation of the Equivalent ABL Descriptions for Boolean Functions in Reed Muller Form Experimental Results Conclusion and Future Work References Debugging Contradictory Constraints in Constraint-Based Random Simulation Daniel Große, Robert Wille, Robert Siegmund and Rolf Drechsler 18.1 Introduction SystemCVerificationLibrary ContradictionAnalysis ProblemFormulation Concepts for Contradiction Analysis Implementation Experimental Evaluation Types of Contradictions Effect of Property 1 and Property Real-LifeExample Conclusions References Design of Communication Infrastructures for Reconfigurable Systems Alessandro Meroni, Vincenzo Rana, Marco D. Santambrogio and Francesco Bruschi 19.1 Introduction RelatedWorks RealWorldApplicationsAnalysis Applications Layer Scenarios Layer Characteristics Layer Metrics Layer The Proposed Solution HighLevelDescription...298

15 xiv Contents HighLevelNetworkSimulation EvaluationandSelection VerificationandValidation Results Concluding Remarks References Analysis of Non-functional Properties of MPSoC Designs Alexander Viehl, Björn Sander, Oliver Bringmann and Wolfgang Rosenstiel 20.1 Introduction RelatedWork Preliminaries Activity Model Power Management Model DesignFlow Abstraction of System Functionality Simulation Model Generation Communication Dependency Graphs Temporal Environment Models Integration of Power Consumption and Power Management Battery Models, Placement and Chip Environment Experimental Results Conclusions References

16 Part I Model-Based System Specification Languages

17 Chapter 1 Power and Energy Estimations in Model-Based Design Eric Senn, Saadia Douhib, Dominique Blouin, Johann Laurent, Skander Turki and Jean-Philippe Diguet Abstract The aim of our works is to provide for methods and tools to quickly estimate the power consumption at the first steps of a system design. We introduce multi-level power models and show how to use them at different levels of the specification refinement in the model-based AADL (Architecture & Analysis Design Language) design flow. Those power models, with the underlying methodology for power estimation, are currently being integrated in the Open Source AADL Tool Environment (OSATE) under the name CAT: Consumption Analysis Toolbox. Itsfirst prototype gives, in the case of a processor binding, power consumption estimations, for software components in the AADL component assembly model, with a maximal error ranging roughly from 5% at the lowest refinement level (the source code of the software component is known), to 30% at the highest level (only the operating frequency and basic target configuration parameters are considered). We illustrate our approach with the power model of a simple RISC (PowerPC 405), of a complex DSP (TI C62), and of a FPGA (from ALTERA). We show how those models can be used at different levels in the AADL flow. Obviously, the power consumption of Operating System (OS) services is also to be considered here. We show that the OS principal impact on the overall consumption is mainly due to services implying data transfers. We introduce a methodology to model Inter-Process Communications (IPC) power and energy consumption, and illustrate this methodology on the building and use of a model for Ethernet based inter-process communications. Keywords Power and energy consumption modelling Model driven engineering AADL Embedded system Processors FPGA 1.1 Introduction AADL (Architecture Analysis & Design Language) is an input modeling language for real-time embedded systems [13]. It is now commonly used in the avionic domain for the design of safety critical systems. The aim of AADL models is providing E. Senn ( ) Lab-STICC, CNRS UMR 3192, Université de Bretagne Sud, Lorient, France eric.senn@univ-ubs.fr M. Radetzki (ed.), Languages for Embedded Systems and Their Applications, Lecture Notes in Electrical Engineering 36, Springer Science + Business Media B.V

18 4 E. Senn et al. a framework in order to verify functional and non functional properties of a system, from early analysis of the specification, to code generation for the targeted hardware platform [15, 23, 30]. One objective of the European project ITEA/SPICES (Support for Predictable Integration of mission Critical Embedded Systems) [2] is to improve the modeling and analysis capabilities of AADL. In association with the AADL Standardization Committee [3], enrichments of the language are being proposed, to appear in the next AADL release (2.0). We are currently working in this context to provide energy and power consumption estimations at different levels in the AADL refinement process. The advantages of early verifications are well-known. They are however only possible if estimations can be performed at this high-levels in a reasonable delay, allowing the user for a fast exploration of the design space. Knowing the power consumption of every component in the system, or at least its upper bound, is indeed essential to guarantee the embedded system s reliability. Indeed, this knowledge allows to avoid the risk of burning a component in case its maximal power dissipation is exceeded (if its activity goes beyond its critical threshold for instance). Indeed the temperature of a component is directly related to its power dissipation. If the temperature rises, even if the component is not destroyed, its timing features can be altered. Typically, the delay of critical paths rises, and a fault can occur that yield the entire system to stall [14]. The increasing of clock skews in VLSI circuits with non-uniform substrate temperature is also a source of major breakdown [20]. Analyzing power consumption in a complete system is also necessary to avoid the risk of overloading a supply bus or a power supply source, in case its capacity is exceeded. That will ensure that the power consumptions of the system and of all its components (including power sources, power buses, etc) is in the range of the allowed global power budget, and this whatever the configuration of the system. Significant research efforts have been devoted to develop tools for power consumption estimation at different abstraction levels in embedded system design. A lot of those tools however work at the Register Transfer Level (RTL) (this is the case for tools Diesel [11] and Petrol [21]), at the Cycle Accurate Bit Accurate (CABA) level [6, 19], and a few tools at the architectural level (Wattch [8] and Simplepower [31]). Such approaches cannot be used at high levels because simulation times at such low abstraction levels become enormous for complete and complex systems, like multiprocessor heterogeneous platforms. In [12] and [18], the authors present a characterization methodology for generating power models within TLM for peripheral components. The pertinent activities are identified at several levels and granularities. The characterization phase of the activities is performed at the gate level and is used to deduce the power of coarse-grained activities at higher level. Again, applying such approaches for complex processors or complete systems is not doable. Instruction level or functional level approaches have been proposed [22, 27, 29]. They however only work at the assembly level, and need to be improved to take into account pipelined architectures, large VLIW instruction sets, and internal data and instruction caches.

19 1 Power and Energy Estimations in Model-Based Design 5 We introduced the Functional Level Power Analysis (FLPA) methodology which we have applied to the building of high-level power models for different hardware components, from simple RISC processors to complex superscalar VLIW DSP [16, 17], and for different FPGA circuits [25]. We present here our new approach to allow for system level power and energy estimations in model-based design flow. We show how it fits into the AADL design flow and how our power models, being inter-operable, are used at different refinement levels. The AADL design flow is presented in Sect. 1.2, and the methodology to perform power and energy estimations throughout this flow in Sect The first phase of our consumption analysis approach, the power estimation, is detailed in Sect We introduce the power model of the GPP PowerPC 405 and show how multi-level power models are used at different levels in the specification refinement. The power models of the DSP TI C62 and the FPGA Altera Stratix EP1S80 are introduced and used as examples in Sects. 1.5 and 1.6. Section 1.7 presents the modeling of energy consuming OS services, that ought to be included in the software components consumption. The model of the IPC Ethernet communication is given as an example. The software development and integration in OSATE is sketched in Sect Dedicated packages and property sets for power estimation are exhibited here. The accuracy of our power estimations is finally evaluated in comparison with physical measurements. 1.2 AADL Component Based Design Flow A component-based system is a component-based software application managed by a component framework and deployed on a target platform [5]. The component based AADL design flow is presented in Fig It relies mainly on three following models: The AADL component assembly model contains all the components and connections instances of the application. It also references the implementation models of all the components instances, found in the AADL models library. The AADL target platform model describes the hardware of the physical target platform. This platform is composed of at least one processor, one memory, and one bus entity to home processes and threads execution. The AADL deployment plan model describes the AADL-PSM composition process. It defines all the binding properties that are necessary to deploy the processes and services model of the componentbased application on the target platform. Those models are combined to obtain the AADL-PSM model of the complete component-based system. From there, the final implementation of the system is obtained through model transformations and code generation. The Open Source AADL Environment Tool (OSATE) [1] provides a framework to specify a complete system using AADL. It also permits to check some of its functional and non-functional properties. Those verifications rely on the use of different plug-ins included in the tool set. During the deployment, software components in the AADL component assembly model are bound to hardware components

20 6 E. Senn et al. Fig. 1.1 AADL component based design in the AADL target platform model. OSATE scheduling analysis plug-in uses information embedded in the software components description to propose a binding for the system [26]. Figure 1.2 shows the typical binding of an application on a multi-processors architecture. In this example, process main_process is composed of six threads. It is bound to the memory sysram, as well as the data block data_b. Threads control_thread, ethernet_driver_thread and post_thread are bound to the first general purpose processor GPP1. Thread pre_thread is bound to GPP2. Thread hw_thread1 is, like hw_thread2 a hardware thread. It will be implemented in the reconfigurable FPGA circuit FPGA1. One connection between pre_thread and post_thread has been declared using in and out data ports in the threads. This connection is bound to bus sys_bus since it connects two threads bound to two different components in the platform. Intra-component connections, like between threads control_thread and ethernet_driver_ thread, do not need to be bound to specific buses. They will however consume hardware resources while being used. In addition to communication buses, dedicated supply buses can also been declared. The resources analysis plug-in in OSATE comes with an Analyze Bus Power Draw command that permits to check if the power capacity of a supply bus is not exceeded. Indeed, a power capacity property (SEI::PowerCapacity) can be declared for a bus, and every component that requires an access to this bus declares a power budget (property SEI::PowerBudget) that it draws from the bus. The plug-in then adds all the power budgets declared for a bus and compares the result with the bus power capacity. This mechanism, even if it is interesting, is extremely limited: power information is only a guess from the user and only applies to supply buses.

21 1 Power and Energy Estimations in Model-Based Design 7 Fig. 1.2 Binding components to the target platform 1.3 Consumption Analysis: the Methodology We propose to use realistic power estimates for power analysis by using an accurate power estimation tool and precise power consumption models for every component in the targeted hardware platform. The challenge for the power estimation tool is to provide a realistic power budget for every component in the AADL component assembly model. Eventually, two computation phases are necessary to analyze the power consumption of the system: 1. In the first phase, the power budget is computed for every software component. This phase is the power estimation. 2. In the second phase, the power budgets of software components are combined to get the power budgets for hardware components. This second phase is the power analysis. Using timing information, the energy analysis is performed afterwards. Those two phases are presented in Fig. 1.3 in the case of the binding of a thread to a processor. Plain edges represent phase 1 (estimation), dotted edges represent phase 2 (analysis). The power estimation tool gathers several information in the

22 8 E. Senn et al. Fig. 1.3 Power and Energy consumption estimation in the AADL design flow system s AADL specification at different places, here from the process and thread descriptions, and from the processor specification. It also uses binding information that may come from a preliminary scheduling analysis. The result of scheduling analysis indeed gives the percentage load of processors for each thread. Whenever a processor is idle, its power consumption is at the minimum level. Scheduling analysis is performed using basic information on the threads properties defined as properties for each thread implementation in the AADL component assembly model: dispatch_protocol (periodic, aperiodic, sporadic, background), period, deadline, execution_time, etc. The tool uses the power model of the targeted hardware component (here a processor) to compute the power budget for the application component (which can be a software or hardware thread). In fact, it decides the input parameters of the model from the set of information it has gathered. Once the power budgets have been computed for every component in the application, the power analysis can be done. The power analysis tool retrieves all the component power budgets, together with additional information from the specification, and computes the power budget for every hardware component in the system. Then it computes the power estimation for the whole system. Energy analysis can be performed using information from the timing analysis tools currently being developed in the SPICES project. 1.4 Power Estimation We have developed a specific power estimation tool to compute the power budget for every software component in the AADL component assembly model. This tool, which is an evolution of our former power estimation tool SoftExplorer [24], comes with a library of power models for every hardware component on the platform. The method to build a model is based on the Functional Level Power Analysis (FLPA) [17]. This approach relies on the decomposition of a complex system into functional blocks, independent regarding the power consumption. Parameters are identified that characterize the way the functional blocks will be excited by the input specification. A set of measurements is performed, where the system consumption is measured for different values of the input parameters. Consumption charts are plotted and mathematical equations are computed by regression. This method is

23 1 Power and Energy Estimations in Model-Based Design 9 interesting because it links low-level measures and observations of the physical implementation with high-level parameters from earlier steps in the design flow. This is an efficient way of evaluating how an application will use the resources in the targeted implementation. As depicted before, the building of a model, and the choice of its parameters, are based on a set of physical measurements. Measurements again will be used to validate the model. The model output is compared with measured values over a large set of realistic applications. This allows to define precisely the maximal and average errors introduced by the model Power Models A power model is thus a set of consumption laws that enables to compute the power consumption of a component in function of a reduced set of input parameters. As an example, we present in Table 1.1, Table 1.2 and Table 1.3 the set of consumption laws that constitute the power model of the PowerPC 405 processor that we have developed in the frame of SPICES. Those laws come with an accuracy better than 5% (the average maximum error found between the physical measurements and the results of a law). The average error is 2%. Input parameters are the processor frequency (F processor ) and the frequency of the bus it is connected to (F bus ), the type of instruction executed (calculation (Calc) or memory accesses (LD and ST)), the configuration of the memory hierarchy associated to the processor s core (i.e. which caches are used (data and/or instruction) and where is the primary memory (internal/external)), and the cache miss rate (γ ). Table 1.1 presents the consumption laws for the 2.5 V power supply when caches are disabled, or one of the two caches is enabled. When the two caches are disabled, the program and/or data can be stored into the internal FPGA memory (using BRAM) or in an external memory (a SDRAM on our board). Even if it is possible to store the program and data into the external memory, programmers never use this possibility. Indeed, the cost of one access (in power and time) in the external memory is too high to make this solution acceptable. As a result, we have considered in Table 1.1 Consumption laws for the 2.5 V power supply 2.5 V power supply BRAM P(mW) = 5.37F bus SDRAM Instruction γ = 0 P Calc (mw) = 5.37F bus cache γ>0 F bus = 50: P Calc (mw) = 0.76F processor γ F bus = 66: P Calc (mw) = ln(γ ) F bus = 100: P Calc (mw) = 0.99F processor γ P LD/ST (mw) = 4.96γ F bus Data LD P(mW) = 4.1γ + 6.3F bus cache ST P(mW) = 6.88γ F bus

24 10 E. Senn et al. Table 1.2 Consumption laws for the 1.5 V power supply 1.5 V power supply Without BRAM P Calc (mw) = 0.32F processor F bus cache P LD/ST (mw) = 0.55F processor F bus Instruction SDRAM P Calc (mw) = 0.46F processor F bus cache P LD (mw) = 0.66F processor F bus P ST (mw) = 0.39F processor F bus BRAM P Calc (mw) = 0.46F processor F bus P LD (mw) = 0.70F processor F bus P ST (mw) = 0.44F processor F bus Data SDRAM P LD/ST (mw) = 0.38F processor F bus cache BRAM P LD (mw) = 0.40F processor F bus P ST (mw) = 0.44F processor F bus Table 1.3 Consumption laws for the 1.5 V power supply Data and instructions caches enabled BRAM 1.5 V P(mW) = 0.40F processor F bus V P(mW) = 5.37F bus SDRAM 1.5 V P(mW) = 0.38F processor F bus V P(mW) = 4.1γ + 6.3F bus this case that the instructions and data are stored in the BRAM, either connected to the processor through the OCM bus, or through the PLB bus. If the BRAM is used then the consumption only depends on the bus frequency. If the SDRAM is used, one of the two caches (for instruction or data) must be selected. With the instruction cache, the consumption depends on the instruction type (LD/ST or computation instructions) and for computation instructions, it also depends on the cache miss rate and the bus frequency. With the data cache, the consumption depends on the memory access instruction type (LD or ST). Table 1.2 presents the consumption laws for the 1.5 V power supply again when caches are disabled, or one of the two caches is enabled. The consumption depends on which cache is used and on the type of primary memory (BRAM or SDRAM). If the BRAM is used, then, without cache, the consumption depends on the instruction type (computation or memory access). With the instruction cache, the consumption depends on the instruction type, and with the data cache, it depends on the type of memory access instruction (LD or ST). If the SDRAM is used, then, with the instruction cache the consumption depends on the type of instruction (same law than with the BRAM memory). With the data cache the consumption is the same for LD and ST instructions. Table 1.3 presents the consumption laws for the 2.5 V and 1.5 V supplies when both the data and instructions caches are enabled. In this situation, the consumption

25 1 Power and Energy Estimations in Model-Based Design 11 depends on the primary memory used, but not on the type of instructions. On the 1.5 V supply, the consumption depends on the processor and bus frequencies. It only depends on the bus frequency on the 2.5 V line Multi-level Estimation As described before, the tool, when it is invoked, extracts relevant information (a set of parameters) from the AADL specification, then computes the components power consumption, and finally returns them to fill the power budget properties for the software components. The information that is extracted from the specification depends on (i) the refinement level and (ii) the targeted hardware component. For instance, information needed is not the same if the processor to which the software component (thread) is bound is an ARM7, a PowerPC, or a TI-C6x DSP. During the deployment, software components are bound to hardware components. Thus, the power estimation tool, as previously illustrated in Fig. 1.3 in the case of a thread to processor binding, must resolve any possible binding of software components onto hardware components, and that means: (i) threads onto processors, (ii) processes and data onto memories, and (iii) connections onto bus. The binding makes it possible to put in relation components in the AADL component assembly model with the power models of hardware components on the targeted platform. In the following, still using the PowerPC 405 as an example, we show how a power model is used at different refinement levels of the AADL specification. To use a power model, we have to extract some relevant information in the AADL specification, in order to determine the model s input parameters. Depending on the specification refinement, it might not be possible to determine precisely the value of every parameters. It is important to determine the accuracy of our estimations at every proposed level. In order to do that, we fix the values of parameters that are known at the considered level, and then perform estimations with all the possible values of the remaining unknown parameters. The maximum error is the difference between the average and the maximum estimations. This is repeated for another set of values for the known input parameters. The final maximum error is the maximum of the maximum errors. This is a very worst case because most of the time the user, even if he can t determine them precisely at a refinement level, can give a realistic value for every unknown parameters. We finally defined three refinement levels for the PowerPC Refinement Level 1 At the first refinement level, our model gives a rough estimate of the power consumption for the software component, only from the knowledge of the processor and some basic information on its operating conditions. In the case of the PowerPC 405, the maximum error we get here is 27%. The only information we need is the

26 12 E. Senn et al. processor frequency and the frequency of the internal bus (OCM or PLB) to which the processor is connected inside the FPGA. Those are the known parameters that were used to determine the maximum error: valid processor/bus frequencies couples were fixed and estimations performed with all the others parameters tested. Those two parameters are in fact directly related to the target platform and the hardware component. They can be changed according to the user s will. They constitute what we call hardware configuration parameters for this processor. They will be defined as a property of the AADL processor implementation of the PowerPC 405 in the AADL specification. To calculate the maximum error, power estimations are performed with one processor/bus frequencies couple, and with all the possible values of the remaining parameters which are not known at a refinement level. The maximum error comes then from the difference between the average and the maximum estimations. This is repeated for every valid processor/bus frequencies couple. The final maximum error is the maximum of the maximum errors Refinement Level 2 At this refinement level, we have to add some information about the memories used. We have to indicate what caches will be used in the PowerPC 405 (data cache, instructions cache, or both), and if its primary memory is internal (using the FPGA BRAM memory bank) or external (using a SDRAM accessed through the FPGA I/O). Indeed, while building the power model for the PowerPC 405, we have observed that it draws quite different power and energy in those various situation. Four different operating conditions are identified: Caches disabled: If caches are disabled then, as explained before, instructions and data are stored in the internal memory (BRAM). Consumption laws 1, 9 and 10 are used in this case. Instruction cache enabled: When an instruction is not in the cache, a cache miss is generated and an access is done to the global memory (the external memory in most cases). The instruction is fetched from this memory and stored into the cache. If the global memory is (function of the development board) the internal BRAM of the FPGA connected by the PLB bus, then laws 1, 14, 15, 16 are used. If the global memory is the external SDRAM (accessed through the FPGA I/O) then laws 2, 3, 4, 5, 6 (depending on the cache miss rate γ and the bus frequency F bus ) and laws 11, 12 and 13 are used. Data cache enabled: When a data is not found in the cache, a cache miss is generated. The data is read from the global memory and written to the cache. The program instructions are stored in the internal BRAM connected to the OCM bus. The global memory can be the internal BRAM connected to the PLB bus (laws 1, 9, 18, 19 are used), or the external SDRAM (through FPGA I/O). In this last situation, laws 7, 8, 9 and 17 are used. Data and instruction caches enabled: If the primary memory is the BRAM, laws 20 and 21 are used. Laws 22 and 23 are used if the primary memory is the external SDRAM.

27 1 Power and Energy Estimations in Model-Based Design 13 Table 1.4 Maximal Errors with level 2 refinement F processor /F bus (MHz) Error Max 2 caches BRAM % 2 caches SDRAM_MAX caches SDRAM_MIN Error 7.0% 7.1% 7.2% 8.0% 8.5% 8.6% 8.7% 9.3% 9.7% 9.7% BRAM_MAX BRAM_MIN Error 1.9% 1.5% 1.1% 1.4% 1.4% 1.1% 0.9% 0.8% 0.7% 1.9% Icache_BRAM &BRAM_MAX Icache_BRAM &BRAM_MIN Error 3.2% 2.8% 2.4% 2.5% 2.3% 2.1% 1.8% 1.6% 1.4% 3.2% Icache_SDRAM_MAX Icache_SDRAM_MIN Error 13.8% 13.7% 13.7% 13.5% 13.5% 13.4% 13.5% 13.2% 13.2% 13.8% Dcache_BRAM_MAX Dcache_BRAM_MIN Error 0.1% 0.2% 0.3% 0.2% 0.1% 0.2% 0.2% 0.2% 0.2% 0.3% Dcache_SDRAM_MAX Dcache_SDRAM_MIN Error 12.6% 12.8% 12.9% 13.8% 14.3% 14.4% 14.5% 15.3% 15.1% 15.3% Table 1.4 shows the maximal errors we obtain for every valid set of known input parameters, the others being unknown. The maximum error we obtain is 15.3% and the average error is 6.6%. The first line indicates 0% because in this configuration, there are not remaining unknown parameters that can change the power consumption of the processor. The result of the scheduling analysis (which gives the %load of processors) is also taken into account at this level. Indeed, the percentage of time a processor is idle, its power consumption is at the minimum level. Scheduling analysis is performed using basic information on the threads properties defined as properties for each thread implementation in the AADL component assembly model: dispatch_protocol (periodic, aperiodic, sporadic, background), period, deadline, execution_time, etc.

28 14 E. Senn et al Refinement Level 3 At this refinement level, the actual code of the software component is parsed. In the case of the PowerPC 405, what is important is not exactly what instruction is executed, but rather the type of instruction being executed. We have indeed exhibited that the power consumption changes noticeably from memory access instructions (load or store in memory), to computation instructions (multiplication or addition). As we have seen before, the place where the data is stored in memory is also important, so the data mapping is also parsed here. The average error we get at this level is 2%. The maximum error is 5%. Logically, that corresponds to the max and average errors for the set of consumption laws for the component. 1.5 Power Estimation for Complex DSP The TI C62 processor has a complex architecture. It has a VLIW instructions set, a deep pipeline (up to 15 stages), fixed point operators, and parallelism capabilities (up to 8 operations in parallel). Its internal program memory can be used like a cache in several modes, and an External Memory Interface (EMIF) is used to load and store data and program from the external memory [28]. In the case of the C62, the 6 following parameters are considered. The clock frequency (F) and the memory mode (MM) are what we call architectural parameters. They are directly related to the target platform and the hardware component, and can be changed according to the user s will. The influence of F is obvious. The C62 maximum frequency is 200 MHz (it is for our version of the chip); the designer can tweak this parameter to adjust consumption and performances. The remaining parameters are called algorithmic parameters; they directly depend on the application code itself. The parallelism rate α assesses the flow between the processor s instruction fetch stages and its internal program memory controller inside its IMU (Instruction Management Unit). The activity of the processing units is represented by the processing rate β. This parameter links the IMU and the PU (Processing Unit). The activity rate between the IMU and the MMU (Memory Management Unit) is expressed by the program cache miss rate γ.thepipeline stall rate (PSR) counts the number of pipeline stalls during execution. It depends on the mapping of data in memory and on the memory mode. The memory mode MM illustrates the way the internal program memory is used. Four modes are available. All the instructions are in the internal memory in the mapped mode (MM M ). They are in the external memory in the bypass mode (MM B ). In the cache mode, the internal memory is used like a direct mapped cache (MM C ), as well as in the freeze mode where no writing in the cache is allowed (MM F ). Internal logic components used to fetch instructions (for instance tag comparison in cache mode) actually depends on the memory mode, and so the power consumption. A precise description of the C62 power model and its building may be found in [16]. The variation of the power consumption with the input parameters, more

29 1 Power and Energy Estimations in Model-Based Design 15 Table 1.5 Default algorithmic parameters for the C62 α β PSR LMSBV_ MPEG_ MPEG_2_ENC FFT_ DCT FIR_ EFR_Vocoder_GSM HISTO (image equalisation by histogram) SPECTRAL (signal spectral power density estimation) TREILLIS (Soft Decision Sequential Decoding) LPC (Linear Predictive Coding) ADPCM (Adaptive Differential Pulse Code Modulation) DCT_2 (imag ) EDGE DETECTION G721 (Marcus Lee) AVERAGE VALUES precisely the fact that the estimation is not equally sensitive to every parameter, allows to use the model in three different situations. In the first situation, only the operating frequency is known. The tool returns the average value of the power consumption, which comes from the minimum and maximum values obtained when all the others parameters are being made to vary. The designer can also ask for the maximum value if a higher bound is needed for the power consumption. In the second situation, we suppose that the architectural parameters (here F and MM) are known. We also assume that the code is not known and that the designer is able to give some realistic values for every algorithmic parameter. If not, default values are proposed, from the values that we have observed running different representative applications on this DSP (see Table 1.5). In the third situation, the source code is known. It is then parsed by our power estimation tools: the value of every algorithmic parameter is computed and the power consumption is estimated, using the power model and the values enter by the user for the frequency and memory mode. The error introduced by our tool obviously differs in these three situations. To calculate the maximum error, estimations are performed with given values for the parameters known in the situation, and with all the possible values of the remaining unknown parameters. The maximum error comes then from the difference between the average and the maximum estimations. This is repeated for every valid set of known input parameters. The final maximum error is the maximum of the maximum errors. Table 1.6 gives the maximum error in the three situations above, which

30 16 E. Senn et al. Table 1.6 Maximum errors for the C62 power model (power in mw) Known parameters Memory Max Min Average Max Mode Power Power Power Error Level 1 Frequency X % Level 2 Frequency, MM, α, β, γ, PSR Mapped % Cache % Freeze % Bypass % Level 3 F, MM, and the source code is provided Max Error = 8%, Average Error = 4% correspond to three levels of the specification refinement. Note that the maximal errors computed at level 2 are really pessimistic since we assume here that the designer is completely (100%) wrong on his evaluation of all the input parameters. If his evaluation of those parameters is only 50%, or 25% wrong, then the error introduced by our tool is reduced as well. 1.6 Power Estimation for Field Programmable Gate Array FPGA (Field Programmable Gate Arrays) are now very common in electronic systems. They are often used in addition to GPP (General Purpose Processors) and/or DSP (Digital Signal Processors) to tackle data intensive dedicated parts of an application. They act as hardware accelerators where and when the application is very demanding regarding the performances, that typically for signal or image processing algorithms. In this case again power estimation can be performed at different refinement levels. At the highest levels, the code of the application is not known yet. The designer needs however to quickly evaluate the application against power, energy and/or thermal constraints. A fast estimation is necessary here, and a much larger error is acceptable. The parameters we can use from the high-level specifications are the frequency F and the occupation ratio β of the targeted FPGA implementation, that we consider as architectural parameters, and the activity rate α. The experienced designer is indeed able to provide, even at this very high-level, a realistic guess of those parameters value. As explained before, to obtain the model, i.e. the mathematical equation linking its output to the parameters, we performed a set of different measurements on the targeted FPGA. For different values of the occupation ratio, and for different values of the frequency, we made the activity rate varying and measured the power consumption.

31 1 Power and Energy Estimations in Model-Based Design 17 Table 1.7 Maximum errors for the Altera Stratix EP1S80 (power in mw) Known parameters Max Min Average Max Power Power Power Error Level 1 Frequency (F = 10 MHz) % Frequency (F = 90 MHz) % Level 2 Frequency, α, β F = 10 MHz, β = % F = 10 MHz, β = % F = 90 MHz, β = % F = 90 MHz, β = % Level 3 F and the source code is provided Max Error = 4.2%, Average Error = 1.3% At our first refinement level, only the frequency is known. Our power estimation tool uses the model to estimate, at the given frequency, the power consumption with α = β = 0.1 and with α = β = 0.9. Then it returns the average value between those minimal and maximal values. The maximal errors we obtain for F = 10 MHz and F = 90 MHz (upper bound for the Altera Stratix EP1S80) are given Table 1.7. At the next refinement level, the two architectural parameters F and β, are known to the user. Like in the case of the former processor s models, default values are proposed for α and also β, coming from a set of representative applications. The maximal error introduced in this case ranges from 6.9% to 44.8%. To determine this error we compute the maximum and minimum estimations for the four extreme (F,β) couples, and compare them to the estimations with α default value. At the lowest refinement level, the source code (a synthesizable hardware description of the component behaviour, may be written in VHDL or SystemC...)is used. A High-Level Synthesis tool [9] permits to estimate the amount of resources necessary to implement the application, and given the targeted circuit, to obtain its occupation ratio (β) and its activity rate (α). Those two parameters and the frequency are finally used with the model. 1.7 Power Estimation for Operating System Services We have presented how to obtain a precise estimation of the power and energy consumed by every thread in the application. There is still work to do to estimate the consumption for the whole system. We have shown in [10] that a large part of a system s power consumption is due to data transfers. Those transfers may come directly from the application, or may be induced by the operating system. Data

32 18 E. Senn et al. transfers from the application are either embedded in the source code through direct memory accesses by threads, or they can be supported by inter-process communication services from the operating system. In the first situation, the data transfer power consumption is directly included in the power model of the processor that run the code. The consumption of the external memory is easily computed from the number of external accesses and the basic power features of the memory component from its datasheet. Power consumption of inter-process communication (IPC) services is however not included in the processors model. As a result, we have developed specific power models for IPC. Moreover, a specific AADL package was also developed to allow the user to describe how IPC are called by the application. As an example, we show here the model that we have developed for Ethernet IPC Ethernet Communications Consumption Modelling We select a standard peripheral device, the Ethernet interface, as a representative example implemented in most of embedded systems. As a first step, we identify the key parameters that can influence the power and energy consumption of Ethernet communications. Then we conduct physical power measures on a XUP pro development board, and take execution time values from the traces obtained by the Linux Trace Toolkit [32]. Measures were realised when running different testbenches that contain RTOS routines stimulating the Ethernet interface. The RTOS that we analyze in this study is the Montavista embedded Linux 3.1. Once we obtained all measures, we built the power and energy consumption model of the Ethernet interface. The following sections explain in detail the three steps that conducted to the Ethernet communications model Analysis of Relevant Parameters Our study is focused on the effect of the operating system on power and energy consumption of embedded system components. In the case of the Ethernet interface component, we identified hardware and software parameters influencing energy consumption. Hardware parameters for our models are processor frequency, bus frequency and primary memory. Software parameters are related to the applicative tasks and the operating system services. They correspond to IP packets data size and transmission protocol (UDP or TCP) Power and Energy Characterisation We performed power consumption characterisation for two components of the XUP board. The first component corresponds to the Ethernet MAC controller which is

33 1 Power and Energy Estimations in Model-Based Design 19 Table 1.8 Power Model of Ethernet communications MAC controller P MAC (mw) = 0.65F proc (MHz) PHY controller P PHY (mw) = Table 1.9 Energy model of Ethernet communications Proc. Freq. Model Error E(µJ/byte) = a Packet b size MAC Controller (2.5 V) 100 MHz E UDP = P 0.88 size,ifp size < 1500 b E UDP = 1.47 P 0.13 size,ifp size 1500 b E TCP = P 0.68 size,ifp size < 1500 b E TCP = 1.80 P size,ifp size 1500 b 200 MHz E UDP = P 0.86 size,ifp size < 1500 b E UDP = 0.85 P 0.11 size,ifp size 1500 b E TCP = P 0.65 size,ifp size < 1500 b E TCP = 1.17 P 0.1 size,ifp size 1500 b 300 MHz E UDP = P 0.84 size,ifp size < 1500 b E UDP = 0.79 P 0.11 size,ifp size 1500 b E TCP = P 0.65 size,ifp size < 1500 b E TCP = 1.09 P 0.09 size,ifp size 1500 b 6.28% 10.16% 7.14% 8.26% 8.59% 8.68% embedded in the FPGA and powered by a 2.5 V power supply. The second is the physical Ethernet controller which is powered by a 3.3 V power supply. We used test programs that only stimulate the OS networking services. Therefore, only the processor, RAM and Ethernet Interface are solicited Models The power consumption mathematical model of the MAC and PHY controllers are represented by the equations in Table 1.8. For the power model, the average error between power consumption measured and values estimated by the models is 3.5%, the maximum error is 9%. Following the methodology defined in Sect. 1.3, we made performance analysis of the whole system. Then, we calculate energy dissipation values in relation to the variation of all the model parameters. We obtained energy values for the MAC and PHY controllers. In Table 1.9, we give the model related to the MAC controller. For each transmission protocol and each processor frequency, there are two laws. The first is for IP packet data size less than 1500 byte, the second is for IP packet data size greater than 1500 bytes. Since the maximum transmission unit (MTU) of the

34 20 E. Senn et al. Ethernet network is 1500 byte, the Internet layer fragments IP packets larger than MTU. On the other hand, there is more encapsulation and no fragmentation for IP packets smaller than MTU. We can deduct from Table 1.9, that encapsulation yields more energy dissipation than fragmentation. The model we propose has some fitting error with respect to the measured energy values it is based on. We use the following average error metric, where Ẽ i s are energy values given by the model and E i s are energy values based on power and performance measures: 1 n n i=1 Ẽ i E i E i. After building the power and energy models of Ethernet communications, we have integrated them in the models library of our power estimation tool. Following the methodology presented in Sect. 1.3, the tool estimates power and energy consumption of applicative tasks communicating through Ethernet. To perform this estimation, the tool extracts pertinent parameters from the AADL specification such as processor frequency and transmission protocol type. Though its precise modelling semantics, we noticed that AADL presents some lacks regarding software modelling. For example, communications between threads or processes are modelled through ports (event, data or event data ports). But this modelling facility is not sufficient to describe all the communication mechanisms supported by an operating system. Therefore, it is necessary to extend the AADL language to enable precise modelling of process communications. These extensions will be presented in future publications. 1.8 Consumption Analysis Tool Our power models, for processors, FPGA, OS services, and the underlying power estimation methodology are integrated to the Open Source AADL Tool Environment (OSATE) in the form of a global Consumption Analysis Toolbox (CAT). For an AADL system component, or for one of its subcomponent selected in the model graphical editor window, the tool computes the power consumption and displays the results in a component-specific Eclipse view (Fig. 1.4) Property Sets As we have seen in the previous sections of this chapter, our power consumption models require specific input parameters for power consumption estimation to be performed. These parameters must be stored in the AADL specification so that they can be extracted later on by CAT. This is achieved by providing custom AADL library extensions added to the Osate design environment when CAT is installed in the Eclipse workbench.

5 CAT AADL extensions defining families of processor classifiers and their corresponding property sets Typically, a set of predefined component classifiers with their corresponding property sets are

35 1 Power and Energy Estimations in Model-Based Design 21 Fig. 1.4 Power estimation results view shown for the selected component in the Osate editor window Fig. 1.5 CAT AADL extensions defining families of processor classifiers and their corresponding property sets Typically, a set of predefined component classifiers with their corresponding property sets are provided (Fig. 1.5). Those classifiers represent the actual components for which power can be estimated. The AADL extends mechanism is used to mark or type the model components of interest with one of our component library classifier. The extension classifiers also provide a place to set default predefined property associations for the component. As an example, we show in Fig. 1.6 the AADL textual representation of a processor component extending our power PC 405 library classifier.

36 22 E. Senn et al. processor XUPProcessorType extends cat::processors::ibm::processor_powerpc405_type features Bus_PLB: requires bus access Bus_OnChip.Bus_PLB; Data_jtag: in out data port; properties CAT_Processor_IBM_Properties::PC405_Processor_Bus_Freq_Couple => P300_B100; end XUPProcessorType; Fig. 1.6 AADL specification of a PowerPC processor Table 1.10 Property set for the PowerPC 405 PowerPC 405 property set Data_Memory_Config: CAT_Processor_IBM_Properties::Memory_Configurations applies to (processor); Inst_Memory_Config: CAT_Processor_IBM_Properties::Memory_Configurations applies to (processor); PC405_Processor_Bus_Freq_Couple: CAT_Processor_IBM_Properties::PC405_Supported_Processor_Bus_Frequencies applies to (processor); PC405_Supported_Processor_Bus_Frequencies: type enumeration (P300_B100, P200_B100, P200_B66, P200_B50, P150_B50, P100_B100, P100_B50); Data_Cache_Miss_Rate: aadlreal applies to (processor); Data_Memory_Config: CAT_Processor_IBM_Properties::Memory_Configurations applies to (processor); Memory_Configurations: type enumeration (OCM_BRAM, PLB_BRAM, PLB_BRAM_CACHE, PLB_SDRAM_CACHE); The set of properties that are used by the estimation tool actually depends on the processor itself, and more precisely, on its power model. For another processor, another set of specific properties might be necessary since another set of configuration parameters might apply. The property set of the processor comes finally as a part of its power model, and, as this, will remain separated from the general property set associated to the current AADL working project for the application being designed in the OSATE environment. Table 1.10 shows the property set for the PowerPC 405 classifier. This processor can be clocked at 100, 150, 200 or 300 MHz, and, depending on the processor frequency, the bus (OCM or PLB) frequency can take different values between 25 to 100 MHz. This is modelled as an enumeration type property (PC405_Supported_Processor_ Bus_Frequencies) ensuring that only the predefined processor/bus frequency couples can be set. Tables 1.11, 1.12 and 1.13 respectively show property sets for generic digital signal processors, the TI C62 processor family and the Alter Stratix EP1S80 FPGA that we modelled as a system component.

37 1 Power and Energy Estimations in Model-Based Design 23 Table 1.11 Property set for digital signal processors Texas Instruments DSPs Property Set Memory_Mode: CAT_Processor_TI_Properties::Memory_Modes; applies to (processor) Memory_Modes: type enumeration (CACHE, FREEZE, BYPASS, MAPPED); Table 1.12 Property set for the TI C62 TI C62 property set Processor_Frequency: aadlreal applies to (processor); Processor_Memory_Mode: CAT_Processor_TI_Properties::Memory_Modes applies to (processor); Processor_Parallelism_Rate: aadlreal applies to (processor); Processor_Processing_Rate: aadlreal applies to (processor); Processor_Cache_Miss_Rate: aadlreal applies to (processor); Processor_Pipeline_Stall_Rate: aadlreal applies to (processor); Processor_Memory_Mode_Type: type enumeration (CACHE,FREEZE, BYPASS,MAPPED); Processor_Parallelism_Rate_Default: constant aadlreal => ; Processor_Processing_Rate_Default: constant aadlreal => ; Processor_Cache_Miss_Rate_Default: constant aadlreal => 0.25; Processor_Pipeline_Stall_Rate_Default: constant aadlreal => ; Table 1.13 Property set for the Altera Stratix EP1S80 FPGA Altera Stratix EP1S80 property set FPGA_Frequency: aadlreal applies to (system); FPGA_Activity_Rate: aadlreal applies to (system) FPGA_Occupation_Ratio: aadlreal applies to (system); FPGA_Activity_Rate_Default: constant aadlreal => 0.4; FPGA_Occupation_Ratio_Default: constant aadlreal => 0.5; 1.9 Conclusion We have presented a method to perform power consumption estimations in the component based AADL design flow. The power consumption of components in the AADL component assembly model is estimated whatever the targeted hardware resource, in the AADL target platform model, is: a DSP (Digital Signal Processor), a GPP (General Purpose Processor), a FPGA (Field Programmable Gate Array), or a peripheral device such as Ethernet controllers. A power estimation tool has been developed with a library of multi-level power models for those (hardware) components. These models can be used at different levels in the AADL specification refinement process. We have currently defined three

38 24 E. Senn et al. Table 1.14 Maximal errors summary Component Max Error Level 1 Max Error Level 2 Max Error Level 3 TI C62 59% 57% 8% PowerPC % 15.3% 5% Altera Stratix EP1S80 70% 44.8% 4.2% refinement levels in the AADL flow. At the lowest level, level 3, the (software) component s actual business code is considered and an accurate estimation is performed. This code, written in C, or C++, for standard threads, can also be written in VHDL or SystemC for hardware threads. At level 2, the power consumption is only estimated from the component operating frequency, and its architectural parameters (mainly linked to its memory configuration in the case of processors). At level 1, the highest level, only the operating frequency of the component is considered. Three power models have been presented for the TI C62 GPP, the PowerPC405 GPP, and the Altera Stratix EP1S80 FPGA. The maximum errors introduced by these models, at the three refinement levels, are given in Table Our approach is however not limited to those architectures for it has been used successfully to develop power models for many other processors, some of them being even more complex (superscalars, pipelines, VLIW, with floating point units and L1 and L2 caches): the TI Digital Signal Processors C62, C64, C67, and C55, and the General Purpose Processors ARM7, ARM9, PowerPC and Xscale (details on these power models can be found in former publications) [16]. We have also considered the operating system effect on power and energy consumption of the whole system. We have noticed that the major sources of consumption are data transfers, and we modeled the most consuming type of data transfer which is between I/O devices and applicative tasks. We presented the case of Ethernet communications. In the frame of the SPICES project, our power estimation methodology and power models are being integrated in the Open Source AADL Tool Environment OSATE, under the name CAT: Consumption Analysis Toolbox. A first prototype of this tool has been released in September We are also concerned about the modelling in the large approaches [7] like the eclipse AM3 project [4] (Atlas Megamodel Management Tool). In such approaches we will have a repository of models, metamodels, model transformations and services where we can publish bridges to/from AADL. This is the future basis for tools interchange and interoperability. Integrating our consumption analysis tool CAT in such an environment will give it a larger visibility and accessibility. References 1. The SAE AADL Standard Info Site The SPICES ITEA Project Website.

39 1 Power and Energy Estimations in Model-Based Design SAE Society of Automative Engineers. SAES AS5506, v1.0. Embedded Computing Systems Committee, SAE, November AM3, ATLAS MegaModel Management H. Balp, É. Borde, G. Haïk, and J.-F. Tilman. Automatic composition of AADL models for the verification of critical component-based embedded systems. In Proc. of the Thirteenth IEEE Int. Conf. on Engineering of Complex Computer Systems (ICECCS), Belfast, Ireland, R. BenAtitallah, S. Niar, A. Greiner, S. Meftali, and J.L. Dekeyser. Estimating energy consumption for an MPSoC architectural exploration. In ARCS06, Frankfurt, Germany, J. Bezivin, F. Jouault, and P. Valduriez. On the need for megamodels. In Proceedings of the OOPSLA/GPCE: Best Practices for Model-Driven Software Development Workshop, 19th Annual ACM Conference on Object-Oriented Programming, Systems, Languages, and Applications, D. Brooks, V. Tiwari, and M. Martonosi. Wattch: A framework for architectural-level power analysis and optimizations. In Proc. International Symposium on Computer Architecture ISCA 00, pages 83 94, P. Coussy, G. Corre, P. Bomel, E. Senn, and E. Martin. High-level synthesis under I/O timing and memory constraints. In ISCAS05, International Symposium on Circuits and Systems,May 2005, Kobe, Japan, S. Dhouib, J.P. Diguet, E. Senn, and J. Laurent. Energy models of real time operating systems on FPGA. In Fourth International Workshop on Operating Systems Platforms for Embedded Real-Time Applications, OSPERT 2008, Prague, July 2 4, Philips Research. Diesel user manual. Technical Report, Philips Electronic Design and Tools Group, June N. Dhanwada, I. Lin, and V. Narayanan. A power estimation methodology for SystemC transaction level models. In International Conference on Hardware/Software Codesign and System Synthesis, P.H. Feiler, B.A. Lewis, and S. Vestal. The SAE architecture analysis & design language (AADL). A standard for engineering performance critical systems. In IEEE International Symposium on Computer-Aided Control Systems Design, pages Munich, W. Huangy, M.R. Stany, K. Skadronz, K. Sankaranarayananz, S. Ghoshyz, and S. Velusamyz. Compact thermal modeling for temperature aware design. In Proceedings of DAC 2004, June 7 11, San Diego, California, USA, J. Hugues, B. Zalila, and L. Pautet. Rapid prototyping of distributed real-time embedded systems using the AADL and ocarina. In Proceedings of the 18th IEEE International Workshop on Rapid System Prototyping (RSP 07), pages IEEE Computer Society Press, Porto Alegre, N. Julien, J. Laurent, E. Senn, and E. Martin. Power consumption modeling of the TI C6201 and characterization of its architectural complexity. IEEE Micro, Special Issue on Power- and Complexity-Aware Design, J. Laurent, N. Julien, E. Senn, and E. Martin. Functional level power analysis: an efficient approach for modeling the power consumption of complex processors. In Proc. Design Automation and Test in Europe DATE, Paris, France, I. Lee, H. Kim, P. Yang, S. Yoo, E. Chung, K. Choi, J. Kong, and S. Eo. PowerVip: Soc power estimation framework at transaction level. In Proc. ASP-DAC, M. Loghi, M. Poncino, and L. Benini. Cycle-accurate power analysis for multiprocessor systems on a chip. In Proceedings of the GLSVLSI, Boston, Massachusetts, USA, April J. Long, J.C. Ku, S.O. Memik, and Y. Ismail. A self-adjusting clock tree architecture to cope with temperature variations. In Proceedings of the 2007 IEEE/ACM International Conference on Computer-Aided Design, pages IEEE Press, San Jose, R. Peset-Lopis and K. Goossens. The petrol approach to high-level power estimation. In Proceedings of the ISLPED, Monterey, California, USA, August G. Qu, N. Kawabe, K. Usami, and M. Potkonjak. Function-level power estimation methodology for microprocessors. In Proc. Design Automation Conf. DAC 00, pages , A.E. Rugina, K. Kanoun, and M. Kaâniche. AADL-based dependability modelling. Technical Report 06209, LAAS, 2006.

40 26 E. Senn et al. 24. E. Senn, J. Laurent, N. Julien, and E. Martin. SoftExplorer: Estimating and optimizing the power and energy consumption of a C program for DSP applications. The EURASIP Journal on Applied Signal Processing, Special Issue on DSP-Enabled Radio (16), E. Senn, N. Julien, N. Abdelli, D. Elleouet, and Y. Savary. Building and using system, algorithmic, and architectural power and energy models in the FPGA design-flow. In Intl. Conf. on Reconfigurable Communication-Centric SoCs 2006, Montpellier, France, July F. Singhoff, J. Legrand, L. Nana, and L. Marcé. Scheduling and memory requirements analysis with AADL. In Proceedings of the 2005 Annual ACM SIGAda International Conference on Ada, Atlanta, GA, USA, S. Steinke, M. Knauer, L. Wehmeyer, and P. Marwedel. An accurate and fine grain instructionlevel energy model supporting software optimizations. In Proc. Int. Workshop on Power and Timing Modeling, Optimization and Simulation PATMOS 01, pages , TMS320C6x User s Guide. Texas Instruments Inc., V. Tiwari, S. Malik, and A. Wolfe. Power analysis of embedded software: a first step towards software power minimization. IEEE Trans. VLSI Systems, 2: , T. Vergnaud. Modélisation des systèmes temps-réel embarqués pour la génération automatique d applications formellement vérifiées. PhD thesis, Ecole Nationale Supérieure des Télécommunications de Paris, France, W. Ye, N. Vijaykrishnam, M. Kandemir, and M. Irwin. The design and use of simplepower: a cycle accurate energy estimation tool.in Proc. Design Automation Conference DAC 00, June K. Yaghmour and M.R. Dagenais. Measuring and characterizing system behavior using kernel-level event logging. In 2000 USENIX Annual Technical Conference, USENIX, San Diego, CA, USA, June 18 23, 2000.

41 Chapter 2 MARTE vs. AADL for Discrete-Event and Discrete-Time Domains Frédéric Mallet and Robert de Simone Abstract Real-time embedded applications tend to combine periodic and aperiodic computations. Modeling standards must then support both discrete-time and discrete-event models of computation and communication whereas they historically pertain to two different communities: asynchronous and synchronous designers. In this article, two emerging standards of the domain (MARTE and AADL) are compared and their ability to tackle this issue is assessed. We plead for combining both standards and show how MARTE can be extended to integrate AADL features required for end-to-end flow latency analysis. Keywords UML Marte AADL MoCC Time requirement 2.1 Introduction Embedded applications often combine aperiodic (or sporadic) and periodic computations. In the automotive industry, this has lead to bus standards like FlexRay ( ortt-can[10] that combine event-triggered messages (for aperiodic computations) with time-triggered messages (for periodic computations). In the avionic industry, an application generally mixes aperiodic events (e.g., interactions with the pilot, switching between air/ground modes...)together with periodic events when updating the system (e.g., fuel quantity, update system data...).ontheonehand, time-triggered approaches enhance predictability by reducing latency jitters and provide higher dependability by making it easier to detect missed messages or illegal accesses to the bus. On the other hand, event-triggered systems are more flexible to support configuration changes without a complete redesign and adapt faster to asynchronous events. In electronic design automation (EDA) event-driven simulators (like those for VHDL or Verilog) provide a large flexibility and support the design of both synchronous and asynchronous architectures. Though, cycle-based simulators have better performances provided that architectures are mainly synchronous. In EDA, Avionic and automotive industries, designers need models able to describe and compose these two communication models. The main point is that, even F. Mallet ( ) Aoste Project I3S-INRIA, INRIA Sophia Antipolis Méditerranée, Université de Nice Sophia Antipolis, Sophia Antipolis Cedex, France Frederic.Mallet@sophia.inria.fr M. Radetzki (ed.), Languages for Embedded Systems and Their Applications, Lecture Notes in Electrical Engineering 36, Springer Science + Business Media B.V

42 28 F. Mallet, R. de Simone in time-triggered sampled communications, the propagation of data in logically instantaneous communications introduces specific phenomena akin to event-based communication features. Indeed, the consumer must wait for data availability to start computing on it. This may introduce well-known problems of priority inversion when component blocks have to be executed atomically. All these phenomena deserve careful semantic treatment to be handled correctly, which is the true essence of this work. Considering the large number of actors in the design of the very large systems (or even systems of systems) standard-based approaches are required to provide interoperability between models and to cover the whole design flow, from system requirements to code generation. These models must be precise enough to support various analyses at different refinement levels. We focus here on two particular standards, AADL (Architecture Analysis & Design Language) [12] standardized by the Society of Automotive Engineers and the UML (Unified Modeling Language) profile for MARTE (Modeling and Analysis of Real-Time and Embedded systems) [14], recently adopted by the Object Management Group (OMG). Both standards focus on modeling and analysis of embedded systems. Both offer constructs to model the application, the execution platform and to allocate the former onto the latter. The expressiveness of MARTE and AADL is compared and their ability to combine periodic and aperiodic computations is assessed. Concerning MARTE, the discussion focuses on its Time Model [2], specifically devised to specify timed domains of computation and communication in a formal way. This is the continuation of some of our previous work [3, 9] to compare both formalisms. An AADL example [6], which comes with its implementation, is used for the comparison. In the selected example, several threads, periodic or not are connected through event, data or event data ports. The combination of various parameters induces either asynchronous or sampled communications. AADL two-layered model is compared to our three-layered UML-based approach. The latter gives more flexibility and avoids the mixing of different levels in the same model. Rather than opposing the two languages we have investigated how the two standards can be combined and gateways can be created. Indeed, a subset of MARTE can be combined with AADL to cover a larger scope than the one currently covered by AADL, thus benefiting to AADL users. Such a combination would also benefit to MARTE users because some of their models could then be analyzed by existing AADL tools. Section 2.2 gives an overview of the MARTE time model. Section 2.3 introduces AADL and shows how MARTE is used to model the AADL constructs that address the two communication schemes under focus. Section 2.4 presents MARTE models for three configurations of an AADL-inspired example. 2.2 Marte Time Model Time and time-related concepts of the UML profile for MARTE have been previously described [2]. This section recalls the main definitions.

43 2 MARTE vs. AADL Definitions In MARTE, Time can be physical and considered as continuous, ordiscretized. It can also be logical and related to user-defined clocks. Time may even be multiform, allowing different times to progress in a non-uniform fashion, and possibly independently to any (direct) reference to physical time. The interest to tackle multiform time has been exhibited in synchronous languages [4]. MARTE Time subprofile, inspired from the theory of tags systems [8], provides a set of general mechanisms to define Models of Computations and Communications (MoCC). These modeling aspects should be hidden to end-users but not to model architects. This work intends to build a MoCC suitable for AADL. The time structure is defined by a set of clocks and relations on these clocks. Here clocks are not a device used to measure the progress of physical time. It is rather a mathematical object lending itself to formal processing. Clocks referring to physical time are called chronometric clocks. A distinguished chronometric clock named idealclk is provided as part of the MARTE time library. This clock represents the ideal physical time used, for instance, in physics and mechanics laws. At the design level most of the clocks are logical. For instance, we consider the processor cycle or the bus cycle as being logical clocks. Making a distinction between chronometric and logical clocks is important. For chronometric clocks, a metric is associated with instant labels thus making relevant the distance between two instants. For logical clocks, the distance between two successive instants is irrelevant. More precisely, a Clock is an ordered set of instants I and a quasi-order relation on I, named strict precedence. is a total, irreflexive, and transitive binary relation on I. Adiscrete-time clock is a clock with a discrete set of instants I. A Time Structure is a set of clocks C with a binary, reflexive and transitive relation named precedence. pred is a partial order relation on the set of all instants of all clocks within a time structure. From we derive another instant relation named Coincidence ( ). Clocks are independent of each other unless some instant relations are imposed. To impose many or infinitely many instant relations at once, clock relations are used. A comprehensive description of clocks relations is available as a research report [1], we focus here on those required to represent event-triggered and timetriggered communications. Connections between UML model elements and MARTE clock relations are made through stereotype ClockConstraint, which extends metaclass UML::Constraint. The language to be used on these clock constraints is called Clock Constraint Specification Language, CCSL. It is defined as an annex of MARTE specification and its formal semantics is briefly introduced separately [11] Event-Triggered Communications Two clocks (t s, t f ) are associated with each task t, the first one contains instants at which the task starts and the other the instants at which it finishes. A task cannot

44 30 F. Mallet, R. de Simone Fig. 2.1 Clock relation alternateswith end before having started and every time a task starts it must finish, in one way or another (normal ending, preemption, interrupted). The clock relation alternateswith can represent this causality relation between t s and t f. Equation (2.1) denotes that every ith instant of t s strictly precedes every ith instant of t f which in turns (weakly) precedes every (i + 1)th instant of t s. This relation is not symmetrical and does not assume the task t as being periodic. t s alternateswith t f (t s t f ) (2.1) t1 f alternateswith t2 s (t1 f t2 s ) (2.2) Alternation is a very general relation and can also represent an event-triggered communication from a task t1 to another task t2, Eq. (2.2). Task t1 runs to completion and sends an event that triggers the execution of t2. Figure 2.1 illustrates the alternation relation. Horizontal lines represent the clocks and their instants. Dashed arrows with a filled triangle as an arrowhead are strict precedence relations whereas arrows with a hollow triangle as an arrowhead are (weak) precedence relations. The precedence relations are directly induced by Eq. (2.2). In that example, the termination of t1 asynchronously triggers the start of t2. Note that we only have partial orders i.e., no instant relation is induced between the start or end of t2 and the next start of t Time-Triggered Communications With time-triggered communications, the data is sampled from a buffer according to a triggering condition. Clock relation sampledon is used to represent sampling and the triggering condition is given by the instants of clocks. Following our previous example, we replace Eq. (2.2) byeq.(2.3). clk is the sampling condition, i.e., the triggering clock. t2 s t1 f sampledon clk (2.3) Figure 2.2 illustrates the use of clock relation sampledon. It does not show the start of t1 since it is not relevant here. The start of task t2 is precisely given by sampling clock clk, however, some events may be missed if the sampling clock is not fast enough. Vertical lines denote coincidence relations.

45 2 MARTE vs. AADL 31 Fig. 2.2 Clock relation sampledon Periodic Tasks and Physical Time Logical clocks are infinite sets of instants but we do not assume any periodicity, i.e., the distance between successive instants is not known. The relation discretizedby is used to discretize idealclk, a dense chronometric (related to physical time) perfect (with no jitter or any other flaw) clock. Equation (2.4) creates, as an example, a 100 Hz clock. c 100 idealclk discretizedby 0.01 (2.4) Equation (2.4) states that the distance (duration) between two successive instants of clock c 100 is 0.01 s. The unit second (s) is implied by the use of idealclk TimeSquare TIMESQUARE is a software environment for modeling and analysis timed systems. It supports an implementation of the Time Model introduced in MARTE and its companion Clock Constraint Specification Language (CCSL). TIMESQUARE displays possible time evolutions solutions to the clock constraint specification as waveforms generated in the VCD format [7]. The VCD format has been chosen because it is an IEEE standard defined as part of the Verilog language and as such, is oftenusedineda.timesquare is available at AADL Modeling Elements AADL supports the modeling of application software components (thread, subprogram, process), execution platform components (bus, memory, processor, device) and the binding of software onto execution platform. Each model element (software or execution platform) must be defined by a type and comes with at least one implementation. Initially, there were plans to create a specific UML profile for AADL. However, the emerging profile for MARTE is now expected to be the basis for UML representation

46 32 F. Mallet, R. de Simone Fig. 2.3 MARTE model library for AADL threads of AADL models [5]. The adopted MARTE specification provides guidelines in this direction. The main goal of this contribution is to further investigate how specific AADL concepts required for end-to-end flow latency analysis can be represented in MARTE. As such, this work is not (yet?) included in the official OMG specification AADL Application Software Components Threads are executed within the context of a process, therefore the process implementations must specify the number of threads it executes and their interconnections. Type and implementation declarations also provide a set of properties that characterizes model elements. For threads, AADL standard properties include the dispatch protocol (periodic, aperiodic, sporadic, background), the period (if the dispatch protocol is periodic or sporadic), the deadline, the minimum and maximum execution times, along with many others. We have created a UML library (see Fig. 2.3) to model AADL application software components [9]. Only elements of our library concerning the periodic and aperiodic threads are shown. AADL threads are modeled using the stereotype SwSchedulableResource from the MARTE Software Resource Modeling sub-profile. Its meta-attribute deadlineelements and periodelements explicitly identify the actual properties used to represent the deadline and the period. Using a meta-attribute of type Property avoids a premature choice of the type of such properties. This makes it easier for the transformation tools to be language and domain independent. In our library, MARTE type NFP_Duration is used as an equivalent for AADL type Time.

47 2 MARTE vs. AADL 33 Fig. 2.4 The example in AADL AADL Flows AADL end-to-end flows explicitly identify a data-stream from sensors to the external environment (actuators). Figure 2.4 shows an example previously used [6] to discuss flow latency analysis with AADL models. This flow starts from a sensor (Ds, an aperiodic device instance) and sinks in an actuator (Da, also aperiodic) through two process instances. The first process executes the first two threads and the last thread is executed by the second process. The two devices are part of the execution platform and communicate via a bus (db1) with two processors (cpu1 and cpu2), which host the three processes with several possible bindings. All processes are executed by either the same processor, or any other combination. One possible binding is illustrated by the dashed arrows. The component declarations and implementations are not shown. Several configurations deriving from this example are modeled with MARTE and discussed in Sect AADL Ports There are three kinds of ports: data, event and event data. Data ports are for data transmissions without queueing. Connections between data ports are either immediate or delayed. Event ports are for queued communications. The queue size may induce transfer delays that must be taken into account when performing latency analysis. Event data ports are for message transmission with queueing, here again the queue size may induce transfer delays. In our example, all components have data ports represented as a solid triangle. We have omitted the ports of the processes since they are required to be of the same type than the connected port declared within the thread declaration and are therefore redundant. UML components are linked together through ports and connectors. No queues are specifically associated with connectors. The queueing policy is better repre-

48 34 F. Mallet, R. de Simone sented on a UML activity diagram, that models the algorithm. Activities are made of actions. The execution sequence is given by the control flow. Data communications between the actions are represented with object flows. In UML, by default, an object flow has a queue, the size of which can be parameterized with its property upperbound. So object flows can be used to represent both event and event data AADL communication links. UML allows the specification of a customized selection policy to select which token is read among the ones stored in the object node. Unfortunately, the selection behavior is only allowed to select one single token making it impossible to represent the AADL dequeue protocol AllItems. This protocol dequeues all items from the port every time the port is read. Therefore, only the dequeue protocol OneItem is supported. To model data ports, UML provides «datastore» object nodes. On these nodes, tokens are never consumed thus allowing for multiple readings of the same token. Using a data store node with an upper bound equal to one is a good way to represent AADL data port communications. 2.4 Three Different Configurations This section illustrates the use of MARTE on three different configurations of the AADL example. First, we address a case where all threads are aperiodic. Then, we consider a mixed periodic/aperiodic case. We finish with a case where all threads are periodic and harmonic The Aperiodic Case We rely on a model with three layers (see Fig. 2.5) where each layer denotes a different aspect of the system. The top-most layer represents the algorithm, i.e., different actions to be executed, and includes the data and control flow. The algorithm gives causal relations among the actions. All communications are through event data ports with infinite queues, represented as object nodes. The two actions acquire and release model the behavior of the two devices. The middle layer is a composite structure diagram that models AADL software components and represents the actual configuration under study. Here, all threads are aperiodic and therefore the classifier AperiodicThread defined in Fig. 2.3 is used. The bottom layer represents the execution platform. This layer-oriented approach significantly differs from the AADL model where all the layers are combined. AADL models do not consider the pure applicative part but rather merge it either within the second or the third level (compare with Fig. 2.4). Layers can be changed independently of the others, which gives a great flexibility. If required, new layers can be added to model virtual machines or middleware, for instance.

49 2 MARTE vs. AADL 35 Fig. 2.5 MARTE model, fully aperiodic case The AADL binding mechanism is equivalent to the MARTE allocation. Actions and object nodes are allocated (dashed arrows) to software components. All threads are aperiodic, therefore all communications are asynchronous and we only use clock relation alternateswith (Eqs. (2.5) (2.8)). Ds alternateswith T 1 s (Ds T 1 s ) (2.5) T 1 f alternateswith T 2 s (T 1 f T 2 s ) (2.6) T 2 f alternateswith T 3 s (T 2 f T 3 s ) (2.7) T 3 f alternateswith Da (T 3 f Da) (2.8) These clock relations are extracted using model-driven engineering techniques and fed into TIMESQUARE.TIMESQUARE checks the relation consistency and proposes, when possible, one execution conformant to the clock relations (see Fig. 2.6).

36 F. Mallet, R. de Simone Fig. 2.6 VCD result, all aperiodic case Fig. 2.7 Timing diagrams, all aperiodic case With TIMESQUARE,theVCD output is annotated to embed clock relations so that instant relations are displayed.

50 36 F. Mallet, R. de Simone Fig. 2.6 VCD result, all aperiodic case Fig. 2.7 Timing diagrams, all aperiodic case With TIMESQUARE,theVCD output is annotated to embed clock relations so that instant relations are displayed. Dashed arrows denote precedences. The transformationcouldalsotarget otheranalysistools like, for instance,cheddar [13], often used with AADL models. After the analysis, it is important to bring back the results into the UML model. The model elements closest to VCD waveforms are the UML timing diagrams. By combining the two clocks relative to each task (e.g., t1 s and t1 f ) the whole information relative to the task itself (e.g., t1) is built. The result, illustrated in Fig. 2.7, represents a family of possible schedules for a given execution flow and a given pair application/execution platform. Computation execution times (thick horizontal lines) equal the latency for devices and range between the MinimumExecutionTime (MinET) and the Deadline for threads.

2 MARTE vs. AADL 37 Fig. 2.8 MARTE model, mixed case Fig. 2.9 VCD result, mixed case 2.4.

51 2 MARTE vs. AADL 37 Fig. 2.8 MARTE model, mixed case Fig. 2.9 VCD result, mixed case The Mixed Event Data Flow Case We study here a second configuration that only differs by making periodic thread t2 (Fig. 2.8). Only the second layer of our model needs to be modified, whereas with AADL the whole model has been rebuilt. The communication from step1 to step2 becomes a sampled one. In CCSL, Eq. (2.6) is replaced by Eqs. (2.9) (2.10), where P is the sampling period of t2. clk = IdealClk discretizedby P (2.9) t2 s = t1 f sampledon clk (2.10) The new simulation run processed by TIMESQUARE is shown in Fig Vertical plain lines with diamonds represent coincidence relations on instants. We also get a different timing diagram (see Fig. 2.10). Oblique lines linking two computation lines represent communications and sampling delays. For sampled communications, this amounts to wait for the next tick of the receiver clock. The maximal sampling delay is when the communication waits for the full sampling period because the previous tick has just been missed. Oblique lines are not normative in UML timing diagrams but it is a convenient notation to represent intermediate communication states between two steady processing states (e.g., between t1 and t2).

52 38 F. Mallet, R. de Simone Fig Timing diagram, mixed case Additionally, on this representation it is easy to process latencies for a given flow. On this configuration and assuming, as in [6], that the sampling delays are always maximal, we get the formulas given in the following equations. Latency worst-case = Ds.latency + i (t i.deadline) + t2.period + Da.latency (2.11) Latency best-case = Ds.latency + i (t i.minet) + t2.period + Da.latency (2.12) Latency jitter = i (t i.deadline t i.minet) (2.13) The jitter (Eq. (2.13)) is identical to the fully asynchronous case, even though, due to the synchronization, the best-case and worst-case latencies are increased The Periodic Case Finally, we address here the case where all threads are periodic but not fully synchronous. Thread t2 is twice as slow as threads t1 and t3, i.e., its period is twice larger. All threads being periodic, their deadline is assumed to be smaller than their period. Only the second layer needs to be replaced by a new configuration where all

53 2 MARTE vs. AADL 39 Fig Timing diagram, fully periodic case threads are periodic. The timing diagram obtained with this configuration is shown in Fig The first three communications in the flow (from acquire to step1, from step1 to step2, and from step2 to step3) are sampled communications. The last one (from step3 to release) is data-driven, since actuator Da is aperiodic. In this last configuration and as expected, becoming synchronous makes the system more predictable since the latency jitter is much smaller (Eq. (2.16)). However, both the best-case and worst-case are bigger than in the two previous cases. Latency worst-case = Ds.latency + t1.period + t2.period + t2.period + t3.deadline + Da.latency (2.14) Latency best-case = Ds.latency + t1.period + t2.period + t2.period + t3.minet + Da.latency (2.15) Latency jitter = t3.deadline t3.minet (2.16) 2.5 Conclusion AADL offers lots of features very important to model and analyze computations and communications of embedded systems. However, combining all these features without a guideline (not part of the standard) can lead to models completely meaningless and impossible to analyze. We have shown how the MARTE Time model could be used to have the same expressiveness with less modeling concepts. More generally, MARTE and its time model could be used to model various timed models of computation and communication. It is important to have specifications free, as much as possible, of implementation choices (platform independent models). To achieve this goal, we need model

54 40 F. Mallet, R. de Simone elements of a higher level of abstraction than AADL threads. AADL two-level models assume that part of the application has already been allocated to a software execution platform made of threads. Our approach makes such an allocation explicit when required and allows alternative solutions. We propose to use for that purpose UML activities. Making a link to the software execution platform (runtime executive) is not a refinement but rather an allocation. The former implies models of the same nature, whereas the latter makes links between models of different natures. Additionally, rather than building a specific graphical editor for AADL models it is more cost-effective to customize existing editors, give them the right semantics and perform model transformations towards analysis tools. For now, graphical customization supported by profiling tools is limited and big improvements are required. However, making AADL UML-friendly leads the path to interoperability with several other modeling standards like SYSML [15], which is appropriate to model at the system level. Glossary AADL Architecture Analysis and Design Language, standardized by SAE EDA Electronic Design Automation MARTE Modeling and Analysis of Real-Time and Embedded systems MET/MinET Minimum Execution Time MoCC Model of Computation and Communication NFP Non Functional Property OMG The Object Management Group SAE Society of Automotive Engineers TT-CAN Time-Triggered Controller Area Network UML The Unified Modeling Language, adopted by the OMG VCD Value Change Dump VHDL Very high speed integrated circuit Hardware Description Language References 1. C. André and F. Mallet. Clock constraints in UML MARTE CCSL. Research Report 6540, INRIA, May C.André,F.Mallet,and R.de Simone.Modeling time(s).in MoDELS 07, LNCS 4735, pages Springer, Berlin, C. André, F. Mallet, and R. de Simone. Modeling of AADL data-communications with UML Marte, In E. Villar, editor, Embedded Systems Specification and Design Languages, Selected Contributions from FDL 07, LNEE 10, pages Springer, Berlin, A. Benveniste, P. Caspi, S. A. Edwards, N. Halbwachs, P. Le Guernic, and R. de Simone. The synchronous languages 12 years later. Proceedings of the IEEE, 91(1):64 83, 2003.

55 2 MARTE vs. AADL M. Faugère, T. Bourbeau, R. de Simone, and S. Gérard. Marte: Also a UML profile for modeling AADL applications. In ICECCS, pages IEEE Comput. Soc., Los Alamitos, P.H. Feiler and J. Hansson. Flow latency analysis with the architecture analysis and design language. Technical Report CMU/SEI-2007-TN-010, CMU, June IEEE Standards Association. IEEE Standard for Verilog Hardware Description Language. IEEE Std 1364TM-2005, Design Automation Standards Committee, E.A. Lee and A.L. Sangiovanni-Vincentelli. A framework for comparing models of computation. IEEE Transactions on CAD of Integrated Circuits and Systems, 17(12): , S.-Y. Lee, F. Mallet, and R. de Simone. Dealing with AADL end-to-end flow latency with UML Marte. In ICECCS, pages IEEE Comput. Soc., Los Alamitos, G. Leen and D. Heffernan. TTCAN: a new time-triggered controller area network. Microprocessors and Microsystems, 26(2):77 94, F. Mallet, C. André, and R. de Simone. CCSL: specifying clock constraints with UML Marte. ISSE, 4(3): , SAE. Architecture analysis and design language (AADL). AS5506/1, org. 13. F. Singhoff and A. Plantec. AADL modeling and analysis of hierarchical schedulers. In A. Srivastava and L.C. Baird III, editors, SIGAda, pages Assoc. Comput. Mach., New York, The ProMARTE Consortium. UML profile for MARTE, beta 2. OMG document number: ptc/ , Object Management Group, T. Weilkiens. Systems Engineering with SysML/UML: Modeling, Analysis, Design. The MK/OMG Press, Burlington, 2008.

56 Chapter 3 Generation of MARTE Allocation Models from Activity Threads Andreas W. Liehr, Klaus J. Buchenrieder, Heike S. Rolfs and Ulrich Nageldinger Abstract UML and specialized profiles, such as MARTE, are established specification and modeling procedures in the system development process. While languagebased system specification and resource modeling shortens the design cycle, the exploration of the design-space is time-consuming. Most expensive proves the generation of system models respectively the generation of architectural alternatives for subsequent exploration. This work contributes a method, that utilizes activity threads to reduce the effort, needed to build such a set. With this method, a group of system models, each representing one design alternative, can automatically be generated. Therefore, only one architecture model and one function model in combination with an activity thread is required. The proposed method is the first step towards automated comparison of the performance for design alternatives at an early stage in the development process. Keywords Component-based system modeling Design-space exploration Unified modeling language UML MARTE 3.1 Introduction In the early development stages of embedded systems, fundamental decisions concerning the system architecture and a proper hardware/software partition must be made. The choices determine, e.g., power consumption, performance, and other system features. For this reason, a method plus supporting tools for fast and efficient specification and automated evaluation of system design alternatives is required. The simulation and analysis of performance and power relevant parts of the system is a way to gather the information, which alternative fits best the constraints for This work has been supported within a subcontract between Infineon Technologies AG and the Universität der Bundeswehr München. This contract is part of the Project Verteilte integrierte Systeme und Netzwerkarchitekturen für die Applikationsdomänen Automobil und Mobilkommunikation (VISION), 01 M A.W. Liehr ( ) Fakultät für Informatik, Universität der Bundeswehr München, Neubiberg, Germany andreas.liehr@unibw.de M. Radetzki (ed.), Languages for Embedded Systems and Their Applications, Lecture Notes in Electrical Engineering 36, Springer Science + Business Media B.V

57 44 A.W. Liehr et al. Fig. 3.1 Utilizing activity threads for the mapping of application to resources its realization. To evaluate such system alternatives, each of them has to be specified in a modeling language fitting this purpose. The MARTE profile [18]forUML[17] provides the features required for system modeling. The specification of the architecture is carried out with the Hardware Resource Modeling (HRM) mechanism of the MARTE specification. The Real-Time Execution (RTE) Model of Computation and Computing allows for a specification of the behavior with real-time constraints. The mapping of behavior to architecture can be achieved as declared in the Allocation Modeling (Alloc) section of the MARTE profile. The MARTE profile provides mechanisms to enrich the models with physical, logical and functional information. These enable the generation of simulation models for performance- and power estimation. Using this approach to support the decision making of hardware/software codesign requires one set of models for every design alternative. Since the construction of alternatives is computationally expensive, we disclose a solution, that drastically reduces manual interaction. In this work, we present a method to build a set of MARTE allocation models, each representing one design alternative for a hardware/software system. As input to this method, only one MARTE hardware resource model and a single real-time execution model must be specified. The differing approaches of spacial allocation from function to architecture are declared with an Activity Thread, as presented by Liehr and Buchenrieder [10]. As a result, one application model per design alternative is built, as depicted in Fig This approach speeds system development, in that system models and design alternatives can be generated much faster. This chapter is organized as follows. Section 3.2 discusses the relationship between this approach and related research activities. Section 3.3 introduces the system specification process based on the UML MARTE profile. Section 3.4 presents activity threads and illustrates their application in the process of hardware/software codesign. In Sect. 3.5, the utilization of activity threads for design space exploration with UML MARTE system models is demonstrated with an example. Section 3.6 provides detailed information concerning the prototypic implementation

58 3 Generation of MARTE Allocation Models from Activity Threads 45 and the choice of tools. It is followed by an example application for our approach. We conclude with a summary and an outlook. 3.2 Related Work In the early 1980s, researchers recognized the potential of system evaluation methods, that utilize formal models of computer based systems. It became clear, that such methods speed up the system development process and reduce the cost, especially through the reduction of the demand for prototyping and testing or the real-life system. An early approach of this technique is the Software Performance Engineering Method, introduced by Smith [21]. This method consists of two models, (1) a software execution model expressed with execution graphs representing the software behavior; (2) a system execution model, based on queuing networks, that describes the system behavior. The input data of the second model is a combination of the results from the software execution model and information about the system hardware. Extensions of this approach have been published by Cortellessa et al. [2, 3]. Using such methods requires broad knowledge in the field of system architecture and the definition of formal graph based models. In the 1990s, a multitude of pattern based approaches emerged. These frameworks, in which developers define system models from predefined and reusable system patterns, rely on executable simulation models. While such approaches improve the efficiency of the modeling process through the prepackaged nature, their versatility suffers from the fact, that pattern definitions are required for every system part. Therefore, the pool of available patterns has to be advanced, each time when novel architecture components emerge. Approaches of this method have been published inter alia by Petriu et al. [19, 20], Balsamo et al. [1], and Liehr and Buchenrieder [10]. With the increasing dissemination of the Unified Modeling Language (UML), system design approaches, utilizing UML not only for software but also for hardware specification, gained popularity. At the turn of the century, UML was already well known by most system developers. In fact, computer based systems were expressed with UML and annotated with information for the system evaluation process. Subsequently, executable models, for the purpose of system evaluation, were built from the UML descriptions without further user interaction. Representatives of this approach were introduced by Kabajunga and Pooley [8], Kähkipuro [9], Gu and Petriu [6, 7], and Woodside et al. [23]. UML was initially developed to support a team-based software development process and the modeling of relations in large projects, utilizing object-oriented programming paradigms. Therefore, plain UML was of limited value to model holistic systems. Concurrent approaches to build more convenient UML dialects for specification demands in the field of architecture modeling were brought forward. The UML Profile for Schedulability, Performance, and Time (SPT) and the OMG Systems Modeling Language (SysML) gained broader acceptance within the research community and for developers of system modeling tools.

59 46 A.W. Liehr et al. The OMG UML Profile for Schedulability, Performance, and Time specifies an UML profile, that defines standard paradigms for modeling of time-, schedulability-, and performance-related aspects of real-time systems. It (1) enables the construction of models to make quantitative predictions; (2) facilitates communication of design intent between developers in a standard way; and (3) permits interoperability between various analysis and design tools [15]. The OMG Systems Modeling Language is a general-purpose graphical modeling language for specifying, analyzing, designing, and verifying complex systems, that may include hardware, software, information, personnel, procedures, and facilities. In particular, the language provides graphical representations with a semantic foundation for modeling system requirements, behavior, structure, and parametrics, which is used to integrate with other engineering analysis models [16]. To add the capabilities for model-driven development of Real Time and Embedded Systems to UML, the UML profile for Modeling and Analysis of Real-time and Embedded systems (MARTE) was introduced. It provides support for specification, the design, and verification and validation stages of the system development process. MARTE is intended to replace the UML SPT profile [18]. Researching enhancements, regarding the system modeling process with UML MARTE, that address the efficient specification of a set of concurring design approaches for one computer based system, is just the next logical step towards the development of even more user-friendly system modeling tools. 3.3 Building System Models with MARTE The modeling of computer systems is vital in the initial stages of system development, because design alternatives can be compared, rated, and decisions made. This shortens the development time and lowers the development effort. The UML MARTE profile extends UML with a detailed hardware resource model. In this model, the logical view classifies the hardware resources with respect to functionality and physical properties, of the hardware resources. MARTE adopts the Y-model as presented in Dumolin et al. [4] and constitutes the system model with three design views [22]: Application model, specifying the system functionality Resource model, representing the execution platform Allocation model, that maps the function to the architecture In this work, UML activity diagrams are utilized as application model. MARTE provides UML stereotypes to include constraints for real-time execution modeling, as illustrated by Frederic et al. [5]. While this representation of the application is not sufficient for the performance simulation process, it is adequate for the demonstration of our approach. As resource model representation, we use UML composite structure diagrams. The components of this diagrams are extended with stereotypes from the logical view and the physical view of the MARTE HRM profile. Despite the adaption of

60 3 Generation of MARTE Allocation Models from Activity Threads 47 the Y-model by MARTE consists of three models (General Resource Model, Software Resource Model and Hardware Resource Model), only the HRM is needed to demonstrate our method. The third component of the Y-model is the allocation of the function to the architecture. In this work, the allocation is realized with UML component diagrams in the context of MARTE Allocation Modeling (Alloc). Using activity threads to compose a set of this allocation models modifies the Y-model for system specification as illustrated in Fig Utilizing Activity Threads for Design-Space Exploration In previous work, we presented a method for performance prediction that utilizes UML based software representations, pattern based hardware models and activity threads, that map functional components to the hardware architecture for the generation of performance simulation models [10, 11]. The hardware model appointed in this approach defines the system hardware architecture for a computer system under construction. The choice of a pattern-based approach fosters the fast assembly of the model from vendor defined hardware components, that contain performance and interface information and are stored in a hardware pattern database. The hardware model is a superset of the required hardware for all system design approaches to be evaluated. Depending on the considered design approach, only a subset of the whole hardware model is taken into account. The software model contains the performance related information concerning software components of the computer system to be designed. The control flow of the intended functionality of the system is represented as UML activity diagram. The activities of this diagram are enriched with performance information, needed for the composition of the performance simulation model. The functional descriptions from the software model are linked to hardware modules by activity threads. Activity threads are realized as graphs with a start node and an end node. Every path from the start node to the end node contains intermediate nodes, each denoting an activity from the activity diagram which represents the software. For every activity from the software model, exactly one node in every path of the activity thread must exist. Such a path represents one design method for evaluation. The number of differing paths is the number of system models to build. The specification of ATs can be achieved with UML activity diagrams. That allows a specification with off-the-shelf UML modeling tools. After the user has supplied the three models, the system model generation is carried out without user interaction. In the original work, this method results in a set of Extended Queuing Network Models (EQNs), representing the related design approaches. The simulation of the EQNs is realized with normative simulation tools or an XML based EQN simulator [12]. The usage of EQN for the simulation models allows for the simulation of concurrent systems with shared resources, such as busses or memory.

61 48 A.W. Liehr et al. Fig. 3.2 The architecture composite structure diagram with MARTE extensions 3.5 Generating MARTE Allocation Models with Activity Threads As shown in the previous section, Activity threads can be exploited to automate the spatial association of functional components to architectural units with the goal to explore design alternatives. We adapt this established approach to system models represented in UML with MARTE extensions. Figure 3.2 depicts a system resource model, represented as composite structure diagram. The stereotype hwcomponent is applied to all components of the resource model, to detail physical properties. Functional properties are specified with

62 3 Generation of MARTE Allocation Models from Activity Threads 49 Fig. 3.3 The application as UML activity diagram the appropriate stereotypes from the logical view. The stereotype hwressource is applied to the CPU, the stereotype hwbus to the BUS, the stereotype hwram to RAM1 and RAM2 etc. The architecture model contains the superset of the hardware components, used in all system design approaches. Every single design approach uses only a subset of the hardware model. Hardware components, that are not used for a system design approach, will be omitted in the generation process of the allocation model for this single approach. To enable the evaluation of core dimensions of the system to model, the annotation of dimension specific information into the architectural model is required. Information regarding the consumption of power by specific parts of the architecture in different stages of resource utilization would represent such information, if power consumption would be one of the dimensions to evaluate. If the performance of the system is the focus of the evaluation process model components must be annotated with performance related data, supplied by component vendors for import. In this field, the SPIRIT Consortium breaks new paths with its standardization efforts with relation to the descriptions of multi-sourced IP blocks. The intention of SPIRIT is, to provide a unified set of specifications based on IP meta-data. These contain the specifications to import complex IP bundles into SoC design tool sets, and to exchange design descriptions between tools. Utilizing SPIRIT descriptions as information source for the architectural model will lessen the amount of information, the user has to deliver. It also ensures, that information about novel architecture components becomes readily available, as members of the SPIRIT consortium are committed to supply descriptions of their IP components. The behavior of the considered system is modeled with UML activity charts. As an example, consider Fig It shows the activity chart of an application, in which the actions A and B are concurrently activated and the result is synchronized. Subsequently, action C executes before action D concludes. The stereotypes rtf and rtaction from the real-time execution model of the MARTE profile are applied on the activities to model real-time features. As for the architecture model, the activities of the functional model must also be enriched with information specific to the intended evaluation. The information comprises statements regarding the complexity and the IO-behavior of the functional parts, that the activities represent. Current approaches encompass the analysis of pseudo code and formal specification. Analysis of pseudo code requires the user to provide a simplified code for every activity. The pseudo code is compiled and executed to deduce timing-measures. For a formal analysis, complexity and I/O-behavior is symbolically expressed. While

63 50 A.W. Liehr et al. Fig. 3.4 An activity thread represented as UML activity diagram system engineers prefer the first method, experts from the system analysis domain prefer the second. The architectural model and the functional model are related with Activity Threads (AT). Thereby, the user states with the definition of the activity thread, which mappings of function to architecture must be investigated. Figure 3.4 shows an example of an AT, fitted to the composite structure diagram of Fig. 3.2 and the activity diagram given in Fig The AT is represented as UML activity diagram. The semantic of the activities in the AT differs from the common usage of activity diagrams: An activity of an AT defines, to which hardware components an activity of the activity diagram, that is part of the application model, will be mapped. Each activity of the AT resides within an activity partition, denoted with a dashed box. Functional descriptions within a partition can be referenced by the ATs partition name. As an example, consider the four activities on the left side of Fig The topmost maps the function, represented by activity A from Fig. 3.3, to the hardware components ASIC1, BUS and RAM1 of the resource model in Fig The lowest, maps the same functional part to the hardware components CPU, BUS and RAM2. The implementation of the concept of activity threads, as part of a framework for system evaluation in the development process, enables user guided design-space exploration for holistic computer based systems. The user limits the range of possible design approaches from a permutation over all possibilities to design alternatives. Each path through the activity diagram from an initial node to a final node represents one design alternative. Obviously, the AT in Fig. 3.4 defines a set of different deployments of functionality to hardware. The eight deployments are illustrated in Fig Each line represents one system design approach and results in an automatic generated allocation model. Note, that a permutation over all possible solutions would lead to 1296 potential system design approaches in contrast to eight unfolded activity threads. Hence, activity threads not only reduce the complexity but also provide the user with the flexibility and convenience of an automated specification of alternative system models.

64 3 Generation of MARTE Allocation Models from Activity Threads 51 Fig. 3.5 The unfolded activity thread represented as UML activity diagram The allocation model for the first mapping of the unfolded threads is shown in Fig For its visual representation, we chose an UML composite structure diagram, using the methods for allocation modeling as described in the specification of the MARTE profile. 3.6 A Prototypic Implementation of the Method To build allocation models from the application model, the resource model and the activity thread, we utilize a Python program and the Python interpreter in Version 2.5. We chose the LXML Pythonic XML processing library in combination with libxml2 to gather information from the XML structure that represents the UML models. This configuration is also employed for the prototype that generates the XML structure, which represents the allocation model. As UML modeling tool, we put to work the Enterprise Architect from Sparx Systems in Version 7.1. The UML models, we use as input for our prototype, are exported as XMI 2.1 specified XML file [14]. Our current version of the program enables the generation of the allocation model from an UML activity diagram as application model, an UML composite structure

65 52 A.W. Liehr et al. Fig. 3.6 The physical deployment of the application as MARTE allocation model

66 3 Generation of MARTE Allocation Models from Activity Threads 53 diagram as resource model and an UML activity diagram with activity partitions as activity threads. The activity thread of Fig. 3.4 is unfolded to single threads without forks by graph transformation. This results in the eight threads as shown in Fig. 3.5, each representing one system design alternative. In the following, each of these alternatives is treated separately with the algorithm described below: A new composite structure diagram, utilized as allocation model for the particular evaluated system alternative, is built. Inside this diagram two empty classes, the first serving as container for the application and the second serving as container for the architecture, are generated. Inside the application class, an empty class is generated for each activity of the application model. Within this class, a single UML-part is created for each hardware component, utilized by this application. It holds the MARTE stereotype app_allocated and represents the utilization of a component, such as (bus, cpu, memory, etc.), by the particular application. As the algorithm steps through the activities of the activity thread, the mappings of application parts to architectural components are resolved. The first time, a component of the hardware model occurs as target of an allocation, an UML-part is created, to represent this component. This UML-part is deployed inside the container for the hardware of this model. The generated part inherits the MARTE stereotype ep_allocated. The mapping of functional parts to architectural components is defined by the corresponding activity thread. The mapping connects the application to the hardware. The connection starts at the UML-part within the representation of the application and ends at the UML-part representing the hardware component. The stereotype of this connection is an allocation. As a result, we obtain a set of composite structure diagrams, defining the allocation of the application to the resources as seen in Fig To improve the readability of the diagram, we omitted the visualization for the allocation of the BUS component. The diagram is serialized as XML file in the XMI 2.1 format and can be fed back into the utilized UML design environment as immediate visual feedback of the process. 3.7 Visualization of Performance Feedback The introduced method for the generation of allocation models was successfully implemented into an approach to visualize the suitability of design alternatives for a hardware/software system from the view of performance evaluation [13]. This method, depicted in Fig. 3.7 delivers information about the fulfillment of performance goals directly into the system description modeled with the UML MARTE profile. Thereby, a novel UML stereotype, that enables the model valid integration of our approach into UML supported system development processes, was introduced. The contributed method fosters a guided design-space exploration and is seamlessly integrateable into a system design-flow with off the shelf tools.

67 54 A.W. Liehr et al. Fig. 3.7 Generating performance simulation feedback for UML models As we have shown, our presented approach to utilize activity threads in the system modeling process with UML MARTE proves to be adoptable into this framework, built to support the development process of embedded systems. 3.8 Summary and Outlook In this work, we presented an effective method to automatically generate architectural alternatives for hardware/software systems. To streamline the hardware/ software codesign process, we extended our established work, based on the activity thread approach, with the UML MARTE profile [10]. As a result, the contributed method fosters a guided design-space exploration and reduces the complexity and the work, which had to be contributed manually otherwise by the system developer. For illustration, we designed an exemplar system with the codesign method brought forward here. For the implementation, we generated a set of allocation

68 3 Generation of MARTE Allocation Models from Activity Threads 55 models as UML composite structure diagrams in XML. These diagrams define the deployment of the system function to the system architecture. Our current and future research efforts focus on the automated generation of simulation models from design alternatives. For this reason, we will include mechanisms from the Performance Analysis Modeling of MARTE. The generated simulation models, based on Extended Queuing Network Models, will provide information about the performance behavior and the power consumption of the system under design. This will enable us to predict, whether prespecified time- and power consumption goals can be met. Furthermore, we will enhance our system modeling approach so that our estimation capability encompasses the dimension of power consumption as well. References 1. S. Balsamo, M. Marzolla, and R. Mirandola. Efficient performance models in componentbased software engineering. In SEAA 06 Proceedings of the 32nd Euromicro Conference on Software Engineering and Advanced Applications, Cavtat/Dubrovnik, Croatia, V. Cortellessa and R. Mirandola. Deriving a queueing network based performance model from UML diagrams. In WOSP 00: Proceedings of the 2nd International Workshop on Software and Performance, pages Assoc. Comput. Mach., New York, V. Cortellessa, A. D Ambrogio, and G. Iazeolla. Automatic derivation of software performance models from case documents. Performance Evaluation, 45(2 3):81 105, C. Dumoulin, P. Boulet, J.-L. Dekeyser, and P. Marquet. UML 2.0 structure diagram for intensive signal processing application specification. Rapport de recherche 4766, Institut National de Recherche en Informatique et en Automatique (INRIA), March T. Frederic, G. Sebastien, D. Jerôme, and T. François. Software real-time resource modeling. In Forum on Specification and Design Languages (FDL), Barcelona, Spain, pages European Electronic Chips & Systems design Initiative (ECSI), September G.P. Gu and D.C. Petriu. Early evaluation of software performance based on the UML performance profile. In CASCON 03: Proceedings of the 2003 Conference of the Centre for Advanced Studies on Collaborative Research, pages IBM Press, Indianapolis, G.P. Gu and D.C. Petriu. From UML to LQN by XML algebra-based model transformations. In WOSP 05: Proceedings of the 5th International Workshop on Software and Performance, pages Assoc. Comput. Mach., New York, C. Kabajunga and R. Pooley. Simulating UML sequence diagrams. In 14th UK Performance Engineering Workshop, Edinburgh, England, pages , July P. Kaehkipuro. UML-based performance modeling framework for component-based distributed systems. In Performance Engineering State of the Art and Current Trends, LNCS 2047, pages Springer, Berlin, A.W. Liehr and K.J. Buchenrieder. Generation of related performance simulation models at an early stage in the design cycle. In 14th IEEE International Conference on the Engineering of Computer-Based Systems (ECBS), pages IEEE Comput. Soc., Tucson, A.W. Liehr and K.J. Buchenrieder. Performance evaluation of HW/SW-system alternatives. In Design, Automation and Test in Europe (DATE) University Booth Demonstration and Poster Exhibition, Munich, Germany, March A.W. Liehr and K.J. Buchenrieder. An XML based simulation method for extended queuing networks. In L.S. Louca, editor, 22nd European Conference on Modelling and Simulation, pages , Nicosia, Cyprus. European Council for Modelling and Simulation, June A.W. Liehr, K.J. Buchenrieder, and U. Nageldinger. Visual feedback for design-space exploration with UML MARTE. In The Fifth International Conference on Innovations in Information Technology, Al Ain, UAE. IEEE Comput. Soc., Los Alamitos, 2008.

69 56 A.W. Liehr et al. 14. OMG. Mof 2.0/xmi mapping specification, v2.1. OMG Document , Object Management Group, September OMG. The unified modeling language (UML), version documents/formal/uml.htm, July OMG. Omg systems modeling language (omg sysml), v1.0. OMG Document , Object Management Group, September OMG. Omg unified modeling language. OMG Document , Object Management Group, November OMG. UML profile for MARTE. OMG Document , Object Management Group, August D.B. Petriu and M. Woodside. Analysing software requirements specifications for performance. In WOSP 02: Proceedings of the 3rd International Workshop on Software and Performance, pages 1 9. Assoc. Comput. Mach., New York, D.C. Petriu and X. Wang. Deriving software performance models from architectural patterns by graph transformations. In TAGT 98: Selected Papers from the 6th International Workshop on Theory and Application of Graph Transformations, London, UK, pages Springer, Berlin, C.U. Smith. The evolution of software performance engineering: a survey. In ACM 86: Proceedings of 1986 ACM Fall Joint Computer Conference, pages IEEE Comput. Soc., Los Alamitos, S. Taha, A. Radermacher, S. Gerard, and J.-L. Dekeyser. An open framework for detailed hardware modeling. In The IEEE Second International Symposium on Industrial Embedded Systems (SIES), Lisbon, Portugal, vol. 1, pages IEEE Comput. Soc., Los Alamitos, M. Woodside, D.C. Petriu, D.B. Petriu, H. Shen, T. Israr, and J. Merseguer. Performance by unified model analysis (PUMA). In WOSP 05: Proceedings of the 5th International Workshop on Software and Performance, pages Assoc. Comput. Mach., New York, 2005.

70 Chapter 4 Model-Driven System Validation by Scenarios A. Carioni, A. Gargantini, E. Riccobene and P. Scandurra Abstract The chapter presents a method for scenario-based validation of embedded system designs provided in terms of UML models. This approach is based on model transformations from SystemC UML graphical models into Abstract State Machine (ASM) formal models, and exploits the scenario-based model validation of the ASMs. This validation approach complements an existing model-driven design methodology for embedded systems based on the SystemC UML profile. A validation tool integrated into an existing model-driven co-design environment to support the proposed scenario-based validation flow is also presented. It allows the designer to functionally validate system components from SystemC UML designs early at high levels of abstraction. Keywords Model-based design System validation Abstract state machines UML 4.1 Introduction In the Embedded System (ES) and System-on-Chip (SoC) design area, conventional system level design flows usually start by developing a system functional executable model from a system specification written in natural language. It is an emerging practice to develop the functional model and refine it with SystemC (built upon C++), which is considered as de facto, open [20], industry-standard language for functional system-level models [27]. The functional executable model, as program code, introduces design decisions which should be postponed when a commitment between software applications and hardware platform has been established, and suffers of the all the limitations of coding with respect to modeling: less flexibility, limited reusing, unreadable documentation. Furthermore, a system design given in terms of code is hardly traceable with respect to the initial specification and prevents a meaningful analysis of the system. This work is supported in part by the project Model-driven methodologies and techniques for embedded system design through UML, ASMs and SystemC at STMicroelectronics. E. Riccobene ( ) Dipartimento di Tecnologie dell Informazione, Università degli Studi di Milano, Crema, Italy riccobene@dti.unimi.it M. Radetzki (ed.), Languages for Embedded Systems and Their Applications, Lecture Notes in Electrical Engineering 36, Springer Science + Business Media B.V

71 58 A. Carioni et al. The improvement of the current system level design would require new design methods, languages and tools capable of raising the level of abstraction to a point where productivity can be improved, errors can be easier to identify and correct, better documentation can be provided, and embedded system designers can collaborate more effectively. Furthermore, early stages of the design process would benefit from the use of graphical interface tools that visualize the system specification and allow multiple team members to share the relevant information [1]. All these reasons have, therefore, caused more and more increasing interest toward visual software modeling languages like the UML (Unified Modeling Language) [28] able to capture and visualize system structure and behavior at multiple levels of abstraction, and to generate executable models in C/C++/SystemC from system specifications. Along this research line, we defined a model-driven methodology [25] and a development process [26] for embedded system design. The new design flow is based on the principles of high level modeling, models transformation and automatic code generation of the Model Driven Engineering (MDE) approach. As modeling languages, it involves the UML 2, a SystemC UML profile (for the hardware side), and a multi-thread C UML profile (for the software side). It allows system modeling from a functional executable level down to the Register Transfer Level (RTL). We here address the problem of analyzing high-level UML-based embedded system descriptions, namely to find techniques for system model validation and verification. Validation is intended as the process of investigating a model with respect to its user perceptions, in order to ensure that the specification really reflects the user needs and statements about the application, and to detect faults in the specification as early as possible with limited effort. Validation should precede the application of more expensive and accurate methods, like formal verification of properties, that should be applied only when a designer has enough confidence that requirements satisfaction is guaranteed. There exist different techniques for system design validation. The scenario-based one allows the designer to build critical scenarios reflecting given system requirements to be guaranteed and check for requirements satisfaction. Of course, this technique requires tools able to support automatic scenario execution. UML-based design methods are not yet well supported by effective validation methods, and, in general, formal model validation and verification techniques are not directly applicable to UML-based models, due to their lack of a precise semantics. Formal methods and analysis tools have been most often applied to low level hardware design. However, these techniques are not applicable to system descriptions given in terms of programs of system-level languages like SystemC, since system description are closer to software programs than to traditional hardware description [29]. So far, the focus in the literature has been mainly on traditional codebased simulation than on design model validation. To tackle the problem of validating UML-based system models, we combine our SystemC UML modeling language with the Abstract State Machine (ASM) [6]formal notation in order to automatically map a visual UML model into a formal ASM model, and then to exploit well established techniques for ASM model analysis. This approach allows us to functionally validate SystemC UML designs early at

72 4 Model-Driven System Validation by Scenarios 59 high levels of abstraction. In particular, we here present the scenario-based validation of embedded system designs provided as SystemC UML models. As a proof-ofconcept, the paper reports the results of the scenario-based validation for the Simple Bus case study from the SystemC distribution. We also present a validation tool, integrated into our model-driven HW-SW codesign environment originally presented in [24], to support the scenario-based validation flow. It makes use of the ASMETA (ASM metamodeling) toolset [4] as supporting tools around ASMs. The choice of the ASMs among other formal methods is intentional and due to the fact that this method (a) comes with a rigorous scientific foundation [6], (b) provides executable specifications and, therefore, it is suitable for high-level model validation, (c) is endowed with a metamodel [10] defining the ASM abstract syntax in terms of an object-oriented representation, and the metamodel availability allows automatic mapping of SystemC UML models into ASM models by exploiting MDE techniques of automatic model transformations [30]. A preliminary version of this work was presented in [11]. We here provide more details on the language for scenarios modeling and the tool components that allow transformations from visual to formal models, and models validation. This paper is organized as follows. Section 4.2 provides some background on the ASMs and their supporting toolset. Section 4.3 presents our basic idea on how targeting validation in the ASM context, and presents the language for scenarios construction. Section 4.4 focus on the model validation flow by describing the mapping from the SystemC UML models to ASM models and the scenario-based approach for high-level validation of SystemC UML models. Section 4.5 provides some results of the scenario-based validation of the Simple Bus case study. Section 4.6 quotes some relevant related work. Finally, Sect. 4.7 concludes the paper. 4.2 ASMs and ASMETA Abstract State Machines are an extension of FSMs, where unstructured control states are replaced by states with arbitrary complex data. The states of an ASM are multi-sorted first-order structures, i.e. domains of objects with functions and predicates defined on them, while the transition relation is specified by rules describing the modification of the functions from one state to the next. A complete mathematical definition of the ASM method can be found in [6]. The notion of ASMs moves from a definition which formalizes simultaneous parallel actions of a single agent, either in an atomic way, Basic ASMs, and in a structured and recursive way, Structured or Turbo ASMs, to a generalization where multiple agents interact Multi-agent ASMs. Appropriate rule constructors also allow non-determinism and unrestricted synchronous parallelism. The ASMETA (ASM metamodeling) toolset [4, 10] is a set of tools around ASMs developed according to the model-driven development principles. At the core of the toolset, the AsmM metamodel [4], available in both meta-languages OMG/MOF [17] and EMF/Ecore [3], provides a complete object-oriented representation of ASM concepts.

73 60 A. Carioni et al. The ASMETA toolset includes: a notation, AsmetaL, to write ASM models conforming to the AsmM in a textual and human-comprehensible form; a text-to-model compiler, AsmetaLc, to parse AsmetaL models and check for their consistency w.r.t. the AsmM OCL constraints; a simulator, AsmetaS, to execute ASM models; the AVALLA language, a domain-specific modeling language for scenario-based validation of ASM models, with its supporting tool, the ASMETAV validator; and the ATGT tool that is a test case generator based on the SPIN model checker [15]. 4.3 Scenario-Based Validation of ASM Models Scenario-based validation of ASM models [9] requires the formalization (complete or incomplete) of the system behavior in terms of an ASM specification, and a scenario representing a description of external actor actions and system reactions. We support two kinds of external actors: the user, who has only a black box (i.e. outside) view of the system, and the observer having, instead, a gray box (i.e. also internal) view. By allowing two types of actors, we are able to build scenarios useful for classical validation (those including user actions and machine reactions), and scenarios useful for testing activity (those including also observer actions) requiring the inspection of the internal configurations of the machine. Therefore, our scenariobased validation approach goes behind the UML use-cases it was inspired from, and has the twofold goal of model validation and model testing. A user actor is able to interact, in a black box manner, with the system by setting the values of the external environment, so asking for a particular service, waits for a step of the machine as reaction to his/her request, and can check the values given in outputs from the system. An observer actor has the further capabilities of inspecting the internal state of the system (i.e. values of machine functions and locations), to require the execution of particular system (sub-)services of the machine, and to check the validity of possible invariants of a certain scenario. We describe scenarios in an algorithmic way as interaction sequences consisting of actions, where each action in turn is an activity of a user or observer actor, and an activity of the machine as reaction of the actor actions The AVALLA Language The AVALLA language has been defined in [9] as a domain-specific modeling language in the context of scenario-based validation of ASM models written in AsmetaL. Figure 4.1 shows the AVALLA metamodel, which defines the language abstract syntax in terms of an (object-oriented) model. For a formal definition of the AVALLA semantics, see [9]. An instance of the class Scenario represents a scenario of a provided ASM specification. A scenario has an attribute name, an attribute spec denoting the

74 4 Model-Driven System Validation by Scenarios 61 Fig. 4.1 The AVALLA metamodel ASM specification to validate, and a list of target commands of type Command.Additionally, a scenario may contain the specification of some critical properties, here referred to as scenario invariants, that should always hold (and therefore checked) for the particular scenario. The composite associations between the Scenario class (the whole)and its component classes (the parts) Invariant and Command assures that each part is included in at most one Scenario instance. The abstract class Command and its concrete sub-classes provide a classification of scenario commands. The Set command allows the user actor to set the external environment, i.e. to supply values of monitored or shared functions as input signals to the system. The Check class represents commands supplied either by the user actor to inspect external property values, or by the observer actor to further inspect internal property values in the current state of the underlying ASM. By an Exec command, an observer actor may require the execution of particular ASM transition rules performing given system (sub-)services. Finally, commands Step and StepUntil represent the reaction of the system, which can execute one single ASM step and one ASM step iteratively until a specified condition becomes true. Examplesof scenarioscriptsare providedin Sect. 4.5 for thesimple bus case study. 4.4 The Model-Driven Validation Environment The scenario-based validation environment has been developed as a component of a more complex co-design environment [24], which allows embedded system modeling, at different levels of abstraction, by using the SystemC UML profile [23] and forward/reverse engineering to/from C/C++/SystemC programming languages. Figure 4.2 shows the architecture of the validation component. The scenario-based validation process starts by applying (phase 1) the UML2AsmM transformation to the SystemC-UML model of the system (exported

75 62 A. Carioni et al. Fig. 4.2 Architecture of the validation environment from the UML modeler component of the co-design environment [24]). This automatic mapping transform the input visual model into a corresponding ASM model written in AsmetaL. Once the ASM model is generated, system validation (phase 2) is possible by supplying suitable scenarios written in AVALLA. A brief description of each activity follows. Note that as required skills and expertise the designer has to familiarize with the SystemC UML profile (embedded in the UML modeler), and with very few commands of the AVALLA textual notation to write pertinent validation scenarios From SystemC UML Models to ASM Models SystemC UML models, provided in input from the co-design tool [24], are transformed into corresponding ASM models (an instance of the AsmM metamodel). This transformation is defined (once for all) by establishing a set of semantic mapping rules between the SystemC UML profile and the AsmM metamodel. This UML2AsmM transformation is completely automatized by means of the ATL transformation engine [2] developed as a possible implementation of the OMG QVT [22] standard. In order to provide a one-to-one mapping (for both the structural and behavioral aspects), first we had to express in terms of ASMs the SystemC discrete (absolute and integer-valued) and event-based simulation semantics. To this goal, we took inspiration from the ASM formalization of the SystemC 2.0 simulation semantics in [19] to define a precise and executable semantics of the SystemC UML profile and, in particular, of the SystemC scheduler and the SystemC process state machines (an extension of the UML statecharts for modeling the behavior of the reactive SystemC processes). We then proceeded to model in ASMs the predefined set of interfaces, ports and primitive channels (the SystemC layer 1), and SystemC-specific data types. The resulting SystemC-ASM component library is available as target of the UML2AsmM transformation. Exploiting the SystemC-ASM component library, a SystemC module M is mapped into an ASM containing in its signature a dynamic abstract domain M. This domain is the set of instances that can be created by the corresponding module.

76 4 Model-Driven System Validation by Scenarios 63 Module attributes and ports of type T are mapped into controlled ASM functions declared in the signature of the ASM corresponding to the module. Basically, these functions have M as domain, and T as codomain. Multiplicity and properties (like ordered, unique, etc.) of attributes and ports are captured by the codomain types of the corresponding functions. A multi-port of type T, for example, is mapped into a controlled ASM function with codomain P(T ), i.e. the mathematical powerset of T. A hierarchical channel is treated as a module. A primitive channel is mapped, instead, into a concrete sub-domain of the predefined abstract domain PrimChannel, which is part of the SystemC-ASM component library. An event is mapped into an element of a predefined abstract domain Event. For the behavioral part, a process (a sc_thread or a sc_method) is mapped into an element of a predefined abstract domain Process. A process behavior within a module is defined by a named, possibly parameterized, transition rule declared within the ASM corresponding to the container module. Moreover, since in the SystemC process state machines, control structures (like if-then-else, while loop, etc.) and process synchronization points (statements like wait, static_wait, dont_initialize, etc.) are modeled in terms of stereotyped pseudo-states (junction or choice) and states, respectively, a one-to-one mapping is defined between the state-like diagram of the process behavior and the basic ASM rule constructs (if-then-else rule, seq rule, etc.). Some special ASM rule constructs, however, have been introduced in the SystemC-ASM component library in order to capture in ASMs the semantics underlying all possible forms of synchronization calls (which require dealing with the ASM agent representing the SystemC scheduler). In particular, the infinite loop mechanism of a thread has been modeled with a specific design pattern of ASM rule constructors. As example of application of such mapping, Fig. 4.3 shows the UML notation, the SystemC code, and the resulting ASM (in AsmetaL) for a module. Fig. 4.3 A UML module (A), its SystemC code (B) and its corresponding ASM (C)

77 64 A. Carioni et al Model Validator Scenarios written in AVALLA are executed by means of the ASMETAV validator. It is a Java application which makes use of the AsmetaS simulator to run scenarios. ASMETAVreadsauserscenariowrittenin AVALLA (seefig. 4.2), itbuildsthescenario as instance of the AVALLA metamodel by means of a parser, it transforms the scenario and the AsmetaL specification which the scenario refers to, to an executable AsmM model. Then, ASMETAV invokes the AsmetaS interpreter to simulate the scenario. During simulation the user can pause the simulation and watch the current state and value of the update set at every step, through a watching window. During simulation, ASMETAV captures any check violation and if none occurs it finishes with a PASS verdict. Besides a PASS / FAIL verdict, during the scenario running ASMETAV collects in a final report some information about the coverage of the original model; this is useful to check which transition rules have been exercised. 4.5 The Simple Bus Case Study The Simple Bus case study is a well-known transactional level example, designed to perform also cycle-accurate simulation. It is made of about 1200 lines of code that implement a high performance, abstract bus model. The complete code is available at the official SystemC web site [20]. The Simple Bus system was modeled [23] in a forward engineering flow using the SystemC UML profile. The UML object diagram in Fig. 4.4 shows the internal collaboration structure of the objects involved in a specific configuration of the Simple Bus design: three master blocks (a blocking master master_b, a non-blocking master master_nb, and a monitor master_d); two slave memories (one fast, mem_fast and one slow, mem_slow); a bus connecting masters and slaves; an arbiter with a priority-based arbitration to select a request to serve and with buslocking support; a clock generator C1. 1 Every master submits read/write requests to the bus at regular time instants. The designer assigns a unique priority to each master: master_nb has priority 3, while master_b has priority 4. Masters can issue a request at the same time, so the arbiter must choose one request according to some deterministic rules. In the simplest case, precedence is accorded to the device with higher priority, 2 in our case the non-blocking master has priority 3 which is higher (following a decreasing order) than the priority 4 of the blocking master. When a master occupies the bus, an incoming request is therefore queued and served later in a different time instant, or served from the next clock cycle if it has a higher priority (and the current request will be terminated later). To illustrate the typical use of the AVALLA language in writing validation scenarios, below we report two scenario examples and their related validation results for the Simple Bus design. 1 Note that all connectors are intended as stereotyped with «sc_connector». 2 Two devices cannot have the same priority, so the determinism is assured.

78 4 Model-Driven System Validation by Scenarios 65 Fig. 4.4 Simple Bus UML Object Diagram The first scenario shows how high level modeling tools like AsmetaV/AVALLA are helpful to abstract and stand out monitoring and debugging functionality, typically embedded within the SystemC design (in our case within the master_d monitor, the arbiter, and the bus) by inserting C++ code lines, thus further alleviating the designers burden of writing code. The second scenario shows instead how to validate the fairness of the arbitration rules adopted for scheduling the masters requests. Scenario s1 At given time instants, the memory locations between address 120 and address 132 are read (directreadbus). The actual values must match the expected values. scenario s1 load Top.asm step until time = 0 and phase = TIMED_NOTIFICATION; check directreadbus(bus, 120) = 0 and directreadbus(bus, 124) = 0 and directreadbus(bus, 128) = 0 and directreadbus(bus, 132) = 0; step until time = 1600 and phase = TIMED_NOTIFICATION; check directreadbus(bus, 120) = 16 and directreadbus(bus, 124) = 0 and directreadbus(bus, 128) = 0 and directreadbus(bus, 132) = 0; Scenario s2 At time 0, the master_nb (with priority 3) issues a read request (status = SIMPLE_BUS_REQUEST and do_write = false) at address

79 66 A. Carioni et al. 56 (address = 56), and the master_b (with priority 4) issues a burst read request from address 76 to 136. We assume that the clock period is 15 time units. The bus checks the requests at each negative clock edge. At time 15, the bus must serve the master with higher priority, i.e. the master_nb, and complete it (status = SIMPLE_BUS_OK). At time 30, the master_nb issues a new write request at address 56. At time 45, the bus serves again the master_nb ignoring for the second time the still pending read request of the master_b. scenario s2 load Top.asm step until time = 0 and phase = TIMED_NOTIFICATION; check (exist $r00 in Request with priority($r00) = 3 and do_write($r00) = false and address($r00) = 56 and status($r00) = SIMPLE_BUS_REQUEST); check (exist $r01 in Request with priority($r01) = 4 and do_write($r01) = false and address($r01) = 76 and end_address($r01) = 136 and status($r01) = SIMPLE_BUS_REQUEST); step until time = 15 and phase = TIMED_NOTIFICATION; check (exist $r02 in Request with priority($r02) = 3 and status($r02) = SIMPLE_BUS_OK); step until time = 30 and phase = TIMED_NOTIFICATION; check (exist $r03 in Request with priority($r03) = 3 and do_write($r03) = true and address($r03) = 56 and status($r03) = SIMPLE_BUS_REQUEST); step until time = 45 and phase = TIMED_NOTIFICATION; check (exist $r04 in Request with priority($r04) = 3 and status($r04) = SIMPLE_BUS_OK); Both scenarios ended with verdict PASS and allowed a coverage of all ASM rules of the Simple Bus model. 4.6 Related Work In [21], the authors present a model-driven development and validation process which begins by creating (from a natural language specification of the system requirements) a functional abstract model and (still manually) a SystemC implementation model. The abstract model is described using the Abstract State Machine Language (AsmL) another implementation language for ASMs. Our methodology, instead, benefits from the use of the UML as design entry-level and of model translators which provide automation and ensure consistency among descriptions in different notations (such those in SystemC and ASMs). Moreover, these last can remain hidden to the designer, making the process completely transparent to the user who do not want to deal with them. In [21], a designer can visually explore the actions of interest in the ASM model using the Spec Explorer tool and generate tests. These tests are used to drive the SystemC implementation from the ASM

80 4 Model-Driven System Validation by Scenarios 67 model to check whether the implementation model conforms to the abstract model (conformance testing). The test generation capability is limited and not scalable. In order to generate tests, the internal algorithm of Spec Explorer extracts a finite state machine from ASM models and then use test generation techniques for FSMs. The effectiveness of their methodology is therefore severely constrained by the limits of Spec Explorer. The authors themselves say that the main difficulty is in using Spec Explorer and its methods for state space pruning/exploration. The ASMETA ATGT tool that we want to use for the same goal exploits, instead, the method of model checking to generate test sequences, and it is based on a direct encoding of ASMs in PROMELA, the language of the model checker SPIN [15]. Theworkin[13] also uses AsmL and Spec Explorer to settle a development and verification methodology for SystemC. They focus on assertion based verification of SystemC designs using the Property Specification Language (PSL), and although they mention test case generation as a possibility, the validation aspect is largely ignored. We were not able to investigate carefully their work as their tools are unavailable. Moreover, it should be noted that approaches in [13, 21], although using the Spec Explorer tool, do not exploit the scenario-based validation feature of Spec Explorer. Indeed, in [5, 12] was shown how Spec Explorer allows scenario-oriented modeling. In [16], a model-driven methodology for development and validation of systemlevel SystemC designs is presented. The development and validation flow is entirely based on the specification of a functional model (reference model) in the ESTEREL language, a state machine formalism, and on the use of the ESTEREL Studio development environment for the purpose of test generation. The proposed approach concentrates on providing coverage-directed test suite generation for system level design validation. Authors in [7] provide test case generation by performing static analysis on SystemC designs. This approach is limited by the strength of the static analysis tools, and the lack of flexibility in describing the reachable states of interest for directed test generation. Moreover, static analysis requires sophisticated syntactic analysis and the construction of a semantic model, which for a language like SystemC (built on C++) is difficult due to the lack of formal semantics. The SystemC Verification Library [20] provides API for transaction-based verification, constrained and weighted randomization, exception handling, and HDLconnection. We aim, however, at developing formal techniques to augment SystemC verification. The Message Sequence Chart (MSC) notation [18], originally developed for telecommunication systems, can be adapted to embedded systems to allow validation. For instance, in [8] MSC is adopted to visualize the simulation of SystemC models. The traces are only displayed and not validated, and the author report the difficulties of adopting a graphical notation like MSC. Our approach is similar to that presented in [14], where the MSCs are validated against the SDL model, from which a SystemC implementation is derived. MSCs are also generated by the SDL model and replayed to cross validation and regression testing.

81 68 A. Carioni et al. 4.7 Conclusions and Future Work We proposed a scenario-based validation approach to system-level design by the use of the SystemC UML profile (for the modeling part) and the ASM formal method and its related ASMETA toolset (for the validation part). We have been testing our validation technique on case studies taken from the standard SystemC distribution, like the Simple Bus presented here, and on some of industrial interest. Thanks to the ease in raising the abstraction level using ASMs, we believe our approach scales effectively to industrial systems. This work is part of our ongoing effort to enact design flows that start with system descriptions using UML-notations and produce C/C++/SystemC implementations of the SW and HW components as well as their communication interfaces, and that are complemented by formal analysis flows for system validation and verification. As future step, we plan to integrate ASMETAV with the ATGT tool of the AS- META toolset to be able to automatically generate some scenarios by using ATGT and ask for a certain type of coverage (rule coverage, fault detection, etc.). Test cases generated by ATGT and the validation scenarios can be transformed in concrete SystemC test cases to test the conformance of the implementations with respect to their specification. Moreover, we plan to support system properties formal verification by model checking techniques. This requires transforming ASM models into models in the language of the model checkers, such as the Promela language of the SPIN model checker. References 1. R. Chen, M. Sgroi, G. Martin, L. Lavagno, A.L. Sangiovanni-Vincentelli, and J. Rabaey. Embedded system design using UML and platforms. In E. Villar and J. Mermet, editors, System Specification and Design Languages, CHDL Series. Kluwer Academic, Dordrecht, The ATL language Eclipse Modeling Framework The ASMETA toolset M. Barnett et al. Validating use-cases with the AsmL test tool. In QSIC Int. Conference on Quality Software, pages IEEE Press, New York, E. Börger and R. Stärk. Abstract State Machines: A Method for High-Level System Design and Analysis. Springer, Berlin, F. Bruschi, F. Ferrandi, and D. Sciuto. A framework for the functional verification of SystemC models. Int. J. Parallel Program., 33(6): , T. Kogel et al. Virtual architecture mapping: a SystemC based methodology for architectural exploration of system-on-chip designs. In A.D. Pimentel and S. Vassiliadis, editors, Computer Systems: Architectures, Modeling, and Simulation, SAMOS, LNCS 3133, pages Springer, Berlin, A. Gargantini, E. Riccobene, and P. Scandurra. A scenario-based validation language for ASMs. In ABZ 08: Proc. of the 1st International Conference on Abstract State Machine, B and Z, LNCS 5238, pages Springer, Berlin, A. Gargantini, E. Riccobene, and P. Scandurra. A language and a simulation engine for abstract state machines based on metamodeling. Journal of Universal Computer Science, 14(12): , 2008.

82 4 Model-Driven System Validation by Scenarios A. Gargantini, E. Riccobene, P. Scandurra, and A. Carioni. Scenario-based validation of Embedded Systems. In FDL 08: Proc. of Forum on Specification and Design Languages, pages IEEE Press, New York, W. Grieskamp, N. Tillmann, and M. Veanes. Instrumenting scenarios in a model-driven development environment. Information & Software Technology, 46(15): , A. Habibi and S. Tahar. Design and verification of SystemC transaction-level models. IEEE Transactions on VLSI Systems, 14:57 68, M. Haroud et al. HW accelerated ultra wide band MAC protocol using SDL and SystemC. In IEEE Radio and Wireless Conference, pages IEEE Press, Los Alamitos, G.J. Holzmann. The model checker SPIN. IEEE Transactions on Software Engineering, 23(5): , D. Mathaikutty, S. Ahuja, A. Dingankar, and S. Shukla. Model-driven test generation for system level validation. In HLVDT 07: High Level Design Validation and Test Workshop, pages IEEE Press, New York, OMG. The meta object facility, formal/ Message Sequence Charts (MSC). ITU-T. Z.120, W. Müller, J. Ruf, and W. Rosenstiel. SystemC Methodologies and Applications. Kluwer Academic, Dordrecht, Open SystemC Initiative H.D. Patel and S.K. Shukla. Model-driven validation of SystemC designs. In DAC 07: Proc. of the 44th Design Automation Conference, pages Assoc. Comput. Mach., New York, OMG. Query/Views/Transformations, ptc/ E. Riccobene, P. Scandurra, A. Rosti, and S. Bocchio. A UML2 profile for SystemC 2.1. STMicroelectronics Technical Report, April E. Riccobene, P. Scandurra, A. Rosti, and S. Bocchio. A model-driven design environment for embedded systems. In DAC 06: Proc. of the 43rd Design Automation Conference, pages Assoc. Comput. Mach., New York, E. Riccobene, P. Scandurra, A. Rosti, and S. Bocchio. A model-driven co-design flow for embedded systems. In Advances in Design and Specification Languages for Embedded Systems (Best of FDL 06), E. Riccobene, P. Scandurra, A. Rosti, and S. Bocchio. Designing a unified process for embedded systems. In Fourth Int. Workshop on Model-Based Methodologies for Pervasive and Embedded Software. IEEE Press, New York, T. Gröetker, S. Liao, G. Martin, and S. Swan. System Design with SystemC. Kluwer Academic, Dordrecht, OMG. The Unified Modeling Language M.Y. Vardi, Formal techniques for SystemC verification. In DAC 07: Proc. of the 44th Design Automation Conference, pages IEEE Press, New York, T. Zhang, F. Jouault, J. Bézivin, and J. Zhao, A MDE based approach for bridging formal models. In Proc. 2nd IFIP/IEEE International Symposium on Theoretical Aspects of Software Engineering. IEEE Comput. Soc., Los Alamitos, 2008.

83 Chapter 5 An Advanced Simulink Verification Flow Using SystemC Kai Hylla, Jan-Hendrik Oetjens and Wolfgang Nebel Abstract Functional verification is a major part of today s system design task. Several approaches are available for verification on a high level of abstraction, where designs are often modeled using MATLAB/Simulink, as well as for RT-level verification. Different approaches are a barrier to a unified verification flow. For simulation based RT-level verification, an extended test bench concept has been developed at Robert Bosch GmbH. This chapter describes how this SystemC-based test bench concept can be applied to Simulink models. The implementation of the resulting verification flow addresses the required synchronization of both simulation environments, as well as data type conversion. An example is used to evaluate the implementation and the whole verification flow. It is shown that using the extended verification flow saves a significant amount of time during development. Reusing test bench modules and test cases preserves consistency of the test bench. Verification is done automatically rather than by inspecting the waveform manually. The extended verification flow unifies system-level and RT-level verification, yielding a holistic verification flow. Keywords SystemC Simulink Verification Co-simulation Test bench 5.1 Introduction Chip complexity and size have been increasing ever since the first chip was developed. Verification effort tends to increase exponentially with the size of the design. Today s verification effort is about 70% of the total project effort [2]. Along the development of new methodologies for designing chips, methodologies for verification have been developed. First verifications were done by inspecting the design manually. The increasing complexity of designs induced test benches. Test benches stimulate the design under verification (DUV) in a reproducible way. The increasing demand of security and safety requirements led to formal verification methodologies. However, these cannot handle larger designs in an appropriate amount of time yet. Until they become usable, simulation based verification is the first class. K. Hylla ( ) OFFIS Institute, Escherweg 2, Oldenburg, Germany kai.hylla@offis.de M. Radetzki (ed.), Languages for Embedded Systems and Their Applications, Lecture Notes in Electrical Engineering 36, Springer Science + Business Media B.V

84 72 K. Hylla et al. Since a lot of design constraints and requirements are to be met, many test cases are needed. Verification of designs needs to be done automatically and repeatable. Concepts allowing reuse of test benches are needed. Ideally, these concepts provide constrained random data generation and are usable among different levels of abstraction. This contribution addresses a simulation based verification flow, starting at system-level based on MATLAB/Simulink models. It describes concisely, how a test bench concept [3], already used in the company s verification flow, is extended to cover different levels of abstraction. The extended test bench concept is usable at RT-level as well as at a higher level, where designs are modeled using Simulink. The following Sect. 5.2 describes some solutions available for verification of system models. It also states, why it is difficult to apply them to the Simulink-based verification flow currently used at Robert Bosch GmbH. Section 5.3 describes the weaknesses of a conventional verification flow, a way to improve the test bench concept, and how this could be applied to Simulink models. Section 5.4 pictures how the extended verification flow has been implemented, considering synchronization and data type conversion. Section 5.5 shows how the implementation has been evaluated. An example is used to evaluate the extended verification flow presented in this chapter. Finally, Sect. 5.6 gives a conclusion and identifies further work. 5.2 Related Work Often, verification is done in an unstructured way. However, there are several tools and methodologies for structured system verification available. A recent methodology for verifying models is provided with the Open Verification Methodology (OVM) [5]. It is jointly developed by Cadence and Mentor Graphics.OVM is based upon the Universal Reuse Methodology (urm) [11]and the Advanced Verification Methodology (AVM) [1]. OVM provides an object-oriented concept for verifying models. Communication between components of the verification environment is done by using transaction-level modeling (TLM). Stimuli can be generated using different layers and transactions can be randomized automatically. The Verification Methodology Manual for SystemVerilog (VMM) [12], which is available from Synopsys, specifies a functional verification methodology. The standard library defined by the VMM includes constrained random stimulus generation, functional coverage collection, assertions, and TLM. The layered structure allows creating both, simple as well as complex test benches. A well-known tool for verifying Simulink models is Simulink Verification and Validation [9]. It allows to create designs based upon requirements. It provides an integrated requirements management. During the simulation, user-extendable verification blocks determine whether or not the requirements have been implemented correctly. This can be done by assuring that the assertions on test signals hold. It also allows a coverage analysis of the models. A likewise library is Simulink Design Verifier [6], which allows verification of requirements based on property proving.

85 5 An Advanced Simulink Verification Flow Using SystemC 73 OVM and VMM are unsuitable for Simulink models, while Simulink Verification and Validation as well as Simulink Design Verifier address the verification of Simulink models. However, these Simulink tools have a significantly different concept compared to the extended test bench concept, which is used at the company for RT-level verification. A combination of both is hardly feasible. 5.3 Extended Verification Flow As mentioned in the previous section, an extension of the existing verification flow is necessary. Before implementing such an extension, it is required to understand the problems and adversities of the existing flow. The company s constraints and requirements must be also considered. The next section shows a conventional flow and describes the extended test bench concept developed at the company. The subsequent section shows how the existing verification flow can be extended to comply with the company s needs Conventional Flow A conventional verification flow starts at the specification of the system to be built. The specification is usually provided in a textual and often non-formalized form. In many instances, it consists of a set of requirements and constraints. As shown in Fig. 5.1, on the following page, the description of the system is brought into an executable specification using Simulink. During this step, the system is partitioned into analog and digital components. Then, in order to verify the system, a test bench is created. This test bench consists of several test bench modules (TBM), which generate stimuli for the system and observe the system s output. The partitioning into TBMs is typically done based upon the identified functionality and interfaces, respectively. A test case (TC) describes the stimuli that should be created by the TBMs and the design s expected output. Ideally, the TC is completely separated from the test bench. After the verification of the system has yielded satisfying results, the system can be brought to a lower level of abstraction. Typically, a hardware description language (HDL) like VHDL or Verilog is used for that purpose. Here, the verification flow is usually split into an analog and a digital verification flow. Aside from the digital components, TBMs and TCs need to be ported, too. Normally this is done manually. Therefore, each component needs to be rewritten by a developer. The same applies to the test cases. In order to assure that porting has been done accurate, the equivalence of the ported and the original component has to be shown, which is a serious task of its own. As the analog components do no longer exist in the digital flow, their behavior must be modeled. Therefore, new TBMs have to be written. The test cases must be adapted to provide the information required by the new TBMs. Again, the system is verified using the test bench. After satisfying results are achieved, the results from the analog and the digital flow are merged and further development towards the final chip design is done.

86 74 K. Hylla et al. Fig. 5.1 Conventional verification flow Fig. 5.2 Conventional test bench concept Conventional Test Bench Concept As just mentioned, test bench and TC should be clearly separated from each other. However, developers often do not follow that rule. Often test bench and test case are mixed up together. Especially when writing test benches in a HDL. Signals are manipulated directly from within the test bench and even partitioning the test bench into TBMs is not performed regularly. Often each TC is implemented in its own test bench as shown in Fig The test bench does only stimulate the DUV. The output is been traced and finally viewed using a waveform viewer. The developer decides, whether the behavior of the DUV is correct or not. The stimulus is reproducible. However, verification has to be done manually, each time the simulation has run. On larger designs, manual verification can hardly be done for the whole design. There are too many signals

87 5 An Advanced Simulink Verification Flow Using SystemC 75 to be considered. As test bench and test case are indistinguishable from each other, there are a large number of test benches. One for each test case. This leads to a confusing verification environment. In order to write a new test case, a complete test bench has to be written, which is more time consuming than writing the new test case only. If the interfaces of one of the components change, all existing test benches need to be adjusted. This causes a significant effort. Changing a test case requires changing the test bench, too. This is more complicated as the test bench contains a lot more information than required for specifying the test case only. Changing the test bench requires re-compiling it, which, depending on the size of the design, can be time consuming Extended Test Bench Concept In order to deal with the problems just mentioned, an extended test bench concept has been developed at Robert Bosch GmbH [3]. The concept forces a strict separation of test bench and test case. Test cases are implemented as single text files, the so-called command files. The command files contain the instructions to be executed by the test bench. The test bench itself consists of several TBMs and a shared controller. The controller processes the command file and operates the TBMs. The TBMs implement commands that can be used within the command file then. Complex operations can be provided by the TBMs by means of simple commands. Because of these user-defined commands, the syntax of the command file is highly flexible and extensible. Test cases for verifying interfaces can be reused independently from the concrete implementation of the protocol. TBMs implementing different protocols simply need to provide the same commands. The same TBMs could be reused to verify different designs, if the components of the design implement the same interface. Figure 5.3 shows the conceptual design of a test bench using the extended test bench concept. The concept was first implemented using VHDL. This implementation is used within a productive environment since several years. Currently, the concept is being adapted to SystemC [10]. Due to the usage of SystemC, the controller is enhanced. It provides a set of new features like constrained randomization, which can be used within the command files. Co-simulation of VHDL and SystemC TBMs is possible. Existing VHDL TBMs can be used within a SystemC test bench environment. The system to be build can be implemented using either SystemC or VHDL. This allows a smooth transition from the VHDL test benches towards the more powerful SystemC test benches. Fig. 5.3 Extended test bench concept.

88 76 K. Hylla et al. Fig. 5.4 Extended verification flow Extending the Verification Flow The extended test bench concept described above has achieved several improvements. A pure SystemC-based approach is not suitable, since the existing flow should be modified as little as possible. Therefore, the presented test bench concept should be used for the Simulink models as well. The resulting verification flow isshowninfig.5.4. The TBMs used for verifying the Simulink model are implemented as SystemCbased test bench modules regarding to the extended test bench concept. The TBMs can be made available to Simulink using so-called S-functions. S-functions allow the usage of C/C++ source code from within Simulink and appear as ordinary Simulink blocks within the model. Again, the TBMs use the common controller, which is not shown in the figure. Test cases are still implemented in a single command file each. AsshowninTable5.1, the TBMs used for verifying Simulink can be used for verifying the VHDL description of the system, as well. No modifications are necessary. Thus, development time and effort can be saved. Additionally the likelihood of errors can be reduced. Using the TBMs that had stimulated the analog components on the RT-level as well, requires the analog components to be ported. Since the resulting code will be part of the test bench, and thus does not need to be synthesizable in later steps, the Real-time Workshop [8] can be used. The Real-time Workshop allows generation of C-code from Simulink models. The generated source code is to be combined with the original TBMs to form new ones. The new TBMs provide the

89 5 An Advanced Simulink Verification Flow Using SystemC 77 Table 5.1 Steps to be done, when switching from Simulink-level to RT-level Component Conventional flow Extended flow digital ported ported analog rewritten as TBMs with a digital interface connected to TBM TBM (digital interface) ported reused TBM (analog interface) discarded reused Test case rewritten to adapt to new TBMs reused same commands as the old ones, but their output is equal to the output formerly generated by the analog components. Thus, no modifications of the TCs are required. This concept allows a smooth integration into the company s existing verification flow. It can be used together with existing test bench modules, implemented using Simulink components. This allows the developer to change gradually towards the extended test bench concept. Models that had been developed using a conventional verification flow can be used furthermore. 5.4 Implementation Co-simulation between SystemC and VHDL is already provided by the extended test bench concept. Therefore the implementation addresses the co-simulation between SystemC and Simulink. It is described in two parts: The first one discusses the synchronization of Simulink and SystemC, while the second one describes how data could be exchanged between both environments. In order to integrate the extended test bench modules into Simulink it is required to write wrappers that implement the S-function. Advanced base classes, which provide the functionality described below, allow the developer to implement the module wrapper in a short amount of time. In order to create flexible TBMs, the type of data conversion can be chosen by a runtime parameter Synchronization The first important step when implementing the co-simulation is to synchronize both environments. This is a complex task, as Simulink and SystemC use different models of simulation. Simulink uses a so-called sample time, which defines the points in time, when a component of the model is to be updated. SystemC uses an event queue, which is processed event by event. Since the SystemC module is integrated into the Simulink environment as an S-function, it is necessary that the S-function is updated each time an incoming or outgoing signal of the SystemC module changes. There are several variants of how the environments can be synchronized. Altogether 13 different variants have been evaluated. They can be clustered as follows:

90 78 K. Hylla et al. Fixed Sample Time Variants of this type assign the S-function a fixed sample time. The sample time can be set to the time resolution of SystemC, resulting in a very low performance. If a value larger than the time resolution is chosen, an error might occur. In this case, the TBM will not be updated at the correct time. Obtaining the adequate value for the sample time from the TBM is difficult. The knowledge of the TBM s developer is required to set up the correct sample time. As internal events might occur randomly, for some modules even this is not sufficient. Their behavior could not be represented adequately using a fixed sample time. Variable Sample Time These types of synchronization variants inform Simulink when the next update of the S-function should occur. Therefore, the internal event queue can be easily mapped to the sample time. The next update of the S-function should occur at the time the next event is scheduled for. However, the time, an incoming signal changes, cannot be predicted. A maximum time interval between two consecutive events can be specified. The S-function is updated either if an event occurs, or if the maximum time interval is reached. This interval assures that the incoming signals are read with an adequate rate. Finding the optimal value for the maximum time can be a difficult task as no information about the incoming signals is available. Inherited In these variants, the S-function inherits its sample time from its driving Simulink components. This way it can be assured that no signal changes were missed. Internal events of the TBM, which cause changes on outgoing signals cannot be handled correctly by these types of variants. The TBM cannot force an update of the S-function if an event from the event queue occurs between two consecutive updates of the S-function, determined by the inherited sample time. Base Period It cannot be assumed that all driving Simulink components of a TBM have the same sample time. Therefore, it is obvious to set up the sample time for each signal individually. In addition, it cannot be predicted, which outgoing signal is affected by which incoming signal. Hence, the sample time of each outgoing signal must fit the sample time of all incoming signals. This sample time is the greatest common divisor (GCD) of all incoming signals sample time. As the sample time of each incoming signal might have an offset, the calculation of the GCD is more complex. If the internal events of the TBM can be considered periodic, these periods are taken into account during the calculation. In this case the TBM has a so-called base period, which consists of the sample times of the incoming signals and the internal periods of the module. This base period is assigned to each outgoing signal. It is also possible to assign an offset to the base period. The offset corresponds to the smallest offset of the inherited sample times. This might help to improve the performance, as fewer values must be considered when calculating the GCD. The following equations illustrate that. α ={p,o 1,...,o k 1,o k,o k+1,...,o n } (5.1) β ={p,o 1 o k,...,o k 1 o k,o k+1 o k,...,o n o k } (5.2) gcd(β) gcd(α) (5.3)

91 5 An Advanced Simulink Verification Flow Using SystemC 79 Table 5.2 Synchronization methods Variant Fix Variable Inherit Base Period performance + + handle incoming signals + + handle outgoing signals + + handle internal period =good, =neutral, =bad The set α contains all periods p i and offsets o i.fromsetβ the smallest offset o k of all periods is excluded and all offsets are shifted by that minimal offset. Based upon the rules applying to the GCD, it can be proven that the GCD of β is greater or equal to the GCD of set α. Hence, the GCD is larger, the sample time is larger and thus less updates of the S-function are required. Each of the evaluated variants has its own advantages and disadvantages. Table 5.2 gives a short overview. No variant can handle all scenarios that might occur. Moreover, due to limitations of Simulink some combinations of sample times are not possible. For example, the preferred variant, where the incoming signals inherit their sample time and outgoing signals have a variable sample time is not supported. Therefore, the sample time of the S-function is chosen depending on the kind of the TBM. Five scenarios can be identified: Scenario 1 The TBM has no incoming signals. In this case only internal events can occur. Therefore, the S-function gets a variable sample time assigned. The S-functions uses the event queue of the encapsulated SystemC module to predict the time the next event occurs. This is the time the S-function needs an update. Scenario 2 If Scenario 1 applies but a solver supporting a variable sample time is not available, a fixed sample time is used. The sample rate is chosen by the developer. Since information about the internal behavior of the SystemC module is required in order to determine an appropriate sample rate, this can not be done automatically. Scenario 3 The TBM has incoming signals but no internal period. Since there are no internal periods, the TBM does only react on incoming signals. Therefore, the inherited sample time is chosen. In order to allow components that operate on different sample times to be connected to the S-function, the sample time is inherited for each port individually. Outgoing signals are updated, whenever an incoming signal changes. Scenario 4 The TBM has incoming signals and an internal period. In this case, the incoming signals inherit their sample time. Based upon the sample times of the incoming signals and the internals rates, the base period is calculated and assigned to the outgoing signals. Scenario 5 This scenario is a special case of Scenario 4. In this case the module does have incoming signals and internal processes, but no outgoing signals. Despite

92 80 K. Hylla et al. of the sample times inherited from the incoming signals, an additional sample time must be specified that covers the internal processes of the module. Due to restrictions of Simulink it is necessary to add an outgoing port to the S-function. This port allows the specification of the additional sample time. It does not carry any data and should be terminated within the Simulink model Data Type Conversion The second step done when implementing the co-simulation is data type conversion. Conversion is done by the wrapper implicitly. Different types of conversion are available. It is independent from the modules and may change from model to model. Therefore, it is interchangeable by a runtime parameter, which can be set for each module individually. Since Simulink internally uses default C/C++-types, these can easily be mapped. The mapping of SystemC data types like bit vector or fixed-point types is done in different ways. The easiest way is to convert them into the default Simulink type double. Logic- and bit vectors can be treated as integer numbers, as long as the vectors are small enough. A more advanced conversion uses the Simulink Fixed Point-extension [7], which also provides an sufficient API. This extension provides data types commonly used in chip development. Each conversion has a build-in value checking, in order to assure that only valid values are received from the driving blocks. More conversions are conceivable and can be easily implemented, due to the provided class structure. 5.5 Evaluation Our approach has been evaluated twice. First the correctness of the implementation is verified. Secondly, the extended flow has been applied to an already existing benchmark, in order to proof the assumed benefits of the flow Implementation Example systems have been modeled in order to verify the correctness of the implementation. Each example covers one of the scenarios described in the previous section. Scenario 1 The TBM generates a number of signal changes. The interval between two consecutive changes is a randomized value between 1 ns and ns. The time the next event is scheduled for, is the time the S-function should be updated. During the next update, the TBM designates the difference between the current Simulink time and the expected SystemC time. More than events have

93 5 An Advanced Simulink Verification Flow Using SystemC 81 (a) normal scaling (b) logarithmic scaling Fig. 5.5 Evaluation of Scenario 1 been simulated in a single simulation. The results are shown in Fig. 5.5a. The figure shows that the simulation was not free from errors. The maximal difference amounts ns and thus is ten orders of magnitude smaller than the smallest distance between two consecutive signal changes. The errors have a double-logarithmic behavior as shown in Fig. 5.5b. This leads to the conclusion that the error results from the calculation errors, which occur when comparing the Simulink and the SystemC time. The calculation error is caused by limitations of the data type double used for representing time in both environments. Scenario 2 This scenario is similar to Scenario 1. Since no variable step solver is available a fixed sample time is used. The sample time is chosen in a way that the smallest distance between two consecutive events can be handled appropriately. Thus an oversampling is possible, which has a negative influence on the performance of the simulation. The occurring error is similar to the error that had occurred in Scenario 1. This error is also based on the aforementioned calculation errors. Scenario 3 This model consists of a TBM directly driven by two other components, generating stimuli. Both components contain two signals each and both perform the same operation. The value of the first signal set by the component corresponds to the Simulink time the signal was set. The second signal contains a simple counter. The counter is incremented each time the first signal is been changed. This allows the missed signal changes to be counted. The components are about to operate on different sample times. The TBM should be updated, each time one of the incoming signals changes. On an update, the TBM calculates the difference between the value of the time signal that had changed and the current time in Simulink. As the update should occur at the same time the signal has been written, these values should be equal. The counter should be incremented by one since the last update. As expected, no errors occurred and no signal changes were missed at all. Scenario 4 The TBM combines the tests from Scenarios 1 and 3. The errors caused by the incoming signals and the internal period are logged separately. The evaluation of the results shows that no errors occur and no signal changes were missed, when synchronizing the incoming signals. Synchronizing the internal period of the TBM leads to errors similar to the errors shown in Fig Scenario 5 The calculated base period equals exactly the one from Scenario 4. As in this scenario the TBM has no outgoing ports, the base period is assigned to the

94 82 K. Hylla et al. dummy port, which has been added. As expected, the occurring errors match exactly the errors that had occurred in Scenario 4. To sum up, it has been shown that the errors, measured in Scenarios 1, 3 and 5 are not caused by the implementation. They are induced by the characteristics of the floating-point representation. All scenarios have shown the expected results. Thus, the implementation can be considered correct Extended Verification Flow In order to evaluate the extended verification flow, an example Simulink model has been implemented. To achieve impartiality, an existing model [4] has been reimplemented. This model implements an overhead crane mounted on a track. The crane carries a load that is connected by means of a free running cable. The whole system isshowninfig.5.6. The crane contains sensors, a diagnosis and a control unit. The controlling unit is the design to be verified. Sensor values, the behavior of the load and the positions of car and load are modeled as differential equations using default Simulink components. TBMs implement the job control, the external force f d as well as the possibility to overwrite the value of the α-sensor. The test bench was implemented twice: First using pure Simulink components as it would have been done using a conventional flow. The test cases are implemented as simple tables, containing time/value-tuples. Verification is done by manually comparing the waveform with the implicitly expected behavior. Secondly, the TBMs have been rewritten as test bench modules using SystemC regarding to the extended test bench concept. Therefore the TCs are realized as the aforementioned command files. Additionally the second design has been enhanced by an extended TBM that verifies the behavior. Thus, the TBM allows the automatic verification of the model, superseding a manual verification. Using SystemC within a Simulink model influences the speed of the simulation. We chose the time simulated divided by the time required for the simulation Fig. 5.6 Crane with load

95 5 An Advanced Simulink Verification Flow Using SystemC 83 Table 5.3 Comparing both verification flows in respect of man hours Task Conventional Extended Implementing the test bench modules 1.5 h 2.5 h Implementing the test case 1.0 h 0.5 h Porting the test bench modules 2.0 h 0.5 h Porting the test case 1.5 h 0.0 h Sum 6.0 h 3.5 h as a metrics of the performance. Using SystemC-based TBMs slows down the simulation speed by a factor of about 2.33 for this benchmark. However, the factor depends on the number of SystemC-based TBMs and thus can not be generalized. But not only the time required for simulation must be considered. Table 5.3 compares the times required for implementing the example using the conventional and the extended verification flow respectively. Writing the TBMs as extended test bench modules takes longer than implementing them using Simulink components. Test cases can be written in much shorter time using the extended flow, as they can be described in a more natural way than using tables. While the conventional flow requires TBMs and TCs to be ported when switching from system to RT-level, the extended flow reuses the already existing TBMs and TCs. For the presented example using the extended flow saves about 41% of the time compared to the time required when using the conventional flow. The time for creating and porting the system model is not taken into account. It will decrease the percentage of the saved time, depending on the complexity of the model. For this example, only simple TBMs and TCs are necessary. If a more complex behavior of the TBMs is required, SystemC will have an advantage over Simulink. Complex behavior can be implemented in a simpler way using SystemC. Thus, implementing complex TBMs using SystemC will take less time. Aside from the saved amount of time, the usage of the extended verification flow has provided other advantages: (i) due to the reuse of TBMs and TCs, there was no need to prove the equivalence of the TBMs and TCs that had been ported and (ii) verification is done automatically rather than by inspecting the waveform manually. 5.6 Conclusion In this chapter, the weaknesses of a conventional verification flow have been pointed out. An extended test bench concept has been presented. This concept provides reusable TBMs and TCs, as well as an automatic verification of the system. It has also been presented, how this concept has been applied to Simulink models. It has been shown, how the resulting extended verification flow has been implemented. The implementation of the resulting verification flow addressed the required synchronization of both simulation environments as well as data type conversion. Models covering each scenario mentioned, have been used to evaluate the implementation. The results of the evaluation show that the implementation is correct.

96 84 K. Hylla et al. The whole verification flow has been evaluated, using an example. It has been shown that the usage of the extended verification flow saves a significant amount of time during the development process. Reusing test bench modules and test cases preserves consistency of the test bench and thus reduces the likelihood of errors. Verification is done automatically rather then by inspecting the waveform manually. Future development will have a focus on verification of analog components. These had not been considered within this work and the presented verification flow. This will be addressed by the integration of AMS-capable tools into the flow. References 1. Advanced Verification Methodology J.-F. Boland, C. Thibeault, and Z. Zilic. Using MATLAB and Simulink in a SystemC verification environment. In Proceedings of Design and Verification Conference, DVCon, February R. Lissel and J. Gerlach. Introducing new verification methods into a company s design flow: an industrial user s point of view. In Design, Automation & Test in Europe Conference & Exhibition, DATE 07, pages 1 6, April E. Moser and W. Nebel. Case study: system model of crane and embedded control. In Proceedings of the Conference on Design, Automation and Test in Europe, page 721, Open Verification Methodology Simulink Design Verifier Simulink Fixed Point Simulink Real-time Workshop Simulink Verification and Validation SystemC Library Universal Reuse Methodology Verification Methodology Manual for SystemVerilog.

97 Part II Languages for Heterogeneous System Design

98 Chapter 6 VHDL AMS Implementation of a Numerical Ballistic CNT Model Dafeng Zhou, Tom J. Kazmierski and Bashir M. Al-Hashimi Abstract This contribution presents a VHDL AMS implementation of a novel numerical carbon nanotube transistor (CNT) modeling approach which relies on a flexible and efficient cubic spline non-linear approximation of the non-equilibrium mobile charge density. The underlying algorithm creates a rapid and accurate solution of the numerical relationship between the charge density and the self-consistent voltage. This leads to a speed-up in the calculation of the current through the channel by about two orders of magnitude without losing much accuracy. The numerical approximation is accurate within less than 1.5% of the normalized RMS error compared with a previously reported theoretical modeling approach. The proposed VHDL AMS implementation has been used in simulations of a logic inverter in SystemVision to demonstrate the feasibility of applying the spline-based technique in development of efficient and accurate CNT models for applications in circuitlevel simulators. Keywords VHDL AMS Ballistic transport CNT model Circuit level simulation 6.1 Introduction Transistors using carbon nanotubes are expected to become the basis of next generation integrated circuits [1, 11]. These expectations are motivated by the growing difficulties in overcoming physical limits of silicon-based transistors fabricated using current technologies. A number of theoretical models have been created to describe the interplay between different physical effects within the nanotube channel and their effect on the performance of the device [3, 5 7, 10, 13]. The standard methodology of modeling carbon nanotube transistors (CNTs) is to derive the channel current from the non-equilibrium mobile charge injected into the channel when voltages are applied to the transistor terminals [11]. However, a common problem these models are facing is the complexity of calculating the Fermi Dirac integral T.J. Kazmierski ( ) School of Electronics and Computer Science, University of Southampton, Southampton, SO17 1BJ, UK tjk@ecs.soton.ac.uk M. Radetzki (ed.), Languages for Embedded Systems and Their Applications, Lecture Notes in Electrical Engineering 36, Springer Science + Business Media B.V

99 88 D. Zhou et al. and non-linear algebraic equations which express the relationships between charge densities and the current. Moreover, the channel current between the source and drain is affected not only by the non-equilibrium mobile charge in the nanotube but also by the charges present at terminal capacitances thus adding to the complexity of the current calculation which is a time-consuming iterative approaches. Recently, the standard theoretical methodology has been improved by approaches where the slow Newton Raphson iterations and the numerical evaluation of the Fermi Dirac integral are replaced by numerical approximations while still maintaining good performance compared with theories. These new techniques suggest piece-wise approximation of charge densities, either linear [6] or non-linear [8] to simplify the numerical calculation. However, while both these approaches accelerate current calculations significantly, they are not flexible enough to allow the user to control the trade-offs between the modeling accuracy and implementation speed. Here we generalize our earlier piece-wise non-linear approach [8] and propose a cubic spline piece-wise approximation of the non-equilibrium mobile charge density and develop a very accurate technique where a an accuracy better than 1.5% in terms of average RMS error can be achieved with just a 5-piece spline, which compares favorably with the 5% obtained by the simple non-linear approximation [8]. The spline-based approach still achieves a speed up of around two orders of magnitude compared with a reported implementation of the theoretical model [12] and allows an easy trade-off between accuracy and speed. The spline approximation is not only capable of describing performance of ideal ballistic CNT models, but also extendable with non-ballistic effects. The model has been implemented and tested in MATLAB and VHDL AMS. As an example, we show how our VHDL AMS model can be used to simulate a CMOS-like inverter made of two complementary CNTs. This illustrates the feasibility of using this novel model in circuit-level simulators for future logic circuit analysis. 6.2 Mobile Charge Density and Self-Consistent Voltage When an electric field is applied between the drain and the source of a CNT, a nonequilibrium mobile charge is generated in the carbon nanotube channel. It can be described as follows [9, 11, 15]: Q = q(n S + N D N 0 ) (6.1) where N S is the density positive velocity states filled by the source, N D is the density of negative velocity states filled by the drain and N 0 is the equilibrium electron density. These densities are determined by the Fermi Dirac probability distribution: N S = 1 2 N D = D(E)f (E U SF )de (6.2) D(E)f (E U DF )de (6.3)

100 6 VHDL AMS Implementation of a Numerical Ballistic CNT Model 89 N 0 = + D(E)f (E E F )de (6.4) where D(E) is the density of states, f is the Fermi probability distribution, E represents the energy levels per nanotube unit length, and U SF and U DF are defined as U SF = E F qv SC (6.5) U DF = E F qv SC qv DS (6.6) where E F is the Fermi level, q is the electronic charge and V SC represents the selfconsistent voltage [11] whose presence in these equations illustrates that the CNT energy band is affected by the external terminal voltages. The self-consistent voltage V SC is determined by the terminal voltages and charges at terminal capacitances by the following non-linear algebraic equation [6, 11]: V SC = Q t + qn S (V SC ) + qn D (V SC ) + qn 0 C (6.7) where Q t represents the charge stored in terminal capacitances and is defined as Q t = V G C G + V D C D + V S C S (6.8) where C G, C D, C S are the gate, drain, and source capacitances respectively and the total terminal capacitance C can be derived by C = C G + C D + C S (6.9) 6.3 Numerical Piece-Wise Approximation of the Charge Density The standard approach to the solution of Eq. (6.7) is to use the Newton Raphson iterative method and in each iteration evaluate the integrals in Eqs. (6.2) and (6.3) to obtain the state densities N D and N S. This approach has been proved effective in CNT transistor modeling [6, 12]. However, the iterative computation and repeated integrations consume immense CPU resources and thus are unsuitable for circuit simulation. Our earlier work [8] proposed a piece-wise non-linear approximation technique that eliminates the need for these complex calculations. It suggested to calculate the charge densities and self-consistent voltage by dividing the continuous density function into a number of linear and non-linear pieces which together compose a fitting approximation of the original charge density curve. Then the V SC equation (6.7) is simplified to a group of linear, quadratic and cubic equations, which can be solved easily and fast. Although this approach has been shown to be efficient and accurate [8], its weakness is that it requires an optimal fitting process when deciding on the number of

101 90 D. Zhou et al. approximation pieces and intervals of the ranges which makes the model inflexible and awkward to use. Here we propose to use a cubic spline piece-wise approximation to overcome these difficulties. For a set of n(n 3) discrete points (x 0,y 0 ), (x 1,y 1 ),...,(x i+1,y i+1 ) (i = 0, 1,...,n 2), cubic splines can be constructed as follows [2]: y = Ay i + By i+1 + Cÿ i + Dÿ i+1 (6.10) where A, B, C and D are the coefficients for each pieces of the cubic spline. For simple demonstration here, the horizontal interval between every two neighbor points is equal to h, then we have x 1 x 0 = x 2 x 1 = =x i+1 x i = h. Therefore, the cubic spline coefficients can be expressed as functions of x: A x i+1 x = x i+1 x x i+1 x i h B 1 A = x x i = x x i x i+1 x i h (6.11) (6.12) C 1 6( A 3 A ) (x i+1 x i ) 2 (6.13) D 1 6( B 3 B ) (x i+1 x i ) 2 (6.14) These equations show that A and B are linearly dependent on x, while C and D are cubic functions of x. To derive the y(x) expression, the second-order derivative of y have to be computed via a tridiagonal matrix: ÿ 0 ÿ 1.. ÿ n 1 y 2 2y 1 + y 0 = 6 y 3 2y 2 + y 1 h 2. y n 1 2y n 2 + y n 3 (6.15) Now that the cubic spline coefficients and the second derivative have been obtained, the function of each spline can be derived with the coefficients a i, b i, c i and d i calculated by using Eqs. (6.11), (6.12), (6.13), (6.14) and (6.15): y i = a i x 3 + b i x 2 + c i x + d i (6.16) The two linear regions that extend the cubic splines on both sides can be described as follows: y = y n (x > x n ) (6.17) y = a l x + b l (x < x 0 ) (6.18) where a l =ÿ 0 = 3a 0 x b 0x 0 + c 0 and bl = y 0 a l x 0. To demonstrate the performance of this approach, we have compared the speed and accuracy of an example model with results of other reported approaches.

102 6 VHDL AMS Implementation of a Numerical Ballistic CNT Model 91 Fig. 6.1 Piece-wise cubic spline approximation with n = 4(circlet line) of mobile charge compared with the theoretical result (solid line) 6.4 Performance of Numerical Approximations An example model which uses three cubic splines, n = 4, and two linear pieces at the ends was compared with the theoretical curves calculated from Eqs. (6.2) and (6.3) correspondingly. To solve the resulting 3rd order polynomial equations, Cardano s method [4] is applied to determine the appropriate root which represents the correct value of V SC. According to the ballistic CNT transport theory [11, 12] the drain current caused by the transport of the non-equilibrium charge across the nanotube can be calculated using the Fermi Dirac statistics as follows: I DS = 2qkT π [ ( ) ( )] USF UDF F 0 F 0 kt kt (6.19) where F 0 represents the Fermi Dirac integral of order 0, k is Boltzmann s constant, T is the temperature and is reduced Planck s constant. Since the self-consistent voltage V SC is directly obtained from the spline model, the evaluation of the drain current poses no numerical difficulty as energy levels U SF, U DF can be found quickly from Eqs. (6.5), (6.6) and I DS can be calculated using: I DS = 2qkT π [ log ( 1 + e E F qv SC kt ) ( E F q(v SC V DS ) ) ] log 1 + e kt (6.20) These calculations are direct and therefore considerably fast, as there are no Newton Raphson iterations or integrations of the Fermi Dirac probability distri-

103 92 D. Zhou et al. Table 6.1 Average CPU time comparison between different models 3-piece 4-piece CS Model CS Model Loops FETToy PWNL Model PWNL Model n = 4 n = s 0.02 s 0.06 s 0.57 s 0.95 s s 0.04 s 0.12 s 1.15 s 1.91 s s 0.19 s 0.56 s 5.82 s 9.59 s s 0.38 s 1.12 s s s Table 6.2 Average RMS errors in piece-wise and cubic spline approximations for 1 nm nanotube at E F = 0.32 ev and T = 300 K 3-piece 4-piece CS Model CS Model V G [V ] PWNL Model PWNL Model n = 4 n = % 2.0% 1.3% 0.9% % 1.7% 1.0% 0.8% % 1.4% 0.8% 0.6% % 1.0% 0.6% 0.5% % 1.2% 0.9% 0.7% % 1.6% 1.1% 1.0% bution. For performance comparison, we have also tried a 4-piece cubic spline approximation (with n = 5) which is expected to be more accurate but slower than the first model. Table 6.1 shows the average CPU times for both models and those from FETToy [12] and previously reported piece-wise models [8], while Table 6.2 compares the accuracy of both numerical model types. It can be seen from Tables 6.1 and 6.2 that although the spline models sacrifice some speed compared with the simple piece-wise non-linear models, they are still more than two orders of magnitude faster than FETToy. They also achieve a much better accuracy than the simple piece-wise non-linear models. The extent to which the modeling accuracy was compromised by numerical approximation was measured by calculating average RMS errors in the simulations and the results are shown in Table 6.2. As expected, the spline models are more accurate with errors not exceeding 1.0% at T = 300 K and E F = 0.32 ev throughout the typical ranges of drain voltages V DS and gate bias V G. Figure 6.2 shows the I DS characteristics calculated by FETToy compared with the 3-piece spline model. The performance of this approach can be affected by the values of E F, T, d and terminal voltages. The choice of the number of cubic spline approximation pieces is an obvious trade-off between speed and accuracy as slightly more operations need to be performed with more pieces while the shape of the mobile charge curve is reflected more accurately.

104 6 VHDL AMS Implementation of a Numerical Ballistic CNT Model 93 Fig. 6.2 Drain current characteristics at T = 300 K and E F = 0.32 ev for FETToy (solid lines) and a 3-piece cubic spline approximation (circlet lines) Fig. 6.3 Schematics of the simulated inverter 6.5 VHDL AMS Implementation The proposed approach has been used to implement both n-type-like and p-type-like CNT transistor models in VHDL AMS and to simulate a CMOS-like inverter shown in Fig The bulk voltage was also considered to take into account the effects on the charge densities generated by the substrate voltage. This is especially important for the p-type-like transistor. Figure 6.4 shows I DS characteristics of the n-type-

105 94 D. Zhou et al. Fig. 6.4 VHDL AMS simulation results on drain current characteristics at T = 300 K and E F = 0.32 ev for a 3-piece cubic spline model Fig. 6.5 Inverter simulation result; input ramps from 0 V to 0.6 V

106 6 VHDL AMS Implementation of a Numerical Ballistic CNT Model 95 like transistor implemented in VHDL AMS which match closely the MATLAB calculations shown in Fig The VHDL AMS test-bench for the inverter invokes the two transistors as well as a ramp voltage source and a constant voltage source. The constant source provides the supply voltage V CC for the gate, while the ramp source was used to produce the output characteristic of the inverter. The simulation result is shown in Fig Considering that the transport characteristics of both transistors are not the same, it is worth noting that the inverter output is not symmetrical at V CC /2 due to the stronger n-type-like transistor. The VHDL AMS code of the transistor top model is shown below. VHDL AMS model of CNT Transistor I V Characteristics using cubic spline approximation of S/D charge densities (c) Southampton University 2008 Southampton VHDL AMS Validation Suite Author: Dafeng Zhou, Tom Kazmierski and Bashir M Al Hashimi School of Electronics and Computer Science University of Southampton Highfield, Southampton SO17 1BJ, United Kingdom Tel Fax e mail: dz05r@ecs. soton.ac.uk, tjk@ecs. soton.ac.uk Created : 17 October 2007 Last revised : November 2008 ( by Dafeng Zhou ) Description : This is a fast numerical model of ballistic transport in carbon nanotube transistors. The default value of the Ef_i parameter (Fermi level ) produces n type like behavior ; a p type like transistor can be obtained by modifying the Fermi level. Package cntcurrent provides the spline data a nd the body of function Fcnt which calculates current Ids from the splines. VHDL AMS Model of Ballistic CNT Transistor library IEEE ; use IEEE. math_real. all ; use IEEE. electrical_systems. all ; library work ; use work. cntcurrent. all ; use work. SolveVscEquation_pack. all ; use work. coeff_pack. all ;

107 96 D. Zhou et al. entity CNTTransistor is generic ( model parameters T: real := 300.0; dcnt : real := 1.0E 9; Ef_i : real := E 19; xmax : real := 0.2; xmin : real := 0.5; n: integer := 4) ; port (terminal drain, gate, source, bulk: electrical ); end entity CNTTransistor ; architecture Characteristic of CNTTransistor is terminal values quantity Vdi across drain to bulk ; quantity Vgi across gate to bulk ; quantity Vsi across source to bulk ; quantity Ids through drain to source ; begin Ids == Fcnt ( Vgi, Vsi, Vdi, Ef_i, T, dcnt, xmax, xmin, n) ; end architecture Characteristic ; The coefficients for cubic spline approximation pieces are derived using a MAT- LAB script which generates text of the VHDL AMS package coeffpack. The generated package is included in the simulation. Combining Eqs. (6.7), (6.16), (6.17) and (6.18), a series of continuous linear and 3rd order polynomial equations to solve the self-consistent voltage are derived using following equations. N D (V SC ) = N S (V SC V DS ) (6.21) V SC = { Q t + q ( a i VSC 3 + b ivsc 2 + c ) [ iv SC + d i + q aj (V SC V DS ) 3 + b j (V SC V DS ) 2 + c j (V SC V DS ) + d j ] qn0 } /C (6.22) From Eq. (6.21), N D (V SC ) can be treated as an x-axial shift of N S (V SC ), and the discrepancy between them is V DS. It can be noticed from Eq. (6.22) that, when all the parameters are fixed, the value of V SC is determined by only V DS and the spline coefficients. For a given V DS, the summary of N S (V SC ) and N D (V SC ) can be expressed as (a i V 3 SC +b iv 2 SC +c iv SC +d i )+q[a j (V SC V DS ) 3 +b j (V SC V DS ) 2 + c j (V SC V DS ) + d j ], which consists of several regions based on the changing of the value of i and j, represented as QsRange and QdRange in an inner function respectively. It can be seen that QsRange and QsRange only change when V DS shifts from one of the spline pieces to another, and in total there are 2n + 1 regions for

108 6 VHDL AMS Implementation of a Numerical Ballistic CNT Model 97 the expression. Below are the combination coefficients of different QsRanges and QsRanges due to the shifting of V DS. The package listed below contains the code of function Fcnt which solves the spline approximation of the V SC equation (Eq. (6.22)) and evaluates the drain current. Package of Cntcurrent library IEEE ; use IEEE. math_real. all ; use IEEE. electrical_systems. all ; library work ; use work. SolveVscEquation_pack. all ; use work. FindQRange_pack. all ; use work. coeff_pack. all ; package cntcurrent is function Fcnt ( Vgi, Vsi, Vdi, Ef_i, T, dcnt, xmax, xmin : real ; n: integer ) return real ; end package cntcurrent ; package body cntcurrent is some physical constants : constant e0 : real := 8.854E 12; constant pi : real := ; constant t0 : real := 1.5E 9; constant L : real := 3.0E 8; constant q : real := 1.6E 19; constant k : real := 3.9; constant acc : real := 1.42E 10; constant Vcc : real := E 19; constant h : real := 6.625E 34; constant hbar : real := 1.05E 34; constant KB : real := 1.380E 23; function Fcnt ( Vgi, Vsi, Vdi, Ef_i, T, dcnt, xmax, xmin : real ; n: integer ) return real is variable EF, Vd, Vg, Vs, Vds, Ids, Ef_t, Efi, N0, c, Cox, Cge, Cse, Cde, Ctot, qc, qcn0, int, Vsc : real ; variable yy : real_vector(0 to 1) ; variable QsRange, QdRange : integer ; begin Cox := 2.0 pi k e0 / log (( t0+dcnt /2.0) 2.0/ dcnt) ; Cge := Cox ; Cse := Cox ; Cde := Cox ;

109 98 D. Zhou et al. Ctot := Cge+Cse+Cde ; Efi := Ef_i ; if Efi > 0.0 then Ef_t:= Efi ;VD:= Vdi ;VS:= Vsi ;VG:= Vgi ; else Ef_t := Efi ;VD:=Vdi;VS:=Vsi ;VG:=Vgi; end i f ; EF:=Ef_t /q;vd:=vdi;vs:=vsi;vg:=vgi; N0 := E3; c := q (Vg Cge+Vs Cse+Vd Cde ) / Ctot ; Vds := Vd Vs ; qc := q q/ctot; qcn0 := qc N0 ; int := (xmax xmin ) / real (n 1) ; Find the ranges of Qs and Qs approximations where the solution of Vsc is located yy := FindQRange (Vds, q, c,qc,qcn0,xmax, xmin, int, n) ; QsRange := integer (yy(0)); QdRange := integer (yy(1)); Calculate the Vsc using Cardone s method from 3rd order polynomial and linear equations Vsc := SolveVscEquation (Vds,q,c,qC,qCN0,QsRange, QdRange ) ; Obtain the drain / source current if Efi <=0.0 then Ids := 4.0 q KB T/h (log(1.0+exp(q (EF Vsc ) /KB/T) ) log (1.0+exp(q (EF Vsc Vds)/KB/T))); elsif Efi >0.0 then Ids := 4.0 q KB T/h (log(1.0+exp(q (EF Vsc ) /KB/T) ) log (1.0+exp(q (EF Vsc Vds)/KB/T))); else Ids := 0.0; end i f ; return Ids ; end function Fcnt ; end package body cntcurrent ; 6.6 Conclusion This contribution proposes to use and investigates the numerical performance of cubic splines in numerical calculations of CNT ballistic transport current with the

110 6 VHDL AMS Implementation of a Numerical Ballistic CNT Model 99 aim to provide a practical and numerically efficient model for implementation in SPICE-like circuit simulators. The cubic spline approximation is more flexible and easier to use than the earlier piece-wise models [6, 8] and the presented results further reinforce the suggestions that numerical integrations and internal Newton- Raphson iterations can be avoided in the calculation of the self-consistent voltage in the CNT. The cubic spline parameters assure the continuity of the first derivative everywhere and were optimized for fitting accuracy. When compared with FETToy [12], a reference theoretical CNT model, we have demonstrated that the proposed approximation approach, although marginally slower than our earlier models, still leads to a computational cost saving of more than two orders of magnitude while increasing the modeling accuracy. To verify the feasibility of the proposed model, VHDL AMS implementations for both n-type-like and p-type-like transistors were derived and used to calculate their I DS characteristics as well the output characteristic a simple logic inverter using the SystemVision simulator from Mentor Graphics. The results matched closely those from MATLAB simulations. The new VHDL AMS model is now available on the Southampton VHDL AMS Validation Suite website [14] for public use. Acknowledgements The authors would like to acknowledge the support of EPSRC/UK for funding this project in part under grant EP/E035965/1. References 1. P. Avouris, J. Appenzeller, R. Martel, and S.J. Wind. Carbon nanotube electronics. Proceedings of the IEEE, 91(11): , R. Bulirsch and J. Stoer. Introduction to Numerical Analysis, 2nd edition. Springer, Berlin, T. Dang, L. Anghel, and R. Leveugle. Cntfet basics and simulation. In IEEE International Conference on Design and Test of Integrated Systems in Nanoscale Technology (DTIS), Tunis, Tunisia, 5 7 September U.K. Deiters. Calculation of densities from cubic equations of state. AIChE Journal, 48(4): , C. Dwyer, M. Cheung, and D.J. Sorin. Semi-empirical SPICE models for carbon nanotube FET logic. In 4th IEEE Conference on Nanotechnology, Munich, Germany, August H. Hashempour and F. Lombardi. An efficient and symbolic model for charge densities in ballistic carbon nanotube FETs. In IEEE-NANO 2006, Sixth IEEE Conference on Nanotechnology, volume 1, pages A. Hazeghi, T. Krishnamohan, and H.-S.P. Wong. Schottky-barrier carbon nanotube fieldeffect transistor modeling. IEEE Transactions on Electron Devices, 54: , T.J. Kazmierski, D. Zhou, and B.M. Al-Hashimi. Efficient circuit-level modeling of ballistic CNT using piecewise non-linear approximation of mobile charge density. In IEEE International Conference on Design, Automation and Test in Europe (DATE), Munich, Germany, March P.L. McEuen, M.S. Fuhrer, and H. Park. Single-walled carbon nanotube electronics. IEEE Transactions on Nanotechnology, 1(1):78 85, B.C. Paul, S. Fujita, M. Okajima, and T. Lee. Modeling and analysis of circuit performance of ballistic CNFET. In 2006 Design Automation Conference,SanFrancisco,CA,USA,24 28 July 2006.

111 100 D. Zhou et al. 11. A. Rahman, J. Guo, S. Datta, and M.S. Lundstrom. Theory of ballistic nanotransistors. IEEE Transactions on Electron Devices, 50(9): , A. Rahman, J. Wang, J. Guo, S. Hasan, Y. Liu, A. Matsudaira, S.S. Ahmed, S. Datta, and M. Lundstrom. Fettoy 2.0 on line tool, 14 February /. 13. A. Raychowdhury, S. Mukhopadhyay, and K. Roy. A circuit-compatible model of ballistic carbon nanotube field-effect transistors. Applied Physics Letters, 23(10): , S. Wang and T.J. Kazmierski. Southampton VHDL AMS validation suite, 18 October M.-H. Yang, K.B.K. Teo, L. Gangloff, W.I. Milne, D.G. Hasko, Y. Robert, and P. Legagneux. Advantages of top-gate, high-k dielectric carbon nanotube field-effect transistors. Applied Physics Letters, 88(11):113507, 2006.

112 Chapter 7 Wide-Band Sigma Delta ADC Design in Superconducting Technology R. Guelaz, P. Desgreys and P. Loumeau Abstract This chapter presents a bandpass sigma delta ADC design in superconducting technology. We studied an architecture based on Josephson junctions VHDL AMS modeling. We proceed by analyzing the standard linear model and proposed different hierarchical model based on functional analyzing. To simplify comparator behaviour and simulations we assimilate SFQ pulses to ideal rectangular pulses to specify ADC performance. Josephson junction modeling is based on writing of RCSJ electrical model with accordance to VHDL AMS language. Each ADC functional element is dissociated and simulated to validate its behaviour. Pulses duration is directly a physical parameter obtained by technology. We give relations between ADC performance (SNR) to the comparator SFQ pulses form and duration with normalized physical parameters. Keywords ADC Sigma delta bandpass Superconducting VHDL AMS 7.1 Introduction Recent advances in RSFQ (Rapid Single Flux Quantum) technology promise opportunity to get better performance compared to CMOS technology [1] for specific applications such as spatial telecommunications. Sigma delta bandpass ADC [2] is an element that could be implemented in software radio communication systems [3] in replacement of classic time-interleaved architecture. Sigma delta is well known to be an efficient converter because of its simple architecture, easy to implement and composed with basic elements. RSFQ technology exploits sigma delta advantagesbecauseof thecapabilityto operateatseveralhundredghz [4 7]. To perform actual converter, we propose to identify sigma delta ADC classic structure to a converter using Josephson junctions which compose principally the ADC comparator and the clock generation. To reproduce sigma delta principle, a feedback must be operated between output and input. SFQ (Single Flux Quantum) pulses generated by the Josephson junctions of the balanced comparator are studied and quantized to reproduce this feedback effect. Pulse form and duration determine directly the ADC P. Desgreys ( ) L.T.C.I., CNRS-UMR 5141, Institut Telecom, Telecom-Paristech, Paris, France patricia.desgreys@telecom-paristech.fr M. Radetzki (ed.), Languages for Embedded Systems and Their Applications, Lecture Notes in Electrical Engineering 36, Springer Science + Business Media B.V

113 102 R. Guelaz et al. performance, the dependence is shown on the particular parameter SNR (signal-tonoise ratio). To simplify the ADC conception with performance rapid estimation, we assimilated SFQ pulses to rectangular forms with area and duration in agreement to the real case. We study pulses effect in the current across the resonator. VHDL AMS language [8] permits us in this particular case to reproduce pulses effect in the system modeling with the use of break statement. Precedent work [9] has considered SFQ (Single Flux Quantum) pulses as Dirac forms, but model could be improved with incorporation with pulse duration parameter. 7.2 Sigma Delta Second Order Architecture Bandpass Sigma Delta Modulator Sigma delta modulation principle is based on oversampling and noise shaping. Its usual block diagram representation is presented in Fig Principle reposed on a feedback loop that permits to make a prediction based on the precedent information. In our case, we study a bandpass sigma delta modulator. For a second order ADC with 1-bit quantization, a simple comparator at the output generates the feedback with 1/ 1 (normalized) values. Theoretical interpretation of the resonator is represented by Eq. (7.1) with z transform. B(z) = z z 2 (7.1) The system could be interpreted as a linearized model where quantization is considered as a source of an additive white noise Q(z). Interpretation of the output Y(z) givenineq.(7.2) is an addition of the response due to the input and the response due to the quantization noise source: Y(z) = STF(z)X(z) + NTF(z)Q(z) (7.2) with the signal transfer function STF(z) and the noise transfer function NTF(z) as STF(z) = B(z) 1 + B(z)C(z), NTF(z) = B(z)C(z) (7.3) Fig nd order sigma delta classic architecture

114 7 Wide-Band Sigma Delta ADC Design in Superconducting Technology 103 Fig. 7.2 SNR max as function to OSR and modulator order N With the oversampling principle, the modulator permits to obtain in the input signal band a gain of 1 with a least quantification error. So the signal-to-noise ratio SNR will be significantly improved and the resolution too. To estimate the ADC performances, we consider that noise is superposed to the signal and the SNR is the ratio between the signal power and the quantification noise power. Power spectral density of the quantization noise PSD N, given by Eq. (7.4), is the multiplication of the modulator noise transfer function NTF with the quantification noise PSD Q. ( ( )) 4πf q 2 PSD N = NTF(f ) PSD Q = cos (7.4) 12 F clk with f the frequency and q the quantizer resolution. The SNR max is expressed by [ 3(2n + 1)OSR 2N+1 ] SNR db-max = 10 log 2π 2 (7.5) With n the quantizer bit resolution and 2N the filter order in the case of a resonator. The SNR in function of the oversampling ratio (OSR) and of the filter order in general case is given in Fig For example with N = 1 which corresponds to a second order resonator, OSR = 128 leads to SNR max = 60 db The Josephson Junction Basis of RSFQ technology lies on the properties of Josephson junction which is a three layers material composed by two superconducting materials separated by a

115 104 R. Guelaz et al. Fig. 7.3 RCSJ Josephson junction electric model Fig. 7.4 Current voltage characteristic of an overdamped junction thin metallic layer (SNS). Principal property of the junction is to have perfect voltage oscillator behavior when it is polarized with a constant current above its critical current I c. Oscillating frequency can reach several hundred GHz. Junction model is based on the writing of each branch of the RCSJ (Resistively and Capacitively Shunted Junction) circuit composed by an ideal current source with the critical current I c in parallel with a resistance R and a capacitance C as illustrated in Fig The relation between junction phase φ and RCSJ model can be written as I sj = I c sin(φ) + φ 0 dφ 2πR n dt + φ 0C d 2 φ 2π dt 2 (7.6) where I sj is the current across the junction and φ 0 is the flux quantum constant equal to 2.07 µv/ghz. Its behavior (Fig. 7.4) is a non-hysteretic form; when the current is above its critical current value the junction generates a voltage pulse. To reproduce a 2nd order sigma delta modulator, we isolate each functional element of the ADC: mainly the comparator and the clock generation and we propose to implement these elements from Josephson junctions based circuitry.

116 7 Wide-Band Sigma Delta ADC Design in Superconducting Technology The RSFQ Balanced Comparator The comparator is composed by 2 junctions JJ2 and JJ3. The decision instants are fixed by each clock front, when the sum of the current is over the JJ2 critical current, a pulse is generated otherwise junction JJ2 switches. To generate the needed modulator feedback, we will use this voltage pulse generated at each clock front. For the comparator behavior assimilation to classical +1/ 1 feedback, we consider that unipolar pulses noted V q (t) can be decomposed by the sum of a bipolar pulses train noted V q2 (t) and a periodic pulses train noted V q1 (t) presented in Fig Principal information is in V q2 (t), periodic effect of V q1 (t) can be compensated by an offset in the comparator that is fixed by JJ2 critical current. Bulzacchelli [8] haveproposed this simplified representation to study the impact of the pulses in sigma delta modulator Sigma Delta Modulator Operation with Josephson Junctions As seen in the theoretical analysis of the sigma delta modulator, the heart of the operation is based on the feedback loop effect between modulator output and input. In classical architecture, the comparator generates 1/ 1 (normalized) voltages synchronized with the clock. In a superconducting integration, we use an L C resonator to make the 2nd order bandpass filter B(z). And we consider that feedback effect can be done directly in the current across the resonator thanks to the use of the RSFQ comparator. In fact, the voltage pulse created at the comparator node is integrated by L C resonator. This result is observed in a change of the L C current by a value proportional to the pulse area. In the case of RSFQ pulse, the area is always equal to φ 0 and the pulse duration is negligible (like a Dirac pulse). Therefore the L C current is decremented by a constant value. Finally transposition between current/tension will reproduce the same behavior as presented in Fig To identify sigma delta principle in the RSFQ design, we consider that each SFQ pulse delivered by the comparator, results in the addition or the subtraction of a value IL to the resonator current I L : IL = φ 0 /L (7.7) Fig. 7.5 Assimilation with use of Josephson junctions

117 106 R. Guelaz et al. Fig. 7.6 ADC modeling with a comparator generating rectangular bipolar pulses Resolution of circuit equation (Eq. (7.8)) when considering a rectangular pulse of duration τ for V q2 (t) is described by Eq. (7.9). We consider p as the Laplace symbol. ( ) 1 V in (p) V JJ3 (p) = I L (p) Cp + Lp = I L (p) 1 + LCp2 Cp (7.8) During the time τ T clk, it is demonstrated that current I L (t) have a linear variation C I L (t) = A L 1 t = φ 0 LC 2τ 1 L t (7.9) System Modeling with VHDL AMS To model the ADC behavior, we consider two approaches. First one supposes that the comparator effect on the current across the resonator L C is produced at each clock front with a fixed value ±φ 0 /(2L). The simulation of these abrupt changes is available with the implementation of the break statement. This VHDL AMS instruction stops the analogical simulator and initializes it again with a new value of the current I L. This new value is determined at each clock pulse and is a function of the present value. The second approach of the modeling is to implement pulse duration for rectangular pulses form. In physical case, pulses have a duration that is set by the technology. Variation of this parameter will informs us about impact of this technological parameter on the ADC operation.

118 7 Wide-Band Sigma Delta ADC Design in Superconducting Technology 107 Fig. 7.7 SFQ comparator with clock stimulation 7.3 The Sigma Delta ADC Design Clock and Comparator Design As presented in the Fig. 7.7, RSFQ comparator is based on the use of two junctions, one of which (JJ3) has a higher critical current. Comparison is made on the input current, if it is positive, an SFQ pulse is generated at the output. If it is negative, output is unchanged. SFQ pulse is interpreted as 1 logic level, otherwise it s a 0 logic level. The clock is generated by polarization of junction JJ1 at a constant current. JJ1 s critical current is fixed at approximately a value corresponding to the sum of JJ2 and JJ3 critical currents. L4 inductance is set to a value ensuring that the comparison results appear between two SFQ clock pulses. I p is the comparator polarization current. In accordance to technology Nb/AlOx/Nb with parameter R n I c = 300 µv, we obtain a clock at 60 GHz (T clk = 16.6 ps) with: V c = 1.15 mv R c = 1Ohm L 3 = 1pH JJ1: I c = 542 µa,r n = 300 µ/542 µ = 0.55 Ohm C = φ 0 I c /(2πR n I c ) 2 = 0.32 pf We implement this clock separately and we obtained the simulation results shown in Figs Simulations are executed with Simplorer V7 software. When the sum of the input current I x and the current across junction JJ2 is above junction JJ3 critical current, a SFQ pulse is generated at the comparator output. With an ideal input current I x, we analyze the comparator output result obtained at JJ2,

119 108 R. Guelaz et al. Fig. 7.8 Clock generated at JJ1 junction for a fixed frequency 60 GHz Fig. 7.9 JJ3 comparator output with input I x [ 300 µa;+300 µa] and JJ3 with the following circuit values: JJ1: I c = 900 µa JJ2: I c = 300 µa JJ3: I c = 310 µa V c = 1.2mV R c = 1Ohm L 3 = 1pH L 4 = 4pH Coupling the comparator to the clock implies to adjust I c and V c values to obtain the same form for the clock. Symmetrical signal resulted from JJ3 junction is presented in Fig. 7.9.

7 Wide-Band Sigma Delta ADC Design in Superconducting Technology 109 This simulation results show that when input current is negative, we have no pulses, just little oscillations that are resulting

120 7 Wide-Band Sigma Delta ADC Design in Superconducting Technology 109 This simulation results show that when input current is negative, we have no pulses, just little oscillations that are resulting of JJ2 switching. In this case of simulation, we are at the frequency limit for proper comparator operation. This is visible because there is a first little pulse before the switching of JJ3 and because the pulses have not the same level when JJ3 commutes. For the last part of our study, the SFQ pulses generated by junction JJ3 are assimilated to rectangular pulses with the same area and duration; these parameters are controlled by fabrication and technology. 7.4 Simulation Results To illustrate sigma delta bandpass ADC operation, we implement the structure of the Fig. 7.6 written in VHDL AMS into the Smash-Dolphin software. The converter is specified for a clock system fixed at 120 GHz. Simulations are done for an over sampling ratio of 120 and a bandwidth of 500 MHz. We study specifically the SNR as a function of the pulse duration generated by the comparator. The pulse duration is included in the range of 1 fs to 2.5 ps. Figure 7.10 shows the pulses effect on the current and the modulator output. At each clock front, we observe a step as demonstrated in theory. It respects the current sign and it is constant. Effect of pulses duration must be taken into account and is illustrated in Fig When pulse duration is not negligible, evolution of the input signal during the pulse modifies the step value. In the precedent figure, we can observe two different steps noted 1 and 2. So quantification is not correctly operated. The conclusion is that the use of SFQ pulses to reproduce sigma delta operation has a limit in frequency. To estimate this limit for a specified resolution, we simulate the impact of the pulse duration on the SNR as it is shown in Fig This result shows that the SNR is a quasi-linear function of the comparator pulse duration. In particular, if the duration is under 2.2 ps, the SNR obtained is in the Fig Simulation results with ideal pulses effect on the current

110 R. Guelaz et al. Fig. 7.11 Simulation with pulse duration effect Fig. 7.12 SNRinfunctionof the pulse duration for an over sampling ratio of 120 range [55 50] db which corresponds to 8 bits resolution.

121 110 R. Guelaz et al. Fig Simulation with pulse duration effect Fig SNRinfunctionof the pulse duration for an over sampling ratio of 120 range [55 50] db which corresponds to 8 bits resolution. Best results are around 55 db and are obtained when τ T clk /100 where T clk is the clock period. Now we simulate the real circuit implemented with Josephson junctions described in VHDL AMS; we simulate the complete ADC represented in Fig and the results are shown in Fig The simulation results show firstly the alternative switching of comparator junctions. The clock is generated by junction JJ1 and result in constant pulses. The most difficult aspect of the design is to avoid perturbation generated by the output switching. Decision instant must be rigorously between 2 SFQ pulses. At the limit working frequency as shown in the present simulation, steps on the current are a little bit different one to another but we measure approximately IL = φ 0 /L. The SNR is obtained with the spectral analysis of the output voltage. For a Blackman windowing with points, we obtain the result illustrated by Fig

122 7 Wide-Band Sigma Delta ADC Design in Superconducting Technology 111 Fig Sigma delta modulator operating at 60 GHz Fig Comparison with real SFQ pulses Noise Transfer Function is highlighted in this result with its particular form of rejected noise out of the interest bandwidth ( GHz). Signal is identified at the particular frequency of 15.1 GHz. In this example, SNR obtained is 54 db. 7.5 Conclusion In this work, we propose an approach to estimate performance of a superconducting sigma delta bandpass ADC in RSFQ technology. We assimilate comparator behavior to a pulses generator parameterized by the pulses duration. We proceed by different step modeling to reproduce specificity of SFQ use to reproduced sigma delta modulator. We show that the increase of comparator pulse duration implies quasi-linear decrease of the SNR. In the next step, simulations should give rapidly the ADC performance when using the complete circuit with Josephson junctions.

112 R. Guelaz et al. Fig. 7.15 Output spectral analysis References 1. O.A. Mukhanov, V.K. Semenov, et al. High-resolution ADC operating at 19.6 GHz clock frequency.

123 112 R. Guelaz et al. Fig Output spectral analysis References 1. O.A. Mukhanov, V.K. Semenov, et al. High-resolution ADC operating at 19.6 GHz clock frequency. Superconductor Science and Technology, 14: , R. Schreier, G.C. Temes. Understanding Delta-Sigma Data Converters. Wiley Interscience, New York, P. Loumeau, L. Naviner, and J.F. Naviner. Analog-to-digital conversion for software radio. In G. Vivier, editor, Reconfigurable Mobile Radio Systems: A Snapshot of Key Aspects Related to Reconfigurability in Wireless Systems. Iste Publishing Company, London, E. Baggetta, R. Setzu, J.-C. Villégier, and M. Maignan. Implementation of basic NBN RSFQ logic gates of a wide-band sigma delta modulator. In Applied Superconductivity Conference, Seattle, E. Baggetta, R. Setzu, and J.-C. Villégier. Study of SNS and SIS NbN Josephson junctions coupled to a microwave band-pass filter. Journal of Physics. Conference Series, 43: , K.K. Likharev, V.K. Semenov. RSFQ logic/memory family, a new Josephson-junction technology for sub-terahertz-clock-frequency digital systems. IEEE Transactions on Applied Superconductivity, 1(1):3 28, P. Febvre, J.-C. Berthet, D. Ney, A. Roussy, J. Tao, G. Angenieux, N. Hadacek, and J.-C. Villegier. On-chip high-frequency diagnostic of RSFQ logic cells. IEEE Transactions on Applied Superconductivity, 11(1): , IEEE STD : IEEE standard VHDL analog and mixed signal extensions. SH94731, IEEE, J. Bulzachelli. A superconducting bandpass delta-sigma modulator for direct analog-to-digital conversion of microwave radio. Ph.D. thesis, Massachusetts Institute of Technology, 2003.

124 Chapter 8 Heterogeneous and Non-linear Modeling in SystemC AMS Ken Caluwaerts and Dimitri Galayko Abstract This contribution presents a SystemC AMS model of a mixed non-linear, strongly coupled and multidomain electromechanical system designed to scavenge the energy of ambient vibrations and to generate an electrical supply for an embedded microsystem. The system operates in three domains: purely mechanical (the resonator), coupled electromechanical (electrostatic transducer associated with the moving mass) and electrical circuit, including switches, diodes and linear electrical components with varying parameters. The modeling difficulties related with the non-linear discontinuous behavior of the system and with the limitations of SystemC AMS are resolved by simultaneous use of the two modeling domains available in SystemC AMS: the one allowing SDF (Synchronous Data Flow) modeling and the one allowing LIN ELEC (linear electrical) circuit analysis. The modeling results are compared with VHDL AMS and Matlab Simulink models. Keywords Vibration energy harvesting/scavenging Capacitive transducer Flyback Charge pump SystemC AMS Heterogeneous modeling Non-linear modeling SDF 8.1 Introduction Harvesters of mechanical energy are complex heterogeneous multiphysic systems that are difficult to describe analytically. They are non-linear and time-variable and exhibit complex and discontinuous behavior. Hence, a design optimizing their energetic performances requires reliable and if possible simple models. This work focuses on modeling of harvesters which use capacitive transducers for the energy conversion from mechanical into electrical form. A typical capacitive harvester includes (Fig. 8.1): a mechanical resonator allowing an accumulation of the mechanical energy, an electrostatic (capacitive) transducer, with one electrode attached to a mobile mass, and the other fixed to the system which is submitted to external vibrations, a conditioning electrical circuit managing the flow of electrical charges on the transducer electrodes. D. Galayko ( ) LIP6, University of Paris-VI, Paris, France dimitri.galayko@lip6.fr M. Radetzki (ed.), Languages for Embedded Systems and Their Applications, Lecture Notes in Electrical Engineering 36, Springer Science + Business Media B.V

125 114 K. Caluwaerts and D. Galayko Fig. 8.1 General structure of the vibration energy harvester Electromechanical harvesters cannot generally be modeled using a purely electrical simulator, since they include non-linear electromechanical elements typically an electromechanical transducer requiring a behavioral model. Existing modeling approaches use signal data flow diagrams (e.g. Simulink models) [1], Spice-level descriptions including Spice macromodels to model the transducer [5, 6] and behavioral VHDL AMS descriptions [3]. However, as shown by recent research, in autonomous SOCs/SIPs, mechanical energy harvester is likely to be part of a multisource block which will include other energy sources (thermal, solar, etc.), a rechargeable battery and an intelligent energy management unit (probably, a low-power microprocessor) [8]. Thus, the model of the mechanical energy harvester must be compatible with the model of the global system. SystemC AMS is one of the only modeling platforms that make it possible to describe physical, electrical analog, digital and software blocks in the same model [10, 11]. Hence, it is an excellent candidate for modeling complex energy generators, including digital and software blocks. In this contribution we present a SystemC AMS model of a vibrational energy harvester system whose conditioning circuit architecture was proposed in [13] (Fig. 8.1). In particular, we describe our solutions for the difficulties related with the non-linearity and the switching operation of the conditioning circuit. The chapter is organized as follows. In Sects and 8.1.2, we introduce the SystemC AMS platform and the modeled system. In Sect. 8.2 we present the model of all the blocks of the harvester and in Sect. 8.3 we discuss the modeling results and compare them with a VHDL AMS and a Simulink model SystemC AMS Modeling Platform The used platform is the first experimental version 0.15RC5 of the extension of SystemC AMS [10]. The authors of this prototype participated in the releasing of the first draft standard of the extension of SystemC AMS. This new standard enables system-level design and modeling of analog/mixed-signal systems by defining

126 8 Heterogeneous and Non-linear Modeling in SystemC AMS 115 Fig. 8.2 SystemC AMS models: a example of an algebraic non-linear system modeled in SDF, b illustration of a connection between SDF and LinElec models additional language constructs and execution semantics for efficient simulation of discrete- and continuous-time behavior [4, 12]. The current SystemC AMS prototype allows to define two kinds of models: Synchronous Data Flow diagram (SDF) and a Linear Electrical Network (LinElec) models. An SDF model is defined in multirate synchronous data flow domain (SDF), which can be used to describe analog non-conservative (signal-flow) behaviors: each block has one or several inputs and outputs, and the data are propagated throughout the blocks. Multirate means that the model designer can define a time step equal to an integer multiple of the minimal model time step. This multiple can be different for each block, but neither the minimal time step nor the individual block multiples can be changed during the simulation. To run an SDF simulation, SystemC AMS uses a static scheduling algorithm. Another important point is that the SystemC AMS imposes some limits on the modeling of non-linear systems. In fact, the only available method for non-linear equation resolution is the fixed-point iteration method [2]. For example, an algebraic system described by the equation y = f(x,y), (8.1) where x is a known input, can only be modeled if a one-step delay is inserted in the loop (Fig. 8.2a). Hence, the output y exhibits a transient process which converges to the solution of (8.1). In fact, the system of Fig. 8.2a models the fixed-point iterative process: y i = f(x i,y i 1 ). (8.2) It is known that this process converges if and only if the initial (guess) y value is in the attraction basin of the solution, and if x changes slowly comparing with the delay. In particular, if f exhibits strong non-linearities (like the exponential models of diodes) and if the delay (step) is not small enough, convergence problems can occur. The Linear Electrical Network description level is provided for modeling of linear electrical networks. The network is defined by its electrical netlist (described by a specific C++ code). Before the start of a simulation, the SystemC AMS core sets up the matrix equation corresponding to the linear network, and during the simulation, closed expressions are used for computing the network quantities.

127 116 K. Caluwaerts and D. Galayko Time-varying and even non-linear electrical networks can be modeled using the connection between SDF and LinElec models (Fig. 8.2b). This is possible thanks to the mechanism allowing the inclusion of components with time-varying parameters (resistor, capacitor, voltage and current source) in the linear circuit models. The connection does exist in both directions: for example, a SDF domain signal can control the resistance value in a linear circuit, and an electrical value (current or voltage) measured in the linear circuit can be converted to an input of a SDF domain block as well (Fig. 8.2b). In this work we show that although SystemC AMS does not offer a direct possibility to model electrical networks with non-linear elements in a charge conservative domain, this limitation can be circumvented using the coupling mechanism between SDF and LinElec models Summary of Electrostatic Harvester Operation The modeled harvester can be seen as a conjunction of three blocks: an electromechanical resonant transducer including a resonator and a variable capacitor, a charge pump and a flyback circuit. The last two blocks form the conditioning circuit (Fig. 8.1) Resonant Electromechanical Transducer The idea of a vibration energy harvester using an electrostatic transducer is the following. The energy stored in the capacitor is given by: W = Q2 2C, (8.3) where Q is the charge, C is the capacitance. If C varies thanks to some mechanical force modifying the capacitor geometry, it is possible to charge the capacitor when C is high (at the price of some energy W 1 ), and discharge it when C is low (getting back some energy W 2 ). From the formula (8.3) one obtains that W 2 >W 1, and that the difference W 2 W 1 corresponds to the energy harvested from the mechanical domain. Figure 8.1 presents the diagram of the resonant transducer block. It is modeled as a second-order lumped parameter system including a mass, a spring, a damper (not shown) and a parallel-plate capacitor with a mobile electrode. The mechanical behavior is described by Newton s second law: kx μẋ + F t (x) ma ext = mẍ, (8.4) where x is the mass displacement from the equilibrium position, m is the mass, k is the stiffness of the spring, μ is the viscous damping constant of the resonator,

128 8 Heterogeneous and Non-linear Modeling in SystemC AMS 117 F t is the force generated by the capacitive transducer and a ext is the acceleration of the external vibrations. The term ma ext takes into account the fact that the equation is written with regard to a non-inertial frame reference moving with acceleration a ext [7]. The transducer force is given by the following formula: F t = V 2 var 2 dc var dx, (8.5) where V var is the voltage applied on the transducer and C var is the transducer s capacitance. The gradient of capacitance depends on the transducer geometry and in our model, the capacitance varies linearly with the displacement (the numerical values are valid for the device presented in [9]): C var (x) = x (Farads). (8.6) Conditioning Circuit Operation The role of the conditioning circuit is to manage the electrical charge flow on the variable capacitor. The circuit proposed in [13] is composed of the charge pump and of the flyback circuit (Fig. 8.3). The charge pump is composed of two diodes and three capacitors C res, C var and C store whose values are such that: C res C store C var. The role of the charge pump is to make use of the mechanical energy to transfer the charges from C res to C store. Since C res C store, in this process the electrical system accumulates the energy coming from the mechanical domain. The circuit starts the operation from the state where all capacitors are charged at some initial voltage V 0, the switch is blocked (opened) and the variable capacitor C var is at its maximal value. Fig. 8.3 Modeled conditioning circuit [13]

129 118 K. Caluwaerts and D. Galayko When C var decreases, the voltage V var increases, the diode D2 turns on and the charges flow from C var to C store increasing V store. When C var is minimal and starts to increase, its voltage decreases, D2 is off and until V var = V res, both the diodes are off. When V var V res, D1 is on and the charges from C res flow towards C var, discharging C res. This is repeated during the next capacity variation cycle. When V store approaches the saturation value [13], the flyback is activated by closing the switch. The natural oscillation period of the LC network is much smaller than the vibration frequency, and very quickly C store discharges on C res. At the moment where V store = V res, the switch turns off, and the system returns to its initial state. Now, all the capacitors have equal voltages again, but slightly higher than those at the start of the pumping process; this increase in the capacitor voltages corresponds to the harvested energy. Switching is driven by the event corresponding to the crossing of the threshold values by the voltage V store, and can be modeled by a finite-state automaton with a one bit memory register [3]. 8.2 SystemC AMS Modeling of the Harvester Resonator Modeling Equation (8.4) was modeled as a feedback SDF system, whose structure is given in Fig To define the operation of the resonator, two input values have to be provided: the voltage of the transducer s capacitor (V var ) and the external acceleration (a ext ). The output value of this block is the capacitance of the transducer calculated in the block C var from the mobile mass position through the known function C var (x) (8.6). This block is implemented as a SystemC module (specified as SC_MODULE in the code). The delay at the input of the first integrator is required by the SystemC SDF solver which doesn t tolerate delay-less loops. Fig. 8.4 SDF diagram of the resonator

130 8 Heterogeneous and Non-linear Modeling in SystemC AMS Implementation of the Conditioning Circuit Model The conditioning circuit contains several components whose modeling is tricky. In the following sections we describe it for each block Variable Capacitor The SystemC AMS LinElec tool provides a model of a variable capacitor. This is the most important component of our application, thus we carried out several simple test cases before incorporating it in a complex model. The main question was whether or not the variable capacitor model was charge conservative. For example, if the capacitor voltage is fixed, variations of the capacitance must generate an electrical current in the voltage source. The response to this question was positive. Indeed, in SystemC AMS, when a variable capacitor is added to the circuit, two state variables are created: the capacitor voltage and the charge equal to the product of the capacitance and the voltage. A variable capacitor is controlled by a signal issued from the SDF domain. Thus, it is natural to connect the signal C var issued from the Fig. 8.4 model to the input of the variable capacitor Diode Implementation The implementation of the diodes is the most challenging part of the modeling task, since they have a very non-linear behavior. At first, we tried to model the diodes as current sources controlled by their own voltages with an exponential functional relation between both values. The implemented scheme is given in Fig. 8.5a. The voltage on the diode is measured in the linear electrical domain and converted to the SDF domain using the predefined SystemC AMS block sca_vd2sdf (Voltage Difference To Signal Data Flow). Then the current is calculated and after a necessary delay step, the current passing through the current source is updated. Fig. 8.5 Different implementations of a diode in SystemC AMS

131 120 K. Caluwaerts and D. Galayko Although correct in theory, the presented diode model did not work, mainly because of the very steep exponential I V diode characteristic in conjunction with the fixed delay between the voltage measurement and the current generation. So, if at step k the voltage becomes, for example, 1 V, the corresponding very high current will be generated throughout the whole time step k +1, which will definitively cause the modeling process to fail. For this reason, we proposed a different scheme using a variable resistance (Fig. 8.5b). In this case, the diode is modeled as a switch with on and off resistances, and switching is ordered by the voltage on the switch itself. Although the value of the resistance will also be updated with a one time step delay after the voltage change, there is no delay between the current and the voltage calculation, thus the model is accurate enough if the time step is low. The listing of the diode implementation is given below. Two different diode models are implemented. The first one is the two-state diode as described above. The second one shows a more general diode law. Instead of using the input voltage directly, we first limit its precision to limit the number of different states (resistance values) our diode can be in. It s very important to limit the number of states, as every state transition forces SystemC AMS to reinitialize its network matrix. To further limit the number of states, one should choose a cutoff voltage. If the voltage on the diode is lower than this voltage, the diode is considered to be completely blocking. The values used, depend on the diode model one wants to use. With this technique, one can approximate many different diode models with SystemC AMS. #define TWO_STATE_DIODE // This code implements a two state diode. // to implement a general diode model / / uncomment the next line // #undef TWO_STATE_DIODE // calculates the diode s state SCA_SDF_MODULE( E l e c t r i c a l _ d i o d e _ f u n c t i o n ) { sca_tdf_in <double> voltage ; sca_tdf_out <double> resistance ; void sig_proc () { double vc = voltage. read (); // voltage on the diode ( input ) double res ; // the resistance of the diode ( output ) # i f d e f TWO_STATE_DIODE if (vc > 0.0) { res = 1e 9; // open resistance } else { res = 1e10; // closed resistance }

132 8 Heterogeneous and Non-linear Modeling in SystemC AMS 121 } # else double volt = round_to_n_decimals ( voltage. read (), <precision >); double current = <realistic current diode law using volt >; if ( current == 0. volt < <cutoff_voltage >) res 1e10 ; // diode closed else res = volt / current ; // calculate resistance #endif resistance. write(res ); void attributes () { current. set_delay (1); // obligatory 1 step delay } Electrical_diode_function ( sc_module_name name_) : sca_tdf_module () {} }; SC_MODULE( E l e c t r i c a l _ d i o d e ) { sca_elec_port p; // electric terminals sca_elec_port n; Electrical_diode ( sc_module_name name_) : sc_module (name_) { // construct blocks resistor = new sca_tdf2r("source",1e10); converter = new sca_vd2tdf ("converter" ); function = new Electrical_diode_function ("function"); // settings converter >scale =1.0; // connect converter converter >p ( p ) ; converter >n ( n ) ; converter >tdf_voltage (voltage ); // connect function function >voltage ( voltage ); function >resistance ( resistance ); } // connect resistor resistor >p ( p ) ; resistor >n ( n ) ; resistor >ctrl ( resistance );

133 122 K. Caluwaerts and D. Galayko // variable resistor sca_tdf2r resistor ; // converter sca_vd2tdf converter ; // resistance calculation block Electrical_diode_function function ; // signals sca_tdf_signal <double> voltage ; sca_tdf_signal <double> resistance ; }; This model is used as a normal SystemC AMS component. For example, an insertion of a diode between the nodes p and n is achieved as follows: Electrical_diode diode ("diode" ); sca_elec_node p, n; diode.p(p); diode.n(n); Initial Charge of the Capacitors When modeling power electronic systems, it is very important to be able to control the initial energy of the reactive elements. In SystemC AMS, default values are zero, and no mechanism is proposed to modify them. Thus for each capacitor, we use the scheme shown in Fig A voltage source was connected during several step periods to the capacitor through a switch (variable resistance). The calculation of the initial pre-charge time is encapsulated in the model. The switch is on during at least the first time step, then, the time constant is calculated (the RC product), and if the latter is high, the switch stays on during the time equal to at least 10RC. Fig. 8.6 Implementation of a capacitor with initial pre-charge

134 8 Heterogeneous and Non-linear Modeling in SystemC AMS Conditioning Circuit/Flyback Switch Modeling Asshownin[3], the switch should be driven by the energy state of the charge pump. Thus in our model we implemented the switch as a finite-state automaton (with 2 possible states), driven by the events of threshold-crossing by the voltage V store. The switch becomes on when it is off and when the V store voltage crosses some V 1 threshold, and the switch becomes off when it is on and the V store voltage crosses some V 2 threshold, V 1 <V Modeling of the Diode D3 A particular problem arises when simulating the circuit s behavior when the flyback/ charge pump switch is turning off. Before this moment, the current in the inductor is maximal. When the switch turns off, the current path is broken, which normally provokes a high negative voltage on the inductor. Since the voltage of the left node of the inductor is fixed by the C res capacitor, a negative voltage glitch is generated on D3. However, the latter is connected so as to turn on when a negative voltage is generated on the node fly (Fig. 8.3). By turning on, the D3 diode allows the inductor current to continue. One can say that the D3 diode absorbs the negative voltage glitch. However, the described phenomenon takes place instantaneously, whereas the model in SystemC AMS is strictly causal: for the diode D3 to turn on, its voltage should become negative during the preceding step. Since the switching off is abrupt (the off resistance of the switch is high), the generated voltage is very high and being present throughout the step, seriously disturbs the circuit s state vector. To limit this negative effect, we connected the node fly to a small parallel-toground capacitor C p = 1 pf, whose role is to maintain the voltage on the node fly at a reasonable level during the transition step. Adding this capacitor doesn t invalidate the model, since it naturally models the nodes s parasitic capacitance. When the switch turns off, the current continues to flow through the capacitor C p, charging it at a negative voltage, and at the next step the diode turns on and the circuit operates normally Model of the Whole System The model of the global system is presented in Fig. 8.7: it is composed of the conditioning circuit and the resonator models connected through the input and output terminals. The delay is necessary since there is a loop, a sca_v2sdf block converts the voltage of C store capacitor to the SDF domain.

135 124 K. Caluwaerts and D. Galayko Fig. 8.7 Diagram of the complete harvester model in SystemC AMS Table 8.1 Numerical values of modeling test case k, m, μ, ω, a ext, nm 1 kg Nsm 1 rad s 1 ms e e 3 2π sin(ωt) C res,f C store,f C p,f L,H R L, t, R ONDI, R OFFDI, R ONSW, R OFFSW, s Switch low Switch high threshold V 1,V threshold V 2,V Modeling Results Description of the Modeling Experiment Table 8.1 gives the numerical values for the circuit model. With a time step of 4 ns, a 1 s simulation requires 250 million steps. Such a small step was chosen to accurately model the quick processes involving highly non-linear elements (diodes). In the first version of our model, the capacitance of the variable capacitor C var calculated by the SDF resonator was updated in the LIN ELEC domain at each step (every 4 ns). This required a reinitialization of the network matrix at each step, leading to an excessively long simulation time (the linear solver of SystemC AMS is optimized for modeling time-invariant systems, or systems whose parameters change rarely). This problem was identified using the GNU profiler (gprof ). Therefore, we modified the variable capacitor s mixed SDF-LIN ELEC model: in the new version, the LIN ELEC capacitance value is updated only one time every 400 ns. This modification does not affect the accuracy of the model, since the capac-

136 8 Heterogeneous and Non-linear Modeling in SystemC AMS 125 Fig. 8.8 Global view of the harvester operation. a V store voltage, b V res voltage evolution highlighting an energy accumulation Fig. 8.9 Zoom on the circuit behavior. a V var and V store at the end of the first cycle, b flyback circuit operation itance varies sinusoidally with a frequency of only 298 Hz. This optimization made the simulation run about 9 times faster. With this modification, modeling 1 second of system operation required 120 minutes of machine time on Mac OS X 10.4, Intel Core 2 Duo, 2.0 GB, 4 MB L2 cache, 2 GHz clock computer (only one processor core was used to run the simulation). SystemC and SystemC AMS 0.15 RC5 were used. Figure 8.8 presents the global view of the simulation results, showing the evolution of V store and V res during 1 s. This is typical behavior for a charge pump, apparently identical to the results obtained by the VHDL AMS simulation [3]. The evolution of V res highlights an accumulation of the harvested energy in the system. Figure 8.9a presents an enlarged view of the V store and V var voltage evolution at the end of the first cycle, Fig. 8.9b shows the evolution of the flyback current (in the inductor).

137 126 K. Caluwaerts and D. Galayko Modeling Results Validation To verify our model we compared the SystemC AMS model with a VHDL AMS and a Matlab Simulink model. The former was described in [3], the latter was created using the electrical network equations. Every model used the same block parameters (resistances, initial voltages, etc.), but with different diode models. Simulink used a quadratic diode law, whereas VHDL AMS used a model with 3 zones (linear on and off zones, and a quadratic transition zone to keep the first derivative continuous). The VHDL AMS model did not include the C p capacitor, as it caused the simulation to slow down too much. For SystemC AMS we used the two-state diode model. The values of three system quantities were compared: V store, V res and I L (the quantities defining the energy state of the system). From the results of each model, V store was measured every 10 ms to compare global system operation, V res s peak value minus 5 V (the initial voltage) was measured after three cycles of the charge pump/flyback operation to compare only the harvested energy, and for I L the peak value was taken from the first cycle to compare the flyback operation. We also performed a number of other global and detailed tests. In all tests SystemC AMS showed similar results as Simulink and VHDL AMS (under 2% relative difference). Table 8.2 presents the relative differences of Simulink and SystemC AMS modeling versus VHDL AMS modeling. These results show that the SystemC AMS model is correct. The use of various diode models is reflected in the different energy yields (reflected by the values of V res after 4 flyback cycles, Fig. 8.8). Nonetheless, these differences are very small (less than 2%), the harvested energy value obtained with Simulink being a bit closer to that obtained with VHDL AMS (since Simulink s diode model is more similar to that of VHDL AMS). Simulation times were 3.5 minutes for Matlab Simulink using the ode23s solver on a Dual Intel Xeon 3 GHz (4 cores) with 6 GB of memory and 5.75 minutes for VHDL AMS using the ADVance MS simulator on a Sun Ultra-80. Compared with SystemC AMS modeling, Simulink and VHDL AMS simulations consume much less machine time. This is explained by the fact that the corresponding solvers use a variable simulation step, allowing a dramatical time step reduction only during the flyback circuit operation. The SystemC AMS model uses a fixed step, which is chosen to accurately model the quickest process of the system. Table 8.2 Relative differences of the values obtained in SystemC AMS and Simulink with VHDL AMS V store V res I L SystemC AMS 0.495% 1.468% 0.595% Simulink 0.886% 0.316% 0.100%

138 8 Heterogeneous and Non-linear Modeling in SystemC AMS Conclusion This study presented a complex SystemC AMS model of a vibration energy harvester with a capacitive transducer. It was demonstrated that complex non-linear electrical circuits coupled with non-electrical domain subsystems can be accurately modeled with SystemC AMS. The modeling results were compared with VHDL AMS and Simulink simulation outputs. We explored the possibilities of this promising new extension for SystemC. By introducing new reusable components we extended the modeling possibilities of this simulator to non-linear electrical circuits. The main limitation of the current version of the SystemC AMS simulator for modeling complex AMS systems is the impossibility to vary dynamically the simulation step. Also, if a time-variable linear system is modeled with the LIN ELEC solver, the equation system matrix is reinitialized at each variation of the system parameters. This initialization is a very time-consuming operation which, if executed at each time step, can dramatically increase the simulation time. Nevertheless, SystemC AMS, being a SystemC extension, offers the unique possibility to model systems from the hardware level (electrical circuit) up to the software level using the same standardized language (C++). The results of this study suggest that the actual version of SystemC AMS can, if necessary, be used for precise modeling of complex systems and circuits. However, the highlighted difficulties show that the SystemC AMS is probably more appropriate for behavioral models with a higher abstraction level. For example, the simulation time can be reduced at price of small precision loss if quick processes, e.g., the flyback operation, are simulated using simplified behavioral models. The actual model architecture and the precision/detail level depend on particular goals of the model designer. References 1. Y. Chiu, Y.-S. Chu, and C.-T. Kuo. MEMS design and fabrication of an electrostatic vibrationto-electricity energy converter. Journal of Microsystem Technologies, 13(11 12): , G. Dahlquist and Å. Björck. Numerical Methods in Scientific Computing, Volume 1. SIAM, Philadelphia, D. Galayko, R. Pizarro, P. Basset, A.M. Paracha, and G. Amendola. AMS modeling of controlled switch for design optimization of capacitive vibration energy harvester. In BMAS 2007 International Workshop Proceeding, pages , C. Grimm, M. Barnasconi, A. Vachoux, and K. Einwich. An introduction to modeling embedded analog/mixed-signal systems using SystemC AMS extensions. In DAC2008 International Conference, June E. Halvorsen, L.-C.J. Blystad, S. Husa, and E. Westby. Simulation of electromechanical systems driven by large random vibrations. In MEMSTECH2007 Conference Proceedings, pages , G. Kondala Rao, P.D. Mitcheson, and T.C. Green. Simulation toolkit for energy scavenging inertialmicropowergenerators.inpowermems 2007 Workshop Proceedings, pages , November 2007.

139 128 K. Caluwaerts and D. Galayko 7. L.D. Landau and E.M. Lifshitz. Course of Theoretical Physics: Mechanics. Butterworth- Heinemann, Burlington, H. Lhermet, C. Condemine, M. Plissonier, R. Salot, P. Audebert, and M. Rosset. Efficient power management circuit: From thermal energy harvesting to above-ic microbattery energy storage. IEEE Journal of Solid-State Circuits, 43(1): , A.M. Paracha, P. Basset, P.C.L. Lim, F. Marty, and T. Bourouina. A bulk silicon-based vibration-to-electric energy converter using an in-plane overlap plate (IPOP) mechanism. In PowerMEMS 2006 Workshop Proceedings, A. Vachoux, C. Grimm, and K. Einwich. Extending SystemC to support mixed discrete continuous system modeling and simulation. In ISCAS2005 International Conference Proceedings, pages , M. Vasilevski, F. Pêcheux, H. Aboushady, and L. De Lamarre. Modeling heterogeneous systems using SystemC AMS case study: A wireless sensor network node. In BMAS2007 International Workshop Proceedings, pages 11 16, Official web site of Open SystemC Initiative (OSCI) group, B.C. Yen and J.H. Lang. A variable-capacitance vibration-to-electric energy harvester. IEEE Transactions on Circuits and Systems, 53(2): , 2006.

140 Part III Digital Systems Design Methodologies Based on C++

141 Chapter 9 Application Workload and SystemC Platform Modeling for Performance Evaluation Jari Kreku, Mika Hoppari, Tuomo Kestilä, Yang Qu, Juha-Pekka Soininen and Kari Tiensyrjä Abstract Increasing number of concurrent applications in future mobile devices will be based on parallel heterogeneous multiprocessor system-on-chip platforms using network-on-chip communication to achieve scalability. A performance modeling and simulation approach is described to explore efficiently the applicationplatform solution/design space at system-level. The application behavior is abstracted to workload models that are mapped onto performance models of the execution platform for transaction level simulation. The approach provides separation of application and platform through service-oriented modeling. The experimentation of the approach in a mobile video player case study is presented. Keywords Workload Platform Transaction-level Performance Simulation Abstract instruction Service modeling UML2 SystemC 9.1 Introduction The digital processing architectures of future handheld mobile multimedia devices will evolve from current System-on-Chips (SoC) and Multi-Processor-SoCs with a few processor cores to massively parallel computers that consist mostly of heterogeneous sub-systems, but may also contain homogeneous computing sub-systems. The Network-on-Chip (NoC) communication paradigm will replace the bus-based communication to allow scalability, but it will increase uncertainties due to latencies in case large centralized or external memories are required. Moreover, several currently independent mobile devices, e.g. phone, music player, television, movie player, desktop and Internet tablet, will converge to one. The sets and types of applications running on the terminal are dependent on the context of the user. To deliver requested services to the user, some of the applications run sequentially and independently, while many others execute concurrently and interact with each others. As a consequence of the above trends the overall complexity of system development will increase by orders of magnitude. J. Kreku ( ) VTT Technical Research Center of Finland, Kaitoväylä 1, 90571, Oulu, Finland jari.kreku@vtt.fi M. Radetzki (ed.), Languages for Embedded Systems and Their Applications, Lecture Notes in Electrical Engineering 36, Springer Science + Business Media B.V

142 132 J. Kreku et al. Platform-based design [1] was introduced in late 1990 s to address the challenges of increasing design complexity of SoCs that consist typically of a few processor cores, hardware accelerators, memories and I/O peripherals communicating through a shared bus. To alleviate scalability problems of the shared bus, the NoC architecture paradigm was proposed as a communication centric approach for systems requiring multiple processors or integration of multiple SoCs. Model-based approaches [2] emphasize the separation of the application and execution platform modeling. The principles of the Y-chart model [3] are usually applied in the design space exploration, i.e. a model of application is mapped onto a model of platform and the resulting allocated model is analyzed. Due to the change from vertical to horizontal business model in the mobile industry [4], aservice-orientedandsubsystem-baseddevelopmentmethodology [5] is being adopted. The end user interactions and the associated applications are modeled in terms of services they require from the underlying execution platform. An obvious consequence is that the execution platform needs also to be modeled in terms of services it provides for the applications. Both the application and platform designers are facing an abundant number of design alternatives and need systematic approaches for the exploration of the design space. Efficient methods and tools for early system-level performance analysis are necessary to avoid wrong decisions at the critical stage of system development. Design time application mapping and platform exploration addressing run-time management of MP-SoC platform is presented in [6]. The idea is to generate at design time a set of Pareto-optimal application mappings that the run-time manager uses when switching applications on the platform. An extension to the Nostrum simulator environment for system-level optimization of dynamic data and communication is presented in [7]. Measured or statistical communication traces are used in [8] for the analysis of network and computing node communication. The application mapping approach presented in [9] uses static profiling and co-simulation to achieve co-exploration of the application design space including both the architecture model and the application source code. Performance evaluation has been approached in many ways at different levels of refinement. SPADE [10] implements a trace-driven, system-level co-simulation of application and architecture. Artemis [11] extends this by introducing the concept of virtual processors and bounded buffers. TAPES performance evaluation approach [12] abstracts the functionalities by processing latencies and covers only the interaction of the associated sub-functions on the architecture without actually running the corresponding program code. MESH [13] looks at resources, software, and schedulers/protocols as three abstraction levels that are modeled by software threads on the evaluation host. Our approach differs from the above as to the way the application is modeled and abstracted. The workload model mimics truly the control structures of the applications, but the leaf level load data is presented like traces. Also the execution platform is modeled rather at transaction than instruction level. The timing information is resolved in the simulation of the system model in our case. In this chapter we present a model-based approach for system-level performance evaluation. The approach combines the top-down refinement type application

9 Application Workload and SystemC Platform Modeling for Performance Evaluation 133 modeling and bottom-up composition type platform modeling.

143 9 Application Workload and SystemC Platform Modeling for Performance Evaluation 133 modeling and bottom-up composition type platform modeling. Both are based on service-oriented approach with defined service interfaces, which brings scalability to both of the sides. The application models are abstracted to workload models that are mapped onto the platform performance models and the resulting system model is simulated at transaction-level to obtain performance data. The tool support is based on a commercial UML2 tool, Telelogic Tau G2 [14], and OSCI SystemC [15]. The rest of the contribution is structured as follows. Section 9.2 describes our performance modeling approach. Section 9.3 presents the mobile video player case study. Section 9.4 draws conclusions and discusses future work. 9.2 Performance Modeling and Simulation The performance modeling and evaluation approach of ABSOLUT follows the Y-chart model as depicted in Fig. 9.1 [16, 17]. The layered hierarchical workload models represent the computation and communication loads the applications cause on the platform when executed. The layered hierarchical platform models represent the computation and communication capacities the platform offers to the applications. The workload models are mapped onto the platform models and the resulting system model is simulated at transaction-level to obtain performance data Application and Workload Modeling Starting from the end-user requirements, a hierarchically layered service-oriented application model is created using UML2 use case, state machine, composite structure and sequence diagrams (Fig. 9.1). The top layer, called abstract use case model, consists of system level services visible to the user. These services are decomposed Fig. 9.1 Performance modeling and evaluation approach

144 134 J. Kreku et al. to sub-services in the refined use case model. Further refinement results in primitive services that are requested from the execution platform model. The functional simulation of the model in Telelogic Tau UML2 tool provides sequence diagrams which are needed for verification and for building the workload model. The purpose of workload modeling is to illustrate the load an application causes to an execution platform when executed. The workload models reflect accurately the control structures of the applications, but the computing and communication loads are abstractions derived either analytically [18], from measured traces [19]orusing a source code-compilation tool approach [16]. Due to load abstraction, the models can be simulated before the applications are finalized, enabling early performance evaluation. As opposed to most performance simulation approaches, the workload models do not contain timing information. It is left to the platform model to find out how long it takes to process the workloads. This arrangement results in enhanced modeling and simulation speed. It is also easy to modify the models, which facilitates easier evaluation of various use cases with minor differences. For example, it is possible to parameterize the models so that the execution order of applications varies from one use case to another. The workload models have a hierarchical structure. The top-level workload model divides into application workloads, which are constructed of one or more process workloads. These are comprised of function workloads that are basically control flow graphs of basic blocks and branches. Basic blocks are ordered sets of load primitives used for load characterization. Load primitives are abstract instructions read and write for modeling memory accesses and execute for modeling data processing. The process and function workload models can also be statistical, in which case the model will describe the total number of different types of load primitives and the control is a statistical distribution for the primitives. The entire hierarchy of workload models is collected in a UML2 class diagram, which presents the associations, dependences, and compositions of the workloads. Control inside the workloads is described with state machine diagrams and composite structure diagrams are used to connect the control with the corresponding workload model. UML2 is a standard language used in software development and thus the possibility to use UML2-based workload models enables reuse of existing UML2 application models, thus reducing the effort and making the performance simulation approach more accessible in general Execution Platform Modeling The platform model is an abstracted hierarchical representation of the actual platform architecture consisting of component, subsystem, and platform layers. Each layer has its own services, which are abstraction views of the architecture models. They describe the platform behaviors and related attributes, e.g. performance, but hide other details. Services in the subsystem and platform architecture layers can be invoked by workload models.

145 9 Application Workload and SystemC Platform Modeling for Performance Evaluation 135 Component Layer This layer consists of processing (e.g. processors, DSPs, dedicated hardware and reconfigurable logic), storage, and interconnection (e.g. bus and network structure) elements. An element must implement one or more types of component-layer services. The component-layer read, write and execute services are the primitive services, based on which higher level services are built. All the component models contain cycle-approximate timing information. However, modeling applications as workloads causes certain limitations to what needs to be modeled on the platform side as the workloads do not express the dependences between abstract instructions. Therefore the data paths of processing units should not be modeled in detail; instead the processor models have a cycles-per-instruction (CPI) value, which is used to estimate the execution time of the workloads. Furthermore, the workload models often do not have exact address information; instead they define, which memory block they are addressing. As a result, cache hits and misses and e.g. SDRAM page misses must be modeled statistically in such a situation. Subsystem Layer The subsystem layer is built on top of the component layer and describes the components of the subsystem and how they are connected. The services used at this layer include e.g. video pre- and post-processing and decoding for a video acceleration subsystem. The model is presented as a composite structure diagram that instantiates elements from the component library. The subsystems are connected to the communication network via network interfaces. Platform Architecture Layer This layer is built on top of the subsystem layer by incorporating communication network and platform software. Platform-layer services consist of service declaration and instantiation information. The service declaration describes the functionalities that the platform can provide. The instantiation information describes how a service is instantiated in the platform. COGNAC is a custom C++ tool, which reads a text-based platform configuration file and generates the top-level SystemC models for the platform and base classes for the subsystems. The platform configuration file describes (1) the subsystems, (2) what components are instantiated inside each subsystem, and (3) how the subsystems and components are connected to each other. The generated subsystem models are stubs that should be extended by the designer by implementing subsystem-level services. Another configuration file is used to configure the parameters of the instantiated components. The number of parameters varies on a component-by-component basis and can be anything from a few parameters to tens of parameters. However, the

146 136 J. Kreku et al. Table 9.1 Low-level interface functions Interface Description read(a,w,b) write(a,w,b) execute(n) read W words of B bits from address A write W words of B bits to address A simulate N data processing instructions Table 9.2 High-level interface functions Interface Return value Description use_service(name, attr) service identifier id request service name using attr as parameters wait_service(id) N/A wait until the completion of service id designer can also set up default values inside the models. Each component has at least parameters for the clock frequency and latencies for accessing the interconnection. Furthermore, processor models typically have parameters for instruction and/or data cache hit probability and latency in addition to the CPI value. Correspondingly, memory models have parameters for access latencies and interconnect models for data transfer and arbitration latencies. Interfaces The platform model provides two interfaces for utilizing its resources from the workload models (Fig. 9.1). The low-level interface of the processing elements at the component layer is intended for transferring load primitives as listed in Table 9.1 [20]. The functions of the low-level interface are blocking in other words a load primitive level workload model is not able to issue further primitives before the previous primitives have been executed. The high-level interface enables workload models to request services from the platform model (Table 9.2). Use_service call is used to request the given service and it is non-blocking so that the workload model can continue while the service is being processed. It returns a unique service identifier, which can be given as a parameter to the blocking wait_service call to wait until the requested service has completed. The platform model includes operating system (OS) models, which control accesses to the processing unit models of the platform by scheduling the execution of process workload models. The OS model supports both low-level and high-level interfaces to the workloads and relays interface function calls to the processor or other models which realize those interfaces. The OS model allows only those process workloads which have been scheduled for execution to call the interface functions and performs rescheduling periodically according to the scheduling policy implemented in the model.

9 Application Workload and SystemC Platform Modeling for Performance Evaluation 137 The services inside the platform can model either hardware or software services.

147 9 Application Workload and SystemC Platform Modeling for Performance Evaluation 137 The services inside the platform can model either hardware or software services. In ABSOLUT, software services are modeled as workload models, but unlike application models, they are integrated in the platform model and easily reusable by the applications. If the service is provided by a process or a set of processes running in the system, the service model consists of application or process layer workloads. If the service is implemented as a library, the model will be at the function layer. Service models can utilize other services, but eventually they consist of the same read/write/execute load primitives as the application models. There are two alternatives how to implement a HW service: It can be implemented simply as a delay in the associated component, if the processing of the service does not affect the other parts of the system at all. In this case the service must not perform I/O operations or request other services. The second alternative is to implement the service as read, write and possibly execute primitives like the SW services, but in this case they are executed inside the HW component and not inside a process workload running on one of the processor models Allocation and Transformation to SystemC Figure 9.2 depicts the flow from UML2 application models to generated SystemC workload models. A skeleton model of the platform is created in UML2 to facilitate mapping between the workload models with service requirements and the platform models with service provisions. The skeleton model describes what components exist in the platform and what services the components provide. In the mapping phase, each workload entity (function, process, application) is linked to a processor or other component, which is able to provide the services required by that entity. This is realized in the UML2 model using composite structure diagrams. The allocated UML2 workload models are transformed to SystemC [21] using automatic SystemC code generation from the Telelogic Tau UML2 tool. The gener- Fig. 9.2 Transformation fromuml2tosystemc

148 138 J. Kreku et al. ator has been developed by the Lund University [22] and it produces SystemC code files and Makefiles for building the models. In the SystemC domain, the workload models have pointers to the platform model components they have been allocated to and utilize their services via the low- or high-level interfaces Performance Simulation The executable simulation model of the combined workload and execution platform models (Fig. 9.1) is based on the OSCI SystemC library, extended with configurable instrumentation. During the simulation of the system model the workloads send load primitives and service calls to the platform model. The platform model processes the primitives and service calls, advancing the simulation time while doing so. The simulation run will continue until the top-level workload model stops it when the use case has been completed. The platform model is instrumented with counters, timers and probes, which record the status of the components during the simulation. These performance probes are manually inserted in the component models where appropriate and are flexible so that they can be used to gather information about platform performance as needed: Status probes collect information about utilization of components and scheduling of processes performed by the operating system models. Counters are used to calculate the number of load primitives, service calls, requests and responses performed by the components. Timers keep track of the task switch times of the OS models and processing times of services. After simulation the performance probes output the collected performance data to the standard output. A C++-based tool, VODKA, is used for viewing e.g. processor utilization, bus and memory traffic and execution time, graphically. 9.3 Mobile Video Player Case Example In the mobile video player (MVP) case example, a mobile terminal user wants to view a movie on the device and selects one from a list of movies available on the mobile terminal. The execution platform provides services for storing and playing of movie files. The platform in the mobile video player case consists of four subsystems (Fig. 9.3): (1) General purpose (GP) subsystem, which is used for executing an operating system and generic applications and services, (2) Image (IM) subsystem, which accelerates image and video processing, (3) Storage (ST) subsystem, which contains a repository for video clips, and (4) Display (DP) subsystem, which takes care of displaying the UI and video. The subsystems are interconnected by a network using a ring topology. The general purpose subsystem has two ARM11 general purpose processors for executing the OS and applications. There is also a subsystem-local SDRAM memory controller and memory to be used by the two processors. For communicating

149 9 Application Workload and SystemC Platform Modeling for Performance Evaluation 139 Fig. 9.3 Execution platform for the mobile video player with the other subsystems, each subsystem has a network interface. Image subsystem is built around the video accelerator. The services provided by the subsystem are controlled by a simple ARM7 microcontroller. There is also some SRAM memory for the ARM and a DMA controller for offloading large data transfers. Storage and display subsystems are mostly similar to the image subsystem and they contain a simple ARM and a DMA controller. The storage subsystem has local memory for storage and metadata, and the display subsystem has a graphics accelerator, local SRAM for graphics, and a display interface for the screen Modeling of the Execution Platform Components The modeling of the MVP platform began with the implementation of the component models. There are 23 components in total inside the four subsystems (Fig. 9.3), but some of them are identical and others resemble other blocks closely. For example, the bus and network interface component are the same in each subsystem. Internal SRAM memory and an ARM7 controller can be found from three of the subsystems, and the ARM11 is close enough to the ARM7 that it can use the same

150 140 J. Kreku et al. Fig. 9.4 Class hierarchy of the MVP platform model model with different parameter values. As a result, only 7 different models of the components (excluding parent classes) were needed. The Component model is an abstract base class for all other models and defines the parameters every model must have: clock frequency and address (Fig. 9.4). The MVP platform uses the OCP protocol for communication, hence the Master and Slave models add OCP master and slave ports respectively. They are still intended as base classes from which processor or memory models can be derived easily. The Master model contains methods for setting up OCP read and write requests, sending them to the port and receiving responses. Correspondingly, the Slave model has methods for getting requests from the port, processing them, preparing responses and sending them back to the port. Both models add parameters, which define latencies for accessing the port for read and write transactions. The Bus model is the third model derived from the Component and simulates a basic OCP bus with round-robin scheduling. It contains both master and slave OCP multiports and methods for arbitration, address decoding, and sending and receiving requests and responses. The parameters of the Bus define the latencies for moving requests and responses across the bus. The general purpose processor (GPP) model provides the primitive interface to the workload models and abstracts the data processing capability of processors with a CPI value. The CPI is a workload-controllable parameter, which is used when the model calculates the execution time for the read, write and execute calls from the workloads. In the MVP case the GPP model is used only as a base class for the

151 9 Application Workload and SystemC Platform Modeling for Performance Evaluation 141 Table 9.3 Clock frequencies of the MVP platform components GP subsystem ST subsystem IM subsystem DP subsystem ARM MHz 50 MHz 83 MHz 100 MHz ARM MHz Video_accel 166 MHz DMA controller 25 MHz 83 MHz 100 MHz Display IF 100 MHz Network IF 83 MHz 83 MHz 83 MHz 83 MHz Bus 166 MHz 25 MHz 83 MHz 100 MHz SDRAM 166 MHz 50 MHz SRAM 25 MHz 166 MHz 100 MHz ARM model, which adds statistical instruction and data cache models on top of the GPP. The ARM model has additional parameters for hit probability, hit latency and line size for the caches. It is also possible to define, whether the data cache operates in write-through or write-back mode and whether or not the cache is allocated on a write miss. The HW_accel model is a pure virtual base class for hardware accelerators and provides the higher-level service interface (Table 9.2). The DMA_controller, Display_IF, and Video_accel models extend the HW_accel by implementing services for dma transfers, display updates and video decoding and encoding respectively. The SRAM and SDRAM models extend the Slave model by overloading the calculation of the data access latency. The SRAM model adds parameters for read, write, and burst access latencies, whereas the SDRAM model has a different set of parameters consisting of RAS precharge, RAS-to-CAS, CAS, and burst latencies. Furthermore, the SDRAM model simulates page misses statistically because the workloads do not generally provide exact addresses to the platform model. Thus, a parameter defining the probability of the page miss is needed. An interesting implication of the modeling approach is that the memory models do not actually provide data storage. It is not required because the functionality of applications is not simulated and data is not moved in the simulated transactions. There were 213 parameters in total for the platform s components, so it is not feasible to display all of them here. Table 9.3 reveals the clock frequency of each component as an example of the configuration Modeling of the Services In the MVP case, three of the four subsystems have a DMA controller, which is used for offloading large data transfers from the ARM7 microcontoller and/or the hardware accelerators in those subsystems. The DMA controller provides a componentlayer dma_transfer service implemented in hardware (Table 9.4). The service takes

152 142 J. Kreku et al. Table 9.4 Services provided by the MVP platform Subsystem Service Provider Description all memcpy process WL Copy memory contents ST, IM, DP dma_transfer DMA controller Offload memory copying DP display_update Display_IF Update screen contents from the framebuffer IM video_decode Video_accel MPEG4 video decoding IM video_encode Video_accel MPEG4 video encoding IM preprocess Video_accel Color conversion IM postprocess Video_accel Color conversion ST list_video_files process WL Provides a list of available films the source and target addresses and the size of the transfer as parameters. The implementation of the service utilizes the methods inherited from the Master model for sending and receiving OCP requests to transfer the requested amount of data. The Display_IF provides a display update service for viewing the user interface of the entire system on the screen. However, it does not provide hardwareaccelerated drawing services. The display update service takes framebuffer address, display resolution and color depth as parameters. The Display_IF will read the contents of the framebuffer using the same approach as the DMA controller does every time the service is requested. The video accelerator provides hardware-accelerated video decoding and encoding services (Fig. 9.5). The service parameters include source (m_attr->from) and target (m_attr->to) addresses for the original and processed video data respectively, video resolution (m_attr->x, m_attr->y), macroblock size (m_attr->x, m_attr->y), bits per pixel (m_attr->b) and compression ratio (m_attr->c). The implementation also relies on the model parameters, which define the number of cycles required for the processing of discrete cosine transform (m_dct_cycles), quantization (m_q_cycles), motion compensation (m_mc_cycles), and variable length coding (m_vlc_cycles). The decoding service contains several calls to read_macroblock, write_macroblock, and process_macroblock methods; these are implemented on top of the internal read, write, and execute primitive services respectively. The video accelerator provides also video preprocessing and postprocessing services for converting between YUV420, YUV422 and RGB formats. Each subsystem provides a SW service for memory copying. This is implemented using the dma_transfer component-layer service in the three subsystems with a DMA controller. The GP subsystem does not have DMA so the memory copying service is implemented as an OS kernel process workload, which sends the required number of read/write/execute primitives to one of the ARM11 processors when the service is requested. The ST subsystem provides a list_video_files service for generating the list of films available in the film storage; the memcpy service is used to move the actual film data. Implementing SW-based video decoding and encoding

153 9 Application Workload and SystemC Platform Modeling for Performance Evaluation 143 Fig. 9.5 Extract from the implementation of the video decoding service services utilizing the ARM11 processors in the GP subsystem will be considered later to explore their effects on performance. The MVP platform model consisted of 12,935 lines of SystemC code after implementing the components and services. The figure includes comments but does not take into account the OCP protocol models Modeling of the Application Major parts of the activity in the MVP case example are modeled as services in the platform model. The control part of the video playback application was modeled using UML2 and then transformed to SystemC. The MVP application provides a PlayVideo service to the user (Fig. 9.6). It is decomposed into DisplayListOfVideos and PlaySelectedVideo service components. Further decomposition of DisplayListOfVideos reveals that the list of video files is produced by the list_video_files platform service, but showing the list on the display has to be implemented by the application itself. The implementation consists of load primitives used to write the display data to the framebuffer. PlaySelectedVideo, on the other hand, is decomposed into ReadVideoFrame, decoding, and ShowDecod-

154 144 J. Kreku et al. Fig. 9.6 Decomposition of the MVP services edframe. Decode is provided by the platform service, whereas the other two can be provided by using the memcpy service. The next step was to generate SystemC models from the UML2 using Telelogic Tau and the Lund University s SystemC generator. Finally, the load information was inserted into the model in the form of read, write, and execute load primitives for those parts of the application that were not provided by the platform services Analysis of Simulation Results The simulation was run for the duration of the video player start up and playback of one second (25 frames). Table 9.5 presents the utilization of every component in the platform. The second ARM11 in the GP subsystem and the video accelerator have the highest load average, about 50 and 60 per cent respectively. The ARM processors of the GP subsystem are somewhat burdened due to the video application start up load. Also generic background load running in the GP subsystem and modeling the execution of other applications is affecting the result. The video accelerator load is relatively high considering that the use case consisted of only video decoding. Getting the system to handle video encoding might require increasing the clock frequency of the accelerator. None of the components in ST and DP subsystems have been at the limit of their capacity. It is clearly possible to execute more demanding applications on this platform from the point of view of these subsystems, or the clock frequency of these components could be reduced to decrease power consumption of the device. Furthermore, combining the functionality of the ST and DP subsystems should be considered because both subsystems have low utilization. Table 9.6 shows the processing times of services from the subsystem layer.

155 9 Application Workload and SystemC Platform Modeling for Performance Evaluation 145 Table 9.5 Utilization of MVP platform components Component GP subsystem ST subsystem IM subsystem DP subsystem ARM 0 11% 20% 10% 15% ARM 1 51% Video_accel 60% DMA_controller 5% 0% 6% Display_IF 31% Network IF 1% 3% 2% 5% Bus 14% 10% 7% 24% SDRAM 16% 3% SRAM 2% 3% 18% Table 9.6 Examples of processing times of services Service Subsystem Average Min Max dma_transfer ST 2.4 ms 0.1 ms 55 ms memcpy ST 12 ms 10 ms 65 ms video_decode IM 14 ms 14 ms 14 ms dma_transfer DP 3.1 ms 2.9 ms 3.1 ms memcpy DP 10 ms 10 ms 11 ms The simulation results have not been validated with measurements since the execution platform is an invented platform intended to portray a future architecture. Consequently, there are no cycle-accurate simulators for the platform which could be used to obtain reference performance data. However, we have modeled MPEG4 video processing and the OMAP platform earlier and compared those simulation results to measurements from a real application in a real architecture [20]. In that case the average difference between simulations and measurements was about 12%. The accuracy of the simulation approach has been validated with other case examples in [18, 19]. 9.4 Conclusions The application-platform performance modeling and evaluation approach is presented. It allows application and platform to be modeled at several levels of abstraction to enable early performance evaluation of the resulting system. Applications are modeled in UML2 as workloads consisting of load primitives, whereas platform models are cycle-approximate transaction-level SystemC models. Mapping between UML2 application models and the SystemC platform models is based on automatic generation of simulation models for system-level performance evaluation. The tool

156 146 J. Kreku et al. support is based on the UML2 tool, Telelogic Tau, and the SystemC simulation tool of OSCI. The approach has been experimented with a mobile video player case study. The utilization of all of the components was obtained with simulations that also yielded the processing times of the platform s services. Performance bottlenecks and power saving possibilities were identified based on the simulations. The results from MVP case have not been verified, but typically the average and maximum errors have been about 15% and 25% respectively in the other modeled case examples. The approach enables performance evaluation early, exhibits light modeling effort, allows fast exploration iteration, reuses application and platform models, and provides performance results that are accurate enough for system-level exploration. In the future, the approach will be expanded to other criteria besides performance, like power consumption. Further tool support for automation of some steps of the approach is in progress. Acknowledgements This work is supported by Tekes (Finnish Funding Agency for Technology and Innovation) and VTT under the EUREKA/ITEA contract MARTES, and partially by the European Community under the grant agreement MOSART. References 1. K. Keutzer, A. Newton, J. Rabaey, and A. Sangiovanni-Vincentelli, System-level design: orthogonalization of concerns and platform-based design. IEEE Transactions on Computer- Aided Design of Integrated Circuits and Systems, 19(12): , MDA guide version 1.0.1, June 2003, document number: omg/ Available at F. Balarin, M. Chiodo, P. Giusto, H. Hsieh, A. Jurecska, L. Lavagno, C. Passerone, A. Sangiovanni-Vincentelli, E. Sentovich, K. Suzuki, and B. Tabbara. Hardware-Software Co- Design of Embedded Systems: The Polis Approach. Kluwer Academic, Dordrecht, R. Suoranta, New directions in mobile device architectures. In The 9th Euromicro Conference on Digital System Design (DSD 06), K. Kronlöf, S. Kontinen, I. Oliver, and T. Eriksson. A method for mobile terminal platform architecture development. In Advances in Design and Specification Languages for Embedded Systems, pages Springer, Dordrecht, C. Ykman-Couvreur, V. Nollet, T. Marescaux, E. Brockmeyer, F. Catthoor, and H. Corporaal, Design-time application mapping and platform exploration for MP-SoC customized run-time management. IET Comput. Digit. Tech., 1(2): , L. Papadopoulos, S. Mamagkakis, F. Catthoor, and D. Soudris. Application-specific NoC platform design based on system level optimization. In IEEE Computer Society Annual Symposium on VLSI (ISVLSI 07), J. Xu, W. Wolf, J. Henkel, and S. Chakradhar. A design methodology for application-specific networks-on-chip. ACM Transactions on Embedded Computing Systems, 5(2): , G. Beltrame, D. Sciuto, C. Silvano, P. Paulin, and E. Bensoudane. An application mapping methodology and case study for multi-processor on-chip architectures. In 2006 IFIP International Conference on Very Large Scale Integration, pages , October P. Lieverse, P. van der Wolf, K. Vissers, and E. Deprettere. A methodology for architecture exploration of heterogeneous signal processing systems. VLSI Signal Processing, 29(3): , 2001.

157 9 Application Workload and SystemC Platform Modeling for Performance Evaluation A. Pimentel and C. Erbas. A systematic approach to exploring embedded system architectures at multiple abstraction levels. IEEE Transactions on Computers, 55(2):99 112, T. Wild, A. Herkersdorf, and G.-Y. Lee. TAPES Trace-based architecture performance evaluation with SystemC. Design Automation for Embedded Systems, 10(2 3): , J.M. Paul, D.E. Thomas, and A.S. Cassidy. High-level modeling and simulation of single-chip programmable heterogeneous multiprocessors. ACM Transactions on Design Automation of Electronic Systems, 10(3): , Telelogic Tau 3.0 User Guide (December 2006). Telelogic AB, 1998 pp. 15. Open SystemC Initiative Website J. Kreku, M. Hoppari, T. Kestilä et al. Combining UML2 application and SystemC platform modelling for performance evaluation of real-time embedded systems. EURASIP Journal on Embedded Systems, 2008(3), doi: /2008/ J. Kreku, Y. Qu, J.-P. Soininen, and K. Tiensyrjä. Layered UML workload and SystemC platform models for performance simulation. In Proceedings of Forum on Specification and Design Languages, pages ECSI, Gières, J. Kreku, T. Kauppi, and J.-P. Soininen. Evaluation of platform architecture performance using abstract instruction-level workload models. In International Symposium on System-on-Chip, pages 43 48, J. Kreku, J. Penttilä, J. Kangas, and J.-P. Soininen. Workload simulation method for evaluation of application feasibility in a mobile multiprocessor platform. In The Euromicro Symposium on Digital System Design, pages , J. Kreku, M. Eteläperä, and J.-P. Soininen. Exploitation of UML 2.0-based platform service model and SystemC workload simulation in MPEG-4 partitioning. In International Symposium on System-on-Chip, pages , J. Kreku, M. Hoppari, K. Tiensyrjä, and P. Andersson. SystemC workload model generation from UML for performance simulation. In Proceedings of Forum on Specification and Design Languages. ECSI, Gières, P. Andersson and M. Höst. UML and SystemC comparison and mapping rules for automatic code generation. In Proceedings of Forum on Specification and Design Languages. ECSI, Gières, 2007.

158 Chapter 10 Adaptive Interconnect Models for Transaction-Level Simulation Rauf Salimi Khaligh and Martin Radetzki Abstract Transaction level models are constructed for efficient simulation of complex embedded systems and systems-on-chip. Traditionally, the use case of a transaction level model dictates its accuracy and abstraction level, which are fixed during simulation. Although the chosen level of accuracy may be required in some intervals, in some other intervals the model may simply be too accurate for the scenario being simulated. This makes the model a simulation bottleneck and unnecessarily impedes the simulation performance. In this contribution we present an adaptive approach for modeling interconnects. The abstraction level of an adaptive model dynamically adapts to the simulation scenario, increasing the simulation performance without sacrificing the accuracy. We have developed adaptive models for point-topoint, FIFO based communication channels widely used in modern GALS and multiprocessor systems as well as models for complex, pipelined buses. We have applied the proposed approach to two real-world communication protocols and developed adaptive models of the AMBA AHB bus and the Fast Simplex Link (FSL) in SystemC, based on the recent OSCI TLM 2 standard. Our experiments clearly show the increase in simulation performance compared to existing, non-adaptive models. Keywords Transaction level modeling Adaptive models On-chip interconnect models SystemC 10.1 Introduction Transaction level modeling enables us to simulate complete hardware and software systems at very high speeds, which are often orders of magnitude faster than the simulation speed of RTL and lower level models. Transaction level models are used by different people for different use cases. Abstraction level of the models and their accuracy depend on the use case at hand and there is a clear trade-off between the simulation speed and accuracy. The definition of the TLM abstraction levels and the terminology are still subject of debate and standardization but speaking generally, transaction level models can R. Salimi Khaligh ( ) Embedded Systems Engineering Group, Institute of Computer Architecture and Computer Engineering (ITI), Universitaet Stuttgart, Pfaffenwaldring 47, Stuttgart, Germany salimi@informatik.uni-stuttgart.de M. Radetzki (ed.), Languages for Embedded Systems and Their Applications, Lecture Notes in Electrical Engineering 36, Springer Science + Business Media B.V

159 150 R. Salimi Khaligh and M. Radetzki Fig Sufficient model accuracy be divided into three categories. They are either untimed, approximately timed or timing accurate. The untimed (UT) models are very abstract models which achieve very high simulation speeds at the cost of having no timing accuracy at all. At the lowest TLM abstraction level are the timing accurate models, for example the cycle accurate (CA) models in the domain of synchronous systems. In such models, it is possible to observe or reconstruct the state of the model at each clock cycle, making the model accurate enough for fine grained performance analysis and verification. This level of timing accuracy comes at the price of simulation performance, which is often several orders of magnitude slower than the simulation of the untimed models. Between the two levels are models with approximate timing, which are used for example for early software profiling or architectural analysis. In this work we are concerned with transaction level models of interconnects used in state of the art embedded systems and systems-on-chip and the following discussions will focus on specifics of these models. Figure 10.1 depicts examples of different abstraction levels and their accuracies for a bus TLM. The untimed models of buses typically model addressing accurately but arbitration and timing details are absent. In more accurate, timed bus models (e.g. the PVT model in Fig. 10.1), arbitration is modeled approximately but the transfers are atomic with an approximate total latency. Preemption of transfers is usually not modeled in these models which may compromise functional correctness, or may leave some errors undetected. A simple example of such a situation is a read burst preempted by a write burst to an overlapping region of the memory. If the bus model does not model preemption of transfers, the simulation results will be inaccurate. In more accurate models (e.g. the CX model in Fig. 10.1), arbitration and preemptions may be modeled accurately but transfers are still as chunks with aggregate timings. Accurate timing of beats in a burst is usually modeled only in the slower, cycle accurate (CA) bus models (e.g. Fig. 10.1(e)). Traditionally, designers either have to develop and use multiple models of the same entity at different abstraction levels, one for each use case, or in order to have an acceptable level of accuracy, they have to use a single, accurate but slow model. The abstraction level of the chosen model will be fixed during simulation and for the simulation session as a whole, this does not always represent the most efficient abstraction level and will unnecessarily degrade the simulation performance. Consider again the bus example in Fig During periods with a single master active (a) PVT accuracy is sufficient. Similarly, simulation of the preemption logic is only necessary when preemptions occur (b, d). Cycle accurate timing of data beats during burst transfers may only be required in some intervals (e). As an example consider a system where a display controller periodically reads the frame buffer in fixed length bursts. We may only be interested in the effect of those bursts on bus utilization and timing of other transfers, but not

160 10 Adaptive Interconnect Models for Transaction-Level Simulation 151 in the intra-burst timing details. In contrast, in the same system we may need the timing and order of the beats in cache fill bursts which are performed by the processor. In this contribution we present a novel approach for modeling interconnects which enables us to construct accurate and at the same time very efficient simulation models. This is achieved by incorporating a certain degree of adaptivity into the models. Put simply, an adaptive model contains all the necessary levels of detail for simulation at the most accurate abstraction level, but the details are only simulated when required and hence the abstraction level of the model adapts to the simulation scenario dynamically. This is in contrast to traditional transaction level models with static abstraction levels, and is also different from multi-level, multiplexed transaction level models which require several distinct models. We have focused on two classes of communication, point-to-point communication and shared medium, bus-based communication. The remainder of this chapter is organized as follows. Section 10.2 gives an overview of the related work. Section 10.3 presents our modeling approach. In Sect details of SystemC and OSCI TLM 2 based models of concrete bus (AMBA AHB) and point-to-point communication (FSL) protocols are shown. Section 10.5 presents the simulation performance of the developed models and compares them to existing models. Section 10.6 concludes this chapter with a summary and directions for potential future work Related Work Transaction level modeling (e.g. [7, 10, 17]) and TLM based design flows have become established design methods in industrial and research communities. TLM is enabled by languages such as SystemC [12], SpecC [9] and SystemVerilog [1]. SystemC and the OSCI TLM standard [17] are currently the most general purpose and the most widely accepted ones. The definition of the abstraction levels, terminology and some core concepts are still subject of debate [6] and formalization [15]. An interesting, systematic representation of transactions at different abstraction levels and their relationships has been presented in the GreenBus model [14]. Designers are often only interested in certain time intervals or certain components during system operation. To accelerate the simulation performance during uninteresting phases or to focus on behavior of certain components, several approaches have been proposed. An early pre-tlm example is SimOS [21] for simulation of complete computer systems. This approach requires multiple models of each system component at different levels of accuracy. Another early example is the work of Hines et al. [11] which is based on a layered model of communication between processor and its peripherals. In TLM, Beltrame et al. [5] have recently proposed a multi-accuracy approach in which several models at different abstraction levels are wrapped in one model, and the dynamic change of accuracy is based on multiplexing and demultiplexing those models. In their model, switching between abstraction levels can only occur on transaction boundaries, so the models can not react to conditions such as transaction

161 152 R. Salimi Khaligh and M. Radetzki preemption, and such conditions can be missed during more abstract simulation phases. The result oriented modeling (ROM) based models [23] are adaptive to the simulation scenario in that although in ROM data transfers are atomic, the estimated duration of a transfer is extended should the transfer get preempted. Considering transaction boundaries only, the model is timing accurate. However since the data is transfered at the boundaries without taking into account the preemption, data transfer is not accurate and functional correctness can be impaired. The dynamic change of abstraction level in the existing approaches is not bumpless and may result in loss of accuracy. An approach to accuracy adaptive simulation of transaction level models which does not require multiplexing of distinct models was recently introduced in [20], and was demonstrated with an abstract communication protocol. The applicability of the approach to real world protocols, data dependent latencies and cycle accuracy of the data transfers were not addressed. We have used concrete industrial communication protocols to evaluate our modeling approach. The Fast Simplex Link (FSL) [25] is a point-to-point FIFO based communication protocol which is used in FPGA-based multiprocessor systems (e.g. [13]). The AMBA AHB [3] is a representative of complex high performance industrial buses, used by many in evaluation of modeling methodologies. In [18] Pasricha et al. have used AHB in evaluation of their proposed CCATB abstraction level. In [24] SpecC-based models of AHB at different abstraction levels are compared. The AHB model of Caldari et al. [8] is an early SystemC 2.0 model which is not based on the OSCI TLM standard. A model of the AHB protocol based on an object oriented TLM approach [19] is presented in [22] Adaptive Interconnect Models In this section we begin with the relatively simple case of point-to-point communication and then move on to more complex bus-based communication protocols Point-to-Point Communication Point-to-point communication is widely used in modern, complex systems-on-chip. For example, mixed clock FIFOs are used for communication between locally synchronous islands in Globally Asynchronous, Locally Synchronous (GALS) systems (e.g. [2, 16]). Another example is use of point-to-point FIFO-based channels for communication between processors in multiprocessor systems (e.g. [13, 26]) or between processors and dedicated co-processors. In this chapter we focus on communication channels that follow the general logical structure shown in Fig That is, a FIFO-based uni-directional channel which consists of internal memory, different producer and consumer clocks and status and handshaking signals. A bidirectional channel can be modeled as two uni-directional channels in opposite directions. The clocks may be from different clock domains and the production and

162 10 Adaptive Interconnect Models for Transaction-Level Simulation 153 Fig A FIFO-based point-to-point communication channel consumption rates may be different. The status of the FIFO is reflected in the status signals (e.g. full and empty). The physical architecture of the channel is not relevant to our discussion. In an abstract untimed model, point-to-point communication would usually be modeled with a single function call from the producer to the consumer, transferring all data items at once. A separate model of the channel most probably will not be present. A certain level of timing accuracy can be achieved by using a FIFO model which accounts for the total latency of data transfer. For example, let T p and T c be the periods of the producer-side and consumer-side clocks respectively and assume that the FIFO is unbounded, and that writing or reading a single data item requires one clock cycle. A lower bound on the total latency of transferring N items from the producer to the consumer under the above assumptions can be easily estimated. For T p <T c, the time required by the consumer to read the N items is greater than the time required by the producer to transfer the items to the FIFO. The first item can be retrieved by the consumer T p time units after the producer has started the transfer. It can be easily seen that the total duration of the transfer in this case (i.e. no congestion, no producer-side or consumer-side idle cycles) is T p + N T c (the case for T p >T c is similar). Such an estimate can be incorporated in the model, for example by means of a wait() inside the channel model. To model details such as congestion or producer side and consumer side idle cycles, a more accurate model is required. The most natural model usually used for this purpose is a FIFO which is written to and read from one item at a time. The accuracy of such a model comes at the price of impeding the simulation performance. More importantly, in some intervals during simulation the model will be unnecessarily too accurate. Figure 10.3 shows two simple scenarios of data transfer between a producer and a consumer using a FIFO channel. In scenario (a) the FIFO is initially empty and there is no congestion. Here the accuracy of a model which only takes into account the total latency would be sufficient. In scenario (b) however the FIFO is initially not empty and there is an interval of congestion. The producer is blocked after transferring two data items, until the consumer removes some items from the FIFO (x and y in the figure). Eventually, the producer transfers the remaining items to the FIFO. To have this level of detail, the more accurate but slower model is required. Based on these observations, our proposed model is capable of maintaining the highest level of accuracy required in TLM use cases, without unnecessarily sacri-

163 154 R. Salimi Khaligh and M. Radetzki Fig Communication over a FIFO with and without congestion ficing the simulation speed. In the proposed model the FIFO slots do not store individual data items, but chunks of data belonging to an atomic transfer. This makes item-by-item reading and writing superfluous and will increase the simulation performance significantly. Without congestion (e.g. Fig. 10.3(a)) data will be transferred in one chunk from the producer to the channel, and in one chunk from the channel to the consumer. In case of a congestion (e.g. Fig. 10.3(b)) the transfer will be divided into multiple chunks. To transfer a set of data items, the producer initially makes an optimistic attempt for transferring all of the items in one chunk. If there is enough room in the FIFO, all of the items will be stored in the FIFO as one chunk, to be retrieved later by the consumer. However, if the FIFO can not receive all items at once, only the fitting number of items will be stored. The producer will be notified and must transfer the remaining items later. The result is that from the producer and consumer points of view, the size of the transferred chunks and their timings will always be accurate, without the need to transfer data one item at a time. The underlying principle and a comparison with related approaches can be found in [20]. To summarize, the model is adaptive in the sense that it always stays at the lowest abstraction level which is required at any instance. In the example shown in Fig. 10.3, in interval (a) the model will be equivalent to an abstract but fast model and in interval (b) it will be equivalent to a more accurate but slower model. With the difference that in scenarios such as interval (a), the adaptive model will be much more efficient than a cycle accurate model or a model which requires item-by-item transfer Bus-Based Communication In contrast to point-to-point communication, bus-based (i.e. shared medium, multimaster, multi-slave communication) involves arbitration and address decoding. Complex timing characteristics of the state-of-the-art bus protocols are usually only presentable at very low (often CA) TLM abstraction levels (e.g. [22]) or in cycle based models (e.g. [4]). In the following discussions we focus on pipelined bus protocols but the concepts can be easily applied to simpler non-pipelined protocols. Figure 10.4 shows an eight beat burst in a typical pipelined bus from request to completion, which is preempted by a second transfer and resumed afterwards to completion. Some time after requesting bus access at req, the initiator M0 is granted access to the address bus at a1 (i.e. the transfer enters the address phase) and later

164 10 Adaptive Interconnect Models for Transaction-Level Simulation 155 Fig A preempted pipelined burst to the data bus at d1. Atdp the transfer is prematurely preempted (e.g. because of a higher priority transfer or being split). Later at a2 and d2, the master is granted access to the address and data buses respectively and at de the transfer is completed. Not all of these details are used, or are relevant for most TLM applications even at lower abstraction levels. The total latency of the transfer (L), the duration of the uninterrupted parts (here L1, L2), the amount of data transferred in each part and the additional latency caused by preemption (Tp) are almost always the sufficient level of accuracy in TLM. It is sufficient that the timing of the atomic (i.e. non-preempted) parts and the data associated with each part be accurate. Accurate timing of each beat (e.g. different transfer latencies of D3 and D4) is not required in all cases. Observing these facts and that bursts always target groups of related addresses, we can abstract away unnecessary timing details and at the same time accurately model arbitration, latencies and data transfer timings by incorporating degrees of adaptivity, without the need for cycle accuracy. Unnecessary notifications and interactions between models are avoided to increase simulation performance. For example, unless timing accurate data transfer for bursts is required, masters are only notified after completion and preemption of transfers, and slaves are only notified when the data phase of the transfers start. Similar to the model used in [22], we use an abstract model in form of a pipeline, whose stages represent the address and data bus(es) and occupation of those buses by transactions (here a transaction corresponds to an uninterrupted portion of a bus transfer). Figure 10.5 shows an abstraction of the same burst from Fig. 10.4in terms of its atomic parts T1 and T2 together with the pipeline model. It can be seen that the state of the pipeline only changes upon address and data bus handovers (hvr points in the figure). We use a global event to indicate the instances at which bus handover needs to take place. Using this event and the pipeline model, the bus model keeps track of the transactions currently in the address and data phases, and controls the timing of the transactions. This event may be triggered by the slaves (e.g. upon completion of single transfers, at the last address phase of bursts or when splitting transfers), the masters (e.g. when a request arrives at an idle bus) or the bus model itself (e.g. when the data bus is idle). This makes a clock event and clock-based operations superfluous and increases the simulation performance significantly. From the bus handover point of view the model adapts to the simulation scenario by being

165 156 R. Salimi Khaligh and M. Radetzki Fig An abstract model of bus handover Fig Arbitration at the most accurate level when necessary (e.g. early in the burst in Fig. 10.5) and at the higher abstraction levels otherwise. Another level of adaptivity can be added to the model by simulating the arbitration mechanism only when necessary. As an example, in Fig the next master to be granted the address bus is determined at arb1 and does not change until hvr1. The result of arbitration is not used until the actual handover of the address and data buses. In our proposed model, all requests are annotated with a time stamp and upon occurrence of a handover event the bus is able to retroactively check the requests and determine the master to be granted the bus. In case of sequential arbitration mechanisms this would be the closest previous clock edge. A similar retroactive check of requests is also possible in case of combinational arbitration mechanisms in which a request can be granted at the same cycle. In Fig. 10.6, time arb2, shows another example where a request arrives at an idle bus and is granted immediately, causing a handover notification. We exploit two observations regarding timing of beats in bursts and preemption of transfers to add further levels of adaptivity to the bus model. The first observation is that the timing of individual data beats during bursts is not always required. For example, when benchmarking software on a virtual platform, for the processor model the total latency of memory access is sufficient, but for other components in the model more accurate timing may be required. We allow the accuracy of a data transfer to be indicated by the initiators and targets of a transfer. For what we term

166 10 Adaptive Interconnect Models for Transaction-Level Simulation 157 Fig Accuracy of data transfer an Aggregate Timed Transaction (ATT), only the aggregate duration will be known by the master and all data will be transfered at the beginning (for write) or at the end of the duration of the transaction (for read). For the read bursts shown in Fig. 10.7, the data phase of the ATT starts at d3 and the slave transfers data of both beats at d5. We call a transaction with detailed beat-wise timing a Segment Timed Transaction (STT).FortheSTTshowninFig.10.7 for example, data transfer occurs at d1 and d2. Slaves may support ATTs only, STTs only or both. Supporting STTs only for example, would force the data transfer to or from the slave to be beat-wise. Using STTs and ATTs the transfer of data will be accurately timed only when necessary. The second observation is that not all transfers are preempted and cycle accuracy is not necessary to ensure accurate transfer of data corresponding to the atomic parts. Here once again we use the adaptive approach introduced in [20]. The example showninfig.10.7 shows a burst preempted by a higher priority transfer. At the beginning of the data phase at d6 the slave optimistically calculates the duration of the transfer L and waits for that amount of time and simultaneously for a preemption event. When notified of preemption, the slave transfers only the data corresponding to the non-preempted part. Handling of preemption in case of STTs is similar and straightforward Model Implementation An Adaptive FSL Model Fast Simplex Link (FSL) [25] is a FIFO based communication channel from XilinX which can be used for unidirectional communication between components in FPGA based designs. For example, the MicroBlaze processor core also from XilinX has input and output FSL interfaces which can be used in multiprocessor designs, or for communication between the processor and dedicated co-processors. An FSL channel is a multi-clock FIFO with a configurable depth which can be used in synchronous mode (i.e. same producer and consumer clocks) or asynchronous mode (i.e. different producer and consumer clocks). Based on the principles introduced in Sect we have developed an adaptive model of FSL. The model is based on the OSCI TLM 2 standard [17]. The model uses the blocking transport interfaces

167 158 R. Salimi Khaligh and M. Radetzki Fig Transport calls for a FSL transfer and is modeled as a passive, target-only sc_module. Internally a FIFO data structure is used to keep track of the chunks, and each chunk may occupy more than one FIFO slot. The model is not clocked and only requires the periods of the producer and consumer clocks as constructor parameters. Each chunk has a write timestamp which is the simulation time at which the producer has started writing the chunk to the FIFO. Similarly, each chunk has a read timestamp which is the simulation time at which the consumer has started reading the chunk. Using the timestamps, and the internal FIFO data structure, the amount of free FIFO slots at any instant in simulation time can be determined. This is used to implement the adaptive behavior explained previously in Sect Figure 10.8 shows the sequence of b_transport calls from the producer and consumer for the example shown in Fig. 10.3(b) An Adaptive AHB Model We have implemented an adaptive model of the AMBA AHB [3] in SystemC based on the OSCI TLM 2 [17] standard. The implemented subset of features covers most of the AHB specification and is large enough to validate our proposed model and the missing features (e.g. RETRY transfers) do not affect our results. The bus model is implemented as a single sc_module with an array of tlm_initiator_socket and tlm_target_socket sockets for connection with the masters and slaves respectively. Our model uses the nonblocking transport interfaces and utilizes the backward path. The bus model itself does not require a clock event, since all timing is handled based on the aforementioned handover event and pipeline model. Only the period of the bus clock is needed in some special cases (e.g. a timed notification of a handover event when a request arrives at an idle bus), which is set as an attribute of the model. A bus request starts by an nb_transport_fw call from the master to the bus carrying the payload for that transaction. The generic payload is extended with the relevant AHB specific details and information specific to the adaptive model. The most important extended attributes are the time stamp of the request, a preemption event and

168 10 Adaptive Interconnect Models for Transaction-Level Simulation 159 Fig Bus handover the transaction type (ATT or STT). The time stamp is set at the time of the request by the bus model and is used later by the retrospective arbitration mechanism. The preemption event is notified by the bus model in case of preemption and is used by the slaves for accurate transfer of data corresponding to the atomic part. The transaction type determines the requested accuracy of data transfers. In AHB, timing of the request signal is crucial for performing back-to-back pipelined transfers. To enable this, inside the bus model one FIFO data structure per master is used to hold the incoming payloads. Bus handover is implemented in a SC_METHOD sensitive to the handover event and in a simplified form is shown in Fig AD and DT are transactions in the address and data phases prior to notification and AD and DT those after handover. Here the internal pipeline model (Sect. 10.3) is updated (a). The payload of the transaction entering the data phase is transported to the corresponding slave (b) (only once per burst) and the transaction entering the address phase is determined (c). In case of an idle data bus a timed handover event is set by the bus model (d) so that the transaction currently in the address phase can resume accordingly. Activity of the slaves is implemented in a SC_THREAD and for a simple, memory like slave for a simple data transfer is shown in Fig Processing starts with the reception of a payload (a), optimistic estimation of the latency of the transfer (b), determining the correct amount of data (c) and transfer of data (d). The activity for bursts is similar. In case of bursts, for accurate timing of bus handover a timed notification of the handover event by the slave at the end of the last address phase is necessary. Figure 10.11showsatypicalsequenceoftransportcallsalongtheforward and backward paths for a burst transfer. Master requests access (a) but a higher priority transaction occupies the address bus until (b) where the handover event is notified and retrospective arbitration is performed. Later (c) the payload is transported to the slave and the data phase begins. The nb_transport call at (d) is only used by the bus model to notify a handover event, the last call (e) causes a handover event and is then forwarded to the master on the backward path (f).

169 160 R. Salimi Khaligh and M. Radetzki Fig Processing in slaves Fig Transport calls for an AHB transfer 10.5 Experimental Results To evaluate the simulation performance of our models, we have compared the adaptive models to existing, fixed accuracy models, whose accuracies were comparable to the best level of accuracy which the adaptive models were able to deliver. For FSL we have used a sc_fifo based cycle accurate model and for AHB we have used a cycle accurate OSCI TLM 1 model. The adaptive FSL model achieves its highest simulation speed (i.e. it is at the highest abstraction level) when there are no congestions. To measure the simulation performance in this case we used a small model consisting of a producer, a consumer and a FIFO model which was effectively unbounded for the simulation scenario. Figure shows the results. Here the producer performs back-to-back transfers of increasing sizes without idle cycles, and similarly the consumer retrieves the items from the FIFO without any idle cycles. Since the cycle accurate model transfers the data byte-by-byte in all situations, its simulation performance stays effectively constant. However, without congestion, in the adaptive model data is always transfered in chunks which results in a significant increase in simulation speed for larger chunks. To measure the effect of congestions, we repeated the previous tests with different FIFO sizes. Predictably, the simulation performance of the cycle accurate model did not change considerably with a reduction in FIFO size. Figure shows the measurement results for the adaptive FSL model (the curves for the cycle accurate

170 10 Adaptive Interconnect Models for Transaction-Level Simulation 161 Fig Maximum simulation performance of the FSL model Fig Effect of FIFO congestion

171 162 R. Salimi Khaligh and M. Radetzki Fig Maximum simulation performance of the AHB model model have been omitted here for clarity). Using an adaptive model with a FIFO size of 1, data is transferred byte-by-byte, similar to the cycle accurate model. This represents the most accurate level of the adaptive model and its worst case simulation performance. The reduction in the simulation speed with a limited FIFO size which allows data transfers in chunks larger than one byte can be seen in the curve for FIFO size of 50. For transfer sizes larger than 50 bytes, the transfer will be broken to multiple chunks and this will result in a breakdown in the simulation performance. However the model is still much more efficient than the cycle accurate model. Figure compares the maximum simulation performance of the adaptive AHB model with that of a cycle accurate AHB model [22] in a single master, single slave setup (a comparison of the used cycle accurate model with AHB models mentioned in the related work can be found in [22]). The master performs transfers of increasing sizes using the most efficient combination of AHB transactions (single data transfers and valid fixed length bursts). Again, the best performance of the adaptive model is achieved for larger transfers where it is almost an order of magnitude faster than the cycle accurate model. This is the result of a reduction in handover events per transaction, which is specially significant for large bursts. The worst case performance of the adaptive model is reached for smaller transfer sizes where one handover event is required per transaction. As a result of bus traffic and eventual preemption of transfers, the adaptive bus model will be at lower abstraction levels to maintain the accuracy, and predictably the simulation performance will be lower than the best case. This can be seen in

172 10 Adaptive Interconnect Models for Transaction-Level Simulation 163 Fig Effect of preemptions Fig Here a high priority master performing single data transfers with varying degrees of bus utilization is added to the system. In this setup, the worst starvation free case occurs at u = 66% where all preemptable transfers are preempted, which results in a decrease in simulation performance as seen in the figure. For lower bus utilizations, the decrease in the simulation performance is also lower. The above mentioned scenarios are however, rather synthetic. To measure the performance of the bus model in more realistic situations, we simulated different communication architectures for an ARM926EJ-S based implementation of an MPEG decoder. The four masters in the system were the Processor (instruction and data interfaces), a DMA controller and an LCD controller, which communicated with six slaves. The traffic generators used in the simulations initiated bus transactions based on the traces captured by a logic analyzer from a running system and therefore represent real transaction patterns. The simulated communication architectures were chosen such that the interdependency of the transactions was respected as much as possible. Table 10.1 summarizes the results. Arch1 represents the communication architecture of the real system which uses a multi-port memory controller (MPMC) and has four single-master buses. In Arch2, the DMA Controller and the LCD controller (no overlapping transactions) share the same bus. In Arch3 and Arch4 the LCD controller uses the same data bus as the CPU and these architectures are only different in the priority assignments of the LCD controller and the Processor and therefore are directly comparable. The slight decrease in performance due to larger number of preemptions can be seen by comparing the results for Arch3 and Arch4. InArch1 the bus model is always at the highest abstraction level and

173 164 R. Salimi Khaligh and M. Radetzki Table 10.1 An ARM-based MPEG case study Arch 1 Arch 2 Arch 3 Arch 4 Number of Buses Number of Transfers Number of Preemptions Simulation Speed (MCycles/S) therefore the highest performance is achieved (considering the number of buses in the model) Conclusion We have shown that by incorporating adaptivity in a transaction level model, it is possible to construct a single model which is capable of delivering the required level of accuracy for different situations during simulation. More importantly, an adaptive model does not unnecessarily sacrifice the simulation performance. Our experimental results clearly show the benefits of using adaptive models and the potential increase in simulation speed, which is the most important motivation for using transaction level models. This has proven to be a very promising research direction and we are currently applying our proposed approach to other interconnect models such as networks-on-chip, processor models and reconfigurable architectures. References 1. Accellera Organization, Inc. SystemVerilog 3.1a Language Reference Manual, May R.W. Apperson, Z.Yu, M.J. Meeuwsen, T. Mohsenin, and B.M. Baas. A scalable dual-clock FIFO for data transfers between arbitrary and haltable clock domains. IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 15(10): , ARM Limited. AMBA Specification, Version 2.0, May ARM Limited. AMBA AHB Transaction Level Modeling Specification, Version 1.0.1, May G. Beltrame, D. Sciuto, and C. Silvano. Multi-accuracy power and performance transactionslevel modeling. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems (TCAD), 26: , M. Burton, J. Aldisy, R. Guenzel, and W. Klingauf. Transaction level modelling: a reflection on what TLM is and how TLMs may be classified. In Proceedings of the Forum on Specification and Design Languages (FDL 07), September L. Cai and D. Gajski. Transaction level modeling: an overview. In Proceedings of the 1st IEEE/ACM/IFIP International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS 03), October M. Caldari, M. Conti, M. Coppola, S. Curaba, L. Pieralisi, and C. Turchetti. Transaction-level models for AMBA bus architecture using SystemC 2.0. In Proceedings of the Conference on Design, Automation and Test in Europe (DATE 03), March 2003.

174 10 Adaptive Interconnect Models for Transaction-Level Simulation R. Dömer, A. Gerstlauer, and D. Gajski. The SpecC Language Reference Manual, Version 2.0. SpecC Technology Open Consortium ( December F. Ghenassia. Transaction-Level Modeling with SystemC: TLM Concepts and Applications for Embedded Systems. Springer, New York, K. Hines and G. Borriello. Dynamic communication models in embedded system cosimulation. In Proceedings of the Design Automation Conference (DAC 97), June IEEE Computer Society. Standard SystemC Language Reference Manual. Standard , March Y. Jin, N. Satish, K. Ravindran, and K. Keutzer. An automated exploration framework for FPGA-based soft multiprocessor systems. In CODES+ISSS 05: Proceedings of the 3rd IEEE/ACM/IFIP International Conference on Hardware/Software Codesign and System Synthesis, pages Assoc. Comput. Mach., New York, W. Klingauf, R. Günzel, O. Bringmann, P. Parfuntseu, and M. Burton. GreenBus: a generic interconnect fabric for transaction level modelling. In Proceedings of the 43rd Annual Conference on Design Automation (DAC 06), July B. Niemann and C. Haubelt. Towards a unified execution model for transactions in TLMs. In Proceedings of the Fifth ACM IEEE International Conference on Formal Methods and Models for Codesign (MEMOCODE 07), June K. Niyogi and D. Marculescu. System level power and performance modeling of GALS pointto-point communication interfaces. In Proceedings of the International Symposium on Low Power Electronics and Design (ISLPED 05), pages , August Open SystemC Initiative (OSCI) TLM Working Group ( Transaction Level Modeling Standard 2 (OSCI TLM 2), June S. Pasricha, N. Dutt, and M. Ben-Romdhane. Fast exploration of bus-based on-chip communication architectures. In Proceedings of the International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS 04), September M. Radetzki. SystemC TLM transaction modelling and dispatch for active objects. In Proceedings of the Forum on Specification and Design Languages (FDL 06), September M. Radetzki and R. Salimi-Khaligh. Accuracy-adaptive simulation of transaction level models. In Proceedings of Design, Automation and Test in Europe 2008 (DATE 08), March M. Rosenblum, S. Herrod, E. Witchel, and A. Gupta. Complete computer system simulation: the SimOS approach. IEEE Parallel and Distributed Technology: Systems and Applications, 3(4):34 43, R. Salimi-Khaligh and M. Radetzki. Efficient and extensible transaction level modeling based on an object oriented model of bus transactions. In Proceedings of the International Embedded Systems Symposium (IESS 07), May G. Schirner and R. Dömer. Fast and accurate transaction level models using result oriented modeling. In Proceedings of the IEEE/ACM International Conference on Computer-Aided Design (ICCAD 06), November G. Schirner and R. Dömer. Quantitative analysis of transaction level models for the AMBA bus. In Proceedings of the Conference on Design, Automation and Test in Europe (DATE 06), March Xilinx, Inc. Fast Simplex Link (FSL) Specification, Version 2.11a, June S. Xu and H. Pollitt-Smith. A multi-microblaze based SOC system: from SystemC modeling to FPGA prototyping. In IEEE International Workshop on Rapid System Prototyping, pages , 2008.

175 Chapter 11 Efficient Architecture Evaluation Using Functional Mapping C. Kerstan, N. Bannow and W. Rosenstiel Abstract For an efficient design flow the substantiation of early design decisions is obligatory. This implies a simple and fast architecture evaluation using simulative approaches. This paper introduces an approach which enables a powerful hardware/ software partitioning and the reuse of already existing functional code by applying minimal code modifications only. The primary objective is to provide a solution to enable an automated application. In this novel approach, code readability and transformation effort are improved significantly by using the powerful operator overloading mechanism of C++. The implementation can easily be customized and combined with other approaches concerning simulative design evaluations. For example, it is possible to realize implicit timing behavior, transparent communication over module boundaries, tracing of simulation data or collecting debugging information. Keywords SystemC TLM-2.0 Functional mapping Communication and timing behavior 11.1 Introduction Since the introduction of SystemC [9, 12] the improvement of the system level design flow was a main intention. Beside different abstraction levels, several tools and methods facilitate a more efficient system development compared to classical approaches. High level synthesis tools like Mentor s Catapult [6, 7] support the direct conversion of C/C++ respectively SystemC to VHDL assuming hardware/ software partitioning is already done. Object-oriented approaches like OSSS [2, 8] allow a more general refinement using SystemC. However, despite using the SystemC approach there is still the need for a fast evaluation of several possible architectures before final mapping decision and hardware/software implementation. Several approaches exist to substantiate early design decisions and support a more or less flexible design flow. An approach to realize efficient system modeling is done in [10] which partly uses high level synthesis to obtain well-founded design decisions. Due to this the whole system can be synthesized in hardware but the power of C/C++ and SystemC is reduced to a synthesizable subset. C. Kerstan ( ) Robert Bosch GmbH, Corporate Sector Research and Advance Engineering, Schwieberdingen, Germany Christian.Kerstan@de.bosch.com M. Radetzki (ed.), Languages for Embedded Systems and Their Applications, Lecture Notes in Electrical Engineering 36, Springer Science + Business Media B.V

176 168 C. Kerstan et al Functional Mapping A more generic approach is presented in [3] by introducing Module Adapters (MAs) which are responsible for intermodule communication, tracing and data-dependent timing behavior (see Fig. 11.1). This adapter concept allows an implementation of the functional behavior in C/C++ that is independent from the architecture. The architecture itself can be realized in SystemC without regarding the concrete functional behavior. Therefore, the simulation model can be generated by using empty module shells together with already existing components which can easily be combined thanks to the TLM-2.0 standard [13]. The module shells provide simple communication, tracing interfaces without realizing any functional behavior. The functional code has to be mapped into these shells to achieve the desired behavior (see Fig. 11.2). However, particularly the necessary code modification for a functional mapping to an architecture is sophisticated. The problem of parsing SystemC or C/C++ code for further analysis and processing is addressed by several works. Similar to this paper, [16] also uses the power of the C++ compiler instead of implementing an own Fig The approach introduced in [3]allows architecture independent functional implementation Fig Functional mapping of C/C++ code to a SystemC architecture

177 11 Efficient Architecture Evaluation Using Functional Mapping 169 parser. In contrast to [16] where a complete representation of a SystemC design is generated, the main focus of this approach is to smoothly enrich the C++ code with the required information to gain first but expressive architecture estimations. Since simulations can evaluate several system aspects this work focuses on the timing behavior based on a correct functional execution Timing Behavior The timing behavior of a module is introduced in different places. First of all timing is caused by data transfers while communicating with other modules due to the timing of the connected bus. The transferred data is generated by the functional code that is mapped into the different modules. The module s internal timing behavior for computation still has to be annotated in addition to the data transfer. This timing behavior can be gained by experiences or is based on assumptions. If the module is realized in hardware it can be extracted for instance with high level synthesis like done in [10]. If the functional code is supposed to run on a software processor, accurate timing behavior can be achieved by using a corresponding instruction set simulator (ISS). Such processor models are usually provided by commercial solutions from tool vendors like VaST [18], CoWare [5] or Synopsys [17]. To run functional code on such an ISS the code has to be compiled for the specific processor model which implies additional demands on the code. Due to the fact that the compiled binary code does not directly run on the simulation host machine the usage of an ISS can slow down the entire simulation. To increase the simulation speed [15] and [19] introduce approaches which use code instrumentation to achieve accurate timing results without a detailed simulation of every single instruction. [19] uses debug information and static code analysis to annotate timing information for each code line. In contrast to this the approach described in [15] divides the code into blocks. This hybrid approach annotates static timing information for the blocks and combines them with dynamic system conditions during runtime Conventional Code Transformation Mapping existing functional C/C++ code into a SystemC design as described in [3] enables a fast and efficient analysis and evaluation of different hardware architecture realizations. This work uses TLM-2.0 compatible module shells which are very similar to the MAs providing a simple interface for external communication. Therefore, the required communication and some other hardware aspects can be abstracted. This increases the readability of the code and allows architectural changes without modifying the functional code. If the functional code and all its variables are mapped to the same module as shown in Fig. 11.3, no external

178 170 C. Kerstan et al. Listing 11.1 Simple functional C++ example Fig Listing 11.1 mapped on a simple SystemC architecture Listing 11.2 Mapping Listing 11.1 to the hardware architecture as shown in Fig requires several code modifications Fig Functional independent mapping of variables is more appropriate than the mapping shown in Fig communication and in consequence no code modifications are necessary (see Listing 11.1). Certainly, variables which are located in other modules (see Fig. 11.4) require complex modifications. Using the conventional approach the code has to be enriched with read and write instructions for every data access which is not located in the same module (see Listing 11.2). Besides, the lower readability an automation is very complex due to the expressive syntax of C/C++. Listing 11.2 shows a possible code structure for external communication. Some optimizations are possible but there are a few issues which have to be taken into account. For instance, there is only one generic interface for each module. That is the reason why each access needs more parameters than only the value. Alternatively it would also be possible to have several method declarations (e.g. int reada(); void writea(int); int readb(); etc.) for data transfers to every variable that is mapped into another module.

179 11 Efficient Architecture Evaluation Using Functional Mapping Optimization Approach To decrease the effort for code enrichment and to keep the readability of the code a new approach is presented that uses the power of C++. The idea is to substitute the variable types by a meta class instance, which from the perspective of the functional code acts like the original one. The template mechanism allows to substitute most kinds of variables. The expected behavior is realized by operator overloading. Thus the access over a MA is transparent to the programmer. The only required modification is the replacement of all concerned original objects or variable declarations, each by its corresponding meta class instance. A basic requirement is a light, fast and extensible solution which covers most use cases. Only a few operators need to be overloaded to allow the compiler handling all possible operations using implicit transformations if necessary. To avoid incorrect read() or write() operations most overloaded operators have the original type either as parameter or as result Class Unitized For type independence the proposed meta class unitized uses the template mechanism. As depicted in Listing 11.3 it is designed for various usages by providing two methods which have to be customized by inheritance. First of all there have to be some constructors which are necessary for several implicit operations used by the compiler (see Listing 11.4). An initialization during the definition of a unitized object by a value of the substituted type is enabled by a corresponding constructor. Furthermore, there also has to be a copy constructor to allow initialization by already existing unitized objects. Hence the definition of these constructors also causes the compiler to ask for the declaration of a default constructor. The very important cast operator (see Listing 11.5) is responsible for implicit casting to the original type which is represented by this object. It is especially needed for calculations and assignments to other variables. Listing 11.3 Declaration of unitized and its virtual methods which have to be overwritten

180 172 C. Kerstan et al. Listing 11.4 It is essential to allow the object initialization by appropriate constructors Listing 11.5 This operator is responsible for im- and explicit type casts of unitized objects Listing 11.6 Supporting the assignment of unitized or originally typed variables and values Another special operator which has to be modified is the assignment operator. Due to the possible combinations of different parameter and result types, overwriting this operator can cause undesirable side effects in combination with other overloaded operators and declared constructors. Listing 11.6 shows an implementation avoiding unnecessary operations during an assignment. The described implementation avoids side effects by overwriting only operators which cannot implicitly be substituted by the compiler. Similar to the equal operators every other operator modification has to be verified carefully. Nevertheless, the modifications of the remaining operators are very similar to Listing After implementing every other operator matching the pattern <operator>= and the increment/decrement operators the class unitized can be adapted by inheritance.

181 11 Efficient Architecture Evaluation Using Functional Mapping 173 Listing 11.7 Remaining equal (-=, *=, /=, %=, =, &=, ˆ=) and the increment/decrement operators (++/--) have to be implemented analog to the exemplary += operator realization Listing 11.8 Customizing unitized to realize automated logging using log4cxx [1] 11.4 Customize and Apply Unitized To demonstrate how to customize and use the introduced unitized class it is exemplarily adopted to tracing purposes. Instead of realizing the communication between modules, the first inherited class (see Listing 11.8) saves the value internally and logs every access in order to trace simulation data. The tracing itself is based on the open source framework log4cxx [1] which provides a flexible and efficient logging. By overwriting the read and write methods, the proper value is stored in value and an adequate logging is done using the log4cxx logger referenced by logger. Of course the constructors also have to be implemented to initialize the internal variables. For completeness the extract of logging.h (see Listing 11.9) points out how log4cxx is used in this case.

174 C. Kerstan et al. Listing 11.9 LOG realizes simple applicability similar to cout and avoids spare invocations Listing 11.10 Tracing a from Listing 11.1 requires only minimal modifications Fig. 11.5 Generated output of Listing 11.

1 Application of u_trace The variables of Listing 11.1 can now be traced by replacing variables with an u_trace object.

182 174 C. Kerstan et al. Listing 11.9 LOG realizes simple applicability similar to cout and avoids spare invocations Listing Tracing a from Listing 11.1 requires only minimal modifications Fig Generated output of Listing respecting every access of a Listing The additional tracing of b requires an extra similar modification Fig Accesses on a or b of Listing are traced by the default logger Application of u_trace The variables of Listing 11.1 can now be traced by replacing variables with an u_trace object. The only essential modification to trace a is changing the type from int to u_trace<int> (see Listing 11.10). Because variable a is constructed without specifying an explicit logger the root logger is taken by default (see Fig. 11.5). By modifying the type of b in the same way, the modified code looks like presented in Listing During execution every access to a or b is traced as shown in Fig

183 11 Efficient Architecture Evaluation Using Functional Mapping 175 Listing Customizing unitized for TLM-2.0 based communication purposes There is only a minor change needed for tracing variable accesses. So an automatic modification of code can be realized without parsing and treating the whole code Using the Approach in the Design Flow To use the new approach for timing evaluation a new unitized successor has to implement the communication demands. In contrast to u_trace it has to pass requests directly to the corresponding SystemC module. Therefore, it does not store any values but has to keep a reference to the initiator port of the parent module and due to the memory mapped system the address to access the correct module and its content (see Listing 11.12). The included master module inherits form the initiator_socket of TLM-2.0 and provides a simple read/write interface. u_map enables the modification of Listing 11.1 to achieve the same result as Listing 11.2 by almost keeping the original code and thus maintaining readability (see Listing 11.13). Figure 11.7 outlines the system configuration from the point of view of the u_map objects Handling Arrays A small enhancement of u_map enables the treatment of arrays as easy as normal variables. Listing implements such an extension for one dimensional arrays.

184 176 C. Kerstan et al. Listing In contrast to Listing 11.2 the same mapping of Listing 11.1 is realized with almost no modifications Fig Outline of the mapped Listing and the intent of the u_map constructor parameters Listing Required operator overloading to empower u_map to handle common ID arrays Fig Depiction of Listing and its meaning in the mapped architecture The overloaded [] operator creates a new u_map object calculating the correct offset using the index parameter and the size of the content type. Certainly, the complexity of the calculation increases with each dimension. The implicit temporary creation of a u_map object (line 25 in Listing 11.14) avoids an interface extension of the read and write methods with an offset parameter. The new u_map object directly points to the correct address without needing an extra offset (see Fig. 11.8). Due to this, less overhead is generated which should not impact the performance dramatically. However, one deficit occurs at declaration because there is no difference anymore between a u_map and an array of the same type Design Example The applicability of u_map will be demonstrated with a more application relevant example which was introduced in [11]. A discrete Laplace operator has to be applied to image data (see Fig. 11.9). An implementation of such an operator is shown in Listing While the image dimensions are defined as macros, the images themselves are represented by ordinary two-dimensional arrays. For the reason of simplicity, the algorithm is implemented assuming source_picture containing proper image data. The first hardware architecture shown in Fig consists of a processing unit executing the algorithm as well as a standard router component distributing

11 Efficient Architecture Evaluation Using Functional Mapping 177 Fig. 11.9 2D Laplace operator realized in the design example exemplary applied on Lenna [14] Fig. 11.10 Target hardware architecture of the design example Listing 11.

image data. By using only standard components and generic module shells for communication, the architecture can be generated automatically.

Since only copying the code (see Listing 11.15) into the run() method is not sufficient, it has to be modified.

185 11 Efficient Architecture Evaluation Using Functional Mapping 177 Fig D Laplace operator realized in the design example exemplary applied on Lenna [14] Fig Target hardware architecture of the design example Listing Common basic implementation of a discrete two-dimensional Laplace operator requests to one of the two standard memory components, which are responsible for the source_picture and target_picture image data. By using only standard components and generic module shells for communication, the architecture can be generated automatically. The necessary address mapping can be achieved through approaches as described in [4]. The remaining task is to map the functional code into the processing module. Since only copying the code (see Listing 11.15) into the run() method is not sufficient, it has to be modified. Especially the accesses to the arrays source_picture and target_picture which are mapped to different components in the architecture (see Fig ) have to be considered. Using the described u_map the only required modification affects the declaration of both arrays. Besides TYPE that is globally defined for the pixel size, the image address offsets are defined by address mapping. i_port is the initiator port of the specific compo-

186 178 C. Kerstan et al. Listing The two-dimensional arrays in Listing require some extra modifications Fig The simulated time advancement can be extracted from the simulation output of Listing nent and setelementsize() is necessary for the correct address decomposition using two-dimensional arrays Simulation Results Excluding the filter module which initially has no timing the delay of each component in this example is set to 1 ns. For each pixel, 9 pixels have to be read and the result has to be written. Due to the fact that the border pixels are not processed and one access needs 2 ns the whole processing time for a image can easily be determined. ( ) (9 + 1) 2ns= 5202 µs Therefore, it is easy to analyze the simulation time advancement (see Listing 11.11). But the complexity already significantly increases by inserting a buffer

11 Efficient Architecture Evaluation Using Functional Mapping 179 Fig. 11.12 Extending the hardware architecture of Fig. 11.10 by an extra buffer between the processing unit and the router Fig. 11.13 Simulation output of the architecture shown in Fig.

187 11 Efficient Architecture Evaluation Using Functional Mapping 179 Fig Extending the hardware architecture of Fig by an extra buffer between the processing unit and the router Fig Simulation output of the architecture shown in Fig between the processing unit and the router (see Fig ) with the size of 9 entries with the size of a pixel data. With the buffer delay also set to 1 ns accessing already buffered data only needs 1 ns but one memory access takes 3 ns. Apart from the first pixel of each line there are only 3 read memory accesses necessary since the other 6 pixel values are already available in the buffer (see Fig. 11.9). This results in an increase of the overall performance (see Fig ). Considering the later substitution of the router by a more complex bus and a more realistic timing annotation it is an even more reasonable optimization. Using an approach described in [15] and [19] the missing internal timing behavior can be gained. In combination with unitized and TLM-2.0 the synchronization respectively the amount of wait statements can be reduced to a minimum Limitations and Experiences Due to the fact that unitized overloads only necessary operators the compiler has to apply implicit casts and operator substitution. Except the address, respectively the & operator, even not explicitly overloaded operators can be used without causing undesirable side effects. To avoid direct address to the content of the unitized object the & operator was not overloaded although it cannot be substituted. Therefore, its usage corresponds to the unitized object itself. The occurrence of the address operator in the original source code may lead to compiler errors which have to be fixed manually. Such an error indicates the assignment of a unitized object address instead of the expected address of the original variable or vice versa. However, the direct access to the unitized content

188 180 C. Kerstan et al. Listing Exemplary implementation of a parameter driven sawtooth shaped signal Listing Partial variable mapping can cause compiler errors is not allowed, because changes and requests can not be tracked using references or pointers. The following fictive example shall depict the problem and its solutions. The exemplary application outlined in Listing generates a sawtooth shaped signal using parameters for its amplitude (posamp and negamp) and its edge slope (inc and dec). For this purpose, the value of the variable which is referenced by op is added to signal. Ifsignal reaches or crosses the maximum amplitude, op switches to the other variable. Except the local variable op all other variables are located in other modules e.g. sensors or memories. That is why they are substituted by adequate unitized objects (see Listing 11.18). In practice the introduced u_map would be used, but for generality unitized is used here. In Listing compiler errors occur in every line where op has to be set. All errors 1 are related to the incompatibility of int * and unitized<type> *.Asimple remedy is to replace the type of the local pointer by unitized<int>* op = &inc; An alternative solution is to customize unitized similar to u_map which allows the implicit modification of the internal address and therefore acts like a pointer, e.g. unitized<unitized<int>*>op = &inc; The unitized approach was evaluated through various applications. In this process even parameter passing was considered. However, due to the power of the C++ language and the variety of compilers an overall correctness cannot be guaranteed. Especially, experiences with complex objects which may overload operators themselves is very limited. Nonetheless, unitized is constructed to induce compiler errors instead of an unwanted behavior. 1 C2440, C2446 (compiler errors using Microsoft Visual C++).

189 11 Efficient Architecture Evaluation Using Functional Mapping Summary The introduced approach simplifies the mapping of C/C++ code to SystemC architectures. Therefore, the presented approach enables the reuse of existing functional models. Due to the minimal changes the readability of the original code is obtained and no complex particular code parser or massive manual modifications are necessary. The demonstrated examples depict the applicability and give some incitements for realistic use cases, e.g. tracing purposes. Separating the functional behavior and the system architecture reduces the complexity to obtain a flexible simulation model. Such an architecture can be changed in structure and granularity without touching the functional code. Only if some formerly internal variables have to be mapped to external modules, appropriate modifications in the code have to be applied. In combination with available approaches concerning timing behavior like [3, 15, 19] a fast and accurate architecture evaluation becomes possible Outlook The proposed combination with [15] and [19] has to be evaluated in more detail. Furthermore, the substitution of more complex objects has to be investigated. In addition to the presented use-cases and specific implementations there are several other possibilities where this approach could be valuable. For instance, using the System Analysis Toolkit (SAT) introduced in [11], an enhanced analysis and diagnosis can be easily adapted to an existing design. Therefore, it is also possible to realize a direct online analysis of simulation data without the usage of files. References 1. Apache. log4cxx August N. Bannow and K. Haug. Evaluation of an object-oriented hardware design methodology for automotive applications. In DATE, February N. Bannow, K. Haug, and W. Rosenstiel. Performance analysis and automated C++ modularization using module-adapters for SystemC. In FDL, September N. Bannow, K. Haug, and W. Rosenstiel. Automatic SystemC design configuration for a faster evaluation of different partitioning alternatives. In DATE, March CoWare Inc. Processor designer FORTE Design Systems. Cynthesizer M. Graphics. Catapult Synthesis E. Grimpe, B. Timmermann, T. Fandrey, R. Biniasch, and F. Oppenheimer. SystemC objectoriented extensions and synthesis features. In FDL, September T. Groetker. System Design with Systemc. Kluwer Academic, Dordrecht, C. Haubelt, J. Falk, J. Keinert, T. Schlichter, M. Streubühr, A. Deyhle, A. Hadert, and J. Teich. A SystemC-based design methodology for digital signal processing systems. EURASIP Journal on Embedded Systems, 2007(1):15, doi:

190 182 C. Kerstan et al. 11. C. Kerstan, N. Bannow, and W. Rosenstiel. Closing the gap in the analysis and visualization of simulation data for automotive video applications. In edaworkshop 08, OSCi. SystemC, Version OSCi. TLM June A. Sawchuk. Lenna Sjööblom. Signal & Processing Institute, University of Southern California. July J. Schnerr, O. Bringmann, A. Viehl, and W. Rosenstiel. High-performance timing simulation of embedded software. In DAC, 45th ACM/IEEE, pages June T. Schubert and W. Nebel. The Quiny SystemC front end: self-synthesising designs. In FDL, September Synopsys, Inc. Virtual Platforms VaST Systems Technology. CoMET Z. Wang, A. Sanchez, A. Herkersdorf, and W. Stechele. Fast and accurate software performance estimation during high-level embedded system design. In edaworkshop 08, 2008.

191 Chapter 12 Symbolic Scheduling of SystemC Dataflow Designs Jens Gladigau, Christian Haubelt and Jürgen Teich Abstract In this chapter, we propose a quasi-static scheduling (QSS) method applicable to SystemC dataflow designs. QSS determines a schedule where several static schedules are combined in a dynamic schedule. This, among others, reduces runtime overhead. QSS is done by performing as much static scheduling as possible at compile time, and only treating data-dependent control flow as runtime decision. Our approach improves known quasi-static approaches in a way that it is automatically applicable to real world designs, and has less restrictions on the underlying model. The effectiveness of the approach based on symbolic computation is demonstrated by scheduling a SystemC design of a network packet filter. Keywords Software scheduling Quasi-static SystemC Symbolic methods 12.1 Introduction SystemC is a system description language convenient for implementing dataflow models and widely used when developing embedded systems, typically starting with an abstract model. Further in the design flow, the abstract model is partitioned and individual modules or clusters of modules are mapped to resources of a target architecture. These architectures may consist of processors, capable to execute software, hardware accelerators, memories, etc. When generating software for an embedded processor from high-level SystemC designs, a transformation of parallel executed modules into sequential code is needed. It is possible to manual recode the software part, but this is costly and may introduce errors. We concentrate on approaches which automatically generate C-based software from a high-level SystemC design, see [6, 7, 14] for examples of such design flows. Scheduling this software could be done, e.g., by using a real-time operation system, or reassembling the SystemC scheduler. Another approach is to use a generic and dynamic scheduling policy as round-robin. All approaches, however, are affected by potential scheduling overhead. While generating C-based software from SystemC designs is out of scope J. Gladigau ( ) Department of Computer Science, University of Erlangen-Nuremberg, Am Weichselgarten 3, Erlangen, Germany jens.gladigau@cs.fau.de M. Radetzki (ed.), Languages for Embedded Systems and Their Applications, Lecture Notes in Electrical Engineering 36, Springer Science + Business Media B.V

192 184 J. Gladigau et al. for this work, we concentrate on a substitution for the generic and dynamic software scheduler. For some dataflow designs it is possible to generate static schedules (e.g., [2]). However, as only limited models such as synchronous or cyclo-static data flow models can be scheduled statically [11, 15], and dynamic approaches (like round-robin) introduce enormous overhead, quasi-static scheduling (QSS) seems to be a possible remedy. Static schedules can be repeated infinitely often without the need of runtime decisions, while dynamic schedules imply runtime decisions before each execution step. In QSS, at compile time analyses are applied to reduce runtime decisions to the minimum of data dependent decisions, which naturally are only decidable at runtime. A quasi-static schedule consists of static sequences and unavoidable runtime decisions, so called conflicts. At runtime, these conflicts select from alternative static sequences. I.e., only inevitable data dependent scheduling decisions are decided at runtime and static decisions are made a priori (at compile time) where possible. While this naturally results in more complex scheduling code, the runtime overhead is minimized. To further reduce runtime overhead, checking for availability of input data and buffer space can be omitted. In QSS, this is already considered at compile time. Besides, guarantees on memory requirements and better predictions (like worst case execution time and dead code) can be given, which is difficult (or impossible) in presence of dynamic schedules. Even deadlock situations resulting from bounded buffers are avoided. To subsume, important properties of quasi-static schedules are: (1) minimized runtime overhead, (2) deadlocks are avoided, and (3) knowledge of exact memory requirements. In this paper, we propose a quasi-static scheduling approach for SystemC dataflow designs. More precisely, we will focus on generating a scheduler for these parts of the model implemented as software on a single embedded processing unit. This chapter is organized as follows: To extract for the scheduling process necessary informations from the SystemC model, a well defined model description is mandatory. The underlying principles are described in Sect To utilize symbolic methods, a symbolic representation of SystemC models is needed. How to obtain such a representation is explained in Sect For the proposed model of computation, a novel quasi-static scheduling approach is introduced in Sect In Sect. 12.5, related work is revised and we differentiate our proposal from these. Finally, we show results from applying our symbolic quasi-static scheduling to an SystemC design, a network packet filter, in Sect Model of Computation Our goal is a quasi-static scheduling method to find schedules for dataflow SystemC models. For this purpose, a well defined model of computation (MoC) represented in SystemC is needed. A common way to model dataflow designs is actor-oriented, which is also used in modern design of embedded systems [12]. In actor-oriented modeling, a system is described by concurrently executed entities (called actors)

193 12 Symbolic Scheduling of SystemC Dataflow Designs 185 Fig Two actors A 1 and A 2 communicate via two channels c 1 and c 2 among each other. The communication behavior is defined by finite state machines. Predicates (demands on buffers and guard functions) and an action function are annotated on each transition which communicate among each other only via dedicated channels. In order to allow as much automation as possible, we require the SystemC model to be transformed into SysteMoC [5], a SystemC library capable of extracting model information for analyses. Other SystemC related approaches of modeling different MoCs exist, e.g., [8, 16]. In contrast to [16], SysteMoC does not demand modifications of SystemC s simulation kernel and is build upon standard SystemC, but avoids context switching to gain simulation speed. Additionally, SysteMoC focuses on model extraction and analyses on system level the designer has not to choose a certain MoC a priori, but the most restricted MoC (ranging from SDF over KPN to non-deterministic dataflow) can be determined automatically based on principles presented in [24]. In a SysteMoC description, each SystemC module implements an actor which is defined by a finite state machine specifying the communication behavior, and functions controlled by this finite state machine. See Fig for a graphical representation of a simple SysteMoC model. There, two modules A 1 and A 2, called actors, communicate via two FIFO channels c 1 and c 2 among each other. Their communication behavior is defined by the depicted state machines. On each transition data consumption and production rates according to the associated function func i are annotated. For example, see the transition in actor A 2 from state 0 to state 1, annotated with i 1 (1)/func 3. After this transition is taken, one single input data, called token, from the connected channel c 1 is consumed (via port i 1 ). Functions (like func 3 ) are executed atomically and data consumption and production is done after computing. Constant functions (e.g., g in Fig. 12.1), called guards, are used to test values of internal variables and data in the input channels. Hence, SysteMoC resembles FunState (Functions driven by State machines) [22] and also realizes a rule-based model of computation [17]. If the predicates (guards and demands on available tokens/free space) annotated to a transition evaluate to true, this transition is enabled. If more than one transition is enabled, one is chosen nondeterministically for taking and the annotated method func i, called action, will be performed atomically. If an application is already available as a SystemC specification, this application can be transformed into a SysteMoC model, if some restrictions apply. E.g., communication is done via SystemC FIFOs and a sole SC_ THREAD is used. We sketch the transformation for the SystemC module Ct_merge from Fig The figure shows the design of a rule based network packet filter with connection tracking capabilities. Network packets arrive from connected interfaces or from the host

194 186 J. Gladigau et al. Fig A SystemC model of a network packet filter. The Ct_merge module is zoomed in, and its implementation as SysteMoC actor with communication controlling finite state machine is shown. Actions req and res are member functions of the module class Ct_merge : public sc_module { / /... ports & constructor including SC_THREAD(process); void process () { while (1) { if (o3. num_free () && i1. nb_read ( request )) { 03. write ( request ); i3. read( response ); o1. write ( response ); } if (o3. num_free () && i2. nb_read ( request ) { 03. write ( request ); i3. read( response ); o2. write ( response ); } } }}; Listing 12.1 Simplified Ct_merge SystemC module the firewall is running on, and packets are forwarded (or discarded) based on a firewall rule set. More explanation of the model follows in Sect The simplified SystemC source code of Ct_merge is given in Listing The SystemC module Ct_merge dispatches incoming requests for connection tracking entry look ups to the module Ct_entries and forwards the corresponding response. Transformed into an equivalent SysteMoC actor (right side of Fig. 12.2), the finite state machine controlling the communication behavior checks for available input data and available space on the output channels to store results. If this is fulfilled, the transition is activated, and on taking the corresponding action function

195 12 Symbolic Scheduling of SystemC Dataflow Designs 187 (e.g., req) is executed. Actions can access the values of the FIFOs, similar to SystemC, through ports. Additionally, a read or write offset is allowed (but not needed in this example). This is implemented by help of the bracket operators indicating a read or write offset as shown in the following listing, the implementation of both actions from the SysteMoC actor Ct_merge. void req( in_port &in) { o3[0] = in [0]; } void res ( out_port &out) { out [0] = i3 [0]; } 12.3 Symbolic Representation To reduce the complexity of the data structures and computations, an abstract view of SysteMoC models is used, which for many models is sufficient for the scheduling task. This abstract view only considers the communication behavior of the model concrete data values, data transformations performed by actions, and guard functions are neglected. Then, the abstract model state is given by the current fill size of each FIFO channel and the current state of each actor s state machine. For example, the abstract state of the model shown in Fig is defined by the tuple (q 1,q 2,s)of integers, with q 1 /q 2 being the fill size of channel c 1 /c 2, and s S, with S ={0, 1, 2}, being the state of actor A 2. (A variable for the sole state in actor A 1 can be omitted.) The model s state space is spawned by these integers. Given the domain sets Q i =[0, max(c i )], with max(c i ) being the maximum fill size of channel c i. For the example, the state space is Q 1 Q 2 S. The start state of the example is (q 1,q 2,s)= (0, 0, 0) and taking, e.g., transition o 1 (1)/func 1 followed by transition i 1 (1)/func 3 results in the abstract state (1, 0, 0) followed by state (0, 0, 1). Scheduling such models is done by traversing their state space, as we show later. Traversing could be done by enumeration, but due to huge state spaces, this is, in general, prohibitive. We need a way to traverse the state space implicitly to avoid enumeration. Symbolic techniques consider the state transition system implicitly, rather than manipulating state sets directly, so we want to use symbolic techniques. In particular, a symbolic representation of SysteMoC models as described above is needed first. Similar to symbolic model checking [13] this is done by means of characteristic functions for model states and the state transition relation. Given a finite state machine m = (X,T,x 0 ) with state set X, a set of transitions T, and an initial state x 0 X. A state set S X is represented by its characteristic function S (s) with { 1 ifs S, S (s) = 0 otherwise The characteristic function T (s, s ) of the transition relation T is defined as { T (s, s 1 if(s, s ) = ) T, 0 otherwise

196 188 J. Gladigau et al. Fig An interval decision diagram (IDD left) and an interval mapping diagram (IMD) of characteristic functions for the SysteMoC example in Fig The IDD represents the state set S ={(0, 0, 0), (1, 0, 0)}, and the IMD encodes the transition relation These characteristic functions allow for usage of symbolic functions, called image and preimage, which are the basis for the symbolic scheduling. We define them similar to [10]. Definition 12.1 (Image) The image(s, T ) of a set of states S and a set of transitions T is the set of all states which can be reached from any state s S by taking a transition t T : image(s, T ) ={s s : S (s) T (s, s )}. Definition 12.2 (Preimage) The preimage(s, T ) of a set of states S and a set of transitions T is the set of all states which can reach any state s S by taking a transition t T : preimage(s, T ) ={s s : S (s ) T (s, s )}. In our implementation, we use interval diagrams [20] for symbolic encoding, which are similar to Boolean decision diagrams (BDDs) [3]. Instead of Boolean values, a value range is associated to each variable, and on outgoing arcs intervals are annotated. The approach is not limited to interval diagrams, and other symbolic representations (like BDDs) could be used. We use interval diagrams because they are well suited for representing transition systems with FIFO communication. Detailed description of these diagrams can be found in [20, 21]. We give an informal introduction to interval diagrams as used here. There are two types of interval diagrams: (1) interval decision diagrams (IDDs), used to represent characteristic functions of state sets, and (2) interval mapping diagrams (IMDs), used to represent characteristic functions of transition relations. For the SysteMoC example from Fig the symbolic representation as interval diagrams is shown in Fig Given a maximum fill size of eight for the both channels. The left diagram encodes the characteristic function of the state set S ={(0, 0, 0), (1, 0, 0)}. Similar to BDDs, IDDs have a single root node, terminal nodes, and a variable order. When encoding Boolean functions, there are only two terminal nodes, 0 and 1, representing the function s result. Depending on the actual value of the variables, paths from the root node lead either to the 1 or 0 terminal node. As the IDD encodes the characteristic function for the state set S ={(0, 0, 0), (1, 0, 0)}, only (q 1,q 2,s)= (0, 0, 0) and (q 1,q 2,s)= (1, 0, 0) lead to the 1 terminal node. All transitions of a SysteMoC model are encoded in a single interval mapping diagram, regardless of the number of different state machines (actors). Using this transition relation for the image operation results in the set of states which can be

197 12 Symbolic Scheduling of SystemC Dataflow Designs 189 reached by a single transition of any actor. The right side of Fig depicts the IMD for the characteristic function of the SysteMoC example s transition relation. IMDs contain only a single terminal node. Paths from the root to this terminal node represent transformations of states due to transitions. Therefore, a mapping function and two intervals are annotated on edges: a predicate interval determines the value a variable must have to apply the mapping function, and an action interval (together with the mapping function) describes the value transformation. Edges without annotations are neutral, i.e., the variable may have any value and is not altered. We give an example using the dashed path in the IMD in Fig This path represents the transformation of variables due to the transition i 1 (1)/func 3 in actor A 2. To apply this transition to a single state (q 1,q 2,s),thevalueofq 1 must be in the range [1, 8], and s must equal zero (q 2 is ignored). Then, for the following state, (q 1,q 2,s ) is determined by q 1 = q 1 1, q 2 = q 2, and s = 1. This is a simplification for illustrating purpose. For the actual transformations, interval arithmetic and a mapping function similar the apply operation for BDDs are used. We refer to the cited literature for further details QSS of SysteMoC Models After introducing the model of computation for SystemC designs and their symbolic representation in the previous sections, we now introduce the quasi-static scheduling procedure for those models. We need some technical terms to describe the proposed quasi-static scheduling. Abstract model state and the model s state space were introduced in Sect the state space is spawned by variables for buffer fill sizes and states of the actor s finite state automata, and one particular point in this state space is an abstract state. Note that this abstraction neglects data values. The basic task while scheduling a model is searching paths through the model s state space. A path t j x j+1 tn 1 x n through the model s state space is a chain of t 0 t 1 x 0 x 1 x j model states x i and transitions t j. Taking transition t j transforms the model state form state x j to state x j+1. Transitions leaving the same state of an actor s finite state machine can be in conflict with each other. E.g., due to annotated guards, either one or the other must be taken at runtime. t 0 t i Then, the scheduling basically is performed as follows: a path x 0 x i x 0 from the model s start state x 0 back to x 0 is searched taking whatever transitions into account (conflicting and non conflicting ones). If a transition t c1 on this path is in conflict with other transitions t ci, additional paths are searched to cope with every possible outcome of the conflict at runtime. We call these paths alternative paths for t c1, as they are alternatives for the conflicting transition t c1 encountered at the first place. At runtime, one of these paths dependent on a runtime decision is chosen. These alternative paths are paths from the model state the conflict transition t c1 is originating to any known model state, i.e., x c t ci x k. A known state x k is any state on any path found so far. This way, a path is present for every possible runtime decision concerning t c1. The described conflict handling procedure is recursively

198 190 J. Gladigau et al. applied to every conflict transition encountered on any path. As the result, a clew of paths is obtained, including a path for every possible runtime decision. This clew is a preliminary stage of a scheduler controller automaton. We now show that the scheduling algorithm terminates, either with a schedule as the result, or indicating that no quasi-static schedule exists. Because queues have limited maximum size and the number of state machine states in actors is limited, too, the models state space is limited. If at some point no alternative path for a conflict transition can be found within these state space boundaries, for the path including the conflict a different path is chosen. This backtracking strategy is applied recursively, if needed. Therewith, finding a valid quasi-static schedule is guaranteed, if one exists within the state space. Otherwise, the algorithm terminates unsuccessfully. Note that this worst case leads to enumeration of possible paths and therewith to vast runtimes. The above outline of the scheduling procedure is presented in more detail in a following subsection. But before, we introduce so called transition graphs, which we use to determine conflicts, and explain path searching, the workhorse of the scheduling procedure Transition Graphs During scheduling it is vital to know when alternative paths are needed at runtime our goal is a scheduler that can cope with every possible runtime condition. We need alternative paths when a transition is in conflict with other transitions. Such a conflict can only occur between transitions leaving the same state of an actor s state machine. This subsection explains a concept called transition graphs, which is used later to exactly determine a conflict. There is a transition graph for each actor s state machine state. This transition graph reflects the relation between the outgoing transitions. We subdivide transitions into two classes: safe and unsafe transitions. Transitions are called safe, if no guards are annotated, otherwise they are called unsafe. Unsafe transitions can lead to runtime decisions, while safe transitions only depend on available data and buffer space. So, from safe transitions static scheduling sequences can be constructed, while for unsafe transitions alternative paths may be needed. Only unsafe transitions are included in a transition graph of an actor s state machine state. As an example, Fig shows a single state with five leaving transitions. This is a part of an actor s state machine, simplified to only include guards a, b, and c, as only these Boolean functions lead to runtime decisions and so are relevant for the transition graph. The resulting transition graph for this state is depicted in Fig Transitions with exactly the same guards are put in the same equivalence class (node). A directed edge is inserted in the graph for a logical implication of the (Boolean) guards. The result is a possible unconnected graph (as in the example). All nodes without leaving edges are said to be leaf nodes. Such a transition graph is the basis for conflict handling.

199 12 Symbolic Scheduling of SystemC Dataflow Designs 191 Fig An actor s finite state machine state with five outgoing transitions. Only guards (a, b, andc) are annotated at the transitions, as only they are relevant to construct the transition graph Fig The transition graph for the state machine state shown in Fig Thereare four equivalence classes. Guards for transition t 2 logically imply guards of other transitions, as depicted by edges Fig A sample path search for the example in Fig by consecutive image computation. The dashed arrows indicate the found path, the ellipses depict the state sets resulting from image-computations Path Searching The crucial part of the scheduling algorithm is path searching through the model s state space. The challenge is: give me a single path, starting in state x 0, and ending in an arbitrary state e E, with E being a set of states. Path searching is done symbolically by consecutively computing the image S i of x 0 until S i E holds. The first image S 1 of x 0 is the set of all states reachable in a single step. The image S 2 of the image S 1 is the set of states reachable from x 0 in two steps, and so on. If a state e E is included in S i (i.e., can be reached), at least one path from x 0 to e (with i steps) exists. By this symbolic breath-first search, a shortest path from a single state x 0 to some desired state is found. For the example in Fig. 12.1, the search for the initial path (from start state (q 1,q 2,s)= (0, 0, 0) back to this state) isshowninfig Note that, in contrast to the figure, the exact relation between states, as depicted by arrows between the states, is not known when computing the images symbolically. By symbolic image computation all transitions are considered implicitly. Hence, when a state is reached, the only thing we know is that at least one path to this state must exist. A backtracking is used to find a concrete path, based on consecutive calculating the preimage of concrete states [21].

200 192 J. Gladigau et al. Fig The two paths found for the example in Fig First, starting in state (0, 0, 0), the path with solid arcs was found. The path with dashed arcs is an alternative path for transition t 4 while (1) { t1 (); t3 (); if (g) { t1 (); t5 (); t6 (); t2 (); } else t4 (); t2 ();} Fig The quasi-static software scheduler gained from the paths depicted in Fig Only one dynamic decision remains at runtime The example in Fig also introduces a class of conflicts between transitions we call multi-rate conflicts. For easier explanation, we assume an enumeration of the transitions in the example, corresponding to the index of the action. I.e., t 1 : o 1 (1)/func 1, t 2 : i 2 (1)/func 2, and so on. Due to the guard g, the two transitions t 4 (o 2 (1)& g/func 4 ) and t 5 (i 1 (1)&g/func 5 ) leaving state 1 of actor A 2 are in conflict. Additionally, t 4 and t 5 have different rates in terms of dataflow: t 4 demands one token on channel c 1, while t 5 demands one free space in channel c 2. To cope with such multi-rate conflicts and effectively search for runtime dependent alternative paths, an advanced symbolic path searching algorithm is needed. In principle, we added three new features to path searching: A desired transition may be anywhere on the path. Some transitions may be disabled during the path search. Disabled transitions may be unlocked. The need to cover conflicting transitions anywhere on an alternative path, not just at the beginning, is demanded by possible different rates of the conflicting transitions; otherwise it may be impossible to take the conflicting transition because of absence of tokens or insufficient free space to produce tokens. This is already motivated by the example. See the sole encountered conflict (transition t 4 )onthe path depicted with dashed arcs in Fig To resolve this conflict, an alternative path has to cover the transition t 5. But transition t 5 is not enabled in model state (q 1,q 2,s)= (0, 0, 1) due to the absence of tokens on channel c 1. By searching paths including the conflicting transition anywhere on the path, other actors are allowed to produce (or consume) tokens, which may finally enable the conflicting transition. In the running example, transition t 1 can produce a token on c 1 and therewith enables t 5. So, when searching for an alternative path for the conflicting transition t 4, i.e., a path originating in state (0, 0, 1) and using t 5 anywhere, a path like (0, 0, 1) t 1 (1, 0, 1) t 5 (0, 0, 2) t 6 (0, 2, 0) t 2 (0, 1, 0) is found. State (0, 1, 0) is known as part of the first path and therefore a valid end state. The both paths together are shown in Fig For this example, these two paths already represent a valid quasi-static schedule and a software scheduler can be derived, see Fig

201 12 Symbolic Scheduling of SystemC Dataflow Designs 193 There is only one runtime decision needed (the if-clause, testing the guard function g). Depending on the outcome of this clause, one of the two static sequences is chosen. This directly reflects the two paths searched for the model (Fig. 12.7). Note that transition t 4 must initially be forbidden when searching the alternative path we search a runtime alternative for usage of t 4 at the first place. This is why path searching needs the capability to ignore transitions. Another improvement we implemented, the transition unlocking feature, is not motivated by the simple example. It may be necessary to re allow taking a disabled transition an alternative path is searched for. I.e., if we search an alternative path for usage of t c, transition t c initially is disabled, but when using one of its conflicting transition on a path, t c afterwards may be necessary on this path to reach a known state, so it is unlocked. Models can be constructed where no alternative paths are found otherwise. To allow for two of the features using desired transitions anywhere on the path and unlocking we must track if some transition (or one out of a set of transitions) was used on a path. E.g., only after some desired transition is covered on a path, the path eventually must end in a known state to be a valid alternative path. Therefore, state sets have to be split in groups during path search: them who already covered desired transitions, and them who don t. These groups have to be treated independently. To not lose the advantages of symbolic path searching and avoid enumerating possible paths, we extended the image operations in two ways to implement the advanced path searching symbolically: (1) image computation respects a given set of transitions which are disabled while computing the image, and (2) the image function returns, additionally to the set of states, the set of transitions involved in the image computation. Therewith, the advanced path searching, as sketched above, with covering transitions anywhere, disabled transitions, and unlocking can be implemented efficiently and symbolically. Because unlocking is rarely needed in real world examples, we leave description of this feature. But it further raises the possible model complexity the approach is applicable to Scheduling Algorithm We now explain the quasi-static scheduling algorithm for actor-oriented models in a more detailed manner. Conflicts that demand runtime decisions are based on the relation of transitions two or more transitions leaving the same state of an actor s finite state machine can be in conflict. A conflict handling strategy may be: whenever using a conflicting transition on a path, search paths starting with one of the remaining transitions in conflict, until one path for every transition is found. As shown above, this common conflict handling cannot cope with multi-rate conflicts. Furthermore, relations between transitions are ignored. But these relations could be exploited by the scheduling procedure, which leads to better results. In the proposed scheduling algorithm, we introduce a novel conflict handling strategy which exploits transition relations and handles multi-rate conflicts. As a preliminary step, before scheduling, for each state in each actor s state machine all outgoing transitions including guards are analyzed. As the result, for each

202 194 J. Gladigau et al. function find_schedule (start, T all ) 2 known = {start}; / / known states disabled = {}; / / disabled transitions t 0 t 1 t n 1 4 ( conflicts, x 0 x 1 x n 1 x 0 ) = find_path ( s t a r t, known, disabled, T all ); 6 known = known {x 1,...,x n 1 } ; initialize s c h e d u l e with path x 0 t 0 x 1 t 1 tn 1 x 0 ; 8 while conflicts do (x, t chosen, disabled ) = conflicts.pop(); 10 leaf_nodes = get_leaf_nodes( t chosen ); leaf_nodes = leaf_nodes / disabled ; 12 while leaf_nodes do tmp_disabled = disabled ; 14 add transitions t chosen to tmp_disabled ; ( tmp_conflicts, x 0 t 0 x 1 t 1 tn 1 x k ) = 16 find_path ( x, known, tmp_disabled,t leaf_nodes ); known = known {x 1,...,x n 1 } ; 18 t 0 t 1 add x 0 x 1 tn 1 x k to schedule ; remove covered leaf from l e a f s ; 20 conflicts = conflicts tmp_conflicts ; endwhile 22 endwhile return schedule ; Listing 12.2 find_schedule algorithm in pseudocode state the aforementioned transition graph is gained, which represents a relation of the transitions leaving this state. If on a path an unsafe transition (including a guard) is taken, the corresponding transition graph is examined to determine the minimum number of additional needed paths to guarantee proper execution at runtime. The important observation is: only one path for each leaf node must be searched. Therewith, alternatives within conflicting transitions and conflict implications are respected. t 1 We give two examples using Fig Assume a path x i x k is found at first, covering transition t 1 with guard a. According to the transition graph, only two alternative paths with transition t 3 and t 5 have to be searched, starting from state x i. Therewith, for every leaf node in the transition graph a path exists. I.e., for all possible runtime condition (including all combinations of runtime conditions) paths exist. t 2 Now assume the path x i x k. Transition t 2 is not included in a leaf node. So, we need to find three additional paths, starting from x i, and covering t 1 or t 4, t 3, and t 5. By this procedure, we find the minimum required paths to guarantee proper execution at runtime; alternatives and implications are exploited. Together with the advanced path searching (search paths that cover desired transitions anywhere on the path), the algorithm also handles multi-rate conflicts. Listing 12.2 shows the scheduling algorithm as pseudocode. Parameters of find_schedule are the start state of the model and the set of all transitions T all. The function returns the found

203 12 Symbolic Scheduling of SystemC Dataflow Designs 195 paths representing the quasi-static schedule. The algorithm is simplified, and always finding a schedule without backtracking is assumed. First, a path from the model s start state start back to start is searched (lines 4 and 5) as an initial path. find_path expects four parameters: (1) the originating state of the path, (2) a set of known states (possible end states), (3) a set of disabled transitions, and (4) a set of transitions from which this path has to cover one. As the result, find_path returns the set conflicts of conflicting positions on the found path and the path itself. Transitions on the path that are in conflict with other transitions need further investigation to assure proper operation at runtime. In line 5, T all denotes all transitions of the model, thus, path searching is not limited. The set of known states is updated by subjoining the new states of the path (line 6), and schedule, which should contain all found paths and eventually the found schedule, is initialized (line 7). If there are no runtime decisions to make, i.e., the set conflicts is empty after the first path search, a static schedule is returned. While the set conflicts contains elements, additional paths for these conflicting transitions are searched (lines 8 to 22). Elements of conflicts are tuples of: (1) the state some (2) conflicting transition is originating from and (3) disabled transitions when encountering this conflict. Function get_leaf_nodes returns the set of leaf nodes from the transition graph (line 10). Because disabled transitions cannot be covered they are removed (line 11). As described earlier, one alternative path has to be searched for each leaf node (lines 12 to 21). The set disabled includes all disabled transitions at the moment the currently treated conflict was encountered. We need to add additional transitions for the currently treated conflict (line 14). These transitions are also gained from the transition graph which includes transition t chossen : all transitions in t chossen s node, and all transitions in nodes with implications to t chossen s node must be disabled. These disabled transitions tmp_disabled are respected in alternative path searching (lines 15 and 16). The path must cover one of leaf_nodes s transitions (T leaf_nodes ) and eventually end in a known state (known) to be a proper alternative path. If a path is found, the set of known states is updated, the found path is added to schedule, the covered leaf is removed from leaf_nodes, and maybe newly encountered conflicting transitions are added to conflicts (lines 17 to 20). If the set conflicts is empty, for all conflicts alternative paths have been searched, and the clew of paths is returned, which is the basis for the quasi-static schedule. See Fig for an example schedule, and Fig for the resulting software scheduler Related Work In recent years, techniques related to quasi-static scheduling using dataflow specifications [1], restricted Petri nets [4, 9, 18, 19], or FunState [23] were presented. Key difference between these approaches and our proposal is the underlying abstract model. Our model can be extracted automatically from an actor-oriented SystemC design and has less restrictions. E.g., whenever an output transition in free-choice Petri nets [18] or equal conflict nets [19] is enabled, all output transitions of the

204 196 J. Gladigau et al. originating place are enabled. A similar approach is taken by Strehl et al. [23] by defining conflict states in FunState. The symbolic QSS approach presented in [23] seems to be the most promising one for actor-oriented SystemC designs and we used it as a basis for our proposed scheduling procedure. But there are two major drawbacks with conflict states as used in the FunState approach: (1) each transition leaving such a conflict state results in a path search while scheduling, and (2) all transitions must have exactly the same demands on input tokens. By the first point, conflicting alternatives (i.e., two transitions leaving a conflict state with the same runtime condition in conflict with other transitions) are not respected as equivalent alternatives, and conflict implications are ignored. The second point also implies that no multi-rate conflicts are allowed. Our approach has neither drawback. We extended [23] in several ways and adapted it for dataflow designs as described above, which enables scheduling of not yet considered model properties (e.g., multi-rate conflicts). Additionally, we eliminated a manual classification step, and search only for the minimum required alternative paths. As a remedy to construct a monolithic scheduler specification automaton, as needed in [23], we discarded the state classification based approach, define the term conflict based on relations of transitions, and introduced a novel conflict handling strategy by using transition graphs. This also enables more natural design of state machines without the need of outsourcing conflicts and alternatives in separate states (if this is possible at all), what is demanded by the monolithic scheduler specification approach. Summarizing, we gain several improvements regarding the FunState method: (1) our methodology starts with abstract SystemC models instead of Fun- State, which is intended for internal representation of systems only [22], (2) construction of a monolithic automaton with classified states, which represents the full dynamic schedule, is avoided, (3) conflicts with different rates are considered and can be scheduled in particular cases, (4) transitions instead of states determine conflicts, and (5) by introducing transition graphs, we search only for the minimum required runtime dependent alternative paths by respecting conflict alternatives and conflict implications, which results in lean schedules Example So far, the proposed symbolic quasi-static scheduling procedure was introduced by small examples. Next, we apply our scheduling to a real world SystemC example, the connection tracking network packet filter, shown in Fig The packet filter model consists of 14 modules (therein 39 transitions) and 18 channels. This includes source and sink modules. Filtering is done in actors Input, Output, and Forward. Based on a rule set, a decision is made to accept or to drop a packet. Connection tracking is done by the Conntrack actors. Ct_merge merges requests of connection tracking entries lookups to Ct_entries. Even with channel sizes kept to the minimum needed (maximum depth of two), the model s state space consists of about reachable abstract states.

205 12 Symbolic Scheduling of SystemC Dataflow Designs 197 Fig The found clew of paths for the packet filter model (Fig. 12.2). Only each fork represents a runtime decision, between them, chains of static scheduled transitions are depicted We automatically extract the finite state machine for each actor as well as the network graph (interconnection of the modules) from the SystemC program. From this information, the symbolic representations of the state transition relation and the models start state are gained. Scheduling this model takes about a second, including the symbolic encoding of the model. During the scheduling, 39 conflicts were encountered, for which 30 alternative paths had to be searched. The resulting clew of paths represents the quasi-static schedule and is shown in Fig As an example, one path from the start state (on the left side, indicated by a small circle) back to the start state is shown with dashed arrows. This path reflects the scenario a packet from the host is received and accepted. The number of runtime decisions needed to process such packets is reduced from ten (full dynamic schedule) to three (the found quasi-static schedule). Additionally, runtime checks on buffers can be omitted. Besides this significant reduction in runtime overhead, answers to questions like How much processing is needed for a packet from the host which is accepted? can easily be given in presence of the quasi-static schedule Conclusions and Further Work In this work, we introduced a symbolic quasi-static scheduling approach, which is directly and automatically applicable to actor-oriented SystemC dataflow designs. A significant improvement to prior approaches was achieved by identifying transitions as the origin of conflicts. The key contributions are (1) the conflict handling mechanism based on transition graphs, which is an efficient instrument to determine the minimum required alternative paths for conflicts, and (2) handling multi-rate conflicts, which are not considered in prior approaches. By using a real world SystemC designs, we could show the applicability. The resulting scheduler significantly reduced the number of runtime decisions and, hence, the scheduling overhead for its software implementation. All other aforementioned benefits apply: avoidance of deadlocks, knowledge of exact memory requirements, dead code detection, more predictable behavior, etc. In further work, we will use the found schedules not only for generating schedulers for embedded software, but also to reduce processing time in the SystemC

206 198 J. Gladigau et al. simulation. First experiments with generated SystemC transaction level models gave very promising results. References 1. B. Bhattacharya and S.S. Bhattacharyya. Quasi-static scheduling of reconfigurable dataflow graphs for DSP systems. In Proc. of RSP, pages 84 89, S.S. Battacharyya, E.A. Lee, and P.K. Murthy. Software Synthesis from Dataflow Graphs. Kluwer Academic, Norwell, R.E. Bryant. Graph-based algorithms for boolean function manipulation. IEEE Trans. Comput., 35(8): , J. Cortadella, A. Kondratyev, L. Lavagno, C. Passerone, and Y. Watanabe. Quasi-static scheduling of independent tasks for reactive systems. IEEE Trans. Comput.-Aided Design Integr. Circuits Syst., 24(10): , J. Falk, C. Haubelt, and J. Teich. Efficient representation and simulation of model-based designs in SystemC. In Proc. of FDL, pages , Darmstadt, Germany, September C. Haubelt, J. Falk, J. Keinert, T. Schlichter, M. Streubühr, A. Deyhle, A. Hadert, and J. Teich. A SystemC-based design methodology for digital signal processing systems. EURASIP Journal on Embedded Systems, doi: /2007/ F. Herrera, H. Posadas, P. Sanchez, and E. Villar. Systematic embedded software generation from SystemC. In Proc. of DATE, pages , F. Herrera and E. Villar. A framework for embedded system specification under different models of computation in SystemC. In Proc. of DAC, pages Assoc. Comput. Mach., New York, P.-A. Hsiung and F.-S. Su. Synthesis of real-time embedded software by timed quasi-static scheduling. In Proc. of VLSID, pages , A.J. Hu and D.L. Dill. Efficient verification with BDDs using implicitly conjoined invariants. In Proc. of CAV, pages Springer, London, E.A. Lee and D.G. Messerschmitt. Static scheduling of synchronous data flow programs for digital signal processing. IEEE Trans. Comput., 36(1):24 35, E.A. Lee, S. Neuendorffer, and M.J. Wirthlin. Actor-oriented design of embedded hardware and software systems. Journal of Circuits, Systems, and Computers, 12(3): , K.L. McMillan. Symbolic Model Checking. Kluwer Academic, Norwell, B. Niemann, F. Mayer, F. Javier, R. Rubio, and M. Speitel. Refining a high level SystemC model. In SystemC: Methodologies and Applications, pages Kluwer Academic, Norwell, T.M. Parks, J.L. Pino, and E.A. Lee. A comparison of synchronous and cycle-static dataflow. In Proc. of ASILOMAR, pages IEEE Comput. Soc., Washington, H.D. Patel and S.K. Shukla. SystemC Kernel Extensions for Heterogeneous System Modeling: A Framework for Multi-MoC Modeling & Simulation. Kluwer Academic, Norwell, H.D. Patel, S.K. Shukla, E. Mednick, and R.S. Nikhil. A rule-based model of computation for SystemC: integrating SystemC and Bluespec for co-design. In Proc. of MEMOCODE, pages 39 48, M. Sgroi and L. Lavagno. Synthesis of embedded software using free-choice Petri nets. In Proc. of DAC, pages , M. Sgroi, L. Lavagno, Y. Watanabe, and A.L. Sangiovanni-Vincentelli. Quasi-static scheduling of embedded software using equal conflict nets. In Proc. of ICATPN, pages , K. Strehl. Interval diagrams: Increasing efficiency of symbolic real-time verification. In Proc. of RTCSA, pages , Hong Kong, K. Strehl. Symbolic Methods Applied to Formal Verification and Synthesis in Embedded Systems Design. PhD thesis, Swiss Federal Institute of Technology Zurich, February 2000.

207 12 Symbolic Scheduling of SystemC Dataflow Designs K. Strehl, L. Thiele, M. Gries, D. Ziegenbein, R. Ernst, and J. Teich. FunState an internal design representation for codesign. IEEE Trans. VLSI Syst., 9(4): , K. Strehl, L. Thiele, D. Ziegenbein, R. Ernst, and J. Teich. Scheduling hardware/software systems using symbolic techniques. In Proc. of CODES, pages , Rome, Italy, C. Zebelein, J. Falk, C. Haubelt, and J. Teich. Classification of general data flow actors into known models of computation. In Proc. of MEMOCODE, pages , Anaheim, CA, USA, 2008.

208 Chapter 13 SystemC Simulation of Networked Embedded Systems Francesco Stefanni, Davide Quaglia and Franco Fummi Abstract The design and simulation of next-generation networked embedded systems are a challenging task since System design choices may affect the network behavior and Network design choices may impact on the System design. For this reason, it is important at the early stages of the design flow to model and simulate not only the system under design, but also the heterogeneous networked environment in which it operates. However, System designers are more focused on System design issues and tools while Network aspects are dealt implicitly by choosing traditional protocols even if, in this case, the chance of joint optimization is lost. To solve this issue, we have exploited a modeling language traditionally used for System design SystemC to build a System/Network simulator named SystemC Network Simulation Library (SCNSL). This library allows to model network scenarios in which different kinds of nodes, or nodes described at different abstraction levels, interact together. The use of SystemC as unique tool has the advantage that HW, SW, and network can be jointly designed, validated and refined. As a case study, the proposed tool has been used to simulate a sensor network application and it has been compared with NS-2, a well-known network simulator; SCNSL shows nearly twoorder-magnitude speed up with TLM modeling and about the same performance as NS-2 with a mixed TLM/RTL scenario. The simulator is partially available to the community at Keywords Networked embedded systems Network simulation IEEE SystemC 13.1 Introduction The widespread use of Networked Embedded Systems (i.e., embedded systems with communication capabilities like PDAs, cell-phones, routers, wireless sensors and actuators) has generated significant research for their efficient design and integration into heterogeneous networks [1, 5 8, 11, 13]. D. Quaglia ( ) Department of Computer Science, University of Verona, Verona, 37134, Italy davide.quaglia@univr.it M. Radetzki (ed.), Languages for Embedded Systems and Their Applications, Lecture Notes in Electrical Engineering 36, Springer Science + Business Media B.V

209 202 F. Stefanni et al. Fig Two-dimension design space of networked embedded systems The design of such networked embedded systems is strictly connected with network functionality as suggested in the literature [4]. In fact, choices taken during the System design exploration may influence the network configuration and vice versa. The result is a two-dimension design space as depicted in Fig System and Network design can be seen as two different aspects of the same design problem, but they are generally addressed by different people belonging to different knowledge domains and using different tools. In the context of System design, tools aim at providing languages to describe models and engines for their simulation. The focus is the functional description of computational blocks and the structural interconnection among them. Particular attention is given to the description of concurrent processes with the specification of the synchronization among them through wait/notify mechanisms. The most popular languages in this context are VHDL, Verilog, System Verilog, and SystemC [9]; in particular, the last language is gaining acceptance for its flexibility in describing both HW and SW components and for the presence of add-on libraries for Transaction-Level Modeling (TLM) and verification. In the context of network simulation, current tools reproduce the functional behavior of protocols, manage time information about transmission and reception of events and simulate packet losses and bit errors. Network can be modeled at different levels of detail, from packet transmission down to signal propagation [2, 12, 14]. The use of a single tool for System and Network modeling would be an advantage for the design of networked embedded systems. Network tools cannot be used for System design since they do not model concurrency within each network node and do not provide a direct path to hardware/software synthesis. Instead, System tools might be the right candidate for this purpose since they already model communications, at least at system level. However, the use of a system description language for network modeling requires the creation of a basic set of primitives and protocols to support the asynchronous transmission of variable-length packets. To fill this gap, in this work we have evaluated the potentiality of SystemC in the simulation of packet-switched networks. In the past, SystemC was successfully used to describe network-on-chip architectures [3] and to simulate the lowest network layers of the Bluetooth communication standard [5]. In the proposed simulator, devices are modeled in SystemC and their instances are connected to a module that reproduces the behavior of the communication channel; propagation delay, interference, collisions and path loss are taken into account by considering the spatial position of nodes and their on-going transmissions. The design of the node can be dealt at different

210 13 SystemC Simulation of Networked Embedded Systems 203 abstraction levels: from system/behavioral level (e.g., Transaction Level Modeling) down to register transfer level (RTL) and gate level. After each refinement step, nodes can be tested in their network environment to verify that communication constraints are met. Nodes with different functionality or described at different abstraction levels can be mixed in the simulation thus allowing the exploration of network scenarios made of heterogeneous devices. Synthesis can be directly performed on those models provided that they are described by using a suitable subset of the SystemC syntax. The chapter is organized as follows. Section 13.2 describes the SystemC Network Simulation Library. Section 13.3 outlines the main solutions brought by this work. Section 13.4 reports experimental results and, finally, conclusions are drawn in Sect The Architecture of SCNSL The driving motivation at the base of SCNSL is to have a single simulation tool to model both the embedded system under design and the surrounding network environment. SystemC has been chosen for its great flexibility, but a lot of work has been done to introduce some important elements for network simulation. Figure 13.2 shows the relationship among the system under design, SCNSL and the SystemC standard library. In traditional scenarios, the system under design is modeled by using the primitives provided by the SystemC standard library, i.e., modules, processes, ports, and events. The resulting model is then simulated by a simulation engine, either the one provided in the SystemC free distribution or a third-party tool. To perform network simulations new primitives are required as described in Sect Starting from SystemC primitives, SCNSL provides such elements so that they can be used together with System models to create network scenarios. Fig SCNSL in the context of SystemC modeling

211 204 F. Stefanni et al. Another point regards the description of the simulation scenario. In SystemC, such description is usually provided in the sc_main() function which creates module instances and connects them before starting simulation; in this phase, it is not possible to specify simulation events as in a story board (e.g., at time X the module Y is activated ). Instead, in many network simulators such functionality is available and the designer not only specifies the network topology, but also can plan events, e.g., node movements, link failures, activation/de-activation of traffic sources, and packet drops. For this reason, SCNSL also supports the registration of such network-oriented events during the initial instantiation of SystemC modules. As depicted in Fig. 13.2, the model of the system under design uses both traditional SystemC primitives for the specification of its internal behavior, and SCNSL primitives to send and receive packets on the network channel and to test if the channel is busy. SCNSL takes in charge the translation of network primitives (e.g., packets events) into SystemC primitives Main Components of SCNSL To support network modeling and simulation, a tool has to provide the following elements: Kernel: the kernel is responsible for the correct simulation, i.e., its adherence to the behavior of an actual communication channel; the kernel has to execute events in the correct temporal order and it has to take into account the physical features of the channel such as, for example, propagation delay, signal loss and so forth; Node: nodes are the active elements of the network; they produce, transform and consume transmitted data; Packet: in packet-switched networks the packet is the unit of data exchanged among nodes; it consists of a header and a payload; Channel: the channel is an abstraction of the transmitting medium which connects two or more nodes; it can be either a point-to-point link or a shared medium; Port: nodes use ports to send and receive packets. Figure 13.3 shows the main components of SCNSL; they can be easily related to the previous list as explained below. The simulation kernel is implemented by the Network_if_t class. This class is the most complex object of SCNSL, because it manages transmissions and, for this reason, it must be highly optimized. For instance, in the wireless model, the network transmits packets and simulates their transmission delay; it can delete ongoing transmissions, change node position, check which nodes are able to receive a packet, and verify if a received packet has been corrupted due to collisions. The standard SystemC kernel does not address these aspects directly, but it provides important primitives such as concurrency models and events. The network class uses these SystemC primitives to reproduce transmission behavior. In particular, it is worth to note that SCNSL does not have its own scheduler since it exploits the SystemC scheduler by mapping network events on standard SystemC events.

212 13 SystemC Simulation of Networked Embedded Systems 205 Fig Main components of SCNSL The Node is one critical point of our library which supports both System and Network design. From the point of view of a network simulator the node is just the producer or consumer of packets and therefore its implementation is not important. However, for the system designer, node implementation is crucial and many operations are connected to its modeling, i.e., change of abstraction level, validation, fault injection, HW/SW partitioning, mapping to an available platform, synthesis, and so forth. For this reason we introduced the class NodeProxy_if_t which decouples node implementation from network simulation. Each Node instance is connected to a NodeProxy instance and, from the perspective of the network simulation kernel, the NodeProxy instance is the alter-ego of the node. This solution allows to keep a stable and well-defined interface between the NodeProxy and the simulation kernel and, at the same time, to let complete freedom in the modeling choices for the node; as depicted in Fig the box named Node is separated from the simulation kernel by the box named NodeProxy and different strategies can be adopted for the modeling of the node, e.g., interconnection of basic blocks or finite-state machine. It is worth to note that other SystemC libraries can also be used in node implementation, e.g., re-used IP blocks and testing components such as the well-known SystemC Verification Library. For example, the figure also shows an optional package above the node; this package is provided by SCNSL and it contains some additional SystemC modules, i.e., an RTL description of a timer and a source of stimuli. These components may simplify designer s work even if they are outside the scope of network simulation. Another critical point in the design of the tool has been the concept of packet. Generally, packet format depends on the corresponding protocol even if some features are always present, e.g., the length and source/destination address. System design requires a bit-accurate description of packet content to test parsing functionality while from the point of view of the network simulator the strictly required fields are the packet length for bit-rate computation and some flags to mark collisions (if routing is performed by the simulator, source/destination addresses are used too).

213 206 F. Stefanni et al. Fig Communicator: class hierarchy Furthermore, the smaller the number of different packet formats, the more efficient is the simulator implementation. To meet these opposite requirements in SCNSL, an internal packet format is used by the simulator while the system designer can use other different packet formats according to protocol design. The conversion between the user packet format and the internal packet format is performed in the NodeProxy. Channels are very important components, because they are an abstraction of the transmission media. Standard SystemC channels are generally used to model interconnections between HW components and, therefore, they can be used to model network at physical level [5]. However, many general purpose network simulators reproduce transmissions at packet level to speed up simulations. SCNSL follows this approach and provides a flexible channel abstraction named Communicator_if_t. The communicator is the most abstract transmission component and, in fact, both NodeProxy and Network classes derive from it. New capabilities and behavior can be easily added by extending this class. Communicators can be interconnected each other to create chains. Each valid chain shall have on one end a NodeProxy instance and, on the other end, the Network; hence transmitted packets will move from the source NodeProxy to the Network traversing zero or more intermediate communicators and then they will eventually traverse the communicators placed between the Network and the destination NodeProxy. In this way, it is possible to modify the simulation behavior by just creating a new communicator and placing its instance between the network and the desired NodeProxy. Figure 13.4 shows the class hierarchy of the Communicator; as said before, both Network and NodeProxy inherit from the Communicator. A wireless network is a specific kind of Network with its own behavior and thus it derives from the abstract Network. A possible implementation of a wireless network model is described in Sect NodeProxies depend both on the type of network and on the abstraction level used in node implementation; for example, Fig reports a TLM and an RTL version of a wireless NodeProxy.

214 13 SystemC Simulation of Networked Embedded Systems Main Problems Solved by SCNSL This section describes some issues encountered during the development of SCNSL and the adopted solutions. The first problem regards the co-existence of RTL system models with the packet-level simulation. The second one regards the assessment of packet validity with reference to collision and out-of-range transmissions. The third problem regards the planning of the network simulation, i.e., source activations, link failures, and so forth. Finally, an application of SCNSL to a wireless network is described Simulation of RTL Models As said before, SCNSL supports the modeling of nodes at different abstraction levels. In case of RTL models, the co-existence of RTL events with network events has to be addressed. RTL events model the setup of logical values on ports and signals and they have an instantaneous propagation, i.e., they are triggered at the same simulation time in which the corresponding values are changing. Furthermore, except for tri-state logic, ports and signals have always a value associated to them, i.e., sentences like nothing is on the port are meaningless. Instead, network events are mainly related to the transmission of packets; each transmission is not instantaneous because of transmission delay, during idle periods the channel is empty, and the repeated transmission of the same packet is possible and leads to distinct network events. In SCNSL, RTL node models handle packet-level events by using three ports signaling the start and the end of each packet transmission, and the reception of a new packet, respectively. Also in this case the NodeProxy instance associated to each node translates network events into RTL events and vice versa. In particular, each RTL node has to write on a specific port when it starts the transmission of a packet while another port is written by the corresponding NodeProxy when the transmission of the packet is completed. A third port is used to notify the node about the arrival of a new packet. With this approach each transmission/reception of a packet is detected even if packets are equal. The last issue regards the handling of packets of different size. Real world protocols use packets of different sizes while RTL ports must have a constant width decided at compile-time. SCNSL solves this problem by creating packet ports with the maximum packet size allowed in the network scenario and by using an integer port to communicate the actual packet size. In this way a NodeProxy or a receiver node can read only the actual used bytes, thus obtaining a correct simulation Assessment of Transmission Validity In wireless scenarios an important task of the network simulator kernel is the assessment of transmission validity which could be compromised by collisions and

215 208 F. Stefanni et al. out-of-range distances. The validity check has been implemented by using two flags and a counter. The first flag is associated to each node pair and it is used to check the validity of the transmission as far as the distance is concerned; if the sender or the receiver of an ongoing transmission has been moved outside the maximum transmission range, this flag is set to false. The second flag and the counter are associated to each node and they are used to check the validity with respect to collisions. The counter is used to register the numbers of active transmitters which are interfering at a given receiver; if the value of this counter is greater than one, then on-going transmission to the given receiver are not valid since they are compromised by collisions. However, even if, at a given time, the counter holds one, the transmission could be invalid due to previous collisions; the flag has the purpose to track this case. When a packet transmission is completed, if the value of the counter is greater than one, the flag is set to false. The combined use of the flag and the counter allows to cover all transmission cases in which packet validity is compromised by collisions Simulation Planning In several network simulators, special events can be scheduled during the setup of the scenario; such special events regard nodes movements, link status changes, traffic activation and packet drops. This feature is important because it allows to simulate the model into a dynamic network context. In SCNSL the simulation kernel has not its own event dispatcher, hence this feature has been implemented into an optional class, called EventsQueue_t. Even if SystemC allows to write in each node the code which trigger such events, the choice of managing them in a specific class of the simulator leads to the following advantages: Standard API: the event queue provides a clear interface to schedule a network event without directly interacting with the Network class or altering node implementation. Simplified user code: Network events are more complex then System ones; the event queue hides such complexity thus simplifying user code and avoiding setup errors. Higher performance: the management of all the events inside a single class improves performance; in fact the event queue is built around a single SystemC thread, minimizing memory usage and context switching. This class can be used also to trigger new events defined through user-defined functions. The only constraint is that such functions shall not block the caller, i.e., the events queue, to allow the correct scheduling of the following events Application to a Wireless Scenario In this section the behavior of the library is described with reference to a wireless scenario; in particular, an RTL node model is reported to clarify the concepts written

216 13 SystemC Simulation of Networked Embedded Systems 209 in Sect Module Rtl::Node_t represents an abstract network node. It has a set of properties which are used by the simulation framework to reproduce network behavior. Transmission rate represents the number of bits per unit of time which the interface can handle; it is used to compute the transmission delay and the network load. Transmission power is used to evaluate the transmission range and the signal-to-noise ratio. Transmission rate and transmission power can be changed during simulation to accurately simulate and evaluate power saving algorithms. The module has the following ports: packet ports to send and receive packets, respectively; carrier port to perform carrier sense; packet length ports to report actual packet size (one for each direction); packet event management ports to report the presence of a new packet (one for each direction) and the completion of packet transmission; rate port to communicate the transmission rate to the simulation kernel through the NodeProxy; power port to communicate the transmission power to the simulation kernel through the NodeProxy; sensor port to model a data input whose meaning is application-specific (e.g., a temperature sensor). The sensor port of each node is bound to an instance of the module Stimulus_t which reproduces a generic environmental data source. It takes as input a clock signal as timing reference to synchronize the generation of data values. Different kinds of stimuli can be generated by sub-classes of this module; the intensity of the stimuli and their localization in time can follow a given statistical distribution or be derived from a trace file. The class Rtl::NodeProxy_t interfaces the node with the network and it manages two node properties, i.e., node position and receiver sensitivity. Node position is used to compute the path loss and to reproduce mobile scenarios. The receiver sensitivity is the minimum signal power below which the packet cannot be received. Even if these properties are related to the node they are frequently used by the simulation kernel and thus we decided to model them in the NodeProxy to simplify their access. When a node starts transmission, its relative position to all other nodes in the same network is computed, and the signal level in all those nodes is derived according to the path loss formula 1/d α. For each node, if the signal level is higher than its receiver sensitivity, then it can be detected and it may interfere with other on-going transmissions. If there are already on-going transmissions reaching the receiving node, then all those messages are marked as collided (i.e., they are not valid). Also, if there are other on-going transmissions which the currently sending node reaches with its transmission, then those messages are marked as collided as well. Since wireless nodes cannot detect collisions, a collided message is not interrupted and the channel remains busy. The transmission time depends on the packet length, the transmission rate, and the propagation delay.

217 210 F. Stefanni et al. Fig CPU time as a function of the number of simulated nodes for the different tested tools and node abstraction levels 13.4 Experimental Results The SystemC Network Simulation Library has been used to model a wireless sensor network application consisting of a master node which repeatedly polls sensor nodes to obtain data. Node communications reproduce a subset of the well-known IEEE standard, i.e., peer un-slotted transmissions with acknowledge [10]. Different scenarios have been simulated with SCNSL by using nodes at different abstraction levels: (1) all nodes at TLM-PVT level, (2) all nodes at RTL, and (3) master node at RTL and sensor nodes at TLM-PVT. The designer had written 172 code lines for the sc_main(), 688 code lines for the RTL node and 633 code lines for the TLM-PVT node. Figure 13.5 shows the CPU time as a function of the number of nodes for the three scenarios and for a simulation provided by NS-2 representing the behavior of a pure network simulator. A logarithmic scale has been used to better show results. Simulations have been performed on the Intel Xeon 2.8 MHz with 8 GB of RAM and Linux kernel; CPU time has been computed with the time command by summing up user and system time. The speed of SCNSL simulations at TLM-PVT level is about two-ordermagnitude higher than in case of NS-2 simulation showing the validity of SCNSL as a tool for efficient network simulation. Simulations at RT level are clearly slower because each node is implemented as a clocked finite state machine as commonly done to increase model accuracy in System design. However a good trade-off between simulation speed and accuracy can be achieved by mixing nodes at different abstraction levels; in this case, experimental results report about the same performance of NS-2 with the advantage that at least one node is described at RT level Conclusions We have presented a SystemC-based approach to model and simulate networked embedded systems. As a result, a single tool has been created to model both the em-

218 13 SystemC Simulation of Networked Embedded Systems 211 bedded system under design and the surrounding network environment. Different issues have been solved to reconcile System design and Network simulation requirements while preserving execution efficiency. In particular, the combined simulation of RTL system models and packet-based networks has been faced. Experimental results for a large network scenario show nearly two-order-magnitude speed up with respect to NS-2 with TLM modeling and about the same performance as NS-2 with a mixed TLM/RTL scenario. References 1. V. Aue et al. Matlab based codesign framework for wireless broadband communication DSPs. In Proc. IEEE ICASSP, pages , AWE Communications. WinProp: Software-Tool for the Planning of Mobile Communication Networks D. Bertozzi et al. NoC synthesis flow for customized domain specific multiprocessor systemson-chip. IEEE Trans. Parallel Distrib. Syst., 16(2): , N. Bombieri, F. Fummi, and D. Quaglia. TLM/network design space exploration for networked embedded systems. In Proc. IEEE/ACM/IFIP CODES+ISSS, pages 58 63, M. Conti and D. Moretti. System level analysis of the bluetooth standard. In Proc. IEEE DATE, pages , March D. Desmet et al. Timed executable system specification of an ADSL modem using a C++ based design environment: a case study. In Proc. IEEE CODES, pages 38 42, D. Dietterle, J. Ebert, G. Wagenknecht, and R. Kraemer. A wireless communication platform for long-term health monitoring. In Proc. IEEE International Conference on Pervasive Computing and Communications Workshop, March J. Fleischmann and K. Buchenrieder. Prototyping networked embedded systems. IEEE Computer, 32(2): , IEEE Std IEEE Standard SystemC Language Reference Manual. IEEE Std , pages 1 423, LAN/MAN Standards Committee of the IEEE Computer Society. IEEE Standard for Information Technology Part 15.4: Wireless Medium Access Control (MAC) and Physical Layer (PHY) Specifications for Low Rate Wireless Personal Area Networks (LR-WPANs), September N. Lugil and L. Philips. A W-CDMA transceiver core and a simulation environment for 3GPP terminals.inproc. IEEE Symp. on Spread Spectrum Techniques and Applications, pages , September S. McCanne and S. Floyd. NS Network Simulator version R. Pasko et al. Functional verification of an embedded network component by co-simulation with a real network. In Proc. IEEE HLDVT, pages 64 67, C. Zhu et al. A comparison of active queue management algorithms using the OPNET modeler. IEEE Communications Magazine, 40(6): , 2002.

219 Chapter 14 Modeling of Embedded Software Multitasking in SystemC/OSSS Philipp A. Hartmann, Philipp Reinkemeier, Henning Kleen and Wolfgang Nebel Abstract Since the software part in today s designs is increasingly important, the impact of platform decisions with respect to the hardware and the software infrastructure (OS, scheduler, priorities, mapping) has to be explored in early design phases. In this work, we present an extension of the existing SystemC -based OSSS design flow regarding software multi-tasking in system models. The simulation of the OSSS software run-time model supports different scheduling policies, as well as efficient timing annotations, and deadlines. Inter-task communication is modeled via user-defined Shared Objects. The impact of timing annotation granularity on the achievable simulation performance and preemption accuracy is studied. As a result, a lazy synchronization scheme is proposed, that is based on omitting SystemC time synchronizations, that do not have observable effects on the application model. Keywords Embedded software Preemptive multi-tasking SystemC Abstract RTOS modeling Simulation performance Design-space exploration 14.1 Introduction The increasing pressure on time-to-market and cost of today s embedded systems requires ever increasing design productivity. As a result, embedded software becomes more and more important, since the required design effort for a software function is usually expected to be lower than for the respective hardware parts. Additionally, an increased flexibility and the possibility of late changes in the design process are an advantage. On the other hand, the choice of the correct software architecture such as the chosen RTOS, task priorities, scheduling policies, mapping of tasks on processors is by no means a simple task. To help developers during this phase of the design space exploration, efficient and fast modeling of different architecture alternatives has to be supported by the chosen design flow. Apart from considering the underlying hardware platform, this P.A. Hartmann ( ) OFFIS Institute for Information Technology, Oldenburg, Germany philipp.hartmann@offis.de M. Radetzki (ed.), Languages for Embedded Systems and Their Applications, Lecture Notes in Electrical Engineering 36, Springer Science + Business Media B.V

220 214 P.A. Hartmann et al. includes the early analysis of software/rtos effects on the system s overall performance, which is important especially if multiple tasks are sharing a single processor. Real-time capabilities have to be explored by choosing e.g. the scheduler, and task priorities to fulfill a given set of requirements like deadlines or other application specific constraints. Not only since its IEEE standardization [13] issystemc [17] a very popular modeling language for system-level modeling of complex hardware/software systems. Since the modeling of real-time software specific aspects is not directly supported by SystemC itself, several extensions to SystemC have been developed, to enable the early exploration of real-time software properties, some of which will be compared briefly in Sect The approach to software multi-tasking presented in this chapter is based on OSSS the Oldenburg System Synthesis Subset, an extension of the SystemC synthesisable subset [21] with object-oriented features. An introduction to the accompanying OSSS design flow is given in Sect The OSSS methodology is characterized by a layered approach and its partly automated path to an implementation. Up to now, the support for software modeling in OSSS has been limited to a single task per processor. In this work, we extend the modeling capabilities of OSSS with support for simulating multiple tasks on a processor running a given RTOS. In Sect. 14.5, some observable effects of different software architecture decisions are shown with an instructive example. This demonstrates the feasibility of our method during (software) design space exploration. Due to the object-oriented approach, communication between different components is modeled in OSSS by abstract method calls to so called Shared Objects.This concept is reused for the modeling of inter-task communication in the new software parts. The main advantage of this approach is the abstraction of error-prone RTOS synchronization primitives and therefore a more robust modeling environment. In Sect. 14.4, we present the new software multi-tasking features of OSSS, like tasks and their properties, timing abstraction and Software Shared Objects. An important property of abstract software models is their simulation performance. The synchronization overhead between several tasks (i.e. SystemC processes) and the underlying simulation kernel is one of the limiting factors for this. As shown in Sect , the granularity of the timing annotations within the Software Tasks has a direct impact on the overall simulation performance. In Sect , we show that this impact can be significantly reduced by applying our lazy synchronization scheme, which reduces the SystemC overhead without changing the observable system behavior and correctness. In Sect a summary and directions of future work are given Related Work Many different approaches to modeling embedded software in the context of SystemC have been proposed. Some of them, like the SPACE framework [3] relyon

221 14 Modeling of Embedded Software Multitasking in SystemC/OSSS 215 the co-simulation with an external RTOS simulator or even an instruction set simulator. Although these approaches may provide higher simulation accuracy, they usually lack the required simulation performance for early platform exploration and are therefore out of the scope here. Abstract RTOS models, like the one presented for SpecC in [6] are better suited for early comparison of different scheduling and priority alternatives. The timing accuracy and therefore the simulation performance of this approach is limited by the fixed minimal resolution of discrete time advances. Just recently, an extension deploying techniques with respect to preemptive scheduling models very similar to the ones presented in this work has been presented in [19]. The Result Oriented Modeling collects and consumes consecutive timing annotations while still handling preemptions accurately similar to the lazy synchronisation scheme presented in Sect Several approaches based on abstract task graphs [12, 15, 20] have been proposed as well. In this case, a pure functional SystemC model is mapped onto an architecture model including an abstract RTOS. The mapping requires an abstract task graph of the model, where estimated execution times can be annotated on a per-task basis only, ignoring control-flow dependent durations. This reduces the achievable accuracy. A single-source approach for the generation of embedded software from SystemC-based descriptions has been proposed in [4, 11, 18]. Starting from untimed, heterogeneous models in HetSC, a POSIX conformant description can automatically be generated by the SWgen methodology. The performance analysis of the resulting model with respect to an underlying RTOS model can be evaluated with the PERFidiX library, that augments the generated source via operator overloading with estimated execution times. Due to the fine-grained timing annotations, the model achieves a good accuracy but relatively weak simulation performance. This interesting approach might significantly benefit from our proposed lazy synchronization, as presented in Sect An early proposal of a generic RTOS model based on SystemC has been published in [14]. The presented abstract RTOS model achieves time-accurate task preemption via SystemC events and models time consumption via a delay() method. Additionally, the RTOS overhead can be modeled as well. Two different task scheduling schemes are studied: The first one uses a dedicated thread for the scheduler, while the second one is based on cooperative procedure calls, avoiding this overhead. Although in this approach explicit inter-task communication resources are required (message queue,...),thesimulationtimeadvances simultaneously as the tasks consume their delays. In [10], an RTOS modeling tool is presented. Its main purpose is to accurately model an existing RTOS on top of SystemC. It cannot be directly used by a system designer. In this approach, the next RTOS event (like interrupt, scheduling event,...) is predicted during run-time. This improves simulation speed, but requires deeper knowledge of the underlying system. Just recently, two other approaches [16, 22] have been published in German. In [16], tasks are derived from a special base class, which augments the regular

222 216 P.A. Hartmann et al. SystemC wait() method with synchronization calls to an abstract RTOS model. An additional FIFO class, with a similar synchronization scheme is included as a HW/SW communication primitive. Regular SystemC events can be used by the application for inter-task communication as well. In [22], the main focus lies on precise interrupt scheduling. For this purpose, a separate scheduler is introduced to handle incoming interrupt requests. This is similar to our ring-based scheduling approach (see Sect ). Timing annotations and synchronization within user tasks is handled by a replacement of the SystemC wait(), similar to [16]. In this work, we present an abstract model for software multi-tasking based on OSSS, which includes some properties of the above mentioned approaches, especially concerning a simple RTOS model. The flexible integration of user-defined communication mechanisms via Shared Objects (Sect ) and the accurate and efficient handling of timing annotations (Sect ) even in preemptive scheduling scenarios is the main contribution of our approach The OSSS Design Flow Based on an object-oriented hardware design approach [8], one of the main objectives of OSSS is to enable the usage of object-oriented features known from languages such as C++ in a synthesisable SystemC model. This includes concepts such as classes, inheritance, polymorphism and abstract communication based on method calls. OSSS extends the synthesisable subset of SystemC [21] by defining synthesis semantics for many of these features. Furthermore, new concepts specifically targeted to the modeling and design of embedded systems are introduced to raise the level of abstraction and increase the expressiveness of the language. OSSS defines separate layers of abstraction for improving refinement support during the design process. The design entry point in OSSS is called the Application Layer. By manually applying a mapping of the system s components, the design can be refined from Application Layer to the Virtual Target Architecture Layer, which can be synthesized to a specific target platform in a separate step by the synthesis tool Fossy [5] Application Layer On this layer the hardware/software system is modeled as a set of communicating processes, representing hardware modules and software tasks. The Application Layer model abstracts from the details of the communication between the components of a model, such as the actual implementation of the communication channel, even across HW/SW boundaries. One concept introduced by OSSS without a direct equivalent in C++ is the so called Shared Object, which equips user-defined classes with specific synchronization facilities. Due to their well-defined synthesis semantics, Shared Objects can act

223 14 Modeling of Embedded Software Multitasking in SystemC/OSSS 217 as a replacement for some non-synthesisable features of SystemC such as hierarchical channels, mutexes and semaphores. Synchronization is performed by arbitrating concurrent accesses and a special feature called Guarded Methods, that can be used to block the execution of a method according to a user-defined condition. As a result, they are especially useful for modeling inter-process communication, both between hardware and software processes. Communication between modules or tasks and Shared Objects is performed by method calls through abstract communication links (binding). A comprehensive description of the Shared Object concept, including several design examples, is part of the publicly available simulation library [7] Virtual Target Architecture Layer In a refinement step, the Application Layer model is transformed to a so-called Virtual Target Architecture. This involves mapping software tasks to processor(s) and hardware modules to certain hardware blocks as shown in Fig Moreover, the abstract communication links of the Application Layer model are then to be mapped to specific communication infrastructure, like buses or point-to-point channels. To enable an easy mapping of the method-based communication on the Application Layer to a signal-based communication OSSS uses a concept known as Remote Method Invocation (RMI). A detailed description of this scheme is beyond the scope of this contribution, but can be found in [9]. Basically, the implementation of the RMI concept allows the transport of method calls including their parameters and return values over arbitrary communication infrastructures and is used for HW/HW as well as HW/SW communication. Fig Mapping of Application Layer to Virtual Target Architecture Layer

224 218 P.A. Hartmann et al. The Virtual Target Architecture Layer model of the design is then translated automatically by the synthesis tool Fossy [5], producing RTL SystemC or VHDL output which is used for further synthesis by vendor-specific implementation tools Modeling Software in OSSS The approach to modeling multi-tasking software in OSSS is not meant to directly model existing real-time operating system (RTOS) primitives. The Software Tasks in OSSS are meant to run on-top of a generic but lightweight run-time system, as depicted in Fig Task synchronization and inter-task communication is modeled with (Software) Shared Objects, similar to the modeling with OSSS in the hardware domain. This enables a seamless specification environment, where the same concepts are used for both, hardware and software on the Application Layer (see Sect. 14.3). The flexible timing annotation mechanism enables high simulation performance, since the synchronization overhead with the SystemC kernel can be minimized Abstraction of Run-time System The basis of the OSSS software run-time simulation model is an OSSS software runtime abstraction class. This pre-defined library element handles the time-sharing of a single processor by several Software Tasks (see Sect ), which are bound to a particular RTOS instance. A specific scheduling policy can be bound to each set of tasks, grouped by the same ring, as depicted in Fig Several frequently used scheduling policies are already provided by the modeling library, most notably static priorities (preemptive and cooperative), time-slice based round-robin, earliest-deadline first, and rate monotonic. Additionally, user-defined schedulers can be implemented through an abstract interface class. The RTOS overhead of context switches and execution times of scheduling decisions can be annotated as well. Several rings can be specified by the designer. The rings are an additional priority layer, where every ring receives its own scheduling policy. The processor is assigned to a task from lower-priority rings only, if there is no task in ready state (see Sect ) in any higher priority ring. An example use case for this feature is to model (prioritized, non-preemptive) interrupt service routines with otherwise timesliced round-robin scheduled user tasks.

225 14 Modeling of Embedded Software Multitasking in SystemC/OSSS 219 Fig Ring-based task scheduling With this set of basic elements, the behavior of the real RTOS on the target platform can be modeled. Task synchronization is not part of the modeling elements, since the inter-task communication is meant to be modeled by using Software Shared Objects. On the target platform, an implementation of this primitive will be provided by the means of the scheduling primitives of the architecture (see Sect ) Software Tasks SystemC processes, that are meant to be implemented in software are modeled as Software Tasks in OSSS. These tasks are derived from the common base class and define their behavior within a main() method. In the multi-tasking OSSS implementation, tasks are equipped with additional properties, like a priority, an initial startup time, optional periods and deadlines, and an optional task ring (see Sect ). During simulation, the tasks can be in different states as shown in Fig The distinction between blocked and waiting has been introduced to ease the runtime detection of deadlocks. A task in waiting state will enter the ready state after a given amount of time, whereas a blocked task can only be unblocked, once a certain logical condition is fulfilled (usually a guard, see below). Technically, the Software Tasks are implemented with SC_THREADs, that use internal synchronization mechanisms with the RTOS abstraction during the simulation, to achieve the effect of only a single active task at a time on a given processor. Task preemption is supported at arbitrary times, independently of the granularity of the annotated execution times, as discussed in Sect Regular SystemC wait() calls are disabled within Software Tasks.

226 220 P.A. Hartmann et al. Fig Task states and transitions (terminate edges omitted) // guard definition ( <name>, <condition> ) OSSS_GUARD( not_empty, cnt_items_ > 0 ); // guarded method declaration of int get() // blocking, until not_empty guard holds OSSS_GUARDED_METHOD ( int, get, OSSS_PARAMS(0), not_empty ); Listing 14.1 Example of a guarded method Software Shared Objects The inter-task communication in an OSSS software model is specified like in an OSSS hardware model in terms of user-defined Shared Objects (see also Sect. 14.3), which are inspired by the protected objects known from Ada [1]. On the final platform, the software implementation of a shared object has to be integrated with the OSSS software run-time, since it usually requires RTOS primitives for the synchronization. As outlined in Sect. 14.3, Shared Objects provide mutual exclusive access and Guarded Methods to ensure deterministic behavior across several concurrent tasks (see Listing 14.1). The guard mechanism resembles the well-known monitor concept and can directly be implemented e.g. on an underlying POSIX-compatible OS using condition variables (pthread_cond_t). The required locks for mutual exclusive access can be implemented using the existing locking mechanism of the underlying RTOS (e.g. mutexes).

227 14 Modeling of Embedded Software Multitasking in SystemC/OSSS 221 Therefore, in OSSS the complexity of inter-task synchronization primitives is hidden from the designer, which improves design productivity. It is planned to automatically generate the required run-time system from an OSSS software model and use cross-compilation techniques to translate Software Tasks and Shared Objects directly to the target machine code Software Execution Times A proper modeling of software multi-tasking requires the consideration of time consumption of the modeled tasks. In OSSS, the Estimated Execution Time (EET) of a code block can be annotated within Software Tasks and inside methods of Shared Objects with the help of the OSSS_EET() macro(seealso[7]). The macro receives a duration as an sc_time argument, that estimates the execution time of the following code block. This enables a flexible and accurate annotation, depending on the required accuracy. Control structures can be efficiently annotated and during the simulation, the resulting execution time respects the (potentially data-dependent) control flow. Listing 14.2 exemplifies the syntax of these annotations. The simulation semantics of these annotations are discussed in Sect OSSS_EET() blocks cannot be nested and must not contain inter-task communication calls. As of today, these times are meant to be determined by profiling the cross-compiled code on the target processor. On the other hand, it is envisioned to extract and back-annotate these times automatically in the future. //... while( some_condition ) // the following block has to be finished within 1ms OSSS_RET( sc_time( 1000, SC_US ) ) { OSSS_EET( sc_time( 20, SC_US ) ) { // computation, that consumes 20μs max_i = compute_number_of_iterations(); } // estimate a data-dependent loop for( int i=0; i<max_i; ++i ) OSSS_EET( sc_time( 100, SC_US ) ) { // loop body } if( my_condition ) { // communication only outside of EET blocks result = my_shared->get(); // see Listing 14.1 } } // end of RET block and loop Listing 14.2 Example of estimated and required execution time annotations

228 222 P.A. Hartmann et al. Fig Impact of scheduling policy on simulation In addition to the EETs, OSSS enables the designer to specify local deadlines for a specific code block. This is especially useful in combination with inter-task communication calls or preemptive scheduling policies. The syntax follows the one for the EETs, meaning that a certain Required Execution Time (RET) is specified by the OSSS_RET() macro, which guards the duration of the following code block. If required, RETs can be nested at arbitrary depth. The consistency of nested RETs is checked during the simulation like any other violation of the RETs, or the optional globally annotated deadline of a given task. If such a timing constraint cannot be fulfilled during the simulation, it is reported by the library. Unmet RETs can arise from the choice of the scheduling policy, (additional) delays caused by blocking guard conditions, or simply unexpectedly long estimated execution times (e.g. max_i 10 in Listing 14.2) Exploration of Platform Effects The main purpose of the modeling of abstract software multi-tasking in early design phases is the exploration of the impact of platform choices on the system s correctness and performance. Therefore, we tested our approach with several example scenarios. In Fig. 14.4, the simulation results of one example specified by G.C. Buttazzo [2, p. 95] are shown. Two tasks T 1, T 2 are sharing the same processor with periods τ 1 = 5ms,τ 2 = 7 ms and estimated execution times t 1 = 2ms,t 2 = 4ms. This minimalistic system is scheduled by (a) a rate-monotonic scheduler (RMS), (b) according to earliest deadline first (EDF). In case of the rate-monotonic schedule, the generated traces expose several deadline violations (denoted by circles in Fig. 14.4). But even in case of a more complex systems, especially with inter-task communication and varying run-times, such violations would have been caught by our simulation model as well.

229 14 Modeling of Embedded Software Multitasking in SystemC/OSSS Simulation Results An important factor for the feasibility of abstract software models is the simulation accuracy, they can achieve. Nonetheless, simulation performance is a critical factor during design space exploration as well. These goals are contradictory, as we will discuss in this section Accuracy and Performance In order to ensure, that only one Software Task is active at a any time during the simulation, the different tasks have to be synchronized with the central RTOS abstraction, which then assigns the tasks according to its scheduling policy. Since the simulation time is usually handled by the (cooperative) SystemC kernel, this synchronization requires an (at least implicit) call of wait(). In order to support a preemptive scheduling policy, this implicit synchronization is performed by the abstract run-time system. For every EET block, the run-time advances the SystemC time for the current task by the annotated time and the additional delay due to preemptions of the current task by other (e.g. higher priority) tasks during this period. Since every SystemC time advance comes at the cost of a host system context switch (see Fig. 14.5), the number of these synchronizations has to be reduced. This leads to a trade-off between the granularity of the annotations and their accuracy. If the implementation of timing annotations immediately consumes the annotated delays, the granularity of these annotations must be kept quite coarse-grained to ensure good performance. As a result, control structures like data-dependent loops as shown in Listing 14.2 have to be estimated by their WCET, instead of taking the real number of iterations into account. Another difficult situation arises from sporadic or conditional inter-task communication, as shown in Listing 14.2 as well. To annotate the surrounding basic blocks Fig Impact of EET resolution on simulation time

230 224 P.A. Hartmann et al. Fig Producer/Consumer benchmark and to keep the number of annotations low, the annotations might have to be moved entirely before or after the communication primitive to keep the processor utilization accurate. This results in a loss of accuracy with respect to access traces on the shared resources. Therefore, the trade-off between simulation time and modeling accuracy limits the observable effects, which might lead to wrong design decisions Lazy Synchronization If, on the other hand, the to-be-consumed processor time can be accumulated until a time synchronization is explicitly required, a considerable speed-up is to be expected even in case of fine-grained annotations. Synchronization between the abstract OS and the currently active task is required, whenever interaction with the OS or other components is requested. This especially includes inter-task communication and deadline validation. Fortunately, communication in OSSS is expressed via Shared Objects, which easily enables the integration of such a execution time accumulation. We have implemented this alternative lazy synchronization scheme as an optional feature of the current OSSS multi-tasking library model. Task synchronization is delayed until a Shared Object call or the border of an OSSS_RET() block is encountered. By this, a task might logically pass multiple EET blocks without actually noticing any SystemC time advance. At the above mentioned synchronization points, the accumulated delay is consumed at once. This accumulation of consecutive EET blocks without intermediate synchronization is possible, since the observable behavior which only depends on the order and time-stamp of inter-task communication is not changed at all. In Fig. 14.5, these two scenarios have been compared. The benchmark scenario consists of a simple producer/consumer scenario, where the producer pushes random numbers to the consumer through a FIFO Shared Object (see Fig. 14.6). The tasks are scheduled with static priorities and overall constant estimated execution times, such that the FIFO channel is nearly empty (and thus blocks the consumer task) during the simulation. The OSSS_EET() blocks in the producer are then increasingly split into small chunks, resulting in an increasing number of synchronizations in the strict scenario. The simulation consists of ten thousand FIFO calls (on both sides) and has been run on a Pentium D workstation with 2.8 GHz. Figure 14.5 shows, that an accumulated synchronization leads to significantly faster execution with higher number of consecutive EETs.

231 14 Modeling of Embedded Software Multitasking in SystemC/OSSS Conclusion In this work, we presented an approach to modeling embedded software in OSSS. In comparison to the previously existing software modeling capabilities in OSSS, the extended implementation introduces an abstract run-time system with support for multi-tasking and SW/SW inter-task communication. The modeling primitives like Software Tasks and Shared Objects are similar to the elements on the OSSS Application Layer (see Sect. 14.3) and abstract from the error-prone synchronization primitives, the underlying RTOSs would provide. The integrated RTOS abstraction includes different scheduling policies (preemptive and cooperative), periodic and continuous tasks, priorities, absolute and relative deadlines, without being targeted to a specific RTOS directly. Some simulation results of different decisions on priorities and schedulers have been presented in Sect As long as some locking primitive is available on the software target architecture, the OSSS software run-time can be mapped on this platform. A prototypical implementation on an existing RTOS is currently under development. The HW/SW and SW/HW communication capabilities of OSSS (Sect. 14.3) are not yet fully integrated with the new software multi-tasking implementation. The communication refinement will follow the OSSS Channel approach, see [7, 9]. As we have shown in Sect. 14.6, the granularity of timing annotations and the resulting synchronization overhead is an important factor for simulation speed. Therefore, the presented approach offers a flexible and simulation efficient way to specify estimated execution times. Moreover, the lazy synchronization scheme further reduces the required SystemC kernel invocations by merging consecutive EETs between required synchronizations, which has been demonstrated with a benchmark in Sect This further improves the early exploration of software platform effects for systems modeled in OSSS. References 1. A. Burns and A. Wellings. Concurrency in Ada. Cambridge University Press, Cambridge, G.C. Buttazzo. Hard Real-time Computing Systems. Kluwer Academic, Dordrecht, J. Chevalier, M. de Nanclas, L. Filion, O. Benny, M. Rondonneau, G. Bois, and E.M. Aboulhamid. A SystemC refinement methodology for embedded software. IEEE Design & Test of Computers, 23(2): , V. Fernandez, F. Herrera, P. Sanchez, and E. Villar. Embedded software generation from SystemC for platform based design. In SystemC: Methodologies and Applications, pages Springer, Berlin, Fossy Functional Oldenburg System Synthesizer A. Gerstlauer, H. Yu, and D. Gajski. RTOS modeling for system level design. In Proceedings of Design, Automation and Test in Europe, pages 47 58, C. Grabbe, K. Grüttner, H. Kleen, and T. Schubert. OSSS A Library for Synthesisable System Level Models in SystemC, E. Grimpe, W. Nebel, F. Oppenheimer, and T. Schubert. Object-oriented hardware design and synthesis based on SystemC 2.0. In SystemC: Methodologies and Applications, pages Springer, Berlin, 2003.

232 226 P.A. Hartmann et al. 9. K. Grüttner, C. Grabbe, F. Oppenheimer, and W. Nebel. Object oriented design and synthesis of communication in hardware-/software systems with OSSS. In Proceedings of the SASIMI 2007, October Z. He, A. Mok, and C. Peng. Timed RTOS modeling for embedded system design. In 11th IEEE, Real Time and Embedded Technology and Applications Symposium, RTAS 2005, pages , March F. Herrera and E. Villar. A framework for embedded system specification under different models of computation in SystemC. In Proceedings of the Design Automation Conference, S.A. Huss and S. Klaus. Assessment of real-time operating systems characteristics in embedded systems design by SystemC models of RTOS services. In Proceedings of Design & Verification Conference and Exhibition (DVCon 07), San Jose, USA, IEEE Standards Association ( IEEE-SA ) Standards Board. IEEE Std Open SystemC Language Reference Manual. IEEE Press, New York, R. Le Moigne, O. Pasquier, and J.-P. Calvez. A generic RTOS model for real-time systems simulation with SystemC. In Proceedings of Design, Automation and Test in Europe Conference, 2004, volume 3, pages 82 87, February S. Mahadevan, M. Storgaard, J. Madsen, and K. Virk. ARTS: a system-level framework for modeling MPSoC components and analysis of their causality. In 13th IEEE International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems, pages , September M. Müller, J. Gerlach, and W. Rosenstiel. Abstrakte Modellierung von Hardware/Software- Systemen unter Berücksichtigung von RTOS-Funktionalität. In 11th Workshop Methoden und Beschreibungssprachen zur Modellierung und Verifikation von Schaltungen und Systemen (MBMV 08), pages 21 30, March Open SystemC Initiative. SystemC H. Posadas, F. Herrera, V. Fernandez, P. Sanchez, and E. Villar. Single source design environment for embedded systems based on SystemC. Transactions on Design Automation of Electronic Embedded Systems, 9(4): , G. Schirner and R. Dömer. Introducing preemptive scheduling in abstract RTOS models using result oriented modeling. In Proceedings of Design, Automation and Test in Europe (DATE 2008), pages Munich, Germany, March M. Streubühr, J. Falk, C. Haubelt, J. Teich, R. Dorsch, and Th. Schlipf. Task-accurate performance modeling in SystemC for real-time multi-processor architectures. In Proceedings of the Design, Automation and Test in Europe Conference, pages European Design and Automation Association, Leuven, Synthesis Working Group Members of Open SystemC Initiative. SystemC synthesizable subset, Draft Whitepaper, Open SystemC Initiative (OSCI), December H. Zabel and W. Müller. Präzises Interrupt Scheduling in abstrakten RTOS Modellen in SystemC. In 11th Workshop Methoden und Beschreibungssprachen zur Modellierung und Verifikation von Schaltungen und Systemen (MBMV 08), pages 31 39, March 2008.

233 Chapter 15 High-Level Reconfiguration Modeling in SystemC Andreas Raabe and Armin Felke Abstract The ongoing trend towards development of parallel software and the increased flexibility of state-of-the-art programmable logic devices are currently converging in the field of reconfigurable hardware. On the other hand there is the traditional hardware market, with its increasingly short development cycles, which is mainly driven by high-level prototyping of products. This paper presents a library for modeling reconfiguration in the leading high-level system description language SYSTEMC combining IP reuse and high-level modeling with reconfiguration. Details on the underlying simulation engine are given, which allows safe disabling and re-enabling of all process types without altering the kernel. Novel control statements and internal techniques that allow safe usage of process controlling in conjunction with standard SYSTEMC language constructs are presented. A real world case study using the presented library proves its applicability. Keywords SystemC Dynamic reconfiguration FPGA Simulation 15.1 Introduction Due to increasing micro-miniaturization in chip production hardware development evolved from plain circuit design into the development of complex heterogeneous systems with an increasing number of increasingly complex processing elements [12]. Not only that productivity of hardware designers does not grow as fast as the number of available transistors per chip (productivity gap), but the time to bring products to market is reduced as well. To fill the gap one obvious approach is reuse of own and externally produced components (IP-cores). The latter are usually provided closed source and remain intellectual property of the vendor. IP reuse is widely regarded as one of the major motors of productivity in contemporary chipdesign. To cope with complexity higher levels of abstraction were introduced in system design and simulation. They enable early estimation of time and hardware consumption. Especially the introduction of Transaction Level Design to evaluate the A. Raabe ( ) International Computer Science Institute, Architecture Group, 1947 Center St., Suite 600, Berkeley, CA 94704, USA raabe@icsi.berkeley.edu M. Radetzki (ed.), Languages for Embedded Systems and Their Applications, Lecture Notes in Electrical Engineering 36, Springer Science + Business Media B.V

234 228 A. Raabe and A. Felke impact of bus models has proven to be highly efficient and productive in practice [7]. A large number of system description languages, mostly based on C/C++, have been proposed since [3]. Among them SYSTEMC became the most prominent one. Configurable logic devices evolved considerably as well. State-of-the-art devices provide tremendous processing power and dynamic reconfiguration abilities, qualifying them as highly parallel co-processing units. This led to increased research in run-time reconfigurable systems, which are now close to commercial breakthrough [13]. To enable the design community to conveniently develop reconfigurable architectures in a short time-to-market, this paper presents the library RECHANNEL, which extends SYSTEMC with advanced language constructs for high-level reconfiguration modeling. Special focus lies on functional and transactional levels of modeling. Details of the underlying simulation library are given, which allows safe disabling and re-enabling of processes. The standard OSCI SYSTEMC-kernel is used without any changes Related Work OSSS+R [11] is certainly one of the most prominent approaches towards high-level design methodology for reconfigurable hardware. It is based on OSSS [4], which extends the synthesizable subset of SYSTEMC by adding constructs that enable modeling, simulation and synthesis of object oriented features. Basically three different such constructs are introduced: Polymorphic Objects, Shared Objects and Sockets. The basic assumption of OSSS+R is, that reconfiguration can be interpreted as an exchange of two objects sharing a common base type and can therefore be modeled with polymorphism. Due to the fundamental change of the programming paradigm, from component based hardware design to object orientation, reuse of existing SYS- TEMC IP-cores is impossible. Additionally, only passive objects are featured as reconfigurable components. An approach towards a SYSTEMC reconfiguration extension closer to the original methodology was developed in the ADRIATIC project [14]. It targets transaction level (TL) description and simulation of reconfiguration aspects, focusing on early evaluation of the performance impact of reconfiguration. Therefore an automated tool is introduced that analyzes static TL models and identifies components that could be made reconfigurable. The designer s task is now reduced to exploring the design space with respect to reconfiguration. The main limitation of the ADRIATIC approach is the restriction of the underlying interfaces. The target architecture needs a generic bus layout that fits into the supported interface scheme. In [1] a modified SystemC kernel that features advanced process control statements is presented. In [2] a modified SystemC kernel is presented that allows enabling and disabling of processes. It is even used for modeling and simulation of a reconfigurable system. Both are altering the kernel and hence are not standard compliant.

235 15 High-Level Reconfiguration Modeling in SystemC 229 To provide full coverage of SYSTEMC 2.2 and to enable some of the more advanced features presented in this article the RECHANNEL library was completely reimplemented. Since the user interface changed, a brief recap and update on the basic features presented in [8] is given in the next section. Afterwards, Sects and 15.5 present novel language constructs and features Basic Reconfiguration Modeling RECHANNEL was designed with respect to a number of objectives. Any SYSTEMC extension should complywiththelanguagestandardieee 1666 [5]. IP-reuse should be possible to enable designers to exploit the vast number of commercially available components. The library should melt as seamlessly as possible with the SYS- TEMC methodology and feature reconfiguration on all levels of abstraction without a change in programming paradigm. Hand crafted refinement from functional to register transfer level should still be possible to allow maximum efficiency in resource utilization and to avoid imposing any design limitations Interpreting Reconfiguration as Circuit Switch Hardware designers often interpret reconfiguration as switching between two or more modules [6]. This very common approach is usually modeled using a bus that is either described as a module or a channel. The modules to be exchanged are then connected to the bus and can now be addressed only one at a time. The reconfiguration controlling can thus be described as an arbiter. This approach has three main limitations: the tremendous development effort, delays caused by reconfiguration are usually not respected and the system s topology as well as its timing are altered. RECHANNEL uses a technique that resembles reconfiguration buses, but that does not have these grave limitations. Portals are introduced to connect the original channel to the reconfigurable modules port. 1 The portal s function is twofold: Firstly, accesses of the active module to the channel need to be executed, while inactive modules should not be allowed to access the channel. Additionally, it is necessary to provide the module s port with an interface it is able to bind. This is both done by binding a so-called accessor object, which is part of the portal, to the port. It needs to be derived from the interface the port can connect to and forward interface accesses to the channel. Secondly, channel events need to be passed to the currently active module. Since the portal s accessors implement the interface the modules port needs, they also possess the events provided by the interface. These events are 1 Hardware designers familiar with the Xilinx Virtex modular design flow, may think of portals as bus macros.

236 230 A. Raabe and A. Felke Fig A portal connecting two reconfigurable modules to a standard SYSTEMC channel. A simulation sequence is shown, where a channel event is triggered by some outside source and is then forwarded to the accessor associated with the currently active module. A channel access within this module is triggered and is being executed via the accessor now registered with event forwarders inside the portal. These components listen to the channel s events and notify the according events inside the accessor associated with the currently active module. Figure 15.1 illustrates this. As will be described in Sect , a portal s state is controlled by the modules connected to it. On the other hand the portals state influences the modules state, since only one reconfigurable module per portal is allowed to be active at a time. The portal s type depends on the interface of the channel that it is bound to. Portals for all SYSTEMC channel interfaces are provided by the RECHANNEL library along with a toolkit that allows easy construction of portals for custom-built channels. This process could be automated as well, since it does not demand any creative coding, but is merely a repetition of information known to the compiler, but not available to the designer via C++ language constructs Creating Reconfigurable Modules from Static Ones To allow IP reuse as demanded in Sect. 15.3, it is necessary to provide a mechanism to equip standard SYSTEMC modules with additional reconfiguration related features (i.e., configuration timings, bit-file size, etc.). In the RECHANNEL environment they also have to be able to efficiently interact with the portals they are connected to. Both can be achieved by deriving from the original static module and rc_reconfigurable, which is a base class providing the necessary capabilities. To provide some of the more advanced mechanisms that will be described in Sect it is of some benefit to additionally wrap the original type into a template, which also automates this derivation. A reconfigurable version M_rc of a static module type M can now be derived. class M_rc: public rc_reconfigurable_module<m> The resulting type M_rc is also of type M and rc_reconfigurable.toprovide more convenience RECHANNEL automates this process by providing a macro

237 15 High-Level Reconfiguration Modeling in SystemC 231 RC_RECONFIGURABLE_MODULE_DERIVED, which accepts the static module type as parameter and a macro RC_RECONFIGURABLE_CTOR_DERIVED, where the user can define reconfiguration related properties (e.g. the module s loading delay). RC_RECONFIGURABLE_MODULE_DERIVED(M_rc, M) { RC_RECONFIGURABLE_CTOR_DERIVED(M_rc, M) { rc_set_delay(rc_load, sc_time(1, SC_MS)); }} Control To control reconfiguration it would be tedious for designers to switch all portals manually. Hence a reconfigurable module can be requested to activate itself. This request is passed down to the portals, which allow switching only if no other module is registered as active. A module is only activated if all its portals can be switched. But this implicit control of the portal s state via rc_module is still not convenient enough. Therefore a simulation control object rc_control provides registration and reconfiguration control functions for modules. E.g. instantiating a control object, registration of three reconfigurable modules and activation of one of them via rc_control looks like this: rc_control ctrl; ctrl.add(m1 + m2 + m3); ctrl.activate(m1); Figure 15.2 shows an example of a reconfigurable design with two alternatively present modules. Their reconfiguration state is controlled by a custom configuration controller using rc_control. Additionally, in more complex designs it might be necessary to simulate usage of multiple reconfigurable platforms with different reconfiguration behavior. By deriving from rc_control and overloading a callback function called takes_time a simulation control object can be implemented that calculates reconfiguration timings from module attributes (i.e., bit-file size). This way properties of the reconfigurable hardware used for the design can be modeled or different alternatives can be evaluated with respect to the impact on system performance Advanced ReChannel Features Exportals A portal is a specially designed component to connect a channel of a design s static part to ports of reconfigurable modules. To allow reconfigurable modules to use

238 232 A. Raabe and A. Felke Fig Design with two alternatively present modules exports as well as ports, the portal concept needs to be generalized. Therefore, its control interface was encapsulated into the rc_switch base class. rc_portal and the novel rc_exportal are derived from it. This way implicit control from module to portal and from module to exportal is implemented using the same mechanism. Additionally, an exportal needs to forward channel events in the opposite direction than a portal does, from reconfigurable to static end. Nevertheless, the same techniques that are implemented in a portal can be used. Interface accesses on the other hand are more difficult. Since they need to be forwarded from static parts of the design to reconfigurable ones, it can occur that no reconfigurable module is currently active to answer the request. Here two cases can be distinguished: If the access is blocking, the exportal can simply wait until a module is activated that can execute the request. But in case of a non-blocking access it must be executed immediately. If this occurs the design will in general be erroneous, but still the access has to be executed on some interface to allow the simulation to continue and issue a warning. 2 This is done by supplying a fall-back channel that reacts accordingly Synchronization Blocking accesses can cause problems as well. Since portals are plugged between port and channel they can only gain control if an access is initiated, when it is finished or if a channel event is notified. In timed environments this does not cause 2 Note that some other options are available as well. E.g. in the case of resolved signals, a return value of X (undefined) may be more appropriate, so any warnings can be omitted. Generally speaking, it is up to the implementer of the fall-back channel to specify its favored behavior.

239 15 High-Level Reconfiguration Modeling in SystemC 233 any problems. But for untimed (and timed-untimed hybrid) modules it makes it difficult for the designer to control when the module is to be deactivated without input and output data becoming asynchronous. E.g. a module reading from an input port A and writing to an output port B must in general not be deactivated if it has read the input, but no output was written yet. Therefore it must be possible to either define blocks of code within the module s processes as atomic transactions, or to externally define synchronization conditions depending on the module s communication behavior. The former is the far more elegant solution and will be described in Sect Still, it rules out IP reuse and hence the latter approach is supported by RECHANNEL as well. Therefore synchronization filters are provided, which allow bookkeeping of channel accesses by manipulating transaction counters. Only if all counters equal zero, the module can be deactivated. Reconfigurable modules may now equip portals with these filters and define synchronization conditions using the information supplied by the filters. E.g. let M be an IP core with the behavior described above, then reading from input port A should increase and writing to output port B should decrease a transaction counter. RC_RECONFIGURABLE_MODULE_DERIVED(M_rc, M) { RC_RECONFIGURABLE_CTOR_DERIVED(M_rc, M) filtera(tc,1), // if data is read begin transaction filterb(tc,-1) { // if data is written end transaction set_interface_filter(a, &filtera); set_interface_filter(b, &filterb); } rc_transaction_counter tc; // initially equals 0 rc_fifo_in_filter<int> filtera; rc_fifo_out_filter<int> filterb; } 15.5 Explicit Description of Reconfiguration RECHANNEL also comes with a set of language extensions intended for explicit description of reconfiguration. This is preferentially applied if a reconfigurable module is build from scratch, or if it is augmented with additional dynamic behavior. In contrast to native SYSTEMC, RECHANNEL language constructs possess an implicit reset mechanism being triggered on reconfiguration. These language constructs are primarily comprised of classes, functions and macros corresponding to a particular functionality already known from SYSTEMC (e.g. rc_signal, rc_fifo, rc_event, rc_semaphore, etc.). With the availability of resettable components and processes, both structure and behavior of reconfigurable modules can be modeled in an intuitive way without the need to care about additional logic that deals with reconfigurable behavior itself.

240 234 A. Raabe and A. Felke Resetting a module accounts to resetting its processes and its sub-components (variables, channels and sub-modules). Hence all of the sub-components and contained processes depend on a particular reconfigurable module up the SYSTEMC object hierarchy tree, i.e., the first object among a component s parent list which is derived from class rc_reconfigurable. This object is denoted as context module of its sub-component. Whereas if no such parent exists, a component is said to be used in a non-reconfigurable context. As a general rule, resettable components and processes are optimized for utilization within context modules. But they may also be employed in a non-reconfigurable context. All components provided by RECHANNEL are already equipped with the ability to implicitly reset themselves. A designer can use RECHANNEL s predefined components without further knowledge of the underlying mechanism. How to create a custom resettable component will be outlined in Sect To enable process reset, RECHANNEL provides its own process registry and process control API for internal management of resettable processes. The API directly builds upon standard SYSTEMC functionality and therefore can be seen as an additional layer on top of SYSTEMC s process infrastructure. Hence it does not need to alter the SYSTEMC kernel and is therefore compliant to the IEEE 1666 standard [5] Resettable Processes In order to enable process reset, a process control layer is introduced (Fig. 15.3). It integrates with the SYSTEMC infrastructure by registering itself with SYSTEMC s process registry and thus it is not necessary to alter the kernel implementation. Any process that is now registered with this process control layer instead of SYSTEMC s native infrastructure can then be disabled, (re-)enabled and reset. Processes registered with SYSTEMC will still remain fully functional and will not suffer from any performance loss. Processes with process control do not differ from standard SYS- TEMC processes and can thus coexist and inter-operate with these even within a single module. Fig RECHANNEL s process control is layered on top of the SYSTEMC infrastructure

241 15 High-Level Reconfiguration Modeling in SystemC 235 // synchronously resettable thread process RC_THREAD(proc); sensitive << clk.pos(); reset_signal_is(reset); rc_set_sync_reset(); Listing 15.1 Implementation of a synchronously resetable thread using ReChannel primitives void proc() { [...] // do something while(true) { wait(new_input_available); [...] // do something } } Listing 15.2 Process using an overloaded wait() Macros available for process declaration are RC_METHOD, RC_THREAD and RC_CTHREAD. They correspond semantically to the respective SYSTEMC process types, but are registered with RECHANNEL s process control layer. Primary reset condition is the deactivation of the context module. Additional reset conditions may be assigned by the user declaring the process. The invocation of reset_signal_is() will result in the process being reset on the occurrence of an edge of a signal. The reset behavior of a process may be either set to be asynchronous (default for RC_THREAD) or synchronous (default for RC_CTHREAD). Consider the example given in Listing It shows a thread that is made synchronously resetable using the presented primitives. Classes of type rc_reconfigurable_module and rc_prim_channel, amongst others, possess overloaded wait() and next_trigger() methods which are specially prepared for checking the reset conditions of a process. Consider the example given in Listing 15.2 of a process function inside such a module. E.g. if the module is blocked by the wait and a reset is triggered due to deactivation of the context module, the execution of the process is canceled. It will be restarted as soon as the context module is activated again. The reset of thread processes is implemented using the C++ exception handling mechanism. Therefore, exception safety plays a major role in simulation reliability. Additionally, it is required that the executed code can be canceled safely in respect of algorithmic correctness and data consistency. Cancellation of transactions or blocking operations, not specially designed to be cancelable, would render a design highly unreliable with respect to simulation stability and correctness. This implies that a reset mechanism requires fine-grained control of where and when a process may be reset. For this purpose RECHANNEL provides the macro RC_NO_RESET by which all reset signals can be temporarily disabled for the current process. The macro RC_TRANSACTION enables the designer to enclose blocks of code that must be finished before the reconfigurable context can deactivate. Listing 15.3 provides an

242 236 A. Raabe and A. Felke x = input_fifo.read(); // read input (blocking) RC_TRANSACTION { // after data has been read y = calc(x); // calculation must not be interrupted output_fifo.write(y); // write output } // point of deactivation (if requested ) Listing 15.3 An example of a transaction used to define a block that must not be interrupted by reconfiguration example using a transaction to provide the necessary synchronization for the problem discussed in Sect This is more elegant than using filters, but can (obviously) not be applied to IP components. Calls to SYSTEMC standard functionality are always considered atomic. This means that if a resettable process calls external code that is not intended for being cancelable at any time, a technical restriction is beneficial in this regard: For the reset of thread processes to work, it is required that these processes have previously been suspended in one of RECHANNEL s prepared wait() methods. Whereas if a thread process calls a function or interface method, that uses SYSTEMC s native wait() functionality, the reset mechanism will be temporarily unavailable. Due to this characteristic a reset condition is considered to be locally bound, e.g. within the borders of a module. Thus, only code within these boundaries needs to be exception safe and interruptible. RECHANNEL also supports resettable spawned thread processes. In contrast to non-spawned processes these are considered to be temporary, i.e., they will be physically terminated if their context module is deactivated Resettable Components Resettable components have the property that they can be automatically reset by RECHANNEL in case of activation or deactivation of their context module. RECHANNEL already provides resettable versions of all basic SYSTEMC channels, (e.g. rc_fifo, rc_signal, rc_signal_rv, rc_semaphore, etc.) and the event class (rc_event). Additionally, the macro rc_var() allows declaration of resettable variables of arbitrary type. rc_var(int, i); // declaration of resettable variable i [...] i = 0; // initialization of i within the constructor A user-defined component can be easily made resettable by deriving it from rc_resettable and implementing its abstract base interface.

243 15 High-Level Reconfiguration Modeling in SystemC 237 class mycomponent : public rc_resettable { [...] // implementation of mycomponent // preservation of initial state virtual void rc_on_init_resettable() { p_reset_value = p_curr_value; } // definition of reset functionality virtual void rc_on_reset() { p_curr_value = p_reset_value; } int p_curr_value, p_reset_value; }; Listing 15.4 Implementation of a resetable component The particular state such a component is reset to can be assigned beforehand during the construction phase. At start of simulation the callback method rc_on_init_resettable() is invoked once on all resettables to give them opportunity to store their initial state after construction has finished. The request for an immediate reset is propagated by a call to rc_on_reset(). Listing 15.4 illustrates this. For the reset mechanism to work, resettable components automatically register themselves with the current context module during construction. Hence the designer does not have to care about any further details Binding Groups of Switches If a channel or export is connected to a module, it is bound to a port by a single binding statement. If reconfiguration is used, switches are bound to multiple module s ports at the dynamic end. In practice modules provide a vast number of ports, especially in RTL descriptions. In conjunction this results in nearly identical and long sequences of binding statements, which make the code difficult to read. Even worse is that implementing these binding blocks is error-prone, and that they are difficult to maintain. To enable convenient use of RECHANNEL, port maps are provided, which group ports, channels and exports. As a counterpart switch connectors can be used to group switches. Switch connectors and port maps can now be bound using a single statement. Moreover, port maps can be used to equip a module with multiple binding schemes. This allows e.g., to provide bit-vectors in little-endian or big-endian bit order. While the standard SYSTEMC check for type compliance of bound objects is still provided, it is extended with a check of port map compliance. E.g., a port map for little-endian order can not be bound to a switching connector which is defined for big-endian order. Last but not least it is still possible to bind the module s ports (etc.) directly without using its port maps.

244 238 A. Raabe and A. Felke Fig (a) Using port maps and switch connectors enables binding of complete modules to switches with a single binding statement. (b) Topmodule models the reconfigurable area explicitly and thus has the same ports and exports as the reconfigurable modules. Here only a single type of port map needs to be defined, to enable port-to-port and export-to-export binding Figure 15.4(a) illustrates the use of port maps and switch connectors as it was previously discussed. In Fig. 15.4(b) a more practical type of application is depicted. Topmodule models the reconfigurable area explicitly and thus has the same ports and exports as the reconfigurable modules. Here only a single type of port map needs to be defined, to enable port-to-port and export-to-export binding Case Study The RECHANNEL library was tested within the CollisionChip [9] project. Figure 15.5 shows the overall hardware/software project implementing a hierarchical collision detection with reconfigurable primitive test. It tests two objects for intersection. Depending on the primitive type the objects are constructed of, a primitive test is loaded into the FPGA. The main design preselects pairs of primitives that are very close to each other and hence might intersect. These preselected pairs are checked for intersection by the currently active primitive test. Parts to be realized in hardware were implemented and tested on timed functional as well as on RT-level. Overall the RTL project consists of over lines of SYSTEMC code. Introducing reconfiguration into simulation using RECHANNEL took a single developer only 2 days. The reconfigured modules were not altered, but treated as closed source components. A more detailed description of this case study can be found in [10].

15 High-Level Reconfiguration Modeling in SystemC 239 Fig. 15.5 Hierarchical collision detection architecture with reconfigurable primitive test 15.

245 15 High-Level Reconfiguration Modeling in SystemC 239 Fig Hierarchical collision detection architecture with reconfigurable primitive test 15.7 Conclusion and Future Work This article presented a library for modeling reconfiguration in the leading highlevel system description language SYSTEMC. RECHANNEL combines IP reuse and high-level modeling with reconfiguration. Advanced synchronization techniques for high-level reconfiguration modeling were presented along with an internal process control. The latter allows safe usage within the SYSTEMC framework by providing the necessary synchronization statements. RECHANNEL does not alter the SYS- TEMC kernel and complies to the language standard [5]. To cover all SYSTEMC language constructs and to provide all of the above, a full reimplementation of the library was provided and the basic mechanisms were presented. Some extended techniques (e.g. resolution of driver conflicts, make-up of variable reset functionality, etc.) and minor details were left out due to space restrictions. A case study was presented proving the RECHANNEL library to be effective and productive. RECHANNEL was released as an open source library under BSD license and is thus available free of charge. We are currently working on support for modeling mobility using RECHANNEL primitives, and providing a synthesis case study. Among our next steps will be the discussion of functionality that should be incorporated into the SYSTEMC standard to ease implementation of language extension libraries. We are also planning on providing portals and other components necessary to cover the TLM library.

246 240 A. Raabe and A. Felke References 1. B. Bhattacharyya, J. Rose, and S. Swan. Language extensions to SystemC: process control constructs. In DAC 07: Proceedings of the 44th Annual Conference on Design Automation, pages ACM Press, New York, A.V. Brito, M. Kuhnle, M. Hubner, J. Becker, and E.U.K. Melcher. Modelling and simulation of dynamic and partially reconfigurable systems using SystemC. In ISVLSI 07, pages IEEE Comput. Soc., Los Alamitos, S.A. Edwards. The challenges of hardware synthesis from C-like languages. In DATE 05: Proceedings of the Conference on Design, Automation and Test in Europe, pages IEEE Comput. Soc., Washington, E. Grimpe and F. Oppenheimer. Aspects of object oriented hardware modelling with SystemC- Plus. In System on Chip Design Languages. Extended Papers: Best of FDL 01 and HDL- Con 01. Kluwer Academic, Dordrecht, IEEE Standards Association Standards Board. IEEE Std Open SystemC Language Reference Manual. 6. P. Lysaght and J. Stockwood. A simulation tool for dynamically reconfigurable field programmable gate arrays. IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 4(3): , S. Pasricha, N. Dutt, and M. Ben-Romdhane. Extending the transaction level modeling approach for fast communication architecture exploration. In DAC 04: Proceedings of the 41st Annual Conference on Design Automation, pages ACM Press, New York, A. Raabe, P.A. Hartmann, and J.K. Anlauf. Rechannel: describing and simulating reconfigurable hardware in SystemC. ACM Transactions on Design Automation of Electronic Systems, 13(1):1 18, A. Raabe, S. Hochgürtel, G. Zachmann, and J.K. Anlauf. Space-efficient FPGA-accelerated collisiondetectionforvirtualprototyping. InDesign Automation and Test (DATE), pages Munich, Germany, A. Raabe, A. Nett, and A. Niers. A refinement case-study of a dynamically reconfigurable intersection test hardware. In ReCoSoc 08, July A. Schallenberg, F. Oppenheimer, and W. Nebel. Designing for dynamic partially reconfigurable FPGAs with SystemC and OSSS. In Forum on Specification and Design Languages, Lille, France, September Y. Tanurhan. Processors and FPGAs quo vadis? IEEE Computer, 39(11): , N. Tredennick and B. Shimamoto. The rise of reconfigurable systems. In Engineering of Reconfigurable Systems and Algorithms, N.S. Voros and K. Masselos. System Level Design of Reconfigurable Systems-on-Chip. Springer, New York, 2005.

247 Chapter 16 Stream Programming for FPGAs Franjo Plavec, Zvonko Vranesic and Stephen Brown Abstract There is an increasing need for automated conversion of high-level design descriptions into hardware. We present a flow that converts a software application written in the Brook streaming language into a hardware description targeting FPGAs. We use a combination of our source-to-source compiler and a commercial C2H behavioral synthesis compiler to implement our flow. Our approach results in a significant throughput increase compared to software and ordinary C2H results (up to 8.9 and 4.3, respectively). The throughput can be further increased by using more hardware resources to exploit data parallelism available in streaming applications. Keywords Streaming FPGA Data-level parallelism Task-level parallelism Behavioral synthesis SOPC 16.1 Introduction A complete system on a programmable chip (SOPC) can often fit into an FPGA device. Such a system usually contains one or more processing nodes, either soft- or hard-core, and a number of peripherals. As the complexity of SOPCs grows, there is a need for tools that allow users to design their systems at a high level. Major FPGA vendors already provide such tools for building systems based on soft processors, which help software developers who wish to target FPGAs. However, such systems do not exploit the full potential of FPGAs, because they fail to generate circuits that fully exploit the parallel nature of an application. To take advantage of parallelism available in FPGAs has required mastering one of the Hardware Design Languages (HDLs). Recently, there has been a push towards supporting automatic compilation of software programs into hardware. There are three basic approaches to automatic compilation of software into hardware. Behavioral synthesis compilers [1] analyze programs written in a high-level sequential language, such as C, and attempt to extract instruction-level parallelism F. Plavec ( ) University of Toronto, Department of Electrical and Computer Engineering, 10 King s College Road, Toronto, Ontario, Canada plavec@eecg.toronto.edu M. Radetzki (ed.), Languages for Embedded Systems and Their Applications, Lecture Notes in Electrical Engineering 36, Springer Science + Business Media B.V

248 242 F. Plavec et al. Fig Sample streaming application by analyzing dependencies among instructions, and mapping independent instructions to parallel hardware units. Several such compilers have been released, including C2H from Altera [2], which is fully integrated into their SOPC design flow. The main problem with the behavioral synthesis approach is that the amount of instruction-level parallelism in a typical software program is limited. In the case of C2H, programmers often have to restructure their code and explicitly manage hardware resources, such as mapping of data to memory modules. Another approach is to take an existing parallel programming model and map a program written in it onto hardware [3]. This approach allows programmers to express parallelism, but they have to deal with issues such as synchronization, deadlocks and starvation. The third approach is to use a language that allows the programmers to express parallelism without having to worry about synchronization and related issues. One class of languages that is attracting a lot of attention lately is based on the streaming paradigm. In streaming, data is organized into streams, which are collections of data, similar to arrays, but with the elements that are guaranteed to be mutually independent [4]. Computation on the streams is performed by kernels, which are functions that implicitly operate on all elements of their input streams. Since the stream elements are independent, a kernel can operate on individual stream elements in parallel, thus exploiting data-level parallelism. If there are multiple kernels in a program, they can operate in parallel in a pipelined fashion, thus exploiting tasklevel parallelism. Figure 16.1 depicts an application consisting of 5 kernels (depicted as circles) and memory buffers that pass stream data between the kernels. Streaming programming languages do not specify the nature of the memory buffers. We use FIFO buffers because of their small size, which allows us to implement them in an on-chip memory in the FPGA. In this contribution we show that the streaming paradigm is suitable for implementation in FPGAs. Our compiler converts kernels into hardware blocks that operate on incoming streams of data. We chose Brook streaming language [4] as our source language because it is based on the C programming language, so it is more likely to be accepted by the programmer community. Brook language has been used

249 16 Stream Programming for FPGAs 243 in a number of research projects [5 7]. We show that a program expressed in Brook can be automatically converted into parallel hardware, which executes significantly faster than software. Our methodology uses our modified version of the Brook compiler, and also leverages the C2H commercial tool. The rest of the chapter is organized as follows. In Sect we describe related work in the field of stream computing. Our design flow and tools are described in Sect Section 16.4 presents experimental results. We make some concluding remarks in Sect Stream Computing The term stream computing (a.k.a. stream processing) has been used to describe a variety of systems. Examples of stream computing include dataflow systems, reactive systems, signal processing and other systems [8]. We focus on streaming applications defined through a set of kernels, which define the computation, and a set of data streams, which define communication. This organization allows the compiler to easily analyze communication patterns in the application, so that parallelism can be exploited. When a programmer specifies that certain data belongs to a stream, this provides a guarantee that the elements of the stream are independent from one another. The computation specified by the kernel can then be applied to stream elements in any order. In fact, all computation can be performed in parallel, limited only by the available computing resources. Stream processing is based on the Single Instruction Multiple Data (SIMD) paradigm, and is similar to vector processing. The major difference between stream and vector processing is in computation granularity. While vector processing involves only simple operations on (typically) two operands, a kernel may take an arbitrary number of input, and produce one or more output streams. In addition, the kernel computation can be arbitrarily complex, and the execution time may vary significantly from one element to the next. Finally, elements of a stream can be complex (e.g. custom data structures), compared to vector processing, which can only operate on primitive data types. The recent interest in stream processing has been driven by two trends: the emergence of multi-core architectures and general-purpose computation on graphics processing units. Multi-core architectures have sparked interest in novel programming paradigms that allow easy programming of such systems. Stream processing is a paradigm that is suitable for both building and programming these systems. For example, the Merrimac supercomputer [7] consists of many stream processors, each containing a number of floating-point units and a hierarchy of register files. A program expressed in a stream programming language is mapped to this architecture in a way that captures data locality through the register file hierarchy. Programs for Merrimac are written in the Brook streaming language [4]. Another similar project is the RAW microprocessor and the accompanying StreamIt streaming language [9]. The processing power of graphics processing units (GPUs) has led to their use for general-purpose computing [10]. GPUs are suitable for stream processing because they typically have a large number of 4-way vector execution units as well

250 244 F. Plavec et al. as a relatively large memory. Streaming languages can be used to program these systems, because kernel computation can be easily mapped to the processing units in a GPU. Various compilers that convert code written in a streaming language into a code suitable for execution on a GPU have been proposed [10, 11]. GPU Brook [11] is a variant of the Brook streaming language specifically targeting GPUs. The GPU Brook compiler hides many details of the underlying GPU hardware from the programmer, thus making programming easier Streaming on FPGAs Several research projects have investigated stream processing on FPGAs. Howes et al. [12] compared performance of GPUs, PlayStation 2 (PS2) vector units, and FP- GAs used as coprocessors. The applications were written in ASC (A Stream Compiler) for FPGAs, which has been extended to target GPUs and PS2. ASC is a C++ library that can be used to target hardware. The authors found that for most applications GPUs outperformed both FPGAs and PS2 vector units. However, they used the FPGA as a coprocessor card attached to a PC, which resulted in a large communication overhead between the host processor and the FPGA. They showed that removing this overhead improves performance of the FPGA significantly. Their approach does not clearly define kernels and streams of data, so the burden is on the programmer to explicitly define which parts of the application will be implemented in hardware. Bellas et al. [13] developed a system based on streaming accelerators. The computation in an application is expressed as a streaming data flow graph (sdfg) and data streams are specified through stream descriptors. sdfg and the stream descriptors are then mapped to customizable units performing computation and communication. The disadvantage of this approach is that the application has to be described in a somewhat obscure format (sdfg). Our approach to streaming on FPGAs is closest to the work described in [14], which also converts computation into hardware IP blocks interconnected in a pipelined fashion. Their approach targets ordinary C programs augmented with directives that allow the programmer to specify how an application maps to hardware. Our approach is based on a streaming language, which provides a higher abstraction level, so parallelism can be expressed without any knowledge of the target hardware Compiling Brook to Hardware We propose using a streaming language as a natural choice for software programmers wishing to target their applications to FPGAs. We believe that the stream processing paradigm is suitable for implementation in FPGAs, because programmable logic blocks in FPGAs are suitable for implementation of parallel compu-

251 16 Stream Programming for FPGAs 245 tation. Also, FPGAs are easily reprogrammable, so the generated hardware can be tailored to the needs of a specific application. The design space for FPGA implementation of streaming applications is large. For instance, a kernel could be implemented as custom hardware, a soft-core processor, or a streaming processor. In either case, several parallel instances of hardware implementing the kernel may be necessary to meet the throughput requirement. The choice of types and numbers of hardware units will affect the topology of the interconnection network. Finally, stream elements can be communicated through on-chip or off-chip memories, organized as regular memories or FIFO buffers. We generate custom hardware for each kernel in the application. An ordinary soft processor would be a poor choice for implementing kernels, because it can only receive and send data through its data bus, which may quickly become a bottleneck. Custom hardware units can have as many I/O ports as needed by an application and are likely to provide the best performance. However, if a kernel is complex, the amount of circuitry needed for its implementation as custom hardware may be excessive, in which case a streaming processor may be a better choice. In this contribution we focus on implementing kernels as hardware units. We base our work on GPU Brook [11] because it is open-source, used in many projects and supported through a community forum. To implement a program written in GPU Brook in an FPGA, the kernel code should be converted into an HDL. Instead of performing this conversion directly, we generate C code for each kernel and then use the C2H behavioral compiler [2] to convert the C code into hardware. All of the C code is generated automatically by our compiler, so the programmer only has to write the Brook source code and pass it through our flow. The first part of the flow is a source-to-source compiler. We reused the original GPU Brook parser and wrote a code generator that emits C code for each kernel. C2H allows functions in the C code to be implemented as hardware blocks. Altera documentation refers to the generated hardware block as a hardware accelerator [2]. Hardware accelerators act as coprocessors to the main soft processor (Nios II), which controls the accelerators and executes the code that was not selected for acceleration. Current version of C2H does not support floating-point data type and operations. Depending on the desired functionality, an accelerator can have one or more ports for accessing other (memory) modules in the system; for each pointer dereference in the original C code, a new port is created. Special pragma statements can be used to define which memories in the system a port connects to. We use this functionality to define how streams are passed between kernels through FIFOs. FIFOs are small so they can be placed on-chip, and they fit naturally into the streaming paradigm, because they act as registers in the pipeline. FIFOs are used instead of simple registers because they provide buffering for cases when execution time of a kernel varies between the elements. For example, consider the system in Fig If the kernel mul takes a long time to process one element, the next kernel downstream (sum) could become idle if there was just one register between them. Using a FIFO, the sum kernel can process data from the FIFO. As long as the mul kernel delivers the next stream element before the FIFO buffer becomes empty, the sum kernel will not have to stall.

252 246 F. Plavec et al Example Brook Program In Brook, streams are declared similarly to arrays, except that characters < and > are used instead of square brackets. Kernels are denoted using the kernel keyword. We illustrate the work done by our compiler using the following Brook code: kernel void mul (int a<>, int b<>, out int c<>) { c = a*b; } reduce void sum (int a<>, reduce int r<>) { r = r+a; } void main () { int output[reduce_length]; int stream1<in_length>, stream2<in_length>; int mul_result<in_length> int reduce_result<reduce_length>; create1 (stream1); create2 (stream2); mul (stream1, stream2, mul_result); sum (mul_result, reduce_result); write (reduce_result, output); } The code is incomplete and contains only two kernels: mul and sum. Kernel code refers to individual streams, not stream elements. This prevents programmers from introducing data dependencies between stream elements. It is assumed that the operation is to be performed over all stream elements. A special kind of kernel, so called reduction kernel, uses several elements of the input stream to produce one element of the output stream. These kernels are used to perform reduction operations, and are denoted by the reduce keyword. To convert Brook code into C, our compiler generates an explicit for loop around the statements inside the kernel function to specify that the kernel operation should be performed over all stream elements. For the Brook code above, our compiler produces the code similar to this: void mul () { volatile int *a, *b, *c; int _iter; int _temp_a, _temp_b, _temp_c; for (_iter=0; _iter<in_length; _iter++) { _temp_a = *a; _temp_b = *b; _temp_c = _temp_a * _temp_b; *c = _temp_c; }

253 16 Stream Programming for FPGAs 247 } void sum() { volatile int *a, *r; int _temp_r, _iter, _mod_iter=0; for (_iter=0; _iter<in_length; _iter++) { if ((_mod_iter == 0) && (_iter!= 0)) *r =_temp_r; if (_mod_iter == 0) _temp_r = *a; else _temp_r = _temp_r + (*a); if (_mod_iter == (IN_LENGTH/REDUCE_LENGTH-1)) _mod_iter = 0; else _mod_iter = _mod_iter + 1; } *r = _temp_r; } #pragma altera_accelerate connect_variable mul/c to mul_result/in #pragma altera_accelerate connect_variable sum/a to mul_result/out In this code, all compiler-generated variable names start with the _ character. For the mul kernel, the code is a straightforward for loop that reads elements from input streams (FIFOs) a and b, multiplies them and writes the result to the output stream (FIFO) c.thein_length limit for the for loop was automatically inserted by the compiler, based on the sizes of the streams passed to the mul kernel from the main program. Pointers a,b and c are connected to FIFOs, which is specified by C2H pragma statements, which are also automatically generated by our compiler. For brevity, we only show two pragma statements in the above code. The first pragma statement specifies that the pointer c defined in the kernel mul connects to the in port of the FIFO named mul_result. The second pragma statement specifies that the pointer a defined in the kernel sum connects to the out port of the same FIFO. Together, these pragmas define a connection between the mul and sum kernels, as indicated in Fig. 16.1, based on dataflow analysis of the main program. FIFOs are implemented in hardware, so kernel code does not have to manage FIFO read and write pointers. Temporary variables _temp_a, _temp_b and _temp_c are used to preserve semantics of the original Brook program. Consider the statement c = a + a;in Brook. According to Brook specification, this statement is equivalent to c = 2*a; which is true if temporary variables are used. However, if the temporary variables were not used and stream references were directly converted to pointers, the original statement would get translated into *c = *a + *a; which would produce an incorrect result. This is because each pointer dereference performs a read from the FIFO, so two consecutive stream elements would be read, instead of the same element being read twice. Temporary variables ensure that only one read is performed per input stream element, and that only one write is performed per output stream element.

254 248 F. Plavec et al. The code generated for the sum kernel is more complicated because sum is a reduction kernel. The reduction operation can result in more than one element in the output stream. In our example, the input stream with IN_LENGTH elements is reduced to a stream of REDUCE_LENGTH elements. This means that IN_LENGTH/REDUCE_LENGTH consecutive elements of the input are summed to produce one element of the output. A straightforward implementation of this operation would use a modulo operation to determine when an appropriate number of elements have been added and a new addition should be started. Since the modulo operation is not efficiently implemented in FPGAs, our compiler avoids using it, and instead generates the _mod_iter variable, which emulates the modulo operation using a simple counter. Our compiler performs a similar optimization for division. Although the code above shows the use of division operation, this is only for illustrative purposes. Since all stream sizes are known at compile time, our compiler performs the division and inserts the result in its place. The code produced by our compiler is larger than what is shown in the previous example. According to the Brook specification, a kernel can accept streams of any dimensionality, but GPU Brook only supports 1-dimensional (1-D) and 2-dimensional (2-D) streams. To support both 1-D and 2-D streams, we have to be able to emit code for both versions of the kernel. Supporting both 1-D and 2-D streams for reduction kernels is more challenging, because one kernel in the original source code can handle reductions of both 1-D and 2-D streams, and a 2-D stream can be reduced into either a 2-D stream, a 1-D stream or a scalar. To handle all of these cases, we generate several different versions of kernels, as needed for the application. Once the C code is generated, it is passed through Altera s C2H compiler, which generates a Verilog description of hardware accelerators for the kernels. The Verilog code is then passed through Quartus II flow to generate a programming file for the target FPGA. To evaluate the system s performance, we generate the streams using simple loops inside the create1 and create2 kernels. This approach, as compared to reading the input data from memory, ensures that the runtime of our benchmarks is not dominated by communication with the shared off-chip memory, which would be the case if three different kernels were using the same memory. The results are written to the main memory using a specialized kernel write, so that they can be checked for correctness. The system also includes a Nios II processor, which verifies the correctness of the results, and measures the execution time. For some applications, such a processor may not be necessary, because the input data may be coming from the outside and the results may be passed back to the outside world Exploiting Data Parallelism ThesysteminFig.16.1 contains hardware accelerators, which can operate in parallel, thus exploiting task-level parallelism. However, each accelerator processes

255 16 Stream Programming for FPGAs 249 Fig Replication example stream elements one at a time, meaning that data parallelism is not exploited. One way to exploit data parallelism is to replicate the functionality of each kernel. This is possible because stream elements are independent, so they can be processed in parallel. In theory we could have as many kernel replicas as the number of elements in the streams being processed. In practice, the kernel that is a bottleneck for the application will be replicated as many times as necessary to achieve required throughput. Replication will usually be limited by the available hardware resources. Figure 16.2 shows the application from Fig with kernels mul and sum replicated. In this example, sum and mul have comparable throughput so both of these kernels have to be replicated to increase the application throughput. Kernels create and write were not replicated because it is assumed that they already operate at a maximum throughput, limited by either the application or the I/O communication interface. If that is not the case, replicating one of these kernels may be beneficial. In this situation, create kernels send elements to mul1/2 and mul2/2 kernels alternately, in a round-robin fashion. Each parallel branch only processes half the elements, thus effectively doubling the throughput. One of the main goals of our research is to bring the benefits of FPGA hardware to the software programmers, while hiding the details of the underlying hardware. For example, we plan to hide the details of kernel replication. The programmer only has to identify which kernels are bottlenecks in an application, and then specify how much the throughput of the kernel should be increased. Compiler can then automatically create the necessary replicas of the kernel, or report an error if the kernel cannot be sufficiently replicated due to limited hardware availability. In the current version of our compiler we do not yet support automatic kernel replication. However, we have performed experiments with kernel replication, where we replicated kernels manually, in a manner that we can envision a compiler could easily perform automatically. We present the results of these and other experiments next.

256 250 F. Plavec et al Experimental Evaluation To validate the correctness of our design flow and estimate the performance benefits of our approach, we implemented two small applications using our flow. We then compared the throughput of our implementation in hardware with the throughput of the best software implementation running on a Nios II soft processor on the same FPGA device, and the same software function accelerated using C2H. This comparison is fair, because our design flow presents the programmer with a compiler interface that is similar to the traditional software development flow, much like C2H. Comparing our approach to a hard-core processor or a GPU would not be fair, because of significantly differing technologies used for their implementation. Research in [5] and [11] has shown that streaming programs can be efficiently compiled and executed on general-purpose processors and GPUs, respectively. We chose two applications that are often used to demonstrate computation acceleration, because they are simple and their characteristics are well understood. These applications are Autocorrelation and Finite Impulse Response (FIR) filter. Autocorrelation is an operation that computes the cross-correlation of a signal and a timeshifted version of that same signal. In our experiments we perform autocorrelation of a signal consisting of 100,000 samples for 8 different shift distances. The FIR filter is commonly used in digital signal processing as a digital filter. The filter works by storing a certain number of samples in a pipeline and then multiplying each sample by a constant factor and summing those products. The depth of the pipeline is often referred to as the number of taps in the filter. In our experiments we use a filter with 8 taps on an input signal consisting of 100,000 samples. In both applications samples were represented as 32-bit integers. In all our experiments we use FIFOs with depth 4. We found that increasing the FIFO sizes beyond 4 was not beneficial for our benchmarks. Our experimental system is based on the Nios II processor f(ast) version, with instruction and data caches (4 KBytes each), and a hardware multiplier unit. The processor is connected to an off-chip 8-MB SDRAM module, and the timer and UART peripherals which enable measuring and reporting the program execution times. Software implementations of both applications were first run on this system and their throughput was recorded, along with the area and the maximum operating frequency (F max ) of the system. Next, we implemented each application in the Brook streaming language, and compiled it using our basic flow, with each kernel mapped to one hardware accelerator. We measured the area and throughput of each application, and then replicated the kernels in each application 2 and 4 times. Finally, we accelerated the original software function using C2H. All experiments were run on Altera s DE2 development board, with a Cyclone II EP2C35F672C6 FPGA device. Software was run from an off-chip SDRAM memory, because the large dataset of 100,000 elements could not fit into the available on-chip memory. To make the comparison fair, the input dataset for software was generated inside a loop and was not stored in the off-chip memory.

257 16 Stream Programming for FPGAs 251 Table 16.1 Throughput and area results for different application implementations Application Throughput (KB/s) Area (LEs) F max (MHz) Relative throughput Relative area autocor_soft 6,255 2, autocor 5,878 +2, autocor_x2 12,240 +3, autocor_x4 23,833 +6, autocor_c2h 5,509 +1, fir_soft 2,163 2, fir 7,948 +3, fir_x2 15,515 +6, fir_x4 19, , fir_c2h 5,849 +1, Results Results of our experiments are summarized in Table The first column indicates the application, where autocor_soft and fir_soft correspond to software implementations, while autocor_c2h and fir_c2h correspond to C2H implementations. autocor and fir correspond to the basic streaming implementations, whereas the applications with x2 and x4 in their name correspond to the applications whose kernels were replicated 2 and 4 times, respectively. The second column shows absolute throughput, while the third column presents area results for the logic performing the computation. We do not include peripherals and units that are in the system just for the measurement and debugging purposes in this area, because they are not necessary once a real system is deployed. For software implementation, this means that we report only the area for the Nios II processor and the SDRAM controller. For streaming and C2H implementations we report only the area for the accelerators, FIFOs (where applicable) and the SDRAM controller. We do not include the area for the Nios II processor, because its role of starting the accelerators could easily be replaced by a simple state machine. We indicate this with the + character in the table, to emphasize that this area is in addition to area for the processor. The fourth column in the table presents the system s F max, and the final two columns show throughput and area relative to the software implementation. There are several interesting observations that can be made. First, it is interesting to note that the streaming implementation and the software implementation of autocorrelation achieve similar throughputs. This is because the operations are simple and the input data is generated on-chip, so the processor can fit the loop code into its instruction cache and perform computation without accessing the off-chip memory. The off-chip memory is accessed only when a result is written, which is the same behavior as that of the streaming implementation. As a consequence, the streaming version and the software version exhibit similar performance; the small difference is due to the difference in F max of the two systems. C2H implementation exhibits

258 252 F. Plavec et al. similar behavior, but achieves lower throughput due to lower F max. This is because C2H implements the complete algorithm in one accelerator, while the streaming approach distributes it across several accelerators. One significant difference between the streaming implementation and the other two approaches is that the throughput achieved by the streaming implementation can be improved by replicating its kernels. As our results show, replicating the kernels two or four times results in nearly double and quadruple throughput, respectively. The situation is slightly different for the FIR filter application. In this application the incoming samples have to be inserted into a shift register. In both implementations, this shifting is implemented as a circular buffer in memory. This operation is more efficiently implemented in hardware because independent operations (e.g. updating the loop counters and writing to the buffer) can be performed in parallel. As a result, C2H implementation achieves 2.7 times, and streaming implementation achieves 3.68 times higher throughput than software. Throughput of the streaming application can be further increased by replicating the kernels. Doubling the number of kernels results in double throughput, as expected. However, when the kernels are replicated four times, we fail to achieve four times higher throughput, because the implementation of the shift register cannot be easily replicated. The kernel implementing the shift register requires all 8 elements of the input to be available to assign them to outputs in a round-robin fashion. Therefore, replicating the node does not reduce the amount of work each node has to perform. Although it is conceivable that this could be improved manually, there does not seem to be an easy way to perform it automatically. Therefore, once the shift register implementation becomes a bottleneck, performance cannot be automatically improved any more. Comparing the area results, C2H implementations require less area for implementation than equivalent streaming implementation, but they also provide lower F max and throughput. In addition, streaming kernels can be replicated to further increase the throughput, while C2H does not provide such an option Concluding Remarks In this chapter we presented a novel approach for allowing software programmers to target FPGAs. The streaming paradigm allows programmers to effectively express parallelism, and it maps well to the FPGA logic. We presented a design flow that converts a Brook streaming program into hardware using our source-to-source compiler and Altera s C2H compiler. Many FPGA systems currently use software running on a soft processor to implement a part of functionality, while critical portions of the application are described in an HDL and implemented in hardware. Our system allows the application to be fully described in software and still exploit the capabilities of FPGA hardware, thus reducing design time and cost. Our experiments show that this approach results in up to 8.9 times better throughput than a soft-core processor, and up to 4.3 times better throughput than a C2H accelerated implementation running on the same FPGA. Moreover, the performance can be improved by employing more hardware to perform the computation.

259 16 Stream Programming for FPGAs 253 Our future work will focus on automating the replication of kernels to increase throughput automatically. We also want to add support for infinite streams to our flow.asshownin[14], infinite streams are important for real applications, such as multimedia. In these applications the data to be processed is constantly streaming into the chip. Although neither Brook nor GPU Brook support infinite streams, we believe such a feature can be implemented without significant changes to the language. For instance, a stream with length 0 could be interpreted as an infinite stream. We also plan to add debug support to our design flow. Debugger for the streaming approach described in this chapter can be implemented using Nios II as a control processor. It can observe values that pass through each of the FIFOs in the system, and stop the flow of data through FIFOs to implement breakpoints. Finally, we plan to build several large applications to demonstrate usability of our approach for realworld applications. References 1. G. DeMichelli. Synthesis and Optimization of Digital Circuits. McGraw Hill, New York, Altera. Nios II C-to-hardware acceleration compiler, November products/ip/processors/nios2/tools/c2h/ni2-c2h.html. 3. Y.L.C.N.W. Wong. Generating hardware from OpenMP programs. In Proc. of IEEE Int. Conf. on Field Programmable Technology, pages 73 80, I. Buck. Brook Spec v0.2. Tech. Report CSTR , Stanford University, October J. Gummaraju and M. Rosenblum. Stream programming on general-purpose processors. In Proc. of 38th Int. Symp. on Microarchitecture, pages , S.-W. Liao et al. Data and computation transformations for Brook streaming applications on multiprocessors. InProc. of Int. Symp. on Code Generation and Optimization, pages , W.J. Dally et al. Merrimac: supercomputing with streams. In 2003 Conference on Supercomputing, pages 35 35, R. Stephens. A survey of stream processing. Acta Informatica, 34(7): , M.I. Gordon et al.. A stream compiler for communication-exposed architectures. ACM SIGOPS Operating Systems Review, 36(5): , D. Tarditi et al. Accelerator: using data parallelism to program GPUs for general-purpose uses. In Proc. 12th Int. Conf. on Architectural Support for Programming Languages and Operating Systems, pages , I. Buck et al.. Brook for GPUs: stream computing on graphics hardware. Trans. on Graphics, 23(3): , L.W. Howes et al. Comparing FPGAs to graphics accelerators and the PlayStation 2 using a unified source description. In Int. Conf. on Field-Programmable Logic, N. Bellas et al. Template-based generation of streaming accelerators from a high level presentation. In IEEE Symp. on Field-Programmable Custom Computing Machines, A.B.P. Mukherjee, R. Jones. Handling data streams while compiling C programs onto hardware. In Proc. IEEE Computer Society Annual Symp. on VLSI, pages , 2004.

260 Part IV Verification and Requirements Evaluation

261 Chapter 17 A New Verification Technique for Custom-Designed Components at the Arithmetic Bit Level Evgeny Pavlenko, Markus Wedler, Dominik Stoffel, Wolfgang Kunz, Oliver Wienand and Evgeny Karibaev Abstract Arithmetic Bit-Level (ABL) normalization has been proven a viable approach to formal property checking of datapath designs. It is applicable where arithmetic bit level components and sub-components can be identified at the registertransfer (RT) level of the design and the property. This chapter extends the applicability of ABL normalization to cases where some of the arithmetic components are custom-designed entities, e.g., specified using Boolean equations or gates. We transform these entities into ABL building blocks using Reed Muller expressions as an intermediate representation. We show how Boolean logic expressed in Reed Muller form can be automatically transformed into ABL components so that such logic blocks can be treated together with the remaining ABL components in a subsequent normalization run. The approach is evaluated on a number of industrial designs generated by a commercial arithmetic module generator. Keywords Custom-designed component Reed Muller expansion ABL normalization 17.1 Introduction Formal property checking has become mainstream in many SoC design flows. In particular, it is common practice to verify control-intensive design blocks formally in order to guarantee high quality for these blocks as well as to relieve system-level verification from local debugging tasks. Dealing with arithmetic circuits, however, there is still a notable lack of robustness when applying formal methodologies. Simulation, therefore, still prevails in industry when verifying arithmetic datapaths. This imposes the risk of overlooking bugs in arithmetic operations. E. Pavlenko ( ) Department of Electrical and Computer Engineering, University of Kaiserslautern, Kaiserslautern, Germany pavlenko@eit.uni-kl.de M. Radetzki (ed.), Languages for Embedded Systems and Their Applications, Lecture Notes in Electrical Engineering 36, Springer Science + Business Media B.V

262 258 E. Pavlenko et al. During the last two decades, datapath verification has been an intensive field of research within the formal methods community. Meanwhile, there exists a large variety of techniques tackling arithmetic circuit verification in different ways. Word-level decision diagrams like *BMDs [3] that promise a compact canonical representation for arithmetic functions have been investigated. By lack of robust synthesis routines to derive these diagrams from bit-level implementations *BMDs are, however, hardly used in RTL property checking. For example, Hamaguchi s method for *BMD synthesis [8] causes the diagram size to grow exponentially in case of faulty circuits as noted by Wefel and Molitor [18]. Generating word-level diagrams from bit-level specifications has remained an unresolved issue also for more recent developments such as Taylor expansion diagrams [5]. As SAT-based techniques have become the predominant proof methods in formal verification significant efforts were made to integrate SAT with solvers for other domains. The integration of SAT and ILP techniques leads to hybrid solvers like [1, 4]. However, ILP turns out to be unsuitable for RTL property checking because non-linear arithmetic functions have to be handled. Unfortunately, even simple multiplication falls into this category. More recently, SAT-modulo-theory (SMT) solvers have gained significant attention [7]. For the purpose of this chapter we consider the quantifier-free logic of fixed-sized bit vectors (QF-BV) to be a natural candidate for representation of the decision problems encountered in property checking. The solvers Spear [13] and Boolector [2] showed best performance within this category in the competitions on SMT solvers 2007 and 2008, respectively. However, our experience is that these solvers show the same bottlenecks as solvers based on bit blasting (i.e., the problem is directly converted into CNF) when being applied to instances derived from the verification of arithmetic datapaths. For equivalence checking of arithmetic RTL circuits, especially multipliers, a technique based on rewriting was proposed in [15]. A database of rewrite rules is provided to support a large number of widely used multiplier implementation schemes. However, for non-standard implementations the approach requires updating the database manually and is thus not fully automatic. Computer algebra techniques have shown promising results at higher levels of abstraction [12]. When applied to RTL designs with bit-level details they require a large amount of intermediate result specifications [16]. However, if an arithmetic bit level (ABL) description of the datapath under verification is available [19] theyare also applicable in RTL property checking. In [14] an extraction technique is presented that automatically extracts ABL information from optimized gate netlists of arithmetic circuits after synthesis. This approach is mainly designed for application in equivalence checking, and its arithmetic reasoning on the ABL is restricted to a single addition network. For RTL property checking, however, global reasoning over several arithmetic components is required. In [17] a generalization of the ABL description was introduced and a normalization calculus for property checking was presented. All ABL information required for property checking can usually be obtained directly from the RTL description. Unfortunately, this may change in full-custom design flows where ABL

263 17 A New Verification Technique for Custom-Designed Components at ABL 259 information may no longer be available at the RT level. In order to apply ABL techniques in full-custom design flows a description language for arithmetic circuits was introduced in [10] that is used by designers to manually capture the necessary ABL information. The methods used for the arithmetic proofs are similar to the normalization approach of [17]. In this chapter, we increase the level of automation for RTL property checking of arithmetic circuits in those cases where ABL information is missing for certain custom-made components of a design such as Booth encoders or sophisticated addition components that are typically implemented below the ABL abstraction level. This reflects a large number of industrial applications where certain arithmetic components are custom-designed and others (especially array structures such as addition networks) are not. We provide an extraction technique to generate ABL descriptions for property checking that can be applied to those parts of the design where hand-crafted optimizations and specialized architectures have made it impossible to translate the RTL description into an ABL description immediately. The proposed approach, thus, fills an important gap and ensures that the manual approach of [10] can be reserved for high-end applications designed globally by a full-custom methodology. The chapter is organized as follows: Sect presents a brief review of the ABL normalization technique, Sect describes our extensions to this technique, Sect presents experimental results and Sect concludes this chapter Normalization Method This section presents a brief review of the ABL normalization technique presented in [17] as far as it is required to describe the proposed extensions. Furthermore, we motivate the need for a generic property checking technique which can handle mixed gate-level and ABL descriptions ABL Normalization Sophisticated arithmetic circuit designs compose arithmetic functions using bitlevel arithmetic circuitry for addition and multiplication. This composition can be modeled at the arithmetic bit level (ABL). An ABL description is a directed acyclic graph where the vertices can be of type partial product generator, addition network or comparator. Partial product generators model bit-wise multiplication and comparators model comparison of bit vectors. Addition networks model addition at different levels of abstraction, including bit-level units like half adders (HA) or full adders (FA) as well as word-level additions such as bit-vector adders or addition schemes of multipliers. Our notion of an addition network can also capture models for addition at intermediate levels of abstraction such as carry-save adders (CSA) where bit-wise addition is performed on bit vectors.

264 260 E. Pavlenko et al. Throughout this chapter we use the following notations: For a Z, b>0 the remainder, a mod b, of the integer division a/b denotes the smallest k 0 with k = a mb for some m Z. The unsigned integer represented by a bit vector a = (a n 1,...,a 0 ) is denoted by Z(a) = n 1 i=0 2i a i. Conversely, x,n for n>0 and x Z denotes the uniquely determined bit vector a = (a n 1,...,a 0 ) with x mod 2 n = n 1 i=0 2i a i, i.e., a is the n-bit binary unsigned integer representation of x. Addition networks represent weighted additions r = (c + a A w(a) a),n where c Z is a constant offset, A is a set of bit-level variables called addends, w : A Z is a weight function and the bit vector r = (r n 1,...,r 0 ) is called the result. It has been shown in [17] that the weights w(a) can be considered to be nonnegative for all addends a A. We say that a A is an addend of column k 0 if w(a),k + 1 has a leading 1. We use the notation A i := {a A a is addend of column i} for the set of addends in column i. Note that we have A = n 1 i=0 A i. With this notation, we can reformulate the defining equation for r as r = 2 (c i i + a) ),n, with c,n =(c n 1,...,c 0 ). a A i ( n 1 i=0 Based on this characterization we can easily see that the result bit r k is influenced by addends a A k := k i=0 A i only. Whenever addends from columns i k influence the result r k+1 we say that the addition network generates carries at column k. This is the case when there is an assignment to the addends such that ( k i=0 2 i (c i + a A i a)) 2 k+1. The main task of ABL normalization is to simplify the comparison of structurally dissimilar ABL representations for arithmetic functions. It is well known that this task is generally beyond the capacity of SATbased methods. The main cause for structural differences between implementation and specification can be found in the application of commutative and distributive laws. These laws are applied during implementation of arithmetic circuits in order to optimize hardware costs and performance. In principle, it may be possible to enhance the specification by detailed structural information that allows the verification tool to keep track of such optimizations. However, this results in overwhelming effort to keep the specification up to date throughout the design process. In order to match structurally different ABL descriptions the normalization algorithm performs a sequence of local equivalence transformations on these descriptions. For example, Fig shows the transformation of an ABL description into a normalized instance where this structural difference between design and property is eliminated. The main transformation applied during this normalization will be discussed in the remainder of this subsection.

265 17 A New Verification Technique for Custom-Designed Components at ABL 261 Fig ABL normalization (CMP comparator, N i addition network, P i partial products generator) Merging of Addition Networks Merging of addition networks corresponds to the application of the commutative law. We consider two addition networks N,N for the weighted additions ( r = c + ) w(a) a,n, a A ( r = c + w (a ) a ),n. a A Furthermore, we consider all the result bits r i of N to be addends in consecutive columns i + k of N with k 0. k indicates the column with r 0 A k. It is easy to see that w(r i ) = 2i w(r 0 ) holds for all i {0,...,n 1} in this case. Finally we consider r n 1 A n 1 or that N does not generate carries in the uppermost column n 1. Under these conditions we can merge the addition networks by replacing the results r i by the addends A i and the constant offsets c i in the appropriate columns of N. The resulting network merge(n,n ) can be described by the following formula ( ( r = c + ) ( w(a) a + 2 k c + ) n ) 1 w (a ) a 2 i+k r i,n. a A a A i=0 If all the results r i are only used as addends in a single column i + k of N, this equation simplifies to ( ) r = c + 2 k c + w(a) a + 2 k w (a ) a,n. a A\{r 0,...,r n 1 } a A

266 262 E. Pavlenko et al. Fig Partial products distribution Distribution of Partial Products Distribution of partial products across addition networks corresponds to the application of the distributive law. Figure 17.2 shows an example how the partial products e and f with a new variable d can be expressed as the outputs of an addition network (FA) for the partial products ad,bd and cd. This example will be generalized in the following. Let N be an addition network with a set of addends A, a weight function w, constant offset c and result vector r = (r n 1,...,r 0 ). Furthermore, let P be a partial product generator for the product of r with a second bit vector q = (q m 1,...,q 0 ). Distribution of P over N results in m addition networks N m and a partial product generator P for the addends (a A) with q. Here, the N i are defined by the following properties: A i := {{a q i a A} {q i }} is the set of addends. The weight function w i is defined by w i (a q i ) = w(a) and w i (q i ) = c. The constant offset c i = 0. Distribution of partial products is sufficient to create a normal form for an ABL description, i.e., as defined in [17], an equivalent ABL where partial products are never generated from the results of an addition network. However, normal forms still allow for different topologies of cascaded addition networks in the fanout of the partial products. In order to decide whether or not two implementations of an addition network are equivalent, as in the left part of Fig. 17.1, we iteratively merge the addition networks in both implementations. If this process ends with a single addition network for each of the implementations, as in the right part of Fig. 17.1, the equivalence check reduces to checking whether the added partial products are pairwise equivalent and whether the equivalent partial products have the same weight modulo bit width in both implementations. Finally, it also needs to be checked whether the constant offsets agree modulo bit width (see example in Sect for more details) Mixed ABL/Gate-Level Problems In order to improve performance or area, designers sometimes describe certain parts of an arithmetic circuit at the gate level. Therefore full ABL information is not

267 17 A New Verification Technique for Custom-Designed Components at ABL 263 always available in industrial RTL descriptions, hence the normalization technique cannot be applied in these cases. For instance, the performance of a multiplier is dominated by the additions of partial products. A widely adopted technique to reduce the number of partial products is (radix 4) Booth encoding. At the ABL the Booth-encoded partial products can be described by the following equation: p i = 2 2i b A = 2 2i b ( 2a (2i+1) + a (2i) + a (2i 1) ), (17.1) where A { 2, 1, 0, 1, 2} is a so-called Booth digit. However, for implementation a designer will not consider instantiation of a signed 3 n-bit multiplier and a (3 + n) 2i-bit multiplier to calculate a partial product corresponding to Eq. (17.1). By contrast, designers implement multiplication by 2 j, with j Z +, by shifting the corresponding bit vector, and in case A<0 the transformation A b = A b + 1 is used. The conditions for the extra bit shift and the negation are as follows: shift i = (a (2i+1) a (2i) a (2i 1) ) (a (2i+1) a (2i) a (2i 1) ), (17.2) cpl i = (a (2i+1) (a (2i) a (2i 1) )). (17.3) Therefore, the implemented partial products p i = p i 2 2i cpl i will be: 0, if a (2i+1) = a (2i) = a (2i 1) = 0 p i = 0, if a (2i+1) = a (2i) = a (2i 1) = 1 B, otherwise (17.4) where B = ((b n 1 cpl i,...,b 0 cpl i ) shift i ) 2i. In the above equations and further throughout this chapter, the symbols,,, and denote Boolean negation, conjunction, disjunction, exclusive-or and the shift-left operations, whereas +, and denote arithmetic addition, subtraction and multiplication, respectively. Suppose the addition of the partial products p i and the complement bits cpl i is performed using a tree of carry-save adders (CSA tree). In this case we generate an ABL description of the adder tree and of the standard multiplier implementation used in the property. However, the Booth-encoded partial products will not be part of this ABL. As a result the normalization approach fails to identify equivalent partial products after merging the addition networks of the implementation and the specification, respectively. This problem is illustrated in Fig In the next section we provide a technique to synthesize an ABL description for small local gate-level descriptions of arithmetic circuits such as the Booth encoder of a multiplier discussed above Synthesis of ABL Descriptions from Gate-Level Models In principle, every Boolean function can be synthesized into an equivalent ABL description. To see this suppose the Boolean function is given in positive Reed Muller

268 264 E. Pavlenko et al. Fig Incompletely normalized instance form. The positive Reed Muller (positive Davio) decomposition for a Boolean function f :{0, 1} n {0, 1} with respect to a variable x i is given by the following equation: f(x 0,...,x n ) = f xi =0 x i (f xi =0 f xi =1), (17.5) where f xi =1 = f(x 0,...,x i = 1,...,x n ), f xi =0 = f(x 0,...,x i = 0,...,x n ) denote the positive and negative cofactors of f with respect to x i. Recursive application of this decomposition results in the Reed Muller form for a given Boolean function. There are efficient data structures such as OFDDs [9] and OKFDDs [6]to represent and manipulate Boolean functions in Reed Muller form. As we generate only local Reed Muller forms for small portions of a circuit, we currently do not resort to these data structures and represent Reed Muller expressions explicitly. In the following subsection we study how to transform the Reed Muller form of a Boolean function into an ABL description that is suitable for the normalization approach Generation of the Equivalent ABL Descriptions for Boolean Functions in Reed Muller Form The product terms of the Reed Muller form can be transformed into equivalent cascades of partial product generators and the XOR can be implemented by a singlecolumn addition network. Such an addition network will always generate a carry in its single column, unless it consists of a single product term only. If the result of such a single column addition network N with more than a single addend is used as addend in some other network N, merging of N and N is only possible if the result is added to the uppermost column of N. This, however, cannot be expected in general since it would require that only addends to the uppermost column of an addition network implementation are specified at the gate level. In order to overcome this restriction we extend a single-column addition network to an equivalent multicolumn addition network as follows:

269 17 A New Verification Technique for Custom-Designed Components at ABL 265 Fig Example of a Reed Muller synthesis into addition network Theorem 1 Let N be a single-column addition network with result r. For every n 1 there is an n-column addition network N with results (r n 1,...,r 0 ) such that r 0 = r and r i = 0 for all i>0. Proof Without loss of generality we can suppose N to have a set of addends A ={a 1,...,a m } with w(a i ) = 1 for all a i A and a constant offset c {0, 1}. We need to consider the case c = 0 only. Note that c = 1 can be handled by inserting a dummy addend a. In the resulting network N we eliminate a by updating the constant offset to c+w(a). In order to transform N, consider the n-column addition networks ˆN with ˆr = (ˆr n 1,...,ˆr 0 ) = ( m i=1 a i ), n and Ñ with r = ( r n 1,..., r 0 ) = ( m i=1 a i n 1 i=1 2i ˆr i ), n. Obviously, Ñ has the results r 0 = r and r i = 0 for all i>0. Moreover, the ˆr i can be expressed as Boolean functions (in Reed Muller form) in terms of the addends a k. Therefore, we obtain a single-column addition network ˆN i (i>0) for each of the ˆr i. By induction hypothesis we can extend ˆN i to an (n i)-column addition network ˆN i. By construction we can merge the addition networks ˆN i with Ñ and obtain an equivalent network N. By means of the above theorem we can generate ABL descriptions for Boolean functions that are suitable for ABL normalization. We illustrate this by the example depicted in Fig Suppose we want to implement r = a b c using a threecolumn addition network N with results r = (r 2,r 1,r 0 ). According to the above proof we have to consider the intermediate network ˆN with three columns and the above variables as addends in column 0. The addition network Ñ subtracts the results of ˆN from columns 1 and 2. We obtain the Reed Muller forms ˆr 0 = a b c, ˆr 1 = ab ac bc and ˆr 2 = 0 for the results of ˆN and recursively determine a twocolumn addition network ˆN 1 with the results ˆr 1 and 0. This network has the addends ab,ac and bc in column 0 and the addend abc in column 1. By construction the addition network ˆN 1 can be merged with the addition network Ñ. This results in the

270 266 E. Pavlenko et al. Fig Synthesized ABL model for the Reed Muller form equation r = (a + b + c 2(ab + ac + bc) + 4abc) mod 8, 3 = (a + b + c + 2(ab + ac + bc) + 4(abc + ab + ac + bc)) mod 8, 3 for the result r of the final addition network N. This addition network can be implemented by the half-adder/full-adder netlist depicted in Fig Our overall procedure for synthesis of ABL descriptions corresponding to local parts of the circuit has two phases: Transform all local gate-level descriptions into Reed Muller forms. Transform Reed Muller forms into equivalent multi-column addition networks as needed during normalization. These algorithms are invoked on demand whenever conventional ABL normalization terminates, with remaining addends in the compared addition networks for which gate-level representations exist. After converting these representations into ABL we re-run normalization. We conclude this section by illustrating the overall flow by means of a small example. Suppose it is required to verify the design of a 2 2 unsigned multiplier with (radix-4) Booth-encoded partial products. We further assume that the partial products of the design are implemented at the gate level. The partial products provided for the addition tree of the design are listed in the left part of Table Furthermore, we annotate in braces the corresponding Reed Muller form in terms of the multiplier inputs a k and b i. The table also corresponds to the addition network N Impl obtained in the normalization algorithm after merging the adders in the addition tree

271 17 A New Verification Technique for Custom-Designed Components at ABL 267 Table 17.1 Partial products for unsigned multipliers Column Result Booth-encoded (radix 4) Standard multiplier bit partial products partial products 0 r 0 cpl 0 ={b 1 } a 0 b 0 p 0 [0]={a 0b 0 b 1 } 1 r 1 p 0 [1]={b 1 a 1 b 0 a 0 b 1 a 0 b 0 b 1 } a 1 b 0, a 0 b 1 2 r 2 p 1 [2]={a 0b 1 }, cpl 1 ={0} a 1 b 1 p 0 [2]={b 1 a 1 b 1 a 1 b 0 b 1 } 3 r 3 signext ={1}, p 1 [3]={a 1b 1 } p 0 [3]=cpl 0 ={b 1 1} Table 17.2 Multi-column addition network for implementation of the product Partial product Reed Muller form Addition network cpl 0 b 1 (0, 0, 0, cpl 0 ) = b 1, 4 p 0 [0] a 0b 0 b 1 (0, 0, 0,p 0 [0]) = (a 0b 0 + b 1 2a 0 b 0 b 1 ), 4 p 0 [1] b 1 a 1 b 0 a 0 b 1 a 0 b 0 b 1 (0, 0,p 0 [1]) = (b 1 + a 1 b 0 + a 0 b 1 + a 0 b 0 b 1 2(a 1 b 0 b 1 + a 0 b 1 ), 3 p 1 [2] a 0b 1 (0,p 1 [2]) = a 0b 1, 2 p 0 [2] b 1 a 1 b 1 a 1 b 0 b 1 (0,p 0 [2]) = (b 1 + a 1 b 1 + a 1 b 0 b 1 2a 1 b 1 ), 2 cpl 1 0 (0, cpl 1 ) = 0, 2 signext 1 (signext) = 1, 1 p 1 [3] a 1b 1 (p 1 [3]) = a 1b 1, 1 p 0 [3] b 1 1 (p 0 [3]) = 1 + b 1, 1 of the implementation, i.e., the result of N Impl is defined as follows: (r 3,r 2,r 1,r 0 ) = ( cpl 0 + p 0 [0]+2p 0 [1]+4( p 1 [2]+p 0 [2]+cpl ) ( signext + p 1 [3]+1 p 0 [3])), 4. The partial products for a standard implementation of a multiplier are depicted in right part of Table It is obvious that normalization cannot establish equivalence between these addition networks, as the partial products a 0 b 0 and a 1 b 0 do not have an equivalent counterpart in the implementation network N Impl. In order to complete normalization we convert the Reed Muller forms of the partial products of the implementation into the corresponding multi-column addition networks. The result of this computation step is summarized in Table 17.2.

272 268 E. Pavlenko et al. Merging the addition networks of Table 17.2 for the partial products with the implementation network N Impl results in the addition network N Impl with (r 3,r 2,r 1,r 0 ) = ( b 1 + (a 0 b 0 + b 1 2a 0 b 0 b 1 ) + 2(b 1 + a 1 b 0 + a 0 b 1 + a 0 b 0 b 1 2(a 1 b 0 b 1 + a 0 b 1 )) + 4(a 0 b 1 + (b 1 + a 1 b 1 + a 1 b 0 b 1 2a 1 b 1 ) + 0) + 8(1 + a 1 b b 1 ) ), 4 = ( b 1 + a 0 b 0 + 2a 1 b 0 + 2a 0 b 1 + 4a 1 b 1 ), 4 = (a 0 b 0 + 2a 1 b 0 + 2a 0 b 1 + 4a 1 b 1 ), 4. Obviously, the resulting addition network N Impl and the standard addition network for unsigned multiplication are identical. Therefore our implementation is proven to be correct Experimental Results In this section we summarize the experimental evaluation for the proposed techniques. We implemented these techniques in a property checking environment utilizing SAT and ABL normalization. The overall flow of the integrated verification engine is shown in Fig As input to the verification engine we consider a combinational netlist of bitvector functions representing the SAT instance that needs to be checked in order to prove a given property. To derive such a netlist from an HDL design and a property we use the industrial property checker OneSpin 360 MV [11]. Fig Property checking flow

273 17 A New Verification Technique for Custom-Designed Components at ABL 269 Table 17.3 Industrial multipliers / TO 1000 sec Booth-encoded Signed Unsigned multiplier SMTs ABL SMTs ABL TO 0.38 TO TO 1.21 TO TO 3 TO TO 82 TO 63.8 In order to simplify the corresponding SAT instance we normalize an ABL description generated from the arithmetic bit-vector functions in this netlist. However, if certain parts of the arithmetic circuit design are implemented at the gate level rather than the arithmetic bit level, normalization will not succeed. In this case, we determine non-arithmetic bit-vector functions in the fan-in of design signals. For these bit-vector functions we determine equivalent ABL representations and include them into the ABL normalization problem. The process of extending the ABL description followed by normalization is iterated until either all comparisons for arithmetic signals are proven or no suitable extension of the normalized ABL description can be generated. In both cases a SAT solver is called to prove the remaining parts of the problem to be unsatisfiable. The efficiency of our prototype implementation was compared against two SMT solvers, namely Spear v. 2.0 [13] and Boolector [2]. It should be noted that the earlier versions of both Spear and Boolector demonstrated the best results at the SMT competitions 2007 and 2008 respectively. All experiments were carried out on an Intel Core 2 Duo E6400 (8 GB RAM) running Linux. All CPU times are specified in seconds with a timeout limit (TO) as denoted above each table. As a first step of the evaluation we collected experimental data in an industrial setting. A number of multiplier implementations were generated by the module generator of an industrial synthesis tool. The RTL code of the synthesized circuits contains Booth-encoded multipliers with partial product generators described at the gate level and CSA trees implementing the addition network at the ABL. Table 17.3 summarizes our experiments for signed and unsigned Booth-encoded multipliers of various bit widths. Both SMT solvers are only adequate to solve the arithmetic proof for small size instances and abort due to limited computing resources in case of larger ones. However, when the missing ABL blocks are added to the arithmetic proof problem our normalization-based tool can perform the proof in reasonable time. For reasons of space we omit similar results obtained for non-booth encoded and self-generated multipliers. The second step of our evaluation explores the capacity limits of the Reed Muller extractor. We generated unsigned multipliers for different input bit-widths and different encodings for partial products. Furthermore, we step-wise increased the size of the design portion specified at the gate level and at the same time reduced the ABL part of the design. Starting with an initial design where only the partial products are specified at the gate level, we iteratively implemented CSAs within the addition network by gate-level components. Figure 17.7 visualizes the first two iterations of this

274 270 E. Pavlenko et al. Fig Increasing the gate level portion of circuits by one CSA per iteration Table 17.4 Multipliers of Fig / TO 1000 sec Unsigned Standard scheme Booth-encoded mults. i = 1 i = 2 i = 3 i = 4 i = 5 i = 1 i = TO 42 TO TO 75 TO TO 103 TO TO TO 435 TO experiment. This experimental setup will end up with instances where the complete multiplier is specified below the ABL abstraction and, of course, we do not expect a technique based on the Reed Muller form to be efficient in this extreme corner case. The results of Table 17.4 show that the proposed techniques can transform fairly large circuit regions specified at the logic level into ABL descriptions. In particular, if the borderline between the (optimized) partial product generators and the addition network is blurred as is often the case in custom-designed circuits our approach is powerful enough to transform the complete partial product generator as well as at least the first level of addition logic into ABL. We conclude the experimental evaluation with an example of a pipelined multiplier. In three subsequent clock cycles the test circuit performs the multiplication of four words using a single multiplier. A finite state machine controls the assignment of the operands to the inputs of the multiplier as depicted in Fig The figure also illustrates that the instantiated multiplier consists of a Booth encoder for the partial products implemented at the gate level and an addition network implemented at the ABL. The investigated property proves that every four clock cycles a correct arithmetic result is generated provided that the reset and enable signals are triggered by the environment accordingly. The resulting decision problem after unrolling the design and assumption propagation is depicted in Fig Note that assumption propagation is used to eliminate control logic from the unrolled problem instance such that the remaining problem can be solved with the techniques

275 17 A New Verification Technique for Custom-Designed Components at ABL 271 Fig Shared multiplier Fig Property checking instance Table 17.5 Shared multipliers / TO 3600 sec Bit-width SMTs TO TO TO TO TO TO TO ABL TO described in this chapter. The experimental results for different operand bit widths are summarized in Table Again, our tool demonstrates good performance and solves the verification task within reasonable time for large instances Conclusion and Future Work In this chapter, we propose a method to transform design components described at the gate level into equivalent ABL building blocks. The method utilizes the Reed Muller form of Boolean functions. It is applicable in cases where small portions of the arithmetic circuit implementation are specified at the gate-level while ABL information is available for the remaining parts within the RTL design. The algorithms can be tightly integrated into a verification framework based on ABL normalization. The experimental results indicate applicability for typical industrial verification problems. Future work will explore how this technique can be leveraged for equivalence checking by integrating it into the approach of [14]. Moreover, we are currently integrating the work described in this chapter into advanced implementations of ABL normalization techniques based on computer algebra techniques, namely Gröbner Basis techniques over Rings [19].

276 272 E. Pavlenko et al. References 1. G. Audemard, P. Bertoli, A. Cimatti, A. Kornilowicz, and R. Sebastiani. A SAT-based approach for solving formulas over boolean and linear mathematical propositions. In Proc. International Conference on Automated Deduction (CAD), pages , Boolector R.E. Bryant and Y.-A. Chen. Verification of arithmetic circuits with binary moment diagrams. In DAC 95: Proceedings of the 32nd ACM/IEEE Conference on Design Automation, pages Assoc. Comput. Mach., New York, D. Chai and A. Kuehlmann. A fast pseudo-boolean constraint solver. In Proc. International Design Automation Conference (DAC), pages , M. Ciesielski, Z. Zeng, P. Kalla, and B. Rouzeyre. Taylor expansion diagrams: A compact, canonical representation with applications to symbolic verification. In Proc. International Conference on Design, Automation and Test in Europe (DATE), pages , R. Drechsler, B. Becker, A. Sarabi, M. Theobald, and M. Perkowski. Efficient representation and manipulation of switching functions based on ordered Kronecker functional decision diagrams. In Proc. International Design Automation Conference (DAC), pages , H. Ganzinger, G. Hagen, R. Nieuwenhuis, A. Oliveras, and C. Tinelli. DPLL(T): Fast decision procedures. In Proc. International Conference on Computer Aided Verification (CAV), pages 26 37, July K. Hamaguchi, A. Morita, and S. Yajima. Efficient construction of binary moment diagrams for verifying arithmetic circuits. In Proc. International Conference on Computer-Aided Design (ICCAD), pages 78 82, November U. Kebschull, E. Schubert, and W. Rostenstiel. Multi-level logic based on functional decision diagrams. In Proc. European Design Automation Conference (EDAC), pages 43 47, U. Krautz, C. Jacobi, K. Weber, M. Pflanz, W. Kunz, and M. Wedler. Verifying full-custom multipliers by boolean equivalence checking and an arithmetic bit level proof. In ASP-DAC 08: Proceedings of the 2008 Conference on Asia and South Pacific Design Automation, pages IEEE Comput. Soc., Los Alamitos, Onespin Solutions GmbH, Germany. OneSpin 360MV N. Shekhar, P. Kalla, and F. Enescu. Equivalence verification of polynomial datapaths using ideal membership testing. IEEE Transactions on Computer-Aided Design, 26(7): , Spear D. Stoffel and W. Kunz. Equivalence checking of arithmetic circuits on the arithmetic bit level. IEEE Transactions on Computer-Aided Design, 23(5): , S. Vasudevan, V. Viswanath, R.W. Sumners, and J.A. Abraham. Automatic verification of arithmetic circuits in RTL using stepwise refinement of term rewriting systems. IEEE Transactions on Computers, 56(10): , Y. Watanabe, N. Homma, T. Aoki, and T. Higuchi. Application of symbolic computer algebra to arithmetic circuit verification. In Proc. International Conference on Computer Design (ICCD), pages 25 32, October M. Wedler, D. Stoffel, R. Brinkmann, and W. Kunz. A normalization method for arithmetic data-path verification. IEEE Transactions on Computer-Aided Design, 26(11): , S. Wefel and P. Molitor. Prove that a faulty multiplier is faulty!? In GLSVLSI 00: Proceedings of the 10th Great Lakes symposium on VLSI, pages Assoc. Comput. Mach., New York, O. Wienand, M. Wedler, G.-M. Greuel, D. Stoffel, and W. Kunz. An algebraic approach for proving data correctness in arithmetic data paths. In Proc. International Conference Computer Aided Verification (CAV), pages Princeton, NJ, USA, July 2008.

277 Chapter 18 Debugging Contradictory Constraints in Constraint-Based Random Simulation Daniel Große, Robert Wille, Robert Siegmund and Rolf Drechsler Abstract Constraint-based random simulation is state-of-the-art in verification of multi-million gate industrial designs. This method is based on stimulus generation by constraint solving. The resulting stimuli will particularly cover corner case test scenarios which are usually hard to identify manually by the verification engineer. Consequently, constraint-based random simulation will catch corner case bugs that would remain undetected otherwise. Therefore, the quality of design verification is increased significantly. However, in the process of constraint specification for a specific test scenario, the verification engineer is faced with the problem of overconstraining, i.e. the overall constraint specified for a test scenario has no solution. In this case the root cause of the contradiction has to be identified and resolved. Given the complexity of constraints used to describe test scenarios, this can be a very time-consuming process. In this chapter we propose a fully automated contradiction analysis method. Our method determines all nonrelevant constraints and computes all reasons that lead to the over-constraining. Thus, we pinpoint the verification engineer to exactly the sets of constraints that have to be considered to resolve the over-constraining. Experiments have been conducted in a real-life SystemC-based verification environment at AMD Dresden Design Center. They demonstrate a significant reduction of the constraint contradiction debug time. Keywords Constraint-based random simulation Contradiction Debugging SystemC verification library 18.1 Introduction The continued advance of circuit fabrication technology that persisted over the last 30 years now allows the integration of more than 1 billion transistors in System-on- Chip (SoC) designs. The development of SoCs of such complexity leads to enormous challenges in Computer-Aided Design (CAD), especially in the area of design verification, which needs to ensure the functional correctness of a design. Because D. Große ( ) Institute of Computer Science, University of Bremen, Bremen, Germany grosse@informatik.uni-bremen.de M. Radetzki (ed.), Languages for Embedded Systems and Their Applications, Lecture Notes in Electrical Engineering 36, Springer Science + Business Media B.V

278 274 D. Große et al. the capacity of formal verification is limited, simulation is still the most frequently used verification technique [22]. In directed simulation explicitly specified stimulus patterns (e.g. written by verification engineers) are applied to the design. Each of those patterns stimulates a very specific design functionality (called a verification scenario) and the response of the design is compared thereafter with the expected result. Due to project time constraints, it is inherent for directed simulation that only a limited number of such scenarios will be verified. With random simulation these limitations are compensated. Random stimuli are generated as inputs for the design. For example, to verify the communication over a bus, random addresses, and random data are computed. A substantial time reduction for the creation of simulation scenarios is achieved by constraint-based random simulation (see e.g. [2, 22]). Here, the stimuli are generated directly from specified constraints by means of a constraint solver, i.e. stimulus patterns are selected by the solver which satisfy the constraints. The resulting stimuli will also cover test scenarios for corner cases that may be difficult to generate manually. As a consequence, design bugs will be found that might otherwise remain undetected, and the quality of design verification increases substantially. For constraint-based random simulation several approaches have been proposed (see e.g. [4, 11, 12, 21, 23]). However, a major problem that arises when stimuli are specified in form of constraints is over-constraining, i.e. the constraint solver is not able to find a valid solution for the given set of constraints. Whenever such a contradiction occurs in a constraint-based random simulation run, this run has to be terminated as no valid stimulus patterns can be applied. Note that over-constraining may not necessarily happen at the very beginning of the simulation run, as modern test-bench languages such as SystemVerilog [9] allow the addition of constraints dynamically during simulation. In any case of over-constraining the verification engineer has to identify the root cause of the constraint contradiction. As this is usually done manually by either code inspection or trial-and-error debug, it is a tedious and time-consuming process. To the best of our knowledge in this work we propose the first non-trivial algorithm for contradiction analysis in constraint-based random simulation. In the area of constraint satisfaction problems methods for diagnosing over-constrained problems have been introduced (see e.g. [1, 16]). These methods aim to find a solution for the over-constrained problem by relaxing constraints according to a given weight for each constraint. In the considered problem no weights are available. Also, the approaches do not determine all minimal reasons that cause the overall contradiction. In contrast, Yuan et al. proposed an approach to locate the source of a conflict using a kind of exhaustive enumeration [22]. But since a very large runtime of this method is supposed neither an implementation nor experiments are provided they recommend to build an approximation. In the domain of Boolean Satisfiability (SAT) a somewhat similar problem can be found: computing an unsat core of an unsatisfiable formula, i.e. to identify an unsatisfiable sub-formula of the overall formula [5, 24]. However, to obtain a minimal reason the much more complex problem of a minimal unsat core has to be considered [7, 14, 15]. Furthermore, all minimal

279 18 Debugging Contradictory Constraints in Constraint-Based Random Simulation 275 unsat cores are required to determine all contradictions. In general this is very time consuming (see e.g. [13]). In this chapter we propose a fully automatic technique for analyzing contradictions in constraint-based random simulation. The basic idea is as follows: The overall constraint is reformulated such that (contradicting) constraints can be disabled by introducing new free variables. Next, an abstraction is computed that forms the basis for the following steps. First, the self-contradicting constraints are identified. Then, all nonrelevant constraints are determined. Finally, for the remaining constraints typically only a very small set a detailed analysis is performed. In total our approach identifies all reasons of the over-constraining, i.e. all minimal constraint combinations that lead to a contradiction of the overall constraint. As shown by experiments in a verification environment of AMD Dresden Design Center (DDC), the debugging time is reduced significantly. The verification engineer completely understands what causes the over-constraining and can resolve the contradictions in one single step. The rest of this chapter is structured as follows. Section 18.2 briefly reviews the SystemC Verification (SCV) library that is used for constraint-based random simulation in this work. In Sect the considered problem is formalized and the concepts of the contradiction analysis approach are given. The implementation of the approach is described in detail in Sect Section 18.5 provides experimental results. First, the different types of contradictions are illustrated as examples. Then, we show the efficiency of our approach using several test cases and the application of the analysis technique for a real-life industrial example, a verification environment used at AMD DDC. Finally, the chapter is summarized in Sect SystemC Verification Library This section briefly reviews the SystemC Verification (SCV) library that is used for constraint-based random simulation in this work. The SCV library was introduced in 2002 as an open source C++ class library [10, 17, 20] on top of SystemC [8, 19]. In the following we focus only on the basic features of the SCV library for constraintbased random simulation. Using the SCV library, constraints are modeled in terms of C++ classes. That way constraints can be hierarchically layered using C++ class inheritance. In detail a constraint is derived from the scv_constraint_base class. The data to be randomized is specified as scv_smart_ptr variables. An example of an SCV constraint is shown in Fig The name of the constraint is cstr. Here, the three 64 bit unsigned integer variables a, b, and addr are randomized. The conditions on the variables a, b, and addr are defined by expressions in the respective SCV_CONSTRAINT() macros. Internally, a constraint in the SCV library is represented by the corresponding characteristic function, i.e. the function is true for all solutions of the constraint. This characteristic function of a constraint is represented as a Binary Decision Diagram (BDD), a canonical and compact data structure for Boolean functions [3]. For

280 276 D. Große et al. struct cstr : public scv_constraint_base { scv_smart_ptr <sc_uint <64> > a,b, addr ; SCV_CONSTRAINT_CTOR( c s t r ) { SCV_CONSTRAINT( a ( ) > 100 ) ; SCV_CONSTRAINT( b ( ) == 0 ) ; SCV_CONSTRAINT( addr () >= 0 && addr () <= 0x400 ); } }; Fig Example constraint stimuli generation a weighting algorithm is applied for the constraint BDD to guarantee a uniform distribution of all constraint solutions and hence maximizing the chance for entering unexplored regions of the design state space. As BDD package CUDD [18]isusedintheSCVlibrary Contradiction Analysis In this section first the considered problem, that is the contradiction of constraints, is formalized. Then, we present concepts for the contradiction analysis approach Problem Formulation Before the problem is formulated we define the type of constraints that are considered in this chapter. Definition 18.1 A constraint is a Boolean function over variables from the set of variables V. For the specification of a constraint, the typical HDL operators such as e.g. logic AND, logic OR, arithmetic operators, and relational operators can be used. Usually a constraint consists of a conjunction of other constraints. We formalize the resulting overall constraint in the following definition. Definition 18.2 An overall constraint is defined as n 1 C = C i i=0 where C i are constraints according to Definition 18.1.

281 18 Debugging Contradictory Constraints in Constraint-Based Random Simulation 277 In practice, the conjunction is built by the explicit use of several SCV_CON- STRAINT() macros or by applying inheritance, i.e. parts of the constraints are defined in a base class and inherited in the actual constraint. Note that this is not specific to constraint-based random simulation using the SCV library. In fact, the same principles are found, for example, in the random constraints of SystemVerilog [9]. During the specification of complex non-trivial constraints, the problem of overconstraining arises: Definition 18.3 An overall constraint C is over-constrained or contradictory iff C is not satisfiable, i.e. C evaluates to 0 for all assignments to the constraint variables. Typically, if C is contradictory the verification engineer has to manually identify the reason for the over-constraining. This process can be very time-consuming because several cases are possible. For example, one of the constraints C i may have no solution. Another reason for a contradiction may be that the conjunction of some of the constraints C i leads to 0. In the following the term reason as used in the rest of this chapter is defined. Definition 18.4 A reason for a contradictory overall constraint C is the set R = {C i1,c i2,...,c ik } {C 0,C 1,...,C n 1 } with the two properties: 1. The constraints in R form a contradiction, i.e. the conjunction C i1 C i2 C ik always evaluates to 0. Therefore the overall constraint C is contradictory. 2. Removing an arbitrary constraint from R resolves the contradiction, i.e. minimality of R is required. Often the root of over-constraining results from more than one contradiction, i.e. there is more than one reason. If in this case only one reason is identified by the verification engineer, the constraint solver has to solve the fixed constraint again, but still there is no solution. Based on these observations, the following problem is considered in this chapter: How can we efficiently compute all minimal reasons for an over-constraining and thereby support the verification engineer in constraint debugging? Analyzing the contradictions in the overall constraint C and presenting all reasons is facilitated by our approach. In particular excluding all constraints which are not part of a contradiction reduces the debugging time significantly Concepts for Contradiction Analysis The general idea of the contradiction analysis approach is as follows: The overall constraint C is reformulated such that the conflicting constraints can be disabled by the constraint solver and C becomes satisfiable. By analyzing the logical dependencies of the disabled constraints, we can identify all reasons for the over-constraining.

282 278 D. Große et al. Fig Contradictory constraint Definition 18.5 Let C be over-constrained. Then the reformulated constraint C is built by introducing a new free variable s i for each constraint C i and substituting each constraint C i with an implication from s i to C i. That is, n 1 C = (s i C i ). i=0 For the reformulated constraint C the following holds: 1. If s i is set to 1, then the constraint C i is enabled. 2. If s i is set to 0, then the constraint C i is disabled because C i can evaluate to 0 or 1. Note that the usage of an implication is crucial. If an equivalence is used instead of an implication, s i = 0 would imply the negation of C i. Example 18.6 Figure 18.2(a) shows a constraint C which is over-constrained. Reformulating C to C avoids the over-constraining because a constraint C i may be disabled by assigning s i to 0. The table in Fig. 18.2(b) gives all assignments to s i such that the reformulated overall constraint C evaluates to 1. 1 That is, the table shows which constraints have to be disabled to get a valid solution. For example, from the first row it can be seen that disabling C 0, C 2, C 3, and C 5 avoids the contradiction. Based on the reformulation the verification engineer is able to avoid the overconstraining. But to understand what causes the over-constraining, i.e. to identify the reason of each contradiction, a more detailed analysis is required. Here, two properties of the assignment table obtained from the reformulated overall constraint can be exploited. 1 Here denotes a don t care, i.e. the value of s i can be either 0 or 1. The table is derived from a symbolic BDD representation of all solutions for the s i variables after abstraction of all other variables.

283 18 Debugging Contradictory Constraints in Constraint-Based Random Simulation 279 Note that for simplicity we always refer to the assignment table in the presentation. As shown later in the implementation the assignment table needs not to be build explicitly. Property 1 The value of variable s i is 0 for all solutions (i.e. in each row of the table) iff the respective constraint C i is self-contradictory (that is C i has no solution). Proof : We show this by contraposition: If C i has at least one solution, then there is a row where s i is 1. Obviously this solution (row) can be constructed by assigning 1tos i and 0 to s j for j i, because (s i C i ) = s i C i = 0 C i = C i = 1 and (s j C j ) = s j C j = 1 C j = 1 for j i. : To satisfy C each element of the conjunction must evaluate to 1, so (s i C i ) = s i C i. Since C i has no solution (C i is always 0) s i must be 0. Thus, each constraint C i whose s i variable is always assigned to 0, is a reason for the contradictory overall constraint C. Property 2 The value of variable s i is don t care for all solutions (i.e. for all rows of the table) iff the constraint C i is never part of a contradiction of C. Proof : This property is shown by contradiction. Assume that s i is don t care for all solutions and C i is part of a contradiction. Then, without loss of generality there has to be another satisfiable constraint C j such that C i C j = 0. 2 If s j is set to 1 and all other constraints C k with k j are disabled by s k = 0, then C is 1. However, switching s i to 1 is not possible due to the conflict of C i and C j. But this contradicts the assumption that the value of s i is don t care for all solutions. : Because the constraint C i is never part of a contradiction, C i can be enabled or can be disabled. In other words, s i can be set to 0 and also to 1 for each solution of the overall constraint, which is equivalent to s i is don t care. Thus, each constraint C i whose s i variable is always don t care, is not part of a reason for the contradictory overall constraint. Therefore these constraints are not presented to the verification engineer and can be left out in the next steps. Example 18.7 Consider again Example Because the value of s 0 is 0 for all solutions, C 0 is self-contradictory. Thus, R 0 ={C 0 } is a reason for C. Since the value of s 1 is always don t care, C 1 is never part of a contradiction. As a result the first two constraints can be ignored in the further analysis. Note that the overall constraint of the example in Fig. 18.2(a) has been specified to demonstrate the two properties. In practice, the number of constraints that are 2 According to Property 1 both constraints C i and C j have at least one solution.

284 280 D. Große et al. never part of a contradiction is considerably larger. Thus, applying Property 2 reduces the debugging effort significantly because each nonrelevant constraint does not have to considered anymore by the verification engineer. In fact, all remaining constraints (if there are any) are part of at least one contradiction. Furthermore, since self-contradictory constraints have been filtered out by Property 1 only a conjunction of two or more constraints causes a contradiction. Now the question is, how can we identify the minimal contradicting conjunctions of the remaining constraints, i.e. the reasons? Example 18.8 Again Example 18.6 is considered. The constraints C 0 and C 1 have been handled already according to Property 1 and Property 2. Now, the conjunction of two or more of the remaining constraints, C 2, C 3, C 4, C 5, and C 6, causes a contradiction. Only identifying the product of all these constraints certainly does not help to resolve the conflict easily. In contrast, the over-constraining can only be fixed if the different contradictions are understood. But this requires the computation of all minimal reasons according to Definition In the example, three reasons can be found in total: R 1 ={C 2,C 4 } and R 2 ={C 3,C 4 } which overlap as well as R 3 ={C 5,C 6 } which is independent of the two before. To find the minimal reason for each contradiction, all constraint combinations are tested for a contradiction starting with the smallest conjunction. For each tested combination the respective s i variables are set to 1. Thus, if the conjunction C i1 C ik leads to a contradiction ((s i1 = 1) (s ik = 1) C 0), then this combination is a reason for C. The minimality is ensured by building the constraint combinations in ascending order with respect to their size and skipping each superset of a previously found reason. Since the overall problem has already been simplified by exploiting Property 1 and Property 2, the combination based procedure has to be applied only for a small set of constraints, i.e. the remaining ones. This is the key to the efficiency of the overall contradiction analysis procedure. The next section presents the details on the implementation of the overall contradiction analysis approach Implementation As already mentioned earlier, the SCV library uses BDDs for the representation of constraints. More precisely the characteristic function of the overall constraint is represented as a BDD. This characteristic function is true for all solutions of the constraint, false otherwise. We implemented the contradiction analysis approach using the SCV library. Therefore our implementation is BDD driven. The pseudo-code of the contradiction analysis approach is shown in Fig As input the approach starts with the BDD representation of the reformulated constraint C and the set of all constraint variables V. At first, all constraint variables are existentially quantified from the reformulated constraint (line 3). Thus, the resulting function C only depends on the s i variables. In other words, this function

285 18 Debugging Contradictory Constraints in Constraint-Based Random Simulation 281 Fig Overall algorithm is the symbolic representation of the assignment table described in the previous section. In general the quantified BDD is much more compact than the BDD for the reformulated constraint. Thus, the following BDD operations can be executed very fast. After quantification the two sets R and S are initialized to the empty set. R stores all reasons that are found. Note that for simplicity R contains the sets of the corresponding s i variables of a reason, not the constraints itself. The set S is used to save all s i variables that are passed to the detailed analysis later. So this set corresponds to the remaining constraints. Then, for each constraint C i it is checked if C i is either self-contradictory (line 9) or never part of a contradiction (line 12) according to Property 1 and Property 2. In the former case the respective s i variable is added to the set of reasons R (line 11). Both checks are conducted on the quantified representation C of the reformulated constraint, that is: To check if s i is 0 for all solutions (see Property 1) the conjunction C s i = 1is carried out. If the result is the constant zero-function, s i is never 1 in any solution, i.e. s i is always zero. Thus, C i becomes a reason.

286 282 D. Große et al. The check if s i is don t care in all solutions (see Property 2) is carried out by (C s i = 0) (C s i = 1). If the respective BDDs are equal, it has been shown that s i is don t care, since regardless of the value of s i the solutions are identical. Therefore, the constraint C i is not relevant for a contradiction and thus neither added to the set R nor to the set S. If both properties cannot be applied (line 14), then the respective constraint C i is part of a contradiction caused by the conjunction of C i with one or more other constraints. Thus, C i is passed to the detailed analysis by inserting the respective s i into S (line 16). Finally, the detailed analysis for all elements in S the remaining constraints is performed (line 18 to 25). First, the power set P(S) of S is created resulting in all subsets (i.e. combinations) of constraints considered for detailed analysis. Note that we exclude the empty set as well as all sets which only contain one element (this is already covered by Property 1) from the power set. Furthermore, during the construction the elements of the power set are ordered according to their cardinality. Then, for each subset X (i.e. for each combination) the conjunction of the respective constraints is tested for a contradiction. Therefore, the conjunction of the current combination X represented as a cube of all variables s i X and C is created, i.e. all respective constraints C i are enabled (line 23). If the conjunction leads to a contradiction, then X is a reason and thus, X is added to R (line 25). To ensure minimality each contradiction test of a subset X is only carried out if no reason X R exists such that X X (line 20 22), i.e. no subset of X has already been identified as reason for a contradiction (see also Definition 18.4). In summary, the presented contradiction analysis procedure computes all minimal reasons R of a contradictory overall constraint C. First, the proposed reformulation of the overall constraint allows a representation where all contradictory constraints can be disabled. From this representation a much more compact one is computed by quantification. All following operations have to be carried out on this representation only. Then, the two properties are applied which significantly reduces the problem size since only 2 n Z DC instead of all 2 n subsets have to be considered in the detailed analysis (Z denotes the set of self-contradictory constraints, and DC denotes the set of constraints, which are not part of a contradiction). In practice, especiallythenumberof nonrelevant constraintsthatbelongtotheset DC is very large, so the input for the detailed analysis shrinks considerably Experimental Evaluation This section provides experimental results for the contradiction analysis. First, different types of contradictions are discussed that have been observed in practice. Then, we show the efficiency of our method using several test cases. Finally, we demonstrate the advantages of our approach in an industrial setting. We briefly discuss a constraint-based simulation environment used at AMD DDC for verification of SoC designs. By means of a concrete example we will show how time spent on debugging constraint contradictions is significantly reduced by our approach.

287 18 Debugging Contradictory Constraints in Constraint-Based Random Simulation 283 Fig Types of contradictions In all examples the partitioning of the constraints is given according to the specification in the constraint classes, i.e. each C i in the following corresponds to a separate SCV_CONSTRAINT() macro (see also Sect ). The contradiction analysis is started by an additional command line switch and runs fully automatic in the SCV library environment Types of Contradictions We have identified different types of contradictions. In the following the general structure is shown by means of examples. We assume that self-contradictory constraints as well as nonrelevant constraints have been removed. Assume k constraints are left. Then, one of the following cases are possible which are automatically identified by our approach: 1. There is exactly one contradiction that is caused by all k constraints. Here, no other subset of the constraints forms a contradiction and thus all constraints are the only reason for the over-constraining. A simple and a more complex example isshowninfig.18.4(a) There are at least two contradictions. This case can be refined further: a. Our approach determines m disjoint partitions from the constraint set. This means our approach has identified m independent contradictions. An example isgiveninfig.18.4(b). In this example for the constraint set {C 0,C 1,C 2,C 3 } the two reasons R 0 ={C 0,C 1 } and R 1 ={C 2,C 3 } are determined. b. There is at least one overlapping, i.e. at least one constraint C i is part of at least two reasons. Also here an example is given in Fig. 18.4(c). This example shows the two reasons R 0 ={C 0,C 2,C 4 } and R 1 ={C 1,C 3,C 4 }. Obviously C 4 is part of both reasons. 3 The reasons are marked by brackets.

288 284 D. Große et al. Our proposed approach is able to identify the minimal reason for all these types of contradictions Effect of Property 1 and Property 2 Applying the two properties introduced in Sect significantly reduces the complexity of the contradiction analysis since each matched constraint can be excluded from further considerations. To show the increasing efficiency we tested our approach for several examples which contain some typical overconstraining errors (e.g. typos, contradicting implications, hierarchical contradictions, etc.). For the considered constraints we give some statistics in Table In the first column a number to identify the test case is given. Then, in the next columns information on the number of constraint variables and their respective sizes (i.e. number of bits) are provided. Finally, the total number of constraints is given. The results after application of our contradiction analysis are shown in Table The first four columns give some information about the test case, i.e. the number of constraints in total (n), the number of contradictions/reasons ( R ), and the runtime in CPU seconds needed to construct the BDD in the SCV library (BDD TIME). The next columns provide the results for the trivial analysis approach without (W/O PROPERTIES) and with the application of the properties (WITH PROPERTIES), respectively. Here the number of checks in the worst case (2 n or 2 n, respectively), the number of checks actually executed by the approach (# ), and the runtime for the Table 18.1 Constraint characteristics # BOOL INT LONG BITS CONSTR. (n) , , Table 18.2 Effect of using properties BDD W/O PROPERTIES WITH PROPERTIES # n R TIME 2 n # TIME Z DC 2 n # Time ,768 24, ,536 26, ,108,864 TO > TO > TO

289 18 Debugging Contradictory Constraints in Constraint-Based Random Simulation 285 detailed analysis (TIME) are given. Additionally the number of nonrelevant constraints ( DC ) and self-contradictory constraints ( Z ) obtained by the two properties are provided. The results clearly show, that identifying all reasons without applying the properties leads to a large number of checks in the worst case (e.g in example #5). In contrast, when the properties are applied most of the constraints can be excluded for the analysis since they are nonrelevant. This significantly reduces the number of checks to be performed at detailed analysis. Instead of all 2 n only 2 n Z DC checks are needed in the worst case (only 64 in example #5). As a result the runtime of the detailed analysis is magnitudes faster when the properties are applied. Moreover, for the last three test cases the reasons can be determined within the timeout of 7200 CPU seconds only when the properties are applied Real-Life Example The constraint contradiction analysis algorithm has been evaluated using a reallife design example. The corresponding verification environment is depicted in Fig The Design Under Verification (DUV) is a PCIe root complex design with an AMD-proprietary host bus interface which is employed in a SoC recently developed by AMD. The root complex supports a number of PCIe links. The verification tasks are to show (1) that transactions are routed correctly from the host bus to one of the PCIe links and vice versa, (2) that the PCIe protocol is not violated, and (3) that no deadlocks occur when multiple PCIe links communicate to the host bus at the same time. Host bus and PCIe links (only one depicted in Fig. 18.5) are driven by Bus Functional Models (BFMs) which convert abstract bus transactions into the detailed signal wigglings on those buses. The abstract bus transactions are generated by means of random generators (denoted by G) which are in turn controlled by constraints. Bus monitors observe the transactions sent into or from either interface and send them to checkers which perform the end-to-end transaction checking of the DUV. The verification environment is implemented in SystemC 2.1, the SCV library, and SystemVerilog, with a special co-simulation interface synchronizing the SystemVerilog and SystemC simulation kernels. The constraint-random verification methodology was chosen in order to both reduce effort in stimulus pattern development and to get high coverage of stimulation corner cases. The PCIe and host bus protocol rules were captured in SCV constraint descriptions and are used to generate the contents of the abstract bus transactions driving the BFMs. The PCIe constraint used to control stimulus generation within the PCIe transaction generator is a layered constraint. The lower level layer describes generic PCIe protocol rules and is comprised of a number of 16 constraint terms. They are shown

290 286 D. Große et al. Fig Architecture for verification in Fig. 18.6(a) (denoted from C 0 to C 15 ). 4 The meaning of the constraint variables is given in the Table The upper level layer imposes user-specific constraints on the generic PCIe constraints (denoted by C Ui ) in order to generate specific stimulus scenarios. Generic PCIe constraints and user-defined constraints are usually developed by different verification engineers; the former by the designer of the test environment and the latter by the engineer who implements and runs the tests. 4 Bit operators are used as introduced in [6].

291 18 Debugging Contradictory Constraints in Constraint-Based Random Simulation 287 Fig PCIe transaction generator constraint with examples The engineer writing the tests and hence the user-specific constraints which are layered on top of the generic PCIe constraints is faced with the problem to resolve contradictions which are generated by imposing the user-defined constraints on the PCIe generic constraints. Given the complexity of the constraints, this is usually a non-trivial task. Two real-life examples of contradictions that are not easy to resolve by manual constraint inspection are depicted in Fig. 18.6(b). In the first example the user sets the maximum transaction length to a value greater than 128 bytes (C U1 ), thereby causing a contradiction to constraint C 13, which states that the total transaction length must not exceed 128 bytes. In the second example, the user independently constrains the transaction address to byte ad-

292 288 D. Große et al. Table 18.3 Definition of random variables used in the PCIe constraint VARIABLE NAME addr addr_space tkind cmd msr posted length be[] data[] be[].len data[].len [io mem cfg]_addr_base0,1 [io mem cfg]_size0,1 DESCRIPTION transaction address (64 bits) transaction address space (memory,io,config) transaction kind (request,response) transaction command (read,write) transaction is targeted at MSR space transaction is posted (yes/no) transaction size in dwords array of byte enables (one per each dword data) array of dword (32 bit) data length of byte enable array length of data array io, memory and config space window base addresses io, memory and config space window sizes dress 4000 (C U2 ) and the transaction length to 100 bytes (C U3 ). While both values, viewed independently, are each perfectly legal (the address should be in 32 bit range and the transaction length is less than 128), an over-constraining occurs. The reason identified by our approach is R 1 ={C 12,C U2,C U3 }. By manual constraint inspection it is not immediately obvious that a PCIe protocol rule is violated when combining constraints C U2 and C U3. However, reason R 1 found for the contradiction by our algorithm shows that when combining constraints C U2 and C U3, then PCIe protocol rule C 12 is violated: A transaction must not cross a 4 K page boundary. Our user constraints of transaction start address set to 4000 and transaction length of 100 bytes would result in addresses that cross a 4 K page and therefore violate this constraint. The algorithm described in this chapter is able to identify exactly the violating constraint expressions for both examples in about 30 seconds. The PCIe constraint to be analyzed contained a total of 21 random variables to be solved which are constrained by 17 and 18 constraint expressions for the respective examples. The total bit count for the random variables amounted to 781 bits. Without such an analysis capability, we would have had to spend several hours on manual constraint inspection in order to identify the root cause for the constraint contradiction. Thus, a significant speed up of the contradiction debug cycle was achieved Conclusions In this chapter we have presented a fully automatic approach to analyze contradictory constraints that occur in constraint-based random simulation. After reformulating the overall constraint and building an abstraction, the self-contradictory

293 18 Debugging Contradictory Constraints in Constraint-Based Random Simulation 289 constraints and all nonrelevant constraints are determined in an initial step. Then for the small set of remaining constraints, all minimal reasons for a contradiction are computed efficiently and presented to the verification engineer. The minimality and completeness of the reasons allows to fully understand the over-constraining. Thus, the verification engineer is able to resolve the conflict in one single step. In total, as shown by industrial experiments, the debugging time is reduced significantly. References 1. R.R. Bakker, F. Dikker, F. Tempelman, and P.M. Wognum. Diagnosing and solving overdetermined constraint satisfaction problems. In International Joint Conference on Artificial Intelligence, pages , J. Bergeron. Writing Testbenches Using SystemVerilog. Springer, Berlin, R.E. Bryant. Graph-based algorithms for Boolean function manipulation. IEEE Trans. on Comp., 35(8): , R. Dechter, K. Kask, E. Bin, and R. Emek. Generating random solutions for constraint satisfaction problems. In Eighteenth National Conference on Artificial Intelligence, pages 15 21, E. Goldberg and Y. Novikov. Verification of proofs of unsatisfiability for CNF formulas. In Design, Automation and Test in Europe, pages , D. Große, R. Ebendt, and R. Drechsler. Improvements for constraint solving in the SystemC verification library. In ACM Great Lakes Symposium on VLSI, pages , J. Huang. Mup: a minimal unsatisfiability prover. In ASP Design Automation Conf., pages , IEEE Std IEEE Standard SystemC Language Reference Manual, IEEE Std IEEE SystemVerilog, C.N. Ip and S. Swan. A tutorial introduction on the new SystemC verification standard. White paper, M.A. Iyer. Race: A word-level ATPG-based constraints solver system for smart random simulation. In Int ltestconf., pages , N. Kitchen and A. Kuehlmann. Stimulus generation for constrained random simulation. In Int l Conf. on CAD, pages , M.H. Liffiton and K.A. Sakallah. On finding all minimally unsatisfiable subformulas. In Theory and Applications of Satisfiability Testing, pages , M.N. Mneimneh, I. Lynce, Z.S. Andraus, J.P. Marques-Silva, and K.A. Sakallah. A branch and bound algorithm for extracting smallest minimal unsatisfiable formulas. In Theory and Applications of Satisfiability Testing, pages , Y. Oh, M. Mneimneh, Z. Andraus, K. Sakallah, and I. Markov. Amuse: A minimallyunsatisfiable subformula extractor. In Design Automation Conf., pages , T. Petit, J.-C. Régin, and C. Bessière. Specific filtering algorithms for over-constrained problems. In International Conference on Principles and Practice of Constraint Programming, pages , J. Rose and S. Swan. SCV randomization version F. Somenzi. CUDD: CU Decision Diagram Package Release University of Colorado at Boulder, Boulder, Synopsys Inc., CoWare Inc., and Frontier Design Inc. Functional Specification for SystemC SystemC Verification Working Group. SystemC Verification Standard Specification Version 1.0e J. Yuan, A. Aziz, C. Pixley, and K. Albin. Simplifying boolean constraint solving for random simulation-vector generation. IEEE Trans. on CAD of Integrated Circuits and Systems, 23(3): , 2004.

294 290 D. Große et al. 22. J. Yuan, C. Pixley, and A. Aziz. Constraint-Based Verification. Springer, Berlin, J. Yuan, K. Shultz, C. Pixley, H. Miller, and A. Aziz. Modeling design constraints and biasing in simulation using BDDs. In Int l Conf. on CAD, pages , L. Zhang and S. Malik. Validating SAT solvers using an independent resolution-based checker: Practical implementations and other applications. In Design, Automation and Test in Europe, pages , 2003.

295 Chapter 19 Design of Communication Infrastructures for Reconfigurable Systems Alessandro Meroni, Vincenzo Rana, Marco D. Santambrogio and Francesco Bruschi Abstract Dynamic reconfiguration capabilities of FPGA devices are commonly exploited in order to perform changes in a system with respect to computational elements. This work aims at proposing a framework able to exploit different levels of simulations in order to perform a requirements-driven design of the communication infrastructure of a reconfigurable system, so that the overall performances can be improved. To accomplish this requirements-driven design it is necessary to perform a design space exploration of applications and scenarios in which a particular system can be used. A new scenario-centric approach is proposed in order to identify metrics and requirements needed to apply a communication infrastructure reconfiguration. Keywords Communication infrastructures Dynamic reconfiguration FPGA 19.1 Introduction With the increasing diffusion of partially and dynamically reconfigurable FPGAs [7, 19], designers have new possibilities to design at system level. One of the main possibilities offered by FPGAs is to change the functionality implemented on the physical device as many times as desired; exploiting this feature allows to follow different approaches, in order to fit different scenarios. For instance, thanks to partial dynamic reconfiguration, it is possible to change the behavior of given areas of the chip while the system is up and running, without need of stopping its execution. In general, an architecture is characterized by its computational components and communication infrastructure. With a run-time partial reconfiguration of the device it is possible to dynamically change both these components, increasing the flexibility and the performances of the developed system. Nowadays, the design space exploration has mainly focused on the computational aspect of the system-on-chip (SoC), rather than on the communication one, but as the number of components increase a communication-based analysis is important to evaluate the scalability and F. Bruschi ( ) Dipartimento di Elettronica e Informazione, Politecnico di Milano, Via Ponzio 34/5, Milano, Italy bruschi@elet.polimi.it M. Radetzki (ed.), Languages for Embedded Systems and Their Applications, Lecture Notes in Electrical Engineering 36, Springer Science + Business Media B.V

296 292 A. Meroni et al. the performance of the overall system; for these reasons, the proposed work is focused on both the design and the dynamic reconfiguration of the communication layer. As described in [18], partial reconfiguration can lead to a novel view of designs and applications, giving designers new degrees of freedom. With this particular ability more than one advantage will be available at design time, such as the possibility to change the communication infrastructure implementation, adapt hardware algorithms, increase resource utilization, upgrade hardware remotely, change the communication infrastructure at run-time and provide continuous hardware servicing. With respect to the different communication architectures nowadays available, it is important to state that, since the introduction of the SoC model, ad hoc mixes of buses and point-to-point solutions characterized the communication structures. Anyway, due to their intrinsic limitations, a common concept for segmented SoC communication structures, based on networks, rose: network-on-chip (NoC) [3]. This new network-based approach has recently characterized the design of many embedded systems, which proves that having a scalable and reliable system is a real need. A communication-based analysis can be made evaluating parameters such as the area usage or the power consumption. In order to perform a good analysis of the communication infrastructure layer, it is also necessary to keep into account other intrinsic metrics that belong to communication, such as throughput and latency [12]. The novelty introduced by the proposed work lies in: the introduction of a new point of view for the design of an embedded system, applying a scenario-centric approach that simplifies the architecture definition and that allows to focus more, during system design, on the application to be implemented and on the requirements; the use of the metrics related to the particular application scenario, in order to help the designer in the definition of the best fitting communication infrastructure for the system that has to be developed; the possibility to apply partial dynamic reconfiguration in order to change, in several ways, the current system communication architecture, for instance changing only bottleneck links or completely changing the infrastructure with one more suitable with respect to the user specified constraints; the implementation of a simulation framework that starts with a High Level Description (HLD) of the target architecture, and performs three other important phases before the actual VHDL implementation: a High Level Network Simulation (HLNS), a solutions Evaluation and Selection (E&S) and a SystemC Verification and Validation (V&V) process. This chapter is organized as follows: related works in the communication infrastructures research field are presented in Sect. 19.2, then a real world applications analysis is made in Sect. 19.3, while in Sect the proposed solution is discussed and, finally, results and conclusions are drawn in Sects and 19.6.

297 19 Design of Communication Infrastructures for Reconfigurable Systems Related Works This section aims at presenting recent works that deal with the development of tools and frameworks for NoCs automatic generation and verification. Recently Ost et al. [17] presented MAIA, a framework useful to generate and evaluate NoCs with varying architectural parameters. It is composed by three main steps NoC specification and generation, traffic generation and traffic analysis. The supported topologies are mesh, torus and ring. Also Wolkotte et al. [22] performed a study about the possibility of simulate network-on-chips. They considered the possibility of using three different methods: VHDL, SystemC and FPGA simulations. Their work gives a good amount of simulation results proving that their approach is very useful in parallel systems. Instead of performing simulation on FGPAs, Fen et al. [8] performed accurate simulations using the OPNET network simulator. They analyzed only latency and throughput of different NoCs such as 2D mesh, Fat-Tree and Butterfly Fat-Tree, using two switching techniques (wormhole and virtual-cut-through), discovering that Fat-Tree topology with wormhole switching technique can be considered a good solution for NoC designs. Even if other works, such as NoCSim [15] and NoCGEN [4], have been evaluated, none of them introduced two very important concepts (exception made for our recently presented flow on which this work is based [13]): (i) the analysis of communication infrastructures and not just different network-on-chip topologies, so that also bus-based and point-to-point solutions could have been considered; (ii) the union of communication infrastructures exploration/generation/validation with the partial dynamic reconfiguration paradigm, that can be seen as the main novelty introduced by the proposed framework Real World Applications Analysis Aim of this section is the introduction of a deep analysis of the real world scenarios and applications where the proposed methodology for the design of reconfigurable communication infrastructures can be successfully accomplished. In order to show how the metrics involved in the selection of the best fitting communication infrastructure have been chosen, the layered approach of Fig. 19.1, used to perform this real world applications analysis, is presented. First a set of different common applications have been studied; this set has been called Applications layer.nextsome scenarios, that can be contextualized in the previously discovered applications, have been identified, defining the Scenarios layer. We connect, using a unique arrow, each application with all the meaningful scenarios that can be identified in that particular instance, so that singular case-study can be contextualized. After this linking, for each scenario in the Scenarios layer, a set of different characteristics has been profiled that better describes the peculiarities of each scenario, identifying the Characteristics layer. Such a thing brings to the identification of the Metrics layer where

298 294 A. Meroni et al. Fig Real world applications analysis. Layer classification: A Applications layer, B Scenarios layer, C Characteristics layer and D Metrics layer the main valuable information that can classify each scenario, based on its characteristics, are listed. All these layers compose the general classification reported in Fig. 19.1, this is a qualitative classification that does not cover all the possible applications or scenarios, moreover not all the metrics have been listed. The image shows the scenario-centric approach stated in the Introduction (Sect. 19.1), which can lead to a better identification of the system requirements to obtain a requirements-driven reconfigurable SoC. In the following, it will be possible to find the description of each layer identified in order to better understand each step of this approach Applications Layer This first layer, Fig. 19.1A, groups different real applications that may exploit the reconfigurable ability of the FPGAs devices. This first classification groups several applications that may differ one from the other, but that have the common need of the

299 19 Design of Communication Infrastructures for Reconfigurable Systems 295 hardware reconfiguration capability; these applications range from the automotive topic to the financial analysis one. Such a classification is also available in other works, i.e. the Xilinx Market Solutions [23]. Some of the identified applications are: Automotive [2], Robotics [9, 11, 24], Biomedical [20], Financial Analysis [10] and Sensor Networks [1]; these are just examples proposed to show which applications can be used within this analysis. Once defined the applications layer, an exploration of some possible scenarios which have meaning for each application identified has been made Scenarios Layer As stated before, each scenario can be contextualized in one or more applications, giving a real case-study to analyze. The connection between an application and a scenario represents a real instance in which the hardware reconfiguration of the system can be applied, possibly increasing the performances of the system. A classification of this manner aims at better identifying in which real scenario our proposed methodology can be applied, so to develop a set of metrics and specific aids, oriented towards the automatic realization of the communication infrastructure that better fits the applications and the scenarios requirements for methodology proposed. In Fig. 19.1B, different situations that can be found in the embedded systems industry are shown; such as the possibility to have more than one scenario connected to more than one application. A good example is the Automotive application field and the Robotics one, in which some of the possible scenario identified for the first one, such as adaptive cruise control, sensors acquisition, image processing and edge detection, are also connected with the robotics field. This can lead to the identification of shared-scenarios that can be contextualized in different applications, yielding different requirements and constraints to cope with. The identification of shared-scenarios can therefore bring to the definition of common rules i.e. specific metrics applicable on the communication infrastructures selection and also can increase and valorize the reusability of our methodology, highlighting its flexibility in the arrangement and adaptation to different applications and scenarios. The combination of a particular scenario mapped on a specific application generates a precise system case-study and brings to the identification of some interesting peculiarities that better characterize the needs of this system. This set of characteristics defines the next layer of our classification Characteristics Layer As previously stated, this new layer is filled with a list of possible characteristics representing each scenario contextualized in each application. Basically a set, uniquely identified (we adopted both the colored and numbered notification, the former for

300 296 A. Meroni et al. a simple and faster comprehension and the latter to cope with black/white sheets), is created containing all the valuable features for each map application-scenario, choosing among a pool composed of the common characteristics required by a real embedded system. An example is shown in Fig. 19.1C, where the sensors acquisition scenario of an automotive application requires to be precise and immediate; these two qualitative features mean that the values acquired through sensors, in an automotive application, must be gained very carefully and as soon as possible, so to cope with the application requirements. Another real case-study can be represented by the edge detection module of a robot, that must accomplish its functionality considering different requirements such as: high throughput, low area usage and low power consumption, in order to keep the overall system performances high without affecting the other functional modules that may be plugged on the robot. In this particular situation the reconfiguration of the communication infrastructure can bring to an increase of the performances, acting based on the objectives and the jobs performed by the robot, in order to keep, for instance, the overall power consumption under specific thresholds but having however a high throughput. These qualitative characteristics are then translated into metrics so that a requirements-driven analysis can be performed based on their the evaluation Metrics Layer The Metrics layer represents the most interesting one, as it gives in output the actual metrics characterization of a system. In this layer a qualitative evaluation of the different metrics, based on the requirements needed by the considered system, is performed. In Fig. 19.1D it is shown a table representing for each evaluated metric its relative estimation with respect to the considered case-study graphically represented by the colored crosses. To give a clear idea of how a metric of a particular system is evaluated, in the metrics table of the Fig. 19.1D we proposed a very simple classification characterized by only three different levels of estimation for each metric: high, medium and low. The evaluated metrics compose the core of the methodology on which this previous framework [13] is based on; in fact it performs a requirements-driven reconfiguration based on these metrics values. Using this flow, the designer must give in input just the scenario context and the application-requirements needed by the system, so that through this framework, a classification of the best fitting communication infrastructures will be produced in output. Moreover it is possible to provide other system information that can support the infrastructure selection, such as the constraints and the specific structural or topological characteristics of the system that can be for instance the number of nodes, or the number of masters and slave of the system.

301 19 Design of Communication Infrastructures for Reconfigurable Systems 297 Fig Schema of the proposed simulation framework 19.4 The Proposed Solution We propose a simulation framework that aims at identifying the best suitable communication infrastructure, or a set of these, starting from designer specifications of the systems. Ideal goal of this work is to let the designer focus his/her attention on the functional characteristics of the target system, rather than on the communication infrastructure implementation details. The system specifications can vary along with the scenario, but the information provided should nevertheless be enough to produce a high level description of the system. The designer can provide more than one specification, each one describing a different functionality of the same scenario or belonging to another one to which the designer is interested. For instance it can be possible to provide two different system specifications of a robotic scenario, i.e. for the edge detection and image processing functionalities, or the same two specifications but applied in two different scenarios, such robotics and automotive. It is possible to provide two different kinds of high level specifications. The first one is a completely custom one, in which the designer provides all the information describing the target system. The other one is a scenario-based specification, in which the designer provides only the main system characteristics such as the number of system elements and their classification (distinction between master/slave cores), the system constraints (can be area or timing constraints) and the communication schema of all the system elements (interconnections among elements). Once all the necessary specifications have been identified, the designer can start using the proposed framework, that will lead to the automatic identification of a single static communication architecture that implements all the functionalities provided with all the specifications (and that represents the best trade-off among all the possible identified infrastructures). If the static communication infrastructure does not respect all the specified constrains or the results are not suitable solution, it is also possible to obtain of a set of differential communication infrastructures. This last output represents the evolutionary feature of the target system, in which partial dynamic reconfiguration can bring to very good results in terms of reconfiguration time. This concept will be explained further, with the description of the flow phases. The proposed framework consists of four phases, as shown in Fig. 19.2: HLD: thehigh Level Description represents the first phase of our framework. This phase permits the designer to create a high level visual representation of the

302 298 A. Meroni et al. system and to set all the specifications previously identified. The output provided is one or more XML files describing the system with all its characteristics; HLNS:theHigh Level Network Simulation phase is performed with a well-known Network Simulator [21]. Here the simulation of communication infrastructures, that differ for topology and/or parameters, is performed. This simulator has been configured to read XML files provided by the HLD phase, and in the end provides simulation results that will be evaluated in the next phase; E&S: intheevaluation and Selection phase the best fitting communication infrastructure is automatically selected considering its metrics results. Furthermore the designer has the possibility to inspect the simulation results and then to manually choose a different solution, based on his experience; V&V: theverification and Validation phase takes in input the XML system file description of the selected communication infrastructures and performs a SystemC [16] simulation that validate the system consistency over the adopted communication infrastructure. Once this phase returns positive results, it will be simple to create the actual VHDL implementation of the communication infrastructure as described in the Verification and Validation paragraph. Next, each phase is discussed in a more detailed way High Level Description The high level description phase is represented by a GUI that makes it possible for a designer to create a high level design of the target system, giving him the possibility to add master/slave elements and interconnections to the model and also to set constrains (such as throughput and latency between elements). Given the specifications, the designer starts having an empty model that must be filled with the system elements, which can be of two different types (master or slave), and with the interconnections among the elements, this is done simply connecting one master element with a slave one through a line. Finally all the constraints provided with the specifications must be set. For instance the specifications can provide information about the maximum throughput supported by a connection between two elements; the GUI permits to set, up to now, a set of constraints, composed of throughput, area and latency (consistent with that ones evaluated in the HLNS phase), for all the elements and the interconnections of the system. After the construction of the design model the GUI generates one or more XML files that represent the description of the target system. More XML file will be generated if the designer created more than one specification. There is also the possibility to manually write these XML files and, thanks to the common XML standard, other programs can generate these description files. During this high level phase, the designer should perform a scenario-based classification that aims at better identifying the set of metrics and specific aids, oriented towards the automatic realization of the communication infrastructure that better fits the target system. This classification can lead to the analysis of different situations, for instance it is possible to have

303 19 Design of Communication Infrastructures for Reconfigurable Systems 299 more than one scenario that can be contextualized for a unique system. An example can be found in an application that has both Automotive and also Robotics peculiarities; there are different functionalities that must be implemented (i.e. adaptive cruise control, sensors acquisition, image processing, edge detection or other), each one corresponding to a different or the same scenario. Depending on this analysis and the specifications, the designer should build the model of his target system, using, if necessary, more than one XML file description High Level Network Simulation The second phase of the framework is characterized by a well-known Network Simulator [21] that has been used to perform high level network simulations. This tool reads the XML files generated in the previous phase and next builds different communication infrastructures based on the achieved information. Some of the most common kinds of communication infrastructures used in the embedded systems research field have been selected for this simulation phase: point-to-point full connected, bus-based, square-mesh, star and spidergon [5, 14, 18]. Moreover, a custom network-on-chip, based on the composition of mesh structures with some direct connections among switches is also created. The generation of this custom NoC has as primary objective the minimization of the distance (conceived as number of hops) among the master/slave elements and also as second objective the throughput maximization (given the XML specifications it is possible to identify a good number of potential bottlenecks introduced by shared links, that can be avoided with the insertion of direct dedicated connections between switches). Simulation models have been used to collect information regarding the following metrics: delivery rate, loss rate, throughput, area usage and latency. We adopted the common wormhole switching paradigm [6] with packet-based communications among system elements with acknowledgments. We defined different intensity levels of packets generation for each master of the system (performed using the Network Simulator primitives); so a simulation of the communication infrastructure, based on different traffic patterns, has been made. To better understand the analyzed metrics, the following packet types definitions are necessary: completed: a completed packet is one that is arrived at its destination and its acknowledgment returned to the source master; processed: a packet is said processed when it has been sent from the source master to its destination, but its acknowledgment is not arrived yet to the source master; dropped: a packet is classified as dropped when it cannot be handled by the communication architecture (for instance if the bus is busy and cannot handle another message request). The metrics evaluated have been defined as follow, using the packet classification described before:

304 300 A. Meroni et al. Table 19.1 Links Complexity Computation Formulae Communication Total number Links complexity infrastructure of elements PtP (full connected) n m s Bus-based n + 1 n Square-mesh n + x n+ 2 (x x) Star n + 1 n Custom NoC n + x [n + 2 (x x)]+(d B) Delivery rate has been defined as the number of completed packets over the processed ones by each IP-Core (excepts the switch nodes in the Network-on- Chip architecture). Loss rate represents the ratio between the dropped packets and the processed ones, by each switch node or bus or central node (for the star topology). Throughput has been defined as packets per seconds. In particular, a conservative formula has been used: total number of completely routed packets (of each ip-core) over the total simulation time (50,000 simulated seconds, performed in about 2 to 5 actual minutes, depending on the packet generation intensity level). Latency has been evaluated in two different ways, the first evaluation takes into consideration the overall system with almost no load (the time employed by a complete transmission of a single packet as been considered), while the second one keeps into account a system with full load, that means a system with more than one packet traveling through the communication architecture: with no load: for the latency introduced by the point-to-point and the bus architectures have been used two different constant values: 1 and 22 clock cycles, respectively for the first and the second one. The network-on-chip infrastructures consider a packing/unpacking delay of 8 cycles plus 1 cycle for each switch; with full load: this second evaluation considers the ratio between the average latency caused by one completed packet and the total amount of completed packets. Area usage, up to now, has been evaluated considering the formulae in Table 19.1 that compute the links complexity of the system. The links complexity has been evaluated considering the number of links in the system, using the following conventions: m: number of masters in the systems; s: number of slaves in the system; n: number of core-elements in the systems (m + s); x: number of switches in the system; D: number of direct dedicated links added to the custom NoC; B: number of bottleneck links removed from the custom NoC. The point-to-point links complexity has been computed considering only the possibility to make a connection between masters and slaves, excluding inter-masters

305 19 Design of Communication Infrastructures for Reconfigurable Systems 301 and inter-slaves connections. In the bus-based architecture, the links have been approximated to only the number of interconnections between each core and the bus, without keeping into consideration the connections inside the bus; the same has been made for the switches. For the custom NoC, we considered both the links added to a system with a mesh infrastructure and also those ones that have been removed because considered bottleneck connections. The latency has been defined in order to give more detailed information about the trend of the communication infrastructure delay with respect to the analyzed communication infrastructures. Moreover, it is important to state that the simulation models are characterized by parameters such as transmission and packing/unpacking delays that have been evaluated with implemented solutions exploiting both the bus and the network-on-chip architectures. Furthermore, some of the results, obtained through simulations, have been compared with actual implementations in order to obtain a more realistic and accurate trend description. As better analyzed in the results Section, several runs of the simulation models have been performed; considering a scenario in which the designer has to deal with strict constraints on latency and area usage a very common set of constraints we discovered that the custom NoC architecture represents a good trade-off among all the performances. In the end of the high level network simulation phase, a lot of simulation results are generated regarding each one of the communication infrastructures analyzed. The next phase takes in input these results that will be prompted with graphics to the designer Evaluation and Selection During the evaluation and selection phase, the best fitting communication infrastructure is automatically selected among all the architectures evaluated. As stated before, the designer can also manually choose which one of the proposed solutions will be selected. This is done through the evaluation of the waveform graphics generated after the HLNS phase; then through the selection of the implementation mode (static or dynamic) that the designer desires. The generated graphics can describe different trends for each evaluated communication infrastructure and for each analyzed metrics, as can be seen in Sect Main objective of the designer, when in manually mode, is to pick the communication infrastructure that has the best trade-off compared with the others or that is more suitable for the target system. There can be different methods to perform this selection, such as giving the priority to the area optimization instead of the latency minimization. The designer can also turn back in this phase after the validation and verification one if the output of this last validation phase shows that the choices made appear to be wrong, not feasible or simply not efficient. After this evaluation, the designer can also choose between two different implementation approaches, a static implementation or a dynamic one, depending on the available devices (i.e. FPGAs) and the target applications. Each one of these two approaches has different characteristics:

306 302 A. Meroni et al. Static: this solution does not consider the possibility to exploit partial dynamic reconfiguration in order to change the communication infrastructure at run-time. A designer can use this implementation mode to create a unique infrastructure that implements all the functionalities needed by the system; so an architecture that represents the composition of all the single ones of each functionality will be created. The main advantage is represented by the fact that all the functionalities needed by the system are plugged on the device, furthermore there will be no reconfiguration delays. The cons of a complete plugged architecture are: device area usage, high power consumption and also a lack of design modularity. Dynamic: with a dynamic approach, the framework will generate different partial communication infrastructures, based on the functionality requested with the specifications. The generation of these architectures is performed trying to minimize the difference between one infrastructure and the next one that will be plugged, and so minimizing the reconfiguration time. This approach introduces modularity and is completely supported by partial dynamic reconfiguration paradigms [19] Verification and Validation The last phase is the validation and verification one. Here, the automatic generation of a SystemC [16] model of the previous selected communication infrastructures, is performed. During this phase consistency and feasibility checks are performed. If a communication infrastructure results not applicable due, for instance, to implementation problems, the designer can always make a step back in the evaluation and selection phase to choose a different solution. After this phase it is possible to easily generate the VHDL code implementing the selected infrastructures; this is possible exploiting some VHDL templates of the single system components (both computational and communication ones) that will be mapped one-to-one as the communication infrastructure model requires. Some of the results achieved after the evaluation and selection phase, have been reported in the next section Results Aim of this section is to present simulation results achieved by analyzing five different communication architectures: point-to-point full connected, bus-based, star, square-mesh and custom NoC. These simulation results are very important for the evaluation and selection phase and so for the validation and verification one; therefore it is possible to state that the simulation results achieved with the high level network simulation phase are the main concept on which our simulation framework is based on. In Fig are reported the trends of delivery rate, loss rate and throughput with respect to three different traffic levels: high, medium and low. We performed this

307 19 Design of Communication Infrastructures for Reconfigurable Systems 303 Fig Metrics trends for the five analyzed architectures with three different traffic levels: high (a), medium (b) and low (c)

308 304 A. Meroni et al. Fig Area usage estimation (defined as links complexity) for a 16 elements architecture Fig Latency trend w r.t. the number of masters in mesh architectures analysis considering these three levels of traffic so to have a simple and general model from which obtain interesting results in simulations. With low traffic (Fig. 19.3C), it is possible to see that more or less all the communication architectures perform in the same way. For this reason, if the developed system is characterized by a very low traffic, it is possible to choose a simple communication infrastructure (such as a bus-based one). On the other hand, in the charts with medium and high traffic, the five architectures obtain very different results. For instance, it can be seen in Fig. 19.3B that with medium traffic the bus-based architecture has a very high loss rate compared with a mesh or a custom NoC, which also provides a very good delivery rate and throughput (very similar to the one of a point-to-point architecture, but with a smaller area usage, as shown in Fig. 19.4). In Fig the latency trends of different square meshes are presented; it can be seen how the latency deeply changes its trend with respect to the number of masters in the system. Another interesting result that it was possible to achieve with the simulations, is the latency trend for each communication infrastructure analyzed, considering an architecture composed by 8 masters and 8 slaves, as shown in Fig. 19.6, it is possible to see how the latency of the bus architecture increases from a simulation with low traffic (Fig. 19.6B) and one with high traffic (Fig. 19.6A). The star architecture performed very bad with respect to the other infrastructures, in all the levels of

309 19 Design of Communication Infrastructures for Reconfigurable Systems 305 Fig Latency trend on an 8 masters 8 slaves architecture traffic, this is probably due to the congestion of the central node that did not support the load provided. In Fig. 19.6C the latency trends are reported considering the system with no load, but just taking into consideration one single completed packet. As it can be seen the custom NoC performs better than the mesh, and considering the scalability

310 306 A. Meroni et al. of the network-on-chip architectures and the previous results w.r.t. the area usage, it is possible to state that the custom NoC will be the best solution among the NoC topologies presented. With the presented values, it is very easy to evaluate a cost function that is able to detect the best fitting communication infrastructure for a system, given its characteristics and its constraints. For instance, by considering the charts in Fig. 19.3B and Fig. 19.6, it is possible to state that the custom network-on-chip proposed in this work represents the best trade-off, considering the overall performances and the constraints of latency and area usage. Moreover it can be seen in Fig that the point-to-point architecture explodes in number of links, while the mesh and the custom NoC require fewer resources. Simulation results, such those just presented, are used in the evaluation and selection phase to choose the right communication infrastructure, then, in the validation and verification one, to validate the solution with a SystemC simulation. In the case this step considers this last simulation invalid, the designer should go back to evaluate and select a new solution Concluding Remarks Aim of this work is to present an innovative simulation framework based on a scenario-centric approach that can be used in order to perform the study of a system design, based on the applications for which it will be used. This approach leads to a requirements-driven reconfiguration of the communication infrastructure of the system, as also described in the methodology presented in the previous work [13], that points at the selection and the election of the best fitting communication infrastructure that will increase the performance of a system, keeping into account both its characteristics and its constraints. The proposed simulation framework guides the designer in the automatic identification of the best fitting communication infrastructure, given the specifications of the target system. Moreover this framework generates, as final output, an actual implementable solution that can exploit partial dynamic reconfiguration, depending on the system/designer requirements. This is a novelty concept introduced in a simulation framework for communication infrastructures exploration that uses both high level network simulations and also a SystemC verification and validation phase. References 1. I. Akyildiz, W. Su, Y. Sankarasubramaniam, and E. Cayirci. A survey on sensor networks. IEEE Communications Magazine, 40(8): , AutomotiveDesignLine L. Benini and G. De Micheli. Networks on chips: A new SoC paradigm. Computer, 35(1):70 78, 2002.

311 19 Design of Communication Infrastructures for Reconfigurable Systems J. Chan and S. Parameswaran. Nocgen: a template based reuse methodology for networks on chip architecture. In Proceedings of 17th International Conference on VLSI Design, , M. Coppola, R. Locatelli, G. Maruccia, L. Pieralisi, and A. Scandurra. Spidergon: a novel on-chip communication network. In: SOC 2004: Proceedings of International Symposium on System-on-Chip, Tampere, Finland, page 15, November W.J. Dally and B. Towles. Route packets, not wires: on-chip interconnection networks. In DAC 01: Proceedings of the 38th Conference on Design Automation, pages Assoc. Comput. Mach., New York, EAPR Xilinx. Early Access Partial Reconfiguration Guide. Xilinx Inc., San Jose, G. Fen, W. Ning, and W. Qi. Simulation and performance evaluation for network on chip design using opnet. In: TENCON IEEE Region 10th Conference, pages 1 4, J. González Gómez, E. Aguayo, and E. Boemo. Locomotion of a modular worm-like robot using a FPGA-based embedded microblaze soft-processor. In CLAWAR, CSIC, pages , September HighPerformanceComputingArchitectures H. Jung, M. Tambe, and S. Kulkarni. A dynamic distributed constraint satisfaction approach to resource allocation. In Principles and Practice of Constraint Programming, pages , H.G. Lee, N. Chang, U.Y. Ogras, and R. Marculescu. On-chip communication architecture exploration: A quantitative evaluation of point-to-point, bus, and network-on-chip approaches. ACM Transactions on Design Automation of Electronic Systems, 12(3):1 20, A. Meroni, V. Rana, M. Santambrogio, and D. Sciuto. A requirements-driven reconfigurable SoC communication infrastructure design flow. In 4th IEEE International Symposium on Electronic Design, Test & Applications, DELTA08, M. Moadeli, A. Shahrabi, W. Vanderbauwhede, and M. Ould-Khaoua. An analytical performance model for the spidergon NoC. In 21st International Conference on Advanced Information Networking and Applications, AINA 07, pages , May F. Moraes. Hermes: an infrastructure for low area overhead packet-switching networks on chip OSCI. SystemC documentation (Last Check March 2008). Open SystemC Iniative OSCI, L. Ost, A. Mello, J. Palma, F. Moraes, and N. Calazans. Maia a framework for networks on chip generation and verification. In Proceedings of the Design Automation Conference, ASP-DAC 2005, volume 1, pages 49 52, January P.P. Pande, C. Grecu, M. Jones, A. Ivanov, and R. Saleh. Performance evaluation and design trade-offs for network-on-chip interconnect architectures. IEEE Transactions on Computers, 54(8): , V. Rana, M.D. Santambrogio, and D. Sciuto. Dynamic reconfigurability in embedded system design. In IEEE International Symposium on Circuits and Systems, pages , May F. Su and K. Chakrabarty. Yield enhancement of reconfigurable microfluidics-based biochips using interstitial redundancy. ACM Journal on Emerging Technologies in Computing Systems, 2(2): , A. Vargas. Omnet P.T. Wolkotte, P.K.F. Holzenspies, and G.J.M. Smit. Fast, accurate and detailed NoC simulations. In First International Symposium on Networks-on-Chip, NOCS 2007, pages , 7 9 May Xilinx. Xilinx market solutions M. Yokoo, E.H. Durfee, T. Ishida, and K. Kuwabara. Distributed constraint satisfaction for formalizing distributed problem solving. In International Conference on Distributed Computing Systems, pages , 1992.

312 Chapter 20 Analysis of Non-functional Properties of MPSoC Designs Alexander Viehl, Björn Sander, Oliver Bringmann and Wolfgang Rosenstiel Abstract In this chapter, a novel design and analysis methodology for simulationbased determination of non-functional properties of a system design, like performance, power consumption, and temperature is proposed. For simulation acceleration and handling of complexity issues, the design flow includes automated abstraction of component functionality. Specified platform attributes as dynamic power management and formally modeled temporal input stimuli are automatically transformed to non-functional SystemC models. The framework implements the ability for automated online and offline analysis of non-functional System-on-Chip properties. Keywords Non-functional properties Electronic system level Performance analysis Power estimation Temperature estimation SystemC 20.1 Introduction The ongoing progress of semiconductor technology shrinking allows the production of chips consisting of millions of transistors and integrating dozens of highly complex components like microprocessors and DSP units. This enhancement boosts product value by enabling the incorporation of increasing numbers of utility functions [9]. Whereas functional requirements can be checked at early design stages, a validation of non-functional needs is more complicated and is not sufficiently supported by current approaches. This may lead to violations of non-functional requirements (e.g. stand-by energy consumption) and finally to costly design iterations or to unsaleable products. Besides complexity, shrinking feature sizes pose additional challenges to nonfunctional property (NFP) verification: checking properties like on-chip temperatures as well as the impact of the system environment, that were ignored or only This work was partially supported by the BMBF project AIS under grant 01M3083G and the DFG project ASOC under grant RO 1030/14-1 and 2. A. Viehl ( ) FZI Forschungszentrum Informatik, Haid-und-Neu-Str , Karlsruhe, Germany viehl@fzi.de M. Radetzki (ed.), Languages for Embedded Systems and Their Applications, Lecture Notes in Electrical Engineering 36, Springer Science + Business Media B.V

310 A. Viehl et al. Fig. 20.1 NFP dependencies considered in rare application areas in the past, are getting more and more in the focus during system design because of nanoelectronic effects.

313 310 A. Viehl et al. Fig NFP dependencies considered in rare application areas in the past, are getting more and more in the focus during system design because of nanoelectronic effects. Especially temperatures can have a major impact on reliability and run-time errors of the system [2, 16, 28]. In this chapter, an approach for early system level evaluation of NFPs is presented. The methodology uses causal dependencies between the non-functional system properties performance, power and temperature. These dependencies are depicted in Fig The NFP analysis process is based on the determination of execution time profiles of the system functionality on platform components. The proposed methodology for automated mapping is presented in Sect Based on these derived timing properties of single components, global activities within the entire system resulting from component communication and interaction are determined. This can be used for analyzing the performance of the entire system. The automated creation of abstract simulation models allowing global analysis of activities is introduced in Sect Furthermore, the incorporation of formally specified and constrained temporal environment models in simulation models is briefly explained. Additionally, if an activity-based description of the power characteristics of each component is given, the energy consumption over time can be determined. Besides, if an activity state-based power management policy is applied, the potentially negative impact on performance can be measured. The inclusion of dynamic power management during abstract system simulation is also described in Sect If information on geometry and heat capacities of the system components is given, the local temperature distribution over time can be determined. Necessary parameters and their incorporation in the design flow are described. To determine requirement violations, an on-line property verification approach is proposed by creating and integrating assertions that check non-functional system properties. The presented approach further allows the model-based exploration of a design by modifying parameters like mapping, geometries or input stimuli at specification model level. The developed design flow was tested using different examples. Section 20.7 illustrates experimental results from NFP analysis of a JPEG decoder.

Lab-STICC : Dominique BLOUIN Skander Turki Eric SENN Saâdia Dhouib 11/06/2009

Lab-STICC : Dominique BLOUIN Skander Turki Eric SENN Saâdia Dhouib 11/06/2009 Lab-STICC : Power Consumption Modelling with AADL Dominique BLOUIN Skander Turki Eric SENN Saâdia Dhouib 11/06/2009 Outline Context Review of power estimation methodologies and tools Functional Level Power