DIT - University of Trento. A System Level Design Methodology for Architecture Exploration of Data Processing Systems

Size: px

Start display at page:

Download "DIT - University of Trento. A System Level Design Methodology for Architecture Exploration of Data Processing Systems"

Anastasia McBride
6 years ago
Views:

PhD Dissertation International Doctorate School in Information and Communication Technologies DIT - University of Trento A System Level Design

1 PhD Dissertation International Doctorate School in Information and Communication Technologies DIT - University of Trento A System Level Design Methodology for Architecture Exploration of Data Processing Systems Alena Simalatsar Advisor: Prof. Roberto Passerone Università degli Studi di Trento March 2009

2 Abstract Electronic embedded systems are widely used for different purposes in our daily life, like communication, automation, measurements, security, and health. By their nature, these systems are often distributed and composed of nodes and processing elements that must interact with the environment and users, and communicate among themselves. The design of such architectures should take into account constraints on cost and physical size, which requires extensive analysis and performance evaluation. However, the growing complexity of electronic systems design and time-to-market pressure should not affect the correctness of new electronic systems. Existing design tools based on Register Transfer Level (RTL) are too detailed for an effective exploration of system design alternatives, and are typically biased towards specific implementation styles. This work is going to present a framework for fast architecture exploration and performance analysis of Data Processing Systems based on system-level specification languages and rapid architecture profiling using both existing and newly developed tools. Several methodologies have been developed for architectural exploration and design optimization based on the stepwise refinement of the design specification. One example is the platform-based design (PBD) methodology, based on the construction of different layers, called platforms, which represent different levels of the design abstraction, where platforms at higher levels abstract the details of lower level platforms. Our contribution is a framework that supports the PBD paradigm. Within the PBD, we have focused on the design abstraction that corresponds to the deployment of an application on a computing platform that may include general-purpose processors, digital signal processors, programmable or reconfigurable components (e.g., FPGAs) and interconnection elements. Each of the analyzed computing platforms can be a single- or double-processor based. The core of our framework is a model of a flexible scheduler that represents both processor and communication resources within a structured approach to the system performance evaluation. This approach includes not only performance metrics such as execution time, but also allows us to estimate the interprocessor communication overhead and evaluate different scheduling policies for both computation and communication. Keywords [Electronic Design Automation, System-Level Design, Design Space Exploration, Platformbased Design, Software Defined Radio]

3 Carpe diem

4 Acknowledgments This is a great opportunity to express my gratitude and respect to all the people who was supporting me within the time I was doing my PhD since they all contributed to this work each in his/her own way. It is difficult to overstate my gratitude to my PhD supervisor Prof. Roberto Passerone. First, I would like to thank him for guiding and inspiring me during my research period at University of Trento and for showing me what research is about. Second, I greatly appreciate his support not only in my research work but also in different life situations where he was assisting me in any possible way. Finally, I should thank him for provided encouragement throughout my thesis-writing period and the final prove ready of this work. I would have been lost without him. I would like to thank Prof. Alberto Sangiovanni-Vincentelli, for giving me a great opportunity to work during six months in his research group at UC Berkeley. A special thank I need to give to one of the members of his group Douglas Densmore, who since my arrival became my research companion, and so we performed a lot of research work together and continuing our collaboration until now. I am also pleased to thank other researchers I met at UC Berkeley and in particular Trevor C. Meyerowitz, Abhijit Davare, and Carlo Fischione. I wish to thank Fernando Pianegiani and Fabrizio Stefani workers of ArsLogica SpA for stimulating discussions and suggestions. I should also thank Gianluca Gasperini a master student of University of Trento for the deployment of the UMTS code onto a DSP platform without which this work would not look accomplished. I would like to thank all the administrative and technical staff of the university of Trento, an in particular Galina Kamburova, Manuel Zucchellini, Alessandro Tomasi, and Sebastiano Perisi. I am indebted to my many student colleagues from University of Trento for providing a stimulating and fun environment in which to learn and grow. I am especially grateful to Andrei Papliatseyeu, Nataliya Shcherbakova, Tanya Yatskevich, Aksana Serada, and

5 Alexander Birukou for being my older friends always ready to help and support me. I would also acknowledge my friends from an enormous Russian-spoken community of our university, and in particular I am very thankful to Volha Kethet, Marina Repich, Volha Bryl, Maksim Khadkevich, Nataliia Bielova, Aleksey Chayka, and Artsiom Yautsiukhin. I would also wish to thank Michele Gubian, Marco Biazzini, and Andrea Zoboli for our nice music gatherings. Lastly, and most importantly, I wish to thank my family: my older brother Yauhen Simalatsar and my parents, Ludmila and Valery Simalatsar. They taught me, believed in me, supported me, and loved me. To them I dedicate this thesis. 5

6 Contents 1 Introduction The Context The Problem The Solution Innovative Aspects Structure of the Thesis State of the Art The Design Flow Specification Validation Synthesis Methodologies Platform-Based Design Model-driven Software Development ROM Models of Computation Communicating Sequential Processes Discrete-Event Finite State Machines Finite State Machines with Outputs Statecharts i

7 2.3.6 Co-design Finite State Machines Process Networks Model Heterogeneity System-level Design Languages SystemC Esterel Lustre System-level Design Frameworks Polis Metropolis Metro II Data Processing Systems SDR UMTS Protocol Existing SDR Platforms Avispa-CH SDRXPP SFF SDR Processing Elements ARM ARM MicroBlaze Sparc The Problem Problem 1: Performance - Multi-Processor Based System Design Problem 2: Interprocessor Communication Problem 3: Time-to-market Pressure vs. Correctness Problem 4: Heterogeneity of the Design Chain Components Solution ii

8 3.6 Data Processing System Design Conclusion The proposed approach Methodology Untimed Functional Model Architectural Model Scheduler Timed Functional Model The Implementation The Bones - Module n.h file The Muscles - Module n.cc file The Heart The Synchronized FSM The Interprocessor Communication Parameters Calculation Uniprocessor System Design and Results UMTS DLL Case Study Untimed Functional model Timed Functional model Architecture Model UMTS DLL Results Multiprocessor Heterogeneous System Design and Results UMTS DDL and PHY Case Study MPSoC UMTS DLL and PHY Results MPSoC Modeling with MetroII MPSoC Modeling with Metro II General Architecture iii

9 7.1.2 Architectural Tasks Operating System Processing Elements Functional Model Architecture Model Mapped System UMTS Model Mapping into a Multiprocessor Heterogeneous Architecture using MetroII Results Processing Time and Utilization Mappings Estimation Design Effort The Framework Accuracy Estimation UMTS C code Adaptation for CCS Profiling Results Related Work Design Space Exploration MPSoC Design RTOS Modeling Reconfigurable Systems Design Environments Generic Modeling Environment Ptolemy Formal System Design Metropolis Industrial Development Tools Comet/Meteor - VaST Systems Technology s Mirabilis Design s Visual Sim Cofluent s Systems Studio iv

10 9.6.4 Coware ConvergenSC Conclusion Summary Future Work Bibliography 119 A Readme.txt 128 B Main.c 135 C Function Execution Times (µs) 139 v

11 List of Tables 5.1 Efficiency of performance analysis Mapping Configurations Architecture Profiling Process and Cost Mapping Scenarios for UMTS Case Study Processors involved in mappings Accuracy of performance analysis vi

12 List of Figures 2.1 Metro II Three Phase Execution Semantics Layer diagram for the User Equipment Domain of the UMTS protocol UMTS Dedicated Transport Channel Block diagram of the SFF SDR Design exploration methodology General Purpose and Programmable Processor Profiling Flows Operating System in SystemC ModuleN.h file structure ModuleN.cc file structure ModuleN.cc file FSM representation The FSM representation of the scheduler The FSM system representation Multiple preemption Hierarchical preemption The Hierarchical Scheduling The Hierarchical Scheduling - BUS Arbitration Model Functional model Block diagram of uniprocessor system Architecture performances Resource distribution by function Functional Model Mapped to Two Processors vii

13 6.2 Sample Execution Times Obtained Through Profiling Mapping Effect on System Utilization Mapping Effect on System Performance MPSoC Architecture Service Topology Sparc Runtime Processing Element UMTS Metro II Untimed Functional Model UMTS Estimated Execution Time vs. Utilization Die areas of studied mappings Size/utilization/performance trade-off(optimization) function The function execution times viii

14 Chapter 1 Introduction This section introduces the problem and provides its positioning in the framework of the main research areas of ICT (telecommunications, computer science, electronics). 1.1 The Context Nowadays, any research work cannot be positioned in the scope of a single scientific area. More frequently it becomes the work on intersection of two or more research fields. This is the case for the work presented in this doctorate thesis. From a very general point of view, this work is going to present a new System-Level Design (SLD) methodology. SLD is a design abstraction that can be applied for the design of any kind of systems, such as ecological, biological, physical, mathematical, electronics, telecommunication and many other types of systems. The main ideas behind most methodologies for SLD is to simplify the system design by decomposing it into smaller already existing or simpler, in terms of design, components and establishing the relations between those components while preserving system functionality within initial constrains. The requirement of design process automation implies the formalization of a set of informal specifications of a system or its components. The formalization process is called modeling and uses formal models for the specification description known as models of computation. More specifically, the presented methodology is oriented to the design of electronics 1

15 CHAPTER 1. INTRODUCTION embedded systems. Therefore, we are going to talk about Electronic Design Automation (EDA). Initially, the design of electronic embedded systems was performed at transistor level. The increasingly growing complexity of electronics embedded system forced designers to raise level of design abstraction by increasing the size of basic building blocks up to logic gates. The hardware (HW) and software (SW) designs where highly separated at this time with no possibility to verify the overall system behavior. Therefore, it was not clear whether the HW design will satisfy the requirements given by SW engineers. Thus EDA introduced a set of tools for designing and producing electronic systems at the Register Transfer Level (RTL). This allowed the system behavior verification in terms of logical operations performed on data transferred between hardware registers by means of combinational logic. However, the increasingly growing complexity and heterogeneity of electronics embedded system, time-to-market pressure and design cost constrains push designer to raise the level of design abstraction to the system level. The presented methodology can be seen in the scope of Electronic Design Automation (EDA) raised up to the system level. The definition of architecture specification of an electronic embedded system is impossible without considering a particular area of application. The construction of a distributed communication infrastructure, and the use of highly connected embedded systems, makes it possible today to realize new and innovative applications and services, often context-aware, that can leverage the mobility afforded by wireless connectivity [16]. Therefore, we have focused on the design of data processing systems with their main representative class in telecommunication technologies. As an examples of complex and innovative telecommunication system we have focused on Software Defined Radios (SDR) [71, 70]. Taking all this into account, our work can be seen in the scope of three main research areas: electronics, telecommunications, and computer science; computer science, by means of the presented methodology and the design framework; electronics, due to the design space exploration considering different permutations and interconnections of electronic components; and telecommunication, due to the targeted application area. 2

16 1.2. THE PROBLEM 1.2 The Problem Electronic embedded systems are widely used for different purposes of our daily life, like communication, automation, measurements, security, and health. By their nature, these systems are often distributed and composed of nodes and processing elements that must interact with the environment and users, and communicate among themselves to exchange data. In addition, the convergence of different applications and the proliferation of communication standards suggest an implementation architecture that includes multifunctional devices able to support a number of different technologies. The implementation of more and more complex telecommunication applications exasperates the performance and real-time system requirements, which can no longer be supported by single processor based architectures. Therefore, the emerging computing platforms are increasingly becoming multiprocessor based [64]. This, together with constraints on cost, physical size and performance of each particular device, results in system heterogeneity and increasing design complexity, which requires a higher degree of sophistication for embedded systems design. In the race for higher performance computing, multi-processor platforms offer flexibility and a wide range of alternative design solutions that are able to optimally trade-off the design metrics of interest. This is especially true for embedded applications, often faced with hard to satisfy real-time and energy requirements which are best addressed by a distributed implementation. This trend is also apparent in the design of modern microprocessors, where the use of multi-threaded cores is favored over faster clocks to speed up the software execution. The design of multi-core architectures and embedded systems in general is, however, made complex by a large design space, the difficulty of integrating heterogeneous components, and time-to-market pressures. An optimal design of such architectures, that takes into account constraints on cost and physical size, requires extensive analysis and performance evaluation. However, the growing complexity of electronic systems design and time-to-market pressure should not affect correctness of new electronic systems. In contrast to architectures based on a single processor, communication components 3

17 CHAPTER 1. INTRODUCTION (e.g., buses, shared or distributed memories, etc.) play an essential role in determining the performance of the multiprocessor systems. The latency of inter-processor communication may either tie up the processor resources or cause the processor to wait, which can drastically affect the overall system performance. Existing design tools for the automatic mapping and synthesis of optimized platforms based on Register Transfer Level (RTL) are too detailed for an effective exploration of system design alternatives, and are typically biased towards specific domains of applications and implementation styles. Therefore, they are faced with extreme complexity, due to the size of the solution space. In addition, the lack of abstraction makes the design, as well as the validation process, difficult. Managing model heterogeneity at this level is also problematic. Therefore, to realize a suitable design relationship between all components, the rise of the level of abstraction at which design is carried out is required. The alternative can be manual architecture selection, coupled with fast performance simulation that computes metrics with quick turnaround time. Early attempts by the industry to introduce such technology [54] have not been successful in the market due to a variety of reasons, including the lack of appropriate performance models and the use of proprietary languages. In addition to working at a higher level of abstraction, complexity can be managed by reusing pre-verified components. The validation process therefore shifts from the verification of the individual components to the verification of their composition and their interconnection. Companies that develop new products or components face big difficulties trying to compose hardware and software components that come from different suppliers who use diverse design models. Thus, there is a need for standardization of hardware and software domains which can allow plug-and-play of subsystems [2]. As a result, the management of the design chain [63] becomes of primary importance, especially for system integrators, in order to assemble embedded systems composed of hardware and software components, with guarantee of correctness and optimal resource utilization within timeto-market and cost constraints. While some effort is being expended in this direction in the industry [2, 7], a structured methodology is required to simplify architecture space exploration and boost design flexibility and reuse. The design chain management based on standardization of hardware and software 4

18 1.3. THE SOLUTION domains is a very complex task due to the existence of a number of different design domains and application areas. Each of this design domains offers a complex design space which requires extensive exploration and analysis of all possible permutations of its components and their interconnects. Therefore, it is important to develop structured approach for the design space exploration of each domain specific system. One possible approach is the development of appropriate framework for analysis of system architecture composed of hardware platform, application mapped onto this platform and scheduled by means of an operating systems. The framework should be able to evaluate multiprocessor based platforms, due to the growing system performance requirements, and possible mappings of software component. Moreover, it should allow studying of different scheduling policies because they may drastically affect the real-time system characteristics. The integration of this framework into a bigger design chain management methodology will propagate its parameters to the upper layers of the methodology hierarchy. Therefore, the performance of the framework should be as high as possible while not affecting the accuracy of the results. 1.3 The Solution To overcome the problems related to composition of heterogeneous components and growing complexity of the design space, several System-Level Design (SDL) methodologies have been developed for fast architectural exploration and optimization based on stepwise refinement of the design specification. One example is the platform-based design (PBD) methodology [75, 35], which is based on the construction of different layers, called platforms, which represent different levels of the design abstraction. Each platform is a well separated library of computational and communication components, where platforms at higher levels abstract the details of lower level platforms, and can be used for fast performance estimation. This is essential for quickly converging toward a platform that is not only optimized for the desired functionality, but can also support its future extensions. In PBD, the design process is a sequence of refinement steps, from specification to implementation, where the functional representation at each level is mapped onto the ar- 5

19 CHAPTER 1. INTRODUCTION chitectural representation at the lower level, while performance metrics are evaluated and compared to the design requirements. The levels of abstraction must be carefully chosen, to be detailed enough to support the desired analysis techniques, and abstract enough for an efficient implementation. Evidently, different design domains and application areas require distinct libraries of components and a different range of abstraction levels. Therefore, each domain requires the development of specific architectural, performance and functional models. The effectiveness and the acceptance of the PBD methodology is therefore tied to the availability of these models and to the development of use cases that can prove their accuracy. Therefore the development of appropriate design space exploration and analysis tools is a essential ingredient in the PBD methodology evolution. The new framework presented in this thesis was developed to support the PBD. It focuses on the design space exploration for a data processing system domain. 1.4 Innovative Aspects We have strengthened the PBD methodology by developing, using existing tools [82, 44], appropriate functional, architectural and middleware models for complex telecommunication systems, such as Software Defined Radio (SDR). The requirements in terms of performance and adaptability of SDR are high; therefore the definition of the best computation platform for it is hard. Moreover, the functionality of SDR can vary according to the current requirements of the environment. Thus, it can be one protocol running on the platform or a mix of concurrently executing protocols. To start the analysis of SDR architectures we have developed a functional model of the Data Link (DLL) and the Physical (PHY) layers of a 3G telecommunication standard called Universal Mobile Telecommunications System (UMTS). We have implemented the transmission part of the protocol upon the specification defined by the 3rd Generation Partnership Project (3GPP) [1]. The reception part of the model performs the reverse functionality and is executed concurrently with the transmission part thus creating a case study with two concurrently executing functional chains. 6

20 1.4. INNOVATIVE ASPECTS We approach the problem by raising the level of abstraction for the design of SDR platforms using the PBD methodology. Our contribution is an infrastructure based on system-level specification language SystemC and rapid architecture profiling using both existing and newly developed tools which support the PBD paradigm. Within PBD, we focus on the design abstraction that corresponds to the mapping of an application on a computing platform that may include general purpose processors (GPP), digital signal processors (DSP), programmable or reconfigurable components (i.e., FPGAs) and interconnection elements. Each of the analyzed computing platforms can be a single- or multi-processor computing platform where the latest can be characterized as centralized shared memory architecture. The core of our infrastructure is a flexible scheduler used to represent both processor and communication resources within a structured approach to the system performance evaluation. This approach includes not only performance metrics such as execution time, latency, throughput or processor load but it also allows us to estimate the interprocessor communication overhead and evaluate different scheduling policies for both computation and communication. For computation elements, this is achieved by separating the functional models from the architectural models, and by connecting them through control signals to regulate the overall execution. Likewise, the interprocessor communication model separates the data transmission with the bus arbitration process, by introducing input and output buffers of the processing elements that act as modules of the functional model connected to the arbiter. Our solution is evaluated both qualitatively and quantitatively on the UMTS protocol mapped onto an architecture oriented towards the implementation of Software Defined Radios (SDR). We show that we can explore interesting mappings quickly (in a matter of minutes) as well as measure metrics such as throughput, latency, and utilization. We also show that the presented framework was smoothly integrated into Metro II [41, 19], a bigger heterogeneous design framework, where it coexists with other models and methodologies and opens wider possibilities for the design space exploration of electronic embedded systems. 7

21 CHAPTER 1. INTRODUCTION 1.5 Structure of the Thesis This thesis is organized as follows: Chapter 1 gives the introduction to this work. Chapter 2 introduces the Start-of-the-Art. It is initiated with a general overview of the design flow (Section 2.1), then it gives the introduction to several design methodologies (Section 2.2). Several models of computation are discussed in Section 2.3 and 2.4. Section 2.5 and 2.6 presents several system-level design languages and frameworks respectively. The overview of data processing systems (Section 2.7, hardware platforms (Section 2.8) and processing elements Section 2.9 is given at the end of Chapter 2. Chapter 3 talks once more about the problem addressed by this thesis giving more details related to the area of application. The developed methodology and its implementation are presented in Sections 4.1 and 4.2 of Chapter 4 respectively. Two developed case studies and the simulation results are presented in Chapters 5 and 6. Chapter 7 includes the description of the integration of our infrastructure into Metro II design framework. This chapter also presents the Metro II simulation results. The accuracy estimation of the designed infrastructure is discussed in Chapter 8. Chapter 9 introduces the related work dividing it into several groups. Section 9.1 of this chapter includes the description of other works performed in the area of design space exploration. Multi-processor System-on-Chip designs are described in Section 9.2. Different approaches to real-time operating systems modeling are presented in Section 9.3. Section 9.4 presents several reconfigurable architectures developed for SDR implementation. Different academic and industrial design environments and tools are presented in Sections 9.5 and 9.6 respectively. Chapter 10 conclude this thesis with a short summary (Section 10.1) and possible future research directions (Section 10.2). 8

22 Chapter 2 State of the Art This chapter presents the state of the art pertaining to the research work presented in this doctorate thesis. It includes the most relevant materials that form the background of the presented research activity as well as a selection of the relevant works performed by other research groups in the world. The opening section of this chapter introduces the main phases of the embedded system design flow. After that we talk about the methodologies used to capture all of these phases among which we need to note the Platform-Based Design (PBD) methodology. Section 2.3 talks about the models of computation used to formalize the system design and the model heterogeneity. Then, system-level design languages are presented among which we need to mark out SystemC which was chosen for the implementation of our framework. The most relevant system-level design frameworks are also described in this chapter while a critical discussion and the description of other academic as well as industrial design frameworks is left to Chapter 9. This chapter also introduces the Software Defined Radio as one of the most complex telecommunication and data processing systems. Here we also describe the UMTS communication protocol as well as examples of existing platforms with high performance characteristics and processing elements, as possible components of the potential hardware platform. 9

23 CHAPTER 2. STATE OF THE ART 2.1 The Design Flow In this section, we first review the fundamental steps in the design of electronic embedded systems, and then show how these can be combined in a platform-based design methodology. We focus our attention on the design flow for system engineers, who are those involved in specifying the system and defining its overall architecture and finally integrating the different parts to create the finished product. For mixed hardware and software embedded systems design, the process can be similarly divided into three steps: specification, validation and synthesis [42] Specification The first step of the design process is the formalization of a set of informal specifications [53]. The formal representation of a system or subsystem is called modeling. Usually, this formalization starts from a black box model which consists only of a set of inputs and set of outputs. Later, the model is enriched with a functional (behavioral) specification, a set of properties, and a set of constraints. The formal model for the description of the specification, known as the model of computation, consists of a set of primitive blocks and their properties, and of rules establishing how these blocks can be connected. We will consider several kinds of models of computation later in this section. For now, we highlight that for design automation the model must include a language with a syntax and denotational and/or operational semantics. The denotational semantics give the meaning of the language in terms of relations. The operational semantics give the meaning of the language in terms of actions taken by some abstract machine. The model of computation underlying the language is applied to construct the executable model (operational semantics) of the system and to manage the interaction of components (denotational semantics). The choice of model of computation is very important because it strongly affects the cost and reliability of design. For the design of embedded systems the models of computation that are able to represent concurrency and time are the most useful. 10

24 2.2. METHODOLOGIES Validation Validation is a phase of system development where software and hardware are analyzed to verify that they satisfy the desired properties. The most common technique for design validation is simulation. A more promising approach that can be applied when systems are specified in a more restricted way is to use formal methods. This type of validation is very important for safety-critical embedded systems. Formal verification is a process for checking whether a system satisfies a given property under all possible inputs, and is usually implemented using finite state reachability algorithms. The two approaches are used for different purposes. Simulation is typically applied to large systems for performance analysis, while formal verification is applied to safety-critical subsystems to ensure correctness Synthesis Synthesis is a process of design refinement which translates a high level specification into a lower level model. Embedded systems synthesis is divided into three classes: partitioning system onto hardware and software functional components, mapping software components onto a hardware architecture, and hardware and software synthesis. Partitioning is a process of dividing the specified system functionality onto the functional blocks and determining which part of the specification will be implemented in software and which one in hardware components. The architecture in general is composed of hardware components, and interconnection media. Mapping determines which parts of software will be executed on which of the given hardware components, which is particularly important for heterogeneous systems composed of more than one component. Software and hardware synthesis is used to derive an actual implementation of the system. 2.2 Methodologies This section presents three methodologies developed to support the design of electronic embedded systems. Two of these methodologies, the Platform-Based Design and the Model-driven Software Development, are very different in their motivation, concepts, and 11

25 CHAPTER 2. STATE OF THE ART level of abstraction, though they are similar in the idea of system architecture separation into several independent platforms. The third one, the Result Oriented Modeling methodology, is developed for system processes modeling Platform-Based Design Platform-Based Design (PBD) is a methodology that combines the specification, validation and synthesis steps of the design flow, while maintaining a clear separation between the corresponding models [75], [35]. By doing so, the designer can operate separately on the distinctive steps and maintain a global view of the impact of his/her design decisions on the final implementation. The methodology includes hardware and embedded software design, where the design of the system starts at a high level of abstraction (initial design description) and proceeds to a detailed implementation by mapping the executable functional model onto progressively more detailed architectures under a set of constraints. The PBD process is neither a top-down (presented as mapping of functionality instance onto an instance of hardware with constraints propagation) nor a bottom-up (design started from building the hardware platform and association of performance capabilities) design approach. It is a meet-in-the-middle process, where the middle point consists of a common semantics for the platform and the functional domains that have diverse semantics. As discussed, different models and interaction semantics can be used to describe a system. In the following, we consider different models of interest for the design of embedded systems, and highlight their specific properties and preferred areas of application Model-driven Software Development Model-Drive Software Development (MDSD) [5], [27] is a software development methodology that was invented to organize the work of distributed project teams. Model-Driven Architecture (MDA) plays the role of the base for MDSD. The Model-Driven Architecture (MDA) consists of a Platform-Independent Model (PIM) of the application and one or more Platform-Specific Models (PSMs) and complete implementations for each supported platform. MDA tools support the mapping of PIM to the PSMs. Thereby the HW/SW platforms separation of the MDSD makes this methodology similar to the PBD. 12

26 2.3. MODELS OF COMPUTATION ROM The Result Oriented Modeling (ROM) [76] is a methodology aimed to model processes in more accurate way then transaction level modeling (TLM). In ROM methodology processes are observed only at their beginning and at the end without looking at the intermediate process state changes. This way, the final state, such as termination time, of the process is predicted from the beginning and at the end of the predicted time the state of the process is checked. In case there were other processes preempting the observed process, the termination time will be recalculated by adding the time during which this process was preempted and the all computational cycle will be repeated once more. This approach optimizes the modeling process by reducing the amount of computation. 2.3 Models of Computation As has been already mentioned, models of computation are used in all phases of the design flow. Firstly, they are used for formal description of system components during the specification phase. Secondly, they are utilized in formal verification of the system properties during the validation phase of the design flow. Thirdly, the system formalization helps preserve the system validity during the refinement process of the synthesis phase. This section will introduce several models of computation that suit the most for the design of data processing system architecture Communicating Sequential Processes Communicating Sequential Processes (CSP) was introduced by C. A. R. Hoare in 1978 [31], [57]. This model represents the system as a network of sequential processes. The processes communicate with each other by sending messages through unidirectional channels synchronously. This means that in order to transfer messages, processes on both sides use blocking rendezvous channels where both processes stall until the message is transferred. Because of the tight synchronization, this model is very good at representing systems where resource sharing is a key element, and it can be used efficiently to describe resource allocation. 13

27 CHAPTER 2. STATE OF THE ART Discrete-Event Discrete-event (DE) is an actor-based model [28], [57], [46] where actors communicate via sequences of timed events. Each event is a data value (a token) together with a tag t which denotes the time of the event. In practice, however, the tag is a pair t = (τ,n), where τ is the time stamp and n is a natural number that represents a microstep, which can be used to determine the order of simultaneous events that have the same time stamp. In a DE model of computation, events are processed chronologically: an actor is fired whenever its available input events are the oldest among all the active events (i.e., they have the earliest time stamp). All the actors of the DE model of computation share the same global notion of time. Because the DE model of computation has prioritization of simultaneous events, it is suitable for the design of systems with dataflow timed behaviors. Also, this model can be appropriate to design queuing systems, communication networks, and digital hardware Finite State Machines A finite state machine (FSM) [28], [46] is a model of system behavior that consists of a finite set of states linked by arcs, called transitions and actions. Formally, an FSM can be defined as a five tuple composed of a finite set of states (Q), a set of possible inputs or input alphabet (Σ), an initial state (q 0 Q), a set of final states (F Q) and transition function (δ : Q Σ Q). The FSM reacts to its input σ by taking transitions δ(q,σ) from one state q to another. The FSM model is the most widely used in the area of hardware design, especially for the design of sequential control logic, because of the tight synchronization and the availability of numerous analysis techniques. Unfortunately, the number of states in an FSM model may grow exponentially with respect to system size, and FSM models for even moderately complex concurrent systems may become unmanageable. The problem can be addressed by decomposing the FSM into separate concurrent components, and by resorting to compositional analysis techniques. 14

28 2.3. MODELS OF COMPUTATION Finite State Machines with Outputs There exists a particular type of FSMs where actions are associated with sending data to the outputs. These FSMs are called finite state machines with outputs. The current output of the FSM is determined by both its input and its present state. Formally, these FSMs are defined with the same set of parameters excluding the set of final states and adding a set of possible outputs, an output alphabet. There exist two types of the FSMs with outputs. The first one is where the output is associated with a state. They are called Moore machines. The second one is defined by output associated with the transition. They are called Mealy machines. These models are also called synchronous finite state machines and are widely used in circuit control logic design Statecharts The main disadvantage of the synchronous FSMs in terms of system level design is the absence of a strategy for top-down or bottom-up development. In other words, the flatness of the state-transition diagram and its diseconomy in terms of transitions in some particular cases (e.g., high-level interrupts) brought to the idea of formalizing the hierarchical development and refinement of Mealy machines. Statecharts is a model that has been introduced by David Harel. Essentially, this model is an extension of the Mealy machine that allows the hierarchical development that in turn makes the state-diagrams more structured and economical in terms of number of transition. The hierarchy of statecharts is presented as depth in states, achieved by drawing states as boxes composed of other boxes as sub-states that represent lower levels of the hierarchy. The interrupts are used to manage the communication between adjacent levels of the hierarchy. Each of the hierarchical levels may include several independent components that can be executed in parallel. Each of these components has its initial states that can be put together into an AND-state, which introduces the orthogonality concept of the Statecharts. Orthogonal components can communicate among each other using broadcast events. The transition from one state to another of any statecharts component is labeled 15

29 CHAPTER 2. STATE OF THE ART with two values: a trigger and an action. The trigger represents the conditions when the transition is taking place, and the action is the generation of set of output events. The orthogonality concept of the model helps to prevent the exponential blow-up in FSM representation familiar to complex concurrent systems and its hierarchy allows a structured step by step refinement of the system Co-design Finite State Machines Co-design finite state machines(cfsms) is a particular type of FSMs that introduces the possibility to describe the system design by means of decomposition of this system into several FSMs, components of the designed system. This way, the system specification is presented as a network of several CFSMs. Communication among CFSMs in the network is not performed by means of shared variables, such as in traditional FSM composition, but by means of events. Therefore, the sets of possible inputs and outputs of CFSMs are composed of events. CFSM is a locally synchronous, globally asynchronous model of the design. Internally, each CFSM acts almost as a Mealy machine, where each transition is an atomic operation. While transitioning from one state to another, a CFSM emits events that are broadcast to the other CFSMs in the network. The execution delays of a CFSM transition is not initially specified in order to add more flexibility to the specification of hardware and software components. Therefore, the broadcast events are stored until the other CFSMs is be ready to consume them. The refinement of the model leads to more precise specification of the time constrains Process Networks Process Networks (PN) [56] is a computational model developed for modeling distributed systems. PNs can be represented as a directed graph with a set of processes (nodes) that map input tokens into output tokens, and a set of arcs where tokens are transferred. Kahn Process Networks (KPNs) are a particular kind of Process Networks where processes communicate between each other through unidirectional unbounded FIFO channels. Hence, producers write a sequence of tokens to the channels, and consumers read the tokens from the channels in the same order in which they had been written. Because the FIFOs are 16

30 2.4. MODEL HETEROGENEITY unbounded, writing to this channel is non-blocking while reading is blocking (i.e., a read blocks when the FIFO is empty). Dataflow process networks [56], [28], [57] are a special kind of KPNs. Processes of dataflow process networks execute a sequence of firings. A set of firing rules (rules that activate a particular dataflow actor) specify precisely what tokens must be available at the inputs for the actor to be fired, and how many tokens are produced at the outputs. Synchronous dataflow (SDF) [28], [57], [46] has a statically (at compile time) determined set of firing rules. With this restriction, it is possible to determine a fixed execution order of the actors that helps to avoid expensive run-time scheduling decisions, and that permits the use of FIFOs with fixed size. SDF is useful for modeling systems with dataflow behavior, such as, for example, signal processing systems. However, because of the untimed nature of the model, SDF is not particularly indicated for control intensive application and for synchronization. 2.4 Model Heterogeneity As was mentioned earlier, the choice of models of computation to design a particular system is very important because it can affect the cost and the reliability of the design. Each model of computation has its own strengths and weaknesses [57]. There exist many models of computation in addition to those described above, for example continues time (CT), Giotto [28], distributed discrete-event (DDE), some other variations of dataflow model like Dynamic dataflow (DDF) and Boolean dataflow (BDF) and many others. The models described above were chosen because of their good suitability for the design of complex data processing systems. The convenience of using the best suited model for each part of the design is balanced by the requirement to deal with the resulting model heterogeneity [57]. To design such kind of systems, a specific modeling environment that can establish the communication between heterogeneous models is required. The issue of heterogeneous models of computation is not trivial. There are some approaches for establishing the interconnection between models. One approach is to let the user specify the interrelationship between the semantics of different models. In this case the design tools simply examine the user s 17

31 CHAPTER 2. STATE OF THE ART specification and shows possible inconsistencies and contradictions. Another approach is to define the semantics of the models formally, and then determine the possibility to make their composition in a common semantic domain. Below we will talk about tools and languages for the design based on heterogeneous models of computation. 2.5 System-level Design Languages SystemC SystemC is a language created by the Language Open Group (LOG) of the Open SystemC Initiative (OSCI), and is targeted to a wide range of designers. SystemC supports different models of computation and allows the design of heterogeneous systems [82], [44]. Basically, SystemC is a C++ class library, where C++ plays the role of language foundation while the library provides both a notion of process and interface, and a simulation kernel based on the Discrete-Event model. The SystemC library includes its own structural elements like modules, ports, interfaces and channels (signals, FIFO, mutex, semaphores etc.). It also introduces new data types, such as 4-valued logic, bits and bit vectors, arbitrary precision integers and fixed-point types, which are useful in the specification of hardware components. Additional libraries, like the Verification Library and the Heterogeneous System Specification Methodology Library (HetSC), have also been developed to extend the original functionality of the language. SystemC suits very well for building executable hardware and software models, but it does not support mapping of functional models into hardware platforms Esterel Esterel is a imperative, textual and concurrent programming language [43]. Its development was started in 1980 by a team of Ecole des Mines de Paris and INRIA led by Gerard Berry. It is based on a synchronous model of time, therefore program execution is synchronized to an external clock. The representation of time is discrete, thus time is divided into discrete ticks. The communication among Esterel programs is performed by means of broadcast signals, where a signal can be either present or absent in a particular 18

32 2.6. SYSTEM-LEVEL DESIGN FRAMEWORKS time tick. The computations are performed conceptually in zero time and are considered to be atomic. Esterel is widely used for the development of complex reactive systems and is very well suited for control-dominated model designs. This language is still under development and its IEEE standardization is currently undergoing Lustre Lustre is a formally defined, declarative, and synchronous dataflow programming language developed since 1984 at IMAG [36]. Usual Lustre program includes a list of modules (FSM nodes) that are working at the same speed, therefore they are synchronized. The intercommunication among nodes is performed by means of inputs and outputs. There are no broadcasting signals in order to avoid side effects. All variables, constants and all expressions are represented as streams that can be composed to form new streams. Each stream has a corresponding clock that in turn is a stream of boolean type. Lustre introduces its own data types (e.g., tuple) as well as allows user types definition. It also offers recursion but the number of recursive calls needs to be known at compile time. In 1993, it became a core language of the industrial environment SCADE, developed by Esterel Technologies. It is now used for critical control software in aircraft, helicopters, and nuclear power plants. 2.6 System-level Design Frameworks This section will present three system-level design frameworks: Polis, Metropolis and MetroII. Here we talk about Polis because it was the first framework that realized the idea of separation of functionality and architecture. Metropolis was the first design framework to support the idea of PBD methodology, which became a precursor of MetroII heterogeneous design framework. There are other frameworks like Behavior-Interactive-Priority (BIP) [26], SML-Sys [65] and several others. All these development tools do not allow fully automatic design of heterogeneous embedded systems. Some of the presented tools are focused mainly on embedded software design, like GME and Ptolemy. 19

33 CHAPTER 2. STATE OF THE ART Polis Polis [18] is a design framework based on a single model of computation, Co-design Finite State Machines (CFSM) (described in 2.3.6), that captures locally synchronous, globally asynchronous designs typical of the automotive design space, the application domain targeted by it. Polis allows the representation of the interconnection among the FSM by a combination of graphics and uses Esterel (described in 2.5.2) to describe the behavior of each FSM. The description of different hardware and software components with separated FSMs adds flexibility to the architecture selection by allowing one to easily choose and change processing elements. The performance evaluation of the selected platforms can be performed by simulating the behavior of the chosen architecture using the Ptolemy simulation environment (described in 9.5.2). The first extension of the Polis framework was a commercial tool developed by Cadence called VCC [62] based on the same modeling paradigm where the architectural modeling, called architectural services, and the simulation environment were taken to the next stage Metropolis Metropolis is an environment developed at the University of California, Berkeley that supports the Platform Based Design (PBD) methodology [20, 22]. The Metropolis design framework was developed for supporting the PBD methodology. This framework was the first system to leverage the concept of a semantic metamodel (abstract semantics) to manage the integration of heterogeneous components, to allow declarative and operational design entry, and to manage architectures and functionality in a unified way. The concept of separation of concerns, mapping of functionality to architectural components as a way of refining a design from specification to implementation, and communication as a first class citizen were all instrumental to build the framework. These concepts originated from work over years of research and development. The roots of Metropolis can be found in Polis [18] (described in 2.6.1) that was the first framework to be based on the separation of functionality and architecture. In parallel, several research projects addressed similar issues. 20

34 2.6. SYSTEM-LEVEL DESIGN FRAMEWORKS Metropolis consists of an infrastructure, a set of dedicated tools, and design methodologies for various application domains. The infrastructure provides a general mechanism to represent heterogeneous components of a system uniformly. Metropolis is based on the Metropolis Meta-Model (MMM) [84], a language with associated formal semantics for the internal representation of design behavior and constraints. The design of the system can be described directly using MMM, but requires code rewriting for each particular system, which increases design cost. To allow design re-use, Metropolis provides platforms consisting of a set of components that can be easily picked by the user. This framework provides two types of platforms: models of computations for the description of functional models and architecture platforms for the architecture and hardware model representation. The design process consists of mapping a functional model onto an architecture model, evaluate the combined performance, and finally refine the mapped functionality to create a final implementation of the system. Basically, mapping is a process of correlating the functional execution and architectural actions using constraints over time and energy quantities described by temporal or propositional logic. Functional and architectural models and mapping are described using MMM. The core Metropolis infrastructure is complemented by a set of back-end tools. One of the most used back-end tools available in Metropolis is a SystemC simulator. SystemC preserves the meta-model semantics and is used for design synthesis. For formal verification of system properties described with Linear Temporal Logic (LTL), one can use the back-end tool SPIN [80]. For other application domains, one can extend Metropolis by adding appropriate back-end tools and platforms suitable for the required analysis objectives. The goal of the Metropolis framework is not to provide algorithms and tools for all possible design domains, but to store the design information and allow the design re-use Metro II Metro II [39] is inspired by Metropolis since it is also based on abstract semantics and implements a PBD design methodology. However, it takes it a step further allowing designers to import designs that are developed using tools foreign to Metro II. In addition, the mapping process is greatly simplified using an event-oriented mechanism inspired by 21

CHAPTER 2. STATE OF THE ART Rapide [59], an Executable Architecture Definition Language (EADL). Concurrency, synchronization, timing, and causality are all represented explicitly in Metro II.

35 CHAPTER 2. STATE OF THE ART Rapide [59], an Executable Architecture Definition Language (EADL). Concurrency, synchronization, timing, and causality are all represented explicitly in Metro II. Further, the execution semantics is purified with respect to those of Metropolis by separating cleanly the functional execution of the components, their quantity (e.g., time and power) annotations and the execution of the component assembly that satisfies a set of implicit and explicit constraints. This type of modeling style allows the integration of heterogeneous models of computation and allows for both imperative and declarative specification. A component is the basic building block in Metro II. Components communicate through ports with compatible interfaces. Two events are associated with each method call on a port: a begin event and an end event. Imperative code within the component may control how events are executed, but separate declarative constraints over events can be used to influence execution as well. A key feature of Metro II is the ability to specify the functional and architectural models separately. The two are then mapped together to produce a system model with performance metrics. Mapping is realized by adding declarative constraints between events from the functional model and events from the architectural model. Figure 2.1: Metro II Three Phase Execution Semantics Metro II has a three-phase execution semantics. Each process in Metro II has two 22

36 2.7. DATA PROCESSING SYSTEMS states: running or suspended. Processes execute concurrently until an event is proposed on a required (output) port or until they are blocked on a provided (input) port. Once all processes are suspended, the simulation switches to the second phase of execution. In this phase, events are annotated by annotators, which represent the metrics of interest within the model. In this way, events and the methods they correspond to can be associated with cost. In the third and final phase, events are enabled according to schedulers and constraint solvers. These enabled events then become inactive again while simultaneously allowing their associated processes to resume to the running state. A collection of three completed phases is referred to as a round. Figure 2.1 illustrates the process states and the three phases in the execution semantics. Self loops on the inactive and annotated states illustrate that multiple rounds may pass without an update to a particular event s state. One of the way to describe a functional model in MetroII is to described it as a process network or actor-oriented model where concurrently executing processes communicate with each other through point-to-point channels. Metro II allows this communication to be specified either declaratively or imperatively. In the imperative model, the writer and reader query the FIFO via method calls - exposing more events to the framework and leading to more phase changes. In the declarative model, fewer phases changes take place, but the burden is instead placed on the constraint solver in phase 3. Apart from streambased applications, it is also possible to explore other types applications, for example control applications. 2.7 Data Processing Systems This section presents several examples of data processing systems. A data processing system is a system which converts data received or stored by another unit of the system from one format to another. It is essentially a chain of transformations applied to the data entry such as encoding, decoding, formatting, or translation before the information is output to a further step in the information processing system. The format of data entry should be recognizable by the data processing system. All the telecommunication protocols can be viewed as example of data processing systems. In this section we are going 23

37 CHAPTER 2. STATE OF THE ART to presents a telecommunication protocol called Universal Mobile Telecommunications System (UMTS). UMTS is one of the third-generation (3G) telecommunication standard which is also going to be implemented in the fourth-generation (4G). An example of a system that combines different telecommunication protocols and, in addition, one of the most complicated data processing systems is a Software Defined Radio(SDR) [71, 70]. SDR is a reconfigurable on-the-fly telecommunication system that is able to switch between different communication standards, which is achieved by implementing the protocol in software instead of hardware in order to increase system flexibility SDR SDRs have recently received a lot of attention from the research, as well as from the industrial community, because of the flexibility they offer in an environment dominated by multiple standards. SDR is a radio communication system that is able to work with signals of large frequency spectrum and different types of modulation, therefore being able to support multiple telecommunication protocols. This is accomplished by implementing SDR on a single hardware platform, and by switching between different communication protocols (reconfigure on-the-fly ) by changing the software running on this platform [29]. In a pure SDR, all the required processing, i.e., tuning, modulation and demodulation and the handling of the higher levels of the protocols, is done in the digital domain by software running on a general purpose processor, while a slim RF front-end remains in the analog domain. Because the task is highly demanding in computation power, most platforms include dedicated hardware components that can be configured and rewired by software to adapt to the required functionality. These very flexible architectures should allow one to switch between different communication standards by simply uploading a new version of the software in the device, and potentially support several protocols at the same time. SDR is characterized by a Programmable Digital Access (PDA). The PDA is the point in the SDR signal processing chain where analog to digital conversion occurs. Depending on the position of the PDA point, the radio communication system can be classified as 24

38 2.7. DATA PROCESSING SYSTEMS follows: none programmable (totally analog or fixed function digital radio), Baseband programmable, IF programmable, and RF programmable. Baseband programmability defines a digital radio (Programmable Digital Radio (PDR)). PDR is different from SDR because the filtering that goes before the baseband processing is implemented in hardware. As a result, the software is unable to adapt to changes in the RF structure of the physical layer to support, for example, a new standard protocol. The major advantages are flexibility and ease of adaptation, since the radio function can easily be changed and adapted to changing standards. Programmability also promises economy of scale for manufacturers, who can rely on common platforms reused across different domains of applications. However, the requirements in terms of performance, latency and cost makes the design of these architectures hard. In an ideal scenario, a Software Defined Radio achieves RF programmability, where the signal is converted to digital form soon after the RF section. The software implementation of RF section may however be made difficult by the high frequencies involved in several of the communication protocols. For example for the UMTS protocol operating at about 2GHz a sampling frequency should, in fact, be of 4GHz. Even when conversion is possible, the high data rate may be hard to handle in software by a general purpose processor. A typical SDR therefore usually employs Intermediate Frequency Conversion, in order to lower the frequency of the signal. In practice, today, even IF programmability is hard to obtain without resorting to dedicated digital hardware, which makes the design exploration phase essential. While practical implementations are possible for baseband programmable architectures, their optimization remains hard without an efficient methodology to evaluate possibly very different design alternatives. Software Defined Radio is a very important development for the wireless communication industry. The flexibility afforded by SDR can in fact solve the problem of heterogeneity at link-layer protocol standards of different network generations and allow global roaming (e.g., USA, Europe) and network services integration. Also SDR can be a solution to increase frequency spectrum utilization by implementing the idea of cognitive radio [34], which is able to adapt to use the most efficient communication protocols and band compatible with the current environment. 25

39 CHAPTER 2. STATE OF THE ART UMTS Protocol UMTS [10] is a 3G mobile standard that adopts the Wideband Code Division Multiple Access (WCDMA) air interface based on the Direct-Sequence Spread Spectrum (DSSS) technology. UMTS uses the infrastructures of 2G standard. In comparison to the 2G standards UMTS is characterized by higher speeds, supports more users and uses lower power of the transmitted signals. 3G standard was designed to support multimedia communication thus transmissions with higher data rates and WCDMA became the most adopted third-generation air interfaces. The specification of WCDMA has been determined by the 3rd Generation Partnership Project (3GPP), that includes standardization groups from Europe, Japan, Korea, the USA and China, within which WCDMA is also called UMTS Terrestrial Radio Access (UTRA). UTRA or WCDMA includes two different modes of bidirectional data transmission: Frequency Division Duplex (FDD) and Time Division Duplex (TDD), therefore there exist two separate specifications for UTRA FDD and UTRA TDD. We are focusing on the UTRA FDD. The work on 3G standard specification has been started in 1992 at the World Administrative Radio Conference (WARC) of the International Telecommunications Union (ITU). At that time the frequencies around 2GHz had not yet been allocated in most of the world parts, such as Europe and Asia, including Japan and Korea, members of 3GPP, therefore it has been assigned to the 3G standards. In particular, in Europe and in most of Asia the 3G standard is using the MHz frequencies for uplink and MHz for downlink. In the US this spectrum has been already occupied by 2G operators, therefore the third generation standard had to be deployed in the existing band. At the top level, the network architecture is divided into a User Equipment Domain and an Infrastructure Domain, which communicate through the radio interface. We focus on the User Equipment Domain, which is of greater interest to mobile devices which are subject to more stringent implementation constraints. The protocol stack of UMTS for the User Equipment Domain has been standardized by the 3GPP up to the Network layer, including the Physical (PHY) [13] and Data Link (DLL) [8, 12] layers. The protocol 26

40 2.7. DATA PROCESSING SYSTEMS stack for the user domain is shown on Figure 2.2. It is divided into three different layers, corresponding to the Physical (PHY), the Data Link (DLL) and the Network Layer [11]. In addition to the usual addressing functions, the Network Layer controls the operations of the Data Link and of the Physical Layer by responding to changes in the transmission parameters. The Data Link Layer performs general packet forming and quality of service support. It includes a Radio Link Control (RLC) [8] and a Medium Access Control (MAC) [12] sublayer. The RLC communicates with the MAC through different logical channels, to distinguish between user data, signaling and control data. The RLC can operate in three different modes: acknowledged mode, unacknowledged mode and transparent mode. The acknowledged mode provides reliable data transfer over the error-prone radio interface by retransmitting RLC packets. In the unacknowledged mode, the data packets are not retransmitted in order to avoid additional delay. The functionality of transparent mode is similar to unacknowledged mode but data packets are sent without any additional protocol information as broadcast packets. In turn, the MAC layer is composed of several MAC entities among which MAC-d acts as a switch for other entities. Depending on the required quality of service, the MAC layer, its MAC-d entity, maps the logical channels into a set of transport channels, which are then passed onto the Physical Layer. In addition to the switch function, MAC-d performs the functionality of the MAC layer used for user data transmission. Finally, the Physical Layer handles lower level coding and modulation, and communicates with the radio interface through a series of physical channels, each optimized to different time and coding requirements. The architecture of the protocol stack is very complex due to the high number of different logical and transport channels, as presented on Figure 2.3. In this work we focus on a subset of the functionality, marked on Figure 2.3 with a dashed square and emphasized on Figure 2.2 by thick lines. It is a data path used for user data transmission also called the Dedicated Channel (DCH). DCH is a bidirectional channel used for pointto-point communication. 27

41 CHAPTER 2. STATE OF THE ART 2.8 Existing SDR Platforms The requirements of SDR as well as 3G systems are pushing the signal processing capabilities higher and higher. We surveyed several high-performance architectures in particular specified for SDR development proposed by the academia [24, 83] and the industry [45, 85]. One of the most promising hardware platforms in terms of performance is the Small Form Factor SDR (SFF SDR) Development Platform from Lyrtech [60]. In our work it was a baseline model that we used as a starting point for the design space exploration. We started our design space exploration from the ARM family of processors to tests the functionality of the UMTS DLL layer. Later on, we went further to study multi-core hardware platforms composed of many possible permutations of ARM, DSP, FPGA and Sparc processing elements to test the functionality of DLL and PHY of UMTS Avispa-CH1 Avispa-CH1 [45] is a highly-efficient reconfigurable inner-receiver processor designed by Philips Electronics. It is based on ULIW technology for high-performance and low-power design. The idea of ULIW [33] technology consists in the development of a multi-core SoC thus allowing multiple data processing parallelism. Avispa-CH1 allows manufacturers to configure communication system according to their requirements by simply downloading proprietary application code onto a pre-fabricated part. It is able to support various wireless applications among which are OFDM wireless applications including digital TV (e.g. DVB-H/T, T-DMB, ISDB-T), digital radio (e.g. DAB, DRM) and wireless data communication technologies, such as Wi-Fi and WiMAX SDRXPP PACT XPP Technologies, inc. is a company that develops extreme performance processor solutions. XPP technology was designed to support maximum flexibility and Performance. In April 2003 PACT has announced a new platform for Software Defined Radio development called SDRXPP [85]. This platform is composed of one ARM1136EJ microcontroller, peripherals and a re- 28

42 2.9. PROCESSING ELEMENTS configurable XPP processor with integrated RAMs and high-speed interfaces. The XPPcore and some additional hardware components like Viterbi coder/decoder are aimed to be used to implement the baseband processing part of the telecommunication technology, while ARM11 is dedicated for channel estimation and higher protocol layers. Due to the use of a reconfigurable XPP-core SDRXPP enables devices that utilize different transmission standards (e.g., W-CDMA and UMTS) using just one unified hardware SFF SDR The SFF SDR platform consists of three separate modules, the Radio Frequency (RF), Data Conversion and Baseband Processing modules, combined together, and is shown on Figure 2.4. Each of these modules can be replaced in order to satisfy the requirements of the target product. The Baseband Processing module works in the digital domain, and is the main focus of our studies. This module employs a TMS320DM6446 system-on-chip (SoC) from Texas Instruments and a Virtex-4 SX35 FPGA from Xilinx for modulation. The SoC consists of one C64x+ TM DSP and one ARM926EJ-S TM general-purpose processor. The DSP is typically used for processing the baseband, while the general-purpose processor is reserved for the upper layers of the communication protocols and other higher level applications. 2.9 Processing Elements ARM7 The first developments of ARM processors have been started in 1990 and was owned by Acron, Apple and VLSI. Firstly, ARM used to be an acronym of Acorn RISC Machine, now it is short for Advanced RISC Machines. ARM [77] is a high performance and low power microprocessor with 32-bit RISC architecture. Due to its power saving features and small die area it is widely used in the design of portable devices, for example, mobile phones, pagers, media players, and PDAs. Initially, ARMs had a fixed instruction width of 32 bits to simplify the instruction decoding and pipelining, at the cost of decreased code density. The ARM7TDMI processor has been the first processor in its family to 29

43 CHAPTER 2. STATE OF THE ART introduce the Thumb mode with 16-bit instructions which helped to increased the code density. ARM7TDMI became one of the most successful ARM designs. Typically, ARM7 functions at 60MHz clock frequency and consumes only 1.5 mw/mhz on a 0.35µm process. In its core it has a 3 stage pipeline architecture that included fetch, decode, and execute stages. ARM7 uses a Von Neumann memory architecture, where instruction fetches and data accesses occurs successively (in contrast to a Harvard architecture, where it happens in parallel), therefore its basic load and a store operations, which account for about 25% of all instructions, takes 3 and 2 cycles respectively ARM9 The ARM9 was built on base of ARM7 architecture with several enhancements. Firstly, ARM9 adopts the Harvard memory architecture, which allows the instruction fetches to occur in parallel with data accesses. Therefore, it became possible to increase the pipeline depth from 3 to 5 stages (fetch, decode, execute, memory access and write-back). The increase of the pipeline depth allowed the device to be clocked with high frequency. In addition to this, forwarding paths have been introduced in the pipeline. This helped reduce the number of interlock cases and the average number of clocks per instruction. All this lead to a large performance improvement in comparison to ARM7. The reorganization of the pipeline stages and the separation of arithmetic and logic units of the Arithmetic Logic Unit (ALU) reduced the power consumption of ARM9 in comparison to ARM MicroBlaze Together with the Embedded Development Kit (EDK), Xilinx provides a library of software elements, which can be automatically implemented on an FPGA platform. These libraries include a soft processor, called MicroBlaze, along with a universal set of peripheral IPs. The MicroBlaze is a flexible 32-bit soft processor core with Harvard RISC architecture. It consists of 32 general-purpose registers, an ALU, a shift unit, and two levels of interrupts. The MicroBlaze architecture can be chosen and reconfigured by a user. For example, the cache size, pipeline depth (3- or 5-stage), embedded peripherals (e.g., floating-point unit, caches, exception handling, and debug logic), memory management 30

44 2.9. PROCESSING ELEMENTS unit, and bus-interfaces can be customized. It is also possible to add either user defined co-processors or general-purpose processor to the MicroBlaze. The communication with the user defined co-processor and MicroBlaze is carried out by a FIFO-style connection called FSL (Fast Simplex Link). In case of communication with general-purpose processor (GPP) Processor Local Bus (PLB) and On-Chip Peripheral Bus (OPB) standards will be used Sparc Sparc, which is an abbreviation of Scalable Processor Architecture, is a RISC microprocessor instruction set architecture (ISA) with 32-bit integer and 32-, 64-, and 128-bit floating-point data types and 72 basic 32-bit wide instruction operations. Originally it was designed in 1985 by Sun Microsystems, and was heavily influenced by the RISC I & II from the University of California, Berkeley. The main goals of SPARC design were optimization of compilers (including as few features or op-codes as possible) and simplification of pipelined hardware implementations. the Sparc processor is composed of an integer unit (IU), a floating-point unit (FPU), and an optional coprocessor (CP). Each of this units includes its own set of registers. The separation of unit registers is made in order to allow the maximum concurrency between integer, floating-point, and coprocessor instruction execution. The Sparc processor also introduces two operational modes: user and supervisor. While functioning as a supervisor, the processor can execute any instruction, including the privileged (supervisor-only) instructions. In user mode, the privileged instruction execution not permitted and will cause a trap to supervisor software. The scalable in Sparc comes from the fact that the Sparc specification allows implementations to scale from embedded processors up through large processors. The SPARC IU may contain from 40 to 520 general-purpose 32-bit registers, 8 of which are called global registers and the rest is a circular stack, from 2 to 32 sets of 16 registers each, which forms the register window. The total number of registers is implementation-dependent. The FPU contains bit floating-point registers, which can hold a maximum of either 32 single-precision, 16 double-precision, or 8 quad-precision 31

45 CHAPTER 2. STATE OF THE ART values. Sparc allows the implementation of only one CP that includes implementationdependent number of 32-bit registers. 32

46 2.9. PROCESSING ELEMENTS >M >L 6#6 789 :;;< =# 789 :;;<! " ADBECF -./012 # $ %& # (% # $ '&%(&) *+, HIBFJ?I ADBECF >K 345 GD"F!ABC ADBECF Figure 2.2: Layer diagram for the User Equipment Domain of the UMTS protocol 33

47 CHAPTER 2. STATE OF THE ART Figure 2.3: UMTS Dedicated Transport Channel dh `cb ^X_` cax id c`b `de azb NOP d^ crfr bzvgtys\zv QRSTURVW XYZ[TSS\V] Figure 2.4: Block diagram of the SFF SDR 34

48 Chapter 3 The Problem Electronic embedded systems are widely used for different purposes of our daily life, such as communication, automation, measurements, security, and health. The combination of those systems opens new horizon in embedded system design. The construction of distributed infrastructure composed of nodes and processing elements that must interact with the environment and users, and communicate among themselves to exchange data, opens the possibility to provide new services and realize innovative applications. The convergence of different applications and the proliferation of communication standards may separate users having devices that support different classes of applications or communication standards making them unable to communicate among each other. Therefore, the idea to implement architectures that includes multifunctional devices able to support a number of different technologies is rising up. This results in system heterogeneity and increasing design complexity. The design of electronic embedded system will always go along with constrains on cost, physical size, performance, energy consumption, the availability of the system components, and real-time requirements. The definition of an optimal architecture within these constrains requires a extensive design space exploration in order to find the tradeoff among metrics of interest. All this, together with time to market constrains, introduces a higher degree of sophistication for embedded systems design. 35

49 CHAPTER 3. THE PROBLEM 3.1 Problem 1: Performance - Multi-Processor Based System Design The implementation of more and more complex electronic systems to support innovative applications exasperates the architecture performance and real-time system requirements, which can no longer be supported by single processor based architectures. Therefore, in order to support higher performances and better real-time characteristics the emerging computing platforms are increasingly becoming multiprocessor based. Multi-processor based platforms offer flexibility and a wide range of alternative design solutions that are able to optimally trade-off the design metrics of interest. This trend is also apparent in the design of modern microprocessors, where the use of multi-threaded cores is favored over faster clocks to speed up the software execution. The design of multi-core architectures and embedded systems in general is made complex by a large design space and the difficulty of integrating heterogeneous components. 3.2 Problem 2: Interprocessor Communication In contrast to architectures based on a single processor, communication components (e.g., buses, shared or distributed memories, etc.) play an essential role in determining the performance of the multiprocessor systems. The latency of inter-processor communication may either tie up the processor resources or cause the processor to wait, which can drastically affect the overall system performance. Moreover, the choice of arbitration policy for inter-processor interconnects plays an essential role in the latency introduced to data communication. Therefore, while modeling a multiprocessor base based system it is very important to create the model of not only processing elements but also the inter-processor communication elements. 3.3 Problem 3: Time-to-market Pressure vs. Correctness Existing design tools for the automatic mapping and synthesis of optimized platforms are based on Register Transfer Level (RTL). They are very detailed for an effective design 36

50 3.4. PROBLEM 4: HETEROGENEITY OF THE DESIGN CHAIN COMPONENTS space exploration of system design alternatives, and are typically biased towards specific implementation styles. These tools are faced with extreme complexity, due to the size of the solution space. In addition, the lack of abstraction makes the design, as well as the validation process, difficult, if not impossible. Managing model heterogeneity at this level is also problematic. However, the growing complexity of electronic systems design and time-to-market pressure should not affect correctness of new electronic systems. The alternative is manual architecture selection, coupled with fast performance simulation that computes metrics with quick turnaround time. However, the accuracy of the result should be substantial enough to take the right decision upon the design choice. Early attempts by the industry to introduce such technology [54] have not been successful in the market due to a variety of reasons, including the lack of appropriate performance models and the use of proprietary languages. 3.4 Problem 4: Heterogeneity of the Design Chain Components The companies that develop new products or components face big difficulties trying to compose hardware and software components that come from different suppliers who use diverse design models. Thus, there is a need for standardization of hardware and software domains which can allow plug-and-play of subsystems [2]. As a result, the management of the design chain [63] becomes of primary importance, especially for system integrators, in order to assemble embedded systems composed of hardware and software components, with guarantee of correctness and optimal resource utilization within time-to-market and cost constraints. While some effort is being expended in this direction in the industry [2, 7], a structured methodology is required to simplify architecture space exploration and boost design flexibility and reuse. 3.5 Solution To realize a suitable design relationship between all components, the rise of the level of abstraction at which design is carried out is required. Working at a higher level of ab- 37

51 CHAPTER 3. THE PROBLEM straction, complexity can be managed by reusing pre-verified components. The validation process therefore shifts from the verification of the individual components to the verification of their composition and their interconnection. New System Level Design (SLD) methodologies have been developed to solve the problem of heterogeneity, complexity, simplify multi-processor based architecture design, reduces time to market delivery while not affecting the correctness of newly designed system. One example is the platformbased design (PBD) methodology [75, 35], which is based on the construction of different layers, called platforms, which represent different levels of the design abstraction. Each platform is a well separated library of computational and communication components, where platforms at higher levels abstract the details of lower level platforms, and can be used for fast performance estimation. This is essential for quickly converging toward a platform that is not only optimized for the desired functionality, but can also support its future extensions. 3.6 Data Processing System Design We have focused our research on the development of a system-level design methodology for architecture exploration and performance analysis of data processing systems (see Section 2.7). As an object of our study we have chosen one of a very complicated example of data processing systems that can combine all the telecommunication protocols, which is Software Defined Radio(SDR) [71]. SDR is a telecommunication technology in which both modulation and demodulation are performed in software or using a programmable device [70]. The major advantages of an SDR are its flexibility and ease of adaptation, since the radio function can easily be changed. Programmability also promises economy of scale for manufacturers, who can rely on standard platforms reused across different domains of applications. However, the development of such a platform is not an easy task due to very high performance requirements. A typical architecture for an SDR platform is shown in Figure 2.4. Designing reusable optimized platforms is complex. The designer must carefully consider several different scenarios in the choice of the architectural components, and in the 38

52 3.7. CONCLUSION way they are connected. Clearly, the arrangement of CPUs, DSPs and FPGAs have a significant impact on performance, and choosing the optimal mapping is hard. Shono et al. report that the arrangement of CPUs, DSPs and FPGAs seriously influences system performance, and that the software assignment to each processor is difficult [78]. 3.7 Conclusion This work is going to present a system-level design methodology for architecture exploration and performance analysis of data processing systems. This methodology applies the principles of PBD methodology. The main objective of the design exploration process for each particular architecture is to evaluate performance in terms of metrics such as latency, throughput and resource utilization. To facilitate this process, it is important to separate the specification of the architecture from that of the functionality. This way, changes in one or the other will not cause the redesign of the entire system. 39

53 Chapter 4 The proposed approach This chapter introdues our methodology used for system architecture analysis. The first section of this chapter gives the methodology overview whithin which we talk about different models (e.g., functional, performance and timed) used in the design flow. In addition, we present the general algorithm of their interworking. The second section explains the details of the SystemC implemetation of the framework. This section also describes the performance parameters that can be received with this framework and the way they are calculated. 4.1 Methodology To define an effective system design flow for SDR, we apply the PBD paradigm [75], by evaluating different architectures against the specification constraints, and by mapping the desired functionality on the elements of the platform. The main objective of the design exploration process for each particular architecture is to define what and how many protocols can be supported by the platform. To facilitate this process, it is important to separate the architecture specification from the functionality. This way, changes in functionality (or in the architecture) will not cause the redesign of the entire system, and vice versa. In addition, this allows us to model function and architecture at two different levels of abstraction, and enable fast annotated functional simulation to quickly provide performance metrics for a variety of design choices. 40

4.1. METHODOLOGY Our particular implementation of the PBD methodology follows the steps presented on Figure 4.1. We first build an abstract SystemC model of the functionality of telecom- Figure 4.

54 4.1. METHODOLOGY Our particular implementation of the PBD methodology follows the steps presented on Figure 4.1. We first build an abstract SystemC model of the functionality of telecom- Figure 4.1: Design exploration methodology munication protocols, with no notion of time (Untimed level), used to verify correctness and to study concurrency issues. Then, we extract the source code of actual functionality performed by this model and profile it for given architectures (Profiler level) using different methodologies in order to receive the execution time of each separate function. Finally, at6 Timed level the untimed SystemC model is annotated by a scheduler at a functional level of granularity. SystemC was chosen over traditional sequential programming because it is a component model which natively supports concurrency, a computation paradigm that is more appropriate for today s reactive embedded systems. It supports different models of computation and allows the design of heterogeneous systems. In addition, it provides the possibility to refine high level specification models (both hardware and software) into low level implementations and to build executable models. Finally, because SystemC is based on a standard language (C++), it is easy to share models with the other members of the research group, such as developers of applications, UI and baseband. 41

55 CHAPTER 4. THE PROPOSED APPROACH Untimed Functional Model Our functional specification follows particular kind of the process network model called dataflow (described in 2.3.7). Each module (dataflow actor) of our model implements a particular function (e.g., packet forming, coding/decoding, spreading/despreading and etc.) that needs to be performed over the data transmitted through the process network. The modules of the functional model have identical structure (described in 4.2), where their main functionality is implemented in pure C code and wrapped up by the specific SystemC functions used to form the functional modules of the process network and establish the connection and synchronization among them. Every module is attached to two blocking read and blocking write FIFO queues, one for data input and the other for data output, connected to the next module. The activation of modules is restrained by constrains. Each of the modules can be activated only if there are data available in the input FIFO and a space in the output FIFO. The functional chain consists of two parts one of which performs the direct data processing and another one implements the inverse functionality in order to verify the correctness of the process network. The first module of the functional chain is a data generator that emulates packets coming from not yet implemented function blocks and is terminated by a module that displays the result of data processing, so it is possible to perform final verification of the correctness of the process network functionality Architectural Model In order to explore different hardware architectures we exploited SystemC s ability to support different models of computation, by working at the timed level. Each architectural element is modeled as a resource manager, that is responsible for granting access to the resource and for correctly accounting for timing (and, possibly, other performance metrics). For computational resources such as CPUs, the resource manager takes the form of a scheduler that implements a certain scheduling policy. We have created a performance model that we use to explore hardware platforms composed of different architectural element. In order to create a SystemC performance model for a particular element we 42

56 4.1. METHODOLOGY need to know the performance of each functional block executed on this element. To do so, we need to extract the functional code (C in this case) from our functional model (SystemC). Then we profile the execution on a set of processor emulators using two primary profiling flows for our work. The first is used to get information for general purpose processors (in this case ARM9 and ARM7). This process involves the use of embedded profilers or instruction set simulators, in particular, Keil ARM Development Tool [51], as well as GNU compiler tools, virtual hardware, and a custom designed code annotator. Initially, the code is cross compiled for the particular architecture target we are interested in. This executable is then fed to a virtual hardware simulator (in this case, simplescalar [32]). These results along with the original source code and corresponding binaries are fed to a code annotator which produces annotated code detailing the actual running time of individual segments of the original code for the given architecture target. This flow is shown in the top half of Figure 4.2. The second flow involves libraries of elements which can be implemented on a platform FPGA (such as Xilinx s Virtex series). From the library, a set of systems are automatically generated by creating various legal permutations of the library components. One permutation may consist of a single soft processor core connected to a bus while another may have multiple processors and memory elements. Each of these individual systems is then run through the synthesis process provided by the FPGA. At this time, code which should be run on these systems is partitioned and bound to processing elements (such as the Microblaze soft processor). At the end of synthesis, the required execution cycles can be obtained for the application along with information about the cycle time from the physical design tools. Together, this can be used to calculate an overall execution time. This process is described in much more detail in [40] Scheduler Of particular interest is the model of a processor, which, as we discussed, takes the form of a scheduler which manages resource allocation and acts like the operating system in our SystemC model. A simple example of a system with only four processes and three FIFOs 43

57 CHAPTER 4. THE PROPOSED APPROACH Figure 4.2: General Purpose and Programmable Processor Profiling Flows mapped onto a CPU is shown on Figure 4.3. When a process is mapped onto a processor it is connected to the scheduler by two channels. The first is a boolean Ready to Run signal (r r), which triggers the scheduler and indicates that the process is ready to run. The second is a bidirectional communication channel, which is used to exchange control information between the process and the scheduler. The scheduler is also connected to a timer used as a stop-watch to be notified of the completion of a process execution. The scheduler is modeled as a finite state machine which controls the execution of the system. The activation of each process is controlled by the typical firing conditions of process networks, i.e., the availability of data at the input queues, and the availability of space at the output queues. These conditions are notified to the processes every time data is written to or read from the attached FIFOs. When a firing condition is satis- 44

58 4.1. METHODOLOGY Figure 4.3: Operating System in SystemC fied, the process triggers the scheduler through its dedicated r r signal and then waits for permission to start computation, which will be granted by the scheduler when the processor is available and when no higher priority processes are ready to run. The process is run to completion, and will stop before the results are written to the output FIFO. The computation is done in logically zero time. Instead, the scheduler will again trigger the process to post its outputs at the correct time, which will not only account for the process execution latency, but also for the time spent in running higher priority processes that had become active and preempted its execution. This way, following the ROM methodology [76](described in Section 2.2.3), a process is never physically suspend as a result of preemption, thus reducing the overhead due to context switches. Instead, the scheduler verifies if any preemption has occurred, and, if so, updates the completion time by delaying it in the future by the appropriate amount. An example of a fixed priority preemptive scheduler is shown in Algorithm 1. The scheduler manages a priority list whose items are process descriptors with the following 45

59 CHAPTER 4. THE PROPOSED APPROACH structure: a process identification ID; a boolean variable status which indicates whether the process is currently executing (true) or waiting to be scheduled or resumed (false); the time T init at which the process initiates its computation (or negative if not scheduled yet), and a variable τ indicating the time left for the process to finish its computation. Algorithm 1 Scheduler( ) 1: if ( new process new P ) then 2: if ( current P.priority new P.priority ) then 3: current P.τ = current time - current P.T init ; 4: end if 5: Add item( new P ); 6: else if (timeout current P) then 7: notify current P post data; 8: list.pop( ); 9: end if 10: current P = list.top( ); 11: if ( current P.T init 0 ) then 12: trigger current P execution (notify event); 13: end if 14: current P.T init = current time; 15: reset timer( current P, current P.τ ); The procedure may be triggered either by a new process entering the enable state, or by a timeout from the timer that signals that a process has terminated execution and needs to post its data. In the case of a new process, the algorithm first compares its priority with the one of the currently running process. If the new process has higher priority, then we update (decrease) the time to completion τ of the current process with the time it has been executed, which is the difference between the current time and the time it was last given the resource (line 3). In all cases, the new process is added to the list of processes (line 5). If a timeout occurs, it signals that the current process has reached the end of its computation. It is therefore removed from the list, and granted permission to post its data to the output (line 7). 46

60 4.2. THE IMPLEMENTATION The CPU is then given to the process at the top of the list (line 10). If the process starts execution for the first time, then its body is actually invoked (line 12). Time will not advance during its execution, since all timing is accounted for by the scheduler. To do so, the process descriptor is updated to record the starting time (line 14), and the timer is reset with the remaining time to completion for the process. Other scheduling policies, such as round robin or earliest deadline first (EDF), can be implemented. By using a standard API for the scheduler process, these can be exchanged quickly to evaluate their impact on the mapped functionality and on the overall performance. A higher level resource manager using the technique described in [76] can then be used to mediate the data transfer between processor cores over a bus or other communication channels Timed Functional Model The combination of the scheduler and the performance model makes up our Timed level model, which we use to determine the performance of the overall system, and to compute the utilization of the microcontroller. This scheme has several advantages over the use of the profiler alone, which by itself can provide the system performance. First, the simulation run in evaluation mode are very time consuming and depend on the complexity of the microcontroller architecture. In contrast, SystemC is relatively fast and independent of the microcontroller architecture (see Section 5.2). Second, SystemC is more flexible and makes it easier to partition the functionality onto different processor cores, and to combine their performance. This is essential as platforms evolve to include more processing elements. This trend also requires the exploration of different concurrency models, from dataflow to synchronized execution, which is natively supported by SystemC but typically not by architecture profilers. 4.2 The Implementation The actual implementation of our methodology is going to be presented in this section. The implementation is partitioned in several files representing modules of the functional 47

61 CHAPTER 4. THE PROPOSED APPROACH and performance models and a file employed to store the system configuration settings. Each module of the functional model is described by one SystemC source and one header file that have similar structures. In this section we are going to describe the content of those files as well as the main principals of the performance model functioning. In order to simplify the implementation description we mark out a simple subsystem of our framework as presented on Figures 4.4 and 4.5. This subsystem is composed of two parts: a module of the functional model (Module N) and a scheduler, which represents the processing element in our framework. First, we will talk about the structure of the SystemC implementation of a functional module. Then we will explain the work of the processing element (scheduler) and the way it is synchronized with the functional model by introducing the FSM representation of the system. The description of interprocessor communication model and the way we used to calculate the important performance parameters are going to be presented at the end of this section The Bones - Module n.h file If we imaging our framework as a system that consists of bones, muscles and motive element, which is the heart, we may associate the SystemC header file with the bones of our system, because exactly here we describe its structure: it give the list of functions performed by the module and list of ports used to connect different communication channels. The typical SystemC header file can be logically divided into two parts as presented on Figure 4.4. The first part includes the definition of the class which carries the name of the module. Inside of this class we define the list of all the ports used to connect the communication channels to this Module N. Each of functional modules has three SystemC ports where the in and out ports are used to connect the input and output FIFOs respectively and the control port is used to connect the bidirectional synchronization channel between Module N and the scheduler it is connected to. The second part defines the functionality of the SystemC module. Module N includes two functions: a triggering function() and a main(), where main() includes the module functionality (C code) of the module, while triggering function() is used to resolve execu- 48

62 4.2. THE IMPLEMENTATION 1. Process fires (has data, has space to write) P1 P2 P3 P4 FIFO1 FIFO2 FIFO3 Timer Scheduler (FSM) 4. Process runs to completion 6. Post output 2. Proc sends ready signal 3. Execution granted 5. Port output granted P1' P2' P3' P4' FIFO1' FIFO2' FIFO3' #ifndef MODULE_N_INC #define MODULE_N_INC #include <systemc.h> #include ʺfifo.hʺ class MODULE_N : public sc_module { public: sc_port<read_write_if<packet_type> >in; sc_port<read_write_if<packet_type> > out; sc_port<scheduler_module_if> control; Module_N FIFO in (SC_THREAD) out control FIFO SC_HAS_PROCESS(MODULE_N); sched_module_channel MODULE_N(sc_module_name name) : sc_module(name) { SC_THREAD(main); } }; void main(); void triggering_function(); #endif... Scheduler (FSM) Figure 4.4: ModuleN.h file structure tion constraints and carries out the function of synchronization between the module and the scheduler. The main() function is also defined as SC THREAD which allows us to use SystemC wait() function for time annotation The Muscles - Module n.cc file The SystemC source file includes the implementation of the functions defined in the header file. This part of Module N implementations can be associated with muscles of our framework because it sets the main rules for module execution. The typical structure of the Module N is depicted on Figure 4.5. The body of the source file is a SC THREAD main(), that starts from variable declaration which is then followed by an infinite cycle (while(true){...}) containing the main functionality of the module. At the beginning of the infinite cycle the triggering function() 49

63 CHAPTER 4. THE PROPOSED APPROACH 1. Process fires (has data, has space to write) P1 P2 P3 P4 FIFO1 FIFO2 FIFO3 Timer Scheduler (FSM) 4. Process runs to completion 6. Post output 2. Proc sends ready signal 3. Execution granted 5. Port output granted P1' P2' P3' P4' FIFO1' FIFO2' FIFO3' void MODULE_N::triggering_function(){ while((in->n_el() == 0) (out->n_el() == fifo_size)){ wait(in->read_write_e out->read_write_e); } control->r_r(); //emit control->start_exec_e(); // waiting }; Module_N FIFO in (SC_THREAD) out control FIFO void MODULE_N::main() {... variables declaration } while (true) { triggering_function(); packet_in->read(in); FUNCTION BODY... control->data_output_e(); packet_out->write(out); sched_module_channel... Scheduler (FSM) Figure 4.5: ModuleN.cc file structure is called to resolve the typical blocking read and blocking write constraints of the dataflow model, availability of the data in the input and place in the output FIFOs respectively. To do so, the while(...){...} cycle of the triggering function() checks for conditions every time a read or write action takes place in one of the FIFOs. Each of these actions notifies the FIFO read write e event. This process will be repeated unless both conditions are satisfied (!((in n el() == 0) (out n el() == fifo size))). Whenever constraints are resolved, the ready-to-run (r r) signal will be sent to the scheduler through the control port (control r r()) and the triggering function() will start waiting for the start execution event (control start exec e()). The scheduler will add the request to the list of tasks taking into account the initially assigned priority to the Module N using algorithm 1. When the task reaches the top (zero position) of the scheduler waiting list the start exec e event will be notified. Im- 50

64 4.2. THE IMPLEMENTATION mediately after its notification the start exec e event is going to be captured by the triggering function(), which will trigger the beginning of function execution. The Module N will take data from the input FIFO, process them within zero simulation time and start waiting for permission to put the data to the output FIFO (control data output e()) that should arrive from the scheduler through the control port after a particular simulation time τ N, where τ N is determined by the execution time of the Module N profiled for a chosen processing element (scheduler). Finally, the new data will be posted to the output FIFO which will produce a new read write e event inside this FIFO. Immediately after its notification, two modules (Module N and its successor) will check for conditions to be satisfied. The Finite State Machine(FSM) representation of Module N is depicted on Figure 4.6. The Module N is represented as a three-state FSM. Each of these states is illustrated Figure 4.6: ModuleN.cc file FSM representation with red circles on the figure. The blue arrows correspond to the transitions provoked by 51

65 CHAPTER 4. THE PROPOSED APPROACH events notified in the bidirectional synchronization channel between the module and the scheduler. The black arrow of the first state is a self-loop provoked by read write e events of the attached FIFOs. In this state the module waits for typical firing conditions to be satisfied. The transition through the black arrow happens any time when read write e event is notified in one of the FIFOs. Whenever conditions are satisfied the module emits the r r signal and transitions to the next state where it starts waiting for the start exec e event. When this event is notified by the scheduler the module transitions to the third state where it reads input data and run the main functionality up to the end within zero simulation time. Afterwards, it start waiting for data output e() event. With notification of this event, Module N puts data into the output FIFO and returns to the initial state where it continues checking for the initial conditions to be satisfied The Heart The heart of our system is a preemptive scheduler. It is a one state FSM presented on Figure 4.7. The FSM includes seven self-loops, three of which are drawn with red and the rest with green arrows. Transitions represented with red arrows are provoked by the signal coming from Module N, the r r signal. Green arrows correspond to the self loops provoked by events coming from the timer. The two-state FSM of the timer is depicted on the same figure and its states are marked with green color. The choice among this transitions depends on three main parameters whether:preemption, status and (List empty) are described in details in Section More detailed description of all the transitions are presented in Section The scheduler carries out not only the function of tasks scheduling but also works as the time annotator. All the data received out of the functions profiling on a particular processing elements are stored in the configuration file constants.h used by the scheduler. This configuration file includes not only information about the execution time but also allows the definition of the architecture, the specification of mapping configuration and many other parameters described in details in Appendix A. 52

66 4.2. THE IMPLEMENTATION timer_timeout(0) &&! List_empty && preemption (next) / Output_data_e && timer_task _in(τ) timer_timeout(0) && List_empty / Output_data_e && status = false r_r &&! preemption &&!Status / timer_task_in(τ) && start_exec _e && status = true timer_timeout(0) &&! List_empty &&! preemption (next) / Output_data_e && (timer_task _in(τ(next)) && start_exec _e (next)) Initial state: wait for timer_task_in timer_timeout(τ ) / Ø timer_task_in(τ) / Time_elapsed_e.notify(τ) Scheduler timer_task_in(τ ) / timer_timeout(τ- t) && Time_elapsed _e.notify(τ ) Time_elapsed_e / timer_timeout (0) wait r_r &&! preemption && Status / Ø r_r && preemption / timer_task _in(τ ) && start_exec _e && Bit_preemption = true constants.h// #define STR912FW44 #ifdef STR912FW44 #define k1 443 #define k2 382 #define k #define time1 k1 #define time2 (6 + k2*(packet_size_ini - 10)) #define time3 (94 + k3*(packet_size_ini - 10)) #endif Figure 4.7: The FSM representation of the scheduler The Synchronized FSM The FSM representation of the SystemC framework is depicted on Figure 4.8. For simplicity we are going to describe a system composed of one functional module (task), one scheduler (processing element) and one timer dedicated to this scheduler. Each of three components of the synchronized FSM has its own color used to mark the states. Thus, the task is red, the one-state scheduler is blue and the states of the dedicated timer are green. The arrow colors depend on the synchronization of the transitions with a particular FSM. For example, the transitions from one state to another of the task model are marked with blue color because they are synchronized with the scheduler model. The scheduler model in turn has a part of transitions marked with red arrows due to the synchronization with the task model and the rest signed with green, because those transitions are triggered by events coming from the timer. 53

67 CHAPTER 4. THE PROPOSED APPROACH FIFO read_write_e / Ø Conditions OK/ r_r Check/Wait for conditions A timer_ timeout(0) &&! List_ empty && preemption(next) / Output_data_e && ( timer_task_in(τ(next)) timer_ timeout(0) && List_ empty / Output_ data_e && status = false r_r &&! preemption &&! Status / timer_ task_in(τ) && start_ exec_e && status = true Start_exec_e / FIFO read_write_ e Wait for start_exec_e Execute & Wait for output_data_e Output_data_e / FIFO read_write_ e B timer_ timeout(0) &&! List_ empty &&! preemption(next) / Output_data_ e && (timer_task_in(τ(next)) && start_exec_ e ( next)) timer_ timeout(τ ) / Ø timer_task_in(τ')/ timer_ timeout(τ- t ) && Time_ elapsed_e.notify(τ ) Scheduler r_r &&! preemption && Status / Ø r_r && preemption / timer_task_in(τ ) && start_exec_ e && Bit_ preemption = true Task Initial state: wait for timer_task_in timer_task_in(τ) / Time_ elapsed_e. notify(τ) Time_ elapsed_e / timer_ timeout(0) wait Timer Figure 4.8: The FSM system representation While transitioning from one state to another each of the FSMs of our system emits a set of events used to synchronize the work of the system. The set of output events can be an empty set, for example a self black loop of the task. This representation reminds in some way the CFSMs. In addition to these events our FSMs may also exchange some data, for example the execution time sent between the scheduler and the timer. Apart from the events and data associated with transitions, some particular actions are associated either right after the transition to a new state, e.g., the check/wait for conditions, wait for start exec e event, or execute & wait for output data e of the task model or before the transition, e.g., the calculation of particular data that should be communicated from scheduler to the timer and vice-versa. The FSMs representing parts of our system are well synchronized among themselves which additionally differs our FSM system representation from the CFSMs. The timer is a part of the system that helps the scheduler carry out the annotation function. Its main function can be described as a stopwatch function, to count down the given time. The timer is represented as a two-state FSM: the initial state and one wait 54

68 4.2. THE IMPLEMENTATION state. While in the initial state, the timer waits for a signal timer task in(τ) from the scheduler to come. With this signal an execution time τ is sent to the stopwatch. Whenever this signal is received, the transition to the next state occurs. While transitioning, the timer notifies the time elapsed e.notify(τ) event in τ time in the future and starts waiting for this event while still being sensitive to the signals coming from the scheduler. Therefore, there are two possible transitions from this state: when no preemption occurs (e.i., the timer is not interrupted by the scheduler) and when a new signal comes from the scheduler with the execution time of the new higher priority process τ. In case no preemption has occurred the time time elapsed e event will be safely notified and timer will return to the initial state through the green arrow and starts waiting for the new signal form the scheduler. While transitioning it will send the timer timeout(0) signal to the scheduler with parameter 0 (zero), which means that the time remained to the currently running function to finish execution is 0, the execution is finished. In case of preemption, the scheduler sends to the timer a signal timer task in(τ ) with a new process execution time τ. The timer replies to the scheduler with the timer timeout(τ- τ) signal. The parameter of this signal represents the amount of time remained to the interrupted process to finish its execution (τ- τ), where τ is the time spent by the timer in the wait state since last transition and τ is the function execution time (profiled or remained after the last preemption) received from the scheduler. Then the timer notifies the new time elapsed e.notify(τ ) event at τ time in the future and starts waiting for this event. If no preemption occurs at this point, the timer will send a timer timeout(0) signal to the scheduler, transit to the initial state and receive a new signal timer task in(τ) from the scheduler where a new τ will be equal to (τ- τ), the time remained to the previously preempted process to finish execution. At this point the task will be resumed in case no preemption occurs. The timer implementation allows modeling of multiple as well as hierarchical preemptions. Here multiple preemption describes the situation when one process is preempted by several other processor as depicted on Figure 4.9. The situation when the process that has just preempted the process with lower priority is preempted by the process with the priority higher than its own we call the hierarchical preemption (see Figure 4.10). The context switch time σ can be modeled by 55

69 CHAPTER 4. THE PROPOSED APPROACH Lower Priority - Higher context switch time Simulation time T Figure 4.9: Multiple preemption adding its value to the time returned to the scheduler while preemption occurs (τ- τ +σ), the time required by the preempted process to resume its calculation. As has been already mentioned, the scheduler is represented with a one-state FSM that includes seven self loops, three of which correspond to transitions triggered by the task and the rest by the timer. While receiving either a timer timeout or a r r signal, the scheduler checks for some internal conditions defined by three values: preemption, status and List empty. The preemption is an attribute of any process in the scheduler list. Its true value indicates that this process was previously interrupted and now it is waiting to be rescheduled to resume the execution. Status indicates whether this process is currently executing (true) or waiting for being scheduled (false). List empty simply indicates the availability of the tasks in the scheduler list. In this model we also used the (next) indication which means that this parameter concerns the next item of the scheduler list. All the actions taken by the scheduler according to particular conditions are depicted on Figure 4.8. In particular, we would like to mention that the start exec e event will be sent to each task only once, when this process is scheduled for execution for the first time (e.i., the transition B). In case it is scheduled to resume its execution (e.i., preemption 56

70 4.2. THE IMPLEMENTATION Lower Priority Higher context switch time Simulation time T Figure 4.10: Hierarchical preemption has the true value) the start exec e event will not be sent to the next task in the list (e.i., transition A), due to the fact that the functionality of the task has been already executed within zero simulation time and the task is only waiting to output already processed data The Interprocessor Communication Very often the interprocessor communication becomes a bottleneck of the multi processor systems design, when the system performance decreases due to the unavailability of data caused by slow communication. Therefore, the modeling of MPSoC implies the modeling of interprocessor communication. In our framework we introduce a hierarchical scheduling where the same scheduler model is used to represent each of the processing elements as well as an interprocessor BUS arbiter. Figure 4.11 presents an example of two processor communicating through a shared memory. The data sent from one processor to another are stored in common FIFOs, separate for each direction of data transmission, that are located in the shared memory. Each of the processors has one output and one input buffer. Those buffers are used to either store the data that need to be sent to the other processor or data that were 57

71 CHAPTER 4. THE PROPOSED APPROACH Arbiter Buffer in Buffer out Processor 1 Processor 2 Buffer out Buffer in FIFO FIFO Memory Figure 4.11: The Hierarchical Scheduling just received. Buffers act the same as tasks of the functional model described earlier, where tasks implement two types of functionality: write to or read from the memory. Buffers (tasks) are connected to the arbiter represented by the scheduler model through the control channels. Only one of the processors can access the BUS and write/read to/from the memory at a time. If a processor needs to write to the memory the buffer out task sends the request to the arbiter. The arbiter adds this request to its scheduling list and manages the scheduling process the same way as the scheduler of the processing element. The time required to write/read to/from the memory is determined by the type of memory used and can be modified accordingly in the constants.h file. The dataflow model introduces a strong relation in the task sequence. If two sequential tasks are mapped onto different processors the usual communication between tasks by means of FIFO, described in Section 4.1.3, will be changed to the communication through the BUS using input/output buffers of the processors. Task-producer will output its data to the buffer out instead of the common task-to-task FIFO and its successor, taskconsumer, will read the data from the buffer in. Our framework allows the analysis 58

72 4.2. THE IMPLEMENTATION of all possible permutations of mapping. Therefore all the tasks need to be virtually connected to task-to-task FIFO, buffer in, and buffer out at the same time. The actual data flow will be determined by a user when he/she defines the mapping. One example of possible mapping is presented on Figure 4.12 where the DLL layer of the UMTS protocol is mapped to the first processor and the PHY layer is mapped to the second one. The dashed lines represent virtual connections of tasks with the processor buffers. In the DLL transmission part only the Tr format sel task is actually connected to the Buffer Out TX of the first processor which is connected through the memory FIFO with the Buffer In TX of the second processor, which in turn provides the received data to the CRC Attachment task. Similarly, the communication between the CRC check and C T DEMUX tasks is organized at the received part. The read/write memory operations are then scheduled by the BUS arbiter Parameters Calculation Using SystemC framework we are able to calculate three of the most important parameters for system performance estimation: the latency, the throughput and the load of each processing element. We introduced two sets of this parameter. The first set is calculated in a static way before any simulation is performed. The second one is calculated at run time and presents the mean value performance parameters. In order to calculate statically the latency of the packet transmitted to the air interface we simply multiply the execution time of each function by the number of time it should be executed for one data packet. Then we sum these numbers taking into account the time required to send the data through the BUS in case of multi-processor system modeling. It is possible to calculate the latency assuming that the system either only transmits the data summing only the numbers of the TX functions (TX latency), or receives the data summing only the numbers of the RX functions (RX latency). When we assume that the data transmission and reception happens in parallel, the TX latency should also include the value of the RX latency because the RX functions will be executed in-between the TX functions and add their time to the TX packet processing and vice-versa. The calculation of the transmission latency in a static way for cases when the modules of the TX and 59

73 CHAPTER 4. THE PROPOSED APPROACH RX parts have different priorities is however difficult, almost impossible, since it is hard to predict the modules execution order. The latency calculated in a static way should be considered as an ideal value. The throughtput value defines how many bits has been sent through the air interface withing a second. The throughput calculation can be done in two ways: the overall throughput of the systems and the useful throughput, which can be referred to the nominal and the actual throughput of the system respectively. When we start the simulation and the first packet starts to be procesed by the system the simulation time is equal to zero. We calculate the throughput any time when a new packet is sent through the air interface by deviding a number of bytes sent through the air interface by the current simulation time expressed in seconds. In case of the useful throughput we are be using the number of user data bits transmitted to the receiver through the air interface. In turn, the throughput of overall system shall be calculated taking into account all the bits that are sent through the air interface, including the not only the user bits but also the control information encripted in packets headers. The static calculation of each processor load is calculated with respect to a particular required throughput (bit rate) of the system. We sum all the execution times of the functions mapped to this particular processor, taking into account that some of the functions are executed a number times, and use this time to calculate the overall throughput of this processor. Then, the ratio of required bit rate (e.g., 12.2kbps for speech) and the actual bit rate multiplied by 100% gives us the processor load. Obviously, if one wishes to support higher bit rates, the load value will grow. We use the static way of calculation for only one example of tasks execution order in the system, where the tasks are executed one after one for each data packet without being interrupted by other processes. Our goal is to analyze the system architecture not only by means of different mappings but also studying different scheduling policies. Therefore, we introduced the runtime calculation of the performance parameters. The runtime calculations give us the mean values of the latency, throughput and load and include the parameter calculation not only for the processors but also for the BUS (e.g., BUS load, BUS throughput). The performance parameters are calculated every time when a next 60

74 4.2. THE IMPLEMENTATION packet is sent through the air interface or the BUS. We compute the mean values of system performance parameters by calculating their sum divided by their amount. However, we allow the transmission of an infinite amount of packets, therefore it is impossible to sum all the values and then divide them by their amount. To solve this problem we recalculate the mean value upon the arrival of each new value using the following formula: N < n N+1 >=< n N > N n 1 N+1 N + 1, where < n N+1 > is a new mean value of the N+1 packet, < n N > is a previous mean value and n N+1 is a newly calculated parameter. 61

75 CHAPTER 4. THE PROPOSED APPROACH DLL Tx Segmentation RLC_header_Add Ciphering TrCH_type_switch C_T_MUX Tr_format_sel Buff Out Tx Tr_buffer Buff In Rx DLL Rx Reassembly RLC_header_Rem Deciphering Rx_TrCH_type_switch C_T_DEMUX CRC_Attachment Tr_block_Concat F I F O BUS Scheduler F I F O CRC_check Tr_block_Segm Ch_decoding Ch_coding Rem_equal_padding_bits Radio_frame_equal Deinterleaving_1 Interleaving_1 Radio_frame_segm Buff In Tx Buff Out Rx Radio_frame_concat Rem_padding_bits Rate_matching TrCH_DEMUX TrCH_MUX PhCh_concat PhyCh_segm Deinterleaving_2 Interleaving_2 PhyCh_extraction PhyCh_mapping Despreading Spreading Demodulation Modulation Air_interface PHY Tx PHY Rx Figure 4.12: The Hierarchical Scheduling - BUS Arbitration Model 62

76 Chapter 5 Uniprocessor System Design and Results In this chapter we present our first case study where we explore architectures based on a single processor with the functionality of the higher layers of the UMTS protocol mapped onto it. Due to the complexity of the UMTS protocol we have implemented and studied only a subset of the functionality and in particular a subset of the DLL layer. The upper layers of the UMTS protocol are replaces with a transmission buffer (Tr buffer) module that simulates the data coming from these layers. The PHY layer of the protocol together with the air interface is represented as a black box connecting the transmitter and the receiver parts. This chapter presents the results of the DLL functionality mapping onto different ARM7 processors. At present, our model is largely extended by including the implementation of the PHY layer functionality. The design space exploration using larger protocol implementation will be presented in Chapter UMTS DLL Case Study We have implemented the DLL layer, which performs general packet forming and quality of service support. The architecture of the DLL layer of the protocol stack is very complex due to the high number of different logical and transport channels. The subset of the protocol stack that we have modeled is the section highlighted on 63

77 CHAPTER 5. UNIPROCESSOR SYSTEM DESIGN AND RESULTS Figure 2.2, and corresponds to the bidirectional Dedicated Channel that, in our case, is limited to point-to-point uplink user data transmission. The RLC is divided into three separate entities for Transparent (Tr), Unacknowledged (UM) and Acknowledged (AM) transmission modes. We have limited our analysis to the Unacknowledged mode, corresponding to the UM-Entity, since this is a superset of the Transparent mode, and can be used to a certain degree to estimate the performance of the Acknowledged mode, which would otherwise require the downlink model. Thus, the results of performance analysis of the UM-Entity can allow us to make approximate estimations of the performance of the complete RLC layer. Likewise, the MAC sublayer is divided into different entities that handle the mapping between the logical and the transport channels. Of these, we model the MAC-d entity which is the only one involved with the baseline (not enhanced) Dedicated Channel. The other blocks, which were introduced in more recent versions of the standard, are instead required for high speed and quality of service support Untimed Functional model The functional model of the UMTS protocol is composed of six modules (actors of a dataflow process network) for the transmitter and five modules for the receiver, and is shown in Figure 5.1. The transmitter consists of three modules for the UM RLC entity (i.e., Segmentation, RLC Header Add and Ciphering) and three to MAC-d (i.e., Transport Channel Type Switch, C/T MUX and Transport Format Combination (TFC) selection). The receiver has two modules for the MAC-d entity (C/T DEMUX and Transport Channel Type Switch) and three modules for the UM RLC entity (Deciphering, RLC Header Rem and Reassembly). Every module is attached to two fixed-size FIFOs, with blocking read and blocking write, one for data input and the other for data output, connected to the next module. The transmission buffer is a random data generator used to perform the simulation. The Reassembly module displays the final data as received. Finally, the PHY module is organized as a channel that adds some random distortion and time delays to the transmitted data. The transmitter deposits its data in the input FIFO of the PHY module to be 64

78 5.1. UMTS DLL CASE STUDY poqrstmuuko }~ ~ ƒ vw xy z{{ w jklkmnko ƒœ ~ƒ ~Œ ƒ ˆ Š ˆ ~Œ Š ŽŒˆ Š ˆ Ž ŽŒ Œ ƒ ~ ž ƒ Ÿ Œ ~Œ ƒ ~ ~ƒ ~ œ ~ƒ ~Œ ~ ˆ Š ~Œ ~ Š š ŽŒˆ Š ˆ Ž Figure 5.1: Functional model processed for further transmission while the receiver gets data from the output FIFO of the PHY module and it processes it until the reassembly block Timed Functional model The SystemC timed functional model has the same composition of block. We assume that all the blocks are executed on the same processor (ARM processor), which means that while one of the blocks is executed others are waiting for it to be finished. Execution of each process takes a particular time which we add to the SystemC simulation time while processing each of the blocks and we assume that this time is already known from running the functionality on an emulator of explored processor. We determine the execution time of each process by running and profiling the functionality on an emulator. This way, the beginning and the end of each process execution has its own time stamps. Because UMTS supports different kinds of voice and data transfers, processes may have different priorities, and must therefore be executed according to some scheduling policy. 65

79 CHAPTER 5. UNIPROCESSOR SYSTEM DESIGN AND RESULTS Architecture Model To design an optimal architecture we need to decide what elements should be available on the platform to achieve the best trade-off between the metrics of interest. These elements include general-purpose processors (GPP), Digital Signal Processors (DSP), Field Programming Gate Arrays (FPGAs), or their mix. This step also includes identifying the kind of processors to be used (and their performance), as well as their number and general interconnection topology. To begin with, we surveyed several architectures proposed by the industry for SDR, and finally decided to take the Small Form Factor SDR (SFF SDR) Development Platform from Lyrtech [60] as our baseline model. The block diagram of this platform is presented on Figure 2.4 and its basic description is given in Section This platform consists of three separate modules, as it is shown on the figure, one of which is the Baseband Processing module, the main target of our interest. This module is based on a SoC from Texas Instruments composed of one ARM and one DSP processor. The DSP is typically used for the baseband processing, while the ARM processor is intended to be used by higher layers of the communication protocols. In this case study we restrict our attention to the ARM family of processors, in particular the ARM7, since the functionality that we are testing is limited to a subset of the UMTS DLL layer. We consider less performing processors than the one available on the SFF SDR platform (which we only take as a template), because of the limitations imposed by the profiler that we have used. The work presented in next chapters includes the extension of the presented performance analysis to higher performance processors, such as ARM9, MicroBlaze, C67xx DSP, and Sparc, and multiprocessor based architecture while integrating the baseband processing, and the physical layer of the UMTS protocol. 5.2 UMTS DLL Results This section is devoted to presenting the results of the performance analysis for the RLC and the MAC sublayers. We use three kinds of metrics to characterize the performance of different architectures, to determine the distribution of resources within the protocol function, and to measure the efficiency of the performance analysis itself. To do that, 66

5.2. UMTS DLL RESULTS Figure 5.2: Block diagram of uniprocessor system we need to know the time that each function takes to be executed on the particular microcontroller.

80 5.2. UMTS DLL RESULTS Figure 5.2: Block diagram of uniprocessor system we need to know the time that each function takes to be executed on the particular microcontroller. We obtain the execution time of each task from the embedded profiler of the Keil ARM Development Tool [51], by running our functionality on several emulated ARM processors. Keil implements a lower level of abstraction (the profiler level), which we use to extract the relevant performance data. Our first results, depicted in Figure 5.3, show the percentage of utilization (load) of the general-purpose processor under different transfer rates, and for different architectures. We have analyzed five different ARM microcontrollers that include the STR912FW44, STR750FV2 and STR736FV1 from STMicroelectronics, and the LPC2119 and LPC2194 from NXP. These have been analyzed at their maximum clock frequency (shown in Figure 5.3 under the name of the architecture), with the exception of the STR912FW44 and the STR750FV2. The first (STR912FW44) was simulated at 60 MHz, instead of 96 MHz, for an easier direct comparison with the NXP microcontrollers. The second (STR750FV2) was simulated at 25 MHz, instead of its maximum 60 MHz, to obtain roughly the same 67

81 CHAPTER 5. UNIPROCESSOR SYSTEM DESIGN AND RESULTS kbps 28.8 kbps 57.6 kbps 384 kbps % of used resources LPC2119 LPC2194 STR912FW44 STR750FV2 STR736FV1 60 Mhz 60 Mhz 60 Mhz 25 Mhz 36 Mhz Figure 5.3: Architecture performances performance results as the STR736FV1, and therefore show the savings in clock frequency. For each microcontroller, we have computed the load for the four most common transmission data rates, i.e., 12.2 kbps for voice communication, 28.8 kbps and 57.6 kbps used by modems and faxes, and 384 kbps for high speed data transmission. For each data rate we use a fixed, though different, packet size for the Transport channel. For instance, we use a Transport block size of 576 bits for modems and faxes, and of 336 bits for high speed data transmission [9]. Each packet has a 16-bit overhead for RLC and MAC headers. In addition to that, information data delivered by the Dedicated Transport Channels (DTCHs) are accompanied by control packets delivered through the Dedicated Control Channels (DCCHs). In our simulations, we have assumed that each DTCH packet is accompanied by one DCCH packet. For a bit rate of x kbps, the load is computed as the ratio between the actual time it takes to transmit x bits, and the maximum time allowed 68

82 5.2. UMTS DLL RESULTS for the transmission of the same amount of bits. Formally, we have L = (T dp + T cp ) x/8 P dp + P cp 100%, where T dp and T cp is the time to transmit a data and a control packet, and P dp and P cp is the size, in bytes, of data and control packets, respectively. For the control packet we have assumed a constant size of 100 bits. The results of the analysis presented in Figure 5.3 show considerable variability, both across architectures, and, as expected, for different data rates. This analysis gives us a measure of the residual computing power available to the rest of the protocol, to potential other protocols running concurrently, and to higher level applications, which is an essential information for the correct architecture choice and sizing of the system. In the case of the STR736FV1 and STR750FV2, the load for 384 kbps exceeds 100%, and is therefore just partially shown on the figure. A second class of results is devoted to the analysis of the resources required by each of the functions of the model shown in Figure 5.4. These results are for the STR912FW44 ARM9 microcontroller from STMicroelectronics and the packet size of 70 bytes, which we have taken as representative, since the analysis for the other architectures and packet sizes are qualitatively and quantitatively similar. The functions on Figure 5.4 are sorted by the amount of resources that they require. As we can see from the graph the Ciphering function is in general the most resource consuming function. This kind of result is very useful for taking the preliminary decision about resource distribution between the functions. Our last group of results is concerned with a measure of the efficiency of the performance analysis. In Table 5.1, we show, for each architecture, the time in seconds required to simulate the transmission of 7,000 packets, with a packet size of 70 bytes, for both the Keil profiler and SystemC. The simulations were performed on a Pentium R M processor running at 1.70 GHz, with 512 MB of RAM. For SystemC, we have used the reference implementation version 2.1.v1, available at the SystemC web site [82]. The performance of the SystemC simulation is independent from the architecture, since it is referred to the same model. In contrast, Keil shows variability depending on the microcontroller. The 69

83 CHAPTER 5. UNIPROCESSOR SYSTEM DESIGN AND RESULTS 80 % of used resources by each function Ciphering RLCheaderAdd C_T_MUX Segmentation Tr_buffer TrFormatSel Figure 5.4: Resource distribution by function TrChSwitch last column of the table shows the speed-up obtained by using the more abstract SystemC models. On average, the SystemC simulation is more than 21 times faster than the profiler, which justifies the use of the profiler for data gathering, and the use of the abstract simulator for performance analysis. This methodology becomes even more compelling as the complexity of the model increases, and several protocols are run concurrently, since the simulation time will correspondingly increase. Moreover, the performance of the SystemC simulation can potentially be increased by using an optimized simulator rather than the reference implementation. 70

84 5.2. UMTS DLL RESULTS Architecture Keil SystemC Speed-up LPC LPC STR912FW STR750FV STR736FV Average Table 5.1: Efficiency of performance analysis 71

85 Chapter 6 Multiprocessor Heterogeneous System Design and Results This chapter includes a larger case study where we explore several multiprocessor based architectures, while the functional model is also significantly extended and apart from the DLL layer includes the implementation of the PHY layer functionality. The PHY layer model includes only a software implementation of protocol functionality including such functions as convolutional coder and viterbi decoder that are usually (in present mobile platforms) implemented in hardware. We studied several hard and soft processing elements (e.g., ARM7, ARM9, and Microblaze) and several heterogeneous double-processor based systems composed of different permutations of these processing elements with DLL mapped onto one processor and the PHY layer onto another. Results shows that only some of the studied architectures are able to support the complete set of UMTS protocol functionality totally implemented in software. 6.1 UMTS DDL and PHY Case Study The functional model of the UMTS protocol, shown in Figure 6.1, is composed of six DLL layer modules (actors of a dataflow process network) and of fourteen PHY layer modules for the transmitter, and of five DLL modules and twelve PHY modules for the receiver. Functionality of each module at the transmitter part is implemented upon the 72

86 6.1. UMTS DDL AND PHY CASE STUDY specification provided by 3GPP [10, 13, 8, 12, 14, 11, 9], while the receiver part provides all the backwards operations to those of the transmitter. As well as in the previous case study, due to the complexity of the protocol stack we have made some initial assumptions in order to restrict our implementation and simplify the functional model. The studied model includes a subset of the DLL and PHY layers of the UMTS protocol stack corresponded to the bidirectional Dedicated Channel that is limited to point-to-point uplink user data transmission. The DLL UMTS model is the same as has been already described in Chapter 5. The PHY layer model is an implementation of the functionality that was hidden in the previous case study in the black box together with air interface model. The transmitter part of the PHY functional model is implemented upon the uplink model presented in [13] and is extended with two more blocks (Spreading and Modulation) described in [14] in order to complete the chain of digital operations. The receiver model is implemented as the chain of backward, to those of the transmitter, operations applied to the data coming from the Air interface module. For the moment, the Ch coding and Ch decoding blocks include implementation of only Convolution coding and Viterbi decoding algorithms respectively. This way we were able to study only the lowest data rate (12.2kbps). Other data rates require the implementation of the Turbo coding and decoding algorithms, which we are going to complete in the future. Every module is attached to two FIFO queues. Each block signals to the next the availability of a packet to be processed. The transmission buffer is a random data generator used to perform the simulation. The Reassembly module displays the final data as received. Finally, the Air interface module is organized as a channel that adds some random distortion and time delays to the transmitted data. The modules are connected to each processor of the architecture via bidirectional channels. Each function uses only the channel dedicated to the processor used for its execution. This way, remapping a function to another processor can be achieved by simply switching to another dedicated channel. In Figure 6.1 we have shown a mapping example, which represents the execution of different protocol layers on separate processors. Connecting each of the functions to all available processors, instead of using an additional 73

87 CHAPTER 6. MULTIPROCESSOR HETEROGENEOUS SYSTEM DESIGN AND RESULTS Mapping # DLL A9(4) A9(2) A9(4) A9(2) A7 PHY A9(4) A9(2) µb µb A9(4) Mapping # DLL A7 A7 µb µb µb PHY A9(2) µb A9(4) A9(2) µb Table 6.1: Mapping Configurations SystemC dummy module for the idle channels, allows us to find an optimal mapping in terms of the throughput, latency, and processor load automatically MPSoC In this case study, we focused on the ARM7 (A7), ARM9 (A9), and Microblaze processors (µb). They were chosen since they are well suited for embedded applications and work well with the profiling flow described earlier. The Microblaze elements are soft FPGA processor cores which interact with IBM s CoreConnect bus architecture. In our case for profiling, our library elements consisted of the Microblaze processor core (6.00a) enabled with an FPU on Xilinx s Virtex II-Pro 2VP30. This was part of the ML310 development board. In addition it was connected to the On-Chip Peripheral Bus (OPB), enabled caching in 32MB of DDR SRAM, and used its ilmb and dlmb (instruction and data local memory buses) to access 112KB of BRAM. Its core frequency was 100MHz. The ARM7 TDMI-S is a small size, low power 32-bit RISC microcontroller with 128KB onchip Flash ROM and 16KB of RAM. The ARM9 TDMI is a higher performance 32-bit processor. It has 16KB caches for both instructions and data. We run this both at 250 and 400 MHz. The architecture configurations (mappings) we explored are shown in Table 6.1. Mapping of architecture models was done at the DLL/PHY level. Combinations of the Microblaze, ARM7, and ARM9 at 400 (A9(4)) and 250Mhz (A9(2)) were used. If one desired, mapping could be done within the sub-functional blocks of both the PHY and DLL models as well. This would greatly expand the size of the design space. 74

88 6.2. UMTS DLL AND PHY RESULTS 6.2 UMTS DLL and PHY Results This section presents the performance analysis for the Data Link (DLL) and the Physical (PHY) layers mapped to the architecture configurations presented in Table 6.1. The same as in the previous case study we use three kinds of metrics to characterize the performance of different architectures. The first metric is the processing element load (utilization). The second and third are latency and throughput of the communication system, respectively. We obtain the execution time of each task mapped onto different processing elements using the profiling methods presented in Section and illustrated partially in Figure 6.2, which provides a comparison of the performance (execution times in ns) for the DLL functions for the ARM9 processor at 250Mhz, the ARM7 processor, and the Microblaze processor. This is provided as a sample to show that while general trends in execution time can be seen, they are unpredictable and require a true profiling flow as opposed to crude estimates. The profiling methods represent the lower level of abstraction (the profiler level), which we use to extract the relevant performance data to be used at the simulation s higher level. We obtain the execution time of each task for ARM7 processor from the embedded profiler of the Keil ARM Development Tool [51], by running our functionality on several emulated ARM processors. Keil implements a lower level of abstraction (the profiler level), which we use to extract the relevant performance data. The function execution time of functions for ARM9 is received for the C code annotator [67]. Our first results, depicted in Figure 6.3 illustrates the percentage load (utilization) of four analyzed processors with respect to the functionality mapped onto them under a fixed transfer rate used for speech transmission (12.2 kbps). This investigation examines 7 configurations. Two of them have the ARM9 mapped to the DLL (mappings 1-4). One uses the ARM7 for DLL (mappings 5-7). One uses the Microblaze for the DLL (mappings 8-10). The remaining three examine mapping the ARM9 and Microblaze to the PHY (the ARM7 was not mapped to the PHY). The results of the analysis show considerable variability across the processors. This analysis gives us a measure of the residual computing power available to the rest of the protocol, to potential 75

89 CHAPTER 6. MULTIPROCESSOR HETEROGENEOUS SYSTEM DESIGN AND RESULTS other protocols running concurrently, and, at DLL layer, to higher level applications. This is essential information for the correct architecture choice and partitioning of the system. When the PHY functionality is mapped to the FPGA (mappings 3, 4, 7, 10) or to the ARM9 at 250Mhz (mappings 2, 6, 9) the load exceeds 100%, and is therefore invalid (just partially shown on the figure). In the case of the FPGA, the load for by the functionality of the PHY exceeds even 1000%. However, some boards for SDR development presently in the market, for example [60], use FPGAs to perform only modulation/demodulation computation (roughly a tenth part of the PHY). Our results show that this is reasonable and that full PHY functionality is not well suited to the current crop of soft processor FPGA cores due to their low frequencies and relatively simple, general purpose pipelines. The architecture mappings we have studied are composed of two processors. One of them is used to run the functionality of the PHY layer exclusively, the functionality performed right before/after the analog part of the transmitted/receiver respectively. Because our case study includes the implementation of only one protocol stack, we consider that the right mapping combination is achieved when the PHY processor is loaded at almost 100%. The other processor is not only dedicated to the DLL layer, but also to the other higher protocol layers and applications, thus, it should not be loaded by the functionality of the DLL completely. A second class of results are presented in Figure 6.4. They are devoted to the analysis of the latency and throughput of the analyzed mappings. From this graph we can see that the throughput adequate for the speech data transfer (12.2kbps) is supported only by mappings that have ARM9 (400Mhz) used to run the functionality of PHY (mappings 1, 5, 8). In these cases, the ARM9 dedicated to the PHY is loaded at almost 100%. The architectures with the ARM9 (250Mhz) used to run the same functionality are slightly overloaded and do not give appropriate throughput. In this situation we can either change (increase) the clock frequency or change the mapping by transferring part of the functionality of the PHY onto another available processor. The load on the processor by DLL is negligible in comparison to PHY. That is why the throughput of the overall system is close to that which can be achieved by using only one processor and, therefore, is very low. Equal distribution of functionality between the processors may increase this value 76

90 6.2. UMTS DLL AND PHY RESULTS significantly, while the latency will not change much. Table 6.2 presents the last group of results and details the time spent in the two profiling flows to get information for the architecture models. As is shown, profiling is not always fast. However, it should be noted that profiling various models is fully independent so that the time for profiling is only dictated by the most computationally complex model (not the set of models). Every model needs to be profiled only once for each architectural model. The profiling information is used after to simulate a combination of models in SystemC, which is performed (for 1000 packets) in less than a minute. It does not take a lot of time to configure (map) another functional model to another architecture configuration. We simply need to comment one line and uncomment another for each function that needs to be remapped. Again all this is not very time consuming and allows a designer to explore a large design space. Procedure DLL PHY Configuration GPP Flow Create Static Exe <2s <3s Envir. Fedora 5 Create Debug Exe <1s <1s Proc. Xeon 3GHz Create DisAsm File <2s <1s Mem. 3.5GB Simplescalar <8s 8-80m 1 s=sec 1.Dependent on Annotation <2m 1h-16h 1 m=min loops to amortize SystemC Exe <1m h=hour cache misses FPGA Flow, Xilinx 9.1i EDK Generate BStream 35m Same Envir. Ubunutu 7.1 Update BStream <3s Same Proc. P4 2GHz Download BStream <1m Same Mem. 2GB Table 6.2: Architecture Profiling Process and Cost 77

91 CHAPTER 6. MULTIPROCESSOR HETEROGENEOUS SYSTEM DESIGN AND RESULTS DLL Tx Segmentation RLC_header_Add Ciphering TrCH_type_switch C_T_MUX Tr_format_sel Tr_buffer Processor 1 DLL Rx Reassembly RLC_header_Rem Deciphering Rx_TrCH_type_switch C_T_DEMUX CRC_Attachment Tr_block_Concat Tr_block_Segm Ch_coding Radio_frame_equal Interleaving_1 Radio_frame_segm Rate_matching TrCH_MUX PhyCh_segm Interleaving_2 PhyCh_mapping Spreading Modulation PHY Tx Processor 2 Air_interface CRC_check Ch_decoding Rem_equal_padding_bits Deinterleaving_1 Radio_frame_concat Rem_padding_bits TrCH_DEMUX PhCh_concat Deinterleaving_2 PhyCh_extraction Despreading Demodulation PHY Rx Figure 6.1: Functional Model Mapped to Two Processors 78

92 6.2. UMTS DLL AND PHY RESULTS Figure 6.2: Sample Execution Times Obtained Through Profiling 79

93 CHAPTER 6. MULTIPROCESSOR HETEROGENEOUS SYSTEM DESIGN AND RESULTS Figure 6.3: Mapping Effect on System Utilization 80

94 6.2. UMTS DLL AND PHY RESULTS Figure 6.4: Mapping Effect on System Performance 81

95 Chapter 7 MPSoC Modeling with MetroII This chapter is going to present results of the integration of our SystemC framework and its co-existence with other models inside Metro II [41, 19], a bigger heterogeneous design framework, which allows designers to import designs that are developed using tools foreign to Metro II. We have integrated the DLL part of the developed UMTS functional model (see Chapter 5) and the scheduler model (see Section 4.1.3). The integration of the functional model required some additional efforts, where each of the block of the SystemC untimed functional model was wrapped with the Metro II adaptation code (see Section 7.2.3). However, the received result is well worth the performed work. The combination of our profiled processing model of ARM7, ARM9, and MicroBlaze processing elements (see Section 7.1.4) with another integrated to Metro II runtime processing SystemC model of a Leon 3 SPARC (see Section 7.1.4) allowed us to study a very large design space. Only a small part of possible mappings (48 mappings) is presented in Section However, one can imagine the number of possible mappings taking into account that architectural model may contain up to 26 processing elements with all possible combinations of ARM7, ARM9, MicroBlaze, and Leon 3 SPARC, as well as the functional model composed of 11 modules may be partitioned in many ways apart from those presented in Section

96 7.1. MPSOC MODELING WITH METRO II 7.1 MPSoC Modeling with Metro II Metro II supports the development of separate architecture service models to complement the functional modeling effort. These models should be modular (support various configurations and parameterizations), flexible (offer a variety of mapping solutions), and accurate (have a firm grounding in the real-world counterparts they represent). In this work we create models with multiple processing elements which provide costs when mapped to a functional design. This section details the development of architectures used in the UMTS case study covered in subsection General Architecture Before detailing the specifics, it is important to make several key points. Architectures in Metro II are services. These services map to functional model components and provide meaningful costs when these services are requested. They minimally need to: 1. Contain components with provided ports to provide services to the functional model for mapping. These components (in this case called architectural tasks) each have their own thread of execution. These threads will generate the events associated with ports. These ports will be required ports and events associated with their interface functions will be mapped to the functional model. 2. Register the events associated with the architectural task required port function calls to the necessary annotators and schedulers. This event association will provide the costs of the architectural services and ultimately the cost of the overall simulation. The architecture models are composed of the following three portions: 1. Tasks - active components which serve as the mapping target for each component in the functional model. 2. Operating System - explicit, imperative mechanism for scheduling and assigning tasks to processing elements. 3. Processing Element - workhorse of the architectural model. Used to model the core cost of a service. 83

97 CHAPTER 7. MPSOC MODELING WITH METROII Figure 7.1 illustrates these three aspects Architectural Tasks Tasks are lightweight, active components in the architecture model. The thread for each task constantly proposes events for its provided services. Mapping creates a rendezvous constraint between the event generated by the task thread and the functional event. Therefore, there is a 1:1 mapping between these tasks and functional components. Due to the rendezvous constraint, the task remains blocked until the corresponding event from the functional model is proposed. Step 1 in Figure 7.1 illustrates the task s role in architecture model execution Operating System The operating system is used to assign tasks to processing elements (in a many-to-one relationship). In addition, it also carries out phase 1 scheduling - reducing the work to be done in phase 3. This is done by pruning the events proposed in the first phase. An investigation of scheduling policies is in Section 7.2. An OS is an active component with N threads (where N is the number of processing elements it controls). It maintains a queue of requested jobs which processing elements query to decide if they can execute or not. The queue contains events proposed for processing, which processing element they wish to use, the order they were proposed in, and the statically assigned priority for the event. Scheduling controls how events are added to and removed from this queue. Access to this queue is coordinated such that there are a limited number of outstanding requests for a given processing element. Steps 2-5 in Figure 7.1 illustrate the OS s role in the architecture model execution. The OS is also used to access the annotation tables for events. The annotation tables are used by the annotator in phase 2 (where event tags can be written). These tables relate event costs to architectural services. The OS updates the appropriate entry in the table after a request is completed and the true cost known. In addition to the processing cost, the OS may also add cost related to overhead (e.g., context switching). Tables are updated dynamically at runtime and do not require to be statically created with 84

98 7.1. MPSOC MODELING WITH METRO II Figure 7.1: MPSoC Architecture Service Topology 85

99 CHAPTER 7. MPSOC MODELING WITH METROII the netlist. Tasks themselves need know nothing about this process and only need to indicate which service they require. Again this separates the computation behavior from its performance cost Processing Elements The third piece of the architecture platform are the actual processing elements. Once the OS decides to run a task request, it calls the corresponding function call on one of its N required ports. The interface supported by all processing elements is the same (to provide modularity and flexibility) but there are different ways in which the cost may be calculated. Steps 6-7 in Figure 7.1 illustrate two different types of processing elements that may be used and the interface to inform them which processing routine they should compute a cost for. Processing element types may be changed easily to provide a balance between simulation speed and effort. Runtime Processing The first architectural modeling style is runtime processing. In this style, the processing elements are cycle accurate, microarchitectural models which execute code dynamically. An example of runtime processing is a cycle accurate model of a Leon 3 SPARC shown in Figure 7.2. There are three key functions in the interface. The first step in using this type of processing element is for the OS to provide information gathered from the task as to which operation is requested. This will be done via the set program() function where the instruction memory of the processing element is loaded with pre-compiled code for the operation. The second step is then to execute that code at runtime with run proc() which will return the cycle requirements for that code. The operating system will use that information to update the annotation table. Finally, the OS will call reset proc() to ready the processor for the next request. The OS provides information gathered from the task as to which operation is requested. The instruction memory of the processing element is loaded with pre-compiled code for the operation. 86

100 7.1. MPSOC MODELING WITH METRO II Figure 7.2: Sparc Runtime Processing Element While this style may result in a slower simulation time as compared to the following approach, it simply requires that the code for the function be available. It requires no other modeling work by the user and is as accurate as the level of detail in the microarchitectural model. Profiled Processing The second style is profiled processing where precomputed performance metrics are stored for lookup. Again the OS will indicate to the processing element which services are requested. In turn, the processing element will lookup the costs for the given operations. These can be trivial table lookups or more complex (but still static) calculations based on the current state of the processing element. Ways to characterize processing elements for this approach have been shown in [67], [40], and [79]. An advantage of this approach is that the lookup is extremely fast as compared to the runtime processing approach. The drawback is that the modeling of these elements is often more limited in its usage and requires that characterization be carried out prior to simulation. This precharacterization however only needs to be done once per computation routine. It does requires a more 87

101 CHAPTER 7. MPSOC MODELING WITH METROII complex set of transformations as compared to simple compilation (runtime processing approach). Section describes two flows for profiling Functional Model Most of what is going to be presented in this subsection have been already described in previous chapters, however, in order to avoid reader going back and forth among the chapters and better comprehend the idea, we are going to describe once more the DLL functional model integrated to Metro II. We focus on the User Equipment Domain of the UMTS protocol [10], which is of interest to mobile devices and is subject to stringent implementation constraints. The protocol stack of UMTS for the User Equipment Domain has been standardized by the 3rd Generation Partnership Project (3GPP) up to the Network layer, including the Physical (PHY) and Data Link (DLL) layers. Our model includes the implementation of the Unacknowledged mode of the DLL layer, which is composed of the functionality of the Radio Link Control (RLC) and Medium Access Control (MAC) sublayers. For the purposes of this case study, our model was largely separated into the RLC and MAC functionality as well as both receiver and transmitter portions. Simulation consists of processing 100 packets, each packet being 70 bytes. The functional model is represented as dataflow with blocking read and blocking write FIFOs and is shown in Figure 7.3. The semantics of the model is dataflow with blocking read and blocking write semantics for the FIFOs. Both untimed and timed models, presented on Figure 7.3, were created to determine the advantages of functional/architectural separation. The timed model mixes both, while the untimed model relies on mapping with an architectural model to obtain performance metrics. The pure functional model allows processes to communicate in zero time provided data is present on the input and space available at the output. The timed model, on the other hand, introduces a scheduler and timer. The time annotation of the functional model is carried out by means of a scheduler. The scheduler is modeled as a finite state machine which controls the execution of the system. The activation of each process is controlled by the typical firing conditions of process networks, i.e., the availability of data at the input FIFO, and the availability of space at the output FIFO. These conditions 88

102 7.1. MPSOC MODELING WITH METRO II are notified to the processes every time data is written to or read from the attached FIFOs. When a firing condition is satisfied, the process triggers the scheduler by sending a Ready to Run signal through the dedicated bi-directional scheduling channel and then waits for permission to start computation, which will be granted by the scheduler when the resources are available and when no higher priority process is ready to run. (the timed model scheduler has a notion of priority and preemption; see Section 4.1.3). In logically zero time the process runs to completion, and stops before the results are written to the output FIFO. Computation is carried out in logically zero time. The scheduler will again trigger the process to post its outputs at the correct time, which will not only account for the process execution latency, but also for the time spent in running higher priority processes that had become active and preempted its execution. In this manner, a process is never physically suspended as a result of preemption, thus reducing the overhead due to context switches. Instead, the scheduler verifies if any preemption has occurred, and, if so, updates the completion time by delaying it by the appropriate amount Architecture Model The architecture model assigns one task for each of 11 UMTS components (TR Buffer and PHY were not mapped as they represent the environment). The OS employs three different scheduling policies: round robin (RR), priority (PR) based on processing requirements, and first-come, first-serve (FCFS). Processing elements communicate through point to point, FIFO links or through shared memory for each processing element. The first is a round robin scheduler where each processing element is simply selected sequentially. This is a cyclic process beginning with processing element 0 and moving through the number of PEs. If a PE does not have a request pending then the next PE in the list is allowed to proceed. The second algorithm is a priority based scheduling algorithm where higher priorities are assigned to tasks with higher processing requirements. These requirements are determined during the pre-profiling stage. Preemption is not employed as in the timed functional model. Priority scheduling here examines all the requests for processing in a given round and selects the one with the highest priority. The selected priority is noted and in the next round it can not be chosen again if there are still events 89

103 CHAPTER 7. MPSOC MODELING WITH METROII Figure 7.3: UMTS Metro II Untimed Functional Model 90

Outline. SLD challenges Platform Based Design (PBD) Leveraging state of the art CAD Metropolis. Case study: Wireless Sensor Network

Outline. SLD challenges Platform Based Design (PBD) Leveraging state of the art CAD Metropolis. Case study: Wireless Sensor Network By Alberto Puggelli Outline SLD challenges Platform Based Design (PBD) Case study: Wireless Sensor Network Leveraging state of the art CAD Metropolis Case study: JPEG Encoder SLD Challenge Establish a