UML-Based Multiprocessor SoC Design Framework

Size: px

Start display at page:

Download "UML-Based Multiprocessor SoC Design Framework"

Marilynn Baldwin
6 years ago
Views:

1 UML-Based Multiprocessor SoC Design Framework TERO KANGAS, PETRI KUKKALA, HEIKKI ORSILA, ERNO SALMINEN, MARKO HÄNNIKÄINEN, and TIMO D. HÄMÄLÄINEN Tampere University of Technology JOUNI RIIHIMÄKI Nokia Technology Platforms and KIMMO KUUSILINNA Nokia Research Center This paper describes a complete design flow for multiprocessor systems-on-chips (SoCs) covering the design phases from system-level modeling to FPGA prototyping. The design of complex heterogeneous systems is enabled by raising the abstraction level and providing several system-level design automation tools. The system is modeled in a UML design environment following a new UML profile that specifies the practices for orthogonal application and architecture modeling. The design flow tools are governed in a single framework that combines the subtools into a seamless flow and visualizes the design process. Novel features also include an automated architecture exploration based on the system models in UML, as well as the automatic back and forward annotation of information in the design flow. The architecture exploration is based on the global optimization of systems that are composed of subsystems, which are then locally optimized for their particular purposes. As a result, the design flow produces an optimized component allocation, task mapping, and scheduling for the described application. In addition, it implements the entire system for FPGA prototyping board. As a case study, the design flow is utilized in the integration of state-of-the-art technology approaches, including a wireless terminal architecture, a network-on-chip, and multiprocessing utilizing RTOS in a SoC. In this study, a central part of a WLAN terminal is modeled, verified, optimized, and prototyped with the presented framework. Categories and Subject Descriptors: I.6.0 [Simulation and Modeling]: General; B.8.2 [Performance and Reliability]: Performance Analysis and Design Aids Simulation; C.2.1 [Computer- Communication Network]: Network Architecture Design Wireless communication General Terms: Design, Performance, Verification Additional Key Words and Phrases: UML 2.0, design flow, architecture exploration Authors addresses: T. Kangas, P. Kukkala, H. Orsila, E. Salminen, M. Hännikäinen, T. Hämäläinen, Institute of Digital and Computer Systems, Tampere University of Technology, P.O. Box 553, FI Tampere, Finland. J. Riihimäki, Nokia Technology Platforms, Tampere, Finland. K. Kuusilinna, Nokia Research Center, Tampere, Finland. Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or direct commercial advantage and that copies show this notice on the first page or initial screen of a display along with the full citation. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, to redistribute to lists, or to use any component of this work in other works requires prior specific permission and/or a fee. Permissions may be requested from Publications Dept., ACM, Inc., 1515 Broadway, New York, NY USA, fax: +1 (212) , or permissions@acm.org. C 2006 ACM /06/ $5.00 ACM Transactions on Embedded Computing Systems, Vol. 5, No. 2, May 2006, Pages

2 282 T. Kangas et al. 1. INTRODUCTION Platform-based design has been envisioned to meet the design challenges of ever-increasing system complexity [Ferrari and Sangiovanni-Vincentelli 1999]. The design approach theoretically allows rapid construction of very large systems and their architecture exploration. However, the analysis of such systems is very demanding and slow using traditional methods, partly because many of the physical architecture evaluation tools operate on relatively low abstraction levels, and partly because of the typically vast design space. Hence, novel methods, in the form of a design flow, are needed to overcome the design and verification gaps. Another design challenge with large systems is the description and modeling of system components with different models of computations (MoC) and in several abstraction levels. Unified Modeling Language (UML) has been utilized in novel design methods proposing a solution for the challenge. The latest release of UML (UML 2.0) is especially converging to a general design language that can be understood by system designers as well as software and hardware engineers. As the UML itself offers only a weak formalism, it needs to be adapted for specific modeling purposes to enable efficient model analysis, transformation, and synthesis. UML 2.0 has support for profiles that enable the language to be applied on particular application and platform domains with sophisticated extension mechanisms. The most common mechanisms are stereotypes, constraints, and tagged values. A profile usually defines particular semantics for a subset of UML. As an advantage, this improves the model formalism at the same time. This paper presents a design flow, called Koski, for multiprocessor systemon-chip (SoC) utilizing UML 2.0 for system modeling. It is a library-based method that hides unnecessary details from high-level design phases, but does not require a plethora of model abstractions. The design flow provides an automated path from UML design entry to FPGA prototyping, including the functional verification and the automated architecture exploration. The architecture design is based on the application model and, thus, the final implementation is application-specific. In addition, the flow supports automated architecture exploration and back-annotation from low-level simulations to the UML model. The UML modeling for the design flow uses an UML extension profile, which is especially targeted at automated embedded real-time system implementation. The application modeling and abstraction is concentrated on the application tasks and their internal behavior. The aim of the design flow is to map the functionality of these tasks to the processing elements of the optimized architecture. This approach is chosen because of its clear correspondence to the practical design in which the primary target is a physical multiprocessor SoC. Our method differs from the higher level object-oriented service-based application and platform modeling approach that would enable even faster modeling and earlier exploration. A disadvantage of the service-based method is that currently there are no tools that could transform (compile and synthesize) the model automatically to implementation level. Moreover, using very abstract

3 UML-Based Multiprocessor SoC Design Framework 283 models would reduce the accuracy of the architecture exploration that is one of the main phases of our implementation-oriented methodology. In this paper, we describe the design flow with a case study, which is a design of Wireless Local Area Network (WLAN) terminal. The focus of this paper is on the methods and tools for automating the SoC design flow rather than on the models of computation or UML modeling aspects. The following Section 2 reviews the related research. Section 3 gives an overview of the Koski design flow and presents the different use case scenarios. Section 4 presents the application for the WLAN case study. The platform, which defines the architectural space, is presented in Section 5. Thereafter, each main design phase is examined in detail starting from the UML design entry (Section 6) and ending with the physical implementation in FPGA prototyping board (Section 10). Finally, the concluding remarks are given in Section RELATED WORK Several design frameworks have been proposed for high-level system design. Most of them are concentrating on a specific design phase, such as modeling, code generation, verification, architecture exploration, or implementation, but rarely cover all of these phases. Moreover, the level of automation is often limited to internal automation of subtools, whereas the interoperation between tools is not automated. Use of UML in embedded system modeling is an active research area [Lavagno et al. 2003]. To adapt UML for this purpose, several custom UML profiles have been proposed. In UML, a profile enables the adaptation of metamodels on specific purposes by extending existing metaclasses. For instance, the UML platform profile [Chen et al. 2003] introduces UML diagrams and notations to model architecture resources and services at different abstraction levels. The UML profile for schedulability, performance, and time [Object Management Group (OMG) 2005] defines notations for building models of real-time systems with quality-of-service (QoS) parameters. It supports the interoperability of modeling and analysis tools, but does not specify a full methodology. The UML-RT profile [Selic and Rumbaugh 1998] defines execution semantics to capture behavior for simulation and synthesis. The profile presents capsules to represent system components, internal behavior of which is designed with state machines. However, the capabilities to model architecture and performance are very limited in UML-RT, and hence, it is not suitable for embedded system modeling, as such. HASoC [Green et al. 2002] is a design methodology that defines the modeling and refinement techniques for SoC design. It uses UML notations based on UML-RT profile for modeling an application and architecture. The design process begins from use-case models that are gradually refined via set of models toward the implementation. Currently, no tools for supporting the automatic transformations of HASoC design method are reported. A significant difference between Koski and HASoC is the technique to obtain the final implementation. Compared to HASoC, Koski design flow starts from lower application (functionally accurate state machines) and architecture (RTL library component) models.

4 284 T. Kangas et al. While HASoC iteratively transforms the system models toward the implementation, Koski abstracts the initial models for fast architecture exploration and then utilizes the results of the exploration to optimize the coarse-grain architecture. The hardware components for the system-level architecture are taken from the Koski platform library, whereas the software for the processors can be automatically transformed (compiled) using an existing C code generator and compiler. A design flow from UML to synthesizable SystemC code is described in Nguyen et al. [2004]. In this approach, the UML model is stored in XML format, which is converted to an abstract tree representation and translated to SystemC. There is also a link from SystemC back to the UML level, but automatic architecture exploration is not supported. The presented framework is based on I-Logix Rhapsody, which is a UML design tool for modeling and verifying the systems as well as generating C, C++, or Ada application source code from the UML model. Another UML-SystemC flow is presented in Tan et al. [2004]. In this approach, the SystemC code is generated from a restricted UML description implemented with Rational Rose RT tool. Similar design flows are also based on other languages. For example, in Horn et al. [1999] and Marcon et al. [2002], the authors present two design flows that start from Specification and Description Language (SDL) leading to VHDL models and, further, to physical implementations. The whole design is modeled in SDL and partitioned into HW and SW parts. The application code, HW description, and required interfaces are generated automatically provided that the system is described with the supported subset of SDL. However, these design frameworks do not provide a back-annotation method from low-level stages to the design entry level. The use of synchronous language, such as Esterel, in system design is presented in Berry et al. [2003]. The language together with the Esterel Studio tool provides an integrated environment for control-dominated behavior capture, simulation, VHDL generation, and formal verification. Although the tool can be utilized in manual exploration of hardware/software partitioning, it does not support system-level architecture exploration. Ptolemy [Ptolemy project 2005], SpecC [Gerstlauer et al. 2001], Artemis [Pimentel 2005], CoFluent Studio [CoFluent homepage 2005], and Metropolis [Balarin et al. 2003] are well-known high-level design frameworks offering metamodels with formal semantics, tools, and methodologies for system-level design entry, simulation, analysis, and synthesis. Most of these focus on specific design aspects and are not capable of automating the entire flow from high-level description language to a physical chip with architecture exploration. Recently, the project related to SpecC has also proposed a method for estimating and exploring the architecture automatically [Cai et al. 2005]. Artemis, Metropolis, and CoFluent enable the orthogonalization of functionality and architecture in the same way as our approach. Metropolis is the only one that enables the representation of design constraints in system model similarly to our model. Among the current methodologies, Artemis is the closest match with Koski. Both of them separate the concerns by modeling the application and architecture strictly in separate models. Both methods abstract the application model

5 UML-Based Multiprocessor SoC Design Framework 285 with Kahn Process Network (KPN) model of computation and use that model to generate the load for the architecture model during the automated architecture exploration. By examining these commonalities in more detail, there can be seen significant differences between Koski and Artemis. Koski uses UML for entire design entry whereas Artemis supports a subset of Matlab for application description and Pearl or SystemC for architecture description. The most distinctive property of the methods is the model transformation. Artemis relies on the gradual refinement of the application model by trace transformations, which transform the application event into finer-grained events. Koski, in turn, utilizes the same abstract application model for each level of architecture exploration and relies on the existing code generators and compiler when refining (i.e. compiling) the application to the final processing element. In Koski, this significantly eases the design flow as the number of model transformations is minimal. On the other hand, through the gradual model refinement, Artemis also supports automatic generation of synthesizable VHDL code from KPN application model with the Laura tool set. Our approach is based on synthesizable library components that are automatically tuned for specific application according to the results of the architecture exploration. CoFluent Studio enables the graphical description of application behavior and platform model. The application modeling is very similar to UML statechart and activity diagrams. The tool can generate SystemC code from the graphical representation for time-annotated behavioral simulation. The architecture exploration is not automatic, but different component allocations and application mappings can be manually evaluated rather easily with drag-and-drop operations. The CoFluent Studio is based on the MCSE methodology [Calvez et al. 1994], which defines the design phases and model transformations between the phases. However, MCSE shares the same design principles presented in Y-chart design model [Kienhuis et al. 2002], in which application and architecture are modeled separately in each abstraction level, but can be evaluated together (after mapping) for performance estimation. Currently, most of the current multiprocessor SoC design methods, including our design flow, are based on the Y-chart model. The Roses design methodology [Cesário et al. 2002] addresses the high-level component-based design by focusing on automatic generation of the hardware, software, and cosimulation interfaces. Although it does not support automation in architecture exploration, the interface generation facilitates the hardware software integration. In this sense, the Roses deals with lower-level design phases compared to our methodology. A common property of the referenced design methods is the utilization of MoCs that describe functionality and architecture very accurately, making them suitable for model refinement. In our approach, the aim is to constrain the system model in each design phase with such an abstract MoC that only the necessary model information is included. In addition, while the other methodologies concentrate on automation of a specific phase, our framework automates the entire flow from the high-level system description to an optimized multiprocessor SoC implementation. Koski is, to our knowledge, the only framework that supports automatic back-annotation of performance results and modified

6 286 T. Kangas et al. system models from every exploration level (static analysis, simulation, and physical execution) to the UML domain. In addition, Koski provides the control over each subphase from its graphical user interface. 3. FRAMEWORK OVERVIEW As the high-level SoC design methodology is a rapidly emerging research area, the terminology is not that well established. The terms and the concepts used in the framework of this paper are explained in the following list: Application. An application is the functionality defined in the system specification Application process. A behavioral entity in an application, i.e. the application consists of a set of processes. Application task is used as a synonym in this context. Platform. General term referring to all concrete parts of the system realization. Platform is a set of libraries that include both software and hardware models to be used in system refinement, as well as supporting models and descriptions for design automation. Architecture. An instance of a platform. A set of hardware processing elements (PEs), memories, and their connections. These components can be instantiated from a library or described from scratch. Architecture exploration. The architecture exploration can be divided into component allocation, task mapping, and scheduling. Allocation denotes the selection of processing elements and communication network. The application tasks are then mapped onto the allocated processing elements. Scheduling is used to define the order and timing of the execution of tasks and communication. Design space. All the possible platform instances and mappings. With design space exploration, all or a portion of these combinations are analyzed with respect to an objective that is defined with a cost function. Architecture exploration evaluates the architectural part of the design space. Cost, performance. The goodness of the platform instance is evaluated with a cost function. In this paper, performance refers to the dynamic factors of the architecture such as timing, latency, and throughput. Performance factors form a subset of the cost function parameters. System-level. The system design can be roughly divided into system and component levels. System-level design deals with processing elements and their connections whereas component-level design refers to internal design and optimization, of processing elements and communication networks. In this paper, the term high-level is used synonymously as system-level. Platform-based design. In this methodology system is build up onto an existing technology platform using an underlying architecture completed with required IP (Intellectual Property) blocks. 3.1 The Main Phases of Koski Design Flow The purpose of the Koski design flow is to create an implementation that meets the specification. This implies that the Koski tool by interaction with the designer should produce an architecture to which the application is mapped. The

The design flow presented in this paper addresses all these phases by defining the modeling languages and levels, automating the tasks, and integrating the separate tasks to a seamless flow.

7 UML-Based Multiprocessor SoC Design Framework 287 Fig. 1. Design flow tools and their connections. Fig. 2. The focus of main design phases. design flow comprises a set of tasks including algorithm development, design entry, functional verification, architecture exploration, prototyping, and physical implementation. The design flow presented in this paper addresses all these phases by defining the modeling languages and levels, automating the tasks, and integrating the separate tasks to a seamless flow. The main phases of the design flow are shown in Figure 1, and the purpose of each phase is explained in Figure Specification and Requirements. As depicted in Figure 1, the design flow starts with the capturing of requirements for an application and architecture, including design constraints, such as the overall maximum cost UML Design. Following the requirements, the functionality of the system is described with an application model in a UML design environment

8 288 T. Kangas et al. and verified with functional simulations. The architecture design is based on the application model and the given platform. The relationship between application and architecture models is described with a mapping model UML Interface. The UML interface handles the transformation of application and architecture models to abstracted model for fast architecture exploration. Particularly, the functionally accurate application model is transformed to an abstract process network model. In addition, the UML interface back-annotates the information about the optimized architecture to the UML design for further evaluation and refinement Architecture Exploration. Finding an architecture for the described application is carried out with two-phase automatic architecture exploration that consists of static and dynamic exploration methods. These phases are handled by the architecture exploration tool that examines the system models obtained from the UML level. The exploration is carried out by analyzing an extensive set of architectures during which the utilized models are gradually refined. For controlling the architecture exploration, the designer constrains the design space by defining which parts of the platform can be used and which are the allowed mapping combinations. In addition, the designer specifies the constraints for performance, area, and power Physical Implementation. The parts of the UML description that were mapped to processors during the architecture exploration are passed to the automatic code generation. The generated low-level software code and the component instances from the platform are then combined for physical implementation, which handles the real-time operating system (RTOS) integration, software executable generation, and hardware synthesis. As illustrated in Figure 2, the entire system, with accurate models, is executed and verified on a prototyping platform in this phase. 3.2 Main Architectural Options Koski framework consists of subtools for each design phase, interfaces to the external tools, and a graphical user interface (GUI). Koski GUI composes the tools for a seamless flow, provides facilities for modeling the system and controlling subtools, handles the design project management, and visualizes the execution of tools. For instance, the result of the cost function and the parameters it is depending on are plotted during architecture exploration. A screenshot of the GUI is shown in Figure 3. With Koski GUI, it is possible to use different combinations of subtools, depending on the design scenario. The need for extensive architecture exploration, particularly varies significantly, depending on the system requirements and the existing subsystems. The following list describes the most common high-level architectural options: Fixed Single-Processor Architecture. There is no need for systemlevel architecture exploration and, thus, the exploration is not utilized. The rest of the design flow is still required including the UML system design, verification, and physical implementation phases.

9 UML-Based Multiprocessor SoC Design Framework 289 Fig. 3. The graphical user interface of the Koski framework. Koski main window is in the left, architecture exploration progress windows at top right, and UML modeling environment (TauG2) window at bottom right.

290 T. Kangas et al. Fig. 4. TUTMAC protocol functional presentation. 3.2.2 Fixed Multiprocessor Architecture. As the architecture is fixed, the allocation during exploration is unnecessary.

10 290 T. Kangas et al. Fig. 4. TUTMAC protocol functional presentation Fixed Multiprocessor Architecture. As the architecture is fixed, the allocation during exploration is unnecessary. If the mapping is not fixed, the task mapping and scheduling are performed with exploration tools Limited Multiprocessor Architecture. The size of the design space is related to the available platform components and the given design constraints. By constraining the architecture to only few components, the coarse architecture exploration (with static method) can be omitted Large Multiprocessor Architecture. This implies complex applications with an extensive platform. The designer concentrates on the application modeling and does not have to specify the architecture or the mapping explicitly. Instead, all the framework phases are utilized to achieve reasonable architecture exploration time and satisfactory optimization result. 4. CASE STUDY: WIRELESS TERMINAL DESIGN Before illustrating the design process with the presented framework, the application to be implemented is first introduced. It is a medium access control (MAC) protocol, which is a part of a Wireless Local Area Network (WLAN) terminal called TUTWLAN [Hännikäinen et al. 2003]. The development of TUTWLAN contains the design and implementation of the TUTMAC protocol, different test applications, and prototype hardware platforms for the evaluation and testing of the system in actual operation environments. TUTMAC is a dynamic reservation time division multiple access (TDMA)- based MAC protocol. The functional architecture of the protocol is presented in Figure 4. The protocol is divided into separate payload data processing and protocol management planes. The protocol services are accessed through four service access points (SAP) for data transfers and management.

11 UML-Based Multiprocessor SoC Design Framework 291 Fig. 5. The library contents of utilized platform. The numbers in parentheses denote the tools utilizing the library component: (1) UML modeling environment; (2) application implementation; (3) static architecture exploration; (4) dynamic architecture exploration; and (5) physical implementation. The data plane contains interfaces to adjacent protocol layers, a number of functions for data processing and the TDMA scheduling for the radio channel access. In TUTMAC, TDMA slots are dynamically allocated to active terminal in a network. The management plane contains management interfaces and functions for protocol and station management. In TUTWLAN topology, a base station contains extended functionality compared to a portable terminal. On the data processing side, there are specific functions for data forwarding between portable terminals and for bridging with a wired backbone LAN. On the management side, the base station protocol manages the radio usage on the network, the terminal registrations and authentications, and data channel allocations to different portable terminals. More details on the protocol can be found in Hännikäinen et al. [2003]. The protocol supports data transfer quality of service (QoS), including reservable throughput and data security. For meeting QoS, there are real-time requirements that must be met in the implementation. Thus, several timecritical functions based on real-time requirements can be identified from the protocol. These functions are cyclic redundancy check (CRC) sum calculations of TUTMAC frames, TDMA synchronization, data encryption with AES or the improved wired equivalent privacy (IWEP) algorithm, and forward error correction (FEC). The functions need time-bounded execution, in order for the TUTMAC protocol to meet the delay requirements when reacting to received and transmitted frames. Hence, the design flow must provide means to guide the architecture exploration tool in allocating accelerators for these time-critical tasks. 5. PLATFORM With the presented design flow, an implementation of the TUTMAC application is developed by instantiating components from the TUTWLAN platform and mapping the application to the architecture defined by these components. Figure 5 depicts the platform contents in general. The platform is a set of libraries that include hardware processing elements, communication networks, software algorithm implementations, interfacing software, and supporting models for

292 T. Kangas et al. Fig. 6. Example of a platform instance. design automation. The platform contents can be extended during the system design with custom components.

12 292 T. Kangas et al. Fig. 6. Example of a platform instance. design automation. The platform contents can be extended during the system design with custom components. A new component can be put into the library by following the requirements for interfaces and description formats. An example instance of a platform is shown in Figure 6. Although the platform-based methodology restricts the implementation space, it enables more efficient architecture exploration and system implementation, since for each platform component there exists a preverified and usually configurable model that is described on several abstraction levels. For example, in the presented platform, there are models for communication network to provide the suitable model abstraction for each design phase. 5.1 Software Library The software library includes the preimplemented application algorithms as well as platform-independent and dependent software. In this case study, the library contains the C implementations for the time-critical functions in TUTMAC that were defined in Section 4. The UML models were developed from scratch, as there were no library components available. The software library also includes the platform-independent C code for implementing the communication and handling the common data between state machines. In TUTMAC design, the ecos [ecos homepage 2005] RTOS is utilized for executing multiple threads on a processor, as depicted in Figure 6. ecos is an open-source RTOS intended for embedded applications. A key feature of ecos is the configuration system that enables the building of an application-specific OS. The configurability and real-time properties make ecos a good solution for the TUTMAC system. In addition to RTOS, the library includes platformindependent software layers for interfacing application functions to RTOS, as well as a platform-dependent hardware abstraction layer (HAL) and device drivers for interfacing OS to the target processor.

13 UML-Based Multiprocessor SoC Design Framework Hardware Library The hardware library consists of communication network and processing element models with different levels of abstraction. The TUTMAC system utilizes the HIBI communication network [Salminen et al. 2004]. The hardware library includes abstract HIBI model for fast architecture exploration and accurate synthesizable model for detailed architecture exploration and physical implementation. The HIBI network separates computation from communication by hiding architectural complexity with simple interfaces and protocols. The simple structure, parameterization, and scalability are the HIBI properties that are utilized in the design flow to optimize the communication between concurrent processing elements. The processing element library contains the processors and hardware accelerators to be used in TUTWLAN platform. In this case study, the system will be implemented on Altera Stratix FPGA and, therefore, the processor models have to be synthesizable. Here we used Altera NIOS II [Altera homepage 2005] general-purpose 32-bit RISC processor core. A custom DMA controller was designed to connect NIOS processors to HIBI network. The accelerators in TUTMAC design are hardware counterparts for the software implementations of the specific time critical functions (AES, CRC). Thus, either software or hardware implementation can be selected during the architecture exploration. 5.3 Design Automation Library The third main part of the platform is the library for design automation. The functions that support the application distribution on multiprocessor system at run-time are included in the library. These functions determine whether the target process of a signal resides on the same or another processor. In the latter case, the signal is transmitted to the target processor by calling interprocessor communication (IPC) routines. The library also contains both hardware and software that is utilized for profiling the state machine communication and execution activity. The profiling functions that are combined with the application functions are required for the automatic performance evaluation. Similarly, the design automation library contains hardware monitors for evaluating the hardware-related performance, such as communication network utilization, throughput, and latency. 6. TUTMAC DESIGN WITH UML 2.0 The design entry, including the descriptions for application, architecture, and design constraints, is given in UML 2.0 design environment. Telelogic Tau G2 [Telelogic Homepage 2005] is utilized for this purpose. The tool was selected on the basis of its support for new UML features, graphical user interface, and mature C code generation as well as debugging facilities. The TUTMAC system is described in Tau G2 following the modeling principles defined by a new UML profile, TUT profile. TUT profile defines a set of stereotypes for extending the UML metaclasses and design practices to describe applications (including real-time requirements) and architectures, as well as

14 294 T. Kangas et al. Fig. 7. TUTMAC application modeling in UML. their mapping. The objective is to enhance the support of external tools for automated analysis, profiling, and modifying the UML model of an embedded system. The profile and the TUTMAC system description is presented in detail in Kukkala et al. [2005b]. System design is divided into three parts: application, architecture, and mapping modeling, as shown in Figure 1. Both the application and architecture models can be developed independently of each other. TUT profile mainly concerns the structure of the application and the architecture. The application is seen as a set of active classes with an internal behavior. The platform is seen as a component library with a parameterized presentation in UML 2.0 for each library component. The profile does not restrict the behavioral modeling and, by default, it utilizes standard UML 2.0 concepts for this. Composite structure diagram is the principal UML construct in TUT profile as the connections between processes as well as processing elements are described with it. In addition, the process grouping and group mapping are modeled in a composite structure diagram. 6.1 TUTMAC Application Model The design of an application model starts from the definition of the class hierarchy with a class diagram. The top-level application class and its components are created, and the associations between components are defined as presented in Figure 7a. When the class hierarchy is defined, composite structure diagrams are used to describe the connections between parts (class instances). The parts communicate with each other by signals via their ports. The composite structure diagram of the top-level class Tutmac Protocol is presented in Figure 7b. The data flow of the protocol goes through user interface, data processing, service support, and radio channel access classes. Management and radio management access classes are used for protocol management purposes. The user interface implements the reception and delivery of MAC service data units

15 UML-Based Multiprocessor SoC Design Framework 295 (MSDU). The data processing handles the user data included in MSDU and fragments a MSDU to multiple MAC protocol data units (MPDU) and generates MPDU headers. The service support queues MPDUs for the TDMA scheduling and takes care of retransmissions and acknowledgments. The channel access controls the utilization and synchronization of shared transmission medium. The management controls behavior of the protocol, while the radio management access controls the radio. The mng and rmng parts are instances of the functional components and they represent the processes of the application. The ui, ss, rca, and dp parts are instances of the structural components. The structural components are hierarchically modeled using class diagrams and composite structure diagrams, until the behavior of the functional components can be expressed. The difference between structural and functional components is that the functional component has its behavior defined in the topmost hierarchy level whereas the structural component consists of one or more subcomponents hierarchically. The behavior of the functional components in TUTMAC is described using statechart diagrams combined with the UML 2.0 textual notation. Statecharts are asynchronous communicating extended finite state machines (EFSM) [Gnesi et al. 2002]. TUT profile defines an application process and sets requirements on its interface. The internal semantics of the process is not restricted, but we are interested more in the external behavior of the processes, particularly the communication. Figure 8 depicts a part of a statechart describing the functionality of CRC. The functionality is modeled with standard UML 2.0 facilities which are extended with TUT profile. For example, the performance information can be embedded to a statechart as shown in Figure 8. There are two values related to transition trigger symbol giving both the required and measured times for state transition in terms of clock cycles. The value is a execution time of a defined path in a statechart. In Figure 8, the length of the path is one state transition. More detailed description of the performance modeling and other details of TUT profile are given in Kukkala et al. [2005a, 2005b]. There are three reasons why the statechart diagrams have been chosen. First, Tau G2 can generate C code currently from statecharts. Second, in our target application and architecture domains, the application is partitioned into processes in such large granularity that each process is likely to include control structures. Third, we claim that they can be applied successfully to both dataflow and control intensive applications. A dataflow process can be modeled with a statechart, which includes no actual state information and, thus, it defines a state-independent behavior. Hence, we can utilize the same action semantics and code generator for both control and data-flow applications. In TUTMAC case, the CRC-32 and AES functions are modeled with state-independent statecharts. However, the scheduler, included to the generated code by Tau G2, is not optimized for data-flow applications and, hence, the implementation is not very efficient. Therefore, we are developing a custom scheduler to support scheduling strategies for different applications.

16 296 T. Kangas et al. Fig. 8. Statechart of the CRC embedded with required and measured performance information. Performance value is given in terms of the numerical value, its units, whether it is required (req), measured (msr), or estimated (est), and whether the value is a minimum (min), mean, or maximum (max). Fig. 9. TUTMAC initial architecture modeling in UML. 6.2 Architecture Model for TUTWLAN To give a starting point for architecture exploration, the initial architecture for the TUTWLAN terminal is described in composite structure diagram, as illustrated in Figure 9. It contains five processing elements connected to a HIBI network. Three of the processing elements are identical NIOS II processors. In addition, there is a hardware accelerator for CRC-32 calculation and a radio interface to access an external radio. The presented architecture is potentially changed during the allocation phase of the architecture exploration. The architecture model can also be left

17 UML-Based Multiprocessor SoC Design Framework 297 Fig. 10. TUTMAC process grouping. undefined, in which case the architecture is defined automatically by the architecture exploration tool. An accurate behavioral model of the architecture components is not required in TUT profile. Instead, the configuration of components is described with a set of parameters given as UML tagged values. For instance, clock frequencies and cache sizes are defined for processors. Similarly, data width, clock frequency, arbitration scheme, and latency are parameters given for the communication channel. The connection between a processing element and HIBI segment is realized with a HIBI wrapper. Each connection is tagged with parameters, such as buffers sizes, priorities, and addresses, which correspond to the parameters in the RT-level HIBI wrapper component. This is the level of the architecture model that is required for efficient architecture exploration. For the final implementation, the low-level synthesizable models from the platform library will be utilized. The platform library describes the component capabilities and costs, which are utilized in the architecture exploration and physical implementation phases. 6.3 Mapping of TUTMAC to the TUTWLAN Platform Mapping of the application to the architecture is performed in two stages. First, the application processes are grouped as depicted in Figure 10. In this diagram, the structural hierarchy of the application model is not visible; only functional components can be grouped. The process grouping provides a platform-independent method for defining application structure in the final implementation. There, the grouping can be realized, for instance, with RTOS threads each including one group of processes. Consequently, if the group is to be implemented with RTOS thread in the final implementation, the grouping defines the composition and priorities of threads.

18 298 T. Kangas et al. Fig. 11. Mapping the TUTMAC protocol to the TUTWLAN platform. The grouping can be performed according to different criteria, such as the preliminary scheduling of application processes, workload distribution, communication between process groups, dependencies between process groups, and the size of a process group (code size, memory requirements). For instance, in TUTMAC, the initial grouping in Figure 10 shows a group named as HighPrio, which includes processes that are known to frequently communicate with each others during the TDMA scheduling. The platform-dependent mapping is carried out by mapping the process groups to architecture components. The mapping given by a designer serves as a starting point for architecture exploration and can even be left undefined similarly to the architecture model, if the automatic mapping optimization is enabled. The initial mapping in Figure 11 integrates the TUTMAC application model and the architecture model of a TUTWLAN terminal. As seen, there is one process group mapped to each processing element. However, there could be several groups mapped to a processing element. The first three groups (HighPrio, DataProcessing, Management) are mapped to processors 1 to 3, all of which are NIOS II processors. Group CRC has processes that can be implemented on an existing hardware accelerator, and thus, the process group is mapped to a CRC type of processing element. Group RadioIF containing the radio interface process is also mapped to the HW counterpart (RadioInterface) in the platform. Both the initial mapping and grouping are utilized to help the automated tools in exploration by giving a potentially good starting point and by reducing the exploration space. Even with a small number of processes and processing elements the exploration space can be very large because of the high degree of parameterization of platform components. Therefore, the initial mapping model is also useful with small applications. The characteristics of the TUTMAC UML model are illustrated in Table I. The composite structure diagrams are hierarchical descriptions and represent the instantiation of classes. As the statecharts model the functionality, they are

19 UML-Based Multiprocessor SoC Design Framework 299 Table I. The Utilization of UML Facilities in TUTMAC UML Model Application Initial Architecture Initial Mapping Class diagrams Composite structure diagrams Statechart diagrams 43 Other 20 processes 5 processing elements 5 groups 52 ports 5 HIBI wrappers 3 fixed mappings 37 signals 1 HIBI segment 17 optimizable mappings 4 timers used only in the application model. As the table shows, the TUTMAC model is very complex, consisting of twenty communicating processes implemented with dozens of statechart diagrams. Therefore, the manual architecture and mapping optimization will not likely result in the optimal solution. 6.4 Defining the Design Constraints The design constraints are defined in the UML system model by parameterizing the platform model with architecture exploration specific attributes. The design constraints define the limits for optimization variables. The essential part of the constraints is the definition of the cost function, which is given in Koski GUI. It enables the comparison of architectural options by unambiguously ranking each optimization candidate. The designer can define an arbitrary cost function in a string format by using the predefined variables, such as area, power, execution time, network utilization, and pe utilization. Hence, the designer controls the optimization result by defining the parameter weights and operations of cost function. To present the real-time requirements, the required execution time for each application process can be indicated by tagged values in UML model. These values are given as constraints to architecture exploration and they can be later compared to the realized values in physical implementation. Similarly, the constraints related to the platform are defined in architecture model of UML, as illustrated in Figure 12. Although not utilized in this case study, the real-time constraints can be taken into account in the architecture exploration by defining them in the cost function. The hard real-time requirements must be verified in any case in the final implementation since the models utilized in the exploration are abstract and possibly include inaccuracies. Even if the real-time requirements were not utilized in the architecture exploration, the path from UML to the final implementation is fast and automatic. Therefore, the number of iterations for evaluating the small set of alternatives in physical implementation is not a critical issue. 7. APPLICATION IMPLEMENTATION AND VERIFICATION An application is implemented using the application model designed with UML 2.0. The application verification and implementation is carried out in four steps: automatic code generation, application build, functional verification, and application profiling. The automatic code generation produces the source code

20 300 T. Kangas et al. Fig. 12. UML implementation of design constraints and optimization control. including functionality and data types. In the application build, the generated code is complemented with supporting libraries. The functional verification is performed by simulations to verify the functionality of the application model. The application profiling is based on the execution trace gathered during simulations. The application implementation and verification, and its relation to the rest of Koski, are depicted in Figure Automatic Code Generation The application model is purely functional as the behavior of the application is designed using state machines in active classes. Consequently, this enables automatic code generation of the source code for conventional programming languages. We have used Telelogic Tau G2 for C source code generation. The code generation produces platform-independent C code, which implements all the functionality of an application model. The code is generated for the active classes containing state machines as well as classes defining custom data types and containing operations. As a result, the generated code implements the internal functionality of each application process but not their communication. It has to be complemented with a common run-time library implementing the communication between state machines and the handling of common data types. The generated code includes only calls to the Koski library functions, which are linked together in the implementation. This enables the implementation of application-independent libraries supporting Koski. 7.2 Application Build and Simulation The application build bundles all the necessary software components to implement and integrate an application into a specific platform. These components include the generated code, run-time libraries, profiling functions, and platform specific functionality. The state machines described in the application model may also include externally defined algorithms that are included in the application during the application build.

21 UML-Based Multiprocessor SoC Design Framework 301 Fig. 13. Code generation and functional verification. For the functional verification, the application build produces an executable application, which is executed in a workstation environment, such as Windows or Linux. The executable is a plain software implementation as no specific hardware components are included and architecture and mapping models are completely ignored. The executable application is completed with debugging facilities to control and trace the execution. The functional verification is based on simulations. The simulations are carried out using Tau G2 Model Verifier, which provides a graphic interface to the simulator. The simulations can be observed using three different tracing methods: sequence diagram, UML model, and textual tracing. The simulations and functional verification can be performed as soon as an application model with some functionality has been designed. As the application model is refined, the simulations can be performed to immediately verify the modifications. This facilitates identifying potential errors at a very early design phase and speeds up the design and implementation. 7.3 Application Profiling During the functional verification, an execution trace is collected utilizing the profiling functions, included during the application build. The profiling functions are called in the essential events of the execution: the state transitions, signal outputs, timer events, and thread switches. In each event, state and signal

22 302 T. Kangas et al. Fig. 14. A trace file from the TUTMAC execution. identifiers and times are noted and formulated with a custom XML format. A snippet of the trace file showing two state transitions and three thread switches is depicted in Figure 14. The use of the execution trace is focused on the profiling of the state machine communication and execution activity. This gives information about the amount of communication and transferred data between state machines and about the most active state machines. The information is back-annotated to the UML application model with a tool described in Section 8.4. The execution trace is also utilized in the generation of the application model for architecture exploration. More detailed description of this is given in Section UML INTERFACE Each design phase produces information that can be exploited in other phases. In our framework, the information exchange is handled with an interface tool that combines the separate design phases into a seamless flow. The UML interface transforms the models and converts the formats between the UML design environment and the back-end tools including functional verification, architecture exploration, and physical implementation tools. The implemented interface tools consist of an UML application parser, UML architecture and mapping parser, UML application profiler, UML application performance backannotator, and UML architecture and mapping back-annotator. These tools are emphasized with a gray background in Figure Intermediate Format: XML System Model Telelogic Tau G2 uses a tool-specific XML format when saving a UML design to a file. The file is structured as an XML tree having a deep hierarchy. In addition to the actual UML model, the file contains coordinate information for the graphical representation. To clarify the model representation and to facilitate tool compatibility, a custom internal XML system model (XSM) has been developed to be used between all the design flow tools. In conjunction with the

UML-Based Multiprocessor SoC Design Framework 303 Fig. 15. Interfacing the UML design environment to the verification and exploration tools. Fig. 16.

23 UML-Based Multiprocessor SoC Design Framework 303 Fig. 15. Interfacing the UML design environment to the verification and exploration tools. Fig. 16. A fragment of the TUTMAC XSM illustrating the process network model. The example process has three input and output ports. It is triggered when data is received in any of the inputs. The data is then processed for 2000 cycles after which 36 bytes is written to output port 1. platform libraries, the XSM contains the necessary UML model information, as well as the architecture optimization and performance back-annotation information. Figure 16 shows a part of an abstract application process in XSM. For example, the XSM contains the description of the process network including the process parameters in addition to their connections and data dependencies.

24 304 T. Kangas et al. 8.2 UML Parsers The UML application parser and the UML architecture and mapping parser examine the XML tree and recognize UML classes, class instances, signals, and dependencies between different objects. The parsers determine the meaning of each object based on the assigned TUT profile stereotype. The UML application parser does not consider the actual functionality of the application processes. Instead, it recognizes the assigned tagged values that indicate the required execution time for each application process. This information is combined later with the UML application profiler result to obtain a time-annotated model of the application. Similarly, the UML architecture and mapping parser determines the architecture components, their tagged values, and dependencies between the components. The mapping is found by searching all dependencies connecting application processes and architecture components, which are associated with the TUT profile stereotypes. Finally, the created system model is converted to XSM. 8.3 UML Application Profiler The UML application profiler examines the execution trace obtained from the functional verification phase and converts the trace to a process network model defined by XSM. The execution trace is input dependent. Hence, it may lead to an incomplete process network if all the processes are not executed for the given input data. However, profiling is quite often the only method fast and accurate enough, especially in reactive applications. In these, the execution is highly conditional by nature and cannot be modeled using pure data-flow MoCs. The UML application profiler itself does not set any special requirements for the UML application design or model of computation. The UML application profiler recognizes the application processes from the execution trace. In addition, it detects the dependencies between processes. Time stamps are used to determine the execution time for each process. The result is an XSM description, like the one depicted in Figure 16, which is the application model used in the architecture exploration. The initial hardware architecture and mapping have to be determined with the UML architecture and mapping parser. 8.4 UML Back-Annotation In our flow, there are two tools to modify the original UML models. The measured execution times for each application process are back-annotated with the UML application performance back-annotator. The information can be added as tagged values to the TUT profile stereotypes defining the application performance. This tool does not change the structure or the functionality of the application. In addition, the performance information can be visualized for a designer using sequence diagrams. Koski framework provides an extension to a UML metamodel to include the message latency and execution time in statecharts [Kukkala et al. 2005a]. The information may contain both the real-time constraints and measured values that are automatically back-annotated to the

25 UML-Based Multiprocessor SoC Design Framework 305 Fig. 17. Sequence diagram of the data frame reception and acknowledgment frame transmission in TUTMAC. UML model. The measured performance can be verified visually against the requirements by using sequence diagrams. An example of the generated performance report is presented in Figure 17. The performance report presents very intuitively the desired sequence of messages and transitions, their message latencies and execution times, and the execution of threads.

26 306 T. Kangas et al. Another task is the updating of the optimized architecture and mapping obtained from the architecture exploration tools. This is done with the UML architecture and mapping back-annotator, as shown in Figure 15. This tool automatically updates the modifications made to the original UML model. Therefore, UML models and the corresponding lower level models are always synchronized. Moreover, the results of the architecture exploration and information how the set requirements are met are shown to the designer. The back-annotation allows the designer to observe the modifications in UML level, which facilitates the management of the design. In addition, a designer can easily modify the design further and, if necessary, override the back-annotated modifications. 9. ARCHITECTURE EXPLORATION After the application, the initial architecture, and the design constraints are modeled in UML environment, the architecture exploration tools start optimizing the system. Exploration attempts to find an optimal selection of platform components and mapping of the tasks. Mapping in UML environment is not required, but it can be used to guide the architecture exploration tool. On the other hand, the mapping of a task can be indicated as fixed in the initial mapping, for example, when mapping a task to a HW accelerator. In the Koski flow, architecture exploration is carried out in two phases. First, coarse-grain exploration is performed by statically analyzing the application model. Architectures are then explored with iterative simulations and more accurate system models. The optimization objective is to minimize the result of the cost function that the designer has defined in the UML design environment. The control for the architecture exploration is described in the Koski GUI. The exploration control parameters are mainly for restricting the iterations of the allocation, mapping, and network parameter optimization. 9.1 System Abstraction for Architecture Exploration The objective of the abstraction is to hide the unnecessary details to minimize the total exploration time. The primary purpose is not in the functional verification, since the functionality was already verified in earlier design phases. Therefore, the behavioral accuracy can be omitted, assuming that the external behavior of application tasks (communication and the timing) is preserved. The UML models that are abstracted for architecture exploration were depicted in Figures 7 9. The application is modeled as communicating tasks, as depicted in Figure 18. The application model is based on the Kahn process network model [Kahn 1974]. Similar abstraction is also utilized in Artemis [Pimentel 2005]. The complexity of the application tasks are characterized by profiling the application automatically as described in Section 7.3. The generated code is executed either in a reference platform (such as x86) or in the final platform (NIOS II in this case study). In Koski, it is possible to select which one of these automated profiling methods is utilized. In the case study, we utilized the development board for profiling but using the NIOS instruction set simulator would

UML-Based Multiprocessor SoC Design Framework 307 Fig. 18. Abstraction of both the application and the architecture for the exploration. have been possible as well.

27 UML-Based Multiprocessor SoC Design Framework 307 Fig. 18. Abstraction of both the application and the architecture for the exploration. have been possible as well. Therefore, we were able to achieve very accurate characterization of application. The new functional components of the application are automatically characterized during the profiling, which is always performed before the architecture exploration. A processing element is characterized with properties, such as performance (operations/cycle), area, power, and available internal memory. As the static architecture exploration is based on application analysis, the model of communication architecture is coarse, including bandwidth, area, and power of the communication. In the dynamic method, the communication network model is either cycleaccurate level or transaction level with timing. The accuracy is rather high compared to the application and processing element models, since the communication timing can have a major effect on the overall system performance. 9.2 Static Method The static part of the architecture exploration analyzes the application model to optimize the allocation, mapping, and scheduling [Orsila et al. 2005]. It is used for fast, coarse, input-independent analysis. Internally, static method converts the process network application model to a directed acyclic graph to enable a static analysis. The outline of the static method is illustrated in Figure 19. The initial candidates for allocation and mapping are obtained from the UML model. As mentioned, the initial candidates can be labeled as fixed so that the exploration tool does not modify them. The optimization process goes through allocation mapping-scheduling cycle. For each processing element allocation, a number of task mappings are performed. Several scheduling combinations are then examined for each allocation-mapping pair. Allocation chooses 1 to M processing elements for mapping and scheduling, where M is the upper bound for the number of processing elements. In the TUTMAC case study, all the processing element targets are identical in the static optimization phase. The utilized optimization algorithms for mapping are based on simulated annealing (SA) [Kirkpatrick et al. 1983] and group migration (GM)

308 T. Kangas et al. Fig. 19. Static method of the architecture exploration. [Krishnamurthy 1984]. They are, however, modified for the system-level architecture exploration purposes.

28 308 T. Kangas et al. Fig. 19. Static method of the architecture exploration. [Krishnamurthy 1984]. They are, however, modified for the system-level architecture exploration purposes. The idea is to localize the communication by mapping the tasks into groups so that communication is avoided while still distributing the system to achieve greater performance. The task scheduler system determines an estimate for the execution time from an allocation-mapping pair. SA and GM algorithms complement each others by combining the good properties of nongreedy global and greedy local optimization techniques. SA can climb from local minima to reach a global minimum. The GM algorithm is used to locally optimize SA solutions as far as possible. Figure 20 shows the pseudocode for SA. This implementation of the algorithm has all possible mappings of the application as the state space. The algorithm has a special task graph heuristics to move in the state space. One state (i.e., allocation or mapping) is denoted with S. The SA temperature T reflects the speed and size of changes per each move. First, the cost associated with initial mapping is calculated with cost function. Cost is based on estimated execution time, memory consumption, and gate count. The objective is to minimize the costs. Inside the optimization loop, the next mapping is determined with heuristic move function and associated costs are calculated. If new mapping results in

29 UML-Based Multiprocessor SoC Design Framework 309 Fig. 20. The utilized simulated annealing algorithm. lower costs, it is chosen as a base for the next iteration. Otherwise, the best mapping so far is used. The algorithm always accepts improving moves, but it can avoid local minima by accepting bad moves in a probabilistic manner. The probability for the acceptance of a bad move is given by the prob function. The lower the temperature T, the less likely it is for the algorithm to accept a locally bad move. The higher the temperature the more radical the algorithm is in accepting locally bad moves. Temperature T is decreased between regular intervals with the calc T function. If mapping has not improved within max rejects iterations, the minimum has been reached and the function terminates. After simulated annealing, the GM algorithm is applied on the mapping. The pseudocode is shown in Figure 21. The algorithm only accepts locally improving moves and, therefore, it is greedy. The algorithm consists of migration rounds. Each round either improves the solution or keeps the original solution. Optimization ends when the latest round is not able to improve the solution. Usually the algorithm converges into a local minimum after a few rounds. A move in GM means moving one specific task to another processing element. For N tasks and M PEs, a round consists of at most N subrounds. A subround tries to move tasks one at a time to each of M PEs. The best move from a subround is selected and the moved process is not moved anymore, i.e., the mapping of that process is fixed. This results into, at most, N 2 (M 1) moves per round. If no improvement move is detected on a subround, i.e., no single move can improve the cost, the round is terminated. In the allocation optimization, the state space is the allocation based on the processing element library. The move function chooses the next allocation

30 310 T. Kangas et al. Fig. 21. The utilized group migration algorithm. from the components of the library. Currently, the heuristics for choosing the next allocation is pure random selection. However, the optimization starts with an allocation with minimum number of processing elements defined by the designer. 9.3 Dynamic Method Figure 22 depicts the flow of dynamic architecture exploration. The flow is based on iterative simulations of application and architecture, which includes the

UML-Based Multiprocessor SoC Design Framework 311 Fig. 22. Dynamic method of architecture exploration. models of communication architecture and processing elements.

31 UML-Based Multiprocessor SoC Design Framework 311 Fig. 22. Dynamic method of architecture exploration. models of communication architecture and processing elements. The allocation and mapping optimizer refines the static exploration or subsequent dynamic exploration candidates. The optimization algorithm evaluates how the move of a single task to another processing element affects the cost. The moves are performed heuristically but taking into account the interdependence of communicating tasks. Therefore, only minor changes to the initial mapping are possible. However, if there is not initial mapping available, either from UML model or static method, the algorithm takes a random mapping for a starting point. The platform generator generates the simulation models of processing elements and communication network by exploiting library components described in Section 5. Since the models for applications, processing elements, and communication architectures are at different abstraction levels, a tool that combines a model for the simulation is required. This tool, called Transaction Generator [Kangas et al. 2003], composes the application and architecture models as well as the mapping information for simulation. It executes the application model with respect to the architecture and produces statistics of process

32 312 T. Kangas et al. Fig. 23. Static exploration progress with one to three NIOS II processors. execution timing and communication latencies. This can later be used to determine the performance of the system. The performance information is backannotated to the allocation and mapping optimizer for the next iteration. If necessary, part of this information may be forwarded to the UML design environment to guide model refinement. After the mapping and allocation has been fixed, the communication architecture parameters, such as priorities and buffer sizes, are optimized with a similar iteration method [Riihimäki et al. 2002]. The results of the architecture exploration are back-annotated to the UML environment after exploration to facilitate further model refinement. This information includes the optimized allocation, mapping, and scheduling, as well as the realized values for the cost function parameters. If the designer is satisfied with the optimized architecture, the model refinement can continue with a physical implementation. 9.4 Architecture Exploration Results Both the static and the dynamic architecture exploration were performed to optimize the initial allocation and mapping model given in UML. The dynamic exploration results are not shown here for brevity. The same cost function, defined by the designer, was applied in both methods: cost = area execution time(p, n) 2 (1) where area is the logic gate count of the processing elements and communication network and execution time( p, n) the time when process p is executed n times. Figure 23a shows the cost result for each optimization iteration of the static method. During the exploration, the number of processing elementss were altered from three to five, keeping the CRC and radio interface PEs as fixed, as defined in the UML mapping model. This implies that the floating processes were evaluated with one to three NIOS processors. Figure 23b depicts the best values for area, execution time, and the total cost obtained with the given number of NIOS II processors. It shows that

33 UML-Based Multiprocessor SoC Design Framework 313 Table II. Number of Evaluated Candidates and Total Exploration Time Static Dynamic PE allocation iterations 3 3 Network optimization 10 Mapping iterations: NIOS II NIOS II NIOS II Total exploration time (min) 7 9 increasing the number of processors improves the execution time, but also increases the area. The execution time does not scale linearly with the number of processors because of the increased interprocessor communication. The lowest cost is achieved with two processors being only slightly better than the three-processor system. Hence, the best architecture with given cost function includes four processing elements (two NIOS II, CRC-32, and radio interface). A HW accelerator for AES calculation was also considered. However, in this case study the offered speedup (w.r.t. SW on NIOS) was outweighed by the increased area. Consequently, the final architecture does not include AES HW. The optimized allocation and mapping candidate from static method were passed to dynamic method as a starting point. Dynamic method considered its initial PE selection and process mapping to be best candidate, but was able to cut the total system area by 2% by decreasing the buffer sizes of HIBI wrappers. This did not affect the execution time. Table II tabulates the optimization iterations and the total exploration time for both static and dynamic methods. The static method is able to evaluate tens of thousands of candidates in minutes while the same amount of candidates would take days with dynamic method. This is due to the more accurate architecture model as well as the simulation-based analysis. By utilizing another cost function, the resulting architecture and mapping would have been different. In this example, the cost function was rather simple, taking the area and execution time into account. A more complex function or different parameters, such as memory consumption and real time constraints, could be applied, depending on the requirements. 10. PHYSICAL IMPLEMENTATION AND FPGA PROTOTYPING The physical implementation is divided into four steps: configuration generation, software build, platform generation, and hardware synthesis. The configuration generation produces C code for the run-time configuration of software. The software build compiles and links a target executable application. The platform generation produces VHDL code to compose a platform instance using libraries of RTL models. The hardware synthesis implements the platform instance. In addition to these steps, application profiling can be performed, based on the execution trace gathered during the target execution. The physical implementation flow is depicted in Figure 24. In this section, the final architecture

314 T. Kangas et al. Fig. 24. Physical implementation flow. is assumed to consist of two NIOS II processors, a CRC-32 accelerator, and a radio interface. 10.

34 314 T. Kangas et al. Fig. 24. Physical implementation flow. is assumed to consist of two NIOS II processors, a CRC-32 accelerator, and a radio interface Software Configuration and Build The application is configured at run-time and, for that, the configuration code is required. The configuration generation script parses the mapping information from the XSM and produces mapping arrays in C. In addition, the priorities for OS threads are parsed from the XSM and included in the configuration code. The mapping arrays contain the target processor for each application process. A custom run-time library supports the use of the mapping arrays with multiprocessor platforms. The mapping arrays are utilized at the startup to decide on which processor and RTOS thread each process is started. In addition, the mapping information is used during the process communication to resolve the target processor of the transferred signal. Software configuration tool selects an appropriate component from the library according to the architecture and mapping models. It takes the platformdependent hardware abstraction layer (e.g., HIBI HAL) from the library and binds it to the platform-independent sofware layers (Communication API). The software build produces a final target executable for each processor by combining the TUTMAC UML implementation with supporting libraries. The libraries contain RTOS, scheduling, and distribution of UML processes as well as the API for HW accelerators. The use of the libraries is dependent on the architecture and mapping models and, therefore, the final compilation and linking has to be performed in the physical implementation phase. In the case of TUTMAC, each executable is equal, apart from the mapping arrays. This approach enables the reconfiguration at run-time and considerably alleviates the software implementation for a multiprocessor platform. Although the application as a whole is built to each processor, the overhead of additional code per a processor is reasonable as the proportion of the constant code for an executable (RTOS and related code) is high. Moreover, the significance of the instruction memory size is even lower when taking into account the data memory. The static memory consumption for each software component is shown in Table III. The RTOS is the largest component of the total executable

35 UML-Based Multiprocessor SoC Design Framework 315 Table III. Static Memory Consumption for Each Software Component Software Component Code (bytes) RW-data (bytes) Total (bytes) Percentage TUTMAC model ecos State machine scheduler HIBI API Total taking approximately one-half of the instruction memory footprint. TUTMAC model itself takes 16% of the instruction memory, but less than 10% of the total memory size when the static data memory is also considered. The identical executable approach enables a convenient way to handle the remapping of processes at run-time. During the remapping procedure, only the thread context has to be communicated between processors, instead of the whole application code. If there is no need for run-time process mapping, the inactive application code can be left out from the executable during the linking process Hardware Configuration and Synthesis XSM defines the structure and parameters for platform instantiation. A platform instance is composed by using a library of RTL models having implementations of available components. VHDL code for a platform instance is implemented with the platform generator that parameterizes the library components (processing elements and network) and generates the top-level code instantiating necessary component entities. In this case study, the parameterization includes the speed grade and cache sizes of NIOS II processors, data widths of CRC and AES accelerators as well as buffer sizes, data widths, priorities, and addresses of HIBI wrappers. The parameters are set according to the UML models and architecture exploration results. The fully automated platform generation takes XSM and a library of RTL models as inputs and produces the hardware model for synthesis. The hardware synthesis is performed using synthesis tools for the utilized target hardware. In the presented TUTWLAN design case, the target hardware is based on a Stratix FPGA from Altera. For the synthesis, Mentor Graphics Precision [Mentor Graphics homepage 2005] and Altera Quartus II [Altera homepage 2005] are utilized. Figure 25a depicts the final architecture of the TUT- MAC on the Stratix chip. The architecture consists of two NIOS II processors executing portions of the application processes whereas one of the time-critical processes (CRC-32) is implemented with a hardware accelerator. The actual development board for system prototyping is shown in Figure 25b. The use of external synthesis tools is automated and controlled by the Koski GUI. Therefore, the path from the UML to the physical implementation can be handled using only the Koski GUI Application Profiling The application profiling of the physical implementation is realized in the same way as during functional verification (see Section 7.3). At this phase,

36 316 T. Kangas et al. Fig. 25. The final implementation of TUTMAC system.

UML 2.0 Profile for Embedded System Design

UML 2.0 Profile for Embedded System Design Petri Kukkala, Jouni Riihimaki, Marko Hannikainen, Timo D. Hamalainen, Klaus Kronlof To cite this version: Petri Kukkala, Jouni Riihimaki, Marko Hannikainen,