D5.3 Accurate Simulation Platforms

Size: px

Start display at page:

Download "D5.3 Accurate Simulation Platforms"

Bathsheba Cooper
6 years ago
Views:

Project Number 611411 D5.3 Accurate Simulation Platforms Version 2.

Stuttgart, University of York Every effort has been made to ensure that all statements and information contained herein are

1 Project Number D5.3 Accurate Simulation Platforms Version 2.0 Final Public Distribution CNRS, Bosch Project Partners: aicas, Bosch, CNRS, Rheon Media, The Open Group, University of Stuttgart, University of York Every effort has been made to ensure that all statements and information contained herein are accurate, however the DreamCloud Project Partners accept no liability for any error or omission in the same Copyright in this document remains vested in the DreamCloud Project Partners.

2 Project Partner Contact Information aicas Bosch Fridtjof Siebert Jochen Härdtlein Haid-und-Neue Strasse 18 Robert-Bosch-Strasse Karlsruhe Schwieberdingen Germany Germany Tel: Tel: siebert@aicas.com jochen.haerdtlein@de.bosch.com CNRS Rheon Media Gilles Sassatelli Raj Patel Rue Ada Leighton Avenue Montpellier Pinner Middlesex HA5 3BW France United Kingdo Tel: Tel: sassatelli@lirmm.fr raj@rheonmedia.com The Open Group University of Stuttgart Scott Hansen Bastian Koller Avenue du Parc de Woluwe 56 Nobelstrasse Brussels Stuttgart Belgium Germany Tel: Tel: s.hansen@opengroup.org koller@hlrs.de University of York Leandro Indrusiak Deramore Lane York YO10 5GH United Kingdom Tel: leandro.indrusiak@cs.york.ac.uk Page ii Version 2.0

3 Table of Contents 1 Introduction Contributions To The Project Terminology Structure Of This Document Modular Simulation Platform Platform Overview Parser Execution Manager Mapper Simulation Platform Extensions Simulation Scripts Mapper Interface Application Graphs Summary Cycle-Accurate Simulator For NoC-Based Architectures Related Work Simulator Implementation Simulated Hardware NoCTweak Design NoCTweak Extensions Experimental Results Setup Scalability Comparison With Abstract Platform Summary Cycle-Accurate Simulator For Crossbar-Based Architectures Simulator Implementation The Crossbar Arbitration Policy Preliminary Experimental Results Version 2.0 Page iii

4 4.2.1 Modeling of Infineon Aurix Architecture Simulation of Automotive Applications on Aurix Architecture Model Assessment w.r.t. Performance Traces from Industrial Use-Case The Timing-Architects Simulator Format of Results Produced by Timing-Architects Extraction of Simulation Parameters Comparison of Performance numbers Summary Conclusions 37 References 39 Page iv Version 2.0

5 Document Control Version Status Date 0.1 Initial draft by CNRS on NoC cycle-accurate Simulator Enhancement by CNRS with Crossbar cycle-accurate Simulator Feedback integrated after internal review by CNRS Review and Enhancement by Bosch with Timing-Architects Enhancement by CNRS with comparison w.r.t. Timing-Architects Final version for submission to EU Version 2.0 Page v

6 Page vi Version 2.0

7 Executive Summary This document defines and describes cycle-accurate simulation platforms for the DreamCloud project. It therefore complements the abstract simulation platform proposed in deliverable D5.2 previously. The current deliverable first presents an overview of the common features in all these simulation platforms. Then, it introduces some extended features provided beyond those already defined in D5.2. In addition, it covers the design and implementation of the new simulators, which target multicore architectures relying on various communication infrastructures: networks-on-chip and crossbar. To validate these simulators, several evaluation results are reported, including a comparison with respect to previous abstract simulator described in D5.2 and an industrial simulator. The resulting modular cycle-accurate simulation platform allows a designer to build and evaluate different resource allocation heuristics for embedded applications running on a multicore platforms, according to the design flow proposed in DreamCloud, from AMALTHEA application models as inputs. Version 2.0 Page 1

8 1 Introduction This document presents and describes in detail the cycle-accurate simulation platforms that extend the abstract simulation capabilities proposed in the deliverable D Contributions To The Project In the simulation platform proposed in D5.2 the communication infrastructure of the hardware is simulated in a Transactional Level Modeling fashion. This leads to fast simulation times at the expense of accuracy. In this deliverable, we present cycle-accurate simulation platforms, which simulates the communication infrastructure at a cycle-accurate granularity. The results provided by these platforms are thus accurate regarding timing but will require longer simulations. In particular, this deliverable describes cycle-accurate simulation platforms able to simulate both network on chip (NoC) and crossbar- based multicore architectures. 1.2 Terminology In this document, we heavily use the terms simulation platform, architecture family and architecture. A simulation platform, is a complete software environment allowing to simulate an application specified using a pre-determined modeling abstraction, which includes software, and hardware components to generate results such as execution time or real time constraints satisfactions. The simulation platform can support the simulation of applications running on top of different architecture families. The cycle-accurate simulation platform described in this deliverable supports two architecture families. The first one represents multicore architectures based on a NoC communication infrastructure, while the second one represents multicore architectures using a crossbar to let cores communicate. Finally, an architecture is a particular instance of an architecture family. In this document, an architecture mainly consist in specifying the number of cores to be used in a particular architecture family and their operating frequency. Figure 1 shows a visual representation of the cycle-accurate simulation platform described in this document. As already stated it supports two different architecture families and for both of them it allows to configure the number of cores and their frequency. The NoC-based architecture family support 2D mesh topology only. The crossbar architecture family support in theory an infinite number of cores, but in practice it will be used to simulate platform with few cores because crossbar are only used in such systems for scalability reasons. 1.3 Structure Of This Document The rest of the deliverable is organized as follows. Section 2 describes the global architecture of the cycle-accurate simulation platform and the components that are common to the two supported multicore architecture families. It first introduces the basic Page 2 Version 2.0

9 Simulation Platform Cycle-accurate Architecture family NoC-based Architecture family Crossbar-based Architecture 2x2 Architecture Architecture 4x4 Architecture 4 Architecture Figure 1: Cycle-accurate simulation platform described in this deliverable Simulation Platform Abstract Architecture family NoC-based Architecture 2x2 Architecture... Architecture 4x4 Figure 2: Abstract simulation platform described in D5.2 blocks of the simulation platform. Then the extensions regarding modularity and usability brought to the abstract simulation platform are presented. Section 3 presents the NoC-based cycle-accurate simulation platform. It first introduces related work about cycle-accurate NoC simulators to then introduce NoCTweak, the cycle-accurate NoC simulator that we use as a basis to build our own simulator. Also, we discuss the extensions implemented in NoCTweak to satisfy the needs of the DreamCloud project. Finally, we present the evaluation of the cycle-accurate simulator in terms of simulation time and accuracy and we compare these results with the ones obtained with our previous abstract simulator. Finally, Section 4 presents the crossbar-based simulation platform. The implementation of the platform is discussed and simulation results for the DemoCar application provided by Bosch running on representative industrial hardware are presented. Version 2.0 Page 3

10 2 Modular Simulation Platform The cycle-accurate simulation platform is a modular system that accepts several input parameters to describe the hardware architecture under simulation. This section describes the global design of the platform and the components common to the two supported multicore architecture families. These two architecture families, a NoC-based one and a crossbar-based one, are described later in Section 3 and Section Platform Overview This section first introduces briefly the different blocks composing the simulation platform. For a detailed description of each of them, please refer to the deliverable D Then, the extensions brought to the abstract simulation platform described in D5.2 are presented. As depicted in Figure 3, the cycle-accurate simulation platform takes as input an application model and several configuration parameters to produce results in terms of execution time, number of deadline misses and energy consumption. The application model is specified using the AMALTHEA 2 format and results are provided as text files. The simulation platform itself is written using the C++ programming language and the SystemC simulation library. AMALTHEA Application model Configuration parameters Command line options C++ SystemC Execution Manager Parser Mapper Architecture Simulator Text files Results Execution time, Energy Deadline misses Figure 3: Overview of the cycle-accurate simulation platform Page 4 Version 2.0

11 2.2 Parser The parser component is the one responsible of reading the AMALTHEA model describing the application and building an in memory object representation of it. In AMALTHEA, an application is described by a runnable call graph. Runnables are AMALTHEA equivalent of what is called a job, or a task instance in the real time community. Because the AMALTHEA application modeling format is XML based, the parser relies on the Apache Xerces library. In the simulation platform, all the relevant AMALTHEA entities have an equivalent in memory object instance, and thus the main responsibility of the parser is to create them. This in memory representation is then passed to execution manager handling the effective simulation of the application. 2.3 Execution Manager The execution manager orchestrates the simulation according to the semantic of the AMALTHEA model provided by the parser. In other words, the execution manager has to manage the timing of the simulation by releasing the runnable instances when required. Basically, runnables are released on three different kind of events: At time 0 when application starts Each time a periodic timer ticks At the completion of another runnable All the runnables that do not have any input dependency and that are not associated to a periodic time are released immediately when the simulation starts, at time 0. For periodic runnables, the execution manager launches periodic timers at the beginning of the simulation, that will trigger the release of the associated runnables at each period. Finally, for the runnables having execution dependencies, the execution manager relies on feedback provided by the architecture simulator on the completion of the runnables that will trigger the release of the dependent runnable. Using this feedback, the execution manager keeps track of the unsatisfied dependencies for each runnable. When the number of unsatisfied dependencies reaches 0, the runnable can be released. When the execution manager releases a runnable, it has to allocate it on a particular core of the multicore architecture. The task of mapping the runnables onto the multicore architecture is handled by the mapper. 2.4 Mapper The mapper decides where a particular runnable instance should be allocated on the multicore architecture. It answers to requests coming from the execution manager. These requests contain the identifier of the runnable to be mapped and the mapper responds with the identifier of the particular core of the platform responsible of executing the runnable in question. See Section for the detailed description of the interface between the execution manager and the mapper. Version 2.0 Page 5

12 >>./compile.py build --help usage: compile.py build [-h] [-c] [-v] optional arguments: -h, --help show this help message and exit -c, --clean clean previously compiled files before compiling -v, --verbose enable verbose output Figure 4: Script to compile the simulation platform 2.5 Simulation Platform Extensions This section describes the extensions brought to the simulation platform described in D5.2. These extensions aim at making the simulation platform more modular and easy to use Simulation Scripts In order to easily use the simulation platform, different scripts are provided. These scripts can be used to build the platform from source code, run a single simulation, or run multiple simulations. We now describe in detail how to use each of them. Compiling the source code The first step to use the simulation platform consist in compiling it from the source code using the compile.py script whose help message is shown in Figure 4. The simulation platform has the following dependencies: cmake SystemC Xerces As a consequence, in order to run the compile script you need to have the cmake tool, the SystemC and Xerces libraries installed in your system. If the two libraries are not installed in standard paths you can define the SYSTEMC_HOME and/or XERCES_HOME environmental variables to specify their custom paths. Running a Single Simulation The simulate.py script is provided to launch a single simulation. This scripts allows to tune all the input parameters of the simulation. These parameters concern either the application, the architecture or the mapping strategy. Figure 5 shows the help message for the simulate.py script describing in detail each of the configuration parameters. As for the compile.py script, the SystemC and Xerces libraries must be correctly defined. If these libraries are not installed in standard paths, the XERCES_HOME and/or SYSTEMC_HOME environmental variables should also be defined with their custom paths. Page 6 Version 2.0

13 >>./simulate.py --help usage: simulate.py [-h] [-d] [-da {DC,CSE}] [-ca CUSTOM_APPLICATION] [-f FREQ] [-mf MODES_FILE] [-m MAPPING_STRATEGY [MAPPING_STRATEGY...]] [-np] [-o OUTPUT_FOLDER] [-r] [-s {fcfs,prio}] [-v] [-x ROWS] [-y COLS] Cycle-accurate simulator runner script optional arguments: -h, --help show this help message and exit -d, --syntax_dependency consider successive runnables in tasks call graph as dependent -da {DC,CSE}, --def_application {DC,CSE} specify the application to be simulated among the default ones -ca CUSTOM_APPLICATION, --custom_application CUSTOM_APPLICATION specify a custom application file to be simulated -f FREQ, --freq FREQ specify the frequency of cores in the NoC. Supported frequency units are Hz, KHz, MHz and GHz e.g 400MHz or 1GHz -mf MODES_FILE, --modes_file MODES_FILE specify a modes switching file to be simulated -m MAPPING_STRATEGY [MAPPING_STRATEGY...], --mapping_strategy MAPPING_STRATEGY [MAPPING_STRATEGY...] specify the mapping strategy used to map runnables on cores and labels on memories. Valide strategies are [ MinComm, Static, StaticSM, ZigZag, ZigZagSM, 3Core, Random, StaticModes ] -np, --no_periodicity run periodic runnables only once -o OUTPUT_FOLDER, --output_folder OUTPUT_FOLDER specify the absolute path of the output folder where simulation results will be generated -r, --random replace constant seed used to generate distributions by a random one based on current time -s {fcfs,prio}, --scheduling_strategy {fcfs,prio} specify the scheduling strategy used by cores to choose the runnable to execute -v, --verbose enable verbose output -x ROWS, --rows ROWS specify the number of rows in the NoC -y COLS, --cols COLS specify the number of columns in the NoC Figure 5: Script to run a single simulation Version 2.0 Page 7

14 >>./simulators.py --help usage: simulators.py [-h] [-v] [-s SIMULATORS] {build,run}... Simulators Script optional arguments: -h, --help show this help message and exit -v, --verbose enable verbose output -s SIMULATORS, --simulators SIMULATORS comma separeted list of the simulators to use among [ NoC-TLM, NoC-CA, Xbar-CA ] valid subcommands: {build,run} Figure 6: Options commons for building several platform or running several simulations in a single step Running Multiple Simulations In order to ease design space exploration, the simulation platform also provides a script allowing to build several architecture simulators or to launch several simulations at a time at once. Figure 6 shows the common options for both building platforms and running simulations (mainly, one option allows to choose the platforms to be considered). Figure 7 shows the options when running several simulation at once. Depending on the input parameters, different types of results and graphs are generated. For example, we simulate the same application with different NoC sizes for the NoC-based cycle-accurate multicore architecture using the following command: >>./simulators.py -s NoC-CA run -m ZigZag -sc fcfs -n 2x2,4x4,8x8 -a DC This command will generate graphs showing execution and simulation time according to the NoC size. Example of such graphs is shown in Figure 15 of Section 3.3 or in Figure 16 of Section This script can also be used to compare the results between the cycle-accurate simulation platform and the abstract one. We use this capability in Section to automatically generate comparison graphs according to different metrics Mapper Interface In order to let users easily customize the mapping strategy, a clean and simple interface is provided to separate concerns between the execution manager and the mapper. Each mapping strategy provided by the simulation platform uses the same interface as shown on Figure 8. When a simulation is started, the corresponding mapper strategy is instantiated according to the mapping parameter provided by the user through the simulate.py script described above. As shown in Figure 8, two mapping strategies are provided by default in the cycle-accurate simulation platform. The ZigZag one simply maps the labels and the runnables in a zig zag fashion on the different memories and cores of the architecture. The Local Max strategy maps also the labels in a zig Page 8 Version 2.0

15 >>./simulators.py run --help usage: simulators.py run [-h] [-a {DC,CSE} [{DC,CSE}...]] [-m {ZigZag,MinComm} [{ZigZag,MinComm}...]] [-n NOCSIZES] [-sc {fcfs,prio} [{fcfs,prio}...]] optional arguments: -h, --help show this help message and exit -a {DC,CSE} [{DC,CSE}...], --applications {DC,CSE} [{DC,CSE}...] specify the application to be simulated -m {ZigZag,MinComm} [{ZigZag,MinComm}...], --mappings {ZigZag,MinComm} [{ ZigZag,MinComm}...] specify the application to be simulated -n NOCSIZES, --nocsizes NOCSIZES comma separated list of NoC sizes to be simulated in the form ROWxCOL -sc {fcfs,prio} [{fcfs,prio}...], --schedulers {fcfs,prio} [{fcfs,prio}...] specify the schedulers use Figure 7: Options for running several simulations in a single step Mapper Interface memory maplabel(string id); core maprunnable(string id); void runnableended(long time, string id);... ZigZag Local Max State of the art [14] Figure 8: The Mapper Interface zag fashion on the memories and then maps each runnable at the core where the largest label used by the runnable is. Adding new mapping strategies is a straightforward process. The only two functions that need to be implemented are the two first ones shown in Figure 8. maplabel is called by the execution manager at the beginning of the simulation once for each label. In this function, the mapper should return the identifier of the memory where the label should be allocated. In the same way, maprunnable is called by the execution manager each time a runnable should be released. In this function, the mapper should return the identifier of the core where the given runnable should be executed. The mapper can also get some optional feedback from the execution manager. This feedback is provided through callback functions implemented by the mapping strategy and called by the execution manager. For example, the void runnableended(long time, string id) function shown in Figure 8 provides the completion time of runnables. There are few other functions in the mapper interface, but the main idea is captured by these three ones. Version 2.0 Page 9

16 ISR_Rte_Tick - Period=5ms CylNumTriggeredTask - Period=5000us ActuatorTask - Period=5000us Task5ms - Period=5000us Task10ms - Period=10000us Task20ms - Period=20000us Task100ms - Period=100000us ISR_Process CylNumObserverRunnableEntity IgnitionSWCSyncRunnableEntity MassAirFlowSWCRunnableEntity APedVoterSWCRunnableEntity OperatingModeSWCRunnableEntity APedSensorDiagRunnable InjectionSWCSyncRunnable ThrottleSensSWCRunnableEntity ThrottleCtlrRunnableEntity IdleSpeedCtrlSWCRunnableEntity InjBattVoltCorrSWCRunnable APedSensorRunnable ThrottleActuatorRunnableEntity BaseFuelMassRunnableEntity ThrottleChangeSWCRunnableEntity TransFuelMassSWCRunnableEntity IgnitionSWCRunnableEntity TotalFuelMassSWCRunnableEntity Figure 9: Application graph with tasks and runnables (DemoCar). Squares are tasks with periodic ones shown in blue and ellipses are runnables. In this application there is no inter tasks dependency Application Graphs Because the AMALTHEA modeling environment doesn t provide a graphical representation of applications, we developed a tool allowing to generate visual graphs from an AMALTHEA XML model. This tool has been integrated into the cycle-accurate simulation platform in order to automatically generate graphs for the application being simulated. This tool generates a graph showing tasks level information only and a graph showing both tasks and runnables information. These two graphs allow to easily visualize internal structure of the application. Tasks and Runnables The first graph generated by the visualization tool shows the graph of tasks composing the application and the graph of runnables composing each task. Figure 9 shows this graph for the Demo Car application provided by Bosch. As depicted, this application is made of 7 periodic tasks (shown in blue) without dependency between them. These tasks contain a total of 19 runnables. Tasks Only The second graph generated by the visualization tool only contains tasks. This is useful for applications containing a large number of runnables where it becomes really hard to visualize all of them at the same time. Figure 10 shows this graph for the Control System Engine application also provided by Bosch. This applications contains 1239 runnables and 109 tasks. This graph is only a part of the task graph of the application. The structure of the application is clear from this graph. It comprises few periodic tasks releasing non periodic ones and some other tasks executed only once. 2.6 Summary This section introduced the design of the modular cycle-accurate simulation platform. It also introduced all the extensions added to the abstract simulation platform described in the deliverable D These extensions aim at making the simulation platform more modular, configurable and user friendly in 3 Page 10 Version 2.0

17 task00037 Period=1000us task00098 task00101 task00105 task00109 task00017 task00012 task00073 task00092 task00064 task00052 task00002 task00053 task00003 task00026 task00032 task00059 task00060 task00001 task00089 task00099 task00095 task00079 task00041 task00102 task00057 task00103 task00013 task00031 task00062 Period=200us task00005 task00006 task00004 task00008 task00030 task00038 task00069 task00086 task00100 task00106 task00097 task00021 task00020 task00088 task00107 task00016 task00082 task00055 task00058 task00033 Figure 10: Application graph with tasks only (A part of Control System Engine). Periodic tasks are shown in blue. Version 2.0 Page 11

18 order to facilitate its adoption by users outside the DreamCloud project. The next two chapters describe the particularities of the NoC-based and the crossbar-based multicore architecture families. Page 12 Version 2.0

19 3 Cycle-Accurate Simulator For NoC-Based Architectures This section presents the cycle-accurate simulator for NoC-based multicore architectures. We first present related work about cycle-accurate NoC simulators. Then we introduce NoCTweak, the cycleaccurate NoC simulator that we use as a basis to build our own simulator and how we extended it for the needs of the DreamCloud project. Finally we present results about the scalability of the simulator according to different metrics and a comparison of simulation results with the abstract simulation platform described in D Related Work This subsection reviews some of the existing work about cycle-accurate simulation of NoCs and motivates our choice of using NoCTweak as a basis for our cycle-accurate simulator. BookSim is a cycle-accurate NoC simulator developed at Stanford university. It has been originally designed and introduced to support the book Principles and Practices of Interconnection Networks [5]. From then, it has been extended with many recent features of NoC design. The current major release, BookSim 2.0 [10], supports a wide range of topologies such as mesh, torus and flattened butterfly networks, provides diverse routing algorithms and includes numerous options for customizing the network s router microarchitecture. Booksim is written in C++ but does not rely on the SystemC simulation library. Noxim [4] is also a cycle-accurate network on chip simulator. It is written in SystemC and mainly provides a command line interface allowing to evaluate the commons standard performance metrics of NoCs such as throughput and latency. The command line allows to configure different parameters regarding the internal microarchitecture of the routers to be simulated. TOPAZ [1] is another cycle-accurate network on chip simulator. TOPAZ can be used standalone through synthetic traffic patterns and application-traces or within full-system evaluation systems such as GEM5. TOPAZ enables the modeling of a wide variety of message routers, with different tradeoffs between speed and precision. It originates from the SICOSYS [13] simulator, which was originally conceived to obtain results very close to those obtained by using hardware description languages. NoCTweak [16] is a parameterizable 2D mesh NoC simulator used for early exploration of performance and energy efficiency of on-chip networks. The simulator has been developed in C++/SystemC allowing fast modeling of concurrent hardware modules at the cycle-level accuracy. We choose to use NoCTweak to build our own cycle-accurate NoC simulator for several reasons. First, NoCTweak is a recent open source project with a relatively large community of users. Second, as will be described in the next section, the design of NoCTweak makes it easily extendable. Finally, because it is written in C++/SystemC it allows to reuse the execution manager, also written in C++/SystemC, from our abstract simulation platform described in the deliverable D Simulator Implementation This section introduces the simulator, describes the design of NoCTweak and how we extended it in order to simulate a NoC supporting realtime features. Version 2.0 Page 13

20 3.2.1 Simulated Hardware The NoC-based simulator supports the simulation of 2D mesh NoC-based architectures. Each tile of the 2D mesh topology is made of a computing element, i.e. a core, a private memory and a router allowing the core to communicate with neighboring cores. Figure 11 shows an example of a 4x4 2D mesh NoC-based architecture Core Router Memory Figure 11: 4x4 2D mesh network NoCTweak Design NoCTweak is a cycle-accurate NoC simulator for 2D mesh topologies. It is written in C++/SystemC and is designed around two main abstractions. The first one represents a core of the simulated hardware while the second one represents a router. Figure 12 shows graphically these two entities and how they connect. NoCTweak provides two different types of cores. The first one, called Synthetic, describes cores that only inject packets into the network either in a uniform random fashion or with particular hotspots. The second type of cores, called Embedded, can be used to inject packets according to traces from real embedded applications. Regarding the routers, NoCTweak also provides two different types. The first one is the widespread wormhole router architecture where packets are divided into flits. The first flit is called the header flit, and all other flits follow the route opened by the header one. The second type of router provided has virtual channels. Like in wormhole, the packets to be exchanged are divided into flits, but in this case each router has several virtual channels to make a better use of the routers resources by allowing several packets to be stored and processed in a given router at the same time. The gray part of Figure 12 shows how the cores and the routers are connected together to make an architecture. When a simulation is launched in NoCTweak, a NoC-based architecture is built using a particular type of cores, i.e. Synthetic or Embedded, and using a particular type of routers, i.e. Wormhole or Virtual Channels. This architecture is made of tiles, where each tile is made of both a core and a router. The number of tiles created depends on a configuration parameter provided to the simulator. Because NoC tweak focuses on 2D mesh topology, the user specifies the horizontal and the vertical size of the NoC. Page 14 Version 2.0

21 Architecture name * Tile name Router Interface name 1 1 Core Interface name Virtual Channels name Wormhole name Embedded name Synthetic name Figure 12: Class diagram for the main classes of NoCTweak NoCTweak Extensions NoCTweak does not support the notion of realtime packet management as stated in the related work section. As a consequence we extended it in order to support routers with virtual channels and priority-preemptive arbitration [3]. Indeed, one of the target of the DreamCloud project is embedded automotive applications with realtime constraints. To answer such constraints, NoC with virtual channels and priority-preemptive arbitration mechanisms have been chosen. Because AMALTHEA is the application modeling language used in the project, we also extended NoCTweak to support the simulation of application described using this format. Figure 13 shows how we extended the NoCTweak framework with a new type of router to support priority-preemptive arbitration and with a new type of core to support the simulation of AMALTHEA applications. With theses two extensions, the simulator is now able to simulate AMALTHEA applications on the same NoC as the one simulated by the abstract simulation platform described in the deliverable D5.2. This allows for fair comparison between the cycle-accurate platform and the abstract one as will be presented in the next section. Figure 13 also shows how the modular mapping solution described in Section is integrated into NoCTweak. Implementing a new router type with virtual channels and priority-preemptive arbitration was quite straightforward. We added a new virtual channel allocator and a new switch allocator taking into account the priority of the flit to be transmitted. These two priority aware allocators replace the existing ones of the default virtual channel router. Regarding the AMALTHEA core type, implementation was more complex because it needed to integrate the execution manager responsible to release the runnables. This execution manager has been integrated partly into the new AMALTHEA core type and partly into the architecture class. The configuration parameters of NoCTweak have also been extended to allow specifying these new router and core types and also to allow to specify all the additional parameters of the AMALTHEA core type compared to the two simple core types provided by NoCTweak (e.g., to let the user specify which scheduling strategy to use on cores). Version 2.0 Page 15

22 Architecture name 1 Mapper Interface name * Tile name ZigZag name Local Max. name Router Interface name 1 1 Core Interface name Virtual Channels name Wormhole name Embedded name Synthetic name AMALTHEA name Virtual Channels RT name 3.3 Experimental Results Figure 13: Class diagram for the extended version of NoCTweak Using the simulator described in the previous section, we are able to simulate in a cycle-accurate way the execution of AMALTHEA applications running on a NoC-based multicore platform. This section reports first results about the simulation time and the memory requirements for this simulator. Then results about the comparison with the abstract simulation platform described in D5.2 are presented Setup All he results reported in this section come from the DemoCar application provided by Bosch. The AMALTHEA model of this application comprises: 8 periodic tasks 19 runnables 62 labels All the presented results show a single iteration of the application, i.e. each task is executed only once. Moreover these results have been generated by considering only explicit dependencies between runnables. In other words, consecutive runnables R1 and R2 in a task are considered dependent only if R1 contains a runnable call to R2. The mapping considered is ZigZag and scheduling is priority based. Finally, for reproductibility concerns, all the results are generated using a random generator initialized Page 16 Version 2.0

23 with a constant seed for instructions with random behavior (the AMALTHEA model supports this kind of modeling). The workstation used to generate the presented results includes an Intel Core i running at 3.4GHz with 8Gb of memory. The processor is a quad core processor, but it is worth mentioning that most of the SystemC kernels implementations including the accellera 4 one that we use, are single threaded. As a consequence, increasing the number of cores will not increase simulation performance. All the graphs presented in this section are generated automatically using simulation scripts described in Section Scalability This section shows how the NoC-based cycle-accurate simulator scales with the number of cores to be simulated according to simulation time and memory requirements. Indeed, having a cycle-accurate simulator automatically leads to longer simulation time and a larger memory requirement because of the additional details simulated compared to a less accurate simulator. Memory Usage NoC sizes. Figure 14 shows the memory usage required by the simulator according to different These results have been obtained from the Linux standard time tool that is able to track memory usage in addition to timing information. The reported metric is the peak resident set size (RSS) of the simulator process. This represents the maximum amount of physical RAM memory used at a given time by the process. As shown by this graph, the peak RSS is few megabytes for small NoCs but quickly grows with the number of cores reaching almost 4 gigabytes for a NoC with 400 cores. We don t report results above 400 cores, because on the machine with 8 gigabytes of memory used for the experiments, a simulation with 30x30 cores uses more than the 8 gigabytes leading to swapping and extremely long simulations. Nevertheless, common server machines can be equipped with several hundreds of gigabytes allowing to simulate NoC-based multicore architectures far bigger than the presented ones. This high memory usage could be probably decreased by profiling memory usage and implementing smart optimizations. Nevertheless, we already performed rough analyzes and most of this high memory consumption comes from the thousands of objects required to simulate in a cycle-accurate way the NoC architecture. These objects include for example the internal buffers used to store the flits, the virtual channel allocators, the input and output ports and all the signal to connect these components. Simulation Time Figure 15 shows the simulation time according to different NoC sizes. Because of the memory limit described in the previous paragraph, the biggest NoC we use on our experiment machine is 20x20. Few seconds are required to simulate the DemoCar application on small NoCs. Then, as for the memory usage, simulation time sharply increases with the number of cores in the simulated multicore 4 Version 2.0 Page 17

24 Peak memory usage in megabytes x20 15x15 10x10 9x9 8x8 7x7 6x6 5x5 4x4 3x3 2x2 NoC size in number of cores (Rows x Cols) Figure 14: Peak memory usage for Democar ZigZag mapping Page 18 Version 2.0

25 architecture. For the architecture with 400 cores, the simulation takes more than 200 seconds even if many cores of the platform are not used. This is explained by the cycle-accurate simulation level. Indeed, the simulator is still simulating these inactive components as in the real hardware to check on each clock cycle if they have something to do. As stated in the previous section, this is explained by all the details present in the cycle-accurate simulator compared to the ones in the abstract one. Indeed, in the abstract simulator, adding a new core mainly consist in adding two simple objects: the core itself and the associated router. There is no need to create objects for all the internal details of router because they are abstracted away by the simulation Simulation time in seconds x20 15x15 10x10 9x9 8x8 7x7 6x6 5x5 4x4 3x3 2x2 NoC size in number of cores (Rows x Cols) Figure 15: Simulation time for DemoCar ZigZag mapping Comparison With Abstract Platform We now compare the cycle-accurate simulation platform with the abstract simulation platform previously developed. For more information about the abstract simulation platform refer to D5.2 and related publications [12, 9]. The results presented below are automatically generated using the simulators.py script described in Section with the following parameters: >>simulators.py run -sc fcfs -m ZigZag -n 2x2,3x3,4x4,5x5,6x6,7x7,8x8,9x9,10x10,15x15,20x20 -a DC Version 2.0 Page 19

26 Memory Usage Figure 16 shows the comparison of peak resident set memory usage between the cycle-accurate and the abstract simulation platforms. The X axis shows the NoC size and the Y axis shows the peak memory usage on a logarithmic scale. From this graph it is clear that the cycle-accurate simulation platform consumes far more memory than the abstract one. The peak memory size is 22 megabytes for the abstract simulation with 400 cores while it is 3.7 gigabytes for the cycle-accurate one NoC-TLM NoC-CA peak resident memory size in megabytes x20 15x15 10x10 9x9 8x8 7x7 6x6 5x5 4x4 3x3 2x2 NoC size in number of cores (Rows x Cols) Figure 16: Comparison with abstract platform for DemoCar ZigZag mapping - Peak Memory Simulation Time Figure 16 shows the comparison of simulation time between the cycle-accurate and the abstract simulation platforms. The X axis shows the NoC size and the Y axis shows the simulation time in seconds on a logarithmic scale. As for the peak memory usage, the simulation time for the cycle-accurate platform is several orders of magnitude longer than for the abstract one. Again, this is explained by all the details simulated in the cycle-accurate platform to get exact results about what is happening in the hardware. Execution Time We now compare the execution time of the simulated application between the two simulation platforms. The additional memory usage and longer simulation time are needed to get accurate results about what is really happening the hardware. Page 20 Version 2.0

27 1000 NoC-TLM NoC-CA 100 Simulation time in seconds x20 15x15 10x10 9x9 8x8 7x7 6x6 5x5 4x4 3x3 2x2 NoC size in number of cores (Rows x Cols) Figure 17: Comparison with abstract platform for DemoCar ZigZag mapping - Simulation Time Version 2.0 Page 21

28 Figure 18 shows the results of this comparison. The X axis shows the NoC size and the Y axis shows the execution time of the application in nanoseconds. As we can see in the figure, the tendencies are the same in the two simulators. Nevertheless, the graph also highlights a difference between the two simulators. In all the cases excepted the first one, for this application (DemoCar), the execution time reported by the cycleaccurate platform is higher than the one reported by the abstract platform. This is explained by the abstraction made by the abstract simulation platform in order to speedup simulation [8]. In this simulation platform, the NoC is simulated at a packet level and only consider points in time when packets enter and leave the NoC. As a consequence, a high priority packet is considered to be using its route between the time it enters the NoC and the time it leaves it. In practice, because packets are divided into flits and depending on its size, a packet will not use all its route all the time. As a consequence, the abstract simulator is considering contention that does not occur in reality and therefore overestimates the execution time NoC-TLM NoC-CA Execution time in nanoseconds x20 15x15 10x10 9x9 8x8 7x7 6x6 5x5 4x4 3x3 2x2 NoC size in number of cores (Rows x Cols) Figure 18: Comparison with abstract platform for DemoCar ZigZag mapping - Execution Time Page 22 Version 2.0

29 3.4 Summary This section introduced the particularities of the simulator for NoC-based architectures. We motivated the usage of NoCTweak and presented how we extended it for our realtime and AMALTHEA modeling needs. The simulator has been validated with the DemoCar application provided by Bosch. We also presented a comparison with the abstract NoC simulator we developed previously. Simulation time and memory requirements of the cycle-accurate simulator are several magnitude orders bigger but the results are more accurate. This simulator allows removing all the inaccuracies of the abstract simulator resulting from abstractions to speed up the simulation. Version 2.0 Page 23

30 4 Cycle-Accurate Simulator For Crossbar-Based Architectures We present a new cycle-accurate simulation platform devoted to crossbar-based multicores. Initially the DreamCloud project only targeted the NoC-based cycle-accurate simulation platform presented in the previous section. Nevertheless, we also devise a complementary crossbar-based platform (on a demand of a partner) in order to cover an important family of multicore architectures that is very popular in automotive systems. In the sequel, we first describe the implementation of the cycle-accurate simulator for crossbar-based architectures. Then we report some experimental evaluation results as well as a comparison with an industrial simulator. 4.1 Simulator Implementation The considered crossbar simulator implementation relies on C++ / SystemC and provides a configuration parameter to choose the arbitration policy. The main idea of the implementation consists in replacing the NoC communication SystemC component developed in our modular simulation framework by a crossbar SystemC component The Crossbar Compared to NoC-based architectures, in crossbar-based multicore, there is a direct connection between every pair of elements linked to the crossbar. This connection is realized by the crossbar itself as shown in Figure 19. These elements are typically cores and memories. As for the NoC simulator, a parameter allows to configure the number of tiles to be used. Nevertheless, as opposed to the NoC simulator, crossbar architecture are usually used in small shared memory system. As consequence, each tile of the system is not both a core and a memory but either a core or a memory. In order to reuse the NoC-based simulation platform and to be able to handle the simulation of such shared memory systems, we implemented custom mappings that will consider some tiles as memory tiles only, and the other as core tiles only. In these mappings, the AMALTHEA runnables are mapped to core tiles only, and the AMALTHEA labels to memory tiles only. Figure 19 shows the architecture of the crossbar simulator on an example of four elements connected to the crossbar. Each element has two associated buffers. One to send remote read and remote write requests to the crossbar and another one to receive remote write requests and remote read requests answers. An element wanting to send a request to the crossbar is blocked if its associated output buffer is full. Each element is notified each time a new request or an answer from a previous remote read request is written in its own input buffer. The processing of the new message received is then done by the element itself. This can be either the reply from a previous remote read request, or a remote write request from another element. The crossbar is responsible of carrying all the requests from their source to their destination. Depending on its internal architecture, some arbitration mechanisms may be implemented in order to decide which communication channel should have the priority. The policy logic block inside the cross bar is the one responsible to perform this arbitration. The next Section describes the different arbitration Page 24 Version 2.0

31 policies we support. A clock signal is driving the behavior of the crossbar. On each clock cycle the crossbar policy logic is activated to forward source messages to their destination. e1 e2 e3 e4 Clock signal Policy logic Crossbar e1 e2 e3 e4 Figure 19: 4 elements crossbar-based architecture example Arbitration Policy The crossbar simulator supports three different arbitration policies. The policy to be used is specified through an additional parameter provided to simulation scripts. All the policies below check if the output buffer of a request is not full. Round Robin This policy arbitrates concurrent accesses to the crossbar in a round robin fashion. On each clock cycle, this policy starts from the element just after the last granted one to check if there is any input request. Full With this policy, the crossbar is able to handle one request between every pair of elements connected to the crossbar on each clock cycle. Priority This policy arbitrates concurrent accesses to the crossbar-based on a priority information. This information is specified at the crossbar request level and must be assigned by elements connected to the crossbar before sending the request. 4.2 Preliminary Experimental Results This section presents preliminary results of the crossbar-based simulator. It first introduces the simulated crossbar-based architecture and how we use the simulator described in the previous section to simulate this architectures. Then it shows results about execution and simulation time along with results about memory utilization. Version 2.0 Page 25

4.2.1 Modeling of Infineon Aurix Architecture We use the cycle-accurate crossbar-based architectures simulator to simulate an Infineon Aurix architecture [2] used by Bosch.

32 4.2.1 Modeling of Infineon Aurix Architecture We use the cycle-accurate crossbar-based architectures simulator to simulate an Infineon Aurix architecture [2] used by Bosch. Figure 20 shows the block diagram of this architecture. It is made by three 32-bit TriCore CPUs shown in red, different memories and different I/O components. A crossbar is used to let the cores communicate with the memory and to each other. In this architecture, all the cores can access to all the memories. Figure 20: Infineon Aurix Architecture - Source Figure 21 shows how we simulate this architecture with our simulator for crossbar-based architectures. Three cores are used to map only the runnables and to represent the 3 Tricore CPUs while 2 other cores are used to represent the two PMU memories. As already stated, custom mapping have been implemented to ensure this. To ease the simulation of the Aurix architecture based on TriCore CPUs, we provide an additional script shown in Figure 22 allowing to run a simulation. This script fixes the size of the platform to be simulated to 3 cores and 3 memories. It also fixes the mapping strategy as described above. Page 26 Version 2.0

33 c1 c2 c3 Clock signal Policy logic Crossbar m1 m2 Figure 21: Infineon Aurix Architecture Simulation Model Moreover, the script also allows to configure both the size of the buffers in the crossbar and the latency for memory accesses Simulation of Automotive Applications on Aurix Architecture Model The workstation used to get the results presented in this section is the same as the one used for the NoC-based simulator and already described in Section Regarding the applications, we use DemoCar and control system engine both provided by Bosch. DemoCar details have been already discussed in Section Control system engine is a real life automotive application that controls the combustion process in the engine to produce the torque according to the run-time requirements invoked by the driver. The AMALTHEA model of this application comprises: 109 periodic tasks 1239 runnables labels For both applications, Table 1 shows results about execution time, simulation time and peak memory usage of the simulator. These results have been obtained using the Tricore.py script described in Section as following for Demo Car: >>./TriCore.py -ca apps/demo_car.amxmi -xblrl 10 -xblwl 10 -xbrrl 10 -xbrwl 10 -xbfs 1 -xbp Full -np For control system engine the command is: >>./TriCore.py -da CSE -xblrl 5 -xblwl 5 -xbrrl 5 -xbrwl 5 -xbfs 1 -xbp Full - np Version 2.0 Page 27

34 >>./TriCore.py --h usage: TriCore.py [-h] [-d] [-da {DC,CSE}] [-ca CUSTOM_APPLICATION] [-f FREQ] [-mf MODES_FILE] [-m MAPPING_STRATEGY [MAPPING_STRATEGY...]] [-np] [-o OUTPUT_FOLDER] [-r] [-s {fcfs,prio}] [-v] [-xbp {Full,RoundRobin,Priority}] [-xbfs XBARFIFOSIZE] [-xblrl XBARLOCALREADLATENCY] [-xblwl XBARLOCALWRITELATENCY] [-xbrrl XBARREMOTEREADLATENCY] [-xbrwl XBARREMOTEWRITELATENCY] Crossbar Simulator Runner script optional arguments: -h, --help show this help message and exit -d, --syntax_dependency consider successive runnables in tasks call graph as dependent -da {DC,CSE}, --def_application {DC,CSE} specify the application to be simulated -ca CUSTOM_APPLICATION, --custom_application CUSTOM_APPLICATION specify a custom application file to be simulated -f FREQ, --freq FREQ specify the frequency of all the cores in the platform (i.g 400MHz or 1GHz) -mf MODES_FILE, --modes_file MODES_FILE specify a modes switching file to be simulated -m MAPPING_STRATEGY [MAPPING_STRATEGY...], --mapping_strategy MAPPING_STRATEGY [MAPPING_STRATEGY...] specify the mapping strategy used to map runnables on cores. Valid strategies are [ 3Core, StaticModes ] -np, --no_periodicity run periodic runnables only once -o OUTPUT_FOLDER, --output_folder OUTPUT_FOLDER specify the absolute path of the output folder where simulation results will be generated -r, --random replace constant seed used to generate instructions timing distributions by a random one based on the time -s {fcfs,prio}, --scheduling_strategy {fcfs,prio} specify the scheduling strategy used by cores to choose the runnable to execute -v, --verbose enable verbose output -xbp {Full,RoundRobin,Priority}, --xbarpolicy {Full,RoundRobin,Priority} specify the cross bar arbitration policy -xbfs XBARFIFOSIZE, --xbarfifosize XBARFIFOSIZE specify the cross bar fifos size -xblrl XBARLOCALREADLATENCY, --xbarlocalreadlatency XBARLOCALREADLATENCY specify the latency of local read -xblwl XBARLOCALWRITELATENCY, --xbarlocalwritelatency XBARLOCALWRITELATENCY specify the latency of local write -xbrrl XBARREMOTEREADLATENCY, --xbarremotereadlatency XBARREMOTEREADLATENCY specify the latency of remote read -xbrwl XBARREMOTEWRITELATENCY, --xbarremotewritelatency XBARREMOTEWRITELATENCY specify the latency of remote write Figure 22: Script to simulate an AMALTHEA application running on top of the Aurix architecture Page 28 Version 2.0

35 Application Execution Time (ns) Simulation Time (s) Peak Memory (Mb) DemoCar Control System Engine Table 1: Execution and Simulation time for DemoCar and Control System Engine simulated on the Aurix architecture 4.3 Assessment w.r.t. Performance Traces from Industrial Use-Case Per the Description of Work (DOW) agreed to by the consortium, Bosch agreed to analyze the results provided by the cycle-accurate crossbar simulator developed by CNRS, and validate the results against real (automotive) hardware components. In order to provide maximum and timely support to CNRS in the development of their simulator, Bosch has chosen to use the Timing-Architects simulator to generate the reference results, see [15]. The use of Timing-Architects has enabled Bosch to quickly create application benchmarks of varying complexity to stress the CNRS simulator, as well as quickly generate detailed performance estimates for comparison against the CNRS simulator. Compared to Timing-Architects, extracting performance data of the same quality from actual hardware for each new comparison test would usually take about 2-3 orders of magnitude more time. The accuracy of simulation results provided by the Timing-Architects simulator has been verified in production projects executed at Bosch. For instance, in one production project, the worst-case end-to-end latencies of tasks distributed in a multicore environment as estimated by the Timing- Architects simulator were within 5% of the actual observations. For the same project, the estimation of core-utilization by the Timing-Architects simulator was within 0.05% of the actual observations from the hardware. The Timing-Architects simulator is also in use at other major automotive companies, such as Audi, Volkswagen, Continental, and the BMW group The Timing-Architects Simulator The Timing-Architects simulator takes as input the following items: 1. Description of the Hardware Platform: Most common elements required to be specified are: oscillator frequency, memories, interconnect (e.g., crossbar, bus), number of cores, and how these components are connected to each other, see Figure 23. In addition, the specification also requires the cost of accessing a memory element, e.g., the number of clock cycles required by a core to access a word in its local scratchpad. These values are extracted from the platform datasheet. 2. Description of the Operating System: The Timing-Architects simulator allows the specification of a custom operating system, together with the associated behavioral description, or the built-in standard OSEK-like OS. Bosch has opted to use the inbuilt OSEK-like OS for generating the reference traces, due to the wide use of OSEK-like OS in automotive computer systems. Version 2.0 Page 29

36 Figure 23: A screenshot of Timing-Architects. Notice that the simulator has been configured for the Aurix TriCore platform. Page 30 Version 2.0

3. Structure of the Application: The Timing-Architects simulator expects an accurate description of the simulated application, including tasks, a fully ordered sequence of functions 5 executed in

37 3. Structure of the Application: The Timing-Architects simulator expects an accurate description of the simulated application, including tasks, a fully ordered sequence of functions 5 executed in each task, and priority of each task, see Figure 24. In Timing-Architects, a task τ i with a priority i 0 interrupts another task τ j if 0 i < j holds. In order to be able to provide extensive and representative reference data to the consortium, Bosch has opted to use both real and synthetic benchmark embedded applications. The synthetic benchmark applications are generated using the statistical information derived from actual hardware implementations (e.g., number of tasks, number of runnables in each task, number of variables used, access pattern to these variables, distribution of tasks to cores, etc.). In addition, Bosch has carefully analyzed several applications in the automotive embedded domain, and has been able to extract useful trends (e.g., the variation in the number of runnables between projects, anticipated increase in the complexity of software in the future). All synthetic benchmarks are generated by combining the available statistical as well as the trend information, see [11]. Figure 24: Specification of the automotive embedded application in Timing-Architects 4. Performance Characteristics of each function: The Timing-Architects simulator also requires detailed timing observations related to each function used in the software system. In a practical system, each function takes a variable amount of time to execute depending upon inputs, contention for resources, etc., it is common to specify the best-case execution time (BCET), worst-case execution time (WCET), mean execution time, alongwith the statistical function which describes the observed variations in the runtimes of each function (e.g., Weibull distribution), see Figure Task-to-Core Mapping: The information on how tasks are mapped to cores is to be provided. All functions contained in a task execute on the same core as specified in the mapping information. 5 also called runnables in the automotive software system. Version 2.0 Page 31

Figure 25: Specification of the execution parameters of each function in Timing-Architects 4.3.

38 Figure 25: Specification of the execution parameters of each function in Timing-Architects Format of Results Produced by Timing-Architects Timing-Architects simulator produces a detailed trace of time-ordered sequence of events such as cache-misses, tasks preemption, cores utilization, to name a few. The traces are produced in the open-source Better Trace Format, or BTF proposed by the Eclipse Automotive Working Group, see [6, 7]. A snapshot of the BTF trace (v.2.2.0) is shown in Figure 26. Figure 26: A sample of Timing-Architects output trace (see BTF trace specification for more details) In addition, Bosch has also provided helper (bash-shell) scripts which operate on the given BTF file in order to extract features of interest. A brief description of the helper scripts is as follows: 1. list-tasks-systemwide: Lists all tasks that execute in the simulated system. Input: BTF trace file. Page 32 Version 2.0

D4.5 OS Allocation and Migration Support Final Release

Project Number 611411 D4.5 OS Allocation and Migration Support Final Release Version 1.01 Final Public Distribution University of York, aicas, Bosch and University of Stuttgart Project Partners: aicas,