D5.3 Accurate Simulation Platforms

Size: px
Start display at page:

Download "D5.3 Accurate Simulation Platforms"

Transcription

1 Project Number D5.3 Accurate Simulation Platforms Version 2.0 Final Public Distribution CNRS, Bosch Project Partners: aicas, Bosch, CNRS, Rheon Media, The Open Group, University of Stuttgart, University of York Every effort has been made to ensure that all statements and information contained herein are accurate, however the DreamCloud Project Partners accept no liability for any error or omission in the same Copyright in this document remains vested in the DreamCloud Project Partners.

2 Project Partner Contact Information aicas Bosch Fridtjof Siebert Jochen Härdtlein Haid-und-Neue Strasse 18 Robert-Bosch-Strasse Karlsruhe Schwieberdingen Germany Germany Tel: Tel: siebert@aicas.com jochen.haerdtlein@de.bosch.com CNRS Rheon Media Gilles Sassatelli Raj Patel Rue Ada Leighton Avenue Montpellier Pinner Middlesex HA5 3BW France United Kingdo Tel: Tel: sassatelli@lirmm.fr raj@rheonmedia.com The Open Group University of Stuttgart Scott Hansen Bastian Koller Avenue du Parc de Woluwe 56 Nobelstrasse Brussels Stuttgart Belgium Germany Tel: Tel: s.hansen@opengroup.org koller@hlrs.de University of York Leandro Indrusiak Deramore Lane York YO10 5GH United Kingdom Tel: leandro.indrusiak@cs.york.ac.uk Page ii Version 2.0

3 Table of Contents 1 Introduction Contributions To The Project Terminology Structure Of This Document Modular Simulation Platform Platform Overview Parser Execution Manager Mapper Simulation Platform Extensions Simulation Scripts Mapper Interface Application Graphs Summary Cycle-Accurate Simulator For NoC-Based Architectures Related Work Simulator Implementation Simulated Hardware NoCTweak Design NoCTweak Extensions Experimental Results Setup Scalability Comparison With Abstract Platform Summary Cycle-Accurate Simulator For Crossbar-Based Architectures Simulator Implementation The Crossbar Arbitration Policy Preliminary Experimental Results Version 2.0 Page iii

4 4.2.1 Modeling of Infineon Aurix Architecture Simulation of Automotive Applications on Aurix Architecture Model Assessment w.r.t. Performance Traces from Industrial Use-Case The Timing-Architects Simulator Format of Results Produced by Timing-Architects Extraction of Simulation Parameters Comparison of Performance numbers Summary Conclusions 37 References 39 Page iv Version 2.0

5 Document Control Version Status Date 0.1 Initial draft by CNRS on NoC cycle-accurate Simulator Enhancement by CNRS with Crossbar cycle-accurate Simulator Feedback integrated after internal review by CNRS Review and Enhancement by Bosch with Timing-Architects Enhancement by CNRS with comparison w.r.t. Timing-Architects Final version for submission to EU Version 2.0 Page v

6 Page vi Version 2.0

7 Executive Summary This document defines and describes cycle-accurate simulation platforms for the DreamCloud project. It therefore complements the abstract simulation platform proposed in deliverable D5.2 previously. The current deliverable first presents an overview of the common features in all these simulation platforms. Then, it introduces some extended features provided beyond those already defined in D5.2. In addition, it covers the design and implementation of the new simulators, which target multicore architectures relying on various communication infrastructures: networks-on-chip and crossbar. To validate these simulators, several evaluation results are reported, including a comparison with respect to previous abstract simulator described in D5.2 and an industrial simulator. The resulting modular cycle-accurate simulation platform allows a designer to build and evaluate different resource allocation heuristics for embedded applications running on a multicore platforms, according to the design flow proposed in DreamCloud, from AMALTHEA application models as inputs. Version 2.0 Page 1

8 1 Introduction This document presents and describes in detail the cycle-accurate simulation platforms that extend the abstract simulation capabilities proposed in the deliverable D Contributions To The Project In the simulation platform proposed in D5.2 the communication infrastructure of the hardware is simulated in a Transactional Level Modeling fashion. This leads to fast simulation times at the expense of accuracy. In this deliverable, we present cycle-accurate simulation platforms, which simulates the communication infrastructure at a cycle-accurate granularity. The results provided by these platforms are thus accurate regarding timing but will require longer simulations. In particular, this deliverable describes cycle-accurate simulation platforms able to simulate both network on chip (NoC) and crossbar- based multicore architectures. 1.2 Terminology In this document, we heavily use the terms simulation platform, architecture family and architecture. A simulation platform, is a complete software environment allowing to simulate an application specified using a pre-determined modeling abstraction, which includes software, and hardware components to generate results such as execution time or real time constraints satisfactions. The simulation platform can support the simulation of applications running on top of different architecture families. The cycle-accurate simulation platform described in this deliverable supports two architecture families. The first one represents multicore architectures based on a NoC communication infrastructure, while the second one represents multicore architectures using a crossbar to let cores communicate. Finally, an architecture is a particular instance of an architecture family. In this document, an architecture mainly consist in specifying the number of cores to be used in a particular architecture family and their operating frequency. Figure 1 shows a visual representation of the cycle-accurate simulation platform described in this document. As already stated it supports two different architecture families and for both of them it allows to configure the number of cores and their frequency. The NoC-based architecture family support 2D mesh topology only. The crossbar architecture family support in theory an infinite number of cores, but in practice it will be used to simulate platform with few cores because crossbar are only used in such systems for scalability reasons. 1.3 Structure Of This Document The rest of the deliverable is organized as follows. Section 2 describes the global architecture of the cycle-accurate simulation platform and the components that are common to the two supported multicore architecture families. It first introduces the basic Page 2 Version 2.0

9 Simulation Platform Cycle-accurate Architecture family NoC-based Architecture family Crossbar-based Architecture 2x2 Architecture Architecture 4x4 Architecture 4 Architecture Figure 1: Cycle-accurate simulation platform described in this deliverable Simulation Platform Abstract Architecture family NoC-based Architecture 2x2 Architecture... Architecture 4x4 Figure 2: Abstract simulation platform described in D5.2 blocks of the simulation platform. Then the extensions regarding modularity and usability brought to the abstract simulation platform are presented. Section 3 presents the NoC-based cycle-accurate simulation platform. It first introduces related work about cycle-accurate NoC simulators to then introduce NoCTweak, the cycle-accurate NoC simulator that we use as a basis to build our own simulator. Also, we discuss the extensions implemented in NoCTweak to satisfy the needs of the DreamCloud project. Finally, we present the evaluation of the cycle-accurate simulator in terms of simulation time and accuracy and we compare these results with the ones obtained with our previous abstract simulator. Finally, Section 4 presents the crossbar-based simulation platform. The implementation of the platform is discussed and simulation results for the DemoCar application provided by Bosch running on representative industrial hardware are presented. Version 2.0 Page 3

10 2 Modular Simulation Platform The cycle-accurate simulation platform is a modular system that accepts several input parameters to describe the hardware architecture under simulation. This section describes the global design of the platform and the components common to the two supported multicore architecture families. These two architecture families, a NoC-based one and a crossbar-based one, are described later in Section 3 and Section Platform Overview This section first introduces briefly the different blocks composing the simulation platform. For a detailed description of each of them, please refer to the deliverable D Then, the extensions brought to the abstract simulation platform described in D5.2 are presented. As depicted in Figure 3, the cycle-accurate simulation platform takes as input an application model and several configuration parameters to produce results in terms of execution time, number of deadline misses and energy consumption. The application model is specified using the AMALTHEA 2 format and results are provided as text files. The simulation platform itself is written using the C++ programming language and the SystemC simulation library. AMALTHEA Application model Configuration parameters Command line options C++ SystemC Execution Manager Parser Mapper Architecture Simulator Text files Results Execution time, Energy Deadline misses Figure 3: Overview of the cycle-accurate simulation platform Page 4 Version 2.0

11 2.2 Parser The parser component is the one responsible of reading the AMALTHEA model describing the application and building an in memory object representation of it. In AMALTHEA, an application is described by a runnable call graph. Runnables are AMALTHEA equivalent of what is called a job, or a task instance in the real time community. Because the AMALTHEA application modeling format is XML based, the parser relies on the Apache Xerces library. In the simulation platform, all the relevant AMALTHEA entities have an equivalent in memory object instance, and thus the main responsibility of the parser is to create them. This in memory representation is then passed to execution manager handling the effective simulation of the application. 2.3 Execution Manager The execution manager orchestrates the simulation according to the semantic of the AMALTHEA model provided by the parser. In other words, the execution manager has to manage the timing of the simulation by releasing the runnable instances when required. Basically, runnables are released on three different kind of events: At time 0 when application starts Each time a periodic timer ticks At the completion of another runnable All the runnables that do not have any input dependency and that are not associated to a periodic time are released immediately when the simulation starts, at time 0. For periodic runnables, the execution manager launches periodic timers at the beginning of the simulation, that will trigger the release of the associated runnables at each period. Finally, for the runnables having execution dependencies, the execution manager relies on feedback provided by the architecture simulator on the completion of the runnables that will trigger the release of the dependent runnable. Using this feedback, the execution manager keeps track of the unsatisfied dependencies for each runnable. When the number of unsatisfied dependencies reaches 0, the runnable can be released. When the execution manager releases a runnable, it has to allocate it on a particular core of the multicore architecture. The task of mapping the runnables onto the multicore architecture is handled by the mapper. 2.4 Mapper The mapper decides where a particular runnable instance should be allocated on the multicore architecture. It answers to requests coming from the execution manager. These requests contain the identifier of the runnable to be mapped and the mapper responds with the identifier of the particular core of the platform responsible of executing the runnable in question. See Section for the detailed description of the interface between the execution manager and the mapper. Version 2.0 Page 5

12 >>./compile.py build --help usage: compile.py build [-h] [-c] [-v] optional arguments: -h, --help show this help message and exit -c, --clean clean previously compiled files before compiling -v, --verbose enable verbose output Figure 4: Script to compile the simulation platform 2.5 Simulation Platform Extensions This section describes the extensions brought to the simulation platform described in D5.2. These extensions aim at making the simulation platform more modular and easy to use Simulation Scripts In order to easily use the simulation platform, different scripts are provided. These scripts can be used to build the platform from source code, run a single simulation, or run multiple simulations. We now describe in detail how to use each of them. Compiling the source code The first step to use the simulation platform consist in compiling it from the source code using the compile.py script whose help message is shown in Figure 4. The simulation platform has the following dependencies: cmake SystemC Xerces As a consequence, in order to run the compile script you need to have the cmake tool, the SystemC and Xerces libraries installed in your system. If the two libraries are not installed in standard paths you can define the SYSTEMC_HOME and/or XERCES_HOME environmental variables to specify their custom paths. Running a Single Simulation The simulate.py script is provided to launch a single simulation. This scripts allows to tune all the input parameters of the simulation. These parameters concern either the application, the architecture or the mapping strategy. Figure 5 shows the help message for the simulate.py script describing in detail each of the configuration parameters. As for the compile.py script, the SystemC and Xerces libraries must be correctly defined. If these libraries are not installed in standard paths, the XERCES_HOME and/or SYSTEMC_HOME environmental variables should also be defined with their custom paths. Page 6 Version 2.0

13 >>./simulate.py --help usage: simulate.py [-h] [-d] [-da {DC,CSE}] [-ca CUSTOM_APPLICATION] [-f FREQ] [-mf MODES_FILE] [-m MAPPING_STRATEGY [MAPPING_STRATEGY...]] [-np] [-o OUTPUT_FOLDER] [-r] [-s {fcfs,prio}] [-v] [-x ROWS] [-y COLS] Cycle-accurate simulator runner script optional arguments: -h, --help show this help message and exit -d, --syntax_dependency consider successive runnables in tasks call graph as dependent -da {DC,CSE}, --def_application {DC,CSE} specify the application to be simulated among the default ones -ca CUSTOM_APPLICATION, --custom_application CUSTOM_APPLICATION specify a custom application file to be simulated -f FREQ, --freq FREQ specify the frequency of cores in the NoC. Supported frequency units are Hz, KHz, MHz and GHz e.g 400MHz or 1GHz -mf MODES_FILE, --modes_file MODES_FILE specify a modes switching file to be simulated -m MAPPING_STRATEGY [MAPPING_STRATEGY...], --mapping_strategy MAPPING_STRATEGY [MAPPING_STRATEGY...] specify the mapping strategy used to map runnables on cores and labels on memories. Valide strategies are [ MinComm, Static, StaticSM, ZigZag, ZigZagSM, 3Core, Random, StaticModes ] -np, --no_periodicity run periodic runnables only once -o OUTPUT_FOLDER, --output_folder OUTPUT_FOLDER specify the absolute path of the output folder where simulation results will be generated -r, --random replace constant seed used to generate distributions by a random one based on current time -s {fcfs,prio}, --scheduling_strategy {fcfs,prio} specify the scheduling strategy used by cores to choose the runnable to execute -v, --verbose enable verbose output -x ROWS, --rows ROWS specify the number of rows in the NoC -y COLS, --cols COLS specify the number of columns in the NoC Figure 5: Script to run a single simulation Version 2.0 Page 7

14 >>./simulators.py --help usage: simulators.py [-h] [-v] [-s SIMULATORS] {build,run}... Simulators Script optional arguments: -h, --help show this help message and exit -v, --verbose enable verbose output -s SIMULATORS, --simulators SIMULATORS comma separeted list of the simulators to use among [ NoC-TLM, NoC-CA, Xbar-CA ] valid subcommands: {build,run} Figure 6: Options commons for building several platform or running several simulations in a single step Running Multiple Simulations In order to ease design space exploration, the simulation platform also provides a script allowing to build several architecture simulators or to launch several simulations at a time at once. Figure 6 shows the common options for both building platforms and running simulations (mainly, one option allows to choose the platforms to be considered). Figure 7 shows the options when running several simulation at once. Depending on the input parameters, different types of results and graphs are generated. For example, we simulate the same application with different NoC sizes for the NoC-based cycle-accurate multicore architecture using the following command: >>./simulators.py -s NoC-CA run -m ZigZag -sc fcfs -n 2x2,4x4,8x8 -a DC This command will generate graphs showing execution and simulation time according to the NoC size. Example of such graphs is shown in Figure 15 of Section 3.3 or in Figure 16 of Section This script can also be used to compare the results between the cycle-accurate simulation platform and the abstract one. We use this capability in Section to automatically generate comparison graphs according to different metrics Mapper Interface In order to let users easily customize the mapping strategy, a clean and simple interface is provided to separate concerns between the execution manager and the mapper. Each mapping strategy provided by the simulation platform uses the same interface as shown on Figure 8. When a simulation is started, the corresponding mapper strategy is instantiated according to the mapping parameter provided by the user through the simulate.py script described above. As shown in Figure 8, two mapping strategies are provided by default in the cycle-accurate simulation platform. The ZigZag one simply maps the labels and the runnables in a zig zag fashion on the different memories and cores of the architecture. The Local Max strategy maps also the labels in a zig Page 8 Version 2.0

15 >>./simulators.py run --help usage: simulators.py run [-h] [-a {DC,CSE} [{DC,CSE}...]] [-m {ZigZag,MinComm} [{ZigZag,MinComm}...]] [-n NOCSIZES] [-sc {fcfs,prio} [{fcfs,prio}...]] optional arguments: -h, --help show this help message and exit -a {DC,CSE} [{DC,CSE}...], --applications {DC,CSE} [{DC,CSE}...] specify the application to be simulated -m {ZigZag,MinComm} [{ZigZag,MinComm}...], --mappings {ZigZag,MinComm} [{ ZigZag,MinComm}...] specify the application to be simulated -n NOCSIZES, --nocsizes NOCSIZES comma separated list of NoC sizes to be simulated in the form ROWxCOL -sc {fcfs,prio} [{fcfs,prio}...], --schedulers {fcfs,prio} [{fcfs,prio}...] specify the schedulers use Figure 7: Options for running several simulations in a single step Mapper Interface memory maplabel(string id); core maprunnable(string id); void runnableended(long time, string id);... ZigZag Local Max State of the art [14] Figure 8: The Mapper Interface zag fashion on the memories and then maps each runnable at the core where the largest label used by the runnable is. Adding new mapping strategies is a straightforward process. The only two functions that need to be implemented are the two first ones shown in Figure 8. maplabel is called by the execution manager at the beginning of the simulation once for each label. In this function, the mapper should return the identifier of the memory where the label should be allocated. In the same way, maprunnable is called by the execution manager each time a runnable should be released. In this function, the mapper should return the identifier of the core where the given runnable should be executed. The mapper can also get some optional feedback from the execution manager. This feedback is provided through callback functions implemented by the mapping strategy and called by the execution manager. For example, the void runnableended(long time, string id) function shown in Figure 8 provides the completion time of runnables. There are few other functions in the mapper interface, but the main idea is captured by these three ones. Version 2.0 Page 9

16 ISR_Rte_Tick - Period=5ms CylNumTriggeredTask - Period=5000us ActuatorTask - Period=5000us Task5ms - Period=5000us Task10ms - Period=10000us Task20ms - Period=20000us Task100ms - Period=100000us ISR_Process CylNumObserverRunnableEntity IgnitionSWCSyncRunnableEntity MassAirFlowSWCRunnableEntity APedVoterSWCRunnableEntity OperatingModeSWCRunnableEntity APedSensorDiagRunnable InjectionSWCSyncRunnable ThrottleSensSWCRunnableEntity ThrottleCtlrRunnableEntity IdleSpeedCtrlSWCRunnableEntity InjBattVoltCorrSWCRunnable APedSensorRunnable ThrottleActuatorRunnableEntity BaseFuelMassRunnableEntity ThrottleChangeSWCRunnableEntity TransFuelMassSWCRunnableEntity IgnitionSWCRunnableEntity TotalFuelMassSWCRunnableEntity Figure 9: Application graph with tasks and runnables (DemoCar). Squares are tasks with periodic ones shown in blue and ellipses are runnables. In this application there is no inter tasks dependency Application Graphs Because the AMALTHEA modeling environment doesn t provide a graphical representation of applications, we developed a tool allowing to generate visual graphs from an AMALTHEA XML model. This tool has been integrated into the cycle-accurate simulation platform in order to automatically generate graphs for the application being simulated. This tool generates a graph showing tasks level information only and a graph showing both tasks and runnables information. These two graphs allow to easily visualize internal structure of the application. Tasks and Runnables The first graph generated by the visualization tool shows the graph of tasks composing the application and the graph of runnables composing each task. Figure 9 shows this graph for the Demo Car application provided by Bosch. As depicted, this application is made of 7 periodic tasks (shown in blue) without dependency between them. These tasks contain a total of 19 runnables. Tasks Only The second graph generated by the visualization tool only contains tasks. This is useful for applications containing a large number of runnables where it becomes really hard to visualize all of them at the same time. Figure 10 shows this graph for the Control System Engine application also provided by Bosch. This applications contains 1239 runnables and 109 tasks. This graph is only a part of the task graph of the application. The structure of the application is clear from this graph. It comprises few periodic tasks releasing non periodic ones and some other tasks executed only once. 2.6 Summary This section introduced the design of the modular cycle-accurate simulation platform. It also introduced all the extensions added to the abstract simulation platform described in the deliverable D These extensions aim at making the simulation platform more modular, configurable and user friendly in 3 Page 10 Version 2.0

17 task00037 Period=1000us task00098 task00101 task00105 task00109 task00017 task00012 task00073 task00092 task00064 task00052 task00002 task00053 task00003 task00026 task00032 task00059 task00060 task00001 task00089 task00099 task00095 task00079 task00041 task00102 task00057 task00103 task00013 task00031 task00062 Period=200us task00005 task00006 task00004 task00008 task00030 task00038 task00069 task00086 task00100 task00106 task00097 task00021 task00020 task00088 task00107 task00016 task00082 task00055 task00058 task00033 Figure 10: Application graph with tasks only (A part of Control System Engine). Periodic tasks are shown in blue. Version 2.0 Page 11

18 order to facilitate its adoption by users outside the DreamCloud project. The next two chapters describe the particularities of the NoC-based and the crossbar-based multicore architecture families. Page 12 Version 2.0

19 3 Cycle-Accurate Simulator For NoC-Based Architectures This section presents the cycle-accurate simulator for NoC-based multicore architectures. We first present related work about cycle-accurate NoC simulators. Then we introduce NoCTweak, the cycleaccurate NoC simulator that we use as a basis to build our own simulator and how we extended it for the needs of the DreamCloud project. Finally we present results about the scalability of the simulator according to different metrics and a comparison of simulation results with the abstract simulation platform described in D Related Work This subsection reviews some of the existing work about cycle-accurate simulation of NoCs and motivates our choice of using NoCTweak as a basis for our cycle-accurate simulator. BookSim is a cycle-accurate NoC simulator developed at Stanford university. It has been originally designed and introduced to support the book Principles and Practices of Interconnection Networks [5]. From then, it has been extended with many recent features of NoC design. The current major release, BookSim 2.0 [10], supports a wide range of topologies such as mesh, torus and flattened butterfly networks, provides diverse routing algorithms and includes numerous options for customizing the network s router microarchitecture. Booksim is written in C++ but does not rely on the SystemC simulation library. Noxim [4] is also a cycle-accurate network on chip simulator. It is written in SystemC and mainly provides a command line interface allowing to evaluate the commons standard performance metrics of NoCs such as throughput and latency. The command line allows to configure different parameters regarding the internal microarchitecture of the routers to be simulated. TOPAZ [1] is another cycle-accurate network on chip simulator. TOPAZ can be used standalone through synthetic traffic patterns and application-traces or within full-system evaluation systems such as GEM5. TOPAZ enables the modeling of a wide variety of message routers, with different tradeoffs between speed and precision. It originates from the SICOSYS [13] simulator, which was originally conceived to obtain results very close to those obtained by using hardware description languages. NoCTweak [16] is a parameterizable 2D mesh NoC simulator used for early exploration of performance and energy efficiency of on-chip networks. The simulator has been developed in C++/SystemC allowing fast modeling of concurrent hardware modules at the cycle-level accuracy. We choose to use NoCTweak to build our own cycle-accurate NoC simulator for several reasons. First, NoCTweak is a recent open source project with a relatively large community of users. Second, as will be described in the next section, the design of NoCTweak makes it easily extendable. Finally, because it is written in C++/SystemC it allows to reuse the execution manager, also written in C++/SystemC, from our abstract simulation platform described in the deliverable D Simulator Implementation This section introduces the simulator, describes the design of NoCTweak and how we extended it in order to simulate a NoC supporting realtime features. Version 2.0 Page 13

20 3.2.1 Simulated Hardware The NoC-based simulator supports the simulation of 2D mesh NoC-based architectures. Each tile of the 2D mesh topology is made of a computing element, i.e. a core, a private memory and a router allowing the core to communicate with neighboring cores. Figure 11 shows an example of a 4x4 2D mesh NoC-based architecture Core Router Memory Figure 11: 4x4 2D mesh network NoCTweak Design NoCTweak is a cycle-accurate NoC simulator for 2D mesh topologies. It is written in C++/SystemC and is designed around two main abstractions. The first one represents a core of the simulated hardware while the second one represents a router. Figure 12 shows graphically these two entities and how they connect. NoCTweak provides two different types of cores. The first one, called Synthetic, describes cores that only inject packets into the network either in a uniform random fashion or with particular hotspots. The second type of cores, called Embedded, can be used to inject packets according to traces from real embedded applications. Regarding the routers, NoCTweak also provides two different types. The first one is the widespread wormhole router architecture where packets are divided into flits. The first flit is called the header flit, and all other flits follow the route opened by the header one. The second type of router provided has virtual channels. Like in wormhole, the packets to be exchanged are divided into flits, but in this case each router has several virtual channels to make a better use of the routers resources by allowing several packets to be stored and processed in a given router at the same time. The gray part of Figure 12 shows how the cores and the routers are connected together to make an architecture. When a simulation is launched in NoCTweak, a NoC-based architecture is built using a particular type of cores, i.e. Synthetic or Embedded, and using a particular type of routers, i.e. Wormhole or Virtual Channels. This architecture is made of tiles, where each tile is made of both a core and a router. The number of tiles created depends on a configuration parameter provided to the simulator. Because NoC tweak focuses on 2D mesh topology, the user specifies the horizontal and the vertical size of the NoC. Page 14 Version 2.0

21 Architecture name * Tile name Router Interface name 1 1 Core Interface name Virtual Channels name Wormhole name Embedded name Synthetic name Figure 12: Class diagram for the main classes of NoCTweak NoCTweak Extensions NoCTweak does not support the notion of realtime packet management as stated in the related work section. As a consequence we extended it in order to support routers with virtual channels and priority-preemptive arbitration [3]. Indeed, one of the target of the DreamCloud project is embedded automotive applications with realtime constraints. To answer such constraints, NoC with virtual channels and priority-preemptive arbitration mechanisms have been chosen. Because AMALTHEA is the application modeling language used in the project, we also extended NoCTweak to support the simulation of application described using this format. Figure 13 shows how we extended the NoCTweak framework with a new type of router to support priority-preemptive arbitration and with a new type of core to support the simulation of AMALTHEA applications. With theses two extensions, the simulator is now able to simulate AMALTHEA applications on the same NoC as the one simulated by the abstract simulation platform described in the deliverable D5.2. This allows for fair comparison between the cycle-accurate platform and the abstract one as will be presented in the next section. Figure 13 also shows how the modular mapping solution described in Section is integrated into NoCTweak. Implementing a new router type with virtual channels and priority-preemptive arbitration was quite straightforward. We added a new virtual channel allocator and a new switch allocator taking into account the priority of the flit to be transmitted. These two priority aware allocators replace the existing ones of the default virtual channel router. Regarding the AMALTHEA core type, implementation was more complex because it needed to integrate the execution manager responsible to release the runnables. This execution manager has been integrated partly into the new AMALTHEA core type and partly into the architecture class. The configuration parameters of NoCTweak have also been extended to allow specifying these new router and core types and also to allow to specify all the additional parameters of the AMALTHEA core type compared to the two simple core types provided by NoCTweak (e.g., to let the user specify which scheduling strategy to use on cores). Version 2.0 Page 15

22 Architecture name 1 Mapper Interface name * Tile name ZigZag name Local Max. name Router Interface name 1 1 Core Interface name Virtual Channels name Wormhole name Embedded name Synthetic name AMALTHEA name Virtual Channels RT name 3.3 Experimental Results Figure 13: Class diagram for the extended version of NoCTweak Using the simulator described in the previous section, we are able to simulate in a cycle-accurate way the execution of AMALTHEA applications running on a NoC-based multicore platform. This section reports first results about the simulation time and the memory requirements for this simulator. Then results about the comparison with the abstract simulation platform described in D5.2 are presented Setup All he results reported in this section come from the DemoCar application provided by Bosch. The AMALTHEA model of this application comprises: 8 periodic tasks 19 runnables 62 labels All the presented results show a single iteration of the application, i.e. each task is executed only once. Moreover these results have been generated by considering only explicit dependencies between runnables. In other words, consecutive runnables R1 and R2 in a task are considered dependent only if R1 contains a runnable call to R2. The mapping considered is ZigZag and scheduling is priority based. Finally, for reproductibility concerns, all the results are generated using a random generator initialized Page 16 Version 2.0

23 with a constant seed for instructions with random behavior (the AMALTHEA model supports this kind of modeling). The workstation used to generate the presented results includes an Intel Core i running at 3.4GHz with 8Gb of memory. The processor is a quad core processor, but it is worth mentioning that most of the SystemC kernels implementations including the accellera 4 one that we use, are single threaded. As a consequence, increasing the number of cores will not increase simulation performance. All the graphs presented in this section are generated automatically using simulation scripts described in Section Scalability This section shows how the NoC-based cycle-accurate simulator scales with the number of cores to be simulated according to simulation time and memory requirements. Indeed, having a cycle-accurate simulator automatically leads to longer simulation time and a larger memory requirement because of the additional details simulated compared to a less accurate simulator. Memory Usage NoC sizes. Figure 14 shows the memory usage required by the simulator according to different These results have been obtained from the Linux standard time tool that is able to track memory usage in addition to timing information. The reported metric is the peak resident set size (RSS) of the simulator process. This represents the maximum amount of physical RAM memory used at a given time by the process. As shown by this graph, the peak RSS is few megabytes for small NoCs but quickly grows with the number of cores reaching almost 4 gigabytes for a NoC with 400 cores. We don t report results above 400 cores, because on the machine with 8 gigabytes of memory used for the experiments, a simulation with 30x30 cores uses more than the 8 gigabytes leading to swapping and extremely long simulations. Nevertheless, common server machines can be equipped with several hundreds of gigabytes allowing to simulate NoC-based multicore architectures far bigger than the presented ones. This high memory usage could be probably decreased by profiling memory usage and implementing smart optimizations. Nevertheless, we already performed rough analyzes and most of this high memory consumption comes from the thousands of objects required to simulate in a cycle-accurate way the NoC architecture. These objects include for example the internal buffers used to store the flits, the virtual channel allocators, the input and output ports and all the signal to connect these components. Simulation Time Figure 15 shows the simulation time according to different NoC sizes. Because of the memory limit described in the previous paragraph, the biggest NoC we use on our experiment machine is 20x20. Few seconds are required to simulate the DemoCar application on small NoCs. Then, as for the memory usage, simulation time sharply increases with the number of cores in the simulated multicore 4 Version 2.0 Page 17

24 Peak memory usage in megabytes x20 15x15 10x10 9x9 8x8 7x7 6x6 5x5 4x4 3x3 2x2 NoC size in number of cores (Rows x Cols) Figure 14: Peak memory usage for Democar ZigZag mapping Page 18 Version 2.0

25 architecture. For the architecture with 400 cores, the simulation takes more than 200 seconds even if many cores of the platform are not used. This is explained by the cycle-accurate simulation level. Indeed, the simulator is still simulating these inactive components as in the real hardware to check on each clock cycle if they have something to do. As stated in the previous section, this is explained by all the details present in the cycle-accurate simulator compared to the ones in the abstract one. Indeed, in the abstract simulator, adding a new core mainly consist in adding two simple objects: the core itself and the associated router. There is no need to create objects for all the internal details of router because they are abstracted away by the simulation Simulation time in seconds x20 15x15 10x10 9x9 8x8 7x7 6x6 5x5 4x4 3x3 2x2 NoC size in number of cores (Rows x Cols) Figure 15: Simulation time for DemoCar ZigZag mapping Comparison With Abstract Platform We now compare the cycle-accurate simulation platform with the abstract simulation platform previously developed. For more information about the abstract simulation platform refer to D5.2 and related publications [12, 9]. The results presented below are automatically generated using the simulators.py script described in Section with the following parameters: >>simulators.py run -sc fcfs -m ZigZag -n 2x2,3x3,4x4,5x5,6x6,7x7,8x8,9x9,10x10,15x15,20x20 -a DC Version 2.0 Page 19

26 Memory Usage Figure 16 shows the comparison of peak resident set memory usage between the cycle-accurate and the abstract simulation platforms. The X axis shows the NoC size and the Y axis shows the peak memory usage on a logarithmic scale. From this graph it is clear that the cycle-accurate simulation platform consumes far more memory than the abstract one. The peak memory size is 22 megabytes for the abstract simulation with 400 cores while it is 3.7 gigabytes for the cycle-accurate one NoC-TLM NoC-CA peak resident memory size in megabytes x20 15x15 10x10 9x9 8x8 7x7 6x6 5x5 4x4 3x3 2x2 NoC size in number of cores (Rows x Cols) Figure 16: Comparison with abstract platform for DemoCar ZigZag mapping - Peak Memory Simulation Time Figure 16 shows the comparison of simulation time between the cycle-accurate and the abstract simulation platforms. The X axis shows the NoC size and the Y axis shows the simulation time in seconds on a logarithmic scale. As for the peak memory usage, the simulation time for the cycle-accurate platform is several orders of magnitude longer than for the abstract one. Again, this is explained by all the details simulated in the cycle-accurate platform to get exact results about what is happening in the hardware. Execution Time We now compare the execution time of the simulated application between the two simulation platforms. The additional memory usage and longer simulation time are needed to get accurate results about what is really happening the hardware. Page 20 Version 2.0

27 1000 NoC-TLM NoC-CA 100 Simulation time in seconds x20 15x15 10x10 9x9 8x8 7x7 6x6 5x5 4x4 3x3 2x2 NoC size in number of cores (Rows x Cols) Figure 17: Comparison with abstract platform for DemoCar ZigZag mapping - Simulation Time Version 2.0 Page 21

28 Figure 18 shows the results of this comparison. The X axis shows the NoC size and the Y axis shows the execution time of the application in nanoseconds. As we can see in the figure, the tendencies are the same in the two simulators. Nevertheless, the graph also highlights a difference between the two simulators. In all the cases excepted the first one, for this application (DemoCar), the execution time reported by the cycleaccurate platform is higher than the one reported by the abstract platform. This is explained by the abstraction made by the abstract simulation platform in order to speedup simulation [8]. In this simulation platform, the NoC is simulated at a packet level and only consider points in time when packets enter and leave the NoC. As a consequence, a high priority packet is considered to be using its route between the time it enters the NoC and the time it leaves it. In practice, because packets are divided into flits and depending on its size, a packet will not use all its route all the time. As a consequence, the abstract simulator is considering contention that does not occur in reality and therefore overestimates the execution time NoC-TLM NoC-CA Execution time in nanoseconds x20 15x15 10x10 9x9 8x8 7x7 6x6 5x5 4x4 3x3 2x2 NoC size in number of cores (Rows x Cols) Figure 18: Comparison with abstract platform for DemoCar ZigZag mapping - Execution Time Page 22 Version 2.0

29 3.4 Summary This section introduced the particularities of the simulator for NoC-based architectures. We motivated the usage of NoCTweak and presented how we extended it for our realtime and AMALTHEA modeling needs. The simulator has been validated with the DemoCar application provided by Bosch. We also presented a comparison with the abstract NoC simulator we developed previously. Simulation time and memory requirements of the cycle-accurate simulator are several magnitude orders bigger but the results are more accurate. This simulator allows removing all the inaccuracies of the abstract simulator resulting from abstractions to speed up the simulation. Version 2.0 Page 23

30 4 Cycle-Accurate Simulator For Crossbar-Based Architectures We present a new cycle-accurate simulation platform devoted to crossbar-based multicores. Initially the DreamCloud project only targeted the NoC-based cycle-accurate simulation platform presented in the previous section. Nevertheless, we also devise a complementary crossbar-based platform (on a demand of a partner) in order to cover an important family of multicore architectures that is very popular in automotive systems. In the sequel, we first describe the implementation of the cycle-accurate simulator for crossbar-based architectures. Then we report some experimental evaluation results as well as a comparison with an industrial simulator. 4.1 Simulator Implementation The considered crossbar simulator implementation relies on C++ / SystemC and provides a configuration parameter to choose the arbitration policy. The main idea of the implementation consists in replacing the NoC communication SystemC component developed in our modular simulation framework by a crossbar SystemC component The Crossbar Compared to NoC-based architectures, in crossbar-based multicore, there is a direct connection between every pair of elements linked to the crossbar. This connection is realized by the crossbar itself as shown in Figure 19. These elements are typically cores and memories. As for the NoC simulator, a parameter allows to configure the number of tiles to be used. Nevertheless, as opposed to the NoC simulator, crossbar architecture are usually used in small shared memory system. As consequence, each tile of the system is not both a core and a memory but either a core or a memory. In order to reuse the NoC-based simulation platform and to be able to handle the simulation of such shared memory systems, we implemented custom mappings that will consider some tiles as memory tiles only, and the other as core tiles only. In these mappings, the AMALTHEA runnables are mapped to core tiles only, and the AMALTHEA labels to memory tiles only. Figure 19 shows the architecture of the crossbar simulator on an example of four elements connected to the crossbar. Each element has two associated buffers. One to send remote read and remote write requests to the crossbar and another one to receive remote write requests and remote read requests answers. An element wanting to send a request to the crossbar is blocked if its associated output buffer is full. Each element is notified each time a new request or an answer from a previous remote read request is written in its own input buffer. The processing of the new message received is then done by the element itself. This can be either the reply from a previous remote read request, or a remote write request from another element. The crossbar is responsible of carrying all the requests from their source to their destination. Depending on its internal architecture, some arbitration mechanisms may be implemented in order to decide which communication channel should have the priority. The policy logic block inside the cross bar is the one responsible to perform this arbitration. The next Section describes the different arbitration Page 24 Version 2.0

31 policies we support. A clock signal is driving the behavior of the crossbar. On each clock cycle the crossbar policy logic is activated to forward source messages to their destination. e1 e2 e3 e4 Clock signal Policy logic Crossbar e1 e2 e3 e4 Figure 19: 4 elements crossbar-based architecture example Arbitration Policy The crossbar simulator supports three different arbitration policies. The policy to be used is specified through an additional parameter provided to simulation scripts. All the policies below check if the output buffer of a request is not full. Round Robin This policy arbitrates concurrent accesses to the crossbar in a round robin fashion. On each clock cycle, this policy starts from the element just after the last granted one to check if there is any input request. Full With this policy, the crossbar is able to handle one request between every pair of elements connected to the crossbar on each clock cycle. Priority This policy arbitrates concurrent accesses to the crossbar-based on a priority information. This information is specified at the crossbar request level and must be assigned by elements connected to the crossbar before sending the request. 4.2 Preliminary Experimental Results This section presents preliminary results of the crossbar-based simulator. It first introduces the simulated crossbar-based architecture and how we use the simulator described in the previous section to simulate this architectures. Then it shows results about execution and simulation time along with results about memory utilization. Version 2.0 Page 25

32 4.2.1 Modeling of Infineon Aurix Architecture We use the cycle-accurate crossbar-based architectures simulator to simulate an Infineon Aurix architecture [2] used by Bosch. Figure 20 shows the block diagram of this architecture. It is made by three 32-bit TriCore CPUs shown in red, different memories and different I/O components. A crossbar is used to let the cores communicate with the memory and to each other. In this architecture, all the cores can access to all the memories. Figure 20: Infineon Aurix Architecture - Source Figure 21 shows how we simulate this architecture with our simulator for crossbar-based architectures. Three cores are used to map only the runnables and to represent the 3 Tricore CPUs while 2 other cores are used to represent the two PMU memories. As already stated, custom mapping have been implemented to ensure this. To ease the simulation of the Aurix architecture based on TriCore CPUs, we provide an additional script shown in Figure 22 allowing to run a simulation. This script fixes the size of the platform to be simulated to 3 cores and 3 memories. It also fixes the mapping strategy as described above. Page 26 Version 2.0

33 c1 c2 c3 Clock signal Policy logic Crossbar m1 m2 Figure 21: Infineon Aurix Architecture Simulation Model Moreover, the script also allows to configure both the size of the buffers in the crossbar and the latency for memory accesses Simulation of Automotive Applications on Aurix Architecture Model The workstation used to get the results presented in this section is the same as the one used for the NoC-based simulator and already described in Section Regarding the applications, we use DemoCar and control system engine both provided by Bosch. DemoCar details have been already discussed in Section Control system engine is a real life automotive application that controls the combustion process in the engine to produce the torque according to the run-time requirements invoked by the driver. The AMALTHEA model of this application comprises: 109 periodic tasks 1239 runnables labels For both applications, Table 1 shows results about execution time, simulation time and peak memory usage of the simulator. These results have been obtained using the Tricore.py script described in Section as following for Demo Car: >>./TriCore.py -ca apps/demo_car.amxmi -xblrl 10 -xblwl 10 -xbrrl 10 -xbrwl 10 -xbfs 1 -xbp Full -np For control system engine the command is: >>./TriCore.py -da CSE -xblrl 5 -xblwl 5 -xbrrl 5 -xbrwl 5 -xbfs 1 -xbp Full - np Version 2.0 Page 27

34 >>./TriCore.py --h usage: TriCore.py [-h] [-d] [-da {DC,CSE}] [-ca CUSTOM_APPLICATION] [-f FREQ] [-mf MODES_FILE] [-m MAPPING_STRATEGY [MAPPING_STRATEGY...]] [-np] [-o OUTPUT_FOLDER] [-r] [-s {fcfs,prio}] [-v] [-xbp {Full,RoundRobin,Priority}] [-xbfs XBARFIFOSIZE] [-xblrl XBARLOCALREADLATENCY] [-xblwl XBARLOCALWRITELATENCY] [-xbrrl XBARREMOTEREADLATENCY] [-xbrwl XBARREMOTEWRITELATENCY] Crossbar Simulator Runner script optional arguments: -h, --help show this help message and exit -d, --syntax_dependency consider successive runnables in tasks call graph as dependent -da {DC,CSE}, --def_application {DC,CSE} specify the application to be simulated -ca CUSTOM_APPLICATION, --custom_application CUSTOM_APPLICATION specify a custom application file to be simulated -f FREQ, --freq FREQ specify the frequency of all the cores in the platform (i.g 400MHz or 1GHz) -mf MODES_FILE, --modes_file MODES_FILE specify a modes switching file to be simulated -m MAPPING_STRATEGY [MAPPING_STRATEGY...], --mapping_strategy MAPPING_STRATEGY [MAPPING_STRATEGY...] specify the mapping strategy used to map runnables on cores. Valid strategies are [ 3Core, StaticModes ] -np, --no_periodicity run periodic runnables only once -o OUTPUT_FOLDER, --output_folder OUTPUT_FOLDER specify the absolute path of the output folder where simulation results will be generated -r, --random replace constant seed used to generate instructions timing distributions by a random one based on the time -s {fcfs,prio}, --scheduling_strategy {fcfs,prio} specify the scheduling strategy used by cores to choose the runnable to execute -v, --verbose enable verbose output -xbp {Full,RoundRobin,Priority}, --xbarpolicy {Full,RoundRobin,Priority} specify the cross bar arbitration policy -xbfs XBARFIFOSIZE, --xbarfifosize XBARFIFOSIZE specify the cross bar fifos size -xblrl XBARLOCALREADLATENCY, --xbarlocalreadlatency XBARLOCALREADLATENCY specify the latency of local read -xblwl XBARLOCALWRITELATENCY, --xbarlocalwritelatency XBARLOCALWRITELATENCY specify the latency of local write -xbrrl XBARREMOTEREADLATENCY, --xbarremotereadlatency XBARREMOTEREADLATENCY specify the latency of remote read -xbrwl XBARREMOTEWRITELATENCY, --xbarremotewritelatency XBARREMOTEWRITELATENCY specify the latency of remote write Figure 22: Script to simulate an AMALTHEA application running on top of the Aurix architecture Page 28 Version 2.0

35 Application Execution Time (ns) Simulation Time (s) Peak Memory (Mb) DemoCar Control System Engine Table 1: Execution and Simulation time for DemoCar and Control System Engine simulated on the Aurix architecture 4.3 Assessment w.r.t. Performance Traces from Industrial Use-Case Per the Description of Work (DOW) agreed to by the consortium, Bosch agreed to analyze the results provided by the cycle-accurate crossbar simulator developed by CNRS, and validate the results against real (automotive) hardware components. In order to provide maximum and timely support to CNRS in the development of their simulator, Bosch has chosen to use the Timing-Architects simulator to generate the reference results, see [15]. The use of Timing-Architects has enabled Bosch to quickly create application benchmarks of varying complexity to stress the CNRS simulator, as well as quickly generate detailed performance estimates for comparison against the CNRS simulator. Compared to Timing-Architects, extracting performance data of the same quality from actual hardware for each new comparison test would usually take about 2-3 orders of magnitude more time. The accuracy of simulation results provided by the Timing-Architects simulator has been verified in production projects executed at Bosch. For instance, in one production project, the worst-case end-to-end latencies of tasks distributed in a multicore environment as estimated by the Timing- Architects simulator were within 5% of the actual observations. For the same project, the estimation of core-utilization by the Timing-Architects simulator was within 0.05% of the actual observations from the hardware. The Timing-Architects simulator is also in use at other major automotive companies, such as Audi, Volkswagen, Continental, and the BMW group The Timing-Architects Simulator The Timing-Architects simulator takes as input the following items: 1. Description of the Hardware Platform: Most common elements required to be specified are: oscillator frequency, memories, interconnect (e.g., crossbar, bus), number of cores, and how these components are connected to each other, see Figure 23. In addition, the specification also requires the cost of accessing a memory element, e.g., the number of clock cycles required by a core to access a word in its local scratchpad. These values are extracted from the platform datasheet. 2. Description of the Operating System: The Timing-Architects simulator allows the specification of a custom operating system, together with the associated behavioral description, or the built-in standard OSEK-like OS. Bosch has opted to use the inbuilt OSEK-like OS for generating the reference traces, due to the wide use of OSEK-like OS in automotive computer systems. Version 2.0 Page 29

36 Figure 23: A screenshot of Timing-Architects. Notice that the simulator has been configured for the Aurix TriCore platform. Page 30 Version 2.0

37 3. Structure of the Application: The Timing-Architects simulator expects an accurate description of the simulated application, including tasks, a fully ordered sequence of functions 5 executed in each task, and priority of each task, see Figure 24. In Timing-Architects, a task τ i with a priority i 0 interrupts another task τ j if 0 i < j holds. In order to be able to provide extensive and representative reference data to the consortium, Bosch has opted to use both real and synthetic benchmark embedded applications. The synthetic benchmark applications are generated using the statistical information derived from actual hardware implementations (e.g., number of tasks, number of runnables in each task, number of variables used, access pattern to these variables, distribution of tasks to cores, etc.). In addition, Bosch has carefully analyzed several applications in the automotive embedded domain, and has been able to extract useful trends (e.g., the variation in the number of runnables between projects, anticipated increase in the complexity of software in the future). All synthetic benchmarks are generated by combining the available statistical as well as the trend information, see [11]. Figure 24: Specification of the automotive embedded application in Timing-Architects 4. Performance Characteristics of each function: The Timing-Architects simulator also requires detailed timing observations related to each function used in the software system. In a practical system, each function takes a variable amount of time to execute depending upon inputs, contention for resources, etc., it is common to specify the best-case execution time (BCET), worst-case execution time (WCET), mean execution time, alongwith the statistical function which describes the observed variations in the runtimes of each function (e.g., Weibull distribution), see Figure Task-to-Core Mapping: The information on how tasks are mapped to cores is to be provided. All functions contained in a task execute on the same core as specified in the mapping information. 5 also called runnables in the automotive software system. Version 2.0 Page 31

38 Figure 25: Specification of the execution parameters of each function in Timing-Architects Format of Results Produced by Timing-Architects Timing-Architects simulator produces a detailed trace of time-ordered sequence of events such as cache-misses, tasks preemption, cores utilization, to name a few. The traces are produced in the open-source Better Trace Format, or BTF proposed by the Eclipse Automotive Working Group, see [6, 7]. A snapshot of the BTF trace (v.2.2.0) is shown in Figure 26. Figure 26: A sample of Timing-Architects output trace (see BTF trace specification for more details) In addition, Bosch has also provided helper (bash-shell) scripts which operate on the given BTF file in order to extract features of interest. A brief description of the helper scripts is as follows: 1. list-tasks-systemwide: Lists all tasks that execute in the simulated system. Input: BTF trace file. Page 32 Version 2.0

D4.5 OS Allocation and Migration Support Final Release

D4.5 OS Allocation and Migration Support Final Release Project Number 611411 D4.5 OS Allocation and Migration Support Final Release Version 1.01 Final Public Distribution University of York, aicas, Bosch and University of Stuttgart Project Partners: aicas,

More information

D6.1 Integration Report

D6.1 Integration Report Project Number 611411 D6.1 Integration Report Version 1.0 Final Public Distribution Bosch with contributions from all partners Project Partners: aicas, Bosch, CNRS, Rheon Media, The Open Group, University

More information

University of York, University of Stuttgart

University of York, University of Stuttgart Project Number 611411 D1.2 Dynamic Resource Allocation Requirements Version 1.0 30 April 2014 Final EC Distribution University of York, University of Stuttgart Project Partners: Aicas, Bosch, CNRS, Rheon

More information

Real-Time Mixed-Criticality Wormhole Networks

Real-Time Mixed-Criticality Wormhole Networks eal-time Mixed-Criticality Wormhole Networks Leandro Soares Indrusiak eal-time Systems Group Department of Computer Science University of York United Kingdom eal-time Systems Group 1 Outline Wormhole Networks

More information

Efficient Throughput-Guarantees for Latency-Sensitive Networks-On-Chip

Efficient Throughput-Guarantees for Latency-Sensitive Networks-On-Chip ASP-DAC 2010 20 Jan 2010 Session 6C Efficient Throughput-Guarantees for Latency-Sensitive Networks-On-Chip Jonas Diemer, Rolf Ernst TU Braunschweig, Germany diemer@ida.ing.tu-bs.de Michael Kauschke Intel,

More information

CHAPTER 6 FPGA IMPLEMENTATION OF ARBITERS ALGORITHM FOR NETWORK-ON-CHIP

CHAPTER 6 FPGA IMPLEMENTATION OF ARBITERS ALGORITHM FOR NETWORK-ON-CHIP 133 CHAPTER 6 FPGA IMPLEMENTATION OF ARBITERS ALGORITHM FOR NETWORK-ON-CHIP 6.1 INTRODUCTION As the era of a billion transistors on a one chip approaches, a lot of Processing Elements (PEs) could be located

More information

NoC Test-Chip Project: Working Document

NoC Test-Chip Project: Working Document NoC Test-Chip Project: Working Document Michele Petracca, Omar Ahmad, Young Jin Yoon, Frank Zovko, Luca Carloni and Kenneth Shepard I. INTRODUCTION This document describes the low-power high-performance

More information

Ultra-Fast NoC Emulation on a Single FPGA

Ultra-Fast NoC Emulation on a Single FPGA The 25 th International Conference on Field-Programmable Logic and Applications (FPL 2015) September 3, 2015 Ultra-Fast NoC Emulation on a Single FPGA Thiem Van Chu, Shimpei Sato, and Kenji Kise Tokyo

More information

Interconnection Networks

Interconnection Networks Lecture 17: Interconnection Networks Parallel Computer Architecture and Programming A comment on web site comments It is okay to make a comment on a slide/topic that has already been commented on. In fact

More information

Achieving Lightweight Multicast in Asynchronous Networks-on-Chip Using Local Speculation

Achieving Lightweight Multicast in Asynchronous Networks-on-Chip Using Local Speculation Achieving Lightweight Multicast in Asynchronous Networks-on-Chip Using Local Speculation Kshitij Bhardwaj Dept. of Computer Science Columbia University Steven M. Nowick 2016 ACM/IEEE Design Automation

More information

NoC Simulation in Heterogeneous Architectures for PGAS Programming Model

NoC Simulation in Heterogeneous Architectures for PGAS Programming Model NoC Simulation in Heterogeneous Architectures for PGAS Programming Model Sascha Roloff, Andreas Weichslgartner, Frank Hannig, Jürgen Teich University of Erlangen-Nuremberg, Germany Jan Heißwolf Karlsruhe

More information

Lecture 26: Interconnects. James C. Hoe Department of ECE Carnegie Mellon University

Lecture 26: Interconnects. James C. Hoe Department of ECE Carnegie Mellon University 18 447 Lecture 26: Interconnects James C. Hoe Department of ECE Carnegie Mellon University 18 447 S18 L26 S1, James C. Hoe, CMU/ECE/CALCM, 2018 Housekeeping Your goal today get an overview of parallel

More information

NetSpeed ORION: A New Approach to Design On-chip Interconnects. August 26 th, 2013

NetSpeed ORION: A New Approach to Design On-chip Interconnects. August 26 th, 2013 NetSpeed ORION: A New Approach to Design On-chip Interconnects August 26 th, 2013 INTERCONNECTS BECOMING INCREASINGLY IMPORTANT Growing number of IP cores Average SoCs today have 100+ IPs Mixing and matching

More information

Achieving Predictable Multicore Execution of Automotive Applications Using the LET Paradigm

Achieving Predictable Multicore Execution of Automotive Applications Using the LET Paradigm Achieving Predictable Multicore Execution of Automotive Applications Using the LET Paradigm Alessandro Biondi and Marco Di Natale Scuola Superiore Sant Anna, Pisa, Italy Introduction The introduction of

More information

Global Adaptive Routing Algorithm Without Additional Congestion Propagation Network

Global Adaptive Routing Algorithm Without Additional Congestion Propagation Network 1 Global Adaptive Routing Algorithm Without Additional Congestion ropagation Network Shaoli Liu, Yunji Chen, Tianshi Chen, Ling Li, Chao Lu Institute of Computing Technology, Chinese Academy of Sciences

More information

Module 17: "Interconnection Networks" Lecture 37: "Introduction to Routers" Interconnection Networks. Fundamentals. Latency and bandwidth

Module 17: Interconnection Networks Lecture 37: Introduction to Routers Interconnection Networks. Fundamentals. Latency and bandwidth Interconnection Networks Fundamentals Latency and bandwidth Router architecture Coherence protocol and routing [From Chapter 10 of Culler, Singh, Gupta] file:///e /parallel_com_arch/lecture37/37_1.htm[6/13/2012

More information

D3.8 First Prototype of the Real-time Scheduling Advisor

D3.8 First Prototype of the Real-time Scheduling Advisor Project Number 318763 D3.8 First Prototype of the Real-time Scheduling Advisor Version 1.0 Final EC Distribution Brno University of Technology Project Partners: aicas, HMI, petafuel, SOFTEAM, Scuola Superiore

More information

Performance Optimization for an ARM Cortex-A53 System Using Software Workloads and Cycle Accurate Models. Jason Andrews

Performance Optimization for an ARM Cortex-A53 System Using Software Workloads and Cycle Accurate Models. Jason Andrews Performance Optimization for an ARM Cortex-A53 System Using Software Workloads and Cycle Accurate Models Jason Andrews Agenda System Performance Analysis IP Configuration System Creation Methodology: Create,

More information

D 3.6 FPGA implementation of self-timed NOC

D 3.6 FPGA implementation of self-timed NOC Project Number 288008 D 3.6 FPGA implementation of self-timed NOC Version 1.0 Final Public Distribution Technical University of Denmark Project Partners: AbsInt Angewandte Informatik, Eindhoven University

More information

G Robert Grimm New York University

G Robert Grimm New York University G22.3250-001 Receiver Livelock Robert Grimm New York University Altogether Now: The Three Questions What is the problem? What is new or different? What are the contributions and limitations? Motivation

More information

TDT Appendix E Interconnection Networks

TDT Appendix E Interconnection Networks TDT 4260 Appendix E Interconnection Networks Review Advantages of a snooping coherency protocol? Disadvantages of a snooping coherency protocol? Advantages of a directory coherency protocol? Disadvantages

More information

FCUDA-NoC: A Scalable and Efficient Network-on-Chip Implementation for the CUDA-to-FPGA Flow

FCUDA-NoC: A Scalable and Efficient Network-on-Chip Implementation for the CUDA-to-FPGA Flow FCUDA-NoC: A Scalable and Efficient Network-on-Chip Implementation for the CUDA-to-FPGA Flow Abstract: High-level synthesis (HLS) of data-parallel input languages, such as the Compute Unified Device Architecture

More information

OpenSMART: Single-cycle Multi-hop NoC Generator in BSV and Chisel

OpenSMART: Single-cycle Multi-hop NoC Generator in BSV and Chisel OpenSMART: Single-cycle Multi-hop NoC Generator in BSV and Chisel Hyoukjun Kwon and Tushar Krishna Georgia Institute of Technology Synergy Lab (http://synergy.ece.gatech.edu) hyoukjun@gatech.edu April

More information

Fault Tolerant and Secure Architectures for On Chip Networks With Emerging Interconnect Technologies. Mohsin Y Ahmed Conlan Wesson

Fault Tolerant and Secure Architectures for On Chip Networks With Emerging Interconnect Technologies. Mohsin Y Ahmed Conlan Wesson Fault Tolerant and Secure Architectures for On Chip Networks With Emerging Interconnect Technologies Mohsin Y Ahmed Conlan Wesson Overview NoC: Future generation of many core processor on a single chip

More information

Fast Scalable FPGA-Based Network-on-Chip Simulation Models

Fast Scalable FPGA-Based Network-on-Chip Simulation Models We thank Xilinx for their FPGA and tool donations. We thank Bluespec for their tool donations and support. Computer Architecture Lab at Carnegie Mellon Fast Scalable FPGA-Based Network-on-Chip Simulation

More information

Interconnection Networks: Topology. Prof. Natalie Enright Jerger

Interconnection Networks: Topology. Prof. Natalie Enright Jerger Interconnection Networks: Topology Prof. Natalie Enright Jerger Topology Overview Definition: determines arrangement of channels and nodes in network Analogous to road map Often first step in network design

More information

A Thermal-aware Application specific Routing Algorithm for Network-on-chip Design

A Thermal-aware Application specific Routing Algorithm for Network-on-chip Design A Thermal-aware Application specific Routing Algorithm for Network-on-chip Design Zhi-Liang Qian and Chi-Ying Tsui VLSI Research Laboratory Department of Electronic and Computer Engineering The Hong Kong

More information

Comparison of Deadlock Recovery and Avoidance Mechanisms to Approach Message Dependent Deadlocks in on-chip Networks

Comparison of Deadlock Recovery and Avoidance Mechanisms to Approach Message Dependent Deadlocks in on-chip Networks Comparison of Deadlock Recovery and Avoidance Mechanisms to Approach Message Dependent Deadlocks in on-chip Networks Andreas Lankes¹, Soeren Sonntag², Helmut Reinig³, Thomas Wild¹, Andreas Herkersdorf¹

More information

D3.5 Operating System with Real-Time Support and FPGA Interface

D3.5 Operating System with Real-Time Support and FPGA Interface Project Number 318763 D3.5 Operating ystem with Real-Time upport and FPGA Interface Version 1.0 Final Public Distribution cuola uperiore ant Anna, University of York Project Partners: aicas, HMI, petafuel,

More information

Smart Port Allocation in Adaptive NoC Routers

Smart Port Allocation in Adaptive NoC Routers 205 28th International Conference 205 on 28th VLSI International Design and Conference 205 4th International VLSI Design Conference on Embedded Systems Smart Port Allocation in Adaptive NoC Routers Reenu

More information

Incorporating DMA into QoS Policies for Maximum Performance in Shared Memory Systems. Scott Marshall and Stephen Twigg

Incorporating DMA into QoS Policies for Maximum Performance in Shared Memory Systems. Scott Marshall and Stephen Twigg Incorporating DMA into QoS Policies for Maximum Performance in Shared Memory Systems Scott Marshall and Stephen Twigg 2 Problems with Shared Memory I/O Fairness Memory bandwidth worthless without memory

More information

Quest for High-Performance Bufferless NoCs with Single-Cycle Express Paths and Self-Learning Throttling

Quest for High-Performance Bufferless NoCs with Single-Cycle Express Paths and Self-Learning Throttling Quest for High-Performance Bufferless NoCs with Single-Cycle Express Paths and Self-Learning Throttling Bhavya K. Daya, Li-Shiuan Peh, Anantha P. Chandrakasan Dept. of Electrical Engineering and Computer

More information

Swizzle Switch: A Self-Arbitrating High-Radix Crossbar for NoC Systems

Swizzle Switch: A Self-Arbitrating High-Radix Crossbar for NoC Systems 1 Swizzle Switch: A Self-Arbitrating High-Radix Crossbar for NoC Systems Ronald Dreslinski, Korey Sewell, Thomas Manville, Sudhir Satpathy, Nathaniel Pinckney, Geoff Blake, Michael Cieslak, Reetuparna

More information

Network on Chip Architecture: An Overview

Network on Chip Architecture: An Overview Network on Chip Architecture: An Overview Md Shahriar Shamim & Naseef Mansoor 12/5/2014 1 Overview Introduction Multi core chip Challenges Network on Chip Architecture Regular Topology Irregular Topology

More information

Scheduling Mar. 19, 2018

Scheduling Mar. 19, 2018 15-410...Everything old is new again... Scheduling Mar. 19, 2018 Dave Eckhardt Brian Railing Roger Dannenberg 1 Outline Chapter 5 (or Chapter 7): Scheduling Scheduling-people/textbook terminology note

More information

Receive Livelock. Robert Grimm New York University

Receive Livelock. Robert Grimm New York University Receive Livelock Robert Grimm New York University The Three Questions What is the problem? What is new or different? What are the contributions and limitations? Motivation Interrupts work well when I/O

More information

Introduction to Operating Systems Prof. Chester Rebeiro Department of Computer Science and Engineering Indian Institute of Technology, Madras

Introduction to Operating Systems Prof. Chester Rebeiro Department of Computer Science and Engineering Indian Institute of Technology, Madras Introduction to Operating Systems Prof. Chester Rebeiro Department of Computer Science and Engineering Indian Institute of Technology, Madras Week 05 Lecture 18 CPU Scheduling Hello. In this lecture, we

More information

Joint consideration of performance, reliability and fault tolerance in regular Networks-on-Chip via multiple spatially-independent interface terminals

Joint consideration of performance, reliability and fault tolerance in regular Networks-on-Chip via multiple spatially-independent interface terminals Joint consideration of performance, reliability and fault tolerance in regular Networks-on-Chip via multiple spatially-independent interface terminals Philipp Gorski, Tim Wegner, Dirk Timmermann University

More information

Low-Power Interconnection Networks

Low-Power Interconnection Networks Low-Power Interconnection Networks Li-Shiuan Peh Associate Professor EECS, CSAIL & MTL MIT 1 Moore s Law: Double the number of transistors on chip every 2 years 1970: Clock speed: 108kHz No. transistors:

More information

Routing Algorithms. Review

Routing Algorithms. Review Routing Algorithms Today s topics: Deterministic, Oblivious Adaptive, & Adaptive models Problems: efficiency livelock deadlock 1 CS6810 Review Network properties are a combination topology topology dependent

More information

Lecture 18: Communication Models and Architectures: Interconnection Networks

Lecture 18: Communication Models and Architectures: Interconnection Networks Design & Co-design of Embedded Systems Lecture 18: Communication Models and Architectures: Interconnection Networks Sharif University of Technology Computer Engineering g Dept. Winter-Spring 2008 Mehdi

More information

Chapter 2 Designing Crossbar Based Systems

Chapter 2 Designing Crossbar Based Systems Chapter 2 Designing Crossbar Based Systems Over the last decade, the communication architecture of SoCs has evolved from single shared bus systems to multi-bus systems. Today, state-of-the-art bus based

More information

EECS 570 Final Exam - SOLUTIONS Winter 2015

EECS 570 Final Exam - SOLUTIONS Winter 2015 EECS 570 Final Exam - SOLUTIONS Winter 2015 Name: unique name: Sign the honor code: I have neither given nor received aid on this exam nor observed anyone else doing so. Scores: # Points 1 / 21 2 / 32

More information

ECE/CS 757: Advanced Computer Architecture II Interconnects

ECE/CS 757: Advanced Computer Architecture II Interconnects ECE/CS 757: Advanced Computer Architecture II Interconnects Instructor:Mikko H Lipasti Spring 2017 University of Wisconsin-Madison Lecture notes created by Natalie Enright Jerger Lecture Outline Introduction

More information

A Novel Energy Efficient Source Routing for Mesh NoCs

A Novel Energy Efficient Source Routing for Mesh NoCs 2014 Fourth International Conference on Advances in Computing and Communications A ovel Energy Efficient Source Routing for Mesh ocs Meril Rani John, Reenu James, John Jose, Elizabeth Isaac, Jobin K. Antony

More information

Challenges for Future Interconnection Networks Hot Interconnects Panel August 24, Dennis Abts Sr. Principal Engineer

Challenges for Future Interconnection Networks Hot Interconnects Panel August 24, Dennis Abts Sr. Principal Engineer Challenges for Future Interconnection Networks Hot Interconnects Panel August 24, 2006 Sr. Principal Engineer Panel Questions How do we build scalable networks that balance power, reliability and performance

More information

Noxim the NoC Simulator

Noxim the NoC Simulator Noxim the NoC Simulator User Guide http://www.noxim.org/ (C) 2005-2010 by the University of Catania Maurizio Palesi, PhD Email: mpalesi@diit.unict.it Home: http://www.diit.unict.it/users/mpalesi/ Davide

More information

Escape Path based Irregular Network-on-chip Simulation Framework

Escape Path based Irregular Network-on-chip Simulation Framework Escape Path based Irregular Network-on-chip Simulation Framework Naveen Choudhary College of technology and Engineering MPUAT Udaipur, India M. S. Gaur Malaviya National Institute of Technology Jaipur,

More information

Hierarchical PLABs, CLABs, TLABs in Hotspot

Hierarchical PLABs, CLABs, TLABs in Hotspot Hierarchical s, CLABs, s in Hotspot Christoph M. Kirsch ck@cs.uni-salzburg.at Hannes Payer hpayer@cs.uni-salzburg.at Harald Röck hroeck@cs.uni-salzburg.at Abstract Thread-local allocation buffers (s) are

More information

D3.6 Final Operating System with Real-Time Support

D3.6 Final Operating System with Real-Time Support Project Number 318763 D3.6 Final Operating System with Real-Time Support Version 1.0 Final Public Distribution Scuola Superiore Sant Anna Project Partners: aicas, HMI, petafuel, SOFTEAM, Scuola Superiore

More information

Interconnect Technology and Computational Speed

Interconnect Technology and Computational Speed Interconnect Technology and Computational Speed From Chapter 1 of B. Wilkinson et al., PARAL- LEL PROGRAMMING. Techniques and Applications Using Networked Workstations and Parallel Computers, augmented

More information

4. Networks. in parallel computers. Advances in Computer Architecture

4. Networks. in parallel computers. Advances in Computer Architecture 4. Networks in parallel computers Advances in Computer Architecture System architectures for parallel computers Control organization Single Instruction stream Multiple Data stream (SIMD) All processors

More information

Conquering Memory Bandwidth Challenges in High-Performance SoCs

Conquering Memory Bandwidth Challenges in High-Performance SoCs Conquering Memory Bandwidth Challenges in High-Performance SoCs ABSTRACT High end System on Chip (SoC) architectures consist of tens of processing engines. In SoCs targeted at high performance computing

More information

High Performance Computing

High Performance Computing The Need for Parallelism High Performance Computing David McCaughan, HPC Analyst SHARCNET, University of Guelph dbm@sharcnet.ca Scientific investigation traditionally takes two forms theoretical empirical

More information

On RTL to TLM Abstraction to Benefit Simulation Performance and Modeling Productivity in NoC Design Exploration

On RTL to TLM Abstraction to Benefit Simulation Performance and Modeling Productivity in NoC Design Exploration On to TLM Abstraction to Benefit Simulation Performance and Modeling Productivity in NoC Design Exploration Sven Alexander Horsinka, Rolf Meyer, Jan Wagner, Rainer Buchty and Mladen Berekovic TU Braunschweig,

More information

Switching/Flow Control Overview. Interconnection Networks: Flow Control and Microarchitecture. Packets. Switching.

Switching/Flow Control Overview. Interconnection Networks: Flow Control and Microarchitecture. Packets. Switching. Switching/Flow Control Overview Interconnection Networks: Flow Control and Microarchitecture Topology: determines connectivity of network Routing: determines paths through network Flow Control: determine

More information

Netronome NFP: Theory of Operation

Netronome NFP: Theory of Operation WHITE PAPER Netronome NFP: Theory of Operation TO ACHIEVE PERFORMANCE GOALS, A MULTI-CORE PROCESSOR NEEDS AN EFFICIENT DATA MOVEMENT ARCHITECTURE. CONTENTS 1. INTRODUCTION...1 2. ARCHITECTURE OVERVIEW...2

More information

Applying the Benefits of Network on a Chip Architecture to FPGA System Design

Applying the Benefits of Network on a Chip Architecture to FPGA System Design white paper Intel FPGA Applying the Benefits of on a Chip Architecture to FPGA System Design Authors Kent Orthner Senior Manager, Software and IP Intel Corporation Table of Contents Abstract...1 Introduction...1

More information

Chapter 1. Introduction

Chapter 1. Introduction Chapter 1 Introduction In a packet-switched network, packets are buffered when they cannot be processed or transmitted at the rate they arrive. There are three main reasons that a router, with generic

More information

CONGESTION AWARE ADAPTIVE ROUTING FOR NETWORK-ON-CHIP COMMUNICATION. Stephen Chui Bachelor of Engineering Ryerson University, 2012.

CONGESTION AWARE ADAPTIVE ROUTING FOR NETWORK-ON-CHIP COMMUNICATION. Stephen Chui Bachelor of Engineering Ryerson University, 2012. CONGESTION AWARE ADAPTIVE ROUTING FOR NETWORK-ON-CHIP COMMUNICATION by Stephen Chui Bachelor of Engineering Ryerson University, 2012 A thesis presented to Ryerson University in partial fulfillment of the

More information

Mapping of Real-time Applications on

Mapping of Real-time Applications on Mapping of Real-time Applications on Network-on-Chip based MPSOCS Paris Mesidis Submitted for the degree of Master of Science (By Research) The University of York, December 2011 Abstract Mapping of real

More information

Experiences with the Sparse Matrix-Vector Multiplication on a Many-core Processor

Experiences with the Sparse Matrix-Vector Multiplication on a Many-core Processor Experiences with the Sparse Matrix-Vector Multiplication on a Many-core Processor Juan C. Pichel Centro de Investigación en Tecnoloxías da Información (CITIUS) Universidade de Santiago de Compostela, Spain

More information

A Dynamic NOC Arbitration Technique using Combination of VCT and XY Routing

A Dynamic NOC Arbitration Technique using Combination of VCT and XY Routing 727 A Dynamic NOC Arbitration Technique using Combination of VCT and XY Routing 1 Bharati B. Sayankar, 2 Pankaj Agrawal 1 Electronics Department, Rashtrasant Tukdoji Maharaj Nagpur University, G.H. Raisoni

More information

OASIS NoC Architecture Design in Verilog HDL Technical Report: TR OASIS

OASIS NoC Architecture Design in Verilog HDL Technical Report: TR OASIS OASIS NoC Architecture Design in Verilog HDL Technical Report: TR-062010-OASIS Written by Kenichi Mori ASL-Ben Abdallah Group Graduate School of Computer Science and Engineering The University of Aizu

More information

Interconnection Networks

Interconnection Networks Lecture 18: Interconnection Networks Parallel Computer Architecture and Programming CMU 15-418/15-618, Spring 2015 Credit: many of these slides were created by Michael Papamichael This lecture is partially

More information

Embedded Systems: Hardware Components (part II) Todor Stefanov

Embedded Systems: Hardware Components (part II) Todor Stefanov Embedded Systems: Hardware Components (part II) Todor Stefanov Leiden Embedded Research Center, Leiden Institute of Advanced Computer Science Leiden University, The Netherlands Outline Generic Embedded

More information

Lecture 2: Topology - I

Lecture 2: Topology - I ECE 8823 A / CS 8803 - ICN Interconnection Networks Spring 2017 http://tusharkrishna.ece.gatech.edu/teaching/icn_s17/ Lecture 2: Topology - I Tushar Krishna Assistant Professor School of Electrical and

More information

A novel way to efficiently simulate complex full systems incorporating hardware accelerators

A novel way to efficiently simulate complex full systems incorporating hardware accelerators ARM Research Summit 2017 Workshop A novel way to efficiently simulate complex full systems incorporating hardware accelerators Nikolaos Tampouratzis Technical University of Crete, Greece Motivation / The

More information

CS355 Hw 4. Interface. Due by the end of day Tuesday, March 20.

CS355 Hw 4. Interface. Due by the end of day Tuesday, March 20. Due by the end of day Tuesday, March 20. CS355 Hw 4 User-level Threads You will write a library to support multiple threads within a single Linux process. This is a user-level thread library because the

More information

Thomas Moscibroda Microsoft Research. Onur Mutlu CMU

Thomas Moscibroda Microsoft Research. Onur Mutlu CMU Thomas Moscibroda Microsoft Research Onur Mutlu CMU CPU+L1 CPU+L1 CPU+L1 CPU+L1 Multi-core Chip Cache -Bank Cache -Bank Cache -Bank Cache -Bank CPU+L1 CPU+L1 CPU+L1 CPU+L1 Accelerator, etc Cache -Bank

More information

The Impact of Optics on HPC System Interconnects

The Impact of Optics on HPC System Interconnects The Impact of Optics on HPC System Interconnects Mike Parker and Steve Scott Hot Interconnects 2009 Manhattan, NYC Will cost-effective optics fundamentally change the landscape of networking? Yes. Changes

More information

Heuristics Core Mapping in On-Chip Networks for Parallel Stream-Based Applications

Heuristics Core Mapping in On-Chip Networks for Parallel Stream-Based Applications Heuristics Core Mapping in On-Chip Networks for Parallel Stream-Based Applications Piotr Dziurzanski and Tomasz Maka Szczecin University of Technology, ul. Zolnierska 49, 71-210 Szczecin, Poland {pdziurzanski,tmaka}@wi.ps.pl

More information

System Performance Optimization Methodology for Infineon's 32-Bit Automotive Microcontroller Architecture

System Performance Optimization Methodology for Infineon's 32-Bit Automotive Microcontroller Architecture System Performance Optimization Methodology for Infineon's 32-Bit Automotive Microcontroller Architecture Albrecht Mayer, Frank Hellwig Infineon Technologies, Am Campeon 1-12, 85579 Neubiberg, Germany

More information

Boosting the Performance of Myrinet Networks

Boosting the Performance of Myrinet Networks IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. XX, NO. Y, MONTH 22 1 Boosting the Performance of Myrinet Networks J. Flich, P. López, M. P. Malumbres, and J. Duato Abstract Networks of workstations

More information

The Design and Implementation of a Low-Latency On-Chip Network

The Design and Implementation of a Low-Latency On-Chip Network The Design and Implementation of a Low-Latency On-Chip Network Robert Mullins 11 th Asia and South Pacific Design Automation Conference (ASP-DAC), Jan 24-27 th, 2006, Yokohama, Japan. Introduction Current

More information

OASIS Network-on-Chip Prototyping on FPGA

OASIS Network-on-Chip Prototyping on FPGA Master thesis of the University of Aizu, Feb. 20, 2012 OASIS Network-on-Chip Prototyping on FPGA m5141120, Kenichi Mori Supervised by Prof. Ben Abdallah Abderazek Adaptive Systems Laboratory, Master of

More information

PRIORITY BASED SWITCH ALLOCATOR IN ADAPTIVE PHYSICAL CHANNEL REGULATOR FOR ON CHIP INTERCONNECTS. A Thesis SONALI MAHAPATRA

PRIORITY BASED SWITCH ALLOCATOR IN ADAPTIVE PHYSICAL CHANNEL REGULATOR FOR ON CHIP INTERCONNECTS. A Thesis SONALI MAHAPATRA PRIORITY BASED SWITCH ALLOCATOR IN ADAPTIVE PHYSICAL CHANNEL REGULATOR FOR ON CHIP INTERCONNECTS A Thesis by SONALI MAHAPATRA Submitted to the Office of Graduate and Professional Studies of Texas A&M University

More information

Application-Platform Mapping in Multiprocessor Systems-on-Chip

Application-Platform Mapping in Multiprocessor Systems-on-Chip Application-Platform Mapping in Multiprocessor Systems-on-Chip Leandro Soares Indrusiak lsi@cs.york.ac.uk http://www-users.cs.york.ac.uk/lsi CREDES Kick-off Meeting Tallinn - June 2009 Application-Platform

More information

Implementing Scheduling Algorithms. Real-Time and Embedded Systems (M) Lecture 9

Implementing Scheduling Algorithms. Real-Time and Embedded Systems (M) Lecture 9 Implementing Scheduling Algorithms Real-Time and Embedded Systems (M) Lecture 9 Lecture Outline Implementing real time systems Key concepts and constraints System architectures: Cyclic executive Microkernel

More information

Scheduling. Scheduling 1/51

Scheduling. Scheduling 1/51 Scheduling 1/51 Learning Objectives Scheduling To understand the role of a scheduler in an operating system To understand the scheduling mechanism To understand scheduling strategies such as non-preemptive

More information

Doğan Fennibay MS Thesis Presentation - June 7th, Supervisor: Assoc. Prof. Arda Yurdakul

Doğan Fennibay MS Thesis Presentation - June 7th, Supervisor: Assoc. Prof. Arda Yurdakul Doğan Fennibay MS Thesis Presentation - June 7th, 2010 Supervisor: Assoc. Prof. Arda Yurdakul Motivation System-level Modeling More integrated systems HW & SW modeled together Models larger but more abstract

More information

An introduction to SDRAM and memory controllers. 5kk73

An introduction to SDRAM and memory controllers. 5kk73 An introduction to SDRAM and memory controllers 5kk73 Presentation Outline (part 1) Introduction to SDRAM Basic SDRAM operation Memory efficiency SDRAM controller architecture Conclusions Followed by part

More information

Mapping and Physical Planning of Networks-on-Chip Architectures with Quality-of-Service Guarantees

Mapping and Physical Planning of Networks-on-Chip Architectures with Quality-of-Service Guarantees Mapping and Physical Planning of Networks-on-Chip Architectures with Quality-of-Service Guarantees Srinivasan Murali 1, Prof. Luca Benini 2, Prof. Giovanni De Micheli 1 1 Computer Systems Lab, Stanford

More information

International Journal of Research and Innovation in Applied Science (IJRIAS) Volume I, Issue IX, December 2016 ISSN

International Journal of Research and Innovation in Applied Science (IJRIAS) Volume I, Issue IX, December 2016 ISSN Comparative Analysis of Latency, Throughput and Network Power for West First, North Last and West First North Last Routing For 2D 4 X 4 Mesh Topology NoC Architecture Bhupendra Kumar Soni 1, Dr. Girish

More information

Concurrent Programming

Concurrent Programming Concurrent Programming Real-Time Systems, Lecture 2 Martina Maggio 19 January 2017 Lund University, Department of Automatic Control www.control.lth.se/course/frtn01 Content [Real-Time Control System: Chapter

More information

Concurrent Programming. Implementation Alternatives. Content. Real-Time Systems, Lecture 2. Historical Implementation Alternatives.

Concurrent Programming. Implementation Alternatives. Content. Real-Time Systems, Lecture 2. Historical Implementation Alternatives. Content Concurrent Programming Real-Time Systems, Lecture 2 [Real-Time Control System: Chapter 3] 1. Implementation Alternatives Martina Maggio 19 January 2017 Lund University, Department of Automatic

More information

Evaluation of NOC Using Tightly Coupled Router Architecture

Evaluation of NOC Using Tightly Coupled Router Architecture IOSR Journal of Computer Engineering (IOSR-JCE) e-issn: 2278-0661,p-ISSN: 2278-8727, Volume 18, Issue 1, Ver. II (Jan Feb. 2016), PP 01-05 www.iosrjournals.org Evaluation of NOC Using Tightly Coupled Router

More information

CSc33200: Operating Systems, CS-CCNY, Fall 2003 Jinzhong Niu December 10, Review

CSc33200: Operating Systems, CS-CCNY, Fall 2003 Jinzhong Niu December 10, Review CSc33200: Operating Systems, CS-CCNY, Fall 2003 Jinzhong Niu December 10, 2003 Review 1 Overview 1.1 The definition, objectives and evolution of operating system An operating system exploits and manages

More information

Implementation Alternatives. Lecture 2: Implementation Alternatives & Concurrent Programming. Current Implementation Alternatives

Implementation Alternatives. Lecture 2: Implementation Alternatives & Concurrent Programming. Current Implementation Alternatives Lecture 2: Implementation Alternatives & Concurrent ming Implementation Alternatives Controllers can be implemented using a number of techniques [RTCS Ch 3] Implementation Alternatives Processes and threads

More information

EECS 598: Integrating Emerging Technologies with Computer Architecture. Lecture 12: On-Chip Interconnects

EECS 598: Integrating Emerging Technologies with Computer Architecture. Lecture 12: On-Chip Interconnects 1 EECS 598: Integrating Emerging Technologies with Computer Architecture Lecture 12: On-Chip Interconnects Instructor: Ron Dreslinski Winter 216 1 1 Announcements Upcoming lecture schedule Today: On-chip

More information

Problem Set: Processes

Problem Set: Processes Lecture Notes on Operating Systems Problem Set: Processes 1. Answer yes/no, and provide a brief explanation. (a) Can two processes be concurrently executing the same program executable? (b) Can two running

More information

Process- Concept &Process Scheduling OPERATING SYSTEMS

Process- Concept &Process Scheduling OPERATING SYSTEMS OPERATING SYSTEMS Prescribed Text Book Operating System Principles, Seventh Edition By Abraham Silberschatz, Peter Baer Galvin and Greg Gagne PROCESS MANAGEMENT Current day computer systems allow multiple

More information

ECE 4750 Computer Architecture, Fall 2017 T06 Fundamental Network Concepts

ECE 4750 Computer Architecture, Fall 2017 T06 Fundamental Network Concepts ECE 4750 Computer Architecture, Fall 2017 T06 Fundamental Network Concepts School of Electrical and Computer Engineering Cornell University revision: 2017-10-17-12-26 1 Network/Roadway Analogy 3 1.1. Running

More information

Network-on-Chip Architecture

Network-on-Chip Architecture Multiple Processor Systems(CMPE-655) Network-on-Chip Architecture Performance aspect and Firefly network architecture By Siva Shankar Chandrasekaran and SreeGowri Shankar Agenda (Enhancing performance)

More information

(big idea): starting with a multi-core design, we're going to blur the line between multi-threaded and multi-core processing.

(big idea): starting with a multi-core design, we're going to blur the line between multi-threaded and multi-core processing. (big idea): starting with a multi-core design, we're going to blur the line between multi-threaded and multi-core processing. Intro: CMP with MT cores e.g. POWER5, Niagara 1 & 2, Nehalem Off-chip miss

More information

Center for Scalable Application Development Software (CScADS): Automatic Performance Tuning Workshop

Center for Scalable Application Development Software (CScADS): Automatic Performance Tuning Workshop Center for Scalable Application Development Software (CScADS): Automatic Performance Tuning Workshop http://cscads.rice.edu/ Discussion and Feedback CScADS Autotuning 07 Top Priority Questions for Discussion

More information

Introduction to Multiprocessors (Part I) Prof. Cristina Silvano Politecnico di Milano

Introduction to Multiprocessors (Part I) Prof. Cristina Silvano Politecnico di Milano Introduction to Multiprocessors (Part I) Prof. Cristina Silvano Politecnico di Milano Outline Key issues to design multiprocessors Interconnection network Centralized shared-memory architectures Distributed

More information

CISC 7310X. C05: CPU Scheduling. Hui Chen Department of Computer & Information Science CUNY Brooklyn College. 3/1/2018 CUNY Brooklyn College

CISC 7310X. C05: CPU Scheduling. Hui Chen Department of Computer & Information Science CUNY Brooklyn College. 3/1/2018 CUNY Brooklyn College CISC 7310X C05: CPU Scheduling Hui Chen Department of Computer & Information Science CUNY Brooklyn College 3/1/2018 CUNY Brooklyn College 1 Outline Recap & issues CPU Scheduling Concepts Goals and criteria

More information

MEMORY/RESOURCE MANAGEMENT IN MULTICORE SYSTEMS

MEMORY/RESOURCE MANAGEMENT IN MULTICORE SYSTEMS MEMORY/RESOURCE MANAGEMENT IN MULTICORE SYSTEMS INSTRUCTOR: Dr. MUHAMMAD SHAABAN PRESENTED BY: MOHIT SATHAWANE AKSHAY YEMBARWAR WHAT IS MULTICORE SYSTEMS? Multi-core processor architecture means placing

More information

STLAC: A Spatial and Temporal Locality-Aware Cache and Networkon-Chip

STLAC: A Spatial and Temporal Locality-Aware Cache and Networkon-Chip STLAC: A Spatial and Temporal Locality-Aware Cache and Networkon-Chip Codesign for Tiled Manycore Systems Mingyu Wang and Zhaolin Li Institute of Microelectronics, Tsinghua University, Beijing 100084,

More information