Instruction Set Architecture Extensions for a Dynamic Task Scheduling Unit

Size: px
Start display at page:

Download "Instruction Set Architecture Extensions for a Dynamic Task Scheduling Unit"

Transcription

1 Instruction Set Architecture Extensions for a Dynamic Task Scheduling Unit Oliver Arnold, Benedikt Noethen, and Gerhard Fettweis Vodafone Chair Mobile Communications Systems Dresden University of Technology (TU Dresden) Dresden, Germany {oliver.arnold, benedikt.noethen, fettweis}@ifn.et.tu-dresden.de Abstract In this paper a heterogeneous Multiprocessor Systemon-Chip (MPSoC) is controlled by a dynamic task scheduling unit called. The instruction set architecture of this unit is extended to improve performance for dynamic data dependency checking, task scheduling, processing element (PE) allocation and data transfer management. In order to analyze and compare different implementations and trade-offs a tool flow was developed. Area and timing results are provided as well. A significant performance improvement can be shown for all parts of the. Keywords- dynamic task scheduling; heterogeneous MPSoC; instruction set extension; I. INTODUCTION One promising approach to ensure high parallelism in future embedded systems is the task level parallelism, where multiple instructions will be bundled to a task. An example is the CellSs programming model [13] running on the Cell Broadband Engine [5]. It integrates up to eight processing cores and one PowerPC, which is used for task scheduling. There are two concepts to schedule tasks to the available processing elements (PE): static and dynamic scheduling. Static task schedulers are used especially for embedded systems, where power consumption and performance are crucial. In this case the set of applications running on the hardware is limited and the overall application requirements/tasks are known. The characterization is the main challenge of this approach. In particular, looking to mobile handsets, where more and more different applications/standards have to be supported and a full characterization of applications is infeasible, especially if several applications run in parallel. In such scenarios, dynamic task scheduling is better suited since complete characterization is not required. A dedicated core is in charge of scheduling the tasks to the available PEs. The schedule is built at runtime after a task data dependency analysis. The dynamic task scheduler can be implemented in hardware as an accelerator or in software, running on a general purpose core. The hardware implementation is characterized by a very short scheduling time of less than 100 cycles for one task [6]. It has a deterministic scheduling time but is neither configurable nor extensible. Only one application at a time can be executed. Specifications of priorities are not possible. In [7] This work was supported by the German Federal Ministry of Education and esearch (BMBF) as part of the CoolBaseStations project under grant 13N a software approach of the scheduler was presented and the performance impact was analyzed. It has been shown that the scheduler will need in average more than 1000 cycles due to the limited performance of the core. Checking the data dependencies at runtime is the most time consuming part of the software approach [7]. The extension of the instruction set of standard processors is available in many areas [1], e.g., in the field of security [2] and network applications [3]. In contrast to these works we analyze critical sections of a software based dynamic taskscheduler in more detail and will define new instruction extensions to increase the overall scheduler performance. An ASIP approach has been presented in [12]. It provides OS support on MPSoCs. Central hardware units in charge of scheduling are shown in [6] and [14]. The remainder of the paper is organized as follows: In section II, the hardware system, the programming model and the tool flow are presented. In the following section the s instruction set architecture extensions are introduced. Section IV presents benchmarks and experimental results. II. SYSTEM MODEL A. Hardware A heterogeneous Multiprocessor System-on-Chip (MPSoC) is shown in Fig. 1. It consists of several blocks connected by a Network-on-Chip (NoC). Therefore, each block has a dedicated router (). A router is connected to its neighbors by point-topoint data links. The routers are responsible for packet scheduling and arbitration. XY routing is applied. Further details about the integrated NoC can be found in [8]. Several types of blocks can be distinguished. Three global memory ports are available (MEM0, MEM1 and MEM2). They allow a connection to the off-chip SDAMs. The application processor (APP) hosts the operating system and executes the sequential part of an application. The data plane of the MPSoC consists of eleven Processing elements (PE). Altogether four digital signal processors (DSP), five general purpose (GP) cores and two application specific instruction set processors (ASIP) are integrated. The (CM) controls the data plane of

2 the MPSoC. It is responsible for dynamic data dependency checking, task scheduling, PE allocation and data transfer management. Furthermore, it is responsible for the power management of the platform. It determines the point of time of the power-on process and the frequency for each PE. A more detailed view of each block is shown in Fig. 2. In Fig. 2a) a PE is connected to a data and an instruction memory. Furthermore, the Spin-Off (CM_SO) is integrated. It contains a task FIFO. Thus, up to four tasks can be scheduled on a PE. The CM_SO is responsible for IN and OUT data transfers. Data transfers and task execution can be simultaneously performed. Thus, explicit prefetching of data is made available. Nevertheless, the is responsible for the configuration of the CM_SO. E.g., the determines the mapping of data in the local memories. In this approach a PE can solely operate on its local memory. No cache misses occur. Thus, task execution time is deterministic leading to a better predictability on system level. Prefetching of data is possible for the next two tasks, but must be explicitly annotated by the. The application processor is formed by a Tensilica 570t as shown in Fig. 2b). It has 2-way set associative instruction and data caches, each 16 Kbyte in size. In the system model it is placed next to an off-chip memory interface for fast data access. In Fig. 2c) the and its subcomponents are shown. Similar to the PEs, the solely works on local on-chip memories. Instruction and data memory size is 32 Kbyte each. The Transfer Unit (CM_TU) is available for data transfers between the s local memories and any other address in the system. Timers and FIFO memories are available as well. The DebugUnit can be used for online and offline debugging. E.g., it traces the internal states and the dynamic decisions of the. Initialization of the platform is as follows: in a first step the application processor is booted from global memory. After the boot process the application processor copies the binary to the local memory of the. The can boot itself as soon as a trigger is set by the application processor. PEs are dynamically booted by the. For this purpose boot code is available for each PE type. MEM0 APP PE_DSP0 MEM1 PE_ASIP0 PE_GP0 CM PE_GP2 PE_DSP1 PE_DSP2 PE_GP4 PE_DSP3 Figure 1. System Model PE_GP1 PE_GP3 MEM2 PE_ASIP1 PE [DSP, GP, ASIP] Inst CM_SO Data TASK_FIFO a) b) c) LX4 ISA_E Inst Data Application Proc. Tensilica 570t Data Cache CM_TU FIFOs Timers Inst Cache DebugUnit Figure 2. Selected plaftorm components: a) PE subsystem, b) Application processor subsystem, c) subsystem B. Programming Model A task based programming model is used for the development of a parallel application [6]. It is independent from the underlying hardware. Thus, applications are fully portable as long as a task can be executed on at least one PE. A task is a collection of instructions. For each task input and output data arrays are specified at runtime. E.g., in software defined radio system data locations of a task are specified after the header is processed. No static data analysis is possible for these kinds of applications. A simple example is shown in Fig. 3. It is executed on the application processor (APP). The header is evaluated and the task description (tasktype and data arrays) is transferred to the. In this example two task descriptions are transferred, either tasktype0 and tasktype1 or tasktype0 and tasktype2. In the next step the checks data dependencies between the tasks. If a data dependency is present the task is delayed until its predecessor tasks are finished. As soon as all dependencies are resolved a task can be scheduled on a suitable PE. For this reason preferred and possible PE types are annotated for each task type. An as soon as possible (ASAP) list based scheduling approach is used. Local memory of the PE must be allocated as well. Within this step, increased data locality is made available by using the on-chip local memories as explicit memory buffers. The necessary information is available within the after the data dependency checking stage. The configures the CM_SO of the selected PE. It will carry out the following steps: If the PE is not ready it is booted. Simultaneously, the necessary instruction and data of the task is fetched. Concurrently to the task execution data can be fetched for the next task. After a task is finished output data is transferred to its destination. As soon as a task is finished data dependencies can be resolved by the.

3 task( tasktype0, IN( ptr0, size0), IN( ptr1, size1), OUT( ptr2, size2) ); C/C++ Application C Task Definitions Specification Source Code If ( header == 0x143) task( tasktype1, IN( ptr2, size2/2), IN( ptr0, size0), OUT( ptr3, size3) ); else task( tasktype2, IN( ptr3, size3), IN( ptr2, size2), OUT( ptr4, size4) ); Figure 3. Example of the task programming model C. Tool Flow The tool flow is shown in Fig. 4. A C/C++ application is developed and compiled for the Tensilica 570t processor. It contains task calls as shown in Fig. 3. The and PE specifications define the hardware capabilities of the and all PE types respectively. The integration, placement and connections of all cores are specified in the platform specification. TL code can be generated by the Tensilica Xtensa Processor Generator (XPG) [9]. Suitable compilers are generated as well. Thus, and PE binaries can be generated. The s source code is adapted to the available hardware configuration. The as well as the PE binaries are linked in the application processor binary. A cycle accurate simulation is available with the Tensilica XTSC simulation environment. For post processing and further analysis the TaskVisualizer and the DebugVisualizer are available. The TaskVisualizer is taken from [11]. The frontend is adapted to the system used in this work. The DebugVisualizer is newly developed and allows a deep insight in the dynamic behavior of the. For this purpose debug message are cyclically stored in the main memory occupying 32 MByte. Debug messages are cyclically written to this region. Each debug message has the following format: { <time stamp>, <debug opcode>, <data> }. As soon as the writes the debug opcode and the data to a 32-bit register in the DebugUnit the time stamp is attached. Afterwards, the whole debug message is written to main memory. The DebugVisualizer analyzes these messages and checks the correct behavior of the. Furthermore, visualization of all states of each task is possible. 570t Compiler APP Binary TaskVisualizer InstGenerator TaskCompiler PE Boot Binaries XTSC Simulation PE Specifications PE Compiler Task Binaries DebugVisualizer PE Cores XPG HW Platform Figure 4. Tool flow Compiler Binary Platform Specification User specification Tensilica Task Tools III. COEMANAGE IINSTUCTION SET EXTENSIONS In this section the instruction set extension of the dynamic task scheduling unit, called, is described. Therefore, the execution is profiled. Each part of the is regarded and analyzed. The most time consuming parts are accelerated. The Tensilica tool chain is used to implement the very large instruction words (VLIW) as well as single instruction multiple data operations (SIMD) [9]. For comparison a basic LX4 core is used as a reference implementation. This core is reasonably configured with functional units. E.g., a full-adder and a multiplier are available. A plain-c version of the software is running on it. It is taken from [7]. Further analysis of the runtime performance and scalability on an AM926 can be found there. In the first step VLIW is used to group instruction for a parallel execution. This step is compiler assisted. Furthermore, new instructions can be specified to improve system performance. Examples are SIMD operations which allow a parallel execution of one instruction on multiple data words. These new instructions are specified in a Verilog-like language. If two types of implementations satisfy all requirements the most generic one is used.

4 TABLE I. NEWLY INTODUCED INSTUCTIONS Instruction Arguments Explanation ADD3 Adds three integer LZ Count leading zeros XO_LZ AND_LZ O_LZ NEG_AND_LZ LoadDepCheck DepCheck_SIMD_1 DepCheck_SIMD_LD2 DepCheck_SIMD_LD4 GetDepCheckesults GetPE GetPePos emovetransfers ADD_SHIFT_LEFT ADD_SHIFT_IGHT SHIFT_1_XO SHIFT_LEFT_O SHIFT_IGHT_O SHIFT_LEFT_XO SHIFT_IGHT_XO MASK_SHIFT_AND, uint8 uint8 1. XO, 2. count leading zeros 1. AND, 2. count leading zeros 1. O, 2. count leading zeros 1. Negate first argument, 2. AND, 3. count leading zeros 1. Loads one data transfer of 64bit, 2. increments data pointer by 8 Dependency checking with one and one state64 Dependency checking with two states and one Dependency checking with two states and two eturns the last depcheck results, dependencies are marked with a dedicated bit for each transfer comparison. Performs a PE allocation for 16 possible and 16 preferred PEs. PE annotation is bitwise. eturns an available taskpos on a PE. Increase data locality in the case a successor task is executed on the same PE. emoves unnecessary transfers. 1. ADD, 2. shift left by n bits 1. ADD, 2. shift right by n bits res = (1<<in0)^in1 res = (res<<in0) in1 res = (res>>in0) in1 res = (res<<in0) ^in1 res = (res>>in0) ^in1 res = (~(in0<<in1))&in2 In Table I an overview of all newly introduced instructions is presented. Load and store instructions are not shown. Several internal states and bit registers are available. The asm_ prefix of each instruction name is omitted. In Fig. 5 the evolution of the dependency checking instruction is presented. In the first line the C-Version is shown (1). Two memory regions are compared. The first region is formed by the pointer p0 and size s0 and p1; the second region by p1 and s1 respectively. Two subtractions, two compares and one O operation are necessary. These instructions can be merged in one asm_depcheck instruction (2). Afterwards, the load of the arguments can be accelerated by applying a 64-bit data bus and 64-bit registers (3). Thus, the burden of memory loads is decreased to half of the amount. In the next step SIMD can be applied. Instead of one compare of two transfers four compares are done in parallel (4). Therefore, four transfers are to be loaded. By applying explicit load instructions and dedicated internal states data loads can be reduced. Thus, data locality is increased and the number of register loads decrease (5). Furthermore, the depchecksimd_ld4 instruction is able to compute false dependencies in the case of read-read transfers. IV. ESULTS A. Benchmarks The Global System for Mobile Communication (GSM) physical layer implementation is used to evaluate the performance of the. The GSM benchmark consists of a receiving and a transmitting part. For each signal processing step a dedicated task type is available. These are e.g. channel encoding/decoding, interleaving, ciphering, burst formatting, modulation and demodulation. Additionally, an additive white Gaussian noise (AWGN) channel is integrated. Channel coding is done by applying a convolutional encoder. Gaussian minimum shift keying (GMSK) is used for modulation. Cyphering uses the A5/1 algorithm. Key generation is not regarded. The most time consuming part of the is the runtime data dependency checking. Therefore, a synthetic benchmark was implemented which solely configures the s initialization and data dependency checking stage. The task window size, which determines the maximum number of tasks in the system as well as the number of input and output data transfers, can be varied.

5 ( unsigned )( p0 p1) s1)) ( unsigned)( p1 p0) s0 (1) + Merge Instructions asm _ depcheck( p0, s0, p1, s1) (2) + 64 bit egs ( _X={pX,sX} ) asm _ depcheck( _0, _1) (3) + SIMD (4 comparisons) asm _ depchecksimd4( _0, _1, _2, _3) (4) + Explicit Load Instructions asm _ depchecksimd _ LD4( _2, _3) (5) Figure 5. Dependency checking instruction evolution B. Performance In Fig. 6 the processing time of the is shown on component level. The GSM benchmark is executed. Odd bars represent the Plain-C version; even bars belong to the VLIW+SIMD execution. For each execution the minimum, average and maximum processing time are given. A decrease in processing time can be observed as soon as VLIW+SIMD are applied. The most time consuming part for this benchmark is the dynamic data dependency checking. Furthermore, it can be seen that some components have a fixed execution time. These are e.g. the PE allocation and task scheduling. In Fig. 7 to Fig. 9 the data dependency stage is analyzed according to the scalability by varying the number of data transfers and by varying the number of tasks already in the task queue within the. esults are shown for the Plain-C, VLIW, SIMD and VLIW+SIMD version of the. In Fig. 7 and Fig. 8 it is assumed that three and 15 tasks respectively are already in the task queue. The number of data transfers is changed for all tasks. The processing time is greatly reduced as soon as SIMD is applied. A minor improvement can be observed for the VLIW versions. In Fig. 9 the number of data transfers is set to a fixed value of four. The number of tasks in the queue is varied between one and 15. As in the example above, a major reduction in processing time can be observed in the SIMD version. Overall a reduction of up to 97 % is achieved. Figure 6. component comparision between plain-c and SIMD+VLIW implementation Figure 7. Processing time of the initialization and dynamic data dependency checking stage for 3 tasks in the task queue Figure 8. Processing time of the initialization and dynamic data dependency checking stage for 15 tasks in the task queue

6 TABLE II. AEA AND TIMING COMPAISION Plain-C VLIW SIMD VLIW+SIMD Area (mm2) f (MHz) Figure 9. Processing time of the initialization and dynamic data dependency checking stage for 4 transfers per task C. Area and Timing In Table II area and frequency are shown for the. All cores have been synthesized with Synopsys Design Compiler for a 65nm low power process from TSMC using worst case conditions. Only logic area is evaluated. For timing correctness interfaces to the local memories are integrated. The local memories itself are not included in the area. For a fair comparison synthesis was done for a target frequency of 333 MHz. An overall area increase of 98% can be observed. V. CONCLUSIONS AND OUTLOOK In this paper a central scheduling unit, called was improved with a newly introduced instruction set architecture extension. It allows a faster processing of the dynamic data dependency checking, task scheduling, PE allocation and data transfer management. VLIW as well as SIMD is applied. The obtained results show an improvement for the dynamic data dependency checking stage of up to 97 %. Furthermore, all other stages are accelerated as well. Future work aims at implementing a silicon prototype of the in a heterogeneous MPSoC, including several types of processing elements as well as IO interfaces. Further optimizations of the architecture and algorithms will be investigated. Especially performance, area and power consumption will be improved. EFEENCES [1] Wang, A.; Killian, E.; Maydan, D.; owen, C.;, "Hardware/software instruction set configurability for system-on-chip processors," Design Automation Conference, Proceedings, vol., no., pp , [2] Potlapally, N..; avi, S.; aghunathan, A.; Lee,.B.; Jha, N.K.;, "Configuration and Extension of Embedded Processors to Optimize IPSec Protocol Execution," Very Large Scale Integration (VLSI) Systems, IEEE Transactions on, vol.15, no.5, pp , May [3] Chormoviti, A.; Vassiliadis, N.; Theodoridis, G.; Nikolaidis, S.;, "Enhancing Embedded Processors with Specific Instruction Set Extensions for Network Applications," Intelligent Data Acquisition and Advanced Computing Systems: Technology and Applications, IDAACS IEEE, pp , 5-7 Sept [4] K. Asanovic et al., The landscape of parallel computing research: a view from Berkeley, Electrical Engineering and Computer Sciences, University of California, Berkeley, Long Beach, CA, USA, Tech. ep., Dec [5] Johns, C..; Brokenshire, D. A., "Introduction to the Cell Broadband Engine Architecture," IBM Journal of esearch and Development, vol.51, no.5, pp , Sept [6] T. Limberg, M. Winter, M. Bimberg,. Klemm et al, "A Heterogeneous MPSoC with Hardware Supported Dynamic Task Scheduling for Software Defined adio", DAC/ISSCC Student Design Contest, [7] O. Arnold, and G. Fettweis, "On the Impact of Dynamic Task Scheduling in Heterogeneous MPSoCs," Embedded Computer Systems (SAMOS), 2011 International Conference on, pp.17-24, July [8] M. Winter, and G. Fettweis, Guaranteed Service Virtual Channel Allocation in NoCs for un-time Task Scheduling, in Proceedings of the Design Automation and Test in Europe (DATE'11), Grenoble, France, March [9] March [10] March [11] O. Arnold, and G. Fettweis, " Power Aware Heterogeneous MPSoC with Dynamic Task Scheduling and Increased Data Locality for Multiple Applications," Embedded Computer Systems (SAMOS), 2010 International Conference on, pp , July [12] J. Castrillon, D. Zhang, T. Kempf, B. Vanthournout,. Leupers, and G. Ascheid, Task Management in MPSoCs: An ASIP Approach, International Conference on Computer-Aided Design, [13] Bellens, P.; Perez, J.M.; Badia,.M.; Labarta, J., "CellSs: a Programming Model for the Cell BE Architecture," in SC 06, Proceedings of the Supercomputing conference, [14] J. Lee, V. J. Mooney III, A. Daleby, K. Ingström, T. Klevin, and L. Lindh, A comparison of the TU hardware TOS with a hardware/software TOS, In ASP-DAC '03, Proceedings of the Asia and South Pacific Design Automation Conference, 2003.

On mapping to multi/manycores

On mapping to multi/manycores On mapping to multi/manycores Jeronimo Castrillon Chair for Compiler Construction (CCC) TU Dresden, Germany MULTIPROG HiPEAC Conference Stockholm, 24.01.2017 Mapping for dataflow programming models MEM

More information

An MPSoC for Energy-Efficient Database Query Processing

An MPSoC for Energy-Efficient Database Query Processing Vodafone Chair Mobile Communications Systems, Prof. Dr.-Ing. Dr. h.c. G. Fettweis An MPSoC for Energy-Efficient Database Query Processing TensilicaDay 2016 Sebastian Haas Emil Matúš Gerhard Fettweis 09.02.2016

More information

CellSs Making it easier to program the Cell Broadband Engine processor

CellSs Making it easier to program the Cell Broadband Engine processor Perez, Bellens, Badia, and Labarta CellSs Making it easier to program the Cell Broadband Engine processor Presented by: Mujahed Eleyat Outline Motivation Architecture of the cell processor Challenges of

More information

Programming Heterogeneous Embedded Systems for IoT

Programming Heterogeneous Embedded Systems for IoT Programming Heterogeneous Embedded Systems for IoT Jeronimo Castrillon Chair for Compiler Construction TU Dresden jeronimo.castrillon@tu-dresden.de Get-together toward a sustainable collaboration in IoT

More information

Ten Reasons to Optimize a Processor

Ten Reasons to Optimize a Processor By Neil Robinson SoC designs today require application-specific logic that meets exacting design requirements, yet is flexible enough to adjust to evolving industry standards. Optimizing your processor

More information

Reconfigurable Cell Array for DSP Applications

Reconfigurable Cell Array for DSP Applications Outline econfigurable Cell Array for DSP Applications Chenxin Zhang Department of Electrical and Information Technology Lund University, Sweden econfigurable computing Coarse-grained reconfigurable cell

More information

Runtime Adaptation of Application Execution under Thermal and Power Constraints in Massively Parallel Processor Arrays

Runtime Adaptation of Application Execution under Thermal and Power Constraints in Massively Parallel Processor Arrays Runtime Adaptation of Application Execution under Thermal and Power Constraints in Massively Parallel Processor Arrays Éricles Sousa 1, Frank Hannig 1, Jürgen Teich 1, Qingqing Chen 2, and Ulf Schlichtmann

More information

Compiling for deeply embedded and heterogeneous signal processing systems

Compiling for deeply embedded and heterogeneous signal processing systems Compiling for deeply embedded and heterogeneous signal processing systems Jeronimo Castrillon Cfaed Chair for Compiler Construction (CCC) 5G Summit, Dresden, Germany September 29, 2016 Multi-Processor/core

More information

Hardware-Software Codesign

Hardware-Software Codesign Hardware-Software Codesign 8. Performance Estimation Lothar Thiele 8-1 System Design specification system synthesis estimation -compilation intellectual prop. code instruction set HW-synthesis intellectual

More information

ConnX D2 DSP Engine. A Flexible 2-MAC DSP. Dual-MAC, 16-bit Fixed-Point Communications DSP PRODUCT BRIEF FEATURES BENEFITS. ConnX D2 DSP Engine

ConnX D2 DSP Engine. A Flexible 2-MAC DSP. Dual-MAC, 16-bit Fixed-Point Communications DSP PRODUCT BRIEF FEATURES BENEFITS. ConnX D2 DSP Engine PRODUCT BRIEF ConnX D2 DSP Engine Dual-MAC, 16-bit Fixed-Point Communications DSP FEATURES BENEFITS Both SIMD and 2-way FLIX (parallel VLIW) operations Optimized, vectorizing XCC Compiler High-performance

More information

A Thermal-aware Application specific Routing Algorithm for Network-on-chip Design

A Thermal-aware Application specific Routing Algorithm for Network-on-chip Design A Thermal-aware Application specific Routing Algorithm for Network-on-chip Design Zhi-Liang Qian and Chi-Ying Tsui VLSI Research Laboratory Department of Electronic and Computer Engineering The Hong Kong

More information

Real-Time Dynamic Voltage Hopping on MPSoCs

Real-Time Dynamic Voltage Hopping on MPSoCs Real-Time Dynamic Voltage Hopping on MPSoCs Tohru Ishihara System LSI Research Center, Kyushu University 2009/08/05 The 9 th International Forum on MPSoC and Multicore 1 Background Low Power / Low Energy

More information

A Phase-Coupled Compiler Backend for a New VLIW Processor Architecture Using Two-step Register Allocation

A Phase-Coupled Compiler Backend for a New VLIW Processor Architecture Using Two-step Register Allocation A Phase-Coupled Compiler Backend for a New VLIW Processor Architecture Using Two-step Register Allocation Jie Guo, Jun Liu, Björn Mennenga and Gerhard P. Fettweis Vodafone Chair Mobile Communications Systems

More information

Dataflow programming for heterogeneous computing systems

Dataflow programming for heterogeneous computing systems Dataflow programming for heterogeneous computing systems Jeronimo Castrillon Cfaed Chair for Compiler Construction TU Dresden jeronimo.castrillon@tu-dresden.de Tutorial: Algorithmic specification, tools

More information

Codesign Framework. Parts of this lecture are borrowed from lectures of Johan Lilius of TUCS and ASV/LL of UC Berkeley available in their web.

Codesign Framework. Parts of this lecture are borrowed from lectures of Johan Lilius of TUCS and ASV/LL of UC Berkeley available in their web. Codesign Framework Parts of this lecture are borrowed from lectures of Johan Lilius of TUCS and ASV/LL of UC Berkeley available in their web. Embedded Processor Types General Purpose Expensive, requires

More information

Advanced issues in pipelining

Advanced issues in pipelining Advanced issues in pipelining 1 Outline Handling exceptions Supporting multi-cycle operations Pipeline evolution Examples of real pipelines 2 Handling exceptions 3 Exceptions In pipelined execution, one

More information

Efficient Throughput-Guarantees for Latency-Sensitive Networks-On-Chip

Efficient Throughput-Guarantees for Latency-Sensitive Networks-On-Chip ASP-DAC 2010 20 Jan 2010 Session 6C Efficient Throughput-Guarantees for Latency-Sensitive Networks-On-Chip Jonas Diemer, Rolf Ernst TU Braunschweig, Germany diemer@ida.ing.tu-bs.de Michael Kauschke Intel,

More information

Overview on Hardware Optimizations for Database Engines

Overview on Hardware Optimizations for Database Engines Overview on Hardware Optimizations for Database Engines Annett Ungethüm, Dirk Habich, Tomas Karnagel, Sebastian Haas, Eric Mier, Gerhard Fettweis, Wolfgang Lehner BTW 2017, Stuttgart, Germany, 2017-03-09

More information

Dynamic Memory Management for Real-Time Multiprocessor System-on-a-Chip

Dynamic Memory Management for Real-Time Multiprocessor System-on-a-Chip Dynamic Memory Management for Real-Time Multiprocessor System-on-a-Chip Mohamed A. Shalan Dissertation Advisor Vincent J. Mooney III School of Electrical and Computer Engineering Agenda Introduction &

More information

Multi MicroBlaze System for Parallel Computing

Multi MicroBlaze System for Parallel Computing Multi MicroBlaze System for Parallel Computing P.HUERTA, J.CASTILLO, J.I.MÁRTINEZ, V.LÓPEZ HW/SW Codesign Group Universidad Rey Juan Carlos 28933 Móstoles, Madrid SPAIN Abstract: - Embedded systems need

More information

Real-Time Dynamic Energy Management on MPSoCs

Real-Time Dynamic Energy Management on MPSoCs Real-Time Dynamic Energy Management on MPSoCs Tohru Ishihara Graduate School of Informatics, Kyoto University 2013/03/27 University of Bristol on Energy-Aware COmputing (EACO) Workshop 1 Background Low

More information

A Dynamic NOC Arbitration Technique using Combination of VCT and XY Routing

A Dynamic NOC Arbitration Technique using Combination of VCT and XY Routing 727 A Dynamic NOC Arbitration Technique using Combination of VCT and XY Routing 1 Bharati B. Sayankar, 2 Pankaj Agrawal 1 Electronics Department, Rashtrasant Tukdoji Maharaj Nagpur University, G.H. Raisoni

More information

A VERIOG-HDL IMPLEMENTATION OF VIRTUAL CHANNELS IN A NETWORK-ON-CHIP ROUTER. A Thesis SUNGHO PARK

A VERIOG-HDL IMPLEMENTATION OF VIRTUAL CHANNELS IN A NETWORK-ON-CHIP ROUTER. A Thesis SUNGHO PARK A VERIOG-HDL IMPLEMENTATION OF VIRTUAL CHANNELS IN A NETWORK-ON-CHIP ROUTER A Thesis by SUNGHO PARK Submitted to the Office of Graduate Studies of Texas A&M University in partial fulfillment of the requirements

More information

DUE to the high computational complexity and real-time

DUE to the high computational complexity and real-time IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 15, NO. 3, MARCH 2005 445 A Memory-Efficient Realization of Cyclic Convolution and Its Application to Discrete Cosine Transform Hun-Chen

More information

MPSoC Design Space Exploration Framework

MPSoC Design Space Exploration Framework MPSoC Design Space Exploration Framework Gerd Ascheid RWTH Aachen University, Germany Outline Motivation: MPSoC requirements in wireless and multimedia MPSoC design space exploration framework Summary

More information

High Performance Interconnect and NoC Router Design

High Performance Interconnect and NoC Router Design High Performance Interconnect and NoC Router Design Brinda M M.E Student, Dept. of ECE (VLSI Design) K.Ramakrishnan College of Technology Samayapuram, Trichy 621 112 brinda18th@gmail.com Devipoonguzhali

More information

Global Scheduler. Global Issue. Global Retire

Global Scheduler. Global Issue. Global Retire The Delft-Java Engine: An Introduction C. John Glossner 1;2 and Stamatis Vassiliadis 2 1 Lucent / Bell Labs, Allentown, Pa. 2 Delft University oftechnology, Department of Electrical Engineering Delft,

More information

Modeling and Simulation of System-on. Platorms. Politecnico di Milano. Donatella Sciuto. Piazza Leonardo da Vinci 32, 20131, Milano

Modeling and Simulation of System-on. Platorms. Politecnico di Milano. Donatella Sciuto. Piazza Leonardo da Vinci 32, 20131, Milano Modeling and Simulation of System-on on-chip Platorms Donatella Sciuto 10/01/2007 Politecnico di Milano Dipartimento di Elettronica e Informazione Piazza Leonardo da Vinci 32, 20131, Milano Key SoC Market

More information

DESIGN OF EFFICIENT ROUTING ALGORITHM FOR CONGESTION CONTROL IN NOC

DESIGN OF EFFICIENT ROUTING ALGORITHM FOR CONGESTION CONTROL IN NOC DESIGN OF EFFICIENT ROUTING ALGORITHM FOR CONGESTION CONTROL IN NOC 1 Pawar Ruchira Pradeep M. E, E&TC Signal Processing, Dr. D Y Patil School of engineering, Ambi, Pune Email: 1 ruchira4391@gmail.com

More information

Hardware / Software Co-design of a SIMD-DSP-based DVB-T Receiver

Hardware / Software Co-design of a SIMD-DSP-based DVB-T Receiver Hardware / Software Co-design of a SIMD-DSP-based DVB-T Receiver H. Seidel, G. Cichon, P. Robelly, M. Bronzel, G. Fettweis Mobile Communications Chair, TU-Dresden D-01062 Dresden, Germany seidel@ifn.et.tu-dresden.de

More information

R.W. Hartenstein, et al.: A Reconfigurable Arithmetic Datapath Architecture; GI/ITG-Workshop, Schloß Dagstuhl, Bericht 303, pp.

R.W. Hartenstein, et al.: A Reconfigurable Arithmetic Datapath Architecture; GI/ITG-Workshop, Schloß Dagstuhl, Bericht 303, pp. # Algorithms Operations # of DPUs Time Steps per Operation Performance 1 1024 Fast Fourier Transformation *,, - 10 16. 10240 20 ms 2 FIR filter, n th order *, 2(n1) 15 1800 ns/data word 3 FIR filter, n

More information

Towards Optimal Custom Instruction Processors

Towards Optimal Custom Instruction Processors Towards Optimal Custom Instruction Processors Wayne Luk Kubilay Atasu, Rob Dimond and Oskar Mencer Department of Computing Imperial College London HOT CHIPS 18 Overview 1. background: extensible processors

More information

Instruction Encoding Synthesis For Architecture Exploration

Instruction Encoding Synthesis For Architecture Exploration Instruction Encoding Synthesis For Architecture Exploration "Compiler Optimizations for Code Density of Variable Length Instructions", "Heuristics for Greedy Transport Triggered Architecture Interconnect

More information

Design of Synchronous NoC Router for System-on-Chip Communication and Implement in FPGA using VHDL

Design of Synchronous NoC Router for System-on-Chip Communication and Implement in FPGA using VHDL Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology ISSN 2320 088X IJCSMC, Vol. 2, Issue.

More information

FCUDA-NoC: A Scalable and Efficient Network-on-Chip Implementation for the CUDA-to-FPGA Flow

FCUDA-NoC: A Scalable and Efficient Network-on-Chip Implementation for the CUDA-to-FPGA Flow FCUDA-NoC: A Scalable and Efficient Network-on-Chip Implementation for the CUDA-to-FPGA Flow Abstract: High-level synthesis (HLS) of data-parallel input languages, such as the Compute Unified Device Architecture

More information

A Study of the Speedups and Competitiveness of FPGA Soft Processor Cores using Dynamic Hardware/Software Partitioning

A Study of the Speedups and Competitiveness of FPGA Soft Processor Cores using Dynamic Hardware/Software Partitioning A Study of the Speedups and Competitiveness of FPGA Soft Processor Cores using Dynamic Hardware/Software Partitioning By: Roman Lysecky and Frank Vahid Presented By: Anton Kiriwas Disclaimer This specific

More information

Configurable Processors for SOC Design. Contents crafted by Technology Evangelist Steve Leibson Tensilica, Inc.

Configurable Processors for SOC Design. Contents crafted by Technology Evangelist Steve Leibson Tensilica, Inc. Configurable s for SOC Design Contents crafted by Technology Evangelist Steve Leibson Tensilica, Inc. Why Listen to This Presentation? Understand how SOC design techniques, now nearly 20 years old, are

More information

Power-Mode-Aware Buffer Synthesis for Low-Power Clock Skew Minimization

Power-Mode-Aware Buffer Synthesis for Low-Power Clock Skew Minimization This article has been accepted and published on J-STAGE in advance of copyediting. Content is final as presented. IEICE Electronics Express, Vol.* No.*,*-* Power-Mode-Aware Buffer Synthesis for Low-Power

More information

The Nios II Family of Configurable Soft-core Processors

The Nios II Family of Configurable Soft-core Processors The Nios II Family of Configurable Soft-core Processors James Ball August 16, 2005 2005 Altera Corporation Agenda Nios II Introduction Configuring your CPU FPGA vs. ASIC CPU Design Instruction Set Architecture

More information

Design of a Weighted Fair Queueing Cell Scheduler for ATM Networks

Design of a Weighted Fair Queueing Cell Scheduler for ATM Networks Design of a Weighted Fair Queueing Cell Scheduler for ATM Networks Yuhua Chen Jonathan S. Turner Department of Electrical Engineering Department of Computer Science Washington University Washington University

More information

DESIGN AND IMPLEMENTATION OF SDR SDRAM CONTROLLER IN VHDL. Shruti Hathwalia* 1, Meenakshi Yadav 2

DESIGN AND IMPLEMENTATION OF SDR SDRAM CONTROLLER IN VHDL. Shruti Hathwalia* 1, Meenakshi Yadav 2 ISSN 2277-2685 IJESR/November 2014/ Vol-4/Issue-11/799-807 Shruti Hathwalia et al./ International Journal of Engineering & Science Research DESIGN AND IMPLEMENTATION OF SDR SDRAM CONTROLLER IN VHDL ABSTRACT

More information

Design and Test Solutions for Networks-on-Chip. Jin-Ho Ahn Hoseo University

Design and Test Solutions for Networks-on-Chip. Jin-Ho Ahn Hoseo University Design and Test Solutions for Networks-on-Chip Jin-Ho Ahn Hoseo University Topics Introduction NoC Basics NoC-elated esearch Topics NoC Design Procedure Case Studies of eal Applications NoC-Based SoC Testing

More information

A Real-Time Programming Model for Heterogeneous MPSoCs

A Real-Time Programming Model for Heterogeneous MPSoCs A Real-Time Programming Model for Heterogeneous MPSoCs Torsten Limberg, Bastian Ristau, and Gerhard Fettweis Technische Universität Dresden Vodafone Chair Mobile Communications Systems 01062 Dresden, Germany

More information

A Dynamic Memory Management Unit for Embedded Real-Time System-on-a-Chip

A Dynamic Memory Management Unit for Embedded Real-Time System-on-a-Chip A Dynamic Memory Management Unit for Embedded Real-Time System-on-a-Chip Mohamed Shalan Georgia Institute of Technology School of Electrical and Computer Engineering 801 Atlantic Drive Atlanta, GA 30332-0250

More information

High-performance and Low-power Consumption Vector Processor for LTE Baseband LSI

High-performance and Low-power Consumption Vector Processor for LTE Baseband LSI High-performance and Low-power Consumption Vector Processor for LTE Baseband LSI Yi Ge Mitsuru Tomono Makiko Ito Yoshio Hirose Recently, the transmission rate for handheld devices has been increasing by

More information

Mapping C code on MPSoC for Nomadic Embedded Systems

Mapping C code on MPSoC for Nomadic Embedded Systems -1 - ARTIST2 Summer School 2008 in Europe Autrans (near Grenoble), France September 8-12, 8 2008 Mapping C code on MPSoC for Nomadic Embedded Systems http://www.artist-embedded.org/ Lecturer: Diederik

More information

Design and Simulation of Router Using WWF Arbiter and Crossbar

Design and Simulation of Router Using WWF Arbiter and Crossbar Design and Simulation of Router Using WWF Arbiter and Crossbar M.Saravana Kumar, K.Rajasekar Electronics and Communication Engineering PSG College of Technology, Coimbatore, India Abstract - Packet scheduling

More information

ARM Processors for Embedded Applications

ARM Processors for Embedded Applications ARM Processors for Embedded Applications Roadmap for ARM Processors ARM Architecture Basics ARM Families AMBA Architecture 1 Current ARM Core Families ARM7: Hard cores and Soft cores Cache with MPU or

More information

Real Time NoC Based Pipelined Architectonics With Efficient TDM Schema

Real Time NoC Based Pipelined Architectonics With Efficient TDM Schema Real Time NoC Based Pipelined Architectonics With Efficient TDM Schema [1] Laila A, [2] Ajeesh R V [1] PG Student [VLSI & ES] [2] Assistant professor, Department of ECE, TKM Institute of Technology, Kollam

More information

Hardware Software Codesign of Embedded Systems

Hardware Software Codesign of Embedded Systems Hardware Software Codesign of Embedded Systems Rabi Mahapatra Texas A&M University Today s topics Course Organization Introduction to HS-CODES Codesign Motivation Some Issues on Codesign of Embedded System

More information

DSP Co-Processing in FPGAs: Embedding High-Performance, Low-Cost DSP Functions

DSP Co-Processing in FPGAs: Embedding High-Performance, Low-Cost DSP Functions White Paper: Spartan-3 FPGAs WP212 (v1.0) March 18, 2004 DSP Co-Processing in FPGAs: Embedding High-Performance, Low-Cost DSP Functions By: Steve Zack, Signal Processing Engineer Suhel Dhanani, Senior

More information

instruction fetch memory interface signal unit priority manager instruction decode stack register sets address PC2 PC3 PC4 instructions extern signals

instruction fetch memory interface signal unit priority manager instruction decode stack register sets address PC2 PC3 PC4 instructions extern signals Performance Evaluations of a Multithreaded Java Microcontroller J. Kreuzinger, M. Pfeer A. Schulz, Th. Ungerer Institute for Computer Design and Fault Tolerance University of Karlsruhe, Germany U. Brinkschulte,

More information

A Novel Deadlock Avoidance Algorithm and Its Hardware Implementation

A Novel Deadlock Avoidance Algorithm and Its Hardware Implementation A ovel Deadlock Avoidance Algorithm and Its Hardware Implementation + Jaehwan Lee and *Vincent* J. Mooney III Hardware/Software RTOS Group Center for Research on Embedded Systems and Technology (CREST)

More information

Architecture Implementation Using the Machine Description Language LISA

Architecture Implementation Using the Machine Description Language LISA Architecture Implementation Using the Machine Description Language LISA Oliver Schliebusch, Andreas Hoffmann, Achim Nohl, Gunnar Braun and Heinrich Meyr Integrated Signal Processing Systems, RWTH Aachen,

More information

Computer Architecture

Computer Architecture Computer Architecture Slide Sets WS 2013/2014 Prof. Dr. Uwe Brinkschulte M.Sc. Benjamin Betting Part 10 Thread and Task Level Parallelism Computer Architecture Part 10 page 1 of 36 Prof. Dr. Uwe Brinkschulte,

More information

A Process Model suitable for defining and programming MpSoCs

A Process Model suitable for defining and programming MpSoCs A Process Model suitable for defining and programming MpSoCs MpSoC-Workshop at Rheinfels, 29-30.6.2010 F. Mayer-Lindenberg, TU Hamburg-Harburg 1. Motivation 2. The Process Model 3. Mapping to MpSoC 4.

More information

Linköping University Post Print. epuma: a novel embedded parallel DSP platform for predictable computing

Linköping University Post Print. epuma: a novel embedded parallel DSP platform for predictable computing Linköping University Post Print epuma: a novel embedded parallel DSP platform for predictable computing Jian Wang, Joar Sohl, Olof Kraigher and Dake Liu N.B.: When citing this work, cite the original article.

More information

CHAPTER 6 FPGA IMPLEMENTATION OF ARBITERS ALGORITHM FOR NETWORK-ON-CHIP

CHAPTER 6 FPGA IMPLEMENTATION OF ARBITERS ALGORITHM FOR NETWORK-ON-CHIP 133 CHAPTER 6 FPGA IMPLEMENTATION OF ARBITERS ALGORITHM FOR NETWORK-ON-CHIP 6.1 INTRODUCTION As the era of a billion transistors on a one chip approaches, a lot of Processing Elements (PEs) could be located

More information

Computer-Aided Recoding for Multi-Core Systems

Computer-Aided Recoding for Multi-Core Systems Computer-Aided Recoding for Multi-Core Systems Rainer Dömer doemer@uci.edu With contributions by P. Chandraiah Center for Embedded Computer Systems University of California, Irvine Outline Embedded System

More information

Hardware-Software Codesign. 1. Introduction

Hardware-Software Codesign. 1. Introduction Hardware-Software Codesign 1. Introduction Lothar Thiele 1-1 Contents What is an Embedded System? Levels of Abstraction in Electronic System Design Typical Design Flow of Hardware-Software Systems 1-2

More information

Timed Compiled-Code Functional Simulation of Embedded Software for Performance Analysis of SOC Design

Timed Compiled-Code Functional Simulation of Embedded Software for Performance Analysis of SOC Design IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 22, NO. 1, JANUARY 2003 1 Timed Compiled-Code Functional Simulation of Embedded Software for Performance Analysis of

More information

Xtensa. Andrew Mihal 290A Fall 2002

Xtensa. Andrew Mihal 290A Fall 2002 Xtensa Andrew Mihal 290A Fall 2002 1 Outline Introduction Single processor Xtensa system architecture Exporting a programming model for single processor Multiple processor system architecture Exporting

More information

Profiling-Based L1 Data Cache Bypassing to Improve GPU Performance and Energy Efficiency

Profiling-Based L1 Data Cache Bypassing to Improve GPU Performance and Energy Efficiency Profiling-Based L1 Data Cache Bypassing to Improve GPU Performance and Energy Efficiency Yijie Huangfu and Wei Zhang Department of Electrical and Computer Engineering Virginia Commonwealth University {huangfuy2,wzhang4}@vcu.edu

More information

Design and Implementation of Low Complexity Router for 2D Mesh Topology using FPGA

Design and Implementation of Low Complexity Router for 2D Mesh Topology using FPGA Design and Implementation of Low Complexity Router for 2D Mesh Topology using FPGA Maheswari Murali * and Seetharaman Gopalakrishnan # * Assistant professor, J. J. College of Engineering and Technology,

More information

Single-Path Programming on a Chip-Multiprocessor System

Single-Path Programming on a Chip-Multiprocessor System Single-Path Programming on a Chip-Multiprocessor System Martin Schoeberl, Peter Puschner, and Raimund Kirner Vienna University of Technology, Austria mschoebe@mail.tuwien.ac.at, {peter,raimund}@vmars.tuwien.ac.at

More information

Mapping of Real-time Applications on

Mapping of Real-time Applications on Mapping of Real-time Applications on Network-on-Chip based MPSOCS Paris Mesidis Submitted for the degree of Master of Science (By Research) The University of York, December 2011 Abstract Mapping of real

More information

Embedded Systems. 7. System Components

Embedded Systems. 7. System Components Embedded Systems 7. System Components Lothar Thiele 7-1 Contents of Course 1. Embedded Systems Introduction 2. Software Introduction 7. System Components 10. Models 3. Real-Time Models 4. Periodic/Aperiodic

More information

Accelerating DSP Applications in Embedded Systems with a Coprocessor Data-Path

Accelerating DSP Applications in Embedded Systems with a Coprocessor Data-Path Accelerating DSP Applications in Embedded Systems with a Coprocessor Data-Path Michalis D. Galanis, Gregory Dimitroulakos, and Costas E. Goutis VLSI Design Laboratory, Electrical and Computer Engineering

More information

DESIGN AND IMPLEMENTATION OF VLSI SYSTOLIC ARRAY MULTIPLIER FOR DSP APPLICATIONS

DESIGN AND IMPLEMENTATION OF VLSI SYSTOLIC ARRAY MULTIPLIER FOR DSP APPLICATIONS International Journal of Computing Academic Research (IJCAR) ISSN 2305-9184 Volume 2, Number 4 (August 2013), pp. 140-146 MEACSE Publications http://www.meacse.org/ijcar DESIGN AND IMPLEMENTATION OF VLSI

More information

Computer Architecture

Computer Architecture Computer Architecture Lecture 2: Fundamental Concepts and ISA Dr. Ahmed Sallam Based on original slides by Prof. Onur Mutlu What Do I Expect From You? Chance favors the prepared mind. (Louis Pasteur) كل

More information

Design and Implementation of Buffer Loan Algorithm for BiNoC Router

Design and Implementation of Buffer Loan Algorithm for BiNoC Router Design and Implementation of Buffer Loan Algorithm for BiNoC Router Deepa S Dev Student, Department of Electronics and Communication, Sree Buddha College of Engineering, University of Kerala, Kerala, India

More information

Venezia: a Scalable Multicore Subsystem for Multimedia Applications

Venezia: a Scalable Multicore Subsystem for Multimedia Applications Venezia: a Scalable Multicore Subsystem for Multimedia Applications Takashi Miyamori Toshiba Corporation Outline Background Venezia Hardware Architecture Venezia Software Architecture Evaluation Chip and

More information

A Case for Hardware Task Management Support for the StarSS Programming Model

A Case for Hardware Task Management Support for the StarSS Programming Model A Case for Hardware Task Management Support for the StarSS Programming Model Cor Meenderinck Delft University of Technology Delft, the Netherlands cor@ce.et.tudelft.nl Ben Juurlink Technische Universität

More information

DESIGN AND IMPLEMENTATION OF APPLICATION SPECIFIC 32-BITALU USING XILINX FPGA

DESIGN AND IMPLEMENTATION OF APPLICATION SPECIFIC 32-BITALU USING XILINX FPGA DESIGN AND IMPLEMENTATION OF APPLICATION SPECIFIC 32-BITALU USING XILINX FPGA T.MALLIKARJUNA 1 *,K.SREENIVASA RAO 2 1 PG Scholar, Annamacharya Institute of Technology & Sciences, Rajampet, A.P, India.

More information

A Generic Tool Set for Application Specific Processor Architectures Λ

A Generic Tool Set for Application Specific Processor Architectures Λ A Generic Tool Set for Application Specific Processor Architectures Λ Frank Engel, Johannes Nührenberg, Gerhard P. Fettweis Mannesmann Mobilfunk Chair for Mobile Communication Systems Dresden University

More information

FPGA IMPLEMENTATION OF FLOATING POINT ADDER AND MULTIPLIER UNDER ROUND TO NEAREST

FPGA IMPLEMENTATION OF FLOATING POINT ADDER AND MULTIPLIER UNDER ROUND TO NEAREST FPGA IMPLEMENTATION OF FLOATING POINT ADDER AND MULTIPLIER UNDER ROUND TO NEAREST SAKTHIVEL Assistant Professor, Department of ECE, Coimbatore Institute of Engineering and Technology Abstract- FPGA is

More information

DESIGN AND IMPLEMENTATION ARCHITECTURE FOR RELIABLE ROUTER RKT SWITCH IN NOC

DESIGN AND IMPLEMENTATION ARCHITECTURE FOR RELIABLE ROUTER RKT SWITCH IN NOC International Journal of Engineering and Manufacturing Science. ISSN 2249-3115 Volume 8, Number 1 (2018) pp. 65-76 Research India Publications http://www.ripublication.com DESIGN AND IMPLEMENTATION ARCHITECTURE

More information

Cor Meenderinck, Ben Juurlink Nexus: hardware support for task-based programming

Cor Meenderinck, Ben Juurlink Nexus: hardware support for task-based programming Cor Meenderinck, Ben Juurlink Nexus: hardware support for task-based programming Conference Object, Postprint version This version is available at http://dx.doi.org/0.479/depositonce-577. Suggested Citation

More information

An Analysis of Blocking vs Non-Blocking Flow Control in On-Chip Networks

An Analysis of Blocking vs Non-Blocking Flow Control in On-Chip Networks An Analysis of Blocking vs Non-Blocking Flow Control in On-Chip Networks ABSTRACT High end System-on-Chip (SoC) architectures consist of tens of processing engines. These processing engines have varied

More information

EITF20: Computer Architecture Part2.2.1: Pipeline-1

EITF20: Computer Architecture Part2.2.1: Pipeline-1 EITF20: Computer Architecture Part2.2.1: Pipeline-1 Liang Liu liang.liu@eit.lth.se 1 Outline Reiteration Pipelining Harzards Structural hazards Data hazards Control hazards Implementation issues Multi-cycle

More information

A Hardware Task-Graph Scheduler for Reconfigurable Multi-tasking Systems

A Hardware Task-Graph Scheduler for Reconfigurable Multi-tasking Systems A Hardware Task-Graph Scheduler for Reconfigurable Multi-tasking Systems Abstract Reconfigurable hardware can be used to build a multitasking system where tasks are assigned to HW resources at run-time

More information

Arquitecturas y Modelos de. Multicore

Arquitecturas y Modelos de. Multicore Arquitecturas y Modelos de rogramacion para Multicore 17 Septiembre 2008 Castellón Eduard Ayguadé Alex Ramírez Opening statements * Some visionaries already predicted multicores 30 years ago And they have

More information

Instruction Set and Functional Unit Synthesis for SIMD Processor Cores

Instruction Set and Functional Unit Synthesis for SIMD Processor Cores Instruction Set and Functional Unit Synthesis for Processor Cores Nozomu Togawa, Koichi Tachikake Yuichiro Miyaoka Masao Yanagisawa Tatsuo Ohtsuki Dept. of Information and Media Sciences, The University

More information

Real-Time Mixed-Criticality Wormhole Networks

Real-Time Mixed-Criticality Wormhole Networks eal-time Mixed-Criticality Wormhole Networks Leandro Soares Indrusiak eal-time Systems Group Department of Computer Science University of York United Kingdom eal-time Systems Group 1 Outline Wormhole Networks

More information

MIMD Overview. Intel Paragon XP/S Overview. XP/S Usage. XP/S Nodes and Interconnection. ! Distributed-memory MIMD multicomputer

MIMD Overview. Intel Paragon XP/S Overview. XP/S Usage. XP/S Nodes and Interconnection. ! Distributed-memory MIMD multicomputer MIMD Overview Intel Paragon XP/S Overview! MIMDs in the 1980s and 1990s! Distributed-memory multicomputers! Intel Paragon XP/S! Thinking Machines CM-5! IBM SP2! Distributed-memory multicomputers with hardware

More information

MODELING LANGUAGE FOR SOFTWARE DEFINED RADIO APPLICATIONS

MODELING LANGUAGE FOR SOFTWARE DEFINED RADIO APPLICATIONS ODELING LANGUAGE FOR SOFTWARE DEFINED RADIO APPLICATIONS atthias Weßeling (BenQ obile, CT PIC NGT, 46395 Bocholt, Germany, matthias.wesseling@siemens.com) 1. ABSTRACT The mobile communication market is

More information

HARDWARE IMPLEMENTATION OF PIPELINE BASED ROUTER DESIGN FOR ON- CHIP NETWORK

HARDWARE IMPLEMENTATION OF PIPELINE BASED ROUTER DESIGN FOR ON- CHIP NETWORK DOI: 10.21917/ijct.2012.0092 HARDWARE IMPLEMENTATION OF PIPELINE BASED ROUTER DESIGN FOR ON- CHIP NETWORK U. Saravanakumar 1, R. Rangarajan 2 and K. Rajasekar 3 1,3 Department of Electronics and Communication

More information

Joint consideration of performance, reliability and fault tolerance in regular Networks-on-Chip via multiple spatially-independent interface terminals

Joint consideration of performance, reliability and fault tolerance in regular Networks-on-Chip via multiple spatially-independent interface terminals Joint consideration of performance, reliability and fault tolerance in regular Networks-on-Chip via multiple spatially-independent interface terminals Philipp Gorski, Tim Wegner, Dirk Timmermann University

More information

Tradeoff Analysis and Architecture Design of a Hybrid Hardware/Software Sorter

Tradeoff Analysis and Architecture Design of a Hybrid Hardware/Software Sorter Tradeoff Analysis and Architecture Design of a Hybrid Hardware/Software Sorter M. Bednara, O. Beyer, J. Teich, R. Wanka Paderborn University D-33095 Paderborn, Germany bednara,beyer,teich @date.upb.de,

More information

A Lost Cycles Analysis for Performance Prediction using High-Level Synthesis

A Lost Cycles Analysis for Performance Prediction using High-Level Synthesis A Lost Cycles Analysis for Performance Prediction using High-Level Synthesis Bruno da Silva, Jan Lemeire, An Braeken, and Abdellah Touhafi Vrije Universiteit Brussel (VUB), INDI and ETRO department, Brussels,

More information

Design and Implementation of VLSI 8 Bit Systolic Array Multiplier

Design and Implementation of VLSI 8 Bit Systolic Array Multiplier Design and Implementation of VLSI 8 Bit Systolic Array Multiplier Khumanthem Devjit Singh, K. Jyothi MTech student (VLSI & ES), GIET, Rajahmundry, AP, India Associate Professor, Dept. of ECE, GIET, Rajahmundry,

More information

Automatic Instrumentation of Embedded Software for High Level Hardware/Software Co-Simulation

Automatic Instrumentation of Embedded Software for High Level Hardware/Software Co-Simulation Automatic Instrumentation of Embedded Software for High Level Hardware/Software Co-Simulation Aimen Bouchhima, Patrice Gerin and Frédéric Pétrot System-Level Synthesis Group TIMA Laboratory 46, Av Félix

More information

Mapping real-life applications on run-time reconfigurable NoC-based MPSoC on FPGA. Singh, A.K.; Kumar, A.; Srikanthan, Th.; Ha, Y.

Mapping real-life applications on run-time reconfigurable NoC-based MPSoC on FPGA. Singh, A.K.; Kumar, A.; Srikanthan, Th.; Ha, Y. Mapping real-life applications on run-time reconfigurable NoC-based MPSoC on FPGA. Singh, A.K.; Kumar, A.; Srikanthan, Th.; Ha, Y. Published in: Proceedings of the 2010 International Conference on Field-programmable

More information

An Efficient Network-on-Chip (NoC) based Multicore Platform for Hierarchical Parallel Genetic Algorithms

An Efficient Network-on-Chip (NoC) based Multicore Platform for Hierarchical Parallel Genetic Algorithms An Efficient Network-on-Chip (NoC) based Multicore Platform for Hierarchical Parallel Genetic Algorithms Yuankun Xue 1, Zhiliang Qian 2, Guopeng Wei 3, Paul Bogdan 1, Chi-Ying Tsui 2, Radu Marculescu 3

More information

Hardware/Software Deadlock Avoidance for Multiprocessor Multiresource System-on-a-Chip

Hardware/Software Deadlock Avoidance for Multiprocessor Multiresource System-on-a-Chip P1 Q1 Hardware/Software Deadlock Avoidance for Multiprocessor Multiresource System-on-a-Chip Q2 P2 Dissertation Defense By Jaehwan Lee Advisor: Vincent J. Mooney III School of Electrical and Computer Engineering

More information

Deadlock-free XY-YX router for on-chip interconnection network

Deadlock-free XY-YX router for on-chip interconnection network LETTER IEICE Electronics Express, Vol.10, No.20, 1 5 Deadlock-free XY-YX router for on-chip interconnection network Yeong Seob Jeong and Seung Eun Lee a) Dept of Electronic Engineering Seoul National Univ

More information

EC EMBEDDED AND REAL TIME SYSTEMS

EC EMBEDDED AND REAL TIME SYSTEMS EC6703 - EMBEDDED AND REAL TIME SYSTEMS Unit I -I INTRODUCTION TO EMBEDDED COMPUTING Part-A (2 Marks) 1. What is an embedded system? An embedded system employs a combination of hardware & software (a computational

More information

A 45nm 1.3GHz 16.7 Double-Precision GFLOPS/W RISC-V Processor with Vector Accelerators"

A 45nm 1.3GHz 16.7 Double-Precision GFLOPS/W RISC-V Processor with Vector Accelerators A 45nm 1.3GHz 16.7 Double-Precision GFLOPS/W ISC-V Processor with Vector Accelerators" Yunsup Lee 1, Andrew Waterman 1, imas Avizienis 1,! Henry Cook 1, Chen Sun 1,2,! Vladimir Stojanovic 1,2, Krste Asanovic

More information

Versal: AI Engine & Programming Environment

Versal: AI Engine & Programming Environment Engineering Director, Xilinx Silicon Architecture Group Versal: Engine & Programming Environment Presented By Ambrose Finnerty Xilinx DSP Technical Marketing Manager October 16, 2018 MEMORY MEMORY MEMORY

More information

An Ultra High Performance Scalable DSP Family for Multimedia. Hot Chips 17 August 2005 Stanford, CA Erik Machnicki

An Ultra High Performance Scalable DSP Family for Multimedia. Hot Chips 17 August 2005 Stanford, CA Erik Machnicki An Ultra High Performance Scalable DSP Family for Multimedia Hot Chips 17 August 2005 Stanford, CA Erik Machnicki Media Processing Challenges Increasing performance requirements Need for flexibility &

More information