Pareto-Based Application Specification for MP-SoC Customized Run-Time Management

Size: px
Start display at page:

Download "Pareto-Based Application Specification for MP-SoC Customized Run-Time Management"

Transcription

1 Pareto-Based Application Specification for MP-SoC Customized Run-Time Management Ch. Ykman-Couvreur 1, V. Nollet 1, Th. Marescaux 1, E. Brockmeyer 1, Fr. Catthoor 1,2, H. Corporaal 3 1 IMEC V.Z.W., Kapeldreef 75, 3001 Leuven, Belgium 2 Also prof. at Katholieke Univ. Leuven, Belgium 3 Prof. at Technical Univ. Eindhoven, The Netherlands Abstract In an MP-SoC environment, a customized run-time management should be incorporated on top of the basic OS services to globally optimize costs (e.g. energy consumption) across all active applications, according to constraints (e.g. performance, user requirements) and available platform resources. To that end, we have proposed a Pareto-based approach combining a designtime application mapping and platform exploration with a lowcomplexity run-time manager. This allows to alleviate the OS in its run-time decisison making and to avoid conservative worstcase assumptions. In this paper, we focus on the characterization of the Pareto-based application specification, resulting from our design-time exploration. This specification is essential as input for our run-time manager. A representative video codec multimedia application, simulated on our MP-SoC platform simulator, is used as case study. For the resulting Pareto-based specification, both binary size and performance overhead is negligible. Distributed PEs interconnected by a NoC Platform aspect Application aspect Dynamic set of appl. (e.g. multimedia) Fig. 1. MP-SoC environment Low-power, RT behavior, small memory footprint Non-functional aspect MP-SoC environment I. INTRODUCTION In a Multi-Processor System-on-Chip (MP-SoC) environment, an ideal Operating System (OS), also called run-time management layer should efficiently combine all application, platform, and non-functional aspects (Fig. 1). First, the OS should enable a dynamic set of multimedia applications (e.g. video messaging, web browsing, video conferencing), 3D games, and many other compute-intensive tasks [1]. These applications are becoming more heterogeneous, dynamic, and data intensive. When running them on mobile devices, which are typically battery-powered energy consumption is a major design issue. The OS also has to fulfill the Quality-of-Service (QoS) requirements of the user (e.g. reliability, performance, and video quality). Secondly, the OS has to support platforms [2] (e.g. TI OMAP and ST Nomadik) which consist of a large number of heterogeneous Processing Elements (PE). These platforms combine the advantages of parallel computing of multiple processors with single-chip integration of SoCs. They provide high computational performance at a low energy cost, while typical embedded systems (e.g. handheld devices such as PDAs and smartphones) are limited by the restricted amount of processing power and memory. Since the application complexity is growing, the major challenges are the right parallelization of these applications and their efficient mapping on the MP-SoC platform. Third, growing SoC complexity makes communication subsystem design as important as computation subsystem design [3], [4]. To provide reliable and scalable communication [5], a flexible interconnect Network-on-Chip (NoC) must be adopted. Designing such an NoC becomes another major task for future MP-SoCs. Finally, for memoryintensive applications such as multimedia applications, the memory subsystem represents an important component in the overall energy cost. In the memory subsystem, ScratchPad Memories (SPM) are used [6], [7], [8], since they perform better than caches in terms of energy per access, performance, on-chip area, and predictability. However, unlike caches, SPMs require complex design-time application analysis to carefully decide which data to assign to the SPM and software allocation techniques. To alleviate the OS in its run-time decision making, and to avoid conservative worst-case assumptions, we have proposed a customized run-time management [9] to map the applications on the platform. It consists of two phases. First, a designtime mapping and platform exploration per application leads to a multi-dimensional Pareto set of optimal mappings. Each mapping is characterized a code version together with an optimal combination of used platform resources, costs, and constraints. The different code versions refer to different parallelizations of the application into parallel tasks and to data transfers between SPMs and local memories. Second, a lowcomplexity run-time manager, incorporated on top of the basic OS services, maintains the high quality of the exploration. Whenever the environment is changing (e.g., when a new application/use case starts, or when the user requirements change), for each active application, our run-time manager reacts as follows: 1) It selects in a predictable way a mapping from its Pareto /06/$ IEEE 78

2 Energy 0 Proc 0 Proc 1 Proc 2 Proc 3 4 PEs 1 PE Application A Ck2 Ck1 Pareto point switch A starts B starts A stops B stops Fig. 2. Energy 0 4 PEs Pareto point switch 1 PEs Application B Time Ck1 Pareto point switch set, according to the available platform resources, in order to minimize the total energy consumption of the platform, while respecting all constraints. 2) It performs Pareto point switches (Fig. 2, restricted to two dimensions), i.e. it assigns the platform resources, it adapts the platform parameters, it loads the task binaries from the shared memory in the corresponding local memories, and it issues the execution of the code versions according to the newly selected Pareto points. When Application A starts, it is assigned to three PEs with a slow clock (ck2). As soon as Application B starts, a Pareto point switch is needed to map A on only two PEs. By speeding up the clock (ck1), the application deadline is still met. After A stops, B can be spread over three PEs in order to reduce the energy consumption. In [9], the design-time exploration phase of our approach, restricted to the usage of one processor, was presented. The main new contribution of this paper is the characterization of the Pareto-based application specification, efficiently merging all code versions present in the Pareto set, and resulting from our design-time exploration. This specification, to be stored into the MP-SoC platform, is essential as input for our run-time manager. A representative video codec multimedia application, simulated on our MP-SoC platform simulator, is used as case study to illustrate this application specification. The resulting binary size and performance overhead is negligible. The remainder of this paper is organized as follows. Section II summarizes the related work in the MP-SoC domain. Section III presents our customized run-time management approach. Section IV introduces our case-study application and our platform simulator. Section V characterizes the Paretobased application specification used in our approach. It also describes the experiments performed on our case study. Conclusions and future work are given in Section VI. Ck2 II. RELATED WORK In recent years, industrial MP-SoC components have been introduced by companies like Texas Instruments and ST Microelectronics. For embedded systems (limited by the number of PEs), Real-Time OSs (RTOSs) are focused on execution determinism, speed and small memory footprint. Current OSs like the TI DSP/BIOS kernel, the Quadros RTXC RTOS, and the Enea Systems OSE RTOS, are clearly focused on lowlevel run-time management (i.e., multiplexing the hardware and providing uniform communication primitives). They only provide an abstraction layer on top of the hardware, they expand and link together existing technologies, but they are not designed for the emerging MP-SoC environment. Support for SPMs, NoCs, dynamic power management, QoS-aware and application-specific run-time management, is lacking. Hence none of these existing OS represents the ideal glue layer for MP-SoCs. The user is supposed to implement his own runtime manager on top of the OS services. State-of-art tools and design practice also are not in a shape yet to meet the needs presented in Section I. Currently in the academic world, two diverging strategies [10] are developed to cope with the design complexity of application-specific and heterogeneous MP-SoC platforms: either the IP-driven approach [11], [12], or the design-flow-driven approach. In these IP-driven approaches, any application is synthesized separately and synthesis has no integral view on the entire system on the MP-SoC platform. Related to the design-flow-driven approach, several global optimization issues are considered: application parallelization, task scheduling, communication management, and dynamic reconfiguration. In this paper, we focus on task scheduling and dynamic reconfiguration, for which our approach offers trade-offs. For MP-SoC platforms, task scheduling becomes more complicated [13], and its impact on the performance and energy consumption becomes more significant. It consists of: mapping, determining the order in which those tasks are executed (i.e. temporal mapping), and on which processor each task must be executed (i.e. spatial mapping), and Dynamic Voltage/Frequency Scaling (DVS/DFS), determining the processor supply voltage and clock frequency if it is allowed. Energy consumption is increasingly an issue not only for battery operated devices. Even if unlimited power is available, a large number of components tightly packed onto a chip poses cooling and reliability problems. An important way to reduce the energy consumption is to shut down or slow down functional components which are idle or under utilized, by combining DVS with Dynamic Power Management (DPM). A survey of system-level design techniques can be found in [14] and [15] respectively. The most recent scheduling approaches, combining application mapping with DVS can be found in [16], [17], [18]. To support the massive data traffic, run-time communication management is a challenging task since inter-processor communications become responsible for significant execution time and energy consumption. Approaches, combining application 79

3 Application A Application B Design-time exploration Refined application code: Version 1 Version 2... Energy Pareto set Others Memory usage PE usage Refined application code: Version 1 Version 2... Energy Pareto set Others Memory usage PE usage Low-complexity run-time layer Constraints Customized run-time manager RTOS kernel Platform information Fig. 4. Our MP-SoC run-time management Fig. 3. Pareto set generated by our design-time exploration mapping with some communication management aspects can be found in [19], [20], [21], [22]. Related to dynamic reconfiguration, some aspects are currently considered. Multimedia applications are becoming more versatile and dynamic applications with multiple use cases need to be supported. Switching from one use case to another one at run time involves changing the application task graph configuration [23], [24]. The platform also needs to support a wide range and dynamic set of applications. This requires an efficient run-time support for platform resource management, task relocation, and reconfiguration of inter-task communication [25]. III. OVERVIEW OF OUR CUSTOMIZED RUN-TIME MANAGEMENT To meet the needs presented in Section I, our approach proposes a customized run-time management to map the applications on the platform, consisting of two phases: (1) a designtime mapping and platform exploration per application; (2) a low-complexity run-time manager incorporated on top of the basic OS services. This run-time manager globally optimizes costs (e.g. energy consumption) across all active applications, according to constraints (e.g. performance, user requirements) and available platform resources. It also performs low-cost switches between possible mappings of a same application, as required by environment changes. A similar conceptual approach was already developed for scheduling concurrent tasks on embedded systems. This was intended to optimize only the energy consumption while respecting the application deadlines [18], [14]. In contrast to the conventional approaches that generate only one solution for each application, the first phase is a design-time application mapping and platform exploration. For each application, this exploration generates a set of optimal mappings in a multi-dimensional design space (Fig. 3), instead of a two-dimensional one. Current dimensions are costs (e.g. energy consumption), constraints (e.g. performance, user requirements), and used platform resources (e.g. memory usage, processors, communication bandwidth, clocks, and processor supply voltage if it is allowed). Only points being better than the other ones in at least one dimension are retained. They are called Pareto points. The resulting set of Pareto points is called the Pareto set. This design-time exploration phase of our approach, restricted to the usage of one processor, was presented in [9]. Dependent on the application constraints, and on the availability of the platform resources, any one of these Pareto points, representing application mappings, will be best can be selected by the run-time manager. Each Pareto point is also annotated with a code version. The different code versions refer to different parallelizations of the application into parallel tasks and to data transfers between SPMs and local memories. The main contribution of this paper is the characterization and merging of all these code versions, called Pareto-based application specification. This latter is presented in section V. Hence, in total, our Pareto set is made up for any application of optimal mappings characterized by a code version together with an optimal combination of used platform resources, costs, and constraints. The description of data structures storing information related to this Pareto set and the Pareto points is out of scope of this paper. The full exploration is done at design time, whereas the critical decisions are taken during the second phase by a lowcomplexity run-time manager (Fig. 4). This latter provides the following services: Whenever a new application is activated, our run-time manager parses its Pareto set provided by the designtime exploration and stores it in the shared memory of the MP-SoC platform, including all task binaries. Whenever the environment is changing (e.g., when a new application/use case starts, or when the user requirements change), for each active application, our run- 80

4 time manager reacts as follows. First, it selects in a predictable way a mapping from its Pareto set, according to the available platform resources, in order to minimize the total energy consumption of the platform, while respecting all constraints. Second, it performs Pareto point switches (Fig. 2, restricted to two dimensions), as explained in Section I. The Pareto point switch technique bears some resemblance with dynamic reconfiguration. It can switch other mappings, but, in contrast to dynamic reconfiguration, it involves more complex run-time tradeoffs. IV. DEMONSTRATOR As driver application, an inter-frame compression technique for video images, called Quadtree Structured Difference Pulse Code Modulation (QSDPCM) is used [26]. It is representative for many today s video codec multimedia applications. It involves a three-stage hierarchical Motion Estimation (ME4, ME2, and ME1), followed by a quadtree-based encoding of the motion compensated frame-to-frame difference signal, a Quantization, and a Huffmann-based Compression (QC). Two image resolutions are allowed: either QCIF, with image size 176*144 pixels, or VGA, with image size of 640*480 pixels. In our experiments, the QCIF resolution is used. The starting algorithm, expressed in C code, has two image frames (the previous and current ones) as input, and one bit stream as output. The code is already tuned for efficient data management and processing by: (1) minimizing the size of internal arrays; (2) optimizing the loop performance and achieving software pipelining. To preserve these optimizations in later code refinements, any optimized loop is encapsulated in a function called kernel in the remainder of this paper. The resulting algorithm is illustrated in Fig. 5(a), where each module is a loop manipulating two pixel blocks at each iteration (the one from the current frame, and the other from the previous frame). Our MP-SoC simulator assumes a platform composed of: (1) processor nodes with local memories and buses; (2) distributed shared memory nodes; (3) communication assists similar to Direct Memory Access (DMA) controllers, providing high-level services to processors and shared memories for efficient data transfers; (4) I/O nodes; (5) a communication architecture, being the AEthereal NoC [27]. The main platform parameters that can be explored at present are: the network clock, the maximum number of time slots, the number of routers, the processor clock and supply voltage if it is allowed, the memory clock, the communication bandwidth between a processor and a shared memory, the number of processors to be used by the application, the memory usage, and some QoS requirement (either guaranteed throughput, or best effort). V. PARETO-BASED APPLICATION SPECIFICATION From our design-time exploration, a multi-dimensional Pareto set of optimal mappings is generated for any application to be mapped on the MP-SoC platform. Each mapping is characterized by a code version together with an optimal ME1() On 1 processor ME1_ On 2 processors (a) Starting algorithm ME1() On 3 processors ME1_1 (b) Relevant parallelizations Fig. 5. QSDPCM application... On k+2 processors 1 < k < 6 ME1_k combination of used platform resources, costs, and constraints. First, the structure of any standalone application code version is described in Section V-A. Then, the Pareto-based specification, merging all these codes, is characterized in Section V-B. A. Standalone Code Version Structure Any application code version present in the Pareto set refers to different parallelizations of the application into parallel tasks and to data transfers between SPMs and local memories, derived from the design-time exploration, as follows. Parallelization exploration Parallelizing an application can be done both at functional and data level. At the functional level, the algorithm is partitioned into smaller tasks, and synchronization requirements between them are identified to allow pipelined execution of these tasks. For instance, in video applications, images can be divided into block of rows. Any task parallelized at the data level deals with its own block of rows. Block transfer exploration: To optimize both performance and energy consumption in the memory subsystem, parts of data arrays stored in the SPM are copied in the processor local memory from where they are accessed multiple times [28]. These copy operations (also called Block Transfers (BT)) are performed through function calls in the application code, first to issue a BT, and next to synchronize its completion with processing. This allows to perform BTs in parallel with processing and hence to improve the application performance. This is illustrated in Fig. 6, where a BT into a copy cp prev frame is performed in parallel with a for loop processing. This allows to reduce the waiting time for this BT completion and 81

5 TABLE I BLOCK TRANSFER IMPACT ON ME1 BINARY SIZE (BYTES) BT Size of Binary size Total ME1 Solution copies of BT calls binary size BT BT BT BT Fig. 6. BT from SPM to processor local memory BT overhead. These QSDPCM parallelizations are illustrated in Fig. 5(b). Related to the BTs, three arrays (storing the current image frame, the previous one, and some internal data required in ) are too large and must be stored in the SPM. Several efficient BT solutions are explored. Table 1 reports for the task ME1 the resulting processor local memory usage for copies (difference up to a factor 2) and the binary size overhead for BT calls (up to 16% of the total ME1 binary). The current implementation of a BT issue (resp. sync) call costs about 378 (resp. 20) bytes in our MP-SoC platform simulator, which explains this important BT call size overhead. This needs to be optimized our near future work. Similar BT solutions are derived for the tasks QC and ME1 QC, whereas only one efficient BT solution is derived for ME42. Hence, considering all combinations of BT solutions in all tasks of any parallelized application gives rise to a huge number of different application code versions. A Pareto-based specification, merging all of them, and allowing efficient loading of any task binary into the platform is required. This specification is characterized in Section V-B. B. Merging code versions Fig. 7. Application code version structure to reach a performance gain of 16 cycles per iteration. Several efficient solutions, yielding different local memory usage and performance, exist for the copy sizes and the places in the code where to insert these BT calls. Such a code version (Fig. 7) is made up of a task set. Each task is made up of a skeleton to glue together the kernel calls (Section IV), the BT calls, and task synchronization for parallelization. Experiments Related to the functional-level parallelization, the QSDPCM can be naturally partitioned into either three tasks (ME42, ME1, and QC), or two tasks (ME42 and ME1 merged with QC). To further alleviate the computation effort of ME1, the input frames can be divided into row blocks to parallelize ME1 at the data level. Up to five parallel ME1 tasks have been considered, beyond which no performance gain is reached any more due to too large task synchronization and All code versions of a same application derived from the design-time exploration are merged into a generic one, called Pareto-based specification. This latter is made up of: A set of tasks, derived from the functional-level parallelization exploration of the application. For each task: The block of image rows, derived from the datalevel parallelization exploration, and used as input argument of the task. An extended task skeleton integrating all BT solutions, derived from the block transfer exploration. For each BT solution, implementation details specifying the size of copies to be allocated in the processor local memory, and the BT calls to be executed in the task skeleton. This Pareto-based specification is stored in the shared memory of the MP-SoC platform. However only the required task binaries are loaded in the corresponding local memories during Pareto point switches, as explained in Section I. This Pareto-based specification is illustrated on the QS- DPCM to show that, for this application, both code size and performance overhead are negligible. To analyze the 82

6 Fig. 8. Task binary sizes (bytes) in Pareto-based QSDPCM specification energy consumption overhead, an energy model in our MP- SoC platform simulator is required. This is currently under investigation. Experiments From the QSDPCM parallelization exploration (Fig. 5(b)), four different tasks are considered: ME42, ME1, QC, and ME1 QC. Binary sizes for these tasks, integrating all BT solutions, are detailed in Fig. 8. They include the sizes of all needed kernels, the extended skeleton, all implementation details. The kernels, which are independent from the standalone code versions, represent the major component of any task binary. The size of task synchronization and all BT calls, being part of the extended task skeleton, is also reported. The code size overhead of the Pareto-based specification is due to: (1) the size of implementation details, which is negligible; (2) the size overhead of the extended skeletons, due to integration of all BT solutions. Size overhead for each task binary is detailed in Fig. 9(a). Merging a standalone code version in the Pareto-based specification yields less than 5% size overhead. To analyze the performance overhead of the Paretobased specification, the QSDPCM mapping on six processors (Fig. 5(b)) is simulated on our MP-SoC platform simulator,using both standalone code version and Pareto-based specification. (processing and BT waiting times) comparison is reported in Fig. 9(b). Less than 0.17% performance overhead can be observed on each processor. VI. CONCLUSION In this paper, we characterize the Pareto-based application specification, used as input for our run-time manager. This specification merge all code versions of a single application derived from the design-time exploration. It refers to different parallelizations of the application and to data transfers between SPMs and local memories. It is also illustrated on a video codec multimedia application, and simulated on our MP-SoC Fig. 9. Comparison between standalone code versions and Pareto-based specification platform simulator. For this application, less than 5% binary size overhead per merged code version, and less than 0.17% performance overhead is observed. Our future work includes the optimization of the data transfer implementation in the NoC of our MP-SoC platform (to further reduce the binary size overhead), the run-time support integration to allow Pareto point switch at run time and the analysis of the resulting run-time overhead, an energy model in our MP-SoC platform simulator, and tests on other real-life applications. REFERENCES [1] P. Cumming, The TI OMAP platform approach to SoC. Kluwer Academic, [2] W. Wolf, The future of multiprocessor systems-on-chips, in Proceedings of the Design Automation Conference, pp , [3] D. Bertozzi, A. Jalabert, M. Srinivasan, R. Tamhankar, S. Stergiou, L. Benini, and G. De Micheli, NoC synthesis flow for customized domain specific multiprocessor systems-on-chip, IEEE Trans. Parallel Distrib. Syst., vol. 16, pp , February [4] S. Murali and G. De Micheli, Bandwidth-constrained mapping of cores onto NoC architectures, in Proceedings of the Conference on Design, Automation and Test in Europe, Paris, France, February [5] L. Benini and G. De Micheli, Networks on chips: a new SoC paradigm, IEEE Computer, pp ,

7 [6] S. Mamagkakis, D. Atienza, C. Poucet, F. Catthoor, D. Soudris, and J. Mendias, Custom design of multi-level dynamic memory management subsystem for embedded systems, in Proceedings of the IEEE Workshop on Signal Processing Systems, October 2004, pp [7] F. Poletti, P. Marchal, D. Atienza, L. Benini, F. Catthoor, and J. Mendias, An integrated hardware/software approach for run-time scratchpad management, in Proceedings of the Design Automation Conference, pp , June [8] M. Verma, L. Wehmeyer, and P. Marwedel, Dynamic overlay of scratchpad memory for energy minimization, in Proceedings of the International Conference on Hardware/Software Codesign and System Synthesis, 2004, pp [9] C. Ykman-Couvreur, E. Brockmeyer, V. Nollet, T. Marescaux, F. Catthoor, and H. Corporaal, Design-time application exploration for MP-SoC customized run-time management, in Proceedings of the International Symposium on System-on-Chip, pp , November [10] T. Kogel and H. Meyr, Heterogeneous MP-SoC - the solution to energyefficient signal processing, in Proceedings of the Design Automation Conference, 2004, pp [11] T. Henriksson, J. Kang, and P. van der Wolf, Implementation of dynamic streaming applications on heterogeneous multi-processor applications, in Proceedings of the International Conference on Hardware/Software Codesign and System Synthesis, Jersey City, NJ, September 2005, pp [12] S. Yoo, M. Youssef, A. Bouchhima, and A. Jerraya, Multi-processor SoC design methodology using a concept of two-layer hardwaredependent software, in Proceedings of the Conference on Design, Automation and Test in Europe, Paris, February [13] Y. Cho, S. Yoo, K. Choi, N.-E. Zergainoh, and A. Jerraya, Scheduler implementation in MP SOC design, in Proceedings of the Asia South Pacific Design Automation Conference, Shangai, China, January [14] C. Ykman-Couvreur, F. Catthoor, J. Vounckx, A. Folens, and F. Louagie, Energy-aware dynamic task scheduling applied to a real-time multimedia application on an Xscale board, Journal of Low Power Electronics, vol. 1, pp , December [15] L. Benini, A. Bogliolo, and G. De Micheli, A survey of design techniques for system-level dynamic power management, IEEE Trans. VLSI Syst., vol. 8, pp , June [16] A. Andrei, M. Schmitz, P. Eles, Z. Peng, and B. Al-Hashimi, Overheadconscious voltage selection for dynamic and leakage energy reduction of time-constrained systems, in Proceedings of the Conference on Design, Automation and Test in Europe, pp , February [17] P. Schaumont, B.-C. C. Lai, W. Qin, and I. Verbauwhede, Cooperative multithreading on embedded multiprocessor architectures enables energyscalable design, in Proceedings of the Design Automation Conference, pp , June [18] P. Yang and F. Catthoor, Dynamic mapping and ordering tasks of embedded real-time systems on multiprocessor platforms, in Proceedings of the International Workshop on Software and Compilers for Embedded Systems, pp , September [19] L. Smit, G. Smit, J. Hurink, H. Boersma, D. Paulusma, and P. Wolkotte, Run-time mapping of applications to a heterogeneous reconfigurable tiled system on chip architecture, in Proceedings of the International Symposium on System-on-Chip, November [20] A. Hansson, K. Goossens, and A. Radulescu, A unified approach to constrained mapping and routing on Network-on-Chip architectures, in Proceedings of the International Conference on Hardware/Software Codesign and System Synthesis, pp , September [21] J. Hu and R. Marculescu, Communication and task scheduling of application-specific networks-on-chip, IEE Proceedings - Computers and Digital Techniques, vol. 152, pp , September [22] O. Bringmann, A. Siebenborn, and W. Rosenstiel, Conflict analysis in multiprocess synthesis for optimized system integration, in Proceedings of the International Conference on Hardware/Software Codesign and System Synthesis, pp , September [23] J. Kang, T. Henriksson, and P. van der Wolf, An interface for the design and implementation of dynamic applications on multi-processor architectures, in Proceedings of the Workshop on Embedded Systems for Real-Time Multimedia, pp , September [24] M. Rutten et al., Dynamic reconfiguration of streaming graphs on a heterogeneous multiprocessor architecture, in Proceedings of the IS&T/SPIE s Annual Symposium on Electronic Imaging: Multimedia Processing and Applications, pp , January [25] V. Nollet, T. Marescaux, P. Avasare, J.-Y. Mignolet, and D. Verkest, Centralized run-time resource management in a network-on-chip containing reconfigurable hardware tiles, in Proceedings of the Conference on Design, Automation and Test in Europe, pp , March [26] P.Strobach, QSDPCM a new technique in scene adaptive coding, in Proceedings of the Eur. Signal processing Conference, pp , September [27] J. Dielissen, A. Radulescu, K. Goossens, and E. Rijpkema, Concepts and implementation of the Philips network-on-chip, in Proceedings of the IP-based SOC Design, November [28] E. Brockmeyer, M. Miranda, H. Corporaal, and F. Catthoor, Layer assignment techniques for low energy in multi-layered memory organisations, in Proceedings of the Conference on Design, Automation and Test in Europe, pp ,

Design-time application mapping and platform exploration for MP-SoC customised run-time management

Design-time application mapping and platform exploration for MP-SoC customised run-time management Design-time application mapping and platform exploration for MP-SoC customised run-time management Ch. Ykman-Couvreur, V. Nollet, Th. Marescaux, E. Brockmeyer, Fr. Catthoor and H. Corporaal Abstract: In

More information

Mapping and Configuration Methods for Multi-Use-Case Networks on Chips

Mapping and Configuration Methods for Multi-Use-Case Networks on Chips Mapping and Configuration Methods for Multi-Use-Case Networks on Chips Srinivasan Murali CSL, Stanford University Stanford, USA smurali@stanford.edu Martijn Coenen, Andrei Radulescu, Kees Goossens Philips

More information

Mapping real-life applications on run-time reconfigurable NoC-based MPSoC on FPGA. Singh, A.K.; Kumar, A.; Srikanthan, Th.; Ha, Y.

Mapping real-life applications on run-time reconfigurable NoC-based MPSoC on FPGA. Singh, A.K.; Kumar, A.; Srikanthan, Th.; Ha, Y. Mapping real-life applications on run-time reconfigurable NoC-based MPSoC on FPGA. Singh, A.K.; Kumar, A.; Srikanthan, Th.; Ha, Y. Published in: Proceedings of the 2010 International Conference on Field-programmable

More information

Mapping and Configuration Methods for Multi-Use-Case Networks on Chips

Mapping and Configuration Methods for Multi-Use-Case Networks on Chips Mapping and Configuration Methods for Multi-Use-Case Networks on Chips Srinivasan Murali, Stanford University Martijn Coenen, Andrei Radulescu, Kees Goossens, Giovanni De Micheli, Ecole Polytechnique Federal

More information

A Low Energy Clustered Instruction Memory Hierarchy for Long Instruction Word Processors

A Low Energy Clustered Instruction Memory Hierarchy for Long Instruction Word Processors A Low Energy Clustered Instruction Memory Hierarchy for Long Instruction Word Processors Murali Jayapala 1, Francisco Barat 1, Pieter Op de Beeck 1, Francky Catthoor 2, Geert Deconinck 1 and Henk Corporaal

More information

Data Storage Exploration and Bandwidth Analysis for Distributed MPEG-4 Decoding

Data Storage Exploration and Bandwidth Analysis for Distributed MPEG-4 Decoding Data Storage Exploration and Bandwidth Analysis for Distributed MPEG-4 oding Milan Pastrnak, Peter H. N. de With, Senior Member, IEEE Abstract The low bit-rate profiles of the MPEG-4 standard enable video-streaming

More information

Optimization of Task Scheduling and Memory Partitioning for Multiprocessor System on Chip

Optimization of Task Scheduling and Memory Partitioning for Multiprocessor System on Chip Optimization of Task Scheduling and Memory Partitioning for Multiprocessor System on Chip 1 Mythili.R, 2 Mugilan.D 1 PG Student, Department of Electronics and Communication K S Rangasamy College Of Technology,

More information

Resource Manager for Non-preemptive Heterogeneous Multiprocessor System-on-chip

Resource Manager for Non-preemptive Heterogeneous Multiprocessor System-on-chip Resource Manager for Non-preemptive Heterogeneous Multiprocessor System-on-chip Akash Kumar, Bart Mesman, Bart Theelen and Henk Corporaal Eindhoven University of Technology 5600MB Eindhoven, The Netherlands

More information

Design and Implementation of Buffer Loan Algorithm for BiNoC Router

Design and Implementation of Buffer Loan Algorithm for BiNoC Router Design and Implementation of Buffer Loan Algorithm for BiNoC Router Deepa S Dev Student, Department of Electronics and Communication, Sree Buddha College of Engineering, University of Kerala, Kerala, India

More information

SDR Forum Technical Conference 2007

SDR Forum Technical Conference 2007 THE APPLICATION OF A NOVEL ADAPTIVE DYNAMIC VOLTAGE SCALING SCHEME TO SOFTWARE DEFINED RADIO Craig Dolwin (Toshiba Research Europe Ltd, Bristol, UK, craig.dolwin@toshiba-trel.com) ABSTRACT This paper presents

More information

DATA REUSE DRIVEN MEMORY AND NETWORK-ON-CHIP CO-SYNTHESIS *

DATA REUSE DRIVEN MEMORY AND NETWORK-ON-CHIP CO-SYNTHESIS * DATA REUSE DRIVEN MEMORY AND NETWORK-ON-CHIP CO-SYNTHESIS * University of California, Irvine, CA 92697 Abstract: Key words: NoCs present a possible communication infrastructure solution to deal with increased

More information

Low Power Mapping of Video Processing Applications on VLIW Multimedia Processors

Low Power Mapping of Video Processing Applications on VLIW Multimedia Processors Low Power Mapping of Video Processing Applications on VLIW Multimedia Processors K. Masselos 1,2, F. Catthoor 2, C. E. Goutis 1, H. DeMan 2 1 VLSI Design Laboratory, Department of Electrical and Computer

More information

Long Term Trends for Embedded System Design

Long Term Trends for Embedded System Design Long Term Trends for Embedded System Design Ahmed Amine JERRAYA Laboratoire TIMA, 46 Avenue Félix Viallet, 38031 Grenoble CEDEX, France Email: Ahmed.Jerraya@imag.fr Abstract. An embedded system is an application

More information

Design of network adapter compatible OCP for high-throughput NOC

Design of network adapter compatible OCP for high-throughput NOC Applied Mechanics and Materials Vols. 313-314 (2013) pp 1341-1346 Online available since 2013/Mar/25 at www.scientific.net (2013) Trans Tech Publications, Switzerland doi:10.4028/www.scientific.net/amm.313-314.1341

More information

Single-Path Programming on a Chip-Multiprocessor System

Single-Path Programming on a Chip-Multiprocessor System Single-Path Programming on a Chip-Multiprocessor System Martin Schoeberl, Peter Puschner, and Raimund Kirner Vienna University of Technology, Austria mschoebe@mail.tuwien.ac.at, {peter,raimund}@vmars.tuwien.ac.at

More information

Effective Memory Access Optimization by Memory Delay Modeling, Memory Allocation, and Slack Time Management

Effective Memory Access Optimization by Memory Delay Modeling, Memory Allocation, and Slack Time Management International Journal of Computer Theory and Engineering, Vol., No., December 01 Effective Memory Optimization by Memory Delay Modeling, Memory Allocation, and Slack Time Management Sultan Daud Khan, Member,

More information

HARDWARE IMPLEMENTATION OF PIPELINE BASED ROUTER DESIGN FOR ON- CHIP NETWORK

HARDWARE IMPLEMENTATION OF PIPELINE BASED ROUTER DESIGN FOR ON- CHIP NETWORK DOI: 10.21917/ijct.2012.0092 HARDWARE IMPLEMENTATION OF PIPELINE BASED ROUTER DESIGN FOR ON- CHIP NETWORK U. Saravanakumar 1, R. Rangarajan 2 and K. Rajasekar 3 1,3 Department of Electronics and Communication

More information

Design Space Exploration of Real-time Multi-media MPSoCs with Heterogeneous Scheduling Policies

Design Space Exploration of Real-time Multi-media MPSoCs with Heterogeneous Scheduling Policies Design Space Exploration of Real-time Multi-media MPSoCs with Heterogeneous Scheduling Policies Minyoung Kim, Sudarshan Banerjee, Nikil Dutt, Nalini Venkatasubramanian School of Information and Computer

More information

WITH the development of the semiconductor technology,

WITH the development of the semiconductor technology, Dual-Link Hierarchical Cluster-Based Interconnect Architecture for 3D Network on Chip Guang Sun, Yong Li, Yuanyuan Zhang, Shijun Lin, Li Su, Depeng Jin and Lieguang zeng Abstract Network on Chip (NoC)

More information

System Modeling and Implementation of MPEG-4. Encoder under Fine-Granular-Scalability Framework

System Modeling and Implementation of MPEG-4. Encoder under Fine-Granular-Scalability Framework System Modeling and Implementation of MPEG-4 Encoder under Fine-Granular-Scalability Framework Literature Survey Embedded Software Systems Prof. B. L. Evans by Wei Li and Zhenxun Xiao March 25, 2002 Abstract

More information

Improving Routing Efficiency for Network-on-Chip through Contention-Aware Input Selection

Improving Routing Efficiency for Network-on-Chip through Contention-Aware Input Selection Improving Routing Efficiency for Network-on-Chip through Contention-Aware Input Selection Dong Wu, Bashir M. Al-Hashimi, Marcus T. Schmitz School of Electronics and Computer Science University of Southampton

More information

An Application Mapping Scheme over Distributed Reconfigurable System

An Application Mapping Scheme over Distributed Reconfigurable System An Application Mapping Scheme over Distributed Reconfigurable System Chao Wang Lianghua Miao Bin Xie and Tianzhou Chen College of Computer Science Zhejiang University Hangzhou Zhejiang 310027 P. R. China

More information

Behavioral Array Mapping into Multiport Memories Targeting Low Power 3

Behavioral Array Mapping into Multiport Memories Targeting Low Power 3 Behavioral Array Mapping into Multiport Memories Targeting Low Power 3 Preeti Ranjan Panda and Nikil D. Dutt Department of Information and Computer Science University of California, Irvine, CA 92697-3425,

More information

The S6000 Family of Processors

The S6000 Family of Processors The S6000 Family of Processors Today s Design Challenges The advent of software configurable processors In recent years, the widespread adoption of digital technologies has revolutionized the way in which

More information

STG-NoC: A Tool for Generating Energy Optimized Custom Built NoC Topology

STG-NoC: A Tool for Generating Energy Optimized Custom Built NoC Topology STG-NoC: A Tool for Generating Energy Optimized Custom Built NoC Topology Surbhi Jain Naveen Choudhary Dharm Singh ABSTRACT Network on Chip (NoC) has emerged as a viable solution to the complex communication

More information

Profiling Driven Scenario Detection and Prediction for Multimedia Applications

Profiling Driven Scenario Detection and Prediction for Multimedia Applications Profiling Driven Scenario Detection and Prediction for Multimedia Applications Stefan Valentin Gheorghita, Twan Basten and Henk Corporaal EE Department, Electronic Systems Group Eindhoven University of

More information

ISSN Vol.04,Issue.01, January-2016, Pages:

ISSN Vol.04,Issue.01, January-2016, Pages: WWW.IJITECH.ORG ISSN 2321-8665 Vol.04,Issue.01, January-2016, Pages:0077-0082 Implementation of Data Encoding and Decoding Techniques for Energy Consumption Reduction in NoC GORANTLA CHAITHANYA 1, VENKATA

More information

Worst Case Execution Time Analysis for Synthesized Hardware

Worst Case Execution Time Analysis for Synthesized Hardware Worst Case Execution Time Analysis for Synthesized Hardware Jun-hee Yoo ihavnoid@poppy.snu.ac.kr Seoul National University, Seoul, Republic of Korea Xingguang Feng fengxg@poppy.snu.ac.kr Seoul National

More information

Software Pipelining for Coarse-Grained Reconfigurable Instruction Set Processors

Software Pipelining for Coarse-Grained Reconfigurable Instruction Set Processors Software Pipelining for Coarse-Grained Reconfigurable Instruction Set Processors Francisco Barat, Murali Jayapala, Pieter Op de Beeck and Geert Deconinck K.U.Leuven, Belgium. {f-barat, j4murali}@ieee.org,

More information

Session: Configurable Systems. Tailored SoC building using reconfigurable IP blocks

Session: Configurable Systems. Tailored SoC building using reconfigurable IP blocks IP 08 Session: Configurable Systems Tailored SoC building using reconfigurable IP blocks Lodewijk T. Smit, Gerard K. Rauwerda, Jochem H. Rutgers, Maciej Portalski and Reinier Kuipers Recore Systems www.recoresystems.com

More information

Performance of Multihop Communications Using Logical Topologies on Optical Torus Networks

Performance of Multihop Communications Using Logical Topologies on Optical Torus Networks Performance of Multihop Communications Using Logical Topologies on Optical Torus Networks X. Yuan, R. Melhem and R. Gupta Department of Computer Science University of Pittsburgh Pittsburgh, PA 156 fxyuan,

More information

An Application-Specific Design Methodology for STbus Crossbar Generation

An Application-Specific Design Methodology for STbus Crossbar Generation An Application-Specific Design Methodology for STbus Crossbar Generation Srinivasan Murali, Giovanni De Micheli Computer Systems Lab Stanford University Stanford, California 935 {smurali, nanni}@stanford.edu

More information

A Hybrid Approach to CAM-Based Longest Prefix Matching for IP Route Lookup

A Hybrid Approach to CAM-Based Longest Prefix Matching for IP Route Lookup A Hybrid Approach to CAM-Based Longest Prefix Matching for IP Route Lookup Yan Sun and Min Sik Kim School of Electrical Engineering and Computer Science Washington State University Pullman, Washington

More information

Mapping C code on MPSoC for Nomadic Embedded Systems

Mapping C code on MPSoC for Nomadic Embedded Systems -1 - ARTIST2 Summer School 2008 in Europe Autrans (near Grenoble), France September 8-12, 8 2008 Mapping C code on MPSoC for Nomadic Embedded Systems http://www.artist-embedded.org/ Lecturer: Diederik

More information

Resource-efficient Routing and Scheduling of Time-constrained Network-on-Chip Communication

Resource-efficient Routing and Scheduling of Time-constrained Network-on-Chip Communication Resource-efficient Routing and Scheduling of Time-constrained Network-on-Chip Communication Sander Stuijk, Twan Basten, Marc Geilen, Amir Hossein Ghamarian and Bart Theelen Eindhoven University of Technology,

More information

Optimization of Dynamic Data Structures in Multimedia Embedded Systems Using Evolutionary Computation

Optimization of Dynamic Data Structures in Multimedia Embedded Systems Using Evolutionary Computation Optimization of Dynamic Data Structures in Multimedia Embedded Systems Using Evolutionary Computation D. Atienza, C. Baloukas, L. Papadopoulos, C. Poucet, S. Mamagkakis, J. I. Hidalgo, F. Catthoor, D.

More information

FPGA BASED ADAPTIVE RESOURCE EFFICIENT ERROR CONTROL METHODOLOGY FOR NETWORK ON CHIP

FPGA BASED ADAPTIVE RESOURCE EFFICIENT ERROR CONTROL METHODOLOGY FOR NETWORK ON CHIP FPGA BASED ADAPTIVE RESOURCE EFFICIENT ERROR CONTROL METHODOLOGY FOR NETWORK ON CHIP 1 M.DEIVAKANI, 2 D.SHANTHI 1 Associate Professor, Department of Electronics and Communication Engineering PSNA College

More information

Design guidelines for embedded real time face detection application

Design guidelines for embedded real time face detection application Design guidelines for embedded real time face detection application White paper for Embedded Vision Alliance By Eldad Melamed Much like the human visual system, embedded computer vision systems perform

More information

ISSN Vol.05,Issue.09, September-2017, Pages:

ISSN Vol.05,Issue.09, September-2017, Pages: WWW.IJITECH.ORG ISSN 2321-8665 Vol.05,Issue.09, September-2017, Pages:1693-1697 AJJAM PUSHPA 1, C. H. RAMA MOHAN 2 1 PG Scholar, Dept of ECE(DECS), Shirdi Sai Institute of Science and Technology, Anantapuramu,

More information

Energy Efficient Computing Systems (EECS) Magnus Jahre Coordinator, EECS

Energy Efficient Computing Systems (EECS) Magnus Jahre Coordinator, EECS Energy Efficient Computing Systems (EECS) Magnus Jahre Coordinator, EECS Who am I? Education Master of Technology, NTNU, 2007 PhD, NTNU, 2010. Title: «Managing Shared Resources in Chip Multiprocessor Memory

More information

A Novel Technique to Use Scratch-pad Memory for Stack Management

A Novel Technique to Use Scratch-pad Memory for Stack Management A Novel Technique to Use Scratch-pad Memory for Stack Management Soyoung Park Hae-woo Park Soonhoi Ha School of EECS, Seoul National University, Seoul, Korea {soy, starlet, sha}@iris.snu.ac.kr Abstract

More information

Multimedia Decoder Using the Nios II Processor

Multimedia Decoder Using the Nios II Processor Multimedia Decoder Using the Nios II Processor Third Prize Multimedia Decoder Using the Nios II Processor Institution: Participants: Instructor: Indian Institute of Science Mythri Alle, Naresh K. V., Svatantra

More information

Cross Clock-Domain TDM Virtual Circuits for Networks on Chips

Cross Clock-Domain TDM Virtual Circuits for Networks on Chips Cross Clock-Domain TDM Virtual Circuits for Networks on Chips Zhonghai Lu Dept. of Electronic Systems School for Information and Communication Technology KTH - Royal Institute of Technology, Stockholm

More information

A Simplified Executable Model to Evaluate Latency and Throughput of Networks-on-Chip

A Simplified Executable Model to Evaluate Latency and Throughput of Networks-on-Chip A Simplified Executable Model to Evaluate Latency and Throughput of Networks-on-Chip Leandro Möller Luciano Ost, Leandro Soares Indrusiak Sanna Määttä Fernando G. Moraes Manfred Glesner Jari Nurmi {ost,

More information

Low-Power Data Address Bus Encoding Method

Low-Power Data Address Bus Encoding Method Low-Power Data Address Bus Encoding Method Tsung-Hsi Weng, Wei-Hao Chiao, Jean Jyh-Jiun Shann, Chung-Ping Chung, and Jimmy Lu Dept. of Computer Science and Information Engineering, National Chao Tung University,

More information

Computer-Aided Recoding for Multi-Core Systems

Computer-Aided Recoding for Multi-Core Systems Computer-Aided Recoding for Multi-Core Systems Rainer Dömer doemer@uci.edu With contributions by P. Chandraiah Center for Embedded Computer Systems University of California, Irvine Outline Embedded System

More information

Reconfigurable Architecture Requirements for Co-Designed Virtual Machines

Reconfigurable Architecture Requirements for Co-Designed Virtual Machines Reconfigurable Architecture Requirements for Co-Designed Virtual Machines Kenneth B. Kent University of New Brunswick Faculty of Computer Science Fredericton, New Brunswick, Canada ken@unb.ca Micaela Serra

More information

A Unified HW/SW Interface Model to Remove Discontinuities between HW and SW Design

A Unified HW/SW Interface Model to Remove Discontinuities between HW and SW Design A Unified /SW Interface Model to Remove Discontinuities between and SW Design Aimen Bouchhima, Xi Chen, Frédéric Pétrot, Wander O. Cesário, Ahmed A. Jerraya TIMA Laboratory 46 Avenue Félix Viallet 38031

More information

342 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 17, NO. 3, MARCH /$ IEEE

342 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 17, NO. 3, MARCH /$ IEEE 342 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 17, NO. 3, MARCH 2009 Custom Networks-on-Chip Architectures With Multicast Routing Shan Yan, Student Member, IEEE, and Bill Lin,

More information

Exploration of Distributed Shared Memory Architectures for NoC-based Multiprocessors

Exploration of Distributed Shared Memory Architectures for NoC-based Multiprocessors Exploration of Distributed Shared Memory Architectures for NoC-based Multiprocessors Matteo Monchiero Gianluca Palermo Cristina Silvano Oreste Villa Dipartimento di Elettronica e Informazione Politecnico

More information

Real-Time Dynamic Voltage Hopping on MPSoCs

Real-Time Dynamic Voltage Hopping on MPSoCs Real-Time Dynamic Voltage Hopping on MPSoCs Tohru Ishihara System LSI Research Center, Kyushu University 2009/08/05 The 9 th International Forum on MPSoC and Multicore 1 Background Low Power / Low Energy

More information

Runtime Adaptation of Application Execution under Thermal and Power Constraints in Massively Parallel Processor Arrays

Runtime Adaptation of Application Execution under Thermal and Power Constraints in Massively Parallel Processor Arrays Runtime Adaptation of Application Execution under Thermal and Power Constraints in Massively Parallel Processor Arrays Éricles Sousa 1, Frank Hannig 1, Jürgen Teich 1, Qingqing Chen 2, and Ulf Schlichtmann

More information

Operating system integrated energy aware scratchpad allocation strategies for multiprocess applications

Operating system integrated energy aware scratchpad allocation strategies for multiprocess applications University of Dortmund Operating system integrated energy aware scratchpad allocation strategies for multiprocess applications Robert Pyka * Christoph Faßbach * Manish Verma + Heiko Falk * Peter Marwedel

More information

Real Time NoC Based Pipelined Architectonics With Efficient TDM Schema

Real Time NoC Based Pipelined Architectonics With Efficient TDM Schema Real Time NoC Based Pipelined Architectonics With Efficient TDM Schema [1] Laila A, [2] Ajeesh R V [1] PG Student [VLSI & ES] [2] Assistant professor, Department of ECE, TKM Institute of Technology, Kollam

More information

FPGA based Design of Low Power Reconfigurable Router for Network on Chip (NoC)

FPGA based Design of Low Power Reconfigurable Router for Network on Chip (NoC) FPGA based Design of Low Power Reconfigurable Router for Network on Chip (NoC) D.Udhayasheela, pg student [Communication system],dept.ofece,,as-salam engineering and technology, N.MageshwariAssistant Professor

More information

Single Pass Connected Components Analysis

Single Pass Connected Components Analysis D. G. Bailey, C. T. Johnston, Single Pass Connected Components Analysis, Proceedings of Image and Vision Computing New Zealand 007, pp. 8 87, Hamilton, New Zealand, December 007. Single Pass Connected

More information

Network-on-Chip Architecture

Network-on-Chip Architecture Multiple Processor Systems(CMPE-655) Network-on-Chip Architecture Performance aspect and Firefly network architecture By Siva Shankar Chandrasekaran and SreeGowri Shankar Agenda (Enhancing performance)

More information

Design of a System-on-Chip Switched Network and its Design Support Λ

Design of a System-on-Chip Switched Network and its Design Support Λ Design of a System-on-Chip Switched Network and its Design Support Λ Daniel Wiklund y, Dake Liu Dept. of Electrical Engineering Linköping University S-581 83 Linköping, Sweden Abstract As the degree of

More information

Functional modeling style for efficient SW code generation of video codec applications

Functional modeling style for efficient SW code generation of video codec applications Functional modeling style for efficient SW code generation of video codec applications Sang-Il Han 1)2) Soo-Ik Chae 1) Ahmed. A. Jerraya 2) SD Group 1) SLS Group 2) Seoul National Univ., Korea TIMA laboratory,

More information

Energy-Aware Cosynthesis of Real-Time Multimedia Applications on MPSoCs Using Heterogeneous Scheduling Policies

Energy-Aware Cosynthesis of Real-Time Multimedia Applications on MPSoCs Using Heterogeneous Scheduling Policies Energy-Aware Cosynthesis of Real-Time Multimedia Applications on MPSoCs Using Heterogeneous Scheduling Policies 9 MINYOUNG KIM, SUDARSHAN BANERJEE, NIKIL DUTT, and NALINI VENKATASUBRAMANIAN University

More information

Co-Design of Many-Accelerator Heterogeneous Systems Exploiting Virtual Platforms. SAMOS XIV July 14-17,

Co-Design of Many-Accelerator Heterogeneous Systems Exploiting Virtual Platforms. SAMOS XIV July 14-17, Co-Design of Many-Accelerator Heterogeneous Systems Exploiting Virtual Platforms SAMOS XIV July 14-17, 2014 1 Outline Introduction + Motivation Design requirements for many-accelerator SoCs Design problems

More information

Using Industry Standards to Exploit the Advantages and Resolve the Challenges of Multicore Technology

Using Industry Standards to Exploit the Advantages and Resolve the Challenges of Multicore Technology Using Industry Standards to Exploit the Advantages and Resolve the Challenges of Multicore Technology September 19, 2007 Markus Levy, EEMBC and Multicore Association Enabling the Multicore Ecosystem Multicore

More information

Low-Power Video Codec Design

Low-Power Video Codec Design International Journal of Engineering Research and Development e-issn: 2278-067X, p-issn : 2278-800X, www.ijerd.com Volume 5, Issue 8 (January 2013), PP. 81-85 Low-Power Video Codec Design R.Kamalakkannan

More information

LOW POWER REDUCED ROUTER NOC ARCHITECTURE DESIGN WITH CLASSICAL BUS BASED SYSTEM

LOW POWER REDUCED ROUTER NOC ARCHITECTURE DESIGN WITH CLASSICAL BUS BASED SYSTEM Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 4, Issue. 5, May 2015, pg.705

More information

The Design and Implementation of a Low-Latency On-Chip Network

The Design and Implementation of a Low-Latency On-Chip Network The Design and Implementation of a Low-Latency On-Chip Network Robert Mullins 11 th Asia and South Pacific Design Automation Conference (ASP-DAC), Jan 24-27 th, 2006, Yokohama, Japan. Introduction Current

More information

A Framework for Video Streaming to Resource- Constrained Terminals

A Framework for Video Streaming to Resource- Constrained Terminals A Framework for Video Streaming to Resource- Constrained Terminals Dmitri Jarnikov 1, Johan Lukkien 1, Peter van der Stok 1 Dept. of Mathematics and Computer Science, Eindhoven University of Technology

More information

Caching video contents in IPTV systems with hierarchical architecture

Caching video contents in IPTV systems with hierarchical architecture Caching video contents in IPTV systems with hierarchical architecture Lydia Chen 1, Michela Meo 2 and Alessandra Scicchitano 1 1. IBM Zurich Research Lab email: {yic,als}@zurich.ibm.com 2. Politecnico

More information

NETWORKS on CHIP A NEW PARADIGM for SYSTEMS on CHIPS DESIGN

NETWORKS on CHIP A NEW PARADIGM for SYSTEMS on CHIPS DESIGN NETWORKS on CHIP A NEW PARADIGM for SYSTEMS on CHIPS DESIGN Giovanni De Micheli Luca Benini CSL - Stanford University DEIS - Bologna University Electronic systems Systems on chip are everywhere Technology

More information

Mapping Array Communication onto FIFO Communication - Towards an Implementation

Mapping Array Communication onto FIFO Communication - Towards an Implementation Mapping Array Communication onto Communication - Towards an Implementation Jeffrey Kang Albert van der Werf Paul Lippens Philips Research Laboratories Prof. Holstlaan 4, 5656 AA Eindhoven, The Netherlands

More information

An Integration of Imprecise Computation Model and Real-Time Voltage and Frequency Scaling

An Integration of Imprecise Computation Model and Real-Time Voltage and Frequency Scaling An Integration of Imprecise Computation Model and Real-Time Voltage and Frequency Scaling Keigo Mizotani, Yusuke Hatori, Yusuke Kumura, Masayoshi Takasu, Hiroyuki Chishiro, and Nobuyuki Yamasaki Graduate

More information

ISSN Vol.03, Issue.02, March-2015, Pages:

ISSN Vol.03, Issue.02, March-2015, Pages: ISSN 2322-0929 Vol.03, Issue.02, March-2015, Pages:0122-0126 www.ijvdcs.org Design and Simulation Five Port Router using Verilog HDL CH.KARTHIK 1, R.S.UMA SUSEELA 2 1 PG Scholar, Dept of VLSI, Gokaraju

More information

Designing Area and Performance Constrained SIMD/VLIW Image Processing Architectures

Designing Area and Performance Constrained SIMD/VLIW Image Processing Architectures Designing Area and Performance Constrained SIMD/VLIW Image Processing Architectures Hamed Fatemi 1,2, Henk Corporaal 2, Twan Basten 2, Richard Kleihorst 3,and Pieter Jonker 4 1 h.fatemi@tue.nl 2 Eindhoven

More information

High performance, power-efficient DSPs based on the TI C64x

High performance, power-efficient DSPs based on the TI C64x High performance, power-efficient DSPs based on the TI C64x Sridhar Rajagopal, Joseph R. Cavallaro, Scott Rixner Rice University {sridhar,cavallar,rixner}@rice.edu RICE UNIVERSITY Recent (2003) Research

More information

HiPANQ Overview of NVIDIA GPU Architecture and Introduction to CUDA/OpenCL Programming, and Parallelization of LDPC codes.

HiPANQ Overview of NVIDIA GPU Architecture and Introduction to CUDA/OpenCL Programming, and Parallelization of LDPC codes. HiPANQ Overview of NVIDIA GPU Architecture and Introduction to CUDA/OpenCL Programming, and Parallelization of LDPC codes Ian Glendinning Outline NVIDIA GPU cards CUDA & OpenCL Parallel Implementation

More information

Efficient Throughput-Guarantees for Latency-Sensitive Networks-On-Chip

Efficient Throughput-Guarantees for Latency-Sensitive Networks-On-Chip ASP-DAC 2010 20 Jan 2010 Session 6C Efficient Throughput-Guarantees for Latency-Sensitive Networks-On-Chip Jonas Diemer, Rolf Ernst TU Braunschweig, Germany diemer@ida.ing.tu-bs.de Michael Kauschke Intel,

More information

Clustering-Based Topology Generation Approach for Application-Specific Network on Chip

Clustering-Based Topology Generation Approach for Application-Specific Network on Chip Proceedings of the World Congress on Engineering and Computer Science Vol II WCECS, October 9-,, San Francisco, USA Clustering-Based Topology Generation Approach for Application-Specific Network on Chip

More information

Cosimulation of ITRON-Based Embedded Software with SystemC

Cosimulation of ITRON-Based Embedded Software with SystemC Cosimulation of ITRON-Based Embedded Software with SystemC Shin-ichiro Chikada, Shinya Honda, Hiroyuki Tomiyama, Hiroaki Takada Graduate School of Information Science, Nagoya University Information Technology

More information

Energy Aware Computing in Cooperative Wireless Networks

Energy Aware Computing in Cooperative Wireless Networks Energy Aware Computing in Cooperative Wireless Networks Anders Brødløs Olsen, Frank H.P. Fitzek, Peter Koch Department of Communication Technology, Aalborg University Niels Jernes Vej 12, 9220 Aalborg

More information

Automatic Generation of Communication Architectures

Automatic Generation of Communication Architectures i Topic: Network and communication system Automatic Generation of Communication Architectures Dongwan Shin, Andreas Gerstlauer, Rainer Dömer and Daniel Gajski Center for Embedded Computer Systems University

More information

Managing Dynamic Reconfiguration Overhead in Systems-on-a-Chip Design Using Reconfigurable Datapaths and Optimized Interconnection Networks

Managing Dynamic Reconfiguration Overhead in Systems-on-a-Chip Design Using Reconfigurable Datapaths and Optimized Interconnection Networks Managing Dynamic Reconfiguration Overhead in Systems-on-a-Chip Design Using Reconfigurable Datapaths and Optimized Interconnection Networks Zhining Huang, Sharad Malik Electrical Engineering Department

More information

ENERGY EFFICIENT SCHEDULING FOR REAL-TIME EMBEDDED SYSTEMS WITH PRECEDENCE AND RESOURCE CONSTRAINTS

ENERGY EFFICIENT SCHEDULING FOR REAL-TIME EMBEDDED SYSTEMS WITH PRECEDENCE AND RESOURCE CONSTRAINTS ENERGY EFFICIENT SCHEDULING FOR REAL-TIME EMBEDDED SYSTEMS WITH PRECEDENCE AND RESOURCE CONSTRAINTS Santhi Baskaran 1 and P. Thambidurai 2 1 Department of Information Technology, Pondicherry Engineering

More information

A METHODOLOGY FOR THE OPTIMIZATION OF MULTI- PROGRAM SHARED SCRATCHPAD MEMORY

A METHODOLOGY FOR THE OPTIMIZATION OF MULTI- PROGRAM SHARED SCRATCHPAD MEMORY INTERNATIONAL JOURNAL ON SMART SENSING AND INTELLIGENT SYSTEMS VOL. 4, NO. 1, MARCH 2011 A METHODOLOGY FOR THE OPTIMIZATION OF MULTI- PROGRAM SHARED SCRATCHPAD MEMORY J. F. Yang, H. Jiang School of Electronic

More information

Hardware Scheduling Support in SMP Architectures

Hardware Scheduling Support in SMP Architectures Hardware Scheduling Support in SMP Architectures André C. Nácul Center for Embedded Systems University of California, Irvine nacul@uci.edu Francesco Regazzoni ALaRI, University of Lugano Lugano, Switzerland

More information

Power Estimation of System-Level Buses for Microprocessor-Based Architectures: A Case Study

Power Estimation of System-Level Buses for Microprocessor-Based Architectures: A Case Study Power Estimation of System-Level Buses for Microprocessor-Based Architectures: A Case Study William Fornaciari Politecnico di Milano, DEI Milano (Italy) fornacia@elet.polimi.it Donatella Sciuto Politecnico

More information

Multi-dimensional Parallel Training of Winograd Layer on Memory-Centric Architecture

Multi-dimensional Parallel Training of Winograd Layer on Memory-Centric Architecture The 51st Annual IEEE/ACM International Symposium on Microarchitecture Multi-dimensional Parallel Training of Winograd Layer on Memory-Centric Architecture Byungchul Hong Yeonju Ro John Kim FuriosaAI Samsung

More information

Integrating MRPSOC with multigrain parallelism for improvement of performance

Integrating MRPSOC with multigrain parallelism for improvement of performance Integrating MRPSOC with multigrain parallelism for improvement of performance 1 Swathi S T, 2 Kavitha V 1 PG Student [VLSI], Dept. of ECE, CMRIT, Bangalore, Karnataka, India 2 Ph.D Scholar, Jain University,

More information

kickoff 15 oct 2004 Project Overview Henk Corporaal

kickoff 15 oct 2004 Project Overview Henk Corporaal PreMaDoNA kickoff 15 oct 2004 Project Overview Henk Corporaal Agenda 15.00 Opening and Overview 15.30 Implementation and Demonstrator 15.40 Project Management 15.55 Application track 16.05 Simulation track

More information

An Analysis of Blocking vs Non-Blocking Flow Control in On-Chip Networks

An Analysis of Blocking vs Non-Blocking Flow Control in On-Chip Networks An Analysis of Blocking vs Non-Blocking Flow Control in On-Chip Networks ABSTRACT High end System-on-Chip (SoC) architectures consist of tens of processing engines. These processing engines have varied

More information

Enabling Scheduling Analysis of Heterogeneous Systems with Multi-Rate Data Dependencies and Rate Intervals

Enabling Scheduling Analysis of Heterogeneous Systems with Multi-Rate Data Dependencies and Rate Intervals 28.2 Enabling Scheduling Analysis of Heterogeneous Systems with Multi-Rate Data Dependencies and Rate Intervals Marek Jersak, Rolf Ernst Technical University of Braunschweig Institute of Computer and Communication

More information

Design methodology for multi processor systems design on regular platforms

Design methodology for multi processor systems design on regular platforms Design methodology for multi processor systems design on regular platforms Ph.D in Electronics, Computer Science and Telecommunications Ph.D Student: Davide Rossi Ph.D Tutor: Prof. Roberto Guerrieri Outline

More information

Multi-Level Cache Hierarchy Evaluation for Programmable Media Processors. Overview

Multi-Level Cache Hierarchy Evaluation for Programmable Media Processors. Overview Multi-Level Cache Hierarchy Evaluation for Programmable Media Processors Jason Fritts Assistant Professor Department of Computer Science Co-Author: Prof. Wayne Wolf Overview Why Programmable Media Processors?

More information

System-on-Chip Architecture for Mobile Applications. Sabyasachi Dey

System-on-Chip Architecture for Mobile Applications. Sabyasachi Dey System-on-Chip Architecture for Mobile Applications Sabyasachi Dey Email: sabyasachi.dey@gmail.com Agenda What is Mobile Application Platform Challenges Key Architecture Focus Areas Conclusion Mobile Revolution

More information

A Unified HW/SW Interface Model to Remove Discontinuities between HW and SW Design

A Unified HW/SW Interface Model to Remove Discontinuities between HW and SW Design A Unified HW/SW Interface Model to Remove Discontinuities between HW and SW Design Ahmed Amine JERRAYA EPFL November 2005 TIMA Laboratory 46 Avenue Felix Viallet 38031 Grenoble CEDEX, France Email: Ahmed.Jerraya@imag.fr

More information

ENERGY EFFICIENT SCHEDULING SIMULATOR FOR DISTRIBUTED REAL-TIME SYSTEMS

ENERGY EFFICIENT SCHEDULING SIMULATOR FOR DISTRIBUTED REAL-TIME SYSTEMS I J I T E ISSN: 2229-7367 3(1-2), 2012, pp. 409-414 ENERGY EFFICIENT SCHEDULING SIMULATOR FOR DISTRIBUTED REAL-TIME SYSTEMS SANTHI BASKARAN 1, VARUN KUMAR P. 2, VEVAKE B. 2 & KARTHIKEYAN A. 2 1 Assistant

More information

Design For High Performance Flexray Protocol For Fpga Based System

Design For High Performance Flexray Protocol For Fpga Based System IOSR Journal of VLSI and Signal Processing (IOSR-JVSP) e-issn: 2319 4200, p-issn No. : 2319 4197 PP 83-88 www.iosrjournals.org Design For High Performance Flexray Protocol For Fpga Based System E. Singaravelan

More information

Implementation of A Optimized Systolic Array Architecture for FSBMA using FPGA for Real-time Applications

Implementation of A Optimized Systolic Array Architecture for FSBMA using FPGA for Real-time Applications 46 IJCSNS International Journal of Computer Science and Network Security, VOL.8 No.3, March 2008 Implementation of A Optimized Systolic Array Architecture for FSBMA using FPGA for Real-time Applications

More information

Co-synthesis and Accelerator based Embedded System Design

Co-synthesis and Accelerator based Embedded System Design Co-synthesis and Accelerator based Embedded System Design COE838: Embedded Computer System http://www.ee.ryerson.ca/~courses/coe838/ Dr. Gul N. Khan http://www.ee.ryerson.ca/~gnkhan Electrical and Computer

More information

Profiling-Based L1 Data Cache Bypassing to Improve GPU Performance and Energy Efficiency

Profiling-Based L1 Data Cache Bypassing to Improve GPU Performance and Energy Efficiency Profiling-Based L1 Data Cache Bypassing to Improve GPU Performance and Energy Efficiency Yijie Huangfu and Wei Zhang Department of Electrical and Computer Engineering Virginia Commonwealth University {huangfuy2,wzhang4}@vcu.edu

More information

A Data Centered Approach for Cache Partitioning in Embedded Real- Time Database System

A Data Centered Approach for Cache Partitioning in Embedded Real- Time Database System A Data Centered Approach for Cache Partitioning in Embedded Real- Time Database System HU WEI, CHEN TIANZHOU, SHI QINGSONG, JIANG NING College of Computer Science Zhejiang University College of Computer

More information

Towards a Dynamically Reconfigurable System-on-Chip Platform for Video Signal Processing

Towards a Dynamically Reconfigurable System-on-Chip Platform for Video Signal Processing Towards a Dynamically Reconfigurable System-on-Chip Platform for Video Signal Processing Walter Stechele, Stephan Herrmann, Andreas Herkersdorf Technische Universität München 80290 München Germany Walter.Stechele@ei.tum.de

More information