Pareto-Based Application Specification for MP-SoC Customized Run-Time Management

Size: px

Start display at page:

Download "Pareto-Based Application Specification for MP-SoC Customized Run-Time Management"

Andrea Ramsey
5 years ago
Views:

1 Pareto-Based Application Specification for MP-SoC Customized Run-Time Management Ch. Ykman-Couvreur 1, V. Nollet 1, Th. Marescaux 1, E. Brockmeyer 1, Fr. Catthoor 1,2, H. Corporaal 3 1 IMEC V.Z.W., Kapeldreef 75, 3001 Leuven, Belgium 2 Also prof. at Katholieke Univ. Leuven, Belgium 3 Prof. at Technical Univ. Eindhoven, The Netherlands Abstract In an MP-SoC environment, a customized run-time management should be incorporated on top of the basic OS services to globally optimize costs (e.g. energy consumption) across all active applications, according to constraints (e.g. performance, user requirements) and available platform resources. To that end, we have proposed a Pareto-based approach combining a designtime application mapping and platform exploration with a lowcomplexity run-time manager. This allows to alleviate the OS in its run-time decisison making and to avoid conservative worstcase assumptions. In this paper, we focus on the characterization of the Pareto-based application specification, resulting from our design-time exploration. This specification is essential as input for our run-time manager. A representative video codec multimedia application, simulated on our MP-SoC platform simulator, is used as case study. For the resulting Pareto-based specification, both binary size and performance overhead is negligible. Distributed PEs interconnected by a NoC Platform aspect Application aspect Dynamic set of appl. (e.g. multimedia) Fig. 1. MP-SoC environment Low-power, RT behavior, small memory footprint Non-functional aspect MP-SoC environment I. INTRODUCTION In a Multi-Processor System-on-Chip (MP-SoC) environment, an ideal Operating System (OS), also called run-time management layer should efficiently combine all application, platform, and non-functional aspects (Fig. 1). First, the OS should enable a dynamic set of multimedia applications (e.g. video messaging, web browsing, video conferencing), 3D games, and many other compute-intensive tasks [1]. These applications are becoming more heterogeneous, dynamic, and data intensive. When running them on mobile devices, which are typically battery-powered energy consumption is a major design issue. The OS also has to fulfill the Quality-of-Service (QoS) requirements of the user (e.g. reliability, performance, and video quality). Secondly, the OS has to support platforms [2] (e.g. TI OMAP and ST Nomadik) which consist of a large number of heterogeneous Processing Elements (PE). These platforms combine the advantages of parallel computing of multiple processors with single-chip integration of SoCs. They provide high computational performance at a low energy cost, while typical embedded systems (e.g. handheld devices such as PDAs and smartphones) are limited by the restricted amount of processing power and memory. Since the application complexity is growing, the major challenges are the right parallelization of these applications and their efficient mapping on the MP-SoC platform. Third, growing SoC complexity makes communication subsystem design as important as computation subsystem design [3], [4]. To provide reliable and scalable communication [5], a flexible interconnect Network-on-Chip (NoC) must be adopted. Designing such an NoC becomes another major task for future MP-SoCs. Finally, for memoryintensive applications such as multimedia applications, the memory subsystem represents an important component in the overall energy cost. In the memory subsystem, ScratchPad Memories (SPM) are used [6], [7], [8], since they perform better than caches in terms of energy per access, performance, on-chip area, and predictability. However, unlike caches, SPMs require complex design-time application analysis to carefully decide which data to assign to the SPM and software allocation techniques. To alleviate the OS in its run-time decision making, and to avoid conservative worst-case assumptions, we have proposed a customized run-time management [9] to map the applications on the platform. It consists of two phases. First, a designtime mapping and platform exploration per application leads to a multi-dimensional Pareto set of optimal mappings. Each mapping is characterized a code version together with an optimal combination of used platform resources, costs, and constraints. The different code versions refer to different parallelizations of the application into parallel tasks and to data transfers between SPMs and local memories. Second, a lowcomplexity run-time manager, incorporated on top of the basic OS services, maintains the high quality of the exploration. Whenever the environment is changing (e.g., when a new application/use case starts, or when the user requirements change), for each active application, our run-time manager reacts as follows: 1) It selects in a predictable way a mapping from its Pareto /06/$ IEEE 78

2 Energy 0 Proc 0 Proc 1 Proc 2 Proc 3 4 PEs 1 PE Application A Ck2 Ck1 Pareto point switch A starts B starts A stops B stops Fig. 2. Energy 0 4 PEs Pareto point switch 1 PEs Application B Time Ck1 Pareto point switch set, according to the available platform resources, in order to minimize the total energy consumption of the platform, while respecting all constraints. 2) It performs Pareto point switches (Fig. 2, restricted to two dimensions), i.e. it assigns the platform resources, it adapts the platform parameters, it loads the task binaries from the shared memory in the corresponding local memories, and it issues the execution of the code versions according to the newly selected Pareto points. When Application A starts, it is assigned to three PEs with a slow clock (ck2). As soon as Application B starts, a Pareto point switch is needed to map A on only two PEs. By speeding up the clock (ck1), the application deadline is still met. After A stops, B can be spread over three PEs in order to reduce the energy consumption. In [9], the design-time exploration phase of our approach, restricted to the usage of one processor, was presented. The main new contribution of this paper is the characterization of the Pareto-based application specification, efficiently merging all code versions present in the Pareto set, and resulting from our design-time exploration. This specification, to be stored into the MP-SoC platform, is essential as input for our run-time manager. A representative video codec multimedia application, simulated on our MP-SoC platform simulator, is used as case study to illustrate this application specification. The resulting binary size and performance overhead is negligible. The remainder of this paper is organized as follows. Section II summarizes the related work in the MP-SoC domain. Section III presents our customized run-time management approach. Section IV introduces our case-study application and our platform simulator. Section V characterizes the Paretobased application specification used in our approach. It also describes the experiments performed on our case study. Conclusions and future work are given in Section VI. Ck2 II. RELATED WORK In recent years, industrial MP-SoC components have been introduced by companies like Texas Instruments and ST Microelectronics. For embedded systems (limited by the number of PEs), Real-Time OSs (RTOSs) are focused on execution determinism, speed and small memory footprint. Current OSs like the TI DSP/BIOS kernel, the Quadros RTXC RTOS, and the Enea Systems OSE RTOS, are clearly focused on lowlevel run-time management (i.e., multiplexing the hardware and providing uniform communication primitives). They only provide an abstraction layer on top of the hardware, they expand and link together existing technologies, but they are not designed for the emerging MP-SoC environment. Support for SPMs, NoCs, dynamic power management, QoS-aware and application-specific run-time management, is lacking. Hence none of these existing OS represents the ideal glue layer for MP-SoCs. The user is supposed to implement his own runtime manager on top of the OS services. State-of-art tools and design practice also are not in a shape yet to meet the needs presented in Section I. Currently in the academic world, two diverging strategies [10] are developed to cope with the design complexity of application-specific and heterogeneous MP-SoC platforms: either the IP-driven approach [11], [12], or the design-flow-driven approach. In these IP-driven approaches, any application is synthesized separately and synthesis has no integral view on the entire system on the MP-SoC platform. Related to the design-flow-driven approach, several global optimization issues are considered: application parallelization, task scheduling, communication management, and dynamic reconfiguration. In this paper, we focus on task scheduling and dynamic reconfiguration, for which our approach offers trade-offs. For MP-SoC platforms, task scheduling becomes more complicated [13], and its impact on the performance and energy consumption becomes more significant. It consists of: mapping, determining the order in which those tasks are executed (i.e. temporal mapping), and on which processor each task must be executed (i.e. spatial mapping), and Dynamic Voltage/Frequency Scaling (DVS/DFS), determining the processor supply voltage and clock frequency if it is allowed. Energy consumption is increasingly an issue not only for battery operated devices. Even if unlimited power is available, a large number of components tightly packed onto a chip poses cooling and reliability problems. An important way to reduce the energy consumption is to shut down or slow down functional components which are idle or under utilized, by combining DVS with Dynamic Power Management (DPM). A survey of system-level design techniques can be found in [14] and [15] respectively. The most recent scheduling approaches, combining application mapping with DVS can be found in [16], [17], [18]. To support the massive data traffic, run-time communication management is a challenging task since inter-processor communications become responsible for significant execution time and energy consumption. Approaches, combining application 79

Application A Application B Design-time exploration Refined application code: Version 1 Version 2... Energy Pareto set Others Memory usage PE usage Refined application code: Version 1 Version 2.

3 Application A Application B Design-time exploration Refined application code: Version 1 Version 2... Energy Pareto set Others Memory usage PE usage Refined application code: Version 1 Version 2... Energy Pareto set Others Memory usage PE usage Low-complexity run-time layer Constraints Customized run-time manager RTOS kernel Platform information Fig. 4. Our MP-SoC run-time management Fig. 3. Pareto set generated by our design-time exploration mapping with some communication management aspects can be found in [19], [20], [21], [22]. Related to dynamic reconfiguration, some aspects are currently considered. Multimedia applications are becoming more versatile and dynamic applications with multiple use cases need to be supported. Switching from one use case to another one at run time involves changing the application task graph configuration [23], [24]. The platform also needs to support a wide range and dynamic set of applications. This requires an efficient run-time support for platform resource management, task relocation, and reconfiguration of inter-task communication [25]. III. OVERVIEW OF OUR CUSTOMIZED RUN-TIME MANAGEMENT To meet the needs presented in Section I, our approach proposes a customized run-time management to map the applications on the platform, consisting of two phases: (1) a designtime mapping and platform exploration per application; (2) a low-complexity run-time manager incorporated on top of the basic OS services. This run-time manager globally optimizes costs (e.g. energy consumption) across all active applications, according to constraints (e.g. performance, user requirements) and available platform resources. It also performs low-cost switches between possible mappings of a same application, as required by environment changes. A similar conceptual approach was already developed for scheduling concurrent tasks on embedded systems. This was intended to optimize only the energy consumption while respecting the application deadlines [18], [14]. In contrast to the conventional approaches that generate only one solution for each application, the first phase is a design-time application mapping and platform exploration. For each application, this exploration generates a set of optimal mappings in a multi-dimensional design space (Fig. 3), instead of a two-dimensional one. Current dimensions are costs (e.g. energy consumption), constraints (e.g. performance, user requirements), and used platform resources (e.g. memory usage, processors, communication bandwidth, clocks, and processor supply voltage if it is allowed). Only points being better than the other ones in at least one dimension are retained. They are called Pareto points. The resulting set of Pareto points is called the Pareto set. This design-time exploration phase of our approach, restricted to the usage of one processor, was presented in [9]. Dependent on the application constraints, and on the availability of the platform resources, any one of these Pareto points, representing application mappings, will be best can be selected by the run-time manager. Each Pareto point is also annotated with a code version. The different code versions refer to different parallelizations of the application into parallel tasks and to data transfers between SPMs and local memories. The main contribution of this paper is the characterization and merging of all these code versions, called Pareto-based application specification. This latter is presented in section V. Hence, in total, our Pareto set is made up for any application of optimal mappings characterized by a code version together with an optimal combination of used platform resources, costs, and constraints. The description of data structures storing information related to this Pareto set and the Pareto points is out of scope of this paper. The full exploration is done at design time, whereas the critical decisions are taken during the second phase by a lowcomplexity run-time manager (Fig. 4). This latter provides the following services: Whenever a new application is activated, our run-time manager parses its Pareto set provided by the designtime exploration and stores it in the shared memory of the MP-SoC platform, including all task binaries. Whenever the environment is changing (e.g., when a new application/use case starts, or when the user requirements change), for each active application, our run- 80

4 time manager reacts as follows. First, it selects in a predictable way a mapping from its Pareto set, according to the available platform resources, in order to minimize the total energy consumption of the platform, while respecting all constraints. Second, it performs Pareto point switches (Fig. 2, restricted to two dimensions), as explained in Section I. The Pareto point switch technique bears some resemblance with dynamic reconfiguration. It can switch other mappings, but, in contrast to dynamic reconfiguration, it involves more complex run-time tradeoffs. IV. DEMONSTRATOR As driver application, an inter-frame compression technique for video images, called Quadtree Structured Difference Pulse Code Modulation (QSDPCM) is used [26]. It is representative for many today s video codec multimedia applications. It involves a three-stage hierarchical Motion Estimation (ME4, ME2, and ME1), followed by a quadtree-based encoding of the motion compensated frame-to-frame difference signal, a Quantization, and a Huffmann-based Compression (QC). Two image resolutions are allowed: either QCIF, with image size 176*144 pixels, or VGA, with image size of 640*480 pixels. In our experiments, the QCIF resolution is used. The starting algorithm, expressed in C code, has two image frames (the previous and current ones) as input, and one bit stream as output. The code is already tuned for efficient data management and processing by: (1) minimizing the size of internal arrays; (2) optimizing the loop performance and achieving software pipelining. To preserve these optimizations in later code refinements, any optimized loop is encapsulated in a function called kernel in the remainder of this paper. The resulting algorithm is illustrated in Fig. 5(a), where each module is a loop manipulating two pixel blocks at each iteration (the one from the current frame, and the other from the previous frame). Our MP-SoC simulator assumes a platform composed of: (1) processor nodes with local memories and buses; (2) distributed shared memory nodes; (3) communication assists similar to Direct Memory Access (DMA) controllers, providing high-level services to processors and shared memories for efficient data transfers; (4) I/O nodes; (5) a communication architecture, being the AEthereal NoC [27]. The main platform parameters that can be explored at present are: the network clock, the maximum number of time slots, the number of routers, the processor clock and supply voltage if it is allowed, the memory clock, the communication bandwidth between a processor and a shared memory, the number of processors to be used by the application, the memory usage, and some QoS requirement (either guaranteed throughput, or best effort). V. PARETO-BASED APPLICATION SPECIFICATION From our design-time exploration, a multi-dimensional Pareto set of optimal mappings is generated for any application to be mapped on the MP-SoC platform. Each mapping is characterized by a code version together with an optimal ME1() On 1 processor ME1_ On 2 processors (a) Starting algorithm ME1() On 3 processors ME1_1 (b) Relevant parallelizations Fig. 5. QSDPCM application... On k+2 processors 1 < k < 6 ME1_k combination of used platform resources, costs, and constraints. First, the structure of any standalone application code version is described in Section V-A. Then, the Pareto-based specification, merging all these codes, is characterized in Section V-B. A. Standalone Code Version Structure Any application code version present in the Pareto set refers to different parallelizations of the application into parallel tasks and to data transfers between SPMs and local memories, derived from the design-time exploration, as follows. Parallelization exploration Parallelizing an application can be done both at functional and data level. At the functional level, the algorithm is partitioned into smaller tasks, and synchronization requirements between them are identified to allow pipelined execution of these tasks. For instance, in video applications, images can be divided into block of rows. Any task parallelized at the data level deals with its own block of rows. Block transfer exploration: To optimize both performance and energy consumption in the memory subsystem, parts of data arrays stored in the SPM are copied in the processor local memory from where they are accessed multiple times [28]. These copy operations (also called Block Transfers (BT)) are performed through function calls in the application code, first to issue a BT, and next to synchronize its completion with processing. This allows to perform BTs in parallel with processing and hence to improve the application performance. This is illustrated in Fig. 6, where a BT into a copy cp prev frame is performed in parallel with a for loop processing. This allows to reduce the waiting time for this BT completion and 81

TABLE I BLOCK TRANSFER IMPACT ON ME1 BINARY SIZE (BYTES) BT Size of Binary size Total ME1 Solution copies of BT calls binary size BT 0 688 676 8132 BT 1 44 1260 8716 BT 2 1120 1028 8484 BT 3 1376

5 TABLE I BLOCK TRANSFER IMPACT ON ME1 BINARY SIZE (BYTES) BT Size of Binary size Total ME1 Solution copies of BT calls binary size BT BT BT BT Fig. 6. BT from SPM to processor local memory BT overhead. These QSDPCM parallelizations are illustrated in Fig. 5(b). Related to the BTs, three arrays (storing the current image frame, the previous one, and some internal data required in ) are too large and must be stored in the SPM. Several efficient BT solutions are explored. Table 1 reports for the task ME1 the resulting processor local memory usage for copies (difference up to a factor 2) and the binary size overhead for BT calls (up to 16% of the total ME1 binary). The current implementation of a BT issue (resp. sync) call costs about 378 (resp. 20) bytes in our MP-SoC platform simulator, which explains this important BT call size overhead. This needs to be optimized our near future work. Similar BT solutions are derived for the tasks QC and ME1 QC, whereas only one efficient BT solution is derived for ME42. Hence, considering all combinations of BT solutions in all tasks of any parallelized application gives rise to a huge number of different application code versions. A Pareto-based specification, merging all of them, and allowing efficient loading of any task binary into the platform is required. This specification is characterized in Section V-B. B. Merging code versions Fig. 7. Application code version structure to reach a performance gain of 16 cycles per iteration. Several efficient solutions, yielding different local memory usage and performance, exist for the copy sizes and the places in the code where to insert these BT calls. Such a code version (Fig. 7) is made up of a task set. Each task is made up of a skeleton to glue together the kernel calls (Section IV), the BT calls, and task synchronization for parallelization. Experiments Related to the functional-level parallelization, the QSDPCM can be naturally partitioned into either three tasks (ME42, ME1, and QC), or two tasks (ME42 and ME1 merged with QC). To further alleviate the computation effort of ME1, the input frames can be divided into row blocks to parallelize ME1 at the data level. Up to five parallel ME1 tasks have been considered, beyond which no performance gain is reached any more due to too large task synchronization and All code versions of a same application derived from the design-time exploration are merged into a generic one, called Pareto-based specification. This latter is made up of: A set of tasks, derived from the functional-level parallelization exploration of the application. For each task: The block of image rows, derived from the datalevel parallelization exploration, and used as input argument of the task. An extended task skeleton integrating all BT solutions, derived from the block transfer exploration. For each BT solution, implementation details specifying the size of copies to be allocated in the processor local memory, and the BT calls to be executed in the task skeleton. This Pareto-based specification is stored in the shared memory of the MP-SoC platform. However only the required task binaries are loaded in the corresponding local memories during Pareto point switches, as explained in Section I. This Pareto-based specification is illustrated on the QS- DPCM to show that, for this application, both code size and performance overhead are negligible. To analyze the 82

Fig. 8. Task binary sizes (bytes) in Pareto-based QSDPCM specification energy consumption overhead, an energy model in our MP- SoC platform simulator is required.

Binary sizes for these tasks, integrating all BT solutions, are detailed in Fig. 8. They include the sizes of all needed kernels, the extended skeleton, all implementation details.

The size of task synchronization and all BT calls, being part of the extended task skeleton, is also reported.

6 Fig. 8. Task binary sizes (bytes) in Pareto-based QSDPCM specification energy consumption overhead, an energy model in our MP- SoC platform simulator is required. This is currently under investigation. Experiments From the QSDPCM parallelization exploration (Fig. 5(b)), four different tasks are considered: ME42, ME1, QC, and ME1 QC. Binary sizes for these tasks, integrating all BT solutions, are detailed in Fig. 8. They include the sizes of all needed kernels, the extended skeleton, all implementation details. The kernels, which are independent from the standalone code versions, represent the major component of any task binary. The size of task synchronization and all BT calls, being part of the extended task skeleton, is also reported. The code size overhead of the Pareto-based specification is due to: (1) the size of implementation details, which is negligible; (2) the size overhead of the extended skeletons, due to integration of all BT solutions. Size overhead for each task binary is detailed in Fig. 9(a). Merging a standalone code version in the Pareto-based specification yields less than 5% size overhead. To analyze the performance overhead of the Paretobased specification, the QSDPCM mapping on six processors (Fig. 5(b)) is simulated on our MP-SoC platform simulator,using both standalone code version and Pareto-based specification. (processing and BT waiting times) comparison is reported in Fig. 9(b). Less than 0.17% performance overhead can be observed on each processor. VI. CONCLUSION In this paper, we characterize the Pareto-based application specification, used as input for our run-time manager. This specification merge all code versions of a single application derived from the design-time exploration. It refers to different parallelizations of the application and to data transfers between SPMs and local memories. It is also illustrated on a video codec multimedia application, and simulated on our MP-SoC Fig. 9. Comparison between standalone code versions and Pareto-based specification platform simulator. For this application, less than 5% binary size overhead per merged code version, and less than 0.17% performance overhead is observed. Our future work includes the optimization of the data transfer implementation in the NoC of our MP-SoC platform (to further reduce the binary size overhead), the run-time support integration to allow Pareto point switch at run time and the analysis of the resulting run-time overhead, an energy model in our MP-SoC platform simulator, and tests on other real-life applications. REFERENCES [1] P. Cumming, The TI OMAP platform approach to SoC. Kluwer Academic, [2] W. Wolf, The future of multiprocessor systems-on-chips, in Proceedings of the Design Automation Conference, pp , [3] D. Bertozzi, A. Jalabert, M. Srinivasan, R. Tamhankar, S. Stergiou, L. Benini, and G. De Micheli, NoC synthesis flow for customized domain specific multiprocessor systems-on-chip, IEEE Trans. Parallel Distrib. Syst., vol. 16, pp , February [4] S. Murali and G. De Micheli, Bandwidth-constrained mapping of cores onto NoC architectures, in Proceedings of the Conference on Design, Automation and Test in Europe, Paris, France, February [5] L. Benini and G. De Micheli, Networks on chips: a new SoC paradigm, IEEE Computer, pp ,

7 [6] S. Mamagkakis, D. Atienza, C. Poucet, F. Catthoor, D. Soudris, and J. Mendias, Custom design of multi-level dynamic memory management subsystem for embedded systems, in Proceedings of the IEEE Workshop on Signal Processing Systems, October 2004, pp [7] F. Poletti, P. Marchal, D. Atienza, L. Benini, F. Catthoor, and J. Mendias, An integrated hardware/software approach for run-time scratchpad management, in Proceedings of the Design Automation Conference, pp , June [8] M. Verma, L. Wehmeyer, and P. Marwedel, Dynamic overlay of scratchpad memory for energy minimization, in Proceedings of the International Conference on Hardware/Software Codesign and System Synthesis, 2004, pp [9] C. Ykman-Couvreur, E. Brockmeyer, V. Nollet, T. Marescaux, F. Catthoor, and H. Corporaal, Design-time application exploration for MP-SoC customized run-time management, in Proceedings of the International Symposium on System-on-Chip, pp , November [10] T. Kogel and H. Meyr, Heterogeneous MP-SoC - the solution to energyefficient signal processing, in Proceedings of the Design Automation Conference, 2004, pp [11] T. Henriksson, J. Kang, and P. van der Wolf, Implementation of dynamic streaming applications on heterogeneous multi-processor applications, in Proceedings of the International Conference on Hardware/Software Codesign and System Synthesis, Jersey City, NJ, September 2005, pp [12] S. Yoo, M. Youssef, A. Bouchhima, and A. Jerraya, Multi-processor SoC design methodology using a concept of two-layer hardwaredependent software, in Proceedings of the Conference on Design, Automation and Test in Europe, Paris, February [13] Y. Cho, S. Yoo, K. Choi, N.-E. Zergainoh, and A. Jerraya, Scheduler implementation in MP SOC design, in Proceedings of the Asia South Pacific Design Automation Conference, Shangai, China, January [14] C. Ykman-Couvreur, F. Catthoor, J. Vounckx, A. Folens, and F. Louagie, Energy-aware dynamic task scheduling applied to a real-time multimedia application on an Xscale board, Journal of Low Power Electronics, vol. 1, pp , December [15] L. Benini, A. Bogliolo, and G. De Micheli, A survey of design techniques for system-level dynamic power management, IEEE Trans. VLSI Syst., vol. 8, pp , June [16] A. Andrei, M. Schmitz, P. Eles, Z. Peng, and B. Al-Hashimi, Overheadconscious voltage selection for dynamic and leakage energy reduction of time-constrained systems, in Proceedings of the Conference on Design, Automation and Test in Europe, pp , February [17] P. Schaumont, B.-C. C. Lai, W. Qin, and I. Verbauwhede, Cooperative multithreading on embedded multiprocessor architectures enables energyscalable design, in Proceedings of the Design Automation Conference, pp , June [18] P. Yang and F. Catthoor, Dynamic mapping and ordering tasks of embedded real-time systems on multiprocessor platforms, in Proceedings of the International Workshop on Software and Compilers for Embedded Systems, pp , September [19] L. Smit, G. Smit, J. Hurink, H. Boersma, D. Paulusma, and P. Wolkotte, Run-time mapping of applications to a heterogeneous reconfigurable tiled system on chip architecture, in Proceedings of the International Symposium on System-on-Chip, November [20] A. Hansson, K. Goossens, and A. Radulescu, A unified approach to constrained mapping and routing on Network-on-Chip architectures, in Proceedings of the International Conference on Hardware/Software Codesign and System Synthesis, pp , September [21] J. Hu and R. Marculescu, Communication and task scheduling of application-specific networks-on-chip, IEE Proceedings - Computers and Digital Techniques, vol. 152, pp , September [22] O. Bringmann, A. Siebenborn, and W. Rosenstiel, Conflict analysis in multiprocess synthesis for optimized system integration, in Proceedings of the International Conference on Hardware/Software Codesign and System Synthesis, pp , September [23] J. Kang, T. Henriksson, and P. van der Wolf, An interface for the design and implementation of dynamic applications on multi-processor architectures, in Proceedings of the Workshop on Embedded Systems for Real-Time Multimedia, pp , September [24] M. Rutten et al., Dynamic reconfiguration of streaming graphs on a heterogeneous multiprocessor architecture, in Proceedings of the IS&T/SPIE s Annual Symposium on Electronic Imaging: Multimedia Processing and Applications, pp , January [25] V. Nollet, T. Marescaux, P. Avasare, J.-Y. Mignolet, and D. Verkest, Centralized run-time resource management in a network-on-chip containing reconfigurable hardware tiles, in Proceedings of the Conference on Design, Automation and Test in Europe, pp , March [26] P.Strobach, QSDPCM a new technique in scene adaptive coding, in Proceedings of the Eur. Signal processing Conference, pp , September [27] J. Dielissen, A. Radulescu, K. Goossens, and E. Rijpkema, Concepts and implementation of the Philips network-on-chip, in Proceedings of the IP-based SOC Design, November [28] E. Brockmeyer, M. Miranda, H. Corporaal, and F. Catthoor, Layer assignment techniques for low energy in multi-layered memory organisations, in Proceedings of the Conference on Design, Automation and Test in Europe, pp ,

Design-time application mapping and platform exploration for MP-SoC customised run-time management

Design-time application mapping and platform exploration for MP-SoC customised run-time management Ch. Ykman-Couvreur, V. Nollet, Th. Marescaux, E. Brockmeyer, Fr. Catthoor and H. Corporaal Abstract: In