TODAY, new applications, e.g., multimedia or advanced

Size: px

Start display at page:

Download "TODAY, new applications, e.g., multimedia or advanced"

Josephine Smith
6 years ago
Views:

1 584 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS II: ANALOG AND DIGITAL SIGNAL PROCESSING, VOL. 45, NO. 5, MAY 1998 A Formal Technique for Hardware Interface Design Adel Baganne, Jean-Luc Philippe, and Eric Martin, Member, IEEE Abstract In this paper, we consider the problem of hardware interface design in a codesign approach for real-time digital signal processing (DSP) applications. We refer to the hardware component as ASICS (Applied Specific Integrated Circuits) and the software component as processors. We describe a formal technique to communication synthesis starting from hardware I/O transfer sequences computed by a high level synthesis tool, like GAUT. The original nature of our work is the fact that a communication interface is generated at the same time as the hardware module which leads to better performance and optimization and ensures communication data coherency. Our design strategy starts from the hardware I/O transfer sequences computed by GAUT. It incorporates some interface specification (I/O transfer order, timing constraints) obtained by any cosynthesis tool. The proposed allocation procedure of necessary storage components needed for data communication between hardware software components assigns for each I/O data a time interval at which its transfer could occur. As an illustration, we present a mixed implementation of the GMDF alpha algorithm, an adaptive filter well suited to acoustic echo cancellation, on both ASIC and TS320C40 DSP. Index Terms Communication synthesis, DSP systems, hardware/software codesign, interface generation, real time. I. INTRODUCTION TODAY, new applications, e.g., multimedia or advanced mobile communication systems are emerging. They need implementation of complex real-time digital signal processing (DSP) algorithms under both time, cost, and power constraints in a very short time-to-market, Hardware Software (H/S) Codesign has been widely recognized as a solution to the new requirements of system design [1], [4]. Heterogeneous implementation of applications on both hardware and software components provides significant gain of time and better performance. Traditionally, the main steps in the codesign methodology are system modeling and specification, analysis of constraints and requirements, H/S partitioning and scheduling, hardware and software module generation, H/S interface synthesis, cosimulation, and system integration [1]. This paper deals with the problem of H/S interface synthesis task which is a key issue in codesign. It specifies how the software and hardware components communicate. Interface synthesis can be divided into two major subtasks, namely communication synthesis and H/S synchronization. Communication synthesis consists in determining: 1) the communication protocol used for data transfer between com- Manuscript received October 10, 1997; revised February 25, This paper was recommended by Guest Editors F. Maloberti and W. C. Siu. The authors are with the LESTER Laboratory, Department of Electrical Engineering, Université de Bretagne Sud, Saint Haouen 56100, Lorient, France ( Adel.Baganne@univ-ubs.fr). Publisher Item Identifier S (98) ponents (blocking or nonblocking, master slave, special protocols [7]); 2) timing requirements for each data transfer (transfer delay estimation, date of transfer, and 3) the transfer mode between hardware and software (Memory mapped I/O scheme, looped DMA transfers, direct transfer by point-to-point link, etc.). The H/S synchronization subtask refers to making data transfer happen in a specified time order by task and communication scheduling, or more especially, coordinating the real time presentation of data and maintaining the time-ordered relation among hardware and software components. We present in this paper a design methodology of a hardware interface that supports functional constraints description of I/O data exchanged between hardware and software. We refer to the hardware component as ASICS (Applied Specific Integrated Circuits) and the software component as processors. Our work domain is the real time implementation of DSP and image applications. These applications are characterized by a parameter called the iteration period. All computations are repeated every time. Real time implementation of these applications on mixed H/S architecture should satisfy the following periodic constraints: 1) execution timing constraints on each component and 2) I/O data transfer sequences between both components with their timing specifications. Such constraints can be obtained by any cosynthesis tool [6]. We consider the hardware interface synthesis task as a part of the ASIC generation process by means of a high level synthesis (HLS) tool GAUT [3]. This tool involves several techniques such as behavioral transformations or hardware selection that allows a large exploration of the hardware design process. Our hardware design methodology is based on the following points: Processing unit (PU) of the ASIC is the first functional unit synthesized by GAUT because it undergoes the most important constraints from the real time execution constraints. the hardware interface is synthesized after PU and takes into account both I/O PU constraints and Processors I/O constraints specified by any cosynthesis tool used for system partitioning. Our design method allows the overlapping of computation and I/O communication which leads to better performances in execution time and hardware cost reduction. II. PREVIOUS WORK Few works have addressed the problem of interface synthesis in a global way. Most of them can be classified into two main categories. The first category aims to incorporate interface timing constraints into the scheduling of operations during high level synthesis [7], [9]. In [7], the authors presented a /98$ IEEE

2 BAGANNE et al.: A FORMAL TECHNIQUE FOR HARDWARE INTERFACE DESIGN 585 technique that attempts interface optimization by minimizing the number of blocking communication between two processes and by merging their communication channels. These processes could be rescheduled to satisfy the timing requirements. In [9] authors examined the bus and protocol generation problem after the partitioning step. They present a channel grouping technique based on average rate estimations for each communication channel. The second category addressed the interface synthesis problem between standard components which have incompatible protocols [4], [6], [10]. In [5], the authors described a template-based technique for transducers synthesis directly from timing diagrams. They used timing diagrams and signal Transitions Graphs (STG) for protocols specifications and the hardware interface is synthesized with asynchronous logic. In our case, we propose a new approach for hardware interface design that allows high overlapping between computation and I/O communication. The interface design is carried out after hardware module generation. The main objective of our synthesis is the reduction of both I/O communication overhead and hardware cost. The organization of this paper is as follows. In Section III, we briefly describe the main steps of our system design flow and we state the assumptions we make about our system target architecture and hardware design by Gaut HLS. In Section IV, we give our formulation of the hardware interface synthesis problem. In Section V, a generic architecture of the interface is presented and discussed. In Section VI, we introduce a timing and communication model for the I/O data transferred between the ASIC and the processors. In Section VII, we present our design technique for analysis and synthesis of the interface buffers under I/O timing requirements. In Section VIII, we present some results relative to the mixed implementation of the GMDF alpha algorithm [13], an adaptive filter well suited to acoustic echo cancellation, on an ASIC and a TMSC40 DSP. III. SYSTEM DESIGN OVERVIEW Our system target architecture consists of an ASIC (hardware component) and a general purpose processor (software component). Fig. 1 illustrates the design flow of both hardware and software components. We consider the design of real time application that is described at the behavioral level by some specification language [1] such as SDL, Statecharts, etc. We assume that a cosynthesis tool such as [6], [8] is used to partition the specification into three components: software, hardware, and interface. The software partition is composed of threads ( ) which are sets of computations that execute in deterministic time and other sets of computation that execute in nondeterministic time (such as, loops and synchronization tasks). A control-data driven scheduler is used in the software that allows both hardware and software to schedule threads over time and to exchange I/O data [8]. The hardware partition is specified with VHDL at behavioral level and constitutes the entry point to the high level synthesis tool GAUT. From this description, the timing constraints (computation delays) imposed after partitioning, and a technological driven library (standard cells, FPGA), the hardware architecture will Fig. 1. Hardware module design overview. be synthesized by GAUT HLS. This architecture is based on four functional units (see Fig. 2): the Processing Unit (PU), the Control Unit (CU), the Memory Unit (MU), and the Communication Unit (UCOM) that implements the H/S interface in the hardware side. The generic architecture topology of the PU is based on elementary cells including an operator, registers and interconnecting components. The memory unit implements registers, memory banks and their associated address generators, as well as channel for data transfer. After the PU synthesis step, GAUT produces a set of functional constraints relative to the PU I/O transfer sequences: the production and consumption dates of the I/O data. This sequence thus constitutes the background and the schedule of conditions for the ASIC s interface synthesis procedure. Our design process of the ASIC s interface requires some knowledge (software part) about I/O communication data exchanged with the processor such as I/O busses number, data transfer delays, and partial or total ordering of data transfer. These information could be obtained by some synthesis process such as Bus generation [10] (bus structure synthesis, by trading off the width of the bus and the processes communication over it) or protocol generation (how to define the exact mechanism of data transfer over the bus) [9]. Therefore, the H/S interface specification is assumed to be further annotated with detailed information. The following lists a description style of such information.

3 586 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS II: ANALOG AND DIGITAL SIGNAL PROCESSING, VOL. 45, NO. 5, MAY 1998 How to generate the smallest amount of hardware (datapath and control logic) that still meets the I/O sequence constraints (timing requirement, data transfer order) and allows a high overlapping between computation and I/O communication? The aims are the following [see Fig. 3(c)]: mixing and demixing of data busses of processor toward PU busses: it concerns to ensure all the necessary link between the Processor and the PU busses. synthesis of I/O data storage under IOCC constraints. Fig. 2. Generic architecture of the ASIC. 1) The ASIC-processor interconnection is realized by direct links. Communications between both components are based on the message passing model [8] and are buffered in order to reduce synchronization delays. 2) The communication protocol is slave master where the software component (Processor) works as a master. The hardware and software modules interacts with an interrupt driven mechanism. The software running on processor is capable of communicating with the ASIC and performing an interrupt-driven I/O. 3) The communication channels between threads and hardware partition are mapped to I/O busses such as in [10]. Therefore bus transactions between the processor and the ASIC could be described at the lowest level: I/O data structure (scalar, vector) and their timing constraints are known. In the rest of the paper, we will refer to these constraints as IOCC (Input Output Cosynthesis Constraints). Therefore, these information constitute another entry point to our ASIC s interface design. IV. PROBLEM FORMULATION Let us consider a mixed implementation of a DSP application where the hardware component consists in a FFT implementation. For simplicity, we consider a four points FFT. Fig. 3(a) shows a target architecture composed by an ASIC (hardware component) and a processor (software component). At each iteration, an I/O sequence is transferred over the bus B between both components. Each communication data may be described by different ways: 1) transaction order and 2) timing constraints determined by some communication synthesis process [10], [11]. These constraints can include strict timing specifications (fixed dates of transfers, transfer deadlines) (see Section VI-A) or only just I/O data transaction order. As mentioned previously our architectural synthesis tool GAUT starts ASIC design cycle by the PU and gives the consumption and gives the production and consumption dates of I/O data. and are the fully specified I/O sequences transferred, respectively, over B1 and B2 (internal busses of the PU). The problem of the hardware interface generation can be explained in the following way: V. GENERIC MODEL OF THE HARDWARE INTERFACE The hardware interface model that we have adopted can be seen in Fig. 3(d). It is composed of datapath and an FSM-based controller. The datapath of the interface includes three major components: buffers for storing I/O data, busses, and interconnection components such as multiplexers, demultiplexers, and tristate buffers. The storage buffers are composed of FIFO s, LIFO s, and registers. Busses included in the interface have two types: external and internal. External busses have two categories. The first one is I/O busses which represent the physical links between ASIC and processor. Their number and the bus width are fixed by the designer or determined by some bus generation tasks such as [10]. The second category is the Interface-Processing Unit busses. Their number is determined by the synthesis step of the processing unit. However, it should be greater or equal to the maximum number of simultaneous data transfer between the interface and PU. The FSM based controller implements the communication protocol defined by the cosynthesis tool and generates the adequate control signals for the datapath components. VI. MODELING AND SPECIFICATION OF I/O SEQUENCES In this section, we will consider that processor and ASIC have the same reference of time. In the following sections, we will describe the timing specification of the ASIC-processor I/O data transfer. A. ASIC-Processor I/O Transfer Sequences In our model (see Fig. 4) the processor has busses connected to the ASIC. As mentioned previously, the set ordering of communication data is provided by a cosynthesis tool and therefore known in advance. Let us have an ordered set of all I/O data transferred over busses in one iteration period. Some of the data item may be structured (vectors). Let us define the set as an ordered set of data subsets: where denotes the ordered sets of data subsets (vectors) transferred over the th bus. the single subset transferred over the is defined as a set of ordered scalar th bus

4 BAGANNE et al.: A FORMAL TECHNIQUE FOR HARDWARE INTERFACE DESIGN 587 (a) (b) (c) (d) Fig. 3. (a) Example: hardware FFT implementation. (b) ASIC-processor data transfer sequences. (c) Hardware interface synthesis problem. (d) Generic model of the hardware interface. Fig. 4. Topology model of the hardware interface. Let be the common reference of time and let and be, respectively, the start and the end of transfer of the th data item on the th bus (see Fig. 5). The communication time for transfer is defined as. Note that our a priori knowledge only concerns the mutual time constraints of data, and not the time values themselves, i.e.,. However, our model can be refined and extended to cases where all data ordering and their timing constraints are fully satisfied. Timing constraints are then introduced to define upper and lower bounds between the start times of two data transfers. The following equations express the main ( ) timing constraints that can be specified for each pair of the exchanged data unit: where and where Fig. 5. Timing specifications of I/O processor data. and define, respectively, the minimum and maximum delays between two consecutive subset transfers over the th bus. defines the minimum delay between transfer starts of the data units and. B. Processing Unit I/O Transfer We have mentioned previously that all the PU I/O transfer sequences are fully specified by GAUT HLS. The hardware interface exchanges data with the PU over internal busses (see Fig. 4). Let be the set of PU I/O busses and an ordered I/O sequence transferred over the th bus.

5 588 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS II: ANALOG AND DIGITAL SIGNAL PROCESSING, VOL. 45, NO. 5, MAY 1998 TABLE I ASIC S I/O DATA DESCRIPTION Definition 3: The Data Test Function (DTF) of the module is defined by the Boolean function : where if the I/O data can share the resource with all data of the sequence. Definition 4: The Sequence Storage Function (SSF) of the module is defined by : where represents the time interval at which the transfer of data, assumed to be stored in, can be occurred. Each input data (toward the ASIC) that belongs to is described by the following information: (Label, Receive step, First read step, Last read step, Source, Destination). The label is basically an integer identifier. The receive step [denoted as ] is the control step (c-step) at which is received by the communication unit from the processor. The first read step [denoted as ] is the c-step of the first use of by the PU. Similarly, the last read step [denoted as ] is the c-step of the last use of by the PU. The source and destination are integer values to represent the I/O ports of the UCOM or the PU that produces and uses the data. Each output data (toward the processor) that belongs to is described by the following information: (Label, Production step, Emission step, Source, Destination). The production step [denoted as ] is the c-step at which is produced by the PU. The Emission step is the c-step at which is sent to the processor. The Other information of output data have the same meaning of those of input data. All variable descriptions presented above are resumed in X Table I. C. Definitions Let be the execution time constraint of the hardware module. Let be a storage module, ( may be a Register, FIFO, or LIFO). Let and be, respectively, an input and output data that belong to the PU I/O data sequence. Definition 1: The PU lifetime of I/O data is defined by for input data and for output data. To keep homogeneous notations, we give an equivalent notation to as. Definition 2: The UCOM lifetime of I/O data is defined by for input data and for output data. Note that. VII. I/O DATA STORAGE AND OPTIMIZATION In this section, we are interested in allocation and mapping of I/O data sequences to specific storage modules which are queues (FIFO) stacks (LIFO) and registers. The allocation problem of I/O data may be solved by means of traditional techniques used for resources sharing [2] if all I/O timing are fully specified. However a full I/O timing specification can hardly be obtained because H/S communications synthesis is based on execution-time estimations, and not guaranteed to be cycle-accurate. The allocation procedure we propose here starts from the PU I/O sequence timing (fully specified by GAUT HLS) and incorporates the I/O constraints presented in Section III. Our goal is to assign, for each I/O data, a time interval at which the transfer to/from the processor could occur. This approach gives the designer a fine grain description of data transfer timing and such information could be exploited to refine and improve the granularity of system partitioning. We have developed temporal transformations SSF (, see Definition 4) based on functional properties of each storage module. Each function takes into account conditions that should be fulfilled by each pair of data to be allocated to the same storage component. For example, conditions relative to data storage in the same register must have disjoint lifetimes (the two intervals have empty intersection). All data to be allocated to FIFO or LIFO buffers must have the same type, e.g., all of them are input data or output data and must have disjoint lifetimes. Here are the main steps of our allocation procedure: First, given the PU I/O sequence and its relative constraints (I/O timing requirements and data ordering), we determine which data that can be grouped in the same storage module. Table II shows PU conditions that should be fulfilled by each pair of data to be allocated to the same storage component (Register, FIFO, LIFO). This step produces a set of subsequences that could be stored into the module is the DTF function of the module. Second, for each subsequences, some accessing constraints of the storage component are added. These constraints (denoted as access conditions in Table II) express the relationship between input and output sequence relative to each storage module. For example, the input sequence of Register or Fifo must be the same as the output sequence. Third, we apply the SSF function on each data subsequence. This function takes into account all the

6 BAGANNE et al.: A FORMAL TECHNIQUE FOR HARDWARE INTERFACE DESIGN 589 TABLE II TIMING CONDITIONS RELATIVE TO STORAGE MODULE ALLOCATION Fig. 6. DFG of GMDF algorithm. TABLE III SSF FUNCTIONS: TRANSFORMATIONS AND DERIVED CONSTRAINTS conditions discussed above and produce for each data a time interval in which the data transfer could occur. In Table III, we describe the SSF transformation of each storage module and its respective derived constraints. These derived constraints can be merged with some user constraints and therefore the allocation procedure carries out, by means of some recursive heuristics we have developed, the following tasks: 1) constraints analysis and checking; 2) I/O buffer sizing; 3) determination for each I/O data a time interval. VIII. DESIGN EXAMPLE: ACOUSTIC ECHO CANCELLER GMDF ALPHA In this example, we consider a mixed implementation of the acoustic echo canceler GMDF algorithm on heterogeneous architecture based on an ASIC and a TMS320C40 DSP. Previous implementations of this algorithm on two DSP C40 are described in [13]. A. Acoustic Echo Cancellation The real time implementation of acoustic echo cancellation algorithms is a challenging problem, especially in teleconference application, because in this context, the acoustic impulse response is very long (1024 taps) and the sampling frequency is high (16 khz). The GMDF algorithm (standing for generalized multi-delay adaptive filter with oversampling factor )isa frequency domain block algorithm where a block of input data is processed at one time, producing a block of outputs. The analysis of the learning behavior and performance of GMDF algorithm can be found in [13]. Fig. 6 shows the data flow graph (DFG) of the GMDF. This DFG includes computation tasks (shaded nodes) and some waiting delays, which are equal to a multiple of the sampling period. Note that computation of the convolution (tasks ) and the filter coefficients updating (tasks ) need the storage of complex values corresponding to FFT s of past blocks of reference signal samples. Note also, that set of tasks such as and may be executed in parallel. B. GMDF Partitioning and Scheduling The available time to meet the real time execution of GMDF is DSP cycles (1 cycle = 50 ns). We used a manual partitioning where the FFT is executed on the dedicated hardware component. All the other tasks are executed on the TMS320C40. The ASIC-processor interconnection is realized by one point-to-point link.

7 590 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS II: ANALOG AND DIGITAL SIGNAL PROCESSING, VOL. 45, NO. 5, MAY 1998 Fig. 8. Hardware interface of the ASIC (FFT). Fig. 8 shows the hardware interface buffers obtained by our allocation procedure. They are composed of two FIFO memories (123 and 112 points) and eight registers. These buffers carry out the storage of 1024 I/O data relative to each FFT execution. The proposed design offers a significant gain of I/O buffer sizing (420%). Fig. 7. Scheduling and execution of the GMDF. TABLE IV HARDWARE DESIGN RESULTS: ASIC ARCHITECTURE IX. CONCLUSION We have presented in this paper a method for hardware interface synthesis taking into account two types of design constraints. The first one consists in some interface specifications produced by a cosynthesis tool. The second one is represented by I/O transfer sequence constraints resulting from the ASIC s processing unit design by a high-level synthesis tool GAUT. Our hardware interface design ensures: 1) real time execution of DSP applications and 2) I/O data coherence, by means of formal transformations we have developed, between processor and PU I/O transfer sequences. We described a global modeling of data transactions between the software (Processor) and hardware (ASIC) modules and we presented a generic model of on-chip hardware interface. The necessary storage components of the interface are composed of queues, stacks, and registers. The allocation procedure we described is based on timing transformation relative to each storage component. The proposed synthesis method allows I/O synchronization delay reduction, a high overlapping between computation and I/O communication and leads to good performance in I/O buffer sizing and optimization. REFERENCES The FFT (real, 512 points) execution time imposed to the ASIC design is 4000 cycles (0, 2 ms). Fig. 7 presents the GMDF scheduling on both DSP and ASIC produced by SYNDEX CAD tool [12]. This tool offers a rapid prototyping of DSP applications on TMS320C40 based architectures. The temporal diagram obtained shows the H/S communications (oblique lines), and tasks scheduling for each component. The total number of cycles of algorithm execution is approximately Table IV shows the hardware architecture of the ASIC obtained by GAUT tool. The component library is extracted from the Compass library (16 bits CMOS 1- m standard cells). [1] G. De Micheli and M. Sami, Hardware/Software Codesign. Amsterdam, The Netherlands: Kluwer Academic, [2] D. D. Gajski and N. Dutt, High Level Synthesis. Amsterdam, The Netherlands: Kluwer Academic, [3] E. Martin, O. Sentieys, H. Dubois, and J. L. Philippe, GAUT, An architectural synthesis tool for dedicated signal processors, in Proc. EURO-DAC, 1993, pp [4] Toward a multi-formalism framework for architectural synthesis: The ASAR project members, in Proc. 3rd Int, Workshop on H/S Codesign, Grenoble, France, Sept [5] G. Boriello and R. Katz, Synthesis of interface transducer logic, in Proc. ICCAD, 1987, pp [6] P. H. Chou, R. B Ortega, and G. Boriello, The Chinook hardware/software co-synthesis system, in Proc. ISSS, 1995, pp [7] D. Filo, D. C. Ku, C. N. Coelho, and G. De Micheli, Interface optimization for concurrent systems under timing constraints, IEEE Trans. VLSI Syst., vol. 1, pp , Sept

8 BAGANNE et al.: A FORMAL TECHNIQUE FOR HARDWARE INTERFACE DESIGN 591 [8] R. K. Gupta, C. N. Coelho, and G. De Micheli, Synthesis and simulation of digital systems containing interacting hardware and software components, in Proc. DAC, 1992, pp [9] S. Narayan and D. D Gajski, Protocol generation for communication channels, in Proc. DAC, 1994, pp [10], Synthesis of system level bus interfaces, in Proc. DAC, 1994, pp [11] C. N. Coelho, Jr. et al., Redesigning hardware software systems, in Proc. 3rd Int. Workshop on H/S Codesign, Grenoble, France, Sept [12] Y. Sorel, Massively parallel computing systems with real time constraints: The algorithm architecture adequation, in Proc. Massively Parallel Computing Systems Conf., Italy, 1994, pp [13] A. Baganne, A. Gilloire, E. Martin, P. Martineau, and J. P. Thomas, Implementation methods of an acoustic echo cancellation algorithm GMDF, in Proc. 4th Int. Workshop on Acoustic Echo and Noise Control, Roros, Norway, 1995, pp Adel Baganne was born in Tunisia in He received the Ph.D. degree in signal processing and telecommunications at the University of Rennes, France, in 1997, the Engineer degree in electronics from the National Superior Engineering School in Angers (ESEO), France, in 1993, and the Masters degree in computer architecture and signal processing from the University of Rennes in He is presently with the Université de Bretagne Sud, Lorient, France, where he is an Assistant Professor. His research interests, inside the LESTER Laboratory, include communication synthesis and hardware generation for codesign, computer architecture, VLSI design for fast signal processing, and CAD tools. Jean-Luc Philippe received the Ph.D. degree in signal processing from the University of Rennes in Since 1992, he had been an Assistant Professor at ENSSAT, University of Rennes, conducting a research group in CAD for VLSI design. Currently, he is Professor in computer science at the Universitéde Bretagne Sud, Lorient, France. His research interests include signal and image processing systems, VLSI, and high level synthesis. Eric Martin (M 93) was born in He was Holder of the agregation at the Ecole Normale Supérieure of Cachan, France, where he received the Ph.D. degree. He founded the LESTER Lab (Laboratoire d Electronique des Systèmes Temps Réels), Université de Bretagne Sud, Lorient, France, in 1994, which he Heads currently as Professor. His research interests are VLSI, high level synthesis, and the conception of dedicated architecture to signal, image processing, and telecommunication.

Scheduling with Bus Access Optimization for Distributed Embedded Systems

472 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 8, NO. 5, OCTOBER 2000 Scheduling with Bus Access Optimization for Distributed Embedded Systems Petru Eles, Member, IEEE, Alex