A Mathematical Formulation of the Loop Pipelining Problem

Size: px
Start display at page:

Download "A Mathematical Formulation of the Loop Pipelining Problem"

Transcription

1 A Mathematical Formulation of the Loop Pipelining Problem Jordi Cortadella, Rosa M. Badia and Fermín Sánchez Department of Computer Architecture Universitat Politècnica de Catalunya Campus Nord. Mòdul D Barcelona, Spain Phone: Fax:

2 A mathematical formulation of the loop pipelining problem Abstract A mathematical model for the loop pipelining problem is presented. The model considers several parameters for optimization and supports any combination of resource and timing constraints. The unrolling degree of the loop is one of the variables explored by the model. By using Farey s series, an optimal exploration of the unrolling degree is performed and optimal solutions not considered by other methods are obtained. Finding an optimal schedule that minimizes resource requirements (including registers) is solved by an ILP model. A novel paradigm called branch and prune is proposed to efficiently converge towards the optimal schedule and prune the search tree for integer solutions, thus drastically reducing the running time. This is the first formulation that combines the unrolling degree of the loop with timing and resource constraints in a mathematical model that guarantees optimal solutions. 1

3 1 Introduction It is well known that loops monopolize most execution time of programs. In many applications a few loops, if not only one, determine the throughput achievable by the implementation of a behavioral description. For example, DSP filters often consist of an infinite loop that repeatedly executes for every sample of the input stream. In architectural synthesis, the problem of optimizing loop execution under timing and area constraints is crucial to obtain high quality architectures. The techniques proposed for this problem attempt to overlap the execution of different loop iterations to reduce the cycle count (initiation interval or II) per iteration. Different methods have been proposed with such a goal: loop folding [12, 23], functional pipelining [17], loop winding [11], rotation scheduling [4], and percolation based synthesis [29], among others. The area of fixed-rate DSP has also drawn the attention of other authors to propose techniques for loop pipelining with timing constraints [3, 6, 34]. Similar (if not identical) problems are encountered in the area of optimizing compilers for parallel architectures (VLIW, superscalar and superpipelined). Under the general name of software pipelining [21], several techniques have been devised: modulo scheduling [31], URPR [36], Lam s algorithm [22] and perfect pipelining [1]. Loops are usually represented by means of a data dependence graph (DG). Figure 1 shows an example. Vertices represent computational nodes (henceforth called instructions). Unlabeled edges represent intra-loop dependences (ILDs), e.g. B i (reads X[i]) depends on A i (produces X[i]). Labeled edges represent loop-carried dependences (LCDs), e.g. B i+1 (reads Y [i+1]) depends on B i (produces Y [i + 1]). Labels indicate the number of iterations traversed by the dependence. Thus, ILDs can also be represented as 0-labeled edges. If no overlap between iterations were allowed to execute the loop, the total execution time would be 3I cycles (assuming 1-cycle instructions). Within each iteration, the execution of A i, B i and C i must be sequential due to the data dependences. An overlapped execution would take I + 2 cycles to complete, as shown in Figure 2. The problem of loop pipelining is basically reduced to finding a steady state pattern (a folded loop body) that can execute at the maximum rate allowed by the dependences [32]. In the new steady state, instructions from different iterations (folds) are executed (A i+f denotes that the execution of A i belongs to fold f). Under resource constraints, unrolling the loop before finding a steady state is crucial to obtain optimal solutions. Assume the schedule depicted in Figure 2(b). If only two adders are available, the steady state must be executed in at least two cycles (II = 2), as shown in Figure 3(a). However, if the loop is unrolled twice (Figure 3(b)), a schedule in three cycles can be found (II = 3=2). for i = 0 to I? 1 do A i : X[i] = R[i] + S[i]; B i : Y [i + 1] = Y [i] + 2 X[i]; C i : Z[i] = 4 Y [i + 1] + T [i]; endfor A i B i C i 1 Figure 1: Loop dependence graph 2

4 A 0 B 0 A 1 C 0 B 1 A 2 C 1 B 2 A 3 C 2 B 3 C 3 (a) ) A 0 B 0 A 1 o C i B i+1 A i+2 ) C I?2 B I?1 C I?1 (b) Prologue Steady state (i = 0; ::; I? 3) Epilogue Figure 2: (a) Overlapped loop execution. (b) Steady state A 0 B 0 A 1 C i B i+1 A i+2 C I?2 B I?1 C I?1 (a) A 0 B i A i+1 C i B i+1 (i = i + 2) A i+2 C i+1 B I?1 C I?1 (b) Figure 3: Steady state with resource constraints: (a) without unrolling (II = 2), and (b) by unrolling twice the loop (II = 3=2). Another important aspect to be considered is the span, defined as the number of folds required to obtain the steady state. Figure 4(b) shows a time-optimal steady state with SPAN = 4. An approach to find a resource-constrained steady state is to reschedule the time-optimal one according to the number of available resources (see Figure 4(c) for 2 resources), thus maintaining the same span. However, by considering resource constraints from the beginning, a steady state with lower span can be obtained (Figure 4(d)). As further commented in Section 5.5, smaller spans result in shorter variable lifetimes, reducing in general the schedule s register pressure [15]. The loop optimization problem addressed in this paper comprises a large variety of formulations with different timing and resource constraints. The two extreme cases are next described: Resource-constrained loop pipelining (RCLP). Given a set of resource constraints (functional units and registers), the objective is to find a pipelined loop schedule that minimizes the execution time. Time-constrained loop pipelining (TCLP). Given an upper bound on the execution time, the objective is to find a pipelined loop schedule that minimizes the cost of the resources required to execute the loop. Between RCLP and TCLP there is a wide range of problems that can be formulated, e.g. finding a time-constrained schedule with constraints on a subset of resources. In the area of optimizing 3

5 A i B i C i D i 1 (a) A 0 B 0 A 1 C 0 B 1 A 2 D 0 C 1 B 2 A 3 D 1 C 2 B 3 D 2 C 3 D 3 (b) A 0 A 1 B 0 A 2 C 0 B 1 D 0 C 1 A 3 B 2 B 3 D 1 C 3 C 2 D 3 D 2 (c) A 0 B 0 C 0 A 1 D 0 B 1 C 1 D 1 (d) Figure 4: (a) Dependence graph and steady states with: (b) II = 1, SPAN = 4, (c) II = 2, SPAN = 4, and (d) II = 2, SPAN = 2. compilers for parallel architectures only the RCLP version of the problem is posed, since the resources are fixed by the target architecture. This paper presents UNRET (unrolling and retiming), a formal approach to solve the loop pipelining problem in which the following mathematical tools are used: Farey s series [35] to calculate the unrolling degrees of the loop that yield optimal solutions. Integer Linear Programming (ILP) to simultaneously solve the problems of scheduling and loop pipelining. Since the delay decision (minimization) problem of loop pipelining with resource constraints is NP-complete (NP-hard) [5], several heuristics have been proposed to solve it in moderate computation time. Several authors have also used linear programming to obtain optimal or quasi-optimal solutions for the problems of scheduling and allocation in architectural synthesis [14, 18, 16, 10, 33]. A preliminary approach for loop pipelining was presented in [9]. The closest approaches related to the work presented here have been proposed in [13, 8, 7], in which the primary goal is the minimization of the register requirements of the final schedule. The main contributions of UNRET in relation to existing formal methods are next presented: UNRET performs an exhaustive analysis of the unrolling degrees of the loop that can derive optimal solutions for the available resources. Therefore, more than one instance of each operation can appear in the final schedule. Unlike other methods that perform loop unrolling [19, 27, 33], this paper will present a new approach that guarantees an optimal unrolling degree. Similarly to [13, 8, 7], the number of folds for the steady state is automatically obtained by solving an ILP model for loop pipelining. The model has been extended to the minimization of resources under timing constraints. The maximum number of live variables at any cycle, MAXLIVE, is also minimized. MAXLIVE is a very approximate lower bound of the number of registers required to execute the loop [30]. A new approach, that we have called Branch-and-Prune, is proposed to solved the ILP model. The heuristics devised to explore the space of solutions allows us a rapid convergence to the optimal solution. ILP models with more than 1000 variables and 700 constraints have been solved in a reasonable running time. 4

6 The paper is organized as follows. Section 2 presents some preliminary definitions required for the mathematical model. Section 3 proposes a strategy for loop unrolling to find time-optimal schedules for the RCLP problem. A similar strategy for the TCLP problem is proposed in Section 4. The ILP model to find an optimal steady state is presented in Section 5. The branch and prune strategy to efficiently solve the ILP model is described in section 6. Experimental results and conclusions are presented in Sections 7 and 8. 2 Basic definitions For the sake of simplicity, we will first assume that all instructions can be executed in any of the resources of the architecture in one cycle. Henceforth, we will call n the number of instructions of the loop and r the number of resources available to execute instructions. Extensions to multiple-cycle, pipelined functional units will be addressed in Section Representation of a loop A loop is represented by a labeled dependence graph, DG(V; E), in which vertices and edges represent instructions and data dependences respectively. Labels of the DG are defined by two mappings, (fold) and (dependence distance), in the following way: (u), defined on vertices, denotes the fold to which u belongs in the steady state ((u) 0). (u) = i will be denoted by u i in the DG. (u; v), defined on edges, denotes the dependence distance (number of iterations traversed by the dependence) between instructions u and v. (u; v) = 0 corresponds to an ILD, and (u; v) > 0 to an LCD. An ILD between u and v is represented by u i 0! vi or simply u i! v i. Similarly, an LCD between u and v with distance d is represented as u i d! v i. Initially, a loop is represented by a DG with only one fold, i.e. 8u 2 V : (u) = 0. After finding a steady state, each instruction will be labeled with a fold representing the relative execution skew (in iterations) with regard to the other instructions of the loop. Equivalent labeling functions of the original DG can be obtained by simple transformations. The way and can be transformed is illustrated in Figure 5. Assume an ILD A i! B i exists in the initial DG. A schedule for the loop is depicted in Figure 5(a), which corresponds to the sequential execution of the original loop body that has only one fold (i). However, the ILD can be transformed into an LCD by preserving the semantics of the loop. Thus, the dependence A 1 d i+1! B i (or in general A i+d! B i ) is equivalent to the previous one. This transformation is analogous to the retiming technique proposed to minimize the clock period in synchronous systems [24]. Figure 5(b) shows the transformed DG with two folds (i and i + 1) and one LCD. A new schedule can be found for the loop body by considering now that A i+1 and B i have not any ILD between them. LCDs are semantically preserved by the sequential execution of different iterations of the same loop body. Therefore, ILDs can be transformed into LCDs by changing the fold assignment and, thus, shorter schedules for the loop body can be found. A formal definition for equivalent labeling functions can now be presented. Definition 1 :[Equivalent labeling functions] Let (; ) and ( 0 ; 0 ) be two different labeling functions for the dependence graph DG = (V; E). The functions are equivalent if 8(u; v) 2 E: (v)? (u) + (u; v) = 0 (v)? 0 (u) + 0 (u; v) (1) 5

7 A i B i? (a). A i B i A i+1 B i+1. A i+1? 1 B i A i+1. B i A i+2 B i+1 (b). Figure 5: (a) Schedule of a loop with one ILD. (b) Schedule of the same loop with an equivalent labeling function and one LCD. It is not difficult to prove that equivalent labeling functions preserve the semantics of a loop [24]. 2.2 Initiation Interval As proposed by other authors [22], UNRET first calculates a lower bound on the II of the loop: the minimum initiation interval (MII). MII = max(resmii; RecMII) where ResMII is determined by the number of available resources and RecMII by the cycles (recurrences) formed by the dependences of the loop. Clearly, ResMII = n r since each functional unit can execute only 1 instruction/cycle. A recurrence R is a set of edges that form a cycle. Let us define R as R = X (u;v)2r (u; v) (2) For any operation u such that 9(u; v) 2 R, u i must be scheduled at least jrj cycles before u i+ R. Therefore, R imposes a minimum initiation interval, RecMII R, on the execution of the loop [32, 5, 27]: A loop can have several recurrences. Therefore: RecMII R = jrj R RecMII = max RecMII R R In the worst case, the number of cycles of a DG grows exponentially with jej. However, finding RecMII can be done in polynomial time by using Karp s algorithm to calculate the weight of the minimum mean-weight cycle in a graph [20]. In the example of Figure 6(a), there are three recurrences with the same value for RecMII. Let us 3 take A 0! B 0! E 0! A0, with R = 3 and jrj = 3. This indicates that A 3 must be executed at least 3 cycles after A 0, thus resulting in RecMII R = 1. The other two recurrences are isomorphic to this one. 6

8 3 A 0 B0 C0 D0 (a) E K # ## # # ## D Cs s # s ## B A # ## s (II max ) (b) II = MII II K Figure 6: (a) DG example (b) Different (II K,K) pairs for 4 resources. 3 Loop unrolling for RCLP The length of a schedule is always an integer number of cycles, whereas the MII of a loop may be a rational number. In the example of Figure 3 we have MII = 3=2 for r = 2. If only one loop instance is taken, at least d3=2e = 2 cycles are needed for any schedule. However, a schedule with optimal II can be found by unrolling the loop twice. In general, a necessary (but not sufficient) condition to find a schedule with II = MII = a (reduced b fraction) is to unroll the loop K = ḃ times. Systematic approaches for loop unrolling of data-flow programs are presented in [19, 27, 33]. However, they do not guarantee optimal solutions, as it will be demonstrated with the example of Figure 6. Let II K be the initiation interval of a steady state comprising K instances of the loop body (II = II K ). The goal of UNRET is to find a schedule that minimizes II. This is done by exploring pairs K (II K,K) in increasing order of II, starting from MII. In order to bound the search space for (II K,K), a maximum value for II K is defined: the maximum length of the steady state (II max ). Figure 6(a) depicts the DG of a 5-instruction loop. If 4 resources are used for execution, then RecMII = 1 and MII = ResMII = 5 4. The diagram in Figure 6(b) represents the pairs (II K,K) that can be explored to find a feasible schedule for the loop. These pairs correspond to the points with integer values for II K and K enclosed in the triangle limited by the following lines: K = 0, II K = II max and II K K = MII. Distinct points with the same II lie on the same line. Among all points lying on the same line, those with smaller values for II K (or K) will be preferred, since they produce shorter steady states. In the example, the first point to be explored to minimize II is C = (5; 4), meaning that 4 iterations must be executed in 5 cycles, which results in II=MII=1.25. In this case, no feasible schedule with such characteristics exists. In [31], II K is incremented by 1 when no schedule is found and, thus, the point B = (6; 4) is explored next. This results in a suboptimal solution. However, a time-optimal solution (for II max = 8) is found if the point A = (4; 3) is explored after C = (5; 4). But, is there any efficient strategy to explore all points in increasing order of II? 7

9 2 F 8 1 ; 15 8 ; : : :; : : : II < MII : : : K 4 II Not explored K ; ; ; 2 4 ; 3 6 ; : : : Table 1: Pairs (II K,K) explored for the example of Figure Farey s series For a fixed integer D > 0, the sequence of all reduced fractions with nonnegative denominator D, arranged in increasing order of magnitude, is defined by the Farey s series of order D (F D ). For example, F 5 is the series of fractions: F 5 = 0 1 ; 1 5 ; 1 4 ; 1 3 ; 2 5 ; 1 2 ; 3 5 ; 2 3 ; 3 4 ; 4 5 ; 1 1 ; : : : Let x i y i be the ith element of the series. F D can be generated by the following recurrence [35]: The first two elements are The generic term x i y i x i = yi?2 + D y i?1 x 0 y 0 = 0 1 x 1 y 1 = 1 D can be calculated as (in increasing order) x i?1? x i?2 y i = y i?1 yi?2 + D y i?1 y i?1? y i?2 or as (in decreasing order) yi + D yi?2 + D x i = x i?1? x i?2 y i = y i?1? y i?2 y i?1 For a given integer k, the following consecutive fractions always belong to F D : k D? 1 D ; k 1 ; k D + 1 D All points (II K,K) correspond to fractions (or equivalent fractions) of the series F IImax. efficient strategy to generate the points in decreasing order is the following: An Let m = d1=miie, x 0 y 0 = m 1 and x 1 y 1 = mii max?1 II max Generate fractions in decreasing order until a fraction xp y p 1 is found. M II Explore schedules for all fractions (and equivalent fractions) generated in decreasing order from x p y p. For a reduced fraction x i y i, all equivalent fractions kx i ky i (k y i II max ) are generated in increasing order of the denominator (k = 1; 2; : : :). Table 1 represents the explored points for the example of Figure 6. 8

10 A 0 A 0 0 A 0 1 A B 0 C B0 2 1 B0 B C0 C0 C 2 0 Figure 7: A DG unrolled 3 times 3.2 Loop unrolling Every pair (II K ; K) obtained from Farey s series denotes a different unrolling degree for the loop and a target initiation interval. Before trying to find a schedule with such characteristics, a new DG with K instances of the original loop body must be obtained. Unrolling a DG K times generates another DG in which each vertex v is instantiated K times (v 0 ; v 1 ; : : :; v K?1 ). Besides, data dependence distances must be changed according to the unrolling degree. A dependence u?! d v in the original DG will be represented by K dependences u b i+d i K c?! v (i+d) mod K in the unrolled DG [27] (see example in Figure 7). 3.3 Extension to multiple types of resources In the most general case the architecture has several types of resources or functional units (FUs). Each FU is able to execute more than one type of operation. We also assume that allocation has already been done and, therefore, each instruction can only be executed in one type of FU. Let us first present some preliminary definitions: M = f1; : : :; mg is the set FU types of the architecture. F = (F 1 ; F 2 ; : : :; F m ) is the resource vector of the architecture, where F t is the number of FUs of type t. ALLOC(t) is the set of instructions allocated to the type of FU t. FU(u) is the type of FU allocated for instruction u. OP(u), is the operation executed by instruction u. T (u), is the delay (cycle count) required to execute OP(u) in FU(u). L(u), is the latency of FU(u) to execute OP(u) (minimum cycle count between successive inputs to the FU). For pipelined FUs L(u) < T (u); otherwise L(u) = T (u). Given a heterogeneous set of resources, the calculation of ResMII is done as follows: ResMII = max ResMII t t2m 9

11 with ResMII t = X u2alloc(t) F t P The numerator ( L(u)) corresponds to the number of cycles in which the FUs are busy. We will say that t is a critical FU type if ResMII = ResMII t. To calculate RecMII, the delay associated to each recurrence R of the loop is now obtained as follows 1 [32, 5]: RecMII R = X u2r R L(u) T (u) The generation of the pairs (II K,K) can now be done similarly to that presented for an homogeneous set of resources, by first calculating MIIand then using Farey s series to obtain II K and K. 3.4 UNRET algorithm for RCLP The strategy followed by UNRET to solve the problem of RCLP is the following: 1. Calculate MII. 2. Find the first unrolling degree (K) and expected initiation interval (II K ) that minimizes II such that II MII. 3. Unroll the loop K times. 4. Find a steady state with length II K (ILP approach explained in Section 5). 5. If no schedule with II K cycles is found, generate new values for K and II K that minimize II (II is explored in increasing order by using Farey s series) and go to step 3. 4 Loop unrolling for TCLP The problem of TCLP starts from a DG and a target initiation interval (II T ). Unlike RCLP, the number of some (or all) resources of the architecture is not constrained. Therefore, ResMII must be calculated by only considering those resources with constraints. In case II T < MII, no feasible schedule for that DG exists. Otherwise, different unrolling degrees for the loop must be explored, similarly to what is done for RCLP, in such a way that 4.1 UNRET algorithm for TCLP MII II K K II T The strategy followed by UNRET to solve the problem of TCLP is the following: 1. Calculate MII. If II T < MII exit (no feasible schedule exists). 1 Although formally incorrect, we use u 2 R to refer to the vertices of the recurrence. 10

12 2. Find the first unrolling degree (K) and expected initiation interval (II K ) that maximizes II such that II II T. 3. Unroll the loop K times. 4. Find a steady state with length II K that minimizes the cost of the required resources (ILP approach explained in Section 5). 5. Optimize the initiation interval of the schedule. This is done by successively generating new values for K and II K that minimize II (II is explored in decreasing order by using Farey s series). If II < MII then exit (an optimal-time schedule has been found). Otherwise, for each pair (II K, K), the loop is unrolled K times, and the algorithm attempts to find a schedule in II K cycles by using the set of resources found at step 4. In this case, Farey s series for K II K are explored in decreasing order, starting from the closest value to 1 II T. 5 Loop pipelining: An ILP approach Both the RCLP and the TCLP problems can be solved by the same mathematical model. As shown in Figure 5, different labeling functions (; ) for the same DG can result in different steady states. The problem of loop pipelining can be reduced to two interrelated subproblems that can be simultaneously solved by using an ILP model: 1. Folding the DG, i.e. finding the functions and. 2. Finding a schedule for the folded DG subject to the data dependences and resource constraints. 5.1 Preliminaries Initially, a DG obtained by unrolling the original DG K times will be given. Moreover, an objective initiation interval (II K ) will fix the number of cycles of the steady state. Hereafter, and for the sake of brevity, we will use II instead of II K. C = f0; : : :; II? 1g will denote the set of cycles of the steady state. All variables used in the model are nonnegative integers. The least nonnegative residue system (0; 1; : : :; k? 1) will be used when modulo k (mod k) operations are performed on constants (e.g.?8 mod 5 = 2). 5.2 Loop folding constraints The following variables are defined for the labeling functions of the DG: u ; 8u 2 V (fundamental variables) u;v ; 8(u; v) 2 E (auxiliary variables) Folding the loop means finding an equivalent labeling function for the DG. According to equation (1), the auxiliary variables u;v are defined as follows: u;v = u? v + k u;v ; 8(u; v) 2 E (3) 11

13 u i v i v i u i+1 v i u i u i+1 u i ??? v i v i v i v i u i+2 v i v i (a) (b) (c) Figure 8: Time frames to schedule v i for different loop foldings (T (u) = 2 and II = 3) with k u;v = (v)? (u) + (u; v) where (u), (v) and (u; v) are constants denoting the initial labeling of the DG (usually (u) = (v) = 0). 5.3 Scheduling and data dependence constraints The following variables are first defined: s u;i = ( 1 if u starts executing at cycle i 0 otherwise 8u 2 V; i 2 C The following constraint is required to guarantee that an instruction is scheduled at one, and only one, cycle of the steady state: X s u;i = 1; 8u 2 V (4) i2c For simplicity in the definition of data dependence constraints, the auxiliary variable c u will be used to denote the cycle at which u is scheduled. Hence, c u = X i2c i s u;i ; 8u 2 V (5) The constraint that guarantees data dependences to be honored in the schedule is the following [5, 12, 18, 23]: c v c u + T (u)? II u;v ; 8(u; v) 2 E (6) Figure 8 illustrates how u;v influences the schedule of v. If u;v = 0 (ILD), then v must be scheduled after the completion of u (c u + T (u)), as shown in Figure 8(a). By increasing the fold from u i to u i+m (Figures 8(b) and 8(c)), u i is implicitly rescheduled m II cycles backwards, thus increasing the time frame to schedule v i by the same amount. 12

14 5.4 Resource constraints The following constraints guarantee that no more than F t functional units are used at each cycle: X u2alloc(t) ix j=i?l(u)+1 s u;(jmodii) F t ; 8t 2 M; i 2 C (7) An instruction u uses an FU during the cycles c u : : : c u + L(u)? 1 (mod II). In case the latency of the operation is longer than II, the execution of several instances of the same operation may overlap and, therefore, more than one FU may be required. Assume, for example, II=3 and L(u) = 5, u being the only instruction that uses FUs of type t. The generated constraints for F t are the following: s u;?4mod3 + s u;?3mod3 + s u;?2mod3 + s u;?1mod3 + s u;0mod3 F t (for cycle 0) s u;?3mod3 + s u;?2mod3 + s u;?1mod3 + s u;0mod3 + s u;1mod3 F t (for cycle 1) s u;?2mod3 + s u;?1mod3 + s u;0mod3 + s u;1mod3 + s u;2mod3 F t (for cycle 2) which, after modulo calculations, results in: 5.5 Register requirements 2s u;0 + s u;1 + 2s u;2 F t (for cycle 0) 2s u;0 + 2s u;1 + s u;2 F t (for cycle 1) s u;0 + 2s u;1 + 2s u;2 F t (for cycle 2) The maximum number of variables whose lifetimes overlap at any cycle, MAXLIVE [30], is the minimum number of registers required for a schedule. Each edge u! v of the DG determines the variable lifetime of the output of u with regard to v. The variable lifetime spreads from the completion of instruction u (cycle c u + T (u)) to the cycle in which the FU executing v does not required the input data anymore (cycle c v + L(v)? 1). Other execution models can also be considered for registers. In [8], variable lifetimes spread from the initiation of u to the initiation of v, a model more appropriate for parallel architectures. As depicted in the example of Figure 9, an edge with u;v = k will spread the variable lifetime across k iterations. For any cycle c in the steady state, an edge will require k? 1, k, or k + 1 registers depending on the scheduling of instructions u and v. Three cases can be distinguished for T (u) = L(v) = 1 (see examples in Figure 9): a) u k! v crosses cycle c forward, then k + 1 registers are required (example: edge u i+2 2! v i at cycle 2). b) u k! v does not cross cycle c, then k registers are required (example: edge u i+2 2! v i at cycles 0 and 1, or edge r i+1 1! s i at cycle 0). c) u k! v crosses cycle c backwards, then k? 1 registers are required (example: edge r i+1 1! s i at cycles 1 and 2). The following auxiliary variables are defined to specify whether an operation reads or writes a result before a given cycle: 13

15 u i+2 v i 2 1?? (a) r i+1 s i u i u i+1 r i?1? s i?1 s u i+2? v i s s s i 6 r i+1 u u u u u u u u i+2 u r i? s i? v i r i+1 3 regs (cycle 0) 2 regs (cycle 1) 3 regs (cycle 2) (b).??. (c).? Figure 9: Register requirements for folded loops: (a) Folded DG, (b) Schedule and cycles crossed by edges, (c) Variable lifetimes and required registers. READS BEFORE u;c = WRITES BEFORE u;c = c?1 X i=0 c?1 X i=0 s u;(i?l(u)+1)modii (u reads before cycle c) s u;(i?t(u)+1)modii (u writes before cycle c) The auxiliary variable r u;v;c defines the number of registers required to store a result in cycle c produced by operation u and consumed by operation v: r u;v;c = u;v? T (u)? 1 II L(v)? 1 + WRITES BEFORE u;(t(u)?1)modii + READS BEFORE v;(l(v)?1)modii II WRITES BEFORE u;c? READS BEFORE v;c (8) The expressions in brackets determine a correction on u;v produced by the execution time of u and the latency of v, i.e. no register is required to store the value while u executes, whereas a register is required during the first L(v) cycles of v s execution. In case T (u) = L(u) = 1, as in the examples

16 depicted in Figure 9, the following simplified expression is obtained: r u;v;c = u;v + WRITES BEFORE u;c? READS BEFOREv; c A variable can be the input of more than one instruction. This is shown in the DG by the fact that some edges have the same source vertex. However, there is no need to allocate different registers for the same variable. The number of registers required is the maximum from all edges with the same source. Therefore, the variables r u;c can also be defined on vertices as follows: r u;v;c r u;c ; 8u 2 V; c 2 C (9) It can be easily proved that if operation v is the last use of u s result then r u;v;c = r u;c for any cycle c [8]. The following constraint must be included to define MAXLIVE: X u2v ML being a variable minimized in the objective function. 5.6 Objective function r u;c ML; 8c 2 C (10) Assume we have a cost (area) vector A = (A 1 ; : : :; A m ) for the FUs of the architecture. Similarly, we can have a cost A r for each register 2. The objective function of the ILP model can be formulated as follows: min Area = X t2m A t F t + A r ML (11) Note that F t and ML can be either variables or constants. This depends on the initial constraints defined for the problem. In case all of them are constants, the ILP model will merely find a feasible solution. Here we have considered ML as the number of registers required by the schedule. Although this might not be true, it has been shown that it is a realistic assumption for many practical cases [30]. 5.7 Complexity of the model Table 2 describes the fundamental variables and constraints of the model. V and E denote the number of nodes and edges of the DG respectively, whereas m is the number of different FU types of the architecture. Some of the variables, e.g. F t and ML, may become constants if the number of resources of the architecture is defined in advance. 6 Branch and Prune The most popular method to solve ILP models is the combination of branch-and-bound techniques with a linear-programming solver such as simplex [26]. Having integer variables in the model increases complexity from polynomial to exponential. The running time for ILP is highly influenced by the 2 Piecewise linear cost functions for registers can also be incorporated, as proposed in [10]. 15

17 variable number constraint number u V (4) V s u;i V II (6) E F t m (7) m II r u;c V II (8,9) V II ML 1 (10) II total V (2II + 1) + m + 1 total V (II + 1) + E + II(m + 1) Table 2: Variables and constraints of the ILP model exploration of the branch-and-bound tree. For an ILP solver insensible to the problem, the tree of solutions is blindly explored and finding valid integer solutions may require solving an excessive number of LP problems. We have implemented an ad-hoc solver for loop pipelining that takes advantage of the information known a priori about the problem. The solver uses a branch-and-bound paradigm to explore the space of integer solutions but, rather than indiscriminately invoking an LP solver at each branch, it allows us a rapid pruning when enough information has been captured to evaluate the objective function. The order in which integer variables are explored is also crucial to reduce the running time. The approach used for this problem, that we have called branch and prune, is next presented. 6.1 Retiming and scheduling The problem of loop pipelining can be decomposed in two different problems: 1. Retiming, or finding the loop fold for each operation and 2. Scheduling, or assigning operations to cycles. After solving retiming, i.e. defining u for all operations, the problem of loop pipelining is reduced to the scheduling of a basic block in which not all dependences must be taken into account. According to constraint (6), all those dependences with T (u)? II u;v 0 are not relevant for scheduling 3. This is the rationale behind our first decision to define the exploration order of the integer variables: first all s (retiming) and then the rest (scheduling) Retiming Recurrences are the most stringent constraints to define the operations folds, since the sum of their edges must be a constant (equation (2)). Let us take, as an example, the recurrence A-B-C-D-A of the DG depicted in Figure 10. Assume we define C = 10 and D = 9. Thus, we have that C;D = 1 and the folds for A and B are implicitly defined by the fact that only one of the edges of the recurrence can have = 1. Therefore, A = B = 10. And similarly for the recurrence C-D-E-C, in which E = 10 for the same reason. Figure 10 shows the exploration tree for the example. The order of the variables is selected as follows: first explore the nodes belonging to the most stringent recurrences. Inside each recurrence, first explore those nodes that also belong to other recurrences. In the example, the recurrence C-D-E-C 3 The value of u;v can be obtained from (3) once u and v have been defined. 16

18 C = 10 D 9 10 A = B = E = 10 E A 1 D 1 A A E B B = 10 B = 10 B B C Figure 10: Branch-and-Prune search for u is the most stringent and nodes C and D also belong to other recurrences. Thus, the order C-D-E-A-B has been chosen. Since only the relative difference between s is required to determine the steady state pattern, the fold of the first node explored can always be fixed ( C = 10 in this case). For nodes not belonging to any recurrence, their exploration order is defined according to a neighboring criteria to the nodes in the recurrences. In these cases, however, the number of branches to be explored may be unbounded, since no recurrence limits the value of. For this reason, an early calculation of a lower bound for register requirements is done at each node of the tree. This calculation assumes the worst case values for the auxiliary variables WRITES BEFORE and READS BEFORE in constraint (8). In practice, this estimation has shown to be very effective to prune the search tree when comparing with the cost of the best obtained solution Scheduling After having defined the folds of the operations, a scheduling problem is posed for each of the leaves of the search tree. At this moment, some dependences of the DG can be eliminated from the model (if T (u)? II u;v 0) and the critical path of the retimed DG can be calculated. In case the critical path is longer than II, no feasible schedule can be found and the exploration is pruned. Checking the critical path of the DG can be even done before defining all s if the information available at each node of the search tree is enough to discard a solution. The variables defined at this stage of the problem are the scheduling variables (s u;i ). Here the exploration order is also crucial and we have chosen one of the well-known algorithms for timeconstrained scheduling (force-directed scheduling (FDS) [28]) to assist the solver in defining an efficient strategy to generate the search tree. FDS performs a stepwise assignment of operations to cycles according to criteria that attempt to balance the utilization of resources over all cycles of the schedule. In our model, each operation has II 0-1 variables with at most one of them being 1. The order of operation assignments produced by FDS is the one we use to generate the search tree. The cycle to which each operation is scheduled by 17

19 Sb,0=1 towards FDS solution Sa,2=1 Sb Sa,1=1 Sb,1=1 Sa Sa,0=1 other possible solutions Sa,3=1 FDS stepwise solution operation cycle a 2 b Figure 11: Generation of the search tree for scheduling guided by FDS FDS is the first value explored. Assignments to the other cycles are explored in increasing order of the forces estimated by FDS (see Figure 11). This strategy leads the solver to a near-optimal solution very soon, thus allowing an efficient pruning of the tree for the other branches. At each node of the tree, in which some operations have already been assigned to cycles, the scheduling constraints no longer relevant for the final solution are eliminated (e.g. if ASAP(u) = 2, the variables s u;0 and s u;1 are not defined and their corresponding constraints simplified). Similarly to branch-and-bound, the LP solver is invoked at each branch of the tree to prune those solutions with a cost greater than the best integer solution found. 7 Experimental results The techniques presented in the previous sections have been implemented by using the package lp solve [2]. In this section, several results are reported to show the main features of the method: the exploration of the optimal unrolling degree and the efficiency of the branch-and-prune strategy to solve the ILP model. The types of resources used in the schedules are adders (1-cycle latency) and multipliers (2-cycle latency). 7.1 Optimal unrolling degree Table 3 presents the results obtained by exploring two different unrolling degrees (see Table 1) for the example depicted in Figure 6. An RCLP problem with 4 resources has been solved. MAXLIVE is also minimized for the optimal II found by the model. The columns V and E indicate the number of nodes and edges of the DG after unrolling the loop. #Vars and #Const indicate the number of variables and constraints of the ILP model. SPAN is calculated as max? min + 1 (number of different iterations involved in the schedule). The first row presents the result for II K = 5 and K = 4 (point C in Figure 6), which is an infeasible problem. The second row presents the result for point A = (4; 3). This case obtains optimal results in MAXLIVE and II in front of the schedule found for the third point (B = (6; 4)), which is the one explored by other techniques that perform sub-optimal loop unrolling. Table 4 shows the results obtained for some examples selected from the set presented in [13]. In all cases 2 Load/Store units, 2 multipliers, 3 adders and 1 divider have been used. Note that unrolling 18

20 Resources MII II K/II K V E #Vars #Const CPU SPAN MAXLIVE / Non feasible / / Table 3: Results obtained for the example of Figure 6(a). is required to achieve a schedule in MII cycles for all cases. Table 5 show the results obtained for the Cytron example if no register minimization is required. Example MII II K/II K V E #Vars #Const CPU SPAN MAXLIVE SPEC-SPICE (loop7) / SPEC-SPICE (loop8) / LIVERMORE (loop1) / Table 4: Results obtained for a selection of examples from the Perfect Club set Resources MII II K/II K V E #Vars #Const SPAN CPU / / / / Table 5: Results for the Cytron example when no register minimization is required 7.2 ILP model for RCLP A significant part of the computational cost for solving the ILP model is spent in guaranteeing that the final solution is optimal. For this reason, we also report the best solution obtained after 1 minute of CPU 4. The results presented in Tables 6 and 7 show that benchmarks with a large number of operations can be solved with moderate computational cost. The method can also be used for heuristic search by limiting the maximum CPU time and providing the best solution found at that moment (e.g. one minute). The results demonstrate that the optimal solution can be often obtained by only using a small fraction of the total computational cost, although a proof of optimality cannot be given in this case. Optimal solutions have been found for models with more than 1000 variables and 700 constraints. Table 8 shows the results obtained for the differential equation solver. For this example two types of resources were considered: a non pipelined multiplier with a two cycle delay and an ALU for the rest of the operations with one cycle of delay. 7.3 ILP model for TCLP Table 9 presents some results on solving time-constrained loop pipelining for the elliptic filter. The cost for the multipliers, adders and registers has been defined as 80, 20 and 10 area units respectively. 4 We only report the value of MAXLIVE after 1 min, which is the critical variable in the optimization of the objective function. 19

21 Resources II V E #Vars #Const CPU MAXLIVE SPAN MAXLIVE + (secs) (60 secs) Table 6: Results obtained for the 16-point FIR filter Resources II V E #Vars #Const. CPU time MAXLIVE SPAN MAXLIVE p + (secs) (60 secs) Table 7: Results for the 5th-order elliptic wave filter ( p and stand for pipelined and non-pipelined multipliers). Resources II V E #Vars #Cons CPU time MAXLIVE SPAN 3 2 ALU ALU ALU ALU Table 8: Results for the differential equation solver In this case, the value of the II is initially defined, whereas the area cost of the resources is minimized. The complexity of the models is equivalent to the corresponding solutions reported in Table 7. Table 10 shows equivalent results when pipelined multipliers are considered. Table 11 presents the results for the Cytron example when solving the TCLP problem. 7.4 Results without loop folding In this section we present the results obtained when no loop-folding is required. In this case, only one leaf of the branch and prune tree is explored: the one with all having the same value. Table 12 shows the results obtained for the elliptic filter. The program obtains the scheduling without loop-folding in a given number of cycle steps minimizing the value of MAXLIVE. 20

22 II Multipliers Adders MAXLIVE CPU (secs) Table 9: TCLP for the 5th-order elliptic wave filter with non-pipelined multipliers. II Multipliers Adders MAXLIVE CPU (secs) Table 10: Results obtained for the fifth order elliptic filter with pipelined multipliers for TCLP II Resources MAXLIVE CPU (secs) Table 11: Results obtained for the Cytron example for TCLP 7.5 Near-optimal solutions for large examples Although the branch-and-prune strategy has proved to be efficient for models with 10 3 variables/constraints, the exponential nature of the method becomes critical in some cases. Rather than discarding this method, we claim that it can still be used to obtain near-optimal solutions by limiting the CPU time of the search. Tables 13 and 14 present results on the Cytron s and FDCT loops obtained by limiting the CPU time to 10 minutes. Given the status of the search at the time the solution was delivered, we suspect that the results are optimal (although we could not prove it). The strategies used to explore the values of and the variables for scheduling (force-directed scheduling) are crucial to rapidly converge to near-optimal solutions. 8 Conclusions Finding optimal schedules for moderate size loops is feasible. In this paper we have presented a mathematical model that can be solved by using ILP. However, efficient techniques to wisely explore the space of integer solutions are required to avoid a blind navigation through the branch-and-bound tree. We have presented a new strategy called branch and prune that takes advantage of the information known a priori about the problem to seek near-optimal solutions as fast as possible. Decomposing loop pipelining into two sub-problems (retiming and scheduling) and using force-directed scheduling to assist the search have been essential heuristics to guarantee optimal solutions in moderate CPU times. We have also demonstrated that exploring the unrolling degree is necessary to find optimal solutions for loop pipelining. A mathematical formulation based on Farey s series has been proposed for such a goal. 21

23 Resources II CPU time (secs) MAXLIVE Table 12: Results obtained for the elliptic filter when only scheduling is required (non-pipelined multipliers) Resources MII II K/II K V E #Vars #Const SPAN MAXLIVE / / / Table 13: Results for the Cytron example after 10 minutes of CPU time Resources MII II K/II K V E #Vars #Const SPAN MAXLIVE +? / / / Table 14: Results obtained for the FDCT after 10 minutes of CPU time Finally, we think that other pruning strategies for ILP models can be considered for the problem we have solved. New techniques, such as branch and cut, successfully used for the traveling-salesman problem [25] can probably be adapted to accommodate many problems in high-level synthesis. References [1] A. Aiken and A. Nicolau. Perfect Pipelining: a new loop parallelization technique, volume 300 of Lecture Notes in Computer Science, pages Springer Verlag, March [2] M.R.C.M. Berkelaar. lp solve version 2.0: a public domain ILP solver, available at ftp.es.ele.tue.nl. [3] F. Catthoor, W. Geurts, and H. De Man. Loop transformation methodology for fixed-rate video, image and telecom processing applications. In Proc. of the Int. Conf. on Application-Specific Array Processors 1994 (ASAP94), [4] L.-F. Chao, A. LaPaugh, and E.H.-M. Sha. Rotation scheduling: a loop pipelining algorithm. In Proc. of the 30th Design Automation Conf., pages , June [5] R. Cytron. Compiler-time scheduling and optimization for asynchronous machines. PhD thesis, University of Illinois at Urbana-Champaign,

24 [6] S.M. Heemstra de Groot, S.H. Gerez, and O.E. Herrmann. Range-chart-guided iterative dataflow graph scheduling. IEEE Transactions on Circuits and Systems I, 39(5): , May [7] Y.G. DeCastelo-Vide e Souza, M. Potkonjak, and A.C. Parker. Optimal ILP-based approach for throughput optimization using simultaneous algorithm/architecture matching and retiming. In Proc. of the 32nd Design Automation Conf., pages , June [8] A.E. Eichenberger, E.S. Davidson, and S.G. Abraham. Optimum modulo schedules for minimum register requirements. In Proc. of 9th ACM the International Symposium on Supercomputing, pages 31 40, June [9] C.H. Gebotys. Throughput optimized architectural synthesis. IEEE Trans. on VLSI Systems, 1(3): , September [10] C.H. Gebotys and M.I. Elmasry. Global optimization approach for architectural synthesis. IEEE Trans. Computer-Aided Design, 12(9): , September [11] E.F. Girczyc. Loop winding: A data flow approach to functional pipelining. In Proc. Int. Symp. Circuits and Systems, pages , May [12] G. Goossens, J. Rabaey, J. Vandewalle, and H. De Man. An efficient microcode compiler for application specific DSP processors. IEEE Trans. Computer-Aided Design, 9(9): , September [13] R. Govindarajan, E.R. Altman, and G.R. Gao. Minimizing register requirements under resourceconstrained rate-optimal software pipelining. In Proc. of the 27th Annual International Symposium on Microarchitecture, pages 85 94, November [14] L. Hafer and A. Parker. A formal method for the specification analysis and design of registertransfer-level digital logic. IEEE Trans. Computer-Aided Design, 2(1):4 17, January [15] R. A. Huff. Lifetime-sensitive modulo scheduling. In Proc. of the 6th Conf. Programming Language Design and Implementation, pages , [16] C.-T. Hwang and Y.-C. Hsu. Zone scheduling. IEEE Trans. Computer-Aided Design, 12(7): , July [17] C.-T. Hwang, Y.-C. Hsu, and Y.-L. Lin. Scheduling for functional pipelining and loop winding. In Proc. of the 28th Design Automation Conf., pages , June [18] C.-T. Hwang, J.-H. Lee, and Y.-C. Hsu. A formal approach to the scheduling problem in high level synthesis. IEEE Trans. Computer-Aided Design, 10(4): , April [19] R.B. Jones and V.H. Allan. Software pipelining: A comparison and improvement. In Proc. 23rd Ann. Workshop on Microprogramming and Microarchitecture, pages 46 56, November [20] R. Karp. A characterization of the minimum cycle mean in a digraph. Discrete Mathematics, 23: , [21] P. M. Kogge. The architecture of pipelined computers. New York. McGraw-Hill,

25 [22] M. Lam. Software pipelining: an effective scheduling technique for VLIW machines. In Proc. of the SIGPLAN 88, pages , June [23] T.-F. Lee, A.C.-H. Wu, Y.-L. Lin, and D.D. Gajski. A transformation-based method for loop folding. IEEE Trans. Computer-Aided Design, 13(4): , April [24] C.E. Leiserson and J.B. Saxe. Retiming synchronous circuitry. Algorithmica, 6:5 35, [25] M.W. Padberg and G. Rinaldi. A branch-and-cut algorithm for the resolution of large-scale symetric traveling salesman problems. SIAM Review, 33:60 100, [26] C.H. Papadimitriou and K. Steiglitz. Combinatorial optimization: algorithms and complexity. Prentice-Hall, [27] K.K. Parhi and D.G. Messerschmitt. Static rate-optimal scheduling of iterative data-flow programs via optimum unfolding. IEEE Trans. Computers, 40(2): , February [28] P.G. Paulin and J.P. Knight. Force-directed scheduling for the behavioral synthesis of ASIC s. IEEE Trans. Computer-Aided Design, 8(6): , June [29] R. Potasman, J. Lis, A. Nicolau, and D. D. Gajski. Percolation-based synthesis. In Proc. of the 27th Design Automation Conf., pages , [30] B.R. Rau. Iterative modulo scheduling: An algorithm for software pipelining loops. In Proc. of the 27th Annual International Symposium on Microarchitecture, pages 63 74, November [31] B.R. Rau and C.D. Glaeser. Some scheduling techniques and an easily schedulable horizontal architecture for high performance scientific computing. In Proc. of the 14th Annual Workshop on Microprogramming, pages , October [32] M. Renfors and Y. Neuvo. The maximum sampling rate of digital filters under hardware speed constraints. IEEE Trans. on Circuits and Systems, CAS-28(3): , March [33] M. Rim and R. Jain. Lower-bound performance estimation for the high-level synthesis scheduling problem. IEEE Trans. Computer-Aided Design, 13(4): , April [34] F. Sánchez and J. Cortadella. Time constrained loop pipelining. In Proc. Int. Conf. Computer- Aided Design, pages , November [35] M.R. Schroeder. Number theory in science and communication. Springer-Verlag, [36] B. Su, S. Ding, and J. Xia. URPR: an extension of URCR for software pipelining. In 19th Microprogramming Workshop (MICRO-19), pages , October

RPUSM: An Effective Instruction Scheduling Method for. Nested Loops

RPUSM: An Effective Instruction Scheduling Method for. Nested Loops RPUSM: An Effective Instruction Scheduling Method for Nested Loops Yi-Hsuan Lee, Ming-Lung Tsai and Cheng Chen Department of Computer Science and Information Engineering 1001 Ta Hsueh Road, Hsinchu, Taiwan,

More information

AN EFFICIENT IMPLEMENTATION OF NESTED LOOP CONTROL INSTRUCTIONS FOR FINE GRAIN PARALLELISM 1

AN EFFICIENT IMPLEMENTATION OF NESTED LOOP CONTROL INSTRUCTIONS FOR FINE GRAIN PARALLELISM 1 AN EFFICIENT IMPLEMENTATION OF NESTED LOOP CONTROL INSTRUCTIONS FOR FINE GRAIN PARALLELISM 1 Virgil Andronache Richard P. Simpson Nelson L. Passos Department of Computer Science Midwestern State University

More information

Rate-Optimal Unfolding of Balanced Synchronous Data-Flow Graphs

Rate-Optimal Unfolding of Balanced Synchronous Data-Flow Graphs Rate-Optimal Unfolding of Balanced Synchronous Data-Flow Graphs Timothy W. O Neil Dept. of Computer Science, The University of Akron, Akron OH 44325 USA toneil@uakron.edu Abstract Many common iterative

More information

FILTER SYNTHESIS USING FINE-GRAIN DATA-FLOW GRAPHS. Waqas Akram, Cirrus Logic Inc., Austin, Texas

FILTER SYNTHESIS USING FINE-GRAIN DATA-FLOW GRAPHS. Waqas Akram, Cirrus Logic Inc., Austin, Texas FILTER SYNTHESIS USING FINE-GRAIN DATA-FLOW GRAPHS Waqas Akram, Cirrus Logic Inc., Austin, Texas Abstract: This project is concerned with finding ways to synthesize hardware-efficient digital filters given

More information

An Approach for Integrating Basic Retiming and Software Pipelining

An Approach for Integrating Basic Retiming and Software Pipelining An Approach for Integrating Basic Retiming and Software Pipelining Noureddine Chabini Department of Electrical and Computer Engineering Royal Military College of Canada PB 7000 Station Forces Kingston

More information

DESIGN OF 2-D FILTERS USING A PARALLEL PROCESSOR ARCHITECTURE. Nelson L. Passos Robert P. Light Virgil Andronache Edwin H.-M. Sha

DESIGN OF 2-D FILTERS USING A PARALLEL PROCESSOR ARCHITECTURE. Nelson L. Passos Robert P. Light Virgil Andronache Edwin H.-M. Sha DESIGN OF -D FILTERS USING A PARALLEL PROCESSOR ARCHITECTURE Nelson L. Passos Robert P. Light Virgil Andronache Edwin H.-M. Sha Midwestern State University University of Notre Dame Wichita Falls, TX 76308

More information

REDUCING THE CODE SIZE OF RETIMED SOFTWARE LOOPS UNDER TIMING AND RESOURCE CONSTRAINTS

REDUCING THE CODE SIZE OF RETIMED SOFTWARE LOOPS UNDER TIMING AND RESOURCE CONSTRAINTS REDUCING THE CODE SIZE OF RETIMED SOFTWARE LOOPS UNDER TIMING AND RESOURCE CONSTRAINTS Noureddine Chabini 1 and Wayne Wolf 2 1 Department of Electrical and Computer Engineering, Royal Military College

More information

MOST computations used in applications, such as multimedia

MOST computations used in applications, such as multimedia IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 13, NO. 9, SEPTEMBER 2005 1023 Pipelining With Common Operands for Power-Efficient Linear Systems Daehong Kim, Member, IEEE, Dongwan

More information

CE431 Parallel Computer Architecture Spring Compile-time ILP extraction Modulo Scheduling

CE431 Parallel Computer Architecture Spring Compile-time ILP extraction Modulo Scheduling CE431 Parallel Computer Architecture Spring 2018 Compile-time ILP extraction Modulo Scheduling Nikos Bellas Electrical and Computer Engineering University of Thessaly Parallel Computer Architecture 1 Readings

More information

Predicated Software Pipelining Technique for Loops with Conditions

Predicated Software Pipelining Technique for Loops with Conditions Predicated Software Pipelining Technique for Loops with Conditions Dragan Milicev and Zoran Jovanovic University of Belgrade E-mail: emiliced@ubbg.etf.bg.ac.yu Abstract An effort to formalize the process

More information

2 <3> <2> <1> (5,6) 9 (5,6) (4,5) <1,3> <1,2> <1,1> (4,5) 6 <1,1,4> <1,1,3> <1,1,2> (5,7) (5,6) (5,5)

2 <3> <2> <1> (5,6) 9 (5,6) (4,5) <1,3> <1,2> <1,1> (4,5) 6 <1,1,4> <1,1,3> <1,1,2> (5,7) (5,6) (5,5) A fast approach to computing exact solutions to the resource-constrained scheduling problem M. NARASIMHAN and J. RAMANUJAM 1 Louisiana State University This paper presents an algorithm that substantially

More information

Constraint Analysis and Heuristic Scheduling Methods

Constraint Analysis and Heuristic Scheduling Methods Constraint Analysis and Heuristic Scheduling Methods P. Poplavko, C.A.J. van Eijk, and T. Basten Eindhoven University of Technology, Department of Electrical Engineering, Eindhoven, The Netherlands peter@ics.ele.tue.nl

More information

Modulo-Variable Expansion Sensitive Scheduling

Modulo-Variable Expansion Sensitive Scheduling Modulo-Variable Expansion Sensitive Scheduling Madhavi Gopal Vallurif and R. GovindarajanfJ f Supercomputer Education and Research Centre Department of Computer Science and Automation Indian Institute

More information

High Level Synthesis

High Level Synthesis High Level Synthesis Design Representation Intermediate representation essential for efficient processing. Input HDL behavioral descriptions translated into some canonical intermediate representation.

More information

ARITHMETIC operations based on residue number systems

ARITHMETIC operations based on residue number systems IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS II: EXPRESS BRIEFS, VOL. 53, NO. 2, FEBRUARY 2006 133 Improved Memoryless RNS Forward Converter Based on the Periodicity of Residues A. B. Premkumar, Senior Member,

More information

EECS 583 Class 13 Software Pipelining

EECS 583 Class 13 Software Pipelining EECS 583 Class 13 Software Pipelining University of Michigan October 29, 2012 Announcements + Reading Material Project proposals» Due Friday, Nov 2, 5pm» 1 paragraph summary of what you plan to work on

More information

Retiming Arithmetic Datapaths using Timed Taylor Expansion Diagrams

Retiming Arithmetic Datapaths using Timed Taylor Expansion Diagrams Retiming Arithmetic Datapaths using Timed Taylor Expansion Diagrams Daniel Gomez-Prado Dusung Kim Maciej Ciesielski Emmanuel Boutillon 2 University of Massachusetts Amherst, USA. {dgomezpr,ciesiel,dukim}@ecs.umass.edu

More information

Communication Networks I December 4, 2001 Agenda Graph theory notation Trees Shortest path algorithms Distributed, asynchronous algorithms Page 1

Communication Networks I December 4, 2001 Agenda Graph theory notation Trees Shortest path algorithms Distributed, asynchronous algorithms Page 1 Communication Networks I December, Agenda Graph theory notation Trees Shortest path algorithms Distributed, asynchronous algorithms Page Communication Networks I December, Notation G = (V,E) denotes a

More information

SHARP: EFFICIENT LOOP SCHEDULING WITH DATA HAZARD REDUCTION ON MULTIPLE PIPELINE DSP SYSTEMS 1

SHARP: EFFICIENT LOOP SCHEDULING WITH DATA HAZARD REDUCTION ON MULTIPLE PIPELINE DSP SYSTEMS 1 SHARP: FFICINT LOOP SCHDULING WITH DATA HAZARD RDUCTION ON MULTIPL PIPLIN DSP SYSTMS 1 S. Tongsima, C. Chantrapornchai,. Sha N. L. Passos Dept. of Computer Science and ngineering Dept. of Computer Science

More information

Framework for Design of Dynamic Programming Algorithms

Framework for Design of Dynamic Programming Algorithms CSE 441T/541T Advanced Algorithms September 22, 2010 Framework for Design of Dynamic Programming Algorithms Dynamic programming algorithms for combinatorial optimization generalize the strategy we studied

More information

Datapath Allocation. Zoltan Baruch. Computer Science Department, Technical University of Cluj-Napoca

Datapath Allocation. Zoltan Baruch. Computer Science Department, Technical University of Cluj-Napoca Datapath Allocation Zoltan Baruch Computer Science Department, Technical University of Cluj-Napoca e-mail: baruch@utcluj.ro Abstract. The datapath allocation is one of the basic operations executed in

More information

Synthesis of DSP Systems using Data Flow Graphs for Silicon Area Reduction

Synthesis of DSP Systems using Data Flow Graphs for Silicon Area Reduction Synthesis of DSP Systems using Data Flow Graphs for Silicon Area Reduction Rakhi S 1, PremanandaB.S 2, Mihir Narayan Mohanty 3 1 Atria Institute of Technology, 2 East Point College of Engineering &Technology,

More information

3 No-Wait Job Shops with Variable Processing Times

3 No-Wait Job Shops with Variable Processing Times 3 No-Wait Job Shops with Variable Processing Times In this chapter we assume that, on top of the classical no-wait job shop setting, we are given a set of processing times for each operation. We may select

More information

Behavioral Array Mapping into Multiport Memories Targeting Low Power 3

Behavioral Array Mapping into Multiport Memories Targeting Low Power 3 Behavioral Array Mapping into Multiport Memories Targeting Low Power 3 Preeti Ranjan Panda and Nikil D. Dutt Department of Information and Computer Science University of California, Irvine, CA 92697-3425,

More information

3 INTEGER LINEAR PROGRAMMING

3 INTEGER LINEAR PROGRAMMING 3 INTEGER LINEAR PROGRAMMING PROBLEM DEFINITION Integer linear programming problem (ILP) of the decision variables x 1,..,x n : (ILP) subject to minimize c x j j n j= 1 a ij x j x j 0 x j integer n j=

More information

Software Pipelining for Coarse-Grained Reconfigurable Instruction Set Processors

Software Pipelining for Coarse-Grained Reconfigurable Instruction Set Processors Software Pipelining for Coarse-Grained Reconfigurable Instruction Set Processors Francisco Barat, Murali Jayapala, Pieter Op de Beeck and Geert Deconinck K.U.Leuven, Belgium. {f-barat, j4murali}@ieee.org,

More information

A Perfect Branch Prediction Technique for Conditional Loops

A Perfect Branch Prediction Technique for Conditional Loops A Perfect Branch Prediction Technique for Conditional Loops Virgil Andronache Department of Computer Science, Midwestern State University Wichita Falls, TX, 76308, USA and Richard P. Simpson Department

More information

1 The Traveling Salesperson Problem (TSP)

1 The Traveling Salesperson Problem (TSP) CS 598CSC: Approximation Algorithms Lecture date: January 23, 2009 Instructor: Chandra Chekuri Scribe: Sungjin Im In the previous lecture, we had a quick overview of several basic aspects of approximation

More information

STATIC SCHEDULING FOR CYCLO STATIC DATA FLOW GRAPHS

STATIC SCHEDULING FOR CYCLO STATIC DATA FLOW GRAPHS STATIC SCHEDULING FOR CYCLO STATIC DATA FLOW GRAPHS Sukumar Reddy Anapalli Krishna Chaithanya Chakilam Timothy W. O Neil Dept. of Computer Science Dept. of Computer Science Dept. of Computer Science The

More information

Symbolic Modeling and Evaluation of Data Paths*

Symbolic Modeling and Evaluation of Data Paths* Symbolic Modeling and Evaluation of Data Paths* Chuck Monahan Forrest Brewer Department of Electrical and Computer Engineering University of California, Santa Barbara, U.S.A. 1. Introduction Abstract We

More information

Register-Sensitive Software Pipelining

Register-Sensitive Software Pipelining Regier-Sensitive Software Pipelining Amod K. Dani V. Janaki Ramanan R. Govindarajan Veritas Software India Pvt. Ltd. Supercomputer Edn. & Res. Centre Supercomputer Edn. & Res. Centre 1179/3, Shivajinagar

More information

12.1 Formulation of General Perfect Matching

12.1 Formulation of General Perfect Matching CSC5160: Combinatorial Optimization and Approximation Algorithms Topic: Perfect Matching Polytope Date: 22/02/2008 Lecturer: Lap Chi Lau Scribe: Yuk Hei Chan, Ling Ding and Xiaobing Wu In this lecture,

More information

Parameterized graph separation problems

Parameterized graph separation problems Parameterized graph separation problems Dániel Marx Department of Computer Science and Information Theory, Budapest University of Technology and Economics Budapest, H-1521, Hungary, dmarx@cs.bme.hu Abstract.

More information

ROTATION SCHEDULING ON SYNCHRONOUS DATA FLOW GRAPHS. A Thesis Presented to The Graduate Faculty of The University of Akron

ROTATION SCHEDULING ON SYNCHRONOUS DATA FLOW GRAPHS. A Thesis Presented to The Graduate Faculty of The University of Akron ROTATION SCHEDULING ON SYNCHRONOUS DATA FLOW GRAPHS A Thesis Presented to The Graduate Faculty of The University of Akron In Partial Fulfillment of the Requirements for the Degree Master of Science Rama

More information

Unit 2: High-Level Synthesis

Unit 2: High-Level Synthesis Course contents Unit 2: High-Level Synthesis Hardware modeling Data flow Scheduling/allocation/assignment Reading Chapter 11 Unit 2 1 High-Level Synthesis (HLS) Hardware-description language (HDL) synthesis

More information

Area. A max. f(t) f(t *) A min

Area. A max. f(t) f(t *) A min Abstract Toward a Practical Methodology for Completely Characterizing the Optimal Design Space One of the most compelling reasons for developing highlevel synthesis systems has been the desire to quickly

More information

Accelerated SAT-based Scheduling of Control/Data Flow Graphs

Accelerated SAT-based Scheduling of Control/Data Flow Graphs Accelerated SAT-based Scheduling of Control/Data Flow Graphs Seda Ogrenci Memik Computer Science Department University of California, Los Angeles Los Angeles, CA - USA seda@cs.ucla.edu Farzan Fallah Fujitsu

More information

CS 473: Algorithms. Ruta Mehta. Spring University of Illinois, Urbana-Champaign. Ruta (UIUC) CS473 1 Spring / 36

CS 473: Algorithms. Ruta Mehta. Spring University of Illinois, Urbana-Champaign. Ruta (UIUC) CS473 1 Spring / 36 CS 473: Algorithms Ruta Mehta University of Illinois, Urbana-Champaign Spring 2018 Ruta (UIUC) CS473 1 Spring 2018 1 / 36 CS 473: Algorithms, Spring 2018 LP Duality Lecture 20 April 3, 2018 Some of the

More information

Theoretical Constraints on Multi-Dimensional Retiming Design Techniques

Theoretical Constraints on Multi-Dimensional Retiming Design Techniques header for SPIE use Theoretical onstraints on Multi-imensional Retiming esign Techniques N. L. Passos,.. efoe, R. J. Bailey, R. Halverson, R. Simpson epartment of omputer Science Midwestern State University

More information

Effective Memory Access Optimization by Memory Delay Modeling, Memory Allocation, and Slack Time Management

Effective Memory Access Optimization by Memory Delay Modeling, Memory Allocation, and Slack Time Management International Journal of Computer Theory and Engineering, Vol., No., December 01 Effective Memory Optimization by Memory Delay Modeling, Memory Allocation, and Slack Time Management Sultan Daud Khan, Member,

More information

Automatic Counterflow Pipeline Synthesis

Automatic Counterflow Pipeline Synthesis Automatic Counterflow Pipeline Synthesis Bruce R. Childers, Jack W. Davidson Computer Science Department University of Virginia Charlottesville, Virginia 22901 {brc2m, jwd}@cs.virginia.edu Abstract The

More information

Topic 12: Cyclic Instruction Scheduling

Topic 12: Cyclic Instruction Scheduling Topic 12: Cyclic Instruction Scheduling COS 320 Compiling Techniques Princeton University Spring 2015 Prof. David August 1 Overlap Iterations Using Pipelining time Iteration 1 2 3 n n 1 2 3 With hardware

More information

Copyright 2007 Pearson Addison-Wesley. All rights reserved. A. Levitin Introduction to the Design & Analysis of Algorithms, 2 nd ed., Ch.

Copyright 2007 Pearson Addison-Wesley. All rights reserved. A. Levitin Introduction to the Design & Analysis of Algorithms, 2 nd ed., Ch. Iterative Improvement Algorithm design technique for solving optimization problems Start with a feasible solution Repeat the following step until no improvement can be found: change the current feasible

More information

Lecture 10,11: General Matching Polytope, Maximum Flow. 1 Perfect Matching and Matching Polytope on General Graphs

Lecture 10,11: General Matching Polytope, Maximum Flow. 1 Perfect Matching and Matching Polytope on General Graphs CMPUT 675: Topics in Algorithms and Combinatorial Optimization (Fall 2009) Lecture 10,11: General Matching Polytope, Maximum Flow Lecturer: Mohammad R Salavatipour Date: Oct 6 and 8, 2009 Scriber: Mohammad

More information

Approximating Fault-Tolerant Steiner Subgraphs in Heterogeneous Wireless Networks

Approximating Fault-Tolerant Steiner Subgraphs in Heterogeneous Wireless Networks Approximating Fault-Tolerant Steiner Subgraphs in Heterogeneous Wireless Networks Ambreen Shahnaz and Thomas Erlebach Department of Computer Science University of Leicester University Road, Leicester LE1

More information

CSE 417 Branch & Bound (pt 4) Branch & Bound

CSE 417 Branch & Bound (pt 4) Branch & Bound CSE 417 Branch & Bound (pt 4) Branch & Bound Reminders > HW8 due today > HW9 will be posted tomorrow start early program will be slow, so debugging will be slow... Review of previous lectures > Complexity

More information

Obstacle-Aware Longest-Path Routing with Parallel MILP Solvers

Obstacle-Aware Longest-Path Routing with Parallel MILP Solvers , October 20-22, 2010, San Francisco, USA Obstacle-Aware Longest-Path Routing with Parallel MILP Solvers I-Lun Tseng, Member, IAENG, Huan-Wen Chen, and Che-I Lee Abstract Longest-path routing problems,

More information

Mathematical and Algorithmic Foundations Linear Programming and Matchings

Mathematical and Algorithmic Foundations Linear Programming and Matchings Adavnced Algorithms Lectures Mathematical and Algorithmic Foundations Linear Programming and Matchings Paul G. Spirakis Department of Computer Science University of Patras and Liverpool Paul G. Spirakis

More information

System-Level Synthesis of Application Specific Systems using A* Search and Generalized Force-Directed Heuristics

System-Level Synthesis of Application Specific Systems using A* Search and Generalized Force-Directed Heuristics System-Level Synthesis of Application Specific Systems using A* Search and Generalized Force-Directed Heuristics Chunho Lee, Miodrag Potkonjak, and Wayne Wolf Computer Science Department, University of

More information

CMSC 451: Lecture 22 Approximation Algorithms: Vertex Cover and TSP Tuesday, Dec 5, 2017

CMSC 451: Lecture 22 Approximation Algorithms: Vertex Cover and TSP Tuesday, Dec 5, 2017 CMSC 451: Lecture 22 Approximation Algorithms: Vertex Cover and TSP Tuesday, Dec 5, 2017 Reading: Section 9.2 of DPV. Section 11.3 of KT presents a different approximation algorithm for Vertex Cover. Coping

More information

1. Lecture notes on bipartite matching February 4th,

1. Lecture notes on bipartite matching February 4th, 1. Lecture notes on bipartite matching February 4th, 2015 6 1.1.1 Hall s Theorem Hall s theorem gives a necessary and sufficient condition for a bipartite graph to have a matching which saturates (or matches)

More information

Notes for Lecture 24

Notes for Lecture 24 U.C. Berkeley CS170: Intro to CS Theory Handout N24 Professor Luca Trevisan December 4, 2001 Notes for Lecture 24 1 Some NP-complete Numerical Problems 1.1 Subset Sum The Subset Sum problem is defined

More information

Lecture 19. Software Pipelining. I. Example of DoAll Loops. I. Introduction. II. Problem Formulation. III. Algorithm.

Lecture 19. Software Pipelining. I. Example of DoAll Loops. I. Introduction. II. Problem Formulation. III. Algorithm. Lecture 19 Software Pipelining I. Introduction II. Problem Formulation III. Algorithm I. Example of DoAll Loops Machine: Per clock: 1 read, 1 write, 1 (2-stage) arithmetic op, with hardware loop op and

More information

DM545 Linear and Integer Programming. Lecture 2. The Simplex Method. Marco Chiarandini

DM545 Linear and Integer Programming. Lecture 2. The Simplex Method. Marco Chiarandini DM545 Linear and Integer Programming Lecture 2 The Marco Chiarandini Department of Mathematics & Computer Science University of Southern Denmark Outline 1. 2. 3. 4. Standard Form Basic Feasible Solutions

More information

Integer Programming Theory

Integer Programming Theory Integer Programming Theory Laura Galli October 24, 2016 In the following we assume all functions are linear, hence we often drop the term linear. In discrete optimization, we seek to find a solution x

More information

REDUCING GRAPH COLORING TO CLIQUE SEARCH

REDUCING GRAPH COLORING TO CLIQUE SEARCH Asia Pacific Journal of Mathematics, Vol. 3, No. 1 (2016), 64-85 ISSN 2357-2205 REDUCING GRAPH COLORING TO CLIQUE SEARCH SÁNDOR SZABÓ AND BOGDÁN ZAVÁLNIJ Institute of Mathematics and Informatics, University

More information

Foundations of Computing

Foundations of Computing Foundations of Computing Darmstadt University of Technology Dept. Computer Science Winter Term 2005 / 2006 Copyright c 2004 by Matthias Müller-Hannemann and Karsten Weihe All rights reserved http://www.algo.informatik.tu-darmstadt.de/

More information

Theorem 2.9: nearest addition algorithm

Theorem 2.9: nearest addition algorithm There are severe limits on our ability to compute near-optimal tours It is NP-complete to decide whether a given undirected =(,)has a Hamiltonian cycle An approximation algorithm for the TSP can be used

More information

SDC-Based Modulo Scheduling for Pipeline Synthesis

SDC-Based Modulo Scheduling for Pipeline Synthesis SDC-Based Modulo Scheduling for Pipeline Synthesis Zhiru Zhang School of Electrical and Computer Engineering, Cornell University, Ithaca, NY zhiruz@cornell.edu Bin Liu Micron Technology, Inc., San Jose,

More information

Last topic: Summary; Heuristics and Approximation Algorithms Topics we studied so far:

Last topic: Summary; Heuristics and Approximation Algorithms Topics we studied so far: Last topic: Summary; Heuristics and Approximation Algorithms Topics we studied so far: I Strength of formulations; improving formulations by adding valid inequalities I Relaxations and dual problems; obtaining

More information

LECTURES 3 and 4: Flows and Matchings

LECTURES 3 and 4: Flows and Matchings LECTURES 3 and 4: Flows and Matchings 1 Max Flow MAX FLOW (SP). Instance: Directed graph N = (V,A), two nodes s,t V, and capacities on the arcs c : A R +. A flow is a set of numbers on the arcs such that

More information

HIGH-LEVEL SYNTHESIS

HIGH-LEVEL SYNTHESIS HIGH-LEVEL SYNTHESIS Page 1 HIGH-LEVEL SYNTHESIS High-level synthesis: the automatic addition of structural information to a design described by an algorithm. BEHAVIORAL D. STRUCTURAL D. Systems Algorithms

More information

Integer Programming ISE 418. Lecture 7. Dr. Ted Ralphs

Integer Programming ISE 418. Lecture 7. Dr. Ted Ralphs Integer Programming ISE 418 Lecture 7 Dr. Ted Ralphs ISE 418 Lecture 7 1 Reading for This Lecture Nemhauser and Wolsey Sections II.3.1, II.3.6, II.4.1, II.4.2, II.5.4 Wolsey Chapter 7 CCZ Chapter 1 Constraint

More information

A General Class of Heuristics for Minimum Weight Perfect Matching and Fast Special Cases with Doubly and Triply Logarithmic Errors 1

A General Class of Heuristics for Minimum Weight Perfect Matching and Fast Special Cases with Doubly and Triply Logarithmic Errors 1 Algorithmica (1997) 18: 544 559 Algorithmica 1997 Springer-Verlag New York Inc. A General Class of Heuristics for Minimum Weight Perfect Matching and Fast Special Cases with Doubly and Triply Logarithmic

More information

Polynomial-Time Approximation Algorithms

Polynomial-Time Approximation Algorithms 6.854 Advanced Algorithms Lecture 20: 10/27/2006 Lecturer: David Karger Scribes: Matt Doherty, John Nham, Sergiy Sidenko, David Schultz Polynomial-Time Approximation Algorithms NP-hard problems are a vast

More information

A Level-wise Priority Based Task Scheduling for Heterogeneous Systems

A Level-wise Priority Based Task Scheduling for Heterogeneous Systems International Journal of Information and Education Technology, Vol., No. 5, December A Level-wise Priority Based Task Scheduling for Heterogeneous Systems R. Eswari and S. Nickolas, Member IACSIT Abstract

More information

Efficient Second-Order Iterative Methods for IR Drop Analysis in Power Grid

Efficient Second-Order Iterative Methods for IR Drop Analysis in Power Grid Efficient Second-Order Iterative Methods for IR Drop Analysis in Power Grid Yu Zhong Martin D. F. Wong Dept. of Electrical and Computer Engineering Dept. of Electrical and Computer Engineering Univ. of

More information

Published in HICSS-26 Conference Proceedings, January 1993, Vol. 1, pp The Benet of Predicated Execution for Software Pipelining

Published in HICSS-26 Conference Proceedings, January 1993, Vol. 1, pp The Benet of Predicated Execution for Software Pipelining Published in HICSS-6 Conference Proceedings, January 1993, Vol. 1, pp. 97-506. 1 The Benet of Predicated Execution for Software Pipelining Nancy J. Warter Daniel M. Lavery Wen-mei W. Hwu Center for Reliable

More information

URSA: A Unified ReSource Allocator for Registers and Functional Units in VLIW Architectures

URSA: A Unified ReSource Allocator for Registers and Functional Units in VLIW Architectures Presented at IFIP WG 10.3(Concurrent Systems) Working Conference on Architectures and Compliation Techniques for Fine and Medium Grain Parallelism, Orlando, Fl., January 1993 URSA: A Unified ReSource Allocator

More information

Max-Cut and Max-Bisection are NP-hard on unit disk graphs

Max-Cut and Max-Bisection are NP-hard on unit disk graphs Max-Cut and Max-Bisection are NP-hard on unit disk graphs Josep Díaz and Marcin Kamiński 2 Llenguatges i Sistemes Informàtics 2 RUTCOR, Rutgers University Universitat Politècnica de Catalunya 640 Bartholomew

More information

Dual-Based Approximation Algorithms for Cut-Based Network Connectivity Problems

Dual-Based Approximation Algorithms for Cut-Based Network Connectivity Problems Dual-Based Approximation Algorithms for Cut-Based Network Connectivity Problems Benjamin Grimmer bdg79@cornell.edu arxiv:1508.05567v2 [cs.ds] 20 Jul 2017 Abstract We consider a variety of NP-Complete network

More information

Topic: Local Search: Max-Cut, Facility Location Date: 2/13/2007

Topic: Local Search: Max-Cut, Facility Location Date: 2/13/2007 CS880: Approximations Algorithms Scribe: Chi Man Liu Lecturer: Shuchi Chawla Topic: Local Search: Max-Cut, Facility Location Date: 2/3/2007 In previous lectures we saw how dynamic programming could be

More information

Methods and Models for Combinatorial Optimization Exact methods for the Traveling Salesman Problem

Methods and Models for Combinatorial Optimization Exact methods for the Traveling Salesman Problem Methods and Models for Combinatorial Optimization Exact methods for the Traveling Salesman Problem L. De Giovanni M. Di Summa The Traveling Salesman Problem (TSP) is an optimization problem on a directed

More information

The alternator. Mohamed G. Gouda F. Furman Haddix

The alternator. Mohamed G. Gouda F. Furman Haddix Distrib. Comput. (2007) 20:21 28 DOI 10.1007/s00446-007-0033-1 The alternator Mohamed G. Gouda F. Furman Haddix Received: 28 August 1999 / Accepted: 5 July 2000 / Published online: 12 June 2007 Springer-Verlag

More information

Optimization Techniques for Design Space Exploration

Optimization Techniques for Design Space Exploration 0-0-7 Optimization Techniques for Design Space Exploration Zebo Peng Embedded Systems Laboratory (ESLAB) Linköping University Outline Optimization problems in ERT system design Heuristic techniques Simulated

More information

Algorithms for Integer Programming

Algorithms for Integer Programming Algorithms for Integer Programming Laura Galli November 9, 2016 Unlike linear programming problems, integer programming problems are very difficult to solve. In fact, no efficient general algorithm is

More information

COMP 355 Advanced Algorithms Approximation Algorithms: VC and TSP Chapter 11 (KT) Section (CLRS)

COMP 355 Advanced Algorithms Approximation Algorithms: VC and TSP Chapter 11 (KT) Section (CLRS) COMP 355 Advanced Algorithms Approximation Algorithms: VC and TSP Chapter 11 (KT) Section 35.1-35.2(CLRS) 1 Coping with NP-Completeness Brute-force search: This is usually only a viable option for small

More information

Chapter 16. Greedy Algorithms

Chapter 16. Greedy Algorithms Chapter 16. Greedy Algorithms Algorithms for optimization problems (minimization or maximization problems) typically go through a sequence of steps, with a set of choices at each step. A greedy algorithm

More information

Placement Algorithm for FPGA Circuits

Placement Algorithm for FPGA Circuits Placement Algorithm for FPGA Circuits ZOLTAN BARUCH, OCTAVIAN CREŢ, KALMAN PUSZTAI Computer Science Department, Technical University of Cluj-Napoca, 26, Bariţiu St., 3400 Cluj-Napoca, Romania {Zoltan.Baruch,

More information

Vertex Cover Approximations

Vertex Cover Approximations CS124 Lecture 20 Heuristics can be useful in practice, but sometimes we would like to have guarantees. Approximation algorithms give guarantees. It is worth keeping in mind that sometimes approximation

More information

Introduction to Electronic Design Automation. Model of Computation. Model of Computation. Model of Computation

Introduction to Electronic Design Automation. Model of Computation. Model of Computation. Model of Computation Introduction to Electronic Design Automation Model of Computation Jie-Hong Roland Jiang 江介宏 Department of Electrical Engineering National Taiwan University Spring 03 Model of Computation In system design,

More information

Analysis of Conditional Resource Sharing using a Guard-based Control Representation *

Analysis of Conditional Resource Sharing using a Guard-based Control Representation * Analysis of Conditional Resource Sharing using a Guard-based Control Representation * Ivan P. Radivojevi c orrest Brewer Department of Electrical and Computer Engineering University of California, Santa

More information

Chapter 15 Introduction to Linear Programming

Chapter 15 Introduction to Linear Programming Chapter 15 Introduction to Linear Programming An Introduction to Optimization Spring, 2015 Wei-Ta Chu 1 Brief History of Linear Programming The goal of linear programming is to determine the values of

More information

Notes for Lecture 18

Notes for Lecture 18 U.C. Berkeley CS17: Intro to CS Theory Handout N18 Professor Luca Trevisan November 6, 21 Notes for Lecture 18 1 Algorithms for Linear Programming Linear programming was first solved by the simplex method

More information

Data Path Allocation using an Extended Binding Model*

Data Path Allocation using an Extended Binding Model* Data Path Allocation using an Extended Binding Model* Ganesh Krishnamoorthy Mentor Graphics Corporation Warren, NJ 07059 Abstract Existing approaches to data path allocation in highlevel synthesis use

More information

Pracy II Konferencji Krajowej Reprogramowalne uklady cyfrowe, RUC 99, Szczecin, 1999, pp Implementation of IIR Digital Filters in FPGA

Pracy II Konferencji Krajowej Reprogramowalne uklady cyfrowe, RUC 99, Szczecin, 1999, pp Implementation of IIR Digital Filters in FPGA Pracy II Konferencji Krajowej Reprogramowalne uklady cyfrowe, RUC 99, Szczecin, 1999, pp.233-239 Implementation of IIR Digital Filters in FPGA Anatoli Sergyienko*, Volodymir Lepekha*, Juri Kanevski**,

More information

CS 372: Computational Geometry Lecture 10 Linear Programming in Fixed Dimension

CS 372: Computational Geometry Lecture 10 Linear Programming in Fixed Dimension CS 372: Computational Geometry Lecture 10 Linear Programming in Fixed Dimension Antoine Vigneron King Abdullah University of Science and Technology November 7, 2012 Antoine Vigneron (KAUST) CS 372 Lecture

More information

arxiv:cs/ v1 [cs.ds] 20 Feb 2003

arxiv:cs/ v1 [cs.ds] 20 Feb 2003 The Traveling Salesman Problem for Cubic Graphs David Eppstein School of Information & Computer Science University of California, Irvine Irvine, CA 92697-3425, USA eppstein@ics.uci.edu arxiv:cs/0302030v1

More information

6.856 Randomized Algorithms

6.856 Randomized Algorithms 6.856 Randomized Algorithms David Karger Handout #4, September 21, 2002 Homework 1 Solutions Problem 1 MR 1.8. (a) The min-cut algorithm given in class works because at each step it is very unlikely (probability

More information

High-Level Synthesis (HLS)

High-Level Synthesis (HLS) Course contents Unit 11: High-Level Synthesis Hardware modeling Data flow Scheduling/allocation/assignment Reading Chapter 11 Unit 11 1 High-Level Synthesis (HLS) Hardware-description language (HDL) synthesis

More information

Managing Dynamic Reconfiguration Overhead in Systems-on-a-Chip Design Using Reconfigurable Datapaths and Optimized Interconnection Networks

Managing Dynamic Reconfiguration Overhead in Systems-on-a-Chip Design Using Reconfigurable Datapaths and Optimized Interconnection Networks Managing Dynamic Reconfiguration Overhead in Systems-on-a-Chip Design Using Reconfigurable Datapaths and Optimized Interconnection Networks Zhining Huang, Sharad Malik Electrical Engineering Department

More information

EXERCISES SHORTEST PATHS: APPLICATIONS, OPTIMIZATION, VARIATIONS, AND SOLVING THE CONSTRAINED SHORTEST PATH PROBLEM. 1 Applications and Modelling

EXERCISES SHORTEST PATHS: APPLICATIONS, OPTIMIZATION, VARIATIONS, AND SOLVING THE CONSTRAINED SHORTEST PATH PROBLEM. 1 Applications and Modelling SHORTEST PATHS: APPLICATIONS, OPTIMIZATION, VARIATIONS, AND SOLVING THE CONSTRAINED SHORTEST PATH PROBLEM EXERCISES Prepared by Natashia Boland 1 and Irina Dumitrescu 2 1 Applications and Modelling 1.1

More information

A note on the subgraphs of the (2 )-grid

A note on the subgraphs of the (2 )-grid A note on the subgraphs of the (2 )-grid Josep Díaz a,2,1 a Departament de Llenguatges i Sistemes Informàtics, Universitat Politècnica de Catalunya Campus Nord, Edifici Ω, c/ Jordi Girona Salgado 1-3,

More information

General properties of staircase and convex dual feasible functions

General properties of staircase and convex dual feasible functions General properties of staircase and convex dual feasible functions JÜRGEN RIETZ, CLÁUDIO ALVES, J. M. VALÉRIO de CARVALHO Centro de Investigação Algoritmi da Universidade do Minho, Escola de Engenharia

More information

The character of the instruction scheduling problem

The character of the instruction scheduling problem The character of the instruction scheduling problem Darko Stefanović Department of Computer Science University of Massachusetts March 997 Abstract Here I present some measurements that serve to characterize

More information

Parallel Query Processing and Edge Ranking of Graphs

Parallel Query Processing and Edge Ranking of Graphs Parallel Query Processing and Edge Ranking of Graphs Dariusz Dereniowski, Marek Kubale Department of Algorithms and System Modeling, Gdańsk University of Technology, Poland, {deren,kubale}@eti.pg.gda.pl

More information

Symbolic Buffer Sizing for Throughput-Optimal Scheduling of Dataflow Graphs

Symbolic Buffer Sizing for Throughput-Optimal Scheduling of Dataflow Graphs Symbolic Buffer Sizing for Throughput-Optimal Scheduling of Dataflow Graphs Anan Bouakaz Pascal Fradet Alain Girault Real-Time and Embedded Technology and Applications Symposium, Vienna April 14th, 2016

More information

Counting the Number of Eulerian Orientations

Counting the Number of Eulerian Orientations Counting the Number of Eulerian Orientations Zhenghui Wang March 16, 011 1 Introduction Consider an undirected Eulerian graph, a graph in which each vertex has even degree. An Eulerian orientation of the

More information

RISC IMPLEMENTATION OF OPTIMAL PROGRAMMABLE DIGITAL IIR FILTER

RISC IMPLEMENTATION OF OPTIMAL PROGRAMMABLE DIGITAL IIR FILTER RISC IMPLEMENTATION OF OPTIMAL PROGRAMMABLE DIGITAL IIR FILTER Miss. Sushma kumari IES COLLEGE OF ENGINEERING, BHOPAL MADHYA PRADESH Mr. Ashish Raghuwanshi(Assist. Prof.) IES COLLEGE OF ENGINEERING, BHOPAL

More information

Advanced Operations Research Techniques IE316. Quiz 1 Review. Dr. Ted Ralphs

Advanced Operations Research Techniques IE316. Quiz 1 Review. Dr. Ted Ralphs Advanced Operations Research Techniques IE316 Quiz 1 Review Dr. Ted Ralphs IE316 Quiz 1 Review 1 Reading for The Quiz Material covered in detail in lecture. 1.1, 1.4, 2.1-2.6, 3.1-3.3, 3.5 Background material

More information