A Mathematical Formulation of the Loop Pipelining Problem
|
|
- Jocelin Matthews
- 5 years ago
- Views:
Transcription
1 A Mathematical Formulation of the Loop Pipelining Problem Jordi Cortadella, Rosa M. Badia and Fermín Sánchez Department of Computer Architecture Universitat Politècnica de Catalunya Campus Nord. Mòdul D Barcelona, Spain Phone: Fax:
2 A mathematical formulation of the loop pipelining problem Abstract A mathematical model for the loop pipelining problem is presented. The model considers several parameters for optimization and supports any combination of resource and timing constraints. The unrolling degree of the loop is one of the variables explored by the model. By using Farey s series, an optimal exploration of the unrolling degree is performed and optimal solutions not considered by other methods are obtained. Finding an optimal schedule that minimizes resource requirements (including registers) is solved by an ILP model. A novel paradigm called branch and prune is proposed to efficiently converge towards the optimal schedule and prune the search tree for integer solutions, thus drastically reducing the running time. This is the first formulation that combines the unrolling degree of the loop with timing and resource constraints in a mathematical model that guarantees optimal solutions. 1
3 1 Introduction It is well known that loops monopolize most execution time of programs. In many applications a few loops, if not only one, determine the throughput achievable by the implementation of a behavioral description. For example, DSP filters often consist of an infinite loop that repeatedly executes for every sample of the input stream. In architectural synthesis, the problem of optimizing loop execution under timing and area constraints is crucial to obtain high quality architectures. The techniques proposed for this problem attempt to overlap the execution of different loop iterations to reduce the cycle count (initiation interval or II) per iteration. Different methods have been proposed with such a goal: loop folding [12, 23], functional pipelining [17], loop winding [11], rotation scheduling [4], and percolation based synthesis [29], among others. The area of fixed-rate DSP has also drawn the attention of other authors to propose techniques for loop pipelining with timing constraints [3, 6, 34]. Similar (if not identical) problems are encountered in the area of optimizing compilers for parallel architectures (VLIW, superscalar and superpipelined). Under the general name of software pipelining [21], several techniques have been devised: modulo scheduling [31], URPR [36], Lam s algorithm [22] and perfect pipelining [1]. Loops are usually represented by means of a data dependence graph (DG). Figure 1 shows an example. Vertices represent computational nodes (henceforth called instructions). Unlabeled edges represent intra-loop dependences (ILDs), e.g. B i (reads X[i]) depends on A i (produces X[i]). Labeled edges represent loop-carried dependences (LCDs), e.g. B i+1 (reads Y [i+1]) depends on B i (produces Y [i + 1]). Labels indicate the number of iterations traversed by the dependence. Thus, ILDs can also be represented as 0-labeled edges. If no overlap between iterations were allowed to execute the loop, the total execution time would be 3I cycles (assuming 1-cycle instructions). Within each iteration, the execution of A i, B i and C i must be sequential due to the data dependences. An overlapped execution would take I + 2 cycles to complete, as shown in Figure 2. The problem of loop pipelining is basically reduced to finding a steady state pattern (a folded loop body) that can execute at the maximum rate allowed by the dependences [32]. In the new steady state, instructions from different iterations (folds) are executed (A i+f denotes that the execution of A i belongs to fold f). Under resource constraints, unrolling the loop before finding a steady state is crucial to obtain optimal solutions. Assume the schedule depicted in Figure 2(b). If only two adders are available, the steady state must be executed in at least two cycles (II = 2), as shown in Figure 3(a). However, if the loop is unrolled twice (Figure 3(b)), a schedule in three cycles can be found (II = 3=2). for i = 0 to I? 1 do A i : X[i] = R[i] + S[i]; B i : Y [i + 1] = Y [i] + 2 X[i]; C i : Z[i] = 4 Y [i + 1] + T [i]; endfor A i B i C i 1 Figure 1: Loop dependence graph 2
4 A 0 B 0 A 1 C 0 B 1 A 2 C 1 B 2 A 3 C 2 B 3 C 3 (a) ) A 0 B 0 A 1 o C i B i+1 A i+2 ) C I?2 B I?1 C I?1 (b) Prologue Steady state (i = 0; ::; I? 3) Epilogue Figure 2: (a) Overlapped loop execution. (b) Steady state A 0 B 0 A 1 C i B i+1 A i+2 C I?2 B I?1 C I?1 (a) A 0 B i A i+1 C i B i+1 (i = i + 2) A i+2 C i+1 B I?1 C I?1 (b) Figure 3: Steady state with resource constraints: (a) without unrolling (II = 2), and (b) by unrolling twice the loop (II = 3=2). Another important aspect to be considered is the span, defined as the number of folds required to obtain the steady state. Figure 4(b) shows a time-optimal steady state with SPAN = 4. An approach to find a resource-constrained steady state is to reschedule the time-optimal one according to the number of available resources (see Figure 4(c) for 2 resources), thus maintaining the same span. However, by considering resource constraints from the beginning, a steady state with lower span can be obtained (Figure 4(d)). As further commented in Section 5.5, smaller spans result in shorter variable lifetimes, reducing in general the schedule s register pressure [15]. The loop optimization problem addressed in this paper comprises a large variety of formulations with different timing and resource constraints. The two extreme cases are next described: Resource-constrained loop pipelining (RCLP). Given a set of resource constraints (functional units and registers), the objective is to find a pipelined loop schedule that minimizes the execution time. Time-constrained loop pipelining (TCLP). Given an upper bound on the execution time, the objective is to find a pipelined loop schedule that minimizes the cost of the resources required to execute the loop. Between RCLP and TCLP there is a wide range of problems that can be formulated, e.g. finding a time-constrained schedule with constraints on a subset of resources. In the area of optimizing 3
5 A i B i C i D i 1 (a) A 0 B 0 A 1 C 0 B 1 A 2 D 0 C 1 B 2 A 3 D 1 C 2 B 3 D 2 C 3 D 3 (b) A 0 A 1 B 0 A 2 C 0 B 1 D 0 C 1 A 3 B 2 B 3 D 1 C 3 C 2 D 3 D 2 (c) A 0 B 0 C 0 A 1 D 0 B 1 C 1 D 1 (d) Figure 4: (a) Dependence graph and steady states with: (b) II = 1, SPAN = 4, (c) II = 2, SPAN = 4, and (d) II = 2, SPAN = 2. compilers for parallel architectures only the RCLP version of the problem is posed, since the resources are fixed by the target architecture. This paper presents UNRET (unrolling and retiming), a formal approach to solve the loop pipelining problem in which the following mathematical tools are used: Farey s series [35] to calculate the unrolling degrees of the loop that yield optimal solutions. Integer Linear Programming (ILP) to simultaneously solve the problems of scheduling and loop pipelining. Since the delay decision (minimization) problem of loop pipelining with resource constraints is NP-complete (NP-hard) [5], several heuristics have been proposed to solve it in moderate computation time. Several authors have also used linear programming to obtain optimal or quasi-optimal solutions for the problems of scheduling and allocation in architectural synthesis [14, 18, 16, 10, 33]. A preliminary approach for loop pipelining was presented in [9]. The closest approaches related to the work presented here have been proposed in [13, 8, 7], in which the primary goal is the minimization of the register requirements of the final schedule. The main contributions of UNRET in relation to existing formal methods are next presented: UNRET performs an exhaustive analysis of the unrolling degrees of the loop that can derive optimal solutions for the available resources. Therefore, more than one instance of each operation can appear in the final schedule. Unlike other methods that perform loop unrolling [19, 27, 33], this paper will present a new approach that guarantees an optimal unrolling degree. Similarly to [13, 8, 7], the number of folds for the steady state is automatically obtained by solving an ILP model for loop pipelining. The model has been extended to the minimization of resources under timing constraints. The maximum number of live variables at any cycle, MAXLIVE, is also minimized. MAXLIVE is a very approximate lower bound of the number of registers required to execute the loop [30]. A new approach, that we have called Branch-and-Prune, is proposed to solved the ILP model. The heuristics devised to explore the space of solutions allows us a rapid convergence to the optimal solution. ILP models with more than 1000 variables and 700 constraints have been solved in a reasonable running time. 4
6 The paper is organized as follows. Section 2 presents some preliminary definitions required for the mathematical model. Section 3 proposes a strategy for loop unrolling to find time-optimal schedules for the RCLP problem. A similar strategy for the TCLP problem is proposed in Section 4. The ILP model to find an optimal steady state is presented in Section 5. The branch and prune strategy to efficiently solve the ILP model is described in section 6. Experimental results and conclusions are presented in Sections 7 and 8. 2 Basic definitions For the sake of simplicity, we will first assume that all instructions can be executed in any of the resources of the architecture in one cycle. Henceforth, we will call n the number of instructions of the loop and r the number of resources available to execute instructions. Extensions to multiple-cycle, pipelined functional units will be addressed in Section Representation of a loop A loop is represented by a labeled dependence graph, DG(V; E), in which vertices and edges represent instructions and data dependences respectively. Labels of the DG are defined by two mappings, (fold) and (dependence distance), in the following way: (u), defined on vertices, denotes the fold to which u belongs in the steady state ((u) 0). (u) = i will be denoted by u i in the DG. (u; v), defined on edges, denotes the dependence distance (number of iterations traversed by the dependence) between instructions u and v. (u; v) = 0 corresponds to an ILD, and (u; v) > 0 to an LCD. An ILD between u and v is represented by u i 0! vi or simply u i! v i. Similarly, an LCD between u and v with distance d is represented as u i d! v i. Initially, a loop is represented by a DG with only one fold, i.e. 8u 2 V : (u) = 0. After finding a steady state, each instruction will be labeled with a fold representing the relative execution skew (in iterations) with regard to the other instructions of the loop. Equivalent labeling functions of the original DG can be obtained by simple transformations. The way and can be transformed is illustrated in Figure 5. Assume an ILD A i! B i exists in the initial DG. A schedule for the loop is depicted in Figure 5(a), which corresponds to the sequential execution of the original loop body that has only one fold (i). However, the ILD can be transformed into an LCD by preserving the semantics of the loop. Thus, the dependence A 1 d i+1! B i (or in general A i+d! B i ) is equivalent to the previous one. This transformation is analogous to the retiming technique proposed to minimize the clock period in synchronous systems [24]. Figure 5(b) shows the transformed DG with two folds (i and i + 1) and one LCD. A new schedule can be found for the loop body by considering now that A i+1 and B i have not any ILD between them. LCDs are semantically preserved by the sequential execution of different iterations of the same loop body. Therefore, ILDs can be transformed into LCDs by changing the fold assignment and, thus, shorter schedules for the loop body can be found. A formal definition for equivalent labeling functions can now be presented. Definition 1 :[Equivalent labeling functions] Let (; ) and ( 0 ; 0 ) be two different labeling functions for the dependence graph DG = (V; E). The functions are equivalent if 8(u; v) 2 E: (v)? (u) + (u; v) = 0 (v)? 0 (u) + 0 (u; v) (1) 5
7 A i B i? (a). A i B i A i+1 B i+1. A i+1? 1 B i A i+1. B i A i+2 B i+1 (b). Figure 5: (a) Schedule of a loop with one ILD. (b) Schedule of the same loop with an equivalent labeling function and one LCD. It is not difficult to prove that equivalent labeling functions preserve the semantics of a loop [24]. 2.2 Initiation Interval As proposed by other authors [22], UNRET first calculates a lower bound on the II of the loop: the minimum initiation interval (MII). MII = max(resmii; RecMII) where ResMII is determined by the number of available resources and RecMII by the cycles (recurrences) formed by the dependences of the loop. Clearly, ResMII = n r since each functional unit can execute only 1 instruction/cycle. A recurrence R is a set of edges that form a cycle. Let us define R as R = X (u;v)2r (u; v) (2) For any operation u such that 9(u; v) 2 R, u i must be scheduled at least jrj cycles before u i+ R. Therefore, R imposes a minimum initiation interval, RecMII R, on the execution of the loop [32, 5, 27]: A loop can have several recurrences. Therefore: RecMII R = jrj R RecMII = max RecMII R R In the worst case, the number of cycles of a DG grows exponentially with jej. However, finding RecMII can be done in polynomial time by using Karp s algorithm to calculate the weight of the minimum mean-weight cycle in a graph [20]. In the example of Figure 6(a), there are three recurrences with the same value for RecMII. Let us 3 take A 0! B 0! E 0! A0, with R = 3 and jrj = 3. This indicates that A 3 must be executed at least 3 cycles after A 0, thus resulting in RecMII R = 1. The other two recurrences are isomorphic to this one. 6
8 3 A 0 B0 C0 D0 (a) E K # ## # # ## D Cs s # s ## B A # ## s (II max ) (b) II = MII II K Figure 6: (a) DG example (b) Different (II K,K) pairs for 4 resources. 3 Loop unrolling for RCLP The length of a schedule is always an integer number of cycles, whereas the MII of a loop may be a rational number. In the example of Figure 3 we have MII = 3=2 for r = 2. If only one loop instance is taken, at least d3=2e = 2 cycles are needed for any schedule. However, a schedule with optimal II can be found by unrolling the loop twice. In general, a necessary (but not sufficient) condition to find a schedule with II = MII = a (reduced b fraction) is to unroll the loop K = ḃ times. Systematic approaches for loop unrolling of data-flow programs are presented in [19, 27, 33]. However, they do not guarantee optimal solutions, as it will be demonstrated with the example of Figure 6. Let II K be the initiation interval of a steady state comprising K instances of the loop body (II = II K ). The goal of UNRET is to find a schedule that minimizes II. This is done by exploring pairs K (II K,K) in increasing order of II, starting from MII. In order to bound the search space for (II K,K), a maximum value for II K is defined: the maximum length of the steady state (II max ). Figure 6(a) depicts the DG of a 5-instruction loop. If 4 resources are used for execution, then RecMII = 1 and MII = ResMII = 5 4. The diagram in Figure 6(b) represents the pairs (II K,K) that can be explored to find a feasible schedule for the loop. These pairs correspond to the points with integer values for II K and K enclosed in the triangle limited by the following lines: K = 0, II K = II max and II K K = MII. Distinct points with the same II lie on the same line. Among all points lying on the same line, those with smaller values for II K (or K) will be preferred, since they produce shorter steady states. In the example, the first point to be explored to minimize II is C = (5; 4), meaning that 4 iterations must be executed in 5 cycles, which results in II=MII=1.25. In this case, no feasible schedule with such characteristics exists. In [31], II K is incremented by 1 when no schedule is found and, thus, the point B = (6; 4) is explored next. This results in a suboptimal solution. However, a time-optimal solution (for II max = 8) is found if the point A = (4; 3) is explored after C = (5; 4). But, is there any efficient strategy to explore all points in increasing order of II? 7
9 2 F 8 1 ; 15 8 ; : : :; : : : II < MII : : : K 4 II Not explored K ; ; ; 2 4 ; 3 6 ; : : : Table 1: Pairs (II K,K) explored for the example of Figure Farey s series For a fixed integer D > 0, the sequence of all reduced fractions with nonnegative denominator D, arranged in increasing order of magnitude, is defined by the Farey s series of order D (F D ). For example, F 5 is the series of fractions: F 5 = 0 1 ; 1 5 ; 1 4 ; 1 3 ; 2 5 ; 1 2 ; 3 5 ; 2 3 ; 3 4 ; 4 5 ; 1 1 ; : : : Let x i y i be the ith element of the series. F D can be generated by the following recurrence [35]: The first two elements are The generic term x i y i x i = yi?2 + D y i?1 x 0 y 0 = 0 1 x 1 y 1 = 1 D can be calculated as (in increasing order) x i?1? x i?2 y i = y i?1 yi?2 + D y i?1 y i?1? y i?2 or as (in decreasing order) yi + D yi?2 + D x i = x i?1? x i?2 y i = y i?1? y i?2 y i?1 For a given integer k, the following consecutive fractions always belong to F D : k D? 1 D ; k 1 ; k D + 1 D All points (II K,K) correspond to fractions (or equivalent fractions) of the series F IImax. efficient strategy to generate the points in decreasing order is the following: An Let m = d1=miie, x 0 y 0 = m 1 and x 1 y 1 = mii max?1 II max Generate fractions in decreasing order until a fraction xp y p 1 is found. M II Explore schedules for all fractions (and equivalent fractions) generated in decreasing order from x p y p. For a reduced fraction x i y i, all equivalent fractions kx i ky i (k y i II max ) are generated in increasing order of the denominator (k = 1; 2; : : :). Table 1 represents the explored points for the example of Figure 6. 8
10 A 0 A 0 0 A 0 1 A B 0 C B0 2 1 B0 B C0 C0 C 2 0 Figure 7: A DG unrolled 3 times 3.2 Loop unrolling Every pair (II K ; K) obtained from Farey s series denotes a different unrolling degree for the loop and a target initiation interval. Before trying to find a schedule with such characteristics, a new DG with K instances of the original loop body must be obtained. Unrolling a DG K times generates another DG in which each vertex v is instantiated K times (v 0 ; v 1 ; : : :; v K?1 ). Besides, data dependence distances must be changed according to the unrolling degree. A dependence u?! d v in the original DG will be represented by K dependences u b i+d i K c?! v (i+d) mod K in the unrolled DG [27] (see example in Figure 7). 3.3 Extension to multiple types of resources In the most general case the architecture has several types of resources or functional units (FUs). Each FU is able to execute more than one type of operation. We also assume that allocation has already been done and, therefore, each instruction can only be executed in one type of FU. Let us first present some preliminary definitions: M = f1; : : :; mg is the set FU types of the architecture. F = (F 1 ; F 2 ; : : :; F m ) is the resource vector of the architecture, where F t is the number of FUs of type t. ALLOC(t) is the set of instructions allocated to the type of FU t. FU(u) is the type of FU allocated for instruction u. OP(u), is the operation executed by instruction u. T (u), is the delay (cycle count) required to execute OP(u) in FU(u). L(u), is the latency of FU(u) to execute OP(u) (minimum cycle count between successive inputs to the FU). For pipelined FUs L(u) < T (u); otherwise L(u) = T (u). Given a heterogeneous set of resources, the calculation of ResMII is done as follows: ResMII = max ResMII t t2m 9
11 with ResMII t = X u2alloc(t) F t P The numerator ( L(u)) corresponds to the number of cycles in which the FUs are busy. We will say that t is a critical FU type if ResMII = ResMII t. To calculate RecMII, the delay associated to each recurrence R of the loop is now obtained as follows 1 [32, 5]: RecMII R = X u2r R L(u) T (u) The generation of the pairs (II K,K) can now be done similarly to that presented for an homogeneous set of resources, by first calculating MIIand then using Farey s series to obtain II K and K. 3.4 UNRET algorithm for RCLP The strategy followed by UNRET to solve the problem of RCLP is the following: 1. Calculate MII. 2. Find the first unrolling degree (K) and expected initiation interval (II K ) that minimizes II such that II MII. 3. Unroll the loop K times. 4. Find a steady state with length II K (ILP approach explained in Section 5). 5. If no schedule with II K cycles is found, generate new values for K and II K that minimize II (II is explored in increasing order by using Farey s series) and go to step 3. 4 Loop unrolling for TCLP The problem of TCLP starts from a DG and a target initiation interval (II T ). Unlike RCLP, the number of some (or all) resources of the architecture is not constrained. Therefore, ResMII must be calculated by only considering those resources with constraints. In case II T < MII, no feasible schedule for that DG exists. Otherwise, different unrolling degrees for the loop must be explored, similarly to what is done for RCLP, in such a way that 4.1 UNRET algorithm for TCLP MII II K K II T The strategy followed by UNRET to solve the problem of TCLP is the following: 1. Calculate MII. If II T < MII exit (no feasible schedule exists). 1 Although formally incorrect, we use u 2 R to refer to the vertices of the recurrence. 10
12 2. Find the first unrolling degree (K) and expected initiation interval (II K ) that maximizes II such that II II T. 3. Unroll the loop K times. 4. Find a steady state with length II K that minimizes the cost of the required resources (ILP approach explained in Section 5). 5. Optimize the initiation interval of the schedule. This is done by successively generating new values for K and II K that minimize II (II is explored in decreasing order by using Farey s series). If II < MII then exit (an optimal-time schedule has been found). Otherwise, for each pair (II K, K), the loop is unrolled K times, and the algorithm attempts to find a schedule in II K cycles by using the set of resources found at step 4. In this case, Farey s series for K II K are explored in decreasing order, starting from the closest value to 1 II T. 5 Loop pipelining: An ILP approach Both the RCLP and the TCLP problems can be solved by the same mathematical model. As shown in Figure 5, different labeling functions (; ) for the same DG can result in different steady states. The problem of loop pipelining can be reduced to two interrelated subproblems that can be simultaneously solved by using an ILP model: 1. Folding the DG, i.e. finding the functions and. 2. Finding a schedule for the folded DG subject to the data dependences and resource constraints. 5.1 Preliminaries Initially, a DG obtained by unrolling the original DG K times will be given. Moreover, an objective initiation interval (II K ) will fix the number of cycles of the steady state. Hereafter, and for the sake of brevity, we will use II instead of II K. C = f0; : : :; II? 1g will denote the set of cycles of the steady state. All variables used in the model are nonnegative integers. The least nonnegative residue system (0; 1; : : :; k? 1) will be used when modulo k (mod k) operations are performed on constants (e.g.?8 mod 5 = 2). 5.2 Loop folding constraints The following variables are defined for the labeling functions of the DG: u ; 8u 2 V (fundamental variables) u;v ; 8(u; v) 2 E (auxiliary variables) Folding the loop means finding an equivalent labeling function for the DG. According to equation (1), the auxiliary variables u;v are defined as follows: u;v = u? v + k u;v ; 8(u; v) 2 E (3) 11
13 u i v i v i u i+1 v i u i u i+1 u i ??? v i v i v i v i u i+2 v i v i (a) (b) (c) Figure 8: Time frames to schedule v i for different loop foldings (T (u) = 2 and II = 3) with k u;v = (v)? (u) + (u; v) where (u), (v) and (u; v) are constants denoting the initial labeling of the DG (usually (u) = (v) = 0). 5.3 Scheduling and data dependence constraints The following variables are first defined: s u;i = ( 1 if u starts executing at cycle i 0 otherwise 8u 2 V; i 2 C The following constraint is required to guarantee that an instruction is scheduled at one, and only one, cycle of the steady state: X s u;i = 1; 8u 2 V (4) i2c For simplicity in the definition of data dependence constraints, the auxiliary variable c u will be used to denote the cycle at which u is scheduled. Hence, c u = X i2c i s u;i ; 8u 2 V (5) The constraint that guarantees data dependences to be honored in the schedule is the following [5, 12, 18, 23]: c v c u + T (u)? II u;v ; 8(u; v) 2 E (6) Figure 8 illustrates how u;v influences the schedule of v. If u;v = 0 (ILD), then v must be scheduled after the completion of u (c u + T (u)), as shown in Figure 8(a). By increasing the fold from u i to u i+m (Figures 8(b) and 8(c)), u i is implicitly rescheduled m II cycles backwards, thus increasing the time frame to schedule v i by the same amount. 12
14 5.4 Resource constraints The following constraints guarantee that no more than F t functional units are used at each cycle: X u2alloc(t) ix j=i?l(u)+1 s u;(jmodii) F t ; 8t 2 M; i 2 C (7) An instruction u uses an FU during the cycles c u : : : c u + L(u)? 1 (mod II). In case the latency of the operation is longer than II, the execution of several instances of the same operation may overlap and, therefore, more than one FU may be required. Assume, for example, II=3 and L(u) = 5, u being the only instruction that uses FUs of type t. The generated constraints for F t are the following: s u;?4mod3 + s u;?3mod3 + s u;?2mod3 + s u;?1mod3 + s u;0mod3 F t (for cycle 0) s u;?3mod3 + s u;?2mod3 + s u;?1mod3 + s u;0mod3 + s u;1mod3 F t (for cycle 1) s u;?2mod3 + s u;?1mod3 + s u;0mod3 + s u;1mod3 + s u;2mod3 F t (for cycle 2) which, after modulo calculations, results in: 5.5 Register requirements 2s u;0 + s u;1 + 2s u;2 F t (for cycle 0) 2s u;0 + 2s u;1 + s u;2 F t (for cycle 1) s u;0 + 2s u;1 + 2s u;2 F t (for cycle 2) The maximum number of variables whose lifetimes overlap at any cycle, MAXLIVE [30], is the minimum number of registers required for a schedule. Each edge u! v of the DG determines the variable lifetime of the output of u with regard to v. The variable lifetime spreads from the completion of instruction u (cycle c u + T (u)) to the cycle in which the FU executing v does not required the input data anymore (cycle c v + L(v)? 1). Other execution models can also be considered for registers. In [8], variable lifetimes spread from the initiation of u to the initiation of v, a model more appropriate for parallel architectures. As depicted in the example of Figure 9, an edge with u;v = k will spread the variable lifetime across k iterations. For any cycle c in the steady state, an edge will require k? 1, k, or k + 1 registers depending on the scheduling of instructions u and v. Three cases can be distinguished for T (u) = L(v) = 1 (see examples in Figure 9): a) u k! v crosses cycle c forward, then k + 1 registers are required (example: edge u i+2 2! v i at cycle 2). b) u k! v does not cross cycle c, then k registers are required (example: edge u i+2 2! v i at cycles 0 and 1, or edge r i+1 1! s i at cycle 0). c) u k! v crosses cycle c backwards, then k? 1 registers are required (example: edge r i+1 1! s i at cycles 1 and 2). The following auxiliary variables are defined to specify whether an operation reads or writes a result before a given cycle: 13
15 u i+2 v i 2 1?? (a) r i+1 s i u i u i+1 r i?1? s i?1 s u i+2? v i s s s i 6 r i+1 u u u u u u u u i+2 u r i? s i? v i r i+1 3 regs (cycle 0) 2 regs (cycle 1) 3 regs (cycle 2) (b).??. (c).? Figure 9: Register requirements for folded loops: (a) Folded DG, (b) Schedule and cycles crossed by edges, (c) Variable lifetimes and required registers. READS BEFORE u;c = WRITES BEFORE u;c = c?1 X i=0 c?1 X i=0 s u;(i?l(u)+1)modii (u reads before cycle c) s u;(i?t(u)+1)modii (u writes before cycle c) The auxiliary variable r u;v;c defines the number of registers required to store a result in cycle c produced by operation u and consumed by operation v: r u;v;c = u;v? T (u)? 1 II L(v)? 1 + WRITES BEFORE u;(t(u)?1)modii + READS BEFORE v;(l(v)?1)modii II WRITES BEFORE u;c? READS BEFORE v;c (8) The expressions in brackets determine a correction on u;v produced by the execution time of u and the latency of v, i.e. no register is required to store the value while u executes, whereas a register is required during the first L(v) cycles of v s execution. In case T (u) = L(u) = 1, as in the examples
16 depicted in Figure 9, the following simplified expression is obtained: r u;v;c = u;v + WRITES BEFORE u;c? READS BEFOREv; c A variable can be the input of more than one instruction. This is shown in the DG by the fact that some edges have the same source vertex. However, there is no need to allocate different registers for the same variable. The number of registers required is the maximum from all edges with the same source. Therefore, the variables r u;c can also be defined on vertices as follows: r u;v;c r u;c ; 8u 2 V; c 2 C (9) It can be easily proved that if operation v is the last use of u s result then r u;v;c = r u;c for any cycle c [8]. The following constraint must be included to define MAXLIVE: X u2v ML being a variable minimized in the objective function. 5.6 Objective function r u;c ML; 8c 2 C (10) Assume we have a cost (area) vector A = (A 1 ; : : :; A m ) for the FUs of the architecture. Similarly, we can have a cost A r for each register 2. The objective function of the ILP model can be formulated as follows: min Area = X t2m A t F t + A r ML (11) Note that F t and ML can be either variables or constants. This depends on the initial constraints defined for the problem. In case all of them are constants, the ILP model will merely find a feasible solution. Here we have considered ML as the number of registers required by the schedule. Although this might not be true, it has been shown that it is a realistic assumption for many practical cases [30]. 5.7 Complexity of the model Table 2 describes the fundamental variables and constraints of the model. V and E denote the number of nodes and edges of the DG respectively, whereas m is the number of different FU types of the architecture. Some of the variables, e.g. F t and ML, may become constants if the number of resources of the architecture is defined in advance. 6 Branch and Prune The most popular method to solve ILP models is the combination of branch-and-bound techniques with a linear-programming solver such as simplex [26]. Having integer variables in the model increases complexity from polynomial to exponential. The running time for ILP is highly influenced by the 2 Piecewise linear cost functions for registers can also be incorporated, as proposed in [10]. 15
17 variable number constraint number u V (4) V s u;i V II (6) E F t m (7) m II r u;c V II (8,9) V II ML 1 (10) II total V (2II + 1) + m + 1 total V (II + 1) + E + II(m + 1) Table 2: Variables and constraints of the ILP model exploration of the branch-and-bound tree. For an ILP solver insensible to the problem, the tree of solutions is blindly explored and finding valid integer solutions may require solving an excessive number of LP problems. We have implemented an ad-hoc solver for loop pipelining that takes advantage of the information known a priori about the problem. The solver uses a branch-and-bound paradigm to explore the space of integer solutions but, rather than indiscriminately invoking an LP solver at each branch, it allows us a rapid pruning when enough information has been captured to evaluate the objective function. The order in which integer variables are explored is also crucial to reduce the running time. The approach used for this problem, that we have called branch and prune, is next presented. 6.1 Retiming and scheduling The problem of loop pipelining can be decomposed in two different problems: 1. Retiming, or finding the loop fold for each operation and 2. Scheduling, or assigning operations to cycles. After solving retiming, i.e. defining u for all operations, the problem of loop pipelining is reduced to the scheduling of a basic block in which not all dependences must be taken into account. According to constraint (6), all those dependences with T (u)? II u;v 0 are not relevant for scheduling 3. This is the rationale behind our first decision to define the exploration order of the integer variables: first all s (retiming) and then the rest (scheduling) Retiming Recurrences are the most stringent constraints to define the operations folds, since the sum of their edges must be a constant (equation (2)). Let us take, as an example, the recurrence A-B-C-D-A of the DG depicted in Figure 10. Assume we define C = 10 and D = 9. Thus, we have that C;D = 1 and the folds for A and B are implicitly defined by the fact that only one of the edges of the recurrence can have = 1. Therefore, A = B = 10. And similarly for the recurrence C-D-E-C, in which E = 10 for the same reason. Figure 10 shows the exploration tree for the example. The order of the variables is selected as follows: first explore the nodes belonging to the most stringent recurrences. Inside each recurrence, first explore those nodes that also belong to other recurrences. In the example, the recurrence C-D-E-C 3 The value of u;v can be obtained from (3) once u and v have been defined. 16
18 C = 10 D 9 10 A = B = E = 10 E A 1 D 1 A A E B B = 10 B = 10 B B C Figure 10: Branch-and-Prune search for u is the most stringent and nodes C and D also belong to other recurrences. Thus, the order C-D-E-A-B has been chosen. Since only the relative difference between s is required to determine the steady state pattern, the fold of the first node explored can always be fixed ( C = 10 in this case). For nodes not belonging to any recurrence, their exploration order is defined according to a neighboring criteria to the nodes in the recurrences. In these cases, however, the number of branches to be explored may be unbounded, since no recurrence limits the value of. For this reason, an early calculation of a lower bound for register requirements is done at each node of the tree. This calculation assumes the worst case values for the auxiliary variables WRITES BEFORE and READS BEFORE in constraint (8). In practice, this estimation has shown to be very effective to prune the search tree when comparing with the cost of the best obtained solution Scheduling After having defined the folds of the operations, a scheduling problem is posed for each of the leaves of the search tree. At this moment, some dependences of the DG can be eliminated from the model (if T (u)? II u;v 0) and the critical path of the retimed DG can be calculated. In case the critical path is longer than II, no feasible schedule can be found and the exploration is pruned. Checking the critical path of the DG can be even done before defining all s if the information available at each node of the search tree is enough to discard a solution. The variables defined at this stage of the problem are the scheduling variables (s u;i ). Here the exploration order is also crucial and we have chosen one of the well-known algorithms for timeconstrained scheduling (force-directed scheduling (FDS) [28]) to assist the solver in defining an efficient strategy to generate the search tree. FDS performs a stepwise assignment of operations to cycles according to criteria that attempt to balance the utilization of resources over all cycles of the schedule. In our model, each operation has II 0-1 variables with at most one of them being 1. The order of operation assignments produced by FDS is the one we use to generate the search tree. The cycle to which each operation is scheduled by 17
19 Sb,0=1 towards FDS solution Sa,2=1 Sb Sa,1=1 Sb,1=1 Sa Sa,0=1 other possible solutions Sa,3=1 FDS stepwise solution operation cycle a 2 b Figure 11: Generation of the search tree for scheduling guided by FDS FDS is the first value explored. Assignments to the other cycles are explored in increasing order of the forces estimated by FDS (see Figure 11). This strategy leads the solver to a near-optimal solution very soon, thus allowing an efficient pruning of the tree for the other branches. At each node of the tree, in which some operations have already been assigned to cycles, the scheduling constraints no longer relevant for the final solution are eliminated (e.g. if ASAP(u) = 2, the variables s u;0 and s u;1 are not defined and their corresponding constraints simplified). Similarly to branch-and-bound, the LP solver is invoked at each branch of the tree to prune those solutions with a cost greater than the best integer solution found. 7 Experimental results The techniques presented in the previous sections have been implemented by using the package lp solve [2]. In this section, several results are reported to show the main features of the method: the exploration of the optimal unrolling degree and the efficiency of the branch-and-prune strategy to solve the ILP model. The types of resources used in the schedules are adders (1-cycle latency) and multipliers (2-cycle latency). 7.1 Optimal unrolling degree Table 3 presents the results obtained by exploring two different unrolling degrees (see Table 1) for the example depicted in Figure 6. An RCLP problem with 4 resources has been solved. MAXLIVE is also minimized for the optimal II found by the model. The columns V and E indicate the number of nodes and edges of the DG after unrolling the loop. #Vars and #Const indicate the number of variables and constraints of the ILP model. SPAN is calculated as max? min + 1 (number of different iterations involved in the schedule). The first row presents the result for II K = 5 and K = 4 (point C in Figure 6), which is an infeasible problem. The second row presents the result for point A = (4; 3). This case obtains optimal results in MAXLIVE and II in front of the schedule found for the third point (B = (6; 4)), which is the one explored by other techniques that perform sub-optimal loop unrolling. Table 4 shows the results obtained for some examples selected from the set presented in [13]. In all cases 2 Load/Store units, 2 multipliers, 3 adders and 1 divider have been used. Note that unrolling 18
20 Resources MII II K/II K V E #Vars #Const CPU SPAN MAXLIVE / Non feasible / / Table 3: Results obtained for the example of Figure 6(a). is required to achieve a schedule in MII cycles for all cases. Table 5 show the results obtained for the Cytron example if no register minimization is required. Example MII II K/II K V E #Vars #Const CPU SPAN MAXLIVE SPEC-SPICE (loop7) / SPEC-SPICE (loop8) / LIVERMORE (loop1) / Table 4: Results obtained for a selection of examples from the Perfect Club set Resources MII II K/II K V E #Vars #Const SPAN CPU / / / / Table 5: Results for the Cytron example when no register minimization is required 7.2 ILP model for RCLP A significant part of the computational cost for solving the ILP model is spent in guaranteeing that the final solution is optimal. For this reason, we also report the best solution obtained after 1 minute of CPU 4. The results presented in Tables 6 and 7 show that benchmarks with a large number of operations can be solved with moderate computational cost. The method can also be used for heuristic search by limiting the maximum CPU time and providing the best solution found at that moment (e.g. one minute). The results demonstrate that the optimal solution can be often obtained by only using a small fraction of the total computational cost, although a proof of optimality cannot be given in this case. Optimal solutions have been found for models with more than 1000 variables and 700 constraints. Table 8 shows the results obtained for the differential equation solver. For this example two types of resources were considered: a non pipelined multiplier with a two cycle delay and an ALU for the rest of the operations with one cycle of delay. 7.3 ILP model for TCLP Table 9 presents some results on solving time-constrained loop pipelining for the elliptic filter. The cost for the multipliers, adders and registers has been defined as 80, 20 and 10 area units respectively. 4 We only report the value of MAXLIVE after 1 min, which is the critical variable in the optimization of the objective function. 19
21 Resources II V E #Vars #Const CPU MAXLIVE SPAN MAXLIVE + (secs) (60 secs) Table 6: Results obtained for the 16-point FIR filter Resources II V E #Vars #Const. CPU time MAXLIVE SPAN MAXLIVE p + (secs) (60 secs) Table 7: Results for the 5th-order elliptic wave filter ( p and stand for pipelined and non-pipelined multipliers). Resources II V E #Vars #Cons CPU time MAXLIVE SPAN 3 2 ALU ALU ALU ALU Table 8: Results for the differential equation solver In this case, the value of the II is initially defined, whereas the area cost of the resources is minimized. The complexity of the models is equivalent to the corresponding solutions reported in Table 7. Table 10 shows equivalent results when pipelined multipliers are considered. Table 11 presents the results for the Cytron example when solving the TCLP problem. 7.4 Results without loop folding In this section we present the results obtained when no loop-folding is required. In this case, only one leaf of the branch and prune tree is explored: the one with all having the same value. Table 12 shows the results obtained for the elliptic filter. The program obtains the scheduling without loop-folding in a given number of cycle steps minimizing the value of MAXLIVE. 20
22 II Multipliers Adders MAXLIVE CPU (secs) Table 9: TCLP for the 5th-order elliptic wave filter with non-pipelined multipliers. II Multipliers Adders MAXLIVE CPU (secs) Table 10: Results obtained for the fifth order elliptic filter with pipelined multipliers for TCLP II Resources MAXLIVE CPU (secs) Table 11: Results obtained for the Cytron example for TCLP 7.5 Near-optimal solutions for large examples Although the branch-and-prune strategy has proved to be efficient for models with 10 3 variables/constraints, the exponential nature of the method becomes critical in some cases. Rather than discarding this method, we claim that it can still be used to obtain near-optimal solutions by limiting the CPU time of the search. Tables 13 and 14 present results on the Cytron s and FDCT loops obtained by limiting the CPU time to 10 minutes. Given the status of the search at the time the solution was delivered, we suspect that the results are optimal (although we could not prove it). The strategies used to explore the values of and the variables for scheduling (force-directed scheduling) are crucial to rapidly converge to near-optimal solutions. 8 Conclusions Finding optimal schedules for moderate size loops is feasible. In this paper we have presented a mathematical model that can be solved by using ILP. However, efficient techniques to wisely explore the space of integer solutions are required to avoid a blind navigation through the branch-and-bound tree. We have presented a new strategy called branch and prune that takes advantage of the information known a priori about the problem to seek near-optimal solutions as fast as possible. Decomposing loop pipelining into two sub-problems (retiming and scheduling) and using force-directed scheduling to assist the search have been essential heuristics to guarantee optimal solutions in moderate CPU times. We have also demonstrated that exploring the unrolling degree is necessary to find optimal solutions for loop pipelining. A mathematical formulation based on Farey s series has been proposed for such a goal. 21
23 Resources II CPU time (secs) MAXLIVE Table 12: Results obtained for the elliptic filter when only scheduling is required (non-pipelined multipliers) Resources MII II K/II K V E #Vars #Const SPAN MAXLIVE / / / Table 13: Results for the Cytron example after 10 minutes of CPU time Resources MII II K/II K V E #Vars #Const SPAN MAXLIVE +? / / / Table 14: Results obtained for the FDCT after 10 minutes of CPU time Finally, we think that other pruning strategies for ILP models can be considered for the problem we have solved. New techniques, such as branch and cut, successfully used for the traveling-salesman problem [25] can probably be adapted to accommodate many problems in high-level synthesis. References [1] A. Aiken and A. Nicolau. Perfect Pipelining: a new loop parallelization technique, volume 300 of Lecture Notes in Computer Science, pages Springer Verlag, March [2] M.R.C.M. Berkelaar. lp solve version 2.0: a public domain ILP solver, available at ftp.es.ele.tue.nl. [3] F. Catthoor, W. Geurts, and H. De Man. Loop transformation methodology for fixed-rate video, image and telecom processing applications. In Proc. of the Int. Conf. on Application-Specific Array Processors 1994 (ASAP94), [4] L.-F. Chao, A. LaPaugh, and E.H.-M. Sha. Rotation scheduling: a loop pipelining algorithm. In Proc. of the 30th Design Automation Conf., pages , June [5] R. Cytron. Compiler-time scheduling and optimization for asynchronous machines. PhD thesis, University of Illinois at Urbana-Champaign,
24 [6] S.M. Heemstra de Groot, S.H. Gerez, and O.E. Herrmann. Range-chart-guided iterative dataflow graph scheduling. IEEE Transactions on Circuits and Systems I, 39(5): , May [7] Y.G. DeCastelo-Vide e Souza, M. Potkonjak, and A.C. Parker. Optimal ILP-based approach for throughput optimization using simultaneous algorithm/architecture matching and retiming. In Proc. of the 32nd Design Automation Conf., pages , June [8] A.E. Eichenberger, E.S. Davidson, and S.G. Abraham. Optimum modulo schedules for minimum register requirements. In Proc. of 9th ACM the International Symposium on Supercomputing, pages 31 40, June [9] C.H. Gebotys. Throughput optimized architectural synthesis. IEEE Trans. on VLSI Systems, 1(3): , September [10] C.H. Gebotys and M.I. Elmasry. Global optimization approach for architectural synthesis. IEEE Trans. Computer-Aided Design, 12(9): , September [11] E.F. Girczyc. Loop winding: A data flow approach to functional pipelining. In Proc. Int. Symp. Circuits and Systems, pages , May [12] G. Goossens, J. Rabaey, J. Vandewalle, and H. De Man. An efficient microcode compiler for application specific DSP processors. IEEE Trans. Computer-Aided Design, 9(9): , September [13] R. Govindarajan, E.R. Altman, and G.R. Gao. Minimizing register requirements under resourceconstrained rate-optimal software pipelining. In Proc. of the 27th Annual International Symposium on Microarchitecture, pages 85 94, November [14] L. Hafer and A. Parker. A formal method for the specification analysis and design of registertransfer-level digital logic. IEEE Trans. Computer-Aided Design, 2(1):4 17, January [15] R. A. Huff. Lifetime-sensitive modulo scheduling. In Proc. of the 6th Conf. Programming Language Design and Implementation, pages , [16] C.-T. Hwang and Y.-C. Hsu. Zone scheduling. IEEE Trans. Computer-Aided Design, 12(7): , July [17] C.-T. Hwang, Y.-C. Hsu, and Y.-L. Lin. Scheduling for functional pipelining and loop winding. In Proc. of the 28th Design Automation Conf., pages , June [18] C.-T. Hwang, J.-H. Lee, and Y.-C. Hsu. A formal approach to the scheduling problem in high level synthesis. IEEE Trans. Computer-Aided Design, 10(4): , April [19] R.B. Jones and V.H. Allan. Software pipelining: A comparison and improvement. In Proc. 23rd Ann. Workshop on Microprogramming and Microarchitecture, pages 46 56, November [20] R. Karp. A characterization of the minimum cycle mean in a digraph. Discrete Mathematics, 23: , [21] P. M. Kogge. The architecture of pipelined computers. New York. McGraw-Hill,
25 [22] M. Lam. Software pipelining: an effective scheduling technique for VLIW machines. In Proc. of the SIGPLAN 88, pages , June [23] T.-F. Lee, A.C.-H. Wu, Y.-L. Lin, and D.D. Gajski. A transformation-based method for loop folding. IEEE Trans. Computer-Aided Design, 13(4): , April [24] C.E. Leiserson and J.B. Saxe. Retiming synchronous circuitry. Algorithmica, 6:5 35, [25] M.W. Padberg and G. Rinaldi. A branch-and-cut algorithm for the resolution of large-scale symetric traveling salesman problems. SIAM Review, 33:60 100, [26] C.H. Papadimitriou and K. Steiglitz. Combinatorial optimization: algorithms and complexity. Prentice-Hall, [27] K.K. Parhi and D.G. Messerschmitt. Static rate-optimal scheduling of iterative data-flow programs via optimum unfolding. IEEE Trans. Computers, 40(2): , February [28] P.G. Paulin and J.P. Knight. Force-directed scheduling for the behavioral synthesis of ASIC s. IEEE Trans. Computer-Aided Design, 8(6): , June [29] R. Potasman, J. Lis, A. Nicolau, and D. D. Gajski. Percolation-based synthesis. In Proc. of the 27th Design Automation Conf., pages , [30] B.R. Rau. Iterative modulo scheduling: An algorithm for software pipelining loops. In Proc. of the 27th Annual International Symposium on Microarchitecture, pages 63 74, November [31] B.R. Rau and C.D. Glaeser. Some scheduling techniques and an easily schedulable horizontal architecture for high performance scientific computing. In Proc. of the 14th Annual Workshop on Microprogramming, pages , October [32] M. Renfors and Y. Neuvo. The maximum sampling rate of digital filters under hardware speed constraints. IEEE Trans. on Circuits and Systems, CAS-28(3): , March [33] M. Rim and R. Jain. Lower-bound performance estimation for the high-level synthesis scheduling problem. IEEE Trans. Computer-Aided Design, 13(4): , April [34] F. Sánchez and J. Cortadella. Time constrained loop pipelining. In Proc. Int. Conf. Computer- Aided Design, pages , November [35] M.R. Schroeder. Number theory in science and communication. Springer-Verlag, [36] B. Su, S. Ding, and J. Xia. URPR: an extension of URCR for software pipelining. In 19th Microprogramming Workshop (MICRO-19), pages , October
RPUSM: An Effective Instruction Scheduling Method for. Nested Loops
RPUSM: An Effective Instruction Scheduling Method for Nested Loops Yi-Hsuan Lee, Ming-Lung Tsai and Cheng Chen Department of Computer Science and Information Engineering 1001 Ta Hsueh Road, Hsinchu, Taiwan,
More informationAN EFFICIENT IMPLEMENTATION OF NESTED LOOP CONTROL INSTRUCTIONS FOR FINE GRAIN PARALLELISM 1
AN EFFICIENT IMPLEMENTATION OF NESTED LOOP CONTROL INSTRUCTIONS FOR FINE GRAIN PARALLELISM 1 Virgil Andronache Richard P. Simpson Nelson L. Passos Department of Computer Science Midwestern State University
More informationRate-Optimal Unfolding of Balanced Synchronous Data-Flow Graphs
Rate-Optimal Unfolding of Balanced Synchronous Data-Flow Graphs Timothy W. O Neil Dept. of Computer Science, The University of Akron, Akron OH 44325 USA toneil@uakron.edu Abstract Many common iterative
More informationFILTER SYNTHESIS USING FINE-GRAIN DATA-FLOW GRAPHS. Waqas Akram, Cirrus Logic Inc., Austin, Texas
FILTER SYNTHESIS USING FINE-GRAIN DATA-FLOW GRAPHS Waqas Akram, Cirrus Logic Inc., Austin, Texas Abstract: This project is concerned with finding ways to synthesize hardware-efficient digital filters given
More informationAn Approach for Integrating Basic Retiming and Software Pipelining
An Approach for Integrating Basic Retiming and Software Pipelining Noureddine Chabini Department of Electrical and Computer Engineering Royal Military College of Canada PB 7000 Station Forces Kingston
More informationDESIGN OF 2-D FILTERS USING A PARALLEL PROCESSOR ARCHITECTURE. Nelson L. Passos Robert P. Light Virgil Andronache Edwin H.-M. Sha
DESIGN OF -D FILTERS USING A PARALLEL PROCESSOR ARCHITECTURE Nelson L. Passos Robert P. Light Virgil Andronache Edwin H.-M. Sha Midwestern State University University of Notre Dame Wichita Falls, TX 76308
More informationREDUCING THE CODE SIZE OF RETIMED SOFTWARE LOOPS UNDER TIMING AND RESOURCE CONSTRAINTS
REDUCING THE CODE SIZE OF RETIMED SOFTWARE LOOPS UNDER TIMING AND RESOURCE CONSTRAINTS Noureddine Chabini 1 and Wayne Wolf 2 1 Department of Electrical and Computer Engineering, Royal Military College
More informationMOST computations used in applications, such as multimedia
IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 13, NO. 9, SEPTEMBER 2005 1023 Pipelining With Common Operands for Power-Efficient Linear Systems Daehong Kim, Member, IEEE, Dongwan
More informationCE431 Parallel Computer Architecture Spring Compile-time ILP extraction Modulo Scheduling
CE431 Parallel Computer Architecture Spring 2018 Compile-time ILP extraction Modulo Scheduling Nikos Bellas Electrical and Computer Engineering University of Thessaly Parallel Computer Architecture 1 Readings
More informationPredicated Software Pipelining Technique for Loops with Conditions
Predicated Software Pipelining Technique for Loops with Conditions Dragan Milicev and Zoran Jovanovic University of Belgrade E-mail: emiliced@ubbg.etf.bg.ac.yu Abstract An effort to formalize the process
More information2 <3> <2> <1> (5,6) 9 (5,6) (4,5) <1,3> <1,2> <1,1> (4,5) 6 <1,1,4> <1,1,3> <1,1,2> (5,7) (5,6) (5,5)
A fast approach to computing exact solutions to the resource-constrained scheduling problem M. NARASIMHAN and J. RAMANUJAM 1 Louisiana State University This paper presents an algorithm that substantially
More informationConstraint Analysis and Heuristic Scheduling Methods
Constraint Analysis and Heuristic Scheduling Methods P. Poplavko, C.A.J. van Eijk, and T. Basten Eindhoven University of Technology, Department of Electrical Engineering, Eindhoven, The Netherlands peter@ics.ele.tue.nl
More informationModulo-Variable Expansion Sensitive Scheduling
Modulo-Variable Expansion Sensitive Scheduling Madhavi Gopal Vallurif and R. GovindarajanfJ f Supercomputer Education and Research Centre Department of Computer Science and Automation Indian Institute
More informationHigh Level Synthesis
High Level Synthesis Design Representation Intermediate representation essential for efficient processing. Input HDL behavioral descriptions translated into some canonical intermediate representation.
More informationARITHMETIC operations based on residue number systems
IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS II: EXPRESS BRIEFS, VOL. 53, NO. 2, FEBRUARY 2006 133 Improved Memoryless RNS Forward Converter Based on the Periodicity of Residues A. B. Premkumar, Senior Member,
More informationEECS 583 Class 13 Software Pipelining
EECS 583 Class 13 Software Pipelining University of Michigan October 29, 2012 Announcements + Reading Material Project proposals» Due Friday, Nov 2, 5pm» 1 paragraph summary of what you plan to work on
More informationRetiming Arithmetic Datapaths using Timed Taylor Expansion Diagrams
Retiming Arithmetic Datapaths using Timed Taylor Expansion Diagrams Daniel Gomez-Prado Dusung Kim Maciej Ciesielski Emmanuel Boutillon 2 University of Massachusetts Amherst, USA. {dgomezpr,ciesiel,dukim}@ecs.umass.edu
More informationCommunication Networks I December 4, 2001 Agenda Graph theory notation Trees Shortest path algorithms Distributed, asynchronous algorithms Page 1
Communication Networks I December, Agenda Graph theory notation Trees Shortest path algorithms Distributed, asynchronous algorithms Page Communication Networks I December, Notation G = (V,E) denotes a
More informationSHARP: EFFICIENT LOOP SCHEDULING WITH DATA HAZARD REDUCTION ON MULTIPLE PIPELINE DSP SYSTEMS 1
SHARP: FFICINT LOOP SCHDULING WITH DATA HAZARD RDUCTION ON MULTIPL PIPLIN DSP SYSTMS 1 S. Tongsima, C. Chantrapornchai,. Sha N. L. Passos Dept. of Computer Science and ngineering Dept. of Computer Science
More informationFramework for Design of Dynamic Programming Algorithms
CSE 441T/541T Advanced Algorithms September 22, 2010 Framework for Design of Dynamic Programming Algorithms Dynamic programming algorithms for combinatorial optimization generalize the strategy we studied
More informationDatapath Allocation. Zoltan Baruch. Computer Science Department, Technical University of Cluj-Napoca
Datapath Allocation Zoltan Baruch Computer Science Department, Technical University of Cluj-Napoca e-mail: baruch@utcluj.ro Abstract. The datapath allocation is one of the basic operations executed in
More informationSynthesis of DSP Systems using Data Flow Graphs for Silicon Area Reduction
Synthesis of DSP Systems using Data Flow Graphs for Silicon Area Reduction Rakhi S 1, PremanandaB.S 2, Mihir Narayan Mohanty 3 1 Atria Institute of Technology, 2 East Point College of Engineering &Technology,
More information3 No-Wait Job Shops with Variable Processing Times
3 No-Wait Job Shops with Variable Processing Times In this chapter we assume that, on top of the classical no-wait job shop setting, we are given a set of processing times for each operation. We may select
More informationBehavioral Array Mapping into Multiport Memories Targeting Low Power 3
Behavioral Array Mapping into Multiport Memories Targeting Low Power 3 Preeti Ranjan Panda and Nikil D. Dutt Department of Information and Computer Science University of California, Irvine, CA 92697-3425,
More information3 INTEGER LINEAR PROGRAMMING
3 INTEGER LINEAR PROGRAMMING PROBLEM DEFINITION Integer linear programming problem (ILP) of the decision variables x 1,..,x n : (ILP) subject to minimize c x j j n j= 1 a ij x j x j 0 x j integer n j=
More informationSoftware Pipelining for Coarse-Grained Reconfigurable Instruction Set Processors
Software Pipelining for Coarse-Grained Reconfigurable Instruction Set Processors Francisco Barat, Murali Jayapala, Pieter Op de Beeck and Geert Deconinck K.U.Leuven, Belgium. {f-barat, j4murali}@ieee.org,
More informationA Perfect Branch Prediction Technique for Conditional Loops
A Perfect Branch Prediction Technique for Conditional Loops Virgil Andronache Department of Computer Science, Midwestern State University Wichita Falls, TX, 76308, USA and Richard P. Simpson Department
More information1 The Traveling Salesperson Problem (TSP)
CS 598CSC: Approximation Algorithms Lecture date: January 23, 2009 Instructor: Chandra Chekuri Scribe: Sungjin Im In the previous lecture, we had a quick overview of several basic aspects of approximation
More informationSTATIC SCHEDULING FOR CYCLO STATIC DATA FLOW GRAPHS
STATIC SCHEDULING FOR CYCLO STATIC DATA FLOW GRAPHS Sukumar Reddy Anapalli Krishna Chaithanya Chakilam Timothy W. O Neil Dept. of Computer Science Dept. of Computer Science Dept. of Computer Science The
More informationSymbolic Modeling and Evaluation of Data Paths*
Symbolic Modeling and Evaluation of Data Paths* Chuck Monahan Forrest Brewer Department of Electrical and Computer Engineering University of California, Santa Barbara, U.S.A. 1. Introduction Abstract We
More informationRegister-Sensitive Software Pipelining
Regier-Sensitive Software Pipelining Amod K. Dani V. Janaki Ramanan R. Govindarajan Veritas Software India Pvt. Ltd. Supercomputer Edn. & Res. Centre Supercomputer Edn. & Res. Centre 1179/3, Shivajinagar
More information12.1 Formulation of General Perfect Matching
CSC5160: Combinatorial Optimization and Approximation Algorithms Topic: Perfect Matching Polytope Date: 22/02/2008 Lecturer: Lap Chi Lau Scribe: Yuk Hei Chan, Ling Ding and Xiaobing Wu In this lecture,
More informationParameterized graph separation problems
Parameterized graph separation problems Dániel Marx Department of Computer Science and Information Theory, Budapest University of Technology and Economics Budapest, H-1521, Hungary, dmarx@cs.bme.hu Abstract.
More informationROTATION SCHEDULING ON SYNCHRONOUS DATA FLOW GRAPHS. A Thesis Presented to The Graduate Faculty of The University of Akron
ROTATION SCHEDULING ON SYNCHRONOUS DATA FLOW GRAPHS A Thesis Presented to The Graduate Faculty of The University of Akron In Partial Fulfillment of the Requirements for the Degree Master of Science Rama
More informationUnit 2: High-Level Synthesis
Course contents Unit 2: High-Level Synthesis Hardware modeling Data flow Scheduling/allocation/assignment Reading Chapter 11 Unit 2 1 High-Level Synthesis (HLS) Hardware-description language (HDL) synthesis
More informationArea. A max. f(t) f(t *) A min
Abstract Toward a Practical Methodology for Completely Characterizing the Optimal Design Space One of the most compelling reasons for developing highlevel synthesis systems has been the desire to quickly
More informationAccelerated SAT-based Scheduling of Control/Data Flow Graphs
Accelerated SAT-based Scheduling of Control/Data Flow Graphs Seda Ogrenci Memik Computer Science Department University of California, Los Angeles Los Angeles, CA - USA seda@cs.ucla.edu Farzan Fallah Fujitsu
More informationCS 473: Algorithms. Ruta Mehta. Spring University of Illinois, Urbana-Champaign. Ruta (UIUC) CS473 1 Spring / 36
CS 473: Algorithms Ruta Mehta University of Illinois, Urbana-Champaign Spring 2018 Ruta (UIUC) CS473 1 Spring 2018 1 / 36 CS 473: Algorithms, Spring 2018 LP Duality Lecture 20 April 3, 2018 Some of the
More informationTheoretical Constraints on Multi-Dimensional Retiming Design Techniques
header for SPIE use Theoretical onstraints on Multi-imensional Retiming esign Techniques N. L. Passos,.. efoe, R. J. Bailey, R. Halverson, R. Simpson epartment of omputer Science Midwestern State University
More informationEffective Memory Access Optimization by Memory Delay Modeling, Memory Allocation, and Slack Time Management
International Journal of Computer Theory and Engineering, Vol., No., December 01 Effective Memory Optimization by Memory Delay Modeling, Memory Allocation, and Slack Time Management Sultan Daud Khan, Member,
More informationAutomatic Counterflow Pipeline Synthesis
Automatic Counterflow Pipeline Synthesis Bruce R. Childers, Jack W. Davidson Computer Science Department University of Virginia Charlottesville, Virginia 22901 {brc2m, jwd}@cs.virginia.edu Abstract The
More informationTopic 12: Cyclic Instruction Scheduling
Topic 12: Cyclic Instruction Scheduling COS 320 Compiling Techniques Princeton University Spring 2015 Prof. David August 1 Overlap Iterations Using Pipelining time Iteration 1 2 3 n n 1 2 3 With hardware
More informationCopyright 2007 Pearson Addison-Wesley. All rights reserved. A. Levitin Introduction to the Design & Analysis of Algorithms, 2 nd ed., Ch.
Iterative Improvement Algorithm design technique for solving optimization problems Start with a feasible solution Repeat the following step until no improvement can be found: change the current feasible
More informationLecture 10,11: General Matching Polytope, Maximum Flow. 1 Perfect Matching and Matching Polytope on General Graphs
CMPUT 675: Topics in Algorithms and Combinatorial Optimization (Fall 2009) Lecture 10,11: General Matching Polytope, Maximum Flow Lecturer: Mohammad R Salavatipour Date: Oct 6 and 8, 2009 Scriber: Mohammad
More informationApproximating Fault-Tolerant Steiner Subgraphs in Heterogeneous Wireless Networks
Approximating Fault-Tolerant Steiner Subgraphs in Heterogeneous Wireless Networks Ambreen Shahnaz and Thomas Erlebach Department of Computer Science University of Leicester University Road, Leicester LE1
More informationCSE 417 Branch & Bound (pt 4) Branch & Bound
CSE 417 Branch & Bound (pt 4) Branch & Bound Reminders > HW8 due today > HW9 will be posted tomorrow start early program will be slow, so debugging will be slow... Review of previous lectures > Complexity
More informationObstacle-Aware Longest-Path Routing with Parallel MILP Solvers
, October 20-22, 2010, San Francisco, USA Obstacle-Aware Longest-Path Routing with Parallel MILP Solvers I-Lun Tseng, Member, IAENG, Huan-Wen Chen, and Che-I Lee Abstract Longest-path routing problems,
More informationMathematical and Algorithmic Foundations Linear Programming and Matchings
Adavnced Algorithms Lectures Mathematical and Algorithmic Foundations Linear Programming and Matchings Paul G. Spirakis Department of Computer Science University of Patras and Liverpool Paul G. Spirakis
More informationSystem-Level Synthesis of Application Specific Systems using A* Search and Generalized Force-Directed Heuristics
System-Level Synthesis of Application Specific Systems using A* Search and Generalized Force-Directed Heuristics Chunho Lee, Miodrag Potkonjak, and Wayne Wolf Computer Science Department, University of
More informationCMSC 451: Lecture 22 Approximation Algorithms: Vertex Cover and TSP Tuesday, Dec 5, 2017
CMSC 451: Lecture 22 Approximation Algorithms: Vertex Cover and TSP Tuesday, Dec 5, 2017 Reading: Section 9.2 of DPV. Section 11.3 of KT presents a different approximation algorithm for Vertex Cover. Coping
More information1. Lecture notes on bipartite matching February 4th,
1. Lecture notes on bipartite matching February 4th, 2015 6 1.1.1 Hall s Theorem Hall s theorem gives a necessary and sufficient condition for a bipartite graph to have a matching which saturates (or matches)
More informationNotes for Lecture 24
U.C. Berkeley CS170: Intro to CS Theory Handout N24 Professor Luca Trevisan December 4, 2001 Notes for Lecture 24 1 Some NP-complete Numerical Problems 1.1 Subset Sum The Subset Sum problem is defined
More informationLecture 19. Software Pipelining. I. Example of DoAll Loops. I. Introduction. II. Problem Formulation. III. Algorithm.
Lecture 19 Software Pipelining I. Introduction II. Problem Formulation III. Algorithm I. Example of DoAll Loops Machine: Per clock: 1 read, 1 write, 1 (2-stage) arithmetic op, with hardware loop op and
More informationDM545 Linear and Integer Programming. Lecture 2. The Simplex Method. Marco Chiarandini
DM545 Linear and Integer Programming Lecture 2 The Marco Chiarandini Department of Mathematics & Computer Science University of Southern Denmark Outline 1. 2. 3. 4. Standard Form Basic Feasible Solutions
More informationInteger Programming Theory
Integer Programming Theory Laura Galli October 24, 2016 In the following we assume all functions are linear, hence we often drop the term linear. In discrete optimization, we seek to find a solution x
More informationREDUCING GRAPH COLORING TO CLIQUE SEARCH
Asia Pacific Journal of Mathematics, Vol. 3, No. 1 (2016), 64-85 ISSN 2357-2205 REDUCING GRAPH COLORING TO CLIQUE SEARCH SÁNDOR SZABÓ AND BOGDÁN ZAVÁLNIJ Institute of Mathematics and Informatics, University
More informationFoundations of Computing
Foundations of Computing Darmstadt University of Technology Dept. Computer Science Winter Term 2005 / 2006 Copyright c 2004 by Matthias Müller-Hannemann and Karsten Weihe All rights reserved http://www.algo.informatik.tu-darmstadt.de/
More informationTheorem 2.9: nearest addition algorithm
There are severe limits on our ability to compute near-optimal tours It is NP-complete to decide whether a given undirected =(,)has a Hamiltonian cycle An approximation algorithm for the TSP can be used
More informationSDC-Based Modulo Scheduling for Pipeline Synthesis
SDC-Based Modulo Scheduling for Pipeline Synthesis Zhiru Zhang School of Electrical and Computer Engineering, Cornell University, Ithaca, NY zhiruz@cornell.edu Bin Liu Micron Technology, Inc., San Jose,
More informationLast topic: Summary; Heuristics and Approximation Algorithms Topics we studied so far:
Last topic: Summary; Heuristics and Approximation Algorithms Topics we studied so far: I Strength of formulations; improving formulations by adding valid inequalities I Relaxations and dual problems; obtaining
More informationLECTURES 3 and 4: Flows and Matchings
LECTURES 3 and 4: Flows and Matchings 1 Max Flow MAX FLOW (SP). Instance: Directed graph N = (V,A), two nodes s,t V, and capacities on the arcs c : A R +. A flow is a set of numbers on the arcs such that
More informationHIGH-LEVEL SYNTHESIS
HIGH-LEVEL SYNTHESIS Page 1 HIGH-LEVEL SYNTHESIS High-level synthesis: the automatic addition of structural information to a design described by an algorithm. BEHAVIORAL D. STRUCTURAL D. Systems Algorithms
More informationInteger Programming ISE 418. Lecture 7. Dr. Ted Ralphs
Integer Programming ISE 418 Lecture 7 Dr. Ted Ralphs ISE 418 Lecture 7 1 Reading for This Lecture Nemhauser and Wolsey Sections II.3.1, II.3.6, II.4.1, II.4.2, II.5.4 Wolsey Chapter 7 CCZ Chapter 1 Constraint
More informationA General Class of Heuristics for Minimum Weight Perfect Matching and Fast Special Cases with Doubly and Triply Logarithmic Errors 1
Algorithmica (1997) 18: 544 559 Algorithmica 1997 Springer-Verlag New York Inc. A General Class of Heuristics for Minimum Weight Perfect Matching and Fast Special Cases with Doubly and Triply Logarithmic
More informationPolynomial-Time Approximation Algorithms
6.854 Advanced Algorithms Lecture 20: 10/27/2006 Lecturer: David Karger Scribes: Matt Doherty, John Nham, Sergiy Sidenko, David Schultz Polynomial-Time Approximation Algorithms NP-hard problems are a vast
More informationA Level-wise Priority Based Task Scheduling for Heterogeneous Systems
International Journal of Information and Education Technology, Vol., No. 5, December A Level-wise Priority Based Task Scheduling for Heterogeneous Systems R. Eswari and S. Nickolas, Member IACSIT Abstract
More informationEfficient Second-Order Iterative Methods for IR Drop Analysis in Power Grid
Efficient Second-Order Iterative Methods for IR Drop Analysis in Power Grid Yu Zhong Martin D. F. Wong Dept. of Electrical and Computer Engineering Dept. of Electrical and Computer Engineering Univ. of
More informationPublished in HICSS-26 Conference Proceedings, January 1993, Vol. 1, pp The Benet of Predicated Execution for Software Pipelining
Published in HICSS-6 Conference Proceedings, January 1993, Vol. 1, pp. 97-506. 1 The Benet of Predicated Execution for Software Pipelining Nancy J. Warter Daniel M. Lavery Wen-mei W. Hwu Center for Reliable
More informationURSA: A Unified ReSource Allocator for Registers and Functional Units in VLIW Architectures
Presented at IFIP WG 10.3(Concurrent Systems) Working Conference on Architectures and Compliation Techniques for Fine and Medium Grain Parallelism, Orlando, Fl., January 1993 URSA: A Unified ReSource Allocator
More informationMax-Cut and Max-Bisection are NP-hard on unit disk graphs
Max-Cut and Max-Bisection are NP-hard on unit disk graphs Josep Díaz and Marcin Kamiński 2 Llenguatges i Sistemes Informàtics 2 RUTCOR, Rutgers University Universitat Politècnica de Catalunya 640 Bartholomew
More informationDual-Based Approximation Algorithms for Cut-Based Network Connectivity Problems
Dual-Based Approximation Algorithms for Cut-Based Network Connectivity Problems Benjamin Grimmer bdg79@cornell.edu arxiv:1508.05567v2 [cs.ds] 20 Jul 2017 Abstract We consider a variety of NP-Complete network
More informationTopic: Local Search: Max-Cut, Facility Location Date: 2/13/2007
CS880: Approximations Algorithms Scribe: Chi Man Liu Lecturer: Shuchi Chawla Topic: Local Search: Max-Cut, Facility Location Date: 2/3/2007 In previous lectures we saw how dynamic programming could be
More informationMethods and Models for Combinatorial Optimization Exact methods for the Traveling Salesman Problem
Methods and Models for Combinatorial Optimization Exact methods for the Traveling Salesman Problem L. De Giovanni M. Di Summa The Traveling Salesman Problem (TSP) is an optimization problem on a directed
More informationThe alternator. Mohamed G. Gouda F. Furman Haddix
Distrib. Comput. (2007) 20:21 28 DOI 10.1007/s00446-007-0033-1 The alternator Mohamed G. Gouda F. Furman Haddix Received: 28 August 1999 / Accepted: 5 July 2000 / Published online: 12 June 2007 Springer-Verlag
More informationOptimization Techniques for Design Space Exploration
0-0-7 Optimization Techniques for Design Space Exploration Zebo Peng Embedded Systems Laboratory (ESLAB) Linköping University Outline Optimization problems in ERT system design Heuristic techniques Simulated
More informationAlgorithms for Integer Programming
Algorithms for Integer Programming Laura Galli November 9, 2016 Unlike linear programming problems, integer programming problems are very difficult to solve. In fact, no efficient general algorithm is
More informationCOMP 355 Advanced Algorithms Approximation Algorithms: VC and TSP Chapter 11 (KT) Section (CLRS)
COMP 355 Advanced Algorithms Approximation Algorithms: VC and TSP Chapter 11 (KT) Section 35.1-35.2(CLRS) 1 Coping with NP-Completeness Brute-force search: This is usually only a viable option for small
More informationChapter 16. Greedy Algorithms
Chapter 16. Greedy Algorithms Algorithms for optimization problems (minimization or maximization problems) typically go through a sequence of steps, with a set of choices at each step. A greedy algorithm
More informationPlacement Algorithm for FPGA Circuits
Placement Algorithm for FPGA Circuits ZOLTAN BARUCH, OCTAVIAN CREŢ, KALMAN PUSZTAI Computer Science Department, Technical University of Cluj-Napoca, 26, Bariţiu St., 3400 Cluj-Napoca, Romania {Zoltan.Baruch,
More informationVertex Cover Approximations
CS124 Lecture 20 Heuristics can be useful in practice, but sometimes we would like to have guarantees. Approximation algorithms give guarantees. It is worth keeping in mind that sometimes approximation
More informationIntroduction to Electronic Design Automation. Model of Computation. Model of Computation. Model of Computation
Introduction to Electronic Design Automation Model of Computation Jie-Hong Roland Jiang 江介宏 Department of Electrical Engineering National Taiwan University Spring 03 Model of Computation In system design,
More informationAnalysis of Conditional Resource Sharing using a Guard-based Control Representation *
Analysis of Conditional Resource Sharing using a Guard-based Control Representation * Ivan P. Radivojevi c orrest Brewer Department of Electrical and Computer Engineering University of California, Santa
More informationChapter 15 Introduction to Linear Programming
Chapter 15 Introduction to Linear Programming An Introduction to Optimization Spring, 2015 Wei-Ta Chu 1 Brief History of Linear Programming The goal of linear programming is to determine the values of
More informationNotes for Lecture 18
U.C. Berkeley CS17: Intro to CS Theory Handout N18 Professor Luca Trevisan November 6, 21 Notes for Lecture 18 1 Algorithms for Linear Programming Linear programming was first solved by the simplex method
More informationData Path Allocation using an Extended Binding Model*
Data Path Allocation using an Extended Binding Model* Ganesh Krishnamoorthy Mentor Graphics Corporation Warren, NJ 07059 Abstract Existing approaches to data path allocation in highlevel synthesis use
More informationPracy II Konferencji Krajowej Reprogramowalne uklady cyfrowe, RUC 99, Szczecin, 1999, pp Implementation of IIR Digital Filters in FPGA
Pracy II Konferencji Krajowej Reprogramowalne uklady cyfrowe, RUC 99, Szczecin, 1999, pp.233-239 Implementation of IIR Digital Filters in FPGA Anatoli Sergyienko*, Volodymir Lepekha*, Juri Kanevski**,
More informationCS 372: Computational Geometry Lecture 10 Linear Programming in Fixed Dimension
CS 372: Computational Geometry Lecture 10 Linear Programming in Fixed Dimension Antoine Vigneron King Abdullah University of Science and Technology November 7, 2012 Antoine Vigneron (KAUST) CS 372 Lecture
More informationarxiv:cs/ v1 [cs.ds] 20 Feb 2003
The Traveling Salesman Problem for Cubic Graphs David Eppstein School of Information & Computer Science University of California, Irvine Irvine, CA 92697-3425, USA eppstein@ics.uci.edu arxiv:cs/0302030v1
More information6.856 Randomized Algorithms
6.856 Randomized Algorithms David Karger Handout #4, September 21, 2002 Homework 1 Solutions Problem 1 MR 1.8. (a) The min-cut algorithm given in class works because at each step it is very unlikely (probability
More informationHigh-Level Synthesis (HLS)
Course contents Unit 11: High-Level Synthesis Hardware modeling Data flow Scheduling/allocation/assignment Reading Chapter 11 Unit 11 1 High-Level Synthesis (HLS) Hardware-description language (HDL) synthesis
More informationManaging Dynamic Reconfiguration Overhead in Systems-on-a-Chip Design Using Reconfigurable Datapaths and Optimized Interconnection Networks
Managing Dynamic Reconfiguration Overhead in Systems-on-a-Chip Design Using Reconfigurable Datapaths and Optimized Interconnection Networks Zhining Huang, Sharad Malik Electrical Engineering Department
More informationEXERCISES SHORTEST PATHS: APPLICATIONS, OPTIMIZATION, VARIATIONS, AND SOLVING THE CONSTRAINED SHORTEST PATH PROBLEM. 1 Applications and Modelling
SHORTEST PATHS: APPLICATIONS, OPTIMIZATION, VARIATIONS, AND SOLVING THE CONSTRAINED SHORTEST PATH PROBLEM EXERCISES Prepared by Natashia Boland 1 and Irina Dumitrescu 2 1 Applications and Modelling 1.1
More informationA note on the subgraphs of the (2 )-grid
A note on the subgraphs of the (2 )-grid Josep Díaz a,2,1 a Departament de Llenguatges i Sistemes Informàtics, Universitat Politècnica de Catalunya Campus Nord, Edifici Ω, c/ Jordi Girona Salgado 1-3,
More informationGeneral properties of staircase and convex dual feasible functions
General properties of staircase and convex dual feasible functions JÜRGEN RIETZ, CLÁUDIO ALVES, J. M. VALÉRIO de CARVALHO Centro de Investigação Algoritmi da Universidade do Minho, Escola de Engenharia
More informationThe character of the instruction scheduling problem
The character of the instruction scheduling problem Darko Stefanović Department of Computer Science University of Massachusetts March 997 Abstract Here I present some measurements that serve to characterize
More informationParallel Query Processing and Edge Ranking of Graphs
Parallel Query Processing and Edge Ranking of Graphs Dariusz Dereniowski, Marek Kubale Department of Algorithms and System Modeling, Gdańsk University of Technology, Poland, {deren,kubale}@eti.pg.gda.pl
More informationSymbolic Buffer Sizing for Throughput-Optimal Scheduling of Dataflow Graphs
Symbolic Buffer Sizing for Throughput-Optimal Scheduling of Dataflow Graphs Anan Bouakaz Pascal Fradet Alain Girault Real-Time and Embedded Technology and Applications Symposium, Vienna April 14th, 2016
More informationCounting the Number of Eulerian Orientations
Counting the Number of Eulerian Orientations Zhenghui Wang March 16, 011 1 Introduction Consider an undirected Eulerian graph, a graph in which each vertex has even degree. An Eulerian orientation of the
More informationRISC IMPLEMENTATION OF OPTIMAL PROGRAMMABLE DIGITAL IIR FILTER
RISC IMPLEMENTATION OF OPTIMAL PROGRAMMABLE DIGITAL IIR FILTER Miss. Sushma kumari IES COLLEGE OF ENGINEERING, BHOPAL MADHYA PRADESH Mr. Ashish Raghuwanshi(Assist. Prof.) IES COLLEGE OF ENGINEERING, BHOPAL
More informationAdvanced Operations Research Techniques IE316. Quiz 1 Review. Dr. Ted Ralphs
Advanced Operations Research Techniques IE316 Quiz 1 Review Dr. Ted Ralphs IE316 Quiz 1 Review 1 Reading for The Quiz Material covered in detail in lecture. 1.1, 1.4, 2.1-2.6, 3.1-3.3, 3.5 Background material
More information