A Mathematical Formulation of the Loop Pipelining Problem

Size: px

Start display at page:

Download "A Mathematical Formulation of the Loop Pipelining Problem"

Jocelin Matthews
5 years ago
Views:

1 A Mathematical Formulation of the Loop Pipelining Problem Jordi Cortadella, Rosa M. Badia and Fermín Sánchez Department of Computer Architecture Universitat Politècnica de Catalunya Campus Nord. Mòdul D Barcelona, Spain Phone: Fax:

2 A mathematical formulation of the loop pipelining problem Abstract A mathematical model for the loop pipelining problem is presented. The model considers several parameters for optimization and supports any combination of resource and timing constraints. The unrolling degree of the loop is one of the variables explored by the model. By using Farey s series, an optimal exploration of the unrolling degree is performed and optimal solutions not considered by other methods are obtained. Finding an optimal schedule that minimizes resource requirements (including registers) is solved by an ILP model. A novel paradigm called branch and prune is proposed to efficiently converge towards the optimal schedule and prune the search tree for integer solutions, thus drastically reducing the running time. This is the first formulation that combines the unrolling degree of the loop with timing and resource constraints in a mathematical model that guarantees optimal solutions. 1

3 1 Introduction It is well known that loops monopolize most execution time of programs. In many applications a few loops, if not only one, determine the throughput achievable by the implementation of a behavioral description. For example, DSP filters often consist of an infinite loop that repeatedly executes for every sample of the input stream. In architectural synthesis, the problem of optimizing loop execution under timing and area constraints is crucial to obtain high quality architectures. The techniques proposed for this problem attempt to overlap the execution of different loop iterations to reduce the cycle count (initiation interval or II) per iteration. Different methods have been proposed with such a goal: loop folding [12, 23], functional pipelining [17], loop winding [11], rotation scheduling [4], and percolation based synthesis [29], among others. The area of fixed-rate DSP has also drawn the attention of other authors to propose techniques for loop pipelining with timing constraints [3, 6, 34]. Similar (if not identical) problems are encountered in the area of optimizing compilers for parallel architectures (VLIW, superscalar and superpipelined). Under the general name of software pipelining [21], several techniques have been devised: modulo scheduling [31], URPR [36], Lam s algorithm [22] and perfect pipelining [1]. Loops are usually represented by means of a data dependence graph (DG). Figure 1 shows an example. Vertices represent computational nodes (henceforth called instructions). Unlabeled edges represent intra-loop dependences (ILDs), e.g. B i (reads X[i]) depends on A i (produces X[i]). Labeled edges represent loop-carried dependences (LCDs), e.g. B i+1 (reads Y [i+1]) depends on B i (produces Y [i + 1]). Labels indicate the number of iterations traversed by the dependence. Thus, ILDs can also be represented as 0-labeled edges. If no overlap between iterations were allowed to execute the loop, the total execution time would be 3I cycles (assuming 1-cycle instructions). Within each iteration, the execution of A i, B i and C i must be sequential due to the data dependences. An overlapped execution would take I + 2 cycles to complete, as shown in Figure 2. The problem of loop pipelining is basically reduced to finding a steady state pattern (a folded loop body) that can execute at the maximum rate allowed by the dependences [32]. In the new steady state, instructions from different iterations (folds) are executed (A i+f denotes that the execution of A i belongs to fold f). Under resource constraints, unrolling the loop before finding a steady state is crucial to obtain optimal solutions. Assume the schedule depicted in Figure 2(b). If only two adders are available, the steady state must be executed in at least two cycles (II = 2), as shown in Figure 3(a). However, if the loop is unrolled twice (Figure 3(b)), a schedule in three cycles can be found (II = 3=2). for i = 0 to I? 1 do A i : X[i] = R[i] + S[i]; B i : Y [i + 1] = Y [i] + 2 X[i]; C i : Z[i] = 4 Y [i + 1] + T [i]; endfor A i B i C i 1 Figure 1: Loop dependence graph 2

4 A 0 B 0 A 1 C 0 B 1 A 2 C 1 B 2 A 3 C 2 B 3 C 3 (a) ) A 0 B 0 A 1 o C i B i+1 A i+2 ) C I?2 B I?1 C I?1 (b) Prologue Steady state (i = 0; ::; I? 3) Epilogue Figure 2: (a) Overlapped loop execution. (b) Steady state A 0 B 0 A 1 C i B i+1 A i+2 C I?2 B I?1 C I?1 (a) A 0 B i A i+1 C i B i+1 (i = i + 2) A i+2 C i+1 B I?1 C I?1 (b) Figure 3: Steady state with resource constraints: (a) without unrolling (II = 2), and (b) by unrolling twice the loop (II = 3=2). Another important aspect to be considered is the span, defined as the number of folds required to obtain the steady state. Figure 4(b) shows a time-optimal steady state with SPAN = 4. An approach to find a resource-constrained steady state is to reschedule the time-optimal one according to the number of available resources (see Figure 4(c) for 2 resources), thus maintaining the same span. However, by considering resource constraints from the beginning, a steady state with lower span can be obtained (Figure 4(d)). As further commented in Section 5.5, smaller spans result in shorter variable lifetimes, reducing in general the schedule s register pressure [15]. The loop optimization problem addressed in this paper comprises a large variety of formulations with different timing and resource constraints. The two extreme cases are next described: Resource-constrained loop pipelining (RCLP). Given a set of resource constraints (functional units and registers), the objective is to find a pipelined loop schedule that minimizes the execution time. Time-constrained loop pipelining (TCLP). Given an upper bound on the execution time, the objective is to find a pipelined loop schedule that minimizes the cost of the resources required to execute the loop. Between RCLP and TCLP there is a wide range of problems that can be formulated, e.g. finding a time-constrained schedule with constraints on a subset of resources. In the area of optimizing 3

5 A i B i C i D i 1 (a) A 0 B 0 A 1 C 0 B 1 A 2 D 0 C 1 B 2 A 3 D 1 C 2 B 3 D 2 C 3 D 3 (b) A 0 A 1 B 0 A 2 C 0 B 1 D 0 C 1 A 3 B 2 B 3 D 1 C 3 C 2 D 3 D 2 (c) A 0 B 0 C 0 A 1 D 0 B 1 C 1 D 1 (d) Figure 4: (a) Dependence graph and steady states with: (b) II = 1, SPAN = 4, (c) II = 2, SPAN = 4, and (d) II = 2, SPAN = 2. compilers for parallel architectures only the RCLP version of the problem is posed, since the resources are fixed by the target architecture. This paper presents UNRET (unrolling and retiming), a formal approach to solve the loop pipelining problem in which the following mathematical tools are used: Farey s series [35] to calculate the unrolling degrees of the loop that yield optimal solutions. Integer Linear Programming (ILP) to simultaneously solve the problems of scheduling and loop pipelining. Since the delay decision (minimization) problem of loop pipelining with resource constraints is NP-complete (NP-hard) [5], several heuristics have been proposed to solve it in moderate computation time. Several authors have also used linear programming to obtain optimal or quasi-optimal solutions for the problems of scheduling and allocation in architectural synthesis [14, 18, 16, 10, 33]. A preliminary approach for loop pipelining was presented in [9]. The closest approaches related to the work presented here have been proposed in [13, 8, 7], in which the primary goal is the minimization of the register requirements of the final schedule. The main contributions of UNRET in relation to existing formal methods are next presented: UNRET performs an exhaustive analysis of the unrolling degrees of the loop that can derive optimal solutions for the available resources. Therefore, more than one instance of each operation can appear in the final schedule. Unlike other methods that perform loop unrolling [19, 27, 33], this paper will present a new approach that guarantees an optimal unrolling degree. Similarly to [13, 8, 7], the number of folds for the steady state is automatically obtained by solving an ILP model for loop pipelining. The model has been extended to the minimization of resources under timing constraints. The maximum number of live variables at any cycle, MAXLIVE, is also minimized. MAXLIVE is a very approximate lower bound of the number of registers required to execute the loop [30]. A new approach, that we have called Branch-and-Prune, is proposed to solved the ILP model. The heuristics devised to explore the space of solutions allows us a rapid convergence to the optimal solution. ILP models with more than 1000 variables and 700 constraints have been solved in a reasonable running time. 4

6 The paper is organized as follows. Section 2 presents some preliminary definitions required for the mathematical model. Section 3 proposes a strategy for loop unrolling to find time-optimal schedules for the RCLP problem. A similar strategy for the TCLP problem is proposed in Section 4. The ILP model to find an optimal steady state is presented in Section 5. The branch and prune strategy to efficiently solve the ILP model is described in section 6. Experimental results and conclusions are presented in Sections 7 and 8. 2 Basic definitions For the sake of simplicity, we will first assume that all instructions can be executed in any of the resources of the architecture in one cycle. Henceforth, we will call n the number of instructions of the loop and r the number of resources available to execute instructions. Extensions to multiple-cycle, pipelined functional units will be addressed in Section Representation of a loop A loop is represented by a labeled dependence graph, DG(V; E), in which vertices and edges represent instructions and data dependences respectively. Labels of the DG are defined by two mappings, (fold) and (dependence distance), in the following way: (u), defined on vertices, denotes the fold to which u belongs in the steady state ((u) 0). (u) = i will be denoted by u i in the DG. (u; v), defined on edges, denotes the dependence distance (number of iterations traversed by the dependence) between instructions u and v. (u; v) = 0 corresponds to an ILD, and (u; v) > 0 to an LCD. An ILD between u and v is represented by u i 0! vi or simply u i! v i. Similarly, an LCD between u and v with distance d is represented as u i d! v i. Initially, a loop is represented by a DG with only one fold, i.e. 8u 2 V : (u) = 0. After finding a steady state, each instruction will be labeled with a fold representing the relative execution skew (in iterations) with regard to the other instructions of the loop. Equivalent labeling functions of the original DG can be obtained by simple transformations. The way and can be transformed is illustrated in Figure 5. Assume an ILD A i! B i exists in the initial DG. A schedule for the loop is depicted in Figure 5(a), which corresponds to the sequential execution of the original loop body that has only one fold (i). However, the ILD can be transformed into an LCD by preserving the semantics of the loop. Thus, the dependence A 1 d i+1! B i (or in general A i+d! B i ) is equivalent to the previous one. This transformation is analogous to the retiming technique proposed to minimize the clock period in synchronous systems [24]. Figure 5(b) shows the transformed DG with two folds (i and i + 1) and one LCD. A new schedule can be found for the loop body by considering now that A i+1 and B i have not any ILD between them. LCDs are semantically preserved by the sequential execution of different iterations of the same loop body. Therefore, ILDs can be transformed into LCDs by changing the fold assignment and, thus, shorter schedules for the loop body can be found. A formal definition for equivalent labeling functions can now be presented. Definition 1 :[Equivalent labeling functions] Let (; ) and ( 0 ; 0 ) be two different labeling functions for the dependence graph DG = (V; E). The functions are equivalent if 8(u; v) 2 E: (v)? (u) + (u; v) = 0 (v)? 0 (u) + 0 (u; v) (1) 5

7 A i B i? (a). A i B i A i+1 B i+1. A i+1? 1 B i A i+1. B i A i+2 B i+1 (b). Figure 5: (a) Schedule of a loop with one ILD. (b) Schedule of the same loop with an equivalent labeling function and one LCD. It is not difficult to prove that equivalent labeling functions preserve the semantics of a loop [24]. 2.2 Initiation Interval As proposed by other authors [22], UNRET first calculates a lower bound on the II of the loop: the minimum initiation interval (MII). MII = max(resmii; RecMII) where ResMII is determined by the number of available resources and RecMII by the cycles (recurrences) formed by the dependences of the loop. Clearly, ResMII = n r since each functional unit can execute only 1 instruction/cycle. A recurrence R is a set of edges that form a cycle. Let us define R as R = X (u;v)2r (u; v) (2) For any operation u such that 9(u; v) 2 R, u i must be scheduled at least jrj cycles before u i+ R. Therefore, R imposes a minimum initiation interval, RecMII R, on the execution of the loop [32, 5, 27]: A loop can have several recurrences. Therefore: RecMII R = jrj R RecMII = max RecMII R R In the worst case, the number of cycles of a DG grows exponentially with jej. However, finding RecMII can be done in polynomial time by using Karp s algorithm to calculate the weight of the minimum mean-weight cycle in a graph [20]. In the example of Figure 6(a), there are three recurrences with the same value for RecMII. Let us 3 take A 0! B 0! E 0! A0, with R = 3 and jrj = 3. This indicates that A 3 must be executed at least 3 cycles after A 0, thus resulting in RecMII R = 1. The other two recurrences are isomorphic to this one. 6

8 3 A 0 B0 C0 D0 (a) E K # ## # # ## D Cs s # s ## B A # ## s (II max ) (b) II = MII II K Figure 6: (a) DG example (b) Different (II K,K) pairs for 4 resources. 3 Loop unrolling for RCLP The length of a schedule is always an integer number of cycles, whereas the MII of a loop may be a rational number. In the example of Figure 3 we have MII = 3=2 for r = 2. If only one loop instance is taken, at least d3=2e = 2 cycles are needed for any schedule. However, a schedule with optimal II can be found by unrolling the loop twice. In general, a necessary (but not sufficient) condition to find a schedule with II = MII = a (reduced b fraction) is to unroll the loop K = ḃ times. Systematic approaches for loop unrolling of data-flow programs are presented in [19, 27, 33]. However, they do not guarantee optimal solutions, as it will be demonstrated with the example of Figure 6. Let II K be the initiation interval of a steady state comprising K instances of the loop body (II = II K ). The goal of UNRET is to find a schedule that minimizes II. This is done by exploring pairs K (II K,K) in increasing order of II, starting from MII. In order to bound the search space for (II K,K), a maximum value for II K is defined: the maximum length of the steady state (II max ). Figure 6(a) depicts the DG of a 5-instruction loop. If 4 resources are used for execution, then RecMII = 1 and MII = ResMII = 5 4. The diagram in Figure 6(b) represents the pairs (II K,K) that can be explored to find a feasible schedule for the loop. These pairs correspond to the points with integer values for II K and K enclosed in the triangle limited by the following lines: K = 0, II K = II max and II K K = MII. Distinct points with the same II lie on the same line. Among all points lying on the same line, those with smaller values for II K (or K) will be preferred, since they produce shorter steady states. In the example, the first point to be explored to minimize II is C = (5; 4), meaning that 4 iterations must be executed in 5 cycles, which results in II=MII=1.25. In this case, no feasible schedule with such characteristics exists. In [31], II K is incremented by 1 when no schedule is found and, thus, the point B = (6; 4) is explored next. This results in a suboptimal solution. However, a time-optimal solution (for II max = 8) is found if the point A = (4; 3) is explored after C = (5; 4). But, is there any efficient strategy to explore all points in increasing order of II? 7

9 2 F 8 1 ; 15 8 ; : : :; : : : II < MII : : : K 4 II Not explored K ; ; ; 2 4 ; 3 6 ; : : : Table 1: Pairs (II K,K) explored for the example of Figure Farey s series For a fixed integer D > 0, the sequence of all reduced fractions with nonnegative denominator D, arranged in increasing order of magnitude, is defined by the Farey s series of order D (F D ). For example, F 5 is the series of fractions: F 5 = 0 1 ; 1 5 ; 1 4 ; 1 3 ; 2 5 ; 1 2 ; 3 5 ; 2 3 ; 3 4 ; 4 5 ; 1 1 ; : : : Let x i y i be the ith element of the series. F D can be generated by the following recurrence [35]: The first two elements are The generic term x i y i x i = yi?2 + D y i?1 x 0 y 0 = 0 1 x 1 y 1 = 1 D can be calculated as (in increasing order) x i?1? x i?2 y i = y i?1 yi?2 + D y i?1 y i?1? y i?2 or as (in decreasing order) yi + D yi?2 + D x i = x i?1? x i?2 y i = y i?1? y i?2 y i?1 For a given integer k, the following consecutive fractions always belong to F D : k D? 1 D ; k 1 ; k D + 1 D All points (II K,K) correspond to fractions (or equivalent fractions) of the series F IImax. efficient strategy to generate the points in decreasing order is the following: An Let m = d1=miie, x 0 y 0 = m 1 and x 1 y 1 = mii max?1 II max Generate fractions in decreasing order until a fraction xp y p 1 is found. M II Explore schedules for all fractions (and equivalent fractions) generated in decreasing order from x p y p. For a reduced fraction x i y i, all equivalent fractions kx i ky i (k y i II max ) are generated in increasing order of the denominator (k = 1; 2; : : :). Table 1 represents the explored points for the example of Figure 6. 8

10 A 0 A 0 0 A 0 1 A B 0 C B0 2 1 B0 B C0 C0 C 2 0 Figure 7: A DG unrolled 3 times 3.2 Loop unrolling Every pair (II K ; K) obtained from Farey s series denotes a different unrolling degree for the loop and a target initiation interval. Before trying to find a schedule with such characteristics, a new DG with K instances of the original loop body must be obtained. Unrolling a DG K times generates another DG in which each vertex v is instantiated K times (v 0 ; v 1 ; : : :; v K?1 ). Besides, data dependence distances must be changed according to the unrolling degree. A dependence u?! d v in the original DG will be represented by K dependences u b i+d i K c?! v (i+d) mod K in the unrolled DG [27] (see example in Figure 7). 3.3 Extension to multiple types of resources In the most general case the architecture has several types of resources or functional units (FUs). Each FU is able to execute more than one type of operation. We also assume that allocation has already been done and, therefore, each instruction can only be executed in one type of FU. Let us first present some preliminary definitions: M = f1; : : :; mg is the set FU types of the architecture. F = (F 1 ; F 2 ; : : :; F m ) is the resource vector of the architecture, where F t is the number of FUs of type t. ALLOC(t) is the set of instructions allocated to the type of FU t. FU(u) is the type of FU allocated for instruction u. OP(u), is the operation executed by instruction u. T (u), is the delay (cycle count) required to execute OP(u) in FU(u). L(u), is the latency of FU(u) to execute OP(u) (minimum cycle count between successive inputs to the FU). For pipelined FUs L(u) < T (u); otherwise L(u) = T (u). Given a heterogeneous set of resources, the calculation of ResMII is done as follows: ResMII = max ResMII t t2m 9

11 with ResMII t = X u2alloc(t) F t P The numerator ( L(u)) corresponds to the number of cycles in which the FUs are busy. We will say that t is a critical FU type if ResMII = ResMII t. To calculate RecMII, the delay associated to each recurrence R of the loop is now obtained as follows 1 [32, 5]: RecMII R = X u2r R L(u) T (u) The generation of the pairs (II K,K) can now be done similarly to that presented for an homogeneous set of resources, by first calculating MIIand then using Farey s series to obtain II K and K. 3.4 UNRET algorithm for RCLP The strategy followed by UNRET to solve the problem of RCLP is the following: 1. Calculate MII. 2. Find the first unrolling degree (K) and expected initiation interval (II K ) that minimizes II such that II MII. 3. Unroll the loop K times. 4. Find a steady state with length II K (ILP approach explained in Section 5). 5. If no schedule with II K cycles is found, generate new values for K and II K that minimize II (II is explored in increasing order by using Farey s series) and go to step 3. 4 Loop unrolling for TCLP The problem of TCLP starts from a DG and a target initiation interval (II T ). Unlike RCLP, the number of some (or all) resources of the architecture is not constrained. Therefore, ResMII must be calculated by only considering those resources with constraints. In case II T < MII, no feasible schedule for that DG exists. Otherwise, different unrolling degrees for the loop must be explored, similarly to what is done for RCLP, in such a way that 4.1 UNRET algorithm for TCLP MII II K K II T The strategy followed by UNRET to solve the problem of TCLP is the following: 1. Calculate MII. If II T < MII exit (no feasible schedule exists). 1 Although formally incorrect, we use u 2 R to refer to the vertices of the recurrence. 10

12 2. Find the first unrolling degree (K) and expected initiation interval (II K ) that maximizes II such that II II T. 3. Unroll the loop K times. 4. Find a steady state with length II K that minimizes the cost of the required resources (ILP approach explained in Section 5). 5. Optimize the initiation interval of the schedule. This is done by successively generating new values for K and II K that minimize II (II is explored in decreasing order by using Farey s series). If II < MII then exit (an optimal-time schedule has been found). Otherwise, for each pair (II K, K), the loop is unrolled K times, and the algorithm attempts to find a schedule in II K cycles by using the set of resources found at step 4. In this case, Farey s series for K II K are explored in decreasing order, starting from the closest value to 1 II T. 5 Loop pipelining: An ILP approach Both the RCLP and the TCLP problems can be solved by the same mathematical model. As shown in Figure 5, different labeling functions (; ) for the same DG can result in different steady states. The problem of loop pipelining can be reduced to two interrelated subproblems that can be simultaneously solved by using an ILP model: 1. Folding the DG, i.e. finding the functions and. 2. Finding a schedule for the folded DG subject to the data dependences and resource constraints. 5.1 Preliminaries Initially, a DG obtained by unrolling the original DG K times will be given. Moreover, an objective initiation interval (II K ) will fix the number of cycles of the steady state. Hereafter, and for the sake of brevity, we will use II instead of II K. C = f0; : : :; II? 1g will denote the set of cycles of the steady state. All variables used in the model are nonnegative integers. The least nonnegative residue system (0; 1; : : :; k? 1) will be used when modulo k (mod k) operations are performed on constants (e.g.?8 mod 5 = 2). 5.2 Loop folding constraints The following variables are defined for the labeling functions of the DG: u ; 8u 2 V (fundamental variables) u;v ; 8(u; v) 2 E (auxiliary variables) Folding the loop means finding an equivalent labeling function for the DG. According to equation (1), the auxiliary variables u;v are defined as follows: u;v = u? v + k u;v ; 8(u; v) 2 E (3) 11

13 u i v i v i u i+1 v i u i u i+1 u i ??? v i v i v i v i u i+2 v i v i (a) (b) (c) Figure 8: Time frames to schedule v i for different loop foldings (T (u) = 2 and II = 3) with k u;v = (v)? (u) + (u; v) where (u), (v) and (u; v) are constants denoting the initial labeling of the DG (usually (u) = (v) = 0). 5.3 Scheduling and data dependence constraints The following variables are first defined: s u;i = ( 1 if u starts executing at cycle i 0 otherwise 8u 2 V; i 2 C The following constraint is required to guarantee that an instruction is scheduled at one, and only one, cycle of the steady state: X s u;i = 1; 8u 2 V (4) i2c For simplicity in the definition of data dependence constraints, the auxiliary variable c u will be used to denote the cycle at which u is scheduled. Hence, c u = X i2c i s u;i ; 8u 2 V (5) The constraint that guarantees data dependences to be honored in the schedule is the following [5, 12, 18, 23]: c v c u + T (u)? II u;v ; 8(u; v) 2 E (6) Figure 8 illustrates how u;v influences the schedule of v. If u;v = 0 (ILD), then v must be scheduled after the completion of u (c u + T (u)), as shown in Figure 8(a). By increasing the fold from u i to u i+m (Figures 8(b) and 8(c)), u i is implicitly rescheduled m II cycles backwards, thus increasing the time frame to schedule v i by the same amount. 12

14 5.4 Resource constraints The following constraints guarantee that no more than F t functional units are used at each cycle: X u2alloc(t) ix j=i?l(u)+1 s u;(jmodii) F t ; 8t 2 M; i 2 C (7) An instruction u uses an FU during the cycles c u : : : c u + L(u)? 1 (mod II). In case the latency of the operation is longer than II, the execution of several instances of the same operation may overlap and, therefore, more than one FU may be required. Assume, for example, II=3 and L(u) = 5, u being the only instruction that uses FUs of type t. The generated constraints for F t are the following: s u;?4mod3 + s u;?3mod3 + s u;?2mod3 + s u;?1mod3 + s u;0mod3 F t (for cycle 0) s u;?3mod3 + s u;?2mod3 + s u;?1mod3 + s u;0mod3 + s u;1mod3 F t (for cycle 1) s u;?2mod3 + s u;?1mod3 + s u;0mod3 + s u;1mod3 + s u;2mod3 F t (for cycle 2) which, after modulo calculations, results in: 5.5 Register requirements 2s u;0 + s u;1 + 2s u;2 F t (for cycle 0) 2s u;0 + 2s u;1 + s u;2 F t (for cycle 1) s u;0 + 2s u;1 + 2s u;2 F t (for cycle 2) The maximum number of variables whose lifetimes overlap at any cycle, MAXLIVE [30], is the minimum number of registers required for a schedule. Each edge u! v of the DG determines the variable lifetime of the output of u with regard to v. The variable lifetime spreads from the completion of instruction u (cycle c u + T (u)) to the cycle in which the FU executing v does not required the input data anymore (cycle c v + L(v)? 1). Other execution models can also be considered for registers. In [8], variable lifetimes spread from the initiation of u to the initiation of v, a model more appropriate for parallel architectures. As depicted in the example of Figure 9, an edge with u;v = k will spread the variable lifetime across k iterations. For any cycle c in the steady state, an edge will require k? 1, k, or k + 1 registers depending on the scheduling of instructions u and v. Three cases can be distinguished for T (u) = L(v) = 1 (see examples in Figure 9): a) u k! v crosses cycle c forward, then k + 1 registers are required (example: edge u i+2 2! v i at cycle 2). b) u k! v does not cross cycle c, then k registers are required (example: edge u i+2 2! v i at cycles 0 and 1, or edge r i+1 1! s i at cycle 0). c) u k! v crosses cycle c backwards, then k? 1 registers are required (example: edge r i+1 1! s i at cycles 1 and 2). The following auxiliary variables are defined to specify whether an operation reads or writes a result before a given cycle: 13

15 u i+2 v i 2 1?? (a) r i+1 s i u i u i+1 r i?1? s i?1 s u i+2? v i s s s i 6 r i+1 u u u u u u u u i+2 u r i? s i? v i r i+1 3 regs (cycle 0) 2 regs (cycle 1) 3 regs (cycle 2) (b).??. (c).? Figure 9: Register requirements for folded loops: (a) Folded DG, (b) Schedule and cycles crossed by edges, (c) Variable lifetimes and required registers. READS BEFORE u;c = WRITES BEFORE u;c = c?1 X i=0 c?1 X i=0 s u;(i?l(u)+1)modii (u reads before cycle c) s u;(i?t(u)+1)modii (u writes before cycle c) The auxiliary variable r u;v;c defines the number of registers required to store a result in cycle c produced by operation u and consumed by operation v: r u;v;c = u;v? T (u)? 1 II L(v)? 1 + WRITES BEFORE u;(t(u)?1)modii + READS BEFORE v;(l(v)?1)modii II WRITES BEFORE u;c? READS BEFORE v;c (8) The expressions in brackets determine a correction on u;v produced by the execution time of u and the latency of v, i.e. no register is required to store the value while u executes, whereas a register is required during the first L(v) cycles of v s execution. In case T (u) = L(u) = 1, as in the examples

16 depicted in Figure 9, the following simplified expression is obtained: r u;v;c = u;v + WRITES BEFORE u;c? READS BEFOREv; c A variable can be the input of more than one instruction. This is shown in the DG by the fact that some edges have the same source vertex. However, there is no need to allocate different registers for the same variable. The number of registers required is the maximum from all edges with the same source. Therefore, the variables r u;c can also be defined on vertices as follows: r u;v;c r u;c ; 8u 2 V; c 2 C (9) It can be easily proved that if operation v is the last use of u s result then r u;v;c = r u;c for any cycle c [8]. The following constraint must be included to define MAXLIVE: X u2v ML being a variable minimized in the objective function. 5.6 Objective function r u;c ML; 8c 2 C (10) Assume we have a cost (area) vector A = (A 1 ; : : :; A m ) for the FUs of the architecture. Similarly, we can have a cost A r for each register 2. The objective function of the ILP model can be formulated as follows: min Area = X t2m A t F t + A r ML (11) Note that F t and ML can be either variables or constants. This depends on the initial constraints defined for the problem. In case all of them are constants, the ILP model will merely find a feasible solution. Here we have considered ML as the number of registers required by the schedule. Although this might not be true, it has been shown that it is a realistic assumption for many practical cases [30]. 5.7 Complexity of the model Table 2 describes the fundamental variables and constraints of the model. V and E denote the number of nodes and edges of the DG respectively, whereas m is the number of different FU types of the architecture. Some of the variables, e.g. F t and ML, may become constants if the number of resources of the architecture is defined in advance. 6 Branch and Prune The most popular method to solve ILP models is the combination of branch-and-bound techniques with a linear-programming solver such as simplex [26]. Having integer variables in the model increases complexity from polynomial to exponential. The running time for ILP is highly influenced by the 2 Piecewise linear cost functions for registers can also be incorporated, as proposed in [10]. 15

17 variable number constraint number u V (4) V s u;i V II (6) E F t m (7) m II r u;c V II (8,9) V II ML 1 (10) II total V (2II + 1) + m + 1 total V (II + 1) + E + II(m + 1) Table 2: Variables and constraints of the ILP model exploration of the branch-and-bound tree. For an ILP solver insensible to the problem, the tree of solutions is blindly explored and finding valid integer solutions may require solving an excessive number of LP problems. We have implemented an ad-hoc solver for loop pipelining that takes advantage of the information known a priori about the problem. The solver uses a branch-and-bound paradigm to explore the space of integer solutions but, rather than indiscriminately invoking an LP solver at each branch, it allows us a rapid pruning when enough information has been captured to evaluate the objective function. The order in which integer variables are explored is also crucial to reduce the running time. The approach used for this problem, that we have called branch and prune, is next presented. 6.1 Retiming and scheduling The problem of loop pipelining can be decomposed in two different problems: 1. Retiming, or finding the loop fold for each operation and 2. Scheduling, or assigning operations to cycles. After solving retiming, i.e. defining u for all operations, the problem of loop pipelining is reduced to the scheduling of a basic block in which not all dependences must be taken into account. According to constraint (6), all those dependences with T (u)? II u;v 0 are not relevant for scheduling 3. This is the rationale behind our first decision to define the exploration order of the integer variables: first all s (retiming) and then the rest (scheduling) Retiming Recurrences are the most stringent constraints to define the operations folds, since the sum of their edges must be a constant (equation (2)). Let us take, as an example, the recurrence A-B-C-D-A of the DG depicted in Figure 10. Assume we define C = 10 and D = 9. Thus, we have that C;D = 1 and the folds for A and B are implicitly defined by the fact that only one of the edges of the recurrence can have = 1. Therefore, A = B = 10. And similarly for the recurrence C-D-E-C, in which E = 10 for the same reason. Figure 10 shows the exploration tree for the example. The order of the variables is selected as follows: first explore the nodes belonging to the most stringent recurrences. Inside each recurrence, first explore those nodes that also belong to other recurrences. In the example, the recurrence C-D-E-C 3 The value of u;v can be obtained from (3) once u and v have been defined. 16

18 C = 10 D 9 10 A = B = E = 10 E A 1 D 1 A A E B B = 10 B = 10 B B C Figure 10: Branch-and-Prune search for u is the most stringent and nodes C and D also belong to other recurrences. Thus, the order C-D-E-A-B has been chosen. Since only the relative difference between s is required to determine the steady state pattern, the fold of the first node explored can always be fixed ( C = 10 in this case). For nodes not belonging to any recurrence, their exploration order is defined according to a neighboring criteria to the nodes in the recurrences. In these cases, however, the number of branches to be explored may be unbounded, since no recurrence limits the value of. For this reason, an early calculation of a lower bound for register requirements is done at each node of the tree. This calculation assumes the worst case values for the auxiliary variables WRITES BEFORE and READS BEFORE in constraint (8). In practice, this estimation has shown to be very effective to prune the search tree when comparing with the cost of the best obtained solution Scheduling After having defined the folds of the operations, a scheduling problem is posed for each of the leaves of the search tree. At this moment, some dependences of the DG can be eliminated from the model (if T (u)? II u;v 0) and the critical path of the retimed DG can be calculated. In case the critical path is longer than II, no feasible schedule can be found and the exploration is pruned. Checking the critical path of the DG can be even done before defining all s if the information available at each node of the search tree is enough to discard a solution. The variables defined at this stage of the problem are the scheduling variables (s u;i ). Here the exploration order is also crucial and we have chosen one of the well-known algorithms for timeconstrained scheduling (force-directed scheduling (FDS) [28]) to assist the solver in defining an efficient strategy to generate the search tree. FDS performs a stepwise assignment of operations to cycles according to criteria that attempt to balance the utilization of resources over all cycles of the schedule. In our model, each operation has II 0-1 variables with at most one of them being 1. The order of operation assignments produced by FDS is the one we use to generate the search tree. The cycle to which each operation is scheduled by 17

19 Sb,0=1 towards FDS solution Sa,2=1 Sb Sa,1=1 Sb,1=1 Sa Sa,0=1 other possible solutions Sa,3=1 FDS stepwise solution operation cycle a 2 b Figure 11: Generation of the search tree for scheduling guided by FDS FDS is the first value explored. Assignments to the other cycles are explored in increasing order of the forces estimated by FDS (see Figure 11). This strategy leads the solver to a near-optimal solution very soon, thus allowing an efficient pruning of the tree for the other branches. At each node of the tree, in which some operations have already been assigned to cycles, the scheduling constraints no longer relevant for the final solution are eliminated (e.g. if ASAP(u) = 2, the variables s u;0 and s u;1 are not defined and their corresponding constraints simplified). Similarly to branch-and-bound, the LP solver is invoked at each branch of the tree to prune those solutions with a cost greater than the best integer solution found. 7 Experimental results The techniques presented in the previous sections have been implemented by using the package lp solve [2]. In this section, several results are reported to show the main features of the method: the exploration of the optimal unrolling degree and the efficiency of the branch-and-prune strategy to solve the ILP model. The types of resources used in the schedules are adders (1-cycle latency) and multipliers (2-cycle latency). 7.1 Optimal unrolling degree Table 3 presents the results obtained by exploring two different unrolling degrees (see Table 1) for the example depicted in Figure 6. An RCLP problem with 4 resources has been solved. MAXLIVE is also minimized for the optimal II found by the model. The columns V and E indicate the number of nodes and edges of the DG after unrolling the loop. #Vars and #Const indicate the number of variables and constraints of the ILP model. SPAN is calculated as max? min + 1 (number of different iterations involved in the schedule). The first row presents the result for II K = 5 and K = 4 (point C in Figure 6), which is an infeasible problem. The second row presents the result for point A = (4; 3). This case obtains optimal results in MAXLIVE and II in front of the schedule found for the third point (B = (6; 4)), which is the one explored by other techniques that perform sub-optimal loop unrolling. Table 4 shows the results obtained for some examples selected from the set presented in [13]. In all cases 2 Load/Store units, 2 multipliers, 3 adders and 1 divider have been used. Note that unrolling 18

20 Resources MII II K/II K V E #Vars #Const CPU SPAN MAXLIVE / Non feasible / / Table 3: Results obtained for the example of Figure 6(a). is required to achieve a schedule in MII cycles for all cases. Table 5 show the results obtained for the Cytron example if no register minimization is required. Example MII II K/II K V E #Vars #Const CPU SPAN MAXLIVE SPEC-SPICE (loop7) / SPEC-SPICE (loop8) / LIVERMORE (loop1) / Table 4: Results obtained for a selection of examples from the Perfect Club set Resources MII II K/II K V E #Vars #Const SPAN CPU / / / / Table 5: Results for the Cytron example when no register minimization is required 7.2 ILP model for RCLP A significant part of the computational cost for solving the ILP model is spent in guaranteeing that the final solution is optimal. For this reason, we also report the best solution obtained after 1 minute of CPU 4. The results presented in Tables 6 and 7 show that benchmarks with a large number of operations can be solved with moderate computational cost. The method can also be used for heuristic search by limiting the maximum CPU time and providing the best solution found at that moment (e.g. one minute). The results demonstrate that the optimal solution can be often obtained by only using a small fraction of the total computational cost, although a proof of optimality cannot be given in this case. Optimal solutions have been found for models with more than 1000 variables and 700 constraints. Table 8 shows the results obtained for the differential equation solver. For this example two types of resources were considered: a non pipelined multiplier with a two cycle delay and an ALU for the rest of the operations with one cycle of delay. 7.3 ILP model for TCLP Table 9 presents some results on solving time-constrained loop pipelining for the elliptic filter. The cost for the multipliers, adders and registers has been defined as 80, 20 and 10 area units respectively. 4 We only report the value of MAXLIVE after 1 min, which is the critical variable in the optimization of the objective function. 19

21 Resources II V E #Vars #Const CPU MAXLIVE SPAN MAXLIVE + (secs) (60 secs) Table 6: Results obtained for the 16-point FIR filter Resources II V E #Vars #Const. CPU time MAXLIVE SPAN MAXLIVE p + (secs) (60 secs) Table 7: Results for the 5th-order elliptic wave filter ( p and stand for pipelined and non-pipelined multipliers). Resources II V E #Vars #Cons CPU time MAXLIVE SPAN 3 2 ALU ALU ALU ALU Table 8: Results for the differential equation solver In this case, the value of the II is initially defined, whereas the area cost of the resources is minimized. The complexity of the models is equivalent to the corresponding solutions reported in Table 7. Table 10 shows equivalent results when pipelined multipliers are considered. Table 11 presents the results for the Cytron example when solving the TCLP problem. 7.4 Results without loop folding In this section we present the results obtained when no loop-folding is required. In this case, only one leaf of the branch and prune tree is explored: the one with all having the same value. Table 12 shows the results obtained for the elliptic filter. The program obtains the scheduling without loop-folding in a given number of cycle steps minimizing the value of MAXLIVE. 20

22 II Multipliers Adders MAXLIVE CPU (secs) Table 9: TCLP for the 5th-order elliptic wave filter with non-pipelined multipliers. II Multipliers Adders MAXLIVE CPU (secs) Table 10: Results obtained for the fifth order elliptic filter with pipelined multipliers for TCLP II Resources MAXLIVE CPU (secs) Table 11: Results obtained for the Cytron example for TCLP 7.5 Near-optimal solutions for large examples Although the branch-and-prune strategy has proved to be efficient for models with 10 3 variables/constraints, the exponential nature of the method becomes critical in some cases. Rather than discarding this method, we claim that it can still be used to obtain near-optimal solutions by limiting the CPU time of the search. Tables 13 and 14 present results on the Cytron s and FDCT loops obtained by limiting the CPU time to 10 minutes. Given the status of the search at the time the solution was delivered, we suspect that the results are optimal (although we could not prove it). The strategies used to explore the values of and the variables for scheduling (force-directed scheduling) are crucial to rapidly converge to near-optimal solutions. 8 Conclusions Finding optimal schedules for moderate size loops is feasible. In this paper we have presented a mathematical model that can be solved by using ILP. However, efficient techniques to wisely explore the space of integer solutions are required to avoid a blind navigation through the branch-and-bound tree. We have presented a new strategy called branch and prune that takes advantage of the information known a priori about the problem to seek near-optimal solutions as fast as possible. Decomposing loop pipelining into two sub-problems (retiming and scheduling) and using force-directed scheduling to assist the search have been essential heuristics to guarantee optimal solutions in moderate CPU times. We have also demonstrated that exploring the unrolling degree is necessary to find optimal solutions for loop pipelining. A mathematical formulation based on Farey s series has been proposed for such a goal. 21

23 Resources II CPU time (secs) MAXLIVE Table 12: Results obtained for the elliptic filter when only scheduling is required (non-pipelined multipliers) Resources MII II K/II K V E #Vars #Const SPAN MAXLIVE / / / Table 13: Results for the Cytron example after 10 minutes of CPU time Resources MII II K/II K V E #Vars #Const SPAN MAXLIVE +? / / / Table 14: Results obtained for the FDCT after 10 minutes of CPU time Finally, we think that other pruning strategies for ILP models can be considered for the problem we have solved. New techniques, such as branch and cut, successfully used for the traveling-salesman problem [25] can probably be adapted to accommodate many problems in high-level synthesis. References [1] A. Aiken and A. Nicolau. Perfect Pipelining: a new loop parallelization technique, volume 300 of Lecture Notes in Computer Science, pages Springer Verlag, March [2] M.R.C.M. Berkelaar. lp solve version 2.0: a public domain ILP solver, available at ftp.es.ele.tue.nl. [3] F. Catthoor, W. Geurts, and H. De Man. Loop transformation methodology for fixed-rate video, image and telecom processing applications. In Proc. of the Int. Conf. on Application-Specific Array Processors 1994 (ASAP94), [4] L.-F. Chao, A. LaPaugh, and E.H.-M. Sha. Rotation scheduling: a loop pipelining algorithm. In Proc. of the 30th Design Automation Conf., pages , June [5] R. Cytron. Compiler-time scheduling and optimization for asynchronous machines. PhD thesis, University of Illinois at Urbana-Champaign,

24 [6] S.M. Heemstra de Groot, S.H. Gerez, and O.E. Herrmann. Range-chart-guided iterative dataflow graph scheduling. IEEE Transactions on Circuits and Systems I, 39(5): , May [7] Y.G. DeCastelo-Vide e Souza, M. Potkonjak, and A.C. Parker. Optimal ILP-based approach for throughput optimization using simultaneous algorithm/architecture matching and retiming. In Proc. of the 32nd Design Automation Conf., pages , June [8] A.E. Eichenberger, E.S. Davidson, and S.G. Abraham. Optimum modulo schedules for minimum register requirements. In Proc. of 9th ACM the International Symposium on Supercomputing, pages 31 40, June [9] C.H. Gebotys. Throughput optimized architectural synthesis. IEEE Trans. on VLSI Systems, 1(3): , September [10] C.H. Gebotys and M.I. Elmasry. Global optimization approach for architectural synthesis. IEEE Trans. Computer-Aided Design, 12(9): , September [11] E.F. Girczyc. Loop winding: A data flow approach to functional pipelining. In Proc. Int. Symp. Circuits and Systems, pages , May [12] G. Goossens, J. Rabaey, J. Vandewalle, and H. De Man. An efficient microcode compiler for application specific DSP processors. IEEE Trans. Computer-Aided Design, 9(9): , September [13] R. Govindarajan, E.R. Altman, and G.R. Gao. Minimizing register requirements under resourceconstrained rate-optimal software pipelining. In Proc. of the 27th Annual International Symposium on Microarchitecture, pages 85 94, November [14] L. Hafer and A. Parker. A formal method for the specification analysis and design of registertransfer-level digital logic. IEEE Trans. Computer-Aided Design, 2(1):4 17, January [15] R. A. Huff. Lifetime-sensitive modulo scheduling. In Proc. of the 6th Conf. Programming Language Design and Implementation, pages , [16] C.-T. Hwang and Y.-C. Hsu. Zone scheduling. IEEE Trans. Computer-Aided Design, 12(7): , July [17] C.-T. Hwang, Y.-C. Hsu, and Y.-L. Lin. Scheduling for functional pipelining and loop winding. In Proc. of the 28th Design Automation Conf., pages , June [18] C.-T. Hwang, J.-H. Lee, and Y.-C. Hsu. A formal approach to the scheduling problem in high level synthesis. IEEE Trans. Computer-Aided Design, 10(4): , April [19] R.B. Jones and V.H. Allan. Software pipelining: A comparison and improvement. In Proc. 23rd Ann. Workshop on Microprogramming and Microarchitecture, pages 46 56, November [20] R. Karp. A characterization of the minimum cycle mean in a digraph. Discrete Mathematics, 23: , [21] P. M. Kogge. The architecture of pipelined computers. New York. McGraw-Hill,

25 [22] M. Lam. Software pipelining: an effective scheduling technique for VLIW machines. In Proc. of the SIGPLAN 88, pages , June [23] T.-F. Lee, A.C.-H. Wu, Y.-L. Lin, and D.D. Gajski. A transformation-based method for loop folding. IEEE Trans. Computer-Aided Design, 13(4): , April [24] C.E. Leiserson and J.B. Saxe. Retiming synchronous circuitry. Algorithmica, 6:5 35, [25] M.W. Padberg and G. Rinaldi. A branch-and-cut algorithm for the resolution of large-scale symetric traveling salesman problems. SIAM Review, 33:60 100, [26] C.H. Papadimitriou and K. Steiglitz. Combinatorial optimization: algorithms and complexity. Prentice-Hall, [27] K.K. Parhi and D.G. Messerschmitt. Static rate-optimal scheduling of iterative data-flow programs via optimum unfolding. IEEE Trans. Computers, 40(2): , February [28] P.G. Paulin and J.P. Knight. Force-directed scheduling for the behavioral synthesis of ASIC s. IEEE Trans. Computer-Aided Design, 8(6): , June [29] R. Potasman, J. Lis, A. Nicolau, and D. D. Gajski. Percolation-based synthesis. In Proc. of the 27th Design Automation Conf., pages , [30] B.R. Rau. Iterative modulo scheduling: An algorithm for software pipelining loops. In Proc. of the 27th Annual International Symposium on Microarchitecture, pages 63 74, November [31] B.R. Rau and C.D. Glaeser. Some scheduling techniques and an easily schedulable horizontal architecture for high performance scientific computing. In Proc. of the 14th Annual Workshop on Microprogramming, pages , October [32] M. Renfors and Y. Neuvo. The maximum sampling rate of digital filters under hardware speed constraints. IEEE Trans. on Circuits and Systems, CAS-28(3): , March [33] M. Rim and R. Jain. Lower-bound performance estimation for the high-level synthesis scheduling problem. IEEE Trans. Computer-Aided Design, 13(4): , April [34] F. Sánchez and J. Cortadella. Time constrained loop pipelining. In Proc. Int. Conf. Computer- Aided Design, pages , November [35] M.R. Schroeder. Number theory in science and communication. Springer-Verlag, [36] B. Su, S. Ding, and J. Xia. URPR: an extension of URCR for software pipelining. In 19th Microprogramming Workshop (MICRO-19), pages , October

RPUSM: An Effective Instruction Scheduling Method for. Nested Loops

RPUSM: An Effective Instruction Scheduling Method for Nested Loops Yi-Hsuan Lee, Ming-Lung Tsai and Cheng Chen Department of Computer Science and Information Engineering 1001 Ta Hsueh Road, Hsinchu, Taiwan,