Area. f(t) f(t ) A min. T min. T T T max Latency

Size: px

Start display at page:

Download "Area. f(t) f(t *) A min. T min. T T * T max Latency"

Elvin Hunter
6 years ago
Views:

1 Ecient Optimal Design Space Characterization Methodologies STEPHEN A. BLYTHE Rensselaer Polytechnic Institute and ROBERT A. WALKER Kent State University 1 One of the primary advantages of a high-level synthesis system is its ability to explore the design space. This paper presents several methodologies for design space exploration that compute all optimal tradeo points for the combined problem of scheduling, clock length determination, and module selection. We discuss how each methodology takes advantage of both the structure within the design space itself as well as the structure of, and interaction between, each of the three subproblems. 1. INTRODUCTION For many years, one of the most compelling reasons for developing high-level synthesis systems [Gajski et al. 1994] [De Micheli 1994] has been the desire to quickly explore a wide range of designs for the same behavioral description. Given a set of designs, two metrics are commonly used to evaluate their quality: area (ideally total area, but often only functional unit area), and time (the schedule length, or latency). Finding the optimal tradeo curve between these two metrics is called design space exploration. Design space exploration is generally considered too dicult to solve optimally in a reasonable amount of time, so the problem is usually limited to computing either lower bounds [Timmer et al. 1993] or estimates [Chen and Jeng 1991] on the optimal tradeo curve for some set of time or resource constraints. Moreover, the design space is usually determined by solving only the scheduling and functional unit allocation subproblems. The design space exploration methodology described here goes beyond traditional design space exploration in several ways. First, all optimal tradeo points are computed so that the design space is completely characterized. Second, these optimal tradeo points represent optimal solutions to the time-constrained scheduling (TCS) and resource-constrained scheduling (RCS) problems, rather than lower bounds or estimates. Third, the tradeo points are computed in a manner that supports more realistic module libraries by incorporating clock length determination and module selection into the methodology. Finally, these tradeo points are 0 This material is based upon work supported by the National Science Foundation under Grant No. MIP Author's addresses: Stephen A. Blythe, Department of Computer Science, Rensselaer Polytechnic Institute, Troy, NY 12180; Robert A. Walker, Department of Mathematics and Computer Science, Kent State University, Kent, OH 44242

2 2 S. A. Blythe and R. A. Walker Area A max f(t) f(t *) A min T min T T * T max Latency Fig. 1. Example design space showing Pareto points. The shaded regions show the two distinct clusters of Pareto points that many tradeo curves exhibit. computed in an ecient manner through careful pruning of the search space during the design cycle. The resulting methodology can also be extended to include additional subproblems. 1.1 The Design Space The process of exploring the design space can be viewed as solving either the timeconstrained scheduling (TCS) problem (minimizing the functional unit area) for a range of time constraints, or the resource-constrained scheduling (RCS) problem (minimizing the latency) for a range of resource constraints. Although there is a tradeo between latency and area, the tradeo curve is not smooth due to the nite combinations of the library modules available [McFarland 1987]. Consider the design space shown in Figure 1 { this curve can be described by the set of points f(t; f(t ))g, where f(t ) is the minimum area required for a given time constraint T (i.e., the optimal solution to that TCS problem). To ensure that this curve is completely characterized, one could exhaustively solve the TCS problem optimally for every time constraint T from T min (the critical path length) to T max (the time constraint corresponding to the module selection / allocation with the minimum area). However, this brute-force approach is not very ecient. Fortunately, the number of points needed to fully characterize the optimal tradeo curve is much smaller. The curve can be completely characterized by the set f(t ; f(t ))g of optimal tradeo points (shown by black dots in Figure 1) { those points for which there is no design with a smaller latency and the same area, and no design with a smaller area and the same latency. Such points are called Pareto points [De Micheli 1994] [Brayton and Spence 1984], and can be formally dened as follows: f(t ) < f(t? k); 1 k T? T min f(t ) f(t + k); 1 k T max? T Therefore, the design space exploration problem can be solved more eciently by

3 Ecient Optimal Design Space Characterization Methodologies 3 nding only the Pareto points in the design space. Furthermore, many optimal tradeo curves contain two distinct clusters of such Pareto points, as shown by the shaded regions in Figure 1: one where the latency is small and the area is large, and another where the area is small and the latency is large. This paper explores several approaches to nd all the Pareto points in an ecient manner. First, Section 2 describes a basic methodology that explores the latency axis to nd the Pareto points in the design space through repeatedly applying a TCS methodology. Section 3 discusses how to extend this problem to incorporate the clock length determination problem, and Section 4 discusses the incorporation of module selection. In Section 5, a related approach based on solving the RCS problem repeatedly while taking advantage of the structure of the module selection problem is discussed, and a comparison is made with the TCS based method. Then Section 6 examines the advantages and disadvantages of combining the two approaches in a manner similar to Timmer's bounding methodology [Timmer et al. 1993]. Sections 7 and 8 describes techniques for pivoting between the TCS and RCS-based methodologies to take advantage of the Pareto point clusters described above. In Section 9, results for a fairly complex module library are presented, and why this pivoting method works well for such a library is discussed. Lastly, Section 10 gives a summary of this work and suggests some future directions that it may take. 2. LATENCY-AXIS EXPLORATION To nd all of the Pareto points, either the TCS problem could be solved repeatedly for various time constraints, or the RCS problem could be solved repeatedly for various resource constraints. Our methodology repeatedly solves the TCS problem 2, which leads to two subproblems: (1) determining which time constraints to explore, and (2) determining how to eciently explore the design space at each time constraint. Ideally, we want to avoid exhaustively searching all time constraints in the feasible range [T min ; T max ]. If the module set and clock length are specied a priori, then the TCS problem need only be solved for those time constraints that are a multiple of the clock length, since any other time constraint could be replaced by the smaller of the two time constraints that it would lie between without any increase in area. As a simple example, consider the design space exploration problem for the DIF- FEQ example [Paulin and Knight 1989], using library A from Table 1 (the \trivial" library 1 found in [Timmer et al. 1993]) and a clock length of 100. The minimum time constraint is 600 (the length of the critical path), the maximum time constraint is 1300 (the latency required for a feasible schedule with 1 mult and 1 alu1), so the only time constraints that must be explored are those in the set f600; 700; 800; 900; 1000; 1100; 1200; 1300g. Given that set of time constraints, our Voyager design space exploration system [Chaudhuri et al. 1997] eciently characterizes the design space as follows. The main loop (see Figure 2) scans the time constraints in the direction of in- 2 Note that, although we are solving only the TCS problem, this methodology is not limited to solving only that problem, and could be extended to include register allocation, interconnect allocation, control unit design, etc.

4 4 S. A. Blythe and R. A. Walker Design Space Exploration: area cur MAXINT compute T min and T max compute all candidate time constraints T i in [T min; T max] for each T i from T min to T max if (latency(asap) > f(t i)) (T i; f(t i)) is not a feasible schedule else compute the lower bound lb on f(t i) if (lb >= area cur) else (T i; f(t i)) is not a Pareto point /* np-lb */ compute the upper bound ub on f(t i) if (lb = ub) (T i; f(t i)) is a Pareto point /* P-lbub */ area cur = lb else compute the LP-relaxed lower bound rlb on f(t i) if (rlb >= area cur) (T i; f(t i)) is not a Pareto point /* np-rlb */ else if (rlb = ub) (T i; f(t i)) is a Pareto point /* P-rlbub */ area cur = rlb else calculate IP solution ip = f(t i). if (ip < area cur) (T i; f(t i)) is a Pareto point /* P-ILP */ area cur = ip else (T i; f(t i)) is not a Pareto point/* np-ilp */ Fig. 2. Voyager's main design space exploration loop creasing latency. At each time constraint, an ASAP schedule is rst calculated to determine if a feasible schedule exists for that time constraint and clock length. If so, then it uses a heuristic to compute a lower bound on the functional unit (FU) area; if this area is the same as or larger 3 than the previous area, then that solution is not a Pareto point. This is the case for time constraints 900, 1000, 1100, and 1200 in Figure 3. However, if the lower bound is smaller than the previous area, then it is a potential 3 In the problem as specied so far, the area will never be larger. However, it may be larger when the clock length determination and module selection problems are incorporated into the methodology as described in Sections 3 and 4.

5 Ecient Optimal Design Space Characterization Methodologies 5 MODULE AREA DELAY (ns) OPERATIONS mult f*g alu f+;?; <g Table 1. Library A { Timmer's \trivial" library Optimal (Pareto-Based) Curve Optimal Solutions Lower Bounds Area Latency Fig. 3. Results from DIFFEQ using library A Pareto point. The methodology then computes an upper bound on the area, and compares it to the lower bound. If the two are equal, then the point is an optimal solution, and a Pareto point (e.g., time constraint 800 in Figure 3); if not, then the results are still inconclusive (e.g., time constraint 700). It then uses a tighter (but more computationally-intensive) FU lower-bounding method based on LPrelaxation, and tries this procedure again (in this example, determining that time constraint 700 corresponds to a Pareto point). If this method also fails, then it solves a carefully-developed ILP formulation [Chaudhuri et al. 1994] to determine the optimal solution, using the bounds determined earlier to reduce the search space for that solution. Thus our base methodology quickly determines whether or not each time constraint corresponds to a Pareto point by carefully pruning the search space. It rst computes a small set of time constraints to explore. Increasingly tighter heuristics are then used to try to determine if each time constraint corresponds to a Pareto point. Only if those heuristics fail is a more computationally-intensive ILP formulation used. Unfortunately, assuming the module set and clock length are specied a priori is unrealistic with complex module libraries. Accordingly, Section 3 describes how this base methodology can be extended to include clock length determination and Section 4 describes the incorporation of module selection. 3. ADDING CLOCK DETERMINATION As described earlier, our base methodology explores a set of time constraints, determining whether or not the solution to each TCS problem is a Pareto point. The problem was simplied by assuming the clock length was known a priori, whereas

6 6 S. A. Blythe and R. A. Walker recent work has shown that not only is determining the system clock length a dif- cult problem [Chen and Jeng 1991; Chaiyakul et al. 1992; Narayan and Gajski 1992; Corazao et al. 1993; Jha et al. 1995; Chaudhuri et al. 1997]), but the choice of the clock length has a signicant impact on the resulting design. Therefore, the problem of clock length determination must be folded into the design space exploration problem. 3.1 Prior Work As described in [Chaudhuri et al. 1997], the clock determination problem is usually ignored in favor of ad hoc decisions or estimates. For example, several early synthesis systems used the delay of the slowest functional unit as the estimated clock length, a choice which favored the use of chaining and disallowed multi-cycling. A heuristic method for nding the clock length was given in [Narayan and Gajski 1992], but the result may not be optimal. To guarantee that the optimal clock length is chosen, 4 the scheduling problem could be solved repeatedly for every possible clock length { a very computationallyintensive task. Fortunately, such an exhaustive search is not necessary, as the set of candidate clock lengths to be scheduled can be reduced. In [Corazao et al. 1993], one method for reducing that set is given. A tighter method was introduced in [Chen and Jeng 1991], and later proven correct in [Jha et al. 1995] and [Chaudhuri et al. 1997] { this method computes a small set of candidate clock lengths (one of which must be the optimal clock length) by taking the ceiling of the integral divisors of each of the functional unit delays. 3.2 Pruning the Candidate Clock Lengths Even these integral-divisor methods can lead to a set of candidate clock lengths so large that it becomes too time-consuming to solve the TCS problem for each clock length at each time constraint. Fortunately, the set of candidate clock lengths can be reduced even further, as described below. Definition 1. For a given clock length c, the slack s k of a module of type k with execution delay d k is given by s k (c) = c dk? d k ; c Theorem 1. Given a clock length c, if there exists a clock length c such that s k (c ) s k (c) for all module types k in the current module selection, then c can be replaced by c without lengthening the schedule. Proof: Since s k (c ) s k (c) for all modules k, the same quality will hold for operations in a schedule using these modules. Thus all operations in the schedule using c could be scheduled at least as soon, if not sooner, in a schedule using c because all operations will be capable of executing faster (or equally as fast) in the schedule using c. Thus, changing the clock length to c can only improve the schedule.2 4 Actually, this is only the data path component of the system clock length; the nal clock length includes controller and interconnect delays as well.

7 clocks time constr. design points np-lb np-rlb np-ilp P-lb P -rlb P-ILP Ecient Optimal Design Space Characterization Methodologies 7 MODULE AREA DELAY (ns) OPERATIONS mul f*g alu f+;?; <g Table 2. Library B { Narayan's library Clock Length slack(*) slack(+) replaced by { { { / { / / Table 3. Slack values found in library B To demonstrate the use of this theorem, consider library B, shown in Table 2 (the VDP100 library from [Narayan and Gajski 1992], augmented with areas similar to those of library A). Assuming a technology limit of 17ns on the shortest clock length, integral divisor methods give the set CK = f163; 82; 55; 48; 41; 33; 28; 24; 21; 19; 17g of candidate clock lengths, with the corresponding slack values shown in Table 3. Consider the clock length of 33ns, found as d163=5e = 33. When a multiplier is scheduled using this clock length, there will be a slack of 2ns. There are several clock lengths whose slack for the multiplier is smaller, but the slack corresponding to the alu1 is always larger. However, a clock length of 55ns has the same slack as the multiplier, and less slack for the alu1. Therefore, Theorem 1 says that any schedule that uses a clock length of 33ns can be shortened by using a clock length of 55ns (without increasing the number of functional units). When Theorem 1 is applied to the full set of candidate clock lengths, the set is reduced to CK 0 = f163; 82; 55; 24g. Note that when two sets of slack values are equivalent, the shorter clock length is dropped since it would tend to result in a larger controller. 3.3 Exploring the Candidate Clock Lengths Once the pruned set CK 0 of candidate clock lengths has been computed, the integral multiples of each of those clock lengths give the time constraints to explore. Then, for each such time constraint and candidate clock length, the methodology outlined in Figure 2 can be applied. The eciency of the search at each time Table 4. Statistics from solving DIFFEQ using library B

8 8 S. A. Blythe and R. A. Walker Optimal (Pareto-Based) Curve Optimal Solutions Lower Bounds Area Latency Fig. 4. Results from DIFFEQ using library B Design Space Exploration w/ Clock Determination: area cur MAXINT compute pruned set CK 0 of candidate clock lengths compute T min and T max for each c j in CK 0 compute all candidate time constraints T i in [T min; T max] for each T i from T min to T max using each c j in CK 0 inducing T i determine if (T i; f(t i)) is a Pareto point (see Figure 2) Fig. 5. Voyager's design space exploration loop with clock determination constraint can be improved by observing that each time constraint was derived as an integral multiple of one or more clock lengths, so only those inducing clock lengths need be explored at that time constraint. The resulting methodology is outlined in Figure 5. Using library B and the DIFFEQ example, this methodology generates the design space shown in Figure 4. From the pruned set CK 0 = f163; 82; 55; 24g of candidate clock lengths, 49 time constraints were generated, and 50 time constraint / clock length pairs were explored (note that there was only a single time constraint with more than one candidate clock length). Two corresponded to infeasible schedules, while the other 48 had to be examined to determine if they were Pareto points. As Table 4 shows (the headings of the last six columns correspond to labels in Figure 2), the vast majority of the solutions were determined to be either Pareto or non-pareto points using the bounding heuristics { only two were solved using the tighter LP-relaxation lower bounding method and no solutions required an ILP approach! Note also that in several cases (time constraints in the range ), the

9 Ecient Optimal Design Space Characterization Methodologies 9 MODULE AREA DELAY (ns) OPERATIONS mult f*g alu f+;?; <g sub f-g add f+g alu f+;?; <g sub f-g add f+g Table 5. Library C { Timmer's library 2 lower bound diered from the optimal solution, so methods based solely on lowerbounding would incorrectly characterize the design space. Finally, Figure 4 also demonstrates the importance of systematically examining all relevant clock lengths in the design space. At a time constraint of 652, the inducing clock length of 163ns leads to a solution with an area of 500, whereas the previous time constraint had a lower area of 400. Although the point (652, 500) is optimal with respect to its time constraint and a xed clock length of 163 ns, it is not a Pareto point, and is thus rejected by the line labeled /* np-lb */ in Figure ADDING MODULE SELECTION While adding clock length determination to the base methodology is an important step toward supporting more complex libraries, the methodology must also be extended to cover libraries that oer a number of possible module sets. Again, we would prefer to avoid an exhaustive search of all possible module sets, yet we must ensure that we do not miss any combination of a time constraint, clock length, and module set that corresponds to a Pareto point. 4.1 Prior Work Over the years, a variety of methods have been employed to determine the appropriate module set. One method, described in [Jain et al. 1988], generates a number of module sets, and then selects the best one. Another method, presented in [Timmer et al. 1993], computes an initial module set through a MILP formulation, and determines its validity by scheduling; if no viable schedule is found, then the set (and its allocation) are updated, and the scheduling process is repeated. 5 As with some of the previous work on clock length determination, using such techniques to determine a single module set before (and independently of) scheduling cannot guarantee a globally optimal solution. Instead of trying to nd a single module set, the method found in [Chen and Jeng 1991] exhaustively explores all possible module sets. Since this method also exhaustively explores all integral divisor based clock lengths, its computational complexity is too large for optimal scheduling, so only estimates are computed. 4.2 Exploring Dierent Module Sets Fortunately, such an exhaustive search is not necessary. Many of the possible module sets can be eliminated since they are incapable of implementing all the 5 This method also incorporates the type mapping problem into the MILP formulation { something our methodology does not yet handle. See Section 9.

10 10 S. A. Blythe and R. A. Walker 5000 library clocks time constr. design points np-lb np-rlb np-ilp P-lb P-rlb P-ILP 4500 Optimal (Pareto-Based) Curve Optimal Solutions Lower Bounds Area Latency Fig. 6. Results from DIFFEQ using library C MODULE AREA DELAY (ns) OPERATIONS alu f; +;?;<; >g mul f*g add f+g sub f-g cmp f<; >g Table 6. Library D { an articial complex library operation types found in the data ow graph. For example, in the case of the DIFFEQ, the module set must be capable of performing the operations f+;?; ; <g; any module sets that do not can be eliminated. Moreover, the number of module sets that must be explored at each time constraint can be reduced (as was the number of candidate clock lengths) by observing that each time constraint was derived as an integral multiple of a clock length derived from one or more specic modules. Therefore, only those module sets that contain at least one of those modules must be explored at that time constraint. Using library C, shown in Table 5 (library 2 from [Timmer et al. 1993]), and the DIFFEQ example, the methodology described above generates the design space shown in Figure 6. There are 32 possible module sets, but only 1 pruned candidate clock length (100ns) and 9 time constraints, resulting in 288 TCS problems to solve. 32 resulted in infeasible schedules (i.e., no solution was possible), and as before, the vast majority of the solutions were determined to be either Pareto or non-pareto points using the bounding heuristics. C D Table 7. Statistics from solving DIFFEQ using libraries C and D

11 Ecient Optimal Design Space Characterization Methodologies Optimal (Pareto-Based) Curve Optimal Solutions Lower Bounds Area Latency Fig. 7. Results from DIFFEQ using library D As another example, consider library D, shown in Table 6 (an articial library slightly less complex than library C, but with more realistic module delays). Using that library, and the DIFFEQ example, the methodology described above generates the design space shown in Figure 7. Here there were 16 possible module sets, 9 integral-divisor candidate clock lengths, and 131 time constraints { almost 19,000 combinations. Even after pruning the candidate clock lengths, there were 6 pruned candidate clock lengths, and 93 time constraints { almost 9,000 combinations. However, the methodology had to solve only 1522 TCS problems (an average of 1.35 clock lengths and module selections at each time constraint). 183 of those were infeasible, and again, the vast majority of the solutions were determined to be either Pareto or non-pareto points using the bounding heuristics. Moreover, this entire procedure took only 1.5 hours of wall-clock time. Without such a careful pruning of the search space, this problem could not have been solved optimally in a reasonable amount of time. Furthermore, with these 4 module delays, there are many resulting designs that lie above the optimal tradeo curve. Although these designs are optimal solutions for a particular clock length and module set, they are not Pareto points, so it is very important that the methodology correctly explores the design space. For example, [Timmer et al. 1993] presents a method that begins at time constraint T max and alternately performs time and area lower-bounding to nd a stair-step tradeo curve. Even if that methodology is enhanced to alternate between optimally solving the resource-constrained and time-constrained scheduling problems, it would only nd the Pareto-based tradeo curve in the absence of the combined module selection and clock length determination problem. If this combined problem was included, the enhanced methodology would fail to nd the Pareto-based curve if one of the points found by time-constrained scheduling is a suboptimal point that lies above the optimal Pareto-based curve. Such a point (which would have a non-minimal area) would then be used by resource-constrained scheduling to nd the minimal latency with this (non-minimal) area, thus compounding the problem and giving an erroneous design curve that actually lies above the optimal area curve based on

12 12 S. A. Blythe and R. A. Walker Module Area Delay Operations alu f; +;?;<;>g mul f*g add f+g sub f-g cmp f<; >g Table 8. An articial complex library the Pareto points. 5. AREA-AXIS EXPLORATION Viewing the previous approach as a latency-axis exploration methodology, an alternative approach is based on the area axis: the Pareto points can be found by determining a set of area constraints to explore, and then optimally solving the RCS problem at each area constraint. Again, it is necessary to reduce the number of constraints to explore, as searching the entire integral range along the area axis would be prohibitively expensive. Given such a reduced set of area constraints, the algorithm outlined in Figure 2 can be modied to explore the area axis instead { starting with the point that has the smallest possible area (and the largest latency) and continuing until it reaches the point with the largest possible area (and minimal latency), solving the RCS problem to minimize the schedule latency at each area constraint. The various solutions can then be determined to be either Pareto or non-pareto points using heuristics and exact techniques together in a manner similar to that described in Section 2. In much the same way that any one time constraint can be induced by more than one clock length, each area constraint can correspond to more than one module set / resource allocation. Enumerating all possible allocations whose resulting area is within A max for each module set gives an initial set of candidate area / resource constraints that could be used in the area-axis methodology. The size of this initial set can then be reduced by noting that it may contain overly loose resource constraints { for example, a resource constraint of 3 adders would be too loose for a behavioral description with only one addition operation. In general, we can reduce the set of candidate resource constraints by upper bounding the number of independent paths in the DFG that could require resources of type T and could possibly be executed in parallel. The resulting upper bound would then be the maximal number of type T resources needed in any allocation for any schedule using any clock length. Although greatly dependent on the module library and DFG in use, applying even such simple heuristics can reduce the number of area constraints to explore by 80% - 85%. Given this basic area-axis methodology, the clock length determination and module selection problems can be incorporated much as they were in the basic latencyaxis methodology. However, in the area-axis methodology, it is easier to solve module selection than clock length determination at each area constraint, as the candidate module selections have the more pronounced eect.

13 Ecient Optimal Design Space Characterization Methodologies 13 DIFF AR EWF Latency Axis Area Axis Timmer Based 14:07 2:36 5: :08 8:50 12: :45 223:22 211: Table 9. Results from axis-based and neighborhood-based Timmer-like exploration using the library from Table Latency-Axis vs. Area-Axis Exploration Due to eect that the inducing clock lengths have on the other problems, it is easier to solve the clock length determination problem than it is to solve the module selection problem at each time constraint in the latency-axis methodology. This is reected by the fact that there are often more candidate module selections than candidate clock lengths at each time constraint. The opposite is true of the area-axis methods - each area constraint is derived from a module selection and allocation, without concern for the clock length determination problem. This frequently results in a single module selection with many clock length candidates at each point considered along the area axis. Results for both the latency-axis and area-axis methodologies are given in the middle two columns of Table 9, which show results for three dierent benchmarks (DIFFEQ, AR-lattice, and Elliptic Wave Filter). In each cell of the table, the execution time 6 is given (as minutes:seconds), along with the total number of points explored (i.e., the number of TCS or RCS problems solved). Note that neither approach is universally better than the other. Most of the time, it was faster to use area-axis exploration, but for the EWF example, several of the RCS problems were quite time-consuming to solve optimally. However, expressing those problems as TCS problems tended to be signicantly faster, resulting in faster latency-axis exploration. Furthermore, the simplicity of the module library used tends to lend itself to giving fewer area constraints, which is a signicant factor in the speed of either method. Thus it is unclear how to determine, a priori, which axis to choose for exploration. 6. A TIMMER-LIKE EXPLORATION While both of the axis-based methods described in Section 5 explore the design space in a reasonably eective manner, neither is universally eective. Furthermore, each method still has a fairly large number of TCS/RCS problems to solve. To increase the eciency further, one approach would be to combine the two methods, using a search methodology similar to the one employed in [Timmer et al. 1993]. As described in that paper, Timmer's methodology solves only the lower bounding problem, but it could easily be adapted to solve the scheduling problem instead. When adapted in this manner, Timmer's methodology essentially alternates between solving TCS and RCS problems, as shown in Figure 8. First, given a clock length and a minimal area (resource) constraint, the RCS problem would be solved, 6 The executions times are based on a Sun SPARC-20 running the Solaris Operating System.

14 14 S. A. Blythe and R. A. Walker A TCS B C D Timmer Curve Timmer Pareto Candidates True Optimal Curve True Pareto Points RCS TCS RCS The pitfall of Timmer's method when considering clock length and module set determi- Fig. 8. nation minimizing the latency for that resource constraint. The resulting latency would then be used as a time constraint and the corresponding TCS problem solved, minimizing the number of resources. Then another RCS problem would be solved using that minimized number of resources, etc. Since every RCS problem solved by this methodology nds a Pareto point, the number of TCS/RCS problems to be solved is reduced considerably over the axis-based methods. As presented, this Timmer-like methodology does not consider the clock length determination problem, but as has been shown in [Chaudhuri et al. 1995] [Chen and Jeng 1991], the clock length has a signicant impact on resulting designs. In fact, failure to incorporate clock length determination can result in overlooking Pareto points in the complete optimal design space. Because Timmer's methodology assumes a single clock length, the resulting Pareto points are optimal relative only to that one clock length and may not fully characterize the design space since time constraints induced by other clock lengths frequently result in smaller areas (TCS solutions). This situation is depicted in Figure 8, where Pareto points C and D were not found with Timmer's methodology, since they correspond to dierent clock lengths than the one being used, and Pareto point A (also corresponding to a dierent clock length) was missed in favor of the false Pareto point B. However, the clock length determination problem can be incorporated into Timmer's method by considering a neighborhood of time constraints around each Timmer Pareto candidate as follows. For each candidate clock length, the induced time constraint closest to, without exceeding, that Pareto point's time constraint is found. Then, instead of solving a single TCS problem at a single time constraint, the minimum area resulting from solving the TCS problem at each of these new time constraints is found, and used as the area constraint for the next RCS problem. Unfortunately, the information from previous schedules can no longer be used to prune the search space as described in Section 2, which puts this method at distinct disadvantage over the axis-based methods. This eect can be seen in the last column of Table 9, where the execution times for our neighborhood-based Timmer-like methodology can exceed those of the axis-based methodologies, even though there is a decrease in the number of design points explored.

15 Ecient Optimal Design Space Characterization Methodologies 15 Design Space Exploration using Simple Pivoting: input percentage of time constraints to explore, perc generate candidate time constraints in range [T min; T max] locate T such that j[t min; T ]j = perc j[t min; T max]j explore [T min; T ] using the latency axis methodology using A corresponding to the point (T ; A ) generate candidate area constraints in range [A min; A ] explore [A min; A ] using the area axis methodology Fig. 9. Voyager's Simple Pivoting Methodology DIFF AR EWF % time constraints explored :07 3:58 2:24 2:01 1:38 1:49 2: :08 11:04 7:45 5:57 4:41 9:24 8: :45 7:01 3:01 1:29 221:46 209:51 223: Table 10. Results from simple pivoting 7. PIVOTING BETWEEN LATENCY-AXIS AND AREA-AXIS EXPLORATION Another approach to combining latency-axis and area-axis exploration is to consider the structure of the tradeo curve. As shown in Figure 1, a large number of the Pareto points are clustered into two regions: one where the latency is small and the area is large, and another where the area is small and the latency is large. This phenomenon is also illustrated in Figure 10. Here, the latency-axis methodology (exploring the latency axis in the direction of increasing latency) would nd many Pareto points fairly quickly, but would then waste a considerable amount of time exploring time constraints that do not correspond to Pareto points until the high-latency cluster of Pareto points is reached. Using the area-axis methodology (exploring the area axis in the direction of increasing area) has a similar shortcoming. However, this shortcoming can be overcome by pivoting between the two axisbased methods { using the latency-axis methodology to explore the high-area / lowlatency cluster, and using the area-axis methodology to explore the high-latency / low-area cluster. This process is outlined in Figure 9. When exploring the latency axis in the direction of decreasing latency, the most obvious method of pivoting is to simply switch from latency-axis exploration to area-axis exploration after exploring a certain percentage (perc) of the latency axis. Note that after making this switch, the area-axis methodology must still explore the area axis in the direction of increasing area (so that information from previous schedules can be used to prune the search space as described in Section 2), but now it can stop when it reaches the last Pareto point found by the latency-axis methodology. The results of performing this pivoting process for various percentages of the

16 16 S. A. Blythe and R. A. Walker AR Optimal (Pareto-Based) Curve EWF Optimal (Pareto-Based) Curve Area Latency Fig. 10. The EWF and AR optimal Pareto-base curves when using the library from Table 8 latency axis are presented in Table 10. Note that the 0% column corresponds to an immediate pivot to area-axis exploration, and the 100% column corresponds to using solely latency-axis exploration. Not surprisingly, for the tradeo curves depicted in Figure 10, the percentages that result in the fastest execution times are fairly low (10%-20%), since most of the low-latency cluster of Pareto points are within the rst 20% of the latency axis. Unfortunately, however, there is no consistent percentage that will always correspond to the best pivot point for every tradeo curve, regardless of whether the execution time 7 or the number of points being explored is the quantity being minimized. Thus, a better method for deciding where to pivot must be found. 8. DYNAMIC PIVOTING Since the best pivot point cannot be determined a priori, it must be determined dynamically during the exploration process. Since the tradeo curve often exhibits two clusters of Pareto points as described earlier, one approach would be to determine when a cluster is being left, and pivot while exploring the next few points that are not members of either cluster. When exploring the latency axis, this pivot would occur when the curve begins to \atten out" into a roughly horizontal line. One simple method of implementing this dynamic pivot is outlined in Figure 11, in which a window W of constant size (W size ) is kept. This window contains the last n design points explored, many of which were pruned as non-pareto points. If the area of the rst element in the window (A 1 ) is not signicantly larger than the area corresponding to the current time constraint (A n ) at any point during latency-axis exploration, the current point is selected as the pivot point. In other words, if a Pareto point has not been found recently, the curve is attening out and the pivot from latency-axis exploration to area-axis exploration is made. 7 Once again, the EWF example contains several points that are computationally more expensive when solved as RCS problems { thus the dramatic execution time increase between 20% and 10% despite the decrease in points explored.

17 Ecient Optimal Design Space Characterization Methodologies 17 Design Space Exploration using Dynamic Window Pivoting: input tolerance percentage tol of time constraints in window W size = tol j[t min; T max]j generate initial window W = [T 1; T n] = [T min; T ] such that jw j = W size explore window W using latency axis methodology using A 1 and A n from the points (T 1; A 1) and (T n; A n) in W while ((A 1? A n)=a 1 > tol) remove T 1 from W append next T i to W calculate A n using TCS method generate nal window W a as the area constraints [A min; A n] explore W a using area axis methodology Fig. 11. Voyager's Dynamic Pivoting Methodology DIFF AR EWF % time constraints in window 0(a) (t) 2:36 1:48 2:23 4:40 5:20 14: :50 10:08 6:20 7:34 8:21 48: :22 256:05 256:15 2:50 3:50 55: Table 11. Results from dynamic window-based pivoting The results of applying this dynamic window-based pivoting are given in Table 11. To determine when a \signicant" change in area was reached, the size of the current window was compared to the change in area over that window. If the percentage change in area was smaller than the size of the window as a percentage of the total number of time constraints, we pivoted to using the area axis; otherwise we continued using the latency axis. Unfortunately, there was no consistent window size that yielded the best result for every case. In general, a window size of 10%-15% of the time constraints generally seemed to give good results. Looking at Table 11, the rst example shown (DIFFEQ) is small enough that 5% of the time constraints is statistically insignicant, leading to results that are dominated by the area-axis exploration. However, the EWF results give a strong argument for using dynamic pivoting { here a bad a priori choice of using only latency-axis exploration or area-axis exploration (as shown in Table 10) could lead to a signicantly larger execution time than 15% dynamic pivoting. 9. FURTHER RESULTS In all of our results so far, we have used the library presented in Table 8. That library has a number of dierent delays, which complicates any design space exploration methodology that considers clock length determination. However, it has only two alternatives for each operation type, leading to a fairly small number of

18 18 S. A. Blythe and R. A. Walker Module Area Delay Operations mul f*g mul f*g mul f*g sub f-,<g sub f-,<g sub f-,<g add f+g add f+g add f+g Table 12. A module selection intensive library DIFF AR EWF Latency Area Timmer Pivot (15%) 44:22 4:37 32:58 6: :32 213:17 313:27 192: :59 232:01 298:46 43: Table 13. Results when using a library with many module selections 4000 AR Optimal (Pareto-Based) Curve EWF Optimal (Pareto-Based) Curve Area Latency Fig. 12. The EWF and AR optimal Pareto-base curves when using the library from Table 12

19 Ecient Optimal Design Space Characterization Methodologies 19 module selection candidates. Now consider Table 12, which is the opposite: it has fewer unique functional unit delays, and several of those delays are multiples of each other. Both of these factors result in fewer resulting candidate clock lengths (for example, several functional units have 50 as a candidate clock length). However, this library has a much larger number of module selection candidates. Results using this library are presented in Table 13. Compared to results using the previous library when the latency-axis methodology is used, there is signicantly more time being spent exploring the latency axis, since the module selection problem is now much more dicult and latency-axis exploration gets most of its time savings due to the structure of the clock length determination problem. However, for area-axis exploration the results are now generally faster, reecting the savings due to considering the structure of the module selection problem. Again, as with the rst library, EWF gives several RCS problems that are time consuming to solve optimally (while the corresponding TCS problems are not as time consuming), thus dramatically increasing overall run time for the area axis. Note that in all cases the number of area constraints to solve is much higher { thus the savings in execution time must result from the fact that there is more structure to each constraint along the area axis for this library. As with the prior library, Timmer-like exploration (even our neighborhood-based Timmer-like exploration) once again fails to produce faster run times for this library, although in this case the primary contributing factor is not only clock length determination but the number of possible module selection candidates at each of the generated time constraints. For AR and EWF, the pivoting method once again gave the best execution times 8, but this time it also explored fewer points than the Timmer-like method! Finally, note that the resulting design spaces for these two benchmarks are also much more complex for this library, as can be seen in Figure 12. The added complexity of these plots is directly attributable to the complexity of the module selection problem { many more area constraints exist, leading to more Pareto points being derived from the corresponding resource constraints. 10. CONCLUSIONS AND FUTURE WORK This paper has examined the process of design space exploration, reducing that process to one of characterizing the optimal latency-area tradeo curve by nding all the Pareto points on that curve. For the combined problem of scheduling, clock length determination, and module selection, we have presented several exploration methodologies: dedicated latency-axis or area-axis exploration, a Timmer-like exploration method, and two methods (one static, one dynamic) for pivoting between the two axis-based methods. Each of these methodologies takes advantage of the structure found along both the latency and are axes by carefully pruning a large number of sub-optimal solutions at each level of the design cycle, making it possible to use optimal scheduling techniques rather than bounds or estimates. Furthermore, 8 The DIFFEQ benchmark once again gives skewed results as its size does not allow a statistically signicant number of Pareto points to be incorporated within the 15% design window, thus not allowing the pivoting method to take full advantage of the structure in the resulting design space.

20 20 S. A. Blythe and R. A. Walker we discussed how the tradeo curve is dominated by two clusters of Pareto points, and how that structure, along with the structure of the combined problem, can be used to more eciently nd the Pareto points. Tests using various benchmarks and dierent module libraries have shown the importance of considering the clock length determination and module selection problem in conjunction with the scheduling problem. When these subproblems are not considered in conjunction like this (often such subproblems are resolved prior to and independently of scheduling), we have shown that results do not accurately reect the optimal tradeo curve. In many cases, methods that do not consider this combined problem entirely miss globally optimal points. Although the methodologies presented solve the design space exploration problem optimally, they could also be used to generate a preliminary characterization by replacing the optimal scheduler with a heuristic scheduler or lower bound estimate. 9 The reductions in the number of constraints to explore would be similar to that found in the optimal case, but the amount if execution time would be lower at the expense of optimality. At present, although these methodologies allow us to handle more realistic module libraries than most previous methodologies since they consider clock length determination and module selection, they do not consider the type mapping problem. That is, they assume that all operations of a given type are mapped to a single functional unit type (found by module selection). To more more fully take advantage of module libraries, and thus to more completely characterize the optimal tradeo curve, the methodologies (in particular the scheduling portions) must be enhanced to handle the complete type-mapping problem. When nding optimal solutions to this type mapping problem, it will crucial to nd tight heuristic bounding techniques for the type mapping problem so that the axis-based methods can maintain their eciency through pruning methods. Furthermore, a more realistic model of the resulting design must be developed so that the methodology also incorporates registers, interconnect issues, controller eects, etc. REFERENCES Blythe, S. A. and Walker, R. A Towards a Practical Methodology for Completely Characterizing the Optimal Design Space. In Proc. of the 9th International Symposium on System-Level Synthesis (La Jolla, California, Nov ). IEEE Computer Society Press. Brayton, R. K. and Spence, R Sensitivity and Optimization. Computer-aided design of electronic circuits. Elsevier Science Publishing Co., INC., 52 Vandervilt Avenue, New York, NY 10017, USA. Chaiyakul, V., Wu, A. C.-H., and Gajski, D. D Timing Models for High Level Synthesis. In Proc. of the European Design Automation Conference (EuroDAC) (Hamburg, Germany, Feb. 1992), pp. 60{65. IEEE Computer Society Press. Chaudhuri, S., Blythe, S. A., and Walker, R. A An Exact Methodology for Scheduling in a 3D Design Space. In Proc. of the 8th International Symposium on System- Level Synthesis (Cannes, France, Sept ), pp. 78{83. IEEE Computer Society Press. Chaudhuri, S., Blythe, S. A., and Walker, R. A A Solution Methodology for 9 Note that the bounding methodology described here would more fully characterize the design space than the one described in [Timmer et al. 1993], for the reasons explained in Section 4.2.

21 Ecient Optimal Design Space Characterization Methodologies 21 Exact Design Space Exploration in a Three Dimensional Design Space. IEEE Transactions on VLSI Systems 5, 1 (March), (to appear). Chaudhuri, S., Walker, R. A., and Mitchell, J. E Analyzing and Exploiting the Structure of the Constraints in the ILP Approach to the Scheduling Problem. IEEE Transactions on VLSI Systems 2, 4 (Dec.), 456{471. Chen, L.-G. and Jeng, L.-G Optimal Module Set and Clock Cycle Selection for DSP Synthesis. In Proc. of 1991 IEEE International Symp. on Circuits and Systems. (Singapore, June ), pp. 2200{2203. IEEE Computer Society Press. Corazao, M., Khalaf, M., Guerra, L., Potkonjak, M., and Rabaey, J. M Instruction Set Mapping for Performance Optimization. In Proc. of the IEEE/ACM International Conference on Computer-Aided Design (Santa Clara, California, Nov ), pp. 518{521. IEEE Computer Society Press. De Micheli, G Synthesis and Optimization of Digital Circuits. McGraw-Hill series in electrical and computer engineering. McGraw-Hill, New York, NY, USA. Gajski, D. D., Vahid, F., Narayan, S., and Gong, J Specication and Design of Embedded Systems. P T R Prentice-Hall, Englewood Clis, NJ 07632, USA. Jain, R., Parker, A. C., and Park, N Module Selection for Pipeline Synthesis. In Proc. of the 25th ACM/IEEE Design Automation Conf. (Anaheim, California, June ), pp. 542{547. IEEE Computer Society Press. Jha, P., Parameswaran, S., and Dutt, N Reclocking for High Level Synthesis. In Proc. of the Asia-South Pacic Conference on Design Automation (ASP-DAC) (Makuhari Messe, Chiba, Japan, Aug. 29-Sept ). IEEE Computer Society Press. McFarland, M. C Reevaluating the Design Space for Register Transfer Hardware Synthesis. In Proc. of the IEEE/ACM International Conference on Computer-Aided Design (Santa Clara, California, Nov ), pp. 262{265. IEEE Computer Society Press. Narayan, S. and Gajski, D. D System Clock Estimation based on Clock Slack Minimization. In Proc. of the European Design Automation Conference (EuroDAC) (Hamburg, Germany, Feb. 1992), pp. 66{71. IEEE Computer Society Press. Paulin, P. G. and Knight, J. P Force Directed Scheduling for the Behavioral Synthesis of ASICs. IEEE Transactions on Computer-Aided Design 8, 6 (June), 661{679. Timmer, A. H., Heijiligers, M. J. M., and Jess, J. A. G Fast System-Level Area- Delay Curve Prediction. In Proc. of 1st APCHDLSA (1993), pp. 198{ Proc. of the European Design Automation Conference (EuroDAC) (Hamburg, Germany, Feb. 1992). IEEE Computer Society Press.

Area. A max. f(t) f(t *) A min

Area. A max. f(t) f(t *) A min Abstract Toward a Practical Methodology for Completely Characterizing the Optimal Design Space One of the most compelling reasons for developing highlevel synthesis systems has been the desire to quickly