THIS paper describes a new algorithm for performance

Size: px

Start display at page:

Download "THIS paper describes a new algorithm for performance"

Wilfrid Jackson
6 years ago
Views:

1 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 5, NO. 2, JUNE Unifiable Scheduling and Allocation for Minimizing System Cycle Time Steve C.-Y. Huang and Wayne H. Wolf, Senior Member, IEEE Abstract This paper describes a new scheduling and allocation algorithm which optimizes a datapath-controller system for clock cycle time. The cycle time of a VLSI system depends not only on the characteristics of the datapath and controller in isolation but also on the interactions between them. A datapath may impose both arrival time constraints on controller inputs and departure time constraints on controller outputs. Late-arriving controller inputs may be generated by complex datapath functions, such as ALU carry-out, while early-departure controller outputs may be required to control slow datapath units. If the controller is not designed taking into account arrival and departure times, it may unnecessarily put control logic on the critical timing path. Our synthesis heuristic, which can be used in conjunction with other scheduling heuristics, identifies critical interactions between datapath and controller and reallocates/reschedules them to reduce system cycle time during high-level synthesis. Experimental results show that a unifiable scheduling and allocation (USA) can substantially improve system cycle time with only small area penalties. Index Terms Allocation, clock period, high-level synthesis, scheduling. I. INTRODUCTION THIS paper describes a new algorithm for performance optimization of datapath-controller systems. The algorithm minimizes the cycle time of the system by analyzing interactions between the datapath and the controller and using scheduling and allocation to avoid creating long critical delay paths which involve both the datapath and the controller. Because most high-level synthesis algorithms consider only the controller or datapath in isolation, this algorithm is an important tool for high-level performance optimization. Fig. 1(a) is a behavioral description to be executed in a control step. If we do not use two multipliers to precompute the ( ) and ( ) operations, we would just need one subtracter and one multiplier to execute the description, as illustrated in Fig. 1(b). Because there are no data dependencies between the subtraction and the multiplication, ideally these two operations could be executed in parallel if we do not consider the controller s effect on the total system cycle time. Suppose that the subtraction has a delay of ns and the multiplication has a delay of ns, then the system cycle Manuscript received September 13, 1994; revised October This work was supported by the National Science Foundation under Grant MIP S. C.-Y. Huang was with the Department of Electrical Engineering, Princeton University, Princeton, NJ USA. He is now with Synopsys, Mountain View, CA USA. W. H. Wolf is with the Department of Electrical Engineering, Princeton University, Princeton, NJ USA. Publisher Item Identifier S (97) Fig. 1. The ideal design: subtraction and multiplication can be executed in parallel. time would be the maximum of and. In this case the estimated cycle time would be, because subtraction usually has a smaller delay than multiplication. However, the controller also plays an important role in determining the system cycle time. In the past, interactions between datapath and controller have not been well considered in system design or estimation. To simplify synthesis, traditional approaches usually treat the controller and datapath separately and cannot take into account their interactions. Although the controller usually occupies less area than datapath components, Fig. 1(b) is still an idealization which oversimplifies the controller s influence on system cycle time. In contrast, Fig. 2 shows how datapath and controller interactions can substantially increase the cycle time. In the controller of Fig. 2, represents the condition, which is a primary input from the datapath into the controller; PS and NS denote present state and next state respectively; and are the controller s output signals, which are /97$ IEEE

2 198 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 5, NO. 2, JUNE 1997 Fig. 2. The worst-case design: controller implementation might introduce undesired dependencies between datapath components Sub and Mul. also the multiplexer select signals for the multiplier component in the datapath. In this configuration, datapath components may impose both arrival time constraints on controller input ( ) and departure time constraints on controller outputs ( and ). The simplified controller can be encoded by a state assignment program JEDI [1] and implemented in a multilevel logic form using SIS [2]. Equation (1) shows the logic functions of and An finite state machine (FSM) output is dependent on input if is a function of.if has no dependency on, is independent of. From (1), we notice that output signals and are dependent on the primary input. Therefore, the functional unit can not select the proper multiplication operands to perform the multiplication until the functional unit is done with the subtraction. In other words, the controller implementation introduces undesired dependencies between datapath components and. The multiplexer (MUX) select signals have to wait for the late-arriving input signals to continue the following datapath operations. As a result, operations and are actually not executed in parallel as predicted in Fig. 1(b). In fact, they would be executed in a serial manner due to the undesired dependencies introduced by the controller implementation as described in Fig. 2. The estimated system cycle time would be ( ) ns rather max( ) ns. Traditional high-level synthesis methods emphasize either the datapath or the controller, but make it difficult to study the interactions between them. As this example shows, datapathcontroller interactions can be subtle yet have a profound influence on the system cycle time. These problems occur at the cycle on which the conditional occurs path-based scheduling algorithms try to share resources across conditionals, but do not consider the interaction between the conditional logic itself with the operations performed immediately afterward. One way to eliminate these long delay paths is to require (1) every controller to be Moore machine. However, that solution introduces unnecessary memory elements. Our scheduling algorithm adds additional clock cycle delays only where required to minimize cycle time. In previous work [3], we demonstrated that in many cases the total time required for a schedule can be reduced by trading cycle time for the number of cycles. Reducing the cycle time of the controller-datapath system requires analyzing the timing of the control signals which flow between the two components. The undesired dependencies from the controller are usually related to the controller structure. Scheduling and allocation affect the design of the controller and its interaction with the datapath, helping to determine the delay of the controllerdatapath system. Therefore, to minimize system cycle time, we must simultaneously schedule the controller and datapath. In many cases, the undesired dependencies can be minimized by generating a unifiable controller structure [3], which we will briefly review in Section II-C. The idea is to minimize the controller output s dependencies on late-arriving input signals from datapath components, so that controller output s departure time can be minimized to continue performing the remaining datapath operations. In this paper, we propose a unifiable scheduling and allocation (USA) method, which is a mixed scheduling and allocation approach, to generate a unifiable controller. This approach is particularly useful when a system has both limited hardware and tight performance constraints other approaches may slow down the circuit unnecessarily. Our ultimate goal is to minimize the unnecessary dependencies between datapath and controller so that system cycle time can be improved with very small to no area overhead. Our heuristics can be used in conjunction with other scheduling and allocation heuristics our presentation of USA will be based on force-directed list scheduling and left-edge allocation. The USA heuristics, when used in conjunction with other heuristics, ensure that system cycle time and controllerdatapath operations will be taken into account during scheduling. The next section reviews previous work in scheduling and allocation and dependency reduction in register-transfer machines. Section III describes our input representation and target architecture. Section IV uses an example to illustrate the ideas in unifiable scheduling and allocation and Section V explains the algorithms in detail. Experimental results are discussed in Section VI. Finally, conclusions are presented in Section VII. II. PREVIOUS WORK Given a behavioral description, we can generate a corresponding internal design representation suitable for further processing. A basic block is defined as a straight-line sequence of operations without branches. A data flow graph (DFG) describes the basic block s data dependencies: a vertex corresponds to an operation and a directed edge corresponds to a variable and its associated data dependency. We refer to a data flow graph for whom a schedule has been found a scheduled data flow graph (SDFG). A control-data flow

3 HUANG AND WOLF: MINIMIZING SYSTEM CYCLE TIME 199 Fig. 3. List scheduling. graph (CDFG) is a DFG with control flows. Scheduling and allocation algorithms usually work on design representations similar to DFG s and CDFG s. We will discuss CDFG s in more detail in Section III. A. Scheduling Scheduling approaches can be divided into time-constrained and resource-constrained scheduling [5], [6]. Linear programming (LP) and integer linear programming (ILP) approaches have been used to solve the general scheduling problem by [7], [8]. However, since both time-and resource-constrained scheduling are NP-complete [9], most synthesis systems use heuristic scheduling methods. Most scheduling algorithms concentrate on scheduling of datapath operations and model delay only as the number of clock cycles required to execute a resource (which may be fractional in the case of chained resources). This simple model ignores the importance of the clock cycle time in determining the total time required to execute a sequence of operations and, as a result, do not consider the influence of control signals and controller delays on the system cycle time. List scheduling, summarized in Fig. 3, keeps a list of ready operations to be scheduled by a priority function. The algorithm schedules the nodes in the data flow graph starting from the sources and working toward the sinks, assigning the operations to a control step Cstep. The prioritize() function chooses which of the currently-ready operations are to be scheduled at the current control step. The exact function used by prioritize() to choose operators for scheduling can be modified to implement different scheduling heuristics: Splicer used the mobility of an operation as the priority function [10]; Cathedral-II [11] and ELF [12] used urgency of an operation as the priority function; The HAL system s forcedirected list scheduling (FDLS) prioritizes based on force computations [13]. We will see in Section V that USA is based on force-directed list scheduling. While the details of the force computations are not essential to an understanding of USA, FDLS s forces model both the uniformity with which operations are allocated to resources over time and the effects of data dependencies on that allocation. USA augments the FDLS force computations with unifiability heuristics to determine which operations will be scheduled at each call to prioritize(). The work which has most closely considered logic delays in scheduling is by Kuehlmann and Bergamaschi, who added to HIS a timing-driven scheduling algorithm based on timing modules and behavioral timing networks [14]. Control flow timing network and data flow timing network were generated separately. Then these two networks were connected to model the interactions between the control flow and the data flow. However, a one-hot encoding model was used to represent each state in the control flow timing network. Onehot encoding usually produces designs which are inefficient in both area and testability; if a different encoding is used for the implementation, the results of the one-hot analysis are invalidated. Also, the delay estimation for scheduling was based on incremental addition of worst case delays on a limited number of timing modules. The IBM high-level synthesis system HIS used a pathbased scheduling algorithm to minimize the number of control steps under given constraints such as timing and area [15]. Wakabayashi and Yoshimura [16] developed a different algorithm with similar optimization goals. These algorithms schedule operations on the legs of a conditional to maximize the commonality of the datapaths required for the conditionals, minimizing hardware requirements. These algorithms do not change the scheduling of operations across the conditionals. And as we will see in Section IV, some of the decisions made by these algorithms to minimize hardware cost have the side effect of decreasing system performance. Parallel compiler optimizations, such as program dependency graph (PDG) operations [17], can be used to reorder operations in nested conditionals. Other research into scheduling has attacked different aspects of the problem. Ku and De Micheli used a relative scheduling formulation to support operations with fixed and unbounded delays [18]. It emphasized interface synthesis whose delays are unknown at compile time but did not consider the interaction between datapath and controller delays. In controller synthesis, Nestor and Tamas presented a heuristic algorithm that exploits

4 200 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 5, NO. 2, JUNE 1997 scheduling freedom to minimize the controller area [19]. Delay was not addressed in that work. They used Espresso [20] as a predictor, it was difficult to deduce dependency properties of the logic implementation from the schedule in this approach. Chaudhuri et al. [21] developed an algorithm which selects a clock cycle time during scheduling and allocation to optimize performance. Their work considers only datapath delays, not controller contributions to clock cycle time. In the scheduling approaches discussed above, arrival/ departure timing constraints between the datapath and controller have not been taken into account to minimize dependencies on late-arriving signals and thus to improve system cycle time. The first scheduling work which considered this kind of controller-datapath interactions was MDS, which applied the minimal dependency concept to schedule controller events [3]. This work showed the first cycle time-number of cycles tradeoff. By increasing the number of cycles while reducing the clock period, the algorithm could in many cases reduced the total execution time (cycle time number of cycles) required for a computation. B. Allocation In general, allocation approaches can be categorized as decomposition, greedy constructive, or iterative refinement approaches [5]. Most previous work in allocation has concentrated on optimizing some combination of datapath resources, registers, functional units, and interconnect. The REAL system introduced the left-edge algorithm, similar to the left-edge algorithm used in channel routing, for register allocation [22]. The left-edge algorithm considers the variables in order of the time step in which they are first defined. A register is considered available at Cstep if the last variable which was assigned to it has ended its lifetime at. Variables are assigned to registers greedily previously defined and available registers are used first, then new registers are created. This greedy algorithm is optimal for acyclic scheduled data flow graphs, as shown by Kurdahi and Parker. As we will see in Section V, USA uses a modification of the left-edge algorithm, in which a variable is allocated to a register only if it surpasses a unifiability threshold. The EMUCS systems used a global selection criterion to allocate the next element for minimal number of registers, modules and multiplexers [23]. The STAR package used branch and bound search for subtask space and performs a constructive binding followed by an iterative refinement for minimal hardware resources [24]. The OAS synthesizer applied integer programming model for scheduling and allocation for embedded VLSI chips [25]. A method for resource sharing and minimal control sequence based on condition vector has been presented in Cyber [26]. These approaches did not try to predict the controller structure. As a result, allocation decisions could have adverse affects on the controller s delay and, therefore the entire cycle time of the entire controller-datapath system. MCD, developed by Huang and Wolf, first applied the minimal dependency concept in datapath allocation to reduce the controller delay on the critical path [27]. TABLE I FSM-1, z0 IS DISTINCT FOR S1 AND z1 IS UNIFIABLE FOR S1 C. Controller Unifiability and Minimum Dependency In earlier work, we introduced the unifiability optimization, which uses don t-care conditions in the controller to eliminate unnecessary dependencies of primary outputs on primary inputs. We developed an algorithm for minimum-dependency state encoding [28]. For state in a Mealy machine, ifthe value of output has at least one 1 and at least one 0 on the transitions out of, we say that output is distinct for state. An output is unifiable for state if is not distinct for state. For instance, if the value of is either 1 or don tcare in transitions with state as the source state, then is unifiable for. is called a unifiable controller if every primary output is unifiable for each state. We want to build a unifiable controller with respect to late-arriving input signals, but not necessarily to all input signals. This is different from building a Moore machine in which output values depend only on the present state. For simplicity, we will assume that all the input signals to the finite state machine are late-arriving signals throughout this paper. Therefore, having a unifiable output corresponds to having a unifiable output with respect to late-arriving input signals in this context. For example, for FSM-1 in Table I, there are two transitions associated with. One transition specifies that binary output is 1 when input is 0, where PO in Table I means primary output. The other transition indicates that becomes 0 when input is 1. Therefore is distinct (not unifiable) for state. However, is unifiable for, because is either 0 or don t-care for the two transitions associated with. Suppose that is a latearriving input signal and other input signals are not shown for clarity. If we assign the don t-care as 0, then s value is unified to be 0 for branch state with respect to late-arriving signal. A unifiable output s logic function can eventually be independent of the late primary input. For instance, if we treat the symbolic present state input as an input in addition to primary input and assign the don t-care in the last row of Table I to be 0, we can write s function as independent of. However, the nonunifiable output s logic function will depend on primary input. We developed the PDS algorithm [28] to perform this minimum-dependency-driven don t-care assignment. After generating the symbolic state transition table, we encoded the states FSM-1 using standard methods and implemented it in multilevel logic form using scripts in SIS [2]. (SIS is a sequential logic optimization system which includes features from the combinational optimization system misii.) The relationship between unifiability and dependency is illustrated by the equations, where

5 HUANG AND WOLF: MINIMIZING SYSTEM CYCLE TIME 201 Fig. 4. (a) (b) (a) Part of a CDFG derived from Sehwa and (b) an annotated CDFG. is a binary present state variable arriving around time zero. The unifiability of a binary controller output can be verified by performing XOR operation on its care output values with respect to branch states. That is, we need only look at the ones and zeros of a controller output values at conditional states. If the XOR result is zero, then the output values must be either all ones or all zeros in addition to don t-care values. In this case, the output is unifiable and it will lead to a minimal dependency structure. On the other hand, if the XOR result is one, the output is not unifiable and the resulting controller structure is not desirable. Scheduling for minimal dependency and allocation for minimal dependency have been applied separately in our previous work [3], [27]. In this paper, we present a mixed scheduling and allocation approach and use a symbolic PO controller rather than a binary PO controller to investigate unifiability even earlier in the processing stage. III. INPUT REPRESENTATION AND TARGET ARCHITECTURE A control data flow graph (CDFG) is a representation for a behavioral description with conditional branches. The vertices in a CDFG can be categorized as operation node, conditional node, and join node. Operation node denotes functional unit operation. Conditional node specifies the beginning of the operations following a branch condition, and join node denotes the end of the conditional operations. Variable in a CDFG is associated with the edge between vertices. Usually is a temporary value to be stored in a register or other storage unit. For instance, Fig. 4(a) is part of a CDFG derived from Sehwa [29]. In Fig. 4(a), 7 and 8 are operation nodes; is a conditional node;, and are join nodes, and and are variables. Conditional nodes and join nodes are important features which distinguish a CDFG from a data flow graph (DFG). For example in Fig. 4(a), conditional node and join node imply that operations 7 and 8 are mutually exclusive due to the conditional branch introduced. However, some CDFG benchmark representations do not give specific detail of the operation involved with conditional Fig. 5. A system with a nonpipelined controller. nodes. For instance, it is not clear to us what kind of operations are associated with the conditional node in Fig. 4(a). By simply looking at the join nodes and above the conditional node, it is also difficult to determine what operands are associated with the operations following the conditional node. To overcome the problem, some previous work assumed that the conditional signal is an external signal generated from some operation not specified in the CDFG [30], [31]. Because our goal is to generate both the datapath and the controller for a given CDFG benchmark, we would like to attach some more information to the conditional node operation. Fig. 4(b) is an annotated CDFG derived from Fig. 4(a). In Fig. 4(b), a?5 node is used to denote the conditional node operation, where in this case it is a chained subtraction and comparison operation. The left branch is executed when the condition is false. The right branch is executed when the condition is true. The left and right operands will flow to the operation following the conditional node accordingly. With the modified configuration, we will be able to generate a consistent datapath and controller for a given CDFG. Our target architecture is a configuration with a nonpipelined controller (as shown in Fig. 5) and a multiplexer-based datapath [5]. The dotted lines in Fig. 5 denote the signal communication between the controller and the datapath. Multiplexers are used at the inputs of both registers and functional units. We will discuss the extensions required to synthesize pipelined and multiple FSM controllers in Section VII. Given an unscheduled CDFG, the synthesis goal is to find a scheduling and allocation method to minimize the system cycle time, including both the datapath and the controller. Minimal overhead in the number of control steps and area should be achieved. The cycle time and area are to be measured after the technology mapping of the logic design. IV. AN EXAMPLE USA is a mixed scheduling and allocation algorithm because it uses a model to predict the resulting allocation configuration in the course of the unifiable scheduling. We will use an example to introduce our unifiable scheduling and allocation

6 202 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 5, NO. 2, JUNE 1997 Fig. 6. A nonunifiable allocated SCDFG. (USA) approach in this section. Section V will describe the USA algorithm in detail. A scheduled CDFG (SCDFG) is a CDFG where each operation has been assigned to a specific control step. Fig. 6 is part of an example SCDFG. For simplicity, we only show the conditional branch part in detail. In the following discussion, we will first assume that there is only one adder available to schedule and allocate the whole CDFG, and inputs, and are stored separately. In Fig. 6, both and are bound to by conditional resource sharing. Because Fig. 6 also includes some allocation information, we call it an allocated SCDFG to differentiate that from a traditional SCDFG. The figure also shows a partial state transition table for the controller around the conditional. and are the left operand and the right operand of, respectively. The figure also shows the symbolic PO controller in which the primary outputs are still in symbolic form. The concept is similar to the symbolic control table in Bridge [32]. According to Fig. 6, if is false, will be executed. Therefore in the state transition table, and will have and as the input, respectively. This is shown in the first row of state transition where primary input is 0 and present state is. Similarly if is true, and will become the inputs to. Fig. 6 also shows the multiplexer configuration for. The signal is used to select the appropriate input to. Similarly, is for. The controller for this datapath operates such that, when is false, and will become 0 to select and into and accordingly. Therefore, ( ) is performed on. Similarly, if is true, ( ) is performed on. In Section II-C, we discussed the concept of a unifiable controller. This scheduling and allocation shown in Fig. 6 produces a nonunifiable controller and

7 HUANG AND WOLF: MINIMIZING SYSTEM CYCLE TIME 203 Fig. 7. A unifiable allocated design from USA. are not unifiable for with respect to the controller input. We can write formulas and, where present state variables are denoted in a symbolic form. The primary input of the controller is, which is a late-arriving input from the datapath into the controller. The multiplexer control outputs have undesired dependencies on the late arriving input. Therefore, it will slow down the operation to select the appropriate operands for the functional unit, and hence delay the addition operation. The total system cycle time will potentially be increased. The concept is similar to that of Fig. 2 in Section I. If we use the symbolic PO table in Fig. 6 to investigate why the nonunifiable allocation of the SCDFG creates a nonunifiable controller, we will realize the reason is that and require different input operands for different primary input conditions with respect to a branch condition ( and in this case). Therefore, MUX select signals generate a unique value for every specific input operand in the binary PO controller. In this case, at least one distinct (i.e., nonunifiable) binary primary output is produced, as shown in the binary controller. The discussion above explains the relationship between a unifiable allocated SCDFG, a unifiable symbolic PO controller and a unifiable binary PO controller. We define a few useful terms. For a symbolic PO controller, if the value of symbolic primary output has at least two different symbolic inputs on the transitions out of, we say that symbolic output is distinct for state. An symbolic output is unifiable for state if is not distinct for state. is called a unifiable symbolic PO controller if every symbolic output is unifiable for each state. A unifiable symbolic PO controller generates only unifiable binary PO controller. A unifiable allocated SCDFG is an allocated SCDFG which generates a unifiable controller. As we can see, unifiability can be determined by checking the existence of a unifiable allocated SCDFG or a unifiable symbolic PO controller, which can be detected earlier in the processing stage than using a binary PO controller. A binary PO controller cannot be generated until the multiplexer tree for interconnection is built. In Fig. 7, the operation has been rescheduled to control step. As a result, the symbolic controller symbolic is unifiable. The multiplexer configuration is the same as in Fig. 6. However, the corresponding binary PO controller becomes unifiable because the allocation of operations to function units has changed. Simultaneous consideration of scheduling effects on the controller and allocation effects on the datapath were required to synthesize a high-performance implementation. V. ALGORITHMS Having explained the idea of unifiable scheduling and allocation, we will describe the algorithms in detail. The unifiable concept can be embedded in existing scheduling and allocation methods. Here some well-known heuristic approaches will be chosen as the foundation of the USA program. The scheduling part of USA is based on the force-directed list scheduling (FDLS) method and left-edge allocation algorithm, both of which were described in Section II. USA could be adapted to other scheduling methods as well we illustrate the USA heuristics using these well-known algorithms for specificity. The unifiability of an operation node or a variable can be determined by the resulting controller structure s unifiability, which is related to both scheduling and allocation.

8 204 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 5, NO. 2, JUNE 1997 Fig. 8. USA s approach to assigning a ready operation list. USA defines constants unifiability, high, and low can be real numbers from zero to one where high low. The values of these constants can be set by the user to guide synthesis. We index the values high and low with subscripts to differentiate various allocated SCDFG s according to the cost of preserving controller unifiability. A threshold value is used in the functions unifiable_allocated_scdfg() of Fig. 8 and allocate_variable_list() of Fig. 10. If threshold low, USA acts like the force-directed list scheduling with left-edge allocation. If high threshold low, USA performs unifiable scheduling and allocation. Unlike FDLS, USA tries to allocate the current operation only if its unifiability passes the threshold. The thresholds are also computed such that, when scheduling outside the range at which unifiability affects clock period, the algorithm makes the same choices as FDLS or left-edge algorithm. In USA, operations with lower forces and higher unifiability are scheduled first. Fig. 8 describes USA s assign_ready_list() function, which performs the selection operation of FDLS. The function first sorts the ready list readylist by force value, Fig. 9. Part of a unifiable allocated SCDFG. then determines s unifiability; the unifiability value, in turn, determines whether the operation will be scheduled on this control step or the next. There are several possible cases: 1) is executed unconditionally (the 1 case of Fig. 9), in which case it is given a high unifiability value; 2) is executed conditionally but another resource of the proper type is available in this control step so that s unifiability is set high;

9 HUANG AND WOLF: MINIMIZING SYSTEM CYCLE TIME 205 Fig. 10. USA s approach to assigning a variable list. 3) is conditionally executed and there are no unused resources available on this cycle (see Fig. 7), in which case s unifiability is set to low. The second and third cases require the algorithm to recheck the available resources to find another resource of the same type which is not involved in the conditional. The function unifiable_allocated_scdfg() actually allocates to the control step. But unlike standard FDLS, it schedules only if s unifiability is above threshold. If not, as in case 3 above, will be deferred to a later schedule in order to make the control signals unifiable. The complexity of the list scheduling with conditional resource sharing is where is the number of operation nodes in a CDFG. The while loops in the functions Listscheduling() and assign_ready_list() run at most iterations, respectively. The mutual-exclusivity test in the function assign_ready_list() takes at most comparisons. The complexity of USA s function assign_ready_list() in Fig. 8 is the same as the corresponding operation in FDLS. Because USA has the same structure as FDLS, its complexity is as well. Register allocation in USA starts with lifetime analysis of the variables in SCDFG. It is based on the left-edge algorithm, which performs register allocation in polynomial time by using the channel-routing analogy of packing horizontal segments into as few tracks as possible [22]. USA modifies the leftedge algorithm and allocates variables with high unifiability to the same register. USA s allocation algorithm is shown in Fig. 10. The register allocation algorithm of USA in Fig. 10 is similar to the standard left-edge allocation algorithm; however, register_unifiability() takes unifiability into account in TABLE II DESIGN CHARACTERISTICS OF SEVERAL BENCHMARKS deciding when to allocate a variable. To determine the unifiability of a variable, the function first checks if is mutually-exclusive with the set of variables assigned to the current register. If not, it implies that would be associated with a unconditional operation and a unifiable allocated SCDFG could be preserved. In this case, will be assigned a high unifiability value and be assigned to if and have nonoverlapping lifetime intervals. On the other hand, if is mutually-exclusive with, register_unifiability() will check if the operation node preceding and operation nodes preceding are operations next to conditional nodes. If yes, USA will assign a low unifiability value to prevent from being assigned to if and are of different functional unit types. This step tries to avoid creating a nonunifiable allocated SCDFG. However, if and are of the same type and same resource-id, a high unifiability will be assigned to. register_unifiability() is called by the outer loop of the register allocation algorithm; it has a user-definable threshold

10 206 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 5, NO. 2, JUNE 1997 TABLE III DETAILED RESULTS ON THE BENCHMARKS which determines how aggressively register allocation takes unifiability into account. If the threshold is set very low, allocation operates exactly like the left-edge algorithm; as the threshold is raised, more decisions are made which take unifiability into account. Left-edge allocation, while optimal for pure CFG s, is not optimal for specifications with conditionals. We have used leftedge allocation as the basis of our discussion and experiments due to its simplicity and clarity. As we have mentioned, the unifiability heuristic can be adapted to other allocation algorithms as well. The complexity of our USA extension to the left-edge algorithm is the same as for standard left-edge allocation:, where is the number of variables in the CDFG and is the number of variables allocated. The variables can be sorted in time, The mutual-exclusivity test in the function register_sharable() takes at most comparisons, since there will be at most registers created during allocation. VI. EXPERIMENTAL RESULTS A. Generation of Interconnected Controller and Datapath Given a CDFG, USA can generate its corresponding allocated SCDFG and a system netlist consisting of interconnected controller and datapath. USA collects the allocated SCDFG for each control step and updates the symbolic PO controller according to all possible transitions in every control step. Multiplexer binding for functional units and registers is then created based on the symbolic PO controller; a point-to-point model for interconnection is used in this case. A balanced tree consisting of two-input multiplexers are generated for minimal area and balanced delay distribution for all multiplexer inputs. The inputs to the controller are the output values of the conditional operations in the allocated SCDFG. USA generates MUX-select signals for functional units and registers and load signals for registers as the primary outputs of the controller. The final controller is in KISS format [33] acceptable by JEDI [1]. The datapath part is generated based on PDL++ type input format [34]. USA then combines the controller and datapath in blif format to be optimized by SIS [2]. B. Comparison of Results We compare the results from USA with those from traditional force-directed list scheduling and left-edge algorithms. We call the latter as the BASE algorithm in the following discussion. The benchmarks we used are from Sehwa [29], Kim [35], Maha [36], and HAL [37]. They all have conditional operations. Given an unscheduled CDFG and optional resource constraints, we generate the allocated SCDFG s and the corresponding controller and datapath netlists for both USA and BASE. Scripts script.rugged and script.delay in SIS are used to perform logic optimization. The system cycle time are measured with library model after technology mapping using MCNC libraries [33]. Table II shows the number of control steps and functional units used in both USA and BASE methods. Basically, USA can usually use the same amount of functional units to generate a unifiable allocated SCDFG with minimal control steps, whereas BASE generates nonunifiable controllers in all cases. Table III describes some other statistics in detail. denotes the total number in bits of the MUX select lines for functional units. Similarly, is for register s MUX select lines. is the total number of 2-to- 1 multiplexers in the balanced two-input multiplexer tree for the functional unit. Similarly for. For these test cases, USA tends to use more multiplexers than BASE. We think that is because USA would usually generate a more balanced allocated SCDFG to available resources, whereas traditional resource sharing tends to be greedy and happens to use a few less multiplexers in these cases. For instance, suppose we have five inputs to bind to two resources and. USA would generate a binding with three inputs to and two inputs to, whereas BASE would produce a binding with four inputs to and one input to. In the former case, the total bit width of MUX select lines needed would be ; for the latter case, it would be. However, we think that the multiplexer overhead can be reduced and we will see later that the cycle time improvement is more significant than the area overhead. We extract the cycle time (in nanoseconds) after technology mapping and define delay ratio as the ratio of (USA/BASE). Similarly area ratio is defined as the (USA/BASE) ratio in lambda area using MCNC standard cell libraries. Fig. 11 shows the mean delay and area ratio for the test cases above when bit width of the resources equal to 4, 8, and 16, respectively. According to the experimental data, the delay improvement increases as bit width increases. We think it contributes to the fact that as bit width increases, the datapath

11 HUANG AND WOLF: MINIMIZING SYSTEM CYCLE TIME 207 Fig. 11. Average delay ratio and area ratio as a function of bit width. Fig. 12. Delay ratio and area ratio comparison. functional units take more time to generate control signals for the controller. Therefore, for a nonunifiable controller with unnecessary dependencies, it further delays the generation of BASE controller s primary output signals to control the remaining operations in the datapath. Fig. 11 also shows that delay improvement ratio is larger than area overhead ratio for in all cases. The area and delay ratio data are shown in Fig. 12 for bit width 4, 8, and 16. Two lines, delay ratio 1 and area ratio 1, are also shown as reference lines. If a datum point is toward the origin (0, 0), it means that USA achieves both cycle time improvement and area reduction for that case. The data distribution shows that in most cases, USA produces cycle time improvement with small to no area overhead.

12 208 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 5, NO. 2, JUNE 1997 Fig. 13. The Branch CDFG. TABLE IV DETAILED MULTICYCLING RESULTS ON THE BRANCH CDFG C. Multicycled Operations So far it has been assumed that each operation node in the CDFG takes one control step to be executed. In fact, operations with long execution time can be split into more than one cycle to further reduce system cycle time. This technique is called multicycling. USA can handle multicycled operations for a unifiable allocated SCDFG. The basic algorithms are the same except that lifetime analysis for resources needs to be updated accordingly [38]. Fig. 13 is a Branch CDFG example derived from the Cond CDFG in Section VI-B. This CDFG will be used to compare experimental results on multicycling. and are hardwired constants. The table compares variations on the design: Branch- is the same as the Branch CDFG except that the multiplier is multicycled over two cycles; Branchuses two multicycled multipliers instead of one, meaning that Branch- needs one more control step than Branch-. Fig. 14 shows the results from the BASE and the USA approaches. Now the multiplication operations span two cycles. Conditional resource sharing are applied for and. Because there is only one adder available, USA will defer the 2 operation to create a unifiable allocated SCDFG. Table IV shows some statistics in detail when bit width 4. USA has smaller system cycle time in all cases. There are two cases where USA achieves area reduction as well. VII. CONCLUSIONS A mixed scheduling and allocation approach is presented in this paper to generate a unifiable allocated scheduled control data flow graph. USA takes a control data flow graph and optional resource constraints and applies the unifiable scheduling and allocation approach to generate the controller and datapath netlist for the entire system. Experimental results show that by generating a unifiable allocated SCDFG, unnecessary dependencies between datapath and controller can be reduced, and hence most of the datapath components can be executed in parallel rather than in serial with respect to conditional operations. The system cycle time is hence improved with small to no area overhead. In our target architecture, we assume that the controller architecture is a single FSM. For a system with multiple FSM s, USA can be used to produce a unifiable structure for each FSM. By minimizing the dependencies between FSM s and its corresponding datapath operations, the entire system will have minimal dependencies as well. As for a pipelined datapath, the original datapath operation will be divided into several stages with storage units between them. In this case, USA will check the unifiability with respect to the branch state associated with the first stage of the pipeline operations. A unifiable allocated SCDFG can hence be generated.

13 HUANG AND WOLF: MINIMIZING SYSTEM CYCLE TIME 209 (a) (b) Fig. 14. Results using two multicycled multipliers from (a) the BASE approach and (b) the USA approach. The unifiable concept can be extended to system cycle time estimation and be applied to a synthesis system with cycle time constraints in addition to resource and control step constraints. Power consumption can also be reduced by scaling the voltage due to minimized cycle time [39]. ACKNOWLEDGMENT The authors would like to thank the anonymous reviewers for their helpful comments. REFERENCES [1] B. Lin and A. R. Newton, Synthesis of multiple level logic from symbolic high-level description languages, in Proc. VLSI Conf., Munich, Germany, Aug [2] E. M. Sentovich, K. J. Singh, C. Moon, H. Savoj, R. K. Brayton, and A. Sangiovanni-Vincentelli, Sequential circuit design using synthesis and optimization, in Proc. ICCD-92, Oct. 1992, pp [3] S. C.-Y. Huang and W. H. Wolf, Scheduling for minimal dependency in FSM s, in Proc. ICCAD-93, Nov. 1993, pp [4] E. A. Snow, Automation of Module Set Independent Register-Transfer Level Design, Ph.D. dissertation, Carnegie-Mellon Univ., Pittsburgh, PA, Apr [5] D. Gajski, A. Wu, N. Dutt, and S. Lin, High-Level Synthesis: Introduction to Chip and System Design. Boston, MA: Kluwer Academic, [6] G. De Micheli, Synthesis and Optimization of Digital Circuits. New York: McGraw-Hill, [7] C. A. Papachristou and H. Honuk, A linear program driven scheduling and allocation method, in Proc. DAC-90, 1990, pp [8] C.-T. Hwang, J.-H. Lee, and Y.-C. Hsu, A formal approach to the scheduling problem in high-level synthesis, IEEE Trans. Computer- Aided Design, vol. 10, pp , Apr [9] M. R. Garey and D. S. Johnson, Computers and Intractability: A Guide to the Theory of NP-Completeness. San Francisco, CA: W. H. Freeman, [10] B. M. Pangrle, A behavior compiler for intelligent silicon compilation, Ph.D. dissertation, Univ. Illinois at Urbana-Champaign, [11] H. DeMan, J. Rabaey, P. Six, and L. Claesen, Cathedral-II: A silicon compiler for digital signal processing, IEEE Design & Test, vol. 3, pp , Dec [12] E. F. Girczyc, R. J. Buhr, and J. P. Knight, Application of a subset of ADA as an algorithmic hardware description language for graph-based hardware compilation, IEEE Trans. Computer-Aided Design, vol. 4, pp , Apr [13] P. G. Paulin and J. P. Knight, Force-directed scheduling for the behavioral synthesis of ASIC s, IEEE Trans. Computer-Aided Design, vol. 8, pp , June [14] A. Kuehlmann and R. A. Bergamaschi, Timing analysis in high-level synthesis, in Proc. ICCAD-92, 1992, pp [15] R. Camposano, Path scheduling for synthesis, IEEE Trans. Computer- Aided Design, vol. 10, pp , Jan [16] K. Wakabayashi and T. Yoshimura, A resource sharing and control synthesis method for conditional branches, in Proc. ICCAD-89, IEEE Computer Society Press, 1989, pp [17] J. Ferrante, K. J. Ottenstein, and J. D. Warren, The program dependence graph and its use in optimization, ACM Trans. Programming Languages and Syst.,, vol. 9, no. 3, pp , July [18] D. Ku and G. De Micheli, Relative scheduling under timing constraints: Algorithms for high-level synthesis of digital circuits, IEEE Trans. Computer-Aided Design, vol. 11, pp , June [19] J. Nestor and V. Tamas, Exploiting scheduling freedom in controller synthesis, in Proc. Sixth Int. Symp. High-Level Synth., Nov. 1992, pp [20] R. Rudell and A. L. Sangiovanni-Vincentelli, ESPRESSO-MV: Algorithms for multivalued logic minimization, in Proc. Custom Integr. Circ. Conf., Portland, OR, May [21] S. Chaudhuri, S. A. Blythe, and R. A. Walker, An exact methodology for scheduling in a 3d design space, in Proc., 8th Int. Symp. Sys. Synth., IEEE Computer Society Press, 1995, pp [22] F. J. Kurdahi and A. C. Parker, REAL: A program for register allocation, in Proc. DAC-87, June 1987, pp [23] D. E. Thomas, E. D. Lagnese, R. A. Walker, J. A. Nestor, J. V. Rajan, and R. L. Blackburn, Algorithmic and Register-Transfer Level Synthesis: The System Architect s Workbench. Boston, MA: Kluwer Academic, [24] F.-S. Tsai and Y.-C. Hsu, STAR: An automatic data path allocator, IEEE Trans. Computer-Aided Design, vol. 11, pp , Sept [25] C. H. Gebotys, Optimal scheduling and allocation of embedded VLSI chips, in Proc. DAC-92, June 1992, pp [26] K. Wakabayashi, Cyber: High level synthesis system from software into ASIC, High-Level VLSI Synthesis, R. Camposano and W. H. Wolf, Eds. Kluwer Academic Publishers, [27] S. C.-Y. Huang and W. H. Wolf, How datapath allocation affects controller delay, in Proc. Seventh Int. Symp. High-Level Synth., IEEE Computer Society Press, May 1994, pp [28], Performance-driven synthesis in controller-datapath systems, IEEE Trans. VLSI Syst., vol. 2, pp , Mar [29] N. Park and A. C. Parker, Sehwa: A software package for synthesis of pipelines from behavioral specification, IEEE Trans. Computer-Aided Design, pp , Mar [30] J. J. Kim, F. J. Kurdahi, and N. Park, Automatic synthesis of timestationary controllers for pipelined data paths, in Proc. ICCAD-91, Nov. 1991, pp [31] T.-C. Lee, N. K. Jha, and W. H. Wolf, A conditional resource sharing method for behavioral synthesis of highly testable data paths, in Int. Test Conf., Baltimore, MD, Oct [32] C.-J. Tseng, R.-S. Wei, S. G. Rothweiler, M. M. Tong, and A. K. Bose, Bridge: A versatile behavioral synthesis system, in Proc. DAC-88, 1988, pp [33] R. Lisanke, Logic synthesis benchmark circuits, in Int. Workshop on Logic Synthesis, May [34] R. J. Lipton, D. N. Serpanos, and W. H. Wolf, PDL++: An optimizing generator language for register-transfer design, in Proc. ISCAS-90, May 1990, IEEE Circuits and Systems Society, pp [35] T. Kim, J. W. S. Liu, and C. L. Liu, A scheduling algorithm for conditional resource sharing, in Proc. ICCAD-91, Nov. 1991, pp [36] A. C. Parker, J. T. Pizarro, and M. Mlinar, MAHA: A program for datapath synthesis, in Proc. DAC-86, June 1986, pp [37] P. G. Paulin, Global scheduling and allocation algorithms in the HAL system, in High-Level VLSI Synthesis, R. Camposano and W. H. Wolf, Eds. New York: Kluwer Academic, [38] S. Bhatia and N. K. Jha. Behavioral synthesis for hierarchical testability of controller/datapath circuits with conditional branches, in Proc. ICCD-94, Sept [39] D. Liu and C. Svensson, Trading speed for low power by choice of supply and threshold voltages, IEEE J. Solid-State Circuits, vol. 28, Jan

How Datapath Allocation Affects Controller Delay

How Datapath Allocation Affects Controller Delay Steve C.-Y. Huang and Wayne H. Wolf Dept. of Electrical Engineering Princeton University Princeton, NJ 08544 Abstract We present in this paper an allocation