How Datapath Allocation Affects Controller Delay

Size: px

Start display at page:

Download "How Datapath Allocation Affects Controller Delay"

Amy Stafford
6 years ago
Views:

1 How Datapath Allocation Affects Controller Delay Steve C.-Y. Huang and Wayne H. Wolf Dept. of Electrical Engineering Princeton University Princeton, NJ Abstract We present in this paper an allocation approach which considers the controller s eflect on system delay to minimize the system cycle time. Most auocation a~gorithms and conditional resource sharing methods emphasize minimum number of resources or area. Previous works have not modeled the resulting controller s structure and its contribution to the system delay in a controuer/datapath system. Our allocation method generates a controller with minimum delay on the system s critical path. Therefore the resulting system cycle time will be shorter than other allocation approaches. 1 Introduction Many systems are built from a datapath and a controller. The system cycle time depends on the interactions between the controller and the datapath. A datapath may impose both arrivai times on controller inputs and departure times on controller outputs. Latearriving controller inputs may be generated by complex datapath functions, such as ALU carry-out, while early-departure controller outputs may be required to control slow datapath units. If the controller is not designed taking into account arrival and departure times, it may unnecessarily put control logic on the critical timing path. Unit binding in the allocation process affects not only the datapath configuration but also the controller structure. Our allocation approach builds the resulting controller structure at the same time so that eventually the controller has minimum delay on the system critical path. In a previous paper [1], we introduced unifiability as a method for reducing controller delay during the scheduling process. However, that algorithm operated only on the controller, considering only the 0/1 values of its primary outputs. This paper shows how to choose datapath allocations to make controller signals unifiable-we show here how allocation choices determine controller unifiability, which in turn determines system cycle time. In the following discussion, we will first review previous allocation work in Section 2. Our allocation approaches will be presented in Section 3 and Section 4. Experimental results and conclusions are in Sections 5 and 6. 2 Review of Allocation Approaches Allocation approaches can be categorized as decomposition approaches, greedy constructive approaches, and iterative refinement approaches [2]. The REAL program uses lifetime analysis and greedy left-edge algorithm for register allocation, which uses minimum number of registers for acyclic scheduled data flow graphs (SDFG) [3], The EMUCS system uses a global selection criterion to allocate the next element for minimum number of registers, modules, and multiplexer [4]. The STAR package uses branch and bound search for subtask space and performs a constructive binding followed by an iterative refinement for minimum hardware resources [5]. The OAS synthesizer uses integer programming model for scheduling and allocation for embedded VLSI chips [6]. These approaches minimize the number of resources and interconnection complexities, but do not try to predict the controller structure. Therefore, the controller structure and its delay influence on the combined datapath and controller configuration is not well known during the resource binding. In this paper, we would like to take the interaction between datapath and controller into account, and propose an allocation approach which can construct a controller structure with minimum delay on the system critical path. 3 Dependency-Driven Allocation Controller implementation may have a significant effect on the system cycle time. It may lengthen the critical path and delay the execution of datapath operation. Minimum-controller-delay allocation is an allocation method which results in a controller implementation with minimum delay on the existing critical path in the controller-datapath system, and hence a smaller system cycle time. To find an allocation method with minimum controller delay, we will consider several resource binding heuristics in subsequent / IEEE 158

$primary input present next primary output Xo state state Zo Z1 so S1 10 0 S1 S2 11 1 S1 S3 o Table 1: FSM-O, ZO is distinct for S 1 and Z1 is unifiable for S1. I* c = ( 8. %0) *I 8\ Wo= Vo+ VI;.$

2 primary input present next primary output Xo state state Zo Z1 so S S1 S S1 S3 o Table 1: FSM-O, ZO is distinct for S 1 and Z1 is unifiable for S1. I* c = ( 8. %0) *I 8\ Wo= Vo+ VI;. controller WI = V2+ 3;,9 --- < cycle 1 0(, cycle 2 cycle 3 W2 = M + Y5; W3 = V6 + v7; if (v8 - V9 > O) { Wd = V8 + v9; ) else{ W5 = Vlo + VII; ) 0 4,8.10; VI 5 V9 VII!.2v6!v3P7; I Wo, W2, W4, W5 I w,, W3 (a) (b) subsections. Simple examples will be given first. Some experimental results based on published benchmarks will be discussed in Section Unifiability and Dependency Unifiability uses don t-care conditions in the controller to eliminate dependencies of primary outputs on primary inputs. The concepts of minimum dependency have been applied in scheduling [1] and encoding [8]. We will first introduce some terms before we explain our allocation approach. An FSM output Zj is dependent on input xk if Zj is a function of xk. If Zj has no dependence on xk, Zj is independent of xk. For state Si in a Mealy machine Ikf, if the value of output Zj has at least one 1 and at least one O on the transitions out of Si, we say that output Zj is distinct for state Si. An output Zj is unifiable for state Si if Zj is not distinct for state Si. For instance, if the value of Zj is either O or don t-care in transitions with state Si being the source state, then Zj is unifiable for Si. For example, in FSM-0, there are two transitions associated with S 1. One transition specifies that ZO is 1 when input XO is O. The other transition indicates that ZO becomes O when input XO is 1. Therefore ZO is distinct (not unifiable) for state S1 in this case. However, Z1 is unifiable for S1, because Z1 is either 1 or don t-care for the two transitions associated with S1. If we assign the don t-care as 1, then Z1 s value is unified to be 1. A unifiable output s logic function can be made independent of the primary input. For instance, if we treat the symbolic present state input as another input in addition to primary input XO and assign the don t-care in the last row of Table 1 to be 1, we can write Z1 S function as in (EQ 1), which is independent of Xo : {21} = Zo S1 +Zo S1 = S1 (1) However, the non-unifiable output ZO will depend on primary input XO as shown in (EQ 2): {Zo} = Zo S1 (2) We use PDS [8] to perform the above minimum- Figure 1: (a) Part of ~ a scheduled description. (b) The simplified controller-functional-unit configuration with conditional resource sharing. The critical path will be delayed by the controller s late-departing m-o and m-1 signals. dependency-driven don t-care assignment and encoding, and then implement the FSM in multi-level logic using S1S [9]. (EQ 3) verifies the relationship between unifiability and dependency, where pso is a binary present state variable. {20} = pso + Zo, {21}= pso (3) The properties of unifiability and minimumdependency will be used in the following discussions. 3.2 Functional-Unit and Interconnection Binding In this subsection, we will describe how to bind available functional units and interconnection resources to minimize the potential delay resulted from the controller implementation. Figure l(a) shows part of a scheduled behavior. For simplicity, we only look at the behavior at cycle 1, cycle 2 and cycle 3. Register allocation and other issues will be considered in later subsections. Assume that two adders are available, and the input variables vo, V1,.... V11 are stored separately. A simplified controller-functional-unit configuration is shown in Figure l(b). Because the statements W4 = V8 + V9 and W5 = v1o + v1l are mutually exclusive, conditional resource sharing has been applied in binding adder-o on the left hand side of Figure l(b), i.e., +.. In Figure l(b), control signals are shown in dotted-and-arrowed lines. We also use signal c to indicate the controller input (v8 V9 > O). For simplicity, only multiplexer configuration is shown in detail. Signals m-o, m-1, m-2, and m-3 are control signals for multiplexer roux-o, roux- 1, roux-2, and roux-3 respectively. The critical path runs through the subtracter, the multiplexer and adder-o. Figure 1 (b) s configuration generates an FSM shown in Table 2, where PSI NS and PO denote present state, next state and primary output respec- 159

primary input Ps NS Po c(v8 w9>o) m-00 m-01 m-lo m-n m-2 m-3 so S1 000000 S1 S2 010111 1 S2 Sx 1111 0 S2 Sy lolo PI Ps NS Po c m-00 m-01 m-lo m-n m-20 m-21 m-30 m-31 so S1 00000000 S1 S2 01010101 1

3 primary input Ps NS Po c(v8 w9>o) m-00 m-01 m-lo m-n m-2 m-3 so S S1 S S2 Sx S2 Sy lolo PI Ps NS Po c m-00 m-01 m-lo m-n m-20 m-21 m-30 m-31 so S S1 S S2 Sx S2 Sy 1111 Table 2: FSM-1, the output signals m-o1 and m-n of this FSM will depend on late-arriving input (IJ8 W9 > O); therefore, the critical path will be further delayed by the controller s implementation. Table 3: FSM-1-MCD, none of the output signals will depend on late-arriving input (v8 V9 > O); therefore, the critical path delay is smaller. arriving input, the critical path delay will be reduced. / C = (v8-v9>o) / o + Figure 2 is another configuration for the description (, 8\ in Figure l(a). Conditional resource sharing has not controller been applied to this configuration. Because two adders V9 / I I I are needed anyway, we might as well eliminate the vov4v8/ vlv5v9~ V2 V6 vzo j V3 V7 v1l ~ undesired controller output dependencies by placing v1o and V11 to roux-2 and roux-3 instead of roux-o and roux- 1. That also means the addition will be performed at adder-1 instead of adder-o. FSM-1-MCD (Minimum-Controller-Delay) in Ta VLW3W5 ble 3 is the new controller, where PI denotes the primary input. All the primary output signals are unzfiable in FSM-1-MCD, because they are either 1 s or don t-cares. Eventually none of them will depend on the late-arriving input (v8 V9 > O). In contrast to Figure 2: The simplified controller-functional unit (EQ 4), m-o1 and m-n are now independent of the configuration with minimum controller delay the critical path delay will be reduced. Therefore, the configuration in Figure 2 has a late-arriving input (v8 V9 > O), as shown in (EQ 5). shorter critical path than Figure l(b). tively. Notice that the signal m-o is actually two bits wide, therefore we use m-o. and m-o1 to denote the first and the second bit respectively. The binding in Figure 1(b) applies conditional resource sharing, and also results in only two four-input multiplexer. However, the out put signals m-o1 and m-11 are distinct (not unifiable) for state S2, because the output value becomes 1 when (v8 V9 > O) is 1, and becomes O when (v8 V9 > O) is O. Therefore, the two signals will depend on primary input (v8 7J9 > O) as shown in (EQ 4), which is derived from FSM-1 S multi-level logic implementation by S1S. The signals pso and ps2 are binary present state variables which arrive around time zero. {m-ol} = pso (v8 V9 > O) +ps2 {m-ll} = PsI) (v8 V9 > O)+ ps2 (4) The input signal (v8 V9 > O) is a late-arriving input because of the subtraction operation. If we can build a controller structure during allocation to make the multiplexer control signals independent of the late- {7n_o,} = pso, {m-l,}= pso (5) We would like to consider the controller structure and datapath binding at the same time. In many cases, conditional resource sharing does not consider controller delay and might be too greedy. Therefore, it could introduce undesired control dependency and longer critical path, as we have seen in Figure l(b) and FSM-1. Generating a controller with unifiable primary outputs during resource binding is one of our approaches to reduce the control dependency and hence system cycle time. 3.3 Register Allocation In the following, we will explain how register allocation changes the controller s dependency on the critical path. Figure 3(a) is part of a scheduled behavior. We assume that the Boolean variable c denotes the condition (vo V1 > O), and fo, go, ho, and r ( ) are different simple operators. For simplicity, we only analyze the. lifetimes of variables on the left hand side of the assignment statements: wo, WI, w2, and w3, as shown in Figure 3(b). The dotted line be-,,-,5 1 O(J

4 cycle if (vo-vi>o){ /* c = 1 cycle 2 (vo-vix)) *I Wo=f (vo); C==l Wlo.... cycle 1 } else { WI = g (VI); c == i) WI ---- ) cycle 2 W2 = h (vz); W3 = r (v3); (.) (b) W2 W3 Figure 3: (a) Part of a scheduled description, (b) A simplified lifetime analysis. 0controller. I,f(w7) g (.1) h (P2) r (P3) c 11 ~------ll controller W3 RO RI El El (a) -;;;i:q m.- 1 f@3) h (,2) El (b) RO. g ( 1)r ( 3) W1 W3 H 01 Figure 4: A simplified controller-register configuration T RI El based on (a) the left-edge register allocation with conditional resource sharing, (b) the minimum-controllerdelay register allocation. tween (c == 1) and (c == O) in Figure 3(b) implies that conditional resource sharing is possible because of the mutual exclusiveness. We first apply the left-edge algorithm with conditional resource sharing for register allocation: RO = (wo, wi, w2), RI = (w3), where RO and RI are registers. The simplified controller-register configuration is illustrated in Figure 4(a). To simplify the matter, we use a signal c to denote the (vo VI > O) primary input to the controller, and only two other control signals that we are interested in, roux-o and roux-l, are shown. The corresponding controller for Figure 4(a) is shown in Table 4(a). Because primary output mux- 0 is distinct for state SO, roux-o will depend on the late-arriving input c, namely (vo VI > O), as shown in (EQ 6). Besides, a three-input multiplexer is placed before RO. We now introduce another register allocation configuration to avoid these potential problems. {muz_o} = psi c = psi (WO - 7s1 > o) {vrwe-1} = psl (6) Figure 4(b) is an alternative register allocation approach. It binds (wo, w2) to RO, and (w1, w3) to R1. In contrast to Figure 4(a), there is only a two-input multiplexer in front of RO. Table 4(b) describes the resulting controller, The primary output signals are all unifiable in this case. The logic functions of roux-o and roux-1 are shown in (EQ 7). Because the primary outputs are not dependent on the late-arriving controller input c, and only a two-input multiplexer is placed in front of RO, Figure 4(b) will have a shorter critical path than Figure 4(a). {rrw-o} = psi, {rmm-1} = psl (7) In this subsection, we learn that register allocation for unifiable controller outputs is helpful to find a controller-datapath implementation with minimum controller delay on the critical path. 3.4 Binding Non-Unifiable Signals From the discussion above, we know that unifiable outputs are generally helpful to minimize the cont roller s delay on the critical path. Because of inherent limitations, sometimes we can not improve the binding configuration and produce a controller with all unifiable outputs. However, heuristically we would like to generate as many unifiable outputs as possible to potentially reduce the system cycle time. In addition to finite state machine structure, multiplexer assignment is another factor that would affect the number of unifiable outputs. We would also like to reduce the number of multiplexer inputs on the critical path to potentially improve the system performance. However, we will keep the total number of multiplexer inputs as small as possible to minimize the interconnection cost. Experiments show that these heuristics can improve the system performance when some controller outputs are not unifiable. 4 Algorithms The goal is to generate a controller with unifiable outputs to eliminate its dependencies on late-arriving controller inputs during the allocation process. Given a scheduled control data flow graph (SCDFG), we can apply our Minimum-Controller-Delay (MCD) allocation approach to existing allocation methods. In this paper, we choose a base algorithm (BASE), as a comparison basis for MCD. BASE uses the greedy left-edge and conditional resource-sharing algorithms for register allocation. A weighted module allocation graph will be built by the preference from register allocation and conditional resource sharing. Then maximum-weight clique partitioning can solve the module allocation. Commutativity has been used in interconnection binding for the point-to-point model. In comparison with BASE, our MCD algorithm maintains the resulting controller structure during the 161

primary input present next primary output primary input present next primary output C(uo vl>o) state state roux-o roux- 1 C(vo vl>o) state state roux-o roux- 1 1 so Sx 00 1 so Sx o 0 so Sy 10 0 so Sy

5 primary input present next primary output primary input present next primary output C(uo vl>o) state state roux-o roux- 1 C(vo vl>o) state state roux-o roux- 1 1 so Sx 00 1 so Sx o 0 so Sy 10 0 so Sy 0 Sx S1 1 Sx S1 11 Sy S1 1 Sy S1 11 (a) (b) Table 4: (a) FSM-2, the FSM derived from the left-edge register allocation algorithm with conditional resource sharing. The roux-o signal will depend on the late-arriving input c, i.e., (vo v 1 > O) (b) FSM-2-MCD, the FSM derived from the minimum-controller-delay register allocation. The roux-o signal will be independent of the late-arriving input c. allocation process. For a SCDFG, the binding of operation nodes relevant to conditional branch directly influences the unifiability of the controller structure. For the conditional branch nodes in the given SCDFG during register allocation, we choose a register-sharing binding for unifiable controller outputs when we proceeds left-edge algorithms after lifetime analysis. Similarly, we assign higher edge weights in the weighted module allocation graph if a module binding generates a unifiable controller. The unifiability of a controller output can be verified by performing XOR operation on its care output values with respect to branch states. That is, we need only look at the 1 s and O s of a controller output values at conditional nodes. If the XOR result is O, then the output values must be either all 1 s or all O s in addition to don t-care values. In this case, the output is unifiable and the binding will lead to a minimum dependency structure. On the other hand, if the XOR result is 1, the binding is not desirable. When some controller outputs can not be unified due to the inherent structure, we will try to reduce the number of multiplexer inputs on the critical path and choose proper multiplexer assignment to increase the number of unifiable outputs to potentially reduce the system cycle time. When the allocation is done, minimum-dependency-driven don t-care assignment and encoding [8] can be used to eliminate the undesired dependencies. 5 Experimental Results We have tried our allocation algorithms on several benchmarks, including those from Kim [10], Maha [11] and Schwa [12]. The schedules for these control data flow graphs are similar to those in [13]. To do the experiments, we assume that the conditional node in the fork branch and the following operation after the conditional branch are to be scheduled at the same cycle. Each conditional node contains an operation and will generate a control signal as an input to the controller. For simplicity, we assume that the conditional node operation is a comparison operator and randomly generate the arrival times of the controller inputs to reflect the fact that these signals arrive late. Select signals for multiplexer and load signals for registers are generated as the controller outputs. The datapath part can be generated by PDL++ [14]. The controller part is in KISS format and generated after the allocation process. These two parts are integrated by S1S [9]. The circuits are optimized by ESPRESSO [15] and delay-driven multilevel logic scripts in S1S. We use mcnc.genhl library and delay-driven options for technology mapping. The cycle time for the whole system is measured using the library model after technology mapping. We summarize our MCD algorithm results in Table 5, where RT state means register-transfer state, and BASE denotes the comparison base algorithm as explained in Section 4. In the experiments, two adders and two subtracters are used for all cases. Greedy conditional resource sharing by BASE results in nonunifiable controller outputs in all three cases, which makes the critical path in the whole system longer. On the other hand, MCD is able to produce a controller structure with unifiable outputs and eliminate the controller output s undesired dependencies on the late-arriving inputs. In Table 5, the cycle time comparison treats the result from BASE as a unit delay and shows its corresponding MCD cycle time. On average the system cycle time improvement is 3 l% ((BASE- MCD)/BASE*100%). Similarly, we normalize the BASE area from three benchmarks as one and show the MCD area accordingly. The area improvement is 24% on average. We believe it contributes to unifiable output s minimum dependent y structure and the simplification of its logic function. 162

benchmark RT conditional allocation FSM non-unifiable MUX area cycle time name state node method out put output input comparison comparison Maha 18 5 BASE 25 6 19 1 1 MCD 28 0 22 0.69 0.

6 benchmark RT conditional allocation FSM non-unifiable MUX area cycle time name state node method out put output input comparison comparison Maha 18 5 BASE MCD Kim 19 2 BASE MCD Schwa 17 5 BASE MCD Table 5: Experimental results. Average cycle time improvement is 31%; average area improvement is 24%. 6 Conclusions Most allocation approaches minimize the number of resources. Greedy conditional resource sharing methods often result in a controller with longer delay interacting with the datapath part of the system. We propose an allocation method to reduce the system delay through controller and datapath by several heuristics, including unifiable controller outputs, minimizing multiplexer inputs on the critical paths, and proper multiplexer assignment. This method is able to build a controller structure with minimum dependency on the late-arriving inputs during the allocation process. The system performance for the whole datapath and controller configuration is hence improved. References [1] S. C-Y. Huang and W. H. Wolf. Scheduling for minimum dependence in FSMS. In Proceedings, ICCAD-93, pages , Nov [2] D. Gajski, A. Wu, N. Dutt, and S. Lin, High-Level Synthesis: Introduction to Chip and System Design. Kluwer Academic Publishers, Boston, [3] F. J. Kurdahi and A. C. Parker. REAL: A program for register allocation. In l+oceedings, DA C-87, pages , June [4] D. E. Thomas, E. D. Lagnese, R. A. Walker, J. A. Nestor, J. V. Rajan, and R. L. Blackburn. Algorithmic and Register- Transfer Level Synthesis: The System Architect s Workbench. Kluwer Academic Publishers, Boston, [5] F-S. Tsai and Y-C. Hsu. STAR: An automatic data path allocator. IEEE Transactions on CAD/ICAS, 11(9): , Sept [6] C. H. Gebotys. Optimal scheduling and allocation of embedded VLSI chips. In Proceedings, DA C- 92, pages , June [7] K. Wakabayashi and T. Yoshimura. A resource sharing and control synthesis method for conditional branches. In Proceedings, ICCAD- 89, pages 62-65, [8] S. C-Y. Huang and W. H. Wolf. Performancedriven synthesis in controller-datapath systems. IEEE Transactions on VLSI Systems, March [9]E. M. Sentovich, K. J. Singh, C. Moon, H. Savoj, R. K. Brayton, and A. Sangiovanni-Vincentelli. Sequential circuit design using synthesis and optimization. In Proceedings, ICCD-92, pages , October [1O]T. Kim, J. W. S. Liu, and C. L. Liu. A scheduling algorithm fro conditional resource sharing. In Proceedings, ICCAD- 91, pages 84-87, Nov [11] A. C. Parker, J. T. Pizarro, and M. Mlinar. MAHA: A program for datapath synthesis. In Proceedings, DA C-86, pages , June [12] N. Park and A. C. Parker. Schwa: A software package for synthesis of pipelines from behavioral specification. IEEE Transactions on CAD/ICAS, pages , March [13]T-C. Lee, W. H. Wolf, and N. K. Jha. A conditional resource sharing method for behavioral synthesis of highly testable data paths. In hte?mational Test Conference, Baltimore, MA, Oct [14] R. J. Lipton, D. N. Serpanos, and W. H. Wolf. Pall++: an optimizing generator language for register-transfer design. In Proceedings, ISCAS- 90, pages IEEE Circuits and Systems Society, May [15] R. Rudell and A. L. Sangiovanni-Vincentelli. ESPRESSO-MV: Algorithms for multivalued logic minimization. In Proceedings, Custom Integrated Circuits Conference, Portland, OR, May

THIS paper describes a new algorithm for performance

THIS paper describes a new algorithm for performance IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 5, NO. 2, JUNE 1997 197 Unifiable Scheduling and Allocation for Minimizing System Cycle Time Steve C.-Y. Huang and Wayne H. Wolf,