Incorporating the Controller Eects During Register Transfer Level. Synthesis. Champaka Ramachandran and Fadi J. Kurdahi

Size: px

Start display at page:

Download "Incorporating the Controller Eects During Register Transfer Level. Synthesis. Champaka Ramachandran and Fadi J. Kurdahi"

Dorthy Powell
5 years ago
Views:

1 Incorporating the Controller Eects During Register Transfer Level Synthesis Champaka Ramachandran and Fadi J. Kurdahi Department of Electrical & Computer Engineering, University of California, Irvine, CA 977 U. S. A. Abstract High Level Synthesis (HLS) has been mainly concerned with datapath synthesis of a digital system. Consequently, controller eects are often ignored when performing HLS tasks. However, the controller may sometimes have signicant contributions to the overall system area and delay. Thus, it is necessary to incorporate the controller eects during datapath synthesis. Since control synthesis tools such as MISII are time consuming, it is not feasible to synthesize a controller netlist every time a high level design decision is made. As a result, it is necessary to estimate the controller contribution. As a rst step towards a comprehensive prediction scheme, we present a simple yet eective controller estimation model which can be invoked during the Register-Transfer synthesis phase of HLS, and which attempts to reect the incremental eects of iterative RT level transformations on the controller area and delay. Our model has been benchmarked and found to eciently account for the controller area and delay. Introduction The design of a VLSI chip begins with a behavioral description and typically ends with a detailed layout. The global decisions made during the early phases of design have a signicant eect on the quality of the nal chip layout. However, such eects will typically not be apparent until the nal stages of the design process. Thus, there is a need for accurate design quality metrics which can properly reect the impact of the subsequent design steps and provide guidance for the global design decisions. An important factor that has been typically ignored during high level synthesis is the controller. In many cases, especially for control-dominated designs, and designs with complex controllers, the controller area, and more importantly, delay, can be a signicant contributor to the total chip area and performance. In a previous work [], a netlist-based model for chip level area and delay estimation was proposed. This model assumes as input an already synthesized Register-Transfer level Datapath, and a controller netlist. The model was experi- This work was supported by NSF Grant # MIP , by a MICRO grant from the University of California, and by a DAC fellowship mentally benchmarked with respect to high level synthesis designs as well as industry standard designs and found to be quite accurate while not sacricing runtime eciency (i.e. the model was not too expensive to evaluate). Thus, this model is useful during datapath synthesis, but may not be quite ecient when controller eects are to be accounted for during high level synthesis since one has to run logic synthesis tools such as MIS to obtain a controller netlist. Such control synthesis tools are time consuming and if run repeatedly, would signicantly increase the estimation runtime and hence the runtime of the overall high level synthesis procedure. Hence, there is a need for a predictive model of controller area and delay which is eective in reecting the controller contributions to the chip area and delay. A complete model of the controller from state diagram necessitates a modeling of the logic as well as the physical design phases. Modeling the impact of logic synthesis is an extremely complex task which has received very little attention in the past. As a rst step towards to complete predictive model of controller, we propose in this paper a predictive model of incremental controller changes which occur during the RT-binding phase of the high level synthesis procedure. This model is shown to be eective in tracking the changes in controller area and delay and hence can be used when an RT level design is undergoing an iterative improvement phase consisting of a sequence of re-binding transformations. Previous work Most of the previous work in developing predictive models of layout was done at the gate or transistor levels. The standard cells style, being the most popular design method for custom random logic applications was studied by researchers, and predictive models of standard cell layouts were developed. Most notable is the work by Pedram and Preas [] who developed accurate analytical models for area and wire length estimation, and Zimmerman [3] who developed a novel slicing technique for estimating the area and shape function of custom layouts. All these models were benchmarked and found to predict the area of standard cell layouts with errors around 5 to %. The work in [4, 5] describes a layout area and delay prediction approach using a hardware model which combines analytical and constructive predictive models

2 Scheduling & Allocation Scheduling & Allocation Control Unit Register File Muxes Binding Binding Next State Logic State Register Control logic Functional Unit Control Synthesis Control Synthesis Datapath Registers Done Area & delay of design ok? Done Area & delay of design ok? Figure : Architectural model for a digital system with a Moore-style controller Transform moves Transform moves Area/delay Est. of layout. In [6] and [7], abstracted layout area and timing models for high level synthesis were presented. These models were experimentally shown to accurately and eciently reect the eects of the data path design tradeos on the nal layout. However, these models concentrated on modeling the datapath and controller separately and did not consider the impact of oorplanning and logic optimization which could generally be a signicant factor in area and delay. In [], a chip level netlist-based area and delay estimation model was proposed. This approach was based on a constructive-analytical mixture of models to hierarchically estimate chip area and worst case register-to-register delay. In [8] and [9], a model for estimating the controller complexity was proposed. Given a state table, this model estimates the number of cells needed for controller implementation. This is accomplished using empirical formulae whose parameters are statistically derived for a given technology. This model does not incorporate delay estimates and furthermore, does not account for wiring eects since no netlist is produced. 3 Approach 3. Architecural Model Typically High Level synthesis systems use a FSMD ([]) design model as shown in Figure. This model consists of two important components: a) a controller, which can be represented as a nite state machine and synthesized into a state register and combinational logic, b) a datapath which contains the functional units and storage units that performs the required computations. The control unit controls the computations in the datapath using the control signals and receives the status of various computations through the status signals. 3. Problem Statement Figure (a) shows a typical ow of high level synthesis which consists of the traditional phases of scheduling and allocation followed by RT level synthesis. During RT level synthesis, the resulting RT level design is often further optimized by an iterative sequence of re-binding transformations which are aimed at improving the design area and/or delay. the re-binding phase of the high level synthesis involves changing the values mapped to registers (e. g. moving a value from register R a to register R b as given in []), moving operators between func- Re binding Control Synthesis (a) Traditional Flow of HL Synthesis Controller Controller Re binding Control Estimation (b) Proposed Flow of HL Synthesis Figure : Design methodology R R R3 R4 R_LOAD R_LOAD R3_LOAD R4_LOAD R R_LOAD R_LOAD R4_LOAD Synthesis Transformation R Merge Registers R and R3 Figure 3: An example a re-binding transformation tional unit and modifying the interconnections between them (e.g. mux connections). An example of a transformation is shown in Fig 3. All these transformations are performed, one at a time, in an iterative fashion and translate to a change in the datapath as well as a change in the state table of the controller. This means both the controller and the datapath need to be re-synthesized. In [], we have shown the estimation of the area and the delay of the datapath given a RT-netlist description, using the architectural model shown in gure. Hence, if we measure the change in the controller due to the rebinding operation, we could gure out if the re-binding indeed produced an improved design. In order to obtain the change in the controller, we have to perform control synthesis on the modied state table. The control synthesis process can be divided into three dierent phases: state encoding, logic optimization and technology mapping. During state encoding the symbolic names for the states are encoded into binary val- R4

3 ues based on certain heuristics []. After encoding, the state table resembles a truth table. This truth table can be optimized using logic optimization techniques [3]. The optimized netlist is then mapped with components selected optimally from a given (standard cell) library. This phase is called technology mapping and generates a gate level netlist is obtained. Next, the netlist is placed and the interconnections routed (typically in standard cells design style). The above sequence of steps that results in a structural netlist is quite time-consuming because of the complexities involved in logic optimization and technology mapping. When performing high level synthesis, we need to exercise the control synthesis process for each design choice. This signicantly impacts the practical applications of the high level synthesis algorithms. In order to deal with this problem, there are two possible choices, one possibility is to dispense with the control resynthesis phase during re-binding. Re-binding, however, may result in signicant changes in the controller structure which may aect the overall chip area and delay. An alternative solution is to replace the control re-synthesis phase with an incremental estimation phase as shown in Figure (b). Using such an approach would be much more ecient than a complete re-synthesis step during every iteration and would enable the correct tracking of the controller eects. 3.3 The Control Model During re-binding transformations described in the previous section, the design is not being re-scheduled and the number of states remains a constant and so does the transitions between the states. However, the values on the datapath control lines are dependent on the transformations in the datapath. So, the boolean expressions that determine the values on datapath control lines get modied during the re-binding transformations. Since we use a non-sharing scheduler with no status registers we need a Moore machine model for the controller [4]. In this model, the controller output is only dependent on the current state of the design. In other words, boolean value on the datapath control lines is a function of the Decoded Current State of the datapath (). The state encoding that we have assumed depends on the state transitions and hence would be invariant during the re-binding phase. Thus, the boolean expressions that determine the state decoder and the next state would be identical during the re-binding transformations. Fig 4 shows the new Partitioned Control Model (PCM) that we propose for the purposes of use with the re-binding phase of high level synthesis. The partitioned control model consists of two sets of expressions, namely, the Invariant Expressions (IE) consisting of the next state logic and the state decoder equations, and the This is a basic assumption in our system. However, we note that datapath changes could sometimes aect the status lines and re-encoding of the states may result in changing the next state logic structure. Clock State register States State Decoder Invariant Expressions (IE) Next State Logic Decoded states () Output logic () Control lines to datapath Status lines from datapath Figure 4: Control model example Transformation Sensitive Expressions () consisting of the output logic equations. State encoding and initial logic synthesis are performed only once, by running IE through Mustang and MISII (or any other logic synthesis tools) only during the rst pass through the design cycle. As shown in g, we obtain the state encoding, the next state logic and the state decoder during this rst pass through the design cycle. We will call the logic that implements the IE as Invariant Logic (IL). We will re-use the IL during the multiple re-binding iterations. In the next section we will show how we can estimate the Transformation Sensitive Logic (TSL) that implements the. 3.4 Predicting the In order to predict the logic required to implement, without actually performing the logic synthesis, we should be able to mimic the various phases of logic synthesis, without performing the time-consuming and complex tasks of logic optimization and technology mapping. So, let us examine how we can determine a relation between and (decoded current state). T SE i of a datapath control line can be expressed as NX T SE i = (V ij j ) () j= Where, V ij is a boolean variable, N is the number of decoded states. This relationship can also be expressed in the form of a bipartite graph as shown in g 5. The left nodes of the graph are the j 's and the right nodes are T SE i 's The nets of the graph are the V ij s and a net exists if the value of V ij is. We can now cast the logic optimization problem as the bipartite graph clustering problem. Let C j be a boolean variable such that it is equal to when all the j nodes in the cluster K l are connected to a node T SE i C j = Y (V ij ) 8 j K l () Hence, the clustering problem is one that of determining K l such that C j is maximized. We can also cast the technology mapping problem as an introduction of hierarchy on the above bipartite graph

4 = = = Figure 5: Bipartite graph model of Internal 3 Y nodes Y 3 4 Y 4 3 Figure 6: Hierarchical cluster tree of bipartite graph clustering problem. Technology mapping involves mapping a set of gates given in a technology library to implement a given expression. Because of the special property of the s given in expression we can observe that they can be built only using OR gates. So, the technology library need only consist of a set of OR gates with the gates having a max input of M. The technology mapping problem now reduces to recognizing clusters K l of sizes M or smaller. In the case where cluster K l is larger than M, we can build a hierarchical tree of sub-clusters such that each sub-cluster is of size M or smaller. Since the above problem is NP-complete and we need a quick solution that can be used during the design iterations of high level synthesis, we have used the Fiduccia- Mattheyses technique which is an improvement of the Kernighan-Lin heuristic [5] [6], to provide an approximate solution. We have transformed the bipartite graph to a hypergraph as shown in gure 5 in order to apply the FMtechnique. The nodes of the new graph are the left nodes of the bi-partite graph or the j. The nets of the graph are T SE i. T SE i is connected to a j such that V ij is a. The hypergraph can now be partitioned such that the cost function C j shown in expression is maximized. Given the example shown in gure 5, we can now derive the hierarchical cluster tree shown in the gure 6 The internal nodes of the tree can now be directly mapped to the OR gates in the technology library. The number of inputs to the OR gate is determined by the number of children of the internal node. Hence, we have derived a logic netlist of the output Layout Area (Sq microns).6e+6.4e+6.e+6.e+6 8.e+5 MISII run CLEAR run 6.e+5 clk_div ctr ellipf hal maha timer Delay (ns) clk_div ctr ellipf hal maha timer Figure 7: Area and delay comparison of MISII produced and CLEAR predicted netlists logic using an approximation to the logic optimization and the technology mapping process. Since the FMalgorithm is pseudo-linear, the predicted netlist of the controller can be obtained in close to linear time. 4 Experimental results We have implemented the PCM model and the incremental Control Logic EstimAtion for RT synthesis in the CLEAR system. We have tested our partitioned control model (PCM) and the incremental control estimation (CLEAR) on the 7 designs, which include dieq, maha, elliptic lter, FIR lter and 3 industrial examples which are sub-circuits of a DSP chip. We synthesized the RT implementations from the behavior specications for each benchmark, and obtained the state table of the controller. We then implemented the logic design of the state table using Mustang []and MISII[3]. In order to run MISII, we used the standard script provided with the MISII release directory which provides an optimal gate count. We also applied PCM on the above state table and estimated the logic netlist of the s. Table shows the complexity of the designs in terms of number of states and outputs in the state table. It also compares the number of cells in the controller by running MISII and CLEAR. Column 4 indicates the percentage of as compared to the size of the controller. We can notice that, the is indeed a signicant part of the controller. Columns 5 and 6 show the CPU times for the MISII run and CLEAR runs on these designs. It can be clearly seen that CLEAR is at least -8 times faster than MISII while estimating the number of cells with a relative error less than about 9 percent. Here, one could argue that we could run MISII in the fast mode by performing minimal optimization and technology mapping. We conducted this experiment on the controller for the FIR lter and found that the fast mode overestimated by close to 9%. These results are shown in Fig 8. Area and delay values of the logic designs were estimated using LAST and TELE [4], [5], which account for wiring area and delay. The graphs in g 7 show comparisons of the estimated area and delay for the netlist produced by MISII and the one estimated by CLEAR. We can observe that there is a close tracking between the area of the netlist produced by MISII and that estimated by CLEAR. In the next set of experiments, we applied the set

5 Bench States Outputs % output MIS CLEAR MIS CLEAR Designs logic CPU secs CPU secs num cells num cells clk-div ctr timer dieq ellipf maha r Table : Characteristics, runtimes and logic netlist sizes of the benchmark circuits of transformations described in [] on RT-designs of MAHA and FIR lter. This enabled us to re-bind the datapath and generate new state tables at every iteration. Figs 8 and 9 show the area, delay and cell count 7.e+5 of the controllers generated from the re-bound RTdesigns of MAHA [7] and 8 controllers of the FIR lter. As we can see, the area and the cell count predicted by CLEAR closely match that of the one produced by MISII. On the other hand, the delay predicted 6.e+5 by CLEAR does not always closely track the delay obtained from the MISII netlist. This is so, because the MISII script used by us was not tuned for performance 5.e+5 optimization. 5 Conclusions Our experiments with the high level synthesis benchmarks show that CLEAR with the partitioned control model can be used during the iterative RT-synthesis phase when re-binding transformations are being applied to the design. CLEAR is not intended as a replacement for Logic Synthesis tools such as misii, but uses the misii run eectively. We have noticed that misii takes quite a signicant amount of CPU time to synthesize the logic netlist of a state table. When misii needs to be executed repeatedly in a iterative design cycle, it could become the bottle neck in the entire process. In our approach, we invokes misii just once during the entire re-binding phase and perform some quick computations to predict the logic netlist. Hence, we avoid the repeated invocation of misii and speed up the design iterations. In this paper, we have also described a partitioned control model (PCM) and shown experimentally that it could lead to designs with better performance with a small penalty in the area. These experiments were only performed on some HLSW9 benchmarks. We have not yet performed similar experiments on the Logic Synthesis benchmarks. As a future step, we intend to extend this concept to account for controller changes during re-scheduling and re-allocation and the entire iterative re-synthesis cycle. References [] C. Ramachandran, F. J. Kurdahi, D. Gajski, V. Chayakul, and A. Wu, \Accurate layout area and delay modeling for system level design," in Proc. ICCAD- 9, Nov. 99. [] M. Pedram and B. Preas, \Interconnection length estimation for optimized standard cell layouts," in Proc. ICCAD-89, pp. 39{393, IEEE/ACM, 989. Layout Area ( sq. microns) 4.e+5 Number of cells Delay(ns) Mis run CLEAR run 6 a b c d e f g h i j k l Designs Produced by re-binding transformations Figure 8: Area, delay and number of cells of MISII and CLEAR predicted controller netlists for MAHA Designs

6 9.e+6 8.e+6 7.e+6 6.e CLEAR run Mis (slow) mode Mis Fast mode (MAP) 6 a b c d e f g h Designs produced by re-binding transformations [3] G. Zimmerman, \A new area and shape function estimation technique for VLSI layouts," in Proc. 5th Design Automation Conf., pp. 6{65, IEEE/ACM, 988. [4] F. J. Kurdahi and C. Ramachandran, \LAST: A layout area and shape function estimator for high level applications," in Proc. Second European Conf. on Design Automation, Feb. 99. [5] C. Ramachandran and F. J. Kurdahi, \TELE: a timing evaluator using layout estimation for high level applications," in Proc. EDAC-9, 99. [6] A. C.-H. Wu, V. Chaiyakul, and D. D. Gajski, \Layoutarea models for high-level synthesis," in Proc. ICCAD- 9, pp. 34{37, Sept. 99. [7] V. Chaiyakul, A. Wu, and D. Gajski, \Timing models for high-level synthesis," in Proc. EuroDAC-9, 99. [8] B. Mitra et. al, \Estimating the complexity of synthesized designs from FSM specications," IEEE Design and Test, vol., pp. 36{4, Mar [9] Q.Ji, Y.Oh, M.Lightner and F.Somenzi, \Technology independent estimation of area in logic synthesis," in Proc. of the SASIMI9 Workshop, pp. 7{8, 99. [] D. Gajski, N. Dutt, A. Wu, and S. Lin, High-Level Synthesis: Introduction to Chip and System Design. Kluwer Academic Publishers, 99. [] C.Papachristou, H.Harmanani, M. Nourani, \An approach for redesigning in datapath synthesis," in Proc. of the DAC93 Conf., pp. 49{43, ACM, June 993. [] S. Devadas et. al, \MUSTANG: State assignment for nite state machines for multi-level logic implementations," in Proc. ICCAD-87, pp. 6{9, 987. [3] R. Brayton et al., \MIS: a multiple level logic optimization system," IEEE Trans. CAD, vol. CAD-6, pp. 6{ 8, Nov [4] L. Ramachandran and D. Gajski, \Architectural tradeos in synthesis of pipelined controls," in To appearr in the Proc. of the EuroDAC93 Conf., IEEE/ACM, Septermber 993. [5] B. W. Kernighan and S. Lin, \An ecient heuristic for partitioning graphs," Bell Syst. Tech. Jour., vol. 49, no., pp. 9{37, 97. [6] C. M. Fiduccia and R. M. Mattheyses, \A linear-time heuristic for improving network partitions," in Proc. of the 9th Design Automation Conference, pp. 75{8, IEEE/ACM, 98. [7] N. Dutt, \Status of hlsw9 benchmarks," in 6th International Workshop on High Level Synthesis, IEEE/ACM, November 99. Figure 9: Area, delay and number of cells of MISII and CLEAR predicted controller netlists for FIR Designs

160 M. Nadjarbashi, S.M. Fakhraie and A. Kaviani Figure 2. LUTB structure. each block-level track can be arbitrarily connected to each of 16 4-LUT inp

160 M. Nadjarbashi, S.M. Fakhraie and A. Kaviani Figure 2. LUTB structure. each block-level track can be arbitrarily connected to each of 16 4-LUT inp Scientia Iranica, Vol. 11, No. 3, pp 159{164 c Sharif University of Technology, July 2004 On Routing Architecture for Hybrid FPGA M. Nadjarbashi, S.M. Fakhraie 1 and A. Kaviani 2 In this paper, the routing