Sequential Logic Synthesis with Retiming in Encounter RTL Compiler (RC)

Size: px

Start display at page:

Download "Sequential Logic Synthesis with Retiming in Encounter RTL Compiler (RC)"

Abigayle Stokes
5 years ago
Views:

1 Sequential Logic Synthesis with Retiming in Encounter RTL Compiler (RC) Christoph Albrecht 1, Shrirang Dhamdhere 1, Suresh Nair 1, Krishnan Palaniswami 2, Sascha Richter 1 1 Cadence Design Systems, 2 Focus Semiconductor Session Track: Digital IC Design Session Number: 2.3 Relevant Cadence Products: Encounter RTL Compiler (RC), Encounter Conformal Logic Equivalence Checker (LEC) Abstract Typical ASIC designs are highly unbalanced with respect to the timing criticality of their combinational logic paths. This is mainly due to the ad-hoc manual design specification of the register transfer level (RTL), which does not use any information regarding the sequential timing criticality. Traditional logic synthesis does not support borrowing of timing slack across registers, and the optimization is restricted by fixed positions of the registers. This may result in a suboptimal solution, in a loss of performance, and unnecessary area and power consumption. This paper explains the concept of clock scheduling and retiming used by Encounter RTL Compiler (RC) to optimize across register boundaries. Retiming is a structural transformation which changes the positions of the registers without modifying the input-output behavior of the circuit. The reader will understand how the area, the number of registers, or the delay of the design is minimized. Computational results show the tradeoff between these two objectives. Practical applications are discussed: Registers may have different control signals, enable signals, or reset signals. This leads to the multiclass retiming problem and the reset line justification problem. Retiming used to be a difficult challenge for equivalence checking. However, together with Encounter Conformal Logic Equivalence Checker (LEC) the verification is now simple: RC writes out checkpoint netlist files and one script, which LEC can then process to automatically verify the golden RTL against the final netlist. We present a case study showing how retiming was used by Focus Semiconductor, a division of Focus Enhancements, on a 1.5 M instance UWB baseband chip. Retiming substantially improved the Quality of Results (QoR) and helped to meet the design objectives. CDNLive! Silicon Valley

2 1 Introduction Traditional combinatorial logic synthesis focuses all the optimization efforts on the combinational paths between the registers. It does not support any tradeoff between tight paths and loose paths when these are separated by registers. To motivate the use of sequential logic synthesis with retiming, we will discuss the slack distribution of a typical ASIC design. Figure 1: Slack distribution of a typical ASIC design. Figure 1 shows the slack distribution, more specifically the distribution of the setup slacks of a late-mode analysis after synthesis. For each slack interval on the x-axis, the number of combinational paths which have a slack value within that interval is shown. The design has a worst negative slack of -529 ps. Figure 2: Slack distribution of the same ASIC design for which the slack distribution is shown in Figure 1, however this time with optimized clock latencies. Figure 2 shows for the same design an optimized slack distribution. The netlist was not changed, only the clock latencies at the registers. The latencies were computed with a slack balancing algorithm which we will discuss later. The number of critical paths has decreased drastically. Only a small fraction of the paths have a negative latency. In this case it was not possible to improve the worst negative slack, because the worst path in this design is a path from a primary input to a primary output. The two figures, Figure 1 and Figure 2, impressively demonstrate the optimization potential which becomes available when the registers are unlocked and not kept fixed as hard boundaries, which constrains the synthesis optimization algorithms. With the optimized clock latencies, many paths become uncritical. The additional slack can be used to downsize the combinational gates or even to use a different logic structure that has smaller area and power consumption. While clock scheduling was not able to reduce the worst negative slack for this specific design, clock scheduling was able to improve the slack of the side paths. These are either combinational paths that start CDNLive! Silicon Valley

3 at the primary input of the critical path and end at a register or paths that start at a register and end at a primary output. This is helpful for the synthesis optimization algorithms in RC. RC is able to improve the slack of a path by using slack of the side paths. In this paper we discuss the two sequential optimization techniques, clock scheduling and retiming, and show how the combination of both these techniques is used in RC. The paper is organized as follows: In Section 2 we discuss clock scheduling. Clock scheduling is also known as useful skew. It changes the latencies of the clock signal but does not change the logic. The different latencies need to be realized by a sophisticated clock network. In Section 3 we describe retiming. Retiming is a structural transformation. While retiming does not change the combinational gates, it modifies the netlist by moving the registers forward and backward in the logic. RC can use clock scheduling as an intermediate step to drive the logic synthesis and optimization process. Ultimately, it realizes the different latencies by retiming so that a conventional zero or limited skew backend flow can place the design, construct the clock network, and route the nets. This is described in Section 4. In practice, retiming can be constrained by registers that have different control signals (for example, enable signals, asynchronous set or reset signals). Section 5 discusses these constraints. In Section 6 we discuss the automatic verification flow with LEC. In the last section we present a case study how retiming was used on an UWB baseband chip from Focus Semiconductors. 2 Clock Scheduling The following figure shows how the worst slack of a design can be improved by changing the clock latencies: Buffers are added to the clock distribution network and the switching time of the register is delayed. In this case the worst slack is improved from -2 ns to 0 ns and the design meets the timing requirements. If the clock latency of the capturing register of a combinational path is increased, the slack of the combinational path increases by the same amount. If, on the other hand, the clock latency for the capturing register is decreased, the slack of the combinational path decreases. Increasing the clock latency of the launching register decreases the slack and decreasing the latency has the opposite effect on the slack of the path. 4 ns 3 ns 3 ns 2 ns 3 ns 1 ns 2 ns 1 ns 1 ns clock + 2 ns + 1 ns + 1 ns Target clock period: 5 ns Worst slack without clock latencies: Worst slack with clock latencies: - 2 ns 0 ns Figure 3: The worst slack is improved by adjusting the clock latencies. CDNLive! Silicon Valley

4 A linear programming formulation The clock scheduling problem can be formulated as a linear program. This was first done by Fishburn in 1990 [1]. Let T be the clock period. The clock period should be minimized. Furthermore, let l i be the latency of the clock signal arriving at register i, and let d ij be the maximum delay of all combinational path from register i to register j. min T subject to l i + d ij l j + T for all combinational paths (i, j). The difference in the inequality is the slack. Should the design have constrained primary inputs or outputs, we can represent all these inputs and outputs by one dummy register that can have, without loss of generality, a clock latency of zero. Hence, we can assume that even in this case the linear program has the form above. The linear program is a very special linear program and it can be solved efficiently with combinatorial algorithms. It can be proved that the minimum clock period achievable by clock scheduling is equal to the maximum average path delay of all cycles in the register-to-register timing graph. The register-to-register timing graph contains a node for every register and an edge whenever there is a combinational path between the registers with a weight equal to the maximum delay of these paths. In general, the linear program does not have one single solution. However, any solution that minimizes the clock period is usually not desirable. For example, we examined the ASIC design for which the two different slack distributions are shown in Figure 1 and Figure 2. The worst negative slacks of the two slack distributions are equal and so are the clock periods at which the chips can operate without failure. Clock scheduling optimally balancing the slack In the following we discuss how it is possible to compute a clock schedule with a specific property which we call optimally balanced slack. As a result of this property many paths are uncritical and have a lot of slack. This part is more theoretical and if the time of the reader is limited, we recommend skipping this part because the sections following are more important for the practical use. We consider a small example circuit with four registers, a, b, c, and d, shown in Figure a 5 6 b 4 5 d 9 c Figure 4: Example circuit with combinational gates and four registers. The numbers specify the delay of the gates. From the circuit we can construct the register-to-register timing graph which is shown in the following figure. The graph has one node for each of the four registers and an edge between two nodes whenever there is a combinational path between the corresponding registers. Associated with the edges is the maximum delay of the combinational paths. CDNLive! Silicon Valley

5 a 6 b c 5 9 d Figure 5: Register-to-register timing graph for the circuit in Figure 4. Without clock latencies, the minimum feasible clock period for this circuit is equal to the maximum delay of the combinational paths, in this case T = 11. By increasing the clock latency for the register b to +1, the clock period can be decreased to T = 10. This is the minimum clock period which can be achieved by clock scheduling, because with these latencies the two paths (b,d) and (d,b) have a slack of zero. Figure 6 shows the register-to-register timing graph with the latency +1 at register b. In addition to the combinational delays we show also the slacks for the clock period T = 10 in brackets. 9 (1) a c 9 (1) 6 (5) 7 (3) 5 (5) 9 (1) 11 (0) b d +1 9 (0) clock period T = 10 delay (slack) Figure 6: A clock schedule applied to the registers such that the worst incoming slack equals the worst outgoing slack for every register. The edges corresponding to the critical paths with a slack smaller than or equal to 1 are shown in red. The clock schedule shown in Figure 6 has the property that for every register the worst incoming slack is equal to the worst outgoing slack. Changing the clock latency of one single register alone does not give an improvement, since the worst slack of all the paths starting or ending at the register can only get worse. The Figure 6 shows that there is one critical edge in red, the edge (d,c), which is not part of a critical cycle. It is possible to increase the slack of this edge by increasing the clock latency of the registers a and c simultaneously. This does not affect the two critical edges (c,a) and (a,c). The result is shown in Figure 7. In this figure the worst incoming slack equals the worst outgoing slack for every subset of the registers. Note that before, in Figure 6, the worst outgoing slack for the registers a and b together is equal to 5 whereas the worst incoming slack is only (1) +2 a c 9 (1) 6 (3) 7 (5) 5 (3) 9 (3) b (0) (0) d clock period T = 10 Figure 7: An optimally balanced clock schedule: The worst incoming slack equals the worst outgoing slack for every subset of the registers. CDNLive! Silicon Valley

6 The clock schedule shown in Figure 2 on page 2, in which the number of critical paths has decreased so drastically, has exactly this property. It is computationally too expensive to consider all subsets of the registers, because there are exponentially many cycles. Nevertheless, the efficient minimum mean balance algorithm by Young, Taran and Orlin [3] can find such a solution by iteratively finding critical cycles and contracting them. For synthesis operations it is helpful if the side paths of a critical path have additional slack. The slack can be used to reduce the delay of the critical path. An example for such a synthesis operation is Shannon decomposition shown in the following figure. combinational logic x 0 x a critical path a 1 Figure 8: A critical path becomes short and fast using Shannon decomposition. If only one path starting at a point a and ending at a point x is critical and all other paths ending at x are uncritical, then the fanin logic of x can be duplicated twice, once the value of a is permanently set to zero and once it is set to one. The two outputs of the replicated logic feed a multiplexer that chooses the right value for x depending on the value for a. The constant values for a are propagated to simplify the logic. After this transformation the path from a to x is very short and hence very fast. Limitations of clock scheduling Clock scheduling has limitations. Changing the clock latencies may increase the number of hold violations. The hold constraint ensures that data signals do not arrive too early at the data input pin of the register at the end of the path. The signal has to arrive after the register has closed. A high number can potentially lead to an enormous number of hold buffers, which need to be added at the end of the flow. Due to process variations the final delay of the paths on the fabricated chip can deviate from the computed delay. This limits the use of clock scheduling further. For example, it is not possible to have a long combinational path that has a combinational delay equal to ten times the clock period and realize the timing constraints by adjusting the latencies of the clock signals at the launching and receiving register. On such a combinational path there would be 10 different data signals at the same time. These signals need to arrive at the receiving register at the right time. If the combinational delay of the path were only 10% smaller on the final fabricated chip due to process variations, the signal would arrive too early and this would result in a hold time violation. As the delay could also increase, it is not possible to fix this hold violation by adding additional delay with hold buffers. Nevertheless, RC can use internally large positive and negative clock latencies and optimize the combinational logic with these latencies. In the end, the latencies are realized by retiming and moving the registers through the combinational logic. The latencies are only bounded by the number and the movement of the registers. CDNLive! Silicon Valley

7 3 Retiming Retiming is a powerful sequential optimization technique which overcomes the limitations of clock scheduling. Retiming moves the registers across the combinational logic to improve the performance without changing the input/output behavior of the circuit. The following figure shows the slack of a circuit can be improved by retiming. It is the same circuit for which we applied clock scheduling in Figure 4. The registers are retimed backward against the direction of the signal propagation. 4 ns 3 ns 3 ns 2 ns 3 ns 1 ns 2 ns 1 ns 1 ns Target clock period: 5 ns Worst slack before retiming: - 2 ns 4 ns 3 ns 3 ns 2 ns 3 ns 1 ns 2 ns Worst slack after retiming: 1 ns 1 ns Figure 9: The worst slack is improved by retiming the registers 0 ns backward against the direction of the signal propagation. This example shows that retiming changes the number of registers. In this case, the number of registers increases. However, the number of registers can also decrease. RC minimizes the clock period as a first objective. Among all possible retiming solutions that achieve the minimum clock period, RC finds the solution with the minimum number of registers. In addition, RC has the option to minimize the number of registers without increasing the current clock period. Any retiming can be achieved by a sequence of two elementary retiming steps: Forward retiming removes the registers at the input of a gate and creates new registers at the outputs. Backward retiming does the opposite: It removes the registers at the output and creates a new register at each input. The two retiming steps are shown in the following figure. forward retiming backward retiming Figure 10: Registers retimed forward and backward over an AND gate. For forward retiming it is necessary that each input of the gate is driven by a register. Similarly, for backward retiming the gate must not drive any combinational gate but only registers. In order to ensure equivalent input / output behavior of the circuit, retiming cannot change the number of registers on any loop and on any path from a primary input to a primary output path. This is guaranteed by the two operations. Of course, it may still be possible to retime registers forward or backward over a gate if CDNLive! Silicon Valley

8 this condition does not hold for the original circuit, but the condition has to be achieved by elementary retiming steps applied for the other gates before. Constants and dangling logic (logic that does not drive anything) are an exception. Constant propagation as part of the RC synthesis operations simplifies any logic driven by a constant, unless the gates are preserved by an attribute. Similarly, dangling logic is removed. However, should this logic be preserved, retiming is able to create or remove registers at constants and dangling logic. The following figure shows an example in which retiming cannot improve the critical path because no elementary retiming step is possible: A B C Figure 11: An example in which retiming cannot improve the clock period because the register cannot be moved forward. Depending on the clk-to-q delay of the register, the critical path goes from the register to the primary output C. If the primary inputs are even unconstrained, then the critical path starts at the register in any case. Just checking the slack at the data input pin and the output pin of the register, the user may wonder why the register was not moved forward. This is not possible, because there is no register following directly the primary input B. Efficient algorithms for retiming have been developed and published. We refer the interested reader to the fundamental paper by Leiserson and Saxe published in 1991 [2] in which the problem of finding a retiming realizing a given clock period and minimizing the number of registers is formulated and solved as a minimum cost flow problem. Polynomial time algorithms have been developed for this problem. A comprehensive book about timing in general and clock scheduling and retiming is the recent book by S. Sapatnekar [5]. Relationship between clock scheduling and retiming The two sequential optimization techniques, clock scheduling and retiming are related: It can be proved that the clock period achievable by clock scheduling (ignoring any hold constraints) is a lower bound on the clock period that can be achieved by retiming [3]. It can also be proved that retiming can almost achieve this clock period: The minimum clock period achievable by retiming is at most the minimum clock period achievable by clock scheduling plus the maximum delay of all gates. If a clock schedule is given a retiming can be computed as follows: Find a register with the maximum positive clock latency. Decrease the clock latency until the incoming slack is zero. If the slack is already zero, perform a backward retiming over the gate driving the register. The new registers added in front of the gate get a clock latency equal to the latency of the original registers minus the delay of the gate. This procedure is repeated until the clock latency of each register is smaller than half the delay of the gate driving the register. Then a similar procedure is applied for registers with the minimum negative clock latency. The registers are moved forward and the clock latency is increased by the delay of the gate until the clock latency of each register is larger than the negative value of half the delay of the gate driven by the register. If the clock latency of every register is then set to zero, then the retimed circuit has a clock period of which is at most the clock period of the original circuit with clock scheduling plus the maximum delay of all gates. CDNLive! Silicon Valley

9 4 The global sequentially driven synthesis flow in RC RC combines the two sequential optimization techniques, clock scheduling and retiming, in a global sequential synthesis flow shown in the following figure. sequentially driven synthesis and optimization combinational synthesis clock scheduling retiming combinational synthesis Figure 12: The global sequentially driven synthesis flow in RC The logic synthesis and optimization algorithms are tightly interlinked with clock scheduling. Clock scheduling computes clock latencies which improve the clock period and the slack of the combinatorial paths. The synthesis algorithms can use slack of side paths to further improve critical paths. In the next step, retiming moves the registers through the combinational logic. It minimizes the clock period and as second objective minimizes the number of registers. Ultimately, retiming is followed once more by combinational synthesis. This is necessary because the loads of the gates have changed as the registers were moved. RC performs these steps automatically. The user only has to set the attribute retime to true for either the top design or the subdesigns for which retiming should be performed and then call the synthesize command. 5 Special cases for retiming In this section we describe special cases for retiming due to control signals at the registers. The control signals at the registers may constrain the movement of the registers. First we discuss the retiming of registers with enable signals. Then we describe the case when registers with an enable signal are implemented by a simple register with a multiplexer feedback loop. Finally, we discuss asynchronous set and reset signals. Retiming of registers with different enable signals In practice, the retiming of the registers can be constrained: The registers in the circuit may have different control signals, for example enable signals. Retiming cannot combine registers which have different control signals. Figure 13 shows an example. To improve the timing, the two registers should be combined and retimed backward. However, this is not possible because the two registers receive different enable signals. RC can combine and retime registers forward or backward only if they receive the same enable signals. CDNLive! Silicon Valley

10 en 1 clock enable 1 enable en 2 Figure 13: The two registers cannot be moved backward because they receive different enable signals. Multiplexer feedback loop Registers with an enable signal can also be implemented by a simple register and a multiplexer. This may be an advantage for retiming because the registers can then be merged even though the enable signals are different. It may, however, also constrain the register movement and increase the number of registers. Figure 14 shows that the number of registers can be larger. It is a pipeline design with three stages of registers at the primary outputs. The enable is realized by a multiplexer. When the registers are retimed into the combinational logic (applying only the elementary retiming steps in Figure 10), one register has to remain in each loop with the multiplexer. Furthermore, registers pile up at the select lines of the multiplexer. enable 1 enable 2 enable 3 enable 1 enable 2 enable 3 Figure 14: Registers with enable can be implemented by a simple register and a multiplexer. This may increase the register count when the registers are moved backward. If the registers have an enable signal instead of a loop with a multiplexer that can be moved with the registers, then the number of registers after retiming is smaller. If the registers with the multiplexers are at the primary inputs and have to be moved forward, the problem is different: only the last register can be retimed forward. To retime more registers forward it would be necessary to have additional registers at the select line of the multiplexers. By default RC uses registers which have enable logic built into the register. Only if the variable hdl_ff_keep_feedback is true, RC uses simple registers which are in a loop with a multiplexer. The results depend on the structure of the design and can differ drastically. Retiming of registers with asynchronous set and reset signals Retiming of registers with asynchronous set or reset signals is more involved. When these registers are retimed forward or backward through the combinational logic it is necessary to compute the new reset values. Moving these registers forward through the combinational logic is simple: The reset values are propagated through the logic. Figure 15 shows an example. CDNLive! Silicon Valley

11 Figure 15: The registers are retimed forward. The reset values are propagated to the registers in the new locations. Moving registers backward is more complicated. First, all the registers driven by the gate need to have the same reset values. Second, the reset values of the new registers that drive the inputs of the gate are not unique. A naive approach that moves the registers over the gates one gate by the next and randomly chooses any reset values is not possible. The wrong reset values could be chosen such that later the registers cannot be retimed backward over a gate because the reset values are different. Hence, it is necessary to solve a global problem: what are the required 0/1 reset values for the registers in the new locations such that propagating these values through the logic results in the given reset values at the registers in the new location? This problem can be transformed into a satisfiablity problem. It is very similar to verifying that two netlists are equivalent, in which we ask the question: do 0/1 values exist for the registers and primary inputs such that propagating these values through the logic results in different values at a input of a register or a primary output? Sometimes no 0/1 reset values exist for the registers in the new locations, such that propagating these values forward would result in the right given values at the original locations. The following figure shows an example. In this case no valid reset values exist if the registers were moved further backward. RC can move registers with asynchronous set or reset backward only as far as valid reset values for the registers exist. 1? Figure 16: It is not possible to find reset values for the registers in the new locations such that propagating these values results in the given values for the registers in the original locations. If all the registers that retiming needs to merge and move either forward or backward receive equivalent control signals and if also the reset line justification problem is solvable, then retiming is more powerful than clock scheduling. It is possible to have extremely long combinational paths that have a delay as large as several times the clock period. If there are sufficient registers at the beginning or end of the paths, retiming can move these registers into the combinational logic and still achieve the target clock period. Earlier we had seen that clock scheduling is limited because hold constraints need to be considered. If the delays of the paths as well as the variations of the path delays are too large, it is at some point impossible to realize the hold constraints together with the setup constraints. Retiming may increase the number of registers. This is the only drawback. For some designs the increase can be significant. However, RC can also decrease the number of registers. Usually for larger designs that have only one critical part, RC can improve the clock period as well as decrease the number of registers: In the uncritical parts the locations of the registers are very flexible and hence the registers can be moved and possibly merged. CDNLive! Silicon Valley

12 6 An automated verification flow Retiming used to pose fundamental hurdles for equivalence checking. Proving that two netlists are equivalent if one netlist was generated from another netlist through combinational synthesis as well as through retiming is a problem of enormous complexity. To address these verification challenges RC writes out checkpoint files (Verilog netlist) that describe the design at a particular stage. When retiming is used, RC can write out the checkpoint files before and after retiming as shown in the following diagram. RC LEC read RTL initial RTL combinational synthesis equivalence check 1 (combinational) write checkpoint file retiming write checkpoint file combinational synthesis write final netlist pre-retiming checkpoint netlist post-retiming checkpoint netlist final netlist equivalence check 2 (retiming) equivalence check 3 (combinational) Figure 17: The automated synthesis and verification flow with checkpoint files generated by RC and read by LEC. Along with each checkpoint file, RC also generates a corresponding dofile, a command script used by Conformal Logic Equivalence Checker (LEC). Equivalence between RTL and the final netlist is established through a series of verification steps which compare the initial RTL with first checkpoint_file, checkpoint tocheckpoint file and last checkpoint file to the final netlist. The appropriate dofile sets up the verification of corresponding stages as shown in the diagram. Conformal verifies the equivalence under the assumption that either only combinational synthesis operations were performed or only the registers were moved by retiming operations. 7 Case study: Retiming for an UWB baseband chip from Focus Enhancements As a case study we describe how retiming in RC was used by Focus Semiconductor, a division of Focus Enhancements, for the dual-phy UWB baseband chip MADRAS. This chip supports a proprietary Focus (Turbo) mode and a WiMedia mode which is compliant with the Multiband OFDM Alliance (MBOA). The Focus mode is more powerful than the MBOA mode: The ratio of the bandwidth versus the distance is about 2x greater. The chip is designed in a 0.13um CMOS TSMC process technology with an analog front end. It has about 4 million transistors which correspond to approximately 1.5 million instances. The Synchronization Module has a three stage hierarchical datapath implementation. Each stage is composed of a finite input response (FIR) filter which required datapath optimization support from RC. The Synchronization Peak Finder Module contains a divider which is used to normalize the synchronization threshold. Enough pipeline registers were added at the inputs and outputs of the block. RC then rebalances the combinational paths by retiming the registers into the combinational logic. CDNLive! Silicon Valley

13 The Coarse Equalization Module consists of a Media Access Controller (MAC) and scratchpad memory. Retiming was also used for this module. Pipeline registers were added at the primary inputs and outputs and retiming automatically moved these registers into the logic and rebalanced the delay of the combinational paths. The Fine Equalization and the Tracking Module use a similar MAC and memory that made the use of retiming for these modules necessary. A top-down sequential synthesis flow with retiming The design consists of a 600K instance top level block FPT which was synthesized top-down. The retime attribute was set on 16 submodules corresponding to about 45% of the total logic and 49% of the registers. The following table shows all the modules for which the retime attribute was set to true in the automatic synthesize retime flow. number of registers clock period (ps) subdesign gates PIs POs before after change before after change block_1 51, ,589 2, % 12,908 3, % block_2 13, ,766 2, % 13,119 3, % block_3 28, ,283 6, % 6,583 3, % block_4 2, % 6,724 3, % block_5 17, % 5,489 3, % block_6 8, % 9,044 4, % block_7-a 7, ,269 1, % 5,484 3, % block_7-b 7, ,269 1, % 5,484 3, % block_7-c 7, ,269 1, % 5,451 3, % block_7-d 7, ,269 1, % 5,446 3, % block_7-e 7, ,269 1, % 5,465 3, % block_7-f 7, ,269 1, % 5,459 3, % block_8 7, ,088 1, % 8,421 5, % block_9 28, ,500 1, % 12,291 5, % block_10 18, ,862 3, % 9,195 4, % block_11 88,925 1,683 1,700 6,694 5, % 5,212 4, % Average 19, ,081 2, % (1) 7,611 3, % (2) (1) percentage change of the average number of registers before and after retiming (2) average of the percentage change of the clock period before and after retiming The table shows the number of combinational gates, the number of primary inputs (PIs), and the number of primary outputs (POs). The next three columns show the number of registers before and after retiming and the percentage change. The last three columns show the clock period in picoseconds before and after retiming and the percentage change. The table shows that retiming can increase and decrease the number of registers. Overall the number of registers decreases by 0.6%. The clock period improves always. For many of the subdesigns it is expected that the clock period decreases by a large amount because pipeline registers were added at either the primary inputs or primary outputs. CDNLive! Silicon Valley

14 Conclusion With increasing demands for faster designs and shorter time-to-market, it is important for designers to look for efficient optimization techniques. Retiming in Encounter RTL Compiler is one very powerful technique that can achieve substantial improvements in performance. In this paper we have described how RTL Compiler uses clock scheduling in a sequentially driven synthesis flow and then performs retiming minimizing the clock period and the number of registers. We have discussed special cases of retiming, registers with enable signals, registers with a multiplexer feedback loop and registers with asynchronous set and reset signals. With RTL Compiler it is easy to perform retiming and the direct link to Conformal Logic Equivalence Checking provides a complete verification solution. References [1] J. P. Fishburn, Clock Skew Optimization, IEEE Transactions on Computers, vol. 39, pp , July [2] C. Leiserson and J. Saxe, Retiming Synchronous Circuitry, Algorithmica, vol. 6, pp. 5-35, [3] N. E. Young, R. E. Tarjan, J. B. Orlin: Faster Parametric Shortest path and Minimum Balance Algorithms, Networks, 21 (1991), [4] S. S. Sapatnekar, R. B. Deokar: Utilizing the retiming-skew equivalence in a practical algorithm for retiming large circuits, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol. 15, no. 10, October [5] S. S. Sapatnekar, Timing, Kluwer Academic Publishers, Boston, MA, CDNLive! Silicon Valley

15 Appendix: Encounter RTL Compiler commands for retiming Automatic synthesis with retiming It is easy to use retiming in RC: only the attribute retime needs to be set to true for the design or subdesign which should be retimed. Then during synthesis the design or subdesign is processed automatically by the sequentially driven synthesis flow with retiming as described in Section 4. set_attr retime true [subdesign] synthesize to_mapped Manual retiming flow This flow can be used when a specific module or modules need to be retimed. It can be used as an exploratory tool to see the impact of what retiming can do for a subdesign in a mapped design. The first step retime prepare prepares the design for retiming and retime min_delay performs the actual retiming. Even though retime min_delay performs a local mapping of immediate logic near the flops, it is recommended to follow it with an incremental synthesis or preferably a global synthesis depending on the granularity of the changes. retime prepare [subdesign design ] retime min_delay [subdesign design ] synthesize to_mapped [-incr ] Manual retiming flow minimizing the number of registers This flow explicitly tries to minimize the number of registers and thus the area. This should be used only for a design which has positive slack. synthesize to_mapped retime min_area [subdesign design ] synthesize to_mapped [-incr ] Attributes set_attr dont_retime true [flop] set_attr retime_hard_region true \ [subdesign] set_attr boundary_opto false \ [subdesign] set_attr retime_async_reset true set_attr retime_optimize_reset true Do not retime the register specified. Retiming cannot move registers into or out of the subdesign. Disable boundary optimization (constant propagation and rewiring of equivalent signals across hierarchy) and preserve the input and output pins of a subdesign. This enables easier ECO for the blocks and might be necessary for formal verification. Enable retiming on flops with asynchronous set or reset signals. The runtime may increase if registers need to be moved backward. By default, registers with asynchronous set or reset signals are excluded from retiming. If this attribute is used in combination with the previous attribute, the reset logic is optimized by replacing asynchronous flops with simple flops wherever possible. For more information refer to the Encounter RTL Compiler User Guide, chapter 9, Retiming the Design. CDNLive! Silicon Valley

16 Interface to Conformal Logic Equivalence Checker (LEC) The checkpoint files of the automatic verification flow described in Section 6 and the corresponding dofiles for LEC are generated by RC if the checkpoint attributes are set as shown below. set_attribute checkpoint_flow true set_attribute library my_library.lib read my_design.v elaborate set_attribute checkpoint_netlist_naming_style \ my_chk_dir/chk_%d.v /designs/my_top set_attribute checkpoint_dofile_naming_style \ my_chk_dir/chk_%d_to_chk_%d.do /designs/my_top read_sdc my_constraints.sdc set_attr retime true my_top synthesize to_mapped write m > final.v write_do_lec revised final.v > final.do To run LEC lec -ultra Dofile hdl_to_chk_01.do lec -ultra Dofile chk_01_to_chk_02.do lec -ultra Dofile final.do For more information refer to the document Interfacing between RTL Compiler and Conformal. CDNLive! Silicon Valley

FishTail: The Formal Generation, Verification and Management of Golden Timing Constraints

FishTail: The Formal Generation, Verification and Management of Golden Timing Constraints Chip design is not getting any easier. With increased gate counts, higher clock speeds, smaller chip sizes and