Submitted for TAU97 Abstract Many attempts have been made to combine some form of retiming with combinational

Similar documents
Delay Estimation for Technology Independent Synthesis

160 M. Nadjarbashi, S.M. Fakhraie and A. Kaviani Figure 2. LUTB structure. each block-level track can be arbitrarily connected to each of 16 4-LUT inp

Conclusions and Future Work. We introduce a new method for dealing with the shortage of quality benchmark circuits

International Conference on Parallel Processing (ICPP) 1994

PARALLEL PERFORMANCE DIRECTED TECHNOLOGY MAPPING FOR FPGA. Laurent Lemarchand. Informatique. ea 2215, D pt. ubo University{ bp 809

Consistent Logical Checkpointing. Nitin H. Vaidya. Texas A&M University. Phone: Fax:

Incorporating the Controller Eects During Register Transfer Level. Synthesis. Champaka Ramachandran and Fadi J. Kurdahi

Retiming. Adapted from: Synthesis and Optimization of Digital Circuits, G. De Micheli Stanford. Outline. Structural optimization methods. Retiming.

PPS : A Pipeline Path-based Scheduler. 46, Avenue Felix Viallet, Grenoble Cedex, France.

Type T1: force false. Type T2: force true. Type T3: complement. Type T4: load

The Priority Queue as an Example of Hardware/Software Codesign. Flemming Heg, Niels Mellergaard, and Jrgen Staunstrup. Department of Computer Science

8ns. 8ns. 16ns. 10ns COUT S3 COUT S3 A3 B3 A2 B2 A1 B1 B0 2 B0 CIN CIN COUT S3 A3 B3 A2 B2 A1 B1 A0 B0 CIN S0 S1 S2 S3 COUT CIN 2 A0 B0 A2 _ A1 B1

Network. Department of Statistics. University of California, Berkeley. January, Abstract

Basic Block. Inputs. K input. N outputs. I inputs MUX. Clock. Input Multiplexors

Eect of fan-out on the Performance of a. Single-message cancellation scheme. Atul Prakash (Contact Author) Gwo-baw Wu. Seema Jetli

CHAPTER 4 AN INTEGRATED APPROACH OF PERFORMANCE PREDICTION ON NETWORKS OF WORKSTATIONS. Xiaodong Zhang and Yongsheng Song

1 Introduction Data format converters (DFCs) are used to permute the data from one format to another in signal processing and image processing applica

Delay and Power Optimization of Sequential Circuits through DJP Algorithm

Don't Cares in Multi-Level Network Optimization. Hamid Savoj. Abstract

A taxonomy of race. D. P. Helmbold, C. E. McDowell. September 28, University of California, Santa Cruz. Santa Cruz, CA

SIS Logic Synthesis System

An Evaluation of Information Retrieval Accuracy. with Simulated OCR Output. K. Taghva z, and J. Borsack z. University of Massachusetts, Amherst

Reinforcement Learning Scheme. for Network Routing. Michael Littman*, Justin Boyan. School of Computer Science. Pittsburgh, PA

Mahsa Vahidi and Alex Orailoglu. La Jolla CA of alternatives needs to be explored to obtain the

n = 2 n = 1 µ λ n = 0

ALMA Memo No An Imaging Study for ACA. Min S. Yun. University of Massachusetts. April 1, Abstract

Sequential Logic Synthesis

Effects of Technology Mapping on Fault Detection Coverage in Reprogrammable FPGAs

Retiming-Based Factorization for Sequential Logic Optimization

Parallelization System. Abstract. We present an overview of our interprocedural analysis system,

Notes. Some of these slides are based on a slide set provided by Ulf Leser. CS 640 Query Processing Winter / 30. Notes

Document Image Restoration Using Binary Morphological Filters. Jisheng Liang, Robert M. Haralick. Seattle, Washington Ihsin T.

Implementations of Dijkstra's Algorithm. Based on Multi-Level Buckets. November Abstract

Department of. Computer Science. Remapping Subpartitions of. Hyperspace Using Iterative. Genetic Search. Keith Mathias and Darrell Whitley

Cluster and Single-Node Analysis of Long-Term Deduplication Paerns

Performance Driven Resynthesis by Exploiting Retiming-Induced State Register Equivalence

I N. k=1. Current I RMS = I I I. k=1 I 1. 0 Time (N time intervals)

On Using Machine Learning for Logic BIST

Unit 2: High-Level Synthesis

THE ever increasing clock frequency of high-performance

(Preliminary Version 2 ) Jai-Hoon Kim Nitin H. Vaidya. Department of Computer Science. Texas A&M University. College Station, TX

How Much Logic Should Go in an FPGA Logic Block?

detected inference channel is eliminated by redesigning the database schema [Lunt, 1989] or upgrading the paths that lead to the inference [Stickel, 1

Chapter 18 Indexing Structures for Files. Indexes as Access Paths

1 Introduction With increasing level of integration the realization of more and more complex and fast parallel algorithms as VLSI circuits is feasible

task object task queue

the constraints. On the other hand, CAD tools are it allows shorter design time and enables the exploration of design tradeos.

Accepted for publication in the IEEE Transactions on VLSI Systems, December, Power Analysis and Minimization Techniques for Embedded

Ecient Implementation of Sorting Algorithms on Asynchronous Distributed-Memory Machines

Beyond the Combinatorial Limit in Depth Minimization for LUT-Based FPGA Designs

RECONFIGURATION OF HIERARCHICAL TUPLE-SPACES: EXPERIMENTS WITH LINDA-POLYLITH. Computer Science Department and Institute. University of Maryland

Solve the Data Flow Problem

CHAPTER 1 INTRODUCTION

times the performance of a high-end microprocessor introduced in 1984 [3]. 1 Such rapid growth in microprocessor performance has stimulated the develo

under Timing Constraints David Filo David Ku Claudionor N. Coelho, Jr. Giovanni De Micheli

Gen := 0. Create Initial Random Population. Termination Criterion Satisfied? Yes. Evaluate fitness of each individual in population.

PARALLEL EXECUTION OF HASH JOINS IN PARALLEL DATABASES. Hui-I Hsiao, Ming-Syan Chen and Philip S. Yu. Electrical Engineering Department.

Chapter 1: Interprocedural Parallelization Analysis: A Case Study. Abstract

Functional extension of structural logic optimization techniques

System-Level Specication of Instruction Sets. Department of Electrical and Computer Engineering. Department of Computer Science

of Perceptron. Perceptron CPU Seconds CPU Seconds Per Trial

S 1 S 2. C s1. C s2. S n. C sn. S 3 C s3. Input. l k S k C k. C 1 C 2 C k-1. R d

to automatically generate parallel code for many applications that periodically update shared data structures using commuting operations and/or manipu

Cluster quality 15. Running time 0.7. Distance between estimated and true means Running time [s]

S S S C S S C C S. rw[j] qd... MUX. w[j] speculated digit correction speculation/ correction. cycle MUX. quotient unit.

Laxmi N. Bhuyan, Ravi R. Iyer, Tahsin Askar, Ashwini K. Nanda and Mohan Kumar. Abstract

SIS: A System for Sequential Circuit Synthesis

Placement Algorithm for FPGA Circuits

Networks for Control. California Institute of Technology. Pasadena, CA Abstract

FEKO Mesh Optimization Study of the EDGES Antenna Panels with Side Lips using a Wire Port and an Infinite Ground Plane

Richard E. Korf. June 27, Abstract. divide them into two subsets, so that the sum of the numbers in

Ecient Implementation of Sorting Algorithms on Asynchronous Distributed-Memory Machines

FPGA PLB Architecture Evaluation and Area Optimization Techniques using Boolean Satisfiability

University of Maryland. fzzj, basili, Empirical studies (Desurvire, 1994) (Jeries, Miller, USABILITY INSPECTION

Implementing Logic in FPGA Memory Arrays: Heterogeneous Memory Architectures

Networks. Wu-chang Fengy Dilip D. Kandlurz Debanjan Sahaz Kang G. Shiny. Ann Arbor, MI Yorktown Heights, NY 10598

Parallel Pipeline STAP System

In this paper, we present the results of a comparative empirical study of two safe RTS techniques. The two techniques we studied have been implemented

INTERNATIONAL ORGANIZATION FOR STANDARDIZATION ORGANISATION INTERNATIONALE DE NORMALISATION ISO-IEC JTC1/SC29/WG11

Khoral Research, Inc. Khoros is a powerful, integrated system which allows users to perform a variety

Andreas Kuehlmann. validates properties conrmed on one (preferably abstract) synthesized by the Cathedral system with the original. input specication.

State Machine Specification. State Minimization. State Assignment. Logic Synthesis. Tech Mapping. Test Generation

Ecient Precomputation of Quality-of-Service Routes. Anees Shaikhy, Jennifer Rexfordz, and Kang G. Shiny

Performance Evaluation of Two New Disk Scheduling Algorithms. for Real-Time Systems. Department of Computer & Information Science

Computer-Aided Design (CAD) Logic Synthesis Tutorial. Prepared by Ray Cheung

University of California at Berkeley. Berkeley, CA the global routing in order to generate a feasible solution

AAL2 Transmitter Simulation Study: Revised

An On-line Variable Length Binary. Institute for Systems Research and. Institute for Advanced Computer Studies. University of Maryland

Thrashing in Real Address Caches due to Memory Management. Arup Mukherjee, Murthy Devarakonda, and Dinkar Sitaram. IBM Research Division

R L R L. Fulllevel = 8 Last = 12

Mark J. Clement and Michael J. Quinn. Oregon State University. January 17, a programmer to predict what eect modications to

MERL { A MITSUBISHI ELECTRIC RESEARCH LABORATORY. Empirical Testing of Algorithms for. Variable-Sized Label Placement.

Siemens TREC-4 Report: Further Experiments with Database. Merging. Ellen M. Voorhees. Siemens Corporate Research, Inc.

Allowing Cycle-Stealing Direct Memory Access I/O. Concurrent with Hard-Real-Time Programs

A B. A: sigmoid B: EBA (x0=0.03) C: EBA (x0=0.05) U

Performance Comparison Between AAL1, AAL2 and AAL5

EEL 4783: HDL in Digital System Design

London SW7 2BZ. in the number of processors due to unfortunate allocation of the. home and ownership of cache lines. We present a modied coherency

FB(9,3) Figure 1(a). A 4-by-4 Benes network. Figure 1(b). An FB(4, 2) network. Figure 2. An FB(27, 3) network

Scrambling Query Plans to Cope With Unexpected Delays. University of Maryland. greatly depending on the specic data sources accessed

Transcription:

Experiments in the Iterative Application of Resynthesis and Retiming Soha Hassoun and Carl Ebeling Department of Computer Science and Engineering University ofwashington, Seattle, WA fsoha,ebelingg@cs.washington.edu July, 1997 Submitted for TAU97 Abstract Many attempts have been made to combine some form of retiming with combinational optimization techniques to improve the performance of sequential circuits. To achieve improvements, registers are shifted to expose dierent and sometimes larger combinational blocks for resynthesis. A simple, yet unexplored, sequential optimization consists of iteratively applying retiming and resynthesis. Retiming changes the boundaries of the combinational blocks in the circuit, and resynthesis changes the delays. It is only logical to repeat this procedure until the circuit performance converges. In this paper we evaluate a set of experiments based on the iterative application of resynthesis and retiming. For a set of the original MCNC benchmark circuits, we show that with a total of 18 alternating resynthesis and retiming steps, we obtain an average of 37% reduction in the clock period at an area increase of 5%. This improvement translates to an 18% further reduction in clock period over a single step of synthesis followed by one step of retiming. We consider the eects of (a) varying the number of iterations, (b) starting with an initial retiming step instead of a resynthesis step, and (c) retiming without register minimization. We also describe the limitations of the iterative procedure in improving the performance of circuits with single-register cycles. Contact Author: Soha Hassoun Department of Computer Science and Engineering, Box 352350 University of Washington, Seattle, WA 98195-2350 Fax: (206)543-2969. Phone: (206)543-5143. e-mail: soha@cs.washington.edu

Experiments in the Iterative Application of 1 Introduction Resynthesis and Retiming Soha Hassoun and Carl Ebeling Department of Computer Science and Engineering University ofwashington, Seattle, WA There has been considerable research into making retiming [9] a more practical algorithm. There are now ecient implementations for retiming to achieve the minimum clock period [16], and to achieve the minimum number of registers under clock period constraints [12]. Other recent research in retiming includes incorporating more accurate delay models [8], and dealing with state equivalence with the original circuit [4]. There is also recent work on using retiming at a higherabstraction level [14], and earlier in the design process [11]. Signicant progress has also been made for level-clocked circuits [10, 7]. The concept of retiming has also been instrumental in using combinational optimization techniques for the optimization of sequential circuits at the structural level. It is intuitive to try to use some form of retiming to change the boundaries of combinational stages, and then apply existing combinational optimization techniques to dierent, and sometimes larger, combinational blocks. The register movements may be large, such as in peripheral retiming [13] which attempts to expose the whole circuit for combinational resynthesis. The register movement could also be restricted to a portion of the circuit. For example, Dey et al. identify subcircuits with equal-weight reconvergent paths (petals) to which peripheral retiming can be applied [3]. Precomputation-based architectural retiming, where a signal is computed one cycle earlier using the combinational logic in the previous pipeline stage, is another technique that involves retiming with logic duplication of a portion of the circuits [5, 6]. Sequential optimization techniques based on more ne-grain register movements include applying local algebraic transformations across register boundaries [2], and implicit retiming, where a restricted form of precomputation is applied to specic kernels in the circuit [1]. The more common use of retiming in conjunction with synthesis, however, has been to perform sequential optimization, technology mapping, and follow that with retiming to achieve the desired performance. The experiments presented here are based on a simple, yet unexplored, sequential optimization technique consisting of iteratively applying retiming and resynthesis. Retiming changes the boundaries of the combinational blocks in the circuit while resynthesis changes the delays. It is only logical to repeat this procedure until the circuit performance converges. In this paper we investigate the eectiveness of the iterative application of retiming and resynthesis. In Section 2 we describe the details of the experiment. In Section 3 we present the results, and analyze the limitations of this technique. We then examine some variations of the experiment. In Section 4 we vary the number of retiming and synthesis steps and show our example circuits cannot further benet from the process. In Section 5 we consider some secondary eects on the iterative process. More specically, we consider the eects of starting the iterative process with a retiming step instead of a synthesis step, and also the eects of retiming without minimizing the register number of registers. We conclude with some comments on the run times of the experiments, and a discussion of challenges in sequential optimization. 1

Circuit # inputs # outputs # nodes # registers mm4a 7 4 35 12 s208.1 10 1 104 8 s27 4 1 10 3 s298 3 6 119 14 s349 9 11 161 15 s386 7 7 159 6 s400 3 6 162 21 s444 3 6 181 21 s641 20 14 379 19 s713 20 13 393 19 Table 1: Circuits from the MCNC benchmarks used in the experiments. 2 Experiment Our set of benchmark circuits was selected from the MCNC sequential multilevel circuits. The circuits were available in blif format, which can be read the Berkeley SIS logic optimization program [15]. The reference, or initial, circuit in each experiment was obtained by rst decomposing all the nodes into 2-input OR gates (tech decomp -o 2), and then performing technology mapping to respect the driving load assuming all arrival and required times are zero (map -n 1 -AFG -p). We used the lib2.genlib and lib2 latch.genlib libraries for mapping in the initial circuit and also in reporting the results of the experiments. In the experiment, there is a resynthesis step, and a retiming step. The resynthesis step was performed by rst running the SIS script script.rugged, which is recommended for area minimization. This script was followed by script.delay without the redundancy removal command 1 which synthesizes a circuit for a nal implementation that is optimal with respect to speed. In the retiming step, we minimized the clock period while minimizing the number of registers (retime -mi). The experiment we conducted was straightforward. We alternated resynthesis and retiming steps beginning with a resynthesis step. The goal of this experiment was to measure the eectiveness of the iterative procedure in minimizing the clock period while also minimizing the register and logic area. The initial circuit is labeled as step number 0, andeach subsequent step is labeled accordingly. Thus, we performed a total of 30 steps { that is 15 resynthesis steps interleaved with 15 retiming steps. Table 1 summarizes the circuits we used in our experiments. We list the number of primary inputs, primary outputs, logic functions (.names in blif format), and registers. From the LGsynth91 sequential multilevel circuits, we report the results of those that completed at least 6 iteration steps for this experiment and its variations that are presented in Section 4 and in Section 5. 3 Experimental Results We rst summarize the results of the experiment. We then examine more closely the results for one of the circuits, and comment on the results in general. 1 SIS was unable to perform redundancy removal reliably. 2

iterative procedure reference circuit resy-ret normalized to normalized to step reference resy-ret Circuit T clk Area T clk Area T clk Area T clk Area It mm4a 28.50 410176 1 2 1 0.79 075 1.28 30 s208.1 16 138272 3 0.93 0 0 0.97 8 0 s27 7.44 30160 2 0.92 2 0.92 0 0 2 s298 11.14 223648 0.91 9 1 1.32 7 1.49 16 s349 13.22 238496 0.98 1.38 7 1.73 9 1.26 12 a s386 15.83 262624 3 9 6 8 0.79 0.99 27 s400 15.23 309488 0.73 4 6 1.16 0.77 1.12 12 s444 15.94 322944 0.73 0.97 5 1.65 0.75 1.70 24 s641 23 341504 2 4 0.71 1.13 7 8 7 b s713 23 348464 4 0.99 0.72 9 5 1.10 8 c a Maximum iteration completed is 14 b Maximum iteration completed is 10 c Maximum iteration completed is 10 Table 2: Results of applying the iterative resynthesis and retiming on some MCNC benchmarks In Table 2 we report the changes in the clock period and area. The leftmost column lists the names of the circuits. The next two columns report the clock period (in time units) and area for the reference circuit. The next two columns labeled resy-ret report the clock period and area, both normalized to the reference circuit, after one step of resynthesis and retiming, which is typical of what has been used so far in sequential logic optimization. The next two columns report the minimum clock period and the corresponding area obtained through the iterative procedure. They are both normalized to the clock period and area of the reference circuit. The following two columns present the same information, but normalized to results of the resy-ret step. The nal column lists the number of the step during which the minimum clock period was achieved. Section 4 provides detailed geometric mean calculations based on the number of iterations completed. Figure 1 shows the results of our experiment applied to circuit s386. The performance improvement here is close to the average results for the test circuits. The lowest clock period, however, is obtained later in the iteration process when compared to the other circuits. Figure (a) shows the clock period normalized to the clock period of the initial circuit. The left most column thus hasaclock period of 1. In the next step, the clock period is reduced by synthesis. The following retiming step further reduces the clock period. In the rest of the steps, it is typical that resynthesis increases the clock period, while retiming reduces it. The resynthesis step is supposed to optimize for delay. However, since no target clock period is specied and the reduce depth command in the delay script restructures a technology-independent network rather than a mapped circuit, it is typical that the delay of the circuit is increased. The trend may have been dierent had we used the technology-dependent command speed up instead. The resynthesis step then changes the maximum delay of the circuit the following retiming step places the registers optimally in the circuit. The minimum clock period for s386 over the 30 iteration steps occurs in iteration 27. Notice however that most of the improvement occurs earlier on in the process. In iteration 15, for example, the normalized clock period is 0:675, and it is 0:657 for iteration 27 { a dierence of 1:80%. Notice also that retiming does not move any registers in the last few iterations. In this case, it is the synthesis step that changes the delays as the circuit is unmapped and resynthesized and mapped again with each iteration step. 3

2.4 2.2 normalized clock period normalized total area 1.8 1.6 1.4 1.2 2.5 (a) (b) 0 Normalized 1.5 Logic Area Normalized Register Area 1.5 Normalized Clock Period 0.9 0.7 1 2 7 29, 30 5 4 13 11 12 19 17 9,10 15,16 3 8 25,26 23,24 6 20 27, 28 21,22 14 18 2.5 (c) 0.7 0.9 1.1 Normalized Total Area (d) Figure 1: Circuit s386. All numbers are normalized to the reference circuit. (a) Clock period vs.. (b) Total area vs.. (c) Logic and register areas vs. iteration number. (d) Area vs. clock period. 4

Figure 1(b) presents the corresponding normalized area for circuit c386. The total area is reduced by 23% after the rst synthesis step, and the following retiming step increases the area to achieve a smaller clock period. This was a typical behavior for all the circuits we examined as we were optimizing for performance, and not for area. After the rst two iterations, however, the change in area was unpredictable. The top half of Figure 1(c) displays the logic area of circuit s386 for each iteration step, while the bottom half displays the register area. Both are normalized to the reference circuit. The normalized logic area corresponds to 158 gates from the library, and the normalized register area is due to 6 registers. With each iteration, the logic area does not increase beyond that of the reference circuit. The number of registers however, more than doubles in iterations 2, 3, 6, 7, and 8. Notice that the logic area does not change with each retiming step, as expected however, the number of registers sometimes decreases with logic minimization. For example, the number of registers decrease by one after resynthesis in iterations 9 and 11. Figure 1(d) plots each pointwewere able to obtain in the design space. The label next to each point is the step number. The reference circuit is at (, ). The circuit resulting from resynthesis is labeled 1. It has the minimum area. Points labeled 27 and 28 correspond to the circuit that had the minimum clock period over the 30 iterations. Now that we understand the iterative process better, let us take a closer look at the results table and explain the more atypical results. Circuit s208.l initially runs at 10:66 time units. After the rst synthesis step, the minimum clock period is 11:44, and then it drops to 11:00 after the rst retiming step. During the next iterations, synthesis slightly changes the delay, but retiming does not move any registers. This is due to a single-register cycle, whose delay imposes a lower bound on the minimum clock period. The normalized clock period after one resynthesis and retiming step is greater than one. The numbers that indicate that the iterative procedure does better than a single step of resynthesis and retiming (0.97 reduction in clock period) does not then indicate that the iterative procedure did better. It simply states that we have chosen the circuit with the minimum clock period to report, and it happened to be the reference circuit, which runs at a faster clock period than the one obtained via a single resynthesis and retiming step. Circuit s27 also has a single register-cycle. Unlike s208.l, however, s27 benets signicantly from the rst resynthesis (33% reduction in clock period) and also from the rst retiming step (another 15% reduction). Circuit s27 reaches a minimum clock period after a single step of resynthesis and a step of retiming. The iterative procedure cannot further improve the performance. We are certainly not discouraged by results for circuits s27 and s208.1. They simply point to the ineectiveness of retiming, and thus the iterative procedure, in optimizing circuits dominated by single-register cycles. If the critical path in the circuit, the one that limits performance, is a single-register cycle, then retiming cannot further improve the performance of the circuit, and thus does not change the position of the register along that cycle. Techniques that change the combinational stage boundaries might then be more suitable than the iterative procedure for optimizing the performance of circuits with single-register cycles [2, 13, 3, 6, 1]. 4 Varying the Number of Iterations In Table 2 we reported the circuit with the minimum clock period obtained through a total of 30 retiming and resynthesis step. As we noticed for circuit s386, Figure 1(a), most of the benets occur earlier in the iterative process. We must then examine the relation between the number of iterations and the reduction in clock period to determine when the iterative procedure has exhausted its potential. 5

basic additional retiming no reg minimization experiment step rst during retiming Circuit T clk logic reg It T clk logic reg It T clk logic reg It area area area area area area mm4a 1 0.75 8 30 5 2 0 2 1 0.91 1.58 15 s208.1 0 0 0 0 2 0 1.88 1 0.94 1.10 2.50 4 s27 2 6 0 2 0 0.94 1.33 2 2 6 0 2 s298 1 0.95 2.21 16 8 1.46 3.07 23 3 1.37 5.21 18 s349 7 1.43 2.47 12 a 1 2 2.40 5 b 0.74 2.12 6.13 20 s386 6 0.77 1.83 27 4 0.90 0 15 0.70 2 3.50 12 s400 6 6 1.38 12 2 1.20 1.57 7 3 1.72 4.00 10 s444 5 1.27 2.52 24 3 2 1.62 7 c 9 2.37 5.81 28 s641 0.71 1.14 1.11 7 d 0.70 0.99 0.95 5 e 4 1.29 1.53 23 s713 0.72 1 1.32 8 f 0.75 0.97 1.16 5 g 6 8 1.16 9 a Maximum iteration completed is 14 b Maximum iteration completed is 24 c Maximum iteration completed is 30 d Maximum iteration completed is 10 e Maximum iteration completed is 11 f Maximum iteration completed is 10 g Maximum iteration completed is 7 Table 3: Results of applying iterative resynthesis and retiming with an additional retiming step and without register minimization during retiming. All the numbers are normalized to the reference circuit. Normalized Clock Period 0 0.90 0 0.70 0 mm4a s208.l s27 s298 s349 s386 s400 s444 s641 s713 geometric mean 0 0 4 8 12 16 20 24 28 32 Step Number Normalized Clock Period 0 1.80 1.60 1.40 1.20 0 0 0 0 0 0 mm4a s208.l s27 s298 s349 s386 s400 s444 s641 s713 geometric mean 0 4 8 12 16 20 24 28 32 Step Number Figure 2: The eects of varying the s. (a) The normalized clock period for our test circuits v.s. the step number. (b) The normalized area v.s. the step number. 6

The graph in Figure 2(a) plots the normalized best clock period that can be obtained at the end of a single resynthesis-retiming iteration (step 2), and at the end of three iterations (step 6), and every three thereafter. We also plot the geometric mean at each of the evaluation points. The average reduction in clock period after one synthesis-retiming step is then 19%. After three resynthesis-retiming steps, it is 28%, and so on. The clock period does not further improve after step 18 for all the circuits except for circuit mm4a. After 18 steps, the geometric mean indicates a 37% reduction in clock period. In other words, the iterative procedure can improve an optimized circuit by an additional 18%, for the seven circuits that completed 9 iterations of resynthesisretiming. The graph in Figure 2(b) plots the corresponding normalized area. The area does not change signicantly after step 12 except for circuit s444. The area change after 18 steps ranged between ;25% and +40%, which is within the same change range for the circuit after a single synthesis/retiming step. The geometric mean indicates an average increase of 5%. 5 Secondary Eects The most important factor that aects iterative resynthesis and retiming is varying the number of steps used in the procedure. In this section we examine two other factors that may inuence the results. We examine the eects on circuit s386 for each factor, and then present the results for the rest of the circuits. 5.1 Retiming Step First In our original experiment we chose to begin our iterative procedure with a synthesis step. How would the results have been modied had we begun with a retiming step? The left 3 graphs in Figure 3(a,c,e) illustrate the clock period, the area, the logic area, and the register area, all normalized to the reference circuit. There are a couple of minor dierences between the graphs here and the ones for the original experiment. First, the minimum clock period is achieved at an earlier time step. Second, there is more variations in the register area. 5.2 No Register Minimization We examined the eect of retiming without register minimization on the test circuits. The right 3 graphs in Figure 3(b,d,f) illustrate the clock period, the area, the logic area, and the register area, all normalized to the reference circuit. While there is a reasonable 6% reduction in the clock period when compared to the minimum clock period obtained using the original setup, there is a large dierence in the total area. Upon closer examination, it is the area of the registers that goes through a rapid growth with each iteration while the area of the logic remains about the same. 5.3 Results Summary Table 3 reports the normalized clock period and the corresponding register area, logic area, and step number during which the circuit with the smallest clock period was found. All the results are normalized to the reference circuit. Figure 4 summarizes the results of our experiment and its variations. We plot the geometric mean for with the smallest clock period at the end of 2, 6, 8, 12, 18, 24, and 30 for the original experiment and for the experiment that had retiming without register minimizing. We also plot the geometric mean at the end of iterations 1, 3, 9, 13, 19, 25, and 31 for the experiment with the retiming step rst. 7

normalized clock period normalized clock period normalized total area 2.4 2.2 1.8 1.6 1.4 1.2 (a) 2.5 (c) normalized total area 2.4 2.2 1.8 1.6 1.4 1.2 (b) (d) Normalized 1.5 Logic Area Normalized Register 1.5 Area 2.5