Cluster-Based Architecture, Timing-Driven Packing and Timing-Driven Placement for FPGAs

Size: px

Start display at page:

Download "Cluster-Based Architecture, Timing-Driven Packing and Timing-Driven Placement for FPGAs"

Dylan Flowers
6 years ago
Views:

1 Cluster-Based Architecture, Timing-Driven Packing and Timing-Driven Placement for FPGAs by Alexander R. Marquardt A thesis submitted in conformity with the requirements for the degree of Master of Applied Science Department of Electrical and Computer Engineering University of Toronto Copyright by Alexander Ronald Marquardt, 1999

2 Abstract Cluster-Based Architecture, Timing-Driven Packing and Timing-Driven Placement for FPGAs Master of Applied Science, 1999 Alexander R. Marquardt Department of Electrical and Computer Engineering University of Toronto As process geometries shrink into the deep-submicron region, interconnect resistance and capacitance account for an increasingly significant portion of the delay of circuits implemented in Field- Programmable Gate Arrays (FPGAs). One way to improve FPGA speed is to employ logiccluster-based architectures which have high-speed local connections among groups of logic elements. In this work we show what size logic-cluster results in the best area-speed trade-off. To obtain the best choices for a cluster-based architecture, we use computer aided design (CAD) tools to experimentally evaluate architectures with different sized logic clusters. As part of this CAD flow, we develop a timing-driven algorithm that packs logic elements into these clusters. In addition, we develop a timing-driven placement algorithm that results in significant improvements in FPGA speed over existing non-timing-driven algorithms. ii

3 Acknowledgments I would like to thank my advisor Jonathan Rose for providing direction, motivation, and advice throughout this work. He has taught me a great deal about FPGA research. I would also like to give thanks to Vaughn Betz. He and I spent many hours discussing FPGA architecture and CAD, and each discussion we had was educational. I would also like to thank the students in Jonathan s research group, Yaska, Jordan, Khalid, Rob, and Paul. Through our weekly meetings, and through other informal meetings, we have shared many ideas. I am grateful to my parents for giving me constant support and encouragement throughout my life and always having faith in me. iii

4 Table of Contents CHAPTER 1 Introduction Cluster-Based Logic Blocks Timing-Driven Packing Timing-Driven Placement Thesis Organization CHAPTER 2 Background and Previous Work Overview of FPGA Architecture Cluster-Based Logic Blocks CAD for FPGAs Timing Analysis Packing Algorithms for Cluster-Based FPGAs The VPack Logic Cluster Packing Tool RASP Placement Simulated Annealing The VPR Placement Tool (VPlace) Timing-Driven Placement TimberWolfSC PROXI Summary CHAPTER 3 Timing-Driven and Connection-Driven Packing Experimental Methodology iv

5 3.2 Timing-Driven Packing: T-VPack Timing Analysis and Delay Models Timing-Driven Packing Description Preliminary Definitions Seed Selection and Attraction Function Algorithm Analysis Computational Complexity Connection-Driven Packing: C-VPack Attraction Function Time Complexity Result Quality of T-VPack, C-VPack, and VPack Summary CHAPTER 4 The Effect of Cluster Size on FPGA Speed and Density Trade-offs in Cluster-Based FPGAs Architecture Modeling Area Model Delay Model Effect of Cluster Size on the Physical Length of FPGA Routing Segments Sizing Routing Transistors to Compensate for Different Physical Segment Lengths FPGA Architectural Parameters Basic Architecture Inputs Required vs. Cluster Size Routing Architecture Flexibility of Logic Block to Routing Interconnect vs. Cluster Size Architecture Evaluation Metric: Area-Delay Product Speed and Area-Efficiency vs. Cluster Size Discussion of Delay vs. Cluster Size Results Effect of Cluster Size on Compile Time Summary CHAPTER 5 Timing-Driven Placement Introduction Timing-Driven Placement: T-VPlace Delay Modeling and Cost Function Delay Lookup Matrix Cost Function v

6 5.2.2 Algorithm Tuning Verification of the Fidelity of the Placement Estimated Critical Path Delay Time Complexity Results: VPlace vs. T-VPlace Summary CHAPTER 6 Conclusions and Future Work Conclusions and Contributions Future Work APPENDIX A MCNC Benchmarks APPENDIX B VPack and T-VPack Sink Delay Distributions: Size 8 Clusters APPENDIX C Sink Delay Distributions for the 2 MCNC Benchmark Circuits C.1 Placement Estimated Sink Delay Distributions: Size 1 Clusters C.2 Low-Stress Sink Delay Distributions: Size 1 Clusters C.3 Placement Estimated Sink Delay Distributions: Size 8 Clusters C.4 Low-Stress Sink Delay Distributions: Size 8 Clusters vi

7 List of Tables TABLE 3.1 Effects of using tie-breakers, and the recompute timing interval (cluster size = 8) TABLE 3.2 Comparison of VPack, T-VPack, and C-VPack result quality (Cluster Size = 8) TABLE 3.3 Net absorption and inputs used (cluster size 8) TABLE 4.1 Important intra-cluster delays in TSMC s.35 µm CMOS process TABLE 4.2 Inputs required for 98% utilization for VPack and T-VPack TABLE 4.3 Routing area vs. F c, input for various cluster sizes TABLE 5.1 Effect of re-timing-analysis in the outer loop TABLE 5.2 Effect of re-timing-analysis in the inner loop TABLE 5.3 Effect of Criticality_Exponent with a λ value of TABLE 5.4 Effect of Criticality_Exponent with a λ value of TABLE 5.5 Effect of λ with an adaptive Criticality_Exponent of TABLE 5.6 Post-place-and-route comparison of VPlace and T-VPlace (cluster size = 1) TABLE 5.7 Post-place-and-route comparison of VPlace and T-VPlace (cluster size = 8) TABLE 5.8 Post-place-and-route comparison with Xilinx-like architecture (cluster size = 4) TABLE A.1 MCNC benchmark circuits vii

8 List of Figures FIGURE 1.1 Example logic cluster containing two LUTs [BETZ99] FIGURE 2.1 A generic FPGA [Brow92] FIGURE 2.2 Logic cluster and basic logic element (BLE) FIGURE 2.3 CAD flow FIGURE 2.4 Packing example FIGURE 2.5 Pseudo-code for VPack [Betz98b, Betz99] FIGURE 2.6 Pseudo-code of a generic Simulated Annealing-based placer [Betz98b, Betz99] FIGURE 3.1 Architecture evaluation CAD flow [Betz98b, Betz99] FIGURE 3.2 Pseudo-code for T-VPack FIGURE 3.3 Determining BaseBLECrit from connection criticalities FIGURE 3.4 Example of first criticality tie-breaker FIGURE 3.5 Example of second criticality tie-breaker FIGURE 3.6 Post place and route T-VPack alpha trade-off curves FIGURE 3.7 Post place and route C-VPAck alpha trade-off curves FIGURE 3.8 Why reducing the number of nets in a circuit is good FIGURE 4.1 Structure and speed paths of a logic cluster FIGURE 4.2 Effect of cluster size on physical length of routing segments FIGURE 4.3 Effect of cluster size on tile length FIGURE 4.4 Inputs required for 98% utilization vs. cluster Size FIGURE 4.5 FPGA with length 4 segments, 5% buffered and 5% pass transistor switches FIGURE 4.6 Total area vs. cluster size FIGURE 4.7 Area components vs. cluster size FIGURE 4.8 Critical path delay vs. cluster size FIGURE 4.9 Area-delay product vs. cluster size viii

9 FIGURE 4.1 Inter-cluster and intra-cluster nets on the critical path FIGURE 4.11 Breakdown of critical path delay into inter-cluster and intra-cluster components FIGURE 4.12 Decrease in logical manhattan distance as cluster size increases FIGURE 4.13 Variation of circuit compile time with logic cluster size FIGURE 5.1 Pseudo-code T-VPlace FIGURE 5.2 Graph showing fidelity of placement estimated critical path ix

10 1 CHAPTER 1 Introduction Field-Programmable Gate Arrays (FPGAs) have become one of the most popular implementation media for digital circuits, and since their introduction in 1984, FPGAs have become a multibillion dollar industry. The key to the success of FPGAs is their programmability, which allows any circuit to be instantly realized by appropriately programming an FPGA. FPGAs have some compelling advantages over Standard Cells or Mask-Programmed Gate Arrays (MPGAs): faster time-to-market, lower non-recurring engineering costs (NRE), and easier debugging. Additionally, FPGAs offer designers the ability to fix errors or to add features to systems that have already been manufactured. FPGAs are also useful for implementing designs that are low volume or are required immediately, since they do not require extensive manufacturing like Standard Cells or MPGAs. The benefits offered by FPGAs come at a price FPGAs are at least three times slower, and require at least ten times the area of MPGAs [Brow92]. This loss in speed is mainly due to the fact that logic in FPGAs is connected via programmable switches, while in Standard Cells or MPGAs, logic is directly connected with metal wires. The programmable switches in FPGAs have high resistance and capacitance compared to the metal wiring in Standard Cells or MPGAs, and therefore reduce circuit speed. Interconnect delay is more significant (a larger proportion of circuit delay) in FPGAs than it is in MPGAs or Standard Cells, and consequently it is more important to minimize the interconnect delay in FPGAs than it is in MPGAs or Standard Cells.

11 CHAPTER 1 Introduction 2 Another important factor affecting circuit delay is the process used in the manufacture of an FPGA. As process geometries shrink into the deep-submicron region, interconnect 1 resistance and capacitance become increasingly significant smaller processes which result in improvements in logic speed do not result in similar improvements in interconnect speed. The result of this is that as processes shrink, interconnect delay accounts for an increasing proportion of total circuit delay. Clearly each process shrink makes interconnect delay more and more significant, and it must be minimized to achieve the best possible circuit performance. The quality of the computer-aided design (CAD) tools used to map circuits into an FPGA and the quality of the FPGA architecture can have a significant impact on the FPGA s performance. It is clear that interconnect delay is an increasingly important factor in the overall performance of an FPGA, so it is crucial that FPGA CAD tools and FPGA architectures minimize this delay. Our research focuses on the following two areas 1. Exploring FPGA logic block architectures to minimize interconnect delay, and 2. Developing CAD tools that minimize interconnect delay. It is important that FPGA architecture and CAD be studied in concert, since architectural features must be properly utilized by CAD tools to be of any benefit, and CAD tool enhancements cannot be properly evaluated without a good architecture. In this thesis, we are concerned with improving FPGA performance without sacrificing large amounts of area. To accomplish this we investigate three promising aspects of FPGA architecture and CAD: Logic-cluster based FPGA architectures, timing-driven packing, and timing-driven placement. These three areas are described in the following sections. 1. Interconnect is the wiring and switches that connect logic elements.

12 CHAPTER 1 Introduction 3 Logic Cluster FPGA Logic Cluster Inputs Local Interconnect (X-Bar) BLE BLE Logic Cluster Outputs FIGURE 1.1 Example logic cluster containing two LUTs [BETZ99] 1.1 Cluster-Based Logic Blocks An important factor affecting the area and speed of an FPGA is the logic block (logic cluster) architecture used within the FPGA. In general a logic cluster consists one or more basic logic elements (BLEs) connected by fast local interconnect [Betz98b, Betz99], where the BLE (described fully in Section 2.1.1) that we use consists of a 4-LUT and a register. Figure 1.1 shows an example logic cluster consisting of two BLEs and local interconnect. The size of the logic cluster (number of BLEs it contains) used in an FPGA architecture can have a dramatic effect on its area and performance. Previous work [Betz98b] demonstrated the effect of cluster size on area efficiency. Also, in [Betz98b] it was speculated that as cluster size is increased, circuit speed would be improved. As cluster size is increased, two things happen 1. More critical path connections are able to use the fast local interconnect rather than using slow inter-cluster (between cluster) interconnect, but this local interconnect becomes slower. 2. More connections are completely absorbed within clusters so less inter-cluster routing is required (reducing area), but the local interconnect area per cluster is growing quadratically (increasing area).

13 CHAPTER 1 Introduction 4 We are concerned with determining the effect of logic cluster size on circuit speed as well as area and finding what size logic cluster has the best area-delay trade-off. To our knowledge no work has been done which simultaneously investigates logic clusters with respect to both area and speed. 1.2 Timing-Driven Packing To fairly evaluate different size logic clusters with respect to speed, it is important that the CAD tools take advantage of the fast local interconnect within the clusters in order to minimize the critical path delay. A packing algorithm selects how BLEs in a circuit are to be mapped into logic clusters, while a timing-driven packing algorithm attempts to map BLEs along the critical path into the fewest number of clusters so that many critical path connections use fast local interconnect. We give a more formal definition of packing in Section Timing-Driven Placement Placement involves selecting the coordinates in the FPGA where each logic cluster will be mapped to. A timing-driven placement algorithm attempts to map logic clusters that are on the critical path into physical locations that are close together so as to minimize the amount of interconnect through which the critical signal must travel. Previous work [Betz99, Betz98b] has done a good job considering timing during routing, but it did not consider timing during placement. While there is evidence that timing-driven placement improves speed for standard cells, there has been no clear quantification of how much the improvement is for FPGAs. A goal of this work is to determine what improvements can be obtained with timing-driven placement. Placement is formally defined in Section

14 CHAPTER 1 Introduction Thesis Organization This thesis is organized as follows. Chapter 2 describes FPGA architecture and CAD, and gives an overview of existing CAD tools. Chapter 3 introduces a new timing-driven logic block packing algorithm. Chapter 4 describes architecture experiments that evaluate different size logic clusters with respect to area and speed. Chapter 5 describes a new timing-driven placement algorithm. Finally, in Chapter 6 we present our conclusions and suggestions for future work. (1.1)

15 CHAPTER 1 Introduction 6

16 7 CHAPTER 2 Background and Previous Work In this chapter, we first give an overview of FPGA architecture with a focus on logic block architecture. After this we discuss the CAD flow used to map circuits into FPGAs including an introduction to timing analysis, and a detailed review of logic cluster packing, placement, and timingdriven placement. 2.1 Overview of FPGA Architecture In general, an FPGA consists of logic blocks, I/O blocks, and programmable routing as shown in Figure 2.1. To implement a circuit in an FPGA, each of the logic blocks in the FPGA are appropriately programmed to perform a small portion of the functionality of the desired circuit, and each of the I/O blocks is programmed to be an input pad or an output pad as required by the circuit. Then these functional portions and I/Os are all appropriately connected through the programmable routing. The logic block used in an FPGA can have a significant impact on the performance of an FPGA, and since we are interested in determining the effects and trade-offs of cluster-based logic blocks, we describe cluster-based logic blocks below.

17 CHAPTER 2 Background and Previous Work 8 Logic block I/O block Programmable routing Cluster-Based Logic Blocks FIGURE 2.1 A generic FPGA [Brow92] We are interested in studying logic blocks that consist of a grouping of basic logic elements (BLEs) connected with fast local interconnect. In general, a BLE is a small indivisible unit combining sequential and combinational logic, while the BLE that we study consists of a 4-LUT and a flip-flop as shown in Figure 2.2-b. A logic block combining many BLEs is known as a logic cluster [Betz99, Betz98b]. An example of a logic cluster is the Logic Array Blocks used in Altera s FLEX 6K, FLEX 8K, and FLEX 1K parts [Alte98a], as well as the Configurable Logic Blocks used in the Xilinx 52 [Xili97] and Virtex [Xili98] parts. Figure 2.2-a shows the structure of a logic cluster that consists of one or more BLEs and the routing required to connect them together. The clusters that we study are fully-connected, meaning that any BLE input can connect to any cluster input or any BLE output. Since the cluster is fully connected it is possible to bring a net into the cluster on a single cluster input, and route this net to many BLEs within the cluster via the local routing. This allows the number of nets brought into the cluster (number of cluster inputs

18 CHAPTER 2 Background and Previous Work 9 I Inputs Clock Local Routing (X-Bar) BLE #1... BLE #N (a) Logic Cluster N Outs Inputs 4- LUT Clock DFF Out (b) Basic Logic Element (BLE) FIGURE 2.2 Logic cluster and basic logic element (BLE) used) to be less than the total number of BLE inputs within the cluster. Another benefit of fully connected clusters is that CAD tools are simplified since all BLEs within the cluster are logically equivalent. A logic cluster consisting of BLEs is described with the following four parameters [Betz99, Betz98b]: 1. The size of (number of inputs to) a LUT (K), 2. The number of BLEs in a cluster (N), 3. The number of inputs to the cluster for use as inputs by the LUTs (I), and 4. The number of clock inputs to a cluster (for use by the registers), Mclk. The work of [Betz99, Betz98b] focused on logic clusters in which the LUT size, K, is 4 and the number of clock pins on a cluster, M clk, is 1 this is the case shown in Figure 2.2. The total number of BLE inputs is K N, however, only I inputs are brought into the cluster. [Betz98b] showed that a good rule of thumb 1 is to design logic clusters with I=2 N + 2. Also shown was that FPGAs composed of logic clusters of size 1-1 BLEs (with the exception of size 2) have the best area efficiency. This research did not consider the effect of cluster size on circuit speed, however, it was speculated that larger cluster sizes would have a positive impact on FPGA performance. 1. This rule of thumb applies to the case when the LUT size, K, is 4. An interesting direction for future research would be to study the interactions between LUT size, K, the number of inputs to a cluster, I, and the number of BLEs in a cluster, N, and determine the best combination of these parameters.

19 CHAPTER 2 Background and Previous Work CAD for FPGAs Figure 2.3 illustrates the CAD flow that is used to evaluate FPGA architectures and CAD algorithms. This CAD flow mirrors the actual CAD flow employed by FPGA and ASIC designers. Each circuit we use is logic-optimized by SIS [Sent92] and then technology-mapped into 4-LUTs by FlowMap [Cong94]. VPack [Betz98b] is then used to group the LUTs and registers into logic clusters 1 of the desired size. Finally, we use VPR [Betz98b, Betz99] to place (determine the x, y position of each cluster in the FPGA) and route (connect the wires) each circuit. VPR s timingdriven router extracts the elmore delay [Elmo48] of each routed net, and performs a path-based timing analysis to determine the delay of the circuit s critical path. Finally, VPR uses a transistorbased area model [Betz98b, Betz99] to estimate the total layout area required by this FPGA to implement each circuit. Circuit Logic Optimization Technology Map to 4-LUTs Cluster Size (N) Pack BLEs into Logic Clusters Placement Routing Timing and Area Results FIGURE 2.3 CAD flow 1. Note, following the convention of [Betz98b] our CAD flow shows packing and placement as two separate steps. After packing, we treat a logic cluster as an indivisible unit which is then placed. This division is not always necessary (depending on the CAD flow used), but we impose it in order to simplify the CAD tools. Another approach would be to eliminate packing, and allow the placement algorithm to move LUTs and registers freely between different clusters. This approach to placement would considerably increase the computational complexity of the placement algorithm, but would likely produce better results.

20 CHAPTER 2 Background and Previous Work 11 In this section we first describe how timing analysis is used to evaluate a circuit s speed, and how it guides timing-driven algorithms. Then we discuss two packing algorithms VPack and RASP. After this we discuss placement, and give an overview of Simulated Annealing and VPR s placement tool, and we discuss several timing-driven placement approaches Timing Analysis Timing analysis [Hitc83] has two main purposes: 1. To determine the final maximum speed that a circuit implementation can achieve. 2. To determine the delay of all the paths and connections in a circuit during placement and routing, and use these as a guide to reduce the total circuit delay. To perform a timing analysis, we must first represent the circuit as a directed graph. Nodes in the graph represent input and output pins of circuit elements such as LUTs, registers, and I/O pads. Connections 1 between these nodes are modeled with edges in the graph. These edges are annotated with a delay corresponding to the physical delay between the nodes. To determine the delay of the circuit, a breadth first traversal is performed on the graph starting at sources (input pads, and register outputs). Then we compute the arrival time, T arrival, at all nodes in the circuit with the following equation T ( arrival i ) = Max j fanin ( i ) { T arrival ( j) + delay( j, i) } (2.1) Where node i is the node currently being computed, and delay(j,i) is the delay value of the edge joining node j to node i. The delay of the circuit is then the maximum arrival time, D max, of all nodes in the circuit. 1. In a graph representation of the circuit we define a connection to be an edge between a net driver and any of its terminals.

21 CHAPTER 2 Background and Previous Work 12 To guide a placement or routing algorithm, it is useful to know how much delay may be added to a connection before the path that the connection is on becomes critical. The amount of delay that may be added to a connection before it becomes critical is called the slack [Hitc83] of that connection. To compute the slack of a connection, we must compute the required arrival time, T required, at every node in the circuit. We first set the T required at all sinks (output pads and register inputs) to be D max. Required arrival time is then propagated backwards starting from the sinks with the following equation T required ( i) = Min j fanout( i) { T required ( j) delay( i, j) } (2.2) Finally, the slack of a connection driving node, i, is defined as: Slack( i, j) = T required ( j) T arrival ( i) delay( i, j) (2.3) Packing Algorithms for Cluster-Based FPGAs A packing algorithm takes a netlist consisting of LUTs and registers and produces a netlist consisting of logic clusters. This involves combining the LUTs and registers into BLEs, and then grouping the BLEs into logic clusters (Figure 2.4). There are two main constraints that packing algorithms must meet: 1. The number of BLEs must be less than the cluster size, N. 2. The number of distinct inputs generated outside the cluster and used as inputs to BLEs within the cluster must be less than or equal to the number of cluster inputs, I.

22 CHAPTER 2 Background and Previous Work 13 Netlist of BLEs Netlist of Clusters A A B B F G C H Pack C D D F G Clusters BLEs E E H FIGURE 2.4 Packing example Altera has an in-house tool [Alte95] that targets cluster-based logic blocks, and Xilinx has an inhouse tool targeting the cluster-like logic blocks of the 52 [Xili97] and Virtex [Xili98] FPGAs, however to our knowledge, this work has not been made publicly available. In this section we discuss two publicly available packing algorithms, VPack [Betz98b] and RASP [Cong96] The VPack Logic Cluster Packing Tool VPack [Betz98b, Betz99] takes a netlist of LUTs and registers, and produces a netlist of logic clusters. All parameters relating to the logic clustering (N, I, K, M clk ) are specified at run-time. VPack first groups LUTs and registers into BLEs, and then packs the BLEs into logic clusters. The pseudo-code for the VPack algorithm is given in Figure 2.5 [Betz98b, Betz99]. The VPack algorithm has two optimization goals. The first is to pack each logic cluster to its capacity to minimize the number of clusters needed. The second goal is to minimize the number of inputs to each cluster in order to reduce the number of connections required between clusters.

23 CHAPTER 2 Background and Previous Work 14 Let: UnclusteredBLEs be the set of BLEs not contained in any cluster C be the set of BLEs contained in the current cluster LogicClusters be the set of clusters (where each cluster is a set of BLEs) UnclusteredBLEs = PatternMatchToBLEs (LUTs, Registers); LogicClusters = NULL; while (UnclusteredBLEs!= NULL) { /* More BLEs to cluster */ C = GetBLEwithMostUsedInputs (UnclusteredBLEs); while ( C < N) { /* Cluster is not full */ BestBLE = MaxAttractionLegalBLE (C, UnclusteredBLEs); if (BestBLE == NULL) /* No BLE can be added to cluster */ break; UnclusteredBLEs = UnclusteredBLEs - BestBLE; C = C BestBLE; } LogicClusters = LogicClusters C; } FIGURE 2.5 Pseudo-code for VPack [Betz98b, Betz99] Vpack uses a greedy algorithm to construct each cluster sequentially. At the start of each cluster operation, VPack selects as a seed an unclustered BLE with the most used inputs, and then places this seed into a cluster C. Then VPack selects a new BLE, B to pack into C based on the attraction that B has to C. Attraction is determined by the number of inputs and outputs that B and C have in common: Attraction( B) = Nets( B) Nets( C) (2.4) BLEs are added to the current cluster until it cannot fit any more, at which point packing begins on a new cluster. The process terminates when there are no more unclustered BLEs left.

24 CHAPTER 2 Background and Previous Work 15 The time complexity of this algorithm is O(k max K n) which is a result of the fact that when each BLE is clustered (n BLEs) we must examine all of the nets attached to the BLE (K nets), and we must examine all BLEs that each net fans out to (maximum fanout = k max ). This results in an execution time of about four seconds to pack the largest MCNC 1 circuit (clma) [Yang91] on a 296 MHz UltraSPARC-II processor RASP In [Cong96] the RASP logic block packing tool is described. This tool is capable of mapping circuits represented as a network of LUTs into several different types of logic blocks. This algorithm uses a closeness cost function to weigh the desirability of mapping LUTs into the same logic block. This closeness cost function can be set up to prefer to minimize delay or area, or to maximize routability. The closeness of two LUTs is marked on an edge in a compatibility graph if it is allowable to pack the two LUTs into one logic block. If the LUTs cannot be packed together (i.e. they violate some hard constraint such as number of inputs or BLEs allowed) then there is no edge put into the compatibility graph. The packing step selects LUTs to pack together by performing a maximum weighted matching on the compatibility graph. The complexity of this algorithm is O(nm) where n is the number of LUTs, and m is the number of edges in the compatibility graph. With the logic blocks used in our research, the number of edges, m, in the compatibility graph is O(n 2 ), which leads to an algorithm complexity of O(n 3 ) Placement Placement is the process by which a netlist of circuit blocks (I/Os or logic clusters) is mapped into physical locations in an FPGA. The locations that blocks are mapped to can significantly affect the performance of the FPGA. There are three main goals that placement algorithms may attempt to satisfy: 1. We give a brief overview of the 2 largest MCNC circuits in Appendix A.

25 CHAPTER 2 Background and Previous Work To minimize the amount of wiring required, which we refer to as wirelength-driven placement. 2. To balance the wiring density across the FPGA, called routability-driven placement. 3. Minimize the delay of the critical path(s), called timing-driven placement. Placement algorithms may simultaneously satisfy one or more of these goals. In the remainder of this section we review the Simulated Annealing algorithm that is commonly applied to placement problems. Then we discuss the Simulated Annealing-based placer built into VPR [Betz98b, Betz99] which we call VPlace. After this we review various timing-driven placement approaches Simulated Annealing The Simulated Annealing placement algorithm mimics the annealing process used to gradually cool molten metal to produce high-quality metal structures [Kirk83]. A Simulated Annealingbased placer initially places logic clusters and I/Os (circuit blocks) randomly into physical locations in an FPGA. Then the placement is iteratively improved by randomly swapping blocks and evaluating the goodness of each swap with a cost function. If the move will result in a reduction in the placement cost, then the move is accepted. If the move would cause an increase in the placement cost, then the move may still be accepted even though it makes the placement worse. The purpose of accepting some bad moves is to prevent the Simulated Annealing-based placer from becoming trapped in a local minimum. The probability of accepting a bad move is given by e - C/T, where C is the positive change in cost function that acceptance of the move would result in, and T is a parameter called temperature that controls the likelihood of accepting each move. Initially, a Simulated Annealing-based placer starts at a high temperature, so that almost all moves are accepted, then the temperature is gradually reduced so that the probability of accepting moves that make the placement worse becomes very low. In the final stages of placement only moves that decrease the placement cost are accepted.

26 CHAPTER 2 Background and Previous Work 17 S = RandomPlacement (); T = InitialTemperature (); R limit = InitialR limit (); while (ExitCriterion () == False) { /* Outer loop */ while (InnerLoopCriterion () == False) { /* Inner loop */ S new = GenerateViaMove (S, R limit ); C = Cost (S new ) - Cost (S); if ( C < ) { S = S new /*Move is good, accept*/ } else { r = random (,1); if (r < e - C/T ) { S = S new ; /*Move is bad, accept any way*/ } } } /* End inner loop */ T = UpdateTemp (); R limit = UpdateR limit (); } /* End outer loop */ FIGURE 2.6 Pseudo-code of a generic Simulated Annealing-based placer [Betz98b, Betz99]. In the final (low temperature) stages of the placement, if all blocks in the FPGA are considered for swapping, most swaps will be rejected because they result in large positive changes in the cost function. To increase the number of accepted moves at low temperatures, only blocks that are close together should be considered for swapping since local swaps tend to result in relatively small changes in the placement cost. Accordingly, a Simulated Annealing-based placer uses a parameter called R limit ( range limiter ) that controls how close together circuit blocks must be to be considered for swapping. Initially, R limit spans the entire FPGA which means that blocks on opposite sides of the FPGA may be considered for swapping. As the placement proceeds, R limit is decreased, so that in the final stages of placement, only blocks that are close together are considered for swapping. In Figure 2.6 we show the pseudo-code for a generic Simulated Annealing-based placer, as presented in [Betz98b, Betz99].

27 CHAPTER 2 Background and Previous Work The VPR Placement Tool (VPlace) In this document we will refer to the placement algorithm used within VPR as VPlace. VPlace is a Simulated Annealing-based placement algorithm that attempts to minimize the wirelength of the resulting circuit by placing circuit blocks that are on the same net close together. To accomplish this, VPlace uses a bounding-box based linear congestion [Betz98b, Betz99] cost function to estimate wirelength requirements. The VPlace algorithm follows the format of the pseudo-code shown in Figure 2.6. The linear congestion cost function has the following functional form [Betz98b, Betz99] Cost linear congestion = N nets i = 1 q( i) [ bb x ( i) + bb y ( i) ] (2.5) where there are N nets in the circuit. The cost of each net, i, is determined by its horizontal span, bb x (i), and its vertical span, bb y (i). The q(i) factor compensates for the fact that the bounding box wire length model underestimates the wiring necessary to connect nets with more than three terminals. The values used for q(i) were obtained from [Chen94] so that q(i) is set to 1 for nets with 3 or fewer terminals, and it slowly increases to 2.79 for nets with 5 terminals. Beyond 5 terminals, the q(i) function linearly increases at the rate of q(i) = (Num_Terminals - 5). (2.6) The complexity of this algorithm is O(n 4/3 ) where n is the number of blocks in the circuit Timing-Driven Placement Placement algorithms that attempt to minimize the critical path delay of the resulting circuits are called timing-driven. There are different approaches to minimizing critical path delay in timingdriven placement algorithms. One approach which we call path-based timing-driven placement computes path delays at every stage of the placement, and uses theses delays in its cost function. This path-based approach is computationally expensive since path delays must be continuously re-computed. Another approach is connection-based timing-driven placement, which involves

28 CHAPTER 2 Background and Previous Work 19 performing a path-based timing analysis and assigning slacks to each connection in the circuit. Then during placement, more attention is paid to connections with low slack, but the more global view of the complete path delay is not used. It is also possible to combine connection-based and path-based timing-driven placement by periodically performing a full path analysis based on the current placement, and then updating the slacks on individual connections. In this section we discuss the existing timing-driven placement algorithms that are most relevant to our work. TimberWolfSC The TimberWolfSC timing driven placement algorithm for row-based standard cell ICs is presented in [Swar95]. This algorithm uses a Simulated Annealing approach to placement. In this algorithm, net delay is computed as Net Delay = T driver + R driver ( C net + C gates ) (2.7) Where T driver is the intrinsic delay of the driver, R driver is the resistance of the driver, C net is the estimated capacitance of the net, and C gates is the gate input capacitance of all sinks on the net. The arrival time at the sink of a path is the summation of all of the net delays along that path. This formulation of delay assumes that the driver resistance is much larger than the wiring resistance (so that it can ignore wiring resistance). The fact that wiring resistance is ignored likely makes these net delays optimistic, especially for circuits implemented in deep-submicron processes where wiring resistance and delay is significant. The cost function used in this algorithm penalizes any paths where the arrival time is greater than the required (user defined) arrival time with the following: Penalty = T arrival T required (2.8) The total timing penalty P t is the sum of all critical path penalties.

29 CHAPTER 2 Background and Previous Work 2 P t = Penalty paths (2.9) The cost function consists of two terms, a wire length term represented by W, total timing penalty, P t, and a trade-off variable λ that trades off between the two terms Cost = W + λ P t (2.1) The authors of [Swar95] found that setting λ = W P t (2.11) gave the best results, where W is the a verage change in wire length and P t is the average change in the timing penalty measured during the first outer loop iteration of a Simulated Annealing algorithm. This implies that changes in the timing penalty are three times as important as changes in the wire length. The authors presented results for three MCNC standard cell circuits, for which timing information was previously available. Compared to the previous results they reduced delay by 28% - 5% at an area cost of between 2.5% and 6%. It is not clear from the paper how the previous timing results were obtained. This algorithm is path based, so the computational complexity is likely quite high, but is not revealed in the paper. PROXI In [Nag95] a performance-driven simultaneous place and route algorithm (PROXI) is presented. After each placement perturbation in the anneal, a small subset of relevant nets (previously unroutable and newly disturbed nets) is ripped up and rerouted with a fast maze router. As the placement evolves the critical path is evaluated. The cost function used in this algorithm is Cost = W r R + W t T (2.12)

30 CHAPTER 2 Background and Previous Work 21 Where R is the number of unrouted nets and T is the critical path. W r and W t are weights that are determined adaptively at runtime so as to normalize the components of the cost function so that each term contributes equally to the cost function. This algorithm is unique in that it performs placement and routing simultaneously most place and route software does placement first, and then routes the placed circuit. Performing placement and routing in one stage should theoretically give better results than a two stage (place then route) algorithm, however it is much more computationally expensive. This algorithm achieves 8% - 15% improvement in delay when compared to the Xilinx XACT5. place and route system. This algorithm, however, has a significant disadvantage in CPU compile time compared to the XACT5. tool, ranging from 6 times for the smallest design (12x12 array), to 11 times for the largest design (16x16 array). 2.3 Summary In this chapter we presented an overview of FPGA architecture including a description of cluster based logic blocks [Betz99, Betz98b]. Then we discussed CAD for FPGAs. This included discussions of timing analysis, packing algorithms, and placement. TABLE 1.1 TABLE 2.1

31 CHAPTER 2 Background and Previous Work 22

32 23 CHAPTER 3 Timing-Driven and Connection-Driven Packing In this chapter we first discuss the experimental methodology that we use to evaluate different CAD algorithms and FPGA architectures. Then we introduce two new packing algorithms that are extensions to the VPack [Betz98b, Betz99] algorithm. The first is a timing-driven packing algorithm that we call T-VPack, and the second is a connection-absorption-driven packing algorithm that we call C-VPack. We then compare the results of both of these algorithms to the results of VPack. 3.1 Experimental Methodology The CAD flow that we use to evaluate different CAD algorithms and FPGA architectures is the same as in [Betz98b, Betz99], and is given in Figure 3.1. First each circuit is logic-optimized by SIS [Sent92] and technology mapped into 4-LUTs by FlowMap [Cong94]. T-VPack (described in Section 3.2) is then used to group the LUTs and registers into logic clusters of the desired size with the desired number of inputs. Then VPR is used to place and route each circuit. The placement algorithm in VPR is simulated annealing based and optimizes the final placement to minimize the required routing area. The router in VPR is fully timing-driven and attempts to minimize the critical path delay (given the current placement). After placement and routing, we

33 CHAPTER 3 Timing-Driven and Connection-Driven Packing 24 Cluster Parameters (N, I, K) Routing Architecture Parameters (Fc, etc.) Circuit Logic optimization (SIS) Technology map to 4-LUTS (FlowMap + Flowpack) Pack FFs and LUTs into logic clusters (T-VPack) Placement (VPR) Routing (VPR, timing-driven router) Min # tracks? No Yes Wmin determined Adjust channel capacities (W) Routing with W = 1.2 Wmin (VPR, timing-driven router) Determine critical path delay and transistor area to build FPGA (VPR + TransCount) FIGURE 3.1 Architecture evaluation CAD flow [Betz98b, Betz99]. know the estimated area and track width required to implement each circuit and the estimated critical path delay, where area and delay values are computed using the area and delay models described in the next chapter. Figure 3.1 shows how VPR computes the minimum number of tracks in which a circuit will route, which we refer to as a high-stress routing. Basically VPR repeatedly routes each circuit with different channel widths (number of tracks per channel), scaling the FPGA s architecture until it finds the minimum number of tracks in which the circuit will route. We define a low-stress routing (as does [Swar98a]) to occur when an FPGA has 2% more routing resources than the minimum required to route a given circuit. We feel that low-stress routings are indicative of how an FPGA will generally be used (it is rare that a user will utilize 1% of all routing and logic

34 CHAPTER 3 Timing-Driven and Connection-Driven Packing 25 resources), so many of our delay results are based on low-stress routings. We also present results that are based on an infinite 1 number of routing resources. These infinite routing results tell us the best possible router-achievable speed of a circuit given the current packing and placement of that circuit. We feel that is a useful indicator of how well a packing or placement algorithm performs with respect delay. By allowing the channel width to vary, and searching for the minimum routable width, we can detect small improvements in FPGA architectures or CAD algorithms that might otherwise go unnoticed. Compare this to mapping a circuit into a fixed size FPGA this would only tell us if the circuit fit or not. A binary result like this makes it is difficult to draw conclusions about new architectures or CAD algorithms. 3.2 Timing-Driven Packing: T-VPack Our timing-driven logic block packing algorithm, T-VPack, attempts not only to pack each logic block to capacity and minimize the number of cluster inputs used, but also to minimize the number of inter-cluster (between cluster) connections on the critical path(s). The local routing within clusters is faster than the general-purpose routing between logic clusters, so reducing the number of inter-cluster connections on the critical path(s) reduces circuit delay. The basic operation of the algorithm is the same as that of the VPack algorithm described in Section with a few modifications. We show the pseudo-code for the T-VPack algorithm in Figure 3.2. T-VPack first performs a timing analysis (defined in Section 2.2.1) to determine the critical path(s) of the circuit. Then T-Vpack finds a seed BLE by selecting a BLE on the critical path(s) rather than selecting a BLE with the most used inputs. BLEs are then added to the current cluster 1. Infinite routing resource results are delay results from the router when it ignores congestion, i.e. the router is allowed to use a single resource for multiple un-related connections. This allows the router to allocate the fastest possible resource for every connection in the circuit. See [Betz98b, Betz99] for a detailed description of how the router in VPR works.

35 CHAPTER 3 Timing-Driven and Connection-Driven Packing 26 Let: UnclusteredBLEs be the set of BLEs not contained in any cluster C be the set of BLEs contained in the current cluster LogicClusters be the set of clusters (where each cluster is a set of BLEs) UnclusteredBLEs = PatternMatchToBLEs (LUTs, Registers); LogicClusters = NULL; ComputeCriticalities(); BLEsSinceLastCriticalityRecompute = ; while (UnclusteredBLEs!= NULL) { /* More BLEs to cluster */ C = GetMostCriticalBLE (UnclusteredBLEs); BLEsSinceLastCriticalityRecompute ++; while ( C < N) { /* Cluster is not full */ if (BLEsSinceLastCriticalityRecompute >= RecomputeInterval) { ComputeCriticalities(); BLEsSinceLastCriticalityRecompute = ; } BestBLE = MaxAttractionLegalBLE (C, UnclusteredBLEs); if (BestBLE == NULL) /* No BLE can be added to cluster */ break; UnclusteredBLEs = UnclusteredBLEs - BestBLE; C = C BestBLE; BLEsSinceLastCriticalityRecompute ++; } } LogicClusters = LogicClusters C; FIGURE 3.2 Pseudo-code for T-VPack based on the attraction they have to the current cluster, where the attraction function is modified to prefer to absorb connections along the critical paths(s). After each cluster is full, packing begins on a new cluster. In this section we first discuss timing-analysis and delay modeling within T-VPack. Then we give details of the algorithm implementation. After this we provide an analysis of T-VPack to see the effect of various parameters within T-VPack. Finally after this we analyze the complexity of the algorithm.

36 CHAPTER 3 Timing-Driven and Connection-Driven Packing Timing Analysis and Delay Models To minimize the number of inter-cluster connections on the critical path(s), T-VPack first needs to determine which connections are on the critical path(s). Accordingly, T-VPack performs a timing analysis to determine the slack of each connection between BLEs. The timing analyzer within T- VPack models three types of delay: the delay through a BLE, or LogicDelay, the connection delay between blocks within the same cluster or IntraClusterConnectionDelay, and the connection delay between blocks that are in different clusters, or InterClusterConnectionDelay. The delay of a connection between two BLEs in different logic clusters is not known until after a circuit has been placed and routed, so T-VPack approximates the delay between clusters as a constant Inter- ClusterConnectionDelay. Note that this leads to some inaccuracy in T-VPack s estimate of where the critical path(s) lies, so that sometimes T-VPack will be attempting to shorten a path that will not be part of the post-place-and-route critical path(s). The performance of T-VPack is not very sensitive to the exact values chosen for these three delay parameters. Throughout this work we set LogicDelay to.1, IntraClusterConnectionDelay to.1 and InterClusterConnectionDelay to 1.. Note that the timing analysis can be performed as often as the user specifies, i.e. a timing analysis can be performed after each BLE is clustered, or at the other end of the spectrum timing analysis may be done once at the beginning of the algorithm execution and never again. The effect of this recompute interval is discussed in Section Timing-Driven Packing Description After a timing analysis is complete, we are able to begin packing. This section describes how we determine which BLE will be selected as a seed for each cluster, and how BLEs to be added to each cluster are selected. We first define many sub-equations that are used in selecting a cluster seed and in the attraction function. After these preliminaries, we finally present how we select a cluster seed, and our new attraction function.

SPEED AND AREA TRADE-OFFS IN CLUSTER-BASED FPGA ARCHITECTURES

SPEED AND AREA TRADE-OFFS IN CLUSTER-BASED FPGA ARCHITECTURES Alexander (Sandy) Marquardt, Vaughn Betz, and Jonathan Rose Right Track CAD Corp. #313-72 Spadina Ave. Toronto, ON, Canada M5S 2T9 {arm, vaughn,