Basic Block. Inputs. K input. N outputs. I inputs MUX. Clock. Input Multiplexors

Size: px

Start display at page:

Download "Basic Block. Inputs. K input. N outputs. I inputs MUX. Clock. Input Multiplexors"

Dorcas Lambert
6 years ago
Views:

1 RPack: Rability-Driven packing for cluster-based FPGAs E. Bozorgzadeh S. Ogrenci-Memik M. Sarrafzadeh Computer Science Department Department ofece Computer Science Department UCLA Northwestern University UCLA Los Angeles, CA Evanston, IL Los Angeles, CA Abstract Ring tools consume a signicant portion of the total design time. Considering rability at earlier steps of the CAD ow would both yield better quality and faster design process. In this paper we are presenting a rability-driven clustering method for cluster-based FPGAs. Our method packs LUTs into logic clusters while incorporating rability metrics into a cost function. The objective is to minimize this rability cost function. Our cost function is consistently able to indicate improved rability. Our method yields up to 50 % improvement over existing clustering methods in terms of the number of ring tracks required. The average improvement obtained is 16.5 %. Reduction in number of tracks yields reduced ring area. I. Introduction CAD ow for FPGAs contains four major steps which are logic optimization, technology mapping, placement, and ring. Among those, placement and ring are the most time consuming tasks. Due to highly constrained and discrete interconnect structure of current FPGAs, ring is a challenging problem. Most of the time current FPGA rers can not use available ring resources eciently. This leads to a large portion of the ring area to be wasted. Also, depending on the complexity of the particular design ring might require a fairly large amount of time, often several hours to be completed. Hence considering rability at earlier steps of the CAD ow would both yield a better quality result and less design time. FPGAs consist of smaller congurable building blocks called logic blocks or congurable logic blocks (CLBs), which are placed on the FPGA chip either on a two-dimensional array orinrows. There are two kinds of such logic blocks, LUT based blocks and multiplexor based blocks. LUT based logic blocks are more popular. One LUT based logic block may contain multiple LUTs depending on the type of the FPGA. FPGAs with logic blocks containing multiple LUTs are called cluster-based FPGAs. FPGA vendors have dierent logicblock congurations. The structure and granularity of the logic block has a signicant impact on the area-eciency of the FPGA. If the logic block is ne-grained, e.g. contains a single LUT, the circuit to be implemented will be distributed over more number of logic blocks. This has a negative impact on rability, since more blocks need to be interconnected. If several LUTs are clustered into one logic block, signal sharing among LUTs can be exploited. Since the interconnect inside the logic blocks are hardwired local interconnect can be made very fast and eciently. This improves rability and decreases the load on the rer signicantly. On the other hand, it is not feasible to increase the complexity of the logic blocks beyond a certain limit. If the logic blocks become too complex it becomes dicult to fully utilize these blocks leading to waste of logic, hence chip area. As a result there is a trade-o between the complexity of the logic blocks and the area-eciency. Many commercial FPGAs, such as Xilinx, Altera, and Actel FPGAs all include logic blocks that contain several LUTs. A collection of LUTs that are grouped together to be placed in one complex logic block is called a cluster. At the technology mapping stage logic gates are assigned to LUTs. In the following phase which only applies to cluster-based FPGAs several LUTs are collected in one cluster. The task of assigning LUTs to clusters is called packing. If during packing special properties of the interconnect, such as sharing among the pins can be exploited, signicant gains can be obtained in terms of rability. Rability problem is dicult to deal with at early stages of CAD ow, since there are no accurate means available to estimate the interconnect at this level. This problem has not been considered very extensively at the packing stage before. Clustering phase being at a later stage than technology mapping, can bring improvements on the rability, since after technology mapping a more accurate estimation on the interconnect is available. In this paper we are proposing improvements upon the state-of-theart logic packing algorithm called VPack: Logic Block Packing Algorithm [2]. We are introducing a new method to consider rability at the packing stage. We are presenting the eect of our method on the rability bysynthesizing the benchmark circuits through the complete CAD ow. We have technology mapped a given circuit, then applied our rability-driven packing method for clustering, and nally placed and red the circuit. We are presenting the results of the nal ring and show that our method improves the rability signicantly. This paper is organized in several sections. Previous work on rability-driven technology mapping will be presented in Section II. Also algorithms for cluster packing will be discussed. Section III provides denitions and the FPGA architecture we are targeting. The issues related to rability will be discussed in Section IV. Our rability-driven packing method will be presented in Section V. Section VI presents experimental results. In Section VII conclusions and future work are given. II. Previous Work Packing LUTs into clusters is an important design step introduced for cluster-based FPGAs. It can be viewed as a sub-task within the technology mapping stage. Hence we will rst mention contributions made in the technology mapping area. The majority of research devoted to technology mapping has been done with the objective of improving either timing or area-eciency. Compared

2 IV. Rability I inputs Clock Input Multiplexors (a) Basic Block Basic Block N puts Inputs Clock K input LUT FF (b) 2x1 MUX Fig. 1. (a) Internal Structure of a Logic Cluster (b) Basic Logic Block to the amount of these eorts there is little work in the rability driven technology mapping domain [5], [6]. The rability driven technology mapper for LUT based arrays, Rmap, employs a mapping strategy that considers several objectives aecting rability and tries to balance those goals. Those objectives will be discussed in more detail in Section V, where we will also present further objectives. On clustering there is also relatively little work published. VPack is one of the best known packing tools for FPGAs [2]. VPack rstpacks a ip op and a LUT together into one BLE, basic logic element. Then these BLEs are packed into logic clusters with the optimization objectives being to ll each cluster to its capacity and minimize the number of used inputs to each cluster. In [3] a timing-driven packing tool for FPGAs, T-VPack was also proposed. We are introducing new metrics that are used to form a new objective function to evaluate rability. Wehave taken the algorithm of VPack as a basis as we will describe in later sections and have built our own approach upon it. Our new algorithm indeed improves rability compared to VPack. As we will describe our results in Section VI, we are able to improve te ring area by 16.5% on average. III. Architecture Description The FPGA we are targeting is of island-style structure. It contains a square matrix of logic blocks. Between every row and column ring tracks are located. The structure of the basic logic block is illustrated in Figure 1 (b). It contains a K input LUT and one ip op. In our case K is 4. The inputs to the basic block are the inputs of the LUT and the clock input for the ip op. The put is red through a two input multiplexor. The logic cluster is shown in Figure 1 (a). The two-dimensional array of logic blocks is constructed from these clusters. The cluster size, N, is dened as the number of basic blocks contained in the cluster. The cluster takes I inputs that are connected to the LUTs inside basic blocks. Not all 4N basic block inputs are accesible externally. Only I of these are connected to input multiplexors of the cluster. These input multiplexors allow any of the I inputs to be connected to any ofthe4n basic block inputs. Also any put of N basic blocks can be connected to any basic block input through these multiplexors. The cluster contains N put pins connecting each basic block put to one cluster put. One advantage of cluster-based logic blocks is that many nets can be red locally inside a cluster, which make faster interconnect compared to general-purpose interconnect between logic blocks. Since part of the interconnect is internally red, less number of nets are left to be red among the clusters by the rer. Since most of an FPGA's area and delay is due to ring, it is important to consider the rability of the design in earlier stages. Technology mapping, clustering and placement can aect the rability. By performing actual ring, rability could be incorporated into early CAD stages. However, this is a very slow process. Also the actual ring cost is very insensitive and thus is a poor cost function. Therefore we consider rability in early stages by using predictors of rability. By clustering the LUTs, the number of connections between clusters can be reduced. In other words the rability can be increased. Consequently, the ring area can be improved in terms of number of tracks required in each ring channel and switch area. In FPGAs, since the ring tracks in each channel are xed, ring in a congested channel is hard for an FPGA rer. There are several factors aecting rability atdierent CAD ow stages. In technology mapping and clustering stage the following factors can improve the rability. The rst three factors have been mentioned previously in[6]: Ratio of cells used to total cells available Ratio of pins per cell Ratio of pins per net Ratio of used pins to total number of pins of the logic block Number of nets The rst two factors depend on the architecture. Two clusterbased FPGAs with dierent cluster sizes and dierent numbers of input pins are not equivalent in terms of rability. The impact of dierent logic block architectures on FPGA performance has been studied earlier [1]. Increasing either LUT size, K, or cluster size, N, increases the functionality of the logic block. This can lead to a decrease in the total number of logic blocks needed to implement a circuit on FPGA, but the size of the logic block increases with both K and N. The size of the cluster is quadratic in N [2]. In addition and more important, the ring area depends on K and N. It has been experimentally shown that the best area-delay can be obtained by using cluster with size of between 4 and 10 and LUT sizes of 4 to 6 [4]. Betz in [4] showed that the uniform distribution of pins on the logic block perimeter is better than top/bottom pin architecture in terms of rability and area. The remaining three rability factors are algorithmic and relatively architecture independent. These factors can be considered in technology mapping or packing algorithms. Ratio of pins per net indicates the density of high-fan nets in the circuit. Rers spend the major part of their eort for ring high-fan nets. Hence the goal is to make this ratio low. Ratio of used pins to total number of pins of the logic block indicates the trac in and around a logic block. A block with high connection trac is likely to create a congested area on the chip. Similarly high connectivity between logic blocks can lead to congestion. Number of nets is also closely related to rability. As the number of nets increases the more dicult it is for the rer to complete the ring. Naturally the number of nets depends on the circuit. In a cluster-based FPGA local ring inside clusters is not a concern compared to ring the external nets among clusters. In the technology mapping and packing stages the number of external nets can be reduced by inserting as many nets as possible inside the clusters. In this paper, we are focusing on the last three factors and we will actually show that they can aect rability signicantly.

3 Ring Area ( 2 * M * M * W ) Our Cost Function Ring Area tseng diffeq bigkey alu4 dsip s298 frisc benchmarks spla ex1010 s tseng ex5p apex4 s298 apex2 elliptic dsip s38584 spla pdc Advanced Method Cost Simple Method Cost Advanced Method Simple Method Fig. 2. Rability Cost after Packing V. RPack:Rability-Driven Packing for Cluster-Based FPGAs We introduce a new method to pack the logic blocks into clusters while considering rability by satisfying the last three factors mentioned in Section IV In order to satisfy these factors we are using the following objective function: RabilityCost = X Nets N i i (1) where N i denotes the number of nets with i terminals and i is the rability weight of nets with i terminals. A net with more number of terminals is harder to re. The rability weight, i, reects the degree of diculty to re a specic net with i terminals. Studies on wirelength estimation, [7], suggest that the ratio of imin imax should be chosen such that, as the number of terminals increases i should increase. On the other hand, for high-fan nets the rability does not vary signicantly. Therefore the dierence of their corresponding i should be small. Due to this reason the range of values that i can take is restricted. In our approach we dene i to be i =2; 1 i : (2) Our objective function is a weighted sum of nets considering the rability of nets with dierent fan. This objective function reects both the eect of the number of nets and also the nature of the nets exposed to the rer. Our goal is to minimize this objective function. In Figure 2 two approaches for packing are compared and the values of the cost function are presented for dierent benchmark circuits. One approach is a simple packer, and the other approach is a sophisticated packing method. From the graph it can be seen that the cost function dierentiates between the two approaches and takes larger values in the case of the simple method. In Figure 3 the corresponding ring area resulting from both approaches is presented. The ring area is measured in terms of number of ring tracks. We are proposing a packing method that aims to minimize the cost function introduced. The input to our packing algorithm, is a list of LUTs, registers, and connections among the resources. In RPack, similar to VPack [2,4],in the rst stage, a LUT and a register are packed into a basic logic block when possible. After that, the blocks are packed into clusters using a heuristic approach since clustering is an NP-hard problem. In this approach, clusters are constructed sequentially. After choosing a seed for a cluster being constructed, the logic block which gives the highest gain is selected to be added to the current cluster provided that it is a Fig. 3. Ring Results of Simple Method and Advanced Method. M M is array size,wisthenumber of tracks in each ring channel. N4 N1 in0 in1 in2 in3 in4 N2 Basic Logic Block B g ( N1, C, B ) =1 g ( N2, C, B ) = 2 N3 Cluster C being built g ( N3, C, B ) = 2 g ( N4, C, B ) = Gain ( B, C ) = 4 Fig. 4. A Cluster Being Constructed and a Candidate Block, K =4,I=5 legal choice. This means that the number of external inputs do not exceed the number of input pins of the cluster. The algorithm continues adding blocks into one cluster until the cluster is full or no more legal choices can be found. Similarly new clusters are constructed until all the blocks are packed into clusters. The best choice for the seed for each cluster is the unclustered logic block with the most used inputs as mentioned in [2, 4]. Assume that the gain computed for a candidate block B, is the number of inputs and puts it has in common with current cluster C as in [2] Gain(B) =j Nets(B) \ Nets(C) j (3) In Figure 4, a candidate basic block B and cluster C being constructed are shown. According to the denition of the gain function in Equation 3, nets N1, N2, and, N3 have the same contribution to the gain of block B to be added to the cluster C. If these nets are observed closely, their contributions to rability gain of the block are slightly dierent. Block B has three common nets with cluster C: N1, N2 and N3. An input pin of the net N1 is inside the cluster C. By adding the block Btothecluster, another input terminal of the net N1 would be inside the cluster. This leads to a decrease in the number of the terminals of the multi-terminal net N1. The gain obtained by moving B to cluster C corresponding to net N1 using Equation 3 is 1. In fact, the gain

4 function originates from the third rability factor discussed in Section IV, i.e. reducing pins per net. Table 1.Gain of a Candidate Block According to a Single Net pin in K! status in-pin edge put total pin in C gain gain congestion gain! in ! in all in pins in C in! in in! in! rest of pins already in C new net in new net The driving pin of net N2 is the put of the logic block B. There is an input pin of net N2 inside the cluster C. If the put terminal of a net is inside the cluster, internal connections can be used to connect the input pins of the net which are located inside the cluster. In such a case, there is no need to use an input pin of the cluster to connect the net N2 to other terminals of the net side the cluster, since an put pin of the cluster can be used for external interconnection. According to the reason above, the contribution of net N2 to block gainis more than just covering an edge of a multi terminal net. Actually, by adding block Bto cluster C, an input pin of the cluster gets free and can be used for another net connection. In Table 1 this is dened as in-pin gain. This increases the probability of acceptance of adding the block to the cluster. In other words, the probability of violating the input constraints of the cluster decreases. Note that each blockput pin is accessible from side and there is no sharing among the put pins. Therefore there would be no put pin constraints for the clusters. According to this observation, saving on put pins does not bring any gain except in one case. Suppose all the input pins of a net are already inside the cluster and the logic block being added to a cluster contains the put pin of the net. Net N3 in Figure 4 is an example of such a case. The put pin of the cluster corresponding to the block driving the net N3 cannot be used by other blocks. This means that there would be no connection from side to this pin since all the terminals of the corresponding net is located inside the cluster. Therefore the number of external connections of the cluster decreases, dened as put congestion gain in Table 1. This yields less congestion among the clusters. In other words, it reduces the number of used pins of a cluster which is the fourth rability factor. Net N4 has no pin in the cluster. The gain of moving logic block B to the cluster would be zero according to the gain function above. However, not only no edge from N4 would be covered, but also one input pin of the cluster would be used for N4. So the gain of moving logic block B to the cluster due to N4 in terms of used pins per cluster is -1. This means N4 has a degrading eect on the rability according to the fourth rability factor. As explained above, by considering just the number of shared inputs and puts as in Equation 3, the packing algorithm cannot dierentiate among the candidate blocks which have dierent impact on rability. All the dierent cases yielding dierent total gains are presented in the table below for one net connected to a candidate block. By incorporating the other rability factors, the gain for each logic block B going into cluster C can be computed as follows: where Gain(B C) = f(nets(b) Nets(C)) X = g(i Nets(C) B) g(i C B) = 8>< >: Nets(B) 1+f in(p (i B) P(i C)) +f o(p (i B) P(i C)) i 2 Nets(C) ;1 T (i B) otherwise f in(p (i B) P(i C)) is dened as the gain obtained in input pins of cluster C as dened in Table 1. Similarly f o(p (i B) P(i C)) is the gain obtained in put congestion. The additional gain of value 1 to the sum of these two gain terms corresponds to the edge gain. T(i, B) returns the type of the pin of Net i connected to basic block B. It returns 0 if the pin is an put pin, and 1 otherwise. P(i, B) is the set of all pins of Net i that are on block B. P(i, C) is the set of pins of Net i connected to cluster C. Nets(C) is the set of nets connected to cluster C. After choosing the best block according to our gain function, the feasibility of adding the block to the cluster is checked. The pseudo code of our approach is shown below. Input: Netlist of LUTs and Registers N = Cluster Size K=LUTSize I = Inputs per Cluster Output: List of Logic Clusters 1 Pack LUTs and Registers together into Basic Blocks 2 while Unclustered Basic Blocks available 3 Find Seed for new Cluster 4 while Cluster is not full 5 Update gains of unclustered Basic Blocks //Candidate blocks 6 Choose Basic Block with highest gain // Pick a candidate block 7 If Candidate is NOT feasible 8 then Go to Step 6 9 Else 10 Remove blockfrom unclustered blocks list 11 Add block to current Cluster 12 end while 13 end while At this pointwe are able to analyze the operation of our algorithm from another point of view. Inserting a whole multi-terminal net in one cluster is practically impossible. In most of the cases the best we can do is to eliminate two-terminal nets. Therefore, reducing an edge from a multi-terminal net should not be considered equivalent to reducing an edge resulting from inserting a twoterminal net inside a cluster. The average gain that a block can take from an n-terminal net i connected to one of its pins, depending on type of the net can be estimated from Table 1 as follows: (4)

5 ( 2 n =2 Gain avg(i n) = (5) n + 2 n;1 3 n n>2 According to Equation 5, the average gain obtained from a twoterminal net is the highest. This implies that the algorithm gives priority topushingatwo-terminal net entirely inside a cluster as compared to reducing a pin of a multi-terminal net. This leads to a decrease in number of nets satisfying the last rability factor. Lemma: The complexity ofrpack iso(m 2 ), where M is the number of basic blocks. Proof: Assume the interconnection among basic blocks is represented by a connectivity graph and each hypernet is represented by a complete graph. If this algorithm is applied on such a connectivity graph, each edge would be visited P at most once. At each visit the gains of the nodes (representing basic blocks) are updated. Therefore the er loop takes O( Nets P (i)2 ), where P (i) isthenumber of pins of net i. Also nding the seed for each cluster takes O(M 2 ) time, where M is the number of basic blocks. Totally the run-time of the algorithm would be O( P Nets P (i)2 + M 2 ). Let P be the total number pins in the circuit. Since P Nets P (i)2 <P 2 and P 2 ( M K )2, the complexity of the algorithm can be expressed as O(M 2 ). 2 VI. Results We are presenting our experimental results in this section. We have implemented our method on top of VPack. The rst set of our experiments consists of comparing our algorithm and VPack in terms of their success in satisfying the rability factors discussed earlier. Number of nets, the average number of pins per cluster and the number of clusters derived from the packing results are reported. Comparing number of nets or just average used pins per cluster does not strongly indicate rability. Since the cost function described in Section V is able to reect the rability quite successfully, it is our comparison metric for rability. Figure 5 presents the rability cost function for the two approaches, VPack and RPack. The gure shows that in all benchmarks RPack yields a smaller cost compared to VPack, indicating enhanced rability. The average improvement in the cost function over all benchmarks is 22.4 %. Although our algorithm gives priority to reducing two-terminal nets the distribution of high-fan nets is also improved. Otherwise, the existing of very high-fan nets would degrade the value of the cost function. In order to compare area, we are reporting the total number of resulting clusters and FPGA array size ( M M).TheFPGA array is the smallest square M M grid into which the resulting logic clusters can t. Array size is a more accurate metric for logic area as compared to total number of clusters. We have set the values of logic architecture parameters according to the discussion in Section IV for best logic utilization. The LUT size is 4 and the cluster size is 8. According to Equation 6 experimentally derived in [1], the input size is set to 18. Finally, a uniform pin distribution is used. I =(N K +1) (6) 2 We ran the 20 largest MCNC benchmarks on VPACK and RPack. The blif input format of each benchmark is obtained by SIS [9] logic minimization and FlowMap [8] technology mapper. The results presented in Table 2 show that our method successfully decreased the number of nets and the average number of used pins per cluster. RPack and VPack use similar number of clusters Our Cost Function alu4 apex4 clma diffeq elliptic ex5p misex3 s298 s38584 spla benchmarks Rpack Cost VPack Cost Fig. 5. Rability Cost after Packing (Our cost function correlates well with the ring results as shown in Table 3) such that the array size resulting from both approaches for almost all benchmarks is same. Even in one case RPack yielded smaller array size. The array size of each benchmark is reported in Table 2. In accordance with average gain estimated in Section V, the results show that the major portion of the decrease in the number of the nets is due to decrease in the number of two-terminal nets. We also observed that the decrease in the average number of used pins per cluster is due to reduced number of put pins. In conclusion, reducing the number of put pins is strongly related to reducing the number of nets. Table 2. Statistics of VPack and RPack.N= 8, K =4, I = 18 Number of Pins Used Number circuit Nets per Cluster of (avg.) Clusters VPack RPack % VPack RPack VPack RPack alu apex apex bigkey clma des dieq dsip elliptic ex ex5p frisc misex pdc s s s seq spla tseng Avg In order to verify that our method meets the objective of

6 improving rability, we synthesized the benchmark circuits through the complete CAD ow to obtain the ring area. We used VPR [2] to place and re the benchmarks. The ring architecture which weusedemploys only the single segment wires which leads to better rability. Subset switch type, in which each track inachannel can be connected to the same track number of the neighboring channels, is used. In addition we have set the fraction of the tracks of each channel to which each logic block input and put pins connect to 1. Table 3. Ring area and array size resulted from RPack and VPack. N=8,K=4,I=1 circuit Array Size Number of Tracks Improvement VPack RPack VPack RPack % alu apex apex bigkey clma des dieq dsip elliptic ex ex5p frisc misex pdc s s s seq spla tseng Average Table 3 summarizes our results after placement and ring of the benchmarks. As shown in Table 3 RPack is able to improve the ring area by decreasing the number of tracks signicantly. The average improvement we obtained is 16.5 %. The number of ring tracks required in each channel is a reliable metric for ring area, since less number of ring tracks does not only mean saving wiring area, but also decreasing the size of the ring switches drastically. In total our approach achieved a big gain in ring area. Such an improvement in ring area decreases the total chip area signicantly, since ring area is typically a large percentage of the total area. We also compared the results of VPack and RPack in terms of ring area and logic area for cluster size 4 and input size 10. By comparing the ring area and logic area for the two dierent cluster sizes, we note that the improvement in ring area is larger for the bigger cluster size. On the other hand, the more ecient logic utilization in cluster size 4 leads to a smaller logic area compared to cluster size 8. The FPGA placement and ring tools play a crucial role in the nal result. The VPR placer and rer are not transparent to the optimization goal of our algorithm. VPR placement algorithm is simulated annealing-based. The cost function is the total bounding box of the nets. In other words, this algorithm is not congestion-driven. Despite the fact that we could not use a suitable placer and rer, we were still able to improve the ring area signicantly. We expect that by combining our packing method with a congestion-driven placer and rer, better results can be obtained. VII. Conclusions and Future Work In this paper we proposed a rability driven cluster packing method for cluster-based FPGAs. Our method is able to improve the rability by decreasing the number of required tracks in the FPGA ring channels. This improvement was achieved by incorporating several rability factors in our packing algorithm. We proposed a new cost function as an indicator for rability before placement. Like VPack, an existing cluster packing tool for FPGAs, RPack is a quite fast heuristic approach. There is even more space for further improvements in RPack, since the time consumed for packing is a very small fraction of the total time consumed to synthesize a circuit through the whole CAD ow. Our results indicate that providing more accurate design decisions at the packing stage even at the expense of spending more time is recommended. Eorts to improve the design performance in early stages such as clustering can yield a big gain in the nal result. References [1] E. Ahmed, J. Rose, \The Eect of LUT and Cluster Size on Deep- Submicron FPGA Performance and Density," ACM/SIGDA International Symposium on FPGA, pp. 3-12, Feb [2] V. Betz, J. Rose, \VPR: A New Packing Placement and Ring Tool for FPGA Research," Int. Workshop on Field-Programmable Logic and Application, pp , [3] A. Marquardt, V. Betz, J. Rose, \Using Cluster-Based Logic Blocks and Timing-Driven Packing to Improve FPGA Speed and Density," ACM/SIGDA International Symposium on FPGAs, pp.37-46, Feb [4] V. Betz, J. Rose, A. Marquardt, \ Architecture and CAD for Deep- Submicron FPGAs," Kluwer Academic Publishers, [5] N. Bhat, D. Hill, \Rable Technology Mapping for FPGAs," First International Workshop on FPGAs, pp , Feb [6] M. Schlag, J. Kong, P. K. Chan, \Rability-Driven Technology Mapping for Lookup Table-Based FPGAs," IEEE Transactions on CAD, Vol. 13, pp.13-26, Jan [7] A. E. Caldwell, A. B. Kahng, S. Mantik, I. L. Markov, A. Zelikovsky, \ On Wirelength Estimations for Row-Based Placement," International Symposium on Physical Design, [8] J. Cong, Y. Ding, \ FlowMap: An Optimal Technology Mapping Algorithm for Delay Optimization in Lookup-Table Based FPGA Design," IEEE Trans. on Computer-Aided Design, Vol.13, No.1, pp. 1-12,Jan [9] E. M. Sentovich, et al, \SIS: A System for Sequential Analysis," Tech. Report No. UCD/ERL M92/41, University of California, Berkeley, [10] S. Yang, \Logic Synthesis and Optimization Benchmarks, Version 3.0," Tech. Report, Microelectronics Center of North Carolina, 1991.

How Much Logic Should Go in an FPGA Logic Block?

How Much Logic Should Go in an FPGA Logic Block? Vaughn Betz and Jonathan Rose Department of Electrical and Computer Engineering, University of Toronto Toronto, Ontario, Canada M5S 3G4 {vaughn, jayar}@eecgutorontoca