AN ANALYTICAL MODEL DESCRIBING THE RELATIONSHIPS BETWEEN LOGIC ARCHITECTURE AND FPGA DENSITY

Size: px

Start display at page:

Download "AN ANALYTICAL MODEL DESCRIBING THE RELATIONSHIPS BETWEEN LOGIC ARCHITECTURE AND FPGA DENSITY"

Arleen Carroll
5 years ago
Views:

1 AN ANALYTICAL MODEL DESCRIBING THE RELATIONSHIPS BETWEEN LOGIC ARCHITECTURE AND FPGA DENSITY Andrew Lam 1, Steven J.E. Wilton 1, Phili Leong 2, Wayne Luk 3 1 Elec. and Com. Engineering 2 Comuter Science and Engineering 3 Det. of Comuting University of British Columbia Chinese University of Hong Kong Imerial College Vancouver, B.C., Canada Hong Kong London, U.K. {andrewl,stevew}@ece.ubc.ca hwl@cse.cuhk.edu.hk wl@doc.ic.ac.uk ABSTRACT This aer describes an analytical model, based rincially on Rent s Rule, that relates logic architectural arameters to the area efficiency of an FPGA. In articular, the model relates the looku-table size, the cluster size, and the number of inuts er cluster to the amount of logic that can be acked into each looku-table and cluster, and the number of used inuts er cluster. Comarison to exerimental results show that our models are accurate. This accuracy combined with the simle form of the equations make them a owerful tool for FPGA architects to better understand and guide the develoment of future FPGA architectures. 1. INTRODUCTION Field-Programmable Gate Arrays (FPGAs) have evolved considerably since their introduction. A dramatic increase in the number of available transistors has motivated research in both academia and industry, leading to new logic block structures, flexible embedded blocks, and comlex interconnect networks. FPGA architectural enhancements are often develoed in a somewhat ad-hoc manner. Exert FPGA architects erform exeriments in which benchmark circuits are maed using reresentative comuter-aided design (CAD) tools, and the resulting density, seed, and/or ower dissiation are estimated [1, 2, 3]. Based on the results of these exeriments, architects use their intuition and exerience to design new architectures, and then evaluate these architectures using another set of exeriments. This is reeated numerous times, until a suitable architecture is found. This rocess occurs both within FPGA comanies as new generations of devices are develoed, and within academia, as new architectural innovations are investigated. During this rocess, there is virtually no body of theory that architects can use to guide their investigations. Previous studies have roduced emirical rules of thumb (such as the best logic element size), however, these rules of thumb usually rovide no insight into the underlying tradeoffs between flexibility and density, seed, and ower dissiation. Such insight, however, would be extremely valuable for two reasons. First, understanding the relationshis between architectural arameters enables early-stage architecture develoment [4] in which the design sace can be searched quickly using analytical models. Once a romising region of the architecture sace has been identified, traditional exerimental methods can be used to choose recise architectural arameters. This would significantly accelerate the FPGA architecture design rocess. It may also allow the study of a wider variety of interesting architectures since exerimental CAD tools need not be develoed for each architecture under consideration. Second, the develoment of such theory will encourage researchers to understand why certain architectures work well, and may eventually rovide bounds on the caabilities and efficiencies of rogrammable logic. This aer is a ste towards such a body of theory. Secifically, this aer resents an analytical model that describes the relationshi between logic block and cluster arameters and the area-efficiency of the resulting FPGA. The inuts of the model are the looku-table size, cluster size, and inuts er cluster. The oututs are (1) the exected number of two-inut logic gates that can be acked into each lookutable, (2) the exected number of looku-tables that can be acked into each cluster, and (3) the exected number of inuts of each cluster that are used. The first two oututs can be used to deduce the density of an FPGA imlementation, while the third outut can be used as an inut to the channel width model resented in [4]. Together with [4], our model can be used to quickly evaluate a wide variety of looku-table/cluster architectures, without requiring timeconsuming emirical exeriments. This aer is organized as follows. Related work and reliminary background are described in Sections 2 and 3, resectively. The model itself is described in Section 4, and it is validated against exerimental results in Section 5. Section 6 gives an examle of how the model can be used as a tool in FPGA architectural investigation /8/$ IEEE. 221

2 2. RELATED WORK Several revious ublications have described analyticallyderived relationshis between FPGA architectural arameters. Much of the work focuses on the routing fabric. El Gamal derives a model that relates that relates the area required for routing to the total number of ins in the logic gates [5]. Although this work was originally designed to describe a non-rogrammable chi, the model has been used in the design of numerous generations of FPGAs [6]. Work by Brown et al relates various arameters that describe an FPGA routing architecture to the routability of that architecture [7], and more recently Fang and Rose related the same architectural arameters to the channel width of an FPGA [4]. Pistorius and Hutton relates the Rent arameter (the Rent arameter is a measure of the comlexity of the interconnect attern in a circuit [8]) of a circuit to various architectural arameters [9]. There has also been much work on interconnect rediction. Early work by Donath [1] and others related the area requirements of routing wires to the Rent arameter of a circuit. Later work by Stroobandt refined the models to consider more realistic network toologies and architectural assumtions [11]. More recent work has roduced more accurate estimates of routing area using more information from the circuit to be imlemented [12]. The revious work closest to ours is by Gao et al, who relates looku-table size to area in a non-clustered FPGA [13]. Comared to that work, we consider clustered architectures (which are more reresentative of real FPGAs) and use cluster architectural arameters in our equations Guiding Princiles 3. PRELIMINARIES Three rinciles guided us in the develoment of this model. First, we endeavored to develo the model by deriving relations analytically, without relying on curve-fitting or exerimental techniques. This ensures that we are caturing the essence of rogrammable logic, and not creating a model that is limited to a articular CAD flow or tool suite. As will be discussed in Section 4, all asects of our model were derived analytically 1 excet for one arameter, γ. Second, since we wish to derive relations between architectural arameters, we attemted to create a model that is indeendent of the circuit to be imlemented on the FPGA. For examle, we would refer a relation between block size and cluster size that is indeendent of the circuit to be imlemented. This is different from much rior estimation work, in which the goal is to redict the area, seed, or ower for a 1 Technically, Rent s Rule is an emirical exression, yet we use it as art of our analytical derivation. Rent s Rule is well acceted, so we do not feel it violates the sirit of this guiding rincile. Table 1. Parameters Architectural Parameters: K Number of inuts er looku table N Number of looku tables er cluster I Number of inuts er cluster Circuit Parameters: Rent arameter of a given circuit n 2 Number of 2-LUTs in a given circuit Imlementation Parameters: n k Number of K-LUTs needed to imlement a given circuit n c Number of clusters needed to imlement a given circuit c Average number of LUTs acked into each cluster (c = n 2 /n c) i Average number of inuts used in each cluster o Average number of oututs used in each cluster f Average fanout of all nets in the circuit γ Average number of inuts not used in each LUT given circuit [12]. That being said, it is imossible to comletely ignore the imact of the circuit; we describe our circuit using the Rent Parameter and size of the un-techmaed circuit. Third, we attemted to balance the comlexity of our equations with their accuracy. We feel that simle equations rovide significantly more insight into architectural tradeoffs than exressions which are unnecessarily comlex Parameters Table 1 summarizes the arameters used to describe the architecture, circuit, and imlementation. In general, uercase letters are used for architectural arameters, and lowercase letters are used for circuit arameters and arameters that describe the imlementation of the circuit on a given architecture. 4. MODEL DERIVATION Although our goal is to model architectures, our model mirrors the CAD flow used when maing circuits to FPGAs. The first art of our model mirrors the rocess of technology maing, in which 2-inut gates are maed to K-LUTs. This art of the model can be used to comute the number of K-LUTs needed to imlement a circuit as a function of K. The second art of the model mirrors clustering, in which K-LUTs are acked into clusters with a re-defined caacity and a re-defined number of unique inuts [1]. The inuts of this art of the model are the number of K-LUTs in a circuit as well as the cluster arameters N (the caacity of each cluster) and I, the maximum number of available inut ins on each cluster, and the oututs are two quantities that describe the clustering: the exected number of LUTs that can be acked into each cluster, and the exected number of inuts to each cluster that are used. In the following, we focus on each art of the model searately. 222

3 4.1. Number of K-LUTs to imlement a circuit We first model the rocess of technology maing. Consider an un-techmaed circuit consisting of n 2 two-inut LUTs. During technology maing, these n 2 gates will be maed into a smaller number of LUTs, which we will denote n k. In this section, we seek a closed form exression for n k. Consider a ortion of the un-techmaed-circuit consisting of x two-inut gates (1 <x n 2 ). Denote the number of signals that connect across the boundary of this region as y. Since each gate has three ins (two inuts and one outut), we can use Rent s Rule [8] to write: y =3x (1) where is the Rent arameter of the circuit. Now suose this same region is maed to zk-luts using a technology maing algorithm. The number of signals that connect across the boundary of this region is still y. The number of ins used in each K-LUT is 1+K γ (the first two terms corresond to the outut and K inuts, and the final term, γ, will be described below). We can then write y =(K +1 γ)z (2) Since y is the same in both equations, we can eliminate y to get z x = 3 K +1 γ. (3) In other words, every zk-luts can imlement x 2-LUTs in the original circuit. Intuitively, this ratio is a measure of how much logic can be acked into each looku-table. Finally, using Equation 3, we can write: n k = n 2 3 K +1 γ The term γ in Equations 2 to 4 arises because not all K inuts are always used in a K-LUT. γ is the exected number of inuts to a K-LUT that are not used. We have not found a way to accurately model this analytically, however, exerimentally we have found that γ (as a function of K) is extremely consistent across all benchmark circuits we considered. Table 2 shows our measured values of γ; the derivation of a closed form for this exression is an interesting toic of future work. (4) clustering, these n k LUTs will be acked into a smaller number of clusters, which we will denote n c. In this subsection, we seek closed-form exressions for n c. In the next subsection, we derive an exression for the exected number of inuts used er cluster, i. Each cluster can contain u to N LUTs and have u to I unique inuts. In an architecture with a large value of I and small value of N, it is likely that most clusters will be comletely filled, while in architectures with a small value of I, some clusters may not be comletely filled, because of the limitation on the number of unique inuts. Since we wish our model to aly to both tyes of architectures, we consider each case searately below. We also define an equation for the boundary between the two cases I-Limited Clustering We first consider architectures in which I is small, and the exected number of LUTs acked into each cluster is dictated by the number of hysical ins on each cluster. In this case, the exected number of LUTs acked into each cluster, c = n k /n c, will be smaller than the caacity of the cluster N. To estimate c, and hence n c, we emloy Rent s Rule as follows. Consider the same region of the technologymaed circuit from Section 4.1 which contains zk-luts and has y signals that connect outside the region. When the same region is maed to clusters, we can write y =(i + o)v (5) where v is the number of clusters needed for this region (v z), i is the average number of used inuts er cluster, and o is the average number of used oututs er cluster. The latter quantity can be written as o = i/f where f is the average fanout of the circuit (this term will be comuted below). Thus, we can write: y = i(1 + 1 f )v (6) Eliminating y from Equations 2 and 6 and solving for n c gives: n c = n K +1 γ k I(1 + 1 f ), (7) Table 2. γ values from 2 MCNC benchmarks K γ and c = I(1 + 1 f ) K +1 γ (8) 4.2. Number of clusters needed to imlement a circuit We next model the rocess of clustering. Consider a technology maed circuit consisting of n k K-LUTs. During 2 Note that in real FPGA imlementations, regardless of N and I,some clusters will be full and some will not be. We could model this by considering a distribution of acking caacities rather than a single exected value. We have chosen to model a single exected value, since we feel it rovides simler equations, and thus more insight into the underlying tradeoffs between the architectural arameters. 223

4 The fanout f in Equations 7 and 8 can be calculated using a formula from [14]: where: f = 1 (f max +1) ( 1) 1 (f max +1) ( 2) φ(, f max ) 1 (9) φ(, f max )= f max n=1 and f max is the maximum fan-out written as: n n 2 (n +1), (1) f max =[(i + N) n k N (1 )](1/(3 )). (11) In Equation 11 we aroximated the number of oututs as N and the number of clusters as n k /N ; exerimentally, we have found that the fanout is only a weak function of f max, and thus these aroximations lead to only a small error N-Limited Clustering This case is trivial. Since cluster inut ins are lentiful, clusters can be filled to caacity. Hence, c = N and Boundary Condition n c = n k N (12) For architectures in which c<n, clustering is I-limited, otherwise it is N-limited. Using Equation 8, we can write the following condition that indicates that clustering is I- limited: I(1 + 1 f ) <N (13) K +1 γ This can be rearranged to roduce: I<N K +1 γ 1+ 1 f (14) For all values of I in which Inequality 14 holds, clustering is I-limited Average Number of Used Inuts Again, we consider I-limited and N-limited architectures searately. The boundary condition between the two tyes of architectures is the same as in the revious section I-Limited Clustering In these architectures, we would exect all cluster inut ins to be used. Thus, we can write i = I (15) N-Limited Clustering By alying Rent s Rule to a single cluster, we obtain: i + o =(K +1 γ)n, (16) since a cluster has N K-LUTs that each has (K +1 γ)i/o ins and i + o number of exterior ins. By substituting o = i/f into (16) and solving for i, we obtain (K +1 γ)n i =. (17) 1+ 1 f 5. MODEL VERIFICATION To evaluate the accuracy of our model, we comare the model redictions to exerimental results. Figure 1(a) illustrates the accuracy of our technology maing model. The exerimental results were obtained by averaging the results for twenty large MCNC benchmark circuits. Results for two different technology maing algorithms are shown. The analytical results were obtained from Equation 4 using the average Rent arameter from all benchmark circuits. As the grah shows, the analytical results track the exerimental results very closely. Figures 1(b) through 1(d) illustrate the accuracy of our clustering model. Figure 1(b) shows the ratio n2 n c as a function of cluster size. In all cases, we set K =4, and I to be.88n +3.2 from [4] (this ensures that our clustering is I- limited, which is the interesting case for this grah). The analytical results were obtained using Equation 7 and 4, while the exerimental results were obtained using two searate clustering algorithms, T-Vack [1] and our imlementation of irac [15]. Again, the analytical results are very consistent with both sets of exerimental results. Figure 1(c) shows the same ratio as a function of the number of inut ins er cluster, I. In all cases, K =4and N =2. The boundary between I-limited and N-limited architectures is also shown. The grah shows that our model tracks the exerimental results well in both regions. Finally, Figure 1(d) shows the average number of used inuts er cluster as a function of cluster size. Although our model tracks both sets of exerimental results, it matches the irac results more closely. This is exected, since irac exlicitly tries to minimize the use of cluster ins. 6. EXAMPLE APPLICATION OF OUR MODEL One of the uroses of our model is to allow for early architectural evaluation. In this section, we show how the model, along with the channel width model from [4], can be used to estimate the number of rogramming bits in an FPGA as a function of the architectural arameters K, N, and I. We will investigate whether an analytical flow emloying our 224

5 n 2 /n k Exerimental (EMa) 5 Exerimental (Flowma) Inuts er LUT (K) Cluster Size (N) n 2 /n c Exerimental (TVPack) Exerimental (irac) a) Logic er LUT vs. LUT Size b) Logic acked er cluster vs. cluster size n 2 /n c Boundary Exerimental 15 between I (TVPack) and N-limited Exerimental 1 clustering (irac) Inuts er Cluster (I) Used inuts er cluster (i) Exerimental (TVPack) Exerimental (irac) Cluster Size (N) c) Logic acked er cluster vs. cluster inuts d) Used Inuts er Cluster Fig. 1. Validation model leads to similar conclusions that would be obtained by a more time-consuming exerimental methodology. We consider three flows: 1. The first flow is urely analytical. We use the model described in Section 4, to estimate n2 n c and i. These quantities are then used in conjunction with the channel width model in [4] to determine the amount of routing needed for a given architecture. We then use equations that model the number of rogramming bits in a clustered logic block, connection block, switch block. These equations are similar to those used within VPR 5., and assume an architecture with uni-directional, single-driver wires. Multilexers are assumed to be imlemented using a two-tiered structure with onehot select lines (as in VPR 5.). Finally, these area estimates are used to determine the number of rogramming bits required for each of twenty large benchmark circuits. Note that this flow is urely analytical and does not require exerimental CAD tools. 2. The second flow is urely exerimental. We ma each of the twenty benchmark circuits to 4-inut LUTs using Flowma, clusters containing four LUTs and 1 inuts using TVPack, and then lace and route the circuits using VPR 5.. We use the model within VPR 5. to count the number of rogramming bits for each benchmark circuit. 3. As an intermediate between the first two flows, we use the model in Section 4 to estimate n2 n c and i, but use VPR 5. to determine the actual channel width (instead of using the model from [4]). This flow is included to rovide insight into each ortion of the analytical model. Figure 2 shows the results for an architecture with F s = 9, F cin =2, F cout =4, and L =1. Figure 2(a) shows the number of rogramming bits as a function of K, Figure 2(b) shows the number of rogramming bits as a function of N, and Figure 2(c) shows the number of rogramming bits as a function of I. In all cases, the results from Flow 2 (urely exerimental) and Flow 3 (analytical technology maing and clustering but exerimental routing) match closely. This is to be exected, given the accuracy illustrated in Figure 1. When the model from [4] is used (Flow 1), the analytical estimates still track the exerimental results, however, not as 225

6 Average # of Programming Bits (x1 5 ) K Average # of Programming Bits (x1 5 ) I Average # of Programming Bits (x1 5 ) N Flow 1 Flow 2 Flow 3 Fig. 2. Routing Bit Verification closely. The largest differences arise for architectures with a small number of cluster inuts. These architectures contain far fewer inuts er cluster than the equations in [4] were intended to model. We have found that for such extreme architectures, the average wirelength is 25% larger than the architectures considered in [4]. This is rimarily because as I decreases, the sharing of LUT inuts within a cluster is encouraged. This tends to decrease the average fanout, which increases the average wirelength. This is an imortant observation although the model resented in this aer correctly models architectures with small values of I, a comlete analytical flow that accurately models these architectures does not yet exist. Addressing this limitation is an interesting area for future research. 7. CONCLUSION In this aer, we have resented an analytical model that relates looku-table size, cluster size, and the number of inuts er cluster to amount of logic that can be acked into each looku-table and cluster, and the number of used inuts er cluster. Comaring the model redictions with exerimental results has shown that our model is accurate. We have shown that the model can be used during early architecture evaluation, in which otential architectures are considered before custom CAD tools are created. However, the model has the otential to be much more owerful than that. By combining our model with that from [4], we have a means of relating routing architecture arameters with looku-table and cluster size. Recent FPGAs are emloying larger looku-tables, and it seems reasonable to exect that the logic architecture of future FPGAs may change further. Using a model such as ours rovides the ability to understand the imact that otential logic block architectures may have on the routing fabric, and more imortantly, why the routing fabric is imacted in a certain way. For FPGA architects, this could otentially be much more useful than exerimental results that evaluate a single architecture (or even a swee of a single architectural arameter). A C-imlementation of our model can be downloaded from htt:// stevew. 8. REFERENCES [1] V. Betz, J. Rose, and A. Marquardt, Architecture and CAD for Deesubmicron FPGAs. Kluwer Academic Publishers, [2] M. Hutton, FPGA architecture design methodology, in Int l Conf. on Field-Programmable Logic and Alications, Aug. 26,. 1. [3] E. Ahmed and J. Rose, The effect of LUT and cluster size on deesubmicron FPGA erformance and density, IEEE Trans. on VLSI Systems, vol. 12, no. 3, , Mar. 24. [4] W. Fang and J. Rose, ing FPGA routing demand in early-stage architecture develoment, in Int l Sym. on Field-Programmable Gate Arrays, Feb. 28. [5] A. A. El Gamal, Two-dimensional stochastic models for interconnections in master-slice integrated circuits, IEEE Trans. on Circuits and Systems, vol. 26, no. 4, , Feb [6] J. Rose, Closing the ga between FPGAs and ASICs, Keynote Address, in Int l Conference on Field-Programmable Technology, Dec. 26. [7] S. Brown, J. Rose, and Z. Vranesic, A stochastic model to redict the routability of field-rogrammable gate arrays, IEEE Trans. on Comuter-Aided Design of Circuits and Systems, vol. 12, no. 12, , Dec [8] B. Landman and R. Russo, On a in vs. block relationshi for artitions of logic grahs, IEEE Trans. on Comuters, vol. C-2, , [9] J. Pistorius and M. Hutton, Placement rent exonent calculation methods, temoraral behavior and FPGA architectural evaluation, in SLIP Worksho, Ar. 23, [1] W. Donath, Placement and average interconnect lengths of comuter logic, IEEE Trans. on Circuits and Systems, vol. 26, no. 4, , [11] D. Stroobandt, A Priori Wire Length Estimates for Digital Design. Kluwer Academic Publishers, 21. [12] S. Balachandran and D. Bhatia, A riori wirelength estimation and interconnect estimation based on circuit characteristics, IEEE Trans. on Comuter-Aided Design of Integrated Circuits and Systems, vol. 24, no. 7, , July 25. [13] H. Gao, Y. Yang, X. Ma, and G. Dong, Analysis of the effect of LUT size on FPGA area and delay using theoretical derivations, in Int l Symosium on Quality Electronic Design, Mar. 25, [14] P. Zarkesh-Ha, J. A. Davis, W. Loh, and J. D. Meindl, Prediction of interconnect fan-out distribution using rent s rule, in Proceedings of the 2 international worksho on System-level interconnect rediction. New York, NY, USA: ACM, 2, [15] A. Singh and M. Marek-Sadowska, Efficient circuit clustering for area and ower reduction in FPGAs, in Int l Symosium on FPGAs, Feb. 22,

IMS Network Deployment Cost Optimization Based on Flow-Based Traffic Model

IMS Network Deployment Cost Optimization Based on Flow-Based Traffic Model IMS Network Deloyment Cost Otimization Based on Flow-Based Traffic Model Jie Xiao, Changcheng Huang and James Yan Deartment of Systems and Comuter Engineering, Carleton University, Ottawa, Canada {jiexiao,