Multi-way Netlist Partitioning into Heterogeneous FPGAs and Minimization of Total Device Cost and Interconnect

Size: px

Start display at page:

Download "Multi-way Netlist Partitioning into Heterogeneous FPGAs and Minimization of Total Device Cost and Interconnect"

Lambert Cain
5 years ago
Views:

1 Multi-way Netlist Partitioning into Heterogeneous FPGAs and Minimization of Total Device Cost and Interconnect Roman Kužnar, Franc Brglez 2, Baldomir Zajc Department of ECE, Tržaška 25, University of Ljubljana, 6 Ljubljana, Slovenia 2 CBL, Dept. of Elec. & Computer Eng., North Carolina State University, Raleigh, N.C , U.S.A. Abstract This paper considers the problem of partitioning a large logic circuit into a collection of subcircuits each of which is implemented with a device from a specific (FPGA) library. The objective function that we minimize is not only the total cost of devices to be used in the partition but also the size of the interconnect between the devices. We introduce the concept of functional replication and a unified cost model for min-cut partitioning with replication. A prototype implementation demonstrates the feasibility of the approach, based on experimental results with a set of large benchmark circuits. I. Introduction FPGAs are widely used in many applications []. Large designs cannot be implemented with FPGAs unless they are partitioned into smaller subcircuits. Optimizing a set of large design specifications will in general require partitioning into multiple FPGAs of varying sizes and types. Different sizes and types of devices can be combined to reduce the design cost and achieve a better performance for the entire design. A survey of partitioning techniques related to physical design problems and a comprehensive list of references on the subject can be found in [2]. Except for [3], [4], none of the recent publications on partitioning, e.g. [5], [6], [7], [8], [9], [], incorporate sufficient constraints to specifically address the problem of FPGA partitioning. In this paper we extend the formulation of the partitioning problem in [3]: Find a feasible k way partition with the minimum cost $ k, where q $ k = d i n i () i= with d i representing the unit cost of each device type, n i the number of devices of type i to be used in the k way partition, and the number of partitions, k = q i= n i.a partition P j is called feasible if it fits the size and the terminal constraints of a specific FPGA library. If all FPGA devices in the implementation are of the same type, the partitioning problem is reduced to finding the minimum number k of subsets that all meet the same size and terminal constraints. An example of a library, from [] and used in [3], is shown in Table I. Each device D i in the FPGA library is described with five parameters, D i =(c i,t i,d i,l i,u i ), representing the number of elementary circuit units contained in the device, the number of terminals, the price, Roman Kužnar was supported in part by Slovenian Ministry of Research and Technology under grant S /535/93. Franc Brglez was supported in part by a grant from the Semiconductor Research Corporation (SRC). Xilinx Inc. provided the XACT toolset to verify routability of each benchmark partition. and the lower and upper bounds on the utilization of elementary circuit units. The circuit unit utilization is the ratio of the number of elementary circuit units assigned to a subcircuit which is to be implemented with device D i, to its capacity, c i. For Xilinx based devices, c i represents the number of configurable logic blocks (CLBs), and t i represents the number of input output blocks (IOBs). TABLE I A subset of the Xilinx XC3 device library. Device c i t i d i l i u i CLB cost (CLB) (IOB) (N$) d i /c i XC32x-x XC33x-x XC342x-x XC364x-x XC39x-x We extend the formulation of the partitioning problem defined above as follows: Find a feasible k way partition with the minimum cost as defined in () and the minimum interconnect between the partitions. By defining t Pj as the number of terminals used in the partition P j and a measure of interconnect as the average utilization k of input output blocks (IOBs) in a given k way partition, we can compare solutions in this paper directly with the solutions in [3]: k q k = t Pj / t i n i (2) j= Our approach to minimizing the measure of interconnect in (2) is based on introducing module replication at each step of the bipartitioning process as implemented in [3]. As pointed out in [2], [3], [4], replication can reduce the size of the min-cut in a bipartition. The min-cut replication algorithm proposed in [4] is applicable to graphs with no constraints on the sizes of partitions. After technology mapping, the number of inputs of a mapped cell increases relative to the number of output pins, seriously limiting the benefits of the traditional replication. In this paper, we propose an effective approach to reducing the size of the min-cut in a bipartition of a hypergraph. Introducing the concept of functional replication, we significantly increase the potential of reducing the number of nets in the cut set. We show that we can remove not only the nets connected to the output pins of a replicated cell but also the nets connected to cell input pins in a large number of replicated cells. Compared to results in [3], we report significant reductions in interconnect while also consistently reducing the total device cost. 3 ST ACM/IEEE Design Automation Conference Permission to copy without fee all or part of this material is granted provided that the copies are not made or distributed for direct commercial advantage, the ACM copyright notice and the title of the publication and its date appear, and notice is given that copying it is by permission of the Association for Computing Machinery. To copy otherwise, or to republish, requires a fee and/or specific permission. 994 ACM /94/ i=

2 a b c II. Functional Replication Following key definitions, we review the role of traditional replication, leading to the concept of functional replication and replication potential. The latter provides the basis to generate a unique distribution for a large set of benchmark circuits, leading to an effective implementation of the replication-based bipartitioning algorithm. We use and extend the notation introduced in [3]. Hypergraph model of a circuit. The circuit partitioning problem addresses implementation of a digital circuit as a collection of subcircuits, each of which can be implemented as a single FPGA. We model the circuit as a hypergraph H =({X ; Y }, E ), where X and Y denote respectively the interior and terminal node sets, X Y =,ande is the set of nets. Whenever appropriate and for simplicity, we may interchangeably refer to the interior node sets as cells or modules, and the terminalnodesetsasi/os or IOBs. Ak way partition of H implies an assignment of the nodes in X and Y to a set of k non overlapping hypergraphs P j =({X j ; Y j }, E j ), j X j Y j =. When partitioning is performed without replication, each of the interior nodes of the original hypergraph is assigned to the interior node set of exactly one component hypergraph, thus k j= X j = X. When partitioning is performed with replication, some of the interior nodes of the original hypergraph are assigned to the interior node set of more than one component hypergraph, thus k j= X j X. The apparent increase in the size of each partition is expected to be absorbed within each device implementing this partition the benefit of replication is measured in terms of reduced interconnect between partitions. The question that arises is which and how many of the interior nodes should be replicated to minimize the interconnect. Traditional replication of a cell. An example of how traditional replication evaluates a move to minimize the size of the cut set is illustrated in Figure. The presence of dotted lines inside the cell should be ignored when discussing traditional replication. When the cell M i is replicated, it is copied from partition P k to the partition R k. This move permits elimination from the cut of the net connected to the output pin Y. However, during this process an additional net, connecting to the input pin a, has been added to the cut. Subsequently, no reduction in the cut set has been achieved and there is no indication why replication should be accepted in this case. Fig.. M i X Y Cut line b c a b c a M i Traditional replication ignores the I/O dependencies. M ' i X Y X Y Functional replication of a cell. The concept of functional replication relies on capturing the cell functional dependency at its outputs with respect to its inputs. We will formally evaluate the potential gain associated with functional replication in the section that follows. Here, we illustrate the concept by way of the example in Figure. This time, the dotted lines inside the cell carry the information about the dependency of output pins X and Y with respect to the input pins {a, b, c}. Specifically, we associate with output X an adjacency vector A X =[] T, and similarly, with output Y an adjacency vector A Y =[] T. Clearly, only the input b is adjacent at both outputs and controls the function of each output. Input a is adjacent at the output X only, and similarly, input c is adjacent at the output Y only. As a consequence, the net in the cut set connecting to pin a in the replicated cell can be removed from the cut. a a 2 a 3 a 4 a5 X =f (a,a 2,a 3,a 4 ) X2=f2(a4,a5) Replication potential of this cell: Ψ = 4 Fig. 2. A 2-output cell with the replication potential of 4. Replication potential of a cell. Functional replication relies on the information about the dependency of cell outputs with respect to cell inputs. Consider the illustration shown in Figure 2. This illustration is based on the information extracted from a netlist after technology mapping. While the information about the specific functions associated with each of the outputs may be of interest in other applications, we only require two adjacency vectors to assess the replication potential of this cell: one with respect to the output X, the other with respect to the output : A X =[] T and A X2 =[] T. (3) We associate a replication potential with each cell by counting all inputs which control only a single output of a cell. Thus, the cell in Figure has a replication potential of 2, the cell in Figure 2 has a replication potential of 4. The higher the replication potential of the cell, the more nets may be removed from the cut set during cell replication. We will illustrate this concept further in the following section. There are three binary operations we will perform on the adjacency vectors introduced above as well as on others to be introduced in the following section: Complementation. For example, given that A X =[] T, then A X =[] T. Logical AND. For example, given A X =[] T and A X2 =[] T we get a product vector A X A X2 =[] T. Norm. For example, given A X2 =[] T, A X2 =2. 239

3 Formally, for a cell with n inputs and m outputs {X,,...X m }, we find the corresponding set of adjacency vectors {A X,A X2,...A Xm }. Then the replication potential ψ is defined and can be evaluated as follows: m m ψ = (A Xi A Xj ) ; if m > i= j=;j i (4) ; if m = In (4), the adjacency vectors are complemented and AND-ed before taking the norm. For example, given the adjacency vectors in (3), the expression (4) evaluates to ψ = 4. This is also illustrated in Figure 2. Cell distribution versus replication potential. Let X designate the set of all cells in the circuit before partititioning and let ψ be the replication potential associated with each cellasdefinedin(4). Thend X (ψ) is a cell distribution with respect to the cell replication potential ψ, namely d X (ψ) =. (5) Cell distribution ψ= We have evaluated the replication potential of each cell in the benchmark set in [] and generated the distributions shown in Figure 3. Notably, less than 5% of all cells on average have single output and, by definition, a replication potential of. About % of the two-output cells have replication potential of (denoted as ). All other remaining cells have a replication potential which is greater than. Our experimental results clearly point out that cells which contribute to the largest decrease of the overall interconnect in (2) have ψ provided we use functional rather than traditional replication! In contrast, replications based on the set of all cells where ψ = reduces the interconnect only marginally in a few cases. Maximum cell replication factor, r T, relates to the replicationpotentialin5andisdefinedas r T = d X (ψ) (6) ψ=t The choice of T = allows maximum replication of all cells, while T = corresponds to partitioning without replication as formulated in [3]. We call T the threshold replication potential. * Cell replication potential Y (* refers to multi-output cells with Y = ) c354 c535 c6288 c 7552 s 5378 s9234 s 327 s 585 s Fig. 3. Distribution of cells after Xilinx-based technology mapping. III. Partitioning with Replications Cost Model The preceding section introduced the motivation and the concept of functional replication. Fundamental to this concept is the notion of the replication potential which can be calculated from the respective adjacency vectors associated with the output pins of each cell. In this section we extend the notion of adjacency vectors of a cell and also make binary-valued assignment to the nets connnected to the cell and crossing the cut. Our discussion will be guided by the example shown in Figure 4. By inspection we find:. moving the single cell across the cut line increases the size of the cut set from 3 to 4, hence the gain of this move equals -; 2. replicating the cell in a traditional manner increases the size of the cut set from 3 to 5, hence the gain of this move equals -2; 3. replicating the cell functionally, exploiting the knowledge of the input-output dependencies, reduces the size of the cut from 3 to, hence the gain of this move equals +2. Single cell move: Gain = - X Traditional replication : Gain = -2 X X X2 Functional replication : Gain = +2 X Cut line Fig. 4. Options to reduce the size of the cutset during a bipartition. We next introduce a unified formulation to calculate gains in each of the cases illustrated: () single move of a cell, (2) traditional replication/unreplication move, and (3) functional replication/unreplication move. Given that a cell under consideration has n inputs and m outputs, we associate the following binary vectors with the cell: m I/O adjacency vectors A Xi associated with each output X i, each vector of size n. a pair of cutset adjacency vectors, C I and C O ; C I is of size n, C O is of size m. An element c I j CI is equal to if the net in the cutset is adjacent to the j-th input pin of the cell. An element c O i C O is equal to if the net in the cutset is adjacent to the i-th output pin of the cell. Anetstate is called cut if it is in cutset, otherwise it is called nocut. a pair of critical net vectors, Q I and Q O ; Q I is of size n, Q O is of size m. Anetiscritical if one move changes its state. An element qj I QI is equal to if the net adjacent to the j-th input pin of the cell is critical. An element qi O Q O is equal to if the net adjacent to the i-th output pin of the cell is critical. X X2 24

4 With respect to Figure 4, these vectors are as follows: A X = ; A = ; CI = ; QI = ; [ ] [ ] C O = ; Q O = A. Gain of a single move. The gain of a single move can be calculated by counting the number of cuts and critical nets which are eliminated from a cut set and the number of nocut critical nets which are added to the cut set. Based on our definitions, the gain of a single move, G m, is then: G m = ( C I Q I + C O Q O ) ( C I Q I + C O Q O ) (7) For example, using vectors above, G m = + [ + [ = (2+) (3 + ) = ] [ ] [ ] ] = B. Gain of traditional replication. Replication duplicates a cell and moves the replicated cell across the cut line to another partition while the original cell remains in the original partition. According to the traditional replication as defined in [3], the replicated cell is identical to the original cell and connects exactly the same nets as the original one. Traditional replication eliminates all output nets from the cut set while adding all input nets to it. Since we know the number of input and output nets, as well as the number of cut nets connecting the cell before replication, the gain of traditional replication G tr,is simply: G tr =( C I + C O ) n. (8) For example, G tr =... =(2+) 5= 2 C. Gain of functional replication. If the functionality of a logic cell is known, we can exploit this information to leave some of the input and output pins of the original and replicated cell floating, resulting in additional reduction of the cut set. As shown in Figure 4, we can replicate the cell and leave one output pin and all input pins that control this output floating. For simplicity of presentation, we next derive a generalized formulation for a cell with two outputs only. Assume that in the original cell the output pin # is used while in the replicated cell it is left floating. Similarly, the output pin #2 is used in the replicated cell and left floating in the original cell. Since only the output pin #2 is used in the replicated cell, all input pins adjacent to pin # only can be left floating. Thus, when calculating the gain only input pins adjacent to the pin #2 need to be considered, etc. In general, we can write a gain expresssion for each of the outputs: G X = ( (C I A X ) (Q I A X A X2 ) +(c O q O )) ( (C I A X ) (Q I A X A X2 ) +(c O q O )).(9) and G X2 = ( (C I A X2 ) (Q I A X2 A X ) +(c O 2 q2 O )) ( (C I A X2 ) (Q I A X2 A X ) +(c O 2 q2 O )). () For example, we calculate the gain G r for the best case, where we use pin #2 in the replicated cell: G X2 =... =(+) ( + ) = 2 and similarly, we can calculate G X : G X =... =(+) (3 + ) = 4 Expressions in (9-) are basically an extension of (7) where we used logic operations to eliminate input nets not adjacent to the corresponding output pin #2 connected to the replicated cell. The gain of a functional replication G r is based on the highest gain associated with a given output. Since only two outputs have been considered in this case, we have: G r = max(g X, G X2 ). () When the replication is performed, the original and the replicated cells are disconnected from some nets. The net state and criticality is updated only for cells which are currently connected to the net. The gain of unreplication is equal to the gain of a move of the original cell to the partition that contains the replicated cell, or vice versa. Here, the gain calculations consider only those cells in the nets which are currently connected. Therefore, there is no need to derive an additional gain equation for unreplication. When the unreplication move is performed, the original and the replicated cell are merged into a single cell. D. Implementation highlights. The proposed approach to bipartitioning with functional cell replication was implemented as an extension of the traditional F-M heuristic [5]. We measure the cost of bipartition with the objective function as proposed in [3]. Due to space limitations, we omit discussion on modifications that were implemented to report the results in the following section. These details are reported in [6]. IV. Experimental Results We exercised the proposed algorithms on a set of benchmarks introduced in [3] and []. The characteristics of the benchmark circuits after mapping into the XC3 family are shown in Table II. 24

5 TABLE II Benchmark circuit characteristics. Circuit #CLBs #IOBs #DFF #NETs #PINs c c c c s s s s s We performed two experiments. First, we bipartitioned all benchmarks into two equal-sized partitions with the objective of minimizing the cut set, completely relaxing the terminal constraints. F-M min-cut was based on implementation of the original min-cut algorithm in [5]. In F-M min-cut + Func. Repl., we extended the original min-cut algorithm in [5] with functional replication as introduced in this paper. We performed 2 bipartitioning runs for each benchmark circuit, measuring the best and the average size of the cutset. In all experiments, the threshold replication potential T was set to, allowing maximum utilization of replications. TABLE III Best gains and average gains in the size of the cutset. F-M min-cut F-M min-cut + Func. Repl. Circuit Best Avg. Best cut Avg. cut Gain Gain Gain Red. Gain Red. c % 57 5.% c % 4 7.3% c % % c % % s % 62.6% s % % s % % s % % s % % Avg % % Table III shows promising results with functional replication. The reduction of the best cut ranges from 7.7% for circuit c354 to 62.9% for circuit s Averaged over all circuits, the best cut of 2 runs per benchmark circuit resulted in a reduction of 34.6%. The reduction of the average cut ranges from.6% for circuit s5378 to 64.% for circuit s Averaged over all circuits, the average cut of 2 runs per benchmark circuit resulted in a reduction of 32.7%. Note that the larger reduction of the cut set is achieved for the set of sequential ISCAS 89 benchmarks where cells are more clustered. We have every indication that functional replication is effective and consistent for a wide range of circuit sizes and characteristics. The average increase in CPU running cost due to functional replication was 34%. Combining this approach with techniques in [4], [7] may potentially reduce the size of the cut even further. In the second experiment, we extended the original min-cut algorithm in [5] with functional replication as introduced in this paper, combined with the k-way partitioning algorithm formulated in () and (2) in the introduction of this paper: the main objective being the reduction of the total device cost as well as the interconnect between devices. We limited the circuit expansion due to replication by using different values of threshold replication potential T as defined in (6). Experimental results in Table IV show that partitioning with replication adds only moderately to the total number of cells. Depending on the value of threshold potential, the percentage of cells which are replicated ranges from.% to 9.8%. Averaged over all circuits, the percentage of replicated cells ranges only from 3.3% to 5.%. Since each feasible partition must satisfy both the size and the terminal constraints, searching for the feasible partitions may increase the total CPU times for some circuits over the linear-time characteristic of the run without replications. For direct comparisons with [3], the CPU times shown in Tables IV VII are for the case when 5 feasible partitions per bipartitioning run were generated on a SUN SparcStation +. TABLE IV Percentage of replicated cells and CPU cost of 5 runs. Percentage of replicated cells CPU Circuit T = T = T =2 T =3 T =3 In [3] (%) (%) (%) (%) (sec.) c c c c s s s s s Avg Note: T = includes multi-output cells with ψ = Although we do not limit the replications during each bipartition explicitly, Table V shows that the utilization of FPGA devices did not increase beyond 9% for most of the circuits (except for the circuit s5378 when using threshold potential T > ). Compared with results in [3], the average utilization of CLBs when using functional replication increased from 77% to at most 83%. While in Table IV we report for circuit s327 that 9.8% of cells have been replicated, the average increase in CLB utilization is from 72% for the partitioned circuit without replication to 85% for the partitioned circuit with replication. Glancing ahead, we see that the average IOB utilization was reduced from 88% to 65% for the same circuit! The final results in this paper are reported with respect to the objective functions formulated in () and (2). Table VI reports on the total design cost as defined in (). The reported results are compared with the results published in [3]. Except for the circuit s585, we reduced the overall design cost for at least one setting of threshold potential T while consistently reducing the size of interconnect, reported in Table VII. Table VII summarizes results on the IOB utilization as a measure related to the interconnect density between FPGA devices, defined in (2). Compared to [3] we reduced the average IOB utilization for most circuits: typical reductions range from 4.3% to 53.9%. The circuit c535 proved to be an exceptionally difficult case. Averaging for all circuits, we also achieve IOB utilization of 67%, 242

6 compared to 77% as reported in [3]. We conclude that greater the freedom for unlimited functional replication, greater the reduction of the average IOB count. Our partitioning with replication utilizes different FPGA devices, so while the total costs are comparable with [3], the device distributions are quite different. A more detailed analysis of current partitioning results, including a report on routability, is in progress. TABLE V Average CLB utilization after partitioning. Partitioning with Functional Replication Util. T = T =2 T =3 Circuit In [3] Util. Incr. Util. Incr. Util. Incr. (%) (%) (%) c c c c s s s s s Avg TABLE VI Total design cost after partitioning. Partitioning with Functional Replication Cost T = T =2 T =3 Circuit In [3] Cost Red. Cost Red. Cost Red. (%) (%) (%) c c c c s s s s s Avg TABLE VII Average IOB utilization after partitioning. Partitioning with Functional Replication Util. T = T =2 T =3 Circuit In [3] Util. Red. Util. Red. Util. Red. (%) (%) (%) c c c c s s s s s Avg V. Conclusions We extended the formulation of the problem of partitioning a large logic circuit into a collection of subcircuits each of which is implemented with a device from a specific (FPGA) library. The objective function which we minimized was not only the total cost of devices used but also the size of the interconnect between the devices. We introduced the concept of functional replication with a unified cost model for min-cut partitioning with replication and demonstrated its effectiveness in achieving both objectives. References [] Stephen D. Brown, Robert J. Francis, and Jonathan Rose. Field-Programmable Gate Array. Kluwer Academic Publishers, Boston, 992. [2] W.E.Donath. Logic Partitioning. inphysical Design Automation of VLSI Systems, B. Preas and M. Lorenzett, ed. The Benjamin/Cummings Publisher Company, Menlo Park, California 9425, 988. [3] R. Kuznar, F. Brglez, and K. Kozminski. Cost miminimization of partitions into multiple devices. In 3th Design Automation Conference, ACM/IEEE, pages 35 32, June 993. [4] N.-S. Woo and J. Kim. An Efficient Method of Partitioning Circuits for Multiple- FPGA Implementation. In 3th Design Automation Conference, ACM/IEEE, pages 22 27, June 993. [5] L. A. Sanchis. Multiple-way network partitioning. IEEE Transactions on Computers, 38():62 8, January 989. [6] C. W. Yeh and C. K. Cheng. A general purpose multiple way partitioning algorithm. In Proceedings of the 28 th IEEE Design Automation Conference, pages , 99. [7] C.J. Alpert and A.B. Kahng. Geometric Embeddings for Faster and Better Multi-Way Netlist Partitioning. In 3th Design Automation Conference, ACM/IEEE, pages , 993. [8] P.K. Chan, M.D.F Schlag, and J.Y. Zien. Spectral K-Way Ratio-Cut Partitioning and Clustering. In 3th Design Automation Conference, ACM/IEEE, pages , June 993. [9] J. Cong and M. Smith. A Parallel Bottom-up Clustering Algorithm with Applications to Circuit Partitioning in VLSI Design. In 3th Design Automation Conference, ACM/IEEE, pages , June 993. [] M. Shih and E.S. Kuh. Quadratic Boolean Programming for Performance-Driven System Partitioning. In 3th Design Automation Conference, ACM/IEEE, pages , June 993. [] Benchmark directory pub/benchmark/partitioning93, June 993. send to benchmarks@mcnc.org for details on ftp access. [2] R. L. Russo, P. H. Odden, and P. K. Wolff. A heuristic procedure for the partitioning and mapping of computer logic graphs. IEEE Transaction on Computers, 2: , 97. [3] C. Kring and A. R. Newton. A Cell-Replicating Approach to Mincut-Based Circuit Partitioning. In IEEE International Conference on Computer-Aided Design ICCAD-9, pages 2 5, November 99. [4] J. Hwang and A. El Gamal. Optimal Replication for Min-Cut Partitioning. In IEEE Int. Conf. on Computer-Aided Design, pages , November 992. [5] Charles M. Fiduccia and R. M. Mattheyses. A linear-time heuristic for improving network partitions. In Proceedings of the 9 th IEEE Design Automation Conference, pages 75 8, 982. [6] R. Kuznar, F. Brglez, and B. Zajc. A Unified Cost Model for K-Way Netlist Partitioning with Replication. Technical report, CBL (CAD Benchmarking Laboratory), Elec. & Comp. Engineering, NCSU, Raleigh, N.C., 994. [7] L. Hagen and A.B. Kahng. A New Approach to Effective Circuit Clustering. In IEEE Int. Conf. on Computer-Aided Design, pages , November

A New K-Way Partitioning Approach. Bernhard M. Riess, Heiko A. Giselbrecht, and Bernd Wurth. Technical University of Munich, Munich, Germany

A New K-Way Partitioning Approach for Multiple Types of s Bernhard M. Riess, Heiko A. Giselbrecht, and Bernd Wurth Institute of Electronic Design Automation Technical University of Munich, 8090 Munich,