PARALLEL PERFORMANCE DIRECTED TECHNOLOGY MAPPING FOR FPGA Laurent Lemarchand Informatique ubo University{ bp 809 f-29285, Brest { France lemarch@univ-brest.fr ea 2215, D pt ABSTRACT An ecient distributed method is developped for the technology mapping of Look Up Table-based Field Programmable Gate Arrays. Parallelization shortens the design cycle time for rapid prototyping of large designs onto fpga. In our algorithm, the boolean network is partitionned using an eective k-way partitioning tool, the subgraphs are synthesized for performance using the nominal delay predict model, and then merged back to form the covering of the circuit. Blocks are processed independently in parallel on a network of workstations. Experimental results for a set of large combinational circuits from the lgsynth'91 benchmark suite show linear speedups. Produced designs are equivalent or better in terms of performance and area as compared to designs processed without partitioning. I. INTRODUCTION Field Programmable Gate Arrays (fpga) with userprogrammability have become very popular for rapid prototyping, dsp and logic emulation due to short design time and inexpensive cost. A Look Up Table (lut)-based fpga consists of an array of lut which implement sequential and combinational logic functions, and a user-congurable network which provides connections among the lut. A K-input lut calculates any boolean function of up to K variables. Technology mapping tools convert a design, represented as a boolean network, into a functionally equivalent network of K-lut. A mapped circuit includes a lut for each primary output of the boolean network, and for each lut input which is not a primary input. The objective is to reduce the area (number of lut), to increase the routability by balancing signals among the lut, or to optimize the delays of the boolean network. Mixed objectives are also of interest, as minimizing area while preserving delays. In this paper, we focus on the delay objective. This criterion is crucial for real-time applications such as dsp. The runtimes of performance-oriented technology mappers are very large for the complex designs that could be accommodated into todays fpga components. High-level tools manage the circuit complexity at behavioral level. We have chosen a divide and conquer method based on data partitioning in order to handle complex circuits at the boolean level, and thus to speed up the prototyping process. We rst present the performance-oriented technology mapping problem, and the algorithms used for combinational circuits. In section 3, we detail how to partition the network for reducing the problem size and parallelizing the synthesis. Experimental results obtained with a distributed implementation of the synthesis are presented last. II. PERFORMANCE ORIENTED TECHNOLOGY MAPPING During the technology mapping process, designs are represented as boolean networks. A boolean network is a Directed Acyclic Graph (dag) G = (V; E) where V is the set of nodes and E the set of edges. Node v 2 V represents a logic gate or a primary input and (u; v) 2 E means that node u is an input of node v, i.e. v is a terminal of the net rooted at u. Node u (resp. v) belongs to the fanin (resp. fanout) of v (resp. u). Each node of a K-bounded network has no more than K inputs. Such a network can be mapped by covering each node by a single lut. Performance optimization objectives are based on delay models at the boolean level, that reect the expected actual delays for the mapped circuit. The rst static model is the unit delay model [8] : each lut has
a constant delay. The technology mapping objective is then to minimize the number of cascaded lut in the boolean network. More accurate models include the net delay model [4] in which each node (net) has a pre-assigned (thus also static) delay, the general delay model [13] that associates a propagation delay to each connection between terminals, and the nominal delay model [3] where the cost of a net is proportional to its number of terminals. This last model is dynamic since the delay of a net varies with the covering solution of the associated node. All except the unit delay model take into account both the delays induced by the lut and those of the interconnections within the mapping solution. Intuitively, the routing ressources involved in the propagation of a signal to all of its terminals increase with the number of terminals. Thus nominal delay is the more accurate for routing delays prediction. With this model, mappers need to identify congested areas in the network in order to decrease high fanout nodes delay cost. This also increases the routability of the design. boolean network 2-bounded - Decomposition- - Covering network (delay) optimal (a) (area) K-bounded - Packing - - network (delay) (c) optimized mapped network (area/delay) (b) Figure 1: Performance-oriented technology mapping for lut-based fpga Performance-oriented technology mapping is usually performed in 3 steps as shown in gure 1 : (a) decomposition : the network is decomposed into a set of 2-input lut. It has been shown that decomposition increases the solution space for covering, thus leading to better solutions [5]. This is the reason for the 2-lut decomposition step. Lots of decomposition methods have been proposed for the pre-processing. Dmig [1], a polynomial algorithm, provides equivalently good results as compared to other methods [5]. (b) covering : it is then covered by a network of K- lut. Researchers have mainly focused on static delays optimization, since dynamic delays optimization is a NP-Hard problem. Delay of a node depends exclusively on the delay of nodes on paths from the pi to this node. Current algorithms make use of dynamic programming and network ow computations for the mapping of a network, starting at pi and processing the nodes in topological order. For a K-bounded network, these algorithms nd the optimal solution in polynomial time, according to some static delay model. For the covering, FlowMap-d [4] has been proven to be optimal for the net model. It runs in O(Knm:log n) time, and prepares the network for area reduction. (c) packing : at last an area optimization step reduces the number of lut while preserving the delays obtained in (b). Various heuristics try to reduce the area as a post-processing step. df-map [2] assures optimal delay preservation, while reducing the area. It exploits mffc 1 structures in the network to reduce of the number of lut. Cong introduces in [4] the nominal delay predict model for net-based performance mapping : the delay assigned to each node (net) prior to covering reects the expected fanout of the node in the mapping solution. Fanout size and reconvergent paths from the node are considered for the delay estimation. His results show an improvement from 3 to 10 % of the actual delays of mapped circuits as compared to covering using the unit delay model. We use this model for the assignment of delay to nodes prior to the covering of the network with FlowMap-d. III. PARALLEL SYNTHESIS USING PARTITIONING Even if most of the synthesis algorithms used for performance-oriented technology mapping have polynomial runtimes, these are prohibitive for the complex designs which t into todays fpga. Our approach for reducing the synthesis runtimes consists in partitioning the circuit and in processing the subnetworks independently. These are much smaller than the original circuit, and the synthesis can be easily parallelized by distributing the blocks over a network of processors. Both factors allow to speedup the synthesis runtimes. However, the synthesis is no more global over the network, and this could lower the quality of the resulting netlist. Concerning the optimization of performance according to the nominal delay predict model, we must limit the loss of connectivity informations induced by the partitioning since performance optimization is based on network structures. For packing, the 1 Maximum Fanout Free Cone
partitioning must take into account the mffc structures exploited by Dmig. For both partitioning tools, the sizes of blocks must be balanced : since each block is expected to be processed in parallel, runtimes must be equilibrated to obtain good speedups for the parallel algorithm. We rst present the partitioning algorithm we use for the decomposition and covering steps. Balancing partition sizes is also discussed in this section. Next we detail our mffc-based partitioning tool. These two algorithms are used for the parallel synthesis of circuits. The overall parallel algorithm is presented last. A. Partitioning for performance optimization Given a boolean network, and a number k, k-way partitioning consists in nding an assignment of each node v 2 V to one of the k blocks while minimizing the cut, i.e. the number of nets crossing block boundaries, and balancing the number of nodes per block. By using such an algorithm, we aim at minimizing the maximal size of subnetworks within the blocks, and at limiting the loss of connectivity information between the blocks. This point is crucial since delay optimization is mainly based on connection structures within the network. We extended our PPart algorithm [11], devoted to the rapid logic synthesis of boolean networks using partitioning, to the case of performance optimization for lut-based fpga. The partitioning algorithm used is HMetis [10], an ecient multi-level partitionner, which was successfully applied in the vlsi domain. The algorithm works on the hypergraph obtained from the boolean network by connecting each node and its fanout to form an hyperedge. HMetis minimizes the number of hyperedges spawning terminals in dierents blocks. After partitioning, subnetworks are built according to the assignment of nodes to partitions. po are added to subnetworks for exporting signals used in other partitions. Corresponding pi are also added for nodes using nets from outside of the partition they belong to. The building process preserves fanin and fanout sizes, thus allowing good nominal delay estimation. Moreover, the partitionner exhibits congested areas in the network, since it doesn't split such areas, due to the cost they induce for the cut. The networks (blocks) considered are much smaller than the whole circuit. The pre-processing and the covering steps benet from this problem size reduction. Applying FlowMap-d with the nominal delay predict model on a high density subnetwork gives good results, since delays at the partition boundaries are minimized locally. Balancing partition sizes. Since PPart processes each subnetwork independently in parallel, the balancing of the partitions is the second objective of the partitioning tool. Each partition weight is calculated as the sum of the weights of nodes included in the partition. The weight of a node must thus reect the cost it induces for the processing of its associated subnetwork. Predicting this cost is very dicult since runtimes depend not only on node and network structures but also on the synthesis algorithms used for the processing of the blocks. ProperPart [7] is a parallel synthesis tool devoted to the logic optimization of combinational circuits with mis ii by partitioning. It uses an experimentally dened cost function for nodes. This cost function is tuned for the mis ii algorithm. However, results show that even with a specialized cost function, runtime estimations are coarse. Thus, we have chosen instead a general weighting function, that should be applied when coupling PPart with dierent synthesis tools. The weight of a node is calculated as the number of literals (positive or complemented variables) in the sum-of-products representation of the boolean function associated to the nodes. B. Partitioning for packing The third step, packing, exploits the Maximum Fanout Free Cone (mffc) structures. We have developped another partitioning algorithm, based on mffc clustering, that guarantees the same results for every partitioning of the clustered network when applying df-map. For each node v of a network G = (V; E), the mffc of v is the maximal set mffc v of predecessors of v such that any path starting from any node of mffc v lies entirely within mffc v. pi are excluded, and v itself is included in mffc v. Intuitively, mffc v consists of the nodes that are on paths that converge to v. It is proven that, for all couples of internal nodes in a network, their mffc are either disjoint or one contains the other [2]. Thus it is possible to cluster the nodes according to the mffc : nodes belonging to the same mffc are grouped together and replaced by a single node in the network until any mffc consists of a unique node. The resulting network is unique, and each initial node is aected to a single cluster (disjoint partitioning). Such a clustering algorithm has been exploited
prior to partitioning for acyclic partitioning of boolean networks [6]. However, when applied in conjunction with HMetis, the mffc-based clustering induces poor results, because, even if the partitionned hypergraph is reduced, the partitionned nodes have large weights, and dense connectivity. Thus, we apply the clustering technique only when an mffc-based synthesis algorithm is to be used. df-map reduces the area of a K-bounded network by reinjecting nodes into their successors while preserving both the K-feasibility of the network, and the delays obtained. Since no duplication of nodes is authorized, it is sucent to process each disjoint mffc independently. Due to this restriction, optimal area is calculated in polynomial time. Since the clusters are processed independently by df-map, the pre-clustering technique guarantees the optimallity of the solution. Moreover, no cut reduction is needed. This implies that we can partition the clustered network by considering balancing constraints only. Thus, our partitioning algorithm is as follows : (1) cluster the nodes based on mffc structures the weight of a cluster is the sum of nodes included (2) partition the clusters into roughly equal-size blocks (3) restore original nodes in each partition The partitioning algorithm sorts the clusters into decreasing weight order and aects them to partitions circularly. C. Parallel algorithm The parallel algorithm for performance-oriented technology mapping of lut-based fpga is integrated into the PPart parallel synthesis tool. It is a master/slave algorithm. The master partitions the circuit, and distributes synthesis tasks to the dierent slaves. The master collects the optimized circuits and merges them back into a netlist. If the number of tasks is greater than the available processors, tasks are sent on demand to idle slaves. The number of parts, and the synthesis tool used are set according to the user directives. Dening different synthesis policies allow the user to aim various optimization objectives. For performance optimization, we have dened the following procedure : (1) partition the network using HMetis (2) in parallel on each block (2.1) predict nominal delay for each node (delays are calculated as in [4]) (2.2) perform FlowMap-d (3) merge blocks and re-partition using mffc clustering (4) in parallel on each block (4.1) apply df-map (5) merge the blocks into the nal circuit. The PPart tool has been implemented in C. Parallelization is based on the pvm [9] routines library. The partitioning tools are integrated into the sis package of ucb [12]. All of the synthesis commands integrated in sis are eligeable for a parallel execution on partitionned networks. IV. RESULTS We have tested our algorithm on a network of sun ultra 1/140 Mhz workstations with 128 Mo of memory. Tested circuits are large examples from the lgsynth'91 [14] public benchmark suite. The results are given for 4 and 8 partitions. All of them are to be compared with those obtained without partitioning. Our goal is to improve the runtimes of synthesis tools for large circuits, without aecting the quality of the mapped designs. We rst detail the results in terms of quality (delay and area optimization) and then present the speedups obtained on a network of up to 8 processors. A. Quality Table 1 presents the results obtained with the usual unit delay model. Performance according to the nominal delay predict model (calculated as in [4]) is shown in table 2. Table 3 is devoted to the results in terms of area. The results obtained with the unit delay model illustrate the main drawback of the partitioning approach for the synthesis : quality could decrease largely since the algorithms are not applied globally on the network. With the unit delay model, the objective function corresponds to the minimization of the critical path lengths in the network. Partitioning destroys such structures, and involves bad results for the nal circuits, with an average loss of quality of over 20 %. Conversely, the results reported for the more accurate nominal delay model are at the opposite. Even
unit delay bigkey 4 4 4 des 9 11 11 misex3 7 10 10 seq 9 9 10 C7552 7 11 13 ex1010 8 11 12 pdc 11 14 13 s38417 8 8 8 s38584.1 11 13 12 avg gain 1-24.3 % -25.9 % area in # 4-lut bigkey 1805 2025 2063 des 1375 1197 1169 misex3 2276 2143 2217 seq 2068 1910 2034 C7552 864 818 813 ex1010 3285 3805 3748 pdc 5893 5735 5905 s38417 2784 3080 3141 s38584.1 3060 3109 3063 avg gain 1-1.7% -3.2% Table 1: model Performance according to the unit delay Table 3: Quality obtained in terms of area when applying the nominal delay predict model nominal delay bigkey 12034 13494 13738 des 6792 3274 3364 misex3 8702 9034 8734 seq 6054 4980 6180 C7552 4098 4222 4418 ex1010 21326 23130 22736 pdc 24602 11686 13190 s38417 5902 5472 5294 s38584.1 16236 16140 14632 avg gain 1 13.5 % 12.7 Table 2: Performance according to the nominal delay predict model if the synthesis process is no more global over the network, due to partitioning, more than 10 % performance improvement on the average is obtained for nominal delay (table 2). Nominal delay metric takes congested areas into account. The partitioning exhibits such zones, since splitting a high density connection area will increase the cut. Thus the decomposition phase benets from the partitioning, which allows to improve delays at each congested zone boundaries individually. This leads to an overall better mapped design. Area results are coarselly equivalent with or without partitioning. The covering phase has thus a small impact on the packing phase. The mffc-based clustering guarantees the optimality of the solution obtained by partitioning. Thus the good performances for the delays are not paid by an important loss of area for the design. seq. time speed up bigkey 125 1.3 1.3 des 285 3.2 4.3 misex3 520 3.3 4.1 seq 571 5.2 5.6 C7552 1128 8.4 13.3 ex1010 1921 3.7 4.5 pdc 4277 3.8 6.2 s38417 4614 11.4 17.0 s38584.1 6600 9.9 17.5 average 1 5.6 8.2 Table 4: Speed up on a network of workstations B. Speedup Table 4 shows the speedups obtained by distributing the synthesis process on 4 and 8 processors. Times are in seconds for the sequential case, and speedups are given otherwise. Parallel execution times include partitioning, synthesis and merging. Circuits are ordered in increasing runtimes for their synthesis on a single processor without partitioning. Results show linear speedups on the average. Since the partitioning tools have very small runtimes, partitioning doesn't penalizes the overall execution times if the circuits are large enough. For example, bigkey induces less than 2 minutes of cpu time in sequential, and the speedups are small (1.3) on 4 or 8 processors. On the other hand, synthesis of large circuits such as pdc and the following takes benet from the parallel approach for improving the runtimes. The bad result obtained for ex1010 on 8 processors is due to the unbalance of the partition sizes.
For some of the circuits (C7552, s38417, s38584.1), speedups are super-linear These results are mainly due to the problem size reduction involved by the partitioning. Partitioning avoids the large memory room needed for the direct synthesis of circuits, thus avoids memory swap to disk, that slows down processing. V. CONCLUSION In this paper we have presented a partitioning approach for the performance-oriented technology mapping of large combinational circuits onto lut-based fpga. The synthesis process makes use of both exact and heuristic methods for optimizing performance and area of circuits. The partitioning tools are adapted to the synthesis algorithms used. Even if partitioning involves performance degradation if a simple unit delay model is used for the delay estimation, results are much better with the more accurate nominal delay predict model. Due to the partitioning, congested area in circuits are exhibited. This allows to optimize delay propagation locally. This local performance optimization allows to increase overall solution quality, as compared to a non-partitioning approach. A specialized partitioning tool limits the loss of area induced by partitioning. The model also increases the routability of the circuits, leading to better results in terms of performance for the placed-routed designs. The use of a partitioning approach allows to parallelize the synthesis easily on a network of computers. The parallel algorithm provides linear speedups and avoids prohibitive runtimes for the rapid prototyping of large designs. VI. ACKNOWLEDGEMENTS Thanks to Prof. Jason Cong, from ucla, who provided the source of his FlowMap package (version 0.2) for our experiments with sis. VII. REFERENCES [1] K.-C. Chen et al. DAG-map: Graph-based fpga technology mapping for delay optimisation. IEEE Design and Test of Computers, pages 7{20, September 1992. [2] J. Cong and Y. Ding. On area/depth trade-o in lut-based fpga technology mapping. IEEE Trans. on VLSI Systems, 2(2):137{148, June 1994. [3] J. Cong and Y. Ding. On nominal delay minimization in lut-based fpga technology mapping. Integration { The VLSI Journal, 18:73{94, November 1994. [4] J. Cong et al. lut-based fpga technology mapping under arbitrary net-delay models. Computers and Graphics, 18(4):507{516, 1994. [5] J. Cong and Y.-Y. Hwang. Structural gate decomposition for depth-optimal technology mapping in lut-based fpga designs. In Proc. ACM/IEEE Design Automation Conf., Las Vegas, NV, June 1996. [6] J. Cong et al. Acyclic multi-way partitioning of boolean networks. In Proc. ACM/IEEE Design Automation Conf., pages 670 { 675, 1994. [7] K. De and P. Banerjee. Parallel logic synthesis using partitioning. In Proc. Int'l Conf. on Parallel Processing, 1994. [8] R. Francis et al. Technology mapping of lookup table-based fpgas for performance. In Proc. Int'l Conf. on Computer-Aided Design, pages 568{ 571, Santa Clara,CA, Novembre 1991. [9] G. Geist et al. PVM: Parallel Virtual Machine - A Users Guide and Tutorial for Network Parallel Computing. MIT Press, 1994. [10] G. Karypis et al. Multilevel hypergraph partitioning : Application in VLSI domain. In Proc. ACM/IEEE Design Automation Conf., June 1997. [11] L. Lemarchand. Parallel synthesis of large combinational circuits for fpgas. In Proc. of High Performance Computing and Networking Europe'97, volume 1225 of Lecture Notes in Computer Science, Vienna, Austria, April 1997. Springer- Verlag. [12] E. Sentovich et al. SIS: a system for sequential circuit synthesis. memorandum UCB/ERL M92/41, University of California at Berkeley, mai 1992. [13] H. Yang and D.F. Wong. Edge-map: Optimal performance driven technology mapping for iterative lut-based fpgas designs. In Proc. Int'l Conf. on Computer-Aided Design, pages 150{ 155, San Jose, CA, 1994. [14] S. Yang. Logic synthesis and optimization benchmarks user guide. Technical report, Stanford University, 1991.