PARALLEL PERFORMANCE DIRECTED TECHNOLOGY MAPPING FOR FPGA. Laurent Lemarchand. Informatique. ea 2215, D pt. ubo University{ bp 809

Similar documents
Beyond the Combinatorial Limit in Depth Minimization for LUT-Based FPGA Designs

On Nominal Delay Minimization in LUT-Based FPGA Technology Mapping

FlowMap: An Optimal Technology Mapping Algorithm for Delay Optimization in Lookup-Table Based FPGA Designs

Simultaneous Depth and Area Minimization in LUT-based FPGA Mapping

CS137: Electronic Design Automation

Acyclic Multi-Way Partitioning of Boolean Networks

Basic Block. Inputs. K input. N outputs. I inputs MUX. Clock. Input Multiplexors

Delay Estimation for Technology Independent Synthesis

ESE535: Electronic Design Automation. Today. LUT Mapping. Simplifying Structure. Preclass: Cover in 4-LUT? Preclass: Cover in 4-LUT?

Figure 1. PLA-Style Logic Block. P Product terms. I Inputs

Global Clustering-Based Performance-Driven Circuit Partitioning

ABC basics (compilation from different articles)

Simultaneous Depth and Area Minimization in LUT-based FPGA Mapping

IMPLEMENTATION DESIGN FLOW

Mapping-aware Logic Synthesis with Parallelized Stochastic Optimization

On Nominal Delay Minimization in LUT-Based FPGA Technology Mapping

RASP: A General Logic Synthesis System for SRAM-based FPGAs

Heterogeneous Technology Mapping for FPGAs with Dual-Port Embedded Memory Arrays

Exploiting Signal Flow and Logic Dependency in Standard Cell Placement

DAOmap: A Depth-optimal Area Optimization Mapping Algorithm for FPGA Designs

Large Scale Circuit Partitioning

How Much Logic Should Go in an FPGA Logic Block?

Implementing Logic in FPGA Memory Arrays: Heterogeneous Memory Architectures

THE technology mapping and synthesis problem for field

Combinational and Sequential Mapping with Priority Cuts

An Efficient Framework of Using Various Decomposition Methods to Synthesize LUT Networks and Its Evaluation

Boolean Matching for Complex PLBs in LUT-based FPGAs with Application to Architecture Evaluation. Jason Cong and Yean-Yow Hwang

Conclusions and Future Work. We introduce a new method for dealing with the shortage of quality benchmark circuits

ON THE INTERACTION BETWEEN POWER-AWARE FPGA CAD ALGORITHMS

Heterogeneous Technology Mapping for Area Reduction in FPGA s with Embedded Memory Arrays

160 M. Nadjarbashi, S.M. Fakhraie and A. Kaviani Figure 2. LUTB structure. each block-level track can be arbitrarily connected to each of 16 4-LUT inp

X(1) X. X(k) DFF PI1 FF PI2 PI3 PI1 FF PI2 PI3

Placement Algorithm for FPGA Circuits

Routing Wire Optimization through Generic Synthesis on FPGA Carry Chains

g a0 1 a0 b 1 (3) (2) (4) (5) (1) 1 i 2

MOTION ESTIMATION IN MPEG-2 VIDEO ENCODING USING A PARALLEL BLOCK MATCHING ALGORITHM. Daniel Grosu, Honorius G^almeanu

Improvements to Technology Mapping for LUT-Based FPGAs

Submitted for TAU97 Abstract Many attempts have been made to combine some form of retiming with combinational

TECHNOLOGY MAPPING FOR THE ATMEL FPGA CIRCUITS

Don't Cares in Multi-Level Network Optimization. Hamid Savoj. Abstract

Quick Look under the Hood of ABC

A Routing Approach to Reduce Glitches in Low Power FPGAs

Technology Dependent Logic Optimization Prof. Kurt Keutzer EECS University of California Berkeley, CA Thanks to S. Devadas

Preclass Warmup. ESE535: Electronic Design Automation. Motivation (1) Today. Bisection Width. Motivation (2)

Fast Timing-driven Partitioning-based Placement for Island Style FPGAs

S 1 S 2. C s1. C s2. S n. C sn. S 3 C s3. Input. l k S k C k. C 1 C 2 C k-1. R d

VLSI Physical Design: From Graph Partitioning to Timing Closure

8ns. 8ns. 16ns. 10ns COUT S3 COUT S3 A3 B3 A2 B2 A1 B1 B0 2 B0 CIN CIN COUT S3 A3 B3 A2 B2 A1 B1 A0 B0 CIN S0 S1 S2 S3 COUT CIN 2 A0 B0 A2 _ A1 B1

CAD Algorithms. Circuit Partitioning

Parallelizing FPGA Technology Mapping using GPUs. Doris Chen Deshanand Singh Aug 31 st, 2010

THE PROCESS of field programmable gate array (FPGA)

Introduction VLSI PHYSICAL DESIGN AUTOMATION

Egemen Tanin, Tahsin M. Kurc, Cevdet Aykanat, Bulent Ozguc. Abstract. Direct Volume Rendering (DVR) is a powerful technique for

Genetic Algorithm for Circuit Partitioning

A Toolbox for Counter-Example Analysis and Optimization

Incorporating the Controller Eects During Register Transfer Level. Synthesis. Champaka Ramachandran and Fadi J. Kurdahi

Leakage Efficient Chip-Level Dual-Vdd Assignment with Time Slack Allocation for FPGA Power Reduction

International Conference on Parallel Processing (ICPP) 1994

A Path Based Algorithm for Timing Driven. Logic Replication in FPGA

Reducing Power in an FPGA via Computer-Aided Design

Field Programmable Gate Arrays

On Algebraic Expressions of Generalized Fibonacci Graphs

FILTER SYNTHESIS USING FINE-GRAIN DATA-FLOW GRAPHS. Waqas Akram, Cirrus Logic Inc., Austin, Texas

Introduction Warp Processors Dynamic HW/SW Partitioning. Introduction Standard binary - Separating Function and Architecture

An Interconnect-Centric Design Flow for Nanometer Technologies

A New Decomposition of Boolean Functions

SEPP: a New Compact Three-Level Logic Form

Binary Decision Diagram with Minimum Expected Path Length

Multilevel Algorithms for Multi-Constraint Hypergraph Partitioning

Functional extension of structural logic optimization techniques

Automated system partitioning based on hypergraphs for 3D stacked integrated circuits. FOSDEM 2018 Quentin Delhaye

FPGA PLB Architecture Evaluation and Area Optimization Techniques using Boolean Satisfiability

Factor Cuts. Satrajit Chatterjee Alan Mishchenko Robert Brayton ABSTRACT

Optimal FPGA Mapping and Retiming with. Jason Cong and Chang Wu. problem which is in general NP-complete.

Multi-level Quadratic Placement for Standard Cell Designs

Network. Department of Statistics. University of California, Berkeley. January, Abstract

/$ IEEE

Performance-Driven Mapping for CPLD Architectures

Partitioning. Course contents: Readings. Kernighang-Lin partitioning heuristic Fiduccia-Mattheyses heuristic. Chapter 7.5.

HYBRID FPGA ARCHITECTURE

2386 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 52, NO. 6, JUNE 2006

Routability-Driven Bump Assignment for Chip-Package Co-Design

A Fast Recursive Mapping Algorithm. Department of Computer and Information Science. New Jersey Institute of Technology.

A Novel Net Weighting Algorithm for Timing-Driven Placement

IN general setting, a combinatorial network is

A Level-wise Priority Based Task Scheduling for Heterogeneous Systems

A Methodology and Tool Framework for Supporting Rapid Exploration of Memory Hierarchies in FPGAs

Efficient SAT-based Boolean Matching for FPGA Technology Mapping

Standard FM MBC RW-ST. Benchmark Size Areas Net cut Areas Net cut Areas Net cut

Minimizing Clock Domain Crossing in Network on Chip Interconnect

A New Algorithm to Create Prime Irredundant Boolean Expressions

Minoru SASAKI and Kenji KITA. Department of Information Science & Intelligent Systems. Faculty of Engineering, Tokushima University

Local Unidirectional Bias for Smooth Cutsize-Delay Tradeoff in Performance-Driven Bipartitioning

Parallel Logic Synthesis Optimization for Digital Sequential Circuit

Fast Boolean Matching for Small Practical Functions

Chapter 1 Introduction

Faster Placer for Island-style FPGAs

THE field-programmable gate array (FPGA) has become

Simultaneous Placement with Clustering and Duplication

1 Introduction Data format converters (DFCs) are used to permute the data from one format to another in signal processing and image processing applica

Transcription:

PARALLEL PERFORMANCE DIRECTED TECHNOLOGY MAPPING FOR FPGA Laurent Lemarchand Informatique ubo University{ bp 809 f-29285, Brest { France lemarch@univ-brest.fr ea 2215, D pt ABSTRACT An ecient distributed method is developped for the technology mapping of Look Up Table-based Field Programmable Gate Arrays. Parallelization shortens the design cycle time for rapid prototyping of large designs onto fpga. In our algorithm, the boolean network is partitionned using an eective k-way partitioning tool, the subgraphs are synthesized for performance using the nominal delay predict model, and then merged back to form the covering of the circuit. Blocks are processed independently in parallel on a network of workstations. Experimental results for a set of large combinational circuits from the lgsynth'91 benchmark suite show linear speedups. Produced designs are equivalent or better in terms of performance and area as compared to designs processed without partitioning. I. INTRODUCTION Field Programmable Gate Arrays (fpga) with userprogrammability have become very popular for rapid prototyping, dsp and logic emulation due to short design time and inexpensive cost. A Look Up Table (lut)-based fpga consists of an array of lut which implement sequential and combinational logic functions, and a user-congurable network which provides connections among the lut. A K-input lut calculates any boolean function of up to K variables. Technology mapping tools convert a design, represented as a boolean network, into a functionally equivalent network of K-lut. A mapped circuit includes a lut for each primary output of the boolean network, and for each lut input which is not a primary input. The objective is to reduce the area (number of lut), to increase the routability by balancing signals among the lut, or to optimize the delays of the boolean network. Mixed objectives are also of interest, as minimizing area while preserving delays. In this paper, we focus on the delay objective. This criterion is crucial for real-time applications such as dsp. The runtimes of performance-oriented technology mappers are very large for the complex designs that could be accommodated into todays fpga components. High-level tools manage the circuit complexity at behavioral level. We have chosen a divide and conquer method based on data partitioning in order to handle complex circuits at the boolean level, and thus to speed up the prototyping process. We rst present the performance-oriented technology mapping problem, and the algorithms used for combinational circuits. In section 3, we detail how to partition the network for reducing the problem size and parallelizing the synthesis. Experimental results obtained with a distributed implementation of the synthesis are presented last. II. PERFORMANCE ORIENTED TECHNOLOGY MAPPING During the technology mapping process, designs are represented as boolean networks. A boolean network is a Directed Acyclic Graph (dag) G = (V; E) where V is the set of nodes and E the set of edges. Node v 2 V represents a logic gate or a primary input and (u; v) 2 E means that node u is an input of node v, i.e. v is a terminal of the net rooted at u. Node u (resp. v) belongs to the fanin (resp. fanout) of v (resp. u). Each node of a K-bounded network has no more than K inputs. Such a network can be mapped by covering each node by a single lut. Performance optimization objectives are based on delay models at the boolean level, that reect the expected actual delays for the mapped circuit. The rst static model is the unit delay model [8] : each lut has

a constant delay. The technology mapping objective is then to minimize the number of cascaded lut in the boolean network. More accurate models include the net delay model [4] in which each node (net) has a pre-assigned (thus also static) delay, the general delay model [13] that associates a propagation delay to each connection between terminals, and the nominal delay model [3] where the cost of a net is proportional to its number of terminals. This last model is dynamic since the delay of a net varies with the covering solution of the associated node. All except the unit delay model take into account both the delays induced by the lut and those of the interconnections within the mapping solution. Intuitively, the routing ressources involved in the propagation of a signal to all of its terminals increase with the number of terminals. Thus nominal delay is the more accurate for routing delays prediction. With this model, mappers need to identify congested areas in the network in order to decrease high fanout nodes delay cost. This also increases the routability of the design. boolean network 2-bounded - Decomposition- - Covering network (delay) optimal (a) (area) K-bounded - Packing - - network (delay) (c) optimized mapped network (area/delay) (b) Figure 1: Performance-oriented technology mapping for lut-based fpga Performance-oriented technology mapping is usually performed in 3 steps as shown in gure 1 : (a) decomposition : the network is decomposed into a set of 2-input lut. It has been shown that decomposition increases the solution space for covering, thus leading to better solutions [5]. This is the reason for the 2-lut decomposition step. Lots of decomposition methods have been proposed for the pre-processing. Dmig [1], a polynomial algorithm, provides equivalently good results as compared to other methods [5]. (b) covering : it is then covered by a network of K- lut. Researchers have mainly focused on static delays optimization, since dynamic delays optimization is a NP-Hard problem. Delay of a node depends exclusively on the delay of nodes on paths from the pi to this node. Current algorithms make use of dynamic programming and network ow computations for the mapping of a network, starting at pi and processing the nodes in topological order. For a K-bounded network, these algorithms nd the optimal solution in polynomial time, according to some static delay model. For the covering, FlowMap-d [4] has been proven to be optimal for the net model. It runs in O(Knm:log n) time, and prepares the network for area reduction. (c) packing : at last an area optimization step reduces the number of lut while preserving the delays obtained in (b). Various heuristics try to reduce the area as a post-processing step. df-map [2] assures optimal delay preservation, while reducing the area. It exploits mffc 1 structures in the network to reduce of the number of lut. Cong introduces in [4] the nominal delay predict model for net-based performance mapping : the delay assigned to each node (net) prior to covering reects the expected fanout of the node in the mapping solution. Fanout size and reconvergent paths from the node are considered for the delay estimation. His results show an improvement from 3 to 10 % of the actual delays of mapped circuits as compared to covering using the unit delay model. We use this model for the assignment of delay to nodes prior to the covering of the network with FlowMap-d. III. PARALLEL SYNTHESIS USING PARTITIONING Even if most of the synthesis algorithms used for performance-oriented technology mapping have polynomial runtimes, these are prohibitive for the complex designs which t into todays fpga. Our approach for reducing the synthesis runtimes consists in partitioning the circuit and in processing the subnetworks independently. These are much smaller than the original circuit, and the synthesis can be easily parallelized by distributing the blocks over a network of processors. Both factors allow to speedup the synthesis runtimes. However, the synthesis is no more global over the network, and this could lower the quality of the resulting netlist. Concerning the optimization of performance according to the nominal delay predict model, we must limit the loss of connectivity informations induced by the partitioning since performance optimization is based on network structures. For packing, the 1 Maximum Fanout Free Cone

partitioning must take into account the mffc structures exploited by Dmig. For both partitioning tools, the sizes of blocks must be balanced : since each block is expected to be processed in parallel, runtimes must be equilibrated to obtain good speedups for the parallel algorithm. We rst present the partitioning algorithm we use for the decomposition and covering steps. Balancing partition sizes is also discussed in this section. Next we detail our mffc-based partitioning tool. These two algorithms are used for the parallel synthesis of circuits. The overall parallel algorithm is presented last. A. Partitioning for performance optimization Given a boolean network, and a number k, k-way partitioning consists in nding an assignment of each node v 2 V to one of the k blocks while minimizing the cut, i.e. the number of nets crossing block boundaries, and balancing the number of nodes per block. By using such an algorithm, we aim at minimizing the maximal size of subnetworks within the blocks, and at limiting the loss of connectivity information between the blocks. This point is crucial since delay optimization is mainly based on connection structures within the network. We extended our PPart algorithm [11], devoted to the rapid logic synthesis of boolean networks using partitioning, to the case of performance optimization for lut-based fpga. The partitioning algorithm used is HMetis [10], an ecient multi-level partitionner, which was successfully applied in the vlsi domain. The algorithm works on the hypergraph obtained from the boolean network by connecting each node and its fanout to form an hyperedge. HMetis minimizes the number of hyperedges spawning terminals in dierents blocks. After partitioning, subnetworks are built according to the assignment of nodes to partitions. po are added to subnetworks for exporting signals used in other partitions. Corresponding pi are also added for nodes using nets from outside of the partition they belong to. The building process preserves fanin and fanout sizes, thus allowing good nominal delay estimation. Moreover, the partitionner exhibits congested areas in the network, since it doesn't split such areas, due to the cost they induce for the cut. The networks (blocks) considered are much smaller than the whole circuit. The pre-processing and the covering steps benet from this problem size reduction. Applying FlowMap-d with the nominal delay predict model on a high density subnetwork gives good results, since delays at the partition boundaries are minimized locally. Balancing partition sizes. Since PPart processes each subnetwork independently in parallel, the balancing of the partitions is the second objective of the partitioning tool. Each partition weight is calculated as the sum of the weights of nodes included in the partition. The weight of a node must thus reect the cost it induces for the processing of its associated subnetwork. Predicting this cost is very dicult since runtimes depend not only on node and network structures but also on the synthesis algorithms used for the processing of the blocks. ProperPart [7] is a parallel synthesis tool devoted to the logic optimization of combinational circuits with mis ii by partitioning. It uses an experimentally dened cost function for nodes. This cost function is tuned for the mis ii algorithm. However, results show that even with a specialized cost function, runtime estimations are coarse. Thus, we have chosen instead a general weighting function, that should be applied when coupling PPart with dierent synthesis tools. The weight of a node is calculated as the number of literals (positive or complemented variables) in the sum-of-products representation of the boolean function associated to the nodes. B. Partitioning for packing The third step, packing, exploits the Maximum Fanout Free Cone (mffc) structures. We have developped another partitioning algorithm, based on mffc clustering, that guarantees the same results for every partitioning of the clustered network when applying df-map. For each node v of a network G = (V; E), the mffc of v is the maximal set mffc v of predecessors of v such that any path starting from any node of mffc v lies entirely within mffc v. pi are excluded, and v itself is included in mffc v. Intuitively, mffc v consists of the nodes that are on paths that converge to v. It is proven that, for all couples of internal nodes in a network, their mffc are either disjoint or one contains the other [2]. Thus it is possible to cluster the nodes according to the mffc : nodes belonging to the same mffc are grouped together and replaced by a single node in the network until any mffc consists of a unique node. The resulting network is unique, and each initial node is aected to a single cluster (disjoint partitioning). Such a clustering algorithm has been exploited

prior to partitioning for acyclic partitioning of boolean networks [6]. However, when applied in conjunction with HMetis, the mffc-based clustering induces poor results, because, even if the partitionned hypergraph is reduced, the partitionned nodes have large weights, and dense connectivity. Thus, we apply the clustering technique only when an mffc-based synthesis algorithm is to be used. df-map reduces the area of a K-bounded network by reinjecting nodes into their successors while preserving both the K-feasibility of the network, and the delays obtained. Since no duplication of nodes is authorized, it is sucent to process each disjoint mffc independently. Due to this restriction, optimal area is calculated in polynomial time. Since the clusters are processed independently by df-map, the pre-clustering technique guarantees the optimallity of the solution. Moreover, no cut reduction is needed. This implies that we can partition the clustered network by considering balancing constraints only. Thus, our partitioning algorithm is as follows : (1) cluster the nodes based on mffc structures the weight of a cluster is the sum of nodes included (2) partition the clusters into roughly equal-size blocks (3) restore original nodes in each partition The partitioning algorithm sorts the clusters into decreasing weight order and aects them to partitions circularly. C. Parallel algorithm The parallel algorithm for performance-oriented technology mapping of lut-based fpga is integrated into the PPart parallel synthesis tool. It is a master/slave algorithm. The master partitions the circuit, and distributes synthesis tasks to the dierent slaves. The master collects the optimized circuits and merges them back into a netlist. If the number of tasks is greater than the available processors, tasks are sent on demand to idle slaves. The number of parts, and the synthesis tool used are set according to the user directives. Dening different synthesis policies allow the user to aim various optimization objectives. For performance optimization, we have dened the following procedure : (1) partition the network using HMetis (2) in parallel on each block (2.1) predict nominal delay for each node (delays are calculated as in [4]) (2.2) perform FlowMap-d (3) merge blocks and re-partition using mffc clustering (4) in parallel on each block (4.1) apply df-map (5) merge the blocks into the nal circuit. The PPart tool has been implemented in C. Parallelization is based on the pvm [9] routines library. The partitioning tools are integrated into the sis package of ucb [12]. All of the synthesis commands integrated in sis are eligeable for a parallel execution on partitionned networks. IV. RESULTS We have tested our algorithm on a network of sun ultra 1/140 Mhz workstations with 128 Mo of memory. Tested circuits are large examples from the lgsynth'91 [14] public benchmark suite. The results are given for 4 and 8 partitions. All of them are to be compared with those obtained without partitioning. Our goal is to improve the runtimes of synthesis tools for large circuits, without aecting the quality of the mapped designs. We rst detail the results in terms of quality (delay and area optimization) and then present the speedups obtained on a network of up to 8 processors. A. Quality Table 1 presents the results obtained with the usual unit delay model. Performance according to the nominal delay predict model (calculated as in [4]) is shown in table 2. Table 3 is devoted to the results in terms of area. The results obtained with the unit delay model illustrate the main drawback of the partitioning approach for the synthesis : quality could decrease largely since the algorithms are not applied globally on the network. With the unit delay model, the objective function corresponds to the minimization of the critical path lengths in the network. Partitioning destroys such structures, and involves bad results for the nal circuits, with an average loss of quality of over 20 %. Conversely, the results reported for the more accurate nominal delay model are at the opposite. Even

unit delay bigkey 4 4 4 des 9 11 11 misex3 7 10 10 seq 9 9 10 C7552 7 11 13 ex1010 8 11 12 pdc 11 14 13 s38417 8 8 8 s38584.1 11 13 12 avg gain 1-24.3 % -25.9 % area in # 4-lut bigkey 1805 2025 2063 des 1375 1197 1169 misex3 2276 2143 2217 seq 2068 1910 2034 C7552 864 818 813 ex1010 3285 3805 3748 pdc 5893 5735 5905 s38417 2784 3080 3141 s38584.1 3060 3109 3063 avg gain 1-1.7% -3.2% Table 1: model Performance according to the unit delay Table 3: Quality obtained in terms of area when applying the nominal delay predict model nominal delay bigkey 12034 13494 13738 des 6792 3274 3364 misex3 8702 9034 8734 seq 6054 4980 6180 C7552 4098 4222 4418 ex1010 21326 23130 22736 pdc 24602 11686 13190 s38417 5902 5472 5294 s38584.1 16236 16140 14632 avg gain 1 13.5 % 12.7 Table 2: Performance according to the nominal delay predict model if the synthesis process is no more global over the network, due to partitioning, more than 10 % performance improvement on the average is obtained for nominal delay (table 2). Nominal delay metric takes congested areas into account. The partitioning exhibits such zones, since splitting a high density connection area will increase the cut. Thus the decomposition phase benets from the partitioning, which allows to improve delays at each congested zone boundaries individually. This leads to an overall better mapped design. Area results are coarselly equivalent with or without partitioning. The covering phase has thus a small impact on the packing phase. The mffc-based clustering guarantees the optimality of the solution obtained by partitioning. Thus the good performances for the delays are not paid by an important loss of area for the design. seq. time speed up bigkey 125 1.3 1.3 des 285 3.2 4.3 misex3 520 3.3 4.1 seq 571 5.2 5.6 C7552 1128 8.4 13.3 ex1010 1921 3.7 4.5 pdc 4277 3.8 6.2 s38417 4614 11.4 17.0 s38584.1 6600 9.9 17.5 average 1 5.6 8.2 Table 4: Speed up on a network of workstations B. Speedup Table 4 shows the speedups obtained by distributing the synthesis process on 4 and 8 processors. Times are in seconds for the sequential case, and speedups are given otherwise. Parallel execution times include partitioning, synthesis and merging. Circuits are ordered in increasing runtimes for their synthesis on a single processor without partitioning. Results show linear speedups on the average. Since the partitioning tools have very small runtimes, partitioning doesn't penalizes the overall execution times if the circuits are large enough. For example, bigkey induces less than 2 minutes of cpu time in sequential, and the speedups are small (1.3) on 4 or 8 processors. On the other hand, synthesis of large circuits such as pdc and the following takes benet from the parallel approach for improving the runtimes. The bad result obtained for ex1010 on 8 processors is due to the unbalance of the partition sizes.

For some of the circuits (C7552, s38417, s38584.1), speedups are super-linear These results are mainly due to the problem size reduction involved by the partitioning. Partitioning avoids the large memory room needed for the direct synthesis of circuits, thus avoids memory swap to disk, that slows down processing. V. CONCLUSION In this paper we have presented a partitioning approach for the performance-oriented technology mapping of large combinational circuits onto lut-based fpga. The synthesis process makes use of both exact and heuristic methods for optimizing performance and area of circuits. The partitioning tools are adapted to the synthesis algorithms used. Even if partitioning involves performance degradation if a simple unit delay model is used for the delay estimation, results are much better with the more accurate nominal delay predict model. Due to the partitioning, congested area in circuits are exhibited. This allows to optimize delay propagation locally. This local performance optimization allows to increase overall solution quality, as compared to a non-partitioning approach. A specialized partitioning tool limits the loss of area induced by partitioning. The model also increases the routability of the circuits, leading to better results in terms of performance for the placed-routed designs. The use of a partitioning approach allows to parallelize the synthesis easily on a network of computers. The parallel algorithm provides linear speedups and avoids prohibitive runtimes for the rapid prototyping of large designs. VI. ACKNOWLEDGEMENTS Thanks to Prof. Jason Cong, from ucla, who provided the source of his FlowMap package (version 0.2) for our experiments with sis. VII. REFERENCES [1] K.-C. Chen et al. DAG-map: Graph-based fpga technology mapping for delay optimisation. IEEE Design and Test of Computers, pages 7{20, September 1992. [2] J. Cong and Y. Ding. On area/depth trade-o in lut-based fpga technology mapping. IEEE Trans. on VLSI Systems, 2(2):137{148, June 1994. [3] J. Cong and Y. Ding. On nominal delay minimization in lut-based fpga technology mapping. Integration { The VLSI Journal, 18:73{94, November 1994. [4] J. Cong et al. lut-based fpga technology mapping under arbitrary net-delay models. Computers and Graphics, 18(4):507{516, 1994. [5] J. Cong and Y.-Y. Hwang. Structural gate decomposition for depth-optimal technology mapping in lut-based fpga designs. In Proc. ACM/IEEE Design Automation Conf., Las Vegas, NV, June 1996. [6] J. Cong et al. Acyclic multi-way partitioning of boolean networks. In Proc. ACM/IEEE Design Automation Conf., pages 670 { 675, 1994. [7] K. De and P. Banerjee. Parallel logic synthesis using partitioning. In Proc. Int'l Conf. on Parallel Processing, 1994. [8] R. Francis et al. Technology mapping of lookup table-based fpgas for performance. In Proc. Int'l Conf. on Computer-Aided Design, pages 568{ 571, Santa Clara,CA, Novembre 1991. [9] G. Geist et al. PVM: Parallel Virtual Machine - A Users Guide and Tutorial for Network Parallel Computing. MIT Press, 1994. [10] G. Karypis et al. Multilevel hypergraph partitioning : Application in VLSI domain. In Proc. ACM/IEEE Design Automation Conf., June 1997. [11] L. Lemarchand. Parallel synthesis of large combinational circuits for fpgas. In Proc. of High Performance Computing and Networking Europe'97, volume 1225 of Lecture Notes in Computer Science, Vienna, Austria, April 1997. Springer- Verlag. [12] E. Sentovich et al. SIS: a system for sequential circuit synthesis. memorandum UCB/ERL M92/41, University of California at Berkeley, mai 1992. [13] H. Yang and D.F. Wong. Edge-map: Optimal performance driven technology mapping for iterative lut-based fpgas designs. In Proc. Int'l Conf. on Computer-Aided Design, pages 150{ 155, San Jose, CA, 1994. [14] S. Yang. Logic synthesis and optimization benchmarks user guide. Technical report, Stanford University, 1991.