A New Enhanced Approach to Technology Mapping

Similar documents
Reducing Structural Bias in Technology Mapping

ABC basics (compilation from different articles)

Fast Boolean Matching for Small Practical Functions

Combinational and Sequential Mapping with Priority Cuts

A Boolean Paradigm in Multi-Valued Logic Synthesis

Integrating Logic Synthesis, Technology Mapping, and Retiming

Representations of Terms Representations of Boolean Networks

Factor Cuts. Satrajit Chatterjee Alan Mishchenko Robert Brayton ABSTRACT

FRAIGs: A Unifying Representation for Logic Synthesis and Verification

Integrating Logic Synthesis, Technology Mapping, and Retiming

Versatile SAT-based Remapping for Standard Cells

Quick Look under the Hood of ABC

Integrating Logic Synthesis, Technology Mapping, and Retiming

Improvements to Technology Mapping for LUT-Based FPGAs

Lazy Man s Logic Synthesis

A Toolbox for Counter-Example Analysis and Optimization

Delay Estimation for Technology Independent Synthesis

SAT-Based Logic Optimization and Resynthesis

Are XORs in Logic Synthesis Really Necessary?

Integrating an AIG Package, Simulator, and SAT Solver

Beyond the Combinatorial Limit in Depth Minimization for LUT-Based FPGA Designs

CSE241 VLSI Digital Circuits UC San Diego

On Resolution Proofs for Combinational Equivalence Checking

Unit 4: Formal Verification

/$ IEEE

SAT-Based Area Recovery in Technology Mapping

Busy Man s Synthesis: Combinational Delay Optimization With SAT

ECE260B CSE241A Winter Logic Synthesis

SEPP: a New Compact Three-Level Logic Form

LUT Mapping and Optimization for Majority-Inverter Graphs

Fast Minimum-Register Retiming via Binary Maximum-Flow

Applying Logic Synthesis for Speeding Up SAT

Cut-Based Inductive Invariant Computation

Technology Dependent Logic Optimization Prof. Kurt Keutzer EECS University of California Berkeley, CA Thanks to S. Devadas

FlowMap: An Optimal Technology Mapping Algorithm for Delay Optimization in Lookup-Table Based FPGA Designs

1/28/2013. Synthesis. The Y-diagram Revisited. Structural Behavioral. More abstract designs Physical. CAD for VLSI 2

How Much Logic Should Go in an FPGA Logic Block?

Boolean Representations and Combinatorial Equivalence

Incremental Sequential Equivalence Checking and Subgraph Isomorphism

BoolTool: A Tool for Manipulation of Boolean Functions

TECHNOLOGY MAPPING FOR THE ATMEL FPGA CIRCUITS

Design of Framework for Logic Synthesis Engine

An Efficient Framework of Using Various Decomposition Methods to Synthesize LUT Networks and Its Evaluation

Optimized Implementation of Logic Functions

On Resolution Proofs for Combinational Equivalence

MVSIS v1.1 Manual. Jie-Hong Jiang, Yunjian Jiang, Yinghua Li, Alan Mishchenko*, Subarna Sinha Tiziano Villa**, Robert Brayton

On Nominal Delay Minimization in LUT-Based FPGA Technology Mapping

Large-scale Boolean Matching

Advances In Industrial Logic Synthesis

Don t Cares and Multi-Valued Logic Network Minimization

Sequential Logic Synthesis Using Symbolic Bi-Decomposition

Local Two-Level And-Inverter Graph Minimization without Blowup

PARALLEL PERFORMANCE DIRECTED TECHNOLOGY MAPPING FOR FPGA. Laurent Lemarchand. Informatique. ea 2215, D pt. ubo University{ bp 809

OPTIMIZATION OF BINARY AND MULTI-VALUED DIGITAL CIRCUITS USING MVSIS AND AIG REWRITING (ABC)

A New Algorithm to Create Prime Irredundant Boolean Expressions

A Power Optimization Toolbox for Logic Synthesis and Mapping

Chapter 2. Boolean Expressions:

VLSI System Design Part II : Logic Synthesis (1) Oct Feb.2007

Boolean Matching for Complex PLBs in LUT-based FPGAs with Application to Architecture Evaluation. Jason Cong and Yean-Yow Hwang

Breakup Algorithm for Switching Circuit Simplifications

ECE260B CSE241A Winter Logic Synthesis

On Using Permutation of Variables to Improve the Iterative Power of Resynthesis

Efficient Computation of Canonical Form for Boolean Matching in Large Libraries

SAT-Based Complete Don t-care Computation for Network Optimization

On the Relation between SAT and BDDs for Equivalence Checking

Don't Cares in Multi-Level Network Optimization. Hamid Savoj. Abstract

Using Synthesis Techniques in SAT Solvers

Functional extension of structural logic optimization techniques

SDD Advanced-User Manual Version 1.1

Gate-Level Minimization. BME208 Logic Circuits Yalçın İŞLER

EECS 219C: Formal Methods Binary Decision Diagrams (BDDs) Sanjit A. Seshia EECS, UC Berkeley

SAT-Based Area Recovery in Structural Technology Mapping

Functional Test Generation for Delay Faults in Combinational Circuits

A Methodology and Tool Framework for Supporting Rapid Exploration of Memory Hierarchies in FPGAs

Minimization of Multiple-Valued Functions in Post Algebra

THE technology mapping and synthesis problem for field

Field Programmable Gate Arrays

Implementing Logic in FPGA Memory Arrays: Heterogeneous Memory Architectures

IT 201 Digital System Design Module II Notes

CAD dependent Estimation of Optimal k-value in FSM onto k-lut FPGA mappings, based on standard benchmark networks

Advanced Digital Logic Design EECS 303

Simultaneous Depth and Area Minimization in LUT-based FPGA Mapping

Binary Decision Diagrams (BDD)

Timing-driven optimization using lookahead logic circuits

Disjoint Support Decompositions

An Introduction to Zero-Suppressed Binary Decision Diagrams

Chapter 12: Indexing and Hashing

A Fast Reparameterization Procedure

Chapter 12: Indexing and Hashing. Basic Concepts

Boolean Factoring with Multi-Objective Goals

Review. EECS Components and Design Techniques for Digital Systems. Lec 05 Boolean Logic 9/4-04. Seq. Circuit Behavior. Outline.

Node Mergers in the Presence of Don t Cares

Binary Decision Diagrams

Flexible Two-Level Boolean Minimizer BOOM-II and Its Applications

Set Manipulation with Boolean Functional Vectors for Symbolic Reachability Analysis

Chapter 12: Indexing and Hashing

Chapter 2 Combinational

An Introduction to Zero-Suppressed Binary Decision Diagrams

ECE 5775 (Fall 17) High-Level Digital Design Automation. Binary Decision Diagrams Static Timing Analysis

Mapping-aware Logic Synthesis with Parallelized Stochastic Optimization

Transcription:

A New Enhanced Approach to Technology Mapping Alan Mishchenko Satrajit Chatterjee Robert Brayton Xinning Wang Timothy Kam Department of EECS Strategic CAD Labs University of California, Berkeley Intel Corporation alanmi, satrajit, brayton@eecs.berkeley.edu xinning.wang, timothy.kam@intel.com Abstract An important part of the design flow, technology mapping, expresses logic functions of the netlist using gates from the technology library, in the presence of various design constraints. This paper proposes a new approach to technology mapping, which relies on several known techniques, combined and tuned to work in a new way. The previous work on DAG mapping is extended, by proposing new methods for enumerating mapping choices and performing Boolean matching, which guarantees the delay-optimum phase assignment at the gate boundaries. Two ways of capturing flexibility in technology mapping are explored and compared: supergates and choice nodes. An experimental technology mapper developed in the MVSIS environment compares favorably with other technology mappers in terms of delay, area, and runtime. 1 Introduction Technology mapping is an important step in the design flow. Typically technology mapping is applied to a Boolean network after technology-independent logic synthesis [2][20]. The traditional approach to technology mapping uses the tree-covering algorithm [8] implemented in SIS [18]. It was improved in [13][7] by adding implicit enumeration of all algebraic decompositions. A different approach to technology mapping was proposed in [9][10], which performs technology mapping as part of Boolean decomposition. This method was extended in [16] to consider a wider class of decomposition choices and compute decomposition functions more efficiently. Although technology mapping has been an active research area for many years, a number of open problems remain to be solved. The quality of mapping, in particular, delay/area/power trade-offs achieved by the present day technology mappers, is often suboptimal because only a small set of mapping choices is explored. The algebraic mappers, such as [8][13], do not use general Boolean properties of the functions, while Boolean mappers, such as [10][16], are often limited by the decomposition schemes employed and cannot beat the algebraic mappers. Besides, both high-effort algebraic mappers [13] and Boolean mappers [10][16] have long runtime, which prevents their use for design-space exploration and in fast prototyping. The proposed approach to technology mapping incorporates advantages of the published methods and proposes new solutions where the previous work fails. The characteristic features of the present approach are: It relies on graph covering [13] instead of constructive decomposition [16] It is DAG-based [12] rather than tree-based [8] It combines algebraic [3] and Boolean [9] methods It employs a new algorithm to enumerate subgraphs to be matched, similar to [5] It uses Boolean matching [1] rather than structural matching [8] and is implemented in a new way [6] It performs delay-optimal mapping [12], followed by area recovery [14][11] The rest of the paper is organized as follows. Section 2 reviews relevant background on technology mapping. Section 3 discusses the previous work. Section 4 presents the new technology mapping flow. Section 5 discusses supergates. Section 6 shows experimental results. Section 7 gives conclusions and future work. 2 Background Technology mapping consists in expressing logic functions of a circuit in terms of logic functions of elementary gates, which belong to some technology library. The circuit is represented by an object graph, while the gates are represented by subject graphs. The subject graphs, with possible duplications, should be connected to cover the object graph, while preserving its functionality. Traditionally, both types of graphs are represented by NAND-graphs (the networks composed of two-input NANDs) or AND-INV graphs (AIGs) (the networks composed of two-input ANDs and inverters). The proposed approach is based on the AIG representation. The difference compared to other AIG-based approaches, such as [13], is that AIGs in the current work are enhanced with functional reduction. 1

Functionally reduced AIGs (FRAIGs) are constructed to guarantee that each node has a unique functionality, that is, that there are no two nodes with identical functionality in terms of the PI variables of the graph. A detailed discussion of FRAIGs and their advantages compared to the traditional AIGs can be found [17]. To simplify the presentation in this paper, it can be assumed that the object graph is represented by the traditional AIGs, unless FRAIGs are specifically mentioned. A survey of several approaches to technology mapping is given in [4]. The functions mentioned in the algorithms and examples are completely specified Boolean functions. The concepts of the delay and the arrival time of a signal are used interchangeably. 3 Previous Work The previous work on technology mapping can be divided into algebraic and Boolean. The algebraic work pioneered by [8] became the method of choice for many CAD tools, in particular, SIS [18]. The main concepts of algebraic mapping are: constructing the object graph using algebraic decomposition of logic functions, representing the mapping graphs using two-input nodes, splitting general DAGs into disjoint trees, delay-optimal mapping of the trees, structural matching between the object graph of the network and the subject graphs of the gates. The limitations of the algebraic approach [8] were addressed by several later approaches. A number of important improvements are proposed in [13]: an improved generation of the starting object graph, an implicit enumeration of all algebraic decompositions using choice nodes, and better handling of the design constraints. This work was further advanced in [7] by developing new methods for efficient area recovery and achieving practical area/delay trade-offs. Yet another improvement [12] adapted the linear-time delay-optimal DAG-mapping algorithm [5], originally designed for LUT-based FPGAs, to work for the library-based mapping. This algorithm, called DagMap, is implemented in the SIS environment. Another major thrust to improve the quality of technology mapping led to the development of the Boolean approach [10][16]. It performs technology mapping in the context of constructive decomposition following the pioneering work on symbolic Boolean methods [9]. The main difficulty here is that Boolean methods are often too complicated and short-sighted to achieve good quality in the first pass over the netlist. They work better in the context of iterative re-synthesis, which targets incremental improvements to a mapped netlist. In this case, re-synthesis requires numerous updates to the netlist and incremental recomputation of timing information, which can make it very slow even for relatively small circuits. Another difficulty with Boolean methods [9][15][16] is that they target the reduction in the size of the support of logic functions as the main criterion in choosing decompositions. This is often misleading. Analysis of good decompositions derived by algebraic methods [3] suggests that the support size in these decompositions can temporarily grow before decreasing sharply. This phenomenon suggests that some type of redundant reencoding of the information, which increases the support size, may be necessary to find good decompositions. 4 New Mapping Flow The proposed approach to technology mapping can be divided into several phases: (1) creating the starting AND-INV graph (AIG), (2) computing all k-feasible cuts for each node, (3) performing Boolean matching between the functions of cuts and those of the library gates, (4) mapping the internal nodes, (5) selecting delay-optimal mapping, (6) post-processing to recover area. 4.1 Creating the starting AIG The object graph represented as an AIG can be constructed using two methods, which differ in the amount of flexibility present in the graph. One netlist: This approach is similar to the way the object graph is constructed in the previous work [8][13][12]. It assumes that the netlist after technologyindependent logic synthesis is available. The SOP representations of the nodes are converted into AIGs while minimizing the output arrival times measured in terms of the number of logic levels. Example. Suppose product p = abcd is being converted into an AIG, and the signal delays are: D(a) = 4, D(b) = 3, D(c) = 3, D(d) = 5. The resulting AIG balanced by delay is p = ((a(bc))d), and D(p) = 6. The sums found in the SOPs of the nodes along with the products, are converted into AIGs similarly, by transforming the two-input ORs into the two-input ANDs using the DeMorgan rule. Multiple netlists: (not yet implemented) This approach uses the functionally reduced AIGs (FRAIGs) [17] to represent multiple structural implementations of the logic functions found in different versions of the same netlist. One way of deriving the set of netlists is to apply a logic optimization script to the original netlist and saving the intermediate netlists after each command in the script. All the netlists are functionally equivalent, but their internal logic structure differs depending on the optimization commands applied. Constructing the FRAIG representation for all of them is equivalent to systematically storing all the different AIG structures for each logic function. The FRAIGs store the alternative structures of the object graph using an implicit data structure similar to choice nodes [13]. The main difference is in the source of the flexibility. In [13] the choice nodes are used to enumerate different algebraic decompositions, while in the proposed approach FRAIGs detect and store functionally-equivalent 2

structurally-different AIGs coming from different versions of the same design arising in logic optimization. In summary, the starting object graph can be choiceless or it may contain choice nodes. In both cases, the next stages of technology mapping work uniformly. They treat a choiceless netlist as the special case of the netlist with choices and try to make the most of the available flexibility. (The experimental result may compare the mapping quality with and without choice nodes.) 4.2 Computing k-feasible cuts Definition. A cut C for node n is a set of nodes, such that every path from node n to the PI nodes contains a node belonging to C. Node n is the root of C. The nodes in the set are the leaves of C. The nodes between the leaves and the root are the internal nodes of C. The size of the cut, denoted C, is the number of nodes in C. Example. Figure 1 shows an AIG with the root n 1, the internal AND nodes n 1 through n 6, and the PI nodes x 1 through x 5. The bubbles on the edges denote inverters. Node n 1 has cuts n 2, n 3 and n 2, n 5, n 6, while a set of nodes n 4, n 5 is not a cut because there is a path from node n 1 to PI x 5, which does not go through n 4 and n 5. n 4 x 1 x 2 x 3 x 4 x 5 Figure 1. Example of an AIG. Definition. A cut C of node n is redundant if a leaf node can be removed from it while C remains a cut of node n. A cut that is not redundant is called irredundant. Definition. An irredundant cut C is a k-feasible cut (k > 0) if C k. Computation of all k-feasible cuts for all nodes in the network is performed in one pass over the nodes as shown in Figure 2. The cuts are defined by the interconnection of the nodes while the inverters on the edges are ignored. Procedure NetworkKFeasibleCuts calls procedure NodeKFeasibleCuts for the PO nodes. After checking the trivial cases and the case when the node is already visited, the procedure calls itself recursively for the fanins of the node. The cut set of the node is derived by merging the cut sets of the fanins. The trivial cut composed of the node itself, n, is added to the resulting set. Procedure MergeCutSets initializes the resulting cut set to be empty, and considers all pairs of cuts from the two sets. For each pair, the merged cut is found as the union of the nodes belonging to the generating cuts. If the size of the merged cut does not exceed k, and the cut is encountered for the first time, it is added to the resulting cut set. Example. The cut set of node n 4 is x 1,x 2,n 4. The cut set of node n 5 is x 2,x 3,n 5. Suppose we compute n 1 n 2 n 3 n 5 n 6 2-feasible cuts of node n 2 using procedure MergeCutSets. The four cut pairs yield the following different cuts: x 1,x 2,x 3, x 1,x 2,n 5, x 2,x 3,n 4, n 4, n 5. Only one of them is two-feasible. So the resulting cut set is n 4, n 5. void NetworkKFeasibleCuts( map_graph * Graph, int k ) map_node * node; for each PO (node) of the mapping graph (Graph) NodeKFeasibleCuts( node, k ); map_cutset * NodeKFeasibleCuts( map_node * n, int k ) map_cutset * Set1, * Set2, * SetRes; if ( n is const ) return ; if ( n is PI ) return n; if ( n is visited ) return NodeReadCutSet( n ); MarkNodeAsVisited( n ); Set1 = NodeKFeasibleCuts( NodeReadChild1(n), k ); Set2 = NodeKFeasibleCuts( NodeReadChild2(n), k ); SetRes = MergeCutsSets( Set1, Set2, k ) n; NodeWriteCutSet( n, SetRes ); return SetRes; map_cutset * MergeCutsSets (map_cutset * Set1, map_cutset * Set2, int k ) map_cutset * SetRes; map_cut * Cut1, * Cut2, * CutRes; SetRes = ; for each cut (Cut1) in Set1 for each cut (Cut2) in Set2 CutRes = Cut1 Cut2; if ( CutRes k and CutRes SetRes ) SetRes = SetRes CutRes; return SetRes; Figure 2. Computation of all k-feasible cuts. 4.3 Boolean matching When all k-feasible cuts of the nodes are computed, an attempt is made to implement each of them using the gates from the library. Unlike most of the previous approaches to technology mapping, which use structural matching (comparing the AIG structure of the cut with those of the library gates), the proposed approach uses Boolean matching (comparing the Boolean function of the cut with those of the library gates). Boolean matching increases the likelihood of matching because it compares the function of the cut output, expressed in terms of the cut inputs, and disregards the interconnection structure of the internal nodes. Meanwhile, for a match to be found, the structural matching requires matching of the internal nodes of the cut. The structural matching can be made equally powerful by considering the complete set of AIG structures of each gate but it is inefficient for large gates (such as supergates in Section 5), because large gates have many different AIG structures. An additional advantage of Boolean matching is that it can perform phase assignment as a by product of matching. The phase assignment, discussed in the next subsection, selects the polarities of the inputs of the cut to minimize the arrival time of the output. 3

Definition. The function of a cut is the Boolean function of the root of the cut expressed in terms of the leaves. The function of the cut is computed by assigning the functions of the elementary variables to the leaves of the cut, followed by computing the functions of the internal nodes of the cut in a DFS order. In the end of this computation, the function of the root of the cut is available. Example. Consider the AIG in Figure 1. The function of the cut x 1,x 2,x 3 with root n 4 is f(x 1,x 2,x 3 ) = x 1 x 2 & x 2 x 3 = = ( x 1 + x 2 )( x 2 + x 3 ) = x 1 x 2 + x 1 x 3 + x 2 x 3. The function of cut n 5,x 4,x 5 with root n 3 is f(n 5,x 4,x 5 ) = nx + nx. 5 4 5 5 Definition. Two Boolean functions, f(x) and g(x), are NPN-equivalent if one of them can be derived from another by selectively complementing the inputs (N), permuting the inputs (P), and optionally complementing the output (N). Of particular interest in the current work on technology mapping is the N-equivalence of Boolean functions. Definition. Two Boolean functions, f(x) and g(x), are N-equivalent if one of them can be derived from another by selectively complementing the inputs of the cut. Example. Consider functions f 1 = xxx 1 2 3 + xx 2 3 + x 4, f = xxx + xx + x, and f 2 1 2 3 2 3 4 3 = xxx 1 2 3 + xx 2 3 + x 4. It can be shown that f 1 and f 2 are not N-equivalent while f 1 and f 3 are N-equivalent because they can be transformed into each other by complementing x 2 and x 4. All Boolean functions of the given number of variables can be divided into N-equivalence classes. Representatives of each class can be transformed into each other by complementing their inputs, but there is no transformation of this type between representatives of different classes. Definition. The truth table of a Boolean function is the bit-string of length 2 n, where n is the number of variables in the function. The individual bits of the bit-string are equal to the values of the function for the input minterms ordered as natural numbers. x 1 x 2 x 3 f c 1 c 2 c 3 Truth Table Integer 0 0 0 0 0 0 0 <00111011> 59 0 0 1 0 0 0 1 <00110111> 55 0 1 0 1 0 1 0 <11001110> 206 0 1 1 1 0 1 1 <11001101> 205 1 0 0 1 1 0 0 <10110011> 179 1 0 1 0 1 0 1 <01110011> 115 1 1 0 1 1 1 0 <11101100> 236 1 1 1 1 1 1 1 <11011100> 220 Figure 3. Truth table of f. Figure 4. Canonical form of f. Example. Consider function f = xx + x. For variable 1 3 2 ordering (x 1,x 2,x 3 ), the truth table is shown in Figure 3. The corresponding bit-string is <00111011>. Definition. The phase transforming function f into a representative g of its N-equivalence class is the bit-string of length n, where n is the number of variables in f. The individual bits, c i, of the phase bit-string show whether the corresponding variables, x i, of f should be complemented to transform f into g. Example. Phase <001> transforms the truth table <00111011> into <00110111>, as shown in Figure 4. Property 1. Let functions f and g belong to the same N-equivalence class. Then, the phase transforming f into g is the same as the phase transforming g into f: p f g = p g f ; Definition. The N-canonical form of a Boolean function is a representative of its N-equivalence class, whose truth table has the smallest integer value. Example. Consider function f = xx + x. The complete 1 3 2 set of eight transformations of its truth table is shown in Figure 4. The truth table <00110111> is the canonical form because it corresponds to the smallest integer value (55). The phase transforming f into its canonical form is <001>. Property 2. Functions f and g are N-equivalent iff their N-canonical forms are identical. Suppose p f and p g are phases transforming N-equivalent functions f and g into their canonical form. Then the phase transforming f into g is p f g = p f p g, where is the bitwise EXOR operation. The proposed Boolean matching procedure pre-computes truth tables of all gates from the library and their N-canonical forms. During the pre-computation phase, which is performed once for each technology library, a hash table is created. This table maps the N-canonical form into a set of gates implementing this form. Additionally, each gate is associated with the phase, which transforms the truth table of the gate into its N-canonical form. The Boolean matching considers k-feasible cuts of all nodes of the object graph in an arbitrary order. For each cut, the following is computed: the truth table, its N-canonical form, and the phase transforming the truth table into its canonical form. The canonical form is used to find the set of gates, which can implement this cut. For each gate, the transformation phase from the function of gate into the function of the cut is found using Property 2. 4.4 Mapping internal nodes The delay-optimal mapping of the object graph is computed using procedure NetworkComputeMapping, shown in Figure 5. The arrival times of the PIs of the object graph are set to 0, or to the values supplied in the netlist specification. The arrival times of all nodes considered in the topological order are computed using the arrival times of the fanins by calling procedure NodeComputeMapping. Procedure NodeComputeMapping iterates through the previously computed k-feasible cuts of each node. For each cut, the arrival time of the cut output is determined. The cut with the earliest arriving output is selected to implement the node. Although not shown in the pseudo-code, the area of the cuts is used as a tie-breaker. The arrival time of a cut is determined by calling procedure CutComputeArrivalTime. This procedure uses the results of Boolean matching found at the previous steps: the set of gates, which can implement the cut, each of them with its own phase. When all the cuts are 4

considered, the gate with the earliest output arrival time is selected as the gate used to implement the cut. The arrival time of a gate is computed using procedure GateComputeArrivalTime. This procedure takes the gate from the technology library, the cut matched with this gate by Boolean matching, and the phase transforming the truth table of the gate into that of the cut. The phase tells what polarities of the cut inputs should be used as the gate inputs for the gate s output to produce the function of the cut. void NetworkComputeMapping ( map_graph * Graph ) map_node * node; SetPiArrivalTimes( Graph ); for each node (node) of the mapping graph (Graph) in topological order NodeComputeMapping( node ); map_time NodeComputeMapping ( map_node * n ) map_cut * Cut, * CutBest; map_time DelayCut, DelayBest; CutBest = NULL; DelayBest = INFINITY; for each cut (Cut) of the node (n) DelayCut = CutComputeArrivalTime( Cut ); if (DelayBest > DelayCut ) DelayBest = DelayCut; CutBest = Cut; NodeSetOptimalCut( n, CutBest ); return DelayBest; map_time CutComputeArrivalTime( map_cut * Cut ) map_gate * Gate, * GateBest; map_time DelayGate, DelayBest; map_truth Table; map_phase Phase; Table = CutReadTruthTable( Cut ); GateBest = NULL; DelayBest = INFINITY; for each gate (Gate) matching th5 scn-0.0464 0.0464 TD(t)Tj0.t 5

Procedure NodeSelectOptimalDelayMapping assumes that the node, for which it is called, should be implemented in the given polarity. If this procedure is already called for the node in the given polarity, the result is returned. Otherwise, one of the two polarities of the node, positive or negative, is selected using the arrival times. The possibility of adding an inverter to the opposite polarity of the node is also considered. The selected cut and its transformation phase are retrieved. Next, the same procedure is called recursively for the fanins of this cut, which are needed in the polarity given by the transformation phase. Finally, the optimal delay cut and its polarity are stored at the node. 4.6 Area recovery Currently, efficient area recovery is not implemented. Two approaches can be explored: iterative minimization of area flow [13] and incremental re-synthesis through symbolic resubstitution [11]. It is expected that both options will recover some area without increasing delay. The first option is likely to be faster but the improvement will be around 10%. The second option is likely to be slower while possibly giving improvements up to 30%. Both approaches should be implemented and tested. 5 Supergates 5.1 Motivation The gate libraries vary in terms of the quantity and diversity of the elementary gates. Some libraries include only a few gates while other libraries are composed of negative unate gates. The problem is the functionality of gates in such libraries is not diverse enough to match the functionality of the majority of cuts computed for the nodes in the object graph. To increase the diversity of Boolean functions that can be implemented, it is natural to group the elementary gates into gate clusters or gate combinations, called supergates. This approach was introduced in [16] and proved useful to increase the efficiency of technology mapping. More formally, a supergate is a single-output combinational Boolean network composed of elementary gates belonging to the given library. a b Figure 7. A supergate. c F d S 1 S 2 S 3 Root gate a b c d c e Figure 8. Supergate generation. Example. Figure 7 shows a typical supergate with two logic levels of elementary gates. Two-input EXOR is the root gate, while the three-input OR and the two-input AND with a shared input occupy the second level. For a library of gates, the supergate library is generated as a preprocessing step before mapping. The supergate generation is guarded by constraints and resource limits, such as the limit on the number of inputs, the limit on the total area and delay, and the runtime limit. The generation process is described in the following subsection. It is important that the supergate library is generated once, stored compactly in a file, and used when technology mapping is invoked. The supergates should be recomputed only if changes are made to the original library. This is why the supergate generation has an additional advantage of reducing the total runtime of mapping by pre-computing and reusing the mapping information, which depends on the library but does not depend on the netlist to be mapped. A supergate library can be viewed as an elementary gate library, which contains a large number of gates with diverse functionality. A library of elementary gates can be viewed as a supergate library, with supergates composed of one level of the elementary gates. The algorithms in this paper work uniformly for both types of libraries. 5.2 Generation Overview: The supergate generation is performed recursively. One recursive step adds one level of gates on top of the available supergates. At the beginning, the set of available supergates is the set of elementary variables. In the current implementation, their number may be up to six; it depends on the largest allowed support size of the generated supergate library. In each recursion step, all possible root gates (the elementary gates from the library) are considered, and the available supergates are plugged into the root gates to create new supergate candidates. Example. The process generation process is illustrated in Figure 8. The root gate AOI21 is added on top of three supergates: S 1 (the elementary variable a), S 2 (the supergate composed of two NAND2 gates), and S 3 (the OR2 gate). The resulting candidates have an additional level of elementary gates, compared to the starting set of supergates. Before accepting a candidate supergate, its truth table is computed and the hash table is checked for a supergate with the same functionality but better delay-area parameters. If such gate exists, the new supergate is discarded. If it does not exist, the new supergate is added to the set of supergates used to generate the next level. Constraints: The runtime of supergate generation can be dramatically reduced by applying constraints on the candidates. The constraints on runtime of generation and on the number of logic levels of gates are obvious. Two other types of constraints are trickier to implement. These constraints include restrictions on the maximum area and on the maximum pin-to-pin delay of the resulting supergates. To ensure that a candidate with a pin-to-pin delay exceeding the given limit is never created, the available 6

supergates are sorted by their maximum pin-to-pin delay. Now, when we consider a root gate with some value of the maximum pin-to-pin delay, we only try the supergates, whose delay, when added to the delay of the root gate, does not exceed the delay limit. Example. Consider the supergate candidate in Figure 8. Suppose AOI21 has delay 1.7, NAND2 has delay 1.0, and OR2 has delay 1.5. Suppose the global delay limit imposed on the generated supergates is 4.0. In this case, the largest allowed maximum delay for the component supergates is equal to the global delay limit minus the delay of the root gate (4.0 1.7 = 2.3). The supergate composed of two NANDs has delay 2.0. Both this supergate and OR2 have delay less than 2.3. Therefore, the supergate candidate in Figure 8 will be considered. Suppose the delay of NAND2 is 1.2. In this case, the supergate composed of two NANDs will have delay 2.4, which is larger than 2.3. Therefore, this supergate will not be allowed as a component when generating supergates with the root gate AOI21. A similar restriction can be developed for area. For each root gate, the available supergates are sorted by area. Now, while adding available supergates to the supergate under construction, the maximum area limit is checked. If the limit is exceeded, the supergate under construction is dropped and another one is tried. Storage: The resulting supergates are stored in a file in the form of a tree. Instead of writing each supergate individually as one line in the output file, all supergates are written as the shared tree of gates, beginning with the elementary variables. In the process of writing, some subtrees may, by themselves, be supergates, while others may be just useful parts of supergates. The former are denoted by a special symbol. Each line of the output file (except the comments and the header lines) describes one gate. The supergates are referred to by the line numbers, in which they are described. 5.3 Comparison with choice nodes Both supergates and choice nodes are introduced to improve the quality of mapping by increasing the search space. It is an interesting question what is the relationship between the flexibilities afforded by these two notions. An experiment should be performed to get a practical answer to this question. Here is a general argument to show that the sources of flexibility provided by supergates and choice nodes are complementary. Supergates are limited, by construction, to several logic levels of gates. As a result, the extended search space due to supergates is relatively shallow. However the population of this space is dense because the generation process is exhaustive for the given number of logic levels. On the other hand, the choice nodes derived from different versions of the netlist lead to a search space extension, which is deep and sparse. This is because the structural differences between the netlists used to generate the choices may encompass many logic levels. Meanwhile, the number of the structural differences is relatively small because their enumeration is not exhaustive and is controlled by external resource limits. 6 Experimental results Experiments may include comparison of the new technology mapper with the SIS mapper [8][19], mapper based on enumeration of algebraic decompositions [13], the future improvements to this mapper [7], the DAG mapper DagMap [12], and constructive decomposition [16]. It may also be interesting to perform the following experiments to evaluate the impact of various aspects of technology mapping on the quality of the final results. The contribution of supergates and choice nodes (How much to they add to quality? What is more powerful? Do they cover the same space?) The area recovery using different options. The trade-off between quality and runtime when different amount of choice nodes is used. 7 Conclusions This paper describes a new approach to technology mapping. This approach is based on the combination of known techniques with the following distinctive features: It gracefully combines Boolean and algebraic methods. It uses algebraic decomposition to construct the starting object graph and allows for Boolean transformations of this graph to increase the number of mapping choices. It uses Boolean matching rather than structural matching when expressing the functionality of cuts in the AIG in terms of the gates from the library. It improves the mapping quality by controlling the amount and origin of mapping choices added to the object graph. The choices can be added using algebraic decomposition, similar to [13], based on different versions of the same network, or based on pre-computed tables of equivalent AIG structures. It reduces the runtime by fine-tuning implementation and relying on pre-computation whenever possible. The future work will include: Experiments with different ways of adding choices (for example, using Boolean decomposition or precomputed library of small AIGs). Exploring the role of don t-cares in Boolean matching, in particular, for the critical path synthesis. For example, using subsets of local SDCs may increase efficiency of Boolean matching without sacrificing compatibility, as it might be the case with complete don t-cares. 7

References [1] L. Benini, G. DeMicheli, A survey of Boolean matching techniques for library binding, ACM TODAES, Vol. 2, No. 3, July 1997, pp. 193-226. [2] R. Brayton, G. Hachtel, A. Sangiovanni-Vincentelli, Multilevel logic synthesis, Proc. IEEE, Vol. 78, Feb.1990. [3] R. K. Brayton and C. McMullen, The decomposition and factorization of Boolean expressions, Proc. ISCAS 82, pp. 29-54. [4] S. Hassoun and T. Sasao, eds., Logic synthesis and verification, Kluwer 2002, Chapter???, Technology mapping, pp.???. [5] J. Cong, Y. Ding, FlowMap: An optimal technology mapping algorithm for delay optimization in lookup-table based FPGA designs, IEEE Trans. CAD, Vol.13, No. 1 (January 1994), pp. 1-12. [6] D. Debnath and T. Sasao, "Fast Boolean matching under variable permutation using representative," Proc. ASP- DAC '99, pp. 359-362. [7] D.-J. Jongeneel, R. Otten, Y. Watanabe, R. K. Brayton, Area and search space control for technology mapping, Proc. DAC 00, pp. 86-91. [8] K. Keutzer, DAGON: Technology binding and local optimizations by DAG matching, Proc. DAC 87, pp. 617-623. [9] V. N. Kravets. Constructive multi-level synthesis by way of functional properties. Ph. D. Thesis, University of Michigan, 2001. [10] V. N. Kravets and K. A. Sakallah, Constructive libraryaware synthesis using symmetries, Proc. DATE 00, pp. 208-216. [11] V. N. Kravets and P. Kudva, Implicit enumeration of structural changes in circuit optimization, Proc. DAC 04, pp.???. [12] Y. Kukimoto, R. K. Brayton, P. Sawkar, Delay-optimal technology mapping by DAG covering, Proc. DAC 98, pp. 348-351. [13] E. Lehman, Y. Watanabe, J. Grodstein, and H. Harkness, Logic decomposition during technology mapping, IEEE Trans. CAD, 16(8), 1997, pp. 813-833. [14] V. Manohararajah, S. D. Brown, Z. G. Vranesic, Heuristics for area minimization in LUT-based FPGA technology mapping, Proc. IWLS 04. [15] A. Mishchenko, B. Steinbach, M. Perkowski, "An algorithm for bi-decomposition of logic functions," Proc. DAC '01, pp. 103-108. [16] A. Mishchenko, X. Wang, T. Kam, A new enhanced constructive decomposition and mapping algorithm, Proc. DAC 2003, Los Angeles, pp. 143-147. [17] A. Mishchenko, R. J.-H. Jiang, R. K. Brayton, FRAIGs: Functionally reduced AND-INV graphs. http://www.ee.pdx.edu/~alanmi/fraig [18] E. Sentovich, et al. SIS: A system for sequential circuit synthesis, Tech. Rep. UCB/ERI, M92/41, ERL, Dept. of EECS, Univ. of California, Berkeley, 1992. [19] H. Touati, Performance-oriented technology mapping, Ph.D. dissertation, UC Berkeley, November 1990. [20] C. Yang and M. Ciesielski, BDS: A BDD-based logic optimization system. IEEE Trans. CAD, Vol. 21 (7), July 2002, pp. 866-876. 8