Reducing Structural Bias in Technology Mapping

Size: px

Start display at page:

Download "Reducing Structural Bias in Technology Mapping"

Lindsay Adams
6 years ago
Views:

1 Reducing Structural Bias in Technology Maing S. Chatterjee A. Mishchenko R. Brayton Deartment of EECS U. C. Berkeley {satrajit, alanmi, X. Wang T. Kam Strategic CAD Labs Intel Cororation {xinning.wang, ABSTRACT Technology maing based on DAG-covering suffers from the roblem of structural bias: the structure of the maed netlist deends strongly on the subject grah. In this aer we resent a new maer aimed at mitigating structural bias. It is based on a simlified cut-based Boolean matching algorithm, and using the seed afforded by this simlification we exlore two ideas to reduce structural bias. The first, called lossless synthesis, leverages recent advances in structure-based combinational equivalence checking to combine the different networks seen during technology indeendent synthesis into a single network with choices in a scalable manner. We show how cut-based maing extends naturally to handle such networks with choices. The second idea is to combine several library gates into a single gate (called a suergate) in order to make the matching rocess less local. We show how suergates hel address the structural bias roblem, and how they fit naturally into the cut-based Boolean matching scheme. An imlementation based on these ideas significantly outerforms state-of-the-art maers in terms of delay, area and run-time on academic and industrial benchmarks. 1. INTRODUCTION The task of technology maing in standard-cell logic synthesis is to exress a given Boolean function as a network of gates chosen from a given standard-cell library to otimize some objective function such as total area or delay. In these general terms, technology maing is intractable. The roblem is usually simlified by first reresenting the Boolean function as a good initial multi-level network of simle gates called the subject grah. The subject grah is then transformed into a multilevel network of library gates by means of local substitutions. This simlification means that the structure of the subject grah dictates to a large extent the structure of the maed network; this is known as structural bias. In this work we resent a new Boolean technology maer aimed at mitigating the effects of structural bias. At the core of the maer is a simlified Boolean matching algorithm that is faster than structural matching and roduces better results. With the seed afforded by this matching technique, we roose two comlementary techniques to reduce structural bias: lossless synthesis and suergates. Lossless Synthesis. To obtain a good structure for the subject grah a number of technology indeendent synthesis stes are usually erformed. An examle of this is the SIS rugged scrit shown in Figure 1(a). Each ste in the scrit is Design simlify simlify resub fx resub mfs w 22 Technology Maing (a) Conventional Synthesis Design simlify simlify resub fx resub mfs w 22 Technology Maing (b) Lossless Synthesis Figure 1: In lossless synthesis intermediate networks are combined to create a choice network which is then used for maing. heuristic, and the subject grah roduced at the end of the scrit is not necessarily otimal; indeed it is ossible that an intermediate network is better in some resects than the final network. We exlore the idea of combining these intermediate networks into a single subject grah with choices and using that to derive the maed netlist. This is shown schematically in Figure 1(b). Following the work of Lehman et al. [13], the maing is done otimally over all the networks encoded in the subject grah. The maer is not constrained to use any one network, but can ick and choose the best arts of each network. We call this lossless synthesis since no network seen during the synthesis rocess is ever lost. By including the initial network in the choice network, we can be sure that the heuristic logic synthesis oerations never make things worse. Furthermore, as in [13], multile scrits (with ossibly different objectives) could be used to accumulate more choices.

2 f f f q 0 x 1 q a b c d a b c d a b c d (a) The subject grah. (b) An easy match to find. (c) A hard match to find. Figure 2: Simle examle illustrating the utility of suergates. Consider the subject grah shown in (a) where each node reresents a 2-inut AND gate and a dashed edge indicates the resence of an inverter. The match in (b) is easy to find since both inuts of the XOR are resent in the subject grah. In contrast the match in (c) is hard to find, since the MUX inut labeled x is not resent in the subject grah. The match in (c) can be found by combining some gates into a suergate. Note that all the inuts of the suergate are still required to be resent in the subject grah. The challenging art of this roosal is the efficient creation of the choice network. Towards this end we roose to leverage recent advances in combinational equivalence. State-of-the-art equivalence checkers deend on finding functionally equivalent internal oints in the networks being checked in order to reduce the comlexity of the decision rocedure. By using a combination of random simulation and SAT it is ossible to quickly determine equivalent oints in two circuits. These equivalent oints rovide the choices for maing in our roosal. Suergates. A suergate is a single outut combinational network of a few library gates which is treated as a single library gate by our algorithms. The advantage of doing this may not be immediately obvious: one might exect that if a suergate matches at a node, the conventional matching algorithm would return the same result, excet it would match library gate by library gate rather than all the gates at once. This is not true and Figure 2 rovides a simle counter-examle. The subject grah is shown in Figure 2(a). Conventional maing would find the maing consisting of the XOR and AOI gates as shown in Figure 2(b), but would fail to find the maing with the MUX as shown in Figure 2(c). To see this, observe that one of the inuts of the MUX (labeled x in Figure 2(c)) is not resent in the subject grah. Consequently, the MUX by itself is not a valid match for f. In contrast, the inuts of the XOR (nets and d) in Figure 2(b) are both resent in the subject grah; thus making the XOR a valid match at f. Now if we connect the MUX, the XOR, and the inverter together to form a new gate (in the manner shown in Figure 2(c)), it is easily verified that this new gate is a match at f in the subject grah. This new gate is a suergate built from the library gates exressly for the urose of finding better matches. This examle illustrates the main idea behind suergates: Using bigger gates allows the matching rocedure to be less local, and thus less affected by the structure of the subject grah. Furthermore, as this examle illustrates, suergates are useful even with standard cell libraries that are functionally rich. The three ideas resented in this aer the faster matching algorithm, lossless synthesis, and suergates are logically indeendent. However, one of the contributions of this aer is to show how these ideas come together naturally in the context of cut-based Boolean maing. In Section 2 we resent related work. Section 3 has an overview of the maer and details of the matching algorithm. In Section 4 we resent ractical techniques to construct the choice network for lossless synthesis and extend the basic maing to handle choices. In Section 5 we resent details of suergate construction. In Section 6 we resent the results of exeriments designed to determine the runtime quality trade-off of the techniques resented in this aer. We also resent comarisons with other industrial and academic maers. We conclude in Section 7 by ointing out some limitations of these techniques and suggesting future work. Finally, we note that the focus of this aer is on mitigating structural bias, that is, on the logical asects of the technology maing roblem. Therefore to simlify exosition we resent the algorithms in the context of a simle gain-based delay model which does not take caacitive loading into account. Such a model is useful in ractice in a flow that uses technology maing for toology selection, and does sizing and buffering iteratively with lacement. Furthermore, the techniques in this aer continue to be alicable in a more traditional load-based flow, since essentially

3 they are techniques to increase the number of toologies considered during maing, and are orthogonal to the usual techniques of considering loads during technology maing. 2. RELATED WORK The literature on technology maing rovides a sectrum of techniques that trade structural bias for comutational comlexity. The classical structural aroaches, such as tree- and DAG-covering [8, 12], lie at one end of this sectrum. They have relatively short run-times but rovide sub-otimal results since their maing choices are comletely constrained by the given subject grah. Constructive Boolean aroaches [9, 19] lie closer to the other end of the sectrum. Although they do not deend as much on the structure of the subject grah, they are limited by the choice of their (heuristic) decomosition schemes and local view since they decomose one node at a time. These limitations combined with long run-time makes them useful mostly in a re-synthesis flow after a maed network has been obtained by some other means. The aroach described by Lehman et al. [13] lies between these two extremes: a number of different local algebraic decomositions are encoded into the given subject grah as choices. The subject grah itself is constructed by combining the results of different technology maing scrits. The ideas relating to lossless synthesis resented in this aer are variations on this basic aroach. Beyond the obvious difference of using intermediate networks from one synthesis scrit, there are significant algorithmic differences. First, the use of structural equivalence checking to detect choices (as oosed to BDDs) allows for lossless synthesis on large circuits. Second, the extension of Boolean matching to handle choices resented in this aer overcomes the run-time limitations of the structural matching methods used by Lehman et al. Furthermore, there is an imrovement in quality since matching is done directly on general DAGs and comlex multi-fanout gates are handled more naturally. Wavefront maing [21] is a ractical enhancement of [13] that mas the circuit in stages, thus reducing the amount of match data that is stored at one time, though at the cost of otimality. An extension of this algorithm allows for dynamic decomosition based on a artially maed circuit. Comared to the aroach of Lehman et al. it reduces the number of decomositions that need to be exlored. The idea of wavefront can be used in conjunction with the techniques in this aer to reduce the amount of match data that needs to be stored in memory at one time. In addition, this aer uts forth the idea of combining library gates into suergates to obtain a more Boolean matching. By squinting a little, one can view suergates as being an alternate source of choices. The choices induced by suergates come from the library rather than the network. In the literature similar combinations of gates have been used for other uroses such as constructive decomosition [19] and rewiring [2]. We note here that suergates allows libraryaware Boolean decomositions to be exlored as oosed to algebraic decomositions in [13], and we defer the discussion to Section 5.5. The work related to the simlified Boolean matching is resented in Section BOOLEAN MATCHING We introduce the simlified, Boolean matching algorithm in the context of the overall maing flow. The maing rocedure is a Boolean, cut-based one [4, 6, 7] on a DAG using dynamic rogramming [3, 12] to guarantee delay otimality. 3.1 Overview of Boolean Maing The inut to the maing rocedure is an AND-Inverter grah (AIG) [11]. An AIG is a DAG whose nodes reresent either AND gates or rimary inuts(pis). Its edges reresent wires. Inverters are reresented by bubbles on the edges. Given an AIG, the maing is done in 5 stes. Ste 1. Comute k-feasible cuts. A feasible cut of a node N in the AIG is a set of nodes {X i} in the transitive fan-in cone of N such that an arbitrary assignment of values to X i comletely determines the value of N. A feasible cut is redundant if the value of a node in the cut is comletely determined by an assignment of values to the other nodes in the cut. A k-feasible cut is a feasible cut of size at most k that is not redundant. The cut {N} is always a k-feasible cut of node N (for any k) and is called the trivial cut. Let Φ(N) denote the set of k-feasible cuts of node N. If N is a PI, then Φ(N) = {{N}}. If N is a AND node with inuts A and B, then Φ(N) = {{N}} {u v u Φ(A), v Φ(B), u v k} We comute all 5-feasible cuts of every node in the network by the simle bottom-u traversal based on the above recursion. Although in general a grah may have O(n 5 ) 5- feasible cuts, we found that most test-cases have between 20 and 30 5-feasible cuts er node. We restrict our attention to 5-feasible cuts since our exeriments show that the total number of cuts increases very quickly with k. Pruning techniques have to be alied, and the maing results are not significantly better (since the runing is quite arbitrary). Ste 2. Comute truth-tables of cuts. The next ste is to comute the local function of a node in terms of its cut. This is done for every non-trivial k-feasible cut of every node in the network. Given a node N, and a cut {X i} of that node, formal variables are assigned to the each cut node (in no articular order). Using these variables, the functionality of the node is comuted symbolically. We note here that BDDs are not necessary for this comutation: since usually only 5-feasible cuts are considered the symbolic function comutation can be erformed efficiently using 32-bit machine words to reresent truth-tables. In what follows we use the words function and truth-table interchangeably. Ste 3. Boolean Matching. For each node in the network, for every cut, an aroriate gate (if one exists) is chosen from the library. Each gate thus chosen is called a match for the node. Our matching rocedure differs from the traditional aroach and we resent this in greater detail in the following section. Ste 4. Comute best arrival time at each node. Starting from the PIs and working in toological order towards the oututs, the best arrival time is comuted for each node from amongst all its matches. Ste 5. Choose the best cover. In the reverse toological order, the best gate for each rimary outut is cho-

4 sen. Next, the best gates imlementing the inuts of these gates are chosen and so on until all rimary inuts have been reached. q 3.2 Simlified Boolean Matching In the matching ste we wish to choose the aroriate library gate to imlement a given function. The traditional solution is to use NPN-equivalence classes. (Two Boolean functions are NPN-equivalent if one can be transformed into the other by ermuting or negating the inuts or negating the outut.) First the NPN-canonical reresentatives of the library gates are comuted. The gates are stored in a hash table indexed by the reresentatives. During matching the NPN-canonical reresentative is comuted for every cut function. It is used to look u the hash table to find the aroriate gate. However this aroach is slow since it requires the comutation of the NPN-reresentative for every cut function. Furthermore, once the aroriate gate is found, the aroriate variable corresondence must be found between the library gate and the cut function. Since these comutations are done in the inner body of the maer, there has been a lot of research on seeding them u [1, 5, 7, 22]. Our simlified matching rocedure is motivated by the fact that in the maing rocedure described above, the functions have only 5 variables. It is therefore advantageous to recomute all functions obtained by ermuting the inuts to library gates and to add those to the hash table. Thus during the actual matching, there is no need to comute a canonical reresentative. The cut function can be used directly to look u the hash table. In addition to avoiding the NPN-equivalence checking, it also avoids the need to establish inut corresondence after the gate is found. Examle. For an AOI gate having the function (x 1 x 2 +x 3), we recomute all ermutations i.e. (x 1 x 3 +x 2), (x 2 x 1 + x 3), (x 2 x 3 + x 1) etc. Note that both (x 2 x 1 + x 3) and (x 1 x 2 + x 3) are ket although they are the same function. This is because in industrial libraries, the inut-to-outut delays are often different even for functionally symmetric ins. Once again, all functional comutations are done with truth tables. In Section 5.2 on suergate generation we describe this recomutation rocedure in more detail. (The recomutation can be thought of as generation of suergates consisting of single library gates.) So far we have not considered the hase assignment at the inuts. This is because the maing is done dual rail as exlained in the following section on otimal hase selection. The roblem with this simlified matching rocedure is that it is not scalable. Indeed, this would not work if the functions we were interested in had, say, 10 variables. But in the framework of Boolean maing it suffices since we deal with functions having only a few variables. 1 This means that in the worst case (a library gate with 5 inuts, and no symmetries) 120 functions are added to the hash table (instead of 1 as in the NPN-reresentative case). However since each match is now only a hash table looku (versus an NPN-reresentative comutation, looku, and inut corresondence), the matching rocedure runs faster. 1 In Boolean maing, as more variables are used, the cut comutation becomes a bottleneck before the matching does. a x b Figure 3: In order to ma and q otimally for delay, the node x must be maed in both olarities. 3.3 Otimal Phase Selection The maing quality can be imroved by exloiting the additional flexibility of maing each node in both olarities: ositive (as above) and negative. When the final maing is selected (in Ste 5 of Section 3.1) the aroriate olarity is chosen to guarantee the shortest delay on each ath. This may lead to an increase in area since the same node may be required in both olarities, and may have to be dulicated. Examle. Consider the AIG fragment in Figure 3. Node x is required in both olarities. If it is maed in only one olarity then the arrival time at either or q increases by an additional inverter delay. Observe that this dual rail maing means that the inuts of a cut are available in both olarities. Consequently the function of a cut is not recisely determined: it belongs to a class of functions which differ only by comlementation of inuts. This class is called the N-equivalence class. There is greater flexibility since any gate belonging to the N-equivalence class may be used to imlement the cut. The simlified matching rocedure above is extended to work with N-equivalence classes. During library rerocessing, for every ermutation of a gate, the N-equivalence reresentative is comuted. This is defined as the function with the lexicograhically smallest truth table. Similarly during matching, for every node function the N-equivalence reresentative is comuted and used for looking u in the hash table. The alert reader would have noticed that this extension negates some seed benefit of the simlified matching rocedure resented in Section 3.2. However this may be justified with the following arguments. First, comuting the N-equivalence reresentative is less exensive than comuting the NPN-equivalence reresentative since the class is smaller (2 n versus 2 n+1 n!). Second, this is a necessary rice to ay for exloring this larger search sace of decomositions (c.f. the inverter transform in Lehman et al. [13]). In structural maers this search sace is exlored by adding a air of inverters between two nodes, and by adding a wire (with zero cost) to the library. That technique is not well suited for Boolean maing since it significantly increases the number of cuts in the network. 3.4 Using Don t Cares An advantage of the Boolean matching technique outlined above is the ability to handle local satisfiability don t cares during the matching rocess. If the matching is done with k feasible cuts, then for every node, a cut larger than k is constructed. This can be done by a simle deth-first search starting from the node. (For examle if matching is done

5 with 5-feasible cuts, a larger cut of size, say, 10 is considered.) For every k feasible cut of the node, it is ossible to comute the satisfiability don t cares symbolically. This can be done either with exhaustive simulation using truthtables or with BDDs. Once the satisfiability don t cares of a k feasible cut are identified, each don t care minterm can be set to one or zero to obtain a comletely secified cut function. This comletely secified function is used to looku the hash table as before. Thus each k feasible cut gives rise to a set of cut functions when satisfiability don t cares are considered. The main drawback of this method is the exonential blow-u in the number of cut functions since each don t care gives rise to two ossible functions. Thus in ractice only an arbitrary subset of the cut functions can be used for matching. Our reliminary exeriments confirmed that using don t cares in this manner imroved the quality of maing. However, in our alication the increase in quality was not justified by the increase in run-time. 4. LOSSLESS LOGIC SYNTHESIS The idea behind lossless logic synthesis is to remember every network seen during a synthesis flow (or a set of flows) and to select the best arts of each network during technology maing. This is useful for two reasons. First, technology-indeendent synthesis algorithms are usually heuristic, and so there is no guarantee that the final network is otimal. By maing using only the final network, we might miss out on a better result that could be obtained from an intermediate network in the otimization flow. Second, synthesis oerations usually aly to the network as a whole. So a flow to otimize delay might significantly increase area, since the whole network is otimized for delay. By combining such a delay otimized network with another network that has been otimized for area, it is ossible to get the best of both. On the critical ath, the maer can choose from the delay-otimized network, whereas off the critical ath, the maer chooses from the area-otimized network. The main roblem is constructing the choice network efficiently. In Section 4.1 we give an overview of how this is done. In Section 4.2 we extend the Boolean maing rocedure of Section 3.1 to handle choices. 4.1 Constructing the choice network The choice network is constructed from a collection of networks that are functionally equivalent. The key idea is to use recent advances in equivalence checking that are based on identifying functionally equivalent internal oints in the networks being checked. Concetually the rocedure is as follows: one can imagine each network to be decomosed into AND gates and inverters to form an AIG. Now for every node in the network the global function is comuted, say by building BDDs. All those nodes which have the same global function are collected in equivalence classes. Thus, the choice network is an AIG which has multile functionally equivalent oints collected in equivalence classes. However for large circuits comuting global BDDs is not feasible. Note that in the rocedure outlined above, it is not necessary to actually comute BDDs. One can use random simulation to identify otentially equivalent nodes, and then use a SAT engine to verify equivalence and construct the equivalence classes. We have imlemented a ackage called FRAIG (functionally reduced and-inverter grahs) that exoses the same API as a BDD ackage but internally uses simulation and SAT. (The details of this ackage are in the technical reort [18]. A discussion of the issues involved can also be found in recent work on structure-based equivalence checking [10, 11, 14, 15].) Examle. Figure 4 illustrates the creation of a network with choices. Networks 1 and 2 show the subject grahs obtained from two networks that are functionally equivalent, but structurally different. The nodes x 1 and x 2 are functionally equivalent (u to comlementation) in the two subject grahs. They are collected in an equivalence class in the choice network, and an arbitrary member (x 1 in this case) is used as a reresentative for the class in the choice network. Note that there is no choice corresonding to node o since the rocedure described above detects the maximal commonality between the two networks. Algebraic re-writing. A different way to generate choices is by iteratively alying the Λ- and -transformations described by Lehman et al. [13] Given an AIG, we use the associativity of AND to locally re-write the grah (the Λ- transformation), i.e. whenever the structure AND(AND(x1, x2), x3) is seen in the AIG, it is relaced by the equivalent structures AND(AND(x1, x3), x2) and AND(x1, AND(x2, x3)). If this rocess is done until no new AND nodes are created, it is equivalent to identifying the maximal multiinut AND-gates in the AIG and adding all ossible treedecomositions. Similarly, the distributivity of AND over OR (the -transformation) rovides another source of choices. In Section 6 we resent exerimental results that comare the flexibility rovided by these choices with those from lossless synthesis. Note that this leads to a new way of thinking about logic synthesis: one can use arbitrary transformations to re-write the network and create choices. The best combination of these choices is selected during maing. 4.2 Maing with choices The cut-based Boolean maing rocedure of Section 3.1 can be extended naturally to handle equivalence classes of nodes. Only the cut comutation ste needs modification. Given a node N, let N denote the equivalence class it belongs to. Let Φ (N ) denote the set of cuts of the equivalence class N. Then, Φ (N ) = [ N N Φ(N) where, if A and B are the two inuts of N, Φ(N) is given by {{N}} {u v u Φ (A ), v Φ (B ), u v k} This exression for Φ(N) is a slight modification of the one in Section 3.1: The cuts of N are obtained from the cuts of the equivalence classes of its inuts (instead of the cuts of just its inuts). The reader should verify that in the absence of choices (which corresonds to the situation when all equivalence classes have only one node) this comutation is essentially the same as the one resented in Section 3.1. As before, the cut comutation can be done in a bottomu manner from PIs to oututs in a single ass.

6 Network 1 Network 2 Network with Choice o o o x 1 x 1 x 2 x 2 a b c d e = b c o = a + d + e a b c d e a b c d e = b c o = a + (d + e) Figure 4: Examle illustrating the creation of a network with choices. A choice is created when there are two nodes with the same global function (u to comlementation), but different structures. Thus nodes x 1 and x 2 lead to a choice, but node does not ( is structurally the same in both networks). Examle. Consider the comutation of the 3-feasible cuts of the equivalence class {o} in Figure 4. Let x reresent the equivalence class {x 1, x 2}. Now, Φ (x) = Φ(x 1) Φ(x 2) = {{x 1}, {x 2}, {q, r}, {, s}, {q,, e}, {, d, r}, {, d, e}, {b, c, s}} and Φ ({a}) = Φ(a) = {a}. We have Φ ({o}) = Φ({o}) = {{o}} {u v u Φ (a ), v Φ (x 1 ), u v 3} Now since a = {a}, and x 1 = x, we get Φ ({o}) = {{o}, {a, x 1}, {a, x 2}, {a, q, r}, {a,, s}} Observe that the set of cuts of o involves nodes from the two choices x 1 and x 2, i.e. o may be imlemented using either of the two structures. The subsequent stes of the maing rocess (stes 2 5 of Section 3.1) remain unchanged, excet now the oerations are done for equivalence classes of nodes, rather than for individual nodes. 5. SUPERGATES A suergate is a single outut combinational network of a few library gates which is treated as a single library gate by the maing rocedure. As the examle in the introduction shows (see Figure 2), suergates match larger ortions of the subject grah than library gates. This makes the matching more Boolean and less deendent on the structure of the subject grah. This greatly increases the number of matches seen by the maer, and leads to better results. In what follows we use the term simle gates to mean the gates in the original library. 5.1 Use in Maing Suergates require no change to the maing rocedure since they are no different from simle gates for the maer. Suergates are generated in a re-rocessing ste as described in Section 5.2. The suergate library is generated once, stored comactly in a file, and used when technology maing is invoked. The suergates are recomuted only if changes are made to the original library. This is why suergate generation has an additional advantage of reducing the total run-time of maing by re-comuting and re-using the maing information, which deends on the library but not on the network to be maed. After the network is maed using suergates, each suergate in the maed netlist is relaced by its constituent simle gates; thus the final netlist consists of only the library gates. 5.2 Suergate Generation Suergates are generated recursively in a number of rounds. In each round we generate a new set of suergates and comute their functions. These suergates are used in subsequent rounds to generate new suergates. As usual we reresent functions by truth tables. Let k be the maximum number of inuts in the suort of the suergates we consider. For examle when maing with 5-feasible cuts, k would be 5. Let G i be the set of suergates at round i. Let L be the set of simle library gates. Let J k denote {1.. k}, π(x) the set of ermutations of a set X, and g the number of inuts of a simle gate g L. The suergate generation is as follows. Initially, G 0 = {x i i J k } where x i are elementary functions. Each G i+1 is the union of G i and an additional set of functions of the form g(x i1, x i2,.. x in ) where g G i and (i 1, i 2,.. i n) π(y ), Y being a subset of J k of cardinality n = g. The total number of rounds corresonds to the maximum number of logic levels in a suergate, and is an user-secified

7 arameter. If the generation is stoed after the first round (i.e. at G 1), then the set of suergates contains only the library gates with all ermutations of inut variables (c.f. Section 3.2). Examle. Let k = 3, i.e. we are interested in suergates with at most 3 inuts. Let L = {AND, OR}. G 0 = {x 1, x 2, x 3} where the truth table of x 1 is , x 2 is and x 3 is Each G i+1 contains in addition to the functions in G i, functions of the form AND(y 1, y 2) and OR(y 1, y 2) where y 1, y 2 are functions in G i. Thus x 1, AND(x 1, x 2) (whose truth table is ), AND(x 2, x 1), AND(x 1, x 3), etc. are some functions in G 1. Similarly, AND(AND(x 1, x 2), x 3), AND(OR(x 2, x 1), AND(x 1, x 3)), AND(AND(x 1, x 2), AND(x 1, x 3)), are some functions in G Pruning by Dominance Since the above rocedure is exhaustive, a large number of suergates is generated. However, some of the suergates generated above are sub-otimal, i.e. they are dominated by other gates. Examle. Consider the gate AND(AND(x 1, x 2), AND(x 1, x 3)) from the examle above. It has worse delay and area than the functionally equivalent suergate AND(AND(x 1, x 2), x 3). Whenever a new suergate is created, it is checked against existing suergates that imlement the same function (by means of a hash table). If the new suergate is worse in terms of area and delay than an existing one, then it is not added to the set of suergates. Also note that usually in industrial libraries functional symmetry of the underlying gate is not very useful. For examle both AND(x 1, x 2) and AND(x 2, x 1) have to be retained since the in-to-in delays are different in the two cases. 5.4 Pruning by Resource Limits Another technique to reduce the number of suergates is by runing based on resource limits. This is imortant because of a combinatorial exlosion inherent in the above formulation. Examle. If at one oint there are 1000 suergates, and a 4-inut NAND is used as a root gate, there would be suergates to consider. Many of these would be added (since the in-to-in delays are different) and the number of suergates would increase significantly, making the next round imossible to comlete. We exerimented with a number of heuristics to reduce the combinatorial exlosion. The simlest heuristic is to use only small suort gates (say 3) as root gates. A second heuristic to set area and delay limits on the suergates. These limits can be handled efficiently by sorting the suergates and using only those suergates as inuts to the root gate such that the resulting suergate would be within the area and delay limits. The suergate generation technique resented above is rather basic and inefficient because of the bottom-u nature of the generation rocess. Imroving the generation rocess is an interesting research roblem. One idea is to study the cut functions actually encountered during maing, and then to emloy constructive decomosition techniques to generate good suergates for the commonly oc- Mode Delay Area Run-time B B+L B+S B+L+S Table 1: Comarison of various maer modes. B is Baseline, L is Lossless synthesis, S is Suergates. Run-time for B+L and B+L+S does not include the choice generation time. The time required for choice generation is roughly a factor of three of the baseline run-time. curring functions. This leads to the exciting ossibility of a maer that learns from the circuits it rocesses! 5.5 Comarison with Algebraic Re-Writing Recall from Section 4.1 that algebraic re-writing includes the addition of choices by adding alternative decomositions based on associative and distributive transforms in the manner of Lehman et al. [13]. If suergates are generated exhaustively without resource limits, it is ossible to do exact logic synthesis (the entire circuit would ma to a single suergate). In this theoretical setting, maing with suergates is the most general form of logic synthesis. In ractice, as discussed above, only a limited set of suergates can be used. Practically, the search sace exlored by suergates is different from the search sace exlored by re-writing in two ways. First, since suergates are built by combining gates, suergates cature different Boolean decomositions of a function. Therefore they are more general than algebraic re-writing which by definition is limited to algebraic decomositions. Second, since the set of decomositions exlored deends on the library, and the runing techniques described above rune away sub-otimal decomositions, suergates exlore a sace of good decomositions relative to the gates in the library. In summary, the decomositions due to suergates are a targeted though sarse samling of the large sace of Boolean decomositions. In contrast, algebraic re-writing is a dense samling of the smaller sace of algebraic decomositions. 6. EXPERIMENTAL RESULTS The techniques described in this aer have been imlemented in the MVSIS logic synthesis system [17]. The imlementation is freely available. We erformed a number of exeriments to characterize the erformance of the maer in its various modes, and to benchmark it against other state-of-the-art maers. Area recovery. In the exeriments reorted here, delay is the arameter of interest, and area is not directly controlled during maing. However, unnecessary dulication of logic during DAG-maing can lead to oor area. The area of the maed netlist is imroved by making two asses after the maing rocedure described in Section 3.1. Before each ass, the required time is comuted at every node. The nodes are then rocessed in toological order (from inuts to oututs). For nodes having ositive slacks, matches that

8 minimize area without violating required times are chosen. The two asses differ in the metric used to measure area. The first ass uses area-flow [16], a global metric that accounts for fanout. The second ass uses exact area, a local metric that catures the area contribution due to gates in the maximal fanout-free cone of a match. (In exact area, the area cost of a match is the sum of the areas of all gates used exclusively by the match.) Our exeriments show that the combination of these heuristics (in this order) is very effective: ost sizing area is reduced by about 30% without increasing delay. 6.1 Basic Evaluation The first two exeriments were done in an academic setting, using mcnc.genlib and simle load indeendent delay model. The benchmarks chosen were the 15 largest ublicly available ones (listed in Table 2), and they were rerocessed with the scrit shown in Figure 1(a) followed by balancing for delay. For lossless synthesis, the choices were generated using the scheme shown in Figure 1(b). Characterizing the quality run-time trade-off. Table 1 summarizes the relative delay, area and run-times of the maer in its various modes. As might be exected, the fastest run-time is obtained when neither suergates nor lossless synthesis is used (this mode is called baseline), and the best quality (32% imrovement in delay over baseline) is obtained when both techniques are used (the mode is called B+L+S). In addition to these extreme cases, Table 1 also shows the intermediate situations when either lossless synthesis or suergates is used alone giving a range of quality run-time trade-offs. We note that since the absolute run-time of the baseline maer is very small (less than 4 seconds for the largest ublic benchmarks), the order of magnitude increase in run-time when using both lossless synthesis and suergates is accetable for better quality. Exerimental comarison with algebraic re-writing. Since a direct comarison with the imlementation described by Lehman et al. [13] was not ossible, we erformed a simle exeriment to estimate the effect of algebraic re-writing. As er Section 4.1, a number of associative and distributive decomositions were added through local re-writing. Decomositions were iteratively added until the number of nodes in the AIG triled. Comared to baseline, algebraic re-writing led to a 9% reduction in delay (cf. the 32% reduction with B+L+S). When used in conjunction with either lossless synthesis or suergates, re-writing led to a smaller imrovement in delay (about 4% in both cases). When used in conjunction with both suergates and lossless synthesis, there was an imrovement of only 2% in delay. This confirms the analysis in Section 5.5 that in ractice suergates and algebraic re-writing exlore different search saces. In the subsequent exeriments we use the baseline and the B+L+S modes (suergates and lossless synthesis) for comarisons with other maers. Comarison with SIS. Table 2 shows the erformance of the maer on the benchmark circuits in comarison with the maer in SIS (used in delay otimal mode). In the baseline mode the maer runs 5 times faster than the tree maer in SIS [20] and roduces 33% better delay without degrading area. In the B+L+S mode, the maer roduces the best results with 30% reduction in delay over baseline, and 54% over SIS. 6.2 Comarison with Industrial Maers The next set of exeriments are conducted in an industrial setting. The examles are timing-critical combinational blocks extracted from a high-erformance microrocessor design which were otimized for delay during technology indeendent synthesis. After technology maing, buffering and sizing is done searately in accordance with a gain-based flow. As art of the maing, an attemt is made to refer those gates that can drive the estimated fanout loads. (This is done iteratively like area-recovery using fanoutestimation techniques similar to those used in the area-flow algorithm [16]). Table 3 shows a comarison of the maer with two other state-of-the art maers: DAG maer [12] and GrahMa which is an indeendent imlementation of the Lehman- Watanabe maer [13] that uses Boolean matching. Both maers do not have area recovery. Using suergates and choices, the maer outerforms both GrahMa and DAG maer in delay and area and has a significantly shorter run-time. Table 4 shows the erformance of the maer on some larger blocks from the microrocessor, in comarison with DAG maer. Delay reduces by 12% while area (measured after sizing) reduces by 24%. Thus, with larger blocks, the imrovement in area is greater. It was ointed out in [12] that DAG maer can roduce significantly faster circuits comared to the traditional tree maing aroach [8]. However, the area increase for DAG maer sometimes can be quite significant. The significant area reduction by the new maer makes DAG maing aroach much more ractical, esecially when leakage ower consumtion is becoming an increasingly imortant consideration in high-erformance designs. 7. CONCLUSIONS AND FUTURE WORK Our exeriments demonstrate that Boolean maing based on the simlified matching algorithm and otimal hase selection is a better alternative to structural matching since it roduces suerior results with shorter run-time. Suergates and choices fit nicely into this framework and greatly imrove the quality of maing by mitigating structural bias. Furthermore, the intermediate networks seen during technology indeendent synthesis are a useful source of choices for the final maing. Suergates, though generated by brute-force enumeration, imrove the quality of maing, even with industrial libraries. To give a balanced view of the techniques resented, we should oint out their limitations. The exhaustive cut comutation which works very well in baseline mode (when no choices are used) becomes a comutational bottleneck when many choices are added. We have develoed runing heuristics to restrict the number of cuts considered for each node, but extensions of the techniques roosed in [4] to handle choices would be useful. A general limitation of cut-based matching methods is that library gates with many inuts cannot be handled. In ractice this is solved by using a structural matcher just for those gates during the matching hase (as is done in the IBM system [21]). We are currently exloring more seamless techniques to extend Boolean matching to gates with many inuts.

9 Name Delay Area Run-time SIS Baseline B+L+S SIS Baseline B+L+S SIS Baseline B+L+S b b bigkey C C C clma clmb dsi j j j s s s Ratio Table 2: Comarison with SIS on ublic benchmarks. The numbers for Baseline are absolute; those for SIS and C-S are relative to Baseline. Run-time is in seconds on a 1.6GHz Intel lato. Name DAG Maer GrahMa B+L+S Area Delay Area Delay Area Delay ex ex ex ex ex ex ex ex ex ex ex ex avg Table 3: Comarison with the other maers on industrial benchmarks. Name DAG Maer Baseline Area Delay Run-time Area Delay Run-time ex ex ex ex ex ex avg Table 4: Comarison on large industrial circuits.

10 The exhaustive nature of suergate generation (as resented in Section 5.2) is inefficient since (i) the generated functions may not correlate well with the actual cut functions in the circuits, and (ii) the same function may be generated multile times. It would be interesting to exlore methods for guided suergate generation where more comutational effort is invested in finding the suergates for the frequently occurring cut functions. This suggests the ossibility of a maing rocedure that learns from the revious runs how to guide suergate generation. For our current rototye within a gain-based methodology, sizing and buffering are erformed after maing during hysical synthesis. We lan to extend our maer for use in a flow that combines logical and hysical synthesis. [17] MVSIS Grou. MVSIS: Multi-Valued Logic Synthesis System. U. C. Berkeley. htt://www-cad.eecs.berkeley.edu/research/mvsis/ [18] A. Mishchenko, S. Chatterjee and R. Brayton, Fraigs: Integrated verification and synthesis, Tech. Reort., Det. of EECS, U. C. Berkeley, htt://www-cad.eecs.berkeley.edu/research/mvsis/ [19] A. Mishchenko, X. Wang, T. Kam, A new enhanced constructive decomosition and maing algorithm, Proc. DAC 03, [20] E. Sentovich, et al. SIS: A system for sequential circuit synthesis, Tech. Re. UCB/ERI, M92/41, ERL, Det. of EECS, U. C. Berkeley, [21] L. Stok, M. A. Iyer, A. J. Sullivan, Wavefront technology maing, Proc. DATE [22] Wu et al., Efficient Boolean matching algorithm for cell libraries, Proc. ICCD 94, ACKNOWLEDGMENTS The first three authors were suorted in art by C2S2, the MARCO Focus Center for Circuit and System Solution, under MARCO contract 2003-CT-888; and in art by California Micro Program along with our industrial sonsors Cadence, Synlicity, Intel, Magma, and Fujitsu. 9. REFERENCES [1] L. Benini, G. DeMicheli, A survey of Boolean matching techniques for library binding, ACM TODAES, Vol. 2, No. 3, July 1997, [2] C.-W. Chang and M. Marek-Sadowska, Who are the alternative wires in your neighborhood? (Alternative wire identification without search), Proc. GLSVLSI 01, [3] J. Cong and Y. Ding, FlowMa: An otimal technology maing algorithm for delay otimization in looku-table based FPGA designs, IEEE Trans. CAD, Vol. 13(1), 1994, [4] J. Cong, C. Wu and Y. Ding, Cut ranking and runing: Enabling a general and efficient FPGA maing solution, Proc. FPGA 99. [5] D. Debnath and T. Sasao, Fast Boolean matching under variable ermutation using reresentative, Proc. ASP-DAC 99, [6] S. Hassoun and T. Sasao, eds., Logic synthesis and verification, Kluwer 2002, Chater 5, Technology maing, [7] U. Hinsberger and R. Kolla, Boolean matching for large libraries, Proc. DAC 98, [8] K. Keutzer, DAGON: Technology binding and local otimizations by DAG matching, Proc. DAC 87, [9] V. N. Kravets and K. A. Sakallah, Constructive library-aware synthesis using symmetries, Proc. DATE 00, [10] A. Kuehlmann, Dynamic transition relation simlification for bounded roerty checking, Proc. ICCAD 04. [11] A. Kuehlmann, V. Paruthi, F. Krohm, and M. K. Ganai, Robust Boolean reasoning for equivalence checking and functional roerty verification, IEEE Trans. CAD, Vol. 21(12), 2002, [12] Y. Kukimoto, R. K. Brayton, P. Sawkar, Delay-otimal technology maing by DAG covering, Proc. DAC 98, [13] E. Lehman, Y. Watanabe, J. Grodstein, and H. Harkness, Logic decomosition during technology maing, IEEE Trans. CAD, 16(8), 1997, [14] F. Lu, L. Wang, K. Cheng, R. Huang. A circuit SAT solver with signal correlation guided learning, Proc. DATE 03, [15] F. Lu, L. Wang, K. Cheng, J. Moondanos and Z. Hanna, A signal correlation guided ATPG solver and its alications for solving difficult industrial cases, Proc. DAC 03, [16] V. Manohararajah, S. D. Brown, Z. G. Vranesic, Heuristics for area minimization in LUT-based FPGA technology maing, Proc. IWLS 04.

Reducing Structural Bias in Technology Mapping

Reducing Structural Bias in Technology Mapping S. Chatterjee A. Mishchenko R. Brayton Department of EECS U. C. Berkeley {satrajit, alanmi, brayton}@eecs.berkeley.edu X. Wang T. Kam Strategic CAD Labs Intel