Structural Gate Decomposition for Depth- Optimal Technology Mapping in LUT- Based FPGA Designs

Size: px

Start display at page:

Download "Structural Gate Decomposition for Depth- Optimal Technology Mapping in LUT- Based FPGA Designs"

Caitlin Benson
5 years ago
Views:

1 Structural Gate Decomposition or Depth- Optimal Technology Mapping in LUT- Based FPGA Designs JASON CONG and YEAN-YOW HWANG Uniersity o Caliornia In this paper we study structural gate decomposition in general, simple gate networks or depth-optimal technology mapping using K-input Lookup-Tables (K-LUTs). We show that () structural gate decomposition in any K-bounded network results in an optimal mapping depth smaller than or equal to that o the original network, regardless o the decomposition method used; and () the problem o structural gate decomposition or depth-optimal technology mapping is NP-hard or K-unbounded networks when K and remains NP-hard or K-bounded networks when K 5. Based on these results, we propose two new structural gate decomposition algorithms, named DOGMA and DOGMA-m, which combine the leel-drien nodepacking technique (used in Chortle-d) and the network low-based labeling technique (used in FlowMap) or depth-optimal technology mapping. Experimental results show that () among ie structural gate decomposition algorithms, DOGMA-m results in the best mapping solutions; and () compared with speed_up (an algebraic algorithm) and TOS (a Boolean approach), DOGMA-m completes decomposition o all tested benchmarks in a short time while speed_up and TOS ail in seeral cases. Howeer, speed_up results in the smallest depth and area in the ollowing technology mapping steps. Categories and Subject Descriptors: B.6. [Logic Design]: Design Styles; B.6. [Logic Design]: Design Aids; Automatic synthesis; B.7. [Integrated Circuits]: Types and Design Styles General Terms: Design, Experimentation, Measurement, Perormance, Theory Additional Key Words and Phrases: Computer-aided design o VLSI, decomposition, delay minimization, FPGA, logic optimization, programmable logic, simpliication, synthesis, system design, technology mapping The authors would like to acknowledge the support o the NSF Young Inestigator (NYI) Award MIP-95758, grants rom Xilinx, Quickturn, and Lucent Technologies under the Caliornia MICRO programs, and the donation o sotware by Synopsys. Authors address: Department o Computer Science, Uniersity o Caliornia, Los Angeles, CA Permission to make digital / hard copy o part or all o this work or personal or classroom use is granted without ee proided that the copies are not made or distributed or proit or commercial adantage, the copyright notice, the title o the publication, and its date appear, and notice is gien that copying is by permission o the ACM, Inc. To copy otherwise, to republish, to post on serers, or to redistribute to lists, requires prior speciic permission and / or a ee. 000 ACM /00/ $5.00 ACM Transactions on Design Automation o Electronic Systems, Vol. 5, No., April 000, Pages 9 5.

2 94 J. Cong and Y.-Y. Hwang. INTRODUCTION Field programmable gate arrays (FPGAs) hae been widely used in circuit design implementation and system prototyping due to their short design cycles and low nonrecurring engineering costs. An important class o FPGAs use lookup-tables (LUTs) as the basic logic element. A K-input LUT (K-LUT), which consists o K SRAM cells, can store the truth table o an arbitrary Boolean unction o up to K ariables. By connecting LUTs into a network, LUT-based FPGAs can be used to implement circuit designs in a short time. Logic synthesis or LUT-based FPGAs transorms networks o logic gates into unctionally equialent LUT networks. The process is usually diided into two tasks: logic optimization and technology mapping. Logic optimization extracts common subunctions to reduce the circuit size and/or resynthesizes critical paths to reduce the circuit delay. Technology mapping consists o two subtasks: gate decomposition and LUT mapping. In gate decomposition, large gates are decomposed into gates o at most K inputs (that is, K-bounded). The resulting K-bounded network is then mapped onto (i.e., coered by) K-LUTs in the LUT mapping step. The separation o optimization and mapping tasks is artiicial. Some LUT synthesis algorithms (e.g., Lai et al. [994] and Wurth et al. [995]) decompose collapsed networks into LUT networks directly. The objecties o these tasks include area minimization, delay minimization, routability maximization, or a combination o all o them. A comprehensie surey o gate decomposition, LUT mapping, and logic synthesis algorithms or LUT-based FPGAs can be ound in Cong and Ding [996]. The delay o an LUT network can be measured by the number o leels (or depth) in the network under the unit delay model. A number o algorithms were proposed in the past or delay-oriented LUT mapping. We classiy them into two classes. The irst class o algorithms, such as Chortle-d [Francis et al. 99b]; DAG-Map [Chen et al. 99]; and Flow- Map [Cong and Ding 994a] perorm LUT mapping without logic resynthesis. Among these algorithms, Chortle-d guarantees depth-optimal technology mapping or simple gate tree networks, and FlowMap guarantees depth-optimal LUT mapping or general K-bounded networks. Following FlowMap, FlowMap-r [Cong and Ding 994b] and CutMap urther reduce the mapping area, and FlowMap-d [Cong and Ding 994c] and Edge-Map [Yang and Wong 994] minimize delay under a more accurate net delay model. Another class o LUT mapping algorithms, such as MIS-pga-delay [Murgai et al. 99]; TechMap-D [Sawkar and Thomas 99]; FlowSyn [Cong and Ding 99]; and ALTO [Huang et al. 996] collapse critical paths ollowed by delay-oriented logic resynthesis. Due to resynthesis, this class o algorithms could obtain mapping depth smaller than the optimal depth computed by FlowMap, but usually with longer computation time. Gate decomposition may signiicantly aect the network depth obtained by the algorithms in the irst LUT mapping class. For example, the ACM Transactions on Design Automation o Electronic Systems, Vol. 5, No., April 000.

3 Gate Decomposition and LUT Mapping 95 PI PI u u u PI PI u u u PI PI u u u? (a) (b) (c) Fig.. Impact o gate decomposition on mapping depth or K. (a) Initial network; (b) a decomposition resulting in a mapping depth o ; (c) a decomposition resulting in a mapping depth o. u u (a) Beore Decomposition (b) Ater Decomposition Fig.. Gate decomposition in a K-bounded network (K ). (a) Initial K-bounded network with a mapping depth o ; (b) decomposed network with a mapping depth o. network in Figure (a) is not a K-bounded network or K. When node is decomposed as shown in Figure (b), any mapping algorithm will result in a depth o or larger. But i node is decomposed in the way shown in Figure (c), a mapping solution with a depth o can be obtained. In addition, when a K-bounded network is urther decomposed, the mapping depth could be reduced. Figure (a) shows a -bounded network. For K, FlowMap produces a -leel mapping solution o 5 LUTs. (Eery shaded square represents an LUT in the igure.) But i node is urther decomposed, FlowMap produces a -leel network o 4 LUTs (Figure (b)). The two examples demonstrate that gate decomposition aects the depth obtained by LUT mapping algorithms. We classiy gate decomposition methods into structural, algebraic, or Boolean approaches. Structural gate decomposition can only be applied to simple gates (e.g., AND gates, OR gates, XOR gates). Complex gates need to be transormed into simple gates (e.g., ia AND-OR decomposition) beore any structural decomposition. The tech_decomp algorithm in SIS [Sentoich et al. 99]; the dmig algorithm [Wang 989; Chen et al. 99]; and the Chortle amily o mapping algorithms [Francis et al. 99a; 99b] all ACM Transactions on Design Automation o Electronic Systems, Vol. 5, No., April 000.

4 96 J. Cong and Y.-Y. Hwang perorm structural gate decomposition. In algebraic gate decomposition approaches, networks are usually partially collapsed and gates are represented in the sum-o-product (SOP) orm. Common logic subunctions are then extracted with algebraic diisions [Rudell 989; De Micheli 994]. The speed_up algorithm in SIS [Sentoich et al. 99] is an algebraic approach which collapses critical paths ollowed by network resynthesis or delay minimization. In Boolean gate decomposition approaches, logic gates are decomposed ia unctional operations. Shannon expansion, i-then-else (ITE) decomposition, and AND-OR decomposition are ery common Boolean gate decomposition operations. Recently, unctional decomposition techniques [Ashenhurst 959; Curtis 96; Roth and Karp 96] were used in a number o LUT network synthesis algorithms [Lai et al. 994; Wurth et al. 995; Legl et al. 996b]. In these algorithms, networks are completely collapsed wheneer possible so that the outputs can be represented as unctions o the network inputs directly. The output unctions are then decomposed into composed K-input subunctions or implementation using K-LUTs. Optional LUT mapping steps may ollow to improe the synthesis results. The FGSyn algorithm [Lai et al. 994] and the BoolMap-D algorithm [Legl et al. 996b] take this approach or delay-oriented LUT network synthesis. Generally speaking, algebraic approaches and Boolean approaches are more eectie or both area and delay minimization in technology mapping, while structural approaches are usually aster. Hybrid approaches such as algebraic decompositions ollowed by structural decompositions are used in many logic synthesis approaches. In this paper we study structural gate decomposition or delay minimization in general networks with the ollowing motiations. First, we hae shown how gates are decomposed, which can aect the mapping depth computed by FlowMap. A good gate decomposition step allows mapping algorithms to obtain the smallest mapping depth. Second, structural gate decomposition allows arbitrary grouping o gate inputs or our optimization objectie, while algebraic or Boolean approaches do not hae this adantage. Third, structural gate decomposition is computationally eicient. This is an important actor or mapping large designs and estimating the mapping delay or area. Nowadays, the IC process technology has adanced to 0.8 m and below. Million-gate FPGAs hae become a reality. Structural gate decomposition algorithms can be employed in the technology mapping approaches along with this technology trend. Seeral delay-oriented structural gate decomposition algorithms were proposed in the past. The tech_decomp algorithm [Sentoich et al. 99] decomposes each simple gate into a balanced anin tree to minimize the number o leels locally. The dmig algorithm [Wang 989; Chen et al. 99] is based on the Human coding algorithm and guarantees the minimum depth in the decomposed network. Howeer, the mapping depth might not be the minimum. The network in Figure (b) is actually decomposed using dmig and results in a suboptimal mapping depth. The Chortle-d algorithm [Francis et al. 99b] employs bin-packing heuristics to achiee ACM Transactions on Design Automation o Electronic Systems, Vol. 5, No., April 000.

5 Gate Decomposition and LUT Mapping 97 depth minimization, but is optimal or trees only. In this paper we go one step urther. We shall deelop structural gate decomposition algorithms or depth-optimal technology mapping on general networks. The rest o this paper is organized as ollows. Section deines the terminology, presents general properties, and ormulates the structural gate decomposition problems. Section addresses the NP-completeness o the problems. Section 4 presents two new algorithms, DOGMA and DOGMA-m, or structural gate decomposition. Experimental results are presented in Section 5, and Section 6 concludes the paper. A preliminary ersion o this work was published in DAC 96 [Cong and Hwang 995] without the proos o theorems and considered single-gate decompositions only.. PROBLEM FORMULATION. Deinitions and Preliminaries A combinational Boolean network N can be represented by a directed acyclic graph N V, E where each node V represents a logic gate and each directed edge u, E represents a connection rom the output o node u to the input o node. A node is a simple gate i implements one o the ollowing unctions: AND, OR, XOR, or their inersions. Primary inputs (PIs) are nodes o in-degree zero. Other nodes are internal, and some are designated as primary outputs (POs). A node is a predecessor o a node u i there is a directed path rom to u in N. The depth o a node is the number o edges on the longest path rom any PI to. Each PI has a depth o zero. The depth o a network is the largest depth or nodes in the network. Let input and anout represent the set o anins and the set o anouts o node, respectiely. Gien a subgraph H o N, let input H denote the set o distinct nodes outside H that supply inputs to nodes in H. A anin cone C rooted at is a connected subnetwork consisting o and its predecessors. Node is the root node o C, and is denoted as root C. Let K be the LUT input size. A node is K-bounded i input K. Otherwise, is K-unbounded. A network N is K-bounded i it contains only K-bounded nodes. Gien a K-bounded network N, a set M L, L,..., L m o subnetworks is a K-LUT mapping solution o N i (C) or eery L i M, L i is a anin cone in N and input L i K; (C) or eery L i M, input L i contains only PIs or root nodes o other subnetworks in M; (C) or eery L i M, root L i is either a PO or belongs to input L j or some L j M; and (C4) or eery PO o N, root L i or some L i M. ACM Transactions on Design Automation o Electronic Systems, Vol. 5, No., April 000.

6 98 J. Cong and Y.-Y. Hwang A mapping solution M is duplication-ree i L i L j A or all L i L j in M. By implementing eery subnetwork in M using a K-LUT, we obtain a K-LUT network that is unctionally equialent to N. The mapping area and the mapping depth o M is the LUT count (i.e., M ) and the depth in the K-LUT network that implements M, respectiely. Gien a K-bounded network N, let S K N represent the set o K-LUT networks that implement all mapping solutions o N. The minimum mapping depth o N, denoted MMD N, is the minimum network depth or all K-LUT networks in S K N. Let N represent the largest anin cone rooted at in N. The minimum mapping depth o a node N, denoted MMD N, is MMD N. The mapping depth o any PI is 0. Gien a K-bounded network N, the FlowMap algorithm [Cong and Ding 994a] computes MMD N or eery node N in polynomial time. A cut in N is a partition X, X o N such that X is a anin cone rooted at and X is N X. The cutset o the cut, denoted n X, X, is deined as input X. The cut is K-easible i n X, X K. The height o the cut, denoted height X, X, ismax MMD N u u n X, X. FlowMap computes a min-height K-easible cut in the anin cone o each node to obtain MMD N. The ollowing two lemmas are on the minimum mapping depth in general networks. Lemma states the monotone property o minimum mapping depth and Lemma gies a way to compute MMD N. LEMMA. [Cong and Ding 994a]. Let N V, E be a K-bounded network and let node V. Then MMD N u MMD N or eery anin u input. LEMMA. [Cong and Ding 994a]. Let N V, E be a K-bounded network, node V, and let max MMD N u u input p. Then MMD N p i there exists a K-easible cut o height p in N. Otherwise, MMD N p.. Properties o Structural Gate Decomposition Simple gates allow arbitrary grouping o their anins in decomposition. Howeer, the grouping and the resulting gate size in decomposition can signiicantly aect the depth and area in the inal mapping solution. In this section, we show that the best mapping results can only be obtained rom completely decomposed networks. Let node be a simple gate in a network N and let input. Gien a structural gate decomposition algorithm D, adecomposition step D on node (i) chooses two anins u and u o ; (ii) remoes edges u, and u, ; and (iii) introduces a node w and three edges u, w, u, w, and w, to reconnect u, u and. Because is a simple gate, D can always ACM Transactions on Design Automation o Electronic Systems, Vol. 5, No., April 000.

7 Gate Decomposition and LUT Mapping 99 a b u c d a b u c d a b u c d u u e u w w (a) N (b) D (N ) Fig.. Decomposition o node. (a) Beore decomposition; (b) D N ater one decomposition step D ; (c) complete decomposition o. be applied. Node w has the same gate type as node. For any subnetwork N V, E o N and a decomposition step D, we deine D N V w, E u,, u, u, w, u, w, w, i V, and D N N i N. A network is completely decomposed when it becomes -bounded. In Figure (a), N contains nodes u and with input N a, b, u, c, d. Figure (b) shows D N ater one decomposition step D. The subnetwork is completely decomposed in Figure (c). We hae the ollowing theorem. THEOREM. Let N V, E be a K-bounded network, node V be a simple gate, and input. Then S K N S K D N or any structural gate decomposition algorithm D. PROOF. Let w be the node introduced by D. Let M L, L,...,L m be an arbitrary mapping solution o N. We claim M D L D L,...,D L m is a mapping solution o D N. First, N and D N hae the same set o PIs and POs. From Figure, it should be clear that L i and D L i hae the same set o inputs as well as the same output node. As a result, M satisies conditions (C) to (C4) as a mapping solution o D N. The K-LUT that implements L i also implements D L i. Hence the K-LUT network that implements M also implements M. Thereore, S K N S K D N. Howeer, a mapping solution M o D N cannot be a mapping solution or N i w is the root node o some subnetwork in M (due to w N). There exists at least one such mapping solution that is D N itsel. As a result, S K N S K D N. e Corollary. Let N V, E be a K-bounded network, node V be a simple gate, and input. Then MMD D N MMD N or any structural gate decomposition algorithm D. ACM Transactions on Design Automation o Electronic Systems, Vol. 5, No., April 000. (c)

8 00 J. Cong and Y.-Y. Hwang PROOF. Since S K N S K D N or any decomposition algorithm D, by deinition, MMD D N MMD N. e Note that Theorem and Corollary. hold as long as the decomposition step at (structural, algebraic, or Boolean) can be carried out, regardless whether is a simple gate or not. Howeer, the algebraic or unctional decomposition or a complex gate may not always be possible. Since the set o all possible unctionally equialent K-LUT networks expands wheneer a simple gate is decomposed (Theorem ), it is always beneicial to decompose simple-gate networks into -bounded networks or LUT mapping algorithms to exploit the larger mapping solution space. The experimental results reported in Cong and Ding [994a] conirm this conclusion. In their experiments, the input networks were irst transormed into simple gate networks and then decomposed structurally into 5-bounded, 4-bounded, -bounded, or -bounded networks beore LUT mapping. The resulting mapping depth decreases monotonically along with the decrease o gate sizes in decomposition. An interesting contrast comes rom the results reported in Legl et al. [996a], where networks were irst collapsed completely and then decomposed unctionally into 5-bounded, 4-bounded, or -bounded networks or LUT mapping. The best mapping solutions in terms o area and depth are mostly rom the 5-bounded networks. The two experiments show an important dierence between structural and unctional decompositions: logic signals are presered in structural decompositions, while new gates are synthesized during unctional decompositions. In Legl et al. [996a], the 5-bounded, 4-bounded, and -bounded networks contain totally dierent sets o internal gates, which are synthesized independently in three unctional decomposition processes. In act, according to Corollary., i the 5-bounded networks in Legl et al. [996b] were urther decomposed beore LUT mapping, een smaller mapping depth could be obtained in their experiments. The ollowing lemma speciies a condition where the structural gate decomposition will not cause urther mapping depth reduction. LEMMA. Let N V, E be a K-bounded network, node V be a simple gate, and input. Assume that nodes u, u input and MMD N u MMD N (see Figure 4(a)). Let D be the decomposition step that merges u, u into an intermediate node w (see Figure 4(b)). Then MMD N MMD D N. PROOF. Assume MMD N u MMD N p. First, MMD D N u p (as N u D N u ). Next, Lemma (monotone property) assures that p MMD D N w MMD D N. Then, according to Corollary., we hae MMD D N MMD N p. Thereore, MMD D N w MMD D N p (see Figure 4(b)). Now we show MMD D N MMD N. Suppose this is not the case. Then MMD D N MMD N, and there exists a mapping solution M L, L,...,L m or D N such that M has a ACM Transactions on Design Automation o Electronic Systems, Vol. 5, No., April 000.

9 Gate Decomposition and LUT Mapping 0 p w p u u p u u p w p p w p L i L i p (a) p (b) depth smaller than MMD N. Let x i represent the output node o each K-easible subnetwork L i in M. First, w x i or some i. Otherwise, M would be a mapping solution o N (by collapsing w into ) and MMD D N would not be smaller than MMD N. Next, there must exist some x i such that MMD D N x i MMD N x i. We call node x i a depthreduced node. There are two cases or any depth-reduced node x i. (i) w input L i. Then we can ind another node x j input L i such that MMD D N x j MMD N x j. Otherwise node x i won t be a depth-reduced node. We continue to trace depth-reduced nodes towards PIs. This tracing, howeer, won t reach PIs since PIs hae a depth o 0. At certain depth, the second case must occur. (ii) w input L i. Then N xi L i, L i is a cut in the anin cone N xi in D N (see Figure 4(c)). But we can moe node rom L i to N xi L i and obtain another K-easible cut o height p in N xi (see Figure 4(d)), since w is anout-ree and w and hae the same mapping depth p. This implies MMD D N x i MMD N x i. As a result, x i is not a depth-reduced node. Contradiction. So we proed MMD D N MMD N. e LEMMA 4. Let N V, E be a K-bounded network, node V be a simple gate, and input. IMMD N u i MMD N or eery anin u i input, then MMD N MMD D N or any structural gate decomposition algorithm D. PROOF. Since the intermediate node w has the same depth as node, this lemma is true according to Lemma. e. Integrated ersus Two-Step Technology Mapping Gate decomposition and LUT mapping can be perormed in two dierent ways. In an integrated mapping approach, the input network is decomposed and coered by LUTs simultaneously, while in a two-step mapping approach, the input network is decomposed into a K-bounded network beore x i N x i Fig. 4. (a) Beore D ; (b) ater D ; (c) w input L i ; (d) is moed out o L i. (c) xi (d) N x i ACM Transactions on Design Automation o Electronic Systems, Vol. 5, No., April 000.

10 0 J. Cong and Y.-Y. Hwang LUT mapping is perormed. For example, Chortle-d is an integrated mapping approach, while FlowMap its only into a two-step mapping approach. The separation o gate decomposition and LUT mapping is a restriction in general, since integrated approaches allow more inormatie gate decomposition and LUT mapping decisions, while two-step approaches do not hae this adantage. It may appear that the minimum mapping depth or all integrated mapping approaches will be smaller than the minimum mapping depth or all two-step mapping approaches. Howeer, we show that this is not the case or structural gate decomposition. THEOREM. Gien a K-bounded network N, i only structural gate decomposition is allowed, the minimum mapping depth or all integrated mapping approaches equals the minimum mapping depth or all two-step mapping approaches. PROOF. Gien an arbitrary K-bounded network N, assume some integrated approach results in the optimal depth MMD N in a mapping solution M N. Then M N is a mapping solution o some K-bounded network N decomposed structurally rom N. A depth-optimal mapper (e.g., Flow- Map) can take N as input and generate a mapping solution M N. Since M N is depth-optimal with respect to N, we hae MMD N MMD N. But M N is depth optimal with respect to N. As a result, MMD N MMD N. Thereore, MMD N MMD N. e Our mapping algorithms, presented in Section 4, should be considered a hybrid approach. On one hand, depth minimization is achieed in structural gate decomposition (by DOGMA or DOGMA-m) to return a network topology o the minimum mapping depth; on the other hand, the LUT mapping solution is computed in depth-optimal LUT mapping with area minimization as a second objectie. As a result, the depth and the area are optimized separately in the two steps o technology mapping. Hence we consider our algorithm a hybrid approach..4 The SGD/K and K-SGD/K Problems In this paper we study structural gate decomposition o K-bounded or K-unbounded simple gate networks into -bounded networks such that LUT mapping algorithms (e.g., FlowMap) can obtain the smallest mapping depth. We ormulate the ollowing two problems. Structural gate decomposition or K-LUT mapping (SGD/K). Gien a simple-gate K-unbounded network N, decompose N into a -bounded network N such that MMD N MMD N or any other -bounded decomposed network N o N. Structural gate decomposition in K-bounded network or K-LUT mapping (K-SGD/K). Gien a simple gate K-bounded network N K, decompose N K into a -bounded network N such that MMD N MMD N or any other -bounded decomposed network N o N K. ACM Transactions on Design Automation o Electronic Systems, Vol. 5, No., April 000.

11 . COMPLEXITY OF SGD/K AND K-SGD/K PROBLEMS We show the ollowing results: () the SGD/K problem is NP-hard or K ; and () the K-SGD/K problem is NP-hard or K 5. We present the construction or the NP-complete reduction, the lemmas and theorems, and the proos or theorems. Proos or lemmas can be ound in the Appendix. Our results are based on polynomial-time transormations rom the SAT problem to the decision ersion o the SGD/K and the K-SGD/K problems. The SAT problem, which is a well-known NP-complete problem [Garey and Johnson 979], is deined as ollows: Problem: -Satisiability (SAT). Gate Decomposition and LUT Mapping 0 Instance: A set o Boolean ariables X x, x,...,x n and collection o m clauses C C, C,...,C m, where (i) each clause is the disjunction (OR) o literals o the ariables; and (ii) each clause contains at most one o x i and x i or any ariable x i. Question: Is there a truth assignment or the ariables in X such that C j or j m? We transorm an arbitrary instance o SAT to an instance o SGD/K in polynomial time. The idea is to relate the truth assignment o ariables in SAT to the decision o gate decomposition in SGD/K. Since determining the truth assignment is diicult, the decision o gate decomposition is also diicult. We deine the decision ersion o the SGD/K problem as ollows: Problem: Structural gate decomposition or K-LUT mapping (SGD/K-D). Instance: A constant K, a depth bound B, and a simple gate K-unbounded network N. Question: Is there a way to structurally decompose N into a -bounded network N such that the depth-optimal K-LUT mapping solution o N has a depth no more than B? Gien an instance F o SAT with n ariables x, x,..., x n and m clauses C i, C,..., C m, we construct a K-unbounded network N F corresponding to the instance F, as ollows. First, or each ariable x i, we construct a subnetwork N x i, which consists o the ollowing nodes: (a) two output nodes denoted x i and x i ; (b) K K PI nodes in which two o them are denoted PI i and PI i ; (c) K internal nodes, denoted i,..., i K, u i,..., u i k, w i, w i and s i, respectiely; The nodes are connected as shown in Figure 5. Each node o w i and w i has K PI anins. Node s i has 4 anins rom w i, w i, PI i and PI i. Eery other internal node has K PI anins. Note that N x i is well deined or K and is K-bounded or K 4. Next, or each clause C j with literals l j, l j, l j, we construct a subnetwork N C j, which consists o the ollowing nodes: (a) one output node denoted C j ; (b) three literal nodes denoted l j, l j, l j ; (c) K 5 internal ACM Transactions on Design Automation o Electronic Systems, Vol. 5, No., April 000.

12 04 J. Cong and Y.-Y. Hwang K- PI s K- PI s PI i PI i i K PI s K PI s K- i w i w i s i u i K PI s K PI s K- u i? x i x i? Fig. 5. Construction o network N x i or each Boolean ariable x i. nodes q j,..., q j K 5, each is the root o a complete -leel K-ary tree with PI nodes as leaes; (d) (K ) internal nodes r j,..., r j k, each is the root o a complete -leel K-ary tree with PI nodes as leaes. The connections are shown in Figure 6(a). The output node C j has all internal nodes as its anins in N C j. Note that N C j is well deined or K. Howeer, the output node C j is not K-bounded. Finally, we connect the subnetworks N C j, j,,..., m with the subnetworks N x i, i,,..., n, as ollows, to orm the network N F. Let literal l j k be a literal in clause C j.il j k x i where x i is a ariable, we connect node x i in N x i as the single anin o node l j k in N C j. Similarly, i l j k x i, we connect node x i in N x i as the single anin o node l j k in N C j. Note that eery literal node has exactly one anin. This anin node is called the ariable node o the corresponding literal node. Network N F has m primary outputs: nodes C,..., C m. We illustrate the construction o N F by an example. Assume F x x x x x x 4 x x x 4. The network N F is shown in Figure 7. Because clause C x x x, we connect nodes x, x, x as anins to nodes l, l, l in N C, respectiely. Node x is the ariable node o node l. We hae the ollowing lemma. LEMMA 5. The SAT instance F is satisiable i and only i N F can be decomposed into D N F such that MMD D N F 4. THEOREM. The SGD/K problem is NP-hard or K. PROOF. The transormation rom an instance F o SAT to the network N F takes O K n m time. I the SGD/K-D problem could be soled ACM Transactions on Design Automation o Electronic Systems, Vol. 5, No., April 000.

13 Gate Decomposition and LUT Mapping 05 q j q j K-5 l j l j l j r j r j K- q j q j K-5 l j l j l j r j r j K-? C j (a) 4 C j Fig. 6. (a) Construction o network N C j or each clause C j ; (b) exact K nodes o depth appear when MMD l j. (b) N(x ) x x x l l l N(C ) C N(x ) x x x x 4 l l l N(C ) C N(x ) N(x ) 4 x 4 l l l N(C ) C Fig. 7. The network N F or F x x x x x x 4 x x x 4. in polynomial time, we can set B 4 and sole SAT in polynomial time. Since SAT is NP-hard, the SGD/K-D problem is NP-hard. For a gien decomposed network D N F o N F, it takes polynomial time to compute its mapping depth d and eriy whether d B (e.g., by FlowMap). As a result, the SGD/K-D problem is NP-complete. Since N x i and N C j are well deined or K, the SGD/K-D problem is NP-complete or K. Hence the SGD/K problem is NP-hard or K. e We now show the complexity o the K-SGD/K problem. In this construction o reduction, we must hae eery node K-bounded (note that N C j is not K-bounded in the preious construction). Gien an instance F o the SAT with n ariables x, x,..., x n and m clauses C i, C,..., C m,we ACM Transactions on Design Automation o Electronic Systems, Vol. 5, No., April 000.

14 06 J. Cong and Y.-Y. Hwang q j q j K-5... l j l j l j q j q j K-5... l l j j l j? C j C j (a) (b) Fig. 8. Construction o K-bounded subnetwork N K C j or each clause C j. construct a corresponding K-bounded network N K F, as ollows. For each ariable x i, construct subnetwork N x i as beore (shown in Figure 5). Howeer, or each clause C j, construct subnetwork N K C j consisting o (a) one output node denoted C j ; (b) three literal nodes denoted l j, l j, l j, (c) K 5 internal nodes q j,..., q j K 5, each o them is the root o a complete -leel K-ary tree with PI nodes as leaes. The subnetwork N K C j is shown in Figure 8(a). Note that N K C j is well deined and K-bounded or K 5. We connect subnetworks N x i and N K C j according to the ormula F as beore, to obtain the network N K F. We hae the ollowing lemma. LEMMA 6. The SAT instance F is satisiable i and only i N K F can be decomposed into D N K F such that MMD D N K F. THEOREM 4. The K-SGD/K problem is NP-hard or K 5. PROOF. The subnetwork N x i is K-bounded or K 4. The subnetwork N K C j is K-bounded or K 5. Based on similar arguments in the proo o Theorem, it is easy to see the K-SGD/K problem is NP-hard or K 5. e 4. GATE DECOMPOSITION ALGORITHMS FOR DEPTH-OPTIMAL MAPPING In this section we combine the node-packing technique in Chortle-d with the min-height K-easible cut technique in FlowMap in structural gate decomposition o simple-gate networks. Our objectie is to minimize the depth in the inal mapping solution. We propose two algorithms. The irst algorithm decomposes logic gates independently, as in most preious approaches, while the second algorithm decomposes multiple gates simultaneously to exploit common anins. The adantage o multigate decomposition can be seen in one example. Nodes a, b,..., in Figure 9 are primary inputs. I nodes u and in Figure 9(a) are decomposed independently, we might obtain a network in Figure 9(b). For K, the best ACM Transactions on Design Automation o Electronic Systems, Vol. 5, No., April 000.

15 Gate Decomposition and LUT Mapping 07 a b c d e a b c d e a b c d e u u u x? x (a) (b) (c) Fig. 9. Multigate decomposition. (a) Initial network; (b) single gate decomposition result; (c) multigate decomposition result. (Shaded nodes are LUT outputs). x mapping solution in this case is a -leel network o 4 LUTs. Howeer, i nodes u and are decomposed together to exploit their common anins c and d as shown in Figure 9(c), a -leel network o 4 LUTs can be obtained. The depth is reduced in the mapping solution. 4. Single Gate Decomposition We present our single gate decomposition algorithm DOGMA (Depth-Optimal Gate decomposition or MApping) in this section. Gien a simple gate network N, DOGMA decomposes nodes in topological order rom PIs to POs. At each node, DOGMA decomposes and labels with the number l MMD N where N denotes the decomposed network. The set o anins o label q in input, denoted S q, is called a stratum o depth q. A K-easible cut o height q exist or eery node in S q.ak-easible cut o height q exists or a set B o nodes i such a cut exists or a node s created with input s B. DOGMA groups input into strata according to their labels, and processes each stratum in two steps. () Starting rom stratum S q o the smallest depth, DOGMA partitions S q into a minimum number o subsets such that there exists a K-easible cut o height q or each subset o nodes. The process is similar to packing objects into bins. Each bin has a size o K. The size o a node (also called an object) is the size o its min-cut o height q. A set o nodes can be packed into one bin i their oerall size is no larger than K. Such a bin is called a min-height K-easible bin, which corresponds to a partitioned subset o S q. Note that the oerall cut size or nodes in a set could be smaller than the sum o their indiidual cut sizes. () Ater partitioning S q into subsets (or min-height K-easible bins), an intermediate node (also called bin node) w i is created or each bin B i with input w i B i and is labeled l w i q. A buer node b i is then ACM Transactions on Design Automation o Electronic Systems, Vol. 5, No., April 000.

16 08 J. Cong and Y.-Y. Hwang created or each w i with input b i w i and a label l b i q. All buer nodes are put into the set S q. Note that i some bin B i contains more than nodes, bin node w i needs to be urther decomposed. Howeer, according to Lemma 4, no matter how w i is decomposed, the minimum mapping depth o the network does not change. DOGMA arbitrarily decomposes w i into an unbalanced tree. DOGMA repeats steps () and () or stratum S q, and so on, until all strata hae been processed. The last bin node corresponds to node. Note that buer nodes are introduced only or the packing process, and will be remoed when the decomposition is complete. To determine i there exists a K-easible cut o height q or a bin B i S q o nodes, we compute a max-low in the low network, constructed as ollows [Cong and Ding 994a]: (i) Create a sink node t with input t B i. (ii) Create a source node s that anouts to all PIs in N t. (iii) Assign eery edge in N t an ininite low capacity. (i) Replace eery node u N t, except s and t, by a subgraph V u, E u where V u u, u and E u u, u such that input u input u and anout u anout u. Assign u, u an ininite low capacity i l u q, otherwise a unit low capacity is assigned. () Finally, compute a max-low in the constructed low network. The amount o low corresponds to the min-cut size in the low network. I K, there exists a min-cut o height q or the bin B i o nodes. We illustrate DOGMA or K. The output node in Figure 0(a) is under decomposition. Among the ie anins o, b, c, d hae labels l b l c l d and a, e hae labels l a l e. As a result, S b, c, d and S a, e. According to DOGMA, b, c will be packed into one bin, since a K-easible cut o height exists or them, and d into another bin or a total o two (which is the minimum) min-height K-easible bins. Then bin nodes and g with labels l l g and buer nodes h and i with labels l h l i are created or the two bins, respectiely (see Figure 0(b)). DOGMA proceeds to the stratum o depth. Two K-easible cuts o height are ound or a, h and i, e, respectiely. Again, bin nodes j and k with labels l j l k and buer nodes m and n with labels l m l n 4 are created or the two bins, respectiely. Nodes m and n are then packed into a bin that corresponds to (see Figure 0(c)). Finally, nodes g, h, i, m and n are remoed and node is completely decomposed with a label l 4. The ollowing problem has to be soled in DOGMA. Min-height K-easible bin-packing problem. Gien a stratum S q o depth q, pack nodes in S q into a minimum number o min-height K-easible bins. In our study we deeloped three heuristics to sole the problem. The irst-it-decreasing (FFD) and best-it-decreasing (BFD) are two heuristics ACM Transactions on Design Automation o Electronic Systems, Vol. 5, No., April 000.

17 Gate Decomposition and LUT Mapping 09 c d c d c d b b b u g u g u a e a h i e a h i e j k (a) (b) 4 m 4 n 4 Fig. 0. Decomposition o gate by the DOGMA algorithm. (a) Beore decomposition; (b) b and c, d are packed into and g; (c) a and h, i and e are packed into j and k. or the bin-packing problem [Horowitz and Sahni 978]. The FFD heuristic sorts objects into a list o objects o decreasing sizes, indexes the bins,,,..., then remoes the object rom the list (in order) and puts it into the irst bin that can accommodate it. The initial conditions on the bins and objects in the BFD heuristic are the same as in the FFD heuristic. But BFD puts the object into the bin that leaes the smallest empty space. For the min-height K-easible bin-packing problem, we proposed two min-cut-based heuristics, MC-FFD and MC-BFD, which are analogous to FFD and BFD, except that eery object is a node whose size is deined to be the size o its min-cut o height q. A set o nodes can be packed into a K-easible bin as long as their combined cut size is no larger than K. The third heuristic is called maximal-sharing-decreasing (MC-MSD), which encourages sharing during packing, i.e., the size o the min-cut or the packed nodes is smaller than the sum o their indiidual min-cut sizes. The packing that produces the maximum sharing is considered the best-it packing when MC-MSD calls MC-BFD or a packing result. Experimental results (Table I) show ery ew dierences on mapping results among the three heuristics (DOGMA ollowed by CutMap) or MCNC benchmarks. It indicates that in most cases the same number o bins were obtained by the three heuristics. This could be due to the small bin size K 5 in the experiment. We chose MC-FFD or its eiciency. The FFD heuristic is also used in Chortle-d or packing nodes into bins. Howeer, MC-FFD packs nodes according to the size o their min-height K-easible cut or better perormance. With reconergent anouts in general networks, ACM Transactions on Design Automation o Electronic Systems, Vol. 5, No., April 000. (c)

18 0 J. Cong and Y.-Y. Hwang Table I. Comparing Packing Heuristics MC-FFD, MC-BFD, and MC-MSD In DOGMA Bin-Packing Heuristics in DOGMA MC-FF MC-BFD MC-MSD Circuits D A D A D A z4ml count symml cordic rg i alu x C alu rot i C C Dalu C too_large i t C k C C Des total one cannot decide locally whether a set o nodes can be packed into one bin or not. For example, it is not obious that nodes e and i in Figure 0(b) can be packed into one bin. The MC-FFD heuristic employs max-low computation and can decide the packing easibility correctly. The time complexity o DOGMA is computed as ollows: For eery node in the input network N V, E, structural gate decomposition will create input nodes. In total, there are V input E V 0 E nodes created. The min-height K-easible cut computation has a time complexity o O K E [Cong and Ding 994a] where K is the LUT input size, and is carried out O input times in the worst case at each node in the MC-FFD heuristic. Let d max be the maximal anin size or nodes in N. Then the time complexity o DOGMA is O K d max E.We can reduce the time complexity o min-height cut computation to O K E p by constructing partial low networks only to a certain depth, where E p is the edge set o the partial low network. Let E p max represent the edge set o the largest partial low network constructed during decomposition. Then the time complexity o DOGMA is reduced to O K d max E p max E. ACM Transactions on Design Automation o Electronic Systems, Vol. 5, No., April 000.

19 Gate Decomposition and LUT Mapping 4. Multiple Gate Decomposition We present our multiple gate decomposition algorithm, called DOGMA-m, and illustrate the procedure on the network shown in Figure (a) or K. DOGMA-m is outlined in Figure. We call the stratum o each node a local stratum. The union o all local strata o depth q is called the global stratum o depth q. For each depth q, a node is under decomposition i input (i.e., not yet completely decomposed) and input intersets with the global stratum o depth q. Starting rom depth q and up, the nodes o the same gate type and also under decomposition will be decomposed simultaneously. In Figure (a), nodes a, b,..., h all hae a label o. Nodes x, y, and z are under decomposition or q. The local stratum o depth is a, b, c or node x, b, c, d, e, or node y, and e,, g, h or node z, respectiely. The global stratum o depth is a, b, c, d, e,, g, h. In initialization, buers are created or PIs to supply inputs to the rest o the network. PIs are labeled 0 and buers are labeled. In Figure (a), nodes a, b,..., h are PI buers. Gray regions represent the global strata o depth and in Figure (a)-(c) and (d), respectiely. The gate decomposition proceeds as ollows: () For each depth q and or each gate type, the nodes under decomposition are collected into a set G q. Then the global stratum o depth q, denoted S q, is computed by the union o local strata o depth q or all nodes in G q. In Figure (a), let AND, we hae G x, y, z and S a, b, c, d, e,, g, h. Based on G q and S q, we ormulate the Global Stratum Bin-Packing (GSBP) problem (to be ormally deined later). By soling the GSBP problem, we achiee (i) or each node in G q, its local stratum o depth q is packed into min-height K-easible bins, and (ii) there are a minimum number o min-height K-easible bins in total. The second objectie is achieed by packing common anins or the nodes in G q. Intermediate nodes (also called bin nodes) are created or bins. In Figure (b), nodes b and c, e and, g and h are packed into bin nodes i, j and k, respectiely. () It is possible that some nodes in G q hae been decomposed completely (e.g., nodes x and z in Figure (b)), while the local strata o other nodes can be packed urther (e.g., node y in Figure (b)). Both G q and S q are updated and a new instance o the GSBP problem or the same q alue is ormulated and soled. The process iterates until the global stratum o depth q has been minimally packed into bins (as a result, the network does not change). In Figure (b), we hae l x, l z, G, y, and S i, d, j, x. By soling the GSBP problem or the updated G q and S, node d and i are packed into a bin node m. Node y is now completely decomposed with a label l y. The ACM Transactions on Design Automation o Electronic Systems, Vol. 5, No., April 000.

20 J. Cong and Y.-Y. Hwang a b c d e g h a b c d e g h i j k x y z x y z (a) (b) a b c d e g h a b c d e g h i j k i j k x m z x m z y n y (c) (d) Fig.. Multiple gate decomposition. (a) Initial network; (b) ater one q iteration; (c) ater two q iterations; (d) completely decomposed network. o process iterates with updated G q and S x. But no urther packing is possible or q (see Figure (c)). () Buer nodes are created and labeled q or eery anin in the global strata S q. The decomposition process iterates steps () and () until the network is -bounded. In Figure (d), a buer node n is created or node x, nodes y and z are then packed into a bin, and the decomposition o node is completed. Two points are worth mentioning. First, in DOGMA, each node is decomposed only ater all its anins hae been decomposed and labeled. In DOGMA-m, howeer, nodes could undergo decomposition, een though some o their anins hae not been labeled. For example, node in Figure (b) is ACM Transactions on Design Automation o Electronic Systems, Vol. 5, No., April 000.

21 Gate Decomposition and LUT Mapping procedure DOGMA-m ( N, K ) /* N is the input network and K is LUT input size. */ Initialization N old = or q =,,... until N is -bounded do 4 while N N old do 5 N old = N 6 or each gate unction type do 7 G q = { unc () =, input () >, u input () s.t. label (u) = q} 8 S q = { u label (u) = q, u input (), G q } 9 Sole GSBP( G q, S q, K ) problem 0 or each min-height K-easible bin B i created in GSBP do create bin node w i, label (w i ) = q add w i to N, update anins o nodes in G q or each node u i S q do 4 create buer node b i, label (b i ) = q + 5 add b i to N, N old = 6 return N Fig.. Multiple gate decomposition algorithm. under decomposition G, while its anin y is not labeled yet. Second, or each depth q and gate type, multiple instances o the GSBP problem might be soled in order to pack local strata into a minimal number o bins. For example, two instances o the GSBP problem are soled or q beore the local stratum o node y is minimally packed (rom Figure (a) to (c)). In our experiments, we ound that soling three instances o the GSBP problem are suicient or each q alue. The Global Stratum Bin-Packing (GSBP) problem is ormally deined as ollows. Global stratum bin-packing (GSBP) problem. Gien a set G q o nodes o gate type under decomposition and a global stratum S q o depth q that contain anins o nodes rom G q, pack the anins in S q into a set o bins such that (i) or each node in G q, its local stratum o depth q is packed into min-height K-easible bins; (ii) there is a minimum number o min-height K-easible bins in total. To sole the GSBP problem, we build a matrix M where rows correspond to nodes in G q,,..., n, columns correspond to anins in S q u, u,...,u m, an entry M i, j i u j input i, and M i, j 0 i not. A rectangle is a subset o rows and columns, denoted by a pair R, C, indicating the row and column subsets, where all entries are. C corresponds ACM Transactions on Design Automation o Electronic Systems, Vol. 5, No., April 000.

22 4 J. Cong and Y.-Y. Hwang a b c d e g h a b c d e g h x x y y z z weight weight (a) (b) Fig.. FFD bin-packing heuristic or the GSBP problem. (a) Initial M; (b) M ater the irst run o bin-packing. to a bin o anins and R corresponds to a set o nodes that share anins in C. A solution o the GSBP problem is a rectangle coer or M, subject to a K-easible cut o height q exists or anins in each column set C. This matrix representation is similar to the cube-literal matrix used or soling the cube-extraction problem [Rudell 989; De Micheli 994]. Howeer, the algorithms or cube extraction cannot be applied directly because the C in eery rectangle R, C must satisy the K-easible cut constraint. We use the MC-FFD packing heuristic to compute a rectangle coer or the GSBP problem as ollows. First, compute the anout actor o j n i M i, j and the cut size s j o min-cut o height q or eery anin u j S q. The weight o each anin is o j s j. Then we sort the anins according to their weights and ollow the MC-FFD bin-packing heuristic to pack anins into bins (starting rom the anin with the largest weight). Our strategy is to group anins o large cut sizes or obtaining a minimum number o bins and to group anins o large anout sizes or exploiting common anins. A set o anins can be packed into one bin C i (i) a K-easible cut o height q exists or the anins in C, and (ii) the largest rectangle R, C satisies R r min (i.e., at least r min nodes in G q share these anins) where r min is a user-speciied parameter. By perorming the MC-FFD packing heuristic, we obtain a set o rectangles. Each rectangle R, C that satisies C c min (another user-speciied parameter) will be saed and coered with 0 s in M. The MC-FFD packing procedure is repeated until M contains only 0 s. A rectangle coer or M is then obtained, and the set C in each rectangle corresponds to a bin. In our implementation, we set r min and c min in the irst pass o the MC-FFD packing procedure, and decrease both alues to in subsequent iterations. The decrease o alues guarantees the termination o our procedure. We demonstrate the MC-FFD packing heuristic on the network in Figure (a) or K or soling the GSBP problem. The initial matrix M is shown in Figure (a). The rows correspond to nodes in G x, y, z and the columns correspond to anins in S a, b, c, d, e,, g, h. The weight o each anin is its anout size (i.e., the number o s in each column), since eery anin is a PI buer whose cut size is. Fanins are ACM Transactions on Design Automation o Electronic Systems, Vol. 5, No., April 000.

23 Gate Decomposition and LUT Mapping 5 Table II. Circuit Optimization Using the Rugged Script Original Rugged gate anin size gate anin size Circuits ckt size time(s) ckt size z4ml 6 0% 89% % 0% count 4% 0%.4 79 % 0% 9symml 5 4% 8% % 5% cordic 7 % 8%. 6 % 8% rg 0 % 94% % 4% i 70 0% 6%. 78 0% 6% slu 0 7% 5% % 6% x 5 9% 76% % 5% C4 59 % % % 9% alu4 46 % 47% % 9% rot 494 % 9% 7. 9 % 8% i 5 0% % % % C % 4% 6.4 6% 9% C % 6% % 0% dalu 99 0% 4% % 7% C % 4% % 5% too_large 08 0% 00% % 5% i0 64 4% 7% % % t % 44%.0 8 5% % C % 8% % 9% k 50 % 98%.0 44 % 4% C % 0%.0 9 % % C % 7% % % des 805 4% 56% % 9% total 884 6% 6% % sorted into the order b, c, e,, a, d, g, h according to their weights. Nodes b and c are packed into the irst bin, which corresponds to the rectangle R, C x, y, b, c. Although there is a -easible cut o height 0 or nodes b, c, e, they cannot be packed into one bin because the rectangles or them hae R y r min. As a result, node e is put into a separate bin and packed with node, which corresponds to the rectangle R, C y, z, e,. Then the two rectangles are coered with 0 s (Figure (b)). We reset r min c min and perorm another run o the MC-FFD packing heuristic. Three bins are obtained but only one bin contains two anins. Totally, three bin nodes will be created. The network in Figure (a) is now decomposed into the network in Figure (b). 5. EXPERIMENTAL RESULTS We implemented DOGMA and DOGMA-m in the C language and incorporated them into the RASP logic synthesis system or FPGAs [Cong et al. 996]. We prepared two sets o benchmarks in our experiments. The irst set C original consists o 4 original multileel MCNC benchmarks, which all ACM Transactions on Design Automation o Electronic Systems, Vol. 5, No., April 000.

Beyond the Combinatorial Limit in Depth Minimization for LUT-Based FPGA Designs

Beyond the Combinatorial Limit in Depth Minimization for LUT-Based FPGA Designs Jason Cong and Yuzheng Ding Department of Computer Science University of California, Los Angeles, CA 90024 Abstract In this