FPGA PLB Architecture Evaluation and Area Optimization Techniques using Boolean Satisfiability

Size: px

Start display at page:

Download "FPGA PLB Architecture Evaluation and Area Optimization Techniques using Boolean Satisfiability"

Joshua Lloyd
5 years ago
Views:

1 IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. X, NO. XX, APRIL FPGA PLB Architecture Evaluation and Area Optimization Techniques using Boolean Satisfiability Andrew C. Ling, Member, IEEE, Deshanand P. Singh, Member, IEEE, and Stephen D. Brown, Member, IEEE Abstract This work presents a Field-Programmable Gate Array (FPGA) logic synthesis technique based upon Boolean Satisfiability (SAT). This work shows how to map any Boolean function into an arbitrary PLB architecture without any custom decomposition techniques. The authors illustrate several useful applications of this technique by showing how this technique can be used for architecture evaluation and area optimization. When evaluating FPGA architecture, the authors focus on the basic building block of the FPGA which they refer as a programmable logic block (PLB). In order to illustrate the fleibility of their evaluation framework, several unrelated PLB architectures are evaluated in an automated fashion. Furthermore, the authors show that using their technique is able to reduce FPGA resource usage by 27% on average in common subcircuits found in digital design. Inde Terms Design Automation, Field-programmable gate array, Quantified Boolean Satisfiability, Boolean Satisfiability, Logic Synthesis, Resynthesis. I. INTRODUCTION Field-Programmable Gate Arrays (FPGAs) are integrated circuits characterized by two distinct features: programmable logic blocks (PLBs) and programmable interconnect structures. An FPGA consists of groups of PLBs known as clusters which are connected through programmable connection blocks and switch blocks to form a regular array of clusters as shown in Fig. 1. The cluster combined with its associated routing form a tile. Previous work has shown that grouping PLBs into clusters greatly improves the performance of FPGAs since the intra-cluster delay is an order of magnitude less than the inter-cluster delay [1]. Manuscript received November 20, 2005; revised November 30, This work was supported by the IEEE. A. Ling is with the University of Toronto. D. Singh is with Altera Corporation. S. Brown is with Altera Corporation and the University of Toronto.

2 IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. X, NO. XX, APRIL PLB Cluster Tile Switch Block I/O pad Routing Tracks Fig. 1. An illustration of an FPGA consisting of a regular array of clustered PLBs. 2-LUT L 1 L 2 L 3 z 1 G L Fig. 2. Programmable Logic Block. An eample of a PLB is shown in Fig. 2. In this eample, the logic block is composed of a 2-input lookup table (2-LUT) that feeds an AND gate. The 2-LUT is capable of implementing any arbitrary Boolean function of 2 variables. Assuming, K is the number of inputs to the LUT, the LUT is implemented with a set of 2 K static RAM (SRAM) bits that are programmed with the truth-table values for the function to be implemented (4 SRAM bits in Fig. 2). The 2 inputs ( 1, 2 ) feed a multipleer that selects the appropriate truth-table value from the SRAM bits. In cases where the PLB only consists of LUTs, we will refer to them as K-LUT architectures. A. Motivation In general, many modern PLBs are based on the K input lookup table (K-LUT). Although the K-LUT is very fleible, it is usually beneficial to add dedicated non-programmable logic to the PLB such as adders or XOR/ANDgates ([2], [3]). These features increase the number of functions that can be implemented by a PLB without the power, speed, and area costs associated with programmable logic. However, since this reduces the fleibility of the PLB, optimally mapping functions to these non-programmable components is difficult. This creates an area penalty which is hard to quantify objectively. The PLB area usage significantly affects the cost of the final circuit implementation on the FPGA. Although

3 IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. X, NO. XX, APRIL the FPGA silicon area is dominated by the routing interconnect, the cost of implementing a circuit in an FPGA is directly proportional to the PLB capacity of the FPGA [4]. Since FPGAs are sold in a number of pre-fabricated sizes, decreasing the number of PLBs in the final circuit netlist may allow the circuit to be realized in a smaller FPGA, thereby reducing the cost of the design. Reducing the PLB usage also has a more localized effect since it allows subcircuits to be realized in a smaller number of clusters. This produces a much faster subcircuit since it reduces the number of inter-cluster connections in these subcircuits which are known to dominate FPGA delay [1]. Both the PLB architecture and the technology mapper which converts a gate-level netlist into a netlist of PLBs has a large impact on the final area of the circuit. Bad PLB designs and poor quality technology mappers can lead to very costly circuit implementations with poor performance in the FPGA. Thus, it is important to evaluate PLB architectures and develop high quality technology mappers during FPGA development. In this paper, we present two tools that accomplish both of these goals using a new PLB function mapping approach based on Boolean Satisfiability (SAT). We will illustrate that the main benefit of our technique is its generality where it can be applied to any PLB architecture and requires no custom decomposition techniques. The first tool we present helps quantify the area usage of various PLB architectures. The second tool is a resynthesis technique which is guaranteed to optimally map functions to small subcircuits. During our resynthesis study, we focus on a class of functions known to be non-disjoint where we will show that synthesis and technology mappers have great difficulty solving optimally. Before we introduce our tools, we present some background on the technology mapping problem in Section II. This is followed by a description of the SAT problem and an eplanation on the transformation of PLB function mapping into the SAT problem in Section III. We follow this with a detailed description of our PLB evaluation method and resynthesis technique in Section IV. Finally, we present several results illustrating the generality of our technique in Section V. II. BACKGROUND A. Technology Mapping Technology mapping a circuit description into a netlist of PLBs occurs after logic synthesis. Logic synthesis optimizes a gate-level circuit description through a sequence of technology independent transformations [5] to improve area and delay. In this work, delay is considered proportional to the depth of a circuit where the depth of a node is defined as the longest path from the node to a primary input. A primary input is any node in a circuit with no fanin such as an input pin. The dual to this is a primary output which is any node in a circuit with no fanouts such as an output pin. Technology mapping takes the optimized gate-level netlist and converts it into a netlist of PLBs. Previous work showed that the depth-optimal technology mapping solution can be obtained in polynomial time using a dynamic programming procedure [6]. The disjoint relationship between logic synthesis and technology mapping often leads to technology mapped circuits that are far from optimal. In later sections, we will show methods to resolve this problem through SAT.

4 IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. X, NO. XX, APRIL The process of technology mapping is often treated as a covering problem. For eample, consider the process of mapping a circuit into LUTs as illustrated in Fig. 3. Fig. 3a illustrates the initial gate-level netlist, Fig. 3b illustrates a possible covering of the initial netlist using s, and Fig. 3c illustrates the LUT netlist produced by the covering. In the mapping given, the gate labeled is covered by both LUTs and is said to be duplicated. In a duplication-free mapping, each gate in the initial circuit is covered by a single LUT in the mapped circuit [7]. However, surprisingly, the controlled use of duplication can lead to further area savings [8]. In contrast to the depth minimization problem, the area minimization problem was shown to be NP-hard for LUTs of size four and greater ([9], [10]). Thus, solving the area minimization problem requires heuristics. a b c d e a b c d e a b c d e f g f g f g (a) Initial Netlist (b) Possible Covering (c) LUT Mapping Fig. 3. Technology mapping as a covering problem. Another way to look at technology mapping is as a cone selection problem. The subcircuits circled in Fig. 3b are eamples of cones. Technology mapping seeks to find the best set of cones that can be mapped to the current PLB architecture. Best is determined by the optimizing goal such as area, speed, or power. If the FPGA architecture consists solely of K-LUTs, mapping from cones to K-LUTs is a direct process since any cone with K-inputs or less can be implemented in a K-LUT. A cone with K-inputs or less is known to be K-feasible. Thus, to technology map circuits to K-LUTs, the circuit simply has to be decomposed into a set of K-feasible cones. However, if the FPGA architecture consists of generic K-input PLBs, mapping from cones to PLBs is much more difficult since PLBs cannot implement all possible K-feasible cones. For eample, the PLB in Fig. 4 cannot implement a 3-input OR gate. Previous work solved this problem by using two main approaches: A specialized PLB is proposed and a customized mapping algorithm is implemented to map benchmark circuits to the proposed PLB [11]. Functions are decomposed using specialized Boolean matching techniques such that it matched the structure of the PLB [12]. A problem with both of these approaches is that they require specific Boolean techniques to map functions to a given PLB architecture. We solve this problem in a general manner using SAT, allowing our technique to be applicable to any PLB architecture and any Boolean function. Although more limited in functionality, PLBs offer speed, area, and power advantages over fully programmable K-LUTs. In general only a small subset of K-feasible cones will appear in most logic circuits; therefore, as long

5 IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. X, NO. XX, APRIL LUT L 1 L 2 L 3 z 1 G L Fig. 4. Eample PLB. as a given PLB architecture captures most cones encountered in real circuits, it will be successful in implementing those circuits. In [12], the authors evaluate PLBs based upon the number of functions a given PLB can implement. We adopt a similar measure whereby we determine the fleibility of a PLB by etracting a set of K-feasible cones from benchmark circuits and determine how many of these cones can fit into the PLB where a high fit percentage is desired. Although, we adopt a similar comparison metric as in [12], no previous work has been done that has been general enough to apply to all the PLB architectures we present later. The PLB fleibility only gives a preliminary estimate on the efficiency of the PLB. To gauge how much area overhead the non-programmable components in the PLB will add to an FPGA, a full area estimate of the FPGA device is necessary. This can be calculated by deriving the number of PLBs required to implement a given circuit in conjunction with an FPGA tile area estimate containing a cluster of PLBs. We present two tools in this paper that do both these tasks. III. BOOLEAN SATISFIABILITY APPLIED TO FPGA FUNCTION MAPPING The following sections provides a brief overview of Boolean satisfiability and Quantified Boolean satisfiability. The informed reader may skip these sections and go on to Section III-B. Boolean Satisfiability (SAT) has gained recent interest, particularly in CAD for digital circuits. The primary reason for this is that several problems that occur in CAD can be represented as a Boolean formula and thus can be solved using SAT. SAT was the first problem shown to be NP-Complete [13, ch.34] and is formally defined as the following: Definition 3.1: The Boolean Satisfiability Problem: Given a Boolean Formula defined on variables 1, 2,..., n, seek an assignment to these variables such that the Boolean formula evaluates to true. If this is possible, the Boolean formula is said to be satisfiable (SAT), otherwise, it is said to be unsatisfiable (UNSAT) For ease of readability, we will use the term SAT to refer to both Boolean Satisfiability and satisfiable when the meaning is obvious from the contet. SAT solvers are tools that seek to solve the SAT problem. Generally, SAT solvers work on Boolean formulae in Conjunctive-Normal-Form (CNF, also known as a Product-of-Sums). A Boolean formula is in CNF if it consists only of a conjunction of clauses, where each clause contains a disjunction of literals and a literal is defined as

6 IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. X, NO. XX, APRIL any variable or its complement. In this form, SAT seeks an assignment of variables such that every clause in the Boolean formula has at least one literal evaluating to true. For eample, Fig. 5 gives an illustration of a Boolean formula in CNF and a satisfying assignment. F =( }{{} ) ( 2) ( ) }{{} literal clause 1 = 0, 2 = 1, 3 = 1 Fig. 5. An eample CNF with a satisfying assignment. A. Quantified Boolean Satisfiability F = ( ) ( 2 ) ( ) Fig. 6. An eample QBF. SAT is actually a subset of the much more difficult problem called Quantified Boolean Satisfiability (QSAT). Definition 3.1 still holds for QSAT; however, QSAT is the more general problem of determining if a Quantified Boolean Formula (QBF) is satisfiable or not. A QBF is a Boolean formula where quantifiers are applied to its variables. For eample, Fig. 6 show an eample of a QBF in CNF. A Boolean formula is actually a special case of a QBF where all the variables on a Boolean formula have an implicit eistential quantifier. Quantified Boolean Satisfiability (QSAT) is known to be P-Space Complete [14]. Although not formally proven, P-Space Complete problems are thought to be harder than NP-Complete problem where PSPACE C NP C. An intuitive eplanation of this can be shown through a simple eample. Consider Equation 1, which shows a simple Boolean epression and a possible satisfying assignment. Now consider the same epression but with quantifiers added to its variables shown in Equation 2. The satisfiable assignment to the QBF shown in Equation 2 is much more elaborate than its unquantified counterpart. This simple eample shows that QSAT must eplore a much larger search space to find a satisfiable solution when compared against SAT. B. Transforming FPGA Function Mapping to SAT At its core, mapping digital circuits to FPGA fabric is the process of decomposing a circuit into a set of Boolean functions that map into a netlist of PLBs. In general, mapping Boolean functions into programmable logic is not trivial since general programmable structures with K inputs such as PLBs can only implement a small subset of K input functions. Interestingly, this problem can be represented as a QBF and solved using QSAT where a satisfying

7 IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. X, NO. XX, APRIL ( ) ( ) ( ) 1 = 1, 2 = 1, 3 = 0 (1) ( ) ( ) ( ) 1 = 1, 2 = 1, 3 = 0 (2) 1 = 0, 2 = 0, 3 = 0 Fig. 7. A QSAT compleity eample. assignment indicates that the mapping is possible. Furthermore, if satisfiable, QSAT will return the programmable configuration necessary to implement the Boolean function in the given programmable structure. We will show later how to transform QSAT into SAT and use SAT solvers to solve the simplified problem. In order to formulate the Boolean function mapping problem as QSAT, it needs to be formalized as follows: Problem 3.2: A Boolean function, F, with n inputs can be realized in a programmable circuit, G, with m inputs and l programmable bits, where n m, if and only if there eists at least one configuration to the l programmable bits such that G F for all inputs applied in the same manner to G and F. The QBF representation of Problem 3.2 is shown in Fig. 8. To ensure that the inputs to F and G are applied in the same manner, we represent their inputs by the same variables, 1, 2,..., n. L 1,L 2,..., L l represent the programmable bits, and z 1,z 2,..., z o represent any auiliary variables found in the epression G F. The eistence of auiliary variables will be eplained later, and is a side effect of the derivation method we use to construct G F. H = L 1 L 2...L l n z 1 z 2...z o (G F) Fig. 8. QBF representation of Problem 3.2. In order to derive the epression G F, we adapt the circuit characteristic function to accomplish this. For a detailed description on the characteristic function, please refer to [15]. A characteristic function is a Boolean representation of a digital circuit, ψ, in CNF which can be modified to epress G F. The characteristic function describes all consistent inputs, outputs, and intermediate wire vectors of the digital circuit. For eample, consider the OR-gate shown in Fig. 9. Net to the OR-gate is its characteristic function truth-table. The onset of this truthtable represents all cubes that are consistent of an OR-gate such as 1 = 0, 2 = 1,G = 1. Using any standard minimization procedure, the OR-gate characteristic function can be derived as shown in Fig. 9. Characteristic functions for large circuits can be derived from the conjunction of the characteristic functions of

8 IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. X, NO. XX, APRIL G F OR F OR = ( G) ( 1 + G) ( 2 + G) G Fig. 9. Deriving a characteristic function for an OR-gate. their basic elements. For eample, consider Fig. 10 which consists of anor-gate fed by a 2:1MUX. To construct the characteristic function of this circuit, we simply conjoin (logical AND) the characteristic functions of the MUX and OR-gate (the derivation of the MUX characteristic function can be accomplished using the technique shown in Fig. 9). The auiliary variables, z i, in Fig. 8 stem from the intermediate wires found in the configuration circuits. This is seen in Fig. 10 with variable z 1. These auiliary variables provide a logical link between the basic characteristic functions to form a unified epression representing the entire configurable circuit z 1 G F MUX-OR =( 1 + z 1 + G) ( 1 + G) (z 1 + G) F OR ( z 1 ) ( z 1 ) F MUX (3) ( z 1 ) ( z 1 ) Fig. 10. Deriving characteristic functions from basic elements. Using the previous procedure, the characteristic function, which we will refer as ψ, can be found for any configurable circuit. Using ψ, we can derive the epression for G F. In order to represent the equivalence operator, all instances of the variable G in ψ are replaced with the epression representing F. This substitution can be represented as ψ[g/f] which will be used in later sections. Going back to original problem formulation, our goal is to find if their eists a configuration to our circuit G such that G F for all inputs applied to G. Thus, as one final step, quantifiers are added to the epression G F to form a final CNF epression representing Problem 3.2 as shown in Fig. 11 where L 1 L 2...L l represent the configuration bits and n represent the inputs to G. 1) Removing Quantifiers on QBF to form SAT: Although function H in Fig. 11 can be solved using QSAT, it is often faster to remove the quantifiers found on a QBF and use SAT solvers. In [16], the author presents a method to remove all quantifiers on a QBF to convert the QSAT problem into SAT. We adopt a similar method in this work to remove the quantifiers in Fig. 11. To do this, first the unquantified epression ψ[g/f] is replicated 2 n times

9 IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. X, NO. XX, APRIL H = L 1 L 2...L l n z 1 z 2...z o (ψ[g/f]) Fig. 11. Characteristic function based representation of Problem 3.2. and conjoined together where n is the number of universally quantified variables ( , n ). This is shown in the first line of Equation 4. Net, the universal variables in each replicated epression is replaced with one possible enumeration such that no two replicated epressions have identical enumerations, which is shown in Equation 5. The purpose of this is to eplicitly cover all possible values of the universal variables. In addition to this, variables bound to the innermost eistential quantifier (z 1 z 2...z o ) are replaced with unique variable in each replicated epression. This preserves the meaning of their original eistential quantifier. Finally, the remaining eistential quantifier on the configuration variables L 1 L 2...L l does not have to be eplicitly shown since variables without an eplicit quantifier will implicitly have an eistential quantifier applied to them. The resulting epression then can be passed to a SAT solver where a satisfying assignment implies that F can be realized in G. Ψ =ψ 0 ψ 1 ψ 2... ψ 2 n 1 (4) =ψ 0 [ 1 /0, 2 /0,..., n /0,z 1 /z o+1,..., z o /z 2o ] ψ 1 [ 1 /0,..., n /1,z 1 /z 2o+1,..., z o /z 3o ]... (5) ψ 2 n 1[ 1 /1,..., n /1,z 1 /z (2 n 1)o+1,...,z o /z 2 n o] Fig. 12. Removing the quantifiers on the QBF in Fig F(X) (a) Original Function F(X) (b) New Function. Fig. 13. Permutation eample. 2) Permutable Inputs: In addition to having programmable bits to program the function being implemented, the inputs of programmable circuits are usually permutable. This greatly epands the number of functions a programmable circuit can implement. For eample, consider the simple 3-input function shown in Fig. 13a. By

10 IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. X, NO. XX, APRIL changing the inputs 1 and 2 a new function can be realized as shown in Fig. 13b. In fact, a K-input function can be transformed to at most K! 1 other functions by simply permuting its variables in every possible way. The way to model this fleibility in a digital circuit is to add multipleers at the inputs as shown in Fig. 14. These virtual multipleers are etremely versatile in that they can also add restrictions in routing. For eample, assume configuration V 5 V 6 = 11 would feed input 3 into the XOR-gate (z 4 ). In order to prevent this, the clause (V 5 + V 6 ) is conjoined to the PLB characteristic function. 2-LUT L 1 L 2 L 3 z1 G L 4 z z 4 2 z 3 V1 V V 2 3 V4 V5 V Fig. 14. Modeling permutable inputs of a programmable circuit. 3) Function Mapping using SAT Eample: In order to give a better understanding of the previously described concepts, an eample is provided. Taking the circuit shown in Fig. 14, we wish to determine if the Boolean function described in Fig. 13a can be realized in it. The first step is to create the epression ψ[g/f]. This is found by finding the characteristic function of Fig. 14 as shown in Fig. 15. Notice that the characteristic function ψ is created from the conjunction of the basic characteristic function components that form the configurable circuit. Following the construction of ψ, all instances of G need to be replaced by F to create the epression G F as shown in Fig. 16. The epression ψ[g/f] is dependent on the variables representing all inputs, output, configuration bits, and intermediate wire variables. In this form, quantifiers can be added to these variables and the final QBF can be solved using a QBF solver where a satisfying assignment implies that function F can be realized in the configurable circuit. This QBF was shown in Fig. 11 where n, L 1 L 2...L l, and z 1 z 2...z o, represent the inputs, configuration bits, and intermediate wire variables respectively. As an etra step, there is the option to remove all quantifiers and solve the final epression using the SAT method shown previously in Fig. 12. IV. FPGA AREA DRIVEN SAT BASED APPLICATIONS The primary power of the SAT technique shown in the previous section is its generality. There are no restrictions on the type of circuit nor function that it can represent. We demonstrate several algorithms that use our SAT-

11 IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. X, NO. XX, APRIL G LUT =(z 3 + z 2 + L 1 + z 1 ) (z 3 + z 2 + L 1 + z 1 ) (z 3 + z 2 + L 2 + z 1 ) (z 3 + z 2 + L 2 + z 1 ) (6) (z 3 + z 2 + L 3 + z 1 ) (z 3 + z 2 + L 3 + z 1 ) (z 3 + z 2 + L 4 + z 1 ) (z 3 + z 2 + L 4 + z 1 ) G XOR =(z 1 + z 4 + G) (z 1 + z 4 + G) (7) (z 1 + z 4 + G) (z 1 + z 4 + G) G V MUX1 =(V 1 + V z 2 ) (V 1 + V z 2 ) (V 1 + V z 2 ) (V 1 + V z 2 ) (8) (V z 2 ) (V z 2 ) G V MUX2 =(V 3 + V z 3 ) (V 3 + V z 3 ) (V 3 + V z 3 ) (V 3 + V z 3 ) (9) (V z 3 ) (V z 3 ) G V MUX3 =(V 5 + V z 4 ) (V 5 + V z 4 ) (V 5 + V z 4 ) (V 5 + V z 4 ) (10) (V z 4 ) (V z 4 ) ψ = G LUT G XOR G V MUX1 G V MUX2 G V MUX3 (11) Fig. 15. Characteristic function of PLB seen in Fig. 14. (G F) = ψ[g/f] (12) Fig. 16. Forming epressions G F. based decision process. These applications can be categorized into PLB evaluation [17] and resynthesis. The PLB evaluation algorithm provides a quantitative area assessment to new PLB architectures, while the resynthesis algorithm helps reduce the final area of the circuit implementation in an FPGA. A. Application to PLB Evaluation In order to evaluate PLB area efficiency, we take two approaches. First, we develop a tool to characterize the fleibility of a PLB. The metric we use to represent PLB fleibility is a fit percentage which is the percentage of cones sampled from various circuits that can be realized in a single PLB. Using this metric, PLBs with a high fit percentage are thought to be more fleible than PLBs with a low fit percentage. This gives a rough indication on how often the non-programmable components are epected to be utilized. Our second approach yields a more conclusive area estimate associated with a given PLB architecture. This approach uses the PLB resource usage required to implement various circuits and FPGA tile area to derive an overall area estimate. The flow of this process is illustrated in Fig. 17. As Fig. 17 shows, in order to derive the PLB usage, a generic PLB technology mapper is necessary. Since area is our primary comparison metric, one requirement for the technology mapper is to be competitive with state-of-the-art technology mappers, yet be general enough to

12 IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. X, NO. XX, APRIL PLB Description BlIF Netlist Of Primative Gates PLB Technology Mapping Technology Mapped Netlist MWT Area single FPGA Tile PLB Usage MWT Area Fig. 17. Area estimation flow for a given PLB architecture. map to any PLB architecture. We achieve the competitive requirement by using the same heuristics as the IMap LUT mapper which outperforms all other technology mappers in terms of area; and we achieve the generality requirement by using our SAT-based function mapping technique to map functions to any PLB architecture. This is shown at the top of Fig. 17 where the PLB circuit description and netlist is passed to the PLB technology mapper. The technology mapper uses this description to generate the CNF epression representing the PLB as described previously in Section III-B. Once the PLB usage is derived, we can estimate how much area a technology mapped circuit will consume. Since FPGA area is known to be dominated by transistor area, we use minimum width transistor area [18] as our estimate of the overall area taken by the circuit. This is shown at the bottom of Fig. 17. Minimum width transistor area gives a process independent metric of the number of transistors required to implement a given circuit where larger transistors are counted as several minimum width transistors. USAGE P LB MWT device = CEIL( PLBs per TILE ) MWT Tile Fig. 18. Minimum width transistor (MW T ) counts for smallest device capable of fitting the given circuit. The way we estimate total area from PLB usage and tile area is shown in Fig. 18. MWT Tile is the minimum width transistor estimate of a single tile. USAGE PLB is the number of PLBs required to implement the given circuit. The term CEIL( USAGE P LB PLBs per TILE ) returns the number of tiles required to implement a given circuit using a particular PLB architecture. Finally, MWT device is the area estimate of the smallest FPGA required to implement the technology mapped circuit returned by our generic PLB technology mapper. Using this metric, a fair area comparison of various PLB architectures can be made. 1) PLB Fit Percentage: Fig. 19 shows a high-level overview of our PLB fit percentage algorithm. As stated previously, PLBs that can capture the functionality of most cones found in real circuits are desired since their non-programmable components will not be wasted. In order to help find such PLBs, our tool can be used to return

13 IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. X, NO. XX, APRIL a PLB cone fit percentage where a high fit percentage is preferred. This fit percentage is found by etracting a set of cones from a list of circuits, then applying our SAT decision step to remove cones that do not fit in the given architecture as shown in lines 1 and 2 of Fig. 19. By recording the number of cones generated and discarded, a fit percentage for various PLB architectures can be found. 1 X GENERATECONES() 2 Y REMOVENOFITCONES() 3 FitPercent (X Y )/X Fig. 19. An overview of the PLB evaluation algorithm. A version of the algorithm described in [8] is used to generate and store all K-feasible cones in the graph. The K-feasible cones are generated as the graph is traversed in topological order from primary inputs to primary outputs. At every internal node v, new cones are generated by combining the cones at the input nodes. 2) Technology Mapping Using SAT: Our function mapping technique allows us to convert any K-LUT technology mapper into a K-input PLB technology mapper. As stated in Section II-A, technology mapping to LUTs can be considered as a covering problem. The same is true for K-input PLBs; however, because a K-input PLB is not fully programmable, not all K-input cones can fit into the PLB. Thus, when generating cones during the technology mapping phase, cones that do not fit into the given PLB should be discarded. This will leave a set of cones guaranteed to fit into the PLB architecture. 1 GENERATECONES() 2 REMOVENOFITCONES() 3 for i 1 upto MaI 4 TRAVERSEFWD() 5 TRAVERSEBWD() 6 end for 7 CONESTOPLBS() Fig. 20. High-level overview of generic PLB technology mapper algorithm. We base our work on IMap [19], an iterative K-LUT technology mapping algorithm. For a detailed description of IMap please refer to [19], which shows that IMap produces amongst the best area results of any known technology mapper. Here, we have a brief overview of the algorithm where the basic framework for our technology mapper is presented in Fig. 20. First, a call to GENERATECONES generates a set of K-feasible cones for each node in the graph, where K is the input size of the PLB. Net, a call to REMOVENOFITCONES discards all cones that cannot fit into the PLB architecture. This decision process uses SAT as described in the Section III-B. Once a set of valid cones is found, a series of forward and backward graph traversals is started to select the best cover of the graph. The cost of the cover is measured in terms of area and depth. The forward traversal, TRAVERSEFWD, selects a cone for each node, and the backward traversal, TRAVERSEBWD, selects a set of cones to cover the graph. Iteration is beneficial because every backward traversal influences the behavior of the forward traversal that follows it.

14 IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. X, NO. XX, APRIL During the forward traversal, the algorithm updates the depth and the area flow for every node and edge encountered. Area flow is a heuristic for estimating the area of the mapping solution below a node or an edge where minimizing it leads to smaller mapping solutions as described in [19]. The definition of area flow is reshown here for convenience. As Fig. 21 shows, iteration is necessary since area flow is influenced by the covering found AreaFlow(v) =1 + i fanin(c v) AreaFlow(i) = AreaFlow(u) fanout(c u ) AreaFlow(i) (13) (14) Fig. 21. Area Flow definition for a node v and edge i. Note that i is an edge that flows from node u to v, C v is a cone selected to cover v, and fanout(c u) is the number of fanouts leaving the cone covering u. in the previous backward traversal (C v ). In the first iteration, where no previous backward traversal has occurred, C v is estimated as the node v itself. Also, fanout(c u ) must be estimated and is taken as the weighted average of the previous iterations where it is initially estimated as fanout(u). A detailed description on this procedure can be found in [19]. At each internal node v, a cone rooted at v is selected to cover v and some of its predecessors in a mapping solution. The quality of the mapping solution is determined by the cone selection procedure. During area-oriented mapping, on the first mapping iteration, the cone with the lowest area flow is selected. If cones have equivalent area-flow, the cone with the lowest depth is selected. During depth-oriented mapping, the first forward traversal establishes the optimal mapping depth, ODepth, which can then be used in subsequent iterations to bound the depth of cones selected at every node. Using the optimal depth and the height of a node v, a bound can be defined on the depth of a cone C v as follows depth(c v ) ODepth height(v). (15) The height of a node or cone is defined as the longest path from that node or cone to a primary output of the circuit. Cones that meet the bound requirement are preferred and among a set of cones that meet the bound requirement, cones with lower area flows are selected. This selection strategy ensures that the mapping solutions will still achieve the optimal depth selected while minimizing area. During the backward traversal, internal nodes of the graph are visited in the reverse topological where a cover of cones is produced. During this traversal, the height(v) of all internal nodes are updated to the height of the cone covering it. This is for use in Equation 15 in the net forward traversal. If v is found in several cones, the largest height is used. Finally, a call to CONESTOPLBS converts the cones selected by the final backward traversal into PLBs. 3) Generating k-feasible Cones: A version of the algorithm described in [8] is used to generate and store all K-feasible cones in the graph. The K-feasible cones are generated as the graph is traversed in topological order from primary inputs to primary outputs. At every internal node v, new cones are generated by combining the

15 IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. X, NO. XX, APRIL cones at the input nodes. The original IMap algorithm combined the cones in every possible way. In our work, in order to prune the number of cones eplored, the cone generation algorithm collapses cones if they have no more (k + e) distinct inputs in total (i.e. (l + m 1) (k + e) where l and m are the number of distinct inputs to each cone being collapsed into one). As long as e was set to a sufficiently high number (2 in our eperiments), this heuristic increased the speed of the cone generation process without significantly impacting the quality of the mapping solution. B. Resynthesis In this section, we address the problem of technology mapping where the technology mapper often fails to find an optimal solution for subcircuits. We consider state-of-the-art K-LUT technology mappers publicly available. As stated in Section II-A, technology mapping is a step that follows gate-level synthesis. Furthermore, gate-level synthesis and technology mapping are very disjoint steps. A problem arises when the cost metrics between gatelevel synthesis and technology mapping do not coincide. This is eplored in detail in Section V. To solve the problem between synthesis and technology mapping, we introduce a post technology mapping step that optimally resynthsizes small subcircuits. 1) Subcircuit Resynthesis: Resynthesizing several subcircuits in a sliding window fashion will reduce the overall LUT count of the entire circuit. Since a subcircuit of LUTs forms a cone, the subcircuit resynthesis problem is the function fitting problem as stated previously in Problem 3.2. In this case, the Boolean function is etracted from the subcircuit consisting of X K-input LUTs and then is checked if it can fit into a programmable structure containing less than X K-input LUTs. This check is done using our SAT-based technique. To illustrate this process, consider Fig. 22. The original cone 22a consists of three 2-LUTs which implements a three input function. Since only three inputs enter the cone, it may be possible to resynthesize Fig. 22a into Fig. 22b to save one LUT. To determine if resynthesis from Fig. 22a to 22b is possible, Fig. 22b is converted into a CNF 2-LUT 2-LUT 2-LUT 2-LUT 2-LUT (a) Original Cone (b) Resynthesized Cone Fig. 22. Resynthesis of three-input cone of logic. epression as described in Section III-B and the function etracted from Fig. 22a is tested using SAT to see if it fits into 22b. If the epression is satisfiable, resynthesis can proceed successfully. Unfortunately, resynthesizing subcircuits with more than 6 distinct inputs cannot be used in real-time resynthesis engines due to speed limitations. However, this technique can be used to build a cache of optimal configurations of

16 IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. X, NO. XX, APRIL digital logic blocks. This is a similar technique used in [20] where the authors focus on multipleer transformations. In [20], the authors traverse a technology mapped netlist and identify multipleers. Once identified, they replace the multipleer circuit with their cached optimal configuration. This is a linear time operation with respect to the circuit size, and thus will have negligible impact on the running time. Our tool can help etend this process and find the optimal configuration of several types of subcircuits that technology mapping fails to find. V. RESULTS To demonstrate the previously described techniques, we provide several concrete eamples here. We first show results for our PLB evaluation method and follow with results for our area optimization techniques. When evaluating PLBs, we show that the main benefit of our technique is its generality. We prove this using three different approaches: We measure the fleibility of several PLB architectures. We eplore a large set of hardwired PLB configurations in an automated fashion. We incorporating routing features into the PLB evaluation. Each approach emphasizes how our technique can be applied to any PLB without any modification to our PLB technology mapper or evaluation framework. After our PLB evaluation results, we focus on our area optimization technique. Here, we resynthesize several common subcircuits using our sliding window technique. This is followed by a discussion on why synthesis and technology mapping misses the optimal configuration provided by our resynthesis technique. When running our eperiments, we focus on the MCNC benchmark circuit set [21]. The SAT solver used to drive our function mapper was the Chaff solver developed by M. W. Moskewicz et al. [22]. All of our algorithms were built on top of the Berkeley MVSIS project [23]. A. PLB Evaluation 1) Generality of Technique: First, to illustrate the generality of our evaluation algorithm, several unrelated PLB architectures were evaluated. Fig. 23 shows the five different PLB architectures used for evaluation. To derive the fit percentages, approimately 1000 K-input cones were etracted from each circuit sampled, where K was the input size of the PLB. Cones were etracted randomly to generate a large set of unrelated logic functions. Table I summarizes our results. Each column shows a fit percentage per circuit for each respective PLB, and % Fit shows the final fit percentages when considering all the circuits. Note that the cone fit percentage varies wildly for all PLBs depending on the circuit. This shows that PLB usefulness is dependent on the application of the circuit. Interestingly, PLB(b) failed for all circuits ecept the ALU circuit (C2670). A reason for this is because PLB(b) uses an XOR-gate which are very rare in most control circuits and are generally used for arithmetic logic. PLB(e) was only able to fit 9-input cones for a few circuits. This was epected since PLB(e) is a simplified version of a commercial PLB primarily used to implement 5-input functions or a 4:1 MUX, and is rarely used as a general 9-input function generator [24]. In order to obtain a more accurate picture of this PLB s functionality, in

17 IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. X, NO. XX, APRIL SRAM (a) (b) (c) (d) (e) Fig. 23. Five PLBs used in evaluation eperiments. TABLE I PLB EVALUATION RESULTS. Circuit a b c d e C e5p clma dalu des i f51m mise mm30a mult16b % Fit addition to generating 9-input functions for PLB(e), 6, 7, and 8-input functions were evaluated. This is shown in Table II. As the numbers show, this PLB looks much more useful when considering a wider range of functions. 2) FPGA Architecture Eploration: Here, we demonstrate how to use our SAT-based technology mapper to give a full area comparison between various PLB architectures. The standard architectures that we used for area comparisons were 4 and 5-LUT based FPGAs. Our goal is to prove the generality of our technique by eploring the resource usage of a wide range of possible PLB structures using our technology mapping algorithm. This is followed by incorporating M W T areas and the routing architecture to get a full comparison. For all eperiments in this section, we optimized our circuits using SIS [25] with script.rugged. These optimized circuits were then passed to our SAT-based technology mapper. Since we wanted to achieve the smallest area results, our technology mapper was tuned to optimize for area while ignoring depth for all circuits. 1 no functions with 8-inputs or more could be found in this circuit

18 IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. X, NO. XX, APRIL TABLE II THE PERCENTAGE OF CONES THAT FIT INTO FIG. 23(E). Circuit 6-input 7-input 8-input C e5p clma dalu des i f51m mise mm30a mult16b Total % Fit First, we focus on the highlighted steps shown in Fig. 24. As Fig. 24 illustrates, the PLB description can be PLB Description BlIF Netlist Of Primative Gates PLB Technology Mapping Technology Mapped Netlist MWT Area single FPGA Tile PLB Usage MWT Area Fig. 24. Steps in overall area estimation flow to derive the PLB usage. modified to eplore various PLB architectures. In order to eplore several PLB architectures in an automated fashion, instead of manually creating several PLB structures, we used the PLB shown in Fig. 25 as our PLB. The PLB shown has four distinct inputs consisting of a 3-LUT in conjunction with a 3-input hardwired function. The benefit of this PLB architecture is that there are 2 8 = 256 possible hardwired functions. Each possible hardwired function can be eplored quickly by modeling the hardwired function as a preconfigured 3-LUT which will be common to all PLBs in the FPGA. To illustrate the eploration process, we technology mapped one benchmark circuit to all 256 possible PLB hardwired configurations. The results are illustrated in Table III which summarizes the hardwired SRAM preconfigurations that produced the lowest and highest overhead in terms of PLB usage. Row shows the PLB usage if only s were used. Column Ratio shows the ratio of PLB usage when compared against the architecture. The configurations that produced the lowest PLB usage occurred for the preconfiguration values of 000 and 0FF. This corresponds to the PLBs shown in Fig. 26. The original intent of this eperiment was to show how we could evaluate a wide range of PLBs in an automated fashion. Since

19 IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. X, NO. XX, APRIL Hardwired Function 3-LUT Fig input hardwired support function based PLB. TABLE III SUMMARY OF SRAM PRECONFIGURATIONS THAT PRODUCED LOW AND HIGH PLB USAGAGE FOR CIRCUIT EX5P. Bit Mask PLB Usage Ratio (low) 0FF (low) 0F (high) (high) this is just an illustrative study, only one benchmark circuit was used. A more conclusive study, however, would include a wide range of benchmark circuits. Having said that, note that the 000 and 0FF configuration model an AND and OR gate cascade as seen in Fig. 27. Digital logic contains a large degree of AND and OR gates, thus we suspect that other benchmark circuits would yield similar results. Furthermore, the area results we obtain for the PLB models shown in Fig. 27 coincides with industrial findings where AND and OR-gate cascade structures are common in industrial FPGAs such as Altera s Ape20k [2]. Table III clearly shows that there is an associated PLB usage overhead when removing some programmability in the PLB. However, as long as the increase in PLB usage is amortized by the decrease in silicon area of the non-programmable components, the loss in fleibility may be beneficial. To eplore this idea, we focused on the two PLB configurations that produced the lowest area overhead in Table III. Both of these PLB configurations can be realized as a single 4-input PLB using a 2:1 MUX and SRAM bit as shown in Fig. 28a which we refer as 4-MUX-PLB. To estimate the area performance of the 4-MUX-PLB architecture, we technology mapped 182 MCNC benchmark circuits to it and recorded the PLB usage for each circuit. Furthermore, since we want to illustrate that our evaluation tool works for any PLB architecture and we know that the cascade structure of 4-MUX-PLB is similar to industrial

20 IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. X, NO. XX, APRIL LUT 3-LUT (a) (b) Fig. 26. PLB configurations that produced the smallest area results when compared against a simple architecture. 3-LUT 3-LUT (a) (b) Fig. 27. Equivalent PLB representation using basic gates. SRAM bit SRAM bit 3-LUT (a) 4-MUX-PLB (b) 5-MUX-PLB Fig. 28. Candidate PLB used in area eperiments. 5-input PLBs, we also compared against the 5-LUT architecture using a 5-input PLB shown in Fig. 28b, referred as 5-MUX-PLB, for technology mapping. After the PLB usage is recorded, we use those numbers to calculate the MWT area and compare the final results. A summary of the PLB usage is shown in Table IV where only the 20 largest circuits of the 182 circuits tested are shown in detail. A geometric mean of PLB usage ratios is shown where both a comparison against s and 5-LUTs is given. Ratio shows the ratio of the PLB architecture against the LUT architecture. GeoMean is the

21 IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. X, NO. XX, APRIL TABLE IV MCNC PLB USAGE SUMMARY AND COMPARISON AGAINST AND 5-LUT BASED FPGA ARCHITECTURES. Circuit 4-MUX-PLB s Ratio 5-MUX-PLB 5-LUTs Ratio ape e5p s tseng alu4 1, ape4 1, mise3 1, ape2 1,123 1, spla 1,126 1, seq 1,140 1, e1010 1,226 1, dsip 1, pdc 1,431 1, ,143 1, des 1,577 1, ,124 1, bigkey 1,938 1, , elliptic 3,282 2, ,071 1, clma 3,841 3, ,050 2, frisc 4,001 2, ,252 2, s s GeoMean geometric mean of the Ratio for all 182 circuits tested. Fig. 29 and 30 has a graphical view of the PLB usage overhead for all the circuits where the overhead is calculated as Overhead = Ratio 1. The results show that the 4-MUX-PLB has a 20.1% usage overhead when compared against a architecture, and 5-MUX-PLB has a 10.5% overhead when compared against a 5-LUT architecture. Again, this PLB usage increase may be acceptable if it is amortized by the decrease in MWT area % % % % % 75.00% 50.00% 25.00% 0.00% % % Fig MCNC benchmark circuit PLB usage overhead when comparing the 4-MUX-PLB against the architecture. The Geometric mean of the overhead is 20.0%. Finally, we finish our area estimation demonstration with the steps highlighted in Fig. 31. In these steps, we use the PLB usage counts to find a full area comparison. This requires the minimum width transistor area for an FPGA

FPGA Programmable Logic Block Evaluation using. Quantified Boolean Satisfiability

FPGA Programmable Logic Block Evaluation using Quantified Boolean Satisfiability Andrew C. Ling, Deshanand P. Singh, and Stephen D. Brown, December 12, 2005 Abstract This paper describes a novel Field