Space- and Time-Efficient BDD Construction via Working Set Control

Size: px

Start display at page:

Download "Space- and Time-Efficient BDD Construction via Working Set Control"

Tyrone Carr
5 years ago
Views:

1 Spae- and Time-Effiient BDD Constrution via Working Set Control Bwolen Yang Yirng-An Chen Randal E. Bryant David R. O Hallaron Computer Siene Department Carnegie Mellon University Pittsburgh, PA USA Abstrat Binary deision diagrams (BDDs) have been shown to be a powerful tool in formal verifiation. Effiient BDD onstrution tehniques beome more important as the omplexity of protool and iruit designs inreases. This paper addresses this issue by introduing three tehniques based on working set ontrol. First, we introdue a novel BDD onstrution algorithm based on partial breadth-first expansion. This approah has the good memory loality of the breadth-first BDD onstrution while maintaining the low memory overhead of the depth-first approah. Seond, we desribe how memory management on a per-variable basis an improve spatial loality of BDD onstrution at all levels, inluding expansion, redution, and rehashing. Finally, we introdue a memory ompating garbage olletion algorithm to remove unreahable BDD nodes and minimize memory fragmentation. Experimental results show that when the appliations fit in physial memory, our approah has speedups of up to 1.6 in omparison to both depth-first (CUDD) and breadth-first (CAL) pakages. When the appliations do not fit into physial memory, our approah outperforms both CUDD and CAL by up to an order of magnitude. Furthermore, the good memory loality and low memory overhead of this approah has enabled us to be the first to have suessfully onstruted the entire C6288 multipliation iruit from the ISCAS85 benhmark set using only onventional BDD representations. I. INTRODUCTION With the inreasing omplexity of protool and iruit designs, formal verifiation has beome an important researh area. Binary deision diagrams (BDDs) have been shown to be a powerful tool in formal verifiation [4]. Even though many funtions have ompat BDD representations, some funtions an have very large BDDs. For example, BDD representations for integer multipliation have been shown to be exponential Effort sponsored in part by the Advaned Researh Projets Ageny and Rome Laboratory, Air Fore Materiel Command, USAF, under agreement number F , in part by the National Siene Foundation under Grant CMS , and in part by a grant from the Intel Corporation. The U.S. Government is authorized to reprodue and distribute reprints for Governmental purposes notwithstanding any opyright annotation thereon. The views and onlusions ontained herein are those of the authors and should not be interpreted as neessarily representing the offiial poliies or endorsements, either expressed or implied, of the Advaned Researh Projets Ageny, Rome Laboratory, or the U.S. Government. Supported in part by the Defense Advaned Researh Projet Ageny (DARPA) under ontrat number DABT63-96-C in the number of input bits [5]. To address this issue, there are many BDD related researh efforts direted towards reduing the size of the graph with tehniques like new ompat representations for speifi lasses of funtions (KFDD [9] and *BMD [6]), divide-and-onquer (POBDD [11] and ACV [7]), funtion abstration (abdd [12]), and variable reordering [19]. Despite these efforts, large graphs an still naturally arise for more irregular funtions or for inorret implementations of a speifiation. Inorret implementation an break the struture of a funtion and thus an greatly inrease the graph size. For example, the *BMD representation for integer multipliation is linear. However, a mistake in the implementation of integer multipliation logi an ause an exponential explosion of the resulting graph. The ability to handle large graphs effiiently an enable us to represent more irregular funtions and to provide ounterexamples for inorret implementations. Conventional BDD algorithms [2] are based on depth-first traversal of BDD graphs. This approah has small memory overhead, but poor memory loality. To address the issue of onstruting large BDDs effiiently, there have been many implementations [14, 15, 1, 10, 18] based on breadth-first traversal. The breadth-first approah, whih exploits its graph traversal pattern by using speialized memory layouts, has better memory aess loality and thus often has better performane. However, the breadth-first approah an have a large memory overhead, up to quadrati in the size of BDD operands. This extra memory overhead an result in an inreased number of page faults and thus poor performane. To maintain memory aess loality with low memory overhead, we introdue a new algorithm based on partial breadthfirst expansion. This algorithm improves loality of referene by ontrolling the working set size and thus reduing overhead due to page faults. We desribe how memory management on a per-variable basis an improve spatial loality of BDD onstrution at all levels, inluding expansion, redution, and rehashing. Finally, we introdue a breadth-first BDD garbage olletion algorithm whih performs memory ompation without inurring additional memory overhead. All of these tehniques work together to ontrol the working set size and have a signifiant impat on performane of BDD onstrution. As these tehniques exploit inherent properties of BDD onstrution, graph redution tehniques (like *BMD, POBDD, and dynami variable reordering) an be inorporated into our algorithms to further expand the usefulness of these

2 $ " & algorithms. Experimental results show that when the appliations fit in physial memory, our approah has speedups of up to 1.6 in omparison to leading depth-first (CUDD) and breadth-first (CAL) pakages. When the appliations do not fit into physial memory, our algorithm outperforms both CUDD and CAL by up to an order of magnitude. Furthermore, to demonstrate how our tehniques an effiiently build very large graphs, we onstruted the output BDDs for the C6288 multipliation iruit from the ISCAS85 benhmark. To the best of our knowledge, this has never been done before. Beyond the sequential world, another advantage of the partial breadth-first algorithm is that it an be parallelized [22]. This approah ahieves speedups of up to four on eight proessors of a shared memory system. The rest of this paper is as follows: Setion II gives an overview of BDDs and how they are onstruted. Setion III desribes the partial breadth-first algorithm and other tehniques for ontrolling the working set size. Setion IV presents performane evaluation of our implementation. Setion V demonstrates the usefulness of this implementation by onstruting very large BDDs for 16-bit array multipliers. Finally, Setion VII summarizes this paper and offers some onluding remarks. II. BDD OVERVIEW A boolean expression an be represented by a omplete binary tree alled a binary deision tree, whih is based on the expression s truth table. Fig.1(a) shows the truth table for a boolean expression and Fig.1(b) shows the orresponding binary deision tree. Eah internal vertex is labeled by a variable and has edges direted toward two hildren: the 0-branh (shown as a dashed line) orresponds to the ase where the variable is assigned 0, and the 1-branh (shown as a solid line) orresponds to the ase where the variable is assigned 1. Eah leaf node is labeled 0 or 1. Eah path from the root to a leaf node orresponds to a truth table entry where the value of the leaf node is the value of the funtion and the path orresponds to the assignment of the boolean variables. a b f (a) 0 b a b (b) b a 0 1 Fig. 1. A boolean expression represented with (a) Truth table, (b) Binary deision tree, () Binary deision diagram. The dashed-edges are 0-branhes and the solid-edges are the 1-branhes. () b A binary deision diagram (BDD) is a direted ayli graph (DAG) representation of a binary deision tree where equivalent boolean subexpressions are uniquely represented. Fig.1() shows the BDD representation of the binary deision tree in Fig.1(b). Sine all subexpressions in a BDD are uniquely represented, a BDD an be exponentially more ompat than its orresponding truth table or binary deision tree representations. One neessary ondition for guaranteeing uniqueness of the BDD representation is that all the BDDs onstruted must follow the same variable ordering; i.e., for any two variables and, if has higher preedene than ( ), then for any path that ontains both and, must appear before on this path. Note that the BDD size an be very sensitive to the variable ordering where the graph size of one ordering an be exponentially more ompat than the graph size of another ordering. Before desribing the basis for BDD onstrution, we will first introdue some terminology and notation. 0 and 1 are ofator funtions of the funtion with respet to the boolean variable, where 0 is equal to with the value of set to 0, and 1 is equal to with the value of set to 1. A reahable subgraph of a node is defined to be all the nodes that an be reahed from by traversing 0 or more direted edges. BDD nodes are defined to be internal verties of BDDs. Given a BDD, the funtion represented by is reursively defined by! 0 "! 1 (1) where is the variable orresponds to s root node and the ofator funtion 0 is reursively defined by the reahable subgraph of s 0-branh hild. Similarly, 1 is reursively defined by the reahable subgraph of s 1-branh hild. A. Basis for BDD Constrution BDD onstrution is a memoization-based dynami programming algorithm. Due to the large number of distint subproblems, instead of a memoization table, a ahe known as the omputed ahe is used to reord the result of eah subproblem. Given a variable ordering and two BDDs and #, the resulting BDD $ of a boolean operation op # is onstruted based on the Shannon expansion % & ' ' op # 0 op # 0 ' 1 op # where & is the variable (top variable) with the highest preedene among all the variables in and #, and (' 0, ' ' ' 1, # 0, and # 1 are the orresponding ofator funtions of and #. In the top-down expansion phase, this Shannon expansion proess repeats reursively following the given variable ordering for all the boolean variables in and #. The base ase (also alled the terminal ase) of this reursive proess is when the operation an be trivially evaluated. For example, the boolean operation ) * is a terminal ase beause it an be trivially evaluated to. Similarly, + 0 is also a terminal ase. At the end of the expansion phase, there may be unredued subexpressions like,.- " /.-. Thus, in order to ensure uniqueness, a bottom-up redution phase is neessary to redue expressions ' 1 (2)

3 like - " - to -. This redution phase also needs to ensure that eah BDD node reated is unique. Fig.2 illustrates the Shannon expansion (Equation 2) for the operation $ op #. On the left side of this figure, the operation is represented with an operator node whih refers to BDD representations of and # as operands. The right side of this figure shows the Shannon expansion of this operation with respet to the variable. Further expansion of operator nodes an be performed in any order. In partiular, the depthfirst onstrution always expands the operator node with the greatest depth. Note that the depth-first algorithm does not expliitly store the operations as operator nodes. Instead, the operation is impliitly stored in the stak as arguments to the reursive alls. In the breadth-first onstrution, the Shannon expansion is performed top-down from the variables with the highest to the lowest preedene so that operations with the same top variable are expanded together. The redution phase is performed bottom-up in reverse order. Thus, all operations with the same top variable are redued at the same time. f r op g op r τ op f g f g τ=0 τ=0 τ=1 τ=1 Fig. 2. Shannon Expansion: The dashed edge represent the 0-branh of a variable and the thik solid edge represents the 1-branh For the rest of this paper, we will refer to boolean operations issued by a user of BDD pakage as the top level operations to distinguish them from operations generated internally by the Shannon expansion proess. B. Memory Overhead and Aess Loality BDD onstrution is often memory intensive, espeially when large graphs are involved. It not only requires a lot of memory, it also requires frequent aesses to many small data strutures (the node size is typially 16 bytes on 32-bit mahines). The depth-first BDD onstrution has poor memory behavior beause of irregular ontrol flow and memory aess patterns. The ontrol flow is irregular beause the reursive expansion an terminate at any time when a terminal ase is deteted or when the operation is ahed in the omputed ahe. The memory aess pattern is irregular beause a BDD node an be aessed due to expansion on any of its many parents; and, sine the BDD is traversed in the depth-first manner, expansions on the parents are sattered in time. The performane impat for the depth-first algorithm s poor memory loality is espeially severe for BDDs larger than the physial memory. Reently, there has been muh interest in BDD onstrution based on breadth-first traversal [14, 15, 1, 10, 18]. In a breadth-first traversal, the expansion phase expands operations one variable at a time with all the operations of the same variable expanded together. Furthermore, during the redution phase, all the new BDD nodes of the same variable are reated together. The breadth-first onstrution exploits this strutured aess by lustering nodes (for both BDD and operator nodes) of the same variable together in memory with speialized node managers. Despite its better memory loality, the breadth-first onstrution has muh larger memory overhead in omparison to the depth-first onstrution. The number of operations that the depth-first onstrution keeps traks of at any given time is the depth of the reursion, whih is at most the number of variables. Sine the number of variables is typially small, the depth-first onstrution does not require muh memory to store these operations. In ontrast, for eah top level operation, the breadth-first onstrution will keep all operations generated by Shannon expansion of this top level operation until the result for this top level operation is onstruted. Sine the number of operations an be quadrati in the size of the BDD operands, the breadth-first approah an inur a large memory overhead. Thus, on some appliations where the depth-first onstrution fits in physial memory while the breadth-first onstrution does not, the performane of the breadth-first onstrution an be signifiantly worse due to page faults. III. OUR APPROACH TO BDD CONSTRUCTION Sine BDD onstrution involves a large number of aesses of many small data strutures, loalizing the memory aess pattern to bound the working set size is ritial beause good memory aess loality results in good hardware ahe loality and fewer page faults. This setion introdues three tehniques to ontrol the working set size by limiting memory overhead and by improving both temporal and spatial loality. These are followed by a brief disussion on how these tehniques an work together with variable reordering algorithms. A. Partial Breadth-First Constrution For the pure breadth-first onstrution (whih normally has good memory loality), if the BDD operands do not fit in physial memory, then the pages of operator nodes swapped in during the expansion phase will be swapped out by the time the redution phase takes plae. Furthermore, as desribed in Setion II.B, breadth-first onstrution an inur a large memory overhead. To overome these drawbaks while bounding the memory overhead, we introdue partial breadth-first expansion based on ontext swith. Within eah evaluation ontext, the breadthfirst expansion is used until a fixed evaluation threshold is reahed. Upon reahing this threshold, the urrent ontext is pushed onto a ontext stak and a new hild ontext is started. The remaining operations of the parent ontext are partitioned into smaller groups and the hild ontext evaluates these operations one group at a time. This proess repeats eah time the

4 urrent evaluation ontext reahes its threshold. By keeping the evaluation threshold to be a small fration of the available physial memory, we an bound the number of BDD nodes and ompute ahe nodes reated and aessed and thus ontrol the working set size. Note that by setting the evaluation threshold to 1, this algorithm degenerates to depth-first onstrution. Similarly, by setting the evaluation threshold to, this algorithm is idential to pure breadth-first onstrution. Fig.3(a) shows an example of a ontext swith. In this figure, the top triangle denotes the graph of the initial expansion. Upon reahing the evaluation threshold, the remaining unexpanded operations are divided into two partitions (shown as two dashed retangles) and the new hild ontext is started. This new hild ontext ontinues to expand on the first partition. After the hild ontext finishes building BDD results for the first partition, it ontinues to expand on the seond partition as shown in Fig.3(b). Note that expansion of these two partitions might share some operations in ommon. For these ommon operations, the expansion of the seond partition an benefit from the results omputed from the expansion of the first partition via the ompute ahe. However, sine the ompute ahe is not a omplete ahe, some ommon operations may need to be reomputed. This figure also depits how the partial breadth-first onstrution an redue memory overhead. The operator nodes reated from expanding the first partition do not need to be kept during the expansion of the seond partition. In omparison, the pure breadth-first onstrution (shown in Fig.3()) needs to keep all the operator nodes until after the redution phase. Context Swith and Expanding 1st Partition (a) Expanding 2nd Partition (b) No Context Swith Fig. 3. A Context Swith Example. (a) Upon reahing the evaluation threshold, urrent unexpanded operations are divided into two partitions (shown as two dashed retangles) and the new hild ontext ontinues to expand on the first partition. (b) After the redution for the first partition, this hild ontext expands on the seond partition. () Pure breadth-first expansion is shown for omparison. Other than the memory loality and the memory overhead, the evaluation threshold an also impat the effetiveness of the ompute ahe. In the pure breadth-first traversal, the expanded operator nodes must be kept until after the redution phase. This feature effetively resulted in a omplete ahe within an expansion phase. Similarly for the partial breadth-first approah, expansion within eah evaluation ontext maintains a omplete ahe. Thus, a larger evaluation threshold results in a larger and more omplete ahe for the urrent evaluation ontext at the ost of higher memory overhead. The rest of this setion formally desribes this partial breadth-first algorithm. Fig.4 shows the top level proedure () and a helper funtion for this partial breadth-first onstrution. For eah variable, there is an expansion queue and a redution queue. An expansion queue queues the operations of the same variable to be Shannon expanded during the expansion phase. A redution queue queues the operations of the same variable to be redued in the redution phase. The top level proedure pbf-op() builds the result BDD by repeatedly doing the Shannon expansion (line 3) and redution (line 4) until there are no more operations in the top ontext (lines 5 to 8) and until there are no more evaluation ontexts on the ontext stak (lines 9 to 11). Proedure preproess-op() first determines whether or not the operation is a terminal ase or is ahed (lines 13 to 15). If not, this operation is added to its top variable s expansion queue (lines 17 and 18) to indiate that further Shannon expansion is neessary for this operation. This operation is also inserted into the ompute ahe (line 19) to avoid expanding redundant operations in the future. This proedure returns either the BDD result (for the terminal ase and for the ase when the ahed result is a BDD) or an operator node. If an operator node is returned, this operator node s field opnode.result will ontain the result BDD after this operator node is proessed in the redution phase. pbf-op(,, ) 1 opnode preproess-op(,, ) 2 if opnode is a BDD node, return opnode. 3 all expansion() 4 all redution() 5 if top ontext of the ontext stak has operations, then 6 take a group of operations from the top ontext 7 add eah operation to its top variable s expansion queue 8 goto line 3 and repeat until top ontext is empty 9 if ontext stak is not empty, 10 pop the top ontext and use it as the urrent ontext 11 goto line 3 and repeat until ontext stak is empty 12 return opnode.result preproess-op(,, ) 13 if terminal ase, return simplified result 14 if the operation (,, ) is in ompute ahe, 15 return result found in ahe 16 opnode (,, ) 17 top variable of and 18 add opnode to s expansion queue 19 insert opnode into the ompute ahe 20 return opnode Fig. 4. Partial Breadth-First Constrution: top level proedure and a helper funtion Fig.5 shows the expansion phase. This top-down expansion phase proesses operations queued from the variable with the highest to the lowest preedene. Here, all the operations of the same variable are Shannon expanded together (lines 3 to 7). The branh 0 and the branh 1 fields of an operator node are used to store the results of Shannon expansion, and as desribed earlier, these results returned by the proedure preproess-op()

5 an be either a BDD node or an operator node. In the later ase, the proedure preproess-op() would have queued the new operator nodes to be proessed by the expansion phase later. The variable nopsproessed is used to trak the size of the urrent evaluation ontext and when it exeeds a onstant evaluation threshold evalthreshold, the urrent ontext is pushed onto the ontext stak and a new hild ontext is started (lines 9 to 13). expansion() 1 nopsproessed 0 2 for eah variable in the urrent evaluation ontext from the highest to lowest preedene 3 for eah node opnode in s expansion queue 4 (,, ) opnode 5 opnode.branh 0 preproess-op(, 0, 0) 6 opnode.branh 1 preproess-op(, 1, 1) 7 add opnode to variable s redue queue 8 nopsproessed++ 9 if (nopsproessed evalthreshold) 10 partition the remaining operators into small groups. 11 push urrent ontext with these operation groups onto the ontext stak 12 start a new evaluation ontext 13 return Fig. 5. Partial Breadth-First Constrution: expansion phase Fig.6 shows the redution phase. This bottom-up redution algorithm is the same as the pure breadth-first onstrution s redution phase where Shannon expanded operations are proessed together one variable at a time, starting from the variable with the lowest preedene moving upwards to the variables with the highest preedene. The results from the hildren are obtained in lines 4 to 11. Lines 12 to 19 perform the redution and ensure the result is unique. The result of a redution is stored in the opnode.result field of an operator node (line 13 and 19). B. Memory Management As in breadth-first BDD algorithms, speialized node managers are the key fators in exploiting strutured aess in the partial breadth-first approah. In our implementation, eah variable is assoiated with a BDD-node manager as in [18] s breadth-first algorithm. Eah variable s BDD-node manager lusters BDD nodes of the same variable by alloating memory in terms of bloks and alloates BDD nodes ontiguously within eah blok. We further extend this lustering onept to using one operator-node manager for eah variable. With this design, we not only benefit from good loality of node lustering, we also eliminate the need for having both the expansion and the redution queues, sine we an aess all the operator nodes of eah variable by simply traversing memory bloks of eah operator-node manager. Furthermore, we assoiate one ompute ahe and one unique table per variable. Thus, ahe lookup in the expansion phase and the BDD unique table lookup in the redution redution() 1 for eah variable in the urrent evaluation ontext from the lowest to highest preedene 2 for eah node opnode in s redue queue 3 (,, ) opnode 4 if opnode.branh 0 is a BDD, 5 res 0 opnode.branh 0 6 else 7 res 0 opnode.branh 0.result 8 if opnode.branh 1 is a BDD, 9 res 1 opnode.branh 1 10 else 11 res 1 opnode.branh 1.result 12 if (res 0 == res 1) 13 opnode.result = res 0 14 else 15 BDD node (, res 0, res 1) 16 opnode.result lookup(unique table, ) 17 if BDD node does not exist in the unique table, 18 insert into the unique table 19 opnode.result Fig. 6. Partial Breadth-First Constrution: redution phase phase will only traverse nodes of the same variable. Sine nodes of the same variables are lustered by the node managers, this results in better memory loality. Combined with per-variable node managers, we an perform rehashing for eah variable independently by traversing the memory bloks of the orresponding node manager. Again, this rehashing approah has better memory loality than the traditional approah, whih traverses the hash table. C. Garbage Colletion No BDD pakage is omplete without a good garbage olletor. External users of a BDD pakage an free referenes to exported BDDs and sine BDD onstrution is a memory intensive appliation, reusing the spae of unreahable BDD nodes is important. Most BDD pakages use referene ounting and maintain a free list of unreferened nodes. This approah has several drawbaks. Most notably it has poor memory loality beause the free-list approah an satter newly reated BDD nodes in memory and thus reversing the lustering effets of speialized node managers. In our implementation, a mark-and-sweep garbage olletor with memory ompation is used. Unlike a opying garbage olletor, our garbage olletion algorithm performs memory ompation without requiring any additional memory. This ompation algorithm is stable; i.e, the nodes linear ordering is maintained. This property allows nodes whih are alloated nearby in time to stay together. This an help aess loality beause nodes alloated together are likely to be aessed together in the future. Our garbage olletion algorithm onsists of two phases, both of whih are breadth-first traversal from the variable with

6 highest preedene to the variable with the lowest preedene. The first phase marks and ompats all the reahable nodes and the seond phase fixes all the referenes and rehashes these nodes. Fig.7 shows the algorithm for the mark-and-ompat phase. Line 1 marks all the roots of exported BDDs to indiate that these nodes and their desendants are all the nodes that we need to keep. The top-down breadth-first marking of desendants is performed by traversing BDD nodes in eah node manager (lines 2 to 6). In this algorithm, denotes the marked BDD node that is being proessed and denotes the next target loation for ompation. For eah marked BDD node, its hildren are marked (line 7). Line 8 establishes the new loation for node by setting s forward field. Lines 9 and 10 opy the relevant information in to this new target loation. Line 11 advanes to the next node in the node manager # $ as the new target loation. Line 12 advanes to the next marked node in this node-manager. This proess repeats until we have proessed all the marked nodes in this node manager # $ ; after whih, all the marked nodes are ompated into memory bloks before new and thus all the bloks after new are marked as free bloks to be freed after the seond phase (line 13). mark-and-ompat() 1 mark all the root nodes of exported BDDs we need to keep. 2 for eah variable from the highest to lowest preedene, 3 mgr s BDD-node manager 4 first marked node in manager mgr 5 new first node in manager mgr 6 while is still in node manager, 7 mark hildren.left and.right 8.forward new 9 new.left.left 10 new.right.right 11 new ManagerNextNode(mgr, new) 12 ManagerNextMarkedNode(mgr, ) 13 put memory bloks for all the nodes after new into mgr.freebloks. Fig. 7. Garbage Colletion s Mark and Compat Phase. This phase marks nodes that we want to keep and at the same time ompat the memory to avoid memory fragmentation. Fig.8 shows the seond phase of the garbage olletion algorithm. Initially, all external referenes are updated (lines 2 and 3). Then it proeeds in a top-down breadth-first manner to fix eah BDD node s hildren referenes (lines 7 and 8) and reinsert this node bak into the unique table (line 9). After all the referenes of a BDD-node manager are updated, its assoiated free bloks are freed (line 10). For the purpose of explanation, the garbage olletion algorithm shown uses an additional field forward for eah BDD node. In the atual implementation, eah BDD node s hashnext field, used for hained hashing, is also used as the forward field during the garbage olletion. This dual use of the same field is only orret if hash insertion of a node does not our until fix-and-rehash() 1 lear all unique tables 2 for eah root node of exported BDDs 3 update root nodes of exported BDDs to the forwarded loation 4 for eah variable from the highest to lowest preedene, 5 mgr s Bdd-node manager 6 for eah node in manager mgr 7.left.left.forward 8.right.right.forward 9 insert into variable s unique table 10 free all memory bloks in mgr.freebloks. Fig. 8. Garbage Colletion s Fix and Rehash Phase. This phase updates all the hildren referenes and reinserts the BDD nodes into unique tables. after all the referenes to this node are fixed. This ondition is guaranteed by first fixing external referenes (lines 2 and 3 in Fig.8) and then performing the top-down breadth-first traversal, whih updates all the parents referenes before inserting a node into the hash table. Thus, this two phase breadth-first garbage olletion algorithm is able to perform memory ompation without requiring any additional memory. D. Variable Reordering Dynami variable reordering is an important part of BDD onstrution. Even though we have not yet implemented dynami variable reordering, the following is an outline of potential problems and their solutions. 1. Some variable reordering algorithms require referene ounts. Sine garbage olletion is generally invoked right before variable reordering, we an ompute referene ounts during the mark-and-ompat phase of garbage olletion (line 1 and line 7 of Fig.7). 2. Dynami variable reordering an ounterat the lustering effets ahieved by the per-variable memory managers [16]. The solutions proposed in [16] should be diretly appliable to our approah. IV. PERFORMANCE EVALUATION In this setion, we present a performane evaluation of our approah. The test ases are the ISCAS85 benhmarks [3], a olletion of ten iruits used in industry. The variable ordering we used is generated by order dfs in SIS [20]. To get more test ases, we generate differene size array multiplier iruits based on arry ripple adders [6]. For the rest of this setion, we shall refer to this multiplier iruit as MCRA (Multiplier based on Carry Ripple Adders). For -bit multiplier with two operands and 2 0, the variable ordering used is For all the test ases, to minimize memory usage, we freed the intermediate results (those that are neither inputs nor outputs of the iruit) immediately after its the last referene.

7 In this setion, we use two leading BDD pakages for omparison. The first pakage is CAL version 2.0 from UC Berkeley, whih implements the breadth-first algorithm desribed in [18]. The seond pakage is CUDD version [21] from the University of Colorado at Boulder, whih implements the depth-first algorithm for BDD onstrution. Both are the latest releases as of November, All pakages are ompiled with g using the optimization flag -O3. In this setion, we will refer to our pakage as PBF. For both CAL and CUDD, we used all the default settings with the exeption of dynami variable reordering features whih we disabled for two reasons. First, we have not implemented dynami variable reordering yet. Seond, turning off the dynami reordering features removes the performane impat due to different dynami reordering algorithms. For the CAL pakage, the results we present are without its supersalarity and pipelining features [18] beause of adverse performane impat. These features require deomposing all operations into a single operation type. For the multipliers, suh deomposition inreases the running time by up to 60% and supersalarity of 10 with automati pipelining inreases the memory usage by 30% with little (< 1%) or no performane improvement. For C2670 and C3540 from ISCAS85 benhmarks, the results are less lear. Thus, for these two iruits, the results using supersalarity of 10 with automati pipelining will also be inluded. A. Evaluation Threshold In this setion, we examine how different evaluation thresholds impat the memory usage and running time of our approah. The system used for this evaluation is an SGI Power Challenge with 1 GBytes of physial memory. This system has 12 proessors running IRIX 6.2 with 32-bit address spae. Eah proessor is a 196MHz MIPS R We perform our experiments using one proessor under light load onditions where our proesses are the only ative proesses. Timing results reported are measured CPU time. In this study, the evaluation threshold ranges from 8 KBytes to where the ase orresponds to the pure breadth-first ase. The results from very small ases ( 10 seonds CPU time and 10 MBytes memory usage) are omitted. The results in Fig.9 show that in general, the running time varies about 10 to 20%, exept for the C2670. For C2670, there is a speedup of 2 for the ase vs. the ases with smaller evaluation thresholds. This is most likely aused by the fat that a larger evaluation threshold results in a more omplete ahe (as disussed in Setion III.A). This is substantiated by the fat that the ase has a total of 23 million Shannon expansions, while the smaller evaluation thresholds ases have over 135 million Shannon expansions. The results in Fig.9 also show that different evaluation thresholds an have an impat on the memory usage; e.g, for C2670, the ratio between maximum and minimum memory usage is In general, this memory usage differene may be the key fator on whether or not an appliation fits into physial memory and thus an have a signifiant effet on the running time. Threshold CPU Time(seonds) / Memory Usage(MBytes) (KBytes) C2670 C3540 MCRA14 MCRA / / / / / / / / / / / / / / / / / / / / / / / / 491 Fig. 9. Effets of Evaluation Threshold. pure breadth-first. ase orresponds to the ase with Note that overall, the evaluation threshold of 4096 KBytes strikes a reasonable balane between memory usage and running time. Sine 4906 KBytes is of the physial memory size (1 GBytes), for the rest of the performane evaluation in this paper, we hoose the evaluation threshold for our pakage to be of the physial memory size. B. Performane Comparison No Paging This setion ompares our approah (PBF) to CAL and CUDD when the test ases fit in physial memory. The system used for evaluation is the same as in the previous setion. The memory usage limit is set to 1 GBytes. The evaluation 1 threshold hosen for our pakage is 4 MBytes whih is 256 of physial memory size of 1 GBytes. Fig. 10 shows the results of this study. The results for smaller ases are shown at the top half of this table. The results for the C6288 and C7552 ases are not available beause they both exeeded the memory limit. Note that for CAL, C2670 and C3540 have better performane using CAL s supersalarity and pipelining feature at the ost of 71% to 84% higher memory usage. These results are marked with in Fig. 10. The results show that for the larger ases, PBF onsistently outperforms both CAL and CUDD, with speedups ranging from 1.10 (MCRA15) to 1.60 (C3540) in omparison to the best of CAL and CUDD. For the smaller ases, PBF is slower. However, sine these smaller ases take less than 2 seonds to finish, performane differenes among the different approahes are less signifiant. As for memory usage, PBF s memory usage traks very losely with CUDD s depth-first implementation. For small ases ( 10 MBytes), PBF s memory usage is higher due to the memory overhead of per variable data strutures. However, for large ases like C3540 and MCRA iruits, PBF s memory usage is atually slightly smaller than CUDD s memory usage. In ontrast, CAL s memory usage is up to a fator of 1.6 (MCRA15) in omparison to PBF s memory usage. C. Performane Comparison Paging This setion ompares our approah (PBF) to CAL and CUDD when the test ases do not fit into physial memory. We

8 Ciruit CPU Time(seonds) Memory(MBytes) PBF CAL CUDD PBF CAL CUDD C C C C C C C C C6288 n/a n/a n/a n/a n/a n/a C7552 n/a n/a n/a n/a n/a n/a MCRA MCRA Fig. 10. Performane omparison when the test ases fit in physial memory. Both C6288 and C7552 ases exeeded the 1 GBytes memory limit and thus the results are not available. Numbers marked with are CAL s results using supersalarity of 10 with automati pipelining. repeated the experiments on a smaller system a 200MHz Pentium Pro with 256 KBytes L2 Cahe and 128 MBytes of 60ns EDO DRAM. This system is running Linux with 32-bit address spae. All measurements were obtained under single user mode. Timing results reported are elapsed time and time limit is set to be 24 hours of elapsed time. For this experiment, we hose the test ases whih use more memory than available physial memory (128 MBytes). Fig.11 shows that our approah (PBF) onsistently outperforms both CAL and CUDD with speedups ranging from 1.51 (C2670) to 13.2 (MCRA14) in omparison to the best of CAL and CUDD. The signifiant speedup of MCRA14 is mainly due to the fat that our approah s memory usage for this ase is only slightly more than the available physial memory. This ase demonstrates the importane of limiting the memory overhead. Another interesting point to note that both the PBF (our approah) and the CAL (breadth-first) approah have muh better paging loality than the CUDD (depth-first) approah. For the C3540 iruit, this loality resulted in an order of magnitude differene in performane. V. ARRAY MULTIPLIERS In this setion, we demonstrate the effetiveness of our tehniques by building very large output BDDs of two types of integer multipliation iruits. The first type is based on C6288 from ISCAS85 benhmark. C6288 is a 16-bit array multiplier using arry save adders. Based on its design, we derived orresponding iruits from 1 to 15 bits. The seond type is an array multiplier with arry ripple adder (MCRA) as in Setion IV. In this study, we haraterize both multipliers from 1 to 16 bits. The system used for this evaluation is an SGI Power Challenge with 4 GBytes of physial memory. This system has 16 proessors running IRIX 6.2 with 64-bit address spae. Eah proessor is a 194MHz MIPS R We perform our experiments under dediated mode using one proessor. Note that for BDD appliations, memory usage on 64-bit mahines is generally twie that of 32-bit mahines. Fig.12 shows the results for this experiment. Fig.13 plots the memory usage of output BDDs and memory usage for onstruting C6288 and MCRA iruits in a semi-log graph. Note that the output BDD sizes grows exponentially at a fator of about 2.87 per bit of word size. Fig.13 also shows that other than the initial overhead, whih affets the memory usage of smaller iruits, the total memory usage grows at the same rate as the output BDDs memory usage. This plot is a semi-log plot to learly show the numbers for small ases. However, it is worth noting that even though the total memory usage for the 16-bit multiplier is about a fator of three to four over the size of output BDDs, this semi-log plot deemphasizes this differene. To better understand the memory usage, we analyze the BDD onstrution for building the C6288 iruit. The maximum memory usage for building this iruit is 3803 MBytes. The maximum number of BDD nodes that exist simultaneously during the BDD onstrution proess is about 110 million (3352 MBytes). To aommodate these BDD nodes, the unique tables have a ombined total of 48 million bins (366 MBytes). Thus the memory overhead of the operator nodes, the ompute ahe, and other auxiliary data strutures is 85 MBytes whih is only 2.2% of the total memory usage. This result demonstrates that our approah has very little memory overhead. As far as we know, this is the first time that the entire C6288 iruit has been built using onventional BDD representations. Ciruit Elapsed Time(seonds) Memory(MBytes) PBF CAL CUDD PBF CAL CUDD C C MCRA MCRA15 n/a n/a n/a n/a n/a n/a Fig. 11. Performane omparison when the test ases do not fit into physial memory. MCRA15 ase exeeded the time limit of 24 hours for all three pakages. CAL s numbers are measured without its supersality nor pipelining features to redue the memory usage and minimize paging. VI. RELATED WORK There are many researh efforts based on breadth-first BDD onstrution [14, 15, 1, 10, 18]. However, none of these propose how to bound the memory overhead of the breadth-first onstrution. To address this issue, we introdued a hybrid algorithm whih performs the breadth-first onstrution to exploit memory loality and swithes to the depth-first onstrution when the memory overhead beomes too high [8]. This hybrid approah has the drawbak that when a BDD operation is muh larger than the swith-over threshold, this hybrid approah will be dominated by the depth-first portion and thus

9 # of Output Size CPU Time(seonds) Memory(MBytes) Bits (# of nodes) C6288 MCRA C6288 MCRA , , , , ,733, ,955, ,181, ,563, Fig. 12. Results for multiplier iruits. Note that sine a 64-bit mahine is used for this study, the memory usage is roughly twie as big as results on a 32-bit mahine. have poor memory behavior. Note that this hybrid is similar to the mixed depth-first and breadth-first approah that prunes unneessary reursion branhes for the quantifiation and relational produt operations [18]. SMV [13] s BDD pakage uses mark-and-sweep garbage olletor without memory ompation. In [15, 1, 17], memory ompation is used to avoid memory fragmentation. These three approahes are all based on referene ounting. In [15], the ompation algorithm is stable (i.e., linear ordering of the nodes is maintained) and does not require additional memory. Our approah is quite similar to this. In [1], the garbage olletion uses a free-list and when memory fragmentation beomes high, a separate memory ompation algorithm based on opying is used. In [17], garbage olletion phase is also free-list based and memory ompation is performed after garbage olletion only when memory fragmentation beomes high. This ompation is performed by moving the newest set of live nodes to fill the holes left behind by the oldest set of dead nodes; thus, no additional memory is required. This algorithm has the advantage of moving minimum number of nodes neessary but it does not maintain the linear ordering of the live nodes. The performane impat of this tradeoff deserves further study. Our approah ombines many attributes of the approahes above by integrating a mark-and-sweep garbage olletor with a stable memory ompation without any additional memory overhead. VII. SUMMARY AND CONCLUSIONS This paper has introdued three tehniques to ontrol the working set size by limiting memory overhead and improving both temporal and spatial loality. First, we have introdued a novel BDD onstrution algorithm based on partial breadthfirst expansion. This approah has the good memory loality Memory (MBytes) MCRA BDD C Number of Bits Fig. 13. Maximum memory usage for both C6288 and MCRA ompared with memory usage of output BDDs (labeled as BDD). of the breadth-first BDD onstrution while maintaining the low memory overhead of the depth-first approah. Seond, we have desribed how memory management on a per-variable basis an improve spatial loality of BDD onstrution at all levels, inluding expansion, redution, and rehashing. Finally, we have introdued a memory ompating garbage olletion algorithm to avoid memory fragmentation due to unreahable BDD nodes. These algorithms work together in ontrolling the working set size to gain better memory aess loality with little memory overhead. As these tehniques exploit inherent properties of BDD onstrution, graph redution tehniques (like *BMD, POBDD, and variable reordering) an be inorporated into our algorithms to further expand the usefulness of these algorithms. Experimental results show that by ontrolling the evaluation threshold, the partial-breadth approah an redue the memory usage by 60% in omparison to our pure breadth-first ase ( evaluation threshold). In the performane omparison study, the results show that when the appliations fit in physial memory, our approah is onsistently faster for larger ases ( 2 seonds) with speedups of up to 1.6 in omparison to the leading depth-first (CUDD) and breadth-first (CAL) pakages. When the appliations do not fit into physial memory, our approah outperforms both CUDD and CAL by up to an order of magnitude. Furthermore, to demonstrate how our tehniques an effiiently build very large graphs, we onstruted the output BDDs for the C6288 multipliation iruit from the ISCAS85 benhmark and showed that the memory overhead of our approah is 2.2%. These results show that our tehniques have suessfully ahieved better memory loality while reduing the memory overhead. Beyond the sequential world, another advantage of the partial breadth-first algorithm is that it an be parallelized by using eah proessor s ontext stak as a distributed work queue [22]. This approah ahieves speedups of up to four on eight proessors of a shared memory system.

10 ACKNOWLEDGEMENT We thank Claudson F. Bornstein and Henry R. Rowley for numerous disussions on effiient BDD implementations. We also thank Rajeev K. Ranjan for his help in setting up our performane study with CAL pakage. This work utilized Silion Graphis Power Challenge shared memory mahines on both the Pittsburgh Superomputing Center and the National Center for Superomputing Appliations at Urbana-Champaign. We are very grateful to the wonderful support staff in both superomputing enters. REFERENCES [1] R. Ashar and M. Cheong. Effiient breadth-first manipulation of binary deision diagrams. In Proeedings of the International Conferene on Computer-Aided Design, pages , November [2] K. Brae, R. Rudell, and R. E. Bryant. Effiient implementation of a BDD pakage. In Proeedings of the 27th ACM/IEEE Design Automation Conferene, pages 40 45, June [3] F. Brglez and H. Fujiwara. A neutral netlist of 10 ombinational benhmark iruits and a target translator in Fortran. In 1985 International Symposium on Ciruits And Systems, June Partially desribed in F. Brglez, P. Pownall, R. Hum. Aelearted ATPG and Fault Grading via Testability Analysis. In 1985 International Symposium on iruits and Systems, pages , June [4] R. E. Bryant. Graph-based algorithms for Boolean funtion manipulation. IEEE Transations on Computers, C-35(8): , August [5] R. E. Bryant. On the omplexity of VLSI implementations and graph representations of Boolean funtions with appliation to integer multipliation. IEEE Transations on Computers, 40(2): , Feburary [6] R. E. Bryant and Y.-A. Chen. Verifiation of arithmeti iruits with binary moment diagrams. In Proeedings of the 32nd ACM/IEEE Design Automation Conferene, pages , June [7] Y.-A. Chen and R. E. Bryant. ACV: An arithmeti iruit verifier. In Proeedings of the International Conferene on Computer- Aided Design, pages , November [8] Y.-A. Chen, B. Yang, and R. E. Bryant. Breadth-first with depthfirst BDD onstrution: A hybrid approah. Tehnial Report CMU-CS , Shool of Computer Siene, Carnegie Mellon University, [9] R. Drehsler, A. Sarabi, M. Theobald, B. Beker, and M. A. Perkowski. Effiient representation and manipulation of swithing funtions based on ordered kroneker funtional deision diagrams. In Proeedings of the 31st ACM/IEEE Design Automation Conferene, pages , June [10] A. Hett, R. Frehsler, and B. Beker. MORE: Alternative implementation of BDD-pakages by multi-operand synthesis. In Proeedings of the European Design Automation Conferene, pages 16 20, September [11] J. Jain, J. Bitner, J. A. Abraham, and D. S. Fussell. Funtional partitioning for verifiation and related problems. In Proeedings of the Brown/MIT VLSI Conferene, pages , Marh [12] S. Jha, Y. Lu, M. Minea, and E. M. Clarke. Equivalene heking using abstrat BDDs. In 1997 IEEE Proeedings of the International Conferene on Computer Design, pages , Otober [13] K. L. MMillan. Symboli Model Cheking. Kluwer Aademi Publishers, [14] H. Ohi, N. Ishiura, and S. Yajima. Breadth-first manipulation of SBDD of Boolean funtions for vetor proessing. In Proeedings of the 28th ACM/IEEE Design Automation Conferene, pages , June [15] H. Ohi, K. Yasuoka, and S. Yajima. Breadth-first manipulation of very large binary-deision diagrams. In Proeedings of the International Conferene on Computer-Aided Design, pages 48 55, November [16] R. K. Ranjan, W. Gosti, R. K. Brayton, and A. Sangiovanni- Vinentelli. Dynami reordering in a breadth-first manipulation based BDD pakage: Challenges and solutions. In 1997 IEEE Proeedings of the International Conferene on Computer Design, pages , Otober [17] R. K. Ranjan and J. Sanghavi. CAL-2.0: Breadthfirst Manipulation Based BDD Library. Publi software. University of California, Berkeley, CA, June bdd/. [18] R. K. Ranjan, J. V. Sanghavi, R. K. Brayton, and A. Sangiovanni- Vinentelli. High performane BDD pakage based on exploiting memory hierarhy. In Proeedings of the 33rd ACM/IEEE Design Automation Conferene, pages , June [19] R. Rudell. Dynami variable ordering for ordered binary deision diagrams. In Proeedings of the International Conferene on Computer-Aided Design, pages , November [20] E. M. Sentovih, K. J. Singh, L. Lavagno, C. Moon, R. Murgai, A. Saldanha, H. Savoj, P. R. Stephan, R. K. Brayton, and A. L. Sangiovanni-Vinentelli. SIS: A system for sequential iruit synthesis. Tehnial Report UCB/ERL M92/41, Eletronis Researh Lab, University of California, May [21] F. Somenzi. CUDD-2.1.2: CU Deision Diagram Pakage, April ftp://vlsi.olorado.edu/pub/udd tar.gz. [22] B. Yang and D. R. O Hallaron. Parallel breadth-first BDD onstrution. In Ninth ACM SIGPLAN Symposium on Priniples and Pratie of Parallel Programming, pages , June 1997.

Outline: Software Design

Outline: Software Design. Goals History of software design ideas Design priniples Design methods Life belt or leg iron? (Budgen) Copyright Nany Leveson, Sept. 1999 A Little History... At first, struggling