Bit-level Arithmetic Optimization for Carry-Save Additions

Bt-leel Arthmet Optmzaton for Carry-Sae s Ke-Yong Khoo, Zhan Yu and Alan N. Wllson, Jr. Integrated Cruts and Systems Laboratory Unersty of Calforna, Los Angeles, CA 995 khoo, zhanyu, wllson @sl.ula.edu Abstrat Ths paper addresses the bt-leel optmzaton of arry-sae adder (CSA) arrays when the operands are of unequal wordlength (suh as n some datapaths n dgtal sgnal proessng ruts). We frst show that by relaxng the arry-sae representaton to allow for more than two sgnals per bt poston, we gan flexblty n the bt-leel mplementaton of CSA arrays that an be exploted to ahee a more effent desgn. We then propose algorthms to optmze a sngle adder array at the bt-leel. In addton, we proposed a heurst to optmze a seres of adder arrays that mght our n a datapath. We hae appled our algorthms to the optmzaton of hgh-speed dgtal FIR flters and hae aheed 5% to 3% sangs (weghted ost) n the oerall flter mplementaton array n omparson to the standard arry-sae mplementaton. I. Introduton and Motaton The effent realzaton of addton n VLSI s an mportant problem beause addton s a ery ommon arthmet operaton and t s also reled on healy to mplement other operatons suh as multplaton and subtraton. In bnary nteger arthmet, an unsgned n-bt (.e., wordlength on) etor X n s a onatenaton of n bnary sgnals x n x n x whose weghted sum represents ts arthmet alue n X x The least-sgnfant-bt (LSB) and the most-sgnfant-bt (MSB) are x and x n, respetely. For addng of a seres of operands Z A n A n K K an effent and ommonly used method to realze a hgh-speed mplementaton employs arry-sae addton (CSA) whh defers the ostly arry-propagaton after eery addton. Instead, a so alled etor merge adder (VMA) s used after the last arrysae addton to mplement the arry-propagaton and obtan the fnal arthmet alue, as shown n Fg.. An n-bt CSA onssts of n full-adders (FAs) as shown n Fg.. Note that nstead of propagatng the arry sgnals to the MSBs, they are saed n the arry etor C so that the adder has only one FA delay. Howeer, the arthmet output alue of the CSA s now the sum of two etors C and S. In omparson, a arry-rpple adder redues two n-bt etors to one sum etor, but requres n FA delays. Ths researh was partally supported by NSF Grant MIP-963698 and Calforna MICRO Grant 98-73. -783-583-X /99/$. 999 IEEE. Arthmet Funton (n ) A Array Z (N) Arthmet Output (n ) K A K S (N) C (N) Vetor Merge Carry-Sae Representaton Fg.. The bt-leel CSA optmzaton problem. s n n z n- y n- x n- s n- n-......... y x z y x s Fg.. An n-bt arry-sae adder. s The frst problem that we want to address s the optmzaton, at the bt-leel, of the adder array shown n Fg.. In ths paper, we wll lmt our sope to ases where the operands are added lnearly. That s, we wll not onsder the use of an addton tree to redue the depth of addtons, whh s desrable when the number of operands s large (say, greater than fe). To be sure, f the operands wordlengths are dental and the mnmum delay s desred, then a unform CSA array wll most lkely yeld the best soluton. Howeer, f the operands are of dfferent wordlength, whh ours n many datapaths n dgtal sgnal proessng, a unform adder struture s ntutely no longer effent. Obously, the adder array n Fg. an be subjeted to a standard ombnatoral log optmzaton algorthm. Howeer, not only are arthmet ruts nherently more dffult to optmze due to the prealene of exluse-or operatons n ther log relatons, the problem s further ompounded by an output Z that s preeded by a fxed funtonal blok (the VMA). A more feasble approah s to optmze the adder array alone wthout the VMA by spefyng the log relatons from the nputs A to the arry-sae outputs C and S. Howeer, arry-sae s a redundant representaton n that there exst many ombnatons of S and C that represent the same arthmet alue S C. Therefore, spefyng the log relatons usng a spef arry-sae representaton unneessary lmts the flexblty n optmzaton. Ths mples that speal tehnques should be deeloped for arthmet z n

log optmzaton. We make the key obseraton that the two-etor arrysae output of the adder array s really a onstrant mposed by the VMA, that eah of ts bt postons an hae no more than two sgnals. So long as the arthmet alue s presered, there s really no requrement to adhere to ths arrysae representaton at nternal array stages. It mght, for nstane, be adantageous for a bt to hae only one output (.e., no arry sgnal). In fat, we an take ths flexblty one step further and relax the arry-sae representaton wthn the adder array suh that some bt postons an hae more than two sgnals (.e., more than one arry sgnal) per bt! The optmzaton of addtons usng CSA was explored n [] but the transformatons were done at the word-leel. Whle arthmet optmzatons hae been studed extensely and arthmet propertes suh as ommutatty and assoatty hae been appled n arous forms [], [3], to our knowledge, there has been no preous work that explots the representaton flexblty n optmzng the arrysae addtons at the bt-leel. Thus, ths work also prodes a useful lnk between word-leel arry-sae optmzaton (e.g., []) and bt-leel optmzaton. In ths paper, we frst show n Seton II how we an explot the flexblty of the relaxed-arry-sae representaton to optmze the bt-leel arry-sae addtons. We then present an algorthm n Seton III to optmze a sngle adder array and an algorthm n Seton IV to optmze a seres of asaded adder arrays. Fnally, we demonstrate the effeteness of our algorthms n Seton V by optmzng seeral pratal fxed-oeffent dgtal flters. Here we ahee sangs of 5% to 3% ompared to unform CSA mplementatons. II. Relaxed Carry-Sae Representaton To llustrate the usefulness of a relaxed-arry-sae (RCS) representaton, let us onsder the smple arry-sae addton of three operands W 4 X 4 Y (note that Y has a shorter wordlength). Fg. 3 shows the straghtforward CSA mplementaton that produes the standard arry-sae etors S and C. Note that whle the two LSB full-adders redue ther respete three nputs to two outputs, the two MSB half-adders appear to do no useful work n redung ther nputs! Of ourse, the two MSB half-adders are neessary beause of the arry sgnal. Howeer, f we relax the arry-sae output requrement and allow bt- to hae two arry sgnals, then the two MSB half-adders an be elmnated as shown n Fg. 3. Ths s a perfetly able transformaton f a subsequent adder stage an eentually redue the sgnals to the fnal arry-sae form. On the other hand, gen a less strngent tmng onstrant (say two FA delays), we an, nstead of usng fewer adders, use more full-adders n order to redue the number of output sgnals as shown n Fg. 3(). Clearly, the beneft of hang a redued number of sgnals wll propagate to the subsequent adder stages as well w x w x w y x 4 3 s s w x w y x w y x 4 3 s s () y w x w y x w y x a + 3 x s -bt FA s s z y + Fg. 3. Equalent arry-sae mplementatons. s x y where more strngent tmng onstrant ompensaton mght our. Ths smple example llustrates two mportant ponts. Frst, een though all three mplementatons n Fg. 3 hae the same arthmet alue, they are not logally equalent. Ths mples that an optmzaton proess s best guded by arthmet equalene rather than logal equalene as we wll show n Seton III. Seond, the flexblty n the relaxed-arry-sae representaton allows a wde range of mplementatons wth dfferent trade-offs. In Fg. 3, the adder stage s smplfed by mposng a burden (an extra arry) on subsequent adder stages. In Fg. 3(), the adder stage s made more omplex to beneft subsequent adder stages. Therefore, the power of the relaxed-arry-sae representaton wll be more edent when we optmze many asadng adder bloks smultaneously as we wll see n Seton IV. III. Optmzaton of an Array Gen the lnear arry-sae addton of K operands A n A n K K the adder array optmzaton problem seeks a mnmum-ost mplementaton subjeted to a maxmum-delay tmng onstrant (T MAX ). Our method begns by onstrutng a retangular latte (shown n Fg. 4) wth K rows and N olumns, where K s the number of operands and N s the output wordlength. Eah ell n the latte represents a possble loaton to plae an adder ell as well as to aept nputs and produe outputs. The olumns n the latte are numbered from left to rght startng at zero. Eah olumn orresponds to a btsle where the sgnals n the j-th olumn hae the arthmet weght j. Next, we dret the frst three operands A n A n 3 3 to the frst row of the latte wth ther bts plaed n the respete olumns that orrespond to ther weghts. The rest of the operands are dreted to subsequent rows (one operand per row) n the order spefed. (The orderng of the operands has a slght mpat on the optmzaton but s beyond the sope of ths paper.) We shall all the nputs to the latteells from the operands loal nputs to dstngush them from sgnals propagated from neghborng latte-ells. The nputs to latte-ell j, at row and olumn j, an be the -bt HA

olumn 3 olumn olumn olumn olumn 3 olumn olumn olumn w x w y x w y x w x w y x w y x j Canddate Cells n CS Implementaton z z z z row row Canddate Cells n RCS Implementaton wth 3 nputs sum nput loal nput arry ouput Fg. 4. Construton of the ell latte. olumn j+ olumn j sum outputs arry nput olumn j- row - row row + Fg. 5. Connetons to a latte-ell. sum outputs of ell j (.e., the same bt of the preous row), the arry outputs from ell j (.e., the lower bt of the same row), and the loal nputs. Eah latte-ell an produe sum outputs to ell j and arry outputs (wth the help of an adder ell) to ell j. Ths s llustrated n Fg. 5. It s lear that the tradtonal CSA array an be realzed by ths latte as llustrated n Fg. 4. In fat, the bt-adder n eah latte-ell n a tradtonal CSA array s unquely determned by the number of nputs at eah latte-ell as shown n Fg. 6. Note that n a tradtonal CSA array, a latteell almost always has two sum outputs and neer has more than two sum outputs. The key to our optmzaton s to relax the arry-sae representaton by allowng for more or less than two sum outputs for eah latte-ell. Ths flexblty allows use of a wder arety of latte-ell mplementatons. Fg. 6 shows the feasble mplementatons for a latteell wth three nputs under the RCS representaton. It s a straghtforward matter to enumerate all possble latte-ells for all possble numbers of nputs. For brety, we wll not reprodue the lst here. We note that n our mplementaton, the maxmum number of latte-ell sum outputs s three beause our has experene shown that there s lttle to gan n prate by allowng more. Gen the latte, the loal nputs and the lst of feasble latte-ell mplementatons, we an now determne the optmal mplementaton, subjet to a maxmum-delay onstrant (T MAX ). Ths an be done by an exhauste searh that examnes all feasble mplementatons for eah latte-ell. The feasbly of a latte-ell mplementaton s onstraned by the number of ell nputs and the delay requrement. In addton, the number of outputs for the last row of latte ells s onstraned to be no more than two n onformty wth the arrysae representaton. Sne the nputs to a latte-ell depend on the outputs from the top ell and the ell to the rght, the Fg. 6. Feasble mplementatons for latte-ells. searh must begn wth the top-rght ell. We hose to proeed wth the searh n a olumn-by-olumn fashon untl we reah the leftmost ell n the last row. Ths partular searh sequene (nstead of a row-by-row searh) better lmts the searh spae sne the last row of ells has the most lmtng onstrants. Een wth an exhauste searh, due to the many onstrants, optmzng an adder array n all of our examples takes only a fraton of a seond on a SUN ULTRA- workstaton. IV. Optmzaton of Casaded Arrays As we hae mentoned n Seton II, the power of the RCS representaton beomes more pronouned when we onsder the optmzaton of a seres of asaded adder arrays. Here, we wll only onsder the ase where the asadng adder arrays are separated by regsters as shown n Fg. 7. The regsters may be used for ppelnng or, n the ase of DSP, for mplementng the dgtal sgnal delay. Een though a smlar latte ould be onstruted for the entre problem (Fg. 7) as n the preous seton, t wll not be feasble to exhaustely searh n suh a latte due to the large problem sze. Instead, we wll explot the regsters that effetely solate the tmng between the adder arrays so that we an determne the optmal mplementaton of eah adder array ndependently. All feasble mplementatons (of regsters and adder arrays) for the entre rut an be represented usng a dreted graph G V E as shown n Fg. 7(). Reall that wth RCS representatons we an hae dfferent output patterns for eah adder array. Ths s represented by the ertes where eah ertex k represents the -th possble output pattern of the k-th adder stage (exept the soure ertex whh represent the all-zero pattern). The ertex set V k s the set of all ertes that represent the output patterns for the k- th regster stage, as llustrated n the fgure. A dreted edge e k j exsts from ertex k to k j f there exsts a feasble mplementaton of the adder array stage k whose nput and output patterns are k and k j, respetely. The edge weght s the ost of the optmal adder array mplementaton found usng the algorthm of Seton III plus the ost of the regsters at the adder outputs. It s lear that eery path n G from to any ertex n V Z, where Z s the number of adder stages and V Z s the ertex set for the last adder stage, represents a feasble mplementaton of the asaded adder arrays. Therefore, the problem of

An mplementaton of adder stage, Loal nputs Array array k- Regster stage k- olumn 3 olumn olumn olumn at stage at stage k- k-, Array VMA Arthmet Output array k Regster stage k array k+ An mplementaton of adder stage k at stage k at stage Z (Output) () e V V k,,j k,j k Z Fg. 7. Optmzaton of asaded adder arrays. fndng the optmal asaded adder arrays an be soled by fndng the mnmum-ost path n G. Whle the mnmumost path an be found by usng a breadth-frst searh startng from, the method s feasble only for smaller problems beause the sze of the ertex sets V k an be ery large. For larger problems, we propose a smple yet effete heurst wth redued runtme and memory requrement. The dea s to selet a subset of ertes n eah V k that s most promsng n obtanng the fnal optmal soluton and to expand only these ertes n the breadth-frst searh. The queston here s that, gen two paths p and p, startng from and endng at k and k j, respetely, n the same ertex set, how an we estmate the relate goodness of the two paths? We frst note that for ertes n the same set V k, the remanng problem (.e., the adder arrays after stage k) are dental for both for p and p. Therefore, any dfferene n the optmal ost from k and k j to ertes n Z wll be due to the dfferent regster patterns of k and k j. That s, f we an predt the ost mpat of a regster pattern on the subsequent adder arrays, then we an add the respete predted osts to p and p and the lowest ost path s suffent to lead us to the optmal soluton. The dffulty s that we annot aurately determne the mpat ost beause dfferenes n the regster patterns wll possbly affet all subsequent adder stages and so the mpat ost an only be estmated. To llustrate how we may estmate the mpat ost, we refer bak to Fg. 3 and Fg. 3. Whle Fg. 3 has a larger ost, t has a smpler output pattern whereas Fg. 3 has a smaller ost but has an extra arry n the output pattern. Clearly, Fg. 3 wll requre more subsequent adders to redue the output patterns to the fnal CSA representaton. Therefore, an ntute estmaton for a pattern s mpat ost s to use the ost of rpple adders needed to redue the RCS representaton to a arry-sae representaton. Note that ths estmaton s nherently naurate beause t does not onsder the mpat of tmng onstrants. Ths estmated ost an now be added to the paths to the ertes n eah V k and we wll lmt the breadth-frst searh at eah stage, to ontnue only on paths wthn a small range (to aommodate the nauraes n the mpat estmaton) from the mnmum ost for the stage. V. Expermental Results We hae mplemented an optmzer for bt-leel arrysae addtons that explots the RCS representaton and we hae suessfully used t to optmze a seres of fxed-oeffent hgh-speed transposed form fnte-mpulseresponse (FIR) flters as summarzed n Table I. The results were olleted on a SUN ULTRA- workstaton. The number of taps orresponds to the number of ertex sets n the searh graph. Examples SI and KA are the standard sn funton and Kaser wndow flters (wth 3-bt wordlength) and examples LP and LP are lowpass flters (wth -bt wordlength). These flters hae fxed-oeffents of arous szes, resultng n arry-sae addtons of operands wth arous wordlengths. We hae used the followng weghts for the ells: FA =, HA =, Reg =, whh roughly reflet the sze of the respete ells. For ease of omparson, we let the FAs and HAs hae unt delays and the regsters hae zero delays. Howeer, our algorthm an be used wth more general delay models. Table I shows the number of full-adders (#FA), the number of half-adders (#HA), the number of regsters (#Reg) and the oerall weghted ost (Cost) for flters mplemented usng tradtonal CSAs and our bt-leel optmzed RCS mplementatons. The superorty of the proposed RCS representaton s learly edent by the large perentage sangs (from 5% to oer 3%) n the weghted ost. Note that the RCS mplementatons use more full-adders (een though the full-adders hae hgher ost!) but sgnfantly fewer half-adders. Ths s onsstent wth our earler obseraton n Seton II that the half-adders are not really useful under the RCS representaton. Furthermore, een though the RCS mplementatons requre three regsters per bt at some

TABLE I SUMMARY OF RESULTS Run tme CSA Proposed Perentage Flter taps (se.) #FA #HA #Reg Cost #FA #HA #Reg Cost Sangs SI 5 75 7 4 46 84 83 36 5.3% KA 48 595 77 49 7 35 33 845 53 5.6% LP 8 9 66 33 55 4 8 48 838 3.% LP 4 53 33 387 96 79 5 35 688 5.7% TABLE II COMPARISONS WITH OPTIMAL SOLUTIONS Optmal Wth mpat ost est. Wthout mpat ost est. T MAX #FA #HA #Reg Cost #FA #HA #Reg Cost #FA #HA #Reg Cost 84 7 84 359 84 83 36 84 6 9 365 3 83 6 338 84 9 6 338 85 8 65 343 4 85 9 47 36 85 9 47 36 86 6 54 33 5 84 3 43 34 86 7 46 35 86 6 47 35 6 84 3 39 3 84 3 39 3 86 6 48 36 7 84 5 34 37 84 5 34 37 86 6 48 36 For the smallest example SI, we were able to use a full breadth-frst searh to fnd the optmal soluton. Ths was useful for examnng the effeteness of our searh strategy usng the mpat ost estmaton. Table II ompares the optmal results wth our results usng the estmaton and wth the ase where no estmaton ost s used. Clearly, the mpat ost estmaton s effete n produng solutons that are ery lose to the optmal. In addton, the table also shows the range of solutons attanable under arous delay onstrants T MAX (number of adder delays), that s, the trade-off between the mplementaton ost and the delay requrement. CSA Implementaton RCS Implementaton FA HA Regster Fg. 8. Comparson of latte ells for the Kaser wndow flter. loatons, the oerall RCS mplementatons hae %-4% fewer regsters. To better understand where the sangs were aheed, we ompare n Fg. 8 the ell latte of example KA usng the tradtonal CSA mplementaton aganst our optmzed RCS mplementaton. We hae drawn the ells n the fgure suh that ther area roughly orresponds to the weghtng funton thereby gng a sual sense of the oerall weghted ost. Sne the RCS mplementaton s less regular than the CSA mplementaton, the perentage area sangs n a ompated layout wll be slghtly less than that of the weghted ost due to the routngs. Howeer, een for a strutured layout where area sangs s not exploted, the RCS mplementaton remans desrable beause of ts power sangs as a result of usng fewer regsters and adder ells. VI. Conlusons We hae shown that by relaxng the arry-sae representaton to allow for more or less than two sgnals per bt poston, we an nrease the flexblty n the bt-leel mplementatons of an adder array. Suh flexblty an n turn lead to a more effent mplementaton usng the proposed algorthm to optmze a sngle adder array at the bt-leel. In addton, we hae proposed a heurst to optmze a seres of adder arrays that mght our n a datapath. We hae appled our algorthms to the optmzaton of hgh-speed dgtal FIR flters and were able to ahee 5% to 3% sangs n weghted ost ompared to usng standard arry-sae mplementatons. Referenes [] T. Km, W. Jao, and S. Tjang, Arthmet optmzaton usng arrysae-adders, n Pro. 35th Desgn Automaton Conf., pp. 433 438, June 998. [] M. Potkonjak and J. Rabaey, Optmzng resoure utlzaton usng transformatons, n Pro. 99 IEEE Int. Conf. Computer-Aded Desgn, pp. 88 9, No. 99. [3] L. Rjnders, Z. Sahraou, P. Sx, and H. De Man, Tmng optmzaton by bt-leel arthmet transformatons, n Pro. Euro. Desgn Automaton Conf., pp. 48 53, Sept. 995.