Using Delayed Addition Techniques to Accelerate Integer and Floating-Point Calculations in Configurable Hardware

Draft submtted for publcaton. Please do not dstrbute Usng Delayed Addton echnques to Accelerate Integer and Floatng-Pont Calculatons n Confgurable Hardware Zhen Luo, Nonmember and Margaret Martonos, Member, IEEE Abstract-- he speed of arthmetc calculatons n confgurable hardware s lmted by carry propagaton even wth the dedcated carry propagaton hardware found n recent FPGAs. hs paper proposes and evaluates an approach called delayed addton that reduces the carrypropagaton bottleneck and mproves the performance of arthmetc calculatons. Our approach employs the dea used n Wallace trees to store the results n an ntermedate form and delay addton untl the end of a repeated calculaton such as accumulaton or dot-product; ths effectvely removes carry propagaton overhead from the calculaton s crtcal path. We present both nteger and floatng-pont desgns that use our technque. Our ppelned nteger multply-accumulate (MAC) desgn s based on a farly tradtonal multpler desgn, but wth delayed addton as well. hs desgn acheves a 66MHz clock rate on an XC4036XL-2 FPGA. Next, we present a 32-bt floatng-pont accumulator based on delayed addton. Here delayed addton requres a novel algnment technque that decouples the ncomng operands from the accumulated result. A conservatve verson of ths desgn acheves a 33 MHz clock rate. We also present a 32-bt floatng-pont accumulator desgn wth compler-managed overflow avodance that acheves a 66MHz clock rate. Fnally, we present an applcaton of delayed addton technques to solve a system of lnear equatons usng conjugate gradent method. hese desgns and applcatons demonstrate the utlty of delayed addton for acceleratng FPGA calculatons n both the nteger and floatng-pont domans. Index erms-- FPGA, delayed addton, Wallace tree, multply-accumulate (MAC) I. INRODUCION When an arthmetc calculaton s carred out n a RISC mcroprocessor, each nstructon typcally has two source operands and one result. In many computatons, however, the result of one arthmetc nstructon s just an ntermedate result n a long seres of calculatons. For example, dot product and other long summatons use a long seres of nteger or floatng-pont operatons to compute a fnal result. Whle FPGA desgns often suffer from much slower clock rates than custom VLSI, confgurable hardware allows us to make specalzed hardware for these he authors are wth the Electrcal Engneerng Department of Prnceton Unversty, Prnceton, NJ 08544, USA. cases; wth ths, we can optmze the ppelnng characterstcs for the partcular computaton. A typcal multpler n a full-custom ntegrated crcut has three stages. Frst, t uses Booth encodng to generate the partal products. Second, t uses one or more levels of Wallace tree compresson to reduce the number of partal products to two. hrd, t uses a fnal adder to add these two numbers and get the result. For such a multpler, the thrd stage, performng the fnal add, generally takes about onethrd of the total multplcaton tme [8, 9]. If mplemented usng FPGAs, stage 3 could become an even greater bottleneck because of the carry propagaton problem. It s hard to apply fast adder technques to speed up carry propagaton wthn the constrants of current FPGAs. In Xlnx 4000-seres chps, for example, the fastest 16-bt adder possble s the hardwred rpple-carry adder [19]. he mnmum delay of such an adder (n a -2 speed grade XC4000xl part) s more than four tmes the delay of an SRAM-based, 4-nput look-up table that forms the core of the confgurable logc blocks. Snce ths carry propagaton s such a bottleneck, t mpedes ppelnng long seres of addtons or multples n confgurable hardware; the carrypropagaton les along the crtcal path, t determnes the ppelned clock rate for the whole computaton. Our work removes ths bottleneck from the crtcal path so that stages 1 and 2 can run at full speed. hs mproves the performance of nner products and other seres calculatons. As an example, consder the summaton C of a vector A: 99 C = A[ ] = 0 Our goal s to accumulate the elements of A wthout payng the prce of 99 seralzed addtons. We observe that n tradtonal multpler desgns (e.g., the multply unts of most recent mcroprocessors [15, 16]), Wallace trees are used to accumulate the partal products. Our work proposes and evaluates ways n whch smlar technques can be used to replace tme-consumng addtons n seres calculatons wth Wallace tree compresson. he technque s applcable to confgurable hardware, because n a dynamcally confgurable system t s practcal to consder buldng specfc hardware for nner products or other repeated calculatons. he technque s effectve for confgurable hardware because t removes addton s carry 1

Draft submtted for publcaton. Please do not dstrbute propagaton logc from the crtcal path of these calculatons, thus allowng them to be ppelned at much faster clock rates. By usng Wallace trees to accumulate results wthout carry propagaton overhead, we can greatly accelerate both nteger and floatng-pont calculatons. We demonstrate our deas on three desgns. he frst desgn s an nteger unt that performs ppelned sequences of MAC (multplyaccumulate) operatons; ths ppelned desgn operates at a 37MHz clock rate. he second and thrd desgns perform floatng-pont accumulatons (.e., repeated addtons) on 32-bt IEEE sngle-precson format numbers. One of them uses a conservatve stall technque to respond to possble overflows; t operates at 33MHz. he other sgn reles on compler assstance to avod overflows by breakng calculatons nto, chunks of no more than 512 summaton elements at a tme. hs approach yelds a 66MHz clock rate for 32-bt IEEE sngle-precson summatons. hese clock rates ndcate the sgnfcant promse of ths approach n mplementng hgh-speed ppelned computatons on FPGA-based systems. We demonstrate ths promse on an example applcaton: solvng lnear equatons usng conjugate gradent method. he remander of ths paper s structured as follows. Secton II ntroduces the basc dea of Delayed Addton calculaton, and presents a desgn for a ppelned nteger multply-accumulate unt based on ths approach. Secton III moves nto the floatng-pont doman, presentng a desgn of a ppelned 33MHz 32-bt floatng-pont accumulator wth delayed addton. Buldng on ths basc desgn, Secton IV then presents the 66MHz floatng-pont accumulator wth compler-managed overflow avodance, whch s used to form the MAC unt n our applcaton to solve lnear equatons usng conjugate gradent method n secton V. Secton VI dscusses ssues of roundng and error theory related to these desgns, Secton VII presents related work, and Secton VIII provdes our conclusons. II. DELAYED ADDIION IN A PIPELINED INEGER MULIPLY-ACCUMULAOR A. Overvew A multply-accumulator unt conssts of a multpler and an adder. For adders of 16 bts or less mplemented n Xlnx FPGAs, the hardwred rpple-carry adder s the fastest. For adders more than 16 bts long, a carry-select adder s a good choce for fast addton n FPGA. It uses rpple-carry adders as basc elements and a few multplexers to choose the result. hus t can stll utlze the hardwred rpple-carry logc on Xlnx FPGA to acheve relatvely hgh speed. Most of the multplers that have been mplemented so far n FPGAs are based on bt-seral multplers [2, 14]. hs s because bt-seral multplers take much less area than any other knd of multplers. Snce they have a regular layout, a1[n] a2[n] a3[n] s[n] c[n] a1[n] a2[n] a3[n] a4[n] cout cn s[n] c[n] t s easy to map on a FPGA to acheve very hgh clock rate. However, bt-seral multpler requres a very long latency to produce a result. For two multplcands of M and N bts long, t takes M+N clock cycles to get the product [7]. Although some mplementatons have tred to releve ths problem by multplyng more than one bt per cycle [2], we know of no such mplementatons wth an overall throughput of more than 10MHz. Bt-array multplers also have a regular layout, whch makes t easy to map on FPGA and to acheve hgh clock rates [3]. Unlke bt-seral multplers, they produce one product every cycle. hus they can acheve a very hgh throughput at the prce of large area cost. In the case of a 32-bt nteger MAC wth a 64-bt fnal result, we would expect to have a bt-array multpler of 63 ppelne stages for multplcaton and one for accumulaton. hus we would need a 64 64 CLB matrx to mplement t [3]. However, CLB matrx of ths sze can barely ft nto the largest Xlnx part avalable (XC40125XV) now (as of Sept. 1998) [20], whch would nvolve a huge cost. Our desgn, as we wll see next, has comparable performance to bt-array multpler for vector MAC and s much more area effcent. B. Background on Wallace rees a1[1] a2[1] a3[1] s[1] c[1] Fg 1. (a) An array of n 3-2 adders. Fg 1. (b) An array of n 4-2 adders. Before contnung on detaled desgns, we wll frst gve a bref revew on some bascs of Wallace tree [10] and ts dervatves [11]. One level of Wallace tree s composed of arrays of 3-2 adders (or compressors). he logc of a 3-2 adder s the same as a full adder except the carry-out from the prevous bt has now become an external nput. For each bt of a 3-2 adder, the logc s: S[] = A1[] A2[] A3[]; C[] = A1[]A2[]+ A2[]A3[] + A3[]A1[]; For the whole array, S+2C = A1 + A1 +A3 a1[1] a2[1] a3[1] a4[1] cout cn s[1] c[1] S and C are partal results that we refer to n ths paper as the pseudo-sum. hey can be combned durng a fnal 2

Draft submtted for publcaton. Please do not dstrbute Booth encoders Wallace tree level 1 Wallace tree level 2 Wallace tree level 3 Wallace tree level 4 B B B B B B B B B B B B B B B B B 4-2 adder array 4-2 adder array 4-2 adder array 4-2 adder array FF 4-2 adder array 4-2 adder array FF 4-2 adder array FF 3-2 adder array Compressor 4-2 adder array Fnal Adder (not ppelned) 64-bt adder Fg. 2 Integer MAC wth delayed addton. cycle # 1 2 3 4 5 6 7 8 9 10 11 Combnatonal Input 1 BH W1 W2 W3 W4 CPR Input 2 BH W1 W2 W3 W4 CPR Input 3 BH W1 W2 W3 W4 CPR Input 4 BH W1 W2 W3 W4 CPR Input 5 BH W1 W2 W3 W4 CPR Input 6 BH W1 W2 W3 W4 CPR Fnal Addton Fg. 3 5 Ppelne dagram of Integer MAC: A [ ] B[ ]. he stages marked: BH (Booth encoders), W1 (Wallace tree level 1), W2 = 0 (Wallace tree level 2), W3 (Wallace tree level 3), W4 (Wallace tree level 4) and CPR (Compressor) refer to the sx ppelne stages shown n Fgure 2. he fnal addton s performed only once per summaton and does not mpact the ppelned clock rate. addton phase to compute a true sum. he total number of nputs across an entre level of a 3-2 adder array s the same as the bt-wdth of the nputs. Fg. 1 (a) shows the layout of such an array example. In some Wallace tree desgns, 4-2 adder arrays have also been used, because they reduce the number of compressor levels requred [11]. Each bt of such an array s composed of a 4-2 adder. he typcal logc s: C out [] = A1[]A2[] + A2[]A3[] + A3[]A1[] ; S[] =A1[] A2[] A3[] A4[] C n [] ; C[] = (A1[] A2[] A3[] A4[])C n [] + (A1[] A2[] A3[] A4[])A4[] ; For the whole array, S + 2C = A1 + A2 + A3 + A4 Fg. 1 (b) shows the layout of an array example usng 4-2 adders. At frst glance, one mght ntally thnk that C n and C out are smlar to the carry-n and carry-out n the rpplecarry adders. he key dfference, however, s that C n does not propagate to C out. he crtcal path of an array of 3-2 or 4-2 adders s n the vertcal, not horzontal drecton. Furthermore, the logc shown maps well to coarse-graned FPGAs. Wth Xlnx 4000-seres parts, we can ft each S or C, for ether a 3-2 or 4-2 adder, nto a sngle CLB usng the F, G, and H functon generators. C. Desgn of Integer MAC wth Delayed Addton For an nteger MAC unt, the mplementaton s straghtforward because ntegers are fxed-pont and are therefore algned. Our desgn looks exactly lke a tradtonal multpler desgn wth Booth encodng and Wallace tree except that a 4-2 adder array s nserted nto the ppelne before the fnal addton. o acheve accumulaton, we repeatedly execute: Pseudo-sum = Pseudo-sum + (the fnal two partal products for each multplcaton) Recall that pseudo-sum refers to the S and C values currently beng computed by a 3-2 or 4-2 adder array, awatng the fnal addton that wll calculate the true result. Fg. 2 shows a block dagram of our mplementaton. 3

Draft submtted for publcaton. Please do not dstrbute Desgns Xlnx part number CLB matrx sze CLBs used Flp-flops used Ppelne stages Speed (MHz) Fnal Addton Delay (ns) IMAC (delayed addton) XC4036xlhq208-2 36 36 1287 1866 7 66.7 38.27 radtonal Integer multpler XC4036xlhq208-2 36 36 1243 1836 8 54.5 N/A able 1: Synthess results for ppelned nteger MAC wth delayed addton and ppelned adder. Each level of a Wallace tree has a smlar delay, and ths delay s also smlar to that of a Booth encoder. hus, as shown n Fg. 2, a natural way to ppelne ths desgn s to let each level of logc (above the dotted lne) be one of the ppelne stages. he well-matched delays make for a very effcent ppelned mplementaton. he fnal compressor, just above the dotted lne, stores and updates the pseudosum every cycle. When the repeated summaton s complete, a fnal add (not part of the ppelne) converts ths ntermedate form to a true sum result. he pseudo-sum s updated each cycle, but the fnal adder s only used when the full accumulaton s done. herefore, t s not one of the ppelne stages, but rather consttutes a post-processng step as shown n Fgure 3. Wth ths structure, the carry propagaton tme for the fnal addton s no longer on the crtcal path that determnes the clock rate of the ppelned MAC desgn. For suffcently long vectors, ths fnal addton tme, done only once per entre summaton rather than once per element, wll be neglgble even compared to the faster vector MAC calculatons of ths desgn. D. Desgn Synthess Results For all the desgns n ths paper, we used the Synopsys fpga_analyzer tool (1997.08) to generate a.sxnf fle from our VHDL nput and we used Xlnx Foundaton tools (V1.3) for the rest of the synthess. In order to remove the bottleneck at the pad nputs, we added an extra ppelne stage before the booth encoder to buffer the chp nputs. After several ntal attempts, we targeted our desgn at the speed of 66.7 MHz and specfed ths nformaton n the tmng constrant fle (.pcf fle), where we lsted ths requrement for all the crtcal paths. he PAR (placement and routng) worked through successfully and mng Analyzer gves all the tmng nformaton after our desgn s completely placed and routed. he synthess results we get for the above desgn are lsted n able 1. Most notably, our desgn fts n a Xlnx 4036 part and acheves the targetted clock rate of 66.7 MHz. he fnal addton delay, done once per vector as a post-processng step, takes roughly 40ns or nearly 3 cycles. o demonstrate the advantage of delayed addton, we also tred to mplement an nteger MAC composed of a tradtonal nteger multpler and an adder. However, the desgn was too bg to ft on one XC4036 chp, so we bult an nteger multpler on the chp nstead. o trade off between ppelne speed and area cost, we used the carryselect adder for fnal addton and dvded t nto two ppelne stages. In the frst stage, three 32-bt addtons are carred out n parallel, one for the lower 32 bts and two for the upper 32 bts. In the second stage, we select the upper 32 bts between the two results by the carryout from the lower 32-bt addton. he synthess result of ths desgn s also lsted n able 1. mng analyss shows t s exactly the two ppelne stages of the adder that are the bottleneck of the whole desgn. However, further ppelnng the adder wll nvolve a much larger area cost and s not lkely to gve any performance gan due to the long wrng delays n FPGA. From table 1, we can see that by usng the delayed addton algorthm, we have acheved a faster ppelne speed than the tradtonal multpler and accumulator desgn. Accordng to the data above, an IMAC wth the delayed addton would requre 7 + (N-1) + 3 = N + 9 cycles for an nteger nner product of length N to complete, where 7 stands for the number of ppelne stages, 3 stands for the cycle tme for the fnal addton. he overall latency for ths desgn would thus be 15ns (N + 9) = (15N + 135)ns. Snce we could only mplement the multpler on one XC4036 chp n the tradtonal desgn, we have to add another two ppelne stages for the accumulator n our calculaton. hus the overall latency for an nner product of sze N usng tradtonal IMAC would be 10 + (N-1) = N + 9 cycles as well. Snce the cycle tme n the delayed addton desgn s 20% shorter than the tradtonal desgn, the delayed addton desgn has a performance speedup of 120%. III. USING DELAYED ADDIION IN A FLOAING-POIN ACCUMULAOR Multply and accumulaton also appears frequently n floatng-pont applcatons. For example, of the 24 Lvermore Loops, 5 loops (loop 3, 4, 6, 9, 21) are bascally long vector nner-product-lke computaton [17]. In certan applcatons, such as the conjugate gradent example n Secton V, multply and accumulaton domnates the whole computaton process. hus t would be deal f we could also use our delayed addton technques to buld a floatngpont multply and accumulator to speed up ths knd of computatons lke what we dd n the nteger case. However, a floatng-pont MAC unt uses too much area to ft on a sngle FPGA chp. he major reason s that floatng- 4

Draft submtted for publcaton. Please do not dstrbute s exponent fracton 31 30 23 22 0 Fg. 4 IEEE sngle precson format. S s the sgn; exponent s based by 127. If exponent s not 0 (normalzed), mantssa = 1.fracton If exponent s 0 (denormalzed), mantssa = 0.fracton pont accumulaton s a much more complex process than the nteger case, as explaned below. Rather than a MAC unt, we nstead focus here on a floatng-pont accumulator usng delayed addton. We frst gve a bref revew of tradtonal approaches, then descrbe how we have used delayed addton technques to optmze performance. A. radtonal Sngle-Precson Addton Algorthm As shown n Fg.4, a tradtonal floatng-pont adder would frst extract the 1-bt sgn, 8-bt exponent and 23-bt fracton of each ncomng number from the IEEE 754 sngle precson format. By checkng the exponent, the adder determnes f each ncomng number s denormalzed. If the exponent bts are all 0, whch means the number s denormalzed, the mantssa s 0.fracton, otherwse, mantssa s 1.fracton. Next, the adder compares the exponents of the two numbers and shfts the mantssa of the smaller number to get them algned. Sgn-adjustments also occur at ths pont f ether of the ncomng numbers s negatve. Next, t adds the two mantssas; the result needs another sgn-adjustment f t s negatve. Fnally the adder re-normalzes the sum, adjusts the exponent accordngly and truncates the resultng mantssa nto 24 bts by the approprate roundng scheme [2]. he above algorthm s desgned for a sngle addton rather than a seres of addtons. Even more so than n the nteger case, ths straghtforward approach s dffcult to ppelne. One problem les n the fact that the ncomng nextelement-to-be-summed must be algned wth the current accumulated result. hs adds a challenge to our delayed addton technque snce we do not keep the accumulated result n ts fnal form, and thus cannot algn ncomng addends to t. Lkewse, at the end of the computaton, renormalzaton also mpedes a delayed addton approach. For these two problems, we have come up wth two solutons: 1. Mnmze the nteracton between the ncomng number and the accumulated result. o acheve ths, we self-algn the ncomng number on each cycle, rather than algnng t to the Pseudo-sum. Secton 1) wll descrbe self-algnment n more detal. 2. Use the delayed addton for accumulaton only. Postpone roundng and normalzaton untl the end of the entre accumulaton. hs approach s also used when mplementng MAC n some full-custom IC floatng-pont unts [12]. B. Our Delayed Addton Floatng-Pont Accumulaton Algorthm hs secton descrbes our approach for delayed addton accumulaton n floatng-pont numbers. Smlar to what we dd n Integer MAC, we repeatedly execute pseudo-sum = pseudo-sum + ncomng operand. Each ncomng operand s an IEEE sngle-precson floatng-pont number, wth 1- bt sgn, 8-bt exponent (EXP[7-0]) and 23-bt fracton. We consder the exponent bts as three subfelds: hgh-order exponent, a decson bt and low-order exponent for smplcty of dscusson. Hgh-order exponent refers to the EXP[7-6], the decson bt s EXP[5] and low-order exponent refers to EXP[4-0]. We take dfferent actons accordng to the value of these three felds. Lke the tradtonal adder, our desgn frst extends the 23-bt fracton nto 24-bt mantssa. However, we choose not to algn the ncomng operand and the current pseudo-sum drectly because that way the ncomng operand nteracts wth the accumulated pseudo-sum throughout the algnment process, whch makes the further ppelnng mpossble. hus the algnment process could easly become the bottleneck of the whole ppelne f we stll adopt the tradtonal algnment method. Instead, we keep summary nformaton about the hgh-order exponent of the accumulated result, and algn ts mantssa to a fxed boundary accordng the ts low-order exponent. We refer to ths technque as "self-algnment" and descrbe t below. 1) Self-Algnng Incomng Operands here are two ways to algn two floatng-pont numbers. he common way s to shft the mantssa of one number by d bts, where d stands for the dfference of the exponents of the two numbers. Another way s to nstead shft both mantssas to some common boundares. radtonal floatng-pont adders use frst method. In our case, however, the second way s used snce we would lke to mnmze the nteractons of the ncomng number and the accumulated pseudo-sum. We could have fully unrolled the ncomng operand and the accumulated pseudo-sum by left-shftng ther mantssas the number of bts denoted by ther exponents except for the huge area cost nvolved. In that case, the shfted mantssa would be as long as 255(the largest 8-bt exponent possble) + 24(the wdth of sngle precson mantssa) = 279 bts. Because of ths, we only left-shft the mantssa the number of bts denoted by the low-order exponent (EXP[4-0]) n our desgn. Snce low-order exponent s a 5-bt quantty, the largest decmal t can express s 31. hus, by left-shftng to account for low-order bts, we have extended the wdth of our mantssa to 55 bts. Although ths s stll wde, our desgn can ft nto 910 CLBs on a Xlnx 4036 as we wll see later, and ths gves us the ablty to garner truly hgh- 5

Draft submtted for publcaton. Please do not dstrbute 55-bt regster for self-algned Incomng number From 64-bt adder 1 3-2 adder array 0 3-2 adder array 1 GS 0 MUX 1 0 MUX 1 0 MUX 1 P 64-bt regster for pseudo-sum o 64-bt adder 0 E GS s the stall sgnal generated by the control unt. P ndcates f the top 2 bts of the ncomng number s exponent s bgger than that of the accumulated S and C or not. E ndcates f the top 2 bts of the exponent of accumulated S and C s bgger than that of the ncomng number or not, E and other control sgnals together serve as the clock enable sgnal for flp-flops S and C. Fg. 5 Compressor desgn (compressor performance sngle-precson floatng-pont from an FPGAbased desgn. In the above self-algnng process, we dd not take nto consderaton of the hgh-order exponent (EXP[7-6]) and decson bt(exp[5]). hus the shfted mantssa of the ncomng operand s stll not perfectly algned wth that of the current pseudo-sum. We used the fact below to solve ths remanng problem. In sngle-precson IEEE floatngpont, the mantssa s only 24 bts wde. hus, f we try to add two orgnally normalzed numbers that dffer by more than 2 24 tmes, algnment wll cause the smaller of the two numbers to be "rght-shfted" out of the expressble range for ths format. For example, 2 26 + 2 2 = 2 26 n sngleprecson calculatons. Our algorthm effcently uses ths fact to dentfy the smlar cases and handles them approprately. Once self-algned, the ncomng number can be thought of as m (the 55-bt mantssa) 2 64 EXP[7-5]. Meanwhle, our pseudo-sum s stored as mp (the 64-bt mantssa) 2 64 EXP[7-5]. If the current pseudo-sum and the ncomng operand are dentcal n decson bt (EXP[5]), then f the hghorder exponent (EXP[7-6]) of the ncomng number s bgger than that of the pseudo-sum, the mantssa of the pseudo-sum wll be shfted out of the expressble range as long as t s no more than 64 bt wde. In ths case, we smply replace the current pseudo-sum by the ncomng operand. On the other hand, f the hgh-order exponent of the ncomng number s smaller than that of the pseudosum, the ncomng number wll be shfted out of the expressble range snce m s less than 64 bt wde. hus we smply gnore the ncomng operand. he compresson wll only take place when hgh-order exponent of the pseudo-sum s equal to that of the ncomng number. Note that f the current pseudo-sum and the ncomng operand are not dentcal n EXP[5], then determnng the approprate response would actually requre subtractng the two full exponents to determne by how much they dffer. hs would pose a bottleneck n the ppelne; thus we hope to avod ths scenaro entrely. hs leads to our desgn descrbed n Secton 2) below. 2) Compressor Implementaton Detals In order to avod the undesrable scenaro of unequal decson bts, we actually keep two runnng pseudo-sums. One compressor, referred to as compressor-0, takes care of ncomng operands whose decson bt s 0 and the other compressor (compressor-1) handles those whch has a decson bt of 1. We smply shunt each ncomng operand to the approprate compressor as shown n Fgure 6. In ths way, we can always take operatons correspondng to the hgh-order exponent as descrbed above. he two pseudo-sums from compressor-0 and compressor-1 are both added together durng the fnal add stage as a postprocessng step followng the ppelned computaton. Fgure 5 shows the desgn layout for one of the two compressor unts n the desgn, namely compressor-0. Compressor-1 has essentally dentcal structure, except that t cross-connects wth adder-0 as shown n Fgure 6. he runnng pseudo-sum s stored as the Wallace tree's S and C partal results n the 64-bt regsters shown. Were t not for the possblty of ether pseudo-sum overflowng, the desgn would now be complete. Snce the accumulated result may exceed the regster capacty, we have also devsed a technque for recognzng and respondng to potental pseudo-sum overflows. Snce we are not dong the full carry-propagaton of a tradtonal adder, we cannot use the tradtonal overflow-detecton technque of comparng carry-n and carry-out at the hghest bt. In fact, wthout performng the fnal add to convert the pseudo-sum to the true sum, t s mpossble to precsely know a pror when overflows wll occur. Our approach nstead reles on conservatvely determnng whenever an overflow mght occur, and then stallng the ppelne to respond. We can conservatvely detect possble overflow stuatons by examnng the top three bts of the S 6

Draft submtted for publcaton. Please do not dstrbute Ppelne Stage 1 Read n and generate 24-bt mantssa Ppelned Porton Ppelne Stage 2 Shft by EXP[2] - EXP[0] Ppelne Stage 3 Shft by EXP[4]-EXP[3]. Complement the mantssa f s = 1 Ppelne Stage 4 Stall response Fnal Addton and Normalzaton (Not ppelned) Compressor 0 (adder not ncluded) Compressor 1 (adder not ncluded) 64-bt adder 0 64-bt adder 1 Algn the numbers from adder 0 and adder1 Fnal 64-bt adder Normalze the result Overflow Detecton & Handlng GlobalStall Combnatonal logc (Not ppelned) Fg. 6 Floatng-pont accumulator ppelnng scheme Xlnx part number CLB matrx CLB Flpflops Ppelne Speed Fnal Add and sze used stages (MHz) Norm. Delay (ns.) XC4036xlhq208-2 36 36 910 378 4 33 72.4 able 2: Synthess results for floatng-pont accumulaton wth delayed addton and C portons of the pseudo-sum and the sgn bt from the 55-bt ncomng operand. We have used espresso to form a mnmzed truth table generatng the GlobalStall sgnal (GS n Fg. 4) as a Boolean functon of these 7 bts. As shown n Fg. 5, the GlobalStall sgnal s used as the clock enable sgnal on the frst three ppelne stages; when t s asserted, the ppelne stalls and no new operands are processed untl we respond to the possble overflow. Snce the desgn's two compressors are summng dfferent numbers, they wll of course approach overflow at dfferent tmes snce only one number s added a tme. Our desgn, however, does overflow processng n both compressors whenever ether compressor's GlobalStall sgnal s asserted. hs coordnated effort avods cases where overflow handlng n one compressor s mmedately followed by an overflow n the other compressor and t potentally reduces the number of stalls needed, too, snce we process these two pseudo-sums n parallel durng the stall. When a stall occurs, our response s to sum the S and C portons of each compressors' pseudo-sum usng the 64-bt adders shown n the Stall Response box n Fgure 5. hs s a tradtonal 64-bt addton ncurrng a sgnfcant carry propagaton delay, but snce t occurs durng the stall-tme, t does not le on the crtcal path that determnes the desgn's ppelned clock rate. (As long as stalls are nfrequent, t does not notceably mpact performance.) he decson of what to do wth the newly formed sum depends on ts value,.e., t depends on whether () an overflow truly occurred or () we were overly conservatve n our stall detecton. In cases where an overflow does occur, the value of EXP[5] n the pseudo-sum wll change. Recall that compressor-0 s to handle the accumulaton of ncomng operands whose EXP[5] bt s 0, wth a pseudo-sum whose EXP[5] bt s also 0. If the pseudo-sum overflow causes EXP[5] to change value, then we need to pass the newlycomputed full sum over to the other compressor. hs s why the desgn n Fg. 6 ncludes the cross-coupled connectons of adder-1 to compressor-0 and vce versa. When we are overly conservatve n predctng a stall, EXP[5] wll not change values. In ths case, we retan the pseudo-sum n ts current form. C. Expermental Results Fgure 6 shows the block dagram of ths desgn and able 2 summarzes the synthess results. Because ths s a 7

Draft submtted for publcaton. Please do not dstrbute Ppelne Stage 1 Read n and generate 24-bt mantssa Ppelned Porton Ppelne Stage 2 Shft by exp[1] and exp[0] Ppelne Stage 3 Shft by exp[3] and exp[2] Ppelne Stage 4 Shft by exp[4]. Complement the mantssa f s = 1 Ppelne Stage 5 Compressor 0 (adder not ncluded) Compressor 1 (adder not ncluded) Algn the S and C from adder 0 and adder1 Fnal addton and normalzaton 4-2 compressor array take S and C from both compressors Fnal 64-bt adder Normalze the result Combnatonal logc (not ppelned) Fg. 7 Floatng-pont accumulator wth compler-managed overflow avodance ppelnng scheme floatng-pont accumulator rather than a MAC unt, t s actually smaller than the nteger MAC unt dscussed n the prevous secton. Usng 4 ppelne stages, our desgn attans a clock rate of 33MHz. Because of extra bookkeepng requred to renormalze the fnal result, the post-processng delay n ths desgn s larger. At 72.4ns, ths delay corresponds to roughly 2.4 of the ppelned clock cycles. As n the nteger case, ths dfference between the fnal add tme and the ppelned clock cycle tme hghlghts the utlty of delayed addton. By pullng ths delay off the vector computaton s crtcal path, we pay for t only once per vector, not on each clock cycle. Accordng to the data above, usng the same analyss as n secton III and assumng there s no stall durng the computaton, we wll have to wat 4 + (N-1) + 3 = N + 6 cycles for an accumulaton of length N to complete. Snce each stall causes a 3-cycle bubble n the ppelne and too many stalls may eventually ncur expensve system nterrupt, we also want to make sure how frequent stalls mght be when we accumulate N numbers. We dd two smulatons. Smulaton I used 100,000 unformly dstrbuted floatng-pont numbers wth ther absolute values rangng from 2-31 to 2 31. Because postves and negatves are balanced, we dd not even meet one case of stallng. Smulaton II uses 100,000 unformly dstrbuted postve floatng-pont numbers rangng from 0 to 2 31, and we only found 24 cases of stallng. Summng these 100,000 numbers would need 100,006 cycles, so that 72 stall cycles are neglgble. From ths experment we conclude that overflow and stallng pose lttle problem for most applcatons, as long as we have a reasonably large local buffer for operands. As we wll show and explot n Secton IV, we can prove that for vectors shorter than 512 elements, there s no chance of stallng at all. IV. FLOAING-POIN ACCUMULAOR WIH COMPILER- MANAGED OVERFLOW AVOIDANCE he man reason why we have the overflow detecton and handlng logc n the prevous desgn s because of the possble overflow of the pseudo-sum after a number of operatons. However, the stall-related logc s very complcated and has a bg area cost. Worst of all, t sts on the crtcal path of our desgn and slows down the ppelne speed. o avod the area and speed overhead due to overflow detecton and handlng, we present a dfferent style desgn here. hs desgn omts overflow handlng by relyng on the compler to break a large accumulaton nto smaller peces so that overflow s guaranteed not to occur when each of these peces s executed. Avodng area overhead for stall handlng s desrable, but we wll not have much gan n our desgn f we have to break an accumulaton nto very small peces. Our goal s to determne a bound of how often the stall wll occur. he largest ncomng mantssa that can be fed nto one compressor s 11 1100 00 (the frst 24 bts are 1 s and 8

Draft submtted for publcaton. Please do not dstrbute Xlnx part number CLB matrx CLB Flpflops Ppelne Speed Fnal Addton sze used stages (MHz) Delay (ns) XC4036xlhq208-2 36 36 839 417 5 66 73.23 able 3: Synthess results of floatng-pont accumulaton wth delayed addton and compler-managed overflow avodance. the rest 31 bts are 0 s). hs s less than 2 55. hus f the pseudo-sum s stored n an n-bt (n > 55) regster, we may have an overflow every 2 n-55 accumulatons. We can use ths formula to choose a sutable n for specfc applcatons. In ths desgn, we choose the pseudo-sum wdth to be 64 as n the desgn from Secton III. Snce 64 55 = 9, overflows may occur every 2 9 = 512 accumulatons. We wll rely on the compler to break summatons nto 512-element vectors. However, snce we no longer need the overflow checkng and handlng n ths desgn, all the related components n our prevous desgn can be removed and the two 64-bt adders below the compressors n Fgure 4 are replaced by an array of 4-2 adders. hs also greatly smplfes the control logc on the crtcal path and enables us to further ppelne our desgn. Fgure 7 shows a resultng 5-ppelnestage desgn. Synthess results summarzed n table 3 show that we have doubled the speed of the conservatve desgn and have acheved a hgh clock rate of 66MHz. For an accumulaton of sze N, we wll need 5 + (N-1) + 5 = N + 9 cycles to complete the whole computaton. V. APPLICAIONS OF DELAYED ADDIION ECHNIQUES In prevous sectons, we have examned MACs and accumulatons n solaton. We now evaluate the utlty of our approach wthn a larger applcaton. Our desgn s especally useful for those applcatons n whch nnerproduct computatons domnate the overall executon tme. One such example s to solve a system of lnear equatons usng conjugate gradent method. he system of lnear equatons takes the form: A x = b, (5.1) where A s a N N matrx and x, b are 1 N vectors. Such computatons arse frequently n scentfc matrx-orented code, such as n Space-me Adaptve Processng (SAP) [18, 21]. SAP s a problem for whch confgurable accelerators are often explored, so a hgh-performance, FPGA-based nner product unt would form the core of such an accelerator. A. Conjugate Gradent Method: Background he conjugate gradent method s based on the dea of mnmzng the functon f(x) = 1 2 x Ax bx (5.2) he functon s mnmzed when ts gradent f = Ax b s zero, whch s exactly the soluton of (5.1). N-way p a+b c Dv Sub Eq. 5.4: α N + 2-1 - Eq. 5.5: x +1 - N - - Eq. 5.6: g +1 N - - N Eq. 5.7: d +1 N + 1 N 1 - able 4: Number of dfferent operatons n Equaton (5.4 5.7) p stands for nner product, Dv for dvson, Sub for subtracton. We can use regresson to fnd the mnmum x n (5.2). Set the ntal condtons to be x 0 = 0 (ntal soluton) d 0 = b (ntal drecton) g 0 = -d 0 (ntal gradent) (5.3) and teratvely calculate the followng steps: g α = d d A d (5.4) x +1 = x + α d (5.5) g +1 = A x +1 b (5.6) d +1 = -g +1 + g A d +1 d d A d (5.7) untl x +1 x < err (5.8) where err s a pre-specfed error lmt. As we can see from (5.4) to (5.7), each teraton s dvded nto four steps and these four steps must be seralzed snce the operaton of each step reles on the full knowledge of prevous step s result. Each step has sgnfcant parallelsm, however. he number of dfferent knds of calculatons requred for each step s summarzed n table 4. Wherever N appears n a table entry, the correspondng operaton has N-fold parallelsm. he above analyss shows that usng conjugate gradent method to solve a system of lnear equatons s potentally a good canddate for hghlghtng the utlty of FPGA computaton wth our specally desgned accumulaton unts. Next, we present a confgurable computng system that mplements ths applcaton. B. Proposed Archtecture he current confgurable computng systems can be roughly categorzed nto closely-coupled archtectures and looselycoupled archtectures. For closely-coupled archtectures, the confgurable hardware s part of a mcroprocessor and s generally treated as a specal functonal unt (SFU). he SFU behaves smlarly to other functonal unts on chp; t also takes two operands and produces a result for each 9

Draft submtted for publcaton. Please do not dstrbute FP Multpler 33 MHz FP Multpler 33 MHz multplers each clock cycle (Fg 8). In ths way, the MAC unt can run at the full speed of the accumulator, 66MHz. 0 MUX 1 FP Accumulator 66 MHz Fg. 8 A Floatng-Pont MAC: he accumulator takes n operands alternatvely from the two multplers. he whole MAC runs at 66MHz. operaton. hs archtecture, however, s not sutable for our applcaton for several reasons. Frst of all, all state-of-the-art mcroprocessors have floatng-pont multplers and adders. Generally these functonal unts are ppelned and can run at a very hgh speed. Some mcroprocessors can even use them to form two MAC unts, lke HP PA-RISC. Our SFU, on the other hand, has a much slower speed. What s more, as we can see from prevous dscusson, t takes a lot of hardware resource to mplement a MAC unt n FPGA, whch s beyond the capablty of most current closely-coupled confgurable computng system. For those that have ths capablty, the number of SFUs that can be mplemented s very lmted, too. hus the parallelsm n the algorthms can hardly be exploted. Loosely-coupled confgurable computng system, on the other hand, can provde us wth plenty of hardware resource for takng advantage of the mass parallelsm n applcatons. In ths archtecture, we normally have one host machne and multple FPGA boards connected to the I/O bus, wth 10-60 FPGAs and assocated memory on each board. he host machne confgures and ntalzes each FPGA board and s responsble for communcaton wth other machnes whle the FPGA boards takes care of major calculaton. he host machne starts the operaton on FPGA boards after ntalzaton and FPGA board talks to the host machne when the job s done by nterrupt. For our desgn, each FPGA board wll be used to mplement several MAC unts, and wll also contan some local memory. We use floatng-pont accumulators and multplers to form MAC unts. Snce the sze of the matrx A s generally known before computaton, we choose a floatng-pont accumulator wth compler-managed overflow avodance as descrbed n Secton IV. On a separate FPGA, we wll have an 8-stage ppelned floatngpont multpler; our desgn can run at 33MHz on 2 Xlnx parts. hus we can form a MAC unt wth two floatngpont multplers and a floatng-pont accumulator, wth the floatng-pont accumulator takng n the products alternatvely from each one of the two floatng-pont Local memory on the FPGA board wll help manage the I/O data flow from the host machne to the FPGA board. It holds the data of matrx A and the results for each teraton ncludng α, x +1, g +1, d +1, and also provdes space for ntermedate results. When solvng more than one system of lnear equatons smultaneously, t would be more effcent to have two memory chps. hs would allow us to have one chp dong computaton whle the other one s downloadng a new system of lnear equatons from the host machne. he memores would alternate ther functons once the computaton s done. he communcaton latency between host machne and each FPGA board wll thus be totally hdden except for the frst tme. C. Performance of FPGA-based Conjugate Gradent Accelerator Suppose we have L (L>2) MAC unts on a FPGA board. hs FPGA board s thus capable of processng all the multplcaton and accumulaton n (5.4) (5.8) except for the dvson, whch s handled by CPU va nterrupt. o smplfy the calculaton, we assume the compler guarantees N s less than 512. hus we don t have to worry about breakng long nner products nto small peces. We also assume each MAC unt s fed from local memory at ts optmum bandwdth of two 32-bt operands at 66MHz. Suppose we have L MAC unts on the FPGA board, the requred local memory bandwdth would thus be 528 L Meg bytes per second. Before gong down to calculaton detals, we frst defne K- p tme as the number of clock cycles needed for our MAC unt to compute the nner product of two vectors of length K. In our case, we need 8 + 5 + (k-1) + 5 = k + 17 cycles to complete such a computaton, where 8 comes from the number of ppelne stages for multpler, 5 comes from the number of ppelne stages for the accumulator, k-1 s the vector length mnus 1, and 5 counts for the combnatonal delay for the fnal addton. We can then calculate the latency of one teraton of computaton as follows: 1) Latency for step 1: Computng α = g d d A d In ths step (5.4), we frst calculate d A. he calculaton conssts of N nner products, whch requres N/L N-p tme. Snce d A s an N 1 vector, ( d A) d s smply one nner product of length N, and so s g d. hus we can compute d A) d and g d ( smultaneously wthn one N- p tme. In all, we need N/L +1 N-p tme for nner 10

Draft submtted for publcaton. Please do not dstrbute products as well as one nterrupt for dvson. Snce d A d wll be used later as a dvsor n step 4, we can save one nterrupt by sendng back both α and 1/ d A d. 120000 L = 2 L = 4 L = 8 2) Latency for step 2: Computng x +1 = x + α d 100000 In ths step (5.5), we have to compute each element of x +1 and we have N such computatons. For element j of x +1, we have x +1,j = x,j + α d,j. hs can be computed as an nner product of vector length two. Although nner product of sze two s not economcal at all usng our MAC unt, t s far better than to send these numbers back to host machne for computaton and send the results back, whch would ncur a hgh penalty n communcaton tme. hus we need N/L 2-p tme for ths step. 3) Latency for step 3: Computng g +1 = A x +1 b In ths step (5.6), we also have to compute each element of g +1 and we have N such computatons. Snce b s fxed through teratons, we can store the b n the memory beforehand. hus for element j of g +1, we have g +1,j = (- b j ) + Σ A j,k x,k. hs can be computed as an nner product of vector length N+1, so we need N/L (N+1)-p tme for ths step. 4) Latency for step 4: Computng d +1 = -g +1 + g 1 A d d A d + d In ths step (5.7), we need N/L +1 N-p tme to calculate g A d for the same reason stated n (5.3.1). Snce + 1 1/ d A d s avalable from step 1, we need one multplcaton tme for g + 1A d (1/ d A d ). And we need another N/L 2-p tme to get the result for the same reason stated n 5.3.2. In all, we need N/L +1 N-p tme + N/L 2-p tme + 1 multplcaton tme for ths step. 5) Overall Performance By summng up the tme needed n each step, we get the tme for one teraton of the computaton. In all, t takes 2 ( N/L + 1) N-p tme + 1 N/L (N+1)-p tme + 2 N/L 2- p tme +1 multplcaton tme + 1 nterrupt tme for each teraton. If we plug n the N-p tme and 2-p tme, and we take 8 cycles to be one multplcaton tme, and we assume that 1 nterrupt takes 500 cycles, then we need (3N + 90) N/L + 2N + 542 cycles for each teraton. Fg. 9 shows the cycle count chart for dfferent problem szes and number of MAC unts. From ths table we can see, wth the ncrease of problem sze, the number of cycles needed becomes nversely proportonal to the number of MAC unts L. hs s because the cost of nterrupt has become trval n overall latency when N gets large. Number of cycles needed Fg. 9 80000 60000 40000 20000 0 0 100 200 300 Problem sze N Cycle count for one teraton wth regard to problem sze N and number of MAC unts L. It s worth mentonng here that the major computaton n checkng for convergence (5.8), x +1 x, s also an nner product operaton. From (5.5) we know that x +1 x = α d, each element of whch s calculated by the multplcaton process n step 2. hus we just have to save all these elements of α d n the memory and calculate ts norm later. A schedulng that defntely won t put extra latency on the system would be to calculate t durng the last N-p tme n step 1 and check f x +1 x < err durng the nterrupt. VI. DISCUSSION In ths secton we dscuss some of the ssues rased by our delayed addton technque, partcularly wth respect to floatng-pont calculatons. he IEEE floatng-pont standard [1] specfes the format of a floatng-pont number of ether sngle or double precson, as well as roundng operatons and excepton handlng. It provdes us wth both a representaton standard and an operaton standard. he representaton standard s helpful for transportng code from one system to another. he operaton standard, together wth the representaton standard, works to ensure that same result can be expected for floatng pont calculatons on dfferent platforms (f they all choose the same roundng scheme). Our desgns, descrbed n Sectons III, IV, and V, abde by the representaton porton of the IEEE standard. Our approaches mplement sngle-precson IEEE floatng-pont ncludng denormalzed numbers. Our operatons do reorder computatons, however, and make the basc assumpton that 11

Draft submtted for publcaton. Please do not dstrbute the addtons performed are commutatve. hs assumpton s also routnely made by most current-generaton mcroprocessors, where out-of-order executon also assumes that floatng-pont operatons on ndependent regsters are commutatve. Smlarly, many compler optmzatons geared at scentfc code also assume commutatvty; optmzatons such as loop nterchange and loop fuson reorder computatons as a matter of course. When users are concerned, error theory n computatons can be used to determne to what extent such reorderng s safe [4,5]. VII. RELAED WORK hs paper touches on areas related to both computer arthmetc and confgurable computng. Hennessy and Patterson provde an overvew of computer arthmetc n Appendx A of [6]. hey concentrate on logc prncples of varous desgns of basc arthmetc components. In addton, nnumerable books and papers go nto more detal on adder and multpler desgns at the transstor level. For example, Weste and Eshraghan [7] provdes a detaled dscusson on varous multpler mplementatons and Wallace trees. Examples of recent full-custom multpler desgn can be found n the Work by Ohkubo et al. [8] and Makno et al. [9]. We can also fnd recent multpler desgns n state-ofthe-art mcroprocessors such as DEC Alpha 21164 [15] and SUN Ultrasparc [16]. hese desgns, as wth most, use Booth encoders and Wallace trees Our approach employs the dea used n Wallace trees. C. S. Wallace used 3-2 adders to buld up the frst Wallace tree [10]. here have been many dervatves snce then. he most mportant change s to use 4-2 adders to replace the 3-2 adders n the orgnal mplementaton. In many desgns, pass transstors rather than full CMOS logc gates are used to buld 4-2 adders to mprove the crcut speed, as we can see n Hekes et al. [11]. Early n 1994, Cank et al. [3] dscussed how to map a btarray multpler to Xlnx FPGA. hey bult an 8 8 btarray multpler for nteger multplcaton wth Xlnx 3000 seres and ther fully ppelned mplementaton on XC3190-3 acheved more than 100 Mhz. Now almost all the major FPGA vendors have provded ther mplementatons of nteger multpler or multply-accumulator of 16 bt or shorter length [22, 24]. A comparson of the speed of these mplementatons can be found n [23]. Most of them are based on bt-seral or bt-array multplers. However, btarray multpler has too bg an area cost for long nteger multplcaton. We know of no mplementatons of 32-btarray multpler, nor have we seen any mplementatons of 32-bt nteger multpler or multply-accumulators based on bt-seral or other algorthms. More recent work has examned mplementng floatngpont unts n FPGAs [2,3,13,14]. Louca et al. [2] present approaches for mplementng IEEE-standard floatng-pont addton and multplcaton n FPGAs. hey used a modfed bt-seral multpler and prototyped ther desgns on Altera FLEX8000s. Lgon et al. [14] also dscuss the mplementaton of IEEE sngle precson floatng-pont multplcaton and addton and they accessed the practcablty of several of ther desgns on XILINX 4000 seres FPGA. Shraz et al. [13] talk about the lmted precson floatng pont arthmetc. hey have adapted IEEE standard for lmted precson computaton lke FF n DSP. At the applcaton level, there are many papers about Space- me Adaptve Processng (SAP). Some background knowledge can be found n Gupta [18]. In addton, Smth et al. [19] talks about usng conjugate gradent method for SAP and Gupta [18] dscussed about the possblty of usng confgurable hardware for SAP. Fnally, several authors have dscussed roundng and error theory [4,5,12]. We can see how people deal wth the error n physcs from aylor [4]. Wlknson [5] and Hekes et al. [12] touch on roundng error n MAC and nner-product unts. VIII. CONCLUSIONS Wthn many current FPGAs, carry propagaton represents a sgnfcant bottleneck that mpedes mplementng truly hgh-performance ppelned adders, multplers, or Multply-accumulate (MAC) unts wthn confgurable desgns. hs paper descrbes a delayed addton technque for mprovng the ppelned clock rate of desgns that perform repeated ppelned calculatons. Our technque draws on Wallace trees to accumulate values wthout performng a full carry-propagaton; Wallace trees are unversally used wthn the multply unts n hghperformance processors. he unque nature of confgurable computng allows us to apply these technques not smply wthn a sngle calculaton, but rather across entre streams of calculatons. We have demonstrated the sgnfcant leverage of our approach by presentng three desgns exemplfyng both nteger and floatng-pont calculatons. he desgns operate at ppelned clock rates between 33 and 66MHz. We have also used our desgns n a real applcaton: solvng system of lnear equatons usng conjugate gradent method. hese technques and applcatons should help to broaden the space of nteger and floatng-pont computatons that can be customzed for hgh-performance executon on current FPGAs. REFERENCES [1] IEEE Standards Board. IEEE Standard for Bnary Floatng-Pont Arthmetc. echncal Report ANSI/IEEE Std. 754-1985, Insttute of Electrcal and Electroncs Engneers, New York, 1985. 12