High-Performance Floating Point Divide

Size: px

Start display at page:

Download "High-Performance Floating Point Divide"

Clarissa Jane Chandler
6 years ago
Views:

1 High-Performce Flotig Poit Divide Albert A. Liddicot d Michel J. Fly Computer Systems Lbortory Stford Uiversity, Stford, CA 945 liddicot@stford.edu d fly@umuhum.stford.edu Abstrct I moder processors flotig poit divide opertios ofte tke to 5 clock cycles, five times tht of multiplictio. Typiclly multiplictive lgorithms with qudrtic covergece re used for high-performce divide. A divide uit bsed o the multiplictive Newto-Rphso itertio is proposed. This divide uit utilizes the higher-order Newto-Rphso reciprocl pproximtio to compute the quotiet fst, efficietly d with high throughput. The divide uit chieves fst executio by computig the squre, cube d higher powers of the pproximtio directly d much fster th the trditiol pproch with seril multiplictios. Additiolly, the secod, third, d higher-order terms re computed simulteously further reducig the divide ltecy. Sigifict hrdwre reductios hve bee idetified tht reduce the overll computtio sigifictly d therefore, reduce the re required for implemettio d the power cosumed by the computtio. The proposed hrdwre uit is desiged to chieve the desired quotiet precisio i sigle itertio llowig the uit to be fully pipelied for mximum throughput. Itroductio Divisio c be expressed s the product of the divided, d the reciprocl of the divisor, Õ ½ µ. Multiplictive techiques such s Newto-Rphso d series expsio lgorithms re ofte used to compute the reciprocl for high-performce divisio []. The IBM È ÓÛ ÖÈ Ì Å d È ÓÛ Ö¾ Ì Å processors use Newto- Rphso lgorithms to implemet divide d squre root. The Å Ã Ì Å [9] d IBM È ÓÛ Ö Ì Å [] processors use lgorithms bsed o series expsio for both divide d squre root. Typiclly, the first-order Newto-Rphso itertio with qudrtic covergece is used. The first-order Newto- Rphso itertio requires two depedet multiplictios per itertio. Rbiowitz [] exteded the Newto- Rphso reciprocl recurrece to iclude higher-order polyomils. The covergece of the higher-order itertio is ½ ½, where is the error of the reciprocl pproximtio for itertio d is the order of the recurrece [5]. Series expsio lgorithms re lso used to compute the reciprocl usig multiplictive itertios. The biomil series expsio techique, ofte clled Goldschmidt s lgorithm [6] [4], is bsed o the fmilir Tylor series expsio of fuctio t poit. The biomil expsio lgorithm requires two idepedet multiplictios per itertio d provides qudrtic covergece. Recet work i the re of high-performce divisio hs show tht higher-order itertios improve performce. Wog d Fly [] proposed very-high rdix divisio scheme tht is bsed o look-up tbles d Tylor series pproximtios for the reciprocl. Higher-order terms of the Tylor series re computed to icrese the precisio of successive quotiet pproximtios. This pproch offers lier covergece while retirig or more bits per itertio. Ito, Tkgi, d Yjim [7] developed ccelerted higher-order Newto-Rphso divisio d squre root lgorithm suitble for implemettio usig multiply-ccumulte uit. This implemettio ccelertes the covergece of the higher-order itertio by usig lookup tble to estimte the cube of itermedite vlue. Ercegovc, Lg, Muller, d Tisserd [] proposed method to compute the reciprocl d other fuctios bsed o rgumet reductio d series expsio. This method uses tbles d smll multipliers to compute the terms of series expsio. Smll seril multipliers re used to compute the squre d cube of itermedite vlue. A multiplictive divide uit bsed o the higher-order Newto-Rphso reciprocl pproximtio is proposed d lyzed. A prllel cubig uit proposed by Liddicot d Fly [8] exposes dditiol computtiol prllelism. The prllel cube computtio is extedble to compute higher powers d thus further ccelerte the covergece of the pproximtio. The proposed divide uit exploits the computtio prllelism exposed by the prllel powerig

2 uits. Furthermore, by usig higher-order pproximtios the desired precisio my be obtied i sigle itertio llowig fully pipelied implemettio. The vrious divide lgorithms d implemettios differ o severl ccouts. First, the iheret computtiol prllelism tht llows ltecy reductio. Secod, the subuit precisio d the ffect of the subuit precisio o the ltecy, re, d power cosumptio required for the divide computtio. Filly, the error covergece of the pproximtio determies whether it is fesible to compute the quotiet to the desired precisio i sigle itertio. This pper is orgized s follows, sectios d preset the Newto-Rphso d biomil series expsio divide lgorithms. I sectio 4 the higher-order Newto- Rphso divider d subuits re proposed. I sectio 5 sigifict hrdwre reductios pplicble to the proposed rchitecture re preseted d the fil hrdwre cofigurtio is discussed. I the remiig sectios, the proposed divide uit is compred to lterte techiques d brief coclusios re preseted. LUT (/b) X b MUX Xi q = /b () Xi+ LUT (/b) X b MUX Xi + Xi+ q = /b (b) Figure. NR divide ()st order (b)rd order. Newto-Rphso Divide Uit The first-order Newto-Rphso reciprocl pproximtio with qudrtic covergece is expressed s, ½ ¾ µ. The iitil pproximtio, ¼, for the reciprocl of ½ is geerlly determied usig ROM lookup tble before the first itertio begis. A fused multiplysubtrct subuit my be used withi the itertio to compute ¾ µ i sigle opertio. Therefore, ech itertio requires two depedet multiplictio opertios. After the fil itertio hs completed, the quotiet is determied by multiplyig the divided with the reciprocl of the divisor. Figure () illustrtes the first-order Newto- Rphso divide uit. Ech multiplictio withi the itertio is depedet o the result produced by the previous multiplictio d must be computed serilly. If Ä itertios re required to chieve the desired quotiet precisio, the the ltecy of the divide uit implemeted with sigle multiplier is Ø Ú Ø ÐÓÓ ÙÔØ Ð ¾ÄØ ÑÙÐØ Ø ÑÙÐØ. Usig two multipliers the totl ltecy my be reduced by oe multiplictio if the fil multiplictio with the divided is overlpped i the lst itertio sice Õ µ ¾ µ. The ltecy for the first-order Newto-Rphso divide uit usig two multipliers is Ø Ú Ø ÐÓÓ ÙÔØ Ð ¾ÄØ ÑÙÐØ. The geerlized Newto-Rphso reciprocl itertio my be expressed s the followig Ø order itertio, ½ ½ ½ µ ½ µ ¾ ½ µ µ. Here is the Ø pproximtio of the reciprocl of the divisor,. Figure (b) shows third-order Newto-Rphso divide uit desiged usig stdrd multiplictio, dditio, d subtrctio uits. The subtrctio d dditios my be fused with the multiplictios s described previously without sigifictly icresig the multiplictio ltecy. If sigle multiplier is used d Ä itertios re required for the desired quotiet precisio, the the ltecy of the third-order Newto-Rphso divide uit is Ø Ú Ø ÐÓÓ ÙÔØ Ð ÄØ ÑÙÐØ Ø ÑÙÐØ. Due to the fster covergece, oe itertio of the third-order divide uit reduces the reciprocl error by the sme mout s two itertios of the first-order divide uit. The ltecy for oe itertio of the third-order divide uit is lso equivlet to the ltecy for two itertios of the first-order divide uit. As ws the cse with the first-order divide uit, the fil multiplictio with the divided my be overlpped whe two multipliers re vilble reducig the ltecy by oe multiplictio. The ltecy of the divide uit usig two multipliers is Ø Ú Ø ÐÓÓ ÙÔØ Ð ÄØ ÑÙÐØ. Agi the ltecy for oe itertio of the third-order divide uit is equivlet with the ltecy of two itertios of the first-order Newto-Rphso itertio. There is o beefit i usig higher-order itertio if full precisio seril multiplictios re used to compute the powers of ½ µ. Divisio by Series Expsio The disdvtge with the stdrd form of the Newto- Rphso divide is tht the multiplictios i the itertio re depedet d must be performed serilly. Therefore, ech itertio requires two or more seril multiplictios. Biomil series expsio is other multiplictive divisio techique with qudrtic covergece. The typicl form of the series expsio recurrece is bsed o the Mcluri

3 series were ½ s show i equtio. µ ½ ½ ½ ½ ¾ () After fctorig equtio d multiplyig by the divided, the quotiet, Õ, c be expressed by the multiplictive series show i equtio. Õ ½ µ ½ ¾ µ ½ µ ½ µ ½ ½ µ () Ech multiplictio i equtio qudrticlly reduces the error i the quotiet Õ d is cosidered itertio towrds the fil quotiet. Here, Õ ¼, Õ ½ ½ µ, d Õ ½ Õ ½ ¾ µ Õ Ö for ½. Let Ö ¼ ½ µ, ¼ ½ µ d ½ Ö ½ the Ö ¾ µ for ½. A multiplictio d subtrctio must be performed to obti the ext fctor Ö. Withi ech itertio both d Ö must be computed, d fused multiply-subtrct cot be used. Figure shows divide uit bsed o the itertive form of the biomil series expsio. The right side of the divide uit computes the ext fctor Ö ½ while the left side computes the quotiet pproximtio Õ ½. The fctor Ö ½ is idepedet of the quotiet, Õ ½, computtio d therefore the two multiplictios my occur simulteously. Similrly to the Newto-Rphso divisio, lookup tble c be used to reduce the umber of itertios required to obti the desired precisio. The first term ½ µ, or product of the first few terms ½ µ ½ ¾ µ ½ µ ½ ¾Ñ µ, is foud i ROM lookup tble. The Ñ ½ is computed by multiplyig the result retured from the lookup tble by ½ µ. The iitil quotiet pproximtio for Õ Ñ ½ is computed by multiplyig the divided,, by the result retured from the lookup tble such tht Õ ½ ½ µ ½ ¾ µ ½ µ ½ ¾Ñ µ. These two multiplictios re lso idepedet d my occur i prllel. The the itertios cotiue s before. The ltecy of the biomil expsio divide lgorithm depeds o how my multipliers re used d the umber of itertios, Ä, required to obti the desired quotiet precisio. If oe multiplier is used, the the divide uit ltecy is Ø Ú Ø ÐÓÓ ÙÔØ Ð ¾ÄØ ÑÙÐØ Ø ÑÙÐØ. Here, the subtrct must be overlpped with the quotiet multiplictios. If two multipliers re used the the ltecy reduces to Ø Ú Ø ÐÓÓ ÙÔØ Ð Ø ÑÙÐØ Ä Ø ¾¼ ÓÑÔ Ø ÑÙÐØ µ. Iterestigly, if sigle multiplier is used the biomil expsio divide uit ltecy is equivlet to tht of the Newto-Rphso divide uit. However, if two multipliers re used the the ltecy of the biomil expsio divider is reduced by pproximtely, Ä Ø ÑÙÐØ Ø ¾¼ ÓÑÔµ Ø ÑÙÐØ. It hs bee show tht the first-order Newto-Rphso lgorithm d the biomil expsio lgorithms re equivlet whe ¼ ½. I fct, they re two differet wys of expressig the sme computtio. Both lgorithms require the sme umber d type of opertios. However, due to the wy ech lgorithm is expressed, the multiplictios i the Newto-Rphso itertio re depedet d must be performed serilly while the two multiplictios i the biomil expsio re idepedet d c be performed i prllel. q = MUX q i q = /b LUT ( X) r q d MUX d i d i r i q i + d i+ d = b = (+X) Figure. Biomil expsio divide uit. 4 Proposed Divide Architecture The divide uit my compute the quotiet directly d eed ot iterte to solutio. The error i the iitil pproximtio d the covergece of the computtio must be sufficiet to gurtee the desired quotiet precisio will be chieved with oe computtio. This joit costrit implies tht there is trdeoff betwee the lookup tble size d the computtiol complexity of the lgorithm. Next, we express the quotiet directly s the product of the divided d the Ø -order Newto-Rphso reciprocl pproximtio, Õ ½ ½ µ½ ½ µ ¾ ½ µ µ. Here is the iitil estimtio of ½ geerlly foud i lookup tble with error, ÐÓÓ ÙÔØ Ð. The quotiet error is expressed s, Õ ÐÓÓ ÙÔØ Ð ½. The Ø order pproximtio icreses the umber of bits of precisio of by fctor of ½. Therefore, i order to compute Ò-bit reciprocl i sigle itertio the precisio Ò of the iitil pproximtio must be ½ bits. The lookup tble size must be ¾ Ò ½ µ Ò µ bits. For exmple, the ½ third-order Newto-Rphso pproximtio would require bit lookup tble. ¾ Ò Ò

4 Figure illustrtes the hrdwre structure required to implemet the proposed higher-order divide uit. The ltecy of the divide uit is Ø Ú Ø ÐÓÓ ÙÔØ Ð Ø ÑÙÐØ. Here, it is ssumed tht the powers of ½ µ my be computed directly, i prllel, d fster th full precisio multiplictio. Liddicot d Fly [8] propose prllel cubig uit d describe prllel squrig uit suitble for the proposed higher-order divide rchitecture. These powerig uits re described i more detil i subsectio 4.. The proposed divide uit my be fully pipelied if two smll multipliers, the powerig uits, d oe full multiplier re used. Smll multipliers re used to compute ½ µ d µ sice is pproximtely Ò bits i legth. A ½ third-order divide uit would be costructed out of two ½ size multipliers, oe squrig uit, oe cubig uit, d oe full multiplier. A 4-bit implemettio of the proposed third-order divide uit is preseted d discussed i detil i the followig subsectios. LUT (X ~ /b) ~ /(k+) x /(k+) ~/(k+) b bx q = /b k ( bx) ( bx) ( bx) + sum Figure. Proposed higher-order NR divide. 4. Lookup tble A iitil pproximtio for the reciprocl of the divisor is determied by tble lookup. The ½ most sigifict bits of the divisor re used to idex the lookup tble d the Ñ ½ most sigifict bits of the reciprocl pproximtio re retured from the lookup tble. A ¾ Ñ bit ROM with ddress bits d Ñ-bit word size is used for the lookup tble sice the most sigifict bit is costt for ormlized IEEE flotig poit operds []. I order to elimite the eed to represet egtive umbers i the computtio, the lookup tble must be progrmmed such tht the ½ µ fused multiply-subtrct lwys produces positive result. Furthermore, if the result of the ½ µ computtio is positive the the computed reciprocl will be equl to or less th the true reciprocl. This c be demostrted by relizig tht the exct reciprocl c be computed usig ifiite-order Newto- Rphso pproximtio. If ½ µ is positive the the exct reciprocl is proportiol to ifiite sum of decresig positive terms d fiite-order pproximtio is the tructio of this ifiite series. To gurtee tht ½ µ ¼, the lookup tble must retur vlue tht whe multiplied by is lwys less th or equl to oe. This c be ccomplished by progrmmig the tble etries such tht the vlue stored i ech lookup tble ddress is less th the reciprocl for ll possible vlues of tht mp to tht prticulr ddress loctio. Let ØÖÙÒ be equivlet to the divisor tructed to ½ bits. To determie the vlue to store i ech ddress of the lookup tble we first dd ¾ to ØÖÙÒ so tht ØÖÙÒ ¾ for y possible. The reciprocl of ØÖÙÒ ¾ is the ½ computed otig tht ØÖÙÒ ¾ ½. Filly the result ½ of ØÖÙÒ ¾ is tructed to Ñ ½ bits. All of the tble etries re of the form ¼ ½ÜÜÜ ÜÜÜ. The first digit to the right of the rdix poit is lwys oe d therefore does ot eed to be stored i the lookup tble. If the lookup tble is progrmmed ccordig to this procedure, the it will lwys retur pproximtio less th ½ d the result of the ½ µ fused multiply-subtrct opertio is lwys positive. The vlue to be progrmmed i ech ddress of the lookup tble re fuctio of the ddress, the umber of bits used to ddress the lookup tble, d the umber of output bits from the tble Ñ. To determie the best lookup tble size we exhustively simulted severl tble sizes usig full precisio computtios. The results from these simultios re show i tble. The tble size of ¾ bits ws selected for the reciprocl pproximtio sice the mximum error ws less th ¼ ÙÐÔ (uit i the lst plce). Furthermore, the umber of ledig zeros to the right of the rdix poit i the ½ µ computtio is gurteed to be six or more whe usig lookup tble of this size. Tble. Lookup tble sizes for 4-bit operd Ad bits Wd bits Tbl Size Led s Error ulps ulps ulps ulps

5 4. Computig ½ µ The first rithmetic opertio tht follows the lookup tble ccess is the ½ µ fused multiply subtrct. Sice is pproximtely Ò bits, smll multiplier is used. ½ The divisor is i the ormlized IEEE sigle precisio formt. Therefore, ¾ ½ d bits ¾¾ through ¼ deped o the vlue stored i. I order to compute the result of ½ µ, is sig exteded with eight ledig zeros d the the two s complemet of is tke to produce. The sig exteded is represeted by 8 ledig s, the oe zero, followed by the complemet of bits ¾¾ to ¼ d dditiol oe is dded to bit ¼ to complete the twos complemet. The bits from re used to select prtil products of. Lstly the costt. is dded to the prtil product rry (PPA) to ccout for the tht is subtrcted from to form ½ µ. Figure 4 illustrtes the hrdwre uit required to clculte ½ µ. Recll tht exhustive simultio idictes tht the ledig six bits to the right of the rdix poit will lwys be zero. These bits do ot hve to be computed, further reducig the hrdwre eeded for the ½ µ computtio. The boxed re i figure 4 idictes the colums i the PPA tht eed to be summed to compute the 4 most sigifict bits of ½ µ. The PPA c be reduced usig Wllce tree structure i four CSA delys. The re required to implemet the PPA for the ½ µ uit is pproximtely % of the size of 4-bit direct multiplier prtil product rry. b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b b.. Figure 4. ½ q q q q q q q q q q q q q q q q q q q q q q q q x q = ( bx) =. q q q... q µ fused multiply-subtrct uit. 4. Computig the powers of ½ µ X X X X X 4 X 5 X 6 X 7 = A squrig circuit tht computes the squre of 4-bit operd 5% fster with slightly less th hlf the re required to perform 4-bit direct multiplictio is used to compute ½ µ ¾. Figure 5 illustrtes the squrig uit prtil product rry reductio. Similrly the cube my be computed directly d cocurretly with the squre usig the cubig uit proposed by x Figure 5. Squrig uit PPA reductio usig ¾. Liddicot d Fly [8]. Figure 6 shows the prtil product rry required to compute the prllel cube for 4-bit operd. Figure 6 lso idetifies three reductios tht my be pplied to reduce the size of the PPA. The terms from the st reductio hve weight of oe while terms from the d d rd reductios hve weight of three. The terms with weight of three re summed is crry sve fshio usig Wllce Tree. The the three times multiple is computed d summed with the X terms usig crry free dditio stge. The precise cube of the 4-bit iput produces 7-bit result. The exct cube is ot eeded to chieve the desired quotiet precisio for the divide uit. I sectio 5, tructios of the reduced cube PPA re studied. These reductios ot oly decrese the re requiremet but lso the ltecy of the cube opertio. The reduced prllel cube is pproximtely 6% fster th computig the cube usig direct multipliers d requires oly bout % of the re tht is required by sigle Ò-bit direct multiplier. The prllel cubig uit is esily extedble to compute higher-order powers of ½ µ. Usig higher-order powers of ½ µ will ccelerte the covergece of the Newto-Rphso pproximtio d reduce the precisio eeded by the iitil reciprocl estimte. 4.4 Computig the Fil Multiplictios The multiplictio of is smll multiplictio sice is pproximtely Ò bits. The multiplictio re ½ is bout % tht of full 4-bit direct multiplictio. A slow re efficiet multiplier my be used sice the result of this computtio is ot eeded util fter the powers of ½ µ hve bee computed. The fil multiplier computes the product of µ with the sum of powers of ½ µ to produce the fil quotiet. This is the oly full multiplictio tht is required by the proposed divide uit.

6 x x - - X X X Figure 6. Prllel cubig uit PPA reductios, (), () µ, d () µ. 5 The Fil Hrdwre Cofigurtio The umber of bits i multiplier PPA, icrese by the squre of the operd legth Ò ¾, while the squrig uit re grows by ½ ¾ Ò¾ d the cubig uit grows by ½ Ò [8]. Therefore, effort must be mde to miimize the itermedite operd legth d the required output precisio from ech subuit. Sigifict reductios i the hrdwre re required to implemet the subuits re preseted i this sectio. The divide uit hs bee exhustively simulted to determie the mximum error i the reciprocl computtio for vrious tructios of the cubig uit PPA. Figure 7 idictes the reciprocl mximum error i terms of ulps for vryig tructios of the cubig uit PPA. A shrp kee exists i the curve whe 6 or more of the lest sigifict colums hve bee tructed. Oly the eight most sigifict colums of the cubig uit PPA re required to chieve error of less th ¼ ulps. Additiolly, the divide uit hs bee exhustively simulted to determie the mximum error i the reciprocl for vrious tructios of the squrig uit give tht 59, 6 d 6 colums hve bee tructed from the cubig uit PPA. Similrly, shrp kee exists i the curve plottig the reciprocl error versus the umber of lest sigifict Reciprocl Mximum Error (ulps) Cube PPA Colums Tructed Figure 7. The reciprocl error versus the cubig uit PPA colum tructio (4-bit). colums tructed from the squrig uit. Tructig up to 9 colums from the squrig uit PPA does ot sigifictly icrese the reciprocl error. The reciprocl ccurcy is less th ¼ ulps for the desig poits listed i Tble. Desig ws selected sice the mximum PPA height d umber of bits i the PPA re miimized. Furthermore, Desig chieves the smllest mximum error. Therefore, the lest sigifict colums of the squrig uit PPA d the lest sigifict 6 colums i the cubig uit PPA my be tructed. The PPA re for the squrig uit is less th 5% of the size of 4-bit direct multiplier while the PPA re for the cubig uit is less th % of the size of 4-bit direct multiplier. Sice the squrig d cubig uits hve bee sigifictly reduced, the ltecy of these uits is less th tht of sigle multiplier. I fct the cube computtio is 6% fster th c be computed usig seril multipliers. Filly, the divide uit hs bee exhustively simulted to determie the mximum error i the reciprocl for vrious tructios of the ½ µ multiply uit give tht the cubig uit PPA hs bee tructed by 6 colums d the squrig uit by colums. The lest sigifict three colums of the ½ µ multiply-subtrct uit my be tru- Tble. Reciprocl error for tructed uits. Cb truc Sqr truc Err(ulp) PPA ht PPA bits

7 cted while mitiig error of less th ¼ ulps. A mximum reciprocl error of.496 ulps is chieved with the cubig uit tructed by 6 colums, the squrig uit tructed by colums, d the ½ µ multiply-subtrct uit tructed by colums. Let s re-exmie the third-order Newto-Rphso reciprocl pproximtio ½ ½ ½ µ ½ µ¾ ½ µ µ Ë where Ë ½ ½ µ ½ µ ¾ ½ µ µ. I figure 8 the bit fields for ech of the four compoets of Ë hve bee liged. I this figure the X s represet bits tht will be computed by colums i the PPA for ech uit d the T s idicte colums tht my be tructed from the PPA for ech uit. From this digrm it is cler tht most of the bits form the ½ µ multiply subtrct uit will cotribute to the 4 most sigifict bits of Ë, while oly bout ½ of the colums from the squrig uit d pproximtely ½ of the colums from the cubig uit cotribute to the 4 most sigifict bits of Ë. Progressively less computtio is required to chieve the higher-order terms of the Newto-Rphso itertio. The desig tht ws preseted i the precedig discussio ws selected to miimize ltecy d hrdwre re uder the costrit of computig the reciprocl to less th ¼ ulps error. By slightly icresig the umber of colums i the squre d cube computtios, the worst-cse reciprocl error will be improved. The lookup tble precisio required depeds o the order of the Newto-Rphso itertio d the precisio of the sub-uit computtio. Icresig the umber of colums i the sub-uit PPAs will decrese the lookup tble precisio eeded to chieve give worst-cse error. The lookup tble re my be reduced by 5% for ech bit of precisio tht it is reduced. Therefore, the desiger my trde off computtiol complexity for re or vice vers. /b = X ( + ( bx) + ( bx) + ( bx) ) 4 bits > ( bx) >... X X X X T ( bx) >... X X X... X T T T T ( bx) > X... X T T T T S >... X X X X 4 bits S Figure 8. Ö order Newto-Rphso pproximtio sub-uit precisio. 6 Summry Divisio by fuctiol itertio utilizes multiplictio s the fudmetl opertio. We preseted the stdrd Newto-Rphso reciprocl itertio d the biomil series expsio reciprocl itertio. The computtiol prllelism i these pproches is limited to two prllel multiplictios. We proposed divide uit rchitecture bsed o the higher-order Newto-Rphso reciprocl itertio. The divide uit uses tructed squrig, cubig d powerig uits. A 4-bit third-order implemettio ws used s exmple to describe the divide uit i detil. We foud tht the first ½ µ ½, secod ½ µ ¾, third ½ µ d fourth ½ µ order computtios require progressively less precisio. Reducig the precisio of the higher-order computtios will mximize the efficiecy, reduce the ltecy d miimize the power cosumptio of the overll computtio. The reductio i the precisio of the higher-order computtios i the proposed rchitecture differs from the typicl Newto-Rphso or series expsio pproch. The lter lgorithms require full precisio computtio fter the first itertio. Furthermore, tructig the subuits sigifictly reduces the re required to implemet the divider. The ½ µ fused multiply-subtrct uit, squrig uit, cubig uit, d the µ multiply uit require respectively %, 5%, %, d % of the re tht is required by direct multiplier. If the divider is desiged with seprte uits the the etire implemettio would be less th the size of two full precisio multipliers. The fil multiplictio my be performed o shred multiplier further reducig the dedicted hrdwre requiremet by the divide uit. A 5-bit IEEE double precisio divide uit ws lso desiged d tested. The sme desig techiques were pplied to the double precisio uit. To reduce the lookup tble size the highest-order 5 colums of the ½ µ prllel computtio were icluded. Addig few dditiol colums for the ½ µ terms oly icresed the X PPA by totl of 8 bits d the X PPA by totl of two bits. Usig ¾ ½ ½ bit lookup tble the 5-bit reciprocl c be computed i oe itertio with the squrig uit d cubig uit tructed to pproximtely 4% d % of the size of 5-bit direct multiplier. A secod 5-bit desig ws studied usig lookup tble of ¾ ½ ½ bits, hlf the size of the previous 5-bit desig poit. The tructed squrig uit ws pproximtely 4% of the size of 5-bit direct multiplier d the tructed cubig uit ws pproximtely 5% of the size of 5-bit direct multiplier. The 5-bit desigs were proportiolly very similr to the results for the 4-bit desigs idictig tht the proposed rchitecture scles well over the studied rge. Tble summrizes the re requiremets of the mul-

8 Tble. Divide re compriso (IEEE DP) Algorithm Lookup Tbl Size HW Are N-R ½ Ø -order 8,67B Mult. Series Expsio 8,67B Mult. N-R Ö -order 8,67B Mult. Ito... [7] 6,44B Mult-Acc. Erc... [] 65,56B Mult. Proposed Arch. 4,6B Mult. Tble 4. Divide ltecy compriso Algorithm Iter. Comp. Ltecy N-R ½ Ø -order SM+FM Series Expsio SM+Sub+FM N-R Ö -order SM+FM Ito... [7] 4 4FMAC Erc... [] SM+SNM+SDA+FM Proposed Arch SM+FM tiplictive divide techiques for IEEE double precisio operds. The lookup tble requires most of the dedicted re eeded to implemet the divide uit. Tble 4 summrizes the ltecies for the multiplictive divisio lgorithms discussed. I the tble the followig bbrevitios re used; SM=smll multiply Ò Ò, SNM=smll rrow multiply Ò Ò, FM=full multiply Ò Ò, SUB=subtrct Ò Ò, FMAC=full multiply ccumulte Ò Ò Ò, SDA= operd siged digit dder. The proposed divide uit hs the lowest ltecy d re requiremets. Additiolly, the umber of itertios required by ech lgorithm to chieve error reductio of Ò Ø Ð is listed. The implemettios listed i Tble 4 d 5 were selected for compriso sice ech oe requires the re of pproximtely two multipliers or less. The proposed rchitecture is esily meble to fully pipelied implemettio. Sice the quotiet is computed i sigle pss through the subuits. A ew divide opertio c be disptched ech cycle. This differs from the first-order Newto-Rphso itertio d biomil series expsio techique tht require multiple itertios to chieve the sme covergece s the proposed rchitecture. Our lgorithm d the Ercegovc, Lg, Muller, Tisserd pproch re fully pipelie-ble without sigifict icrese i hrdwre. 7 Coclusios A fst, efficiet, d high-throughput divide uit is proposed. This uit utilizes the higher-order Newto-Rphso reciprocl pproximtio. Prllel squrig, cubig d powerig uits perform low ltecy cocurret computtio d reduce the overll ltecy of the divide uit. It hs bee demostrted tht progressively less computtio is required to compute the secod, third d higherorder terms. Therefore, sigifict hrdwre reductios re chievble by tructig the powerig uit prtil product rrys. The proposed rchitecture chieves the desired precisio i sigle itertio d is meble to fully pipelied implemettio tht disptches oe divide istructio per cycle. Desigig optiml divide uit for specific operd legth requires blcig the subuit precisio d lookup tble size. Refereces [] ANSI/IEEE Std , IEEE Stdrd for Biry Flotig-Poit Arithmetic, 985. [] R. C. Agrwl, F. G. Gustvso, d M. S. Schmookler. Series Approximtio Methods for Divide d Squre Root i the È ÓÛ Ö Ì Å Processor. I Proc. 4th IEEE Symp. o Computer Arithmetic, pges 6, April 999. [] M. D. Ercegovc, T. Lg, J.-M. Muller, d A. Tisserd. Reciproctio, Squre Root, Iverse Squre Root, d Some Elemetry Fuctios Usig Smll Multipliers. IEEE Trsctios o Computers, 49(7):68 67, July. [4] M. D. Ercegovc, D. W. Mtul, J.-M. Muller, d G. Wei. Improvig Goldschmidt Divisio, Squre Root, d Squre Root Reciprocl. IEEE Trsctios o Computers, 49(7):759 76, July. [5] M. Fly. O Divisio by Fuctiol Itertio. IEEE Trsctios o Computers, C-9(8):7 76, August 97. [6] R. E. Goldschmidt. Applictios of Divisio by Covergece. Mster s thesis, Dept. of Electricl Egieerig, Msschusetts Istitute of Techology, Cmbridge, Mss., Jue 964. [7] M. Ito, N. Tkgi, d S. Yjim. Efficiet Iitil Approximtios d Fst Covergig Methods for Divisio d Squre Root. I Proc. th IEEE Symp. o Computer Arithmetic, pges 9, July 995. [8] A. Liddicot d M. Fly. The Prllel Squre d Cube Computtio. I IEEE 4th Asilomr Coferce o Sigls, Systems d Computers, October. [9] S. F. Oberm. Flotig Poit Divisio d Squre Root Algorithms d Implemettio i the AMD-K7 Microprocessor. I Proc. 4th IEEE Symp. o Computer Arithmetic, pges 6 5, April 999. [] S. F. Oberm d M. Fly. Divisio Algorithms d Implemettios. IEEE Trsctios o Computers, 46(8):8 854, August 997. [] P. Rbiowitz. Multiple-Precisio Divisio. I Commuictios of the ACM, volume 4, pge 98, Februry 96. [] D. Wog d M. Fly. Fst Divisio Usig Accurte Quotiet Approximtios to Reduce the Number of Itertios. I IEEE Trsctios o Computers, pges , August 99.

Parallel Square and Cube Computations

Parallel Square and Cube Computations Prllel Squre nd Cube Computtions Albert A. Liddicot nd Michel J. Flynn Computer Systems Lbortory, Deprtment of Electricl Engineering Stnford University Gtes Building 5 Serr Mll, Stnford, CA 945, USA liddicot@stnford.edu