Parallel Square and Cube Computations

Prllel Squre nd Cube Computtions Albert A. Liddicot nd Michel J. Flynn Computer Systems Lbortory, Deprtment of Electricl Engineering Stnford University Gtes Building 5 Serr Mll, Stnford, CA 945, USA liddicot@stnford.edu nd flynn@umunhum.stnford.edu Abstrct Typiclly multipliers re used to compute the squre nd cube of n opernd. A squring unit cn be used to compute the squre of n opernd fster nd more efficiently thn multiplier. This pper proposes prllel cubing unit tht computes the cube of n opernd 5 to % fster thn cn be computed using multipliers. Furthermore, the reduced squring nd cubing units re mthemticlly modeled nd the performnce nd re requirements re studied for opernds up to 54 bits in length. The pplicbility of the proposed cubing circuit is discussed with reltion to the current Newton-Rphson nd Tylor series function evlution units.. Introduction Itertive techniques such s the Newton-Rphson nd Tylor series epnsion cn be used to compute the reciprocl, squre root, inverse squre root, nd other elementry functions. Using higher-order function pproimtions decreses the number of itertions required to chieve desired precision. Using fst nd efficient prllel squring nd cubing units reduces the number of itertions nd overll ltency of the computtion of elementry functions. The typicl Newton-Rphson method converges to the reciprocl qudrticlly using the following itertion ½ ¾ µ [6]. The error in the reciprocl pproimtion decreses by ½ ¾ for ech iter- tion (with ¼ ½). Rbinowitz [7] proposed higherorder Newton-Rphson reciprocl pproimtion, ½ ½ µ ½ µ ¾ ½ µ. Flynn [] shows tht for the Ø order Newton-Rphson itertion the error decreses s ½ ½ Ito [4] proposed fst converging methods for division nd squre root bsed on the higher-order Newton-Rphson itertion. To pproimte the third-order Newton-Rphson itertion, n estimte of the cube of ½ µ is determined using ¾ Ð ÜÐ bit lookup tble. They report tht the precision of the pproimted third-order itertion eceeds the second-order itertion by Ð ¾µ bits. If the cube my be computed directly nd in prllel with the computtion of the squre, then the ltency of the thirdorder itertion is reduced. Furthermore, if the desired pre cision my be obtined in single itertion, then the unit my be redily pipelined s described by Liddicot [5]. Wong [9] proposes high-rdi fst division using ccurte quotient pproimtions. The following Tylor series pproimtion is used to compute n ccurte reciprocl pproimtion. ½ ½ ¾ ¾ Here, is the most significnt bits of etended with s such tht nd. The reciprocl pproimtion is then used to successively compute quotient pproimtions. In Wong s proposed technique seprte lookup tbles re used to store the powers for ½ while the powers of re computed using seril multiplictions. The reciprocl pproimtion is on the criticl pth in the divide unit. The overll ltency for the division cn be reduced by the prllel computtion of the squre nd cube of the powers of. Additionlly, the powers of ½ my lso be computed using prllel squring nd cubing units to reduce the tble requirements. Ercegovc [] et. l. propose method to compute the reciprocl, squre root, inverse squre root, nd other elementry functions bsed on rgument reduction nd series epnsion. In this scheme, smll multiplictions re used to compute the Tylor series epnsion for ech function pproimtion. The cube is computed using two seril -bit multiplictions. Where is between one qurter nd one third of the length of the opernd size. The cubing computtion is on the criticl pth nd therefore, the overll ltency my be reduce by using prllel squring nd cubing units. Ienne [] proposed circuit tht serilly computes the squre of vrible. This pproch hs been etended to compute the squre of n opernd in prllel using prtil product rry for the reduction. The pp required for the prllel squring unit is bout hlf the height nd size of direct multiplier prtil product rry. This work nlyzes the reduced prllel squring unit. Then prllel cubing

unit is proposed nd nlyzed. The prllel squring nd cubing units re then compred with stndrd direct multipliers for both ltency nd re.. Prllel Squring Unit È È Ø ½ ¾ ¾ µ ¾. Comprison to Direct Multiplier ¾ (). Prtil Product Arry Generlly multiplier unit is used to compute the squre of n opernd. However, the squre of n opernd cn be computed directly with specilized unit tht hs fster ltency nd smller re implementtion. The middle portion of figure shows the prtil product rry for prllel squre using multiplier pp. The boes indicte which terms my be combined using the equivlence ¾. In the lower portion of figure ¾ is represented by plcing one column to the left which hs weighting of two times tht of the current column. The squre of opernd cn be computed with the reduced prtil product rry shown in the lower portion of figure. 5 5 5 4 5 5 4 5 4 5 4 5 4 5 4 5 4 5 4 4 4 4 4 4 4 4 5 5 5 5 5 5 5 4 4 5 4 5 4 4 Figure. PPA Reduction for Squring Unit. Anlysis The prtil product rry for the squring unit cn be epressed mthemticlly for n -bit number s: ¾ ¼ ¾ ¾ ¼ ½ ¾ ½ () The height of the squring unit prtil product rry cn be epressed s: È È Ø ¾ ½ ¾ The number of input bits for the squring unit prtil product rry is: () The squring unit prtil product rry height is pproimtely one hlf the height of the direct multiplier prtil product rry. Additionlly, the number of input bits in the squring unit prtil product rry is pproimtely one hlf tht of the direct multiplier. The input logic for the direct multiplier nd the squring unit both require single two input AND gte. The input logic for the direct multiplier requires ¾ AND gtes while the input logic for the squring unit requires ¾ ¾ AND gtes. Less thn hlf of the input logic gtes re required by the squring unit s compred to the direct multiplier. The prtil product rry reduction for the direct multiplier requires more thn twice the number of CSA s thn re required to reduce the prtil product rry terms for the squring unit. The finl CPA is the sme for both units. The dely differs only in the prtil product rry reduction. A Wllce Tree [8] ws used to reduce the prtil product rry for both the squring unit nd the direct multiplier. For most of the opernd widths, the squring unit ltency is two CSA delys fster thn the direct multiplier. This is due to the fct tht Wllce tree cn reduce twice s mny prtil products by doubling the tree hrdwre nd computing the crry sve sum on both hlves of the prtil product rry in prllel nd then combining the two crry sve sums using two dditionl CSA s. Figure shows the ltency nd re of the squring unit reltive to the direct multiplier. The squring unit prtil product rry ltency is generlly -5% fster on opernds of to 54 bit widths. While the prtil product rry reduction re for the squring unit is bout 5-7% less thn tht of the direct multiplier. Since the re required for the input logic nd prtil product rry is much lrger thn the re required for the CPA, the squring unit cn be implemented in less thn hlf the re required by direct multiplier. The interconnect dely nd number of routing chnnels required for the squring unit is less thn the interconnect dely nd number of routing chnnels for multiplier since only one opernd needs to be distributed throughout the prtil product rry. When compring the prtil product rry from the squring unit to booth- multiplier the pp height nd number of input bits re comprble. However, the squring unit does not require the logic nd dely for the booth recoding nor the multiplicnd multiple selection. Therefore, the

Reltive Are nd Dely.9.8.7.6.5.4... 4 5 6 Opernd Length (bits) PPA Dely PPA Are Figure. PPA Are nd Dely for the Squring Unit reltive to Direct Multiplier squring unit would require less re nd hve better performnce thn booth- multiplier. Prllel Cubing Unit. Prtil Product Arry To compute precise cube of n -bit opernd using two seril multiplictions requires n bit multipliction followed by ¾ bit multipliction. The cube of n opernd cn be computed in prllel similr to multiply. The middle portion of figure shows the epnded prllel prtil product rry used to compute the cube of 4-bit opernd. There re bits in the non-reduced cubing unit prtil product rry. The boes in figure identify three reduction techniques tht cn be pplied to the cubing unit prtil product rry. These reduction techniques eliminte terms from the cube prtil product rry nd reduce the overll height of the prtil product rry. Therefore, the ltency nd hrdwre needed to sum the prtil products is significntly reduced. The first reduction technique is performed on the prtil product terms tht include three identicl bits such s ¼ ¼ ¼. These terms cn be replced by single bit terms such s ¼. For n -bit opernd there re three bit terms tht cn be replced by single-bit terms. The second reduction technique cn be pplied to the prtil product terms tht include two identicl bits. Bo in figure indictes the three terms with two ¼ -bits nd one ½ -bit. The three terms in bo cn be replced by one ¼ ½ term with weighting of. The third reduction technique cn be pplied to the prtil product terms tht hve three different input bits. Bo in figure indictes si boolen terms tht ech hve one ¼ -bit, one ½ -bit, nd one ¾ -bit. The si terms in bo cn be replced with one ¼ ½ ¾ term with weighting of shifted one column to the left. Figure shows the reduced prtil product rry to compute the cube of 4-bit opernd fter pplying the three reduction techniques. X X X Figure. Cubing Unit PPA Reduction The hrdwre required to compute the reduced prllel cube of n opernd is similr to tht of multiplier. The terms with weighting of three cn be summed to crry sve result using Wllce tree. Then using crry free (5,5,4) counter stge the three times multiple of the crry sve result my be computed nd summed with the onebit terms. The finl crry sve result my then be summed using crry propgte dder.. Anlysis of the Reduced Cubing Unit After pplying the three reduction techniques, the cube cn be represented by eqution 4. ¼ ¾ ¼ ½ ¼ ½ ½ ¾ ¾ ¾ ¾ µ ¾ ½ µ (4)

The height of the reduced cubing unit prtil product rry for the terms with weighting of three is pproimted by eqution 5 È È Ø ½ ¾ ½ (5) The number of bits in the prllel cube unit fter the reductions hve been pplied is epressed in eqution 6. Recll tht the number of bits in the non-reduced prllel cube unit is nd the number of bits in the prtil product rry for the multipliers tht re required to compute the ect cube is ¾. È È Ø ½ ½ ¾ ¾ ½ (6) Figure 4 plots the number of bits required for the nonreduced cubing unit, the reduced cubing unit, nd the multiply-multiply unit required to perform the cubing function. The cubing unit requires fewer pp bits thn multiply-multiply for opernd lengths of less thn 5-bits. dely to sum the prtil products, the crry propgte dditions, nd the three times multiple tht is required by the reduced prllel cube technique. The re nlysis compres the number of CSA s required to sum the prtil product rrys. The Wllce tree circuitry constitutes the mjority of the unit re. Figure 5 shows the number of gte delys ech unit requires for vrious opernd lengths. The reduced prllel cubing unit chieves the best performnce for ll opernd lengths. Gte Dely 9 8 7 6 5 4 Reduced Cube Multipy Multiply Squre Multiply 8 Prllel Cube Reduced Prllel Cube Multiplier Bits in PPA ( bits) 6 4 8 6 4 4 5 6 Opernd Length (bits) Figure 4. PPA bits for the Cube Computtion. Cubing Unit Comprison The reduced prllel cube is compred to the trditionl pproch of two seril direct multiplictions. In ddition, the reduced prllel cube is compred to method of forming the cube by squring n opernd using the previously described reduced squring unit followed by seril multipliction. All of the cubing methods produce the ect result. Both performnce nd re of the three methods re compred. The performnce is mesured by the number of gte delys required to produce the finl result. The gte delys include the pp input gte dely, the Wllce tree 4 5 6 Opernd Length (bits) Figure 5. Performnce of Cubing Units Figure 6 grphs the number of CSA s required to implement ech of the cubing units nd is plotted s function of the input opernd length. For opernd lengths of less thn -bits, the cubing unit requires fewer CSA s in the pp reduction thn the multiply-multiply unit. However for 54-bit opernd, the cubing unit requires fctor of. more CSA s thn the multiply-multiply unit. For opernds of length less thn -bits, the cubing unit provides higher performnce unit with comprble re implementtion. Figure 7 shows the re nd dely of the reduced prllel cubing unit nd the squre-multiply unit reltive to the multiply-multiply unit. The prllel cubing unit performs better thn both the squre-multiply unit or the multiplymultiply unit for ll opernd lengths. However, the reduced prllel cube unit s hrdwre requirement grows fster with incresed opernd length thn the other methods nd for lrger opernd sizes the performnce improvement hs to be considered long with the increse in the re required to implement the reduced prllel cubing unit. We see tht the squre-multiply unit performs bout % fster thn the multiply-multiply unit nd requires only 8% of the re for implementtion. Therefore, the squremultiply unit is better suited thn the multiply-multiply unit to compute the cube for ll opernd lengths. 4

CSA s in PPA Reduction ( CSA s).8.6.4..8.6.4. 4 6 8 4 6 8 Opernd Length (bits) Reduced Cube Multipy Multiply Squre Multiply Reltive Are nd Dely.5.5.5.5 4 5 6 Opernd Length (bits) Cube Are Cube Dely Sqr Mult Are Sqr Mult Dely Figure 6. CSA Are for Cubing Units 4 Conclusions The reduced squring unit generlly computes the squre of n opernd -% fster thn direct multiplier for to 54 bit opernds. Additionlly, the squring unit requires less thn hlf of the re needed to implement direct multiplier. Therefore, when the squre of n opernd must be computed on dedicted hrdwre unit, the reduced squring unit provides.4 times higher performnce per re s compred to direct multiplier. The reduced cubing unit is the fstest method studied to compute the cube of n opernd. The reduced cubing unit is 5-% fster thn the direct multiply-multiply. However, the re required to implement the reduced cube grows more rpidly thn the re required to implement the cube using multipliers. For opernds with length less thn 5 bits, the reduced cubing unit requires less re to implement the cube thn direct multipliers. However, for 54 bit opernds the reduced cubing unit requires pproimtely three times the re of the cube implementtion using direct multipliers. Therefore both performnce nd re requirements must be considered. Alterntely, the cube cn be computed with reduced squring unit followed by multiplier. This method performs % fster nd requires 8% of the re to implement the cube thn by using direct multipliers. In section severl higher-order itertive function evlution techniques were discussed. These techniques included Newton-Rphson nd Tylor series techniques to compute functions such s the reciprocl, squre root, inverse squre root, nd other elementry functions. The squring unit nd proposed cubing unit should decrese the ltency of the higher-order function evlution nd my reduce the re needed for implementtion of such units. Figure 7. Prllel Cube nd Squre-Multiply Reltive to Multiply-Multiply References [] M. D. Ercegovc, T. Lng, J.-M. Muller, nd A. Tissernd. Reciprocl, Squre Root, Inverse Squre Root, nd Some Elementry Functions Using Smll Multipliers. In IEEE Trnsctions on Computers, volume 49, pges 68 67, July. [] M. Flynn. On Division by Functionl Itertion. In IEEE Trnsctions on Computers, volume C-9, pges 7 76, August 97. [] P. Ienne nd M. Viredz. Bit-Seril Multipliers nd Squrers. In IEEE Trnsctions on Computers, volume 4, pges 445 45, December 994. [4] M. Ito, N. Tkgi, nd S. Yjim. Efficient Initil Approimtions nd Fst Converging Methods for Division nd Squre Root. In Proc. th IEEE Symp. Computer Arithmetic, pges 9, July 995. [5] A. A. Liddicot nd M. J. Flynn. Pipelinelbe Division Unit. Technicl Report CSL-TR--89, Computer Systems Lbortory, Stnford University,. [6] S. Obermn. Division Algorithms nd Implementtions. In IEEE Trnsctions on Computers, volume 46, pges 8 854, August 997. [7] P. Rbinowitz. Multiple-Precision Division. In Communictions of the ACM, volume 4, pge 98, Februry 96. [8] C. S. Wllce. A Suggestion for Fst Multiplier. In IEEE Trnsctions on Computers, pges 4 7, Februry 964. [9] D. Wong. Fst Division Using Accurte Quotient Approimtions to Reduce the Number of Itertions. In IEEE Trnsctions on Computers, volume 4, pges 98 995, August 99. 5