Automatic Generation of Polynomial-Basis Multipliers in GF (2 n ) using Recursive VHDL

Automatic Geeratio of Polyomial-Basis Multipliers i GF (2 ) usig Recursive VHDL J. Nelso, G. Lai, A. Teca Abstract Multiplicatio i GF (2 ) is very commoly used i the fields of cryptography ad error correctig codes. Automatig the desig process for these multipliers could reduce the cost ad developmet time of hardware implemetatios. I this project, we preset a geeral desig for these multipliers ad a strategy to geerate them automatically give the precisio of the operads ad the irreducible polyomial which defies the field. I particular, the geeratio ad use of tree structures based o a previously proposed recursive VHDL techique is compared with other Galios Field multipliers. It is show that the fial result of this automatic desig tool is very competitive with some specialized desigs preseted i the literature, with the advatage that it ca be adjusted to ay type of triomial or petaomial i the IEEE stadard. Idex Terms recursive VHDL, GF (2 ) multiplicatio, automatic geeratio, cryptography, recofigurable. I. INTRODUCTION MULTIPLICATION i biary extesio fields, or Galois Fields GF (2 ), is a very importat operatio i several umerical algorithms used i cryptographic applicatios ad error-correctig codes []. The field is defied based o a irreducible polyomial ad as a result, the multipliers have a structure that depeds o this polyomial [2]. Polyomials with fewer coefficiets have less complexity, ad for this reaso the most commo irreducible polyomials are triomials ad petaomials (3 ad 5 coefficiets respectively). The IEEE stadard shows a huge list of such polyomials [3]. Most of the work preseted i the literature estimates the complexity of multipliers i GF (2 ) [4], [5], [6], [7] ad usually proposes desig alteratives to optimize them. However, the actual desig process, eve with these desig guidelies, is always a complex ad tedious task. I this work we preset a HDL (hardware descriptio laguage) descriptio for multipliers i GF (2 ) that use polyomial basis. This descriptio allows the geeratio of multipliers for ay field size (withi the limits of the sythesis tool) ad for ay triomial or petaomial. However, the basic cocept is geeral ad could be used for ay type of irreducible polyomial. As i other multipliers, the fast additio of partial products is the most importat factor i determiig performace. The use of tree structures is the fastest way to perform this task. However, a parameterizable descriptio of tree structures usig VHDL ca oly be obtaied whe usig recursive calls. The use of recursive VHDL is briefly explaied i Sectio IV. The beefits of usig a automatic geerator Authors are with the Departmet of Electrical Egieerig & Computer Sciece, Orego State Uiversity, Corvallis, Orego 9733. E-mail: {elsoja,laige,teca}@ece.orst.edu of multipliers are flexibility ad speed. From the iformatio about the field size ad possibly the irreducible polyomial, the multiplier is geerated with miimal user itervetio. This approach sigificatly reduces the cost of gettig a workig ad efficiet desig solutio. We iitially preset a overview of multiplicatio i GF (2 ) ad show the geeral desig approach to be used i this work. A brief descriptio of recursive VHDL is also show. Some experimetal results are preseted ad compared agaist published results. We iitially imagied that the circuit would be iefficiet, but the results show that the sythesized desig is very competitive with optimized multiplier desigs. These results are show i later sectios. II. POLYNOMIAL BASIS MULTIPLICATION IN GF (2 ) For our desig we chose to use the stadard or polyomial basis to represets elemets i GF (2 ). Whe usig polyomial basis, each elemet of the field is represeted by a polyomial of the form a(x) = a x + a 2 x 2... + a 2 x 2 + a x + a 0 whose coefficiets are always or 0. To create a proper field a irreducible polyomial of degree must be chose to defie that field. All operatios withi the field are the performed modulo the polyomial. Additio ad subtractio, whe usig polyomial basis, is a simple XOR of the correspodig coefficiets. Multiplicatio o the other had is a very complicated procedure requirig the result to be reduced by as show i (). c(x) = a(x)b(x) mod () This reductio ca be performed by subtractig x i for each coefficiet i whose degree is greater tha. If we view the multiplicatio as show i (2) the it is possible to reduce each partial product seperately as show i (3). (a(x)b x + + a(x)xb + a(x)b 0 ) mod (2) c(x) = a(x)b i x i mod (3) i=0 I fact, the reductio ca be further simplified by usig each reduced partial product as the base for the ext partial product. We ca produce ay partial product z(x) i as z(x) i = z(x) i x + z (4) where z is the MSB of z(x) i ad z(x) 0 = a(x). Each z(x) i is the ANDed with b i to select the actual partial products. This allows us to perform the reductio while calculatig each partial product by addig z at each iteratio.

2 c(x) 0 Geerator MSB MUX Shift Left b(x) - Recursive XOR Trees 0 Geerator MSB MUX Shift Left a(x) b(x) 0 Fig.. Block Diagram of recursive implemetatio. Note that the result of each partial product module is fed ito the ext. It is also possible to view the polyomial basis multiplicatio as the multiplicatio of a biary matrix ad a biary vector as preseted i [2]. Each colum of the biary matrix cotai oe partial product a(x)x i. The biary vector cotais the coefficiets of b(x). It is the possible to create a reductio matrix Q based o. It has bee show that the logic required to add i the reductio matrix Q ca be greatly simplified for most polyomials [4], [5], [6]. For all polyomials preseted i [3] the Q matrix will be very sparse. This leads to a faster ad smaller implemetatio. III. MULTIPLIER DESIGN APPROACH The multiplier we preset i this paper is a Galois Field multiplier that operates o field elemets defied usig polyomial basis. The irreducible polyomial ca be prespecified for each desired multiplier desig or oe will be automatically chose from [3] based o the operad size. Rather tha describig the matrix preseted i [2] our HDL describes a structure with two mai parts. The first is the the partial product geeratio stage ad the secod combies the resultig partial products. The combiatio of the partial products is performed by the recursive XOR trees described i Sectio IV. Idividual partial products are computed by left-shiftig the previous partial product by oe bit ad checkig the mostsigificat bit (MSB) of the result. If this bit is equal to 0, the the curret partial product has ot exceeded the degree of the field. If it is equal to, the it has exceeded the degree of field ad must be reduced by. The reductio is accomplished by a bitwise XOR of the shifted partial product ad. This way, the partial product computed i ay iteratio ca always be represeted i a -bit vector. The computed partial product is passed o to the ext iteratio ad this process cotiues util all partial products have bee geerated. As show i Figure, each dotted block represets oe iteratio step. I each block, the curret partial product is leftshifted ad its MSB is used to determie whether the partial product will be reduced by. The resultig partial product is the used i the ext iteratio step. The, effective partial products are selected by AND operatios of the partial product at block i with c(x) i. To obtai the result, a bitwise-xor of all the effective partial products is accomplished by meas of a XOR tree. IV. RECURSIVE VHDL DESCRIPTION OF TREES Several efficiet hardware circuits are costructed usig trees. Describig these trees usig o-recursive costructs ca be very difficult. By usig recursive VHDL [8], we foud a compact ad regular way to describe these trees. The followig code is the architecture body of the etity called tree: BEGIN // begiig of architecture body degeerated tree: IF d = 0 GENERATE output <= iput(0); END GENERATE degeerated tree; sigle gate: IF d = GENERATE xor gate root: xor GENERIC MAP( => ) PORT MAP(a => iput, z => output); END GENERATE sigle gate; compoud tree: IF d > GENERATE the gate: xor GENERIC MAP( => ) PORT MAP(a => xor iput, z => output); subtree array: FOR i IN 0 TO GENERATE the subtree: tree GENERIC MAP(d => d-, => ) PORT MAP(iput => iput(i***(d-) to (i+)***(d-)-), output => xor iput(i)); END GENERATE subtree array; END GENERATE compoud tree; END; // ed of architecture body where xor is a -iput XOR gate compoet. The variable d correspods to the tree level, ad depedig o the level differet istatiatios are made. Whe d = 0 the tree is a sigle wire (degeerated tree). Whe d = the level cosists of a sigle XOR gate (the base of the tree). Whe the d >, oe XOR gate is used, ad each gate iput is coected to oe ew tree. That is whe the same compoet tree is ivoked agai. This VHDL code is extremely simple ad flexible eough to geerate trees of differet sizes. It allows the cotrol of the fa-i of the XOR gate ad the umber of tree levels. Oce a tree is used to assimilate several bits from the partial products, there will most frequetly be uused iputs. It is expected that the sythesis tool will prue uused logic from the tree geerated from the recursive descriptio. Give the sparse ature of the large peta- or triomial, the sythesis tool is able to remove a large umber of XOR gates that have all the iputs as zero, or oly oe variable with all the remaiig iputs as zeros. V. INITIAL SYNTHESIS RESULTS To determie the characteristics of the proposed multiplier desig, it was sythesized usig several differet operad sizes. For each ru, oe of the irreducible polyomials supplied i [3] were used. These polyomials icluded both

3 triomials ad petaomials. The sythesis was performed usig the Leoardo ad the HEP ADK with its AMI-05 CMOS libraries. As explaied before, we expected to have the sythesis tool prue uused XOR gates from the tree. However, we did ot expect that the sythesis tool would break the coectios betwee partial product geerators ad make the fial bitslices largely idepedet. I doig this, the sythesis tool udoubtedly improved the delay of the fial circuit. Table I cotais results from sythesis rus at 4, 8, 6, 32, 64, ad 28 bits. We had attempted to sythesize larger versios but the system requiremets exceeded the available resources. Ideally, we would like to determie the behavior of the recursive desig at 52 ad 024 bits. Table I Auto-Geeratio ASIC Results Size XOR gates AND gates Delay 4-bit 5 6 T A + 3T X 8-bit 79 64 T A + 7T X 6-bit 287 256 T A + 7T X 32-bit,083,024 T A + 9T X 64-bit 4,22 4,096 T A + 0T X 28-bit 6,638 6,384 T A + 0T X The memory requiremet to sythesize desigs with a large umber of bits is sigificatly high. This is due largely to the umber of istaces eed for such a large bit-parallel multiplier. The amout of memory available to the sythesis tool will almost certaily defie the largest multiplier which ca be created usig auto-geeratio. As ca be see, the area of the recursive desig is quite large. Give the large umber of XOR gates eeded to costruct the recursive trees, the area results are ot surprisig. It is iterestig to ote the delay characteristics of this desig. Give the desig of the partial product geeratio it was expected that there would be a large liear compoet, with a logarithmic compoet cotributed by the XOR trees. However, whe the sythesis tool optimized the partial product geeratio module, it chaged the delay characteristics of the proposed multiplier structure. VI. COMPARISON TO OTHER MULTIPLIERS For compariso, other parallel implemetatios of GF (2 ) multipliers are itroduced. The first is based o a modified versio of the Karatsuba algorithm for polyomial multiplicatio. The secod is a form of Mastrovito multiplier. For more details o these desigs see [7,8]. The multipliers are similar to the oe preseted i this work i the sese that their delay grows logarithmically as icreases. It should be oted that the delays for the multipliers from [7,8] are purely theoretical ad the delay ad area characteristics are based solely o the algorithms. See Tables II ad III for the estimated delay ad area values. The data shows that the recursive desig is geerally about twice as large as the Karatsuba multiplier. This is cosistet for all the values of we have sythesized. These results may idicate that there is more which ca be doe to optimize the area of the auto-geerated multiplier. It is i the delay performace that the auto-geerated desig begis to show a real beefit. Figure 2 compares the delay characteristics of the recursive desig with those of the other multipliers. As ca be see i the chart, the auto-geerated desig is able to out perform the Karatsuba multiplier at higher operad sizes. Table II Karatsuba-Ofma[7] ASIC Estimates Size XOR gates AND gates Delay 4-bit 9 6 T A + 2T X 8-bit 55 48 T A + 6T X 6-bit 225 44 T A + 0T X 32-bit 799 432 T A + 4T X 64-bit 2,649,296 T A + 8T X 28-bit 8,455 3,888 T A + 22T X Table III Mastrovito Multiplier[2] ASIC Estimates Size XOR gates AND gates Delay 4-bit[4] 5 6 T A + 3T X 8-bit[6] 72 64 T A + 6T X 6-bit[6] 285 256 T A + 7T X 32-bit[6],085,024 T A + 9T X 64-bit[6] 4,60 4,096 T A + 9T X 28-bit[6] 6,637 6,384 T A + 0T X Eve whe compared to the ideal Mastrovito multiplier, the auto-geerated multiplier remais very competitive. The gap betwee the two ever exceeds oe T X (XOR delay) for ay operad size. I both cases where the auto-geerated multiplier lags behid the ideal Mastrovito multiplier, is of the form x + x 4 + x 3 + x +. For this type of petaomial, the sythesis tool seems uable to completely optimize partial product geeratio. This may i fact be due to the limitatio of auto-geeratio or it may be simply a weakess i the sythesis tool. More detailed aalysis of the resultig etworks will eed to be performed before results like [6] ca be obtaied for these polyomials. VII. FPGA IMPLEMENTATION Automatic geeratio of multipliers i GF (2 ) is particularly iterestig for FPGAs, give the large umber of polyomials that are available i the stadard. For more security, several desigs could be geerated ad replaced periodically. The flexibility of the FPGA, combied with the auto-geeratio capabilities, would allow for rapid chages ad make periodic desig chages quick ad cost effective. Give the resources available withi a FPGA, pipeliig the multiplier becomes a very effective way to improve performace without much of a impact o size. By pipeliig the auto-geerated multiplier, the performace hit, caused by the ature of FPGA implemetatios, could be sigificatly reduced. This chage will greatly improve the throughput of the auto-geerated multiplier.

4 Fig. 2. Our auto-geerated multiplier out performs the Karatsuba multiplier for larger values of ad remais very close i performace to the ideal Mastrovito multiplier. Geerator b(x) Geerator 2 Levels of XOR Gates XOR Gates b(x) Geerator a(x) b(x) 0 Fig. 4. Pipelied performace does ot iclude cold start Table V No-Pipelied Multiplier FPGA Results Size LUTs CLB Slices Delay (s) 4-bit 3 7 5.67 8-bit 84 42 5.8 6-bit 324 62 9.0 32-bit,365 683 2.99 64-bit 5,435 2,78 22.30 28-bit 2,760 0,880 24.95 c(x) XOR Gates DFFs Table IV Pipelied Recursive Multiplier FPGA Results Fig. 3. Block diagram of pipelied FPGA implemetatio The origial recursive desig was modified to iclude iput ad output registers as well as registers to hold each partial product. Flip-flops were also added to the recursive XOR trees. The flip-flops were iserted betwee every other level of the tree structure as show i 3. The recursive VHDL allowed for easy itegratio of these flip-flops ito the XOR trees. For details see Sectio VIII. The results of the FPGA sythesis for the pipelied implemetatio are show i Table IV. For the target of our FPGA sythesis we agai used Leoardo. We chose Xiliix VIRTEX as our target family. The specific device was varied depedig o the space requiremets of the multiplier beig geerated. To study the performace of the pipelied recursive multiplier, a o-pipelied versio was also sythesized to the same FPGAs. The sythesis results for this multiplier are show i Tables V. As expected, the throughput of the auto-geerated multiplier was greatly icreased by pipeliig. See Fig. 4 for a compariso of throughput for the two desigs. While the latecy of a sigle calculatio would be less i the opipelied desig, the ability of the pipelied multiplier to complete a multiplicatio every clock cycles makes it faster whe cosiderig a large umber of multiplicatios performed back to back. Size LUTs CLB Slices Dff Clk (MHz) 4-bit 2 4 28 93 8-bit 86 44 88 20 6-bit 358 84 368 3 32-bit,363 682,259 05 64-bit 5,592 2,89 5,638 0 28-bit* 9,0 8,0 00 *Uoptimized, oly a iitial pass could be made. Curretly, the mai limitig factor i performace of the pipelied recursive desig is the partial product geeratio stage. To allow the sythesis tool to freely optimize partial product geeratio, D-flip-flops caot be iserted ito this stage prior to sythesis. The key to improvig performace at larger operad sizes will be to modify the partial product geeratio to allow for automated pipeliig without impairig the sythesis tool s ability to optimize this stage based o the polyomial. VIII. PIPELINING THE RECURSIVE VHDL TREES To make the multiplier more effective i a FPGA eviromet, the desig was pipelied to icrease throughput ad clock speed. To pipelie the recursive tree structures, additioal geeric parameters are passed betwee levels. The geeric values idicate whether or ot the output of that level should be latched. Oe value sets the umber of recursive

5 levels i betwee latches. The secod determies which levels will actually be latched. This iherit flexibility allows for the pipelied recursive tree to be optimized for clock frequecy ad target device. The code for tree was modified i the followig maor: BEGIN // begiig of architecture body... compoud tree: IF d > GENERATE IF L = 0 THEN the gate: xor GENERIC MAP( => ) PORT MAP(a => xor iput, z => temp out); tree ff: PROCESS(reset, clk, temp out) BEGIN IF reset = THEN output <= 0 ; ELSIF clk EVENT AND clk= THEN output <= temp out; END IF; END PROCESS tree ff; subtree array: FOR i IN 0 TO - GENERATE the subtree: tree GENERIC MAP(d => d-, =>, L => c, c => c) PORT MAP (iput => iput(i***(d-) TO (i+)***(d-)-), output => xor iput(i), clk => clk, reset => reset); END GENERATE subtree array; ELSE the gate2: xor GENERIC MAP( => ) PORT MAP(a => xor iput, z => output); subtree array2: FOR i IN 0 TO - GENERATE the subtree2: tree GENERIC MAP(d => d-, =>, L => L-, c => c) PORT MAP (iput => iput(i***(d-) TO (i+)***(d-)-), output => xor iput(i), clk => clk, reset => reset); END GENERATE subtree array2; END IF; END GENERATE compoud tree; END; // ed of architecture body L is used to determie if a particular level withi the tree should be latched. c is the umber of levels betwee latches mius oe. Whe L = 0, a D-flip-flop is created to hold the output of the XOR gate. Each sub-tree created for the levels above will have L mapped to c. For L > 0, the level is created almost idetically to the origial recursive code, except for the additioal geeric ad port mappigs. I this case, the sub trees are created with L mapped to L. This implemetatio ot oly allows for cotrol over the umber of pipelie stages withi the tree, but also where those stages are iserted. Chagig the value of L passed to the top level of the recursive tree will move the latches up or dow withi the tree. This would allow the fial result to be latched withi the tree or allow the last level(s) to be used ad the latched exterally of tree. c ca be adjusted to improve clock frequecy or overall latecy depedig o the implemetatio. Larger values of c will mea fewer pipelie stages with a lower clock frequecy. Smaller values would create more stages ad allow for a higher clock frequecy. IX. CONCLUSION As the developmet eviromet cotiues to call for shorter ad shorter cycles, auto-geeratio will become more ad more critical to successful product developmet. Moreover, the ability to geerate a dedicated solutio oly based o geeral parameters is of great iterest to a system that has FPGAs. The results obtaied with the sythesis of this flexible VHDL descriptio for both AMI-CMOS ad FPGAs are comparable to those created by traditioal ad optimized desig methods. The results show that automatic geeratio of polyomial basis multiplier i GF (2 ) is a quick, flexible ad efficiet alterative to traditioal desig. It was whe attemptig to pipelie the auto-geerated multiplier, that the real advatage of the recursive VHDL became apparet. The ability of recursio to more accurately model tree structures makes it clearly superior to cruder descriptios. Whe used effectively, recursio ca greatly improve the developmet of tree structures i VHDL. The test results we acquired i terms of delay versus area are promisig, ad we hope that this desig approach for autogeeratig multipliers i GF (2 ) ca still be improved to lead to better optimized desigs ad also to geerate ew ideas for other desigs. X. ACKNOWLEDGEMENTS The authors would like to ackowledge the partial support of NSF from the CAREER Grat CCR-0093434 - Computer Arithmetic Algorithms ad Scalable Hardware Desigs for Cryptographic Applicatios. REFERENCES [] S. B. Wicker ad V. K. Bhargava, Reed-Solomo Codes ad Their Applicatios. IEEE Press, New York. [2] E. D. Mastrovito, VLSI desigs for multiplicatio over fiite fields GF(2 m ), i Proceedigs of the 6th Iteratioal Coferece, o Applied Algebra, Algebraic Algorithms ad Error-Correctig Codes. Spriger- Verlag, 989, pp. 297 309. [3] IEEE P363, Stadard specificatios for public key cryptography, Draft 3, November 999, http://grouper.ieee.org/groups/363/. [4] B. Suar ad Ç. K. Koç, Mastrovito multiplier for all triomials, IEEE Tras. Comput., vol. 48, o. 5, pp. 522 527, 999. [5] A. Halbutoğullari ad Çeti K. Koç, Mastrovito multiplier for geeral irreducible polyomials, IEEE Tras. Comput., vol. 49, o. 5, pp. 503 58, 2000. [6] A. Reyhai-Masoleh ad M. A. Hasa, O low complexity bit parallel polyomial basis multipliers, i Cryptographic Hardware ad Embedded Systems, ser. Lecture Notes i Computer Sciece, No. 2779, C. D. Walter, Ç. K. Koç, ad C. Paar, Eds. Spriger, Berli, Germay, 2003, pp. 89 202.

[7] F. R. Heriquez, Research problem: Fully parallel multipliers for GF(2 m ), CINVESTAV-IPN #2508, http://delta.cs.civestav.mx/ adiaz/reccomp2003/rproblems.pdf. [8] P. J. Ashede, Recursive ad repetitive hardware models i VHDL, techical Report TR 60/2/93/ECE, Departmet of Electrical ad Computer Egieerig, Uiversity of Ciciati, also published as Techical Report 93-9, Departmet of Computer Sciece, The Uiversity of Adelaide, Australia. 6