SINCE the introduction of the RSA algorithm [1] in

Size: px

Start display at page:

Download "SINCE the introduction of the RSA algorithm [1] in"

Iris Reed
5 years ago
Views:

1 I TRANSACTIONS ON COMPUTRS, VOL. N, NO. N, MM X Ne Hardare Architectures for Montgomery Modular Multiplication Algorithm Miaoqing Huang, Member, Kris Gaj, Tarek l-ghazai, Senior Member Abstract Montgomery modular multiplication is one of the fundamental operations used in cryptographic algorithms, such as RSA and lliptic Curve Cryptosystems. At CHS 999, Tenca and Koç proposed the Multiple-Word Radix- Montgomery Multiplication (MWRMM) algorithm and introduced a no-classic architecture for implementing Montgomery multiplication in hardare. With parameters optimized for minimum latency, this architecture performs a single Montgomery multiplication in approximately n clock cycles, here n is the size of operands in bits. In this paper e propose to ne hardare architectures that are able to perform the same operation in approximately n clock cycles ith almost the same clock period. These to architectures are based on pre-computing partial results using to possible assumptions regarding the most significant bit of the previous ord. These to architectures outperform the original architecture of Tenca and Koç in terms of the product latency times area by 3% and 5%, respectively, for several most common operand sizes used in cryptography. The architecture in radix- can be extended to the case of radix-4, hile preserving a factor of to speed-up over the corresponding radix-4 design by Tenca, Todorov, and Koç from CHS. Our optimization has been verified by modeling it using Verilog-HL, implementing it on Xilinx Virtex-II 6 FPGA, and experimentally testing it using SRC-6 reconfigurable computer. Index Terms Montgomery Multiplication, MWRMM Algorithm, Hardare Optimization, Field-Programmable Gate Arrays INTROUCTION SINC the introduction of the RSA algorithm [] in 978, high-speed and space-efficient hardare architectures for modular multiplication have been a subject of constant interest for more than 3 years. uring this period, one of the most useful advances came ith the introduction of Montgomery multiplication algorithm due to Peter L. Montgomery []. Montgomery multiplication is the basic operation of the modular exponentiation, hich is required in the RSA public-key cryptosystem. It is also used in lliptic Curve Cryptosystems, and several methods of factoring, such as CM, p-, and Pollard s rho method, as ell as in many other cryptographic and cryptanalytic transformations [3]. At CHS 999, Tenca and Koç introduced a ord-based algorithm for Montgomery multiplication, called Multiple-Word Radix- Montgomery Multiplication (MWRMM), as ell as a scalable hardare architecture capable of executing this algorithm [4], [5]. Several follo-up designs based on the MWRMM algorithm have been proposed in order to reduce the computation time [6] []. In [6], a high-radix ordbased Montgomery algorithm (MWR k MM) as proposed using ooth encoding technique. Although the number of scanning steps as reduced, the complex- M. Huang is ith the epartment of Computer Science and Computer ngineering, University of Arkansas, Fayetteville, AR 77, USA, mqhuang@uark.edu. K. Gaj is ith the epartment of lectrical and Computer ngineering, George Mason University, Fairfax, VA 3, USA, kgaj@gmu.edu. T. l-ghazai is ith the epartment of lectrical and Computer ngineering, The George Washington University, Washington, C 5, USA, tarek@gu.edu. Manuscript received Sept., 8, revised ec. 9, accepted Jan.. ity of control and computational logic increased substantially at the same time. In [7], Harris et al. implemented the MWRMM algorithm in a quite different ay, i.e., left shifting Y and M instead of right shifting S. Their approach as able to process an n-bit precision Montgomery multiplication in approximately n clock cycles, hile keeping the scalability and simplicity of the original implementation. In [8] and [9], the left-shifting technique as applied on the radix- and radix-4 versions of the parallelized Montgomery algorithm [], respectively. In [], Michalski and uell introduced a MWRkMM algorithm, hich is derived from The Finely Integrated Operand Scanning Method described in []. MWRkMM algorithm requires the built-in multipliers in the FPGA device to speed up the computation. This feature makes the implementation expensive. The systolic high-radix design by McIvor et al. described in [3] is also capable of very high speed operation, but suffers from the same disadvantage of large area requirements for fast multiplier units. A different approach based on processing multi-precision operands in carry-save form has been presented in [4]. This architecture is optimized for the minimum latency and is particularly suitable for repeated sequence of Montgomery multiplications, such as the sequence used in modular exponentiations (e.g., RSA). In this paper, e focus on the optimization of hardare architectures for MWRMM and MWR4MM algorithms in order to minimize the number of clock cycles required to compute an n-bit precision Montgomery multiplication. We start ith the introduction of Montgomery multiplication in Section. Then, the classic MWRMM architecture is discussed. The ne

2 I TRANSACTIONS ON COMPUTRS, VOL. N, NO. N, MM X TAL Conversion beteen ordinary and Montgomery domains Ordinary omain Montgomery omain X X = X n (mod M) Y Y = Y n (mod M) XY (X Y ) = X Y n (mod M) optimized architecture, hich is able to perform the n- bit precision MWRMM algorithm in approximately n clock cycles, is presented in Section 3. In Section 4, e propose an alternative optimized architecture that is able to achieve the same performance goal ith simpler logic design. In Section 5, the high-radix version of our ne architecture is introduced. In Section 6, e first compare our to optimized architectures ith three previous architectures from the conceptual point of vie. Then, the hardare implementations of all discussed architectures are presented and contrasted ith each other. Finally, in Section 7, e present the summary and conclusions for this ork. MONTGOMRY MULTIPLICATION ALGORITHM Let M > be an odd integer. In many cryptosystems, such as RSA, computing X Y (mod M) is a crucial operation. The reduction of X Y (mod M) is a more time-consuming step than the multiplication X Y ithout reduction. In [], Montgomery introduced a method for calculating products (mod M) ithout the costly reduction (mod M), since then knon as Montgomery multiplication. Montgomery multiplication of X and Y (mod M), denoted by M P (X, Y, M), is defined as X Y n (mod M) for some fixed integer n. Since Montgomery multiplication is not an ordinary multiplication, there is a conversion process beteen the ordinary domain (ith ordinary multiplication) and the Montgomery domain. The conversion beteen the ordinary domain and the Montgomery domain is given by the relation X X, here X = X n (mod M). The corresponding diagram is shon in Table. Table shos that the conversion is compatible ith multiplications in each domain, since MP (X, Y, M) X Y n (X n ) (Y n ) n X Y n (X Y ) (mod M). (a) (b) The conversion beteen each domain can be done using the same Montgomery operation, in particular X = MP (X, n (mod M), M) and X = MP (X,, M), here n (mod M) can be precomputed. espite the initial conversion cost, e achieve an advantage over ordinary multiplication if e do many Montgomery multiplications folloed by an inverse conversion at the end, hich is the case, for example, in RSA Algorithm : Radix- Montgomery Multiplication Input: odd M, n = log M +, X = n i= i, ith X, Y < M Output: Z = MP (X, Y, M) X Y n (mod M), Z < M S[] = ; for i = to n do q i = ( Y ) S[i] ; S[i + ] = (S[i] + Y + q i M)/;.5 if S[n] > M then.6 S[n] = S[n] M;.7 return Z = S[n]; Algorithm shos the pseudocode for the radix- Montgomery multiplication, here e choose n = log M +. n is the size of M in bits. The verification of the above algorithm is given belo: Let us define S[i] as S[i] i i x j j Y (mod M) () j= ith S[] =. Then, S[n] X Y n (mod M) = MP (X, Y, M). S[n] can be computed iteratively using the folloing dependence: S[i + ] i i+ x j j Y (3a) j= i x j j + i Y (3b) i+ j= i i x j j Y + Y j= (3c) (S[i] + Y ) (mod M). (3d) Therefore depending on the parity of S[i] + Y, e compute S[i + ] as S[i + ] = S[i] + Y or S[i] + Y + M, (4) to make the numerator divisible by. Since Y < M and S[] =, one has S[i] < M for all i < n. In [5], [6], it is shon that the result of a Montgomery multiplication X Y n (mod M) < M hen X, Y < M and n > 4M. As a result, by redefining n to be the smallest integer such that n > 4M, the subtraction at the end of Algorithm can be avoided and the output of the multiplication can be directly used as an input for the next Montgomery multiplication. 3 OPTIMIZING MWRMM ALGORITHM In [4], [5], Tenca and Koç proposed a scalable architecture based on the Multiple-Word Radix- Montgomery Multiplication Algorithm (MWRMM), shon as Algorithm. In Algorithm, the operand Y (multiplicand) is scanned ord-by-ord, and the operand X is scanned

3 I TRANSACTIONS ON COMPUTRS, VOL. N, NO. N, MM X 3 Algorithm : Multiple-Word Radix- Montgomery Multiplication Algorithm [4].3 Input: odd M, n = log M +, ord size, e = n+, X = n i= i, Y = e j= Y (j) j, M = e j= M (j) j, ith X, Y < M Output: Z = e j= S(j) j = MP (X, Y, M) X Y n (mod M), Z < M. S = ;. for i = to n do /*initialize all ords of S*/ q i = ( Y () ) S () ;.4 (C (), S () ) = Y () + q i M () + S () ;.5 for j = to e step do.6 (C (j+), ) = C (j) + Y (j) + q i M (j) + ; S (j ) = (, S(j ).. );.8 S (e) = ;.9 return Z = S; bit-by-bit. The operand length is n bits, and the ordlength is bits. e = n+ ords are required to store S since its range is [, M ]. The original M and Y are extended by one extra bit of as the most significant bit. Presented as vectors, M = (, M (e ),..., M (), M () ), Y = (, Y (e ),..., Y (), Y () ), S = (, S (e ),..., S (), S () ), and X = (x n,..., x, x ). The carry variable C (j) has to bits, as explained belo. Assuming C () =, each subsequent value of C (j+) is given by (C (j+), ) = C (j) + Y (j) + q i M (j) +. Assuming that C (j) 3, e obtain (C (j+), ) = C (j) + Y (j) + q i M (j) ( ) = 3. From (5), e have C (j+) 3. y induction, C (j) 3 is ensured for any j e. Additionally, based on the fact that S M, e have C (e). The data dependency graph of the hardare implementation for the MWRMM algorithm by Tenca and Koç is shon in Fig.. ach circle in the graph represents an atomic computation and is labeled according to the type of action performed. Task A consists of computing lines.3 and.4 in Algorithm. Task corresponds to computing lines.6 and.7 in Algorithm. The data dependencies among the operations ithin j loop makes it impossible to execute the steps in a single iteration of j loop in parallel. Hoever, parallelism is possible among executions of different iterations of i loop. In [4], Tenca and Koç suggested that each column in the graph may be computed by a separate processing element (P), and the data generated from one P may be passed into another P in a pipelined fashion. Folloing this method, all atomic computations represented by (5) S () = S () = {x,q,c (),S () } x S () = {x,q,c (),S () } S (3) = S (4) = P # i = {x,q,c (3),S () } {x,q,c (4),S (3) } S (5) = {x,q,c (5),S (4) } {x,q,c (6),S (5) } Y () M () Y () M () S () Y () M () P # i = x S () Y (3) {x,q,c (),S () } M (3) Y () M () Y () M () P # i = S () S () x Y (4) {x,q,c (),S () } Y () Y () M (4) M () M () S (3) S () Y (5) {x,q,c (3),S () } Y (3) {x,q,c (),S () } Y () M (5) M (3) M () S (4) S () S () {x,q,c (4),S (3) } {x,q,c (),S () } Fig.. ata dependency graph of the original architecture of MWRMM algorithm [4] circles in the same ro can be processed concurrently. The processing of each column takes e + clock cycles ( clock cycle for Task A, e clock cycles for Task ). ecause there is a delay of clock cycles beteen the processing of a column for and the processing of a column for +, the minimum computation time T (in clock cycles) is T = n+e given P max = e+ Ps are implemented to ork in parallel. In this configuration, after e + clock cycles, P # sitches from executing column to executing column P max. After another to clock cycles, P # sitches from executing column to executing column P max +, etc. The opportunity of improving the implementation performance of Algorithm is to reduce the delay beteen the processing of to subsequent iterations of i loop from clock cycles to clock cycle. The -clockcycle delay comes from the right shift (division by ) in both Algorithm and. Take the first to Ps in Fig. for example. These to Ps compute the S ords in the first to columns. Starting from clock #, P # has to ait for to clock cycles before it starts the computation of S () (i = ) in the clock cycle #. In order to reduce the -clock-cycle delay to half, e propose an approach to pre-computing the partial results using to possible assumptions regarding the most significant bit of the previous ord. As shon in

4 I TRANSACTIONS ON COMPUTRS, VOL. N, NO. N, MM X 4 clk # S () (i=) P # P # P # clk # S () (i=) S () (i=) -... ' -... S () (i=) S () (i=) S () (i=) clk # ' ' -... Fig.. ata operation in the optimized architecture (Architecture ) (S ords belonging to the same i loop share the same background pattern) Fig., P # can take the most significant bits of S () (i = ) from P # at the beginning of clock #, do a right shift, and compute to versions of S () (i = ), based on the to different assumptions about the most significant bit of this ord at the start of computations. At the beginning of the clock cycle #, the previously missing bit becomes available as the least significant bit of S () (i = ). This bit can be used to choose beteen the to precomputed versions of S () (i = ). Similarly, in the clock cycle #, to different versions of S () (i = ) and S () (i = ) are computed by P # and P # respectively, based on to different assumptions about the most significant bits of these ords at the start of computations. At the beginning of the clock cycle #3, the previously missing bits become available as the least significant bits of S () (i = ) and S () (i = ), respectively. These to bits can be used to choose beteen the to precomputed versions of these ords. The same pattern of computations is repeated in subsequent clock cycles. Furthermore, since e ords are enough to represent the values in S, S (e) is discarded in our designs. Therefore, e clock cycles are required to compute one iteration of S. The proposed optimization technique can be applied onto both non-redundant and redundant representation of the partial sum S, as demonstrated in Fig. 3. It is logically straightforard to apply the approach hen S is represented in non-redundant form because each digit of S consists of only one bit. When S is represented in redundant Carry-Save (CS) form, each digit of S consists of to bits, the sum (SS) bit and the carry (SC) bit. As shon in Fig. 3(b) and Fig. 3(c), after the update of, only the sum bit of S (j+), i.e., SS (j+), is missing in order to determine a full ord S (j) after right shift. The carry bit, SC (j+), has been already computed and can be forarded to the next P together ith S (j)... Then, the same approach can be applied to update. In the remainder of this paper, e use the nonredundant form in all the diagrams and description for the sake of simplicity. The corresponding diagrams and implementations in redundant format can be derived from the non-redundant case accordingly. j = () S.. () S j = () S.. () S j = () S.. (3) S j = 3 (3) S.. (4) S j = 4 P # i = {x,q,c () } {x,q,c () } {x,q,c (3) } {x,q,c (4) } x Y () M () () S.. Y () M () () S () S.. Y () M () () S () S.. Y (3) M (3) (3) S {x,q,c () } P # i = {x,q,c () } x (3) S.. {x,q,c (3) } Y () M () () S.. Y () M () () S P # i = {x,q,c () } x () S.. Y () {x,q,c () } M () () S () S.. Y () M () () S.. Y () M () () S () S.. Fig. 4. ata dependency graph of the optimized architecture (Architecture ) of MWRMM algorithm (S is represented in non-redundant form) Algorithm 3: Computations in Task Input:, Y (), M (), S (), S().. Output: q i, C (), S ().. 3. q i = ( Y () ) S () ; (CO (), SO () , S().. (C (), S (), S().. if S () = then C () = CO () ; 3.6 S ().. = (SO(), S().. ); 3.7 else 3.8 C () = C () ; 3.9 S ().. = (S(), S().. ); ) = (, S().. ) + Y () + q i M () ; ) = (, S().. ) + Y () + q i M () ; The data dependency of the optimized architecture for implementing MWRMM algorithm is shon in Fig. 4. Similar to the original implementation by Tenca and Koç, the circle in the graph of Fig. 4 represents an atomic

5 I TRANSACTIONS ON COMPUTRS, VOL. N, NO. N, MM X 5 x q i M (j) i Y (j) c out SC (j) S Y (j) q i M (j) c in q i Y M SC SS Y M SC ( j ) SS Y M ( j ) SC SS c out Carry-Ripple Adder Sʹ(j) (a) c in SC ( ) ' j Carry-Save Adder SC ( ) ' j.. Sʹ(j) (b) SS ( ) ' j.. ( j ) SC c out ) SC' j ( 3 FA FA FA FA FA FA ) SS' j ( SC' j ( ) ( ) SS' j ( ) SC' j (c) ) SS' j ( ) SC' j ( SC c in ( j ) 3 Fig. 3. Update of an S ord: (a) S is represented in non-redundant form, (b) S is represent in redundant form, (c) Logic diagram of an update of an S ord ( = 3) in redundant form Algorithm 4: Computations in Task Input: q i,, C (j), Y (j), M (j), S (j+),.. Output: C (j+),.., S(j) 4. (CO (j+), SO (j), S(j).. ) = (,.. ) + C(j) + Y (j) + q i M (j) ; 4. (C (j+),, S(j).. ) = (,.. ) + C(j) + Y (j) + q i M (j) ; 4.3 if S (j+) = then 4.4 C (j+) = CO (j+) ; = (SO(j), S(j).. ); 4.6 else 4.7 C (j+) = C (j+) ; = (S(j), S(j).. ); computation. Task consists of three steps, the computation of q i, the calculation of to sets of possible results, and the selection beteen these to sets of results using an additional input S (), hich becomes available at the end of the processing time for Task. These three steps are shon in Algorithm 3. Task corresponds to to steps, as shon in Algorithm 4. The data forarding of and.. from one circle to the to circles in the right column takes place at the same time. Hoever, is used for selecting the to partial results of S (j ), and.. is used for generating the to partial results of. The exact approach to avoiding the extra clock cycle delay due to the right shift is detailed as follos by taking Task as an example. ach P first computes to versions of C (j+) and simultaneously, as shon in Algorithm 4. One version assumes that S (j+) is equal to one, and the other assumes that this bit is equal to zero. oth results are stored in registers. At the same moment, the bit S (j+) becomes available and this P can output the correct C (j+) and. For Task, the computation of q i is performed in addition to the computation of C () and S (). The diagram of the P logic is given in Fig. 5. The signals at the left and right sides are for the interconnection purpose. The carry C is fed back to the core logic of the same P. The signal remains unchanged during the computation of a hole column in Fig. 4. is a ord of the final output at the end of the computation of the hole multiplication. The core logic in Fig. 5 consists of to parts, the combinational logic and a finite state machine. The multiplications of Y (j) and q i M (j) are shon to be carried out using multiplexers. A ro of AN gates is another implementation option. On FPGA devices, the designer may leave the choice of the real implementation up to the synthesis tool for the best performance in terms of tradeoff beteen speed and area. The direct implementation of to branches (i.e., line 4. and 4. in Algorithm 4) requires the use of to ripple-carry adders, each of hich consists of three -bit inputs and a carry. It is easy to see that these to additions only differ in the most significant bit of the S ord and share all remaining operand bits. Therefore, it is desired to consolidate the shared part beteen these to additions into one ripplecarry adder ith three -bit inputs and a carry. The remaining separate parts are then carried out using to small adders. Folloing this implementation, the resource requirements increase only marginally hile performing computation for to different cases. When S is represented in redundant form (see Fig. 3(c)), only one additional Full Adder is required to cover to possible cases of S. The optimized architecture keeps the scalability of the original architecture described in [4]. Fig. 6 illustrates ho to use p Ps to implement the MWRMM algorithm. oth M (j) and Y (j) are moved from left to right every clock cycle through registers. has been registered inside each P. Therefore, it can be passed into the next P directly. The total computation time T, in clock cycles hen p stages are used in the pipeline to. Ripple-carry adders are used hen S is represented in nonredundant form. When S is represented in redundant form, carry-save adders should be used instead.

6 I TRANSACTIONS ON COMPUTRS, VOL. N, NO. N, MM X 6 S (j+) -... Valid at the beginning of next clock cycle, e.g. #n+ - Y (j) M (j) Core logic Valid at the beginning of current clock cycle, e.g. #n CO C SO (j) - S (j) - S (j) -... Register CO C SO (j) - - C P M (j) Y (j) C q i () S Core logic Z - R - R -... Z -... CO SO (j) - C S (j) - S (j) -... () Y q i Control Signal Fig. 5. The P logic used in the optimized Architecture of MWRMM implementation (only the combinational logic in Task is illustrated, S is represented in non-redundant form) M (j) Y (j) + +p- queue Fig. 6. The optimized architecture (S is represented in non-redundant form, i =, p, p,...) n=5 = e=3 p=3 S () S () S () n=5 = e=3 p= Fig. 7. An example of computations for 5-bit operands in Architecture using (a) three Ps, (b) to Ps S () S () S () in the operand is larger than the number of Ps. If e define a kernel cycle as the computation in hich p bits of x are processed, then there is an e p-clock-cycle extra delay beteen to kernel cycles. In this case, k complete and one partial kernel cycles are required to process all n bits in X. Overall, the ne architecture is capable of reducing the processing latency to half of latency of the Tenca-Koç design, given maximum number of Ps. Fig. 7 demonstrates these to different cases ith a simplified example. If e > p, the output from the rightmost P is fed into a queue and processed by the leftmost P later. This is the example shon in Fig. 7(b). Since there is an e p-clockcycle extra delay beteen to kernel cycles, the length of the queue Q is determined as { if e p Q = (7) e p otherise. In order to distinguish this architecture from the other architecture, hich is described in Section 4, the architecture discussed in this section is called Architecture hereafter. compute for the case ith n bits of size, is given by { n + e if e p T = (6) n + k(e p) + e otherise here k = n p. The first case shon in (6) represents the situation hen there are more Ps than the number of ords. Then it ould take n clock cycles to scan the n bits in X and take another e clock cycles to compute the remaining e ords in the last iteration. The second case models the condition hen the number of ords 4 TH ALTRNATIV OPTIMIZ HARWAR ARCHITCTUR OF MWRMM ALGORITHM In Section 3, e presented the optimization technique for improving the performance of the original implementation architecture by Tenca and Koç. In this section, e present an alternative optimized hardare architecture for implementing MWRMM algorithm. The corresponding data dependency graph is shon in Fig. 8. Similar to the previous data dependency graphs in Fig. and Fig. 4, the computation of each column in Fig. 8 can be processed by one separate P. Similarly to the graph in Fig. 4, there is only one clock cycle latency

7 I TRANSACTIONS ON COMPUTRS, VOL. N, NO. N, MM X 7 i= i= i= i=3 i=4 i= i= i= i=3 i=4 i= i= i= i=3 i=4 j= A A A A A j= j= P # j= j= j= P # j= j= j= P # j=3 j=3 j=3 P #3 j=4 j=4 j=4 P #4 j=e j=e- F F F F F j=e- F F F F F P #e- P # P # P # P #3 P #4 P # P # P # P #3 P #4 (a) (b) (c) Fig. 9. Three different approaches for mapping MWRMM algorithm: (a) The architecture by Tenca and Koç, (b) The proposed Architecture, (c) The proposed alternative Architecture i = i = S () = Y () M () S () S () S () S () S () x Y () M () x Y () M () {C (),q} {C (),q} {C (3),q} {C (4),q} x S () = Y () M () S () Y () M () S () = Y () M () Y () S () Y () S () Y () M () M () M () x3 Y () S () Y () S () Y () M () M () M () x4 S () {C (3),q} {C (),q} {C (),q3} {C (),q4} {C (),q3} {C (3),q} {C (4),q} {C (),q} {C (),q} {C (),q} i = i = 3 i = 4 P # j = S () S () S () P # j = x x x x3 S () S () P # j = S (3) = Y (3) M (3) S (3) S () S () S (3) x x S (3) x S () S (3) P #3 j = 3 x Y (3) M (3) Fig. 8. ata dependency graph of the proposed alternative architecture (Architecture ) of MWRMM algorithm (S is represented in non-redundant form) beteen the processing of to adjacent columns in this data dependency graph. These three data dependency graphs map Algorithm folloing different strategies, as shon in Fig. 9. In Fig. and Fig. 4, each column corresponds to a single iteration of i loop and covers all iterations of j loop, as shon in Fig. 9(a) and Fig. 9(b) respectively. In contrast, each column in Fig. 8 corresponds to a single iteration of j loop and covers all iterations of i loop, as shon in Fig. 9(c). Folloing the data dependency graph in Fig. 8, x S (4) 5. Algorithm 5: Computations in Task F Input: q i,, C (e ), Y (e ), M (e ), S (e ).., C(e) Output: C (e), S (e ).., S(e ) (C (e), S (e ) ) = (C (e), S(e ).. ) + C(e ) + Y (e ) + q i M (e ) ; e present an alternative hardare architecture of MWRMM algorithm in Fig.. This architecture can finish the computation of Montgomery multiplication of n-bit operands in n+e clock cycles. Furthermore, this alternative design is simpler than the approach given in [4] in terms of control logic and data path logic. Hereafter, e call this alternative architecture Architecture. As shon in Fig. (d), Architecture consists of e Ps forming a computation chain. ach P focuses on the computation of a specific ord in S, i.e., P #j only orks on. In other ords, each P corresponds to one fixed round as j in the inner loop of Algorithm. Meanhile, all Ps scan different bits of operand X at the same time. The same optimization technique is applied to avoid the extra clock cycle delay due to the right shift. The pseudocode in Algorithm 4 describes the function and internal logic of the P #j. The function of the combinational logic is given by lines 4. and 4.. Lines 4.3 to 4.8 are implemented using to -to- multiplexers, shon in the diagram to the right of Register. Fig. demonstrates the computations of the first 3 Ps in the first 3 clock cycles. The internal logic of all Ps is same except the to Ps residing at the head and tail of the chain. P #, shon in Fig. (a) as the cell of type, is also responsible for computing q i and has no C (j) input. This P implements Algorithm 3. P #(e ), shon in Fig. (c) as type F, has only one internal branch because the most significant

8 I TRANSACTIONS ON COMPUTRS, VOL. N, NO. N, MM X 8 (a) q i Y () M () Combinational logic - CO () C () SO () - S () - S () -... Register CO () C () SO () - S () - S () -... S () - S () -... P # S () S () C () S () -... M () Y () S () Combinational logic Z Z - R - R -... Z -... CO SO (j) - C S (j) - S (j) -... q i (b) (c) C (j) S (e-) q i Y (j) M (j) Combinational logic - q i Y (e-) M (e-) CO (j+) C (j+) SO (j) - S (j) - S (j) -... Register CO (j+) C (j+) SO (j) P #j P #e- S (j+) C (j+) -... q i M (j) Y (j) C (j) Combinational logic Z - R - R -... Z -... Combinational logic CO SO (j) - C S (j) - S (j) -... C (e-) Combinational logic C (e) S (e-) Register C (e) S (e-) S (e-) (C (e),s (e-) -...) C (e) S (e-) -... q i M (j) Y (j) C (e-) (C (e),s (e-) -...) C (e) S (e-) (d) (e-)-bit Shift Register for q Y () M () q Y () M () Y () M () Y (i) M (i) Y (e-) M (e-) i q i- q i- q i-j q i-e+ P P P S P P () S () S (3) S (j+) S (e-) # C # # # j #e- () C () C (3) C (j) C (j+) C (e-) F X S () S () S () - - -j -e+ e-bit Shift Register for x S (e-) Fig.. (a)the internal logic of P # of type. (b)the internal logic of P #j of type. (c)the internal logic of P #e- of type F. (d)the proposed alternative architecture of MWRMM algorithm - Architecture (S is represented in non-redundant form)

9 I TRANSACTIONS ON COMPUTRS, VOL. N, NO. N, MM X 9 clk # P # S () (i=) P # P # clk # S () (i=) -... ' -... S () (i=) S () (i=) S () (i=) S () (i=)... clk # ' -... ' Fig.. ata operation in the alternative architecture (Architecture ) bit of S (e ) is equivalent to C (e), hich is determined at the beginning of every clock cycle. This P implements Algorithm 5. To shift registers parallel to Ps carry and q i, respectively, and do a right shift every clock cycle. efore the start of multiplication, all registers, including the to shift registers and the internal registers of Ps, should be reset to zeros. All the bits of X ill be pushed into the shift register one by one and folloed by zeros. The second shift register ill be filled ith values of q i computed by P # of type. All the registers can be enabled at the same time after the multiplication process starts because the additions of Y (j) and M (j) ill be nullified by the zeros in the to shift registers before the values of x and q reach a given stage. The internal register of P #j keeps the value of that should be shifted one bit to the right for the next round of calculations. This feature gives us to options to generate the final product. ) The contents of.. can be stored in e clock cycles after P # finishes the calculation of the most significant bit of X, i.e., after n clock cycles, and then the circuit can do a right shift on all accumulated bits. Or, ) One more round of calculation can be performed right after the round ith the most significant bit of X. In order to do so, one bit of needs to be pushed into to shift registers to make sure that the additions of Y (j) and M (j) are nullified and the only operation performed by the circuit is right are collected in e clock cycles after P # finishes its extra round of calculations. These ords are concatenated to form the final product. After the final product is generated, there are to methods to collect them. If the internal registers of Ps are disabled after the end of computation, the entire result can be read in parallel after n + e clock cycles. Alternatively, the results can be read ord by ord in e clock cycles by connecting internal registers of Ps into a shift register chain. The exact ay of collecting the results largely depends on the application. For example, in the implementation shift. Then the contents of Algorithm 6: Multiple-Word Radix-4 Montgomery Multiplication Algorithm Input: odd M, n = log M +, ord size, e = n+, X = n i= x (i) 4 i, Y = e j= Y (j) j, M = e j= M (j) j, ith X, Y < M Output: Z = e j= S(j) j = MP (X, Y, M) X Y n (mod M), Z < M S = ; /*initialize all ords of S*/ for i = to n step do q (i) = F unc(s ().., x(i), Y ().., M ().. ); / *q (i) and x (i) are -bit long*/ (C (), S () ) = S () + x (i) Y () + q (i) M () ; /*C is 3-bit long*/ for j = to e step do (C (j+), ) = C (j) + + x (i) Y (j) + q (i) M (j) ; S (j ) = (.., S(j ).. ); 6.7 S (e ) = (C (e) 6.8 return Z = S; 6.9.., S(e ).. ); of RSA, a parallel output ould be preferred; hile in the CC computations, reading results ord by ord may be more appropriate. 5 HIGH-RAIX ARCHITCTUR OF MONT- GOMRY MULTIPLICATION The concepts illustrated in Fig. 4 and Fig. 8 can be adopted to the design of high-radix hardare architecture of Montgomery multiplication. Instead of scanning one bit of X every time, several bits of X can be scanned together for high-radix cases. Assuming k bits of X are scanned at one time, k branches should be covered at the same time to maximize the performance. Considering the value of k increases exponentially as k increments, the design becomes impractical beyond radix-4. Folloing the same definitions regarding ords as in Algorithm, the radix-4 version of Montgomery multiplication is shon as Algorithm 6. To bits of X are scanned in one step this time instead of one bit as in Algorithm. While reaching the maximal parallelism, the radix-4 version design takes n + e clock cycles to process n-bit Montgomery multiplication.

10 I TRANSACTIONS ON COMPUTRS, VOL. N, NO. N, MM X clk # S () (i=) P # P # P # clk # S () (i=) S () (i=) -... S () (i=) S () (i=) S () (i=) clk # Fig.. ata operation in Harris architecture [7] (its ith the gray background are ignored due to the left shift) The carry variable C has 3 bits, hich can be proven in a similar ay to the proof of the radix- case. The value of q (i) at line 6.3 of Algorithm 6 is defined by a function involving S ().., x(i), Y ().. is satisfied. andm ().. so that (8) S ().. + x(i) Y ().. + q(i) M ().. = (mod 4) (8) Since M is odd, M () =. From (8), e can derive q (i) = S () (x (i) Y () ) (9) here x (i) and q (i) denote the least significant bit of x (i) and q (i) respectively. The bit q (i) is a function of only seven one-bit variables and can be computed using a relatively small look-up table. The multiplication by 3, hich is necessary to compute x (i) Y (j) and q (i) M (j), can be done on the fly or avoided by using ooth recoding as discussed in [6]. Using the ooth recoding ould require adjusting the algorithm and architecture to deal ith signed operands. Furthermore, e can generalize Algorithm 6 to handle MWR k MM algorithm. In general, x (i) and q (i) are both k-bit variables. x (i) is a k-bit digit of X, and q (i) is defined by (). S () + x (i) Y () + q (i) M () = (mod k ) () Nevertheless, the implementation of the proposed optimization for the case of k > ould be impractical in majority of applications. 6 HARWAR IMPLMNTATION AN COM- PARISON OF IFFRNT ARCHITCTURS In this section, e compare five major types of architectures for Montgomery multiplication from the point of vie of the number of Ps and latency in clock cycles. In the architecture by Tenca and Koç, the number of Ps can vary beteen one and P max = e+. The larger the number of Ps, the smaller the latency, but the larger the circuit area. This feature allos the designer to choose the best possible trade-off beteen these to requirements. The architecture by Harris et al. [7] has the similar scalability as the original architecture by Tenca and Koç [4]. Instead of making right-shift of the intermediate values, their architecture left-shifts the Y and M to avoid the data dependency beteen and S (j ). The data processing diagram in Harris architecture is shon in Fig.. For the design ith the number of Ps optimized for minimum latency, the architecture by Harris reduces the number of clock cycles from n+e (for Tenca and Koç [4]) to n + e. Our optimized architecture, Architecture, is built using similar concepts to the architecture by Tenca and Koç. Hoever, it is able to reduce the processing latency to approximately half hile preserving the scalability of the original architecture. Our alternative architecture, Architecture, and the architecture by McIvor et al. both have fixed size, optimized for minimum latency. Our architecture consists of e Ps, each operating on operands of the size of a single ord. The architecture by McIvor et al. consists of just one P, operating on multi-precision numbers represented in the carry-save form. The final result of the McIvor architecture obtained after n clock cycles is expressed in the carry-save redundant form. In order to convert this result to the non-redundant binary representation, additional e clock cycles are required, hich makes the total latency of this architecture comparable to the latency of our architecture. In the sequence of modular multiplications, such as the one required for modular exponentiation, the conversion to the nonredundant representation can be delayed to the very end of computations. Therefore each subsequent Montgomery multiplication can start every n clock cycles. The similar property can be implemented in our architecture by starting a ne multiplication immediately after the first P, P #, has released the first least significant ord of the final result. Architecture can be parameterized in terms of the value of the ord size. The larger the smaller the number of Ps, but the larger the size of a single P. Additionally, the larger the smaller the maximum clock frequency, especially in the redundant representation. The latency expressed in the number of clock cycles is equal to n+ ((n+)/), and is almost independent of for 6. Since actual FPGA-based platforms, such as SRC-6 used in our implementations, have a fixed target clock frequency, this target clock frequency determines the optimum value of. Additionally, the same HL code can be used for different values of the

11 I TRANSACTIONS ON COMPUTRS, VOL. N, NO. N, MM X x y x c M (j) Y (j) + (a) q i + z + c out s (b) c out c in s (c) c in c z 5 z 4 s 4 z 3 s 3 z s z s z s HA FA FA FA FA FA c c HA HA HA HA FA FA c s 4 s 3 s s s (d) Fig. 3. istributing the computation of c + + Y (j) + q i M (j) into to clock cycles: (a) Logic diagram, (b) Implementation of Full Adder in Xilinx FPGAs, (c) Implementation of Half Adder in Xilinx FPGAs, (d) Implementation of S.. + Z.. + C.. in Xilinx Virtex-II FPGA device, = 5 (Z = Y.. + q i M.. ) operand size n and the parameter, ith only a minor change in the values of respective constants. oth optimized architectures, Architecture and Architecture, have been implemented in Verilog HL, and their codes have been verified using reference softare implementation. The results completely matched. We have selected Xilinx Virtex-II6FF57-4 FPGA device used in the SRC-6 reconfigurable computer for the prototype implementations. The synthesis tool as Synplify Pro 9. and the Place and Route tool as Xilinx IS 9.. We have implemented four different sizes of multipliers, 4, 48, 37 and 496 bits, respectively, in the radix- case using Verilog-HL to verify our approach. The resource utilization on a single FPGA is shon in Table. For comparison, e have implemented the multipliers of these four sizes folloing the hardare architectures by Tenca and Koç and by Harris et al. as ell. Additionally, e have implemented the approach based on CSA (Carry Save Addition) from [4] as a reference. The purpose is to sho ho the MWRMM architecture compares ith other types of architectures in terms of resource utilization and performance. The ord size is fixed at 6-bit for most of the architectures implementing the MWRMM algorithm. Moreover, the 3-bit case of Architecture is tested as ell to sho the trade-off among clock rate, minimum latency and area. In order to maximize the performance, e used the maximum number of Ps in the implementation of all three scalable architectures, i.e., the architecture by Tenca and Koç [4], the architecture by Harris et al. [7], and Architecture. Therefore, the queue (shon in Fig. 6) is not implemented in all three cases. In the implementation of these four architectures, S is represented in non-redundant form. In other ords, carry-ripple adders are used in the implementation. In order to minimize the critical path delay in the carry-ripple addition of c + + Y (j) + q i M (j), this three-input addition ith carry is broken into to toinput additions. As shon in Fig. 3(a), Y (j) + M (j) is pre-computed one clock cycle ahead of its addition ith. This technique is applied to the implementation of all four cases to maximize the frequency. This design point is appropriate hen the target device is an FPGA device ith abundant hardare resources. When area constraint is of high priority, or S is represented in redundant form (as suggested in [4], [5], [7]), this frequency-oriented technique may become unnecessary. The real implementation of the second to-input addition ith to-bit carry in Xilinx Virtex-II device is illustrated in Fig. 3(d). + full adders (FAs) and half adders (HAs) form to parallel chains to perform the addition. Considering FAs used in the first addition, the implementation of the logic in Fig. 3(a) requires 3 + FAs or HAs. Compared ith the FAs used in Fig. 3(c), the non-redundant pipelined implementation of Montgomery multiplication ill consume approximately 5% more hardare resources than the implementation in redundant form on Xilinx Virtex-II platform. From Table, e can see that both Architecture and Architecture (radix- and =6) give a speedup by a factor of almost to compared ith the architecture by Tenca and Koç [4] in terms of latency expressed in the number of clock cycles. The minimum clock period is comparable in both cases and extra propagation delay in our architecture is introduced only by the multiplexers directly folloing the Registers, as shon in Fig. 6 and Fig.. The resource requirements of the P in three scalable architectures are very close to each other because most of their logic is the same. The implementations of both Harris architecture and Architecture use tice as many Ps as the architecture by Tenca and Koç. At the same time, they both require only about 44% more resources (in LUTs) compared ith the Tenca and Koç s architecture. This feature is due to the ay LUTs are counted by implementation tools; namely, LUT is counted as one even if not all of its inputs are used. A close observation of the area report by Synplify Pro reveals that in the cases of both Harris architecture and Architecture, the percentage of fully or close-to-fully used LUTs is much higher than in case of Tenca and Koç s architecture. Architecture occupies 6% less resources than architecture by Tenca and Koç in terms of LUTs, although our Architecture uses almost tice as many Ps. This result

12 I TRANSACTIONS ON COMPUTRS, VOL. N, NO. N, MM X TAL Hardare resource requirement and performance of the implementations on Xilinx Virtex-II6FF57-4 FPGA Scalable Architectures 4-bit 48-bit 37-bit 496-bit Max Frequency(MHz).5 Number of Ps Architecture by Min Latency (clks), 4,4 6,336 8,448 Tenca and Koç [4] Min Latency (µs) (radix-, =6) Area (LUTs) 6,438,774 9, 5,446 Min Latency Area (µs LUTs), ,846,4,97,784,33 Max Frequency(MHz) 9.7 Number of Ps Architecture by Min Latency (clks),67,39 3,47 4,63 Harris et al. [7] Min Latency (µs) (radix-, =6) Area (LUTs) 9,7 8,455 8,5 36,65 Min Latency Area (µs LUTs) 9, ,485 83,9,43,9 Max Frequency(MHz) 6.4 Number of Ps Our Proposed Min Latency (clks),88,76 3,64 4,35 Architecture Min Latency (µs) (radix-, =6) Area (LUTs) 9,39 8,535 7,75 36,967 Min Latency Area (µs LUTs) 87,5 346, ,348,38,445 Non-scalable Architectures Max Frequency(MHz) Architecture by Min Latency (clks),5,49 3,73 4,97 McIvor et al. [4] Min Latency (µs) (radix-) Area (LUTs) 9,879,453 3,43 4, Min Latency Area (µs LUTs) 68,88 8, 65,44,3,58 Max Frequency(MHz) Number of Ps Our Proposed Min Latency (clks),88,76 3,64 4,35 Architecture Min Latency (µs) (radix-, =6) Area (LUTs) 5,356,698 6,33,39 Min Latency Area (µs LUTs) 54,748 3,577 59,4 89,634 Max Frequency(MHz) Number of Ps Our Proposed Min Latency (clks),56, 3,68 4,4 Architecture Min Latency (µs) (radix-, =3) Area (LUTs) 5,3,587 5,97 9,6 Min Latency Area (µs LUTs) 53,573 8,94 47,7 8,66. The number of Ps is optimized for the minimum latency.. In all the implementations except the one by McIvor et al. [4], S is represented in non-redundant form. is mainly due to the fact that our P shon in Fig. (b) is substantially simpler than the P in the architecture by Tenca and Koç [4]. The P in [4] is responsible for calculating multiple columns of the dependency graph shon in Fig.. Therefore it must sitch its function beteen Tasks A and Task, depending on the phase of calculations. In contrast, in our Architecture, each P is responsible for only one column of the dependency graph in Fig. 8 and one Task, either or or F. Additionally in [4], the ords Y (j) and M (j) must rotate ith regard to Ps, hich further complicates the control logic. Compared ith the architecture by McIvor et al. [4], our Architecture (radix- and =6) has a comparable latency expressed in the number of clock cycles. In terms of clock frequency, the McIvor s architecture is better by 4-47%, but in terms of area, our architecture is superior by almost a factor of. As a result, Architecture outperforms the McIvor s design in terms of the product of latency times area by about %. In Table 3, performance gain of various architectures against the architecture of Tenca and Koç is summarized. Harris architecture, Architecture and Architecture all consistently outperform the classic architecture by Tenca and Koç in terms of both latency and the product of latency times area, for all four investigated operand sizes. oth Harris architecture and Architecture achieve a gain of around % regarding the product of latency times area. Architecture can achieve a gain up to 5% due to much smaller resource requirements. In all investigated architectures, the time beteen to consecutive Montgomery multiplications can be further reduced by overlapping computations for to consecutive sets of operands. In the original architecture by Tenca and Koç, this repetition interval is equal to n clock cycles, and in all other investigated architectures n clock cycles. For radix-4 case, e only have implemented four different operand sizes, 4, 48, 37, and 496, of

13 I TRANSACTIONS ON COMPUTRS, VOL. N, NO. N, MM X 3 TAL 3 Performance gain (%) against the architecture by Tenca and Koç [4] Min Latency (µs) Latency Area 4-bit 48-bit 37-bit 496-bit Harris Architecture (radix-, =6) Architecture (radix-, =6) McIvor s Architecture (radix-) Architecture (radix-, =6) Architecture (radix-, =3) Harris Architecture (radix-, =6) Architecture (radix-, =6) McIvor s Architecture (radix-) Architecture (radix-, =6) Architecture (radix-, =3) TAL 4 Comparison of the radix- and the radix-4 versions of Architecture (=6) for the implementation on Xilinx Virtex-II6FF57-4 FPGA it Max freq. Min latency Area radix-4/radix- Radix length (MHz) (clocks) (µs) (LUTs) latency(µs) latency(µs) area 4 radix ,356 (7%) radix ,37 (9%) radix ,698 (5%) radix ,56 (39%) radix ,33 (4%) radix ,74 (6%) radix ,39 (3%) radix ,8 (69%) Montgomery multipliers in Architecture as a shocase. The ord-length is the same as the one in the radix- case, i.e., 6 bits. For all four cases, the maximum frequency is comparable for both radix- and radix-4 designs. Moreover, the minimum latency of the radix- 4 designs is almost half of the radix- designs. In the meantime, the radix-4 designs occupy more than tice as many resources as the radix- versions. These figures fall ithin our expectations because radix-4 P has 4 internal branches, hich doubles the quantity of branches of radix- version, and some small design teaks ere required to redeem the propagation delay increase caused by more complicated combinational logic. Some of these optimization techniques are listed belo, ) At line 6.6 of Algorithm 6 there is an addition of three operands hose length is -bit or larger. To reduce the propagation delay of this step, e precomputed the value of x (i) Y (j) +q (i) M (j) one clock cycle before it arrives at the corresponding P. ) For the first P in hich the update of S () and the evaluation of q (i) happen in the same clock cycle, e can not precompute the value of x (i) Y () +q (i) M () in advance. To overcome this difficulty, e precompute four possible values of x (i) Y () +q (i) M () corresponding to q (i) =,,, 3, and make a decision at the end of the clock cycle based on the real value of q (i). As mentioned at the beginning of Section 5, the hardare implementation of our optimization beyond radix- 4 is no longer viable considering the large resource cost for covering all the k branches in one clock cycle, and the need to perform multiplications of ords by numbers in the range.. k. 7 CONCLUSION In this paper, e present to ne hardare architectures for Montgomery multiplication. These architectures are based on the ne idea for enhancing parallelism by precomputing partial results using to different assumptions regarding the most significant bit of each partial result ord. Additionally, Architecture introduces a ne original data dependency graph, aimed at significantly simplifying the control unit of each Processing lement. oth architectures improve on the ell knon architecture by Tenca and Koç, first presented at CHS 999, and then published in the I Transactions on Computers in 3. oth architectures reduce the circuit latency by almost a factor of to, from n + e clock cycles to n + e clock cycles, ith a negligible penalty in terms of the minimum clock period. Our Architecture preserves the scalability of the original design by Tenca and Koç. Further it outperforms Tenca- Koç design by about 3% in terms of the product of latency times area hen implemented on Xilinx Virtex- II 6 FPGA. Our Architecture breaks ith the scalability of the original scheme in favor of optimizing the design for the case of minimum latency. This architecture outperforms the original design by Tenca and Koç by 5% in terms of the product latency times area for four

from 5 and non-scalable architecture by McIvor et al. from 4. Our scalable Architecture demonstrates performance comparable to that of the architecture by Harris et al.

14 I TRANSACTIONS ON COMPUTRS, VOL. N, NO. N, MM X 4 most popular operand sizes used in cryptography (4, 48, 37 and 496 bits). oth our architectures have been also compared ith to other latency-optimized architectures reported earlier in the literature: scalable architecture by Harris et al. from 5 and non-scalable architecture by McIvor et al. from 4. Our scalable Architecture demonstrates performance comparable to that of the architecture by Harris et al., hile using a substantially different optimization method. Our nonscalable Architecture has a longer latency than the architecture by McIvor et al., but at the same time it outperforms this architecture in terms of the product latency by area by about % for all operand sizes. These to ne architectures can be extended from radix- to radix-4 in order to further reduce their circuit latency at the cost of increasing the product of latency times area. Our architectures have been fully verified by modeling them using Verilog-HL, and comparing their function vs. reference softare implementation of Montgomery multiplication based on the GMP library. Our code has been implemented on Xilinx Virtex-II 6 FPGA and experimentally tested on SRC-6 reconfigurable computer. Our architectures can be easily parameterized, so the same generic code ith different values of parameters can be easily used for multiple operand and ord sizes. [8] N. Jiang and. Harris, Parallelized radix- scalable Montgomery multiplier, in Proc. IFIP International Conference on Very Large Scale Integration, 7 (VLSI-SoC 7), Oct. 7, pp [9] N. Pinckney and. M. Harris, Parallelized radix-4 scalable Montgomery multipliers, Journal of Integrated Circuits and Systems, vol. 3, no., pp , Mar. 8. [] K. Kelly and. Harris, Parallelized very high radix scalable Montgomery multipliers, in Proc. the Thirty-Ninth Asilomar Conference on Signals, Systems and Computers, 5, Oct. 5, pp. 96. []. A. Michalski and. A. uell, A scalable architecture for RSA cryptography on large FPGAs, in Proc. International Conference on Field Programmable Logic and Applications, 6 (FPL 6), Aug. 6, pp [] Ç. K. Koç, T. Acar, and. S. Kaliski Jr., Analyzing and comparing Montgomery multiplication algorithms, I Micro, vol. 6, no. 3, pp. 6 33, 996. [3] C. McIvor, M. McLoone, and J. V. McCanny, High-radix systolic modular multiplication on reconfigurable hardare, in Proc. I International Conference on Field-Programmable Technology 5 (ICFPT 5), ec. 5, pp [4], Modified Montgomery modular multiplication and RSA exponentiation techniques, I Proceedings Computers and igital Techniques, vol. 5, no. 6, pp. 4 48, Nov. 4. [5] L. atina and G. Muurling, Montgomery in practice: Ho to do it more efficiently in hardare, in Proc. The Cryptographer s Track at the RSA Conference on Topics in Cryptology (CT-RSA ), Feb., pp [6] C.. Walter, Precise bounds for Montgomery modular multiplication and some potentially insecure RSA moduli, in Proc. The Cryptographer s Track at the RSA Conference on Topics in Cryptology (CT-RSA ), Feb., pp ACKNOWLGMNT The authors ould like to acknoledge the contributions of Hoang Le, Ramakrishna achimanchi and Marcin Rogaski from George Mason University ho provided results for their implementation of the Montgomery multiplier from [4]. The authors also ould like to thank Prof. Soonhak Kon from Sungkyunkan University in South Korea for helpful discussions and comments. Finally e are grateful to the anonymous revieers for their invaluable suggestions and comments to improve the quality and fairness of this paper. Miaoqing Huang is an Assistant Professor in the epartment of Computer Science and Computer ngineering at University of Arkansas. His research interests include reconfigurable computing, high-performance computing architectures, cryptography, image processing, computer arithmetic, and cache design in Solid- State rives. Huang received a.s. degree in electronics and information systems from Fudan University, China in 998, and a Ph.. degree in computer engineering from The George Washington University in 9, respectively. He is a member of I. RFRNCS [] R. L. Rivest, A. Shamir, and L. Adleman, A method for obtaining digital signatures and public-key cryptosystems, Communications of the ACM, vol., no., pp. 6, 978. [] P. L. Montgomery, Modular multiplication ithout trial division, Mathematics of Computation, vol. 44, no. 7, pp. 59 5, Apr [3] K. Gaj and et al., Implementing the elliptic curve method of factoring in reconfigurable hardare, in CHS 6, Springer- Verlag Lecture Notes in Computer Sciences, vol. 449, Oct. 6, pp [4] A. F. Tenca and Ç. K. Koç, A scalable architecture for Montgomery multiplication, in CHS 99, Springer-Verlag Lecture Notes in Computer Sciences, vol. 77, 999, pp [5], A scalable architecture for modular multiplication based on Montgomery s algorithm, I Trans. Comput., vol. 5, no. 9, pp. 5, Sept. 3. [6] A. F. Tenca, G. Todorov, and Ç. K. Koç, High-radix design of a scalable modular multiplier, in CHS, Springer-Verlag Lecture Notes in Computer Sciences, vol. 6,, pp. 85. [7]. Harris, R. Krishnamurthy, M. Anders, S. Mathe, and S. Hsu, An improved unified scalable radix- Montgomery multiplier, in Proc. the 7th I Symposium on Computer Arithmetic (ARITH 7), June 5, pp Kris Gaj received the M.Sc. and Ph.. degrees in lectrical ngineering from Warsa University of Technology in Warsa, Poland. He as a founder of nigma, a Polish company that generates practical softare and hardare cryptographic applications used by major Polish banks. In 998, he joined George Mason University, here he currently orks as an Associate Professor, doing research and teaching courses in the area of cryptographic engineering and reconfigurable computing. His research projects center on ne hardare architectures for secret key ciphers, hash functions, public key cryptosystems, and factoring, as ell as development of specialized libraries and application kernels for high-performance reconfigurable computers. He has been a member of the Program Committees of CHS, CryptArchi, and Quo Vadis Cryptology orkshops, and a General Co-Chair of CHS 8 in Washington.C. He is an author of a book on breaking German nigma cipher during World War II.

I TRANSACTIONS ON COMPUTRS, VOL. N, NO. N, MM X 5 Tarek l-ghazai is a Professor in the epartment of lectrical and Computer ngineering at The George Washington University.

15 I TRANSACTIONS ON COMPUTRS, VOL. N, NO. N, MM X 5 Tarek l-ghazai is a Professor in the epartment of lectrical and Computer ngineering at The George Washington University. At GWU, He is the founding director of GW IMPACT: The Institute for Massively Parallel Applications and Computing Technologies, and a founding Co- irector of the NSF Industry/University Center for High-Performance Reconfigurable Computing (CHRC). l-ghazais research interests include high-performance computing, computer architectures, and reconfigurable computing. He is one of the principal co-authors of the UPC parallel programming language and the UPC book from John Wiley and Sons. He has received his Ph.. degree in lectrical and Computer ngineering from Ne Mexico State University in 988. l-ghazai has close to refereed research publications in these areas. r. l-ghazais research has been frequently supported by government agencies and industry and has received the IM faculty partnership aard in 4. He serves or has served on many technical advisory boards. l-ghazai is a Program Chair for the 6th International Symposium on Applied Reconfigurable Computing (ARC) and a General Chair for the th I International Conference on Scalable Computing and Communications (ScalCom-) and has served in many conference leadership and editorial duties. He is a senior member of the Institute of lectrical and lectronics ngineers (I), and a member of the ACM, IFIP WG.3, and Phi Kappa Phi National Honor Society.

High-Performance and Area-Efficient Hardware Design for Radix-2 k Montgomery Multipliers

High-Performance and Area-Efficient Hardare Design for Radix- k Montgomery Multipliers Liang Zhou, Miaoqing Huang, Scott C. Smith University of Arkansas, Fayetteville, Arkansas 771, USA Abstract Montgomery