SINCE the introduction of the RSA algorithm [1] in

Size: px
Start display at page:

Download "SINCE the introduction of the RSA algorithm [1] in"

Transcription

1 I TRANSACTIONS ON COMPUTRS, VOL. N, NO. N, MM X Ne Hardare Architectures for Montgomery Modular Multiplication Algorithm Miaoqing Huang, Member, Kris Gaj, Tarek l-ghazai, Senior Member Abstract Montgomery modular multiplication is one of the fundamental operations used in cryptographic algorithms, such as RSA and lliptic Curve Cryptosystems. At CHS 999, Tenca and Koç proposed the Multiple-Word Radix- Montgomery Multiplication (MWRMM) algorithm and introduced a no-classic architecture for implementing Montgomery multiplication in hardare. With parameters optimized for minimum latency, this architecture performs a single Montgomery multiplication in approximately n clock cycles, here n is the size of operands in bits. In this paper e propose to ne hardare architectures that are able to perform the same operation in approximately n clock cycles ith almost the same clock period. These to architectures are based on pre-computing partial results using to possible assumptions regarding the most significant bit of the previous ord. These to architectures outperform the original architecture of Tenca and Koç in terms of the product latency times area by 3% and 5%, respectively, for several most common operand sizes used in cryptography. The architecture in radix- can be extended to the case of radix-4, hile preserving a factor of to speed-up over the corresponding radix-4 design by Tenca, Todorov, and Koç from CHS. Our optimization has been verified by modeling it using Verilog-HL, implementing it on Xilinx Virtex-II 6 FPGA, and experimentally testing it using SRC-6 reconfigurable computer. Index Terms Montgomery Multiplication, MWRMM Algorithm, Hardare Optimization, Field-Programmable Gate Arrays INTROUCTION SINC the introduction of the RSA algorithm [] in 978, high-speed and space-efficient hardare architectures for modular multiplication have been a subject of constant interest for more than 3 years. uring this period, one of the most useful advances came ith the introduction of Montgomery multiplication algorithm due to Peter L. Montgomery []. Montgomery multiplication is the basic operation of the modular exponentiation, hich is required in the RSA public-key cryptosystem. It is also used in lliptic Curve Cryptosystems, and several methods of factoring, such as CM, p-, and Pollard s rho method, as ell as in many other cryptographic and cryptanalytic transformations [3]. At CHS 999, Tenca and Koç introduced a ord-based algorithm for Montgomery multiplication, called Multiple-Word Radix- Montgomery Multiplication (MWRMM), as ell as a scalable hardare architecture capable of executing this algorithm [4], [5]. Several follo-up designs based on the MWRMM algorithm have been proposed in order to reduce the computation time [6] []. In [6], a high-radix ordbased Montgomery algorithm (MWR k MM) as proposed using ooth encoding technique. Although the number of scanning steps as reduced, the complex- M. Huang is ith the epartment of Computer Science and Computer ngineering, University of Arkansas, Fayetteville, AR 77, USA, mqhuang@uark.edu. K. Gaj is ith the epartment of lectrical and Computer ngineering, George Mason University, Fairfax, VA 3, USA, kgaj@gmu.edu. T. l-ghazai is ith the epartment of lectrical and Computer ngineering, The George Washington University, Washington, C 5, USA, tarek@gu.edu. Manuscript received Sept., 8, revised ec. 9, accepted Jan.. ity of control and computational logic increased substantially at the same time. In [7], Harris et al. implemented the MWRMM algorithm in a quite different ay, i.e., left shifting Y and M instead of right shifting S. Their approach as able to process an n-bit precision Montgomery multiplication in approximately n clock cycles, hile keeping the scalability and simplicity of the original implementation. In [8] and [9], the left-shifting technique as applied on the radix- and radix-4 versions of the parallelized Montgomery algorithm [], respectively. In [], Michalski and uell introduced a MWRkMM algorithm, hich is derived from The Finely Integrated Operand Scanning Method described in []. MWRkMM algorithm requires the built-in multipliers in the FPGA device to speed up the computation. This feature makes the implementation expensive. The systolic high-radix design by McIvor et al. described in [3] is also capable of very high speed operation, but suffers from the same disadvantage of large area requirements for fast multiplier units. A different approach based on processing multi-precision operands in carry-save form has been presented in [4]. This architecture is optimized for the minimum latency and is particularly suitable for repeated sequence of Montgomery multiplications, such as the sequence used in modular exponentiations (e.g., RSA). In this paper, e focus on the optimization of hardare architectures for MWRMM and MWR4MM algorithms in order to minimize the number of clock cycles required to compute an n-bit precision Montgomery multiplication. We start ith the introduction of Montgomery multiplication in Section. Then, the classic MWRMM architecture is discussed. The ne

2 I TRANSACTIONS ON COMPUTRS, VOL. N, NO. N, MM X TAL Conversion beteen ordinary and Montgomery domains Ordinary omain Montgomery omain X X = X n (mod M) Y Y = Y n (mod M) XY (X Y ) = X Y n (mod M) optimized architecture, hich is able to perform the n- bit precision MWRMM algorithm in approximately n clock cycles, is presented in Section 3. In Section 4, e propose an alternative optimized architecture that is able to achieve the same performance goal ith simpler logic design. In Section 5, the high-radix version of our ne architecture is introduced. In Section 6, e first compare our to optimized architectures ith three previous architectures from the conceptual point of vie. Then, the hardare implementations of all discussed architectures are presented and contrasted ith each other. Finally, in Section 7, e present the summary and conclusions for this ork. MONTGOMRY MULTIPLICATION ALGORITHM Let M > be an odd integer. In many cryptosystems, such as RSA, computing X Y (mod M) is a crucial operation. The reduction of X Y (mod M) is a more time-consuming step than the multiplication X Y ithout reduction. In [], Montgomery introduced a method for calculating products (mod M) ithout the costly reduction (mod M), since then knon as Montgomery multiplication. Montgomery multiplication of X and Y (mod M), denoted by M P (X, Y, M), is defined as X Y n (mod M) for some fixed integer n. Since Montgomery multiplication is not an ordinary multiplication, there is a conversion process beteen the ordinary domain (ith ordinary multiplication) and the Montgomery domain. The conversion beteen the ordinary domain and the Montgomery domain is given by the relation X X, here X = X n (mod M). The corresponding diagram is shon in Table. Table shos that the conversion is compatible ith multiplications in each domain, since MP (X, Y, M) X Y n (X n ) (Y n ) n X Y n (X Y ) (mod M). (a) (b) The conversion beteen each domain can be done using the same Montgomery operation, in particular X = MP (X, n (mod M), M) and X = MP (X,, M), here n (mod M) can be precomputed. espite the initial conversion cost, e achieve an advantage over ordinary multiplication if e do many Montgomery multiplications folloed by an inverse conversion at the end, hich is the case, for example, in RSA Algorithm : Radix- Montgomery Multiplication Input: odd M, n = log M +, X = n i= i, ith X, Y < M Output: Z = MP (X, Y, M) X Y n (mod M), Z < M S[] = ; for i = to n do q i = ( Y ) S[i] ; S[i + ] = (S[i] + Y + q i M)/;.5 if S[n] > M then.6 S[n] = S[n] M;.7 return Z = S[n]; Algorithm shos the pseudocode for the radix- Montgomery multiplication, here e choose n = log M +. n is the size of M in bits. The verification of the above algorithm is given belo: Let us define S[i] as S[i] i i x j j Y (mod M) () j= ith S[] =. Then, S[n] X Y n (mod M) = MP (X, Y, M). S[n] can be computed iteratively using the folloing dependence: S[i + ] i i+ x j j Y (3a) j= i x j j + i Y (3b) i+ j= i i x j j Y + Y j= (3c) (S[i] + Y ) (mod M). (3d) Therefore depending on the parity of S[i] + Y, e compute S[i + ] as S[i + ] = S[i] + Y or S[i] + Y + M, (4) to make the numerator divisible by. Since Y < M and S[] =, one has S[i] < M for all i < n. In [5], [6], it is shon that the result of a Montgomery multiplication X Y n (mod M) < M hen X, Y < M and n > 4M. As a result, by redefining n to be the smallest integer such that n > 4M, the subtraction at the end of Algorithm can be avoided and the output of the multiplication can be directly used as an input for the next Montgomery multiplication. 3 OPTIMIZING MWRMM ALGORITHM In [4], [5], Tenca and Koç proposed a scalable architecture based on the Multiple-Word Radix- Montgomery Multiplication Algorithm (MWRMM), shon as Algorithm. In Algorithm, the operand Y (multiplicand) is scanned ord-by-ord, and the operand X is scanned

3 I TRANSACTIONS ON COMPUTRS, VOL. N, NO. N, MM X 3 Algorithm : Multiple-Word Radix- Montgomery Multiplication Algorithm [4].3 Input: odd M, n = log M +, ord size, e = n+, X = n i= i, Y = e j= Y (j) j, M = e j= M (j) j, ith X, Y < M Output: Z = e j= S(j) j = MP (X, Y, M) X Y n (mod M), Z < M. S = ;. for i = to n do /*initialize all ords of S*/ q i = ( Y () ) S () ;.4 (C (), S () ) = Y () + q i M () + S () ;.5 for j = to e step do.6 (C (j+), ) = C (j) + Y (j) + q i M (j) + ; S (j ) = (, S(j ).. );.8 S (e) = ;.9 return Z = S; bit-by-bit. The operand length is n bits, and the ordlength is bits. e = n+ ords are required to store S since its range is [, M ]. The original M and Y are extended by one extra bit of as the most significant bit. Presented as vectors, M = (, M (e ),..., M (), M () ), Y = (, Y (e ),..., Y (), Y () ), S = (, S (e ),..., S (), S () ), and X = (x n,..., x, x ). The carry variable C (j) has to bits, as explained belo. Assuming C () =, each subsequent value of C (j+) is given by (C (j+), ) = C (j) + Y (j) + q i M (j) +. Assuming that C (j) 3, e obtain (C (j+), ) = C (j) + Y (j) + q i M (j) ( ) = 3. From (5), e have C (j+) 3. y induction, C (j) 3 is ensured for any j e. Additionally, based on the fact that S M, e have C (e). The data dependency graph of the hardare implementation for the MWRMM algorithm by Tenca and Koç is shon in Fig.. ach circle in the graph represents an atomic computation and is labeled according to the type of action performed. Task A consists of computing lines.3 and.4 in Algorithm. Task corresponds to computing lines.6 and.7 in Algorithm. The data dependencies among the operations ithin j loop makes it impossible to execute the steps in a single iteration of j loop in parallel. Hoever, parallelism is possible among executions of different iterations of i loop. In [4], Tenca and Koç suggested that each column in the graph may be computed by a separate processing element (P), and the data generated from one P may be passed into another P in a pipelined fashion. Folloing this method, all atomic computations represented by (5) S () = S () = {x,q,c (),S () } x S () = {x,q,c (),S () } S (3) = S (4) = P # i = {x,q,c (3),S () } {x,q,c (4),S (3) } S (5) = {x,q,c (5),S (4) } {x,q,c (6),S (5) } Y () M () Y () M () S () Y () M () P # i = x S () Y (3) {x,q,c (),S () } M (3) Y () M () Y () M () P # i = S () S () x Y (4) {x,q,c (),S () } Y () Y () M (4) M () M () S (3) S () Y (5) {x,q,c (3),S () } Y (3) {x,q,c (),S () } Y () M (5) M (3) M () S (4) S () S () {x,q,c (4),S (3) } {x,q,c (),S () } Fig.. ata dependency graph of the original architecture of MWRMM algorithm [4] circles in the same ro can be processed concurrently. The processing of each column takes e + clock cycles ( clock cycle for Task A, e clock cycles for Task ). ecause there is a delay of clock cycles beteen the processing of a column for and the processing of a column for +, the minimum computation time T (in clock cycles) is T = n+e given P max = e+ Ps are implemented to ork in parallel. In this configuration, after e + clock cycles, P # sitches from executing column to executing column P max. After another to clock cycles, P # sitches from executing column to executing column P max +, etc. The opportunity of improving the implementation performance of Algorithm is to reduce the delay beteen the processing of to subsequent iterations of i loop from clock cycles to clock cycle. The -clockcycle delay comes from the right shift (division by ) in both Algorithm and. Take the first to Ps in Fig. for example. These to Ps compute the S ords in the first to columns. Starting from clock #, P # has to ait for to clock cycles before it starts the computation of S () (i = ) in the clock cycle #. In order to reduce the -clock-cycle delay to half, e propose an approach to pre-computing the partial results using to possible assumptions regarding the most significant bit of the previous ord. As shon in

4 I TRANSACTIONS ON COMPUTRS, VOL. N, NO. N, MM X 4 clk # S () (i=) P # P # P # clk # S () (i=) S () (i=) -... ' -... S () (i=) S () (i=) S () (i=) clk # ' ' -... Fig.. ata operation in the optimized architecture (Architecture ) (S ords belonging to the same i loop share the same background pattern) Fig., P # can take the most significant bits of S () (i = ) from P # at the beginning of clock #, do a right shift, and compute to versions of S () (i = ), based on the to different assumptions about the most significant bit of this ord at the start of computations. At the beginning of the clock cycle #, the previously missing bit becomes available as the least significant bit of S () (i = ). This bit can be used to choose beteen the to precomputed versions of S () (i = ). Similarly, in the clock cycle #, to different versions of S () (i = ) and S () (i = ) are computed by P # and P # respectively, based on to different assumptions about the most significant bits of these ords at the start of computations. At the beginning of the clock cycle #3, the previously missing bits become available as the least significant bits of S () (i = ) and S () (i = ), respectively. These to bits can be used to choose beteen the to precomputed versions of these ords. The same pattern of computations is repeated in subsequent clock cycles. Furthermore, since e ords are enough to represent the values in S, S (e) is discarded in our designs. Therefore, e clock cycles are required to compute one iteration of S. The proposed optimization technique can be applied onto both non-redundant and redundant representation of the partial sum S, as demonstrated in Fig. 3. It is logically straightforard to apply the approach hen S is represented in non-redundant form because each digit of S consists of only one bit. When S is represented in redundant Carry-Save (CS) form, each digit of S consists of to bits, the sum (SS) bit and the carry (SC) bit. As shon in Fig. 3(b) and Fig. 3(c), after the update of, only the sum bit of S (j+), i.e., SS (j+), is missing in order to determine a full ord S (j) after right shift. The carry bit, SC (j+), has been already computed and can be forarded to the next P together ith S (j)... Then, the same approach can be applied to update. In the remainder of this paper, e use the nonredundant form in all the diagrams and description for the sake of simplicity. The corresponding diagrams and implementations in redundant format can be derived from the non-redundant case accordingly. j = () S.. () S j = () S.. () S j = () S.. (3) S j = 3 (3) S.. (4) S j = 4 P # i = {x,q,c () } {x,q,c () } {x,q,c (3) } {x,q,c (4) } x Y () M () () S.. Y () M () () S () S.. Y () M () () S () S.. Y (3) M (3) (3) S {x,q,c () } P # i = {x,q,c () } x (3) S.. {x,q,c (3) } Y () M () () S.. Y () M () () S P # i = {x,q,c () } x () S.. Y () {x,q,c () } M () () S () S.. Y () M () () S.. Y () M () () S () S.. Fig. 4. ata dependency graph of the optimized architecture (Architecture ) of MWRMM algorithm (S is represented in non-redundant form) Algorithm 3: Computations in Task Input:, Y (), M (), S (), S().. Output: q i, C (), S ().. 3. q i = ( Y () ) S () ; (CO (), SO () , S().. (C (), S (), S().. if S () = then C () = CO () ; 3.6 S ().. = (SO(), S().. ); 3.7 else 3.8 C () = C () ; 3.9 S ().. = (S(), S().. ); ) = (, S().. ) + Y () + q i M () ; ) = (, S().. ) + Y () + q i M () ; The data dependency of the optimized architecture for implementing MWRMM algorithm is shon in Fig. 4. Similar to the original implementation by Tenca and Koç, the circle in the graph of Fig. 4 represents an atomic

5 I TRANSACTIONS ON COMPUTRS, VOL. N, NO. N, MM X 5 x q i M (j) i Y (j) c out SC (j) S Y (j) q i M (j) c in q i Y M SC SS Y M SC ( j ) SS Y M ( j ) SC SS c out Carry-Ripple Adder Sʹ(j) (a) c in SC ( ) ' j Carry-Save Adder SC ( ) ' j.. Sʹ(j) (b) SS ( ) ' j.. ( j ) SC c out ) SC' j ( 3 FA FA FA FA FA FA ) SS' j ( SC' j ( ) ( ) SS' j ( ) SC' j (c) ) SS' j ( ) SC' j ( SC c in ( j ) 3 Fig. 3. Update of an S ord: (a) S is represented in non-redundant form, (b) S is represent in redundant form, (c) Logic diagram of an update of an S ord ( = 3) in redundant form Algorithm 4: Computations in Task Input: q i,, C (j), Y (j), M (j), S (j+),.. Output: C (j+),.., S(j) 4. (CO (j+), SO (j), S(j).. ) = (,.. ) + C(j) + Y (j) + q i M (j) ; 4. (C (j+),, S(j).. ) = (,.. ) + C(j) + Y (j) + q i M (j) ; 4.3 if S (j+) = then 4.4 C (j+) = CO (j+) ; = (SO(j), S(j).. ); 4.6 else 4.7 C (j+) = C (j+) ; = (S(j), S(j).. ); computation. Task consists of three steps, the computation of q i, the calculation of to sets of possible results, and the selection beteen these to sets of results using an additional input S (), hich becomes available at the end of the processing time for Task. These three steps are shon in Algorithm 3. Task corresponds to to steps, as shon in Algorithm 4. The data forarding of and.. from one circle to the to circles in the right column takes place at the same time. Hoever, is used for selecting the to partial results of S (j ), and.. is used for generating the to partial results of. The exact approach to avoiding the extra clock cycle delay due to the right shift is detailed as follos by taking Task as an example. ach P first computes to versions of C (j+) and simultaneously, as shon in Algorithm 4. One version assumes that S (j+) is equal to one, and the other assumes that this bit is equal to zero. oth results are stored in registers. At the same moment, the bit S (j+) becomes available and this P can output the correct C (j+) and. For Task, the computation of q i is performed in addition to the computation of C () and S (). The diagram of the P logic is given in Fig. 5. The signals at the left and right sides are for the interconnection purpose. The carry C is fed back to the core logic of the same P. The signal remains unchanged during the computation of a hole column in Fig. 4. is a ord of the final output at the end of the computation of the hole multiplication. The core logic in Fig. 5 consists of to parts, the combinational logic and a finite state machine. The multiplications of Y (j) and q i M (j) are shon to be carried out using multiplexers. A ro of AN gates is another implementation option. On FPGA devices, the designer may leave the choice of the real implementation up to the synthesis tool for the best performance in terms of tradeoff beteen speed and area. The direct implementation of to branches (i.e., line 4. and 4. in Algorithm 4) requires the use of to ripple-carry adders, each of hich consists of three -bit inputs and a carry. It is easy to see that these to additions only differ in the most significant bit of the S ord and share all remaining operand bits. Therefore, it is desired to consolidate the shared part beteen these to additions into one ripplecarry adder ith three -bit inputs and a carry. The remaining separate parts are then carried out using to small adders. Folloing this implementation, the resource requirements increase only marginally hile performing computation for to different cases. When S is represented in redundant form (see Fig. 3(c)), only one additional Full Adder is required to cover to possible cases of S. The optimized architecture keeps the scalability of the original architecture described in [4]. Fig. 6 illustrates ho to use p Ps to implement the MWRMM algorithm. oth M (j) and Y (j) are moved from left to right every clock cycle through registers. has been registered inside each P. Therefore, it can be passed into the next P directly. The total computation time T, in clock cycles hen p stages are used in the pipeline to. Ripple-carry adders are used hen S is represented in nonredundant form. When S is represented in redundant form, carry-save adders should be used instead.

6 I TRANSACTIONS ON COMPUTRS, VOL. N, NO. N, MM X 6 S (j+) -... Valid at the beginning of next clock cycle, e.g. #n+ - Y (j) M (j) Core logic Valid at the beginning of current clock cycle, e.g. #n CO C SO (j) - S (j) - S (j) -... Register CO C SO (j) - - C P M (j) Y (j) C q i () S Core logic Z - R - R -... Z -... CO SO (j) - C S (j) - S (j) -... () Y q i Control Signal Fig. 5. The P logic used in the optimized Architecture of MWRMM implementation (only the combinational logic in Task is illustrated, S is represented in non-redundant form) M (j) Y (j) + +p- queue Fig. 6. The optimized architecture (S is represented in non-redundant form, i =, p, p,...) n=5 = e=3 p=3 S () S () S () n=5 = e=3 p= Fig. 7. An example of computations for 5-bit operands in Architecture using (a) three Ps, (b) to Ps S () S () S () in the operand is larger than the number of Ps. If e define a kernel cycle as the computation in hich p bits of x are processed, then there is an e p-clock-cycle extra delay beteen to kernel cycles. In this case, k complete and one partial kernel cycles are required to process all n bits in X. Overall, the ne architecture is capable of reducing the processing latency to half of latency of the Tenca-Koç design, given maximum number of Ps. Fig. 7 demonstrates these to different cases ith a simplified example. If e > p, the output from the rightmost P is fed into a queue and processed by the leftmost P later. This is the example shon in Fig. 7(b). Since there is an e p-clockcycle extra delay beteen to kernel cycles, the length of the queue Q is determined as { if e p Q = (7) e p otherise. In order to distinguish this architecture from the other architecture, hich is described in Section 4, the architecture discussed in this section is called Architecture hereafter. compute for the case ith n bits of size, is given by { n + e if e p T = (6) n + k(e p) + e otherise here k = n p. The first case shon in (6) represents the situation hen there are more Ps than the number of ords. Then it ould take n clock cycles to scan the n bits in X and take another e clock cycles to compute the remaining e ords in the last iteration. The second case models the condition hen the number of ords 4 TH ALTRNATIV OPTIMIZ HARWAR ARCHITCTUR OF MWRMM ALGORITHM In Section 3, e presented the optimization technique for improving the performance of the original implementation architecture by Tenca and Koç. In this section, e present an alternative optimized hardare architecture for implementing MWRMM algorithm. The corresponding data dependency graph is shon in Fig. 8. Similar to the previous data dependency graphs in Fig. and Fig. 4, the computation of each column in Fig. 8 can be processed by one separate P. Similarly to the graph in Fig. 4, there is only one clock cycle latency

7 I TRANSACTIONS ON COMPUTRS, VOL. N, NO. N, MM X 7 i= i= i= i=3 i=4 i= i= i= i=3 i=4 i= i= i= i=3 i=4 j= A A A A A j= j= P # j= j= j= P # j= j= j= P # j=3 j=3 j=3 P #3 j=4 j=4 j=4 P #4 j=e j=e- F F F F F j=e- F F F F F P #e- P # P # P # P #3 P #4 P # P # P # P #3 P #4 (a) (b) (c) Fig. 9. Three different approaches for mapping MWRMM algorithm: (a) The architecture by Tenca and Koç, (b) The proposed Architecture, (c) The proposed alternative Architecture i = i = S () = Y () M () S () S () S () S () S () x Y () M () x Y () M () {C (),q} {C (),q} {C (3),q} {C (4),q} x S () = Y () M () S () Y () M () S () = Y () M () Y () S () Y () S () Y () M () M () M () x3 Y () S () Y () S () Y () M () M () M () x4 S () {C (3),q} {C (),q} {C (),q3} {C (),q4} {C (),q3} {C (3),q} {C (4),q} {C (),q} {C (),q} {C (),q} i = i = 3 i = 4 P # j = S () S () S () P # j = x x x x3 S () S () P # j = S (3) = Y (3) M (3) S (3) S () S () S (3) x x S (3) x S () S (3) P #3 j = 3 x Y (3) M (3) Fig. 8. ata dependency graph of the proposed alternative architecture (Architecture ) of MWRMM algorithm (S is represented in non-redundant form) beteen the processing of to adjacent columns in this data dependency graph. These three data dependency graphs map Algorithm folloing different strategies, as shon in Fig. 9. In Fig. and Fig. 4, each column corresponds to a single iteration of i loop and covers all iterations of j loop, as shon in Fig. 9(a) and Fig. 9(b) respectively. In contrast, each column in Fig. 8 corresponds to a single iteration of j loop and covers all iterations of i loop, as shon in Fig. 9(c). Folloing the data dependency graph in Fig. 8, x S (4) 5. Algorithm 5: Computations in Task F Input: q i,, C (e ), Y (e ), M (e ), S (e ).., C(e) Output: C (e), S (e ).., S(e ) (C (e), S (e ) ) = (C (e), S(e ).. ) + C(e ) + Y (e ) + q i M (e ) ; e present an alternative hardare architecture of MWRMM algorithm in Fig.. This architecture can finish the computation of Montgomery multiplication of n-bit operands in n+e clock cycles. Furthermore, this alternative design is simpler than the approach given in [4] in terms of control logic and data path logic. Hereafter, e call this alternative architecture Architecture. As shon in Fig. (d), Architecture consists of e Ps forming a computation chain. ach P focuses on the computation of a specific ord in S, i.e., P #j only orks on. In other ords, each P corresponds to one fixed round as j in the inner loop of Algorithm. Meanhile, all Ps scan different bits of operand X at the same time. The same optimization technique is applied to avoid the extra clock cycle delay due to the right shift. The pseudocode in Algorithm 4 describes the function and internal logic of the P #j. The function of the combinational logic is given by lines 4. and 4.. Lines 4.3 to 4.8 are implemented using to -to- multiplexers, shon in the diagram to the right of Register. Fig. demonstrates the computations of the first 3 Ps in the first 3 clock cycles. The internal logic of all Ps is same except the to Ps residing at the head and tail of the chain. P #, shon in Fig. (a) as the cell of type, is also responsible for computing q i and has no C (j) input. This P implements Algorithm 3. P #(e ), shon in Fig. (c) as type F, has only one internal branch because the most significant

8 I TRANSACTIONS ON COMPUTRS, VOL. N, NO. N, MM X 8 (a) q i Y () M () Combinational logic - CO () C () SO () - S () - S () -... Register CO () C () SO () - S () - S () -... S () - S () -... P # S () S () C () S () -... M () Y () S () Combinational logic Z Z - R - R -... Z -... CO SO (j) - C S (j) - S (j) -... q i (b) (c) C (j) S (e-) q i Y (j) M (j) Combinational logic - q i Y (e-) M (e-) CO (j+) C (j+) SO (j) - S (j) - S (j) -... Register CO (j+) C (j+) SO (j) P #j P #e- S (j+) C (j+) -... q i M (j) Y (j) C (j) Combinational logic Z - R - R -... Z -... Combinational logic CO SO (j) - C S (j) - S (j) -... C (e-) Combinational logic C (e) S (e-) Register C (e) S (e-) S (e-) (C (e),s (e-) -...) C (e) S (e-) -... q i M (j) Y (j) C (e-) (C (e),s (e-) -...) C (e) S (e-) (d) (e-)-bit Shift Register for q Y () M () q Y () M () Y () M () Y (i) M (i) Y (e-) M (e-) i q i- q i- q i-j q i-e+ P P P S P P () S () S (3) S (j+) S (e-) # C # # # j #e- () C () C (3) C (j) C (j+) C (e-) F X S () S () S () - - -j -e+ e-bit Shift Register for x S (e-) Fig.. (a)the internal logic of P # of type. (b)the internal logic of P #j of type. (c)the internal logic of P #e- of type F. (d)the proposed alternative architecture of MWRMM algorithm - Architecture (S is represented in non-redundant form)

9 I TRANSACTIONS ON COMPUTRS, VOL. N, NO. N, MM X 9 clk # P # S () (i=) P # P # clk # S () (i=) -... ' -... S () (i=) S () (i=) S () (i=) S () (i=)... clk # ' -... ' Fig.. ata operation in the alternative architecture (Architecture ) bit of S (e ) is equivalent to C (e), hich is determined at the beginning of every clock cycle. This P implements Algorithm 5. To shift registers parallel to Ps carry and q i, respectively, and do a right shift every clock cycle. efore the start of multiplication, all registers, including the to shift registers and the internal registers of Ps, should be reset to zeros. All the bits of X ill be pushed into the shift register one by one and folloed by zeros. The second shift register ill be filled ith values of q i computed by P # of type. All the registers can be enabled at the same time after the multiplication process starts because the additions of Y (j) and M (j) ill be nullified by the zeros in the to shift registers before the values of x and q reach a given stage. The internal register of P #j keeps the value of that should be shifted one bit to the right for the next round of calculations. This feature gives us to options to generate the final product. ) The contents of.. can be stored in e clock cycles after P # finishes the calculation of the most significant bit of X, i.e., after n clock cycles, and then the circuit can do a right shift on all accumulated bits. Or, ) One more round of calculation can be performed right after the round ith the most significant bit of X. In order to do so, one bit of needs to be pushed into to shift registers to make sure that the additions of Y (j) and M (j) are nullified and the only operation performed by the circuit is right are collected in e clock cycles after P # finishes its extra round of calculations. These ords are concatenated to form the final product. After the final product is generated, there are to methods to collect them. If the internal registers of Ps are disabled after the end of computation, the entire result can be read in parallel after n + e clock cycles. Alternatively, the results can be read ord by ord in e clock cycles by connecting internal registers of Ps into a shift register chain. The exact ay of collecting the results largely depends on the application. For example, in the implementation shift. Then the contents of Algorithm 6: Multiple-Word Radix-4 Montgomery Multiplication Algorithm Input: odd M, n = log M +, ord size, e = n+, X = n i= x (i) 4 i, Y = e j= Y (j) j, M = e j= M (j) j, ith X, Y < M Output: Z = e j= S(j) j = MP (X, Y, M) X Y n (mod M), Z < M S = ; /*initialize all ords of S*/ for i = to n step do q (i) = F unc(s ().., x(i), Y ().., M ().. ); / *q (i) and x (i) are -bit long*/ (C (), S () ) = S () + x (i) Y () + q (i) M () ; /*C is 3-bit long*/ for j = to e step do (C (j+), ) = C (j) + + x (i) Y (j) + q (i) M (j) ; S (j ) = (.., S(j ).. ); 6.7 S (e ) = (C (e) 6.8 return Z = S; 6.9.., S(e ).. ); of RSA, a parallel output ould be preferred; hile in the CC computations, reading results ord by ord may be more appropriate. 5 HIGH-RAIX ARCHITCTUR OF MONT- GOMRY MULTIPLICATION The concepts illustrated in Fig. 4 and Fig. 8 can be adopted to the design of high-radix hardare architecture of Montgomery multiplication. Instead of scanning one bit of X every time, several bits of X can be scanned together for high-radix cases. Assuming k bits of X are scanned at one time, k branches should be covered at the same time to maximize the performance. Considering the value of k increases exponentially as k increments, the design becomes impractical beyond radix-4. Folloing the same definitions regarding ords as in Algorithm, the radix-4 version of Montgomery multiplication is shon as Algorithm 6. To bits of X are scanned in one step this time instead of one bit as in Algorithm. While reaching the maximal parallelism, the radix-4 version design takes n + e clock cycles to process n-bit Montgomery multiplication.

10 I TRANSACTIONS ON COMPUTRS, VOL. N, NO. N, MM X clk # S () (i=) P # P # P # clk # S () (i=) S () (i=) -... S () (i=) S () (i=) S () (i=) clk # Fig.. ata operation in Harris architecture [7] (its ith the gray background are ignored due to the left shift) The carry variable C has 3 bits, hich can be proven in a similar ay to the proof of the radix- case. The value of q (i) at line 6.3 of Algorithm 6 is defined by a function involving S ().., x(i), Y ().. is satisfied. andm ().. so that (8) S ().. + x(i) Y ().. + q(i) M ().. = (mod 4) (8) Since M is odd, M () =. From (8), e can derive q (i) = S () (x (i) Y () ) (9) here x (i) and q (i) denote the least significant bit of x (i) and q (i) respectively. The bit q (i) is a function of only seven one-bit variables and can be computed using a relatively small look-up table. The multiplication by 3, hich is necessary to compute x (i) Y (j) and q (i) M (j), can be done on the fly or avoided by using ooth recoding as discussed in [6]. Using the ooth recoding ould require adjusting the algorithm and architecture to deal ith signed operands. Furthermore, e can generalize Algorithm 6 to handle MWR k MM algorithm. In general, x (i) and q (i) are both k-bit variables. x (i) is a k-bit digit of X, and q (i) is defined by (). S () + x (i) Y () + q (i) M () = (mod k ) () Nevertheless, the implementation of the proposed optimization for the case of k > ould be impractical in majority of applications. 6 HARWAR IMPLMNTATION AN COM- PARISON OF IFFRNT ARCHITCTURS In this section, e compare five major types of architectures for Montgomery multiplication from the point of vie of the number of Ps and latency in clock cycles. In the architecture by Tenca and Koç, the number of Ps can vary beteen one and P max = e+. The larger the number of Ps, the smaller the latency, but the larger the circuit area. This feature allos the designer to choose the best possible trade-off beteen these to requirements. The architecture by Harris et al. [7] has the similar scalability as the original architecture by Tenca and Koç [4]. Instead of making right-shift of the intermediate values, their architecture left-shifts the Y and M to avoid the data dependency beteen and S (j ). The data processing diagram in Harris architecture is shon in Fig.. For the design ith the number of Ps optimized for minimum latency, the architecture by Harris reduces the number of clock cycles from n+e (for Tenca and Koç [4]) to n + e. Our optimized architecture, Architecture, is built using similar concepts to the architecture by Tenca and Koç. Hoever, it is able to reduce the processing latency to approximately half hile preserving the scalability of the original architecture. Our alternative architecture, Architecture, and the architecture by McIvor et al. both have fixed size, optimized for minimum latency. Our architecture consists of e Ps, each operating on operands of the size of a single ord. The architecture by McIvor et al. consists of just one P, operating on multi-precision numbers represented in the carry-save form. The final result of the McIvor architecture obtained after n clock cycles is expressed in the carry-save redundant form. In order to convert this result to the non-redundant binary representation, additional e clock cycles are required, hich makes the total latency of this architecture comparable to the latency of our architecture. In the sequence of modular multiplications, such as the one required for modular exponentiation, the conversion to the nonredundant representation can be delayed to the very end of computations. Therefore each subsequent Montgomery multiplication can start every n clock cycles. The similar property can be implemented in our architecture by starting a ne multiplication immediately after the first P, P #, has released the first least significant ord of the final result. Architecture can be parameterized in terms of the value of the ord size. The larger the smaller the number of Ps, but the larger the size of a single P. Additionally, the larger the smaller the maximum clock frequency, especially in the redundant representation. The latency expressed in the number of clock cycles is equal to n+ ((n+)/), and is almost independent of for 6. Since actual FPGA-based platforms, such as SRC-6 used in our implementations, have a fixed target clock frequency, this target clock frequency determines the optimum value of. Additionally, the same HL code can be used for different values of the

11 I TRANSACTIONS ON COMPUTRS, VOL. N, NO. N, MM X x y x c M (j) Y (j) + (a) q i + z + c out s (b) c out c in s (c) c in c z 5 z 4 s 4 z 3 s 3 z s z s z s HA FA FA FA FA FA c c HA HA HA HA FA FA c s 4 s 3 s s s (d) Fig. 3. istributing the computation of c + + Y (j) + q i M (j) into to clock cycles: (a) Logic diagram, (b) Implementation of Full Adder in Xilinx FPGAs, (c) Implementation of Half Adder in Xilinx FPGAs, (d) Implementation of S.. + Z.. + C.. in Xilinx Virtex-II FPGA device, = 5 (Z = Y.. + q i M.. ) operand size n and the parameter, ith only a minor change in the values of respective constants. oth optimized architectures, Architecture and Architecture, have been implemented in Verilog HL, and their codes have been verified using reference softare implementation. The results completely matched. We have selected Xilinx Virtex-II6FF57-4 FPGA device used in the SRC-6 reconfigurable computer for the prototype implementations. The synthesis tool as Synplify Pro 9. and the Place and Route tool as Xilinx IS 9.. We have implemented four different sizes of multipliers, 4, 48, 37 and 496 bits, respectively, in the radix- case using Verilog-HL to verify our approach. The resource utilization on a single FPGA is shon in Table. For comparison, e have implemented the multipliers of these four sizes folloing the hardare architectures by Tenca and Koç and by Harris et al. as ell. Additionally, e have implemented the approach based on CSA (Carry Save Addition) from [4] as a reference. The purpose is to sho ho the MWRMM architecture compares ith other types of architectures in terms of resource utilization and performance. The ord size is fixed at 6-bit for most of the architectures implementing the MWRMM algorithm. Moreover, the 3-bit case of Architecture is tested as ell to sho the trade-off among clock rate, minimum latency and area. In order to maximize the performance, e used the maximum number of Ps in the implementation of all three scalable architectures, i.e., the architecture by Tenca and Koç [4], the architecture by Harris et al. [7], and Architecture. Therefore, the queue (shon in Fig. 6) is not implemented in all three cases. In the implementation of these four architectures, S is represented in non-redundant form. In other ords, carry-ripple adders are used in the implementation. In order to minimize the critical path delay in the carry-ripple addition of c + + Y (j) + q i M (j), this three-input addition ith carry is broken into to toinput additions. As shon in Fig. 3(a), Y (j) + M (j) is pre-computed one clock cycle ahead of its addition ith. This technique is applied to the implementation of all four cases to maximize the frequency. This design point is appropriate hen the target device is an FPGA device ith abundant hardare resources. When area constraint is of high priority, or S is represented in redundant form (as suggested in [4], [5], [7]), this frequency-oriented technique may become unnecessary. The real implementation of the second to-input addition ith to-bit carry in Xilinx Virtex-II device is illustrated in Fig. 3(d). + full adders (FAs) and half adders (HAs) form to parallel chains to perform the addition. Considering FAs used in the first addition, the implementation of the logic in Fig. 3(a) requires 3 + FAs or HAs. Compared ith the FAs used in Fig. 3(c), the non-redundant pipelined implementation of Montgomery multiplication ill consume approximately 5% more hardare resources than the implementation in redundant form on Xilinx Virtex-II platform. From Table, e can see that both Architecture and Architecture (radix- and =6) give a speedup by a factor of almost to compared ith the architecture by Tenca and Koç [4] in terms of latency expressed in the number of clock cycles. The minimum clock period is comparable in both cases and extra propagation delay in our architecture is introduced only by the multiplexers directly folloing the Registers, as shon in Fig. 6 and Fig.. The resource requirements of the P in three scalable architectures are very close to each other because most of their logic is the same. The implementations of both Harris architecture and Architecture use tice as many Ps as the architecture by Tenca and Koç. At the same time, they both require only about 44% more resources (in LUTs) compared ith the Tenca and Koç s architecture. This feature is due to the ay LUTs are counted by implementation tools; namely, LUT is counted as one even if not all of its inputs are used. A close observation of the area report by Synplify Pro reveals that in the cases of both Harris architecture and Architecture, the percentage of fully or close-to-fully used LUTs is much higher than in case of Tenca and Koç s architecture. Architecture occupies 6% less resources than architecture by Tenca and Koç in terms of LUTs, although our Architecture uses almost tice as many Ps. This result

12 I TRANSACTIONS ON COMPUTRS, VOL. N, NO. N, MM X TAL Hardare resource requirement and performance of the implementations on Xilinx Virtex-II6FF57-4 FPGA Scalable Architectures 4-bit 48-bit 37-bit 496-bit Max Frequency(MHz).5 Number of Ps Architecture by Min Latency (clks), 4,4 6,336 8,448 Tenca and Koç [4] Min Latency (µs) (radix-, =6) Area (LUTs) 6,438,774 9, 5,446 Min Latency Area (µs LUTs), ,846,4,97,784,33 Max Frequency(MHz) 9.7 Number of Ps Architecture by Min Latency (clks),67,39 3,47 4,63 Harris et al. [7] Min Latency (µs) (radix-, =6) Area (LUTs) 9,7 8,455 8,5 36,65 Min Latency Area (µs LUTs) 9, ,485 83,9,43,9 Max Frequency(MHz) 6.4 Number of Ps Our Proposed Min Latency (clks),88,76 3,64 4,35 Architecture Min Latency (µs) (radix-, =6) Area (LUTs) 9,39 8,535 7,75 36,967 Min Latency Area (µs LUTs) 87,5 346, ,348,38,445 Non-scalable Architectures Max Frequency(MHz) Architecture by Min Latency (clks),5,49 3,73 4,97 McIvor et al. [4] Min Latency (µs) (radix-) Area (LUTs) 9,879,453 3,43 4, Min Latency Area (µs LUTs) 68,88 8, 65,44,3,58 Max Frequency(MHz) Number of Ps Our Proposed Min Latency (clks),88,76 3,64 4,35 Architecture Min Latency (µs) (radix-, =6) Area (LUTs) 5,356,698 6,33,39 Min Latency Area (µs LUTs) 54,748 3,577 59,4 89,634 Max Frequency(MHz) Number of Ps Our Proposed Min Latency (clks),56, 3,68 4,4 Architecture Min Latency (µs) (radix-, =3) Area (LUTs) 5,3,587 5,97 9,6 Min Latency Area (µs LUTs) 53,573 8,94 47,7 8,66. The number of Ps is optimized for the minimum latency.. In all the implementations except the one by McIvor et al. [4], S is represented in non-redundant form. is mainly due to the fact that our P shon in Fig. (b) is substantially simpler than the P in the architecture by Tenca and Koç [4]. The P in [4] is responsible for calculating multiple columns of the dependency graph shon in Fig.. Therefore it must sitch its function beteen Tasks A and Task, depending on the phase of calculations. In contrast, in our Architecture, each P is responsible for only one column of the dependency graph in Fig. 8 and one Task, either or or F. Additionally in [4], the ords Y (j) and M (j) must rotate ith regard to Ps, hich further complicates the control logic. Compared ith the architecture by McIvor et al. [4], our Architecture (radix- and =6) has a comparable latency expressed in the number of clock cycles. In terms of clock frequency, the McIvor s architecture is better by 4-47%, but in terms of area, our architecture is superior by almost a factor of. As a result, Architecture outperforms the McIvor s design in terms of the product of latency times area by about %. In Table 3, performance gain of various architectures against the architecture of Tenca and Koç is summarized. Harris architecture, Architecture and Architecture all consistently outperform the classic architecture by Tenca and Koç in terms of both latency and the product of latency times area, for all four investigated operand sizes. oth Harris architecture and Architecture achieve a gain of around % regarding the product of latency times area. Architecture can achieve a gain up to 5% due to much smaller resource requirements. In all investigated architectures, the time beteen to consecutive Montgomery multiplications can be further reduced by overlapping computations for to consecutive sets of operands. In the original architecture by Tenca and Koç, this repetition interval is equal to n clock cycles, and in all other investigated architectures n clock cycles. For radix-4 case, e only have implemented four different operand sizes, 4, 48, 37, and 496, of

13 I TRANSACTIONS ON COMPUTRS, VOL. N, NO. N, MM X 3 TAL 3 Performance gain (%) against the architecture by Tenca and Koç [4] Min Latency (µs) Latency Area 4-bit 48-bit 37-bit 496-bit Harris Architecture (radix-, =6) Architecture (radix-, =6) McIvor s Architecture (radix-) Architecture (radix-, =6) Architecture (radix-, =3) Harris Architecture (radix-, =6) Architecture (radix-, =6) McIvor s Architecture (radix-) Architecture (radix-, =6) Architecture (radix-, =3) TAL 4 Comparison of the radix- and the radix-4 versions of Architecture (=6) for the implementation on Xilinx Virtex-II6FF57-4 FPGA it Max freq. Min latency Area radix-4/radix- Radix length (MHz) (clocks) (µs) (LUTs) latency(µs) latency(µs) area 4 radix ,356 (7%) radix ,37 (9%) radix ,698 (5%) radix ,56 (39%) radix ,33 (4%) radix ,74 (6%) radix ,39 (3%) radix ,8 (69%) Montgomery multipliers in Architecture as a shocase. The ord-length is the same as the one in the radix- case, i.e., 6 bits. For all four cases, the maximum frequency is comparable for both radix- and radix-4 designs. Moreover, the minimum latency of the radix- 4 designs is almost half of the radix- designs. In the meantime, the radix-4 designs occupy more than tice as many resources as the radix- versions. These figures fall ithin our expectations because radix-4 P has 4 internal branches, hich doubles the quantity of branches of radix- version, and some small design teaks ere required to redeem the propagation delay increase caused by more complicated combinational logic. Some of these optimization techniques are listed belo, ) At line 6.6 of Algorithm 6 there is an addition of three operands hose length is -bit or larger. To reduce the propagation delay of this step, e precomputed the value of x (i) Y (j) +q (i) M (j) one clock cycle before it arrives at the corresponding P. ) For the first P in hich the update of S () and the evaluation of q (i) happen in the same clock cycle, e can not precompute the value of x (i) Y () +q (i) M () in advance. To overcome this difficulty, e precompute four possible values of x (i) Y () +q (i) M () corresponding to q (i) =,,, 3, and make a decision at the end of the clock cycle based on the real value of q (i). As mentioned at the beginning of Section 5, the hardare implementation of our optimization beyond radix- 4 is no longer viable considering the large resource cost for covering all the k branches in one clock cycle, and the need to perform multiplications of ords by numbers in the range.. k. 7 CONCLUSION In this paper, e present to ne hardare architectures for Montgomery multiplication. These architectures are based on the ne idea for enhancing parallelism by precomputing partial results using to different assumptions regarding the most significant bit of each partial result ord. Additionally, Architecture introduces a ne original data dependency graph, aimed at significantly simplifying the control unit of each Processing lement. oth architectures improve on the ell knon architecture by Tenca and Koç, first presented at CHS 999, and then published in the I Transactions on Computers in 3. oth architectures reduce the circuit latency by almost a factor of to, from n + e clock cycles to n + e clock cycles, ith a negligible penalty in terms of the minimum clock period. Our Architecture preserves the scalability of the original design by Tenca and Koç. Further it outperforms Tenca- Koç design by about 3% in terms of the product of latency times area hen implemented on Xilinx Virtex- II 6 FPGA. Our Architecture breaks ith the scalability of the original scheme in favor of optimizing the design for the case of minimum latency. This architecture outperforms the original design by Tenca and Koç by 5% in terms of the product latency times area for four

14 I TRANSACTIONS ON COMPUTRS, VOL. N, NO. N, MM X 4 most popular operand sizes used in cryptography (4, 48, 37 and 496 bits). oth our architectures have been also compared ith to other latency-optimized architectures reported earlier in the literature: scalable architecture by Harris et al. from 5 and non-scalable architecture by McIvor et al. from 4. Our scalable Architecture demonstrates performance comparable to that of the architecture by Harris et al., hile using a substantially different optimization method. Our nonscalable Architecture has a longer latency than the architecture by McIvor et al., but at the same time it outperforms this architecture in terms of the product latency by area by about % for all operand sizes. These to ne architectures can be extended from radix- to radix-4 in order to further reduce their circuit latency at the cost of increasing the product of latency times area. Our architectures have been fully verified by modeling them using Verilog-HL, and comparing their function vs. reference softare implementation of Montgomery multiplication based on the GMP library. Our code has been implemented on Xilinx Virtex-II 6 FPGA and experimentally tested on SRC-6 reconfigurable computer. Our architectures can be easily parameterized, so the same generic code ith different values of parameters can be easily used for multiple operand and ord sizes. [8] N. Jiang and. Harris, Parallelized radix- scalable Montgomery multiplier, in Proc. IFIP International Conference on Very Large Scale Integration, 7 (VLSI-SoC 7), Oct. 7, pp [9] N. Pinckney and. M. Harris, Parallelized radix-4 scalable Montgomery multipliers, Journal of Integrated Circuits and Systems, vol. 3, no., pp , Mar. 8. [] K. Kelly and. Harris, Parallelized very high radix scalable Montgomery multipliers, in Proc. the Thirty-Ninth Asilomar Conference on Signals, Systems and Computers, 5, Oct. 5, pp. 96. []. A. Michalski and. A. uell, A scalable architecture for RSA cryptography on large FPGAs, in Proc. International Conference on Field Programmable Logic and Applications, 6 (FPL 6), Aug. 6, pp [] Ç. K. Koç, T. Acar, and. S. Kaliski Jr., Analyzing and comparing Montgomery multiplication algorithms, I Micro, vol. 6, no. 3, pp. 6 33, 996. [3] C. McIvor, M. McLoone, and J. V. McCanny, High-radix systolic modular multiplication on reconfigurable hardare, in Proc. I International Conference on Field-Programmable Technology 5 (ICFPT 5), ec. 5, pp [4], Modified Montgomery modular multiplication and RSA exponentiation techniques, I Proceedings Computers and igital Techniques, vol. 5, no. 6, pp. 4 48, Nov. 4. [5] L. atina and G. Muurling, Montgomery in practice: Ho to do it more efficiently in hardare, in Proc. The Cryptographer s Track at the RSA Conference on Topics in Cryptology (CT-RSA ), Feb., pp [6] C.. Walter, Precise bounds for Montgomery modular multiplication and some potentially insecure RSA moduli, in Proc. The Cryptographer s Track at the RSA Conference on Topics in Cryptology (CT-RSA ), Feb., pp ACKNOWLGMNT The authors ould like to acknoledge the contributions of Hoang Le, Ramakrishna achimanchi and Marcin Rogaski from George Mason University ho provided results for their implementation of the Montgomery multiplier from [4]. The authors also ould like to thank Prof. Soonhak Kon from Sungkyunkan University in South Korea for helpful discussions and comments. Finally e are grateful to the anonymous revieers for their invaluable suggestions and comments to improve the quality and fairness of this paper. Miaoqing Huang is an Assistant Professor in the epartment of Computer Science and Computer ngineering at University of Arkansas. His research interests include reconfigurable computing, high-performance computing architectures, cryptography, image processing, computer arithmetic, and cache design in Solid- State rives. Huang received a.s. degree in electronics and information systems from Fudan University, China in 998, and a Ph.. degree in computer engineering from The George Washington University in 9, respectively. He is a member of I. RFRNCS [] R. L. Rivest, A. Shamir, and L. Adleman, A method for obtaining digital signatures and public-key cryptosystems, Communications of the ACM, vol., no., pp. 6, 978. [] P. L. Montgomery, Modular multiplication ithout trial division, Mathematics of Computation, vol. 44, no. 7, pp. 59 5, Apr [3] K. Gaj and et al., Implementing the elliptic curve method of factoring in reconfigurable hardare, in CHS 6, Springer- Verlag Lecture Notes in Computer Sciences, vol. 449, Oct. 6, pp [4] A. F. Tenca and Ç. K. Koç, A scalable architecture for Montgomery multiplication, in CHS 99, Springer-Verlag Lecture Notes in Computer Sciences, vol. 77, 999, pp [5], A scalable architecture for modular multiplication based on Montgomery s algorithm, I Trans. Comput., vol. 5, no. 9, pp. 5, Sept. 3. [6] A. F. Tenca, G. Todorov, and Ç. K. Koç, High-radix design of a scalable modular multiplier, in CHS, Springer-Verlag Lecture Notes in Computer Sciences, vol. 6,, pp. 85. [7]. Harris, R. Krishnamurthy, M. Anders, S. Mathe, and S. Hsu, An improved unified scalable radix- Montgomery multiplier, in Proc. the 7th I Symposium on Computer Arithmetic (ARITH 7), June 5, pp Kris Gaj received the M.Sc. and Ph.. degrees in lectrical ngineering from Warsa University of Technology in Warsa, Poland. He as a founder of nigma, a Polish company that generates practical softare and hardare cryptographic applications used by major Polish banks. In 998, he joined George Mason University, here he currently orks as an Associate Professor, doing research and teaching courses in the area of cryptographic engineering and reconfigurable computing. His research projects center on ne hardare architectures for secret key ciphers, hash functions, public key cryptosystems, and factoring, as ell as development of specialized libraries and application kernels for high-performance reconfigurable computers. He has been a member of the Program Committees of CHS, CryptArchi, and Quo Vadis Cryptology orkshops, and a General Co-Chair of CHS 8 in Washington.C. He is an author of a book on breaking German nigma cipher during World War II.

15 I TRANSACTIONS ON COMPUTRS, VOL. N, NO. N, MM X 5 Tarek l-ghazai is a Professor in the epartment of lectrical and Computer ngineering at The George Washington University. At GWU, He is the founding director of GW IMPACT: The Institute for Massively Parallel Applications and Computing Technologies, and a founding Co- irector of the NSF Industry/University Center for High-Performance Reconfigurable Computing (CHRC). l-ghazais research interests include high-performance computing, computer architectures, and reconfigurable computing. He is one of the principal co-authors of the UPC parallel programming language and the UPC book from John Wiley and Sons. He has received his Ph.. degree in lectrical and Computer ngineering from Ne Mexico State University in 988. l-ghazai has close to refereed research publications in these areas. r. l-ghazais research has been frequently supported by government agencies and industry and has received the IM faculty partnership aard in 4. He serves or has served on many technical advisory boards. l-ghazai is a Program Chair for the 6th International Symposium on Applied Reconfigurable Computing (ARC) and a General Chair for the th I International Conference on Scalable Computing and Communications (ScalCom-) and has served in many conference leadership and editorial duties. He is a senior member of the Institute of lectrical and lectronics ngineers (I), and a member of the ACM, IFIP WG.3, and Phi Kappa Phi National Honor Society.

High-Performance and Area-Efficient Hardware Design for Radix-2 k Montgomery Multipliers

High-Performance and Area-Efficient Hardware Design for Radix-2 k Montgomery Multipliers High-Performance and Area-Efficient Hardare Design for Radix- k Montgomery Multipliers Liang Zhou, Miaoqing Huang, Scott C. Smith University of Arkansas, Fayetteville, Arkansas 771, USA Abstract Montgomery

More information

An Optimized Hardware Architecture for the Montgomery Multiplication Algorithm

An Optimized Hardware Architecture for the Montgomery Multiplication Algorithm An Optimized Hardware Architecture for the Montgomery Multiplication Algorithm Miaoqing Huang 1, Kris Gaj 2, Soonhak Kwon 3, and Tarek El-Ghazawi 1 1 The George Washington University, Washington, DC 20052,

More information

An Optimized Hardware Architecture for the Montgomery Multiplication Algorithm

An Optimized Hardware Architecture for the Montgomery Multiplication Algorithm An Optimized Hardware Architecture for the Montgomery Multiplication Algorithm Miaoqing Huang 1, Kris Gaj 2, Soonhak Kwon 3, Tarek El-Ghazawi 1 1 The George Washington University, Washington, D.C., U.S.A.

More information

An Optimized Montgomery Modular Multiplication Algorithm for Cryptography

An Optimized Montgomery Modular Multiplication Algorithm for Cryptography 118 IJCSNS International Journal of Computer Science and Network Security, VOL.13 No.1, January 2013 An Optimized Montgomery Modular Multiplication Algorithm for Cryptography G.Narmadha 1 Asst.Prof /ECE,

More information

Parallelized Very High Radix Scalable Montgomery Multipliers

Parallelized Very High Radix Scalable Montgomery Multipliers Parallelized Very High Radix Scalable Montgomery Multipliers Kyle Kelley and Daid Harris Harey Mudd College 301 E. Telfth St. Claremont, CA 91711 {Kyle_Kelley, Daid_Harris}@hmc.edu Abstract This paper

More information

Parallelized Radix-4 Scalable Montgomery Multipliers

Parallelized Radix-4 Scalable Montgomery Multipliers Parallelized Radix-4 Scalable Montgomery Multipliers Nathaniel Pinckney and David Money Harris 1 1 Harvey Mudd College, 301 Platt. Blvd., Claremont, CA, USA e-mail: npinckney@hmc.edu ABSTRACT This paper

More information

Optimized Multiple Word Radix-2 Montgomery Multiplication Algorithm

Optimized Multiple Word Radix-2 Montgomery Multiplication Algorithm International Journal of Engineering Trends and Technology (IJETT) - Volume4 Issue7- July 23 Optimized Multiple Word Radix-2 Montgomery Multiplication Algorithm Harmeet Kaur, haru Madhu 2 Post graduate

More information

An RNS Based Montgomery Modular Multiplication Algorithm For Cryptography

An RNS Based Montgomery Modular Multiplication Algorithm For Cryptography An RNS Based Modular Multiplication Algorithm For Cryptography P.Shenbagapriya(ME-II VLSI-design) ECE department Syedammal engineering college Ramanathapuram Dr. G. Mahendran ME.,Ph.D, Associate professor

More information

Bipartite Modular Multiplication

Bipartite Modular Multiplication Bipartite Modular Multiplication Marcelo E. Kaihara and Naofumi Takagi Department of Information Engineering, Nagoya University, Nagoya, 464-8603, Japan {mkaihara, ntakagi}@takagi.nuie.nagoya-u.ac.jp Abstract.

More information

A Scalable Architecture for Montgomery Multiplication

A Scalable Architecture for Montgomery Multiplication A Scalable Architecture for Montgomery Multiplication Alexandre F. Tenca and Çetin K. Koç Electrical & Computer Engineering Oregon State University, Corvallis, Oregon 97331 {tenca,koc}@ece.orst.edu Abstract.

More information

A Comparison of Two Algorithms Involving Montgomery Modular Multiplication

A Comparison of Two Algorithms Involving Montgomery Modular Multiplication ISSN (Online) : 2319-8753 ISSN (Print) : 2347-6710 International Journal of Innovative Research in Science, Engineering and Technology An ISO 3297: 2007 Certified Organization Volume 6, Special Issue 5,

More information

High Speed Systolic Montgomery Modular Multipliers for RSA Cryptosystems

High Speed Systolic Montgomery Modular Multipliers for RSA Cryptosystems High Speed Systolic Montgomery Modular Multipliers for RSA Cryptosystems RAVI KUMAR SATZODA, CHIP-HONG CHANG and CHING-CHUEN JONG Centre for High Performance Embedded Systems Nanyang Technological University

More information

Faster Interleaved Modular Multiplier Based on Sign Detection

Faster Interleaved Modular Multiplier Based on Sign Detection Faster Interleaved Modular Multiplier Based on Sign Detection Mohamed A. Nassar, and Layla A. A. El-Sayed Department of Computer and Systems Engineering, Alexandria University, Alexandria, Egypt eng.mohamedatif@gmail.com,

More information

Realizing Arbitrary-Precision Modular Multiplication with a Fixed-Precision Multiplier Datapath

Realizing Arbitrary-Precision Modular Multiplication with a Fixed-Precision Multiplier Datapath Realizing Arbitrary-Precision Modular Multiplication with a Fixed-Precision Multiplier Datapath Johann Großschädl University of Luxembourg johann.groszschaedl@uni.lu Erkay Savaş Sabanci University, Turkey

More information

Lecture 12 March 16, 2010

Lecture 12 March 16, 2010 6.851: Advanced Data Structures Spring 010 Prof. Erik Demaine Lecture 1 March 16, 010 1 Overvie In the last lecture e covered the round elimination technique and loer bounds on the static predecessor problem.

More information

Scalable Montgomery Multiplication Algorithm

Scalable Montgomery Multiplication Algorithm 1 Scalable Montgomery Multiplication Algorithm Brock J. Prince Department of Electrical & Computer Engineering, Oregon State University, Corvallis, Oregon 97331 E-mail: princebr@engr.orst.edu May 29, 2002

More information

A Binary Redundant Scalar Point Multiplication in Secure Elliptic Curve Cryptosystems

A Binary Redundant Scalar Point Multiplication in Secure Elliptic Curve Cryptosystems International Journal of Network Security, Vol3, No2, PP132 137, Sept 2006 (http://ijnsnchuedutw/) 132 A Binary Redundant Scalar Multiplication in Secure Elliptic Curve Cryptosystems Sangook Moon School

More information

NEW MODIFIED LEFT-TO-RIGHT RADIX-R REPRESENTATION FOR INTEGERS. Arash Eghdamian 1*, Azman Samsudin 1

NEW MODIFIED LEFT-TO-RIGHT RADIX-R REPRESENTATION FOR INTEGERS. Arash Eghdamian 1*, Azman Samsudin 1 International Journal of Technology (2017) 3: 519-527 ISSN 2086-9614 IJTech 2017 NEW MODIFIED LEFT-TO-RIGHT RADIX-R REPRESENTATION FOR INTEGERS Arash Eghdamian 1*, Azman Samsudin 1 1 School of Computer

More information

A New Modified CMM Modular Exponentiation Algorithm

A New Modified CMM Modular Exponentiation Algorithm International Journal of Intelligent Computing Research (IJICR), Volume, Issue 3, September A New odified C odular xponentiation Algorithm Abdalhossein Rezai Semnan University, Semnan, Iran Parviz Keshavarzi

More information

International Journal of Engineering and Techniques - Volume 4 Issue 2, April-2018

International Journal of Engineering and Techniques - Volume 4 Issue 2, April-2018 RESEARCH ARTICLE DESIGN AND ANALYSIS OF RADIX-16 BOOTH PARTIAL PRODUCT GENERATOR FOR 64-BIT BINARY MULTIPLIERS K.Deepthi 1, Dr.T.Lalith Kumar 2 OPEN ACCESS 1 PG Scholar,Dept. Of ECE,Annamacharya Institute

More information

EE878 Special Topics in VLSI. Computer Arithmetic for Digital Signal Processing

EE878 Special Topics in VLSI. Computer Arithmetic for Digital Signal Processing EE878 Special Topics in VLSI Computer Arithmetic for Digital Signal Processing Part 6c High-Speed Multiplication - III Spring 2017 Koren Part.6c.1 Array Multipliers The two basic operations - generation

More information

UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering. Digital Computer Arithmetic ECE 666

UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering. Digital Computer Arithmetic ECE 666 UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering Digital Computer Arithmetic ECE 666 Part 6c High-Speed Multiplication - III Israel Koren Fall 2010 ECE666/Koren Part.6c.1 Array Multipliers

More information

ECE 297:11 Reconfigurable Architectures for Computer Security

ECE 297:11 Reconfigurable Architectures for Computer Security ECE 297:11 Reconfigurable Architectures for Computer Security Course web page: http://mason.gmu.edu/~kgaj/ece297 Instructors: Kris Gaj (GMU) Tarek El-Ghazawi (GWU) TA: Pawel Chodowiec (GMU) Kris Gaj George

More information

An Efficient Parallel CMM-CSD Modular Exponentiation Algorithm by Using a New Modified Modular Multiplication Algorithm

An Efficient Parallel CMM-CSD Modular Exponentiation Algorithm by Using a New Modified Modular Multiplication Algorithm 5 th SASTech 0, Khavaran Higher-education Institute, Mashhad, Iran. May -4. An Efficient Parallel CMM-CSD Modular Exponentiation Algorithm by Using a New Modified Modular Multiplication Algorithm Abdalhossein

More information

A High-Speed FPGA Implementation of an RSD- Based ECC Processor

A High-Speed FPGA Implementation of an RSD- Based ECC Processor A High-Speed FPGA Implementation of an RSD- Based ECC Processor Abstract: In this paper, an exportable application-specific instruction-set elliptic curve cryptography processor based on redundant signed

More information

Implementation of Elliptic Curve Cryptosystems over GF(2 n ) in Optimal Normal Basis on a Reconfigurable Computer

Implementation of Elliptic Curve Cryptosystems over GF(2 n ) in Optimal Normal Basis on a Reconfigurable Computer Implementation of Elliptic Curve Cryptosystems over GF(2 n ) in Optimal Normal Basis on a Reconfigurable Computer Sashisu Bajracharya 1, Chang Shu 1, Kris Gaj 1, Tarek El-Ghazawi 2 1 ECE Department, George

More information

Scalable VLSI Design for Fast GF(p) Montgomery Inverse Computation

Scalable VLSI Design for Fast GF(p) Montgomery Inverse Computation Scalable VLSI Design for Fast GF(p) Montgomery Inverse Computation Adnan Abdul-Aziz Gutub 1, Erkay Savas 2, and Tatiana Kalganova 3 1 Department of Computer Engineering, King Fahd University of Petroleum

More information

Multifunction Residue Architectures for Cryptography 1

Multifunction Residue Architectures for Cryptography 1 Multifunction Residue Architectures for Cryptography 1 LAXMI TRIVENI.D, M.TECH., EMBEDDED SYSTEMS & VLSI 2 P.V.VARAPRASAD,RAO ASSOCIATE PROFESSOR., SLC S INSTITUTE OF ENGINEERING AND TECHNOLOGY Abstract

More information

I. Introduction. India; 2 Assistant Professor, Department of Electronics & Communication Engineering, SRIT, Jabalpur (M.P.

I. Introduction. India; 2 Assistant Professor, Department of Electronics & Communication Engineering, SRIT, Jabalpur (M.P. A Decimal / Binary Multi-operand Adder using a Fast Binary to Decimal Converter-A Review Ruchi Bhatt, Divyanshu Rao, Ravi Mohan 1 M. Tech Scholar, Department of Electronics & Communication Engineering,

More information

Optimized Multi-Precision Multiplication for Public-Key Cryptography on Embedded Microprocessors

Optimized Multi-Precision Multiplication for Public-Key Cryptography on Embedded Microprocessors International Journal of Computer and Communication Engineering, Vol., No., May 01 Optimized Multi-Precision Multiplication for Public-Key Cryptography on Embedded Microprocessors Hwajeong Seo and Howon

More information

Hardware Architectures

Hardware Architectures Hardware Architectures Secret-key Cryptography Public-key Cryptography Cryptanalysis AES & AES candidates estream candidates Hash Functions SHA-3 Montgomery Multipliers ECC cryptosystems Pairing-based

More information

Design and Implementation of a Coprocessor for Cryptography Applications

Design and Implementation of a Coprocessor for Cryptography Applications Design and Implementation of a Coprocessor for Cryptography Applications Ander Royo, Javier Morán, Juan Carlos López Dpto. Ingeniería Electrónica, Universidad Politécnica de Madrid Ciudad Universitaria

More information

A High-Speed FPGA Implementation of an RSD-Based ECC Processor

A High-Speed FPGA Implementation of an RSD-Based ECC Processor RESEARCH ARTICLE International Journal of Engineering and Techniques - Volume 4 Issue 1, Jan Feb 2018 A High-Speed FPGA Implementation of an RSD-Based ECC Processor 1 K Durga Prasad, 2 M.Suresh kumar 1

More information

High Speed Cryptoprocessor for η T Pairing on 128-bit Secure Supersingular Elliptic Curves over Characteristic Two Fields

High Speed Cryptoprocessor for η T Pairing on 128-bit Secure Supersingular Elliptic Curves over Characteristic Two Fields High Speed Cryptoprocessor for η T Pairing on 128-bit Secure Supersingular Elliptic Curves over Characteristic Two Fields Santosh Ghosh, Dipanwita Roy Chowdhury, and Abhijit Das Computer Science and Engineering

More information

ISSN (Online), Volume 1, Special Issue 2(ICITET 15), March 2015 International Journal of Innovative Trends and Emerging Technologies

ISSN (Online), Volume 1, Special Issue 2(ICITET 15), March 2015 International Journal of Innovative Trends and Emerging Technologies VLSI IMPLEMENTATION OF HIGH PERFORMANCE DISTRIBUTED ARITHMETIC (DA) BASED ADAPTIVE FILTER WITH FAST CONVERGENCE FACTOR G. PARTHIBAN 1, P.SATHIYA 2 PG Student, VLSI Design, Department of ECE, Surya Group

More information

Novel Multiplier Architectures for GF (p) andgf (2 n )

Novel Multiplier Architectures for GF (p) andgf (2 n ) Novel Multiplier Architectures for GF (p) andgf (2 n ) E. Savaş 1,A.F.Tenca 2,M.E.Çiftçibasi 3,andÇ. K. Koç 2 1 Faculty of Engineering and Natural Sciences Sabanci University Istanbul, Turkey TR-34956

More information

Volume 5, Issue 5 OCT 2016

Volume 5, Issue 5 OCT 2016 DESIGN AND IMPLEMENTATION OF REDUNDANT BASIS HIGH SPEED FINITE FIELD MULTIPLIERS Vakkalakula Bharathsreenivasulu 1 G.Divya Praneetha 2 1 PG Scholar, Dept of VLSI & ES, G.Pullareddy Eng College,kurnool

More information

ARITHMETIC operations based on residue number systems

ARITHMETIC operations based on residue number systems IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS II: EXPRESS BRIEFS, VOL. 53, NO. 2, FEBRUARY 2006 133 Improved Memoryless RNS Forward Converter Based on the Periodicity of Residues A. B. Premkumar, Senior Member,

More information

OPTIMIZING THE POWER USING FUSED ADD MULTIPLIER

OPTIMIZING THE POWER USING FUSED ADD MULTIPLIER Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 3, Issue. 11, November 2014,

More information

Applications of The Montgomery Exponent

Applications of The Montgomery Exponent Applications of The Montgomery Exponent Shay Gueron 1,3 1 Dept. of Mathematics, University of Haifa, Israel (shay@math.haifa.ac.il) Or Zuk 2,3 2 Dept. of Physics of Complex Systems, Weizmann Institute

More information

A Novel Carry-look ahead approach to an Unified BCD and Binary Adder/Subtractor

A Novel Carry-look ahead approach to an Unified BCD and Binary Adder/Subtractor A Novel Carry-look ahead approach to an Unified BCD and Binary Adder/Subtractor Abstract Increasing prominence of commercial, financial and internet-based applications, which process decimal data, there

More information

FPGA IMPLEMENTATION OF FLOATING POINT ADDER AND MULTIPLIER UNDER ROUND TO NEAREST

FPGA IMPLEMENTATION OF FLOATING POINT ADDER AND MULTIPLIER UNDER ROUND TO NEAREST FPGA IMPLEMENTATION OF FLOATING POINT ADDER AND MULTIPLIER UNDER ROUND TO NEAREST SAKTHIVEL Assistant Professor, Department of ECE, Coimbatore Institute of Engineering and Technology Abstract- FPGA is

More information

INTERNATIONAL JOURNAL OF PROFESSIONAL ENGINEERING STUDIES Volume VI /Issue 3 / JUNE 2016

INTERNATIONAL JOURNAL OF PROFESSIONAL ENGINEERING STUDIES Volume VI /Issue 3 / JUNE 2016 VLSI DESIGN OF HIGH THROUGHPUT FINITE FIELD MULTIPLIER USING REDUNDANT BASIS TECHNIQUE YANATI.BHARGAVI, A.ANASUYAMMA Department of Electronics and communication Engineering Audisankara College of Engineering

More information

HIGH-THROUGHPUT FINITE FIELD MULTIPLIERS USING REDUNDANT BASIS FOR FPGA AND ASIC IMPLEMENTATIONS

HIGH-THROUGHPUT FINITE FIELD MULTIPLIERS USING REDUNDANT BASIS FOR FPGA AND ASIC IMPLEMENTATIONS HIGH-THROUGHPUT FINITE FIELD MULTIPLIERS USING REDUNDANT BASIS FOR FPGA AND ASIC IMPLEMENTATIONS Shaik.Sooraj, Jabeena shaik,m.tech Department of Electronics and communication Engineering, Quba College

More information

January 1996, ver. 1 Functional Specification 1

January 1996, ver. 1 Functional Specification 1 FIR Filters January 1996, ver. 1 Functional Specification 1 Features High-speed operation: up to 105 million samples per second (MSPS) -, 16-, 24-, 32-, and 64-tap finite impulse response (FIR) filters

More information

II. MOTIVATION AND IMPLEMENTATION

II. MOTIVATION AND IMPLEMENTATION An Efficient Design of Modified Booth Recoder for Fused Add-Multiply operator Dhanalakshmi.G Applied Electronics PSN College of Engineering and Technology Tirunelveli dhanamgovind20@gmail.com Prof.V.Gopi

More information

2010 First International Conference on Networking and Computing

2010 First International Conference on Networking and Computing 2010 First International Conference on Networking and Computing An RSA Encryption Hardware Algorithm using a Single DSP Block and a Single Block RAM on the FPGA Bo Song, Kensuke Kawakami, Koji Nakano,

More information

OPTIMIZATION OF AREA COMPLEXITY AND DELAY USING PRE-ENCODED NR4SD MULTIPLIER.

OPTIMIZATION OF AREA COMPLEXITY AND DELAY USING PRE-ENCODED NR4SD MULTIPLIER. OPTIMIZATION OF AREA COMPLEXITY AND DELAY USING PRE-ENCODED NR4SD MULTIPLIER. A.Anusha 1 R.Basavaraju 2 anusha201093@gmail.com 1 basava430@gmail.com 2 1 PG Scholar, VLSI, Bharath Institute of Engineering

More information

VLSI Design Of a Novel Pre Encoding Multiplier Using DADDA Multiplier. Guntur(Dt),Pin:522017

VLSI Design Of a Novel Pre Encoding Multiplier Using DADDA Multiplier. Guntur(Dt),Pin:522017 VLSI Design Of a Novel Pre Encoding Multiplier Using DADDA Multiplier 1 Katakam Hemalatha,(M.Tech),Email Id: hema.spark2011@gmail.com 2 Kundurthi Ravi Kumar, M.Tech,Email Id: kundurthi.ravikumar@gmail.com

More information

A unified architecture of MD5 and RIPEMD-160 hash algorithms

A unified architecture of MD5 and RIPEMD-160 hash algorithms Title A unified architecture of MD5 and RIPMD-160 hash algorithms Author(s) Ng, CW; Ng, TS; Yip, KW Citation The 2004 I International Symposium on Cirquits and Systems, Vancouver, BC., 23-26 May 2004.

More information

High Performance and Area Efficient DSP Architecture using Dadda Multiplier

High Performance and Area Efficient DSP Architecture using Dadda Multiplier 2017 IJSRST Volume 3 Issue 6 Print ISSN: 2395-6011 Online ISSN: 2395-602X Themed Section: Science and Technology High Performance and Area Efficient DSP Architecture using Dadda Multiplier V.Kiran Kumar

More information

Hardware Implementation of a Montgomery Modular Multiplier in a Systolic Array

Hardware Implementation of a Montgomery Modular Multiplier in a Systolic Array Hardware Implementation of a Montgomery Modular Multiplier in a Systolic Array Sıddıka Berna Örs 1 Lejla Batina 1,2 Bart Preneel 1 Joos Vandewalle 1 1 Katholieke Universiteit Leuven, ESAT/SCD-COSIC Kasteelpark

More information

Reducing Computational Time using Radix-4 in 2 s Complement Rectangular Multipliers

Reducing Computational Time using Radix-4 in 2 s Complement Rectangular Multipliers Reducing Computational Time using Radix-4 in 2 s Complement Rectangular Multipliers Y. Latha Post Graduate Scholar, Indur institute of Engineering & Technology, Siddipet K.Padmavathi Associate. Professor,

More information

An Algorithm and Hardware Architecture for Integrated Modular Division and Multiplication in GF (p) and GF (2 n )

An Algorithm and Hardware Architecture for Integrated Modular Division and Multiplication in GF (p) and GF (2 n ) An Algorithm and Hardware Architecture for Integrated Modular Division and Multiplication in GF (p) and GF (2 n ) Lo ai A. Tawalbeh and Alexandre F. Tenca School of Electrical Engineering and Computer

More information

Efficient Pipelining for Modular Multiplication Architectures in Prime Fields

Efficient Pipelining for Modular Multiplication Architectures in Prime Fields Efficient Pipelining for odular ultiplication Architectures in Prime Fields Nele entens, Kazuo Sakiyama, Bart Preneel and Ingrid Verbauwhede Katholieke Universiteit Leuven, ESA-SCD/COSIC Kasteelpark Arenberg

More information

FPGA Implementation of Multiplier for Floating- Point Numbers Based on IEEE Standard

FPGA Implementation of Multiplier for Floating- Point Numbers Based on IEEE Standard FPGA Implementation of Multiplier for Floating- Point Numbers Based on IEEE 754-2008 Standard M. Shyamsi, M. I. Ibrahimy, S. M. A. Motakabber and M. R. Ahsan Dept. of Electrical and Computer Engineering

More information

THE INTERNATIONAL JOURNAL OF SCIENCE & TECHNOLEDGE

THE INTERNATIONAL JOURNAL OF SCIENCE & TECHNOLEDGE THE INTERNATIONAL JOURNAL OF SCIENCE & TECHNOLEDGE Design and Implementation of Optimized Floating Point Matrix Multiplier Based on FPGA Maruti L. Doddamani IV Semester, M.Tech (Digital Electronics), Department

More information

A New Attack with Side Channel Leakage during Exponent Recoding Computations

A New Attack with Side Channel Leakage during Exponent Recoding Computations A New Attack with Side Channel Leakage during Exponent Recoding Computations Yasuyuki Sakai 1 and Kouichi Sakurai 2 1 Mitsubishi Electric Corporation, 5-1-1 Ofuna, Kamakura, Kanagawa 247-8501, Japan ysakai@iss.isl.melco.co.jp

More information

IEEE-754 compliant Algorithms for Fast Multiplication of Double Precision Floating Point Numbers

IEEE-754 compliant Algorithms for Fast Multiplication of Double Precision Floating Point Numbers International Journal of Research in Computer Science ISSN 2249-8257 Volume 1 Issue 1 (2011) pp. 1-7 White Globe Publications www.ijorcs.org IEEE-754 compliant Algorithms for Fast Multiplication of Double

More information

VLSI Design and Implementation of High Speed and High Throughput DADDA Multiplier

VLSI Design and Implementation of High Speed and High Throughput DADDA Multiplier VLSI Design and Implementation of High Speed and High Throughput DADDA Multiplier U.V.N.S.Suhitha Student Department of ECE, BVC College of Engineering, AP, India. Abstract: The ever growing need for improved

More information

Sum to Modified Booth Recoding Techniques For Efficient Design of the Fused Add-Multiply Operator

Sum to Modified Booth Recoding Techniques For Efficient Design of the Fused Add-Multiply Operator Sum to Modified Booth Recoding Techniques For Efficient Design of the Fused Add-Multiply Operator D.S. Vanaja 1, S. Sandeep 2 1 M. Tech scholar in VLSI System Design, Department of ECE, Sri VenkatesaPerumal

More information

HIGH PERFORMANCE QUATERNARY ARITHMETIC LOGIC UNIT ON PROGRAMMABLE LOGIC DEVICE

HIGH PERFORMANCE QUATERNARY ARITHMETIC LOGIC UNIT ON PROGRAMMABLE LOGIC DEVICE International Journal of Advances in Applied Science and Engineering (IJAEAS) ISSN (P): 2348-1811; ISSN (E): 2348-182X Vol. 2, Issue 1, Feb 2015, 01-07 IIST HIGH PERFORMANCE QUATERNARY ARITHMETIC LOGIC

More information

A SIMULINK-TO-FPGA MULTI-RATE HIERARCHICAL FIR FILTER DESIGN

A SIMULINK-TO-FPGA MULTI-RATE HIERARCHICAL FIR FILTER DESIGN A SIMULINK-TO-FPGA MULTI-RATE HIERARCHICAL FIR FILTER DESIGN Xiaoying Li 1 Fuming Sun 2 Enhua Wu 1, 3 1 University of Macau, Macao, China 2 University of Science and Technology Beijing, Beijing, China

More information

EE878 Special Topics in VLSI. Computer Arithmetic for Digital Signal Processing

EE878 Special Topics in VLSI. Computer Arithmetic for Digital Signal Processing EE878 Special Topics in VLSI Computer Arithmetic for Digital Signal Processing Part 6b High-Speed Multiplication - II Spring 2017 Koren Part.6b.1 Accumulating the Partial Products After generating partial

More information

DESIGN AND IMPLEMENTATION OF FAST DECIMAL MULTIPLIER USING SMSD ENCODING TECHNIQUE

DESIGN AND IMPLEMENTATION OF FAST DECIMAL MULTIPLIER USING SMSD ENCODING TECHNIQUE RESEARCH ARTICLE OPEN ACCESS DESIGN AND IMPLEMENTATION OF FAST DECIMAL MULTIPLIER USING SMSD ENCODING TECHNIQUE S.Sirisha PG Scholar Department of Electronics and Communication Engineering AITS, Kadapa,

More information

Delay Optimised 16 Bit Twin Precision Baugh Wooley Multiplier

Delay Optimised 16 Bit Twin Precision Baugh Wooley Multiplier Delay Optimised 16 Bit Twin Precision Baugh Wooley Multiplier Vivek. V. Babu 1, S. Mary Vijaya Lense 2 1 II ME-VLSI DESIGN & The Rajaas Engineering College Vadakkangulam, Tirunelveli 2 Assistant Professor

More information

HIGH-PERFORMANCE RECONFIGURABLE FIR FILTER USING PIPELINE TECHNIQUE

HIGH-PERFORMANCE RECONFIGURABLE FIR FILTER USING PIPELINE TECHNIQUE HIGH-PERFORMANCE RECONFIGURABLE FIR FILTER USING PIPELINE TECHNIQUE Anni Benitta.M #1 and Felcy Jeba Malar.M *2 1# Centre for excellence in VLSI Design, ECE, KCG College of Technology, Chennai, Tamilnadu

More information

WORD LEVEL FINITE FIELD MULTIPLIERS USING NORMAL BASIS

WORD LEVEL FINITE FIELD MULTIPLIERS USING NORMAL BASIS WORD LEVEL FINITE FIELD MULTIPLIERS USING NORMAL BASIS 1 B.SARGUNAM, 2 Dr.R.DHANASEKARAN 1 Assistant Professor, Department of ECE, Avinashilingam University, Coimbatore 2 Professor & Director-Research,

More information

Digital Computer Arithmetic

Digital Computer Arithmetic Digital Computer Arithmetic Part 6 High-Speed Multiplication Soo-Ik Chae Spring 2010 Koren Chap.6.1 Speeding Up Multiplication Multiplication involves 2 basic operations generation of partial products

More information

Pipelined Quadratic Equation based Novel Multiplication Method for Cryptographic Applications

Pipelined Quadratic Equation based Novel Multiplication Method for Cryptographic Applications , Vol 7(4S), 34 39, April 204 ISSN (Print): 0974-6846 ISSN (Online) : 0974-5645 Pipelined Quadratic Equation based Novel Multiplication Method for Cryptographic Applications B. Vignesh *, K. P. Sridhar

More information

HW/SW Partitioning of an Embedded Instruction Memory Decompressor

HW/SW Partitioning of an Embedded Instruction Memory Decompressor HW/SW Partitioning of an Embedded Instruction Memory Decompressor Shlomo Weiss and Shay Beren EE-Systems, Tel Aviv University Tel Aviv 69978, ISRAEL ABSTRACT We introduce a ne PLA-based decoder architecture

More information

Designing and Characterization of koggestone, Sparse Kogge stone, Spanning tree and Brentkung Adders

Designing and Characterization of koggestone, Sparse Kogge stone, Spanning tree and Brentkung Adders Vol. 3, Issue. 4, July-august. 2013 pp-2266-2270 ISSN: 2249-6645 Designing and Characterization of koggestone, Sparse Kogge stone, Spanning tree and Brentkung Adders V.Krishna Kumari (1), Y.Sri Chakrapani

More information

FPGA Accelerated Tate Pairing Cryptosystems over Binary Fields

FPGA Accelerated Tate Pairing Cryptosystems over Binary Fields FPGA Accelerated ate Pairing Cryptosystems over Binary Fields Chang Shu, Soonhak Kwon, and Kris Gaj Dept. of ECE, George Mason University Fairfax VA, USA Dept. of Mathematics, Sungkyukwan University Suwon,

More information

Twiddle Factor Transformation for Pipelined FFT Processing

Twiddle Factor Transformation for Pipelined FFT Processing Twiddle Factor Transformation for Pipelined FFT Processing In-Cheol Park, WonHee Son, and Ji-Hoon Kim School of EECS, Korea Advanced Institute of Science and Technology, Daejeon, Korea icpark@ee.kaist.ac.kr,

More information

Effective Improvement of Carry save Adder

Effective Improvement of Carry save Adder Effective Improvement of Carry save Adder K.Nandini 1, A.Padmavathi 1, K.Pavithra 1, M.Selva Priya 1, Dr. P. Nithiyanantham 2 1 UG scholars, Department of Electronics, Jay Shriram Group of Institutions,

More information

Fast and Scalable Conflict Detection for Packet Classifiers

Fast and Scalable Conflict Detection for Packet Classifiers Fast and Scalable Conflict Detection for Packet Classifiers Florin Baboescu, George Varghese Dept. of Computer Science and Engineering University of California, San Diego 95 Gilman Drive La Jolla, CA9293-4

More information

International Journal of Advanced Research in Electrical, Electronics and Instrumentation Engineering

International Journal of Advanced Research in Electrical, Electronics and Instrumentation Engineering An Efficient Implementation of Double Precision Floating Point Multiplier Using Booth Algorithm Pallavi Ramteke 1, Dr. N. N. Mhala 2, Prof. P. R. Lakhe M.Tech [IV Sem], Dept. of Comm. Engg., S.D.C.E, [Selukate],

More information

CPE300: Digital System Architecture and Design

CPE300: Digital System Architecture and Design CPE300: Digital System Architecture and Design Fall 2011 MW 17:30-18:45 CBC C316 Arithmetic Unit 10122011 http://www.egr.unlv.edu/~b1morris/cpe300/ 2 Outline Recap Fixed Point Arithmetic Addition/Subtraction

More information

Divide-and-Conquer Approach for Designing Large-operand Functions on Reconfigurable Computers

Divide-and-Conquer Approach for Designing Large-operand Functions on Reconfigurable Computers Divide-and-Conquer pproach for Designing Large-operand Functions on Reconfigurable Computers Miaoqing Huang, Esam El-raby, and Tarek El-Ghazawi Department of Electrical and Computer Engineering, The George

More information

Implementation of Efficient Modified Booth Recoder for Fused Sum-Product Operator

Implementation of Efficient Modified Booth Recoder for Fused Sum-Product Operator Implementation of Efficient Modified Booth Recoder for Fused Sum-Product Operator A.Sindhu 1, K.PriyaMeenakshi 2 PG Student [VLSI], Dept. of ECE, Muthayammal Engineering College, Rasipuram, Tamil Nadu,

More information

Binary Adders. Ripple-Carry Adder

Binary Adders. Ripple-Carry Adder Ripple-Carry Adder Binary Adders x n y n x y x y c n FA c n - c 2 FA c FA c s n MSB position Longest delay (Critical-path delay): d c(n) = n d carry = 2n gate delays d s(n-) = (n-) d carry +d sum = 2n

More information

This is a repository copy of High Speed and Low Latency ECC Implementation over GF(2m) on FPGA.

This is a repository copy of High Speed and Low Latency ECC Implementation over GF(2m) on FPGA. This is a repository copy of High Speed and Low Latency ECC Implementation over GF(2m) on FPGA. White Rose Research Online URL for this paper: http://eprints.whiterose.ac.uk/99476/ Version: Accepted Version

More information

Hybrid Signed Digit Representation for Low Power Arithmetic Circuits

Hybrid Signed Digit Representation for Low Power Arithmetic Circuits Hybrid Signed Digit Representation for Low Power Arithmetic Circuits Dhananjay S. Phatak Steffen Kahle, Hansoo Kim and Jason Lue Electrical Engineering Department State University of New York Binghamton,

More information

Binary Arithmetic. Daniel Sanchez Computer Science & Artificial Intelligence Lab M.I.T.

Binary Arithmetic. Daniel Sanchez Computer Science & Artificial Intelligence Lab M.I.T. Binary Arithmetic Daniel Sanchez Computer Science & Artificial Intelligence Lab M.I.T. MIT 6.004 Fall 2018 Reminder: Encoding Positive Integers Bit i in a binary representation (in right-to-left order)

More information

ISSN Vol.08,Issue.12, September-2016, Pages:

ISSN Vol.08,Issue.12, September-2016, Pages: ISSN 2348 2370 Vol.08,Issue.12, September-2016, Pages:2273-2277 www.ijatir.org G. DIVYA JYOTHI REDDY 1, V. ROOPA REDDY 2 1 PG Scholar, Dept of ECE, TKR Engineering College, Hyderabad, TS, India, E-mail:

More information

the main limitations of the work is that wiring increases with 1. INTRODUCTION

the main limitations of the work is that wiring increases with 1. INTRODUCTION Design of Low Power Speculative Han-Carlson Adder S.Sangeetha II ME - VLSI Design, Akshaya College of Engineering and Technology, Coimbatore sangeethasoctober@gmail.com S.Kamatchi Assistant Professor,

More information

Towards an FPGA Architecture Optimized for Public-Key Algorithms

Towards an FPGA Architecture Optimized for Public-Key Algorithms Towards an FPGA Architecture Optimized for Public-Key Algorithms AJ Elbirt *, C Paar ** Cryptography and Information Security Laboratory, Worcester, MA 01609 Electrical and Computer Engineering epartment,

More information

A Review on Optimizing Efficiency of Fixed Point Multiplication using Modified Booth s Algorithm

A Review on Optimizing Efficiency of Fixed Point Multiplication using Modified Booth s Algorithm A Review on Optimizing Efficiency of Fixed Point Multiplication using Modified Booth s Algorithm Mahendra R. Bhongade, Manas M. Ramteke, Vijay G. Roy Author Details Mahendra R. Bhongade, Department of

More information

Design and Implementation of Low-Complexity Redundant Multiplier Architecture for Finite Field

Design and Implementation of Low-Complexity Redundant Multiplier Architecture for Finite Field Design and Implementation of Low-Complexity Redundant Multiplier Architecture for Finite Field Veerraju kaki Electronics and Communication Engineering, India Abstract- In the present work, a low-complexity

More information

An update on Scalable Implementation of Primitives for Homomorphic EncRyption FPGA implementation using Simulink Abstract

An update on Scalable Implementation of Primitives for Homomorphic EncRyption FPGA implementation using Simulink Abstract An update on Scalable Implementation of Primitives for Homomorphic EncRyption FPGA implementation using Simulink David Bruce Cousins, Kurt Rohloff, Chris Peikert, Rick Schantz Raytheon BBN Technologies,

More information

Performance Evaluation of a Novel Direct Table Lookup Method and Architecture With Application to 16-bit Integer Functions

Performance Evaluation of a Novel Direct Table Lookup Method and Architecture With Application to 16-bit Integer Functions Performance Evaluation of a Novel Direct Table Lookup Method and Architecture With Application to 16-bit nteger Functions L. Li, Alex Fit-Florea, M. A. Thornton, D. W. Matula Southern Methodist University,

More information

CS 5803 Introduction to High Performance Computer Architecture: Arithmetic Logic Unit. A.R. Hurson 323 CS Building, Missouri S&T

CS 5803 Introduction to High Performance Computer Architecture: Arithmetic Logic Unit. A.R. Hurson 323 CS Building, Missouri S&T CS 5803 Introduction to High Performance Computer Architecture: Arithmetic Logic Unit A.R. Hurson 323 CS Building, Missouri S&T hurson@mst.edu 1 Outline Motivation Design of a simple ALU How to design

More information

A New Family of High Performance Parallel Decimal Multipliers

A New Family of High Performance Parallel Decimal Multipliers A New Family of High Performance Parallel Decimal Multipliers Alvaro Vázquez, Elisardo Antelo University of Santiago de Compostela Dept. of Electronic and Computer Science 15782 Santiago de Compostela,

More information

Design of an Efficient Architecture for Advanced Encryption Standard Algorithm Using Systolic Structures

Design of an Efficient Architecture for Advanced Encryption Standard Algorithm Using Systolic Structures Design of an Efficient Architecture for Advanced Encryption Standard Algorithm Using Systolic Structures 1 Suresh Sharma, 2 T S B Sudarshan 1 Student, Computer Science & Engineering, IIT, Khragpur 2 Assistant

More information

A Ripple Carry Adder based Low Power Architecture of LMS Adaptive Filter

A Ripple Carry Adder based Low Power Architecture of LMS Adaptive Filter A Ripple Carry Adder based Low Power Architecture of LMS Adaptive Filter A.S. Sneka Priyaa PG Scholar Government College of Technology Coimbatore ABSTRACT The Least Mean Square Adaptive Filter is frequently

More information

Hardware RSA Accelerator. Group 3: Ariel Anders, Timur Balbekov, Neil Forrester

Hardware RSA Accelerator. Group 3: Ariel Anders, Timur Balbekov, Neil Forrester Hardware RSA Accelerator Group 3: Ariel Anders, Timur Balbekov, Neil Forrester May 15, 2013 Contents 1 Background 1 1.1 RSA Algorithm.......................... 1 1.1.1 Definition of Variables for the RSA

More information

Vendor Agnostic, High Performance, Double Precision Floating Point Division for FPGAs

Vendor Agnostic, High Performance, Double Precision Floating Point Division for FPGAs Vendor Agnostic, High Performance, Double Precision Floating Point Division for FPGAs Xin Fang and Miriam Leeser Dept of Electrical and Computer Eng Northeastern University Boston, Massachusetts 02115

More information

International Journal of Scientific & Engineering Research, Volume 4, Issue 4, April ISSN

International Journal of Scientific & Engineering Research, Volume 4, Issue 4, April ISSN International Journal of Scientific & Engineering Research, Volume 4, Issue 4, April-2013 884 FPGA Implementation of Cryptographic Algorithms: A Survey Ambika R 1 Sahana Devanathan 2 1Associate Professor,

More information

Plug-in Board Editor for PLG150-DR/PLG150-PC

Plug-in Board Editor for PLG150-DR/PLG150-PC Plug-in Board Editor for PLG150-DR/PLG150-PC Oner s Manual Contents Introduction.........................................2 Starting Up.........................................3 Assigning the PLG150-DR/PLG150-PC

More information

High Speed Radix 8 CORDIC Processor

High Speed Radix 8 CORDIC Processor High Speed Radix 8 CORDIC Processor Smt. J.M.Rudagi 1, Dr. Smt. S.S ubbaraman 2 1 Associate Professor, K.L.E CET, Chikodi, karnataka, India. 2 Professor, W C E Sangli, Maharashtra. 1 js_itti@yahoo.co.in

More information