U k-2. U k-1 V 1. V k-1. W k-1 W 1 P 1. P k-1

Size: px

Start display at page:

Download "U k-2. U k-1 V 1. V k-1. W k-1 W 1 P 1. P k-1"

Rosemary Horn
6 years ago
Views:

1 A Super-Serial Galois Fields Multiplier for FPGAs and its Application to Public-Key Algorithms Gerardo Orlando Christof Paar GTE Government Systems ECE Department 77 A. St. Worcester Polytechnic Institute Needham, MA Institute Road Worcester, MA presented at: FCCM '99, April 21-23, 1999, Napa Valley, California Abstract This contribution introduces a scalable multiplier architecture for Galois field GF (2 k ) amenable for field programmable gate arrays (FPGAs) implementations. This architecture is well suited for the implementation of public-key cryptosystems which require programmable multipliers in large Galois fields. The architecture trades a reduction in resources with an increase in the number of clock cycles. This architecture is also fine grain scalable in both the time and the area (or logic) dimensions thus facilitating implementations that maximize their use of finite FPGA resources while achieving fast computational speed. This leads to an architecture that requires less resources than traditional bit serial multipliers, which we demonstrated with implementations of multipliers in the field GF (2 167 ). Our results demonstrate that for this field one can realize super-serial multipliers that use 2.76 times fewer function generators and 6.84 times fewer flip-flops than their serial multiplier counterparts. We also extrapolated the performance of these multipliers in an elliptic curve cryptosystem. 1 Introduction Public-key cryptography is an indispensable technology of the new digital age. This technology enables identification, authentication, secure communications, intellectual property protection, exchanges of goods and services, and many other services. Many popular public-key algorithms require arithmetic in Galois fields GF (2 k ), including in particular schemes based on the intractable discrete logarithm in finite fields (see [1] for an overview), elliptic curve discrete logarithm [2, 3], or hyperelliptic curves [4]. Galois field multiplication is usually considered the most crucial operation for the performance of these cryptosystems. All practical public-key schemes require operations in relatively large finite fields; e.g., about 150{250 bit for elliptic curve systems and about 1024 bits for systems based on the discrete logarithm problem in finite fields. For physical security as well as for performance reasons, implementations of Galois field arithmetic in hardware are very attractive. At the same time, the algorithm-independent design paradigm of modern cryptographic protocols and flexible security levels require alterable implementations that are difficult to provide with traditional (non-reconfigurable) hardware. Our architecture deals with these issues by offering fine-grained scalability and the inherent reprogrammability and speed of FPGAs. Previous work in this area includes traditional bitserial multipliers (referred to here as serial multipliers), which compute GF (2 k ) multiplication in k cycles using O(k) logic elements, and parallel multipliers that compute field multiplications in one cycle using O(k 2 ) logic elements. More recently, hybrid multipliers for composite fields GF ((2 n ) m ) were introduced in [5, 6] and digit-serial architectures were introduced in [7]. These multipliers perform multiplication in m cycles using O(mk) logic elements. Most of the documented studies focus on ASIC implementations. There are just a few that focus on reprogrammable logic. They include [8, 9, 10]. This work introduces, to our knowledge, a new sliced, polynomial basis multiplier based on a traditional serial multiplier [5]. We refer to the new architectures as super-serial. This multiplier emulates the operation of the serial multiplier using a smaller number of processors but requires more clock cycles, thus

2 realizing smaller and proportionally slower implementations by exploring a time-space trade-off. This paper begins with an introduction of a serial multiplier, then continues with a detailed description of the super-serial multiplier, and concludes with an analysis of the proof-of-concept implementations and their estimated performance in an elliptic curve cryptosystem. Throughout this document the acronyms SM and SSM are used to reference respectively the serial and the super-serial multipliers. 2 Preliminaries 2.1 Galois Field Multiplication We consider arithmetic in an extension field of GF (2). The extension degree is denoted by k, so that the field can be denoted by GF (2 k ). This field is isomorphic to GF (2)[x]/(P(x)), where P (x) = x k + P k?1 i=0 p ix i is an irreducible polynomial of degree k with p i 2 GF (2). In the following, a residue class will be identified with the polynomial of least degree in its class. The product of two field element W = UV, where W P P (x) = k?1 P i=0 w ix i, U(x) = k?1 i=0 u ix i, and V (x) = k?1 i=0 v ix i, is defined as follows: W (x) = U(x)V (x) mod P (x) (1) GF (2 k ) multiplication can be carried out by multiplying U(x) and V (x) and then performing reduction modulo P (x) or alternatively by interleaving multiplication and reduction according to Equation (2). In this equation W 0(i) corresponds to the partial results generated at step i of the recursion. W 0(i) = xw 0(i?1) mod P (x) + u k?i V (x) (2) for i = 1,2,...,k ; W 0(0) = 0 ; W (x) = W 0(k) GF (2 k ) multiplication in k cycles using Algorithm 1 [11, 12]. This algorithm is an implementation of the multiplication in (2) that uses identity (3) for polynomial reductions. The computation is carried out in k cycles, each of which realizes the operation in Equation (2). Algorithm 1: SM multiplication algorithm w[-1] = 0 w[0 to k? 1] = 0 For i = k? 1 down to 0 Do Parallel For j = 0 to k? 1 Do w[j] = w[j? 1] + (u[i] * v[j]) + (w[k? 1] * p[j]) End Parallel For End For A hardware realization of the multiplier is shown in Figure 1 (note that each connection is one bit wide). The serial multiplier is particularly attractive for FPGA implementations because it uses very simple processors (or slices) and a highly distributed interconnect network. Each processor consists of three register (v; w; p), two AND gates (GF (2) multipliers), and an XOR gate (GF (2) adder). The interconnect network consists mainly of connections between neighboring processors along with two global interconnects (u; w k?1). The complexity of this multiplier can be further reduced for implementations that use irreducible polynomials with few coefficients such as trinomials or pentanomials. On average, these implementations save one register, one AND gate, and one input of the XOR gate per processor. They also require only one global connection from the U operand shift register to all the processors. U 0 U k-2 U k-1 In (2), the products xw 0(i?1) are polynomials of degree k, which must be reduced modulo P (x). These reductions are done using the following identity. V 0 W 0 V 1 W 1 V k-1 W k-1 x k p k?1x k?1 + + p1x + p0 mod P (x) (3) Each field multiplication requires a total of k 2 bit multiplications and k? 1 polynomial reductions. P 0 P 1 P k Serial Multiplier Architecture The serial multiplier, sometimes referred to as \MSB-First multiplier," is a polynomial basis multiplier that uses k processors (or slices) and computes Figure 1: Serial multiplier in GF (2 k ) The main building blocks of modern FPGAs are flip-flops (FF), 4-input function generators (FG) that

3 implement any logical function of its inputs, and memory elements (MM). Using these logic elements as a metric, we estimated the logic complexity of this multiplier for configurations that implement programmable P (x) polynomials and those that implement fixed P (x) polynomials. The complexity of the latter is significantly lower especially for implementations of low complexity polynomials such as trinomials and pentanomials. We also estimated the complexity of configurations that implement the U operand shift register with memory elements as well as flip-flops. These estimates are summarized in Table 1. In general, the complexity of the serial multiplier is proportional to k. For fixed polynomial implementations, it also depends on p, the number of non-zero coefficients of P(x) minus one. For cryptographic applications where k is large, p can be approximated to zero for low complexity polynomials such as trinomials and pentanomials for which p = 2 and 4 respectively. Because memory implementations are not uniform across FPGAs, Table 1 summarizes the complexity of implementations of the U operand shift register with memory elements as c, where c represents the number of resources utilized; that is, c under the #FG column represents the number of function generators utilized and under the #FF column represents the number of flip-flops utilized. These U shift register implementations require counters that generate memory addresses and miscellaneous multiplexing and decoding logic. For cryptographic applications, the complexity of this circuitry can also be ignored as c << k. 3 A Super-Serial Multiplier 3.1 Principal Idea As can be seen from the SM entries of Table 1, the classical serial multiplier requires ak resources for multiplication in GF (2 k ), where a 1 is a constant. Since the values of k required for practical public-key algorithms are relatively large, at least 150 bit for elliptic curves systems, it seems attractive to provide multiplier architectures which require less resources, especially if FPGA implementations are intended. In the following such architecture will be developed which still requires bk resources, but this time with b < 1. We name this architecture super-serial multiplier as it takes several clock cycles to compute a partial product W 0(i). The super-serial multiplier is also a polynomial basis multiplier that uses m < k processors and computes GF (2 k ) multiplication in k dk=me cycles (see Figure 2). It implements Algorithm 2, which is also a partially parallel realization of the multiplication (2) that uses identity (3) for polynomial reduction. As the serial multiplier, it performs GF (2 k ) multiplication by recursively computing W 0(i). Each partial result is computed by its m < k processors in dk=me cycles thus realizing multiplication in k dk=me cycles. In contrast, the serial multiplier computes the same result in one cycle using a larger number of processors thus computing an entire multiplication in k cycles. It should be noted that contrary to a serial multiplier, which for a particular field GF (2 k ) requires a fixed number of processors (k) and computes a multiplication in a fixed number of cycles (k), the super-serial multiplier allows implementations to choose the number of processors (m) that best meet their target processing time (k dk=me cycles per multiplication) and area requirements. This flexible design option is particularly relevant for implementations in reconfigurable hardware. Algorithm 2: SSM multiplication algorithm w 0 [-1] = 0 w 0 [0 to k? 1] = 0 w[0 to k? 1] = 0 For i = k? 1 down to 0 Do For j = 0 to dk=me Do Parallel For l = 0 to m? 1 Do w[j m + l] = w 0 [j m + l? 1] + (u[i] * v[j m + l]) + (w 0 [k? 1]*p[j m + l]) End Parallel For End For w 0 [0 to k? 1] = w[0 to k? 1] End For The super-serial multiplier trades area for speed. It performs essentially the same algorithm the serial multiplier implements but with a lower degree of parallelism. It should be noted that the reduced parallelism is accompanied by a reduced number of processing elements and an increased number of storage elements. Discrete implementations of the super-serial multiplier using gates and flip-flops would result in designs that are larger than their serial multiplier counterparts. On the other hand, implementations in FPGAs that implement memory elements result in smaller designs because it is generally \cheaper" to store data in memory elements than in flip-flops. As an example, the logic complexity of a 16x1-bit memory element in a Xilinx XC4000 FPGA is comparable to that of a single flipflop.

4 3.2 Serial Multiplier Emulation Functionally, the super-serial multiplier performs GF (2 k ) multiplication by emulating the operation of a serial multiplier. The emulation is done by logically mapping the functions performed by the k processors of the serial multiplier into the m processors of the super-serial multiplier. This mapping assigns the processing functions of processor x of the serial multiplier to processor y of the super-serial multiplier, where the relationship between x and y is given by y x mod m. An example of such mapping is illustrated by Figure 2 for m = 5. It should be stressed that only one row with m processing elements is actually realized in hardware. s 0 s 1 s 5 s 6 s 7 s k-3 s k-2 s k-1 ss 0 s 0 s 5 s k-4 ss 1 s 1 s 6 s k-3 ss 2 s 2 s 7 s k-2 logical mapping ss 3 s 3 s 8 s k-1 s x - serial multiplier processor ss x - super-serial multiplier processor ss 4 Figure 2: Mapping of functions from the serial multiplier to the super-serial multiplier with m = 5 The super-serial multiplier emulates the serial multiplier by first emulating processors S0 to S m?1, then processors S m to S2m?1, and so on until all the k processors are emulated. This process is repeated k times for a full multiplication, which results in a complete multiplication after a total of k dk=me cycles. To carry out this emulation, the super-serial multiplier incorporates processors that realize the same mathematical function implemented by the processors of the serial multiplier. These processors also incorporate 3 dk=me bits of storage for the V operand, the multiplication results (W ), and the coefficients of the irreducible polynomial (P ). This storage is used to save and restore the state (u; v; p) of the processors s 4 s 9 they emulate. In addition, the super-serial multiplier incorporates a mechanism for the propagation of results from one cycle to another as the emulation of a serial multiplier cycle spans multiple cycles. When a processor from the super-serial multiplier emulates a processor from the serial multiplier, it restores the state of the emulated processor, computes the next partial result, and then saves the new state. For these processors, the first multiplication cycle is a special one. This cycle initializes W with the product u k?1v (x) (first product of Equation (2)). These processors support this cycle with a special reset emulation circuit that forces their output to zero when active (shown in Figure 3 with multiplexers). The super-serial multiplier incorporates a data transfer mechanism from processor SS m?1 to processor SS0. This mechanism is used to emulate the connection between processors S m?1 and S m, S2m?1 and S2m,, S mbk=mc?1 and S mbk=mc of the serial multiplier. Although not entirely obvious, this mechanism incorporates dk=me? 1 bits of storage because the results from the emulated processors S m?1, S2m?1,, S mbk=mc?1 are not immediately used in the following cycle. These results are used in the computation of the next partial result W 0(i) not the current one. This detail is easier to grasp by noticing that at the beginning of a multiplication all the w registers of the serial multiplier are reset, and thus all the processors receive a zero from their neighboring processors. The result of this cycle is the initialization of the multiplier with the product u k?1v (x). This partial result is then used in the following multiplication cycle. This multiplier also propagates the partial results of the emulated S k?1 processor. This result is latched so it is available through the dk=me cycles in which it is used by various processors. As an example, in Figure 3 we assume that processor S k?1 (which is the last processor of the serial multiplier) is realized by slice m? 2, so that the register for the result of S k?1 is placed after slice m? 2. In the example shown in Figure 2, this result is propagated to processor SS0 where it is used in the first cycle and to processor SS1 where it is used in the second cycle. It should be pointed out that the serial multiplier requires the loading of the V operand and one bit of the U operand before it can start computing a product. The multiplication result also becomes available all at once on the last clock cycle. For most practical implementations, the loading and unloading of data must span multiple clock cycles because the data is typically carried over busses that are much narrower than the field elements. Field elements are commonly

5 more than 150 bits wide in public key applications. Therefore, these multipliers must idle while I/O takes place or must store the results in temporary registers. Implementations of the super-serial multiplier can overcome this limitation by using a number of processors equal to the bus width of the interfacing busses. These multipliers can start computing a product as soon as the m least significant bits of the V operand and the most significant bit of the U operand are available. They can also make their results available, piecewise, in successive groups of m bits starting with the least significant ones and and ending with the most significant ones. 3.3 Architecture The architecture of the super-serial multiplier, shown in Figure 3, is similar to that of the serial multiplier in its processor architecture, processor communications, storage of the U operand, but it is substantially more complex in its control structure. Whereas the serial multiplier requires minimum control, the super-serial multiplier requires a more sophisticated controller that guides the emulation of the processors of a serial multiplier and controls the transfer of partial results from one intermediate cycle to another. Because the architecture of the super-serial multiplier is memory independent, it can be efficiently implemented in FPGAs that implement either centralized or distributed memory. Its partially distributed and partially centralized interconnect network is also well suited for FPGA implementations. Its distributed interconnects are analogous to those of the serial multiplier and its centralized ones are very regular and in practice link a moderate number of processors. Using the FPGA metric previously noted, we estimated the logic complexity of this multiplier for configurations that implement programmable P (x) polynomials and those that implement fixed polynomials. These estimates are summarized in Table 1. Note that these estimates exclude the complexity of the multiplier's controller as it is tightly coupled to the utilized FPGA technology and the system details. As a reference, the complexity of the controller used in the implementation documented in Subsection 4.2, is 64 function generators and 26 flip-flops. As for the serial multiplier, these estimates also represent the complexity of the U operand shift register with c. The serial multiplier core uses mainly function generators and memory elements. It uses only one flipflop to store the result from the emulated processor S k?1. Its controller consumes the bulk of the flipflops along with some function generators. In general, the complexity in terms of function generators is proportionally to m. For fixed polynomial implementations, it also depends on p 0, the number of non-overlapping sets of non-zero coefficients of P (x). These non-overlapping sets needs to be supported with independent hardware. For example, the multiplier shown in Figure 2 requires the propagation of the result from the emulated processor S k?1 to the emulated processors S0 and S6, requiring the propagation of this result to two different processors SS0 and SS1. For this case p 0 = 2. If this result is used by a single processor, then p 0 equal one. This would have been the case in Figure 2 if the propagation of the result were to the emulated processor S5 instead of S6. The multiplier's memory complexity depends on both k and m. The term k corresponds to the U operand memory and the terms proportional to m correspond processors' memory. Unless m is large, accurate estimates should not ignore the complexity of the controller and the U operand shift register. Table 1: Multipliers' estimated complexity Mult. U 1 P (x) 2 #FG 345 #FF 3 MMb 56 FF P 2k 4k 0 F k + p 3k 0 SM MM P 2k + c 3k + c k F k + p 2k + c k +c P 2m + 1 c + 1 k + +c (3m + 1)* SSM 7 MM dk=me F m + 1+ c + 1 k + (2m p 0 + c +1 + p 0 )* dk=me 4 Proof-of-Concept Implementation 4.1 Rationale and Parameter Choice The theoretical concepts were verified through implementations of a serial and a super-serial multiplier in modern FPGAs. In order to assure practical relevance, we designed a multiplier for the finite field GF (2 167 ) with the field polynomial 1 U - shift reg. impl. memory (MM) or flip-flops (FF) 2 P (x) - programmable poly. (P) or fixed poly.(f) 3 c = refers to U shift register logic 4 p = no. of non-zero coef. of P (x) minus 1 5 p' = no. of slices requiring wk?1 input 6 MMb - storage bits 7 Excludes the complexity of its controller.

6 U 0 V 0 W 0 P 0 proc addr 0 reset Legend U k-2 0 reset + start of cycle U k-1 W path addr single port memory start of cycle u addr reset Controller proc addr V m-2 W m-2 P m-2 proc addr 0 reset reg dual port memory path addr Figure 3: Super-serial multiplier V m-1 U proc addr P (x) = x x over GF(2). This field is highly interesting as an underlying algebraic structure for public-key cryptosystems based on elliptic curves [13]. Elliptic curves form the most recent family of publickey algorithms with practical relevance. Due to their small field orders (as opposed to RSA or schemes based on the discrete logarithm problem in finite fields), they are very attractive for implementation in reconfigurable logic. We concentrated on compact implementations and thus minimized the I/O and the control logic and implemented the U operand shift register using memory elements. We also implemented a common coprocessor interface for both multipliers consisting of an 11-bit data bus, a 6-bit address bus, a device select control signal, and an interrupt signal that signals the end of a multiplication. The selection of an 11-bit data bus resulted in an area efficient implementation for GF (2 167 ). It led to the a high utilization of the memory elements available in the used FPGAs and minimized the I/O complexity. Since for these FPGAs the function generators can be configured as 16x1-bit memories, very efficient implementations are achieved when k=m approximates a multiple of sixteen. For our implementation, this figure is 167=11 = 15:2 which represents a high utilization of memory elements. 4.2 FPGA Implementations W m-1 P m-1 0 reset The prototype implementations were done using Xilinx XC4000X FPGAs of speed grade -09. These devices incorporate distributed synchronous single and dual port memories that can be used in place of 4- input function generators. For these parts the logic complexity of a synchronous single port 16x1-bit memory element is the same as that of a 4-input function generator. This FPGA family also defines a number of parts with different logic densities, some of which were used to verify the scalability of both architectures with respect to FPGA resources. To achieve fast and area efficient implementations, we used logic generated by Xilinx's LogiBLOX version M1.5.19, which we incorporated into our VHDL code. For designs compilation and synthesis, we used Synopsis' FPGA Analyzer version and Xilinx's Design Manager version M Our proof-of-concept approach focused on prototyping practical multipliers with common interfaces. To realize these prototypes, logic was added to the basic multipliers described in Sections 2.2 and 3. The additions consisted of an interface circuitry and for the super-serial multiplier it also included the use of pipelined processors. The processors were pipelined by adding registers at the outputs of the U operand shift register and at the outputs of the v and the w memory elements. Minor modifications to the controller were also necessary. Pipelining was used because it was possible at practically no logic cost. As Table 1 shows, the super-serial multiplier uses far more function generators than flip-flops, which could result in a large number of wasted flip-flops due to the unavailability of routing resources. For I/O interface, we used an 11-bit data bus. This bus width was chosen because it matched the number of processors in the implementation of the super-serial multiplier (m = 11). This implementation achieved a high utilization of 1x16-bit memory elements for v's, and w's. The overall result is a very compact superserial multiplier for the field GF (2 167 ). This parameter selection did not affect the overall area of the serial multiplier as the complexity of its core circuitry is much larger than the complexity of its interface circuitry. The implementation results are summarized in Tables 2 and 3. Table 2 summarizes the logic complexity. To facilitate comparisons, this table also include normalized results that use as a reference the complexity of the super-serial multiplier. Table 3 summarizes the operational frequency and the percent of FPGA resources used by each implementation for different devices.

7 Table 2: Area (or logic) results for GF (2 167 ) Mult. #FG #FF #CLB 8 abs norm abs norm abs norm SSM (m = 11) SM Table 3: Timings and resource usage for GF (2 167 ) Mult. FPGA % % % Freq. FG FF CLB (MHz) SSM XC (m = 11) XC XC SM 9 XC XC Elliptic Curve Performance As described above, it is very interesting to implement elliptic curve cryptosystems on reconfigurable logic. For that reason, we extrapolated the performance of these multipliers for an elliptic curve cryptosystem. Our timing estimates, summarized in Table 4, are based on the computation of elliptic curve point multiplication using projective coordinates as documented in [10]. We considered projective coordinates because they eliminate field inversions from the computation of elliptic curve point multiplications at the cost of additional field multiplications [13]. The projected results assume the use of the double-and-add algorithm, which on average requires k? 1 point doubles and (k? 1)=2 point additions per point multiplication. In these projective coordinates, the computation of a point double requires 12 field multiplication and the computation of a point addition requires 14 field multiplications, leading to the number of clock cycles shown in Table 4. Note that the results in Table 5 ignore the transformation from projective to affine coordinates. This transformation requires an inversion and two field multiplications. These results also ignore additions, which are realized by a single bitwise XOR operation and are thus very inexpensive compared to multiplications. The expressions in Table 4 were used along with the performance numbers recorded in Table 3 to estimate the time to perform a GF (2 k ) multiplication and an elliptic curve point multiplication. These results are summarized in Table 5. Table 4: Number of clock cycles for elliptic curve point multiplication SM 19k(k? 1) SSM 19k(k? 1)(dk=me) Table 5: Estimated timing for GF (2 167 ) field and elliptic curve point multiplication Mult. FPGA GF Mult. Elliptic Curve (sec) Point Mult. (msec) SSM XC (m = 11) XC XC SM 9 XC XC Conclusions In this work we introduced a super-serial multiplier architecture which is, to our knowledge, a new GF (2 k ) multiplier. This architecture is particularly attractive for FPGA implementations because of its regularity; its mainly distributed network with few global interconnects; its simple processors which are efficiently implementable with function generators, flip-flops and memory; and, to a large extend, because its fine grained time and area scalability. This later point is one of great practical importance as it allows many implementations to reach a balance between performance and area over a wide ranges, which could be translated into cost reductions and product growth. We proved the theoretical concepts with proof-ofconcept implementations that demonstrated that substantial area savings are achievable for multipliers suitable for secure elliptic curve cryptosystems. More precisely, our implementation of the serial multiplier uses 2.76 times more function generators, 6.84 times more flip-flops, and 5.78 times more CLBs than our implementation of the super-serial multiplier. Our results also demonstrate the scalability of both serial and super-serial multipliers with respect to FPGA resources as their overall performance remained virtually constant across parts with different logic densities. 8 Configurable logic blocks consisting of one 3- and two 4- input function generators, and two flip-flops. 9 Design did not fit in the XC4005 part.

8 References [1] A. J. Menezes, P. C. van Oorschot, and S. A. Vanstone, Handbook of Applied Cryptography. CRC Press, [2] N. Koblitz, \Elliptic curve cryptosystems," Mathematics of Computation, vol. 48, pp. 203{209, [12] S. Lin and D. Costello, Error Control Coding: Fundamentals and Applications. Englewood Clis, NJ: Prentice-Hall, [13] A. Menezes, Elliptic Curve Public Key Cryptosystems. Kluwer Academic Publishers, [3] V. Miller, \Uses of elliptic curves in cryptography," in Advances in Cryptology CRYPTO '85, pp. 417{426, Springer-Verlag, [4] N. Koblitz, \Hyperelliptic cryptosystems," Journal of Cryptology, vol. 1, no. 3, pp. 129{150, [5] E. Mastrovito, VLSI Architectures for Computation in Galois Fields. PhD thesis, Linkoping University, Dept. Electr. Eng., Linkoping, Sweden, [6] C. Paar and P. S. Rodriguez, \Fast Arithmetic Architectures for Public-Key Algorithms over Galois Fields GF ((2 n ) m )," in Advances in Cryptography EUROCRYPT '97, pp. 363{378, Springer-Verlag, LNCS [7] L. Song and K. Parhi, \Ecient nite elds serial/parallel multiplication," in Proc. Int. Conf. Application Specic System Architectures and Processors, pp. 72{82, Chicago, IL, August [8] A. Klindworth, \FPLD-Implementation of computations over nite elds GF (2 m ) with applications to error control coding," in 5th. Intern. Workshop on Field-Programable Logic and Applications, (Oxford, UK), pp. 261{71, LNCS975, Springer-Verlag, September [9] C. Paar and M. Rosner, \Comparison of arithmetic architectures for reed-solomon decoders in recongurable hardware," in Fifth Annual IEEE Symposuium on Field-Programmable Custom Computing Machines, FCCM '97, (Napa Valley, USA), April [10] M. Rosner, \Elliptic curve cryptosystems on recongurable hardware," Master's thesis, ECE Dept., Worcester Polytechnic Institute, Worcester, USA, May [11] T. Beth and D. Gollmann, \Algorithm engineering for public key algorithms," IEEE Journal on Selected Areas in Communications, vol. 7, no. 4, pp. 458{466, 1989.

ANALYSIS OF AN AREA EFFICIENT VLSI ARCHITECTURE FOR FLOATING POINT MULTIPLIER AND GALOIS FIELD MULTIPLIER*

IJVD: 3(1), 2012, pp. 21-26 ANALYSIS OF AN AREA EFFICIENT VLSI ARCHITECTURE FOR FLOATING POINT MULTIPLIER AND GALOIS FIELD MULTIPLIER* Anbuselvi M. and Salivahanan S. Department of Electronics and Communication