ANALYSIS OF AN AREA EFFICIENT VLSI ARCHITECTURE FOR FLOATING POINT MULTIPLIER AND GALOIS FIELD MULTIPLIER*

Size: px

Start display at page:

Download "ANALYSIS OF AN AREA EFFICIENT VLSI ARCHITECTURE FOR FLOATING POINT MULTIPLIER AND GALOIS FIELD MULTIPLIER*"

Dominic Hardy
6 years ago
Views:

1 IJVD: 3(1), 2012, pp ANALYSIS OF AN AREA EFFICIENT VLSI ARCHITECTURE FOR FLOATING POINT MULTIPLIER AND GALOIS FIELD MULTIPLIER* Anbuselvi M. and Salivahanan S. Department of Electronics and Communication Engineering, SSN College of Engineering, Rajiv Gandhi Salai, Kalavakkam , India This article deals with the VLSI architecture of the Floating point and Galois field multiplier, using a technique called Wave-pipelining. Wave -pipelining is a circuit design technique that allows digital synchronous systems to be clocked at rates higher than conventional pipelining techniques. Wave pipelining can improve the throughput of a logic circuit while avoiding some of the overheads of traditional pipelining. Multiplication plays a very important role in the signal processing applications. In the VLSI platform, the area consumption is judged with the number of gates required to realize the logic. Accordingly, the multiplier structures, which we have traditionally, are computation intensive thereby involves larger usage of flipflops and slices in terms of architecture realization. To reduce the area consumption, the technique of wave pipelining has been incorporated, which also paves way for the low power architecture. The concept has been verified with the other kind of multiplier namely, Galois field multiplier, which has its existence in coding theory and cryptography analysis. The analysis of the designed architectures is done in Xilinx and Synopsys, targeted to 90nm technology. Keywords: Digital design, Floating point, Galois field, Maximum rate pipelining, Multiplier, VLSI architecture, Wave-pipelining. 1. INTRODUCTION With the advent of the signal processing techniques and technologies, the computation complexity of the individual blocks of any application has been optimized. In general, the most common operation involved in any signal processing application is the multiplication operation. As a fact, the process of multiplication is computation intensive, which consumes more power and area, when compared to the other arithmetic operations. In an effort to improve the throughput of digital systems, the architecture for multipliers has been optimized using the pipelining technique. When a logic network is pipelined, synchronizing elements, either latches or registers, are inserted to partition the network into stages. Pipelining of a circuit into N stages can result in speedup in throughput upto a factor of N. The inserted synchronizing elements increase the area and power consumption of the logic. Thereby the additional overheads are increase in latency and cycle time. Conventional pipelined systems allow data to propagate from a register through the combinational network to another register prior to initiating the subsequent data transfer. Thus, the maximum operating frequency is determined by the maximum propagation delay through the longest pipeline stage. Wave-Pipelining or maximum rate pipelining is a circuit design that allows digital systems to be clocked at rates higher than that can be achieved with conventional pipelining techniques. Wave-pipelining relies on the predictable finite signal propagation delay through combinational logic for virtual data storage. Wave pipelining of combinational circuits has been shown to achieve clock rates 2 to 7-times those possible for the same circuits with conventional pipelining. Rather, knowledge of the signal propagation delay characteristics of the logic network is used at design time to manage the signal delays so as to ensure that operations do not interfere with their predecessor nor successor computations [1, 2]. Unlike ordinary pipelining, wave pipelining does not require internal clock elements to increase throughput. The synchronization of internal computations is achieved by balancing inherent RC delays of combinational logic elements, thus allowing circuits to be pipelined at a very fine grain level. The rate at which logic can

2 22 Anbuselvi M. and Salivahanan S. propagate through the circuit depends not on the longest path delay but on the difference between the longest and shortest path delays [3]. 2. WAVE-PIPELINING While improving the throughput of a logic circuit, traditional pipelining of VLSI systems results in overheads in latency, cycle time, area, and power consumption. Cycle time overhead results from the time required for signals to propagate out of the synchronizing elements, from the time required for signals to set up to the synchronizing elements prior to their being stored in the synchronizing elements, and for the unintentional clock skew in the arrival of the synchronizer clock signal. Instead, in wavepipelining, cycle time is determined by the variation in the propagation delay of the signals through the logic, the input and output register delays. Latency through the traditional pipeline is defined as the total elapsed time from the time of introduction of data, at the input to the first stage of the pipeline, to the time the results of computations performed on that data arrive at the output of the final stage of the pipeline. Area and power overhead results from the additional transistors and wires used to implement the synchronizing latches or registers, and from the increased clock buffer area and power needed to drive the clock inputs to the synchronizers. The area and power overheads of a traditional pipeline are avoided in the wave pipelining since there are no separate synchronizers [4]. Figure 1: Model of a Wave-pipelined Circuit Figure 1 shows the model of a wave-pipelined circuit. There is no internal registers inside the logic block. There are only flip-flops inserted at the input and output side of the logic block. For the designed logic, the maximum and minimum delay is calculated. The technique of buffer insertion can be used to equalize the delay inside the logic element. TCK ( DMAX DMIN ) TS T H 2 CK (1) According to the equation (1), the clock period is directly proportional to the difference between the maximum and minimum delay. Reducing the difference in delay, by buffer insertion, the clock speed can be increased, thereby realizing the wavepipelined circuit. 3. FLOATING POINT MULTIPLIER IEEE 754 single precision is the standard defined for the floating point representation. The floatingpoint representation is one way to represent real numbers. A floating-point number n is represented with an exponent e and a mantissa m, so that: n = be m, where b is the base number (also called radix). The three basic components are the sign, exponent, and mantissa as shown in Figure 2. IEEE 754 standard defines the sign representation with a single bit, exponent with 8 bits and mantissa part with 22 bits. Figure 2: The Storage Layout for Single-precision Floatingpoint Binary The floating-point format can represent a wide range of scale without losing precision, while the fixed-point format has a fixed window of representation. Hence, for example in a 32-bit floating-point representation, numbers from to can be represented with ease. This is one of the reasons why floating-point representation is the most common solution. Floating-point representations also include special values like infinity, Not-a-Number (NaN) (e.g. result of square root of a negative number). The architecture of the floating point multiplier is shown in figure 3. According to that, the sign bit of the multiplicand and multiplier are xored. The exponent part of the multiplicand and the multiplier are added and normalized to get the exponent part of the result. The mantissa part of the multiplicand and multiplier are multiplied and normalized to find the product term. Normalization is done to compensate the loss in precision. At the product term, the overflow effect is take care by the rounding logic. The parallel architecture for speeding up the computation has been addressed in the literature [5].

3 Analysis of an Area Efficient VLSI Architecture for Floating Point Multiplier and Galois Field Multiplier* 23 Figure 3: Floating Point Multiplier Architecture 4. GALOIS FIELD MULTIPLIER The need for portable circuits able to communicate with high bandwidths pushes the development of high speed and low-power circuits. In this context, efficient Galois field GF (2 m ) arithmetic blocks are desired in many fields like error-control coding and cryptosystems. In error-control coding, the Galois field GF (2 m ) arithmetic, mainly the field addition and multiplication is the basis of Reed-Solomon encoding and decoding [6, 7]. In cryptographic applications, the GF (2m) arithmetic is largely used in elliptic-curve cryptosystems. In these applications, the building blocks that greatly influence system complexity and timing performance are the ones implementing the algebraic blocks. The addition operation in GF (2m) is equivalent to a simple bitwise XOR operation. On the other hand, the multiplication operation requires a larger and a slower hardware. The multiplier design presents a good area which is suitable for elliptic curve crypto processor design. Therefore elliptic curve crypto system can be used in applications that require small area and low consumption power such as smart cards and cellular telephones. The different kind of architectures of Galois field multiplier is addressed in the literature [8]. But the trade off between area and speed always exists with respect to the various architectures. This paper presents efficient hardware implementations for Galois field multiplier. Figure 4 shows a basic 4-bit multiplier structure. The operands are as shown, with the multiplier residing in a 4-bit shift register, the multiplicand in a 4-bit register, the result in the middle (R (3) R (0)), and an irreducible polynomial at the bottom. It is possible to load the multiplier and multiplicand serially, and have the irreducible polynomial arrive as part of the power on initialization process. As the operation occurs, there will be a common clock shifting the multiplier and the result registers. The irreducible polynomial and the multiplicand remain static. Generally all numbers in a Galois Field will be 1s and 0s and for GF (2 m ), there will be 2 m distinct symbols. For m = 4, there will be 16 distinct symbols. When we multiply, we will use what is called polynomial form, so the arithmetic will be similar to standard arithmetic multiplication, except that if the results overflow the four bit limit, we must adjust the result by subtracting the modulus, m. The irreducible polynomial we used is, x 4 + x + 1, which will be represented by in binary [9]. The value of the multiplier (in bold) is incrementally placed in front of the parenthetic multiplicand, so successive bits of the multiplier can be read down that position from row to row. They arrive most significant bit first. Also, multiplying number times one preserves the number. Multiplying by zero will produce a zero, as well. Due to the large number of partial results that have 0000 in them, we don ft see the effect of intermediate shifting.

4 24 Anbuselvi M. and Salivahanan S. Figure 4: A Basic 4-bit Galois Field Multiplier 5. SYNTHESIS The architecture of the floating point multiplier and Galois field multiplier is realized using VHDL description language. The logic verification has been performed using Modelsim. The designed structure is synthesized using XILINX 9.1 ISE Tool. The synthesis report with respect to Spartan 3e FPGA, is analyzed for the device utilization by the designed architecture. The floating point multiplier architecture shown above is designed with different stages of pipelining. Table 1 Device Utilization Summary for Floating Point Multiplier Device Utilized Logic utilization Logic distribution No. of slice flip flops No. of 4 input luts 2,481 2,491 2,494 2,409 No. of occupied slices 1,548 1,614 1,645 1,463 No. of Slices 1,548 1,614 1,645 1,463 Total Number of 2,658 2,669 2,660 2,594 4 input luts Gate count 24,115 25,366 26,080 22,505 The wave-pipelined architecture of the multiplier is designed by, computing the maximum and minimum delay along the different paths inside the logic. The non-critical path having the minimum delay is considered for delay equalization. The buffers are inserted at the appropriate paths, thereby reducing the difference between the maximum and minimum delay of the logic block. The synthesis report for floating point multiplier with different pipelining stages has been shown in Table 1. The detailed synthesis report speaks about the device utilization, timing involved and the total memory usage. Considering the device utilization report, the logic utilization in terms of number of flip-flops, number of lookup table and finally the total gate count is analyzed. The floating point multiplier is analyzed with the different stages of pipelining and compared with the wave-pipelined structure. The above result proves that the area consumption of the multipliers gets reduced with the wave-pipelining technique in terms of number of flip flops or LUTs. The above analysis has been strengthened with the Synopsys tool, targeted to the 90nm technology.

5 Analysis of an Area Efficient VLSI Architecture for Floating Point Multiplier and Galois Field Multiplier* 25 Table 2 Area Analysis for Floating Point Multiplier in Synopsys Tool Area Combinational area (µm 2 ) Noncombinational Area (µm 2 ) Net Interconnect (µm 2 ) Total cell area (µm 2 ) Total area (µm 2 ) We infer from Table 2 that the total area occupied by the logic increases with the increase in the stages of pipelining. But with the use of wavepipelining technique, the throughput, latency and also the total area gets reduced compared with the initial single stage architecture. The synthesis report for Galois field multiplier with different pipelining stages has been shown in Table 3. The above result proves that the architecture when targeted to Xilinx FPGA, the area consumption of the multipliers gets reduced with the wave-pipelining in terms of number of flip flops or LUTs. The above analysis has been strengthened with the Synopsys tool, targeted to the 90nm technology. The analysis of the Galois field multiplier with different stages of pipelining has been done in the Synopsys tool also. The inference from the Table 4, is that the area and also power gets reduced for the wavepipelined architecture, when compared to the different architectures of GF multiplier. Thus the designed architecture is area efficient and power efficient. Table 3 Device Utilization Summary for Galois Field Multiplier Device Utilized Logic utilization Logic distribution No. of slice No. of slice flip flops No. of occupied slices No. of Slices Total Number of input luts Gate count 1,712 1,902 2,074 1,528 Table 2 Area Analysis for Floating Point Multiplier in Synopsys Tool Area Combinational area (µm 2 ) Noncombinational Area (µm 2 ) Net Interconnect (µm 2 ) Total cell area (µm 2 ) Total area (µm 2 ) Power µw µw 1.23mW µw

6 26 Anbuselvi M. and Salivahanan S. 6. CONCLUSION This paper aims at analyzing the performance of floating point and Galois field multipliers with the effect of wave-pipelining. Both the architectures have been studied and different stages of pipelining have been implemented. The different architectures of both floating point and GF multiplier is also synthesized using the Synopsys tool, targeted to 90nm. It is found that the GF multiplier with wavepipelined structure is both area and power efficient. Hence wave-pipelining is found to be more superior in terms of area and power when compared with other pipelining stages. The same architectures can be designed with other wave-pipelining methods, such as logic restructuring and node collapsing. REFERENCES [1] Donald A. Joy and Maciej J. Ciesielski, Clock Period Minimization with Wave Pipelining, IEEE Transaction On Computer Aided Design of Integrated Circuits and Systems, 12(14), April [2] Fabian Klass, Maciji Ciesielski, Wayne P. Burleson and Wental Liu, Wave -Pipelining: A Tutorial and Research Survey, IEEE Transactions on Very Large Scale Integration (VLSI) System, 6(3), September [3] G. Lakshminarayanan and B. Venkataramanai, Optimization Techniques for FPGA-Based Wavepipelined DSP Blocks, IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 13(7), July [4] Ramalingam Sridhar and Xuguang Zhang, Synchronization of Wave Pipelined Circuits, IEEE [5] Sanjiv Kumar Mangal, Raghavendra B. Deshmukh, M. Badghare and R.M. Patrikar, FPGA Implementation of Low Power Parallel Multiplier, 20th International Conference on VLSI Design (VLSID 07). [6] Nick Iliev, James Stine, and Nathan Jachimiec, Digital Finite-Field Multiplier for Reed-Solomon Channel Codes in GF (2^8) with Programmable Basis Polynomial, IIT VLSI Lab. [7] R. Lidl, and H. Niederreiter, Introduction to Finite Fields and Their Applica tions, Cambridge Univ. Press [8] Joes Luis Imana, Bit-Parallel Arithmetic Implementations Over Finite Fields GF (2m) with Reconfigurable Hardware, pp , Kluwer Academic, [9] C. Yeh, I. S. Reed, and T.K. Trouong, Systolic Multipliers for Finite Fields GF (2m), IEEE Trans. On Computers, C-33, pp. 357, 1984.

Pipelined Quadratic Equation based Novel Multiplication Method for Cryptographic Applications

, Vol 7(4S), 34 39, April 204 ISSN (Print): 0974-6846 ISSN (Online) : 0974-5645 Pipelined Quadratic Equation based Novel Multiplication Method for Cryptographic Applications B. Vignesh *, K. P. Sridhar