VLSI Design of Karatsuba Integer Multipliers and Its Evaluation

Size: px
Start display at page:

Download "VLSI Design of Karatsuba Integer Multipliers and Its Evaluation"

Transcription

1 Electronics and Communications in Japan, Vol. 92, No. 4, 2009 Translated from Denki Gakkai Ronbunshi, Vol. 128-C, No. 2, February 2008, pp VLSI Design of Karatsuba Integer Multipliers and Its Evaluation SYUNJI YAZAKI 1 and KOKI ABE 2 1 Tokyo University of Technology, Japan 2 University of Electro-Communications, Japan SUMMARY Multidigit multiplication is widely used for various applications in recent years, including numerical calculation, chaos arithmetic, and primality testing. Systems with high performance and low energy consumption are demanded, especially for image processing and communications with cryptography using chaos. Karatsuba algorithm with computational complexity of O(n 1.58 ) has been employed in software for multiplication of hundreds to thousands of bits, where n stands for bit-length of operands. In this paper, hardware design of multidigit integer multiplication based on Karatsuba algorithm is described and its VLSI realization is evaluated in terms of the cost, performance, and energy consumption. We present two design choices of the Karatsuba hardware: RKM (Recursive Karatsuba Multiplier) and IKM (Iterative Karatsuba Multiplier). We found that RKM has less area cost than WTM (Wallace Tree Multiplier) for bit-length larger than 2 9 with area cost of 30 mm 2. Critical path delay of RKM is always larger than that of WTM. Therefore, we should use WTM as combinational circuits for IKM to have better cost performance. We also found that a version of IKM using 0.18 µm process can perform 1024-bit multiplications 30 times faster than software at the area cost of 10.9 mm 2. Energy for the computation by the IKM version was found to be nearly 1/600 of that consumed by general-purpose processor which executes the software. The results obtained by this study will help system designers for applications requiring multidigit multiplication to select design alternatives including ASIC realization Wiley Periodicals, Inc. Electron Comm Jpn, 92(4): 9 20, 2009; Published online in Wiley InterScience ( wiley.com). DOI /ecj Contract grant sponsors: Synopsis, Inc. via VDEC, University of Tokyo and by a JSPS Grant-in-Aid [Fundamental Research (C) (2) ]. Key words: multidigit multiplication; Karatsuba algorithm; VLSI. 1. Introduction Recently, numerous attempts have been made to use ASICs (Application-Specific Integrated Circuits) for entire applications or their parts in order to achieve faster calculation. In addition, compact low-power-consumption ASICs are often used instead of general-purpose processors to implement inexpensive embedded devices. Further, multidigit calculations far beyond the bit length of general-purpose processors are required in many fields such as numerical analysis [1], chaos computing [2], primality testing [3], or cryptography [4]. In particular, rounding errors are known to strongly affect the whole calculation in chaos-based image processing [5] and communication systems [6]. In such systems, the influence of errors can be reduced by using multidigit calculations. In addition, high speed and low power consumption are usually required in such cases. In this context, we focus on multidigit multiplication, a frequently used and time-consuming kind of multidigit computing. Specifically, we propose a VLSI implementation offering high performance, small area, and low power consumption. In multidigit multiplication using multipliers with a fixed narrow width, in software or hardware, multidigit numbers must be decomposed appropriately for repeated processing by the narrow multipliers. Based on this approach, efficient implementation of multidigit multiplication requires some scheme to minimize the number of iterations. In particular, when designing hardware multidigit multipliers, such a scheme is necessary to minimize memory accesses, computing time, and power consumption. The reduction of repetitive multiplication operations has long been studied, with the Karatsuba algorithm [7] and FFT (Fast Fourier Transform) multiplication [8] being typi Wiley Periodicals, Inc.

2 cal algorithms for this purpose. With these techniques, n-bit multiplication which otherwise has a complexity of O(n 2 ) can be performed with a complexity of O(n 1.58 ) and O(n log n log log n), respectively. FFT multiplication proves faster than the Karatsuba algorithm in terms of order; however, FFT multiplication suffers from large overhead, and the Karatsuba algorithm is used usually for multiplication in the range of hundreds to thousands of bits. A hardware implementation for FFT multiplication has already been reported [9]. There are also many examples of hardware-based Karatsuba algorithm implementations over a Galois field (GF), for use in elliptic curve cryptography and other fields [4, 10]. On the other hand, there are few examples of the implementation and evaluation of hardware Karatsuba multipliers for multidigit operations (e.g., 32-bit Karatsuba integer multiplier [11]), and no studies dealing with the range from hundreds to thousands of bits are available. In this study, we propose a hardware implementation of the Karatsuba integer multiplier in this range, and evaluate its performance. This paper is organized as follows. We give a brief explanation of the Karatsuba algorithm in Section 2, then present two design choices for hardware design (RKM implemented on a combinational circuit and IKM implemented on a sequential circuit) in Section 3. The two implementations are then discussed in Sections 4 and 5, respectively. A summary is given in Section Karatsuba Algorithm The Karatsuba algorithm uses the following method of multiplication. Suppose that A and B are 2n-bit integers. Each is split into two halves: A = a 1 2 n + a 0, B = b 1 2 n + b 0. Then the product P of A and B can be expressed as follows: (1) Considering that a 1 b 0 + a 0 b 1 = (a 1 + a 0 ) (b 1 + b 0 ) (a 1 b 1 + a 0 b 0 ), Eq. (1) can be rewritten as follows: Comparing Eqs. (1) and (2), one can see that the number of n-bit multiplications required in order to get the product of 2n-bit integers can be reduced from 4 to 3. In addition, with the Karatsuba algorithm, the computational complexity can be finally reduced to O(n 1.58 ) (log ) by recursive application to the multiplication involved in Eq. (2). In software implementation, this recursion is repeated until the numbers become one word length. 3. Design Choices There are two design choices of hardware implementation of Karatsuba algorithm, one based on a combinational circuit using required multipliers, as shown in Fig. 1(a), and the other based on a sequential circuit repeatedly using a calculator set (basic multiplier and adder) as shown in Fig. 1(b). The former combinational configuration is called an RKM (Recursive Karatsuba Multiplier), and the latter sequential configuration is called an IKM (Iterative Karatsuba Multiplier). RKM makes it possible to operate with fewer multipliers by application of the Karatsuba algorithm. Compared to the well-known Wallace Tree Multiplier (WTM) [12], the order of the chip area can be reduced from O(n 2 ) to O(n 1.58 ). On the other hand, the order of the delay time is O(log n) for both RKM and WTM. Thus, a theoretical comparison in terms of computational complexity suggests that RKM is better than WTM due to a smaller chip area at the same time delay. However, when comparing the performance and area cost of actual implementations, one cannot neglect the influence of order coefficients and low-order terms. There- (2) Fig. 1. Configurations of Karatsuba multiplier. (a) Recursive Karatsuba multiplier (RKM). (b) Iterative Karatsuba multiplier (IKM). 10

3 fore, the evaluation must involve not only computational complexity but also experimental results. There is an example of the implementation and evaluation of 32-bit integer Karatsuba multiplier [11]. However, no research has been done on Karatsuba multipliers for hundreds to thousands of bits, and further evaluation of Karatsuba multipliers based on combinational configurations are required. On the other hand, combinational implementations of the Karatsuba multiplier are limited by the chip area, and the use of sequential IKM may prove satisfactory for certain applications. When multidigit multiplication is configured sequentially, a fixed number of short bit-length multipliers are used repeatedly; in this case, the chip area is a constant depending on the implementation. With a fixed number of arithmetic units, the computing time is O(n 2 ) when the Karatsuba algorithm is not applied. On the other hand, when using the Karatsuba algorithm, the number of repetitive multiplications can be reduced, and the order of the computing time can be lowered to O(n 1.58 ). This results in a considerable acceleration of the actual calculations, and long bit-length operations, impossible for O(n 2 ) multipliers, may become possible. In IKM, the number of multiplication is reduced, while the number of addition increases instead; therefore, the IKM configuration loses its advantage when the time/area cost of addition cannot be ignored compared to multiplication. However, the addition cost is normally very low compared to multiplication. The cost and performance of IKM vary with the bit-length of the basic multipliers and adders. Therefore, a trade-off between cost and performance can be found by varying the bit-length. In addition, one must examine which kind of the basic multiplier, RKM or WTM, should be employed. There are no examples of such an examination because the Karatsuba algorithm has not been implemented using sequential circuits such as IKM. Based on the above considerations, we performed VLSI design for both RKM and IKM, and compared their cost and performance for various numbers of algorithm applications and bit-lengths. Fig. 2. Configuration of Carry-Save RKM. inside RKM. Since the algorithm is applied only once, the WTM bit-length is set to half the input width, that is, to 16 bits. In addition, a CSA (Carry Save Adder) is used to accumulate the WTM outputs. However, a CPA (Carry Propagation Adder) is employed in the final stage to output the product. This configuration is called Carry-Save RKM. For comparison, we also designed an RKM with all adders being CPA, as shown in Fig. 3. This configuration is called Binary RKM. In the implementations of RKM, we used IP (Intellectual Property) blocks by Synopsis, specifically, DW02_mult for WTM, and DW01_add (DW01_sub) for CPA Implementation results and discussion Logic synthesis of the two designed RKMs was performed using Design Compiler Ver. W SP2 by 4. RKM In this section, we explain the RKM design. First, we implement a 32-bit RKM with one-time application of the Karatsuba algorithm, and compare our results to those given in a previous report [11]. Then we consider a more general case of RKM design Bit RKM Design The configuration of RKM designed by Eq. (2) is shown in Fig. 2. Here WTM multipliers are employed Fig. 3. Configuration of Binary RKM. 11

4 Table 1. Synthesis results of 32-bit RKMs using Hitachi 0.18 µm technology with area-preferred constraint Synopsis. The results thus obtained were then utilized to estimate the critical path delay and area cost. In the logic synthesis, we employed the cell library for 0.18 µm process developed by VDEC (VLSI Design and Education Center) [13] according to Hitachi specifications. The synthesis constraints were set so as to minimize the area cost. The synthesis results are presented in Table 1. In addition, for comparison with the results reported in Ref. 11, we also performed logic synthesis of Carry-Save RKM using a cell library for 0.35 µm process based on the ROHM specifications. The synthesis constraints were set so as to minimize the critical path delay. Comparison results are presented in Table 2. As is evident from Table 1, the critical path delay can be reduced by about 22% when CSA is used as the adder in the nonfinal stages. However, the area cost increases by 6%. This is because the carry-save implementation requires many full adders. As regards the RKM intended for area reduction, this configuration does not appear adequate; however, a considerable reduction of the time delay suggests a good cost/performance ratio. Thus, in this study, we adopted CSA as the nonfinal adders. In addition, as indicated by Table 2, the Carry-Save RKM configured in this study shows a critical path delay and an area cost that are 1.16 times and 1.28 times as great, respectively, as the RKM reported in Ref. 11. The reason is that in Ref. 11, the CPA at the final stage is optimized manually in steps of 2 or 3 bits, but logic synthesis using IP is employed in this study, which does not assure sufficient optimization. 4.2 General form of RKM Design Assuming the multiplication bit-length to be 2n = 2 k, and the bit-length of the WTM components of the RKM to Table 2. Synthesis results of 32-bit RKMs using ROHM 0.35 µm technology with delay-preferred constraint be 2 l, the number of recursions is k l. Here k and l (0 < l < k) are integers. The 32-bit RKM considered in Section 4.1 can be interpreted as a particular case of such an RKM with k = 5, l = 4. When designing a general form of RKM, the following three types of circuit components are involved. (1) Components that transform the carry-save (CS) product into binary (B) form using CPA so as to obtain the final product (2) Components that do not require CPA (3) Components with 2 l -bit WTM The above components correspond to the topmost, intermediate, and undermost layers in the RKM module configuration, being denoted by T, I, and U, respectively. In addition, components I and U include those with CS input (I cs, U cs ) and those with B input (I b, U b ). In particular, the U components with CS input values (U cs ) require CPA in order to transform the CS input values into B form. Thus, five types of modules are needed, as shown in Fig. 4. In the diagram, KaratsubaCS and KaratsubaB modules are replaced depending on the number of recursions, as shown in Table 3. For example, in the case of one-time recursion, KaratsubaCS and KaratsubaB in the T module are replaced, respectively, by U cs and U b. As a result, the RKM shown in Fig. 2 is obtained. When the number of recursions is 2 or more, KaratsubaCS and KaratsubaB in the T module are replaced, respectively, by I cs and I b, and KaratsubaCS and KaratsubaB in I cs and I b are replaced, respectively, by U cs and U b. The shaded V module in Fig. 4 is intended for carry-save addition according to Eq. (2). As indicated by Fig. 4, the critical path delay and area cost of this RKM can be estimated by Eqs. (3) to (9) and (10) to (14), respectively: (3) (4) (5) (6) (7) (8) (9) (10) 12

5 Fig. 4. Components of Karatsuba multiplier for large bit-lengths: topmost component (T), intermediate component (I), and undermost components (U). Suffixes cs and b denote components with inputs in carry-save and binary form, respectively. (a) T, (b) I cs, (c) I b, (d) U cs, (e) U b Implementation results and discussion (11) (12) (13) (14) Here DT(k) and AT(k) represent the critical path delay and the area cost of the entire 2 k -bit RKM. In addition, DI {cs,b} (j) and AI {cs,b} (j) denote the critical path delay and area cost of the I component, and DU {cs,b} (k) and AU {cs,b} (l) denote the critical path delay and area cost of the U component; l < j < k. DT q (k) and DT r (k) are the delay times for paths q and r shown in Fig. 4(a); D CSA (i) and D CPA (k) are the delay times of the 2 i -bit CSA and 2 k -bit CPA, respectively. Similarly, A CSA (i) and A CPA (k) are the area costs of 2 i -bit CSA and 2 k -bit CPA, respectively; here i is a positive integer. D W (l) and A W (l) are the critical path delay and area cost of the 2 l -bit WTM. When j 1 = l, that is, in the final recursion, the U component is employed, and DI cs (l) = DU cs (l), AI cs (l) = AU cs (l). We estimate critical path delay and area cost of an RKM with a bit-length of 2 5 to 2 9 bits (32 to 512 bits) using Eqs. (3) to (14). This estimation requires the time delay and the area cost data for the basic multipliers, CPA and CSA. Here we use a 16-bit WTM (l = 4) as the basic multiplier. The time delay and area cost for the 16-bit WTM measured by logic synthesis (under the same conditions as in Section 4.1.2) are, respectively, D W (2 4 ) = 4.15 ns and A W (2 4 ) = mm 2. The time delay and area cost for the CPA were found from the results of logic synthesis of the adder from the Synopsis library (DW01_add); the adder configuration was CLA (Carry Lookahead Adder). The data thus obtained are presented in Table 4. For the CSA, the time delay and area cost were found similarly from the synthesis results; for 16 bits, the values are 0.27 ns and mm 2, respectively. The critical path delay of the CSA does not change as the bit-length increases, and therefore we used the value for a 16-bit CSA regardless of the actual bit-length. On the other hand, we assume the area cost increases with the bit-length. The critical path delay and area cost of the RKM thus estimated are given in Table 5(a). For comparison, Table Table 3. Modules to be replaced in recursion Table 4. Critical path delay and area cost of CPA (DW01_add) 13

6 Table 5. Critical path delay and area cost of RKM (a) and WTM (b) 5(b) gives the synthesis results for the multiplier (DW02_mult) from the Synopsis library. In addition, the respective graphs are shown in Figs. 5(a) and 5(b). In the diagrams, the x-axis represents the base 2 logarithm of the bit-length, and the y-axis represents the base 2 logarithm of the area cost and critical path delay, respectively. In order to compare the estimates with actual values, we implemented a 2 6 -bit RKM and measured the critical path delay and area cost by logic analysis. The values thus obtained were, respectively, ns and mm 2. The estimated values were ns and mm 2 [Table 5(a)]. The errors of values can be recognized as acceptable considering that the measured values were obtained by automatic synthesis. Now consider the results obtained for the area cost. The area cost plot in Fig. 5(a), approximated by a straight line, has a slope of about The difference between this slope and the theoretical value of O(n 1.58 ) for the Karatsuba algorithm can be attributed to the additional area cost incurred by addition. On the other hand, the approximation for the WTM results has a slope of about 1.90, which is close to the theoretical value of O(n 2 ). In addition, as indicated by Fig. 5(a), WTM is capable of achieving a lower area cost than RKM at bit-lengths above 2 9. At this critical point, the area is about 30 mm 2. For the critical path delay, Fig. 5(b) shows the results for O(log n) for both multipliers; the RKM has a larger slope. As may be concluded from the above, WTM should be chosen when calculation speed is the first priority. On the other hand, RKM proves helpful in area-preferred design at bit-lengths above 2 9. At such bit-lengths, however, both RKM and WTM have areas of about 30 mm 2, and in terms of cost/performance ratio, WTM may be recognized as a more practicable design. In IKM design, too, better performance can be achieved by using WTM as the basic multiplier. Fig. 5. Comparison between RKM and WTM in terms of area cost (a) and critical path delay (b). 5. IKM In this section, we describe the design of IKM based on the repetitive use of basic multipliers and adders. First we present in Section 5.1 an example of IKM design for two-time application of the Karatsuba algorithm. Then in Section 5.2, we describe implementation results obtained for various bit-lengths and numbers of algorithm applications. The IKM design is based on the approach to Karatsuba algorithm implementations over Galois fields described in Ref. 4. However, calculations over Galois fields do not involve carry operations, so that addition becomes logical XOR, and multiplication becomes logical AND. Hence, the circuit configuration and scale are different from the case of integer calculations. 5.1 Design method Here we design a 4n-bit IKM using two-time iteration of the Karatsuba algorithm, assuming an n-bit basic multiplier. First, the 4n-bit multiplier A and multiplicand B are both split into four n-bit parts as follows: (15) (16) 14

7 Then the product P of A and B can be written as (17) Table 6. Relations between coefficients p 7 g,..., p 0 g and partial products (rows L and H represent lower n-bit part and remaining higher part of partial products, respectively) Writing A 1 = a 3 2 n + a 2, A 0 = a 1 2 n + a 0, B 1 = b 3 2 n + b 2, B 0 = b 1 2 n + b 0, the product P can be expressed as follows: (18) Here A 10 = A 1 + A 0, B 10 = B 1 + B 0. Below we use such subscript pairs to represent the addition of two variables. The terms A 1 B 1, A 0 B 0, and A 10 B 10 in Eq. (18) can be expanded as follows: (19) (20) (21) The following final expression is obtained by substituting Eqs. (19), (20), (21) into Eq. (18): The relations between p g g 7,..., p 0 and the partial products are illustrated in Table 6. In the table, L and H respectively represent the lower n bits and remaining higher bits of every partial product. The coefficients p g g 7,..., p 0 before carry of the product P in Eq. (17) are found by accumulation of the partial products according to the table. If the coefficients p g 7,..., p g 0 exceed n bits after accumulation, the excess parts must be added to the higher coefficients (carry operation). Thus, p 7,..., p 0 can be obtained eventually. In the case of two or more applications of the Karatsuba algorithm, tables similar to Table 6 can be derived. For example, for three-time recursion, the number of partial products will be 3 times as great as in the case of two-time recursion, and 16 coefficients p i are involved; therefore, the table size will be IKM configuration (22) The coefficients p 7,..., p 0 in Eq. (17) can be obtained by accumulation of the 9 components a 0 b 0,..., a 3 b 3, a 10 b 10,..., a 32 b 32 that form coefficients in Eq. (22). Below we denote these terms by pp0,..., pp8, and call them partial products. Carry occurs when these partial products are added. Therefore, carry processing from p 0 toward p 7 is required after addition to obtain the final p 7,..., p 0. Here the p 7,..., p 0 before carry processing are denoted by p 7 g,..., p 0 g. Here we configure the IKM based on the approach explained above. As shown in Fig. 6, the IKM is composed Fig. 6. Configuration of IKM. 15

8 of three units, PPG (Partial Product Generator), ACC (Accumulator), and CP (Carry Propagator), and a control module (CTRL). Every module is explained in detail below PPG design The PPG is composed of adders, multipliers, and a buffer (PPG Buffer) to generate partial products, as shown in Fig. 7. In this configuration, one WTM is employed as the multiplier; several multipliers may be used in more complex and high-performance IKM designs. The WTM has 6 pipeline stages so as to adjust the delay time with the adders included in the IKM. Due to such pipelining, several addition operations can be performed in parallel during a multiplication cycle. That is, the addition time in the PPG can be covered by the multiplication time. This pipelined WTM was implemented on a multiplier (DW2_mult_6_stage) from the Synopsis library. As regards adders, two units are used in this design, considering that the cost of the adder is much lower than that of the multiplier. Using two adders makes it possible to assure that the multiplier is always busy. The adders, too, were taken from the Synopsis library (DW01_add). The PPG buffer stores all the data involved in the operations. A dedicated I/O port was provided for the arithmetic units and input terminals. In addition, the buffer was designed for synchronous read and asynchronous write. The buffer capacity depends on the bit-length and number of the partial products, and on the number of arithmetic units. With every additional recursion, the number of partial products increases by a factor of 3, and the bit-length increases by 1. For example, when two-time recursion is performed at the basic bit-length of 16, ten 18-bit entries are required for the PPG buffer ACC design The ACC is composed of adders/subtractors and a buffer (ACC Buffer) to accumulate the partial products generated by the PPG, as shown in Fig. 8. We used four adders/subtractors considering the number of operations that can be performed simultaneously for one partial product. This corresponds to the maximum number of operations per row in Table 6. More adders/subtractors can be Fig. 8. Configuration of ACC. used to configure the ACC, but this complicates control. We used adders/subtractors (DW01_addsub) from the same library. As in the case of the PPG buffer, the AAC buffer is provided with an I/O port for every arithmetic unit and input terminal. The buffer capacity depends on the bit-length and number of the partial products, and on the number of arithmetic units. Since the ACC involves signed arithmetic, a sign bit must be assigned to every value. In addition, the internal bit-length of the ACC increases by 2 (= log 2 3j) with every additional recursion because of the carry addition. For example, when two-time recursion is performed at the basic bit-length of 16, eighteen 21-bit entries are required for the ACC buffer CP design The CP is composed of adders and a register (Carry Register) to receive the coefficients accumulated by the ACC [p 0 to p 7 in Eq. (17) in the case of two-time recursion], and to perform ripple carry, as shown in Fig. 9. The CP module divides the input values into the lower n bits and the remaining bits. In addition, the higher bits are stored in the Carry Register, to be added to the next higher-order coefficients in the next cycle. However, carry is impossible for the lowest coefficient p 0, and therefore 0 is added in this case CTRL design As explained above, the CTRL module is intended to send appropriate control signals to the PPG, ACC, and CP modules. There may be various control signals such as a Fig. 7. Configuration of PPG. Fig. 9. Configuration of CP. 16

9 Table 7. PPG operation scheduling read/write address or Write Enable signal for the PPG buffer and ACC buffer, an addition/subtraction selection signal for ACC, and a signal to add 0 to the lowest coefficient in CP. Below we give an example of the CTRL module for the considered configuration with two-time recursion. Under the control of the CTRL module, PPG performs the operations scheduled as shown in Table 7. The table shows the input timing and usable output timing for every arithmetic unit (one multiplier and two adders). Similarly, the ACC operation scheduling is given in Table 8. In the table, the suffixes L and H attached to the partial products represent the lowest n bits and the remaining bits, respectively. For the CP, the multiplexer is controlled so as to select zero carry addition when the lowest coefficient p 0 is input. The ACC and CP operations can start prior to completion of the preceding operations (PPG and ACC, respectively). Specifically, ACC can start to receive the first partial Table 8. ACC operation scheduling 17

10 product pp 0 as soon as the PPG comes to its eighth step. In addition, ACC operation ends by finding the lowest coefficient p 3. This coefficient is output at the 17th step, and the coefficients p 0 to p 2 can be input to the CP before that. Therefore, the product P can be obtained in (8 1) + (17 1) + 4 = 27 steps after the lowest terms a 0 and b 0 of multiplier A and multiplicand B are input. 5.2 Implementation results and discussion We designed the IKM as explained in Section 5.1 with 1, 2, and 3 recursions (below referred to as R1IKM, R2IKM, and R3IKM, respectively). The multipliers were described in Verilog-HDL, and the logic synthesis was performed while varying the basic bit-length from 4 to 128 with priority given to minimization of the delay time. We used the library for the 0.18 µm process as mentioned in Section Table 9 presents evaluations of critical path delay, the area cost, the multiplication time, and the power consumption obtained from the results of logic synthesis. Here the power consumption was found at a driving voltage of 1.8 V. Every first row shows the basic bit-length of every IKM. The fourth rows Time show the multiplication time found as the delay time multiplied by the necessary number of steps, namely, 14 for R1IKM, 27 for R2IKM, and 79 for R3IKM. The uppermost row Bit length pertains to the multiplier and multiplicand. For comparison with software implementation, we measured the processing speed of Karatsuba multiplication using exflib Ver [14] (a package of fast multipleprecision arithmetic routines). The data thus obtained are presented in Fig. 10. In the diagram, the dashed lines have a slope of 1.58, representing ideal performance of the Fig. 10. Performance comparison between IKMs and software implementation (exflib). The dashed lines with a slope of 1.58 represent the multiplication time O(n 1.58 ) of the Karatsuba algorithm. Karatsuba algorithm. As is evident from the graphs, the measured performance of exflib is close to the ideal value as well as the performance of R1IKM, R2IKM, and R3IKM. Thus, the relative performance of IKM with respect to the software implementation remains nearly constant as the number of recursions is increased. However, the capacity of the PPG and ACC buffers increases with more recursions, resulting in a larger area. In addition, the diagram indicates that the performance increases with the basic bit-length. With basic bit-lengths of 32, 64, and 128, IKM outperforms the software by factors of about 5, 10, and 30, respectively. As regards the area cost, even the largest value of 10.9 mm 2 shown by R3IKM with a basic bit-length of 128 in Table 9 is sufficiently practicable. Table 9. Evaluation of IKMs in terms of time, area, and power consumption Fig. 11. Area cost of IKMs for multiplication of x-bit integers (x/2, x/4, and x/8-bit WTMs denote basic multipliers of R1IKM, R2IKM, and R3IKM, respectively). 18

11 As is evident from Table 9, the energy consumption of the 1024-bit R3IKM with a basic bit-length of 128 is 663 nj (= 1874 mw ns). On the other hand, in the software implementation, the power consumption of a Pentium GHz CPU is about 63 W [15], and the computing time for 856-bit (256 decimal digits) multiplication is about 6366 ns. Therefore, the energy consumption is 401 µj (= 63 W 6366 ns), which is about 600 times that of the hardware IKM. Figure 11 shows how the area cost of R1IKM, R2IKM, and R3IKM varies with the basic bit-length. In the diagram, the x-axis and y-axis represent the multiplication bit-length and area cost, respectively. The IKM area includes primarily WTM, the PPG buffer, the AAC buffer, and the adders. Thus, we also show the area cost of the respective WTMs on the same graph. * When the basic bit-length is increased at a fixed number of recursions, the number of entries in every buffer does not change: only the entry length increases. Therefore, the area cost of every buffer is O(n). That is, the area of every IKM approaches the area of internal WTMs as the bit-length is increased. 6. Conclusions We examined the performance and area cost of two hardware designs of the Karatsuba algorithm, namely, RKM with a combinational configuration and IKM with a sequential configuration. We found that RKM can achieve a lower area cost than ordinary WTM at 2 9 or more bits: the area cost is about 30 mm 2. On the other hand, RKM has larger delays than WTM at any bit-length. Therefore, WTM is capable of a better cost/performance ratio than RKM. In addition, we showed that IKM with basic bit-lengths of 32, 64, and 128 outperforms software by factors of about 5, 10, and 30, respectively. The greatest area cost of R3IKM with a basic bit-length of 128 is 10.9 mm 2, which is quite practicable. The energy consumption of a 1024-bit R3IKM with a basic bit-length of 128 is just 1/600 that of a general-purpose processor. Hence, we may conclude that WTM should be used rather than RKM, for multiplication with relatively small bit-length, and IKM using WTM as the basic multiplier is a proper solution for large bit-length. This study provides guidelines for optimal IKM design according to the application parameters. * Based on Fig. 11, WTM area can be approximated by O(n 1.70 ). The difference with O(n 2 ) can be explained by the priority given to speed in the logic synthesis. In the case of area-preferred synthesis, O(n 1.90 ) is obtained. Acknowledgments The present study was supported by Synopsis, Inc. via VDEC, University of Tokyo. In addition, we were subsidized by a JSPS Grant-in-Aid [Fundamental Research (C) (2) ]. We express our gratitude to all the persons and institutions concerned. REFERENCES 1. Fujiwara H. High-accurate numerical method for integral equations of the first kind under multiple-precision arithmetic. Theor Appl Mech Japan 2003;52: Sprott JC. Chaos and time-series analysis. Oxford University Press; Agrawal M, Kayal N, Saxena N. PRIMES is in P Dyka Z, Langendoerfer P. Area efficient hardware implementation of elliptic curve cryptography by iteratively applying Karatsuba s method. Proc of the Design, Automation and Test in Europe Conference and Exhibition, Vol. 3, p 70 75, Ling BW-K, Ho CY-F, Tam PK-S. Chaotic filter bank for computer cryptography. Chaos, Solitons & Fractals, in press, Available online 5 June Chien T-I, Liao T-L. Design of secure digital communication systems using chaotic modulation, cryptography and chaotic synchronization. Chaos, Solitons and Fractals 2005;24: Karatsuba A, Ofman Y. Multiplication of multidigit numbers on automata. Sov Phys Dokl 1963;7: Knuth DE. The art of computer programming 2nd edition: Seminumerical algorithms, Vol. 2. Addison- Wesley; Yazaki S, Abe K. VLSI design of FFT multi-digit multiplier. Trans Japan Soc Ind Appl Math 2006;15: (in Japanese) 10. Grabbe C, Bednara M, Teich J, von zur Gathen J, Shokrollahi J. FPGA designs of parallel high performance GF(2 233 ) multiplier. Proc IEEE International Symposium on Circuits and Systems, p , Shibaoka M, Takagi N, Takagi K. Reduced area parallel multiplier based on Karatsuba algorithm. IEICE General Conference, Vol. A-3, p 66, (in Japanese) 12. Wallace CS. A suggestion for a fast multiplier. IEEE Trans Electronic Computers 1964;13: VLSI Design and Education Center Homepage, 19

12 14. exflib Extended Precision Float-Point Arithmetic Library, ~ fujiwara/ exflib/exflibindex.html 15. Intel(R) Pentium(R) 4 processor specifications, specs.htm AUTHORS (from left to right) Syunji Yazaki (nonmember) graduated from Tokyo University of Technology in 2002, completed the first and second stages of the doctoral program at the University of Electro-Communications in 2004 and 2007, and joined the faculty of Tokyo University of Technology as a research associate. His research interests include VLSI system design, multidigit multiplication, and social supports system for elderly people. He holds a D.Eng. degree, and is a member of IEICE, JSIAM, and PARTHENON Research Group. Koki Abe (member) graduated from Yokohama National University in 1969 and completed the M.E. program in He withdrew from the doctoral program at the University of Tokyo in 1974 to become a research associate at the University of Electro-Communications. He was a visiting researcher at Carnegie Mellon University from 1980 to He has been an associate professor at the University of Electro-Communications since His research interests include computer architectures, VLSI system design, and computer networks. He holds a D.Sc. degree, and is a member of IPSJ, IEICE, and IEEE. 20

An Optimum Design of FFT Multi-Digit Multiplier and Its VLSI Implementation

An Optimum Design of FFT Multi-Digit Multiplier and Its VLSI Implementation An Optimum Design of FFT Multi-Digit Multiplier and Its VLSI Implementation Syunji Yazaki Kôki Abe Abstract We designed a VLSI chip of FFT multiplier based on simple Cooly Tukey FFT using a floating-point

More information

Area Efficient, Low Power Array Multiplier for Signed and Unsigned Number. Chapter 3

Area Efficient, Low Power Array Multiplier for Signed and Unsigned Number. Chapter 3 Area Efficient, Low Power Array Multiplier for Signed and Unsigned Number Chapter 3 Area Efficient, Low Power Array Multiplier for Signed and Unsigned Number Chapter 3 3.1 Introduction The various sections

More information

FPGA Implementation of a High Speed Multiplier Employing Carry Lookahead Adders in Reduction Phase

FPGA Implementation of a High Speed Multiplier Employing Carry Lookahead Adders in Reduction Phase FPGA Implementation of a High Speed Multiplier Employing Carry Lookahead Adders in Reduction Phase Abhay Sharma M.Tech Student Department of ECE MNNIT Allahabad, India ABSTRACT Tree Multipliers are frequently

More information

OPTIMIZATION OF AREA COMPLEXITY AND DELAY USING PRE-ENCODED NR4SD MULTIPLIER.

OPTIMIZATION OF AREA COMPLEXITY AND DELAY USING PRE-ENCODED NR4SD MULTIPLIER. OPTIMIZATION OF AREA COMPLEXITY AND DELAY USING PRE-ENCODED NR4SD MULTIPLIER. A.Anusha 1 R.Basavaraju 2 anusha201093@gmail.com 1 basava430@gmail.com 2 1 PG Scholar, VLSI, Bharath Institute of Engineering

More information

An Efficient Fused Add Multiplier With MWT Multiplier And Spanning Tree Adder

An Efficient Fused Add Multiplier With MWT Multiplier And Spanning Tree Adder An Efficient Fused Add Multiplier With MWT Multiplier And Spanning Tree Adder 1.M.Megha,M.Tech (VLSI&ES),2. Nataraj, M.Tech (VLSI&ES), Assistant Professor, 1,2. ECE Department,ST.MARY S College of Engineering

More information

A High-Speed FPGA Implementation of an RSD- Based ECC Processor

A High-Speed FPGA Implementation of an RSD- Based ECC Processor A High-Speed FPGA Implementation of an RSD- Based ECC Processor Abstract: In this paper, an exportable application-specific instruction-set elliptic curve cryptography processor based on redundant signed

More information

Fixed-Width Recursive Multipliers

Fixed-Width Recursive Multipliers Fixed-Width Recursive Multipliers Presented by: Kevin Biswas Supervisors: Dr. M. Ahmadi Dr. H. Wu Department of Electrical and Computer Engineering University of Windsor Motivation & Objectives Outline

More information

Area-Delay-Power Efficient Carry-Select Adder

Area-Delay-Power Efficient Carry-Select Adder Area-Delay-Power Efficient Carry-Select Adder Shruthi Nataraj 1, Karthik.L 2 1 M-Tech Student, Karavali Institute of Technology, Neermarga, Mangalore, Karnataka 2 Assistant professor, Karavali Institute

More information

An Efficient Carry Select Adder with Less Delay and Reduced Area Application

An Efficient Carry Select Adder with Less Delay and Reduced Area Application An Efficient Carry Select Adder with Less Delay and Reduced Area Application Pandu Ranga Rao #1 Priyanka Halle #2 # Associate Professor Department of ECE Sreyas Institute of Engineering and Technology,

More information

VLSI Design Of a Novel Pre Encoding Multiplier Using DADDA Multiplier. Guntur(Dt),Pin:522017

VLSI Design Of a Novel Pre Encoding Multiplier Using DADDA Multiplier. Guntur(Dt),Pin:522017 VLSI Design Of a Novel Pre Encoding Multiplier Using DADDA Multiplier 1 Katakam Hemalatha,(M.Tech),Email Id: hema.spark2011@gmail.com 2 Kundurthi Ravi Kumar, M.Tech,Email Id: kundurthi.ravikumar@gmail.com

More information

Fused Floating Point Arithmetic Unit for Radix 2 FFT Implementation

Fused Floating Point Arithmetic Unit for Radix 2 FFT Implementation IOSR Journal of VLSI and Signal Processing (IOSR-JVSP) Volume 6, Issue 2, Ver. I (Mar. -Apr. 2016), PP 58-65 e-issn: 2319 4200, p-issn No. : 2319 4197 www.iosrjournals.org Fused Floating Point Arithmetic

More information

Analysis of Different Multiplication Algorithms & FPGA Implementation

Analysis of Different Multiplication Algorithms & FPGA Implementation IOSR Journal of VLSI and Signal Processing (IOSR-JVSP) Volume 4, Issue 2, Ver. I (Mar-Apr. 2014), PP 29-35 e-issn: 2319 4200, p-issn No. : 2319 4197 Analysis of Different Multiplication Algorithms & FPGA

More information

EE878 Special Topics in VLSI. Computer Arithmetic for Digital Signal Processing

EE878 Special Topics in VLSI. Computer Arithmetic for Digital Signal Processing EE878 Special Topics in VLSI Computer Arithmetic for Digital Signal Processing Part 6c High-Speed Multiplication - III Spring 2017 Koren Part.6c.1 Array Multipliers The two basic operations - generation

More information

II. MOTIVATION AND IMPLEMENTATION

II. MOTIVATION AND IMPLEMENTATION An Efficient Design of Modified Booth Recoder for Fused Add-Multiply operator Dhanalakshmi.G Applied Electronics PSN College of Engineering and Technology Tirunelveli dhanamgovind20@gmail.com Prof.V.Gopi

More information

OPTIMIZING THE POWER USING FUSED ADD MULTIPLIER

OPTIMIZING THE POWER USING FUSED ADD MULTIPLIER Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 3, Issue. 11, November 2014,

More information

Area Delay Power Efficient Carry-Select Adder

Area Delay Power Efficient Carry-Select Adder Area Delay Power Efficient Carry-Select Adder Pooja Vasant Tayade Electronics and Telecommunication, S.N.D COE and Research Centre, Maharashtra, India ---------------------------------------------------------------------***---------------------------------------------------------------------

More information

Design and Analysis of Kogge-Stone and Han-Carlson Adders in 130nm CMOS Technology

Design and Analysis of Kogge-Stone and Han-Carlson Adders in 130nm CMOS Technology Design and Analysis of Kogge-Stone and Han-Carlson Adders in 130nm CMOS Technology Senthil Ganesh R & R. Kalaimathi 1 Assistant Professor, Electronics and Communication Engineering, Info Institute of Engineering,

More information

I. Introduction. India; 2 Assistant Professor, Department of Electronics & Communication Engineering, SRIT, Jabalpur (M.P.

I. Introduction. India; 2 Assistant Professor, Department of Electronics & Communication Engineering, SRIT, Jabalpur (M.P. A Decimal / Binary Multi-operand Adder using a Fast Binary to Decimal Converter-A Review Ruchi Bhatt, Divyanshu Rao, Ravi Mohan 1 M. Tech Scholar, Department of Electronics & Communication Engineering,

More information

FPGA IMPLEMENTATION OF FLOATING POINT ADDER AND MULTIPLIER UNDER ROUND TO NEAREST

FPGA IMPLEMENTATION OF FLOATING POINT ADDER AND MULTIPLIER UNDER ROUND TO NEAREST FPGA IMPLEMENTATION OF FLOATING POINT ADDER AND MULTIPLIER UNDER ROUND TO NEAREST SAKTHIVEL Assistant Professor, Department of ECE, Coimbatore Institute of Engineering and Technology Abstract- FPGA is

More information

Low-Power FIR Digital Filters Using Residue Arithmetic

Low-Power FIR Digital Filters Using Residue Arithmetic Low-Power FIR Digital Filters Using Residue Arithmetic William L. Freking and Keshab K. Parhi Department of Electrical and Computer Engineering University of Minnesota 200 Union St. S.E. Minneapolis, MN

More information

Design of an Efficient 128-Bit Carry Select Adder Using Bec and Variable csla Techniques

Design of an Efficient 128-Bit Carry Select Adder Using Bec and Variable csla Techniques Design of an Efficient 128-Bit Carry Select Adder Using Bec and Variable csla Techniques B.Bharathi 1, C.V.Subhaskar Reddy 2 1 DEPARTMENT OF ECE, S.R.E.C, NANDYAL 2 ASSOCIATE PROFESSOR, S.R.E.C, NANDYAL.

More information

High Speed Special Function Unit for Graphics Processing Unit

High Speed Special Function Unit for Graphics Processing Unit High Speed Special Function Unit for Graphics Processing Unit Abd-Elrahman G. Qoutb 1, Abdullah M. El-Gunidy 1, Mohammed F. Tolba 1, and Magdy A. El-Moursy 2 1 Electrical Engineering Department, Fayoum

More information

Low Power and Memory Efficient FFT Architecture Using Modified CORDIC Algorithm

Low Power and Memory Efficient FFT Architecture Using Modified CORDIC Algorithm Low Power and Memory Efficient FFT Architecture Using Modified CORDIC Algorithm 1 A.Malashri, 2 C.Paramasivam 1 PG Student, Department of Electronics and Communication K S Rangasamy College Of Technology,

More information

Path Traced Perceptron Branch Predictor Using Local History for Weight Selection

Path Traced Perceptron Branch Predictor Using Local History for Weight Selection Path Traced Perceptron Branch Predictor Using Local History for Selection Yasuyuki Ninomiya and Kôki Abe Department of Computer Science The University of Electro-Communications 1-5-1 Chofugaoka Chofu-shi

More information

High Speed Systolic Montgomery Modular Multipliers for RSA Cryptosystems

High Speed Systolic Montgomery Modular Multipliers for RSA Cryptosystems High Speed Systolic Montgomery Modular Multipliers for RSA Cryptosystems RAVI KUMAR SATZODA, CHIP-HONG CHANG and CHING-CHUEN JONG Centre for High Performance Embedded Systems Nanyang Technological University

More information

Implementation of Efficient Modified Booth Recoder for Fused Sum-Product Operator

Implementation of Efficient Modified Booth Recoder for Fused Sum-Product Operator Implementation of Efficient Modified Booth Recoder for Fused Sum-Product Operator A.Sindhu 1, K.PriyaMeenakshi 2 PG Student [VLSI], Dept. of ECE, Muthayammal Engineering College, Rasipuram, Tamil Nadu,

More information

UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering. Digital Computer Arithmetic ECE 666

UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering. Digital Computer Arithmetic ECE 666 UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering Digital Computer Arithmetic ECE 666 Part 6c High-Speed Multiplication - III Israel Koren Fall 2010 ECE666/Koren Part.6c.1 Array Multipliers

More information

CHAPTER 3 METHODOLOGY. 3.1 Analysis of the Conventional High Speed 8-bits x 8-bits Wallace Tree Multiplier

CHAPTER 3 METHODOLOGY. 3.1 Analysis of the Conventional High Speed 8-bits x 8-bits Wallace Tree Multiplier CHAPTER 3 METHODOLOGY 3.1 Analysis of the Conventional High Speed 8-bits x 8-bits Wallace Tree Multiplier The design analysis starts with the analysis of the elementary algorithm for multiplication by

More information

A SIMULINK-TO-FPGA MULTI-RATE HIERARCHICAL FIR FILTER DESIGN

A SIMULINK-TO-FPGA MULTI-RATE HIERARCHICAL FIR FILTER DESIGN A SIMULINK-TO-FPGA MULTI-RATE HIERARCHICAL FIR FILTER DESIGN Xiaoying Li 1 Fuming Sun 2 Enhua Wu 1, 3 1 University of Macau, Macao, China 2 University of Science and Technology Beijing, Beijing, China

More information

A High Speed Design of 32 Bit Multiplier Using Modified CSLA

A High Speed Design of 32 Bit Multiplier Using Modified CSLA Journal From the SelectedWorks of Journal October, 2014 A High Speed Design of 32 Bit Multiplier Using Modified CSLA Vijaya kumar vadladi David Solomon Raju. Y This work is licensed under a Creative Commons

More information

Design and Implementation of CVNS Based Low Power 64-Bit Adder

Design and Implementation of CVNS Based Low Power 64-Bit Adder Design and Implementation of CVNS Based Low Power 64-Bit Adder Ch.Vijay Kumar Department of ECE Embedded Systems & VLSI Design Vishakhapatnam, India Sri.Sagara Pandu Department of ECE Embedded Systems

More information

ISSN (Online), Volume 1, Special Issue 2(ICITET 15), March 2015 International Journal of Innovative Trends and Emerging Technologies

ISSN (Online), Volume 1, Special Issue 2(ICITET 15), March 2015 International Journal of Innovative Trends and Emerging Technologies VLSI IMPLEMENTATION OF HIGH PERFORMANCE DISTRIBUTED ARITHMETIC (DA) BASED ADAPTIVE FILTER WITH FAST CONVERGENCE FACTOR G. PARTHIBAN 1, P.SATHIYA 2 PG Student, VLSI Design, Department of ECE, Surya Group

More information

High Speed Multiplication Using BCD Codes For DSP Applications

High Speed Multiplication Using BCD Codes For DSP Applications High Speed Multiplication Using BCD Codes For DSP Applications Balasundaram 1, Dr. R. Vijayabhasker 2 PG Scholar, Dept. Electronics & Communication Engineering, Anna University Regional Centre, Coimbatore,

More information

ISSN Vol.08,Issue.12, September-2016, Pages:

ISSN Vol.08,Issue.12, September-2016, Pages: ISSN 2348 2370 Vol.08,Issue.12, September-2016, Pages:2273-2277 www.ijatir.org G. DIVYA JYOTHI REDDY 1, V. ROOPA REDDY 2 1 PG Scholar, Dept of ECE, TKR Engineering College, Hyderabad, TS, India, E-mail:

More information

VLSI Design and Implementation of High Speed and High Throughput DADDA Multiplier

VLSI Design and Implementation of High Speed and High Throughput DADDA Multiplier VLSI Design and Implementation of High Speed and High Throughput DADDA Multiplier U.V.N.S.Suhitha Student Department of ECE, BVC College of Engineering, AP, India. Abstract: The ever growing need for improved

More information

Design and Implementation of Advanced Modified Booth Encoding Multiplier

Design and Implementation of Advanced Modified Booth Encoding Multiplier Design and Implementation of Advanced Modified Booth Encoding Multiplier B.Sirisha M.Tech Student, Department of Electronics and communication Engineering, GDMM College of Engineering and Technology. ABSTRACT:

More information

FPGA Matrix Multiplier

FPGA Matrix Multiplier FPGA Matrix Multiplier In Hwan Baek Henri Samueli School of Engineering and Applied Science University of California Los Angeles Los Angeles, California Email: chris.inhwan.baek@gmail.com David Boeck Henri

More information

The p-sized partitioning algorithm for fast computation of factorials of numbers

The p-sized partitioning algorithm for fast computation of factorials of numbers J Supercomput (2006) 38:73 82 DOI 10.1007/s11227-006-7285-5 The p-sized partitioning algorithm for fast computation of factorials of numbers Ahmet Ugur Henry Thompson C Science + Business Media, LLC 2006

More information

A Simple Method to Improve the throughput of A Multiplier

A Simple Method to Improve the throughput of A Multiplier International Journal of Electronics and Communication Engineering. ISSN 0974-2166 Volume 6, Number 1 (2013), pp. 9-16 International Research Publication House http://www.irphouse.com A Simple Method to

More information

An Efficient Pipelined Multiplicative Inverse Architecture for the AES Cryptosystem

An Efficient Pipelined Multiplicative Inverse Architecture for the AES Cryptosystem An Efficient Pipelined Multiplicative Inverse Architecture for the AES Cryptosystem Mostafa Abd-El-Barr and Amro Khattab Abstract In this paper, we introduce an architecture for performing a recursive

More information

At the ith stage: Input: ci is the carry-in Output: si is the sum ci+1 carry-out to (i+1)st state

At the ith stage: Input: ci is the carry-in Output: si is the sum ci+1 carry-out to (i+1)st state Chapter 4 xi yi Carry in ci Sum s i Carry out c i+ At the ith stage: Input: ci is the carry-in Output: si is the sum ci+ carry-out to (i+)st state si = xi yi ci + xi yi ci + xi yi ci + xi yi ci = x i yi

More information

Design and Characterization of High Speed Carry Select Adder

Design and Characterization of High Speed Carry Select Adder Design and Characterization of High Speed Carry Select Adder Santosh Elangadi MTech Student, Dept of ECE, BVBCET, Hubli, Karnataka, India Suhas Shirol Professor, Dept of ECE, BVBCET, Hubli, Karnataka,

More information

High Performance VLSI Architecture of Fractional Motion Estimation for H.264/AVC

High Performance VLSI Architecture of Fractional Motion Estimation for H.264/AVC Journal of Computational Information Systems 7: 8 (2011) 2843-2850 Available at http://www.jofcis.com High Performance VLSI Architecture of Fractional Motion Estimation for H.264/AVC Meihua GU 1,2, Ningmei

More information

FPGA Implementation of Multiplierless 2D DWT Architecture for Image Compression

FPGA Implementation of Multiplierless 2D DWT Architecture for Image Compression FPGA Implementation of Multiplierless 2D DWT Architecture for Image Compression Divakara.S.S, Research Scholar, J.S.S. Research Foundation, Mysore Cyril Prasanna Raj P Dean(R&D), MSEC, Bangalore Thejas

More information

UNIT II - COMBINATIONAL LOGIC Part A 2 Marks. 1. Define Combinational circuit A combinational circuit consist of logic gates whose outputs at anytime are determined directly from the present combination

More information

Design and Implementation of 3-D DWT for Video Processing Applications

Design and Implementation of 3-D DWT for Video Processing Applications Design and Implementation of 3-D DWT for Video Processing Applications P. Mohaniah 1, P. Sathyanarayana 2, A. S. Ram Kumar Reddy 3 & A. Vijayalakshmi 4 1 E.C.E, N.B.K.R.IST, Vidyanagar, 2 E.C.E, S.V University

More information

JOURNAL OF INTERNATIONAL ACADEMIC RESEARCH FOR MULTIDISCIPLINARY Impact Factor 1.393, ISSN: , Volume 2, Issue 7, August 2014

JOURNAL OF INTERNATIONAL ACADEMIC RESEARCH FOR MULTIDISCIPLINARY Impact Factor 1.393, ISSN: , Volume 2, Issue 7, August 2014 DESIGN OF HIGH SPEED BOOTH ENCODED MULTIPLIER PRAVEENA KAKARLA* *Assistant Professor, Dept. of ECONE, Sree Vidyanikethan Engineering College, A.P., India ABSTRACT This paper presents the design and implementation

More information

IMPLEMENTATION OF AN ADAPTIVE FIR FILTER USING HIGH SPEED DISTRIBUTED ARITHMETIC

IMPLEMENTATION OF AN ADAPTIVE FIR FILTER USING HIGH SPEED DISTRIBUTED ARITHMETIC IMPLEMENTATION OF AN ADAPTIVE FIR FILTER USING HIGH SPEED DISTRIBUTED ARITHMETIC Thangamonikha.A 1, Dr.V.R.Balaji 2 1 PG Scholar, Department OF ECE, 2 Assitant Professor, Department of ECE 1, 2 Sri Krishna

More information

Pipelined Quadratic Equation based Novel Multiplication Method for Cryptographic Applications

Pipelined Quadratic Equation based Novel Multiplication Method for Cryptographic Applications , Vol 7(4S), 34 39, April 204 ISSN (Print): 0974-6846 ISSN (Online) : 0974-5645 Pipelined Quadratic Equation based Novel Multiplication Method for Cryptographic Applications B. Vignesh *, K. P. Sridhar

More information

A VLSI Architecture for H.264/AVC Variable Block Size Motion Estimation

A VLSI Architecture for H.264/AVC Variable Block Size Motion Estimation Journal of Automation and Control Engineering Vol. 3, No. 1, February 20 A VLSI Architecture for H.264/AVC Variable Block Size Motion Estimation Dam. Minh Tung and Tran. Le Thang Dong Center of Electrical

More information

A High-Speed FPGA Implementation of an RSD-Based ECC Processor

A High-Speed FPGA Implementation of an RSD-Based ECC Processor RESEARCH ARTICLE International Journal of Engineering and Techniques - Volume 4 Issue 1, Jan Feb 2018 A High-Speed FPGA Implementation of an RSD-Based ECC Processor 1 K Durga Prasad, 2 M.Suresh kumar 1

More information

COPY RIGHT. To Secure Your Paper As Per UGC Guidelines We Are Providing A Electronic Bar Code

COPY RIGHT. To Secure Your Paper As Per UGC Guidelines We Are Providing A Electronic Bar Code COPY RIGHT 2018IJIEMR.Personal use of this material is permitted. Permission from IJIEMR must be obtained for all other uses, in any current or future media, including reprinting/republishing this material

More information

Implementation of Reduce the Area- Power Efficient Fixed-Point LMS Adaptive Filter with Low Adaptation-Delay

Implementation of Reduce the Area- Power Efficient Fixed-Point LMS Adaptive Filter with Low Adaptation-Delay Implementation of Reduce the Area- Power Efficient Fixed-Point LMS Adaptive Filter with Low Adaptation-Delay A.Sakthivel 1, A.Lalithakumar 2, T.Kowsalya 3 PG Scholar [VLSI], Muthayammal Engineering College,

More information

THE orthogonal frequency-division multiplex (OFDM)

THE orthogonal frequency-division multiplex (OFDM) 26 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS II: EXPRESS BRIEFS, VOL. 57, NO. 1, JANUARY 2010 A Generalized Mixed-Radix Algorithm for Memory-Based FFT Processors Chen-Fong Hsiao, Yuan Chen, Member, IEEE,

More information

A Macro Generator for Arithmetic Cores

A Macro Generator for Arithmetic Cores A Macro Generator for Arithmetic Cores D. Bakalis 1,2, M. Bellos 1, H. T. Vergos 1,2, D. Nikolos 1,2 & G. Alexiou 1,2 1 Computer Engineering and Informatics Dept., University of Patras, 26 500, Rio, Greece

More information

Power-Mode-Aware Buffer Synthesis for Low-Power Clock Skew Minimization

Power-Mode-Aware Buffer Synthesis for Low-Power Clock Skew Minimization This article has been accepted and published on J-STAGE in advance of copyediting. Content is final as presented. IEICE Electronics Express, Vol.* No.*,*-* Power-Mode-Aware Buffer Synthesis for Low-Power

More information

Design and Verification of Area Efficient High-Speed Carry Select Adder

Design and Verification of Area Efficient High-Speed Carry Select Adder Design and Verification of Area Efficient High-Speed Carry Select Adder T. RatnaMala # 1, R. Vinay Kumar* 2, T. Chandra Kala #3 #1 PG Student, Kakinada Institute of Engineering and Technology,Korangi,

More information

RADIX-4 AND RADIX-8 MULTIPLIER USING VERILOG HDL

RADIX-4 AND RADIX-8 MULTIPLIER USING VERILOG HDL RADIX-4 AND RADIX-8 MULTIPLIER USING VERILOG HDL P. Thayammal 1, R.Sudhashree 2, G.Rajakumar 3 P.G.Scholar, Department of VLSI, Francis Xavier Engineering College, Tirunelveli 1 P.G.Scholar, Department

More information

Efficient Radix-4 and Radix-8 Butterfly Elements

Efficient Radix-4 and Radix-8 Butterfly Elements Efficient Radix4 and Radix8 Butterfly Elements Weidong Li and Lars Wanhammar Electronics Systems, Department of Electrical Engineering Linköping University, SE581 83 Linköping, Sweden Tel.: +46 13 28 {1721,

More information

A Ripple Carry Adder based Low Power Architecture of LMS Adaptive Filter

A Ripple Carry Adder based Low Power Architecture of LMS Adaptive Filter A Ripple Carry Adder based Low Power Architecture of LMS Adaptive Filter A.S. Sneka Priyaa PG Scholar Government College of Technology Coimbatore ABSTRACT The Least Mean Square Adaptive Filter is frequently

More information

Efficient VLSI Huffman encoder implementation and its application in high rate serial data encoding

Efficient VLSI Huffman encoder implementation and its application in high rate serial data encoding LETTER IEICE Electronics Express, Vol.14, No.21, 1 11 Efficient VLSI Huffman encoder implementation and its application in high rate serial data encoding Rongshan Wei a) and Xingang Zhang College of Physics

More information

DESIGN OF DECIMAL / BINARY MULTI-OPERAND ADDER USING A FAST BINARY TO DECIMAL CONVERTER

DESIGN OF DECIMAL / BINARY MULTI-OPERAND ADDER USING A FAST BINARY TO DECIMAL CONVERTER DESIGN OF DECIMAL / BINARY MULTI-OPERAND ADDER USING A FAST BINARY TO DECIMAL CONVERTER Sk.Howldar 1, M. Vamsi Krishna Allu 2 1 M.Tech VLSI Design student, 2 Assistant Professor 1,2 E.C.E Department, Sir

More information

Paper ID # IC In the last decade many research have been carried

Paper ID # IC In the last decade many research have been carried A New VLSI Architecture of Efficient Radix based Modified Booth Multiplier with Reduced Complexity In the last decade many research have been carried KARTHICK.Kout 1, MR. to reduce S. BHARATH the computation

More information

Digital Logic & Computer Design CS Professor Dan Moldovan Spring 2010

Digital Logic & Computer Design CS Professor Dan Moldovan Spring 2010 Digital Logic & Computer Design CS 434 Professor Dan Moldovan Spring 2 Copyright 27 Elsevier 5- Chapter 5 :: Digital Building Blocks Digital Design and Computer Architecture David Money Harris and Sarah

More information

Design of a Multiplier Architecture Based on LUT and VHBCSE Algorithm For FIR Filter

Design of a Multiplier Architecture Based on LUT and VHBCSE Algorithm For FIR Filter African Journal of Basic & Applied Sciences 9 (1): 53-58, 2017 ISSN 2079-2034 IDOSI Publications, 2017 DOI: 10.5829/idosi.ajbas.2017.53.58 Design of a Multiplier Architecture Based on LUT and VHBCSE Algorithm

More information

CHAPTER 1 INTRODUCTION

CHAPTER 1 INTRODUCTION 1 CHAPTER 1 INTRODUCTION 1.1 Advance Encryption Standard (AES) Rijndael algorithm is symmetric block cipher that can process data blocks of 128 bits, using cipher keys with lengths of 128, 192, and 256

More information

FPGA Implementation of Multiplier for Floating- Point Numbers Based on IEEE Standard

FPGA Implementation of Multiplier for Floating- Point Numbers Based on IEEE Standard FPGA Implementation of Multiplier for Floating- Point Numbers Based on IEEE 754-2008 Standard M. Shyamsi, M. I. Ibrahimy, S. M. A. Motakabber and M. R. Ahsan Dept. of Electrical and Computer Engineering

More information

Implimentation of A 16-bit RISC Processor for Convolution Application

Implimentation of A 16-bit RISC Processor for Convolution Application Advance in Electronic and Electric Engineering. ISSN 2231-1297, Volume 4, Number 5 (2014), pp. 441-446 Research India Publications http://www.ripublication.com/aeee.htm Implimentation of A 16-bit RISC

More information

Design of Delay Efficient Carry Save Adder

Design of Delay Efficient Carry Save Adder Design of Delay Efficient Carry Save Adder K. Deepthi Assistant Professor,M.Tech., Department of ECE MIC College of technology Vijayawada, India M.Jayasree (PG scholar) Department of ECE MIC College of

More information

Design and Implementation of VLSI 8 Bit Systolic Array Multiplier

Design and Implementation of VLSI 8 Bit Systolic Array Multiplier Design and Implementation of VLSI 8 Bit Systolic Array Multiplier Khumanthem Devjit Singh, K. Jyothi MTech student (VLSI & ES), GIET, Rajahmundry, AP, India Associate Professor, Dept. of ECE, GIET, Rajahmundry,

More information

On-Line Error Detecting Constant Delay Adder

On-Line Error Detecting Constant Delay Adder On-Line Error Detecting Constant Delay Adder Whitney J. Townsend and Jacob A. Abraham Computer Engineering Research Center The University of Texas at Austin whitney and jaa @cerc.utexas.edu Parag K. Lala

More information

HIGH PERFORMANCE FUSED ADD MULTIPLY OPERATOR

HIGH PERFORMANCE FUSED ADD MULTIPLY OPERATOR HIGH PERFORMANCE FUSED ADD MULTIPLY OPERATOR R. Alwin [1] S. Anbu Vallal [2] I. Angel [3] B. Benhar Silvan [4] V. Jai Ganesh [5] 1 Assistant Professor, 2,3,4,5 Student Members Department of Electronics

More information

FIR Filter Synthesis Algorithms for Minimizing the Delay and the Number of Adders

FIR Filter Synthesis Algorithms for Minimizing the Delay and the Number of Adders 770 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS II: ANALOG AND DIGITAL SIGNAL PROCESSING, VOL. 48, NO. 8, AUGUST 2001 FIR Filter Synthesis Algorithms for Minimizing the Delay and the Number of Adders Hyeong-Ju

More information

Combinational Logic. Prof. Wangrok Oh. Dept. of Information Communications Eng. Chungnam National University. Prof. Wangrok Oh(CNU) 1 / 93

Combinational Logic. Prof. Wangrok Oh. Dept. of Information Communications Eng. Chungnam National University. Prof. Wangrok Oh(CNU) 1 / 93 Combinational Logic Prof. Wangrok Oh Dept. of Information Communications Eng. Chungnam National University Prof. Wangrok Oh(CNU) / 93 Overview Introduction 2 Combinational Circuits 3 Analysis Procedure

More information

Digital Computer Arithmetic

Digital Computer Arithmetic Digital Computer Arithmetic Part 6 High-Speed Multiplication Soo-Ik Chae Spring 2010 Koren Chap.6.1 Speeding Up Multiplication Multiplication involves 2 basic operations generation of partial products

More information

Efficient Radix-10 Multiplication Using BCD Codes

Efficient Radix-10 Multiplication Using BCD Codes Efficient Radix-10 Multiplication Using BCD Codes P.Ranjith Kumar Reddy M.Tech VLSI, Department of ECE, CMR Institute of Technology. P.Navitha Assistant Professor, Department of ECE, CMR Institute of Technology.

More information

Bipartite Modular Multiplication

Bipartite Modular Multiplication Bipartite Modular Multiplication Marcelo E. Kaihara and Naofumi Takagi Department of Information Engineering, Nagoya University, Nagoya, 464-8603, Japan {mkaihara, ntakagi}@takagi.nuie.nagoya-u.ac.jp Abstract.

More information

CPE300: Digital System Architecture and Design

CPE300: Digital System Architecture and Design CPE300: Digital System Architecture and Design Fall 2011 MW 17:30-18:45 CBC C316 Arithmetic Unit 10122011 http://www.egr.unlv.edu/~b1morris/cpe300/ 2 Outline Recap Fixed Point Arithmetic Addition/Subtraction

More information

THE INTERNATIONAL JOURNAL OF SCIENCE & TECHNOLEDGE

THE INTERNATIONAL JOURNAL OF SCIENCE & TECHNOLEDGE THE INTERNATIONAL JOURNAL OF SCIENCE & TECHNOLEDGE Design and Implementation of Optimized Floating Point Matrix Multiplier Based on FPGA Maruti L. Doddamani IV Semester, M.Tech (Digital Electronics), Department

More information

Design of Delay Efficient Distributed Arithmetic Based Split Radix FFT

Design of Delay Efficient Distributed Arithmetic Based Split Radix FFT Design of Delay Efficient Arithmetic Based Split Radix FFT Nisha Laguri #1, K. Anusudha *2 #1 M.Tech Student, Electronics, Department of Electronics Engineering, Pondicherry University, Puducherry, India

More information

Performance of Constant Addition Using Enhanced Flagged Binary Adder

Performance of Constant Addition Using Enhanced Flagged Binary Adder Performance of Constant Addition Using Enhanced Flagged Binary Adder Sangeetha A UG Student, Department of Electronics and Communication Engineering Bannari Amman Institute of Technology, Sathyamangalam,

More information

High speed Integrated Circuit Hardware Description Language), RTL (Register transfer level). Abstract:

High speed Integrated Circuit Hardware Description Language), RTL (Register transfer level). Abstract: based implementation of 8-bit ALU of a RISC processor using Booth algorithm written in VHDL language Paresh Kumar Pasayat, Manoranjan Pradhan, Bhupesh Kumar Pasayat Abstract: This paper explains the design

More information

Srinivasasamanoj.R et al., International Journal of Wireless Communications and Network Technologies, 1(1), August-September 2012, 4-9

Srinivasasamanoj.R et al., International Journal of Wireless Communications and Network Technologies, 1(1), August-September 2012, 4-9 ISSN 2319-6629 Volume 1, No.1, August- September 2012 International Journal of Wireless Communications and Networking Technologies Available Online at http://warse.org/pdfs/ijwcnt02112012.pdf High speed

More information

Sum to Modified Booth Recoding Techniques For Efficient Design of the Fused Add-Multiply Operator

Sum to Modified Booth Recoding Techniques For Efficient Design of the Fused Add-Multiply Operator Sum to Modified Booth Recoding Techniques For Efficient Design of the Fused Add-Multiply Operator D.S. Vanaja 1, S. Sandeep 2 1 M. Tech scholar in VLSI System Design, Department of ECE, Sri VenkatesaPerumal

More information

HIGH-THROUGHPUT FINITE FIELD MULTIPLIERS USING REDUNDANT BASIS FOR FPGA AND ASIC IMPLEMENTATIONS

HIGH-THROUGHPUT FINITE FIELD MULTIPLIERS USING REDUNDANT BASIS FOR FPGA AND ASIC IMPLEMENTATIONS HIGH-THROUGHPUT FINITE FIELD MULTIPLIERS USING REDUNDANT BASIS FOR FPGA AND ASIC IMPLEMENTATIONS Shaik.Sooraj, Jabeena shaik,m.tech Department of Electronics and communication Engineering, Quba College

More information

1. Introduction. Raj Kishore Kumar 1, Vikram Kumar 2

1. Introduction. Raj Kishore Kumar 1, Vikram Kumar 2 ASIC Implementation and Comparison of Diminished-one Modulo 2 n +1 Adder Raj Kishore Kumar 1, Vikram Kumar 2 1 Shivalik Institute of Engineering & Technology 2 Assistant Professor, Shivalik Institute of

More information

Chapter 4 Design of Function Specific Arithmetic Circuits

Chapter 4 Design of Function Specific Arithmetic Circuits Chapter 4 Design of Function Specific Arithmetic Circuits Contents Chapter 4... 55 4.1 Introduction:... 55 4.1.1 Incrementer/Decrementer Circuit...56 4.1.2 2 s Complement Circuit...56 4.1.3 Priority Encoder

More information

VLSI ARCHITECTURE FOR NANO WIRE BASED ADVANCED ENCRYPTION STANDARD (AES) WITH THE EFFICIENT MULTIPLICATIVE INVERSE UNIT

VLSI ARCHITECTURE FOR NANO WIRE BASED ADVANCED ENCRYPTION STANDARD (AES) WITH THE EFFICIENT MULTIPLICATIVE INVERSE UNIT VLSI ARCHITECTURE FOR NANO WIRE BASED ADVANCED ENCRYPTION STANDARD (AES) WITH THE EFFICIENT MULTIPLICATIVE INVERSE UNIT K.Sandyarani 1 and P. Nirmal Kumar 2 1 Research Scholar, Department of ECE, Sathyabama

More information

Implementation of Ripple Carry and Carry Skip Adders with Speed and Area Efficient

Implementation of Ripple Carry and Carry Skip Adders with Speed and Area Efficient ISSN (Online) : 2278-1021 Implementation of Ripple Carry and Carry Skip Adders with Speed and Area Efficient PUSHPALATHA CHOPPA 1, B.N. SRINIVASA RAO 2 PG Scholar (VLSI Design), Department of ECE, Avanthi

More information

International Journal of Advanced Research in Electrical, Electronics and Instrumentation Engineering

International Journal of Advanced Research in Electrical, Electronics and Instrumentation Engineering An Efficient Implementation of Double Precision Floating Point Multiplier Using Booth Algorithm Pallavi Ramteke 1, Dr. N. N. Mhala 2, Prof. P. R. Lakhe M.Tech [IV Sem], Dept. of Comm. Engg., S.D.C.E, [Selukate],

More information

Arithmetic Circuits. Design of Digital Circuits 2014 Srdjan Capkun Frank K. Gürkaynak.

Arithmetic Circuits. Design of Digital Circuits 2014 Srdjan Capkun Frank K. Gürkaynak. Arithmetic Circuits Design of Digital Circuits 2014 Srdjan Capkun Frank K. Gürkaynak http://www.syssec.ethz.ch/education/digitaltechnik_14 Adapted from Digital Design and Computer Architecture, David Money

More information

ECE 341. Lecture # 6

ECE 341. Lecture # 6 ECE 34 Lecture # 6 Instructor: Zeshan Chishti zeshan@pdx.edu October 5, 24 Portland State University Lecture Topics Design of Fast Adders Carry Looakahead Adders (CLA) Blocked Carry-Lookahead Adders Multiplication

More information

ANALYSIS OF AN AREA EFFICIENT VLSI ARCHITECTURE FOR FLOATING POINT MULTIPLIER AND GALOIS FIELD MULTIPLIER*

ANALYSIS OF AN AREA EFFICIENT VLSI ARCHITECTURE FOR FLOATING POINT MULTIPLIER AND GALOIS FIELD MULTIPLIER* IJVD: 3(1), 2012, pp. 21-26 ANALYSIS OF AN AREA EFFICIENT VLSI ARCHITECTURE FOR FLOATING POINT MULTIPLIER AND GALOIS FIELD MULTIPLIER* Anbuselvi M. and Salivahanan S. Department of Electronics and Communication

More information

THE LOGIC OF COMPOUND STATEMENTS

THE LOGIC OF COMPOUND STATEMENTS CHAPTER 2 THE LOGIC OF COMPOUND STATEMENTS Copyright Cengage Learning. All rights reserved. SECTION 2.5 Application: Number Systems and Circuits for Addition Copyright Cengage Learning. All rights reserved.

More information

P V Sriniwas Shastry et al, Int.J.Computer Technology & Applications,Vol 5 (1),

P V Sriniwas Shastry et al, Int.J.Computer Technology & Applications,Vol 5 (1), On-The-Fly AES Key Expansion For All Key Sizes on ASIC P.V.Sriniwas Shastry 1, M. S. Sutaone 2, 1 Cummins College of Engineering for Women, Pune, 2 College of Engineering, Pune pvs.shastry@cumminscollege.in

More information

Design of Low-Delay FIR Half-Band Filters with Arbitrary Flatness and Its Application to Filter Banks

Design of Low-Delay FIR Half-Band Filters with Arbitrary Flatness and Its Application to Filter Banks Electronics and Communications in Japan, Part 3, Vol 83, No 10, 2000 Translated from Denshi Joho Tsushin Gakkai Ronbunshi, Vol J82-A, No 10, October 1999, pp 1529 1537 Design of Low-Delay FIR Half-Band

More information

COMPUTER STRUCTURE AND ORGANIZATION

COMPUTER STRUCTURE AND ORGANIZATION COMPUTER STRUCTURE AND ORGANIZATION Course titular: DUMITRAŞCU Eugen Chapter 4 COMPUTER ORGANIZATION FUNDAMENTAL CONCEPTS CONTENT The scheme of 5 units von Neumann principles Functioning of a von Neumann

More information

On Designs of Radix Converters Using Arithmetic Decompositions

On Designs of Radix Converters Using Arithmetic Decompositions On Designs of Radix Converters Using Arithmetic Decompositions Yukihiro Iguchi 1 Tsutomu Sasao Munehiro Matsuura 1 Dept. of Computer Science, Meiji University, Kawasaki 1-51, Japan Dept. of Computer Science

More information

Volume 5, Issue 5 OCT 2016

Volume 5, Issue 5 OCT 2016 DESIGN AND IMPLEMENTATION OF REDUNDANT BASIS HIGH SPEED FINITE FIELD MULTIPLIERS Vakkalakula Bharathsreenivasulu 1 G.Divya Praneetha 2 1 PG Scholar, Dept of VLSI & ES, G.Pullareddy Eng College,kurnool

More information

An Algorithm and Hardware Architecture for Integrated Modular Division and Multiplication in GF (p) and GF (2 n )

An Algorithm and Hardware Architecture for Integrated Modular Division and Multiplication in GF (p) and GF (2 n ) An Algorithm and Hardware Architecture for Integrated Modular Division and Multiplication in GF (p) and GF (2 n ) Lo ai A. Tawalbeh and Alexandre F. Tenca School of Electrical Engineering and Computer

More information