VLSI Design of Karatsuba Integer Multipliers and Its Evaluation

Size: px

Start display at page:

Download "VLSI Design of Karatsuba Integer Multipliers and Its Evaluation"

Amberlynn Hopkins
5 years ago
Views:

1 Electronics and Communications in Japan, Vol. 92, No. 4, 2009 Translated from Denki Gakkai Ronbunshi, Vol. 128-C, No. 2, February 2008, pp VLSI Design of Karatsuba Integer Multipliers and Its Evaluation SYUNJI YAZAKI 1 and KOKI ABE 2 1 Tokyo University of Technology, Japan 2 University of Electro-Communications, Japan SUMMARY Multidigit multiplication is widely used for various applications in recent years, including numerical calculation, chaos arithmetic, and primality testing. Systems with high performance and low energy consumption are demanded, especially for image processing and communications with cryptography using chaos. Karatsuba algorithm with computational complexity of O(n 1.58 ) has been employed in software for multiplication of hundreds to thousands of bits, where n stands for bit-length of operands. In this paper, hardware design of multidigit integer multiplication based on Karatsuba algorithm is described and its VLSI realization is evaluated in terms of the cost, performance, and energy consumption. We present two design choices of the Karatsuba hardware: RKM (Recursive Karatsuba Multiplier) and IKM (Iterative Karatsuba Multiplier). We found that RKM has less area cost than WTM (Wallace Tree Multiplier) for bit-length larger than 2 9 with area cost of 30 mm 2. Critical path delay of RKM is always larger than that of WTM. Therefore, we should use WTM as combinational circuits for IKM to have better cost performance. We also found that a version of IKM using 0.18 µm process can perform 1024-bit multiplications 30 times faster than software at the area cost of 10.9 mm 2. Energy for the computation by the IKM version was found to be nearly 1/600 of that consumed by general-purpose processor which executes the software. The results obtained by this study will help system designers for applications requiring multidigit multiplication to select design alternatives including ASIC realization Wiley Periodicals, Inc. Electron Comm Jpn, 92(4): 9 20, 2009; Published online in Wiley InterScience ( wiley.com). DOI /ecj Contract grant sponsors: Synopsis, Inc. via VDEC, University of Tokyo and by a JSPS Grant-in-Aid [Fundamental Research (C) (2) ]. Key words: multidigit multiplication; Karatsuba algorithm; VLSI. 1. Introduction Recently, numerous attempts have been made to use ASICs (Application-Specific Integrated Circuits) for entire applications or their parts in order to achieve faster calculation. In addition, compact low-power-consumption ASICs are often used instead of general-purpose processors to implement inexpensive embedded devices. Further, multidigit calculations far beyond the bit length of general-purpose processors are required in many fields such as numerical analysis [1], chaos computing [2], primality testing [3], or cryptography [4]. In particular, rounding errors are known to strongly affect the whole calculation in chaos-based image processing [5] and communication systems [6]. In such systems, the influence of errors can be reduced by using multidigit calculations. In addition, high speed and low power consumption are usually required in such cases. In this context, we focus on multidigit multiplication, a frequently used and time-consuming kind of multidigit computing. Specifically, we propose a VLSI implementation offering high performance, small area, and low power consumption. In multidigit multiplication using multipliers with a fixed narrow width, in software or hardware, multidigit numbers must be decomposed appropriately for repeated processing by the narrow multipliers. Based on this approach, efficient implementation of multidigit multiplication requires some scheme to minimize the number of iterations. In particular, when designing hardware multidigit multipliers, such a scheme is necessary to minimize memory accesses, computing time, and power consumption. The reduction of repetitive multiplication operations has long been studied, with the Karatsuba algorithm [7] and FFT (Fast Fourier Transform) multiplication [8] being typi Wiley Periodicals, Inc.

2 cal algorithms for this purpose. With these techniques, n-bit multiplication which otherwise has a complexity of O(n 2 ) can be performed with a complexity of O(n 1.58 ) and O(n log n log log n), respectively. FFT multiplication proves faster than the Karatsuba algorithm in terms of order; however, FFT multiplication suffers from large overhead, and the Karatsuba algorithm is used usually for multiplication in the range of hundreds to thousands of bits. A hardware implementation for FFT multiplication has already been reported [9]. There are also many examples of hardware-based Karatsuba algorithm implementations over a Galois field (GF), for use in elliptic curve cryptography and other fields [4, 10]. On the other hand, there are few examples of the implementation and evaluation of hardware Karatsuba multipliers for multidigit operations (e.g., 32-bit Karatsuba integer multiplier [11]), and no studies dealing with the range from hundreds to thousands of bits are available. In this study, we propose a hardware implementation of the Karatsuba integer multiplier in this range, and evaluate its performance. This paper is organized as follows. We give a brief explanation of the Karatsuba algorithm in Section 2, then present two design choices for hardware design (RKM implemented on a combinational circuit and IKM implemented on a sequential circuit) in Section 3. The two implementations are then discussed in Sections 4 and 5, respectively. A summary is given in Section Karatsuba Algorithm The Karatsuba algorithm uses the following method of multiplication. Suppose that A and B are 2n-bit integers. Each is split into two halves: A = a 1 2 n + a 0, B = b 1 2 n + b 0. Then the product P of A and B can be expressed as follows: (1) Considering that a 1 b 0 + a 0 b 1 = (a 1 + a 0 ) (b 1 + b 0 ) (a 1 b 1 + a 0 b 0 ), Eq. (1) can be rewritten as follows: Comparing Eqs. (1) and (2), one can see that the number of n-bit multiplications required in order to get the product of 2n-bit integers can be reduced from 4 to 3. In addition, with the Karatsuba algorithm, the computational complexity can be finally reduced to O(n 1.58 ) (log ) by recursive application to the multiplication involved in Eq. (2). In software implementation, this recursion is repeated until the numbers become one word length. 3. Design Choices There are two design choices of hardware implementation of Karatsuba algorithm, one based on a combinational circuit using required multipliers, as shown in Fig. 1(a), and the other based on a sequential circuit repeatedly using a calculator set (basic multiplier and adder) as shown in Fig. 1(b). The former combinational configuration is called an RKM (Recursive Karatsuba Multiplier), and the latter sequential configuration is called an IKM (Iterative Karatsuba Multiplier). RKM makes it possible to operate with fewer multipliers by application of the Karatsuba algorithm. Compared to the well-known Wallace Tree Multiplier (WTM) [12], the order of the chip area can be reduced from O(n 2 ) to O(n 1.58 ). On the other hand, the order of the delay time is O(log n) for both RKM and WTM. Thus, a theoretical comparison in terms of computational complexity suggests that RKM is better than WTM due to a smaller chip area at the same time delay. However, when comparing the performance and area cost of actual implementations, one cannot neglect the influence of order coefficients and low-order terms. There- (2) Fig. 1. Configurations of Karatsuba multiplier. (a) Recursive Karatsuba multiplier (RKM). (b) Iterative Karatsuba multiplier (IKM). 10

3 fore, the evaluation must involve not only computational complexity but also experimental results. There is an example of the implementation and evaluation of 32-bit integer Karatsuba multiplier [11]. However, no research has been done on Karatsuba multipliers for hundreds to thousands of bits, and further evaluation of Karatsuba multipliers based on combinational configurations are required. On the other hand, combinational implementations of the Karatsuba multiplier are limited by the chip area, and the use of sequential IKM may prove satisfactory for certain applications. When multidigit multiplication is configured sequentially, a fixed number of short bit-length multipliers are used repeatedly; in this case, the chip area is a constant depending on the implementation. With a fixed number of arithmetic units, the computing time is O(n 2 ) when the Karatsuba algorithm is not applied. On the other hand, when using the Karatsuba algorithm, the number of repetitive multiplications can be reduced, and the order of the computing time can be lowered to O(n 1.58 ). This results in a considerable acceleration of the actual calculations, and long bit-length operations, impossible for O(n 2 ) multipliers, may become possible. In IKM, the number of multiplication is reduced, while the number of addition increases instead; therefore, the IKM configuration loses its advantage when the time/area cost of addition cannot be ignored compared to multiplication. However, the addition cost is normally very low compared to multiplication. The cost and performance of IKM vary with the bit-length of the basic multipliers and adders. Therefore, a trade-off between cost and performance can be found by varying the bit-length. In addition, one must examine which kind of the basic multiplier, RKM or WTM, should be employed. There are no examples of such an examination because the Karatsuba algorithm has not been implemented using sequential circuits such as IKM. Based on the above considerations, we performed VLSI design for both RKM and IKM, and compared their cost and performance for various numbers of algorithm applications and bit-lengths. Fig. 2. Configuration of Carry-Save RKM. inside RKM. Since the algorithm is applied only once, the WTM bit-length is set to half the input width, that is, to 16 bits. In addition, a CSA (Carry Save Adder) is used to accumulate the WTM outputs. However, a CPA (Carry Propagation Adder) is employed in the final stage to output the product. This configuration is called Carry-Save RKM. For comparison, we also designed an RKM with all adders being CPA, as shown in Fig. 3. This configuration is called Binary RKM. In the implementations of RKM, we used IP (Intellectual Property) blocks by Synopsis, specifically, DW02_mult for WTM, and DW01_add (DW01_sub) for CPA Implementation results and discussion Logic synthesis of the two designed RKMs was performed using Design Compiler Ver. W SP2 by 4. RKM In this section, we explain the RKM design. First, we implement a 32-bit RKM with one-time application of the Karatsuba algorithm, and compare our results to those given in a previous report [11]. Then we consider a more general case of RKM design Bit RKM Design The configuration of RKM designed by Eq. (2) is shown in Fig. 2. Here WTM multipliers are employed Fig. 3. Configuration of Binary RKM. 11

4 Table 1. Synthesis results of 32-bit RKMs using Hitachi 0.18 µm technology with area-preferred constraint Synopsis. The results thus obtained were then utilized to estimate the critical path delay and area cost. In the logic synthesis, we employed the cell library for 0.18 µm process developed by VDEC (VLSI Design and Education Center) [13] according to Hitachi specifications. The synthesis constraints were set so as to minimize the area cost. The synthesis results are presented in Table 1. In addition, for comparison with the results reported in Ref. 11, we also performed logic synthesis of Carry-Save RKM using a cell library for 0.35 µm process based on the ROHM specifications. The synthesis constraints were set so as to minimize the critical path delay. Comparison results are presented in Table 2. As is evident from Table 1, the critical path delay can be reduced by about 22% when CSA is used as the adder in the nonfinal stages. However, the area cost increases by 6%. This is because the carry-save implementation requires many full adders. As regards the RKM intended for area reduction, this configuration does not appear adequate; however, a considerable reduction of the time delay suggests a good cost/performance ratio. Thus, in this study, we adopted CSA as the nonfinal adders. In addition, as indicated by Table 2, the Carry-Save RKM configured in this study shows a critical path delay and an area cost that are 1.16 times and 1.28 times as great, respectively, as the RKM reported in Ref. 11. The reason is that in Ref. 11, the CPA at the final stage is optimized manually in steps of 2 or 3 bits, but logic synthesis using IP is employed in this study, which does not assure sufficient optimization. 4.2 General form of RKM Design Assuming the multiplication bit-length to be 2n = 2 k, and the bit-length of the WTM components of the RKM to Table 2. Synthesis results of 32-bit RKMs using ROHM 0.35 µm technology with delay-preferred constraint be 2 l, the number of recursions is k l. Here k and l (0 < l < k) are integers. The 32-bit RKM considered in Section 4.1 can be interpreted as a particular case of such an RKM with k = 5, l = 4. When designing a general form of RKM, the following three types of circuit components are involved. (1) Components that transform the carry-save (CS) product into binary (B) form using CPA so as to obtain the final product (2) Components that do not require CPA (3) Components with 2 l -bit WTM The above components correspond to the topmost, intermediate, and undermost layers in the RKM module configuration, being denoted by T, I, and U, respectively. In addition, components I and U include those with CS input (I cs, U cs ) and those with B input (I b, U b ). In particular, the U components with CS input values (U cs ) require CPA in order to transform the CS input values into B form. Thus, five types of modules are needed, as shown in Fig. 4. In the diagram, KaratsubaCS and KaratsubaB modules are replaced depending on the number of recursions, as shown in Table 3. For example, in the case of one-time recursion, KaratsubaCS and KaratsubaB in the T module are replaced, respectively, by U cs and U b. As a result, the RKM shown in Fig. 2 is obtained. When the number of recursions is 2 or more, KaratsubaCS and KaratsubaB in the T module are replaced, respectively, by I cs and I b, and KaratsubaCS and KaratsubaB in I cs and I b are replaced, respectively, by U cs and U b. The shaded V module in Fig. 4 is intended for carry-save addition according to Eq. (2). As indicated by Fig. 4, the critical path delay and area cost of this RKM can be estimated by Eqs. (3) to (9) and (10) to (14), respectively: (3) (4) (5) (6) (7) (8) (9) (10) 12

5 Fig. 4. Components of Karatsuba multiplier for large bit-lengths: topmost component (T), intermediate component (I), and undermost components (U). Suffixes cs and b denote components with inputs in carry-save and binary form, respectively. (a) T, (b) I cs, (c) I b, (d) U cs, (e) U b Implementation results and discussion (11) (12) (13) (14) Here DT(k) and AT(k) represent the critical path delay and the area cost of the entire 2 k -bit RKM. In addition, DI {cs,b} (j) and AI {cs,b} (j) denote the critical path delay and area cost of the I component, and DU {cs,b} (k) and AU {cs,b} (l) denote the critical path delay and area cost of the U component; l < j < k. DT q (k) and DT r (k) are the delay times for paths q and r shown in Fig. 4(a); D CSA (i) and D CPA (k) are the delay times of the 2 i -bit CSA and 2 k -bit CPA, respectively. Similarly, A CSA (i) and A CPA (k) are the area costs of 2 i -bit CSA and 2 k -bit CPA, respectively; here i is a positive integer. D W (l) and A W (l) are the critical path delay and area cost of the 2 l -bit WTM. When j 1 = l, that is, in the final recursion, the U component is employed, and DI cs (l) = DU cs (l), AI cs (l) = AU cs (l). We estimate critical path delay and area cost of an RKM with a bit-length of 2 5 to 2 9 bits (32 to 512 bits) using Eqs. (3) to (14). This estimation requires the time delay and the area cost data for the basic multipliers, CPA and CSA. Here we use a 16-bit WTM (l = 4) as the basic multiplier. The time delay and area cost for the 16-bit WTM measured by logic synthesis (under the same conditions as in Section 4.1.2) are, respectively, D W (2 4 ) = 4.15 ns and A W (2 4 ) = mm 2. The time delay and area cost for the CPA were found from the results of logic synthesis of the adder from the Synopsis library (DW01_add); the adder configuration was CLA (Carry Lookahead Adder). The data thus obtained are presented in Table 4. For the CSA, the time delay and area cost were found similarly from the synthesis results; for 16 bits, the values are 0.27 ns and mm 2, respectively. The critical path delay of the CSA does not change as the bit-length increases, and therefore we used the value for a 16-bit CSA regardless of the actual bit-length. On the other hand, we assume the area cost increases with the bit-length. The critical path delay and area cost of the RKM thus estimated are given in Table 5(a). For comparison, Table Table 3. Modules to be replaced in recursion Table 4. Critical path delay and area cost of CPA (DW01_add) 13

6 Table 5. Critical path delay and area cost of RKM (a) and WTM (b) 5(b) gives the synthesis results for the multiplier (DW02_mult) from the Synopsis library. In addition, the respective graphs are shown in Figs. 5(a) and 5(b). In the diagrams, the x-axis represents the base 2 logarithm of the bit-length, and the y-axis represents the base 2 logarithm of the area cost and critical path delay, respectively. In order to compare the estimates with actual values, we implemented a 2 6 -bit RKM and measured the critical path delay and area cost by logic analysis. The values thus obtained were, respectively, ns and mm 2. The estimated values were ns and mm 2 [Table 5(a)]. The errors of values can be recognized as acceptable considering that the measured values were obtained by automatic synthesis. Now consider the results obtained for the area cost. The area cost plot in Fig. 5(a), approximated by a straight line, has a slope of about The difference between this slope and the theoretical value of O(n 1.58 ) for the Karatsuba algorithm can be attributed to the additional area cost incurred by addition. On the other hand, the approximation for the WTM results has a slope of about 1.90, which is close to the theoretical value of O(n 2 ). In addition, as indicated by Fig. 5(a), WTM is capable of achieving a lower area cost than RKM at bit-lengths above 2 9. At this critical point, the area is about 30 mm 2. For the critical path delay, Fig. 5(b) shows the results for O(log n) for both multipliers; the RKM has a larger slope. As may be concluded from the above, WTM should be chosen when calculation speed is the first priority. On the other hand, RKM proves helpful in area-preferred design at bit-lengths above 2 9. At such bit-lengths, however, both RKM and WTM have areas of about 30 mm 2, and in terms of cost/performance ratio, WTM may be recognized as a more practicable design. In IKM design, too, better performance can be achieved by using WTM as the basic multiplier. Fig. 5. Comparison between RKM and WTM in terms of area cost (a) and critical path delay (b). 5. IKM In this section, we describe the design of IKM based on the repetitive use of basic multipliers and adders. First we present in Section 5.1 an example of IKM design for two-time application of the Karatsuba algorithm. Then in Section 5.2, we describe implementation results obtained for various bit-lengths and numbers of algorithm applications. The IKM design is based on the approach to Karatsuba algorithm implementations over Galois fields described in Ref. 4. However, calculations over Galois fields do not involve carry operations, so that addition becomes logical XOR, and multiplication becomes logical AND. Hence, the circuit configuration and scale are different from the case of integer calculations. 5.1 Design method Here we design a 4n-bit IKM using two-time iteration of the Karatsuba algorithm, assuming an n-bit basic multiplier. First, the 4n-bit multiplier A and multiplicand B are both split into four n-bit parts as follows: (15) (16) 14

7 Then the product P of A and B can be written as (17) Table 6. Relations between coefficients p 7 g,..., p 0 g and partial products (rows L and H represent lower n-bit part and remaining higher part of partial products, respectively) Writing A 1 = a 3 2 n + a 2, A 0 = a 1 2 n + a 0, B 1 = b 3 2 n + b 2, B 0 = b 1 2 n + b 0, the product P can be expressed as follows: (18) Here A 10 = A 1 + A 0, B 10 = B 1 + B 0. Below we use such subscript pairs to represent the addition of two variables. The terms A 1 B 1, A 0 B 0, and A 10 B 10 in Eq. (18) can be expanded as follows: (19) (20) (21) The following final expression is obtained by substituting Eqs. (19), (20), (21) into Eq. (18): The relations between p g g 7,..., p 0 and the partial products are illustrated in Table 6. In the table, L and H respectively represent the lower n bits and remaining higher bits of every partial product. The coefficients p g g 7,..., p 0 before carry of the product P in Eq. (17) are found by accumulation of the partial products according to the table. If the coefficients p g 7,..., p g 0 exceed n bits after accumulation, the excess parts must be added to the higher coefficients (carry operation). Thus, p 7,..., p 0 can be obtained eventually. In the case of two or more applications of the Karatsuba algorithm, tables similar to Table 6 can be derived. For example, for three-time recursion, the number of partial products will be 3 times as great as in the case of two-time recursion, and 16 coefficients p i are involved; therefore, the table size will be IKM configuration (22) The coefficients p 7,..., p 0 in Eq. (17) can be obtained by accumulation of the 9 components a 0 b 0,..., a 3 b 3, a 10 b 10,..., a 32 b 32 that form coefficients in Eq. (22). Below we denote these terms by pp0,..., pp8, and call them partial products. Carry occurs when these partial products are added. Therefore, carry processing from p 0 toward p 7 is required after addition to obtain the final p 7,..., p 0. Here the p 7,..., p 0 before carry processing are denoted by p 7 g,..., p 0 g. Here we configure the IKM based on the approach explained above. As shown in Fig. 6, the IKM is composed Fig. 6. Configuration of IKM. 15

8 of three units, PPG (Partial Product Generator), ACC (Accumulator), and CP (Carry Propagator), and a control module (CTRL). Every module is explained in detail below PPG design The PPG is composed of adders, multipliers, and a buffer (PPG Buffer) to generate partial products, as shown in Fig. 7. In this configuration, one WTM is employed as the multiplier; several multipliers may be used in more complex and high-performance IKM designs. The WTM has 6 pipeline stages so as to adjust the delay time with the adders included in the IKM. Due to such pipelining, several addition operations can be performed in parallel during a multiplication cycle. That is, the addition time in the PPG can be covered by the multiplication time. This pipelined WTM was implemented on a multiplier (DW2_mult_6_stage) from the Synopsis library. As regards adders, two units are used in this design, considering that the cost of the adder is much lower than that of the multiplier. Using two adders makes it possible to assure that the multiplier is always busy. The adders, too, were taken from the Synopsis library (DW01_add). The PPG buffer stores all the data involved in the operations. A dedicated I/O port was provided for the arithmetic units and input terminals. In addition, the buffer was designed for synchronous read and asynchronous write. The buffer capacity depends on the bit-length and number of the partial products, and on the number of arithmetic units. With every additional recursion, the number of partial products increases by a factor of 3, and the bit-length increases by 1. For example, when two-time recursion is performed at the basic bit-length of 16, ten 18-bit entries are required for the PPG buffer ACC design The ACC is composed of adders/subtractors and a buffer (ACC Buffer) to accumulate the partial products generated by the PPG, as shown in Fig. 8. We used four adders/subtractors considering the number of operations that can be performed simultaneously for one partial product. This corresponds to the maximum number of operations per row in Table 6. More adders/subtractors can be Fig. 8. Configuration of ACC. used to configure the ACC, but this complicates control. We used adders/subtractors (DW01_addsub) from the same library. As in the case of the PPG buffer, the AAC buffer is provided with an I/O port for every arithmetic unit and input terminal. The buffer capacity depends on the bit-length and number of the partial products, and on the number of arithmetic units. Since the ACC involves signed arithmetic, a sign bit must be assigned to every value. In addition, the internal bit-length of the ACC increases by 2 (= log 2 3j) with every additional recursion because of the carry addition. For example, when two-time recursion is performed at the basic bit-length of 16, eighteen 21-bit entries are required for the ACC buffer CP design The CP is composed of adders and a register (Carry Register) to receive the coefficients accumulated by the ACC [p 0 to p 7 in Eq. (17) in the case of two-time recursion], and to perform ripple carry, as shown in Fig. 9. The CP module divides the input values into the lower n bits and the remaining bits. In addition, the higher bits are stored in the Carry Register, to be added to the next higher-order coefficients in the next cycle. However, carry is impossible for the lowest coefficient p 0, and therefore 0 is added in this case CTRL design As explained above, the CTRL module is intended to send appropriate control signals to the PPG, ACC, and CP modules. There may be various control signals such as a Fig. 7. Configuration of PPG. Fig. 9. Configuration of CP. 16

9 Table 7. PPG operation scheduling read/write address or Write Enable signal for the PPG buffer and ACC buffer, an addition/subtraction selection signal for ACC, and a signal to add 0 to the lowest coefficient in CP. Below we give an example of the CTRL module for the considered configuration with two-time recursion. Under the control of the CTRL module, PPG performs the operations scheduled as shown in Table 7. The table shows the input timing and usable output timing for every arithmetic unit (one multiplier and two adders). Similarly, the ACC operation scheduling is given in Table 8. In the table, the suffixes L and H attached to the partial products represent the lowest n bits and the remaining bits, respectively. For the CP, the multiplexer is controlled so as to select zero carry addition when the lowest coefficient p 0 is input. The ACC and CP operations can start prior to completion of the preceding operations (PPG and ACC, respectively). Specifically, ACC can start to receive the first partial Table 8. ACC operation scheduling 17

10 product pp 0 as soon as the PPG comes to its eighth step. In addition, ACC operation ends by finding the lowest coefficient p 3. This coefficient is output at the 17th step, and the coefficients p 0 to p 2 can be input to the CP before that. Therefore, the product P can be obtained in (8 1) + (17 1) + 4 = 27 steps after the lowest terms a 0 and b 0 of multiplier A and multiplicand B are input. 5.2 Implementation results and discussion We designed the IKM as explained in Section 5.1 with 1, 2, and 3 recursions (below referred to as R1IKM, R2IKM, and R3IKM, respectively). The multipliers were described in Verilog-HDL, and the logic synthesis was performed while varying the basic bit-length from 4 to 128 with priority given to minimization of the delay time. We used the library for the 0.18 µm process as mentioned in Section Table 9 presents evaluations of critical path delay, the area cost, the multiplication time, and the power consumption obtained from the results of logic synthesis. Here the power consumption was found at a driving voltage of 1.8 V. Every first row shows the basic bit-length of every IKM. The fourth rows Time show the multiplication time found as the delay time multiplied by the necessary number of steps, namely, 14 for R1IKM, 27 for R2IKM, and 79 for R3IKM. The uppermost row Bit length pertains to the multiplier and multiplicand. For comparison with software implementation, we measured the processing speed of Karatsuba multiplication using exflib Ver [14] (a package of fast multipleprecision arithmetic routines). The data thus obtained are presented in Fig. 10. In the diagram, the dashed lines have a slope of 1.58, representing ideal performance of the Fig. 10. Performance comparison between IKMs and software implementation (exflib). The dashed lines with a slope of 1.58 represent the multiplication time O(n 1.58 ) of the Karatsuba algorithm. Karatsuba algorithm. As is evident from the graphs, the measured performance of exflib is close to the ideal value as well as the performance of R1IKM, R2IKM, and R3IKM. Thus, the relative performance of IKM with respect to the software implementation remains nearly constant as the number of recursions is increased. However, the capacity of the PPG and ACC buffers increases with more recursions, resulting in a larger area. In addition, the diagram indicates that the performance increases with the basic bit-length. With basic bit-lengths of 32, 64, and 128, IKM outperforms the software by factors of about 5, 10, and 30, respectively. As regards the area cost, even the largest value of 10.9 mm 2 shown by R3IKM with a basic bit-length of 128 in Table 9 is sufficiently practicable. Table 9. Evaluation of IKMs in terms of time, area, and power consumption Fig. 11. Area cost of IKMs for multiplication of x-bit integers (x/2, x/4, and x/8-bit WTMs denote basic multipliers of R1IKM, R2IKM, and R3IKM, respectively). 18

11 As is evident from Table 9, the energy consumption of the 1024-bit R3IKM with a basic bit-length of 128 is 663 nj (= 1874 mw ns). On the other hand, in the software implementation, the power consumption of a Pentium GHz CPU is about 63 W [15], and the computing time for 856-bit (256 decimal digits) multiplication is about 6366 ns. Therefore, the energy consumption is 401 µj (= 63 W 6366 ns), which is about 600 times that of the hardware IKM. Figure 11 shows how the area cost of R1IKM, R2IKM, and R3IKM varies with the basic bit-length. In the diagram, the x-axis and y-axis represent the multiplication bit-length and area cost, respectively. The IKM area includes primarily WTM, the PPG buffer, the AAC buffer, and the adders. Thus, we also show the area cost of the respective WTMs on the same graph. * When the basic bit-length is increased at a fixed number of recursions, the number of entries in every buffer does not change: only the entry length increases. Therefore, the area cost of every buffer is O(n). That is, the area of every IKM approaches the area of internal WTMs as the bit-length is increased. 6. Conclusions We examined the performance and area cost of two hardware designs of the Karatsuba algorithm, namely, RKM with a combinational configuration and IKM with a sequential configuration. We found that RKM can achieve a lower area cost than ordinary WTM at 2 9 or more bits: the area cost is about 30 mm 2. On the other hand, RKM has larger delays than WTM at any bit-length. Therefore, WTM is capable of a better cost/performance ratio than RKM. In addition, we showed that IKM with basic bit-lengths of 32, 64, and 128 outperforms software by factors of about 5, 10, and 30, respectively. The greatest area cost of R3IKM with a basic bit-length of 128 is 10.9 mm 2, which is quite practicable. The energy consumption of a 1024-bit R3IKM with a basic bit-length of 128 is just 1/600 that of a general-purpose processor. Hence, we may conclude that WTM should be used rather than RKM, for multiplication with relatively small bit-length, and IKM using WTM as the basic multiplier is a proper solution for large bit-length. This study provides guidelines for optimal IKM design according to the application parameters. * Based on Fig. 11, WTM area can be approximated by O(n 1.70 ). The difference with O(n 2 ) can be explained by the priority given to speed in the logic synthesis. In the case of area-preferred synthesis, O(n 1.90 ) is obtained. Acknowledgments The present study was supported by Synopsis, Inc. via VDEC, University of Tokyo. In addition, we were subsidized by a JSPS Grant-in-Aid [Fundamental Research (C) (2) ]. We express our gratitude to all the persons and institutions concerned. REFERENCES 1. Fujiwara H. High-accurate numerical method for integral equations of the first kind under multiple-precision arithmetic. Theor Appl Mech Japan 2003;52: Sprott JC. Chaos and time-series analysis. Oxford University Press; Agrawal M, Kayal N, Saxena N. PRIMES is in P Dyka Z, Langendoerfer P. Area efficient hardware implementation of elliptic curve cryptography by iteratively applying Karatsuba s method. Proc of the Design, Automation and Test in Europe Conference and Exhibition, Vol. 3, p 70 75, Ling BW-K, Ho CY-F, Tam PK-S. Chaotic filter bank for computer cryptography. Chaos, Solitons & Fractals, in press, Available online 5 June Chien T-I, Liao T-L. Design of secure digital communication systems using chaotic modulation, cryptography and chaotic synchronization. Chaos, Solitons and Fractals 2005;24: Karatsuba A, Ofman Y. Multiplication of multidigit numbers on automata. Sov Phys Dokl 1963;7: Knuth DE. The art of computer programming 2nd edition: Seminumerical algorithms, Vol. 2. Addison- Wesley; Yazaki S, Abe K. VLSI design of FFT multi-digit multiplier. Trans Japan Soc Ind Appl Math 2006;15: (in Japanese) 10. Grabbe C, Bednara M, Teich J, von zur Gathen J, Shokrollahi J. FPGA designs of parallel high performance GF(2 233 ) multiplier. Proc IEEE International Symposium on Circuits and Systems, p , Shibaoka M, Takagi N, Takagi K. Reduced area parallel multiplier based on Karatsuba algorithm. IEICE General Conference, Vol. A-3, p 66, (in Japanese) 12. Wallace CS. A suggestion for a fast multiplier. IEEE Trans Electronic Computers 1964;13: VLSI Design and Education Center Homepage, 19

14. exflib Extended Precision Float-Point Arithmetic Library, http://www-an.acs.i.kyoto-u.ac.jp/ ~ fujiwara/ exflib/exflibindex.html 15. Intel(R) Pentium(R) 4 processor specifications, http://www.

12 14. exflib Extended Precision Float-Point Arithmetic Library, ~ fujiwara/ exflib/exflibindex.html 15. Intel(R) Pentium(R) 4 processor specifications, specs.htm AUTHORS (from left to right) Syunji Yazaki (nonmember) graduated from Tokyo University of Technology in 2002, completed the first and second stages of the doctoral program at the University of Electro-Communications in 2004 and 2007, and joined the faculty of Tokyo University of Technology as a research associate. His research interests include VLSI system design, multidigit multiplication, and social supports system for elderly people. He holds a D.Eng. degree, and is a member of IEICE, JSIAM, and PARTHENON Research Group. Koki Abe (member) graduated from Yokohama National University in 1969 and completed the M.E. program in He withdrew from the doctoral program at the University of Tokyo in 1974 to become a research associate at the University of Electro-Communications. He was a visiting researcher at Carnegie Mellon University from 1980 to He has been an associate professor at the University of Electro-Communications since His research interests include computer architectures, VLSI system design, and computer networks. He holds a D.Sc. degree, and is a member of IPSJ, IEICE, and IEEE. 20

An Optimum Design of FFT Multi-Digit Multiplier and Its VLSI Implementation

An Optimum Design of FFT Multi-Digit Multiplier and Its VLSI Implementation Syunji Yazaki Kôki Abe Abstract We designed a VLSI chip of FFT multiplier based on simple Cooly Tukey FFT using a floating-point