A Novel Distributed Arithmetic Multiplierless Approach for Computing Complex Inner Products

Size: px

Start display at page:

Download "A Novel Distributed Arithmetic Multiplierless Approach for Computing Complex Inner Products"

Allen Jacobs
5 years ago
Views:

1 606 Int'l Conf. Par. and Dist. Proc. Tech. and Appl. PDPTA'5 A ovel Distributed Arithmetic Multiplierless Approach for Computing Complex Inner Products evin. Bowlyn, and azeih M. Botros. Ph.D. Candidate, Dept. of Electrical & Computer Engineering, Southern Illinois Univ., Carbondale, IL, USA. Professor Emeritus, Dept. of Electrical & Computer Engineering, Southern Illinois Univ., Carbondale, IL, USA Abstract - In this paper we present a new integration approach for computing complex inner products using the Distributed Arithmetic (DA) technique and Complex Binary umber System (CBS). By using the CBS technique each complex number can be represented as one single unit instead of two. Our extended goal is to apply the approach in designing Fast Fourier Transform and realize the design on field-programmable gate arrays (FPGAs). A DA look-uptable (LUT) is used to store all linear combinations of coefficients. Our results show that for a radix- FFT computation, the number of point arithmetic calculations would be decreased by 75% and approximately 67 % for the total number of real adders when compared to the traditional radix- FFT computation. This proposed design is multiplierless. The approach is implemented on a 3-tap filter; preliminary analysis shows a power consumption reduction of approximately 55% compared to the MAC approach. eywords: Distributed arithmetic (DA), fast Fourier transform (FFT), look-up-table (LUT), complex binary number system (CBS), complex numbers Introduction Most digital devices in today s technology require Digital Signal Processing (DSP) for calculating the vector dot products between two vectors. Complex numbers are widely used within many DSP applications and are considered essential for the different computer applications that are greatly dependent on the arithmetic use of complex numbers []. In spite of the fact that there is a great need for a better representation of complex numbers, as a single unit, especially with the FFT algorithm, today s modern technology still relies heavily on the divide and conquer approach, where real and imaginary number parts are treated as two separate entities [].. Fast Fourier Transform The traditional FFT algorithm which uses the vector dot products basically uses the divide and conquer approach by breaking up the Discrete Fourier Transform (DFT) computation into two half lengths of even and odd indexes [] [3]. This approach of splitting up the DFT is quite useful as the output from each stage of the butterfly structure becomes the input of the next consecutive stage []. Although this computational algorithm is fast and efficient, in calculating the DFT, which greatly reduces the number of multipliers from an order of multiples to / log multiples, this Radix- DIT algorithm still requires an extensive number of multipliers which are quite costly [4].. Distributed Arithmetic The DA approach uses less hardware structure and computation without the use of any multiplier hardware structure []. The DA approach replaces the multiplier by using Look-up-tables (LUT). The LUT memory however, may grow exponentially in size from to + words which is one of the major factors within this technique. For example, a 56-k DA based tap filter would require a memory size of 57 which is significant. However, there are several different algorithms such as the use of ROM decomposition, the use of modifying the adder to an adder/subtractor, and the use of offset binary coding in which the size of the ROM can be greatly reduced [5]. Overall, the DA approach has been designed and tested to become the efficient tool in computing the vector dot product without any dedicated multipliers. It also consumes far less power and time in computing the inner dot product between two vectors [6] [7]..3 Complex Binary umber System The CBS algorithm uses far less arithmetic computations for computing complex numbers as each complex number is treated as a single entity instead of two. Unlike the traditional approach of calculating complex numbers, this method does not rely on the divide and conquer approach. This algorithm was first introduced by Penney in 964 [8] [9] who developed a negative 4 base conversion for the complex number system. Later, in 965, he introduced the (- + j) complex binary base algorithm in which a complex number is represented as a single entity in order to perform certain arithmetic computations such as addition, subtraction, and multiplication [] [0]. Jamil [] reintroduced in 000 an efficient (-+j)-based algorithm for doing addition, subtraction, multiplication and division using complex numbers [] [-6].

2 Int'l Conf. Par. and Dist. Proc. Tech. and Appl. PDPTA' Aims/Objectives The CBS algorithm indicates the importance of having a single unit representation for complex numbers as this will be a great benefit given st century technology. ot only will it demonstrate a great advantage in computer applications but this algorithm will also result in superior performance in today s computer designs. Integrating this algorithm with the DA approach will result in a 00% reduction in the hardware structure as no multipliers are used in computing the FFT structure which will result in a decrease in area size and cost, as multipliers occupy a large volume of hardware and they are fairly costly in implementing..5 Paper Structure The remainder of this paper is organized as follows. Section presents the proposed DA structure and ways on how to have the memory size reduced. A conversion of complex binary number system is presented in section 3. The results are presented in section 4. Work in progress is presented in section 5 and conclusion is presented in section 6. The Distributed Arithmetic (DA) In DA, multiplications are reordered and mixed such that the arithmetic becomes distributed through the structure rather than being lumped. DA is very effective in calculation of inner products without the use of multipliers. It relies on simple operations: adders, shifters and look up tables (LUT). DA facilitates the mapping of these operations onto FPGAs. DA computes the inner products and stores all possible linear combinational sums in a LUT ROM. The inputs of the vector are two s complement binary numbers between fixed and variable input data with the most significant bit, which is the sign bit located to the left of the binary point [7] [8] [9]. In explaining the DA technique, consider the sum of products from equation () y A x k where x is a s-complement binary number scaled such that x < (fixed point number) input data and A k are fixed coefficient vectors. Therefore, x k can be expressed as () n x k bk0 bkn () n where b k0 is the sign bit, b kn are bits 0 or, b k,- is the least significant bit (LSB), and is the word length of the input variable. Substituting equation () into () and changing the order to express y in terms of the bits x k yields the following equations: y Ak k y b k0 Ak Ak bkn k b k0 bkn n k n This equation is the conventional form of expressing the inner product. Direct mechanization of this equation defines lumped arithmetic computation [] as this calculation adds and multiplies the partial product of different shifts altogether before it is summed up to give a final result. By expanding equation (3) it yields the following equations: y n n ( ) bk Ak Ak bk Ak bk Ak bk ( ) k y 0 k b0 A b0 A b 0 A b A b A b b A b A b A A b A b A b A By regrouping each liked terms, the following sets of equations can be generated: y b0 A b0 A b 0 A b A b A b A b A b A b A b A b A b A k n k0 ) Ak b n Ak bn A bn A n y ( b Consequently, by interchanging the order of summation, a more distributed arithmetic computation can be obtained. y A k b k0 Ak b kn k n k (3) (4) n (6) As a result, in DA, the multiplication block is eliminated such that the arithmetic computation becomes discrete throughout the structure and not in a combined form [0] [] []. With this equation it can be seen that the partial products of equal shifts are being added before being summed to the next partial product shift. (5)

3 608 Int'l Conf. Par. and Dist. Proc. Tech. and Appl. PDPTA'5. ROM-based Construction ROM-based DA speeds up the multiplication process by pre-computing all possible values and storing them in the LUT ROM. The bits of input data {x 0k, x k, x i,k } are used to form the ROM address directly in which an arithmetic accumulator feedback (scalar shift accumulator) is used to form successive scaling and shifting by the power of two. Multiplication by a power of two is no more than a bit shift and since this structure is being designed to multiply fixed point numbers, multiplication by a power of two will be no more than a bit shift to the right [7] [8] [0]. Take into consideration the square bracketed term from equation (6) shown below as equation (7).. Akb kn k Since b kn has two possible values, either 0 or, b kn may only have possible values. Therefore, from equation (7), these pre-calculated values can be stored in a LUT of word addresses. These word addresses are used to directly access the memory location of the LUT ROM that contains the pre-computed result to that address [7] [0]. Since the constant coefficients A k are known and b kn values are s complement, either 0 or, each vector dot product is a result of the combination of each constant coefficient stored in the ROM. Evaluating equation (6), in order to accommodate the negative term of the first summation, one more address line needs to be added to the LUT called T s. The result is a ROM size of + where T s is a control timing signal and equal to one during the sign bit time, but otherwise zero. This bit is very significant as it is required to determine when the final result is completed [7]. Figure shown below is a 3-tap + DAbased implementation FIR filter (7) The size of the ROM however, can be reduced in three ways. This can be accomplished by the use of a ROM decomposition, by modifying the adder to an adder/subtractor, and by the use of offset binary coding (OBC).. ROM Decomposition As previously described, the size of the ROM increases rapidly according to the number of -tap filters, however, these ROMs can be divided into smaller units of DA-LUT in which their output sums from each unit can be added to give the final result. In other words, equation (6) can be broken into smaller m -tap DA-based filters [9]. The total memory bank is now equal to m x where m is the number of DA-LUT units and is the number of tap filters. For example, a 8-tap filter requires a LUT with 8 memory bank. This memory bank, however, can be broken into 3 smaller DA-LUTs with 4-input taps for each unit allowing the memory component to decrease from 8 to 3 x 4 which would just comprise of only 5 memory entries [9] [0]. Figure shows the implementation of a ROM decomposition 4-tap DA-based filter with m = and =. The output y[n], however, is only available after the + [log (m)] clock cycle. The additional logarithmic term is for the adder tree as shown in the figure below [9]. Figure : 4-tap DA-based filter with m = and =.3 Modifying the Distributed Arithmetic to an Adder/Subtractor In order to reduce the ROM size, certain aspects are needed to take into consideration. One such aspect is to modify the adder to an adder/subtractor allowing the memory size to be reduced to half of its size from + to word ROM. This method is accomplished by using T s, as the add/subtractcontrol line and not as a direct input into the LUT [7] []. Figure 3 shows the modified 4-tap DA-based implementation FIR filter. Figure : 3-tap + DA-based implementation FIR filter

4 Int'l Conf. Par. and Dist. Proc. Tech. and Appl. PDPTA'5 609 ow the offset binary code c kn can be defined as follows with c kn being explicated as shown below. bkn bkn, n 0 c kn where ckn {,} () ( b b ), n 0 kn kn Equation (0) can now be rewritten as seen in the equation below. Figure 3: 4-tap DA-based implementation FIR filter..4 Distributed Arithmetic using Offset Binary Coding (OBC) It is possible to reduce the memory size from words to - words by simply employing additional logics within its architectural structure. In order to do this, the input variable data are being read as an offset binary code of (-, ) and not in the usual conventional binary code form of (0, ) [7] []. In analyzing the sum of product (SOP) from equation (), which is shown below, y A x k recall that x k is the s complement input variable data, therefore, in order to cast the data into an offset binary code of (-, ), x k can be rewritten as shown below. x k [ xk ( xk )] (8) The negative x k term can also be expressed in the s complement form ( s complement plus ) from equation () where the symbols with an overbar are the complements of its bit. n ( ) x k bk0 bkn (9) n Rewriting equation (8) by combining equation () with equation (9) and regrouping yields the following equation. b bk0 b bkn n ( ) x k k0 kn (0) n n ( ) x k c kn () n0 ow by replacing x k from equation () into the sum of products equation () yields equation (3). y k A k c n0 kn n ( ) (3) Interchanging the order of summation produces equation (4), in which a more distributed arithmetic computation for the offset binary code implementation may be obtained. ( ) n y A kckn Ak (4) n0 k k For simplicity, the first inner summation part of equation (4) may be written as a variable Q(b n ) as shown in equation (5) Q( b ) Qc nc n cn n Ak ckn (5) k The latter summation part of equation (4) is an initial condition constant register and this may be expressed as Q(0). Q( 0) A k (6) k ow equation (4) can be simplified to equation (6) n0 n ( ) Q 0 y Qb (7) n Table is an example of a - 8-word DA-based ROM filter where = 4, b kn are the input data memory addresses and c kn is the data set that is been cast as (-, ) instead of (0, ). It can also be depicted that the top half of the table, colorcoded red, is just a mirror image (inverse symmetry) of the

Figure 4 shows 4-tap - DA-based implementation filter Table : - 8-word DA-based ROM filter where = 4 base (- + j), we first have to convert the fixed point number (F) into its appropriate form such

5 60 Int'l Conf. Par. and Dist. Proc. Tech. and Appl. PDPTA'5 bottom half of the table, color-coded blue [] and since Ts is not directory fed into the ROM LUT, only the top half of the table will be used. Figure 4 shows 4-tap - DA-based implementation filter Table : - 8-word DA-based ROM filter where = 4 base (- + j), we first have to convert the fixed point number (F) into its appropriate form such that its power can be expressed in terms of power of ½ as shown below [] [3], (0) where r i represents the remainder value and the coefficients of f i are binary numbers, either 0 or. According to Jamil and Ali, the steps to convert the complex number algorithm to CBS fraction are as follows []: First step: If r 0 < 0 then f = 0 and set r = r 0 or if r 0 0 then f = and set r =r o Second: If r i < 0 then f i+ = 0 and r i+ = r i or if r i 0 then f i+ = and r i+ = r i Just like the traditional way of computing fixed point numbers, this procedure will continue until the remainder r i = 0, which signifies that the fraction has been terminated or when the computer limitation has been attained [] [3]. In order to represent -i into its respective equivalent base of (- +j), we substitute -i according to the table below, where i = 4s + t, s is any positive integer, and 0 t 3 [] []. Table 3. shows the first four values of i and Table 3. shows the overall binary representation for any -i power. Table : Representations for the First Four -i Values in Base (-+j) Figure 4: 4-tap - DA-based implementation filters 3 Conversion of Complex Binary umber System 3. (- + j)-based CBS conversion algorithm The binary complex number base (- + j) for fixed point numbers may be written in the form shown below, (9) where a is a complex binary number (0 or ) scaled such that a <. This equation is similar to the traditional base power series but instead of the base being in the power of, it is instead (-+j) []. In order to convert from base-0 to Table 3: Representations for all -i Powers in Base (- + j) 4 Results 4. Power Consumption and Time within the DA approach In comparing the power consumption between the modified adder/subtractor DA technique with the traditional MAC approach, a 3-tap DA based filter was designed and implemented. The DA consumes 0.8 mw compared to 0.6 mw for the MAC approach which signifies that the

6 Int'l Conf. Par. and Dist. Proc. Tech. and Appl. PDPTA'5 6 DA approach uses far less power consumption than the traditional MAC technique by approximately 55%. Also, in terms of time, the maximum time delay for the DA approach was far less than that of the MAC approach by approximately 5%. The maximum time delay for the DA approach was ns while for the traditional approach it was 8.37ns. A 3-tap DA-OBC was also implemented and its power consumption amount was mw but this is due to the extra logic that was needed to implement this design without the need of any multipliers. These techniques will be tested on a larger scale for computing the FFT structure and further be analyzed to see which technique will give the overall best power and time consumption. 4. Savings between the CBS and FFT Arithmetic Computation Given the review on CBS, above, this subsection will focus on how efficient the CBS algorithm will be in computing the radix- FFT calculations. With the CBS represented as a single unit, this will reduce the arithmetic computations as seen in Table 4. Table 4: -point comparison between DFT, Radix- FFT, and Radix- CBS FFT From Table 4 shown above, it is shown that the radix- CBS FFT uses less multiplier hardware structures by 75% and adder by approximately 67% in comparison to the radix- FFT. It also shows that the radix- CBS FFT uses less arithmetic computations than the DFT. 5 Work in Progress The flow chart shown in Figure 5 demonstrates how the CBS -point FFT will be implemented using the DA approach. Recall that a DA architecture is bit serial in nature and computes the SOP between two vectors: fixed and variable input data. In implementing this structure first, the fixed point complex numbers are converted into the (-+j)- base representation. After its conversion, this new varying input data now become x[n] and the twiddle factor now becomes the fixed constant coefficient value. Figure 5: CBS -point FFT Implementation These data now become the driving input data for the butterfly structure. For each m-stage of the butterfly structure, the twiddle factor will be loaded into the LUT for each stage where the DA structure will calculate the partial products of equal shifts, which are then added before being summed to the next partial product shift. This process will continue for each m-stage until y[n] contains the final result. The final results will be outputted into the (-+j)-base representation which can then be reconverted using the algorithm described in section 3.. In this proposed design, the three DA Rom reductions discussed in sections II will be used in implementing the FFT structure using the DA approach. The outcome hardware structure will have zero multipliers which is a 00% reduction as shown in Table 5, as no multipliers will be used in computing the FFT structure. The multiplication process is done by shifting and adding only. The total number of real adders, however, will change slightly as the total adders that are needed to design the DA and FFT structure will be combined. The truth table for CBS employed for the adder/subtraction technique and the (-+j)-base structure, will be used instead of the conventional approach for adding and subtracting. Currently, this design for implementing the FFT algorithm, using the CBS and the DA approach algorithm is in its initial stage of development. Table 5: Proposed DA CBS FFT Structure

7 6 Int'l Conf. Par. and Dist. Proc. Tech. and Appl. PDPTA'5 6 Conclusion For a DIT butterfly structure, in calculating the radix- FFT algorithm, the number of arithmetic computations for each butterfly structure will consist of four multiplications and six additions. In integrating the CBS within the FFT algorithm, the arithmetic computations for each butterfly structure will be greatly reduced, due to the fact that each arithmetic operation is not based on the divide and conquer method as each complex number is represented as one single unit. Therefore, by treating each complex number as one entity, the arithmetic operations for each butterfly structure will consist of only one multiplication and two additions/subtractions. This will result in the number of point calculations reduced by 75% for the multipliers and approximately 67% for the adder/subtractor. This proposed method will reduce the multiplier complexity by 00% and decrease the cost for implementing this structure compared to traditional ways of computing the algorithms. Given that there is a great demand for new and improved technology for DSP applications, in calculating the SOP, the methods discussed in this paper will be used in implementing the new radix- and radix-4 FFT structures employing DA. These three new methods will be compared with the traditional method to investigate which method will give the overall maximum performance in terms of its area size, speed, and timing 7 References [] T. Jamil, "The complex binary number system,"ieee Potentials, vol. 0, no. 5, pp. 39-4, 00. [] R. G. Lyons, "FFT Software Programs," in Understanding Digital Signal Processing,Prentice Hall, 004. [3]. Govil and S. R. Chowdhury, "High performance and low cost implementation of Fast Fourier Transform algorithm based on Hardware Software co-design," 04 IEEE Region 0 Symposium, pp , 04. [4] R. Lyons, "Relationship of the FFT to the DFT," in Understanding Digital Signal Processing, nd ed., Prentice Hall, 004. [5] R. Guo, "Two high-performance adaptive filter implementation schemes using distributed arithmetic," IEEE transaction on Circuits and System ll, vol. 58, no. 9,pp , 0. [6] M. Jiang, B. Yang, R. Huang, T. Y. Zhang and Y. Y. Wang, "A multiplierless fast fourier transform architecture," Electronis Letters, vol. 43, no. 3, pp. 9-9, 007. [7] V.. Sharma,.. Mahapatra and U. C. Pati, "An efficient distributed arithmetic based VLSI architecture for DCT," International Conference on Devices and Communications, pp. -5, 0. [8] W. Penney, "A numerical system with a negative base," Mathematical Student Journal, pp. -, 964. [9] W. Penney, "A binary system for complex numbers," Journal of the ACM, vol., no., pp , 965. [0] H. Zaini and R. G. Deshmukh, "A novel method for arithmetic operations using complex binary number system and the reconversion of the result to the decimal complex number system," Proceedings of the IEEE SouthestCon 003, pp. 3-37, 003. [] T. Jamil, "Impact of shift operations on (-+j)-base complex binary numbers," Journal of Computers, vol. 3, no., pp. 63-7, 008. [] T. Jamil and U. Ali, "Effects of multiple-bit shift-right operations on complex binary numbers," Proceedings of IEEE SoutheastCon 007, pp , 007. [3] D. C. Blest and T. Jamil, "Efficient division in the binary representation of complex numbers," Proceedings of the IEEE SoutheastCon 00, pp , 00. [4] T. Jamil, "An introduction to complex binary number system," Fourth International Conference on Information and Computing, pp. 9-3, 0. [5] T. Jamil,. Holmes and D. Blest, "Towards implemention of a binary number system for complex numbers," Proceedings of the IEEE SoutheastCon 000, pp , 000. [6] J. Goode, T. Jamil and D. Callahan, "A simple circuit for adding complex numbers," WSEAS Transactions on Information Science and Applications, vol., no., pp. 6-66, 004. [7] S. A. White, "Applications of distributed arithmetic to digital signal processing: A tutor review," IEEE ASSP Magazine, pp. 4-9, 989. [8] S. Ramprasad,. R. Shanbhag and I.. Hajj, "Low-power distributed arithmetic architectures using non-uniform memory partitioning," IEEE International Symposium on Circuits and Systems, vol. 3, pp , 999. [9] W. Huang, "Implementation of adaptive digital FIR and reprogrammable mixed-signal filters using distributed arithmetic," PhD Thesis, Dept. Elect. & Comput. Eng., Georgia Institute of Tech., Atlanta, 009. [0] R. Guo and L. S. DeBrunner, "A novel adaptive filter implementation scheme using distributed arithmetic," Signals, System and Computers, pp , 0. [] R. M. Jiang, "An area-efficient FFT architecture for OFDM and digital video broadcasting," IEEE Transactions on Consumer Electronics, vol. 53, no. 4, pp. 3-36, 007. [] S. Chandrasekaran and A. Amira, "ovel sparse OBC based distributed arithmeticarchitecture for matrix transforms," IEEE International Symposium on Circuits and Systems, pp , 007.

Latest Innovation For FFT implementation using RCBNS

Latest Innovation For FFT implementation using SADAF SAEED, USMAN ALI, SHAHID A. KHAN Department of Electrical Engineering COMSATS Institute of Information Technology, Abbottabad (Pakistan) Abstract: -