ISSN Vol.03,Issue.05, July-2015, Pages:

Similar documents
High Throughput Radix-D Multiplication Using BCD

Efficient Radix-10 Multiplication Using BCD Codes

Improved Design of High Performance Radix-10 Multiplication Using BCD Codes

High Speed Multiplication Using BCD Codes For DSP Applications

Design of BCD Parrell Multiplier Using Redundant BCD Codes

DESIGN AND IMPLEMENTATION OF FAST DECIMAL MULTIPLIER USING SMSD ENCODING TECHNIQUE

A New Family of High Performance Parallel Decimal Multipliers

1902 IEEE TRANSACTIONS ON COMPUTERS, VOL. 63, NO. 8, AUGUST Fast Radix-10 Multiplication Using Redundant BCD Codes

I. Introduction. India; 2 Assistant Professor, Department of Electronics & Communication Engineering, SRIT, Jabalpur (M.P.

Improved Combined Binary/Decimal Fixed-Point Multipliers

II. MOTIVATION AND IMPLEMENTATION

Implementation of Efficient Modified Booth Recoder for Fused Sum-Product Operator

International Journal of Engineering and Techniques - Volume 4 Issue 2, April-2018

Low Power Floating-Point Multiplier Based On Vedic Mathematics

A Radix-10 SRT Divider Based on Alternative BCD Codings

OPTIMIZING THE POWER USING FUSED ADD MULTIPLIER

Sum to Modified Booth Recoding Techniques For Efficient Design of the Fused Add-Multiply Operator

Analysis of Different Multiplication Algorithms & FPGA Implementation

Design and Implementation of IEEE-754 Decimal Floating Point Adder, Subtractor and Multiplier

Area Efficient, Low Power Array Multiplier for Signed and Unsigned Number. Chapter 3

Design and Implementation of Signed, Rounded and Truncated Multipliers using Modified Booth Algorithm for Dsp Systems.

A Novel Carry-look ahead approach to an Unified BCD and Binary Adder/Subtractor

An IEEE Decimal Parallel and Pipelined FPGA Floating Point Multiplier

Reducing Computational Time using Radix-4 in 2 s Complement Rectangular Multipliers

16-BIT DECIMAL CONVERTER FOR DECIMAL / BINARY MULTI-OPERAND ADDER

32-bit Signed and Unsigned Advanced Modified Booth Multiplication using Radix-4 Encoding Algorithm

A Simple Method to Improve the throughput of A Multiplier

FPGA Implementation of Multiplier for Floating- Point Numbers Based on IEEE Standard

High Performance and Area Efficient DSP Architecture using Dadda Multiplier

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

AN IMPROVED FUSED FLOATING-POINT THREE-TERM ADDER. Mohyiddin K, Nithin Jose, Mitha Raj, Muhamed Jasim TK, Bijith PS, Mohamed Waseem P

OPTIMIZATION OF AREA COMPLEXITY AND DELAY USING PRE-ENCODED NR4SD MULTIPLIER.

UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering. Digital Computer Arithmetic ECE 666

Digital Computer Arithmetic

EE878 Special Topics in VLSI. Computer Arithmetic for Digital Signal Processing

DESIGN OF QUATERNARY ADDER FOR HIGH SPEED APPLICATIONS

DLD VIDYA SAGAR P. potharajuvidyasagar.wordpress.com. Vignana Bharathi Institute of Technology UNIT 3 DLD P VIDYA SAGAR

VLSI Design Of a Novel Pre Encoding Multiplier Using DADDA Multiplier. Guntur(Dt),Pin:522017

An Efficient Carry Select Adder with Less Delay and Reduced Area Application

Area Delay Power Efficient Carry-Select Adder

High Performance Architecture for Reciprocal Function Evaluation on Virtex II FPGA

HIGH PERFORMANCE FUSED ADD MULTIPLY OPERATOR

Design of a Floating-Point Fused Add-Subtract Unit Using Verilog

HIGH SPEED SINGLE PRECISION FLOATING POINT UNIT IMPLEMENTATION USING VERILOG

Area Delay Power Efficient Carry-Select Adder

Week 7: Assignment Solutions

IEEE-754 compliant Algorithms for Fast Multiplication of Double Precision Floating Point Numbers

High Speed Radix 8 CORDIC Processor

DESIGN AND PERFORMANCE ANALYSIS OF CARRY SELECT ADDER

An Optimized Montgomery Modular Multiplication Algorithm for Cryptography

THE INTERNATIONAL JOURNAL OF SCIENCE & TECHNOLEDGE

An Efficient Design of Sum-Modified Booth Recoder for Fused Add-Multiply Operator

CPE300: Digital System Architecture and Design

Pipelined Quadratic Equation based Novel Multiplication Method for Cryptographic Applications

A High-Performance Significand BCD Adder with IEEE Decimal Rounding

FPGA IMPLEMENTATION OF FLOATING POINT ADDER AND MULTIPLIER UNDER ROUND TO NEAREST

IMPLEMENTATION OF DOUBLE PRECISION FLOATING POINT RADIX-2 FFT USING VHDL

An FPGA based Implementation of Floating-point Multiplier

Delay Optimised 16 Bit Twin Precision Baugh Wooley Multiplier

ISSN Vol.08,Issue.12, September-2016, Pages:

Low Power and Memory Efficient FFT Architecture Using Modified CORDIC Algorithm

HIGH PERFORMANCE QUATERNARY ARITHMETIC LOGIC UNIT ON PROGRAMMABLE LOGIC DEVICE

ISSN Vol.05,Issue.09, September-2017, Pages:

UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering. Digital Computer Arithmetic ECE 666

An Efficient Fused Add Multiplier With MWT Multiplier And Spanning Tree Adder

IMPLEMENTATION OF TWIN PRECISION TECHNIQUE FOR MULTIPLICATION

16 BIT IMPLEMENTATION OF ASYNCHRONOUS TWOS COMPLEMENT ARRAY MULTIPLIER USING MODIFIED BAUGH-WOOLEY ALGORITHM AND ARCHITECTURE.

FPGA Based Acceleration of Decimal Operations

INTERNATIONAL JOURNAL OF PROFESSIONAL ENGINEERING STUDIES Volume VII /Issue 2 / OCT 2016

International Journal of Advanced Research in Electrical, Electronics and Instrumentation Engineering

DESIGN AND SIMULATION OF 1 BIT ARITHMETIC LOGIC UNIT DESIGN USING PASS-TRANSISTOR LOGIC FAMILIES

Floating-Point Butterfly Architecture Based on Binary Signed-Digit Representation

JOURNAL OF INTERNATIONAL ACADEMIC RESEARCH FOR MULTIDISCIPLINARY Impact Factor 1.393, ISSN: , Volume 2, Issue 7, August 2014

FPGA Implementation of a High Speed Multiplier Employing Carry Lookahead Adders in Reduction Phase

COMPUTER ARCHITECTURE AND ORGANIZATION. Operation Add Magnitudes Subtract Magnitudes (+A) + ( B) + (A B) (B A) + (A B)

A Novel Floating Point Comparator Using Parallel Tree Structure

At the ith stage: Input: ci is the carry-in Output: si is the sum ci+1 carry-out to (i+1)st state

Binary Adders. Ripple-Carry Adder

Parallelized Radix-4 Scalable Montgomery Multipliers

Efficient Design of Radix Booth Multiplier

Prachi Sharma 1, Rama Laxmi 2, Arun Kumar Mishra 3 1 Student, 2,3 Assistant Professor, EC Department, Bhabha College of Engineering

VLSI Design and Implementation of High Speed and High Throughput DADDA Multiplier

Keywords: Soft Core Processor, Arithmetic and Logical Unit, Back End Implementation and Front End Implementation.

Evaluation of High Speed Hardware Multipliers - Fixed Point and Floating point

Design and Implementation of CVNS Based Low Power 64-Bit Adder

Binary Arithmetic. Daniel Sanchez Computer Science & Artificial Intelligence Lab M.I.T.

EE878 Special Topics in VLSI. Computer Arithmetic for Digital Signal Processing

A Review on Optimizing Efficiency of Fixed Point Multiplication using Modified Booth s Algorithm

FPGA Implementation of Efficient Carry-Select Adder Using Verilog HDL

Design and Analysis of Kogge-Stone and Han-Carlson Adders in 130nm CMOS Technology

A New Architecture for 2 s Complement Gray Encoded Array Multiplier

DESIGN OF DECIMAL / BINARY MULTI-OPERAND ADDER USING A FAST BINARY TO DECIMAL CONVERTER

ENERGY-EFFICIENT VLSI REALIZATION OF BINARY64 DIVISION WITH REDUNDANT NUMBER SYSTEMS 1 AVANIGADDA. NAGA SANDHYA RANI

Combinational Circuits

Implementation of Ripple Carry and Carry Skip Adders with Speed and Area Efficient

ASIC IMPLEMENTATION OF 16 BIT CARRY SELECT ADDER

Design and Implementation of VLSI 8 Bit Systolic Array Multiplier


A Ripple Carry Adder based Low Power Architecture of LMS Adaptive Filter

Fast Multi Operand Decimal Adders using Digit Compressors with Decimal Carry Generation

Transcription:

WWW.IJITECH.ORG ISSN 2321-8665 Vol.03,Issue.05, July-2015, Pages:0707-0713 Evaluation of Fast Radix-10 Multiplication using Redundant BCD Codes for Speedup and Simplify Applications B. NAGARJUN SINGH 1, T. VENKATARAMARAO 2 1 Assoc Prof & HOD, Dept of ECE, Sarada Institute Technology and Science, India, E-mail: nagarjun.singh24@gmail.com. 2 PG Scholar, Dept of ECE, Sarada Institute Technology and Science, TS, India, E-mail: venkataramarao321@gmail.com. Abstract: We present the algorithm and architecture of a BCD parallel multiplier that exploits some properties of two different redundant BCD codes to speed up its computation: the redundant BCD excess-3 code (XS-3), and the overloaded BCD representation (ODDS). In addition, new techniques are developed to reduce significantly the latency and area of previous representative high performance implementations. Partial products are generated in parallel using a signed-digit radix-10encoding of the BCD multiplier with the digit set [-5, 5], and a set of positive multiplicand multiples (0X, 1X, 2X, 3X, 4X, 5X) coded in XS-3. Hardware implementations normally use BCD instead of binary to manipulate decimal fixed-point operands and integer significant of DFP numbers for easy conversion between machine and user representations. However, BCD is less efficient for encoding integers than binary, since codes 10 to 15 are unused. Moreover, the implementation of BCD arithmetic has more complications than binary, which lead to area and delay penalties in the resulting arithmetic units. The proposed decimal multiplier uses internally a redundant BCD arithmetic to speed up and simplify the implementation. In this paper we have presented the algorithm and architecture of a new BCD parallel multiplier. The area and delay figures estimated from both a theoretical model and synthesis show that our BCD multiplier presents 20-35% less area than the existing designs. Keywords: Parallel Multiplication, Decimal Hardware, Overloaded BCD Representation, Redundant Excess-3 Code, Redundant Arithmetic. I. INTRODUCTION Decimal fixed-point and floating-point formats are important in financial, commercial, and user-oriented computing. Since area and power dissipation are critical design factors in state-of-the-art DFPUs, multiplication and division are performed iteratively by means of digit-by digit algorithms and therefore they present low performance. Moreover, the aggressive cycle time of these processors puts an additional constraint on the use of parallel techniques for reducing the latency of DFP multiplication in high performance DFPUs. The improvement of parallel decimal multiplication by exploiting the redundancy of two decimal representations: the ODDS and the redundant BCD excess-3 (XS-3) representation, a self complementing code with the digit set. The general redundant BCD arithmetic is used to (that includes the ODDS, XS-3 and BCD representations to accelerate parallel BCD multiplication in two ways such as Partial Product Generation and Partial Product Reduction. The design of area efficient high speed with reasonable power consumption the decimal multiplication is one of the most important decimal arithmetic operations which have a growing demand in the area of commercial, financial, and scientific computing. The reasons behind the use of binary data for doing the arithmetic operations in almost all the computer systems is the speed and simplicity of binary arithmetic, efficiency in storing the binary data. But, for the Digital Signal processing and commercial applications, the use of decimal arithmetic is still relevant. But the speed of the operation major concern for the decimal software. Moreover, the commercial databases contain more decimal data than binary data. For the purpose of processing, these decimal data are converted into binary data. And, once the processing is completed, those are again converted back into the decimal format. Cause some delay binary Coded Decimal (BCD) on-going research which is carried out in almost everywhere using BCD with reduced area and delay. The computer arithmetic literature includes articles on decimal arithmetic hardware such as BCD adders and multi operand BCD adders, sequential BCD multipliers and dividers. However, some single arithmetic operations (e.g., division or square root), and also function evaluation circuits (e.g., radix-10 exponentiation or logarithm), are often implemented by a sequence of simpler operations including several multiplications. Therefore, hardware implementation of such operations and functions would call for high-speed parallel decimal multiplication. Such multiplication schemes are generally implemented as a sequence of three steps: partial product generation (PPG), partial product reduction (PPR), and the final carry- propagating addition. Parallel binary multipliers are used extensively in most of the binary floating point units for high performance. However, decimal multiplication is more difficult to implement due to the complexity in the generation of multiplicand multiples and the inefficiency of representing decimal values in system Copyright @ 2015 IJIT. All rights reserved.

based on binary signals. These issues complicate the generation and reduction of partial products. The first implementation of a parallel decimal multiplier is described. Several different parallel decimal multiplier architectures are proposed, which use new techniques for partial product generation and reduction. Furthermore, some of these architectures were extended to support binary multiplication. Some concepts were applied to design decimal 4:2 compressor trees. All of the previous designs are combinational fixed point architectures. A pipelined IEEE 754-2008 compliant DFP multiplier based on architecture was presented. The work is the major extension of the previous paper, which presented a new family of high performance parallel decimal multipliers. In this paper, we deal with fully combinational decimal fixed-point architecture. We describe in some detail the methods for partial product generation and reductions proposed and introduce new techniques to reduce the latency and the hardware complexity of the previous designs. The main objectives to improve the performance of BCD multiplication other objectives are given below. To avoid long carry-propagations in the generation of decimal positive multiplicand multiples To obtain the negative multiples from the corresponding positive ones easily To simplify conversion of the partial products generated in XS-3 to the ODDS representation for efficient partial product reduction. B. NAGARJUN SINGH, T. VENKATARAMARAO The partial products array PP (i) depicted in Fig.1 are reduced by CSA techniques; each column of M+1 operands is condensed into two ones. The strategy adopted to tackle and simplify the design is to divide the partial products into subsets, as is observed in Fig.3. In order to carry out the reduction, this paper proposes decimal Q: 2CS as which reduce Q digits to two ones. These reductions are implemented in different versions, for Q=3, 4,5,8,9 decimal operands. This will be further detailed in Section 3. It must be pointed out that the processing is based on the (4221) coding. Several decimal additions are applied over two reduced digits. The fast carry-chain adder described is a suitable alternative due to its high performance in terms of area-delay. II. AN OVERVIEW OFBCDMULTIPLIER A generic N-digit by M-digit multiplication P=AxB is indicated in Fig.1, where A and B are the multiplicand and multiplier respectively. The operation is made of the following modules: generation of partial products, reduction of partial products tree, and fast decimal adders. Decimal multiplication coded in BCD-8421(8421) presents a more complex implementation than binary multiplication due to the presence of invalid (8421) digits between {A,B F}. These need to be corrected generating an extra cost in computation as well as additional multiplicand multiples that must be implemented by the multiplier. The generation of decimal partial products is based on the techniques described. First of all, decimal digits of A are represented in BCD-4221(4221) to prevent the corrections previously mentioned. This reduces the computation complexity of multiplicand multiples coded into (4221). ABCD-5211(5211) format is introduced in this Section. Both of these formats are necessary to determine these to multiples. For example, the multiple 2 A is computed as follows: each (8421) digit is first recoded to (5421) decimal as it is shown in Fig.2a; straight away a1-bit left shift is carried out, obtaining the 2A multiple in (8421). Then, the 2A multiple is easily recoded from (8421) to (4221). Fig.1.Decimal SdRadix-10 pipe lined multiplier diagram. The rest of multiples are Computed according to Fig.2a. The multiplier B is recoded into signed digit (SD) radix-10. The (4221) or (5211) coding is convenient because speed-up the CSA reduction, whose 9 s complement is generated by simple bit inversion. In order to simplify the computational process, in this paper a reduced non-redundant decimal code (4221) is adopted. The reduced coding is exhibited in Fig.2b. (a) Multiplicand Multiples Generation

Evaluation of Fast Radix-10 Multiplication Using Redundant BCD Codes For Speedup and Simplify Applications Each block stage carries out M+3+ (M/#blocks) CSA reductions. At the same time, the reductions in the current pipelined stage are processed. The 4221 Adder Recoding unit transforms the two 8-bit out puts 2H (i), S(i) of the previous unit into two 4-bit digits D1(i), D0(i) coded in (4221). Index (i) denotes the ith CSA reduction. The early block is based on a (4221) addition and a circuit to detect special cases. Straight away, the (4221) outputs are recoded into (8421). Each mentioned stage pipeline executes a decimal addition which is based on carry-chain techniques. Finally, the result from the last pipe lined stage shows the multiplication N+Mdigit AxB. (b) ReducedBCD-4221 coding Fig.2.Multiples for SD radix-10 encoding. III. NXMBCDMULTIPLIERIMPLMENTATION A general overview of NxM-digit BCD multiplier architecture is described below and depicted in Fig.1. Dashed blocks indicate the main modules of the design, and the dotted line indicates the pipe lined stages. AN-digit multiplicand A and a M-digit multiplier B as unsigned decimal are assumed. The (8421) multiplication is processed as follows: the multiplicand multiples generation unit takes the operand A and computes a set of N+4-digit multiplicand multiples {1A, 2A, 3A, 4A, 5A} coded into (4221). In parallel, the sd-radix- 10 unit recodes each digit (8421) of B between {-5,-4,-3, +3,+4,+5} that is represented with a1-bit sign and 5-bit magnitude format. The recoded B is multiplied digit-by-digit by the previously computed multiplicand multiples. It is important to emphasize that the (8421) coding introduces a computational cost due to the corrections in the decimal reduction CSA. An alternative to prevent this drawback is to use (4221) coding. Immediately after, the partial product generation unit takes the outputs from the two modules previously mentioned and generates M+1 partial product which depends on each re-coded digit of B, represented in signed-digitradix-10 format. Each partial product coded in (4221) is at most of N+3- digit length. All of partial products are taken as inputs in the decimal Q: 2CS A reduction tree module, this simplifies Q digit into two decimal digits coded in(4221). The basic scheme with regard to decimal 3:2CSA reduction fulfills the relation A(i)+B(i)+C(i)=2H(i)+S(i), i {0,1,2, M-1} (Fig.3a). The 4-bit inputs A(i), B(i), and C(i) are simplified to 4-bit S and H by means of 4 full-adder cells. It can be noted that the 2H (i) is produced by the block2x. Therefore, the considered outputs will require at least 5-bit operands. The proposed decimal Q:2 reduction presents two outputs:2h and S of 8 bits each one. This architecture can be extended to different versions of decimal Q:2 compressors. To validate and carry out the evaluations, several decimal Q: 2CS as between 3:2 and 9:2 compressors were tested. The proposed decimal Q: 2CSA is base on several basic binary CSAs (Fig.3b, 3c, 3d, 3e) and these equally are made up of full or half-adders as basic cells. The partial products are split in # blocks which provide different pipeline stages (Fig.4). In, a performance comparison between 8x8 and 16x16 multiplications implemented in several pipelined levels is presented. Fig.3. Proposed Decimal 3:2CSA and several Binary Adder schemes. A. Signed-DigitRadix-10 recoding/ Multiplicand Multiples Generation The block diagram is depicted in the Fig.2a; the recoding is applied to each digit of B. This transforms a (8421) set {0, 1, 2..,9} into the signed-digit set{-5,-4 +4,+5} made up of a

sign signal SX (i) and a5-bit magnitude X(i). Each digit X (i) selects the corresponding multiplicand multiples coded in (4221). The sign SX verifies if a negative multiple is produced. It is important to highlight that a negative version of a partial product is obtained simply by inverting the bits of the positive version using arrays of XOR gates manipulated by SX (i). The above circuit requires 2 slices and6luts. The following section discusses Multiplicand Multiples Generation (Fig.2a), this set of multiples are coded into (4221). Multiples 2A and 5A are generated with recoding (8421 to 4221, 5211 to 4221) and carrying out a left shifting process. Multiple 4A is computed as 2x2 A and 3A comes from a decimal addition between A and 2A. Each partial product has N+3 digits. B. NAGARJUN SINGH, T. VENKATARAMARAO As an example the decimal 8:2CSA will be described: This implementation requires 4binary modules 8:4CSA, 2 binary 3:2CSA and seven 2x blocks. Each binary 8:4CSA is made up of one half and 4 full adders and has an area of 7LUT 4s and 3LUT6s. Table I exhibits the area contribution of several decimal CSA implementations. TABLE I: Area of Several Proposed Decimal CSAS B. Reduction Partial Product After generating the M+1 partial products, coded into (4221), the reduction of partial products is developed by means of decimal Q:2 carry save additions (CSA). First, the partial products are aligned and divided into blocks as it is shown in 3:2 Carry save adder Fig.4. A circuit based on decimal 3:2, 4:2, and 8:2CSA compressors is proposed, which entail to utilize block sizes of (M+3+M/#blocks) xq decimal operands. A decimal 3:2CSA compressor (Fig.3a) adds three numbers A, B, C each one of size 4-bit (coded in 4221) to get a decimal S and a carry H, such that A+B+C=S+2H as it was explained in Section 3. The 2X module observed in Fig.3a is composed of a (4221 to 5211) coding operation and a one bit left shift operation. The decimal 3:2CSA is implemented by 4 binaries 3:2CSA, each one utilizing 2LUT3s, and a 2X block. The hardware cost of different decimal Q: 2 versions are shown in the Table I. The above basic binary 3:2 (full adder) circuit is utilized as basic cell in different CSA compressor schemes. Fig.4.Example of partial products reduction architecture. C. BCD-4221 Adder Recoding After processing the earlier stage, the (4221) Adder Recoding unit in Fig.5 takes as inputs the outputs of each Q:2 reduction and codes the min to two decimal digits (4221) of 4-bit size. This implementation is based on a (4221) addition and a circuit to detect special cases. A sit is specified in the Section III, a (4221) decimal addition does not require correction as usually occurs in the (8421) decimal adder. As soon as the circuit receives the two outputs of 8 bits of the early implementation, a2-digit decimal addition in (4221) format is carried out. The function is represented as: 2H(i)(7..0)+S(i)(7..0)=D1(i)(3..0)&D0(i)(3..0), where (i) represents the i-th Q:2 reduction process. In Fig.5 the implemented architecture can be observed. One hand, the Q: 2 compressors are limited to a reduced group of (4221) code. Thus lead store code the outputs of the compressors to guarantee that 2H and S are belonging to the above code. On the other hand, there are certain decimal digits which match with the previously reduced codes that do not develop a correctly addition in (4221) format, therefore it is necessary to implement a detection circuit to overcome these cases (exhibited in Fig.5 as correction circuit). The (4221) adder is a completely combinational design and utilizes an area of: 32LUTs, and 10 Slices. D. Decimal Addition Design After finishing the previous stage and as soon as its outputs are computed, a (8421) decimal addition is carried out. First of all, an aligned process on the vectors D0 and D1 is developed, just before both of the mare recoded in to (8421). As it is pointed out at the beginning of this section, various levels of pipeline=log2 (M) +1 are implemented. Several decimal Additions are processed at each pipelined stage (Fig.4), carrying out σ additions of (M+3+M/#blocks) digits. The proposed addition is based on 10 s complement BCD numbers, and the carry-chains techniques to carry out the full design. The carry-chain addition consists in computing beforehand all propagating carry and carry generating conditions, in order to reduce the overall execution time. The implemented adder utilizes an area of 10x (M+3+M/#blocks) LUTs. It has been considered the highperformance adder proposed.

Evaluation of Fast Radix-10 Multiplication Using Redundant BCD Codes For Speedup and Simplify Applications that is, the delay of a 1 inverter with a fan out of four inverters. The hardware complexity is given as the number of equivalent minimum size NAND2 gates. We do not expect this rough model to give absolute area-delay figures, due to the high wiring complexity of parallel multipliers. However, based on our experience this model is good enough for making design decisions at gate level and it provides reasonable accuracy of area and delay ratios to compare different designs. The delay, input capacitance (L in ) and area of the main building blocks used in the BCD multipliers. The input capacitance is normalized to the input capacitance of the 1 inverter. The L out parameter represents the normalized output load connected to the gate. The XOR2 gate is implemented with CMOS transmission gates. To evaluate our architectures, gates with the drive strength of the minimum sized (1 ) inverter have been assumed, and buffers have been inserted to deal with high loads. The critical path delay in every stage of the multiplier has been estimated as the sum of Fig.5. BCD-4221 Adder recoding architecture. the delays of the gates on this critical path. The area and delay figures obtained for the 16 16-digit and 34 34-digit architectures are shown in Table 2. IV. EVALUATION AND COMPARISON The proposed combinational architectures for BCD 16 16- digit and 34 34-digit multipliers are evaluated and compared with other representative BCD multipliers. The area and delay figures of our architectures were obtained from an areadelay evaluation model for static CMOS gates, and validated with the synthesis of verified RTL models coded in VHDL. This evaluation is detailed in Section 4.1. TABLE II: Area and Delay (LE-Based Model) for the Proposed Mults Synthesis Results: The designs have been synthesized using Synopsys Design Compiler B-2012.09-SP3 and a 90 nm CMOS standard cell library [12]. The FO4 delay for this library is 49 ps under typical conditions (1 V, 25 C). The area-delay curves of Fig. 6 have been obtained with the constraint C out = C in =4C inv, where C inv is the input capacitance of a 1 inverter of the library. Finally, the most representative sequential and parallel decimal multipliers have been compared with our architecture. The results of the comparison are summarized in SectionIV. A. Evaluation As stated above, the evaluation has been performed in two steps. First, a technological independent evaluation using a model for static CMOS circuits based on Logical Effort (LE) has been carried out, and then the results obtained with this model have been validated with the synthesis and functional verification of the RTL model. Technological Independent Evaluation: Our technological independent evaluation model allows us to obtain a rough estimation of the area and delay figures for the architecture being evaluated. It takes into account the different input and output gate loads, but neither interconnections nor gate sizing optimizations are modeled. The delay is given in FO4 units, Fig.6. Area-delay space obtained from synthesis. We also include in Fig. 6 the area-delay points estimated from the LE-based model evaluation. We have kept the hierarchy of the design in the synthesis process as described in Sections 3 to 6 (no top level architecture optimization options). Nevertheless, some specific structures have been optimized internally to reduce the overall delay. To ensure the correctness of the designs we have simulated the RTL models of the 16 16-digit and 34 34-digit multipliers using the Synopsys VCS-MX tool and a comprehensive set of random test vectors. B. Comparison Table 3 shows the area and delay estimations obtained from synthesis for some representative BCD sequential and combinational multipliers. As far as we know, the most

representative high-performance BCD multipliers are 16 16-digits combinational and sequential [10], [11] implementations. The area and delay figures shown in Table 3 correspond to the minimum delay point of each implementation, and were obtained from the synthesis results provided in the respective reference, except for the two multipliers of reference, which correspond to an estimation obtained by their authors using a LE-based model. The comparison ratios are given with respect to the area and delay figures of a 53-bit binary Booth radix-4 multiplier extracted. The PPG of multipliers [8] is based on a SD radix-5 scheme that generates 32 BCD partial products for a 16-digit multiplier. Though it only requires simple constant time delay BCD multiplicand multiples, the 9 s complement operation for obtaining the negative multiples is more complex than a simple bit inversion. The partial product reduction implemented is a BCD carry-save adder tree build of BCD digit adders. On the other hand, the BCD partial products are reduced in [8] by using counters that compute the binary sum of each column of digits sum, and subsequent binary to decimal conversions [8]. B. NAGARJUN SINGH, T. VENKATARAMARAO The BCD multipliers in use either the SD radix-5 PPG scheme or a SD radix-10 PPG scheme The last one has the advantage that practically it halves the number of partial products generated by the former (17 against 32 for 16 16- digit multiplications). However, it has the disadvantage of a significant latency overhead due to the generation of the complex multiple 3X. The latency and area of prior-art multipliers are improved by representing the partial products in (4221) or (5211) decimal codes, which allow them to implement PPR using a very regular and compact tree of 4-bit binary carry-save adders (built of 3:2 or 4:2 compressors) and decimal digit doublers. The most recent implementation is presented. It also uses a SD radix-10 PPG scheme to reduce the number of partial products generated to 17, and subsequently, the area of the PPR tree. To avoid the latency overhead of the 3 multiple generation, the partial products are coded in are dundant SD representation. The BCD multiplier pre-computes all the positive decimal multiplicand multiples {0X;... ; 9X}. The delay of PPG is reduced by representing the complex operands (3X; 6X; 7X; 8X; 9X) as the sum of two simpler multiples. The number of partial products generated is therefore equivalent to that of the SD radix-5 scheme. The PPR tree is implemented with BCD digit adders. This has the disadvantage of a large area compared to the other BCD multipliers analyzed. The two 16 16-digit BCD multipliers implement an easy-multiple PPG [10] (only pre-computes {2X; 4X; 5X}) that produces 32 BCD partial products. The intermediate decimal partial product sums are computed in overloaded BCD to speed up the PPR evaluation. The delay-improved design uses a tree structure built of five levels overloaded BCD digit adders, while the area-improved design replaces two levels of these custom designed adders by three levels of 4:2 compressors and a binary counter. This reduces the area consumption but at the cost of introducing a significant latency penalty. TABLE III: Synthesis Results for Fixed-Point Multipliers Fig.7. Area-delay space for the fastest 16 16-digit mults. Sequential 16 16-digit (Decimal64) BCD multipliers are about two times smaller than equivalent parallel implementations, but have higher latency and reduced throughput (one mult issued every 17 cycles). For example, the proposed multiplier is about seven times faster than the best sequential implementation proposed in [10], but requires 2.5 times more area. To compare the high hardware cost of a combinational Decimal128 implementation, we also include in Table 3 the area and delay figures obtained for our 34 34- digit BCD multiplier. Due to the tight area and power consumption constraints of current DFUs [6], a sequential architecture seems a more realistic solution than a fully pipelined implementation for a commercial Decimal128 multiplier. Finally, we present a more detailed comparison of the fastest BCD 16 16-digit combinational multipliers (SD Radix-5 and SD Radix-10, and the proposed one) in terms of latency and area. The corresponding are a delay synthesis values are shown in Fig. 7. We have directly introduced in the figure the area-delay curves of referenced multipliers as provided by their authors, since all of them were synthesized using 90 nm CMOS standard cell libraries and similar conditions. The area-delay points for the two multipliers of reference correspond to an estimation obtained by their authors using a LE-based model. From the area-delay space represented in Fig. 7, we observe that our proposed decimal multiplier has an area

Evaluation of Fast Radix-10 Multiplication Using Redundant BCD Codes For Speedup and Simplify Applications improvement roughly in the range 20-35 percent depending on the target delay. On the other hand, for the minimum delay point (44FO4), the proposed multiplier is still competitive with the fastest design. More recently, the authors of reference have presented in a comparison study between their delay improved multiplier and the multiplier of reference based on synthesis results using a TSMC 130 nm standard CMOS process under typical conditions (1.2 V, 25 C). They show that for the minimum delay point of each one of the two area-delay curves obtained, the delay-improved multiplier is 20 percent faster and has 10 percent less area compared to the design. Therefore, according to the curve corresponding to the design presented should be to the left of the area-delay points corresponding to the delay-improved design presented. V. CONCLUSION This paper has reassessed several implementations of N-x M-digit multiplications on Xilinx FPGAs. This work presents the design of several BCD multipliers and their implementations on Virtex-6FPGA. Previous techniques and designs proposed are analyzed to carry out performance comparison in terms of area-delay as it was seen in Section 4. The proposed pipe lined multiplication is based on a multiplier operand coded in to SDradix-10 recoding. A parallel generation of decimal partial products is reduced by carry save adder techniques and decimal adders. The proposed circuit presents a high-performance design of decimal multiplier. Our proposal shows encouraging results: one hand, its figures are comparable and outperformed than other multipliers. The other hand, a considerable number of cores can be fit in to large FPGA. Future work plans to include the proposed multiplier in operations based on decimal floating point. Also, a good alternative would be to code the multiplier operand in to SDradix-5 and SDradix-4 formats and to compare the performance results. Finally, a Xilinx Micro blaze Embedded processor can be utilized to execute multiplications on SW and HW, in order to estimate the SW-HW performance-cost. VI. REFERENCES [1] Alvaro V_azquez, Member, IEEE, Elisardo Antelo, and Javier D. Bruguera, Member, IEEE, Fast Radix-10 Multiplication Using Redundant BCD Codes, IEEE Transactions on Computers, Vol. 63, No. 8, August 2014. [2] A. Aswal, M. G. Perumal, and G. N. S. Prasanna, On basic financial decimal operations on binary machines, IEEE Trans. Comput., vol. 61, no. 8, pp. 1084 1096, Aug. 2012. [3] M. F. Cowlishaw, E. M. Schwarz, R. M. Smith, and C. F. Webb, A decimal floating-point specification, in Proc. 15th IEEE Symp. Comput. Arithmetic, Jun. 2001, pp. 147 154. [4] M. F. Cowlishaw, Decimal floating-point: Algorism for computers, in Proc. 16th IEEE Symp. Comput. Arithmetic, Jul. 2003, pp. 104 111. [5] S. Carlough and E. Schwarz, Power6 decimal divide, in Proc. 18 th IEEE Symp. Appl.-Specific Syst., Arch., Process, Jul. 2007, pp. 128 133. [6] S. Carlough, S. Mueller, A. Collura, and M. Kroener, The IBM zenterprise-196 decimal floating point accelerator, in Proc. 20 th IEEE Symp. Comput. Arithmetic, Jul. 2011, pp. 139 146. [7] L. Dadda, Multi-operand parallel decimal adder: A mixed binary and BCD approach, IEEE Trans. Comput., vol. 56, no. 10, pp. 1320 1328, Oct. 2007. [8] L. Dadda and A. Nannarelli, A variant of a Radix-10 combinational multiplier, in Proc. IEEE Int. Symp. Circuits Syst., May 2008, pp. 3370 3373. [9] L. Eisen, J. W. Ward, H.-W. Tast, N. Mading, J. Leenstra, S. M. Mueller, C. Jacobi, J. Preiss, E. M. Schwarz, and S. R. Carlough, IBM POWER6 accelerators: VMX and DFU, IBM J. Res. Dev., vol. 51, no. 6, pp. 663 684, Nov. 2007. [10] M. A. Erle and M. J. Schulte, Decimal multiplication via carry save addition, in Proc. IEEE Int. Conf Appl.- Specific Syst., Arch., Process., Jun. 2003, pp. 348 358. [11] M. A. Erle, E. M. Schwarz, and M. J. Schulte, Decimal multiplication with efficient partial product generation, in Proc. 17th IEEE Symp. Comput. Arithmetic, Jun. 2005, pp. 21 28. [12] Faraday Tech. Corp. (2004). 90nm UMC L90 standard performance low-k library (RVT). [Online]. Available: http://freelibrary.faraday-tech.com. Author s Profile: B. NAGARJUN SINGH, Assoc Prof & HOD, Dept of ECE, Sarada Institute Technology and Science, India, E-mail: nagarjun.singh24@gmail.com. T. VENKATARAMARAO, PG Scholar, Dept of ECE, Sarada Institute Technology and Science, TS, India, E-mail: venkataramarao321@gmail.com