A Novel Carry-look ahead approach to an Unified BCD and Binary Adder/Subtractor

A Novel Carry-look ahead approach to an Unified BCD and Binary Adder/Subtractor Abstract Increasing prominence of commercial, financial and internet-based applications, which process decimal data, there is an increasing interest in providing hardware support for such data. In this paper, new architecture for efficient binary and Binary Coded Decimal (BCD) adder/subtractor is presented. This employs a new method of subtraction unlike the existing designs which use 10 s complements, to obtain a much lower latency. Though there is a necessity of correction in some cases, the delay overhead is minimal. A complete discussion about such cases and the required logic to process is presented. The architecture is run-time reconfigurable to facilitate both BCD and binary operations, including signed and unsigned numbers. The proposed circuits are compared (both qualitatively as well as quantitatively) with the existing circuits in literature and are shown to perform better. Simulation results show that the proposed architecture is at least 11% faster than the existing designs. 1. Introduction Due to growing importance of decimal arithmetic in commercial, financial and internet-based applications, which cannot tolerate errors from converting between binary and decimal formats, hardware support for decimal arithmetic is receiving an increased attention. Recently, specifications for decimal floating point arithmetic have been added to the draft revision of the IEEE-754r Standard for Floating Point Arithmetic [1]. Despite the widespread use of binary arithmetic, decimal computation remains essential for many applications. Not only it is required whenever numbers are presented for human inspection, but it is also often a necessity when fractions are involved. Decimal fractions are pervasive in human endeavors, yet most cannot be represented by binary fractions. The value 0.1, for example, requires an infinitely recurring binary number. If a binary approximation is used instead of an exact decimal fraction, results can be incorrect even if subsequent arithmetic is correct. [2] It is anticipated that once the IEEE-754r Standard is finally approved, hardware support for decimal floating point arithmetic will be incorporated on processors for various applications. Still, the major consideration while implementing Binary Coded Decimal (BCD) arithmetic will be to enhance its speed as much as possible which is being addressed in this paper. But to facilitate even binary applications on the same hardware a reconfigurable approach needs to be adopted. This paper deals with the design of an architecture that can perform both binary and BCD addition/subtraction. It also supports both signed and unsigned operations. All the existing architectures, as per the knowledge of the authors, use 10 s or 9 s complement to implement subtraction in BCD. But this has been found to have a very high latency. Hence a new approach has been proposed to overcome this problem. The architecture has been designed to have maximum hardware utilization. The rest of the paper is organized as follow: Section 2 provides a brief mathematical background of BCD while section 3 gives a brief explanation about the existing architectures. The proposed algorithm for the unified BCD and binary adder/subtractor is given in section 4. In section 5, the proposed architecture is presented. Simulation results for the proposed and existing circuits are given in section 6 and comparisons are carried out. Finally a conclusion is presented in section 7. 2. Mathematical Background for BCD BCD is a decimal representation of a number directly coded in binary, digit by digit. For example the number (9527)10 = (1001 0101 0010 0111) BCD. It can be seen that each digit of the decimal number is coded in binary and then concatenated, to form the BCD representation of the decimal number. To use this representation all the arithmetic and logical operations need to be defined. As the decimal

number system contains 10 digits, at least 4 bits are needed to represent a BCD digit. Consider a BCD digit A. The BCD representation of A is A 4 A 3 A 2 A 1 where all A k ( 0,1). The only point of note is that the maximum value that can be represented by a BCD digit is 9. The representation of (10) 10 in BCD is (0001 0000). Addition in BCD can be explained by considering two decimal digits A and B with BCD representations as A 4 A 3 A 2 A 1 and B 4 B 3 B 2 B 1 respectively. In the conventional algorithm, these two numbers are added using a 4-bit binary adder. It is possible that the resultant sum can exceed 9 which results in overflow. If the sum is greater than 9, the binary equivalent of 6 is added to the resultant sum to obtain the exact BCD representation. This can be illustrated with the following example A B Sum Add BCD 0110 0101 1011 0110 1 0001 Answer = (0001 (6) (5) (11) (6) (11 in BCD) 0001) Fig 1. Hwang s proposal [13] Flora [11] presents an adder similar to the carry-select adder. This employs the use of duplicate hardware to compute the output in the presence of a carry and in its absence. It then selects the appropriate one as the carry is computed. This was improvised by Fischer et. al. where only a single adder was employed to reduce the area over-head. But there was a higher latency due to the additional correction block employed. 3. Related Work There is a wide range of literature available in field of BCD arithmetic. Some of the first contributions were made by Schmookler et. al.[10] and Adiletta et. al. [14]. An approach towards to architecture dealing with both BCD and binary was shown by Levine et. al. and Anderson, while one of the first BCD sign-magnitude adder/subtractor architecture was presented by Grupe [17]. An area efficient signmagnitude adder was later developed by Hwang [13]. In his approach two additional conversions were introduced before and after the binary addition. Area occupied by this design was least amongst all the previous designs. Fig 2. Fischer s proposal [11] The approach to construct BCD architectures in many IBM processors is based on the work presented by Haller et. al. in [12]. This architecture shown in Fig. 3 operated in a single cycle, though requiring corrections in some cases. In the case of subtraction there is a need for the computation of the complement after the subtraction to obtain the correct difference, hence increasing the latency. Another improvement in the same architecture was the optimization of the carry chain resulting in a slight delay improvement with an increased area of the unit.

4. Proposed Algorithm Fig 3. Haller s proposal [12] The Universal adder micro architecture design (Fig. 4) proposed by Humberto et. al. [18] uses effective addition/subtraction operations on unsigned, sign-magnitude, and various complement representations. This design also overcomes the limitations of previously reported approaches that produce some of the results in complement representation when operating on sign-magnitude numbers. This design proposed that the major disadvantage of the previous designs i.e. having the subtrahend the smaller number in magnitude, was eliminated by their approach. One of the major points to note is that all these proposals make use of complement addition to perform subtraction. This paper proposed a new method without the essential use of 10 s complement to perform BCD subtraction. The proposed algorithm aims at performing both BCD and binary addition/subtraction. The major concern is to avoid 10 s complement to perform subtraction which is the reason for the high latency in the existing architectures. The proposed design can be divided into three major parts, the pre-computation stage, the prefix network [20] and the postcomputation stage. The pre-computation stage generates control signals named propagate (P) and generate (G). These control the operation of the prefix network. These control signals are generated for every significant stage in the N-digit number i.e. there 2*N( P*N + G*N) control signals. These denote whether the k th stage propagates the carry/borrow signal or generates it respectively. The concept of propagate and generate in both binary and BCD is illustrated below with equations and examples. In the case of addition of binary numbers the following equations denote propagate and generate for A and B (two bits at the k th stage): P = G = A B A B A B A B Fig 5. Examples of Binary addition illustrating the concept of propagate and generate In the case of addition of BCD digits A and B (two digits at the k th stage) the following equations denote propagate and generate: P => A+B=9 Fig 4. Humberto s proposal [18] G => A+B>9 This operation can be better illustrated from the following example:

Fig 6. Examples of BCD addition illustrating the concept of propagate and generate In the case of subtraction of BCD digits A and B (two digits at the k th stage) the following equations denote propagate and generate: P => A=B G => A<B This operation can be better illustrated from the following example: In case of BCD, it is a little complicated. As mentioned in Section 2, for the addition of two BCD digits if there is an overflow then a correction value of 0110(6) has to be added. For the subtraction of BCD digits, all the existing architectures are employing 10 s complement subtraction. But computing 10 s complement induces a very high latency in the operation. Hence the proposed architecture uses propagate and generate bits defined specifically for subtraction to compute the borrow bits for every stage using the prefix network. After the borrow bit is computed, the individual digits at every significant stage are subtracted by using 2 s complement using 4-bit binary adders. But then if a borrow output is generated then the output has to be corrected by adding 1010 (10) to the difference. This is checked by inspecting the borrow input of the subsequent stage which has already been generated by the prefix network. This operation is illustrated in the following example: Fig 7. Examples of BCD subtraction illustrating the concept of propagate and generate These propagate and generate bits are sent to the prefix network which has a network of blocks which calculates the group propagate and generate bits. The group P k:0 and G k:0 bits denote whether the first k stages propagate or generate the carry/borrow. After the calculation of the group propagate and generate, the carry/borrow can be calculated based on the carry/borrow input based on the following equation. C out =G + P. C in Thus, the carry/borrow at every significant stage is obtained. These bits are sent to the post-computation blocks which compute the final sum/difference based on these bits. In case of binary, after obtaining the carry/borrow the sum/difference is the XOR of the carry/borrow and the propagate bit. Fig 8. Examples of Binary and BCD operations illustrating the dataflow of the proposed architecture 5. Architecture of the proposed BCD and binary Adder/Subtractor The proposed architecture performs both BCD and binary addition/subtraction including signed and unsigned numbers. This architecture can be divided into three major parts, the pre-computation stage, the prefix network and the post-computation stage. This architecture is illustrated by a block diagram in Fig. 9.

The BCD Full Adder/Subtractor computes the sum if selected by the control signal. The addition operation is performed by adding the two BCD digits using the 4-bit binary carry look-ahead adder and the correction (mentioned in Sec. 2) block. This diagram is shown in Fig. 11. S4 S3 S2 S1 S4 S3 S2 Fig 9. Block Diagram of the proposed architecture 1 0 1 0 For the case of BCD computation it can be observed from the diagram that the pre-computation block for every significant stage consists of logic to generate (P, G) and (P*, G*). Depending on whether addition or subtraction is selected the corresponding propagate and generate bits are sent to the prefix network using an array of multiplexers. The selection of the prefix network can be made according to the requirements of area, power and delay from the wide range available in literature. For simulation purposes Sklansky network is used in the design. [19] Though Sklansky network is being used, any available prefix network [20] can be chosen depending upon the area, power and delay criterion. This generates the group propagate and generate which when combined with the carry/borrow input generates carry/borrow for every stage. These bits are taken by the final post-computation BCD Full Adder/Subtractor blocks shown in Fig 10. Overflow 0 1 0 1 0 1 O4 Fig 11. Gate-level diagram of the correction block for the addition operation The subtraction operation is done by 2 s complement addition. But the difference generated needs to be corrected. The borrow input for the subsequent stage that has already been generated by the prefix network is checked and the correction value (1010) is added. The correction block is shown in Fig. 12. The final set of multiplexers select the output based on the operation selected. The only thing that needs to be made sure in this logic is that during subtraction the subtrahend is always the smaller number (in magnitude). This is the case even in all the existing architectures except the design of Humberto et. al.. In the proposed architecture this is managed by using the outputs of the (P*, G*) blocks which have comparators in their logic. Thus essentially there is no additional latency overhead due to this comparison. O3 O2 O1 Fig 12. Gate level diagram of the correction block for the subtraction operation Fig 10. Block diagram of the 1-digit BCD adder/subtractor As it comes to the binary operation of the architecture, the prefix network is the major block. The

pre-computation of the binary numbers consists of an array of XOR gates which compute the 2 s complement when subtraction needs to be performed. Then the propagate and generate bits are generated based on the equations mentioned in the previous section. These bits are sent to the prefix network which generates the carry at every stage. These individual carry bits are XORed to the corresponding propagate bits for every stage to compute the sum/difference. Thus the prefix network that is the major block both in the BCD and binary computation is shared between the two operations to facilitate re-configurability. The signed numbers are taken care by the control logic at the beginning which take the two sign bits and OpSelect(Operation Select) as inputs to compute the control signal that selects the appropriate multiplexers depending on the operation. 6. Simulations and Results 6.1 Simulation Environment The proposed architectures have been described using Verilog HDL and 8, 16 and 32 digits wide operands. The design was synthesized using Mentor Graphics Leonardo spectrum targeting Xilinx Virtex- E v50ecs144 (speed grade-8) FPGA. The approximate values of area have also been mentioned. The inputs are given at a clock frequency of 500 MHz. 6.2 Results and Discussion The proposed architectures have been simulated in the simulation environment mentioned above and the results for latency and area for 32-bits architectures are given in Table 1. Table 1. Comparison of the existing and the proposed architectures in terms of latency and area Design Latency(ns) Area (LUTs) Hwang[13] 10.5 158 Haller[11] 10 584 Humberto[17] 12.1 495 Proposed 8.9 523 In terms of latency Haller turned out to be better of the all the existing designs. Compared to it, this design is 11% faster and uses 10% to 11% less hardware. This is due to the absence of 10 s complement approach. In terms of area Hwang s proposal is by far the best one presented, as it uses only a single adder in its design. 7. Conclusion Existing and proposed architectures for the BCD and binary reconfigurable adders are presented, simulated and compared. A novel way of implementing subtraction in Binary Coded Decimal without the use of 10 s complement is explained. Though there is the necessity for the check of magnitude of the two numbers before subtraction it is implemented in a way as not to affect the latency. All the cases where correction is necessary and the logic correction blocks The proposed BCD and binary adder/subtractor is at least 11% faster than the fastest one till now while occupying a considerable amount of area. 8. References [1] Draft IEEE Standard for Floating-Point Arithmetic. New York: IEEE,Inc., 2004, http://754r.ucbtest.org/drafts. [2] Michael F. Cowlishaw, Decimal Floating Point: Algorithm for Computers, Proceedings of the 16 th IEEE Symposium on Computer Arithmetic (ARITH 03), pp 104-111, June 2003. [3] M.S.Schmookler and A.W. Weinderger, High Speed Decimal Addition, IEEE Transactions on. Computers, vol. C-20, pp. 862-867, August 1971. [4] W. Bultmann, W. Haller, H. Wetter, and A. Worner, Binary and Decimal Adder Unit," U.S. Patent #6,292,819, September 2001. [5] R.D. Kenney and M.J. Schulte, Multioperand Decimal Addition, Proc. IEEE CS Ann. Symp. VLSI, pp. 251-253, Feb. 2004. [6] M. J. Adiletta and V. C. Lamere. BCD Adder Circuit. Digital Equipment Corporation, US patent 4805131, pages 1 18, Jul 1989. [7] J. L. Anderson. Binary or BCD Adder with Precorrected Result. Motorola, Inc., US patent 4172288, pages 1 8, Oct 1979. [8] H. Fischer andw. Rohsaint. Circuit Arrangement for Adding or Subtracting Operands Coded in BCD-Code or Binary-Code. Siemens Aktiengesellschaft, US patent 5146423, pages 1 9, Sep 1992.

[9] M. J. Adiletta and V. C. Lamere. BCD Adder Circuit. Digital Equipment Corporation, US patent 4805131, pages 1 18, Jul 1989. [10] M.S.Schmookler and A. Weinderger. Decimal Adder for Directly Implementing BCD Addition Utilizing Logic Circuitry. International Business Machines Corporation, US patent 3629565, pages 1 19, Dic 1971. [11] W. Haller, U. Krauch, and H. Wetter. Combined Binary/Decimal Adder Unit. International Business Machines Corporation, US patent 5928319, pages 1 9, Jul 1999. [12] W. Haller, W. H. Li, M. R. Kelly, and H. Wetter. Highly Parallel Structure for Fast Cycle Binary and Decimal Adder Unit. International Business Machines Corporation, US patent 2006/0031289, pages 1 8, Feb 2006. [13] S. Hwang. High-Speed Binary and Decimal Arithmetic Logic Unit. American Telephone and Telegraph Company, AT&T Bell Laboratories, US patent 4866656, pages 1 11, Sep 1989. [16] S. R. Levine, S. Singh, and A. Weinberger. Integrated Binary-BCD Look-Ahead Adder. International Business Machines Corporation, US patent 4118786, pages 1 13, Oct 1978. [17] U. Grupe. Decimal Adder. Vereinigte Flugtechnische Werke-Fokker gmbh, US patent 3935438, pages 1 11, Jan 1976. [18] D.R.H. Calderón, G. N. Gaydadjiev, S. Vassiliadis, Reconfigurable Universal Adder, in Proceedings of the IEEE International Conference on Application-Specific Systems, Architectures, and Processors (ASAP 07), Montreal, Quebec, Canada, July 2007 [19] J. Sklansky, Conditional-sum addition logic, IRE Trans. Electronic Computers, vol. EC-9, pp. 226-231, June 1960 [20] Harris D, A taxonomy of parallel prefix networks, in Proc. 37th Asilomar Conf. Signals, Systems, and Computers, Nov. 2003, Vol. 2, pp. 2213-2217 [14] M. J. Adiletta and V. C. Lamere. BCD Adder Circuit. Digital Equipment Corporation, US patent 4805131, pages 1 18, Jul 1989. [15] J. L. Anderson. Binary or BCD Adder with Precorrected Result. Motorola, Inc., US patent 4172288, pages 1 8, Oct 1979.