A Low-Power Carry Skip Adder with Fast Saturation

Similar documents
Partitioned Branch Condition Resolution Logic

16 Bit Low Power High Speed RCA Using Various Adder Configurations

Implementation of Ripple Carry and Carry Skip Adders with Speed and Area Efficient

Hybrid Signed Digit Representation for Low Power Arithmetic Circuits

Design and Characterization of High Speed Carry Select Adder

A Quadruple Precision and Dual Double Precision Floating-Point Multiplier

Chapter 3: part 3 Binary Subtraction

IMPLEMENTATION OF DIGITAL CMOS COMPARATOR USING PARALLEL PREFIX TREE

Design and Analysis of Kogge-Stone and Han-Carlson Adders in 130nm CMOS Technology

Design and Implementation of CVNS Based Low Power 64-Bit Adder

Designing and Characterization of koggestone, Sparse Kogge stone, Spanning tree and Brentkung Adders

Performance of Constant Addition Using Enhanced Flagged Binary Adder

CPE300: Digital System Architecture and Design

Arithmetic Logic Unit. Digital Computer Design

Reduced Delay BCD Adder

Fundamentals of Computer Systems

High-Performance Full Adders Using an Alternative Logic Structure

CAD4 The ALU Fall 2009 Assignment. Description

Lecture 5. Other Adder Issues

DESIGN AND SIMULATION OF 1 BIT ARITHMETIC LOGIC UNIT DESIGN USING PASS-TRANSISTOR LOGIC FAMILIES

IMPLEMENTATION OF TWIN PRECISION TECHNIQUE FOR MULTIPLICATION

Design of Low Power Digital CMOS Comparator

VARUN AGGARWAL

A novel technique for fast multiplication

ECE 341 Midterm Exam

VLSI Implementation of Adders for High Speed ALU

University of Illinois at Chicago. Lecture Notes # 10

Arithmetic and Logical Operations

A Novel Carry-look ahead approach to an Unified BCD and Binary Adder/Subtractor

Performance Evaluation of Digital CMOS Comparator Using Parallel Prefix Tree.

Implementation of Floating Point Multiplier Using Dadda Algorithm

VHDL IMPLEMENTATION OF FLOATING POINT MULTIPLIER USING VEDIC MATHEMATICS

ECE468 Computer Organization & Architecture. The Design Process & ALU Design

High Speed Special Function Unit for Graphics Processing Unit

IMPLEMENTATION OF LOW POWER AREA EFFICIENT ALU WITH LOW POWER FULL ADDER USING MICROWIND DSCH3

Area-Time Efficient Square Architecture

Basic Arithmetic (adding and subtracting)

On-Line Error Detecting Constant Delay Adder

A 65nm Parallel Prefix High Speed Tree based 64-Bit Binary Comparator

ECE 341 Midterm Exam

the main limitations of the work is that wiring increases with 1. INTRODUCTION

A Review of Various Adders for Fast ALU

HIGH PERFORMANCE QUATERNARY ARITHMETIC LOGIC UNIT ON PROGRAMMABLE LOGIC DEVICE

Chapter 6 ARITHMETIC FOR DIGITAL SYSTEMS

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

By, Ajinkya Karande Adarsh Yoga

Chapter 4 Design of Function Specific Arithmetic Circuits

International Journal of Advanced Research in Electrical, Electronics and Instrumentation Engineering

High Performance and Area Efficient DSP Architecture using Dadda Multiplier

OUTLINE Introduction Power Components Dynamic Power Optimization Conclusions

Analysis of Different Multiplication Algorithms & FPGA Implementation

A Novel Design of 32 Bit Unsigned Multiplier Using Modified CSLA

Analysis of Performance and Designing of Bi-Quad Filter using Hybrid Signed digit Number System

Area-Delay-Power Efficient Carry-Select Adder

Delay Optimised 16 Bit Twin Precision Baugh Wooley Multiplier

Carry Select Adder with High Speed and Power Efficiency

Low Power Circuits using Modified Gate Diffusion Input (GDI)

Lecture 3: Basic Adders and Counters

AN IMPROVED FUSED FLOATING-POINT THREE-TERM ADDER. Mohyiddin K, Nithin Jose, Mitha Raj, Muhamed Jasim TK, Bijith PS, Mohamed Waseem P

Design and Implementation of an Eight Bit Multiplier Using Twin Precision Technique and Baugh-Wooley Algorithm

Design and Implementation of Signed, Rounded and Truncated Multipliers using Modified Booth Algorithm for Dsp Systems.

Binary Arithmetic. Daniel Sanchez Computer Science & Artificial Intelligence Lab M.I.T.

DIGITAL ARITHMETIC: OPERATIONS AND CIRCUITS

Performance Evaluation of Guarded Static CMOS Logic based Arithmetic and Logic Unit Design

Implementation of ALU Using Asynchronous Design

Learning Outcomes. Spiral 2-2. Digital System Design DATAPATH COMPONENTS

Chapter 4 Arithmetic Functions

1. Mark the correct statement(s)

SIDDHARTH INSTITUTE OF ENGINEERING AND TECHNOLOGY :: PUTTUR (AUTONOMOUS) Siddharth Nagar, Narayanavanam Road QUESTION BANK UNIT I

VTU NOTES QUESTION PAPERS NEWS RESULTS FORUMS Arithmetic (a) The four possible cases Carry (b) Truth table x y

DESIGN OF QUATERNARY ADDER FOR HIGH SPEED APPLICATIONS

A Novel Floating Point Comparator Using Parallel Tree Structure

FPGA IMPLEMENTATION OF FLOATING POINT ADDER AND MULTIPLIER UNDER ROUND TO NEAREST

International Journal of Scientific & Engineering Research, Volume 4, Issue 10, October ISSN

High Speed Han Carlson Adder Using Modified SQRT CSLA

Implementation of 64-Bit Kogge Stone Carry Select Adder with ZFC for Efficient Area

Fundamentals of Computer Systems

THE latest generation of microprocessors uses a combination

FLOATING POINT ADDERS AND MULTIPLIERS

Implementation of Efficient Modified Booth Recoder for Fused Sum-Product Operator

OPTIMIZATION OF AREA COMPLEXITY AND DELAY USING PRE-ENCODED NR4SD MULTIPLIER.

Number Systems and Computer Arithmetic

M.J. Flynn 1. Lecture 6 EE 486. Bit logic. Ripple adders. Add algorithms. Addition. EE 486 lecture 6: Integer Addition

Learning Outcomes. Spiral 2 2. Digital System Design DATAPATH COMPONENTS

VLSI Design Of a Novel Pre Encoding Multiplier Using DADDA Multiplier. Guntur(Dt),Pin:522017

A Unified Addition Structure for Moduli Set {2 n -1, 2 n,2 n +1} Based on a Novel RNS Representation

Learning Outcomes. Spiral 2 2. Digital System Design DATAPATH COMPONENTS

Principles of Computer Architecture. Chapter 3: Arithmetic

Lecture Topics. Announcements. Today: Integer Arithmetic (P&H ) Next: continued. Consulting hours. Introduction to Sim. Milestone #1 (due 1/26)

VLSI Arithmetic Lecture 6

Compound Adder Design Using Carry-Lookahead / Carry Select Adders

Henry Lin, Department of Electrical and Computer Engineering, California State University, Bakersfield Lecture 7 (Digital Logic) July 24 th, 2012

Combinational Logic II

Software and Hardware

FPGA Implementation of Multiplier for Floating- Point Numbers Based on IEEE Standard

LECTURE 4. Logic Design

RADIX-4 AND RADIX-8 MULTIPLIER USING VERILOG HDL

Study, Implementation and Survey of Different VLSI Architectures for Multipliers

Srinivasasamanoj.R et al., International Journal of Wireless Communications and Network Technologies, 1(1), August-September 2012, 4-9

Carry-Free Radix-2 Subtractive Division Algorithm and Implementation of the Divider

Transcription:

A Low-Power Carry Skip Adder with Fast Saturation Michael Schulte,3, Kai Chirca,2, John Glossner,2,Suman Mamidi,3, Pablo Balzola, and Stamatis Vassiliadis 2 Sandbridge Technologies, Inc. White Plains, NY 060 USA jglossner@sandbridgetech.com 2 Delft University of Technology Electrical Engineering, Mathematics and Computer Science Department Delft, The Netherlands 3 University of Wisconsin Dept. of ECE 5 Engineering Drive Madison, WI, 53706, USA ABSTRACT In this paper, we present the design of a low power carry-skip adder with fast saturation that achieves low power dissipation and high-performance operation. To reduce the adder s delay and power consumption, the adder is divided into variable-sized blocks that balance the inputs to the carry chain. The optimum block sizes for minimizing the critical path delay with complementary carry generation are achieved. Within blocks, highly optimized carry-lookahead logic, which computes block generate and block propagate signals, is used to further decrease delay. The adder architecture decreases power consumption by reducing the number of logic levels, switching glitches, and transistors. To achieve balanced delay, input bits are grouped unevenly in the carry chain. This grouping reduces active power by minimizing extraneous glitching and transitions. The design has been simulated in both 0.3u and 90nm CMOS technology. Worst case performance is 830MHz and.2ghz, respectively. Worst case power dissipation normalized to 600MHz operation is 0.mW and 0.25mW, respectively. Extensions to the carry-skip adder allow it to quickly perform saturating addition, which is useful in a variety of digital signal processing and multimedia applications. Index Term Low Power, Computer Arithmetic, Carry Skip Adders.. INTRODUCTION Adders are fundamental arithmetic components in many computer systems, since addition is the most common arithmetic operation in a wide variety of important applications []. Consequently, several adder implementations, including ripple carry, Manchester carry chain, carry skip, carry lookahead, carry select, conditional sum, and various parallel prefix adders are available to satisfy different area, delay, and power requirements [2-5]. Ripple carry and Manchester carry chain adders are the simplest, but slowest adders with O(n) area and O(n) delay, where n is the operand size in bits. Carry lookahead, conditional sum, and parallel prefix adders have 0(n log(n)) area and O(log(n)) delay, but typically suffer from irregular layout. Carry skip adders, which have O(n) area and O( n ) delay provide a good compromise in terms of area and delay, along with a simple and regular layout [0]. As demonstrated in [3,] carry skip adders also dissipate less power than carry lookahead, conditional sum, carry select, and parallel prefix adders due to their low transistor counts and short wire lengths. Several studies have been performed to reduce the delay of carry-skip adders by using variable block sizes [6-0]. Although these studies are useful in theory, they include assumptions that are not realistic for modern high-performance circuit design. Furthermore, they do not take into account delay optimizations that can achieved by complementary carry generation and the use of carry lookahead addition techniques within the blocks. Previously we have presented the design and implementation of a full-static carry skip adder with low power consumption and low critical path delay [5]. The adder reduces the delay and power consumption by using variable-sized blocks to balance the inputs to the carry chain. Further reductions in delay are achieved by using complementary carry logic in the carry chain and by using highly optimized carry lookahead logic to compute block generate and block propagate signals. The adder has just seven logic levels on the critical delay path, where each logic level corresponds to one complex CMOS gate. Efficient AndOrInverters (s) and OrAndInverters (OAIs) are used to reduce delay and power. In Section 2, we present our Carry Skip Adder implementation. In Section 3, we present our results. In Section, we present extensions to the carryskip adder that allow it to quickly perform saturating addition. In Section 5, we give some concluding remarks. 2. 32-BIT ADDER IMPLEMENTATION An archicture may specify that two numbers are to be added together to produce specific results. An example is two s complement encoded operands being added or subtracted to produce a two s complement result. An example of an admissible implementation of an adder that performs a particular arithmetic function may be a Carry Lookahead Adder (CLA).

Often, there are multiple admissible implementations that logically perform the same architectural function. In the case of adders, Ripple Carry Adders (RCA), Carry Skip Adders (CSKA), Carry Select Adders (CSEA), and Carry Lookahead Adders (CLA) are all examples of admissible implementations. It is sometimes possible to incorporate different types of adders in different portions of the arithmetic circuit. An example may be an 8-bit RCA section followed by an 8-bit CLA section to perform a 6-bit addition. Different implementations are developed to optimize various design parameters. Some important design parameters include propagation delay, area utilization, and power dissipation. Most adder implementations tend to trade off performance and area. Occasionally dynamic switching power is considered. In our implementation, we are concerned with minimizing dynamic switching power and short circuit power. With carry skip adders [2], the linear growth of adder carry generation delay with the size of the input operands is improved by allowing carries to skip across blocks of bits, rather than rippling through them. The idea behind our adder design is to find the optimal bit partitioning to balance the propagation delay of the inputs to the carry chain. For the less significant portion of the adder, fewer bits are combined into blocks to speed up the carry generation. As the propagation delay increases along the carry chain, more bits are combined into blocks to generate the carry out. where The carry out of the jth bit C j+, can be expressed as: C j+ = G j + P j C j () G j = A j B j generate signal (2) P j = A j + B j propagate signal Assume a block has input bits from i to j, then C i is the carry in from the previous block. Expanding the equations above, we get: C j+ = G j + P j G j- + P j P j- G j-2 + + P j P i C i = (G j +P j G j- +P j P j- G j-2 + +P j P j- P i+ G i )+P j P i C i = G block + P block C i = G j:i + P j:i C i () Therefore, while C i is propagating along the carry chain, G block and P block can be calculated in parallel. The number of bits, j-i+, involved in one block strongly depends on the propagation delay of C i in order to guarantee that G block and P block signals have the same propagation delay. Then, the logic gate used to generate the block carry out, C j, will have a glitch-free output, since all three input delays are matched. Applying this rule along the carry chain optimizes both the block partitions and the logic depth on the critical path. 2. TOP-LEVEL ARCHITECTURE In order to favor the carry generation at the lower bits and the sum generation at the higher bits, and to balance the inputs along the carry chain, a variable-sized block partition scheme is used, as shown in Figure. A 3:0 and B 3:0 are two 32-bit input operands and C 0 is the carry in. S 3:0 is the 32-bit sum output of the addition operation with a carry out signal of C 32. The smaller blocks at the two ends of the diagram are used to speed up the carry and sum generation. The block sizes in the middle part of the adder, which are multiples of three, balance the carry chain inputs very well. The reason for selecting these block sizes is explained below for each carry skip block. In the following discussion, we assume the three inputs A, B, and C 0 are fed into the adders from input registers and have the same delay at the inputs of the adder. B 3:28 A 3:28 B 27:0 A 27:0 B 9: A 9: B 3:0 A 3:0 C 32 CS CS8 CS6 CS C C 28 0 C 0 C S 3:28 S 27:0 S 9: S 3:0 Figure. Block diagram of a 32-bit carry-skip adder with optimized block sizes

A 3 B 3 A 2 B 2 C 3 A B P 3 () G 3 () P 2 () G 2 () S 3 S 2 OAI P 3:2 (2) G 3:2 (2) P () G () S A 0 B 0 C C 2 (2) OAI C () C 0 S 0 Figure 2. Block diagram of the first CS block 2.2 FIRST -BIT BLOCK For the least significant bits, fast carry generation is more important than sum generation, since the sum bits are not on the critical delay path. To match the propagation delay of each carry in the carry chain, block P and G logic combines a small numbers of input bits, which means that block sizes are small. A structure similar to a ripple carry adder is used for the least significant bits because the ripple carry adder has good performance and power consumption when there are only a few adder input bits. The logic depth, however, accumulates to make larger block sizes advantageous at the more significant end. The circuit architecture of the first -bit carry skip block (CS) is shown in Figure 2. The LSB is on the right and the MSB is on the left. The delays for signals that affect the critical path delay are shown in parenthesis next to the signal s name. For example, P 3:2 (2) indicates that the block propagate signal, P 3:2, is available after two complex gate delays. The LSB addition is implemented using a standard full adder (). Since the sum generation is not on the critical path here, minimal size devices are used to reduce power consumption. This takes one complex gate delay to generate the inverted carry out, C. Producing an inverted carry reduces the logic depth on the carry chain. In the second, P and G logic operates in parallel with the carry generation in the first. The P and G logic computes inverted outputs G A B = inverted generate signal (5) P = + inverted propagate signal (6) A B As with the inverted carry, C, we use one logic level to generate the inverted G and P, instead of G and P to reduce the logic path length. The propagate signal is implemented using a NOR gate instead of an XNOR gate to simplify the logic without affecting the next carry generation. From (), the carry output of this bit, C 2 is expressed as: C 2 = G + P C 0 = G P + ) (7) ( C A simple OrAndInverter (OAI) gate takes the inverted inputs and implements the desired AndOr function producing the carry C 2 after one more complex gate delay. By matching the path lengths on all the inputs of the OAI gate, glitches do not occur on its output. This assumes the NAND, NOR and OAI gates have the same delays, which is done by properly sizing the transistors of the logic gates. As shown in Figure 2, C 2 is available after two complex gate delays. The next block is a 2-bit adder with inputs A 3:2 and B 3:2, and carry in signal C 2. To match the two units of propagation delay on C 2 from (7), two generate and propagate signals are combined to form, G 3:2 and P 3:2. In Figure 2, the logic to produce G 3:2 and P 3:2 has two gate delays. The inverted carry out is computed as: C = G3 + P3 G2 + P3 P2 C2 = G3:2 + P3:2 C2 G 3:2 = G 3 + P 3 G 2 P 3:2 = P 3 P 2 (8)

G 3:2 and P 3:2 each are ready after two gate delays, which match the delay of the input C 2. They also have the same polarity as C 2. Since all the inputs are non-inverted, an AND-OR-INVERT () cell is used to generate the inverted carry, C. As shown in Figure 2, C is available after three complex gate delays and the inputs used to compute C are balanced. 2.3 6-BIT BLOCK The 6-bit carry skip block has C from the previous block as the carry in signal. To match the C delay, G block and P block of this block should have three levels of logic depth before entering the carry chain. This allows three pairs of bits to be used as inputs to the P and G logic of this block. A block diagram for the 6-bit carry skip block (CS6) is shown in Figure 3. At the top of Figure 3, the two 3-bit adders are identical. The block propagate and block generate signals, P 9:7, G 9:7, P 6:, and G 6: are ready after two gate delays. Then these signals are fed into the third level P and G logic to produce P 9: and G 9:. Notice that after 3 levels of logic, the signal outputs are all inverted, which means an OAI gate is used in the carry chain of this block and the carry out polarity is flipped back again. The equation for C 0 is: C = G P + ) (9) 0 9: ( 9: C A 9:7 B 9:7 A 6: B 6: C 7 P 9,7 (2) G 9,7 (2) P 6: (2) G 6: (2) S 9:7 S 6: P 9: G 9: C 0 () OAI C Figure 3. Block diagram of CS6 block Sum generations in this block require some additional design consideration. If the second 3-bit adder takes the ripple carry out from the first 3-bit adder block, the subsequent sum generation will be on the critical path. To avoid ripple carries inside the block for sum generation, we use the carry skip logic to produce C 7. Therefore, P 6: and G 6: from the first 3-bit block are re-used together with carry in C to produce the carry in C 7 for the higher 3-bit block in order to reduce the delay path for sum generation. This way the higher 3-bit sum delay is only one gate delay greater than the lower 3-bit block delay. By the end of this block, the 0 lower bits of the operands are added with a carry chain logic depth of, and all the inputs along the carry chain are well balanced. Every bit input takes the same number of logic steps to the carry out C 0. Also notice that compared to the previous carry skip block, which has bits 2 and 3 joined in Figure 2, this block has 6 bits combined. Thus, a one logic level increase of P and G logic triples the number of input bits. The reason for this tripling comes from the usage of complex logic cells to generate P 6:, P 9:7, G 6:, and G 9:7. This ratio is also used in the next block. 2. 8-BIT BLOCK Since the carry generated by the previous block, C 0, takes levels of logic, the block P and G logic in this block has one more logic level to balance the carry chain inputs. Applying the factor of three discussed in the previous section, the total number of bits in this block is three times as large as the previous blocks, which gives a block size of 8 bits. The block diagram for the 8-bit carry skip block (CS8) is shown in Figure. Notice that the sum calculations become more critical as the bit number increases. The 3-bits adders in Figure use local P and G logic to balance the carry inputs for each 3-bit adder for sum generations. The capacitance on carry in C 0 may also become high. This can be mitigated with local buffers, which are not shown in Figure.

A 27:25 B 27:25 A 2:22 B 2:22 A 2:9 B 2:9 A 8:6 B 8:6 A 5:3 B 5:3 A 2:0 B 2:0 S 27:25 S 2:22 S 2:9 S 8:6 S 5:3 S 2:0 P 27,25 (2) G 27,25 (2) P 2,22 (2) G 2,22 (2) P 2,9 (2) G 2,9 (2) P 8,6 (2) G 8,6 (2) P 5,3 (2) G 5,3 (2) P 2,0 (2) G 2,0 (2) CLL CLL Group Propagate Group Generate Group Propagate Group Generate P 27,9 G 27,9 P 8,0 G 8,0 Group Propagate Group Generate P 27,0 () G 27,0 () C 28 (5) C 0 () Figure. Block diagram of CS8 block By the end of this block, 28 pairs of operand bits have been added, but only 5 levels of logic were used along the carry chain. Again, all critical paths are balanced to avoid glitches. 2.5 FINAL -BIT BLOCK Since there are only bit inputs left in this block, and the sum generation is more critical than the carry generation, more effort is put into calculating the sum by using carry select logic to obtain the correct sum, and the same carry skip blocks to obtain the carries. The block diagram for the final -bit carry skip block (CS) is shown in Figure 5. As shown in Figure 5, the logic depth of the final carry out, C 32, is equal to 6. To balance the inputs along the carry chain with only a few operand bits, XNOR gates are employed to obtain the propagate signal instead of NOR gates. This signal can also be shared by the sum generations to save hardware. Since the critical paths are well balanced on the carry chain, the critical paths of the 32-bit adder include the four sum outputs in this block. By using local carry skip adders and carry select style logic, the propagation delays of the sums are equal to 7 logic levels, which is one level more than the delay of the carry chain.

A 3 B 3 A 30 B 30 A 29 B 29 A 28 B 28 Carry Select Carry Select Carry Select Carry Select P 3 () G 3 () S 3 (7) P 30 () G 30 () S 30 P 29 () G 29 () S 29 P 28 () G 28 () S 28 OAI OAI P 3:30 G 3:30 P 29:28 G 29:28 Group Propagate and Generate P 3,28 (5) G 3,28 (5) C 32 (6) C 28 (5) Figure 5. Block diagram of final CS block 3. RESULTS The design was implemented for different technologies and process corners. All input delay balancing along the carry chain was implemented in layout. To also balance wiring lengths, the carry chain is put close to the block P and G logic. Full spice simulation with wire load estimation was performed on the circuit. The performance results are shown in for both 0.3u and 90nm processes in Table, where SS, TT, and FF correspond to slow (Vdd 0%), typical (Vdd), and fast (Vdd+0%) process corners. Power analysis is done at different process corners as well, and results are shown in Table 2. SS TT FF 0.3u 830MHz.08V 970MHz.2V.8MHz.32V 90nm.20GHz 0.9V.32GHz.0V.8GHz.V Table. Performance Results 0.3u, 600MHz 90nm, 600MHz SS TT FF 0.38mW 0.2mW 0.8mW 0.22mW 0.25mW 0.29mW Table 2. Power Results The adder average power does not change much across different process corners, but the speed is increased dramatically from the SS to FF corners. By balancing the critical paths throughout the design, glitches are eliminated. Unnecessary switching during the addition is reduced by using complementary logic gates, multiple input complex logic gates, and by sharing logic among signals. Therefore, this adder design has improved the performance and also reduced the power, compared to previous carry skip adders.

. ST SATURATING ADDITION Our low-power carry-skip adder has been extended to perform fast saturating addition. With saturating addition, results of operations that overflow (i.e., that are too small or too large to represent in a specified number format) are set to the most positive or most negative number that can be represented in the specified format. Our carry-skip adder with saturation is specifically designed for digital signal processing (DSP) applications, which often process data as fixed-point operands. On our target DSP architecture, data is typically represented using one of three wide two s complement number formats:. A 32-bit fraction with a sign bit and 3 fraction bits 2. A 0-bit sign-extended fraction with 8 bits of sign extension, a sign bit, and 3 fraction bits 3. A 0-bit mixed-number with a sign bit, 8 integer guard bits, and 3 fraction bits. Each of these formats is shown in Figure 6. The 32-bit fraction format is supported by using the 0-bit sign-extended fraction format and ignoring the 8 most significant bits. Our target architecture also supports 6-bit fractions, 2-bit sign-extended fractions, and 2-bit mixed numbers by ignoring the 6 least significant bits in the corresponding wide format number. In a variety of DSP applications, it is useful to perform operations on operands in the mixed-number format and produce saturated results that are in the fractional format or sign-extended fractional format. It is also useful to have a single adder than can produce results in the mixed-number format, fractional format, or sign-extended fractional format. 3 30 0 sign bit 3 fraction bits 39 32 3 30 0 8 extend bits sign bit 3 fraction bits 39 38 3 30 0 sign bit 8 guard bits 3 fraction bits Fraction Sign- extended Fraction Mixed number Figure 6. Wide Two s Complement Number Formats Techniques have been developed to perform overflow detection and saturation with two s complement addition [-]. When the input and result operands all use the same format, overflow is often detected by examining the sign bits of the input and result operands. If the input operands have the same sign and the sign of the result is different, then overflow occurs and the result should be saturated; otherwise overflow is guaranteed not to occur. Another method for detecting this same condition is to examine the carries into and out of the sign bit. If the carry into the sign bit differs from the carry out of the sign bit, then overflow occurs and the result should be saturated; otherwise overflow is guaranteed not to occur. Although these techniques work well when the input and results operands use the same format, they cannot be used when the input and result operands have different formats. A straight-forward mechanism for performing two s complement saturating addition when the input operands are in mixed-number format and the output operands are in fractional format or sign-extended fractional format involves producing a result in the mixed-number format and then examining that result and the sign bits of the input operands to determine if overflow occurs and if the final result should be saturated. This can be accomplished by have one circuit that detects if overflow occurs in the mixed number format and a second circuit that detects if the mixed-number result cannot be exactly represented in the fractional format. The first circuit can detect overflow by examining the sign bits of the input operands and the sign bits of the result, as described in the previous paragraph. The second circuit can detect overflow by comparing the sign bit of the result with all of the guard bits of the result. If the sign bit differs from any of these bits than overflow has occurred. Although this approach correctly detects overflow, it has the disadvantage that the guard bits of the result must be computed before it can be determined if overflow has occurred. Our technique for fast saturating addition, uses an enhanced form of carry select addition to compute the most significant bits of the result and to determine if the result overflows. To perform high-speed saturating addition on 0-bit operands, we send the 32 least significant bits of both operands (A 3:0 and B 3:0 ) into the 32-bit carry skip adder described in Section 3. The carry skip adder computes the temporary sum, T 3:0, and a carry-out bit, C 32, as

{C 32, T 3:0 } = A 3:0 + B 3:0 (0) where {C, T} indicates C and T are concatenated. In parallel with this, the 8 most significant bits of both operands (A 39:32 and B 39:32 ) are sign extended by one bit and sent into a dual-adder. The 9-bit dual-adder computes two temporary sums T0 0:32 = {A 39,A 39:32 } + {B 39,B 39:32 } (used when C 32 = 0) () T 0:32 = {A 39,A 39:32 } + {B 39,B 39:32 } + (used when C 32 = ) (2) Thus, T 0:32 is always one more than T0 0:32. By sign-extending A 39:32 and B 39:32 one bit, the most significant bit of the temporary sums, T0 0 and T 0, always contain the true sign for the result of the addition, and can be used to determine the correction direction for saturation. Techniques similar to those presented in [-5] can be used to reduce the area, delay, and power consumption of the dual adder. The outputs of the dual adder, T0 0:32 computes the following saturation flags. and T 0:32, are fed as inputs to a preliminary saturation detection unit, which + Sat 0 = T 00 ( T 039 + T 038 + T 037 + T 036 + T 035 + T 03 + T 033 + T 032 ) + Sat = T0 ( T39 + T38 + T37 + T36 + T35 + T3 + T33 + T32 ) Sat 0 = T 00 ( T 039 + T 038 + T 037 + T 036 + T 035 + T 03 + T 033 + T 032 ) Sat = T0 ( T39 + T38 + T37 + T36 + T35 + T3 + T33 + T32 ) (3) () (5) (6) Thus, the final sum should be saturated to the most positive number if the most significant temporary sum bit, T0 0 or T 0, is zero and any of the guard bits are one. It should be saturated to the most negative number if the most significant temporary sum bit, T0 0 or T 0, is one and any of the guard bits are zero. Otherwise, the final result should not be saturated. Once C 32 is available, it is used as the select signal on a multiplexer, which outputs T 0 = T0 0, Sat + = Sat0 +, and Sat - = Sat0 - when C 32 = 0, and outputs T 0 = T 0, Sat + = Sat +, and Sat - = Sat - when C 32 = 0. The saturation logic then takes the temporary sum T0 39:0 and the saturation bits, Sat + and Sat -. If Sat + =, positive overflow has occurred and the result should be saturated to the most positive representable number; if Sat - =, negative overflow has occurred and the result should be saturated to the most negative representable number; otherwise, overflow has not occurred and the correct result is T0 39:0. Figure 7 shows a block diagram of our 0-bit two s complement adder with fast saturation. The input operands, A and B, which are each 0 bits, are divided into 32-bit less significant portions and 8-bit more significant portions. The least significant portions, A 3:0 and B 3:0, are fed as inputs to a carry-propagate adder, which computes the temporary sum, T 3:0, and carry-output, C 32, based on Equation (0). The most significant portions, A 39:32 and B 39:32, are sign extended one bit and fed as inputs to a dual adder, which computes T0 39:32 and T 39:32, based on Equations () and (2). These values are then sent to a preliminary saturation detection unit, which compute Sat0 +, Sat +, Sat0 -, and Sat -, based on Equations (3) to (6). The carry-output, C 32, is used as the select signal of a multiplexor, which determines the values of T 39:32, Sat +, and Sat - that go to the saturation logic. The saturation logic outputs T0 39:0, the most positive representable number, or the most negative representable number, based on the values of Sat + and Sat -. The output carry, C 32, is ready after seven complex gate delays. Once it is ready, the signals T 39:32, Sat +, and Sat - are ready after one 0-bit 2-to- multiplexor delay, and the final sum, S 39:0, is ready after one 0-bit 3-to- multiplexer delay. Minor extensions to the carry skip adder with fast saturation can be made to allow it to support other useful operations. For example, two s complement subtraction is performed by inverting the B input operand and setting C 0 to one. Wraparound addition is performed by modifying the saturation detection so that T0 39:0 is always selected. One simple method for accomplishing this is to AND the saturation signals, Sat + and Sat -, with a control signal, sat, which is one when saturating addition is performed and zero when wrap-around addition is performed.

A 39:32 A 3:0 B 39:32 B 3:0 Dual Adder & Preliminary Saturation Detection Carry Skip Adder C 0 C 32 Multiplexor Sat + Sat - T 39:32 Saturation Logic T 3:0 T 39:0 Figure 7. Block Diagram of a Two s Complement Adder with Fast Saturation 5. CONCLUSIONS We have presented a carry skip adder implementation with balanced critical paths. To achieve balanced delay, input bits are grouped into variable-sized carry skip blocks. This grouping reduces active power by minimizing extraneous glitching and transitions. The adder has been implemented in both 0.3u and 90nm CMOS technology. Worst case performance is 830MHz and.2ghz, respectively. Worst case power dissipation normalized to 600MHz operation is 0.mW and 0.25mW, respectively. Extensions to the carry-skip adder allow it to quickly perform saturating addition and subtraction, which are useful in a many digital signal processing and multimedia applications. REFERENCES [] K. Uming, T. Balsara, and W. Lee, Low-power design techniques for high-performance CMOS adders, IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 3, no. 2, pp. 327-333, June 995. [2] B. Parhami, Computer Arithmetic: Algorithms and Hardware Designs, Oxford University Press, New York, 2000. [3] C. Nagendra, M. J. Irwin, and R. M. Owens, Area-time-power tradeoffs in parallel adders, IEEE Transactions on Circuits and Systems II: Analog and Digital Signal Processing, vol. 53, no. 0, pp 689-702, October, 996. [] T. K. Callaway, and E. E. Swartzlander, Jr., Estimating the power consumption of CMOS adders, Proceedings of the th IEEE Symposium on Computer Arithmetic, pp. 20-26, July, 993. [5] V. G. Oklobdzija, B. R. Zeydel, H. Dao, S. Mathew, and R. Krishnamurthy, Energy-delay estimation technique for highperformance microprocessor VLSI adders, Proceedings of the 6th IEEE Symposium on Computer Arithmetic, pp. 272-279, June 2003. [6] A. Guyot, B. Hochet, and J. Muller, A way to build efficient carry-skip adders, IEEE Transactions on Computers, vol. 36, Oct. 987. [7] S. Turrini, Optimum group distribution in carry-skip adders, in Proceedings of the 9th IEEE Symposium on Computer Arithmetic, pp. 96 03, 989. [8] P. Chan, M. Schlag, C. Thomborson, and V. Oklobdzija, Delay optimizationof carry-skip adders and block carry-lookahead adders using multidimensional dynamic programming, IEEE Transactions on Computers, vol., no. 8, pp. 920-930, August 992.

[9] V. Kantabutra, Designing optimum one-level carry-skip adders, IEEE Transactions on Computers, vol. 2, no. 6, pp. 759 76, June 993. [0] M. Alioto and G. Palumbo, A simple strategy for optimized design of one-level carry-skip adders, IEEE Transactions on Circuits and Systems I: Fundamental Theory and Applications, vol. 50, no., pp. -8, January 2003. [] N. Yadav, M. J. Schulte, and J. Glossner, Parallel Saturating Fractional Arithmetic Units, Proceedings of the Ninth Great Lakes Symposium on VLSI, pp. 2-27, March, 999. [2] Lee et al., Efficient Hardware Handling of Positive and Negative Overflow Resulting from Arithmetic Operations, United States Patent No. 5,8,509, September 5, 995. Anderson, Fast Overflow and Underflow Limiting Circuit for Signed Adder, United States Patent No. 5,6,9, November 7, 992. [3] Tyagi, A., "A Reduced-area Scheme for Carry-Select Adders," IEEE Transactions on Computers, vol. 2, no. 0, pp. 63-70, Oct. 993. [] Y. Kim; L.-S. Kim, "A Low Power Carry Select Adder with Reduced Area," Proceedings of the 200 IEEE International Symposium on Circuits and Systems, vol., pp. 28-22, 200. [5] K. Chirca, M. Schulte, J. Glossner, S. Mamidi, and S. Vassiliadis, A Static Low-Power, High-Performance 32-bit Carry Skip Adder, Accepted for publication at Euromicro 200 Symposium on Digital System Design (DSD), August 3 September 3, 200, Rennes, France.