A Taxonomy of Parallel Prefix Networks

Similar documents
the main limitations of the work is that wiring increases with 1. INTRODUCTION

VLSI Arithmetic Lecture 6

Design and Analysis of Kogge-Stone and Han-Carlson Adders in 130nm CMOS Technology

High Speed Han Carlson Adder Using Modified SQRT CSLA

Designing and Characterization of koggestone, Sparse Kogge stone, Spanning tree and Brentkung Adders

International Journal Of Global Innovations -Vol.6, Issue.I Paper Id: SP-V6-I1-P14 ISSN Online:

International Journal of Scientific & Engineering Research, Volume 4, Issue 10, October ISSN

Implementation of a 64-bit Jackson Adder

DESIGN AND ANALYSIS OF COMPETENT ARITHMETIC AND LOGIC UNIT FOR RISC PROCESSOR

Low-Area Low-Power Parallel Prefix Adder Based on Modified Ling Equations

1. Introduction. Raj Kishore Kumar 1, Vikram Kumar 2

Non-Heuristic Optimization and Synthesis of Parallel-Prefix Adders

Implementation of Ripple Carry and Carry Skip Adders with Speed and Area Efficient

DESIGN OF HYBRID PARALLEL PREFIX ADDERS

Srinivasasamanoj.R et al., International Journal of Wireless Communications and Network Technologies, 1(1), August-September 2012, 4-9

A Unified Addition Structure for Moduli Set {2 n -1, 2 n,2 n +1} Based on a Novel RNS Representation

ANALYZING THE PERFORMANCE OF CARRY TREE ADDERS BASED ON FPGA S

An Efficient Hybrid Parallel Prefix Adders for Reverse Converters using QCA Technology

DESIGN AND IMPLEMENTATION 0F 64-BIT PARALLEL PREFIX BRENTKUNG ADDER

Design and Characterization of High Speed Carry Select Adder

Design of High Speed Modulo 2 n +1 Adder

Figure 1. An 8-bit Superset Adder.

Design and Implementation of Adder for Modulo 2 n +1 Addition

Multi-Modulus Adder Implementation and its Applications

IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 44, NO. 2, FEBRUARY

Design and Implementation of High Performance Parallel Prefix Adders

Polynomial Time Algorithm for Area and Power Efficient Adder Synthesis in High-Performance Designs

Analysis and Design of High Performance 128-bit Parallel Prefix End-Around-Carry Adder

Issue Logic for a 600-MHz Out-of-Order Execution Microprocessor

Part 1-2. Translation of Netlist to CNF

INTERNATIONAL JOURNAL OF PROFESSIONAL ENGINEERING STUDIES Volume 9 /Issue 3 / OCT 2017

AN EFFICIENT REVERSE CONVERTER DESIGN VIA PARALLEL PREFIX ADDER

A Novel Carry-look ahead approach to an Unified BCD and Binary Adder/Subtractor

3D Stacked Wide-Operand Adders: A Case Study

High Performance Memory Read Using Cross-Coupled Pull-up Circuitry

Improved Fault Tolerant Sparse KOGGE Stone ADDER

AN IMPROVED FUSED FLOATING-POINT THREE-TERM ADDER. Mohyiddin K, Nithin Jose, Mitha Raj, Muhamed Jasim TK, Bijith PS, Mohamed Waseem P

Partitioned Branch Condition Resolution Logic

High-Performance Carry Chains for FPGA s

FLOATING POINT ADDERS AND MULTIPLIERS

IMPLEMENTATION OF TWIN PRECISION TECHNIQUE FOR MULTIPLICATION

DYNAMIC CIRCUIT TECHNIQUE FOR LOW- POWER MICROPROCESSORS Kuruva Hanumantha Rao 1 (M.tech)

Design of Parallel Prefix Adders using FPGAs

CAD4 The ALU Fall 2009 Assignment. Description

Parallelized Radix-4 Scalable Montgomery Multipliers

Performance Analysis of 64-Bit Carry Look Ahead Adder

Binary Arithmetic. Daniel Sanchez Computer Science & Artificial Intelligence Lab M.I.T.

Design and Implementation of Signed, Rounded and Truncated Multipliers using Modified Booth Algorithm for Dsp Systems.

A Radix-10 SRT Divider Based on Alternative BCD Codings

Implementing FIR Filters

ADDERS AND MULTIPLIERS

An Optimized Montgomery Modular Multiplication Algorithm for Cryptography

ISSN Vol.08,Issue.07, July-2016, Pages:

VARUN AGGARWAL

Parallel, Single-Rail Self-Timed Adder. Formulation for Performing Multi Bit Binary Addition. Without Any Carry Chain Propagation

Hardware Description and Verification Lava Exam

Compound Adder Design Using Carry-Lookahead / Carry Select Adders

A High Performance Unified BCD and Binary Adder/Subtractor

Lecture 19: Arithmetic Modules 14-1

A Review of Various Adders for Fast ALU

ENERGY-EFFICIENT VLSI REALIZATION OF BINARY64 DIVISION WITH REDUNDANT NUMBER SYSTEMS 1 AVANIGADDA. NAGA SANDHYA RANI

Carry Prediction and Selection for Truncated Multiplication

Design and Implementation of Parallel Prefix Adder for Improving the Performance of Carry Lookahead Adder

Design of Delay Efficient Distributed Arithmetic Based Split Radix FFT

DESIGN AND PERFORMANCE ANALYSIS OF CARRY SELECT ADDER

Analysis of Performance and Designing of Bi-Quad Filter using Hybrid Signed digit Number System

DUE to the high computational complexity and real-time

VERY large scale integration (VLSI) design for power

Fused Floating Point Three Term Adder Using Brent-Kung Adder

A Reconfigurable Network Architecture For Parallel Prefix Counting

Design of Low Power Wide Gates used in Register File and Tag Comparator

Lecture 5. Other Adder Issues

Design of DWT Module

Design of Low Power Asynchronous Parallel Adder Benedicta Roseline. R 1 Kamatchi. S 2

Formulation for Performing Multi Bit Binary Addition using Parallel, Single-Rail Self-Timed Adder without Any Carry Chain Propagation

Design of Two Different 128-bit Adders. Project Report

Design of Efficient VLSI Arithmetic Circuits

A 65nm Parallel Prefix High Speed Tree based 64-Bit Binary Comparator

International Journal of Computer Trends and Technology (IJCTT) volume 17 Number 5 Nov 2014 LowPower32-Bit DADDA Multipleir

FPGA IMPLEMENTATION OF FLOATING POINT ADDER AND MULTIPLIER UNDER ROUND TO NEAREST

Multi-Operand Addition Ivor Page 1

A LOSSLESS INDEX CODING ALGORITHM AND VLSI DESIGN FOR VECTOR QUANTIZATION

DESIGN AND IMPLEMENTATION OF FAST DECIMAL MULTIPLIER USING SMSD ENCODING TECHNIQUE

A High-Performance Significand BCD Adder with IEEE Decimal Rounding

Chapter 3 Arithmetic for Computers. ELEC 5200/ From P-H slides

FIR Filter Architecture for Fixed and Reconfigurable Applications

ARITHMETIC operations based on residue number systems

Area Efficient, Low Power Array Multiplier for Signed and Unsigned Number. Chapter 3

Very Large Scale Integration (VLSI)

Design and Implementation of 3-D DWT for Video Processing Applications

Efficient Design of Radix Booth Multiplier

Effects of FPGA Architecture on FPGA Routing

Fixed-Width Recursive Multipliers

Parallel-Prefix Adders Implementation Using Reverse Converter Design. Department of ECE

Design and Implementation of an Eight Bit Multiplier Using Twin Precision Technique and Baugh-Wooley Algorithm

Design and Implementation of CVNS Based Low Power 64-Bit Adder

THE synchronous DRAM (SDRAM) has been widely

Ternary Content Addressable Memory Types And Matchline Schemes

Design and Implementation of FPGA- based Systolic Array for LZ Data Compression

Performance of Constant Addition Using Enhanced Flagged Binary Adder

Transcription:

A Taxonomy of Parallel Prefix Networks David Harris Harvey Mudd College / Sun Microsystems Laboratories 301 E. Twelfth St. Claremont, CA 9171 1 David Harris@hmc.edu Abstract - Parallel prefu networks are widely used in highperformance adders. Networks io the Literature represent tradeoffs between number of logic levels, fanout, and wiring tracks. This paper presents a three-dimensional taxonomy that not only describes the tradeoffs in existing parallel pretix networks but also points to a family of new networks. Adders using these networks are compared using the method of logical effort. Tbe new architecture is competitive in latency and area for some technologies. I. INTRODUCTION A parallel prefvt circuit computes N outputs {YN,..., K} fromninputs (X,..., XI} using anarbitrary associative two-input operator o as follows [13] r, = XI r, =x, ox, r, =X,OX*~X, rn=xnox, Io...ox,~x, Common prefix computations include addition, incrementation, priority encoding, etc. Most prefvr computations precompute intermediate variables ~N:N,..., Z,:,] from the inputs. The preh network combines these intermediate variables to form the prefixes (ZM~,..., ZM}. The outputs are postcomputed from the inputs and prefixes. For example, adders take inputs VN,..., AI}, {BN,..., BI} and C, and produce a sum output {SN,, SI} using intermediate generate (G) and propagate (P) prefvr signals. The addition logic consists of the following calculations and is shown in Fig. - 1. G,, = 4.B,. Go, = C,, Precomputation: (2) c,=4 B, P,,=O Prefix: Postcomputation: S, =P, G,,9 (4) There are many ways to perform the prefix computation. For example, serial-prefix shuctures like ripple carry adders are compact but have a latency O(N). Single-level carry lookahead structures reduce the latency by a constant factor. Parallel prefm circuits use a tree network to reduce latency to c, s. 8, s, s, Fig. 1. Pmfi. computation: 4-bk d da Rd-Wuam O(log N) and are widely used in fast adders, priority encoders p], and other prefvt computations. This paper focuses on valency-2 prefvt operations (i.e. those that use 2-input associative operators), but the results readily generalize to higher valency [I]. Many parallel prefix networks have been described in the literature, especially in the context of addition. The classic networks include Brent-Kung [2], Sklansky [I 11, and Kogge-Stone [SI. An ideal prefix network would have lo&n stages of logic, a fanout Ever exceeding 2 at each stage, and no more than one horizontal track of wire at each stage. The classic architectures deviate from ideal with 2log2N stages, fanout of N/2+l, and N/2 horizontal tracks, respectively. The Han-Carlson family of networks [SI offer tradeoffs in stages and wiring between Brent-Kung and Kogge-Stone. The Knowles family [7] similarly offers tradeoffs in fanout and Wiring between Sklansky and Kogge- Stone and the Ladner-Fischer family [IO] offers tradeoffs between fanout and stages between Sklansky and Brent- Kung. The Kowalczuk, Tudor, and Mlpek prefix network [9] has also been proposed, but this network is serialized in the middle and hence not as fast for wide adders. This paper develops a taxonomy of parallel prefvt networks based on stages, fanout, and wiring tracks. The area of a datapath layout is the product of the number of rows and columns in the network The latency strongly depends on fanout and wiring capacitance, not just number of logic levels. Therefore, the latency is evaluated using the method of logical effort [12, 61. The taxonomy suggests new families of networks with different tradeoffs. One of these networks has area comparable with the smallest 0-7803-8104-1/03/$17.00 02003 IEEE 2213

known network and latency comparable with the fastest known network Section II reviews the parallel prefix networks in the literature. Section III develops the taxonomy, which reveals a new family of prefix networks. Performance comparison appears in Section I%' and Section V concludes the paper. Ii.PAnAILKL~moRKs Parallel prefix networks are distinguished by the arrangement of prefix cells. Fig. 2 shows'six such networks for S16. The upper box performs the precomputation and the lower box performs the postcomputation. In the middle, black cells, gray cells, and white buffers comprise the prefix network. Black cells perform the full prefix operation, as given in EQ (3). In certain cases, only part of the intermediate'variable is required. For example, in many adder cells, only the Gc0 signal is required, and the Pho signal may be discarded. Such gray cells have lower input capacitance. White buffers are used to reduce the loading of later non-critical stages on the critical path. The span of bits covered hy each cell output appears near the output. The critical path is indicated with a heavy line. The prefur graphs illustrate the tradeoffs in each network between number of logic levels, fanout, and horizontal wiring tracks. All three of these tradeoffs impact latency; Huang and Ercegovac [4] showed that networks with large number of wiring tracks increase the wiring capacitance because the tracks are packed on a tight pitch to achieve reasonable area. Observe that the Brent-Kung and Han-Carlson never have more than one black or gray cell in each pair of bits on any given row. This suggests that the datapath layout may use half as many columns, saving area and wire length. IIL TAXONOMY Parallel prefix structures may be classified with a three-dimensional taxonomy (I&) corresponding to the number of logic'levels, fanout, and wiring tracks. For an N- bit parallel prefix structure with L = log#, I, f, and f are integers in the range [0, Ll] indicating: Logic Levels: L+I Fanout: 2f+1 W*gTracks: 2' This taxonomy is illustrated in Fig. 3 for S16. The actual logic levels, fanout, and wiring tracks are annotated along each axis in parentheses. The parallel prefix networks &om the previous section all fall on the plane 1 +f+ f = L- 1, suggesting an inherent tradeoff between logic levels, fanout, and wiring-tracks. The Brent-Kung (L-1,0,0), Sklansky (0+1,0), and Kogge-Stone (O,O,L-l) networks occupy vertices. The Ladner-Fiscber (L2,1,0) network savesone I (a) Brent-Kung (b) sklansky (c) Kogge-Stone (d) Hen-Cadson (9 Ladner-Fisher Fig. 2. ParaUd prrk nehvorlu 2214

level of logic at the expense of greater fanout. The Han- Carlson (I,OJ-2) network reduces the wiring tracks of the Kogge-Stone network by nearly a factor of two at the expense of an extra level of logic. In general, Han and Carlson describe a family of networks along the diagonal (1, 0, t) with 1 f i = L1. Similarly, the Knowles family of networks occupy the diagonal (0, f, t) with f + i = L-1 and Ladner-Fischer occupy the diagonal (1,L 0) with 1 + f = L-1. Knowles networks are described by L integers specifying the fanout at each stage. For example, the [8,4,2,1] and [l,l,l,l] network represents the Sklansky and Kogge-Stone extremes and [2,1,1,1] was shown in Fig. 2e. In pneral, a (0, f, i) network corresponds to the Knowles network [g y,..., 1, I], which is the Knowles network closest to the diagonal. The taxonomy suggests yet another family of parallel prefix networks found inside the cube with l,f, i > 0. Fig. 4 shows such a (l,l,l) network. ForN=32, the new networks wouldinclnde(l,l,2),(1,2,1),and(2,1,1). Iv. RESULTS Table 1 compares the parallel prefix networks under consideration. The delay depends on the number of logic levels, the fanout, and the wire capacitance. All cells are designed to have the same drive capability; this drive is arbitrary and generally greater than minimum. Networks with 1 > 0 are sparse and require half as many columns of cells. The wire capacitance depends on layout and process and can be expressed by w, the ratio of wire capacitance per column traversed to input capacitance of a unit inverter. Reasonable estimates from a trial layout in a 180 nm process are w = 0.5 for widely spaced tracks and w = 1 for networks with a large number of tightly spaced wiring tracks. The method of logical effort is used to estimate the latency adders built with each prefw network, following the assumptions made in [6]. Tables 2-4 shows how the latency depends on adder size, circuit family, and wire capacitance. V. CONCLUSION This paper has presented a three-dimensional taxonomy of parallel prefix networks showing the tradeoffs between number of stages, fanout, and wiring tracks. The taxonomy captures the networks used in the parallel prefix adders described in the literature. It also suggests a new family of parallel prefw networks inside the cube. The new architecture appears to have competitive latency in many cases. REFERENCES 1 A. Beaumont-Smith and C. Lim, Parallel prefix adder design, Proc. I5Ih IEEE Symp. Comp. Arith., pp. 218-225, June 2001. 2 3 4 5 6 7 8 9 10 11 12 13 Fig. 4. New (1,lJ) parallel prefix network R. Brent and H. Kung, A regular layout for parallel adders, IEEE Trans. Computers. vol. G31, no. 3, pp. 260-264, March 1982. C. Huang, J. Wang, and Y. Huang, Design of highperformance CMOS priority encoders and incrementeddecrementers using multilevel lookahead and multilevel folding techniques, LEEE J. Solid-state Circuits, vol. 37, no. 1, pp. 63-76, Jan. 2002. Z. Huang and M. Ercegovac, Effect of Wire delay on the design of prefix adders in deep submicron technology: Proc. 34* hilomar ConJ SignaIs, Sysiems, and Computers, vol. 2, pp. 1713-1717, 2000. T. Han and D. Carlson, Fast area-efficient VLSI adders, Proc. gh Symp. Comp. Ariih., pp. 49-56, Sept. 1987. D. Harris and I. Sutherland, Logical effort analysis of cany propagate adders, Proc. 3Th Asilomar Con$ Signals, Systems, and Computers, 2003. S. Knowles, A family of adders, Proc. Ifh IEEE Symp. Comp. Arith., pp. 277-281, June 2001. P. Kogge and H. Stone, A parallel algorithm for the efficient solution of a general class of recurrence relations, IEEE Trans. Computers, vol. C- 22, no. 8, pp. 786-793, Aug. 1973. J. Kowalczuk, S. Tudor, and D. Mlynek, A new architechlre for an automatic generation of fast pipeline adders, Proc. European Solid-State Circuits ConJ,pp. 101-104.1991. R. Ladner and M. Fischer, Parallel prefix computation, J. ACM, vol. 27, no. 4, ~p. 831-838, Oct. 1980. J. Sklansky, %onditional-sum addition logic, RE Trans. Electronic Computers, vol. EC-9, pp. 226-231, June 1960. I. Sutherland, R Sproull, and D. Harris, Logical Efori, San Francisco: Morgan Kauhann, 1999. R. Zmermann, Binay Adder Architeciures for Cell-Based VLSI.and their Synihesis, ETH Dissertation 12480, Swiss Federal Institute of Technology, 1997. 2215

A.wets) t t (WireTracks) Fie.3. Taxonomy of pmk graphs 2216

Table 1. Campariron of parallel pree nehvork archikchue, Table 2. Adder delays: -0.5; inverting static CMOS I footed domino SI.".,., H: Inverting. Noninverting Footed Footless Static CMOS Static CMOS Domino Domino 113.7 118.1 116.8 121.8 113.0 117.4 110.7 114.6 112.4 117.0 113.4 118.0 110.0 / 14.1 I 8.7 112.7 Ladner- Table 3. Adder delays: -0.5; N- 31164 Table 4. Adder delays: inverting static CMOS; N= 32/64 2217