Analysis and Design of High Performance 128-bit Parallel Prefix End-Around-Carry Adder

Size: px
Start display at page:

Download "Analysis and Design of High Performance 128-bit Parallel Prefix End-Around-Carry Adder"

Transcription

1 Analysis and Design of High Performance 128-bit Parallel Prefix End-Around-Carry Adder A Thesis Presented by Ogun Turkyilmaz to The Department of Electrical and Computer Engineering in partial fulfillment of the requirements for the degree of Master of Science in Electrical and Computer Engineering Northeastern University Boston, Massachusetts August 2011

2 Abstract Addition is a timing critical operation in today s floating point units. In order to develop faster processing, an end-around carry (EAC) was proposed as a part of fused-multiply-add unit which performs multiplication followed by addition [5]. The proposed EAC adder was also investigated through other prefix adders in FPGA technology as a complete adder [6]. In this thesis, we propose a 128-bit standalone adder with parallel prefix end around carry logic and conditional sum blocks to improve the critical path delay and provide flexibility to design with different adder architectures. In previous works, CLA logic was used for EAC logic. Using a modified structure of a parallel prefix 2 n 1 adder provides flexibility to the design and decreases the length of the carry path. After the architecture is tested and verified, critical path is analyzed using FreePDK45nm library. Full custom design techniques are applied carefully during critical path optimization. Critical path analysis provides fast comparison of the total delay among different architectures without designing the whole circuit and a simpler approach to size the transistors for lowest delay possible. As a final step, datapath is designed as a recurring bitslice for fast layout entry. The results show that the proposed adder shows 142ps delay, 2.42mW average power dissipation, and 3,132 sq. micron area assuming there is not much routing area overhead in the estimated area.

3 Acknowledgements I would like to express my foremost appreciation to my advisor, Prof. Yong-Bin Kim for giving me the opportunity to conduct research at Northeastern University. I am grateful for his technical guidance and constant support in my graduate career. Without his valuable suggestions and assistance, this thesis would not be accomplished. I would like to thank the committee members, Prof. Fabrizio Lombardi and Prof. Gunar Schirner for reading my thesis and offering valuable suggestions and contributions. I am, especially, grateful to Prof. Schirner for the long discussions about graduate study. He has been a mentor and a teacher to me, who generously shares his knowledge and experiences with tremendous enthusiasm and never ending encouragement. It has been an honor and a great pleasure to study as a Fulbright Scholar. I would like to express my appreciation to the Fulbright Commission for giving me the chance to pursue further academical study and connect with many accomplished scholars. I would also like to thank Faith Crisley, Graduate Coordinator at the ECE Department, for her support and valuable suggestions. She has always been helpful with her comforting manner even in the most stressful moments. Last but not least, I would like to express my sincere appreciation to my beloved parents, Nuket and Nevzat Turkyilmaz, and my sister, Pinar Turkyilmaz, who encouraged me continuously in every step I take, supported me constantly through every hardship I faced and loved me without boundaries. Without them, I could not even imagine being where I am today. Ogun Turkyilmaz August 2011

4 Contents 1 Introduction Fused Multiply-Add Operation Adders Ripple Carry Adders Carry lookahead adders Parallel Prefix Adders Tree Adders Recurrence Algorithms Weinberger Recurrence Ling Recurrence Conclusion Modulo Adders Introduction Addition in Modulo 2 n 1 Adder Analysis of Previous End-Around-Carry Adders Carry-lookahead EAC Logic Unit Conclusion Modified Parallel Prefix EAC Adder Introduction Proposed Adder The 16bit blocks in EAC adder Parallel Prefix 2 n 1 EAC Block Implementation and Validation Conclusion Critical Path Analysis Path Identification Path Design Transistor Level Design and Sizing Logic Level Minimization Late arriving signal exploitation Logical Effort Design with Helpers Transistor Sizing i

5 4.5 Simulation Results Conclusion Datapath Library Introduction Concepts in Full Custom Design Datapath Design Layout Design Results Conclusion Conclusion and Future Works 56 A Verilog Code of the Proposed Adder 63 B HSPICE Simulation Files 68 B.1 Cells B.2 Simulation Code B.3 Condition of Transistors

6 List of Figures 1.1 Ripple Carry adder [1] Carry lookahead adder [2] Group PG cells [3] Taxonomy of prefix networks [3] Kogge-Stone adder [3] Sklansky adder [3] Brent-Kung adder [3] Han-Carlson adder [3] Knowles [2,1,1,1] adder [3] Ladner Fischer adder [3] Prefix graph with fast end-around carry [4] General block diagram Modulo 2 n 1 Adder [4] Block diagram of the 128-bit binary adder [5] Architecture of the EAC adder [6] Architecture of the EAC adder [6] Architecture of the modified EAC Adder bit conditional sum blocks c in merge with fast carry link [7] c in merge with extra bit [7] Modified 8bit Kogge-Stone EAC block Critical path of the modified EAC adder Gate level design of critical path Reduced first stage in Weinberger recursion adder [8] AOI without late arriving exploitation AOI with late arriving exploitation EAC logic with helper Spreadsheet for Logical Effort Calculation Transistor level schematic Transistor level schematic Delay vs. V dd at 25 C Power vs. V dd at 25 C Delay vs. V dd at 100 C Power vs. V dd at 100 C Global floorplan of a datapath [9] iii

7 5.2 Regularity placement and routing datapath circuit [10] Schematic layout of datapath and detailed view of bitcell [11] Representation of a datapath cell [9] Designed basic cells Bit slices of the blocks in the adder Bit slice of 16bit Kogge-Stone adder Wide layout Stacked compact layout Layout of the blocks

8 List of Tables 1.1 Trade-off between different adder topologies Delay and Power dissipation values in correspondence to V DD and Temperature Results comparison of proposed adder with the previous work v

9 Chapter 1 Introduction Fused multiply add unit plays an important role in modern microprocessor. It performs floating point multiplication followed by an addition of the product with a third floating point operand. In 2007, a seven cycle fused multiply add pipeline unit was proposed as a part of the floating point unit in IBM s POWER6 microprocessor [5]. In this fused multiply add data flow, the product should be aligned before it is added with the addend. Because the magnitude of the product is unknown in the early stages prior to the combination with the addend it is difficult to determine a priori which operand is bigger. Even if it was determined early that the product is bigger, there would be a problem on conditionally complementing two intermediate operands the carry and sum outputs of the counter tree. Thus an adder needs to be designed to always output a positive magnitude result and preferably only needs to complement one operand. In [6], the adder in POWER6, was taken as a reference for design space exploration in FPGA technology. They designed a complete adder independent of FMA block and shown that Kogge-Stone does not provide the best performance in FPGA technology. Zhang et al. [12] proposed recently a 108-bit adder for an FMA unit. All these adders took the adder in POWER6 as a reference. We believe using a CLA block in this adder limits the possibility to totally exploit the benefits of parallel prefix adders. We designed an adder with parallel prefix 2 n 1 block. Although the carry increment topology is still employed, the number of carry merge terms is decreased as well as the length end around carry path. Another area of improvement lies in the design of first level addition blocks. Using a Carry-select scheme provides the benefit of choosing the sum 1

10 CHAPTER 1. INTRODUCTION 2 at the end of computation. However, carry path should not include the calculation of group propagate and generate(pg) terms according to input carry. In our conditional sum blocks, the carry path only includes PG terms with c in = 0 and the sum is calculated in non-critical path for both conditions: c in = 0 and c in = 1. At the final stage, the real sum is selected according to the output of the EAC block. The thesis is organized as following; in Chapter 1 general information about adders are provided to show the importance of design space. In Chapter 2 detailed analysis of modulo 2 n 1 adders and EAC adders are given. The modified adder is described in Chapter 3 and compared with the previous architectures. Critical path analysis methodology is analyzed in detail in Chapter 4 and simulation results are provided. Finally in Chapter 5 datapath library design methodology is described. 1.1 Fused Multiply-Add Operation A fused multiply-add(fma) unit performs the multiplication A B followed immediately by an addition of the product and a third operand C so that the result T is calculated as Eqn. 1.1 in a single indivisable step [2]. Such a unit is capable of performing multiply only by setting C = 0 and add(or subtract) only by setting, for example, B=1. T = A B + C = M + C (1.1) An advantage of a fused multiply add unit, compared to separate multiply and adder, arises when executing floating-point operations since rounding is performed only once for the result of T = A B+C rather than twice(for the multiply and then for the add) [13]. Since rounding may introduce computation errors, reducing the number of rounding operations affects positively the overall error. The input of the operands are calculated at the CSA(Carry-Save Adder) multiplier tree and the magnitude of the operands is not known prior to addition to determine which operand has greater value. Since floating point is a sign magnitude operation, the result of the adder should be in two s complement form [14]. Therefore, an adder is needed to produce two separate results

11 CHAPTER 1. INTRODUCTION 3 for the following cases: Case 1 : If operand M > C, M C = M C = M + C + 1 Case 2 : If operand C < M, M C = C M = (M C) = (M + C + 1) = (M + C) 1 = (M + C) During subtraction of M C, the final carry out is C out is 1 when M > C, and 0 when C > M. Consequently, an End-Around-Carry adder produces two different results and C out determines whether case 1 or case 2 happens [12]. 1.2 Adders Ripple Carry Adders The addition of two operands is the most frequent operation in almost any arithmetic unit. A two-operand adder is used not only when performing additions and subtractions, but also often employed when executing more complex operations like multiplication and division. Consequently, a fast two-operand adder is essential [3]. s i = a i b i c i (1.2) c i+1 = a i b i + (a i + b i ) c i (1.3) Figure 1.1: Ripple Carry adder [1].

12 CHAPTER 1. INTRODUCTION 4 The most straightforward implementation of a parallel adder for two operands x and y is through the use of n basic units called full adders. A full adder(fa) is a logical circuit that accepts two operand bits, say x and y, and an incoming carry bit, denoted by s and an outgoing carry bit denoted by c. As the notation suggests, the outgoing carry c is also the incoming carry for the subsequent FA, which has c and y as the input bits. The FA is a combinational digital circuit implementing the binary addition of three bits through the Boolean equations in 1.2 and 1.3. The ripple carry adder is shown on Figure Carry lookahead adders The most commonly used scheme for accelerating carry propagation is the carry lookahead scheme [2] on Figure 1.2. The equations in Eqn. 1.4 show the realization of the Carry-Look- Ahead Generator. The main idea behind carry lookahead addition is an attempt to generate all incoming carries in parallel(for all the n-1 high order FAs) and avoid the need to wait until the correct carry propagates from the stage(fa) of the adder where it has been generated. This can be accomplished in principle, since the carries generated and the way they propagate depend only on the digits of the original numbers x-1, x-2,and y. These digits are available simultaneously to all stages of the adder and consequently value of the incoming carry and compute the sum bit accordingly. This however, would require and inordinately large number of inputs to each stage of the adder, rendering this approach impractical [1]. Figure 1.2: Carry lookahead adder [2].

13 CHAPTER 1. INTRODUCTION 5 c 4 = G 0 + c 0 P 0, c 8 = G 1 + G 0P 1 + c 0 P 0 P 1, c 12 = G 2 + G 1P 2 + G 0P 1 P 2 + c 0 P 0 P 1 P 2 (1.4) Parallel Prefix Adders A parallel prefix circuit is a combinational circuit with n inputs x 1, x 2,, x n producing the outputs x 1, x 2 x 1,..., x n x n 1... x 1 where is the associativity binary operation. The first stage of the adder generates individual P and G signals. The remaining stages constitute the parallel prefix circuit with the fundamental carry operation serving as the associative binary operation. This part of the adder can be designed in many different ways. g i = a i b i p i = a i b i (1.5) G i:k = G i:j + P i:j G j 1:k P i:k = P i:j P j 1:k (1.6) Although computing carry-propagate addition can use generate and propagate signals, its implementation in VLSI can be quite inefficient due to the number of wires that have to be connected together. Parallel-prefix adders solve this problem by making the wires shorter with simple gate structures to aid in the passing of groups of carries to the next weight [3] [15]. The proof of parallel prefix adder can be found in [16]. Parallel-prefix adders can be broken down into three stages: Pre-computation: single bit carry generate/propagate are obtained with Equation 1.5 and temporary sum is generated. This stage can be simplified applying the rules defined in Section Parallel-prefix tree: carry at each bit is computed with group carry generate/propagate where Equation 1.6 is applied. It is possible to simplify the first stage of the tree using the rules defined in Section

14 CHAPTER 1. INTRODUCTION 6 Post-computation: Sum and carry-out are derived with Equation 1.2. Basic cell definitions of prefix adders is shown Figure 1.3. Figure 1.3: Group PG cells [3]. 1.3 Tree Adders Tree structures have been used for graphically representing the various parallel prefix algorithms. Many state-of-the-art adder circuits use parallel prefix schemes to achieve high performance [17 19]. For wide adders, the delay of carry-lookahead (or carry-skip or carry-select) adders becomes dominated by the delay of passing the carry through the lookahead stages. This delay can be reduced by looking ahead across the look- ahead blocks [20]. In general, it possible to construct multilevel tree of look-ahead structures to achieve delay that grows with log N. Such adders are variously referred to as tree adders, logarithmic adders, multilevel-lookahead adders, parallel-prefix adders, or simply lookahead adders. There are many ways to build the lookahead tree that offer tradeoffs among the number of

15 CHAPTER 1. INTRODUCTION 7 Table 1.1: Trade-off between different adder topologies. Topology Logic Level Fanout Wiring Track Kogge-Stone [22] Low Low High Sklansky [23] Low High Low Brent-Kung [24] High Low Low stages of logic, the number of logic gates, the maximum fanout on each gate, and the amount of wiring between stages. Figure 1.4 shows a three dimensional taxonomy of prefix adders [21]. Far edge adders are Kogge-Stone [22], Sklansky [23] and Brent-Kung [24] for their properties of low logic-level, higher wiring track; low logic-level, high fanout and high logic-level, low fanout respectively as shown on Table 1.1. Figure 1.4: Taxonomy of prefix networks [3]. The following parallel prefix adders can be found in the literature: Kogge-Stone The Kogge-Stone tree [22] Figure 1.5 achieves both log2n stages and fanout of 2 at each stage. This comes at the cost of long wires that must be routed between stages. The tree also contains more PG cells; while this may not impact the area if the adder layout is on a regular grid, it will increase power consumption. Despite these cost,

16 CHAPTER 1. INTRODUCTION 8 Kogge-Stone adder is generally used for wide adders because it shows the lowest delay among other structures. Figure 1.5: Kogge-Stone adder [3]. Sklansky The Sklansky or divide-and-conquer tree [23] on Figure 1.6 reduces the delay to log2n stages by computing intermediate prefixes along with the large group prefixes. This comes at the expense of fan-outs that double at each level: The gates fanout to [8, 4, 2, 1] other columns. These high fan-outs cause poor performance on wide adders unless the gates are appropriately sized or the critical signals are buffered before being used for the intermediate prefixes. Transistor sizing can cut into the regularity of the layout because multiple sizes of each cell are required, although the larger gates can spread into adjacent columns. With appropriate buffering, the fan-outs can be reduced to [8,1,1,1].

17 CHAPTER 1. INTRODUCTION 9 Figure 1.6: Sklansky adder [3]. Brent-Kung The Brent-Kung tree [24] as shown on Figure 1.7 computes prefixes for 2-bit groups. These are used to find prefixes for 4-bit groups, which in turn are used to find prefixes for 8-bit groups, and so forth. The prefixes then fan back down to compute the carries-in to each bit. The tree requires 2(log2N) 1 stages. The fanout is limited to 2 at each stage. The diagram shows buffers used to minimize the fanout and loading on the gates, but in practice, the buffers are generally omitted. Figure 1.7: Brent-Kung adder [3].

18 CHAPTER 1. INTRODUCTION 10 Other than those major adders trade-off can be achieved with the following adders. The Han-Carlson trees [25] are a family of networks between Kogge-Stone and Brent-Kung. Figure 1.8 shows such a tree that performs Kogge-Stone on the odd- numbered bits, and then uses one more stage to ripple into the even positions. The Knowles trees [26] are a family of networks between Kogge-Stone and Sklansky. All of these trees have log2n stages, but differ in the fanout and number of wires. If we say that 16-bit Kogge-Stone and Sklansky adders drive fanouts of [1, 1,1,1] and [8, 4, 2, 1] other columns, respectively, the Knowles networks lie between these extremes. For example, Figure 1.9 shows a [2, 1, 1, 1] Knowles tree that halves the number of wires in the final track at the expense of doubling the load on those wires. The Ladner-Fischer trees [27] are a family of networks between Sklansky and Brent-Kung. Figure 1.10 is similar to Sklansky, but computes prefixes for the odd- numbered bits and again uses one more stage to ripple into the even positions. Cells at high-fanout nodes must still be sized or merged appropriately to achieve good speed. Figure 1.8: Han-Carlson adder [3].

19 CHAPTER 1. INTRODUCTION 11 Figure 1.9: Knowles [2,1,1,1] adder [3]. Figure 1.10: Ladner Fischer adder [3]. 1.4 Recurrence Algorithms Recurrence algorithms have been a research area for a long time [8] [28]. Weinberger presented the most widely known carry recurrence for VLSI addition in 1958 [20]. Over the years, several addition algorithms have been developed. These algorithms manipulate the carry and sum equations in an attempt to improve the speed of addition. The equations for sum and carry are defined as equation 1.2 and 1.3.

20 CHAPTER 1. INTRODUCTION 12 Ling modified the algorithm to reduce the complexity of the carry computation at the cost of increased complexity in the sum computation. An analysis was later performed by Doran [29] to determine the set of recurrences which have recurrence properties that are similar to Weinberger s and Ling s Weinberger Recurrence Weinberger [20] demonstrated that addition speed could be improved by parallelizing the computation of carry. Although widely credited with only the Carry Look-Ahead Adder, Weinberger s recurrence was not limited in group size or number of levels for carry computation [20]. The fundamental advancement of his work was the introduction of generate and propagate as shown on Eqn Weinberger defined the terms: bitwise generate (g), bitwise propagate (p), group generate (G), and group propagate (P ). These terms allow for carry computation to be performed in parallel, yielding a significant improvement in performance compared to ripple-carry addition. For a group of 4 bits, Weinberger recurrence has ten terms for the generation of G 5:2 from the inputs and four terms for the generation of P 5:2. The maximum transistor stack height is 5. Weinberger demonstrated that and could be used to create blocks of arbitrary size and parallelized to form multiple levels of recurrence [20]. Thus, the majority of parallel prefix adders proposed for high-performance addition employ the realizations of Weinberger s recurrence, e.g., Kogge-Stone [22], Brent-Kung [24], Han-Carlson [25], Ladner-Fischer [27], and those described by Knowles [26] Ling Recurrence g i = a i b i p i = a i + b i (1.7) IBM ECL technology limitations on fan-in (limited to 4) and wired-or (limited to 8) motivated Ling to develop a transformation that reduced the fan-in of Weinberger s recurrence [30] [8]. For clarity, a simple derivation of Ling s transformation will be shown. This derivation provides

21 CHAPTER 1. INTRODUCTION 13 the physical meaning of the signals used in Ling s transformation and identifies the favorable characteristics of Ling for implementation in modern CMOS technology. In the derivation, the bitwise generate signal is defined as: g i = a i b i and the bitwise propagate signal is defined as: t i = a i +b i. Note that the propagate signal t i is the same as Weinberger s p i (when implemented using an OR). To maintain consistency with Ling s original paper, t i will be used for propagate. Ling s transformation reduces the complexity of Weinberger s recurrence by factoring t i from c i+1 to create a pseudo-carry (h i ) on which the recurrence is performed. The transformation is shown below on c 1 to form h 0. The carry-out signal, c 1, of the first bit position is Ling s transformation uses the property t i g i = g i to form c 1 = g 0 + t 0 c 0 (1.8) where g 0 + c 0 = h 0, which leads to c 1 = t 0 g 0 + t 0 c 0 = t 0 (g 0 + c 0 ) (1.9) The general transformation of is c i defined as c 1 = t 0 (g 0 + c 0 ) = t 0 h 0 (1.10) c i = { ti 1 h i 1 if i > 0 c 0 i=0 (1.11) where the pseudo-carry, h i, is defined as h i = g i + c i (1.12) The physical meaning of the pseudo-carry signal h can be described as follows. By factoring t i out of the carry expression and propagating h i instead of c i+1, all cases where carry is generated and/or propagated from the stage preceding stage i are included in h i. This includes the case where a carry-in to the ith stage can be assimilated (which should not result in a carry-out). The assimilate condition is handled when forming c i+1 by ANDing h i with t i to produce c i+1. If the carry-assimilate (carry-kill) condition exists then t i = 0, which results in c i+1 = 0. A recurrence for can be defined as has been done previously for Weinberger s c i. The

22 CHAPTER 1. INTRODUCTION 14 group pseudo-carry and transmit which allow for parallel prefix computation can be defined over the group of bits (capital letters are used to refer to the group): T i:j = t i t i 1 t j (1.13) H i:j = g i + g i 1 + t i 1 g i 2 + t i 1 t i 2 g i t i 1 t i 2 t j+1 g j (1.14) The recurrence can be expressed using the operator as ( Hi:j T i 1:j 1 ) ( Hj 1:k T j 2:k 1 ) ( Hi:j + T = i 1:j 1 H j 1:k T i 1:j 1 T j 2:k 1 ) (1.15) The transformation from Weinberger s recurrence to Ling s recurrence for a group of 4 bits is shown in the example in Fig. 2. This figure should dispel any difficulties associated with understanding the original Ling s derivation. The advantage of using pseudo-carry instead of carry is offset by the increased complexity of sum computation, which requires the real carry to form individual sum signals. In CMOS technology can be efficiently calculated conditionally, thus avoiding the AND operation on the critical carry path: s i = { ai b i h i 1 =0 a i b i t i 1 h i 1 =1 (1.16) 1.5 Conclusion In this chapter general information about binary adder realization in VLSI perspective is described. Parallel prefix adders provide the fastest carry propagation on the critical path for wide adders and it allows fast layout design because of its regularity. The most important advantage is the design space with many trade-offs in delay, power dissipation and area. Also, recurrence algorithms are provided to show it is possible to shorten the carry path with different propagate and generate terms. In the next chapter, a detailed analysis about the modulo adders is given. End-around-carry adders are analyzed in detail as a specialized realization of modulo adders. Analysis of previous

23 CHAPTER 1. INTRODUCTION 15 work is also included.

24 Chapter 2 Modulo Adders 2.1 Introduction Modular arithmetic has been an interest to researchers in a wide range of areas, since its operations are the basis for systems that use the Residue Number Systems (RNS) [31]. Modulo addition/subtraction and multiplication can also be applied to digital filters [32], cryptography [33], error detection and correction [34], as well as checksum computation in high-speed networks [35]. More importantly, modulo 2 n 1 addition is a common operation that can be implemented in hardware, because of its circuit efficiency and simple implementation [36]. In end-around carry adders the carry-in depends on the carry out [4] ie. the carry out c out is fed through some logic back to the carry in c in. In particular, this is used for modulo 2 n + 1 [37] and 2 n 1 [38] addition rely on decrement and increment respectively of the addition result depending on c out. Since prefix algorithms actually rely on incrementer structures, considering parallel-prefix schemes for this kind of adders is very promising. In order to obtain fast end around carry adders both conditions of fast carry out generation and fast carry in processing have to be met. This implies that there should be no combinational path existing between c in and c out. A fast end around carry adder can be built using the prefix structure. Here the last prefix is used as an incrementer which is controlled by the c out of the previous prefix stages. 16

25 CHAPTER 2. MODULO ADDERS Addition in Modulo 2 n 1 Adder Addition modulo 2 n 1 or ones complement addition can be formulated by the following equation: { A+B(mod2 n A + B (2 1) = n 1) = A + B + 1 (mod2 n ) if A + B 2 n 1 A + B otherwise (2.1) However, the condition A + B 2 n 1 is not trivial to compute. It can be rewritten as A + B 2 n with a carry input of 1. { A + B(mod2 n A + B (2 1) = n 1) = A + B + 1 (mod2 n ) if A + B 2 n A + B otherwise (2.2) Now the carry out cout from the addition A+B can be used to determine whether increment has to be performed or even, simpler c out can be added to the sum of A + B. This equation however results in a double representation of zero ( ie. 0 = 00 = 11) Figure 2.1: Prefix graph with fast end-around carry [4]. The standard approach for the implementation of a modulo 2 n 1 adder is by using a conventional carry propagate adder(cpa) abd have the carry out fed back into the carry in of the adder. This creates the necessary end-around-carry needed to have modulo 2 n 1 operate correctly as shown on Figure 2.2.

26 CHAPTER 2. MODULO ADDERS 18 Figure 2.2: General block diagram Modulo 2 n 1 Adder [4]. 2.3 Analysis of Previous End-Around-Carry Adders Previous defined end-around action can be obtained using different rules. Although the EAC adder has been used [39] and implemented on several microprocessors, very few details exist on their formulations and arithmetic algorithms in today s literature. Schwarz [40] provided explanations about some aspects of the EAC adder s algorithm as a part of fused-multiplyadd(fma) unit. Shedletsky [41] defined the indeterminate behavior of EAC adders and Liu et al. [42] presented a formal analysis of EAC adders.

27 CHAPTER 2. MODULO ADDERS 19 Figure 2.3: Block diagram of the 128-bit binary adder [5]. Liu et al. [6] [42] extended the algorithm to make the adder independent without being part of a FMA unit. The design mainly follows the algorithms of the EAC adder which is implemented in IBM POWER6 microprocessor [5]. The additional logic units of the proposed adder are useful to ensure the whole adder to work independently. Another advantage is that it is easier to implement and test the adder in FPGA technology, which enables design space exploration. Figure 2.4 shows the architecture of the adder.

28 CHAPTER 2. MODULO ADDERS 20 Figure 2.4: Architecture of the EAC adder [6]. EAC means that when subtracting two signed numbers that are in signed magnitude format, the subtraction is implemented by the addition of the first operand with the Boolean complement of the second operand. For this addition, instead of setting a carry into the least significant digit, the carry out of the most significant digit is taken as the carry in. This ensures that the result of the addition is always positive magnitude result and preferably only one operand needs to be conditionally complemented. Thus, an EAC adder performs addition similar to other regular adders and subtraction using the end around carry to ensure the result is positive. The adder shown in Figure 2.4 should satisfy the following conditions: 1) when x.ŝ = ỹ.ŝ the adder should do addition and we have s.ŝ = x.ŝ and s.s = x.x + ỹ.y. 2) when x.ŝ ỹ.ŝ, the

29 CHAPTER 2. MODULO ADDERS 21 adder should do subtraction. If x.x ỹ.y, then s.ŝ = x.ŝ and s.s = x.x ỹ.y; if x.x < ỹ.y then s.ŝ = ỹ.ŝ and s.s = ỹ.y x.x. The subtraction operation can be described as follows: 1) Determining which operand is bigger. After a subtraction operation if result is positive, operand x is bigger otherwise y is bigger. When x.x ỹ.y = x.x + ỹ.y + 1 = x.x + 2 n ỹ.y, the carry out of x.x + ỹ.y + 1 will be 1. Therefore, c out results as 1 if x is bigger and 0 if y is bigger. Hence the sum equation can be written as x.x + ỹ.y + c out 2)When y is bigger, c out = 0 the subtraction can be written as s.s = ỹ.y x.x = (ỹ.y x.x) = ( x.x + ỹ.y + 1) = ( x.x + ỹ.y + 0) = ( x.x + ỹ.y). 3) The c out is used to select correct result: s.s = { x.x + ỹ.y + cout c out = 1 x.x + ỹ.y + c out c out = 0 In order to implement addition and subtraction in one adder,y should be complemented conditionally. The effective operation can be defined as; O s = x.ŝ ỹ.ŝ (2.3) y t = { ỹ.y Os = 0 ỹ.y O s = 1 The sign of the result is determined after sign logic: s.ŝ = { x.ŝ cout = 1 ỹ.ŝ c out = 0 Generally implementation of an adder/subtracter is achieved using two different adders, one for addition and one for subtraction. After the results for both of the operation is calculated, the final result is selected with a multiplexer as given on Figure 2.5.

30 CHAPTER 2. MODULO ADDERS 22 Figure 2.5: Architecture of the EAC adder [6]. 2.4 Carry-lookahead EAC Logic Unit The use of EAC unit helps implementing adder/subtracter using only one adder. In [40], the algorithm for EAC unit with four carry bits, can be found. The most significant bit is labeled as 0. The group carries for a CLA adder is defined as : C 0 = G 0 + P 0 G 1 + P 0 P 1 G 2 + P 0 P 1 P 2 G 3 + P 0 P 1 P 2 P 3 C in C 1 = G 1 + P 1 G 2 + P 1 P 2 G 3 + P 1 P 2 P 3 C in C 2 = G 2 + P 2 G 3 + P 2 P 3 C in (2.4) C 3 = G 3 + P 3 C in If the carry out C 0 is fed to the carry in, EAC operation is achieved as Equation 2.5. C 0 = G 0 + P 0 G 1 + P 0 P 1 G 2 + P 0 P 1 P 2 G 3 + P 0 P 1 P 2 P 3 C 1 = G 1 + P 1 G 2 + P 1 P 2 G 3 + P 1 P 2 P 3 G 0 + P 0 P 1 P 2 P 3 C 2 = G 2 + P 2 G 3 + P 2 P 3 G 0 + P 2 P 3 P 0 G 1 + P 0 P 1 P 2 P 3 (2.5) C 3 = G 3 + P 3 G 0 + P 3 P 0 G 1 + P 3 P 0 P 1 G 2 + P 0 P 1 P 2 P 3 The combination of the carries in this way result in a carry chain for every group that is the length of the width of the adder. This wrapping of the carries is correct for subtraction but is not correct for addition. To make adder selectable for addition and subtraction. the P 3 term needs to be modified. An extra bit is combined to the least significant bit of the adder to assert the carry propagation when the effective operation is subtraction. This bit can be integrated

31 CHAPTER 2. MODULO ADDERS 23 into P 3 to make P 3 = 0 for an effective operation of addition as defined in Eqn 2.6. P t 3 = { P3 O s = 1 0 O s = 0 (2.6) 2.5 Conclusion In this chapter detailed information about binary modulo adders and previous work on EAC adders is provided. It is shown that when the c out is fed back into a carry-increment stage provides EAC operation as defined as 2 n 1 adders. The designs in [5] and [12] employ the adder as a part of FMA flow. However, by addition of extra logic Liu [6] proposes a stand alone adder. In the next chapter detailed analysis about the proposed parallel prefix EAC adder will be given. The design premises an alternative to the previous work while providing wider design space.

32 Chapter 3 Modified Parallel Prefix EAC Adder 3.1 Introduction Previosly proposed adders were analyzed in section 2.3. Since the designed adder in [5] did not include the implementation details about how the blocks were internally built, [6] made a detailed analysis about how the first stage adder and EAC blocks are defined. They also extended the work to a complete adder without a FMA unit. Their implementation was directed to FPGA. Although the analysis that they provided solved many of the questions about previously designed work, a number of the parts of the adder still needs to analyzed. In this chapter, we propose a modified adder which uses a modified parallel prefix 2 n 1 adder block as the EAC logic with the conditional sum blocks for flexibility among different adder architectures and lower total propagation delay. 3.2 Proposed Adder Figure 3.1 shows the architecture of the proposed adder. The first level includes eight 16bit blocks of Kogge-Stone prefix-2 adders for 128bit inputs and the second level includes modiifed 2 n 1 parallel prefix Kogge-Stone adder. In order to design a standalone adder, the input complement, add/sub and sign blocks are included as suggested in [6]. 24

33 CHAPTER 3. MODIFIED PARALLEL PREFIX EAC ADDER 25 Figure 3.1: Architecture of the modified EAC Adder The 16bit blocks in EAC adder Figure 3.2 shows the 16bit conditional sum blocks. The black and gray blocks are the same as Fig The dashed lines correspond to the halfsum blocks, h i = a i b i. GG and GP refer to the group generate and propagate signals. As explained in [6], when carry in to the adder block is assumed 0, it is possible to reduce the complexity of the adder. When carry in is intended to propagate, the generate term in Equation 1.6 can be extended as G i:k = G i:j +P i:j c in. Thus, for c in = 0 it results in G i:k = G i:j. It can be seen that the generated group carry for the corresponding bit position need not be merged with the previous carry. Thus, it is possible to reduce the number of black terms and use gray terms instead. Namely, if the there is no carry merge operation in the next level, the corresponding level can be terminated with a gray cell. Figure 1.5 shows 16-bit Kogge-stone tree with gray

34 CHAPTER 3. MODIFIED PARALLEL PREFIX EAC ADDER 26 and black cells. However, in our design since we included a second stage, we need both the generate and propagate terms. This necessity increases the use of black cells instead of gray cells. Figure 3.2: 16bit conditional sum blocks. The adder in [6] is not clear about how the input carry is merged in the first level of adders. A discussion of the most efficient approaches for the traditional carries can be found in [43]. The carry-in bit can be included either by adding a fast carry increment stage or by treating cin as an extra bit of the preprocessing stage of the adder. The first case in shown in Figure 3.3. The second case can be derived by setting g 1 = c in according to Equation 1.6. As a result of these schemes, the complexity increases to solve carry incorporation problem. Additionally [6] and [5] does not explain how the carry is propagated after 8bit blocks. Although a similar structure is used in [12], the adder architecture is different and only one set of generate-propagate is calculated and propagated in the first level. Since [6] uses two different adders for each condition when c in = 0 and c in = 1, two set carries need to be selected before

35 CHAPTER 3. MODIFIED PARALLEL PREFIX EAC ADDER 27 leaving the first level. This operation makes the calculation more complicated. Thus, we use the architecture in Figure 3.2 to calculate one set of generate-propagate as group terms for the conditional of c in = 0. In the next stage, conditional carry for c in = 1 is calculated with a simple equation G i:k = G i:j + P i:j which is simply an OR gate. Figure 3.3: c in merge with fast carry link [7]. Figure 3.4: c in merge with extra bit [7]. As a final step the sum for each carry condition, for c in = 0 and c in = 1, is calculated to be

36 CHAPTER 3. MODIFIED PARALLEL PREFIX EAC ADDER 28 selected according to the result of the second level of carry calculation Parallel Prefix 2 n 1 EAC Block As explained in chapter 2, 2 n 1 adders can be used for EAC calculation. In this section, we extend the use of 2 n 1 adders for adder/subtractor. Figure 3.5 shows the modified 8bit Kogge-Stone EAC block. Figure 3.5: Modified 8bit Kogge-Stone EAC block. The block first takes the group generate and propagate terms that are calculated in the first level. Since there are eight 16 bit adders in the first level, the carry for the whole adder must be calculated first. The 8 carries are merged in the Kogge-Stone adder. In order to achieve subtraction operation, the final carry at the most significant bit position must be merged with lower significant position. This step is also called as carry increment stage. The effective operation signal O s is defined as in Eqn When operation is subtraction, O s = 1, the AND gate propagates the most signiificant carry and when operation is addition, O s = 0, most significant carry is blocked and the unit works as a regular adder. c t 8 = O s G 127:0 is defined as the end-around-carry in Eqn c t 8 = { G127:0 O s = 1 0 O s = 0 (3.1) In the literature [37] [4] [44] [45] [15] carry-increment stage has found many usage for 2 n 1 operation. Especially for the wide adders, N 64 the final carry needs to travel a very long path to arrive on the least significant position, actually twice the length from the least significant

37 CHAPTER 3. MODIFIED PARALLEL PREFIX EAC ADDER 29 position to least significant position again. It is defined in [40] that carry needs to travel only the total length in an EAC operation. That result motivated the use of CLA logic as a EAC logic. In our adder, we use a second stage for EAC calculation instead of one long parallel prefix propagation and carry increment stages. Namely for 128 bit EAC adder the shorthest stage count is 2 n = 128, n = 7 and one stage of 127 carry merge terms. In our adder, we have 4 stages for the first level, 3 stages for the second level and 1 stage of 8 carry merge terms, which makes the total stage count the same, whereas a much lower count of carry merge terms. Decreasing the number of carry merge terms, actually, decreases the length of the carry path as well as the delay. 3.3 Implementation and Validation After the adder architecture is finalized, both the modified and previously proposed [6] adders are designed in Verilog. A test fixture is created to validate the adder with the corner cases, such as 0-to-1 crossover, carry propagation from 16 th bit to the next bit. A check procedure is defined to signal out if the output of the adder and the calculated result are different. The tests show that both adders have similar operation and thus the modified EAC is verified. 3.4 Conclusion In this chapter the proposed Parallel Prefix EAC adder is analyzed. It is shown that using conditional sum blocks solves the carry incorporation problem existing in the previous works. Moreover, the EAC logic in [6] is redesigned with a modified 2 n 1 adder to provide wider design space and shorter carry path. As a final step the adder is implemented in Verilog and validated through simulation. In the next chapter, a simple method for transistor level realization of the critical path is discussed.

38 Chapter 4 Critical Path Analysis 4.1 Path Identification The critical path is identified as the path from the sign logic to the sum of at the output as shown in Figure 4.1 for the proposed adder. In order to conditionally complement the second input, the effective operation O s is calculated from sign inputs sa and sb. After B t is calculated, the propagate and generate terms are calculated using Eqn The half-sum block, h i = a i b i is not on the critical path because the half-sum is needed for sum calculation which is not needed for carry propagation. After 4 stages of 16bit adder, carry is merged with other group terms in the 8bit EAC block. The EAC block consists of 3 carry merge stages and 1 carry increment stage. We only take into account the carry merge operation G i:k = G i:j + P i:j G j 1:k, which is simply an AND-OR-INVERT(AOI) gate, because the group propagation term is not on the critical path. The most significant carry should either be blocked or propagated. Thus, the O s controlled AND gate is on the critical path. Since the end-around carry term, the most significant bit in EAC block, is calculated before the carry increment stage, c 8 is not on the critical path. If we choose one of the carry in between the last and first bits, it gives a good estimation about the delay of the operation. At the next stage one of the sums needs to be selected according to the carry from EAC block as an output for each16bit adder. Therefore, one mux is on the critical path. As a final stage, the output needs to be conditionally complemented according to the end-around-carry and operation. The calculation of the condition is not on 30

39 CHAPTER 4. CRITICAL PATH ANALYSIS 31 critical path because during the first MUX stage, it can be calculated. Thus, we end up one XOR stage for complementing. Figure 4.1: Critical path of the modified EAC adder. 4.2 Path Design After the critical path is identified, the gate level model is created as in Figure 4.2. All the inverters on the critical path are removed to decrease the number of logic levels by applying the inverting property of CMOS to the consecutive levels. It can be observed that each stage of carry merge operation shows an alternating design of AOI and OAI. The output load is one minimum size inverter.

40 CHAPTER 4. CRITICAL PATH ANALYSIS 32 Figure 4.2: Gate level design of critical path. Gate level design is just a logical expression of the circuit. Therefore, in order to make accurate simulations the gates should be mapped to transistor level models. In this phase, we created a library of custom design gates instead of using standard cell library for higher performance. 4.3 Transistor Level Design and Sizing Full custom design always outperforms standard cell design in terms of power consumption, area and propagation delay. However, it is hard to automate designing in full custom perspective and the designing process takes much longer time than the design with standard cells. A number of techniques exist when designing in transistor level Logic Level Minimization Before designing any logic circuit, it is a beneficial step to exploit logic level minimization. In our design we tried to combine the pg generation and first level of AOI as suggested in [8]. This approach allows for a single stage to be removed from Weinberger adder realization. However, this method increases the stack count and since our blocks were designed close to minimum width transistors, we did not observe any improvement on the delay. Figure 4.3 shows the reduced first stage.

41 CHAPTER 4. CRITICAL PATH ANALYSIS 33 Figure 4.3: Reduced first stage in Weinberger recursion adder [8]. As another step Ling recursion is applied to the first step. However as suggested in [15] Ling carries must be converted to real carries by ANDing the most significant carry with the corresponding propagate term, c i+1 = d i p i before leaving the adder block. This calculation is on the critical path. Therefore, Ling recursion is not applied. As a final step, we combined the O s controlled AND gate with the previous OAI gate. However, we observed that increasing the complexity of the gate did not provide much delay improvement because the gates are already sized close to minimum.

42 CHAPTER 4. CRITICAL PATH ANALYSIS Late arriving signal exploitation One of the most important perspective in transistor level design is to connect the late arriving signals close to output. It can be shown on Figure 4.4 AOI gate, assuming A is the late arriving signal, the gate on Figure 4.5 provides better delay. Figure 4.4: AOI without late arriving exploitation.

43 CHAPTER 4. CRITICAL PATH ANALYSIS 35 Figure 4.5: AOI with late arriving exploitation Logical Effort Sutherland et al. described a very useful concept of Logical Effort [46].The method of logical effort is founded on a simple model of the delay through a single MOS logic gate. The model describes delays caused by the capacitive load that the logic gate drives and by the topology of the logic gate. Clearly as the load increases, the delay increases, but delay also depends on the logic function of the gate. Inverters, the simplest logic gates, drive loads best and are often used as amplifiers to drive large capacitances. Logic gates that compute other functions require more transistors, some of which are connected in series, making them poorer that inverters at driving current. A NAND gate has more delay than an inverter with similar transistor sizes that drives the same load. The method of logical effort quantifies these effects to simplify delay analysis for individual logic gates and multistage logic networks. The logical effort of a logic gate tells how much worse it is at producing output current than is an inverter, given that each of its inputs may present the same input capacitance as the

44 CHAPTER 4. CRITICAL PATH ANALYSIS 36 inverter. Reduced output current means slower operation, and thus the logical effort number for a logic gate tells how much more slowly it will drive a load that would an inverter. Equivalently, logical effort is how much more input capacitance a gate must present in order to deliver the same output current as an inverter. Logical Effort for Multistage Networks The method of logical effort reveals the best number of stages in a multistage network and how to obtain the least overall delay by balancing the delay among the stages. The notions of logical and electrical effort generalize easily from individual gates to multistage paths. The logical effort along a path compounds by multiplying the logical efforts of all the logic gates along the path. The symbol G denotes the path logical effort, so that it is distinguished from g, the logical effort of a single gate in the path. The subscript i indexes the logic states along the path. G = Πg i (4.1) The electrical effort along a path through a network is simply the ratio of the capacitance that loads the last logic gate in the path to the input capacitance of the first gate in the path. The uppercase symbol H indicates the electrical effort along a path. In this case, C in and C out refer to the input and output capacitances. H = C out /C in (4.2) Branching effort b is used to account for fanout within a network. When fanout occurs within a logic network, some of the available drive current is directed along the analyzed path and some is directed along the off-path. The branching effort b is defined at the output of a logic to be b = (C onpath + C offpath )/C in (4.3) Note that if the path does not branch, the branching effort is one. The branching effort along an entire path B is the product of the branching effort at each of the stages along the

45 CHAPTER 4. CRITICAL PATH ANALYSIS 37 path. B = Πb i (4.4) Utilizing the definitions of logical, electrical and branching effort along a path, path effort is defined as F. Note that the path branching and electrical efforts are related to the electrical effort of each stage. F = G B H (4.5) Although it is not a direct measure of delay along the path, the path effort holds the key to minimizing the delay. Observe that the path effort depends only on the circuit topology and loading and not upon the sizes of the transistors used in the logic gates embedded within the network. The path effort is related to the minimum achievable delay along the path. Only a little work yields to find the best number of stages and the proper transistor sizes to realize the minimum delay. Optimizing the design of an N-stage logic network proceeds from a very simple principle that the path delay is least when each stage in the path bears the same stage effort. This minimum delay is achieved when the stage effort is: f = g i h i = F 1/N (4.6) To equalize the effort borne by each stage on a path, and therefore achieve the minimum delay along the path, appropriate transistor sizes for each stage of logic along the path must be chosen. Each logic stage should be designed with electrical effort: h i = F 1/N /g i (4.7) From this relationship, it is straightforward to determine the transistor sizes of gates along a path. Starting at the end of the path and working backward to apply the capacitance transformation: C in,i = (g i C out,i )/f (4.8)

46 CHAPTER 4. CRITICAL PATH ANALYSIS 38 The equation determines the input capacitance of each gate, which can then be distributed appropriately among the transistors connected to the input Design with Helpers As described in [47] in most of the aders the stage effort is generally constant if wire capacitance is neglected. This means uniform gate sizes may be used throughout with little loss in performance. It is possible to have a very regular layout in this case. However, such adders like Sklansky show exponential fanout increase. In this case, the stage effort becomes high for those cells and as a result a bigger driver is needed. Therefore, Harris et al. [47] proposes the concept of helpers. When the stage effort increases, it is a good practice to duplicate driving cells in parallel to maintain lower delay. Figure 4.6: EAC logic with helper. Using the concept described by [47], we duplicate the operation controlled NAND gate. That helps to decrease the fanout of the carry increment stage and, consequently, the size of the NAND gate. 4.4 Transistor Sizing Logical effort found wide interest in transistor sizing [48 51]. In our proposed adder, we used the concept to size the transistors in the critical path. After the sizes are determined, it is

47 CHAPTER 4. CRITICAL PATH ANALYSIS 39 possible to use these widths in the remaining blocks because other blocks which are not on critical path will have the same loads and fan-outs within the same logic level. In order to equalize the rising and falling edge of the output signal w p /w n = 2.5/1 ratio is used between NMOS and PMOS transistors. In the transistor level library, late arriving signal exploitation, helpers and logical effort are used to achive highest performance. In addition, the O s controlled AND gate needs to drive 7 OAI cells. If we use two helpers, as described in 4.3.4, instead of one, so that the gates drive 4 OAI at most and it is possible decrease the large size as well as delay. In order to apply logical effort, a spreadsheet is created as in Figure 4.7. After the logical effort parameters are calculated, the width is distributed to each stage according to stage coefficients. Since the gates are designed according to 2.5/1 ratio, stage coefficients show how large is the gate when compared to inverter. Figure 4.7: Spreadsheet for Logical Effort Calculation. After the appropriate transistor sizes are found, we continued on performing simulations using HSPICE [52] and FreePDK45 [53] [54] library. Figure 4.8 and 4.9 show the transistor level schematic.

48 CHAPTER 4. CRITICAL PATH ANALYSIS 40 Figure 4.8: Transistor level schematic. Figure 4.9: Transistor level schematic.

49 CHAPTER 4. CRITICAL PATH ANALYSIS Simulation Results During the simulation a square wave signal is applied with 150ps rising and falling edges. The delay between input and output is then measured. The power dissipation of the critical path is measured for power estimation of one bit of the adder. Temperature and supply voltage is swept between C and V dd to observe optimal performance. It can be observed from Figures that critical path shows 141.8ps delay and 18.9µW power dissipation for one bit, when 10 % switching activity is considered [3], which makes 2.42mW for 128bits. Figure 4.10: Delay vs. V dd at 25 C.

50 CHAPTER 4. CRITICAL PATH ANALYSIS 42 Figure 4.11: Power vs. V dd at 25 C. Figure 4.12: Delay vs. V dd at 100 C.

51 CHAPTER 4. CRITICAL PATH ANALYSIS 43 Table 4.1: Delay and Power dissipation values in correspondence to V DD and Temperature. V dd Delay(ps) Power(µW ) Temp 25 C 100 C 25 C 100 C Figure 4.13: Power vs. V dd at 100 C. Table 4.1 shows the trade-off between temperature and supply voltage for the proposed adder. When V dd is increased from 1.0V to 1.2V delay decreases by 12%, power increases by 79% and when decreased from 1.0V to 0.8V, delay increases by 26% and power decreases by 45%. When temperature changes from 25 C to 100 C delay increases by 51% and power increases by 9.2%. It can be observed that much more power can be saved than the increase in delay for low power applications by lowering the supply voltage. However, for high speed applications more power should be provided. Since the temperature of the circuit has a reverse effect on both power and delay, the circuit should remain at lower temperature for higher performance.

52 CHAPTER 4. CRITICAL PATH ANALYSIS Conclusion In this chapter the methodology for transistor level critical path optimization is provided. First, the critical path of the adder is identified taking into account the logic levels. Transistor are sized with the Logical Effort concept [46]. Architecture level and transistor level optimizations are applied according to full custom design rules in FreePDK45 library [54]. As a final step, transient simulations are carried out to calculate the delay and power dissipation. It is observed that the method enables fast design for comparison of the path among different adder architectures in the critical delay and power dissipation perspective. In the next chapter, a methodology for fast area estimation and layout entry is given. Datapath library is generated exploiting the regularity for the proposed adder.

53 Chapter 5 Datapath Library 5.1 Introduction Datapaths have been a research topic in the last years, as an approach to make the layout entry faster [9] [10] [55] [56]. Informally datapaths are circuits where the same or similar logic is applied to several bits [57]. A datapath stack [58] is made up of many custom word lines such as registers, ALU, adders, shifters, multiplexers, buffers to form the data flow of the functional units. Datapaths are characterized by a highly regular layout structure. A typical datapath floorplan consists of an array of horizontally oriented words of identical bit cells, called datapath cells, and vertically oriented bit slices as shown on Figure 5.1. Since each bit slice is replicated a number of times(determined by the datapath width) with very little or no modification, layout generation of such regular structures reduces to a careful design, often by means of handcrafting, of individual datapath cells. Figure 5.2 shows the regularity placement and routing inside a datapath slice. Figure 5.3 shows the schematic layout of datapath and corresponding bit slice [55]. 45

54 CHAPTER 5. DATAPATH LIBRARY 46 Figure 5.1: Global floorplan of a datapath [9]. Figure 5.2: Regularity placement and routing datapath circuit [10].

55 CHAPTER 5. DATAPATH LIBRARY 47 Figure 5.3: Schematic layout of datapath and detailed view of bitcell [11]. Datapath circuits are typically organized in horizontal rows of words representing the same functional block and vertical bit slices, delimited by vertically running power and ground rails. The layout of the datapath cell of bit slice i is identical to that of bit slice (i+1), but mirrored along the vertical axis so that the adjacent bit slices can share common power or ground rail.

56 CHAPTER 5. DATAPATH LIBRARY 48 Figure 5.4: Representation of a datapath cell [9]. The width of the bit slice, also known as a pitch is fixed; it determines the width for all the datapath cells as outlined on Figure 5.4. Power and ground (V DD /V SS ) supply rails generally delimit the pitch. Signal nets are connected to the datapath cell components by means of bristles. Vertical bristles, or data lines, provide wiring between different within the same bit slice. They run in parallel to the power rails. Horizontal bristles or control lines provide wiring between datapath cells of different bit slices. Control lines span the width of the datapath run perpendicular to the power rails. Since adders exploit very regular structures, they can be designed with datapaths. 5.2 Concepts in Full Custom Design Transistor chaining and device merging Transistor chaining is a widely used technique to improve both area and performance of datapath cells. Several transistors can be chained together by combining their diffusion areas in order to reduce the diffusion capacitance. Diffusion sharing applied to simple logic gates in the same datapath cell is known as device merging. Transistor folding Transistor folding is another popular technique aimed at minimizing area

57 CHAPTER 5. DATAPATH LIBRARY 49 and improving performance of custom designs. The folding changes the aspect ratio of the component while maintaining the required device size (W/L ratio). By performing folding with different number of fingers(poly gates), different component instances can be created for the placement phase. Intracell sharing Two component areas(diffusion regions or poly gates) belonging to components from adjacent bit slices can be merged if they share the same global net, such as power line, control line or clock signal. In a typical organization of a datapath, adjacent bit slices are identical copies of each other, reflected with respect to the vertical boundary line. In this case the components can be pushed under the boundary line (ground or power rail) to create a more compact layout. 5.3 Datapath Design Using the concepts defined in sections 5.2 and 4.3 the logic gates are designed. Figure 5.5(a) and 5.5(b) shows the design of basic AOI and MUX2 cells according to the rules. AOI cell occupies 0.77µm 1.5µm area and MUX2 0.97µm 1.5µm. (a) AOI Layout (b) MUX2 Layout Figure 5.5: Designed basic cells. The longest cell according to the values from Figure 4.7 has a 1.5µm height. Thus, the other

58 CHAPTER 5. DATAPATH LIBRARY 50 cells are designed according to the longest cell to maintain regularity and to exploit datapath. Figure 5.7 shows the bitslice of 16bit Kogge-Stone adder. Using small cells allowed us to reduce the height of the bit slice. The bitslice includes the sign logic and conditional sum calculation. It can be seen from the datapath that the carry merge and propagate cells are interchanging one stage after another due to the intrinsic negation of CMOS design. (a) EAC bitslice (b) EAC last bit Figure 5.6: Bit slices of the blocks in the adder.

59 CHAPTER 5. DATAPATH LIBRARY 51 Figure 5.7: Bit slice of 16bit Kogge-Stone adder.

60 CHAPTER 5. DATAPATH LIBRARY 52 Using the same approach in Figure 5.7, bitslices for the last and regular bits of parallel prefix EAC block can be designed. Since the last bit has a irregular layout it is designed separately. Figure 5.6(a) and 5.6(b) show the designed bitslices for EAC block. 5.4 Layout Design Using the datapath designs, it is straightforward to design the whole adder. Replicating the bitslice as designed in Figure 5.7 gives the layout of 16 bit Kogge-Stone adder. Figure 5.8(a) shows the layout of 16 bit adder. 16 bit adder and EAC block are sized 13.42µm 21.98µm and 5.18µm 11.06µm respectively. As a final step, using the designed blocks, it is possible to estimate the total area of 128 bit EAC adder. Two different schemes are considered. Figure 5.9 has a thinner but wider layout to be used as a part of larger datapath. Figure 5.10 has a thicker but tighter layout to have a more compact layout because half of the adder is stacked on top of the other. First one and the second are sized 20.8µm 175µm and 35.8µm 87.5µm respectively. Figure 5.9: Wide layout. Figure 5.10: Stacked compact layout.

61 CHAPTER 5. DATAPATH LIBRARY 53 Table 5.1: Results comparison of proposed adder with the previous work. Delay(ps) Power(mW) Area(µm 2 ) Technology [5] 200(+29%) nm [12] 270(+47%) 20(+88%) 17,237(+82%) 65nm Proposed ,132 45nm 5.5 Results The results of previous works and the proposed adder can be observed on Table 5.1. In previous works a pipelined adder with 128bit 5GHz+ Binary Floating Point adder [5] is proposed and in [12] a 108bit EAC adder is proposed. It is not possible to compare with both of the adders because they implement the adders as a part of FMA unit and the adders are designed in 65nm IBM SOI technology, which is not publicly available. The adder designed in [5] is a pipelined adder which does not fall into the scope of this thesis. The adder in [6] is designed in FPGA technology, however, our adder is full custom design. Therefore, to the best of our knowledge this is the first adder to use small Parallel Prefix 2 n 1 EAC block with the full custom design methodology. It can be seen on Table 5.1 that the adder operates at the delay of 142ps and 2.42mW power dissipation in 3, 132µm 2 area under 25 C with 1V supply. The proposed adder shows up to 47% improvement in delay, 81% improvement in area and 88% improvement in power dissipation assuming routing does not majorly contribute to delay and power in comparison with the previous works. 5.6 Conclusion In this final chapter a detailed analysis about layout entry through datapath design is provided. The comparison between the proposed and the previous work is included. Datapath library provides building blocks for early area estimation before designing the whole adder. Further, the routing is simplified with datapath since wires can be placed regularly in the bitslice. The cells are designed using full custom design rules to achieve compact, area efficient layout. Two different layouts are generated by repeating the bitslices. We aimed to have the longest carry path under 100µm assuming the routing does not majorly contribute to delay. Final result

62 CHAPTER 5. DATAPATH LIBRARY 54 shows that, in the stacked layout, longest path remains under 45nm and the proposed adder shows up to 47% improvement in delay, 81% improvement in area and 88% improvement in power dissipation in 45nm technology under 25 C and 1.0V dd.

63 CHAPTER 5. DATAPATH LIBRARY 55 (a) Layout of 16bit Kogge-Stone adder (b) Layout of parallel prefix EAC block Figure 5.8: Layout of the blocks.

the main limitations of the work is that wiring increases with 1. INTRODUCTION

the main limitations of the work is that wiring increases with 1. INTRODUCTION Design of Low Power Speculative Han-Carlson Adder S.Sangeetha II ME - VLSI Design, Akshaya College of Engineering and Technology, Coimbatore sangeethasoctober@gmail.com S.Kamatchi Assistant Professor,

More information

Srinivasasamanoj.R et al., International Journal of Wireless Communications and Network Technologies, 1(1), August-September 2012, 4-9

Srinivasasamanoj.R et al., International Journal of Wireless Communications and Network Technologies, 1(1), August-September 2012, 4-9 ISSN 2319-6629 Volume 1, No.1, August- September 2012 International Journal of Wireless Communications and Networking Technologies Available Online at http://warse.org/pdfs/ijwcnt02112012.pdf High speed

More information

Design and Analysis of Kogge-Stone and Han-Carlson Adders in 130nm CMOS Technology

Design and Analysis of Kogge-Stone and Han-Carlson Adders in 130nm CMOS Technology Design and Analysis of Kogge-Stone and Han-Carlson Adders in 130nm CMOS Technology Senthil Ganesh R & R. Kalaimathi 1 Assistant Professor, Electronics and Communication Engineering, Info Institute of Engineering,

More information

An Efficient Hybrid Parallel Prefix Adders for Reverse Converters using QCA Technology

An Efficient Hybrid Parallel Prefix Adders for Reverse Converters using QCA Technology An Efficient Hybrid Parallel Prefix Adders for Reverse Converters using QCA Technology N. Chandini M.Tech student Scholar Dept.of ECE AITAM B. Chinna Rao Associate Professor Dept.of ECE AITAM A. Jaya Laxmi

More information

Low-Area Low-Power Parallel Prefix Adder Based on Modified Ling Equations

Low-Area Low-Power Parallel Prefix Adder Based on Modified Ling Equations I J C T A, 9(18) 2016, pp. 8935-8943 International Science Press Low-Area Low-Power Parallel Prefix Adder Based on Modified Ling Equations Rohan Pinto * and Kumara Shama * ABSTRACT For the design and implementation

More information

VLSI Arithmetic Lecture 6

VLSI Arithmetic Lecture 6 VLSI Arithmetic Lecture 6 Prof. Vojin G. Oklobdzija University of California http://www.ece.ucdavis.edu/acsel Review Lecture 5 Prefix Adders and Parallel Prefix Adders from: Ercegovac-Lang Oklobdzija 2004

More information

Lecture 5. Other Adder Issues

Lecture 5. Other Adder Issues Lecture 5 Other Adder Issues Mark Horowitz Computer Systems Laboratory Stanford University horowitz@stanford.edu Copyright 24 by Mark Horowitz with information from Brucek Khailany 1 Overview Reading There

More information

1. Introduction. Raj Kishore Kumar 1, Vikram Kumar 2

1. Introduction. Raj Kishore Kumar 1, Vikram Kumar 2 ASIC Implementation and Comparison of Diminished-one Modulo 2 n +1 Adder Raj Kishore Kumar 1, Vikram Kumar 2 1 Shivalik Institute of Engineering & Technology 2 Assistant Professor, Shivalik Institute of

More information

COMPUTER ARCHITECTURE AND ORGANIZATION Register Transfer and Micro-operations 1. Introduction A digital system is an interconnection of digital

COMPUTER ARCHITECTURE AND ORGANIZATION Register Transfer and Micro-operations 1. Introduction A digital system is an interconnection of digital Register Transfer and Micro-operations 1. Introduction A digital system is an interconnection of digital hardware modules that accomplish a specific information-processing task. Digital systems vary in

More information

Binary Arithmetic. Daniel Sanchez Computer Science & Artificial Intelligence Lab M.I.T.

Binary Arithmetic. Daniel Sanchez Computer Science & Artificial Intelligence Lab M.I.T. Binary Arithmetic Daniel Sanchez Computer Science & Artificial Intelligence Lab M.I.T. MIT 6.004 Fall 2018 Reminder: Encoding Positive Integers Bit i in a binary representation (in right-to-left order)

More information

CAD4 The ALU Fall 2009 Assignment. Description

CAD4 The ALU Fall 2009 Assignment. Description CAD4 The ALU Fall 2009 Assignment To design a 16-bit ALU which will be used in the datapath of the microprocessor. This ALU must support two s complement arithmetic and the instructions in the baseline

More information

Design of Efficient VLSI Arithmetic Circuits

Design of Efficient VLSI Arithmetic Circuits Design of Efficient VLSI Arithmetic Circuits Thesis submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy in ELECTRONICS AND COMMUNICATION ENGINEERING by Sreehari

More information

Digital Computer Arithmetic

Digital Computer Arithmetic Digital Computer Arithmetic Part 6 High-Speed Multiplication Soo-Ik Chae Spring 2010 Koren Chap.6.1 Speeding Up Multiplication Multiplication involves 2 basic operations generation of partial products

More information

Designing and Characterization of koggestone, Sparse Kogge stone, Spanning tree and Brentkung Adders

Designing and Characterization of koggestone, Sparse Kogge stone, Spanning tree and Brentkung Adders Vol. 3, Issue. 4, July-august. 2013 pp-2266-2270 ISSN: 2249-6645 Designing and Characterization of koggestone, Sparse Kogge stone, Spanning tree and Brentkung Adders V.Krishna Kumari (1), Y.Sri Chakrapani

More information

DIGITAL TECHNICS. Dr. Bálint Pődör. Óbuda University, Microelectronics and Technology Institute

DIGITAL TECHNICS. Dr. Bálint Pődör. Óbuda University, Microelectronics and Technology Institute DIGITAL TECHNIC Dr. Bálint Pődör Óbuda University, Microelectronics and Technology Institute 4. LECTURE: COMBINATIONAL LOGIC DEIGN: ARITHMETIC (THROUGH EXAMPLE) 2nd (Autumn) term 28/29 COMBINATIONAL LOGIC

More information

A Unified Addition Structure for Moduli Set {2 n -1, 2 n,2 n +1} Based on a Novel RNS Representation

A Unified Addition Structure for Moduli Set {2 n -1, 2 n,2 n +1} Based on a Novel RNS Representation A Unified Addition Structure for Moduli Set { n -, n, n +} Based on a Novel RNS Representation Somayeh Timarchi,, Mahmood Fazlali,, and Sorin D.Cotofana Department of Electrical and Computer Engineering,

More information

Design and Characterization of High Speed Carry Select Adder

Design and Characterization of High Speed Carry Select Adder Design and Characterization of High Speed Carry Select Adder Santosh Elangadi MTech Student, Dept of ECE, BVBCET, Hubli, Karnataka, India Suhas Shirol Professor, Dept of ECE, BVBCET, Hubli, Karnataka,

More information

International Journal Of Global Innovations -Vol.6, Issue.I Paper Id: SP-V6-I1-P14 ISSN Online:

International Journal Of Global Innovations -Vol.6, Issue.I Paper Id: SP-V6-I1-P14 ISSN Online: DESIGN OF ERROR FREE PARALLEL PREFIX ADDER USING HAN CARLSON ADDER #1 AKKATAVELLI MANISHA, M.Tech student, #2 R.SHIVASHENKAR, Assistant Professor, #3 S.BALAIAH, Associate Professor (Pursuing Phd), Dept

More information

Part 1-2. Translation of Netlist to CNF

Part 1-2. Translation of Netlist to CNF 2D Project Report Part 1-2. Translation of Netlist to CNF Team members: MA Ke, WEI Fanding, CHEN Jian, CHEN Ziyi Introduction In this part, we are required to translate the netlist of our optimized adder

More information

EE878 Special Topics in VLSI. Computer Arithmetic for Digital Signal Processing

EE878 Special Topics in VLSI. Computer Arithmetic for Digital Signal Processing EE878 Special Topics in VLSI Computer Arithmetic for Digital Signal Processing Part 6b High-Speed Multiplication - II Spring 2017 Koren Part.6b.1 Accumulating the Partial Products After generating partial

More information

UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering. Digital Computer Arithmetic ECE 666

UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering. Digital Computer Arithmetic ECE 666 UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering Digital Computer Arithmetic ECE 666 Part 6b High-Speed Multiplication - II Israel Koren ECE666/Koren Part.6b.1 Accumulating the Partial

More information

Design of High Speed Modulo 2 n +1 Adder

Design of High Speed Modulo 2 n +1 Adder Design of High Speed Modulo 2 n +1 Adder M. Varun M. Tech, Student Department of ECE Vardhaman College of Engineering M. Nagarjuna Assistant Professor Department of ECE Vardhaman College of Engineering

More information

High Speed Han Carlson Adder Using Modified SQRT CSLA

High Speed Han Carlson Adder Using Modified SQRT CSLA I J C T A, 9(16), 2016, pp. 7843-7849 International Science Press High Speed Han Carlson Adder Using Modified SQRT CSLA D. Vamshi Krishna*, P. Radhika** and T. Vigneswaran*** ABSTRACT Binary addition is

More information

International Journal of Scientific & Engineering Research, Volume 4, Issue 10, October ISSN

International Journal of Scientific & Engineering Research, Volume 4, Issue 10, October ISSN International Journal of Scientific & Engineering Research, Volume 4, Issue 10, October-2013 1502 Design and Characterization of Koggestone, Sparse Koggestone, Spanning tree and Brentkung Adders V. Krishna

More information

DESIGN AND ANALYSIS OF COMPETENT ARITHMETIC AND LOGIC UNIT FOR RISC PROCESSOR

DESIGN AND ANALYSIS OF COMPETENT ARITHMETIC AND LOGIC UNIT FOR RISC PROCESSOR DESIGN AND ANALYSIS OF COMPETENT ARITHMETIC AND LOGIC UNIT FOR RISC PROCESSOR M. Priyanka 1 and T. Ravi 2 1 M.Tech VLSI Design, Sathyabama University, Chennai, Tamil Nadu, India 2 Department of Electronics

More information

COMPUTER ORGANIZATION AND DESIGN. 5 th Edition. The Hardware/Software Interface. Chapter 3. Arithmetic for Computers Implementation

COMPUTER ORGANIZATION AND DESIGN. 5 th Edition. The Hardware/Software Interface. Chapter 3. Arithmetic for Computers Implementation COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition Chapter 3 Arithmetic for Computers Implementation Today Review representations (252/352 recap) Floating point Addition: Ripple

More information

Partitioned Branch Condition Resolution Logic

Partitioned Branch Condition Resolution Logic 1 Synopsys Inc. Synopsys Module Compiler Group 700 Middlefield Road, Mountain View CA 94043-4033 (650) 584-5689 (650) 584-1227 FAX aamirf@synopsys.com http://aamir.homepage.com Partitioned Branch Condition

More information

UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering. Digital Computer Arithmetic ECE 666

UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering. Digital Computer Arithmetic ECE 666 UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering Digital Computer Arithmetic ECE 666 Part 6c High-Speed Multiplication - III Israel Koren Fall 2010 ECE666/Koren Part.6c.1 Array Multipliers

More information

Get Free notes at Module-I One s Complement: Complement all the bits.i.e. makes all 1s as 0s and all 0s as 1s Two s Complement: One s complement+1 SIGNED BINARY NUMBERS Positive integers (including zero)

More information

EE878 Special Topics in VLSI. Computer Arithmetic for Digital Signal Processing

EE878 Special Topics in VLSI. Computer Arithmetic for Digital Signal Processing EE878 Special Topics in VLSI Computer Arithmetic for Digital Signal Processing Part 6c High-Speed Multiplication - III Spring 2017 Koren Part.6c.1 Array Multipliers The two basic operations - generation

More information

Implementation of Ripple Carry and Carry Skip Adders with Speed and Area Efficient

Implementation of Ripple Carry and Carry Skip Adders with Speed and Area Efficient ISSN (Online) : 2278-1021 Implementation of Ripple Carry and Carry Skip Adders with Speed and Area Efficient PUSHPALATHA CHOPPA 1, B.N. SRINIVASA RAO 2 PG Scholar (VLSI Design), Department of ECE, Avanthi

More information

Chapter 4 Design of Function Specific Arithmetic Circuits

Chapter 4 Design of Function Specific Arithmetic Circuits Chapter 4 Design of Function Specific Arithmetic Circuits Contents Chapter 4... 55 4.1 Introduction:... 55 4.1.1 Incrementer/Decrementer Circuit...56 4.1.2 2 s Complement Circuit...56 4.1.3 Priority Encoder

More information

Chapter 5 Design and Implementation of a Unified BCD/Binary Adder/Subtractor

Chapter 5 Design and Implementation of a Unified BCD/Binary Adder/Subtractor Chapter 5 Design and Implementation of a Unified BCD/Binary Adder/Subtractor Contents Chapter 5... 74 5.1 Introduction... 74 5.2 Review of Existing Techniques for BCD Addition/Subtraction... 76 5.2.1 One-Digit

More information

Submitted by 1

Submitted by   1 Submitted by WWW.ASSIGNMENTPOINT.COM 1 In computing, an Arithmetic Logic Unit (ALU) is a digital circuit that performs arithmetic and logical operations. The ALU is a fundamental building block of the

More information

DESIGN AND PERFORMANCE ANALYSIS OF CARRY SELECT ADDER

DESIGN AND PERFORMANCE ANALYSIS OF CARRY SELECT ADDER DESIGN AND PERFORMANCE ANALYSIS OF CARRY SELECT ADDER Bhuvaneswaran.M 1, Elamathi.K 2 Assistant Professor, Muthayammal Engineering college, Rasipuram, Tamil Nadu, India 1 Assistant Professor, Muthayammal

More information

Reduced Delay BCD Adder

Reduced Delay BCD Adder Reduced Delay BCD Adder Alp Arslan Bayrakçi and Ahmet Akkaş Computer Engineering Department Koç University 350 Sarıyer, İstanbul, Turkey abayrakci@ku.edu.tr ahakkas@ku.edu.tr Abstract Financial and commercial

More information

Design and Implementation of Adder for Modulo 2 n +1 Addition

Design and Implementation of Adder for Modulo 2 n +1 Addition IOSR Journal of VLSI and Signal Processing (IOSR-JVSP) Volume 4, Issue 2, Ver. I (Mar-Apr. 2014), PP 61-67 e-issn: 2319 4200, p-issn No. : 2319 4197 Design and Implementation of Adder for Modulo 2 n +1

More information

Chapter 3 Part 2 Combinational Logic Design

Chapter 3 Part 2 Combinational Logic Design University of Wisconsin - Madison ECE/Comp Sci 352 Digital Systems Fundamentals Kewal K. Saluja and Yu Hen Hu Spring 2002 Chapter 3 Part 2 Combinational Logic Design Originals by: Charles R. Kime and Tom

More information

MODULO 2 n + 1 MAC UNIT

MODULO 2 n + 1 MAC UNIT Int. J. Elec&Electr.Eng&Telecoms. 2013 Sithara Sha and Shajimon K John, 2013 Research Paper MODULO 2 n + 1 MAC UNIT ISSN 2319 2518 www.ijeetc.com Vol. 2, No. 4, October 2013 2013 IJEETC. All Rights Reserved

More information

Combinational Logic II

Combinational Logic II Combinational Logic II Ranga Rodrigo July 26, 2009 1 Binary Adder-Subtractor Digital computers perform variety of information processing tasks. Among the functions encountered are the various arithmetic

More information

ECE468 Computer Organization & Architecture. The Design Process & ALU Design

ECE468 Computer Organization & Architecture. The Design Process & ALU Design ECE6 Computer Organization & Architecture The Design Process & Design The Design Process "To Design Is To Represent" Design activity yields description/representation of an object -- Traditional craftsman

More information

Chapter 4 Arithmetic Functions

Chapter 4 Arithmetic Functions Logic and Computer Design Fundamentals Chapter 4 Arithmetic Functions Charles Kime & Thomas Kaminski 2008 Pearson Education, Inc. (Hyperlinks are active in View Show mode) Overview Iterative combinational

More information

DLD VIDYA SAGAR P. potharajuvidyasagar.wordpress.com. Vignana Bharathi Institute of Technology UNIT 3 DLD P VIDYA SAGAR

DLD VIDYA SAGAR P. potharajuvidyasagar.wordpress.com. Vignana Bharathi Institute of Technology UNIT 3 DLD P VIDYA SAGAR DLD UNIT III Combinational Circuits (CC), Analysis procedure, Design Procedure, Combinational circuit for different code converters and other problems, Binary Adder- Subtractor, Decimal Adder, Binary Multiplier,

More information

Arithmetic Logic Unit. Digital Computer Design

Arithmetic Logic Unit. Digital Computer Design Arithmetic Logic Unit Digital Computer Design Arithmetic Circuits Arithmetic circuits are the central building blocks of computers. Computers and digital logic perform many arithmetic functions: addition,

More information

Principles of Computer Architecture. Chapter 3: Arithmetic

Principles of Computer Architecture. Chapter 3: Arithmetic 3-1 Chapter 3 - Arithmetic Principles of Computer Architecture Miles Murdocca and Vincent Heuring Chapter 3: Arithmetic 3-2 Chapter 3 - Arithmetic 3.1 Overview Chapter Contents 3.2 Fixed Point Addition

More information

An Efficient Fused Add Multiplier With MWT Multiplier And Spanning Tree Adder

An Efficient Fused Add Multiplier With MWT Multiplier And Spanning Tree Adder An Efficient Fused Add Multiplier With MWT Multiplier And Spanning Tree Adder 1.M.Megha,M.Tech (VLSI&ES),2. Nataraj, M.Tech (VLSI&ES), Assistant Professor, 1,2. ECE Department,ST.MARY S College of Engineering

More information

ARITHMETIC operations based on residue number systems

ARITHMETIC operations based on residue number systems IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS II: EXPRESS BRIEFS, VOL. 53, NO. 2, FEBRUARY 2006 133 Improved Memoryless RNS Forward Converter Based on the Periodicity of Residues A. B. Premkumar, Senior Member,

More information

Figure 1. An 8-bit Superset Adder.

Figure 1. An 8-bit Superset Adder. Improving the Adder: A Fault-tolerant, Reconfigurable Parallel Prefix Adder Kyle E. Powers Dar-Eaum A. Nam Eric A. Llana ECE 4332 Fall 2012 University of Virginia @virginia.edu ABSTRACT

More information

ANALYZING THE PERFORMANCE OF CARRY TREE ADDERS BASED ON FPGA S

ANALYZING THE PERFORMANCE OF CARRY TREE ADDERS BASED ON FPGA S ANALYZING THE PERFORMANCE OF CARRY TREE ADDERS BASED ON FPGA S RENUKUNTLA KIRAN 1 & SUNITHA NAMPALLY 2 1,2 Ganapathy Engineering College E-mail: kiran00447@gmail.com, nsunitha566@gmail.com Abstract- In

More information

FPGA IMPLEMENTATION OF FLOATING POINT ADDER AND MULTIPLIER UNDER ROUND TO NEAREST

FPGA IMPLEMENTATION OF FLOATING POINT ADDER AND MULTIPLIER UNDER ROUND TO NEAREST FPGA IMPLEMENTATION OF FLOATING POINT ADDER AND MULTIPLIER UNDER ROUND TO NEAREST SAKTHIVEL Assistant Professor, Department of ECE, Coimbatore Institute of Engineering and Technology Abstract- FPGA is

More information

Computer Architecture Set Four. Arithmetic

Computer Architecture Set Four. Arithmetic Computer Architecture Set Four Arithmetic Arithmetic Where we ve been: Performance (seconds, cycles, instructions) Abstractions: Instruction Set Architecture Assembly Language and Machine Language What

More information

THE DESIGN OF AN IC HALF PRECISION FLOATING POINT ARITHMETIC LOGIC UNIT

THE DESIGN OF AN IC HALF PRECISION FLOATING POINT ARITHMETIC LOGIC UNIT Clemson University TigerPrints All Theses Theses 12-2009 THE DESIGN OF AN IC HALF PRECISION FLOATING POINT ARITHMETIC LOGIC UNIT Balaji Kannan Clemson University, balaji.n.kannan@gmail.com Follow this

More information

Multi-Modulus Adder Implementation and its Applications

Multi-Modulus Adder Implementation and its Applications Multi-Modulus Adder Implementation and its Applications Aditya Urkude, Gargi Deshmukh, Minal Shinde M. Tech. VLSI Design, M. Tech. VLSI Design M. Tech. VLSI Design VIT University, Vellore, TN. VIT University,

More information

Chapter 3: part 3 Binary Subtraction

Chapter 3: part 3 Binary Subtraction Chapter 3: part 3 Binary Subtraction Iterative combinational circuits Binary adders Half and full adders Ripple carry and carry lookahead adders Binary subtraction Binary adder-subtractors Signed binary

More information

Implementation of Efficient Modified Booth Recoder for Fused Sum-Product Operator

Implementation of Efficient Modified Booth Recoder for Fused Sum-Product Operator Implementation of Efficient Modified Booth Recoder for Fused Sum-Product Operator A.Sindhu 1, K.PriyaMeenakshi 2 PG Student [VLSI], Dept. of ECE, Muthayammal Engineering College, Rasipuram, Tamil Nadu,

More information

Design of Two Different 128-bit Adders. Project Report

Design of Two Different 128-bit Adders. Project Report Design of Two Different 128-bit Adders Project Report By Vladislav uravin Concordia ID: 5505763 COEN6501: Digital Design & Synthesis Offered by Professor Asim Al-Khalili Concordia University December 2004

More information

Non-Heuristic Optimization and Synthesis of Parallel-Prefix Adders

Non-Heuristic Optimization and Synthesis of Parallel-Prefix Adders International Workshop on Logic and Architecture Synthesis (IWLAS, Grenoble, ecember Non-Heuristic Optimization and Synthesis of Parallel-Prefix Adders Reto Zimmermann Integrated Systems Laboratory Swiss

More information

EC2303-COMPUTER ARCHITECTURE AND ORGANIZATION

EC2303-COMPUTER ARCHITECTURE AND ORGANIZATION EC2303-COMPUTER ARCHITECTURE AND ORGANIZATION QUESTION BANK UNIT-II 1. What are the disadvantages in using a ripple carry adder? (NOV/DEC 2006) The main disadvantage using ripple carry adder is time delay.

More information

16 BIT IMPLEMENTATION OF ASYNCHRONOUS TWOS COMPLEMENT ARRAY MULTIPLIER USING MODIFIED BAUGH-WOOLEY ALGORITHM AND ARCHITECTURE.

16 BIT IMPLEMENTATION OF ASYNCHRONOUS TWOS COMPLEMENT ARRAY MULTIPLIER USING MODIFIED BAUGH-WOOLEY ALGORITHM AND ARCHITECTURE. 16 BIT IMPLEMENTATION OF ASYNCHRONOUS TWOS COMPLEMENT ARRAY MULTIPLIER USING MODIFIED BAUGH-WOOLEY ALGORITHM AND ARCHITECTURE. AditiPandey* Electronics & Communication,University Institute of Technology,

More information

High-Performance Carry Chains for FPGA s

High-Performance Carry Chains for FPGA s 138 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 8, NO. 2, APRIL 2000 High-Performance Carry Chains for FPGA s Scott Hauck, Member, IEEE, Matthew M. Hosler, and Thomas W. Fry

More information

CPE300: Digital System Architecture and Design

CPE300: Digital System Architecture and Design CPE300: Digital System Architecture and Design Fall 2011 MW 17:30-18:45 CBC C316 Arithmetic Unit 10122011 http://www.egr.unlv.edu/~b1morris/cpe300/ 2 Outline Recap Fixed Point Arithmetic Addition/Subtraction

More information

UNIT-III REGISTER TRANSFER LANGUAGE AND DESIGN OF CONTROL UNIT

UNIT-III REGISTER TRANSFER LANGUAGE AND DESIGN OF CONTROL UNIT UNIT-III 1 KNREDDY UNIT-III REGISTER TRANSFER LANGUAGE AND DESIGN OF CONTROL UNIT Register Transfer: Register Transfer Language Register Transfer Bus and Memory Transfers Arithmetic Micro operations Logic

More information

Computer Architecture and Organization

Computer Architecture and Organization 3-1 Chapter 3 - Arithmetic Computer Architecture and Organization Miles Murdocca and Vincent Heuring Chapter 3 Arithmetic 3-2 Chapter 3 - Arithmetic Chapter Contents 3.1 Fixed Point Addition and Subtraction

More information

International Journal of Computer Trends and Technology (IJCTT) volume 17 Number 5 Nov 2014 LowPower32-Bit DADDA Multipleir

International Journal of Computer Trends and Technology (IJCTT) volume 17 Number 5 Nov 2014 LowPower32-Bit DADDA Multipleir LowPower32-Bit DADDA Multipleir K.N.V.S.Vijaya Lakshmi 1, D.R.Sandeep 2 1 PG Scholar& ECE Department&JNTU Kakinada University Sri Vasavi Engineering College, Tadepalligudem, Andhra Pradesh, India 2 AssosciateProfessor&

More information

Parallel-Prefix Adders Implementation Using Reverse Converter Design. Department of ECE

Parallel-Prefix Adders Implementation Using Reverse Converter Design. Department of ECE Parallel-Prefix Adders Implementation Using Reverse Converter Design Submitted by: M.SHASHIDHAR Guide name: J.PUSHPARANI, M.TECH Department of ECE ABSTRACT: The binary adder is the critical element in

More information

Lecture 19: Arithmetic Modules 14-1

Lecture 19: Arithmetic Modules 14-1 Lecture 19: Arithmetic Modules 14-1 Syllabus Objectives Addition and subtraction Multiplication Division Arithmetic and logic unit 14-2 Objectives After completing this chapter, you will be able to: Describe

More information

Compound Adder Design Using Carry-Lookahead / Carry Select Adders

Compound Adder Design Using Carry-Lookahead / Carry Select Adders Journal From the SelectedWorks of Journal December, 2013 Compound Adder Design Using Carry-Lookahead / Carry Select Adders Jayaprakash M Dr. A. Shanmugam This work is licensed under a Creative Commons

More information

DESIGN METHODOLOGY. 5.1 General

DESIGN METHODOLOGY. 5.1 General 87 5 FFT DESIGN METHODOLOGY 5.1 General The fast Fourier transform is used to deliver a fast approach for the processing of data in the wireless transmission. The Fast Fourier Transform is one of the methods

More information

6.004 Computation Structures Spring 2009

6.004 Computation Structures Spring 2009 MIT OpenCourseWare http://ocw.mit.edu 6.004 Computation Structures Spring 009 For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms. M A S S A C H U S E T T

More information

A Low-Power Carry Skip Adder with Fast Saturation

A Low-Power Carry Skip Adder with Fast Saturation A Low-Power Carry Skip Adder with Fast Saturation Michael Schulte,3, Kai Chirca,2, John Glossner,2,Suman Mamidi,3, Pablo Balzola, and Stamatis Vassiliadis 2 Sandbridge Technologies, Inc. White Plains,

More information

CS 5803 Introduction to High Performance Computer Architecture: Arithmetic Logic Unit. A.R. Hurson 323 CS Building, Missouri S&T

CS 5803 Introduction to High Performance Computer Architecture: Arithmetic Logic Unit. A.R. Hurson 323 CS Building, Missouri S&T CS 5803 Introduction to High Performance Computer Architecture: Arithmetic Logic Unit A.R. Hurson 323 CS Building, Missouri S&T hurson@mst.edu 1 Outline Motivation Design of a simple ALU How to design

More information

FPGA Implementation of a High Speed Multiplier Employing Carry Lookahead Adders in Reduction Phase

FPGA Implementation of a High Speed Multiplier Employing Carry Lookahead Adders in Reduction Phase FPGA Implementation of a High Speed Multiplier Employing Carry Lookahead Adders in Reduction Phase Abhay Sharma M.Tech Student Department of ECE MNNIT Allahabad, India ABSTRACT Tree Multipliers are frequently

More information

Chapter 6. CMOS Functional Cells

Chapter 6. CMOS Functional Cells Chapter 6 CMOS Functional Cells In the previous chapter we discussed methods of designing layout of logic gates and building blocks like transmission gates, multiplexers and tri-state inverters. In this

More information

At the ith stage: Input: ci is the carry-in Output: si is the sum ci+1 carry-out to (i+1)st state

At the ith stage: Input: ci is the carry-in Output: si is the sum ci+1 carry-out to (i+1)st state Chapter 4 xi yi Carry in ci Sum s i Carry out c i+ At the ith stage: Input: ci is the carry-in Output: si is the sum ci+ carry-out to (i+)st state si = xi yi ci + xi yi ci + xi yi ci + xi yi ci = x i yi

More information

Excerpt from: Stephen H. Unger, The Essence of Logic Circuits, Second Ed., Wiley, 1997

Excerpt from: Stephen H. Unger, The Essence of Logic Circuits, Second Ed., Wiley, 1997 Excerpt from: Stephen H. Unger, The Essence of Logic Circuits, Second Ed., Wiley, 1997 APPENDIX A.1 Number systems and codes Since ten-fingered humans are addicted to the decimal system, and since computers

More information

Lecture #21 March 31, 2004 Introduction to Gates and Circuits

Lecture #21 March 31, 2004 Introduction to Gates and Circuits Lecture #21 March 31, 2004 Introduction to Gates and Circuits To this point we have looked at computers strictly from the perspective of assembly language programming. While it is possible to go a great

More information

Arithmetic Circuits. Nurul Hazlina Adder 2. Multiplier 3. Arithmetic Logic Unit (ALU) 4. HDL for Arithmetic Circuit

Arithmetic Circuits. Nurul Hazlina Adder 2. Multiplier 3. Arithmetic Logic Unit (ALU) 4. HDL for Arithmetic Circuit Nurul Hazlina 1 1. Adder 2. Multiplier 3. Arithmetic Logic Unit (ALU) 4. HDL for Arithmetic Circuit Nurul Hazlina 2 Introduction 1. Digital circuits are frequently used for arithmetic operations 2. Fundamental

More information

Hardware Design Environments. Dr. Mahdi Abbasi Computer Engineering Department Bu-Ali Sina University

Hardware Design Environments. Dr. Mahdi Abbasi Computer Engineering Department Bu-Ali Sina University Hardware Design Environments Dr. Mahdi Abbasi Computer Engineering Department Bu-Ali Sina University Outline Welcome to COE 405 Digital System Design Design Domains and Levels of Abstractions Synthesis

More information

A Novel Carry-look ahead approach to an Unified BCD and Binary Adder/Subtractor

A Novel Carry-look ahead approach to an Unified BCD and Binary Adder/Subtractor A Novel Carry-look ahead approach to an Unified BCD and Binary Adder/Subtractor Abstract Increasing prominence of commercial, financial and internet-based applications, which process decimal data, there

More information

High Speed Multiplication Using BCD Codes For DSP Applications

High Speed Multiplication Using BCD Codes For DSP Applications High Speed Multiplication Using BCD Codes For DSP Applications Balasundaram 1, Dr. R. Vijayabhasker 2 PG Scholar, Dept. Electronics & Communication Engineering, Anna University Regional Centre, Coimbatore,

More information

VTU NOTES QUESTION PAPERS NEWS RESULTS FORUMS Arithmetic (a) The four possible cases Carry (b) Truth table x y

VTU NOTES QUESTION PAPERS NEWS RESULTS FORUMS Arithmetic (a) The four possible cases Carry (b) Truth table x y Arithmetic A basic operation in all digital computers is the addition and subtraction of two numbers They are implemented, along with the basic logic functions such as AND,OR, NOT,EX- OR in the ALU subsystem

More information

Implementing FIR Filters

Implementing FIR Filters Implementing FIR Filters in FLEX Devices February 199, ver. 1.01 Application Note 73 FIR Filter Architecture This section describes a conventional FIR filter design and how the design can be optimized

More information

Digital System Design Using Verilog. - Processing Unit Design

Digital System Design Using Verilog. - Processing Unit Design Digital System Design Using Verilog - Processing Unit Design 1.1 CPU BASICS A typical CPU has three major components: (1) Register set, (2) Arithmetic logic unit (ALU), and (3) Control unit (CU) The register

More information

Chapter 5. Digital Design and Computer Architecture, 2 nd Edition. David Money Harris and Sarah L. Harris. Chapter 5 <1>

Chapter 5. Digital Design and Computer Architecture, 2 nd Edition. David Money Harris and Sarah L. Harris. Chapter 5 <1> Chapter 5 Digital Design and Computer Architecture, 2 nd Edition David Money Harris and Sarah L. Harris Chapter 5 Chapter 5 :: Topics Introduction Arithmetic Circuits umber Systems Sequential Building

More information

Design and Implementation of High Performance Parallel Prefix Adders

Design and Implementation of High Performance Parallel Prefix Adders Design and Implementation of High Performance Parallel Prefix Adders CH.Sudha Rani, CH.Ramesh Student, Department of ECE, Ganapathy Engineering College, Warangal, India. Associate Professor, Department

More information

Boolean Algebra and Logic Gates

Boolean Algebra and Logic Gates Boolean Algebra and Logic Gates Binary logic is used in all of today's digital computers and devices Cost of the circuits is an important factor Finding simpler and cheaper but equivalent circuits can

More information

AN EFFICIENT REVERSE CONVERTER DESIGN VIA PARALLEL PREFIX ADDER

AN EFFICIENT REVERSE CONVERTER DESIGN VIA PARALLEL PREFIX ADDER AN EFFICIENT REVERSE CONVERTER DESIGN VIA PARALLEL PREFIX ADDER #1 BEERAM SANDHYARANI, M.Tech Student, #2 R.NARAIAH, Associate Professor, Department Of ECE VAAGESHWARI COLLEGE OF ENGINEERING, KARIMNAGAR,

More information

Lecture (05) Boolean Algebra and Logic Gates

Lecture (05) Boolean Algebra and Logic Gates Lecture (05) Boolean Algebra and Logic Gates By: Dr. Ahmed ElShafee ١ Minterms and Maxterms consider two binary variables x and y combined with an AND operation. Since eachv ariable may appear in either

More information

Design of Delay Efficient Distributed Arithmetic Based Split Radix FFT

Design of Delay Efficient Distributed Arithmetic Based Split Radix FFT Design of Delay Efficient Arithmetic Based Split Radix FFT Nisha Laguri #1, K. Anusudha *2 #1 M.Tech Student, Electronics, Department of Electronics Engineering, Pondicherry University, Puducherry, India

More information

Digital Circuit Design and Language. Datapath Design. Chang, Ik Joon Kyunghee University

Digital Circuit Design and Language. Datapath Design. Chang, Ik Joon Kyunghee University Digital Circuit Design and Language Datapath Design Chang, Ik Joon Kyunghee University Typical Synchronous Design + Control Section : Finite State Machine + Data Section: Adder, Multiplier, Shift Register

More information

UNIT - V MEMORY P.VIDYA SAGAR ( ASSOCIATE PROFESSOR) Department of Electronics and Communication Engineering, VBIT

UNIT - V MEMORY P.VIDYA SAGAR ( ASSOCIATE PROFESSOR) Department of Electronics and Communication Engineering, VBIT UNIT - V MEMORY P.VIDYA SAGAR ( ASSOCIATE PROFESSOR) contents Memory: Introduction, Random-Access memory, Memory decoding, ROM, Programmable Logic Array, Programmable Array Logic, Sequential programmable

More information

A New Family of High Performance Parallel Decimal Multipliers

A New Family of High Performance Parallel Decimal Multipliers A New Family of High Performance Parallel Decimal Multipliers Alvaro Vázquez, Elisardo Antelo University of Santiago de Compostela Dept. of Electronic and Computer Science 15782 Santiago de Compostela,

More information

Modulo adders, multipliers and shared-moduli architectures for moduli of type {2 n 1, 2 n, 2 n + 1}

Modulo adders, multipliers and shared-moduli architectures for moduli of type {2 n 1, 2 n, 2 n + 1} Modulo adders, multipliers and shared-moduli architectures for moduli of type {2 n 1, 2 n, 2 n + 1} Shibu Menon School of Electrical and Electronic Engineering A thesis submitted to the Nanyang Technological

More information

ECE410 Design Project Spring 2013 Design and Characterization of a CMOS 8-bit pipelined Microprocessor Data Path

ECE410 Design Project Spring 2013 Design and Characterization of a CMOS 8-bit pipelined Microprocessor Data Path ECE410 Design Project Spring 2013 Design and Characterization of a CMOS 8-bit pipelined Microprocessor Data Path Project Summary This project involves the schematic and layout design of an 8-bit microprocessor

More information

A High Performance Unified BCD and Binary Adder/Subtractor

A High Performance Unified BCD and Binary Adder/Subtractor 29 IEEE Computer Society Annual Symposium on VLSI A High Performance Unified and Binary Adder/Subtractor Anshul Singh,Aman Gupta,Sreehari Veeramachaneni, MB Srinivas* Centre for VLSI and Embedded System

More information

An Efficient Carry Select Adder with Less Delay and Reduced Area Application

An Efficient Carry Select Adder with Less Delay and Reduced Area Application An Efficient Carry Select Adder with Less Delay and Reduced Area Application Pandu Ranga Rao #1 Priyanka Halle #2 # Associate Professor Department of ECE Sreyas Institute of Engineering and Technology,

More information

Chapter 6 ARITHMETIC FOR DIGITAL SYSTEMS

Chapter 6 ARITHMETIC FOR DIGITAL SYSTEMS Chapter 6 ARITHMETIC FOR DIGITAL SYSTEMS Introduction Notation Systems Principle of Generation and Propagation The bit Full Adder Enhancement Techniques for Adders Multioperand Adders Multiplication Addition

More information

Overview. EECS Components and Design Techniques for Digital Systems. Lec 16 Arithmetic II (Multiplication) Computer Number Systems.

Overview. EECS Components and Design Techniques for Digital Systems. Lec 16 Arithmetic II (Multiplication) Computer Number Systems. Overview EE 15 - omponents and Design Techniques for Digital ystems Lec 16 Arithmetic II (Multiplication) Review of Addition Overflow Multiplication Further adder optimizations for multiplication LA in

More information

Two-Level CLA for 4-bit Adder. Two-Level CLA for 4-bit Adder. Two-Level CLA for 16-bit Adder. A Closer Look at CLA Delay

Two-Level CLA for 4-bit Adder. Two-Level CLA for 4-bit Adder. Two-Level CLA for 16-bit Adder. A Closer Look at CLA Delay Two-Level CLA for 4-bit Adder Individual carry equations C 1 = g 0 +p 0, C 2 = g 1 +p 1 C 1,C 3 = g 2 +p 2 C 2, = g 3 +p 3 C 3 Fully expanded (infinite hardware) CLA equations C 1 = g 0 +p 0 C 2 = g 1

More information

Hardware Description and Verification Lava Exam

Hardware Description and Verification Lava Exam Hardware Description and Verification Lava Exam Mary Sheeran Revised by Thomas Hallgren hallgren@chalmers.se May 16, 2010 Introduction The purpose of this take-home exam is to give you further practice

More information